CN113297837A

CN113297837A - PDF form information extraction method, device, equipment and storage medium

Info

Publication number: CN113297837A
Application number: CN202110692819.9A
Authority: CN
Inventors: 李宗波; 傅永德
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2021-08-24

Abstract

The invention discloses a method, device, equipment and medium for extracting PDF form information. Filling with the filling information to obtain a temporary filling file, parsing to obtain the second text block set in the temporary filling file and the starting position of each second text block; comparing the second text block set with the first text block set to select a target text block Add the difference text block set; match the text blocks in the difference text block set with the pre-filled information to obtain the filling position of the form field attribute; parse to obtain the third text block in the target PDF file and the starting position of each third text block, And extract the text content in the third text block whose starting position is at the filling position as form information. The invention improves the accuracy of form information extraction.

Description

PDF form information extraction method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of document processing, in particular to a method, a device, equipment and a storage medium for extracting PDF (Portable document Format) form information.

Background

In a business scenario in the computer field, a PDF file is generally used to store an electronic contract signed by a user, and after the contract is signed, the filled form information needs to be extracted when the integrity and correctness of the contract signing are verified. The contract document generated by filling the form field can remove the original form field after the form field is filled with the text, so that the filled form information cannot be obtained by reading the form field. The existing extraction method is to find out a different text block in a target PDF file, which is different from a PDF template file, by comparing the content of each text block in the PDF template file and the target PDF file, and take the text content of the different text block as extracted form information, but this method cannot accurately judge whether two text blocks with the same text content are the same page number and the same position, so that the found different text blocks have deviation, that is, the extraction result has deviation.

Disclosure of Invention

The invention mainly aims to provide a method, a device, equipment and a storage medium for extracting PDF form information, and aims to solve the technical problem that the extraction result of the existing method for extracting form information by comparing text block contents is inaccurate.

In order to achieve the above object, the present invention provides a PDF form information extraction method, including the following steps:

acquiring a PDF template file, analyzing to obtain a first text block set consisting of first text blocks in a page of a target page number in the PDF template file and the initial position of each first text block in the page, and analyzing to obtain each form field attribute in the page of the target page number in the PDF template file;

generating pre-filling information corresponding to each form field attribute respectively, filling the PDF template file by adopting each pre-filling information to obtain a temporary filling file, and analyzing to obtain a second text block set formed by second text blocks in the page of the target page number in the temporary filling file and the initial positions of each second text block in the page respectively;

comparing the text blocks in the second text block set with the text blocks in the first text block set, selecting target text blocks in the second text block set and adding the target text blocks into a difference text block set, wherein the target text blocks and the first text blocks have at least one item of difference in text content and starting position;

matching the text content of the text block in the difference text block set with each pre-filling information to obtain a filling position corresponding to each form field attribute;

and acquiring a target PDF file of the form information to be extracted, analyzing to obtain a third text block in the page of the target page number in the target PDF file and the initial position of each third text block in the page, and extracting the text content in the third text block of the initial position at the filling position as the form information corresponding to the form field attribute.

Optionally, the step of obtaining, by analysis, a first text block set formed by first text blocks in a page of the target page number in the PDF template file and a starting position of each of the first text blocks in the page includes:

analyzing the PDF template file to obtain a page object corresponding to a target page in the PDF template file;

analyzing each resource in the page object line by taking the width of a space character as a step length to obtain each first text block in a page corresponding to the target page number and a transverse offset of an initial character of the first text block, wherein the transverse offset is the step length contained in front of the initial character in a line where the initial character is located;

calculating to obtain the longitudinal offset of the initial character of the first text block relative to the top edge of the page according to the initial vertical coordinate information of each resource in the page object;

and taking the horizontal offset and the vertical offset of the starting character of the first text block as the starting position of the first text block in a page.

Optionally, the step of analyzing the PDF template file to obtain a page object corresponding to a target page number in the PDF template file includes:

acquiring a root object number stored at the tail of the PDF template file, and analyzing the PDF template file according to the root object number to obtain a root object;

analyzing the PDF template file according to the page group object number stored in the root object to obtain a page group object;

and determining a target page object number corresponding to a target page number from the page object numbers stored in the page group object, and analyzing the PDF template file according to the target page object number to obtain a page object corresponding to the target page number.

Optionally, the step of generating pre-filling information corresponding to each form field attribute respectively includes:

and randomly generating pre-filling information corresponding to each form field attribute, wherein the text content identical to the pre-filling information does not exist in the first text block set.

Optionally, the step of matching the text content of the text block in the difference text block set with each of the pre-filling information to obtain a filling position corresponding to each of the form field attributes includes:

respectively matching the text content of each text block in the difference text block set with each pre-filling information, and taking the text block with the same content as the pre-filling information as a form information text block according to a matching result;

and taking the initial position of the form information text block in the page as the filling position of the corresponding form field attribute.

Optionally, after the step of generating the pre-population information corresponding to each form field attribute respectively, the method further includes:

establishing a property fill value dictionary associating the form field property with the corresponding pre-fill information;

the step of taking the starting position of the form information text block in the page as the filling position of the corresponding form field attribute comprises the following steps:

looking up the form field attribute corresponding to the pre-filling information with the same content as the form information text block in the attribute filling value dictionary;

and taking the initial position of the form information text block in the page as the filling position of the searched form field attribute.

Optionally, before the steps of obtaining, by analysis, a first text block set formed by first text blocks in a page of a target page number in the PDF template file, and a starting position of each of the first text blocks in the page, and obtaining, by analysis, each form field attribute in the page of the target page number in the PDF template file, the method further includes:

analyzing the PDF template file, and determining each target page object containing a form domain object in the PDF template file;

and respectively taking the page number corresponding to each target page object as the target page number.

In order to achieve the above object, the present invention further provides a PDF form information extracting apparatus, including:

the first analysis module is used for acquiring a PDF template file, analyzing to obtain a first text block set formed by first text blocks in a page of a target page number in the PDF template file and initial positions of the first text blocks in the page respectively, and analyzing to obtain each form field attribute in the page of the target page number in the PDF template file;

the second analysis module is used for respectively generating pre-filling information corresponding to each form field attribute, filling the PDF template file by adopting each pre-filling information to obtain a temporary filling file, and analyzing to obtain a second text block set formed by second text blocks in a page of the target page number in the temporary filling file and the initial positions of each second text block in the page;

a comparison module, configured to compare text blocks in the second text block set with text blocks in the first text block set, select a target text block in the second text block set, and add the target text block into a difference text block set, where at least one of text content and an initial position of the target text block is different from that of each of the first text blocks;

a matching module, configured to match text contents of text blocks in the difference text block set with each of the pre-filling information, so as to obtain filling positions corresponding to each of the form field attributes;

and the extraction module is used for acquiring a target PDF file of the form information to be extracted, analyzing to obtain a third text block in a page of the target page number in the target PDF file and the initial position of each third text block in the page, and extracting the text content in the third text block of the initial position at the filling position as the form information corresponding to the form field attribute.

In order to achieve the above object, the present invention further provides a PDF form information extracting apparatus, including: the device comprises a memory, a processor and a PDF form information extraction program which is stored on the memory and can run on the processor, wherein the PDF form information extraction program realizes the steps of the PDF form information extraction method when being executed by the processor.

In addition, to achieve the above object, the present invention further provides a computer readable storage medium, which stores thereon a PDF form information extraction program, and when the PDF form information extraction program is executed by a processor, the PDF form information extraction program implements the steps of the PDF form information extraction method as described above.

In the invention, a first text block set consisting of first text blocks in a page of a target page number in a PDF template file and the initial positions of the first text blocks in the page are obtained by obtaining the PDF template file and analyzing, and form domain attributes in the page of the target page number in the PDF template file are obtained by analyzing; generating pre-filling information corresponding to each form field attribute respectively, filling the PDF template file by adopting each pre-filling information to obtain a temporary filling file, and analyzing to obtain a second text block set formed by second text blocks in a page of a target page number in the temporary filling file and the initial positions of each second text block in the page respectively; comparing the text blocks in the second text block set with the text blocks in the first text block set, selecting target text blocks in the second text block set and adding the target text blocks into the difference text block set, wherein the target text blocks and the first text blocks have at least one item of difference in text content and starting position; matching the text content of the text blocks in the difference text block set with each pre-filling information to obtain filling positions corresponding to each form field attribute; and acquiring a target PDF file of the form information to be extracted, analyzing to obtain the initial positions of the third text blocks and the third text blocks in the page of the target page number in the target PDF file in the page, and extracting the text contents of the third text blocks of the initial positions at the filling positions as the form information corresponding to the form field attribute. Compared with the method that the content of the text block is directly compared with the target PDF file, the method and the device for extracting the form information have the advantages that the initial position of the text block in the page is obtained by analyzing the file, the text content is compared with the initial position, the condition that the form information is not accurately extracted due to the fact that whether the two text blocks with the same text content are the text blocks with the same page number and the same position cannot be accurately judged is avoided, and the accuracy of extracting the form information is improved. And moreover, the form domain attribute is extracted from the PDF template file, the pre-filling information is filled according to the form domain attribute to obtain a temporary filling file, the text blocks of the PDF template file and the temporary filling file are compared to obtain the filling position of each form domain attribute, and the form information in the target PDF file is extracted through the filling position, so that the corresponding relation between each form information and the form domain attribute can be extracted while the form information in the target PDF file is extracted, and the accuracy of the form information extraction is further improved.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a PDF form information extraction method according to a first embodiment of the present invention;

fig. 3 is a functional module diagram of a PDF form information extraction apparatus according to a preferred embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

It should be noted that, the PDF form information extraction device in the embodiment of the present invention may be a smart phone, a personal computer, a server, and the like, and is not limited herein.

As shown in fig. 1, the PDF form information extraction device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the device configuration shown in fig. 1 does not constitute a limitation of the PDF form information extraction device, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, the memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a PDF form information extraction program. The operating system is a program that manages and controls the hardware and software resources of the device, and supports the operation of the PDF form information extraction program and other software or programs. In the device shown in fig. 1, the user interface 1003 is mainly used for data communication with a client; the network interface 1004 is mainly used for establishing communication connection with a server; and the processor 1001 may be configured to call the PDF form information extraction program stored in the memory 1005, and perform the following operations:

Further, the analyzing to obtain a first text block set formed by first text blocks in a page of the target page number in the PDF template file and the initial position of each first text block in the page respectively includes:

Further, the analyzing the PDF template file to obtain a page object corresponding to a target page number in the PDF template file includes:

Further, the generating pre-filling information corresponding to each form field attribute respectively includes:

Further, the matching the text content of the text block in the difference text block set with each of the pre-filling information to obtain a filling position corresponding to each of the form field attributes includes:

Further, after the pre-filling information corresponding to each form field attribute is generated, the processor 1001 may be further configured to call a PDF form information extraction program stored in the memory 1005, and perform the following operations:

the taking the starting position of the form information text block in the page as the filling position of the corresponding form field attribute comprises:

Further, before obtaining the first text block set formed by the first text block in the page of the target page number in the PDF template file and the initial position of each first text block in the page respectively through the parsing, and obtaining each form field attribute in the page of the target page number in the PDF template file through the parsing, the processor 1001 may be further configured to invoke a PDF form information extraction program stored in the memory 1005, and execute the following operations:

Based on the above structure, various embodiments of the PDF form information extraction method are proposed.

Referring to fig. 2, fig. 2 is a flowchart illustrating a PDF form information extracting method according to a first embodiment of the present invention.

While a logic sequence is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than that shown or described herein. In this embodiment, the PDF form information extraction method includes:

step S10, a PDF template file is obtained, a first text block set formed by first text blocks in a page of a target page number in the PDF template file and the initial positions of the first text blocks in the page are obtained through analysis, and each form field attribute in the page of the target page number in the PDF template file is obtained through analysis;

in this embodiment, the PDF template file refers to a template used for generating a PDF file, and a specific region for filling a differentiated text is set in the PDF template file and is called a form field; a form field attribute is set in the form field for indicating what the user fills out, for example, by setting a form field attribute of "name" to indicate that the user fills out his/her name; the form information is a text value filled in the form field aiming at the form field attribute, for example, "three-piece" filled in aiming at the form field attribute of "name" is the form information; the PDF file refers to a file obtained by filling a form field in a PDF template file; after the form information is filled in the PDF template file, the form field is removed, so that no form field attribute exists in the PDF file; compared with a PDF template file, the PDF file has more filled form information in a page provided with a form field, and the form information extraction scheme of the embodiment is to extract the form information; after the form information is extracted, the form information may have different uses according to different specific application scenarios, for example, the form information may be used to check the integrity, the correctness, and the like of the contract signing, and the embodiment is not limited specifically.

In this embodiment, to extract the form information, the PDF template file may be obtained first. Specifically, in an embodiment, after determining a PDF file whose form information needs to be extracted, a PDF template file corresponding to the PDF file may be acquired. In another embodiment, a filling position of the form field attribute may be extracted for a PDF template file, and then when a PDF file (obtained by filling form information in the PDF template file) whose form information needs to be extracted is obtained subsequently, form information in the PDF file may be extracted by using the pre-extracted filling position.

The PDF template file may be parsed with respect to the obtained PDF template file. Each text block (hereinafter, the text block parsed from the PDF template file is referred to as a first text block) in a page (hereinafter, referred to as a first page) of the target page number in the PDF template file obtained by parsing (hereinafter, referred to as a first page) is referred to as a first text block, and a set of each first text block extracted from the first page is referred to as a first text block set. The text block is a storage mode of the text resources in the PDF file, the text resources in the PDF file are stored in a mode of one text block, one text block comprises a plurality of text characters, and the lengths of the text blocks are not necessarily the same. The method for analyzing the PDF template file to obtain each first text block may refer to an existing text block analysis method, which is not limited in this embodiment.

And obtaining the starting position of each first text block in the first page through analysis, wherein the starting position of the text block is relative to the page where the text block is located. In general, the starting position of the text block can be defined as the offset position of the starting character of the text block relative to the vertex of the upper left corner of the page, but is not limited to this definition, for example, when the page content is laid out from the right to the left and from the top to the bottom, the starting position of the text block can also be defined as the offset position of the starting character of the text block relative to the vertex of the upper right corner of the page. Both the PDF template file and the PDF file record the position of each resource in the page, for example, record the offset position of each resource with respect to the vertex at the upper left corner of the page, or record the offset position of each resource with respect to the resource before the resource, so that the starting position of the text block in the page can be obtained by analyzing the PDF template file.

And obtaining each form field attribute in the first page in the PDF template file through analysis. The form domain attributes in each page are recorded in the PDF template file, so that each form domain attribute in the first page can be obtained by analyzing the PDF template file. Specifically, the form domain object in the first page of the PDF template file may be extracted, and the form domain attribute may be extracted from the form domain object.

In an embodiment, the page numbers of the pages in the PDF template file may be respectively used as target page numbers, that is, the same analysis may be performed on each page.

Step S20, respectively generating pre-filling information corresponding to each form field attribute, filling the PDF template file with each pre-filling information to obtain a temporary filling file, and analyzing to obtain a second text block set formed by second text blocks in the page of the target page number in the temporary filling file and a starting position of each second text block in the page;

after each form field attribute in the page (i.e. the first page) of the target page number in the PDF template file is obtained through parsing, pre-filling information corresponding to each form field attribute may be generated respectively. In this embodiment, the content of the pre-population information is not limited, and therefore, the pre-population information may be generated by generating a random number using a random algorithm, or a plurality of pieces of information may be defined in advance, and the pre-population information corresponding to each form field attribute may be randomly selected from the plurality of pieces of information.

After the pre-filling information corresponding to each form field attribute of the first page is generated, the PDF template file may be filled with the pre-filling information, specifically, the form field of the first page in the PDF template file is filled to obtain a PDF file after the form field is filled, and the PDF file is referred to as a temporary filling file. The specific filling method may refer to an existing form field filling manner, which is not described in detail herein. It will be appreciated that the pre-populated information in the temporarily populated file is the form information in the file. The feature of the PDF template file is that after the form information is filled, the content and location of the resource in each page, which is not in the form field, are not changed, so the number of pages in the temporary filled file is the same as the number of pages in the PDF template file, and the pages in the two files are in one-to-one correspondence, the page of the target page number in the temporary filled file (hereinafter referred to as the second page) has less form field and more pre-filled information than the first page, and the content and location of other resources in the two pages are all in the same correspondence.

After the temporary filler file is obtained, the temporary filler file may be parsed. Each text block in a second page in the temporary filling file is obtained through analysis (for illustration, the text block obtained through analysis from the temporary filling file is referred to as a second text block), and a set formed by each second text block extracted from the second page is referred to as a second text block set. The way of parsing the temporary padding file to obtain each second text block is similar to the way of parsing the PDF template file to obtain each first text block, and a description thereof is not given here. The initial positions of the second text blocks in the second page are obtained through analysis, and the way of analyzing the temporary filling file to obtain the initial positions of the second text blocks in the page is similar to the way of analyzing the PDF template file to obtain the initial positions of the first text blocks in the page, and no description is provided herein.

Step S30, comparing the text blocks in the second text block set with the text blocks in the first text block set, selecting a target text block in the second text block set and adding the target text block into a difference text block set, wherein the target text block and each of the first text blocks have at least one difference between the text content and the initial position;

after the first text block set and the second text block set are obtained, the text blocks in the second text block set can be compared with the text blocks in the first text set, target text blocks in the second text block set are selected and added into the difference text block set, wherein the target text blocks and the first text blocks have at least one difference in text content and starting position. That is, if at least one of the text content and the starting position of one second text block is different from that of each first text block, the second text block is used as a target text block, and all target text blocks in the second text block set are selected to be added into the difference text block set. Specifically, in one embodiment, a second set of text blocks may be traversed, for each second text block, the second text block may be compared with each first text block in the first set of text blocks, and if the text content and starting position of the second text block is the same as that of one of the first text blocks, the second text block may be considered to be non-form information, and if none of the first set of text blocks has the same text content and starting position as the second text block, the second text block may be considered to be form information, and the second text block may be added to the difference set. In another embodiment, in the process of traversing the second text block set and comparing each text block with each first text block, if the text content and the starting position of one first text block and one second text block are the same, the next second text block may not be compared with the first text block, so as to avoid unnecessary waste of computing resources.

In an embodiment, when the text contents of the two text blocks are compared, the text contents of the two text blocks may be compared character by character, and once two different characters appear, the text contents of the two text blocks may be considered to be different, and the comparison of subsequent characters is not performed. For example, the text block 1 and the text block 2 are compared, specifically, the first character of the text block 1 is compared with the first character of the text block 2, if the first character and the first character are different, the text contents of the two text blocks are considered to be different, if the first character and the first character are the same, the second character of the text block 1 is compared with the second character of the text block 2, and so on.

Step S40, matching the text content of the text block in the difference text block set with each of the pre-filling information to obtain a filling position corresponding to each of the form field attributes;

and after the difference text block set is obtained, matching the text content of the text block in the difference text block set with each pre-filling information to obtain a filling position corresponding to each form field attribute. It can be understood that, since the second page has more pre-filling information, less form fields, and the other resource contents and locations are correspondingly the same, compared to the first page, and the characteristic of the PDF file is that the form information filled in one form field is stored in one text block, the difference text block set includes each text block corresponding to the same pre-filling information content; therefore, by matching the text content of each text block in the difference text block set with each pre-filling information, each pre-filling information can find the text block with the same content in the difference text block set; the pre-filling information corresponds to the form field attributes one by one, and each text block also corresponds to a unique initial position, so that the filling position corresponding to each form field attribute can be obtained by taking the corresponding relation between the pre-filling information and the text blocks in the difference text block set as a bridge; the filling position of the form field attribute is the starting position of the form information filled in aiming at the form field attribute in the page.

Further, in an embodiment, the step S40 includes:

step S401, respectively matching the text content of each text block in the difference text block set with each pre-filling information, and using the text block with the same content as the pre-filling information as a form information text block according to the matching result;

and respectively matching the text content of each text block in the difference text block set with each pre-filling information, and determining whether each text block has the pre-filling information with the same content. Based on the matching result, the same text block as the pre-populated information content can be found as the form information text block.

Step S402, the initial position of the form information text block in the page is used as the filling position of the corresponding form field attribute.

After finding the form information text block corresponding to each pre-filling information, for each pre-filling information, the starting position of the form information text block corresponding to the pre-filling information in the page may be used as the filling position of the form field attribute corresponding to the pre-filling information.

It is understood that the text content before and after the form field in the first page may belong to one text block, but after being filled with the pre-filling information, is divided into two text blocks in the second page, and the two text blocks are not the form information; although the text blocks in the second text block set are both added as target text blocks when compared with the text blocks in the first text block set, because the text contents of the two text blocks are different from those of the first text blocks, the text blocks in the different text block set are matched with the pre-filling information, and the text blocks in the non-form information are not used as form information text blocks because the text blocks have no pre-filling information with the same contents, so that the accuracy of the filling positions of the finally acquired form field attributes is not influenced.

In the existing scheme, a mode of directly comparing the content of the text blocks of the PDF template file with that of the target PDF file to extract form information is adopted, and the text blocks divided from the front and back of the form field are also extracted as difference text blocks to be used as form information.

Step S50, acquiring a target PDF file of the form information to be extracted, analyzing to obtain a third text block in the page of the target page number in the target PDF file and an initial position of each third text block in the page, and extracting text content in the third text block at the initial position in the filling position as the form information corresponding to the form field attribute.

After the filling position corresponding to each form field attribute of the first page in the PDF template file is obtained, the filling position can be adopted to extract form information in the page of the PDF file, which has the same page number as the first page. For a PDF file obtained by filling on the basis of the PDF template file, if it is determined that form information needs to be extracted from the PDF file, the PDF file may be used as a target PDF file. After the target PDF file is obtained, the target PDF file may be parsed, and text blocks in pages of the target page number (hereinafter referred to as a third page) in the target PDF file are obtained through parsing (for illustration, text blocks parsed from the target PDF file are hereinafter referred to as third text blocks). The way of parsing the target PDF file to obtain each third text block is similar to the way of parsing the PDF template file to obtain each first text block, and a description thereof is not given here. The initial positions of the third text blocks in the third page are obtained through analysis, and the manner of analyzing the target PDF file to obtain the initial positions of the third text blocks in the page is similar to the manner of analyzing the PDF template file to obtain the initial positions of the first text blocks in the page, and no description is provided herein.

After the third text blocks and the initial positions of the third text blocks are obtained through analysis, the text content in the third text block with the initial position at the filling position of the form field attribute can be used as the form information of the form field attribute. Specifically, the start position of each third text block may be compared with the filling position of each form field attribute, and if the start position of one third text block is the same as the filling position of one form field attribute, the text content of the third text block is extracted as the form information of the form field attribute. By the method, each form information in the target PDF file can be extracted, and the form field attribute corresponding to each extracted form information can be obtained.

In the embodiment, a first text block set formed by first text blocks in a page of a target page number in a PDF template file and the initial positions of the first text blocks in the page are obtained by obtaining the PDF template file and analyzing, and form field attributes in the page of the target page number in the PDF template file are obtained by analyzing; generating pre-filling information corresponding to each form field attribute respectively, filling the PDF template file by adopting each pre-filling information to obtain a temporary filling file, and analyzing to obtain a second text block set formed by second text blocks in a page of a target page number in the temporary filling file and the initial positions of each second text block in the page respectively; comparing the text blocks in the second text block set with the text blocks in the first text block set, selecting target text blocks in the second text block set and adding the target text blocks into the difference text block set, wherein the target text blocks and the first text blocks have at least one item of difference in text content and starting position; matching the text content of the text blocks in the difference text block set with each pre-filling information to obtain filling positions corresponding to each form field attribute; and acquiring a target PDF file of the form information to be extracted, analyzing to obtain the initial positions of the third text blocks and the third text blocks in the page of the target page number in the target PDF file in the page, and extracting the text contents of the third text blocks of the initial positions at the filling positions as the form information corresponding to the form field attribute. Compared with the method that the content of the text block is directly compared with the target PDF file, in the embodiment, the starting position of the text block in the page is obtained by analyzing the file, and the text content is compared with the starting position, so that the condition that the form information extraction is inaccurate because whether two text blocks with the same text content are the same page number and the same position cannot be accurately judged is avoided, and the form information extraction accuracy is improved. And moreover, the form domain attribute is extracted from the PDF template file, the pre-filling information is filled according to the form domain attribute to obtain a temporary filling file, the text blocks of the PDF template file and the temporary filling file are compared to obtain the filling position of each form domain attribute, and the form information in the target PDF file is extracted through the filling position, so that the corresponding relation between each form information and the form domain attribute can be extracted while the form information in the target PDF file is extracted, and the accuracy of the form information extraction is further improved.

Further, based on the first embodiment, a second embodiment of the PDF form information extraction method according to the present invention is provided, in this embodiment, the step S10 of obtaining, by parsing, a first text block set formed by first text blocks in a page of a target page number in the PDF template file, and a starting position of each first text block in the page includes:

step S101, analyzing the PDF template file to obtain a page object corresponding to a target page number in the PDF template file;

in this embodiment, the horizontal offset and the vertical offset of the starting character of the text block compared with the top left corner of the page may be used as the starting position of the text block in the page, the width of the space character may be used as a unit for the horizontal offset, and the unit for the vertical offset may be a unit for measuring the height of the page carried in the file. The PDF template file can record the initial vertical coordinate information of each resource, and the longitudinal offset of the initial character of the text block relative to the top edge of the page can be calculated according to the initial vertical coordinate information; by analyzing the PDF template file line by taking the width of the space character as a step length, each text block can be obtained through analysis, and the number of the space characters or the step length of the initial character of each text block in the line where the initial character is located before the initial character can be obtained, namely the transverse offset of the initial character of the text block.

Specifically, in an embodiment, the PDF template file may be parsed to obtain a page object corresponding to the target page number in the PDF template file. In the PDF template file, a page is stored in the form of a page object, and various resources in the page are defined or described in the page object, including types, locations, contents, and organization forms of the various resources.

Further, in an embodiment, the step S101 includes:

step S1011, acquiring a root object number stored at the tail of the file in the PDF template file, and analyzing the PDF template file according to the root object number to obtain a root object;

step S1012, analyzing the PDF template file according to the page group object number stored in the root object to obtain a page group object;

step S1013, determining a target page object number corresponding to a target page number from the page object numbers stored in the page group object, and analyzing the PDF template file according to the target page object number to obtain a page object corresponding to the target page number.

The PDF template file mainly comprises 4 parts: file header, file body, cross reference table, file tail (Trailer). The file body is used for storing the indirect objects, and the page group objects (pages), the page objects (pages) and the text blocks are stored in the file body. The cross-index table is used to store the starting address of each indirect object in the file. The file end is used to store the start position of the cross index table, the object number of the Root object (Root) (i.e., Root object number). The root object is obtained by obtaining the root object number stored at the tail of the file in the PDF template file and analyzing the PDF template file according to the root object number, and specifically, the root object is taken out from the file body according to the initial address corresponding to the root object number recorded in the cross index table. After the root object is obtained through analysis, the page group object is obtained through analysis from the PDF template file according to the page group object number stored in the root object, and specifically, the page group object is taken out from the file body according to the initial address corresponding to the page group object number recorded in the cross index table. And determining a target page object number corresponding to the target page number from all page object numbers stored in the page group object. Specifically, the page group object stores the page number corresponding to each page object number, and the page object number corresponding to the target page number is used as the target page object number. After the target page object number is determined, a page object corresponding to the target page number is obtained by analyzing from the PDF template file according to the target page object number, specifically, the target page object is taken out from the file body according to the start address corresponding to the target page object number recorded in the cross index table, where the target page object is the page object corresponding to the target page number.

Step S102, analyzing each resource in the page object line by taking the width of a space character as a step length to obtain each first text block in a page corresponding to the target page number and a transverse offset of an initial character of the first text block, wherein the transverse offset is the step length number contained in front of the initial character in a line where the initial character is located;

and analyzing each resource in the page object line by taking the width of the space character as a step length to obtain each first text block in the page corresponding to the target page number and the transverse offset of the initial character of each first text block. The line-by-line analysis refers to analyzing the resources one by one according to the arrangement sequence of the resources in the page. The page object records the initial abscissa information of each resource, and the initial abscissa information is the abscissa offset of the initial point of the resource relative to the initial point of the previous resource or the abscissa offset of the resource relative to the top left corner of the page. If the initial abscissa information is the abscissa offset relative to the previous resource, the abscissa relative to the top left vertex of the page can be calculated according to the abscissa offset relative to the previous resource by analyzing each resource line by line. If the starting abscissa information is the abscissa offset from the top left vertex of the page, it can be directly used. In the process of analyzing resources line by line, the resources of which the types are the text types are used as text blocks obtained through analysis, the horizontal coordinate offset of the starting point of the text block obtained through analysis relative to the top point of the upper left corner of the page is divided by the width of a space character to obtain the number of space characters or the number of steps contained in front of the starting character of the text block in the line where the starting character is located, and the number of steps is used as the horizontal offset of the starting character of the text block.

Step S103, calculating to obtain the longitudinal offset of the initial character of the first text block relative to the top edge of the page according to the initial vertical coordinate information of each resource in the page object;

the page object records the initial vertical coordinate information of each resource, and based on a calculation mode similar to the horizontal coordinate offset, the vertical coordinate offset of the initial point of the first text block relative to the top edge of the page (or the top left corner vertex of the page) can be calculated, and the vertical coordinate offset is used as the vertical offset of the initial character of the first text block relative to the top edge of the page.

Step S104, taking the horizontal offset and the vertical offset of the initial character of the first text block as the initial position of the first text block in the page.

And taking the horizontal offset and the vertical offset of the starting character of the first text block as the starting position of the first text block in the page.

Further, based on the first and/or second embodiments, a third embodiment of the PDF form information extraction method according to the present invention is proposed, in this embodiment, the step of generating pre-filling information corresponding to each form field attribute in step S20 includes:

in step S201, pre-filling information corresponding to each form field attribute is randomly generated, where the same text content as the pre-filling information does not exist in the first text block set.

In this embodiment, after extracting the respective form field attributes in the first page, pre-filling information corresponding to the respective form field attributes may be randomly generated, and the generated pre-filling information needs to satisfy a condition that the text content identical to the pre-filling information does not exist in the first text block set. In the present embodiment, any random generation manner capable of generating the pre-population information satisfying the condition may be adopted, and is not limited in particular. By generating the pre-filling information which is not contained in the first text block set, when the text blocks in the second text block set are compared with the text blocks in the first text block set, the pre-filling information can be prevented from being matched with the non-form information in the first text block set, and the accuracy of extracting the form information can be improved.

Further, in an embodiment, a random character generation algorithm may be used to generate a random character string, and the generated random character string is compared with each text block in the first text block set to determine whether there is a text block containing the random character string; if not, the random character string is used as the pre-filling information of the form field attribute, otherwise, a random character generation algorithm is adopted to generate a random character string to be compared with the text blocks in the first text block set, and iteration is carried out in a circulating mode until the pre-filling information of the form field attribute is obtained. In the same way, pre-filling information corresponding to each form field attribute is generated.

Further, in an embodiment, the method further comprises:

step S60, establishing a property filling value dictionary associating the form field property with the corresponding pre-filling information;

after generating the pre-population information corresponding to each form field attribute, a property fill value dictionary may be established in which the form field attributes and the corresponding pre-population information are stored in association, i.e., in which the form field attributes and the pre-population information are in one-to-one correspondence.

The step S402 includes:

step S4021, searching the attribute filling value dictionary for the form field attribute corresponding to the pre-filling information with the same content as the form information text block;

step S4022, using the initial position of the form information text block in the page as the filling position of the found form field attribute.

After the form information text blocks respectively corresponding to the pre-filling information are determined in the difference text block set, for each form information text block, the attribute filling value dictionary may be searched for the form field attribute corresponding to the pre-filling information with the same content as the form text block, that is, the attribute filling value dictionary may be searched for the form field attribute corresponding to the pre-filling information with the pre-filling information as an index. And after the form field attribute is found, taking the initial position of the form information text block in the page as the filling position of the found form field attribute.

Further, in an embodiment, before the step of analyzing, in step S10, a first text block set formed by first text blocks in a page of a target page number in the PDF template file and a starting position of each first text block in the page, and analyzing, to obtain each form field attribute in the page of the target page number in the PDF template file, the method further includes:

step S70, analyzing the PDF template file, and determining each target page object containing the form domain object in the PDF template file;

after the PDF template file is obtained, the PDF template file may be analyzed, and each target page object including the form domain object in the PDF template file is determined. That is, in each page in the PDF template file, some pages have a form field, and some pages do not have a form field, and form information needs to be extracted only for pages having a form field. Therefore, the page object containing the form field object in the PDF template file can be analyzed as the target page object.

Step S80, taking the page number corresponding to each target page object as the target page number.

And respectively taking the page number corresponding to each target page object as a target page number. It should be noted that when there are multiple pages including form fields, multiple target page objects obtained by analysis are obtained, and at this time, the page number of each target page object is respectively used as a target page number, that is, form information extraction is performed on each page; when only one page contains the form field, only one target page object is obtained through analysis, at the moment, only the page number of the target page object is used as the target page number, namely, only the form information of the page containing the form field is extracted.

In addition, an embodiment of the present invention further provides a PDF form information extraction device, where, with reference to fig. 3, the device includes:

the first analysis module 10 is configured to obtain a PDF template file, analyze the PDF template file to obtain a first text block set formed by first text blocks in a page of a target page number in the PDF template file and a starting position of each first text block in the page, and analyze the PDF template file to obtain each form field attribute in the page of the target page number;

a second parsing module 20, configured to generate pre-filling information corresponding to each form field attribute, fill the PDF template file with each pre-filling information to obtain a temporary filling file, and parse the temporary filling file to obtain a second text block set formed by second text blocks in the page of the target page number and a starting position of each second text block in the page;

a comparing module 30, configured to compare text blocks in the second text block set with text blocks in the first text block set, select a target text block in the second text block set, and add the target text block into a difference text block set, where at least one of text content and an initial position of the target text block is different from that of each of the first text blocks;

a matching module 40, configured to match text contents of text blocks in the difference text block set with each of the pre-filling information to obtain filling positions corresponding to each of the form field attributes;

the extracting module 50 is configured to obtain a target PDF file of the form information to be extracted, analyze the target PDF file to obtain a third text block in the page of the target page number in the target PDF file and an initial position of each third text block in the page, and extract text content in the third text block at the initial position in the filling position as the form information corresponding to the form field attribute.

Further, the first parsing module 10 includes:

the first analysis unit is used for analyzing the PDF template file to obtain a page object corresponding to a target page number in the PDF template file;

a second analyzing unit, configured to analyze, line by line, each resource in the page object with a width of a space character as a step length to obtain each first text block in a page corresponding to the target page number and a lateral offset of a start character of the first text block, where the lateral offset is a step number included in front of the start character in a line where the start character is located;

the calculation unit is used for calculating and obtaining the longitudinal offset of the initial character of the first text block relative to the top edge of the page according to the initial vertical coordinate information of each resource in the page object;

a first determining unit, configured to use the lateral offset and the longitudinal offset of the starting character of the first text block as a starting position of the first text block in a page.

Further, the first parsing unit is further configured to:

Further, the second parsing module 20 includes:

and a generating unit, configured to randomly generate pre-filling information corresponding to each form field attribute, where the same text content as the pre-filling information does not exist in the first text block set.

Further, the matching module 40 includes:

the matching unit is used for respectively matching the text content of each text block in the difference text block set with each pre-filling information, and taking the text block with the same content as the pre-filling information as a form information text block according to the matching result;

and the second determining unit is used for taking the initial position of the form information text block in the page as the filling position of the corresponding form field attribute.

Further, the apparatus further comprises:

an establishing module for establishing a property filling value dictionary associating the form field property with the corresponding pre-filling information;

the second determination unit is further configured to:

Further, the apparatus further comprises:

the third analysis module is used for analyzing the PDF template file and determining each target page object containing the form domain object in the PDF template file;

and the determining module is used for respectively taking the page number corresponding to each target page object as the target page number.

The specific implementation of the PDF form information extraction apparatus of the present invention is basically the same as the above embodiments of the PDF form information extraction method, and is not described herein again.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a PDF form information extraction program is stored on the storage medium, and when being executed by a processor, the PDF form information extraction program implements the following steps of the PDF form information extraction method.

The embodiments of the PDF form information extraction device and the computer-readable storage medium of the present invention can refer to the embodiments of the PDF form information extraction method of the present invention, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A PDF form information extraction method is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the step of parsing to obtain the first text block set formed by the first text block in the page of the target page number in the PDF template file and the respective starting positions of the first text blocks in the page comprises:

3. The method for extracting PDF form information according to claim 2, wherein said step of parsing the PDF template file to obtain a page object corresponding to a target page number in the PDF template file comprises:

4. The PDF form information extraction method of claim 1, wherein the step of generating pre-populated information corresponding to each of the form field attributes respectively comprises:

5. The method of extracting PDF form information according to claim 1, wherein said step of matching the text content of the text block in said set of difference text blocks with each of said pre-populated information to obtain a filling position corresponding to each of said form field attributes comprises:

6. The PDF form information extraction method of claim 5, wherein said step of generating pre-populated information corresponding to each of said form field attributes, respectively, further comprises:

7. The method as claimed in any one of claims 1 to 6, wherein before the step of obtaining the first text block set formed by the first text block in the page of the target page number in the PDF template file and the starting position of each first text block in the page respectively by parsing, and obtaining each form field attribute in the page of the target page number in the PDF template file by parsing, the method further comprises:

8. An apparatus for extracting PDF form information, the apparatus comprising:

9. A PDF form information extracting apparatus, comprising: a memory, a processor and a PDF form information extraction program stored on the memory and executable on the processor, the PDF form information extraction program when executed by the processor implementing the steps of the PDF form information extraction method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a PDF form information extraction program is stored thereon, which when executed by a processor, implements the steps of the PDF form information extraction method according to any one of claims 1 to 7.