CN114821613A

CN114821613A - Extraction method and system of table information in PDF

Info

Publication number: CN114821613A
Application number: CN202210342716.4A
Authority: CN
Inventors: 王则远; 刘鹏
Original assignee: Lingxi Quantum Beijing Medical Technology Co ltd
Current assignee: Lingxi Quantum Beijing Medical Technology Co ltd
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-07-29

Abstract

The invention relates to the technical field of deep learning, and provides a method and a system for extracting table information in PDF. The method comprises the following steps: acquiring a PDF file, and identifying an image page including a table in the file; dividing an image page to obtain a plurality of table units, and inputting an operating table structure identification model to obtain a target sequence; and integrating the text recognition result of the table unit into the target sequence to obtain a table extraction result in the html code form. The method carries out the identification of the table structure frame through the table unit obtained based on image segmentation, namely, the model obtained by multi-example learning training is adopted to carry out the identification of the table structure frame, so that the table information in PDF can be more effectively and accurately identified and extracted; meanwhile, the output format of the table structure framework based on the html sequence is more suitable for a model obtained by multi-example learning training, and the table information extraction task, especially the table information extraction task under a complex scene, has better efficiency and accuracy.

Description

Extraction method and system of table information in PDF

Technical Field

The invention relates to the technical field of deep learning, in particular to a method and a system for extracting table information in PDF.

Background

The processing and application of the table data information in the PDF have wide requirements in many practical production scenes, and in recent years, with the vigorous development of the related algorithm technology of the Computer Vision (Computer Vision) task based on artificial intelligence, the extraction of the PDF table information by using the AI technology is a direction with great value and significance.

In actual production, table data in PDF often needs to be analyzed and organized systematically, usually, a PDF file has a plurality of table information, and if manual arrangement needs to consume a large amount of labor cost and time cost, it is a boring work, so that extraction of PDF table information by using a technical means in an automatic or semi-automatic manner becomes an important research subject, and how to provide an efficient and accurate extraction method and system of table information in PDF becomes a technical problem that needs to be solved urgently in the industry.

Disclosure of Invention

The invention provides a method and a system for extracting table information in PDF (portable document format), which are used for solving the defects of high labor cost and time cost required by extracting the table information in the PDF in the prior art and realizing more efficient and accurate extraction aiming at the table information in the PDF.

The invention provides a method for extracting table information in PDF, which comprises the following steps:

acquiring a PDF file, and identifying an image page comprising a table in the PDF file;

dividing the image page to obtain a plurality of table units, and operating a table structure recognition model by taking the table units as input to obtain a target sequence; the target sequence is an html sequence based on a table structure frame;

integrating the text recognition result of the table unit into the target sequence to obtain a table extraction result in the form of html codes;

the table structure recognition model is a model obtained through sample training.

According to the extraction method of the table information in the PDF, provided by the invention, the identification model of the table structure is a model of an encoder-decoder structure;

the encoder can extract the local features, the global features and the associated features of the table units, and encode the local features, the global features and the associated features to obtain feature extraction results;

the decoder can obtain a table structure frame according to the feature extraction result.

According to the extraction method of the table information in the PDF, provided by the invention, the table structure identification model is a model based on a self-attention mechanism;

the encoder is capable of extracting local features, global features, associated features and element sequence features of source sequence elements; the source sequence comprises the ordered plurality of table cells; the source sequence element refers to the table unit, or a character string obtained by splitting the table unit;

the decoder is capable of:

obtaining the 1 st element feature of the target sequence according to the local feature, the global feature, the association feature and the element sequence feature of the source sequence element;

obtaining the ith element characteristic of the target sequence according to the local characteristic, the global characteristic, the association characteristic and the element sequence characteristic of the source sequence element and the 1 st to the (i-1) th element characteristics of the target sequence;

obtaining elements of the target sequence according to the element characteristics of the target sequence;

the target sequence element is an html character or a character string.

According to the method for extracting the table information in the PDF, the step of acquiring the PDF file and identifying the image page of the table included in the PDF file includes:

acquiring a PDF file;

identifying an image page comprising a table in the PDF file according to a preset table identification rule and/or a preset PDF identification model;

the PDF identification model is obtained by training a sample and a label based on a model of a YOLOv5 algorithm by taking a PDF file as input and an image page including a table in the PDF file as output.

According to the extraction method of the table information in the PDF provided by the invention, the step of integrating the text recognition result of the table unit into the target sequence to obtain the table extraction result in the html code form comprises the following steps:

running a text recognition model based on the table unit to obtain a text recognition result of the table unit;

determining a table structure according to the target sequence, and filling the text recognition result into the table structure to obtain a table extraction result in an html code form;

the text recognition model is obtained by training a sample and a label based on a model of a CTPN algorithm by taking the table unit in a picture format as input and the text recognition result of the table unit as output.

According to the extraction method of the table information in the PDF provided by the invention, the step of determining a table structure according to the target sequence and filling the text recognition result into the table structure to obtain the table extraction result in the html code form comprises the following steps:

determining at least one table structure according to the target sequence; the number of the table structures is the same as the number of the tables in the PDF file;

determining a mapping relation between a text in the text recognition result and the table structure, and filling the text to the table structure according to the mapping relation to obtain a table extraction result in an html code form; the number of table extraction results in the html code form is the same as that of the table structures.

The invention also provides a system for extracting the table information in the PDF, which comprises the following steps:

the acquisition module is used for acquiring a PDF file and identifying an image page comprising a table in the PDF file;

the segmentation module is used for segmenting the image page to obtain a plurality of table units, and operating a table structure recognition model by taking the table units as input to obtain a target sequence; the target sequence is an html sequence based on a table structure frame;

the extraction module is used for integrating the text recognition result of the table unit into the target sequence to obtain a table extraction result in the form of html codes;

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the method for extracting the table information in the PDF.

The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for extracting table information in a PDF as described in any one of the above.

The present invention also provides a computer program product comprising a computer program, which when executed by a processor implements the steps of the method for extracting table information in a PDF as described in any one of the above.

According to the extraction method and system for the table information in the PDF, provided by the invention, the table structure frame is identified through the table unit obtained based on image segmentation, namely, the model obtained by multi-example learning training is adopted for identifying the table structure frame, so that the table information in the PDF can be more effectively and accurately identified and extracted; meanwhile, the output format of the table structure framework based on the html sequence is more suitable for a model obtained by multi-example learning training, and the table information extraction task, especially the table information extraction task under a complex scene, has better efficiency and accuracy.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for extracting table information in a PDF according to the present invention;

FIG. 2 is a table extraction architecture according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a table information extraction process provided by an embodiment of the present invention;

fig. 4 is a schematic diagram of a table to be extracted in a PDF according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an html file of a table extraction result provided by an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an electronic device provided by the present invention;

fig. 7 is a schematic structural diagram of an apparatus for extracting table information in PDF according to an embodiment of the present invention.

FIG. 8 is a flowchart illustrating an implementation of LX-tableOCR according to an embodiment of the present invention.

Reference numerals:

610: a processor;

620: a communication interface;

630: a memory;

640: a communication bus;

701: an acquisition module;

702: a segmentation module;

703: and (5) an extraction module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following describes the extraction method of table information in PDF according to the present invention with reference to fig. 1 to 5.

As shown in fig. 1, an embodiment of the present invention provides a method for extracting table information in a PDF, including:

step 102, acquiring a PDF file, and identifying an image page comprising a table in the PDF file;

step 104, dividing the image page to obtain a plurality of table units, and operating a table structure recognition model by taking the table units as input to obtain a target sequence; the target sequence is an html sequence based on a table structure frame;

step 106, integrating the text recognition result of the table unit into the target sequence to obtain a table extraction result in the form of html codes;

In a preferred embodiment, the table cells are partitioned based on a table structure; for a table with all the frame lines, the table unit can be obtained by dividing based on the horizontal and vertical frame lines (in some cases, the inclined frame lines are also included); for tables containing only horizontal frame lines, only vertical frame lines or no frame lines, the table cells can be obtained based on character clustering segmentation.

In another preferred embodiment, the table cells are randomly divided or divided in a set size.

In this embodiment, the input of the table structure identification model is a table unit in a picture format, and the output is an html file code.

The beneficial effect of this embodiment lies in:

the form structure frame recognition is carried out through the form units obtained based on image segmentation, namely, the form structure frame recognition is carried out by adopting a model obtained by multi-example learning training, so that the form information in the PDF can be more effectively and accurately recognized and extracted; meanwhile, the output format of the table structure framework based on the html sequence is more suitable for a model obtained by multi-example learning training, and the table information extraction task, especially the table information extraction task under a complex scene, has better efficiency and accuracy.

According to the above embodiment, in the present embodiment:

the table structure identification model is a model of an encoder-decoder structure;

The table structure recognition model is a model based on a self-attention mechanism;

the decoder is capable of:

the target sequence element is an html character or a character string.

In this embodiment, the training process of the table recognition model of the encoder-decoder structure is as follows:

combining a multi-example learning principle to split a plurality of image blocks of each image into a group of multi-example packets, further splitting each example of the multi-example packets and forming an embedding sequence to form tokens similar to NLP (natural language understanding), encoding the tokens by using a Transformer, encoding the tokens by using an outer Transformer, performing corresponding mapping on different combinations of local and global according to the tokens, and performing spatial summation on information of respective reactions; meanwhile, distillation tokens are added to the outer layer sequence of each example, so that the newly added distillation tokens interact with the original tokens through a self-attention layer and learn through back propagation, and the training process is efficient.

That is, the form recognition model of the present embodiment is a self-attention mechanism model based on natural language understanding, and the form recognition model includes, in addition to the form cells, distillation tokens for interactively learning the sequence features of the form cells.

The beneficial effect of this embodiment lies in:

the table structure is identified through the natural language understanding-based self-attention mechanism model, and the table structure can be constructed through the semantic relation of table characters on the premise of not depending on the frame lines of the table, so that the identification efficiency and the accuracy of the frame-line-free table (or a similar non-full frame line table) and the complex table structure are higher.

According to any of the embodiments described above, in this embodiment:

the step of acquiring a PDF file and identifying an image page of a table included in the PDF file includes:

acquiring a PDF file;

The step of integrating the text recognition result of the table unit into the target sequence to obtain the table extraction result in the html code form comprises the following steps:

The step of determining a table structure according to the target sequence and filling the text recognition result into the table structure to obtain a table extraction result in an html code form includes:

The beneficial effect of this embodiment lies in:

in the embodiment, the table in the PDF is identified by setting a rule in advance, and if the rule identification is successful, the image page including the table in the PDF file is directly output (in this case, the output efficiency of the image page is higher); if the rule recognition fails, an image page including a table in the PDF file is output by the PDF recognition model based on yolov5 (in this case, the applicability to the input PDF is better).

In addition, in the embodiment, a text recognition model based on the CTPN algorithm is used for text recognition, and in some embodiments, other text recognition algorithms may be used to achieve similar effects.

According to the above embodiments, a more complete embodiment will be provided in the following from the perspective of flow implementation.

The embodiment provides improvement for the following defects in the prior art:

extraction of table information in PDF is a relatively hot topic, and based on research on prior art schemes, the extraction can be roughly divided into two categories: one is that regular extraction is carried out on the table information in the PDF through programming languages such as Python, Java and the like, and then content extraction is carried out through modes such as built-in packages of the languages and the like, so that the requirements on the types and the like of the PDF and the table information are higher, the extraction coverage rate is lower, and the universality is poorer; the other method is to perform content recognition on the PDF picture table information by using an OCR recognition technology, which solves the problem of low coverage, but the recognition accuracy of the table information and the recognition of the table structure information still need to be improved.

The embodiment aims to provide a method for extracting table information in a PDF, and relates to the technical field of PDF table analysis and the like.

In summary, the embodiment of this example is: firstly, recognizing each page of PDF image of a table in a PDF file by using a target monitoring algorithm yolov5, then training and recognizing a table structure by using a self-developed image recognition algorithm LX-tableOCR, structuring the table structure into an html tag form by using the algorithm, then recognizing and extracting text contents by using a CTPN text recognition algorithm, and finally integrating the text contents and the html structured by the table to output html codes for completely expressing table information.

Referring to fig. 2, the scheme of the present embodiment will be described in detail below.

1. And constructing a PDF file table identification data set.

And marking the tables in 5000 PDFs to construct a PDF file table identification data set.

2. And constructing an html data set corresponding to the table information.

In this embodiment, 3000 medical PDF documents are labeled, html conversion is performed on tables in the relevant PDF documents, and a table structure recognition training data and a table content recognition training data are constructed in combination with a PubTabNet public data set.

3. And building each functional module.

Referring to fig. 3:

1) a table page splitting module: by looking at the PDF file to find some text information which can be used for distinguishing tables and combining with the rules which are used for formulating the table pages in the PDF file by a programming language, the algorithm identification is carried out on the yolov5 model which can not be solved by the rules and is trained by the data set.

2) A table structure identification module: in a table structure recognition module, a table structure is trained and fine-tuned by using a self-developed image recognition algorithm LX-tableOCR to obtain a table structure recognition model, and the general principle is as follows: combining a multi-example learning principle to split a plurality of image blocks of each image to form a group of multi-example packets, further splitting each example of the multi-example packets to form an embedding sequence, forming tokens similar to NLPs, encoding the tokens by using a Transformer, encoding the tokens by using an outer Transformer among the examples, performing corresponding mapping according to different combinations of local and global forces of the tokens and performing spatial summation on information of respective reactions; meanwhile, distillation tokens are added to the outer layer sequence of each example, so that the newly added distillation tokens and the original tokens interact through a self-attention layer, and the obtained model is efficiently trained through back propagation learning.

3) A text content identification module: in text content recognition, we use the classical text recognition algorithm CTPN for text content detection.

4) A result integration module: and finally, performing corresponding information integration on an html frame which is generated by reasoning on the table structure model and represents the table structure and the corresponding text content detected by the text content to finally form an html code which represents the table information, and outputting the corresponding html file number according to the table number input by the split PDF table page.

4. Form parsing result verification

We take part of PDF form pictures not included in the training set and the verification set for testing, and the effect graph is as follows: fig. 4 shows a PDF table, and fig. 5 shows html file codes generated to represent the table.

The key points of this embodiment are as follows:

1. by adopting the method, the form image data of the PDF is converted into the html code format, so that the form image information under the complex scene can be conveniently extracted;

2. the method for extracting the PDF table information provided by the embodiment adopts a mode of fusing various algorithms in the visual field of an AI computer, and the PDF table information extraction is realized by the cooperation of a plurality of modules;

3. by applying the Transformer structure in combination with multi-instance learning to PDF form image detection, the image processing is more accurate and efficient by using the double-layer Transformer.

The beneficial effect of this embodiment lies in:

in this embodiment, a method for generating html codes corresponding to rows and columns from PDF table information and then performing table content analysis according to html tag information is obtained by constructing a data set with a table structure corresponding to html and training. The method can accurately extract the corresponding information such as the table structure, the table content and the like of the PDF table information, so that the table information in the PDF can be more accurately applied to downstream tasks.

Further, as shown in FIG. 8, in a preferred embodiment, the LX-tableOCR is used in a manner comprising:

firstly, identifying a table page image in the whole pdf file by using a rule matched with yolov5, then taking the table page image as input of LX-tableOCR, combining a multi-example learning method to finally obtain a synthesized pdf table screenshot and an html structure code (unfilled content) corresponding to the synthesized pdf table screenshot, then taking the synthesized pdf table screenshot and the html structure code as input of a CTPN algorithm to extract content, and finally integrating and filling the html structure and content information into html to obtain a final html file.

The following describes an extraction device of table information in PDF according to the present invention, and the extraction device of table information in PDF described below and the extraction method of table information in PDF described above may be referred to in correspondence with each other.

As shown in fig. 7, an embodiment of the present invention provides a system for extracting table information in PDF, including:

an obtaining module 701, configured to obtain a PDF file, and identify an image page in the PDF file that includes a table;

a segmentation module 702, configured to segment the image page to obtain a plurality of table units, and run a table structure recognition model with the table units as input to obtain a target sequence; the target sequence is an html sequence based on a table structure frame;

the extraction module 703 is configured to integrate the text recognition result of the table unit into the target sequence to obtain a table extraction result in the html code form;

Further, the obtaining module 701 includes:

an acquisition unit configured to acquire a PDF file;

the identification unit is used for identifying image pages comprising tables in the PDF file according to a preset table identification rule and/or a preset PDF identification model;

The extraction module 703 includes:

the text unit is used for operating a text recognition model based on the table unit to obtain a text recognition result of the table unit;

the filling unit is used for determining a table structure according to the target sequence and filling the text recognition result into the table structure to obtain a table extraction result in an html code form;

Still further, the filling unit includes:

a table structure subunit, configured to determine at least one table structure according to the target sequence; the number of the table structures is the same as the number of the tables in the PDF file;

the mapping and filling subunit is used for determining the mapping relationship between the text in the text recognition result and the table structure, and filling the text into the table structure according to the mapping relationship to obtain a table extraction result in the form of the html code; the number of table extraction results in the html code form is the same as that of the table structures.

The beneficial effect of this embodiment lies in:

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform a method of extracting table information in PDF, the method comprising: acquiring a PDF file, and identifying an image page comprising a table in the PDF file; dividing the image page to obtain a plurality of table units, and operating a table structure recognition model by taking the table units as input to obtain a target sequence; the target sequence is an html sequence based on a table structure frame; integrating the text recognition result of the table unit into the target sequence to obtain a table extraction result in the form of html codes; the table structure recognition model is a model obtained through sample training.

In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention further provides a computer program product, where the computer program product includes a computer program, the computer program can be stored on a non-transitory computer-readable storage medium, and when the computer program is executed by a processor, a computer can execute the method for extracting table information in PDF provided by the above methods, and the method includes: acquiring a PDF file, and identifying an image page comprising a table in the PDF file; dividing the image page to obtain a plurality of table units, and operating a table structure recognition model by taking the table units as input to obtain a target sequence; the target sequence is an html sequence based on a table structure frame; integrating the text recognition result of the table unit into the target sequence to obtain a table extraction result in the form of html codes; the table structure recognition model is a model obtained through sample training.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute the method for extracting table information in a PDF provided by the above methods, the method comprising: acquiring a PDF file, and identifying an image page comprising a table in the PDF file; dividing the image page to obtain a plurality of table units, and operating a table structure recognition model by taking the table units as input to obtain a target sequence; the target sequence is an html sequence based on a table structure frame; integrating the text recognition result of the table unit into the target sequence to obtain a table extraction result in the form of html codes; the table structure recognition model is a model obtained through sample training.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for extracting table information in PDF is characterized by comprising the following steps:

2. The method of extracting table information in PDF according to claim 1, wherein said table structure identification model is a model of an encoder-decoder structure;

3. The method for extracting table information in PDF according to claim 2, wherein said table structure identification model is a model based on the self-attention mechanism;

the decoder is capable of:

the target sequence element is an html character or a character string.

4. The method according to claim 1, wherein the step of acquiring the PDF file and identifying the image page of the table included in the PDF file comprises:

acquiring a PDF file;

5. The method according to claim 1, wherein the step of integrating the text recognition result of the table unit into the target sequence to obtain the table extraction result in html code form comprises:

6. The method according to claim 5, wherein the step of determining a table structure according to the target sequence and filling the text recognition result into the table structure to obtain the table extraction result in the html code form comprises:

7. A system for extracting table information in a PDF, comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for extracting table information in a PDF according to any one of claims 1 to 6.

9. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the method for extracting table information in PDF according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method for extracting table information in a PDF according to any one of claims 1 to 6.