CN110390269B

CN110390269B - PDF document table extraction method, device, equipment and computer readable storage medium

Info

Publication number: CN110390269B
Application number: CN201910560432.0A
Authority: CN
Inventors: 刘克亮; 卢波
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2023-08-01
Anticipated expiration: 2039-06-26
Also published as: CN110390269A

Abstract

The application relates to the technical field of artificial intelligence and discloses a PDF document table extraction method, device and equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring a PDF document to be identified, and processing the PDF document to be identified; preprocessing the processed PDF document, inputting the preprocessed PDF document into a convolutional neural network, outputting a feature map, inputting the feature map into an RPN region candidate network, and determining a table region; preprocessing a table area and extracting features based on an OCR (optical character recognition) technology to obtain a feature picture, detecting characters of the feature picture, determining a text area, recognizing characters of the text area, and determining text information, wherein the text information comprises text position information and text content information; and determining the structural information of the table according to the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table. By the method and the device, the accuracy of extracting the PDF document table is improved.

Description

PDF document table extraction method, device, equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a PDF document table extraction method, apparatus, device, and computer readable storage medium.

Background

The existing method for extracting the form in the PDF file is basically aimed at the PDF of the extractable text, and the form area is extracted by acquiring the structural information of the PDF. And aiming at the PDF file of the picture type, the table extraction can only be carried out by a traditional image processing method. Firstly, extracting a table frame, then extracting an in-frame region according to the table frame, and finally performing OCR (optical character recognition) on an in-frame region image, thereby extracting table contents. However, this method is effective only for a table having table ruled lines, and if the table ruled lines are not complete, there may occur a problem that the located table area is not complete or the content of the cells is not complete, resulting in a low accuracy of table extraction.

Disclosure of Invention

The main purpose of the application is to provide a PDF document table extraction method, a device, equipment and a computer readable storage medium, which aim to solve the technical problems of small application range and low accuracy of the existing PDF document table extraction method.

In order to achieve the above object, the present application provides a PDF document table extraction method, which includes the following steps:

acquiring a PDF document to be identified, and processing the PDF document to be identified, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, and processing the PDF document comprises converting the PDF document capable of extracting text content into a PDF document of the picture type;

preprocessing the processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, outputting a characteristic diagram of the processed PDF document based on the preset convolutional neural network, inputting the characteristic diagram into an RPN region candidate network, and determining a table region in the processed PDF document;

preprocessing the table area based on an OCR text recognition technology, extracting features to obtain a feature picture of the table area, detecting the features of the feature picture, determining a text area in the table area, recognizing the text area, and determining text information in the table area, wherein the text information comprises text position information and text content information, and the text position information is represented by coordinates;

and determining the structural information of the table according to the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table.

Optionally, the preprocessing the processed PDF document and inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, and inputting the feature map into an RPN region candidate network, where before determining a table region in the processed PDF document, the method further includes:

acquiring a PDF document sample to be trained, and converting the PDF document sample to be trained to obtain a sample picture;

acquiring marking information corresponding to the PDF document sample to be trained, and marking the table position in the sample picture based on the marking information;

training a preset initial model based on the marked sample picture to obtain a form identification model;

and saving the table identification model.

Optionally, training the preset initial model based on the labeled sample picture, and obtaining the form identification model includes:

preprocessing the marked sample picture, wherein the preprocessing process comprises mean value removal, normalization and whitening;

inputting the preprocessed sample picture into a preset convolutional neural network to obtain a feature map of the marked sample picture;

inputting the feature map into an RPN region candidate network, and detecting the position of a table region in the marked sample picture;

and when the current detection is determined to reach the convergence condition, obtaining a form identification model.

Optionally, the determining the structure information of the table according to the text coordinate information, dividing each cell of the table based on the structure information, and filling the text corresponding to the text content information into each corresponding cell of the table includes:

generating text boxes based on the text position information, wherein each text box contains a line of text;

dividing a text box into different row and column positions of the table by the overlapping proportion of the text box in the horizontal direction and the vertical direction, and dividing each cell of the table based on the different row and column positions;

and filling corresponding texts corresponding to the text content information into each cell of the table.

In addition, in order to achieve the above object, the present application further provides a PDF document table extraction device including:

the processing module is used for acquiring a PDF document to be identified and processing the PDF document to be identified, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, and the processing of the PDF document comprises the conversion of the PDF document capable of extracting text content into the PDF document of the picture type;

the identification module is used for preprocessing the processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, inputting the feature map into an RPN area candidate network and determining a form area in the processed PDF document;

the positioning module is used for preprocessing the table area and extracting features based on an OCR text recognition technology to obtain a feature picture of the table area, detecting the characters of the feature picture, determining a text area in the table area, recognizing the characters of the text area, and determining text information in the table area, wherein the text information comprises text position information and text content information, and the text position information is represented by coordinates;

and the filling module is used for determining the structural information of the form according to the text coordinate information, dividing each cell of the form based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the form.

Optionally, the PDF document table extracting device further includes:

the conversion module is used for obtaining a PDF document sample to be trained and converting the PDF document sample to be trained to obtain a sample picture;

the marking module is used for acquiring marking information corresponding to the PDF document sample to be trained and marking the table position in the sample picture based on the marking information;

the training module is used for training the preset initial model based on the marked sample picture to obtain a form identification model;

and the storage module is used for storing the table identification model.

Optionally, the training module includes:

the preprocessing unit is used for preprocessing the marked sample picture, wherein the preprocessing process comprises mean value removal, normalization and whitening;

the extraction unit is used for inputting the preprocessed sample picture into a preset convolutional neural network to obtain a feature map of the marked sample picture;

the detection unit is used for inputting the feature map into an RPN region candidate network and detecting the position of a table region in the marked sample picture;

and the confirmation unit is used for obtaining a form identification model when the current detection is determined to reach the convergence condition.

Optionally, the filling module includes:

a generation unit, configured to generate text boxes based on the text position information, where each text box contains a line of text;

a dividing unit for dividing the text box into different row and column positions of the table by the overlapping proportion of the text box in the horizontal direction and the vertical direction, and dividing each cell of the table based on the different row and column positions;

and the filling unit is used for filling corresponding texts corresponding to the text content information in each cell of the form.

In addition, in order to achieve the above object, the present application further provides a PDF document table extraction apparatus, which includes an input-output unit, a memory, and a processor, the memory storing computer-readable instructions that, when executed by the processor, cause the processor to implement the steps of the PDF document table extraction method described above when executed.

In addition, in order to achieve the above object, the present application also provides a computer-readable storage medium having stored thereon a PDF document table extraction program which, when executed by a processor, implements the steps of the PDF document table extraction method described above.

The PDF document table extraction method comprises the steps of firstly obtaining a PDF document to be identified, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, processing the PDF document, and converting the PDF document capable of extracting the text content into the PDF document of the picture type; preprocessing a processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, obtaining a feature map of the processed PDF document, and inputting the feature map into an RPN region candidate network, so as to determine a form region in the processed PDF document; preprocessing and extracting features by using OCR character recognition positioning in a table area to obtain a feature picture of the table area, detecting characters of the feature picture, determining a text area in the feature picture, and finally determining text information in the table area based on a character recognition algorithm; and finally, determining the structural information of the table through the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table, thereby completing the table content extraction in the PDF document. The PDF document table extraction method provided by the invention can accurately position the table and extract the content in the table for PDF documents with different formats, thereby improving the application range of the PDF document table extraction method and improving the accuracy of PDF document table extraction.

Drawings

FIG. 1 is a schematic diagram of a PDF document table extraction device in a hardware running environment according to an embodiment of the present application;

FIG. 2 is a flowchart of an embodiment of a method for extracting a PDF document table;

FIG. 3 is a schematic diagram of a functional module of an embodiment of a PDF document table extraction device according to the present application;

FIG. 4 is a schematic diagram of a functional module of another embodiment of a PDF document form extraction device according to the present application;

FIG. 5 is a schematic diagram of functional units of the training module 70 in another embodiment of the PDF document table extracting device of the present application;

FIG. 6 is a schematic diagram of functional units of a filling module 40 in another embodiment of the PDF document extraction device of the present application;

fig. 7 is a schematic diagram of a text box in an embodiment of a PDF document table extraction method of the present application.

The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a PDF document table extraction device of a hardware running environment according to an embodiment of the present application.

The PDF document table extraction device in the embodiment of the present application may be a terminal device with data processing capability, such as a portable computer, a server, or the like.

As shown in fig. 1, the PDF document table extraction apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the aforementioned processor 1001.

It will be appreciated by those skilled in the art that the PDF document table extraction device structure shown in fig. 1 does not constitute a limitation of the PDF document table extraction device, and may include more or fewer components than illustrated, or may combine certain components, or may be arranged in different components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a PDF document table extraction program may be included in the memory 1005 as one type of computer storage medium.

In the PDF document table extraction apparatus shown in fig. 1, the network interface 1004 is mainly used to connect to a background server, and perform data communication with the background server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be used to call a PDF document table extraction program stored in the memory 1005 and perform the operations of the following embodiments of the PDF document table extraction method.

Referring to fig. 2, fig. 2 is a flowchart illustrating an embodiment of a PDF document table extraction method, where the PDF document table extraction method includes:

step S10, a PDF document to be identified is obtained and is processed, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, and the PDF document is processed by converting the PDF document capable of extracting text content into the PDF document of the picture type.

In this embodiment, the PDF document to be identified may include multiple formats, for example, the PDF document to be identified may be a PDF document in which text content can be extracted, or may be a PDF document in a picture type. After the PDF document to be identified is obtained, the PDF document to be identified is firstly processed, specifically, the PDF document to be identified is converted into a picture, and if the PDF document to be identified is a PDF document of a picture type, the PDF document to be identified does not need to be processed.

Step S20, preprocessing the processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, inputting the feature map into an RPN region candidate network, and determining a table region in the processed PDF document.

Further, the converted picture is input into a preset form recognition model, so that the form recognition model recognizes a form region in the picture.

It can be understood that the preset form identification model is obtained by training in advance, and the preset form identification model is trained through the PDF document sample to be trained so as to improve the accuracy of form area identification.

Specifically, the process of performing table area recognition on the processed PDF document by the preset table recognition model is as follows:

firstly, preprocessing a picture obtained by converting a PDF document, wherein the preprocessing step comprises mean value removal, normalization and whitening, and the preprocessing aims at strengthening the characteristics of the picture; further, inputting the preprocessed picture into a preset convolutional neural network, and extracting a feature map; finally, the obtained feature map is input into an RPN (RegionProposal Network, regional candidate network) to identify and locate the table region.

Step S30, preprocessing and feature extraction are carried out on the table area based on the OCR text recognition technology, feature pictures of the table area are obtained, text detection is carried out on the feature pictures, text areas in the table area are determined, text recognition is carried out on the text areas, text information in the table area is determined, wherein the text information comprises text position information and text content information, and the text position information is represented by coordinates.

Further, after the table area in the picture is identified, the characters in the table area are positioned and identified by utilizing an OCR character recognition technology, and text position information and text content information in the table area are determined.

Specifically, the OCR recognizes text as follows: firstly, preprocessing a picture containing a table area, wherein the aim of the preprocessing is to reduce useless information in the picture, and the picture containing characters is processed so as to carry out characteristic extraction subsequently; further, detecting the text part in the preprocessed picture, wherein the text region in the picture can be subjected to frame selection by adopting a common image detection algorithm, and the details are not repeated here; and finally, identifying the text in the detected text area through a text identification algorithm, and determining specific text content information.

Step S40, the structure information of the table is determined according to the text coordinate information, each cell of the table is divided based on the structure information, and the text corresponding to the text content information is filled into each corresponding cell of the table.

It will be appreciated that specific text location information may be obtained during text recognition and location within the form area, and that the text location information may be represented by the top left and bottom right coordinates of each line of text. Therefore, through the upper left corner coordinate and the lower right corner coordinate of each line of text, the corresponding row and column position of each line of text in the table can be determined, and the structural information of the table can be determined. After the structural information of the table is determined, the text content information corresponding to the text position information is filled in the corresponding row and column positions.

In the embodiment, firstly, a PDF document to be identified is obtained, the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, the PDF document is processed, and the PDF document capable of extracting text content is converted into the PDF document of the picture type; preprocessing a processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, obtaining a feature map of the processed PDF document, and inputting the feature map into an RPN region candidate network, so as to determine a form region in the processed PDF document; preprocessing and extracting features by using OCR character recognition positioning in a table area to obtain a feature picture of the table area, detecting characters of the feature picture, determining a text area in the feature picture, and finally determining text information in the table area based on a character recognition algorithm; and finally, determining the structural information of the table through the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table, thereby completing the table content extraction in the PDF document. The PDF document table extraction method provided by the invention can accurately position the table and extract the content in the table for PDF documents with different formats, thereby improving the application range of the PDF document table extraction method and improving the accuracy of PDF document table extraction.

Further, before step S20, the method further includes:

step S50, a PDF document sample to be trained is obtained, and the PDF document sample to be trained is converted to obtain a sample picture;

step S60, marking information corresponding to a PDF document sample to be trained is obtained, and the table position in a sample picture is marked based on the marking information;

step S70, training a preset initial model based on the marked sample picture to obtain a form identification model;

step S80, save the form identification model.

In this embodiment, a preset table recognition model is trained through a PDF document sample, so that a PDF document to be recognized is subjected to table region recognition by using the trained table recognition model. Specifically, the PDF document sample to be trained may include both PDF documents that can extract text content and PDF documents of a picture type. If the PDF document sample to be trained is a PDF document capable of extracting text content, converting the PDF document sample to be trained into a Zhang Yangben picture according to pages; if the PDF document to be trained is a picture-like PDF document, processing is not needed.

Further, labeling information corresponding to the PDF document sample to be trained is obtained so as to label the converted sample picture, specifically, the table position in the sample picture is labeled, wherein the labeling information can be represented in a coordinate mode. For example, a rectangular frame is used to select a table region frame in the sample picture, and then the rectangular frame is represented by the upper left corner coordinates and the lower right corner coordinates of the rectangular frame, that is, the table position in the sample picture is represented by the upper left corner coordinates and the lower right corner coordinates of the rectangular frame.

In this embodiment, the algorithm used for training the preset table recognition model is the fast-R-CNN algorithm, and the training process is as follows:

firstly, preprocessing is carried out on an input sample picture, and the aim of the preprocessing is to strengthen the characteristics of the sample picture. Specifically, the preprocessing process sequentially comprises mean value removal, normalization and whitening, wherein the purpose of mean value removal is to center each dimension in input sample data to 0, namely, the center in the sample data is pulled back to the origin of a coordinate system; the normalization aims to normalize the amplitude of the sample data to the same range and reduce the interference caused by the difference of the value ranges of the data in each dimension; whitening is the normalization of the amplitude on each characteristic axis of the sample data. Through the steps, the sample picture obtained by converting the PDF document to be trained is subjected to characteristic reinforcement.

Further, the preprocessed sample picture is input into a preset convolutional neural network, and feature map extraction is performed. Specifically, in this embodiment, the preset convolutional neural network includes 13 convolutional layers, 13 excitation layers and 4 pooling layers, for the convolutional kernel of the convolutional layers, 3*3, the filling value is 1, and the function of the filling value is to make the convolutional layers not change the input and output matrix sizes; the convolution kernel of the pooling layer is 2 x 2 and the stride is 2 x 2. And (3) carrying out convolution, excitation, pooling and other operations on the preprocessed sample picture to obtain a feature vector, wherein the feature vector represents the vector information of the feature picture corresponding to the sample picture.

Further, the obtained feature map is input into the RPN to identify and locate the table area. Firstly, carrying out a 3*3 convolution calculation on the characteristic diagram to obtain a 256-dimensional vector, and then calculating 9 candidate windows of each pixel of the 256-dimensional vector based on a scale transformation mode so as to classify the candidate windows based on a soft max function and determine the foreground and the background of the candidate windows; meanwhile, calculating the regression offset of the boundary frame for the candidate window, so as to preliminarily determine the table area; and finally, acquiring a target area based on the foreground candidate window and the regression offset of the boundary box, and eliminating the target area which is too small and exceeds the boundary to obtain a final target area, namely a finally identified form area.

In this embodiment, since the PDF document sample to be trained is converted into the sample picture and then all carries the labeling information, that is, the position information of the table area included in the sample picture, the sample picture is input into the preset table recognition model to identify and position the table area in the sample picture, so as to see whether the output result of the preset table recognition model is consistent with the labeling information, if not, the parameters of the preset table recognition model are readjusted, so that the output result of the preset table recognition model is consistent with the labeling information finally, and then the preset table recognition model is determined to be trained.

Further, after the training is confirmed, a preset form recognition model is stored, so that the form region recognition is carried out on the PDF document to be recognized by using the trained form recognition model.

Further, step S40 includes:

step S401, generating text boxes based on the text position information, wherein each text box contains a line of text;

step S402, dividing the text box into different row and column positions of the table by the overlapping proportion of the text box in the horizontal direction and the vertical direction, and dividing each cell of the table based on the different row and column positions;

step S403, the text corresponding to the corresponding text content information is filled into each cell of the table.

In this embodiment, in the process of text region detection based on OCR text recognition technology, specific text position information may be obtained, and the text position information may be represented by the upper left corner coordinates and the lower right corner coordinates of each line of text, as shown in fig. 7, and the text information selected and located by using a rectangular box is denoted as a text box.

The text boxes can be divided into different row and column positions of the table by overlapping proportions of the text boxes in the horizontal direction and the vertical direction. Specifically, the coordinate information of the text boxes of the same row in the vertical direction is the same, and the coordinate ranges of the text boxes of the same row in the horizontal direction are not intersected; the coordinate information of the text boxes of the same column in the horizontal direction is the same, and the coordinate ranges of the text boxes of the same column in the vertical direction have no intersection. Therefore, by analyzing the coordinate information of the text box, the structural information of the form can be determined. For example, the structural information of the table may be 3*5, i.e., indicating that the table has 3 rows and 5 columns. If there is a text length across columns, as shown in FIG. 7, the vertical grid lines in the cell are removed.

After the structural information of the table is determined, corresponding text contents are correspondingly filled into each cell of the table according to the identified text content information, and then the table content extraction in the PDF document can be completed.

Referring to fig. 3, fig. 3 is a schematic functional block diagram of an embodiment of a PDF document table extracting apparatus of the present application.

In this embodiment, the PDF document table extraction device includes:

the processing module 10 is configured to obtain a PDF document to be identified, and process the PDF document to be identified, where the PDF document to be identified includes a PDF document capable of extracting text content and a PDF document of a picture class, and process the PDF document includes converting the PDF document capable of extracting text content into a PDF document of a picture class;

the identifying module 20 is configured to pre-process the processed PDF document, input the pre-processed PDF document into a preset convolutional neural network, output a feature map of the processed PDF document based on the preset convolutional neural network, input the feature map into an RPN region candidate network, and determine a table region in the processed PDF document;

the positioning module 30 is configured to perform preprocessing and feature extraction on the table area based on an OCR text recognition technology, obtain a feature picture of the table area, perform text detection on the feature picture, determine a text area in the table area, perform text recognition on the text area, and determine text information in the table area, where the text information includes text position information and text content information, and the text position information is represented by coordinates;

and a filling module 40, configured to determine structural information of the form according to the text coordinate information, divide each cell of the form based on the structural information, and fill text corresponding to the text content information into each corresponding cell of the form.

Further, referring to fig. 4, the pdf document table extraction apparatus further includes:

the conversion module 50 is used for obtaining a PDF document sample to be trained and converting the PDF document sample to be trained to obtain a sample picture;

the labeling module 60 is configured to obtain labeling information corresponding to the PDF document sample to be trained, and label a table position in the sample picture based on the labeling information;

the training module 70 is configured to train the preset initial model based on the labeled sample picture to obtain a form recognition model;

a saving module 80, configured to save the table identification model.

Further, referring to fig. 5, the training module 70 includes:

the preprocessing unit 701 is configured to perform preprocessing on the marked sample picture, where the preprocessing includes mean removal, normalization and whitening;

the extracting unit 702 is configured to input the preprocessed sample picture into a preset convolutional neural network, so as to obtain a feature map of the labeled sample picture;

a detection unit 703, configured to input the feature map into an RPN region candidate network, and detect a position of a table region in the labeled sample picture;

and a confirmation unit 704, configured to obtain a table identification model when it is determined that the current detection reaches the convergence condition.

Further, referring to fig. 6, the filling module 40 includes:

a generating unit 401, configured to generate text boxes based on the text position information, where each text box contains a line of text;

a dividing unit 402, configured to divide a text box into different row and column positions of the table by overlapping proportions of the text box in a horizontal direction and a vertical direction, and divide each cell of the table based on the different row and column positions;

and a filling unit 403, configured to fill in a text corresponding to the text content information in each cell of the form.

The specific embodiments of the PDF document table extracting device of the present application are substantially the same as the embodiments of the PDF document table extracting method described above, and are not described herein.

In addition, the embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a PDF document table extraction program, and the PDF document table extraction program realizes the steps of the PDF document table extraction method when being executed by a processor.

The specific embodiments of the computer readable storage medium are basically the same as the embodiments of the PDF document table extraction method described above, and are not described herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above, including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims

1. The PDF document table extraction method is characterized by comprising the following steps of:

2. The PDF document table extraction method of claim 1 wherein the preprocessing the processed PDF document and inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, and inputting the feature map into an RPN region candidate network, and before determining a table region in the processed PDF document, further comprising:

and saving the table identification model.

3. The PDF document table extraction method of claim 2, wherein training the preset initial model based on the annotated sample picture to obtain the table identification model includes:

4. The PDF document table extraction method of claim 1, wherein the determining structural information of the table from the text coordinate information, dividing cells of the table based on the structural information, and filling text corresponding to the text content information into corresponding cells of the table includes:

5. A PDF document table extraction device, characterized in that the PDF document table extraction device includes:

6. The PDF document table extraction device of claim 5, wherein said PDF document table extraction device further comprises:

and the storage module is used for storing the table identification model.

7. The PDF document table extraction device of claim 6, wherein the training module comprises:

8. The PDF document form extraction device of claim 5, wherein said filling module includes:

9. A PDF document table extraction device comprising an input output unit, a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the PDF document table extraction method of any of claims 1 to 4.

10. A computer-readable storage medium, wherein a PDF document table extraction program is stored on the computer-readable storage medium, which when executed by a processor, implements the steps of the PDF document table extraction method of any one of claims 1 to 4.