CN110390269B - PDF document table extraction method, device, equipment and computer readable storage medium - Google Patents
PDF document table extraction method, device, equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN110390269B CN110390269B CN201910560432.0A CN201910560432A CN110390269B CN 110390269 B CN110390269 B CN 110390269B CN 201910560432 A CN201910560432 A CN 201910560432A CN 110390269 B CN110390269 B CN 110390269B
- Authority
- CN
- China
- Prior art keywords
- pdf document
- text
- information
- picture
- area
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/412—Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Character Discrimination (AREA)
- Character Input (AREA)
Abstract
The application relates to the technical field of artificial intelligence and discloses a PDF document table extraction method, device and equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring a PDF document to be identified, and processing the PDF document to be identified; preprocessing the processed PDF document, inputting the preprocessed PDF document into a convolutional neural network, outputting a feature map, inputting the feature map into an RPN region candidate network, and determining a table region; preprocessing a table area and extracting features based on an OCR (optical character recognition) technology to obtain a feature picture, detecting characters of the feature picture, determining a text area, recognizing characters of the text area, and determining text information, wherein the text information comprises text position information and text content information; and determining the structural information of the table according to the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table. By the method and the device, the accuracy of extracting the PDF document table is improved.
Description
Technical Field
The present disclosure relates to the field of image processing technologies, and in particular, to a PDF document table extraction method, apparatus, device, and computer readable storage medium.
Background
The existing method for extracting the form in the PDF file is basically aimed at the PDF of the extractable text, and the form area is extracted by acquiring the structural information of the PDF. And aiming at the PDF file of the picture type, the table extraction can only be carried out by a traditional image processing method. Firstly, extracting a table frame, then extracting an in-frame region according to the table frame, and finally performing OCR (optical character recognition) on an in-frame region image, thereby extracting table contents. However, this method is effective only for a table having table ruled lines, and if the table ruled lines are not complete, there may occur a problem that the located table area is not complete or the content of the cells is not complete, resulting in a low accuracy of table extraction.
Disclosure of Invention
The main purpose of the application is to provide a PDF document table extraction method, a device, equipment and a computer readable storage medium, which aim to solve the technical problems of small application range and low accuracy of the existing PDF document table extraction method.
In order to achieve the above object, the present application provides a PDF document table extraction method, which includes the following steps:
acquiring a PDF document to be identified, and processing the PDF document to be identified, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, and processing the PDF document comprises converting the PDF document capable of extracting text content into a PDF document of the picture type;
preprocessing the processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, outputting a characteristic diagram of the processed PDF document based on the preset convolutional neural network, inputting the characteristic diagram into an RPN region candidate network, and determining a table region in the processed PDF document;
preprocessing the table area based on an OCR text recognition technology, extracting features to obtain a feature picture of the table area, detecting the features of the feature picture, determining a text area in the table area, recognizing the text area, and determining text information in the table area, wherein the text information comprises text position information and text content information, and the text position information is represented by coordinates;
and determining the structural information of the table according to the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table.
Optionally, the preprocessing the processed PDF document and inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, and inputting the feature map into an RPN region candidate network, where before determining a table region in the processed PDF document, the method further includes:
acquiring a PDF document sample to be trained, and converting the PDF document sample to be trained to obtain a sample picture;
acquiring marking information corresponding to the PDF document sample to be trained, and marking the table position in the sample picture based on the marking information;
training a preset initial model based on the marked sample picture to obtain a form identification model;
and saving the table identification model.
Optionally, training the preset initial model based on the labeled sample picture, and obtaining the form identification model includes:
preprocessing the marked sample picture, wherein the preprocessing process comprises mean value removal, normalization and whitening;
inputting the preprocessed sample picture into a preset convolutional neural network to obtain a feature map of the marked sample picture;
inputting the feature map into an RPN region candidate network, and detecting the position of a table region in the marked sample picture;
and when the current detection is determined to reach the convergence condition, obtaining a form identification model.
Optionally, the determining the structure information of the table according to the text coordinate information, dividing each cell of the table based on the structure information, and filling the text corresponding to the text content information into each corresponding cell of the table includes:
generating text boxes based on the text position information, wherein each text box contains a line of text;
dividing a text box into different row and column positions of the table by the overlapping proportion of the text box in the horizontal direction and the vertical direction, and dividing each cell of the table based on the different row and column positions;
and filling corresponding texts corresponding to the text content information into each cell of the table.
In addition, in order to achieve the above object, the present application further provides a PDF document table extraction device including:
the processing module is used for acquiring a PDF document to be identified and processing the PDF document to be identified, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, and the processing of the PDF document comprises the conversion of the PDF document capable of extracting text content into the PDF document of the picture type;
the identification module is used for preprocessing the processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, inputting the feature map into an RPN area candidate network and determining a form area in the processed PDF document;
the positioning module is used for preprocessing the table area and extracting features based on an OCR text recognition technology to obtain a feature picture of the table area, detecting the characters of the feature picture, determining a text area in the table area, recognizing the characters of the text area, and determining text information in the table area, wherein the text information comprises text position information and text content information, and the text position information is represented by coordinates;
and the filling module is used for determining the structural information of the form according to the text coordinate information, dividing each cell of the form based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the form.
Optionally, the PDF document table extracting device further includes:
the conversion module is used for obtaining a PDF document sample to be trained and converting the PDF document sample to be trained to obtain a sample picture;
the marking module is used for acquiring marking information corresponding to the PDF document sample to be trained and marking the table position in the sample picture based on the marking information;
the training module is used for training the preset initial model based on the marked sample picture to obtain a form identification model;
and the storage module is used for storing the table identification model.
Optionally, the training module includes:
the preprocessing unit is used for preprocessing the marked sample picture, wherein the preprocessing process comprises mean value removal, normalization and whitening;
the extraction unit is used for inputting the preprocessed sample picture into a preset convolutional neural network to obtain a feature map of the marked sample picture;
the detection unit is used for inputting the feature map into an RPN region candidate network and detecting the position of a table region in the marked sample picture;
and the confirmation unit is used for obtaining a form identification model when the current detection is determined to reach the convergence condition.
Optionally, the filling module includes:
a generation unit, configured to generate text boxes based on the text position information, where each text box contains a line of text;
a dividing unit for dividing the text box into different row and column positions of the table by the overlapping proportion of the text box in the horizontal direction and the vertical direction, and dividing each cell of the table based on the different row and column positions;
and the filling unit is used for filling corresponding texts corresponding to the text content information in each cell of the form.
In addition, in order to achieve the above object, the present application further provides a PDF document table extraction apparatus, which includes an input-output unit, a memory, and a processor, the memory storing computer-readable instructions that, when executed by the processor, cause the processor to implement the steps of the PDF document table extraction method described above when executed.
In addition, in order to achieve the above object, the present application also provides a computer-readable storage medium having stored thereon a PDF document table extraction program which, when executed by a processor, implements the steps of the PDF document table extraction method described above.
The PDF document table extraction method comprises the steps of firstly obtaining a PDF document to be identified, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, processing the PDF document, and converting the PDF document capable of extracting the text content into the PDF document of the picture type; preprocessing a processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, obtaining a feature map of the processed PDF document, and inputting the feature map into an RPN region candidate network, so as to determine a form region in the processed PDF document; preprocessing and extracting features by using OCR character recognition positioning in a table area to obtain a feature picture of the table area, detecting characters of the feature picture, determining a text area in the feature picture, and finally determining text information in the table area based on a character recognition algorithm; and finally, determining the structural information of the table through the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table, thereby completing the table content extraction in the PDF document. The PDF document table extraction method provided by the invention can accurately position the table and extract the content in the table for PDF documents with different formats, thereby improving the application range of the PDF document table extraction method and improving the accuracy of PDF document table extraction.
Drawings
FIG. 1 is a schematic diagram of a PDF document table extraction device in a hardware running environment according to an embodiment of the present application;
FIG. 2 is a flowchart of an embodiment of a method for extracting a PDF document table;
FIG. 3 is a schematic diagram of a functional module of an embodiment of a PDF document table extraction device according to the present application;
FIG. 4 is a schematic diagram of a functional module of another embodiment of a PDF document form extraction device according to the present application;
FIG. 5 is a schematic diagram of functional units of the training module 70 in another embodiment of the PDF document table extracting device of the present application;
FIG. 6 is a schematic diagram of functional units of a filling module 40 in another embodiment of the PDF document extraction device of the present application;
fig. 7 is a schematic diagram of a text box in an embodiment of a PDF document table extraction method of the present application.
The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a PDF document table extraction device of a hardware running environment according to an embodiment of the present application.
The PDF document table extraction device in the embodiment of the present application may be a terminal device with data processing capability, such as a portable computer, a server, or the like.
As shown in fig. 1, the PDF document table extraction apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the aforementioned processor 1001.
It will be appreciated by those skilled in the art that the PDF document table extraction device structure shown in fig. 1 does not constitute a limitation of the PDF document table extraction device, and may include more or fewer components than illustrated, or may combine certain components, or may be arranged in different components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a PDF document table extraction program may be included in the memory 1005 as one type of computer storage medium.
In the PDF document table extraction apparatus shown in fig. 1, the network interface 1004 is mainly used to connect to a background server, and perform data communication with the background server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be used to call a PDF document table extraction program stored in the memory 1005 and perform the operations of the following embodiments of the PDF document table extraction method.
Referring to fig. 2, fig. 2 is a flowchart illustrating an embodiment of a PDF document table extraction method, where the PDF document table extraction method includes:
step S10, a PDF document to be identified is obtained and is processed, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, and the PDF document is processed by converting the PDF document capable of extracting text content into the PDF document of the picture type.
In this embodiment, the PDF document to be identified may include multiple formats, for example, the PDF document to be identified may be a PDF document in which text content can be extracted, or may be a PDF document in a picture type. After the PDF document to be identified is obtained, the PDF document to be identified is firstly processed, specifically, the PDF document to be identified is converted into a picture, and if the PDF document to be identified is a PDF document of a picture type, the PDF document to be identified does not need to be processed.
Step S20, preprocessing the processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, inputting the feature map into an RPN region candidate network, and determining a table region in the processed PDF document.
Further, the converted picture is input into a preset form recognition model, so that the form recognition model recognizes a form region in the picture.
It can be understood that the preset form identification model is obtained by training in advance, and the preset form identification model is trained through the PDF document sample to be trained so as to improve the accuracy of form area identification.
Specifically, the process of performing table area recognition on the processed PDF document by the preset table recognition model is as follows:
firstly, preprocessing a picture obtained by converting a PDF document, wherein the preprocessing step comprises mean value removal, normalization and whitening, and the preprocessing aims at strengthening the characteristics of the picture; further, inputting the preprocessed picture into a preset convolutional neural network, and extracting a feature map; finally, the obtained feature map is input into an RPN (RegionProposal Network, regional candidate network) to identify and locate the table region.
Step S30, preprocessing and feature extraction are carried out on the table area based on the OCR text recognition technology, feature pictures of the table area are obtained, text detection is carried out on the feature pictures, text areas in the table area are determined, text recognition is carried out on the text areas, text information in the table area is determined, wherein the text information comprises text position information and text content information, and the text position information is represented by coordinates.
Further, after the table area in the picture is identified, the characters in the table area are positioned and identified by utilizing an OCR character recognition technology, and text position information and text content information in the table area are determined.
Specifically, the OCR recognizes text as follows: firstly, preprocessing a picture containing a table area, wherein the aim of the preprocessing is to reduce useless information in the picture, and the picture containing characters is processed so as to carry out characteristic extraction subsequently; further, detecting the text part in the preprocessed picture, wherein the text region in the picture can be subjected to frame selection by adopting a common image detection algorithm, and the details are not repeated here; and finally, identifying the text in the detected text area through a text identification algorithm, and determining specific text content information.
Step S40, the structure information of the table is determined according to the text coordinate information, each cell of the table is divided based on the structure information, and the text corresponding to the text content information is filled into each corresponding cell of the table.
It will be appreciated that specific text location information may be obtained during text recognition and location within the form area, and that the text location information may be represented by the top left and bottom right coordinates of each line of text. Therefore, through the upper left corner coordinate and the lower right corner coordinate of each line of text, the corresponding row and column position of each line of text in the table can be determined, and the structural information of the table can be determined. After the structural information of the table is determined, the text content information corresponding to the text position information is filled in the corresponding row and column positions.
In the embodiment, firstly, a PDF document to be identified is obtained, the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, the PDF document is processed, and the PDF document capable of extracting text content is converted into the PDF document of the picture type; preprocessing a processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, obtaining a feature map of the processed PDF document, and inputting the feature map into an RPN region candidate network, so as to determine a form region in the processed PDF document; preprocessing and extracting features by using OCR character recognition positioning in a table area to obtain a feature picture of the table area, detecting characters of the feature picture, determining a text area in the feature picture, and finally determining text information in the table area based on a character recognition algorithm; and finally, determining the structural information of the table through the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table, thereby completing the table content extraction in the PDF document. The PDF document table extraction method provided by the invention can accurately position the table and extract the content in the table for PDF documents with different formats, thereby improving the application range of the PDF document table extraction method and improving the accuracy of PDF document table extraction.
Further, before step S20, the method further includes:
step S50, a PDF document sample to be trained is obtained, and the PDF document sample to be trained is converted to obtain a sample picture;
step S60, marking information corresponding to a PDF document sample to be trained is obtained, and the table position in a sample picture is marked based on the marking information;
step S70, training a preset initial model based on the marked sample picture to obtain a form identification model;
step S80, save the form identification model.
In this embodiment, a preset table recognition model is trained through a PDF document sample, so that a PDF document to be recognized is subjected to table region recognition by using the trained table recognition model. Specifically, the PDF document sample to be trained may include both PDF documents that can extract text content and PDF documents of a picture type. If the PDF document sample to be trained is a PDF document capable of extracting text content, converting the PDF document sample to be trained into a Zhang Yangben picture according to pages; if the PDF document to be trained is a picture-like PDF document, processing is not needed.
Further, labeling information corresponding to the PDF document sample to be trained is obtained so as to label the converted sample picture, specifically, the table position in the sample picture is labeled, wherein the labeling information can be represented in a coordinate mode. For example, a rectangular frame is used to select a table region frame in the sample picture, and then the rectangular frame is represented by the upper left corner coordinates and the lower right corner coordinates of the rectangular frame, that is, the table position in the sample picture is represented by the upper left corner coordinates and the lower right corner coordinates of the rectangular frame.
In this embodiment, the algorithm used for training the preset table recognition model is the fast-R-CNN algorithm, and the training process is as follows:
firstly, preprocessing is carried out on an input sample picture, and the aim of the preprocessing is to strengthen the characteristics of the sample picture. Specifically, the preprocessing process sequentially comprises mean value removal, normalization and whitening, wherein the purpose of mean value removal is to center each dimension in input sample data to 0, namely, the center in the sample data is pulled back to the origin of a coordinate system; the normalization aims to normalize the amplitude of the sample data to the same range and reduce the interference caused by the difference of the value ranges of the data in each dimension; whitening is the normalization of the amplitude on each characteristic axis of the sample data. Through the steps, the sample picture obtained by converting the PDF document to be trained is subjected to characteristic reinforcement.
Further, the preprocessed sample picture is input into a preset convolutional neural network, and feature map extraction is performed. Specifically, in this embodiment, the preset convolutional neural network includes 13 convolutional layers, 13 excitation layers and 4 pooling layers, for the convolutional kernel of the convolutional layers, 3*3, the filling value is 1, and the function of the filling value is to make the convolutional layers not change the input and output matrix sizes; the convolution kernel of the pooling layer is 2 x 2 and the stride is 2 x 2. And (3) carrying out convolution, excitation, pooling and other operations on the preprocessed sample picture to obtain a feature vector, wherein the feature vector represents the vector information of the feature picture corresponding to the sample picture.
Further, the obtained feature map is input into the RPN to identify and locate the table area. Firstly, carrying out a 3*3 convolution calculation on the characteristic diagram to obtain a 256-dimensional vector, and then calculating 9 candidate windows of each pixel of the 256-dimensional vector based on a scale transformation mode so as to classify the candidate windows based on a soft max function and determine the foreground and the background of the candidate windows; meanwhile, calculating the regression offset of the boundary frame for the candidate window, so as to preliminarily determine the table area; and finally, acquiring a target area based on the foreground candidate window and the regression offset of the boundary box, and eliminating the target area which is too small and exceeds the boundary to obtain a final target area, namely a finally identified form area.
In this embodiment, since the PDF document sample to be trained is converted into the sample picture and then all carries the labeling information, that is, the position information of the table area included in the sample picture, the sample picture is input into the preset table recognition model to identify and position the table area in the sample picture, so as to see whether the output result of the preset table recognition model is consistent with the labeling information, if not, the parameters of the preset table recognition model are readjusted, so that the output result of the preset table recognition model is consistent with the labeling information finally, and then the preset table recognition model is determined to be trained.
Further, after the training is confirmed, a preset form recognition model is stored, so that the form region recognition is carried out on the PDF document to be recognized by using the trained form recognition model.
Further, step S40 includes:
step S401, generating text boxes based on the text position information, wherein each text box contains a line of text;
step S402, dividing the text box into different row and column positions of the table by the overlapping proportion of the text box in the horizontal direction and the vertical direction, and dividing each cell of the table based on the different row and column positions;
step S403, the text corresponding to the corresponding text content information is filled into each cell of the table.
In this embodiment, in the process of text region detection based on OCR text recognition technology, specific text position information may be obtained, and the text position information may be represented by the upper left corner coordinates and the lower right corner coordinates of each line of text, as shown in fig. 7, and the text information selected and located by using a rectangular box is denoted as a text box.
The text boxes can be divided into different row and column positions of the table by overlapping proportions of the text boxes in the horizontal direction and the vertical direction. Specifically, the coordinate information of the text boxes of the same row in the vertical direction is the same, and the coordinate ranges of the text boxes of the same row in the horizontal direction are not intersected; the coordinate information of the text boxes of the same column in the horizontal direction is the same, and the coordinate ranges of the text boxes of the same column in the vertical direction have no intersection. Therefore, by analyzing the coordinate information of the text box, the structural information of the form can be determined. For example, the structural information of the table may be 3*5, i.e., indicating that the table has 3 rows and 5 columns. If there is a text length across columns, as shown in FIG. 7, the vertical grid lines in the cell are removed.
After the structural information of the table is determined, corresponding text contents are correspondingly filled into each cell of the table according to the identified text content information, and then the table content extraction in the PDF document can be completed.
Referring to fig. 3, fig. 3 is a schematic functional block diagram of an embodiment of a PDF document table extracting apparatus of the present application.
In this embodiment, the PDF document table extraction device includes:
the processing module 10 is configured to obtain a PDF document to be identified, and process the PDF document to be identified, where the PDF document to be identified includes a PDF document capable of extracting text content and a PDF document of a picture class, and process the PDF document includes converting the PDF document capable of extracting text content into a PDF document of a picture class;
the identifying module 20 is configured to pre-process the processed PDF document, input the pre-processed PDF document into a preset convolutional neural network, output a feature map of the processed PDF document based on the preset convolutional neural network, input the feature map into an RPN region candidate network, and determine a table region in the processed PDF document;
the positioning module 30 is configured to perform preprocessing and feature extraction on the table area based on an OCR text recognition technology, obtain a feature picture of the table area, perform text detection on the feature picture, determine a text area in the table area, perform text recognition on the text area, and determine text information in the table area, where the text information includes text position information and text content information, and the text position information is represented by coordinates;
and a filling module 40, configured to determine structural information of the form according to the text coordinate information, divide each cell of the form based on the structural information, and fill text corresponding to the text content information into each corresponding cell of the form.
Further, referring to fig. 4, the pdf document table extraction apparatus further includes:
the conversion module 50 is used for obtaining a PDF document sample to be trained and converting the PDF document sample to be trained to obtain a sample picture;
the labeling module 60 is configured to obtain labeling information corresponding to the PDF document sample to be trained, and label a table position in the sample picture based on the labeling information;
the training module 70 is configured to train the preset initial model based on the labeled sample picture to obtain a form recognition model;
a saving module 80, configured to save the table identification model.
Further, referring to fig. 5, the training module 70 includes:
the preprocessing unit 701 is configured to perform preprocessing on the marked sample picture, where the preprocessing includes mean removal, normalization and whitening;
the extracting unit 702 is configured to input the preprocessed sample picture into a preset convolutional neural network, so as to obtain a feature map of the labeled sample picture;
a detection unit 703, configured to input the feature map into an RPN region candidate network, and detect a position of a table region in the labeled sample picture;
and a confirmation unit 704, configured to obtain a table identification model when it is determined that the current detection reaches the convergence condition.
Further, referring to fig. 6, the filling module 40 includes:
a generating unit 401, configured to generate text boxes based on the text position information, where each text box contains a line of text;
a dividing unit 402, configured to divide a text box into different row and column positions of the table by overlapping proportions of the text box in a horizontal direction and a vertical direction, and divide each cell of the table based on the different row and column positions;
and a filling unit 403, configured to fill in a text corresponding to the text content information in each cell of the form.
The specific embodiments of the PDF document table extracting device of the present application are substantially the same as the embodiments of the PDF document table extracting method described above, and are not described herein.
In addition, the embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a PDF document table extraction program, and the PDF document table extraction program realizes the steps of the PDF document table extraction method when being executed by a processor.
The specific embodiments of the computer readable storage medium are basically the same as the embodiments of the PDF document table extraction method described above, and are not described herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above, including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.
Claims (10)
1. The PDF document table extraction method is characterized by comprising the following steps of:
acquiring a PDF document to be identified, and processing the PDF document to be identified, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, and processing the PDF document comprises converting the PDF document capable of extracting text content into a PDF document of the picture type;
preprocessing the processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, outputting a characteristic diagram of the processed PDF document based on the preset convolutional neural network, inputting the characteristic diagram into an RPN region candidate network, and determining a table region in the processed PDF document;
preprocessing the table area based on an OCR text recognition technology, extracting features to obtain a feature picture of the table area, detecting the features of the feature picture, determining a text area in the table area, recognizing the text area, and determining text information in the table area, wherein the text information comprises text position information and text content information, and the text position information is represented by coordinates;
and determining the structural information of the table according to the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table.
2. The PDF document table extraction method of claim 1 wherein the preprocessing the processed PDF document and inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, and inputting the feature map into an RPN region candidate network, and before determining a table region in the processed PDF document, further comprising:
acquiring a PDF document sample to be trained, and converting the PDF document sample to be trained to obtain a sample picture;
acquiring marking information corresponding to the PDF document sample to be trained, and marking the table position in the sample picture based on the marking information;
training a preset initial model based on the marked sample picture to obtain a form identification model;
and saving the table identification model.
3. The PDF document table extraction method of claim 2, wherein training the preset initial model based on the annotated sample picture to obtain the table identification model includes:
preprocessing the marked sample picture, wherein the preprocessing process comprises mean value removal, normalization and whitening;
inputting the preprocessed sample picture into a preset convolutional neural network to obtain a feature map of the marked sample picture;
inputting the feature map into an RPN region candidate network, and detecting the position of a table region in the marked sample picture;
and when the current detection is determined to reach the convergence condition, obtaining a form identification model.
4. The PDF document table extraction method of claim 1, wherein the determining structural information of the table from the text coordinate information, dividing cells of the table based on the structural information, and filling text corresponding to the text content information into corresponding cells of the table includes:
generating text boxes based on the text position information, wherein each text box contains a line of text;
dividing a text box into different row and column positions of the table by the overlapping proportion of the text box in the horizontal direction and the vertical direction, and dividing each cell of the table based on the different row and column positions;
and filling corresponding texts corresponding to the text content information into each cell of the table.
5. A PDF document table extraction device, characterized in that the PDF document table extraction device includes:
the processing module is used for acquiring a PDF document to be identified and processing the PDF document to be identified, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, and the processing of the PDF document comprises the conversion of the PDF document capable of extracting text content into the PDF document of the picture type;
the identification module is used for preprocessing the processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, inputting the feature map into an RPN area candidate network and determining a form area in the processed PDF document;
the positioning module is used for preprocessing the table area and extracting features based on an OCR text recognition technology to obtain a feature picture of the table area, detecting the characters of the feature picture, determining a text area in the table area, recognizing the characters of the text area, and determining text information in the table area, wherein the text information comprises text position information and text content information, and the text position information is represented by coordinates;
and the filling module is used for determining the structural information of the form according to the text coordinate information, dividing each cell of the form based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the form.
6. The PDF document table extraction device of claim 5, wherein said PDF document table extraction device further comprises:
the conversion module is used for obtaining a PDF document sample to be trained and converting the PDF document sample to be trained to obtain a sample picture;
the marking module is used for acquiring marking information corresponding to the PDF document sample to be trained and marking the table position in the sample picture based on the marking information;
the training module is used for training the preset initial model based on the marked sample picture to obtain a form identification model;
and the storage module is used for storing the table identification model.
7. The PDF document table extraction device of claim 6, wherein the training module comprises:
the preprocessing unit is used for preprocessing the marked sample picture, wherein the preprocessing process comprises mean value removal, normalization and whitening;
the extraction unit is used for inputting the preprocessed sample picture into a preset convolutional neural network to obtain a feature map of the marked sample picture;
the detection unit is used for inputting the feature map into an RPN region candidate network and detecting the position of a table region in the marked sample picture;
and the confirmation unit is used for obtaining a form identification model when the current detection is determined to reach the convergence condition.
8. The PDF document form extraction device of claim 5, wherein said filling module includes:
a generation unit, configured to generate text boxes based on the text position information, where each text box contains a line of text;
a dividing unit for dividing the text box into different row and column positions of the table by the overlapping proportion of the text box in the horizontal direction and the vertical direction, and dividing each cell of the table based on the different row and column positions;
and the filling unit is used for filling corresponding texts corresponding to the text content information in each cell of the form.
9. A PDF document table extraction device comprising an input output unit, a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the PDF document table extraction method of any of claims 1 to 4.
10. A computer-readable storage medium, wherein a PDF document table extraction program is stored on the computer-readable storage medium, which when executed by a processor, implements the steps of the PDF document table extraction method of any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910560432.0A CN110390269B (en) | 2019-06-26 | 2019-06-26 | PDF document table extraction method, device, equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910560432.0A CN110390269B (en) | 2019-06-26 | 2019-06-26 | PDF document table extraction method, device, equipment and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110390269A CN110390269A (en) | 2019-10-29 |
CN110390269B true CN110390269B (en) | 2023-08-01 |
Family
ID=68285644
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910560432.0A Active CN110390269B (en) | 2019-06-26 | 2019-06-26 | PDF document table extraction method, device, equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110390269B (en) |
Families Citing this family (60)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110795919B (en) * | 2019-11-07 | 2023-10-31 | 达观数据有限公司 | Form extraction method, device, equipment and medium in PDF document |
CN111104871B (en) * | 2019-11-28 | 2023-11-07 | 北京明略软件系统有限公司 | Form region identification model generation method and device and form positioning method and device |
CN111241365B (en) * | 2019-12-23 | 2023-06-30 | 望海康信(北京)科技股份公司 | Table picture analysis method and system |
CN111144282B (en) * | 2019-12-25 | 2023-12-05 | 北京同邦卓益科技有限公司 | Form recognition method and apparatus, and computer-readable storage medium |
CN111259830A (en) * | 2020-01-19 | 2020-06-09 | 中国农业科学院农业信息研究所 | Method and system for fragmenting PDF document contents in overseas agriculture |
CN111368744B (en) * | 2020-03-05 | 2023-06-27 | 中国工商银行股份有限公司 | Method and device for identifying unstructured table in picture |
CN111382717B (en) * | 2020-03-17 | 2022-09-09 | 腾讯科技(深圳)有限公司 | Table identification method and device and computer readable storage medium |
CN111340000A (en) * | 2020-03-23 | 2020-06-26 | 深圳智能思创科技有限公司 | Method and system for extracting and optimizing PDF document table |
CN111523292B (en) * | 2020-04-23 | 2023-09-15 | 北京百度网讯科技有限公司 | Method and device for acquiring image information |
CN111597943B (en) * | 2020-05-08 | 2021-09-03 | 杭州火石数智科技有限公司 | Table structure identification method based on graph neural network |
CN111626027B (en) * | 2020-05-20 | 2023-03-24 | 北京百度网讯科技有限公司 | Table structure restoration method, device, equipment, system and readable storage medium |
CN111680491B (en) * | 2020-05-27 | 2024-02-02 | 北京字跳网络技术有限公司 | Method and device for extracting document information and electronic equipment |
CN111832403A (en) * | 2020-06-04 | 2020-10-27 | 北京百度网讯科技有限公司 | Document structure recognition method, and model training method and device for document structure recognition |
CN111695553B (en) * | 2020-06-05 | 2023-09-08 | 北京百度网讯科技有限公司 | Form identification method, device, equipment and medium |
CN111898411B (en) * | 2020-06-16 | 2021-08-31 | 华南理工大学 | Text image labeling system, method, computer device and storage medium |
CN111709956B (en) * | 2020-06-19 | 2024-01-12 | 腾讯科技(深圳)有限公司 | Image processing method, device, electronic equipment and readable storage medium |
CN111782839B (en) * | 2020-06-30 | 2023-08-22 | 北京百度网讯科技有限公司 | Image question-answering method, device, computer equipment and medium |
CN111814443A (en) * | 2020-07-21 | 2020-10-23 | 北京来也网络科技有限公司 | Table generation method and device combining RPA and AI, computing equipment and storage medium |
CN114077830A (en) * | 2020-08-17 | 2022-02-22 | 税友软件集团股份有限公司 | Method, device and equipment for analyzing PDF table document based on position |
CN111914805A (en) * | 2020-08-18 | 2020-11-10 | 科大讯飞股份有限公司 | Table structuring method and device, electronic equipment and storage medium |
CN112149506A (en) * | 2020-08-25 | 2020-12-29 | 北京来也网络科技有限公司 | Table generation method, apparatus and storage medium in image combining RPA and AI |
CN111985459B (en) * | 2020-09-18 | 2023-07-28 | 北京百度网讯科技有限公司 | Table image correction method, apparatus, electronic device and storage medium |
CN112115865B (en) * | 2020-09-18 | 2024-04-12 | 北京百度网讯科技有限公司 | Method, apparatus, device and storage medium for processing image |
CN112232198A (en) * | 2020-10-15 | 2021-01-15 | 北京来也网络科技有限公司 | Table content extraction method, device, equipment and medium based on RPA and AI |
CN112329548A (en) * | 2020-10-16 | 2021-02-05 | 北京临近空间飞行器系统工程研究所 | Document chapter segmentation method and device and storage medium |
CN112364790B (en) * | 2020-11-16 | 2022-10-25 | 中国民航大学 | Airport work order information identification method and system based on convolutional neural network |
CN112241730A (en) * | 2020-11-21 | 2021-01-19 | 杭州投知信息技术有限公司 | Form extraction method and system based on machine learning |
CN113807158A (en) * | 2020-12-04 | 2021-12-17 | 四川医枢科技股份有限公司 | PDF content extraction method, device and equipment |
CN112597773B (en) * | 2020-12-08 | 2022-12-13 | 上海深杳智能科技有限公司 | Document structuring method, system, terminal and medium |
CN112434496B (en) * | 2020-12-11 | 2021-06-22 | 深圳司南数据服务有限公司 | Method and terminal for identifying form data of bulletin document |
CN112633116B (en) * | 2020-12-17 | 2024-02-02 | 西安理工大学 | Method for intelligently analyzing PDF graphics context |
CN112560417A (en) * | 2020-12-24 | 2021-03-26 | 万兴科技集团股份有限公司 | Table editing method and device, computer equipment and storage medium |
CN112560767A (en) * | 2020-12-24 | 2021-03-26 | 南方电网深圳数字电网研究院有限公司 | Document signature identification method and device and computer readable storage medium |
CN112686223B (en) * | 2021-03-12 | 2021-06-18 | 腾讯科技(深圳)有限公司 | Table identification method and device and computer readable storage medium |
CN113221711A (en) * | 2021-04-30 | 2021-08-06 | 北京金山数字娱乐科技有限公司 | Information extraction method and device |
CN113177541B (en) * | 2021-05-17 | 2023-12-19 | 上海云扩信息科技有限公司 | Method for extracting text content in PDF document and picture by computer program |
CN113255583B (en) * | 2021-06-21 | 2023-02-03 | 中国平安人寿保险股份有限公司 | Data annotation method and device, computer equipment and storage medium |
CN113536951B (en) * | 2021-06-22 | 2023-11-24 | 科大讯飞股份有限公司 | Form identification method, related device, electronic equipment and storage medium |
CN113269153B (en) * | 2021-06-26 | 2024-03-19 | 中国电子系统技术有限公司 | Form identification method and device |
CN113591746A (en) * | 2021-08-05 | 2021-11-02 | 上海金仕达软件科技有限公司 | Document table structure detection method and device |
CN113657274B (en) * | 2021-08-17 | 2022-09-20 | 北京百度网讯科技有限公司 | Table generation method and device, electronic equipment and storage medium |
CN113626444B (en) * | 2021-08-26 | 2023-11-28 | 平安国际智慧城市科技股份有限公司 | Table query method, device, equipment and medium based on bitmap algorithm |
CN113505762B (en) * | 2021-09-09 | 2021-11-30 | 冠传网络科技(南京)有限公司 | Table identification method and device, terminal and storage medium |
CN113989823B (en) * | 2021-09-14 | 2022-10-18 | 北京左医科技有限公司 | Image table restoration method and system based on OCR coordinates |
CN113963367B (en) * | 2021-10-22 | 2024-05-28 | 深圳前海环融联易信息科技服务有限公司 | Model-based financial transaction file and money extraction method |
CN114022883A (en) * | 2021-11-05 | 2022-02-08 | 深圳前海环融联易信息科技服务有限公司 | Financial field transaction file form date extraction method based on model |
CN114218233A (en) * | 2022-02-22 | 2022-03-22 | 子长科技(北京)有限公司 | Annual newspaper processing method and device, electronic equipment and storage medium |
CN114220103B (en) * | 2022-02-22 | 2022-05-06 | 成都明途科技有限公司 | Image recognition method, device, equipment and computer readable storage medium |
CN114783584A (en) * | 2022-03-09 | 2022-07-22 | 广州方舟信息科技有限公司 | Method and device for recording drug delivery receipt |
CN114637845B (en) * | 2022-03-11 | 2023-04-14 | 上海弘玑信息技术有限公司 | Model testing method, device, equipment and storage medium |
CN114821612B (en) * | 2022-05-30 | 2023-04-07 | 浙商期货有限公司 | Method and system for extracting information of PDF document in securities future scene |
CN115082941A (en) * | 2022-08-23 | 2022-09-20 | 平安银行股份有限公司 | Form information acquisition method and device for form document image |
CN115331245B (en) * | 2022-10-12 | 2023-02-03 | 中南民族大学 | Table structure identification method based on image instance segmentation |
CN115527227A (en) * | 2022-10-13 | 2022-12-27 | 澎湃数智(北京)科技有限公司 | Character recognition method and device, storage medium and electronic equipment |
CN115577688B (en) * | 2022-12-09 | 2023-04-28 | 深圳智能思创科技有限公司 | Table structuring processing method, device, storage medium and apparatus |
CN115713775B (en) * | 2023-01-05 | 2023-04-25 | 达而观信息科技(上海)有限公司 | Method, system and computer equipment for extracting form from document |
CN116562251A (en) * | 2023-05-19 | 2023-08-08 | 中国矿业大学(北京) | Form classification method for stock information disclosure long document |
CN116861912B (en) * | 2023-08-31 | 2023-12-05 | 合肥天帷信息安全技术有限公司 | Deep learning-based form entity extraction method and system |
CN117332761B (en) * | 2023-11-30 | 2024-02-09 | 北京一标数字科技有限公司 | PDF document intelligent identification marking system |
CN117593752B (en) * | 2024-01-18 | 2024-04-09 | 星云海数字科技股份有限公司 | PDF document input method, PDF document input system, storage medium and electronic equipment |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
WO2015009297A1 (en) * | 2013-07-16 | 2015-01-22 | Recommind, Inc. | Systems and methods for extracting table information from documents |
CN108416279A (en) * | 2018-02-26 | 2018-08-17 | 阿博茨德(北京)科技有限公司 | Form analysis method and device in file and picture |
WO2019041527A1 (en) * | 2017-08-31 | 2019-03-07 | 平安科技(深圳)有限公司 | Method of extracting chart in document, electronic device and computer-readable storage medium |
CN109670461A (en) * | 2018-12-24 | 2019-04-23 | 广东亿迅科技有限公司 | PDF text extraction method, device, computer equipment and storage medium |
CN109685056A (en) * | 2019-01-04 | 2019-04-26 | 达而观信息科技(上海)有限公司 | Obtain the method and device of document information |
WO2019104879A1 (en) * | 2017-11-30 | 2019-06-06 | 平安科技(深圳)有限公司 | Information recognition method for form-type image, electronic device and readable storage medium |
-
2019
- 2019-06-26 CN CN201910560432.0A patent/CN110390269B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
WO2015009297A1 (en) * | 2013-07-16 | 2015-01-22 | Recommind, Inc. | Systems and methods for extracting table information from documents |
WO2019041527A1 (en) * | 2017-08-31 | 2019-03-07 | 平安科技(深圳)有限公司 | Method of extracting chart in document, electronic device and computer-readable storage medium |
WO2019104879A1 (en) * | 2017-11-30 | 2019-06-06 | 平安科技(深圳)有限公司 | Information recognition method for form-type image, electronic device and readable storage medium |
CN108416279A (en) * | 2018-02-26 | 2018-08-17 | 阿博茨德(北京)科技有限公司 | Form analysis method and device in file and picture |
CN109670461A (en) * | 2018-12-24 | 2019-04-23 | 广东亿迅科技有限公司 | PDF text extraction method, device, computer equipment and storage medium |
CN109685056A (en) * | 2019-01-04 | 2019-04-26 | 达而观信息科技(上海)有限公司 | Obtain the method and device of document information |
Also Published As
Publication number | Publication date |
---|---|
CN110390269A (en) | 2019-10-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110390269B (en) | PDF document table extraction method, device, equipment and computer readable storage medium | |
CN110766014B (en) | Bill information positioning method, system and computer readable storage medium | |
CN111476227B (en) | Target field identification method and device based on OCR and storage medium | |
US20200311460A1 (en) | Character identification method and device | |
CN110503054B (en) | Text image processing method and device | |
CN109740515B (en) | Evaluation method and device | |
CN112070076B (en) | Text paragraph structure reduction method, device, equipment and computer storage medium | |
CN110197238B (en) | Font type identification method, system and terminal equipment | |
CN110728687B (en) | File image segmentation method and device, computer equipment and storage medium | |
CN113505762B (en) | Table identification method and device, terminal and storage medium | |
CN111737478B (en) | Text detection method, electronic device and computer readable medium | |
CN111814905A (en) | Target detection method, target detection device, computer equipment and storage medium | |
CN111144372A (en) | Vehicle detection method, device, computer equipment and storage medium | |
CN114663904A (en) | PDF document layout detection method, device, equipment and medium | |
CN112215811A (en) | Image detection method and device, electronic equipment and storage medium | |
CN114663897A (en) | Table extraction method and table extraction system | |
US11386685B2 (en) | Multiple channels of rasterized content for page decomposition using machine learning | |
CN111738252B (en) | Text line detection method, device and computer system in image | |
US11906441B2 (en) | Inspection apparatus, control method, and program | |
CN113673528B (en) | Text processing method, text processing device, electronic equipment and readable storage medium | |
CN112580499A (en) | Text recognition method, device, equipment and storage medium | |
CN113936137A (en) | Method, system and storage medium for removing overlapping of image type text line detection areas | |
CN116311300A (en) | Table generation method, apparatus, electronic device and storage medium | |
Naz et al. | Challenges in baseline detection of cursive script languages | |
CN111291756B (en) | Method and device for detecting text region in image, computer equipment and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |