CN110390269B - PDF document table extraction method, device, equipment and computer readable storage medium - Google Patents

PDF document table extraction method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN110390269B
CN110390269B CN201910560432.0A CN201910560432A CN110390269B CN 110390269 B CN110390269 B CN 110390269B CN 201910560432 A CN201910560432 A CN 201910560432A CN 110390269 B CN110390269 B CN 110390269B
Authority
CN
China
Prior art keywords
pdf document
text
information
picture
area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910560432.0A
Other languages
Chinese (zh)
Other versions
CN110390269A (en
Inventor
刘克亮
卢波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910560432.0A priority Critical patent/CN110390269B/en
Publication of CN110390269A publication Critical patent/CN110390269A/en
Application granted granted Critical
Publication of CN110390269B publication Critical patent/CN110390269B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Character Discrimination (AREA)
  • Character Input (AREA)

Abstract

The application relates to the technical field of artificial intelligence and discloses a PDF document table extraction method, device and equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring a PDF document to be identified, and processing the PDF document to be identified; preprocessing the processed PDF document, inputting the preprocessed PDF document into a convolutional neural network, outputting a feature map, inputting the feature map into an RPN region candidate network, and determining a table region; preprocessing a table area and extracting features based on an OCR (optical character recognition) technology to obtain a feature picture, detecting characters of the feature picture, determining a text area, recognizing characters of the text area, and determining text information, wherein the text information comprises text position information and text content information; and determining the structural information of the table according to the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table. By the method and the device, the accuracy of extracting the PDF document table is improved.

Description

PDF document table extraction method, device, equipment and computer readable storage medium
Technical Field
The present disclosure relates to the field of image processing technologies, and in particular, to a PDF document table extraction method, apparatus, device, and computer readable storage medium.
Background
The existing method for extracting the form in the PDF file is basically aimed at the PDF of the extractable text, and the form area is extracted by acquiring the structural information of the PDF. And aiming at the PDF file of the picture type, the table extraction can only be carried out by a traditional image processing method. Firstly, extracting a table frame, then extracting an in-frame region according to the table frame, and finally performing OCR (optical character recognition) on an in-frame region image, thereby extracting table contents. However, this method is effective only for a table having table ruled lines, and if the table ruled lines are not complete, there may occur a problem that the located table area is not complete or the content of the cells is not complete, resulting in a low accuracy of table extraction.
Disclosure of Invention
The main purpose of the application is to provide a PDF document table extraction method, a device, equipment and a computer readable storage medium, which aim to solve the technical problems of small application range and low accuracy of the existing PDF document table extraction method.
In order to achieve the above object, the present application provides a PDF document table extraction method, which includes the following steps:
acquiring a PDF document to be identified, and processing the PDF document to be identified, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, and processing the PDF document comprises converting the PDF document capable of extracting text content into a PDF document of the picture type;
preprocessing the processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, outputting a characteristic diagram of the processed PDF document based on the preset convolutional neural network, inputting the characteristic diagram into an RPN region candidate network, and determining a table region in the processed PDF document;
preprocessing the table area based on an OCR text recognition technology, extracting features to obtain a feature picture of the table area, detecting the features of the feature picture, determining a text area in the table area, recognizing the text area, and determining text information in the table area, wherein the text information comprises text position information and text content information, and the text position information is represented by coordinates;
and determining the structural information of the table according to the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table.
Optionally, the preprocessing the processed PDF document and inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, and inputting the feature map into an RPN region candidate network, where before determining a table region in the processed PDF document, the method further includes:
acquiring a PDF document sample to be trained, and converting the PDF document sample to be trained to obtain a sample picture;
acquiring marking information corresponding to the PDF document sample to be trained, and marking the table position in the sample picture based on the marking information;
training a preset initial model based on the marked sample picture to obtain a form identification model;
and saving the table identification model.
Optionally, training the preset initial model based on the labeled sample picture, and obtaining the form identification model includes:
preprocessing the marked sample picture, wherein the preprocessing process comprises mean value removal, normalization and whitening;
inputting the preprocessed sample picture into a preset convolutional neural network to obtain a feature map of the marked sample picture;
inputting the feature map into an RPN region candidate network, and detecting the position of a table region in the marked sample picture;
and when the current detection is determined to reach the convergence condition, obtaining a form identification model.
Optionally, the determining the structure information of the table according to the text coordinate information, dividing each cell of the table based on the structure information, and filling the text corresponding to the text content information into each corresponding cell of the table includes:
generating text boxes based on the text position information, wherein each text box contains a line of text;
dividing a text box into different row and column positions of the table by the overlapping proportion of the text box in the horizontal direction and the vertical direction, and dividing each cell of the table based on the different row and column positions;
and filling corresponding texts corresponding to the text content information into each cell of the table.
In addition, in order to achieve the above object, the present application further provides a PDF document table extraction device including:
the processing module is used for acquiring a PDF document to be identified and processing the PDF document to be identified, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, and the processing of the PDF document comprises the conversion of the PDF document capable of extracting text content into the PDF document of the picture type;
the identification module is used for preprocessing the processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, inputting the feature map into an RPN area candidate network and determining a form area in the processed PDF document;
the positioning module is used for preprocessing the table area and extracting features based on an OCR text recognition technology to obtain a feature picture of the table area, detecting the characters of the feature picture, determining a text area in the table area, recognizing the characters of the text area, and determining text information in the table area, wherein the text information comprises text position information and text content information, and the text position information is represented by coordinates;
and the filling module is used for determining the structural information of the form according to the text coordinate information, dividing each cell of the form based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the form.
Optionally, the PDF document table extracting device further includes:
the conversion module is used for obtaining a PDF document sample to be trained and converting the PDF document sample to be trained to obtain a sample picture;
the marking module is used for acquiring marking information corresponding to the PDF document sample to be trained and marking the table position in the sample picture based on the marking information;
the training module is used for training the preset initial model based on the marked sample picture to obtain a form identification model;
and the storage module is used for storing the table identification model.
Optionally, the training module includes:
the preprocessing unit is used for preprocessing the marked sample picture, wherein the preprocessing process comprises mean value removal, normalization and whitening;
the extraction unit is used for inputting the preprocessed sample picture into a preset convolutional neural network to obtain a feature map of the marked sample picture;
the detection unit is used for inputting the feature map into an RPN region candidate network and detecting the position of a table region in the marked sample picture;
and the confirmation unit is used for obtaining a form identification model when the current detection is determined to reach the convergence condition.
Optionally, the filling module includes:
a generation unit, configured to generate text boxes based on the text position information, where each text box contains a line of text;
a dividing unit for dividing the text box into different row and column positions of the table by the overlapping proportion of the text box in the horizontal direction and the vertical direction, and dividing each cell of the table based on the different row and column positions;
and the filling unit is used for filling corresponding texts corresponding to the text content information in each cell of the form.
In addition, in order to achieve the above object, the present application further provides a PDF document table extraction apparatus, which includes an input-output unit, a memory, and a processor, the memory storing computer-readable instructions that, when executed by the processor, cause the processor to implement the steps of the PDF document table extraction method described above when executed.
In addition, in order to achieve the above object, the present application also provides a computer-readable storage medium having stored thereon a PDF document table extraction program which, when executed by a processor, implements the steps of the PDF document table extraction method described above.
The PDF document table extraction method comprises the steps of firstly obtaining a PDF document to be identified, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, processing the PDF document, and converting the PDF document capable of extracting the text content into the PDF document of the picture type; preprocessing a processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, obtaining a feature map of the processed PDF document, and inputting the feature map into an RPN region candidate network, so as to determine a form region in the processed PDF document; preprocessing and extracting features by using OCR character recognition positioning in a table area to obtain a feature picture of the table area, detecting characters of the feature picture, determining a text area in the feature picture, and finally determining text information in the table area based on a character recognition algorithm; and finally, determining the structural information of the table through the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table, thereby completing the table content extraction in the PDF document. The PDF document table extraction method provided by the invention can accurately position the table and extract the content in the table for PDF documents with different formats, thereby improving the application range of the PDF document table extraction method and improving the accuracy of PDF document table extraction.
Drawings
FIG. 1 is a schematic diagram of a PDF document table extraction device in a hardware running environment according to an embodiment of the present application;
FIG. 2 is a flowchart of an embodiment of a method for extracting a PDF document table;
FIG. 3 is a schematic diagram of a functional module of an embodiment of a PDF document table extraction device according to the present application;
FIG. 4 is a schematic diagram of a functional module of another embodiment of a PDF document form extraction device according to the present application;
FIG. 5 is a schematic diagram of functional units of the training module 70 in another embodiment of the PDF document table extracting device of the present application;
FIG. 6 is a schematic diagram of functional units of a filling module 40 in another embodiment of the PDF document extraction device of the present application;
fig. 7 is a schematic diagram of a text box in an embodiment of a PDF document table extraction method of the present application.
The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a PDF document table extraction device of a hardware running environment according to an embodiment of the present application.
The PDF document table extraction device in the embodiment of the present application may be a terminal device with data processing capability, such as a portable computer, a server, or the like.
As shown in fig. 1, the PDF document table extraction apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the aforementioned processor 1001.
It will be appreciated by those skilled in the art that the PDF document table extraction device structure shown in fig. 1 does not constitute a limitation of the PDF document table extraction device, and may include more or fewer components than illustrated, or may combine certain components, or may be arranged in different components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a PDF document table extraction program may be included in the memory 1005 as one type of computer storage medium.
In the PDF document table extraction apparatus shown in fig. 1, the network interface 1004 is mainly used to connect to a background server, and perform data communication with the background server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be used to call a PDF document table extraction program stored in the memory 1005 and perform the operations of the following embodiments of the PDF document table extraction method.
Referring to fig. 2, fig. 2 is a flowchart illustrating an embodiment of a PDF document table extraction method, where the PDF document table extraction method includes:
step S10, a PDF document to be identified is obtained and is processed, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, and the PDF document is processed by converting the PDF document capable of extracting text content into the PDF document of the picture type.
In this embodiment, the PDF document to be identified may include multiple formats, for example, the PDF document to be identified may be a PDF document in which text content can be extracted, or may be a PDF document in a picture type. After the PDF document to be identified is obtained, the PDF document to be identified is firstly processed, specifically, the PDF document to be identified is converted into a picture, and if the PDF document to be identified is a PDF document of a picture type, the PDF document to be identified does not need to be processed.
Step S20, preprocessing the processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, inputting the feature map into an RPN region candidate network, and determining a table region in the processed PDF document.
Further, the converted picture is input into a preset form recognition model, so that the form recognition model recognizes a form region in the picture.
It can be understood that the preset form identification model is obtained by training in advance, and the preset form identification model is trained through the PDF document sample to be trained so as to improve the accuracy of form area identification.
Specifically, the process of performing table area recognition on the processed PDF document by the preset table recognition model is as follows:
firstly, preprocessing a picture obtained by converting a PDF document, wherein the preprocessing step comprises mean value removal, normalization and whitening, and the preprocessing aims at strengthening the characteristics of the picture; further, inputting the preprocessed picture into a preset convolutional neural network, and extracting a feature map; finally, the obtained feature map is input into an RPN (RegionProposal Network, regional candidate network) to identify and locate the table region.
Step S30, preprocessing and feature extraction are carried out on the table area based on the OCR text recognition technology, feature pictures of the table area are obtained, text detection is carried out on the feature pictures, text areas in the table area are determined, text recognition is carried out on the text areas, text information in the table area is determined, wherein the text information comprises text position information and text content information, and the text position information is represented by coordinates.
Further, after the table area in the picture is identified, the characters in the table area are positioned and identified by utilizing an OCR character recognition technology, and text position information and text content information in the table area are determined.
Specifically, the OCR recognizes text as follows: firstly, preprocessing a picture containing a table area, wherein the aim of the preprocessing is to reduce useless information in the picture, and the picture containing characters is processed so as to carry out characteristic extraction subsequently; further, detecting the text part in the preprocessed picture, wherein the text region in the picture can be subjected to frame selection by adopting a common image detection algorithm, and the details are not repeated here; and finally, identifying the text in the detected text area through a text identification algorithm, and determining specific text content information.
Step S40, the structure information of the table is determined according to the text coordinate information, each cell of the table is divided based on the structure information, and the text corresponding to the text content information is filled into each corresponding cell of the table.
It will be appreciated that specific text location information may be obtained during text recognition and location within the form area, and that the text location information may be represented by the top left and bottom right coordinates of each line of text. Therefore, through the upper left corner coordinate and the lower right corner coordinate of each line of text, the corresponding row and column position of each line of text in the table can be determined, and the structural information of the table can be determined. After the structural information of the table is determined, the text content information corresponding to the text position information is filled in the corresponding row and column positions.
In the embodiment, firstly, a PDF document to be identified is obtained, the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, the PDF document is processed, and the PDF document capable of extracting text content is converted into the PDF document of the picture type; preprocessing a processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, obtaining a feature map of the processed PDF document, and inputting the feature map into an RPN region candidate network, so as to determine a form region in the processed PDF document; preprocessing and extracting features by using OCR character recognition positioning in a table area to obtain a feature picture of the table area, detecting characters of the feature picture, determining a text area in the feature picture, and finally determining text information in the table area based on a character recognition algorithm; and finally, determining the structural information of the table through the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table, thereby completing the table content extraction in the PDF document. The PDF document table extraction method provided by the invention can accurately position the table and extract the content in the table for PDF documents with different formats, thereby improving the application range of the PDF document table extraction method and improving the accuracy of PDF document table extraction.
Further, before step S20, the method further includes:
step S50, a PDF document sample to be trained is obtained, and the PDF document sample to be trained is converted to obtain a sample picture;
step S60, marking information corresponding to a PDF document sample to be trained is obtained, and the table position in a sample picture is marked based on the marking information;
step S70, training a preset initial model based on the marked sample picture to obtain a form identification model;
step S80, save the form identification model.
In this embodiment, a preset table recognition model is trained through a PDF document sample, so that a PDF document to be recognized is subjected to table region recognition by using the trained table recognition model. Specifically, the PDF document sample to be trained may include both PDF documents that can extract text content and PDF documents of a picture type. If the PDF document sample to be trained is a PDF document capable of extracting text content, converting the PDF document sample to be trained into a Zhang Yangben picture according to pages; if the PDF document to be trained is a picture-like PDF document, processing is not needed.
Further, labeling information corresponding to the PDF document sample to be trained is obtained so as to label the converted sample picture, specifically, the table position in the sample picture is labeled, wherein the labeling information can be represented in a coordinate mode. For example, a rectangular frame is used to select a table region frame in the sample picture, and then the rectangular frame is represented by the upper left corner coordinates and the lower right corner coordinates of the rectangular frame, that is, the table position in the sample picture is represented by the upper left corner coordinates and the lower right corner coordinates of the rectangular frame.
In this embodiment, the algorithm used for training the preset table recognition model is the fast-R-CNN algorithm, and the training process is as follows:
firstly, preprocessing is carried out on an input sample picture, and the aim of the preprocessing is to strengthen the characteristics of the sample picture. Specifically, the preprocessing process sequentially comprises mean value removal, normalization and whitening, wherein the purpose of mean value removal is to center each dimension in input sample data to 0, namely, the center in the sample data is pulled back to the origin of a coordinate system; the normalization aims to normalize the amplitude of the sample data to the same range and reduce the interference caused by the difference of the value ranges of the data in each dimension; whitening is the normalization of the amplitude on each characteristic axis of the sample data. Through the steps, the sample picture obtained by converting the PDF document to be trained is subjected to characteristic reinforcement.
Further, the preprocessed sample picture is input into a preset convolutional neural network, and feature map extraction is performed. Specifically, in this embodiment, the preset convolutional neural network includes 13 convolutional layers, 13 excitation layers and 4 pooling layers, for the convolutional kernel of the convolutional layers, 3*3, the filling value is 1, and the function of the filling value is to make the convolutional layers not change the input and output matrix sizes; the convolution kernel of the pooling layer is 2 x 2 and the stride is 2 x 2. And (3) carrying out convolution, excitation, pooling and other operations on the preprocessed sample picture to obtain a feature vector, wherein the feature vector represents the vector information of the feature picture corresponding to the sample picture.
Further, the obtained feature map is input into the RPN to identify and locate the table area. Firstly, carrying out a 3*3 convolution calculation on the characteristic diagram to obtain a 256-dimensional vector, and then calculating 9 candidate windows of each pixel of the 256-dimensional vector based on a scale transformation mode so as to classify the candidate windows based on a soft max function and determine the foreground and the background of the candidate windows; meanwhile, calculating the regression offset of the boundary frame for the candidate window, so as to preliminarily determine the table area; and finally, acquiring a target area based on the foreground candidate window and the regression offset of the boundary box, and eliminating the target area which is too small and exceeds the boundary to obtain a final target area, namely a finally identified form area.
In this embodiment, since the PDF document sample to be trained is converted into the sample picture and then all carries the labeling information, that is, the position information of the table area included in the sample picture, the sample picture is input into the preset table recognition model to identify and position the table area in the sample picture, so as to see whether the output result of the preset table recognition model is consistent with the labeling information, if not, the parameters of the preset table recognition model are readjusted, so that the output result of the preset table recognition model is consistent with the labeling information finally, and then the preset table recognition model is determined to be trained.
Further, after the training is confirmed, a preset form recognition model is stored, so that the form region recognition is carried out on the PDF document to be recognized by using the trained form recognition model.
Further, step S40 includes:
step S401, generating text boxes based on the text position information, wherein each text box contains a line of text;
step S402, dividing the text box into different row and column positions of the table by the overlapping proportion of the text box in the horizontal direction and the vertical direction, and dividing each cell of the table based on the different row and column positions;
step S403, the text corresponding to the corresponding text content information is filled into each cell of the table.
In this embodiment, in the process of text region detection based on OCR text recognition technology, specific text position information may be obtained, and the text position information may be represented by the upper left corner coordinates and the lower right corner coordinates of each line of text, as shown in fig. 7, and the text information selected and located by using a rectangular box is denoted as a text box.
The text boxes can be divided into different row and column positions of the table by overlapping proportions of the text boxes in the horizontal direction and the vertical direction. Specifically, the coordinate information of the text boxes of the same row in the vertical direction is the same, and the coordinate ranges of the text boxes of the same row in the horizontal direction are not intersected; the coordinate information of the text boxes of the same column in the horizontal direction is the same, and the coordinate ranges of the text boxes of the same column in the vertical direction have no intersection. Therefore, by analyzing the coordinate information of the text box, the structural information of the form can be determined. For example, the structural information of the table may be 3*5, i.e., indicating that the table has 3 rows and 5 columns. If there is a text length across columns, as shown in FIG. 7, the vertical grid lines in the cell are removed.
After the structural information of the table is determined, corresponding text contents are correspondingly filled into each cell of the table according to the identified text content information, and then the table content extraction in the PDF document can be completed.
Referring to fig. 3, fig. 3 is a schematic functional block diagram of an embodiment of a PDF document table extracting apparatus of the present application.
In this embodiment, the PDF document table extraction device includes:
the processing module 10 is configured to obtain a PDF document to be identified, and process the PDF document to be identified, where the PDF document to be identified includes a PDF document capable of extracting text content and a PDF document of a picture class, and process the PDF document includes converting the PDF document capable of extracting text content into a PDF document of a picture class;
the identifying module 20 is configured to pre-process the processed PDF document, input the pre-processed PDF document into a preset convolutional neural network, output a feature map of the processed PDF document based on the preset convolutional neural network, input the feature map into an RPN region candidate network, and determine a table region in the processed PDF document;
the positioning module 30 is configured to perform preprocessing and feature extraction on the table area based on an OCR text recognition technology, obtain a feature picture of the table area, perform text detection on the feature picture, determine a text area in the table area, perform text recognition on the text area, and determine text information in the table area, where the text information includes text position information and text content information, and the text position information is represented by coordinates;
and a filling module 40, configured to determine structural information of the form according to the text coordinate information, divide each cell of the form based on the structural information, and fill text corresponding to the text content information into each corresponding cell of the form.
Further, referring to fig. 4, the pdf document table extraction apparatus further includes:
the conversion module 50 is used for obtaining a PDF document sample to be trained and converting the PDF document sample to be trained to obtain a sample picture;
the labeling module 60 is configured to obtain labeling information corresponding to the PDF document sample to be trained, and label a table position in the sample picture based on the labeling information;
the training module 70 is configured to train the preset initial model based on the labeled sample picture to obtain a form recognition model;
a saving module 80, configured to save the table identification model.
Further, referring to fig. 5, the training module 70 includes:
the preprocessing unit 701 is configured to perform preprocessing on the marked sample picture, where the preprocessing includes mean removal, normalization and whitening;
the extracting unit 702 is configured to input the preprocessed sample picture into a preset convolutional neural network, so as to obtain a feature map of the labeled sample picture;
a detection unit 703, configured to input the feature map into an RPN region candidate network, and detect a position of a table region in the labeled sample picture;
and a confirmation unit 704, configured to obtain a table identification model when it is determined that the current detection reaches the convergence condition.
Further, referring to fig. 6, the filling module 40 includes:
a generating unit 401, configured to generate text boxes based on the text position information, where each text box contains a line of text;
a dividing unit 402, configured to divide a text box into different row and column positions of the table by overlapping proportions of the text box in a horizontal direction and a vertical direction, and divide each cell of the table based on the different row and column positions;
and a filling unit 403, configured to fill in a text corresponding to the text content information in each cell of the form.
The specific embodiments of the PDF document table extracting device of the present application are substantially the same as the embodiments of the PDF document table extracting method described above, and are not described herein.
In addition, the embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a PDF document table extraction program, and the PDF document table extraction program realizes the steps of the PDF document table extraction method when being executed by a processor.
The specific embodiments of the computer readable storage medium are basically the same as the embodiments of the PDF document table extraction method described above, and are not described herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above, including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (10)

1. The PDF document table extraction method is characterized by comprising the following steps of:
acquiring a PDF document to be identified, and processing the PDF document to be identified, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, and processing the PDF document comprises converting the PDF document capable of extracting text content into a PDF document of the picture type;
preprocessing the processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, outputting a characteristic diagram of the processed PDF document based on the preset convolutional neural network, inputting the characteristic diagram into an RPN region candidate network, and determining a table region in the processed PDF document;
preprocessing the table area based on an OCR text recognition technology, extracting features to obtain a feature picture of the table area, detecting the features of the feature picture, determining a text area in the table area, recognizing the text area, and determining text information in the table area, wherein the text information comprises text position information and text content information, and the text position information is represented by coordinates;
and determining the structural information of the table according to the text coordinate information, dividing each cell of the table based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the table.
2. The PDF document table extraction method of claim 1 wherein the preprocessing the processed PDF document and inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, and inputting the feature map into an RPN region candidate network, and before determining a table region in the processed PDF document, further comprising:
acquiring a PDF document sample to be trained, and converting the PDF document sample to be trained to obtain a sample picture;
acquiring marking information corresponding to the PDF document sample to be trained, and marking the table position in the sample picture based on the marking information;
training a preset initial model based on the marked sample picture to obtain a form identification model;
and saving the table identification model.
3. The PDF document table extraction method of claim 2, wherein training the preset initial model based on the annotated sample picture to obtain the table identification model includes:
preprocessing the marked sample picture, wherein the preprocessing process comprises mean value removal, normalization and whitening;
inputting the preprocessed sample picture into a preset convolutional neural network to obtain a feature map of the marked sample picture;
inputting the feature map into an RPN region candidate network, and detecting the position of a table region in the marked sample picture;
and when the current detection is determined to reach the convergence condition, obtaining a form identification model.
4. The PDF document table extraction method of claim 1, wherein the determining structural information of the table from the text coordinate information, dividing cells of the table based on the structural information, and filling text corresponding to the text content information into corresponding cells of the table includes:
generating text boxes based on the text position information, wherein each text box contains a line of text;
dividing a text box into different row and column positions of the table by the overlapping proportion of the text box in the horizontal direction and the vertical direction, and dividing each cell of the table based on the different row and column positions;
and filling corresponding texts corresponding to the text content information into each cell of the table.
5. A PDF document table extraction device, characterized in that the PDF document table extraction device includes:
the processing module is used for acquiring a PDF document to be identified and processing the PDF document to be identified, wherein the PDF document to be identified comprises a PDF document capable of extracting text content and a PDF document of a picture type, and the processing of the PDF document comprises the conversion of the PDF document capable of extracting text content into the PDF document of the picture type;
the identification module is used for preprocessing the processed PDF document, inputting the preprocessed PDF document into a preset convolutional neural network, outputting a feature map of the processed PDF document based on the preset convolutional neural network, inputting the feature map into an RPN area candidate network and determining a form area in the processed PDF document;
the positioning module is used for preprocessing the table area and extracting features based on an OCR text recognition technology to obtain a feature picture of the table area, detecting the characters of the feature picture, determining a text area in the table area, recognizing the characters of the text area, and determining text information in the table area, wherein the text information comprises text position information and text content information, and the text position information is represented by coordinates;
and the filling module is used for determining the structural information of the form according to the text coordinate information, dividing each cell of the form based on the structural information, and filling the text corresponding to the text content information into each corresponding cell of the form.
6. The PDF document table extraction device of claim 5, wherein said PDF document table extraction device further comprises:
the conversion module is used for obtaining a PDF document sample to be trained and converting the PDF document sample to be trained to obtain a sample picture;
the marking module is used for acquiring marking information corresponding to the PDF document sample to be trained and marking the table position in the sample picture based on the marking information;
the training module is used for training the preset initial model based on the marked sample picture to obtain a form identification model;
and the storage module is used for storing the table identification model.
7. The PDF document table extraction device of claim 6, wherein the training module comprises:
the preprocessing unit is used for preprocessing the marked sample picture, wherein the preprocessing process comprises mean value removal, normalization and whitening;
the extraction unit is used for inputting the preprocessed sample picture into a preset convolutional neural network to obtain a feature map of the marked sample picture;
the detection unit is used for inputting the feature map into an RPN region candidate network and detecting the position of a table region in the marked sample picture;
and the confirmation unit is used for obtaining a form identification model when the current detection is determined to reach the convergence condition.
8. The PDF document form extraction device of claim 5, wherein said filling module includes:
a generation unit, configured to generate text boxes based on the text position information, where each text box contains a line of text;
a dividing unit for dividing the text box into different row and column positions of the table by the overlapping proportion of the text box in the horizontal direction and the vertical direction, and dividing each cell of the table based on the different row and column positions;
and the filling unit is used for filling corresponding texts corresponding to the text content information in each cell of the form.
9. A PDF document table extraction device comprising an input output unit, a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the PDF document table extraction method of any of claims 1 to 4.
10. A computer-readable storage medium, wherein a PDF document table extraction program is stored on the computer-readable storage medium, which when executed by a processor, implements the steps of the PDF document table extraction method of any one of claims 1 to 4.
CN201910560432.0A 2019-06-26 2019-06-26 PDF document table extraction method, device, equipment and computer readable storage medium Active CN110390269B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910560432.0A CN110390269B (en) 2019-06-26 2019-06-26 PDF document table extraction method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910560432.0A CN110390269B (en) 2019-06-26 2019-06-26 PDF document table extraction method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN110390269A CN110390269A (en) 2019-10-29
CN110390269B true CN110390269B (en) 2023-08-01

Family

ID=68285644

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910560432.0A Active CN110390269B (en) 2019-06-26 2019-06-26 PDF document table extraction method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN110390269B (en)

Families Citing this family (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110795919B (en) * 2019-11-07 2023-10-31 达观数据有限公司 Form extraction method, device, equipment and medium in PDF document
CN111104871B (en) * 2019-11-28 2023-11-07 北京明略软件系统有限公司 Form region identification model generation method and device and form positioning method and device
CN111241365B (en) * 2019-12-23 2023-06-30 望海康信(北京)科技股份公司 Table picture analysis method and system
CN111144282B (en) * 2019-12-25 2023-12-05 北京同邦卓益科技有限公司 Form recognition method and apparatus, and computer-readable storage medium
CN111259830A (en) * 2020-01-19 2020-06-09 中国农业科学院农业信息研究所 Method and system for fragmenting PDF document contents in overseas agriculture
CN111368744B (en) * 2020-03-05 2023-06-27 中国工商银行股份有限公司 Method and device for identifying unstructured table in picture
CN111382717B (en) * 2020-03-17 2022-09-09 腾讯科技(深圳)有限公司 Table identification method and device and computer readable storage medium
CN111340000A (en) * 2020-03-23 2020-06-26 深圳智能思创科技有限公司 Method and system for extracting and optimizing PDF document table
CN111523292B (en) * 2020-04-23 2023-09-15 北京百度网讯科技有限公司 Method and device for acquiring image information
CN111597943B (en) * 2020-05-08 2021-09-03 杭州火石数智科技有限公司 Table structure identification method based on graph neural network
CN111626027B (en) * 2020-05-20 2023-03-24 北京百度网讯科技有限公司 Table structure restoration method, device, equipment, system and readable storage medium
CN111680491B (en) * 2020-05-27 2024-02-02 北京字跳网络技术有限公司 Method and device for extracting document information and electronic equipment
CN111832403A (en) * 2020-06-04 2020-10-27 北京百度网讯科技有限公司 Document structure recognition method, and model training method and device for document structure recognition
CN111695553B (en) * 2020-06-05 2023-09-08 北京百度网讯科技有限公司 Form identification method, device, equipment and medium
CN111898411B (en) * 2020-06-16 2021-08-31 华南理工大学 Text image labeling system, method, computer device and storage medium
CN111709956B (en) * 2020-06-19 2024-01-12 腾讯科技(深圳)有限公司 Image processing method, device, electronic equipment and readable storage medium
CN111782839B (en) * 2020-06-30 2023-08-22 北京百度网讯科技有限公司 Image question-answering method, device, computer equipment and medium
CN111814443A (en) * 2020-07-21 2020-10-23 北京来也网络科技有限公司 Table generation method and device combining RPA and AI, computing equipment and storage medium
CN114077830A (en) * 2020-08-17 2022-02-22 税友软件集团股份有限公司 Method, device and equipment for analyzing PDF table document based on position
CN111914805A (en) * 2020-08-18 2020-11-10 科大讯飞股份有限公司 Table structuring method and device, electronic equipment and storage medium
CN112149506A (en) * 2020-08-25 2020-12-29 北京来也网络科技有限公司 Table generation method, apparatus and storage medium in image combining RPA and AI
CN111985459B (en) * 2020-09-18 2023-07-28 北京百度网讯科技有限公司 Table image correction method, apparatus, electronic device and storage medium
CN112115865B (en) * 2020-09-18 2024-04-12 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing image
CN112232198A (en) * 2020-10-15 2021-01-15 北京来也网络科技有限公司 Table content extraction method, device, equipment and medium based on RPA and AI
CN112329548A (en) * 2020-10-16 2021-02-05 北京临近空间飞行器系统工程研究所 Document chapter segmentation method and device and storage medium
CN112364790B (en) * 2020-11-16 2022-10-25 中国民航大学 Airport work order information identification method and system based on convolutional neural network
CN112241730A (en) * 2020-11-21 2021-01-19 杭州投知信息技术有限公司 Form extraction method and system based on machine learning
CN113807158A (en) * 2020-12-04 2021-12-17 四川医枢科技股份有限公司 PDF content extraction method, device and equipment
CN112597773B (en) * 2020-12-08 2022-12-13 上海深杳智能科技有限公司 Document structuring method, system, terminal and medium
CN112434496B (en) * 2020-12-11 2021-06-22 深圳司南数据服务有限公司 Method and terminal for identifying form data of bulletin document
CN112633116B (en) * 2020-12-17 2024-02-02 西安理工大学 Method for intelligently analyzing PDF graphics context
CN112560417A (en) * 2020-12-24 2021-03-26 万兴科技集团股份有限公司 Table editing method and device, computer equipment and storage medium
CN112560767A (en) * 2020-12-24 2021-03-26 南方电网深圳数字电网研究院有限公司 Document signature identification method and device and computer readable storage medium
CN112686223B (en) * 2021-03-12 2021-06-18 腾讯科技(深圳)有限公司 Table identification method and device and computer readable storage medium
CN113221711A (en) * 2021-04-30 2021-08-06 北京金山数字娱乐科技有限公司 Information extraction method and device
CN113177541B (en) * 2021-05-17 2023-12-19 上海云扩信息科技有限公司 Method for extracting text content in PDF document and picture by computer program
CN113255583B (en) * 2021-06-21 2023-02-03 中国平安人寿保险股份有限公司 Data annotation method and device, computer equipment and storage medium
CN113536951B (en) * 2021-06-22 2023-11-24 科大讯飞股份有限公司 Form identification method, related device, electronic equipment and storage medium
CN113269153B (en) * 2021-06-26 2024-03-19 中国电子系统技术有限公司 Form identification method and device
CN113591746A (en) * 2021-08-05 2021-11-02 上海金仕达软件科技有限公司 Document table structure detection method and device
CN113657274B (en) * 2021-08-17 2022-09-20 北京百度网讯科技有限公司 Table generation method and device, electronic equipment and storage medium
CN113626444B (en) * 2021-08-26 2023-11-28 平安国际智慧城市科技股份有限公司 Table query method, device, equipment and medium based on bitmap algorithm
CN113505762B (en) * 2021-09-09 2021-11-30 冠传网络科技(南京)有限公司 Table identification method and device, terminal and storage medium
CN113989823B (en) * 2021-09-14 2022-10-18 北京左医科技有限公司 Image table restoration method and system based on OCR coordinates
CN113963367B (en) * 2021-10-22 2024-05-28 深圳前海环融联易信息科技服务有限公司 Model-based financial transaction file and money extraction method
CN114022883A (en) * 2021-11-05 2022-02-08 深圳前海环融联易信息科技服务有限公司 Financial field transaction file form date extraction method based on model
CN114218233A (en) * 2022-02-22 2022-03-22 子长科技(北京)有限公司 Annual newspaper processing method and device, electronic equipment and storage medium
CN114220103B (en) * 2022-02-22 2022-05-06 成都明途科技有限公司 Image recognition method, device, equipment and computer readable storage medium
CN114783584A (en) * 2022-03-09 2022-07-22 广州方舟信息科技有限公司 Method and device for recording drug delivery receipt
CN114637845B (en) * 2022-03-11 2023-04-14 上海弘玑信息技术有限公司 Model testing method, device, equipment and storage medium
CN114821612B (en) * 2022-05-30 2023-04-07 浙商期货有限公司 Method and system for extracting information of PDF document in securities future scene
CN115082941A (en) * 2022-08-23 2022-09-20 平安银行股份有限公司 Form information acquisition method and device for form document image
CN115331245B (en) * 2022-10-12 2023-02-03 中南民族大学 Table structure identification method based on image instance segmentation
CN115527227A (en) * 2022-10-13 2022-12-27 澎湃数智(北京)科技有限公司 Character recognition method and device, storage medium and electronic equipment
CN115577688B (en) * 2022-12-09 2023-04-28 深圳智能思创科技有限公司 Table structuring processing method, device, storage medium and apparatus
CN115713775B (en) * 2023-01-05 2023-04-25 达而观信息科技(上海)有限公司 Method, system and computer equipment for extracting form from document
CN116562251A (en) * 2023-05-19 2023-08-08 中国矿业大学(北京) Form classification method for stock information disclosure long document
CN116861912B (en) * 2023-08-31 2023-12-05 合肥天帷信息安全技术有限公司 Deep learning-based form entity extraction method and system
CN117332761B (en) * 2023-11-30 2024-02-09 北京一标数字科技有限公司 PDF document intelligent identification marking system
CN117593752B (en) * 2024-01-18 2024-04-09 星云海数字科技股份有限公司 PDF document input method, PDF document input system, storage medium and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976232A (en) * 2010-09-19 2011-02-16 深圳市万兴软件有限公司 Method for identifying data form in document and device thereof
WO2015009297A1 (en) * 2013-07-16 2015-01-22 Recommind, Inc. Systems and methods for extracting table information from documents
CN108416279A (en) * 2018-02-26 2018-08-17 阿博茨德(北京)科技有限公司 Form analysis method and device in file and picture
WO2019041527A1 (en) * 2017-08-31 2019-03-07 平安科技(深圳)有限公司 Method of extracting chart in document, electronic device and computer-readable storage medium
CN109670461A (en) * 2018-12-24 2019-04-23 广东亿迅科技有限公司 PDF text extraction method, device, computer equipment and storage medium
CN109685056A (en) * 2019-01-04 2019-04-26 达而观信息科技(上海)有限公司 Obtain the method and device of document information
WO2019104879A1 (en) * 2017-11-30 2019-06-06 平安科技(深圳)有限公司 Information recognition method for form-type image, electronic device and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976232A (en) * 2010-09-19 2011-02-16 深圳市万兴软件有限公司 Method for identifying data form in document and device thereof
WO2015009297A1 (en) * 2013-07-16 2015-01-22 Recommind, Inc. Systems and methods for extracting table information from documents
WO2019041527A1 (en) * 2017-08-31 2019-03-07 平安科技(深圳)有限公司 Method of extracting chart in document, electronic device and computer-readable storage medium
WO2019104879A1 (en) * 2017-11-30 2019-06-06 平安科技(深圳)有限公司 Information recognition method for form-type image, electronic device and readable storage medium
CN108416279A (en) * 2018-02-26 2018-08-17 阿博茨德(北京)科技有限公司 Form analysis method and device in file and picture
CN109670461A (en) * 2018-12-24 2019-04-23 广东亿迅科技有限公司 PDF text extraction method, device, computer equipment and storage medium
CN109685056A (en) * 2019-01-04 2019-04-26 达而观信息科技(上海)有限公司 Obtain the method and device of document information

Also Published As

Publication number Publication date
CN110390269A (en) 2019-10-29

Similar Documents

Publication Publication Date Title
CN110390269B (en) PDF document table extraction method, device, equipment and computer readable storage medium
CN110766014B (en) Bill information positioning method, system and computer readable storage medium
CN111476227B (en) Target field identification method and device based on OCR and storage medium
US20200311460A1 (en) Character identification method and device
CN110503054B (en) Text image processing method and device
CN109740515B (en) Evaluation method and device
CN112070076B (en) Text paragraph structure reduction method, device, equipment and computer storage medium
CN110197238B (en) Font type identification method, system and terminal equipment
CN110728687B (en) File image segmentation method and device, computer equipment and storage medium
CN113505762B (en) Table identification method and device, terminal and storage medium
CN111737478B (en) Text detection method, electronic device and computer readable medium
CN111814905A (en) Target detection method, target detection device, computer equipment and storage medium
CN111144372A (en) Vehicle detection method, device, computer equipment and storage medium
CN114663904A (en) PDF document layout detection method, device, equipment and medium
CN112215811A (en) Image detection method and device, electronic equipment and storage medium
CN114663897A (en) Table extraction method and table extraction system
US11386685B2 (en) Multiple channels of rasterized content for page decomposition using machine learning
CN111738252B (en) Text line detection method, device and computer system in image
US11906441B2 (en) Inspection apparatus, control method, and program
CN113673528B (en) Text processing method, text processing device, electronic equipment and readable storage medium
CN112580499A (en) Text recognition method, device, equipment and storage medium
CN113936137A (en) Method, system and storage medium for removing overlapping of image type text line detection areas
CN116311300A (en) Table generation method, apparatus, electronic device and storage medium
Naz et al. Challenges in baseline detection of cursive script languages
CN111291756B (en) Method and device for detecting text region in image, computer equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant