CN112069991A - PDF table information extraction method and related device - Google Patents

PDF table information extraction method and related device Download PDF

Info

Publication number
CN112069991A
CN112069991A CN202010922836.2A CN202010922836A CN112069991A CN 112069991 A CN112069991 A CN 112069991A CN 202010922836 A CN202010922836 A CN 202010922836A CN 112069991 A CN112069991 A CN 112069991A
Authority
CN
China
Prior art keywords
array
pdf
rectangular
information
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010922836.2A
Other languages
Chinese (zh)
Inventor
余昊旻
张青龙
陈强
丁明
蒋坡良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Servyou Software Group Co ltd
Original Assignee
Servyou Software Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Servyou Software Group Co ltd filed Critical Servyou Software Group Co ltd
Priority to CN202010922836.2A priority Critical patent/CN112069991A/en
Publication of CN112069991A publication Critical patent/CN112069991A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Character Discrimination (AREA)

Abstract

The application discloses a PDF table information extraction method, which comprises the following steps: performing character analysis on the PDF file to obtain characters and character position information; carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array; and carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information. The corresponding table contour is identified for the picture corresponding to the PDF file in an image identification mode, and then the table information is spliced according to the table contour, so that the extraction efficiency of the table information is improved, and the extraction effect is ensured. The application also discloses a PDF table information extraction device, a calculation device and a computer readable storage medium, which have the beneficial effects.

Description

PDF table information extraction method and related device
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a computing apparatus, and a computer-readable storage medium for extracting table information of a PDF.
Background
With the continuous development of information technology, various document file formats are emerging. Among them, a Portable Document Format (PDF) is a file Format that presents a Document in a manner independent of an application program, hardware, and an operating system.
As can be appreciated from the PDF specification, the image presented by the PDF is composed of vector graphics, bitmaps, text, and interactable elements. The table is also composed of vector graphics, bitmaps and text. When reading PDF, it is not as intuitive to read several rows and columns as when reading Excel.
In the prior art, technical solutions such as PDFBox, Tabula, Itext, etc. are all text data reading based on PDF specification analysis. However, since the PDF specification does not define a table, the rendered table content cannot be directly extracted. When the contents of the table are relatively complicated and difficult to recognize, the recognition rate of the table contents is lowered, and the contents of the table cannot be extracted.
Therefore, how to improve the efficiency of extracting PDF form information is a key issue to be focused on by those skilled in the art.
Disclosure of Invention
The application aims to provide a PDF table information extraction method, a PDF table information extraction device, a calculation device and a computer readable storage medium.
In order to solve the above technical problem, the present application provides a method for extracting table information of PDF, including:
performing character analysis on the PDF file to obtain characters and character position information;
carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array;
and carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information.
Optionally, performing text parsing on the PDF file to obtain text and text position information, including:
and performing character analysis on the PDF file according to a PDF analysis library to obtain the characters and the character position information.
Optionally, the image corresponding to the PDF file is subjected to closed contour recognition processing by using an image recognition algorithm to obtain a rectangular contour array, including:
converting the PDF file into a picture;
carrying out binarization processing on the picture to obtain a black and white picture;
identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;
and converting the rectangular area into the rectangular outline array by an array.
Optionally, performing structured processing on the text according to the rectangular outline array and the text position information to obtain table information, including:
inquiring in the rectangular outline array according to the character position information, and determining the position of each character in the rectangular array;
and splicing all the characters according to the positions to obtain the table information.
The present application further provides a device for extracting table information of PDF, including:
the character analysis module is used for carrying out character analysis on the PDF file to obtain characters and character position information;
the table contour identification module is used for carrying out closed contour identification processing on the picture corresponding to the PDF file through an image identification algorithm to obtain a rectangular contour array;
and the table structured processing module is used for carrying out structured processing on the characters according to the rectangular outline array and the character position information to obtain table information.
Optionally, the text parsing module is specifically configured to perform text parsing on the PDF file according to a PDF parsing library to obtain the text and the text position information.
Optionally, the form contour identification module includes:
a PDF conversion unit for converting the PDF file into a picture;
a binarization processing unit, configured to perform binarization processing on the picture to obtain a black-and-white picture;
the closed contour identification unit is used for identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;
and the array conversion unit is used for converting the rectangular area into the rectangular outline array by an array.
Optionally, the table structuring processing module includes:
the position query unit is used for querying in the rectangular outline array according to the character position information and determining the position of each character in the rectangular array;
and the character splicing unit is used for splicing all characters according to the positions to obtain the table information.
The present application further provides a computing device comprising:
a memory for storing a computer program;
a processor for implementing the steps of the table information extraction method as described above when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the table information extraction method as described above.
The application provides a PDF table information extraction method, which comprises the following steps: performing character analysis on the PDF file to obtain characters and character position information; carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array; and carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information.
The method comprises the steps of firstly analyzing characters and character position information from a PDF file, then carrying out outline recognition on a picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular outline array, namely an outline array of a table, and finally carrying out structural processing on the characters according to the rectangular outline array and the character position information to obtain table information.
The present application further provides a device for extracting table information of PDF, a computing device, and a computer-readable storage medium, which have the above beneficial effects and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for extracting table information of a PDF according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a PDF form information extraction device according to an embodiment of the present application.
Detailed Description
The core of the application is to provide a PDF table information extraction method, a PDF table information extraction device, a calculation device and a computer readable storage medium, wherein corresponding table outlines are identified for pictures corresponding to PDF files in an image identification mode, and then the table information is spliced according to the table outlines, so that the extraction efficiency of the table information is improved, and the extraction effect is ensured.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the prior art, technical solutions such as PDFBox, Tabula, Itext, etc. are all text data reading based on PDF specification analysis. However, since the PDF specification does not define a table, the rendered table content cannot be directly extracted. When the contents of the table are relatively complicated and difficult to recognize, the recognition rate of the table contents is lowered, and the contents of the table cannot be extracted.
Therefore, the method for extracting the table information of the PDF includes the steps of firstly analyzing characters and character position information from a PDF file, then carrying out outline recognition on a picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular outline array, namely obtaining the outline array of the table, and finally carrying out structural processing on the characters according to the rectangular outline array and the character position information to obtain the table information.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for extracting table information of a PDF according to an embodiment of the present disclosure.
In this embodiment, the method may include:
s101, performing character analysis on a PDF file to obtain characters and character position information;
the step aims to perform character analysis on the PDF file to obtain characters and character position information. That is, the words in the PDF file are parsed to determine all the words in the PDF file. It should be understood that the text includes all the text in the PDF file, and also includes all the text in the table. And meanwhile, acquiring character position information corresponding to the character.
The text position information generally refers to position information indicating the position of each text with respect to the PDF file. The coordinates of the rectangular coordinate system of the PDF file can be generally used for representation.
Optionally, this step may include:
and performing character analysis on the PDF file according to a PDF analysis library to obtain the characters and the character position information.
It can be seen that the present alternative mainly explains how to perform text parsing. In the alternative scheme, a PDF analysis library is mainly adopted to analyze characters in a PDF file to obtain corresponding characters and character position information. The PDF analysis library may be any one of PDF analysis libraries provided in the prior art, which is not described herein again.
S102, carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array;
on the basis of S101, the step aims to perform closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array. That is, the matrix information corresponding to the table is extracted by means of image recognition, rather than analyzing the table that cannot be normally recognized by means of an analysis library.
In this step, any image recognition algorithm provided in the prior art may be used as long as the table in the picture is recognized.
Furthermore, the outline of the picture corresponding to the PDF file is identified through the image identification algorithm. Since the form shapes have uniform characteristics, for example, they are all closed rectangular shapes. Therefore, the closed rectangles of each table can be identified from the corresponding picture of the PDF file through the characteristics, and the identified closed rectangles are sorted into the rectangular outline array. The rectangular outline array is an array representing the length, the size and the number of tables. According to the number of the information types, the information type can be a binary array.
Optionally, this step may include:
step 1, converting the PDF file into a picture;
step 2, carrying out binarization processing on the picture to obtain a black and white picture;
step 3, identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;
and 4, converting the rectangular area into the rectangular outline array by using an array.
It can be seen that the present alternative scheme mainly explains how to obtain the matrix contour array. In this alternative, specifically, the PDF file is converted into a picture; then, carrying out binarization processing on the picture to obtain a black and white picture; here, the binarization processing of the picture is mainly performed to make more prominent the image features related to the table information in the picture. Then, identifying the black and white picture through a closed contour algorithm to obtain the rectangular area; and finally, converting the rectangular area into the rectangular outline array by using an array.
It should be noted that, in this alternative, the step of identifying the black and white picture by using a closed contour algorithm to obtain the rectangular area may include: and identifying the black and white picture by a closed contour algorithm to obtain all closed contours in the picture. And then eliminating the closed region irrelevant to the maximum closed region through the correlation among all the closed contours, and finally obtaining the rectangular region.
S103, carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information.
On the basis of S102, this step aims to perform structuring processing on the text according to the rectangular outline array and the text position information to obtain table information. That is, on the basis of acquiring the rectangular outline array, that is, on the basis of determining all rectangular tables in the picture, the corresponding characters are pieced close to the corresponding tables through the character position information.
Optionally, this step may include:
step 1, inquiring in the rectangular outline array according to the character position information, and determining the position of each character in the rectangular array;
and 2, splicing all the characters according to the positions to obtain the table information.
It can be seen that the alternative scheme mainly explains how to splice the table information. In this alternative, the position of each character in the rectangular array is determined by query, and then all characters are spliced into the corresponding table through the position to obtain the table information.
In summary, in the embodiment, the characters and the character position information are firstly analyzed from the PDF file, then the image corresponding to the PDF file is subjected to contour recognition through an image recognition algorithm to obtain a rectangular contour array, that is, a contour array of a table is obtained, and finally the characters are subjected to structural processing according to the rectangular contour array and the character position information to obtain the table information, so that the table information is extracted from the PDF file in an image recognition manner, the problem of information extraction errors caused by the fact that the table cannot be analyzed is solved, the extraction efficiency of the table information is improved, and the extraction effect is ensured.
The following further describes a method for extracting table information of a PDF provided by the present application by a specific embodiment.
In this embodiment, the method may include:
step 1, generating a picture corresponding to each page of a PDF according to a given PDF file;
step 2, analyzing characters and position information of the characters in the PDF through the existing PDF analysis library, and using the characters and position information as character arrays for subsequent processing; among them, the PDF parsing library may include itext.
And 3, carrying out closed contour recognition processing on the picture file generated in the step 1 through opencv (cross-platform computer vision and machine learning software library) to obtain a rectangular array.
Preferably, the step may include: firstly, extracting three layers of channels of RGB (RGB color mode) of the picture; then, only a red channel is used for carrying out binarization to convert the picture into black and white; then, searching all closed contours through a FindContours function, abandoning the contours with small areas, and only reserving the contours at the bottommost layer to obtain a rectangular area of the contours to obtain a rectangular array; filling the rectangles in the blank canvas, finding the maximum black area through a FindContours function again, and removing the rectangles which are not in the black area from the found rectangular array to obtain a new rectangular array; and finally, aggregating and de-duplicating x and y of each rectangle in the rectangle array to obtain a row array and a column array.
And 4, data structuring: traversing the acquired rectangle arrays, comparing each rectangle with the row-column array to obtain the rows and columns occupied by the rectangle, formatting the result (A1: B3) for example, searching the obtained characters, searching the texts in the rectangles and splicing. Thus, the position information and the content of the table of each rectangle are obtained.
It can be seen that, in the embodiment, the characters and the character position information are firstly analyzed from the PDF file, then the image corresponding to the PDF file is subjected to contour recognition through an image recognition algorithm to obtain a rectangular contour array, that is, a contour array of a table is obtained, and finally, the characters are subjected to structural processing according to the rectangular contour array and the character position information to obtain table information.
The following describes a table information extraction device of a PDF according to an embodiment of the present application, and the table information extraction device of a PDF described below and the table information extraction method of a PDF described above may be referred to in correspondence with each other.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a PDF form information extraction device according to an embodiment of the present application.
In this embodiment, the apparatus may include:
the character analysis module 100 is used for performing character analysis on the PDF file to obtain characters and character position information;
the table contour identification module 200 is configured to perform closed contour identification processing on the picture corresponding to the PDF file through an image identification algorithm to obtain a rectangular contour array;
and the table structuring processing module 300 is configured to perform structuring processing on the characters according to the rectangular outline array and the character position information to obtain table information.
Optionally, the text parsing module 100 is specifically configured to perform text parsing on the PDF file according to a PDF parsing library to obtain the text and the text position information.
Optionally, the table contour identifying module 200 may include:
a PDF conversion unit for converting the PDF file into a picture;
a binarization processing unit, configured to perform binarization processing on the picture to obtain a black-and-white picture;
the closed contour identification unit is used for identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;
and the array conversion unit is used for converting the rectangular area into the rectangular outline array by an array.
Optionally, the table structuring processing module 300 may include:
the position query unit is used for querying in the rectangular outline array according to the character position information and determining the position of each character in the rectangular array;
and the character splicing unit is used for splicing all characters according to the positions to obtain the table information.
An embodiment of the present application further provides a computing apparatus, including:
a memory for storing a computer program;
a processor for implementing the steps of the table information extraction method as described in the above embodiments when the computer program is executed.
The present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the table information extraction method according to the above embodiments are implemented.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above detailed description is provided for a method, device, and computer readable storage medium for extracting form information of a PDF. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims (10)

1. A method for extracting form information of PDF is characterized by comprising the following steps:
performing character analysis on the PDF file to obtain characters and character position information;
carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array;
and carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information.
2. The form information extraction method of claim 1, wherein performing text parsing on a PDF file to obtain text and text position information comprises:
and performing character analysis on the PDF file according to a PDF analysis library to obtain the characters and the character position information.
3. The form information extraction method of claim 1, wherein the obtaining of the rectangular contour array by performing closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm comprises:
converting the PDF file into a picture;
carrying out binarization processing on the picture to obtain a black and white picture;
identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;
and converting the rectangular area into the rectangular outline array by an array.
4. The form information extraction method of claim 1, wherein the structuring of the text according to the rectangular outline array and the text position information to obtain form information comprises:
inquiring in the rectangular outline array according to the character position information, and determining the position of each character in the rectangular array;
and splicing all the characters according to the positions to obtain the table information.
5. A form information extraction apparatus of a PDF, comprising:
the character analysis module is used for carrying out character analysis on the PDF file to obtain characters and character position information;
the table contour identification module is used for carrying out closed contour identification processing on the picture corresponding to the PDF file through an image identification algorithm to obtain a rectangular contour array;
and the table structured processing module is used for carrying out structured processing on the characters according to the rectangular outline array and the character position information to obtain table information.
6. The form information extraction information of claim 5, wherein the text parsing module is specifically configured to perform text parsing on the PDF file according to a PDF parsing library to obtain the text and the text position information.
7. The form information extraction information of claim 5, wherein the form outline recognition module comprises:
a PDF conversion unit for converting the PDF file into a picture;
a binarization processing unit, configured to perform binarization processing on the picture to obtain a black-and-white picture;
the closed contour identification unit is used for identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;
and the array conversion unit is used for converting the rectangular area into the rectangular outline array by an array.
8. The table information extraction information according to claim 5, wherein the table structuring processing module includes:
the position query unit is used for querying in the rectangular outline array according to the character position information and determining the position of each character in the rectangular array;
and the character splicing unit is used for splicing all characters according to the positions to obtain the table information.
9. A computing device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the table information extraction method according to any one of claims 1 to 4 when executing the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, implements the steps of the table information extraction method according to any one of claims 1 to 4.
CN202010922836.2A 2020-09-04 2020-09-04 PDF table information extraction method and related device Pending CN112069991A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010922836.2A CN112069991A (en) 2020-09-04 2020-09-04 PDF table information extraction method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010922836.2A CN112069991A (en) 2020-09-04 2020-09-04 PDF table information extraction method and related device

Publications (1)

Publication Number Publication Date
CN112069991A true CN112069991A (en) 2020-12-11

Family

ID=73665541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010922836.2A Pending CN112069991A (en) 2020-09-04 2020-09-04 PDF table information extraction method and related device

Country Status (1)

Country Link
CN (1) CN112069991A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818894A (en) * 2021-02-08 2021-05-18 深圳万兴软件有限公司 Method and device for identifying text box in PDF file, computer equipment and storage medium
CN112861603A (en) * 2020-12-17 2021-05-28 西安理工大学 Automatic identification and analysis method for limited forms
CN112949455A (en) * 2021-02-26 2021-06-11 武汉天喻信息产业股份有限公司 Value-added tax invoice identification system and method
CN113221649A (en) * 2021-04-08 2021-08-06 西安理工大学 Method for solving wired table identification and analysis
CN113326797A (en) * 2021-06-17 2021-08-31 上海电气集团股份有限公司 Method for converting form information extracted from PDF document into structured knowledge

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105426856A (en) * 2015-11-25 2016-03-23 成都数联铭品科技有限公司 Image table character identification method
CN106897690A (en) * 2017-02-22 2017-06-27 南京述酷信息技术有限公司 PDF table extracting methods
CN108132916A (en) * 2017-11-30 2018-06-08 厦门市美亚柏科信息股份有限公司 Parse method, the storage medium of PDF list datas
CN110163030A (en) * 2018-02-11 2019-08-23 鼎复数据科技(北京)有限公司 A kind of PDF based on image information has frame table abstracting method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105426856A (en) * 2015-11-25 2016-03-23 成都数联铭品科技有限公司 Image table character identification method
CN106897690A (en) * 2017-02-22 2017-06-27 南京述酷信息技术有限公司 PDF table extracting methods
CN108132916A (en) * 2017-11-30 2018-06-08 厦门市美亚柏科信息股份有限公司 Parse method, the storage medium of PDF list datas
CN110163030A (en) * 2018-02-11 2019-08-23 鼎复数据科技(北京)有限公司 A kind of PDF based on image information has frame table abstracting method

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861603A (en) * 2020-12-17 2021-05-28 西安理工大学 Automatic identification and analysis method for limited forms
CN112861603B (en) * 2020-12-17 2023-12-22 西安理工大学 Automatic identification and analysis method for limited form
CN112818894A (en) * 2021-02-08 2021-05-18 深圳万兴软件有限公司 Method and device for identifying text box in PDF file, computer equipment and storage medium
CN112818894B (en) * 2021-02-08 2023-12-15 深圳万兴软件有限公司 Method and device for identifying text box in PDF (portable document format) file, computer equipment and storage medium
CN112949455A (en) * 2021-02-26 2021-06-11 武汉天喻信息产业股份有限公司 Value-added tax invoice identification system and method
CN112949455B (en) * 2021-02-26 2024-04-05 武汉天喻信息产业股份有限公司 Value-added tax invoice recognition system and method
CN113221649A (en) * 2021-04-08 2021-08-06 西安理工大学 Method for solving wired table identification and analysis
CN113221649B (en) * 2021-04-08 2023-04-18 西安理工大学 Method for solving wired table identification and analysis
CN113326797A (en) * 2021-06-17 2021-08-31 上海电气集团股份有限公司 Method for converting form information extracted from PDF document into structured knowledge

Similar Documents

Publication Publication Date Title
CN112069991A (en) PDF table information extraction method and related device
KR101334483B1 (en) Apparatus and method for digitizing a document, and computer-readable recording medium
JP3139521B2 (en) Automatic language determination device
US8428356B2 (en) Image processing device and image processing method for generating electronic document with a table line determination portion
JPH10228473A (en) Document picture processing method, document picture processor and storage medium
US6711292B2 (en) Block selection of table features
US11615635B2 (en) Heuristic method for analyzing content of an electronic document
CN111291572A (en) Character typesetting method and device and computer readable storage medium
US20080069447A1 (en) Character recognition method, character recognition device, and computer product
US8538154B2 (en) Image processing method and image processing apparatus for extracting heading region from image of document
CN115828874A (en) Industry table digital processing method based on image recognition technology
CN108319578B (en) Method for generating medium for data recording
JP2002015280A (en) Device and method for image recognition, and computer- readable recording medium with recorded image recognizing program
Yuan et al. An opencv-based framework for table information extraction
CN116822634A (en) Document visual language reasoning method based on layout perception prompt
CN115880708A (en) Method for detecting character paragraph spacing compliance in APP (application) aging-adapted mode
CN114579796A (en) Machine reading understanding method and device
CN112560849A (en) Neural network algorithm-based grammar segmentation method and system
Ferilli et al. A distance-based technique for non-manhattan layout analysis
CN112257719A (en) Character recognition method, system and storage medium
CN116311301B (en) Wireless form identification method and system
KR102673900B1 (en) Table data extraction system and the method of thereof
CN118155230A (en) File processing method, storage medium and computer device
JPH0743718B2 (en) Multimedia document structuring method
CN117291152A (en) Table extraction method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination