CN112069991A - PDF table information extraction method and related device - Google Patents
PDF table information extraction method and related device Download PDFInfo
- Publication number
- CN112069991A CN112069991A CN202010922836.2A CN202010922836A CN112069991A CN 112069991 A CN112069991 A CN 112069991A CN 202010922836 A CN202010922836 A CN 202010922836A CN 112069991 A CN112069991 A CN 112069991A
- Authority
- CN
- China
- Prior art keywords
- array
- rectangular
- information
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 34
- 238000012545 processing Methods 0.000 claims abstract description 45
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 28
- 238000004458 analytical method Methods 0.000 claims abstract description 25
- 238000000034 method Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 10
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 abstract description 3
- 230000000694 effects Effects 0.000 abstract description 3
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 238000003491 array Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Geometry (AREA)
- Computer Graphics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Character Discrimination (AREA)
Abstract
The application discloses a PDF table information extraction method, which comprises the following steps: performing character analysis on the PDF file to obtain characters and character position information; carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array; and carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information. The corresponding table contour is identified for the picture corresponding to the PDF file in an image identification mode, and then the table information is spliced according to the table contour, so that the extraction efficiency of the table information is improved, and the extraction effect is ensured. The application also discloses a PDF table information extraction device, a calculation device and a computer readable storage medium, which have the beneficial effects.
Description
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a computing apparatus, and a computer-readable storage medium for extracting table information of a PDF.
Background
With the continuous development of information technology, various document file formats are emerging. Among them, a Portable Document Format (PDF) is a file Format that presents a Document in a manner independent of an application program, hardware, and an operating system.
As can be appreciated from the PDF specification, the image presented by the PDF is composed of vector graphics, bitmaps, text, and interactable elements. The table is also composed of vector graphics, bitmaps and text. When reading PDF, it is not as intuitive to read several rows and columns as when reading Excel.
In the prior art, technical solutions such as PDFBox, Tabula, Itext, etc. are all text data reading based on PDF specification analysis. However, since the PDF specification does not define a table, the rendered table content cannot be directly extracted. When the contents of the table are relatively complicated and difficult to recognize, the recognition rate of the table contents is lowered, and the contents of the table cannot be extracted.
Therefore, how to improve the efficiency of extracting PDF form information is a key issue to be focused on by those skilled in the art.
Disclosure of Invention
The application aims to provide a PDF table information extraction method, a PDF table information extraction device, a calculation device and a computer readable storage medium.
In order to solve the above technical problem, the present application provides a method for extracting table information of PDF, including:
performing character analysis on the PDF file to obtain characters and character position information;
carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array;
and carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information.
Optionally, performing text parsing on the PDF file to obtain text and text position information, including:
and performing character analysis on the PDF file according to a PDF analysis library to obtain the characters and the character position information.
Optionally, the image corresponding to the PDF file is subjected to closed contour recognition processing by using an image recognition algorithm to obtain a rectangular contour array, including:
converting the PDF file into a picture;
carrying out binarization processing on the picture to obtain a black and white picture;
identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;
and converting the rectangular area into the rectangular outline array by an array.
Optionally, performing structured processing on the text according to the rectangular outline array and the text position information to obtain table information, including:
inquiring in the rectangular outline array according to the character position information, and determining the position of each character in the rectangular array;
and splicing all the characters according to the positions to obtain the table information.
The present application further provides a device for extracting table information of PDF, including:
the character analysis module is used for carrying out character analysis on the PDF file to obtain characters and character position information;
the table contour identification module is used for carrying out closed contour identification processing on the picture corresponding to the PDF file through an image identification algorithm to obtain a rectangular contour array;
and the table structured processing module is used for carrying out structured processing on the characters according to the rectangular outline array and the character position information to obtain table information.
Optionally, the text parsing module is specifically configured to perform text parsing on the PDF file according to a PDF parsing library to obtain the text and the text position information.
Optionally, the form contour identification module includes:
a PDF conversion unit for converting the PDF file into a picture;
a binarization processing unit, configured to perform binarization processing on the picture to obtain a black-and-white picture;
the closed contour identification unit is used for identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;
and the array conversion unit is used for converting the rectangular area into the rectangular outline array by an array.
Optionally, the table structuring processing module includes:
the position query unit is used for querying in the rectangular outline array according to the character position information and determining the position of each character in the rectangular array;
and the character splicing unit is used for splicing all characters according to the positions to obtain the table information.
The present application further provides a computing device comprising:
a memory for storing a computer program;
a processor for implementing the steps of the table information extraction method as described above when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the table information extraction method as described above.
The application provides a PDF table information extraction method, which comprises the following steps: performing character analysis on the PDF file to obtain characters and character position information; carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array; and carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information.
The method comprises the steps of firstly analyzing characters and character position information from a PDF file, then carrying out outline recognition on a picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular outline array, namely an outline array of a table, and finally carrying out structural processing on the characters according to the rectangular outline array and the character position information to obtain table information.
The present application further provides a device for extracting table information of PDF, a computing device, and a computer-readable storage medium, which have the above beneficial effects and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for extracting table information of a PDF according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a PDF form information extraction device according to an embodiment of the present application.
Detailed Description
The core of the application is to provide a PDF table information extraction method, a PDF table information extraction device, a calculation device and a computer readable storage medium, wherein corresponding table outlines are identified for pictures corresponding to PDF files in an image identification mode, and then the table information is spliced according to the table outlines, so that the extraction efficiency of the table information is improved, and the extraction effect is ensured.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the prior art, technical solutions such as PDFBox, Tabula, Itext, etc. are all text data reading based on PDF specification analysis. However, since the PDF specification does not define a table, the rendered table content cannot be directly extracted. When the contents of the table are relatively complicated and difficult to recognize, the recognition rate of the table contents is lowered, and the contents of the table cannot be extracted.
Therefore, the method for extracting the table information of the PDF includes the steps of firstly analyzing characters and character position information from a PDF file, then carrying out outline recognition on a picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular outline array, namely obtaining the outline array of the table, and finally carrying out structural processing on the characters according to the rectangular outline array and the character position information to obtain the table information.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for extracting table information of a PDF according to an embodiment of the present disclosure.
In this embodiment, the method may include:
s101, performing character analysis on a PDF file to obtain characters and character position information;
the step aims to perform character analysis on the PDF file to obtain characters and character position information. That is, the words in the PDF file are parsed to determine all the words in the PDF file. It should be understood that the text includes all the text in the PDF file, and also includes all the text in the table. And meanwhile, acquiring character position information corresponding to the character.
The text position information generally refers to position information indicating the position of each text with respect to the PDF file. The coordinates of the rectangular coordinate system of the PDF file can be generally used for representation.
Optionally, this step may include:
and performing character analysis on the PDF file according to a PDF analysis library to obtain the characters and the character position information.
It can be seen that the present alternative mainly explains how to perform text parsing. In the alternative scheme, a PDF analysis library is mainly adopted to analyze characters in a PDF file to obtain corresponding characters and character position information. The PDF analysis library may be any one of PDF analysis libraries provided in the prior art, which is not described herein again.
S102, carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array;
on the basis of S101, the step aims to perform closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array. That is, the matrix information corresponding to the table is extracted by means of image recognition, rather than analyzing the table that cannot be normally recognized by means of an analysis library.
In this step, any image recognition algorithm provided in the prior art may be used as long as the table in the picture is recognized.
Furthermore, the outline of the picture corresponding to the PDF file is identified through the image identification algorithm. Since the form shapes have uniform characteristics, for example, they are all closed rectangular shapes. Therefore, the closed rectangles of each table can be identified from the corresponding picture of the PDF file through the characteristics, and the identified closed rectangles are sorted into the rectangular outline array. The rectangular outline array is an array representing the length, the size and the number of tables. According to the number of the information types, the information type can be a binary array.
Optionally, this step may include:
step 1, converting the PDF file into a picture;
step 2, carrying out binarization processing on the picture to obtain a black and white picture;
step 3, identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;
and 4, converting the rectangular area into the rectangular outline array by using an array.
It can be seen that the present alternative scheme mainly explains how to obtain the matrix contour array. In this alternative, specifically, the PDF file is converted into a picture; then, carrying out binarization processing on the picture to obtain a black and white picture; here, the binarization processing of the picture is mainly performed to make more prominent the image features related to the table information in the picture. Then, identifying the black and white picture through a closed contour algorithm to obtain the rectangular area; and finally, converting the rectangular area into the rectangular outline array by using an array.
It should be noted that, in this alternative, the step of identifying the black and white picture by using a closed contour algorithm to obtain the rectangular area may include: and identifying the black and white picture by a closed contour algorithm to obtain all closed contours in the picture. And then eliminating the closed region irrelevant to the maximum closed region through the correlation among all the closed contours, and finally obtaining the rectangular region.
S103, carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information.
On the basis of S102, this step aims to perform structuring processing on the text according to the rectangular outline array and the text position information to obtain table information. That is, on the basis of acquiring the rectangular outline array, that is, on the basis of determining all rectangular tables in the picture, the corresponding characters are pieced close to the corresponding tables through the character position information.
Optionally, this step may include:
step 1, inquiring in the rectangular outline array according to the character position information, and determining the position of each character in the rectangular array;
and 2, splicing all the characters according to the positions to obtain the table information.
It can be seen that the alternative scheme mainly explains how to splice the table information. In this alternative, the position of each character in the rectangular array is determined by query, and then all characters are spliced into the corresponding table through the position to obtain the table information.
In summary, in the embodiment, the characters and the character position information are firstly analyzed from the PDF file, then the image corresponding to the PDF file is subjected to contour recognition through an image recognition algorithm to obtain a rectangular contour array, that is, a contour array of a table is obtained, and finally the characters are subjected to structural processing according to the rectangular contour array and the character position information to obtain the table information, so that the table information is extracted from the PDF file in an image recognition manner, the problem of information extraction errors caused by the fact that the table cannot be analyzed is solved, the extraction efficiency of the table information is improved, and the extraction effect is ensured.
The following further describes a method for extracting table information of a PDF provided by the present application by a specific embodiment.
In this embodiment, the method may include:
step 1, generating a picture corresponding to each page of a PDF according to a given PDF file;
step 2, analyzing characters and position information of the characters in the PDF through the existing PDF analysis library, and using the characters and position information as character arrays for subsequent processing; among them, the PDF parsing library may include itext.
And 3, carrying out closed contour recognition processing on the picture file generated in the step 1 through opencv (cross-platform computer vision and machine learning software library) to obtain a rectangular array.
Preferably, the step may include: firstly, extracting three layers of channels of RGB (RGB color mode) of the picture; then, only a red channel is used for carrying out binarization to convert the picture into black and white; then, searching all closed contours through a FindContours function, abandoning the contours with small areas, and only reserving the contours at the bottommost layer to obtain a rectangular area of the contours to obtain a rectangular array; filling the rectangles in the blank canvas, finding the maximum black area through a FindContours function again, and removing the rectangles which are not in the black area from the found rectangular array to obtain a new rectangular array; and finally, aggregating and de-duplicating x and y of each rectangle in the rectangle array to obtain a row array and a column array.
And 4, data structuring: traversing the acquired rectangle arrays, comparing each rectangle with the row-column array to obtain the rows and columns occupied by the rectangle, formatting the result (A1: B3) for example, searching the obtained characters, searching the texts in the rectangles and splicing. Thus, the position information and the content of the table of each rectangle are obtained.
It can be seen that, in the embodiment, the characters and the character position information are firstly analyzed from the PDF file, then the image corresponding to the PDF file is subjected to contour recognition through an image recognition algorithm to obtain a rectangular contour array, that is, a contour array of a table is obtained, and finally, the characters are subjected to structural processing according to the rectangular contour array and the character position information to obtain table information.
The following describes a table information extraction device of a PDF according to an embodiment of the present application, and the table information extraction device of a PDF described below and the table information extraction method of a PDF described above may be referred to in correspondence with each other.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a PDF form information extraction device according to an embodiment of the present application.
In this embodiment, the apparatus may include:
the character analysis module 100 is used for performing character analysis on the PDF file to obtain characters and character position information;
the table contour identification module 200 is configured to perform closed contour identification processing on the picture corresponding to the PDF file through an image identification algorithm to obtain a rectangular contour array;
and the table structuring processing module 300 is configured to perform structuring processing on the characters according to the rectangular outline array and the character position information to obtain table information.
Optionally, the text parsing module 100 is specifically configured to perform text parsing on the PDF file according to a PDF parsing library to obtain the text and the text position information.
Optionally, the table contour identifying module 200 may include:
a PDF conversion unit for converting the PDF file into a picture;
a binarization processing unit, configured to perform binarization processing on the picture to obtain a black-and-white picture;
the closed contour identification unit is used for identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;
and the array conversion unit is used for converting the rectangular area into the rectangular outline array by an array.
Optionally, the table structuring processing module 300 may include:
the position query unit is used for querying in the rectangular outline array according to the character position information and determining the position of each character in the rectangular array;
and the character splicing unit is used for splicing all characters according to the positions to obtain the table information.
An embodiment of the present application further provides a computing apparatus, including:
a memory for storing a computer program;
a processor for implementing the steps of the table information extraction method as described in the above embodiments when the computer program is executed.
The present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the table information extraction method according to the above embodiments are implemented.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above detailed description is provided for a method, device, and computer readable storage medium for extracting form information of a PDF. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
Claims (10)
1. A method for extracting form information of PDF is characterized by comprising the following steps:
performing character analysis on the PDF file to obtain characters and character position information;
carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array;
and carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information.
2. The form information extraction method of claim 1, wherein performing text parsing on a PDF file to obtain text and text position information comprises:
and performing character analysis on the PDF file according to a PDF analysis library to obtain the characters and the character position information.
3. The form information extraction method of claim 1, wherein the obtaining of the rectangular contour array by performing closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm comprises:
converting the PDF file into a picture;
carrying out binarization processing on the picture to obtain a black and white picture;
identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;
and converting the rectangular area into the rectangular outline array by an array.
4. The form information extraction method of claim 1, wherein the structuring of the text according to the rectangular outline array and the text position information to obtain form information comprises:
inquiring in the rectangular outline array according to the character position information, and determining the position of each character in the rectangular array;
and splicing all the characters according to the positions to obtain the table information.
5. A form information extraction apparatus of a PDF, comprising:
the character analysis module is used for carrying out character analysis on the PDF file to obtain characters and character position information;
the table contour identification module is used for carrying out closed contour identification processing on the picture corresponding to the PDF file through an image identification algorithm to obtain a rectangular contour array;
and the table structured processing module is used for carrying out structured processing on the characters according to the rectangular outline array and the character position information to obtain table information.
6. The form information extraction information of claim 5, wherein the text parsing module is specifically configured to perform text parsing on the PDF file according to a PDF parsing library to obtain the text and the text position information.
7. The form information extraction information of claim 5, wherein the form outline recognition module comprises:
a PDF conversion unit for converting the PDF file into a picture;
a binarization processing unit, configured to perform binarization processing on the picture to obtain a black-and-white picture;
the closed contour identification unit is used for identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;
and the array conversion unit is used for converting the rectangular area into the rectangular outline array by an array.
8. The table information extraction information according to claim 5, wherein the table structuring processing module includes:
the position query unit is used for querying in the rectangular outline array according to the character position information and determining the position of each character in the rectangular array;
and the character splicing unit is used for splicing all characters according to the positions to obtain the table information.
9. A computing device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the table information extraction method according to any one of claims 1 to 4 when executing the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, implements the steps of the table information extraction method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010922836.2A CN112069991A (en) | 2020-09-04 | 2020-09-04 | PDF table information extraction method and related device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010922836.2A CN112069991A (en) | 2020-09-04 | 2020-09-04 | PDF table information extraction method and related device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112069991A true CN112069991A (en) | 2020-12-11 |
Family
ID=73665541
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010922836.2A Pending CN112069991A (en) | 2020-09-04 | 2020-09-04 | PDF table information extraction method and related device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112069991A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818894A (en) * | 2021-02-08 | 2021-05-18 | 深圳万兴软件有限公司 | Method and device for identifying text box in PDF file, computer equipment and storage medium |
CN112861603A (en) * | 2020-12-17 | 2021-05-28 | 西安理工大学 | Automatic identification and analysis method for limited forms |
CN112949455A (en) * | 2021-02-26 | 2021-06-11 | 武汉天喻信息产业股份有限公司 | Value-added tax invoice identification system and method |
CN113221649A (en) * | 2021-04-08 | 2021-08-06 | 西安理工大学 | Method for solving wired table identification and analysis |
CN113326797A (en) * | 2021-06-17 | 2021-08-31 | 上海电气集团股份有限公司 | Method for converting form information extracted from PDF document into structured knowledge |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105426856A (en) * | 2015-11-25 | 2016-03-23 | 成都数联铭品科技有限公司 | Image table character identification method |
CN106897690A (en) * | 2017-02-22 | 2017-06-27 | 南京述酷信息技术有限公司 | PDF table extracting methods |
CN108132916A (en) * | 2017-11-30 | 2018-06-08 | 厦门市美亚柏科信息股份有限公司 | Parse method, the storage medium of PDF list datas |
CN110163030A (en) * | 2018-02-11 | 2019-08-23 | 鼎复数据科技(北京)有限公司 | A kind of PDF based on image information has frame table abstracting method |
-
2020
- 2020-09-04 CN CN202010922836.2A patent/CN112069991A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105426856A (en) * | 2015-11-25 | 2016-03-23 | 成都数联铭品科技有限公司 | Image table character identification method |
CN106897690A (en) * | 2017-02-22 | 2017-06-27 | 南京述酷信息技术有限公司 | PDF table extracting methods |
CN108132916A (en) * | 2017-11-30 | 2018-06-08 | 厦门市美亚柏科信息股份有限公司 | Parse method, the storage medium of PDF list datas |
CN110163030A (en) * | 2018-02-11 | 2019-08-23 | 鼎复数据科技(北京)有限公司 | A kind of PDF based on image information has frame table abstracting method |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112861603A (en) * | 2020-12-17 | 2021-05-28 | 西安理工大学 | Automatic identification and analysis method for limited forms |
CN112861603B (en) * | 2020-12-17 | 2023-12-22 | 西安理工大学 | Automatic identification and analysis method for limited form |
CN112818894A (en) * | 2021-02-08 | 2021-05-18 | 深圳万兴软件有限公司 | Method and device for identifying text box in PDF file, computer equipment and storage medium |
CN112818894B (en) * | 2021-02-08 | 2023-12-15 | 深圳万兴软件有限公司 | Method and device for identifying text box in PDF (portable document format) file, computer equipment and storage medium |
CN112949455A (en) * | 2021-02-26 | 2021-06-11 | 武汉天喻信息产业股份有限公司 | Value-added tax invoice identification system and method |
CN112949455B (en) * | 2021-02-26 | 2024-04-05 | 武汉天喻信息产业股份有限公司 | Value-added tax invoice recognition system and method |
CN113221649A (en) * | 2021-04-08 | 2021-08-06 | 西安理工大学 | Method for solving wired table identification and analysis |
CN113221649B (en) * | 2021-04-08 | 2023-04-18 | 西安理工大学 | Method for solving wired table identification and analysis |
CN113326797A (en) * | 2021-06-17 | 2021-08-31 | 上海电气集团股份有限公司 | Method for converting form information extracted from PDF document into structured knowledge |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112069991A (en) | PDF table information extraction method and related device | |
KR101334483B1 (en) | Apparatus and method for digitizing a document, and computer-readable recording medium | |
JP3139521B2 (en) | Automatic language determination device | |
US8428356B2 (en) | Image processing device and image processing method for generating electronic document with a table line determination portion | |
JPH10228473A (en) | Document picture processing method, document picture processor and storage medium | |
US6711292B2 (en) | Block selection of table features | |
US11615635B2 (en) | Heuristic method for analyzing content of an electronic document | |
CN111291572A (en) | Character typesetting method and device and computer readable storage medium | |
US20080069447A1 (en) | Character recognition method, character recognition device, and computer product | |
US8538154B2 (en) | Image processing method and image processing apparatus for extracting heading region from image of document | |
CN115828874A (en) | Industry table digital processing method based on image recognition technology | |
CN108319578B (en) | Method for generating medium for data recording | |
JP2002015280A (en) | Device and method for image recognition, and computer- readable recording medium with recorded image recognizing program | |
Yuan et al. | An opencv-based framework for table information extraction | |
CN116822634A (en) | Document visual language reasoning method based on layout perception prompt | |
CN115880708A (en) | Method for detecting character paragraph spacing compliance in APP (application) aging-adapted mode | |
CN114579796A (en) | Machine reading understanding method and device | |
CN112560849A (en) | Neural network algorithm-based grammar segmentation method and system | |
Ferilli et al. | A distance-based technique for non-manhattan layout analysis | |
CN112257719A (en) | Character recognition method, system and storage medium | |
CN116311301B (en) | Wireless form identification method and system | |
KR102673900B1 (en) | Table data extraction system and the method of thereof | |
CN118155230A (en) | File processing method, storage medium and computer device | |
JPH0743718B2 (en) | Multimedia document structuring method | |
CN117291152A (en) | Table extraction method and apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |