CN110147697A - A kind of PDF table extracting method based on man-machine mutual assistance - Google Patents
A kind of PDF table extracting method based on man-machine mutual assistance Download PDFInfo
- Publication number
- CN110147697A CN110147697A CN201810142939.XA CN201810142939A CN110147697A CN 110147697 A CN110147697 A CN 110147697A CN 201810142939 A CN201810142939 A CN 201810142939A CN 110147697 A CN110147697 A CN 110147697A
- Authority
- CN
- China
- Prior art keywords
- information
- text block
- column
- page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a kind of PDF table extracting methods based on man-machine mutual assistance, the described method comprises the following steps: pdf document to be resolved is uploaded to browser, and opens the pdf document by step 1;Step 2 draws in the PDF page and selects PDF table area, obtains location information of the PDF table in the PDF page, the location information includes left information, right information, bottom information and top information;The page number information of left information, right information, bottom information and top information and the PDF table in pdf document of PDF table that step 2 obtains in the PDF page is transferred to background server by step 3;Step 4, the parsing that PDF table is carried out in background server;The result of parsing is returned to browser by step 5, background server, and is shown on the right side of browser.The method extracts PDF table by the way of man-machine mutual assistance, substantially increases the accuracy rate of extraction, almost can achieve 100%.
Description
Technical field
The present invention relates to the extraction of PDF table more particularly to the extractions of PDF table, particularly, are related to a kind of based on man-machine
The PDF table extracting method of mutual assistance.
Background technique
PDF is a kind of formal international standard, and the pdf document that we often say refers to the file generated based on this standard.
Its is very widely used at present, and all trades and professions are all liked propagating document with PDF, and one can guarantee safety, secondly can guarantee
Format is consistent.
But pdf document also brings the list data in big inconvenience, especially PDF, can not directly be led
Out, this brings difficulty to the crowd for much arranging list data from document, especially in financial field, financial report,
In industry research report, researcher need to do further advanced treating for table, and the table in pdf need to be converted to rule
The form (such as Excel table) of row and column.
In the prior art, mostly based on offline batch processing pdf document, program reads in pdf document content, and parsing is wherein
Content of text and text position derive table boundary, then carry out table content extraction, although efficiently big there are following two
Problem:
(1) it is poor to cope with table ability lack of standardization, for example there are several tables for document a line, without apparent cut-off rule;Or
Person is the subfield content often occurred in Hong Kong stock financial report, two tables in left and right;Again or document itself has background color background or picture
Background is more in the case of these to extract or extraction effect is very poor;
(2) obscurity boundary is extracted, under the situation of Rimless table, the situation high if there is row, col width is inconsistent, mould
Formula matching often malfunctions, and can not accurately divide row, column, just will appear multiple line content and extracts into a line, and multiple row contents extraction is into one
Column, to not can guarantee accuracy rate.
Summary of the invention
In order to overcome the above problem, present inventor has performed sharp studies, by the way of man-machine mutual assistance to PDF table into
Row extracts, and substantially increases the accuracy rate of extraction, almost can achieve 100%, thereby completing the present invention.
The present invention provides a kind of PDF table extracting methods based on man-machine mutual assistance, the described method comprises the following steps:
Pdf document to be resolved is uploaded to browser, and opens the pdf document by step 1;
Step 2 draws in the PDF page and selects PDF table area, obtains location information of the PDF table in the PDF page, institute
Stating location information includes left information, right information, bottom information and top information;
Rapid 3, left information of the PDF table for obtaining step 2 in the PDF page, right information, bottom information and
The page number information of top information and the PDF table in pdf document is transferred to background server;
Step 4, the parsing that PDF table is carried out in background server;
The result of parsing is returned to browser by step 5, background server, and is shown on the right side of browser.
Detailed description of the invention
Fig. 1 shows the schematic diagram of interaction page of the present invention;
Fig. 2 shows the explanation schematic diagrames of text block and cell;
Fig. 3-1 to Fig. 3-6 shows the PDF Form Handle processes result figure in embodiment 1;
The original table of the processing of embodiment 2 and treated table is shown respectively in Fig. 4-1 and Fig. 4-2;
Fig. 5-1 to Fig. 5-3 shows the PDF Form Handle processes result figure in embodiment 3.
Specific embodiment
Below by attached drawing, the present invention is described in more detail.Illustrated by these, the features and advantages of the invention will
It becomes more apparent from clear.
One aspect of the present invention provides a kind of PDF table extracting method based on man-machine mutual assistance, and the method includes following
Step:
Pdf document to be resolved is uploaded to browser, and opens the pdf document by step 1.
Wherein, the pdf document refers to the pdf document of non-picture format, wherein containing PDF table, the PDF table packet
Frame table, Rimless table and the imperfect table of frame are included, described have frame table to refer to contain all rows in table
With the complete table of column information, the Rimless table refers to the table of no any row or column, the imperfect table of frame
It is the table of hypodactylia row or column.
A kind of preferred embodiment according to the present invention, in step 1, the left side of browser show PDF original text, browser
Right side show pdf document parsing result, formed interaction page (as shown in Fig. 1, Fig. 3-6 and Fig. 5-3).
In further preferred embodiment, the interaction page is realized by Html+CSS+JavaScript.
In embodiment still more preferably, in step 1, after pdf document is uploaded to browser, pass through browser
The rendering of rendering engine, pdf document are converted into html view layer and canvas view layer in a browser.
Wherein, the rendering refers to by pdf document after loading webpage, by the rendering engine of browser to pdf document
Conversion, become with browser adapt to file or element.By rendering, the region PDF is divided into two layers of view.Wherein, on
Layer view is html view layer, wherein including the content and coordinate information of character/number in pdf document;Underlying file is
Canvas view layer, wherein including pictorial information, for example, the background color of table, wire etc. in pdf document.
Step 2 draws in the PDF page and selects PDF table area, obtains location information of the PDF table in the PDF page, institute
Stating location information includes left information, right information, bottom information and top information.
Wherein, mouse draws the table that choosing needs to parse in the region PDF, and browser obtains the click location information and model of mouse
Enclose information, available selected areas, so that it is determined that the range of resolution areas.Wherein, as shown in Figure 1, the left information is
Refer to that the distance of the left end distance PDF page left end of PDF table, the right information refer to the right end distance of PDF table
The distance of PDF page left end, the bottom information refer to bottom distance PDF page the lowermost distance of PDF table,
The top information refers to the top distance PDF page the lowermost distance of PDF table.
In the present invention, table all in pdf document can be parsed, preferably the table of selected areas is carried out
Parsing, in this way, can targeted table required for rapidly extracting.
A kind of preferred embodiment according to the present invention after step 1, before step 2, optionally carries out step 1 ':
Step 1 ', paddled to Rimless table or the imperfect table of frame using canvas technology and/or draw column, into
The completion of row PDF table row and/or column.
Wherein, it when the PDF table is Rimless table or the imperfect table of frame, needs after step 1 clear
Device of looking at is interior to be supplemented the frame of table completely using canvas technology, then carries out step 2;When the PDF table is to have frame
When table, then do not need to carry out step 1 ', and step 2 is directly carried out after step 1.
In further preferred embodiment, in step 1 ' in, in browser, using canvas technology, mobile mouse
Mark the drafting that line and/or alignment are carried out to Rimless table or the imperfect table of frame.
Wherein, when mouse is mobile, browser can monitor the shift position of mouse, and Canvas technology is moved along mouse
Draw lines in position.
Left information, right information, the bottom information of step 3, the PDF table for obtaining step 2 in the PDF page
Background server is transferred to the page number information of top information and the PDF table in pdf document.
A kind of preferred embodiment according to the present invention in step 3 also transmits the chained address of the pdf document
To background server, and in background server carry out pdf document load, to carry out the parsing of PDF table.
In further preferred embodiment, in step 3, also by step 1 ' in paddle and/or draw arrange seat
Cursor position information is transferred to background server, carries out the parsing of PDF table.
Step 4, the parsing that PDF table is carried out in background server.
A kind of preferred embodiment according to the present invention, step 4 include following sub-step:
Step 4-1, the load of pdf document is carried out in background server, and parses each text block in pdf document
Information.
Wherein, the concept of text block of the present invention is derived from this noun of English block, is sanctified by usage in this field
Concept, refer to a fixed area on pdf document display interface, wherein may comprising multiple characters and other correlation letter
Breath, wherein the concept of block may refer to International Organization for standardization for PDF document of agreement given by pdf document, this document
Number is ISO32000-1:2008.
Specifically, the text block is formed by continuous text, and concept is different from cell, wherein a cell
In can there is no text block (when space), can also be (single containing one when line feed (in cell text without) or multiple text blocks
When text has line feed in first lattice).For example, as shown in Fig. 2, at a be a text block, b place be a cell, marking b at
Cell in contain multiple text blocks.
A kind of preferred embodiment according to the present invention, the information of the text block include character all in text block
The co-ordinate position information of information and text block in the PDF page, the coordinate refer to the absolute location coordinates in the PDF page.
Wherein preferably, in step 4-1, text block is gone out according to PDF protocol analysis.
Step 4-2, in virtual memory, information (the especially coordinate of text block of the obtained text block of step 4-1 is utilized
Location information), each text block is rearranged in the PDF page, forms text to be processed.
A kind of preferred embodiment according to the present invention, it is described to rearrange each text block and refer to root in step 4-2
The text block being located in the same horizontal position is in line by sequence from left to right according to the co-ordinate position information of text block, together
When, the text block being located on same vertical position is formed a line by sequence from top to bottom.
Wherein, the co-ordinate position information refers to absolute location coordinates of the text block in the PDF page.Wherein, it needs to lead to
Accurate pdf document text block location information could be obtained by crossing the above-mentioned process rearranged, and cannot be directly by reading PDF
The sequencing of text block obtains in file, because the information of footer is recorded on front in PDF agreement, the information
Significant errors can be brought to the accuracy for reading result.
A kind of preferred embodiment according to the present invention after step 4-1, carries out step 4-1 ' before step 4-2:
Step 4-1 ', deconsolidation process is carried out to the text block in step 4-1, the text block split will be needed to split into two
Or multiple text blocks.
In further preferred embodiment, in step 4-1 ', in a text block, when the spacing of adjacent character
When size is greater than predetermined value, need to split the text block.
Wherein, in the present invention, the size of the predetermined value is defined as the average ruler of a chinese character in the PDF page
It is very little.
In embodiment still more preferably, carried out between the adjacent character that the spacing dimension is greater than predetermined value
It splits.
Wherein, if there are two pairs or more of above-mentioned adjacent character in text block, text block need to carry out twice with
On fractionation, it is ensured that finally splitting obtained text block is not described to need the text block that splits.
In the present invention, by the processing of the fractionation text block of step 4-1 ', it can be improved and ultimately generate the accurate of table
Rate and recall rate.
Step 4-3, left information, right information, bottom of the PDF table transmitted using step 3 in the PDF page
The page number information of information and top information and the PDF table in pdf document, can be filtered out in text to be processed to
Handle table.
Wherein, information of the PDF table of background server in the PDF page is transferred to according to step 3, can accurately sieved
Select the table to be processed in the PDF page.
A kind of preferred embodiment according to the present invention optionally carries out step before step 4-4 after step 4-3
4-3 ':
Step 4-3 ', table irregular in the table to be processed that step 4-3 is obtained is deleted.
Wherein, it is 2 rows table below and columns is 2 column table below that the irregular table, which includes line number, is led to
Redundancy can be effectively removed to the flow chart in the table to be processed by crossing the step, and acceptable basis is the same as text adjacent in a line
The distance between this block size screens irregular table, to improve the accuracy of the table to be processed.
Step 4-4, cell division is carried out, the row, column of table is drawn, improves Form Frame Line.
A kind of preferred embodiment according to the present invention, when the PDF table contained in the pdf document of step 1 is to have frame
When table, cell division is directly carried out in step 4-4.
Wherein, described to there is frame table to refer in table containing all row and column information.When PDF table is to have frame table
When lattice, in step 1 ' in without paddle and/or draw column operation, then in step 3 just without paddle and/or draw arrange
The transmission of co-ordinate position information.
In further preferred embodiment, there is no staggered place i.e. blank space to carry out drawing for cell in adjacent text block
Point, improve Form Frame Line.
Wherein, due to having had complete wire in the PDF table of early period, when carrying out cell division to it not
Need to carry out aftermentioned " width of text block in the adjustment table to be processed ".
A kind of preferred embodiment according to the present invention, when the PDF table contained in the pdf document of step 1 is Rimless
When table, step 1 ' in will do it the processing (completion table) for paddling and drawing column, and will paddle and draw the coordinate arranged by step 3
Location information is transferred to background server.
Wherein, the Rimless table refers to the table of no any row or column.
In further preferred embodiment, in step 4-4, directly using be transferred to server paddle and draw column
Co-ordinate position information carry out cell division.
A kind of preferred embodiment according to the present invention, when the PDF table contained in the pdf document of step 1 be frame not
When complete table, step 1 ' in will do it the processing (completion table) paddled and/or draw column, and will be paddled by step 3 and/or
The co-ordinate position information for drawing column is transferred to background server.
Wherein, the imperfect table of the frame is the table of hypodactylia row or column, therefore, paddles and/or draw column also only
It is to carry out part to paddle and/or draw column.
In further preferred embodiment, in step 4-4, first be transferred to server paddling and/or draw
The co-ordinate position information of column carries out the division of unit lattice.
Wherein, since process object is the imperfect table of frame, in step 1 ' in, only carry out part (lacking)
Row and/or column draw take, therefore, step 3 be transferred to background server also be table in part (lacking) row and/or
Therefore column when carrying out cell division using it, can only also carry out the division of unit lattice.
In embodiment still more preferably, when the PDF table contained in the pdf document of step 1 is that frame is endless
When whole table, be also handled as follows when cell division: the width of text block in the adjustment table to be processed makes
Equivalent width of the cell of its formation in same row is obtained, in favor of the division of cell.
Wherein, text block width is different between the row in same row and row, will affect the division of later period cell.It is preferred that
Ground, in the adjustment of step 4-4, the character boundary, character pitch in text block are constant, and character is placed in the middle.
In embodiment still more preferably, the adjustment is following to be carried out:
(1) when in the table to be processed adjacent rows have one or more text blocks column direction be aligned when, will be each
Two sides extend the lesser text block of width to the left and right in a column, make cell equivalent width;
Wherein, the text block different for place cell width needs first that the relatively narrow position text block Xiang Hangxiang is long
Extension, so that the left-right position of same row is aligned;
(2) when in the table to be processed adjacent rows do not have text block column direction be aligned when, make text of the row
This block is moved to the left or right simultaneously, is aligned the left or right of the row with the left or right of adjacent rows;Then in the row
Two sides extend the lesser text block of width to the left and right in each column, make cell equivalent width;Do not have in adjacent text block finally
Staggered place, that is, blank space carries out the division of cell, improves Form Frame Line.
Wherein, the lines quantity that can reduce the table ultimately generated by adjusting operation, in this way, work can either be reduced
Amount improves working efficiency, additionally it is possible to improve the accuracy of final result.
Step 4-5, extra null and/or empty column are deleted, the formatting of PDF table is completed.
In this way, finally obtained table can be made simple, beautiful.
The result of parsing is returned to browser by step 5, background server, and is shown on the right side of browser.
In the present invention, the co-ordinate position information refers to the absolute coordinate position in the PDF page.
Beneficial effect possessed by the present invention includes:
(1) the method for the invention carries out the extraction of PDF table using the mode of human-computer interaction, can accurately extract
PDF table, 100% can be reached by extracting accuracy rate;
(2) the method for the invention can be carried out to having frame table, Rimless table and the incomplete table of frame
Processing;
(3) the method for the invention progress PDF table is extracted and is parsed, and obtains editable form, such as
Excel;
(4) the method for the invention is simple, it is easy to accomplish.
Embodiment
The extraction of 1 PDF Rimless table of embodiment
Pdf document content is the midterm examination achievement in 1 class of certain school grade, which is a Rimless table, is such as schemed
Shown in 3-1, the table of the PDF format is extracted:
(1) pdf document is uploaded in website, the rendering engine of browser is rendered (such as Fig. 3-2 to pdf document
It is shown).
(2) draw and select PDF table area (as shown in Fig. 3-3), obtain the left information of PDF table, right information,
Bottom information and top information, also, Rimless table is paddled and is drawn column using canvas technology, (such as Fig. 3-4 institute
Show) carry out PDF table ranks completion, paddled and drawn arrange co-ordinate position information;
(3) by the left information of PDF table, right information, bottom information, top information and the PDF table in PDF
A co-ordinate position information for page number information in file and the column that paddle, draw is transferred to background server;
(4) parsing of PDF table is carried out in background server, parsing result is as in Figure 3-5.
(5) result of parsing is returned to browser by background server, and is shown on the right side of browser, such as Fig. 3-6
It is shown.
The extraction of 2 PDF Rimless table of embodiment
The table of PDF Rimless shown in Fig. 4-1 is extracted using the method for the invention, by processing, is finally obtained
Table as shown in Fig. 4-2.
The extraction of the imperfect table of 3 frame of embodiment
Contents extraction is carried out to the imperfect table of PDF frame:
(1) pdf document is uploaded in website, the rendering engine of browser is rendered (such as Fig. 5-1 to pdf document
It is shown);
(2) draw and select PDF table area (as shown in fig. 5-1), obtain the left information of PDF table, right information,
Bottom information and top information, also, Rimless table is carried out using canvas technology to draw column (as shown in Fig. 5-2), it carries out
The completion of PDF table ranks obtains the co-ordinate position information for drawing column;
(3) by the left information of PDF table, right information, bottom information, top information and the PDF table in PDF
The co-ordinate position information of page number information and stroke column in file is transferred to background server;
(4) parsing of PDF table is carried out in background server;
(5) result of parsing is returned to browser by background server, and is shown on the right side of browser, such as Fig. 5-3
It is shown.
Combining preferred embodiment above, the present invention is described, but these embodiments are only exemplary
, only play the role of illustrative.On this basis, a variety of replacements and improvement can be carried out to the present invention, these each fall within this
In the protection scope of invention.
Claims (10)
1. a kind of PDF table extracting method based on man-machine mutual assistance, the described method comprises the following steps:
Pdf document to be resolved is uploaded to browser, and opens the pdf document by step 1;
Step 2 draws in the PDF page and selects PDF table area, obtains location information of the PDF table in the PDF page, institute's rheme
Confidence breath includes left information, right information, bottom information and top information;
Left information, right information, bottom information and the top of step 3, the PDF table for obtaining step 2 in the PDF page
The page number information of information and the PDF table in pdf document is transferred to background server;
Step 4, the parsing that PDF table is carried out in background server;
The result of parsing is returned to browser by step 5, background server, and is shown on the right side of browser.
2. the method according to claim 1, wherein in step 1, the left side of browser shows PDF original text, clear
Look at device right side show pdf document parsing result, formed interaction page.
3. method according to claim 1 or 2, which is characterized in that after step 1, before step 2, optionally carry out
Step 1 ':
Step 1 ', paddled to Rimless table or the imperfect table of frame using canvas technology and/or draw column, carry out
The completion of PDF table row and/or column.
4. method according to claim 1 to 3, which is characterized in that in step 3,
The chained address of the pdf document is also transferred to background server, and carries out adding for pdf document in background server
It carries, to carry out the parsing of PDF table;
Preferably, also by step 1 ' in paddle and/or draw arrange co-ordinate position information be transferred to background server, progress
The parsing of PDF table.
5. method according to claim 1 to 4, which is characterized in that step 4 includes following sub-step:
Step 4-1, the load of pdf document is carried out in background server, and parses the letter of each text block in pdf document
Breath;
Step 4-2, in virtual memory, information (the especially coordinate position of text block of the obtained text block of step 4-1 is utilized
Information), each text block is rearranged in the PDF page, forms text to be processed;
Step 4-3, left information, right information, bottom information of the PDF table transmitted using step 3 in the PDF page
With the page number information of top information and the PDF table in pdf document, can be filtered out in text to be processed to be processed
Table;
Step 4-4, cell division is carried out, the row, column of table is drawn, improves Form Frame Line;
Step 4-5, extra null and/or empty column are deleted, the formatting of PDF table is completed.
6. method according to claim 1 to 5, which is characterized in that in step 4-2, it is described rearrange it is each
Text block refers to that will to be located at the text block in the same horizontal position according to the co-ordinate position information of text block suitable by from left to right
Sequence is in line, meanwhile, the text block being located on same vertical position is formed a line by sequence from top to bottom.
7. method according to claim 1 to 6, which is characterized in that
After step 4-1, step 4-1 ' is carried out before step 4-2:
Step 4-1 ', deconsolidation process is carried out to the text block in step 4-1, the text block split will be needed to split into two or more
A text block;
Preferably, it in step 4-1 ', in a text block, when the spacing dimension of adjacent character is greater than predetermined value, needs
The text block is split;
It is highly preferred that being split between the adjacent character that the spacing dimension is greater than predetermined value;
Wherein, the size of the predetermined value is defined as the average-size of a chinese character in the PDF page;
And/or
Before step 4-4 after step 4-3, optionally progress step 4-3 ':
Step 4-3 ', table irregular in the table to be processed that step 4-3 is obtained is deleted.
8. method according to claim 1 to 7, which is characterized in that when the PDF table contained in the pdf document of step 1
Lattice are that when having frame table, cell division is directly carried out in step 4-4;
Preferably, there is no staggered place i.e. blank space to carry out the division of cell in adjacent text block, improve Form Frame Line.
9. method according to claim 1 to 7, which is characterized in that when the PDF table contained in the pdf document of step 1
Lattice be Rimless table when, in step 4-4, directly using be transferred to server paddle and draw column co-ordinate position information into
The division of row cell;
Wherein, when the PDF table contained in the pdf document of step 1 is Rimless table, step 1 ' in will do it and paddle and draw
The processing (completion table) of column, and the co-ordinate position information for paddling and drawing column is transferred to by background server by step 3.
10. method according to claim 1 to 7, which is characterized in that as the PDF contained in the pdf document of step 1
When table is frame imperfect table, in step 4-4, first with the coordinate bit to paddle and/or draw column for being transferred to server
Confidence breath carries out the division of unit lattice;
Preferably, be also handled as follows when cell division: the width of text block in the adjustment table to be processed,
So that equivalent width of the cell of its formation in same row, in favor of the division of cell;
It is carried out it is highly preferred that the adjustment is following:
(1) when in the table to be processed adjacent rows have one or more text blocks column direction be aligned when, by each column
Two sides extend the middle lesser text block of width to the left and right, make cell equivalent width;
Wherein, the text block different for place cell width needs first that the relatively narrow position text block Xiang Hangxiang is long extension,
So that the left-right position of same row is aligned;
(2) when in the table to be processed adjacent rows do not have text block column direction be aligned when, make text block of the row
It is moved to the left or right simultaneously, is aligned the left or right of the row with the left or right of adjacent rows;Then each in the row
Two sides extend the lesser text block of width to the left and right in column, make cell equivalent width;Do not interlock in adjacent text block finally
Place is the division that blank space carries out cell, improves Form Frame Line.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810142939.XA CN110147697A (en) | 2018-02-11 | 2018-02-11 | A kind of PDF table extracting method based on man-machine mutual assistance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810142939.XA CN110147697A (en) | 2018-02-11 | 2018-02-11 | A kind of PDF table extracting method based on man-machine mutual assistance |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110147697A true CN110147697A (en) | 2019-08-20 |
Family
ID=67589076
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810142939.XA Pending CN110147697A (en) | 2018-02-11 | 2018-02-11 | A kind of PDF table extracting method based on man-machine mutual assistance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110147697A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112560417A (en) * | 2020-12-24 | 2021-03-26 | 万兴科技集团股份有限公司 | Table editing method and device, computer equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040093355A1 (en) * | 2000-03-22 | 2004-05-13 | Stinger James R. | Automatic table detection method and system |
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
CN107622041A (en) * | 2017-09-18 | 2018-01-23 | 北京神州泰岳软件股份有限公司 | recessive table extracting method and device |
-
2018
- 2018-02-11 CN CN201810142939.XA patent/CN110147697A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040093355A1 (en) * | 2000-03-22 | 2004-05-13 | Stinger James R. | Automatic table detection method and system |
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
CN107622041A (en) * | 2017-09-18 | 2018-01-23 | 北京神州泰岳软件股份有限公司 | recessive table extracting method and device |
Non-Patent Citations (2)
Title |
---|
ALESZU BAJAK: "How to use Tabula to extract tables from PDFs", 《STORYBENCH网站》 * |
CODEPLAYER: "使用HTML5 Canvas绘制直线或折线等线条的方法讲解", 《脚本之家》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112560417A (en) * | 2020-12-24 | 2021-03-26 | 万兴科技集团股份有限公司 | Table editing method and device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102855232B (en) | A kind of tabular analysis adapts job operation | |
CN104268127B (en) | A kind of method of electronics shelves layout files reading order analysis | |
CN108595402A (en) | A kind of system of extraction PDF form datas | |
CN102253979A (en) | Vision-based web page extracting method | |
CN104346319B (en) | Method and system for inspecting document style | |
CN106446072B (en) | The treating method and apparatus of web page contents | |
CN103186511A (en) | Method and equipment for word formation of Chinese characters, and method for constructing font object library | |
CN104516867A (en) | Table reordering method and table reordering system | |
CN103064827A (en) | Method and device for extracting webpage content | |
JP5664174B2 (en) | Apparatus and method for extracting circumscribed rectangle of character from portable electronic file | |
US9535888B2 (en) | System, method, software arrangement and computer-accessible medium for a generator that automatically identifies regions of interest in electronic documents for transcoding | |
CN101593186A (en) | Visual web editor method and visual web editor system | |
DE102021001321A1 (en) | Logical grouping of exported text blocks | |
CN106909572A (en) | A kind of construction method and device of question and answer knowledge base | |
US20200364452A1 (en) | A heuristic method for analyzing content of an electronic document | |
CN105320734A (en) | Web page core content extraction method | |
CN106547895B (en) | Webpage information extraction method and device | |
CN105740355B (en) | Webpage context extraction method and device based on aggregation text density | |
CN110688825A (en) | Method for extracting information of table containing lines in layout document | |
CN110147697A (en) | A kind of PDF table extracting method based on man-machine mutual assistance | |
CN112380812A (en) | Method, device, equipment and storage medium for extracting incomplete frame line table of PDF (Portable document Format) | |
CN103970890B (en) | Real-time webpage data generation method and device | |
CN109726369A (en) | A kind of intelligent template questions record Implementation Technology based on normative document | |
CN116311300A (en) | Table generation method, apparatus, electronic device and storage medium | |
CN106294431A (en) | The automatic intercept method of a kind of field and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20220614 |