CN110147697A

CN110147697A - A kind of PDF table extracting method based on man-machine mutual assistance

Info

Publication number: CN110147697A
Application number: CN201810142939.XA
Authority: CN
Inventors: 淡强强; 刘炬光; 陈前力; 吴雪军
Original assignee: Dingfu Data Technology (beijing) Co Ltd
Current assignee: Dingfu Data Technology (beijing) Co Ltd
Priority date: 2018-02-11
Filing date: 2018-02-11
Publication date: 2019-08-20

Abstract

The invention discloses a kind of PDF table extracting methods based on man-machine mutual assistance, the described method comprises the following steps: pdf document to be resolved is uploaded to browser, and opens the pdf document by step 1；Step 2 draws in the PDF page and selects PDF table area, obtains location information of the PDF table in the PDF page, the location information includes left information, right information, bottom information and top information；The page number information of left information, right information, bottom information and top information and the PDF table in pdf document of PDF table that step 2 obtains in the PDF page is transferred to background server by step 3；Step 4, the parsing that PDF table is carried out in background server；The result of parsing is returned to browser by step 5, background server, and is shown on the right side of browser.The method extracts PDF table by the way of man-machine mutual assistance, substantially increases the accuracy rate of extraction, almost can achieve 100%.

Description

A kind of PDF table extracting method based on man-machine mutual assistance

Technical field

The present invention relates to the extraction of PDF table more particularly to the extractions of PDF table, particularly, are related to a kind of based on man-machine The PDF table extracting method of mutual assistance.

Background technique

PDF is a kind of formal international standard, and the pdf document that we often say refers to the file generated based on this standard. Its is very widely used at present, and all trades and professions are all liked propagating document with PDF, and one can guarantee safety, secondly can guarantee Format is consistent.

But pdf document also brings the list data in big inconvenience, especially PDF, can not directly be led Out, this brings difficulty to the crowd for much arranging list data from document, especially in financial field, financial report, In industry research report, researcher need to do further advanced treating for table, and the table in pdf need to be converted to rule The form (such as Excel table) of row and column.

In the prior art, mostly based on offline batch processing pdf document, program reads in pdf document content, and parsing is wherein Content of text and text position derive table boundary, then carry out table content extraction, although efficiently big there are following two Problem:

(1) it is poor to cope with table ability lack of standardization, for example there are several tables for document a line, without apparent cut-off rule；Or Person is the subfield content often occurred in Hong Kong stock financial report, two tables in left and right；Again or document itself has background color background or picture Background is more in the case of these to extract or extraction effect is very poor；

(2) obscurity boundary is extracted, under the situation of Rimless table, the situation high if there is row, col width is inconsistent, mould Formula matching often malfunctions, and can not accurately divide row, column, just will appear multiple line content and extracts into a line, and multiple row contents extraction is into one Column, to not can guarantee accuracy rate.

Summary of the invention

In order to overcome the above problem, present inventor has performed sharp studies, by the way of man-machine mutual assistance to PDF table into Row extracts, and substantially increases the accuracy rate of extraction, almost can achieve 100%, thereby completing the present invention.

The present invention provides a kind of PDF table extracting methods based on man-machine mutual assistance, the described method comprises the following steps:

Pdf document to be resolved is uploaded to browser, and opens the pdf document by step 1；

Step 2 draws in the PDF page and selects PDF table area, obtains location information of the PDF table in the PDF page, institute Stating location information includes left information, right information, bottom information and top information；

Rapid 3, left information of the PDF table for obtaining step 2 in the PDF page, right information, bottom information and The page number information of top information and the PDF table in pdf document is transferred to background server；

Step 4, the parsing that PDF table is carried out in background server；

The result of parsing is returned to browser by step 5, background server, and is shown on the right side of browser.

Detailed description of the invention

Fig. 1 shows the schematic diagram of interaction page of the present invention；

Fig. 2 shows the explanation schematic diagrames of text block and cell；

Fig. 3-1 to Fig. 3-6 shows the PDF Form Handle processes result figure in embodiment 1；

The original table of the processing of embodiment 2 and treated table is shown respectively in Fig. 4-1 and Fig. 4-2；

Fig. 5-1 to Fig. 5-3 shows the PDF Form Handle processes result figure in embodiment 3.

Specific embodiment

Below by attached drawing, the present invention is described in more detail.Illustrated by these, the features and advantages of the invention will It becomes more apparent from clear.

One aspect of the present invention provides a kind of PDF table extracting method based on man-machine mutual assistance, and the method includes following Step:

Pdf document to be resolved is uploaded to browser, and opens the pdf document by step 1.

Wherein, the pdf document refers to the pdf document of non-picture format, wherein containing PDF table, the PDF table packet Frame table, Rimless table and the imperfect table of frame are included, described have frame table to refer to contain all rows in table With the complete table of column information, the Rimless table refers to the table of no any row or column, the imperfect table of frame It is the table of hypodactylia row or column.

A kind of preferred embodiment according to the present invention, in step 1, the left side of browser show PDF original text, browser Right side show pdf document parsing result, formed interaction page (as shown in Fig. 1, Fig. 3-6 and Fig. 5-3).

In further preferred embodiment, the interaction page is realized by Html+CSS+JavaScript.

In embodiment still more preferably, in step 1, after pdf document is uploaded to browser, pass through browser The rendering of rendering engine, pdf document are converted into html view layer and canvas view layer in a browser.

Wherein, the rendering refers to by pdf document after loading webpage, by the rendering engine of browser to pdf document Conversion, become with browser adapt to file or element.By rendering, the region PDF is divided into two layers of view.Wherein, on Layer view is html view layer, wherein including the content and coordinate information of character/number in pdf document；Underlying file is Canvas view layer, wherein including pictorial information, for example, the background color of table, wire etc. in pdf document.

Step 2 draws in the PDF page and selects PDF table area, obtains location information of the PDF table in the PDF page, institute Stating location information includes left information, right information, bottom information and top information.

Wherein, mouse draws the table that choosing needs to parse in the region PDF, and browser obtains the click location information and model of mouse Enclose information, available selected areas, so that it is determined that the range of resolution areas.Wherein, as shown in Figure 1, the left information is Refer to that the distance of the left end distance PDF page left end of PDF table, the right information refer to the right end distance of PDF table The distance of PDF page left end, the bottom information refer to bottom distance PDF page the lowermost distance of PDF table, The top information refers to the top distance PDF page the lowermost distance of PDF table.

In the present invention, table all in pdf document can be parsed, preferably the table of selected areas is carried out Parsing, in this way, can targeted table required for rapidly extracting.

A kind of preferred embodiment according to the present invention after step 1, before step 2, optionally carries out step 1 ':

Step 1 ', paddled to Rimless table or the imperfect table of frame using canvas technology and/or draw column, into The completion of row PDF table row and/or column.

Wherein, it when the PDF table is Rimless table or the imperfect table of frame, needs after step 1 clear Device of looking at is interior to be supplemented the frame of table completely using canvas technology, then carries out step 2；When the PDF table is to have frame When table, then do not need to carry out step 1 ', and step 2 is directly carried out after step 1.

In further preferred embodiment, in step 1 ' in, in browser, using canvas technology, mobile mouse Mark the drafting that line and/or alignment are carried out to Rimless table or the imperfect table of frame.

Wherein, when mouse is mobile, browser can monitor the shift position of mouse, and Canvas technology is moved along mouse Draw lines in position.

Left information, right information, the bottom information of step 3, the PDF table for obtaining step 2 in the PDF page Background server is transferred to the page number information of top information and the PDF table in pdf document.

A kind of preferred embodiment according to the present invention in step 3 also transmits the chained address of the pdf document To background server, and in background server carry out pdf document load, to carry out the parsing of PDF table.

In further preferred embodiment, in step 3, also by step 1 ' in paddle and/or draw arrange seat Cursor position information is transferred to background server, carries out the parsing of PDF table.

Step 4, the parsing that PDF table is carried out in background server.

A kind of preferred embodiment according to the present invention, step 4 include following sub-step:

Step 4-1, the load of pdf document is carried out in background server, and parses each text block in pdf document Information.

Wherein, the concept of text block of the present invention is derived from this noun of English block, is sanctified by usage in this field Concept, refer to a fixed area on pdf document display interface, wherein may comprising multiple characters and other correlation letter Breath, wherein the concept of block may refer to International Organization for standardization for PDF document of agreement given by pdf document, this document Number is ISO32000-1:2008.

Specifically, the text block is formed by continuous text, and concept is different from cell, wherein a cell In can there is no text block (when space), can also be (single containing one when line feed (in cell text without) or multiple text blocks When text has line feed in first lattice).For example, as shown in Fig. 2, at a be a text block, b place be a cell, marking b at Cell in contain multiple text blocks.

A kind of preferred embodiment according to the present invention, the information of the text block include character all in text block The co-ordinate position information of information and text block in the PDF page, the coordinate refer to the absolute location coordinates in the PDF page.

Wherein preferably, in step 4-1, text block is gone out according to PDF protocol analysis.

Step 4-2, in virtual memory, information (the especially coordinate of text block of the obtained text block of step 4-1 is utilized Location information), each text block is rearranged in the PDF page, forms text to be processed.

A kind of preferred embodiment according to the present invention, it is described to rearrange each text block and refer to root in step 4-2 The text block being located in the same horizontal position is in line by sequence from left to right according to the co-ordinate position information of text block, together When, the text block being located on same vertical position is formed a line by sequence from top to bottom.

Wherein, the co-ordinate position information refers to absolute location coordinates of the text block in the PDF page.Wherein, it needs to lead to Accurate pdf document text block location information could be obtained by crossing the above-mentioned process rearranged, and cannot be directly by reading PDF The sequencing of text block obtains in file, because the information of footer is recorded on front in PDF agreement, the information Significant errors can be brought to the accuracy for reading result.

A kind of preferred embodiment according to the present invention after step 4-1, carries out step 4-1 ' before step 4-2:

Step 4-1 ', deconsolidation process is carried out to the text block in step 4-1, the text block split will be needed to split into two Or multiple text blocks.

In further preferred embodiment, in step 4-1 ', in a text block, when the spacing of adjacent character When size is greater than predetermined value, need to split the text block.

Wherein, in the present invention, the size of the predetermined value is defined as the average ruler of a chinese character in the PDF page It is very little.

In embodiment still more preferably, carried out between the adjacent character that the spacing dimension is greater than predetermined value It splits.

Wherein, if there are two pairs or more of above-mentioned adjacent character in text block, text block need to carry out twice with On fractionation, it is ensured that finally splitting obtained text block is not described to need the text block that splits.

In the present invention, by the processing of the fractionation text block of step 4-1 ', it can be improved and ultimately generate the accurate of table Rate and recall rate.

Step 4-3, left information, right information, bottom of the PDF table transmitted using step 3 in the PDF page The page number information of information and top information and the PDF table in pdf document, can be filtered out in text to be processed to Handle table.

Wherein, information of the PDF table of background server in the PDF page is transferred to according to step 3, can accurately sieved Select the table to be processed in the PDF page.

A kind of preferred embodiment according to the present invention optionally carries out step before step 4-4 after step 4-3 4-3 ':

Step 4-3 ', table irregular in the table to be processed that step 4-3 is obtained is deleted.

Wherein, it is 2 rows table below and columns is 2 column table below that the irregular table, which includes line number, is led to Redundancy can be effectively removed to the flow chart in the table to be processed by crossing the step, and acceptable basis is the same as text adjacent in a line The distance between this block size screens irregular table, to improve the accuracy of the table to be processed.

Step 4-4, cell division is carried out, the row, column of table is drawn, improves Form Frame Line.

A kind of preferred embodiment according to the present invention, when the PDF table contained in the pdf document of step 1 is to have frame When table, cell division is directly carried out in step 4-4.

Wherein, described to there is frame table to refer in table containing all row and column information.When PDF table is to have frame table When lattice, in step 1 ' in without paddle and/or draw column operation, then in step 3 just without paddle and/or draw arrange The transmission of co-ordinate position information.

In further preferred embodiment, there is no staggered place i.e. blank space to carry out drawing for cell in adjacent text block Point, improve Form Frame Line.

Wherein, due to having had complete wire in the PDF table of early period, when carrying out cell division to it not Need to carry out aftermentioned " width of text block in the adjustment table to be processed ".

A kind of preferred embodiment according to the present invention, when the PDF table contained in the pdf document of step 1 is Rimless When table, step 1 ' in will do it the processing (completion table) for paddling and drawing column, and will paddle and draw the coordinate arranged by step 3 Location information is transferred to background server.

Wherein, the Rimless table refers to the table of no any row or column.

In further preferred embodiment, in step 4-4, directly using be transferred to server paddle and draw column Co-ordinate position information carry out cell division.

A kind of preferred embodiment according to the present invention, when the PDF table contained in the pdf document of step 1 be frame not When complete table, step 1 ' in will do it the processing (completion table) paddled and/or draw column, and will be paddled by step 3 and/or The co-ordinate position information for drawing column is transferred to background server.

Wherein, the imperfect table of the frame is the table of hypodactylia row or column, therefore, paddles and/or draw column also only It is to carry out part to paddle and/or draw column.

In further preferred embodiment, in step 4-4, first be transferred to server paddling and/or draw The co-ordinate position information of column carries out the division of unit lattice.

Wherein, since process object is the imperfect table of frame, in step 1 ' in, only carry out part (lacking) Row and/or column draw take, therefore, step 3 be transferred to background server also be table in part (lacking) row and/or Therefore column when carrying out cell division using it, can only also carry out the division of unit lattice.

In embodiment still more preferably, when the PDF table contained in the pdf document of step 1 is that frame is endless When whole table, be also handled as follows when cell division: the width of text block in the adjustment table to be processed makes Equivalent width of the cell of its formation in same row is obtained, in favor of the division of cell.

Wherein, text block width is different between the row in same row and row, will affect the division of later period cell.It is preferred that Ground, in the adjustment of step 4-4, the character boundary, character pitch in text block are constant, and character is placed in the middle.

In embodiment still more preferably, the adjustment is following to be carried out:

(1) when in the table to be processed adjacent rows have one or more text blocks column direction be aligned when, will be each Two sides extend the lesser text block of width to the left and right in a column, make cell equivalent width；

Wherein, the text block different for place cell width needs first that the relatively narrow position text block Xiang Hangxiang is long Extension, so that the left-right position of same row is aligned；

(2) when in the table to be processed adjacent rows do not have text block column direction be aligned when, make text of the row This block is moved to the left or right simultaneously, is aligned the left or right of the row with the left or right of adjacent rows；Then in the row Two sides extend the lesser text block of width to the left and right in each column, make cell equivalent width；Do not have in adjacent text block finally Staggered place, that is, blank space carries out the division of cell, improves Form Frame Line.

Wherein, the lines quantity that can reduce the table ultimately generated by adjusting operation, in this way, work can either be reduced Amount improves working efficiency, additionally it is possible to improve the accuracy of final result.

Step 4-5, extra null and/or empty column are deleted, the formatting of PDF table is completed.

In this way, finally obtained table can be made simple, beautiful.

In the present invention, the co-ordinate position information refers to the absolute coordinate position in the PDF page.

Beneficial effect possessed by the present invention includes:

(1) the method for the invention carries out the extraction of PDF table using the mode of human-computer interaction, can accurately extract PDF table, 100% can be reached by extracting accuracy rate；

(2) the method for the invention can be carried out to having frame table, Rimless table and the incomplete table of frame Processing；

(3) the method for the invention progress PDF table is extracted and is parsed, and obtains editable form, such as Excel；

(4) the method for the invention is simple, it is easy to accomplish.

Embodiment

The extraction of 1 PDF Rimless table of embodiment

Pdf document content is the midterm examination achievement in 1 class of certain school grade, which is a Rimless table, is such as schemed Shown in 3-1, the table of the PDF format is extracted:

(1) pdf document is uploaded in website, the rendering engine of browser is rendered (such as Fig. 3-2 to pdf document It is shown).

(2) draw and select PDF table area (as shown in Fig. 3-3), obtain the left information of PDF table, right information, Bottom information and top information, also, Rimless table is paddled and is drawn column using canvas technology, (such as Fig. 3-4 institute Show) carry out PDF table ranks completion, paddled and drawn arrange co-ordinate position information；

(3) by the left information of PDF table, right information, bottom information, top information and the PDF table in PDF A co-ordinate position information for page number information in file and the column that paddle, draw is transferred to background server；

(4) parsing of PDF table is carried out in background server, parsing result is as in Figure 3-5.

(5) result of parsing is returned to browser by background server, and is shown on the right side of browser, such as Fig. 3-6 It is shown.

The extraction of 2 PDF Rimless table of embodiment

The table of PDF Rimless shown in Fig. 4-1 is extracted using the method for the invention, by processing, is finally obtained Table as shown in Fig. 4-2.

The extraction of the imperfect table of 3 frame of embodiment

Contents extraction is carried out to the imperfect table of PDF frame:

(1) pdf document is uploaded in website, the rendering engine of browser is rendered (such as Fig. 5-1 to pdf document It is shown)；

(2) draw and select PDF table area (as shown in fig. 5-1), obtain the left information of PDF table, right information, Bottom information and top information, also, Rimless table is carried out using canvas technology to draw column (as shown in Fig. 5-2), it carries out The completion of PDF table ranks obtains the co-ordinate position information for drawing column；

(3) by the left information of PDF table, right information, bottom information, top information and the PDF table in PDF The co-ordinate position information of page number information and stroke column in file is transferred to background server；

(4) parsing of PDF table is carried out in background server；

(5) result of parsing is returned to browser by background server, and is shown on the right side of browser, such as Fig. 5-3 It is shown.

Combining preferred embodiment above, the present invention is described, but these embodiments are only exemplary , only play the role of illustrative.On this basis, a variety of replacements and improvement can be carried out to the present invention, these each fall within this In the protection scope of invention.

Claims

1. a kind of PDF table extracting method based on man-machine mutual assistance, the described method comprises the following steps:

Step 2 draws in the PDF page and selects PDF table area, obtains location information of the PDF table in the PDF page, institute's rheme Confidence breath includes left information, right information, bottom information and top information；

Left information, right information, bottom information and the top of step 3, the PDF table for obtaining step 2 in the PDF page The page number information of information and the PDF table in pdf document is transferred to background server；

Step 4, the parsing that PDF table is carried out in background server；

2. the method according to claim 1, wherein in step 1, the left side of browser shows PDF original text, clear Look at device right side show pdf document parsing result, formed interaction page.

3. method according to claim 1 or 2, which is characterized in that after step 1, before step 2, optionally carry out Step 1 ':

Step 1 ', paddled to Rimless table or the imperfect table of frame using canvas technology and/or draw column, carry out The completion of PDF table row and/or column.

4. method according to claim 1 to 3, which is characterized in that in step 3,

The chained address of the pdf document is also transferred to background server, and carries out adding for pdf document in background server It carries, to carry out the parsing of PDF table；

Preferably, also by step 1 ' in paddle and/or draw arrange co-ordinate position information be transferred to background server, progress The parsing of PDF table.

5. method according to claim 1 to 4, which is characterized in that step 4 includes following sub-step:

Step 4-1, the load of pdf document is carried out in background server, and parses the letter of each text block in pdf document Breath；

Step 4-2, in virtual memory, information (the especially coordinate position of text block of the obtained text block of step 4-1 is utilized Information), each text block is rearranged in the PDF page, forms text to be processed；

Step 4-3, left information, right information, bottom information of the PDF table transmitted using step 3 in the PDF page With the page number information of top information and the PDF table in pdf document, can be filtered out in text to be processed to be processed Table；

Step 4-4, cell division is carried out, the row, column of table is drawn, improves Form Frame Line；

6. method according to claim 1 to 5, which is characterized in that in step 4-2, it is described rearrange it is each Text block refers to that will to be located at the text block in the same horizontal position according to the co-ordinate position information of text block suitable by from left to right Sequence is in line, meanwhile, the text block being located on same vertical position is formed a line by sequence from top to bottom.

7. method according to claim 1 to 6, which is characterized in that

After step 4-1, step 4-1 ' is carried out before step 4-2:

Step 4-1 ', deconsolidation process is carried out to the text block in step 4-1, the text block split will be needed to split into two or more A text block；

Preferably, it in step 4-1 ', in a text block, when the spacing dimension of adjacent character is greater than predetermined value, needs The text block is split；

It is highly preferred that being split between the adjacent character that the spacing dimension is greater than predetermined value；

Wherein, the size of the predetermined value is defined as the average-size of a chinese character in the PDF page；

And/or

Before step 4-4 after step 4-3, optionally progress step 4-3 ':

8. method according to claim 1 to 7, which is characterized in that when the PDF table contained in the pdf document of step 1 Lattice are that when having frame table, cell division is directly carried out in step 4-4；

Preferably, there is no staggered place i.e. blank space to carry out the division of cell in adjacent text block, improve Form Frame Line.

9. method according to claim 1 to 7, which is characterized in that when the PDF table contained in the pdf document of step 1 Lattice be Rimless table when, in step 4-4, directly using be transferred to server paddle and draw column co-ordinate position information into The division of row cell；

Wherein, when the PDF table contained in the pdf document of step 1 is Rimless table, step 1 ' in will do it and paddle and draw The processing (completion table) of column, and the co-ordinate position information for paddling and drawing column is transferred to by background server by step 3.

10. method according to claim 1 to 7, which is characterized in that as the PDF contained in the pdf document of step 1 When table is frame imperfect table, in step 4-4, first with the coordinate bit to paddle and/or draw column for being transferred to server Confidence breath carries out the division of unit lattice；

Preferably, be also handled as follows when cell division: the width of text block in the adjustment table to be processed, So that equivalent width of the cell of its formation in same row, in favor of the division of cell；

It is carried out it is highly preferred that the adjustment is following:

(1) when in the table to be processed adjacent rows have one or more text blocks column direction be aligned when, by each column Two sides extend the middle lesser text block of width to the left and right, make cell equivalent width；

(2) when in the table to be processed adjacent rows do not have text block column direction be aligned when, make text block of the row It is moved to the left or right simultaneously, is aligned the left or right of the row with the left or right of adjacent rows；Then each in the row Two sides extend the lesser text block of width to the left and right in column, make cell equivalent width；Do not interlock in adjacent text block finally Place is the division that blank space carries out cell, improves Form Frame Line.