CN108132916A

CN108132916A - Parse method, the storage medium of PDF list datas

Info

Publication number: CN108132916A
Application number: CN201711235867.5A
Authority: CN
Inventors: 蓝树和; 段涵瑞; 薛艳英; 江汉祥
Original assignee: Xiamen Meiya Pico Information Co Ltd
Current assignee: Xiamen Meiya Pico Information Co Ltd
Priority date: 2017-11-30
Filing date: 2017-11-30
Publication date: 2018-06-08
Anticipated expiration: 2037-11-30
Also published as: CN108132916B

Abstract

The present invention provides a kind of method for parsing PDF list datas, storage medium, and method includes：Obtain the coordinate of each line segment of every page of PDF and the coordinate of each character；Cell is marked off, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments；According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained.The present invention accurately marks off the character in cell and cell according to the relationship of each line segment and each character coordinates, the data in the accurate table and table for extracting PDF, realizes accurate, convenient, the automation parsing of PDF tables.

Description

Parse method, the storage medium of PDF list datas

Technical field

The present invention relates to data to parse field, particularly relates to method, the storage medium of parsing PDF list datas.

Background technology

The existing object of the PDF parsings of the prior art is typically all to be directed to word, and the table of the inside is visually , without real table objects, each unit lattice only with line segment demarcate come, PDF agreements be record these words, The location information of line segment, picture etc..

Existing related resolution is the word for obtaining the inside, but title should be strictly corresponded to for list data Respective column, due to the particularity of PDF, such as continuous, the uncertainty of the line feed of individual unit lattice of front and rear page table, watermark Deng.The division of simple character is unrealistic, for each form table all first analyze distinguishing characteristic therein, thus again It writes corresponding script and imported into database, workload is big to be difficult to imagine, therefore the list data difficult to realize PDF is automatic Extraction storage in the database.

Therefore, PDF parsings currently on the market close source relatively, and are all simple character processing to this kind of list data, It is difficult to accomplish that data are corresponding with title, it is difficult to judge the correlation between data row and row.

Invention content

The technical problems to be solved by the invention are：A kind of method for parsing PDF list datas, storage medium are realized complete List data is automatically and accurately parsed, and highly practical.

In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention is：

A kind of method for parsing PDF list datas, including：

Obtain the coordinate of each line segment of every page of PDF and the coordinate of each character；

Cell is marked off, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments；

According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained

Another technical solution provided by the invention is：

A kind of computer readable storage medium, is stored thereon with computer program, which realizes when being executed by processor Following steps：

According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained.

The beneficial effects of the present invention are：A kind of method of vision parsing PDF list datas is provided, without according to specific Pdf document analysis field between how to divide, do not need to determine the title head of table, can realize and automatically accurately parse And organization field block number evidence, strong applicability.Specifically, the relationship according to each line segment and each character coordinates accurately marks off unit Character in lattice and cell accurately extracts the data in the table and table of PDF, and automaticity is extremely strong, enormously simplifies The importing of PDF tables.The present invention can greatly improve accuracy and the convenience of the parsing of PDF list datas, and effect is very aobvious It writes.

Description of the drawings

Fig. 1 is the PDF table schematic diagrames of single form；

Fig. 2 is the schematic diagram of random blank cell；

Fig. 3 is the schematic diagram of cross-page cell；

Fig. 4 is the table schematic diagram of multilayer watermark；

Fig. 5 is a kind of flow diagram for the method for parsing PDF list datas of the present invention；

Fig. 6 is the schematic diagram of intersection point of line segments；

Fig. 7 is the line segment composition schematic diagram for forming effective cell；

Fig. 8 is the flow diagram of embodiment one.

Specific embodiment

For the technology contents that the present invention will be described in detail, the objects and the effects, below in conjunction with embodiment and coordinate attached Figure is explained.

The design of most critical of the present invention is：Relationship according to each line segment and each character coordinates accurately marks off cell With the character in cell, the data in the accurate table and table for extracting PDF, realize PDF tables it is accurate, convenient, from Dynamic neutralizing analysis.

Fig. 5 is please referred to, the present invention provides a kind of method for parsing PDF list datas, including：

Further, it further includes：

According to the neutrality line of cell, determine per the corresponding cell of a line.

Seen from the above description, determine whether according to the error range between the neutrality line of identified each unit lattice To be in the cell of same a line, the regular of cell is realized, to obtain the list of marshalling.

Further, it further includes：

Every page of PDF is changed into image data form；

If the upper nextpage being mutually connected gradually is superimposed along Y direction draw close after, corresponding vertical line segment can be obtained, And horizontal line segment can be got on the vertical line segment respectively, then merge the cell of the upper nextpage joining place.

Seen from the above description, the correlation of cell being connected between upper nextpage can be analyzed according to Image Visual Feature Whether property judges whether to belong to same cell split due to paging, if so, merging.It realizes automatic, accurate Really split cell is merged.

Further, it if the upper nextpage being mutually connected gradually is superimposed along Y direction after drawing close, can obtain corresponding Vertical line segment, and horizontal line segment can be got on the vertical line segment respectively, then merge the upper nextpage rank The cell at place is connect, specially：

The upper left corner for presetting every page of PDF is coordinate origin；

To current page since Y-axis maximum value, advance toward origin direction after obtaining vertical line segment, judge described vertical Line segment on the presence or absence of the horizontal line section that intersects with it；And simultaneously

To lower one page since zero coordinate of Y-axis, advance toward maximum value direction after obtaining vertical line segment, judge described hang down With the presence or absence of the horizontal line section intersected with it on straight line segment；

If so, by the corresponding cell of the vertical line segment adjacent in the current page and phase in described lower one page Corresponding cell span is same cell.

Seen from the above description, vision algorithm can be used to judge the correlation of cell between PDF pages and page, from dynamic circuit connector And the form of expression of the cell being split, further perfect final acquired table.

Further, it is described according to the coordinate of character and the inclusion relation of rectangular coordinates, obtain the corresponding word of each unit lattice Section block, specially：

Whether the coordinate according to character is located in rectangular coordinates, obtains the corresponding character of each rectangular coordinates of non-blank-white；

The matrix coefficient of user's visual space is mapped to from the coordinate space of PDF according to character, excludes the non-blank-white Watermark character in each rectangular coordinates；

The corresponding character composition field block of each rectangular coordinates of the non-blank-white supplements each rectangular coordinates pair of blank Answer null field；

Obtain the corresponding field block of each unit lattice.

Seen from the above description, watermark character can be effectively removed, ensures the accuracy of parsing gained table.Meanwhile needle Configuration null field is corresponded to blank cell, realization blank cell is aligned with corresponding title.So as to ensure finally to obtain Table integrality and accuracy.

Further, the acquisition coordinate of each line segment of every page of PDF and the coordinate of each character, specially：

By the line segment of every page of PDF and character rendering to CImage handles, each line segment and each character are captured while rendering Coordinate.

Seen from the above description, line segment and character rendering are realized to CImage handles and turns the PDF data of structuring Become the image data for ease of analyzing and processing, follow-up direct basis image data is facilitated to be detected analysis, obtain line segment and word The characteristic of symbol, finally accordingly obtain needed for data.

It is further, described to mark off cell, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments, Specially：

If the distance between one end point coordinates of a line segment and one end point coordinates of another line segment are in preset first threshold In the range of, then judge a line segment and another line segment intersection；

If four adjacent line segments sequentially intersect end to end, and the region formed is more than preset second threshold range, then obtains The coordinate of four line segments is taken, the corresponding rectangular coordinates of cell formed labeled as four line segments.

Seen from the above description, it is corresponding to judge two points since the coordinate of PDF user's spaces is floating point type Distance whether in certain threshold range come determine correspondence line segment whether intersect.It is convenient subsequently accurately to be divided according to number of intersections Cell.

Another technical solution provided by the invention is：

Further, described program can also realize following steps：

Every page of PDF is changed into image data form；

According to the neutrality line of cell, determine per the corresponding cell of a line；

Further, step obtains the coordinate of each line segment of every page of PDF and the coordinate of each character, specially：

By the line segment of every page of PDF and character rendering to CImage handles, each line segment and each character are captured while rendering Coordinate；

Step marks off cell, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments, specially：

If four adjacent line segments sequentially intersect end to end, and the region formed is more than preset second threshold range, then obtains The coordinate of four line segments is taken, the corresponding rectangular coordinates of cell formed labeled as four line segments；

Step obtains the corresponding field block of each unit lattice, specifically according to the coordinate of character and the inclusion relation of rectangular coordinates For：

Obtain the corresponding field block of each unit lattice.

Embodiment one

The present embodiment mainly provides a kind of method for parsing PDF list datas, suitable for the table PDF format data Lattice are parsed, and are obtained corresponding list data, are facilitated subsequent editing operations.Data are such as cleaned in front end, if client provides It is greatly sheet format PDF format to have in single, bill, through this embodiment can be extracted into sheet format PDF corresponding CSV forms automatically imported into database and are analyzed.

As shown in Figs 1-4, it is existing common several PDF forms.Specifically, Fig. 1 corresponds to single table；Fig. 2 is corresponded to Random blank cell；Fig. 3 corresponds to cross-page cell；Fig. 4 corresponds to the forms such as multilayer watermark.Based on current existing PDF tables Parsing is opposite to close source, and is all simple character processing to this kind of list data, it is difficult to accomplish that data are corresponding with title, more It is difficult to judge the correlation between row and row.

In view of the above-mentioned problems, the present invention will in the present embodiment, by multiple specific embodiments come homographic solution never With the parsing of form.

Referring to Fig. 8, the method for the parsing PDF list datas of the present embodiment includes：

S1：Every page of PDF is changed into image data form；If the upper left corner of every page of PDF is coordinate origin；Obtain every page The coordinate of each line segment of PDF and the coordinate of each character；

The step specifically includes：

S101：Pdf document is loaded, cycle obtains the object of every page；The object is the pointer for being directed toward every page PDF, For sequentially obtaining every page PDF data；

S102：By the line segment of every page PDF and character rendering to CImage handles, if every page of PDF data of image data The upper left corner be coordinate origin；The coordinate of each line segment and each character is captured while rendering.

Here, the purpose of being rendered into CImage handles be in order to：1. by will turn again after the PDF data copies of structured type It is changed to image data；2. independent process preserves original PDF data, source file is avoided to lose；3. the image of pure line segment can be obtained Data, for image binaryzation, straight-line detection exclusive PCR below；4. being converted to image data, facilitate subsequent processing, it is directly logical It crosses image detection and obtains character pair, data needed for acquisition.

The coordinate of line segment and character can obtain simultaneously during rendering, be intended merely to obtain the image of pure line segment here. The coordinate for obtaining line segment refers to the coordinate of a pair of of point obtained.

S2：Cell is marked off, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments；

Since the coordinate of PDF user's spaces is floating point type, in the xy space coordinates of image data form, line The intersection point of section refers to one of endpoint of a line segment in space coordinate and one of endpoint of another line segment Distance is in certain threshold range；As shown in fig. 6, line segment A (x1, y1), (x2, y2) and line segment B (x3, y3), (x4, y4) There are one intersection points for tool.Cell refers to four groups of line segments of space coordinate there are four intersection points, and the region formed is more than certain threshold value When be regarded as effective cell；As shown in fig. 7, adjacent four line segments A, B, C and D form a cell.

Therefore, step S2 is specifically included：

S201：If the coordinate of the coordinate of one of endpoint of a line segment and one of endpoint of another line segment The distance between in the range of preset first threshold, then judge that this two lines section intersects；

S202：If four adjacent line segments intersect end to end successively, and the region formed is more than preset second threshold model Enclose, then judge that this four line segments form an effective cells, while obtain the coordinate of this four line segments, labeled as this four The corresponding rectangular coordinates of cell that line segment is formed.

S203：Obtain the corresponding rectangular coordinates of each cell.

S3：According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained.

Field block refers to that (exclusion is fallen in unit for the sequence set of all significant characters in PDF individual unit lattice Watermark character in lattice).

Step S3 is specifically included：

S301：According to the inclusion relation between the coordinate and rectangular coordinates of character, judge whether include in rectangular coordinates Whether the rectangular coordinates of character, i.e. character are fallen in the rectangular coordinates of (cell)；If so, then perform S302；If it is not, it then holds Row S303.This process is intuitively naturally determined from the spatial coordinate location relationship of picture format.

S302：Sequence obtains all characters in rectangular coordinates, forms the field block of the corresponding rectangular coordinates；

S303：If it is determined that the corresponding field block of the rectangular coordinates is then set as empty without character in some rectangular coordinates, That is supplement null field block corresponds to the rectangular coordinates, to ensure that the corresponding blank cell of the rectangular coordinates of this blank can be with Corresponding title alignment.

In a specific embodiment, after judging to obtain including character in rectangular coordinates, i.e. before step S302, also Following step will be performed；PDF form analysis there are watermark is solved with specific aim.

The step specifically includes：According to character user's visual space (i.e. this implementation is mapped to from the coordinate space of PDF Example in image data form xy coordinate spaces) matrix coefficient, exclude the watermark character in each rectangular coordinates.Specifically, square Battle array characteristic coefficient refers to that character is mapped to one group of matrix of user's visual space from the coordinate space of PDF, and watermark is usually to carry The character of angle, so matrix and the normal character of conversion have difference, so as to according to this mode judge some character whether be Watermark.

Then, the S3 of the present embodiment is further included：

S304：Obtain the corresponding field block of each cell.

In another specific embodiment, it further will also include S305, to realize the solution to multirow cell simultaneously Analysis.

S305：According to the neutrality line of cell, determine per the corresponding cell of a line.Specifically, by according to each list Error range between the corresponding neutrality line of first lattice determines whether it is same data line.If positioned at same a line, the row it is each If the y-axis coordinate of the neutrality line of cell, not in the threshold range, should be judged not same in certain threshold range A line is split into and does not go together.

In another specific embodiment, it will also include S4-S5, further to realize the parsing to cross-page cell.

S4：Every page of PDF data of image data form are transformed into Mat objects opencv.

S5：To current page since Y-axis maximum value, advance toward origin direction after obtaining vertical line segment, judge described hang down Whether the horizontal line section intersected with it can be detected on straight line segment；And simultaneously

To lower one page since zero coordinate of Y-axis, advance toward maximum value direction after obtaining vertical line segment, judge described hang down Whether the horizontal line section intersected with it can be detected on straight line segment；

If meeting the two conditions simultaneously, by the corresponding cell of the vertical line segment adjacent in the current page It is same cell with corresponding cell span in described lower one page.It as shown in figure 3, will be due to upper nextpage paging And the imperfect cell span split is complete cell.

In the present embodiment, it is further comprising the steps of：

S6：Convergence is organized into the list data of CSV forms.

PDF form analysis method provided in this embodiment, without according to specific pdf document analyze field between how It divides, does not need to determine the title head of table, can realize and automatically, accurately parse simultaneously organization field block number evidence, practicability By force, applicability light.Further, the present embodiment is accurately divided using the relationship of intercharacter coordinate, is judged using vision algorithm The correlation of cell between PDF pages and page, this method accurately extract the list data of PDF, automaticity pole with completely new thinking By force, the importing of such data is enormously simplified.To sum up, the present embodiment can automatically, accurately and comprehensively parse class PDF table numbers According to considerably improving the accuracy and convenience of data cleansing, effect is very notable.

Embodiment two

The present embodiment corresponding embodiment one provides a kind of corresponding computer readable storage medium, is stored thereon with calculating Machine program can realize all steps that embodiment one is included when the program is executed by processor.

In conclusion a kind of method for parsing PDF list datas provided by the invention, storage medium, can realize PDF tables Accurate, convenient, the automation parsing of lattice.The data of single table, multilist can not only precisely be parsed, moreover it is possible to accurately parse Random blank cell, cross-page cell and multilayer watermark cell；The practicality is strong, applied widely.Further, The present invention is based on character coordinates and line segment coordinate to be parsed, and is different from the existing processing for being based purely on character, is not only realized More accurately, it easily parses, and also ensure that data are corresponding with title；It can also analyze accordingly simultaneously between row and row Correlation, to realize that a plurality of types of form analysis provide support.

The foregoing is merely the embodiment of the present invention, are not intended to limit the scope of the invention, every to utilize this hair The equivalents that bright specification and accompanying drawing content are made directly or indirectly are used in relevant technical field, similarly include In the scope of patent protection of the present invention.

Claims

A kind of 1. method for parsing PDF list datas, which is characterized in that including：

Obtain the coordinate of each line segment of every page of PDF and the coordinate of each character；

Cell is marked off, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments；

According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained.
2. the method for parsing PDF list datas as described in claim 1, which is characterized in that further include：

According to the neutrality line of cell, determine per the corresponding cell of a line.
3. the method for parsing PDF list datas as described in claim 1, which is characterized in that further include：

Every page of PDF is changed into image data form；

If the upper nextpage being mutually connected gradually is superimposed along Y direction draw close after, corresponding vertical line segment can be obtained, and can Horizontal line segment is got on the vertical line segment respectively, then merges the cell of the upper nextpage joining place.
4. the method for parsing PDF list datas as claimed in claim 3, which is characterized in that if the upper nextpage being mutually connected It is gradually superimposed along Y direction after drawing close, corresponding vertical line segment can be obtained, and can be respectively in the vertical line segment On get horizontal line segment, then merge the cell of the upper nextpage joining place, specially：

The upper left corner for presetting every page of PDF is coordinate origin；

To current page since Y-axis maximum value, advance toward origin direction after obtaining vertical line segment, judge the vertical line segment It is upper to whether there is the horizontal line section intersected with it；And simultaneously

To lower one page since zero coordinate of Y-axis, advance toward maximum value direction after obtaining vertical line segment, judge the vertical line With the presence or absence of the horizontal line section intersected with it in section；

If so, by the corresponding cell of the vertical line segment adjacent in the current page with it is corresponding in described lower one page Cell span is same cell.
5. the method for parsing PDF list datas as described in claim 1, which is characterized in that the coordinate according to character with The inclusion relation of rectangular coordinates obtains the corresponding field block of each unit lattice, specially：

Whether the coordinate according to character is located in rectangular coordinates, obtains the corresponding character of each rectangular coordinates of non-blank-white；

The matrix coefficient of user's visual space is mapped to from the coordinate space of PDF according to character, excludes each square of the non-blank-white Watermark character in shape coordinate；

The corresponding character composition field block of each rectangular coordinates of the non-blank-white, each rectangular coordinates for supplementing blank correspond to blank Field；

Obtain the corresponding field block of each unit lattice.
6. the method for parsing PDF list datas as described in claim 1, which is characterized in that each line for obtaining every page of PDF The coordinate of section and the coordinate of each character, specially：

By the line segment of every page of PDF and character rendering to CImage handles, the seat of each line segment and each character is captured while rendering Mark.
7. the method for parsing PDF list datas as described in claim 1, which is characterized in that described to be divided according to intersection point of line segments Go out cell, and obtain the corresponding rectangular coordinates of each unit lattice, specially：

If the distance between one end point coordinates of a line segment and one end point coordinates of another line segment are in preset first threshold range It is interior, then judge a line segment and another line segment intersection；

If four adjacent line segments sequentially intersect end to end, and the region formed is more than preset second threshold range, then obtains institute The coordinate of four line segments is stated, the corresponding rectangular coordinates of cell formed labeled as four line segments.
8. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor Following steps are realized during row：

Obtain the coordinate of each line segment of every page of PDF and the coordinate of each character；

Cell is marked off, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments；

According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained.
9. computer readable storage medium as described in claim 1, which is characterized in that described program can also realize following step Suddenly：

Every page of PDF is changed into image data form；

According to the neutrality line of cell, determine per the corresponding cell of a line；

If the upper nextpage being mutually connected gradually is superimposed along Y direction draw close after, corresponding vertical line segment can be obtained, and can Horizontal line segment is got on the vertical line segment respectively, then merges the cell of the upper nextpage joining place.
10. computer readable storage medium as described in claim 1, which is characterized in that step obtains each line segment of every page of PDF Coordinate and each character coordinate, specially：

By the line segment of every page of PDF and character rendering to CImage handles, the seat of each line segment and each character is captured while rendering Mark；

Step marks off cell, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments, specially：

If the distance between one end point coordinates of a line segment and one end point coordinates of another line segment are in preset first threshold range It is interior, then judge a line segment and another line segment intersection；

If four adjacent line segments sequentially intersect end to end, and the region formed is more than preset second threshold range, then obtains institute The coordinate of four line segments is stated, the corresponding rectangular coordinates of cell formed labeled as four line segments；

Step obtains the corresponding field block of each unit lattice, specially according to the coordinate of character and the inclusion relation of rectangular coordinates：

Whether the coordinate according to character is located in rectangular coordinates, obtains the corresponding character of each rectangular coordinates of non-blank-white；

The matrix coefficient of user's visual space is mapped to from the coordinate space of PDF according to character, excludes each square of the non-blank-white Watermark character in shape coordinate；

The corresponding character composition field block of each rectangular coordinates of the non-blank-white, each rectangular coordinates for supplementing blank correspond to blank Field；

Obtain the corresponding field block of each unit lattice.