CN108132916A - Parse method, the storage medium of PDF list datas - Google Patents

Parse method, the storage medium of PDF list datas Download PDF

Info

Publication number
CN108132916A
CN108132916A CN201711235867.5A CN201711235867A CN108132916A CN 108132916 A CN108132916 A CN 108132916A CN 201711235867 A CN201711235867 A CN 201711235867A CN 108132916 A CN108132916 A CN 108132916A
Authority
CN
China
Prior art keywords
line segment
character
coordinate
pdf
cell
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711235867.5A
Other languages
Chinese (zh)
Other versions
CN108132916B (en
Inventor
蓝树和
段涵瑞
薛艳英
江汉祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN201711235867.5A priority Critical patent/CN108132916B/en
Publication of CN108132916A publication Critical patent/CN108132916A/en
Application granted granted Critical
Publication of CN108132916B publication Critical patent/CN108132916B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Editing Of Facsimile Originals (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The present invention provides a kind of method for parsing PDF list datas, storage medium, and method includes:Obtain the coordinate of each line segment of every page of PDF and the coordinate of each character;Cell is marked off, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments;According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained.The present invention accurately marks off the character in cell and cell according to the relationship of each line segment and each character coordinates, the data in the accurate table and table for extracting PDF, realizes accurate, convenient, the automation parsing of PDF tables.

Description

Parse method, the storage medium of PDF list datas
Technical field
The present invention relates to data to parse field, particularly relates to method, the storage medium of parsing PDF list datas.
Background technology
The existing object of the PDF parsings of the prior art is typically all to be directed to word, and the table of the inside is visually , without real table objects, each unit lattice only with line segment demarcate come, PDF agreements be record these words, The location information of line segment, picture etc..
Existing related resolution is the word for obtaining the inside, but title should be strictly corresponded to for list data Respective column, due to the particularity of PDF, such as continuous, the uncertainty of the line feed of individual unit lattice of front and rear page table, watermark Deng.The division of simple character is unrealistic, for each form table all first analyze distinguishing characteristic therein, thus again It writes corresponding script and imported into database, workload is big to be difficult to imagine, therefore the list data difficult to realize PDF is automatic Extraction storage in the database.
Therefore, PDF parsings currently on the market close source relatively, and are all simple character processing to this kind of list data, It is difficult to accomplish that data are corresponding with title, it is difficult to judge the correlation between data row and row.
Invention content
The technical problems to be solved by the invention are:A kind of method for parsing PDF list datas, storage medium are realized complete List data is automatically and accurately parsed, and highly practical.
In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention is:
A kind of method for parsing PDF list datas, including:
Obtain the coordinate of each line segment of every page of PDF and the coordinate of each character;
Cell is marked off, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments;
According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained
Another technical solution provided by the invention is:
A kind of computer readable storage medium, is stored thereon with computer program, which realizes when being executed by processor Following steps:
Obtain the coordinate of each line segment of every page of PDF and the coordinate of each character;
Cell is marked off, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments;
According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained.
The beneficial effects of the present invention are:A kind of method of vision parsing PDF list datas is provided, without according to specific Pdf document analysis field between how to divide, do not need to determine the title head of table, can realize and automatically accurately parse And organization field block number evidence, strong applicability.Specifically, the relationship according to each line segment and each character coordinates accurately marks off unit Character in lattice and cell accurately extracts the data in the table and table of PDF, and automaticity is extremely strong, enormously simplifies The importing of PDF tables.The present invention can greatly improve accuracy and the convenience of the parsing of PDF list datas, and effect is very aobvious It writes.
Description of the drawings
Fig. 1 is the PDF table schematic diagrames of single form;
Fig. 2 is the schematic diagram of random blank cell;
Fig. 3 is the schematic diagram of cross-page cell;
Fig. 4 is the table schematic diagram of multilayer watermark;
Fig. 5 is a kind of flow diagram for the method for parsing PDF list datas of the present invention;
Fig. 6 is the schematic diagram of intersection point of line segments;
Fig. 7 is the line segment composition schematic diagram for forming effective cell;
Fig. 8 is the flow diagram of embodiment one.
Specific embodiment
For the technology contents that the present invention will be described in detail, the objects and the effects, below in conjunction with embodiment and coordinate attached Figure is explained.
The design of most critical of the present invention is:Relationship according to each line segment and each character coordinates accurately marks off cell With the character in cell, the data in the accurate table and table for extracting PDF, realize PDF tables it is accurate, convenient, from Dynamic neutralizing analysis.
Fig. 5 is please referred to, the present invention provides a kind of method for parsing PDF list datas, including:
Obtain the coordinate of each line segment of every page of PDF and the coordinate of each character;
Cell is marked off, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments;
According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained.
Further, it further includes:
According to the neutrality line of cell, determine per the corresponding cell of a line.
Seen from the above description, determine whether according to the error range between the neutrality line of identified each unit lattice To be in the cell of same a line, the regular of cell is realized, to obtain the list of marshalling.
Further, it further includes:
Every page of PDF is changed into image data form;
If the upper nextpage being mutually connected gradually is superimposed along Y direction draw close after, corresponding vertical line segment can be obtained, And horizontal line segment can be got on the vertical line segment respectively, then merge the cell of the upper nextpage joining place.
Seen from the above description, the correlation of cell being connected between upper nextpage can be analyzed according to Image Visual Feature Whether property judges whether to belong to same cell split due to paging, if so, merging.It realizes automatic, accurate Really split cell is merged.
Further, it if the upper nextpage being mutually connected gradually is superimposed along Y direction after drawing close, can obtain corresponding Vertical line segment, and horizontal line segment can be got on the vertical line segment respectively, then merge the upper nextpage rank The cell at place is connect, specially:
The upper left corner for presetting every page of PDF is coordinate origin;
To current page since Y-axis maximum value, advance toward origin direction after obtaining vertical line segment, judge described vertical Line segment on the presence or absence of the horizontal line section that intersects with it;And simultaneously
To lower one page since zero coordinate of Y-axis, advance toward maximum value direction after obtaining vertical line segment, judge described hang down With the presence or absence of the horizontal line section intersected with it on straight line segment;
If so, by the corresponding cell of the vertical line segment adjacent in the current page and phase in described lower one page Corresponding cell span is same cell.
Seen from the above description, vision algorithm can be used to judge the correlation of cell between PDF pages and page, from dynamic circuit connector And the form of expression of the cell being split, further perfect final acquired table.
Further, it is described according to the coordinate of character and the inclusion relation of rectangular coordinates, obtain the corresponding word of each unit lattice Section block, specially:
Whether the coordinate according to character is located in rectangular coordinates, obtains the corresponding character of each rectangular coordinates of non-blank-white;
The matrix coefficient of user's visual space is mapped to from the coordinate space of PDF according to character, excludes the non-blank-white Watermark character in each rectangular coordinates;
The corresponding character composition field block of each rectangular coordinates of the non-blank-white supplements each rectangular coordinates pair of blank Answer null field;
Obtain the corresponding field block of each unit lattice.
Seen from the above description, watermark character can be effectively removed, ensures the accuracy of parsing gained table.Meanwhile needle Configuration null field is corresponded to blank cell, realization blank cell is aligned with corresponding title.So as to ensure finally to obtain Table integrality and accuracy.
Further, the acquisition coordinate of each line segment of every page of PDF and the coordinate of each character, specially:
By the line segment of every page of PDF and character rendering to CImage handles, each line segment and each character are captured while rendering Coordinate.
Seen from the above description, line segment and character rendering are realized to CImage handles and turns the PDF data of structuring Become the image data for ease of analyzing and processing, follow-up direct basis image data is facilitated to be detected analysis, obtain line segment and word The characteristic of symbol, finally accordingly obtain needed for data.
It is further, described to mark off cell, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments, Specially:
If the distance between one end point coordinates of a line segment and one end point coordinates of another line segment are in preset first threshold In the range of, then judge a line segment and another line segment intersection;
If four adjacent line segments sequentially intersect end to end, and the region formed is more than preset second threshold range, then obtains The coordinate of four line segments is taken, the corresponding rectangular coordinates of cell formed labeled as four line segments.
Seen from the above description, it is corresponding to judge two points since the coordinate of PDF user's spaces is floating point type Distance whether in certain threshold range come determine correspondence line segment whether intersect.It is convenient subsequently accurately to be divided according to number of intersections Cell.
Another technical solution provided by the invention is:
A kind of computer readable storage medium, is stored thereon with computer program, which realizes when being executed by processor Following steps:
Obtain the coordinate of each line segment of every page of PDF and the coordinate of each character;
Cell is marked off, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments;
According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained.
Further, described program can also realize following steps:
Every page of PDF is changed into image data form;
According to the neutrality line of cell, determine per the corresponding cell of a line;
If the upper nextpage being mutually connected gradually is superimposed along Y direction draw close after, corresponding vertical line segment can be obtained, And horizontal line segment can be got on the vertical line segment respectively, then merge the cell of the upper nextpage joining place.
Further, step obtains the coordinate of each line segment of every page of PDF and the coordinate of each character, specially:
By the line segment of every page of PDF and character rendering to CImage handles, each line segment and each character are captured while rendering Coordinate;
Step marks off cell, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments, specially:
If the distance between one end point coordinates of a line segment and one end point coordinates of another line segment are in preset first threshold In the range of, then judge a line segment and another line segment intersection;
If four adjacent line segments sequentially intersect end to end, and the region formed is more than preset second threshold range, then obtains The coordinate of four line segments is taken, the corresponding rectangular coordinates of cell formed labeled as four line segments;
Step obtains the corresponding field block of each unit lattice, specifically according to the coordinate of character and the inclusion relation of rectangular coordinates For:
Whether the coordinate according to character is located in rectangular coordinates, obtains the corresponding character of each rectangular coordinates of non-blank-white;
The matrix coefficient of user's visual space is mapped to from the coordinate space of PDF according to character, excludes the non-blank-white Watermark character in each rectangular coordinates;
The corresponding character composition field block of each rectangular coordinates of the non-blank-white supplements each rectangular coordinates pair of blank Answer null field;
Obtain the corresponding field block of each unit lattice.
Embodiment one
The present embodiment mainly provides a kind of method for parsing PDF list datas, suitable for the table PDF format data Lattice are parsed, and are obtained corresponding list data, are facilitated subsequent editing operations.Data are such as cleaned in front end, if client provides It is greatly sheet format PDF format to have in single, bill, through this embodiment can be extracted into sheet format PDF corresponding CSV forms automatically imported into database and are analyzed.
As shown in Figs 1-4, it is existing common several PDF forms.Specifically, Fig. 1 corresponds to single table;Fig. 2 is corresponded to Random blank cell;Fig. 3 corresponds to cross-page cell;Fig. 4 corresponds to the forms such as multilayer watermark.Based on current existing PDF tables Parsing is opposite to close source, and is all simple character processing to this kind of list data, it is difficult to accomplish that data are corresponding with title, more It is difficult to judge the correlation between row and row.
In view of the above-mentioned problems, the present invention will in the present embodiment, by multiple specific embodiments come homographic solution never With the parsing of form.
Referring to Fig. 8, the method for the parsing PDF list datas of the present embodiment includes:
S1:Every page of PDF is changed into image data form;If the upper left corner of every page of PDF is coordinate origin;Obtain every page The coordinate of each line segment of PDF and the coordinate of each character;
The step specifically includes:
S101:Pdf document is loaded, cycle obtains the object of every page;The object is the pointer for being directed toward every page PDF, For sequentially obtaining every page PDF data;
S102:By the line segment of every page PDF and character rendering to CImage handles, if every page of PDF data of image data The upper left corner be coordinate origin;The coordinate of each line segment and each character is captured while rendering.
Here, the purpose of being rendered into CImage handles be in order to:1. by will turn again after the PDF data copies of structured type It is changed to image data;2. independent process preserves original PDF data, source file is avoided to lose;3. the image of pure line segment can be obtained Data, for image binaryzation, straight-line detection exclusive PCR below;4. being converted to image data, facilitate subsequent processing, it is directly logical It crosses image detection and obtains character pair, data needed for acquisition.
The coordinate of line segment and character can obtain simultaneously during rendering, be intended merely to obtain the image of pure line segment here. The coordinate for obtaining line segment refers to the coordinate of a pair of of point obtained.
S2:Cell is marked off, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments;
Since the coordinate of PDF user's spaces is floating point type, in the xy space coordinates of image data form, line The intersection point of section refers to one of endpoint of a line segment in space coordinate and one of endpoint of another line segment Distance is in certain threshold range;As shown in fig. 6, line segment A (x1, y1), (x2, y2) and line segment B (x3, y3), (x4, y4) There are one intersection points for tool.Cell refers to four groups of line segments of space coordinate there are four intersection points, and the region formed is more than certain threshold value When be regarded as effective cell;As shown in fig. 7, adjacent four line segments A, B, C and D form a cell.
Therefore, step S2 is specifically included:
S201:If the coordinate of the coordinate of one of endpoint of a line segment and one of endpoint of another line segment The distance between in the range of preset first threshold, then judge that this two lines section intersects;
S202:If four adjacent line segments intersect end to end successively, and the region formed is more than preset second threshold model Enclose, then judge that this four line segments form an effective cells, while obtain the coordinate of this four line segments, labeled as this four The corresponding rectangular coordinates of cell that line segment is formed.
S203:Obtain the corresponding rectangular coordinates of each cell.
S3:According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained.
Field block refers to that (exclusion is fallen in unit for the sequence set of all significant characters in PDF individual unit lattice Watermark character in lattice).
Step S3 is specifically included:
S301:According to the inclusion relation between the coordinate and rectangular coordinates of character, judge whether include in rectangular coordinates Whether the rectangular coordinates of character, i.e. character are fallen in the rectangular coordinates of (cell);If so, then perform S302;If it is not, it then holds Row S303.This process is intuitively naturally determined from the spatial coordinate location relationship of picture format.
S302:Sequence obtains all characters in rectangular coordinates, forms the field block of the corresponding rectangular coordinates;
S303:If it is determined that the corresponding field block of the rectangular coordinates is then set as empty without character in some rectangular coordinates, That is supplement null field block corresponds to the rectangular coordinates, to ensure that the corresponding blank cell of the rectangular coordinates of this blank can be with Corresponding title alignment.
In a specific embodiment, after judging to obtain including character in rectangular coordinates, i.e. before step S302, also Following step will be performed;PDF form analysis there are watermark is solved with specific aim.
The step specifically includes:According to character user's visual space (i.e. this implementation is mapped to from the coordinate space of PDF Example in image data form xy coordinate spaces) matrix coefficient, exclude the watermark character in each rectangular coordinates.Specifically, square Battle array characteristic coefficient refers to that character is mapped to one group of matrix of user's visual space from the coordinate space of PDF, and watermark is usually to carry The character of angle, so matrix and the normal character of conversion have difference, so as to according to this mode judge some character whether be Watermark.
Then, the S3 of the present embodiment is further included:
S304:Obtain the corresponding field block of each cell.
In another specific embodiment, it further will also include S305, to realize the solution to multirow cell simultaneously Analysis.
S305:According to the neutrality line of cell, determine per the corresponding cell of a line.Specifically, by according to each list Error range between the corresponding neutrality line of first lattice determines whether it is same data line.If positioned at same a line, the row it is each If the y-axis coordinate of the neutrality line of cell, not in the threshold range, should be judged not same in certain threshold range A line is split into and does not go together.
In another specific embodiment, it will also include S4-S5, further to realize the parsing to cross-page cell.
S4:Every page of PDF data of image data form are transformed into Mat objects opencv.
S5:To current page since Y-axis maximum value, advance toward origin direction after obtaining vertical line segment, judge described hang down Whether the horizontal line section intersected with it can be detected on straight line segment;And simultaneously
To lower one page since zero coordinate of Y-axis, advance toward maximum value direction after obtaining vertical line segment, judge described hang down Whether the horizontal line section intersected with it can be detected on straight line segment;
If meeting the two conditions simultaneously, by the corresponding cell of the vertical line segment adjacent in the current page It is same cell with corresponding cell span in described lower one page.It as shown in figure 3, will be due to upper nextpage paging And the imperfect cell span split is complete cell.
In the present embodiment, it is further comprising the steps of:
S6:Convergence is organized into the list data of CSV forms.
PDF form analysis method provided in this embodiment, without according to specific pdf document analyze field between how It divides, does not need to determine the title head of table, can realize and automatically, accurately parse simultaneously organization field block number evidence, practicability By force, applicability light.Further, the present embodiment is accurately divided using the relationship of intercharacter coordinate, is judged using vision algorithm The correlation of cell between PDF pages and page, this method accurately extract the list data of PDF, automaticity pole with completely new thinking By force, the importing of such data is enormously simplified.To sum up, the present embodiment can automatically, accurately and comprehensively parse class PDF table numbers According to considerably improving the accuracy and convenience of data cleansing, effect is very notable.
Embodiment two
The present embodiment corresponding embodiment one provides a kind of corresponding computer readable storage medium, is stored thereon with calculating Machine program can realize all steps that embodiment one is included when the program is executed by processor.
In conclusion a kind of method for parsing PDF list datas provided by the invention, storage medium, can realize PDF tables Accurate, convenient, the automation parsing of lattice.The data of single table, multilist can not only precisely be parsed, moreover it is possible to accurately parse Random blank cell, cross-page cell and multilayer watermark cell;The practicality is strong, applied widely.Further, The present invention is based on character coordinates and line segment coordinate to be parsed, and is different from the existing processing for being based purely on character, is not only realized More accurately, it easily parses, and also ensure that data are corresponding with title;It can also analyze accordingly simultaneously between row and row Correlation, to realize that a plurality of types of form analysis provide support.
The foregoing is merely the embodiment of the present invention, are not intended to limit the scope of the invention, every to utilize this hair The equivalents that bright specification and accompanying drawing content are made directly or indirectly are used in relevant technical field, similarly include In the scope of patent protection of the present invention.

Claims (10)

  1. A kind of 1. method for parsing PDF list datas, which is characterized in that including:
    Obtain the coordinate of each line segment of every page of PDF and the coordinate of each character;
    Cell is marked off, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments;
    According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained.
  2. 2. the method for parsing PDF list datas as described in claim 1, which is characterized in that further include:
    According to the neutrality line of cell, determine per the corresponding cell of a line.
  3. 3. the method for parsing PDF list datas as described in claim 1, which is characterized in that further include:
    Every page of PDF is changed into image data form;
    If the upper nextpage being mutually connected gradually is superimposed along Y direction draw close after, corresponding vertical line segment can be obtained, and can Horizontal line segment is got on the vertical line segment respectively, then merges the cell of the upper nextpage joining place.
  4. 4. the method for parsing PDF list datas as claimed in claim 3, which is characterized in that if the upper nextpage being mutually connected It is gradually superimposed along Y direction after drawing close, corresponding vertical line segment can be obtained, and can be respectively in the vertical line segment On get horizontal line segment, then merge the cell of the upper nextpage joining place, specially:
    The upper left corner for presetting every page of PDF is coordinate origin;
    To current page since Y-axis maximum value, advance toward origin direction after obtaining vertical line segment, judge the vertical line segment It is upper to whether there is the horizontal line section intersected with it;And simultaneously
    To lower one page since zero coordinate of Y-axis, advance toward maximum value direction after obtaining vertical line segment, judge the vertical line With the presence or absence of the horizontal line section intersected with it in section;
    If so, by the corresponding cell of the vertical line segment adjacent in the current page with it is corresponding in described lower one page Cell span is same cell.
  5. 5. the method for parsing PDF list datas as described in claim 1, which is characterized in that the coordinate according to character with The inclusion relation of rectangular coordinates obtains the corresponding field block of each unit lattice, specially:
    Whether the coordinate according to character is located in rectangular coordinates, obtains the corresponding character of each rectangular coordinates of non-blank-white;
    The matrix coefficient of user's visual space is mapped to from the coordinate space of PDF according to character, excludes each square of the non-blank-white Watermark character in shape coordinate;
    The corresponding character composition field block of each rectangular coordinates of the non-blank-white, each rectangular coordinates for supplementing blank correspond to blank Field;
    Obtain the corresponding field block of each unit lattice.
  6. 6. the method for parsing PDF list datas as described in claim 1, which is characterized in that each line for obtaining every page of PDF The coordinate of section and the coordinate of each character, specially:
    By the line segment of every page of PDF and character rendering to CImage handles, the seat of each line segment and each character is captured while rendering Mark.
  7. 7. the method for parsing PDF list datas as described in claim 1, which is characterized in that described to be divided according to intersection point of line segments Go out cell, and obtain the corresponding rectangular coordinates of each unit lattice, specially:
    If the distance between one end point coordinates of a line segment and one end point coordinates of another line segment are in preset first threshold range It is interior, then judge a line segment and another line segment intersection;
    If four adjacent line segments sequentially intersect end to end, and the region formed is more than preset second threshold range, then obtains institute The coordinate of four line segments is stated, the corresponding rectangular coordinates of cell formed labeled as four line segments.
  8. 8. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor Following steps are realized during row:
    Obtain the coordinate of each line segment of every page of PDF and the coordinate of each character;
    Cell is marked off, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments;
    According to the coordinate of character and the inclusion relation of rectangular coordinates, the corresponding field block of each unit lattice is obtained.
  9. 9. computer readable storage medium as described in claim 1, which is characterized in that described program can also realize following step Suddenly:
    Every page of PDF is changed into image data form;
    According to the neutrality line of cell, determine per the corresponding cell of a line;
    If the upper nextpage being mutually connected gradually is superimposed along Y direction draw close after, corresponding vertical line segment can be obtained, and can Horizontal line segment is got on the vertical line segment respectively, then merges the cell of the upper nextpage joining place.
  10. 10. computer readable storage medium as described in claim 1, which is characterized in that step obtains each line segment of every page of PDF Coordinate and each character coordinate, specially:
    By the line segment of every page of PDF and character rendering to CImage handles, the seat of each line segment and each character is captured while rendering Mark;
    Step marks off cell, and obtain the corresponding rectangular coordinates of each unit lattice according to intersection point of line segments, specially:
    If the distance between one end point coordinates of a line segment and one end point coordinates of another line segment are in preset first threshold range It is interior, then judge a line segment and another line segment intersection;
    If four adjacent line segments sequentially intersect end to end, and the region formed is more than preset second threshold range, then obtains institute The coordinate of four line segments is stated, the corresponding rectangular coordinates of cell formed labeled as four line segments;
    Step obtains the corresponding field block of each unit lattice, specially according to the coordinate of character and the inclusion relation of rectangular coordinates:
    Whether the coordinate according to character is located in rectangular coordinates, obtains the corresponding character of each rectangular coordinates of non-blank-white;
    The matrix coefficient of user's visual space is mapped to from the coordinate space of PDF according to character, excludes each square of the non-blank-white Watermark character in shape coordinate;
    The corresponding character composition field block of each rectangular coordinates of the non-blank-white, each rectangular coordinates for supplementing blank correspond to blank Field;
    Obtain the corresponding field block of each unit lattice.
CN201711235867.5A 2017-11-30 2017-11-30 Method for analyzing PDF table data and storage medium Active CN108132916B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711235867.5A CN108132916B (en) 2017-11-30 2017-11-30 Method for analyzing PDF table data and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711235867.5A CN108132916B (en) 2017-11-30 2017-11-30 Method for analyzing PDF table data and storage medium

Publications (2)

Publication Number Publication Date
CN108132916A true CN108132916A (en) 2018-06-08
CN108132916B CN108132916B (en) 2022-02-11

Family

ID=62390012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711235867.5A Active CN108132916B (en) 2017-11-30 2017-11-30 Method for analyzing PDF table data and storage medium

Country Status (1)

Country Link
CN (1) CN108132916B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446487A (en) * 2018-11-01 2019-03-08 北京神州泰岳软件股份有限公司 A kind of method and device parsing portable document format document table
CN109670461A (en) * 2018-12-24 2019-04-23 广东亿迅科技有限公司 PDF text extraction method, device, computer equipment and storage medium
CN109815958A (en) * 2019-02-01 2019-05-28 杭州睿琪软件有限公司 A kind of laboratory test report recognition methods, device, electronic equipment and storage medium
CN109871524A (en) * 2019-02-21 2019-06-11 腾讯科技(深圳)有限公司 A kind of chart generation method and device
CN110134957A (en) * 2019-05-14 2019-08-16 云南电网有限责任公司电力科学研究院 A kind of scientific and technological achievement storage method and system based on semantic analysis
WO2020140698A1 (en) * 2019-01-04 2020-07-09 阿里巴巴集团控股有限公司 Table data acquisition method and apparatus, and server
CN112069991A (en) * 2020-09-04 2020-12-11 税友软件集团股份有限公司 PDF table information extraction method and related device
CN112541332A (en) * 2020-12-08 2021-03-23 北京百度网讯科技有限公司 Form information extraction method and device, electronic equipment and storage medium
CN112712014A (en) * 2020-12-29 2021-04-27 平安健康保险股份有限公司 Table picture structure analysis method, system, equipment and readable storage medium
CN113361257A (en) * 2021-06-29 2021-09-07 深圳壹账通智能科技有限公司 PDF document analysis method, system, electronic device and storage medium
CN113435166A (en) * 2021-06-09 2021-09-24 深圳市世强元件网络有限公司 Underlining method and system, computer device and readable storage medium
CN113642408A (en) * 2021-07-15 2021-11-12 杭州玖欣物联科技有限公司 Method for processing and analyzing picture data in real time through industrial internet

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102467378A (en) * 2010-11-11 2012-05-23 深圳市金蝶友商电子商务服务有限公司 HTML (Hypertext Markup Language) form processing method based on two-dimensional matrix and computer
CN101866335B (en) * 2010-06-14 2012-12-12 深圳市万兴软件有限公司 Form processing method and device in document conversion
CN104268127A (en) * 2014-09-22 2015-01-07 同方知网(北京)技术有限公司 Method for analyzing reading order of electronic layout file
US20160247020A1 (en) * 2013-03-19 2016-08-25 Fujian Foxit Software Development Joint Stock Co., Ltd. A method for identifying pdf document
CN105989013A (en) * 2015-01-28 2016-10-05 腾讯科技(深圳)有限公司 Method and device for removing character watermarks
CN105988979A (en) * 2015-02-16 2016-10-05 北京邮电大学 Form extraction method and device based on PDF (Portable Document Format) file
CN106484340A (en) * 2016-09-08 2017-03-08 中标软件有限公司 Watermark interpolation is carried out in print procedure to document knows method for distinguishing with watermark
CN106897690A (en) * 2017-02-22 2017-06-27 南京述酷信息技术有限公司 PDF table extracting methods
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866335B (en) * 2010-06-14 2012-12-12 深圳市万兴软件有限公司 Form processing method and device in document conversion
CN102467378A (en) * 2010-11-11 2012-05-23 深圳市金蝶友商电子商务服务有限公司 HTML (Hypertext Markup Language) form processing method based on two-dimensional matrix and computer
US20160247020A1 (en) * 2013-03-19 2016-08-25 Fujian Foxit Software Development Joint Stock Co., Ltd. A method for identifying pdf document
CN104268127A (en) * 2014-09-22 2015-01-07 同方知网(北京)技术有限公司 Method for analyzing reading order of electronic layout file
CN105989013A (en) * 2015-01-28 2016-10-05 腾讯科技(深圳)有限公司 Method and device for removing character watermarks
CN105988979A (en) * 2015-02-16 2016-10-05 北京邮电大学 Form extraction method and device based on PDF (Portable Document Format) file
CN106484340A (en) * 2016-09-08 2017-03-08 中标软件有限公司 Watermark interpolation is carried out in print procedure to document knows method for distinguishing with watermark
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document
CN106897690A (en) * 2017-02-22 2017-06-27 南京述酷信息技术有限公司 PDF table extracting methods

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446487A (en) * 2018-11-01 2019-03-08 北京神州泰岳软件股份有限公司 A kind of method and device parsing portable document format document table
CN109670461A (en) * 2018-12-24 2019-04-23 广东亿迅科技有限公司 PDF text extraction method, device, computer equipment and storage medium
WO2020140698A1 (en) * 2019-01-04 2020-07-09 阿里巴巴集团控股有限公司 Table data acquisition method and apparatus, and server
CN109815958A (en) * 2019-02-01 2019-05-28 杭州睿琪软件有限公司 A kind of laboratory test report recognition methods, device, electronic equipment and storage medium
CN109871524A (en) * 2019-02-21 2019-06-11 腾讯科技(深圳)有限公司 A kind of chart generation method and device
CN110134957B (en) * 2019-05-14 2023-06-13 云南电网有限责任公司电力科学研究院 Scientific and technological achievement warehousing method and system based on semantic analysis
CN110134957A (en) * 2019-05-14 2019-08-16 云南电网有限责任公司电力科学研究院 A kind of scientific and technological achievement storage method and system based on semantic analysis
CN112069991A (en) * 2020-09-04 2020-12-11 税友软件集团股份有限公司 PDF table information extraction method and related device
CN112541332A (en) * 2020-12-08 2021-03-23 北京百度网讯科技有限公司 Form information extraction method and device, electronic equipment and storage medium
CN112541332B (en) * 2020-12-08 2023-06-23 北京百度网讯科技有限公司 Form information extraction method and device, electronic equipment and storage medium
CN112712014A (en) * 2020-12-29 2021-04-27 平安健康保险股份有限公司 Table picture structure analysis method, system, equipment and readable storage medium
CN112712014B (en) * 2020-12-29 2024-04-30 平安健康保险股份有限公司 Method, system, device and readable storage medium for parsing table picture structure
CN113435166A (en) * 2021-06-09 2021-09-24 深圳市世强元件网络有限公司 Underlining method and system, computer device and readable storage medium
CN113435166B (en) * 2021-06-09 2024-03-19 深圳市世强元件网络有限公司 Underline method and system, computer device and readable storage medium
CN113361257B (en) * 2021-06-29 2022-10-11 深圳壹账通智能科技有限公司 PDF document analysis method, system, electronic device and storage medium
CN113361257A (en) * 2021-06-29 2021-09-07 深圳壹账通智能科技有限公司 PDF document analysis method, system, electronic device and storage medium
CN113642408A (en) * 2021-07-15 2021-11-12 杭州玖欣物联科技有限公司 Method for processing and analyzing picture data in real time through industrial internet

Also Published As

Publication number Publication date
CN108132916B (en) 2022-02-11

Similar Documents

Publication Publication Date Title
CN108132916A (en) Parse method, the storage medium of PDF list datas
Uchiyama et al. Random dot markers
CN104516867A (en) Table reordering method and table reordering system
CN105260751B (en) A kind of character recognition method and its system
US20140337717A1 (en) Logic processing apparatus and logic processing method for composite graphs in fixed layout document
CN101226594B (en) Pattern separating extraction device and pattern separating extraction method
CN103605716B (en) Data processing method and device used for webpage click display
CN102163108B (en) Method and device for identifying multiple touch points
CN112668289A (en) Extraction method and device of nested table and storage medium
CN107103225A (en) A kind of method for generating graphical verification code
CN103336961A (en) Interactive natural scene text detection method
CN111949156A (en) Chinese character writing test method and system of writing device and writing device
CN113191309A (en) Method and system for recognizing, scoring and correcting handwritten Chinese characters
CN112149561A (en) Image processing method and apparatus, electronic device, and storage medium
CN109670516A (en) A kind of image characteristic extracting method, device, equipment and readable storage medium storing program for executing
CN103617073A (en) Method for analyzing and displaying picture file of electrical power system
CN112906532B (en) Image processing method and device, electronic equipment and storage medium
CN112084103B (en) Interface test method, device, equipment and medium
CN106097281B (en) A kind of calibration maps and its calibration detection method for projecting interactive system
CN115841671B (en) Handwriting skeleton correction method, system and storage medium
CN113936137A (en) Method, system and storage medium for removing overlapping of image type text line detection areas
CN107679219B (en) Matching method and device, interactive intelligent panel and storage medium
CN105653549A (en) Method and device for extracting document information
CN114241486A (en) Method for improving accuracy rate of identifying student information of test paper
CN110675384B (en) Image processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant