CN105589841B - A kind of method of PDF document Table recognition - Google Patents

A kind of method of PDF document Table recognition Download PDF

Info

Publication number
CN105589841B
CN105589841B CN201610025529.8A CN201610025529A CN105589841B CN 105589841 B CN105589841 B CN 105589841B CN 201610025529 A CN201610025529 A CN 201610025529A CN 105589841 B CN105589841 B CN 105589841B
Authority
CN
China
Prior art keywords
line
doubtful
title
page
row
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610025529.8A
Other languages
Chinese (zh)
Other versions
CN105589841A (en
Inventor
邹季英
袁仁慧
梁洵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
Original Assignee
TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd filed Critical TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
Priority to CN201610025529.8A priority Critical patent/CN105589841B/en
Publication of CN105589841A publication Critical patent/CN105589841A/en
Application granted granted Critical
Publication of CN105589841B publication Critical patent/CN105589841B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of method of PDF document Table recognition, including:Character set in the page is obtained, and the character set is merged and embarked on journey, establishes row set;Horizontal line and vertical line in page path are extracted, establishes line set;Detect the doubtful table title in row set and the doubtful form line in line set;If doubtful table title and doubtful form line simultaneously be present, using based on table title and the region-growing method of line set identification form;If only existing doubtful form line, first detect all fronts table with line set and row set and detect three line tables again;If only existing doubtful table title, form is identified with the region-growing method based on table title and row set;If both without doubtful form line or without doubtful table title, judge the page without form;Gauge outfit, table note form secondary element are detected, exports the page table recognition result.Table title, form line and form character arrangements feature are considered as three big feature of form by the present invention, in multilist and can be deposited in the complicated space of a whole page of one page using the thought of region parallel growth and are accurately positioned form.

Description

A kind of method of PDF document Table recognition
Technical field
The present invention relates to a kind of PDF document Table recognition method, the printed page analysis and the space of a whole page for belonging to format electronic document are managed Solve category.
Background technology
With developing rapidly for domestic digital Publishing Industry, how deep exploitation publishing resource, how quickly to carry out document Resource deep processing, realize resource fragmentationization with restructuring, meet it is polymorphic, by all kinds of means, the digital publishing demand of multimedium, be work as Preceding digital publishing industry needs to solve the problems, such as.Document resources fragmentation includes the bases such as piece name, author, keyword, bibliography This metadata indexing, also including the body matter fragmentation such as paragraph, picture, form, formula.Document layout technology is real The key technology of the existing automatic fragmentation of document, present document relates to PDF Table recognition methods belong to the printed page analysis of format document with Understanding technology.
PDF (Portable Document Format, portable document format) is a kind of electronics developed by Adobe companies Document format, there is the characteristics of with operating system platform independence, it has also become in electronic document distribution and digital information propagation Widely used preferable document format.PDF belongs to format document, relatively independent between the page, changes a content of pages not Influence other page layouts.Though format document is good at describing document layout, accurately shows the space of a whole page, no recording documents are patrolled Structure is collected, without logical elements such as paragraph, form, formula, how the document that is beyond expression is organized, it is impossible to logically deposits Store up document.
In pdf document, form is split into form line and table content describes respectively.Form line path one by one (PATH) operator drawing shaping, but path is more than form line, it is also possible to representation formula fraction line, polar plot, turn bent character, The elements such as space of a whole page decorative pattern.Table content then cannot be guaranteed to collect with a string character representation, all characters of same form Middle appearance, often mix with space of a whole page other guide together with.The storage mode of PDF forms makes Table recognition become complicated.
Form is generally made up of elements such as table title, table main body, table notes, and table main body includes form line and character content.It is existing The technology and method (such as CN103377177A) of some format document identification forms stress form line feature, ignore table title It is an important table features.In multiple forms and deposit the especially multiple three lines tables (containing only horizontal form) of one page and deposit In the complicated space of a whole page of one page, only with intersecting form line and table main body word arrangement feature by influence Table recognition accuracy and Efficiency.
The content of the invention
In order to solve the above technical problems, storage characteristics of the present invention according to PDF document form, it is therefore an objective to which a kind of PDF is provided The method of document Table recognition.
The purpose of the present invention is realized by following technical scheme:
A kind of method of PDF document Table recognition, including:
Character set in the page is obtained, and the character set is merged and embarked on journey, establishes row set;
Horizontal line and vertical line in page path are extracted, establishes line set;
Detect the doubtful table title in row set and the doubtful form line in line set;
If doubtful table title and doubtful form line simultaneously be present, the region-growing method based on table title and line set is used Identify form;
If only existing doubtful form line, first detect all fronts table with line set and row set and detect three line tables again;
If only existing doubtful table title, form is identified with the region-growing method based on table title and row set;
If both without doubtful form line or without doubtful table title, judge the page without form;
Gauge outfit, table note form secondary element are detected, exports the page table recognition result.
Compared with prior art, one or more embodiments of the invention can have the following advantages that:
Identification difficulty to different type form is classified, and is respectively from the easier to the more advanced:Title containing table and form line form, Without table title all fronts table, without the line table of table title three, the wireless meter of the title containing table, the wireless meter without table title.According to easy first and difficult later Order is identified, and not only increases the accuracy of Table recognition, also improves recognition efficiency;
Region-growing method based on table title and line set, table title is considered as initial position of the seed as growth, with Doubtful form line set is growth elements, and table can be accurately positioned using multiple seeds (multiple doubtful forms) parallel growth pattern Lattice, solve multilist and deposit the complicated space of a whole page of one page;
Region-growing method based on table title and row set, it is multiple using row set as growth elements with the entitled seed of table Seed grows the parallel detection for realizing multiple wireless forms simultaneously;
This method can not only automatic identification form main body, also simultaneously have detected table title, gauge outfit, table note etc. table information, and Table title, gauge outfit, table note are associated with the matching of form main body, keep synchronous.
Brief description of the drawings
Fig. 1 is implementation flow chart of the present invention;
Fig. 2 is region-growing method schematic diagram.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with embodiment and accompanying drawing to this hair It is bright to be described in further detail.
As shown in figure 1, for the method flow of PDF document Table recognition, methods described includes:
Character set in the page is obtained, and the character set is merged and embarked on journey, establishes row set;
Horizontal line and vertical line in page path are extracted, establishes line set;
Detect the doubtful table title in row set and the doubtful form line in line set;
If doubtful table title and doubtful form line simultaneously be present, the region-growing method based on table title and line set is used Identify form;
If only existing doubtful form line, first detect all fronts table with line set and row set and detect three line tables again;
If only existing doubtful table title, form is identified with the region-growing method based on table title and row set;
If both without doubtful form line or without doubtful table title, judge the page without form;
Gauge outfit, table note form secondary element are detected, exports the page table recognition result.
Character set is obtained from the page merge and embarks on journey, feature is mainly arranged by character according to character stream order and character space Merging is embarked on journey.Because most of PDF document character stream order is identical with reading order, character is by the neat typesetting of row, by character stream It is an efficient method that order, which embarks on journey character merging,.To be utilized during polymerization between the vertical range and level of intercharacter Every waiting spatial arrangement feature to enter row constraint, character vertical range is nearer two-by-two, and horizontal interval is smaller, and adjacent possibility is bigger, The probability of colleague is higher.Character tentatively merged according to character stream order after embarking on journey, should also enter horizontal between every trade and row merge With the processing character stream order situation inconsistent with reading order.
Detect doubtful table title and used key quality control point, detected from row set row it is first with " table ", " Tab ", The row of keywords such as " Table ", it is regarded as doubtful table title.The table title of some documents is simultaneously multilingual comprising Chinese and English, Need to merge the multilingual table title for belonging to a form together if necessary.
The form line of PDF document is preserved with path, but path is more than form line, it is also possible to representation formula fraction line, arrow Spirogram, turn bent character, background decorative decorative pattern etc..Page path can represent following several space of a whole page elements:Thread path:Form line, Headerfooter line, formula fraction line, annotation cut-off rule, page layout background decorative thread etc.;Word path:Represent to turn bent character, may Appear in section, in formula, figure, form;Polar plot:It is made up of paths such as straight line, curve, turn bent characters;Edit path:Cut Path (clipping path), for cutting page area, fill path (filling path), for filling background colour etc..
Form line belongs to circuit warp, has " thin, long " feature, can be according to the horizontal direction or vertical direction of the external square in path Angle judge whether the path is thread path, if be horizontal line or vertical line.
It is actual to be spliced by a plurality of short-term although seeming straight line on a form line surface of some documents, Therefore need to carry out splicing if necessary.Splicing type has two kinds:Level splicing and vertical splicing, when two horizontal external squares The ratio that intersects vertically is more than a certain threshold value, while needs horizontal splicing when two line horizontal intervals are less than a certain threshold value;When two vertical The horizontal intersecting ratio of the external square of line is more than a certain threshold value, while needs vertically to splice when two line perpendicular separations are less than a certain threshold value.
Form generally comprises the horizontal line (such as three line tables) or one group or so alignment horizontal line and one group of one group or so alignment Consistency from top to bottom vertical line (such as all fronts table).One form arranges including at least two rows two, at least one character of each column, between the column and the column At least one character pitch, then water-glass ruling width>=3 character durations, vertical form line height>=2 character height Degree.When doubtful form line is detected from line set, by simultaneously meet aligned condition and minimum length limitation one group of horizontal line or One group of vertical line is determined as doubtful form line.
Region-growing method based on table title and line set, table title is considered as initial position of the seed as growth, with Doubtful form line set is growth elements, using multiple seeds (multiple doubtful forms) vertical parallel growth pattern, can accurately be determined Position form body region, easily solve multilist and deposit the complex page of one page.As shown in Fig. 2 it is based on table title and line set Region-growing method be divided into following four step:
(1) this page of all doubtful table titles are arranged to seed as initial growth position;
(2) determine that the maximum growth scope of each seed establishes growth pool, refer to collect seed horizontal doubtful form line nearby Set is used as growth elements, and the region that growth elements limit is exactly the maximum growth scope of the seed.It can possess between seed identical Growth elements, interseminal maximum growth scope may overlap, the unions of seed-bearing growth elements constitute growth pool;
(3) vertical parallel grows:Because most form captions are located at the top of form main body, the direction of growth can be set to Grow vertically downward.So-called parallel growth, when referring to every secondary growth, a most short seed of vertical range is extracted from growth pool With corresponding doubtful form line, judge whether the line can be incorporated to the affiliated area of seed, judge whether the seed continued growth or can stop Only growth etc..Be considered as during growth between page subfield feature, form space alternative (mutually disjointed between table and table, Do not include mutually) etc. constraints.Parallel growth is not the serial life that next seed starts growth after a seed stops growing Growth process, but succession is determined according to the distance in growth pool between each seed and growth elements;
(4) table schema is analyzed:After all seeds stop growing, consider line subset in the affiliated area of each seed and Row subset whether there is form common feature, for example whether comprising the alignment horizontal line of more than two, line number whether>=2, columns Whether>=2 etc., judge whether the affiliated area of seed is form according to form common feature.
The method of detection all fronts table is that one group of effectively intersecting alignment horizontal line and right is extracted from doubtful form line set Neat vertical line, judge that the region that this group of horizontal line and vertical line are covered whether there is form common feature, should if then fixing tentatively Region is doubtful form.In all doubtful forms and common group of the form identified based on table title and the region-growing method of line set Into set in, according to tablespace alternative reject it is intersecting, by comprising doubtful form, the doubtful form for meeting rule is sentenced It is set to form.
It is above-mentioned extracted from doubtful form line set one group of effectively intersecting alignment horizontal line and alignment vertical line collection refer to, First deployment area area intersects method and judges whether alignment horizontal line overlay area intersects with vertical line region of aliging, if intersecting Then continue to judge using straight line intersection method, judge that this group of horizontal line and vertical line are non-intersect if non-intersect.
Above-mentioned straight line intersection method refers to, extracts each horizontal line and each vertical line successively, judges whether to intersect. If intersecting, it is respectively level of significance line and effective vertical line to judge this two lines.When level of significance line number>=3 and effectively vertical Line number>When=2, judge that this group of horizontal line effectively intersects with vertical line, be otherwise determined as invalid intersecting.
Detecting the method for three line tables is, one group of alignment horizontal line is extracted from doubtful form line set, according to tablespace Alternative, alignment horizontal line is split using the form set identified above, forms horizontal line subset of aliging one by one.In every height Horizontal line is arranged in order by concentration from top to bottom, finds the doubtful original position of form and doubtful final position.By it is doubtful starting and Region between final position is considered as doubtful form.Reuse tablespace alternative and reject the doubtful form for intersecting, including, The doubtful form for meeting rule is determined as form.
The above-mentioned form set identified above refers to, the form identified based on table title and the region-growing method of line set with And the form set constructed by with the form of all fronts table detection method identification.It is above-mentioned to split alignment water with the form set identified above Horizontal line refers to, according to tablespace alternative, one can not possibly be collectively constituted by the alignment horizontal line that identified form separates Individual form body, horizontal line collection, contracting small third-line areas table detection range can be reduced with this method.
The above-mentioned doubtful original position of searching form refers to that the row set in adjacent level line whether there is form two-by-two for detection Arrangement feature (two are comprised at least to arrange, the vertical range between horizontal line should not be excessive etc.), if then judging to doubt for form at this Like original position, otherwise continue search original position downwards.
The above-mentioned doubtful final position of searching form refers to, after doubtful original position is detected, continues to judge downwards two-by-two Row set in adjacent level line, if then continuing search downwards, is otherwise considered as form with the presence or absence of form arrangement feature at this Doubtful final position.
Region-growing method based on table title and row set, it is the initial position using the entitled seed of table as growth, with Row set is growth elements, and multiple seeds grow to identify form simultaneously.
Region-growing method based on table title and row set and the region-growing method based on table title and line set all use The thought of region growing, both differences be the former using row set as growth elements, the latter makes a living with doubtful form line set Long element;The former form common feature used only includes the arrangement feature of row set, and form common feature used in the latter includes row Set and the arrangement feature of line set.
As shown in Fig. 2 the region-growing method based on table title and row set is divided into following four step:
(1) this page of all doubtful table titles are arranged to seed as initial growth position;
(2) determine that the maximum growth scope of each seed establishes growth pool, refer to according to position of the seed in page and page Face subfield feature determines the maximum growth scope of each seed, and growth pool is established using row set as growth elements.With based on table Title is similar with the region-growing method of line set, and identical growth elements, interseminal maximum growth model may be possessed between seed Enclosing to overlap;
(3) vertical parallel grows:The direction of growth is to grow vertically downward.When parallel growth refers to every secondary growth, from growth A most short seed of vertical range and corresponding row are extracted in pond, judges whether the row can be incorporated to the affiliated area of seed, the seed is It is no continued growth or to stop growing etc..Parallel growth determines according to the distance in growth pool between each seed and growth elements The sequencing of growth;
(4) table schema is analyzed:After all seeds stop growing, judge whether the row subset in the affiliated area of each seed deposits In form common feature, for example whether being arranged including at least two rows two, if having uniform between-line spacing or row interval etc., according to these Feature judges whether the affiliated area of seed is form.
Wireless meter without table title due to lacking table title and form line the two form essential characteristics, easily with the page Matrix, the other elements such as determinant obscure, cause to know by mistake.By the statistical analysis to a large amount of Chinese specification PDF documents, The probability that wireless meter without table title uses in practice is very low, therefore does not handle the wireless meter of no table title herein.
Gauge outfit is normally at below table title, above table main body, is usually used in describing the unit of table element, such as " N/ mm”.Table note is usually located at below table main body, is usually used in describing the specified otherwises such as the source of form.Herein mainly according to gauge outfit and The position characteristics and Keywords matching (" note ", " note ") identification gauge outfit and table note of table note.
This page of all Table recognition result is exported, each Table recognition result includes:Table caption position and content, gauge outfit position Put and content, table body position, table note position and content.
Although disclosed herein embodiment as above, described content only to facilitate understand the present invention and adopt Embodiment, it is not limited to the present invention.Any those skilled in the art to which this invention pertains, this is not being departed from On the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details, But the scope of patent protection of the present invention, still should be subject to the scope of the claims as defined in the appended claims.

Claims (5)

  1. A kind of 1. method of PDF document Table recognition, it is characterised in that methods described includes:
    Character set in the page is obtained, and the character set is merged and embarked on journey, establishes row set;
    Horizontal line and vertical line in page path are extracted, establishes line set;
    Detect the doubtful table title in row set and the doubtful form line in line set;
    If doubtful table title and doubtful form line simultaneously be present, identified using based on table title and the region-growing method of line set Form;
    If only existing doubtful form line, first detect all fronts table with line set and row set and detect three line tables again;
    If only existing doubtful table title, form is identified with the region-growing method based on table title and row set;
    If both without doubtful form line or without doubtful table title, judge the page without form;
    Gauge outfit, table note form secondary element are detected, exports the page table recognition result.
  2. 2. the method for PDF document Table recognition as claimed in claim 1, it is characterised in that the identification to different type form Difficulty is classified, wherein, it is respectively from the easier to the more advanced:Title containing table and form line form, without table title all fronts table, without table title Three line tables, the wireless meter of the title containing table, the wireless meter without table title, the recognition sequence of the form is also that easy first and difficult later order is entered Row identification.
  3. 3. the method for PDF document Table recognition as claimed in claim 1, it is characterised in that the doubtful table in detection row set Title includes:Row head is detected with key quality control point, merges the multilingual doubtful table title for belonging to a form.
  4. 4. the method for PDF document Table recognition as claimed in claim 1, it is characterised in that examined with line set and row set Surveying all fronts table includes:Judge whether one group of alignment horizontal line and one group of alignment vertical line are effectively intersecting, the effectively intersecting method It is that first deployment area area intersects method, then with straight line intersection method.
  5. 5. the method for PDF document Table recognition as claimed in claim 1, it is characterised in that the table of the recognition methods identification Information includes:Form main body, table title, gauge outfit, table note, and the table title, gauge outfit, table note and the matching of form main body are closed Connection, keep synchronous.
CN201610025529.8A 2016-01-15 2016-01-15 A kind of method of PDF document Table recognition Active CN105589841B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610025529.8A CN105589841B (en) 2016-01-15 2016-01-15 A kind of method of PDF document Table recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610025529.8A CN105589841B (en) 2016-01-15 2016-01-15 A kind of method of PDF document Table recognition

Publications (2)

Publication Number Publication Date
CN105589841A CN105589841A (en) 2016-05-18
CN105589841B true CN105589841B (en) 2018-03-30

Family

ID=55929431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610025529.8A Active CN105589841B (en) 2016-01-15 2016-01-15 A kind of method of PDF document Table recognition

Country Status (1)

Country Link
CN (1) CN105589841B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084117A (en) * 2019-03-22 2019-08-02 中国科学院自动化研究所 Document table line detecting method, system based on binary map segmented projection

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446863B (en) * 2016-10-11 2020-01-21 同方知网(北京)技术有限公司 PDF document logic diagram identification method
CN106802884B (en) * 2017-02-17 2020-09-22 同方知网(北京)技术有限公司 Method for fragmenting text of layout document
CN106897690B (en) * 2017-02-22 2018-04-13 南京述酷信息技术有限公司 PDF table extracting methods
CN107133566A (en) * 2017-03-31 2017-09-05 常诚 A kind of method of chart in identification PDF document
CN107315989B (en) * 2017-05-03 2020-06-12 天方创新(北京)信息技术有限公司 Text recognition method and device for medical data picture
CN108170697B (en) * 2017-07-12 2021-08-20 信号旗智能科技(上海)有限公司 International trade file processing method and system and server
CN107943956A (en) * 2017-11-24 2018-04-20 北京金堤科技有限公司 Conversion of page method, apparatus and conversion of page equipment
CN107909064B (en) * 2017-12-27 2018-11-16 掌阅科技股份有限公司 Three line table recognition methods, electronic equipment and storage medium
CN108197216A (en) * 2017-12-28 2018-06-22 深圳市巨鼎医疗设备有限公司 A kind of method of information processing
CN110163030B (en) * 2018-02-11 2021-04-23 鼎复数据科技(北京)有限公司 PDF framed table extraction method based on image information
CN108470021B (en) * 2018-03-26 2022-06-03 阿博茨德(北京)科技有限公司 Method and device for positioning table in PDF document
CN108446264B (en) * 2018-03-26 2022-02-15 阿博茨德(北京)科技有限公司 Method and device for analyzing table vector in PDF document
CN109062874B (en) * 2018-06-12 2022-03-04 平安科技(深圳)有限公司 Financial data acquisition method, terminal device and medium
CN109284495B (en) * 2018-11-03 2023-02-07 上海犀语科技有限公司 Method and device for performing table-free line table cutting on text
CN109635268B (en) * 2018-12-29 2023-05-05 南京吾道知信信息技术有限公司 Method for extracting form information in PDF file
CN109934160B (en) * 2019-03-12 2023-06-02 天津瑟威兰斯科技有限公司 Method and system for extracting table text information based on table recognition
CN110263792B (en) * 2019-06-12 2021-10-22 广东小天才科技有限公司 Image recognizing and reading and data processing method, intelligent pen, system and storage medium
CN110413979A (en) * 2019-08-05 2019-11-05 金税桥大数据科技股份有限公司 Industry table digitalized processing method based on image recognition technology
CN110659346B (en) * 2019-08-23 2024-04-12 平安科技(深圳)有限公司 Form extraction method, form extraction device, terminal and computer readable storage medium
CN110516048A (en) * 2019-09-02 2019-11-29 苏州朗动网络科技有限公司 The extracting method, equipment and storage medium of list data in pdf document
CN112380812B (en) * 2020-10-09 2022-02-22 北京中科凡语科技有限公司 Method, device, equipment and storage medium for extracting incomplete frame line table of PDF (Portable document Format)
CN112464626B (en) * 2020-12-09 2022-04-01 上海携宁计算机科技股份有限公司 Graph extraction method of PDF (Portable document Format) document, electronic equipment and storage medium
CN113239818B (en) * 2021-05-18 2023-05-30 上海交通大学 Table cross-modal information extraction method based on segmentation and graph convolution neural network
CN113408251B (en) * 2021-06-30 2023-08-18 北京百度网讯科技有限公司 Layout document processing method and device, electronic equipment and readable storage medium
CN114612921B (en) * 2022-05-12 2022-07-19 中信证券股份有限公司 Form recognition method and device, electronic equipment and computer readable medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866335A (en) * 2010-06-14 2010-10-20 深圳市万兴软件有限公司 Form processing method and device in document conversion
CN101976232A (en) * 2010-09-19 2011-02-16 深圳市万兴软件有限公司 Method for identifying data form in document and device thereof
CN103377177A (en) * 2012-04-27 2013-10-30 北大方正集团有限公司 Method and device for identifying forms in digital format files
CN104063364A (en) * 2013-03-19 2014-09-24 福建福昕软件开发股份有限公司北京分公司 PDF document recognition method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9047533B2 (en) * 2012-02-17 2015-06-02 Palo Alto Research Center Incorporated Parsing tables by probabilistic modeling of perceptual cues

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866335A (en) * 2010-06-14 2010-10-20 深圳市万兴软件有限公司 Form processing method and device in document conversion
CN101976232A (en) * 2010-09-19 2011-02-16 深圳市万兴软件有限公司 Method for identifying data form in document and device thereof
CN103377177A (en) * 2012-04-27 2013-10-30 北大方正集团有限公司 Method and device for identifying forms in digital format files
CN104063364A (en) * 2013-03-19 2014-09-24 福建福昕软件开发股份有限公司北京分公司 PDF document recognition method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Table Recognition and Understanding from PDF Files;Tamir Hassan 等;《international conference ondocument analysis and recognition》;20070926;第1-5页 *
基于 PDF 文字流的表格识别技术的研究;张伯;《中国优秀硕士学位论文全文数据库信息科技辑》;20100915;第2010年卷(第9期);第I138-534页 *
版式电子文档表格自动检测与性能评估;房婧 等;《北京大学学报(自然科学版)》;20130131;第49卷(第1期);第45-53页 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084117A (en) * 2019-03-22 2019-08-02 中国科学院自动化研究所 Document table line detecting method, system based on binary map segmented projection

Also Published As

Publication number Publication date
CN105589841A (en) 2016-05-18

Similar Documents

Publication Publication Date Title
CN105589841B (en) A kind of method of PDF document Table recognition
CN105930159B (en) A kind of method and system that the GUI code based on image generates
US7054871B2 (en) Method for identifying and using table structures
CN106802884B (en) Method for fragmenting text of layout document
CN106250830B (en) Digital book structured analysis processing method
CA2486528C (en) Document structure identifier
Simon et al. ViPER: augmenting automatic information extraction with visual perceptions
CN108470021A (en) The localization method and device of table in PDF document
EP1679613A2 (en) Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents
US20070196015A1 (en) Table of contents extraction with improved robustness
CN101329731A (en) Automatic recognition method pf mathematical formula in image
CN101206639A (en) Method for indexing complex impression based on PDF
CN101354727B (en) Method and apparatus for establishing links between digital document catalog and text
CN111274239A (en) Test paper structuralization processing method, device and equipment
US10762377B2 (en) Floating form processing based on topological structures of documents
Klampfl et al. A comparison of two unsupervised table recognition methods from digital scientific articles
CN106502991A (en) Publication treating method and apparatus
CN113962201A (en) Document structuralization and extraction method for documents
US9049400B2 (en) Image processing apparatus, and image processing method and program
CN106446863B (en) PDF document logic diagram identification method
CN107590448A (en) The method for obtaining QTL data automatically from document
CN110688825A (en) Method for extracting information of table containing lines in layout document
US9418051B2 (en) Methods and devices for extracting document structure
Huang et al. Associating text and graphics for scientific chart understanding
Asi et al. User-assisted alignment of arabic historical manuscripts

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant