CN105589841B - A kind of method of PDF document Table recognition - Google Patents
A kind of method of PDF document Table recognition Download PDFInfo
- Publication number
- CN105589841B CN105589841B CN201610025529.8A CN201610025529A CN105589841B CN 105589841 B CN105589841 B CN 105589841B CN 201610025529 A CN201610025529 A CN 201610025529A CN 105589841 B CN105589841 B CN 105589841B
- Authority
- CN
- China
- Prior art keywords
- line
- doubtful
- title
- page
- row
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of method of PDF document Table recognition, including:Character set in the page is obtained, and the character set is merged and embarked on journey, establishes row set;Horizontal line and vertical line in page path are extracted, establishes line set;Detect the doubtful table title in row set and the doubtful form line in line set;If doubtful table title and doubtful form line simultaneously be present, using based on table title and the region-growing method of line set identification form;If only existing doubtful form line, first detect all fronts table with line set and row set and detect three line tables again;If only existing doubtful table title, form is identified with the region-growing method based on table title and row set;If both without doubtful form line or without doubtful table title, judge the page without form;Gauge outfit, table note form secondary element are detected, exports the page table recognition result.Table title, form line and form character arrangements feature are considered as three big feature of form by the present invention, in multilist and can be deposited in the complicated space of a whole page of one page using the thought of region parallel growth and are accurately positioned form.
Description
Technical field
The present invention relates to a kind of PDF document Table recognition method, the printed page analysis and the space of a whole page for belonging to format electronic document are managed
Solve category.
Background technology
With developing rapidly for domestic digital Publishing Industry, how deep exploitation publishing resource, how quickly to carry out document
Resource deep processing, realize resource fragmentationization with restructuring, meet it is polymorphic, by all kinds of means, the digital publishing demand of multimedium, be work as
Preceding digital publishing industry needs to solve the problems, such as.Document resources fragmentation includes the bases such as piece name, author, keyword, bibliography
This metadata indexing, also including the body matter fragmentation such as paragraph, picture, form, formula.Document layout technology is real
The key technology of the existing automatic fragmentation of document, present document relates to PDF Table recognition methods belong to the printed page analysis of format document with
Understanding technology.
PDF (Portable Document Format, portable document format) is a kind of electronics developed by Adobe companies
Document format, there is the characteristics of with operating system platform independence, it has also become in electronic document distribution and digital information propagation
Widely used preferable document format.PDF belongs to format document, relatively independent between the page, changes a content of pages not
Influence other page layouts.Though format document is good at describing document layout, accurately shows the space of a whole page, no recording documents are patrolled
Structure is collected, without logical elements such as paragraph, form, formula, how the document that is beyond expression is organized, it is impossible to logically deposits
Store up document.
In pdf document, form is split into form line and table content describes respectively.Form line path one by one
(PATH) operator drawing shaping, but path is more than form line, it is also possible to representation formula fraction line, polar plot, turn bent character,
The elements such as space of a whole page decorative pattern.Table content then cannot be guaranteed to collect with a string character representation, all characters of same form
Middle appearance, often mix with space of a whole page other guide together with.The storage mode of PDF forms makes Table recognition become complicated.
Form is generally made up of elements such as table title, table main body, table notes, and table main body includes form line and character content.It is existing
The technology and method (such as CN103377177A) of some format document identification forms stress form line feature, ignore table title
It is an important table features.In multiple forms and deposit the especially multiple three lines tables (containing only horizontal form) of one page and deposit
In the complicated space of a whole page of one page, only with intersecting form line and table main body word arrangement feature by influence Table recognition accuracy and
Efficiency.
The content of the invention
In order to solve the above technical problems, storage characteristics of the present invention according to PDF document form, it is therefore an objective to which a kind of PDF is provided
The method of document Table recognition.
The purpose of the present invention is realized by following technical scheme:
A kind of method of PDF document Table recognition, including:
Character set in the page is obtained, and the character set is merged and embarked on journey, establishes row set;
Horizontal line and vertical line in page path are extracted, establishes line set;
Detect the doubtful table title in row set and the doubtful form line in line set;
If doubtful table title and doubtful form line simultaneously be present, the region-growing method based on table title and line set is used
Identify form;
If only existing doubtful form line, first detect all fronts table with line set and row set and detect three line tables again;
If only existing doubtful table title, form is identified with the region-growing method based on table title and row set;
If both without doubtful form line or without doubtful table title, judge the page without form;
Gauge outfit, table note form secondary element are detected, exports the page table recognition result.
Compared with prior art, one or more embodiments of the invention can have the following advantages that:
Identification difficulty to different type form is classified, and is respectively from the easier to the more advanced:Title containing table and form line form,
Without table title all fronts table, without the line table of table title three, the wireless meter of the title containing table, the wireless meter without table title.According to easy first and difficult later
Order is identified, and not only increases the accuracy of Table recognition, also improves recognition efficiency;
Region-growing method based on table title and line set, table title is considered as initial position of the seed as growth, with
Doubtful form line set is growth elements, and table can be accurately positioned using multiple seeds (multiple doubtful forms) parallel growth pattern
Lattice, solve multilist and deposit the complicated space of a whole page of one page;
Region-growing method based on table title and row set, it is multiple using row set as growth elements with the entitled seed of table
Seed grows the parallel detection for realizing multiple wireless forms simultaneously;
This method can not only automatic identification form main body, also simultaneously have detected table title, gauge outfit, table note etc. table information, and
Table title, gauge outfit, table note are associated with the matching of form main body, keep synchronous.
Brief description of the drawings
Fig. 1 is implementation flow chart of the present invention;
Fig. 2 is region-growing method schematic diagram.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with embodiment and accompanying drawing to this hair
It is bright to be described in further detail.
As shown in figure 1, for the method flow of PDF document Table recognition, methods described includes:
Character set in the page is obtained, and the character set is merged and embarked on journey, establishes row set;
Horizontal line and vertical line in page path are extracted, establishes line set;
Detect the doubtful table title in row set and the doubtful form line in line set;
If doubtful table title and doubtful form line simultaneously be present, the region-growing method based on table title and line set is used
Identify form;
If only existing doubtful form line, first detect all fronts table with line set and row set and detect three line tables again;
If only existing doubtful table title, form is identified with the region-growing method based on table title and row set;
If both without doubtful form line or without doubtful table title, judge the page without form;
Gauge outfit, table note form secondary element are detected, exports the page table recognition result.
Character set is obtained from the page merge and embarks on journey, feature is mainly arranged by character according to character stream order and character space
Merging is embarked on journey.Because most of PDF document character stream order is identical with reading order, character is by the neat typesetting of row, by character stream
It is an efficient method that order, which embarks on journey character merging,.To be utilized during polymerization between the vertical range and level of intercharacter
Every waiting spatial arrangement feature to enter row constraint, character vertical range is nearer two-by-two, and horizontal interval is smaller, and adjacent possibility is bigger,
The probability of colleague is higher.Character tentatively merged according to character stream order after embarking on journey, should also enter horizontal between every trade and row merge
With the processing character stream order situation inconsistent with reading order.
Detect doubtful table title and used key quality control point, detected from row set row it is first with " table ", " Tab ",
The row of keywords such as " Table ", it is regarded as doubtful table title.The table title of some documents is simultaneously multilingual comprising Chinese and English,
Need to merge the multilingual table title for belonging to a form together if necessary.
The form line of PDF document is preserved with path, but path is more than form line, it is also possible to representation formula fraction line, arrow
Spirogram, turn bent character, background decorative decorative pattern etc..Page path can represent following several space of a whole page elements:Thread path:Form line,
Headerfooter line, formula fraction line, annotation cut-off rule, page layout background decorative thread etc.;Word path:Represent to turn bent character, may
Appear in section, in formula, figure, form;Polar plot:It is made up of paths such as straight line, curve, turn bent characters;Edit path:Cut
Path (clipping path), for cutting page area, fill path (filling path), for filling background colour etc..
Form line belongs to circuit warp, has " thin, long " feature, can be according to the horizontal direction or vertical direction of the external square in path
Angle judge whether the path is thread path, if be horizontal line or vertical line.
It is actual to be spliced by a plurality of short-term although seeming straight line on a form line surface of some documents,
Therefore need to carry out splicing if necessary.Splicing type has two kinds:Level splicing and vertical splicing, when two horizontal external squares
The ratio that intersects vertically is more than a certain threshold value, while needs horizontal splicing when two line horizontal intervals are less than a certain threshold value;When two vertical
The horizontal intersecting ratio of the external square of line is more than a certain threshold value, while needs vertically to splice when two line perpendicular separations are less than a certain threshold value.
Form generally comprises the horizontal line (such as three line tables) or one group or so alignment horizontal line and one group of one group or so alignment
Consistency from top to bottom vertical line (such as all fronts table).One form arranges including at least two rows two, at least one character of each column, between the column and the column
At least one character pitch, then water-glass ruling width>=3 character durations, vertical form line height>=2 character height
Degree.When doubtful form line is detected from line set, by simultaneously meet aligned condition and minimum length limitation one group of horizontal line or
One group of vertical line is determined as doubtful form line.
Region-growing method based on table title and line set, table title is considered as initial position of the seed as growth, with
Doubtful form line set is growth elements, using multiple seeds (multiple doubtful forms) vertical parallel growth pattern, can accurately be determined
Position form body region, easily solve multilist and deposit the complex page of one page.As shown in Fig. 2 it is based on table title and line set
Region-growing method be divided into following four step:
(1) this page of all doubtful table titles are arranged to seed as initial growth position;
(2) determine that the maximum growth scope of each seed establishes growth pool, refer to collect seed horizontal doubtful form line nearby
Set is used as growth elements, and the region that growth elements limit is exactly the maximum growth scope of the seed.It can possess between seed identical
Growth elements, interseminal maximum growth scope may overlap, the unions of seed-bearing growth elements constitute growth pool;
(3) vertical parallel grows:Because most form captions are located at the top of form main body, the direction of growth can be set to
Grow vertically downward.So-called parallel growth, when referring to every secondary growth, a most short seed of vertical range is extracted from growth pool
With corresponding doubtful form line, judge whether the line can be incorporated to the affiliated area of seed, judge whether the seed continued growth or can stop
Only growth etc..Be considered as during growth between page subfield feature, form space alternative (mutually disjointed between table and table,
Do not include mutually) etc. constraints.Parallel growth is not the serial life that next seed starts growth after a seed stops growing
Growth process, but succession is determined according to the distance in growth pool between each seed and growth elements;
(4) table schema is analyzed:After all seeds stop growing, consider line subset in the affiliated area of each seed and
Row subset whether there is form common feature, for example whether comprising the alignment horizontal line of more than two, line number whether>=2, columns
Whether>=2 etc., judge whether the affiliated area of seed is form according to form common feature.
The method of detection all fronts table is that one group of effectively intersecting alignment horizontal line and right is extracted from doubtful form line set
Neat vertical line, judge that the region that this group of horizontal line and vertical line are covered whether there is form common feature, should if then fixing tentatively
Region is doubtful form.In all doubtful forms and common group of the form identified based on table title and the region-growing method of line set
Into set in, according to tablespace alternative reject it is intersecting, by comprising doubtful form, the doubtful form for meeting rule is sentenced
It is set to form.
It is above-mentioned extracted from doubtful form line set one group of effectively intersecting alignment horizontal line and alignment vertical line collection refer to,
First deployment area area intersects method and judges whether alignment horizontal line overlay area intersects with vertical line region of aliging, if intersecting
Then continue to judge using straight line intersection method, judge that this group of horizontal line and vertical line are non-intersect if non-intersect.
Above-mentioned straight line intersection method refers to, extracts each horizontal line and each vertical line successively, judges whether to intersect.
If intersecting, it is respectively level of significance line and effective vertical line to judge this two lines.When level of significance line number>=3 and effectively vertical
Line number>When=2, judge that this group of horizontal line effectively intersects with vertical line, be otherwise determined as invalid intersecting.
Detecting the method for three line tables is, one group of alignment horizontal line is extracted from doubtful form line set, according to tablespace
Alternative, alignment horizontal line is split using the form set identified above, forms horizontal line subset of aliging one by one.In every height
Horizontal line is arranged in order by concentration from top to bottom, finds the doubtful original position of form and doubtful final position.By it is doubtful starting and
Region between final position is considered as doubtful form.Reuse tablespace alternative and reject the doubtful form for intersecting, including,
The doubtful form for meeting rule is determined as form.
The above-mentioned form set identified above refers to, the form identified based on table title and the region-growing method of line set with
And the form set constructed by with the form of all fronts table detection method identification.It is above-mentioned to split alignment water with the form set identified above
Horizontal line refers to, according to tablespace alternative, one can not possibly be collectively constituted by the alignment horizontal line that identified form separates
Individual form body, horizontal line collection, contracting small third-line areas table detection range can be reduced with this method.
The above-mentioned doubtful original position of searching form refers to that the row set in adjacent level line whether there is form two-by-two for detection
Arrangement feature (two are comprised at least to arrange, the vertical range between horizontal line should not be excessive etc.), if then judging to doubt for form at this
Like original position, otherwise continue search original position downwards.
The above-mentioned doubtful final position of searching form refers to, after doubtful original position is detected, continues to judge downwards two-by-two
Row set in adjacent level line, if then continuing search downwards, is otherwise considered as form with the presence or absence of form arrangement feature at this
Doubtful final position.
Region-growing method based on table title and row set, it is the initial position using the entitled seed of table as growth, with
Row set is growth elements, and multiple seeds grow to identify form simultaneously.
Region-growing method based on table title and row set and the region-growing method based on table title and line set all use
The thought of region growing, both differences be the former using row set as growth elements, the latter makes a living with doubtful form line set
Long element;The former form common feature used only includes the arrangement feature of row set, and form common feature used in the latter includes row
Set and the arrangement feature of line set.
As shown in Fig. 2 the region-growing method based on table title and row set is divided into following four step:
(1) this page of all doubtful table titles are arranged to seed as initial growth position;
(2) determine that the maximum growth scope of each seed establishes growth pool, refer to according to position of the seed in page and page
Face subfield feature determines the maximum growth scope of each seed, and growth pool is established using row set as growth elements.With based on table
Title is similar with the region-growing method of line set, and identical growth elements, interseminal maximum growth model may be possessed between seed
Enclosing to overlap;
(3) vertical parallel grows:The direction of growth is to grow vertically downward.When parallel growth refers to every secondary growth, from growth
A most short seed of vertical range and corresponding row are extracted in pond, judges whether the row can be incorporated to the affiliated area of seed, the seed is
It is no continued growth or to stop growing etc..Parallel growth determines according to the distance in growth pool between each seed and growth elements
The sequencing of growth;
(4) table schema is analyzed:After all seeds stop growing, judge whether the row subset in the affiliated area of each seed deposits
In form common feature, for example whether being arranged including at least two rows two, if having uniform between-line spacing or row interval etc., according to these
Feature judges whether the affiliated area of seed is form.
Wireless meter without table title due to lacking table title and form line the two form essential characteristics, easily with the page
Matrix, the other elements such as determinant obscure, cause to know by mistake.By the statistical analysis to a large amount of Chinese specification PDF documents,
The probability that wireless meter without table title uses in practice is very low, therefore does not handle the wireless meter of no table title herein.
Gauge outfit is normally at below table title, above table main body, is usually used in describing the unit of table element, such as " N/
mm”.Table note is usually located at below table main body, is usually used in describing the specified otherwises such as the source of form.Herein mainly according to gauge outfit and
The position characteristics and Keywords matching (" note ", " note ") identification gauge outfit and table note of table note.
This page of all Table recognition result is exported, each Table recognition result includes:Table caption position and content, gauge outfit position
Put and content, table body position, table note position and content.
Although disclosed herein embodiment as above, described content only to facilitate understand the present invention and adopt
Embodiment, it is not limited to the present invention.Any those skilled in the art to which this invention pertains, this is not being departed from
On the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details,
But the scope of patent protection of the present invention, still should be subject to the scope of the claims as defined in the appended claims.
Claims (5)
- A kind of 1. method of PDF document Table recognition, it is characterised in that methods described includes:Character set in the page is obtained, and the character set is merged and embarked on journey, establishes row set;Horizontal line and vertical line in page path are extracted, establishes line set;Detect the doubtful table title in row set and the doubtful form line in line set;If doubtful table title and doubtful form line simultaneously be present, identified using based on table title and the region-growing method of line set Form;If only existing doubtful form line, first detect all fronts table with line set and row set and detect three line tables again;If only existing doubtful table title, form is identified with the region-growing method based on table title and row set;If both without doubtful form line or without doubtful table title, judge the page without form;Gauge outfit, table note form secondary element are detected, exports the page table recognition result.
- 2. the method for PDF document Table recognition as claimed in claim 1, it is characterised in that the identification to different type form Difficulty is classified, wherein, it is respectively from the easier to the more advanced:Title containing table and form line form, without table title all fronts table, without table title Three line tables, the wireless meter of the title containing table, the wireless meter without table title, the recognition sequence of the form is also that easy first and difficult later order is entered Row identification.
- 3. the method for PDF document Table recognition as claimed in claim 1, it is characterised in that the doubtful table in detection row set Title includes:Row head is detected with key quality control point, merges the multilingual doubtful table title for belonging to a form.
- 4. the method for PDF document Table recognition as claimed in claim 1, it is characterised in that examined with line set and row set Surveying all fronts table includes:Judge whether one group of alignment horizontal line and one group of alignment vertical line are effectively intersecting, the effectively intersecting method It is that first deployment area area intersects method, then with straight line intersection method.
- 5. the method for PDF document Table recognition as claimed in claim 1, it is characterised in that the table of the recognition methods identification Information includes:Form main body, table title, gauge outfit, table note, and the table title, gauge outfit, table note and the matching of form main body are closed Connection, keep synchronous.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610025529.8A CN105589841B (en) | 2016-01-15 | 2016-01-15 | A kind of method of PDF document Table recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610025529.8A CN105589841B (en) | 2016-01-15 | 2016-01-15 | A kind of method of PDF document Table recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105589841A CN105589841A (en) | 2016-05-18 |
CN105589841B true CN105589841B (en) | 2018-03-30 |
Family
ID=55929431
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610025529.8A Active CN105589841B (en) | 2016-01-15 | 2016-01-15 | A kind of method of PDF document Table recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105589841B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110084117A (en) * | 2019-03-22 | 2019-08-02 | 中国科学院自动化研究所 | Document table line detecting method, system based on binary map segmented projection |
Families Citing this family (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106446863B (en) * | 2016-10-11 | 2020-01-21 | 同方知网(北京)技术有限公司 | PDF document logic diagram identification method |
CN106802884B (en) * | 2017-02-17 | 2020-09-22 | 同方知网(北京)技术有限公司 | Method for fragmenting text of layout document |
CN106897690B (en) * | 2017-02-22 | 2018-04-13 | 南京述酷信息技术有限公司 | PDF table extracting methods |
CN107133566A (en) * | 2017-03-31 | 2017-09-05 | 常诚 | A kind of method of chart in identification PDF document |
CN107315989B (en) * | 2017-05-03 | 2020-06-12 | 天方创新(北京)信息技术有限公司 | Text recognition method and device for medical data picture |
CN108170697B (en) * | 2017-07-12 | 2021-08-20 | 信号旗智能科技(上海)有限公司 | International trade file processing method and system and server |
CN107943956A (en) * | 2017-11-24 | 2018-04-20 | 北京金堤科技有限公司 | Conversion of page method, apparatus and conversion of page equipment |
CN107909064B (en) * | 2017-12-27 | 2018-11-16 | 掌阅科技股份有限公司 | Three line table recognition methods, electronic equipment and storage medium |
CN108197216A (en) * | 2017-12-28 | 2018-06-22 | 深圳市巨鼎医疗设备有限公司 | A kind of method of information processing |
CN110163030B (en) * | 2018-02-11 | 2021-04-23 | 鼎复数据科技(北京)有限公司 | PDF framed table extraction method based on image information |
CN108470021B (en) * | 2018-03-26 | 2022-06-03 | 阿博茨德(北京)科技有限公司 | Method and device for positioning table in PDF document |
CN108446264B (en) * | 2018-03-26 | 2022-02-15 | 阿博茨德(北京)科技有限公司 | Method and device for analyzing table vector in PDF document |
CN109062874B (en) * | 2018-06-12 | 2022-03-04 | 平安科技(深圳)有限公司 | Financial data acquisition method, terminal device and medium |
CN109284495B (en) * | 2018-11-03 | 2023-02-07 | 上海犀语科技有限公司 | Method and device for performing table-free line table cutting on text |
CN109635268B (en) * | 2018-12-29 | 2023-05-05 | 南京吾道知信信息技术有限公司 | Method for extracting form information in PDF file |
CN109934160B (en) * | 2019-03-12 | 2023-06-02 | 天津瑟威兰斯科技有限公司 | Method and system for extracting table text information based on table recognition |
CN110263792B (en) * | 2019-06-12 | 2021-10-22 | 广东小天才科技有限公司 | Image recognizing and reading and data processing method, intelligent pen, system and storage medium |
CN110413979A (en) * | 2019-08-05 | 2019-11-05 | 金税桥大数据科技股份有限公司 | Industry table digitalized processing method based on image recognition technology |
CN110659346B (en) * | 2019-08-23 | 2024-04-12 | 平安科技(深圳)有限公司 | Form extraction method, form extraction device, terminal and computer readable storage medium |
CN110516048A (en) * | 2019-09-02 | 2019-11-29 | 苏州朗动网络科技有限公司 | The extracting method, equipment and storage medium of list data in pdf document |
CN112380812B (en) * | 2020-10-09 | 2022-02-22 | 北京中科凡语科技有限公司 | Method, device, equipment and storage medium for extracting incomplete frame line table of PDF (Portable document Format) |
CN112464626B (en) * | 2020-12-09 | 2022-04-01 | 上海携宁计算机科技股份有限公司 | Graph extraction method of PDF (Portable document Format) document, electronic equipment and storage medium |
CN113239818B (en) * | 2021-05-18 | 2023-05-30 | 上海交通大学 | Table cross-modal information extraction method based on segmentation and graph convolution neural network |
CN113408251B (en) * | 2021-06-30 | 2023-08-18 | 北京百度网讯科技有限公司 | Layout document processing method and device, electronic equipment and readable storage medium |
CN114612921B (en) * | 2022-05-12 | 2022-07-19 | 中信证券股份有限公司 | Form recognition method and device, electronic equipment and computer readable medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101866335A (en) * | 2010-06-14 | 2010-10-20 | 深圳市万兴软件有限公司 | Form processing method and device in document conversion |
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
CN103377177A (en) * | 2012-04-27 | 2013-10-30 | 北大方正集团有限公司 | Method and device for identifying forms in digital format files |
CN104063364A (en) * | 2013-03-19 | 2014-09-24 | 福建福昕软件开发股份有限公司北京分公司 | PDF document recognition method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9047533B2 (en) * | 2012-02-17 | 2015-06-02 | Palo Alto Research Center Incorporated | Parsing tables by probabilistic modeling of perceptual cues |
-
2016
- 2016-01-15 CN CN201610025529.8A patent/CN105589841B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101866335A (en) * | 2010-06-14 | 2010-10-20 | 深圳市万兴软件有限公司 | Form processing method and device in document conversion |
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
CN103377177A (en) * | 2012-04-27 | 2013-10-30 | 北大方正集团有限公司 | Method and device for identifying forms in digital format files |
CN104063364A (en) * | 2013-03-19 | 2014-09-24 | 福建福昕软件开发股份有限公司北京分公司 | PDF document recognition method |
Non-Patent Citations (3)
Title |
---|
Table Recognition and Understanding from PDF Files;Tamir Hassan 等;《international conference ondocument analysis and recognition》;20070926;第1-5页 * |
基于 PDF 文字流的表格识别技术的研究;张伯;《中国优秀硕士学位论文全文数据库信息科技辑》;20100915;第2010年卷(第9期);第I138-534页 * |
版式电子文档表格自动检测与性能评估;房婧 等;《北京大学学报(自然科学版)》;20130131;第49卷(第1期);第45-53页 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110084117A (en) * | 2019-03-22 | 2019-08-02 | 中国科学院自动化研究所 | Document table line detecting method, system based on binary map segmented projection |
Also Published As
Publication number | Publication date |
---|---|
CN105589841A (en) | 2016-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105589841B (en) | A kind of method of PDF document Table recognition | |
CN105930159B (en) | A kind of method and system that the GUI code based on image generates | |
US7054871B2 (en) | Method for identifying and using table structures | |
CN106802884B (en) | Method for fragmenting text of layout document | |
CN106250830B (en) | Digital book structured analysis processing method | |
CA2486528C (en) | Document structure identifier | |
Simon et al. | ViPER: augmenting automatic information extraction with visual perceptions | |
CN108470021A (en) | The localization method and device of table in PDF document | |
EP1679613A2 (en) | Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents | |
US20070196015A1 (en) | Table of contents extraction with improved robustness | |
CN101329731A (en) | Automatic recognition method pf mathematical formula in image | |
CN101206639A (en) | Method for indexing complex impression based on PDF | |
CN101354727B (en) | Method and apparatus for establishing links between digital document catalog and text | |
CN111274239A (en) | Test paper structuralization processing method, device and equipment | |
US10762377B2 (en) | Floating form processing based on topological structures of documents | |
Klampfl et al. | A comparison of two unsupervised table recognition methods from digital scientific articles | |
CN106502991A (en) | Publication treating method and apparatus | |
CN113962201A (en) | Document structuralization and extraction method for documents | |
US9049400B2 (en) | Image processing apparatus, and image processing method and program | |
CN106446863B (en) | PDF document logic diagram identification method | |
CN107590448A (en) | The method for obtaining QTL data automatically from document | |
CN110688825A (en) | Method for extracting information of table containing lines in layout document | |
US9418051B2 (en) | Methods and devices for extracting document structure | |
Huang et al. | Associating text and graphics for scientific chart understanding | |
Asi et al. | User-assisted alignment of arabic historical manuscripts |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |