CN107133566A - A kind of method of chart in identification PDF document - Google Patents

A kind of method of chart in identification PDF document Download PDF

Info

Publication number
CN107133566A
CN107133566A CN201710209497.1A CN201710209497A CN107133566A CN 107133566 A CN107133566 A CN 107133566A CN 201710209497 A CN201710209497 A CN 201710209497A CN 107133566 A CN107133566 A CN 107133566A
Authority
CN
China
Prior art keywords
chart
drawing object
character
word
pdf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710209497.1A
Other languages
Chinese (zh)
Inventor
常诚
何黎刚
陈浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201710209497.1A priority Critical patent/CN107133566A/en
Publication of CN107133566A publication Critical patent/CN107133566A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The present invention relates to a kind of method for recognizing chart in pdf document.This method includes, step 1:Read and record the area information of all words and drawing object in PDF format standard, i.e. position coordinates and the wide high rectangle constituted;Step 2:Calculate the character density of text object, statistical average character density d and variance v, the minimum wide w of character and high h;Step 3:Filtering does not conform to rule drawing object, and region is detected to external expansion if rectangular area is zero;Step 4:Drawing object is traveled through, combined region is new drawing object and records number of characters if with other object rectangle intersections, until all scopes no longer change;Step 5:The character density D of each drawing object is calculated, if D values are interval outer at [d v, d+v], then the object is judged as chart, correspondence rectangular extent is chart region.The present invention can recognize chart and its position in document, be prepared for subsequent treatments such as extraction, analyses.

Description

A kind of method of chart in identification PDF document
Technical field
The invention belongs to pdf document contents processing and the technical field of analysis, the present invention relates to one kind identification PDF texts The method of chart in part.
Background technology
Financial statement, technical report, academic journal and various papers etc. carry the document for information of largely drawing a diagram, often Preserved using PDF forms.PDF contents extraction scenes, are such as converted to extended formatting(Such as EPUB, MOBI electronics book format)Again Avoid chart word to be mixed into text during typesetting, or processing is further analyzed for chart data, be required for recognizing chart-information.
Cause the reason for chart cannot be distinguished by with word content a lot, except original document feature and printer's error etc. are uncontrollable Factor, the particularity of PDF format is main cause.First, do not have the logical concept of " chart " chart in PDF format standard, only Object, some rendered objects are drawn on the page according to instruction(Such as path)A secondary chart is just constituted with text object combination, than If reference axis is by two straight lines(Path)Vertical composition, form is also to be made up of path around word.Secondly, in word paragraph Also rendered object often occurs, most typical is exactly comprising mathematical formulae in score line, the row of radical sign.In addition, rendered object It is usually utilized to, as typesetting element, such as underscore, cut-off rule etc., can also interfere with Chart recognition.In summary, in logic just Text is identical with the component of chart, so being difficult to directly difference.
Prior art is generally divided into two classes, and one kind is artificial or template mark, and the border of processing is determined by specified range, Artificial operating efficiency is low, it is impossible to handle large-scale document, and template way flexibility is poor.Another is by simply extracting Rendered object is determined, but can occur to be described above many mistakes, reason.One simple example such as Fig. 1, picture and text mixing Pdf document structure in include some words and drawing object, only mark here everywhere(101~104).Simple object type The foundation of decision chart is cannot function as, drawing object such as 102 is likely to occur in word segment, and word also occurs in diagram portion Object such as 104.In general, there is the document of a large amount of charts, also the disturbance ecologies such as numeral, formula can occur in word segment. In addition, the caption below chart area:Figure 1. A sample figure, from contents attribute angle can not with afterwards Word segment is distinguished, but logically belongs to a part for chart, can confusion reigned during conversion.Typesetting in actual conditions, such as Multicolumn text, chart of hurdling may be more complicated, therefore conventional method often causes conversion to slip up, and disturb subsequent treatment work Make.
The content of the invention
The technical problems to be solved by the invention are to provide a kind of method for recognizing the chart in pdf document, can solve Analyse the chart content on the page and identify its position.
Step 1:Travel through all words and drawing in file data, record PDF format standard(Path, editing, bitmap)It is right The area information of elephant, is expressed as the rectangle that the object's position coordinate and wide high parameter are constituted.
Step 2:Calculate the character density of each text object(Number of characters divided by rectangular area), count all text objects Average character density d and variance v, the minimum wide w of character and high h.
Step 3:Filtering does not conform to rule drawing object, condition include rectangle beyond content of pages border or other from fixed condition; If rectangular area is zero, drawing area both horizontally and vertically respectively expands w/2 and h/2.
Step 4:Travel through drawing object, if with other objects(Word or drawing)Then combined region is newly to paint to rectangle intersection Figure object simultaneously records number of characters, until all scopes no longer change.
Step 5:The character density D of each drawing object is calculated, if D values existIt is interval outer, then judge The object is chart, and correspondence rectangular extent is chart region.
The beneficial effects of the invention are as follows:By traveling through PDF files, extract drawing object and text object handled respectively, The former is filtered, merged and recognized, and the latter provides important parameter --- average character density.Principle is, text object Shared area size is proportional to number of characters, and character is more, and occupied area is also bigger, and vice versa, i.e. document character density is one Determine floating in scope, represented by average value with variance.Comparatively, chart area becomes big due to merging drawing object, area (Or tail off)And number of characters is constant so that character density creates a difference with character area, thus we can judge.Word Region incorporation drawing object is also similar, and a large amount of text objects are able to maintain that the density in the region after merging, thus avoid typesetting member The factor such as formula causes erroneous judgement in element, row, so as to greatly improve the accuracy rate of Chart recognition.
On the basis of above-mentioned technical proposal, the present invention can also do extended below, for recognizing note, caption, table note Word:After the step 5, in addition to step 6:The word for having recognized directly over chart drawing area or lower section is read, If it find that nominal key, it is determined that this article field fall for note, table note, caption, fall within a part for chart.Keyword Depending on Doctype and language, such as " Figure ", " figure ", " Table ", " table ".
Brief description of the drawings
Fig. 1 carries the PDF document schematic diagram of chart.
Fig. 2 is the flow chart proposed by the present invention for recognizing the method for chart in pdf document.
Embodiment
The principle and feature of the present invention are described below in conjunction with accompanying drawing, the given examples are served only to explain the present invention, and It is non-to be used to limit the scope of the present invention.
Fig. 2 is the flow chart of the method proposed by the present invention for recognizing the chart in pdf document.Here PDF is Portable Document Format abbreviation, means portable document format, is a kind of electronic document format, and PDF files refer to Be e-file using PDF format.The form is usually used in preserving and distributed portable in the document of complicated typesetting, title (Portable)Refer to that identical typesetting effect can be obtained on various hardware devices and software platform, therefore be particularly suitable for For storing the accurate display of the requirement such as science, technology, finance, not revisable document.
In the present invention, the chart in pdf document can be straight line, curve, the figure of Word-Drawing or table or picture (It is referred to as bitmap Bitmap in PDF)For exemplary flow, display data, contrast etc. is enumerated, official documentation chart is typically also The word for pointing out chart purposes can be noted with note, caption and table.
Fig. 2 is method handling process, is described in detail below.
Step 201:Travel through the position area information of PDF files, shorthand object and drawing object.
" traversal " in this step is a kind of Computing form, is referred to along certain search pattern, successively to collection Each node, which does once and only done, in conjunction once accesses.The method of traversal and the concept of above-mentioned set belong to computer neck The common knowledge in domain, will not be described here.
In this step, the method for traversal file is to read the object data in pdf document one by one.The object in PDF format Except also indicating how display in itself comprising content, with polytype, wherein four classes are only concerned in the present invention:Word (Text), path(Path), bitmap(Bitmap)And editing(Clipping).
Each object includes the area information of display, can be expressed as rectangle, pass through lower right position coordinate(x, y) It is high with width(width, height)Parameter is determined.Object is divided into two classes by the present invention:Word and drawing, the latter include path, position Figure and editing.The region of each object is recorded, is that subsequent treatment is prepared.
Step 202:The average character density of text object and variance are calculated, statistics character is minimum wide high.
The text object on basis is single character, and PDF can be with a word when parsing(As English etc. relies on space participle Language)Or a line(Such as Chinese, Japanese and Korean)Read for unit, essence is all one group of continuation character.
For the text object set in 201, single character density is by character group area()And character Number c is calculated.Further, average character density is;Variance is。 In addition, during traversal set, the wide height of minimum in statistics text object is calculated as w, h respectively.
Step 203:Filtering does not conform to rule drawing object, the drawing object rectangle that enlarged-area is zero.
Filtered and corrected for drawing object set in 201.Filtering refers to remove no longer to be located in set, subsequent process Manage the object, rule be above Page Range and other from fixed condition.Repair and be exactly based on calculating drawing object area, if Zero, it is point or line to represent drawing, and the scope that we expand the drawing is(x-w/2, y-h/2, width + w/2, height + h/2), to be detected in subsequent step, expand numerical value and determined by the wide height of minimum character.
Step 204:Drawing object is traveled through, merges other intersected object rectangles, shorthand Object Character number.
On the basis of 203, drawing object is traveled through:Judge whether with other object rectangle geometrical intersections, if it is close And, drawing object set is included, the process is repeated.No longer expand until all drawing objects are stable.
" other objects " includes word and drawing object, if new object includes text object, and add up its number of characters It is used as the number of characters of new object.Rectangle intersection is all two dimensional surface geometry common operation with merging, and belongs to mathematics general knowledge, is no longer gone to live in the household of one's in-laws on getting married State.
Step 205:New drawing object character density is calculated, chart is determined whether.
Above-mentioned drawing object is traveled through, is used respectively(Newly)Area calculates the character density D in the region with number of characters.If should Area's density is in intervalOutside, then it is determined as chart.
Step 206:Recognize note, table note, caption.
On the basis of 205, the word for having recognized directly over chart drawing area or lower section is read, if it find that designated key Word, it is determined that this article field falls for note, table note or caption.Keyword depending on Doctype and language, such as " Figure ", " figure ", " Table ", " table " etc..
The diagram portion in Fig. 1 examples is analyzed using the present invention, merge rear region in point include series of points, directly Line, curve, word, but word density is relatively low compared to literal field, it is determined that being chart, can correspondingly recognize the figure of lower section Note.On the other hand, although 101 appearance causes a series of characters around it to be merged into drawing object, word is known by calculating Density is still suitable with text, and this is also consistent with our direct feel.Therefore, chart will not be judged as.
As can be seen here, the present invention has advantages below:By traveling through pdf document, using text and chart word density it Between intuitively distinguish, for object type and content-location information unique in PDF format, density concept is implemented as computable number Value, then a whole set of handling process according to the parameter designing, is devised including filtering, amendment, and auxiliary information such as note, table are noted etc. Extract flow.This method can avoid the drawing class object that non-chart area occurs from interfering identification, and then obtain chart Positional information, greatly improve the recognition accuracy of chart in PDF document so that accurate processing and analysis to document automation It is possibly realized.
The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc. should be included in the scope of the protection.

Claims (1)

1. a kind of method for recognizing chart in pdf document, it is characterised in that this method includes:Step 1:Travel through file data, note Record all words and drawing in PDF format standard(Path, editing, bitmap)The area information of object, is expressed as the object's position The rectangle that coordinate and wide high parameter are constituted;Step 2:Calculate the character density of each text object(Number of characters divided by rectangular surfaces Product), count the average character density d and variance v, the minimum wide w of character and high h of all text objects;Step 3:Filtering does not conform to rule Drawing object, condition include rectangle beyond content of pages border or other from fixed condition, if rectangular area is zero, Drawing zone Domain both horizontally and vertically respectively expands w/2 and h/2;Step 4:Travel through drawing object, if with other objects(Word or drawing) Then combined region is new drawing object and records number of characters rectangle intersection, until all scopes no longer change;
Step 5:The character density D of each drawing object is calculated, if D values are interval outer at (d-v, d+v), then this pair is judged As for chart, correspondence rectangular extent is chart region;Wherein, after the step 5, in addition to step 6:Reading has been known The word of directly over other chart drawing area or lower section, if it find that nominal key, it is determined that this article field falls for note, table Note or caption, fall within a part for chart;Keyword depending on Doctype and language, such as " Figure ", " figure ", " Table ", " table " etc..
CN201710209497.1A 2017-03-31 2017-03-31 A kind of method of chart in identification PDF document Pending CN107133566A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710209497.1A CN107133566A (en) 2017-03-31 2017-03-31 A kind of method of chart in identification PDF document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710209497.1A CN107133566A (en) 2017-03-31 2017-03-31 A kind of method of chart in identification PDF document

Publications (1)

Publication Number Publication Date
CN107133566A true CN107133566A (en) 2017-09-05

Family

ID=59716357

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710209497.1A Pending CN107133566A (en) 2017-03-31 2017-03-31 A kind of method of chart in identification PDF document

Country Status (1)

Country Link
CN (1) CN107133566A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038426A (en) * 2017-11-29 2018-05-15 阿博茨德(北京)科技有限公司 The method and device of chart-information in a kind of extraction document
WO2019041526A1 (en) * 2017-08-31 2019-03-07 平安科技(深圳)有限公司 Method of extracting chart in document, electronic device and computer-readable storage medium
CN109522539A (en) * 2018-11-26 2019-03-26 常诚 Mobile device-based PDF academic paper reset system and method
CN109948123A (en) * 2018-11-27 2019-06-28 阿里巴巴集团控股有限公司 A kind of image combining method and device
CN110443202A (en) * 2019-08-06 2019-11-12 北京如优教育科技有限公司 Paper font carefully and neatly spends instant analysis platform, method and storage medium
CN112818894A (en) * 2021-02-08 2021-05-18 深圳万兴软件有限公司 Method and device for identifying text box in PDF file, computer equipment and storage medium
CN112861821A (en) * 2021-04-06 2021-05-28 刘羽 Map data reduction method based on PDF file analysis
CN112861822A (en) * 2021-04-06 2021-05-28 刘羽 Map data processing method based on PDF file analysis
CN116110051A (en) * 2023-04-13 2023-05-12 合肥机数量子科技有限公司 File information processing method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104751148A (en) * 2015-04-16 2015-07-01 同方知网数字出版技术股份有限公司 Method for recognizing scientific formulas in layout file
CN105589841A (en) * 2016-01-15 2016-05-18 同方知网(北京)技术有限公司 Portable document format (PDF) document form identification method
CN106446863A (en) * 2016-10-11 2017-02-22 同方知网(北京)技术有限公司 PDF document logic diagram identification method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104751148A (en) * 2015-04-16 2015-07-01 同方知网数字出版技术股份有限公司 Method for recognizing scientific formulas in layout file
CN105589841A (en) * 2016-01-15 2016-05-18 同方知网(北京)技术有限公司 Portable document format (PDF) document form identification method
CN106446863A (en) * 2016-10-11 2017-02-22 同方知网(北京)技术有限公司 PDF document logic diagram identification method

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019041526A1 (en) * 2017-08-31 2019-03-07 平安科技(深圳)有限公司 Method of extracting chart in document, electronic device and computer-readable storage medium
CN108038426A (en) * 2017-11-29 2018-05-15 阿博茨德(北京)科技有限公司 The method and device of chart-information in a kind of extraction document
CN109522539A (en) * 2018-11-26 2019-03-26 常诚 Mobile device-based PDF academic paper reset system and method
CN109948123B (en) * 2018-11-27 2023-06-02 创新先进技术有限公司 Image merging method and device
CN109948123A (en) * 2018-11-27 2019-06-28 阿里巴巴集团控股有限公司 A kind of image combining method and device
CN110443202A (en) * 2019-08-06 2019-11-12 北京如优教育科技有限公司 Paper font carefully and neatly spends instant analysis platform, method and storage medium
CN110443202B (en) * 2019-08-06 2022-11-01 超级知识产权顾问(北京)有限公司 System, method and storage medium for real-time analysis of paper font regularity
CN112818894A (en) * 2021-02-08 2021-05-18 深圳万兴软件有限公司 Method and device for identifying text box in PDF file, computer equipment and storage medium
CN112818894B (en) * 2021-02-08 2023-12-15 深圳万兴软件有限公司 Method and device for identifying text box in PDF (portable document format) file, computer equipment and storage medium
CN112861821A (en) * 2021-04-06 2021-05-28 刘羽 Map data reduction method based on PDF file analysis
CN112861822A (en) * 2021-04-06 2021-05-28 刘羽 Map data processing method based on PDF file analysis
CN112861822B (en) * 2021-04-06 2024-03-12 刘羽 Map data processing method based on PDF file analysis
CN112861821B (en) * 2021-04-06 2024-04-19 刘羽 Map data reduction method based on PDF file analysis
CN116110051A (en) * 2023-04-13 2023-05-12 合肥机数量子科技有限公司 File information processing method and device, computer equipment and storage medium
CN116110051B (en) * 2023-04-13 2023-07-14 合肥机数量子科技有限公司 File information processing method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107133566A (en) A kind of method of chart in identification PDF document
EP1739574B1 (en) Method of identifying words in an electronic document
US9798925B2 (en) Method for identifying PDF document
US5517578A (en) Method and apparatus for grouping and manipulating electronic representations of handwriting, printing and drawings
EP1376390B1 (en) Writing guide for a free-form document editor
US20190163970A1 (en) Method and device for extracting chart information in file
US20060294460A1 (en) Generating a text layout boundary from a text block in an electronic document
KR101985612B1 (en) Method for manufacturing digital articles of paper-articles
CN106951400A (en) The information extraction method and device of a kind of pdf document
JP2005526314A (en) Document structure identifier
CN104516891A (en) Layout analyzing method and system
CN101206639A (en) Method for indexing complex impression based on PDF
CN110704570A (en) Continuous page layout document structured information extraction method
CN110968667A (en) Periodical and literature table extraction method based on text state characteristics
CN110163030A (en) A kind of PDF based on image information has frame table abstracting method
CN104751148A (en) Method for recognizing scientific formulas in layout file
JP6327963B2 (en) Character recognition device and character recognition method
US7929772B2 (en) Method for generating typographical line
Dori et al. Segmentation and recognition of dimensioning text from engineering drawings
CN110688825A (en) Method for extracting information of table containing lines in layout document
JP5950700B2 (en) Image processing apparatus, image processing method, and program
CN110413962A (en) Rimless form analysis technology in file and picture
CN112861485B (en) Nuclear power DCS control logic drawing processing method, device and equipment
CN102110108A (en) Method and device for processing galley proof file
CN114022888B (en) Method, apparatus and medium for identifying PDF form

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170905