CN102081736A - Equipment and method for extracting enclosing rectangles of characters from portable electronic documents - Google Patents

Equipment and method for extracting enclosing rectangles of characters from portable electronic documents Download PDF

Info

Publication number
CN102081736A
CN102081736A CN2009102498487A CN200910249848A CN102081736A CN 102081736 A CN102081736 A CN 102081736A CN 2009102498487 A CN2009102498487 A CN 2009102498487A CN 200910249848 A CN200910249848 A CN 200910249848A CN 102081736 A CN102081736 A CN 102081736A
Authority
CN
China
Prior art keywords
character
font
portable electronic
extracts
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2009102498487A
Other languages
Chinese (zh)
Other versions
CN102081736B (en
Inventor
杜成
徐文晖
长谷川史裕
井上浩一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to CN200910249848.7A priority Critical patent/CN102081736B/en
Publication of CN102081736A publication Critical patent/CN102081736A/en
Application granted granted Critical
Publication of CN102081736B publication Critical patent/CN102081736B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Controls And Circuits For Display Device (AREA)

Abstract

The invention provides equipment for extracting enclosing rectangles of characters from portable electronic documents. The equipment comprises a text segment display command extracting device, a font information extracting device, a size information extracting device and an enclosing rectangle computing device, wherein the text segment display command extracting device is used for extracting text segment display commands of text segments in the pages in the portable electronic documents from content streams of the pages; the font information extracting device is used for extracting the font information corresponding to the text segments from the resources of the pages with regard to the extracted text segment display commands; the size information extracting device is used for extracting the size information of the characters in the text segments; and the enclosing rectangle computing device is used for computing the enclosing rectangles of the characters in the text segments. The invention also provides a method for extracting the enclosing rectangles of the characters from the portable electronic documents.

Description

From portable electronic document, extract the equipment and the method for character-circumscribed rectangle
Technical field
The invention provides a kind of equipment and method of from portable electronic document, extracting character-circumscribed rectangle.
Background technology
Transplantable electronic document as PDF (Portable Document Format, portable document format), PS (PostScript), is widely used in the work of routine office work chamber.But from portable electronic document, extract the still work of difficulty of customizing messages.Such as, not having the minimum boundary rectangle information that explicitly is preserved each character in the pdf document, the extraction of the minimum boundary rectangle of character relates to document analysis and calculating.The minimum boundary rectangle that extracts character from document has a wide range of applications in DRS, by position and the dimension information between the minimum boundary rectangle of coupling adjacent character, can realize the quick coupling of electronic document and file and picture, thereby realize file retrieval.
United States Patent (USP) 6801673B2 provides the method for the speech in a kind of PDF of extraction document.This method is by searching speech separating character (space) in the text chunk and extract speech or judge distance between the adjacent text chunk, if this distance greater than certain threshold value, adjacent text chunk just is divided into two speech.In the method, input is the PDF document, and output is the set of the speech that comprises of the document.
United States Patent (USP) 5832530 has proposed a kind of instrument that extracts content segments from the PDF document.At first, the user pulls a rectangle at the PDF browser interface, and this instrument extracts the PDF document content segment that comprises in this rectangle, and the content segments of extracting is stored as a new PDF document.This instrument extracts and pastes rudimentary PDF order, and does not extract such as senior document content, the form data of picture.
Summary of the invention
Make the present invention in view of the above-mentioned problems in the prior art, the present invention proposes a kind of equipment and method of from portable electronic document, extracting the minimum boundary rectangle of character.The page or leaf of overwhelming majority portable electronic document is level or vertical direction, and promptly the page anglec of rotation is 0 degree, 90 degree, 180 degree or 270 degree, can read from PDF document tree structure.Character wherein also is level or vertical direction, does not have the problem of the anglec of rotation.The present invention can handle above-mentioned portable electronic document in most cases.The invention belongs to document processing field, can be applicable to that document content extracts, document is reused and file retrieval.
According to an aspect of the present invention, a kind of equipment that extracts character-circumscribed rectangle from portable electronic document is provided, comprise: text chunk display command extraction element, for the page or leaf in the portable electronic document, at the text chunk in this page, from the content stream of this page, extract the text chunk display command of text section; The font information extraction element for the text chunk display command that is extracted, extracts the font information corresponding with text section from the resource of this page; The dimension information extraction element for the character in the described text chunk, extracts the character size information of this character; And the boundary rectangle calculation element, for the character in the described text chunk, calculate the boundary rectangle of this character.
According to another aspect of the present invention, a kind of method of extracting character-circumscribed rectangle from portable electronic document is provided, comprise: text chunk display command extraction step, for the page or leaf in the portable electronic document, at the text chunk in this page, from the content stream of this page, extract the text chunk display command of text section; The font information extraction step for the text chunk display command that is extracted, extracts the font information corresponding with text section from the resource of this page; The dimension information extraction step for the character in the described text chunk, extracts the character size information of this character; And the boundary rectangle calculation procedure, for the character in the described text chunk, calculate the boundary rectangle of this character.
The present invention can be used in from such as the minimum boundary rectangle that extracts character the portable electronic document of PDF, PS, and the character-circumscribed rectangle of being extracted can be used for that document is reused or file retrieval etc.For example, the geometric distributions relation by the compare string boundary rectangle can realize the coupling between electronic document and the file and picture, thereby realizes file retrieval.
By reading the detailed description of following the preferred embodiments of the present invention of considering in conjunction with the accompanying drawings, will understand above and other target of the present invention, feature, advantage and technology and industrial significance better.
Description of drawings
Fig. 1 illustrates the computer system of extracting the minimum boundary rectangle of character according to the realization of the embodiment of the invention from portable electronic document.
Fig. 2 illustrates the The general frame according to the equipment that extracts the minimum boundary rectangle of character from portable electronic document of the embodiment of the invention.
Fig. 3 exemplarily illustrates character reference position, character size information, reaches character-circumscribed rectangle.
Fig. 4 exemplarily illustrates PDF document tree structure.
The character height that the character bitmap of the schematically illustrated processing rasterizing of Fig. 5 obtains.
Embodiment
As shown in Figure 1, a kind of realization according to embodiment of the invention computer system 10 of extracting the minimum boundary rectangle of character from portable electronic document comprises: computing machine 11, keyboard 16, display 17, printer 18, floppy disk 19, network insertion device 20, and hard disk drive 21.Computing machine 11 comprises: data bus 12, random access memory (RAM) 13, ROM (read-only memory) (ROM) 14, central processing unit 15, and peripheral bus 22.
According to the instruction of from random access memory 13, receiving, the reception of the central processing unit 15 control data of importing and processing and to the output of display 17 or other peripheral hardwares.In the present embodiment, a function of central processing unit 15 is to handle the PDF document of input, extracts the minimum boundary rectangle of the character that comprises in the document.The character-circumscribed rectangle of extracting can be used for other application programs in the central processing unit 15.
Central processing unit 15 visits random access memory 13 and ROM (read-only memory) 14 by data bus 12.Random access memory 13 as read-write internal memory, is used as the workspace and the variable data memory block of each process by central processing unit 15.The program of the character-circumscribed rectangle that ROM (read-only memory) 14 storage such as portable electronic document, character-circumscribed rectangle extraction procedure and other the application of PDF are extracted.
Peripheral bus 22 is used to visit the peripherals such as input, output and storage that link to each other with computing machine 11.In the present embodiment, described peripherals comprises display 17, printer 18, floppy disk 19, network insertion device 20, reaches hard disk drive 21.Display 17 shows the data and the image of central processing unit 15 outputs by peripheral bus 22.Display 17 can be the grid type display device, such as CRT or LCD display.Printer 18 arrives paper or the media similar with paper to the data of central processing unit 15 inputs with image print.In order to show the PDF document on as output devices such as display 17 or printers 18, computer system 10 needs realization document rasterizing process to carry out the conversion of representing to its correspondence image from the PDF document.In other embodiment, as comprising also on printer 18 output devices such as grade that central processing unit or similar processor are to realize the conversion of similar PDF document to image.Floppy disk 19 and hard disk drive 21 are used for storing the PDF document.By floppy disk 19, the PDF document can transmit between various computing machine system.Hard disk drive 21 storage spaces are bigger, and access speed is faster.Other memory device such as flash memory, also can be used for storing pdf document for computer system 10 visits.Computer system 10 sends data and the data that receive from other computer systems by network insertion device 20 on network.The user can give computer system 10 by keyboard 16 input instructions.
Fig. 2 illustrates the The general frame according to the equipment that extracts the minimum boundary rectangle of character from portable electronic document of the embodiment of the invention.As shown in Figure 2, the equipment that extracts character-circumscribed rectangle from portable electronic document comprises: text chunk display command extraction element 100, font information extraction element 200, dimension information extraction element 300, and boundary rectangle calculation element 400.Described portable electronic document can be the PDF document, also can be the document such as other forms such as PS.
Text chunk display command extraction element 100 is used for the page or leaf for portable electronic document, at the text chunk in this page, extracts the text chunk display command of text section from the content stream of this page.For each page or leaf in the pdf document, text chunk display command extraction element 100 extracts the text chunk display command and stores the tabulation of text chunk display command into from the page or leaf content stream of current page.Text chunk comprises character, and this character both can be the alphabetic character such as Chinese, also can be the alphabetic character such as English, and text chunk must not be equal to speech.According to PDF document description book, PDF page or leaf content stream is made up of one group of PDF order and parameter thereof.By sequentially carrying out these orders, the PDF page just can be drawn comes out.Text chunk in the PDF page is drawn by the text chunk display command.The embodiment of the invention is implemented at a text chunk in one page of PDF document, implements the present invention in proper order at each text chunk in every page in the PDF document, can realize the processing to whole PDF document.
Font information extraction element 200 is used for for the text chunk display command that is extracted, and extracts the font information corresponding with text section from the resource of this page.For each order in the tabulation of text chunk display command, the corresponding font information of text chunk to be shown in 200 extractions of font information extraction element and the text section display command.Page or leaf content stream and page or leaf resource are two notions in the PDF document, page or leaf content stream storage PDF order, and the involved resource of page or leaf resource storage PDF order comprises font, image, color space etc.Fig. 4 exemplarily illustrates PDF document tree structure.The font of PDF is provided with order and selects from the page or leaf resource and the corresponding font information of current text section display command.
Dimension information extraction element 300 is used for the character for described text chunk, extracts the character size information of this character.By handling the dimension information that text chunk display command and corresponding font information thereof extract each character in the text chunk that the text chunk display command relates to.
Boundary rectangle calculation element 400 is used for the character for described text chunk, calculates the boundary rectangle of this character.Boundary rectangle calculation element 400 is by the dimension information of the character handling other PDF order in the page or leaf content stream and extracted, calculates the minimum boundary rectangle of each character that the text chunk display command relates to.Handle all orders in the tabulation of text chunk display command, then finished the processing of one page PDF document.After every page of the PDF document finish dealing with, then finished the processing of this PDF document.
Described text chunk display command extraction element 100 comprises content stream extraction element 110, content stream decoder 120, order extraction element 130.Content stream extraction element 110 is used for the content stream at the described page or leaf of tree structure extraction of described portable electronic document.Content stream decoder 120 is used for coming this content stream decoding according to the coded system that the content stream that is extracted is adopted.Order extraction element 130 is used for extracting the text chunk display command from the content stream of being decoded.
Content stream extraction element 110 extracts the content stream of each page by resolving all PDF document tree structures as shown in Figure 4.Here the foundation of PDF document tree structure is known technology, can be with reference to PDF document description book (third edition).The page or leaf content stream decoding of 120 pairs of extractions of content stream decoder, according to PDF document description book, the PDF document is supported different stream encryption technology, such as FlateDecode, LZWDecode etc., the coded system that page or leaf content stream is adopted can be obtained from tree structure.Order extraction element 130 is resolved decoded page or leaf content stream, extracts the PDF command list (CLIST), and sequentially extract each text chunk display command from this tabulation.PDF text chunk display command is used for videotex section on the PDF page, in the present embodiment, PDF order (Tj), (TJ), ('), (") corresponding to the text chunk display command.Such as, order (string Tj) is according to current text section state and current graphics state, in current location videotex section " string ".Here current text section state comprises text chunk attributes such as current font, character pitch, speech spacing, and current graphics state comprises current transformation matrix, foreground color, background color figures attribute.
Font information extraction element 200 comprises that font resource extraction element 210, font are provided with command lookup device 220, reach information extracting device 230.Font resource extraction element 210 is used for extracting at the tree structure of described portable electronic document the font resource of described page or leaf.Font is provided with command lookup device 220 and is used for searching with described text chunk display command nearest font in ordering in the command list (CLIST) of page or leaf content stream order is set.Information extracting device 230 is used for being provided with according to the font of being searched the parameter of order, extracts and the corresponding font information of text section display command from font resource.
For each text chunk display command that text chunk display command extraction element 100 extracts, font information extraction element 200 extracts its corresponding font information.Font resource extraction element 210 extracts the font resource in the page or leaf resource, and all fonts that font resource is used by this page Chinese version section in the page or leaf resource are formed.Font is provided with command lookup device 220 and extracts in the command list (CLIST) of page or leaf content stream and when the nearest font of the text chunk display command ordering of pre-treatment order is set.It is that text chunk to be shown sets font that font is provided with order, and for example, order (/F1 10 Tf) is provided with current font and is "/F1 ", and current font size is 10, and "/F1 " is a kind of font name in the page or leaf font resource.Information extracting device 230 extracts from the page or leaf resource and the corresponding font information of text chunk to be shown according to font name.The PDF font information is used to the attribute that sets font, and font attribute comprises font coded system, font type, font family etc.
Dimension information extraction element 300 comprises font size extraction element 310, character decoding device 320, character size extraction element 330.Font size extraction element 310 is used to obtain the font size information corresponding with the font information that is extracted.Character decoding device 320 is used for the font information according to described text chunk, and the character in the text chunk display command is decoded.Character size extraction element 330 is used for the title according to the character of being decoded, and extracts the character size information of this character from font size information.
Dimension information extraction element 300 extracts the dimension information of each character in the text chunk to be shown.Font size extraction element 310 extracts the dimension information of current font, and here, font size information is made up of one group of character size information.Character size information is described information such as character title, character duration and minimum boundary rectangle.The boundary rectangle of character generally refers to the minimum boundary rectangle of this character.For example: (C 65; WX 600; N A; B 30597562) be a character size information, here, the character title is " A ", and corresponding unicode coding is 65, and its minimum boundary rectangle is (30597562).Notice that the boundary rectangle of font size extraction element 310 extractions here embodies the coordinate under the figure space, figure space is the known concept of this area, is the local coordinate system of this character; And the character-circumscribed rectangle that boundary rectangle calculation element 400 extracts in the present embodiment embodies the coordinate of character under the PDF page space, and page space is the known concept of this area, is the global coordinate system of this page.The conversion of each character from figure space to the PDF page space specified in PDF order in the page or leaf content stream.According to different font types, font size information can obtain or obtain by resolving outside font size message file by the processing pdf document, has defined the corresponding relation of font information and font size information in the font size message file.Each character that character decoding device 320 is treated in the videotex section according to the font information of text chunk to be shown is decoded, and different fonts is provided with different character code modes, and the character code mode can obtain from font information.After the decoding, can obtain the character title of each character.Character size extraction element 330 is retrieved corresponding character size information according to the character title from font size information.
Font size extraction element 310 comprises font type extraction element 340 and font size information extracting device 350.Font type extraction element 340 is used for extracting font type from font information.Font size information extracting device 350, if be used for the command description of this font type by portable electronic document, then from the character stream of font information, obtain font size information, if this font type is the command description by portable electronic document not, then, obtain and the corresponding font size information of this font type by outside font size file.Wherein, the font type of the command description by the portable electronic document Type3 type of PDF document for example.
Font type extraction element 340 extracts current font type information.If current font type for example is " Type3 ", then font size information extracting device 350 obtains font size information by the character stream of handling in the font information.If current font type is not for example " Type3 ", then font size information extracting device 350 extracts current font name, and search the font size message file of the outside of its correspondence by current font name, in this outside font size message file, search the current font size information that obtains.PDF document description book has defined different font types, comprises Type0, Type1, Type3 etc.Different with other font types, " Type3 " font is embedded in all font attributes in the pdf document, by the PDF command description, for other font type, the PDF handling procedure needs outside font size message file to obtain information such as font size and font." Type3 " font is made up of font attributes such as one group of character stream, font coded messages, and one of them character stream is made up of one group of PDF order, is used for describing the font of a character of this font.In the present embodiment, described outside font size message file can be an Adobe font size message file (AFM).Can realize by known technology parsing such as the outside font size message file of AFM.
At font type such as the command description that passes through portable electronic document of " Type3 ", font size information extracting device 350 replaces with known font with this font, obtains font size information by the outside font size message file of resolving this known font correspondence.Although replace the font size information and the true font size information that obtain by font error is arranged, use for the overwhelming majority, this error is in the tolerable scope.
At the situation of this font type by the command description of portable electronic document, font size information extracting device 350 can comprise character stream decoding device 351 and adding set 352.Character stream decoding device 351 is used for obtaining each character stream of this font type under the situation of this font type by the command description of portable electronic document, according to the coded system that character stream adopted this character stream is decoded.Adding set 352, if first order is that character size is provided with order in the character stream of being decoded, then be provided with and obtain character size information the order and add in the font size information from this character size, if first order is not that character size is provided with order in the character stream of being decoded, then each order in the order execution character stream is to realize the rasterizing to character, obtain character size information according to the bitmap behind the rasterizing, add this character size information to font size information.
For example, at " Type3 " font, character stream decoding device 351 sequentially obtains each character stream in " Type3 " font and decodes.Adding set 352 is checked first PDF order in each character stream, character size is provided with order if this order is " Type3 ", then from the parameter of this order, obtain the character size information of this character, and this character size information is added to font size information.If first order is not that " Type3 " character size is provided with order in the character stream, then adding set 352 each PDF order in the execution character stream is sequentially carried out rasterizing to this character, and rasterizing is the transfer process of ordering character bitmap from PDF at this.The character bitmap that adding set 352 is handled rasterizings to be obtaining the dimension information of character, and this character size information is added to font size information.
In the present embodiment, according to PDF document description book, " Type3 " character size is provided with order and is " d1 ", and for example, order (1000000750750d1) is set to (00750750) with this character stream corresponding characters at the dimension information under the figure space.Described rasterizing method can adopt known technology.The character bitmap of handling rasterizing for example can be by the level and the vertical direction projection histogram of calculating character bitmap with the dimension information that obtains character, and the horizontal direction and the vertical direction position of character-circumscribed rectangle determined in the position of first and last non-zero points in two histograms.Perhaps, also can handle in the following manner the character bitmap after the rasterizing.For example, every row in the scan image from top to bottom, stop at the row that occurs black pixel first, obtain y1, every row in the scan image from bottom to top stops at the row that occurs black pixel first, obtains y2, character height is y2-y1, the character height that the character bitmap of the schematically illustrated such processing rasterizing of Fig. 5 obtains.Every row in the scan image from left to right stop at the row that occur black pixel first, obtain x1, and the every row in the right-to-left scan image stop at the row that occur black pixel first, obtain x2, and character duration is x2-x1.
Boundary rectangle calculation element 400 comprises reference position calculation element 410 and apex coordinate calculation element 420.Reference position calculation element 410 is used to calculate the reference position coordinate of this character.Apex coordinate calculation element 420 is used for reference position coordinate, character size information according to this character, the apex coordinate of calculating character boundary rectangle.
The minimum boundary rectangle of each character in the text chunk to be shown under the boundary rectangle calculation element 400 calculating PDF page space.Reference position calculation element 410 calculates the initial point position of current character in the text chunk to be shown.The minimum boundary rectangle position of apex coordinate calculation element 420 calculating characters.The minimum boundary rectangle of the character that calculates can further store the character-circumscribed rectangle tabulation into, next character in the text chunk to be shown is provided with current character, and the reference position of fresh character more, it is processed to repeat processing procedure all characters in text chunk to be shown.
The minimum boundary rectangle of character can utilize following TRM, TM, CTM matrix and formula to calculate.TRM has specified the affined transformation from figure space to the PDF page space, TRM=TM * CTM.TM is the text chunk matrix, has specified the affined transformation from the figure space to the user's space.CTM is current transformation matrix, has specified the affined transformation from user's space to the PDF page space.The TM matrix can be revised by PDF text chunk positioning command, and the text chunk positioning command comprises (Td), (TD), (Tm), (T*).For example, PDF order (tx ty Td) changes TM by expression formula (1), and Td is a command name, and tx and ty are command parameter, and this Td order is read from page or leaf content stream.The TM of the TM on expression formula (1) equal sign right side before for a change calculates the TM after the change in left side, and the initial value of TM is a unit matrix.
TM = 1 0 0 0 1 0 tx ty 1 × TM - - - ( 1 )
The CTM matrix can order cm to revise by PDF, and for example, PDF orders (abcdefcm), and cm is a command name, and a, b, c, d, e, f are command parameter, and this cm order is read from page or leaf content stream.Revise the CTM matrix by following formula (2).The CTM of the CTM on expression formula (2) equal sign right side before for a change calculates the CTM after the change in left side, and the initial value of CTM is a unit matrix.
CTM = a b 0 c d 0 e f 1 × CTM - - - ( 2 )
Reference position calculation element 410 obtains the reference position of the current character of text chunk to be shown by TRM, and TRM calculates by TRM=TM * CTM, and by expression formula (3) expression, h, i, j, k, l, m are by calculating.
TRM = h i 0 j k 0 l m 1 - - - ( 3 )
Reference position calculation element 410 calculates under the PDF page space, the reference position of the current character of text chunk to be shown be (xStart, yStart)=(l, m).By following formula (4) and (5), obtain from figure space to the PDF page space at the scaling ratio xScale and the yScale of x and y direction respectively from TRM.
xScale = h 2 + i 2 - - - ( 4 )
yScale = j 2 + k 2 - - - ( 5 )
The minimum boundary rectangle coordinate of each character in the text chunk to be shown under the apex coordinate calculation element 420 calculating PDF page space.Because at the text chunk anglec of rotation and the page anglec of rotation all is 0 situation, therefore the minimum boundary rectangle of certain character can obtain by following formula (6)~(9) in the text chunk to be shown.
x=xStart+CharMetric.boundingBox.lowerLeftX×fontSize×xScale;(6)
y=yStart+CharMetric.boundingBox.lowerLeftY×fontSize×yScale;(7)
width=(CharMetric.boundingBox.upperRightX-CharMetric.boundingBox.lowerLeftX)×
fontSize?×xScale; (8)
height=(CharMetric.boundingBox.upperRightY-CharMetric.boundingBox.lowerLeftY)×
fontSize×yScale; (9)
Wherein, fontSize is the parameter that font is provided with order Tf, and this Tf order and parameter wherein read from page or leaf content stream.XScale and yScale obtain from the TRM matrix, participate in expression formula (4) and (5).XStart and yStart are the reference position of current character in the text chunk to be shown.CharMetric.boundingBox is the character size information that dimension information extraction element 300 extracts, and is presented as the character-circumscribed rectangle under the figure space.CharMetric.boundingBox comprises CharMetric.boundingBox.lowerLeftX, CharMetric.boundingBox.lowerLeftY, CharMetric.boundingBox.upperRightX, CharMetric.boundingBox.upperRightY, is respectively lower-left x direction coordinate, lower-left y direction coordinate, upper right x direction coordinate, the upper right y direction coordinate of the character-circumscribed rectangle of current character under the figure space.(the x that expression formula (6) calculates to (9), y, width, height) the minimum boundary rectangle of current character under the PDF page space has been described, x and y respectively for this reason the reference position of minimum boundary rectangle be the x direction and the y direction coordinate of lower-left point, width and height be width and the height of minimum for this reason boundary rectangle under page space respectively.
Suppose that text chunk to be shown is horizontal text chunk, then current character disposes in the time of will handling next character, and the renewal of character reference position realizes by following formula (10) and (11):
xStart=xStart+charSpace+wordSpace+width (10)
yStart=yStart (11)
In the expression formula (10), charSpace is the current character spacing, by being provided with in the command list (CLIST) of the page or leaf content of PDF stream with when the nearest Tc order of the text chunk display command ordering of pre-treatment, wordSpace is current speech spacing, by being provided with in the command list (CLIST) of the page or leaf content of PDF stream with when the nearest Tw order of the text chunk display command ordering of pre-treatment, Tc order and Tw order all can be read from the content stream of the current page of PDF.If previous character is the space, then from nearest Tw order, read the speech spacing, otherwise wordSpace is 0.Fig. 3 exemplarily illustrates the character reference position, character size information is the width height, reaches the PDF character-circumscribed rectangle, and character-circumscribed rectangle is expressed as rectangle frame.
The present invention can also be embodied as a kind of method of extracting character-circumscribed rectangle from portable electronic document, comprise: text chunk display command extraction step, can carry out by aforementioned texts section display command extraction element 100, for the page or leaf in the portable electronic document, at the text chunk in this page, from the content stream of this page, extract the text chunk display command of text section; The font information extraction step can be carried out by aforementioned font information extraction element 200, for the text chunk display command that is extracted, extracts the font information corresponding with text section from the resource of this page; The dimension information extraction step can be carried out by aforementioned dimensions information extracting device 300, for the character in the described text chunk, extracts the character size information of this character; And the boundary rectangle calculation procedure, can carry out by aforementioned boundary rectangle calculation element 400, for the character in the described text chunk, calculate the boundary rectangle of this character.
Text chunk display command extraction step comprises: content stream extraction step, can carry out by aforementioned content stream extraction element 110, and in the tree structure of described portable electronic document, extract the content stream of described page or leaf; Content stream decoding step can be carried out by aforementioned content stream decoder 120, comes this content stream decoding according to the coded system that the content stream that is extracted is adopted; And the order extraction step, can carry out by aforementioned order extraction element 130, from the content stream of being decoded, extract the text chunk display command.
The font information extraction step comprises: the font resource extraction step, can carry out by aforementioned font resource extraction element 210, and in the tree structure of described portable electronic document, extract the font resource of described page or leaf; Font is provided with the command lookup step, can command lookup device 220 be set by aforementioned font and carry out, and searches in the command list (CLIST) of content stream with described text chunk display command nearest font in ordering order is set; And the information extraction step, can carry out by aforementioned information extraction element 230, according to the font of being searched parameter in the order is set, from font resource, extract and the corresponding font information of text section display command.
The dimension information extraction step comprises: the font size extraction step, can carry out by aforementioned font size extraction element 310, and obtain the font size information corresponding with the font information that is extracted; The character decoding step can be carried out by aforementioned character decoding device 320, according to the font information of described text chunk, the character in the text chunk display command is decoded; And the character size extraction step, can carry out by aforementioned character size extraction element 330, according to the title of the character of being decoded, from font size information, extract the character size information of this character.
The boundary rectangle calculation procedure comprises: the reference position calculation procedure, can carry out by aforementioned reference position calculation element 410, and calculate the reference position coordinate of this character; And the apex coordinate calculation procedure, can carry out by aforementioned apex coordinate calculation element 420, according to reference position coordinate, the character size information of this character, the apex coordinate of calculating character boundary rectangle.
The font size extraction step comprises: the font type extraction step, can carry out by aforementioned font type extraction element 340, and from font information, extract font type; Font size information extraction step, can carry out by aforementioned font size information extracting device 350, if this font type is by the command description of portable electronic document, then from the character stream of font information, obtain font size information, if this font type is the command description by portable electronic document not, then, obtain and the corresponding font size information of this font type by outside font size file.
Font size information extraction step comprises: the character stream decoding step, can carry out by aforementioned character stream decoding device 351, under the situation of this font type by the command description of portable electronic document, obtain each character stream of this font type, this character stream is decoded according to the coded system that character stream adopted; Add step, can carry out by aforementioned adding set 352, if first order is that character size is provided with order in the character stream of being decoded, then be provided with and obtain character size information the order and add in the font size information from this character size, if first order is not that character size is provided with order in the character stream of being decoded, then each order in the order execution character stream is to realize the rasterizing to character, obtain character size information according to the bitmap behind the rasterizing, add this character size information to font size information.
Although in present specification, be that example is illustrated, yet it will be understood by those skilled in the art that the embodiment of the invention also can be applied to the portable electronic document such as the PS form with the PDF document.
The sequence of operations that illustrates in instructions can be carried out by the combination of hardware, software or hardware and software.When carrying out this sequence of operations by software, can be installed to computer program wherein in the storer in the computing machine that is built in specialized hardware, make computing machine carry out this computer program.Perhaps, can be installed to computer program in the multi-purpose computer that can carry out various types of processing, make computing machine carry out this computer program.
For example, can store computer program in advance in the hard disk or ROM (ROM (read-only memory)) as recording medium.Perhaps, can be temporarily or for good and all storage (record) computer program in removable recording medium, such as floppy disk, CD-ROM (compact disc read-only memory), MO (magneto-optic) dish, DVD (digital versatile disc), disk or semiconductor memory.Can so removable recording medium be provided as canned software.
The present invention has been described in detail with reference to specific embodiment.Yet clearly, under the situation that does not deviate from spirit of the present invention, those skilled in the art can carry out change and replacement to embodiment.In other words, the present invention is open with form illustrated, rather than explains with being limited.Judge main idea of the present invention, should consider appended claim.

Claims (10)

1. equipment that extracts character-circumscribed rectangle from portable electronic document comprises:
Text chunk display command extraction element for the page or leaf in the portable electronic document, at the text chunk in this page, extracts the text chunk display command of text section from the content stream of this page;
The font information extraction element for the text chunk display command that is extracted, extracts the font information corresponding with text section from the resource of this page;
The dimension information extraction element for the character in the described text chunk, extracts the character size information of this character; And
The boundary rectangle calculation element for the character in the described text chunk, calculates the boundary rectangle of this character.
2. according to the described equipment that extracts character-circumscribed rectangle from portable electronic document of claim 1, wherein, described text chunk display command extraction element comprises:
Content stream extraction element, the content of extracting described page or leaf in the tree structure of described portable electronic document flows;
The content stream decoder comes this content stream decoding according to the coded system that the content stream that is extracted is adopted; And
The order extraction element extracts the text chunk display command from the content stream of being decoded.
3. according to the described equipment that extracts character-circumscribed rectangle from portable electronic document of claim 1, wherein, described font information extraction element comprises:
The font resource extraction element extracts described page font resource in the tree structure of described portable electronic document;
Font is provided with the command lookup device, searches in the command list (CLIST) of content stream with described text chunk display command nearest font in ordering order is set; And
Information extracting device is provided with parameter in the order according to the font of being searched, and extracts and the corresponding font information of text section display command from font resource.
4. according to the described equipment that extracts character-circumscribed rectangle from portable electronic document of claim 1, wherein, described dimension information extraction element comprises:
The font size extraction element obtains the font size information corresponding with the font information that is extracted;
The character decoding device according to the font information of described text chunk, is decoded to the character in the text chunk display command; And
The character size extraction element according to the title of the character of being decoded, extracts the character size information of this character from font size information.
5. according to the described equipment that extracts character-circumscribed rectangle from portable electronic document of claim 1, wherein, described boundary rectangle calculation element comprises:
The reference position calculation element calculates the reference position coordinate of this character; And
The apex coordinate calculation element is according to reference position coordinate, the character size information of this character, the apex coordinate of calculating character boundary rectangle.
6. according to the described equipment that extracts character-circumscribed rectangle from portable electronic document of claim 4, wherein, described font size extraction element comprises:
The font type extraction element extracts font type from font information;
The font size information extracting device, if this font type is by the command description of portable electronic document, then from the character stream of font information, obtain font size information, if this font type is the command description by portable electronic document not, then, obtain and the corresponding font size information of this font type by outside font size file.
7. according to the described equipment that extracts character-circumscribed rectangle from portable electronic document of claim 6, wherein, described font size information extracting device comprises:
The character stream decoding device under the situation of this font type by the command description of portable electronic document, obtains each character stream of this font type, according to the coded system that character stream adopted this character stream is decoded;
Adding set, if first order is that character size is provided with order in the character stream of being decoded, then be provided with and obtain character size information the order and add in the font size information from this character size, if first order is not that character size is provided with order in the character stream of being decoded, then each order in the order execution character stream is to realize the rasterizing to character, obtain character size information according to the bitmap behind the rasterizing, add this character size information to font size information.
8. according to the described equipment that extracts character-circumscribed rectangle from portable electronic document of claim 1, wherein, described portable electronic document is the PDF document.
9. according to the described equipment that extracts character-circumscribed rectangle from portable electronic document of claim 6, wherein, the font type of the command description by portable electronic document is the Type3 type of PDF document.
10. method of extracting character-circumscribed rectangle from portable electronic document comprises:
Text chunk display command extraction step for the page or leaf in the portable electronic document, at the text chunk in this page, extracts the text chunk display command of text section from the content stream of this page;
The font information extraction step for the text chunk display command that is extracted, extracts the font information corresponding with text section from the resource of this page;
The dimension information extraction step for the character in the described text chunk, extracts the character size information of this character; And
The boundary rectangle calculation procedure for the character in the described text chunk, is calculated the boundary rectangle of this character.
CN200910249848.7A 2009-11-27 2009-11-27 Equipment and method for extracting enclosing rectangles of characters from portable electronic documents Expired - Fee Related CN102081736B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910249848.7A CN102081736B (en) 2009-11-27 2009-11-27 Equipment and method for extracting enclosing rectangles of characters from portable electronic documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910249848.7A CN102081736B (en) 2009-11-27 2009-11-27 Equipment and method for extracting enclosing rectangles of characters from portable electronic documents

Publications (2)

Publication Number Publication Date
CN102081736A true CN102081736A (en) 2011-06-01
CN102081736B CN102081736B (en) 2014-11-26

Family

ID=44087691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910249848.7A Expired - Fee Related CN102081736B (en) 2009-11-27 2009-11-27 Equipment and method for extracting enclosing rectangles of characters from portable electronic documents

Country Status (1)

Country Link
CN (1) CN102081736B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106033412A (en) * 2015-03-20 2016-10-19 广州金山移动科技有限公司 Text conversion method and device
CN107688789A (en) * 2017-08-31 2018-02-13 平安科技(深圳)有限公司 Document charts abstracting method, electronic equipment and computer-readable recording medium
CN108897730A (en) * 2018-06-29 2018-11-27 国信优易数据有限公司 A kind of processing method and device of PDF text
CN109670461A (en) * 2018-12-24 2019-04-23 广东亿迅科技有限公司 PDF text extraction method, device, computer equipment and storage medium
CN111027285A (en) * 2019-12-17 2020-04-17 南京上游软件有限公司 Method and system for automatically extracting order information from pdf format order
CN114546936A (en) * 2021-12-22 2022-05-27 上海电机学院 PDF form data extraction method, storage medium and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6274181A (en) * 1985-09-27 1987-04-04 Sony Corp Character recognizing device
CN1467682A (en) * 2002-06-28 2004-01-14 富士通株式会社 Apparatus and method of analyzing layout of document, and computer product
CN1495660A (en) * 1995-09-06 2004-05-12 富士通株式会社 Header extracting device and method for extracting header from file image
US20080063276A1 (en) * 2006-09-08 2008-03-13 Luc Vincent Shape clustering in post optical character recognition processing
CN101325642A (en) * 2007-06-15 2008-12-17 佳能株式会社 Information processing apparatus and method thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6274181A (en) * 1985-09-27 1987-04-04 Sony Corp Character recognizing device
CN1495660A (en) * 1995-09-06 2004-05-12 富士通株式会社 Header extracting device and method for extracting header from file image
CN1467682A (en) * 2002-06-28 2004-01-14 富士通株式会社 Apparatus and method of analyzing layout of document, and computer product
US20080063276A1 (en) * 2006-09-08 2008-03-13 Luc Vincent Shape clustering in post optical character recognition processing
CN101325642A (en) * 2007-06-15 2008-12-17 佳能株式会社 Information processing apparatus and method thereof
JP2008312063A (en) * 2007-06-15 2008-12-25 Canon Inc Information processor and its method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JOSEF B. BAKER, ALAN P. SEXTON, VOLKER SORGE: "《Extracting Precise Data on the Mathematical Content of PDF Documents》", 《MATHEMATICAL CONTENT OF PDF DOCNMENTS》 *
张秀秀,张立峰: "《PDF文件文本内容提取研究》", 《科技情报开发与经济》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106033412A (en) * 2015-03-20 2016-10-19 广州金山移动科技有限公司 Text conversion method and device
CN106033412B (en) * 2015-03-20 2019-07-26 广州金山移动科技有限公司 A kind of text conversion method and device
CN107688789A (en) * 2017-08-31 2018-02-13 平安科技(深圳)有限公司 Document charts abstracting method, electronic equipment and computer-readable recording medium
CN107688789B (en) * 2017-08-31 2021-05-18 平安科技(深圳)有限公司 Document chart extraction method, electronic device and computer readable storage medium
CN108897730A (en) * 2018-06-29 2018-11-27 国信优易数据有限公司 A kind of processing method and device of PDF text
CN108897730B (en) * 2018-06-29 2022-07-29 国信优易数据股份有限公司 PDF text processing method and device
CN109670461A (en) * 2018-12-24 2019-04-23 广东亿迅科技有限公司 PDF text extraction method, device, computer equipment and storage medium
CN111027285A (en) * 2019-12-17 2020-04-17 南京上游软件有限公司 Method and system for automatically extracting order information from pdf format order
CN111027285B (en) * 2019-12-17 2023-06-16 南京上游软件有限公司 Method and system for automatically extracting order information from pdf format order
CN114546936A (en) * 2021-12-22 2022-05-27 上海电机学院 PDF form data extraction method, storage medium and device

Also Published As

Publication number Publication date
CN102081736B (en) 2014-11-26

Similar Documents

Publication Publication Date Title
US8209600B1 (en) Method and apparatus for generating layout-preserved text
RU2316814C2 (en) Font selection method
RU2394268C2 (en) Simplification of symbols to allow eligibility
CN102081736B (en) Equipment and method for extracting enclosing rectangles of characters from portable electronic documents
CN102081594B (en) Equipment and method for extracting enclosing rectangles of characters from portable electronic documents
US8225200B2 (en) Extracting a character string from a document and partitioning the character string into words by inserting space characters where appropriate
US8451489B1 (en) Content-aware method for saving paper and ink while printing a PDF document
JP2006350867A (en) Document processing device, method, program, and information storage medium
US7643682B2 (en) Method of identifying redundant text in an electronic document
US8804139B1 (en) Method and system for repurposing a presentation document to save paper and ink
US20150193387A1 (en) Cloud-based font service system
JP2008059590A (en) Techniques for image segment accumulation in document rendering
US7870478B1 (en) Repurposing subsections and/or objects
CN102841941B (en) Index-based format returnable file establishing and drawing method
CN115659917A (en) Document format restoration method and device, electronic equipment and storage equipment
US20150169508A1 (en) Obfuscating page-description language output to thwart conversion to an editable format
JP5950700B2 (en) Image processing apparatus, image processing method, and program
CN101901341B (en) Method and equipment for extracting raster image from transportable electronic document
CN102685347A (en) Image processing apparatus and image processing method
US20160148082A1 (en) Method, system and apparatus for rendering a document
JP2009278181A (en) Electronic watermark information embedding apparatus and method, and electronic watermark information extraction apparatus and method
US20110296292A1 (en) Efficient application-neutral vector documents
US9817620B2 (en) Predictive object-sequence caching from prior page content
CN101833544A (en) Method and system for extracting word part from portable electronic document
JP2001312691A (en) Method/device for processing picture and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20141126

Termination date: 20201127

CF01 Termination of patent right due to non-payment of annual fee