CN102081736B - Equipment and method for extracting enclosing rectangles of characters from portable electronic documents - Google Patents

Equipment and method for extracting enclosing rectangles of characters from portable electronic documents Download PDF

Info

Publication number
CN102081736B
CN102081736B CN200910249848.7A CN200910249848A CN102081736B CN 102081736 B CN102081736 B CN 102081736B CN 200910249848 A CN200910249848 A CN 200910249848A CN 102081736 B CN102081736 B CN 102081736B
Authority
CN
China
Prior art keywords
character
font
information
portable electronic
size information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200910249848.7A
Other languages
Chinese (zh)
Other versions
CN102081736A (en
Inventor
杜成
徐文晖
长谷川史裕
井上浩一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to CN200910249848.7A priority Critical patent/CN102081736B/en
Publication of CN102081736A publication Critical patent/CN102081736A/en
Application granted granted Critical
Publication of CN102081736B publication Critical patent/CN102081736B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Controls And Circuits For Display Device (AREA)

Abstract

The invention provides equipment for extracting enclosing rectangles of characters from portable electronic documents. The equipment comprises a text segment display command extracting device, a font information extracting device, a size information extracting device and an enclosing rectangle computing device, wherein the text segment display command extracting device is used for extracting text segment display commands of text segments in the pages in the portable electronic documents from content streams of the pages; the font information extracting device is used for extracting the font information corresponding to the text segments from the resources of the pages with regard to the extracted text segment display commands; the size information extracting device is used for extracting the size information of the characters in the text segments; and the enclosing rectangle computing device is used for computing the enclosing rectangles of the characters in the text segments. The invention also provides a method for extracting the enclosing rectangles of the characters from the portable electronic documents.

Description

From portable electronic document, extract equipment and the method for character-circumscribed rectangle
Technical field
The invention provides a kind of equipment and method of extracting character-circumscribed rectangle from portable electronic document.
Background technology
Transplantable electronic document as PDF (Portable Document Format, portable document format), PS (PostScript), is widely used in the work of routine office work chamber.But from portable electronic document, extract customizing messages or the work of difficulty.Such as, the minimum boundary rectangle information that in pdf document, explicitly is not preserved each character, the extraction of the minimum boundary rectangle of character relates to document analysis and calculating.The minimum boundary rectangle that extracts character from document has a wide range of applications in DRS, by position and dimension information between the minimum boundary rectangle of coupling adjacent character, can realize the Rapid matching of electronic document and file and picture, thereby realize file retrieval.
United States Patent (USP) 6801673B2 provides the method for the word in a kind of PDF of extraction document.The method is extracted word or is judged the distance between adjacent text chunk by the word separating character (space) of searching in text chunk, if this distance is greater than certain threshold value, adjacent text chunk is just divided into two words.In the method, input is PDF document, and output is the set of the word that comprises of the document.
United States Patent (USP) 5832530 has proposed a kind of instrument that extracts content segments from PDF document.First, user pulls a rectangle at PDF browser interface, and this instrument extracts the PDF document content segment comprising in this rectangle, and the content segments of extraction is stored as to a new PDF document.This instrument extracts and pastes rudimentary PDF order, and does not extract such as senior document content, the form data of picture.
Summary of the invention
In view of the above-mentioned problems in the prior art, make the present invention, the present invention proposes a kind of equipment and method of extracting the minimum boundary rectangle of character from portable electronic document.The page of overwhelming majority portable electronic document is horizontal or vertical direction, and page rotation angle is 0 degree, 90 degree, 180 degree or 270 degree, can from PDF document tree structure, read.Character is wherein also horizontal or vertical direction, does not have the problem of the anglec of rotation.The present invention can process above-mentioned portable electronic document in most cases.The invention belongs to document processing field, can be applicable to that document content extracts, document is reused and file retrieval.
According to an aspect of the present invention, a kind of equipment that extracts character-circumscribed rectangle from portable electronic document is provided, comprise: text chunk display command extraction element, for the page in portable electronic document, for the text chunk in this page, from the content flow of this page, extract the text chunk display command of text section; Font information extraction element for extracted text chunk display command, extracts the font information corresponding with text section from the resource of this page; Dimension information extraction element, for the character in described text chunk, extracts the character size information of this character; And boundary rectangle calculation element, for the character in described text chunk, calculate the boundary rectangle of this character.
According to another aspect of the present invention, a kind of method of extracting character-circumscribed rectangle from portable electronic document is provided, comprise: text chunk display command extraction step, for the page in portable electronic document, for the text chunk in this page, from the content flow of this page, extract the text chunk display command of text section; Font information extraction step for extracted text chunk display command, extracts the font information corresponding with text section from the resource of this page; Dimension information extraction step, for the character in described text chunk, extracts the character size information of this character; And boundary rectangle calculation procedure, for the character in described text chunk, calculate the boundary rectangle of this character.
The present invention can be used in the minimum boundary rectangle that extracts character from the portable electronic document such as PDF, PS, and the character-circumscribed rectangle of extracting can be reused or file retrieval etc. for document.For example, by how much distribution relations of compare string boundary rectangle, can realize the coupling between electronic document and file and picture, thereby realize file retrieval.
By reading the detailed description of following the preferred embodiments of the present invention of considering by reference to the accompanying drawings, will understand better above and other target of the present invention, feature, advantage and technology and industrial significance.
Accompanying drawing explanation
Fig. 1 illustrates the computer system of extracting the minimum boundary rectangle of character according to the realization of the embodiment of the present invention from portable electronic document.
Fig. 2 illustrates according to the general frame of the equipment that extracts the minimum boundary rectangle of character from portable electronic document of the embodiment of the present invention.
Fig. 3 exemplarily illustrates character reference position, character size information and character-circumscribed rectangle.
Fig. 4 exemplarily illustrates PDF document tree structure.
The character height that the character bitmap of the schematically illustrated processing rasterizing of Fig. 5 obtains.
Embodiment
As shown in Figure 1, a kind of computer system 10 of extracting the minimum boundary rectangle of character according to the realization of the embodiment of the present invention from portable electronic document comprises: computing machine 11, keyboard 16, display 17, printer 18, floppy disk 19, network insertion device 20 and hard disk drive 21.Computing machine 11 comprises: data bus 12, random access memory (RAM) 13, ROM (read-only memory) (ROM) 14, central processing unit 15 and peripheral bus 22.
According to the instruction of receiving from random access memory 13, central processing unit 15 is controlled the reception of the data of inputting and processing and to the output of display 17 or other peripheral hardwares.In the present embodiment, a function of central processing unit 15 is to process the PDF document of input, extracts the minimum boundary rectangle of the character comprising in document.The character-circumscribed rectangle of extracting can be used for other application programs in central processing unit 15.
Central processing unit 15 visits random access memory 13 and ROM (read-only memory) 14 by data bus 12.Random access memory 13 as read-write internal memory, is used as workspace and the variable data memory block of each process by central processing unit 15.The program of the character-circumscribed rectangle that ROM (read-only memory) 14 storages are extracted such as portable electronic document, character-circumscribed rectangle extraction procedure and other the application of PDF.
Peripheral bus 22 is for accessing the peripherals such as input, output and storage that are connected with computing machine 11.In the present embodiment, described peripherals comprises display 17, printer 18, floppy disk 19, network insertion device 20 and hard disk drive 21.Display 17 shows data and the image of central processing unit 15 outputs by peripheral bus 22.Display 17 can be grid type display device, such as CRT or LCD display.Printer 18 arrives paper or the medium similar with paper the data of central processing unit 15 inputs with image printing.In order to show PDF document on as output devices such as display 17 or printers 18, computer system 10 need to realize document rasterizing process and carry out the conversion representing to its correspondence image from PDF document.In other embodiment, as also comprised on printer 18 output devices such as grade, central processing unit or similar processor are to realize similar PDF document to the conversion of image.Floppy disk 19 and hard disk drive 21 are used for storing PDF document.By floppy disk 19, PDF document can transmit between different computer systems.Hard disk drive 21 storage spaces are larger, and access speed is faster.Other memory device, such as flash memory, also can be used for storing pdf document for computer system 10 access.Computer system 10 is sent data and is received the data from other computer systems by network insertion device 20 on network.User can be by keyboard 16 input instructions to computer system 10.
Fig. 2 illustrates according to the general frame of the equipment that extracts the minimum boundary rectangle of character from portable electronic document of the embodiment of the present invention.As shown in Figure 2, the equipment that extracts character-circumscribed rectangle from portable electronic document comprises: text chunk display command extraction element 100, font information extraction element 200, dimension information extraction element 300 and boundary rectangle calculation element 400.Described portable electronic document can be PDF document, can be also the document such as other forms such as PS.
Text chunk display command extraction element 100, the page for for portable electronic document for the text chunk in this page, extracts the text chunk display command of text section from the content flow of this page.For the every one page in pdf document, text chunk display command extraction element 100 extracts text chunk display command and stores the list of text chunk display command into from the page content flow of current page.Text chunk comprises character, and this character can be both the alphabetic character such as Chinese, can be also the alphabetic character such as English, and text chunk must not be equal to word.According to PDF document description book, PDF page content flow is comprised of one group of PDF order and parameter thereof.By sequentially carrying out these orders, the PDF page just can be out drawn.Text chunk in the PDF page is drawn by text chunk display command.The embodiment of the present invention is implemented for a text chunk in one page of PDF document, for each text chunk in every page in PDF document, sequentially implements the present invention, can realize the processing to whole PDF document.
Font information extraction element 200 for the text chunk display command for extracted, extracts the font information corresponding with text section from the resource of this page.For each order in the list of text chunk display command, font information extraction element 200 extracts the font information corresponding with text chunk to be shown in text section display command.Page content flow and page resource are two concepts in PDF document, page content flow storage PDF order, and the involved resource of page resource storage PDF order, comprises font, image, color space etc.Fig. 4 exemplarily illustrates PDF document tree structure.The font information corresponding with current text section display command selected in the Set font order of PDF from page resource.
Dimension information extraction element 300, for the character for described text chunk, extracts the character size information of this character.By processing text chunk display command and corresponding font information thereof, extract the dimension information of each character in the text chunk that text chunk display command relates to.
Boundary rectangle calculation element 400, for the character for described text chunk, calculates the boundary rectangle of this character.Boundary rectangle calculation element 400 is by the dimension information of the character processing other PDF order in page content flow and extract, calculates the minimum boundary rectangle of each character that text chunk display command relates to.Process all orders in the list of text chunk display command, completed the processing of one page PDF document.After every page of PDF document finish dealing with, completed the processing of this PDF document.
Described text chunk display command extraction element 100 comprises content flow extraction element 110, content flow decoding device 120, order extraction element 130.Content flow extraction element 110, extracts the content flow of described page for the tree structure at described portable electronic document.Content flow decoding device 120, comes this content flow decoding for the coded system adopting according to extracted content flow.Order extraction element 130, extracts text chunk display command for the content flow from decoded.
Content flow extraction element 110 extracts the content flow of every one page by resolving all PDF document tree structures as shown in Figure 4.Here the foundation of PDF document tree structure is known technology, can be with reference to PDF document description book (third edition).The page content flow decoding of 120 pairs of extractions of content flow decoding device, according to PDF document description book, PDF document is supported different stream encryption technology, such as FlateDecode, LZWDecode etc., the coded system that page content flow adopts can be obtained from tree structure.Order extraction element 130 is resolved decoded page content flow, extracts PDF command list (CLIST), and from this list, sequentially extracts each text chunk display command.PDF text chunk display command is used for showing text chunk on the PDF page, in the present embodiment, PDF order (Tj), (TJ), ('), (") corresponding to text chunk display command.Such as, order (string Tj), according to current text section state and current graphics state, shows text chunk " string " in current location.Here current text section state comprises the text chunk attributes such as current font, character pitch, word spacing, and current graphics state comprises the graphic attributes such as Current Transform matrix, foreground color, background color.
Font information extraction element 200 comprises font resource extraction element 210, Set font command lookup device 220 and information extracting device 230.Font resource extraction element 210 extracts the font resource of described page for the tree structure at described portable electronic document.Set font command lookup device 220 is for searching in the command list (CLIST) of page content flow and the nearest Set font order in sequence of described text chunk display command.Information extracting device 230 for according to the parameter of searched Set font order, extracts the font information corresponding with text section display command from font resource.
Each the text chunk display command extracting for text chunk display command extraction element 100, font information extraction element 200 extracts its corresponding font information.Font resource extraction element 210 extracts the font resource in page resource, and all fonts that in page resource, font resource is used by this page of Chinese version section form.Set font command lookup device 220 extracts the Set font order of sorting nearest with the text chunk display command of working as pre-treatment in the command list (CLIST) of page content flow.Set font order is that text chunk to be shown sets font, and for example, order (/F1 10 Tf) arranges current font for "/F1 ", and current font size is 10, and "/F1 " is a kind of font name in page font resource.Information extracting device 230 extracts the font information corresponding with text chunk to be shown from page resource according to font name.PDF font information is used for the attribute that sets font, and font attribute comprises font coded system, font type, font family etc.
Dimension information extraction element 300 comprises font size extraction element 310, Character decoder device 320, character size extraction element 330.Font size extraction element 310, for obtaining the font size information corresponding with extracted font information.Character decoder device 320, for according to the font information of described text chunk, decodes to the character in text chunk display command.Character size extraction element 330 for according to the title of decoded character, extracts the character size information of this character from font size information.
Dimension information extraction element 300 extracts the dimension information of each character in text chunk to be shown.Font size extraction element 310 extracts the dimension information of current font, and here, font size information is comprised of one group of character size information.Character size information is described the information such as character title, character duration and minimum boundary rectangle.The boundary rectangle of character generally refers to the minimum boundary rectangle of this character.For example: (C 65; WX 600; NA; B 30 597 562) be a character size information, here, character title is " A ", and corresponding unicode coding is 65, and its minimum boundary rectangle is (3 0 597 562).Note, the boundary rectangle that font size extraction element 310 extracts here embodies the coordinate under figure space, and figure space is the known concept of this area, is the local coordinate system of this character; And the character-circumscribed rectangle that in the present embodiment, boundary rectangle calculation element 400 extracts embodies the coordinate of character under PDF page space, page space is the known concept of this area, is the global coordinate system of this page.The conversion of each character from figure space to PDF page space specified in PDF order in page content flow.According to different font types, font size information can obtain or obtain by resolving outside font size message file by processing pdf document, has defined the corresponding relation of font information and font size information in font size message file.Character decoder device 320 comes each Character decoder in text chunk to be shown according to the font information of text chunk to be shown, different Set fonts different character code modes, character code mode can obtain from font information.After decoding, can obtain the character title of each character.Character size extraction element 330, according to character title, is retrieved corresponding character size information from font size information.
Font size extraction element 310 comprises font type extraction element 340 and font size information extracting device 350.Font type extraction element 340, for extracting font type from font information.Font size information extracting device 350, if be used for this font type by the command description of portable electronic document, from the character stream of font information, obtain font size information, if this font type is not by the command description of portable electronic document,, by outside font size file, obtain the font size information corresponding with this font type.Wherein, the font type of the command description by the portable electronic document Type3 type of PDF document for example.
Font type extraction element 340 extracts current font type information.If current font type is for example " Type3 ", font size information extracting device 350 obtains font size information by the character stream of processing in font information.If current font type is not for example " Type3 ", font size information extracting device 350 extracts current font name, and by current font name, search the outside font size message file of its correspondence, in this outside font size message file, search the current font size information that obtains.PDF document description book has defined different font types, comprises Type0, Type1, Type3 etc.Different with other font types, " Type3 " font is embedded in all font attributes in pdf document, by PDF command description, for other font type, PDF handling procedure needs outside font size message file to obtain the information such as font size and font." Type3 " font is comprised of font attributes such as one group of character stream, font coded messages, and one of them character stream is comprised of one group of PDF order, is used for describing the font of a character of this font.In the present embodiment, described outside font size message file can be Adobe font size message file (AFM).Parsing to the outside font size message file such as AFM can realize by known technology.
Font type for the command description that passes through portable electronic document such as " Type3 ", font size information extracting device 350 replaces with known font by this font, by resolving outside font size message file corresponding to this known font, obtains font size information.Although replace by font the font size information and the true font size information that obtain, have error, for overwhelming majority's application, this error is within the scope of tolerable.
For this font type, by the situation of the command description of portable electronic document, font size information extracting device 350 can comprise character stream decoding device 351 and adding set 352.Character stream decoding device 351, in the situation that this font type passes through the command description of portable electronic document, obtains each character stream of this font type, and the coded system adopting according to character stream is to this character stream decoding.Adding set 352, if first order is that character size arranges order in the character stream of decoding, from this character size, arrange and order, obtain character size information and add in font size information, if first order is not that character size arranges order in the character stream of decoding, each in order execution character stream orders to realize the rasterizing to character, according to the bitmap after rasterizing, obtain character size information, add this character size information to font size information.
For example, for " Type3 " font, character stream decoding device 351 sequentially obtains each character stream in " Type3 " font and decodes.First PDF order that adding set 352 checks in each character stream, if this order is " Type3 ", character size arranges order, from the parameter of this order, obtain the character size information of this character, and this character size information is added to font size information.If first order is not that " Type3 " character size arranges order in character stream, sequentially each PDF order in execution character stream of adding set 352, carries out rasterizing to this character, and rasterizing is the transfer process from PDF order to character bitmap at this.Adding set 352 is processed the character bitmap of rasterizings to obtain the dimension information of character, and this character size information is added to font size information.
In the present embodiment, according to PDF document description book, " Type3 " character size arranges order for " d1 ", and for example, the dimension information of order (the 1000 000 750 750 d1) character corresponding with this character stream under figure space is set to (0 0 750 750).Described rasterizing method can adopt known technology.The character bitmap of processing rasterizing for example can pass through the horizontal and vertical direction projection histogram of calculating character bitmap to obtain the dimension information of character, in two histograms the location positioning of first and last non-zero points horizontal direction and the vertical direction position of character-circumscribed rectangle.Or, to the character bitmap after rasterizing, also can process in the following manner.For example, every row in scan image from top to bottom, at the row that occurs first black pixel, stop, obtain y1, every row in scan image from bottom to top, stops at the row that occurs first black pixel, obtains y2, character height is y2-y1, the character height that the character bitmap of the schematically illustrated such processing rasterizing of Fig. 5 obtains.Every row in scan image from left to right, stop at the row that occur first black pixel, obtain x1, and the every row in right-to-left scan image stop at the row that occur first black pixel, obtain x2, and character duration is x2-x1.
Boundary rectangle calculation element 400 comprises reference position calculation element 410 and apex coordinate calculation element 420.Reference position calculation element 410, for calculating the reference position coordinate of this character.Apex coordinate calculation element 420, for according to reference position coordinate, the character size information of this character, the apex coordinate of calculating character boundary rectangle.
The minimum boundary rectangle of each character in text chunk to be shown under boundary rectangle calculation element 400 calculating PDF page space.Reference position calculation element 410 calculates the initial point position of current character in text chunk to be shown.The minimum boundary rectangle position of apex coordinate calculation element 420 calculating characters.The minimum boundary rectangle of character calculating can further store character-circumscribed rectangle list into, next character in text chunk to be shown is arranged to current character, and the reference position of fresh character more, repeat processing procedure until in text chunk to be shown all characters processed.
The minimum boundary rectangle of character can utilize following TRM, TM, CTM matrix and formula to calculate.TRM has specified the affined transformation from figure space to PDF page space, TRM=TM * CTM.TM is text chunk matrix, has specified the affined transformation from figure space to user's space.CTM is Current Transform matrix, has specified the affined transformation from user's space to PDF page space.TM matrix can be revised by PDF text chunk positioning command, and text chunk positioning command comprises (Td), (TD), (Tm), (T *).For example, PDF order (tx ty Td) changes TM by expression formula (1), and Td is command name, and tx and ty are command parameter, and this Td order is read from page content flow.The TM of the TM on expression formula (1) equal sign right side before for a change, calculates the TM after the change in left side, and the initial value of TM is unit matrix.
TM = 1 0 0 0 1 0 tx ty 1 × TM - - - ( 1 )
CTM matrix can be revised by PDF order cm, for example, PDF order (a b c d e f cm), cm is command name, and a, b, c, d, e, f are command parameter, and this cm order is read from page content flow.By following formula (2), revise CTM matrix.The CTM of the CTM on expression formula (2) equal sign right side before for a change, calculates the CTM after the change in left side, and the initial value of CTM is unit matrix.
CTM = a b 0 c d 0 e f 1 × CTM - - - ( 2 )
Reference position calculation element 410 obtains the reference position of the current character of text chunk to be shown by TRM, TRM calculates by TRM=TM * CTM, by expression formula (3), is represented, h, i, j, k, l, m are by calculating.
TRM = h i 0 j k 0 l m 1 - - - ( 3 )
Reference position calculation element 410 calculates under PDF page space, and the reference position of the current character of text chunk to be shown is (xStart, yStart)=(l, m).By following formula (4) and (5), from TRM, obtain from figure space to PDF page space at scaling ratio xScale and the yScale of x and y direction respectively.
xScale = h 2 + i 2 - - - ( 4 )
yScale = j 2 + k 2 - - - ( 5 )
The minimum boundary rectangle coordinate of each character in text chunk to be shown under apex coordinate calculation element 420 calculating PDF page space.Owing to for the text chunk anglec of rotation and page rotation angle being all 0 situation, therefore in text chunk to be shown, the minimum boundary rectangle of certain character can obtain by following formula (6)~(9).
x=xStart+CharMetric.boundingBox.lowerLeftX×fontSize×xScale;(6)
y=yStart+CharMetric.boundingBox.lowerLeftY×fontSize×yScale;(7)
width=(CharMetric.boundingBox.upperRightX-CharMetric.boundingBox.lowerLeftX)×fontSize×xScale; (8)
height=(CharMetric.boundingBox.upperRightY-CharMetric.boundingBox.lowerLeftY)×fontSize×yScale; (9)
Wherein, fontSize is the parameter of Set font order Tf, and this Tf order and parameter wherein read from page content flow.XScale and yScale obtain from TRM matrix, participate in expression formula (4) and (5).XStart and yStart are the reference position of current character in text chunk to be shown.CharMetric.boundingBox is the character size information that dimension information extraction element 300 extracts, and is presented as the character-circumscribed rectangle under figure space.CharMetric.boundingBox comprises CharMetric.boundingBox.lowerLeftX, CharMetric.boundingBox.lowerLeftY, CharMetric.boundingBox.upperRightX, CharMetric.boundingBox.upperRightY, is respectively lower-left x direction coordinate, lower-left y direction coordinate, upper right x direction coordinate, the upper right y direction coordinate of the character-circumscribed rectangle of current character under figure space.(the x that expression formula (6) calculates to (9), y, width, height) the minimum boundary rectangle of current character under PDF page space has been described, x and y respectively for this reason the reference position of minimum boundary rectangle be x direction and the y direction coordinate of lower-left point, width and height be width and the height of minimum boundary rectangle under page space for this reason respectively.
Suppose that text chunk to be shown is horizontal text chunk, current character is disposed in the time of will processing next character, and the renewal of character reference position realizes by following formula (10) and (11):
xStart=xStart+charSpace+wordSpace+width (10)
yStart=yStart (11)
In expression formula (10), charSpace is current character spacing, in the command list (CLIST) of the page content flow of PDF with the Tc command set that sorts nearest when the text chunk display command of pre-treatment, wordSpace is current word spacing, in the command list (CLIST) of the page content flow of PDF, with the Tw command set that sorts nearest when the text chunk display command of pre-treatment, Tc order and Tw order all can be read from the content flow of the current page of PDF.If previous character is space, from nearest Tw order, read word spacing, otherwise wordSpace is 0.Fig. 3 exemplarily illustrates character reference position, character size information is width height and PDF character-circumscribed rectangle, and character-circumscribed rectangle is expressed as rectangle frame.
The present invention can also be embodied as a kind of method of extracting character-circumscribed rectangle from portable electronic document, comprise: text chunk display command extraction step, can be carried out by aforementioned texts section display command extraction element 100, for the page in portable electronic document, for the text chunk in this page, from the content flow of this page, extract the text chunk display command of text section; Font information extraction step, can be carried out by aforementioned font information extraction element 200, for extracted text chunk display command, extracts the font information corresponding with text section from the resource of this page; Dimension information extraction step, can be carried out by aforementioned dimensions information extracting device 300, for the character in described text chunk, extracts the character size information of this character; And boundary rectangle calculation procedure, can be carried out by aforementioned boundary rectangle calculation element 400, for the character in described text chunk, calculate the boundary rectangle of this character.
Text chunk display command extraction step comprises: content flow extraction step, can be carried out by aforementioned content flow extraction element 110, and in the tree structure of described portable electronic document, extract the content flow of described page; Content flow decoding step, can be carried out by aforementioned content flow decoding device 120, and the coded system adopting according to extracted content flow is come this content flow decoding; And order extraction step, can be carried out by aforementioned order extraction element 130, from decoded content flow, extract text chunk display command.
Font information extraction step comprises: font resource extraction step, can be carried out by aforementioned font resource extraction element 210, and in the tree structure of described portable electronic document, extract the font resource of described page; Set font command lookup step, can be carried out by aforementioned Set font command lookup device 220, searches and the nearest Set font order in sequence of described text chunk display command in the command list (CLIST) of content flow; And information extraction step, can be carried out by aforementioned information extraction element 230, according to the parameter in searched Set font order, from font resource, extract the font information corresponding with text section display command.
Dimension information extraction step comprises: font size extraction step, can be carried out by aforementioned font size extraction element 310, and obtain the font size information corresponding with extracted font information; Character decoder step, can be carried out by aforementioned Character decoder device 320, according to the font information of described text chunk, the character in text chunk display command is decoded; And character size extraction step, can be carried out by aforementioned character size extraction element 330, according to the title of decoded character, from font size information, extract the character size information of this character.
Boundary rectangle calculation procedure comprises: reference position calculation procedure, can be carried out by aforementioned reference position calculation element 410, and calculate the reference position coordinate of this character; And apex coordinate calculation procedure, can be carried out by aforementioned apex coordinate calculation element 420, according to reference position coordinate, the character size information of this character, the apex coordinate of calculating character boundary rectangle.
Font size extraction step comprises: font type extraction step, can be carried out by aforementioned font type extraction element 340, and from font information, extract font type; Font size information extraction step, can be carried out by aforementioned font size information extracting device 350, if this font type is by the command description of portable electronic document, from the character stream of font information, obtain font size information, if this font type is not by the command description of portable electronic document,, by outside font size file, obtain the font size information corresponding with this font type.
Font size information extraction step comprises: character stream decoding step, can be carried out by aforementioned character stream decoding device 351, in the situation that this font type is by the command description of portable electronic document, obtain each character stream of this font type, the coded system adopting according to character stream is to this character stream decoding; Add step, can be carried out by aforementioned adding set 352, if first order is that character size arranges order in the character stream of decoding, from this character size, arrange and order, obtain character size information and add in font size information, if first order is not that character size arranges order in the character stream of decoding, each in order execution character stream orders to realize the rasterizing to character, according to the bitmap after rasterizing, obtain character size information, add this character size information to font size information.
Although the PDF document of take in present specification is illustrated as example, yet it will be understood by those skilled in the art that the embodiment of the present invention also can be applied to the portable electronic document such as PS form.
The sequence of operations illustrating in instructions can be carried out by the combination of hardware, software or hardware and software.When carrying out this sequence of operations by software, computer program wherein can be installed in the storer in the computing machine that is built in specialized hardware, make computing machine carry out this computer program.Or, computer program can be installed in the multi-purpose computer that can carry out various types of processing, make computing machine carry out this computer program.
For example, can computer program is pre-stored in the hard disk or ROM (ROM (read-only memory)) of recording medium.Or, can be temporarily or for good and all storage (record) computer program in removable recording medium, such as floppy disk, CD-ROM (compact disc read-only memory), MO (magneto-optic) dish, DVD (digital versatile disc), disk or semiconductor memory.So removable recording medium can be provided as canned software.
The present invention has been described in detail with reference to specific embodiment.Yet clearly, in the situation that not deviating from spirit of the present invention, those skilled in the art can carry out change and replace embodiment.In other words, the present invention is open by the form of explanation, rather than is limited to explain.Judge main idea of the present invention, should consider appended claim.

Claims (6)

1. from portable electronic document, extract an equipment for character-circumscribed rectangle, comprising:
Text chunk display command extraction element, the page in portable electronic document for the text chunk in this page, extracts the text chunk display command of text section from the content flow of this page;
Font information extraction element for extracted text chunk display command, extracts the font information corresponding with text section from the resource of this page;
Dimension information extraction element, for the character in described text chunk, extracts the character size information of this character; And
Boundary rectangle calculation element, for the character in described text chunk, calculates in PDF page space but not the boundary rectangle of this character in figure space,
Wherein, described dimension information extraction element comprises:
Font size extraction element, obtains the font size information corresponding with extracted font information;
Character decoder device, according to the font information of described text chunk, decodes to the character in text chunk display command; And
Character size extraction element according to the title of decoded character, extracts the character size information of this character from font size information,
Wherein, described font size extraction element comprises:
Font type extraction element extracts font type from font information;
Font size information extracting device, if this font type is by the command description of portable electronic document, from the character stream of font information, obtain font size information, if this font type is not by the command description of portable electronic document, by outside font size file, obtain the font size information corresponding with this font type
Wherein, described font size information extracting device comprises:
Character stream decoding device, in the situation that this font type, by the command description of portable electronic document, obtains each character stream of this font type, the coded system adopting according to character stream is to this character stream decoding;
Adding set, if first order is that character size arranges order in the character stream of decoding, from this character size, arrange and order, obtain character size information and add in font size information, if first order is not that character size arranges order in the character stream of decoding, each in order execution character stream orders to realize the rasterizing to character, according to the bitmap after rasterizing, obtain character size information, add this character size information to font size information
Wherein, described boundary rectangle calculation element comprises:
Reference position calculation element, calculates the reference position coordinate of this character; And
Apex coordinate calculation element, according to reference position coordinate, character size information and the figure space of this character to PDF page space scaling ratio and/or the text chunk anglec of rotation and the page rotation angle in horizontal and perpendicular direction, calculate in PDF page space but not the apex coordinate of character-circumscribed rectangle in figure space.
2. according to the equipment that extracts character-circumscribed rectangle from portable electronic document claimed in claim 1, wherein, described text chunk display command extraction element comprises:
Content flow extraction element extracts the content flow of described page in the tree structure of described portable electronic document;
Content flow decoding device, the coded system adopting according to extracted content flow is come this content flow decoding; And
Order extraction element extracts text chunk display command from decoded content flow.
3. according to the equipment that extracts character-circumscribed rectangle from portable electronic document claimed in claim 1, wherein, described font information extraction element comprises:
Font resource extraction element extracts the font resource of described page in the tree structure of described portable electronic document;
Set font command lookup device is searched and the nearest Set font order in sequence of described text chunk display command in the command list (CLIST) of content flow; And
Information extracting device according to the parameter in searched Set font order, extracts the font information corresponding with text section display command from font resource.
4. according to the equipment that extracts character-circumscribed rectangle from portable electronic document claimed in claim 1, wherein, described portable electronic document is PDF document.
5. according to the equipment that extracts character-circumscribed rectangle from portable electronic document claimed in claim 1, wherein, the font type of the command description by portable electronic document is the Type3 type of PDF document.
6. from portable electronic document, extract a method for character-circumscribed rectangle, comprising:
Text chunk display command extraction step, the page in portable electronic document for the text chunk in this page, extracts the text chunk display command of text section from the content flow of this page;
Font information extraction step for extracted text chunk display command, extracts the font information corresponding with text section from the resource of this page;
Dimension information extraction step, for the character in described text chunk, extracts the character size information of this character; And
Boundary rectangle calculation procedure, for the character in described text chunk, calculates the boundary rectangle in PDF page space but not in figure space of this character,
Described dimension information extraction step comprises:
Font size extraction step, obtains the font size information corresponding with extracted font information;
Character decoder step, according to the font information of described text chunk, decodes to the character in text chunk demonstration life the present; And
Character size extraction step according to the title of decoded character, extracts the character size information of this character from font size information,
Wherein, described font size extraction step comprises:
Font type extraction step extracts font type from font information;
Font size information extraction step, if this font type is by the command description of portable electronic document, from the character stream of font information, obtain font size information, if this font type is not by the command description of portable electronic document, by outside font size file, obtain the font size information corresponding with this font type
Wherein, described font size information extraction step comprises:
Character stream decoding step, in the situation that this font type, by the command description of portable electronic document, obtains each character stream of this font type, the coded system adopting according to character stream is to this character stream decoding;
Add step, if first order is that character size arranges order in the character stream of decoding, from this character size, arrange and order, obtain character size information and add in font size information, if first order is not that character size arranges order in the character stream of decoding, each in order execution character stream orders to realize the rasterizing to character, according to the bitmap after rasterizing, obtain character size information, add this character size information to font size information
Wherein, described boundary rectangle calculation procedure comprises:
Reference position calculation procedure, calculates the reference position coordinate of this character; And
Apex coordinate calculation procedure, according to reference position coordinate, character size information and the figure space of this character to PDF page space scaling ratio and/or the text chunk anglec of rotation and the page rotation angle in horizontal and perpendicular direction, calculate in PDF page space but not the apex coordinate of character-circumscribed rectangle in figure space.
CN200910249848.7A 2009-11-27 2009-11-27 Equipment and method for extracting enclosing rectangles of characters from portable electronic documents Expired - Fee Related CN102081736B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910249848.7A CN102081736B (en) 2009-11-27 2009-11-27 Equipment and method for extracting enclosing rectangles of characters from portable electronic documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910249848.7A CN102081736B (en) 2009-11-27 2009-11-27 Equipment and method for extracting enclosing rectangles of characters from portable electronic documents

Publications (2)

Publication Number Publication Date
CN102081736A CN102081736A (en) 2011-06-01
CN102081736B true CN102081736B (en) 2014-11-26

Family

ID=44087691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910249848.7A Expired - Fee Related CN102081736B (en) 2009-11-27 2009-11-27 Equipment and method for extracting enclosing rectangles of characters from portable electronic documents

Country Status (1)

Country Link
CN (1) CN102081736B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106033412B (en) * 2015-03-20 2019-07-26 广州金山移动科技有限公司 A kind of text conversion method and device
CN107688789B (en) * 2017-08-31 2021-05-18 平安科技(深圳)有限公司 Document chart extraction method, electronic device and computer readable storage medium
CN108897730B (en) * 2018-06-29 2022-07-29 国信优易数据股份有限公司 PDF text processing method and device
CN109670461A (en) * 2018-12-24 2019-04-23 广东亿迅科技有限公司 PDF text extraction method, device, computer equipment and storage medium
CN111027285B (en) * 2019-12-17 2023-06-16 南京上游软件有限公司 Method and system for automatically extracting order information from pdf format order
CN114546936A (en) * 2021-12-22 2022-05-27 上海电机学院 PDF form data extraction method, storage medium and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1467682A (en) * 2002-06-28 2004-01-14 富士通株式会社 Apparatus and method of analyzing layout of document, and computer product
CN1495660A (en) * 1995-09-06 2004-05-12 富士通株式会社 Header extracting device and method for extracting header from file image
CN101325642A (en) * 2007-06-15 2008-12-17 佳能株式会社 Information processing apparatus and method thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6274181A (en) * 1985-09-27 1987-04-04 Sony Corp Character recognizing device
US8175394B2 (en) * 2006-09-08 2012-05-08 Google Inc. Shape clustering in post optical character recognition processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1495660A (en) * 1995-09-06 2004-05-12 富士通株式会社 Header extracting device and method for extracting header from file image
CN1467682A (en) * 2002-06-28 2004-01-14 富士通株式会社 Apparatus and method of analyzing layout of document, and computer product
CN101325642A (en) * 2007-06-15 2008-12-17 佳能株式会社 Information processing apparatus and method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JP昭62-74181A 1987.04.04 *
JP特开2008-312063A 2008.12.25 *

Also Published As

Publication number Publication date
CN102081736A (en) 2011-06-01

Similar Documents

Publication Publication Date Title
US8209600B1 (en) Method and apparatus for generating layout-preserved text
Himsolt GML: A portable graph file format
CN102081736B (en) Equipment and method for extracting enclosing rectangles of characters from portable electronic documents
RU2394268C2 (en) Simplification of symbols to allow eligibility
RU2316814C2 (en) Font selection method
US5930813A (en) Method and system for designating objects
CN102081594B (en) Equipment and method for extracting enclosing rectangles of characters from portable electronic documents
WO2018084715A1 (en) Method and system for transforming handwritten text to digital ink
US20060294460A1 (en) Generating a text layout boundary from a text block in an electronic document
US8451489B1 (en) Content-aware method for saving paper and ink while printing a PDF document
KR20150128921A (en) Detection and reconstruction of east asian layout features in a fixed format document
JP2006350867A (en) Document processing device, method, program, and information storage medium
US8804139B1 (en) Method and system for repurposing a presentation document to save paper and ink
JP2008059590A (en) Techniques for image segment accumulation in document rendering
JP7244223B2 (en) Identifying emphasized text in electronic documents
US7870478B1 (en) Repurposing subsections and/or objects
CN115659917A (en) Document format restoration method and device, electronic equipment and storage equipment
CN102841941B (en) Index-based format returnable file establishing and drawing method
US20150169508A1 (en) Obfuscating page-description language output to thwart conversion to an editable format
JP2013254321A (en) Image processing apparatus, image processing method, and program
CN101901341B (en) Method and equipment for extracting raster image from transportable electronic document
CN102685347A (en) Image processing apparatus and image processing method
US10853710B2 (en) Method, system and apparatus for rendering a document
US20110296292A1 (en) Efficient application-neutral vector documents
JP2001312691A (en) Method/device for processing picture and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20141126

Termination date: 20201127