US7460710B2 - Converting digital images containing text to token-based files for rendering - Google Patents

Converting digital images containing text to token-based files for rendering Download PDF

Info

Publication number
US7460710B2
US7460710B2 US11/392,213 US39221306A US7460710B2 US 7460710 B2 US7460710 B2 US 7460710B2 US 39221306 A US39221306 A US 39221306A US 7460710 B2 US7460710 B2 US 7460710B2
Authority
US
United States
Prior art keywords
token
tokens
computer
vectorized
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US11/392,213
Other languages
English (en)
Other versions
US20070237401A1 (en
Inventor
Adam Brian Coath
Frederick Ziya Ramos Akalin
Robert L. Goodwin
Joshua Shagam
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Amazon Technologies Inc
Original Assignee
Amazon Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US11/392,213 priority Critical patent/US7460710B2/en
Application filed by Amazon Technologies Inc filed Critical Amazon Technologies Inc
Priority to CN2007800155655A priority patent/CN101432761B/zh
Priority to PCT/US2007/064616 priority patent/WO2007121029A2/en
Priority to CN2011100955146A priority patent/CN102176230B/zh
Priority to EP07780285.8A priority patent/EP1999688B1/en
Priority to JP2009503161A priority patent/JP4987960B2/ja
Publication of US20070237401A1 publication Critical patent/US20070237401A1/en
Assigned to AMAZON TECHNOLOGIES, INC. reassignment AMAZON TECHNOLOGIES, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SHAGAM, JOSHUA, AKALIN, FREDERICK ZIYA RAMOS, COATH, ADAM BRIAN, GOODWIN, ROBERT L.
Application granted granted Critical
Publication of US7460710B2 publication Critical patent/US7460710B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G5/00Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
    • G09G5/22Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators characterised by the display of characters or indicia using display control signals derived from coded signals representing the characters or indicia, e.g. with a character-code memory
    • G09G5/24Generation of individual character patterns
    • G09G5/28Generation of individual character patterns for enhancement of character form, e.g. smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/242Division of the character sequences into groups prior to recognition; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present invention is directed to processing of digital images, and more particularly to processing images of content having text therein.
  • This content includes traditional media such as books, magazines, newspapers, newsletters, manuals, guides, references, articles, reports, documents, etc., that exist in print, as well as electronic media in which the aforesaid content exists in digital form or is transformed from print into digital form through the use of a scanning device.
  • the Internet has facilitated the wider publication of digital content through downloading and display of images of content. As data transmission speeds increase, more and more images of pages of content are becoming available online. A page image allows a reader to see the page of content as it would appear in print.
  • Digital images may be represented at a variety of resolutions, typically denoted by the number of pixels in the image in both the horizontal and vertical directions.
  • higher resolution images have a larger file size and require a greater amount of memory for storage.
  • the cost of storing images of content can greatly multiply when one considers the number of images it takes to capture and store large volumes of media, such as books, magazines, etc. While reducing the size and resolution of images often reduces the requirements for storing the images, low resolution images eventually reach a point where the images, in particular any text contained therein, are difficult for readers to perceive when displayed.
  • Content providers wishing to provide page images with text must ensure that the images can be rendered in sufficiently high resolution so that displayed text will be legible. Yet another challenge faced by the content providers is to provide page images that are scalable, i.e., that may be readily scaled up or down so as to be rendered, for example, on various-sized displays at relatively high resolution while ensuring the minimum quality and legibility of the text in the images.
  • What is needed is a method and system for reliably processing scanned-in page images including text so that the text in the page images, upon rendering, will be legible and in sufficiently high resolution, and further scalable, without requiring an excessive amount of memory space for storage.
  • a token refers to a graphical unit, which may or may not represent a single character or a symbol. From scanned-in page images, numerous tokens are separated. Then, tokens of similar shapes may be grouped together and their shapes are combined to create a combined token, which is morphologically representative of all of the tokens included in the group. The combined token is further converted into a vectorized token, which is a mathematical representation of the combined token and is capable of representing the shape of the combined token in clean curves.
  • a number of vectorized tokens are created in this manner, each representing a group of similarly shaped tokens.
  • the position of each of the (original, unprocessed) tokens forming a group is associated with the vectorized token that represents the group of tokens.
  • the position of each token may be defined by a page number and the X-Y coordinates of the position within each page at which the token appears, and the position is associated with a pointer to the corresponding vectorized token.
  • the vectorized token as opposed to the original token, is displayed at this position to thereby create a page image that consists only of vectorized tokens.
  • vectorized tokens are mathematical representations of token shapes, they can be rendered at any resolution, including high resolution, and appear crisp and legible when displayed. Further, because multiple positions of similarly shaped tokens are merely associated with a pointer to their representative vectorized token, the storage requirement for the page images can be minimized.
  • a computer-implemented method for converting an electronic image containing text into a token-based file.
  • the method includes generally five steps. First, various tokens (i.e., graphical units) are identified in the electronic image. Second, identified tokens having similar shapes are grouped together to form a token group. Thus, multiple token groups are formed, each including one or more tokens having similar shapes. Third, in each token group, a representative token is generated (or found) that morphologically represents the shapes of tokens included in the group. For example, a representative token may be generated by combining (e.g., averaging) the shapes of tokens in the token group.
  • each representative (e.g., combined) token is converted into a vectorized token, which is a mathematical representation of the shape of the representative token.
  • muitiple vectorized tokens are created, each mathematically representing the shape of a representative token, which in turn morphologically represents the shape of one or more tokens classified into one token group.
  • each of the vectorized tokens is associated with the positions of the tokens represented by the vectorized token, to thereby form a token-based file.
  • the position of each of the tokens forming a group is associated with the vectorized token that represents the group of tokens.
  • the vectorized token is displayed at this position to thereby create a page image consisting only of razor-sharp token images based on the vectorized tokens.
  • the step of separating tokens is carried out by using a connected component (or a “flood fill”) analysis.
  • the step of grouping tokens having similar shapes is carried out by calculating a center of mass for each token, aligning the tokens using the center of mass, calculating the “distance” between a pair of tokens by, for example, calculating a root-mean-square error between the two tokens, and grouping the tokens within a predefined distance with each other.
  • the step of vectorizing each representative token (e.g., a combined token) to create a vectorized token is carried out based on a raster to vector conversion method using a mathematical representation, such as Bezier splines.
  • a system for converting an electronic image into a token-based file.
  • the system includes generally two components: a page image database for storing electronic images containing text, such as page images; and a computing device in communication with the page image database.
  • the computing device is operative to process the electronic images containing text to identify tokens therein, and to classify the identified tokens into multiple token groups.
  • the computing device is further operative to create a vectorized token, for each of the token groups, which mathematically represents the shapes of the tokens included in the token group, and to generate a token-based file in which each vectorized token is associated with positions of the tokens represented by the vectorized token.
  • a computer-accessible medium having instructions encoded thereon is provided to create a token-based file.
  • the instructions when executed by a computing apparatus, cause the computing apparatus to (1) process an image having text therein to identify tokens therein; (2) classify the identified tokens into multiple token groups according to their shapes; (3) for each of the token groups, create a vectorized token that mathematically represents the shapes of the tokens included in the token group; and (4) replace the tokens represented by a vectorized token with the vectorized token.
  • FIG. 1 is a functional block diagram of an exemplary computing system that may be used to implement an embodiment of the present invention
  • FIG. 2 is a flow diagram of an exemplary method for converting an electronic image containing text to a token-based file according to one embodiment of the present invention
  • FIG. 3 is a pictorial diagram schematically illustrating some of the steps of a method of converting an electronic image containing text to a token-based file shown in FIG. 2 ;
  • FIGS. 4A-4C illustrate various techniques, which may be used to identify and classify tokens according to their shapes in a method of converting an electronic image containing text to a token-based file in accordance with various embodiments of the present invention.
  • the present invention is directed to a method, system, and computer-accessible medium having instructions for converting an electronic (digital) image containing text, which has been scanned, for example, into a token-based file suitable for high-resolution rendering without requiring an excessive amount of storage space.
  • rendering of the token-based file can be done on a variety of output media such as digital displays and print media.
  • FIG. 1 illustrates a functional block diagram of a computing system 10 that may be used to implement the present invention.
  • the computing system 10 includes a computing device 11 having a processor 12 in communication with a variety of computing elements, including a network interface 14 , an input/output interface 16 , and a memory 19 .
  • the network interface 14 enables the computing device 11 to communicate data, control signals, data requests, and other information with a computer network 15 (LAN, WAN, Internet, etc.).
  • the computing device 11 may receive a file containing page images of books, magazines, etc., from a page image database 17 connected to the computer network 15 via the network interface 14 .
  • a token-based file database 18 may be connected to the computer network 15 , to which token-based files generated by the computing device 11 are sent via the network interface 14 for storage.
  • the computer network 15 may be the Internet, a local or wide area network that connects servers storing related documents and associated files, scripts, and databases, or a broadcast communication network that includes set-top boxes or other information appliances providing access to audio or video files, documents, scripts, databases, etc.
  • the input/output interface 16 enables the computing device 11 to communicate with various local input and output devices.
  • An input device 20 in communication with the input/output interface 16 , may include computing elements that provide input signals to the computing device 11 , such as a scanner, a scanning pen, a digital camera, a video camera, a copier, a keyboard, a mouse, an external memory, a disk drive, etc.
  • Input devices comprising scanners and cameras, for example, may be used to provide electronic images such as page images including text to the computing device 11 , which then converts these electronic images into a token-based file in accordance with the present invention.
  • An output device 22 in communication with the input/output interface 16 , may include typical output devices, such as a computer display (e.g., CRT or LCD screen), a television, printer, facsimile machine, copy machine, etc. As to the present invention, the output device 22 may be used to display token-based file images for an operator to manually confirm their accuracy and legibility.
  • a computer display e.g., CRT or LCD screen
  • the output device 22 may be used to display token-based file images for an operator to manually confirm their accuracy and legibility.
  • the processor 12 is configured to operate in accordance with computer program instructions stored in a memory, such as the memory 19 .
  • Program instructions may also be embodied in a hardware format, such as in a programmed digital signal processor.
  • the memory 19 generally comprises RAM, ROM, and/or permanent memory.
  • the memory 19 may be configured to store digital images of text for processing, transmission, and display in accordance with the present invention.
  • the memory 19 stores an operating system 23 for controlling the general operation of the computing device 11 .
  • the operating system 23 may be a general-purpose operating system such as a Microsoft® operating system, UNIX® operating system, or Linux® operating system.
  • the memory 19 may further store an optical character recognition (OCR) application 24 comprised of program code and data for analyzing digital images containing text therein.
  • OCR optical character recognition
  • the memory 19 additionally stores a token-based file generator application 25 .
  • the token-based file generator application 25 contains program code and data for processing an electronic image containing text received via the network interface 14 , the input/output interface 16 , etc., to generate a token-based file.
  • the token-based file may then be sent to and stored in the token-based file database 18 .
  • FIG. 2 is a flow diagram of an exemplary method 30 implemented by the token-based file generator application 25 for converting one or more electronic images containing text to a token-based file according to one embodiment of the present invention.
  • the term “text” includes all forms of letters, characters, symbols, numbers, formulas, graphics, line drawings, table borders, etc., that may be used to represent information in an electronic image (e.g., a page image).
  • the method 30 starts at block 31 where the computing device 11 receives electronic images (e.g., page images) containing text. For example, page images, as previously scanned into the page image database 17 ( FIG.
  • the received images may be of relatively low resolution, such as in 300 dpi (dots per inch).
  • the format of the page images as received may vary, and can include page images in which the content of the page image is represented in a non-text accessible format, such as in a JPEG, TIFF, GIF, and BMP file, or in which the content of the page image is represented in a text-accessible format, such as in an Adobe Portable Document File (PDF).
  • PDF Adobe Portable Document File
  • the page images may undergo standard OCR or OCR-like preprocessing techniques, such as contrast adjustment, deskewing, despeckling, and/or page rotation correction, prior to undergoing the token-based file generation processing method 30 .
  • standard OCR or OCR-like preprocessing techniques such as contrast adjustment, deskewing, despeckling, and/or page rotation correction
  • a token refers to a graphical unit, which may or may not represent a single character or a symbol. Rather, a token is a unit that is identified to be sufficiently discrete purely in a graphical sense to thereby form a single unit.
  • a search for tokens in an electronic image occurs within a background region, which is typically white.
  • a token is presumed wherever a pixel color deviates sufficiently from the background color.
  • a connected component analysis or a flood fill analysis
  • an electronic image including a text component “every day” is analyzed based on a connected component technique to identify “e,” “v,” “e,” “r,” “y,” “d,” “a,” and “y” as separate units, i.e., as tokens. Further, each of these tokens can be bound within a bounding box, as illustrated.
  • the connected component analysis and finding of bounding boxes may be performed using a suitable OCR or OCR-like software program stored in the memory 19 ( FIG. 1 ) as is well known in the art.
  • pixels within an electronic image may be represented as a graph having edge weights based on the pixel intensities and edge magnitudes and directions.
  • a connection determination can be made by determining the shortest path between two sets of pixels. If sets of pixels are sufficiently connected, they may be identified as jointly forming a single token.
  • two letters may touch each other to form a single connected component, as in the case of “ra” shown in FIG. 4A .
  • the original word “raw” is separated based on a connected component analysis into two tokens “ra” 50 and “w” 51 .
  • it is not critical for a method of the present invention to identify each letter as a separate token it may be desirable to do so for the purpose of reducing the number of token types in order to reduce storage requirements.
  • further processing may be performed to separate out a connected component that may be a combination of two or more letters or symbols. For example, in the case of FIG.
  • the bounding box for the token “ra” 50 may be too large in its horizontal dimension for this token to be representative of a single letter or a symbol.
  • a token whose bounding box has a longer horizontal dimension than a vertical dimension may be suspected of potentially representing two or more letters or symbols. If so determined, the suspected token may be further analyzed, again using a suitable OCR or OCR-like software program (e.g., a maze algorithm), to identify the shortest path 52 from one side (e.g., the top side) to the other (e.g., the bottom side) to sever the token into two portions.
  • a suitable OCR or OCR-like software program e.g., a maze algorithm
  • OCR or OCR-like software may also be useful in recognizing, in the above example, that “r” is at a given location so as to make it easier to split it from adjacent letter(s) that it may be touching (i.e., “a” in the above example).
  • the path along which to possibly sever the token into two portions can be computed by representing the pixels as a graph with edge weights based on the pixel intensities and edge magnitudes and directions. Then, the shortest path 52 can be found between two points on opposite sides of the token (e.g., between the center of the top edge and the center of the bottom edge). In the example of FIG.
  • the shortest path 52 is found to cut the token “ra” 50 into two tokens, “r” and “a.” Thereafter, the accuracy of separated “r” and “a” tokens may be confirmed by comparing the “r” and “a” tokens to other “r” and “a” tokens, respectively, which have already been unambiguously identified as tokens.
  • ambiguous tokens i.e., tokens that are suspected of containing two or more letters or symbols
  • any ambiguous token may be separated into multiple tokens only if the resulting separated portions will match some unambiguously identified tokens.
  • the token “ra” 50 in FIG. 4A may be separated into “r” and “a” tokens only if each of the resulting separated “r” and “a” tokens will match unambiguously identified tokens “r” and “a,” respectively. If each of the resulting separated tokens cannot find a close match with an unambiguously identified token, then the ambiguous token should not be separated into multiple tokens.
  • step 43 two “e” tokens from the text “every day” are grouped together as having similar shapes into an “e” bucket, and two “y” tokens from the same text are grouped together as having similar shapes into a “y” bucket.
  • the grouping is carried out based on the morphological characteristics of the tokens.
  • regular “e” and “e” set forth in bold type may well be treated as having sufficiently different shapes to be grouped into two different buckets.
  • a “center of mass” is calculated for each token and is used to align tokens so that they can be compared with each other.
  • the “mass” of a pixel in a grayscale image is defined as its deviation from the background color (typically pure white). If the grayscale image is treated as a grid of point masses, one point mass for each pixel, the “center of mass” of the image can be considered as a representative point of the image.
  • the “mass” and “center of mass” can still be calculated similarly, by first converting the color image to a grayscale image using any suitable conversion method. The center of mass calculated for each token image may then be used to align token images according to their respective center of mass values.
  • the aligned tokens can be compared to determine if the tokens are sufficiently similar.
  • Each pixel in each (grayscale) image may be normalized so that 0.0 represents white and 1.0 represents black.
  • a “distance” between the images is calculated to ascertain the similarity in shape between the token images.
  • Various methods are possible to calculate such a distance. In one embodiment, one can calculate a distance in terms of a Root-Mean-Square (RMS) error.
  • RMS Root-Mean-Square
  • the two token images may be considered the same or sufficiently similar in shape to each other so as to belong to the same token group if the RMS error value is no more than a predefined threshold value, such as 0.10.
  • FIG. 4B illustrates a token “e” in a bounding box 53 , in which a center of mass is found at point “x.”
  • a bounding box may be divided using horizontal and vertical lines through the center of mass point 53 ′ into multiple sections, such as four sections 54 , 55 , 56 , and 57 as illustrated. Center of mass values may be found for the four sections, respectively, at four points “x” as indicated.
  • the four center of mass values may be represented as (x, y) coordinate values relative to the center of mass point 53 ′ used as the origin.
  • the four center of mass values may be respectively compared with the corresponding center of mass values of another token from a token group (e.g., by taking the average squared difference between the two sets of the four center of mass values), to roughly determine which token group the token at issue might belong in.
  • a true match may be confirmed using a more comprehensive comparison test, such as the RMS error based method described above.
  • various other methods may be used to preliminarily classify a token into a candidate token group in which the token may belong. For example, OCR or OCR-like processing may be performed to obtain letter information such as the actual character detected and various formatting details such as a font, an approximate font size, whether the letter is bold, italic, or underlined, etc. If two tokens are detected to have the same OCR character and about the same size, it may be preliminarily determined that the two tokens are similar in shape to each other. As before, however, even if a match is found according to this method, a true match may still be confirmed using a more comprehensive comparison method, such as the RMS error based method.
  • a sum of blackness analysis may be used to compare the shapes of various tokens.
  • Another example is a cross-entropy method. Given two tokens A and B, the cross-entropy of B with respect to A can be calculated by compressing the token image for B using the information in the token image for A as a guide. Then, the number of bits in the final compressed file for the token image B is taken.
  • the cross-entropy of A with respect to B can be calculated by compressing the token image for A using the information in the token image for B and by taking the number of bits in the final compressed file for the token image A. Then, the maximum between the cross-entropy of A with respect to B and the cross-entropy of B with respect to A is taken, and used as a measure of “distance” (i.e., closeness in shape) between the two token images.
  • FIG. 4C schematically illustrates one technique suitable for use in classifying various shaped tokens into many token groups.
  • a search tree or a classification tree may be built so as to speed up the classification process.
  • a search tree or a classification tree comprises a branching structure in which each state (node) may give rise to a new set of states (child nodes) and each of these may in turn give rise to successor states of its own (grandchild nodes), and so on.
  • a computer routine may rapidly classify new tokens into various token groups (forming leaf nodes).
  • token groups for “e,” “a,” and “b” have been formed. Further, the difference between the token groups for “e” and “a” (for example in terms of the RMS error value between the “e” token image and “a” token image) has been found to be 0.3, and the difference between the token groups for “e” and “b” has been found to be 0.4. In this example, the “e” token group is used as a reference point.
  • the RMS error value between the next token to be classified “?” and the “e” token group is calculated as “ ⁇ .” If ⁇ .x is less than 0.3, then “?” can be classified into a new token group that has not yet been created, because no existing token group differs from “e” token group by less than 0.3. Likewise, if ⁇ .x is more than 0.4, then “?” can be classified into a new token group that has not yet been created, because no existing token group differs from “e” token group by more than 0.4.
  • the tokens in each group may undergo any suitable image processing or preprocessing.
  • various digital image processing filters may be applied to the tokens classified in each group to, for example, smooth out the outlines of tokens, remove obvious artifacts, etc.
  • filters in this regard are known in the art, and may be part of a commercially available OCR or OCR-like software program stored in the memory 19 ( FIG. 1 ).
  • a representative token that morphologically represents all the tokens classified in the token group is found. For example, all the tokens in the token group may be combined to obtain a combined token.
  • Various methods for combining tokens or, more specifically, token shapes, are possible, such as averaging, taking a median, etc., as will be apparent to one skilled in the art.
  • a representative token is found as an averaged token ( FIG. 3 ).
  • Averaging can be performed by aligning the center of mass points of all token images and taking the average of every pixel location (e.g., by taking each coinciding pixel and calculating the average color (grayscale) value for each pixel, by summing all the color (grayscale) values and dividing the sum by the number of token images). Additionally, interpolation may be performed to obtain sub-pixel level average values.
  • an averaged token is created from averaging all of the token images in a token group, various imperfections or artifacts that may have been present in the original token images will be averaged away (or minimized) to produce generally smoothed edges, albeit at the potential price of increased blurriness.
  • not all the tokens included in a token group need to be combined (e.g., averaged) to produce a combined token. For example, when there is a large number of tokens in a token group, such as over 1000 tokens, then it may not be necessary to average all the tokens because the quality of the averaged token image does not increase appreciably after a few hundred tokens. In such a case, only 100 or so “closest” token images may be taken and averaged to produce an averaged token.
  • a representative (e.g., combined or averaged) token which morphologically represents the shapes of all the tokens in a token group but with some blurriness, is converted into a vectorized token, which is a mathematical representation of the representative token.
  • vectorize refers to the process of finding an outline that best represents the shape of a representative token and representing the outline in mathematical formulae (together with suitable fill instructions to fill any enclosed portions). Any suitable raster-to-vector conversion software for converting bitmaps into vector graphics may be used for vectorizing representative tokens, according to the present invention.
  • additional preprocessing techniques such as contrast adjustment, deskewing, despeckling, and/or page rotation correction, may be utilized prior to vectorization at block 36 .
  • an outline for a representative token is found based on the analysis of token regions. Specifically, each representative token is divided into two or more regions. For example, a letter “e” has three regions: a background; a solid portion representing “e”; and the semicircle-shape hole in the upper portion of “e”. An outline can be found as a collection of boundaries between any adjacent regions. For example, an outline of “e” can be found as a boundary between the background and the solid portion “e” in combination with another boundary between the solid portion and the semicircle-shape hole. Similarly, a letter “i” has three regions: a background and two solid portions; and its outline can be found as a boundary between the first (top) solid portion and the background in combination with another boundary between the second (bottom) solid portion and the background.
  • a vectorized token “e” is represented by nine end points 1 - 9 and mathematical formulae representing nine curves between each adjacent pair of the end points: 1 - 2 , 2 - 3 , 3 - 4 , 4 - 5 , 5 - 6 , 6 - 7 , 7 - 1 , and 8 - 9 .
  • Each adjacent pair of end points also has two other control points that are used to control the appearance (or “curviness”) of the Bezier curve.
  • the number of Bezier curves used to define each curve of a representative token may vary depending on how frequently the representative token (or, more specifically, the tokens represented by the representative token) appears in a document. For example, some tokens will occur thousands of times in a document while others may occur only a few times. By allowing more Bezier curves to be used to define the frequently occurring tokens, one can improve the image quality for the vast majority of tokens in the document, while still achieving excellent compression of the infrequently occurring tokens.
  • a vectorized token is a mathematical representation of a shape, it can be rendered at any resolution, for example at a relatively high resolution such as in 2400 dpi or even in 19200 dpi.
  • a vectorized token is significantly compressed in terms of its memory space, as compared to any of the original tokens that it represents. For example, in various exemplary embodiments of the present invention, it may take as few as 180 bytes to represent a single vectorized token.
  • vectorized tokens may be defined, each representing a group of tokens having similar shapes.
  • page images from a 200-page book may be processed to create over 2,000 vectorized tokens to each represent a group of similarly shaped tokens.
  • all of the tokens that were initially identified in the book are now represented by one of the 2,000 plus vectorized tokens.
  • a method of the present invention defines vectorized tokens without recognizing them as specific characters or of certain font type. Rather, a method defines vectorized tokens purely as images based on the analysis of the morphological features of all tokens found in the original document, such as in a book that has been scanned in. This image-based approach to processing a scanned-in document is one of the keys for creating a token-based file, which can be rendered in high resolution while maintaining the same look and feel as the original document in print.
  • a token-based file is created based on the vectorized tokens previously defined in block 36 .
  • each vectorized token is assigned a token number, and the position of each of the tokens forming a token group is associated with the vectorized token (or, more specifically, its token number) that represents the group of tokens.
  • the position of each token may be defined by a page number and the X-Y coordinates of the position within each page at which the token appears, and the position is associated with a pointer to the corresponding vectorized token.
  • the vectorized token as opposed to the original token is displayed at this position to thereby create a page image consisting only of vectorized tokens.
  • vectorized tokens are mathematical representations of token shapes, they can be rendered at any resolution including high resolution, and appear crisp and legible when displayed. Further, because multiple positions of similarly shaped tokens are merely associated with a pointer to their representative vectorized token (having a small memory size), there is no need to store the original tokens for these positions and, therefore, the storage requirement for the page images can be minimized. For example, on average, a book can be converted into a token-based file having the memory size of approximately 2 MB. Still further, due to the small memory size of each vectorized token (e.g., 180 bytes), very fast rendering of a token-based file is possible. Still further, the token-based file may be further rendered on any number of print media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Hardware Design (AREA)
  • Processing Or Creating Images (AREA)
  • Character Input (AREA)
  • Image Analysis (AREA)
  • Character Discrimination (AREA)
  • Image Generation (AREA)
US11/392,213 2006-03-29 2006-03-29 Converting digital images containing text to token-based files for rendering Active 2027-04-17 US7460710B2 (en)

Priority Applications (6)

Application Number Priority Date Filing Date Title
US11/392,213 US7460710B2 (en) 2006-03-29 2006-03-29 Converting digital images containing text to token-based files for rendering
PCT/US2007/064616 WO2007121029A2 (en) 2006-03-29 2007-03-22 Converting digital images containing text to token-based files for rendering
CN2011100955146A CN102176230B (zh) 2006-03-29 2007-03-22 将包含文字的数字图像转换为用于再现的基于记号的文件
EP07780285.8A EP1999688B1 (en) 2006-03-29 2007-03-22 Converting digital images containing text to token-based files for rendering
CN2007800155655A CN101432761B (zh) 2006-03-29 2007-03-22 将包含文字的数字图像转换为用于再现的基于记号的文件的方法
JP2009503161A JP4987960B2 (ja) 2006-03-29 2007-03-22 レンダリングのためにトークンベースファイルへの文字列を含むディジタル画像の変換

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/392,213 US7460710B2 (en) 2006-03-29 2006-03-29 Converting digital images containing text to token-based files for rendering

Publications (2)

Publication Number Publication Date
US20070237401A1 US20070237401A1 (en) 2007-10-11
US7460710B2 true US7460710B2 (en) 2008-12-02

Family

ID=38575327

Family Applications (1)

Application Number Title Priority Date Filing Date
US11/392,213 Active 2027-04-17 US7460710B2 (en) 2006-03-29 2006-03-29 Converting digital images containing text to token-based files for rendering

Country Status (5)

Country Link
US (1) US7460710B2 (ja)
EP (1) EP1999688B1 (ja)
JP (1) JP4987960B2 (ja)
CN (2) CN101432761B (ja)
WO (1) WO2007121029A2 (ja)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070234203A1 (en) * 2006-03-29 2007-10-04 Joshua Shagam Generating image-based reflowable files for rendering on various sized displays
US20080267535A1 (en) * 2006-03-28 2008-10-30 Goodwin Robert L Efficient processing of non-reflow content in a digital image
US8023738B1 (en) 2006-03-28 2011-09-20 Amazon Technologies, Inc. Generating reflow files from digital images for rendering on various sized displays
US8413048B1 (en) 2006-03-28 2013-04-02 Amazon Technologies, Inc. Processing digital images including headers and footers into reflow content
US8499236B1 (en) 2010-01-21 2013-07-30 Amazon Technologies, Inc. Systems and methods for presenting reflowable content on a display
US8572480B1 (en) 2008-05-30 2013-10-29 Amazon Technologies, Inc. Editing the sequential flow of a page
US8782516B1 (en) 2007-12-21 2014-07-15 Amazon Technologies, Inc. Content style detection
US9208133B2 (en) 2006-09-29 2015-12-08 Amazon Technologies, Inc. Optimizing typographical content for transmission and display
US9229911B1 (en) 2008-09-30 2016-01-05 Amazon Technologies, Inc. Detecting continuation of flow of a page
WO2021176278A2 (en) 2020-02-05 2021-09-10 Amazon Technologies, Inc. Dynamic layout adjustment for reflowable content

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060050961A1 (en) * 2004-08-13 2006-03-09 Mohanaraj Thiyagarajah Method and system for locating and verifying a finder pattern in a two-dimensional machine-readable symbol
US7596270B2 (en) * 2005-09-23 2009-09-29 Dynacomware Taiwan Inc. Method of shuffling text in an Asian document image
US8144978B2 (en) * 2007-08-01 2012-03-27 Tandent Vision Science, Inc. System and method for identifying complex tokens in an image
EP2201472A1 (en) * 2007-10-09 2010-06-30 Skiff, Llc Methods, apparatus, and systems for providing local and online data services
US8086040B2 (en) * 2007-12-05 2011-12-27 Xerox Corporation Text representation method and apparatus
JP5121599B2 (ja) * 2008-06-30 2013-01-16 キヤノン株式会社 画像処理装置、画像処理方法およびそのプログラムならびに記憶媒体
US8255820B2 (en) 2009-06-09 2012-08-28 Skiff, Llc Electronic paper display device event tracking
US8195626B1 (en) * 2009-06-18 2012-06-05 Amazon Technologies, Inc. Compressing token-based files for transfer and reconstruction
US8396301B2 (en) 2009-09-24 2013-03-12 Gtech Corporation System and method for document location and recognition
FR2950713A1 (fr) * 2009-09-29 2011-04-01 Movea Sa Systeme et procede de reconnaissance de gestes
US20110173532A1 (en) * 2010-01-13 2011-07-14 George Forman Generating a layout of text line images in a reflow area
US8463041B2 (en) * 2010-01-26 2013-06-11 Hewlett-Packard Development Company, L.P. Word-based document image compression
WO2011137409A1 (en) 2010-04-30 2011-11-03 Vucomp, Inc. Malignant mass detection and classification in radiographic images
US8675933B2 (en) 2010-04-30 2014-03-18 Vucomp, Inc. Breast segmentation in radiographic images
CN101853246B (zh) * 2010-06-14 2012-05-23 深圳市万兴软件有限公司 一种文档格式的转换方法及装置
US9256799B2 (en) * 2010-07-07 2016-02-09 Vucomp, Inc. Marking system for computer-aided detection of breast abnormalities
US9349202B1 (en) * 2012-10-01 2016-05-24 Amazon Technologies, Inc. Digital conversion of imaged content
US9501499B2 (en) * 2013-10-21 2016-11-22 Google Inc. Methods and systems for creating image-based content based on text-based content
JP6000992B2 (ja) * 2014-01-24 2016-10-05 京セラドキュメントソリューションズ株式会社 文書ファイル生成装置及び文書ファイル生成方法
US9852337B1 (en) 2015-09-30 2017-12-26 Open Text Corporation Method and system for assessing similarity of documents
US9684842B2 (en) * 2015-10-29 2017-06-20 The Nielsen Company (Us), Llc Methods and apparatus to extract text from imaged documents
US9990521B2 (en) * 2016-09-06 2018-06-05 Amazon Technologies, Inc. Bundled unit identification and tracking
US10296788B1 (en) 2016-12-19 2019-05-21 Matrox Electronic Systems Ltd. Method and system for processing candidate strings detected in an image to identify a match of a model string in the image
WO2018125926A1 (en) * 2016-12-27 2018-07-05 Datalogic Usa, Inc Robust string text detection for industrial optical character recognition
CN112053410A (zh) * 2020-08-24 2020-12-08 海南太美航空股份有限公司 一种基于矢量图形绘制的图像处理方法、系统及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5523946A (en) * 1992-02-11 1996-06-04 Xerox Corporation Compact encoding of multi-lingual translation dictionaries
US6064767A (en) * 1998-01-16 2000-05-16 Regents Of The University Of California Automatic language identification by stroke geometry analysis
US6562077B2 (en) * 1997-11-14 2003-05-13 Xerox Corporation Sorting image segments into clusters based on a distance measurement
US6621941B1 (en) * 1998-12-18 2003-09-16 Xerox Corporation System of indexing a two dimensional pattern in a document drawing
US20040146199A1 (en) 2003-01-29 2004-07-29 Kathrin Berkner Reformatting documents using document analysis information

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0296885A (ja) * 1988-10-03 1990-04-09 Ricoh Co Ltd 画像処理装置
JPH06180771A (ja) * 1992-12-11 1994-06-28 Matsushita Electric Ind Co Ltd 英文字認識装置
US5956419A (en) 1995-04-28 1999-09-21 Xerox Corporation Unsupervised training of character templates using unsegmented samples
JPH1091724A (ja) * 1996-09-10 1998-04-10 Riibuson:Kk パターン認識装置
JP2000113112A (ja) * 1998-09-30 2000-04-21 Oki Electric Ind Co Ltd 文字認識回路および英単語認識方法
JP4085183B2 (ja) * 2002-05-31 2008-05-14 株式会社 エヌティーアイ 遺伝的アルゴリズムによるフォント生成システム
CN1416041A (zh) * 2002-11-07 2003-05-07 白世宾 图形符号信息处理及输入法
US7486294B2 (en) * 2003-03-27 2009-02-03 Microsoft Corporation Vector graphics element-based model, application programming interface, and markup language
JP4574235B2 (ja) * 2004-06-04 2010-11-04 キヤノン株式会社 画像処理装置、及びその制御方法、プログラム

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5523946A (en) * 1992-02-11 1996-06-04 Xerox Corporation Compact encoding of multi-lingual translation dictionaries
US5787386A (en) * 1992-02-11 1998-07-28 Xerox Corporation Compact encoding of multi-lingual translation dictionaries
US6562077B2 (en) * 1997-11-14 2003-05-13 Xerox Corporation Sorting image segments into clusters based on a distance measurement
US6064767A (en) * 1998-01-16 2000-05-16 Regents Of The University Of California Automatic language identification by stroke geometry analysis
US6621941B1 (en) * 1998-12-18 2003-09-16 Xerox Corporation System of indexing a two dimensional pattern in a document drawing
US20040146199A1 (en) 2003-01-29 2004-07-29 Kathrin Berkner Reformatting documents using document analysis information
US7272258B2 (en) * 2003-01-29 2007-09-18 Ricoh Co., Ltd. Reformatting documents using document analysis information

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080267535A1 (en) * 2006-03-28 2008-10-30 Goodwin Robert L Efficient processing of non-reflow content in a digital image
US7961987B2 (en) 2006-03-28 2011-06-14 Amazon Technologies, Inc. Efficient processing of non-reflow content in a digital image
US8023738B1 (en) 2006-03-28 2011-09-20 Amazon Technologies, Inc. Generating reflow files from digital images for rendering on various sized displays
US8413048B1 (en) 2006-03-28 2013-04-02 Amazon Technologies, Inc. Processing digital images including headers and footers into reflow content
US20070234203A1 (en) * 2006-03-29 2007-10-04 Joshua Shagam Generating image-based reflowable files for rendering on various sized displays
US7966557B2 (en) * 2006-03-29 2011-06-21 Amazon Technologies, Inc. Generating image-based reflowable files for rendering on various sized displays
US8566707B1 (en) 2006-03-29 2013-10-22 Amazon Technologies, Inc. Generating image-based reflowable files for rendering on various sized displays
US9208133B2 (en) 2006-09-29 2015-12-08 Amazon Technologies, Inc. Optimizing typographical content for transmission and display
US8782516B1 (en) 2007-12-21 2014-07-15 Amazon Technologies, Inc. Content style detection
US8572480B1 (en) 2008-05-30 2013-10-29 Amazon Technologies, Inc. Editing the sequential flow of a page
US9229911B1 (en) 2008-09-30 2016-01-05 Amazon Technologies, Inc. Detecting continuation of flow of a page
US8499236B1 (en) 2010-01-21 2013-07-30 Amazon Technologies, Inc. Systems and methods for presenting reflowable content on a display
WO2021176278A2 (en) 2020-02-05 2021-09-10 Amazon Technologies, Inc. Dynamic layout adjustment for reflowable content

Also Published As

Publication number Publication date
JP4987960B2 (ja) 2012-08-01
JP2009531788A (ja) 2009-09-03
CN101432761B (zh) 2011-11-09
EP1999688A2 (en) 2008-12-10
WO2007121029A2 (en) 2007-10-25
CN102176230A (zh) 2011-09-07
EP1999688B1 (en) 2013-10-16
CN102176230B (zh) 2013-01-16
US20070237401A1 (en) 2007-10-11
EP1999688A4 (en) 2011-07-13
CN101432761A (zh) 2009-05-13
WO2007121029A3 (en) 2008-10-16

Similar Documents

Publication Publication Date Title
US7460710B2 (en) Converting digital images containing text to token-based files for rendering
JP3345350B2 (ja) 文書画像認識装置、その方法、及び記録媒体
US6640010B2 (en) Word-to-word selection on images
US8634644B2 (en) System and method for identifying pictures in documents
AU2006252025B2 (en) Recognition of parameterised shapes from document images
KR100390264B1 (ko) 폼처리중자동페이지등록및자동영역검출을위한시스템및방법
US5335290A (en) Segmentation of text, picture and lines of a document image
US5664027A (en) Methods and apparatus for inferring orientation of lines of text
KR20190123790A (ko) 전자 문서로부터 데이터 추출
US8965125B2 (en) Image processing device, method and storage medium for storing and displaying an electronic document
US20150169951A1 (en) Comparing documents using a trusted source
US8412705B2 (en) Image processing apparatus, image processing method, and computer-readable storage medium
JP2004318879A (ja) 画像内容を比較する自動化技術
US10423851B2 (en) Method, apparatus, and computer-readable medium for processing an image with horizontal and vertical text
JPH01253077A (ja) 文字列検出方法
Borovikov A survey of modern optical character recognition techniques
US20030012438A1 (en) Multiple size reductions for image segmentation
US8195626B1 (en) Compressing token-based files for transfer and reconstruction
US9323726B1 (en) Optimizing a glyph-based file
US5923782A (en) System for detecting and identifying substantially linear horizontal and vertical lines of engineering drawings
EP2545498A2 (en) Resolution adjustment of an image that includes text undergoing an ocr process
US7873228B2 (en) System and method for creating synthetic ligatures as quality prototypes for sparse multi-character clusters
US20020164087A1 (en) System and method for fast rotation of binary images using block matching method
Konya et al. Adaptive methods for robust document image understanding
Boiangiu et al. Efficient solutions for ocr text remote correction in content conversion systems

Legal Events

Date Code Title Description
AS Assignment

Owner name: AMAZON TECHNOLOGIES, INC., NEVADA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:COATH, ADAM BRIAN;AKALIN, FREDERICK ZIYA RAMOS;GOODWIN, ROBERT L.;AND OTHERS;REEL/FRAME:021662/0896;SIGNING DATES FROM 20060502 TO 20060509

STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

CC Certificate of correction
FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 12