US9442898B2 - Electronic document that inhibits automatic text extraction - Google Patents

Electronic document that inhibits automatic text extraction Download PDF

Info

Publication number
US9442898B2
US9442898B2 US13/550,696 US201213550696A US9442898B2 US 9442898 B2 US9442898 B2 US 9442898B2 US 201213550696 A US201213550696 A US 201213550696A US 9442898 B2 US9442898 B2 US 9442898B2
Authority
US
United States
Prior art keywords
glyph
document
character
modified
modifying
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US13/550,696
Other versions
US20140022260A1 (en
Inventor
Tracy ATTEBERRY
Harry Shaun LIPPY
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oracle International Corp
Original Assignee
Oracle International Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oracle International Corp filed Critical Oracle International Corp
Priority to US13/550,696 priority Critical patent/US9442898B2/en
Assigned to ORACLE INTERNATIONAL CORPORATION reassignment ORACLE INTERNATIONAL CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ATTEBERRY, TRACY, LIPPY, HARRY SHAUN
Publication of US20140022260A1 publication Critical patent/US20140022260A1/en
Application granted granted Critical
Publication of US9442898B2 publication Critical patent/US9442898B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • G06F17/214
    • G06F17/2217
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6209Protecting access to data via a platform, e.g. using keys or access control rules to a single file or object, e.g. in a secure envelope, encrypted and accessed using a key, or with access control rules appended to the object itself
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/20Drawing from basic elements, e.g. lines or circles
    • G06T11/203Drawing of straight lines or curves
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G2370/00Aspects of data communication
    • G09G2370/02Networking aspects
    • G09G2370/027Arrangements and methods specific for the display of internet documents
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G5/00Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
    • G09G5/22Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators characterised by the display of characters or indicia using display control signals derived from coded signals representing the characters or indicia, e.g. with a character-code memory
    • G09G5/24Generation of individual character patterns

Definitions

  • One embodiment is directed to a computer system, and more particularly, to a computer system that publishes electronic documents.
  • One embodiment is directed to a system that generates one or more fonts for a document.
  • the system creates glyph data associated with each font of the one or more fonts, where the glyph data produces one or more glyphs that are displayed within the document.
  • the system further modifies the glyph data, where the modified glyph data produces one or more modified glyphs, and where each modified glyph is substantially identical to a corresponding glyph when displayed within the document.
  • the system further creates one or more character mappings, where each character mapping maps a unique character code of one or more unique character codes to a modified glyph of the one or more modified glyphs, where one or more instances of a character in the document are replaced with a unique character code of the one or more unique character codes.
  • FIG. 1 illustrates a block diagram of a system that can implement an embodiment of the invention.
  • FIG. 2 illustrates the generation and modification of glyph data, in accordance with an embodiment of the invention.
  • FIG. 3 illustrates the generation of a plurality of character mappings, in accordance with an embodiment of the invention.
  • FIG. 4 illustrates a flow diagram of the functionality of a document font generation module, according to an embodiment of the invention.
  • character maps for fonts that can be delivered with a document can be produced using a two-part process.
  • glyph data associated with the fonts is randomly “fuzzed” (i.e., the glyph drawing instructions and/or the (x, y) coordinates defining the glyph are modified), in a way that does not affect how the glyph is viewed by a human reader, but which affects how the glyph data is hashed, and thus, foils any attempt to hash the glyph data into the corresponding character map.
  • character mappings are created, where several characters are mapped to one or more “fuzzed” (i.e., modified) glyphs.
  • the number of character mappings can be equal to the number of times the character was used (per font, and per document).
  • the character mapping is equivalent to a one-time pad, a type of encryption proven to be impossible to crack. This approach is far more effective than, for example, a simple substitution cypher, which is easily recognized and cracked.
  • One known solution is to scramble a character mapping (“cmap”) table of the embedded fonts in a PDF or an HTML document.
  • cmap character mapping
  • a character code for ‘C’ might map to a glyph for ‘W
  • a character code for ‘A’ might map to a glyph for ‘X
  • a character code for ‘T’ might map to a glyph for ‘Y,’ etc.
  • “CAT” may be displayed on a screen, when the document is displayed within the screen, any search or copy/paste operations will result in “WXY.”
  • reverse-engineering extraction techniques can get around the scrambling by maintaining a database of hashes for glyph data where the hashes are generated using a hashing algorithm (such as an MD5 message-digest algorithm (“MD5”)), and then using the hashes to map the glyphs directly to their correct character codes.
  • MD5 message-digest algorithm
  • embodiments provide character maps for fonts that can be delivered with a document, such as embedded fonts, can be produced using a process as described.
  • glyph data such as glyph instructions and/or coordinates
  • glyph data associated with the fonts are modified in such a way so that the rendered glyphs are not visibly different from the unmodified glyphs, but which affects how the glyph data is hashed, and thus, foils any attempt to map the hash of the glyph data to the correct character code for that glyph.
  • character mappings are created, where several characters are mapped to one or more modified glyphs. Therefore, document providers can have simple documents (such as PDF and HTML documents) that can be viewed as intended but do not allow legible copy/paste or machine searchability by any known method.
  • FIG. 1 illustrates a block diagram of a system 10 that can implement one embodiment of the invention.
  • System 10 includes a bus 12 or other communications mechanism for communicating information between components of system 10 .
  • System 10 also includes a processor 22 , operatively coupled to bus 12 , for processing information and executing instructions or operations.
  • Processor 22 may be any type of general or specific purpose processor.
  • System 10 further includes a memory 14 for storing information and instructions to be executed by processor 22 .
  • Memory 14 can be comprised of any combination of random access memory (“RAM”), read only memory (“ROM”), static storage such as a magnetic or optical disk, or any other type of machine or computer-readable medium.
  • System 10 further includes a communication device 20 , such as a network interface card or other communications interface, to provide access to a network. As a result, a user may interface with system 10 directly, or remotely through a network or any other method.
  • a computer-readable medium may be any available medium that can be accessed by processor 22 .
  • a computer-readable medium may include both a volatile and nonvolatile medium, a removable and non-removable medium, a communication medium, and a storage medium.
  • a communication medium may include computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any other form of information delivery medium known in the art.
  • a storage medium may include RAM, flash memory, ROM, erasable programmable read-only memory (“EPROM”), electrically erasable programmable read-only memory (“EEPROM”), registers, hard disk, a removable disk, a compact disk read-only memory (“CD-ROM”), or any other form of storage medium known in the art.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • registers hard disk, a removable disk, a compact disk read-only memory (“CD-ROM”), or any other form of storage medium known in the art.
  • Processor 22 can also be operatively coupled via bus 12 to a display 24 , such as a Liquid Crystal Display (“LCD”).
  • Display 24 can display information to the user.
  • a keyboard 26 and a cursor control device 28 can also be operatively coupled to bus 12 to enable the user to interface with system 10 .
  • memory 14 can store software modules that may provide functionality when executed by processor 22 .
  • the modules can include an operating system 15 , a document font generation module 16 , as well as other functional modules 18 .
  • Operating system 15 can provide an operating system functionality for system 10 .
  • Document font generation module 16 can provide functionality for generating fonts for documents to prevent automatic text extraction, as will be described in more detail below.
  • document font generation module 16 can comprise a plurality of modules, where each module provides specific individual functionality for generating fonts for documents to prevent automatic text extraction.
  • System 10 can also be part of a larger system.
  • system 10 can include one or more additional functional modules 18 to include the additional functionality.
  • functional modules 18 may include modules that provide additional functionality, such as an “Outside In” product from Oracle Corporation, where an example of an “Outside In” product is “Clean Content.”
  • Database 34 can store data in an integrated collection of logically-related records or files.
  • Database 34 can be an operational database, an analytical database, a data warehouse, a distributed database, an end-user database, an external database, a navigational database, an in-memory database, a document-oriented database, a real-time database, a relational database, an object-oriented database, or any other database known in the art.
  • FIG. 2 illustrates the generation and modification of glyph data, in accordance with an embodiment of the invention.
  • a “glyph” is a vector drawing that can be displayed within a user interface when a document is displayed within the user interface, where the vector drawing represents at least a portion of the data, such as text, contained within the document that is displayed within the user interface.
  • glyph data can comprise a plurality of (x, y) coordinates that define a shape of a glyph (such as one or more shape contours), and a plurality of byte code instructions that describe how to define the shape of the glyph to provide the best display of the glyph at various sizes.
  • the plurality of byte code instructions can describe how to alter the shape contours of the glyph defined by the plurality of (x,y) coordinates in order to provide the best display of the glyph at a specified size.
  • grid 200 includes glyph 210 , where glyph 210 is an example of a glyph that can be displayed within a user interface when the document is displayed within the user interface.
  • glyph 210 is a representation of the character “o”.
  • glyph 210 is merely an example of a glyph, and glyph 210 can represent any data, such as any character, or any text.
  • Glyph 210 includes a plurality of (x, y) coordinates that correspond to pixels, where a pixel is a smallest addressable element in a display device that can display a user interface.
  • An example coordinate, as illustrated in FIG. 2 is coordinate 211 .
  • glyph 210 is comprised of a plurality of coordinates, where the other coordinates are not specifically illustrated in FIG. 2 , for sake of visibility.
  • glyph 210 can be produced from glyph data, where glyph data can comprise a plurality of (x, y) coordinates that define shape contours and a plurality of byte code instructions that describe how to alter the shape when necessary to provide the best display of the glyph at various sizes.
  • the glyph data that corresponds to glyph 210 can be modified in a way so that a modified glyph (i.e., modified glyph 220 ) can be produced from the glyph data, where modified glyph 220 can be displayed so that the appearance of modified glyph 220 is not substantially or appreciably different from the appearance of glyph 210 .
  • the modification to the glyph data can comprise a modification to the one or more (x, y) coordinates, a modification to the plurality of byte code instructions, or a combination therein.
  • modified glyph 220 can be produced from the glyph data, rather than glyph 210 .
  • modified glyph 220 is displayed as having a smaller height than glyph 210 .
  • modified glyph 220 includes different (x, y) coordinates from glyph 210 .
  • modified glyph 220 includes coordinate 221 , which has a different position than coordinate 211 of glyph 210 . While the appearance of modified glyph 220 is different than the appearance of glyph 210 in FIG. 2 , one of ordinary skill in the art would readily appreciate that the difference in appearance between glyph 210 and modified glyph 220 is exaggerated in FIG.
  • glyph 210 and modified glyph 220 can have substantially identical appearances when displayed within a user interface, notwithstanding that glyph 210 and modified glyph 220 can include different (x, y) coordinates.
  • substantially identical what is meant is that the appearances of glyph 210 and modified glyph 220 appear to be identical to a viewer of the document, even though they may not be identical.
  • glyph data can comprise a plurality of (x, y) coordinates that define shape contours and a plurality of byte code instructions that describe how to alter the shape when necessary to provide the best display of the glyph at various sizes.
  • the modification of the glyph data can include a modification to one or more (x, y) coordinates of the plurality of (x, y) coordinates.
  • the modification to the one or more (x, y) coordinates can include the modification of the position of the one or more (x, y) coordinates.
  • the modification of the position of the one or more (x, y) coordinates can be such that the display of the glyph associated with the glyph data is not substantially altered.
  • an (x, y) coordinate is modified by less than 1/1000 th of an em-square (i.e., a grid used to define a glyph)
  • the difference is generally not detectable once the glyph is displayed as a collection of pixels.
  • a majority of fonts use an em-square greater than 1000 units by 1000 units in size, and thus, moving any (x, y) coordinate one pixel in any direction generally does not result in a detectable difference.
  • the glyph coordinates alone can be modified in 16,384 different ways, which is a full one quarter of the basic multi-lingual plane character code points in a Unicode font.
  • the modification of the glyph data can include a modification to the plurality of byte code instructions. These embodiments can include embodiments where the em-square is less than 1000 units ⁇ 1000 units.
  • the modification to the plurality of byte code instructions can include adding one or more instructions to the plurality of byte code instructions.
  • the modification to the plurality of byte code instructions can include removing one or more instructions to the plurality of byte code instructions.
  • the modification to the plurality of byte code instructions can include both adding one or more instructions to the plurality of byte code instructions, and removing one or more instructions to the plurality of byte code instructions.
  • the modification of the glyph data can include both a modification to one or more (x, y) coordinates of the plurality of (x, y) coordinates, and a modification to the plurality of byte code instructions.
  • modifying the glyph data is the first part of the process to produce fonts that can be delivered with a document.
  • the second part is to create a non-reversible mapping of character codes to glyphs for each particular font that is delivered with the document.
  • mapping character codes to glyphs in a font For n occurrences of a character code c in a given font for a given document, r( ⁇ n) unique characters codes ⁇ c i : 0 ⁇ i ⁇ r ⁇ are mapped to m ( ⁇ r since a single character code c i can only map to a single glyph) glyphs using r mapping functions ⁇ f i (c i ,g i (c)): 0 ⁇ i ⁇ r, 0 ⁇ j ⁇ r ⁇ such that each g i (c) renders the glyph for c and each f i ( ) maps an input character code c i to that rendering.
  • G(c) refers to a glyph mapping function for c described in an original font file.
  • each unique character in an input set can map to one and only one glyph that represents the character.
  • the character code “c” maps to a glyph representation of “c”
  • the character code “a” maps to a glyph representation of “a”
  • the character code “t” maps to a glyph representation of “t.”
  • a document producer can “scramble” a font's cmap table.
  • the character code “x” maps to a glyph representation of “c”
  • the character code “y” maps to a glyph representation of “a”
  • the character code “z” maps to a glyph representation of “t.”
  • Such a simple substitution cypher is susceptible to a hashing algorithm, such as an MD5 hashing algorithm, as previously described, where a database of glyph MD5 hashes can be maintained to reverse-engineer the character mapping.
  • a character mapping that is equivalent to a one-time pad, where a one-time pad is a type of encryption which is proven to be impossible to break.
  • the character mapping is not susceptible to reverse-engineering using a hashing algorithm and is not subject to cryptanalysis.
  • each instance of a character in a document is replaced with a unique c i , where each unique c i maps to a different glyph in a font, where each glyph is modified differently, but each modified glyph produces the same display of the original character.
  • the character code “q” maps to a glyph representation of “c”
  • the character code “r” maps to a glyph representation of “c”
  • the character code “s” maps to a glyph representation of “c.”
  • each instance of a glyph representation maps to a different character code.
  • each instance of a character code maps to a different glyph representation.
  • achieving a one-time pad when using a single font can be possible if the total number of characters in a document is less than 65,535 (i.e., a number for glyph slots available).
  • 65,535 i.e., a number for glyph slots available.
  • additional fonts can be created (with each font allowing for an additional 65,535 characters), and as many additional fonts as necessary can be used to preserve the one-time pad.
  • a one-time pad level of security (proven to be unbreakable) can be provided for a plurality of character mappings (i.e., character-code-to-glyph mappings), where the plurality of character mappings can be provided for all documents and fonts, regardless of document size.
  • a character mapping can still be provided. While the character mapping is no longer a one-time pad because there are not enough unused glyphs in the font to cover all instances of each character in the document (and thus, the encryption is no longer proven to be unbreakable), the cryptographic security of the document can still be kept at a very high level, as the pattern of the character mapping is extremely subtle, and non-trivial to break.
  • a document can have over 58 pages before the number of characters and glyphs available for remapping in a standard Unicode font are exhausted. Therefore, if a document uses a single Unicode font, it would take 58 pages before the security of the character mapping became less strong than a one-time pad. However, the addition of additional fonts (thus, increasing file space), allows the maintenance of a one-time pad level of security for documents of any size.
  • FIG. 3 illustrates the generation of a plurality of character mappings, in accordance with an embodiment of the invention.
  • FIG. 3 illustrates character mappings 310 , 320 , and 330 .
  • character code 311 is mapped to glyph 312 .
  • character code 321 is mapped to glyph 322 .
  • character code 331 is mapped to glyph 332 .
  • character code 311 is a character code of the character “q”
  • character code 321 is a character code of the character “r”
  • character code 331 is a character code of the character “s.”
  • FIG. 3 illustrates the generation of a plurality of character mappings, in accordance with an embodiment of the invention.
  • FIG. 3 illustrates character mappings 310 , 320 , and 330 .
  • character code 311 is mapped to glyph 312 .
  • character code 321 is mapped to glyph 322 .
  • character code 331 is mapped to glyph 332 .
  • each instance of a glyph representation maps to a different character code, and likewise, each instance of a character code maps to a different glyph representation.
  • a one-time pad character mapping can be produced, where the one-time pad character mapping is proven to be unbreakable.
  • a “not-quite one-time pad” character mapping can alternatively be produced, where most instances of a glyph representation map to a different character, but certain instances of a glyph representation map to an identical character.
  • most instances of a character map to a different glyph representation, but certain instances of a character map to an identical glyph representation.
  • FIG. 4 illustrates a flow diagram of the functionality of a document font generation module (such as document font generation module 16 of FIG. 1 ), according to an embodiment of the invention.
  • the functionality of the flow diagram of FIG. 4 is implemented by software stored in a memory or some other computer-readable or tangible medium, and executed by a processor.
  • the functionality may be performed by hardware (e.g., through the use of an application specific integrated circuit (“ASIC”), a programmable gate array (“PGA”), a field programmable gate array (“FPGA”), etc.), or any combination of hardware and software.
  • ASIC application specific integrated circuit
  • PGA programmable gate array
  • FPGA field programmable gate array
  • the flow begins and proceeds to 410 .
  • glyph data associated with each font of the one or more fonts is created, where the glyph data produces one or more glyphs that are displayed within the document.
  • the glyph data includes a plurality of coordinates that define a shape for the one or more glyphs.
  • the glyph data includes a plurality of byte code instructions that define a shape for the one or more glyphs.
  • the glyph data includes both a plurality of coordinates that define a shape for the one or more glyphs, and a plurality of byte code instructions that define a shape for the one or more glyphs.
  • the plurality of byte code instructions can alter a shape defined by the plurality of coordinates.
  • the glyph data is modified, where the modified glyph data produces one or more modified glyphs, and where each modified glyph is substantially identical to a corresponding glyph when displayed within the document.
  • the glyph data is modified by modifying at least one coordinate of the plurality of coordinates. In some of these embodiments, at least one coordinate is modified by modifying the position of at least one coordinate. In some of these embodiments, the position of at least one coordinate is modified by less than 1/1000 th of an em-square of each glyph of the one or more glyphs.
  • the glyph data is modified by modifying at least one byte code instruction of the plurality of byte code instructions. In some of these embodiments, one or more byte code instructions are removed. In other embodiments, one or more byte code instructions are added. In yet other embodiments, one or more byte code instructions are removed, and one or more byte code instructions are added. The flow then proceeds to 430 .
  • one or more character mappings are created, where each character mapping maps a unique character code of one or more unique character codes to a modified glyph of the one or more modified glyphs.
  • each character mapping maps a unique character code of one or more unique character codes to a modified glyph of the one or more modified glyphs.
  • one or more instances of a character in the document are replaced with a unique character code of the one or more unique character codes.
  • all of the one or more unique character codes are mapped to different modified glyphs.
  • some of the one or more unique character codes are mapped to different modified glyphs.
  • the document and the one or more fonts are delivered.
  • the one or more fonts are embedded fonts that are embedded within the document.
  • the document is a PDF document. In other embodiments, the document is an HTML document. The flow then ends.
  • a product can produce documents (such as PDF documents and HTML documents) with deliverable fonts (such as embedded fonts) using the process previously described.
  • the product can be a document export product that provides functionality for exporting documents that include deliverable fonts.
  • the product can also be a document creation product, where the process is part of a conversion process that creates a document.
  • documents with deliverable fonts specially created with character mappings that are not reversible by any known technology can be generated.
  • the low-level source of the document itself would not be human-readable, or machine-searchable, but the document can be displayed as intended within an appropriate document view.
  • This process can protect against current reverse-engineering techniques that are used against scrambled character mapping tables, and can protect against reverse-engineering techniques developed in the future.
  • a customer can provide their documents in common formats delivered in traditional manners with no special plug-ins or software required.
  • the documents can be in common formats, such as PDF and HTML.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • Controls And Circuits For Display Device (AREA)

Abstract

A system that generates one or more fonts for a document is provided. Glyph data associated with the one or more fonts is modified in a way that modifies one or more glyphs, but does not affect how the one or more glyphs are displayed within the document. Subsequently, character mappings are created, where each character of a plurality of characters is mapped to one or more modified glyphs.

Description

FIELD
One embodiment is directed to a computer system, and more particularly, to a computer system that publishes electronic documents.
BACKGROUND
Organizations often want to publish electronic documents (e.g., such as portable document format (“PDF”) documents and hypertext markup language (“HTML”) documents) on an open network, such as the Internet, and have those documents easily accessible, readable and printable by their target audiences without the use of special plug-ins. However, at the same time, organizations often do not want these documents to be machine searchable (and possibly indexed) by either commercial search engines or by competitors in an automated fashion.
SUMMARY
One embodiment is directed to a system that generates one or more fonts for a document. The system creates glyph data associated with each font of the one or more fonts, where the glyph data produces one or more glyphs that are displayed within the document. The system further modifies the glyph data, where the modified glyph data produces one or more modified glyphs, and where each modified glyph is substantially identical to a corresponding glyph when displayed within the document. The system further creates one or more character mappings, where each character mapping maps a unique character code of one or more unique character codes to a modified glyph of the one or more modified glyphs, where one or more instances of a character in the document are replaced with a unique character code of the one or more unique character codes.
BRIEF DESCRIPTION OF THE DRAWINGS
Further embodiments, details, advantages, and modifications will become apparent from the following detailed description of the preferred embodiments, which is to be taken in conjunction with the accompanying drawings.
FIG. 1 illustrates a block diagram of a system that can implement an embodiment of the invention.
FIG. 2 illustrates the generation and modification of glyph data, in accordance with an embodiment of the invention.
FIG. 3 illustrates the generation of a plurality of character mappings, in accordance with an embodiment of the invention.
FIG. 4 illustrates a flow diagram of the functionality of a document font generation module, according to an embodiment of the invention.
DETAILED DESCRIPTION
According to an embodiment, character maps for fonts that can be delivered with a document, such as embedded fonts, can be produced using a two-part process. In the first part, glyph data associated with the fonts is randomly “fuzzed” (i.e., the glyph drawing instructions and/or the (x, y) coordinates defining the glyph are modified), in a way that does not affect how the glyph is viewed by a human reader, but which affects how the glyph data is hashed, and thus, foils any attempt to hash the glyph data into the corresponding character map. In the second part, character mappings are created, where several characters are mapped to one or more “fuzzed” (i.e., modified) glyphs. Ideally, the number of character mappings can be equal to the number of times the character was used (per font, and per document). Thus, in one embodiment, the character mapping is equivalent to a one-time pad, a type of encryption proven to be impossible to crack. This approach is far more effective than, for example, a simple substitution cypher, which is easily recognized and cracked.
As previously described, organizations often want to make documents available and readable without special plug-ins, but do not want the documents to be machine-searchable by either search engines or their competition. Currently, solutions to this problem involve either compromising the accessibility, readability, or printability of the document (for example, by encryption, password protection, or the use of specialized plug-ins), or obfuscating the document format (such as a PDF or HTML format). The former is not desirable and the latter is difficult.
One known solution is to scramble a character mapping (“cmap”) table of the embedded fonts in a PDF or an HTML document. In other words, a character code for ‘C’ might map to a glyph for ‘W,’ a character code for ‘A’ might map to a glyph for ‘X,’ a character code for ‘T’ might map to a glyph for ‘Y,’ etc. Thus, while “CAT” may be displayed on a screen, when the document is displayed within the screen, any search or copy/paste operations will result in “WXY.” There are two drawbacks to this solution. First, reverse-engineering extraction techniques can get around the scrambling by maintaining a database of hashes for glyph data where the hashes are generated using a hashing algorithm (such as an MD5 message-digest algorithm (“MD5”)), and then using the hashes to map the glyphs directly to their correct character codes. Second, if the scrambling is done in a naïve way by using a simple substitution cipher, it would be very easy for someone to deduce the correct character codes, in an automatic way and on a per document basis.
In contrast, embodiments provide character maps for fonts that can be delivered with a document, such as embedded fonts, can be produced using a process as described. Specifically, in one embodiment, glyph data (such as glyph instructions and/or coordinates) associated with the fonts are modified in such a way so that the rendered glyphs are not visibly different from the unmodified glyphs, but which affects how the glyph data is hashed, and thus, foils any attempt to map the hash of the glyph data to the correct character code for that glyph. Further, in one embodiment, character mappings are created, where several characters are mapped to one or more modified glyphs. Therefore, document providers can have simple documents (such as PDF and HTML documents) that can be viewed as intended but do not allow legible copy/paste or machine searchability by any known method.
FIG. 1 illustrates a block diagram of a system 10 that can implement one embodiment of the invention. System 10 includes a bus 12 or other communications mechanism for communicating information between components of system 10. System 10 also includes a processor 22, operatively coupled to bus 12, for processing information and executing instructions or operations. Processor 22 may be any type of general or specific purpose processor. System 10 further includes a memory 14 for storing information and instructions to be executed by processor 22. Memory 14 can be comprised of any combination of random access memory (“RAM”), read only memory (“ROM”), static storage such as a magnetic or optical disk, or any other type of machine or computer-readable medium. System 10 further includes a communication device 20, such as a network interface card or other communications interface, to provide access to a network. As a result, a user may interface with system 10 directly, or remotely through a network or any other method.
A computer-readable medium may be any available medium that can be accessed by processor 22. A computer-readable medium may include both a volatile and nonvolatile medium, a removable and non-removable medium, a communication medium, and a storage medium. A communication medium may include computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any other form of information delivery medium known in the art. A storage medium may include RAM, flash memory, ROM, erasable programmable read-only memory (“EPROM”), electrically erasable programmable read-only memory (“EEPROM”), registers, hard disk, a removable disk, a compact disk read-only memory (“CD-ROM”), or any other form of storage medium known in the art.
Processor 22 can also be operatively coupled via bus 12 to a display 24, such as a Liquid Crystal Display (“LCD”). Display 24 can display information to the user. A keyboard 26 and a cursor control device 28, such as a computer mouse, can also be operatively coupled to bus 12 to enable the user to interface with system 10.
According to one embodiment, memory 14 can store software modules that may provide functionality when executed by processor 22. The modules can include an operating system 15, a document font generation module 16, as well as other functional modules 18. Operating system 15 can provide an operating system functionality for system 10. Document font generation module 16 can provide functionality for generating fonts for documents to prevent automatic text extraction, as will be described in more detail below. In certain embodiments, document font generation module 16 can comprise a plurality of modules, where each module provides specific individual functionality for generating fonts for documents to prevent automatic text extraction. System 10 can also be part of a larger system. Thus, system 10 can include one or more additional functional modules 18 to include the additional functionality. For example, functional modules 18 may include modules that provide additional functionality, such as an “Outside In” product from Oracle Corporation, where an example of an “Outside In” product is “Clean Content.”
Processor 22 can also be operatively coupled via bus 12 to a database 34. Database 34 can store data in an integrated collection of logically-related records or files. Database 34 can be an operational database, an analytical database, a data warehouse, a distributed database, an end-user database, an external database, a navigational database, an in-memory database, a document-oriented database, a real-time database, a relational database, an object-oriented database, or any other database known in the art.
FIG. 2 illustrates the generation and modification of glyph data, in accordance with an embodiment of the invention. As understood by one of ordinary skill in the art, a “glyph” is a vector drawing that can be displayed within a user interface when a document is displayed within the user interface, where the vector drawing represents at least a portion of the data, such as text, contained within the document that is displayed within the user interface. Thus, glyph data can comprise a plurality of (x, y) coordinates that define a shape of a glyph (such as one or more shape contours), and a plurality of byte code instructions that describe how to define the shape of the glyph to provide the best display of the glyph at various sizes. In certain embodiments, the plurality of byte code instructions can describe how to alter the shape contours of the glyph defined by the plurality of (x,y) coordinates in order to provide the best display of the glyph at a specified size.
According to the illustrated embodiment, grid 200 includes glyph 210, where glyph 210 is an example of a glyph that can be displayed within a user interface when the document is displayed within the user interface. In the illustrated embodiment, glyph 210 is a representation of the character “o”. However, glyph 210 is merely an example of a glyph, and glyph 210 can represent any data, such as any character, or any text. Glyph 210 includes a plurality of (x, y) coordinates that correspond to pixels, where a pixel is a smallest addressable element in a display device that can display a user interface. An example coordinate, as illustrated in FIG. 2, is coordinate 211. As one of ordinary skill in the art would readily appreciate, glyph 210 is comprised of a plurality of coordinates, where the other coordinates are not specifically illustrated in FIG. 2, for sake of visibility.
As previously described, glyph 210 can be produced from glyph data, where glyph data can comprise a plurality of (x, y) coordinates that define shape contours and a plurality of byte code instructions that describe how to alter the shape when necessary to provide the best display of the glyph at various sizes. According to an embodiment, as illustrated in FIG. 2 and described below in greater detail, the glyph data that corresponds to glyph 210 can be modified in a way so that a modified glyph (i.e., modified glyph 220) can be produced from the glyph data, where modified glyph 220 can be displayed so that the appearance of modified glyph 220 is not substantially or appreciably different from the appearance of glyph 210. By “substantially different” or “appreciably different,” what is meant is that the appearance of modified glyph 220 is different from the appearance of glyph 210, but that a viewer of the document cannot detect the difference in appearance between modified glyph 220 and glyph 210. As also described below in greater detail, the modification to the glyph data can comprise a modification to the one or more (x, y) coordinates, a modification to the plurality of byte code instructions, or a combination therein.
While the modification of the glyph data does not substantially or appreciably change the appearance of modified glyph 220, as compared to glyph 210, the modification does result in a different hash value for modified glyph 220 than would be obtained by hashing glyph 210. Thus, if an attempt is made to hash modified glyph 220, and then to use this hash value to look up the correct character code in a table that maps known glyph hashes to character codes, the correct character will not be found and either an incorrect character will be returned from the table look up or, more likely, no match will be found at all.
As previously described, the glyph data can be modified so that modified glyph 220 can be produced from the glyph data, rather than glyph 210. As illustrated in FIG. 2, modified glyph 220 is displayed as having a smaller height than glyph 210. More specifically, modified glyph 220 includes different (x, y) coordinates from glyph 210. For example, modified glyph 220 includes coordinate 221, which has a different position than coordinate 211 of glyph 210. While the appearance of modified glyph 220 is different than the appearance of glyph 210 in FIG. 2, one of ordinary skill in the art would readily appreciate that the difference in appearance between glyph 210 and modified glyph 220 is exaggerated in FIG. 2 (for sake of visibility), and that in alternate embodiments, glyph 210 and modified glyph 220 can have substantially identical appearances when displayed within a user interface, notwithstanding that glyph 210 and modified glyph 220 can include different (x, y) coordinates. By “substantially identical,” what is meant is that the appearances of glyph 210 and modified glyph 220 appear to be identical to a viewer of the document, even though they may not be identical.
As previously described, glyph data can comprise a plurality of (x, y) coordinates that define shape contours and a plurality of byte code instructions that describe how to alter the shape when necessary to provide the best display of the glyph at various sizes. In certain embodiments, the modification of the glyph data can include a modification to one or more (x, y) coordinates of the plurality of (x, y) coordinates. According to these embodiments, the modification to the one or more (x, y) coordinates can include the modification of the position of the one or more (x, y) coordinates. The modification of the position of the one or more (x, y) coordinates can be such that the display of the glyph associated with the glyph data is not substantially altered. For example, if the position of an (x, y) coordinate is modified by less than 1/1000th of an em-square (i.e., a grid used to define a glyph), then the difference is generally not detectable once the glyph is displayed as a collection of pixels. A majority of fonts use an em-square greater than 1000 units by 1000 units in size, and thus, moving any (x, y) coordinate one pixel in any direction generally does not result in a detectable difference. For example, if a glyph has a shape contour with seven points, the glyph coordinates alone can be modified in 16,384 different ways, which is a full one quarter of the basic multi-lingual plane character code points in a Unicode font.
In certain embodiments, the modification of the glyph data can include a modification to the plurality of byte code instructions. These embodiments can include embodiments where the em-square is less than 1000 units×1000 units. The modification to the plurality of byte code instructions can include adding one or more instructions to the plurality of byte code instructions. In alternate embodiments, the modification to the plurality of byte code instructions can include removing one or more instructions to the plurality of byte code instructions. In yet alternate embodiments, the modification to the plurality of byte code instructions can include both adding one or more instructions to the plurality of byte code instructions, and removing one or more instructions to the plurality of byte code instructions. In certain embodiments, the modification of the glyph data can include both a modification to one or more (x, y) coordinates of the plurality of (x, y) coordinates, and a modification to the plurality of byte code instructions.
In certain embodiments, modifying the glyph data is the first part of the process to produce fonts that can be delivered with a document. According to these embodiments, the second part is to create a non-reversible mapping of character codes to glyphs for each particular font that is delivered with the document.
A general description of mapping character codes to glyphs in a font is now provided. For n occurrences of a character code c in a given font for a given document, r(≦n) unique characters codes {ci: 0≦i≦r} are mapped to m (≦r since a single character code ci can only map to a single glyph) glyphs using r mapping functions {fi(ci,gi(c)): 0<i<r, 0<j<r} such that each gi(c) renders the glyph for c and each fi( ) maps an input character code ci to that rendering. Also, for the sake of notation below, G(c) refers to a glyph mapping function for c described in an original font file.
In a standard known scenario of mapping character codes to glyphs, each unique character in an input set can map to one and only one glyph that represents the character. In other words, r=m=1, c0=c, g0(c)=G(c). This is how conventional fonts generally map character codes to their glyph representations. Below is an example of such a known character mapping:
    • c0=c→G(c)
    • c0=a→G(a)
    • c0=t→G(t)
Thus, in the above example, the character code “c” maps to a glyph representation of “c,” the character code “a” maps to a glyph representation of “a,” and the character code “t” maps to a glyph representation of “t.”
In a known simple substitution scenario, a document producer can “scramble” a font's cmap table. In other words, r=m=1, c0≠c, g0(c)=G(c). This in effect changes the mapping of each character code, and thus, acts as a simple substitution cypher.
Below is an example of such a known character mapping:
    • c0=x→G(c)
    • c0=y→G(a)
    • c0=z→G(t)
Thus, in the above example, the character code “x” maps to a glyph representation of “c,” the character code “y” maps to a glyph representation of “a,” and the character code “z” maps to a glyph representation of “t.” Such a simple substitution cypher is susceptible to a hashing algorithm, such as an MD5 hashing algorithm, as previously described, where a database of glyph MD5 hashes can be maintained to reverse-engineer the character mapping.
According to an embodiment of the invention, a character mapping that is equivalent to a one-time pad is provided, where a one-time pad is a type of encryption which is proven to be impossible to break. Thus, the character mapping is not susceptible to reverse-engineering using a hashing algorithm and is not subject to cryptanalysis. According to the embodiment, each instance of a character in a document is replaced with a unique ci, where each unique ci maps to a different glyph in a font, where each glyph is modified differently, but each modified glyph produces the same display of the original character. In other words, r=m=n, ci≠c, gi(c)≠gj(c) and no gi(c)=G(c). Below is an example of such a character mapping:
    • c0=q→g0(c)
    • c1=r→g1(c)
    • c2=s→g2(c)
Thus, in the above example, the character code “q” maps to a glyph representation of “c,” the character code “r” maps to a glyph representation of “c,” and the character code “s” maps to a glyph representation of “c.” Thus, each instance of a glyph representation maps to a different character code. Likewise, each instance of a character code maps to a different glyph representation.
According to the embodiment, achieving a one-time pad when using a single font can be possible if the total number of characters in a document is less than 65,535 (i.e., a number for glyph slots available). In situations where the document exceeds 65,535 characters in a single font (roughly 58 pages), additional fonts can be created (with each font allowing for an additional 65,535 characters), and as many additional fonts as necessary can be used to preserve the one-time pad. Therefore, according to the embodiment, a one-time pad level of security (proven to be unbreakable) can be provided for a plurality of character mappings (i.e., character-code-to-glyph mappings), where the plurality of character mappings can be provided for all documents and fonts, regardless of document size.
In alternate embodiments, where a single font is required, and a document is a sufficient size to require more than 65,535 characters, a character mapping can still be provided. While the character mapping is no longer a one-time pad because there are not enough unused glyphs in the font to cover all instances of each character in the document (and thus, the encryption is no longer proven to be unbreakable), the cryptographic security of the document can still be kept at a very high level, as the pattern of the character mapping is extremely subtle, and non-trivial to break. According to the embodiment, most instances of a character in a document are replaced with a unique ci, where each unique ci maps to a different glyph in a font, where each glyph is modified differently, but each modified glyph produces the same display of the original character. Furthermore, according to the embodiment, some instances of a character in a document are replaced with a ci that is identical to a ci that corresponds to a previous instance of the character, because there are no longer any unused glyphs to map the instance of the character to. In other words, m≦r≦n, ci≠c, gi(c)≠gi(c) for each i≠j and no gi(c)=G(c). Below is an example of such a character mapping:
    • c0=q→g0(c)=
    • c1=r→g0(c)
    • c2=s→g0(c)
    • c3=t→g1(c)
As can be seen in the example, there are multiple character codes, c0, c1, and c2, that map to the same glyph rendering, g0(c). Thus, the unbreakable security of the one-time pad cannot be guaranteed in this embodiment. However, as long as the total number of characters in the document is not too large, the embodiment will result in an encryption that would be very difficult to crack using standard cryptographic techniques such as character frequency distributions.
Thus, when restricted to a single font, and where r and m must be less than n, a stronger restriction will generally be on m as this represents the number of glyphs in a font file, and, in practice, a little bit of cryptographic security can be traded for a smaller font size. The number of characters available to map into the glyph table, however, is still 65,535. So the cryptographic security of the document can be kept at a very high level, even if a small weakening of the mapping security embedded in the font is produced from reducing the file size. In other words, even though a little security may be traded for a smaller font size, the resulting minor security weakness is buried in the location most technically difficult to access, the font's cmap table.
As an example, given about 1125 characters per page in a document, a document can have over 58 pages before the number of characters and glyphs available for remapping in a standard Unicode font are exhausted. Therefore, if a document uses a single Unicode font, it would take 58 pages before the security of the character mapping became less strong than a one-time pad. However, the addition of additional fonts (thus, increasing file space), allows the maintenance of a one-time pad level of security for documents of any size.
FIG. 3 illustrates the generation of a plurality of character mappings, in accordance with an embodiment of the invention. Specifically, FIG. 3 illustrates character mappings 310, 320, and 330. As part of character mapping 310, character code 311 is mapped to glyph 312. As part of character mapping 320, character code 321 is mapped to glyph 322. As part of character mapping 330, character code 331 is mapped to glyph 332. As illustrated in FIG. 3, character code 311 is a character code of the character “q,” character code 321 is a character code of the character “r,” and character code 331 is a character code of the character “s.” As illustrated in FIG. 3, glyphs 312, 322, and 332 are each a glyph representation of the character “c.” Thus, according to the embodiment, each instance of a glyph representation maps to a different character code, and likewise, each instance of a character code maps to a different glyph representation. Thus, a one-time pad character mapping can be produced, where the one-time pad character mapping is proven to be unbreakable. Further, a “not-quite one-time pad” character mapping can alternatively be produced, where most instances of a glyph representation map to a different character, but certain instances of a glyph representation map to an identical character. Likewise, most instances of a character map to a different glyph representation, but certain instances of a character map to an identical glyph representation.
FIG. 4 illustrates a flow diagram of the functionality of a document font generation module (such as document font generation module 16 of FIG. 1), according to an embodiment of the invention. In one embodiment, the functionality of the flow diagram of FIG. 4, described below, is implemented by software stored in a memory or some other computer-readable or tangible medium, and executed by a processor. In other embodiments, the functionality may be performed by hardware (e.g., through the use of an application specific integrated circuit (“ASIC”), a programmable gate array (“PGA”), a field programmable gate array (“FPGA”), etc.), or any combination of hardware and software.
The flow begins and proceeds to 410. At 410, glyph data associated with each font of the one or more fonts is created, where the glyph data produces one or more glyphs that are displayed within the document. In certain embodiments, the glyph data includes a plurality of coordinates that define a shape for the one or more glyphs. In other embodiments, the glyph data includes a plurality of byte code instructions that define a shape for the one or more glyphs. In yet other embodiments, the glyph data includes both a plurality of coordinates that define a shape for the one or more glyphs, and a plurality of byte code instructions that define a shape for the one or more glyphs. In some of these embodiments, the plurality of byte code instructions can alter a shape defined by the plurality of coordinates. The flow then proceeds to 420.
At 420, the glyph data is modified, where the modified glyph data produces one or more modified glyphs, and where each modified glyph is substantially identical to a corresponding glyph when displayed within the document. In certain embodiments, the glyph data is modified by modifying at least one coordinate of the plurality of coordinates. In some of these embodiments, at least one coordinate is modified by modifying the position of at least one coordinate. In some of these embodiments, the position of at least one coordinate is modified by less than 1/1000th of an em-square of each glyph of the one or more glyphs. In other embodiments, the glyph data is modified by modifying at least one byte code instruction of the plurality of byte code instructions. In some of these embodiments, one or more byte code instructions are removed. In other embodiments, one or more byte code instructions are added. In yet other embodiments, one or more byte code instructions are removed, and one or more byte code instructions are added. The flow then proceeds to 430.
At 430, one or more character mappings are created, where each character mapping maps a unique character code of one or more unique character codes to a modified glyph of the one or more modified glyphs. Thus, one or more instances of a character in the document are replaced with a unique character code of the one or more unique character codes. In some embodiments, all of the one or more unique character codes are mapped to different modified glyphs. In other embodiments, some of the one or more unique character codes are mapped to different modified glyphs. The flow then proceeds to 440.
At 440, the document and the one or more fonts are delivered. In certain embodiments, the one or more fonts are embedded fonts that are embedded within the document. In some embodiments, the document is a PDF document. In other embodiments, the document is an HTML document. The flow then ends.
In certain embodiments, a product can produce documents (such as PDF documents and HTML documents) with deliverable fonts (such as embedded fonts) using the process previously described. The product can be a document export product that provides functionality for exporting documents that include deliverable fonts. The product can also be a document creation product, where the process is part of a conversion process that creates a document.
Thus, according to an embodiment, documents with deliverable fonts specially created with character mappings that are not reversible by any known technology can be generated. The low-level source of the document itself would not be human-readable, or machine-searchable, but the document can be displayed as intended within an appropriate document view. This process can protect against current reverse-engineering techniques that are used against scrambled character mapping tables, and can protect against reverse-engineering techniques developed in the future. By providing protection at a level of a deliverable font, a customer can provide their documents in common formats delivered in traditional manners with no special plug-ins or software required. Thus, according to the embodiment, not only are documents produced that include a higher level of security, but the documents can be in common formats, such as PDF and HTML.
The features, structures, or characteristics of the invention described throughout this specification may be combined in any suitable manner in one or more embodiments. For example, the usage of “one embodiment,” “some embodiments,” “certain embodiment,” “certain embodiments,” or other similar language, throughout this specification refers to the fact that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present invention. Thus, appearances of the phrases “one embodiment,” “some embodiments,” “a certain embodiment,” “certain embodiments,” or other similar language, throughout this specification do not necessarily all refer to the same group of embodiments, and the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
One having ordinary skill in the art will readily understand that the invention as discussed above may be practiced with steps in a different order, and/or with elements in configurations which are different than those which are disclosed. Therefore, although the invention has been described based upon these preferred embodiments, it would be apparent to those of skill in the art that certain modifications, variations, and alternative constructions would be apparent, while remaining within the spirit and scope of the invention. In order to determine the metes and bounds of the invention, therefore, reference should be made to the appended claims.

Claims (20)

We claim:
1. A non-transitory computer-readable medium having instructions stored thereon that, when executed by a processor, cause the processor to generate a font for a document, the generating comprising:
creating glyph data associated with the font, wherein the glyph data produces a glyph that is displayed within the document;
modifying the glyph data, wherein the modified glyph data produces a modified glyph, and wherein the modified glyph is visually substantially identical to the unmodified glyph when displayed within the document;
creating a particular character mapping, wherein the character mapping maps a unique character code to the modified glyph;
replacing a first instance of a particular character in the document with a first unique character code, wherein the first unique character code is mapped, using the particular character mapping, to a first modified glyph; and
replacing a second instance of the particular character in the document with a second unique character code, wherein the second unique character code is mapped, using the particular character mapping, to a second modified glyph.
2. The non-transitory computer-readable medium of claim 1, wherein the glyph data comprises a plurality of coordinates that define a shape for the glyph, and wherein the modifying the glyph data comprises modifying at least one coordinate of the plurality of coordinates.
3. The non-transitory computer-readable medium of claim 2, wherein the modifying at least one coordinate of the plurality of coordinates comprises modifying a position of the at least one coordinate of the plurality of coordinates.
4. The non-transitory computer-readable medium of claim 3, wherein the modifying the position of the at least one coordinate of the plurality of coordinates comprises modifying the position by less than 1/1000th of an em-square of the glyph.
5. The non-transitory computer-readable medium of claim 1, wherein the glyph data comprises a plurality of byte code instructions that define a shape for the glyph, and wherein the modifying the glyph data comprises modifying at least one byte code instruction of the plurality of byte code instructions.
6. The non-transitory computer-readable medium of claim 1, wherein the font is an embedded font that is embedded within the document.
7. The non-transitory computer-readable medium of claim 1, wherein the document is a portable document format document.
8. The non-transitory computer-readable medium of claim 1, wherein the document is a hypertext markup language document.
9. The non-transitory computer-readable medium of claim 1, the generating further comprising delivering the document and the font.
10. The non-transitory computer-readable medium of claim 1, wherein the character mapping maps a unique character code to each character in the document.
11. A computer-implemented method for generating a font for a document, the computer-implemented method comprising:
creating glyph data associated with the font, wherein the glyph data produces a glyph that is displayed within the document;
modifying the glyph data, wherein the modified glyph data produces a modified glyph, and wherein the modified glyph is visually substantially identical to the unmodified glyph when displayed within the document;
creating a particular character mapping, wherein the character mapping maps a unique character to the modified glyph;
replacing a first instance of a particular character in the document with a first unique character code, wherein the first unique character code is mapped, using the particular character mapping, to a first modified glyph; and
replacing a second instance of the particular character in the document with a second unique character code, wherein the second unique character code is mapped, using the particular character mapping, to a second modified glyph.
12. The computer-implemented method of claim 11, wherein the glyph data comprises a plurality of coordinates that define a shape for the glyph, and wherein the modifying the glyph data comprises modifying at least one coordinate of the plurality of coordinates.
13. The computer-implemented method of claim 12,
wherein the modifying at least one coordinate of the plurality of coordinates comprises modifying a position of the at least one coordinate of the plurality of coordinates; and
wherein the modifying the position of the at least one coordinate of the plurality of coordinates comprises modifying the position by less than 1/1000th of an em-square of the glyph.
14. The computer-implemented method of claim 11, wherein the glyph data comprises a plurality of byte code instructions that define a shape for the glyph, and wherein the modifying the glyph data comprises modifying at least one byte code instruction of the plurality of byte code instructions.
15. The computer-implemented method of claim 11, wherein the character mapping maps a unique character code to each character in the document.
16. A system, comprising:
a memory configured to store one or more instructions;
a processor configured to execute the one or more instructions;
a glyph data creation module, when stored within the memory and executed by the processor, configured to create glyph data associated with a font, wherein the glyph data produces a glyph that is displayed within a document;
a glyph data modification module, when stored within the memory and executed by the processor, configured to modify the glyph data, wherein the modified glyph data produces a modified glyph, and wherein the modified glyph is visually substantially identical to the unmodified glyph when displayed within the document; and
a character mapping creation module configured to create a particular character mapping, wherein the character mapping maps a unique character code to the modified glyph;
wherein the character mapping creation module, when stored within the memory and executed by the processor, is further configured to replace a first instance of a particular character in the document with a first unique character code, wherein the first unique character code is mapped, using the particular character mapping, to a first modified glyph; and
wherein the character mapping creation module, when stored within the memory and executed by the processor, is further configured to replace a second instance of the particular character in the document with a second unique character code, wherein the second unique character code is mapped, using the particular character mapping, to a second modified glyph.
17. The system of claim 16, wherein the glyph data comprises a plurality of coordinates that define a shape for the glyphs, and wherein the glyph data modification module is further configured to modify at least one coordinate of the plurality of coordinates.
18. The system of claim 17,
wherein the glyph data modification module is further configured to modify a position of the at least one coordinate of the plurality of coordinates; and
wherein the glyph data modification module is further configured to modify the position by less than 1/1000th of an em-square of the glyph.
19. The system of claim 16, wherein the glyph data comprises a plurality of byte code instructions that define a shape for the glyph, and wherein the modifying the glyph data comprises modifying at least one byte code instruction of the plurality of byte code instructions.
20. The system of claim 16, wherein the character mapping maps a unique character code to each character in the document.
US13/550,696 2012-07-17 2012-07-17 Electronic document that inhibits automatic text extraction Active 2034-02-19 US9442898B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/550,696 US9442898B2 (en) 2012-07-17 2012-07-17 Electronic document that inhibits automatic text extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US13/550,696 US9442898B2 (en) 2012-07-17 2012-07-17 Electronic document that inhibits automatic text extraction

Publications (2)

Publication Number Publication Date
US20140022260A1 US20140022260A1 (en) 2014-01-23
US9442898B2 true US9442898B2 (en) 2016-09-13

Family

ID=49946163

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/550,696 Active 2034-02-19 US9442898B2 (en) 2012-07-17 2012-07-17 Electronic document that inhibits automatic text extraction

Country Status (1)

Country Link
US (1) US9442898B2 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8867741B2 (en) * 2012-04-13 2014-10-21 Xerox Corporation Mobile field level encryption of private documents
US20150169508A1 (en) * 2013-12-13 2015-06-18 Konica Minolta Laboratory U.S.A., Inc. Obfuscating page-description language output to thwart conversion to an editable format
US9652669B2 (en) * 2014-09-16 2017-05-16 Lenovo (Singapore) Pte. Ltd. Reflecting handwriting attributes in typographic characters
US10402471B2 (en) * 2014-09-26 2019-09-03 Guy Le Henaff Method for obfuscating the display of text

Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5338043A (en) * 1989-07-13 1994-08-16 Rehm Peter H Cryptographic guessing game
US5526477A (en) * 1994-01-04 1996-06-11 Digital Equipment Corporation System and method for generating glyphs of unknown characters
US5715473A (en) * 1992-12-29 1998-02-03 Apple Computer, Inc. Method and apparatus to vary control points of an outline font to provide a set of variations for the outline font
US6504545B1 (en) * 1998-03-27 2003-01-07 Canon Kabushiki Kaisha Animated font characters
US20040001606A1 (en) * 2002-06-28 2004-01-01 Levy Kenneth L. Watermark fonts
US20050005122A1 (en) * 2001-10-01 2005-01-06 Abraham Nigel Christopher Optical encoding
US6993662B2 (en) * 1998-06-14 2006-01-31 Finjan Software Ltd. Method and system for copy protection of displayed data content
US20060171588A1 (en) * 2005-01-28 2006-08-03 Microsoft Corporation Scalable hash-based character recognition
US20070200852A1 (en) * 2006-02-28 2007-08-30 Cisco Technology, Inc. Method to protect display text from eavesdropping
US20080028304A1 (en) * 2006-07-25 2008-01-31 Monotype Imaging, Inc. Method and apparatus for font subsetting
US20080049023A1 (en) * 2006-08-22 2008-02-28 Monotype Imaging, Inc. Method for reducing size and increasing speed for font generation of instructions
US7420692B2 (en) * 2003-07-11 2008-09-02 Sharp Laboratories Of America, Inc. Security font system and method for generating traceable pages in an electronic document
US20080301431A1 (en) * 2007-06-01 2008-12-04 Hea Young Sun Text security method
US20090109227A1 (en) * 2007-10-31 2009-04-30 Leroy Luc H System and method for independent font substitution of string characters
US20100164984A1 (en) * 2008-12-31 2010-07-01 Shantanu Rane Method for Embedding Messages into Documents Using Distance Fields
US20110188761A1 (en) * 2010-02-02 2011-08-04 Boutros Philip Character identification through glyph data matching
US20110203000A1 (en) 2010-02-16 2011-08-18 Extensis Inc. Preventing unauthorized font linking
US20120001922A1 (en) * 2009-01-26 2012-01-05 Escher Marc System and method for creating and sharing personalized fonts on a client/server architecture
US20120260108A1 (en) * 2011-04-11 2012-10-11 Steve Lee Font encryption and decryption system and method
US8330760B1 (en) * 2009-05-26 2012-12-11 Adobe Systems Incorporated Modifying glyph outlines
US20130027406A1 (en) * 2011-07-29 2013-01-31 International Business Machines Corporation System And Method For Improved Font Substitution With Character Variant Replacement
US8402371B2 (en) * 2008-03-18 2013-03-19 Crimsonlogic Pte Ltd Method and system for embedding covert data in text document using character rotation
US20130174017A1 (en) * 2011-12-29 2013-07-04 Chegg, Inc. Document Content Reconstruction
US8762828B2 (en) * 2011-09-23 2014-06-24 Guy Le Henaff Tracing an electronic document in an electronic publication by modifying the electronic page description of the electronic document
US9081529B1 (en) * 2012-06-22 2015-07-14 Amazon Technologies, Inc. Generation of electronic books

Patent Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5338043A (en) * 1989-07-13 1994-08-16 Rehm Peter H Cryptographic guessing game
US5715473A (en) * 1992-12-29 1998-02-03 Apple Computer, Inc. Method and apparatus to vary control points of an outline font to provide a set of variations for the outline font
US5526477A (en) * 1994-01-04 1996-06-11 Digital Equipment Corporation System and method for generating glyphs of unknown characters
US6504545B1 (en) * 1998-03-27 2003-01-07 Canon Kabushiki Kaisha Animated font characters
US6993662B2 (en) * 1998-06-14 2006-01-31 Finjan Software Ltd. Method and system for copy protection of displayed data content
US20050005122A1 (en) * 2001-10-01 2005-01-06 Abraham Nigel Christopher Optical encoding
US20040001606A1 (en) * 2002-06-28 2004-01-01 Levy Kenneth L. Watermark fonts
US7420692B2 (en) * 2003-07-11 2008-09-02 Sharp Laboratories Of America, Inc. Security font system and method for generating traceable pages in an electronic document
US20060171588A1 (en) * 2005-01-28 2006-08-03 Microsoft Corporation Scalable hash-based character recognition
US20070200852A1 (en) * 2006-02-28 2007-08-30 Cisco Technology, Inc. Method to protect display text from eavesdropping
US20080028304A1 (en) * 2006-07-25 2008-01-31 Monotype Imaging, Inc. Method and apparatus for font subsetting
US20080049023A1 (en) * 2006-08-22 2008-02-28 Monotype Imaging, Inc. Method for reducing size and increasing speed for font generation of instructions
US20080301431A1 (en) * 2007-06-01 2008-12-04 Hea Young Sun Text security method
US20090109227A1 (en) * 2007-10-31 2009-04-30 Leroy Luc H System and method for independent font substitution of string characters
US8402371B2 (en) * 2008-03-18 2013-03-19 Crimsonlogic Pte Ltd Method and system for embedding covert data in text document using character rotation
US20100164984A1 (en) * 2008-12-31 2010-07-01 Shantanu Rane Method for Embedding Messages into Documents Using Distance Fields
US20120001922A1 (en) * 2009-01-26 2012-01-05 Escher Marc System and method for creating and sharing personalized fonts on a client/server architecture
US8330760B1 (en) * 2009-05-26 2012-12-11 Adobe Systems Incorporated Modifying glyph outlines
US20110188761A1 (en) * 2010-02-02 2011-08-04 Boutros Philip Character identification through glyph data matching
US20110203000A1 (en) 2010-02-16 2011-08-18 Extensis Inc. Preventing unauthorized font linking
US20120260108A1 (en) * 2011-04-11 2012-10-11 Steve Lee Font encryption and decryption system and method
US20130027406A1 (en) * 2011-07-29 2013-01-31 International Business Machines Corporation System And Method For Improved Font Substitution With Character Variant Replacement
US8762828B2 (en) * 2011-09-23 2014-06-24 Guy Le Henaff Tracing an electronic document in an electronic publication by modifying the electronic page description of the electronic document
US20130174017A1 (en) * 2011-12-29 2013-07-04 Chegg, Inc. Document Content Reconstruction
US9081529B1 (en) * 2012-06-22 2015-07-14 Amazon Technologies, Inc. Generation of electronic books

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
Adobe Font Development Kit, Adobe Systems Incorporated, Copyright 2012, http://www.adobeaudience.com/devnet/opentype/afdko.html (last visited Jul. 16, 2012).
Apple TrueType Reference Manual, Apple Computer Inc., Copyright 1997-2002, https://developer.apple.com/fonts/TTRefMan/index.html (last visited Jul. 16, 2012).
Apple TrueType Reference Manual, Chapter 1, Digitizing Letterform Designs, Apple Inc., Copyright 2011, https://developer.apple.com/fonts/TTRefMan/RM01/Chap1.html (last visited Jul. 16, 2012).
Converting Outlines to True Type Format, Apple Inc., Copyright 2011, http://developer.apple.com/fonts/TTRefMan/RM08/appendixE.html (last visited Jul. 16, 2012).
OpenType Font Specification, Chapter 1, TrueType Fundamentals, Microsoft Corporation, Copyright 1997, http://www.microsoft.com/typography/otspec/TTCH01.htm (last visited Jul. 16, 2012).
OpenType Font Specification, cmap-Character to Glyph Index Mapping Table, Microsoft Corporation, Copyright 2008, http://www.microsoft.com/typography/otspec/cmap.htm (last visited Jul. 16, 2012).
OpenType Font Specification, Microsoft Corporation, Copyright 2009, http://www.microsoft.com/typography/otspec/ (last visited Jul. 16, 2012).

Also Published As

Publication number Publication date
US20140022260A1 (en) 2014-01-23

Similar Documents

Publication Publication Date Title
CN110447035B (en) User content obfuscation in structured user data files
CN107239713B (en) Sensitive content data information protection method and system
US9754120B2 (en) Document redaction with data retention
US8065739B1 (en) Detecting policy violations in information content containing data in a character-based language
US6324555B1 (en) Comparing contents of electronic documents
CN117195307A (en) Configurable annotations for privacy-sensitive user content
CN109740317A (en) A kind of digital finger-print based on block chain deposits card method and device
US10534931B2 (en) Systems, devices and methods for automatic detection and masking of private data
CN110245469B (en) Webpage watermark generation method, watermark analysis method, device and storage medium
US9237136B2 (en) Mapping a glyph to character code in obfuscated data
Heather Turnitoff: Identifying and fixing a hole in current plagiarism detection software
US20110188761A1 (en) Character identification through glyph data matching
US9442898B2 (en) Electronic document that inhibits automatic text extraction
US10706160B1 (en) Methods, systems, and articles of manufacture for protecting data in an electronic document using steganography techniques
US20140212040A1 (en) Document Alteration Based on Native Text Analysis and OCR
CN111859853A (en) Webpage text encryption and decryption method based on random font
CN106789856A (en) A kind of information coding method, coding/decoding method and device
CN114626079A (en) File viewing method, device, equipment and storage medium based on user permission
US9081529B1 (en) Generation of electronic books
CN117725333A (en) Text steganography method and device for browser webpage and electronic equipment
KR20150044430A (en) Document processing system, electronic document, document processing method, and program
CN112464180A (en) Page screenshot outgoing control method and system, electronic device and storage medium
TWI664849B (en) Method, computer program product and processing system for generating secure alternative representation
US10403392B1 (en) Data de-identification methodologies
CN115982675A (en) Document processing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
AS Assignment

Owner name: ORACLE INTERNATIONAL CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ATTEBERRY, TRACY;LIPPY, HARRY SHAUN;REEL/FRAME:028563/0731

Effective date: 20120711

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8