US20140101456A1 - Confidentiality preserving document analysis system and method - Google Patents
Confidentiality preserving document analysis system and method Download PDFInfo
- Publication number
- US20140101456A1 US20140101456A1 US13/648,462 US201213648462A US2014101456A1 US 20140101456 A1 US20140101456 A1 US 20140101456A1 US 201213648462 A US201213648462 A US 201213648462A US 2014101456 A1 US2014101456 A1 US 2014101456A1
- Authority
- US
- United States
- Prior art keywords
- document
- encoded
- tokens
- text
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 70
- 238000004458 analytical method Methods 0.000 title description 2
- 238000012545 processing Methods 0.000 claims abstract description 69
- 230000015654 memory Effects 0.000 claims abstract description 29
- 230000005540 biological transmission Effects 0.000 claims abstract description 21
- 238000010606 normalization Methods 0.000 claims description 28
- 238000001514 detection method Methods 0.000 claims description 25
- 238000004590 computer program Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 11
- 230000011218 segmentation Effects 0.000 description 15
- 238000013459 approach Methods 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000012015 optical character recognition Methods 0.000 description 4
- 241000252206 Cypriniformes Species 0.000 description 3
- 238000003709 image segmentation Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 241000282326 Felis catus Species 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000002243 precursor Substances 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/146—Coding or compression of tree-structured data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/143—Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/149—Adaptation of the text data for streaming purposes, e.g. Efficient XML Interchange [EXI] format
Definitions
- the exemplary embodiment relates to the field of electronic document processing. It finds particular application in connection with electronic document format conversion and in particular with processing documents formatted in an unstructured or semi-structured format, and will be described with particular reference thereto. However, it is to be appreciated that the following is amenable to other like applications.
- Legacy documents are frequently either stored as Adobe portable document format (PDF) files or scanned from hard copies into PDF.
- PDF Portable document format
- Other common formats are image format such as portable network graphics format (png), graphics interchange format (gif), and the like.
- Legacy documents may also be in word processing formats or other, possibly proprietary, formats.
- the target formats are often XML, SGML, or HTML, which allows easy conversion into other structured formats, e.g., the epub format for ebook readers. Reformatting into the structured document may entail segmenting the document by finding paragraph divisions and generating a table of contents, information that is not readily available from unstructured scanned documents or PDF documents.
- a system for processing an encoded source document includes memory which receives an encoded source document from an associated client system, the encoded source document comprising structural information and encoded content information, the encoded content information comprising a plurality of encoded tokens generated by individually encoding each of a plurality of text tokens of the source document, the structural information comprising location information for each of the plurality of text tokens.
- a processing module processes the encoded document, without decoding the encoded tokens, to generate a modified document.
- a transmission module transmits the modified document to an associated client system whereby the client system is able to generate a transformed document based on the modified document and the plurality of text tokens.
- a processor implements one or more of the modules.
- a method of processing an encoded source document includes receiving an encoded source document from an associated client system, the encoded source document including structural information and encoded content information.
- the encoded content information includes a plurality of encoded tokens, the encoded tokens having been generated by individually encoding each of a plurality of text tokens of the source document to a respective value.
- the structural information includes location information for each of the plurality of tokens.
- the encoded document is processed using a processor, without decoding the encoded tokens, to generate a modified document.
- the modified document is transmitted to an associated client system, whereby the client system is able to generate a transformed document based on the modified document and plurality of text tokens.
- a computer system includes memory which receives a source document.
- the source document includes content information and structural information.
- the content information includes a plurality of text tokens.
- the structural information includes location information for each of the plurality of text tokens.
- An encoding module encodes each of the plurality of text tokens individually to produce an encoded document including encoded tokens.
- a transmission module transmits the encoded document to an associated service provider for processing the encoded document to generate a modified document.
- the encoded document being transmitted in a manner which allows the service provider to access at least a part of the structural information without decoding the encoded tokens.
- a reconstruction module generates a transformed document based on the modified document received from the service provider and the text tokens.
- a processor implements one or more of the modules.
- the content information includes a plurality of tokens.
- the structural information includes location information for each of the plurality of tokens.
- An encoding module encodes each of the tokens of the content information individually to produce an encoded document.
- a transmission module transmits the encoded document to an associated service provider for processing the encoded document to generate a modified document, the encoded document being transmitted in a manner which allows the service provider to access at least a part of the structural information without decoding the content information.
- a reconstruction module generates a transformed document based on the modified document received from the service provider.
- a processor implements one or more of the modules.
- a method for processing a source document includes receiving a source document, the source document including structural information and content information, the content information comprising a plurality of tokens. The method further includes encoding each of the plurality of tokens individually to produce an encoded document, and transmitting the encoded document to an associated service provider for processing the encoded document to generate a modified document, the encoded document being transmitted in a manner which allows the service provider to access at least a part of the structural information without decoding the content information. A transformed document is generated based on the modified document received from the service provider and the transformed document is output.
- FIG. 1 illustrates an example XML document containing structural information portions and content information portions
- FIG. 2 illustrates an example ancillary XML document embodied as an XML schema
- FIG. 3 illustrates a client system for encoding the content information of an XML document to be transferred to a service provider
- FIG. 4 illustrates a service provider system for processing an XML document having encoded content
- FIG. 5 is a flow chart illustrating an exemplary method for processing an XML document.
- aspects of the exemplary embodiment relate to a method and system for document encoding which allows a service provider to process a document without having access the textual content of the document.
- the document is in a format, such as Extensible Markup Language (XML) format, that allows structural information (such as content location and page layout information) to be separated from the content information of the document (text, vector graphics, and images).
- XML Extensible Markup Language
- Content information and structural information for purposes herein, are mutually exclusive parts of the document which in combination, form the entire document.
- the textual content information of the document can be extracted as a sequence of text elements or “tokens”, each token roughly corresponding to a word, punctuation, or number sequence.
- the tokens of the document can then be individually encoded by the document owner (the client).
- the document can be sent from the client to the service provider who then processes and transforms the structural data of the document, based in part on the structural information associated with the encoded tokens.
- the service provider then
- the method is particularly suited to a document format that allows decoupling the content information from the structural information.
- the content information includes content elements (e.g., characters or words forming tokens) while the structural information includes location information, such as position, width, and height of content elements, as well as page layout information, geometry, and typography, which may be generally referred to herein as metadata.
- the decoupling enables the content elements to be encrypted without encrypting the metadata or by encryption of the metadata with a different encryption method such as encryption using a separate key, allowing the metadata to be decrypted by the service provider (and operations to be performed on the metadata) without decrypting the content information, in particular, the encoded tokens.
- the client's original documents may not be in a format that separates the text content from the structural information. Accordingly, scanned or legacy documents may be minimally processed, e.g., by the client, to convert them to a format that allows the text of the document to be automatically processed separately from the structural information associated with of the text.
- document is used herein to refer to all versions of a document, which may be the result of the document being normalized, encoded, encrypted, modified, and/or otherwise processed in the exemplary method described herein.
- FIG. 1 illustrates an example input (source) document 2 in a structured format which may be processed using the exemplary method.
- the document 2 is in the form of a tree with a root node corresponding to the entire document and increasingly lower level nodes corresponding to page, column, paragraph, text (approximately a line of text or a part of a line) and, at the lowest level, token nodes 3 , which include the textual content 4 of the document.
- the rest of the information constitutes the structural information 5 .
- the illustrated structural information includes tags 6 , location information 8 , typographical information 10 , and page layout information 12 , although it is to be appreciated that structural information is not limited to such information.
- each node of the tree-structured document is identified by a respective one of set 6 of tags (DOCUMENT, PAGE, COLUMN, PARAGRAPH, TEXT, TOKEN).
- the tags form part of the structural information portion of the document and identify the respective type of each of the nodes of the tree-structured document 2 .
- each tag is recognizable by arrowheads ⁇ >.
- Each token node 3 includes a single text token 4 , which generally includes one or more characters and is contained within arrowheads > ⁇ . Although only one token node including text content 4 (the word “DAILY”) is shown in FIG.
- a document may include many text tokens, roughly one per word (in the case of documents converted using OCR, there may be some errors in separating tokens, resulting in a words being occasionally split into two or more tokens or two or more words combined into a single token).
- the exemplary content is shown as text 4 it is to be appreciated that the content portion may also include images and/or graphics.
- Each text token 4 has location information 8 associated with it (here, incorporated into the token node 3 , indicating the location of the token relative to a predefined location on the page, such as x,y coordinates relative to the top left corner of the page, as well as typographical information 10 , such as font name, font size, font color, bold/not bold, italic/not italic, underline/not underline, and the like.
- Page layout information 12 in this example, is associated with the PAGE tag within a page level node 14 . While not shown, other nodes may also have structural information associated with them.
- the format of the document 2 in FIG. 1 uses the standard XML format, any markup language format that separates textual content 4 from the structural information 5 may be employed.
- an ancillary document may be provided that defines the structure of the document, called a schema or Document Type Definition (DTD).
- An example schema 16 for the document of FIG. 1 is shown in FIG. 2 .
- the schema defines the tags DOCUMENT 20 , PAGE 22 , COLUMN 24 , PARAGRAPH 26 , TEXT 28 , and TOKEN 30 .
- Some of these tags have attributes 32 . Examples of attributes are “a-Size” and “a-BoundingBox” which denote the size and location of the content element (such as TOKEN 30 ).
- the attribute elements have typed values, as shown in the attribute section 34 .
- the corresponding structural information 5 can be extracted from the document 2 . In other embodiments, the structural information and content information in the document 2 can be extracted/distinguished without the need for a schema.
- FIGS. 1 and 2 The format of FIGS. 1 and 2 is a simplified version of the XML structure produced by the open-source tool “PDF-to-XML.” As noted above, a token contains, roughly speaking, one word.
- the structural information 5 provided in the XML format includes the location information 8 (e.g., size and position, rotation) of each text element 4 .
- location information 8 e.g., size and position, rotation
- a bounding box which encompasses each text element can be defined, based on the location information. Since the metadata 8 contains the size and position, the textual content 4 , corresponding to the image data from a scanned document is no longer needed, and need not be sent to the service provider for processing. That is, the image of the original, confidential document may be kept secret from the service provider and only the structural information 5 sent, optionally together with the appropriate schema 16 .
- the document can be processed by the service provider, which may include modifying some of the structural information, such as changing the location of text elements in the document, while the actual text content is not revealed to the service provider. Since the metadata is rarely considered as confidential information, the tags, etc., of the document and the schema may be revealed to the service provider, although they may be encrypted for transmission to the service provider.
- the text tokens 4 of the XML document 2 are individually encoded and each encoded token may be stored within the respective token placeholders > ⁇ in the transmitted document, without changing the structural information to account for a change in size of the encoded token 38 .
- the encoding may be performed using symmetric or asymmetric encryption or by hashing the tokens, e.g. using a cryptographic hash function, such as the Message-Digest Algorithm (e.g., MD5).
- a cryptographic hash function such as the Message-Digest Algorithm (e.g., MD5).
- MD5 Message-Digest Algorithm
- FIG. 1 illustrates an exemplary text encoded document 36 which may be formed by encoding only the text tokens 4 to provided encoded tokens 38 .
- the entire document 36 may be globally encoded for transmission to form a fully-encoded document.
- the encryption of the text encoded document 36 may be performed using any suitable encryption method, such as HTTPS (hypertext transport protocol HTTP over secure sockets layer SSL or transport layer security TLS) or application specific protocols, such as globally encrypting the document using an asymmetric or symmetric encryption scheme during the exchange.
- HTTPS hypertext transport protocol HTTP over secure sockets layer SSL or transport layer security TLS
- application specific protocols such as globally encrypting the document using an asymmetric or symmetric encryption scheme during the exchange.
- the shared key is exchanged, for example, using a preliminary key exchange step which encrypts the shared key using an asymmetric encryption scheme.
- the encoding scheme (encryption or hashing) scheme may be agreed upon in advance or a protocol for choosing an encryption/hashing scheme during each document transfer may be agreed upon.
- each text element 4 is given an identifier, which is inserted in its place in the respective token placeholder, without changing the location information, and the text token 4 is extracted from the document and stored in a list, indexed by the identifier.
- the exemplary system may rely on the same token encoding (encrypting or hashing) to the same value, so if tokens are processed, such as padded or salted, the processing may be consistent throughout the document when the same token is encrypted or hashed. This allows the client to keep the document text confidential and allows the service provider to deliver a service to the client without the need for disclosing proprietary software to the client.
- a network computing system includes a client computing system (“client system”) 100 for encoding a customer's XML documents 2 and a service provider computing system (“provider system”) 200 for processing documents 36 with encoded content.
- the two systems 100 , 200 may be entirely independently and separately controlled by separate entities, i.e., the client and the service provider.
- the systems 100 , 200 can each include one or more computing devices that each include a respective network interface device 102 , 202 for receiving and outputting documents across a network.
- Systems 100 , 200 also have input/output devices 104 , 204 for transmitting data to/from, for example, storage units such as disks or user interface devices.
- client system 100 is connected to document storage device 130 .
- System 100 receives as input at least one unencrypted source XML document 2 , an optional ancillary XML document 16 , and an optional encryption key or keys 40 .
- System 100 may receive an encoded, modified XML document 42 , output from the service provider 200 , which is decoded by the client device 100 to produce a transformed document 44 .
- System 200 receives as input an encoded XML document 36 and optional ancillary XML document 16 .
- inputs 2 , 16 , 40 , 36 , and 42 may be produced by and/or stored in the respective systems 100 , 200 themselves.
- the encryption key 40 may be generated as needed.
- the input source document 2 and ancillary document 16 are XML documents.
- input documents 2 , 16 , 36 , and/or 42 may be any markup language document, e.g., a document representing a logical tree structure, wherein structural information may be separated from the content information.
- the ancillary XML document 1 & may not be needed, may already reside on the provider system 200 , or may be input into the provider system 200 separately. In such instances, there may be no ancillary XML document 16 input into client system 100 .
- the ancillary XML document 16 may be encrypted at the client system 100 with an asymmetric encryption key received from the provider system 200 . In this case, the service provider is trusted to the extent that the provider system 200 may decrypt the ancillary XML document 16 (but not the content 4 of source XML document 2 ) with a private key corresponding to the public encryption key.
- the systems 100 , 200 may process more than one source XML document 2 and/or ancillary XML document 16 at a time, either in parallel or serially.
- each document may use a separate encryption key 40 for one or more documents 2 , 36 although this is not required.
- source XML document 2 , ancillary XML document 16 , encryption key(s) 40 , and XML document with encrypted content 38 may be stored in any suitable non-transitory data memory storage device 106 , 206 , such as a disk, ROM or RAM, or may be input into systems 100 , 200 in the form of a carrier wave, e.g., via a wired or wireless network 126 , such as the Internet.
- the network interfaces 102 , 202 may include a modem link, a wired or wireless connection, USB port, floppy or hard disk receiver, transceiver portion of a cellular telephone, or the like and may be separated or combined with other components of systems 100 , 200 .
- System 100 includes data memory 106 for storing the source XML document 2 , ancillary XML document 16 , encryption key 40 , encoded XML document 14 , and any other input or intermediate data generated during processing.
- System 200 includes corresponding data memory 206 for storing encoded XML document 36 , ancillary XML document 16 , and any other input or intermediate data generated during processing.
- Main memory 108 , 208 of systems 100 , 200 stores instructions 110 , 210 , respectively for performing the exemplary method.
- main memory 108 of client system 100 stores a parser 112 (e.g. an XML document object model), a normalization module 114 , a content encoding/decoding module (encoder) 116 , a transmission encryption module 118 (which may be implemented as TLS/SSL), and a reconstruction module 119 .
- the parser 112 includes a structure detection module 120 and a content detection module 121 .
- main memory 208 of provider system 200 stores a parser 212 (e.g. an XML document object model), an XML processing module 214 , and a transmission encryption module 218 (which may be implemented as TLS/SSL).
- the parser 212 includes a structure detection module 220 and a content detection module 221 .
- memories 106 , 206 , and 108 , 208 of the respective systems 100 , 200 may be embodied as a single memory unit, or that one or both of memories 106 , 206 , and/or 108 , 208 may comprise two or more component memory units.
- the instructions 110 , 210 are suitably executed by a corresponding digital processor 122 , 222 such as respective computer processors.
- Each digital processor 122 , 222 may be variously embodied, such as by a single core processor, a dual core processor (or more generally by a multiple core processor), a digital processor and cooperating math coprocessor, a digital controller, or the like.
- Outputs from modules 110 , 112 , 114 , 116 , 118 , 119 may be stored in memory 106 and/or output via input/output device 104 or network interface devices 102 to the provider system 200 and outputs from modules 210 , 212 , 214 , and 220 may be stored in memory 206 and/or output via input/output device 204 to client system 100 or another device such as an external computer having memory and/or a processor, optionally through a wired or wireless local or wide area network 126 such as the Internet.
- systems 100 and/or 200 are hosted by a respective server computer that is part of a distributed or cloud computing network, in which case inputs and outputs may be made via a single network interface 102 , 202 , suitably networked to a portal server.
- the software modules 112 , 114 , 116 , 118 , 119 , and 212 , 214 , 218 , of systems 100 and 200 are described briefly below. The functional characteristics of the modules are explained in greater detail with respect to the exemplary method(s) of FIGS. 5 , 6 , and 7 .
- the parser module 112 of system 100 parses the input XML documents 2 and 16 to identify the structural information portions 6 , 8 , 10 , 12 and the content information portion 4 .
- the parser module 212 of system 200 parses the XML document 36 with encoded content (after the document-level encoding has been decoded).
- the parsers 112 , 212 are each configured to construct an XML document object model (DOM).
- Parser 112 includes a structure detection module 120 to detect and store the structural information portion of source XML document 2 and modified document 42 with encrypted content.
- Parser 220 similarly includes a structure detection module 220 to detect and store the structural information portion of encoded XML document 36 with encoded content.
- Each parser also includes a respective content detection module 121 , 221 to detect and store the content portions 4 of the respective XML documents.
- the normalization module 114 of client system 100 is configured for normalization of the text content 4 , which may include one or more of changing all letters to lower case, removing accents, removing apostrophes, removing punctuation from within word tokens, removing numbers, removing small words (e.g., those with fewer than a predefined number of characters, such as tokens only one character, such as “a”); and removing stop words (words in a predefined set which occur in high frequency in documents, such as “and” and “the”).
- the client and service provider may agree ahead of time on what data normalization operations will be performed by the client system, if any.
- the normalization scheme employed by the normalization module 114 may be configured to allow the service provider to process the contents, even after they have been encrypted.
- the normalization may be dependent on what type of processing is to be performed by the service provider. Removing small words, accents, apostrophes and punctuation, and lower casing allows, for example, the XML processing module 214 of provider device 200 to find near matches that might differ by only punctuation or a small word to enable table of content detection. Once the text 4 is encoded, it becomes difficult to detect such near matches, as similar words (e.g., which differ in their case) tend to be encoded to completely unrelated values. Similarly, word tokens may include commas or periods in the tokens, causing a word to encode to a different value due to the included punctuation. That is “cat” and “cat?” may encode to different values.
- a document may be sent to the service provider two or more times, in different normalized forms, for different processing operations to be performed.
- the content encoding/decoding module (encoder) 116 encodes the content 4 of the document 2 after it has been normalized (if any normalization is to be performed).
- the encoder 116 also decodes the content when the modified document 42 is received from the service provider.
- the reconstruction module 119 reconstructs the document, to generate the transformed (unencoded) document 44 .
- Reconstruction of the modified document 42 may be desirable if, for example, a text reordering service has been provided.
- the reconstruction module 119 may only need to replace the metadata in source document 2 with the metadata from the modified document 42 , avoiding the need to decode the modified document 42 .
- a transmission encryption module (a document-level encryption module) 118 may be provided by the operating system that runs on the client system 100 or may be a higher level component.
- the encryption module 118 encrypts the text-level encrypted XML document 36 before it is transmitted to protect the document metadata 6 , 8 , 10 , 12 from interception during transmission. This may be accomplished by shared key or asymmetric encryption.
- the encryption module 118 may encrypt the entire document, without regard to which part of the document is content or structure, potentially encrypting a portion of the document that is already encoded.
- the encryption module 118 also decrypts the modified document 42 when it is received from the provider system 200 .
- the provider system 200 includes a corresponding encryption module 218 which decrypts the XML document 36 to provide an XML document with encoded content 38 .
- the encryption module 218 is not configured for decoding encoded content 38 which was encrypted using a different encrypting scheme or encryption key. If the content 4 was hashed, there is generally no practical way to reconstruct the content 4 from the hash.
- the transmission encryption modules 118 , 218 are optional and may not be used for each document.
- the provider system 200 ( FIG. 4 ) also includes an XML processing module 214 , which may include one or more dedicated processing modules, illustrated as modules 240 , 242 , 244 , 246 , and 248 .
- modules 240 , 242 , 244 , 246 , and 248 perform, respectively, text segmentation, text ordering, ToC detection, header/footer detection, and image segmentation and ordering. These modules are discussed below with respect to the method of FIGS. 5 .
- two or more of modules 240 , 242 , 244 , 246 , and 248 may be combined into a single module.
- components 110 , 210 , 112 , 212 , 114 , 214 , 116 , 216 , 118 , 218 , 119 , 120 , 220 , 121 , 221 , 240 , 242 , 244 , 246 , and 248 comprise software instructions stored in respective main memory 108 , 208 , which are executed by the respective computer processor 122 , 222 .
- the processor 122 , 222 such as the computer's CPU, may also control the overall operation of the respective computer systems 100 , 200 by execution of processing instructions stored in memories 108 , 208 and/or 106 , 206 .
- Hardware components 102 , 104 , 106 , 108 , and 122 may be communicatively connected by a data control bus 132 .
- a similar data control bus 232 for device 200 connects hardware components 202 , 204 , 206 , 208 , and 222 .
- systems 100 , 200 may include fewer or more components/modules while still having the same functionality.
- components 112 , 114 , 116 , 118 , 119 , 120 , and 121 may be combined to form fewer components, or may be functionally separated to form more individual components. The same may apply to components of device 200 .
- the client system 100 and encrypted XML processing service provider system 200 may each comprise one or more computing devices, such as a personal computer, PDA, laptop computer, server computer, or combination thereof.
- the systems 100 , 200 may be incorporated into an overall distributed architecture.
- Memories 106 , 108 ( 206 , 208 ) may be integral or separate and may represent any type of computer readable medium such as random access memory (RAM), read only memory (ROM), magnetic disk or tape, optical disk, flash memory, or holographic memory.
- the memories 106 , 108 ( 206 , 208 ) comprise a combination of random access memory and read only memory.
- the processor 122 ( 222 ) and memory 106 ( 206 ) and/or 108 ( 208 ) may be combined in a single chip.
- the systems 100 , 200 may output a portion or all of outputs 36 , 42 to an external output device, such as a client terminal, database system, or the like.
- the term “software” as used herein is intended to encompass any collection or set of instructions executable by a computer or other digital system so as to configure the computer or other digital system to perform the task that is the intent of the software.
- the term “software” as used herein is intended to encompass such instructions stored in a storage medium such as RAM, a hard disk, optical disk, or so forth, and is also intended to encompass so-called “firmware” that is software stored on a ROM or so forth.
- Such software may be organized in various ways, and may include software components organized as libraries, Internet-based programs stored on a remote server or so forth, source code, interpretive code, object code, directly executable code, and so forth. It is contemplated that the software may invoke system-level code or calls to other software residing on a server or other location to perform certain functions.
- FIG. 5 illustrates an integrated method for processing an XML document performed by the client system 100 and the service provider 200 remotely through a network 126 such as the Internet.
- the client and service provider may agree on a data normalization method and, optionally, a transmission encryption method.
- the method begins at S 100 .
- a source XML document 2 is received by client system 100 , and may be stored in memory 106 during processing.
- the XML document 2 may be retrieved from document storage 130 .
- an ancillary document 16 is also retrieved.
- document 2 is generated from a scanned document that is transformed into an XML document. This may involve performing optical character recognition on the scanned document and storing the document as a PDF.
- a conversion tool such as PDF2XML, may be used to translate the document from an unstructured or minimally structured format such as PDF, to XML.
- the document may be a legacy document translated into XML by a process specific to the format of the legacy document.
- the content portions of document 2 are identified.
- the content containing portions 4 and the structural information 6 , 8 , 10 , 12 are identified/distinguished, for example, by the structure detection module 120 and/or content detection module 121 of the parser 112 , optionally using the schema 16 .
- all portions within inverted arrows or other content-denoting symbols may be identified as text content portions 4 .
- the content portions of document 2 are optionally normalized, e.g., using normalization module 114 .
- the content portions 4 only of document 2 are encoded, e.g., by encoder 116 , to produce document 36 with encoded content.
- Each token 4 is separately encoded to provide a respective encoded element 38 , which in the exemplary embodiment is stored in its respective TOKEN node 3 .
- the content portions 4 of document 2 may be encoded (S 108 ) by hashing or encryption, token by token. Because the content is encoded token by token, the service provider system can still perform textual similarity comparison between text elements or between sequences of text elements, as two tokens 4 consisting of the same word will hash or encrypt to the same value.
- a shared key may be used, as the content may only be decrypted by the client system 100 , though public/private key encryption may be used.
- the content may alternatively be hashed, for example using MD5.
- the document 36 may be encrypted as a whole in a manner which can be decrypted by the service provider to protect against third parties intercepting the document during transmission.
- the structural information 5 may be left unencrypted, in markup language text, or may be at least partially encrypted by shared key or a public key of the service provider, allowing the service provider to decrypt the structural information.
- the ancillary document 16 if used, may also be encrypted using a shared key or the service provider's public key, if it is to be transmitted to the service provider.
- the normalized document is optionally encrypted by globally encrypting the entire document or by encrypting only the plain text structural portions. The global encrypting approach may be performed by the module 118 using TSL/SSL.
- the encoded and optionally encrypted document 36 is transmitted to the provider system 200 e.g., via the network 126 .
- the document 36 is received by the system 200 and stored in memory 206 .
- the document is decrypted, e.g., by the module 218 .
- the structural portions and encoded content portions of document 14 are identified/distinguished by structure detection module 220 , and/or content detection module 221 of the parser 212 , optionally using the ancillary document, similarly to step S 220 of FIG. 6 .
- one or more processing operations are performed on the text document 36 .
- one or more modules of the XML processing module 214 operate on the decrypted, but still content-encoded document 36 to produce a modified document 42 .
- Textual operations (which may involve changing image locations) on the document may include one or more XML operations, possibly using ancillary document 16 , on the document 36 to produce a modified document 42 .
- the operations discussed below, may include one of text ordering, text segmenting, table of contents (ToC) recognition, page header/footer detection, and image segmentation/ordering.
- the document may be re-encrypted, e.g., by module 218 , optionally using the same document encryption scheme used by system 100 . This may be performed similarly to step S 110 of FIG. 5 .
- the ancillary document generally need not be re-encrypted, as the ancillary document contains unaltered structural information, and may not be sent back with the document 36 .
- the processed and optionally encrypted document 42 is output, e.g., transmitted back from the system 200 to system 100 , e.g., via the network 126 .
- the modified document 42 is received by the system 100 , which can be the same or a different computing device from the device 100 which transmitted the document to the system 200 , and stored in memory 106 .
- the modified document 42 received by the client system 100 is decrypted, if it was sent from the service provider in encrypted form.
- the processing performed on the document 36 by the provider system 200 is used to generate a transformed document 44 .
- this may be achieved by decoding the encoded content 38 of document 42 and reconstructing the document to produce document 44 .
- the changes between document 36 and document 42 that were implemented by the XML processing module 214 are identified from the modified document 42 and applied to the original document 2 to generate document 44 .
- this allows the client system 100 to generate a document which is equivalent to that which would have been generated had the service provider been provided with the source document in an unencoded form, but optionally encrypted as a whole in a manner which allows the service provider to decrypt the document including the tokens.
- the client system may output document 44 , for example to document storage system 130 or to a user.
- the method may return to S 106 , where a different normalization operation may be performed and steps S 108 -S 130 repeated, such that at S 120 , a different processing operation is performed on the encoded document 36 .
- This further processing may be performed by the same or a different provider system 200 .
- the method may be reiterated more times.
- the method ends at S 134 .
- the normalization may include one or more of converting all characters to lower case, converting accented characters to unaccented characters (or to a combination of characters), removing apostrophes, removing punctuation from within word tokens, removing numbers, and ignoring or removing short words, etc.
- the service provider may provide the software tool 114 for performing normalization, as the normalization steps are determined by what transformations will be performed by the service provider.
- the XML schema may be enriched to allow normalization information to be incorporated into the document 2 and passed to the service provider.
- this may entail modifying the processing software used by the service provider, which may not be desirable for maintaining conformity over a variety of different clients.
- differences in capitalization may be incorporated into the encryption (capital letters encrypted as capitals and lower case letters as lower case). Although providing a relatively weak level of protection, this may not be a problem.
- the textual content may be normalized and encoded at the same time by replacing any lowercase letter by ‘I’ (for instance), any uppercase letter by ‘L’, and any number by ‘0’.
- the text is lost (i.e., it cannot be regenerated from the encoding) but the customer can reconstruct the appropriate result by re-injecting the text in the token placeholders of the modified document 42 or by incorporating the modifications made in the modified document 42 into the source document 2 . While this method preserves the same number of characters in the encoded token, this can be a disadvantage for security.
- any sequence of ‘I’ is replaced by one single ‘I’. This may provide better encryption but the word length is lost.
- low entropy encrypted data may, in some cases, raise a security issue, for example if common text values are easily guessed by an attacker.
- the method makes use of use encryption padding to make calculating the encryption mechanism more difficult. Identical padding may be used for identical tokens to ensure that they encrypt to the same value.
- normalization may include removal of common prefixes and suffixes, such as removal of “ing” and “ed” to allow conjugated verbs to match each other.
- the service provider system 200 may be configured to receive and process both encoded and unencoded documents (e.g., from two different clients with different security needs).
- the service provider's XML processing module 214 need not be able to distinguish between encoded and unencoded text and may perform a second normalization on all documents, even encoded documents.
- This second normalization by the service provider can be performed without the service provider having to provide separate processing paths for encoded and unencoded documents.
- the encoding alphabet (after normalization), may thus contain only lower case characters and contain no accents or white space characters.
- the service provider processing module 214 may be aware of whether it receives encoded or unencoded text, or may be configured for detecting the different forms of text automatically. In some embodiments, where the service provider system 200 performs a second normalization, the service provider system may also “de normalize” the document after processing, by keeping track of all normalized operations and undoing them.
- Example content (e.g., text) processing operations which may be performed on encrypted document 36 at S 120 may include some or all of the following.
- the text segmentation module 240 segments the flow of text into lines and paragraphs.
- the text ordering module 242 orders the flow so as to respect human-reading order. For example, if an original document has two columns, the contents of the two columns may, after scanning and OCR, be stored in one large column, in which each row contains a space in the middle, separating the two columns. While such format would print correctly and be viewable on screen, converting the document to HTML or epub would render the document unreadable.
- the text ordering module is configured to recognize the two columns and re-order the text so that lines run consecutively, allowing further conversion into another format.
- the text ordering may move the encrypted or hashed tokens to new locations in the document to produce modified document 42 .
- the reconstruction module 119 of the client system may replace the encoded tokens with their corresponding words to reconstruct the document text.
- Segmentation identifies paragraphs in the document, which can be an aid to other services, discussed below.
- the segmentation module 240 may store the paragraph information by the addition of metatags (structural information) to the document 36 to produce modified document 42 .
- metatags structural information
- segmentation works at a page level, and does not consider paragraphs which span two or more pages. This is generally sufficient to allow for further processing.
- Both segmentation and text ordering generally rely on the geometry of the words (tokens) on a page.
- the client system may encrypt or hash each token (word) without altering the operation of the text ordering and segmentation module.
- the service provider system does not generally need to compare the encrypted text content 36 , so, for this particular service, it may be sufficient to use a serial number or other unique identifier for each token, rather than encryption or hashing.
- the client system may add a tag to the metadata 6 of the document 36 , indicating the order of the tokens, and the text ordering module 242 of the service provider system may update the metadata to reflect the modified ordering of the tokens.
- a format that is line based rather than token based is used.
- Each row of text is made up of one or more lines.
- each row may contain a maximum of two lines, in the ideal case.
- the metadata 6 may contain a line order tag, which the service provider may update, after reordering portions of lines instead of single tokens.
- This approach may also be extended to paragraphs. Recognizing lines or paragraphs may entail greater sophistication on the part of the client's scanning software and/or greater processing on the client system.
- One potential advantage of this approach is that the encryption or hash can be more secure, as larger chunks may be encrypted, making it more difficult for an unauthorized user to infer any plaintext from the layout of the document.
- the identification of text order and segments can be performed using existing techniques. Order may be detected, for example, from layout clues that indicate columns. Segments may be inferred from gaps or whitespace in the document. An XY cut method may be used to find narrow or wide strips to identify paragraphs and/or lines, as described in U.S. Pat. No. 7,392,473 to Meunier, entitled METHOD AND APPARATUS FOR DETERMINING LOGICAL DOCUMENT STRUCTURE. An example method for segmenting a document is described in US Pub. No. 2004/0013302 to Ma, et al., entitled DOCUMENT CLASSIFICATION AND LABELING USING LAYOUT GRAPH MATCHING. The disclosures of both of these documents are incorporated herein by reference in their entireties.
- the Table of Contents (ToC) module 244 may process document 36 after processing by the document ordering module 240 and/or segmentation module 242 .
- the ToC module is configured for recognizing a Table of Contents in a document, if it exists, and for each entry in the ToC which is identified, to construct a link or other pointer to the portion of the document referenced by the entry in the ToC.
- Paragraphs or pages which are referenced by the ToC may also be labeled by a metadata tag. This can be a flag or Boolean, that is true for paragraphs or pages which are referenced by the ToC.
- the ToC referenced paragraphs or pages may also be labeled with metadata identifying the link from the ToC.
- the ToC module 244 may function by first computing the textual similarity between pairs of document parts, where the parts can each be a sequence of tokens, such as a paragraph, line, a few tokens, or even a single token, then by identifying the ToC based on this pairwise similarity.
- the sequences compared can be sequences of encoded tokens 38 .
- the ToC may be further refined using the metadata (e.g., clues such as typography changes or page layout).
- the similarity measure used by the ToC module 244 may function better if the normalization of the textual content provides lowercasing and removes small, frequent words that, when encoded, could cause text which should be considered sufficiently similar to be a match to be treated as different.
- the reconstruction module 119 of the client system may update the original document 2 with the updated ToC and additional metadata (link locations and paragraph flags). Generating a ToC generally does not reorder the textual data, so the client need not decrypt or look up the hash values of the encoded content 38 , however, this method of reconstruction is not excluded.
- the header/footer module 246 can perform one or both of page header detection and page footer detection, in which encoded tokens 38 are tagged with metadata identifying them as being part of a respective header or footer. This detection may be based on structural information and/or similarity between the encoded tokens on subsequent pages.
- the client system can thus treat the header and footer as regular text, relying on the header/footer module 246 of the service provider to identify the header and footer based on, for example, page layout, in a similar manner to text ordering and segmentation.
- the header and footer information may be tagged by adding tags to the metadata.
- the module 246 may also be configured for detection of page numbers, which may involve recognition of a pattern of numbers (Arabic and/or Roman numerals) and replacing them with a simple regular expression. This may entail the encoding scheme incorporating a detectable sequentiality.
- Page header/footer detection may be performed, for example, using the methods described in U.S. Pat. No. 7,937,653, issued May 3, 2011, entitled METHOD AND APPARATUS FOR DETECTING PAGINATION CONSTRUCTS INCLUDING A HEADER AND A FOOTER IN LEGACY DOCUMENTS, by Nervé Déjean, et al., the disclosure of which is incorporated herein by reference in its entirety.
- Images in the document 36 may also be segmented and/or ordered by the image segmentation/ordering module 248 . This may be performed similarly to the text ordering already described, for example, based on size, layout, and similarity of nearby text. Groups of related images may be identified and kept together as one large image. In one embodiment, the images are not sent within the encoded document, only their bounding boxes. No data about the image need be sent other than its dimensions, so encryption of the images is unnecessary. In another embodiment, the images are encrypted or hashed. Surrounding text may be identified as labels or captions associated with the image. Text surrounding the image may be used to identify an image that is referenced by the ToC. Just as in the case of text ordering, the images may be moved in the document.
- the images (or bounding boxes) in the encoded document may have metadata associated with them at the client system, indicating their sequence in the document. This information may be modified by the image segmenting/ordering module 248 .
- the method illustrated in FIG. 5 may be implemented in one or more computer program product(s) that may be executed on a respective computer.
- the computer program product(s) may comprise a non-transitory computer-readable recording medium on which a control program is recorded, such as a disk, hard drive, or the like.
- a non-transitory computer-readable recording medium such as a disk, hard drive, or the like.
- Common forms of non-transitory computer-readable media include, for example, floppy disks, flexible disks, hard disks, magnetic tape, or any other magnetic storage medium, CD-ROM, DVD, or any other optical medium, a RAM, a PROM, an EPROM, a FLASH-EPROM, or other memory chip or cartridge, or any other non-transitory medium from which a computer can read and use.
- the method may be implemented in transitory media, such as a transmittable carrier wave in which the control program is embodied as a data signal using transmission media, such as acoustic or light waves, such as those generated during radio wave and infrared data communications, and the like.
- transitory media such as a transmittable carrier wave
- the control program is embodied as a data signal using transmission media, such as acoustic or light waves, such as those generated during radio wave and infrared data communications, and the like.
- the exemplary method may be implemented on general purpose computers, special purpose computer(s), a programmed microprocessor or microcontroller and peripheral integrated circuit elements, an ASIC or other integrated circuit, a digital signal processor, a hardwired electronic or logic circuit such as a discrete element circuit, a programmable logic device such as a PLD, PLA, FPGA, Graphical card CPU (GPU), or PAL, or the like.
- any devices capable of implementing a finite state machine that is in turn capable of implementing a respective part of the flowchart shown in FIG. 5 , can be used to implement the exemplary method.
Abstract
Description
- The exemplary embodiment relates to the field of electronic document processing. It finds particular application in connection with electronic document format conversion and in particular with processing documents formatted in an unstructured or semi-structured format, and will be described with particular reference thereto. However, it is to be appreciated that the following is amenable to other like applications.
- Organizations frequently have documents that are stored in an unstructured or semi-structured format that is difficult to reformat for different viewing devices. A common task is the batch conversion of these documents into an electronic form which allows searching and automatic transformation for presentation by different devices. Legacy documents are frequently either stored as Adobe portable document format (PDF) files or scanned from hard copies into PDF. Other common formats are image format such as portable network graphics format (png), graphics interchange format (gif), and the like. Legacy documents may also be in word processing formats or other, possibly proprietary, formats. The target formats are often XML, SGML, or HTML, which allows easy conversion into other structured formats, e.g., the epub format for ebook readers. Reformatting into the structured document may entail segmenting the document by finding paragraph divisions and generating a table of contents, information that is not readily available from unstructured scanned documents or PDF documents.
- When large quantities of such documents are to be processed, batch processing by an outside service provider may be desirable. If the documents are confidential in nature, however, there may be concerns that sensitive information may be released, either during transmission or by the service provider. An encrypted channel may be used to protect the sensitive information during transmission, but this still poses a risk of disclosure by the service provider when the documents are decrypted.
- It would be desirable to have a method and system for transmitting a document such that a service provider may perform processing of the structure of the document and limited processing of the content without having full access to the content.
- The following reference, the disclosure of which is incorporated herein by reference in its entirety, is mentioned:
- U.S. application Ser. No. 13/103,216, filed on May 9, 2011, entitled METHOD AND SYSTEM FOR SECURED MANAGEMENT OF ONLINE XML DOCUMENT SERVICES THROUGH STRUCTURE-PRESERVING ASYMMETRIC ENCRYPTION, to Vion-Dury, describes a method of allowing the service provider to process structural data without having access to the textual information of a document.
- In accordance with one aspect of the exemplary embodiment, a system for processing an encoded source document includes memory which receives an encoded source document from an associated client system, the encoded source document comprising structural information and encoded content information, the encoded content information comprising a plurality of encoded tokens generated by individually encoding each of a plurality of text tokens of the source document, the structural information comprising location information for each of the plurality of text tokens. A processing module processes the encoded document, without decoding the encoded tokens, to generate a modified document. A transmission module transmits the modified document to an associated client system whereby the client system is able to generate a transformed document based on the modified document and the plurality of text tokens. A processor implements one or more of the modules.
- In accordance with another aspect of the exemplary embodiment, a method of processing an encoded source document includes receiving an encoded source document from an associated client system, the encoded source document including structural information and encoded content information. The encoded content information includes a plurality of encoded tokens, the encoded tokens having been generated by individually encoding each of a plurality of text tokens of the source document to a respective value. The structural information includes location information for each of the plurality of tokens. The encoded document is processed using a processor, without decoding the encoded tokens, to generate a modified document. The modified document is transmitted to an associated client system, whereby the client system is able to generate a transformed document based on the modified document and plurality of text tokens.
- In accordance with another aspect of the exemplary embodiment, a computer system includes memory which receives a source document. The source document includes content information and structural information. The content information includes a plurality of text tokens. The structural information includes location information for each of the plurality of text tokens. An encoding module encodes each of the plurality of text tokens individually to produce an encoded document including encoded tokens. A transmission module transmits the encoded document to an associated service provider for processing the encoded document to generate a modified document. The encoded document being transmitted in a manner which allows the service provider to access at least a part of the structural information without decoding the encoded tokens. A reconstruction module generates a transformed document based on the modified document received from the service provider and the text tokens. A processor implements one or more of the modules.
- The content information includes a plurality of tokens. The structural information includes location information for each of the plurality of tokens. An encoding module encodes each of the tokens of the content information individually to produce an encoded document. A transmission module transmits the encoded document to an associated service provider for processing the encoded document to generate a modified document, the encoded document being transmitted in a manner which allows the service provider to access at least a part of the structural information without decoding the content information. A reconstruction module generates a transformed document based on the modified document received from the service provider. A processor implements one or more of the modules.
- In accordance with another aspect of the exemplary embodiment, a method for processing a source document includes receiving a source document, the source document including structural information and content information, the content information comprising a plurality of tokens. The method further includes encoding each of the plurality of tokens individually to produce an encoded document, and transmitting the encoded document to an associated service provider for processing the encoded document to generate a modified document, the encoded document being transmitted in a manner which allows the service provider to access at least a part of the structural information without decoding the content information. A transformed document is generated based on the modified document received from the service provider and the transformed document is output.
-
FIG. 1 illustrates an example XML document containing structural information portions and content information portions; -
FIG. 2 illustrates an example ancillary XML document embodied as an XML schema; -
FIG. 3 illustrates a client system for encoding the content information of an XML document to be transferred to a service provider; -
FIG. 4 illustrates a service provider system for processing an XML document having encoded content; and -
FIG. 5 is a flow chart illustrating an exemplary method for processing an XML document. - Aspects of the exemplary embodiment relate to a method and system for document encoding which allows a service provider to process a document without having access the textual content of the document. The document is in a format, such as Extensible Markup Language (XML) format, that allows structural information (such as content location and page layout information) to be separated from the content information of the document (text, vector graphics, and images). Content information and structural information, for purposes herein, are mutually exclusive parts of the document which in combination, form the entire document. The textual content information of the document can be extracted as a sequence of text elements or “tokens”, each token roughly corresponding to a word, punctuation, or number sequence. The tokens of the document can then be individually encoded by the document owner (the client). In this format, the document can be sent from the client to the service provider who then processes and transforms the structural data of the document, based in part on the structural information associated with the encoded tokens. The service provider then returns the modified document to the client.
- The method is particularly suited to a document format that allows decoupling the content information from the structural information. The content information includes content elements (e.g., characters or words forming tokens) while the structural information includes location information, such as position, width, and height of content elements, as well as page layout information, geometry, and typography, which may be generally referred to herein as metadata. The decoupling enables the content elements to be encrypted without encrypting the metadata or by encryption of the metadata with a different encryption method such as encryption using a separate key, allowing the metadata to be decrypted by the service provider (and operations to be performed on the metadata) without decrypting the content information, in particular, the encoded tokens.
- The client's original documents may not be in a format that separates the text content from the structural information. Accordingly, scanned or legacy documents may be minimally processed, e.g., by the client, to convert them to a format that allows the text of the document to be automatically processed separately from the structural information associated with of the text.
- The term “document” is used herein to refer to all versions of a document, which may be the result of the document being normalized, encoded, encrypted, modified, and/or otherwise processed in the exemplary method described herein.
-
FIG. 1 illustrates an example input (source)document 2 in a structured format which may be processed using the exemplary method. Thedocument 2 is in the form of a tree with a root node corresponding to the entire document and increasingly lower level nodes corresponding to page, column, paragraph, text (approximately a line of text or a part of a line) and, at the lowest level, token nodes 3, which include thetextual content 4 of the document. In the illustrateddocument 2, the rest of the information constitutes the structural information 5. The illustrated structural information includestags 6,location information 8,typographical information 10, andpage layout information 12, although it is to be appreciated that structural information is not limited to such information. - First, each node of the tree-structured document is identified by a respective one of
set 6 of tags (DOCUMENT, PAGE, COLUMN, PARAGRAPH, TEXT, TOKEN). The tags form part of the structural information portion of the document and identify the respective type of each of the nodes of the tree-structureddocument 2. In the example, each tag is recognizable by arrowheads < >. Each token node 3 includes asingle text token 4, which generally includes one or more characters and is contained within arrowheads > <. Although only one token node including text content 4 (the word “DAILY”) is shown inFIG. 1 , it is to be understood that generally a document may include many text tokens, roughly one per word (in the case of documents converted using OCR, there may be some errors in separating tokens, resulting in a words being occasionally split into two or more tokens or two or more words combined into a single token). - While the exemplary content is shown as
text 4 it is to be appreciated that the content portion may also include images and/or graphics. - Each
text token 4 haslocation information 8 associated with it (here, incorporated into the token node 3, indicating the location of the token relative to a predefined location on the page, such as x,y coordinates relative to the top left corner of the page, as well astypographical information 10, such as font name, font size, font color, bold/not bold, italic/not italic, underline/not underline, and the like.Page layout information 12, in this example, is associated with the PAGE tag within apage level node 14. While not shown, other nodes may also have structural information associated with them. Although the format of thedocument 2 inFIG. 1 uses the standard XML format, any markup language format that separatestextual content 4 from the structural information 5 may be employed. - In order to parse the document of
FIG. 1 , an ancillary document may be provided that defines the structure of the document, called a schema or Document Type Definition (DTD). Anexample schema 16 for the document ofFIG. 1 is shown inFIG. 2 . The schema defines thetags DOCUMENT 20,PAGE 22,COLUMN 24,PARAGRAPH 26,TEXT 28, andTOKEN 30. Some of these tags have attributes 32. Examples of attributes are “a-Size” and “a-BoundingBox” which denote the size and location of the content element (such as TOKEN 30). The attribute elements have typed values, as shown in theattribute section 34. Using theschema 16, the corresponding structural information 5 can be extracted from thedocument 2. In other embodiments, the structural information and content information in thedocument 2 can be extracted/distinguished without the need for a schema. - The format of
FIGS. 1 and 2 is a simplified version of the XML structure produced by the open-source tool “PDF-to-XML.” As noted above, a token contains, roughly speaking, one word. - The exemplary system and method make use of the fact that the structural information 5 provided in the XML format includes the location information 8 (e.g., size and position, rotation) of each
text element 4. Thus, a bounding box which encompasses each text element can be defined, based on the location information. Since themetadata 8 contains the size and position, thetextual content 4, corresponding to the image data from a scanned document is no longer needed, and need not be sent to the service provider for processing. That is, the image of the original, confidential document may be kept secret from the service provider and only the structural information 5 sent, optionally together with theappropriate schema 16. - In this way, the document can be processed by the service provider, which may include modifying some of the structural information, such as changing the location of text elements in the document, while the actual text content is not revealed to the service provider. Since the metadata is rarely considered as confidential information, the tags, etc., of the document and the schema may be revealed to the service provider, although they may be encrypted for transmission to the service provider.
- In one embodiment, the
text tokens 4 of theXML document 2 are individually encoded and each encoded token may be stored within the respective token placeholders > < in the transmitted document, without changing the structural information to account for a change in size of the encodedtoken 38. The encoding may be performed using symmetric or asymmetric encryption or by hashing the tokens, e.g. using a cryptographic hash function, such as the Message-Digest Algorithm (e.g., MD5). For the transformations described herein, either encryption or hashing may be used, and the term encoding is used to refer to encrypting or hashing the tokens into encrypted values or hash values. -
FIG. 1 illustrates an exemplary text encodeddocument 36 which may be formed by encoding only thetext tokens 4 to provided encodedtokens 38. - After the tokens have been encoded, the
entire document 36, including already encoded values, may be globally encoded for transmission to form a fully-encoded document. The encryption of the text encodeddocument 36, for transmission between the client and the service provider, may be performed using any suitable encryption method, such as HTTPS (hypertext transport protocol HTTP over secure sockets layer SSL or transport layer security TLS) or application specific protocols, such as globally encrypting the document using an asymmetric or symmetric encryption scheme during the exchange. In the case where a symmetric scheme is used, the shared key is exchanged, for example, using a preliminary key exchange step which encrypts the shared key using an asymmetric encryption scheme. The encoding scheme (encryption or hashing) scheme may be agreed upon in advance or a protocol for choosing an encryption/hashing scheme during each document transfer may be agreed upon. - In one embodiment, the encoding includes inserting an XML attribute name ‘id’, which contains the identifier, e.g. <TOKEN>DAILY</TOKEN> is replaced with <TOKEN id=“12345”>abdcw</TOKEN>.
- In another embodiment, each
text element 4 is given an identifier, which is inserted in its place in the respective token placeholder, without changing the location information, and thetext token 4 is extracted from the document and stored in a list, indexed by the identifier. - The exemplary system may rely on the same token encoding (encrypting or hashing) to the same value, so if tokens are processed, such as padded or salted, the processing may be consistent throughout the document when the same token is encrypted or hashed. This allows the client to keep the document text confidential and allows the service provider to deliver a service to the client without the need for disclosing proprietary software to the client.
- With reference now to
FIGS. 3 and 4 , a network computing system includes a client computing system (“client system”) 100 for encoding a customer'sXML documents 2 and a service provider computing system (“provider system”) 200 for processingdocuments 36 with encoded content. The twosystems systems network interface device Systems output devices client system 100 is connected to documentstorage device 130. -
System 100 receives as input at least one unencryptedsource XML document 2, an optionalancillary XML document 16, and an optional encryption key orkeys 40.System 100 may receive an encoded, modifiedXML document 42, output from theservice provider 200, which is decoded by theclient device 100 to produce a transformeddocument 44. -
System 200 receives as input an encodedXML document 36 and optionalancillary XML document 16. In the alternative,inputs respective systems encryption key 40 may be generated as needed. In the exemplary embodiment, theinput source document 2 andancillary document 16 are XML documents. In alternate embodiments,input documents provider system 200, or may be input into theprovider system 200 separately. In such instances, there may be noancillary XML document 16 input intoclient system 100. In some embodiments, theancillary XML document 16 may be encrypted at theclient system 100 with an asymmetric encryption key received from theprovider system 200. In this case, the service provider is trusted to the extent that theprovider system 200 may decrypt the ancillary XML document 16 (but not thecontent 4 of source XML document 2) with a private key corresponding to the public encryption key. - The
systems source XML document 2 and/orancillary XML document 16 at a time, either in parallel or serially. In such embodiments, each document may use aseparate encryption key 40 for one ormore documents source XML document 2 andancillary XML document 16 is processed at a time. During processing,source XML document 2,ancillary XML document 16, encryption key(s) 40, and XML document withencrypted content 38 may be stored in any suitable non-transitory datamemory storage device systems wireless network 126, such as the Internet. The network interfaces 102, 202 may include a modem link, a wired or wireless connection, USB port, floppy or hard disk receiver, transceiver portion of a cellular telephone, or the like and may be separated or combined with other components ofsystems -
System 100 includesdata memory 106 for storing thesource XML document 2,ancillary XML document 16,encryption key 40, encodedXML document 14, and any other input or intermediate data generated during processing.System 200 includes correspondingdata memory 206 for storing encodedXML document 36,ancillary XML document 16, and any other input or intermediate data generated during processing. -
Main memory systems stores instructions main memory 108 ofclient system 100 stores a parser 112 (e.g. an XML document object model), anormalization module 114, a content encoding/decoding module (encoder) 116, a transmission encryption module 118 (which may be implemented as TLS/SSL), and areconstruction module 119. Theparser 112 includes astructure detection module 120 and acontent detection module 121. Similarly,main memory 208 ofprovider system 200 stores a parser 212 (e.g. an XML document object model), anXML processing module 214, and a transmission encryption module 218 (which may be implemented as TLS/SSL). Theparser 212 includes astructure detection module 220 and acontent detection module 221. - It is to be appreciated that
memories respective systems memories instructions digital processor digital processor modules memory 106 and/or output via input/output device 104 ornetwork interface devices 102 to theprovider system 200 and outputs frommodules memory 206 and/or output via input/output device 204 toclient system 100 or another device such as an external computer having memory and/or a processor, optionally through a wired or wireless local orwide area network 126 such as the Internet. - In one illustrative example,
systems 100 and/or 200 are hosted by a respective server computer that is part of a distributed or cloud computing network, in which case inputs and outputs may be made via asingle network interface - The
software modules systems FIGS. 5 , 6, and 7. - The
parser module 112 ofsystem 100 parses theinput XML documents structural information portions content information portion 4. Similarly, theparser module 212 ofsystem 200 parses theXML document 36 with encoded content (after the document-level encoding has been decoded). In one embodiment, theparsers Parser 112 includes astructure detection module 120 to detect and store the structural information portion ofsource XML document 2 and modifieddocument 42 with encrypted content.Parser 220 similarly includes astructure detection module 220 to detect and store the structural information portion of encodedXML document 36 with encoded content. Each parser also includes a respectivecontent detection module content portions 4 of the respective XML documents. - The
normalization module 114 ofclient system 100 is configured for normalization of thetext content 4, which may include one or more of changing all letters to lower case, removing accents, removing apostrophes, removing punctuation from within word tokens, removing numbers, removing small words (e.g., those with fewer than a predefined number of characters, such as tokens only one character, such as “a”); and removing stop words (words in a predefined set which occur in high frequency in documents, such as “and” and “the”). The client and service provider may agree ahead of time on what data normalization operations will be performed by the client system, if any. The normalization scheme employed by thenormalization module 114 may be configured to allow the service provider to process the contents, even after they have been encrypted. The normalization may be dependent on what type of processing is to be performed by the service provider. Removing small words, accents, apostrophes and punctuation, and lower casing allows, for example, theXML processing module 214 ofprovider device 200 to find near matches that might differ by only punctuation or a small word to enable table of content detection. Once thetext 4 is encoded, it becomes difficult to detect such near matches, as similar words (e.g., which differ in their case) tend to be encoded to completely unrelated values. Similarly, word tokens may include commas or periods in the tokens, causing a word to encode to a different value due to the included punctuation. That is “cat” and “cat?” may encode to different values. Removing punctuation, accents, and apostrophes avoids this. For other processing operations, it may be desirable to perform different normalization operations. For example, processing which identifies sentences and paragraphs may benefit from retaining some of these features. Thus, a document may be sent to the service provider two or more times, in different normalized forms, for different processing operations to be performed. - The content encoding/decoding module (encoder) 116 encodes the
content 4 of thedocument 2 after it has been normalized (if any normalization is to be performed). Theencoder 116 also decodes the content when the modifieddocument 42 is received from the service provider. Thereconstruction module 119 reconstructs the document, to generate the transformed (unencoded)document 44. Reconstruction of the modifieddocument 42 may be desirable if, for example, a text reordering service has been provided. For some services, such as table of content (ToC) detection, which mainly add metadata, thereconstruction module 119 may only need to replace the metadata insource document 2 with the metadata from the modifieddocument 42, avoiding the need to decode the modifieddocument 42. - Optionally, a transmission encryption module (a document-level encryption module) 118 may be provided by the operating system that runs on the
client system 100 or may be a higher level component. Theencryption module 118 encrypts the text-levelencrypted XML document 36 before it is transmitted to protect thedocument metadata encryption module 118 may encrypt the entire document, without regard to which part of the document is content or structure, potentially encrypting a portion of the document that is already encoded. Theencryption module 118 also decrypts the modifieddocument 42 when it is received from theprovider system 200. Theprovider system 200 includes acorresponding encryption module 218 which decrypts theXML document 36 to provide an XML document with encodedcontent 38. Theencryption module 218 is not configured for decoding encodedcontent 38 which was encrypted using a different encrypting scheme or encryption key. If thecontent 4 was hashed, there is generally no practical way to reconstruct thecontent 4 from the hash. Thetransmission encryption modules - The provider system 200 (
FIG. 4 ) also includes anXML processing module 214, which may include one or more dedicated processing modules, illustrated asmodules exemplary system 200 these modules perform, respectively, text segmentation, text ordering, ToC detection, header/footer detection, and image segmentation and ordering. These modules are discussed below with respect to the method ofFIGS. 5 . As will be appreciated, rather than several modules as shown, two or more ofmodules - In the exemplary embodiment,
components main memory respective computer processor processor respective computer systems memories Hardware components data control bus 132. A similardata control bus 232 fordevice 200 connectshardware components - As will be appreciated,
systems components device 200. - The
client system 100 and encrypted XML processingservice provider system 200 may each comprise one or more computing devices, such as a personal computer, PDA, laptop computer, server computer, or combination thereof. In some embodiments, thesystems Memories 106, 108 (206, 208) may be integral or separate and may represent any type of computer readable medium such as random access memory (RAM), read only memory (ROM), magnetic disk or tape, optical disk, flash memory, or holographic memory. In one embodiment, thememories 106, 108 (206, 208) comprise a combination of random access memory and read only memory. In some embodiments, the processor 122 (222) and memory 106 (206) and/or 108 (208) may be combined in a single chip. - The
systems outputs - The term “software” as used herein is intended to encompass any collection or set of instructions executable by a computer or other digital system so as to configure the computer or other digital system to perform the task that is the intent of the software. The term “software” as used herein is intended to encompass such instructions stored in a storage medium such as RAM, a hard disk, optical disk, or so forth, and is also intended to encompass so-called “firmware” that is software stored on a ROM or so forth. Such software may be organized in various ways, and may include software components organized as libraries, Internet-based programs stored on a remote server or so forth, source code, interpretive code, object code, directly executable code, and so forth. It is contemplated that the software may invoke system-level code or calls to other software residing on a server or other location to perform certain functions.
-
FIG. 5 illustrates an integrated method for processing an XML document performed by theclient system 100 and theservice provider 200 remotely through anetwork 126 such as the Internet. Before transmitting adocument 36, the client and service provider may agree on a data normalization method and, optionally, a transmission encryption method. - The method begins at S100.
- At S102, a
source XML document 2 is received byclient system 100, and may be stored inmemory 106 during processing. TheXML document 2 may be retrieved fromdocument storage 130. Optionally anancillary document 16 is also retrieved. In some embodiments,document 2, is generated from a scanned document that is transformed into an XML document. This may involve performing optical character recognition on the scanned document and storing the document as a PDF. A conversion tool, such as PDF2XML, may be used to translate the document from an unstructured or minimally structured format such as PDF, to XML. Alternatively, the document may be a legacy document translated into XML by a process specific to the format of the legacy document. - At S104, the content portions of
document 2 are identified. Thecontent containing portions 4 and thestructural information structure detection module 120 and/orcontent detection module 121 of theparser 112, optionally using theschema 16. For example, all portions within inverted arrows or other content-denoting symbols may be identified astext content portions 4. - At S106, the content portions of
document 2 are optionally normalized, e.g., usingnormalization module 114. - At S108, the
content portions 4 only ofdocument 2 are encoded, e.g., byencoder 116, to producedocument 36 with encoded content. Eachtoken 4 is separately encoded to provide a respective encodedelement 38, which in the exemplary embodiment is stored in its respective TOKEN node 3. Thecontent portions 4 ofdocument 2 may be encoded (S108) by hashing or encryption, token by token. Because the content is encoded token by token, the service provider system can still perform textual similarity comparison between text elements or between sequences of text elements, as twotokens 4 consisting of the same word will hash or encrypt to the same value. A shared key may be used, as the content may only be decrypted by theclient system 100, though public/private key encryption may be used. The content may alternatively be hashed, for example using MD5. - At S110, the
document 36 may be encrypted as a whole in a manner which can be decrypted by the service provider to protect against third parties intercepting the document during transmission. The structural information 5 may be left unencrypted, in markup language text, or may be at least partially encrypted by shared key or a public key of the service provider, allowing the service provider to decrypt the structural information. Theancillary document 16, if used, may also be encrypted using a shared key or the service provider's public key, if it is to be transmitted to the service provider. For example, the normalized document is optionally encrypted by globally encrypting the entire document or by encrypting only the plain text structural portions. The global encrypting approach may be performed by themodule 118 using TSL/SSL. - At S112 the encoded and optionally
encrypted document 36 is transmitted to theprovider system 200 e.g., via thenetwork 126. - At S114, the
document 36 is received by thesystem 200 and stored inmemory 206. At S116, in the case of a document which has been encrypted at S110, the document is decrypted, e.g., by themodule 218. - At S118, the structural portions and encoded content portions of
document 14 are identified/distinguished bystructure detection module 220, and/orcontent detection module 221 of theparser 212, optionally using the ancillary document, similarly to step S220 ofFIG. 6 . - At S120, one or more processing operations are performed on the
text document 36. In particular, one or more modules of theXML processing module 214 operate on the decrypted, but still content-encodeddocument 36 to produce a modifieddocument 42. Textual operations (which may involve changing image locations) on the document may include one or more XML operations, possibly usingancillary document 16, on thedocument 36 to produce a modifieddocument 42. The operations, discussed below, may include one of text ordering, text segmenting, table of contents (ToC) recognition, page header/footer detection, and image segmentation/ordering. - At S122, the document may be re-encrypted, e.g., by
module 218, optionally using the same document encryption scheme used bysystem 100. This may be performed similarly to step S110 ofFIG. 5 . The ancillary document generally need not be re-encrypted, as the ancillary document contains unaltered structural information, and may not be sent back with thedocument 36. - At S124, the processed and optionally
encrypted document 42 is output, e.g., transmitted back from thesystem 200 tosystem 100, e.g., via thenetwork 126. - At S126, the modified
document 42 is received by thesystem 100, which can be the same or a different computing device from thedevice 100 which transmitted the document to thesystem 200, and stored inmemory 106. - At S128, the modified
document 42 received by theclient system 100 is decrypted, if it was sent from the service provider in encrypted form. - At S130, the processing performed on the
document 36 by theprovider system 200 is used to generate a transformeddocument 44. In one embodiment, this may be achieved by decoding the encodedcontent 38 ofdocument 42 and reconstructing the document to producedocument 44. In other embodiments, the changes betweendocument 36 anddocument 42 that were implemented by theXML processing module 214 are identified from the modifieddocument 42 and applied to theoriginal document 2 to generatedocument 44. In the exemplary embodiment, this allows theclient system 100 to generate a document which is equivalent to that which would have been generated had the service provider been provided with the source document in an unencoded form, but optionally encrypted as a whole in a manner which allows the service provider to decrypt the document including the tokens. - At S132, the client system may
output document 44, for example to documentstorage system 130 or to a user. In other embodiments, the method may return to S106, where a different normalization operation may be performed and steps S108-S130 repeated, such that at S120, a different processing operation is performed on the encodeddocument 36. This further processing may be performed by the same or adifferent provider system 200. As will be appreciated, the method may be reiterated more times. - The method ends at S134.
- Further details of the method will now be described.
- The normalization may include one or more of converting all characters to lower case, converting accented characters to unaccented characters (or to a combination of characters), removing apostrophes, removing punctuation from within word tokens, removing numbers, and ignoring or removing short words, etc. Once text is encoded, the normalization generally cannot be accomplished. The service provider may provide the
software tool 114 for performing normalization, as the normalization steps are determined by what transformations will be performed by the service provider. - In some embodiments, the XML schema may be enriched to allow normalization information to be incorporated into the
document 2 and passed to the service provider. For example, one or more new XML attributes may be introduced, e.g. via attributes such as Capitalized=“true”. However, this may entail modifying the processing software used by the service provider, which may not be desirable for maintaining conformity over a variety of different clients. In another embodiment, differences in capitalization may be incorporated into the encryption (capital letters encrypted as capitals and lower case letters as lower case). Although providing a relatively weak level of protection, this may not be a problem. - In another embodiment, the textual content may be normalized and encoded at the same time by replacing any lowercase letter by ‘I’ (for instance), any uppercase letter by ‘L’, and any number by ‘0’. The text is lost (i.e., it cannot be regenerated from the encoding) but the customer can reconstruct the appropriate result by re-injecting the text in the token placeholders of the modified
document 42 or by incorporating the modifications made in the modifieddocument 42 into thesource document 2. While this method preserves the same number of characters in the encoded token, this can be a disadvantage for security. In another embodiment, any sequence of ‘I’ is replaced by one single ‘I’. This may provide better encryption but the word length is lost. - As noted in above-mentioned U.S. application Ser. No. 13/103,216, low entropy encrypted data may, in some cases, raise a security issue, for example if common text values are easily guessed by an attacker. In one embodiment, the method makes use of use encryption padding to make calculating the encryption mechanism more difficult. Identical padding may be used for identical tokens to ensure that they encrypt to the same value.
- In some embodiments, normalization may include removal of common prefixes and suffixes, such as removal of “ing” and “ed” to allow conjugated verbs to match each other.
- In some embodiments, the
service provider system 200 may be configured to receive and process both encoded and unencoded documents (e.g., from two different clients with different security needs). In one embodiment, the service provider'sXML processing module 214 need not be able to distinguish between encoded and unencoded text and may perform a second normalization on all documents, even encoded documents. By using an encoding alphabet that is normalization safe (unchanged by the second normalization algorithm), this second normalization by the service provider can be performed without the service provider having to provide separate processing paths for encoded and unencoded documents. The encoding alphabet (after normalization), may thus contain only lower case characters and contain no accents or white space characters. In another embodiment, the serviceprovider processing module 214 may be aware of whether it receives encoded or unencoded text, or may be configured for detecting the different forms of text automatically. In some embodiments, where theservice provider system 200 performs a second normalization, the service provider system may also “de normalize” the document after processing, by keeping track of all normalized operations and undoing them. - Example content (e.g., text) processing operations which may be performed on
encrypted document 36 at S120 may include some or all of the following. - Text Ordering and Segmentation
- The
text segmentation module 240 segments the flow of text into lines and paragraphs. Thetext ordering module 242 orders the flow so as to respect human-reading order. For example, if an original document has two columns, the contents of the two columns may, after scanning and OCR, be stored in one large column, in which each row contains a space in the middle, separating the two columns. While such format would print correctly and be viewable on screen, converting the document to HTML or epub would render the document unreadable. The text ordering module is configured to recognize the two columns and re-order the text so that lines run consecutively, allowing further conversion into another format. The text ordering may move the encrypted or hashed tokens to new locations in the document to produce modifieddocument 42. On receipt, thereconstruction module 119 of the client system may replace the encoded tokens with their corresponding words to reconstruct the document text. - Segmentation identifies paragraphs in the document, which can be an aid to other services, discussed below. The
segmentation module 240 may store the paragraph information by the addition of metatags (structural information) to thedocument 36 to produce modifieddocument 42. Generally, segmentation works at a page level, and does not consider paragraphs which span two or more pages. This is generally sufficient to allow for further processing. Both segmentation and text ordering generally rely on the geometry of the words (tokens) on a page. As text ordering and segmentation services rely on thepage layout data 12 of the structural data (such as location and size), the client system may encrypt or hash each token (word) without altering the operation of the text ordering and segmentation module. - In the case of text ordering and segmentation, the service provider system does not generally need to compare the
encrypted text content 36, so, for this particular service, it may be sufficient to use a serial number or other unique identifier for each token, rather than encryption or hashing. However, since text ordering and segmentation is often performed as a precursor to other services, it may be advantageous to use encryption or hashing, to avoid having to construct a second version of the encodeddocument 36 for the service provider to perform the subsequent operations. - In another embodiment, the client system may add a tag to the
metadata 6 of thedocument 36, indicating the order of the tokens, and thetext ordering module 242 of the service provider system may update the metadata to reflect the modified ordering of the tokens. - In another embodiment, a format that is line based rather than token based is used. Each row of text is made up of one or more lines. For example, in a two column document, each row may contain a maximum of two lines, in the ideal case. The
metadata 6 may contain a line order tag, which the service provider may update, after reordering portions of lines instead of single tokens. This approach may also be extended to paragraphs. Recognizing lines or paragraphs may entail greater sophistication on the part of the client's scanning software and/or greater processing on the client system. One potential advantage of this approach is that the encryption or hash can be more secure, as larger chunks may be encrypted, making it more difficult for an unauthorized user to infer any plaintext from the layout of the document. As with the unique identifier approach above, however, too large of a chunk of text may make further processing on the encrypted text difficult. In the line or paragraph approach, a hash method, possibly salted, could be used. The customer simply needs to re-construct the correct output document by identifying the location of the original text document by finding its hash. The document partition is preserved by the service provider system and that the parts generally have a unique identifier, allowing the customer to project the newly created metadata (in the received, encrypted document 42) back onto theoriginal document 2. - The identification of text order and segments can be performed using existing techniques. Order may be detected, for example, from layout clues that indicate columns. Segments may be inferred from gaps or whitespace in the document. An XY cut method may be used to find narrow or wide strips to identify paragraphs and/or lines, as described in U.S. Pat. No. 7,392,473 to Meunier, entitled METHOD AND APPARATUS FOR DETERMINING LOGICAL DOCUMENT STRUCTURE. An example method for segmenting a document is described in US Pub. No. 2004/0013302 to Ma, et al., entitled DOCUMENT CLASSIFICATION AND LABELING USING LAYOUT GRAPH MATCHING. The disclosures of both of these documents are incorporated herein by reference in their entireties.
- Table of Contents Recognition
- The Table of Contents (ToC)
module 244 may processdocument 36 after processing by thedocument ordering module 240 and/orsegmentation module 242. - The ToC module is configured for recognizing a Table of Contents in a document, if it exists, and for each entry in the ToC which is identified, to construct a link or other pointer to the portion of the document referenced by the entry in the ToC. Paragraphs or pages which are referenced by the ToC may also be labeled by a metadata tag. This can be a flag or Boolean, that is true for paragraphs or pages which are referenced by the ToC. The ToC referenced paragraphs or pages may also be labeled with metadata identifying the link from the ToC. Methods for identifying a table of contents may be found in U.S. Pat. No. 7,890,859, issued Feb. 15, 2011, entitled RAPID SIMILARITY LINKS COMPUTATION FOR TABLE OF CONTENTS DETERMINATION, to Meunier, and U.S. Pat. No. 7,743,327, entitled TABLE OF CONTENTS EXTRACTION WITH IMPROVED ROBUSTNESS, to Meunier and Déjean, the disclosures of which are incorporated herein by reference in their entireties.
- The
ToC module 244 may function by first computing the textual similarity between pairs of document parts, where the parts can each be a sequence of tokens, such as a paragraph, line, a few tokens, or even a single token, then by identifying the ToC based on this pairwise similarity. In the exemplary embodiment, the sequences compared can be sequences of encodedtokens 38. The ToC may be further refined using the metadata (e.g., clues such as typography changes or page layout). - The similarity measure used by the
ToC module 244 may function better if the normalization of the textual content provides lowercasing and removes small, frequent words that, when encoded, could cause text which should be considered sufficiently similar to be a match to be treated as different. - Once the service provider sends back the
result 42, thereconstruction module 119 of the client system may update theoriginal document 2 with the updated ToC and additional metadata (link locations and paragraph flags). Generating a ToC generally does not reorder the textual data, so the client need not decrypt or look up the hash values of the encodedcontent 38, however, this method of reconstruction is not excluded. - Page Headers/Footers
- The header/
footer module 246 can perform one or both of page header detection and page footer detection, in which encodedtokens 38 are tagged with metadata identifying them as being part of a respective header or footer. This detection may be based on structural information and/or similarity between the encoded tokens on subsequent pages. The client system can thus treat the header and footer as regular text, relying on the header/footer module 246 of the service provider to identify the header and footer based on, for example, page layout, in a similar manner to text ordering and segmentation. The header and footer information may be tagged by adding tags to the metadata. - The
module 246 may also be configured for detection of page numbers, which may involve recognition of a pattern of numbers (Arabic and/or Roman numerals) and replacing them with a simple regular expression. This may entail the encoding scheme incorporating a detectable sequentiality. - Page header/footer detection may be performed, for example, using the methods described in U.S. Pat. No. 7,937,653, issued May 3, 2011, entitled METHOD AND APPARATUS FOR DETECTING PAGINATION CONSTRUCTS INCLUDING A HEADER AND A FOOTER IN LEGACY DOCUMENTS, by Nervé Déjean, et al., the disclosure of which is incorporated herein by reference in its entirety.
- Images
- Images in the
document 36 may also be segmented and/or ordered by the image segmentation/ordering module 248. This may be performed similarly to the text ordering already described, for example, based on size, layout, and similarity of nearby text. Groups of related images may be identified and kept together as one large image. In one embodiment, the images are not sent within the encoded document, only their bounding boxes. No data about the image need be sent other than its dimensions, so encryption of the images is unnecessary. In another embodiment, the images are encrypted or hashed. Surrounding text may be identified as labels or captions associated with the image. Text surrounding the image may be used to identify an image that is referenced by the ToC. Just as in the case of text ordering, the images may be moved in the document. - In another embodiment, the images (or bounding boxes) in the encoded document may have metadata associated with them at the client system, indicating their sequence in the document. This information may be modified by the image segmenting/
ordering module 248. - The following references, the disclosures of which are incorporated herein by reference in their entireties, relate to other methods of page layout analysis which may be adapted to use by the XML processing module 214: U.S. patent application Ser. No. 13/484,708, filed on May 31, 2012, entitled TYPOGRAPHICAL BLOCK GENERATION, by Hervé Déjean; U.S. patent application Ser. No. 13/462,053, filed on May 2, 2012, entitled POST OPTICAL CHARACTER RECOGNITION DETERMINATION OF FONT SIZE, by Jean-Luc Meunier, U.S. Pat. No. 5,708,730, entitled TABLE RECOGNITION APPARATUS, by Katsuhiko Itonori; U.S. Pat. No. 6,785,420, entitled METHOD AND APPARATUS FOR TABLE RECOGNITION, APPARATUS FOR CHARACTER RECOGNITION, AND COMPUTER PRODUCT, by Toshifumi Yamaai; U.S. Pat. No. 7,392,473, issued Jun. 24, 2008, entitled METHOD AND APPARATUS FOR DETERMINING LOGICAL DOCUMENT STRUCTURE, by Jean-Luc Meunier; U.S. Pat. No. 7,693,848, issued Apr. 6, 2010, entitled METHOD AND APPARATUS FOR STRUCTURING DOCUMENTS BASED ON LAYOUT, CONTENT AND COLLECTION, by Hervé Déjean, et al.; U.S. Pat. No. 7,739,587, issued Jun. 15, 2010, ENTITLED METHODS AND APPARATUSES FOR FINDING RECTANGLES AND APPLICATION TO SEGMENTATION OF GRID-SHAPED TABLES, by Jean-Yves Vion-Dury; U.S. Pub. No. 20060155703, published Jul. 13, 2006, entitled METHOD AND APPARATUS FOR DETECTING A TABLE OF CONTENTS AND REFERENCE DETERMINATION, by Hervé Déjean, et al.; U.S. Pub. No. 20120005225, published Jan. 5, 2012, entitled METHOD FOR LAYOUT BASED DOCUMENT ZONE QUERYING, by Boris Chidlovskii; U.S. application Ser. No. 12/974,843, filed on Dec. 21, 2010, entitled SYSTEM AND METHOD FOR LOGICAL STRUCTURING OF DOCUMENTS BASED ON TRAILING AND LEADING PAGES, by Hervé Déjean; U.S. application Ser. No. 13/155,011, filed Jun. 7, 2011, entitled A GENERATE-AND-TEST METHOD FOR COLUMN SEGMENTATION by Hervé Déjean; and U.S. application Ser. No. 13/162,858, filed Jun. 17, 2011, entitled DETECTION AND EXTRACTION OF ELEMENTS CONSTITUTING IMAGES IN UNSTRUCTURED DOCUMENT FILES, by Hervé Déjean.
- The method illustrated in
FIG. 5 may be implemented in one or more computer program product(s) that may be executed on a respective computer. The computer program product(s) may comprise a non-transitory computer-readable recording medium on which a control program is recorded, such as a disk, hard drive, or the like. Common forms of non-transitory computer-readable media include, for example, floppy disks, flexible disks, hard disks, magnetic tape, or any other magnetic storage medium, CD-ROM, DVD, or any other optical medium, a RAM, a PROM, an EPROM, a FLASH-EPROM, or other memory chip or cartridge, or any other non-transitory medium from which a computer can read and use. - Alternatively, the method may be implemented in transitory media, such as a transmittable carrier wave in which the control program is embodied as a data signal using transmission media, such as acoustic or light waves, such as those generated during radio wave and infrared data communications, and the like.
- The exemplary method may be implemented on general purpose computers, special purpose computer(s), a programmed microprocessor or microcontroller and peripheral integrated circuit elements, an ASIC or other integrated circuit, a digital signal processor, a hardwired electronic or logic circuit such as a discrete element circuit, a programmable logic device such as a PLD, PLA, FPGA, Graphical card CPU (GPU), or PAL, or the like. In general, any devices, capable of implementing a finite state machine that is in turn capable of implementing a respective part of the flowchart shown in
FIG. 5 , can be used to implement the exemplary method. - It will be appreciated that variants of the above-disclosed and other features and functions, or alternatives thereof, may be combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.
Claims (27)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/648,462 US8812870B2 (en) | 2012-10-10 | 2012-10-10 | Confidentiality preserving document analysis system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/648,462 US8812870B2 (en) | 2012-10-10 | 2012-10-10 | Confidentiality preserving document analysis system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
US20140101456A1 true US20140101456A1 (en) | 2014-04-10 |
US8812870B2 US8812870B2 (en) | 2014-08-19 |
Family
ID=50433721
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/648,462 Active 2032-12-20 US8812870B2 (en) | 2012-10-10 | 2012-10-10 | Confidentiality preserving document analysis system and method |
Country Status (1)
Country | Link |
---|---|
US (1) | US8812870B2 (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140101007A1 (en) * | 2012-10-04 | 2014-04-10 | Quickdash, Llc | Methods and apparatus for providing data normalization, scalability and maintainability |
US20140279426A1 (en) * | 2013-03-15 | 2014-09-18 | Elwha Llc | Devices, methods, and systems for technologically shifting options and modalities |
US20150104013A1 (en) * | 2013-10-10 | 2015-04-16 | Elwha Llc | Methods, systems, and devices for handling captured image data that is received by devices |
US20150104002A1 (en) * | 2013-10-10 | 2015-04-16 | Elwha Llc | Methods, systems, and devices for handling image data from captured images |
US9799036B2 (en) | 2013-10-10 | 2017-10-24 | Elwha Llc | Devices, methods, and systems for managing representations of entities through use of privacy indicators |
CN107809643A (en) * | 2017-11-13 | 2018-03-16 | 郑州云海信息技术有限公司 | A kind of coding/decoding method of image, device and medium |
US10013564B2 (en) | 2013-10-10 | 2018-07-03 | Elwha Llc | Methods, systems, and devices for handling image capture devices and captured images |
US10055384B2 (en) * | 2016-12-31 | 2018-08-21 | Entefy Inc. | Advanced zero-knowledge document processing and synchronization |
US10102543B2 (en) | 2013-10-10 | 2018-10-16 | Elwha Llc | Methods, systems, and devices for handling inserted data into captured images |
US10185841B2 (en) | 2013-10-10 | 2019-01-22 | Elwha Llc | Devices, methods, and systems for managing representations of entities through use of privacy beacons |
CN109753285A (en) * | 2018-12-26 | 2019-05-14 | 浙江大学 | A kind of XML parser based on FPGA, restructurable computing system |
US10304095B2 (en) * | 2008-02-04 | 2019-05-28 | Thomson Reuters Global Resources Unlimited Company | System and method for accounting gateway |
US10346624B2 (en) | 2013-10-10 | 2019-07-09 | Elwha Llc | Methods, systems, and devices for obscuring entities depicted in captured images |
US10681019B2 (en) * | 2012-10-19 | 2020-06-09 | International Business Machines Corporation | Secure sharing and collaborative editing of documents in cloud based applications |
US10834290B2 (en) | 2013-10-10 | 2020-11-10 | Elwha Llc | Methods, systems, and devices for delivering image data from captured images to devices |
US11068486B2 (en) * | 2014-04-04 | 2021-07-20 | Siemens Aktiengesellschaft | Method for automatically processing a number of log files of an automation system |
US20210350080A1 (en) * | 2020-01-24 | 2021-11-11 | Thomson Reuters Enterprise Centre Gmbh | Systems and methods for deviation detection, information extraction and obligation deviation detection |
US20220405474A1 (en) * | 2021-06-21 | 2022-12-22 | Kongju National University Industry-University Cooperation Foundation | Method, computing device and computer-readable medium for classification of encrypted data using neural network |
US11625662B2 (en) | 2016-09-22 | 2023-04-11 | Qvinci Software, Llc | Methods and apparatus for the manipulating and providing of anonymized data collected from a plurality of sources |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9477913B2 (en) | 2014-11-12 | 2016-10-25 | Xerox Corporation | Print path obfuscation method and system for document content analytics assessment |
CN105653520B (en) * | 2015-12-30 | 2018-11-06 | 北京奇艺世纪科技有限公司 | A kind of segmenting method and device based on graphics processor GPU |
WO2023055807A1 (en) * | 2021-09-28 | 2023-04-06 | RDW Advisors, LLC. | System and method for an artificial intelligence data analytics platform for cryptographic certification management |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6424989B1 (en) * | 1991-09-20 | 2002-07-23 | Venson M. Shaw | Object-oriented transaction computing system |
JP2789971B2 (en) | 1992-10-27 | 1998-08-27 | 富士ゼロックス株式会社 | Table recognition device |
US6385341B1 (en) * | 1997-04-17 | 2002-05-07 | Microsoft Corporation | Technique for decoding variable length data codes |
JP4159720B2 (en) | 2000-03-15 | 2008-10-01 | 株式会社リコー | Table recognition method, table recognition device, character recognition device, and storage medium storing table recognition program |
US20040013302A1 (en) | 2001-12-04 | 2004-01-22 | Yue Ma | Document classification and labeling using layout graph matching |
US7693848B2 (en) | 2005-01-10 | 2010-04-06 | Xerox Corporation | Method and apparatus for structuring documents based on layout, content and collection |
US8706475B2 (en) * | 2005-01-10 | 2014-04-22 | Xerox Corporation | Method and apparatus for detecting a table of contents and reference determination |
US7937653B2 (en) | 2005-01-10 | 2011-05-03 | Xerox Corporation | Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents |
US7392473B2 (en) | 2005-05-26 | 2008-06-24 | Xerox Corporation | Method and apparatus for determining logical document structure |
US7743327B2 (en) | 2006-02-23 | 2010-06-22 | Xerox Corporation | Table of contents extraction with improved robustness |
US7890859B2 (en) | 2006-02-23 | 2011-02-15 | Xerox Corporation | Rapid similarity links computation for table of contents determination |
US7739587B2 (en) | 2006-06-12 | 2010-06-15 | Xerox Corporation | Methods and apparatuses for finding rectangles and application to segmentation of grid-shaped tables |
US8185565B2 (en) * | 2007-11-16 | 2012-05-22 | Canon Kabushiki Kaisha | Information processing apparatus, control method, and storage medium |
US8302170B2 (en) * | 2008-09-22 | 2012-10-30 | Bespoke Innovations S.A.R.L. | Method for enhancing network application security |
US9135249B2 (en) * | 2009-05-29 | 2015-09-15 | Xerox Corporation | Number sequences detection systems and methods |
US9007604B2 (en) * | 2010-06-30 | 2015-04-14 | Xpedite Systems, Llc | System, method, and apparatus for an interactive virtual fax machine |
US8606789B2 (en) | 2010-07-02 | 2013-12-10 | Xerox Corporation | Method for layout based document zone querying |
-
2012
- 2012-10-10 US US13/648,462 patent/US8812870B2/en active Active
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10304095B2 (en) * | 2008-02-04 | 2019-05-28 | Thomson Reuters Global Resources Unlimited Company | System and method for accounting gateway |
US20140101007A1 (en) * | 2012-10-04 | 2014-04-10 | Quickdash, Llc | Methods and apparatus for providing data normalization, scalability and maintainability |
US9858624B2 (en) * | 2012-10-04 | 2018-01-02 | Qvinci Software, Llc | Methods and apparatus for providing data normalization, scalability and maintainability |
US11283780B2 (en) * | 2012-10-19 | 2022-03-22 | International Business Machines Corporation | Secure sharing and collaborative editing of documents in cloud based applications |
US10681019B2 (en) * | 2012-10-19 | 2020-06-09 | International Business Machines Corporation | Secure sharing and collaborative editing of documents in cloud based applications |
US20140279426A1 (en) * | 2013-03-15 | 2014-09-18 | Elwha Llc | Devices, methods, and systems for technologically shifting options and modalities |
US9799036B2 (en) | 2013-10-10 | 2017-10-24 | Elwha Llc | Devices, methods, and systems for managing representations of entities through use of privacy indicators |
US10013564B2 (en) | 2013-10-10 | 2018-07-03 | Elwha Llc | Methods, systems, and devices for handling image capture devices and captured images |
US10102543B2 (en) | 2013-10-10 | 2018-10-16 | Elwha Llc | Methods, systems, and devices for handling inserted data into captured images |
US10834290B2 (en) | 2013-10-10 | 2020-11-10 | Elwha Llc | Methods, systems, and devices for delivering image data from captured images to devices |
US10185841B2 (en) | 2013-10-10 | 2019-01-22 | Elwha Llc | Devices, methods, and systems for managing representations of entities through use of privacy beacons |
US20150104002A1 (en) * | 2013-10-10 | 2015-04-16 | Elwha Llc | Methods, systems, and devices for handling image data from captured images |
US10289863B2 (en) | 2013-10-10 | 2019-05-14 | Elwha Llc | Devices, methods, and systems for managing representations of entities through use of privacy beacons |
US20150104013A1 (en) * | 2013-10-10 | 2015-04-16 | Elwha Llc | Methods, systems, and devices for handling captured image data that is received by devices |
US10346624B2 (en) | 2013-10-10 | 2019-07-09 | Elwha Llc | Methods, systems, and devices for obscuring entities depicted in captured images |
US11068486B2 (en) * | 2014-04-04 | 2021-07-20 | Siemens Aktiengesellschaft | Method for automatically processing a number of log files of an automation system |
US11625662B2 (en) | 2016-09-22 | 2023-04-11 | Qvinci Software, Llc | Methods and apparatus for the manipulating and providing of anonymized data collected from a plurality of sources |
US10169300B2 (en) * | 2016-12-31 | 2019-01-01 | Entefy Inc. | Advanced zero-knowledge document processing and synchronization |
US10055384B2 (en) * | 2016-12-31 | 2018-08-21 | Entefy Inc. | Advanced zero-knowledge document processing and synchronization |
CN107809643A (en) * | 2017-11-13 | 2018-03-16 | 郑州云海信息技术有限公司 | A kind of coding/decoding method of image, device and medium |
CN109753285A (en) * | 2018-12-26 | 2019-05-14 | 浙江大学 | A kind of XML parser based on FPGA, restructurable computing system |
US20210350080A1 (en) * | 2020-01-24 | 2021-11-11 | Thomson Reuters Enterprise Centre Gmbh | Systems and methods for deviation detection, information extraction and obligation deviation detection |
US11763079B2 (en) | 2020-01-24 | 2023-09-19 | Thomson Reuters Enterprise Centre Gmbh | Systems and methods for structure and header extraction |
US11803706B2 (en) | 2020-01-24 | 2023-10-31 | Thomson Reuters Enterprise Centre Gmbh | Systems and methods for structure and header extraction |
US11886814B2 (en) | 2020-01-24 | 2024-01-30 | Thomson Reuters Enterprise Centre Gmbh | Systems and methods for deviation detection, information extraction and obligation deviation detection |
US20220405474A1 (en) * | 2021-06-21 | 2022-12-22 | Kongju National University Industry-University Cooperation Foundation | Method, computing device and computer-readable medium for classification of encrypted data using neural network |
Also Published As
Publication number | Publication date |
---|---|
US8812870B2 (en) | 2014-08-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8812870B2 (en) | Confidentiality preserving document analysis system and method | |
US10021078B2 (en) | System, apparatus and method for encryption and decryption of data transmitted over a network | |
US9754120B2 (en) | Document redaction with data retention | |
US8473740B2 (en) | Method and system for secured management of online XML document services through structure-preserving asymmetric encryption | |
JP5735539B2 (en) | System, apparatus and method for encrypting and decrypting data transmitted over a network | |
TWI540447B (en) | Font distribution system and method of font distribution | |
US8429397B2 (en) | Generating an encryption font by converting character codes and recording the encryption font in a unique tag | |
US20130271777A1 (en) | Information processing apparatus and method thereof | |
WO2018196661A1 (en) | Image processing device and method | |
US10402471B2 (en) | Method for obfuscating the display of text | |
Al-maweri et al. | Robust digital text watermarking algorithm based on unicode extended characters | |
US10706160B1 (en) | Methods, systems, and articles of manufacture for protecting data in an electronic document using steganography techniques | |
JP2015135638A5 (en) | ||
JP4831461B2 (en) | Document processing apparatus and method | |
Taleby Ahvanooey et al. | An innovative technique for web text watermarking (AITW) | |
JP2001134180A (en) | Data processing method and device, and storage medium | |
JP2009048621A (en) | Data providing device, data providing method and program | |
US9916464B2 (en) | Replacement text for textual content to be printed | |
JP2006261718A (en) | Document processing apparatus and method | |
CN115048665A (en) | Excel file-based information hiding method, device, equipment and storage medium | |
CN107291763B (en) | Electronic document management method and management device | |
Qiu et al. | A novel xml-based document format with printing quality for web publishing | |
Lai et al. | A large payload webpage data embedding method using CSS attributes modification | |
Lai et al. | Covert communication via PDF files by new data hiding techniques | |
Mir | Zero watermarking for text on www using semantic approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: XEROX CORPORATION, CONNECTICUT Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MEUNIER, JEAN-LUC;DEJEAN, HERVE;REEL/FRAME:029104/0206 Effective date: 20120927 |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551) Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
AS | Assignment |
Owner name: CITIBANK, N.A., AS AGENT, DELAWARE Free format text: SECURITY INTEREST;ASSIGNOR:XEROX CORPORATION;REEL/FRAME:062740/0214 Effective date: 20221107 |
|
AS | Assignment |
Owner name: XEROX CORPORATION, CONNECTICUT Free format text: RELEASE OF SECURITY INTEREST IN PATENTS AT R/F 062740/0214;ASSIGNOR:CITIBANK, N.A., AS AGENT;REEL/FRAME:063694/0122 Effective date: 20230517 |
|
AS | Assignment |
Owner name: CITIBANK, N.A., AS COLLATERAL AGENT, NEW YORK Free format text: SECURITY INTEREST;ASSIGNOR:XEROX CORPORATION;REEL/FRAME:064760/0389 Effective date: 20230621 |
|
AS | Assignment |
Owner name: JEFFERIES FINANCE LLC, AS COLLATERAL AGENT, NEW YORK Free format text: SECURITY INTEREST;ASSIGNOR:XEROX CORPORATION;REEL/FRAME:065628/0019 Effective date: 20231117 |
|
AS | Assignment |
Owner name: CITIBANK, N.A., AS COLLATERAL AGENT, NEW YORK Free format text: SECURITY INTEREST;ASSIGNOR:XEROX CORPORATION;REEL/FRAME:066741/0001 Effective date: 20240206 |