IES20060361A2

IES20060361A2 - Electronic document conversion

Info

Publication number: IES20060361A2
Application number: IE20060361A
Authority: IE
Inventors: Seamus Mcgrenery; Brian Mcgrath
Original assignee: Big River Ltd
Priority date: 2006-05-05
Filing date: 2006-05-05
Publication date: 2007-10-31
Also published as: WO2007129288A2; WO2007129288A3

Abstract

An apparatus and a method are disclosed for converting an electronic document formatted for printing into an electronic document formatted for display. The apparatus comprises processing means, memory means and data input means. The memory means stores instructions, a data structure defining at least one electronic document formatted for printing and having a layout, and at least one document layout template. The instructions configure the processing means to perform a plurality of converting operations, according to the method. a comparison of the document layout defined by the data structure is performed against at least one document layout template. Alphanumerical data in the data structure is mapped to corresponding ASCII character data and optically recognized. A comparison of the optically-recognized alphanumerical data is performed against the mapped alphanumerical data. Image data is identified in the data structure and rescaled if it exceeds an image data parameter. The electronic document including the optionally rescaled image data and the compared ASCII character data is output, formatted for display according to the document layout template. <Figure 1>

Description

The present invention relates to a system and method for converting electronic documents. More particularly, the present invention relates to an enhanced system and method for converting electronic documents formatted for printing or disp^ynqn a sper.ifir. j m ct j Background of the invention iGtObV 3 f/l4 I i Many computerised systems and methods used therewith are-knewe.-w-ith-whteh'-trsers’ifiput any combination of alphabetical, numerical and image data for generating electronic versions of print documents. Typically, such documents have been prepared for distribution to recipients in printed form, whereby the finalised document is output by the computer on which it has been generated to a printing device. The ubiquitous development and adoption of the Internet has however simplified such distribution, as any electronic version of a print document, such as a report or a brochure which beforehand needed to be printed and mailed as a hardcopy to a distant recipient, may now simply be broadcast across the Internet to the respective terminals of one or many recipients, irrespective of their often disparate geographical locations, and be received quasi-instantaneously. Similarly, documents may be prepared for display on a particular type of device such as a computer with a video display unit (VDU) screen and a particular type of software.

This network-enabled distribution is traditionally accomplished either via electronic mail messaging, wherein the electronic version of the print document may be sent and subsequently received by a recipient as a file attached to an electronic mail message (email), or via downloading the electronic document as a file made available on an Internet page (webpage) via its storage location reference in the network. Particularly well-known types of such files include the portable document format ("pdf’) file format developed by Adobe Systems Inc. of San Jose, California and the Microsoft Word document ("doc") file format developed by the Microsoft Corporation of Redmond, Washington.

Although such files and the respective computer applications used to produce them have immensely facilitated the production and distribution of document based information since their advent, especially for end-users unskilled in the particular arts of typesetting, page35 setting and graphic design, they feature an important disadvantage, inherent to their initial ,£ 060361 purpose of presenting and reproducing graphic and textual information in print, or on a specific device such as a standard VDU, in that the intended meaning^ef the content may be lost or confused when they are converted for use or display on a range of devices. This disadvantage arises principally from the software applications used to create and display such documents (native applications) using a combination of layered data and layout data to convey meaning through visual presentation of a combination of text and graphical elements, which can be lost when the text and graphic data is extracted for use or display on another device. Typically, a document designer may choose to convey the meaning that a particular string of text is a heading by rendering it as an image, which the native application will display as if it were large text. Alternatively the document designer may choose to make text stand out by layering the same string of text twice with one slightly offset. In the above examples, the intended meaning will be lost if the text is extracted for use on a different device or application in the first case. In the second case meaning may be confused by the text being repeated.

Additionally the native applications used to display documents is typically optimised for interaction using a combination of a display device, such as computer video display units (VDUs), with user interaction by visual control such as a mouse or pointing device. This combination provides difficulties for users who are trying to access the document on a non20 standard VDU screen, and/or need to interact with controls using another method, such as keyboard. But this disadvantage can equally arise due to characteristics of the document such as the information contained in the document not being structured according to the intended reading order.

Because computers, and the software for creating documents using them, have been developed at different times in different countries, a further disadvantage of electronic copies of print documents is that characters which make up the text may be stored electronically using one of a number of standards. These include ASCII, extended ASCII, ANSI and the group of Latin 1 character sets defined in ISO-8859-1 for languages using Latin script. In languages where the entire character set exceeds this character range (such as Japanese or Chinese for example), two characters are used to represent a single character. Many Countries use their own standard sets of double byte encoding to represent character sets in each language. Alternatively characters may be stored in a universal character set, such as Unicode (e.g UTF-16), or the 8 bit Universal character set Translation Format, UTF-8. Therefore the IE 0 6 0 361 printed letters or ‘glyphs’ in document text may be mapped to one of a number of different digital representations of a character.

The visual representation of the character may be further controlled by using ‘font’ characteristics to call an alternative ‘glyph’ for the same character. In most instances, such as where the variation is to a bold or italic version of the character, this does not affect the meaning. It does nonetheless in some cases, such as where the ASCII character ‘e’, or alternatively *$’, is modified by font controls to represent the euro symbol '€’. In another example, the ‘€’ may be represented by the general currency sign π [#164], relying on settings from the document or display computer to display the correct glyph. Other examples include the use of ASCII characters modified by font controls to represent Mathematical Pi characters, and special symbols such as ™ or © .

A further disadvantage of electronic documents is that they may be compiled from a variety of sources, or on a variety of computer operating systems, resulting in individual text objects being encoded in character sets other than that declared for the document or document section as a whole.

These interrelated disadvantages are increasingly problematic, as there is an ongoing global trend to increase information data distribution from more traditional delivery platforms, such as desktop computers, to mobile platforms in the so-called pervasive computing era. This is evidenced by increasingly frequent announcements by device manufacturers of new personal digital assistants (PDAs), mobile telephone handsets and personal media centres (PMCs) with enhanced functionality. Mobile platforms present their own unique challenges, especially in the domain of graphical information processing and presentation. The devices are typically restricted to having low computational power, small memory storage space, limitedbandwidth network access to the Internet and ever-increasing miniaturisation requirements, particularly of their video display units (VDUs). By way of example, the resolution of the VDU of desktop computers has increased dramatically over time, from the Color Graphics Adapter (CGA) standard in 1981 of 160x200 16-colours pixels, through to the more recent Ultra Extended Graphics Array (UXGA) of 1600x1200 32-bit colour pixels. Conversely, the VDUs of most modem portable devices, such as mobile phones or PDAs, are still only limited to a resolution of approximately 320x200 16-bit colour pixels.

IE 0 60 381 The interrelated disadvantages of inconsistent mapping of printed characters to the data stored in the document, document meaning conveyed through layered data or through a method which will only convey the intended meaning correctly in the native application, have a disproportionately negative effect on people who rely on assistive technology to access documents, such as for example blind people who use text to voice software. These users may find that words can not be spoken because one or more letter is inconsistently mapped to ASCII, or lines from one paragraph may be spoken as part of another paragraph because of the way the document is laid out or saved, and they may have to listen to the whole text of a very large document to find the part that is of interest to them. Existing software for creating and converting copies of electronic documents do not meet the needs of document publishers in relation to making documents accessible to users of assistive technology.

Previous attempts to solve the above, interrelated disadvantages have followed two main approaches. On the one hand, some solutions have focused upon converting the contents of electronic versions of print documents in for example "doc" or "pdf’ formats into alternative, structured documents formatted according to the HyperText Mark-up Language (HTML) or Extensible Mark-up Language (XML) to remedy the disadvantages related to the resolution and the navigation functionality. Such converting solutions generally produce inaccurate results and low quality output, because they do not take all the inconsistencies between the visual appearance of a document, facilitating its reading in print form, and the manner in which data is stored in the document into account. On the other hand, some other solutions have focused upon improving the data formatting and file definition characteristics of electronic versions of print documents, to remedy the disadvantages related to the file size. However, such solutions have not remedied the disadvantages linked to the desired portability of the document, particularly as a function of target recipient devices.

Moreover, in the context of distributing information data over Wide Area Network (WAN) such as the Internet, an increasing proportion of the above prior art solutions rely upon an information data recipient downloading task-specific applications known as applets or plugins in order to access the document contents. The requirement for a recipient user to use such applets, for instance Active-X applets, to process and view the information data poses a security risk as such applets have been known to harbour, or be particularly vulnerable to, computer viruses and other such data processing applications with a nefarious purpose. The downloading and installation of additional software or ‘plug-ins’ posses a disproportionate difficulty for people with low levels of computer skill and users of non-standard devices.

IE 06 0 3(1 Ob ject of the Invention It is an object of the present invention to provide an improved system for converting an electronic document formatted for printing or display on computer and VDU into an electronic document formatted for display on a variety of devices.

It is another object of the present invention to provide an automatic electronic document conversion system, in which text character glyphs are accurately mapped to electronic representations of text characters, so that the intended character is displayed.

It is yet another object of the present invention to provide an automatic electronic document conversion system, in which image-formatted text data is identified and processed for outputting in corresponding text data.

It is a further object of the present invention to provide an improved method of converting an electronic document formatted for printing or display on a VDU into an electronic document formatted for display on multiple devices, by identifying a meaning conveyed by layering multiple objects for display as a single object in the native application, and rendering these as a single object.

It is a further object of the present invention to provide an improved method of converting an electronic document, whereby additional text introduced into the design to create a visual effect in the native application is identified and removed, or replaced by markup which conveys the same meaning without the use of additional text.

It is a further object of the present invention to provide an improved automatic electronic document conversion system, in which textual conventions for referring readers to another part of the document are augmented by hyperlinks, which the reader can follow in the electronic copy of the document.

It is a further object of the present invention to provide an automatic electronic document conversion system, in which textual conventions for referring a reader to an electronic resource such as URL or e-mail address are processed into hyperlinks, which the user can follow.

IE 0 6 0 36 1 It is a further object of this invention to improve the characteristics of electronic copies of print documents so that they can be accessed by users of devices other than standard computers, such as users of assistive technology and mobile devices.

Summary of the Invention According to an aspect of the present invention, an apparatus is provided for converting electronic documents formatted for printing, or display on a specific device, the apparatus comprising processing means, memory means and data input means and the memory means storing instructions. The instructions configures the processing means to select metadata for an output electronic document formatted for display in response to user settings, map alphanumerical data in the data structure to corresponding ASCII character data, optically recognise alphanumerical data in the data structure, compare the optically-recognised alphanumerical data against the mapped alphanumerical data, identify image data in the data structure, and in response to a selection of the at least one recipient terminal profile, optionally rescale the image data, identify sections of the document content where the intended meaning is conveyed, in the documents native application, by the use of layered text or text and image data and processing this layered data according to rules to ensure that the intended meaning is preserved in the converted document, identify where text has been added to the source document for purely visual effect and delete or replace the additional text by markup and output the electronic document, including the optionally rescaled image data and ASCII character data, wherein the output electronic document is formatted for display according to the selected metadata and profile.

According to another aspect of the present invention, a method is provided for converting electronic documents formatted for printing, or display on a specific device, the apparatus comprising processing means, memory means and data input means and the memory means storing instructions. The instructions configures the processing means to select metadata for an output electronic document formatted for display in response to user settings, map alphanumerical data in the data structure to corresponding ASCII character data, optically recognise alphanumerical data in the data structure, compare the optically-recognised alphanumerical data against the mapped alphanumerical data, identify image data in the data structure, and in response to a selection of the at least one recipient terminal profile, optionally rescale the image data, identify sections of the document content where the intended meaning is conveyed, in the documents native application, by the use of layered text or text and image data and processing this layered data according to rules to ensure that the IE 06 0 361 intended meaning is preserved in the converted document, identify where text has been added to the source document for purely visual effect and delete or replace the additional text by markup and output the electronic document including the optionally rescaled image data and ASCII character data, wherein the output electronic document is formatted for display according to the selected metadata and profile.

Brief Description of the Drawings The invention will be better understood upon consideration of the following detailed description and the accompanying drawings, in which: Figure 1 shows a preferred embodiment of the present invention in an environment comprising a plurality of network-connected user terminals, including those of a sender and of a recipient; Figure 2 provides an example of a sender user terminal shown in Figure 1, which includes memory means, processing means and networking means; Figure 3 details the processing steps according to which the sender user terminal of Figures 1 and 2 operates, including a step of loading a data structure defining a print document; Figure 4 illustrates the contents of the memory means shown in Figure 2 further to the loading step shown in Figure 3; Figure 5 illustrates the components of the document shown in Figure 4, including layer data and layout data; Figure 6 shows examples of document layers and layouts according to the data shown in Figure 5; Figure 7 details the processing steps according to which the sender user terminal of Figures 1 to 5 operates, including a step of processing a data structure for output to a video display unit; and Figure 8 details the processing steps according to which the sender user terminal of Figures I to 5 operates, including a step of processing a data structure for output to a video display unit; Detailed Description of the Drawings The words "comprises/comprising" and the words "having/including" when used herein with reference to the present invention are used to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps, components or groups thereof.

IE 06 0 38 1 A preferred embodiment of the present invention is shown in an environment in Figure 1. A plurality of network-connected user terminals is shown, which first includes recipient user devices such as a mobile telephone handset 101. The handset 101 is configured with wireless telecommunication emitting and receiving functionality, such as over a cellular telephone network configured according to either Global System for Mobile Communication ('GSM') or General Packet Radio Service ('GPRS’) network industry standards. Handset 101 receives or emits voice and/or data encoded as a digital signal over wireless data transmission 102, wherein the signal is relayed respectively to or from the handset 101 by the geographicallyclosest communication link relay 103 of a plurality thereof.

The plurality of communication link relays allows the digital signal to be routed between handset 101 and its intended recipient or from its remote emitter, in the example sender user terminal 104 of content service provider 105, by means of a remote gateway 106. Gateway 106 is for instance a communication network switch coupling digital signal traffic between wireless telecommunication networks, such as the network within which wireless data transmission 102 takes place, and a wide area network (WAN) 107, an example of which is the Internet, or an Intranet or Extranet. The gateway 106 further provides protocol conversion if required, for instance if handset 101 uses a Wireless Application Protocol (WAP) differing from the Internet Transmission Control Protocol/Intemet Protocol (TCP/IP) in order to receive from, and optionally distribute data to, terminal 104, which is itself only connected to the WAN 107 via an Internet Service Provider (ISP) 108.

The user of handset 101 may also have the use of another mobile terminal 109, for instance a Personal Digital Assistant (PDA). In the example, PDA 109 is also connected to the WAN 107 and configured to exchange data according to the TCP/IP protocol with the sender user terminal 104 via a different network 110, such as a Local Area Network (LAN) or a Wireless Local Area Network (WLAN) conforming to the IEEE 802.11b or IEEE 802.1 lg standard, interfaced with WAN 107, although it will be readily understood by those skilled in the art that the present invention is not limited thereto and may indeed include any other such protocol or standard, depending upon the device, its networking means, operating system and processing capacity.

Thus, the potential exists for data exchange between any of handset 101, terminals 104 and 109, by way of wireless data transmission 102 interfaced by gateway 106 or wireless data transmission within network 110 and the Internet 107.

JE 06 0 3(1 An example of the terminal 104 at the service provider 105 shown in Figure 1 is provided in Figure 2. Terminal 104 is a computer terminal configured with a data processing unit 201, data outputting means such as video display unit (VDU) 202, data inputting means such as a keyboard 203 and a pointing device (mouse) 204 and data inputting/outputting means such as network connection 205, magnetic data-carrying medium reader/writer 206A and optical datacarrying medium reader/writer 207A for, respectively, reading from or writing data to magnetic data-carrying medium (floppy disk or solid state memory card) 206B and reading from or writing data to optical data-carrying medium (CD-RAM, CD-RW, DVD-RAM, DVD-RW, DVD-R and the like) 207B .

Within data processing unit 201, a central processing unit (CPU) 208, such as an Intel Pentium 4 manufactured by the Intel Corporation, provides task co-ordination and data processing functionality. Instructions and data for the CPU 208 are stored in main memory 209 and a hard disk storage unit 210 facilitates non-volatile storage of data and several software applications. WAN connection 205 is provided by way of a modem 211 as a wired dial-up connection or, alternatively, by way of a Network Interface Card (NIC) 212 as a wired high-bandwidth connection to the ISP 108 and the Internet 107.

A universal serial bus (USB) input/output interface 213 facilitates connection to the keyboard and pointing device 203, 204. A11 of the above devices are connected to a data input/output bus 214, to which the magnetic data-carrying medium reader/writer 206A and optical datacarrying medium reader/writer 207A are also connected. A video adapter 215 receives CPU instructions over the bus 214 for outputting processed data to VDU 202. In the embodiment, data processing unit 201 is of the type generally known as a compatible Personal Computer ('PC'), but may equally be any device configured with processing means, output data display means, memory means, input means and wired or wireless network connectivity.

The processing steps according to which the terminal 104 operates according to the present invention are further detailed in Figure 3. At step 301 the terminal 104 is switched on, whereby a first set of instructions known as the Operating System (OS) is loaded in memory 209 at step 302, to configure the terminal with basic interoperability and connectivity with ISP 108 and the Internet 107. At step 303, a second set of instructions hereinafter referred to as a converting application is loaded in memory 209, either from storage 210 or from a remote Ii 0 60 3βί terminal, for instance a remote server accessible via the Internet 107, by way of a data download.

In the preferred embodiment, the second set of instructions includes a plurality of functional modules further described herein below, at least one of which allows a user to select and load an electronic document configured for printing or display on a specific device at the next step 304. At step 305, the converting application receives user input to indicate the desired output file format of the electronic document configured for display on multiple devices, and optionally to indicate the type or characteristics of the recipient user terminal, in order to parameterize the processing of the electronic document configured for print by the modules at the next step 306 and the outputting of the processed electronic document now configured for display by the application at the next step 307. After step 307 the user of terminal 104 may then write the electronic document output at step 307 to a removable media 206B, 207B and/or distribute same to one or a plurality of remote recipient user terminals, such as mobile phone 101, by way of its network connection.

A question can be asked at step 308, as to whether another electronic document configured for print should be processed according to steps 304 to 307, and control returns to step 304 for the selection thereof if the question is answered in the affirmative. Alternatively, the question is answered negatively, and at step 309 the user of terminal 104 may then decide to terminate the processing of the application first loaded at step 303 at the step 309 and eventually switch the terminal 104 off at step 310.

The contents of the memory means 209 after the steps 303, 304 of loading instructions and an electronic document are illustrated in Figure 4. The OS loaded at step 302 is shown at 401, which in the example is Windows® XP® Professional®, manufactured and distributed by the Microsoft® Corporation of Redmond, Washington, United States, but it will be readily apparent to those skilled in the art that the present invention is not limited thereto and that any other OS may be used to configure terminal 104 with basic functionality, such as Mac OS/X® manufactured and distributed by Apple® Inc. of Cuppertino, California, United States, or Linux which is freely distributed.

The application loaded at step 303 is shown at 402 and comprises a plurality of modules including a document processing module 403, a text data processing module 404, an Optical Character Recognition (OCR) module 405 and an image data scanning module 406, the Ιε 060381 respective functionalities of which will be further described herein below. In operation, each module apportions a part of memory 209 as a respective buffer, the buffers being shown as 407 in the figure, A document layout template is shown at 408, which is also loaded in memory 209 at step 303 along with the application 402, and which the document processing module 403 preferably uses at step 306 in conjunction with text data processing module 404 for processing a data structure shown at 409 after the loading thereof according to step 304, which is an electronic document formatted for print and, in the example, is formatted according to the Portable Document Format (*.pdf) file format of Adobe Systems Inc. of San Jose, California, United States and output by Adobe applications, notably the well-known Adobe® Acrobat® Reader application. It will be readily apparent to those skilled in the art that the present invention is not limited thereto, however, and a preferred embodiment of the present invention may process many other types of electronic documents formatted for print, such as electronic document formatted according to the Microsoft Word (*.doc) file format of Microsoft® Corporation or documents prepared for display, such as HTML (*.html or .htm) or XML (*.xml) documents.

Various types of data components of the data structure 409 are further described in Figure 5. The document 409 initially comprises metadata 501, which may be better understood as a description or definition of the electronic data contained in the data structure 409. Metadata 501 therefore preferably includes data structure information, such as the document file type, the language of the document, the page format and the number of pages of the document, but can also include information about when the document was created, by whom and what changes have been made to the document since its creation, or even include descriptive HTML tags (also known to those skilled in the art as Meta tags) if the data structure 409 is an HTML document.

Document 409 also includes text data 502 encoded in one or more of a plurality of character representations including, but not limited to, ASCII, ANSI, ISO, 16-bit UNICODE or its 8-bit representation UTF-8 text data 503 developed as a universal character set for international use and allowing a wider variety of text characters than ASCII to be used, particularly for representing the written characters of Asian languages.

In some instances, document 409 may include textual content stored in an image file format to convey a particular meaning when displayed in the native application. In some other IE 0 6 0 36 1 instances, document 409 will include additional text data 502, which purpose is to create a visual effect for either implying a meaning or for artistic effect.

Document 409 next includes image n data 504, which in the context of the preferred embodiment is any graphical data in the document, which does not convey textual information, therefore such as photographs, artistic or design works, charts and the like. Each image or graphical component 504 is individually defined within the document 409 in size by a pixel area, in resolution by a dot per inch (dpi) value and in content by individual pixels, each of which comprises respective Red, Green and Blue (RGB) values and, optionally, an Alpha (transparency) value. Alternatively graphical content may be defined by reference to a vector object such as a line of a given width in pixels, and colour values in RGB between points A and B. Image data 504 may further be modified by being overlaid with text or other graphic objects to convey a particular meaning when displayed in the native application.

Document 508 is formatted for print and therefore the various visual components 502 to 504 thereof are disposed relative to one another to achieve a particular result, be it in terms of a visual effect imparted upon the reader or a specific order in which those visual components should be read or observed when the document is printed out. Data defining this disposition may take one of two forms, or possibly both, respectively represented as document layer data 505 and document layout data 506. Any or both of the layer data 505 and layout data 506 may be included as part of metadata 501.

Examples of layer and layout data are shown in Figure 6. Layer data 505 comprises data defining which visual components 502 to 504 belong to which layer of the document formatted for printing. Layers may be described as document areas overlaying one another, figuratively "stacked" on top of each other, the aggregation of which results in the final document. A typical example may include a watermark image 504 belonging in a bottom layer 601 and a letter comprising text belonging in a top layer 603, the aggregation 603 of which resulting in a letter having text 502 or 503 over the watermark 504 when printed. Layer data 505 therefore defines the bottom and top layers 601, 602 and with which watermark 504 and text 502 or 503 are associated, respectively. Layer data can be used for artistic effect or to convey meaning. A typical example of layer data being used to convey meaning is where a block of text is drawn to the reader’s attention by being layered over a different background colour to that of the remaining text. j£ 0 6 0 3 β 1 Layout data 506 comprises data defining where visual components 502 to 504 are physically disposed within the document area, e.g. where the watermark 504 is disposed relative to the total area of the document, for instance centered relative to the four boundaries of the document. The disposition of visual components 502 to 504 according to layout data 506 can vary to a very large extent, which depends on the intended use and/or purpose of the document 409. The single-sided example 603 of a watermarked letter 409 above can for instance be contrasted with another example 604 of a double-sided brochure, based upon an A4 document print format to be folded in thirds 605 to 607, wherein visual components 502 to 504 are laid out according to an intended reading order (605 recto, 605 verso, 606 verso, 607 verso, 607 recto, 606 recto) of the brochure, once printed and folded. It can be very easily appreciated that unless the example document 604 is physically printed and folded, a user would find reading document 604 presented on a VDU, such as VDU 202, particularly difficult, since layout data 506 specifies in this particular example that image and text data are rotated clockwise by 90 degrees relative to a VDU reading orientation shown as arrow 608 and image and text data are apportioned to their respective brochure recto or verso "face" 605 to 607, again irrespective of VDU reading order 608. In this instance the text reading order of the source document file will need to be changed in order to preserve the intended reading order.

The step 306 according to which the instructions 402 configure the data processing system 201 to process document 409 into an electronic document formatted for display is further detailed in Figure 7. At step 701, application 402 processes the metadata 501 to identify and temporarily store document data in its respective portion of buffer 407, which is useful to determine characteristics of text data 502, 503, for instance its language which is particularly useful in the case of written language using Unicode or UTF-8 text data 503, and also to determine whether the loaded document 409 incorporates layer data 505 and/or layout data 506. At step 702, the text processing module 404 of application 402 extracts text data 502 from the document 409, which it also temporarily stores in its respective portion of buffer 407.

All text is passed through text processing module 404 using rules, which detect if there is text encoded in character sets other than that declared for the document, or illegal in the output format. Where such characters are encountered they are processed using rules, and output in the correct character set and character. Where there is more than one possible correct output character a comparison is made with the text extracted using the OCR module 405.

IE 0 6 0 36 1 The application detects if the document contains any instances where additional text may have been included for visual effect. In one example, additional text is defined as repeated layered text, which can be discarded or replaced by markup. In another example where layered text is detected, the layers are examined to determine if one is used for visual effect, by having a significantly less prominent colour or a colour close to that of the background. It will be apparent that many other examples exist where a similar technique can be used to impart the intended meaning to the document.

Application 402 coordinates the respective data processing steps of each of processing modules 403 to 406, and at step 703, application 402 invokes the scanning module 406 to scan the document, in its composite form if it comprises layer data 505 or if an initial reading of the character strings stored at step 702 indicates that there may be more than one possible glyph for any of the characters stored, and then stores the result in its respective buffer portion 407, and subsequently invokes the OCR processing module 405 to optically recognize characters from the buffered scanned document and temporarily store an OCR version of the document 409 in its respective buffer portion 407.

Thereafter, at step 704, the text processing module 404 performs a comparison of the text data 502, 503 extracted at step 702 with the buffered OCR version of the document output at the step 703 in order to detect and correct both extraction errors as well as alphanumerical data featured in document 409 under the form of image data 504 into ASCII text data 502. Examples of such alphanumerical data under the form of image data 504 include alphanumerical data in charts, graphs or in a stylized form, or defined by layer data 505 as or in a graphic layer 601 as opposed to a text layer 602. At step 705, the extracted and verified text data is stored in the buffer portion 407 of document processing module 403.

In order to facilitate the document conversion, a number of document layout templates 408 is provided, for instance for the more common types of document layouts that may be encountered, and therefore incorporating a letter, a magazine article with two or more columns, a broadsheet article with four or more columns, a double sided brochure and folded variations thereof as provided in the example 604. It will be readily apparent to those skilled in the art that the above templates are provided by way of example only and are not meant to be limitative. At the next step 706, application 402 matches the verified text data with a document layout template 408. In the example, application 402 obtains from the performing IE 060 361 of steps 701 to 705 that document 409 has an A4 format from step 701, that text data has been rotated by 90 degrees for the purposes of performing the OCR step 703, and that the disposition of text data 502 within the A4 area of document 409 defines three portions 605 to 607. In an alternative embodiment of the present invention, application 402 obtains from performing step 701 that layout data 506 defines three columns corresponding to the three portions 605 to 607.

At step 707, application 402 processes the buffered scanned document output by the scanning module at step 703 for identifying respective images 504«. A question is therefore asked at step 708, as to whether an image n 504 is present in the document 409. An initial question is asked as to whether any layer data is stored above the image. If layer data is stored above the image, the image is recaptured using a ‘scan’ of the source document display with OCR module 405, and processed as step 709. If the question of step 708 is answered positively, application 402 processes the identified image 504 n for reducing its respective storage requirement, e.g. its respective size in bytes, at the next step 709, for instance with resampling the image to a lower resolution and subsequently performing a one-pass sharpening of the re-sampled image, and buffers the processed image 504 n. In one embodiment of application 402, application 402 outputs the image n 504 to VDU 202 for the user to optionally increase or decrease its scale, or change its boundary relative to text or graphic objects, so that for example an adjacent text string can be included or excluded from the image.

In an alternative embodiment, a question is then asked as to whether there is a text string either overlaying the image, or adjacent to the image which might be used as a text alternative to assist blind users, or users accessing the document on a small screen without image display (or with image display turned off), for determining the meaning of the image. Adjacent text may be selected for inclusion by the use of one or more of a plurality of features including: textual conventions such as the inclusion of a word such as ‘caption’ ‘image’ or ‘photo’; layout conventions such as the placement of a short text string in isolation and adjacent to an image; and typographical conventions such as a change in font, font style, size or weight. In this embodiment, such text alternatives will be displayed to the user by application 402 for acceptance, editing or rejection. The identified text alternative is stored in association with image 504 for output.

IE 0 6 0 381 16 A question is subsequently asked at step 710, as to whether another image 504 n+1 is present in the document 409. If the question of step 711 is entered positively a further question is asked as to whether this image has already been used in the document, If the image has been used before this instance is replaced by a reference to a previous image and control returns to step 709, otherwise control returns to step 709 for the processing of this next image as previously described, and the image processing cycle continues until all images 504 n have been processed. The question of step 711 is therefore eventually answered negatively, whereby application 402 may then invoke document processing module 403 for outputting the electronic document formatted for display at step 307. Referring to the question of step 708, if the question is answered negatively and there is no image data 504 n to process in document 409, application 402 likewise next invokes document processing module 403 for outputting the converted document at step 307.

Therefore, regardless of the format and content of an electronic document formatted for printing 409, application 402 and data processing modules 403 to 406 thereof process document 409 into ASCII or one of a plurality of character sets text data 502 and downsampled image data 504, substantially reducing the storage space required to store the visual components of the original document.

In a further step, application 402 process the extracted text to search for print conventions, which indicate a reference to another part of the document. Application 402 seeks to identify printed page numbers. In formats where such numbers are stored directly such as ‘doc’ or some versions of ‘pdf the numbers are identified and stored. Where there are no stored numbers sequential numbers, they may be identified by searching for print conventions as ‘page’ ‘p’ with a number, or a number placed in isolation at a consistent location on the page or mirrored location on facing pages. These are stored as page numbers and mapped by offset to the leaf number or file section representing a page. Application 402 next searches for a ‘contents’ page, identified by strings of data matching headings adjacent to numbers matching the page number of pages with the same heading. When a contents page is identified, all its entries are hyper-linked. Similarly, references within the text which contain clear identifiers such as ‘see page «’ are hyperlinked to the section with the appropriate page number.

In another example, application 402 will search for a reference such as ‘see Appendix A’ and it will then search for a heading matching text string ‘Appendix A’ and, having successfully located the heading, provide a hyperlink within the document from the text reference to the IE 0 S 0 3 6 1 I7 heading. Similarly, conventions such as those identifying footnotes or endnotes will be identified and hyper-linked.

In a further embodiment, application 402 searches the text for characters which suggest a reference to an external electronic resource such as a URL for a web page or an electronic mailing address. These are deduced by the presence of text such as www.name or name.com, or name@domain.topleveldomain, for example. Where such text strings are detected, application 402 first checks the string for characters such as white space or line feed which should not be present. It then, where an appropriate network connect is available, tests the link to the resource and where valid provides an active hyperlink.

In one embodiment of application 402, the user is presented with a prompt, which will allow them to test all links and send test e-mails to all addresses detected in the document. Where the resource has been tested successfully, either by the successful loading of a web resource, or the receipt of a confirmation e-mail, or indeed by the e-mails having not been returned as undeliverable, the resource will be hyperlinked.

The step 307 of outputting the electronic document formatted for display, converted from the electronic document of formatted for printing 409 according to the present invention, is further detailed in Figure 8. The document processing module 403 firstly processes the buffered metadata 501 of step 701 to identify document structuring data, for instance layout data 506 comprising structural tags if document 409 is formatted according to HTML or XML structured languages, or headings if the format of document 409 is capable of storing such information as for example "doc" or "pdf’ file formats. A first question is therefore asked at step 801, as to whether the buffered metadata comprises such information. If the question of step 801 is answered negatively, then at step 802 the document processing module 403 recursively processes the text data 502 output at step 705 to identify headings, sub-headings and other document subdivisions according to one or more of a plurality of; Conventional document sub-dividing conventions, including for instance identifying strings of alphanumerical data 502 such as "chapter", "part" sequential numbering and the like, Layout conventions such as the placement of a short text string, in isolation. Typographical conventions such as a change in font, font style, size or weight. A second question is asked at step 803 further to the identifying attempt of step 802, as to whether a document structure has been identified. If the question of step 803 is answered positively, then at step 804 the IE 06 0 36 1 ΐδ document processing module 403 temporarily stores data of the identified structure in buffer 407.

If the question of step 803 is answered negatively, the document processing module further processes the text to ask if potential headings can be identified to a lower level of certainty. In one implementation of application 402 the potential headings identified are presented to the user for acceptance, editing or rejection. Then, at step 805 the document processing module 403 recursively processes the text data 502 output at step 705 to identify headings, subheadings and other document subdivisions with selecting a first string of alphanumerical data 502 and processing the frequency and disposition of said first string throughout the document. At step 806, module 403 then assigns a respective probability of structural heading to the first string 502 based upon the processing of step 805 and temporarily stores said probability in respect of said first string. At step 807, a question is asked as to whether a string of alphanumerical data 502 remains to be processed in the document according to steps 805 and 806 and, if answered positively, control returns to step 805, whereby a next string of alphanumerical data 502 is selected for which a probability is obtained and stored, and so on and so forth until all probable strings 502 have been processed. The question of step 807 is therefore eventually answered negatively, and the document processing module 803 derives a probable document structure from the stored probabilities, which it temporarily stores at step 804.

If the question of step 801 is answered positively, or as and when document processing module 403 stores structure data at step 804 as a result of identifying conventional headings at step 802 or deriving structural headings according to steps 805 to 807, control proceeds to step 808, at which the document processing module 403 generates a table of contents (TOC) from the structural data, then a file index at step 809, from which TOC and index application 402 can subsequently compile an electronic document formatted for display at step 810. In a preferred embodiment of the present invention, the electronic document formatted for display is a CHM file, known to those skilled in the art as a ‘Help’ file, which can be processed for display by any Windows OS 401, compiled from the TOC of step 808 as a HHC file, or Html Help table of Contents, and the Index of step 809 as a HHK file, or Html Help index file.

In another embodiment of the invention, the terminal 104 is itself a server, connected to the network, such as the Internet or an Intranet, which a number of users at remote locations can use to convert documents according to the present invention. The remote users can initiate IE 0 6 0 3(1 i9 document conversion by such means as transferring an electronic copy of the document described herein below to be converted to a specific location, either directly or by use of an automated system for transferring documents such as an ECM (Electronic Content Management) System or other Document Management System. Equally however, the remote user could initiate the conversion process by such means as dragging an icon representing the document to an icon representing application 402, or using a command within an application such as Microsoft Word and choosing the conversion process, the remote user being thus able to control the conversion process by means such as pre-set commands, commands loaded into an editable *.ini file, or input by a remote user using an XML interface. In this embodiment, the application is based on the server 104 and pre-set to convert all documents delivered to one or more specific locations on a computer network. Alternatively it can be pre-set to convert a document in response to user input, either directly or indirectly through an application such as an ECM system or by using a command from software on a remote computer.

Claims

1. An apparatus for converting an electronic document formatted for printing into an electronic document formatted for display, the apparatus comprising processing means, memory means and data input means, the memory means storing instructions, a data structure defining at least one electronic document formatted for printing and having a layout, and at least one document layout template, the instructions configuring the processing means to: compare the document layout defined by the data structure against the at least one document layout template; map alphanumerical data in the data structure to corresponding ASCII character data; optically recognize alphanumerical data in the data structure; compare the optically-recognized alphanumerical data against the mapped alphanumerical data; identify image data in the data structure; rescale the image data if the image data exceeds an image data parameter ; and output the electronic document including the optionally rescaled image data and compared ASCII character data, wherein the output electronic document is formatted for display according to the document layout template.

2. An apparatus according to claim 1, wherein the memory means further stores a plurality of recipient terminal profiles and the instructions further configure the processing means to read a selection of a recipient terminal profile by a user and to format the output electronic document for display according to the selected recipient terminal profile.

3. A method for converting an electronic document formatted for printing into an electronic document formatted for display, the method comprising the steps of: comparing a document layout defined by the data structure of the electronic document against at least one document layout template; mapping alphanumerical data in the data structure to corresponding ASCII character data; optically recognizing alphanumerical data in the data structure; comparing the optically-recognized alphanumerical data against the mapped alphanumerical data; identifying image data in the data structure and rescaling the image data if the image data exceeds an image data parameter; and IE Ο6 Ο 361 outputting the electronic document including the optionally rescaled image data and compared ASCII character data, wherein the output electronic document is formatted for display according to the document layout template. 5

4. A method for converting an electronic document formatted for printing into an electronic document formatted for display, the method comprising the steps of: mapping alphanumerical data in a data structure of the electronic document to corresponding ASCII character data; optically recognizing alphanumerical data in the data structure; 10 comparing the optically-recognized alphanumerical data against the mapped alphanumerical data; identifying image data in the data structure; identifying sections of the electronic document content where the intended meaning is conveyed by the use of layered text or text and image data and processing this layered data 15 according to rules to ensure that the intended meaning is preserved in the converted document; identifying additional text included for decorative effect and deleting it or replacing it with appropriate markup; rescaling the image data if the image data exceeds an image data parameter; and 20 outputting the electronic document including the optionally rescaled image data, processed layered data and compared ASCII character data.

5. A system substantially as herein described in relation to and in association with the accompanying drawings.