AU2006235880A1

AU2006235880A1 - Page identification

Info

Publication number: AU2006235880A1
Application number: AU2006235880A
Authority: AU
Inventors: Anthony Hok Tsung Ko; Mark Ronald Tainsh; Alyce Widjaja
Original assignee: Canon Information Systems Research Australia Pty Ltd
Current assignee: Canon Information Systems Research Australia Pty Ltd
Priority date: 2006-11-06
Filing date: 2006-11-06
Publication date: 2008-05-22

Description

AUSTRALIA

PATENTS ACT 1990 COMPLETE SPECIFICATION NAME OF APPLICANT(S):: Canon Information Systems Research Australia Pty Ltd ADDRESS FOR SERVICE: DAVIES COLLISON CAVE Patent Attorneys 255 Elizabeth Street, Sydney, New South Wales, Australia, 2000 INVENTION TITLE: Page identification The following statement is a full description of this invention, including the best method of performing it known to me/us:- 5102 P:\WPDOCSIAJS\ p-\I2712I01doc-03I1 lf6 -1- 0 PAGE IDENTIFICATION z Field of the Invention The present invention relates to a method and apparatus for identifying a scanned page in a 00 document, and a method and apparatus for generating a reference signature for use in 00 S identifying a page in a document.

INO

Description of the Background Art The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that the prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to 0 which this specification relates.

With the wide availability of scanners in environments such as offices and universities, it is easy for a user to reproduce parts of a physical document, including those that are subject to copyright, such as pages from a published docurient. According to governmental regulations, copyright owners must be fairly compensated for the reproduction of any copyright materials that they own. In order to accurately track the. reproduction of copyrighted materials, it is necessary to first determine if the reproduction involves materials that are subject to copyright. Conventional technique requires the user to manually submit a form that details the copyright information of the materials being reproduced. Not only is this technique cumbersome, but it is also hard to verify that the provided copyright information corresponds to the pages that are being reproduced.

A way to ensure that the obtained copyright information corresponds to the pages being reproduced is to embed a printable code that contains copyright information, such as a barcode, on each page. When a page is scanned, the presence of the printable code on the page is detected and the code is then deciphered to obtain the copyright information. A disadvantage to this technique is the requirement for each page to be physically modified. In P:\WPDOCSS spcci\ 12712001 d0oc3/I1/0 I6 -2- 0 addition to that, the layout of some pages may not allow printable code to be embedded due to the lack of space.

\O

Another technique involves identifying a page using document signals. A document signal is generated from the scanned page and compared to a library of document signals to find the 00oO 00 5 closest match. In addition to document signals, the library contains the copyright information C for each reference document signal. When a match is found and the page identified, the copyright information for the document can be obtained. Although no physical modification is required, the method of comparing signals is not scalable to a large library of documents since the generated signal has to be compared to every page of every document in the library o to find the closest match. In addition, if the document in the library contains similar pages, there is a possibility of identifying the page for a wrong document.

Various other techniques have been proposed that involve using a reference library to identify a scanned image but none has been found to be easily scalable to a large library. For example, document matching can be done over word-level topological properties, such as the word layout of the document or the location of a portion of words within the document. Word-level topological properties are extracted from the, scanned image of the document and the extracted properties are then compared to a library in order to find a match. Even though optimisation can be performed, such as using a tree-like structure to store the properties, this method is not easily scalable to a large library of document. In addition to that, this method cannot be used to identify pages that only contain images.

A technique normally used to identify pages containing images is image hashing. Image hashing is the process of producing a binary string, called hash value, from an image using a hash function. Perceptually similar images should map to the same hash value and perceptually different images should map to different hash values. A disadvantage of this method is that artefacts in the image, which are common for scanned images, may cause the same page to have a different hash values. In addition, when the library of documents to be identified against is large, the probability of a collision increases. That is, there is a higher chance of perceptually different images to map to the same hash value, resulting in a false positive identification of images.

P \WPDOCSWSspci\12712001 dc-O3/I 1/06 S-3- O The ability to identify a scanned page can also be applied to other areas such as using the z physical document to locate the electronic version of the document. This is useful in the O scenario where multiple copies of a document are printed and distributed to various people in a meeting. The person that receives a copy of the physical document can then locate the 00 5 electronic version by scanning the document. There is a need for a scalable method that is 00oO V) able to reliably identify a page at the point of scanning that can be used for pages with N different types of content.

Multi-function print devices (MFDs) are devices that integrate a number of hard-copy document handling functions, such as facsimile transceiver, scanner, copier and printer, in a 0 single device. MFDs have become commonplace in the modem office environment and find particular application in the so-called "home office" where the need for each function often exists but the workload for each function does not justify a stand-alone or dedicated device.

Summary of the Present Invention It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.

In a first broad form the present invention provides a method of identifying a scanned page in a document, the method including, in a document handling device: a) determining a document identifier associated with the document; b) determining a page signature of the scanned page using a selected algorithm; c) determining one or more reference signatures using the document identifier; and, d) identifying the page by comparing the page signature to the one or more reference signatures.

In a second broad form the present invention provides apparatus for identifying a scanned page in a document, the apparatus including a document handling device for: a) determining a document identifier associated with the document; b) determining a page signature by applying a selected algorithm to the scanned page; c) determining one or more reference signatures using the document identifier; and, P:%WPDOCSJS\Ap0i\J21 12001 dmoc-31 IM6 -4- O d) identifying the page by comparing the page signature to the one or more reference z signatures.

0 In a third broad form the present invention provides a computer program product for use in identifying a scanned page in a document, the computer program product including computer 00 00 5 executable code which when executed on a processor in a document handling device, causes c the document handling device to: IND a) determine a document identifier associated with the document; Sb) determine a page signature by applying a selected algorithm to the scanned page; c) determine one or more reference signatures using the document identifier; and, 0 d) identify the page by comparing the page signature to the one or more reference signatures.

In a fourth broad form the present invention provides a method of generating a reference signature for use in identifying a page in a document, the method including, in a document handling device: a) determining a document identifier associated with the document; b) generating a reference signature indicaiive of the page using a selected algorithm; and, c) storing an indication of the reference signature and the document identifier.

In a fifth broad form the present invention provides apparatus for generating a reference signature for use in identifying a page in a document, the apparatus including a document handling device for: a) determining a document identifier associated with the document; b) generating a reference signature indicative of the page using a selected algorithm; and, c) storing an indication of the reference signature and the document identifier.

In a sixth broad form the present invention provides a computer program product for generating a reference signature for use in identifying a page in a document, the computer P :\WPDOCS JSP~d/ I271 2O I dDOc3/I 106 0 program product including computer executable code which when executed on a processor in z a document handling device, causes the document handling device to:

INO

a) determine a document identifier associated with the document; b) generate a reference signature indicative of the page using a selected algorithm; and, 00 5 c) store an indication of the reference signature and the document identifier.

00 N' Brief Description of the Drawings

INO

An example of the present invention will now be described with reference to the accompanying drawings, in which: Figure 1A is a flow chart of an example of a process for identifying a scanned page from a 0 document; Figure 1 B is a flow chart of an example of a process for generating a reference signature for use in identifying a page from a document; Figure 2 is a schematic diagram of an example of an MFD; Figure 3 is a schematic diagram of an example of a networked environment containing a number of MFDs; Figure 4 is a schematic diagram of an example of a computer; Figure 5 is a schematic diagram of an example of a page identification system; Figure 6A is an example of the format of a' database table used to store the reference signatures; Figure 6B is an example of the format of a data.base table used to store an indication of the algorithm used to generate page signatures; Figure 7 is a block diagram showing an example of a system incorporating the page identification system of Figure Figure 8 is a flowchart of a first specific example of the process for identifying a scanned page;and, Figure 9 is a flowchart of a second specific example of the process for identifying a scanned page.

P:\WPDOCS /S\p 27 /U I.o -0311 /06 -6- 0 Detailed Description Including Best Mode z IO A method of identifying a scanned page using' a document handling device will now be described with reference to Figure 1A.

00 In this example, the method includes, at step 100, determining a document identifier 00 n s associated with the document containing the scanned page. This may be achieved in any one t'q N, of a number of ways, such as by having a userdetermine a document identifier such as an

INO

indication of an ISBN number or other identifier. Alternatively this may include scanning i some or all of the pages in the document allowing the document identifier to be automatically therefrom, for example by detecting a barcode provided on the page which is indicative of the 0 document identifier.

At step 110 a page signature is determined from the scanned page, using a selected algorithm.

This is achieved by applying the algorithm to a digital representation of the scanned page, to allow an output to be provided which is indicative of the identity of the page.

The nature of and the output from the algorithm used may vary depending on the nature of the pages and the nature of the content provided thereon, such as whether the page contains text as opposed to images. Thus, for example, in the case of text, the algorithm may involve using optical character recognition techniques to identify one or more of the characters on the page, such as page numbers. Alternatively, this may involve the use of image hashing functions or the like, if the content includes an image. In any event, the signature is generally selected such that each page within the document has a respective signature, and the manner in which this is achieved will vary depending on the preferred implementation, as will be described in more detail below.

At step 120 one or more reference signatures re determined using the document identifier.

The reference signatures are predetermined signatures indicative of each of the pages in the document, and these are typically generated using a separate process as will be described in more detail below. The reference signatures are typically retrieved from a database or other data store, using the document identifier, although any suitable technique may be used depending on the preferred implementation.

P.NWPDCS1AJSIcpa%7I1200)I doc)311 116 -7- 0 At step 130 the page signatures are compared to the one or more reference signatures to allow z D the page to be identified. Thus, this typically involves performing a matching process, to O match the page signature to one of the reference signatures. Once a match has been found, determination of the page associated with the reference signature allows the scanned page to oo00 5 be identified.

00 MC, An example of the process for generating the reference signatures will now be described with INO reference to Figure 1B.

In this example, at step 150, the process involves determining a document identifier associated with a document. This may be achieved in any one of a number of ways, as 0 described for example with respect to step 100 above.

At step 160, the process involves generating a reference signature using a selected algorithm.

Whilst the algorithm may be selected using any suitable technique, this is generally performed so as to allow the same algorithm to be selected during determination of the page signatures, as well as to ensure that the reference signatures for each of the pages within a given document are unique.

Thus, in one example, the algorithm may be selected manually or automatically, with an indication of the algorithm being stored together with the document identifier, so that the algorithm can be subsequently determined from the document identifier.

I

Altemrnatively, or additionally, the algorithm may be selected based on the nature of the content on the page. In this instance, as long as the algorithm is selected in a consistent manner, this means that the same algorithm will be selected each time a signature is being generated in the same manner, thereby ensuring consistency between the generation of reference and page signatures.

At step 170, an indication of the reference signature is stored together with an indication of the document identifier, thereby allowing the reference signature to be subsequently retrieved. This process may also involve storing an indication of the algorithm used to determine the reference signature, to allow subsequent identification of the algorithm, PAWPDOCSWS Wp\127120l0.dac.3II 1A6 -8-1 O although this will depend on the preferred implementation. It will also be appreciated that the

Z

reference signature and document identifier may be stored in any suitable manner, and in any S suitable location, depending on the preferred implementation.

Thus, it will be appreciated that the above described process utilises a document identifier to 00 00 5 identify the document containing the scanned page, with a page signature derived from the C scanned page being used to identify the page itself.

The use of the document identifier to identify the document containing the page means that 1 the page signatures and reference signatures (hereinafter referred to generally as "signatures") do not need to be unique between documents. Instead, it is only necessary for the reference 0 signatures to be unique within a document, with common reference signatures between documents being distinguished through the use of the document identifier. This has a number of implications.

Firstly, as the signatures need only to distinguish between pages in the document, the complexity of the signature and the associated 'algorithm used to determine the signature is vastly reduced.

Secondly, this allows different algorithms to be used to derive signatures for different documents, thereby allowing an optimum algorithm to be selected depending on the nature of content within the document.

Thirdly, this vastly reduces the number of signature records that need to be searched in order to allow pages to be identified. In particular, this only requires that a number of reference signatures corresponding to the number of pages in the document need be searched, as opposed to signatures representing every page in a library. This vastly reduces the amount of processing of time required and enhances the speed at which pages can be identified.

The above described process may be performed in any one of a number of manners, and is typically performed at least in part using a document handling device, such as an MFD and/or a computer.

An example of an MFD is shown in more detail in Figure 2.

PAWPDCSWSkS'l2712O lIdoC-0i] I/6

\O

9- 0 In this example, the MFD includes a scanner 200, a printer 205, a fax 210 unit, an optional z dedicated copier 215, an Input/Output controller 220, a multi-function controller 225, a O user interface controller 230, and an optional memory 260, coupled together via a bus 235, as shown. An optional reader, such as a bar-code reader, may also be provided as shown at 255, 00 5 to allow the identity document identifier to be determined.

00 C The user interface controller 230 is typically coupled to one or more user interface devices, IND such as a touch screen 240 and keypad 245, to allow a user to view information provided by the MFD 300 and provide appropriate input commands. A recognition device 250 may also be provided for obtaining information for identifying users. This may include for example a D biometric scanning device, or a swipe card or'RFID (Radio Frequency Identification) tag reader for reading information from a suitable swipe card or RFID tag.

In use, the I/O controller 220 operates to handle interaction with external devices, such as remote computers or the like, whilst the multi-function controller 225, operates to control the scanner 200, printer 205, fax 210 and copier 21'5, to allow desired jobs to be performed. It will therefore be appreciated that the controllers are typically implemented as software executed by a suitable processor, which is operating under control of appropriate software applications stored in a store (not shown).

In particular, in one example, the page identification processes described in more detail below may be performed through the use of a suitable module loaded into the processor from memory, and this is typically implemented by the multi-function controller 225. This may be achieved in any one of a number of manners, but in one example may be achieved using a JAVA module that activates a graphical user interface (GUI) on the touch screen 240, and interacts with the remote computers and/or the'servers as required. This allows the MFD to display information relating to the page identification process, as well as allowing a user to provide input commands to control the process.

In one example, the MFD may be provided in a network environment as will now be described with respect to Figure 3.

P \WPDOCSS4pei\12712OOI doc-031/ 10 6 0 In particular, in this example the network environment includes a number of Multi-Function Devices (MFDs) 300, coupled to a number of computers 320, and optionally a number of O servers 330, via a communications network 310. The servers may also be coupled to one or more databases 340, as shown.

00 00oO 5 In use, the MFDs 300 are used to perform various document handling jobs, such as printing, c scanning, copying, or faxing of documents, or the like. As part of this process, the computers INO 320 may be used to provide documents to the MFDs 300, for example in the case of printing applications, or may be used to display job results, for example following scanning of the documents by the MFDs 300. Similarly, the servers 330 may be used to provide or receive 0 documents used in jobs, as well as to provide additional network based activities, such as access to reference signatures, and this may require interaction with data in the database 340.

It will therefore be appreciated that a wide range of network architectures are encompassed by the system and the configuration shown is for the purpose of example only. Thus, for example, the communications network may be any suitable communications network, but is typically a Local Area Network (LAN) 310 such as an intranet, although may also include a Wide Area Network, the Internet, or the like.' Furthermore, any number of MFDs 300, computers 320, or servers 330 may be used, and the number shown is for the purpose of illustration only.

An example of a general-purpose computer 320 is shown in Figure 4.

The computer module 401 typically includes at least one processor unit 405, and a memory unit 406, for example formed from semiconductor random access memory (RAM) and read only memory (ROM).

The module 401 includes an number of input/output interfaces including an audio-video interface 407 that couples to the video display 414 and loudspeakers 417, and an I/O interface 413 for the keyboard 402 and mouse 403 and optionally a joystick (not illustrated). This allows the computer system 400 to determine and interpret input commands supplied by a user.

p:\WPDOCSAJSXsci\1212001 cc-031 1106 -11, 0 An 1/O interface 408, such as a network interface card (NIC) is also typically used for z connecting to the computer to the computer network 310, which can optionally provide O onward connectivity to a network printer 451, the server 330 and the database 340. The I/O interface 408 can also provide connectivity to a local printer 415.

00 00 5 A storage device 409 is provided and typically includes a hard disk drive 410 and a floppy Vt) c disk drive 411. A magnetic tape drive (not illustrated) may also be used. A CD-ROM drive S 412 is typically provided as a non-volatile source of data.

S The components 405 to 413 of the computer module 401, typically communicate via an interconnected bus 404 and in a manner that results in a conventional mode of operation of 0 the computer system 400 known to those in the relevant art. Examples of computers on which the described arrangements can be practised include IBM-computer's and compatibles, Sun Sparcstations or the like.

The process of performing jobs such as page identification is typically implemented using software, such as one or more application programs executing within the computer system 400. Typically, the application activates a GUI on the video display 414 of the computer system 400 which displays pages to be identified, such as pages to be printed, scanned or copied.

In particular, the methods and processes are affected by instructions in the software that are carried out by the computer. The instructions iay be formed as one or more code modules, each for performing one or more particular tasks. Typically the execution of the instructions may require a number of different application programs to interact, and may also require the presence of a suitable driver that is configured to operate with a specific device or MFD.

The software may be stored in a computer readable medium, and loaded into the computer, from the computer readable medium, to allow execution. A computer readable medium having such software or computer program recorded on it is a computer program product.

The use of the computer program product in the computer preferably affects an advantageous apparatus for distributed printing, scanning or copying.

P:\W PDOCS"AJS\.'p. 1271200 I.d.-03/I 1/06 -12- 0 The term "computer readable medium" as used herein refers to any storage or transmission z medium that participates in providing instructions and/or data to the computer system 400 for O execution and/or processing. Examples of storage media include floppy disks, magnetic tape, CD-ROM, a hard disk drive, a ROM or integrated circuit, a magneto-optical disk, or a 00o 5 computer readable card such as a PCMCIA card and the like, whether or not such devices are 00oO V) internal or external of the computer module 401. Examples of transmission media include N, radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Iniranets including e-mail transmissions and S information recorded on Websites and the like.

0 Specific examples of the processes of page identification will now be described in more detail.

First Specific Example Figure 5 is a block diagram showing a basic arrangement for a system used to identify a scanned page.

In this example, the system includes a data stbre 504, such as a database 340, containing reference signatures. A signature describes the. result of applying an algorithm to a scanned image of a page in order to distinguish the page based on its visual characteristics. The format of the signature depends on the type of algorithm used to extract the distinguishing visual characteristics. For example, if the algorithm involves detecting the characters located at a specified position using character recognition techniques, such as OCR, the signature will be the expected characters. If the algorithm performs an image hashing on the page, the signature will be the expected hash result.

The reference signatures are stored according to the document that contains the pages, allowing the reference signatures of all the pages in a document to be retrieved by specifying a document identifier 501.

This can be achieved by storing the reference signatures in a database table as shown in Figure 6A. In this example, the name of the algorithm used to generate the reference P \WPDOCSWJS\spcc\1271201 dwc-03111A)6 -13- O signature can also be stored in a database table as shown in Figure 6B. The reference z signatures for a specified document can then be retrieved by querying the database for all O rows that has the identifier of the required document.

The document identifier 501 is used to uniquely identify a document that contains the 00oO 00o 5 scanned page 502. The type of information being used as a document identifier depends on the document type. It can be the ISBN number of a published document or a combination of IND the title, author and revision of the document.

The reference signatures are generally generated prior to the system being used. In order to facilitate the generation of reference signatures, an application may be used to generate the 0 reference signatures, for example, a plug-in integrated with document publishing software, can be distributed to publishers. The application contains the set of signature generator algorithms known to the system. When a document is published, the publisher can distribute, together with the document, a set of reference signatures for the document. The application can also be extended to allow publishers to define a custom algorithm that can distinguish the pages in their documents with more accuracy. If a custom algorithm is used, it should be distributed together with the reference signatures.

The system also includes a data store 503, which again may form part of the database 340, that contains the algorithms used to generate the reference signatures for the documents. The data store 503 can be implemented as a simple hash table that maps the name of the algorithm to the implementation of the algorithm. The name of the algorithm has to be unique. In this example, a single algorithm is used to generate all the reference signatures for a particular document. An example of an algorithm used to generate page signature for a page that only contains text is:

START

Use OCR to read characters starting from position (xl, yl) to position (x2, y 2 on the image; Featurel first 12 characters starting from position (xl, yl); Use OCR to read characters starting from position (x3, y 3 to position (x4, y4) on the image; Feature2 first 5 characters starting from position (x3, y3); PAWPDOCSkA2S\spcc\12712001 8004)3/1 If -14 O Page signature Featurel Feature2;

ZEND

\O

In this example, the system includes a signature generation module 505, which takes as inputs the scanned image and the algorithm implementation used to generate page signatures and 00 5 provides an output in the form of the page signature of the scanned image.

N A comparison module 506 is then used to compare the page signature of the scanned page S with the reference signatures for the document containing the scanned page. The output of Ni this module is a boolean value indicating whether or not a scanned page can be identified.

The type of comparison performed depends on the format of the page signature. For example, 0 if the page signature is a string of characters that is expected to be present at a certain location, a simple string comparison can be performed.

I

It will be appreciated that the signature generation module 505 and the comparison module 506 are typically implemented as software modiles within a document handling device, such as one or a combination of the computers 320, the MFDs 300 and the server 330.

An example of the process for identifying a scanned page will be detailed with reference to Figures 7 and 8.

Figure 7 shows a system that uses the above described processes to identify pages. In this example, the system includes a number of MFDs 702 (similar to the MFDs 300 described above), each of which is connected to the data store 504, which contains the reference signatures.

Before a page is scanned by a user 701, the document identifier for the document containing the page is obtained in step 801. The document identifier 501 can be obtained using various input methods. For example, if the page belongs to a published document, the ISBN barcode can be scanned using the bar code reader 255, to obtain the identifier. If the identity of the document cannot be found physically, the user can be prompted to provide user input commands to thereby manually enter in the document identifier, for example using an input such as the touch screen 240.

P:\WPDOCSS\spcci\ 2712UI do4)311/06 0 Once the document identifier is obtained at the MFD 702, at step 802 the document identifier z is used to obtain reference signatures from the data store 504. The name of the algorithm used O to generate the reference signatures can also be obtained based on the document identifier.

The name of the algorithm is then used to retrieve the algorithm implementation to be used 00oO 5 from data store 503, which in this example is implemented as a simple hash table.

00oO c In this example, the algorithm used is an image key generation algorithm, such as the IND algorithm outlined in US20060050985. The algprithm is capable of generating a unique key for images that is substantially invariant to rotation, scale and translation. The algorithm works by first forming a spatial document representation of the image that is substantially 0 invariant to translation, then forming a log-polar resampled image from the spatial domain representation. Finally, the algorithm forms a representation of the log-polar resampled image that is substantially invariant to translation.

The scanned image 502 is obtained at step 803 when the user 701 scans the page at the MFD 702. The algorithm obtained in step 802 is then applied to the scanned image 502 by the s signature generation module 505 at step 804. The generated page signature has the same format as the reference signature since the same algorithm is used to generate it. The generated page signature is then passed on to the comparison module 506 to be compared against the reference signatures obtained at step 802. The comparison step 805 involves comparing the generated page signature with each reference page signature until a match is found. If a match is found, the page is then identified.

I

At this stage, the MFD 702 can be adapted to display a representation indicative of whether

I

or not the page can be identified. This can be used by the user to assess whether to proceed further with a document handling job, or the like.

I

Furthermore, if no identification can be made, this can be used to further control any document handling job, such as by preventing copies of the scanned page being made. In this instance, any image of the scanned page will be cleared from the memory of the system, and the user informed that copies cannot be made. 'A record of the event can also be sent to the administrator.

PA\WPDCSWSSspnM27120DI doc.

0 31

\O

-16- 0 Thus, failure to identify a page can be used to at least partially prevent further operations, z such as scanning further pages, or copying scanned pages, if the page cannot be identified.

\O

When the signature comparison has been completed, the user then has a choice at step 806 to scan another page from the same document. If nmore pages are to be scanned, the user can go 00oO 00o 5 directly to step 803 in Figure 8 without needing to repeat the process of obtaining the c document identifier (step 801).

IND

O Once the page has been identified, the user can also use the identification in performing further document handling operations, such as copying the scanned page. In particular, identification of the page can be used in recording the number of copies made of respective o pages. This can in turn allow a fee associated with the copying of each page to be determined, further allowing an indication of copyright fees payable to be stored or otherwise recorded.

Second Specific Example The first specific example is based on the premise that the same algorithm is used to generate the signatures of all the pages in the same document. However, when the pages in the same document have different types of content, a better result can be obtained by using different algorithm to generate the signatures of the pages. This way, the algorithm used can be customised for each page.

This basic arrangement for the system in this example is substantially the same as the example shown in Figure 5. The main difference is that in addition to storing the algorithms used to generate page signatures, the data store 503 also stores the criteria used to select the algorithm to be used for each document. Since different types of algorithm can be used to generate pages signatures for different pages in the same document, selection criteria are required to determine the algorithm to be used for a particular page. The selection criteria used depend on the types of content in the pages in the document.

For example, a document contains two types of pages, those that contain only text and those that contain both text and images. In the first case, the algorithm used to generate page P:\WPDOCSAJS\spmC\12712001 doc.03/11 6 -17- 0 signature only involves the use of OCR to detect the presence of characters at a certain location. In the case where the page contains' both text and images, the algorithm used involves the use of image key generation technique as outlined previously. An example of the selection criteria used is: 0o 00 5 START Algorithms [ocr_algorithm, key_algorithm]; Run block recognition algorithm on scanned page; IF page contains image THEN Algorithm key_algorithm; 0 ELSE Algorithm ocr algorithm;

END

The criteria used for the document uses a block recognition algorithm to detect the presence of images on the page. If no image is detected, 'the algorithm that only uses OCR is used on the scanned page. Otherwise, the algorithm that uses key generation techniques is used.

Figure 9 describes the process of identifying a, scanned page when different algorithms are used in generating page signatures for the different pages in the document. In this example, similar steps to those used in the first specific, example use similar reference numerals and will not therefore be described in detail.

At step 801 the document identifier 501 is obtained, before being used at step 901 to obtain at least one algorithm selection criterion for the document, in addition to the page signatures. In this case, multiple algorithms are obtained for the document.

After the page has been scanned by the user in step 803, an initial pass is performed on the scanned page as shown at step 902. The purpose of the initial pass is to obtain high-level information about the scanned page that can be used to select an algorithm using the selection criteria obtained in step 901. The type of information extracted from the scanned page during the initial pass depends on the type of information required for the selection criteria. For example, if the criteria select the algorithm based on the presence of a certain colour on the P:\WPDOCSAJISX%.pi\1I271201 doc.0311 1106 S-18- 0 page, the initial pass could be done to determine ,if the page contains colours other than black z and white.

\O

Once the algorithm to be used has been selected at step 903, it is then applied to the scanned page as shown at step 804. The generated page signature is compared against the reference 00 00 5 signatures and the page is identified if a match is found. Again, when the signature c comparison has been completed, the user has a choice 806 to scan another page from the ID same document. If there are more pages to be scanned, the user can go directly to step 803 in Figure 9 without repeating the process of obtaining a document identifier.

Other Examples 0 When more than one page needs to be identified, the system can be extended to predict the next page that is going to be scanned. When 'comparing the scanned page with reference signatures, the scanned page is compared with the predicted page first to improve processing time. The prediction can be based simply on page numbers, but may involve more sophisticated predictive algorithms.

In another example, a page in a document can have multiple reference signatures. This is useful when the page is scanned in different environments. For example, if the scanning device is able to scan in colour, a different algorithm can produce a more accurate result when compared to the algorithm used for a black and white scanning device. The method of determining which algorithm to use is incorporated into the selection criteria as detailed in the second specific example. In the case where the selection criteria are complex, multiple passes of the scanned page may be required to determine the algorithm to be used.

Thus, the above described process may optionally implement further features, including: Sproviding a plug-in component to a document publishing software capable of generating page signatures from digital data; prohibiting user from continuing if the scanned page cannot be identified, predicting the next page to be scanned once a page is identified, and allowing a page to have multiple reference signatures generated using different algorithms.

P kWPDOCSVJS\WXc1i271200)1 doc.O3/ I 06 -19- O Accordingly, the above described method and apparatus can be used to identify a page by z S using a document identifier and reference signatures, which are derived by applying an O algorithm to a page in order to distinguish the page based on its visual characteristics.

In one example, a document library containing mappings of a document identifier to 00oO 00o 5 reference signatures for the pages in the document is established. Before a page is scanned, C the document containing the scanned page is identified. The document identifier is then used IND to obtain reference signatures corresponding to the pages in the identified document and the algorithm used to obtain the reference signatures. The same algorithm can then be used to derive the page signature of the scanned page. The derived page signature is compared to the o reference signatures to identify the scanned page. If a match is found, the scanned page is identified.

The use of document identifier to narrow the search space when identifying pages from a document library allows the process to be scalable to large database size. For the same document, the search space is independent of the size of the document library. In addition, using a document identifier can potentially provide more accurate result in cases where pages from different documents are very similar as it eliminates the possibility of identifying a page from the wrong document.

In another example, different algorithms can be used to derive page signatures for different pages in the same document. In this case, the document identifier is used to obtain the different types of algorithm used to create reference signatures for the document and also criteria for selecting the algorithm to be used. An initial pass is performed over the scanned page and the algorithm to be used is selected based on the result of the initial pass and the selection criteria. The selected algorithm is then used to derive the page signature of the scanned page and the derived page signature compared to reference signatures.

1 The ability to use different types of algorithm fdr different pages in the document allows for a more accurate process when identifying a page in a document containing pages with different types of content. In addition to that, this flexibility allows less memory to be used when storing the reference signatures for easily distinguishable pages.

P.\WPDOCS\AJS\spcc12712do1 06-03/1/06 O 0 In the above described examples, it will be appreciated that the MFD operation is typically z controlled using a JAVA module executed by an appropriate processor, such as the multi- O function controller, although any suitable control mechanism may be used.

In the above described examples, specific reference is made to applications software.

00 00 5 However, it will be appreciated that this encompasses multiple software applications, C elements, or other modules, such as drivers.

IND

It will be appreciated from this that whilst the above examples have been described with respect to MFDs, the techniques may be applied to any devices that are capable of performing document handling jobs, such as printers, copiers, scanners, facsimile machines, or the like.

0 The term document handling device is also understood to encompass any one or combination of the processing systems provided in the network environment, including but not limited to one or more of the computers 320, the servers 330, and/or the MFDs 300.

The foregoing describes only some examples of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the examples being illustrative and not restrictive.

I

In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied meanings.