AU2007249103B2

AU2007249103B2 - Document analysis method

Info

Publication number: AU2007249103B2
Application number: AU2007249103A
Authority: AU
Inventors: Yu-Ling Chen; Eric Wai-Shing Chong; Steven Richard Irrgang
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2007-12-05
Filing date: 2007-12-18
Publication date: 2011-05-12
Anticipated expiration: 2027-12-05
Also published as: AU2007249099A1; AU2007237365A1; AU2007249103A1; AU2007249098A1; AU2007237365B2; AU2007249099B2; AU2007249098B2

Description

S&F Ref: 831922 AUSTRALIA PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT Name and Address Canon Kabushiki Kaisha, of 30-2, Shimomaruko 3-chome, of Applicant: Ohta-ku, Tokyo, 146, Japan Actual Inventor(s): Steven Richard Irrgang, Yu-Ling Chen, Eric Wai-Shing Chong Address for Service: Spruson & Ferguson St Martins Tower Level 35 31 Market Street Sydney NSW 2000 (CCN 3710000177) Invention Title: Document analysis method The following statement is a full description of this invention, including the best method of performing it known to me/us: 5845c(1067483_1) -1 DOCUMENT ANALYSIS METHOD TECHNICAL FIELD The present invention relates to a method of document analysis and, in particular, to improving the speed of colour document segmentation and pixel level classification. BACKGROUND ART 5 The proliferation of scanning technology combined with ever increasing computational processing power has lead to many advances in the area of document analysis. Document analysis systems may be used to extract semantic information from a scanned document, often by means of optical character recognition (OCR) technology. This technology is used in a growing number of applications such as automated form 10 reading. Document analysis systems can also be used to improve compression of a document by selectively using an appropriate compression method depending on the content of each part of the page. Improved document compression lends itself to applications such as electronic document archiving and electronic document distribution. Document analysis can typically be broken into three stages. The first of these 15 stages involves pixel colour analysis. The second stage is document layout analysis, which identifies content types such as text, backgrounds and images in different regions of the page. The final stage uses the results of the first two stages and some further analysis to create a final output. The desired final output depends on the application. Many different methods for document layout analysis exist. Some methods 20 partition the page into fixed sized blocks to give a coarse classification of the page. Methods such as these however can only give a single classification to a region, applying to the pixels of all colours within that region. For example, a region may be classified as containing text, but the pixels which are part of the text are not distinguished from the 1064936_1 831922_speci02 -2 pixels in the background by that classification. In most such systems analysis is done in terms of a binary image, so it is clear that the text is one colour and the background another. In such cases, classifications of 'text' and 'inverted text' are sufficient to distinguish which is which. However, in a complicated multi-colour document, a single 5 region may contain text of multiple colours, perhaps over backgrounds of multiple colours, including even natural images. In such cases, a binary image cannot be generated to sufficiently represent the text in the document without first analysing the document to determine where the text resides in different areas, which is itself the problem the system is trying to solve. In such a case, a coarse region-based classification, is not sufficient to 10 represent the document content. Other methods of document layout analysis use the background structure. Again however this is generally done on black and white images, and does not extend easily to complicated colour documents. There is therefore a need for methods which provide a pixel level classification in a 15 complicated colour document. Some methods do exist for this, however in providing an analysis at a pixel level, they generally lack context from the rest of the page, which may be helpful to the classification. Many such methods also involve a large number of operations to be applied for each pixel. For an application of document analysis embedded in a scanner, such methods may be too slow when running with the limited computational 20 resources available inside most document scanners. It is therefore desirable to provide a method of document analysis which affords efficiency in an environment with low resources, offers a pixel level of detail in its classification, makes use of context over a large area for these classifications, and which will perform well on colour documents with complicated layouts. 1064936_1 831922_speci_02 -3 SUMMARY Presently disclosed is a method of document layout analysis that affords a pixel level of detail in classification of objects in the document, while still using large scale context to provide an accurate classification. This is achieved by creating a multi-layered 5 representation of the page, referred to as 'macroregions', and then classifying each of these macroregions. In accordance with one aspect of the present disclosure there is provided a method of classifying regions of a scanned document image, said method comprising the steps of: (a) partitioning the scanned image into a plurality of tiles; (b) determining at least one 10 dominant colour for each of the plurality of tiles (c) generating superpositioned regions based on dominant colours, each said region representing a group of tiles wherein at least one tile is grouped into two superpositioned regions and each dominant colour is represented by at most one of the regions; (d) calculating statistics for each said region using pixel level statistics from each of the tiles included in said region; and 15 (e) determining a classification for each region based on the calculated statistics. Other aspects are also disclosed. BRIEF DESCRIPTION OF THE DRAWINGS At least one embodiment of the present invention will now be described with 20 reference to the drawings in which: Fig. I shows the processing steps of a page layout analysis method; Fig. 2 is a flowchart describing the noise filtering stage of the system. Fig. 3 is a flowchart showing the classification optimisation stage of the processing of Fig. 1; 3447974_1 831922_speci03 -4 Fig. 4 is a high-level operational flowchart of a document analysis system; Fig. 5 is a flowchart describing a method of decomposing a coloured image into a multi-layered representation; Fig. 6 is a flowchart describing the process of macroregion generation of Fig. 5; 5 Fig. 7 is an illustration of macroregions in a typical document; Fig. 8 is an illustration of macroregion generation at tile level; and Fig. 9 is a schematic block diagram representation of general purpose computer system upon which the arrangements presently described may be performed. DETAILED DESCRIPTION INCLDUING BEST MODE 10 The document analysis methods presently disclosed may be implemented using a computer system 900, such as that shown in Fig. 9 wherein the processes of Figs. I to 8 may be implemented as software, such as one or more application programs executable within the computer system 900. In particular, the steps of the document analysis methods are effected by instructions in the software that are carried out within the computer 15 system 900. The instructions may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part and the corresponding code modules performs the document analysis methods and a second part and the corresponding code modules manage a user interface between the first part and the user. The user interface may not necessarily 20 be required and the process entirely automated, for example to be performed during scanning operations. The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 900 from the computer readable medium, and then executed by the computer system 900. A computer readable medium having such software or computer 1064936_1 831922_speci02 -5 program recorded on it is a computer program product. The use of the computer program product in the computer system 900 preferably effects an advantageous apparatus for document analysis. As seen in Fig. 9, the computer system 900 is formed by a computer module 901, 5 input devices such as a keyboard 902, a mouse pointer device 903 and scanner 918, and output devices including a printer 915, a display device 914 and loudspeakers 917. An external Modulator-Demodulator (Modem) transceiver device 916 may be used by the computer module 901 for communicating to and from a communications network 920 via a connection 921. The network 920 may be a wide-area network (WAN), such as the 10 Internet or a private WAN. Where the connection 921 is a telephone line, the modem 916 may be a traditional "dial-up" modem. Alternatively, where the connection 921 is a high capacity (eg: cable) connection, the modem 916 may be a broadband modem. A wireless modem may also be used for wireless connection to the network 920. The computer module 901 typically includes at least one processor unit 905, and a 15 memory unit 906 for example formed from semiconductor random access memory (RAM) and read only memory (ROM). The module 901 also includes an number of input/output (1/0) interfaces including an audio-video interface 907 that couples to the video display 914 and loudspeakers 917, an 1/0 interface 913 for the keyboard 902 and mouse 903 and optionally a joystick (not illustrated), and an interface 908 for the external 20 modem 916, scanner 918 and printer 915. In some implementations, the modem 916 may be incorporated within the computer module 901, for example within the interface 908. The computer module 901 also has a local network interface 911 which, via a connection 923, permits coupling of the computer system 900 to a local computer network 922, known as a Local Area Network (LAN). As also illustrated, the local 1064936_1 831922_speci_02 -6 network 922 may also couple to the wide network 920 via a connection 924, which would typically include a so-called "firewall" device or similar functionality. The interface 911 may be formed by an Etherneti" circuit card, a wireless BluetoothTM or an IEEE 802.11 wireless arrangement. 5 The interfaces 908 and 913 may afford both serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 909 are provided and typically include a hard disk drive (HDD) 910. Other devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical 10 disk drive 912 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (eg: CD-ROM, DVD), USB-RAM, and floppy disks for example may then be used as appropriate sources of data to the system 900. The components 905 to 913 of the computer module 901 typically communicate via an interconnected bus 904 and in a manner which results in a conventional mode of 15 operation of the computer system 900 known to those in the relevant art. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun Sparcstations, Apple MacTM or alike computer systems evolved therefrom. The scanner 918 may be used to scan pages of documents to provide a scanned 20 image to the computer module 901, for storage in the HDD 910, for example. That scanned image may then be subject to image analysis and other processing to perform the colour page decomposition tasks. Scanned images may also be sourced from the networks 920 and 922, for example. 1064936_1 831922_speci02 -7 Typically, the application programs discussed above are resident on the hard disk drive 910 and read and controlled in execution by the processor 905. Intermediate storage of such programs and any data fetched from the networks 920 and 922 may be accomplished using the semiconductor memory 906, possibly in concert with the hard disk 5 drive 910. In some instances, the application programs may be supplied to the user encoded on one or more CD-ROM and read via the corresponding drive 912, or alternatively may be read by the user from the networks 920 or 922. Still further, the software can also be loaded into the computer system 900 from other computer readable media. Computer readable media refers to any storage medium that participates in 10 providing instructions and/or data to the computer system 900 for execution and/or processing. Examples of such media include floppy disks, magnetic tape, CD-ROM, a hard disk drive, a ROM or integrated circuit, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 901. Examples of computer readable transmission 15 media that may also participate in the provision of instructions and/or data include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like. The second part of the application programs and the corresponding code modules 20 mentioned above may be executed to implement one or more graphical user interfaces (GUls) to be rendered or otherwise represented upon the display 914. Through manipulation of the keyboard 902 and the mouse 903, a user of the computer system 900 and the application may manipulate the interface to provide controlling commands and/or input to the applications associated with the GUI(s). 1064936_1 831922_speci02 -8 One or more of the methods of document layout analysis may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub functions of document layout analysis. Such dedicated hardware may also include one or more microprocessors and associated memories. 5 The arrangements described further may be configured to operate within the scanner 918 automatically upon the scanning of a document. In this fashion the scanned data is automatically analysed and the data arising from the analysis, such as classification etc., may be made available with the scanned data to the computer 901. In this description, a colour distance is used in a number of places. This is 10 understood to include any function that takes two colours and gives a number describing how close, in terms of colour proximity, those colours are to each other. A small distance indicates the colours are close to being identical (eg. pastel green and turquoise blue), whereas a large distance indicates dissimilar colours (eg. yellow and blue). In preferred implementation, a "city-block" [IY1-Y21 + |Cbl - Cb2| + |Cr1 - Cr2|] distance in the 15 YCbCr colour space is used. This is then adjusted based on the hue distance between the colours. Fig. 4 shows an overview of a document analysis system 400 according to the present disclosure. The system 400 receives as an input, a single layered image 405 for example derived from the scanning of a hard copy document. The document will typically 20 be a "compound" document having text portion, and other image type portions including background fills or bitmap images. Processing of the input image 405 commences with a pixel colour analysis process 410. The input image is preferably an RGB image at a resolution of 300dpi. In the presently described implementation, this includes breaking or partitioning the image 405 into non-overlapping, uniform sized tiles, preferably of size 32 1064936_1 831922_speci02 -9 x 32 pixels and finding a small number of dominant colours and tile statistics to represent each tile. A tile of size 32 x 32 has 1024 pixels, thus giving it a maximum possible of 1024 distinct colours. However, at 300dpi, a tile is a very small area, and can typically be represented by only a few colours. These colours are representative of the colours of the 5 pixels in the tile and are referred to as the dominant colours of the tile. Useful tile statistics and information such as pixel count, edge ratios, bitmap, and shared boundary pixel counts for each dominant colour may be extracted in step 410. Other statistics such as colour variance may also be calculated. Thus, a reference to a dominant colour refers to the representative colour of a set of pixels, and its associated statistics. The pixel colour 10 analysis step 410 may be implemented in a number of different ways. It may involve a number of image processing operations. A typical implementation may include the following processes: colour conversion, noise filtering, image enhancement, colour quantisation, dominant colour detection, and neighbourhood analysis, as known in the art. The colour analysed tiles are stored for page layout analysis in step 420. 15 Page layout analysis 420 is then performed to classify each part of the page into one of a number of classes such as text, image, background and line-art. The analysis of step 420 is described in more detail below with reference to Fig. 1. The consequence of the analysis 420 is the creation of an output document 430 which concludes the document analysis system 400. The type of document output 430 depends on the particular 20 application for which analysis is being performed, and may include a compressed format, an editable document or a number of other document formats. Some of these outputs may also require additional processing. Fig. 1 shows the basic breakdown of the page layout analysis stage 420 of the system 400. Each of the steps 110-150 of the stage 420 will be described in more detail 1064936_1 831922_speci02 -10 below. In the first step 150, a multi-layered presentation of the page is formed. The result of this stage is a set of macroregions. A macroregion in the context of the present description is a document structural object that encompasses a group of dominant colours with similar characteristics in close proximity. The grouping represents a region of 5 semantically related coloured segments. This is performed in a fast, one pass process that may be done in parallel with, or using an output from, the pixel colour analysis 410 performed earlier in the system 400. In step 110, which then follows, noise filtering is performed to reduce the number of macroregions to processes in later the steps 120, 130 and 140. Step 120 then ascribes macroregion an initial classification to the macroregions 10 based on colour and shape statistics associated with each macroregion. The statistics are used to interpret the context of content of each macroregion. In step 130, these classifications are improved and optimised using relationships between the macroregions. In step 140, a hierarchy is created for the macroregions representing which of the macroregions are contained within the context of other ones of the macroregions. Each of 15 these steps will be described in more detail below. Forming a multi-layered presentation Fig. 5 is a flowchart of a method 150 for decomposing a coloured image into a multi-layered representation. The process in Fig. 5 employs a loop structure beginning in step 520 where the colour analysed tiles are processed preferably in raster order - that is 20 from left to right, and top to bottom. The first tile to be processed is the top-left tile of the input image, and the last tile to be processed is the bottom-right tile of the input image. This form of tiling is used for efficiency purposes. Alternatively, overlapping and non fixed size tiles may be used. The tiles may alternatively be referred to as blocks. 1064936_1 831922_speci02 -11 Steps 520 to 550 form a loop which processes each of the colour analysed tiles. Step 520 receives a tile of colour analysed data. Pixels with the same dominant colour may or may not be connected. This can be seen in Fig. 8(a) where two tiles 801 and 802 are shown side-by-side and which have dominant colours 803, 804 and 805. It is seen that the 5 colour 805 represents one "connected" region, whereas each of the colours 803 and 804 has two connected regions 806, 807 and 808, 809 respectively. It has been determined through numerical experiments that a majority of the tiles in a document image with foreground information can be reliably represented by four colours or less. Tiles may also be represented by more than four colours. 10 Step 540 operates to generate macroregions. Figs. 7(a) and 7(b). provide illustrations of macroregions on a typical document. In Fig. 7(a), a scanned document forms a single layered image 700 is shown composed of four types of objects, being an overall background 710, a local background 720, text lines 730, and an image/graphical object 740. Fig. 7(b) shows how these four types of objects may be grouped into multiple 15 semantically coherent regions or macroregions that collectively form a multi-layered representation 799. It can be seen that each object type forms at least one macroregion. The background 710 forms a macroregion 770, the image 740 forms a macroregion 780, and the local background 720 forms a macroregion 790. The text lines 730 produce two macroregions 750 and 760 due to the significant gap between the two paragraphs 732 20 and 734. Although not accurately depicted in Fig. 3(b) the layer 770 has cut-outs sized and shaped to accommodate the overlying macroregions 780 and 790, and further, the macroregion 790 has cut-out corresponding to the outlines of the particular text characters present in the layers 750 and 760. As a consequence, when the various macroregions 750 390 are superimposed or superpositioned as their layers they collectively represent the 1064936_1 831922_speci02 -12 image 700. Note that the layering described here is not the same as layered objects in a graphical object rendering system where each object may have its own "z-level". In this description, the layers are for representative purposes to illustrate how the various macroregions superimpose. 5 Fig. 8(b) is an illustration of macroregions at tile level. In Fig. 8(a), the left tile 801 contains two dominant colours 803 and 805, and the right tile has three dominant colours 803, 804 and 805. Each dominant colour may be associated with a number of segments. For the purpose of macroregion generation, segments of the same dominant colour are treated as a single entity, whose combined statistics are used for determining merging 10 decisions. Thus any reference to a dominant colour of a tile refers to all the tile segments with the same dominant colour and its associated statistics. In this example, dominant colour segments with similar tile statistics are filled with the same patten. These segments merge across the tile border, based on their statistics, to form three macroregions as shown in Fig. 8(b), corresponding to the dominant colours 803, 804 and 805. It can be seen that 15 the left tile 801 belongs to two macroregions, while the right tile 802 is part of three macroregions. A macroregion or a memory record of a macroregion may include the following data features: an average colour, bounding box coordinates, a binary mask, the number of tiles, the number of pixels, the number of neighbouring macroregions, pointers to those 20 neighbours, and various statistics derived from edge, colour and contrast information within the macroregion. Processing in step 540 begins by receiving tile dominant colour and statistics in tile raster order, and by which each dominant colour is either merged to an existing macroregion or converted to a new macroregion. Details of this macroregion generation 1064936_1 831922_speci02 -13 process 540 will be explained further with reference to Fig. 6 below. Step 550 tests if any more tiles remain to be processed. If so, the method 150 returns to step 520 to get the next tile of colour analysed data. Where there are no more tiles, the method 150 ends and the resulting macroregions form a multi-layered representation 560 of the input image 405. 5 By decomposing the document image 405 into the multi-level overlapping document object representation 560, it is possible to satisfy the conflicting requirement of remaining stable to local colour fluctuations due to various undesirable noises, and remaining sensitive to genuine changes in the document. The macroregion generation step 540 is further expanded upon in Fig. 6. A 10 macroregion is formed by merging dominant colours with similar tile statistics in adjacent tiles. The purpose of this step is to find suitable neighbouring macroregions for the current dominant colour to merge with. If there are no suitable neighbouring macroregions, a new macroregion is formed using the current dominant colour. In tile raster order processing, with the exception of the first row to be considered, at any instance there are at most two 15 adjacent previously considered tiles to the current tile location: one from above and one from left. It is important to note that a tile may belong to more than one macroregion. The process 540 in Fig. 6 begins with the current tile data and statistics as an input 610. Step 620 commences a loop that operates for each dominant colour in the tile. Step 625 then obtains an adjacent tile. This adjacent tile can either be from above or left. 20 In step 630, the most suitable macroregion for merging from the adjacent tile is chosen as the best match macroregion. This may be performed by conducting a colour distance comparison between a dominant colour of a current tile and each dominant colour of adjacent tiles, such as a left tile and an above tile in raster tile order. The colour having the smallest distance is then used to define the best match macroregion associated with that 1064936_1 831922_speci02 -14 colour. A test is then performed on the best match macroregion in step 640 to further determine its suitability for merging with other macroregions. This best match macroregion is stored in a list in step 650 if it passes the test as a suitable candidate for merging, otherwise it is ignored. Processing continues at decision step 655, in which it 5 checks whether all adjacent (eg. left and above, and/or diagonally connected where desired) tiles have been processed. If not, the remaining adjacent tiles are processed by returning to step 625. Once all the adjacent tiles have been processed, the potential merging candidate macroregions are compared in step 660 in order to select or consolidate the merging candidates into a final merging candidate. A check at decision step 665 is performed to 10 determine if the candidate list is empty. If the list is empty, a new macroregion is formed in step 680 using the current dominant colour and statistics for the current tile. Otherwise the current dominant colour is merged with the final candidate macroregion in step 670. This process is repeated via operation of step 690 for each of the dominant colours within the current tile whereupon the step 540 ends. 15 Noise Filtering The macroregions formed in stage 150 [of the system 420 may fragment the page contents into a number of smaller macroregions. In the preferred implementation as described, this is partly because macroregions are created using a fast, single (raster tile order) pass, approach. As a consequence, not enough contextual information is available to 20 group the macroregions properly. It is also easier to merge macroregions than split them up, so it is preferable to fragment page content into smaller macroregions than to combine different types of content into a single macroregion. This means the macroregion formation is designed to be conservative in merging areas together. 1064936_1 831922_speci02 -15 There are also small macroregions which may represent additional over-segmented colours in particular tiles on the page. It is desirable to merge these small macroregions into larger macroregions, using additional contextual information. If these small macroregions are merged correctly, this reduces the later processing by reducing the 5 number of macroregions to process. This also avoids the difficulty of classifying macroregions with statistics which are gathered only over small areas of the page. This contextual merging is performed by noise filtering. Fig. 2 shows the process of noise filtering 110 used in the arrangement of Fig. 1. Steps 210 and 290 operate so that each macroregion is processed in turn. In contrast to tile 10 by-tile based processing, the process of Fig. 2 is preferably performed across the whole page on a macroregion basis. In step 220, the current macroregion is tested for whether it is a "noise" macroregion. In a preferred implementation, macroregions which either cover less than a predetermined number of (eg. 5) tiles of the page image, or which have a tile border pixel count which is greater than a predetermined percentage (eg. 95%) of the total 15 number of pixels in the macroregion are considered to be noise macroregions on the basis of their size. The coverage value simply operates to exclude macroregions that are too small to influence the appearance of the document image. The tile border pixel count is a statistic of the number of pixels in the macroregion which appear on one of the four borders of the tile they are in. In step 230, macroregions are tested for whether to be force 20 merged. A macroregion is force-merged if a high proportion of pixels in the macroregion are on tile borders, this being another statistic. If a macroregion is to be force merged, in step 240, the best macroregion to force merge it to is determined or otherwise found. To do this, first the side (left, top, right or bottom) of the current macroregion with the most tile border pixels is found. The macroregion to which the current macroregion is to merge 1064936_1 831922_speci02 -16 is required to be a neighbour on that side. Among the neighbours on that side, the macroregion chosen for merging is the one with the closest colour to the current macroregion. These two macroregions are then merged in step 250, whereupon control returns to step 290 to check for more macroregions. 5 If the current macroregion is not chosen to be force-merged, then step 260 follows where each adjacent macroregion is given a score. The score is based on the colour distance between the macroregions. However, this distance may be reduced or modified if the two macroregions associated with each score have a large number of tile border pixels on common borders, and also adjusted for how the merge would affect the geometry of the 10 larger macroregion. A measure used in a preferred implementation, representing the geometry of the larger macroregion, is the ratio of the number of tiles in the macroregion to the area enclosed by a bounding box of the macroregion. The neighbouring macroregion with the smallest modified distance is then found in step 270. The distance associated with this macroregion is compared to a fixed threshold 15 in step 280, and if it is close enough the current macroregion, is merged to the current macroregion in step 250. In a preferred implementation, the statistics of the larger macroregion may not be updated based on the noise macroregion, as the values for noise macroregions are generally unreliable and may pollute the statistics of the larger macroregion. Otherwise the macroregion is assumed to be an area of page content which 20 genuinely only covers a small number of tiles. One example of this could be a page number in an isolated corner of the page. Processing then continues to the next macroregion, until all macroregions have been processed and step 110 of Fig. 2 concludes. Classification 1064936_1 831922_speci02 -17 In step 120 each macroregion is given an initial classification, based on its own statistics. The method used for this classification in the preferred implementation is a support vector machine, using a set of features derived from the statistics gathered about the macroregion. Support vector machines are a method of machine learning for 5 classification problems and are known in the art. Alternatively, other machine learning techniques such as neural networks, decision trees, or a set of human generated rules may be used for this classification. Features that may be used include: 1. Statistics such as average, and variance, on the number of pixels in dominant colours included in the macroregion. 10 2. Total number of dominant colours included in the macroregion. 3. Ratio of the number of dominant colours to the bounding box area. 4. Statistics based on the 'edge ratio's of the dominant colours in the macroregion. The 'edge ratio' for a dominant colour is the ratio between the number edges (pairs of adjacent pixels where one pixel is in this 15 macroregion and the other pixel is not), and the total number of pixels. 5. Average colour values for each macroregion. 6. Number of tiles quantised to 3 or more colours. 7. Number of other macroregions that this macroregion shares tiles with. 8. Total number of pixels on the tile borders of the macroregion. 20 9. The contrast level between the dominant colours in this macroregion and other dominant colours from the same tile. 10. The total number of dominant colours in the macroregion. Using these features, the preferred implementation classifies each macroregion into one of three classes, being: 1064936_1 831922_speci_02 -18 1. Text; 2. Flat colour areas; and 3. Image. The support vector machine used in the preferred embodiment is trained using a set 5 of manually annotated truth values. Each macroregion in these test pages is given a classification from among the four classes, and these are used training and testing the support vector machine. Classification Optimisation Fig. 3 illustrates the classification optimisation stage 130 in more detail. This stage 10 aims to improve the classification of step 120 and consolidate the macroregions. In step 310, macroregions classified as image are examined. Image regions on a page tend to include a number of different colours, and so they are generally represented by a number of overlapping macroregions with an image classification. Because of this, pairs of macroregions which are both classified as image and share a number of common tiles 15 are merged together to consolidate the image. Once this consolidation is complete, unmerged image classified macroregions are reclassified to be of an 'unknown' type, as if there are no other image classified macroregions nearby it is likely that they were originally misclassified as image. In step 320, blend detection is applied. This is to help deal with areas of the input 20 page which are a continuous blend from one colour to another. These areas can often confuse the later stages of processing if not detected at this stage. Blends are detected in the preferred implementation by searching for pairs of macroregions with the following properties: 1064936_1 831922_speci02 -19 (i) The local colours in adjacent tiles between the two macroregions are very similar, despite the overall average colours of the two macroregions being different. (ii) The macroregions share a lot of adjacent tiles but only rarely appear in the 5 same tile. (iii) The macroregions tend to have a high and equal number of border pixels on either side of the shared border between adjacent tiles. Pairs of macroregions with these properties are likely to be from a blended area of the page, and so they are merged together at this stage. 10 In Step 330, another reclassification is done on each of the non-image macroregions, taking into account the way in which these macroregions overlap each other. To do this, a graph of the macroregions is formed with edges between each pair of macroregions which include different colours from a common tile. A global cost function is then formed on the graph based on a given classification of the macroregions as text or 15 flat. The cost function desirably includes the following terms: (i) A cost is given to each edge between two macroregions with the same classification, weighted by the number of common tiles the two macroregions share. (ii) A cost is given to each macroregion which has changed its classification 20 from the original classification given in step 120. The classification of non-image macroregions into text or background which minimises this cost is then found. For this, each macroregion is initialised to its current classification. Macroregions previously classified as image and now unknown are initialised to the flat classification. The each macroregion in turn is tested to see whether 1064936_1 831922_speci_02 -20 changing its classification would reduce the global cost, and the classification is changed if this is the case. This is repeated in a number of iterations until no more changes are made. Note that this may not find the global minimum of the cost function, but it will find a local minimum close to the original classifications. This may even be a better result than the 5 true global minimum of the cost function. Step 340 is then implemented by which table consolidation is performed. This stage attempts to merge fragmented foreground regions, particularly tables. Macroregions classified as Text are considered as foreground regions and other types are considered as background regions. For very thin lines, often the colour information from the scanner and 10 the earlier stages of processing may be inaccurate, causing fragmentation of otherwise related or close colours across different macroregions. For example, for very thin lines, colour fringes may occur due to the different locations of the sensors of different colours. Printer colour registration problems can also cause a similar effect. In this case, objects such as tables may be fragmented into multiple macroregions, which may not merge 15 together if their colours are too far apart. To account for possible colour differences in different parts of the same foreground object without merging things which should not be merged, other evidence is used to determine whether certain foreground macroregions should be merged together. In step 340, each pair of adjacent foreground macroregions is given a distance 20 based on a combination of the following pieces of evidence: (1) The colour distance between the two macroregions. (2) An overlap factor. This accounts for the number of adjacent tiles between the macroregions and thus an extent of overlap. A multiple of the number of tiles where the two macroregions both appear together is then subtracted 1064936_1 831922_speci02 -21 from this. The goal is that if the macroregions share a lot of tiles then this factor will be a penalty, making the macroregions less likely to merge, while if they share a lot of adjacent tiles without appearing in the same tiles this factor will increase the likelihood of them merging. 5 (3) A tile border factor. Each time macroregions appear in adjacent tiles, a score is given based on how many pixels from each macroregion are on the tile border. If a roughly equal number of pixels are on both sides, then this factor will make the macroregions more likely to merge, while if the values are often very different then they will be less likely to merge. 10 (4) An edge ratio similarity factor. This causes macroregions with similar edge ratios in adjacent tiles to be more likely to merge together, and macroregions with very different edge ratios to be less likely to merge. (5) An amount of evidence factor. This is used to account for when the other factors are based on very small amounts of evidence. If there are not many 15 adjacent tiles between the two macroregions, and so not much evidence for or against their similarity, then this factor will make them less likely to merge. Macroregions which are part of a fragmented text or table region should have the property that they cover distinct but adjacent areas, have similar statistics, and have objects 20 which connect from one macroregion to the next across the adjacent tiles. The factors above are designed to detect this situation and enable such macroregions to merge even in cases where various artifacts may have caused the colours to be significantly different. In step 350, a final merging is done between any nearby (proximate) macroregions which have similar colours and the same classification. In earlier stages, macroregions 1064936_1 831922_speci02 -22 may be separated despite having similar colours, if it is considered possible that they represent different types of page content. Colour similarity may be assessed using colour distance determination and comparison against a threshold. Now that all the classification has finished, if they have been classified as representing the same type of page content, 5 then they can be merged. Classification optimization of step 130 therefore ends. Form Hierarchy Returning to Fig. 1, the next stage in the macroregion analysis process 420 is to form a hierarchy from the macroregions in step 140. This hierarchy is represented in terms of at most one parent macroregion associated with each macroregion. If a text macroregion 10 is determined to fit within the bounding box a particular background, then the macroregion representing that background will become its parent. If a background macroregion of a particular colour is surrounded by a larger background of a different colour, then the larger background will be a parent of the smaller background. The rules used for determining the hierarchy in the preferred implementation are 15 that the parent of a macroregion is the smallest non-text other macroregion which has a bounding box completely containing it. The final macroregion classifications must be different for hierarchical association between macroregions. The net result of the process 420 described above is a classification of macroregions of a mixed content document from a original image. This is achieved using pixel level 20 analysis and colour document segmentation. INDUSTRIAL APPLICABILITY The above that the arrangements described are applicable to the computer and data processing industries and particularly for decomposition of colour documents for layout analysis and classification. 1064936_1 831922_speci_02 -23 The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive. (Australia Only) In the context of this specification, the word "comprising" means 5 "including principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied meanings. 1064936_1 831922_speci_02

Claims

1. A method of classifying regions of a scanned document image, said method comprising the steps of: 5 (a) partitioning the scanned image into a plurality of tiles; (b) determining at least one dominant colour for each of the plurality of tiles; (c) generating superpositioned regions based on dominant colours, each said region representing a group of tiles wherein at least one tile is grouped into two superpositioned regions and each dominant colour is represented by at most one of the 10 regions; (d) calculating statistics for each said region using pixel level statistics from each of the tiles included in said region; and (e) determining a classification for each region based on the calculated statistics. 15

2. A method according to claim I wherein each region comprises a macroregion.

3. A method according to claim 2 wherein step (d) further comprises examining statistics of said macroregions and merging macroregions having a size considered to be 20 noise.

4. A method according to claim 3 wherein the size considered to be noise is determined by using at least one of a coverage value and a border count approach. 3447974_1 831922_speci03 -25

5. A method according to claim 4 wherein the coverage value comprises those macroregions that cover less than a predetermined number of said tiles.

6. A method according to claim 4 wherein the border count approach comprises those 5 macroregions which have a tile border count greater than a predetermined percentage of the total number of pixels in the macroregion.

7. A method according to claim 2 wherein step (e) comprises the steps of; (ea) ascribing an initial classification to each said macroregion based on colour 10 and shape statistics associated with said macroregion; and (eb) assessing relationships between said macroregions to optimise the classification by merging sufficiently related ones of said macroregions.

8. A method according to claim 7 wherein step (ea) comprises classifying a 15 macroregion according to at least one of the following statistics: (1) at least one of an average and variance of a the number of pixels in dominant colours included in the macroregion; (2) a total number of dominant colours included in the macroregion; (3) a ratio of the number of dominant colours to a bounding box area 20 surrounding the macroregion; (4) at least one edge ratio of the dominant colours in the macroregion; (5) an average colour value for the macroregion; (6) a number of tiles quantised to 3 or more colours; (7) a number of other macroregions with which the macroregion shares tiles; 3447974_1 831922_speci03 -26 (8) a total number of pixels on the tile borders of the macroregion; (9) a contrast level between the dominant colours in the macroregion and other dominant colours from the same tile; and (10) a total number of dominant colours in the macroregion. 5

9. A method according to claim 7 or 8 wherein the initial classification ascribed each macroregion to one of three classes being: text, flat colour and image.

10. A method according to claim 9 wherein step (eb) comprises the steps of: 10 (eba) consolidating macroregions classified as image; (ebb) detecting blends within the image and thereby merging corresponding macroregions; (ebc) reclassifying macroregions based upon an extent of overlap between such macroregions; 15 (ebd) merging foreground fragmented macroregions of the image to the same classification; and (ebe) merging proximate macroregions having similar colours and the same classification. 20

11. A method according to claim 10 wherein step (ebc) comprises determining a cost function associated with pairs of macroregions and minimising the cost function to classify macroregions as text or flat. 3447974_1 831922_speci03 -27

12. A method according to claim 10 wherein step (ebd) comprises assessing the fragmentation using at least one of: (1) a colour distance between the two macroregions; (2) an overlap factor determined using a number of tiles in which two 5 macroregions are present; (3) a tile border factor associated with a number of pixels from each macroregion on a tile border; (4) an edge ratio similarity factor based upon edge ratios of macroregions; and (5) an amount of evidence factor based upon the incidence of evidence 10 contributing to a likelihood of merging.

13. A method according to any one of claims 2 to 12 further comprising the step of: (f) forming a final output segmentation of said image by hierarchically associating at least one said macroregion within another said macroregion of different 15 classification.

14. A computer readable medium having a program recorded thereon, the program being executable by a computer to classify regions of a scanned document image, said program comprising: 20 code for partitioning the scanned image into a plurality of tiles; code for determining at least one dominant colour for each of the plurality of tiles; code for generating superpositioned regions based on dominant colours, each said region representing a group of tiles wherein at least one tile is grouped into two 3447974_1 831922_speci03 -28 superpositioned regions and each dominant colour is represented by at most one of the regions; code for calculating statistics for each said region using pixel level statistics from each of tiles included in said region; and 5 code for determining a classification (120, 130) for each region based on the calculated statistics.

15. A computer readable medium according to claim 14 wherein each segmented content comprises a macroregion and said code for calculating further comprises code for 10 examining statistics of said macroregions and merging macroregions having a size considered to be noise.

16. A computer readable medium according to claim 15 wherein said code for determining comprises: 15 code for ascribing an initial classification to each said macroregion based on colour and shape statistics associated with said macroregion; and code for assessing relationships between said macroregions optimise the classification by merging sufficiently related ones of said macroregions. 20

17. A computer readable medium according to claim 16 wherein said code for ascribing comprises code for classifying a macroregion according to at least one of the following statistics: (1) at least one of an average and variance of a the number of pixels in dominant colours included in the macroregion; 3447974_1 831922_speci03 -29 (2) a total number of dominant colours included in the macroregion; (3) a ratio of the number of dominant colours to a bounding box area surrounding the macroregion; (4) at least one edge ratio of the dominant colours in the macroregion; 5 (5) an average colour value for the macroregion; (6) a number of tiles quantised to 3 or more colours; (7) a number of other macroregions with which the macroregion shares tiles; (8) a total number of pixels on the tile borders of the macroregion; (9) a contrast level between the dominant colours in the macroregion and other 10 dominant colours from the same tile; and (10) a total number of dominant colours in the macroregion.

18. A computer readable medium according to claim 15 or 16 wherein the initial classification ascribed each macroregion to one of three classes being: text, flat colour and 15 image, and the code for assessing comprises: code for consolidating macroregions classified as image; code for detecting blends within the image and thereby merging corresponding macroregions; code for reclassifying macroregions based upon an extent of overlap between such 20 macroregions by determining a cost function associated with pairs of macroregions and minimising the cost function to classify macroregions as text or flat; code for merging foreground fragmented macroregions of the image to the same classification; and 3447974_1 831922_speci03 -30 code for merging proximate macroregions having similar colours and the same classification.

19. A computer readable medium according to claim 18 wherein said code for merging 5 comprises code for assessing the fragmentation using at least one of: (1) a colour distance between the two macroregions; (2) an overlap factor determined using a number of tiles in which two macroregions are present; (3) a tile border factor associated with a number of pixels from each 10 macroregion on a tile border; (4) an edge ratio similarity factor based upon edge ratios of macroregions; and (5) an amount of evidence factor based upon the incidence of evidence contributing to a likelihood of merging. 15 20. A computer readable medium according to any one of claims 13 to 19 further comprising: code for forming a final output segmentation of said image by hierarchically associating at least one said macroregion within another said macroregion of different classification.

20

21. Computer apparatus adapted to perform the method of any one of claims I to 13. 3447974 1 831922_speci03