AU2013260720A1 - Method, apparatus and system for generating a codebook - Google Patents

Method, apparatus and system for generating a codebook Download PDF

Info

Publication number
AU2013260720A1
AU2013260720A1 AU2013260720A AU2013260720A AU2013260720A1 AU 2013260720 A1 AU2013260720 A1 AU 2013260720A1 AU 2013260720 A AU2013260720 A AU 2013260720A AU 2013260720 A AU2013260720 A AU 2013260720A AU 2013260720 A1 AU2013260720 A1 AU 2013260720A1
Authority
AU
Australia
Prior art keywords
partition
partitions
feature vectors
text
codebook
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
AU2013260720A
Inventor
Getian Ye
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canon Inc
Original Assignee
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canon Inc filed Critical Canon Inc
Priority to AU2013260720A priority Critical patent/AU2013260720A1/en
Publication of AU2013260720A1 publication Critical patent/AU2013260720A1/en
Abandoned legal-status Critical Current

Links

Landscapes

  • Character Discrimination (AREA)

Abstract

-37 Abstract METHOD, APPARATUS AND SYSTEM FOR GENERATING A CODEBOOK A method of generating a codebook for classifying text attributes in a document, is disclosed. For each of a plurality of partitions in a feature space, a count of a plurality of feature vectors in the partition is determined. One of the partitions having a non-zero count of feature vectors is selected. A first one of the partitions containing a shift point in the feature space is determined. The shift point has an offset from the selected partition according to the determined count of feature vectors associated with one or more of the plurality of partitions, the offset moving towards a partition having a higher density of the plurality of feature vectors. A second one of the partitions located adjacent to, and including, the first partition is selected to move towards a density peak in the feature space. The second partition is selected according to the determined counts of the plurality of feature vectors. A codeword for the codebook is determined based on the selected second partition to generate the codebook for classifying the text attributes in the document. P0R9030 Sneri As Filed (RO2679Ov1) 126 106 127 Text text text text text text Camera E text. Scanner text. Text region selection Text attribute classification 180 OCR system selection OCR bank 182 184 186 188 OCRfor OCRfor OCRfor OCRfor handwritten machine printed handwritten machine printed English Chinese Chinese Japanese Fig. 1A 8063943vl (P089030_FigsAs Filed)

Description

-1 METHOD, APPARATUS AND SYSTEM FOR GENERATING A CODEBOOK TECHNICAL FIELD The present invention relates generally to document image processing and, in particular, to codebook generation for classifying text attributes in a document image. The present invention also relates to a method and apparatus for generating a codebook for text attribute classification, and to a computer program product including a computer readable medium having recorded thereon a computer program for generating a codebook for text attribute classification. BACKGROUND Documents such as letters, posters, forms, etc. are commonly used to convey important information. Even though documents are typically intended to convey information directly to human readers, it can be useful to computationally understand the content of documents. Computationally understanding the content of documents allows automation of many routine tasks associated with documents, such as sorting, filing, summarising, cross referencing, etc., thereby allowing human attention to focus more directly on the relevant information rather than the document formatting. Often documents are not in a convenient format for automation. In these cases, some form of document analysis is required, in order to determine the necessary information about the document. When the document is stored as a digitised image, document image processing techniques may be used to derive information from the document image. One well-known example of a document image processing method is Optical Character Recognition (OCR), which extracts the textual content of a document image. Text attribute classification is a document image processing method that identifies text attributes used in an image that depicts text. A text attribute describes aspects, properties, or characteristics of a piece of text. For example, a text attribute may be a script corresponding to a piece of text. Script is the graphical form used by a writing system to write statements expressible in a language such as English or Chinese. A script may be used P0R9030 Sneri As Filpe (ROnh79Ov1) -2 by only one language. Some languages share a script with other languages. For example, both English and French use the Latin script. Some languages such as Japanese use multiple scripts. An example of a document using a number of scripts is shown in Fig. 9A. As another example, a text attribute may describe whether text is handwritten or machine printed. Documents such as letters, bank cheques, legal forms, etc. often contain both machine printed and handwritten text. Handwriting often indicates signatures, corrections, additions, or other supplemental information that should be treated differently from machine printed text. An example of a document using handwritten and machine printed text is shown in Fig. 9B. As yet another example, a text attribute may be a combination of different text attributes such as, whether text is handwritten English or machine printed Chinese. Knowledge of a text attribute used in a document image is important for automatic document image analysis. For example, identifying a script used in a document image helps to select an appropriate optical character recognition (OCR) system from the optical character recognition (OCR) bank. Separating handwritten text from machine printed text helps other document image processing tasks such as digitisation, text line extraction, optical character recognition (OCR) system selection, etc., because handwritten text is significantly different from machine printed text. Text attribute classification relies on the fact that each class of text attribute has unique visual characteristics that make it possible to distinguish the text attribute from other classes of attributes. For example, the shape of strokes may distinguish handwritten text from machine printed text as handwritten text is usually more curvy and irregular than machine printed text. Hence, one component of a text attribute classification system is the set of visual feature vectors extracted from text regions within a given document image. Each element of a feature vector indicates a visual characteristic such as shape, orientation, or gradient, etc. The feature extraction can be performed at different levels inside a document image (i.e., page level, paragraph/text block level, text line level, and even word level). Another component of a text attribute classification system is a classifier, which is often designed using a machine learning algorithm. A machine learning algorithm usually comprises a training phase and a testing phase. During the training phase, the classifier is trained or learned using numerous training images and associated labels. The label for each training image indicates the text attribute class used in the image. During the testing phase, P0R9030 Sneri As Filed (ROnh79Ov1) -3 the trained classifier is used for identifying or predicting the text attribute class for an input test image with unknown text attribute class. A bag-of-words (BoW) representation, which models an image as an unordered collection of local image feature vectors, has become increasingly popular for document image analysis due to its simplicity and good performance. A conventional BoW method first extracts feature vectors from each training image during a training phase. For example, a suitable feature vector may be a Scale Invariant Feature Transform (SIFT) vector, or a Histogram of Oriented Gradients (HOG) vector, or a k Adjacent Segment (kAS) vector, etc. A clustering process is then performed on feature vectors to generate a set of cluster centres. Each cluster centre is called a "codeword" and the set of cluster centres is called a "codebook". By comparing the distances between feature vectors and codewords, each feature vector is uniquely mapped to a nearest codeword. After the mapping operation, each training image is represented by a histogram that indicates how many feature vectors are mapped to each codeword in the whole training image. A classifier is then trained or learned using the training histograms and the labels associated with training images. For example, a suitable classifier may be comprised of all the histograms and labels associated with training images. During the testing phase, the feature extraction method used in the training phase is performed on an input test image to extract feature vectors. Each feature vector is then uniquely mapped to a nearest codeword of the pre-learned codebook. After the mapping operation, the input test image is represented by a histogram. Such a test histogram is compared with all the training histograms based on distance calculation. The text attribute class label for the input test image may be the label of the training histogram nearest to the test histogram. Alternatively, the text attribute class label for the input test image may be obtained using a majority voting scheme. That is, the label of a majority of training histograms near to the test histogram is the predicted label. A conventional method of generating a codebook is referred as "K-means", which can group feature vectors into k clusters. A cluster centre is usually the mean of all the feature vectors belonging to the cluster. Each feature vector is determined to belong to a cluster if the distance from the feature vector to the cluster centre is the smallest compared with the distances to other cluster centres. The K-means method usually requires the user to provide a reasonable guess for the number of clusters present. In order to achieve reasonable results, the K-means method usually needs to be performed more than once with different initialisations. P0R9030 Sneri As Filed (ROnh79Ov1) -4 That is, k feature vectors are randomly selected as the initial cluster centres. The K-means method may produce erroneous results if the underlying distribution of the feature space is not spherical shaped. Moreover, the K-means method is not robust to "outliers". The term outliers refers to feature vectors that do not belong to any of the k clusters and which may move the estimated cluster centres away from the dense regions in the feature space. Another conventional method of generating a codebook is referred to as "mean shift", which is a mode seeking method for determining the modes (or local maxima, or local density peaks) of the underlying distribution of a feature space. The mean shift method begins at each feature vector and first estimates the local density gradient of similar feature vectors. The gradient estimates are used within an iterative procedure to determine the peaks in the local density. All the feature vectors that are drawn upwards to the same peak are then considered to be members of the same cluster. Compared with the K-means method, the mean shift method does not require the underlying distribution of the feature space to have a parametric structure. In addition, the mean shift method may automatically find the number of clusters in the feature space without any prior knowledge of the number of clusters. Moreover, the mean shift method is robust against outliers. The mean shift method has two main disadvantages. One disadvantage is high computational complexity. The computational complexity of the mean shift method depends on the number of feature vectors to be clustered, the average number of iterations required by each feature vector, and the dimensionality of the feature space. The mean shift method may be infeasible for handling a high-dimensional large-scale dataset. The linear convergence rate of the mean shift method usually results in a high number of iterations for finding a mode of each feature vector particularly near the mode. Another disadvantage is that the mean shift method may become trapped in spurious local maxima especially when the underlying distribution has multiple peaked modes. This is mainly because the mean shift method assumes that the initialisation point falls within the basin of attraction of the desired mode. Various data reduction schemes have been proposed to reduce the number of feature vectors and improve the conventional mean shift method. The common scheme of data reduction is binning (or use of multi-dimensional histograms), which divides a feature space using an equally-spaced mesh of grid points and assigns each feature vector to a nearest grid point with a multidimensional range search. Grid points are then used in the mean shift P0R9030 Sneri As Filed (ROnh79Ov1) -5 procedure. The binning scheme requires a large amount of memory storage especially when the dimensionality of feature vectors is high. For example, if M grid points are selected at each dimension of a d-dimensional feature space, the total number of grid points is Md. Building a high-dimensional histogram (e.g., more than ten (10)) with a large number of grid points is also very difficult and computationally expensive. Another data reduction scheme which may be used to improve the conventional mean shift method uses cluster centres obtained from the K-means method in the mean shift method. The K-means method usually needs to be performed more than once with different initialisations. In addition, the pre-selection of k clusters often leads to different clustering results. Another data reduction scheme is random sampling from the entire distribution based on a pre-selected kernel function. However, the implementation of the sampling procedure is only applicable for a limited number of kernel functions such as a Gaussian kernel and a uniform kernel. Another data reduction scheme which may be used to improve the conventional mean shift method is spatial discretisation, which uses binning or a k-d tree to group feature vectors based on the relationship of the feature vectors in the spatial domain of the image. The spatial discretisation reduction scheme is mainly designed for colour-based image segmentation tasks and may not be applicable for other feature descriptors such as HOG and kAS feature descriptors. Another method of improving the mean shift method is to reduce the average number of iterations or improve convergence. One method of reducing the average number of iterations is based on the fact that the mean shift method is an Expectation-Maximisation (EM) algorithm and takes advantage of the sparse EM algorithm so that a full iteration of the mean shift method is run infrequently. However, such a method of reducing the average number of iterations is sensitive to the parameter settings of the EM algorithm and may result in unacceptably large errors for a suboptimal parameter setting. Another method which may be used to improve the mean shift method is to combine the EM algorithm with Newton's method. However, one issue of using such a method is when to enable the Newton step. The use of the Newton step introduces extra computational cost in each iteration. Furthermore, the Newton step may even reduce the efficiency of the POR9030 Snri As Filiel (RO2E790v1) -6 mean shift method, may be undefined, or may be too long if the Hessian of the density is not positive definite. Another method which may be used to improve the mean shift method is to improve the robustness of the mean shift method to spurious local maxima. For example, random perturbation that perturbs a candidate mode by a random vector of small norm. Such random perturbation may move a candidate mode to a region, which is further away from the desired mode or contains an undesired mode. Therefore, random perturbation may introduce extra computational complexity or lead to erroneous results. SUMMARY It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements. Disclosed are arrangements, referred to as Partition Based Codebook Generation (PBCG) arrangements, which seek to address the above problems to generate a codebook with low computational complexity, fast convergence rate, and high robustness to spurious local maxima by exploiting a space partition structure. The codebook so formed may then be used by a classifier for identifying the text attribute class of a text region in question. According to one aspect of the present disclosure, there is provided a method of generating a codebook for classifying text attributes in a document, the method comprising: determining, for each of a plurality of partitions in a feature space, a count of a plurality of feature vectors in the partition; selecting one of said partitions having a non-zero count of feature vectors; determining a first one of said partitions containing a shift point in the feature space, the shift point having an offset from the selected partition according to the determined count of feature vectors associated with one or more of the plurality of partitions, the offset moving towards a partition having a higher density of the plurality of feature vectors; POR9030 Snri As Fil-d (RO2E790v1) -7 selecting a second one of said partitions located adjacent to, and including, the first partition to move towards a density peak in said feature space, the second partition being selected according to the determined counts of the plurality of feature vectors; and determining a codeword for the codebook based on the selected second partition to generate the codebook for classifying the text attributes in the document. According to another aspect of the present disclosure, there is provided an apparatus for generating a codebook for classifying text attributes in a document, the apparatus comprising: means for determining, for each of a plurality of partitions in a feature space, a count of a plurality of feature vectors in the partition; means for selecting one of said partitions having a non-zero count of feature vectors; means for determining a first one of said partitions containing a shift point in the feature space, the shift point having an offset from the selected partition according to the determined count of feature vectors associated with one or more of the plurality of partitions, the offset moving towards a partition having a higher density of the plurality of feature vectors; and means for selecting a second one of said partitions located adjacent to, and including, the first partition to move towards a density peak in said feature space, the second partition being selected according to the determined counts of the plurality of feature vectors; and means for determining a codeword for the codebook based on the selected second partition to generate the codebook for classifying the text attributes in the document. According to still another aspect of the present disclosure, there is provided a system for generating a codebook for classifying text attributes in a document, the system comprising: a memory for storing data and a computer program; POR9030 Snri As Fil-d (RO2E790v1) -8 a processor coupled to the memory for executing the computer program, said computer program comprising instructions for: determining, for each of a plurality of partitions in a feature space, a count of a plurality of feature vectors in the partition; selecting one of said partitions having a non-zero count of feature vectors; determining a first one of said partitions containing a shift point in the feature space, the shift point having an offset from the selected partition according to the determined count of feature vectors associated with one or more of the plurality of partitions, the offset moving towards a partition having a higher density of the plurality of feature vectors; and selecting a second one of said partitions located adjacent to, and including, the first partition to move towards a density peak in said feature space, the second partition being selected according to the determined counts of the plurality of feature vectors; and determining a codeword for the codebook based on the selected second partition to generate the codebook for classifying the text attributes in the document. According to still another aspect of the present disclosure, there is provided a computer program comprising a computer program stored thereon for generating a codebook for classifying text attributes in a document, the program comprising: code for determining, for each of a plurality of partitions in a feature space, a count of a plurality of feature vectors in the partition; code for selecting one of said partitions having a non-zero count of feature vectors; code for determining a first one of said partitions containing a shift point in the feature space, the shift point having an offset from the selected partition according to the determined count of feature vectors associated with one or more of the plurality of partitions, P0R9030 Sneri As Filed (ROnh79Ov1) -9 the offset moving towards a partition having a higher density of the plurality of feature vectors; and code for selecting a second one of said partitions located adjacent to, and including, the first partition to move towards a density peak in said feature space, the second partition being selected according to the determined counts of the plurality of feature vectors; and code for determining a codeword for the codebook based on the selected second partition to generate the codebook for classifying the text attributes in the document. Other aspects of the invention are also disclosed. BRIEF DESCRIPTION OF THE DRAWINGS Some aspects of the prior art and at least one embodiment of the present invention will now be described with reference to the following drawings, in which: Fig. 1A is a functional block diagram of an optical character recognition (OCR) system utilising text attribute classification according to one Partition Based Codebook Generation (PBCG) arrangement; Figs. 1B and 1C form a schematic block diagram of a general purpose computer system upon which a computer module of the optical character recognition (OCR) system of Fig. 1A can be practiced; Fig. 2 shows a bitmap image depicting multiple paragraphs of text; Fig. 3 is a schematic flow diagram showing a method of classifying a text region having an unknown text attribute; Fig. 4 is a schematic flow diagram showing a method of determining a codebook and a classifier as used in the method of Fig. 3; Fig. 5 is a schematic flow diagram showing a method of generating a codebook; POR9030 Snri As Filed (RO26790v1) -10 Fig. 6 is a schematic flow diagram showing a method of mode seeking according to a PBCG arrangement; Fig. 7 is a diagram showing some feature vectors and a shift point in a two dimensional projection of a d-dimensional subspace Hd; Fig. 8A is a diagram showing some feature vectors and a shift point in a 2 dimensional feature space that is hierarchically partitioned by splitting lines; Fig. 8B is a diagram showing a k-d tree constructed based on the partitioning shown in Fig. 8A; Fig. 9A is an example of a document image using a number of scripts; and Fig. 9B is an example of a document image using handwritten and machine printed texts. DETAILED DESCRIPTION INCLUDING BEST MODE Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears. It is to be noted that the discussions contained in the "Background" section and the section above relating to prior art arrangements relate to discussions of documents or devices which may form public knowledge through their respective publication and/or use. Such discussions should not be interpreted as a representation by the present inventor(s) or the patent applicant that such documents or devices in any way form part of the common general knowledge in the art. The Partition Based Codebook Generation (PBCG) arrangements described below effectively generate a codebook for a high-dimensional large-scale dataset with low computational cost, fast convergence, and high robustness to spurious local maxima. POR9030 Snri As Filed (RO2E790v1) -11 Codebook generation facilitates further processing such as text attribute classification and OCR engine selection. Fig. 1A shows a system 100 for performing OCR on a document image. The OCR system 100 processes a bitmap image 171 of an input document 170 to produce an electronic document 190. The electronic document 190 may be produced in accordance with methods described below. The described methods may be implemented, for example, by a software application program 133 (see Fig. 1B) resident on a hard disk drive 110 and being controlled in its execution by a processor 105 (see Fig. 1B) computer module 101. The document 190 produced in accordance with the described methods can be edited in a word processing environment or can be indexed using typical text search tools. The bitmap image 171 may be produced by any of a number of sources, such as by a scanner 126 (see Fig. 1B) scanning a hardcopy document 170, by retrieval from a data storage system such as the hard disk drive 110 of the computer module 101 having a database of images stored on the hard disk drive 110, or by digital photography using a camera 127. The scanner 126, hard disk drive 110 and camera 127 are merely examples of how the bitmap image 171 might be provided. As another example, the bitmap image 171 may be created by the software application 133 as an extension of the printing functionality of the software application 133. The OCR system 100 performs text region selection using a text region selection module 172, whereby at least one text region is extracted from the bitmap image 171. In one arrangement, the text region selection module 172 may be implemented as one or more software code modules of the software application program 133 resident on the hard disk drive 110 of the computer module 133 and being controlled in their execution by a processor 105 of the computer module 101. Text attribute classification is then performed by a text attribute classification module 173 on each of the selected text regions. Again, in one arrangement, the text attribute classification module 173 may be implemented as one or more software code modules of the software application program 133 resident on the hard disk drive 110 of the computer module 133 and being controlled in their execution by the processor 105. POR9030 Snri As Fil-d (RO2E790v1) -12 Text attribute classification results determined by text attribute classification module 173 are considered by a OCR system selection module 174. In one arrangement, the OCR system selection module 174 may be implemented as one or more software code modules of the software application program 133 resident on the hard disk drive 110 of the computer module 133 and being controlled in their execution by the processor 105. The OCR system selection module 174 determines which of a plurality of OCR engines 182, 184, 186 and 188 from a bank of OCR engines 180 are appropriate for the bitmap image 171. For example, if the text attribute classification results suggest a high incidence of machine printed Chinese text, the OCR system selection module 174 considers the OCR for machine printed Chinese 184. Again, in one arrangement, the bank of OCR engines 180 may be implemented as one or more software code modules of the software application program 133. The OCR system 100 processes the bitmap image 171 using the appropriate OCR engine 182, 184, 186 or 188 to produce the electronic document 190 with recognised text. As seen in Fig. 1B, the system 100 includes: the computer module 101; input devices such as a keyboard 102, a mouse pointer device 103, the scanner 126, the camera 127, and a microphone 180; and output devices including a printer 115, a display device 114 and loudspeakers 117. An external Modulator-Demodulator (Modem) transceiver device 116 may be used by the computer module 101 for communicating to and from a communications network 120 via a connection 121. The communications network 120 may be a wide-area network (WAN), such as the Internet, a cellular telecommunications network, or a private WAN. Where the connection 121 is a telephone line, the modem 116 may be a traditional "dial-up" modem. Alternatively, where the connection 121 is a high capacity (e.g., cable) connection, the modem 116 may be a broadband modem. A wireless modem may also be used for wireless connection to the communications network 120. The computer module 101 typically includes at least one processor unit 105, and a memory unit 106. For example, the memory unit 106 may have semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The computer module 101 also includes an number of input/output (1/0) interfaces including: an audio-video interface 107 that couples to the video display 114, loudspeakers 117 and microphone 180; an 1/0 interface 113 that couples to the keyboard 102, mouse 103, scanner 126, camera 127 and optionally a joystick or other human interface device (not illustrated); and an interface 108 for the external modem 116 and printer 115. In some implementations, the modem 116 may POR9030 Snri As Fil-d (RO2E790v1) -13 be incorporated within the computer module 101, for example within the interface 108. The computer module 101 also has a local network interface 111, which permits coupling of th system 100 via a connection 123 to a local-area communications network 122, known as a Local Area Network (LAN). As illustrated in Fig. 1B, the local communications network 122 may also couple to the wide network 120 via a connection 124, which would typically include a so-called "firewall" device or device of similar functionality. The local network interface 111 may comprise an Ethernet circuit card, a Bluetooth@ wireless arrangement or an IEEE 802.11 wireless arrangement; however, numerous other types of interfaces may be practiced for the interface 111. The 1/0 interfaces 108 and 113 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 109 are provided and typically include the hard disk drive (HDD) 110. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 112 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g., CD-ROM, DVD, Blu ray DiscTM), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate sources of data to the system 100. The components 105 to 113 of the computer module 101 typically communicate via an interconnected bus 104 and in a manner that results in a conventional mode of operation of the computer system 100 known to those in the relevant art. For example, the processor 105 is coupled to the system bus 104 using a connection 118. Likewise, the memory 106 and optical disk drive 112 are coupled to the system bus 104 by connections 119. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun Sparcstations, Apple MacTM or a like computer systems. The described method may be implemented using the system 100 wherein the processes of Figs. 3 to 6 to be described, may be implemented as one or more code modules of the software application programs 133 executable within the system 100. In particular, the steps of the described methods are effected by instructions 131 (see Fig. 1C) in the software 133 that are carried out within the system 100. The software instructions 131 may be formed as one or more code modules, each for performing one or more particular tasks. The software POR9030 Snri As Fil-d (RO2E790v1) -14 may also be divided into two separate parts, in which a first part and the corresponding code modules performs the described methods and a second part and the corresponding code modules manage a user interface between the first part and the user. The software 133 may be stored in a computer readable medium, including the storage devices described below, for example. The software 133 is typically stored in the HDD 110 or the memory 106. The software 133 is loaded into the system 100 from the computer readable medium, and then executed by the system 100. Thus, for example, the software 133 may be stored on an optically readable disk storage medium (e.g., CD-ROM) 125 that is read by the optical disk drive 112. A computer readable medium having such software or computer program recorded on the computer readable medium is a computer program product. The use of the computer program product in the system 100 preferably effects an advantageous apparatus for implementing the described methods. In some instances, the application programs 133 may be supplied to the user encoded on one or more CD-ROMs 125 and read via the corresponding drive 112, or alternatively may be read by the user from the networks 120 or 122. Still further, the software can also be loaded into the computer system 100 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to the computer system 100 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, DVD, Blu rayTM Disc, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 101. Examples of transitory or non tangible computer readable transmission media that may also participate in the provision of software, application programs, instructions and/or data to the computer module 101 include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like. The second part of the application programs 133 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 114. Through manipulation of typically the keyboard 102 and the mouse 103, a user of the computer system 100 and the POR9030 Snri As Fil-d (RO2E790v1) -15 application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 117 and user voice commands input via the microphone 180. Fig. 1C is a detailed schematic block diagram of the processor 105 and a "memory" 134. The memory 134 represents a logical aggregation of all the memory modules (including the HDD 109 and semiconductor memory 106) that can be accessed by the computer module 101 in Fig. 1B. When the computer module 101 is initially powered up, a power-on self-test (POST) program 150 executes. The POST program 150 is typically stored in a ROM 149 of the semiconductor memory 106 of Fig. 1B. A hardware device such as the ROM 149 storing software is sometimes referred to as firmware. The POST program 150 examines hardware within the computer module 101 to ensure proper functioning and typically checks the processor 105, the memory 134 (109, 106), and a basic input-output systems software (BIOS) module 151, also typically stored in the ROM 149, for correct operation. Once the POST program 150 has run successfully, the BIOS 151 activates the hard disk drive 110 of Fig. 1B. Activation of the hard disk drive 110 causes a bootstrap loader program 152 that is resident on the hard disk drive 110 to execute via the processor 105. This loads an operating system 153 into the RAM memory 106, upon which the operating system 153 commences operation. The operating system 153 is a system level application, executable by the processor 105, to fulfil various high level functions, including processor management, memory management, device management, storage management, software application interface, and generic user interface. The operating system 153 manages the memory 134 (109, 106) to ensure that each process or application running on the computer module 101 has sufficient memory in which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the system 100 of Fig. 1B must be used properly so that each process can run effectively. Accordingly, the aggregated memory 134 is not intended to illustrate how particular segments of memory are allocated (unless otherwise POR9030 Snri As Fil-d (RO2E790v1) -16 stated), but rather to provide a general view of the memory accessible by the computer system 100 and how such is used. As shown in Fig. 1C, the processor 105 includes a number of functional modules including a control unit 139, an arithmetic logic unit (ALU) 140, and a local or internal memory 148, sometimes called a cache memory. The cache memory 148 typically include a number of storage registers 144 - 146 in a register section. One or more internal busses 141 functionally interconnect these functional modules. The processor 105 typically also has one or more interfaces 142 for communicating with external devices via the system bus 104, using a connection 118. The memory 134 is coupled to the bus 104 using a connection 119. The application program 133 includes a sequence of instructions 131 that may include conditional branch and loop instructions. The program 133 may also include data 132 which is used in execution of the program 133. The instructions 131 and the data 132 are stored in memory locations 128, 129, 130 and 135, 136, 137, respectively. Depending upon the relative size of the instructions 131 and the memory locations 128-130, a particular instruction may be stored in a single memory location as depicted by the instruction shown in the memory location 130. Alternately, an instruction may be segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 128 and 129. In general, the processor 105 is given a set of instructions which are executed therein. The processor 105 waits for a subsequent input, to which the processor 105 reacts to by executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 102, 103, data received from an external source across one of the networks 120, 102, data retrieved from one of the storage devices 106, 109 or data retrieved from a storage medium 125 inserted into the corresponding reader 112, all depicted in Fig. 1B. The execution of a set of the instructions may in some cases result in output of data. Execution may also involve storing data or variables to the memory 134. The described arrangements use input variables 154, which are stored in the memory 134 in corresponding memory locations 155, 156, 157. The described arrangements produce output variables 161, which are stored in the memory 134 in corresponding memory locations POR9030 Snri As Fil-d (RO2E790v1) -17 162, 163, 164. Intermediate variables 158 may be stored in memory locations 159, 160, 166 and 167. Referring to the processor 105 of Fig. 1C, the registers 144, 145, 146, the arithmetic logic unit (ALU) 140, and the control unit 139 work together to perform sequences of micro operations needed to perform "fetch, decode, and execute" cycles for every instruction in the instruction set making up the program 133. Each fetch, decode, and execute cycle comprises: a fetch operation, which fetches or reads an instruction 131 from a memory location 128, 129, 130; a decode operation in which the control unit 139 determines which instruction has been fetched; and an execute operation in which the control unit 139 and/or the ALU 140 execute the instruction. Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 139 stores or writes a value to a memory location 132. Each step or sub-process in the processes of Figs. 3 to 6 is associated with one or more segments of the program 133 and is performed by the register section 144, 145, 147, the ALU 140, and the control unit 139 in the processor 105 working together to perform the fetch, decode, and execute cycles for every instruction in the instruction set for the noted segments of the program 133. The described methods may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub functions of described methods. Such dedicated hardware may include graphic processors, digital signal processors, or one or more microprocessors and associated memories. The text region selection module 172 will now be described with reference to the bitmap image 171 which is shown in more detail in Fig. 2. As seen in Fig. 2, the bitmap image 171 comprises multiple paragraphs of text. The text region selection module 172 is POR9030 Snri As Fil-d (RO2E790v1) -18 configured to select at least one text region of a bitmap image, such as the bitmap image 171, for processing. The text region selection module 172 may use text segmentation to break text of the bitmap image 171 up into meaningful elements. In the example of Fig. 2, the text region selection module 172 is configured for segmenting the bitmap image 171 into text regions including a text region 210 containing a single word, a text region 220 containing a single text line, and a text region 230 containing a single text paragraph. Alternatively, the text region selection module 172 may be configured to select text regions of an image without significant text segmentation. For example, the text region selection module 172 may be configured to select a text region in the form of a square-shaped subregion of a bitmap image. In the example of Fig. 2, the text region selection module 172 may be configured to select a square-shaped subregion 240 of the bitmap image 200 containing text, including some partial characters as well as full characters. In one arrangement, the text region selection module 172 may be configured to select the entire bitmap image 171 as a text region. The selected text regions may be stored in the memory 106 by the text region selection module 172 under execution of the processor 105. The text region selection module 172, under execution of the processor 105, provides the selected text regions to the text attribute classification module 173. The text region selection module 172 provides different amounts of context to the text attribute classification module 173 depending on the text regions selected in the bitmap image 171. The context is used by the text attribute classification module 173 to provide specificity to the classification of the selected text. The text is selected by the text region selection module 172 according to intended use of the system 100 (i.e., the optical character recognition (OCR) system). For example, in one arrangement, the system 100 may be used for indexing search terms found in the bitmap image 171 by selecting regions of the bitmap image 171 which are of a similar quantum to a typical search term. For such an arrangement, the text region selection module 172 may be configured for selecting regions of the bitmap image 171 containing at least a word and at most a text line. The text attribute classification module 173 as described in more detail below can function with any suitable selection method used by the text region selection module 172. A method 300 of classifying a text region, as executed by the text attribute classification module 173 will now be described with reference to Fig. 3. The method 300 POR9030 Snri As Fil-d (RO2E790v1) -19 will be described by way of example with reference to the text region 310. As described above, the text attribute classification module 173 may be implemented as one or more software code modules of the software application program 133 resident on the hard disk drive 110 of the computer module 133 and being controlled in their execution by the processor 105. The method 300 begins at receiving step 310, where a text region 310 having an unknown text attribute is received by the text attribute classification module 173 from the system 100. For example, the text region 310 may be received from the text region selection module 172 of the system 100. Then at feature extraction step 320, a feature extraction process is executed by the text attribute classification module 173, under execution of the processor 105, to determine feature vectors (e.g., SIFT or HOG or kAS feature vectors) from the text region received at step 310. At a subsequent comparing step 330, the text attribute classification module 173, under execution of the processor 105, compares each of the feature vectors determined at step 320 with a pre-learned codebook by determining the distances between each feature vector and all the codewords of the codebook. The pre-learned codebook will be described in more detail below. Then at a histogram generation step 340, the text attribute classification module 173, under execution of the processor 105, generates a histogram representing the text region 310 by mapping each feature vector to a codeword nearest the particular feature vector. The histogram indicates how many feature vectors are mapped to each codeword in the text region 310. The determined histogram may be stored in the memory 106 by the text attribute classification module 173, under execution of the processor 105. The histogram is used by a subsequent text attribute determination step 350 for determining the text attribute class of the text region 310 based on a pre-learned classifier such as a classifier comprised of all training histograms which will be described below. Alternatively, the text attribute class of the text region 310 may be determined based on a support vector machine (SVM). POR9030 Snri As Fil-d (RO2E790v1) -20 The pre-learned codebook and classifier used by the text attribute classification module 173 in the method 300 may be determined in accordance with a method 400 of determining a codebook and a classifier as shown in Fig. 4. The method 400 may be referred to as a "training or learning process". The codebook may be used for classifying text attribute in the document 170, as described below. The method 400 may be implemented as one or more software code modules of the software application program 133 resident on the hard disk drive 110 of the computer module 133 and being controlled in their execution by the processor 105. The method 400 requires a set of training samples composed of training image regions and associated labels. The training samples may be created by a user with some domain knowledge using manual annotation. The created training samples may be stored in the hard disk drive 110 under execution of the processor 105. The method 400 begins at an accessing step 410, where the set of training samples 410 are accessed for example, from the hard disk drive 110. Then at step 420, a feature extraction process (equivalent to the feature extraction process executed at step 320 of Fig. 3) is executed by the processor 105, to determine feature vectors from each training image region. The feature vectors may be referred to as "training feature vectors" and may be stored in the memory 106. Then at a codebook generation step 430, the processor 105 groups all the feature vectors determined at step 420 into a plurality of clusters and generates a codebook by assigning the centres of each of the clusters as codewords of a codebook. The codebook may be stored in the memory 106. A method 500 of generating a codebook, as executed at step 430, will be described in detail below with reference to Fig. 5. The generated codebook may be used for classifying text attributes in a document. At a subsequent comparing step 440, the processor 105 is used for comparing each feature vector determined at step 420 with codewords of the codebook determined at step 430 based on distance calculations. POR9030 Snri As Fil-d (RO2E790v1) -21 The method 400 continues at a histogram generation step 450, where the processor 105 is used to generate a histogram representing each training image region accessed at step 410 by mapping each feature vector determined at step 420 to a nearest codeword. Then at step 460, the histograms and class labels corresponding with the training image regions are then used to train a classifier such as a support vector machine (SVM) which may be stored in the memory 106. The method 400 concludes at a storing step 470, where the codebook determined at step 430 and the classifier determined at step 460 are subsequently stored in the memory 106 and/or the hard disk drive 110 for text attribute classification by the text attribute classification module 173. The method 500 of generating a codebook, as executed at step 430, will be described in detail below with reference to Fig. 5. The method 500 may be implemented as one or more software code modules of the software application program 133 resident on the hard disk drive 110 of the computer module 133 and being controlled in their execution by the processor 105. The method 500 begins at accessing step 510 where the training feature vectors determined by the feature extraction process executed at step 420 are accessed from the memory 106 by the processor 105. At a subsequent selection step 520, the processor 105 is used to select a partition structure that divides a feature space corresponding to the training feature vectors into "subregions" or "partitions". The selection of a suitable partition structure at step 520 depends on dimensionality of the feature space. The partition structure selected at step 520 may be a k-d tree or a lattice. A k-d tree, which is a binary tree, defines a hierarchical partitioning or splitting of a k-dimensional feature space into disjoint subregions or partitions. A lattice is an infinite set of regularly spaced points in a Euclidean space and each point represents "a subregion" or "a partition" of the space. A k-d tree is effective for handling low-dimensional data (e.g., the dimension is less than ten (10)). A lattice is effective for handling high-dimensional data. Based on the partition structure selected at step 520, the feature space is split into disjoint partitions. P0R9030 Sneri As FilPe (ROnh79Ov1) -22 For each training feature vector, a searching step 530 is performed where the processor 105 is used to determine a partition containing the training feature vector using a search algorithm. Each partition centre has a record of the number of training feature vectors close to the partition centre. The number of training feature vectors close to each partition centre, will be referred to below as a "count", so that each partition centre has an associated count. The partition determined at step 530 may be stored in the memory 106 together with the associated count. In one arrangement, prior to execution of the step 530, the processor 105 may be configured for determining, for each of the plurality of partitions in the feature space corresponding to the training vectors, the count of feature vectors in each partition. As described above, each partition centre has an associated count. The method 500 continues at an identification step 540, where the processor 105 is used to identify centres of each partition with associated counts larger than zero. The partition centres with counts larger than zero are called "active" partition centres and the partition centres identified at step 540 may be stored in the memory 106. The active partition centres and associated counts are used by the processor 105 at a mode seeking step 550 for generating a mode set which may be stored in memory 105. The mode set contains the mode for each active partition centre. A method 600 of generating a mode set for the active partition centres, as executed at step 550, will be described in detail below with reference to Fig. 6. Then at a codeword forming step 560, the processor 105 is used to form codewords of the codebook determined at step 430 and stored in memory 106. Each codeword may be a unique mode of the mode set determined at step 550. Alternatively, a clustering process may be performed at step 560 to obtain cluster centres by grouping the modes of the mode set based on distances between the modes. In this instance, each cluster centre is then assigned to be a codeword of the codebook. A codeword of the codebook is a representation of similar feature vectors. All the codewords of a codebook represent typical visual characteristics extracted from training POR9030 Snri As Fil-d (RO2E790v1) -23 images. By mapping each feature vector extracted from an image to the codewords, the image may be represented by a histogram indicating distribution of typical visual characteristics in a feature space. For example, a codeword may describe shape characteristics of a curvy stroke such as the length and orientation of the stroke. After a mapping operation, the histogram for an image belonging to handwritten text may have a higher number of occurrences of the codeword, which describes the curvy stroke, than the histogram for an image belonging to machine printed text. The codebook enables the text attribute classification module 173, when analysing feature vectors observed from a text region, to predict a likely text attribute class by comparing the occurrences of codewords of the codebook. In one arrangement as described below, an A* lattice is selected as the partition structure at the partition structure selection step 520. In such an arrangement, the plurality of partitions of the feature space is formed according to the A* lattice partition structure. Other lattices such as the A lattice, D lattice, D* lattice, Z lattice, and Leech lattice may also be selected. Accordingly, the partition structure may be determined based on one of an A* lattice, A lattice, D lattice, D*lattice, Zlattice, and Leech lattice. A lattice is an infinite set of regularly spaced points in a Euclidean space. The term A* denotes a family of lattices, where a lattice A* denotes a lattice of dimension d. The A* family can be defined in terms of the A lattice family. The lattice Ad 1 defined in accordance with Equation (1) below: Ad = {p e Z+1 IT p, = 01 (1) is a d-dimensional lattice that is embedded in R+ 1 to make the coordinates integers where p is a (d + 1)-dimensional lattice point with every dimension having an integer coordinate (i.e., p belongs to the (d + 1)-dimensional integer lattice Za+1), and pt is the i coordinate of p. Thus, from the aforementioned definition of the lattice Ad, the lattice Ad contains all such lattice points p with coordinates that sum to zero (0). The family of lattices A* is defined to be the dual of Ad and may be similarly embedded inside the same d-dimensional subspace, H, of RW+1, where the d-dimensional subspace Hd is defined in accordance with Equation (2) below: H = {q E RF +1 I i = 01 . (2) P0R9030 Sneri As Filie (ROnh79Ov1) -24 where q is a (d + 1)-dimensional lattice point with every dimension having a real coordinate (i.e., q belongs to the (d + 1)-dimensional space of real numbers Rd,'), and qj is the ii coordinate of q. Thus, from the aforementioned definition of the d-dimensional subspace Hd, the d-dimensional subspace Hd contains all such lattice points q with coordinates that sum to zero (0). Given a first lattice, a dual lattice is the set of dual lattice points where for each dual lattice point in the dual lattice the dot product between the dual lattice point and each lattice point in the first lattice is an integer. In other words, the family of lattices A* is defined by Equation (3), below: A* { =q C HdJVp C Ad,q -p C Z} (3) where q is a lattice point belonging to Hd as previously defined, p is a point belonging to the lattice Ad as previously defined, and - denotes the dot-product operation. Thus, from the aforementioned definition of the family of lattices A*, the family of lattices A* contains all the lattice points q that produce integers when the dot-product operation is performed with the lattice points q and any arbitrary Ad lattice point p. Fig. 7 shows a set of feature vectors in a two-dimensional projection 700 of the d dimensional subspace Hd. Feature vectors are marked as a cross, such as cross 730. Solid circles, such as circles 710 and 720, indicate some lattice points in the projection of the d dimensional subspace Hd. A partition or a subregion associated with a lattice point is called a Voronoi cell 740, which is the set of all the points that are closer to the lattice point than any other lattice points. Hence each partition or Voronoi cell corresponds to a single lattice point. The lattice point that is closer to a feature vector than any other lattice points can be found by using a lattice decoding or quantisation method. Lattice points such as lattice points 720, 721, 722, 723 and 724 are active lattice points, as there are feature vectors that are closer to the lattice points 720, 721, 722, 723 and 724 than to any other lattice points. Each active lattice point has a record of the number of feature vectors or a count associated with the active lattice point. For example, the lattice point 720 has a count of three (3) feature vectors. The lattice points with an associated zero count of feature vectors, such as lattice point 710 in Fig. 7, are inactive. A Delaunay cell is the convex hull of all the lattice points P0R9030 Sneri As Filie (ROnh79Ov1) -25 whose Voronoi cells share a vertex. For example, the Voronoi cells associated with the lattice points 721, 722, 723 share a vertex 750. A region 760 enclosed by dashed lines in Fig. 7 indicates a Delaunay cell surrounding the vertex 750. The vertices of a Delaunay cell are lattice points. The method 600 of generating a mode set for the active partition centres, as executed at step 550, will be described in detail below with reference to Fig. 6. The method 600 may be implemented as one or more software code modules of the software application program 133 resident on the hard disk drive 110 of the computer module 133 and being controlled in their execution by the processor 105. The method 600 begins at accessing step 601, where the active partition centres and associated counts determined at 540 are accessed from the memory 106. In one arrangement, the partition structure is selected as an A* lattice. In such an arrangement, each active partition centre is an active lattice point. The partition associated with a lattice point is the Voronoi cell of the lattice point. The method 600 will be described below by way of example where the partition structure is selected as an A* lattice and each active partition centre is an active lattice point. At step 605, the processor 105 is used for selecting one of the active partition centres. As described above, an active partition centre is a partition centre having a non-zero count of feature vectors. A mode seeking process is then performed in following steps 610 to 655 to determine the mode for each active partition centre where each active partition centre is a lattice point. As described below, the steps of the method 600 are used for determining a first one of the partitions containing a shift point in the feature space, the shift point having an offset from the determined partition according to a determined count of feature vectors associated with nearby ones of a plurality of partitions in the feature space, the offset moving towards a partition having a higher density of the feature vectors. As also described below, a second one of the partitions located adjacent to, and including, the first partition is then selected to move towards a density peak in the feature space. The second partition is selected according POR9030 Snri As Fil-d (RO2E790v1) -26 to determined counts of feature vectors in the second partition. The density peak is represented by a higher count of feature vectors in the second partition. Then at an assigning step 610, the selected active partition centre, in the form of an active lattice point, is assigned as a candidate point using the processor 105. The selected active partition centre corresponds to one of the partitions having a non-zero count of feature vectors. The method 600 continues at a mean shift step 615, where a mean shift method is executed at step 615 to determine a shift point for the candidate point assigned at step 610. The partition containing the shift point, determined at step 615, corresponds to a first one of the partitions in the feature space. The shift point determined at step 615 has an offset from the candidate point according to a determined count of feature vectors associated with selected ones of the plurality of active partitions in the feature space. The selected active partitions may be all the active partitions in the feature space. The selected active partitions may alternatively be nearby ones of the plurality of active partitions in the feature space to reduce the computational complexity of determining the offset. The offset determined by the mean shift method moves towards (or is moving towards) a partition having a higher density of the plurality of feature vectors. The shift point may, for instance, be determined at step 615 in accordance with Equation (4), below: n=1 lnK(x U) - 1,)c, 1 K(x(J) - 1n) cn (4) where y is the shift point, xU) is the candidate point at the j-th iteration, in is the n-th lattice point, cn is the count associated with the n-th lattice point, N is the total number of lattice points of selected active partitions, and K is a predetermined kernel function which may be a Gaussian kernel, or a uniform kernel, etc. After the shift point is determined at step 615, the processor 105 is used to determine the partition centres close to the shift point determined at step 615, which may be used for determining a partition containing the shift point. As described, the vertices of a partition, in POR9030 Snri As Fil-d (RO2E790v1) -27 the form of a Delaunay cell, containing the shift point are determined at the determining step 620 using an A* lattice decoding method. For example, a suitable method for determining the vertices of the Delaunay cell at step 620 may be first to determine the closest remainder-zero point and then to sort the difference between the closest remainder-zero point and the shift point. A resulting permutation and translation are used to determine the vertices of the Delaunay cell. The closest remainder-zero point may be determined at 620 by rounding each coordinate of the shift point to a nearest multiple of (d + 1), where d is the dimension of a feature space. If the closest remainder-zero point determined is outside the d-dimensional subspace Hd, then the coordinates that move the furthest are identified and are then rounded in the other direction. The vertices of the Delaunay cell are lattice points or partition centres corresponding to the first partition containing the shift point determined at step 615 and the partitions located adjacent to the first partition. The partition centre, in the form of a vertex or lattice point, with the highest count is then selected using the processor 105 at a selecting step 625. For example, the solid triangle 770 shown in Fig. 7 indicates the shift point for a candidate point such as candidate point 720. The region enclosed by dashed lines 760 indicates the Delaunay cell containing the shift point 770. The vertices of the Delaunay cell are lattice points 721, 722 and 723. In the example of Fig. 7, the vertex or the lattice point 723 is selected, as the vertex 723 having the highest count compared with the other vertices 721 and 722 corresponding to partitions adjacent to the partition containing the vertex or the lattice point 723. The selected vertex 723 corresponds to a second one of the partitions in the feature space 700. After the partition centre (i.e., in the form of a vertex) with the highest count is selected, at a next decision step 630 the processor 105 is used to check whether the selected partition centre is equal to the candidate point. If the selected partition centre is equal to the candidate point, then the shift point is updated to be the candidate point at an updating step 640. Otherwise, the candidate point is updated to be the selected partition centre (or vertex) at an updating step 635. The distance between the candidate point and the shift point is determined in a determining step 645. At a subsequent decision step 650, the processor 105 is used to determine whether the distance determined at step 645 is larger than a predetermined threshold. In one arrangement, for example, the predetermined threshold may be 0.01. POR9030 Snri As Fil-d (RO2E790v1) -28 If the distance determined at step 645 is larger than the predetermined threshold, then the method 600 returns to step 615 and the candidate point is processed in another iteration of the mode seeking process performed by steps 615 to 655. Otherwise, the candidate point is added by the processor 105 to a mode set configured within the memory 106 at an adding step 655. The mode set contains the modes for all the active partitions. At decision step 660, if there are further active partition centres to be processed, then the method 600 returns to step 605. Otherwise, the method 600 concludes and the mode set is used at the codeword forming step 560 to generate a codebook. In another arrangement, as described below, at the partition structure selection step 520, a k-d tree is selected as the partition structure. In such an arrangement, the plurality of partitions of the feature space is formed according to the k-d tree structure. Other trees such as a randomised tree, a random projection tree, and a k-d forest may alternatively be used as the partition structure. Accordingly, the partition structure may be determined based on one of a k-d tree, a randomised tree, a random projection tree and a k-d forest. A k-d tree, which is a binary tree, defines a hierarchical partitioning or splitting of a k-dimensional feature space into disjoint subregions or partitions. Each node of the k-d tree defines a corresponding partition of an original feature space, and hence feature vectors contained in the corresponding partition. The root node of the k-d tree defines the whole feature space containing all the feature vectors. Each non-terminal node has two successors or child nodes representing two partitions obtained from the splitting of the partition associated with the parent node using a splitting hyperplane. There are a number of methods of selecting a splitting hyperplane. For example, the splitting hyperplane may be an axis orthogonal plane, which splits orthogonally to the longest side of each subregion through the median coordinate of the associated feature vectors. The splitting hyperplane may be a plane, which is perpendicular to the principal axis of feature vectors contained in each partition based on principal component analysis. The splitting hyperplane used at a non-terminal node is associated with the non-terminal node and determines how the partition represented by the non-terminal node is split into child nodes. Such a splitting process for a partition stops when the number of feature vectors contained in the partition is less than a predetermined threshold. In one example, the predetermined P0R9030 Sneri As Filed (ROnh79Ov1) -29 threshold may be thirty (30) feature vectors. A node that defines such a partition is a terminal node. In a k-d tree arrangement of the described methods, each terminal node is associated with the centre of the partition defined by the terminal node and a count. The count represents the number of feature vectors contained in the partition defined by the terminal node. The partition centre may, for instance, be the mean or weighted mean of all the feature vectors contained in the partition. All the partition centres represented in a k-d tree are active partition centres. In a k-d tree arrangement of the described methods, the k-d tree provides an efficient distance-based search, which may be used to determine feature vectors in the k-d tree that are close to a given query point. The determined feature vectors may, for instance, be k nearest neighbours or fixed-radius near neighbours to the given query point. Searching for near neighbours can be implemented efficiently by using the properties of the k-d tree to quickly eliminate large portions of a search space. Fig. 8A shows a set of feature vectors in a two-dimensional feature space 800. The feature space 800 is partitioned by axis-orthogonal splitting lines. In Fig. 8A, feature vectors are marked as a cross, such as cross 820. The feature space 800 is split by a splitting line 830 into two partitions 855 and 865. One partition 855 is on the left side of the splitting line 830 and the other partition 865 is on the right side of the splitting line 830. The left partition 855 is then split by a splitting line 831 into two partitions 855A and 855B. The partition 855A is on the top side of the splitting line 831 and the other partition 855B is on the bottom side of the splitting line 831. Similarly, the right partition 865 is split by a splitting line 832 into two partitions 865A and 865B. Solid circles, such as circles 840 and 841, indicate some partition centres obtained after a partitioning process has been executed on the feature space 800 to produce the partitions 855A, 855B, 865A and 865B. Fig. 8B shows a k-d tree 890 constructed based on the partitioning process performed on the feature space 800 in Fig. 8A. Unfilled circles, such as circles 870, 871, and 872, indicate non-terminal nodes. Solid circles, such as circles 873 and 874, represent some terminal nodes. The root node 870 defines the whole feature space 800 and all the feature vectors shown in Fig. 8A. The non-terminal node 871 defines a left partition obtained from P0R9030 Sneri As Filed (ROnh79Ov1) -30 splitting the whole feature space by the splitting line 830. Thus, the splitting line 830 is associated with the root node 870. The terminal nodes 873 and 874 define the top and bottom partitions obtained from splitting the partition defined by the non-terminal node 871 using the splitting line 831. As described above, the mode generation method 600 is executed at step 550 of the method 500. The active partition centres and associated counts identified at step 540 are accessed at step 601 of the method 600. In a k-d tree arrangement of the described methods, each active partition centre is the centre of the partition defined by a terminal node. A mode seeking process is performed to determine the mode for each active partition centre in accordance with steps 605 to 660. At an assigning step 610, an active partition centre is selected to be assigned as a candidate point. The selected active partition centre corresponds to one of active partitions having a non-zero count of feature vectors. At mean shift step 615, the mean shift method is then performed to determine a shift point for the candidate point. The partition containing the shift point determined at step 615 corresponds to a first one of the partitions in the feature space 800. The shift point has an offset from the candidate point. As described above, the shift point may be determined in accordance with Equation (4). After the shift point is determined at step 615, a number of active partition centres close to the shift point are determined at step 620. The active partition centres determined at step 620 correspond to the first partition containing the shift point determined at step 615 and the partitions located close to the first partition. In the k-d tree arrangement of the described methods, the number of active partition centres close to the shift point may be determined using a k nearest neighbour search or radius search. Such a near neighbour search uses the shift point as a query point. The parameter k (e.g., k = 5) or the radius (e.g., radius of 1.2) used by the search is predetermined. The active partition centre with the highest count is then selected at step 625. For example, the solid triangle 850 shown in Fig. 8A represents a shift point for a candidate point such as point 842. The dashed circle 860 of Fig. 8A represents a region to be considered for a fixed-radius search using the shift point 850. In the example of Fig. 8, active partition centres 842 and 843 are determined by the fixed-radius search. The partition centre 843 is selected as the partition centre that has the highest count compared with the other partition centre 842. The selected partition centre corresponds to a second one of the partitions in the feature space 800. POR9030 Snri As Fil-d (RO2E790v1) -31 As described above, after the active partition centre with the highest count is selected, at decision step 630, the processor 105 is used to check whether the active partition centre is equal to the candidate point. If the selected active partition centre is equal to the candidate point, the shift point is updated to be the candidate point at step 640. Otherwise, the candidate point is updated to be the selected active partition centre at step 635. The distance between the candidate point and the shift point is determined at the determining step 645. Then at the subsequent decision step 650, the processor 105 is used to determine whether the distance is larger than the predetermined threshold (e.g., 0.01). If the distance determined at step 645 is larger than the predetermined threshold, then the method 600 returns to step 615 and the candidate point is processed in another iteration of the mode seeking process performed by steps 615 to 655. Otherwise, the candidate point is added by the processor 105 to a mode set configured within the memory 106 at the adding step 655. After determining the modes for all the active partition centres, the mode set is used at the codeword forming step 560 to generate a codebook. Regardless of the arrangement of the described methods used, step 530 is a data reduction step that groups feature vectors into disjoint partitions of a feature space. Each active partition centre is a representation of feature vectors belonging to the same partition. The data reduction step executed at step 530 reduces the number of points used by the mode seeking step 550 so that the mode seeking is only performed on active partition centres. In the mode seeking method 600, the mean shift step 615 is performed on each active partition centre to determine a shift point close to a mode. By exploiting the selected partition structure, (e.g., a k-d tree or a lattice), the active partitions near to the shift point can efficiently be determined at step 620. If a nearby active partition has the highest count (i.e., a higher density of the plurality of feature vectors in the feature space), the nearby active partition is closer to a desired mode than the shift point or may even contain a desired mode. Therefore, selecting the centre of the nearby active partition with the highest count as a candidate point 635 increases the offset from the original active partition centre, and thus improves the convergence of mode seeking. In addition, selecting the centre of the nearby active partition with the highest count improves the robustness to spurious local maxima. For example, if the shift point gets trapped in spurious local maxima, repeatedly using the mean shift step 615 cannot move the shift point away from the spurious local maxima as the mean shift procedure converges to local maxima. Selecting the nearby active partition centre with POR9030 Snri As Fil-d (RO2E790v1) -32 the highest count allows the shift point to move away from the spurious local maxima and move close to a desired mode. Industrial Applicability The arrangements described are applicable to the computer and data processing industries and particularly for codebook generation, image segmentation and data clustering. The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive. In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied meanings. POR9030 Snri As Filed (RO26790v1)

Claims (11)

1. A method of generating a codebook for classifying text attributes in a document, the method comprising: determining, for each of a plurality of partitions in a feature space, a count of a plurality of feature vectors in the partition; selecting one of said partitions having a non-zero count of feature vectors; determining a first one of said partitions containing a shift point in the feature space, the shift point having an offset from the selected partition according to the determined count of feature vectors associated with one or more of the plurality of partitions, the offset moving towards a partition having a higher density of the plurality of feature vectors; selecting a second one of said partitions located adjacent to, and including, the first partition to move towards a density peak in said feature space, the second partition being selected according to the determined counts of the plurality of feature vectors; and determining a codeword for the codebook based on the selected second partition to generate the codebook for classifying the text attributes in the document.
2. The method according to claim 1, wherein the plurality of partitions of said feature space are formed according to a lattice structure.
3. The method according to claim 2, wherein the structure of the plurality of partitions is determined based on one of an A* lattice, A lattice, D lattice, D*lattice, Zlattice, and Leech lattice.
4. The method according to claim 1, wherein the plurality of partitions of said feature space are formed according to a tree structure.
5. The method according to claim 4, the structure of the plurality of partitions is determined based on one of a k-d tree, a randomised tree, a random projection tree and a k-d forest. POR9030 Snri As Fil-d (RO2E790v1) -34
6. The method according to claim 1, further comprising segmenting a bitmap image of the document into one or more text regions.
7. The method according to claim 6, further comprising classifying the text regions.
8. The method according to claim 1, further comprising performing optical character recognition on the document.
9. An apparatus for generating a codebook for classifying text attributes in a document, the apparatus comprising: means for determining, for each of a plurality of partitions in a feature space, a count of a plurality of feature vectors in the partition; means for selecting one of said partitions having a non-zero count of feature vectors; means for determining a first one of said partitions containing a shift point in the feature space, the shift point having an offset from the selected partition according to the determined count of feature vectors associated with one or more of the plurality of partitions, the offset moving towards a partition having a higher density of the plurality of feature vectors; and means for selecting a second one of said partitions located adjacent to, and including, the first partition to move towards a density peak in said feature space, the second partition being selected according to the determined counts of the plurality of feature vectors; and means for determining a codeword for the codebook based on the selected second partition to generate the codebook for classifying the text attributes in the document.
10. A system for generating a codebook for classifying text attributes in a document, the system comprising: a memory for storing data and a computer program; P0R9030 Sneri As Filed (ROnh79Ov1) -35 a processor coupled to the memory for executing the computer program, said computer program comprising instructions for: determining, for each of a plurality of partitions in a feature space, a count of a plurality of feature vectors in the partition; selecting one of said partitions having a non-zero count of feature vectors; determining a first one of said partitions containing a shift point in the feature space, the shift point having an offset from the selected partition according to the determined count of feature vectors associated with one or more of the plurality of partitions, the offset moving towards a partition having a higher density of the plurality of feature vectors; and selecting a second one of said partitions located adjacent to, and including, the first partition to move towards a density peak in said feature space, the second partition being selected according to the determined counts of the plurality of feature vectors; and determining a codeword for the codebook based on the selected second partition to generate the codebook for classifying the text attributes in the document.
11. A computer program comprising a computer program stored thereon for generating a codebook for classifying text attributes in a document, the program comprising: code for determining, for each of a plurality of partitions in a feature space, a count of a plurality of feature vectors in the partition; code for selecting one of said partitions having a non-zero count of feature vectors; code for determining a first one of said partitions containing a shift point in the feature space, the shift point having an offset from the selected partition according to the determined count of feature vectors associated with one or more of the plurality of partitions, the offset moving towards a partition having a higher density of the plurality of feature vectors; and P0R9030 Sneri As Filed (ROnh79Ov1) -36 code for selecting a second one of said partitions located adjacent to, and including, the first partition to move towards a density peak in said feature space, the second partition being selected according to the determined counts of the plurality of feature vectors; and code for determining a codeword for the codebook based on the selected second partition to generate the codebook for classifying the text attributes in the document. CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant/Nominated Person SPRUSON & FERGUSON POR9030 Snri As Filed (RO26790v1)
AU2013260720A 2013-11-22 2013-11-22 Method, apparatus and system for generating a codebook Abandoned AU2013260720A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2013260720A AU2013260720A1 (en) 2013-11-22 2013-11-22 Method, apparatus and system for generating a codebook

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2013260720A AU2013260720A1 (en) 2013-11-22 2013-11-22 Method, apparatus and system for generating a codebook

Publications (1)

Publication Number Publication Date
AU2013260720A1 true AU2013260720A1 (en) 2015-06-11

Family

ID=53275841

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2013260720A Abandoned AU2013260720A1 (en) 2013-11-22 2013-11-22 Method, apparatus and system for generating a codebook

Country Status (1)

Country Link
AU (1) AU2013260720A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107545038A (en) * 2017-07-31 2018-01-05 中国农业大学 A kind of file classification method and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107545038A (en) * 2017-07-31 2018-01-05 中国农业大学 A kind of file classification method and equipment
CN107545038B (en) * 2017-07-31 2019-12-10 中国农业大学 Text classification method and equipment

Similar Documents

Publication Publication Date Title
Shi et al. Script identification in the wild via discriminative convolutional neural network
US8737739B2 (en) Active segmentation for groups of images
Yi et al. Feature representations for scene text character recognition: A comparative study
US7570816B2 (en) Systems and methods for detecting text
Rothacker et al. Bag-of-features representations for offline handwriting recognition applied to Arabic script
CN110363049B (en) Method and device for detecting, identifying and determining categories of graphic elements
CN109697451B (en) Similar image clustering method and device, storage medium and electronic equipment
US20160026848A1 (en) Global-scale object detection using satellite imagery
Gomez et al. A fast hierarchical method for multi-script and arbitrary oriented scene text extraction
Zhang et al. Automatic discrimination of text and non-text natural images
US20110295778A1 (en) Information processing apparatus, information processing method, and program
Wu et al. Scale-invariant visual language modeling for object categorization
CN112163114B (en) Image retrieval method based on feature fusion
Bharath et al. Scalable scene understanding using saliency-guided object localization
Mandal et al. Bag-of-visual-words for signature-based multi-script document retrieval
Li et al. Multilingual text detection with nonlinear neural network
Khalifa et al. A novel word based Arabic handwritten recognition system using SVM classifier
Tong et al. A review of indoor-outdoor scene classification
Parikh et al. Determining patch saliency using low-level context
Altun et al. SKETRACK: stroke-based recognition of online hand-drawn sketches of arrow-connected diagrams and digital logic circuit diagrams
Chherawala et al. Arabic word descriptor for handwritten word indexing and lexicon reduction
Dai et al. Discovering scene categories by information projection and cluster sampling
Farhangi et al. Improvement the bag of words image representation using spatial information
CN111553442B (en) Optimization method and system for classifier chain tag sequence
AU2013260720A1 (en) Method, apparatus and system for generating a codebook

Legal Events

Date Code Title Description
MK4 Application lapsed section 142(2)(d) - no continuation fee paid for the application