AU2012268796A1 - Directional stroke width variation feature for script recognition - Google Patents

Directional stroke width variation feature for script recognition Download PDF

Info

Publication number
AU2012268796A1
AU2012268796A1 AU2012268796A AU2012268796A AU2012268796A1 AU 2012268796 A1 AU2012268796 A1 AU 2012268796A1 AU 2012268796 A AU2012268796 A AU 2012268796A AU 2012268796 A AU2012268796 A AU 2012268796A AU 2012268796 A1 AU2012268796 A1 AU 2012268796A1
Authority
AU
Australia
Prior art keywords
text
region
pixels
subregion
classifying
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
AU2012268796A
Inventor
Getian Ye
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canon Inc
Original Assignee
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canon Inc filed Critical Canon Inc
Priority to AU2012268796A priority Critical patent/AU2012268796A1/en
Publication of AU2012268796A1 publication Critical patent/AU2012268796A1/en
Abandoned legal-status Critical Current

Links

Landscapes

  • Character Discrimination (AREA)

Abstract

Abstract DIRECTIONAL STROKE WIDTH VARIATION FEATURE FOR SCRIPT RECOGNITION A method of classifying a region of text (310) in a bitmap image as one of a predetermined 5 number of scripts, the method comprising receiving (311) the region of text from the bitmap image, generating (320) a thinned representation of the received region of text by eroding (460) edge pixels in the region, each of the edge pixels being eroded in one of a plurality of predetermined directions using predefined structural elements (500) of pixels in sequence, forming (330) a feature vector based on a count of the eroded edge pixels in each of the 10 predetermined directions; and classifying (340) the region of text into one of the predetermined number of scripts according to the feature vector.

Description

1 DIRECTIONAL STROKE WIDTH VARIATION FEATURE FOR SCRIPT RECOGNITION TECHNICAL FIELD The present invention relates generally to document image processing and in particular to the 5 identification of script in a document image. A script is a graphical form used by a writing system, and is related to (but not equivalent to) language. The present invention also relates to a method and apparatus for identifying script in a document, and to a computer program product including a computer readable medium having recorded thereon a computer program for identifying script in a document. 10 BACKGROUND Documents such as letters, posters, forms, etc. are commonly used to convey important information. Even though documents are typically intended to convey information directly to human readers, it can be useful to computationally understand the content of documents. This allows automation of many routine tasks associated with documents, such as sorting, filing, 15 summarising, cross-referencing, etc., thereby allowing human attention to focus more directly on the relevant information rather than the document formatting. Often documents are not in a convenient format for automation. In these cases, some form of document analysis is required, in order to determine the necessary information about the document. When the document is stored as a digitised image, document image processing 20 techniques may be used to derive information from the document image. One well-known example of a document image processing technique is Optical Character Recognition (OCR), which extracts the textual content of a document image. Script recognition is a document image processing technique that identifies the script used in an image that depicts text. As previously noted a script is the graphical form used by a writing 25 system, and is related to (but not equivalent to) language. Some languages share a script with other languages. Thus, for example, both English and French use the Latin script. Some languages such as Japanese use multiple scripts. An example of a document using a number of scripts is shown in Fig. 14. Knowledge of a script used in a document provides insight into which language or language family the document uses, which can enable some forms of 2 computational automation that would otherwise be less reliable or slower without this information. Script identification and recognition relies on the fact that each script has unique visual attributes that make it possible to distinguish it from other scripts. Hence, one critical 5 component of script recognition is the set of visual features extracted from text regions within a given document image. The feature extraction can be performed at different levels inside a document image, i.e., page level, paragraph/text block level, textline level, and even word/character level. Another major component of a script recognition system is a script classifier, which is often designed using a machine learning algorithm. The classifier is trained 10 or learns using numerous training samples composed of features extracted from training images. Then the trained classifier is used to identify or predict the script class for an input test sample with unknown script class. Various classifiers and their methods of training are known in the Machine Learning discipline of statistical analysis. One classifier, called a Support Vector Machine (SVM), uses n 15 dimensional vectors as its training samples. The SVM considers these training samples to be points in an n-dimensional vector space (called the "feature space"). To distinguish between two different classes (e.g. two different scripts), the SVM determines an (n-1)-dimensional hyperplane that maximally separates the training samples of the two different classes. A hyperplane is said to maximally separate the training samples of two classes when the following 20 conditions are true: * The hyperplane separates the feature space in two, such that each of the training samples of one class lies on one side of the hyperplane, and each of the training samples of the other class lies on the other side of the hyperplane; and * The shortest distance between any point on the hyperplane and any training sample is 25 maximised. When the training samples of the two classes cannot be separated by a hyperplane, a variation of the aforementioned conditions may be used, whereby some training samples are allowed to lie on the incorrect side of the hyperplane. In this variation, the number of and/or severity (in terms of distance from closest points on the hyperplane) of such incorrectly separated training 30 points are minimised, further to the aforementioned conditions applying to the correctly separated training points.
3 To classify a sample of an unknown class, the SVM determines on which side of the determined hyperplane the sample lies, and classifies the sample accordingly. When the training samples are members of more than two classes, a Multiclass Support Vector Machine can be created using Support Vector Machines for all the pairs of classes. 5 Another solution to the classification problem uses a technique called Kernel Density Estimation (KDE). Similarly to the Support Vector Machine method, the KDE method considers its training samples to be points in the feature space. Unlike the Support Vector Machine method, the KDE method uses a kernel to estimate the distribution (i.e. the probability density) of each class across the feature space. An estimated distribution for a class is formed 10 by placing a shaped copy of the kernel (shaped according to a "bandwidth matrix") at each training sample for that class. The sum of these shaped kernels gives a distribution across the feature space. A classifier called a Parzen classifier uses the estimated distributions of multiple classes to classify samples of unknown class located in the feature space, by assigning the class with highest estimated density at the sample location. 15 Where a Parzen classifier is infeasible due to the number of training samples, an approximation to the KDE method can be created by binning (i.e. forming a histogram of) sample points, and calculating each density estimate using bin centres weighted by the numbers of samples in the bins. A Parzen classifier can then be formed using the approximated density estimates for the classes. 20 A first conventional approach for script recognition is to identify significant structures within the text region in question. For example, one script recognition method examines the horizontal runs of consecutive pixels within a character to identify structural features indicative of certain scripts. The method identifies upward concavities of text characters, where a horizontal intra character gap is bridged at some lower row of the character. The distribution of these upward 25 branching structures is used to distinguish between Han and Latin scripts. This method produces its best results when the characters of the text region are well formed and not degraded by poor image quality, but has difficulties when these conditions aren't possible. Another script recognition method uses mathematical morphology to identify long stroke structures at four orientations in a text region. The minimal target stroke length is determined 30 based on the average connected component (CC) height in the text region in question, and then a structural element is created per orientation as a rotated rectangle of this stroke length. At each orientation, the image is eroded once by the associated structural element to produce a 6982860 1 / P053280 / filed specification 4 marker image. Morphological reconstruction is then performed using the original text region as the mask image, which has the effect of selecting CCs with long strokes of the associated orientation. The fractions of reconstructed pixels in the text region area at each orientation aid in discriminating between Kannada, Hindi, Urdu and Latin scripts. This method is reliant on an 5 appropriate selection of a target stroke length, which may be difficult for heterogeneous regions e.g. regions containing multiple font sizes. Another conventional approach for script recognition is to consider a text region based on its overall appearance as a texture rather than by looking for features of the constituent characters. One such script recognition method uses Gabor filtering at 16 orientations on the text region to 10 form a feature. The text region must be of normalised size for features of a script to correspond optimally. This can be a difficult requirement to meet. Another script recognition method considers appearance properties of a text region as a whole, by calculating a horizontal projection profile of the text region, and searching the projection profile for peaks. Some script families can be characterised by the location of these peaks. 15 This method can be useful for identifying distinct script families apart, but can have difficulties distinguishing between more closely related scripts. Another conventional approach for script recognition is to recognise the characters segments in a text region. This is typically achieved by comparing character segments extracted from the text region against a large database of known characters segments and their associated scripts. 20 This comparison may take place in a transformed space or an otherwise non-pixel representation of the region, to speed up the comparison or to improve its robustness against minor variations, etc. Methods using this approach typically require a relatively large amount of context in the text region to perform well, as well as requiring a large database to compare against. 25 SUMMARY It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements. Disclosed are arrangements, referred to as Erosion Based Feature Vector (EBFV) arrangements, which seek to address the above problems by forming a feature vector for script recognition 30 from a sequence of iterative pixel-level erosions of a text region in question. The feature vector 5 so formed can then be processed by a classifier to identify the script of the text region. The number of pixels eroded at each step, corresponding to different directions, is discriminative of different scripts. According to a first aspect of the present invention, there is provided a method of classifying a 5 region of text in a bitmap image as one of a predetermined number of scripts, the method comprising: receiving the region of text from the bitmap image; generating a thinned representation of the received region of text by eroding edge pixels in the region, each of the edge pixels being eroded in one of a plurality of predetermined 10 directions using predefined structural elements of pixels in sequence; forming a feature vector based on a count of the eroded edge pixels in each of the predetermined directions; and classifying the region of text into one of the predetermined number of scripts according to the feature vector. 15 According to another aspect of the present invention, there is provided an apparatus for classifying a region of text in a bitmap image as one of a predetermined number of scripts, the apparatus comprising: a processor; and a memory storing a computer executable software program for directing the processor to 20 perform a method comprising the steps of: receiving the region of text from the bitmap image; generating a thinned representation of the received region of text by eroding edge pixels in the region, each of the edge pixels being eroded in one of a plurality of predetermined directions using predefined structural elements of pixels in sequence; 25 forming a feature vector based on a count of the eroded edge pixels in each of the predetermined directions; and classifying the region of text into one of the predetermined number of scripts according to the feature vector. According to another aspect of the present invention, there is provided a non-transitory 30 computer readable medium storing a computer executable software program for directing a processor to execute a method for classifying a region of text in a bitmap image as one of a predetermined number of scripts, program comprising: software executable code for receiving the region of text from the bitmap image; 6 software executable code for generating a thinned representation of the received region of text by eroding edge pixels in the region, each of the edge pixels being eroded in one of a plurality of predetermined directions using predefined structural elements of pixels in sequence; software executable code for forming a feature vector based on a count of the eroded 5 edge pixels in each of the predetermined directions; and software executable code for classifying the region of text into one of the predetermined number of scripts according to the feature vector. Other aspects of the invention are also disclosed. BRIEF DESCRIPTION OF THE DRAWINGS 10 One or more embodiments of the invention will now be described with reference to the following drawings, in which: Fig. 1 is a functional block diagram of an OCR system utilising script recognition according to an EBFV arrangement; Fig. 2 is an illustration of various text region selection approaches whereby a region of text may 15 be received by an EBFV arrangement; Fig. 3 is a schematic flow diagram illustrating a method of script recognition according to an EBFV arrangement; Fig. 4 is a schematic flow diagram illustrating a method of iteratively thinning a text region using a sequence of directional structural elements according to an EBFV arrangement; 20 Fig. 5 is a diagram illustrating a collection of directional structural elements as used in the method of Fig. 4; Fig. 6a is a diagram illustrating a structural element; Fig. 6b is a diagram illustrating an image region; Fig. 6c is a diagram illustrating the identified pixels of the image region of Fig. 6b using the 25 structural element of Fig. 6a; 7 Fig. 7 is a diagram illustrating one iteration of the iterative thinning method of Fig. 4, performed on a text region using the directional structural elements of Fig. 5; Fig. 8 is a schematic flow diagram illustrating a method of iteratively thinning a text region using a sequence of directional lookup tables according to an EBFV arrangement; 5 Fig. 9a is a diagram illustrating a structural element; Fig. 9b is a diagram illustrating an index weight matrix for converting an image subregion to a lookup table index for the structural element of Fig. 9a; Fig. 9c is a diagram illustrating an image subregion that the structural element of Fig. 9a matches, as well as its weighted index matrix; 10 Fix. 9d is a diagram illustrating an image subregion that the structural element of Fig. 9a doesn't match, as well as its weighted index matrix; Fig. 9e is a diagram illustrating a lookup table for the structural element of Fig. 9a, showing the entries for the image subregions of Fig. 9c and Fig. 9d; Fig. 10a is a diagram illustrating an offline training method for a Parzen classifier using an A* 15 lattice histogram; Fig. 10b is a diagram illustrating an offline classification method for a Parzen classifier using an A*-lattice histogram; Fig. 11 a is a diagram illustrating an online training method for a Parzen classifier using an A* lattice histogram; 20 Fig. 1 lb is a diagram illustrating an online classification method for a Parzen classifier using an A*-lattice histogram; Fig. 12 is a diagram illustrating directional feature vector formation for an example set of directional pixel counts and an example normalisation factor; Figs. 13A and 13B form a schematic block diagram of a general purpose computer system upon 25 which EBFV arrangements described can be practiced; 8 Fig. 14 is an example of a document using a number of scripts; Fig. 15 is a diagram illustrating some training samples in a two-dimensional projection of Hd; and Fig. 16 is a diagram illustrating a hash table describing the training samples depicted in Fig. 15. 5 DETAILED DESCRIPTION INCLUDING BEST MODE Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears. 10 It is to be noted that the discussions contained in the "Background" section and the section above relating to prior art arrangements relate to discussions of documents or devices which may form public knowledge through their respective publication and/or use. Such discussions should not be interpreted as a representation by the present inventor(s) or the patent applicant that such documents or devices in any way form part of the common general knowledge in the 15 art. The EBFV arrangements described below effectively perform script recognition without the need for parameter selection. The script recognition facilitates further processing such as OCR language engine selection. Context 20 Fig. 1 depicts a system for performing OCR on a document image. The OCR system 100 processes a bitmap image 111 of an input document to produce an electronic document 190 that can be edited in a word processing environment or indexed using typical text search tools. The bitmap image 111 may be produced by any of a number of sources, such as by a scanner 120 scanning a hardcopy document 110, by retrieval from a data storage system 130 such as a 25 hard disk having a database of images stored on the hard disk, or by digital photography using a camera 140. These are merely examples of how the bitmap image 111 might be provided. As another example, the bitmap image 111 could be created by a software application as an extension of that software application's printing functionality. 6982860 1 / P053280 / filed specification 9 The OCR system 100 performs text region selection in a step 150, whereby at least one text region is extracted from the bitmap image 111. The process 150, as well as processes 160, 170, and 190, as well as OCR engines 182, 184, 186 and 188 can, in one EBFV arrangement, be implemented as software modules in an EBFV software application 1333 executing on a 5 general-purpose computer module 1301 described hereinafter in more detail with reference to Figs. 13A and 13B. The OCR system 100 then performs script recognition in the following step 160 on each of the selected text regions. The script recognition results are considered by OCR language prediction in the following step 170 that determines which of the language-specific OCR engines 182, 184, 186 and 188 from a bank of OCR engines 180 are appropriate for the 10 bitmap image 111. For example, if the script recognition results suggest a high incidence of Latin text, the OCR language prediction step 170 would consider the English-language OCR 182 and possibly the French-language OCR 184 to be appropriate, but neither the Traditional Chinese OCR 186 nor the Japanese-language OCR 188 would be deemed appropriate. Finally, the OCR system 100 processes the bitmap image 111 using the appropriate OCR engines to 15 produce an electronic document in the following step 190 with recognised text. Figs. 13A and 13B depict a general-purpose computer system 1300, upon which the various EBFV arrangements described can be practiced. As seen in Fig. 13A, the computer system 1300 includes: the computer module 1301; input devices such as a keyboard 1302, a mouse pointer device 1303, a scanner 1326, a camera 1327, 20 and a microphone 1380; and output devices including a printer 1315, a display device 1314 and loudspeakers 1317. An external Modulator-Demodulator (Modem) transceiver device 1316 may be used by the computer module 1301 for communicating to and from a communications network 1320 via a connection 1321. The communications network 1320 may be a wide-area network (WAN), such as the Internet, a cellular telecommunications network, or a private 25 WAN. Where the connection 1321 is a telephone line, the modem 1316 may be a traditional "dial-up" modem. Alternatively, where the connection 1321 is a high capacity (e.g., cable) connection, the modem 1316 may be a broadband modem. A wireless modem may also be used for wireless connection to the communications network 1320. The computer module 1301 typically includes at least one processor unit 1305, and a memory 30 unit 1306. For example, the memory unit 1306 may have semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The computer module 1301 also includes an number of input/output (1/0) interfaces including: an audio-video interface 10 1307 that couples to the video display 1314, loudspeakers 1317 and microphone 1380; an 1/0 interface 1313 that couples to the keyboard 1302, mouse 1303, scanner 1326, camera 1327 and optionally a joystick or other human interface device (not illustrated); and an interface 1308 for the external modem 1316 and printer 1315. In some implementations, the modem 1316 may be 5 incorporated within the computer module 1301, for example within the interface 1308. The computer module 1301 also has a local network interface 1311, which permits coupling of the computer system 1300 via a connection 1323 to a local-area communications network 1322, known as a Local Area Network (LAN). As illustrated in Fig. 13A, the local communications network 1322 may also couple to the wide network 1320 via a connection 1324, which would 10 typically include a so-called "firewall" device or device of similar functionality. The local network interface 1311 may comprise an EthernetTM circuit card, a BluetoothTM wireless arrangement or an IEEE 802.11 wireless arrangement; however, numerous other types of interfaces may be practiced for the interface 1311. The 1/0 interfaces 1308 and 1313 may afford either or both of serial and parallel connectivity, 15 the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 1309 are provided and typically include a hard disk drive (HDD) 1310. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 1312 is typically provided to act as a non-volatile source of data. Portable memory devices, 20 such optical disks (e.g., CD-ROM, DVD, Blu ray DiscTM), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate sources of data to the system 1300. The components 1305 to 1313 of the computer module 1301 typically communicate via an interconnected bus 1304 and in a manner that results in a conventional mode of operation of the 25 computer system 1300 known to those in the relevant art. For example, the processor 1305 is coupled to the system bus 1304 using a connection 1318. Likewise, the memory 1306 and optical disk drive 1312 are coupled to the system bus 1304 by connections 1319. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun Sparcstations, Apple MacTM or a like computer systems. 30 The EBFV method may be implemented using the computer system 1300 wherein the processes of Figs. 3, 4, 8, 10a, 10b, 1 la, and 11 b, to be described, may be implemented as one or more EBFV software application programs 1333 executable within the computer system 1300. In 11 particular, the steps of the EBFV method are effected by instructions 1331 (see Fig. 13B) in the EBFV software 1333 that are carried out within the computer system 1300. The software instructions 1331 may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first 5 part and the corresponding code modules performs the EBFV methods and a second part and the corresponding code modules manage a user interface between the first part and the user. The EBFV software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 1300 from the computer readable medium, and then executed by the computer system 1300. A 10 computer readable medium having such software or computer program recorded on the computer readable medium is a computer program product. The use of the computer program product in the computer system 1300 preferably effects an advantageous apparatus for implementing the EBFV arrangements. The EBFV software 1333 is typically stored in the HDD 1310 or the memory 1306. The 15 software is loaded into the computer system 1300 from a computer readable medium, and executed by the computer system 1300. Thus, for example, the EBFV software 1333 may be stored on an optically readable disk storage medium (e.g., CD-ROM) 1325 that is read by the optical disk drive 1312. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program 20 product in the computer system 1300 preferably effects an apparatus for implementing the EBFV arrangements. In some instances, the EBFV application programs 1333 may be supplied to the user encoded on one or more CD-ROMs 1325 and read via the corresponding drive 1312, or alternatively may be read by the user from the networks 1320 or 1322. Still further, the software can also be 25 loaded into the computer system 1300 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to the computer system 1300 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, DVD, Blu rayTM Disc, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical 30 disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 1301. Examples of transitory or non tangible computer readable transmission media that may also participate in the provision of 12 software, application programs, instructions and/or data to the computer module 1301 include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like. 5 The second part of the EBFV application programs 1333 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 1314. Through manipulation of typically the keyboard 1302 and the mouse 1303, a user of the computer system 1300 and the application may manipulate the interface in a functionally adaptable manner to provide 10 controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 1317 and user voice commands input via the microphone 1380. Fig. 13B is a detailed schematic block diagram of the processor 1305 and a "memory" 1334. 15 The memory 1334 represents a logical aggregation of all the memory modules (including the HDD 1309 and semiconductor memory 1306) that can be accessed by the computer module 1301 in Fig. 13A. When the computer module 1301 is initially powered up, a power-on self-test (POST) program 1350 executes. The POST program 1350 is typically stored in a ROM 1349 of the 20 semiconductor memory 1306 of Fig. 13A. A hardware device such as the ROM 1349 storing software is sometimes referred to as firmware. The POST program 1350 examines hardware within the computer module 1301 to ensure proper functioning and typically checks the processor 1305, the memory 1334 (1309, 1306), and a basic input-output systems software (BIOS) module 1351, also typically stored in the ROM 1349, for correct operation. Once the 25 POST program 1350 has run successfully, the BIOS 1351 activates the hard disk drive 1310 of Fig. 13A. Activation of the hard disk drive 1310 causes a bootstrap loader program 1352 that is resident on the hard disk drive 1310 to execute via the processor 1305. This loads an operating system 1353 into the RAM memory 1306, upon which the operating system 1353 commences operation. The operating system 1353 is a system level application, executable by the processor 30 1305, to fulfil various high level functions, including processor management, memory management, device management, storage management, software application interface, and generic user interface.
13 The operating system 1353 manages the memory 1334 (1309, 1306) to ensure that each process or application running on the computer module 1301 has sufficient memory in which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the system 1300 of Fig. 13A must be used properly so that each process 5 can run effectively. Accordingly, the aggregated memory 1334 is not intended to illustrate how particular segments of memory are allocated (unless otherwise stated), but rather to provide a general view of the memory accessible by the computer system 1300 and how such is used. As shown in Fig. 13B, the processor 1305 includes a number of functional modules including a control unit 1339, an arithmetic logic unit (ALU) 1340, and a local or internal memory 1348, 10 sometimes called a cache memory. The cache memory 1348 typically includes a number of storage registers 1344 - 1346 in a register section. One or more internal busses 1341 functionally interconnect these functional modules. The processor 1305 typically also has one or more interfaces 1342 for communicating with external devices via the system bus 1304, using a connection 1318. The memory 1334 is coupled to the bus 1304 using a connection 15 1319. The EBFV application program 1333 includes a sequence of instructions 1331 that may include conditional branch and loop instructions. The program 1333 may also include data 1332 which is used in execution of the program 1333. The instructions 1331 and the data 1332 are stored in memory locations 1328, 1329, 1330 and 1335, 1336, 1337, respectively. Depending upon the 20 relative size of the instructions 1331 and the memory locations 1328-1330, a particular instruction may be stored in a single memory location as depicted by the instruction shown in the memory location 1330. Alternately, an instruction may be segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 1328 and 1329. 25 In general, the processor 1305 is given a set of instructions which are executed therein. The processor 1105 waits for a subsequent input, to which the processor 1305 reacts to by executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 1302, 1303, data received from an external source across one of the networks 1320, 1302, data retrieved from one of the 30 storage devices 1306, 1309 or data retrieved from a storage medium 1325 inserted into the corresponding reader 1312, all depicted in Fig. 13A. The execution of a set of the instructions 14 may in some cases result in output of data. Execution may also involve storing data or variables to the memory 1334. The disclosed EBFV arrangements use input variables 1354, which are stored in the memory 1334 in corresponding memory locations 1355, 1356, 1357. The EBFV arrangements produce 5 output variables 1361, which are stored in the memory 1334 in corresponding memory locations 1362, 1363, 1364. Intermediate variables 1358 may be stored in memory locations 1359, 1360, 1366 and 1367. Referring to the processor 1305 of Fig. 13B, the registers 1344, 1345, 1346, the arithmetic logic unit (ALU) 1340, and the control unit 1339 work together to perform sequences of micro 10 operations needed to perform "fetch, decode, and execute" cycles for every instruction in the instruction set making up the program 1333. Each fetch, decode, and execute cycle comprises: * a fetch operation, which fetches or reads an instruction 1331 from a memory location 1328, 1329, 1330; e a decode operation in which the control unit 1339 determines which instruction has been 15 fetched; and * an execute operation in which the control unit 1339 and/or the ALU 1340 execute the instruction. Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 1339 stores or writes a 20 value to a memory location 1332. Each step or sub-process in the processes of Figs. 3, 4, 8, 10a, 10b, 1la, and 1 lb is associated with one or more segments of the program 1333 and is performed by the register section 1344, 1345, 1347, the ALU 1340, and the control unit 1339 in the processor 1305 working together to perform the fetch, decode, and execute cycles for every instruction in the instruction set for the 25 noted segments of the program 1333. The EBFV arrangements may alternatively be implemented in dedicated hardware such as one or more gate arrays and/or integrated circuits performing the EBFV functions or sub functions. Such dedicated hardware may also include graphic processors, digital signal processors, or one 15 or more microprocessors and associated memories. If gate arrays are used, the process flow charts in Figs. 3, 4, 8, 10a, 10b, 1 la, and 1 lb are converted to Hardware Description Language (HDL) form. This HDL description is converted to a device level netlist which is used by a Place and Route (P&R) tool to produce a file which is downloaded to the gate array to program 5 it with the design specified in the HDL description. The characteristics of the text region selection step 150, performed by the processor 1305 directed by the software program 1333, will now be described with reference to Fig. 2. Fig. 2. illustrates a bitmap image 200 depicting multiple paragraphs of text. The purpose of the text region selection step 150 is to extract at least one text region of the bitmap image that is 10 appropriate for processing by the script recognition step 160. The method employed by the step 150 for text region selection may use some level of text segmentation to break the text up into meaningful elements. For example, a text region 210 contains a single word, a text region 220 contains a single text line, and a text region 230 contains a single text paragraph. Alternatively, text regions can be selected without significant text segmentation. For example, a text region 15 240 is a square-shaped subregion of 200 containing text, including some partial characters as well as full characters. In the extreme case, the entire bitmap image 200 could be selected as a text region. The various methods for text region selection effected by the step 150 provide different amounts of context for the script recognition step 160 and provide specificity to the script 20 recognition results. The selection method 150 should be chosen according to the intended use of the OCR system 100. For example, if the OCR system 100 is to be used for indexing search terms found in the bitmap image 111, it would be sensible for the selection method 150 to select regions of a similar quantum to a typical search term, e.g. containing at least a word and at most a text line. The script recognition method 160 described hereinafter can function with any of the 25 various selection methods used by the step 150. Overview of the EBFV arrangement The script recognition step 160 can in one example be performed as illustrated in a script recognition flowchart 300 of Fig. 3. A text region 310 is received, as depicted by an arrow 311, from the embodying system, for example from the text region selection step 150 of the OCR 30 system 100. A following directional iterative thinning process 320, performed by the processor 1305 directed by the software program 1333, processes and erodes the text region, counting and eroding edge pixels identified in each of a sequence of directions. A subsequent directional 16 feature vector formation step 330, performed by the processor 1305 directed by the software program 1333, forms a feature vector from the counts of removed edge pixels in each direction. The feature vector is used by a subsequent script classification step 340, performed by the processor 1305 directed by the software program 1333, for classifying the text region to one of 5 a predetermined number of scripts by identifying script present in the text region 310. This denotes the endpoint 399 of the script recognition method 160. Arrangement 1 In a first EBFV arrangement, the directional iterative thinning process 320 of Fig. 3 proceeds as illustrated in a thinning flowchart 400 depicted in Fig. 4. Thinning means reducing binary 10 objects or shapes to strokes that are less wide than in the original binary objects or shapes, possibly reducing to strokes that are single pixel wide. A received text region 410 (equivalent to the text region 310 of Fig. 3) is expected to be a bilevel image. If the text region 410 is not bilevel, it is first binarised by a region binariser 420 using a method such as Otsu's method, such that "on" pixels substantially depict text and "off' pixels substantially depict non-text. 15 Otsu's method can be used to automatically perform histogram-based image thresholding or the reduction of a gray level image to a binary image. This binarised text region then replaces the received text region 410 for the remainder of the first EBFV arrangement. The bilevel text region 410 is eroded using directional structural elements according to a sequence controlled by a direction iterator process (consisting of a direction initialiser process 20 425a, performed by the processor 1305 directed by the software program 1333, and a direction advancer process 425b, performed by the processor 1305 directed by the software program 1333,). In one example, the direction sequence is south, south-west, east, south-east, west, north-east, north and north-west. Other direction sequences may also be used. The direction initialiser process 425a sets an initial direction to be the first direction of the sequence (e.g. 25 south). The iteration advancer process 425b advances the direction in the sequence if possible; otherwise it terminates the direction iterator process when the final direction in the sequence has been processed. The processing of a full direction sequence is called a "pass". For each direction in sequence, a structural element for that direction is selected by a step 430, performed by the processor 1305 directed by the software program 1333. Fig. 5 illustrates a set 30 of eight structural elements 500, each associated with one of the directions in the sequence. Other sets of structural elements may also be used. Each structural element consists of nine pixels in a three-by-three arrangement. When a structural element is compared to an image 17 subregion of equal size, each pixel of the structural element is compared to a corresponding pixel of the image subregion. The structural element is said to "match" the image subregion when both the following predefined EBFV conditions are true: e Hatched pixels (such as 510, also referred to as "active" pixels) of the structural element 5 correspond to "on" (i.e. text) pixels of the image subregion; and e Unfilled pixels (such as 530, also referred to as "inactive" pixels,) of the structural element correspond to "off' (i.e. non-text) pixels in the image subregion. Crossed pixels (such as 520) have no influence on whether the structural element matches. If a structural element matches an image subregion, it is said to match at the central pixel of the 10 subregion (such as the pixel of the subregion that corresponds with 540). The set of structural elements 500 has the property that, if a structural element of the set matches an image subregion, the match occurs at an "on" pixel that is alongside "off' pixels. This boundary of "on" and "off' pixels is called an edge, and therefore the set of structural elements 500 is said to match edge pixels. Consider for example the structural element 15 depicted in Fig. 5c. If this structural element matches an image subregion, the match occurs at the pixel of the image subregion corresponding to the central pixel 540. Due to the aforementioned predefined EBFV conditions and the location of the structural element's "active" and "inactive" pixels, the matched image subregion has "on" pixels to the left, top-left and bottom-left of the match location, and has "off' pixels to the right, top-right and bottom-right of 20 the match location. Further, the match location is an "on" pixel in the image subregion for equivalent reasons. As the match location for this structural element must occur at an interface between a set of "on" pixels and a set of "off' pixels in the matched image subregion, this structural element is said to match edge pixels. After a structural element has been selected, a subsequent pixel identification step 440, 25 performed by the processor 1305 directed by the software program 1333, identifies pixels of the text region 410 that are matched by the selected structural element. The structural element is compared to each contiguous subregion of the text region 410 that is the same size as the structural element. Whenever the structural element matches a subregion, the associated central pixel is marked as "identified" in the text region 410 (eg see 720 in Fig. 7c). 6982860 1 / P053280 / filed specification 18 Figs. 6a, 6b and 6c collectively illustrate the pixel identification step 440. A structural element 600 shown in Fig. 6a is compared to contiguous subregions of a text region 601 shown in Fig. 6b to thereby produce identified pixels 670 and 671 of an output text region 602 shown in Fig. 6c. 5 In other words, the structural element 610 is placed, as depicted by a dashed rectangle 651, over a subregion of the text region 601. Thereafter, a pixel-by-pixel comparison is performed in which each pixel of the structural element is compared to a corresponding pixel of the image subregion. The structural element is said to "match" the image subregion when the predefined EBFV conditions described in regard to Fig. 5 are met. If the structural element matches the 10 subregion in question, then the centre pixel 652 is identified for subsequent erosion in the present pass. Otherwise, the pixel 652 is not subsequently eroded in the present pass. The structural element is then moved to successive contiguous subregions such as the contiguous subregion 653, and the matching process is again performed. Hatched pixels such as 620, unfilled pixels such as 610 and crossed pixels such as 630 of the 15 structural element 600 have the same meanings as previously described in regard to Fig. 5. Pixels of the text region 601 are either unfilled pixels such as 640 indicating "off' pixels, or hatched pixels such as 650 indicating "on" pixels. The structural element 600 matches two subregions of the text region 601: the first matched subregion spans columns 2 to 4 and rows 1 to 3; the second matched subregion spans columns I to 3 and rows 3 to 5. From these 20 subregions, the central pixels at column 3, row 2 and column 2, row 4 respectively are classified as being "identified". Fig. 6c shows the output text region 602 containing non identified pixels such as 660 (unfilled) and the two identified pixels 670 and 671 (cross hatched). Note that the identified pixels lie on the western edge of the "on" pixels of the text region depicted in 601 as a result of the specific properties of the structural element 600 used. 25 A following identified pixel counting step 450, performed by the processor 1305 directed by the software program 1333, counts the number of identified pixels following the pixel identification step 440. This count is stored and associated with the current direction as controlled by the direction iterator process. For the example of Figs. 6a, 6b and 6c, the identified pixel count is two. 30 Following the identified pixel counting step 450, an identified pixel eroding step 460, performed by the processor 1305 directed by the software program 1333, changes all identified pixels from "on" to "off' in the text region 410.
19 When all the directions of the sequence have been processed, a sufficient thinning decision step 470, performed by the processor 1305 directed by the software program 1333, determines whether any further thinning would be achieved by reinitialising and rerunning the direction initialiser. The sufficient thinning decision step 470 checks the counts associated with the most 5 recent pass of directional thinning. If all such counts are zero, it determines that no further thinning can be achieved, and the directional thinning process 320 finishes; otherwise, the direction iterator process is reinitialised and rerun for another pass of directional thinning, repeating the process, and this continues until the step 470 determines that no further thinning can be achieved. 10 Fig. 7 illustrates the results of one directional thinning pass, showing text regions 700 at each iteration of the direction iterator steps 425a, 425b. In this example, an original text region [labelled "a) original"] is an image representation of an uppercase "C" Latin character. In general the original text region could be any text region received from the text region selection step 150 and may contain more than one character. Directional thinning is performed using the 15 structural elements 500 of Fig. 5. Each text region in Fig. 7 has the following: * hatched pixels such as 710 that were "on" pixels throughout the iteration; * crosshatched pixels such as 720 that were "on" at the start of the iteration, but were identified using the associated structural element of 500 and subsequently eroded (i.e. turned "off"); and 20 * unfilled pixels such as 730 that were "off' throughout the iteration. As seen from the crosshatched pixels such as 720 of Fig. 7: * six pixels are eroded in the south iteration; * five pixels are eroded in the east iteration; * three pixels are eroded in each of the south-west and south-east iterations; 25 * two pixels are eroded in the north-east iteration; * one pixel is eroded in each of the west and north iterations; and 20 no pixels are eroded in the north-west iteration. These erosion values and their associated directions are stored by the identified pixel counting step 450 during this pass. As not all these counts were zero, the sufficient thinning decision step 470 causes another directional thinning pass to occur. 5 Following the directional thinning process 320, a directional feature vector formation step 330, performed by the processor 1305 directed by the software program 1333, creates a vector using the counts stored by the identified pixel counting step 450 for this text region 310. Specifically, the vector has one entry per direction in the direction sequence, in the same order as the direction sequence. Each entry in the vector is initialised as the sum of the counts associated 10 with the direction at the same position in the sequence. For example, the first entry in the vector pertains to the south direction. It is initialised to the sum of the counts associated with the south structural element for this text region by the step 450. If the thinning process required three passes through the direction sequence, and the south counts were 12, 9 and 0 respectively for these passes, then the first entry in the vector would be initialised to 21, i.e. 12 + 9 + 0. 15 After creating the vector, the direction feature vector formation step 330 then normalises the vector by dividing each entry by the number of "on" pixels of the text region 310 prior to the directional thinning process 320. For example, the number of "on" pixels in the text region shown in Fig. 7a is 50. This normalised vector is the output of the directional feature vector formation step 330. 20 Fig. 12 illustrates an example of a directional feature vector formation step 1201, performed by the processor 1305 directed by the software program 1333, for an example text region that undergoes four directional thinning passes. Directional pixel count vectors 1210, 1220, 1230 and 1240 correspond to counts stored by the identified pixel counting step 450 during the first, second, third and fourth passes respectively. An initialised vector 1250 is created by summing 25 together pixel counts of the first, second, third and fourth directional pixel count vectors 1210, 1220, 1230, 1240 that pertain to a common direction. For example, the initialised vector 1250 has a south entry of 16, which is determined by summing the south entries from the four directional pixel count vectors (7, 6, 3 and 0 respectively). A normalised vector 1270 is created by dividing each entry of the initialised vector 1250 by a normalisation factor 1260 equal to the 30 initial number of "on" pixels of the text region. In this example, the normalisation factor 1260 is 70. As a result, the south entry of the normalised vector 1270 is 16 / 70, or approximately 0.23. Equivalent calculations are used to determine entries of the normalised vector 1270 for 21 the other directions. It is noted that other normalisation values can be used, such as for example the number of "on" pixels in the text region following the final directional thinning pass. The feature vector 1270 produced by the feature vector formation step 330 is then used by the script classification process 340 to identify a script present in the text region 310. The feature 5 vector is appropriate for use with any standard Machine Learning classification technique, such as Support Vector Machines, Decision Trees, Neural Networks, naive Bayes, etc. Though a wide variety of classifiers are suitable for the script classification process 340, a Parzen classifier using an A*-lattice binned KDE approximation will be described. A*-binned Parzen classifier 10 A lattice is an infinite set of points in a Euclidean space. The term A* denotes a family of lattices, where a lattice A* denotes a lattice of dimension d. The A* family can be defined in terms of the A lattice family. The lattice defined as Ad = {p E 2d+1 I = o1 is a d-dimensional lattice that is embedded in Rd+1 to make the coordinates integers where p is a d+1-dimensional point with every dimension having an integer coordinate (i.e. p belongs to 15 the d+1-dimensional integer lattice Zd+1), and pi is the ih coordinate of p. Thus from the aforementioned definition of Ad, Ad contains all such points p with coordinates that sum to 0. A* is defined to be the dual of Ad and may be similarly embedded inside the same d dimensional subspace, Hd, of R ,+1 where Hd = (xE Rd+ I xi = 01 where x is a d+l-dimensional point with every dimension having a real coordinate (i.e. x belongs to the d+l-dimensional space of real numbers Rd+1), and xi is the ith coordinate of x. Thus from the aforementioned definition of Hd, Hd contains all such points x with coordinates that sum to 0. Given a first lattice, a dual lattice is the set of dual lattice points where for each dual lattice point in the dual lattice the dot product between the dual lattice point and each lattice point in the first lattice is an integer. In other words, A* is defined by the 6982860 1 / P053280 / filed specification 22 equation: A* = {p E HdIVz E Ad,p -Z E Z} where p is a point belonging to Hd as previously defined, z is a point belonging to Ad as previously defined, and -denotes the dot-product operation. Thus from the aforementioned definition of A*, A* contains all such points p that produce integers when dotted with (i.e. when the dot-product operation is performed with p and with) any arbitrary such point z. 5 The points of the A* lattice are used as bin centres for a binned KDE approximation, where d is the feature vector length. For example, when feature vectors with 8 entries are used, such as e.g. the feature vector 1270 depicted in Fig. 12, the points of the A* lattice are used as bin centres. A Parzen classifier is created, by the processor 1305 directed by the software 1333, using the approximated density estimates of each class. To train the classifier, a set of "training samples" 10 is analysed. Each training sample comprises: * a feature vector, such as 1270, derived from a text region; and * a "class" associated with the feature vector. The class is a label that corresponds to a specific script; e.g. the class for the original text region depicted in Fig. 7a would signify the Latin script. 15 During training, the classifier learns a general correspondence between features of the training samples' respective feature vectors and classes. Subsequent to the classifier being trained, the classifier is used to estimate a likely class for an arbitrary feature vector. Such a feature vector (i.e. a feature vector that is not yet associated with a class) is known as a "query sample" or a "sample of unknown class". 20 As is described hereinafter, feature vectors are embedded into the Hd space and treated as points in that space for the purposes of training and classification. The position of a feature vector in Hd is indicative of the values of the feature entries, and the proximity of a pair of feature vectors in Hd is indicative of how similar the respective feature entry values of the feature vectors are. 25 Training can either be performed in an offline manner (training a classifier from scratch with all the training samples provided at once) or an online manner (providing a smaller set of training 23 samples at a time, and updating the training with each new set). When training is performed in an offline manner, an associated offline form of classification is used. Similarly, when training is performed in an online manner, an associated online form of classification is used. The offline training process will now be described with reference to Fig. 10a. 5 The offline training process 1000 takes place between a training startpoint 1001 and an endpoint 1002. When performed in an offline manner, it comprises a training sample quantisation step 1010, performed by the processor 1305 directed by the software 1333, a following active lattice point identification step 1020, performed by the processor 1305 directed by the software 1333, a subsequent bin count updating step 1030, performed by the processor 10 1305 directed by the software 1333, a following kernel weight computation step 1040, performed by the processor 1305 directed by the software 1333, and a subsequent convolution step 1050, performed by the processor 1305 directed by the software 1333. First, each training sample is quantised to the nearest point of the A* lattice in the training sample quantisation step 1010. This is performed by embedding each sample point in Hd and 15 then performing A* lattice decoding by determining a closest point of A* to the sample point in Hd. The quantised sample is then that closest point of A*. Second, the active lattice point identification step 1020 identifies all the A* lattice points that had at least one training sample quantised to them in the training sample quantisation step 1010. These lattice points are marked as active lattice points. 20 Third, the bin count updating step 1030 counts the number of quantised training samples of each class at each active lattice point as marked by the active lattice point identification step 1020. These bin counts are stored in a vector (one entry per class), and the vector is stored in a hash table indexed by the associated active lattice point. Lattice points that are not present as an index in the hash table are inactive, and are considered to have a bin count vector containing 25 only zeroes. Therefore at the end of the bin count updating step 1030, the hash table contains a sparse approximate view of the training samples. Figs. 15 and 16 collectively illustrate how bin count vectors are stored by the bin count updating step 1030. Fig. 15 shows a set of training samples in a two-dimensional projection of Hd 1500. Samples marked as a cross, such as 1530, indicate a first class, and samples marked 30 with a filled triangle, such as 1540, indicate a second class. Solid circles, such as 1510 and 24 1520, indicate some lattice points in the projection of Hd. Lattice points 1520, 1521, 1522, 1523 and 1524 are active lattice points, as there are training samples that are closer to these lattice points than to any other lattice point of A* . All other lattice points depicted in Fig. 15, such as 1510, are inactive. Fig. 16 shows a hash table 1600 describing the set of training 5 samples depicted in Fig. 15. The hash table has a row for each of the active lattice points depicted in Fig. 15. The hash table rows 1610, 1611, 1612, 1613 and 1614 correspond to the active lattice points 1520, 1521, 1522, 1523 and 1524 respectively. The values in the hash code column are derived from the respective positions of the active lattice points, and are used to quickly locate the appropriate row of the table. The bin count updating step 1030 creates the 10 count column of the hash table 1600 for each class at each active lattice point. The creation and usage of the density estimate column of the hash table 1600 is described hereinafter. Fourth, the kernel weight computation step 1040 and the convolution step 1050 collectively estimate the class densities at the active lattice points by placing a shaped kernel at each quantised training sample (i.e. using the binned KDE approach). The resulting density estimate 15 pi(x) for a given class at a lattice point x is given by: M p(x) = KH(X m)Cm m=1 where N is the number of training samples of the class, M is the number of active lattice points, KH is the shaped kernel, 1m is the mth active lattice point, and Cm is the bin count for the class at 1m. The kernel weight computation step 1040 calculates the kernel weight W(x, 1m) 20 KH(x - mi) for all pairs of active lattice points x and 1m. As KH is usually given with respect to the feature vectors (i.e. in Rd) and (x - 1m) is a distance between points embedded in Hd, it is important to reconcile the two scales. Due to the regular spacing property of lattice points, D(1j, 1m) = yD(oj, om), where D(oj, om) represents the distance between two lattice points in Rd and y is the scale of the lattice, which is determined by the lattice's packing radius and the 25 dimension. The convolution step 1050 then, for each class, calculates the density estimate pi(x) for a given class at each active lattice point x as 25 pi(x) = W(X,Im)cm M=1 where N is the number of training samples of the class, M is the number of active lattice points, 1m is the mth active lattice point, W(x, 1in) is the aforementioned kernel weight for the lattice points x and 1i, and cm is the bin count for the class at 1m. For each active lattice point, the density estimate of each class at that point is stored in a vector. 5 The vector is then added to the hash table, indexed by the associated active lattice point. As a result, the hash table contains a vector of bin counts and a vector of density estimates for each active lattice point. Following the convolution step 1050, the offline training process 1000 reaches its endpoint 1002. The offline classification process will now be described with reference to Fig. 10b. 10 The offline classification process 1005 takes place between a classification startpoint 1006 and an endpoint 1007. When offline training has been used, it comprises a sample quantisation step 1060, performed by the processor 1305 directed by the software 1333, and a class determination step 1070, performed by the processor 1305 directed by the software 1333. First the sample quantisation step 1060 quantises the sample of unknown class to the nearest 15 point of the A* lattice, in the same manner as the training sample quantisation step 1010. Then the class determination step 1070 checks if the quantised sample produced by the sample quantisation step 1060 is an active lattice point. This is achieved by checking if the quantised sample is an index of the hash table. If the quantised sample is an index, the density estimate vector is retrieved for that index, and 20 the class with highest density estimate is selected as the classification. If the quantised sample is not an index, density estimates for each class are extrapolated from the density estimates at nearby active lattice points. The class with highest extrapolated estimate is selected as the classification. The selected class signifies the most likely script for the query sample, according to the 25 aforementioned density estimates.
26 Following the class determination step 1070, the offline classification process 1005 reaches its endpoint 1007. Offline processes for training and classification have now been described. Alternatively, training and classification can proceed in an online manner with some variations to the offline 5 processes. The online training and online classification processes will now be described with reference to Fig. 11 a and Fig. 1 lb respectively. The main difference as compared to the offline processes is that the kernel weight computation step 1150 and the convolution step 1160 take place in an online classification process 1105 rather than an online training process 1100. In the online training process 1100, only new training samples are quantised by a training 10 sample quantisation step 1110, performed by the processor 1305 directed by the software 1333. A subsequent active lattice point identification step 1120, performed by the processor 1305 directed by the software 1333, updates any existing set of active lattice points to include the newly quantised training samples. In other words, if a lattice point is newly active due to the new quantised training samples, the active lattice point identification step 1120 marks these 15 lattice points as active in addition to lattice points already marked active. A following bin count updating step 1130, performed by the processor 1305 directed by the software 1333, adds the counts of the newly quantised samples to any already existing bin counts, creating a new vector of bin counts if one did not previously exist. This denotes the endpoint 1102 of the online training process 1100. 20 In the online classification process 1105, first a sample quantisation step 1140, performed by the processor 1305 directed by the software 1333, occurs as for the offline equivalent 1060. Second, a kernel weight computation step 1150, performed by the processor 1305 directed by the software 1333, calculates any kernel weights that need updating. A kernel weight needs updating if it pertains to a newly active lattice point (identified by the active lattice point 25 identification step 1120) since the previous run of the online classification process 1105. Third, a convolution step 1160, performed by the processor 1305 directed by the software 1333, calculates the density estimates of each class at the quantised sample s . If s is an active lattice point, the density estimate p3(s) for a given class is calculated for each class and at each active lattice point as: 27 f (s) = N W(S, m)Cm M=1 where N is the number of training samples of the class, M is the number of active lattice points, 1m is the mth active lattice point, W(s, 1in) is the aforementioned kernel weight for the lattice points s and 1i, and cm is the bin count for the class at 1m. 5 If s is not an active lattice point, the kernel weights have not been precalculated by the kernel weight computation step 1150. Therefore the density estimates of each class are calculated using active lattice points near the quantised sample: p3(s) = 1 Y KH(S - 1nearby)Cnearby Nearby where N is the number of training samples of the class, KH is the shaped kernel, 'nearby is one of the nearby active lattice points, and Cnearby is the bin count for the class at that nearby active 10 lattice point. Finally, a class determination step 1170, performed by the processor 1305 directed by the software 1333, selects the class with highest density estimate p3(s) at s using the density estimates calculated by the convolution step 1160. The selected class signifies the most likely script for the query sample, according to the aforementioned density estimates. This denotes 15 the endpoint 1107 of the online classification process 1105. Regardless of whether online or offline classification is used, the script classification method determines a likely script class for a directional feature vector derived from a text region. This is achieved by exercising a learned general correspondence between features of a feature vector and various scripts. Feature vectors formed using directional counts of eroded edge pixels are 20 useful for this purpose.
28 EBFV arrangement No. 2 In a second EBFV arrangement, the directional iterative thinning process 320 of Fig. 3 proceeds as illustrated in a thinning process 800 of Fig. 8. Similarly to the first EBFV arrangement, a received text region 810 (equivalent to the text region 310 of Fig. 3) is replaced by a binarised 5 region (created by a region binariser 820) if it is not already bilevel. The text region is then eroded according to a sequence controlled by a direction iterator process (consisting of a direction initialiser process 825a, performed by the processor 1305 directed by the software program 1333, and a direction advancer process 825b, performed by the processor 1305 directed by the software program 1333,). However in contrast to the first EBFV arrangement, 10 erosion is performed using lookup tables derived from directional structural elements. There is one lookup table per direction of the sequence, generated using the structural elements 500 of Fig. 5 described earlier. Accordingly, in an EBFV arrangement having eight structural elements as depicted in Fig. 5, there are eight corresponding lookup tables. The generation of a lookup table will now be described with reference to Figs. 9a - 9e. 15 An EBFV lookup table is a mapping of subregion indices to a central pixel replacement value for the subregion in question when processed by a given structural element. Each index is a number uniquely representing the pixel arrangement in an image subregion of the same size as the structural element. The lookup table is filled with entries such that it has an entry for every possible image subregion of the structural element's size. In the case of a three-by-three pixel 20 structural element, the lookup table will therefore have 512 entries. Each index maps to a central pixel replacement value. This value is the resulting central pixel value of the subregion if the index's associated image subregion is eroded using the structural element (i.e. if the structural element matches the image subregion, the central pixel replacement value is "off"; otherwise it is "on".) 25 Fig. 9a illustrates a structural element 900 used to generate an example lookup table. The structural element has hatched pixels 920, unfilled pixels 910 and crossed pixels 930 with the same meanings as for Fig. 5. Fig. 9b illustrates an index weight matrix 901 used to generate an index number for an image subregion of the same size as the structural element 900. Each value in the matrix 901 is a 30 different power of 2. To generate an index number for an image subregion, the subregion is considered to have a 1 where pixels are "on" and a 0 where pixels are "off'. The index weight matrix 901 is multiplied elementwise with the pixel values in the subregion in question (such as 29 902) to produce a corresponding weighted index matrix (such as 903). The elements of the weighted index matrix are summed to produce the index number (such as 16 + 8 + 2 + 1 = 27 as depicted by a reference numeral 960). Figs. 9c and 9d each provide an example of the aforementioned process. 5 First the image subregion 902 of Fig. 9c and a second image subregion 904 of Fig. 9d each have "off" pixels such as 940 (unfilled) and "on" pixels such as 950 (hatched). The first image subregion 902 produces a weighted index matrix 903 which produces an index number 27 (see 960). The second image subregion 904 produces a weighted index matrix 905 and an index number 90 (see 970). 10 Fig. 9e illustrates two rows of a lookup table 906 generated for the structural element 900, with the first column containing the index numbers and the second column containing the associated central pixel replacement values. One entry 960 pertains to the first image subregion 902, and another entry 970 pertains to the second image subregion 904. As the structural element 900 matches the first image subregion 902 (according to the EBFV conditions), the central pixel 15 replacement value of the first entry 960 is "off'. Conversely, as the structural element 900 does not match the second image subregion 904 (according to the EBFV conditions), the central pixel replacement value of the second entry 970 is "on". To generate the entire lookup table 906, each of the 512 possible image subregions is iterated through, and its entry is created using the process described above. 20 In summary, in order to create the lookup table for a structural element such as 900, in one example two processes are involved, namely generating the lookup table indices, and generating the associated lookup table values. The lookup table indices are generated by generating 512 sample subregions each having a distinct combination of "on" and "off' pixels having associated values "1" and "0". Thereafter, for each sample subregion, each pixel value 25 (i.e. "1" or "0") is multiplied, on a per-pixel basis, with the corresponding element value of the index weight matrix 901, to thereby generate a corresponding weighted index matrix (such as 903). The elements of the weighted index matrix are summed, as depicted at 960, to form the index associated with the sample subregion in question. The lookup table values are generated by matching the structural element 900, using the EBFV conditions, against each of the 512 30 sample subregions. Those sample subregions which satisfy the EBFV matching conditions are assigned a value of "off'. Those sample subregions which do not satisfy the EBFV matching 30 conditions are assigned a value of "on". The lookup table thus speeds up the process of establishing a match between the structural element and the subregion in question. Referring again to Fig. 8, a following directional lookup table selection step 830, performed by the processor 1305 directed by the software program 1333, selects the lookup table associated 5 with the current direction of the direction iterator process. This is the lookup table generated from that direction's structural element from the set of structural elements 500 of Fig. 5. A following pixel replacement step 840, performed by the processor 1305 directed by the software program 1333, processes each image subregion of the text region 810 that is the same size as the structural elements 500. The step 840 determines the index associated with a 10 subregion, and retrieves the selected lookup table's entry for that index. It then replaces the central pixel value in the text region 810 with the central pixel replacement value of the retrieved entry. A subsequent eroded pixel counting step 850, performed by the processor 1305 directed by the software program 1333, counts the number of pixels that were changed from "on" to "off" by 15 the pixel replacement step 840. This count is stored and associated with the current direction as controlled by the direction iterator process. A following sufficient thinning decision step 860, performed by the processor 1305 directed by the software program 1333, performs the same actions as the sufficient thinning decision step 470 of the first EBFV arrangement. The step 860 checks if any of the counts from the last pass 20 are not zero, and if so, it causes the direction iterator process to be reinitialised and rerun. The directional feature vector formation step 330 and the script classification step 340 proceed as per the first EBFV arrangement, with the directional feature vector formation step 330 using the counts stored by the eroded pixel counting step 850 rather than the counts stored by the identified pixel counting step 450. 25 Advantages The first and second EBFV arrangements described above have a number of desirable properties.
31 The EBFV arrangements use feature vectors for script recognition that are generated without requiring or estimating important numerical parameters such as an indicative character size for the text region in question. The EBFV arrangements are nonetheless robust to common input degradations such as inadvertently fragmented characters due to poor input quality to an OCR 5 system 100, notwithstanding the source of the fragmented characters, whether occurring in a hard copy document 110, or occurring due to input processes such as scanning by a scanner 120 or photography using a camera 140, or occurring due to other such reasons. The EBFV arrangements are capable of being used with a variety of text region selection approaches, as the script recognition process does not perform its own text region segmentation, 10 nor does it expect a particular text region segmentation to have been performed. Industrial Applicability The arrangements described are applicable to the computer and data processing industries and particularly for the document processing industry. The foregoing describes only some embodiments of the present invention, and modifications 15 and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive. In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied 20 meanings.

Claims (10)

1. A method of classifying a region of text in a bitmap image as one of a predetermined number of scripts, the method comprising: 5 receiving the region of text from the bitmap image; generating a thinned representation of the received region of text by eroding edge pixels in the region, each of the edge pixels being eroded in one of a plurality of predetermined directions using predefined structural elements of pixels in sequence; forming a feature vector based on a count of the eroded edge pixels in each of the 10 predetermined directions; and classifying the region of text into one of the predetermined number of scripts according to the feature vector.
2. A method according to claim 1, wherein the step of generating a thinned representation of the received region of text comprises, for each direction of the plurality of predetermined 15 directions, the steps of: selecting a corresponding structural element; matching the selected structural element against successive subregions of the region of text using predefined matching conditions; eroding, for each said subregion, a central pixel of the subregion if the predefined 20 matching conditions are satisfied; and repeating the selecting, matching and eroding steps for each successive said direction until the predefined matching conditions are not satisfied for a full pass of the plurality of predetermined directions. 33
3. A method according to claim 2, wherein the step of matching the selected structural element against a subregion comprises determining, for each pixel in the structural element, if the pixel matches a corresponding pixel of the subregion according to the predefined conditions.
4. A method according to claim 2, wherein the step of matching the selected structural 5 element against a subregion comprises determining a lookup table index based upon the subregion and establishing the match from the lookup table value associated with the determined index.
5. A method according to claim 1, wherein there are eight predetermined directions, and the sequence of said direction is south, south-west, east, south-east, west, north-east, north and 10 north-west.
6. An apparatus for classifying a region of text in a bitmap image as one of a predetermined number of scripts, the apparatus comprising: a processor; and a memory storing a computer executable software program for directing the processor to 15 perform a method comprising the steps of: receiving the region of text from the bitmap image; generating a thinned representation of the received region of text by eroding edge pixels in the region, each of the edge pixels being eroded in one of a plurality of predetermined directions using predefined structural elements of pixels in sequence; 20 forming a feature vector based on a count of the eroded edge pixels in each of the predetermined directions; and classifying the region of text into one of the predetermined number of scripts according to the feature vector.
7. A non-transitory computer readable medium storing a computer executable software 25 program for directing a processor to execute a method for classifying a region of text in a bitmap image as one of a predetermined number of scripts, program comprising: 34 software executable code for receiving the region of text from the bitmap image; software executable code for generating a thinned representation of the received region of text by eroding edge pixels in the region, each of the edge pixels being eroded in one of a plurality of predetermined directions using predefined structural elements of pixels in sequence; 5 software executable code for forming a feature vector based on a count of the eroded edge pixels in each of the predetermined directions; and software executable code for classifying the region of text into one of the predetermined number of scripts according to the feature vector.
8. A method of classifying a region of text, substantially as described herein with reference 10 to the accompanying drawings.
9. An apparatus for classifying a region of text, substantially as described herein with reference to the accompanying drawings.
10. A non-transitory computer readable medium storing a computer executable software program for directing a processor to execute a method for classifying a region of text, 15 substantially as described herein with reference to the accompanying drawings. Dated 20th day of December 2012 CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant/Nominated Person SPRUSON & FERGUSON 20
AU2012268796A 2012-12-20 2012-12-20 Directional stroke width variation feature for script recognition Abandoned AU2012268796A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2012268796A AU2012268796A1 (en) 2012-12-20 2012-12-20 Directional stroke width variation feature for script recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2012268796A AU2012268796A1 (en) 2012-12-20 2012-12-20 Directional stroke width variation feature for script recognition

Publications (1)

Publication Number Publication Date
AU2012268796A1 true AU2012268796A1 (en) 2014-07-10

Family

ID=51228837

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2012268796A Abandoned AU2012268796A1 (en) 2012-12-20 2012-12-20 Directional stroke width variation feature for script recognition

Country Status (1)

Country Link
AU (1) AU2012268796A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781925A (en) * 2019-09-29 2020-02-11 支付宝(杭州)信息技术有限公司 Software page classification method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781925A (en) * 2019-09-29 2020-02-11 支付宝(杭州)信息技术有限公司 Software page classification method and device, electronic equipment and storage medium
CN110781925B (en) * 2019-09-29 2023-03-10 支付宝(杭州)信息技术有限公司 Software page classification method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Liu et al. Squeezedtext: A real-time scene text recognition by binary convolutional encoder-decoder network
Afzal et al. Cutting the error by half: Investigation of very deep cnn and advanced training strategies for document image classification
Ye et al. Text detection and recognition in imagery: A survey
Bhowmik et al. Text and non-text separation in offline document images: a survey
US8744196B2 (en) Automatic recognition of images
CN109685065B (en) Layout analysis method and system for automatically classifying test paper contents
Bhowmik et al. Handwritten Bangla word recognition using HOG descriptor
Wang et al. A coarse-to-fine word spotting approach for historical handwritten documents based on graph embedding and graph edit distance
US20210374455A1 (en) Utilizing machine learning and image filtering techniques to detect and analyze handwritten text
Wu et al. Scene text detection using adaptive color reduction, adjacent character model and hybrid verification strategy
CN112163114B (en) Image retrieval method based on feature fusion
Sampath et al. Handwritten optical character recognition by hybrid neural network training algorithm
Song et al. Robust and parallel Uyghur text localization in complex background images
Sah et al. Text and non-text recognition using modified HOG descriptor
Roy et al. Word searching in scene image and video frame in multi-script scenario using dynamic shape coding
Mandal et al. Bag-of-visual-words for signature-based multi-script document retrieval
CN115203408A (en) Intelligent labeling method for multi-modal test data
Chakraborty et al. Application of daisy descriptor for language identification in the wild
Rahul et al. Multilingual text detection and identification from Indian signage boards
Kavitha et al. A robust script identification system for historical Indian document images
Jubair et al. A simplified method for handwritten character recognition from document image
Joshi et al. Combination of multiple image features along with KNN classifier for classification of Marathi Barakhadi
Neycharan et al. Edge color transform: a new operator for natural scene text localization
CN113128496B (en) Method, device and equipment for extracting structured data from image
Bozkurt et al. Classifying fonts and calligraphy styles using complex wavelet transform

Legal Events

Date Code Title Description
MK4 Application lapsed section 142(2)(d) - no continuation fee paid for the application