WO2008138356A2 - System and method for arabic omni font written optica character recognition - Google Patents

System and method for arabic omni font written optica character recognition Download PDF

Info

Publication number
WO2008138356A2
WO2008138356A2 PCT/EG2007/000018 EG2007000018W WO2008138356A2 WO 2008138356 A2 WO2008138356 A2 WO 2008138356A2 EG 2007000018 W EG2007000018 W EG 2007000018W WO 2008138356 A2 WO2008138356 A2 WO 2008138356A2
Authority
WO
WIPO (PCT)
Prior art keywords
frame
segment
module
feature
arabic
Prior art date
Application number
PCT/EG2007/000018
Other languages
French (fr)
Other versions
WO2008138356A3 (en
Inventor
Mohsen Abdel-Razik Ali Rashwan
Mohamed Attia Mohamed El-Araby Ahmed
Original Assignee
The Engineering Company For The Development Of Computer Systems ; (Rdi)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Engineering Company For The Development Of Computer Systems ; (Rdi) filed Critical The Engineering Company For The Development Of Computer Systems ; (Rdi)
Priority to PCT/EG2007/000018 priority Critical patent/WO2008138356A2/en
Publication of WO2008138356A2 publication Critical patent/WO2008138356A2/en
Publication of WO2008138356A3 publication Critical patent/WO2008138356A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • G06V30/18019Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections by matching or filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • G06F18/295Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • OCR optical character recognition
  • the text resulting from OCRing the images is used for all kinds of Information Retrieval (IR) and Knowledge Management (KM) purposes which are insensitive to the inevitable error rate of the OCR process as long as this rate is kept small enough ( ⁇ 4% of the word rate is a rational criterion).
  • IR Information Retrieval
  • KM Knowledge Management
  • HMMs Hidden Markov Models
  • the success of HMMs in speech recognition has led many researchers to apply them to character recognition by representing each word image as a sequence of observations.
  • HMMs qualify as a suitable tool for cursive script recognition for a number of reasons.
  • they are stochastic models that can cope with noise and pattern variations occurring as in different fonts.
  • the number of tokens representing an unknown input word may be of variable length. This is a fundamental requirement in cursive fonts because the lengths of the individual input words may greatly vary.
  • the segmentation problem which is extremely difficult and error prone, can be avoided. This means that the features extracted from an input word need not necessarily correspond with letters.
  • dotting is a significant source of confusion - hence recognition errors - in Arabic typewritten OCR systems especially when run on noisy documents; e.g. those produced by photocopiers. On the contrary, dotting may be helpful for Arabic handwritten OCR systems as dots are usually sensed as separate short strokes.
  • Broad graphemes set Multiple grapheme cases as well as the occurrence of ligatures directly lead to broad grapheme sets; e.g. a common highly involved font like Traditional Arabic contains around 190 graphemes, and another common less involved one (with less ligatures) like Simplified Arabic contains around 95 graphemes. Compare this to English where 40 or 50 graphemes are enough! Again, a broader grapheme set means higher ambiguity, and hence more confusion.
  • the language models are essential in recovering strings of words or characters after they have been passed through a noisy channel such as scanning or print degradation.
  • a noisy channel such as scanning or print degradation.
  • Any font-written OCR system needs to perform some processing on the input scanned text block prior to the recognition phase. Decomposing the text block into lines hence into words is perhaps the most fundamental such pre-processing where errors add in the total error rate of the whole OCR system.
  • the system is divided into two subsystems: the trainer subsystem which is used to initialize and train Hidden Markov Models using many scanned bitmaps with their corresponding text files, and the Recognizer subsystem which uses the Viterbi algorithm to recognize input bitmaps to output characters.
  • the block diagram of the trainer subsystem and Recognizer subsystem is shown in Figure 6. The system is built up using the following modules:
  • optical scanners are used to convert printed documents to computer images. It is common practice to convert the multilevel image into a bi-level image of black and white. Often this process, known as thresholding, is performed on the scanner to save memory space and computational effort.
  • Filters can be designed for smoothing, sharpening, thresholding, removing slightly textured or colored background, and contrast adjustment purposes.
  • a method that is often used to reduce noise in two dimensional (off-line) images is to traverse the image, pixel-by-pixel, using a 3 x 3 window and to change the value of a pixel based on the value of its 8-neighbors.
  • a 5 x 5 median filter has proven effective for the cleaning of input scanned text images from the salt & pepper noise
  • Ill- Word Decomposer Module The word decomposer module is used to segment paper into lines, and then to segment each line into words or sub-words stalling from upper right corner of the page. A robust Histogram-based algorithm is used for lines & words decomposition
  • the target text rectangle bitmap is passed through a 5 by 5 median filter in order to clean the salt & pepper noise.
  • Step 2 Getting the Darkness & Brightness Centroids
  • n(j) is defined as the index of the nearest writing-line, as obtained from the sub-step 3.1.1, to the scan-line l ⁇ .
  • Step 4 Writing-Lines Grouping
  • Heights of writing lines are usually considered as a rough - never an accurate - indication of the font size they are written in. So, gathering the writing lines in a given text-image in groups, where the heights of these lines in the same group are of a sufficiently small variance, may be useful for the recognition process that might tune some parameters for each group according to say its average height.
  • this grouping is a necessary auxiliary step towards deciding whether a given space between successive separate word segments is an inter or intra word space. See step 6 below.
  • is tuned to 0.2.
  • the 1st and 2nd writing lines from the top of figure 10 are gathered in one group, while the rest are gathered in another one.
  • a blankness condition is formulated as:
  • Figure 11 shows the result of decomposing the part-of-word segments in our running example, with each marked by a surrounding rectangle.
  • Step6 Full Words Concatenation 6.1- Eliminating Rouge Spaces Among Part-of- Words. This problem is in fact a hard one of outliers detection as named in the literature of data mining. However, a much simpler and effective approach to solving this problem is presented here based on the heuristics emanating from our rich experience about the nature of Arabic, or Arabic dominated script.
  • Figure 12 illustrating a histogram of a spaces group where the widths of intra- word spaces, inter-word spaces, and rogue spaces are plotted.
  • the optimal threshold between the intra-word centroid Si ntra and inter-word centroid Si nt e r is Si ntra + X A • (Sinte r - Si ntra ). If symmetry around Si n t er is also assumed, the cut-off space width for eliminating the rogue spaces can then be expressed as
  • S ⁇ vg may be expressed as
  • the feature-extractor module is used to extract features of the sliding window, which slides through the body of the Arabic word from right to left, and provides a series of feature vectors to the vector quantizer module.
  • F 1 is the frame feature vector and S j is the segment feature vector
  • the words are consisting of number of frames and each frame is consisting of number of segments, and the max number of segments per frame is 4 in Arabic fonts. (See fig. 14)
  • Fig 15 is a flow chart for feature extraction.
  • the vector-quantizer module provides a series of quantized observations to the Discrete HMM Decoder module using the codebook made by the Codebook Maker Module during the training phase.
  • the HMM recognizer stage receives a series of observations representing the scanned word from the vector quantizer and uses the Viterbi algorithm to produce the most probable sequence of models that produces these observations. Using a statistical ligature bi-grams Probabilities generated by Ligature m-grams Prob. Estimator, the most probable model with the highest score is produced by the Viterbi algorithm.
  • left-to-right HMM In our system, we used left-to-right HMM as shown in figure 16.
  • the state-transitions are allowed only from the current state to the next state or from the current state to itself (i.e. to stay or to advance).
  • the number of states per model is one of the key factors that affect the system performance. It is specified empirically to increase the performance of the system.
  • Decoding is controlled by a recognition network compiled from a word-level network, a dictionary and a set of HMMs.
  • the recognition network consists of a set of nodes connected by arcs. Each node is either a HMM model instance or a word-end. Each model node is itself a network consisting of states connected by arcs. Thus, once fully compiled, a recognition network ultimately consists of HMM states connected by transitions. However, it can be viewed at three different levels: word, model and state. Figure 17 illustrates this hierarchy.
  • Every path from the start node to the exit node of the network which passes through exactly T emitting HMM states is a potential recognition hypothesis.
  • Each of these paths has a log probability which is computed by summing the log probability of each individual transition in the path and the log probability of each emitting state generating the corresponding observation.
  • Within-HMM transitions are determined from the HMM parameters, between-model transitions are constant and word-end transitions are determined by the language model likelihoods attached to the word level networks.
  • a token represents a partial path through the network extending from time 0 through to time t. At time 0, a token is placed in every possible start node.
  • tokens are propagated along connecting transitions stopping whenever they reach an emitting HMM state.
  • the token is copied so that all possible paths are explored in parallel.
  • the token passes across transitions and through nodes, its log probability is incremented by the corresponding transition and emission probabilities.
  • a network node can hold at most N tokens. Hence, at the end of each time step, all but the N best tokens in any node are discarded. As each token passes through the network it must maintain a history recording its route. The amount of detail in this history depends on the required recognition output. Normally, only word sequences are wanted and hence, only transitions out of word-end nodes need be recorded.
  • This module converts the sequence of recognized ligature patterns back into ASCII equivalent codes and the whole recognized text is stored in a text file to be stored.
  • Vlll-Character to Ligature Converter Module This module uses the reference text files of the training set to produce a series of pattern numbers of all ligatures included in each word using the ligature-set of the used font and applies this series with quantized feature vectors of the word to the next stage for training the models of the ligatures included in the word.
  • Statistical Language Model uses the language text statistics for increasing the recognition accuracy.
  • Language models provide an estimate of the probability of a character sequence for the given recognition task.
  • the bigram model is based on the approximation that a character (or ligature) is statistically dependant only on the temporally previous character (or ligature).
  • a huge Arabic text corpus from NEMLAR database is used for building an Arabic bigram language model that is used to improve the recognition accuracy.
  • This module is used to detect the dynamic range of the features and calculate the normalization parameter from the feature vectors of the training data after excluding the extreme values. These normalization parameters are then used during the recognition phase to normalize the feature vectors of the data to be recognized.
  • the codebook maker creates a codebook using the conventional K-means algorithm.
  • the number of centroids of the codebook is another factor that affects the recognition accuracy. It is selected empirically to maximize the recognition rates of the system.
  • the HMM trainer receives the observation sequence which is the quantized feature vectors of each word, and also the pattern numbers of the ligatures included in this word generated by the character to ligature converter module and then uses the Baum-Welch re-estimation procedure to train the models of these ligatures, i.e. to maximize the probabilities of the observation sequences given the models.
  • HInit The basic principle of HInit depends on the concept of a HMM as a generator of character vectors. Every training example can be viewed as the output of the HMM whose parameters are to be estimated. Thus, if the state that generated each vector in the training data was known, then the unknown means and variances could be estimated by averaging all the vectors associated with each state. Similarly, the transition matrix could be estimated by simply counting the number of time slots that each state was occupied.
  • Figurel ⁇ The above idea can be implemented by an iterative scheme as shown in Figurel ⁇ .
  • the Viterbi algorithm is used to find the most likely state sequence corresponding to each training example, then the HMM parameters are estimated.
  • the log likelihood of the training data can be computed.
  • the whole estimation process can be repeated until no further increase in likelihood is obtained.
  • This process requires some initial HMM parameters to get started.
  • HInit starts by uniformly segmenting the data and associating each successive segment with successive states.
  • HMM training procedures for building subword systems revolve around the concept of embedded training. Unlike the processes described before, embedded training simultaneously updates all of the HMMs in a system using all of the training data. It is performed by HRest step which, unlike HRest, performs just a single iteration.
  • HRest step (as shown in figure 20) works as follows. On startup, HERest loads in a complete set of HMM definitions. Every training file must have an associated label file which gives a transcription for that file. HERest processes each training file in turn. After loading it into memory, it uses the associated transcription to construct a composite HMM which spans the whole word.
  • This composite HMM is made by concatenating instances of the phone HMMs corresponding to each label in the transcription.
  • the Forward-Backward algorithm is then applied and the sums needed to form the weighted averages accumulated in the normal way.
  • the new parameter estimates are formed from the weighted sums and the updated HMM set is output.
  • Fig 1 represents Grapheme segmentation process illustrated by manually inserting vertical gray lines at appropriate grapheme connection points.
  • Fig2 represents example sets of dotting-differentiated graphemes.
  • Fig3 represents the 4 cases; Starting, Middle, Ending, and Separate cases of the grapheme representing character "Ein” highlighted in red.
  • Fig4 represents some ligatures in the Traditional Arabic font and overlapped characters.
  • Fig5 represents the diacritics added to Arabic text.
  • Fig6 represents Discrete HMM Based Recognition System Block Diagram. And it is composed of the following modules:
  • optical scanners are used to convert printed documents to computer images. It is common practice to convert the multilevel image into a bi-level image of black and white. Often this process, known as thresholding, is performed on the scanner to save memory space and computational effort.
  • Filters can be designed for smoothing, sharpening, thresholding, removing slightly textured or colored background, and contrast adjustment purposes.
  • a method that is often used to reduce noise in two dimensional (off-line) images is to traverse the image, pixel-by- pixel, using a 3 x 3 window and to change the value of a pixel based on the value of its 8-neighbors.
  • a 5 x 5 median filter has proven effective for the cleaning of input scanned text images from the salt & pepper noise
  • the word decomposer module is used to segment paper into lines, and then to segment each line into words or sub-words starting from upper right corner of the page.
  • a robust Histogram- based algorithm is used for lines & words decomposition
  • the feature-extractor module is used to extract features of the sliding window, which slides through the body of the Arabic word from right to left, and provides a series of feature vectors to the vector quantizer module.
  • the vector-quantizer module provides a series of quantized observations to the Discrete HMM Decoder module using the codebook made by the Codebook Maker Module during the training phase.
  • the HMM recognizer stage receives a series of observations representing the scanned word from the vector quantizer and uses the Viterbi algorithm to produce the most probable sequence of models that produces these observations. Using a statistical ligature bi-grams Probabilities generated by Ligature m-grams Prob. Estimator, the most probable model with the highest score is produced by the Viterbi algorithm.
  • left-to-right HMM In our system, we used left-to-right HMM.
  • the state-transitions are allowed only from the current state to the next state or from the current state to itself (i.e. to stay or to advance).
  • the number of states per model is one of the key factors that affect the system performance. It is specified empirically to increase the performance of the system.
  • This module converts the sequence of recognized ligature patterns back into ASCII equivalent codes and the whole recognized text is stored in a text file to be stored.
  • This module uses the reference text files of the training set to produce a series of pattern numbers of all ligatures included in each word using the ligature-set of the used font and applies this series with quantized feature vectors of the word to the next stage for training the models of the ligatures included in the word.
  • Statistical Language Model uses the language text statistics for increasing the recognition accuracy.
  • Language models provide an estimate of the probability of a character sequence for the given recognition task.
  • the bigram model is based on the approximation that a character (or ligature) is statistically dependant only on the temporally previous character (or ligature).
  • a huge Arabic text corpus from NEMLAR database is used for building an Arabic bigram language model that is used to improve the recognition accuracy.
  • This module is used to detect the dynamic range of the features and calculate the normalization parameter from the feature vectors of the training data after excluding the extreme values. These normalization parameters are then used during the recognition phase to normalize the feature vectors of the data to be recognized.
  • the codebook maker creates a codebook using the conventional K-means algorithm.
  • the number of centroids of the codebook is another factor that affects the recognition accuracy. It is selected empirically to maximize the recognition rates of the system.
  • the HMM trainer receives the observation sequence which is the quantized feature vectors of each word, and also the pattern numbers of the ligatures included in this word generated by the character to ligature converter module and then uses the Baum- Welch re-estimation procedure to train the models of these ligatures, i.e. to maximize the probabilities of the observation sequences given the models.
  • Fig7 represents the Gray values histogram of the pixels population of the sample text rectangle.
  • Fig8 represents the Sample text rectangle after initial lines decomposition.
  • Fig9 represents the Sample text rectangle after refined lines decomposition.
  • FiglO represents the Sample text rectangle after logical writing lines concatenation.
  • Figll represents the Sample text rectangle after part-of-words decomposition.
  • Figl2 represents a hypothetical histogram of intra-word, inter- word, and rogue part-of-word spaces.
  • Figl3 represents the Sample text rectangle after full words concatenation.
  • Figl4 represents the illustration of the frames and segments.
  • Figl5 represents the flow chart for feature extraction.
  • Figl6 represents the Simple Left-Right HMM.
  • Figl7 represents the Recognition Network Levels.
  • Figl8 represents the Initialization Operation.
  • Figl9 represents the HRest Operation.
  • Fig20 represents the HERest File Processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Character Discrimination (AREA)

Abstract

Optical character recognition system and method are disclosed for Arabic scripts. The system is based on Hidden Markov Models (HMM), an approach that has proven to be very successful in the area of automatic speech recognition. The system is divided into two subsystems, the trainer subsystem based on Hidden Markov Models (HMM) and the recognizer subsystem comprising a feature extractor (M4) providing a series of feature vectors to the vector quantizer module (M5) and using the Viterbi algorithm to recognize input bitmaps to output characters. The system uses a speciel histogram-based algorithm for lines & words decomposition.

Description

A Large-Scale Arabic Omni Font-Written OCR
Based on an Analogy with HMM-Based ASR Technical field:
Converting the Text picture into an editable text file using the optical character recognition.
Background Art:
Optical character recognition (OCR) is the branch of pattern recognition that studies automatic reading. The ultimate goal of OCR is to imitate the human ability to read at a much faster rate by associating symbolic identities with images of characters.
The potential application areas of automatic reading machines are numerous. One of the earliest and most successful applications is sorting checks in banks, as the volume of checks that circulates daily has proven to be too enormous for manual entry. Other applications include reading postal addresses off envelops and automatically sorting mail, helping the blind to read, reading customer-filled forms (tax forms, insurance claims, application forms), automating offices, archiving and retrieving text, and improving human-computer interfaces (e.g., pen-based computers). Among numerous applications of typewritten OCR systems comes Document Management Systems (DMS) as the largest industrial consumer. In such systems scanned images of electronically unavailable documents are archived by the DMS, meanwhile OCR runs on each scanned image. While the images are used for viewing the document, the text resulting from OCRing the images is used for all kinds of Information Retrieval (IR) and Knowledge Management (KM) purposes which are insensitive to the inevitable error rate of the OCR process as long as this rate is kept small enough (< 4% of the word rate is a rational criterion).
Since the mid- 1940s researchers have carried out extensive work and published many papers on character recognition. Most of the published work on OCR has been on Latin characters, with work on Japanese and Chinese characters emerging in the mid-1960s.
Although almost a third of a billion people worldwide, in several different languages, use Arabic characters for writing (alongside Arabic, Persian and Urdu are the most noted examples), Arabic character recognition has not been researched as thoroughly as Latin, Japanese, or Chinese. The first published work on Arabic optical text recognition (AOTR) may be traced back to 1975, and was by Nazif. In his master's thesis he developed a system for recognizing printed Arabic characters based on extracting strokes, that he called radicals (20 radicals are used) and their positions. He used correlation between the templates of the radicals and the character image. A segmentation stage was included to segment the cursive text. Years later, Badi and Shimura and Noun worked on printed Arabic characters and Amin on handwritten Arabic characters. In the 1980s research work on AOTR increased considerably, a trend that is continuing in the 1990s.
During the past decade, Hidden Markov Models (HMMs) have become the predominant approach to automatic speech recognition. The success of HMMs in speech recognition has led many researchers to apply them to character recognition by representing each word image as a sequence of observations. HMMs qualify as a suitable tool for cursive script recognition for a number of reasons. First, they are stochastic models that can cope with noise and pattern variations occurring as in different fonts. Next, the number of tokens representing an unknown input word may be of variable length. This is a fundamental requirement in cursive fonts because the lengths of the individual input words may greatly vary. Moreover, using an HMM-based approach, the segmentation problem, which is extremely difficult and error prone, can be avoided. This means that the features extracted from an input word need not necessarily correspond with letters. Instead, they may be derived from part of one letter, or from several letters. Thus, the operations carried out by an HMM are in some sense holistic, combining local feature interpretation with contextual knowledge. Finally, there are standard algorithms known from the literature for both training and recognition using HMMs. These algorithms are fast and can be implemented with reasonable effort.
There has been little work in the use of HMMs for the recognition of Arabic script. Three references report on the use of HMMs to perform recognition of Arabic script. Allam uses contour tracing to locate groups of connected characters; the recognition is then performed on each such group as a whole, using features extracted from vertical slices. The feature extraction of the papers by Ben Amara and Belaid and by Yarman-Vural and Atici appear to be specific to Arabic and may not be easily generalizable.
In 1999 Issam Bazzi, Richard Schwartz, and John Makhoul present an omnifont, unlimited- vocabulary OCR system for English and Arabic. The system is based on Hidden Markov Models. Their system includes the use of a trigram language model on character sequences, the achieved character error rates of 1.1 percent (about 4.4% WER)on data from the University of Washington English Document Image Database and 3.3 (about 13.2% WER) percent on data from the DARPA Arabic OCR Corpus.
As the Arabic market - esp. in the Gulf countries - is currently a hot one, there is a quite high need for reliable typewritten Arabic OCR engines to be integrated in such DMS systems. Perhaps the - by far - most ready and best equipped system is Automatic Reader© 7.0 provided by Sakhr®. Affording document retrieval, omni Arabic OCR, learning mode, 95% word-level accuracy rate, SDK for integration, and being able to deal with bilingual Arabic-Latin documents, this system is the best-to-choose-now for heavy duty serious applications.
Automatic Reader© 7.0 from Sakhr® is essentially based on a huge set of ad hoc orthographic rules and tips put in a work frame of AI searching techniques to decide on each of the Arabic OCR phases; preprocessing of tiny blocks like dots and diacritics, segmentation, and grapheme recognition. The last phase of synthesizing the recognized text is cleverly guided by Sakhr' s Arabic NLP tools for filtering out nonsensical results.
The lag in research on Arabic text recognition, compared to other languages, may be attributed to
(i) The lack of adequate support in terms of journals, books, conferences, and funding, and the lack of interaction between researchers in this field.
(ii) The lack of general supporting utilities like Arabic text databases, dictionaries, programming tools, and supporting staff.
(iii) The late start of Arabic text recognition,
(iv) The special characteristics of Arabic script.
Challenges in Arabic Recognition:
1- The connectivity challenge: Whether handwritten or typewritten, Arabic text can only be scripted in connected (or cursive) mode; i.e. graphemes are connected to one another within the same word with this connection interrupted at few certain characters or at the end of the word. This necessitates any Arabic OCR to do not only the traditional grapheme recognition task, but also another tough grapheme segmentation one (see figure 1). To make things even harder, both of these tasks are mutually dependent and must hence be done simultaneously. - The dotting challenge: As stated before; dotting is extensively used to differentiate characters sharing similar graphemes. According to figure 2 where some example sets of dotting-differentiated graphemes, it is apparent that the digital differences between the members of the same set are small. Whether the dots are eliminated before the recognition process, or recognition features are extracted from the dotted script, dotting is a significant source of confusion - hence recognition errors - in Arabic typewritten OCR systems especially when run on noisy documents; e.g. those produced by photocopiers. On the contrary, dotting may be helpful for Arabic handwritten OCR systems as dots are usually sensed as separate short strokes.
3- The multiple grapheme cases challenge: Due to the mandatory connectivity in Arabic orthography; the same grapheme representing the same character can has multiple variants according to its relative position within the Arabic word segment {Starting, Middle, Ending, Separate} as exemplified by the 4 variants of the Arabic character "Ein" highlighted in red in figure 3.
4- The ligatures challenge: To make things even more complex, certain compounds of characters at certain positions of the Arabic word segments are represented by single atomic graphemes called ligatures. Ligatures are found in almost all the Arabic fonts, but their number depends on the involvement of the specific font in use. Figure 4 illustrates some ligatures in the famous font "Traditional Arabic" highlighted in red.
Broad graphemes set: Multiple grapheme cases as well as the occurrence of ligatures directly lead to broad grapheme sets; e.g. a common highly involved font like Traditional Arabic contains around 190 graphemes, and another common less involved one (with less ligatures) like Simplified Arabic contains around 95 graphemes. Compare this to English where 40 or 50 graphemes are enough! Again, a broader grapheme set means higher ambiguity, and hence more confusion.
5- The diacritics challenge: Unless the reader is knowledgeable enough, each character in Arabic strictly needs one or more diacritical marks to be drawn over or under the corresponding grapheme in order to ensure the intended phonetic transcription and hence the correct pronunciation. Apart from the teaching purposes, Arabic diacritics are used in practice only when they help in resolving linguistic ambiguity of the text. The problem of diacritics with typewritten Arabic OCR is that their direction of flow is vertical while the main writing direction of the body Arabic text is horizontal from right to left. (See figure 5) Like dots; diacritics - when existent - are a source of confusion of typewritten OCR systems especially when run on noisy documents, but due to their relatively larger size they are usually preprocessed. 6- The variation challenge: Arabic characters do not have a fixed size (height and width). Size varies across different characters and across the different shapes of the same character
7- The overlapping challenges: - Characters in a word may overlap vertically (even without touching)
(see Fig. 4).
In our research, we implement an Omni-font-open vocabulary HMM-based typewritten Arabic Text recognition system. We address the issue of how to perform a character recognition system for an independent style (unconstrained style) Arabic typewritten data; this will be done by using a large amount of training data to train the HMMs.
Building an open vocabulary system will derive us to the usage of a character based HMM system not a word based HMM system. Our research on building an independent style-open vocabulary HMM typewritten Arabic recognition system is handled by trying different kinds of HMMs (either discrete or continuous) after choosing the best features that best discriminate between different classes of Arabic characters. Our first system architecture is based on the Discrete HMM.
The basic philosophy of our used autonomously differential recognition feature is as follows; given a sequence of such a feature, one can fully retrieve the written script with only a scaling factor as well as the digital distortion error.
Whatever the features and the approach used for recognition, the language models are essential in recovering strings of words or characters after they have been passed through a noisy channel such as scanning or print degradation. Hence, we in cooperate a bigram character based language models in our system to improve its performance.
Any font-written OCR system needs to perform some processing on the input scanned text block prior to the recognition phase. Decomposing the text block into lines hence into words is perhaps the most fundamental such pre-processing where errors add in the total error rate of the whole OCR system.
Despite being a widely used, simple-to-implement, and computationally efficient, the histogram-based approach may be vulnerable to some idiosyncrasies of real-life documents esp. as per noise and structural textual complexities. The latter of those idiosyncrasies is apparent while dealing with Arabic script. In our work, we elaborated on this approach to overcome its shortcomings esp. - with Arabic script - and came out with a robust and powerful decomposition algorithm we used in our system. This quite robust algorithm were proved with real-life Arabic (or Arabic dominated) documents via extensive experimentation that it is highly efficient algorithm.
Disclosure of invention:
Our system is divided into two subsystems: the trainer subsystem which is used to initialize and train Hidden Markov Models using many scanned bitmaps with their corresponding text files, and the Recognizer subsystem which uses the Viterbi algorithm to recognize input bitmaps to output characters. The block diagram of the trainer subsystem and Recognizer subsystem is shown in Figure 6. The system is built up using the following modules:
I-Quality Scanning Module:
In this stage, optical scanners are used to convert printed documents to computer images. It is common practice to convert the multilevel image into a bi-level image of black and white. Often this process, known as thresholding, is performed on the scanner to save memory space and computational effort.
Il-Primary Noise Handler Module:
This aims to remove noise and other uninteresting variations in the image which are often byproducts of image acquisition. Filters can be designed for smoothing, sharpening, thresholding, removing slightly textured or colored background, and contrast adjustment purposes. A method that is often used to reduce noise in two dimensional (off-line) images is to traverse the image, pixel-by-pixel, using a 3 x 3 window and to change the value of a pixel based on the value of its 8-neighbors. In our study, a 5 x 5 median filter has proven effective for the cleaning of input scanned text images from the salt & pepper noise
Median filters are, however, hard deciding and cause considerable erosive distortion to the graphemes of written text which is generally harmful to the recognition performance. So, they should be used only while the lines & words decomposition, whereas other softer deciding filters - like Gaussian ones - might be used - if necessary - while the recognition process.
Ill- Word Decomposer Module: The word decomposer module is used to segment paper into lines, and then to segment each line into words or sub-words stalling from upper right corner of the page. A robust Histogram-based algorithm is used for lines & words decomposition
It done with the following steps:
Step 1: Filtering
The target text rectangle bitmap is passed through a 5 by 5 median filter in order to clean the salt & pepper noise.
Step 2: Getting the Darkness & Brightness Centroids
For deciding about the blankness of horizontal/vertical pixels-scan-lines in later steps of this algorithm, it is necessary to accurately determine the centroid of darkness CD , and the centroid of brightness eg. We apply the famous K-means clustering algorithm on the pixels population histogram of the target text image, namely the histogram of their gray values on the standard scale from 0 (darkest)- to-255 (brightest), to accurately acquire these two important values (see Figure 7)
Step 3: Writing-Lines Decomposition
Getting the vertical range
Figure imgf000008_0001
height h,- = yi,top -yt.bottom is accomplished through the two following sub-steps:
3.1- Physical Writing Lines Decomposition. This sub-step is trivially done once we get rationally able to decide about whether a given horizontal scan-line /, is blank or non-blank. Intuitively, /} is blank if the following condition is true;
∑ (255 - p(k, J)) ≤ M- (255 -cB) (A)
Where w,- is the effective width of L, to which Ij belongs. However We calculate Wj we use the following two sub-steps: 1- Initial Lines Decomposition: Here, initial writing lines L*, with vertical ranges [y*,,bottom , y\,top] and heights h*, =
Figure imgf000009_0001
- y*i,bottom , are computed with the blankness condition modified to:
∑{255- p(k, J)) ≤ W -(255-C11) (B)
Noting that w, < W, all the scan-lines decided to be non-blank are truly so, according to (A), See figure 8. While few short scan-lines may be falsely decided blank and have to be reconsidered in the next sub-step
2- Refined Lines Decomposition:
For each scan-line Ij judged as blank in the previous sub-step; it gets re-tested in this sub-step with the blankness condition modified to:
∑(255 - p(k,j)) ≤ w,,ω - (255 -cB) (C) k=xSI≠ι,,U)-"«UI+l
n(j) is defined as the index of the nearest writing-line, as obtained from the sub-step 3.1.1, to the scan-line l}. After this second pass of blankness decisions; the writing lines L, are accordingly re-labeled as shown in figure 9
3.2- Logical Writing-Lines Concatenation. Orthography in general, and Arabic script in specific, is full of dots and diacritics over and under the basic body of grapheme sequences This may, cause the sub-step 3.1 generate false thin physical writing-lines which logically belong to thick neighbors To realize that concatenation, a concatenation test is iterating until it states false for all the writing-lines in the target text-image. For a writing-line L1, the concatenation conditions with its neighbors L,.ιtl>o and
Figure imgf000009_0002
Concat,hprc = < cx)AND{yιA'h K""' < β) (D)
Figure imgf000009_0003
Figure imgf000009_0004
ConcatAJtcr = (A. < < β) (E)
Figure imgf000009_0005
If both conditions are FALSE, no concatenation occurs and the test advances to the next writing-line. If either ConcatBef0re or ConcatAfter is TRUE, concatenation happens with the corresponding line, and the test resumes iterating at the first writing-line on top of text-image. If both Concatβefore and ConcatAfter are TRUE, concatenation happens with the closer writing-line, and the test resumes iterating at the first writing-line on top of text-image. As per the nature of Arabic script, our extensive experiments show that, the best values for α and β are 0.35 and 0.30 respectively. Figure 10 shows the resulting logical writing lines.
Step 4: Writing-Lines Grouping
Heights of writing lines are usually considered as a rough - never an accurate - indication of the font size they are written in. So, gathering the writing lines in a given text-image in groups, where the heights of these lines in the same group are of a sufficiently small variance, may be useful for the recognition process that might tune some parameters for each group according to say its average height.
Far more importantly, this grouping is a necessary auxiliary step towards deciding whether a given space between successive separate word segments is an inter or intra word space. See step 6 below.
We here group the writing-lines so that the standard deviation of the writing-lines heights σg in each group and the average of these heights hg satisfy the condition σg ≤ yhs (F)
In our experiments, γ is tuned to 0.2. The 1st and 2nd writing lines from the top of figure 10 are gathered in one group, while the rest are gathered in another one.
Step 5; Part-of-Words Decomposition
In this step, part-of-word segments on each writing line X, are separated by vertical spaces among them. This step is trivially done once we get able to decide about whether a given vertical scan-line segment at x = k within the range of Lt is blank or non-blank. Like formula (A) in step 3.1; such a blankness condition is formulated as:
∑(255 - p(k, J)) ≤ H1 ■ (255 -cB) (G)
Figure 11 shows the result of decomposing the part-of-word segments in our running example, with each marked by a surrounding rectangle.
Step6: Full Words Concatenation 6.1- Eliminating Rouge Spaces Among Part-of- Words. This problem is in fact a hard one of outliers detection as named in the literature of data mining. However, a much simpler and effective approach to solving this problem is presented here based on the heuristics emanating from our rich experience about the nature of Arabic, or Arabic dominated script.
Figure 12 illustrating a histogram of a spaces group where the widths of intra- word spaces, inter-word spaces, and rogue spaces are plotted.
The optimal threshold between the intra-word centroid Sintra and inter-word centroid Sinter is Sintra + XA (Sinter - Sintra). If symmetry around Sinter is also assumed, the cut-off space width for eliminating the rogue spaces can then be expressed as
SCul-Off = S Inter "*" '^ ' \S Inter ~ S Intra ) (H)
While the average of all spaces Savg is easily computable, the centroids Sintra and Sjnter are unknown. So, we seek an expression of Sαu-Offin terms of only Sαvg. Let
Simrα =s-s,tuer^ ≤ε < \ (I)
Substituting from (I) in (H), we get
On the other hand, Sαvg may be expressed as
Sαyg = sInlm (Vz + δ) + S/αUr (J* - δ);0 ≤ δ < V2 (K)
Where δ is a weighting preference parameter. Substituting in (K); for Sjnm from (I), then for Sinter from (J), we get
Sc'"-oχr = {l-2-δ) + ε-(l + 2 - δ) ' Smg (L)
Setting ε=0, δ=0 at the extreme, results in Scut-off = 3 Sαvg. However, our empirical knowledge of the Arabic script in general realistic conditions states εe[0.1,0.15], ^e[0.05, 0.1] which produces the simple rogue-spaces cut-off formula that scm-off ≡ P.7 sαvg,3Λ 5 sαrg] (M)
6.2- Clustering Inter-word/Intra-Word Spaces. Once the rogue spaces are eliminated - if any - from a spaces group, the rest of these spaces in the group are sent to the K-means algorithm to cluster them into an intra-word spaces cluster whose centroid is Sintra and another one of inter-word spaces cluster whose centroid is Sinter; Sintra < Sinter. Figure 13 shows the result of diagnosing inter-word spaces, hence grouping the part-of-word segments within each word in our running example together in one framing rectangle.
IV-Feature Extractor Module:
The feature-extractor module is used to extract features of the sliding window, which slides through the body of the Arabic word from right to left, and provides a series of feature vectors to the vector quantizer module.
The basic philosophy of our autonomously differential recognition feature is as follows; given a sequence of such a feature, one can fully retrieve the written script with only a scaling factor as well as the digital distortion error. The feature proposed here is a time-domain differential one where the script contour is totally differentiated along the x-direction, which can be expressed by equation [1] as follows:
Figure imgf000012_0001
Where F1 is the frame feature vector and Sj is the segment feature vector
The words are consisting of number of frames and each frame is consisting of number of segments, and the max number of segments per frame is 4 in Arabic fonts. (See fig. 14)
For each segment we have four cases:
1- Segment located in the start frame of the word:
S= r CG. of the segment - CG. of the first segment segment length _j
Figure imgf000012_0002
2- Segment located in a frame came after a non empty frame:
S= [ CG. of the segment - CG. of the previous frame ^ segment length ? γ< overlapped segment ; last overlapped segment] previous frame length previous frame lengthh
If there is overlapping. §= f CG. of the segment -CG. of the previous frame segment length jχ .2] previous frame length previous frame lengthh
If there is no overlapping.
3- Segment located in a frame came after an empty frames in the same word:
S= [ CG, of the segment -CG, of the first previous non empty frame } CG. of the segment - CG. of the First previous non empty frame } -3 5 .3] first previous non empty frame length first previous non empty frame length
4- Empty frame:
S= [0, 0, 0, 0]
• Where CG is the Center of Gravity
Fig 15 is a flow chart for feature extraction.
V- Vector Quantizer Module;
The vector-quantizer module provides a series of quantized observations to the Discrete HMM Decoder module using the codebook made by the Codebook Maker Module during the training phase.
VI-Discrete HMM Decoder Module:
The HMM recognizer stage receives a series of observations representing the scanned word from the vector quantizer and uses the Viterbi algorithm to produce the most probable sequence of models that produces these observations. Using a statistical ligature bi-grams Probabilities generated by Ligature m-grams Prob. Estimator, the most probable model with the highest score is produced by the Viterbi algorithm.
In our system, we used left-to-right HMM as shown in figure 16. The state-transitions are allowed only from the current state to the next state or from the current state to itself (i.e. to stay or to advance). The number of states per model is one of the key factors that affect the system performance. It is specified empirically to increase the performance of the system.
Decoding is controlled by a recognition network compiled from a word-level network, a dictionary and a set of HMMs. The recognition network consists of a set of nodes connected by arcs. Each node is either a HMM model instance or a word-end. Each model node is itself a network consisting of states connected by arcs. Thus, once fully compiled, a recognition network ultimately consists of HMM states connected by transitions. However, it can be viewed at three different levels: word, model and state. Figure 17 illustrates this hierarchy.
For an unknown input word with T frames, every path from the start node to the exit node of the network which passes through exactly T emitting HMM states is a potential recognition hypothesis. Each of these paths has a log probability which is computed by summing the log probability of each individual transition in the path and the log probability of each emitting state generating the corresponding observation. Within-HMM transitions are determined from the HMM parameters, between-model transitions are constant and word-end transitions are determined by the language model likelihoods attached to the word level networks.
The job of the decoder is to find those paths through the network which have the highest log probability. These paths are found using a Token Passing algorithm. A token represents a partial path through the network extending from time 0 through to time t. At time 0, a token is placed in every possible start node.
Each time step, tokens are propagated along connecting transitions stopping whenever they reach an emitting HMM state. When there are multiple exits from a node, the token is copied so that all possible paths are explored in parallel. As the token passes across transitions and through nodes, its log probability is incremented by the corresponding transition and emission probabilities.
A network node can hold at most N tokens. Hence, at the end of each time step, all but the N best tokens in any node are discarded. As each token passes through the network it must maintain a history recording its route. The amount of detail in this history depends on the required recognition output. Normally, only word sequences are wanted and hence, only transitions out of word-end nodes need be recorded.
VII-Ligature to Character Converter Module:
This module converts the sequence of recognized ligature patterns back into ASCII equivalent codes and the whole recognized text is stored in a text file to be stored.
Vlll-Character to Ligature Converter Module: This module uses the reference text files of the training set to produce a series of pattern numbers of all ligatures included in each word using the ligature-set of the used font and applies this series with quantized feature vectors of the word to the next stage for training the models of the ligatures included in the word.
IX-Ligature m-grams Prob. Estimator Module:
Statistical Language Model (SLM) uses the language text statistics for increasing the recognition accuracy. Language models provide an estimate of the probability of a character sequence for the given recognition task. In our system we used the bigram SLM. The bigram model is based on the approximation that a character (or ligature) is statistically dependant only on the temporally previous character (or ligature). In our system a huge Arabic text corpus from NEMLAR database is used for building an Arabic bigram language model that is used to improve the recognition accuracy.
X-DRNP Estimator Module:
This module is used to detect the dynamic range of the features and calculate the normalization parameter from the feature vectors of the training data after excluding the extreme values. These normalization parameters are then used during the recognition phase to normalize the feature vectors of the data to be recognized.
Xl-Codebook Maker Module:
From many images of the training set of documents representing different Arabic fonts with different sizes, the codebook maker creates a codebook using the conventional K-means algorithm. The number of centroids of the codebook is another factor that affects the recognition accuracy. It is selected empirically to maximize the recognition rates of the system.
XII-HMM Trainer Module:
The HMM trainer receives the observation sequence which is the quantized feature vectors of each word, and also the pattern numbers of the ligatures included in this word generated by the character to ligature converter module and then uses the Baum-Welch re-estimation procedure to train the models of these ligatures, i.e. to maximize the probabilities of the observation sequences given the models.
The basic procedures involved in training a set of our system models are gone through three steps: Stepl: HMM Initialization (HInit)
The basic principle of HInit depends on the concept of a HMM as a generator of character vectors. Every training example can be viewed as the output of the HMM whose parameters are to be estimated. Thus, if the state that generated each vector in the training data was known, then the unknown means and variances could be estimated by averaging all the vectors associated with each state. Similarly, the transition matrix could be estimated by simply counting the number of time slots that each state was occupied. The above idea can be implemented by an iterative scheme as shown in Figurelδ.
Firstly, the Viterbi algorithm is used to find the most likely state sequence corresponding to each training example, then the HMM parameters are estimated. As a side-effect of finding the Viterbi state alignment, the log likelihood of the training data can be computed. Hence, the whole estimation process can be repeated until no further increase in likelihood is obtained. This process requires some initial HMM parameters to get started. To circumvent this problem, HInit starts by uniformly segmenting the data and associating each successive segment with successive states.
Step2: HMM Re-Estimation (HRest)
Its operation is very similar to HInit except that, as shown in Figure 19, it expects the input HMM definition to have been initialized and it uses Baum- Welch re-estimation in place of Viterbi training. This involves finding the probability of being in each state at each time frame using the Forward-Backward algorithm. This probability is then used to form weighted averages for the HMM parameters. Thus, whereas Viterbi training makes a hard decision as to which state each training vector was "generated" by, Baum- Welch takes a soft decision.
Step3; HMM Re-Estimation Embedded Training (HERest)
Whereas isolated unit training is sufficient for building whole word models and initialization of models using hand-labeled bootstrap data, the main HMM training procedures for building subword systems revolve around the concept of embedded training. Unlike the processes described before, embedded training simultaneously updates all of the HMMs in a system using all of the training data. It is performed by HRest step which, unlike HRest, performs just a single iteration. In outline, HERest step (as shown in figure 20) works as follows. On startup, HERest loads in a complete set of HMM definitions. Every training file must have an associated label file which gives a transcription for that file. HERest processes each training file in turn. After loading it into memory, it uses the associated transcription to construct a composite HMM which spans the whole word. This composite HMM is made by concatenating instances of the phone HMMs corresponding to each label in the transcription. The Forward-Backward algorithm is then applied and the sums needed to form the weighted averages accumulated in the normal way. When all of the training files have been processed, the new parameter estimates are formed from the weighted sums and the updated HMM set is output.
Brief description of the drawings;
Fig 1: represents Grapheme segmentation process illustrated by manually inserting vertical gray lines at appropriate grapheme connection points. Fig2: represents example sets of dotting-differentiated graphemes. Fig3: represents the 4 cases; Starting, Middle, Ending, and Separate cases of the grapheme representing character "Ein" highlighted in red. Fig4: represents some ligatures in the Traditional Arabic font and overlapped characters.
Fig5: represents the diacritics added to Arabic text. Fig6: represents Discrete HMM Based Recognition System Block Diagram. And it is composed of the following modules:
Quality Scanning (Ml) Module:
In this stage, optical scanners are used to convert printed documents to computer images. It is common practice to convert the multilevel image into a bi-level image of black and white. Often this process, known as thresholding, is performed on the scanner to save memory space and computational effort.
Primary Noise Handler (M2) Module
This aims to remove noise and other uninteresting variations in the image which are often byproducts of image acquisition. Filters can be designed for smoothing, sharpening, thresholding, removing slightly textured or colored background, and contrast adjustment purposes. A method that is often used to reduce noise in two dimensional (off-line) images is to traverse the image, pixel-by- pixel, using a 3 x 3 window and to change the value of a pixel based on the value of its 8-neighbors. In our study, a 5 x 5 median filter has proven effective for the cleaning of input scanned text images from the salt & pepper noise
Median filters are, however, hard deciding and cause considerable erosive distortion to the graphemes of written text which is generally harmful to the recognition performance. So, they should be used only while the lines & words decomposition, whereas other softer deciding filters - like Gaussian ones - might be used - if necessary - while the recognition process..
Word Decomposer (M3) Module
The word decomposer module is used to segment paper into lines, and then to segment each line into words or sub-words starting from upper right corner of the page. A robust Histogram- based algorithm is used for lines & words decomposition
Feature Extractor (M4) Module
The feature-extractor module is used to extract features of the sliding window, which slides through the body of the Arabic word from right to left, and provides a series of feature vectors to the vector quantizer module.
The basic philosophy of our autonomously differential recognition feature is as follows;. given a sequence of such a feature, one can fully retrieve the written script with only a scaling factor as well as the digital distortion error. The feature proposed here is a time-domain differential one where the script contour is totally differentiated along the x-direction, which can be expressed by equation [1] as follows:
M 'k = \S_X u S2 u ...SM] = [JS j Eq. [1]
;=i
Vector Quantizer (M5) Module
The vector-quantizer module provides a series of quantized observations to the Discrete HMM Decoder module using the codebook made by the Codebook Maker Module during the training phase.
Discrete HMM Decoder (M6) Module: The HMM recognizer stage receives a series of observations representing the scanned word from the vector quantizer and uses the Viterbi algorithm to produce the most probable sequence of models that produces these observations. Using a statistical ligature bi-grams Probabilities generated by Ligature m-grams Prob. Estimator, the most probable model with the highest score is produced by the Viterbi algorithm.
In our system, we used left-to-right HMM. The state-transitions are allowed only from the current state to the next state or from the current state to itself (i.e. to stay or to advance). The number of states per model is one of the key factors that affect the system performance. It is specified empirically to increase the performance of the system.
Ligature to Character Converter (M7) Module:
This module converts the sequence of recognized ligature patterns back into ASCII equivalent codes and the whole recognized text is stored in a text file to be stored.
Character to Ligature Converter (Tl) Module:
This module uses the reference text files of the training set to produce a series of pattern numbers of all ligatures included in each word using the ligature-set of the used font and applies this series with quantized feature vectors of the word to the next stage for training the models of the ligatures included in the word.
Ligature m-grams Prob. Estimator (T2) Module:
Statistical Language Model (SLM) uses the language text statistics for increasing the recognition accuracy. Language models provide an estimate of the probability of a character sequence for the given recognition task. In our system we used the bigram SLM. The bigram model is based on the approximation that a character (or ligature) is statistically dependant only on the temporally previous character (or ligature). In our system a huge Arabic text corpus from NEMLAR database is used for building an Arabic bigram language model that is used to improve the recognition accuracy.
DRNP Estimator (T3) Module
This module is used to detect the dynamic range of the features and calculate the normalization parameter from the feature vectors of the training data after excluding the extreme values. These normalization parameters are then used during the recognition phase to normalize the feature vectors of the data to be recognized.
Codebook Maker (T4) Module
From many images of the training set of documents representing different Arabic fonts with different sizes, the codebook maker creates a codebook using the conventional K-means algorithm. The number of centroids of the codebook is another factor that affects the recognition accuracy. It is selected empirically to maximize the recognition rates of the system.
HMM Trainer (T5) Module
The HMM trainer receives the observation sequence which is the quantized feature vectors of each word, and also the pattern numbers of the ligatures included in this word generated by the character to ligature converter module and then uses the Baum- Welch re-estimation procedure to train the models of these ligatures, i.e. to maximize the probabilities of the observation sequences given the models.
Fig7: represents the Gray values histogram of the pixels population of the sample text rectangle.
Fig8: represents the Sample text rectangle after initial lines decomposition.
Fig9: represents the Sample text rectangle after refined lines decomposition.
FiglO: represents the Sample text rectangle after logical writing lines concatenation.
Figll: represents the Sample text rectangle after part-of-words decomposition.
Figl2: represents a hypothetical histogram of intra-word, inter- word, and rogue part-of-word spaces.
Figl3: represents the Sample text rectangle after full words concatenation.
Figl4: represents the illustration of the frames and segments.
Figl5: represents the flow chart for feature extraction.
Figl6: represents the Simple Left-Right HMM.
Figl7: represents the Recognition Network Levels.
Figl8: represents the Initialization Operation.
Figl9: represents the HRest Operation.
Fig20: represents the HERest File Processing

Claims

Claims
1. System for enhancing the Arabic OCR includes;
Means for feature extractor, training process, and recognition process
System is divided into two subsystems: the trainer subsystem (feature extractor, training process) which is used to initialize and train Hidden Markov Models using many scanned bitmaps with their corresponding text files, and the Recognizer subsystem (feature extractor, and recognition process) which uses the Viterbi algorithm to recognize input bitmaps to output characters.
The system enhances the problem of the Arabic OCR system by using the new features extractor steps.
The new feature-extractor module is used to extract features of the sliding window, which slides through the body of the Arabic word from right to left, and provides a series of feature vectors to the vector quantizer module. The basic philosophy of our autonomously differential recognition feature is as follows; given a sequence of such a feature, one can fully retrieve the written script with only a scaling factor as well as the digital distortion error. The feature proposed here is a time-domain differential one where the script contour is totally differentiated along the x-direction, which can be expressed by equation [1] as follows:
Figure imgf000021_0001
Where Fj is the frame feature vector and Sj is the segment feature vector
2. According to claim 1 this feature can be used with any language (i.e. Arabic, English,...) whether it was with connected characters or with isolated characters ,Also whether it was handwritten or type written.
3. According to claim 1 this feature can be implemented on any platform whether it was implemented in a software on a computer with any operating system (Windows, Linux, OS2, Solaris, ) or was implemented on any hardware device.
4. The method for calculating features according to claim 1 , comprise
The words are consisting of number of frames and each frame is consisting of number of segments, and the max number of segments per frame is 4 in Arabic fonts.
For each segment we calculate it according to the location of its frame. The first case if the Segment is located in the start frame of the word:
§= r CG. of the segment -CG. of the first segment segment length ^ _j ^ _-n First segment length First segment length
5. The method for calculating features according to claim 1 and 2, comprise
The words are consisting of number of frames and each frame is consisting of number of segments, and the max number of segments per frame is 4 in Arabic fonts.
The second case if the Segment is located in a frame came after a non empty frame:
S= [ CG. of the segment - CG. of the previous frame ^ segment length ^, overlapped segment , last overiapped segment] previous frame length previous frame lengthh
► If there is overlapping.
S= r CG. of the segment- CG. of the previous frame segment length _2 _21 previous frame length previous frame lengthh
If there is no overlapping.
6. The method for calculating features according to claim 1 and 2, comprise The words are consisting of number of frames and each frame is consisting of number of Segments and the max number of segments per frame is 4 in Arabic fonts.
The third case if Segment located in a frame came after an empty frame in the same word:
S= f CG. of the segment -CG. of the first previous non empty frame , CG, of the segment -CG. of the First previous non empty frame } .3 , .3] first previous non empty frame length first previous non empty frame length
7. The method for calculating features according to claim 1 and 2, comprise
The words are consisting of number of frames and each frame is consisting of number of segments, and the max number of segments per frame is 4 in Arabic fonts.
The fourth case if the frame is Empty:
S= [0, 0, 0, 0]
8. According to any previous claims, the process of the new feature is done by the following steps:
1- Getting a new word 2- Get frame
3- Test if the frame is any case from the four previous cases at claims 2, 3, 4, and 5
4- Calculate the frame features according to the case that determined in step 3
5- If this frame is not the last one go to step 2 6- End
9. According to claim 1, the training process includes:
Quality Scanning (Ml) Module, Primary Noise Handler (M2) Module, Word Decomposer (M3) Module, Feature Extractor (M4) Module, Vector Quantizer (M5) Module and Discrete HMM Decoder (M6) Module.
10. According to claim 1, the recognition process includes:
Quality Scanning (Ml) Module, Primary Noise Handler (M2) Module, Word Decomposer (M3) Module, Feature Extractor (M4) Module, Vector Quantizer (M5) Module, Discrete HMM Decoder (M6) Module, Ligature to Character Converter (M7) Module, Character to Ligature Converter (Tl) Module, Ligature m-grams Prob. Estimator (T2) Module, DRNP Estimator (T3) Module, Codebook Maker (T4) Module and HMM Trainer (T5) Module.
11. Method for enhancement Arabic OCR includes:
Method for feature extractor, training process, and recognition process
System is divided into two subsystems: the trainer subsystem (feature extractor, training process) which is used to initialize and train Hidden Markov Models using many scanned bitmaps with their corresponding text files, and the Recognizer subsystem (feature extractor, and recognition process) which uses the Viterbi algorithm to recognize input bitmaps to output characters.
The system enhances the problem of the Arabic OCR system by using the new features extractor steps.
The new feature extractor module is used to extract features of the sliding window, which slides through the body of the Arabic word from right to left, and provides a series of feature vectors to the vector quantizer module. The basic philosophy of our autonomously differential recognition feature is as follows; given a sequence of such a feature, one can fully retrieve the written script with only a scaling factor as well as the digital distortion error. The feature proposed here is a time-domain differential one where the script contour is totally differentiated along the x-direction, which can be expressed by equation [1] as follows:
Figure imgf000024_0001
Where F1 is the frame feature vector and Sj is the segment feature vector
12. According to claim 11 this feature can be used with any language (i.e. Arabic, English,...) whether it was with connected characters or with isolated characters ,Also whether it was handwritten or type written.
13. According to claim 11 this feature can be implemented on any platform whether it was implemented in a software on a computer with any operating system (Windows, Linux, OS2, Solaris, ) or was implemented on any hardware device.
14. According to claim 11, the method for calculating features according to claim 1, comprise
The words are consisting of number of frames and each frame is consisting of number of segments, and the max number of segments per frame is 4 in Arabic fonts.
For each segment we calculate it according to the location of its frame.
The first case if Segment located in the start frame of the word
Figure imgf000024_0002
- The second case if Segment located in a frame came after a non empty frame:
S= [ CG. of thesegment-CG. of the previous frame ? segment length ^ γΛ overlapped segment , last overiapped segment] previous frame length previous frame lengthh
► If there is overlapping.
S= r CG. of thesegment-CG. of the previous frame segment length _2 -2~\ previous frame length previous frame lengthh
If there is no overlapping. - The third case if Segment located in a frame came after an empty frame in the same word:
S= [ CG. of the segment -CO. of the first previous non empty frame } CG. of the segment -CG. of the first previous non empty frame } .3 s .3] first previous non empty frame length first previous non empty frame length
- The fourth case if the frame is Empty:
S= [0, 0, 0, 0]
15. Feature extractor Process is calculated by the algorithm according to the following equation
M
* -a rf& = \Sι V S2 u .. SH] = (JS j Eq. [1] x=l J=I
Where F1 is the frame feature vector and S1 is the segment feature vector
Includes the following steps:
1- Getting anew word
2- Get frame
3- Test if the frame is any case from the four previous cases at claim 10
4- Calculate the frame features according to the case that determined in step 3
5- If this frame is not the last one go to step 2 6- End
PCT/EG2007/000018 2007-05-15 2007-05-15 System and method for arabic omni font written optica character recognition WO2008138356A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EG2007/000018 WO2008138356A2 (en) 2007-05-15 2007-05-15 System and method for arabic omni font written optica character recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EG2007/000018 WO2008138356A2 (en) 2007-05-15 2007-05-15 System and method for arabic omni font written optica character recognition

Publications (2)

Publication Number Publication Date
WO2008138356A2 true WO2008138356A2 (en) 2008-11-20
WO2008138356A3 WO2008138356A3 (en) 2009-09-24

Family

ID=40002673

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EG2007/000018 WO2008138356A2 (en) 2007-05-15 2007-05-15 System and method for arabic omni font written optica character recognition

Country Status (1)

Country Link
WO (1) WO2008138356A2 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013235574A (en) * 2012-05-09 2013-11-21 King Abdulaziz City For Science & Technology (Kacst) Method and system for preprocessing image for optical character recognition
JP2014106961A (en) * 2009-04-27 2014-06-09 King Abdulaziz City For Science & Technology (Kacst) Method executed by computer for automatically recognizing text in arabic, and computer program
JP2014219954A (en) * 2013-05-12 2014-11-20 キング・アブドゥルアジズ・シティ・フォー・サイエンス・アンド・テクノロジー(ケイ・エイ・シィ・エス・ティ)King Abdulaziz City For Scienceand Technology (Kacst) Method for automatically recognizing arabic text
US9014481B1 (en) 2014-04-22 2015-04-21 King Fahd University Of Petroleum And Minerals Method and apparatus for Arabic and Farsi font recognition
US9501708B1 (en) 2015-09-10 2016-11-22 King Fahd University Of Petroleum And Minerals Adaptive sliding windows for text recognition
CN109255113A (en) * 2018-09-04 2019-01-22 郑州信大壹密科技有限公司 Intelligent critique system
US11270153B2 (en) 2020-02-19 2022-03-08 Northrop Grumman Systems Corporation System and method for whole word conversion of text in image

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5933525A (en) * 1996-04-10 1999-08-03 Bbn Corporation Language-independent and segmentation-free optical character recognition system and method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5933525A (en) * 1996-04-10 1999-08-03 Bbn Corporation Language-independent and segmentation-free optical character recognition system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
'Computational Intelligence for Measurement Systems and Applications, CIMSA 2004 IEEE International Conference on Boston, MA, USA', 14 July 2004 article GOUDA ET AL.: 'Segmentation of connected arabic characters using hidden Markov models', pages 115 - 119, XP010773387 *
'Document Image Analysis for Libraries, 2006. DIAL '06. Second International Conference on Lyon, France', 27 April 2006 article NADIA BEN AMOR ET AL.: 'Combining a hybrid Approach for Features Selection and Hidden Markov Models in Multifont Arabic Characters Recognition', pages 103 - 107, XP010912514 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014106961A (en) * 2009-04-27 2014-06-09 King Abdulaziz City For Science & Technology (Kacst) Method executed by computer for automatically recognizing text in arabic, and computer program
JP2013235574A (en) * 2012-05-09 2013-11-21 King Abdulaziz City For Science & Technology (Kacst) Method and system for preprocessing image for optical character recognition
JP2014219954A (en) * 2013-05-12 2014-11-20 キング・アブドゥルアジズ・シティ・フォー・サイエンス・アンド・テクノロジー(ケイ・エイ・シィ・エス・ティ)King Abdulaziz City For Scienceand Technology (Kacst) Method for automatically recognizing arabic text
US9014481B1 (en) 2014-04-22 2015-04-21 King Fahd University Of Petroleum And Minerals Method and apparatus for Arabic and Farsi font recognition
US9501708B1 (en) 2015-09-10 2016-11-22 King Fahd University Of Petroleum And Minerals Adaptive sliding windows for text recognition
CN109255113A (en) * 2018-09-04 2019-01-22 郑州信大壹密科技有限公司 Intelligent critique system
CN109255113B (en) * 2018-09-04 2022-10-11 郑州信大壹密科技有限公司 Intelligent proofreading system
US11270153B2 (en) 2020-02-19 2022-03-08 Northrop Grumman Systems Corporation System and method for whole word conversion of text in image

Also Published As

Publication number Publication date
WO2008138356A3 (en) 2009-09-24

Similar Documents

Publication Publication Date Title
Kolcz et al. A line-oriented approach to word spotting in handwritten documents
El-Yacoubi et al. An HMM-based approach for off-line unconstrained handwritten word modeling and recognition
Pugin Optical Music Recognitoin of Early Typographic Prints using Hidden Markov Models.
AlKhateeb et al. Offline handwritten Arabic cursive text recognition using Hidden Markov Models and re-ranking
Jayadevan et al. Offline recognition of Devanagari script: a survey
Hu et al. Writer independent on-line handwriting recognition using an HMM approach
EP2144188B1 (en) Word detection method and system
Natarajan et al. Multilingual machine printed OCR
Shaw et al. Offline Handwritten Devanagari Word Recognition: A holistic approach based on directional chain code feature and HMM
WO2008138356A2 (en) System and method for arabic omni font written optica character recognition
Lu et al. Robust language-independent OCR system
Mozaffari et al. Lexicon reduction using dots for off-line Farsi/Arabic handwritten word recognition
Ly et al. Recognition of anomalously deformed kana sequences in Japanese historical documents
US8340428B2 (en) Unsupervised writer style adaptation for handwritten word spotting
Amara et al. Printed PAW recognition based on planar hidden Markov models
Xiang et al. Offline Arabic handwriting recognition system based on HMM
Gilloux et al. Strategies for cursive script recognition using hidden Markov models
Nopsuwanchai et al. Maximization of mutual information for offline Thai handwriting recognition
Procter et al. Cursive handwriting recognition using hidden Markov models and a lexicon-driven level building algorithm
El-Mahallawy A large scale HMM-based omni front-written OCR system for cursive scripts
Wang et al. Multi-branch and two-pass HMM modeling approaches for off-line cursive handwriting recognition
Schüßler et al. A HMM-based system for recognition of handwritten address words
Ajao et al. Hidden markov model approach for offline Yoruba handwritten word recognition
El Moubtahij et al. Recognition of off-line Arabic handwriting words using HMM toolkit (HTK)
Rao et al. Orthographic properties based Telugu text recognition using hidden Markov models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07722742

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13/04/10)

122 Ep: pct application non-entry in european phase

Ref document number: 07722742

Country of ref document: EP

Kind code of ref document: A2