CN101866417A - Method for identifying handwritten Uigur characters - Google Patents

Method for identifying handwritten Uigur characters Download PDF

Info

Publication number
CN101866417A
CN101866417A CN 201010204177 CN201010204177A CN101866417A CN 101866417 A CN101866417 A CN 101866417A CN 201010204177 CN201010204177 CN 201010204177 CN 201010204177 A CN201010204177 A CN 201010204177A CN 101866417 A CN101866417 A CN 101866417A
Authority
CN
China
Prior art keywords
character
uygur
parts
handwritten
stroke
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010204177
Other languages
Chinese (zh)
Other versions
CN101866417B (en
Inventor
卢朝阳
李静
许亚美
阿地力·依米提
谭福秀
王炜
曹琎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN 201010204177 priority Critical patent/CN101866417B/en
Publication of CN101866417A publication Critical patent/CN101866417A/en
Application granted granted Critical
Publication of CN101866417B publication Critical patent/CN101866417B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Character Discrimination (AREA)

Abstract

The invention discloses a method for identifying handwritten Uigur characters, which belongs to the field of character mode identification and can effectively identify 128 classes of handwritten Uigur characters in real time. The method comprises the following three parts of: (1) establishment of a Uigur character component dictionary and a handwritten Uigur character component library; (2) a character training process, comprising character one-line information acquisition, character pre-processing, component partition during training, extraction of time division direction characteristics, calculation of training characteristics of components and the like; and (3) a character identification process, comprising character one-line information acquisition, character pre-processing, component partition during training, extraction of time division direction characteristics and point component characteristics, characteristic fusion of components, identification result output and the like. The method is the basis for identifying whole Uigur words. By using the method and a handwritten Uigur word splitting method, a user can freely, naturally and smoothly write the Uigur words at random angle in a mobile platform handwriting frame such as an intelligent mobile phone and the like, and the method can realize robust identification.

Description

A kind of method for identifying handwritten Uigur characters
Technical field
The invention belongs to the type mode identification field in the pattern-recognition, specifically belong to the System for Handwritten Character Recognition field, be used for the handwritten form Uygur character of portable terminal input is discerned.
Background technology
Uygur's literal belongs to Altai family Turkic west breast language and props up, and is the language of the Uygur nationality of the important ethnic group of China.Existing Uighur is made up of 32 letters, according to the difference of position in speech, connect before each letter has formula, doubly-linked formula, after connect formula, single 2-8 kind written form such as vertical, be evolved into 124 characters.The speech of Uighur is made up of one or more characters, wherein except that 124 class-letter variant characters, also comprises connecting behind two precomposed characters, one among prefix symbol and one connecting the prefix symbol, altogether 128 characters.Character in the speech links to each other along a horizontal line, and this horizontal line is called as baseline.Character adopts right-to-left, ways of writing from top to bottom, and the part of writing along baseline is called main part, is used for distinguishing the main form of character, and the symbol except that baseline, point or drip etc. is called extention, is the foundation of distinguishing similar character.
Processing and identification to Uighur are of value to promotion ethnic mimority area information and development of science and technology, and the research of at present Uighur being discerned still is in the exploratory stage.Uighur is made of Uygur's word, and Uygur's word is made up of a plurality of characters, and these characters link to each other along baseline and write, and are identifying object with whole word, meet the writing style of Uighur, and practicality is good.At the block letter Uighur, the cutting of block letter Uygur word and the identification of block letter Uygur character have been studied by Tsing-Hua University's fourth green grass or young crops dawn, Halimulati of Xinjiang University etc.Wherein, utilize the information of presorting that whole character set is divided into some subclass, input character is extracted the directional line element feature feature, finish classification, draw recognition result by revising the secondary Discrimination Functions to block letter Uygur character.At the handwritten form Uighur, Yuan Bao She of Xinjiang University etc. has studied handwritten form Uygur 41 class character identifying methods, propose 21 of stroke number, main body stroke structure feature, accessory structure features etc. and be characterized as feature set, carry out XOR with characteristic in the sample storehouse, get after the computing 1 the minimum sample of number and be recognition sample and provide candidate samples.But Uygur's 41 class characters only comprise the vertical and simple single vertical form of the list of 32 letters of Uygur, can not be applied on the Uygur's word identification based on character cutting.
At present, the recognition methods at handwritten form Uygur 128 class characters yet there are no the pertinent literature report.How ad hoc rules and the existing character recognition algorithm with Uygur's literal combines, and the recognition methods of research handwritten form Uygur 128 class characters is problem demanding prompt solutions.
Summary of the invention
The object of the invention is to provide a kind of handwritten form Uygur 128 class character identifying methods, and this method is discerned at the 128 class characters that are syncopated as in Uygur's word, is the basis of the whole Uygur of identification word.Utilize this method finally can allow the user in the writing frame of mobile platforms such as smart mobile phone, not have constraint, nature, write Uygur's cliction with arbitrarily angled glibly, and carry out robust identification in conjunction with handwritten form Uygur segmentation of words method.
The present invention is the method that a kind of structure and statistics combine, and at 128 class variant characters of Uygur's letter, at first sets up Uygur basic character component storehouse and parts dictionary, utilizes the civilian priori of dimension with the parts training characteristics Uygur's Character segmentation to be become some parts; Then single parts are extracted the time-division direction character, merge each component feature of Function Synthesis with Weighted distance at last, in order to discern whole character.Recognition methods of the present invention comprises following three parts:
(1) sets up 128 class Uygur basic character component dictionaries and handwritten form Uygur basic character component storehouse;
(2) handwritten form Uygur character training process: gather quantitative handwritten form Uygur character sample, utilize Uygur's basic character component dictionary to train with relevant training algorithm, training characteristics deposits handwritten form Uygur character training characteristics storehouse in;
(3) handwritten form Uygur character recognition process:, utilize Uygur's basic character component dictionary, handwritten form Uygur basic character component training characteristics storehouse to discern with relevant recognizer to handwritten form Uygur character.
Wherein training process may further comprise the steps:
(a) on-line information of collection handwritten form Uygur character on mobile-terminal platform, this information is a series of stroke Grid Tracks of sampling chronologically, gathers many cover character samples, as training sample set;
(b) the character coordinates track to each training sample carries out pre-service, comprises slant correction, normalization, resampling, smoothly is connected with disconnected pen;
(c) with reference to Uygur's basic character component dictionary, to pretreated character with when training the parts partitioning algorithm be partitioned into four parts: main element, first optional feature, second optional feature and point connect a parts;
(d) extract the time-division direction character at each parts that are partitioned in the character: main element extracts 4 * 9 dimension time-division direction characters, and optional feature and point connect a parts and extract 4 * 4 dimension time-division direction characters;
(e) each sample time-division direction character of each parts is asked on average, drawn the training characteristics of these parts, deposit handwritten form Uygur basic character component training characteristics storehouse in.
Identifying may further comprise the steps:
(a) on-line information of collection handwritten form Uygur character on mobile-terminal platform;
(b) the character coordinates track that collects is carried out pre-service, comprise slant correction, normalization, resampling, smoothly be connected with disconnected pen;
(c) with reference to handwritten form Uygur basic character component training characteristics storehouse, to pretreated character with when identification the parts partitioning algorithm be divided into four parts: main element, first optional feature, second optional feature and some parts;
(d) each parts that are partitioned in the character are extracted feature: main element extracts 4 * 9 dimension time-division direction characters, and optional feature extracts 4 * 4 dimension time-division direction characters, and the some parts extract and count out, position and two dot structure features;
(e) with reference to Uygur's basic character component dictionary and handwritten form Uygur basic character component training characteristics storehouse, the characteristic distance of difference each parts of calculating character and each parts of each character masterplate (128 class), merge function with Weighted distance and merge each component feature, with minimum distance criterion output recognition result.
The beneficial effect that the present invention has is as follows:
1, the present invention is based on the parts analysis of handwritten form Uygur character, this recognition methods not only can overcome the randomness of each stroke position in the handwritten character, reduce feature complexity and classification number, and make small identifying information enlarge, reduced the erroneous judgement of similar character, beyond doubt a kind of effective way of System for Handwritten Character Recognition;
2, the present invention regards the company's of putting pen as a kind of special parts in the features training process, extract its time-division direction character, when identification, utilize a training characteristics that connects pen identification point parts correctly, so just solved ubiquitous some stroke write the two or more syllables of a word together problem in the handwriting Uighur;
3, the present invention is with the differentiation feature of time-division direction character as each parts, and the time-division direction character is applicable to cursive characters such as handwritten form Uygur character, topology and the structure that can portray stroke well, and intrinsic dimensionality is relatively little, and distance calculation is simple
4, the present invention excavates and studies the rule and the writing rules of handwritten form Uygur character, the validity of method has confirmed rare foreign languages literal such as Uygur's characters, make full use of the rule of literal self uniqueness, and in conjunction with the universal character recognizer, of great advantage to improving final discrimination.
The present invention is based on the hand-written Uygur character set of portable terminal cell phone platform collection by the person writing of the Uygur nationality, (processor is Intel double-core T2300, and the internal storage capacity is 512MB) carries out the experiment of character recognition on PC.Experiment shows, the method for identifying handwritten Uigur characters that the present invention proposes can identify the order of strokes observed in calligraphy effectively and connect pen 128 class handwritten form Uygur characters freely, average recognition rate is 84.23%, recognition time is the 174ms/ word, for the handwritten form Uygur word identification based on character cutting is had laid a good foundation.
Description of drawings
Fig. 1 is the present invention's 128 class handwritten form Uygur character set
Fig. 2 partly illustrates for example for Uygur of the present invention basic character component dictionary
Fig. 3 is handwritten form of the present invention Uygur basic character component storehouse
Fig. 4 is character recognition system overall flow figure of the present invention
Fig. 5 is parts partitioning algorithm process flow diagram during training in the character recognition system of the present invention
Fig. 6 is parts partitioning algorithm process flow diagram during identification in the character recognition system of the present invention
Fig. 7 calculates synoptic diagram for direction code in the character recognition system of the present invention, and wherein (a) is that stroke starting point direction code calculates diagram, (b) calculates diagram for the non-starting point direction code of stroke
Fig. 8 extracts synoptic diagram for time-division direction character in the character recognition system of the present invention, and wherein (a) is that the sample diagram of No. 054 parts, enlarged drawing, (c) that (b) is A-B among the figure (a) are the regular synoptic diagram of direction code
Fig. 9 is the used sample portion synoptic diagram of experiment test of character recognition system of the present invention
Figure 10 is the experiment test result schematic diagram of character recognition system of the present invention
Embodiment
Method for identifying handwritten Uigur characters of the present invention is based on 128 class Uygur characters, and 128 class handwritten form Uygur character set are with reference to Fig. 1.Method of the present invention is divided into three parts, further specifies technical scheme of the present invention below in conjunction with accompanying drawing and by embodiment.
First, the foundation in Uygur's basic character component dictionary and hand-written Uygur basic character component storehouse; Set up Uygur's basic character component dictionary, to each Uygur's character, its main part can be regarded as parts, claim main element, extention is divided into a parts and optional feature by whether putting stroke, in addition, for unified model, set empty parts, representative does not have this part, with " NULL " expression, each Uygur's character all can fixedly be decomposed into main element like this, first optional feature, second optional feature, these four parts of some parts.Uygur's basic character component dictionary partly illustrates with reference to Fig. 2, wherein respectively with M, A for example 1, A 2Represent main element with D, first optional feature, second optional feature and some parts, dotted line is a baseline position among the figure.
At all parts in the handwritten form Uygur basic character component dictionary, can set up handwritten form Uygur basic character component storehouse, with reference to Fig. 3, handwritten form Uygur basic character component storehouse comprises 58 of main elements altogether, 6 of optional features, 8 of some parts, point connects 4 of parts.It is write the two or more syllables of a word together forms of a parts that its mid point connects parts, and the parts dictionary is companys of a putting parts not, is trained as a kind of special parts but point connects a parts, is used for that correct discrimination points parts connect form of a stroke or a combination of strokes formula and accessory components when identification.
Handwritten form of the present invention Uygur character recognition system overall flow is divided into training process and identifying two parts with reference to Fig. 4.Rectangle frame among Fig. 4 is represented concrete Processing Algorithm, and ellipse shape frame table shows the data of depositing, and solid line is represented pending data trend, and dotted line is represented comparable data required in the relevant treatment algorithm.Wherein training process comprise pre-service, when training parts cut apart, feature calculation and component feature warehouse-in step; Identifying comprise pre-service, when identification parts cut apart, feature calculation, distance merges and recognition result output step.
Second portion, training process: gather handwritten form Uygur character sample, utilize Uygur's basic character component dictionary and related algorithm to train, and deposit training characteristics in the training characteristics storehouse;
The training process concrete steps are as follows:
Step 1, gather handwritten form Uygur character sample on the portable terminal cell phone platform, the sample on-line information is a series of stroke Grid Tracks of sampling chronologically, separates with space character between two-stroke, resolution is 512 * 512, and using as training needs to gather many cover character samples;
Step 2 is carried out pre-service to each training sample character, comprises slant correction, normalization, resampling, smoothly is connected with disconnected pen;
Wherein slant correction adopts the hough converter technique; The linear normalization method is adopted in normalization, and character boundary is 256 * 256 after the normalization; Resampling is that thinking between per two points of being similar to is that straight line connects,, and carry out the operation of interpolation by the straight-line equation that calculates, be spaced apart 1 point/3 pixel; Level and smooth is the multiple spot weighted mean, and this method is considered 2 points of current point and front and back.
A disconnected join algorithm is for the present invention aims at handwritten form Uygur character design, comprises that conventional disconnected pen connects the two parts that link to each other with the main element stroke.Conventional disconnected pen connect promptly connect when writing because Palingraphia and the accidental caused disconnected pen of starting writing, and the main element that the main element stroke links to each other character is linked to be a stroke, so that basic character component is cut apart.
If stroke sequence is S 1..., S i..., S n, S i.l represent stroke S iLength, || represent the distance of two strokes, promptly there are a minimum value and value, " S in the head of certain stroke or tail institute in another stroke i+ S j" expression stroke S iAnd S jLink to each other, a then disconnected join algorithm is described below:
(a) S i+ S jRule: when two strokes link to each other, if S iTail and S jFirst is approaching, then S jBe connected on S iAfterwards; If S iHead and S jTail approaching, S then iBe connected on S jAfterwards;
(b) conventional disconnected pen connects: to S i, if | S i-S j|<min (S i.l, S j.l)/6, S then i=S i+ S j, i=1 ..., n, j=i+1 ..., n;
(c) the main element stroke connects: if S 1.l ≠ max (S 1.l ..., S n.l), try to achieve and make S i.l=max (S 1.l ..., S n.l) i, then S 1=S 1+ S 2+ ... + S i
Step 3, to pretreated character with when training the parts partitioning algorithm be partitioned into four parts: main element, first optional feature, second optional feature and point connect a parts;
Parts partitioning algorithm process flow diagram is with reference to Fig. 5 during training, and establishing M is main element, A 1, A 2Be optional feature, F connects a parts for point, and F.c represents a company unit type, i.e. it is which kind of connects a situation, D. that explanation point connects pen n, D. p, D. rBe respectively number, position, 2 features of a parts, the parts partitioning algorithm is described below when then training:
(1) find out a stroke and main element according to stroke length, all the other strokes are that optional feature and point connect a parts;
Its mid point stroke is a length less than the stroke of a threshold value (some threshold value be normalization character duration 1/10), and main element is the stroke of length maximum, and stroke length is with the sum calculating of normalization and resampling post-sampling point.
(2) determine baseline position by main element M;
Specific algorithm is: find out among the M segment length greater than the horizontal direction section of length overall 1/6, baseline position i.e. first section ordinate position.
(3) utilize Uygur's basic character component rule preliminary judgement optional feature and point to connect a parts;
Wherein Uygur's basic character component rule comprise following some: 1) length of main element stroke is long than other stroke; 2) when two optional features in the character all were not sky, these two optional features were left and right sides positional structure; When 3) optional feature and some parts were not simultaneously empty, optional feature was positioned at a parts below; 4) optional feature only writes on the baseline top.
According to Fig. 5, specific algorithm can be described below:
I) initialization:, can classify S as to remaining stroke 1..., S i..., S n, make A 1=NULL, A 2=NULL, F=NULL, i=1;
Ii) if S iThe position below baseline, F=S then i, execution in step iv); Otherwise execution in step iii);
Iii) if A 1=NULL, note A 1=S i, execution in step iv); Otherwise if A 2=NULL, note A 2=S i, execution in step iv);
Iv) i=i+1, if i ≠ n, then execution in step ii), otherwise execution in step is v);
V) if A 1≠ NULL and A 2≠ NULL and A 1, A 2Be upper-lower position, then execution in step vi); Otherwise algorithm finishes;
Vi) if A 1At A 2Top, then F=A 1, A 1=A 2, A 2=NULL, algorithm finishes; Otherwise F=A 2, A 2=NULL, algorithm finishes.
(4) according to Uygur's basic character component dictionary residue optional feature and the company's of putting parts are done further judgement.
Specific algorithm is: calculate D.n, D.p, D.r by a stroke, utilize the value of F and D.n, D.p, D.r to calculate F.c, consult Uygur's basic character component dictionary according to known sample class, whether the some parts of this character type mate in the value of checking D.n, D.p, D.r and the parts dictionary, if do not match then to A 1, A 2, F, F.c the currency correction; Otherwise output M, A 1, A 2, F, F.c, algorithm finishes.
Step 4 connects a component computes time-division direction character at the main element that is partitioned in the character, optional feature and point;
Time-division direction character arthmetic statement is as follows:
(1) to each the sampled point calculated direction sign indicating number in the stroke;
Hand script Chinese input equipment Uygur character represents that by a series of Grid Tracks that collect in real time the present invention adopts fuzzy direction code extracting method to sampled point calculated direction sign indicating number.With reference to Fig. 7, for the starting point of stroke, whole plane is divided into 4 zones, shown in Fig. 7 (a), and the directly regional code that falls into by direction; Second point from stroke, whole plane is divided into 8 zones, shown in Fig. 7 (b), wherein be that non-shaded portion is the clear area, directly by regional code, dash area was a fuzzy region, when stroke direction falls into when stroke direction fell into, when having only angle when this direction and last direction, judge that just change has taken place the last relatively direction of this direction greater than certain threshold value.The decision errors that this method has avoided direction to be introduced when the small variations of the left and right sides, separatrix effectively.
(2) stroke direction sign indicating number sequence is carried out regular, remove the jittering noise in writing;
To every bit calculated direction sign indicating number in the stroke, can form a direction code sequence chronologically, with reference to shown in Figure 8, Fig. 8 (a) is a sample figure of No. 053 parts, Fig. 8 (b) is the enlarged drawing of A-B section among Fig. 8 (a), and the line between 2 o'clock is encoded to scheme method shown in (7), obtains the direction code sequence of A-B section, 444411111121122, shown in Fig. 8 (c); Be to eliminate the jittering noise in the hand-written character, the direction code of change is in short-term replaced with direction code adjacent before it, the A-B section direction code sequence after regular is, 444411111111122, and shown in Fig. 8 (c).
(3) evenly divide the direction code sequence with the time period, add up each section direction code quantity and obtain the time-division direction character.
To the direction code sequence after regular, can extract following feature: chronologically whole direction code stream is divided into L zone,, defines one 4 dimensional vector X (x to each zone 1, x 2, x 3, x 4), x wherein i, i=1, the direction code quantity of i direction in 2,3,4 these zones of expression.Like this, can obtain one 4 dimensional vector, the vector of all subregions is arranged in chronologically just form the 4L dimensional feature vector together, shown in Fig. 8 (c) each subregion.This statistical nature is divided the direction code sequence with the time period, is called the time-division direction character, hop count when L is called.
Here, the time hop count L value very important, value too hour, between class a little less than the separating capacity, when value is too big, feature instability in the class.Contradiction between comprehensive stability of the present invention and classification capacity draws the value of L to a large amount of experiments of tieing up civilian sample, to main element, gets L m=9, optional feature and point are connected a parts, get L a=4.
Step 5 is asked on average by dimension the time-division direction character of each each sample of parts, draws the training characteristics of these parts, deposits handwritten form Uygur basic character component training characteristics storehouse in.
What deposit in the handwritten form Uygur basic character component training characteristics storehouse is three groups of training characteristics: (58 of main element time-division direction characters, 4 * 9 dimensions), optional feature time-division direction character is (6,4 * 4 dimensions), point connects (4 of parts time-division direction characters, 4 * 4 dimensions), the time-division direction character of its hollow part is defined as complete 0 vector of 4 * 4 dimensions.
Third part, identifying:, utilize Uygur's basic character component dictionary, handwritten form Uygur basic character component training characteristics storehouse to discern with relevant recognizer to handwritten form Uygur character.
The identifying concrete steps are as follows:
Step 1, the on-line information of collection handwritten form Uygur character on the portable terminal cell phone platform, this information is a series of stroke Grid Tracks of sampling chronologically, separates with space character between two-stroke, resolution is 512 * 512;
Step 2 is carried out pre-service to the character coordinates track that collects, and pretreated method is identical with preprocess method in the training process step 2;
Step 3, to pretreated character with when identification the parts partitioning algorithm be divided into four parts: main element, first optional feature, second optional feature and some parts;
Parts partitioning algorithm process flow diagram is with reference to Fig. 6 during identification, and establishing M is main element, A 1, A 2Be optional feature, F connects a parts for point, and F.c represents a company unit type, D. n, D. p, D. rBe respectively number, position, 2 features of a parts, the parts partitioning algorithm is described below when then discerning:
(1) find out a stroke and main element according to stroke length, all the other strokes are that optional feature and point connect a parts, specific algorithm during with training parts partitioning algorithm (1) go on foot identical;
(2) determine baseline position by main element M, specific algorithm is identical with (2) step of parts partitioning algorithm when training;
(3) utilize the basic character component rule preliminary judgement optional feature A of Uygur 1, A 2With the parts F of a company, specific algorithm is identical with (3) step of parts partitioning algorithm when training;
(4) utilize handwritten form Uygur basic character component training characteristics storehouse that residue optional feature and the company's of putting parts are done further judgement.
Specific algorithm can be described below:
I) calculate the time-division direction character of F and calculate F and training characteristics storehouse mid point connects the characteristic distance of a parts and optional feature, draw the parts of distance minimum with it, if this parts optional feature, then execution in step ii); Otherwise execution in step iii);
Ii) if A 1=NULL, note A 1=F, execution in step is iii); Otherwise if A 2=NULL, note A 2=F, execution in step is iii);
Iii) revise F.c,, export M, A according to the value that F.c revises D.n, D.p, D.r 1, A 2, D.n, D.p, D.r, algorithm finishes.
Step 4 is extracted the time-division direction character to the main element that is partitioned in the character and optional feature, and is identical during step 4 in time-division direction character algorithm and the training process; To the some parts that are partitioned into, extract count out, position and two dot structure features;
Wherein position feature gives directions relative position, the two dot structure features of stroke and baseline to refer to that this relative position of 2 is write across the page or perpendicular writing, and is used for distinguishing parts No. 203 when some stroke number is 2 With No. 206 parts
Step 5 with reference to Uygur's basic character component dictionary and handwritten form Uygur basic character component training characteristics storehouse, is calculated the distance for the treatment of character learning symbol and each character masterplate (128 class) with the distance blending algorithm.
The distance that accords with Character mother plate is D if wait to become literate, and the distance of main element, optional feature and some parts is respectively D m, D aAnd D d, then be described below apart from blending algorithm:
(1) inquires about Uygur's basic character component dictionary, draw each building block of character masterplate, inquire about handwritten form Uygur basic character component training characteristics storehouse then, obtain the time-division direction character of character masterplate main element and optional feature;
(2) calculate the main element distance D m: the Euclidean distance of wait to become literate symbol and the corresponding feature of main element of Character mother plate;
(3) calculate the optional feature distance D a: the symbol of waiting to become literate is average with the Euclidean distance of two corresponding features of optional feature of Character mother plate;
(4) calculation level member distance D d: if the symbol of waiting to become literate all mates with the corresponding number of some parts, position and the feature of Character mother plate, D then d=0; Otherwise D d=d, wherein d is the Euclidean distance of No. 101 parts (optional features that flex point is maximum) and No. 106 parts (empty parts) training characteristics;
(5) merge the characteristic distance that function merges each parts with Weighted distance, and with minimum distance criterion output recognition result.
Weighted distance merges function: D=λ m* D m+ λ a* (L m/ L a) * D a+ λ d* (L m/ L a) * D d, L wherein m/ L aBe normalization coefficient, L mAnd L aIt is respectively main element and optional feature time-division hop count when directional characteristic; λ m, λ a, λ dBe weighting coefficient, characterized the importance ratio of each parts in identification.To 128 Uygur character types statistics, wherein do not contain a little and the character type of optional feature has 34, the character type that contains optional feature has 33, and the character type that contains a parts has 67 (comprising 6 character types that not only contained optional feature but also contained a parts), so λ m, λ a, λ dValue be respectively: λ m=34/128, λ a=33/128, λ d=67/128.
The effect of handwritten form of the present invention Uygur character recognition system can further specify by following experiment test.
This experiment test is to be Intel double-core T2300 at processor, and the internal storage capacity is to finish on the PC of 512MB.The hand-written Uygur character set that experiment adopts Xian Electronics Science and Technology University's intelligent signal processing and pattern-recognition laboratory to collect, the collection of data is based on the portable terminal cell phone platform, by the person writing of the Uygur nationality, without any writing restriction, guaranteed the accuracy and the practical value of sample, the part sample is with reference to shown in Figure 9.This sample contains 128 character types, and totally 115 covers select 60 covers to be used for training at random, and all the other 55 covers are used for test.
Three kinds of algorithms are adopted in experiment, direction character is online 4 direction characters of the present invention, algorithm one is not based on the parts analysis, document Handwritten Chinese Character Recognition with Directional Decomposition Cellular Features (LianWen Jin is adopted in feature extraction, Circuits Systems and Computers, 1998.) middle elastic mesh division direction character (the elastic mesh directional features that proposes, EMDF), grid number: 8 * 8; Algorithm two based on parts (radical-based, RB), feature extraction is identical with algorithm one; Algorithm three is algorithm of the present invention, feature extraction adopt the time-division direction character (time division directional features, TDDF), the time hop count: to main element, get L m=9, optional feature and point are connected a parts, get L a=4.The discrimination of each test sample book group is with reference to Figure 10, and is as shown in table 2 to the average recognition rate and the discrimination scope of all test sample books.
The average recognition rate of three kinds of algorithms of table 2 and discrimination scope
Algorithm one Algorithm two Algorithm three
Average recognition rate (%) ??75.53 ??81.60 ??84.23
Discrimination scope (%) ??37.5~90.63 ??65.63~91.41 ??70.03~93.75
The recognition performance of contrast algorithm one and algorithm two as can be known, under identical feature extraction method, the algorithm that the present invention is based on the parts analysis has improved 6.07% with average recognition rate, more or the stroke position and the order of strokes observed in calligraphy sample more freely for pen even in addition, algorithm based on whole word presents the discrimination low ebb, and algorithm of the present invention need not consider to connect pen, the order of strokes observed in calligraphy and stroke position, and discrimination is relatively stable, and recognition result is more reliable and practical.
The recognition performance of contrast algorithm two and algorithm three as can be known, to the character recognition of analyzing based on parts, the time-division direction character has improved 2.63% than elastic mesh direction character average recognition rate, confirmed for single cursive character, time-division direction character of the present invention can be portrayed the topology of stroke well, and is more effective than the elastic mesh direction character.
Table 3 has been listed the candidate's discrimination that the present invention is based on parts analysis and time-division direction character recognizer, and average recognition speed is the 174ms/ word.
Candidate's discrimination of table 3 algorithm of the present invention
The 1st candidate Preceding 2 candidates Preceding 3 candidates Preceding 5 candidates Preceding 10 candidates
Average recognition rate (%) ??84.23 ??89.66 ??91.45 ??94.12 ??96.48
Average 84.23% by the 1st candidate's discrimination of above-mentioned experiment handwritten form of the present invention as can be known Uygur character recognition system, preceding 10 candidate's discriminations are average 96.48%, and recognition speed is the 174ms/ word, and algorithm performance has reached practical requirement.The present invention is based on 128 class Uygur characters, can be used for handwritten form Uygur word identification, also can be used for the character recognition of handwritten form Uygur based on character cutting.

Claims (10)

1. method for identifying handwritten Uigur characters, its feature comprises as the lower part:
(1) sets up 128 class Uygur basic character component dictionaries and hand-written Uygur basic character component storehouse;
(2) handwritten form Uygur character training process: gather quantitative handwritten form Uygur character sample, utilize Uygur's basic character component dictionary to train with relevant training algorithm, training characteristics deposits handwritten form Uygur character training characteristics storehouse in;
(3) handwritten form Uygur character recognition process:, utilize Uygur's basic character component dictionary, handwritten form Uygur basic character component training characteristics storehouse to discern with relevant recognizer to handwritten form Uygur character.
2. method for identifying handwritten Uigur characters according to claim 1 is characterized in that: the foundation in described 128 class Uygur basic character component dictionaries and hand-written Uygur basic character component storehouse comprises:
The foundation of (1) 128 class Uygur basic character component dictionary
Set up Uygur's basic character component dictionary, to each Uygur's character, it is decomposed into main element, first optional feature, second optional feature and some parts, for unified model, set empty parts, representative does not have this part, represents with " NULL ";
(2) foundation in hand-written Uygur basic character component storehouse
To all parts in the handwritten form Uygur basic character component dictionary, set up handwritten form Uygur basic character component storehouse, comprise 58 of main elements altogether, 6 of optional features, 8 of some parts, point connects 4 of parts, and it is write the two or more syllables of a word together forms of a parts that its mid point connects a parts.
3. method for identifying handwritten Uigur characters according to claim 1 is characterized in that: described handwritten form Uygur character training process comprises the steps:
(3a) on-line information of collection handwritten form Uygur character on mobile-terminal platform, this information is a series of stroke Grid Tracks of sampling chronologically, gathers many cover character samples, as training sample set;
(3b) the character coordinates track to each training sample carries out pre-service, comprises slant correction, normalization, resampling, smoothly is connected with disconnected pen;
(3c) with reference to Uygur's basic character component dictionary, to pretreated character with when training the parts partitioning algorithm be partitioned into four parts: main element, first optional feature, second optional feature and point connect a parts;
(3d) extract the time-division direction character at each parts that are partitioned in the character: main element extracts 4 * 9 dimension time-division direction characters, and optional feature and point connect a parts and extract 4 * 4 dimension time-division direction characters;
(3e) each sample time-division direction character of each parts is asked on average, drawn the training characteristics of these parts, deposit handwritten form Uygur basic character component training characteristics storehouse in.
4. method for identifying handwritten Uigur characters according to claim 1 is characterized in that: described handwritten form Uygur character recognition process as follows:
(4a) on-line information of collection handwritten form Uygur character on mobile-terminal platform;
(4b) the character coordinates track that collects is carried out pre-service, comprise slant correction, normalization, resampling, smoothly be connected with disconnected pen;
(4c) with reference to handwritten form Uygur basic character component training characteristics storehouse, to pretreated character with when identification the parts partitioning algorithm be divided into four parts: main element, first optional feature, second optional feature and some parts;
(4d) each parts that are partitioned in the character are extracted feature respectively: main element extracts 4 * 9 dimension time-division direction characters, and optional feature extracts 4 * 4 dimension time-division direction characters, and the some parts extract and count out, position and two dot structure features;
(4e) with reference to Uygur's basic character component dictionary and handwritten form Uygur basic character component training characteristics storehouse, the characteristic distance of difference each parts of calculating character and each each parts of character masterplate, merge function with Weighted distance and merge each component feature, with minimum distance criterion output recognition result.
5. according to claim 3 or 4 described method for identifying handwritten Uigur characters, it is characterized in that: the character coordinates track of the step (3b) of handwritten form Uygur character training process and the step (4b) of handwritten form Uygur character recognition process is done slant correction, normalization, resampling, smoothly as follows with disconnected preprocess method that is connected:
Slant correction adopts the hough converter technique; The linear normalization method is adopted in normalization, and size is 256 * 256 after the normalization; Resampling is spaced apart 1 point/3 pixel; Level and smooth is the multiple spot weighted mean, and multiple spot is got 2 points of current point and front and back; Disconnected pen connects the two parts that comprise that conventional disconnected pen connects and the main element stroke links to each other: conventional disconnected pen is connected promptly connect when writing because Palingraphia and start writing caused disconnected accidentally, and the continuous main element with character of main element stroke is linked to be a stroke, so that basic character component is cut apart.
6. method for identifying handwritten Uigur characters according to claim 5 is characterized in that: the character pre-processing method is interrupted a join algorithm as follows:
(6a) stroke links to each other regular: according to the immediate position decision of the two-stroke order of connection;
(6b) conventional disconnected pen connects: when two-stroke distance during less than certain threshold value, two-stroke is linked to each other;
(6c) the main element stroke connects: non-preceding stroke of the longest stroke linked to each other with the longest stroke order.
7. method for identifying handwritten Uigur characters according to claim 3 is characterized in that: in the handwritten form Uygur character training process step (3c) and when training the parts partitioning algorithm as follows:
(7a) find out a stroke and main element according to stroke length, all the other strokes are that optional feature and point connect a parts;
(7b) determine baseline position by main element;
(7c) utilize Uygur's basic character component rule preliminary judgement optional feature and point to connect a parts;
(7d) residue optional feature and the company's of putting parts are done further judgement according to Uygur's basic character component dictionary.
8. method for identifying handwritten Uigur characters according to claim 4 is characterized in that: the parts partitioning algorithm was as follows when step (4c) was discerned in the handwritten form Uygur character recognition process:
(8a) find out a stroke and main element according to stroke length, all the other strokes are that optional feature and point connect a parts;
(8b) determine baseline position by main element;
(8c) utilize Uygur's basic character component rule preliminary judgement optional feature and point to connect a parts;
(8d) utilize handwritten form Uygur basic character component training characteristics storehouse that residue optional feature and the company's of putting parts are done further judgement.
9. according to claim 3 or 4 described method for identifying handwritten Uigur characters, it is characterized in that: in handwritten form Uygur character training process and the handwritten form Uygur character recognition process, step (3d) and step (4d) time-division direction character algorithm as follows:
(9a), form a direction code sequence to each the sampled point calculated direction sign indicating number in the stroke;
(9b) stroke direction sign indicating number sequence is carried out regular, remove the jittering noise in writing;
(9c) evenly divide the direction code sequence, add up each section direction code quantity and obtain the time-division direction character with the time period, hop count when wherein the number of time period is called, hop count was 9 when main element was got, and optional feature and point are connected a parts, hop count is 4 when getting.
10. method for identifying handwritten Uigur characters according to claim 4 is characterized in that: step (4e) is apart from blending algorithm as follows in the handwritten form Uygur character recognition process:
(10a) inquire about Uygur's basic character component dictionary and handwritten form Uygur basic character component training characteristics storehouse, obtain the time-division direction character of character masterplate main element and optional feature;
(10b) calculate the main element distance: the Euclidean distance of wait to become literate symbol and the corresponding feature of main element of Character mother plate;
(10c) calculate the optional feature distance: the symbol of waiting to become literate is average with the Euclidean distance of two corresponding features of optional feature of Character mother plate;
(10d) calculation level member distance: if the symbol of waiting to become literate all mates with the corresponding number of some parts, position and the feature of Character mother plate, then distance is 0; Otherwise distance is the Euclidean distance of maximum optional feature of flex point and empty parts;
(10e) merge the characteristic distance that function merges each parts with Weighted distance, and with minimum distance criterion output recognition result.
Weighted distance merges function: D=λ m* D m+ λ a* (L m/ L a) * D a+ λ d* (L m/ L a) * D d, D wherein m, D aAnd D dBeing respectively main element, optional feature and the some member distance of wait to become literate symbol and character masterplate, L m/ L aBe normalization coefficient, L mAnd L aIt is respectively main element and optional feature time-division hop count when directional characteristic; λ m, λ a, λ dBe weighting coefficient, characterize the importance ratio of each parts in identification, draw according to statistic of classification: λ 128 Uygur's character types m=34/128, λ a=33/128, λ d=67/128.
CN 201010204177 2010-06-18 2010-06-18 Method for identifying handwritten Uigur characters Expired - Fee Related CN101866417B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010204177 CN101866417B (en) 2010-06-18 2010-06-18 Method for identifying handwritten Uigur characters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010204177 CN101866417B (en) 2010-06-18 2010-06-18 Method for identifying handwritten Uigur characters

Publications (2)

Publication Number Publication Date
CN101866417A true CN101866417A (en) 2010-10-20
CN101866417B CN101866417B (en) 2013-06-12

Family

ID=42958139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010204177 Expired - Fee Related CN101866417B (en) 2010-06-18 2010-06-18 Method for identifying handwritten Uigur characters

Country Status (1)

Country Link
CN (1) CN101866417B (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054172A (en) * 2010-12-10 2011-05-11 浙江农林大学 Method and device for synthesizing handwritten Chinese characters based on differential vector generation component
CN103885607A (en) * 2012-12-19 2014-06-25 新疆信息产业有限责任公司 Method for judging and storing concatenation of Uyghur based on embedded system
CN103927539A (en) * 2014-03-24 2014-07-16 新疆大学 Efficient feature extraction method for off-line recognition of Uyghur handwritten signature
CN104156721A (en) * 2014-07-31 2014-11-19 南京师范大学 Off-line Chinese character stroke extraction method based on template matching
CN104899601A (en) * 2015-05-29 2015-09-09 西安电子科技大学宁波信息技术研究院 Identification method of handwritten Uyghur words
CN105589925A (en) * 2015-11-25 2016-05-18 小米科技有限责任公司 Information recommendation method, device and system
CN106127162A (en) * 2016-06-28 2016-11-16 联想(北京)有限公司 A kind of signature identifying method and electronic equipment
CN106295631A (en) * 2016-07-27 2017-01-04 新疆大学 A kind of image Uighur word recognition methods and device
CN106339726A (en) * 2015-07-17 2017-01-18 佳能株式会社 Method and device for handwriting recognition
CN108090489A (en) * 2018-01-15 2018-05-29 兰州理工大学 Offline handwriting Balakrishnan word recognition methods of the computer based according to grapheme segmentation
CN108764155A (en) * 2018-05-30 2018-11-06 新疆大学 A kind of handwriting Uighur words cutting recognition methods
CN108764036A (en) * 2018-04-24 2018-11-06 西安电子科技大学 A kind of handwritten form Tibetan language word fourth recognition methods
CN109034186A (en) * 2018-06-11 2018-12-18 东北大学秦皇岛分校 The method for establishing DA-RBM sorter model
CN110210476A (en) * 2019-05-24 2019-09-06 北大方正集团有限公司 Basic character component clustering method, device, equipment and computer readable storage medium
CN110287952A (en) * 2019-07-01 2019-09-27 中科软科技股份有限公司 A kind of recognition methods and system for tieing up sonagram piece character
CN111310543A (en) * 2019-12-04 2020-06-19 湖北工业大学 Method for extracting and authenticating stroke connecting stroke characteristics in online handwriting authentication
CN113011413A (en) * 2021-04-15 2021-06-22 深圳市鹰硕云科技有限公司 Method, device and system for processing handwritten image based on smart pen and storage medium
CN114898345A (en) * 2021-12-13 2022-08-12 华东师范大学 Arabic text recognition method and system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664975B (en) * 2018-04-24 2022-03-25 新疆大学 Uyghur handwritten letter recognition method and system and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000348142A (en) * 1999-06-08 2000-12-15 Nippon Telegr & Teleph Corp <Ntt> Character recognizing device, method therefor and recording medium for recording program executing the method
CN1606028A (en) * 2004-11-12 2005-04-13 清华大学 Printed font character identification method based on Arabic character set
CN1652138A (en) * 2005-02-08 2005-08-10 华南理工大学 Method for identifying hand-writing characters
CN1741035A (en) * 2005-09-23 2006-03-01 清华大学 Blocks letter Arabic character set text dividing method
CN101551710A (en) * 2009-05-14 2009-10-07 广东国笔科技股份有限公司 Uigur input system and input method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000348142A (en) * 1999-06-08 2000-12-15 Nippon Telegr & Teleph Corp <Ntt> Character recognizing device, method therefor and recording medium for recording program executing the method
CN1606028A (en) * 2004-11-12 2005-04-13 清华大学 Printed font character identification method based on Arabic character set
CN1652138A (en) * 2005-02-08 2005-08-10 华南理工大学 Method for identifying hand-writing characters
CN1741035A (en) * 2005-09-23 2006-03-01 清华大学 Blocks letter Arabic character set text dividing method
CN101551710A (en) * 2009-05-14 2009-10-07 广东国笔科技股份有限公司 Uigur input system and input method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《西苑出版社》 20071231 玉素甫.艾白都拉 笔式维吾尔文识别的中的文字切分研究 , 2 *
《西苑出版社版权时间页》 20071231 玉素甫.艾白都拉 笔式维吾尔文识别的中的文字切分研究 , 2 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054172A (en) * 2010-12-10 2011-05-11 浙江农林大学 Method and device for synthesizing handwritten Chinese characters based on differential vector generation component
CN103885607A (en) * 2012-12-19 2014-06-25 新疆信息产业有限责任公司 Method for judging and storing concatenation of Uyghur based on embedded system
CN103927539A (en) * 2014-03-24 2014-07-16 新疆大学 Efficient feature extraction method for off-line recognition of Uyghur handwritten signature
CN104156721A (en) * 2014-07-31 2014-11-19 南京师范大学 Off-line Chinese character stroke extraction method based on template matching
CN104156721B (en) * 2014-07-31 2017-06-23 南京师范大学 A kind of off line Chinese-character stroke extracting method based on template matches
CN104899601A (en) * 2015-05-29 2015-09-09 西安电子科技大学宁波信息技术研究院 Identification method of handwritten Uyghur words
CN106339726A (en) * 2015-07-17 2017-01-18 佳能株式会社 Method and device for handwriting recognition
CN105589925A (en) * 2015-11-25 2016-05-18 小米科技有限责任公司 Information recommendation method, device and system
CN106127162A (en) * 2016-06-28 2016-11-16 联想(北京)有限公司 A kind of signature identifying method and electronic equipment
CN106295631A (en) * 2016-07-27 2017-01-04 新疆大学 A kind of image Uighur word recognition methods and device
CN108090489A (en) * 2018-01-15 2018-05-29 兰州理工大学 Offline handwriting Balakrishnan word recognition methods of the computer based according to grapheme segmentation
CN108090489B (en) * 2018-01-15 2021-06-29 兰州理工大学 Off-line hand-written Uygur word recognition method based on grapheme segmentation based on computer
CN108764036A (en) * 2018-04-24 2018-11-06 西安电子科技大学 A kind of handwritten form Tibetan language word fourth recognition methods
CN108764155A (en) * 2018-05-30 2018-11-06 新疆大学 A kind of handwriting Uighur words cutting recognition methods
CN108764155B (en) * 2018-05-30 2021-10-12 新疆大学 Handwritten Uyghur word segmentation recognition method
CN109034186A (en) * 2018-06-11 2018-12-18 东北大学秦皇岛分校 The method for establishing DA-RBM sorter model
CN109034186B (en) * 2018-06-11 2022-05-24 东北大学秦皇岛分校 Handwriting data identification method based on DA-RBM classifier model
CN110210476A (en) * 2019-05-24 2019-09-06 北大方正集团有限公司 Basic character component clustering method, device, equipment and computer readable storage medium
CN110210476B (en) * 2019-05-24 2021-04-09 北大方正集团有限公司 Character component clustering method, device, equipment and computer readable storage medium
CN110287952A (en) * 2019-07-01 2019-09-27 中科软科技股份有限公司 A kind of recognition methods and system for tieing up sonagram piece character
CN110287952B (en) * 2019-07-01 2021-07-20 中科软科技股份有限公司 Method and system for recognizing characters of dimension picture
CN111310543A (en) * 2019-12-04 2020-06-19 湖北工业大学 Method for extracting and authenticating stroke connecting stroke characteristics in online handwriting authentication
CN111310543B (en) * 2019-12-04 2023-05-30 湖北工业大学 Method for extracting and authenticating stroke-extracting continuous stroke characteristics in online handwriting authentication
CN113011413A (en) * 2021-04-15 2021-06-22 深圳市鹰硕云科技有限公司 Method, device and system for processing handwritten image based on smart pen and storage medium
CN114898345A (en) * 2021-12-13 2022-08-12 华东师范大学 Arabic text recognition method and system

Also Published As

Publication number Publication date
CN101866417B (en) 2013-06-12

Similar Documents

Publication Publication Date Title
CN101866417B (en) Method for identifying handwritten Uigur characters
Mahdavi et al. ICDAR 2019 CROHME+ TFD: Competition on recognition of handwritten mathematical expressions and typeset formula detection
CN102622610B (en) Handwritten Uyghur character recognition method based on classifier integration
CN101853126B (en) Real-time identification method for on-line handwriting sentences
CN105117054B (en) A kind of recognition methods of handwriting input and system
CN109614944B (en) Mathematical formula identification method, device, equipment and readable storage medium
CN101398902B (en) Natural hand-written Arabian letter on-line identification method
CN103502915B (en) System and method for implementing sliding input of text based upon on-screen soft keyboard on electronic equipment
Joshi et al. Comparison of elastic matching algorithms for online Tamil handwritten character recognition
Simistira et al. Recognition of online handwritten mathematical formulas using probabilistic SVMs and stochastic context free grammars
Chowdhury et al. Online handwriting recognition using Levenshtein distance metric
CN102073870A (en) Method for recognizing Chinese character handwriting on touch screen
CN104899601A (en) Identification method of handwritten Uyghur words
US20010033694A1 (en) Handwriting recognition by word separation into sillouette bar codes and other feature extraction
CN1333366C (en) On-line hand-written Chinese characters recognition method based on statistic structural features
CN102360436B (en) Identification method for on-line handwritten Tibetan characters based on components
CN104376336A (en) Handwriting recognition method and handwriting pen
CN101149805A (en) Method and device for hand writing identification using character structural information for post treatment
CN106339481A (en) Chinese compound new-word discovery method based on maximum confidence coefficient
Tappert Speed, accuracy, and flexibility trade-offs in on-line character recognition
CN112651323B (en) Chinese handwriting recognition method and system based on text line detection
CN101697200B (en) Handwritten Chinese grass-style phrase identification method irrelevant to rotation
CN101452531A (en) Identification method for handwriting latin letter
CN103235836A (en) Method for inputting information through mobile phone
Joshi et al. Tamil handwriting recognition using subspace and DTW based classifiers

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130612

Termination date: 20190618