CN101866417B

CN101866417B - Method for identifying handwritten Uigur characters

Info

Publication number: CN101866417B
Application number: CN 201010204177
Authority: CN
Inventors: 卢朝阳; 李静; 许亚美; 阿地力·依米提; 谭福秀; 王炜; 曹琎
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2010-06-18
Filing date: 2010-06-18
Publication date: 2013-06-12
Anticipated expiration: 2030-06-18
Also published as: CN101866417A

Abstract

The invention discloses a method for identifying handwritten Uigur characters, which belongs to the field of character mode identification and can effectively identify 128 classes of handwritten Uigur characters in real time. The method comprises the following three parts of: (1) establishment of a Uigur character component dictionary and a handwritten Uigur character component library; (2) a character training process, comprising character one-line information acquisition, character pre-processing, component partition during training, extraction of time division direction characteristics, calculation of training characteristics of components and the like; and (3) a character identification process, comprising character one-line information acquisition, character pre-processing, component partition during training, extraction of time division direction characteristics and point component characteristics, characteristic fusion of components, identification result output and the like. The method is the basis for identifying whole Uigur words. By using the method and a handwritten Uigur word splitting method, a user can freely, naturally and smoothly write the Uigur words at random angle in a mobile platform handwriting frame such as an intelligent mobile phone and the like, and the method can realize robust identification.

Description

A kind of method for identifying handwritten Uigur characters

Technical field

The invention belongs to the type mode identification field in pattern-recognition, specifically belong to the System for Handwritten Character Recognition field, be used for the handwritten form Uygur character of mobile terminal input is identified.

Background technology

Uyghur Character belongs to the western breast language of Altai family Turkic and props up, and is the language of the Uygur nationality of the important ethnic group of China.Existing Uighur is comprised of 32 letters, and according to the difference of position in word, each letter has front formula, doubly-linked formula, rear formula, the single 2-8 kind written form such as vertical of connecting of connecting, and is evolved into 124 characters.The word of Uighur is comprised of one or more characters, wherein except 124 class-letter variant characters, also comprises connecting after two precomposed characters, one in prefix symbol and one connecting the prefix symbol, altogether 128 characters.Character in word is connected along a horizontal line, and this horizontal line is called as baseline.Character adopts right-to-left, ways of writing from top to bottom, and the part of writing along baseline is called main part, is used for distinguishing the Main Morphology of character, and the symbol except baseline, point or drip etc. is called extention, is the foundation of distinguishing similar character.

Processing and identification to Uighur are of value to the development that promotes ethnic mimority area information and science and technology, and the research of at present Uighur being identified still is in the exploratory stage.Uighur is made of Uygur's word, and Uygur's word is comprised of a plurality of characters, and these characters are connected along baseline and write, and take whole word as identifying object, meet the writing style of Uighur, and practicality is good.For the block letter Uighur, the cutting of block letter Uygur word and the identification of block letter Uygur character have been studied by Tsing-Hua University fourth green grass or young crops dawn, Halimulati of Xinjiang University etc.Wherein to block letter Uygur character, utilize the information of presorting that whole character set is divided into some subsets, input character is extracted the directional line element feature feature, complete classification by revising the secondary Discrimination Functions, draw recognition result.For the handwritten form Uighur, Yuan Bao She of Xinjiang University etc. has studied handwritten form Uygur 41 class character identifying methods, propose 21 of stroke number, main body stroke structure feature, accessory structure features etc. and be characterized as feature set, carry out XOR with characteristic in Sample Storehouse, get after computing 1 the minimum sample of number and be recognition sample and provide candidate samples.But Uygur's 41 class characters only comprise the vertical and simple single vertical form of the list of 32 letters of Uygur, can not be applied on the Uygur's word identification based on character cutting.

At present, the recognition methods for handwritten form Uygur 128 class characters yet there are no the pertinent literature report.How ad hoc rules and the existing character recognition algorithm with Uyghur Character combines, and the recognition methods of research handwritten form Uygur 128 class characters is problem demanding prompt solutions.

Summary of the invention

The object of the invention is to provide a kind of handwritten form Uygur 128 class character identifying methods, and the method is identified for the 128 class characters that are syncopated as in Uygur's word, is the basis of identification whole Uygur word.Utilize the method in conjunction with handwritten form Uygur segmentation of words method finally can allow the user in the writing frame of the mobile platforms such as smart mobile phone without constraint, nature, write Uygur's cliction with arbitrarily angled glibly, and carry out robust identification.

The present invention is a kind of method that structure and statistics combine, and for 128 class variant characters of Uygur's letter, model Uygur's basic character component storehouse and parts dictionary utilize the Balakrishnan priori with the parts training characteristics, Uygur's Character segmentation to be become some parts; Then single parts are extracted the time-division direction character, at last with comprehensive each component feature of Weighted distance fusion function, in order to identify whole character.Recognition methods of the present invention comprises following three parts:

(1) set up 128 class Uygur's basic character component dictionaries and handwritten form Uygur's basic character component storehouse;

(2) handwritten form Uygur character training process: gather quantitative handwritten form Uygur character sample, utilize Uygur's basic character component dictionary and relevant training algorithm to train, training characteristics deposits handwritten form Uygur's character training characteristics storehouse in;

(3) handwritten form Uygur character recognition process: to handwritten form Uygur character, utilize Uygur's basic character component dictionary, handwritten form Uygur basic character component training characteristics storehouse and relevant recognizer to identify.

Wherein training process comprises the following steps:

(a) gather the on-line information of handwritten form Uygur character on mobile-terminal platform, this information is a series of stroke Grid Tracks of sampling chronologically, gathers many cover character samples, as training sample set;

(b) the character coordinates track of each training sample carried out pre-service, comprise slant correction, normalization, resampling, smoothly connect with the pen that is connected;

(c) with reference to Uygur's basic character component dictionary, to pretreated character with when training the parts partitioning algorithm be partitioned into four parts: main element, the first optional feature, the second optional feature and point connect a parts;

(d) extract the time-division direction character for all parts that is partitioned in character: main element extracts 4 * 9 dimension time-division direction characters, and optional feature and point connect a parts and extract 4 * 4 dimension time-division direction characters;

(e) each sample time-division direction character of each parts is averaging, draws the training characteristics of these parts, deposit handwritten form Uygur's basic character component training characteristics storehouse in.

Identifying comprises the following steps:

(a) gather the on-line information of handwritten form Uygur character on mobile-terminal platform;

(b) the character coordinates track that collects is carried out pre-service, comprise slant correction, normalization, resampling, smoothly connect with the pen that is connected;

(c) with reference to handwritten form Uygur's basic character component training characteristics storehouse, to pretreated character with when identification the parts partitioning algorithm be divided into four parts: main element, the first optional feature, the second optional feature and some parts;

(d) all parts that is partitioned in character is extracted feature: main element extracts 4 * 9 dimension time-division direction characters, and optional feature extracts 4 * 4 dimension time-division direction characters, and the some parts extract and count out, position and two dot structure features;

(e) with reference to Uygur's basic character component dictionary and handwritten form Uygur's basic character component training characteristics storehouse, the characteristic distance of difference each parts of calculating character and each parts of each character masterplate (128 class), merge each component feature with the Weighted distance fusion function, with minimum distance criterion output recognition result.

The beneficial effect that the present invention has is as follows:

1, the present invention is based on the parts analysis of handwritten form Uygur character, this recognition methods not only can overcome in handwritten character the randomness of each stroke position, reduce feature complexity and class number, and make small identifying information enlarge, reduced the erroneous judgement of similar character, beyond doubt a kind of effective way of System for Handwritten Character Recognition;

2, the present invention regards the company's of putting pen as a kind of special parts in the features training process, extract its time-division direction character, utilize correctly identification point parts of a training characteristics that connects pen when identification, so just solved ubiquitous some stroke write the two or more syllables of a word together problem in handwriting Uighur;

3, the present invention is with the differentiation feature of time-division direction character as each parts, and the time-division direction character is applicable to the cursive characters such as handwritten form Uygur character, topology and the structure that can portray well stroke, and intrinsic dimensionality is relatively little, and distance is calculated simple

4, the present invention excavates and rule and the writing rules of research handwritten form Uygur character, the validity of method has confirmed rare foreign languages words such as Uygur's characters, take full advantage of the rule of word self uniqueness, and in conjunction with the universal character recognizer, of great advantage to improving final discrimination.

The present invention is based on the collection of mobile terminal cell phone platform by the person writing's of the Uygur nationality hand-written Uygur character set, (processor is Intel double-core T2300, and the internal storage capacity is 512MB) carries out the experiment of character recognition on PC.Experiment shows, the method for identifying handwritten Uigur characters that the present invention proposes can effectively identify the order of strokes observed in calligraphy and connect pen 128 class handwritten form Uygur characters freely, average recognition rate is 84.23%, recognition time is the 174ms/ word, for the handwritten form Uygur word identification based on character cutting is had laid a good foundation.

Description of drawings

Fig. 1 is the present invention's 128 class handwritten form Uygur character set

Fig. 2 is that Uygur of the present invention basic character component dictionary partly illustrates for example

Fig. 3 is handwritten form Uygur's basic character component of the present invention storehouse

Fig. 4 is character recognition system overall flow figure of the present invention

Fig. 5 be in character recognition system of the present invention when training parts partitioning algorithm process flow diagram

Fig. 6 be in character recognition system of the present invention when identification parts partitioning algorithm process flow diagram

Fig. 7 is that in character recognition system of the present invention, direction code calculates schematic diagram, and wherein (a) is that stroke starting point direction code calculates diagram, (b) calculates diagram for the non-starting point direction code of stroke

Fig. 8 is that in character recognition system of the present invention, the time-division direction character extracts schematic diagram, and wherein (a) is that the sample diagram of No. 054 parts, enlarged drawing, (c) that (b) is A-B in figure (a) are the regular schematic diagram of direction code

Fig. 9 is the experiment test of character recognition system of the present invention sample portion schematic diagram used

Figure 10 is the experimental results schematic diagram of character recognition system of the present invention

Embodiment

Method for identifying handwritten Uigur characters of the present invention is based on 128 class Uygur characters, and 128 class handwritten form Uygur character set are with reference to Fig. 1.Method of the present invention is divided into three parts, further illustrates technical scheme of the present invention below in conjunction with accompanying drawing and by embodiment.

First, the foundation in Uygur's basic character component dictionary and hand-written Uygur's basic character component storehouse; Set up Uygur's basic character component dictionary, to each Uygur's character, its main part can be regarded as parts, claim main element, extention is divided into a parts and optional feature by whether putting stroke, in addition, for unified model, set empty parts, representative does not have this part, with " NULL " expression, each Uygur's character all can fixedly be decomposed into main element like this, the first optional feature, the second optional feature, these four parts of some parts.Uygur's basic character component dictionary partly illustrates with reference to Fig. 2, wherein respectively with M, A for example ₁, A ₂Represent main element with D, the first optional feature, the second optional feature and some parts, in figure, dotted line is baseline position.

For all parts in handwritten form Uygur basic character component dictionary, can set up handwritten form Uygur's basic character component storehouse, with reference to Fig. 3, handwritten form Uygur's basic character component storehouse comprises 58 of main elements altogether, 6 of optional features, 8 of some parts, point connects 4 of parts.It is write the two or more syllables of a word together forms of a parts that its mid point connects a parts, and the parts dictionary is companys of a putting parts not, is trained as a kind of special parts but point connects a parts, is used for that correct discrimination points parts connect form of a stroke or a combination of strokes formula and accessory components when identification.

Handwritten form of the present invention Uygur character recognition system overall flow is divided into training process and identifying two parts with reference to Fig. 4.Rectangle frame in Fig. 4 represents concrete Processing Algorithm, and ellipse shape frame table shows the data of depositing, and solid line represents pending data trend, and dotted line represents comparable data required in the relevant treatment algorithm.Wherein training process comprise pre-service, when training parts cut apart, feature calculation and component feature warehouse-in step; Identifying comprise pre-service, when identification parts cut apart, feature calculation, distance merges and recognition result output step.

Second portion, training process: gather handwritten form Uygur character sample, utilize Uygur's basic character component dictionary and related algorithm to train, and deposit training characteristics in the training characteristics storehouse;

The training process concrete steps are as follows:

Step 1, gather handwritten form Uygur character sample on the mobile terminal cell phone platform, the sample on-line information is a series of stroke Grid Tracks of sampling chronologically, separates with space character between two-stroke, resolution is 512 * 512, and using as training needs to gather many cover character samples;

Step 2 is carried out pre-service to each training sample character, comprises slant correction, normalization, resampling, smoothly connects with the pen that is connected;

Wherein slant correction adopts the hough converter technique; The linear normalization method is adopted in normalization, and after normalization, character boundary is 256 * 256; Resampling is that thinking of being similar to is the straight line connection between every two points,, and carry out the operation of interpolation by the straight-line equation that calculates, be spaced apart 1 point/3 pixel; Level and smooth is the multiple spot weighted mean, and the method is considered current point and 2 of front and back point.

A disconnected join algorithm is that the present invention aims at handwritten form Uygur character design, comprises that conventional disconnected pen connects the two parts that are connected with the main element stroke.Conventional disconnected pen connect namely connect when writing due to Palingraphia and the caused disconnected pen of accidentally starting writing, and the connected main element with character of main element stroke is linked to be a stroke, so that basic character component is cut apart.

If stroke sequence is S ₁..., S _i..., S _n, S _i.l represent stroke S _iLength, || the distance of expression two strokes, the i.e. minimum value of the head of certain stroke or the tail distance of having a few in another stroke, " S _i+ S _j" expression stroke S _iAnd S _jBe connected, a disconnected join algorithm is described below:

(a) S _i+ S _jRule: when two strokes are connected, if S _iTail and S _jFirst approaches, S _jBe connected on S _iAfterwards; If S _iHead and S _jTail approach, S _iBe connected on S _jAfterwards;

(b) conventional disconnected pen connects: to S _iIf, | S _i-S _j|＜min (S _i.l, S _j.l)/6, S _i=S _i+ S _j, i=1 ..., n, j=i+1 ..., n;

(c) the main element stroke connects: if S ₁.l ≠ max (S ₁.l ..., S _n.l), try to achieve and make S _i.l=max (S ₁.l ..., S _n.l) i, S ₁=S ₁+ S ₂+ ... + S _i

Step 3, to pretreated character with when training the parts partitioning algorithm be partitioned into four parts: main element, the first optional feature, the second optional feature and point connect a parts;

During training, parts partitioning algorithm process flow diagram is with reference to Fig. 5, and establishing M is main element, A ₁, A ₂Be optional feature, F connects a parts for point, and F.c represents a company unit type, i.e. it is which kind of connects a situation, D. that explanation point connects pen _n, D. _p, D. _rBe respectively number, position, 2 features of a parts, when training, the parts partitioning algorithm is described below:

(1) find out a stroke and main element according to stroke length, all the other strokes are that optional feature and point connect a parts;

Its mid point stroke is length less than the stroke of a threshold value (some threshold value be normalization character duration 1/10), and main element is the stroke of length maximum, and stroke length is with the sum calculating of normalization and resampling post-sampling point.

(2) determine baseline position by main element M;

Specific algorithm is: find out that in M, segment length is greater than the horizontal direction section of overall length 1/6, baseline position is the ordinate position of first paragraph.

(3) utilize Uygur's basic character component rule preliminary judgement optional feature and point to connect a parts;

Wherein Uygur's basic character component rule comprise following some: 1) length of main element stroke is long than other stroke; 2) when two optional features in character were not all sky, these two optional features were the left and right positional structure; When 3) optional feature and some parts were not simultaneously empty, optional feature was positioned at below a parts; 4) optional feature only writes on above baseline.

According to Fig. 5, specific algorithm can be described below:

I) initialization: to remaining stroke, can classify S as ₁..., S _i..., S _n, make A ₁=NULL, A ₂=NULL, F=NULL, i=1;

Ii) if S _iThe position below baseline, F=S _i, execution in step iv); Otherwise execution in step iii);

Iii) if A ₁=NULL, note A ₁=S _i, execution in step iv); If otherwise A ₂=NULL, note A ₂=S _i, execution in step iv);

Iv) if i=i+1 is i ≠ n, execution in step ii), otherwise execution in step v);

V) if A ₁≠ NULL and A ₂≠ NULL and A ₁, A ₂Be upper-lower position, execution in step vi); Otherwise algorithm finishes;

Vi) if A ₁At A ₂Top, F=A ₁, A ₁=A ₂, A ₂=NULL, algorithm finishes; Otherwise F=A ₂, A ₂=NULL, algorithm finishes.

(4) according to Uygur's basic character component dictionary, residue optional feature and the company's of putting parts are done further judgement.

Specific algorithm is: calculate D.n, D.p, D.r by a stroke, utilize the value of F and D.n, D.p, D.r to calculate F.c, consult Uygur's basic character component dictionary according to known sample class, in the value of checking D.n, D.p, D.r and parts dictionary, whether the some parts of this character type mate, if do not mate to A ₁, A ₂, F, F.c the currency correction; Otherwise output M, A ₁, A ₂, F, F.c, algorithm finishes.

Step 4 connects a component computes time-division direction character for the main element that is partitioned in character, optional feature and point;

Time-division direction character arthmetic statement is as follows:

(1) to each the sampled point calculated direction code in stroke;

Hand script Chinese input equipment Uygur character by a series of Real-time Collections to Grid Track represent, the present invention adopts fuzzy direction code extracting method to sampled point calculated direction code.With reference to Fig. 7, for the starting point of stroke, whole plane is divided into 4 zones, as shown in Fig. 7 (a), and the regional code that directly falls into by direction; From second point of stroke, whole plane is divided into 8 zones, as shown in Fig. 7 (b), wherein that non-shaded portion is the clear area, when stroke direction falls into, directly by regional code, dash area is fuzzy region, when stroke direction falls into, when only having angle when this direction and last direction greater than certain threshold value, judge that just change has occured the relatively last direction of this direction.The decision errors that this method has avoided direction to introduce when the small variations of left and right, separatrix effectively.

(2) stroke direction code sequence is carried out regular, remove the jittering noise in writing;

To every bit calculated direction code in stroke, can form chronologically a direction code sequence, with reference to shown in Figure 8, Fig. 8 (a) is a sample graph of No. 053 parts, Fig. 8 (b) is the enlarged drawing of A-B section in Fig. 8 (a), and the line between 2 o'clock is encoded to scheme method shown in (7), obtains the direction code sequence of A-B section, 444411111121122, as shown in Fig. 8 (c); Be to eliminate the jittering noise in hand-written character, the direction code of change is in short-term replaced with direction code adjacent before it, the A-B section direction code sequence after regular is, 444411111111122, and as shown in Fig. 8 (c).

(3) with time period even division direction code sequence, add up each section direction code quantity and obtain the time-division direction character.

To the direction code sequence after regular, can extract following characteristics: chronologically whole direction code stream is divided into L zone, to each zone, defines 4 dimensional vector X (x ₁, x ₂, x ₃, x ₄), x wherein _i, i=1,2,3,4 represent the direction code quantity of i direction in this zone.Like this, can obtain 4 dimensional vectors to every sub regions, with the vector of all subregions 4L dimensional feature vector that just forms arranged together chronologically, as shown in Fig. 8 (c).This statistical nature is called the time-division direction character, hop count when L is called with time period division direction code sequence.

Here, the time hop count L value very important, value too hour, between class, a little less than separating capacity, when value was too large, in class, feature was unstable.Contradiction between comprehensive stability of the present invention and classification capacity draws the value of L to the great many of experiments of Balakrishnan sample, to main element, get L _m=9, optional feature and point are connected a parts, get L _a=4.

Step 5 is averaging by dimension the time-division direction character of each each sample of parts, draws the training characteristics of these parts, deposits handwritten form Uygur's basic character component training characteristics storehouse in.

What deposit in handwritten form Uygur basic character component training characteristics storehouse is three groups of training characteristics: (58 of main element time-division direction characters, 4 * 9 dimensions), optional feature time-division direction character is (6,4 * 4 dimensions), point connects (4 of parts time-division direction characters, 4 * 4 dimensions), the time-division direction character of its hollow part is defined as 4 * 4 dimension full 0 vectors.

Third part, identifying: to handwritten form Uygur character, utilize Uygur's basic character component dictionary, handwritten form Uygur basic character component training characteristics storehouse and relevant recognizer to identify.

The identifying concrete steps are as follows:

Step 1 gathers the on-line information of handwritten form Uygur character on the mobile terminal cell phone platform, this information is a series of stroke Grid Tracks of sampling chronologically, separates with space character between two-stroke, and resolution is 512 * 512;

Step 2 is carried out pre-service to the character coordinates track that collects, and pretreated method is identical with preprocess method in training process step 2;

Step 3, to pretreated character with when identification the parts partitioning algorithm be divided into four parts: main element, the first optional feature, the second optional feature and some parts;

During identification, parts partitioning algorithm process flow diagram is with reference to Fig. 6, and establishing M is main element, A ₁, A ₂Be optional feature, F connects a parts for point, and F.c represents a company unit type, D. _n, D. _p, D. _rBe respectively number, position, 2 features of a parts, when identifying, the parts partitioning algorithm is described below:

(1) find out a stroke and main element according to stroke length, all the other strokes are that optional feature and point connect a parts, specific algorithm during with training parts partitioning algorithm (1) go on foot identical;

(2) determine baseline position by main element M, specific algorithm is identical with (2) step of parts partitioning algorithm when training;

(3) utilize the basic character component rule preliminary judgement optional feature A of Uygur ₁, A ₂With the parts F of a company, specific algorithm is identical with (3) step of parts partitioning algorithm when training;

(4) utilize handwritten form Uygur's basic character component training characteristics storehouse to connect a parts to residue optional feature and point and do further judgement.

Specific algorithm can be described below:

I) calculate the time-division direction character of F and calculate F and connect the characteristic distance of a parts and optional feature with training characteristics storehouse mid point, draw the parts of distance minimum with it, if this parts optional feature, execution in step ii); Otherwise execution in step iii);

Ii) if A ₁=NULL, note A ₁=F, execution in step iii); If otherwise A ₂=NULL, note A ₂=F, execution in step iii);

Iii) revise F.c, revise the value of D.n, D.p, D.r according to F.c, output M, A ₁, A ₂, D.n, D.p, D.r, algorithm finishes.

Step 4 is extracted the time-division direction character to the main element that is partitioned in character and optional feature, and is identical during step 4 in time-division direction character algorithm and training process; To the some parts that are partitioned into, extract count out, position and two dot structure features;

Wherein position feature gives directions relative position, the two dot structure features of stroke and baseline to refer to that this relative position of 2 is write across the page or perpendicular writing, and is used for distinguishing parts No. 203 when some stroke number is 2

With No. 206 parts

Step 5 with reference to Uygur's basic character component dictionary and handwritten form Uygur's basic character component training characteristics storehouse, is calculated the distance for the treatment of character learning symbol and each character masterplate (128 class) with the distance blending algorithm.

The distance that accords with Character mother plate is D if wait to become literate, and the distance of main element, optional feature and some parts is respectively D _m, D _aAnd D _d, be described below apart from blending algorithm:

(1) inquiry Uygur basic character component dictionary draws each building block of character masterplate, then inquires about handwritten form Uygur's basic character component training characteristics storehouse, obtains the time-division direction character of character masterplate main element and optional feature;

(2) calculate the main element distance B _m: the Euclidean distance of wait to become literate symbol and the corresponding feature of main element of Character mother plate;

(3) calculate the optional feature distance B _a: the symbol of waiting to become literate is average with the Euclidean distance of two corresponding features of optional feature of Character mother plate;

(4) calculation level member distance D _d: if the symbol of waiting to become literate all mates with the corresponding number of some parts, position and the feature of Character mother plate, D _d=0; Otherwise D _d=d, wherein d is the Euclidean distance of No. 101 parts (optional features that flex point is maximum) and No. 106 parts (empty parts) training characteristics;

(5) merge the characteristic distance of all parts with the Weighted distance fusion function, and with minimum distance criterion output recognition result.

Weighted distance fusion function: D=λ _m* D _m+ λ _a* (L _m/ L _a) * D _a+ λ _d* (L _m/ L _a) * D _d, L wherein _m/ L _aNormalization coefficient, L _mAnd L _aIt is respectively main element and optional feature time-division hop count when directional characteristic; λ _m, λ _a, λ _dBe weighting coefficient, characterized the importance ratio of each parts in identification.To 128 Uygur character types statistics, wherein do not contain a little and the character type of optional feature has 34, the character type that contains optional feature has 33, and the character type that contains a parts has 67 (comprising 6 character types that not only contained optional feature but also contained a parts), so λ _m, λ _a, λ _dValue be respectively: λ _m=34/128, λ _a=33/128, λ _d=67/128.

The effect of handwritten form of the present invention Uygur character recognition system can further illustrate by following experiment test.

This experiment test is to be Intel double-core T2300 at processor, and the internal storage capacity is to complete on the PC of 512MB.The hand-written Uygur character set of Xian Electronics Science and Technology University's intelligent signal processing and pattern-recognition laboratory collection is adopted in experiment, the collection movement-based terminal phone platform of data, by the person writing of the Uygur nationality, without any writing restriction, guaranteed accuracy and the practical value of sample, the part sample is with reference to shown in Figure 9.This sample contains 128 character types, and totally 115 covers, select 60 covers to be used for training at random, and all the other 55 covers are used for test.

Three kinds of algorithms are adopted in experiment, direction character is online 4 direction characters of the present invention, algorithm one is not based on the parts analysis, document Handwritten Chinese Character Recognition with Directional Decomposition Cellular Features (LianWen Jin is adopted in feature extraction, Circuits Systems and Computers, 1998.) middle elastic mesh division direction feature (the elastic mesh directional features that proposes, EMDF), grid number: 8 * 8; Algorithm two is based on parts (radical-based, RB), and feature extraction is identical with algorithm one; Algorithm three is algorithm of the present invention, and time-division direction character (time division directional features, TDDF) is adopted in feature extraction, the time hop count: to main element, get L _m=9, optional feature and point are connected a parts, get L _a=4.The discrimination of each test sample book group is with reference to Figure 10, and is as shown in table 2 to average recognition rate and the discrimination scope of all test sample books.

The average recognition rate of three kinds of algorithms of table 2 and discrimination scope

	Algorithm one	Algorithm two	Algorithm three
				Average recognition rate (%)	75.53	81.60	84.23
Discrimination scope (%)	37.5～90.63	65.63～91.41	70.03～93.75

The recognition performance of contrast algorithm one and algorithm two as can be known, under identical Feature Extraction Method, the algorithm that the present invention is based on the parts analysis has improved 6.07% with average recognition rate, or stroke position more for pen even and order of strokes observed in calligraphy sample more freely in addition, algorithm based on whole word presents the discrimination low ebb, and algorithm of the present invention need not consider to connect pen, the order of strokes observed in calligraphy and stroke position, and discrimination is relatively stable, and recognition result is more reliable and practical.

The recognition performance of contrast algorithm two and algorithm three as can be known, to the character recognition of analyzing based on parts, the time-division direction character has improved 2.63% than elastic mesh direction character average recognition rate, confirmed for single cursive character, time-division direction character of the present invention can be portrayed the topology of stroke well, and is more effective than the elastic mesh direction character.

Table 3 has been listed the candidate's discrimination that the present invention is based on parts analysis and time-division direction character recognizer, and average recognition speed is the 174ms/ word.

Candidate's discrimination of table 3 algorithm of the present invention

	The 1st candidate	Front	2 candidates	Front 3 candidates	Front 5 candidates	Front 10 candidates
							Average recognition rate (%)	84.23	89.66	91.45	94.12	96.48

The 1st candidate's discrimination of handwritten form of the present invention Uygur character recognition system is average 84.23% as can be known by above-mentioned experiment, and front 10 candidate's discriminations are average 96.48%, and recognition speed is the 174ms/ word, and algorithm performance has reached real requirement.The present invention is based on 128 class Uygur characters, can be used for the handwritten form Uygur word identification based on character cutting, also can be used for the character recognition of handwritten form Uygur.

Claims

1. method for identifying handwritten Uigur characters, its feature comprises following part:

(1) foundation in hand-written Uygur's basic character component storehouse: to handwritten form Uygur character, set up handwritten form Uygur's basic character component storehouse, totally 76 parts, comprise 58 of main elements, 6 of optional features, 8 of some parts, point connects 4 of parts, and it is write the two or more syllables of a word together forms of a parts that its mid point connects a parts;

(2) foundation of Uygur's basic character component dictionary: to 128 class Uygur characters, set up Uygur's basic character component dictionary, to each Uygur's character, it is decomposed into main element, the first optional feature, the second optional feature and some parts, for unified model, set empty parts, representative does not have this part, represents with " NULL ";

(3) training process of handwritten form Uygur character recognition algorithm: gather quantitative handwritten form Uygur character sample, utilize the parts that Uygur's basic character component template provides to form structure, character sample is carried out parts to be cut apart, utilize relevant training algorithm that the parts that decomposite are trained, training characteristics deposits handwritten form Uygur's basic character component training characteristics storehouse in;

(4) identifying of handwritten form Uygur character recognition algorithm: treat and know the hand-written character sample, carrying out parts according to Uygur's character writing rule cuts apart, the all parts that decomposites is extracted different features and calculates its decipherment distance, the parts that provided character class by the basic character component dictionary consist of, and merge a plurality of parts with the weighted sum strategy and obtain final character identification result.

2. method for identifying handwritten Uigur characters according to claim 1 is characterized in that: described handwritten form Uygur character training process comprises the steps:

(2a) gather the on-line information of handwritten form Uygur character on mobile-terminal platform, this information is a series of stroke Grid Tracks of sampling chronologically, gathers many cover character samples, as training sample set;

(2b) the character coordinates track of each training sample carried out pre-service, comprise slant correction, normalization, resampling, smoothly connect with the pen that is connected;

(2c) with reference to Uygur's basic character component dictionary, to pretreated character with when training the parts partitioning algorithm be partitioned into four parts: main element, the first optional feature, the second optional feature and point connect a parts;

(2d) extract the time-division direction character for all parts that is partitioned in character: main element extracts 4 * 9 dimension time-division direction characters, and optional feature and point connect a parts and extract 4 * 4 dimension time-division direction characters;

(2e) each sample time-division direction character of each parts is averaging, draws the training characteristics of these parts, deposit handwritten form Uygur's basic character component training characteristics storehouse in.

3. method for identifying handwritten Uigur characters according to claim 1 is characterized in that: described handwritten form Uygur character recognition process as follows:

(3a) gather the on-line information of handwritten form Uygur character on mobile-terminal platform;

(3b) the character coordinates track that collects is carried out pre-service, comprise slant correction, normalization, resampling, smoothly connect with the pen that is connected;

(3c) with reference to handwritten form Uygur's basic character component training characteristics storehouse, to pretreated character with when identification the parts partitioning algorithm be divided into four parts: main element, the first optional feature, the second optional feature and some parts;

(3d) all parts that is partitioned in character is extracted respectively feature: main element extracts 4 * 9 dimension time-division direction characters, and optional feature extracts 4 * 4 dimension time-division direction characters, and the some parts extract and count out, position and two dot structure features;

(3e) with reference to Uygur's basic character component dictionary and handwritten form Uygur's basic character component training characteristics storehouse, the characteristic distance of difference each parts of calculating character and each each parts of character masterplate, merge each component feature with the Weighted distance fusion function, with minimum distance criterion output recognition result.

4. according to claim 2 or 3 described method for identifying handwritten Uigur characters is characterized in that: the step (2b) of handwritten form Uygur character training process is connected the character coordinates track of step (3b) and is done slant correction, normalization, resampling, smoothly as follows with the preprocess method of the pen connection of be connected with handwritten form Uygur character recognition process:

Slant correction adopts the hough converter technique; The linear normalization method is adopted in normalization, and after normalization, size is 256 * 256; Resampling is spaced apart 1 point/3 pixel; Level and smooth is the multiple spot weighted mean, and multiple spot is got current point and 2 of front and back point; Disconnected pen connects and comprises that conventional disconnected pen connects the two parts that are connected with the main element stroke: conventional disconnected pen connect when namely connection is write due to Palingraphia and accidentally start writing caused disconnected, and the connected main element with character of main element stroke is linked to be a stroke, so that basic character component is cut apart.

5. method for identifying handwritten Uigur characters according to claim 4 is characterized in that: the character pre-processing method is interrupted a join algorithm as follows:

(5a) stroke is connected regular: position immediate according to two-stroke determines the order of connection;

(5b) conventional disconnected pen connects:, two-stroke is connected during less than certain threshold value when the two-stroke distance;

(5c) the main element stroke connects: non-front stroke of the longest stroke is connected with the longest stroke order.

6. method for identifying handwritten Uigur characters according to claim 2 is characterized in that: in handwritten form Uygur character training process, the parts partitioning algorithm is as follows when step (2c) and training:

(6a) find out a stroke and main element according to stroke length, all the other strokes are that optional feature and point connect a parts;

(6b) determine baseline position by main element;

(6c) utilize Uygur's basic character component rule preliminary judgement optional feature and point to connect a parts;

(6d) according to Uygur's basic character component dictionary, residue optional feature and the company's of putting parts are done further judgement.

7. method for identifying handwritten Uigur characters according to claim 3 is characterized in that: in handwritten form Uygur character recognition process, the parts partitioning algorithm is as follows during step (3c) identification:

(7a) find out a stroke and main element according to stroke length, all the other strokes are that optional feature and point connect a parts;

(7b) determine baseline position by main element;

(7c) utilize Uygur's basic character component rule preliminary judgement optional feature and point to connect a parts;

(7d) utilize handwritten form Uygur's basic character component training characteristics storehouse to connect a parts to residue optional feature and point and do further judgement.

8. according to claim 2 or 3 described method for identifying handwritten Uigur characters, it is characterized in that: in handwritten form Uygur's character training process and handwritten form Uygur character recognition process, step (2d) and step (3d) time-division direction character algorithm as follows:

(8a) to each the sampled point calculated direction code in stroke, form a direction code sequence;

(8b) stroke direction code sequence is carried out regular, remove the jittering noise in writing;

(8c) with time period even division direction code sequence, add up each section direction code quantity and obtain the time-division direction character, hop count when wherein the number of time period is called, when main element is got, hop count is 9, and optional feature and point are connected a parts, when getting, hop count is 4.

9. method for identifying handwritten Uigur characters according to claim 3 is characterized in that: in handwritten form Uygur character recognition process, step (3e) is apart from blending algorithm as follows:

(9a) inquiry Uygur's basic character component dictionary and handwritten form Uygur's basic character component training characteristics storehouse obtain the time-division direction character of character masterplate main element and optional feature;

(9b) calculate the main element distance: the Euclidean distance of wait to become literate symbol and the corresponding feature of main element of Character mother plate;

(9c) calculate the optional feature distance: the symbol of waiting to become literate is average with the Euclidean distance of two corresponding features of optional feature of Character mother plate;

(9d) calculation level member distance: if the symbol of waiting to become literate all mates with the corresponding number of some parts, position and the feature of Character mother plate, distance is 0; Otherwise distance is the Euclidean distance of the maximum optional feature of flex point and empty parts;

(9e) merge the characteristic distance of all parts with the Weighted distance fusion function, and with minimum distance criterion output recognition result;

Weighted distance fusion function: D=λ _m* D _m+ λ _a* (L _m/ L _a) * D _a+ λ _d* (L _m/ L _a) * D _d, D wherein _m, D _aAnd D _dBeing respectively main element, optional feature and the some member distance of wait to become literate symbol and character masterplate, L _m/ L _aNormalization coefficient, L _mAnd L _aIt is respectively main element and optional feature time-division hop count when directional characteristic; λ _m, λ _a, λ _dBe weighting coefficient, characterize the importance ratio of each parts in identification, draw according to the statistic of classification to 128 Uygur's character types: λ _m=34/128, λ _a=33/128, λ _d=67/128.