CN1570958A

CN1570958A - Method for identifying multi-font multi-character size print form Tibetan character

Info

Publication number: CN1570958A
Application number: CN 200410034107
Authority: CN
Inventors: 丁晓青; 王�华; 刘长松; 彭良瑞; 方驰
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2004-04-23
Filing date: 2004-04-23
Publication date: 2005-01-26
Anticipated expiration: 2024-04-23
Also published as: CN1251130C

Abstract

It is multi-font and multi-size block letter Tibet character identifying method. It's characteristic is: providing the normalization scheme that arming at the feature that the block letter Tibet character is non-Chinese characters. It separates the character to two sub-images that non-overlap based on the baseline, namely the upper-even line. Then normalizing every sub-image by location normalization that combining the barycenter and frame and size normalization that based on cubic B spline function interpolation. It abstracts the four-direction line-element feature that can reflect the combining information of Tibet character. Then it uses the linear distinguishing analyze LDA compress to reduce dimensions and obtain the compact character feature vector. It judges the character type by thick-thin two level sorting strategies that based on confidence analysis. The thick-thin sorter adopted the Euler's distance EDD that with deviation and modified quadratic authentication function MQDF.

Description

Many font sizes of multi-font printed tibetan character character identifying method

Technical field

Many font sizes of multi-font printed tibetan character character identifying method belongs to the character recognition field.

Background technology

The Tibetan language character recognition technologies is the important component part of Chinese multilingual information processing system, has high theory and is worth and wide application prospect.Character identifying method can be summed up as two classes: statistical decision method and syntactic structure method.In statistical decision method, each character pattern represents that with an eigenvector it is regarded as a point in the feature space, and the process of identification is exactly in feature space character pattern to be identified correctly to be divided in the affiliated classification.The syntactic structure method extracts a limited number of indivisible minimum subpatterns (primitive) then for given character set, and these primitives are got up to constitute any character in this character set according to specific order and principle combinations.Like this, utilize the similarity between charcter topology and the language, character recognition can be described the structure of analysis character by the syntax (having comprised syntactic rule) of Formal Linguistics.

Character quantity is big, font structure is complicated, font type is many, the similar character ratio is high has brought challenge for Tibetan language character recognition research.Also very limited basically to the research of Tibetan language identification both at home and abroad at present, the algorithm and the system that do not see success as yet occur.Though Tibetan language is alphabetic writing, each character all is made up of several parts (letter and some alphabetical variant), but because the structure of parts and mutual connected mode complexity thereof, make in the correct separating character very difficulty of each parts, consider the significant weakness such as anti-interference difference of syntactic structure method again, so the present invention adopts the method for statistical decision to carry out the research of many font sizes of multi-font printed tibetan character character recognition, with the integral body of single Tibetan language character as basic recognition unit.

In Chinese Character Recognition, directional line element feature has described well that it is horizontal, vertical on the diverse location that is taken up space, cast aside, press down the quantitative relation of four kinds of elementary cells, thereby comprehensively, accurately, has stably reflected the composition information of Chinese character.The Tibetan language character vertically is superimposed according to certain order by each parts and constitutes, and parts are made up of stroke, and the annexation in each parts between the stroke is changeless.Like this, each Tibetan language character all has specific structure, and this structure can be reflected from level, part and details three aspects, and directional line element feature is portrayed the effective means of these architectural features just.

The present invention is on the basis of comprehensive careful investigation Tibetan language character characteristics, specific form according to the Tibetan language character, selected appropriate method for normalizing, extract the strong directional line element feature feature of descriptive power, the two-stage statistical sorter that utilization is analyzed based on degree of confidence obtains recognition result, realized high performance many font sizes of multi-font Tibetan language character identifying method, this is all not have used method in the present every other document.

Summary of the invention

The objective of the invention is to realize the method for many font sizes of multi-font printed tibetan character character recognition.With single Tibetan language character as process object, at first character object is carried out necessary normalized, comprise place normalization and size normalization, extract then can fine reflection character characteristics the four directions to the linear element feature and adopt LDA (linear discriminant analysis) method that feature is carried out compressed transform, adopt based on the judgement of classifying of thick, the thin two-stage statistical sorter of degree of confidence analysis.Thus, can obtain high individual character recognition correct rate.According to this method, realized many font sizes of multi-font printed tibetan character character recognition system.

Also comprise the collection of individual character sample as a printed tibetan character character recognition system, promptly system at first scans the text of input printed tibetan character, adopts automatic mode to carry out character cutting.The training sample database of utilize to gather setting up, feature extraction of travel direction linear element and eigentransformation obtain the property data base of training sample.On the basis of the property data base of training sample, be determined by experiment the parameter of sorter.To the input character sample of the unknown, adopt the extraction feature that uses the same method, send into the comparison of classifying of sorter and feature database then, thereby judge the category attribute of input character.

The present invention consists of the following components: character normalization, four directions are to linear element feature extraction, eigentransformation, classifier design.

1. character normalization

1.1 place normalization

If the original character image is [F (i, j)] _{W * H}, picture traverse is W, highly is H, the value that image is positioned at the picture element of the capable j of i row be F (i, j), i=1,2 ..., H, j=1,2 ..., W.According to the characteristics of Tibetan language character, [F (i, j)] _{W * H}Subimage [the F that can regard two non-overlapping copies as ₁(i, j)] _{W * H1}, [F2 (i, j)] _{W * H2}Longitudinal spliced forming, wherein [F1 (i, j)] _{W * H1}For the above parts of images of baseline (upper horizontal line), promptly go up the vowel part, [F ₂(i, j)] _{W * H2}For baseline with the lower part, and H ₁+ H ₂=H.If the horizontal projection V of character picture (i), i=1,2 ..., H is calculated by following formula:

V (i) = Σ_{j = 1}^{W} F (i, j)

The ordinate value P of baseline position then _IFor:

P_{I} = \arg \max_{i} (V (i) - V (i - 1)), i = 2,3, \cdot \cdot \cdot, H

According to P _IJust can determine H with the value of the ordinate at character top ₁, and in coordinate system of the present invention (Fig. 4), H ₁Be numerically equal to P _I

If character picture is after the normalization [G (i, j)] _{M * N}, picture traverse is M, highly is N, the value that image is positioned at the picture element of the capable j of i row be G (i, j), i=1,2 ..., N, j=1,2 ..., M.Same, [G (i, j)] _{M * N}Also can regard two non-overlapping copies subimage [G as ₁(i, j)] _{M * N1}, [G ₂(i, j)] _{M * N2}Longitudinal spliced forming, [G wherein ₁(i, j)] _{M * N1}Be the above parts of images of baseline, [G ₂(i, j)] _{M * N2}For baseline with the lower part, according to position characteristic analysis, set N herein to baseline in the Tibetan language character ₁=N/4, N ₂=3N/4.Like this, normalization can be regarded as input picture dot matrix [F ₁(i, j)] _{W * H1}, [F ₂(i, j)] _{W * H2}Be mapped to target image dot matrix [G respectively ₁(i, j)] _{M * N1}, [G ₂(i, j)] _{M * N2}Processing procedure.In this process, selected input picture dot matrix [F _k(i, j)] _{W * Hk}, k=1, the reference point U in 2 _k(u _Ik, u _Jk), k=1,2, mobile input picture dot matrix makes this reference point be positioned at target dot matrix [G _k(i, j)] _{M * Nk}, k=1,2 center, thus finish the place normalization of input character.

Make [F _k(i, j)] _{W * Hk}, k=1, the center that 2 centers of gravity and outer rim are how much is respectively A _k(a _Ik, a _Jk), k=1,2 and B _k(b _Ik, b _Jk), k=1,2, then have:

Make U _k(u _Ik, u _Jk), k=1,2 is between A _k(a _Ik, a _Jk), k=1,2 and B _k(b _Ik, b _Jk), k=1, a bit between 2, that is:

Wherein β is constant and 0≤β≤1.

1.2 size normalization

The Tibetan language character is non-Chinese characters, and character duration has relative stability, and each intercharacter difference in height is very big, can't resemble to be normalized to square dot matrix the Chinese character.According to in the 1200 cover Tibetan language character samples of collecting totally 710,400 (6 kinds of fonts, 7 kinds of font sizes, 592 characters of every cover sample) statistics done of the depth-width ratio characteristic of character, the depth-width ratio of getting the Tibetan language character after the normalization is 2 more reasonable, and it is that of each different font characters depth-width ratio of difference is compromise.

Investigate input input character image [F _k(i, j)] _{W * Hk}, k=1,2, with target character dot matrix after the normalization be [G _k(i, j)] _{M * Nk}, k=1,2, between relation as can be known:

G _k(i，j)＝F _k(i/r _i，j/r _j)，k＝1，2

R wherein _iAnd r _jBe respectively the change of scale factor of i and j direction: r _i=N _k/ H _k, r _j=M/W.According to following formula, (i is j) corresponding to the point (i/r in the input character for the point in the output image dot matrix _i, j/r _j).F _k(i j) is discrete function, and i/r _i, j/r _jValue generally be not integer, so need be according to F _kIn the value at known discrete point place estimate that it is at (i/r _i, j/r _j) value located.The present invention adopts cubic B-spline function to carry out interpolation arithmetic, occurs such as distortion such as stepped edges with character pattern after reducing normalization.For given (i, j), the order:

Wherein:

[] is bracket function.Interpolation process can be expressed as:

G_{k} (i, j) = F_{k} (p_{0} + Δ_{p}, q_{0} + Δ_{q}) = Σ_{m = - 1}^{2} Σ_{l = - 1}^{2} F_{k} (p_{0} + m, q_{0} + l) R_{B} (m - Δ_{p}) R_{B} (- (l - Δ_{q}))

R in the formula _B(z) be cubic B-spline function:

R_{B} (z) = \frac{1}{6} [{(z + 2)}^{3} W (z + 2) - 4 {(z + 1)}^{3} W (z + 1) + 6 z^{3} W (z) - 4 {(z - 1)}^{3} W (z - 1)]

Wherein W (z) is a step function,

2. directional line element feature feature extraction

2.1 extract the profile of character

Suppose that the pairing point of its stroke of tagged word image is the black pixel point, background dot is the white elephant vegetarian refreshments.For the stroke picture element, not isolated black pixel point (number of 8 neighborhood black pixel points is 0) if its 8 neighborhood has white elephant vegetarian refreshments and current black pixel, claim that then this stroke picture element is a point.The method of extracting contour images is the whole character pattern of scanning, for the black pixel of certain position, if a black pixel number in its 8 neighborhoods and a white pixel number average then keep this black pixel greater than 0, otherwise changes the value of character pattern in this position into 0.Like this, the character picture after the normalization [G (i, j)] _{M * N}Obtained its contour images [G ' (i, j)] _{M * N}

2.2 the formation of piecemeal and eigenvector

For character outline dot matrix [G ' (i, j)] _{M * N}In each black pixel, according to the position relation of it and adjacent two other black pixel, give its horizontal stroke (0 °), perpendicular (90 °), cast aside (45 °), press down (135 °) four kinds of linear elements.Consider two kinds of situations: a kind of be 3 black pixels on same straight line, distributing a kind of linear element feature and assignment then only for this center pixel is 2 (Fig. 9 a-d); Another kind of 3 black pixels are not on same straight line, so simultaneously to center pixel distribute two kinds of linear element features and respectively assignment be 1 (Fig. 9 e-p), the linear element that situation shown in Fig. 9 k then distributes for the center linear element is to press down and erect, and numerical value is 1, and all the other situations are analogized.According to of the distribution of carrying out linear element feature of these principles to each black pixel in the character pattern, to each black pixel point (i, j), can obtain one 4 dimensional vector X (i, j)=(x _v, x _k, x _p, x _o) ^T, its component is represented the quantity of 4 kinds of linear elements at this black pixel point place respectively.

Finish after the above-mentioned work, the dot matrix of M * N evenly is divided into the wide M that is ₀, height is N ₀Subregion (Figure 10), each subregion is with M being arranged in the horizontal direction between the adjacent subregion ₀/ 2, N is arranged in vertical direction ₀The coincidence of/2 pixels is so from the available subregion number of whole M * N dot matrix be

(\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1) .

Then, each subregion is divided into A nested against one another, B, four blockages of C, D (Figure 11), the size of A, B, C, D is followed successively by (M ₀/ 4) * (N ₀/ 4), (M ₀/ 2) * (N ₀/ 2), (3M ₀/ 4) * (3N ₀/ 4) and M ₀* N ₀For each blockage, define one 4 dimensional vector X respectively _A=(x _v, x _k, x _p, x _o) ^T, X _B=(x _v, x _k, x _p, x _o) ^T, X _C=(x _v, x _k, x _p, x _o) ^T, X _D=(x _v, x _k, x _p, x _o) ^T, the summation of the linear element quantity on 0 °, 90 °, 45 °, 135 ° directions of each pixel in the expression square separately, that is:

X_{A} = \underset{(i, j) &Element; A}{Σ} X (i, j)

X_{B} = \underset{(i, j) &Element; B}{Σ} X (i, j)

X_{C} = \underset{(i, j) &Element; C}{Σ} X (i, j)

X_{D} = \underset{(i, j) &Element; D}{Σ} X (i, j)

And the directional line element feature feature vector, X of whole subregion _S=(x _v, x _k, x _p, x _o) ^TWeighted sum by each side's block eigenvector in this subregion represents, that is:

X _S＝α _AX _A+α _BX _B+α _CX _C+α _DX _D

α wherein _A, α _B, α _C, α _DBe the constant between 0 and 1, they have portrayed the significance level of the interior proper vector of different squares to the contribution of book regional integration proper vector.Like this, can obtain one 4 dimensional feature vector from each subregion after, the proper vector of all subregions is arranged in order forms together

4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

Dimensional feature vector, Here it is the expression this character the directional line element feature feature.

3. eigentransformation

The increase of intrinsic dimensionality and the deficiency of training sample will be estimated and discern calculated amount and all bring very big problem to classifier parameters.According to the experience of general classifier design, be to reach more than 10 times of intrinsic dimensionality to the requirement of number of training.For the difficulty that the relative deficiency that reduces too high intrinsic dimensionality and training sample brings for classifier design and parameter estimation, the present invention utilizes the LDA method that the primitive character of higher-dimension is compressed.

If the character class number is c (c=592 in the Tibetan language character recognition), the number of training of ω class character is O _ω, ω=1,2 ..., c, then the training sample to this character class adopts said method to extract the four directions after the linear element feature, obtains set of eigenvectors and is combined into

{{X}_{1}^{ω}, {X_{2}}^{ω}, \cdot \cdot \cdot, {X_{O_{ω}}}^{ω}},

X wherein _k ^ω(k=1,2 ..., O _ω) be

4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

Dimensional vector.

At first calculate each character type ω (center μ of proper vector of 1≤ω≤c) _ωCenter μ with the proper vector of all character types

μ_{r} = \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} {X_{k}}^{ω}

μ = \frac{1}{c} Σ_{ω = 1}^{c} μ_{ω}

Calculate the between class scatter matrix S then _bWith divergence matrix S in the average class _w

S_{b} = \frac{1}{c} Σ_{ω = 1}^{c} (μ_{ω} - μ) {(μ_{ω} - μ)}^{T}

S_{w} = \frac{1}{c} Σ_{ω = 1}^{c} \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} ({X_{k}}^{ω} - μ_{ω}) {({X_{k}}^{ω} - μ_{ω})}^{T}

Seek transformation matrix Φ, make tr[(Φ ^TS _wΦ) ^-1(Φ ^TS _bΦ)] reach maximum, thereby make the ratio of interior divergence variance of mode class and between class scatter variance reach maximum to increase the separability between each pattern class.

With matrix computations instrument compute matrix S _w ^-1S _bBefore

d (d \leq 4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1))

The non-zero eigenvalue ξ of individual maximum _k(k=1,2 ..., d) with corresponding latent vector _k(k=1,2 ..., d),

The transformation matrix Φ=[ of LDA conversion then ₁, ₂..., _d].Corresponding eigentransformation is Y=Φ ^TX, Y is the d dimensional feature of tool identification here.

4. classifier design

Classifier design is one of core technology of character recognition, and the researcher has proposed many pattern classifiers at different problems.But under multiple factor restriction, when handling the large character set identification problem, often still select minimum distance classifier at present.Thick, the thin two-stage classification strategy (Figure 13) that the present invention's employing is analyzed based on degree of confidence is finished the judgement of the affiliated classification of input Tibetan language character to be identified.

4.1 rough sort

The purpose of rough sort is to select relatively very little subset of candidate words of a number in a big character set fast, and it is big as far as possible to guarantee to comprise in the Candidate Set under the character to be identified the probability of correct classification.This just requires simple in structure, the fast operation of rough sort device.For this reason, the present invention has designed a kind of Euclidean distance with deviation (EDD) sorter.

Make Y=(y ₁, y ₂..., y _d) ^TBe the d dimensional feature vector of input unknown character, Y ^ω=(y ^ω ₁, y ^ω ₂..., y ^ω _d) ^TBe the standard feature vector of ω class character, the Euclidean distance of band deviation is defined as follows:

D (Y, Y_{ω}) = Σ_{k = 1}^{d} {[t (y_{k}, {y^{ω}}_{k})]}^{2}

In the formula

Wherein, σ ^ω _kBe the mean square deviation of k component of ω class character feature vector, θ _ω, γ _ωBe the constant relevant with ω, C is and the irrelevant constant of character class.A most important characteristic of following formula is a second-order statistic of having introduced character feature in Euclidean distance, and this makes sorter, and distribution spatially has certain portrayal ability to feature.

4.2 disaggregated classification

Bayes classifier is optimum in theory statistical sorter, and when handling practical problems, people wish to go to approach it as far as possible.Under the condition that the prior probability that is characterized as Gaussian distribution and all kinds of characteristic distribution at character equates, Bayes classifier is reduced to the mahalanobis distance sorter.But this condition be difficult for to satisfy in practice usually, and the performance of mahalanobis distance sorter along with the generation of covariance matrix error serious deterioration.The present invention adopts MQDF (revising the secondary Discrimination Functions) as disaggregated classification tolerance, and it is a distortion of mahalanobis distance.MQDF Discrimination Functions form is:

Q (Y, Y^{ω}) = \frac{1}{h^{2}} {Σ_{l = 1}^{d} {(y_{l} - {y^{ω}}_{l})}^{2} - Σ_{l = 1}^{K} (1 - \frac{h^{2}}{λ_{ωl}}) {[{(Y - Y^{ω})}^{T} φ_{ωl}]}^{2}} + \ln (h^{2 (d - K)} Π_{l = 1}^{K} λ_{ωl})

λ wherein _{ω l}And φ _{ω l}Be respectively the covariance matrix ∑ of ω class sample _ωL eigenwert and proper vector, K represents the number of the main latent vector that intercepted, also is the principal subspace dimension of mode class, its optimal value is determined by experiment, h ²Be that the experiment of little eigenvalue is estimated.What MQDF produced is secondary judgement curved surface, because of only needing to estimate preceding K main latent vector of each classification covariance matrix, has avoided the negative effect of little eigenvalue evaluated error.MQDF differentiates that distance can be regarded as the mahalanobis distance in K dimension principal subspace and the weighted sum of the Euclidean distance in remaining (d-K) dimension space, and weighting factor is 1/h ²

4.3 confidence calculations

If the output Candidate Set of rough sort device is CanSet={ (e ₁, D ₁), (e ₂, D ₂) ..., (e _L, D _L), k is the Candidate Set capacity, e _kAnd D _kBe respectively candidate characters and corresponding rough sort distance, D ₁≤ D ₂≤ ... ≤ D _LThe effect of disaggregated classification device is according to the discriminating distance that recomputates CanSet to be sorted again, finds the affiliated most probable classification of input character.If rough sort result's reliability is quite high, in other words, if e ₁Be correct minute time-like of input character, then disaggregated classification need not carry out fully.Whether the present invention carries out the degree of confidence analysis to Candidate Set CanSet needs to carry out disaggregated classification with decision, and the distance that adopts EDD output is calculated degree of confidence as tolerance according to following formula:

Conf (CanSet) = \frac{D_{2} - D_{1}}{D_{1}}

When degree of confidence is lower than certain threshold value Conf _THThe time, CanSet is sent into the disaggregated classification device handle, otherwise directly export CanSet.The invention is characterized in that it is a kind of printed tibetan character character recognition technologies that can discern multiple font and multiple font size.It contains following steps successively:

It at first carries out suitable place normalization and size normalization to the single Tibetan language character of input, to eliminate the difference of input character to greatest extent because of aspects such as the different shapes that cause of font size and font, attitudes, the four directions of extracting the fine reflection Tibetan language charcter topology characteristics of energy then is to the linear element feature, on this basis, utilize the LDA conversion to extract the distinctive feature of tool reducing intrinsic dimensionality, feature after the conversion is sent into thick, the thin two-stage classification device of analyzing based on recognition confidence judge classification under the character.In the system that is made up of image capture device and computing machine, it contains following steps successively:

1. the collection of character sample

The scanning input is printed on the text of many font sizes of multi-font Tibetan language character, after utilizing existing algorithm to remove necessary pre-service such as noise, binaryzation, the Tibetan language text is carried out cutting to separate single character, the image of each character is demarcated the ISN of the correct character of its correspondence, finish collection thus, set up the training sample database in order to the Tibetan language character individual character sample of training and testing.

2. normalized comprises the linear normalization of character position and size

2.1 locate the baseline position of single Tibetan language character

If the original character image is [F (i, j)] _{W * H}, wherein W is a picture traverse, H is a picture altitude, the value that image is positioned at the picture element of the capable j of i row be F (i, j), i=1,2 ..., H, j=1,2 ..., W

By the horizontal projection V (i) of following formula calculating character image, i=1,2 ..., H:

V (i) = Σ_{j = 1}^{W} F (i, j)

The position P of baseline then _LFor:

P_{L} = \arg \max_{i} (V (i) - V (i - 1)), i = 2,3, \cdot \cdot \cdot, H

2.2 with the baseline is that separation is separated into two number of sub images with input picture

[F (i, j)] _{W * H}Can regard two number of sub images [F as ₁(i, j)] _{W * H1}, [F ₂(i, j)] _{W * H2}Longitudinal spliced wherein [F ₁(i, j)] _{W * H1}For baseline with top, promptly go up the vowel part; [F ₂(i, j)] _{W * H2}For baseline with the lower part.Both not do not overlap but vertically combine synthetic [F (i, j)] _{W * H}, and H ₁+ H ₂=H

Corresponding, the target character image after the normalization [G (i, j)] _{M * N}Also can regard two number of sub images [G as ₁(i, j)] _{M * N1}, [G ₂(i, j)] _{M * N2}The longitudinal spliced wherein M width that is target image, N is a picture altitude.[G ₁(i, j)] _{M * N1}For the above parts of images of baseline, promptly go up the vowel part; [G ₂(i, j)] _{M * N2}For baseline with the lower part.Both do not overlap yet but vertically are combined into [G (i, j)] _{M * N}, and set N ₁=N/4, N ₂=3N/4.

2.3 place normalization reference point U _k(u _Ik, u _Jk), k=1,2 selection

[F _k(i, j)] _{W * Hk}, k=1,2 centers of gravity and outer rim center are respectively A _k(a _Ik, a _Jk), k=1,2 and B _k(b _Ik, b _Jl), k=1,2 wherein

Wherein β is constant and 0≤β≤1.

Mobile input picture dot matrix makes this reference point be positioned at target dot matrix [G _k(i, j)] _{M * Nk}, k=1,2 geometric center, thus finish the place normalization of input character

2.4 size normalization

Because of [F _k(i, j)] _{W * Hk}, k=1,2 with [G _k(i, j)] _{M * Nk}, k=1, the pass between 2 is G _k(i, j)=F _k(i/r _i, j/r _j), k=1,2 r wherein _iAnd r _jBe respectively the change of scale factor of i and j direction: r _i=N _k/ H _k, r _j=M/W.So adopt cubic B-spline function to carry out interpolation arithmetic, occur such as distortion such as stepped edges with character after reducing normalization.For given (i, j), the order:

Wherein: [] is bracket function.Interpolation process can be expressed as:

G_{k} (i, j) = F_{k} (p_{0} + Δ_{p}, q_{0} + Δ_{q}) = Σ_{m = - 1}^{2} Σ_{l = - 1}^{2} F_{k} (p_{0} + m, q_{0} + l) R_{B} (m - Δ_{p}) R_{B} (- (l - Δ_{q}))

R in the formula _B(z) be cubic B-spline function:

R_{B} (z) = \frac{1}{6} [{(z + 2)}^{3} W (z + 2) - 4 {(z + 1)}^{3} W (z + 1) + 6 z^{3} W (z) - 4 {(z - 1)}^{3} W (z - 1)]

Wherein W (z) is a step function,

3. extract the four directions of Tibetan language character to the linear element feature

3.1 character outline is extracted

Scan whole character pattern, for the black pixel of certain position, whether decision keeps this black pixel according to the pixel distribution situation in its 8 neighborhoods.Like this, can obtain character picture [G (i, j)] after the normalization _{M * N}Contour images [G ' (i, j)] _{M * N}

3.2 directional line element feature Feature Extraction

At first, for character outline dot matrix [G ' (i, j)] _{M * N}In each black pixel (i j), according to the position relation between it and adjacent two other black pixel, gives its horizontal stroke (0 °), perpendicular (90 °), casts aside (45 °), presses down (135 °) four kinds of linear elements.And be designated as one 4 dimensional vector X (i, j)=(x _v, x _k, x _p, x _o) ^T

With whole size is the character outline image [G ' (i, j)] of M * N _{M * N}Evenly be divided into

(\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

That sub regions, each subregion further are divided into again is nested against one another, size is followed successively by (M ₀/ 4) * (N ₀/ 4), (M ₀/ 2) * (N ₀/ 2), (3M ₀/ 4) * (3N ₀/ 4) and M ₀* N ₀A, B, 4 blockages such as C, D.The feature vector, X of blockage on each _A=(x _v, x _k, x _p, x _o) ^T, X _B=(x _v, x _k, x _p, x _o) ^T, X _C=(x _v, x _k, x _p, x _o) ^T, X _D=(x _v, x _k, x _p, x _o) ^TBe expressed as all black pixel proper vectors in this square and:

X_{A} = \underset{(i, j) &Element; A}{Σ} X (i, j)

X_{B} = \underset{(i, j) &Element; B}{Σ} X (i, j)

X_{C} = \underset{(i, j) &Element; C}{Σ} X (i, j)

X_{D} = \underset{(i, j) &Element; D}{Σ} X (i, j)

The directional line element feature feature vector, X of whole subregion _S=(x _v, x _k, x _p, x _o) ^TWeighted sum by each side's block eigenvector in this subregion is represented:

X _S=α _AX _A+ α _BX _B+ α _CX _C+ α _DX _DLike this, can obtain one 4 dimensional feature vector from each subregion after, the proper vector of all subregions is arranged in order the expression input character formed together

4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

Dimension directional line element feature proper vector.

4. eigentransformation

If the character class number is c, the number of training of ω class character is O _ω, ω=1,2 ..., c, then the training sample to this character class adopts said method to extract the four directions after the linear element feature, obtains set of eigenvectors and is combined into

{{X_{1}}^{ω}, {X_{2}}^{ω}, \cdot \cdot \cdot, {X_{O_{ω}}}^{ω}},

X wherein _k ^ω(k=1,2 ..., O _ω) be

4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

Dimensional vector.

Utilize the compression of LDA transfer pair primitive character as follows

At first calculate each character type ω (center μ of proper vector of 1≤ω≤c) _ω, all character types center μ, the between class scatter matrix S of proper vector _bWith divergence matrix S in the average class _w

μ_{r} = \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} {X_{k}}^{ω}

μ = \frac{1}{c} Σ_{ω = 1}^{c} μ_{ω}

S_{b} = \frac{1}{c} Σ_{ω = 1}^{c} (μ_{ω} - μ) {(μ_{ω} - μ)}^{T}

S_{w} = \frac{1}{c} Σ_{ω = 1}^{c} \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} ({X_{k}}^{ω} - μ_{ω}) {({X_{k}}^{ω} - μ_{ω})}^{T}

Seek transformation matrix Φ, make tr (Φ ^TS _wΦ) ^-1(Φ ^TS _bΦ)] reach maximum, then the corresponding eigentransformation of LDA is Y=Φ ^TX, Y is the d dimensional feature of tool identification here.

5. to the judgement of classification under the input character, promptly, extract feature, compare with existing data in the identification storehouse, to determine its correct character code to the character picture of unknown classification.

5.1 design category device

To the proper vector Y that obtains by the LDA compression, calculate the mean vector of each character

\overset{&OverBar;}{Y^{ω}} (ω = 1,2, \cdot \cdot \cdot, c)

The variances sigma of proper vector on each dimension with each character _s ^ω(ω=1,2 ..., c, s=1,2 ..., d), d is the dimension of Y,

\overset{&OverBar;}{Y^{ω}} = \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} {Y_{k}}^{ω},

{σ_{s}}^{ω} = \sqrt{\frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} {({y^{ω}}_{ks} - {\overset{&OverBar;}{y^{ω}}}_{s})}^{2}}

Wherein (characteristic set of 1≤ω≤c) is each Tibetan language character class ω

{{Y_{1}}^{ω}, {Y_{2}}^{ω}, \cdot \cdot \cdot, {Y_{O_{ω}}}^{ω}},

The diagnostic characteristics mean vector of each character and the variance of Ge Wei are deposited in the diagnostic characteristics database file, and the parameter of the sorter that will obtain by experiment deposits in the library file simultaneously.

5.2 classification judgement

To the input character image of unknown classification, at first carry out place normalization and size normalization and handle, extract the four directions again to linear element feature X, utilize LDA matrix of a linear transformation Φ that its original orientation linear element feature X is transformed into Y=Φ ^TX=(y ₁, y ₂..., y _d) ^T, d is the dimension of feature after the conversion.

From library file, read the mean vector of all character types

\overset{&OverBar;}{Y^{ω}} = {(\overset{&OverBar;}{{y_{1}}^{ω}}, \overset{&OverBar;}{{y_{2}}^{ω}}, \cdot \cdot \cdot \overset{&OverBar;}{{y_{d}}^{ω}})}^{T},

(ω=1,2 ..., c) and each character type each the dimension variances sigma _s ^ω(ω=1,2 ..., c, s=1,2 ..., d).Calculating Y arrives The Euclidean distance of band deviation

D (Y, \overset{&OverBar;}{Y^{ω}}) = Σ_{s = 1}^{d} {[t (y_{s}, \overset{&OverBar;}{{y^{ω}}_{s}})]}^{2}

Wherein

All processes are calculated ω=1,2 ..., c is according to the rearrangement of ascending order, selects preceding L (the character class sign indicating number e of individual distance of 1≤L≤c) and representative thereof _k, k=1,2 ..., L forms rough sort Candidate Set CanSet={ (e ₁, D ₁), (e ₂, D ₂) ..., (e _L, D _L), D ₁≤ D ₂≤ ... ≤ D _L

Calculate the recognition confidence Conf (CanSet) of initial character among the CanSet

Conf (CanSet) = \frac{D_{2} - D_{1}}{D_{1}}

If Conf (CanSet) is higher than certain threshold value Conf _TH, directly with (e ₁, D ₁) as the recognition result output of input character, think that promptly input character belongs to e ₁Pairing character class, and decipherment distance is D ₁Otherwise, calculate Y MQDF of the pairing character class of each ISN in the CanSet and differentiate distance ω=1,2 ..., L

Q (Y, \overset{&OverBar;}{Y^{ω}}) = \frac{1}{h^{2}} {Σ_{l = 1}^{d} {(y_{l} - \overset{&OverBar;}{{y^{ω}}_{l}})}^{2} - Σ_{l = 1}^{K} (1 - \frac{h^{2}}{λ_{ωl}}) {[{(Y - \overset{&OverBar;}{Y^{ω}})}^{T} φ_{ωl}]}^{2}} + \ln (h^{2 (d - K)} Π_{l = 1}^{K} λ_{ωl})

If

Q (Y . \overset{&OverBar;}{Y^{τ}}) = \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}),

Then this input character belongs to the pairing character class of e τ, promptly

τ = \arg \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}) .

Experiment showed, that the recognition correct rate of the present invention on many font sizes of multi-font printed tibetan character individual character test set reaches 99.83%, also can reach more than 99% the discrimination of actual text.

Description of drawings

The hardware of a typical Tibetan language character recognition system of Fig. 1 constitutes.

The generation of Fig. 2 Tibetan language individual character sample.

The formation of Fig. 3 Tibetan language character recognition system.

The image coordinate system signal that Fig. 4 adopts.

Fig. 5 character normalization flow process

Fig. 6 character normalization example

Fig. 7 directional line element feature feature extraction flow process.

Character and profile thereof after Fig. 8 normalization.

Fig. 9 four directions in the linear element feature horizontal, vertical, cast aside, press down four kinds of direction attributes.

The division methods of Figure 10 image region.

Figure 11 constitutes the blockage signal of subregion.

Figure 12 LDA eigentransformation process flow diagram.

Figure 13 classification policy

Figure 14 is based on the many font size printings of the multi-font Tibetan language character recognition system of this algorithm.

Figure 15 multi-font printing Tibetan language (mixing is Chinese-English) document recognition system

Embodiment

As shown in Figure 1, a printed tibetan character character recognition system is made of two parts on hardware: image capture device and computing machine.Image capture device generally is a scanner, is used for obtaining the digital picture of Tibetan language character.Computing machine is used for digital picture is handled, and adjudicates classification.

Shown in Figure 2 is the generative process of training Tibetan language individual character sample and test Tibetan language individual character sample.For one piece of printed tibetan character specimen page, at first it is swept computing machine by scanner, make it to become digital picture.To pre-service measures such as digital picture binaryzation, removal noises, obtain the image of binaryzation.To the capable cutting of input picture, obtain line of text again, on this basis each line of text is carried out character segmentation, obtain single Tibetan language character, demarcate the affiliated character class of each character picture then.After this, check once that the mistake that row, character segmentation stage and character class calibration phase are produced adopts manual mode to correct.At last, the original character image of identical character class correspondence is extracted, and preserve, finish the collection of Tibetan language individual character sample.

As shown in Figure 3, the printed tibetan character character recognition algorithm is divided into two parts: training system and test macro.In the training system, to each concentrated sample of the Tibetan language individual character training sample of input, carry out normalized rightly, extract its four directions of forming information of reflection to the linear element feature, utilize LDA that feature is carried out conversion, reduce the primitive character dimension, then, adopt proper classifier, training classifier obtains the feature database file.In test macro, to the unknown classification character picture of input, adopt and same normalization and the feature extracting method of training system, and feature is carried out conversion with the transformation matrix that training system obtains, send into sorter then and classify, judge the classification that input character is affiliated.

Thereby the realization of practical many font sizes of multi-font printed tibetan character character recognition system need be considered following several aspect:

A) Tibetan language character individual character sample obtains;

B) realization of training system;

C) realization of test system.

Respectively these three aspects are described in detail below.

A) Tibetan language character individual character basis obtains

The acquisition process of printed tibetan character individual character sample as shown in Figure 2.One piece of paper printing body Tibetan language document of input obtains digital picture by scanner, the input computing machine.Then this image is carried out pre-service measures such as noise remove, binaryzation.Utilize various worry wave methods to remove noise and in existing document, a large amount of records have been arranged.Binarization method can adopt existing overall binaryzation or local auto-adaptive binaryzation.Then document is carried out printed page analysis, obtain character zone.Utilize horizontal projection histogram and the capable cutting of vertical projection histogram and character segmentation to obtain single character respectively to character zone.Cutting mistake in this stage adopts manual mode to correct.Classification to the single Tibetan language character that obtains is demarcated, and generally adopts computing machine to demarcate automatically, and mistake is wherein carried out artificial treatment (more should, deletion etc.).At last, the original character image of the pairing different fonts of the character with identical ISN, different font sizes is preserved, just obtained many font sizes of multi-font printed tibetan character individual character sample.

B) realization of training system

B.1 character normalization

B.1.1 place normalization

If the original character image is [F (i, j)] _{W * H}, picture traverse is W, highly is H, the value that image is positioned at the picture element of the capable j of i row be F (i, j), i=1,2 ..., H, j=1,2 ..., W.[F (i, j)] _{W * H}Can regard as by two number of sub images---baseline is with top [F ₁(i, j)] _{W * H1}With baseline with lower part [F ₂(i, j)] _{W * H2}Longitudinal spliced forming, H ₁+ H ₂=H.If the horizontal projection of character picture is V (i), i=1,2 ..., H, can be calculated by following formula:

V (i) = Σ_{j = 1}^{W} F (i, j)

The ordinate value P of baseline position then _IFor:

P_{I} = \arg \max_{i} (V (i) - V (i - 1)), i = 2,3, \cdot \cdot \cdot, H

If character picture is after the normalization [G (i, j)] _{M * N}, picture traverse is M, highly is N, the value that image is positioned at the picture element of the capable j of i row be G (i, j), i=1,2 ..., N, j=1,2 ..., M.Same, [G (i, j)] _{M * N}Also can regard two number of sub images as---baseline is with top [G ₁(i, j)] _{M * N1}With baseline with lower part [G ₂(i, j)] _{M * N2}Longitudinal spliced forming, set N herein ₁=N/4, N ₂=3N/4.Like this, normalization can be regarded as input picture dot matrix [F ₁(i, j)] _{W * H1}, [F ₂(i, j)] _{W * H2}Be mapped to target image dot matrix [G respectively ₁(i, j)] _{M * N1}, [G ₂(i, j)] _{M * N2}Processing procedure.In this process, selected input picture dot matrix [F _k(i, j)] _{W * Hk}, k=1, the reference point U in 2 _k(u _Ik, u _Jk), k=1,2, mobile input picture dot matrix makes this reference point, is positioned at target dot matrix [G _k(i, j)] _{M * Nk}, k=1,2 center, thus finish the place normalization of input character.

Wherein β is constant and 0≤β≤1.

B.1.2 size normalization

Investigate input character image [F _k(i, j)] _{W * Hk}, k=1,2 with normalization after the target character dot matrix be [G _k(i, j)] _{M * Nk}, k=1, the relation between 2 as can be known:

G _k(i，j)＝F _k(i/r _i，j/r _j)，k＝1，2

R wherein _iAnd r _jBe respectively the change of scale factor of i and j direction: r _i=N _k/ H _k, r _j=M/W.According to following formula, (i is j) corresponding to the point (i/r in the input character for the point in the output image dot matrix _i, j/r _j).F _k(i j) is discrete function, and i/r _i, j/r _jValue generally be not integer, so need be according to F _kIn the value at known discrete point place estimate that it is at (i/r _i, j/r _j) value located.Adopt cubic B-spline function to carry out interpolation arithmetic, distortion occurs with character after reducing normalization.For given (i, j), the order:

Wherein: [] is bracket function.Interpolation process can be expressed as:

G_{k} (i, j) = F_{k} (p_{0} + Δ_{p}, q_{0} + Δ_{q}) = Σ_{m = - 1}^{2} Σ_{l = - 1}^{2} F_{k} (p_{0} + m, q_{0} + l) R_{B} (m - Δ_{p}) R_{B} (- (l - Δ_{q}))

R in the formula _B(z) be cubic B-spline function:

R_{B} (z) = \frac{1}{6} [{(z + 2)}^{3} W (z + 2) - 4 {(z + 1)}^{3} W (z + 1) + 6 z^{3} W (z) - 4 {(z - 1)}^{3} W (z - 1)]

Wherein W (z) is a step function,

B.2 directional line element feature feature extraction

B.2.1 get the profile of character

Scan whole character pattern,,, then keep this black pixel, otherwise change the value of character pattern into 0 in this position if black pixel number and a white pixel number average are greater than 0 in its 8 neighborhoods for the black pixel of certain position.Like this, can be from the character picture after the normalization [G (i, j)] _{M * N}Contour images [G ' (i, j)] _{M * N}

B.2.2 the formation of piecemeal and eigenvector

For character outline dot matrix [G ' (i, j)] _{M * N}In each black pixel, according to the position relation of it and adjacent two other black pixel, give its horizontal stroke (0 °), perpendicular (90 °), cast aside (45 °), press down (135 °) four kinds of linear elements.Consider two kinds of situations: a kind of be 3 black pixels on same straight line, distributing a kind of linear element feature and assignment then only for this center pixel is 2; Another kind of 3 black pixels not on same straight line, so just simultaneously to center pixel distribute two kinds of linear element features and respectively assignment be 1.According to of the distribution of carrying out linear element feature of these principles to each black pixel in the character pattern, to each black pixel point (i, j), can obtain one 4 dimensional vector X (i, j)=(x _v, x _k, x _p, x _o) ^T, its component is represented 4 kinds of linear element quantity at this black pixel point place respectively.

Finish after the above-mentioned work, the dot matrix of M * N evenly is divided into the wide M that is ₀, height is N ₀Subregion, each subregion is with M being arranged in the horizontal direction between the adjacent subregion ₀/ 2, N is arranged in vertical direction ₀The coincidence of/2 pixels is so total number of subregion is

(\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

Then, each subregion is divided into nested against one another, size and is followed successively by (M ₀/ 4) * (N ₀/ 4), (M ₀/ 2) * (N ₀/ 2), (3M ₀/ 4) * (3N ₀/ 4) and M ₀* N ₀A, B, 4 blockages such as C, D.To each blockage, define one 4 dimensional vector X respectively _A=(x _v, x _k, x _p, x _o) ^T, X _B=(x _v, x _k, x _p, x _o) ^T, X _C=(x _v, x _k, x _p, x _o) ^T, X _D=(x _v, x _k, x _p, x _o) ^T, they represent the summation of 0 °, 90 °, 45 °, 135 ° directional line element feature quantity of interior each pixel of square separately, that is:

X_{A} = \underset{(i, j) &Element; A}{Σ} X (i, j)

X_{B} = \underset{(i, j) &Element; B}{Σ} X (i, j)

X_{C} = \underset{(i, j) &Element; C}{Σ} X (i, j)

X_{D} = \underset{(i, j) &Element; D}{Σ} X (i, j)

And the directional line element feature feature vector, X of whole subregion _S=(x _v, x _k, x _p, x _o) ^TBe expressed as the weighted sum of each side's block eigenvector in this subregion, that is:

X _S＝α _AX _A+α _BX _B+α _CX _C+α _DX _D

α wherein _A, α _B, α _C, α _DBe the constant between 0 and 1, they have portrayed the significance level of the interior proper vector of different squares to the contribution of this subregion general characteristic vector.Like this, can obtain one 4 dimensional feature vector from each subregion after, the proper vector of all subregions is arranged in order forms together

4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

Dimension directional line element feature proper vector.

B.3 eigentransformation

If the character class number is c (c=592 in the Tibetan language character recognition), the number of training of ω class character is O _ω, ω=1,2 ..., c, its original orientation linear element set of eigenvectors is combined into

{{X_{1}}^{ω}, {X_{2}}^{ω}, \cdot \cdot \cdot, {X_{O_{ω}}}^{ω}},

X wherein _k ^ω(k=1,2 ..., O _ω) be

4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

Dimensional vector.

μ_{r} = \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} {X_{k}}^{ω}

μ = \frac{1}{c} Σ_{ω = 1}^{c} μ_{ω}

S_{b} = \frac{1}{c} Σ_{ω = 1}^{c} (μ_{ω} - μ) {(μ_{ω} - μ)}^{T}

S_{w} = \frac{1}{c} Σ_{ω = 1}^{c} \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} ({X_{k}}^{ω} - μ_{ω}) {({X_{k}}^{ω} - μ_{ω})}^{T}

With matrix computations instrument compute matrix S _w ^-1S _bBefore

d (d \leq 4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1))

The non-zero eigenvalue ξ of individual maximum _k(k=1,2 ..., d) with corresponding latent vector _k(k=1,2 ..., d), The transformation matrix Φ=[ of LDA conversion then ₁, ₂..., _d].Corresponding eigentransformation is Y=Φ ^TX, Y is the d dimensional feature of tool identification here.

B.4 design category device

To obtaining proper vector Y, calculate the mean vector of each character through the LDA conversion

\overset{&OverBar;}{Y^{ω}} (ω = 1,2, \cdot \cdot \cdot, c)

\overset{&OverBar;}{Y^{ω}} = \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} {Y_{k}}^{ω},

{σ_{s}}^{ω} = \sqrt{\frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} {({y^{ω}}_{ks} - {\overset{&OverBar;}{y^{ω}}}_{s})}^{2}}

Wherein (characteristic set of the tool separability of 1≤ω≤c) is each Tibetan language character class ω

{{Y_{1}}^{ω}, {Y_{2}}^{ω}, \cdot \cdot \cdot, {Y_{O_{ω}}}^{ω}},

The diagnostic characteristics mean vector and the variance on the Ge Wei of each character are deposited in the diagnostic characteristics database file, will adjust by experiment simultaneously sorter each correlation parameter value and deposit in the library file.The design and the training of sorter have so just been finished.

C) realization of test system

From library file, read the mean vector of all character types

\overset{&OverBar;}{Y^{ω}} = {(\overset{&OverBar;}{{y_{1}}^{ω}}, \overset{&OverBar;}{{y_{2}}^{ω}}, \cdot \cdot \cdot \overset{&OverBar;}{{y_{d}}^{ω}})}^{T},

(ω=1,2 ..., c) and each character type each the dimension variances sigma _s ^ω(ω=1,2 ..., c, s=1,2 ..., d).Calculating Y arrives

The Euclidean distance of band deviation

D (Y, \overset{&OverBar;}{Y^{ω}}) = Σ_{s = 1}^{d} {[t (y_{s}, \overset{&OverBar;}{{y^{ω}}_{s}})]}^{2}

Wherein

All processes are calculated

ω=1,2 ..., c is according to the rearrangement of ascending order, selects preceding L (the character class sign indicating number e of individual distance of 1≤L≤c) and representative thereof _k, k=1,2 ..., L forms rough sort Candidate Set CanSet={ (e ₁, D ₁), (e ₂, D ₂) ..., (e _L, D _L), D ₁≤ D ₂≤ ... ≤ D _L

Conf (CanSet) = \frac{D_{2} - D_{1}}{D_{1}}

Q (Y, \overset{&OverBar;}{Y^{ω}}) = \frac{1}{h^{2}} {Σ_{l = 1}^{d} {(y_{1} - \overset{&OverBar;}{{y^{ω}}_{l}})}^{2} - Σ_{l = 1}^{K} (1 - \frac{h^{2}}{λ_{ωl}}) {[{(Y - \overset{&OverBar;}{Y^{ω}})}^{T} φ_{ωl}]}^{2}} + \ln (h^{2 (d - K)} Π_{l = 1}^{K} λ_{ωl})

If

Q (Y, \overset{&OverBar;}{Y^{τ}}) = \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}),

Then this input character belongs to e _τPairing character class, promptly

τ = \arg \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}) .

Below provide two concrete realization examples.

Embodiment 1: many font sizes of multi-font printed tibetan character character recognition system based on many font sizes of multi-font printed tibetan character character recognition system of the present invention shown in Figure 14 a, experiment is carried out on the 1200 cover block letter Tibetan language documents of collecting (each document is forgiven whole 592 modern Tibetan characters), these sample files major parts are picked up from current main printing Tibetan language publishing system (upright, Hua Guang), also have directly to be printed by the TureType font on a small quantity to form.Font not only has the most frequently used lean type, black matrix and general body, also comprises circle, long body, bamboo body, and font size is from No. six to first number.Sample quality does not wait, and the ratio of normal, fracture, adhesion character is about 2: 1: 1.Through processes such as overscanning input, row, character segmentation and ISN demarcation, this 1200 cover Tibetan language document is converted to 1200 cover individual character samples (being that each character class has 1200 individual character samples), therefrom extract 900 covers at random out and form training set, all the other 300 covers give over to test sample book.

In the experiment, adopt method of the present invention that each Tibetan language character is normalized to 48 * 96 dot matrix, normalized parameter β=0.5.Four directions mode as shown in figure 10 of subregion in the linear element feature extraction is divided, and gets M ₀=N ₀=16, the proper vector of each square is to the weighting coefficient α of whole subregion proper vector in the subregion _A, α _B, α _C, α _DBe respectively 0.4,0.3,0.2,0.1.After flow process extraction directional line element feature feature shown in Figure 7, adopt the LDA linear transformation to carry out feature compression, intrinsic dimensionality d is chosen to be 128 (Figure 14 c) after the conversion.Parameter θ among the rough sort device EDD ₁=θ ₂=...=θ ₅₉₂=0.8, γ ₁=γ ₂=...=γ ₅₉₂=2.2, C=20 adopts threshold value Conf when the rough sort degree of confidence is analyzed _TH=0.9, parameter K=32 among the disaggregated classification device MQDF (Figure 14 b), h ²With the average of K eigenvalue of the covariance matrix of each character type as estimated value.Experimental result on test set is as shown in table 1

The discrimination of table 1 system on six kinds of Tibetan language font test sample book collection

Font	Lean type	Black matrix	General body	Circle	Long body	The bamboo body	Average recognition rate
Font	Lean type	Black matrix	General body	Circle	Long body	The bamboo body		Number of characters	??36112	?39072	?35520	?30192	?14800	??22496
Discrimination	??99.94％	?99.86％	?99.83％	?99.85％	?99.58％	??99.76％		Number of characters	??36112	?39072	?35520	?30192	?14800	??22496	??99.83％

As seen from Table 1, the average recognition correct rate of many font sizes of multi-font Tibetan language character reaches 99.83%, shows the validity of the method that the present invention puies forward.

Embodiment 2: multi-font printing Tibetan language (mixing is Chinese-English) document recognition system

The systematic research of multi-font printing Tibetan language (mixing is Chinese-English) document recognition is to launch for the demand that adapts to the Chinese multilingual information treatment technology development of Tibetan's area office automation and promotion, and its system chart as shown in figure 15.Mainly comprise image input and preprocessing subsystem, row character segmentation subsystem, character recognition subsystem and aftertreatment subsystem.The present invention is the chief component of character recognition subsystem, under the cooperation of Chinese character and English identification core to Tibetan language account for main body, be mingled with certain Chinese character and English, the multi-font document printing of numeral, symbol discerns automatically, and file and picture is converted to the text that computing machine can " be read ".

The method that Tibetan language character recognition in this system partly adopts the present invention to propose, concrete parameter is consistent with embodiment 1, has transplanted the character feature storehouse among the embodiment 1.This system has passed through the expert appraisal that the Ministry of Education is presided in November, 2003.When characterization test, from provide by Northwest University for nationalities 500 surplus the page or leaf, select 62 pages at random in the actual print body Tibetan language document (picking up from publications such as books, newpapers and periodicals, magazine) of word surplus in the of totally 52 ten thousand, totally 95583 characters are tested, the result is as follows:

The test performance of table 2 multi-font printing Tibetan language (mixing is Chinese-English) document recognition system

The character kind	Number of characters	Recognition correct rate (%)	Error rate distributes
			Error rate distributes			??ACE(％)	?ASE(％)	UTE(％)
			Tibetan language	?91636	99.06	??ACE(％)	?ASE(％)	UTE(％)	??0.30	?0.57	0.07
Chinese character	?804	96.27	Tibetan language	?91636	99.06	??1.99	?1.74	0	??0.30	?0.57	0.07

English+symbol	????2118	????86.59	????5.24	????6.66	????1.51
English+symbol	????2118	????86.59	????5.24	????6.66	????1.51	Numeral	????1025	????92.39	????3.61	????3.42	????0.58
Add up to	????95583	????98.68	????-	????-	?????-	Numeral	????1025	????92.39	????3.61	????3.42	????0.58

Annotate: ACE is that to be interpretable cutting error rate UTE show for this result of error rate that can not the misjudgment type interpretable identification error rate ASE, many font sizes of multi-font printed tibetan character character recognition that the present invention proposes adapts to the needs of practical application fully, can obtain good recognition performance, be with a wide range of applications.

Claims

1. many font sizes of multi-font printed tibetan character character identifying method, it is characterized in that, normalization scheme at the printed tibetan character character characteristics that belong to non-Chinese characters has been proposed: with character picture with baseline, it is upper horizontal line, for separation resolves into two number of sub images that do not overlap mutually, each subimage is adopted the place normalization that combines with center of gravity and frame respectively and based on the size normalization method of cubic B-spline function interpolation; Extraction can fully reflect four directions that the Tibetan language character forms information to the linear element feature, obtains compact character feature vector after utilizing linear discriminant analysis LDA compression dimensionality reduction; Employing is carried out the judgement of character class based on thick, the thin two-stage classification strategy of degree of confidence analysis, and thick, disaggregated classification device adopts the Euclidean distance EDD of band deviation and the secondary Discrimination Functions MQDF of correction respectively; In the system that is made up of image capture device and computing machine, it contains following steps successively:

(1) set:

(1.1) the Tibetan language character class sum c=592 of the present invention's processing;

(1.2) character duration M, height N after the normalization;

The place normalization parameter beta;

When (1.3) extracting the directional line element feature feature, the subregion width M of division ₀, the height N ₀

The proper vector of each square is to the weighting coefficient α of whole subregion proper vector in the subregion _A, α _B, α _C, α _D

(1.4) parameters C among the rough sort device EDD, θ _k, γ _k, k=1 wherein, 2 ..., 592;

(1.5) confidence threshold value Conf _TH

(2) collection of character sample

Be printed on the text of many font sizes of multi-font Tibetan language character to the computing machine input by scanner, after utilizing existing method to remove necessary pre-service such as noise, binaryzation, the Tibetan language text is carried out cutting to separate single character, the image of each character is demarcated the ISN of the correct character of its correspondence, finish collection thus, set up the training sample database in order to the Tibetan language character individual character sample of training and testing;

(3) normalized comprises the normalization of character position and size

(3.1) baseline position of the single Tibetan language character in location

If the original character image is [F (i, j)] _{W * H},

Wherein W is a picture traverse, and H is a picture altitude, the value that image is positioned at the picture element of the capable j of i row be F (i, j), i=1,2 ..., H, j=1,2 ..., W,

The horizontal projection V of calculating character image (i), i=1,2 ..., H is:

V (i) = Σ_{j = 1}^{w} F (i, j),

The ordinate value P of baseline position then ₁For:

P_{I} = \arg \max_{i} (V (i) - V (i - 1)), i = 2,3, \cdot \cdot \cdot, H;

(3.2) be that separation is separated into two number of sub images with input picture with the baseline

[F (i, j) _{W * H}Can regard two number of sub images as

Longitudinal spliced,

Wherein

For baseline with top, promptly go up the vowel part; For baseline with the lower part, both not do not overlap but vertically combine synthetic [F (i, j)] _{W * H}, and H ₁+ H ₂=H is by P ₁Can determine H with the difference of the ordinate at character top ₁Size;

Corresponding, the target character image after the normalization [G (i, j)] _{M * N}Also can regard two number of sub images as

Longitudinal spliced,

Wherein M is the width of target image, and N is a picture altitude; For the above parts of images of baseline, promptly go up the vowel part;

For baseline with the lower part; Both do not overlap yet but vertically are combined into [G (i, j)] _{M * N}, and set N ₁=N/4, N ₂=3N/4;

(3.3) place normalization reference point U _k(u _Ik, u _Jk), k=1,2 selection

{[F_{k} (i, j)]}_{W \times H_{k}}, k = 1,2

Center of gravity and outer rim center are respectively Ak (α _Ik, α _Jk), k=1,2 and B _k(b _Ik, b _Jk), k=1,2 wherein

U then _k(u _Ik, u _Jk), k=1,2 get between A _k(a _Ik, a _Jk), k=1,2 and B _k(b _Ik, b _Jk), k=1, a bit between 2, that is:

Wherein β is constant and 0≤β≤1;

Mobile input picture dot matrix makes this reference point, is positioned at the target dot matrix

{[G_{k} (i, j)]}_{M \times N_{k}}, k = 1,2

Geometric center, thereby finish the place normalization of input character;

(3.4) size normalization

Cause

{[F_{k} (i, j)]}_{{W \times H}_{k}}, k = 1,2

With

{[G_{k} (i, j)]}_{M \times N_{k}}, k = 1,2

Between the pass be:

G _k(i，j)＝F _k(i/r _i，j/r _j)，k＝1，2，

R wherein _iAnd r _jBe respectively the change of scale factor of i and j direction: r _i=N _k/ H _k, r _j=M/W; Adopt cubic B-spline function to carry out interpolation arithmetic;

For given (i, j), the order:

Wherein: [] is bracket function;

Interpolation process can be expressed as:

G_{k} (i, j) = F_{k} (p_{0} + Δ_{p}, q_{0} + Δ_{q}) = Σ_{m = - 1}^{2} Σ_{l = - 1}^{2} F_{k} (p_{0} + m, q_{0} + l) R_{B} (m - Δ_{p}) R_{B} (- (l - Δ_{q})),

R in the formula _B(z) be cubic B-spline function:

R_{B} (z) = \frac{1}{6} [{(z + 2)}^{3} W (z + 2) - 4 {(z + 1)}^{3} W (z + 1) + 6 z^{3} W (z) - 4 {(z - 1)}^{3} W (z - 1)],

Wherein W (z) is a step function,

(4) extract the four directions of Tibetan language character to the linear element feature

(4.1) character outline is extracted

Scan whole character pattern, for the black pixel of certain position, if the individual number average of black pixel in its 8 neighborhoods and background pixels then keeps this black pixel greater than 0, otherwise, this black pixel is made as background pixels; Like this, obtain after the normalization character picture [G (and i, j) _{M * N}Contour images [G ' (i, j) _{M * N}

(4.2) formation of directional line element feature feature

At first, for character outline dot matrix [G ' (i, j)] _{M * N}In each black pixel (i j), according to the position relation between it and adjacent two other black pixel, gives that it is horizontal, vertical, cast aside, press down four kinds of linear elements, and be designated as one 4 dimensional vector X (i, j)=(x _v, x _k, x _p, x _o) ^T

(\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

Individual width is M ₀, highly be N ₀Subregion, that each subregion further is divided into again is nested against one another, size is followed successively by (M ₀/ 4) * (N ₀/ 4), (M ₀/ 2) * (N ₀/ 2), (3M ₀/ 4) * (3N ₀/ 4) and M ₀* N ₀A, B, 4 blockages such as C, D; The feature vector, X of blockage on each _A=(x _v, x _k, x _p, x _o) ^T, X _B=(x _v, x _k, x _p, x _o) ^T, X _C=(x _v, x _k, x _p, x _o) ^T, X _D=(x _v, x _k, x _p, x _o) ^TBe expressed as all black pixel proper vectors in this square and:

X_{A} = \underset{(i, j) &Element; A}{Σ} X (i, j),

X_{B} = \underset{(i, j) &Element; B}{Σ} X (i, j),

X_{C} = \underset{(i, j) &Element; C}{Σ} X (i, j),

X_{D} = \underset{(i, j) &Element; D}{Σ} X (i, j),

X _S＝α _AX _A+α _BX _B+α _CX _C+α _DX _D，

α wherein _A, α _B, α _C, α _DBe the constant between 0 and 1; Like this, can obtain one 4 dimensional feature vector from each subregion after, the proper vector of all subregions is arranged in order the expression input character formed together

4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

Dimension its original orientation linear element proper vector;

(5) eigentransformation

If Tibetan language character class number is c, the number of training of ω class character is O _ω, ω=1,2 ..., c, then the training sample to this character class adopts said method to extract the four directions after the linear element feature, obtains set of eigenvectors and is combined into { X ₁ ^ω, X ₂ ^ω..., X _{O ω} ^ω, X wherein _k ^ω(k=1,2 ..., O _ω) be

4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

Dimensional vector;

Utilize the compression of LDA transfer pair primitive character as follows:

At first calculate each character type ω (center μ of proper vector of 1≤ω≤c) _ω, all character types center μ, the between class scatter matrix S of proper vector _bWith divergence matrix S in the average class _w:

μ_{r} = \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} {X_{k}}^{ω},

μ = \frac{1}{c} Σ_{ω = 1}^{c} μ_{ω},

S_{b} = \frac{1}{c} Σ_{ω = 1}^{c} (μ_{ω} - μ) {(μ_{ω} - μ)}^{T},

S_{w} = \frac{1}{c} Σ_{ω = 1}^{c} \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} ({X_{k}}^{ω} - μ_{ω}) {({X_{k}}^{ω} - μ_{ω})}^{T},

Seek transformation matrix Φ, make t _r[(Φ ^TS _wΦ) ^-1(Φ ^TS _bΦ)] reach maximum, then the corresponding eigentransformation of LDA is Y=Φ ^TX, Y is the d dimensional feature of tool identification here;

(6) to the judgement of classification under the input character, promptly, extract feature, compare with existing data in the identification storehouse, to determine its correct character code to the character picture of unknown classification;

(6.1) design category device

\overset{&OverBar;}{Y^{ω}} (ω = 1,2, \cdot \cdot \cdot, c)

\overset{&OverBar;}{Y^{ω}} = \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} {Y_{k}}^{ω},

{σ_{s}}^{ω} = \sqrt{\frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} {({y^{ω}}_{ks} - {\overset{&OverBar;}{y^{ω}}}_{s})}^{2}},

Wherein (characteristic set of 1≤ω≤c) is each Tibetan language character class ω The diagnostic characteristics mean vector of each character and the variance of Ge Wei are deposited in the diagnostic characteristics database file, and the parameter of the sorter that will obtain by experiment deposits in the library file simultaneously;

(6.2) classification judgement

To the input character image of unknown classification, at first carry out place normalization and size normalization and handle, extract the four directions again to linear element feature X, utilize LDA matrix of a linear transformation Φ that its original orientation linear element feature X is transformed into Y=Φ ^TX=(y ₁, y ₂..., y _d) ^T, d is the dimension of feature after the conversion;

From library file, read the mean vector of all character types

\overset{&OverBar;}{Y^{ω}} = {(\overset{&OverBar;}{{y_{1}}^{ω}}, \overset{&OverBar;}{{y_{2}}^{ω}}, \cdot \cdot \cdot \overset{&OverBar;}{{y_{d}}^{ω}})}^{T} (ω = 1,2, \cdot \cdot \cdot, c)

Each variances sigma of tieing up with each character type _s ^ω(ω=1,2 ..., c, s=1,2 ..., d), calculate Y and arrive The Euclidean distance of band deviation

D (Y, \overset{&OverBar;}{Y^{ω}}) :

D (Y, \overset{&OverBar;}{Y^{ω}}) = Σ_{s = 1}^{d} {[t (y_{s}, \overset{&OverBar;}{{y^{ω}}_{s}})]}^{2},

Wherein

All processes are calculated

D (Y, \overset{&OverBar;}{Y^{ω}}), ω = 1,2, \cdot \cdot \cdot,

According to the rearrangement of ascending order, select preceding L (the character class sign indicating number e of individual distance of 1≤L≤c) and representative thereof _k, k=1,2 ..., L forms rough sort Candidate Set CanSet={ (e ₁, D ₁), (e ₂, D ₂) ..., (e _L, D _L), D ₁≤ D ₂≤ ... ≤ D _L

Conf (CanSet) = \frac{D_{2} - D_{1}}{D_{1}},

If Conf (CanSet) is higher than certain threshold value Conf _TH, directly with (e ₁, D ₁) as the recognition result output of input character, think that promptly input character belongs to e ₁Pairing character class, and decipherment distance is D ₁Otherwise, calculate Y MQDF of the pairing character class of each ISN in the CanSet and differentiate distance

Q (Y, \overset{&OverBar;}{Y^{ω}}), ω = 1,2, \cdot \cdot \cdot, L :

Q (Y, \overset{&OverBar;}{Y^{ω}}) = \frac{1}{h^{2}} {Σ_{l = 1}^{d} {(y_{l} - \overset{&OverBar;}{{y^{ω}}_{l}})}^{2} - Σ_{l = 1}^{K} (1 - \frac{h^{2}}{λ_{ωl}}) {[{(Y - \overset{&OverBar;}{Y^{ω}})}^{T} φ_{ωl}]}^{2}} + \ln (h^{2 (d - K)} Π_{l = 1}^{K} λ_{ωl}),

If

Q (Y, \overset{&OverBar;}{Y^{τ}}) = \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}),

Then this input character belongs to e _τPairing character class, promptly

τ = \arg \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}) .