CN1570958A - Method for identifying multi-font multi-character size print form Tibetan character - Google Patents

Method for identifying multi-font multi-character size print form Tibetan character Download PDF

Info

Publication number
CN1570958A
CN1570958A CN 200410034107 CN200410034107A CN1570958A CN 1570958 A CN1570958 A CN 1570958A CN 200410034107 CN200410034107 CN 200410034107 CN 200410034107 A CN200410034107 A CN 200410034107A CN 1570958 A CN1570958 A CN 1570958A
Authority
CN
China
Prior art keywords
character
omega
sigma
centerdot
overbar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200410034107
Other languages
Chinese (zh)
Other versions
CN1251130C (en
Inventor
丁晓青
王�华
刘长松
彭良瑞
方驰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN 200410034107 priority Critical patent/CN1251130C/en
Publication of CN1570958A publication Critical patent/CN1570958A/en
Application granted granted Critical
Publication of CN1251130C publication Critical patent/CN1251130C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Abstract

It is multi-font and multi-size block letter Tibet character identifying method. It's characteristic is: providing the normalization scheme that arming at the feature that the block letter Tibet character is non-Chinese characters. It separates the character to two sub-images that non-overlap based on the baseline, namely the upper-even line. Then normalizing every sub-image by location normalization that combining the barycenter and frame and size normalization that based on cubic B spline function interpolation. It abstracts the four-direction line-element feature that can reflect the combining information of Tibet character. Then it uses the linear distinguishing analyze LDA compress to reduce dimensions and obtain the compact character feature vector. It judges the character type by thick-thin two level sorting strategies that based on confidence analysis. The thick-thin sorter adopted the Euler's distance EDD that with deviation and modified quadratic authentication function MQDF.

Description

Many font sizes of multi-font printed tibetan character character identifying method
Technical field
Many font sizes of multi-font printed tibetan character character identifying method belongs to the character recognition field.
Background technology
The Tibetan language character recognition technologies is the important component part of Chinese multilingual information processing system, has high theory and is worth and wide application prospect.Character identifying method can be summed up as two classes: statistical decision method and syntactic structure method.In statistical decision method, each character pattern represents that with an eigenvector it is regarded as a point in the feature space, and the process of identification is exactly in feature space character pattern to be identified correctly to be divided in the affiliated classification.The syntactic structure method extracts a limited number of indivisible minimum subpatterns (primitive) then for given character set, and these primitives are got up to constitute any character in this character set according to specific order and principle combinations.Like this, utilize the similarity between charcter topology and the language, character recognition can be described the structure of analysis character by the syntax (having comprised syntactic rule) of Formal Linguistics.
Character quantity is big, font structure is complicated, font type is many, the similar character ratio is high has brought challenge for Tibetan language character recognition research.Also very limited basically to the research of Tibetan language identification both at home and abroad at present, the algorithm and the system that do not see success as yet occur.Though Tibetan language is alphabetic writing, each character all is made up of several parts (letter and some alphabetical variant), but because the structure of parts and mutual connected mode complexity thereof, make in the correct separating character very difficulty of each parts, consider the significant weakness such as anti-interference difference of syntactic structure method again, so the present invention adopts the method for statistical decision to carry out the research of many font sizes of multi-font printed tibetan character character recognition, with the integral body of single Tibetan language character as basic recognition unit.
In Chinese Character Recognition, directional line element feature has described well that it is horizontal, vertical on the diverse location that is taken up space, cast aside, press down the quantitative relation of four kinds of elementary cells, thereby comprehensively, accurately, has stably reflected the composition information of Chinese character.The Tibetan language character vertically is superimposed according to certain order by each parts and constitutes, and parts are made up of stroke, and the annexation in each parts between the stroke is changeless.Like this, each Tibetan language character all has specific structure, and this structure can be reflected from level, part and details three aspects, and directional line element feature is portrayed the effective means of these architectural features just.
The present invention is on the basis of comprehensive careful investigation Tibetan language character characteristics, specific form according to the Tibetan language character, selected appropriate method for normalizing, extract the strong directional line element feature feature of descriptive power, the two-stage statistical sorter that utilization is analyzed based on degree of confidence obtains recognition result, realized high performance many font sizes of multi-font Tibetan language character identifying method, this is all not have used method in the present every other document.
Summary of the invention
The objective of the invention is to realize the method for many font sizes of multi-font printed tibetan character character recognition.With single Tibetan language character as process object, at first character object is carried out necessary normalized, comprise place normalization and size normalization, extract then can fine reflection character characteristics the four directions to the linear element feature and adopt LDA (linear discriminant analysis) method that feature is carried out compressed transform, adopt based on the judgement of classifying of thick, the thin two-stage statistical sorter of degree of confidence analysis.Thus, can obtain high individual character recognition correct rate.According to this method, realized many font sizes of multi-font printed tibetan character character recognition system.
Also comprise the collection of individual character sample as a printed tibetan character character recognition system, promptly system at first scans the text of input printed tibetan character, adopts automatic mode to carry out character cutting.The training sample database of utilize to gather setting up, feature extraction of travel direction linear element and eigentransformation obtain the property data base of training sample.On the basis of the property data base of training sample, be determined by experiment the parameter of sorter.To the input character sample of the unknown, adopt the extraction feature that uses the same method, send into the comparison of classifying of sorter and feature database then, thereby judge the category attribute of input character.
The present invention consists of the following components: character normalization, four directions are to linear element feature extraction, eigentransformation, classifier design.
1. character normalization
1.1 place normalization
If the original character image is [F (i, j)] W * H, picture traverse is W, highly is H, the value that image is positioned at the picture element of the capable j of i row be F (i, j), i=1,2 ..., H, j=1,2 ..., W.According to the characteristics of Tibetan language character, [F (i, j)] W * HSubimage [the F that can regard two non-overlapping copies as 1(i, j)] W * H1, [F2 (i, j)] W * H2Longitudinal spliced forming, wherein [F1 (i, j)] W * H1For the above parts of images of baseline (upper horizontal line), promptly go up the vowel part, [F 2(i, j)] W * H2For baseline with the lower part, and H 1+ H 2=H.If the horizontal projection V of character picture (i), i=1,2 ..., H is calculated by following formula:
V ( i ) = Σ j = 1 W F ( i , j )
The ordinate value P of baseline position then IFor:
P I = arg max i ( V ( i ) - V ( i - 1 ) ) , i = 2,3 , · · · , H
According to P IJust can determine H with the value of the ordinate at character top 1, and in coordinate system of the present invention (Fig. 4), H 1Be numerically equal to P I
If character picture is after the normalization [G (i, j)] M * N, picture traverse is M, highly is N, the value that image is positioned at the picture element of the capable j of i row be G (i, j), i=1,2 ..., N, j=1,2 ..., M.Same, [G (i, j)] M * NAlso can regard two non-overlapping copies subimage [G as 1(i, j)] M * N1, [G 2(i, j)] M * N2Longitudinal spliced forming, [G wherein 1(i, j)] M * N1Be the above parts of images of baseline, [G 2(i, j)] M * N2For baseline with the lower part, according to position characteristic analysis, set N herein to baseline in the Tibetan language character 1=N/4, N 2=3N/4.Like this, normalization can be regarded as input picture dot matrix [F 1(i, j)] W * H1, [F 2(i, j)] W * H2Be mapped to target image dot matrix [G respectively 1(i, j)] M * N1, [G 2(i, j)] M * N2Processing procedure.In this process, selected input picture dot matrix [F k(i, j)] W * Hk, k=1, the reference point U in 2 k(u Ik, u Jk), k=1,2, mobile input picture dot matrix makes this reference point be positioned at target dot matrix [G k(i, j)] M * Nk, k=1,2 center, thus finish the place normalization of input character.
Make [F k(i, j)] W * Hk, k=1, the center that 2 centers of gravity and outer rim are how much is respectively A k(a Ik, a Jk), k=1,2 and B k(b Ik, b Jk), k=1,2, then have:
Make U k(u Ik, u Jk), k=1,2 is between A k(a Ik, a Jk), k=1,2 and B k(b Ik, b Jk), k=1, a bit between 2, that is:
Wherein β is constant and 0≤β≤1.
1.2 size normalization
The Tibetan language character is non-Chinese characters, and character duration has relative stability, and each intercharacter difference in height is very big, can't resemble to be normalized to square dot matrix the Chinese character.According to in the 1200 cover Tibetan language character samples of collecting totally 710,400 (6 kinds of fonts, 7 kinds of font sizes, 592 characters of every cover sample) statistics done of the depth-width ratio characteristic of character, the depth-width ratio of getting the Tibetan language character after the normalization is 2 more reasonable, and it is that of each different font characters depth-width ratio of difference is compromise.
Investigate input input character image [F k(i, j)] W * Hk, k=1,2, with target character dot matrix after the normalization be [G k(i, j)] M * Nk, k=1,2, between relation as can be known:
G k(i,j)=F k(i/r i,j/r j),k=1,2
R wherein iAnd r jBe respectively the change of scale factor of i and j direction: r i=N k/ H k, r j=M/W.According to following formula, (i is j) corresponding to the point (i/r in the input character for the point in the output image dot matrix i, j/r j).F k(i j) is discrete function, and i/r i, j/r jValue generally be not integer, so need be according to F kIn the value at known discrete point place estimate that it is at (i/r i, j/r j) value located.The present invention adopts cubic B-spline function to carry out interpolation arithmetic, occurs such as distortion such as stepped edges with character pattern after reducing normalization.For given (i, j), the order:
Figure A20041003410700111
Wherein:
Figure A20041003410700112
[] is bracket function.Interpolation process can be expressed as:
G k ( i , j ) = F k ( p 0 + Δ p , q 0 + Δ q ) = Σ m = - 1 2 Σ l = - 1 2 F k ( p 0 + m , q 0 + l ) R B ( m - Δ p ) R B ( - ( l - Δ q ) )
R in the formula B(z) be cubic B-spline function:
R B ( z ) = 1 6 [ ( z + 2 ) 3 W ( z + 2 ) - 4 ( z + 1 ) 3 W ( z + 1 ) + 6 z 3 W ( z ) - 4 ( z - 1 ) 3 W ( z - 1 ) ]
Wherein W (z) is a step function,
2. directional line element feature feature extraction
2.1 extract the profile of character
Suppose that the pairing point of its stroke of tagged word image is the black pixel point, background dot is the white elephant vegetarian refreshments.For the stroke picture element, not isolated black pixel point (number of 8 neighborhood black pixel points is 0) if its 8 neighborhood has white elephant vegetarian refreshments and current black pixel, claim that then this stroke picture element is a point.The method of extracting contour images is the whole character pattern of scanning, for the black pixel of certain position, if a black pixel number in its 8 neighborhoods and a white pixel number average then keep this black pixel greater than 0, otherwise changes the value of character pattern in this position into 0.Like this, the character picture after the normalization [G (i, j)] M * NObtained its contour images [G ' (i, j)] M * N
2.2 the formation of piecemeal and eigenvector
For character outline dot matrix [G ' (i, j)] M * NIn each black pixel, according to the position relation of it and adjacent two other black pixel, give its horizontal stroke (0 °), perpendicular (90 °), cast aside (45 °), press down (135 °) four kinds of linear elements.Consider two kinds of situations: a kind of be 3 black pixels on same straight line, distributing a kind of linear element feature and assignment then only for this center pixel is 2 (Fig. 9 a-d); Another kind of 3 black pixels are not on same straight line, so simultaneously to center pixel distribute two kinds of linear element features and respectively assignment be 1 (Fig. 9 e-p), the linear element that situation shown in Fig. 9 k then distributes for the center linear element is to press down and erect, and numerical value is 1, and all the other situations are analogized.According to of the distribution of carrying out linear element feature of these principles to each black pixel in the character pattern, to each black pixel point (i, j), can obtain one 4 dimensional vector X (i, j)=(x v, x k, x p, x o) T, its component is represented the quantity of 4 kinds of linear elements at this black pixel point place respectively.
Finish after the above-mentioned work, the dot matrix of M * N evenly is divided into the wide M that is 0, height is N 0Subregion (Figure 10), each subregion is with M being arranged in the horizontal direction between the adjacent subregion 0/ 2, N is arranged in vertical direction 0The coincidence of/2 pixels is so from the available subregion number of whole M * N dot matrix be ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) . Then, each subregion is divided into A nested against one another, B, four blockages of C, D (Figure 11), the size of A, B, C, D is followed successively by (M 0/ 4) * (N 0/ 4), (M 0/ 2) * (N 0/ 2), (3M 0/ 4) * (3N 0/ 4) and M 0* N 0For each blockage, define one 4 dimensional vector X respectively A=(x v, x k, x p, x o) T, X B=(x v, x k, x p, x o) T, X C=(x v, x k, x p, x o) T, X D=(x v, x k, x p, x o) T, the summation of the linear element quantity on 0 °, 90 °, 45 °, 135 ° directions of each pixel in the expression square separately, that is:
X A = Σ ( i , j ) ∈ A X ( i , j )
X B = Σ ( i , j ) ∈ B X ( i , j )
X C = Σ ( i , j ) ∈ C X ( i , j )
X D = Σ ( i , j ) ∈ D X ( i , j )
And the directional line element feature feature vector, X of whole subregion S=(x v, x k, x p, x o) TWeighted sum by each side's block eigenvector in this subregion represents, that is:
X S=α AX ABX BCX CDX D
α wherein A, α B, α C, α DBe the constant between 0 and 1, they have portrayed the significance level of the interior proper vector of different squares to the contribution of book regional integration proper vector.Like this, can obtain one 4 dimensional feature vector from each subregion after, the proper vector of all subregions is arranged in order forms together 4 ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) Dimensional feature vector, Here it is the expression this character the directional line element feature feature.
3. eigentransformation
The increase of intrinsic dimensionality and the deficiency of training sample will be estimated and discern calculated amount and all bring very big problem to classifier parameters.According to the experience of general classifier design, be to reach more than 10 times of intrinsic dimensionality to the requirement of number of training.For the difficulty that the relative deficiency that reduces too high intrinsic dimensionality and training sample brings for classifier design and parameter estimation, the present invention utilizes the LDA method that the primitive character of higher-dimension is compressed.
If the character class number is c (c=592 in the Tibetan language character recognition), the number of training of ω class character is O ω, ω=1,2 ..., c, then the training sample to this character class adopts said method to extract the four directions after the linear element feature, obtains set of eigenvectors and is combined into { X 1 ω , X 2 ω , · · · , X O ω ω } , X wherein k ω(k=1,2 ..., O ω) be 4 ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) Dimensional vector.
At first calculate each character type ω (center μ of proper vector of 1≤ω≤c) ωCenter μ with the proper vector of all character types
μ r = 1 O ω Σ k = 1 O ω X k ω
μ = 1 c Σ ω = 1 c μ ω
Calculate the between class scatter matrix S then bWith divergence matrix S in the average class w
S b = 1 c Σ ω = 1 c ( μ ω - μ ) ( μ ω - μ ) T
S w = 1 c Σ ω = 1 c 1 O ω Σ k = 1 O ω ( X k ω - μ ω ) ( X k ω - μ ω ) T
Seek transformation matrix Φ, make tr[(Φ TS wΦ) -1TS bΦ)] reach maximum, thereby make the ratio of interior divergence variance of mode class and between class scatter variance reach maximum to increase the separability between each pattern class.
With matrix computations instrument compute matrix S w -1S bBefore d ( d ≤ 4 ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) ) The non-zero eigenvalue ξ of individual maximum k(k=1,2 ..., d) with corresponding latent vector k(k=1,2 ..., d),
Figure A20041003410700138
The transformation matrix Φ=[ of LDA conversion then 1, 2..., d].Corresponding eigentransformation is Y=Φ TX, Y is the d dimensional feature of tool identification here.
4. classifier design
Classifier design is one of core technology of character recognition, and the researcher has proposed many pattern classifiers at different problems.But under multiple factor restriction, when handling the large character set identification problem, often still select minimum distance classifier at present.Thick, the thin two-stage classification strategy (Figure 13) that the present invention's employing is analyzed based on degree of confidence is finished the judgement of the affiliated classification of input Tibetan language character to be identified.
4.1 rough sort
The purpose of rough sort is to select relatively very little subset of candidate words of a number in a big character set fast, and it is big as far as possible to guarantee to comprise in the Candidate Set under the character to be identified the probability of correct classification.This just requires simple in structure, the fast operation of rough sort device.For this reason, the present invention has designed a kind of Euclidean distance with deviation (EDD) sorter.
Make Y=(y 1, y 2..., y d) TBe the d dimensional feature vector of input unknown character, Y ω=(y ω 1, y ω 2..., y ω d) TBe the standard feature vector of ω class character, the Euclidean distance of band deviation is defined as follows:
D ( Y , Y ω ) = Σ k = 1 d [ t ( y k , y ω k ) ] 2
In the formula
Figure A20041003410700142
Wherein, σ ω kBe the mean square deviation of k component of ω class character feature vector, θ ω, γ ωBe the constant relevant with ω, C is and the irrelevant constant of character class.A most important characteristic of following formula is a second-order statistic of having introduced character feature in Euclidean distance, and this makes sorter, and distribution spatially has certain portrayal ability to feature.
4.2 disaggregated classification
Bayes classifier is optimum in theory statistical sorter, and when handling practical problems, people wish to go to approach it as far as possible.Under the condition that the prior probability that is characterized as Gaussian distribution and all kinds of characteristic distribution at character equates, Bayes classifier is reduced to the mahalanobis distance sorter.But this condition be difficult for to satisfy in practice usually, and the performance of mahalanobis distance sorter along with the generation of covariance matrix error serious deterioration.The present invention adopts MQDF (revising the secondary Discrimination Functions) as disaggregated classification tolerance, and it is a distortion of mahalanobis distance.MQDF Discrimination Functions form is:
Q ( Y , Y ω ) = 1 h 2 { Σ l = 1 d ( y l - y ω l ) 2 - Σ l = 1 K ( 1 - h 2 λ ωl ) [ ( Y - Y ω ) T φ ωl ] 2 } + ln ( h 2 ( d - K ) Π l = 1 K λ ωl )
λ wherein ω lAnd φ ω lBe respectively the covariance matrix ∑ of ω class sample ωL eigenwert and proper vector, K represents the number of the main latent vector that intercepted, also is the principal subspace dimension of mode class, its optimal value is determined by experiment, h 2Be that the experiment of little eigenvalue is estimated.What MQDF produced is secondary judgement curved surface, because of only needing to estimate preceding K main latent vector of each classification covariance matrix, has avoided the negative effect of little eigenvalue evaluated error.MQDF differentiates that distance can be regarded as the mahalanobis distance in K dimension principal subspace and the weighted sum of the Euclidean distance in remaining (d-K) dimension space, and weighting factor is 1/h 2
4.3 confidence calculations
If the output Candidate Set of rough sort device is CanSet={ (e 1, D 1), (e 2, D 2) ..., (e L, D L), k is the Candidate Set capacity, e kAnd D kBe respectively candidate characters and corresponding rough sort distance, D 1≤ D 2≤ ... ≤ D LThe effect of disaggregated classification device is according to the discriminating distance that recomputates CanSet to be sorted again, finds the affiliated most probable classification of input character.If rough sort result's reliability is quite high, in other words, if e 1Be correct minute time-like of input character, then disaggregated classification need not carry out fully.Whether the present invention carries out the degree of confidence analysis to Candidate Set CanSet needs to carry out disaggregated classification with decision, and the distance that adopts EDD output is calculated degree of confidence as tolerance according to following formula:
Conf ( CanSet ) = D 2 - D 1 D 1
When degree of confidence is lower than certain threshold value Conf THThe time, CanSet is sent into the disaggregated classification device handle, otherwise directly export CanSet.The invention is characterized in that it is a kind of printed tibetan character character recognition technologies that can discern multiple font and multiple font size.It contains following steps successively:
It at first carries out suitable place normalization and size normalization to the single Tibetan language character of input, to eliminate the difference of input character to greatest extent because of aspects such as the different shapes that cause of font size and font, attitudes, the four directions of extracting the fine reflection Tibetan language charcter topology characteristics of energy then is to the linear element feature, on this basis, utilize the LDA conversion to extract the distinctive feature of tool reducing intrinsic dimensionality, feature after the conversion is sent into thick, the thin two-stage classification device of analyzing based on recognition confidence judge classification under the character.In the system that is made up of image capture device and computing machine, it contains following steps successively:
1. the collection of character sample
The scanning input is printed on the text of many font sizes of multi-font Tibetan language character, after utilizing existing algorithm to remove necessary pre-service such as noise, binaryzation, the Tibetan language text is carried out cutting to separate single character, the image of each character is demarcated the ISN of the correct character of its correspondence, finish collection thus, set up the training sample database in order to the Tibetan language character individual character sample of training and testing.
2. normalized comprises the linear normalization of character position and size
2.1 locate the baseline position of single Tibetan language character
If the original character image is [F (i, j)] W * H, wherein W is a picture traverse, H is a picture altitude, the value that image is positioned at the picture element of the capable j of i row be F (i, j), i=1,2 ..., H, j=1,2 ..., W
By the horizontal projection V (i) of following formula calculating character image, i=1,2 ..., H:
V ( i ) = Σ j = 1 W F ( i , j )
The position P of baseline then LFor:
P L = arg max i ( V ( i ) - V ( i - 1 ) ) , i = 2,3 , · · · , H
2.2 with the baseline is that separation is separated into two number of sub images with input picture
[F (i, j)] W * HCan regard two number of sub images [F as 1(i, j)] W * H1, [F 2(i, j)] W * H2Longitudinal spliced wherein [F 1(i, j)] W * H1For baseline with top, promptly go up the vowel part; [F 2(i, j)] W * H2For baseline with the lower part.Both not do not overlap but vertically combine synthetic [F (i, j)] W * H, and H 1+ H 2=H
Corresponding, the target character image after the normalization [G (i, j)] M * NAlso can regard two number of sub images [G as 1(i, j)] M * N1, [G 2(i, j)] M * N2The longitudinal spliced wherein M width that is target image, N is a picture altitude.[G 1(i, j)] M * N1For the above parts of images of baseline, promptly go up the vowel part; [G 2(i, j)] M * N2For baseline with the lower part.Both do not overlap yet but vertically are combined into [G (i, j)] M * N, and set N 1=N/4, N 2=3N/4.
2.3 place normalization reference point U k(u Ik, u Jk), k=1,2 selection
[F k(i, j)] W * Hk, k=1,2 centers of gravity and outer rim center are respectively A k(a Ik, a Jk), k=1,2 and B k(b Ik, b Jl), k=1,2 wherein
Figure A20041003410700164
Make U k(u Ik, u Jk), k=1,2 is between A k(a Ik, a Jk), k=1,2 and B k(b Ik, b Jk), k=1, a bit between 2, that is:
Figure A20041003410700165
Wherein β is constant and 0≤β≤1.
Mobile input picture dot matrix makes this reference point be positioned at target dot matrix [G k(i, j)] M * Nk, k=1,2 geometric center, thus finish the place normalization of input character
2.4 size normalization
Because of [F k(i, j)] W * Hk, k=1,2 with [G k(i, j)] M * Nk, k=1, the pass between 2 is G k(i, j)=F k(i/r i, j/r j), k=1,2 r wherein iAnd r jBe respectively the change of scale factor of i and j direction: r i=N k/ H k, r j=M/W.So adopt cubic B-spline function to carry out interpolation arithmetic, occur such as distortion such as stepped edges with character after reducing normalization.For given (i, j), the order:
Figure A20041003410700171
Wherein: [] is bracket function.Interpolation process can be expressed as:
G k ( i , j ) = F k ( p 0 + Δ p , q 0 + Δ q ) = Σ m = - 1 2 Σ l = - 1 2 F k ( p 0 + m , q 0 + l ) R B ( m - Δ p ) R B ( - ( l - Δ q ) )
R in the formula B(z) be cubic B-spline function:
R B ( z ) = 1 6 [ ( z + 2 ) 3 W ( z + 2 ) - 4 ( z + 1 ) 3 W ( z + 1 ) + 6 z 3 W ( z ) - 4 ( z - 1 ) 3 W ( z - 1 ) ]
Wherein W (z) is a step function,
Figure A20041003410700175
3. extract the four directions of Tibetan language character to the linear element feature
3.1 character outline is extracted
Scan whole character pattern, for the black pixel of certain position, whether decision keeps this black pixel according to the pixel distribution situation in its 8 neighborhoods.Like this, can obtain character picture [G (i, j)] after the normalization M * NContour images [G ' (i, j)] M * N
3.2 directional line element feature Feature Extraction
At first, for character outline dot matrix [G ' (i, j)] M * NIn each black pixel (i j), according to the position relation between it and adjacent two other black pixel, gives its horizontal stroke (0 °), perpendicular (90 °), casts aside (45 °), presses down (135 °) four kinds of linear elements.And be designated as one 4 dimensional vector X (i, j)=(x v, x k, x p, x o) T
With whole size is the character outline image [G ' (i, j)] of M * N M * NEvenly be divided into ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) That sub regions, each subregion further are divided into again is nested against one another, size is followed successively by (M 0/ 4) * (N 0/ 4), (M 0/ 2) * (N 0/ 2), (3M 0/ 4) * (3N 0/ 4) and M 0* N 0A, B, 4 blockages such as C, D.The feature vector, X of blockage on each A=(x v, x k, x p, x o) T, X B=(x v, x k, x p, x o) T, X C=(x v, x k, x p, x o) T, X D=(x v, x k, x p, x o) TBe expressed as all black pixel proper vectors in this square and:
X A = Σ ( i , j ) ∈ A X ( i , j )
X B = Σ ( i , j ) ∈ B X ( i , j )
X C = Σ ( i , j ) ∈ C X ( i , j )
X D = Σ ( i , j ) ∈ D X ( i , j )
The directional line element feature feature vector, X of whole subregion S=(x v, x k, x p, x o) TWeighted sum by each side's block eigenvector in this subregion is represented:
X SAX A+ α BX B+ α CX C+ α DX DLike this, can obtain one 4 dimensional feature vector from each subregion after, the proper vector of all subregions is arranged in order the expression input character formed together 4 ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) Dimension directional line element feature proper vector.
4. eigentransformation
If the character class number is c, the number of training of ω class character is O ω, ω=1,2 ..., c, then the training sample to this character class adopts said method to extract the four directions after the linear element feature, obtains set of eigenvectors and is combined into { X 1 ω , X 2 ω , · · · , X O ω ω } , X wherein k ω(k=1,2 ..., O ω) be 4 ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) Dimensional vector.
Utilize the compression of LDA transfer pair primitive character as follows
At first calculate each character type ω (center μ of proper vector of 1≤ω≤c) ω, all character types center μ, the between class scatter matrix S of proper vector bWith divergence matrix S in the average class w
μ r = 1 O ω Σ k = 1 O ω X k ω
μ = 1 c Σ ω = 1 c μ ω
S b = 1 c Σ ω = 1 c ( μ ω - μ ) ( μ ω - μ ) T
S w = 1 c Σ ω = 1 c 1 O ω Σ k = 1 O ω ( X k ω - μ ω ) ( X k ω - μ ω ) T
Seek transformation matrix Φ, make tr (Φ TS wΦ) -1TS bΦ)] reach maximum, then the corresponding eigentransformation of LDA is Y=Φ TX, Y is the d dimensional feature of tool identification here.
5. to the judgement of classification under the input character, promptly, extract feature, compare with existing data in the identification storehouse, to determine its correct character code to the character picture of unknown classification.
5.1 design category device
To the proper vector Y that obtains by the LDA compression, calculate the mean vector of each character Y ω ‾ ( ω = 1,2 , · · · , c ) The variances sigma of proper vector on each dimension with each character s ω(ω=1,2 ..., c, s=1,2 ..., d), d is the dimension of Y,
Y ω ‾ = 1 O ω Σ k = 1 O ω Y k ω ,
σ s ω = 1 O ω Σ k = 1 O ω ( y ω ks - y ω ‾ s ) 2
Wherein (characteristic set of 1≤ω≤c) is each Tibetan language character class ω { Y 1 ω , Y 2 ω , · · · , Y O ω ω } , The diagnostic characteristics mean vector of each character and the variance of Ge Wei are deposited in the diagnostic characteristics database file, and the parameter of the sorter that will obtain by experiment deposits in the library file simultaneously.
5.2 classification judgement
To the input character image of unknown classification, at first carry out place normalization and size normalization and handle, extract the four directions again to linear element feature X, utilize LDA matrix of a linear transformation Φ that its original orientation linear element feature X is transformed into Y=Φ TX=(y 1, y 2..., y d) T, d is the dimension of feature after the conversion.
From library file, read the mean vector of all character types Y ω ‾ = ( y 1 ω ‾ , y 2 ω ‾ , · · · y d ω ‾ ) T , (ω=1,2 ..., c) and each character type each the dimension variances sigma s ω(ω=1,2 ..., c, s=1,2 ..., d).Calculating Y arrives The Euclidean distance of band deviation
D ( Y , Y ω ‾ ) = Σ s = 1 d [ t ( y s , y ω s ‾ ) ] 2
Wherein
Figure A20041003410700201
All processes are calculated ω=1,2 ..., c is according to the rearrangement of ascending order, selects preceding L (the character class sign indicating number e of individual distance of 1≤L≤c) and representative thereof k, k=1,2 ..., L forms rough sort Candidate Set CanSet={ (e 1, D 1), (e 2, D 2) ..., (e L, D L), D 1≤ D 2≤ ... ≤ D L
Calculate the recognition confidence Conf (CanSet) of initial character among the CanSet
Conf ( CanSet ) = D 2 - D 1 D 1
If Conf (CanSet) is higher than certain threshold value Conf TH, directly with (e 1, D 1) as the recognition result output of input character, think that promptly input character belongs to e 1Pairing character class, and decipherment distance is D 1Otherwise, calculate Y MQDF of the pairing character class of each ISN in the CanSet and differentiate distance ω=1,2 ..., L
Q ( Y , Y ω ‾ ) = 1 h 2 { Σ l = 1 d ( y l - y ω l ‾ ) 2 - Σ l = 1 K ( 1 - h 2 λ ωl ) [ ( Y - Y ω ‾ ) T φ ωl ] 2 } + ln ( h 2 ( d - K ) Π l = 1 K λ ωl )
If Q ( Y . Y τ ‾ ) = min 1 ≤ ω ≤ L Q ( Y , Y ω ‾ ) , Then this input character belongs to the pairing character class of e τ, promptly τ = arg min 1 ≤ ω ≤ L Q ( Y , Y ω ‾ ) .
Experiment showed, that the recognition correct rate of the present invention on many font sizes of multi-font printed tibetan character individual character test set reaches 99.83%, also can reach more than 99% the discrimination of actual text.
Description of drawings
The hardware of a typical Tibetan language character recognition system of Fig. 1 constitutes.
The generation of Fig. 2 Tibetan language individual character sample.
The formation of Fig. 3 Tibetan language character recognition system.
The image coordinate system signal that Fig. 4 adopts.
Fig. 5 character normalization flow process
Fig. 6 character normalization example
Fig. 7 directional line element feature feature extraction flow process.
Character and profile thereof after Fig. 8 normalization.
Fig. 9 four directions in the linear element feature horizontal, vertical, cast aside, press down four kinds of direction attributes.
The division methods of Figure 10 image region.
Figure 11 constitutes the blockage signal of subregion.
Figure 12 LDA eigentransformation process flow diagram.
Figure 13 classification policy
Figure 14 is based on the many font size printings of the multi-font Tibetan language character recognition system of this algorithm.
Figure 15 multi-font printing Tibetan language (mixing is Chinese-English) document recognition system
Embodiment
As shown in Figure 1, a printed tibetan character character recognition system is made of two parts on hardware: image capture device and computing machine.Image capture device generally is a scanner, is used for obtaining the digital picture of Tibetan language character.Computing machine is used for digital picture is handled, and adjudicates classification.
Shown in Figure 2 is the generative process of training Tibetan language individual character sample and test Tibetan language individual character sample.For one piece of printed tibetan character specimen page, at first it is swept computing machine by scanner, make it to become digital picture.To pre-service measures such as digital picture binaryzation, removal noises, obtain the image of binaryzation.To the capable cutting of input picture, obtain line of text again, on this basis each line of text is carried out character segmentation, obtain single Tibetan language character, demarcate the affiliated character class of each character picture then.After this, check once that the mistake that row, character segmentation stage and character class calibration phase are produced adopts manual mode to correct.At last, the original character image of identical character class correspondence is extracted, and preserve, finish the collection of Tibetan language individual character sample.
As shown in Figure 3, the printed tibetan character character recognition algorithm is divided into two parts: training system and test macro.In the training system, to each concentrated sample of the Tibetan language individual character training sample of input, carry out normalized rightly, extract its four directions of forming information of reflection to the linear element feature, utilize LDA that feature is carried out conversion, reduce the primitive character dimension, then, adopt proper classifier, training classifier obtains the feature database file.In test macro, to the unknown classification character picture of input, adopt and same normalization and the feature extracting method of training system, and feature is carried out conversion with the transformation matrix that training system obtains, send into sorter then and classify, judge the classification that input character is affiliated.
Thereby the realization of practical many font sizes of multi-font printed tibetan character character recognition system need be considered following several aspect:
A) Tibetan language character individual character sample obtains;
B) realization of training system;
C) realization of test system.
Respectively these three aspects are described in detail below.
A) Tibetan language character individual character basis obtains
The acquisition process of printed tibetan character individual character sample as shown in Figure 2.One piece of paper printing body Tibetan language document of input obtains digital picture by scanner, the input computing machine.Then this image is carried out pre-service measures such as noise remove, binaryzation.Utilize various worry wave methods to remove noise and in existing document, a large amount of records have been arranged.Binarization method can adopt existing overall binaryzation or local auto-adaptive binaryzation.Then document is carried out printed page analysis, obtain character zone.Utilize horizontal projection histogram and the capable cutting of vertical projection histogram and character segmentation to obtain single character respectively to character zone.Cutting mistake in this stage adopts manual mode to correct.Classification to the single Tibetan language character that obtains is demarcated, and generally adopts computing machine to demarcate automatically, and mistake is wherein carried out artificial treatment (more should, deletion etc.).At last, the original character image of the pairing different fonts of the character with identical ISN, different font sizes is preserved, just obtained many font sizes of multi-font printed tibetan character individual character sample.
B) realization of training system
B.1 character normalization
B.1.1 place normalization
If the original character image is [F (i, j)] W * H, picture traverse is W, highly is H, the value that image is positioned at the picture element of the capable j of i row be F (i, j), i=1,2 ..., H, j=1,2 ..., W.[F (i, j)] W * HCan regard as by two number of sub images---baseline is with top [F 1(i, j)] W * H1With baseline with lower part [F 2(i, j)] W * H2Longitudinal spliced forming, H 1+ H 2=H.If the horizontal projection of character picture is V (i), i=1,2 ..., H, can be calculated by following formula:
V ( i ) = Σ j = 1 W F ( i , j )
The ordinate value P of baseline position then IFor:
P I = arg max i ( V ( i ) - V ( i - 1 ) ) , i = 2,3 , · · · , H
According to P IJust can determine H with the value of the ordinate at character top 1, and in coordinate system of the present invention (Fig. 4), H 1Be numerically equal to P I
If character picture is after the normalization [G (i, j)] M * N, picture traverse is M, highly is N, the value that image is positioned at the picture element of the capable j of i row be G (i, j), i=1,2 ..., N, j=1,2 ..., M.Same, [G (i, j)] M * NAlso can regard two number of sub images as---baseline is with top [G 1(i, j)] M * N1With baseline with lower part [G 2(i, j)] M * N2Longitudinal spliced forming, set N herein 1=N/4, N 2=3N/4.Like this, normalization can be regarded as input picture dot matrix [F 1(i, j)] W * H1, [F 2(i, j)] W * H2Be mapped to target image dot matrix [G respectively 1(i, j)] M * N1, [G 2(i, j)] M * N2Processing procedure.In this process, selected input picture dot matrix [F k(i, j)] W * Hk, k=1, the reference point U in 2 k(u Ik, u Jk), k=1,2, mobile input picture dot matrix makes this reference point, is positioned at target dot matrix [G k(i, j)] M * Nk, k=1,2 center, thus finish the place normalization of input character.
Make [F k(i, j)] W * Hk, k=1, the center that 2 centers of gravity and outer rim are how much is respectively A k(a Ik, a Jk), k=1,2 and B k(b Ik, b Jk), k=1,2, then have:
Figure A20041003410700232
Make U k(u Ik, u Jk), k=1,2 is between A k(a Ik, a Jk), k=1,2 and B k(b Ik, b Jk), k=1, a bit between 2, that is:
Wherein β is constant and 0≤β≤1.
B.1.2 size normalization
Investigate input character image [F k(i, j)] W * Hk, k=1,2 with normalization after the target character dot matrix be [G k(i, j)] M * Nk, k=1, the relation between 2 as can be known:
G k(i,j)=F k(i/r i,j/r j),k=1,2
R wherein iAnd r jBe respectively the change of scale factor of i and j direction: r i=N k/ H k, r j=M/W.According to following formula, (i is j) corresponding to the point (i/r in the input character for the point in the output image dot matrix i, j/r j).F k(i j) is discrete function, and i/r i, j/r jValue generally be not integer, so need be according to F kIn the value at known discrete point place estimate that it is at (i/r i, j/r j) value located.Adopt cubic B-spline function to carry out interpolation arithmetic, distortion occurs with character after reducing normalization.For given (i, j), the order:
Wherein: [] is bracket function.Interpolation process can be expressed as:
G k ( i , j ) = F k ( p 0 + Δ p , q 0 + Δ q ) = Σ m = - 1 2 Σ l = - 1 2 F k ( p 0 + m , q 0 + l ) R B ( m - Δ p ) R B ( - ( l - Δ q ) )
R in the formula B(z) be cubic B-spline function:
R B ( z ) = 1 6 [ ( z + 2 ) 3 W ( z + 2 ) - 4 ( z + 1 ) 3 W ( z + 1 ) + 6 z 3 W ( z ) - 4 ( z - 1 ) 3 W ( z - 1 ) ]
Wherein W (z) is a step function,
Figure A20041003410700243
B.2 directional line element feature feature extraction
B.2.1 get the profile of character
Scan whole character pattern,,, then keep this black pixel, otherwise change the value of character pattern into 0 in this position if black pixel number and a white pixel number average are greater than 0 in its 8 neighborhoods for the black pixel of certain position.Like this, can be from the character picture after the normalization [G (i, j)] M * NContour images [G ' (i, j)] M * N
B.2.2 the formation of piecemeal and eigenvector
For character outline dot matrix [G ' (i, j)] M * NIn each black pixel, according to the position relation of it and adjacent two other black pixel, give its horizontal stroke (0 °), perpendicular (90 °), cast aside (45 °), press down (135 °) four kinds of linear elements.Consider two kinds of situations: a kind of be 3 black pixels on same straight line, distributing a kind of linear element feature and assignment then only for this center pixel is 2; Another kind of 3 black pixels not on same straight line, so just simultaneously to center pixel distribute two kinds of linear element features and respectively assignment be 1.According to of the distribution of carrying out linear element feature of these principles to each black pixel in the character pattern, to each black pixel point (i, j), can obtain one 4 dimensional vector X (i, j)=(x v, x k, x p, x o) T, its component is represented 4 kinds of linear element quantity at this black pixel point place respectively.
Finish after the above-mentioned work, the dot matrix of M * N evenly is divided into the wide M that is 0, height is N 0Subregion, each subregion is with M being arranged in the horizontal direction between the adjacent subregion 0/ 2, N is arranged in vertical direction 0The coincidence of/2 pixels is so total number of subregion is ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) Then, each subregion is divided into nested against one another, size and is followed successively by (M 0/ 4) * (N 0/ 4), (M 0/ 2) * (N 0/ 2), (3M 0/ 4) * (3N 0/ 4) and M 0* N 0A, B, 4 blockages such as C, D.To each blockage, define one 4 dimensional vector X respectively A=(x v, x k, x p, x o) T, X B=(x v, x k, x p, x o) T, X C=(x v, x k, x p, x o) T, X D=(x v, x k, x p, x o) T, they represent the summation of 0 °, 90 °, 45 °, 135 ° directional line element feature quantity of interior each pixel of square separately, that is:
X A = Σ ( i , j ) ∈ A X ( i , j )
X B = Σ ( i , j ) ∈ B X ( i , j )
X C = Σ ( i , j ) ∈ C X ( i , j )
X D = Σ ( i , j ) ∈ D X ( i , j )
And the directional line element feature feature vector, X of whole subregion S=(x v, x k, x p, x o) TBe expressed as the weighted sum of each side's block eigenvector in this subregion, that is:
X S=α AX ABX BCX CDX D
α wherein A, α B, α C, α DBe the constant between 0 and 1, they have portrayed the significance level of the interior proper vector of different squares to the contribution of this subregion general characteristic vector.Like this, can obtain one 4 dimensional feature vector from each subregion after, the proper vector of all subregions is arranged in order forms together 4 ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) Dimension directional line element feature proper vector.
B.3 eigentransformation
If the character class number is c (c=592 in the Tibetan language character recognition), the number of training of ω class character is O ω, ω=1,2 ..., c, its original orientation linear element set of eigenvectors is combined into { X 1 ω , X 2 ω , · · · , X O ω ω } , X wherein k ω(k=1,2 ..., O ω) be 4 ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) Dimensional vector.
At first calculate each character type ω (center μ of proper vector of 1≤ω≤c) ω, all character types center μ, the between class scatter matrix S of proper vector bWith divergence matrix S in the average class w
μ r = 1 O ω Σ k = 1 O ω X k ω
μ = 1 c Σ ω = 1 c μ ω
S b = 1 c Σ ω = 1 c ( μ ω - μ ) ( μ ω - μ ) T
S w = 1 c Σ ω = 1 c 1 O ω Σ k = 1 O ω ( X k ω - μ ω ) ( X k ω - μ ω ) T
Seek transformation matrix Φ, make tr[(Φ TS wΦ) -1TS bΦ)] reach maximum, thereby make the ratio of interior divergence variance of mode class and between class scatter variance reach maximum to increase the separability between each pattern class.
With matrix computations instrument compute matrix S w -1S bBefore d ( d ≤ 4 ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) ) The non-zero eigenvalue ξ of individual maximum k(k=1,2 ..., d) with corresponding latent vector k(k=1,2 ..., d), The transformation matrix Φ=[ of LDA conversion then 1, 2..., d].Corresponding eigentransformation is Y=Φ TX, Y is the d dimensional feature of tool identification here.
B.4 design category device
To obtaining proper vector Y, calculate the mean vector of each character through the LDA conversion Y ω ‾ ( ω = 1,2 , · · · , c ) The variances sigma of proper vector on each dimension with each character s ω(ω=1,2 ..., c, s=1,2 ..., d), d is the dimension of Y,
Y ω ‾ = 1 O ω Σ k = 1 O ω Y k ω ,
σ s ω = 1 O ω Σ k = 1 O ω ( y ω ks - y ω ‾ s ) 2
Wherein (characteristic set of the tool separability of 1≤ω≤c) is each Tibetan language character class ω { Y 1 ω , Y 2 ω , · · · , Y O ω ω } , The diagnostic characteristics mean vector and the variance on the Ge Wei of each character are deposited in the diagnostic characteristics database file, will adjust by experiment simultaneously sorter each correlation parameter value and deposit in the library file.The design and the training of sorter have so just been finished.
C) realization of test system
To the input character image of unknown classification, at first carry out place normalization and size normalization and handle, extract the four directions again to linear element feature X, utilize LDA matrix of a linear transformation Φ that its original orientation linear element feature X is transformed into Y=Φ TX=(y 1, y 2..., y d) T, d is the dimension of feature after the conversion.
From library file, read the mean vector of all character types Y ω ‾ = ( y 1 ω ‾ , y 2 ω ‾ , · · · y d ω ‾ ) T , (ω=1,2 ..., c) and each character type each the dimension variances sigma s ω(ω=1,2 ..., c, s=1,2 ..., d).Calculating Y arrives
Figure A20041003410700268
The Euclidean distance of band deviation
D ( Y , Y ω ‾ ) = Σ s = 1 d [ t ( y s , y ω s ‾ ) ] 2
Wherein
Figure A20041003410700271
All processes are calculated
Figure A20041003410700272
ω=1,2 ..., c is according to the rearrangement of ascending order, selects preceding L (the character class sign indicating number e of individual distance of 1≤L≤c) and representative thereof k, k=1,2 ..., L forms rough sort Candidate Set CanSet={ (e 1, D 1), (e 2, D 2) ..., (e L, D L), D 1≤ D 2≤ ... ≤ D L
Calculate the recognition confidence Conf (CanSet) of initial character among the CanSet
Conf ( CanSet ) = D 2 - D 1 D 1
If Conf (CanSet) is higher than certain threshold value Conf TH, directly with (e 1, D 1) as the recognition result output of input character, think that promptly input character belongs to e 1Pairing character class, and decipherment distance is D 1Otherwise, calculate Y MQDF of the pairing character class of each ISN in the CanSet and differentiate distance ω=1,2 ..., L
Q ( Y , Y ω ‾ ) = 1 h 2 { Σ l = 1 d ( y 1 - y ω l ‾ ) 2 - Σ l = 1 K ( 1 - h 2 λ ωl ) [ ( Y - Y ω ‾ ) T φ ωl ] 2 } + ln ( h 2 ( d - K ) Π l = 1 K λ ωl )
If Q ( Y , Y τ ‾ ) = min 1 ≤ ω ≤ L Q ( Y , Y ω ‾ ) , Then this input character belongs to e τPairing character class, promptly τ = arg min 1 ≤ ω ≤ L Q ( Y , Y ω ‾ ) . Below provide two concrete realization examples.
Embodiment 1: many font sizes of multi-font printed tibetan character character recognition system based on many font sizes of multi-font printed tibetan character character recognition system of the present invention shown in Figure 14 a, experiment is carried out on the 1200 cover block letter Tibetan language documents of collecting (each document is forgiven whole 592 modern Tibetan characters), these sample files major parts are picked up from current main printing Tibetan language publishing system (upright, Hua Guang), also have directly to be printed by the TureType font on a small quantity to form.Font not only has the most frequently used lean type, black matrix and general body, also comprises circle, long body, bamboo body, and font size is from No. six to first number.Sample quality does not wait, and the ratio of normal, fracture, adhesion character is about 2: 1: 1.Through processes such as overscanning input, row, character segmentation and ISN demarcation, this 1200 cover Tibetan language document is converted to 1200 cover individual character samples (being that each character class has 1200 individual character samples), therefrom extract 900 covers at random out and form training set, all the other 300 covers give over to test sample book.
In the experiment, adopt method of the present invention that each Tibetan language character is normalized to 48 * 96 dot matrix, normalized parameter β=0.5.Four directions mode as shown in figure 10 of subregion in the linear element feature extraction is divided, and gets M 0=N 0=16, the proper vector of each square is to the weighting coefficient α of whole subregion proper vector in the subregion A, α B, α C, α DBe respectively 0.4,0.3,0.2,0.1.After flow process extraction directional line element feature feature shown in Figure 7, adopt the LDA linear transformation to carry out feature compression, intrinsic dimensionality d is chosen to be 128 (Figure 14 c) after the conversion.Parameter θ among the rough sort device EDD 12=...=θ 592=0.8, γ 12=...=γ 592=2.2, C=20 adopts threshold value Conf when the rough sort degree of confidence is analyzed TH=0.9, parameter K=32 among the disaggregated classification device MQDF (Figure 14 b), h 2With the average of K eigenvalue of the covariance matrix of each character type as estimated value.Experimental result on test set is as shown in table 1
The discrimination of table 1 system on six kinds of Tibetan language font test sample book collection
Font Lean type Black matrix General body Circle Long body The bamboo body Average recognition rate
Number of characters ??36112 ?39072 ?35520 ?30192 ?14800 ??22496
Discrimination ??99.94% ?99.86% ?99.83% ?99.85% ?99.58% ??99.76% ??99.83%
As seen from Table 1, the average recognition correct rate of many font sizes of multi-font Tibetan language character reaches 99.83%, shows the validity of the method that the present invention puies forward.
Embodiment 2: multi-font printing Tibetan language (mixing is Chinese-English) document recognition system
The systematic research of multi-font printing Tibetan language (mixing is Chinese-English) document recognition is to launch for the demand that adapts to the Chinese multilingual information treatment technology development of Tibetan's area office automation and promotion, and its system chart as shown in figure 15.Mainly comprise image input and preprocessing subsystem, row character segmentation subsystem, character recognition subsystem and aftertreatment subsystem.The present invention is the chief component of character recognition subsystem, under the cooperation of Chinese character and English identification core to Tibetan language account for main body, be mingled with certain Chinese character and English, the multi-font document printing of numeral, symbol discerns automatically, and file and picture is converted to the text that computing machine can " be read ".
The method that Tibetan language character recognition in this system partly adopts the present invention to propose, concrete parameter is consistent with embodiment 1, has transplanted the character feature storehouse among the embodiment 1.This system has passed through the expert appraisal that the Ministry of Education is presided in November, 2003.When characterization test, from provide by Northwest University for nationalities 500 surplus the page or leaf, select 62 pages at random in the actual print body Tibetan language document (picking up from publications such as books, newpapers and periodicals, magazine) of word surplus in the of totally 52 ten thousand, totally 95583 characters are tested, the result is as follows:
The test performance of table 2 multi-font printing Tibetan language (mixing is Chinese-English) document recognition system
The character kind Number of characters Recognition correct rate (%) Error rate distributes
??ACE(%) ?ASE(%) UTE(%)
Tibetan language ?91636 99.06 ??0.30 ?0.57 0.07
Chinese character ?804 96.27 ??1.99 ?1.74 0
English+symbol ????2118 ????86.59 ????5.24 ????6.66 ????1.51
Numeral ????1025 ????92.39 ????3.61 ????3.42 ????0.58
Add up to ????95583 ????98.68 ????- ????- ?????-
Annotate: ACE is that to be interpretable cutting error rate UTE show for this result of error rate that can not the misjudgment type interpretable identification error rate ASE, many font sizes of multi-font printed tibetan character character recognition that the present invention proposes adapts to the needs of practical application fully, can obtain good recognition performance, be with a wide range of applications.

Claims (1)

1. many font sizes of multi-font printed tibetan character character identifying method, it is characterized in that, normalization scheme at the printed tibetan character character characteristics that belong to non-Chinese characters has been proposed: with character picture with baseline, it is upper horizontal line, for separation resolves into two number of sub images that do not overlap mutually, each subimage is adopted the place normalization that combines with center of gravity and frame respectively and based on the size normalization method of cubic B-spline function interpolation; Extraction can fully reflect four directions that the Tibetan language character forms information to the linear element feature, obtains compact character feature vector after utilizing linear discriminant analysis LDA compression dimensionality reduction; Employing is carried out the judgement of character class based on thick, the thin two-stage classification strategy of degree of confidence analysis, and thick, disaggregated classification device adopts the Euclidean distance EDD of band deviation and the secondary Discrimination Functions MQDF of correction respectively; In the system that is made up of image capture device and computing machine, it contains following steps successively:
(1) set:
(1.1) the Tibetan language character class sum c=592 of the present invention's processing;
(1.2) character duration M, height N after the normalization;
The place normalization parameter beta;
When (1.3) extracting the directional line element feature feature, the subregion width M of division 0, the height N 0
The proper vector of each square is to the weighting coefficient α of whole subregion proper vector in the subregion A, α B, α C, α D
(1.4) parameters C among the rough sort device EDD, θ k, γ k, k=1 wherein, 2 ..., 592;
(1.5) confidence threshold value Conf TH
(2) collection of character sample
Be printed on the text of many font sizes of multi-font Tibetan language character to the computing machine input by scanner, after utilizing existing method to remove necessary pre-service such as noise, binaryzation, the Tibetan language text is carried out cutting to separate single character, the image of each character is demarcated the ISN of the correct character of its correspondence, finish collection thus, set up the training sample database in order to the Tibetan language character individual character sample of training and testing;
(3) normalized comprises the normalization of character position and size
(3.1) baseline position of the single Tibetan language character in location
If the original character image is [F (i, j)] W * H,
Wherein W is a picture traverse, and H is a picture altitude, the value that image is positioned at the picture element of the capable j of i row be F (i, j), i=1,2 ..., H, j=1,2 ..., W,
The horizontal projection V of calculating character image (i), i=1,2 ..., H is:
V ( i ) = Σ j = 1 w F ( i , j ) ,
The ordinate value P of baseline position then 1For:
P I = arg max i ( V ( i ) - V ( i - 1 ) ) , i = 2,3 , · · · , H ;
(3.2) be that separation is separated into two number of sub images with input picture with the baseline
[F (i, j) W * HCan regard two number of sub images as
Figure A2004100341070003C3
Longitudinal spliced,
Wherein
Figure A2004100341070003C4
For baseline with top, promptly go up the vowel part; For baseline with the lower part, both not do not overlap but vertically combine synthetic [F (i, j)] W * H, and H 1+ H 2=H is by P 1Can determine H with the difference of the ordinate at character top 1Size;
Corresponding, the target character image after the normalization [G (i, j)] M * NAlso can regard two number of sub images as
Figure A2004100341070003C7
Longitudinal spliced,
Wherein M is the width of target image, and N is a picture altitude; For the above parts of images of baseline, promptly go up the vowel part;
Figure A2004100341070003C9
For baseline with the lower part; Both do not overlap yet but vertically are combined into [G (i, j)] M * N, and set N 1=N/4, N 2=3N/4;
(3.3) place normalization reference point U k(u Ik, u Jk), k=1,2 selection [ F k ( i , j ) ] W × H k , k = 1,2 Center of gravity and outer rim center are respectively Ak (α Ik, α Jk), k=1,2 and B k(b Ik, b Jk), k=1,2 wherein
Figure A2004100341070003C12
U then k(u Ik, u Jk), k=1,2 get between A k(a Ik, a Jk), k=1,2 and B k(b Ik, b Jk), k=1, a bit between 2, that is:
Wherein β is constant and 0≤β≤1;
Mobile input picture dot matrix makes this reference point, is positioned at the target dot matrix [ G k ( i , j ) ] M × N k , k = 1,2 Geometric center, thereby finish the place normalization of input character;
(3.4) size normalization
Cause [ F k ( i , j ) ] W × H k , k = 1,2 With [ G k ( i , j ) ] M × N k , k = 1,2 Between the pass be:
G k(i,j)=F k(i/r i,j/r j),k=1,2,
R wherein iAnd r jBe respectively the change of scale factor of i and j direction: r i=N k/ H k, r j=M/W; Adopt cubic B-spline function to carry out interpolation arithmetic;
For given (i, j), the order:
Figure A2004100341070004C5
Wherein: [] is bracket function;
Interpolation process can be expressed as:
G k ( i , j ) = F k ( p 0 + Δ p , q 0 + Δ q ) = Σ m = - 1 2 Σ l = - 1 2 F k ( p 0 + m , q 0 + l ) R B ( m - Δ p ) R B ( - ( l - Δ q ) ) ,
R in the formula B(z) be cubic B-spline function:
R B ( z ) = 1 6 [ ( z + 2 ) 3 W ( z + 2 ) - 4 ( z + 1 ) 3 W ( z + 1 ) + 6 z 3 W ( z ) - 4 ( z - 1 ) 3 W ( z - 1 ) ] ,
Wherein W (z) is a step function,
(4) extract the four directions of Tibetan language character to the linear element feature
(4.1) character outline is extracted
Scan whole character pattern, for the black pixel of certain position, if the individual number average of black pixel in its 8 neighborhoods and background pixels then keeps this black pixel greater than 0, otherwise, this black pixel is made as background pixels; Like this, obtain after the normalization character picture [G (and i, j) M * NContour images [G ' (i, j) M * N
(4.2) formation of directional line element feature feature
At first, for character outline dot matrix [G ' (i, j)] M * NIn each black pixel (i j), according to the position relation between it and adjacent two other black pixel, gives that it is horizontal, vertical, cast aside, press down four kinds of linear elements, and be designated as one 4 dimensional vector X (i, j)=(x v, x k, x p, x o) T
With whole size is the character outline image [G ' (i, j)] of M * N M * NEvenly be divided into ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) Individual width is M 0, highly be N 0Subregion, that each subregion further is divided into again is nested against one another, size is followed successively by (M 0/ 4) * (N 0/ 4), (M 0/ 2) * (N 0/ 2), (3M 0/ 4) * (3N 0/ 4) and M 0* N 0A, B, 4 blockages such as C, D; The feature vector, X of blockage on each A=(x v, x k, x p, x o) T, X B=(x v, x k, x p, x o) T, X C=(x v, x k, x p, x o) T, X D=(x v, x k, x p, x o) TBe expressed as all black pixel proper vectors in this square and:
X A = Σ ( i , j ) ∈ A X ( i , j ) ,
X B = Σ ( i , j ) ∈ B X ( i , j ) ,
X C = Σ ( i , j ) ∈ C X ( i , j ) ,
X D = Σ ( i , j ) ∈ D X ( i , j ) ,
The directional line element feature feature vector, X of whole subregion S=(x v, x k, x p, x o) TWeighted sum by each side's block eigenvector in this subregion is represented:
X S=α AX ABX BCX CDX D
α wherein A, α B, α C, α DBe the constant between 0 and 1; Like this, can obtain one 4 dimensional feature vector from each subregion after, the proper vector of all subregions is arranged in order the expression input character formed together 4 ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) Dimension its original orientation linear element proper vector;
(5) eigentransformation
If Tibetan language character class number is c, the number of training of ω class character is O ω, ω=1,2 ..., c, then the training sample to this character class adopts said method to extract the four directions after the linear element feature, obtains set of eigenvectors and is combined into { X 1 ω, X 2 ω..., X O ω ω, X wherein k ω(k=1,2 ..., O ω) be 4 ( 2 M M 0 - 1 ) × ( 2 N N 0 - 1 ) Dimensional vector;
Utilize the compression of LDA transfer pair primitive character as follows:
At first calculate each character type ω (center μ of proper vector of 1≤ω≤c) ω, all character types center μ, the between class scatter matrix S of proper vector bWith divergence matrix S in the average class w:
μ r = 1 O ω Σ k = 1 O ω X k ω ,
μ = 1 c Σ ω = 1 c μ ω ,
S b = 1 c Σ ω = 1 c ( μ ω - μ ) ( μ ω - μ ) T ,
S w = 1 c Σ ω = 1 c 1 O ω Σ k = 1 O ω ( X k ω - μ ω ) ( X k ω - μ ω ) T ,
Seek transformation matrix Φ, make t r[(Φ TS wΦ) -1TS bΦ)] reach maximum, then the corresponding eigentransformation of LDA is Y=Φ TX, Y is the d dimensional feature of tool identification here;
(6) to the judgement of classification under the input character, promptly, extract feature, compare with existing data in the identification storehouse, to determine its correct character code to the character picture of unknown classification;
(6.1) design category device
To the proper vector Y that obtains by the LDA compression, calculate the mean vector of each character Y ω ‾ ( ω = 1,2 , · · · , c ) The variances sigma of proper vector on each dimension with each character s ω(ω=1,2 ..., c, s=1,2 ..., d), d is the dimension of Y,
Y ω ‾ = 1 O ω Σ k = 1 O ω Y k ω ,
σ s ω = 1 O ω Σ k = 1 O ω ( y ω ks - y ω ‾ s ) 2 ,
Wherein (characteristic set of 1≤ω≤c) is each Tibetan language character class ω The diagnostic characteristics mean vector of each character and the variance of Ge Wei are deposited in the diagnostic characteristics database file, and the parameter of the sorter that will obtain by experiment deposits in the library file simultaneously;
(6.2) classification judgement
To the input character image of unknown classification, at first carry out place normalization and size normalization and handle, extract the four directions again to linear element feature X, utilize LDA matrix of a linear transformation Φ that its original orientation linear element feature X is transformed into Y=Φ TX=(y 1, y 2..., y d) T, d is the dimension of feature after the conversion;
From library file, read the mean vector of all character types Y ω ‾ = ( y 1 ω ‾ , y 2 ω ‾ , · · · y d ω ‾ ) T ( ω = 1,2 , · · · , c ) Each variances sigma of tieing up with each character type s ω(ω=1,2 ..., c, s=1,2 ..., d), calculate Y and arrive The Euclidean distance of band deviation
D ( Y , Y ω ‾ ) :
D ( Y , Y ω ‾ ) = Σ s = 1 d [ t ( y s , y ω s ‾ ) ] 2 ,
Wherein
Figure A2004100341070007C5
All processes are calculated D ( Y , Y ω ‾ ) , ω = 1,2 , · · · , According to the rearrangement of ascending order, select preceding L (the character class sign indicating number e of individual distance of 1≤L≤c) and representative thereof k, k=1,2 ..., L forms rough sort Candidate Set CanSet={ (e 1, D 1), (e 2, D 2) ..., (e L, D L), D 1≤ D 2≤ ... ≤ D L
Calculate the recognition confidence Conf (CanSet) of initial character among the CanSet
Conf ( CanSet ) = D 2 - D 1 D 1 ,
If Conf (CanSet) is higher than certain threshold value Conf TH, directly with (e 1, D 1) as the recognition result output of input character, think that promptly input character belongs to e 1Pairing character class, and decipherment distance is D 1Otherwise, calculate Y MQDF of the pairing character class of each ISN in the CanSet and differentiate distance Q ( Y , Y ω ‾ ) , ω = 1,2 , · · · , L :
Q ( Y , Y ω ‾ ) = 1 h 2 { Σ l = 1 d ( y l - y ω l ‾ ) 2 - Σ l = 1 K ( 1 - h 2 λ ωl ) [ ( Y - Y ω ‾ ) T φ ωl ] 2 } + ln ( h 2 ( d - K ) Π l = 1 K λ ωl ) ,
If Q ( Y , Y τ ‾ ) = min 1 ≤ ω ≤ L Q ( Y , Y ω ‾ ) , Then this input character belongs to e τPairing character class, promptly τ = arg min 1 ≤ ω ≤ L Q ( Y , Y ω ‾ ) .
CN 200410034107 2004-04-23 2004-04-23 Method for identifying multi-font multi-character size print form Tibetan character Expired - Fee Related CN1251130C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200410034107 CN1251130C (en) 2004-04-23 2004-04-23 Method for identifying multi-font multi-character size print form Tibetan character

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200410034107 CN1251130C (en) 2004-04-23 2004-04-23 Method for identifying multi-font multi-character size print form Tibetan character

Publications (2)

Publication Number Publication Date
CN1570958A true CN1570958A (en) 2005-01-26
CN1251130C CN1251130C (en) 2006-04-12

Family

ID=34481469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200410034107 Expired - Fee Related CN1251130C (en) 2004-04-23 2004-04-23 Method for identifying multi-font multi-character size print form Tibetan character

Country Status (1)

Country Link
CN (1) CN1251130C (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100440250C (en) * 2007-03-09 2008-12-03 清华大学 Recognition method of printed mongolian character
WO2009114967A1 (en) * 2008-03-19 2009-09-24 东莞市步步高教育电子产品有限公司 Motion scan-based image processing method and device
CN101366017B (en) * 2005-12-12 2010-06-16 微软公司 Logical structure and layout based character recognition method and system
CN101510259B (en) * 2009-03-18 2011-04-06 西北民族大学 On-line identification method for 'ding' of handwriting Tibet character
CN102184383A (en) * 2011-04-18 2011-09-14 哈尔滨工业大学 Automatic generation method of image sample of printed character
CN102360436A (en) * 2011-10-24 2012-02-22 中国科学院软件研究所 Identification method for on-line handwritten Tibetan characters based on components
CN103999097A (en) * 2011-07-11 2014-08-20 华为技术有限公司 System and method for compact descriptor for visual search
CN104809442A (en) * 2015-05-04 2015-07-29 北京信息科技大学 Intelligent recognition method for graphemes of Dongba pictographs
CN106127266A (en) * 2016-08-29 2016-11-16 大连民族大学 Hand-written Manchu alphabet recognition methods
CN106355200A (en) * 2016-08-29 2017-01-25 大连民族大学 Manchu handwritten recognition device
CN106408002A (en) * 2016-08-29 2017-02-15 大连民族大学 Hand-written manchu alphabet identification system
CN107025452A (en) * 2016-01-29 2017-08-08 富士通株式会社 Image-recognizing method and image recognition apparatus
CN107730511A (en) * 2017-09-20 2018-02-23 北京工业大学 A kind of Tibetan language historical document line of text cutting method based on baseline estimations
CN108932454A (en) * 2017-05-23 2018-12-04 杭州海康威视系统技术有限公司 A kind of character recognition method based on picture, device and electronic equipment
CN111553336A (en) * 2020-04-27 2020-08-18 西安电子科技大学 Print Uyghur document image recognition system and method based on link segment
CN111583217A (en) * 2020-04-30 2020-08-25 深圳开立生物医疗科技股份有限公司 Tumor ablation curative effect prediction method, device, equipment and computer medium

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101366017B (en) * 2005-12-12 2010-06-16 微软公司 Logical structure and layout based character recognition method and system
CN100440250C (en) * 2007-03-09 2008-12-03 清华大学 Recognition method of printed mongolian character
WO2009114967A1 (en) * 2008-03-19 2009-09-24 东莞市步步高教育电子产品有限公司 Motion scan-based image processing method and device
CN101510259B (en) * 2009-03-18 2011-04-06 西北民族大学 On-line identification method for 'ding' of handwriting Tibet character
CN102184383A (en) * 2011-04-18 2011-09-14 哈尔滨工业大学 Automatic generation method of image sample of printed character
CN102184383B (en) * 2011-04-18 2013-04-10 哈尔滨工业大学 Automatic generation method of image sample of printed character
CN103999097B (en) * 2011-07-11 2017-04-12 华为技术有限公司 System and method for compact descriptor for visual search
CN103999097A (en) * 2011-07-11 2014-08-20 华为技术有限公司 System and method for compact descriptor for visual search
CN102360436A (en) * 2011-10-24 2012-02-22 中国科学院软件研究所 Identification method for on-line handwritten Tibetan characters based on components
CN102360436B (en) * 2011-10-24 2012-11-07 中国科学院软件研究所 Identification method for on-line handwritten Tibetan characters based on components
CN104809442B (en) * 2015-05-04 2017-11-17 北京信息科技大学 A kind of Dongba pictograph grapheme intelligent identification Method
CN104809442A (en) * 2015-05-04 2015-07-29 北京信息科技大学 Intelligent recognition method for graphemes of Dongba pictographs
CN107025452A (en) * 2016-01-29 2017-08-08 富士通株式会社 Image-recognizing method and image recognition apparatus
CN106355200A (en) * 2016-08-29 2017-01-25 大连民族大学 Manchu handwritten recognition device
CN106408002A (en) * 2016-08-29 2017-02-15 大连民族大学 Hand-written manchu alphabet identification system
CN106127266A (en) * 2016-08-29 2016-11-16 大连民族大学 Hand-written Manchu alphabet recognition methods
CN108932454A (en) * 2017-05-23 2018-12-04 杭州海康威视系统技术有限公司 A kind of character recognition method based on picture, device and electronic equipment
CN107730511A (en) * 2017-09-20 2018-02-23 北京工业大学 A kind of Tibetan language historical document line of text cutting method based on baseline estimations
CN107730511B (en) * 2017-09-20 2020-10-27 北京工业大学 Tibetan historical literature text line segmentation method based on baseline estimation
CN111553336A (en) * 2020-04-27 2020-08-18 西安电子科技大学 Print Uyghur document image recognition system and method based on link segment
CN111553336B (en) * 2020-04-27 2023-03-24 西安电子科技大学 Print Uyghur document image recognition system and method based on link segment
CN111583217A (en) * 2020-04-30 2020-08-25 深圳开立生物医疗科技股份有限公司 Tumor ablation curative effect prediction method, device, equipment and computer medium

Also Published As

Publication number Publication date
CN1251130C (en) 2006-04-12

Similar Documents

Publication Publication Date Title
CN1251130C (en) Method for identifying multi-font multi-character size print form Tibetan character
CN1794266A (en) Biocharacteristics fusioned identity distinguishing and identification method
CN100336070C (en) Method of robust human face detection in complicated background image
CN1664846A (en) On-line hand-written Chinese characters recognition method based on statistic structural features
CN1156791C (en) Pattern recognizing apparatus and method
CN1275201C (en) Parameter estimation apparatus and data collating apparatus
CN1184796C (en) Image processing method and equipment, image processing system and storage medium
CN1151465C (en) Model identification equipment using condidate table making classifying and method thereof
CN1171162C (en) Apparatus and method for retrieving charater string based on classification of character
CN1310825A (en) Methods and apparatus for classifying text and for building a text classifier
CN1200387C (en) Statistic handwriting identification and verification method based on separate character
CN1599913A (en) Iris identification system and method, and storage media having program thereof
CN1041773C (en) Character recognition method and apparatus based on 0-1 pattern representation of histogram of character image
CN1311394C (en) Appts. and method for binary image
CN1924897A (en) Image processing apparatus and method and program
CN1741035A (en) Blocks letter Arabic character set text dividing method
CN1122022A (en) Scribble matching
CN1574269A (en) Method and device for analyzing fail bit maps of wafers
CN1186287A (en) Method and apparatus for character recognition
CN1251128C (en) Pattern ranked matching device and method
CN1403959A (en) Content filter based on text content characteristic similarity and theme correlation degree comparison
CN1904906A (en) Device and method of address identification
CN1973757A (en) Computerized disease sign analysis system based on tongue picture characteristics
CN1266643C (en) Printed font character identification method based on Arabic character set
CN1247615A (en) Method and appts. for recognizing patterns

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20060412

Termination date: 20140423