CN100347723C

CN100347723C - Off-line hand writing Chinese character segmentation method with compromised geomotric cast and sematic discrimination cost

Info

Publication number: CN100347723C
Application number: CNB2005100121952A
Authority: CN
Inventors: 丁晓青; 蒋焰; 付强; 刘长松; 彭良瑞; 方驰
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2005-07-15
Filing date: 2005-07-15
Publication date: 2007-11-07
Anticipated expiration: 2025-07-15
Also published as: CN1719454A

Abstract

The present invention relates to an off-line handwritten Chinese character segmentation method amalgamated by geometric costs and semantic-recognition costs, which belongs to the field of character recognition. The present invention is characterized in that firstly, a row image of inputted off-line handwritten Chinese characters is analyzed to extract stroke segments, the stroke segments are merged into sub-character blocks to simultaneously give the geometric costs merged by the sub-character blocks; a plurality of possible candidate segmentation methods are generated by the geometric costs, every method is evaluated by using a binary syntactic model of language to obtain the semantic-recognition cost of every segmentation manner, and finally, the geometric costs and the semantic-recognition costs are synthesized to give an optimal segmentation recognition scheme. The present invention is applied to the segmentation and the recognition of a handwritten envelope address, and a segmentation correction rate can reach 93%; the present invention largely improves the performance of the traditional segmentation method, and has a certain guidance action for the segmentation problems of other language characters or fields.

Description

The cutting method of the The Off-line Handwritten Chinese character that combines with semanteme-identification cost based on how much costs

Technical field

The cutting method of the The Off-line Handwritten Chinese character that combines with semanteme-identification cost based on how much costs belongs to the character recognition field.

Background technology

Optical Character Recognition (OCR) technology is the hot issue in the pattern-recognition always, and the developing history of more than two decades has also been arranged at the Chinese character OCR technology of Chinese Character Recognition.The identification of off line Chinese character is meant that the Chinese character image that obtains by scanner, digital camera or camera discerns (Fig. 3), and the identification of The Off-line Handwritten Chinese is a difficult point always.This is because writing naturally of people has bigger degree of freedom comparatively speaking, and it can not provide how extra information as on-line handwritten Chinese character.

The problem that relates to a key in the off line Chinese Character Recognition is exactly the cutting of character, and this is because existing sorter is generally just made identification and judgement to independent image, and it needs at first to determine which partly belongs to same Chinese character.For human eye, adjacent Chinese characters is not separately become problem, but for machine, this is not to be easy to.Traditional segmentation technique all depends on geometric properties basically and adjudicates, and a part of method has been used the confidence information that recognizer provides simultaneously.

Yet the situation of reading in fact from the mankind, it is not enough only depending on above information, brain be the more important thing is the tightness degree of considering the adjacent Chinese characters combination in finishing the process that adjacent Chinese characters symbol image is separated, tend to accept from the syntax and semantics more acceptable cutting result.

Chinese character has profuse structure, font architecture about wherein quite a few Chinese character has, under writing style from left to right, the Chinese character that unavoidable problem of off line Chinese character segmentation is exactly a left and right sides structure usually is separated, for example " village " character segmentation in " village " becomes " wood is very little ", for such cutting, recognizer also often provides very large degree of confidence, also is insecure so depend on the method for classifier confidence merely.No matter be the method for utilizing the method for how much costs or utilizing the recognizer degree of confidence, all ignored the most significant in fact feature---semantic information in the Chinese character segmentation.

People have had long time for the research of language model, obtain obvious effects in practice based on the statistical language model of statistical method, particularly in the post-processing technology of speech recognition and Chinese Character Recognition.Mainly be to discern and adjudicate by the Hidden Markov Model (HMM) (HMM) of binary or three metagrammars.

The present invention has mainly utilized the semanteme-recognition confidence that obtains character cutting based on the Hidden Markov Model (HMM) of two-dimensional grammar, simultaneously the geometry cost of character cutting and the confidence information triplicity of character recognition are got up then, obtain a unified cutting identification cost, this process is actual organically to combine the cutting of character, identification and three processes of aftertreatment, and the identification and the aftertreatment of character have also been finished simultaneously after the cutting method decision.Experiment showed, that the present invention is at a high speed effectively.The present invention has proposed a kind of model simultaneously, has unified cutting, identification and the aftertreatment of The Off-line Handwritten Chinese, and the cutting thinking of this Chinese character is not occur in other documents, has certain innovation.

Summary of the invention

The objective of the invention is to realize the The Off-line Handwritten Chinese character cutting of high-accuracy.The process object of this method is the capable image (Fig. 4 has provided the general process of hand-written capable Image Acquisition) of the The Off-line Handwritten Chinese of input, at first pass through the method for some graphical analyses, extract the pen section of character, merge into sub-character by the pen section then, extracting a series of geometric properties and parameter simultaneously estimates, the slit mode of how much cost optimums of K choosing before seeking with the optimization method of K shortest path according to these how much costs then, then every kind of slit mode is used based on the language model of two-dimensional grammar and estimated, obtain corresponding semanteme-recognition confidence, thereby then these two costs combinations are obtained the character cutting mode that final evaluation is picked out to be had most.

The present invention comprises theoretical derivation, parameter estimation, and feature extraction, Candidate Set generates and sorter is adjudicated this several sections.

1 theory part

(1) based on the HMM semanteme-recognition confidence of statistical language model

x ₁x ₂... x _n---the character picture after the cutting of row image;

N _Cand---the number of the identification candidate that the identification core provides the word character image, this is the performance parameter of an identification core, thereby is constant, and irrelevant with the input character image;

c _{I, j}---by discerning i the character picture x that core provides _iJ candidate's recognition result (the identification core is arranged the identification candidate according to the ascending ascending order of decipherment distance);

The last handling process of general Chinese Character Recognition is the character posterior probability under given image sequence of will maximizing, and makes and satisfies the identification string of the maximized character string of posterior probability as us, promptly

Utilize the Bayesian formula, we have

P (c_{1, k_{1}}, c_{2, k_{2}}, . . ., c_{n, k_{n}} | x_{1}, x_{2}, . . ., x_{n}) = \frac{P (x_{1}, x_{2}, . . ., x_{n} | c_{1, k_{1}}, c_{2, k_{2}}, . . ., c_{n, k_{n}}) P (c_{1, k_{1}}, c_{2, k_{2}}, . . ., c_{n, k_{n}})}{P (x_{1}, x_{2}, . . ., x_{n})} - - - (1)

The last handling process of character recognition is general to calculate following formula according to following hypothesis:

1. owing under the condition of given character string, the character picture probability of occurrence is relatively independent, and every width of cloth image is only relevant with the character of its representative, and it doesn't matter with other character, so

P (x_{1}, x_{2}, . . ., x_{n} | c_{1, k_{1}}, c_{2, k_{2}}, . . ., c_{n, k_{n}}) = Π_{i = 1}^{n} P (x_{i} | c_{i, k_{i}}) = Π_{i = 1}^{n} \frac{P (c_{i, k_{i}} | x_{i}) P (x_{i})}{P (c_{i, k_{i}})} - - - (2)

Second equal sign is according to the Bayes formula.

2. under the situation of not considering corpus, it is equiprobable basically that each Chinese character occurs, and thinks that therefore the prior probability that each character occurs is approximate consistent, promptly each character we think that all it such as has at the probability that may occur;

3. suppose P (c _{1, k1}, c _{2, k2}..., c _{N, kn}) satisfy the two-dimensional grammar model, promptly only be of the have influence of the previous Chinese character of each Chinese character, so for a back Chinese character

P (c_{1, k_{1}}, c_{2, k_{2}}, . . ., c_{n, k_{n}}) = P (c_{1, k_{1}}) Π_{i = 2}^{n} P (c_{i, k_{i}} | c_{i - 1, k_{i - 1}}) - - - (3)

In fact from the angle of Statistical Linguistics, adopt the ternary syntactic model more to tally with the actual situation, corresponding formula is

P (c_{1, k_{1}}, c_{2, k_{2}}, . . ., c_{n, k_{n}}) = P (c_{1, k_{1}}) P (c_{2, k_{2}} | c_{1, k_{1}}) Π_{i = 3}^{n} P (c_{i, k_{i}} | c_{i - 2, k_{i - 2}} c_{i - 1, k_{i - 1}}) - - - (4)

But consider time complexity and space complexity, it is just enough only to get the two-dimensional grammar model.

Comprehensively (1) (2) (3) obtain

P (c_{1, k_{1}}, c_{2, k_{2}}, . . ., c_{n, k_{n}} | x_{1}, x_{2}, . . ., x_{n}) = Π_{i = 1}^{n} P (c_{i, k_{i}} | x_{i}) (P (c_{1, k_{1}}) Π_{i = 2}^{n} P (c_{i, k_{i}} | c_{i - 1, k_{i - 1}})) \frac{Π_{i = 1}^{n} P (x_{i})}{P (x_{1}, x_{2}, . . ., x_{n}) Π_{i = 1}^{n} P (c_{i, k_{i}})}

According to hypothesis 2., and notice

And P (x ₁, x ₂..., x _n) under the given situation of character picture sequence, be constant, following formula explanation so

P (c_{1, k_{1}}, c_{2, k_{2}}, . . ., c_{n, k_{n}} | x_{1}, x_{2}, . . ., x_{n}) &Proportional; Π_{i = 1}^{n} P (c_{i, k_{i}} | x_{i}) (P (c_{1, k_{1}}) Π_{i = 2}^{n} P (c_{i, k_{i}} | c_{i - 1, k_{i - 1}}))

(" ∝ " symbolic representation " with ... be directly proportional ").

The post-processing technology of OCR is exactly will be at identification candidate collection c _Ij1≤i≤n 1≤j≤N _CandIn pick out optimum status switch, so-called optimum, be exactly maximization It has reflected the semanteme-recognition confidence of identification to a certain extent.

If

k_{1}^{*} k_{2}^{*} . . . k_{n}^{*} = \underset{1 \leq k_{1}, k_{2}, . . ., k_{n} \leq N_{Cand}}{\arg \max} Π_{i = 1}^{n} P (c_{i, k_{i}} | x_{i}) (P (c_{1, k_{1}}) Π_{i = 2}^{n} P (c_{i, k_{i}} | c_{i - 1, k_{i - 1}})),

Then can choose optimum candidate sequence c _{1, k1*}c _{2, k2*}... c _{N, kn*}

So just be converted into the problem among the HMM, how estimate optimum status switch at given observation sequence, the Viterbi method is the classic algorithm that addresses this problem, and its time complexity is O (nN _Cand ²).N _Cand ²Be the number of identification candidate, n is the piece number of the character picture after merging.

The Viterbi arthmetic statement is as follows:

Make Q[n] [N _Cand] be two-dimensional array, wherein a Q[t] [j] preserved from certain candidate of first character picture and put c to byte _{T, j}The logarithm value of the probability that had of maximum possible candidate selection mode, be not difficult to find out 1≤t≤n 1≤j≤N _CandGet a two-dimentional array of pointers Path[n in addition] [N _Cand] be used to write down computation process.Initialization t=1,1≤j≤N _Cand

Path[1][j]＝NULL

Q[1][j]＝logP(c _1，j)+log(c _1，j|x ₁)

Recurrence 2≤t≤n is to 1≤j≤N _Cand

Q [t] [j] = \max_{1 \leq l \leq N_{Cand}} {Q [t - 1] [l] + \log P (c_{t, j} | c_{t - 1, l})} + \log P (c_{t, j} | x_{t})

l^{*} = \underset{1 \leq l \leq N_{Cand}}{\arg \max} {Q [t - 1] [l] + \log P (c_{t, j} | c_{t - 1, l})}

Path[t] [j] sensing byte point c _{T-1, l*}, i.e. byte point c _{T, j}Father node be c _{T-1, l*}Stop t=n

j^{*} = \underset{1 \leq j \leq N_{Cand}}{\arg \max} Q [n] [j],

Optimum byte point is c _{N, j*}

Recall Path[n] [j ^*] path of indication, each the byte point on the outgoing route obtains the recognition result of optimum state sequence as us, obtains the logarithm value Q[n of the probability in maximum possible path simultaneously] [j ^*].

(2) fusion of geometry cost and HMM semanteme-identification cost

s ₁, s ₂..., s _l---sub-character picture sequence;

x ₁, x ₂..., x _n---the character picture sequence after sub-character picture merges;

N _Cand---the number of the identification candidate that the identification core provides the word character image as previously mentioned, is a constant;

c _{I, j}---i the character picture x that mountain identification core provides _iJ candidate's recognition result (the identification core is arranged the identification candidate according to the ascending ascending order of decipherment distance);

Different with the identification aftertreatment of known cutting is, our posterior probability that need maximize that is to say that at pen section amalgamation result under the stable condition, greatly beggar's character merges the posterior probability P (x of mode and recognition result ₁x ₂... x _n, c _{1, k1}c _{2, k2}... c _{N, kn}| s ₁s ₂... s _l)

According to identical relation

P (x_{1} x_{2} . . . x_{n}, c_{1, k_{1}} c_{2, k_{2}} . . . c_{n {, k}_{n}} | s_{1} s_{2} . . . s_{l}) = P (c_{1, k_{1}} c_{2 {, k}_{2}} . . . c_{n, k_{n}} | x_{1} x_{2} . . . x_{n}, s_{1} s_{2} . . . s_{l}) P (x_{1} x_{2} . . . x_{n} | s_{1} s_{2} . . . s_{l})

When segment image given after, identifying only with the image x that segments ₁, x ₂..., x _nRelevant, irrelevant with pen section amalgamation result, so s can be omitted in the condition the inside of first of following formula ₁s ₂... s _l, be reduced to

P (x_{1} x_{2} . . . x_{n}, c_{1, k_{1}} c_{2, k_{2}} . . . c_{n, k_{n}} | s_{1} s_{2} . . . s_{l}) = P (c_{1, k_{1}} c_{2, k_{2}} . . . c_{n, k_{n}} | x_{1} x_{2} . . . x_{n}) P (x_{1} x_{2} . . . x_{n} | s_{1} s_{2} . . . s_{l}) - - - (6)

Our posterior probability that we need be maximized is divided into two parts like this, and first is our above-mentioned semanteme-recognition confidence, and a back part is exactly the geometry cost that sub-character picture merges, (5) of substitution front

P (x_{1} x_{2} . . . x_{n}, c_{1, k_{1}}, c_{2, k_{2}}, . . ., c_{n, k_{n}} | s_{1} s_{2} . . . s_{l})

= Π_{i = 1}^{n} P (c_{i, k_{i}} | x_{i}) (P (c_{1, k_{1}}) Π_{i = 2}^{n} P (c_{i, k_{i}} | c_{i - 1, k_{i - 1}})) \frac{Π_{i = 1}^{n} P (x_{i})}{P (x_{1}, x_{2}, . . ., x_{n}) Π_{i = 1}^{n} P (c_{i, k_{i}})} P (x_{1} x_{2} . . . x_{n} | s_{1} s_{2} . . . s_{l}) - - - (7)

We merge cutting, identification and aftertreatment to have descended at a unified model.But in practical operation, because the number of the character picture that different cutting route obtains is different, and the number of words of cutting is many more, and posterior probability values is often more little, so we can not directly relatively merge the posterior probability of the different cutting route of number of words.Because merge the influence that the number of words difference is brought, we get P (x in order to eliminate ₁x ₂... x _n, c _{1, k1}, c _{2, k2}..., c _{N, kn}| s ₁s ₂... s _l) geometric mean, promptly

As our objective function.

We at first get the logarithmic function of (7) formula:

\log P (x_{1} x_{2} . . . x_{n}, c_{1, k_{1}}, c_{2, k_{2}}, . . ., c_{n, k_{n}} | s_{1} s_{2} . . . s_{l})

= Σ_{i = 1}^{n} \log P (x_{i}) - \log P (x_{1}, x_{2}, . . ., x_{n}) - Σ_{i = 1}^{n} \log P (c_{i, k_{i}}) + \log P (x_{1} x_{2} . . . x_{n} | s_{1} s_{2} . . . s_{l})

+ Σ_{i = 1}^{n} \log P (c_{i, k_{i}} | x_{i}) + \log P (c_{1, k_{1}}) + Σ_{i = 2}^{n} \log P (c_{i, k_{i}} | c_{i - 1, k_{i - 1}}) - - - (8)

The geometric mean correspondence the arithmetic mean of (8) formula, and (8) formula that is about to is divided by the cost function of n as us.

\frac{1}{n} \log P (x_{1} x_{2} . . . x_{n}, c_{1, k_{1}}, c_{2, k_{2}}, . . ., c_{n, k_{n}} | s_{1} s_{2} . . . s_{l})

= \frac{1}{n} (Σ_{i = 1}^{n} \log P (x_{i}) - \log P (x_{1}, x_{2}, . . ., x_{n}) - Σ_{i = 1}^{n} \log P (c_{i, k_{i}})) + \frac{1}{n} \log P (x_{1} x_{2} . . . x_{n} | s_{1} s_{2} . . . s_{l})

+ \frac{1}{n} (Σ_{i = 1}^{n} \log P (c_{i, k_{i}} | x_{i}) + \log P (c_{1, k_{1}}) + Σ_{i = 2}^{n} \log P (c_{i, k_{i}} | c_{i - 1, k_{i - 1}})) - - - (9)

How calculating following formula neither be easy to, and we need further be similar to the model of simplifying us.

To P (c _{I, ki}), according to the hypothesis in the last handling process of Chinese Character Recognition, under the corpus condition of unknown, think that the prior probability that each word occurs is consistent in text, i.e. P (c _{I, ki}) be a constant, thereby

Also be a constant, the maximization process is not had influence;

For P (x ₁, x ₂..., x _n), it represents the probability that a row character picture occurs, and the appearance of character picture we can be similar to and think an independently process, the character picture in front occurs the appearance of the image in back not have to influence substantially, thereby

P (x_{1}, x_{2}, . . ., x_{n}) = Π_{i = 1}^{n} P (x_{i}) .

Thereby

\frac{1}{n} Σ_{i = 1}^{n} \log P (x_{i}) - \frac{1}{n} \log P (x_{1}, x_{2}, . . ., x_{n}) = \frac{1}{n} \log \frac{Π_{i = 1}^{n} P (x_{i})}{P (x_{1}, x_{2}, . . ., x_{n})} = 0

Analysis above comprehensive, we are reduced to maximization at (10) formula that will maximize

T = \frac{1}{n} (Σ_{i = 1}^{n} \log P (c_{i, k_{i}} | x_{i}) + \log P (c_{1, k_{1}}) + Σ_{i = 2}^{n} \log P (c_{i, k_{i}} | c_{i - 1, k_{i - 1}})) + \frac{1}{n} \log P (x_{1} x_{2} . . . x_{n} | s_{1} s_{2} . . . s_{l}) - - - (10)

Order

\{\begin{matrix} \overset{&OverBar;}{H} = \frac{Σ_{i = 1}^{n} \log P (c_{i, k_{i}} | x_{i}) + \log P (c_{1, k_{1}}) + Σ_{i = 2}^{n} \log P (c_{i, k_{i}} | c_{i - 1, k_{i - 1}})}{n} \\ \overset{&OverBar;}{G} = \frac{\log P (x_{1} x_{2} . . . x_{n} | s_{1} s_{2} . . . s_{l})}{n} \end{matrix}

So we make T=H+ G, judge the criterion of optimum cutting as us with the T value.Wherein H can think the average semanteme-identification cost of this merging mode, and G is the average geometric cost of this merging mode correspondence.

Because the geometry cost that system provides is a kind of tolerance of distance, we select for use a monotone decreasing function to be similar to P (x ₁x ₂... x _n| s ₁s ₂... s _l)

P (x_{1} x_{2} . . . x_{n} | s_{1} s_{2} . . . s_{l}) = {λe}^{- λ (\frac{g}{\overset{&OverBar;}{g}} - 1)} - - - (11)

Wherein λ is a positive constant, and g represents by sub-character block s ₁s ₂... s _lMerge into character picture x ₁x ₂... x _nThe geometry cost, g represents sub-character block s ₁s ₂... s _lThe geometry cost of minimum in all possible merging mode, then

\overset{&OverBar;}{G} = \frac{1}{n} \log (λ e^{- λ (\frac{g}{\overset{&OverBar;}{g}} - 1)}),

We claim that it is an average geometric cost after the normalization.

2 parameter estimation

This algorithm needs to estimate a series of parameter in realization.

At first we need estimate prior probability and the intercharacter transition probability that character occurs for the two-dimensional grammar model, we must collect with capable image to be slit and relate to the consistent corpus of content for this purpose, the semantic constraint condition that could reflect us so targetedly, prior probability that each word of being expected to calculate by different field occurs and word are different with transition probability between the word.If can not select suitable expectation storehouse, the performance of system can be subjected to bigger influence so.

For example the capable image of cutting is the address of hand-written envelope if desired, and we just must collect the language material sample of address database as us so.

State following symbol:

N _c---the number of times that Chinese character c occurs in the language material sample;

N _C1c2---two-character word c ₁c ₂The number of times that in the language material sample, occurs;

N---the Chinese character sum in the language material sample;

P (c)---the probability that Chinese character c occurs in the language material sample;

P (c ₂| c ₁)---Chinese character c in the language material sample ₁And then Chinese character c appears in the back ₂Probability;

P _Smooth()---the probability after the smoothing processing;

The number of M---the different Chinese character that comprises in the language material sample, country-level Chinese character standard has comprised 3755 of Chinese characters in common use, we establish M=3755 simply;

(1) prior probability of estimation character, just P (c)

Calculate the total degree N that each Chinese character occurs in corpus _c, and the Chinese character sum N in the calculating corpus, then

Be estimation to P (c).

(2) transition probability of calculating character

If we adopt the two-dimensional grammar model, because

P (c_{2} | c_{1}) = \frac{P (c_{1} c_{2})}{P (c_{1})},

We at first calculate c so ₁c ₂The times N that this speech occurs in corpus _C1c2, use then As to P (c ₂| c ₁) estimation.

(3) data is level and smooth

Here relate to the another one major issue and be exactly the level and smooth of data, because not every Chinese character all occurs in our corpus.Therefore the prior probability and the transition probability of a part of word are imponderable, and in statistical language model, people have proposed the whole bag of tricks and have been used for the level and smooth of data.The basic thought of the sparse smoothing processing of data is when training data is insufficient, adopts certain mode that probability is adjusted and repaired, to obtain more accurate probability.Because (1) (2) all underestimate into 0 to the probability that incident do not occur, and all probability of occurrence sums always 1, so it provides too high estimation to the incident that occurred.Therefore, level and smooth target is by small probability (or zero probability) being heightened, big probability is turned down, making that probability distribution on the whole is more even.Smoothing technique has not only solved the zero probability problem, and can improve the performance of language model on the whole.

Smoothing technique combines the high-order model of n-gram with lower-order model, comprised rollback (back-off) and two kinds of binding patterns of interpolation (interpolation).

With the two-dimensional grammar is example, and fall-back mode has following form:

P_{smooth} (c_{2} | c_{1}) = \{\begin{matrix} P (c_{2} | c_{1}) & if N_{c_{1} c_{2}} > 0 \\ γ (c_{1}) P (c_{2}) & if N_{c_{1} c_{2}} = 0 \end{matrix}

Following formula shows, if existing together to c ₁c ₂In corpus, occurred, then used P (c ₂| c ₁) be similar to transition probability; Otherwise, be retracted into lower-order model P (c ₂), γ (c ₁) be normalized factor.

Interpolative mode then is that high-order model and lower-order model are carried out linear interpolation, and following form is arranged:

P _smooth(c ₂|c ₁)＝P(c ₂|c ₁)+γ(c ₁)P(c ₂)

The difference of interpolative mode and fall-back mode is: when handling greater than zero probable value, the former has utilized the information of lower-order model, and the latter does not utilize.Both something in commons are: the information of all having utilized lower-order model when handling zero probability.This shows, be the key point of smoothing technique to the utilization of lower-order model information.

At present, Chang Yong smoothing method has methods such as Jelinek-Mercer is level and smooth, Katz level and smooth, absolute discount is level and smooth, Witten-Bell is level and smooth, Kneser-Ney is level and smooth.The performance of various smoothing methods is along with the size of corpus scale, the exponent number of n-gram model, the content change of corpus.Wherein, the training corpus scale is to smoothing technique Effect on Performance maximum.

(4) estimation of character degree of confidence

The information of degree of confidence reflection identification, promptly our recognition result to what extent is believable actually, this is that we need the work done after character recognition is finished.

X---character picture to be identified;

c _j(x)---(the identification core is arranged according to the ascending ascending order of decipherment distance and is discerned candidate, so c by j candidate's identifier word to image x of providing of identification core ₁(x) be exactly the first-selection identification candidate of image x);

d _j(x)---by j candidate's identifier word c of the image x that provides of identification core _j(x) Dui Ying decipherment distance, so d ₁(x) decipherment distance of the first-selection of presentation video x identification candidate correspondence;

N _Cand---the number of the identification candidate that the identification core provides the word character image, this value is constant, and is only relevant with identification core itself;

P (c _j(x) | x)---image x is identified as c _j(x) degree of confidence, this is the object that we need estimate.

General sorter provides its identification candidate c to image x ₁(x), c ₂(x) ..., c _NCand(x), provide corresponding decipherment distance d simultaneously ₁(x), d ₂(x) ..., d _NCand(x), how to calculate P (c _j(x) | x) 1≤j≤N _CandBe the problem of a more complicated, can consider following several ways, as for specifically selecting for use which kind of way need be at the time complexity that calculates, a compromise be done in space complexity and accuracy rate aspect.

1. apart from experimental formula

P (c_{j} (x) | x) = \frac{1 / d_{j} (x)}{Σ_{k = 1}^{N_{Cand}} 1 / d_{k} (x)}, 1 \leq j \leq N_{Cand}

Perhaps

P (c_{j} (x) | x) = \frac{\frac{1}{d_{j} (x) - d_{1} (x) + 1}}{Σ_{k = 1}^{N_{Cand}} \frac{1}{d_{k} (x) - d_{1} (x) + 1}}, 1 \leq j \leq N_{Cand}

Perhaps

P (c_{j} (x) | x) = \frac{1 / d_{j}^{2} (x)}{Σ_{k = 1}^{N_{Cand}} 1 / d_{k}^{2} (x)}, 1 \leq j \leq N_{Cand}

2. estimate based on the degree of confidence of Gauss model

P (c_{j} (x) | x) = \frac{\exp (- \frac{d_{j} (x)}{θ})}{Σ_{k = 1}^{N_{Cand}} \exp (- \frac{d_{k} (x)}{θ})}

(θ needs to estimate)

Utilize confusion matrix can revise the degree of confidence of estimation (confusion matrix is relevant with identification core performance own, needs to estimate), revise degree of confidence by

P (c_{j} (x) | x) = Σ_{k = 1}^{N_{Cand}} P (c_{k} (x) | x) P (c_{j} (x) | c_{k} (x))

Obtain.

If we estimate degree of confidence according to the method described above, just need the confusion matrix of estimation variance parameter θ and sorter.To specifically implement to introduce in the part in the technical program about this part content.In general, select the sorter performance estimation of which kind of method, need take all factors into consideration time cost and the storage space cost decides as us.

3 feature extractions

Feature extraction is at first extracted the stroke parameter by the graphical analysis to the row image, extracts the basic pen section of Chinese character then, and basic strokes is merged into each sub-character, provides the marking of how much costs simultaneously.

(1) goes parameter extraction stage of image

This stage will be extracted three parameters

w _s---stroke width:

---the character mean breadth;

---the character average height;

1. stroke width w _sExtract

Stroke width is meant the width of written handwriting.At first, the black pixel distance of swimming of level of line of text is carried out histogram analysis, and (the black distance of swimming of level is meant the rectangular area of black pixel continuous possession on the X direction, the height of rectangle is a pixel, the wide length that is the black distance of swimming of level), histogrammic transverse axis represents that level deceives run length, and ordinate represents to have the black distance of swimming number of level of this length.If the black run length of the level that corresponding distance of swimming number is maximum in the histogram is p, corresponding distance of swimming number is hist (p), (be that histogrammic ordinate maximal value is hist (p), the horizontal ordinate of its correspondence is p)

Then get

w_{s} = \frac{(p - 1) \times hist (p - 1) + p \times hist (p) + (p + 1) \times h (p + 1)}{h (p - 1) + h (p) + h (p + 1)} .

(Fig. 6 has provided the histogram analysis to Fig. 5 (a) image, p=4 wherein, hist (p)=690)

2. character mean breadth

Estimation

The character mean breadth has reflected the writing style of line of text, and character cutting is had direct influence.At first the line of text image is done the projection of vertical direction, obtain perspective view, the horizontal ordinate of this figure is corresponding one by one with the horizontal ordinate of line of text, and the value of ordinate is whole number (Fig. 7 has provided the perspective view of Fig. 5 (a)) of black pixel points on the corresponding horizontal ordinate vertical direction in the line of text.Perspective view horizontal coordinate direction of principal axis (ordinate is 0) is done the black distance of swimming analysis of level, and calculate the mean value of the black distance of swimming of whole levels, with this estimation as the character mean breadth.Too small and cause the overlapped situation of intercharacter stroke for the character pitch in the line of text, can be at perspective view y=2w _sDo black distance of swimming statistics on+1 the horizontal direction, calculate its mean value, can obtain character mean breadth preferably

Estimation.

(Fig. 8 has provided the perspective view on Fig. 5 (b) vertical direction, adhesion serious situation that Here it is)

3. character average height

Extraction

The extraction of character average height is relatively simple, only needs the row image is divided into some five equilibriums (generally being five five equilibriums) in the horizontal direction, and the height to all little images is averaged the average height that is character again (as Fig. 9)

(2) the pen section extraction stage

The pen section is meant to cast aside anyhow in the Chinese character presses down four kinds of fundamental elements, and the pen section is extracted the adhesion that can overcome character effectively.The method that pen section extracting method adopts the black distance of swimming to follow the tracks of, its thinking is: at first search out the black distance of swimming of a level in image, as the beginning of a certain pen section, then the black distance of swimming of this level is followed the tracks of from the top down line by line, finish up to following the tracks of, obtain a section.

The thinking that follow the tracks of to adopt: at the next line of the black distance of swimming of present level, get comprise current black distance of swimming place horizontal level and about respectively be offset the horizontal extent of a pixel, in this scope, find all levels to deceive the distance of swimming; Fit the pen section direction that obtains according to the mean breadth of the existing distance of swimming of pen section and by the existing distance of swimming of pen section then, select the black distance of swimming of certain level to join in the existing pen section distance of swimming, and upgrade the information of former pen section.We will describe this process in detail in concrete enforcement.

Accompanying drawing 10 and 11 is results that the pen section is extracted.

(3) pen section merging phase

After the extraction of pen section is finished, also need the pen section is further merged.If R _iAnd R _jBe respectively the boundary rectangle of two adjacent pen sections.(seeing accompanying drawing 12)

(x _{I, 1}, y _{I, 1})---R _iThe coordinate of upper left angle point;

(x _{I, 2}, y _{I, 2})---R _iThe coordinate of bottom right angle point;

(x _{J, 1}, y _{J, 1})---R _jThe coordinate of upper left angle point;

(x _{J, 2}, y _{J, 2})---R _jThe coordinate of bottom right angle point;

D _H(R _i, R _j)---R _iRight side and R _jThe horizontal range in left side (is noticed D _H(R _i, R _j) value is with the positive and negative direction of representing, R in Figure 12 _iThe horizontal level of left frame is at R _jTherefore the right of the horizontal level of left frame is represented with negative value; Otherwise if R _iThe horizontal level of left frame is at R _jThe left side of the horizontal level of left frame is then used on the occasion of representing);

D _V(R _i, R _j)---R _iBottom side and R _jThe vertical range of top side; (notice D _V(R _i, R _j) value is with the positive and negative direction of representing, R in Figure 12 _iThe upright position of bottom side frame is at R _jBelow the upright position of top side frame, therefore represent with negative value; Otherwise if R _iThe upright position of bottom side frame is at R _jAbove the upright position of top side frame, then use) on the occasion of representing;

Width (R _i)---R _iWidth;

Width (R _j)---R _jWidth;

We merge the pen section according to following principle:

If 1. R _iAnd R _jSatisfy in the horizontal direction R _iComprise R _jPerhaps R _jComprise R _i, then with pen section i, j merges;

If 2. R _iAnd R _jSatisfy: D _H(R _i, R _j)＜0 (is R _iLeft frame is at R _jAnd have the right side of left frame),

\frac{- D_{H} (R_{i}, R_{j})}{width (R_{i})} > T_{1}

Or

\frac{- D_{H} (R_{i}, R_{j})}{width (R_{j})} > T_{1},

Then with pen section i, j merges, herein T ₁Be the predefine thresholding, generally get 0.7;

If 3. R _iAnd R _jSatisfy: D _H(R _i, R _j)＜0, and have

\frac{- D_{H} (R_{i}, R_{j})}{width (R_{i})} > T_{2}

And

\frac{- D_{H} (R_{i}, R_{j})}{width (R_{j})} > T_{2},

Then with pen section i, j merges, herein T ₂Be the predefine thresholding, generally get 0.5;

Algorithm is described in will be concrete the implementation phase, and Figure 13 has provided the pen section amalgamation result of Figure 11

(4) sub-character marking

The result that we merge the pen section, ordering from left to right is designated as: s ₁, s ₂..., s _N, the sub-character picture after our pen section that Here it is merges will be finished cutting, this a little character picture suitably need be merged to finish the cutting operation.

Assess (s according to following mode _k, s _K+1..., s _K+nk-1) the geometry cost that merges.

w _k---sub-character picture s _k, s _K+1..., s _K+nk-1Character picture (s after the merging _k, s _K+1..., s _K+nk-1) width (Figure 21);

h _k---sub-character picture s _k, s _K+1..., s _K+nk-1Character picture (s after the merging _k, s _K+1..., s _K+nk-1) height (Figure 21);

1. character duration

If (s _k, s _K+1..., s _K+nk-1) width be w _k, the character mean breadth of line of text is

(going on foot institute as 3-2 asks), then

S (w_{k}) = \{\begin{matrix} a {(\frac{w_{k}}{\overset{&OverBar;}{w_{c}}} - 1)}^{2} & if \frac{w_{k}}{\overset{&OverBar;}{w_{c}}} > 1 \\ b {(\frac{w_{k}}{\overset{&OverBar;}{w_{c}}} - 1)}^{2} & if \frac{w_{k}}{\overset{&OverBar;}{w_{c}}} \leq 1 \end{matrix}

A wherein, b is the predefine parameter, and we get a=100, b=400.

2. character the ratio of width to height

S (r) = \min {c {(\frac{r}{\overset{&OverBar;}{r}} - 1)}^{2}, 100},

Wherein general c=100.

R is character (s _k, s _K+1..., s _K+nk-1) the ratio of width to height,

r = \frac{w_{k}}{h_{k}};

R is the average the ratio of width to height of character

\overset{&OverBar;}{r} = \frac{\overset{&OverBar;}{w_{c}}}{\overset{&OverBar;}{h_{c}}};

Be line of text character mean breadth, the estimated value that adopts the front line parameter to extract;

Be line of text character average height, the estimated value that adopts the front line parameter to extract;

3. the inner sub intercharacter distance of character

Distance between the inner sub-character of character has reflected the combinative degree between each sub-character in the character, general writing style is that the sub-character for same Chinese character inside has less relatively distance, if and two sub-characters do not belong to same Chinese character, then its distance is relatively large.

But, only adopt above-mentioned sub-character-circumscribed rectangle distance measure can not reflect affinity degree between the sub-character fully, must adopt distance measure preferably.Following three kinds of different distance measures are arranged:

The horizontal range of i boundary rectangle frame: the horizontal range between the promptly sub-character rectangle frame (the rectangle frame area surrounded can have overlapping) (as Figure 14 b);

The ii Euclidean distance:

Euclidean distance between two pixels is to say, if the coordinate of pixel 1 is (x ₁, y ₁) coordinate of pixel 2 is (x ₂, y ₂), then the Euclidean distance between these two pixels is defined as

Euclidean distance between sub-character A and sub-character B is defined as all deceives the minimum value of all deceiving Euclidean distance values all between the pixels among pixel and the B among the A.

The average distance of swimming distance of iii; Shown in Figure 14 (d), we calculate the average distance of swimming distance of the mean value of the length of whole horizontal distances of swimming (the white distance of swimming) between two sub-characters as us.

According to above-mentioned three kinds of distances, we define (s _k, s _K+1..., s _K+nk-1) inner distance be

d_{in} = \frac{Σ_{i = k}^{i = k + n_{k} - 1} d_{i, i + 1}}{k},

Its neutron character s _iWith s _jApart from d _{I, j}Be defined as

d_{i, j} = \underset{n}{Σ} γ_{n} d_{ij}^{n}

Promptly be above-mentioned three kinds of distances weighted mean and, γ _nBe weighting coefficient, in general we get γ _nAll be 1 also just passable.

Definition inner distance mark is

S (d_{in}) = \{\begin{matrix} 0 & if d_{in} < 0 \\ 100 & if d_{in} > \frac{w_{k}}{4} or d_{in} > \frac{\overset{&OverBar;}{w_{c}}}{2} \\ \frac{400 d_{in}}{w_{k}} & otherwise \end{matrix}

(Figure 14 (a) is the atom character picture, and Figure 14 (b) is sub intercharacter boundary rectangle horizontal range, and Figure 14 (c) is sub-intercharacter Euclidean distance, and Figure 14 (d) is a distance of swimming distance)

4. the distance before and after the character

In delegation's handwritten text, the distance between the inner sub-character of general character is less than the distance between the sub-character of kinds of characters, therefore, we to character with its before and after the distance of sub-character of character assess, and give to give a mark.

Suppose character (s _k, s _K+1..., s _K+nk-1) and the sub-character in front and back between distance be respectively D _LAnd D _R

D _L---sub-character s _kBoundary rectangle frame and sub-character s _K-1The boundary rectangle frame between horizontal range;

D _R---sub-character s _K+nk-1Boundary rectangle frame and sub-character s _K+nkThe boundary rectangle frame between horizontal range;

If k=1 then D _L=∞; If k+n _k-1=N, then D _R=∞.Get D=min (D at last _L, D _R).

Marking for the character longitudinal separation is S (D),

S (D) = \{\begin{matrix} 100 & if D < - \overset{&OverBar;}{w_{c}} \\ \frac{25 \overset{&OverBar;}{w_{c}} + 100 \overset{&OverBar;}{D} - 75 D}{\overset{&OverBar;}{D} + \overset{&OverBar;}{w_{c}}} & if - \overset{&OverBar;}{w_{c}} \leq D \leq \overset{&OverBar;}{D} \\ \frac{25 (D_{\max} - D)}{D_{\max} - \overset{&OverBar;}{D}} & if \overset{&OverBar;}{D} \leq D \leq D_{\max} \\ 0 & if D > D_{\max} \end{matrix}

Wherein Be the mean breadth of character in the line of text, D and D _MaxBe respectively the horizontal range between the horizontal range between the average sub-character-circumscribed rectangle frame and maximum sub-character-circumscribed rectangle frame in the line of text.

5. the interconnectedness of sub-character is estimated

Define sub-character s _iWith sub-character s _jInterconnectedness C _IjFor

Character (s _k, s _K+1..., s _K+nk-1) connectedness represent by three amounts:

C _I---internal communication;

C _L---the left part connectedness;

C _R---the right part connectedness;

Wherein internal communication refers to the connection degree of the inner sub-character of character; Left part connectedness and right part connectedness refer to sub-character s _kWith sub-character s _K-1, sub-character s _K+nk-1With sub-character s _K+nkBetween connectedness;

C_{I} = \{\begin{matrix} \frac{1}{n_{k} - 1} Σ_{i = k}^{i = k + n_{k} - 2} C_{i, i + 1} & n_{k} > 1 \\ 1 & n_{k} = 1 \end{matrix}

C_{L} = \{\begin{matrix} C_{k, k - 1} & k > 1 \\ 1 & k = 1 \end{matrix}

C_{R} = \{\begin{matrix} C_{k + n_{k} - 1, k + n_{k}} & k + n_{k} - 1 < N \\ 1 & k + n_{k} - 1 = N \end{matrix}

The interconnectedness score

S (C) = 100 \times [1 - \frac{1}{2} (1 + C_{I} - \frac{C_{R} + C_{L}}{2})]

Whole mark be above five kinds of marks weighted mean and

S = \frac{\underset{i}{Σ} α_{i} S_{i}}{\underset{i}{Σ} α_{i}} .

4 Candidate Sets produce

For present off line Chinese character segmentation method, utilize method of geometry the Chinese character of adhesion can be cut basically, but the unit after the cutting is that this merges the part of cutting with regard to needing us, obtains final result by suitable merge algorithm more than the quantity of the Chinese character of reality.Traditional cutting method based on geometric properties is often just according to the slit mode of the geometry cost that generates by shortest path first computational geometry cost minimum, and think that from our viewpoint how much costs are not to estimate the optimum criterion of slit mode, the method that depends on classifier confidence merely also is insecure, must consider the relation of context constraint, and the model of reflection semantic constraint relation is exactly the Hidden Markov Model (HMM) (HMM) that we introduce previously.Because our last result is that synthetic geometry cost and semanteme-identification cost provide, therefore we need produce the Candidate Set of a slit mode according to certain criterion, each element in the Candidate Set is calculated the geometry cost and the semanteme-identification cost of its correspondence, comprehensively obtain optimum cutting result.

For the N that has segmented a sub-character picture s ₁, s ₂..., s _N, (V, E), wherein the node number is N+1, is labeled as Node to set up a digraph ₁, Node ₂, Node ₃..., Node _N+1, also be V={Node ₁, Node ₂, Node ₃..., Node _N+1.For Node arbitrarily _i, exist from Node _iTo Node _I+1, Node _I+2, Node _I+3... directed walk, path Node wherein _i→ Node _I+jCorresponding to i, i+1 ..., i+j-1 piece character merges, and also promptly merges sub-character picture s _i, s _I+1..., s _I+j-1, the cost in path is exactly that this merges corresponding geometry cost.Every in cutting figure is by starting point Node ₁To terminal point Node _N+1The path all corresponding sub-character picture s ₁, s ₂..., s _NA kind of merging method, just to a kind of slit mode of row image, so we claim that it is a cutting route.As Figure 20, the capable image that Figure 16 a is provided extracts stroke, merges sub-character, and Figure 17 a has provided corresponding sub-character block, and Figure 20 has provided the cutting figure that sets up thus, and as seen each bar arc has just been represented the merging of plurality of sub character.

Traditional method directly search this cutting figure (Segmentation Graph) from the optimal path of origin-to-destination as slit mode, and we wish to search near this optimum solution " neighborhood ", because if our how much cost functions can more correctly reflect the intercharacter closed intensity of each height, even the slit mode of optimum how much costs is not correctly to separate, it also should be little with correct cutting difference, so we turn to the sub-optimal path of search cutting figure.Under the given situation of sub-optimal path, just extract semanteme-recognition confidence based on the two-dimensional grammar model, be used for " evaluation " our candidate's slit mode.

Be exactly we search for the method K shortest path first of sub-optimal path, (every paths all is from starting point Node Zong it can calculate the optimum cutting route of preceding K choosing that cost ascending order by path arranges to cutting figure ₁To terminal point Node _N+1).

N _Node---the node number of figure, N _Node=N+1, the number of the sub-character picture that the N representative has segmented;

N _Edge---the bar number on limit;

Start---start node (is Node ₁);

Edge---terminal node (is Node _N+1);

K---calculative optimal path bar number;

π ^k(v)---to the path that the total cost in the path of node v is arranged k, wherein the set of the value of v is set of node Node from starting point Start ₁, Node ₂, Node ₃..., Node _N+1, institute reaches path cost arranges by ascending order, so π ¹(v) be exactly shortest path, if get v=Node from starting point Start to end points v _N+1, π so ¹(Node _N+1) represented shortest path from origin-to-destination;

Γ ^-1(v)---the set of the precursor node of v, promptly might be connected to the set of the node of v, for any u ∈ Γ ^-1(v), all there is path u → v;

---the connection of two paths, wherein the terminal point in a path is the starting point in b path, as starting point, the b path termination is as terminal point with the starting point in a path for the path a b after the connection;

C[v]---the path candidate set of node v;

Carry out according to following step then:

At first, calculate π for all v ∈ V ¹(v), promptly calculate shortest path respectively from starting point Start to each node;

For each v ∈ V, calculate π with the method for recurrence ^k(v), 2≤k≤K wherein.

If π ¹(v), π ²(v) ..., π ^K-1(v) finish, introduce below and how to utilize π ¹(v), π ²(v) ..., π ^K-1(v) calculate π ^k(v).

If k=2, initialization path candidate set C[v so], to the set Γ of the precursor node of v ^-1(each the element u v) finds the shortest path from starting point Start to node u, constructs new path π ¹(u) v adds in the path candidate set of v, i.e. C[v] ← { π ¹(u) v|u ∈ Γ ^-1(v) };

If k＞2 are to path π ^K-1(v), find path π ^K-1(the precursor node u of v v) ₀, i.e. path π ^K-1(v) by node u ₀Link v.Can prove to have integer l, satisfy 1≤l≤k-1, make π ^K-1(v) from starting point Start to u ₀The time path and the π that pass by ^l(u ₀) unanimity, that is to say π ^K-1(v) equal just by path π ^l(u ₀) terminal point u ₀Be connected to node v, i.e. π ^K-1(v)=π ^l(u ₀) v.We calculate π again to such integer l ^L+1(u ₀)

We gather C[v from path candidate then] the inside removes path π ^l(u ₀) v, and π ^L+1(u ₀) v adds path candidate set C[v to] the inside.Even C[v] ← C[v]-{ π ^l(u ₀) v}} ∪ { π ^L+1(u ₀) v}; Then ask C[v] in shortest path, can prove C[v] in shortest path be exactly π ^k(v);

Recursively in order to last algorithm, the optimum cutting route of K bar can prove that the time complexity of this algorithm under worst case is before just can obtaining N wherein _EdgeBe the number of arc, N _NodeIt is the node number.

5 judgements

We at first extract each sub-character for the capable image of input, and provide corresponding how much cost estimated, again with the candidate Merge Scenarios of the way in 4 according to geometry cost generation plurality of sub characters.

For each candidate scheme, we use the semanteme-identification cost of the every strip character of Viterbi algorithm computation merge way.

According to the derivation of our front, we calculate average geometric cost (after the normalization) G of every paths respectively to the preceding K paths of each row image _kWith average semanteme-identification cost H _k, make T _k=H _k+ G _k, in path candidate, choose the cutting method that the maximum conduct of T value finally provides then, notice that when cutting method provides the recognition result of each image and the result of aftertreatment provide simultaneously.

Description of drawings

Fig. 1 (a): to " having provided the capable image pattern storehouse of the correct cutting of character ", we are divided into two parts to them, and a part is used for calculating parameter as training sample, and another part is used to test the performance of the method for the invention as test sample book;

Fig. 1 (b): to " object to be slit relates to the corpus in field ", we calculate our semantic constraint relation, i.e. the prior probability of character appearance and intercharacter transition probability (seeing the 2-1 step of specific implementation method);

Fig. 1 (c): to " the individual character image pattern storehouse of The Off-line Handwritten Chinese ", our estimation variance parameter θ and confusion matrix is used for the recognition confidence (seeing the 2-2-1 and the 2-2-2 step of specific implementation method) that we estimate each identifier word;

Fig. 2 (a): the process of the parameter lambda that computational geometry cost and semanteme-identification cost merges (seeing the 9-1 step of specific implementation method);

Fig. 2 (b): accompanying method of the present invention is realized the process to the The Off-line Handwritten Chinese cutting;

Fig. 3: we advance our computing machine to the document of handwritten Chinese character by scanner scanning, store in the mode of image;

Fig. 4: the file and picture for scanning is advanced, need carry out denoising and binary conversion treatment, because the present invention is that unit handles to go image, therefore also need file and picture is extracted the row that it comprises Chinese character, as objective for implementation of the present invention;

Fig. 5 (a): separate situation preferably between the example of the capable image that extracts, word and word;

Fig. 5 (b): the example of the capable image that extracts, adhesion serious situation between word and the word;

Fig. 6: Fig. 6 has provided the histogram analysis to the character stroke width of Fig. 5 (a) image, and wherein length is that the number that occurs of the black distance of swimming of 4 level is maximum, therefore at the 3-1 of specific implementation method step p=4, corresponding hist (p)=690

Fig. 7: Fig. 7 has provided the perspective view on Fig. 5 (a) picture black pixel vertical direction (seeing the 3-2 step of specific implementation method);

Fig. 8: Fig. 8 has provided the perspective view on Fig. 5 (b) picture black pixel vertical direction (seeing the 3-2 step of specific implementation method);

Fig. 9: Fig. 9 has provided the 3-3 step character average height of specific implementation method

Estimation procedure;

Figure 10 and Figure 11: Figure 10 and 11 is results that the pen section is extracted, and what each little rectangle surrounded is exactly a pen section that extracts, and sees the step 4 of specific implementation method.

Figure 12: R _iAnd R _jBe respectively the boundary rectangle of two adjacent pen sections, Figure 12 has marked R _iRight side and R _jThe horizontal range D in left side _H(R _i, R _j) and R _iBottom side and R _jThe vertical range D of top side _V(R _i, R _j) (seeing the step 5 of specific implementation method);

The pen section that Figure 13: Figure 13 has provided Figure 11 is merged into result's (seeing the step 5 of specific implementation method) of sub-character;

(b) (c) (d) for Figure 14 (a): Figure 14 (a) is the atom character picture, and Figure 14 (b) is sub intercharacter boundary rectangle horizontal range, and Figure 14 (c) is sub-intercharacter Euclidean distance, and Figure 14 (d) is the horizontal distance of swimming distance between the sub-character

Figure 15: to the marking of the distance D before and after the character, D is to the mapping relations (seeing the 6-4 step of specific implementation method) of S (D);

Figure 16 (a) is (c) (b): the capable image to be slit that proposes from envelope;

Figure 17 (a) is (b): Figure 16 (a) sub-character that (b) (c) extracts by the pen section and the pen section obtains after merging (b), and each sub-character is surrounded by a boundary rectangle;

Figure 18 (a) (b) (c): Figure 17 (a) (b) (b) according to the scoring of how much costs, the sub-character of getting its how much cost optimums merges mode, merges the result who obtains behind the sub-character block;

Figure 19 (a) (b) (c): Figure 17 (a) (b) (b) merge the result who obtains behind the sub-character block according to the described method of the application;

Figure 20: the cutting figure that the sub-character cutting result who is provided by Figure 17 (a) sets up (seeing the 7-1 step of specific implementation method);

Figure 21: the general process (seeing the step 6 of specific implementation method) that sub-character merges;

Figure 22: Figure 22 has explained the process of step 7-3 in the specific implementation method, and this step makes us can avoid repeating to discern identical character picture;

Specific implementation method

The invention is characterized in: it is by image capture device and coupled computer implemented, contains following performing step successively:

Step 1: by image capture device is the enough training samples of following purpose collection, sets up corresponding storehouse

● the individual character image pattern storehouse of The Off-line Handwritten Chinese;

● provided the capable image pattern storehouse of the correct cutting of character, see Fig. 1 a, we demarcate their correct slit mode to the capable image pattern that has extracted in advance, then they are divided into two parts, a part is used for calculating parameter as training sample, and another part is used to test the performance of the described method of the application as test sample book;

● object to be slit relates to the corpus in field;

Step 2: parameter estimation

2-1 and 2-2 step carry out on given " corpus in the related field of capable image to be slit ", are used to calculate the semantic constraint relation (Fig. 1 b) in the related field of sample to be slit, and we state the implication of following symbology:

N---the Chinese character sum in the language material sample;

P _Smooth()---the probability after the smoothing processing;

The number of M---the different Chinese character that comprises in the language material sample, country-level Chinese character standard has comprised 3755 of Chinese characters in common use, we can establish M=3755 simply;

The 2-1 step: on the language material sample, estimate P (c), the prior probability of character c appearance just; Also to estimate intercharacter transition probability in addition, just P (c ₂| c ₁):

The 2-1-1 step:

P (c) = \frac{N_{C}}{N},

N _cThe total degree that in corpus, occurs for each Chinese character that calculates;

The 2-1-2 step: for the two-dimensional grammar model,

P (c_{2} | c_{1}) = \frac{N_{c_{1} c_{2}}}{N_{c_{1}}},

N wherein _C1c2Be the c that calculates ₁c ₂The number of times that this speech occurs in corpus;

2-1-3 step: we have adopted following simple disposal route to come smoothed data, for the two-dimensional grammar model:

P_{smooth} (c_{2} | c_{1}) = \{\begin{matrix} P (c_{2} | c_{1}) & if N_{c_{1} c_{2}} > 0 \\ ϵ & if N_{c_{1} c_{2}} = 0 and N_{c_{2}} = 0 \\ \frac{1}{M} & if N_{c_{1} c_{2}} = 0 and N_{c_{2}} > 0 \end{matrix}

Wherein ε is a very little positive number given in advance, gets ε=10 ^-9

The 2-2 step is (Fig. 1 c) that calculates on the visual sample of the individual character handwritten Chinese character in " the individual character image pattern storehouse of The Off-line Handwritten Chinese ", and wherein the correct character of each sample image correspondence is known, states following agreement:

N _Sample---the number of image pattern in the The Off-line Handwritten Chinese image pattern storehouse;

x _i---i sample image;

d _j(x _i)---by discerning i the sample image x that core provides _iThe decipherment distance of j candidate's identifier word correspondence, the identification core is arranged the identification candidate according to the ascending order of decipherment distance, so d ₁(x _i) i sample image x of expression _iThe decipherment distance of first-selection identification candidate correspondence;

L _i---i character picture x _iThe sequence number that corresponding correct Chinese character occurs in the identification Candidate Set of character picture is wherein discerned Candidate Set and is arranged each identification candidate according to the ascending order of decipherment distance;

The 2-2-1 step: calculate Off-line Handwritten Chinese Recognition core corresponding variance parameter, represent with symbol theta

At first, all images sample in " the individual character image pattern storehouse of The Off-line Handwritten Chinese " spoken of in step 1 is discerned,, obtained N each image _CandIndividual identification candidate and corresponding decipherment distance;

The decipherment distance that provides according to the identification core calculates i sample image x _iThe decipherment distance d of first-selected word correspondence ₁(x _i) with its decipherment distance poor of j identifier word, use y _IjEven expression is y _Ij=d _j(x _i)-d ₁(x _i), secondly, the minimization following formula obtains the estimation to parameter θ:

E = \frac{1}{2 N_{sample}} Σ_{i = 1}^{N_{sample}} {Σ_{j = 2}^{L_{j}} {[\exp (- \frac{y_{ij}}{θ}) - 1]}^{2} + Σ_{j = L_{i} + 1}^{N_{Cand}} \exp (- \frac{2 y_{ij}}{θ})}

Concrete minimization method can be taked exhaustive way, gets 10000 points between 0 and 100, and 0.01,0.02,0.03 ..., 99.8,99.9,100, as θ substitution following formula, obtain the estimation of the conduct of wherein corresponding E minimum value to θ;

In order to calculate the confusion matrix of identification core, we state following agreement:

ω ^Input(x)---the true classification of image x correspondence;

c _j(x)---by j candidate's recognition result of the image x that provides of identification core, the identification core is arranged according to the ascending ascending order of decipherment distance and is discerned candidate, so c ₁(x) be exactly the first-selected identification of image x candidate;

{ ω ^Input(x)=ω }---corresponding true character is the sample set of the image of ω;

#{ ω ^Input(x)=ω }---corresponding true character is the number of the image pattern of ω;

The 2-2-2 step: calculate confusion matrix

Confusion matrix is the matrix of a M * M, and wherein M represents to have comprised in the language material sample the different Chinese character of how many kinds of, and national standard first-level Chinese characters character set has 3755 Chinese characters, and we can establish M=3755; If we press selected arbitrarily series arrangement, char to all Chinese characters ₁, char ₂..., char _M, the element of the capable β row of the α of confusion matrix is exactly P (char so _β| char _α), the expression concrete class is char _α, be char but but be identified the core knowledge _βProbability;

According to formula

P ({char}_{β} | {char}_{α}) = \frac{1}{# {ω^{input} (x) = {char}_{α}}} \underset{x &Element; {ω^{input} (x) = {char}_{α}}}{Σ} Σ_{j = 1}^{N_{Cand}} P (c_{j} (x) = {char}_{β} | x)

Calculate with discern core relevant obscure probability matrix, wherein #{ ω ^Input(x)=char _αFor being char corresponding to true character _αThe number of image pattern;

Wherein

P (c_{j} (x) = {char}_{β} | x) = \frac{\exp (- \frac{d_{j} (x)}{θ})}{Σ_{i = 1}^{N_{Cand}} \exp (- \frac{d_{i} (x)}{θ})},

It is that the recognition result to image x that the identification core provides is char _βDegree of confidence;

\underset{x &Element; {ω^{input} (x) = {char}_{α}}}{Σ} Σ_{j = 1}^{N_{Cand}} P (c_{j} (x) = {char}_{β} | x)

Represent that true character is char _α, comprised char in the candidate but discern _βThe alphabet image in about char _βThe recognition confidence sum;

Except above-mentioned four aspects, we also need to estimate the parameter lambda that how much costs and semanteme-identification cost merge, and (Fig. 2 a), we are placed on last part and how introduce estimated parameter λ by the training sample calculating of row image for it;

Step 3: the character row image parameter is extracted

This step is finished the extraction to the row image parameter, comprises stroke width, character mean breadth and character average height, relates to the estimation of following parameter:

w _s---stroke width;

---the character mean breadth;

---the character average height;

The 3-1 step: stroke width, i.e. the width w of written handwriting _sEstimate

At first, the black pixel distance of swimming of level to line of text is carried out histogram analysis, the black distance of swimming of level is meant the rectangular area of black pixel continuous possession on the X direction, the height of rectangle is a pixel, the wide length that is the black distance of swimming of level, histogrammic transverse axis represents that level deceives run length, and ordinate represents to have the black distance of swimming number of level of this length; If the black run length of the level that corresponding distance of swimming number is maximum in the histogram is p, corresponding distance of swimming number is hist (p), that is to say that histogrammic ordinate maximal value is hist (p), and the horizontal ordinate of its correspondence is p,

Then get

w_{s} = \frac{(p - 1) \times hist (p - 1) + p \times hist (p) + (p + 1) \times h (p + 1)}{h (p - 1) + h (p) + h (p + 1)};

The 3-2 step: average character duration

Estimation

The character mean breadth has reflected the writing style of line of text, and character cutting is had direct influence; At first the line of text image is done the projection of vertical direction, obtain perspective view, the horizontal ordinate of this figure is corresponding one by one with the horizontal ordinate of line of text, and the value of ordinate is whole number (Fig. 7 has provided the perspective view of Fig. 5 (a)) of black pixel points on the corresponding horizontal ordinate vertical direction in the line of text; Perspective view horizontal coordinate direction of principal axis (ordinate is 0) is done the black distance of swimming analysis of level, and calculate the mean value of the black distance of swimming of whole levels, with this estimation as the character mean breadth; Too small and cause the overlapped situation of intercharacter stroke for the character pitch in the line of text, can be at perspective view y=2w _sDo black distance of swimming statistics on+1 the horizontal direction, calculate its mean value, can obtain character mean breadth preferably

Estimation;

The 3-3 step: character average height

Estimation

The extraction of character average height is relatively simple, only needs the row image is divided into some five equilibriums (generally being five five equilibriums) in the horizontal direction, and the height to all little images is averaged the average height that is character again

(as Fig. 9)

Step 4: pen section extraction stage

The pen section is meant to cast aside anyhow in the Chinese character presses down four kinds of fundamental elements, and the pen section is extracted the adhesion that can overcome character effectively;

The method that pen section extracting method adopts the black distance of swimming to follow the tracks of, its thinking is: at first search out the black distance of swimming of a level in image, as the beginning of a certain pen section, then the black distance of swimming of this level is followed the tracks of from the top down line by line, finish up to following the tracks of, obtain a section;

The thinking that follow the tracks of to adopt: in certain scope of the next line of the black distance of swimming of present level, normally respectively be offset some pixels (generally respectively being offset a pixel about us) about the black distance of swimming position of present level, find all levels to deceive the distance of swimming, and information according to existing pen section, mean breadth as the existing distance of swimming of pen section, the pen section direction that the fitting a straight line of the existing distance of swimming of pen section obtains etc., select the black distance of swimming of certain level to join in the existing pen section distance of swimming, and upgrade the information of former pen section, we describe this process in detail, and it contains following steps successively:

4-1 step: during according to section horizontal black pixel distance of swimming of sequential scanning to from top to bottom, if it is not in the black run-length recording of our existing each section, its beginning, simultaneously the black distance of swimming of this section level is added in the black run-length recording of new pen section so as a certain new pen section;

The 4-2 step: for the black distance of swimming of adding to recently in the segment record, we comprise the horizontal level of this horizontal distance of swimming again at the black distance of swimming next line of this level, and about respectively be offset a pixel and begin to search for the black distance of swimming of new level, if have the black picture element of certain horizontal distance of swimming to extend to this zone, then this distance of swimming extracted; If the not black distance of swimming appears at this zone, this section is extracted and is finished so, gets back to the 4-1 step and continues the new pen section of search;

4-3 step: for the black distance of swimming that extracts, how we consider it is added in the black run-length recording of pen section and go.Here we need divide two situation discussion, and a kind of situation is that the black distance of swimming of level that previous step extracts only has one, carries out the 4-3-1 step so; Another kind of situation is that the black distance of swimming of level that previous step extracts has two or more, carries out the 4-3-2 step so;

The 4-3-1 step:

If we only extract the black distance of swimming of a level,

If the mean breadth of the black distance of swimming of the existing level of this section of ■ more than or equal to the twice of the black distance of swimming width of new level, judges so to have reached pen section end point that the pen section is extracted and finished;

If the black distance of swimming width of the new level of ■ is more than or equal to three times of the mean breadth of the black distance of swimming of the existing level of this section, judge that then it is a pen section point of crossing, predict pen section direction according to the black Itinerary Information of level of existing this section so, and then each is offset half of the black distance of swimming mean breadth of section level about on the direction of prediction, joins in the record as a new pen section black distance of swimming;

If the black distance of swimming width of the level that ■ is new does not satisfy above-mentioned two prerequisites, judge that then it is the black distance of swimming of rational pen section, directly adds the level of pen section to and deceives run-length recording;

The 4-3-2 step:

If we extract the black distance of swimming of two or more levels, we at first utilize the direction of the black Itinerary Information prediction of existing pen section level pen section so, the distance of swimming that is extracted in then on the prediction direction is deceived the distance of swimming as our candidate's level, judges the black run-length recording tables of renewal for three that repeat 4-3-1 at last;

The Forecasting Methodology that step 4-3-1 and 4-3-2 adopt is as follows: ask mid point respectively for the black distance of swimming of all levels that the pen section follows the tracks of out, according to least square principle these mid points of linear function fit, dope the direction of pen section simultaneously according to the straight line that simulates then;

The 4-4 step: the attribute of judgement pen section

For the pen section that extracts, we calculate its height and width at first respectively

If the mean breadth of the black distance of swimming of all levels of this section of ■ is greater than given threshold value (generally we are made as 10 pixels), and pen section width judges so that greater than pen section height it is horizontal stroke.

We set a little step-length ■, represent (generally we get 3 pixels) with MinStepLength

◆ calculate the black distance of swimming mid point of i levels capable and capable these two row of i+MinStepLength ("+" expression plus sige)

If ■ mid point unanimity, angle accumulator adds so

If the capable mid point horizontal ordinate of ■ i is greater than the horizontal ordinate of the capable mid point of i+MinStepLength, angle accumulator adds so

If the capable mid point horizontal ordinate of ■ i is less than the capable mid point horizontal ordinate of i+MinStepLength, angle accumulator adds so

◆ down scan distance pen section from first row of each section highly is zero to be MinStepLength when capable always, and also promptly the (pen section height-MinStepLength) go with all angle additions, is asked average angle

If the ■ average angle is greater than zero and than predefined value α ₁For a short time, (generally establish α ₁Be 10 degree), be judged as perpendicular stroke so;

If the ■ average angle is greater than above-mentioned α ₁And less than predefined value α ₂(generally establish α ₂Be 88 degree) be judged as the left-falling stroke stroke so

If the ■ average angle is greater than above-mentioned α ₂And less than predefined value α ₃(generally establish α ₃Be 98 degree) be judged as horizontal stroke so;

If the ■ average angle is greater than above-mentioned α ₃And less than predefined value α ₄(generally establish α ₄Be 176 degree) be judged as the right-falling stroke stroke so

If the ■ average angle is greater than above-mentioned α ₄Be judged as perpendicular stroke so

If there is not the new distance of swimming found, then this section is followed the tracks of and is finished; If there is not new pen section found, then the pen section is extracted and is finished; When each section was extracted out, the attribute of pen section was promptly cast aside right-falling stroke anyhow and is promptly also determined;

Accompanying drawing 10 and 11 is results that the pen section is extracted, and each little pen section is all surrounded by a little rectangle

Step 5: the pen section merges

After the extraction of pen section is finished, also need the pen section is further merged into sub-character, establish R _iAnd R _jBe respectively the boundary rectangle of two adjacent pen sections: (seeing accompanying drawing 12)

(x _{I, 1}, y _{I, 1})---R _iThe coordinate of upper left angle point;

(x _{I, 2}, y _{I, 2})---R _iThe coordinate of bottom right angle point;

(x _{J, 1}, y _{J, 1})---R _jThe coordinate of upper left angle point;

(x _{J, 2}, y _{J, 2})---R _jThe coordinate of bottom right angle point;

Width (R _i)---R _iWidth;

Width (R _j)---R _jWidth;

The pen section merges carries out according to following three principles:

(1) if R _iAnd R _jSatisfy in the horizontal direction R _iComprise R _jPerhaps R _jComprise R _i, then with pen section i, j merges;

(2) if R _iAnd R _jSatisfy: D _H(R _i, R _j)＜0 (is R _iLeft frame is at R _jAnd have the right side of left frame),

\frac{- D_{H} (R_{i}, R_{j})}{width (R_{i})} > T_{1}

Or

\frac{- D_{H} (R_{i}, R_{j})}{width (R_{j})} > T_{1},

(3) if R _iAnd R _jSatisfy: D _H(R _i, R _j)＜0, and have

\frac{- D_{H} (R_{i}, R_{j})}{width (R_{i})} > T_{2}

And

\frac{- D_{H} (R_{i}, R_{j})}{width (R_{j})} > T_{2},

Pen section merge algorithm contains following step successively:

The 5-1 step: initialization, the pen section sorts from left to right by the position on the horizontal direction;

5-2 step: (1) searches for the pen section that all need merge on principle, as searches a Duan Ze who satisfies condition they are merged, and has merged then to turn to 5-1 to go on foot; Otherwise forward the 5-3 step to;

5-3: (2) search for the pen section that all need merge on principle, as search a Duan Ze who satisfies condition they are merged, and have merged then to turn to 5-1 to go on foot; Otherwise forward the 5-4 step to;

3: (3) search for the pen section that all need merge on principle, as search a Duan Ze who satisfies condition they are merged, and have merged then end;

Figure 13 has provided the pen section amalgamation result of Figure 11

Step 6: how much cost evaluations that character merges comprise six sub-steps:

The result that we merge the pen section, ordering from left to right is designated as: s ₁, s ₂..., s _N, the sub-character picture after our pen section that Here it is merges will be finished cutting, this a little character picture suitably need be merged to finish the cutting operation, and we introduce the merging process of sub-character picture, Figure 21 this process of having demonstrated:

We use (s _k, s _K+1..., s _K+nk-1), expression is by sub-character picture s _k, s _K+1..., s _K+nk-1Merge the character picture that forms, we assess (s according to following mode _k, s _K+1..., s _K+nk-1) the geometry cost that merges:

The 6-1 step: character duration w _kMarking is with S (w _k) expression character duration score

(going on foot institute as 3-2 asks), then

S (w_{k}) = \{\begin{matrix} a {(\frac{w_{k}}{\overset{&OverBar;}{w_{c}}} - 1)}^{2} & if \frac{w_{k}}{\overset{&OverBar;}{w_{c}}} > 1 \\ b {(\frac{w_{k}}{\overset{&OverBar;}{w_{c}}} - 1)}^{2} & if \frac{w_{k}}{\overset{&OverBar;}{w_{c}}} \leq 1 \end{matrix}

A wherein, b is the predefine parameter, and we get a=100, b=400;

The 6-2 step: character the ratio of width to height r marking, with S (r) expression character duration score

S (r) = \min {c {(\frac{r}{\overset{&OverBar;}{r}} - 1)}^{2}, 100},

Wherein general c=100;

R is character (s _k, s _K+1..., s _K+nk-1) the ratio of width to height,

r = \frac{w_{k}}{h_{k}};

R is the average the ratio of width to height of character

\overset{&OverBar;}{r} = \frac{\overset{&OverBar;}{w_{c}}}{\overset{&OverBar;}{h_{c}}};

Be line of text character mean breadth, estimated value (3-2 step);

Be line of text character average height, estimated value (3-3 step);

The 6-3 step: the intercharacter distance marking of the inner son of character

The ii Euclidean distance:

Euclidean distance between sub-character A and sub-character B is defined as all deceives the minimum value of all deceiving Euclidean distance values all between the pixels among pixel and the B among the A;

The average distance of swimming distance of iii: shown in Figure 14 (d), we calculate the average distance of swimming distance of the mean value of the length of whole horizontal distances of swimming (the white distance of swimming) between two sub-characters as us;

d_{in} = \frac{Σ_{i = k}^{i = k + n_{k} - 1} d_{i, i + 1}}{k},

Its neutron character s _iWith s _jApart from d _{I, j}Be defined as

d_{i, j} = \underset{n}{Σ} γ_{n} d_{ij}^{n}

Promptly be above-mentioned three kinds of distances weighted mean and, γ _nBe weighting coefficient, in general we get γ _nAll be 1 also just passable;

Definition inner distance mark is

S (d_{in}) = \{\begin{matrix} 0 & if d_{in} < 0 \\ 100 & if d_{in} > \frac{w_{k}}{4} or d_{in} > \frac{\overset{&OverBar;}{w_{c}}}{2} \\ \frac{400 d_{in}}{w_{k}} & otherwise \end{matrix}

The 6-4 step: the distance marking before and after the character, with S (D) expression

D _L---in character s _kBoundary rectangle frame and sub-character s _K-1The boundary rectangle frame between horizontal range (consistent) with the definition in the step 5;

D _R---sub-character s _K+nk-1Boundary rectangle frame and sub-character s _K+nkThe boundary rectangle frame between horizontal range (consistent) with the definition in the step 5;

If k=1 then D _L=∞; If k+n _k-1=N, then D _R=∞ gets D=min (D at last _L, D _R);

Marking for the character longitudinal separation is S (D),

S (D) = \{\begin{matrix} 100 & if D < - \overset{&OverBar;}{w_{c}} \\ \frac{25 \overset{&OverBar;}{w_{c}} + 100 \overset{&OverBar;}{D} - 75 D}{\overset{&OverBar;}{D} + \overset{&OverBar;}{w_{c}}} & if - \overset{&OverBar;}{w_{c}} \leq D \leq \overset{&OverBar;}{D} \\ \frac{25 (D_{\max} - D)}{D_{\max} - \overset{&OverBar;}{D}} & if \overset{&OverBar;}{D} \leq D \leq D_{\max} \\ 0 & if D > D_{\max} \end{matrix}

Wherein

Be the mean breadth of character in the line of text, D and D _MaxBe respectively the horizontal range between the horizontal range between the average sub-character-circumscribed rectangle frame and maximum sub-character-circumscribed rectangle frame in the line of text;

The 6-5 step: interconnectedness marking, with S (C) expression

Define sub-character s _iWith sub-character s _jInterconnectedness C _IjFor

C _I---internal communication;

C _L---the left part connectedness;

C _R---the right part connectedness;

C_{I} = \{\begin{matrix} \frac{1}{n_{k} - 1} Σ_{i = k}^{i = k + n_{k} - 2} C_{i, i + 1} & n_{k} > 1 \\ 1 & n_{k} = 1 \end{matrix}

C_{L} = \{\begin{matrix} C_{k, k - 1} & k > 1 \\ 1 & k = 1 \end{matrix}

C_{R} = \{\begin{matrix} C_{k + n_{k} - 1, k + n_{k}} & k + n_{k} - 1 < N \\ 1 & k + n_{k} - 1 = N \end{matrix}

The interconnectedness score

S (C) = 100 \times [1 - \frac{1}{2} (1 + C_{I} - \frac{C_{R} + C_{L}}{2})]

The 6-6 step: the calculated population score value is as how much costs

Whole mark be above five kinds of marks weighted mean and

S = \frac{\underset{i}{Σ} α_{i} S_{i}}{\underset{i}{Σ} α_{i}},

Generally we get α ₁=5, α ₂=2, α ₃=3, α ₄=5, α ₅=2

Step 7: the slit mode that calculates how much cost optimums of preceding K bar according to the geometry cost of estimating out

After finishing step 6, we have been cut into the sub-character picture of row to the row image and (have seen Figure 17 (a) (b) (c), it provided Figure 16 (a) (b) the image cutting of (c) row be the result of sub-character picture, what each rectangle frame the inside comprised is exactly a sub-character picture, this rectangle frame is exactly the boundary rectangle frame of sub-character picture), provided the geometry cost that merges between the sub-character block simultaneously, but thereby we need further bundle character picture to merge into the cutting that character picture is finished character; Below our cost of utilizing sub-character block that step 6 obtains to merge, generate some possible sub-character Merge Scenarioses (each scheme is a kind of cutting result of corresponding row image all), we estimate these schemes in step 8, and provide optimal case;

The 7-1 step: cutting figure sets up

For the N that has segmented a sub-character picture s ₁, s ₂..., s _N, (V, E), wherein the node number is N+1, is labeled as Node to set up a digraph ₁, Node ₂, Node ₃..., Node _N+1, also be V={Node ₁, Node ₂, Node ₃..., Node _N+1; For Node arbitrarily _i, exist from Node _iTo Node _I+1, Node _I+2, Node _I+3... directed walk, path Node wherein _i→ Node _I+jCorresponding to i, i+1 ..., i+j-1 piece character merges, and also promptly merges sub-character picture s _i, s _I+1..., s _I+j-1, the cost in path is exactly that this merges corresponding geometry cost; Every in cutting figure is by starting point Node ₁To terminal point Node _N+1The path all corresponding sub-character picture s ₁, s ₂..., s _NA kind of merging method, just to a kind of slit mode of row image, so we claim that it is a cutting route;

(see Figure 20, the capable image that Figure 16 a is provided extracts stroke, merges sub-character, and Figure 17 a has provided corresponding sub-character block, and Figure 20 has provided the cutting figure that sets up thus, and as seen each bar arc has just been represented the merging of plurality of sub character)

The 7-2 step: the cutting route of how much cost optimums of K bar before producing:

Introduce below us how to calculate cutting figure (V, E) in from starting point Node ₁To terminal point Node _N+1By the preceding K paths that the ascending order of cost is arranged, in fact every here paths is all corresponding a kind of slit mode of capable image is exactly sub-character picture s ₁, s ₂..., s _NA kind of Merge Scenarios merges sub-character s according to this scheme ₁, s ₂..., s _NThereby, finish cutting to the row image;

The detailed process of algorithm is as follows:

(V E), establishes given figure

N _Edge---the bar number on limit;

Start---start node (is Node ₁);

Edge---terminal node (is Node _N+1);

K---calculative optimal path bar number;

π ^k(v)---(wherein v can get set of node Node from starting point Start to node v ₁, Node ₂, Node ₃..., Node _N+1In any one node) the total cost in the path path (path cost by ascending order arrange) of arranging k, so π ¹(v) be exactly shortest path, if get v=Node from starting point Start to end points v _N+1, π so ¹(Node _N+1) represented shortest path from origin-to-destination;

C[v]---the path candidate set of node v;

Carry out according to following step then:

The 7-2-1 step:, calculate π at first for all v ∈ V ¹(v), promptly calculate shortest path respectively from starting point Start to each node;

The 7-2-2 step:, calculate π with the method for recurrence for each v ∈ V ^k(v), 2≤k≤K wherein;

If π ¹(v), π ²(v) ..., π ^K-1(v) finish, introduce below and how to utilize π ¹(v), π ²(v) ..., π ^K-1(v) calculate π ^k(v);

If k＞2 are to path π ^K-1(v), find path π ^K-1(the precursor node u of v v) ₀, i.e. path π ^K-1(v) by node u ₀Link v, can prove to have integer l, satisfy 1≤l≤k-1, make π ^K-1(v) from starting point Start to u ₀The time path and the π that pass by ^l(u ₀) unanimity, that is to say π ^K-1(v) equal just by path π ^l(u ₀) terminal point u ₀Be connected to node v, i.e. π ^K-1(v)=π ^l(u ₀) v, we calculate π again to such integer l ^L+1(u ₀) (because 1≤l≤k-1,2≤l+1≤k can prove that this recursive calculation process can realize so)

We gather C[v from path candidate then] the inside removes path π ^l(u ₀) v, and π ^L+1(u ₀) v adds path candidate set C[v to] the inside, even C[v] ← C[v]-{ π ^l(u ₀) v}} ∪ { π ^L+1(u ₀) v}; Then ask C[v] in shortest path, can prove C[v] in shortest path be exactly π ^k(v);

Recursively, just can obtain preceding K bar cutting route in order to last algorithm;

As for how selecting the K value in actual applications, need between time complexity and accuracy, do a compromise, in fact, we do not need to find right-on slit mode in practice, the slit mode of suboptimum also can guarantee very high character identification rate, we generally get K=200, and more generally we can select adaptive K value (generally selecting K is 10 times of sub-character block number) under the situation;

Because our Candidate Set generates according to how much cost criterion, therefore effectively how much cost functions can make our Candidate Set as far as possible little, and correct slit mode appears at earlier position simultaneously;

The 7-3 step: character recognition

In fact, the time loss that produces the K candidate is also little, the time bottleneck is the time that sorter identification consumes, therefore above-mentioned algorithm can be taked further optimized Measures in our concrete application, though notice cutting method difference whole in this preceding K candidate's slit mode, all there are some total slit modes in every cutting route, so we there is no need every paths is all discerned one time, also needn't all recomputate degree of confidence, adopt following method:

CharCandidatesSet---image recognition Candidate Set, each element has wherein all comprised N _CandIndividual identification candidate and N _CandThe recognition confidence of individual correspondence;

CharCandidatesSetNum---the number of element in the image recognition Candidate Set;

If sub-character is s ₁s ₂... s _N, we set up the question blank LookupTable of (N+1) * (N+1), and the element among the LookupTable at first all is made as-1, empties image recognition Candidate Set CharCandidatesSet, and makes CharCandidatesSetNum=0;

For 1≤k≤K

To k candidate's cutting route (how much costs come candidate's cutting route of k by ascending order):

To s _ps _P+1... s _q(the merging mode that 1≤p≤q≤N) is such, the element LookupTable among the inquiry LookupTable (p, q+1) (element of the capable q+1 row of p among the expression question blank LookupTable);

If (p q+1)=-1, illustrates that so this merging also was not identified to LookupTable, and the image that this combination obtains is discerned, and obtains N _CandIndividual candidate, and estimate the degree of confidence (8-1 step) of each identification candidate, then identification candidate and corresponding recognition confidence integral body thereof are added to image recognition Candidate Set CharCandidatesSet as an element, make LookupTable (p then, q+1)=CharCandidatesSetNum, order allows CharCandidatesSetNum increase by 1 then, also is CharCandidatesSetNum=CharCandidatesSetNum+1;

If (p q+1) ≠-1, illustrates that so this merging was considered, need not handle again to LookupTable

End For

Figure 22 has explained this process of 7-3 in detail, and the benefit that this process is brought is the work that repeats of having avoided the identification core, and has saved the time greatly; If we need know sub-character block s in the step of back _ps _P+1... s _qDuring recognition result after the merging, as long as the capable q+1 column element of p of inquiry LookupTable just can obtain sub-character block s _ps _P+1... s _qThe sequence number of recognition result after the merging in CharCandidatesSet, thus the sub-character block s that in the CharCandidatesSet correspondence position, has write down found _ps _P+1... s _qRecognition result after the merging and corresponding degree of confidence;

Step 8: depend on optimum candidate's cutting scheme under how much meanings of K bar that provide previously, provide the semanteme-identification cost of each sub-character Merge Scenarios correspondence according to the two-dimensional grammar model:

The 8-1 step: the character degree of confidence is estimated

X---character picture to be identified;

N _Cand---the number of the identification candidate that the identification core provides the word character image, as described in the preceding 2-2 step, this value is constant, and is only relevant with identification core itself;

P (c _j(x) | x)---image x is identified as c _j(x) degree of confidence, this is the object that we need estimate;

For all candidate that the 7-3 step obtains, can select following two kinds of diverse ways to estimate the degree of confidence of candidate at our concrete needs:

1. apart from experimental formula

P (c_{j} (x) | x) = \frac{1 / d_{j} (x)}{Σ_{k = 1}^{N_{Cand}} 1 / d_{k} (x)}, 1 \leq j \leq N_{Cand}

Perhaps

P (c_{j} (x) | x) = \frac{\frac{1}{d_{j} (x) - d_{1} (x) + 1}}{Σ_{k = 1}^{N_{Cand}} \frac{1}{d_{k} (x) - d_{1} (x) + 1}}, 1 \leq j \leq N_{Cand}

Perhaps

P (c_{j} (x) | x) = \frac{1 / d_{j}^{2} (x)}{Σ_{k = 1}^{N_{Cand}} 1 / d_{k}^{2} (x)}, 1 \leq j \leq N_{Cand}

2. estimate based on the degree of confidence of Gauss model

P (c_{j} (x) | x) = \frac{\exp (- \frac{d_{j} (x)}{θ})}{Σ_{k = 1}^{N_{Cand}} \exp (- \frac{d_{k} (x)}{θ})}

(θ calculates in our front 2-2-1 step) utilizes the confusion matrix of previous calculations can revise the degree of confidence (confusion matrix is calculated by 2-2-2) of estimation, revise degree of confidence by

P (c_{j} (x) | x) = Σ_{k = 1}^{N_{Cand}} P (c_{k} (x) | x) P (c_{j} (x) | c_{k} (x))

Obtain;

The estimation recognition confidence can be selected top method as required flexibly, and in the present invention, we have adopted the degree of confidence estimation scheme based on the Gauss model;

The 8-2 step: put the letter cost based on the semanteme-identification of two-dimensional grammar and calculate

For the capable image cutting route of the K bar that has provided, the letter cost is put in average semanteme-identification that every following method of cutting route application is calculated its correspondence, state following symbol and meaning thereof:

n _k---merge sub-character picture according to k bar cutting route, obtain n altogether _kCharacter picture after the individual merging;

Image _{K, t}---merge t character picture behind the sub-character picture according to k bar cutting route, 1≤k≤K wherein, 1≤t≤n _k

c _j(image _{K, t})---the character picture image that the identification core provides _{K, t}J identification candidate, wherein 1≤j≤N _Cand, 1≤k≤K, 1≤t≤n _k, P (c _j(image _{K, t}) | image _{K, t}) its recognition confidence of correspondence;

Because the identification of character picture and the estimation of degree of confidence are finished in the step at 7-3, we can obtain recognition result and the degree of confidence that we need by the method for inquiry LookupTable from CharcandidatesSet in this step;

To k bar candidate cutting route, 1≤k≤K, use following Vieterbi method computing semantic-identification cost:

Make Q[n _k] [N _Cand] be two-dimensional array, wherein a Q[t] [j] preserved from certain candidate of first character picture and put c to byte _j(image _{K, t}) the logarithm value of the probability that had of maximum possible candidate selection mode, get a two-dimentional array of pointers Path[n in addition _k] [N _Cand] be used to write down computation process;

Initialization t=1,1≤j≤N _Cand

Path[1][j]＝NULL

Q[1][j]＝logP(c _j(image _k，1))+log(c _j(image _k，1)|image _k，1)

Recurrence 2≤t≤n _k, to 1≤j≤N _CandCalculate Q[t] [j]

Q [t] [j] = \max_{1 \leq l \leq N_{Cand}} {Q [t - 1] [l] + \log P (c_{j} ({image}_{k, t}) | c_{l} ({image}_{k, t - 1}))} + \log P (c_{j} ({image}_{k, t}) | {image}_{k, t})

Find in addition and make Q[t-1] [l]+logP (c _j(image _{K, t}) | c _j(image _{K, t-1})) maximum l, note is made l ^*, promptly

l^{*} = \underset{1 \leq l \leq N_{Cand}}{\arg \max} {Q [t - 1] [l] + \log P (c_{j} ({image}_{k, t}) | c_{l} ({image}_{k, t - 1}))}

Make Path[t then] [j] sensing byte point c _L*(image _{K, t-1}), i.e. byte point c _j(image _{K, t}) father node be c _J*(image _{K, t-1})

Stop t=n _k

Find j at last ^*, make

j^{*} = \underset{1 \leq j \leq N_{Cand}}{\arg \max} Q [n_{k}] [j],

Recall Path[n _k] [j ^*] path of indication, each the byte point on the outgoing route obtains character string and is our character identification result; When obtaining optimum character string, we have also obtained the logarithm value Q[n of the probability in maximum possible path _k] [j ^*], we estimate this value, this are worth divided by n as the semanteme-identification cost to this cutting route _k, obtain average semanteme-identification cost

H_{k} = \frac{Q [n_{k}] [j^{*}]}{n_{k}};

Step 9: synthetic geometry cost and semantic cost provide net result

The 9-1 step: we need estimate the fusion parameters λ of geometry cost and semanteme-identification cost:

The calculation process of fusion parameters λ such as Fig. 2 a

State following agreement:

N _L---provided the capable image number (being the training sample number) that sub-character and correct sub-character merge mode;

n _{I, k}---k candidate's slit mode of i training sample corresponding characters number;

n _{I, 0}---i the correct cutting of training sample obtains character number;

g _{I, k}---the geometry cost of k candidate's cutting route of i training sample correspondence, then g _{I, 1}The minimum value of representing how much costs in i all slit modes of training sample;

G _{I, k}---the average geometric cost (value after the normalization) of k candidate's slit mode of i training sample correspondence;

H _{I, k}---the average semanteme-identification cost of k candidate's slit mode of i training sample correspondence;

g _{I, 0}---the geometry cost of i the entirely true cutting correspondence of training sample;

G _{I, 0}---the average geometric cost (value after the normalization) of i the entirely true cutting correspondence of training sample;

H _{I, 0}---the average semanteme-identification cost of i the entirely true cutting correspondence of training sample;

We select N _L(Fig. 1 a) handles according to the order from the step 3 to the step 8 each row image, thereby can obtain n the capable image of individual training sample _{I, k}, g _{I, k}, H _{I, k}1≤i≤N _L, 1≤k≤K; We select for use following mode that the geometry cost of estimating in the step 6 is carried out normalization, and ask its mean value, and promptly we make

G_{ik} = \frac{1}{n_{i, k}} \log (λ e^{- λ (g_{i, k} / g_{i, 1} - 1)}), 1 \leq i \leq N_{L}, 1 \leq k \leq K,

We can obtain the mark G after the geometry cost normalization of correct cutting correspondence according to the step of front similarly _{I, 0}1≤i≤N _LWith average semanteme-identification cost (utilizing the method for 8-2 step introduction) H _{I, 0}1≤i≤N _L, note

T_{i}^{k} (λ) = H_{i, k} + G_{i, k}, 1 \leq k \leq K,

Remember T then _i ⁰(λ) be the T value of i the entirely true cutting correspondence of sample, promptly

T_{i}^{0} (λ) = H_{i, 0} + G_{i, 0};

Minimization

N (λ) = Σ_{i = 1}^{N_{L}} # {T_{i}^{k} (λ) > T_{i}^{0} (λ) | 1 \leq k \leq K}

Promptly obtain estimation to weighting coefficient λ;

Wherein

# {T_{i}^{k} (λ) > T_{i}^{0} (λ) | 1 \leq k \leq K}

Be illustrated under the situation of given λ, in the K of i sample image correspondence candidate's cutting route, the number of the path candidate that the T value is also bigger than the T value of correct slit mode correspondence, minimization method still can adopt the hit-and-miss method that is similar to minimization θ;

The 9-2 step: calculate optimum cutting identification path (Fig. 2 b) according to fusion parameters λ

To the general capable image of sample to be slit, we are according to step 3---and step 8 calculates K bar candidate cutting route, and calculates the average identification-semantic cost H of every paths _k1≤k≤K, the average geometric cost (after the normalization)

G_{k} = \frac{1}{n_{k}} \log (λ e^{- λ (g_{k} / g_{1} - 1)}), 1 \leq k \leq K

(g wherein _kHow much costs of every cutting route that obtain in the corresponding step 6 of 1≤k≤K), then comprehensive cost T _k=H _k+ G _k1≤k≤K gets

k^{*} = \underset{1 \leq k \leq K}{\arg \max} T_{k},

K then ^*The optimum cutting that individual candidate's slit mode provides as us.

The experimental result of the described method of the application

We prepare following data for test:

1, in order to calculate the recognition confidence of sorter, we need be on the character picture sample of known class calculating parameter θ, 50 candidates' that provided by THOCR Chinese Character Recognition core has been provided for use for we.Sample also is the handwritten Chinese character sample of being collected by department of electronic engineering, tsinghua university intelligence picture and text treatment of laboratory.

2, we have collected the hand-written envelop image sample of about 4,000 width of cloth envelopes, the a part of sample wherein that used CEARS routine processes that department of electronic engineering, tsinghua university intelligence picture and text laboratory provides, extract the address line (having comprised two parts of geographical address and organization) of envelope address, get rid of and wherein extract address line by mistake, obtained the whole capable image of 1141 experiment samples as us, and the prior correct character cutting mode that has provided each row image by manual type, wherein 908 pick out as training sample calculating parameter λ, other 233 as test sample book.

What 3, we realized is the capable cutting of hand-written postal address, what select for use is the postinfo address (ADDR storehouse, Beijing bought of space postinfo company therefrom, has 180,000 address (ADDR information, comprises organization and physical address, actual total about 370,000 clauses and subclauses are as the corpus of our training.

Carry out to step 9 according to our step 2 recited above then, we calculate θ=2.322 (2-2-1 step), λ=51.85 (9-1 step), and preceding ten identification candidate that the identification core is exported image to be identified (are N _Cand=10,2-2 step, 8-1 step).

We are two index R relatively _LAnd R _C

R wherein _LExpression row cutting accuracy, the cutting of so-called row is correct, and each character that is meant this delegation's image is all by correct cutting,

R _CExpression character segmentation accuracy is meant an independent character by the ratio of correct cutting,

Test findings following (Intel Pentium4 2.8GHz 512MB RAM)

	The number of words of correct cutting	The line number of correct cutting	The ratio (%) of correct cutting number of words	The ratio (%) that entirely true cutting row accounts for
	The number of words of correct cutting	The line number of correct cutting	The ratio (%) of correct cutting number of words	The ratio (%) that entirely true cutting row accounts for	Path cutting based on how much cost minimums	2,492	55	82.7	23.6
The cutting of this paper	2,814	147	93.3	63.1	Path cutting based on how much cost minimums	2,492	55	82.7	23.6

Annotate: 233 of test sample books have comprised 3013 Chinese characters altogether.

Every capable average handling time is less than 300 milliseconds, and this time the inside has comprised the merging of character picture, identification, aftertreatment and the All Time that provides optimum cutting recognition result.

Contrast tradition only the result that obtains of the slit mode of how much cost optimums of search (Figure 18 (a) (b) (c) provided to Figure 17 (a) (b) (c) only depend on the cutting result that how much cost optimums obtain, Figure 19 (a) (b) (c) has provided the cutting result that our method provides), the method that visible the present invention proposes has realized the cutting of the The Off-line Handwritten Chinese of high-accuracy.From time index, this method can be carried out real-time efficient cutting and identification to hand-written file and picture simultaneously.

When promptly how much costs are not included consideration in λ=0, the simple semantic information that relies on can not obtain best result, this is because the identification of our individual character can allow certain noise, therefore not necessarily leave no choice but under some situation under correct cutting, just can obtain correct recognition result, and for the approaching slit mode of semanteme-identification cost, just need further to have compared their geometry cost.

Conclusion

In sum, the The Off-line Handwritten Chinese cutting method based on geometry cost and semanteme-identification cost fusion of the present invention's proposition has the following advantages:

1), can overcome the adhesion problems of handwritten Chinese character effectively based on the cutting of crossing of the statistical nature realization of extracting image geometry feature and image to The Off-line Handwritten Chinese.

2) how much assessments of the sub-character of each character that adopts of the present invention can reflect the intercharacter intimate degree of each character preferably, for next step merging provides correct foundation.

What 3) the present invention proposed provides cutting candidate's method with the calculating K shortest path.Classic method is only got the slit mode of how much cost optimums, can not guarantee best slit mode necessarily can in the hope of, the K shortest-path method has overcome the deficiency of classic method, has enlarged the region of search.

4) the present invention proposes a new viewpoint, think that semanteme-identification cost is only the strongest criterion of estimating cutting validity, consider the geometry cost of cutting on this basis simultaneously, provide slit mode.This makes the process of The Off-line Handwritten Chinese cutting that we realize and identification more approach human cognitive process.The framework that the present invention simultaneously provides has certain directive significance, no matter the geometry cost of which kind of form or semantic cost all can merge according to thinking provided by the invention.

5) the present invention has excellent popularization, mainly shows: can be generalized in the cutting of offline handwriting character of English (perhaps other language) (may need to consider the language model of high-order) by Chinese cutting on the one hand; The present invention can be applied to other easily and need realize the field of offline handwriting character cutting efficiently on the other hand, only needs prior corpus to this field to calculate relevant parameters, and we get final product substitution by the framework of model.

6) the present invention has certain refusal effect to the document of non-this area, at Bi-gram model at the address line training, for the situation that the capable error extraction of name is become address line, can find that semantic cost at this moment is very high, this makes us can correctly refuse such error extraction row, makes total system have more intellectuality.

This method has proposed the unified model of Chinese character segmentation identification and aftertreatment, for the cutting of off line Chinese character has proposed a kind of new thinking.

Claims

1, the The Off-line Handwritten Chinese cutting method that merges of how much costs and semanteme-identification cost is characterized in that: it is by image capture device and coupled computer implemented, contains following steps successively:

● provided the capable image pattern storehouse of the correct cutting of character: we demarcate their correct slit mode to the capable image pattern that has extracted in advance, then they are divided into two parts, a part is used for calculating parameter as training sample, and another part is used to test the performance of the described method of the application as test sample book;

Object to be slit relates to the corpus in field;

Step 2: parameter estimation

2-1 and 2-2 step carry out on given " corpus in the related field of capable image to be slit ", are used to calculate the semantic constraint relation in the related field of sample to be slit, and we state the implication of following symbology:

N---the Chinese character sum in the language material sample;

P _Smooth()---the probability after the smoothing processing;

The number of M---the different Chinese character that comprises in the language material sample;

The 2-1-1 step:

P (c) = \frac{N_{c}}{N},

The 2-1-2 step: for the two-dimensional grammar model,

P (c_{2} | c_{1}) = \frac{N_{c_{1} c_{2}}}{N_{c_{1}}},

P_{smooth} (c_{2} | c_{1}) = \{\begin{matrix} P (c_{2} | c_{1}) & if N_{c_{1} c_{2}} > 0 \\ ϵ & if N_{c_{1} c_{2}} = 0 and N_{c_{2}} = 0 \\ \frac{1}{M} & if N_{c_{1} c_{2}} = 0 and N_{c_{2}} > 0 \end{matrix}

ε=10 wherein ^-9

The 2-2 step is to calculate on the visual sample of the individual character handwritten Chinese character in " the individual character image pattern storehouse of The Off-line Handwritten Chinese ", and wherein the correct character of each sample image correspondence is known, states following agreement:

x _i---i sample image;

d _j(x _i)---by discerning i the sample image x that core provides _I.The decipherment distance of j candidate's identifier word correspondence, the identification core is arranged the identification candidate according to the ascending order of decipherment distance, so d ₁(x _i) i sample image x of expression _iThe decipherment distance of first-selection identification candidate correspondence;

E = \frac{1}{2 N_{sample}} Σ_{i = 1}^{N_{sample}} {Σ_{i = 2}^{L_{i}} {[\exp (- \frac{y_{ij}}{θ}) - 1]}^{2} + Σ_{j = I_{2} + 1}^{N_{Cand}} \exp (- \frac{2 y_{ij}}{θ})}

Concrete minimization method is taked exhaustive way, gets 10000 points between 0 and 100, and 0.01,0.02,0.03 ..., 99.8,99.9,100, as θ substitution following formula, obtain the estimation of the conduct of wherein corresponding E minimum value to θ;

ω ^Input(x)---the true classification of image x correspondence;

The 2-2-2 step: calculate confusion matrix

Confusion matrix is the matrix of a M * M, and wherein M represents to have comprised in the language material sample the different Chinese character of how many kinds of; If we press selected arbitrarily series arrangement, char to all Chinese characters ₁, char ₂..., char _M, the element of the capable β row of the α of confusion matrix is exactly P (char so _β| char _α), the expression concrete class is char _α, be char but but be identified the core knowledge _βProbability;

According to formula

P ({char}_{β} | {char}_{α}) = \frac{1}{# {ω^{input} (x) {= char}_{α}}} \underset{x &Element; {ω^{input} (x) = {char}_{α}}}{Σ} Σ_{j = 1}^{N_{Cand}} P (c_{j} (x) = {char}_{β} | x)

Wherein

P (c_{j} (x) = {char}_{β} | x) = \frac{\exp (- \frac{d_{j} (x)}{θ})}{Σ_{i = 1}^{N_{Cand}} \exp (- \frac{d_{i} (x)}{θ})},

\underset{x &Element; {ω^{input} (x) = {char}_{α}}}{Σ} Σ_{j = 1}^{N_{Cand}} P (c_{j} (x) = cha r_{β} | x)

Except above-mentioned four aspects, we also need to estimate the parameter lambda that how much costs and semanteme-identification cost merge, and it is by the training sample calculating of row image, and we are placed on last part and how introduce estimated parameter λ;

Step 3: the character row image parameter is extracted

w _s---stroke width;

---the character mean breadth:

---the character average height;

The 3-1 step: stroke width, i.e. the width w of written handwriting _sEstimate

Then get

w_{s} = \frac{(p - 1) \times hist (p - 1) + p \times hist (p) + (p + 1) \times h (p + 1)}{h (p - 1) + h (p) + h (p + 1)};

The 3-2 step: average character duration Estimation

The character mean breadth has reflected the writing style of line of text, character cutting there is direct influence, at first the line of text image is done the projection of vertical direction, obtain perspective view, the horizontal ordinate of this figure is corresponding one by one with the horizontal ordinate of line of text, the value of ordinate is whole number of black pixel points on the corresponding horizontal ordinate vertical direction in the line of text, to perspective view horizontal coordinate direction of principal axis just ordinate be that 0 direction is done the black distance of swimming analysis of level, and calculate the mean value of the black distance of swimming of whole levels, with this estimation as the character mean breadth; Too small and cause the overlapped situation of intercharacter stroke for the character pitch in the line of text, at perspective view y=2w _sDo black distance of swimming statistics on+1 the horizontal direction, calculate its mean value, obtain character mean breadth preferably

Estimation;

The 3-3 step: character average height

Estimation

The extraction of character average height is relatively simple, only needs the row image is divided into some five equilibriums in the horizontal direction, generally gets five five equilibriums, and the height to all little images is averaged the average height that is character again

Step 4: pen section extraction stage

The pen section is meant to cast aside anyhow in the Chinese character presses down four kinds of fundamental elements, and the pen section is extracted the adhesion that has overcome character effectively;

The thinking that follow the tracks of to adopt: at the next line of the black distance of swimming of present level, get comprise current black distance of swimming place horizontal level and about respectively be offset the horizontal extent of a pixel, in this scope, find all levels to deceive the distance of swimming; Fit the pen section direction that obtains according to the mean breadth of the existing distance of swimming of pen section and by the existing distance of swimming of pen section then, select the black distance of swimming of certain level to join in the existing pen section distance of swimming, and upgrade the information of former pen section; We describe this process in detail, and it contains following steps successively:

4-3 step: for the black distance of swimming that extracts, how we consider it is added in the black run-length recording of pen section and go: we need divide two situation discussion here, and a kind of situation is that the black distance of swimming of level that previous step extracts only has one, carries out the 4-3-1 step so; Another kind of situation is that the black distance of swimming of level that previous step extracts has two or more, carries out the 4-3-2 step so;

The 4-3-1 step:

If we only extract the black distance of swimming of a level,

The 4-3-2 step:

The 4-4 step: the attribute of judgement pen section:

If the mean breadth of the black distance of swimming of all levels of this section of ■ is greater than given threshold value, and pen section width judges so that greater than pen section height it is horizontal stroke;

We set a little step-length ■, represent with MinStepLength,

◆ calculate the black distance of swimming mid point of i levels capable and capable these two row of i+MinStepLength, "+" expression plus sige

● if the mid point unanimity, angle accumulator adds so

If ● the capable mid point horizontal ordinate of i is greater than the horizontal ordinate of the capable mid point of i+MinStepLength, and angle accumulator adds so

If ● the capable mid point horizontal ordinate of i is less than the capable mid point horizontal ordinate of i+MinStepLength, and angle accumulator adds so

◆ down scan distance pen section from first row of each section highly is zero MinStepLength when capable always, with all angle additions, asks average angle:

● if average angle is greater than zero and than predefined value α ₁Little, be judged as perpendicular stroke so;

● if average angle is greater than above-mentioned α ₁And less than predefined value α ₂Be judged as the left-falling stroke stroke so;

● if average angle is greater than above-mentioned α ₂And less than predefined value α ₃Be judged as horizontal stroke so;

● if average angle is greater than above-mentioned α ₃And less than predefined value α ₄Be judged as the right-falling stroke stroke so;

● if average angle is greater than above-mentioned α ₄Be judged as perpendicular stroke so;

Step 5: the pen section merges

After the extraction of pen section is finished, also need the pen section is further merged into sub-character, establish R _iAnd R _jBe respectively the boundary rectangle of two adjacent pen sections;

(x _{I, 1}, y _{I, 1})---R _iThe coordinate of upper left angle point;

(x _{I, 2}, y _{I, 2})---R _iThe coordinate of bottom right angle point;

(x _{J, 1}, y _{J, 1})---R _jThe coordinate of upper left angle point;

(x _{J, 2}, y _{J, 2})---R _jThe coordinate of bottom right angle point;

D _H(R _i, R _j)---R _iRight side and R _jThe horizontal range in left side, D _H(R _i, R _j) value is with the positive and negative direction of representing, R _iThe horizontal level of left frame is at R _jThe right of the horizontal level of left frame is represented with negative value; Otherwise if R _iThe horizontal level of left frame is at R _jThe left side of the horizontal level of left frame is then used on the occasion of representing;

D _V(R _i, R _j)---R _iBottom side and R _jThe vertical range of top side, D _V(R _i, R _j) value is with the positive and negative direction of representing, R _iThe upright position of bottom side frame is at R _jBelow the upright position of top side frame, represent with negative value; Otherwise if R _iThe upright position of bottom side frame is at R _jAbove the upright position of top side frame, then use on the occasion of representing;

Width (R _i)---R _iWidth;

Width (R _j)---R _jWidth;

The pen section merges carries out according to following three principles:

1) if R _iAnd R _jSatisfy in the horizontal direction R _iComprise R _jPerhaps R _jComprise R _i, then with pen section i, j merges;

2) if R _iAnd R _jSatisfy: D _H(R _i, R _j)＜0 also is R _iLeft frame is at R _jThe right side of left frame, and have

\frac{- D_{H} (R_{i}, R_{j})}{width (R_{i})} > T_{1}

Or

\frac{- D_{H} (R_{i}, R_{j})}{width (R_{i})} > T_{1},

3) if R _iAnd R _jSatisfy: D _H(R _i, R _j)＜0, and have

\frac{- D_{H} (R_{i}, R_{j})}{width (R_{i})} > T_{2}

And

\frac{- D_{H} (R_{i}, R_{j})}{width (R_{i})} > T_{2},

Pen section merge algorithm contains following step successively:

5-3 step: (2) search for the pen section that all need merge on principle, as search a Duan Ze who satisfies condition they are merged, and have merged then to turn to 5-1 to go on foot; Otherwise forward the 5-4 step to;

5-4 step: (3) search for the pen section that all need merge on principle, as search a Duan Ze who satisfies condition they are merged, and have merged then end;

Step 6: how much cost evaluations that character merges comprise six sub-steps

The result that we merge the pen section, ordering from left to right is designated as: s ₁, s ₂..., s _N, the sub-character picture after our pen section that Here it is merges will be finished cutting, this a little character picture suitably need be merged to finish the cutting operation;

We introduce the merging process of sub-character picture;

w _k---sub-character picture s _k, s _K+1..., s _K+nk-1Character picture (s after the merging _k, s _K+1..., s _K+nk-1) width;

h _k---sub-character picture s _k, s _K+1..., s _K+nk-1Character picture (s after the merging _k, s _K+1..., s _K+nk-1) height;

, draw by 3-2 step, then

S (w_{k}) = \{\begin{matrix} a {(\frac{w_{k}}{\overset{&OverBar;}{w_{c}}} - 1)}^{2} & if \frac{w_{k}}{\overset{&OverBar;}{w_{c}}} > 1 \\ b {(\frac{w_{k}}{\overset{&OverBar;}{w_{c}}} - 1)}^{2} & if \frac{w_{k}}{\overset{&OverBar;}{w_{c}}} \leq 1 \end{matrix}

A wherein, b is the predefine parameter, and generally we get a=100, b=400;

S (r) = \min {c {(\frac{r}{\overset{&OverBar;}{r}} - 1)}^{2}, 100},

Wherein generally get c=100;

R is character (s _k, s _K+1..., s _K+nk-1) the ratio of width to height,

r = \frac{w_{k}}{h_{k}};

R is the average the ratio of width to height of character

\overset{&OverBar;}{r} = \frac{\overset{&OverBar;}{w_{c}}}{\overset{&OverBar;}{h_{c}}};

Be line of text character mean breadth, method of estimation is seen the 3-2 step;

Be line of text character average height, method of estimation is seen the 3-3 step;

The 6-3 step: the intercharacter distance marking of the inner son of character

1) horizontal range of boundary rectangle frame: the horizontal range between the promptly sub-character rectangle frame allows the rectangle frame area surrounded to have overlapping:

2) Euclidean distance: the Euclidean distance between two pixels is to say, if the coordinate of pixel 1 is (x ₁, y ₁) coordinate of pixel 2 is (x ₂, y ₂), then the Euclidean distance between these two pixels is defined as

\sqrt{{(x_{1} - x_{2})}^{2} + {(y_{1} - y_{2})}^{2}};

3) average distance of swimming distance: we calculate the average distance of swimming distance of the mean value of the length of whole white distances of swimming of level between two sub-characters as us;

d_{in} = \frac{Σ_{i = k}^{i = k + n_{k} - 1} d_{i, i + 1}}{k},

Its neutron character s _iWith s _j, apart from d _{I, j}Be defined as

d_{i, j} = \underset{n}{Σ} d_{ij}^{n}

Promptly be the average of above-mentioned three kinds of distances and;

Definition inner distance mark is

S (d_{in}) = \{\begin{matrix} 0 & if d_{in} < 0 \\ 100 & if d_{in} > \frac{w_{k}}{4} or d_{in} > \frac{\overset{&OverBar;}{w_{c}}}{2} \\ \frac{{400 d}_{in}}{w_{k}} & otherwise \end{matrix}

D _L---sub-character s _kBoundary rectangle frame and sub-character s _K-1The boundary rectangle frame between horizontal range, consistent with the definition in the step 5;

D _R---sub-character s _K+nk-1Boundary rectangle frame and sub-character s _K+nkThe boundary rectangle frame between horizontal range, consistent with the definition in the step 5;

If k=1 then D _L=∞; If k+n _k-1=N, then D _R=∞; Get D=min (D at last _L, D _R);

Marking for the character longitudinal separation is S (D),

S (D) = \{\begin{matrix} 100 & ifD < - \overset{&OverBar;}{w_{c}} \\ \frac{25 \overset{&OverBar;}{w_{c}} + 100 \overset{&OverBar;}{D} - 75 D}{\overset{&OverBar;}{D} + \overset{&OverBar;}{w_{c}}} & if - \overset{&OverBar;}{w_{c}} \leq D \leq \overset{&OverBar;}{D} \\ \frac{25 (D_{\max} - D)}{D_{\max} - \overset{&OverBar;}{D}} & if \overset{&OverBar;}{D} \leq D {\leq D}_{\max} \\ 0 & ifD > D_{\max} \end{matrix}

Wherein

Be the mean breadth of character in the line of text, D and D _MaxBe respectively the mean value of the horizontal range between the line of text neutron character-circumscribed rectangle frame and the maximal value of the horizontal range between the sub-character-circumscribed rectangle frame;

The 6-5 step: interconnectedness marking, with S (C) expression

Define sub-character s _iWith sub-character s _jInterconnectedness C _IjFor

C _I---internal communication;

C _L---the left part connectedness;

C _R---the right part connectedness;

C_{l} = \{\begin{matrix} \frac{1}{n_{k} - 1} Σ_{i = k}^{i = k + n_{k} - 2} C_{i, i + 1} & n_{k} > 1 \\ 1 & n_{k} = 1 \end{matrix}

C_{L} = \{\begin{matrix} C_{k, k - 1} & k > 1 \\ 1 & k = 1 \end{matrix}

C_{R} = \{\begin{matrix} C_{k + n_{k} - 1, k + n_{k}} & k + n_{k} - 1 < N \\ 1 & k + n_{k} - 1 = N \end{matrix}

The interconnectedness score

S (C) = 100 \times [1 - \frac{1}{2} (1 + C_{l} - \frac{C_{R} + C_{L}}{2})];

The 6-6 step: the calculated population score value is as how much costs

Whole mark be above five kinds of marks weighted mean and

S = \frac{\underset{i}{Σ} α_{i} S_{i}}{\underset{i}{Σ} α_{i}},

α wherein ₁=5, α ₂=2, α ₃=3, α ₄=5, α ₅=2;

After finishing step 6, we have been cut into the sub-character picture of row to the row image, provided the geometry cost that merges between the sub-character block simultaneously, but we need further bundle character picture to merge into character picture, thereby finish the cutting of character, below our cost of utilizing sub-character block that step 6 obtains to merge, generate some possible sub-character Merge Scenarioses, each scheme is a kind of cutting result of corresponding row image all, we estimate these schemes in step 8, and provide optimal case;

The 7-1 step: cutting figure sets up

The detailed process of algorithm is as follows:

(V E), establishes given figure

N _Edge---the bar number on limit;

Start---start node is Node ₁

Edge---terminal node is Node _N+1

K---calculative optimal path bar number;

Г ^-1(v)---the set of the precursor node of v, promptly might be connected to the set of the node of v, for any u ∈ Г ^-1(v), all there is path u → v;

---the connection of two paths, wherein the terminal point in a path is the starting point in b path, as starting point, the b path termination is as terminal point with the starting point in a path for the path a b after the connection;

C[v]---the path candidate set of node v;

Carry out according to following step then:

The 7-2-2 step: for each v ∈ V, if π ¹(v), π ²(v) ..., π ^K-1(v) finish, introduce below and how to utilize π ¹(v), π ²(v) ..., π ^K-1(v) calculate π ^k(v), 2≤k≤K wherein;

If k=2, initialization path candidate set C[v so], to the set Г of the precursor node of v ^-1(each the element u v) finds the shortest path from starting point Start to node u, constructs new path π ¹(u) v adds in the path candidate set of v, i.e. C[v] ← { π ¹(u) v|u ∈ Г ^-1(v) };

If k＞2 are to path π ^K-1(v), find path π ^K-1(the precursor node u of v v) ₀, i.e. path π ^K-1(v) by node u ₀Link v, necessarily have integer l, satisfy 1≤l≤k-1, make π ^K-1(v) from starting point Start to u ₀The time path and the π that pass by ^l(u ₀) unanimity, that is to say π ^K-1(v) equal just by path π ^l(u ₀) terminal point u ₀Be connected to node v, i.e. π ^K-1(v)=π ^l(u ₀) v, to such integer l, we calculate π again ^L+1(u ₀);

We gather C[v from path candidate then] the inside removes path π ^l(u ₀) v, and π ^L+1(u ₀) v adds path candidate set C[v to] the inside, even C[v] ← C[v]-{ π ^l(u ₀) v}} ∪ { π ^L+1(u ₀) v}; Then ask C[v] in shortest path, C[v] in shortest path be exactly π ^k(v);

Recursively, just obtained preceding K bar cutting route in order to last algorithm;

The 7-3 step: character recognition

If sub-character is s ₁s ₂... s _N, we set up the question blank of (N+1) * (N+1), i.e. LookupTable, and the element among the LookupTable at first all is made as-1, empties the image recognition Candidate Set, i.e. CharCandidatesSet, and make CharCandidatesSetNum=0;

For 1≤k≤K

To k candidate's cutting route, promptly how much costs come candidate's cutting route of k by ascending order,

To s _ps _P+1... s _q(the merging mode that 1≤p≤q≤N) is such, and the element LookupTable among the inquiry LookupTable (p, q+1), the element of the capable q+1 row of p among its expression question blank LookupTable;

If (p q+1)=-1, illustrates that so this merging also was not identified to LookupTable, and the image that this combination obtains is discerned, and obtains N _CandIndividual candidate, and estimate the degree of confidence of each identification candidate, see the 8-1 step, then identification candidate and corresponding recognition confidence integral body thereof are added to image recognition Candidate Set CharCandidatesSet as an element, make LookupTable (p then, q+1)=and CharCandidatesSetNum, allow CharCandidatesSetNum increase by 1 simultaneously, also be CharCandidatesSetNum=CharCandidatesSetNum+1;

End For

The benefit that this process is brought is the work that repeats of having avoided the identification core, and has saved the time greatly, if we need know sub-character block s in the step of back _ps _P+1... s _qDuring recognition result after the merging, as long as the capable q+1 column element of p of inquiry LookupTable just obtains sub-character block s _ps _P+1... s _qThe sequence number of recognition result after the merging in CharCandidatesSet, thus the sub-character block s that in the CharCandidatesSet correspondence position, has write down found _ps _P+1... s _qRecognition result after the merging and corresponding degree of confidence;

The 8-1 step: the character degree of confidence is estimated

X---character picture to be identified;

c _j(x)---by j candidate's identifier word to image x that the identification core provides, the identification core is arranged the identification candidate according to the ascending ascending order of decipherment distance, so c ₁(x) be exactly the first-selection identification candidate of image x;

d _j(x)-by j candidate's identifier word c of the image x that provides of identification core _j(x) Dui Ying decipherment distance, so d _j(x) decipherment distance of the first-selection of presentation video x identification candidate correspondence;

For all candidate that the 7-3 step obtains, use the degree of confidence of estimating candidate based on the confidence degree estimation method of Gauss model:

P (c_{j} (x) | x) = \frac{\exp (- \frac{d_{j} (x)}{θ})}{Σ_{k = 1}^{N_{Cand}} \exp (- \frac{d_{k} (x)}{θ})},

θ is exactly that we calculate in the front 2-3 step;

The degree of confidence of utilizing the confusion matrix correction of previous calculations to estimate, wherein confusion matrix is calculated by 2-4, revise degree of confidence by

P (c_{j} (x) | x) = Σ_{k = 1}^{N_{Cand}} P (c_{k} (x) | x) P (c_{j} (x) | c_{k} (x))

Obtain;

Because the identification of character picture and the estimation of degree of confidence are finished in the step at 7-3, we obtain recognition result and the degree of confidence that we need by the method for inquiry LookupTable from CharcandidatesSet in this step;

Initialization t=1,1≤j≤N _Cand

Path[1][j]＝NULL，

Q[1][j]＝logP(c _j(image _k，1))+log(c _j(image _k，1)|image _k，1)，

Recurrence 2≤t≤n _k, to 1≤j≤N _CandCalculate Q[t] [j],

Q [t] [j] = \max_{1 \leq l \leq N_{Cand}} {Q [t - 1] [l] + \log P (c_{j} (imag e_{k, t}) | c_{l} ({image}_{k, t - 1}))} + \log P (c_{j} ({image}_{k, t}) | image)

Find in addition and make Q[t-1] [l]+logP (c _j(image _{K, t}) | c _l(image _{K, t-1})) maximum l, note is made l ^*, promptly

l^{*} = \underset{1 \leq l \leq N_{Cand}}{\arg \max} {Q [t - 1] [l] + \log P (c_{j} (imag e_{k, t}) | c_{l} ({image}_{k, t - 1}))},

Make Path[t then] [j] sensing byte point c _L*(image _{K, t-1}), i.e. byte point c _j(image _{K, t}) father node be c _I*(image _{K, i-1})

Stop t=n _k'

Find j at last ^*, make

j^{*} = \underset{1 \leq j \leq N_{Cand}}{\arg \max} Q [n_{k}] [j],

H_{k} = \frac{Q [n_{k}] [j^{*}]}{n_{k}};

Step 9: synthetic geometry cost and semantic cost provide net result

The 9-1 step: we need estimate the fusion parameters λ of geometry cost and semanteme-identification cost

State following agreement:

N _L---provided the capable image number that sub-character and correct sub-character merge mode, i.e. the training sample number;

n _{I, 0}---i the correct cutting of training sample obtains character number;

G _{I, k}---the average geometric cost of k candidate's slit mode of i training sample correspondence is got the value after the normalization;

G _{I, 0}---the average geometric cost of i the entirely true cutting correspondence of training sample is got the value after the normalization;

We select N _LThe capable image of individual training sample is handled according to the order from the step 3 to the step 8 each row image, thereby is obtained n _{I, k}, g _{I, k}, H _{I, k}1≤i≤N _L, 1≤k≤K; We select for use following mode that the geometry cost of estimating in the step 6 is carried out normalization, and ask its mean value, and promptly we make

G_{ik} = \frac{1}{n_{i, k}} \log ({λe}^{- λ (g_{i, k} / g_{i, l} - 1)})

1≤i≤N _L, 1≤k≤K; We obtain the mark G after the geometry cost normalization of correct cutting correspondence according to the step of front similarly _{I, 0}1≤i≤N _LWith average semanteme-identification cost H _{I, 0}1≤i≤N _LNote

T_{i}^{k} (λ) = H_{i, k} + G_{i, k}

1≤k≤K remembers T then _i ⁰(λ) be the T value of i the entirely true cutting correspondence of sample, promptly

T_{i}^{0} (λ) = H_{i, 0} + G_{i, 0};

Minimization

N (λ) = Σ_{i = 1}^{N_{L}} # {T_{i}^{k} (λ) > T_{i}^{0} (λ) | 1 \leq k \leq K}

Promptly obtain estimation to weighting coefficient λ;

Wherein

# {T_{i}^{k} (λ) > T_{i}^{0} (λ) | 1 \leq k \leq K}

Be illustrated under the situation of given λ, in the K of i sample image correspondence candidate's cutting route, the number of the path candidate that the T value is also bigger than the T value of correct slit mode correspondence, minimization method still adopts the hit-and-miss method that is similar to minimization θ;

The 9-2 step: calculate optimum cutting identification path according to fusion parameters λ

To the general capable image of sample to be slit, we are according to step 3---and step 8 calculates K bar candidate cutting route, and calculates the average identification-semantic cost H of every paths _kAverage geometric cost after 1≤k≤K and the normalization

G_{k} = \frac{1}{n_{k}} \log ({λe}^{- λ (g_{k} / g_{1} - 1)})

1≤k≤K, wherein g _kThe geometry cost of every the cutting route that obtains in the corresponding step 6 of 1≤k≤K, then comprehensive cost T _k=H _k+ G _k1≤k≤K gets

k^{*} = \underset{1 \leq k \leq K}{\arg \max} T_{k},