CN106570518A

CN106570518A - Chinese and Japanese handwritten text identification method

Info

Publication number: CN106570518A
Application number: CN201610895677.5A
Authority: CN
Inventors: 刘建生
Original assignee: Shanghai Newreal Auto-system Co Ltd
Current assignee: Shanghai Newreal Auto-system Co Ltd
Priority date: 2016-10-14
Filing date: 2016-10-14
Publication date: 2017-04-19

Abstract

The invention discloses a Chinese and Japanese handwritten text identification method, can identify hand-written text input character strings, can realize relatively high identification accuracy and belongs to a character string online handwritten identification robustness model. The method comprises steps that over-segmentation for a stroke interval is carried out, and undetermined points are introduced outside segmentation points and non-segmentation points; multiple candidate sets are linked with each candidate character graph; on the basis of probability approximation of a handwritten text and character string set, in combination with characteristic identification, one-element and two-element geometric characteristics and language environments, credibility of a candidate segmentation and identification path is evaluated; path evaluation standards are flexibly combined with a context score, no change along with path length occurs, so an optimal segmentation path and a corresponding character string set can be effectively acquired through employing viterbi search, moreover, parameters of an evaluation model are estimated through employing a genetic algorithm, and integral character string identification performance is optimized.

Description

A kind of recognition methodss of Chinese and japanese handwritten text

Technical field

The invention belongs to a kind of recognition methodss of Chinese and japanese handwritten text, specifically, it is proposed that a Chinese and Japan word Symbol cascade machine handwriting recognition robust Model, with reference to character recognition, geometry and language environment, it can be estimated that segmentation candidates block can Reliability and its corresponding character trail.

Background technology

With big writing screen such as panel computer, electronic whiteboard and digital pen (such as Anoto pens) haptic device send out Exhibition, people are more desirable to can be more random when writing, and this is accomplished by equipment can know to handwritten text i.e. character string Not.Compared to isolated character recognition, the identification of handwritten text has a difficult problem --- and equipment cannot be before unknown character Text is reliably segmented.In addition with a difficult problem --- in continuous writing, what people often write is rapid style of writing.

The segmentation difficult point of continuous Chinese or Japanese handwritten text is that the space between character is not obvious, many characters It is made up of multiple radicals, there is gap between radical in itself, while some characters also can connects together in rapid style of writing.Current divides Segmentation method is only attempting character string according to the geometric distribution position (gap, character boundary or position and its relation) of character Segmentation.If without character recognition information and language environment, character string cannot be clearly segmented with dividing method.In order to This ambiguous segmentation, feasible method is overcome to have so-called comprehensive segmentation and method of identification, this method is divided into implicit segment again With clear and definite method of fractionation.Wherein implicit segment method mainly employs hidden Markov model (HMM) also referred to as method of fractionation is exempted from, This method is simply into the frame of equal length and labelled for its by character string pattern cut, in identification by itself and character string phase Contact.This method does not simultaneously include enough character shape information.Clearly segmentation rule is attempt to character graphics is true according to its Positive segmentation boundary is split and labelled, therefore preferably character graphics can be identified.This method is generally divided into two Step is completed：Over-segmentation and the evaluation in path and search.Character graphics is too segmented into original segments, and each fragment constitutes a word A part for symbol or character.Fragment in conjunction with so as to constitute candidate characters figure (formed candidate's lattice), by comprising geometric graph The character recognition program of shape and language environment is being estimated.

In the identification of over-segmentation character string, how the path evaluation candidate characters with reference to candidate's lattice are keys.In a reason In the standard thought, the path for having maximum score is correct segmentation.In recognition methodss based on hidden Markov model, often The characteristic vector of one frame has the sequence of a uniqueness, and hidden Markov model is then accordingly to character classification；Unlike this, over-segmentation Candidate's lattice then have the path of different length, be also different corresponding to the sequence of characteristic vector, therefore can not be by over-segmentation method Applied based on Bayesian decision theory, because this theory is based on hidden Markov model.Conversely, candidate characters Identification and context score should heuristically be combined for assessing path.This didactic evaluation criteria can divide For standard technique and the common standard method of suing for peace.The standard of summation standard technique is the summation of the likelihood logarithm in terms of character, or Say be credibility product.Because likelihood is typically smaller than 1, with the reduction of character string quantity, path usually affect summation or The standard of person's product, therefore be susceptible to excessively merge the situation of character.On the other hand, the easy over-segmentation word of common standard method Symbol, because its summation standard is different with the dividing number of character string.Another problem of common standard method is to work as path When length extends, it is not but dull to change therewith, therefore its optimal path can not be by Viterbi search or dynamic programming (DP) ensureing.

For the shape information for preferably utilizing character in the identification based on hidden Markov model, while retaining road The monotonicity of footpath evaluation criterion, it is possible to use the temporally variable property of hidden Markov model, by connecting multiple frames, so as to obtain The likelihood of whole candidate characters shape, and replace the frame of all likelihood divergings.In fact, being identified as base with over-segmentation In the system of plinth, this improves the precision of identification corresponding to each candidate characters of weight and its quantity of original segmentation block And search efficiency.This method based on over-segmentation carrys out character in weight summation standard technique using the quantity of original segmentation block The fraction of identification, so as to overcome impact of the section length to result.However, this method is only capable of the fraction of weight character recognition, And not do not explain the fraction of why not weight geometrical property and language environment.It is either label in original segments, or For label on character pattern, such path evaluation standard can not be all obtained.So, it would be desirable to a kind of method is deciding whether By the quantity of original segments come for each factor weight, and weighting degree is given automatically.

The content of the invention

1. the identification process of the present invention：

Used as a kind of recognition methodss of online handwriting text, our comprehensive segmentation and identifying system have five main steps Suddenly.Input is a hand-written character string or the line of text being made up of the stroke type of a sequence.

Step 100：Over-segmentation.Each stroke interval (the person's handwriting breakpoints in i.e. two continuous strokes) is classified as segmentation Point (SP) and non-cut-point (NSP), and point to be located.Cut-point separates two characters, rather than cut-point in stroke interval Stroke interval is represented among a character, other stroke intervals classify as point to be located according to some of geometrical property.Two Continuous stroke group between individual adjacent cut-point or point to be located is an original segmentation block, and one or more are continuous original point Cut block and constitute a candidate characters figure.

Stroke interval is generally divided into cut-point and non-cut-point by over-segmentation method using a support vector machine classifier. It extracts multi-dimensional nature, such as the distance between the adjacent stroke at stroke interval and overlap, and gives higher nicety of grading.For Avoid, when adjacent stroke overlaps serious, the point that should be cut-point being divided into into non-cut-point by mistake, and work as adjacent stroke interval When larger, often the point that should be non-cut-point is divided into into cut-point by mistake, when can not reliably be classified very much when stroke interval Wait, it is necessary to be classified as point to be located, so as to reduce classification in occur mistake.

The present invention reduces point to be located using a kind of improved two steps classification schemes, while reducing cut-point and non-segmentation as far as possible The mistake of point is divided.First, it is non-cut-point and imaginary cut-point by stroke Margin Classification using two geometric properties.Then, then Some imaginary cut-points are categorized as cut-point using a support vector machine classifier.This process is specific as follows：

1) imagination segmentation

Imaginary cut-point is extracted by extracting two features at each stroke interval：Horizontal range feature and intersection length Feature.Horizontal range feature is calculated from two bounding boxes, all strokes being used for before stroke interval, is expressed as B B_{p_all}, one be used for after stroke, be expressed as B B_{s_all}.Apart from BD_xIt is defined as：

DB_x=left_bound (BB_{s_all})-right_bound(BB_{p_all}) (1)

Horizontal range feature f_dDrawn by below equation：

f_d=DB_x/acs (2)

Acs for character average-size, the long edge lengths of the bounding box by measuring each stroke, the length to all strokes Degree sequence, maximum of which 1/3 is averaged, and can be estimated and be drawn acs.

In order to calculate intersection length characteristic, all strokes to all strokes before stroke interval and afterwards are counted Calculate, the former collection is designated as S_{p_all}, the collection of the latter is designated as S_{p_all}.If a pair of stroke s_p∈S_{p_all}, s_s∈S_{s_all}, they are intersected at Point p, calculates s_pFrom p points to the length of its right endpoint, l is designated as_p, calculate s_sFrom p points to the length of its left end point, l is designated as_s。s_pWith s_sBetween intersection length be defined as：

Cumulative length is：

So as to whole intersection length f at a stroke interval_iIt is defined as：

Arrive here, stroke interval has just been divided into non-cut-point and imaginary cut-point by the present invention.Because character string figure is not Can be split in non-cut-point, next step does not consider non-cut-point, but using support vector machine classifier to imagination segmentation Point further classification.

2) support vector cassification

Based on geometric properties, imaginary cut-point is classified using support vector machine classifier.There are two spies in addition Levy, be intersection length characteristic f defined in imagination segmentation_i, and the width characteristics of a new imaginary cut-point for introducing, after Person is defined as：Width of the stroke from before imaginary cut-point to after is divided by character average-size acs.

Adopt support vector machine that stroke interval is set respectively into its value as 1 and -1 according to cut-point and non-cut-point.Surveyed At the imaginary cut-point of character string figure, the output of support vector machine is converted to into confidence value, then again with segmentation candidates know Other path criteria is combined.

Using support vector machine classifier, support vector machine output is more than a certain threshold value, width characteristics value also greater than another The imaginary cut-point of threshold value is chosen for cut-point.It is determined that cut-point at, adjacent former fragment can not merge to form candidate word Symbol figure.Imaginary cut-point is reduced so as to reduce segmentation candidates lattice, increases the recognition efficiency of character string.

It is that imaginary cut-point keeps undetermined by all remaining stroke intervals, the quantity of cut-point determines amount of calculation.Disappear Except pseudo- segmentation so that the process time of support vector machine classifier increases, but latter step process time is reduced, and in path Discrimination is improved during evaluation function computing.

Step 102：Candidate's lattice structure.By character classification, by each candidate characters figure and multiple times for having a credibility Selected works are associated.The combination of all candidate's figures and feature set is represented with segmentation identification candidate's lattice, wherein each arc is represented One cut-point, one character set of each node on behalf, candidate's figure distributes a character group.

Step 104：Trust evaluation.The confidence score of assessment candidate's lattice and corresponding character trail.

Step 106：Path evaluation.By assessing split path and corresponding character string in candidate's lattice with reference to candidate characters The score of collection and character compatibility, i.e. geometry and Linguistic context credibility.

Character string figure is represented using the sequence of original segmentation：X=s₁,…,s_m, it is divided into character graphics：Z= z₁,…,z_n, wherein, each candidate's figure contains k_iIndividual original segmentation：Zi=s_ji,…s_ji+ki-1.The partition graph of character then by It is classified as collecting C=C₁,…,C_n.The score of character string X is assessed for correspondence character trail C, it is extracted to original segmentation (or candidate Figure) eigenvalue, and segmentation interval (or character pitch) compatibility.Eigenvalue includes bounding box feature value b_i, it is internal Clearance features value q_i, shape facility value s_iOr z_i, unitary position feature value p of single segmentation block (or character)^u _i, adjacent segmentation block The binary position eigenvalue p of (or character)^b _i, split block gap eigenvalue g_i, g_iTo determine cut-point and non-segmentation in step 100 The basis of point.

The characteristic value sequence of original segmentation is expressed as into b, q, X, p^u, g, the posteriority credibility of character string is：

Wherein data verification program and geometry scoring function are obtained by tranining database.

Ignore the denominator unrelated with character string, reasonably assume that and be mutually independent of between eigenvalue, the evaluation to character trail Can wait and be all：

C in formula (7)_iRepresent an imagination classification (referred to as cross and classify) of a character string or original segmentation block, t_i SP or NSP is then represented, cut-point and non-cut-point are represented respectively.One character set C_jIncluding one or more continuous c_i。p (z_i|C_i) obtain for data verification program tranining database, it is that normalization recognizes condition confidence score, p (b_i|C_i), p (q_i| C_i), p (p^u _i|C_i) and p (p^b _i|C_i-1C_i) it is geometry scoring function, by quadric discriminant function, classifier training data base obtains.

Priori P (C) of language is by original segmentation block P (c_i|c_i-2c_i-1) cross the three words syntax expression classified.Due to undue The three word syntax of class are difficult to obtain, and we adopt character set P (C_i|C_i-2C_i-1) carry out approximate expression P (c_i|c_i-2c_i-1), C_iComprising c_i。

λ in formula (8)₁₁And λ₁₂For weight factor, λ₁To balance the departure of number of characters.Due to the initial pen of character graphics The Different Effects with non-starting stroke are drawn, when its conversion credibility is estimated different weights need to be adopted.

It is identical with upper, the eigenvalue of original segmented extraction is estimated using the eigenvalue extracted from candidate characters figure, and Starting stroke and non-starting stroke for further feature value uses different weights, and the path score for obtaining is：

Concentrate in training data, weight factor λ_h1, λ_h2(h=1～7) and λ using genetic algorithm optimization character string by being known Other performance is selected.

Reduce unessential parameter, simplify data study, to improve recognition accuracy, therefore λ during setting h ≠ 4_h2=0, i.e., only Consider P₄Weight, formula (9) can be changed into：

Further obtain：

Three words syntax credibility P (C_i|C_i-2C_i-1) can calculate according to text corpus, work as C_iBe first of sentence or During second word, the word syntax or the two words syntax are just deteriorated to respectively.In order to overcome undertrained to cause Imprecise information, three words The syntax are smoothed as follows：

P'(C_i|C_i-2,C_i-1)=β₁P(C_i|C_i-2,C_i-1)+β₂P(C_i|C_i-1)+β₃P(C_i)+β₄, (12)

Weight is drawn by different text corpus, and is had：β₁+β₂+β₃+β₄=1.

In order to keep scaling invariance, geometrical characteristic b_i, q_i, p^u _iAnd p^b _iAccording to the average-size acs normalization of character. Characteristic vector b_iIt is made up of the height and width of the bounding box of each character graphics.Characteristic vector q_iIt is made up of six values, such as accompanying drawing Shown in 6.First three value represents the level interval (distinguishing according to upright projection) of three down suctions, and afterwards three values then represent three The vertical interval (being distinguished according to floor projection) of horizontal clearance.Characteristic vector p^u _iInclude top from centrage to bounding box and The vertical length of bottom.Characteristic vector p^b _iThere are two elements obtained from the bounding box of two vectorial character graphics：Coboundary it Between vertical dimension and lower boundary between vertical dimension.p(p^b ₁|C₁,C₀) it is set as 1.In order to reduce p (p^b _i|C_i-1C_i) skill Art, six superclass are divided into by character set according to the average vector of the unitary position in training sample.So p (p^b _i|C_i-1C_i) just By p (p^b _i|C’_i-1,C’_i) replaced, wherein C '_i-1And C '_iFor superclass.Using quadric discriminant function grader, geometric properties to Amount b_i, q_i, p^u _iAnd p^b _iJust it is changed into the score of likelihood logarithm, it is used in formula (11).Character shape score p (z_i|C_i) by One Character recognizer is given.Characteristic vector g_iComprising the feature repeatedly measured a segmentation candidates point, at segmentation candidates point Between two adjacent original segmentation blocks.

P (g are approximately obtained using support vector machine classifier_i| SP) and p (g_i|NSP).And in order to obtain credibility p (o_i| ) and p (o SP_i| NSP), the output result of support vector machine adopts warping function, wherein o_iFor g_iOutput result.Warping function Then by having verified that data set on the basis of support vector machine output result obtaining.Wherein p (o₁| SP) value be set to 1.

For the output result of warpage support vector machine, we obtain first output result p (o_i| SP) and p (o_i| NSP) Rectangular histogram, then calculates accumulative credibility p ' (o_i| SP) and p ' (o_i|NSP)：

Then by p ' (o_i| SP) and p ' (o_i| NSP) it is fitted using two s shape functions, fitting parameter passes through square mistake Poor minimum is obtaining.The parameter of two of which s shape function uses different standards.

Weight factor λ_h1, λ_h2(h=1～7) and λ are drawn by genetic algorithm training, using the training number of character string figure According to so that the discrimination of training data is maximum.Accomplish this point, by λ_h1, λ_h2(h=1～7) and λ regard of chromosome as Element.It is according to the step of genetic algorithm estimation weight factor：

1) initialize：A random value being chosen from 0 to 1 and initializing N bar chromosomes, the degree of fitting for arranging N bar chromosomes is put down 0 is, the time, t was set to 1.

2) hybridize：Two chromosomes are randomly selected from N bar chromosomes.Hybridize this two chromosomes in two random sites So as to produce two new chromosomes.Repeat this step, until obtaining M new chromosome.

3) it is mutated：With credibility P_mutRandom value is from -1 to 1 changing each chromosome of N+M chromosome.

4) degree of fitting evaluation：According to the discrimination assessment degree of fitting of training data, its weighted value is encoded in each chromosome In.

5) select：The roulette probability of each chromosome is judged according to degree of fitting.Degree of fitting highest two is chosen first Chromosome, then using roulette selection chromosome, until obtaining the new chromosome of N bars.The dyeing of original N bars is substituted with new chromosome Body.

6) iteration：There is new N bars chromosome f_newObtain Average Quasi right.If (f_new-f_old<Threshold value) there occurs n_stopIt is secondary, Or t>T, then return to degree of fitting highest chromosome.Conversely, arranging f_newFor f_old, increase t, return to 2).

In order to assess the degree of fitting of a chromosome, the optimal path of each training character string picture is searched for, path is most Excellent is to be estimated what is obtained with the weighted value in chromosome.

Step 108：Character string is recognized.Path score in formula (11) is added up by original segmentation block, so not with word The segmentation number change of symbol figure.Therefore, optimal path can be found by Viterbi search's (dynamic programming).By improving path Evaluation criterion, repeats above step, so as to drawing and correctly splitting and recognize to deserved more preferable optimal path.Finally, by Point highest optimal path provides the final result of Character segmentation and identification.

Description of the drawings

Fig. 1 shows the recognition methodss flow chart of steps of handwritten text of the invention.

Fig. 2 shows the recognition methodss particular flow sheet of handwritten text of the invention.

Fig. 3 shows the flow chart of over-segmentation of the invention.

Fig. 4 shows the citing of stroke interval mistake classification.

Fig. 5 shows the schematic diagram of some geometric properties according to embodiments of the present invention.

Fig. 6 shows the schematic diagram of character graphics internal clearance feature according to embodiments of the present invention.

Fig. 7 shows a schematic diagram of imaginary cut-point given threshold according to embodiments of the present invention.

Fig. 8 shows the segmentation block according to embodiments of the present invention citing corresponding with candidate's lattice.

Specific embodiment

With reference to specific embodiment, the present invention is expanded on further.It should be understood that these embodiments are merely to illustrate the present invention Rather than restriction the scope of the present invention.

Embodiment one：

In the present embodiment, it is right using Japanese online handwriting data base in order to assess character string identification model of the invention Data verification program and geometry scoring function are trained.Being recognized by normalization must be divided into condition credibility p (z_i|C_i), character Identification combines offline and ONLINE RECOGNITION method.For geometry score, four quadric discriminant function graders are respectively trained as p (b_i|C_i), p (q_i|C_i), p (p^u _i|C_i) and p (p^b _i|C_i-1C_i)。

In order to give a mark to language environment, Wo Mencong《Asahi Shimbun》1993 volume and《Nikkei newspaper》It is accurate in volumes in 2002 For an initial three words syntax form.Smoothing factor β 1, β 2, β 3, β 4 are obtained using Japanese online handwriting data base estimation.It is logical Cross and delete the phrase not occurred, the logarithm value ignored the less phrase of occurrence number, quantify phrase credibility, the number of the three words syntax 6MB is reduced to according to size.

Lteral data storehouse (Kondate) is collected in 100 people, then horizontal text writing row is extracted from data base, from And weight parameter is trained, and assess the performance of character string identification.With the handwriting text lines Training Support Vector Machines grader of 75 people, So as to draw segmentation candidates point credibility, and obtain the weight factor of path evaluation score.After training, using remaining 25 people's Line of text is tested.Training and the data tested are listed in table 1.Experiment is by the Pentium (R) 4 that 512MB is saved as in one 2.80GHz processors are realizing.In order to calculate simplicity, each weighted value is set to 1 by us first, and trains character string for each Select front 100 identification candidate (segmentation block identification path).

	Text line number	Characteristic pattern figurate number	Feature set	Feature set/OK
					Training	10 174	104 093	1 106	10.23
Test	3 511	35 686	790	16.89

Form 1 trains/data of line of text test

In embodiments of the present invention, such as accompanying drawing 7, if the horizontal range eigenvalue at stroke interval is more than 0, or it is in In the OABCDE of region, imaginary cut-point (point to be located) is divided into, otherwise is then non-cut-point.Then, if in region OFGH Interior, the width between two continuous imagination cut-points is more than a certain threshold value divided by acs, and we are between this two imaginary cut-point , the point modification for being classified as non-cut-point be categorized as point to be located.

The horizontal range feature at the stroke interval of one group of training sample character string and the distribution such as accompanying drawing 8 of intersection length characteristic It is shown.As a result show, the embodiment of the present invention can be very good to distinguish cut-point and non-cut-point.

The line of text recognition result that form 2 is obtained using two kinds of undue segmentation methods

Comparison over-segmentation (two-stage classification method) of the invention and direct according to support vector machine output result (i.e. first-level class Method) result that obtains.In order to justice compares, both approaches are all estimated using path evaluation standard proposed by the present invention. The undue segmentation method that first-level class method is obtained has obtained 19 characteristic quantities by stroke interval, and extracts spy using support vector machine The amount of levying so as to by each stroke Margin Classification into cut-point, non-cut-point and point to be located.Using Character segmentation operation f (segmentations The F operations at point interval), character identification rate C_r, and character string evaluation recognition time T_{av_rec_tl}To assess the property of line of text identification Energy.As a result as shown in Table 2.

Can see from form 2, although over-segmentation method proposed by the present invention consumes more process than first-level class method Between, but significantly improve character recognition and segmentation accuracy rate.Two-stage classification method is undetermined by many stroke gaps, although these The complicated candidate's lattice of cut-point undetermined so that increased the calculating of character string identification, but can but reduce classification error, so as to Improve recognition performance.

Compare the performance of three kinds of path evaluation standards：The method one of the weighting repeated factor of the propositions such as the present invention, Nakagawa (Nakagawa,M.,Zhu,B.,Onuma,M.:A model of on-line handwritten Japanese text recognition free from line direction and writing format constraints.IEICE Trans.Inf.Syst.E88 (D (8), 1815-1822 (2005)) and Zhou et al. propose method two (Zhou, X.-D., Yu, J.-L.,Liu,C.-L.,Nagasaki,T.,Marukawa,K.:Online handwritten Japanese character string recognition incorporating geometric context,In:Proceedings of the 9th International Conference on Document Analysis and Recognition,pp.48– 52.Curitiba,Brazil(2007)).Method one and method two are respectively formula (14) and formula (15).For fair ratio Compared with three kinds of methods are same three word text, same character recognition and background geometry grader using language setting.The power of every kind of method Repeated factor λ_h1, λ_h2(h=1～7), λ_i(i=1～7) and λ are optimized using genetic algorithm.Three kinds of methods combine formula (9) same seven of the path evaluation in, except for the difference that method one be not used and k_i(i.e. original segmentation block constitutes one to relevant item Character graphics), second method adopts the quantity normalization of the separating character path score of method one.

In method two, because path score does not add up with character string, therefore beam search is employed being waited Select the optimal path of lattice.Then it is that optimal path is obtained by Viterbi search for the present invention and method one.For it is all of this Three kinds of methods, candidate's lattice are each character graphics and have chosen ten Candidate Sets.

In order to prove the advantage of genetic algorithm optimization weight factor, can be by the genetic algorithm optimization of the present invention and stochastic gradient Decline the character identification rate obtained by the minimum classification error standard of optimization to compare.The minimum classification error of stochastic gradient descent optimization Standard least determine character trail and most determine the difference between character trail finding optimized parameter vector λ by is minimized 's：

L_MCE(λ, X)=σ (max (Score_Incorrect)-Score_Correct), (16)

σ (x)=(1+e^-x)^-1

Score_correctFor the score of correct path in candidate's lattice, Score_incorrectFor incorrect path in candidate's lattice Score.

Form 3 shows the character string recognition result of three kinds of path evaluation methods.Wherein, the character obtained by genetic algorithm optimization The training weighted value of identification is：

(λ₁₁, λ₁₂, λ₂₁, λ₂₂, λ₃₁, λ₃₂, λ₄₁, λ₄₂, λ₅₁, λ₅₂, λ₆₁, λ₆₂, λ₇₁, λ₇₂, λ)=(0.351,0.000, 0.265,0.001,0.199,0.000,1.000,0.641,0.009,0.000,0.100,0.000,0.100,0.000, 0.323,0.120,0.100)。

3 three kinds of path evaluation methods of form recognize the result of line of text

Calculate in weight factor in genetic algorithm, it can be seen that except character recognition score p (z_i|C_i) and non-segmentation score p(g_i| NSP), remaining geometric feature and language environment be not with original segments (λ_h2=0) quantity weighting.It means that Except character recognition score and non-cut-point score, the original of remaining geometric feature and language environment almost with character graphics The number for beginning to split block is unrelated.

From result, it can be seen that the either character recognition as obtained by genetic algorithm optimization, or under stochastic gradient Character recognition obtained by the minimum classification error standard of drop optimization, the path evaluation model of the present invention improves character recognition and divides The accuracy for cutting.For method one, because a shorter character string often has larger compared to longer character string Evaluation score, therefore be easily caused and excessively merge character.On the other hand and, for method two, i.e., normalized path score, then easily Longer character string is produced, so often tending to over-segmentation character.Instant invention overcomes these problems, while path score Do not change with the change of separating character quantity.In addition, the process time of three kinds of methods is almost identical.Relative to random Gradient declines this parameter optimization method of the minimum classification error standard of optimization, and genetic algorithm optimization can obtain more preferable character string to be known Other performance.This is that take is local optimum in gradient decline due to the former, and genetic algorithm optimization is on the training data Directly optimization character identification rate (this can not separate), such that it is able to reach the optimization of the overall situation.

Because weight factor is learnt with data, in theory degree of freedom is big more than parameter.But in practice, due to training Data base's is limited, can cause the complication for learning construction more than parameter on the contrary, causes to be difficult to obtain optimized parameter.Especially, root According to the present invention, using formula (11), i.e., for λ_h2Leave behind important P₄The weight factor of item, remaining is set as 0, Ke Yiyou Help the optimization of parameter.Actual result also demonstrates that method proposed by the present invention can continue to improve segmentation accuracy rate and character is known Other accuracy rate, such as form 4.

Form 4

Claims

1. a kind of recognition methodss of Chinese and japanese handwritten text, it is characterised in that the method is mainly included the following steps that：

Step 100, the over-segmentation of stroke, including each stroke interval are classified, i.e. cut-point, non-cut-point and point to be located；

Step 102, candidate's lattice of character graphics is made, including by character classification, by each candidate characters figure and many Candidate Set is associated；

Step 104, the trust evaluation of partition graph, including correspondence Candidate Set are assessing the score of candidate characters figure；

Step 106, the path evaluation of character string, including accumulative confidence score, and reduce unessential parameter in assessment；

Step 108, finds optimal path, identification string；Assessment and its and character string with reference to the candidate characters of split path The score of collection and character compatibility, improve path evaluation standard, draw and correctly split and recognize to deserved best path；

In the step 100, the over-segmentation of stroke is by character string figure X=s₁,…,s_mIt is divided into candidate characters figure Z= z₁,…,z_n, wherein, each candidate characters figure contains k_iIndividual original segmentation, z_i=s_ji,…s_ji+ki-1；The partition graph of character Then it is classified as collecting C=C₁,…,C_n；One character set C_jBy one or more continuous c_iConstitute；c_iRepresent a character string or One imagination classification of original segmentation block was referred to as classification；

In the step 104, reliability evaluation method includes correspondence character set C to assess the score of character string figure X, extracts Unitary position feature value p of bounding box feature value b, internal clearance eigenvalue q, single segmentation block or character^u, segmentation block gap spy The binary position eigenvalue p of value indicative g, adjacent segmentation block or character^b, data verification program and geometry scoring function are by training number Obtain according to storehouse, including condition credibility p (z_i|C_i), geometry score p (b_i|C_i), p (q_i|C_i),With The posteriority credibility of character string is：

\begin{matrix} P (C | X) = P (C | q, X, p^{u}, p^{b}, g) \\ = \frac{p (b, q, X, p^{u}, p^{b}, g | C) P (C)}{p (b, q, X, p^{u}, p^{b}, g)} \end{matrix},

Evaluation to character trail etc. is all：

\begin{matrix} f (X, C) = \log p (b, q, X, p^{u}, p^{b}, g | C) P (C) \\ = \log p (C) + Σ_{i = 1}^{m} [\begin{matrix} \log p (b_{i} | c_{i}) + \log p (q_{i} | c_{i}) + \log p (s_{i} | c_{i}) \\ + \log p (b_{i}^{u} | c_{i}) + \log p (|_{i}^{b} c_{i - 1}, c_{i}) + \log p (g_{i} | t_{i}) \end{matrix}] \end{matrix},

\begin{matrix} \log P (C) = Σ_{i = 1}^{m} \log P (c_{i} | c_{i - 2} c_{i - 1}) \\ = Σ_{i = 1}^{m} [\log P (c_{j_{i}} | c_{j_{i} - 2} c_{j_{i} - 1}) + Σ_{j = j_{i} + 1}^{j_{i} + k_{i} - 1} \log P (c_{j_{i}} | c_{j_{i} - 2} c_{j_{i} - 1})] \\ \approx Σ_{i = 1}^{m} [λ_{11} \log P (C_{i} | C_{i - 2} C_{i - 1}) + λ_{12} Σ_{j = j_{i} + 1}^{j_{i} + k_{i} - 1} P (C_{i} | C_{i - 2} C_{i - 1}) + λ_{1}) \\ = Σ_{i = 1}^{n} {[λ_{11} + λ_{12} (k_{i} - 1)] \cdot \log P (C_{i} | C_{i - 2} C_{i - 1}) + λ_{1}} \end{matrix}

In the step 106, accumulated path must be divided into：

f (X, C) = \{\begin{matrix} Σ_{h = 1}^{6} [λ_{h 1} + λ_{h 2} (k_{i} - 1)] \log P_{h} + λ_{71} \log P (g_{j_{i}} | S P) \\ + λ_{72} Σ_{j = j_{i} + 1}^{j_{i} + k_{i} - 1} \log P (g_{j_{i}} | N S P) \end{matrix}\} + n λ,

P_h, h=1 ... 6, respectively represent P (C_i|C_i-2C_i-1), p (b_i|C_i), p (q_i|C_i), p (z_i|C_i),With

In accumulated path score f (X, C) in the step 106, the λ when h ≠ 4 is set_h2=0, accumulated path score is changed into：

f (X, C) = Σ_{i = 1}^{n} \{\begin{matrix} Σ_{h = 1}^{5} λ_{h 1} \log P_{h} + [λ_{61} + λ_{62} (k_{i} - 1)] \log P (z_{i} | C_{i}) \\ + λ_{72} \log P (g_{j_{i}} | S P) + λ_{72} Σ_{j = j_{i} + 1}^{j_{i} + k_{i} - 1} \log P (g_{j_{i}} | N S P) \end{matrix}\} + n λ,

Wherein, P_h, h=1 ... 5, respectively represent P (C_i|C_i-2C_i-1), P (b_i|C_i), P (q_i|C_i),Withλ is weight factor, and SP and NSP represents respectively cut-point and non-cut-point.

2. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that input is a handss The character string write or the line of text being made up of the stroke type of a sequence, including Chinese and Japanese.

3. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that over-segmentation step bag Include, the horizontal range and intersection two features of length that each stroke interval is extracted first obtains imaginary cut-point, then chooses and props up The imaginary cut-point that vector machine output and width characteristics value are held more than certain two threshold value is cut-point, and remaining is point to be located.

4. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that original segmented extraction Eigenvalue is using the eigenvalue estimation extracted from candidate characters figure.

5. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that described by character string Credibility is obtained in the evaluation of character trail, ignores the denominator unrelated with character string, is mutually independent of between eigenvalue.

6. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that P (C) representation language Priori, using the three words syntax for crossing classification, by the character shape score P (c to original segmentation block_i|c_i-2c_i-1) represent, by one Character recognizer is given.

7. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that calculate path score In, geometric properties vector b_i, q_i,WithThe score of likelihood logarithm is changed into using quadric discriminant function grader；Path Score is added up by original segmentation block, is not changed with the segmentation number of character graphics.

8. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that weight factor λ₁₁And λ₁₂ Changed according to starting stroke and non-starting stroke；Weight factor λ₁Changed according to number of characters；Weight factor λ_h1, λ_h2(h=1～ 7) drawn using genetic algorithm with λ；In accumulated path assessment, for λ_h2Leave behind important P₄The weight factor of item, remaining Unessential weight factor is 0.

9. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that the p in path score (g_i| SP) and p (g_i| NSP) approximately obtained using support vector machine classifier.

10. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that optimal path passes through Viterbi search finds.