CN106570518A - Chinese and Japanese handwritten text identification method - Google Patents

Chinese and Japanese handwritten text identification method Download PDF

Info

Publication number
CN106570518A
CN106570518A CN201610895677.5A CN201610895677A CN106570518A CN 106570518 A CN106570518 A CN 106570518A CN 201610895677 A CN201610895677 A CN 201610895677A CN 106570518 A CN106570518 A CN 106570518A
Authority
CN
China
Prior art keywords
character
log
segmentation
lambda
path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610895677.5A
Other languages
Chinese (zh)
Inventor
刘建生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Newreal Auto-system Co Ltd
Original Assignee
Shanghai Newreal Auto-system Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Newreal Auto-system Co Ltd filed Critical Shanghai Newreal Auto-system Co Ltd
Priority to CN201610895677.5A priority Critical patent/CN106570518A/en
Publication of CN106570518A publication Critical patent/CN106570518A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/457Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by analysing connectivity, e.g. edge linking, connected component analysis or slices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Abstract

The invention discloses a Chinese and Japanese handwritten text identification method, can identify hand-written text input character strings, can realize relatively high identification accuracy and belongs to a character string online handwritten identification robustness model. The method comprises steps that over-segmentation for a stroke interval is carried out, and undetermined points are introduced outside segmentation points and non-segmentation points; multiple candidate sets are linked with each candidate character graph; on the basis of probability approximation of a handwritten text and character string set, in combination with characteristic identification, one-element and two-element geometric characteristics and language environments, credibility of a candidate segmentation and identification path is evaluated; path evaluation standards are flexibly combined with a context score, no change along with path length occurs, so an optimal segmentation path and a corresponding character string set can be effectively acquired through employing viterbi search, moreover, parameters of an evaluation model are estimated through employing a genetic algorithm, and integral character string identification performance is optimized.

Description

A kind of recognition methodss of Chinese and japanese handwritten text
Technical field
The invention belongs to a kind of recognition methodss of Chinese and japanese handwritten text, specifically, it is proposed that a Chinese and Japan word Symbol cascade machine handwriting recognition robust Model, with reference to character recognition, geometry and language environment, it can be estimated that segmentation candidates block can Reliability and its corresponding character trail.
Background technology
With big writing screen such as panel computer, electronic whiteboard and digital pen (such as Anoto pens) haptic device send out Exhibition, people are more desirable to can be more random when writing, and this is accomplished by equipment can know to handwritten text i.e. character string Not.Compared to isolated character recognition, the identification of handwritten text has a difficult problem --- and equipment cannot be before unknown character Text is reliably segmented.In addition with a difficult problem --- in continuous writing, what people often write is rapid style of writing.
The segmentation difficult point of continuous Chinese or Japanese handwritten text is that the space between character is not obvious, many characters It is made up of multiple radicals, there is gap between radical in itself, while some characters also can connects together in rapid style of writing.Current divides Segmentation method is only attempting character string according to the geometric distribution position (gap, character boundary or position and its relation) of character Segmentation.If without character recognition information and language environment, character string cannot be clearly segmented with dividing method.In order to This ambiguous segmentation, feasible method is overcome to have so-called comprehensive segmentation and method of identification, this method is divided into implicit segment again With clear and definite method of fractionation.Wherein implicit segment method mainly employs hidden Markov model (HMM) also referred to as method of fractionation is exempted from, This method is simply into the frame of equal length and labelled for its by character string pattern cut, in identification by itself and character string phase Contact.This method does not simultaneously include enough character shape information.Clearly segmentation rule is attempt to character graphics is true according to its Positive segmentation boundary is split and labelled, therefore preferably character graphics can be identified.This method is generally divided into two Step is completed:Over-segmentation and the evaluation in path and search.Character graphics is too segmented into original segments, and each fragment constitutes a word A part for symbol or character.Fragment in conjunction with so as to constitute candidate characters figure (formed candidate's lattice), by comprising geometric graph The character recognition program of shape and language environment is being estimated.
In the identification of over-segmentation character string, how the path evaluation candidate characters with reference to candidate's lattice are keys.In a reason In the standard thought, the path for having maximum score is correct segmentation.In recognition methodss based on hidden Markov model, often The characteristic vector of one frame has the sequence of a uniqueness, and hidden Markov model is then accordingly to character classification;Unlike this, over-segmentation Candidate's lattice then have the path of different length, be also different corresponding to the sequence of characteristic vector, therefore can not be by over-segmentation method Applied based on Bayesian decision theory, because this theory is based on hidden Markov model.Conversely, candidate characters Identification and context score should heuristically be combined for assessing path.This didactic evaluation criteria can divide For standard technique and the common standard method of suing for peace.The standard of summation standard technique is the summation of the likelihood logarithm in terms of character, or Say be credibility product.Because likelihood is typically smaller than 1, with the reduction of character string quantity, path usually affect summation or The standard of person's product, therefore be susceptible to excessively merge the situation of character.On the other hand, the easy over-segmentation word of common standard method Symbol, because its summation standard is different with the dividing number of character string.Another problem of common standard method is to work as path When length extends, it is not but dull to change therewith, therefore its optimal path can not be by Viterbi search or dynamic programming (DP) ensureing.
For the shape information for preferably utilizing character in the identification based on hidden Markov model, while retaining road The monotonicity of footpath evaluation criterion, it is possible to use the temporally variable property of hidden Markov model, by connecting multiple frames, so as to obtain The likelihood of whole candidate characters shape, and replace the frame of all likelihood divergings.In fact, being identified as base with over-segmentation In the system of plinth, this improves the precision of identification corresponding to each candidate characters of weight and its quantity of original segmentation block And search efficiency.This method based on over-segmentation carrys out character in weight summation standard technique using the quantity of original segmentation block The fraction of identification, so as to overcome impact of the section length to result.However, this method is only capable of the fraction of weight character recognition, And not do not explain the fraction of why not weight geometrical property and language environment.It is either label in original segments, or For label on character pattern, such path evaluation standard can not be all obtained.So, it would be desirable to a kind of method is deciding whether By the quantity of original segments come for each factor weight, and weighting degree is given automatically.
The content of the invention
1. the identification process of the present invention:
Used as a kind of recognition methodss of online handwriting text, our comprehensive segmentation and identifying system have five main steps Suddenly.Input is a hand-written character string or the line of text being made up of the stroke type of a sequence.
Step 100:Over-segmentation.Each stroke interval (the person's handwriting breakpoints in i.e. two continuous strokes) is classified as segmentation Point (SP) and non-cut-point (NSP), and point to be located.Cut-point separates two characters, rather than cut-point in stroke interval Stroke interval is represented among a character, other stroke intervals classify as point to be located according to some of geometrical property.Two Continuous stroke group between individual adjacent cut-point or point to be located is an original segmentation block, and one or more are continuous original point Cut block and constitute a candidate characters figure.
Stroke interval is generally divided into cut-point and non-cut-point by over-segmentation method using a support vector machine classifier. It extracts multi-dimensional nature, such as the distance between the adjacent stroke at stroke interval and overlap, and gives higher nicety of grading.For Avoid, when adjacent stroke overlaps serious, the point that should be cut-point being divided into into non-cut-point by mistake, and work as adjacent stroke interval When larger, often the point that should be non-cut-point is divided into into cut-point by mistake, when can not reliably be classified very much when stroke interval Wait, it is necessary to be classified as point to be located, so as to reduce classification in occur mistake.
The present invention reduces point to be located using a kind of improved two steps classification schemes, while reducing cut-point and non-segmentation as far as possible The mistake of point is divided.First, it is non-cut-point and imaginary cut-point by stroke Margin Classification using two geometric properties.Then, then Some imaginary cut-points are categorized as cut-point using a support vector machine classifier.This process is specific as follows:
1) imagination segmentation
Imaginary cut-point is extracted by extracting two features at each stroke interval:Horizontal range feature and intersection length Feature.Horizontal range feature is calculated from two bounding boxes, all strokes being used for before stroke interval, is expressed as B Bp_all, one be used for after stroke, be expressed as B Bs_all.Apart from BDxIt is defined as:
DBx=left_bound (BBs_all)-right_bound(BBp_all) (1)
Horizontal range feature fdDrawn by below equation:
fd=DBx/acs (2)
Acs for character average-size, the long edge lengths of the bounding box by measuring each stroke, the length to all strokes Degree sequence, maximum of which 1/3 is averaged, and can be estimated and be drawn acs.
In order to calculate intersection length characteristic, all strokes to all strokes before stroke interval and afterwards are counted Calculate, the former collection is designated as Sp_all, the collection of the latter is designated as Sp_all.If a pair of stroke sp∈Sp_all, ss∈Ss_all, they are intersected at Point p, calculates spFrom p points to the length of its right endpoint, l is designated asp, calculate ssFrom p points to the length of its left end point, l is designated ass。spWith ssBetween intersection length be defined as:
Cumulative length is:
So as to whole intersection length f at a stroke intervaliIt is defined as:
Arrive here, stroke interval has just been divided into non-cut-point and imaginary cut-point by the present invention.Because character string figure is not Can be split in non-cut-point, next step does not consider non-cut-point, but using support vector machine classifier to imagination segmentation Point further classification.
2) support vector cassification
Based on geometric properties, imaginary cut-point is classified using support vector machine classifier.There are two spies in addition Levy, be intersection length characteristic f defined in imagination segmentationi, and the width characteristics of a new imaginary cut-point for introducing, after Person is defined as:Width of the stroke from before imaginary cut-point to after is divided by character average-size acs.
Adopt support vector machine that stroke interval is set respectively into its value as 1 and -1 according to cut-point and non-cut-point.Surveyed At the imaginary cut-point of character string figure, the output of support vector machine is converted to into confidence value, then again with segmentation candidates know Other path criteria is combined.
Using support vector machine classifier, support vector machine output is more than a certain threshold value, width characteristics value also greater than another The imaginary cut-point of threshold value is chosen for cut-point.It is determined that cut-point at, adjacent former fragment can not merge to form candidate word Symbol figure.Imaginary cut-point is reduced so as to reduce segmentation candidates lattice, increases the recognition efficiency of character string.
It is that imaginary cut-point keeps undetermined by all remaining stroke intervals, the quantity of cut-point determines amount of calculation.Disappear Except pseudo- segmentation so that the process time of support vector machine classifier increases, but latter step process time is reduced, and in path Discrimination is improved during evaluation function computing.
Step 102:Candidate's lattice structure.By character classification, by each candidate characters figure and multiple times for having a credibility Selected works are associated.The combination of all candidate's figures and feature set is represented with segmentation identification candidate's lattice, wherein each arc is represented One cut-point, one character set of each node on behalf, candidate's figure distributes a character group.
Step 104:Trust evaluation.The confidence score of assessment candidate's lattice and corresponding character trail.
Step 106:Path evaluation.By assessing split path and corresponding character string in candidate's lattice with reference to candidate characters The score of collection and character compatibility, i.e. geometry and Linguistic context credibility.
Character string figure is represented using the sequence of original segmentation:X=s1,…,sm, it is divided into character graphics:Z= z1,…,zn, wherein, each candidate's figure contains kiIndividual original segmentation:Zi=sji,…sji+ki-1.The partition graph of character then by It is classified as collecting C=C1,…,Cn.The score of character string X is assessed for correspondence character trail C, it is extracted to original segmentation (or candidate Figure) eigenvalue, and segmentation interval (or character pitch) compatibility.Eigenvalue includes bounding box feature value bi, it is internal Clearance features value qi, shape facility value siOr zi, unitary position feature value p of single segmentation block (or character)u i, adjacent segmentation block The binary position eigenvalue p of (or character)b i, split block gap eigenvalue gi, giTo determine cut-point and non-segmentation in step 100 The basis of point.
The characteristic value sequence of original segmentation is expressed as into b, q, X, pu, g, the posteriority credibility of character string is:
Wherein data verification program and geometry scoring function are obtained by tranining database.
Ignore the denominator unrelated with character string, reasonably assume that and be mutually independent of between eigenvalue, the evaluation to character trail Can wait and be all:
C in formula (7)iRepresent an imagination classification (referred to as cross and classify) of a character string or original segmentation block, ti SP or NSP is then represented, cut-point and non-cut-point are represented respectively.One character set CjIncluding one or more continuous ci。p (zi|Ci) obtain for data verification program tranining database, it is that normalization recognizes condition confidence score, p (bi|Ci), p (qi| Ci), p (pu i|Ci) and p (pb i|Ci-1Ci) it is geometry scoring function, by quadric discriminant function, classifier training data base obtains.
Priori P (C) of language is by original segmentation block P (ci|ci-2ci-1) cross the three words syntax expression classified.Due to undue The three word syntax of class are difficult to obtain, and we adopt character set P (Ci|Ci-2Ci-1) carry out approximate expression P (ci|ci-2ci-1), CiComprising ci
λ in formula (8)11And λ12For weight factor, λ1To balance the departure of number of characters.Due to the initial pen of character graphics The Different Effects with non-starting stroke are drawn, when its conversion credibility is estimated different weights need to be adopted.
It is identical with upper, the eigenvalue of original segmented extraction is estimated using the eigenvalue extracted from candidate characters figure, and Starting stroke and non-starting stroke for further feature value uses different weights, and the path score for obtaining is:
P in formula (9)h, h=1 ... 6, respectively represent P (Ci|Ci-2Ci-1), p (bi|Ci), p (qi|Ci), p (zi|Ci), p (pu i|Ci) and p (pb i|Ci-1Ci).λ in formula (9) includes h=1 ... all of bias term in 6.Note s hereiBy ziGeneration Replace, because reliability function p (si|ci) it is with p (zi|Ci) estimate.
Concentrate in training data, weight factor λh1, λh2(h=1~7) and λ using genetic algorithm optimization character string by being known Other performance is selected.
Reduce unessential parameter, simplify data study, to improve recognition accuracy, therefore λ during setting h ≠ 4h2=0, i.e., only Consider P4Weight, formula (9) can be changed into:
Further obtain:
In formula (11), Ph, h=1 ... 5, respectively represent P (Ci|Ci-2Ci-1), P (bi|Ci), P (qi|Ci), P (pu i|Ci) With P (pb i|Ci-1Ci)。
Three words syntax credibility P (Ci|Ci-2Ci-1) can calculate according to text corpus, work as CiBe first of sentence or During second word, the word syntax or the two words syntax are just deteriorated to respectively.In order to overcome undertrained to cause Imprecise information, three words The syntax are smoothed as follows:
P'(Ci|Ci-2,Ci-1)=β1P(Ci|Ci-2,Ci-1)+β2P(Ci|Ci-1)+β3P(Ci)+β4, (12)
Weight is drawn by different text corpus, and is had:β1234=1.
In order to keep scaling invariance, geometrical characteristic bi, qi, pu iAnd pb iAccording to the average-size acs normalization of character. Characteristic vector biIt is made up of the height and width of the bounding box of each character graphics.Characteristic vector qiIt is made up of six values, such as accompanying drawing Shown in 6.First three value represents the level interval (distinguishing according to upright projection) of three down suctions, and afterwards three values then represent three The vertical interval (being distinguished according to floor projection) of horizontal clearance.Characteristic vector pu iInclude top from centrage to bounding box and The vertical length of bottom.Characteristic vector pb iThere are two elements obtained from the bounding box of two vectorial character graphics:Coboundary it Between vertical dimension and lower boundary between vertical dimension.p(pb 1|C1,C0) it is set as 1.In order to reduce p (pb i|Ci-1Ci) skill Art, six superclass are divided into by character set according to the average vector of the unitary position in training sample.So p (pb i|Ci-1Ci) just By p (pb i|C’i-1,C’i) replaced, wherein C 'i-1And C 'iFor superclass.Using quadric discriminant function grader, geometric properties to Amount bi, qi, pu iAnd pb iJust it is changed into the score of likelihood logarithm, it is used in formula (11).Character shape score p (zi|Ci) by One Character recognizer is given.Characteristic vector giComprising the feature repeatedly measured a segmentation candidates point, at segmentation candidates point Between two adjacent original segmentation blocks.
P (g are approximately obtained using support vector machine classifieri| SP) and p (gi|NSP).And in order to obtain credibility p (oi| ) and p (o SPi| NSP), the output result of support vector machine adopts warping function, wherein oiFor giOutput result.Warping function Then by having verified that data set on the basis of support vector machine output result obtaining.Wherein p (o1| SP) value be set to 1.
For the output result of warpage support vector machine, we obtain first output result p (oi| SP) and p (oi| NSP) Rectangular histogram, then calculates accumulative credibility p ' (oi| SP) and p ' (oi|NSP):
Then by p ' (oi| SP) and p ' (oi| NSP) it is fitted using two s shape functions, fitting parameter passes through square mistake Poor minimum is obtaining.The parameter of two of which s shape function uses different standards.
Weight factor λh1, λh2(h=1~7) and λ are drawn by genetic algorithm training, using the training number of character string figure According to so that the discrimination of training data is maximum.Accomplish this point, by λh1, λh2(h=1~7) and λ regard of chromosome as Element.It is according to the step of genetic algorithm estimation weight factor:
1) initialize:A random value being chosen from 0 to 1 and initializing N bar chromosomes, the degree of fitting for arranging N bar chromosomes is put down 0 is, the time, t was set to 1.
2) hybridize:Two chromosomes are randomly selected from N bar chromosomes.Hybridize this two chromosomes in two random sites So as to produce two new chromosomes.Repeat this step, until obtaining M new chromosome.
3) it is mutated:With credibility PmutRandom value is from -1 to 1 changing each chromosome of N+M chromosome.
4) degree of fitting evaluation:According to the discrimination assessment degree of fitting of training data, its weighted value is encoded in each chromosome In.
5) select:The roulette probability of each chromosome is judged according to degree of fitting.Degree of fitting highest two is chosen first Chromosome, then using roulette selection chromosome, until obtaining the new chromosome of N bars.The dyeing of original N bars is substituted with new chromosome Body.
6) iteration:There is new N bars chromosome fnewObtain Average Quasi right.If (fnew-fold<Threshold value) there occurs nstopIt is secondary, Or t>T, then return to degree of fitting highest chromosome.Conversely, arranging fnewFor fold, increase t, return to 2).
In order to assess the degree of fitting of a chromosome, the optimal path of each training character string picture is searched for, path is most Excellent is to be estimated what is obtained with the weighted value in chromosome.
Step 108:Character string is recognized.Path score in formula (11) is added up by original segmentation block, so not with word The segmentation number change of symbol figure.Therefore, optimal path can be found by Viterbi search's (dynamic programming).By improving path Evaluation criterion, repeats above step, so as to drawing and correctly splitting and recognize to deserved more preferable optimal path.Finally, by Point highest optimal path provides the final result of Character segmentation and identification.
Description of the drawings
Fig. 1 shows the recognition methodss flow chart of steps of handwritten text of the invention.
Fig. 2 shows the recognition methodss particular flow sheet of handwritten text of the invention.
Fig. 3 shows the flow chart of over-segmentation of the invention.
Fig. 4 shows the citing of stroke interval mistake classification.
Fig. 5 shows the schematic diagram of some geometric properties according to embodiments of the present invention.
Fig. 6 shows the schematic diagram of character graphics internal clearance feature according to embodiments of the present invention.
Fig. 7 shows a schematic diagram of imaginary cut-point given threshold according to embodiments of the present invention.
Fig. 8 shows the segmentation block according to embodiments of the present invention citing corresponding with candidate's lattice.
Specific embodiment
With reference to specific embodiment, the present invention is expanded on further.It should be understood that these embodiments are merely to illustrate the present invention Rather than restriction the scope of the present invention.
Embodiment one:
In the present embodiment, it is right using Japanese online handwriting data base in order to assess character string identification model of the invention Data verification program and geometry scoring function are trained.Being recognized by normalization must be divided into condition credibility p (zi|Ci), character Identification combines offline and ONLINE RECOGNITION method.For geometry score, four quadric discriminant function graders are respectively trained as p (bi|Ci), p (qi|Ci), p (pu i|Ci) and p (pb i|Ci-1Ci)。
In order to give a mark to language environment, Wo Mencong《Asahi Shimbun》1993 volume and《Nikkei newspaper》It is accurate in volumes in 2002 For an initial three words syntax form.Smoothing factor β 1, β 2, β 3, β 4 are obtained using Japanese online handwriting data base estimation.It is logical Cross and delete the phrase not occurred, the logarithm value ignored the less phrase of occurrence number, quantify phrase credibility, the number of the three words syntax 6MB is reduced to according to size.
Lteral data storehouse (Kondate) is collected in 100 people, then horizontal text writing row is extracted from data base, from And weight parameter is trained, and assess the performance of character string identification.With the handwriting text lines Training Support Vector Machines grader of 75 people, So as to draw segmentation candidates point credibility, and obtain the weight factor of path evaluation score.After training, using remaining 25 people's Line of text is tested.Training and the data tested are listed in table 1.Experiment is by the Pentium (R) 4 that 512MB is saved as in one 2.80GHz processors are realizing.In order to calculate simplicity, each weighted value is set to 1 by us first, and trains character string for each Select front 100 identification candidate (segmentation block identification path).
Text line number Characteristic pattern figurate number Feature set Feature set/OK
Training 10 174 104 093 1 106 10.23
Test 3 511 35 686 790 16.89
Form 1 trains/data of line of text test
In embodiments of the present invention, such as accompanying drawing 7, if the horizontal range eigenvalue at stroke interval is more than 0, or it is in In the OABCDE of region, imaginary cut-point (point to be located) is divided into, otherwise is then non-cut-point.Then, if in region OFGH Interior, the width between two continuous imagination cut-points is more than a certain threshold value divided by acs, and we are between this two imaginary cut-point , the point modification for being classified as non-cut-point be categorized as point to be located.
The horizontal range feature at the stroke interval of one group of training sample character string and the distribution such as accompanying drawing 8 of intersection length characteristic It is shown.As a result show, the embodiment of the present invention can be very good to distinguish cut-point and non-cut-point.
The line of text recognition result that form 2 is obtained using two kinds of undue segmentation methods
Comparison over-segmentation (two-stage classification method) of the invention and direct according to support vector machine output result (i.e. first-level class Method) result that obtains.In order to justice compares, both approaches are all estimated using path evaluation standard proposed by the present invention. The undue segmentation method that first-level class method is obtained has obtained 19 characteristic quantities by stroke interval, and extracts spy using support vector machine The amount of levying so as to by each stroke Margin Classification into cut-point, non-cut-point and point to be located.Using Character segmentation operation f (segmentations The F operations at point interval), character identification rate Cr, and character string evaluation recognition time Tav_rec_tlTo assess the property of line of text identification Energy.As a result as shown in Table 2.
Can see from form 2, although over-segmentation method proposed by the present invention consumes more process than first-level class method Between, but significantly improve character recognition and segmentation accuracy rate.Two-stage classification method is undetermined by many stroke gaps, although these The complicated candidate's lattice of cut-point undetermined so that increased the calculating of character string identification, but can but reduce classification error, so as to Improve recognition performance.
Compare the performance of three kinds of path evaluation standards:The method one of the weighting repeated factor of the propositions such as the present invention, Nakagawa (Nakagawa,M.,Zhu,B.,Onuma,M.:A model of on-line handwritten Japanese text recognition free from line direction and writing format constraints.IEICE Trans.Inf.Syst.E88 (D (8), 1815-1822 (2005)) and Zhou et al. propose method two (Zhou, X.-D., Yu, J.-L.,Liu,C.-L.,Nagasaki,T.,Marukawa,K.:Online handwritten Japanese character string recognition incorporating geometric context,In:Proceedings of the 9th International Conference on Document Analysis and Recognition,pp.48– 52.Curitiba,Brazil(2007)).Method one and method two are respectively formula (14) and formula (15).For fair ratio Compared with three kinds of methods are same three word text, same character recognition and background geometry grader using language setting.The power of every kind of method Repeated factor λh1, λh2(h=1~7), λi(i=1~7) and λ are optimized using genetic algorithm.Three kinds of methods combine formula (9) same seven of the path evaluation in, except for the difference that method one be not used and ki(i.e. original segmentation block constitutes one to relevant item Character graphics), second method adopts the quantity normalization of the separating character path score of method one.
In method two, because path score does not add up with character string, therefore beam search is employed being waited Select the optimal path of lattice.Then it is that optimal path is obtained by Viterbi search for the present invention and method one.For it is all of this Three kinds of methods, candidate's lattice are each character graphics and have chosen ten Candidate Sets.
In order to prove the advantage of genetic algorithm optimization weight factor, can be by the genetic algorithm optimization of the present invention and stochastic gradient Decline the character identification rate obtained by the minimum classification error standard of optimization to compare.The minimum classification error of stochastic gradient descent optimization Standard least determine character trail and most determine the difference between character trail finding optimized parameter vector λ by is minimized 's:
LMCE(λ, X)=σ (max (ScoreIncorrect)-ScoreCorrect), (16)
σ (x)=(1+e-x)-1
ScorecorrectFor the score of correct path in candidate's lattice, ScoreincorrectFor incorrect path in candidate's lattice Score.
Form 3 shows the character string recognition result of three kinds of path evaluation methods.Wherein, the character obtained by genetic algorithm optimization The training weighted value of identification is:
11, λ12, λ21, λ22, λ31, λ32, λ41, λ42, λ51, λ52, λ61, λ62, λ71, λ72, λ)=(0.351,0.000, 0.265,0.001,0.199,0.000,1.000,0.641,0.009,0.000,0.100,0.000,0.100,0.000, 0.323,0.120,0.100)。
3 three kinds of path evaluation methods of form recognize the result of line of text
Calculate in weight factor in genetic algorithm, it can be seen that except character recognition score p (zi|Ci) and non-segmentation score p(gi| NSP), remaining geometric feature and language environment be not with original segments (λh2=0) quantity weighting.It means that Except character recognition score and non-cut-point score, the original of remaining geometric feature and language environment almost with character graphics The number for beginning to split block is unrelated.
From result, it can be seen that the either character recognition as obtained by genetic algorithm optimization, or under stochastic gradient Character recognition obtained by the minimum classification error standard of drop optimization, the path evaluation model of the present invention improves character recognition and divides The accuracy for cutting.For method one, because a shorter character string often has larger compared to longer character string Evaluation score, therefore be easily caused and excessively merge character.On the other hand and, for method two, i.e., normalized path score, then easily Longer character string is produced, so often tending to over-segmentation character.Instant invention overcomes these problems, while path score Do not change with the change of separating character quantity.In addition, the process time of three kinds of methods is almost identical.Relative to random Gradient declines this parameter optimization method of the minimum classification error standard of optimization, and genetic algorithm optimization can obtain more preferable character string to be known Other performance.This is that take is local optimum in gradient decline due to the former, and genetic algorithm optimization is on the training data Directly optimization character identification rate (this can not separate), such that it is able to reach the optimization of the overall situation.
Because weight factor is learnt with data, in theory degree of freedom is big more than parameter.But in practice, due to training Data base's is limited, can cause the complication for learning construction more than parameter on the contrary, causes to be difficult to obtain optimized parameter.Especially, root According to the present invention, using formula (11), i.e., for λh2Leave behind important P4The weight factor of item, remaining is set as 0, Ke Yiyou Help the optimization of parameter.Actual result also demonstrates that method proposed by the present invention can continue to improve segmentation accuracy rate and character is known Other accuracy rate, such as form 4.
Form 4

Claims (10)

1. a kind of recognition methodss of Chinese and japanese handwritten text, it is characterised in that the method is mainly included the following steps that:
Step 100, the over-segmentation of stroke, including each stroke interval are classified, i.e. cut-point, non-cut-point and point to be located;
Step 102, candidate's lattice of character graphics is made, including by character classification, by each candidate characters figure and many Candidate Set is associated;
Step 104, the trust evaluation of partition graph, including correspondence Candidate Set are assessing the score of candidate characters figure;
Step 106, the path evaluation of character string, including accumulative confidence score, and reduce unessential parameter in assessment;
Step 108, finds optimal path, identification string;Assessment and its and character string with reference to the candidate characters of split path The score of collection and character compatibility, improve path evaluation standard, draw and correctly split and recognize to deserved best path;
In the step 100, the over-segmentation of stroke is by character string figure X=s1,…,smIt is divided into candidate characters figure Z= z1,…,zn, wherein, each candidate characters figure contains kiIndividual original segmentation, zi=sji,…sji+ki-1;The partition graph of character Then it is classified as collecting C=C1,…,Cn;One character set CjBy one or more continuous ciConstitute;ciRepresent a character string or One imagination classification of original segmentation block was referred to as classification;
In the step 104, reliability evaluation method includes correspondence character set C to assess the score of character string figure X, extracts Unitary position feature value p of bounding box feature value b, internal clearance eigenvalue q, single segmentation block or characteru, segmentation block gap spy The binary position eigenvalue p of value indicative g, adjacent segmentation block or characterb, data verification program and geometry scoring function are by training number Obtain according to storehouse, including condition credibility p (zi|Ci), geometry score p (bi|Ci), p (qi|Ci),With The posteriority credibility of character string is:
P ( C | X ) = P ( C | q , X , p u , p b , g ) = p ( b , q , X , p u , p b , g | C ) P ( C ) p ( b , q , X , p u , p b , g ) ,
Evaluation to character trail etc. is all:
f ( X , C ) = log p ( b , q , X , p u , p b , g | C ) P ( C ) = log p ( C ) + &Sigma; i = 1 m log p ( b i | c i ) + log p ( q i | c i ) + log p ( s i | c i ) + log p ( b i u | c i ) + log p ( | i b c i - 1 , c i ) + log p ( g i | t i ) ,
log P ( C ) = &Sigma; i = 1 m log P ( c i | c i - 2 c i - 1 ) = &Sigma; i = 1 m &lsqb; log P ( c j i | c j i - 2 c j i - 1 ) + &Sigma; j = j i + 1 j i + k i - 1 log P ( c j i | c j i - 2 c j i - 1 ) &rsqb; &ap; &Sigma; i = 1 m &lsqb; &lambda; 11 log P ( C i | C i - 2 C i - 1 ) + &lambda; 12 &Sigma; j = j i + 1 j i + k i - 1 P ( C i | C i - 2 C i - 1 ) + &lambda; 1 ) = &Sigma; i = 1 n { &lsqb; &lambda; 11 + &lambda; 12 ( k i - 1 ) &rsqb; &CenterDot; log P ( C i | C i - 2 C i - 1 ) + &lambda; 1 }
In the step 106, accumulated path must be divided into:
f ( X , C ) = &Sigma; h = 1 6 &lsqb; &lambda; h 1 + &lambda; h 2 ( k i - 1 ) &rsqb; log P h + &lambda; 71 log P ( g j i | S P ) + &lambda; 72 &Sigma; j = j i + 1 j i + k i - 1 log P ( g j i | N S P ) + n &lambda; ,
Ph, h=1 ... 6, respectively represent P (Ci|Ci-2Ci-1), p (bi|Ci), p (qi|Ci), p (zi|Ci),With
In accumulated path score f (X, C) in the step 106, the λ when h ≠ 4 is seth2=0, accumulated path score is changed into:
f ( X , C ) = &Sigma; i = 1 n &Sigma; h = 1 5 &lambda; h 1 log P h + &lsqb; &lambda; 6 1 + &lambda; 6 2 ( k i - 1 ) &rsqb; log P ( z i | C i ) + &lambda; 72 log P ( g j i | S P ) + &lambda; 72 &Sigma; j = j i + 1 j i + k i - 1 log P ( g j i | N S P ) + n &lambda; ,
Wherein, Ph, h=1 ... 5, respectively represent P (Ci|Ci-2Ci-1), P (bi|Ci), P (qi|Ci),Withλ is weight factor, and SP and NSP represents respectively cut-point and non-cut-point.
2. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that input is a handss The character string write or the line of text being made up of the stroke type of a sequence, including Chinese and Japanese.
3. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that over-segmentation step bag Include, the horizontal range and intersection two features of length that each stroke interval is extracted first obtains imaginary cut-point, then chooses and props up The imaginary cut-point that vector machine output and width characteristics value are held more than certain two threshold value is cut-point, and remaining is point to be located.
4. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that original segmented extraction Eigenvalue is using the eigenvalue estimation extracted from candidate characters figure.
5. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that described by character string Credibility is obtained in the evaluation of character trail, ignores the denominator unrelated with character string, is mutually independent of between eigenvalue.
6. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that P (C) representation language Priori, using the three words syntax for crossing classification, by the character shape score P (c to original segmentation blocki|ci-2ci-1) represent, by one Character recognizer is given.
7. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that calculate path score In, geometric properties vector bi, qi,WithThe score of likelihood logarithm is changed into using quadric discriminant function grader;Path Score is added up by original segmentation block, is not changed with the segmentation number of character graphics.
8. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that weight factor λ11And λ12 Changed according to starting stroke and non-starting stroke;Weight factor λ1Changed according to number of characters;Weight factor λh1, λh2(h=1~ 7) drawn using genetic algorithm with λ;In accumulated path assessment, for λh2Leave behind important P4The weight factor of item, remaining Unessential weight factor is 0.
9. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that the p in path score (gi| SP) and p (gi| NSP) approximately obtained using support vector machine classifier.
10. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that optimal path passes through Viterbi search finds.
CN201610895677.5A 2016-10-14 2016-10-14 Chinese and Japanese handwritten text identification method Pending CN106570518A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610895677.5A CN106570518A (en) 2016-10-14 2016-10-14 Chinese and Japanese handwritten text identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610895677.5A CN106570518A (en) 2016-10-14 2016-10-14 Chinese and Japanese handwritten text identification method

Publications (1)

Publication Number Publication Date
CN106570518A true CN106570518A (en) 2017-04-19

Family

ID=58532832

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610895677.5A Pending CN106570518A (en) 2016-10-14 2016-10-14 Chinese and Japanese handwritten text identification method

Country Status (1)

Country Link
CN (1) CN106570518A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108364655A (en) * 2018-01-31 2018-08-03 网易乐得科技有限公司 Method of speech processing, medium, device and computing device
CN109614494A (en) * 2018-12-29 2019-04-12 东软集团股份有限公司 A kind of file classification method and relevant apparatus
CN111639646A (en) * 2020-05-18 2020-09-08 山东大学 Test paper handwritten English character recognition method and system based on deep learning
CN111797634A (en) * 2020-06-04 2020-10-20 语联网(武汉)信息技术有限公司 Document segmentation method and device
CN112699780A (en) * 2020-12-29 2021-04-23 上海臣星软件技术有限公司 Object identification method, device, equipment and storage medium
CN113592045A (en) * 2021-09-30 2021-11-02 杭州一知智能科技有限公司 Model adaptive text recognition method and system from printed form to handwritten form

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108364655A (en) * 2018-01-31 2018-08-03 网易乐得科技有限公司 Method of speech processing, medium, device and computing device
CN108364655B (en) * 2018-01-31 2021-03-09 网易乐得科技有限公司 Voice processing method, medium, device and computing equipment
CN109614494A (en) * 2018-12-29 2019-04-12 东软集团股份有限公司 A kind of file classification method and relevant apparatus
CN111639646A (en) * 2020-05-18 2020-09-08 山东大学 Test paper handwritten English character recognition method and system based on deep learning
CN111639646B (en) * 2020-05-18 2021-04-13 山东大学 Test paper handwritten English character recognition method and system based on deep learning
CN111797634A (en) * 2020-06-04 2020-10-20 语联网(武汉)信息技术有限公司 Document segmentation method and device
CN111797634B (en) * 2020-06-04 2023-09-08 语联网(武汉)信息技术有限公司 Document segmentation method and device
CN112699780A (en) * 2020-12-29 2021-04-23 上海臣星软件技术有限公司 Object identification method, device, equipment and storage medium
CN113592045A (en) * 2021-09-30 2021-11-02 杭州一知智能科技有限公司 Model adaptive text recognition method and system from printed form to handwritten form

Similar Documents

Publication Publication Date Title
CN106570518A (en) Chinese and Japanese handwritten text identification method
Saba et al. Methods and strategies on off-line cursive touched characters segmentation: a directional review
Álvaro et al. An integrated grammar-based approach for mathematical expression recognition
Saba et al. Cursive script segmentation with neural confidence
Mouchere et al. Icdar 2013 crohme: Third international competition on recognition of online handwritten mathematical expressions
Saba et al. Effects of artificially intelligent tools on pattern recognition
JP2750057B2 (en) Statistical mixing method for automatic handwritten character recognition
Harouni et al. Online Persian/Arabic script classification without contextual information
Simistira et al. Recognition of online handwritten mathematical formulas using probabilistic SVMs and stochastic context free grammars
US9711117B2 (en) Method and apparatus for recognising music symbols
Awal et al. Towards handwritten mathematical expression recognition
CN110114776B (en) System and method for character recognition using a fully convolutional neural network
TWI437448B (en) Radical-based hmm modeling for handwritten east asian characters
JP2008532176A (en) Recognition graph
Saba et al. A survey on methods and strategies on touched characters segmentation
CN108280430B (en) Flow image identification method
Hu et al. MST-based visual parsing of online handwritten mathematical expressions
Liu et al. Dynamic local search based immune automatic clustering algorithm and its applications
CN105023029B (en) A kind of on-line handwritten Tibetan language syllable recognition methods and device
CN102360436B (en) Identification method for on-line handwritten Tibetan characters based on components
Sundaram et al. Bigram language models and reevaluation strategy for improved recognition of online handwritten Tamil words
Montagner et al. Staff removal using image operator learning
Le et al. Stroke order normalization for improving recognition of online handwritten mathematical expressions
Saabni et al. Hierarchical on-line arabic handwriting recognition
Wan et al. On-line Chinese character recognition system for overlapping samples

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170419