CN106570518A - Chinese and Japanese handwritten text identification method - Google Patents
Chinese and Japanese handwritten text identification method Download PDFInfo
- Publication number
- CN106570518A CN106570518A CN201610895677.5A CN201610895677A CN106570518A CN 106570518 A CN106570518 A CN 106570518A CN 201610895677 A CN201610895677 A CN 201610895677A CN 106570518 A CN106570518 A CN 106570518A
- Authority
- CN
- China
- Prior art keywords
- character
- log
- segmentation
- lambda
- path
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/457—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by analysing connectivity, e.g. edge linking, connected component analysis or slices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
Abstract
The invention discloses a Chinese and Japanese handwritten text identification method, can identify hand-written text input character strings, can realize relatively high identification accuracy and belongs to a character string online handwritten identification robustness model. The method comprises steps that over-segmentation for a stroke interval is carried out, and undetermined points are introduced outside segmentation points and non-segmentation points; multiple candidate sets are linked with each candidate character graph; on the basis of probability approximation of a handwritten text and character string set, in combination with characteristic identification, one-element and two-element geometric characteristics and language environments, credibility of a candidate segmentation and identification path is evaluated; path evaluation standards are flexibly combined with a context score, no change along with path length occurs, so an optimal segmentation path and a corresponding character string set can be effectively acquired through employing viterbi search, moreover, parameters of an evaluation model are estimated through employing a genetic algorithm, and integral character string identification performance is optimized.
Description
Technical field
The invention belongs to a kind of recognition methodss of Chinese and japanese handwritten text, specifically, it is proposed that a Chinese and Japan word
Symbol cascade machine handwriting recognition robust Model, with reference to character recognition, geometry and language environment, it can be estimated that segmentation candidates block can
Reliability and its corresponding character trail.
Background technology
With big writing screen such as panel computer, electronic whiteboard and digital pen (such as Anoto pens) haptic device send out
Exhibition, people are more desirable to can be more random when writing, and this is accomplished by equipment can know to handwritten text i.e. character string
Not.Compared to isolated character recognition, the identification of handwritten text has a difficult problem --- and equipment cannot be before unknown character
Text is reliably segmented.In addition with a difficult problem --- in continuous writing, what people often write is rapid style of writing.
The segmentation difficult point of continuous Chinese or Japanese handwritten text is that the space between character is not obvious, many characters
It is made up of multiple radicals, there is gap between radical in itself, while some characters also can connects together in rapid style of writing.Current divides
Segmentation method is only attempting character string according to the geometric distribution position (gap, character boundary or position and its relation) of character
Segmentation.If without character recognition information and language environment, character string cannot be clearly segmented with dividing method.In order to
This ambiguous segmentation, feasible method is overcome to have so-called comprehensive segmentation and method of identification, this method is divided into implicit segment again
With clear and definite method of fractionation.Wherein implicit segment method mainly employs hidden Markov model (HMM) also referred to as method of fractionation is exempted from,
This method is simply into the frame of equal length and labelled for its by character string pattern cut, in identification by itself and character string phase
Contact.This method does not simultaneously include enough character shape information.Clearly segmentation rule is attempt to character graphics is true according to its
Positive segmentation boundary is split and labelled, therefore preferably character graphics can be identified.This method is generally divided into two
Step is completed:Over-segmentation and the evaluation in path and search.Character graphics is too segmented into original segments, and each fragment constitutes a word
A part for symbol or character.Fragment in conjunction with so as to constitute candidate characters figure (formed candidate's lattice), by comprising geometric graph
The character recognition program of shape and language environment is being estimated.
In the identification of over-segmentation character string, how the path evaluation candidate characters with reference to candidate's lattice are keys.In a reason
In the standard thought, the path for having maximum score is correct segmentation.In recognition methodss based on hidden Markov model, often
The characteristic vector of one frame has the sequence of a uniqueness, and hidden Markov model is then accordingly to character classification;Unlike this, over-segmentation
Candidate's lattice then have the path of different length, be also different corresponding to the sequence of characteristic vector, therefore can not be by over-segmentation method
Applied based on Bayesian decision theory, because this theory is based on hidden Markov model.Conversely, candidate characters
Identification and context score should heuristically be combined for assessing path.This didactic evaluation criteria can divide
For standard technique and the common standard method of suing for peace.The standard of summation standard technique is the summation of the likelihood logarithm in terms of character, or
Say be credibility product.Because likelihood is typically smaller than 1, with the reduction of character string quantity, path usually affect summation or
The standard of person's product, therefore be susceptible to excessively merge the situation of character.On the other hand, the easy over-segmentation word of common standard method
Symbol, because its summation standard is different with the dividing number of character string.Another problem of common standard method is to work as path
When length extends, it is not but dull to change therewith, therefore its optimal path can not be by Viterbi search or dynamic programming
(DP) ensureing.
For the shape information for preferably utilizing character in the identification based on hidden Markov model, while retaining road
The monotonicity of footpath evaluation criterion, it is possible to use the temporally variable property of hidden Markov model, by connecting multiple frames, so as to obtain
The likelihood of whole candidate characters shape, and replace the frame of all likelihood divergings.In fact, being identified as base with over-segmentation
In the system of plinth, this improves the precision of identification corresponding to each candidate characters of weight and its quantity of original segmentation block
And search efficiency.This method based on over-segmentation carrys out character in weight summation standard technique using the quantity of original segmentation block
The fraction of identification, so as to overcome impact of the section length to result.However, this method is only capable of the fraction of weight character recognition,
And not do not explain the fraction of why not weight geometrical property and language environment.It is either label in original segments, or
For label on character pattern, such path evaluation standard can not be all obtained.So, it would be desirable to a kind of method is deciding whether
By the quantity of original segments come for each factor weight, and weighting degree is given automatically.
The content of the invention
1. the identification process of the present invention:
Used as a kind of recognition methodss of online handwriting text, our comprehensive segmentation and identifying system have five main steps
Suddenly.Input is a hand-written character string or the line of text being made up of the stroke type of a sequence.
Step 100:Over-segmentation.Each stroke interval (the person's handwriting breakpoints in i.e. two continuous strokes) is classified as segmentation
Point (SP) and non-cut-point (NSP), and point to be located.Cut-point separates two characters, rather than cut-point in stroke interval
Stroke interval is represented among a character, other stroke intervals classify as point to be located according to some of geometrical property.Two
Continuous stroke group between individual adjacent cut-point or point to be located is an original segmentation block, and one or more are continuous original point
Cut block and constitute a candidate characters figure.
Stroke interval is generally divided into cut-point and non-cut-point by over-segmentation method using a support vector machine classifier.
It extracts multi-dimensional nature, such as the distance between the adjacent stroke at stroke interval and overlap, and gives higher nicety of grading.For
Avoid, when adjacent stroke overlaps serious, the point that should be cut-point being divided into into non-cut-point by mistake, and work as adjacent stroke interval
When larger, often the point that should be non-cut-point is divided into into cut-point by mistake, when can not reliably be classified very much when stroke interval
Wait, it is necessary to be classified as point to be located, so as to reduce classification in occur mistake.
The present invention reduces point to be located using a kind of improved two steps classification schemes, while reducing cut-point and non-segmentation as far as possible
The mistake of point is divided.First, it is non-cut-point and imaginary cut-point by stroke Margin Classification using two geometric properties.Then, then
Some imaginary cut-points are categorized as cut-point using a support vector machine classifier.This process is specific as follows:
1) imagination segmentation
Imaginary cut-point is extracted by extracting two features at each stroke interval:Horizontal range feature and intersection length
Feature.Horizontal range feature is calculated from two bounding boxes, all strokes being used for before stroke interval, is expressed as B
Bp_all, one be used for after stroke, be expressed as B Bs_all.Apart from BDxIt is defined as:
DBx=left_bound (BBs_all)-right_bound(BBp_all) (1)
Horizontal range feature fdDrawn by below equation:
fd=DBx/acs (2)
Acs for character average-size, the long edge lengths of the bounding box by measuring each stroke, the length to all strokes
Degree sequence, maximum of which 1/3 is averaged, and can be estimated and be drawn acs.
In order to calculate intersection length characteristic, all strokes to all strokes before stroke interval and afterwards are counted
Calculate, the former collection is designated as Sp_all, the collection of the latter is designated as Sp_all.If a pair of stroke sp∈Sp_all, ss∈Ss_all, they are intersected at
Point p, calculates spFrom p points to the length of its right endpoint, l is designated asp, calculate ssFrom p points to the length of its left end point, l is designated ass。spWith
ssBetween intersection length be defined as:
Cumulative length is:
So as to whole intersection length f at a stroke intervaliIt is defined as:
Arrive here, stroke interval has just been divided into non-cut-point and imaginary cut-point by the present invention.Because character string figure is not
Can be split in non-cut-point, next step does not consider non-cut-point, but using support vector machine classifier to imagination segmentation
Point further classification.
2) support vector cassification
Based on geometric properties, imaginary cut-point is classified using support vector machine classifier.There are two spies in addition
Levy, be intersection length characteristic f defined in imagination segmentationi, and the width characteristics of a new imaginary cut-point for introducing, after
Person is defined as:Width of the stroke from before imaginary cut-point to after is divided by character average-size acs.
Adopt support vector machine that stroke interval is set respectively into its value as 1 and -1 according to cut-point and non-cut-point.Surveyed
At the imaginary cut-point of character string figure, the output of support vector machine is converted to into confidence value, then again with segmentation candidates know
Other path criteria is combined.
Using support vector machine classifier, support vector machine output is more than a certain threshold value, width characteristics value also greater than another
The imaginary cut-point of threshold value is chosen for cut-point.It is determined that cut-point at, adjacent former fragment can not merge to form candidate word
Symbol figure.Imaginary cut-point is reduced so as to reduce segmentation candidates lattice, increases the recognition efficiency of character string.
It is that imaginary cut-point keeps undetermined by all remaining stroke intervals, the quantity of cut-point determines amount of calculation.Disappear
Except pseudo- segmentation so that the process time of support vector machine classifier increases, but latter step process time is reduced, and in path
Discrimination is improved during evaluation function computing.
Step 102:Candidate's lattice structure.By character classification, by each candidate characters figure and multiple times for having a credibility
Selected works are associated.The combination of all candidate's figures and feature set is represented with segmentation identification candidate's lattice, wherein each arc is represented
One cut-point, one character set of each node on behalf, candidate's figure distributes a character group.
Step 104:Trust evaluation.The confidence score of assessment candidate's lattice and corresponding character trail.
Step 106:Path evaluation.By assessing split path and corresponding character string in candidate's lattice with reference to candidate characters
The score of collection and character compatibility, i.e. geometry and Linguistic context credibility.
Character string figure is represented using the sequence of original segmentation:X=s1,…,sm, it is divided into character graphics:Z=
z1,…,zn, wherein, each candidate's figure contains kiIndividual original segmentation:Zi=sji,…sji+ki-1.The partition graph of character then by
It is classified as collecting C=C1,…,Cn.The score of character string X is assessed for correspondence character trail C, it is extracted to original segmentation (or candidate
Figure) eigenvalue, and segmentation interval (or character pitch) compatibility.Eigenvalue includes bounding box feature value bi, it is internal
Clearance features value qi, shape facility value siOr zi, unitary position feature value p of single segmentation block (or character)u i, adjacent segmentation block
The binary position eigenvalue p of (or character)b i, split block gap eigenvalue gi, giTo determine cut-point and non-segmentation in step 100
The basis of point.
The characteristic value sequence of original segmentation is expressed as into b, q, X, pu, g, the posteriority credibility of character string is:
Wherein data verification program and geometry scoring function are obtained by tranining database.
Ignore the denominator unrelated with character string, reasonably assume that and be mutually independent of between eigenvalue, the evaluation to character trail
Can wait and be all:
C in formula (7)iRepresent an imagination classification (referred to as cross and classify) of a character string or original segmentation block, ti
SP or NSP is then represented, cut-point and non-cut-point are represented respectively.One character set CjIncluding one or more continuous ci。p
(zi|Ci) obtain for data verification program tranining database, it is that normalization recognizes condition confidence score, p (bi|Ci), p (qi|
Ci), p (pu i|Ci) and p (pb i|Ci-1Ci) it is geometry scoring function, by quadric discriminant function, classifier training data base obtains.
Priori P (C) of language is by original segmentation block P (ci|ci-2ci-1) cross the three words syntax expression classified.Due to undue
The three word syntax of class are difficult to obtain, and we adopt character set P (Ci|Ci-2Ci-1) carry out approximate expression P (ci|ci-2ci-1), CiComprising
ci。
λ in formula (8)11And λ12For weight factor, λ1To balance the departure of number of characters.Due to the initial pen of character graphics
The Different Effects with non-starting stroke are drawn, when its conversion credibility is estimated different weights need to be adopted.
It is identical with upper, the eigenvalue of original segmented extraction is estimated using the eigenvalue extracted from candidate characters figure, and
Starting stroke and non-starting stroke for further feature value uses different weights, and the path score for obtaining is:
P in formula (9)h, h=1 ... 6, respectively represent P (Ci|Ci-2Ci-1), p (bi|Ci), p (qi|Ci), p (zi|Ci), p
(pu i|Ci) and p (pb i|Ci-1Ci).λ in formula (9) includes h=1 ... all of bias term in 6.Note s hereiBy ziGeneration
Replace, because reliability function p (si|ci) it is with p (zi|Ci) estimate.
Concentrate in training data, weight factor λh1, λh2(h=1~7) and λ using genetic algorithm optimization character string by being known
Other performance is selected.
Reduce unessential parameter, simplify data study, to improve recognition accuracy, therefore λ during setting h ≠ 4h2=0, i.e., only
Consider P4Weight, formula (9) can be changed into:
Further obtain:
In formula (11), Ph, h=1 ... 5, respectively represent P (Ci|Ci-2Ci-1), P (bi|Ci), P (qi|Ci), P (pu i|Ci)
With P (pb i|Ci-1Ci)。
Three words syntax credibility P (Ci|Ci-2Ci-1) can calculate according to text corpus, work as CiBe first of sentence or
During second word, the word syntax or the two words syntax are just deteriorated to respectively.In order to overcome undertrained to cause Imprecise information, three words
The syntax are smoothed as follows:
P'(Ci|Ci-2,Ci-1)=β1P(Ci|Ci-2,Ci-1)+β2P(Ci|Ci-1)+β3P(Ci)+β4, (12)
Weight is drawn by different text corpus, and is had:β1+β2+β3+β4=1.
In order to keep scaling invariance, geometrical characteristic bi, qi, pu iAnd pb iAccording to the average-size acs normalization of character.
Characteristic vector biIt is made up of the height and width of the bounding box of each character graphics.Characteristic vector qiIt is made up of six values, such as accompanying drawing
Shown in 6.First three value represents the level interval (distinguishing according to upright projection) of three down suctions, and afterwards three values then represent three
The vertical interval (being distinguished according to floor projection) of horizontal clearance.Characteristic vector pu iInclude top from centrage to bounding box and
The vertical length of bottom.Characteristic vector pb iThere are two elements obtained from the bounding box of two vectorial character graphics:Coboundary it
Between vertical dimension and lower boundary between vertical dimension.p(pb 1|C1,C0) it is set as 1.In order to reduce p (pb i|Ci-1Ci) skill
Art, six superclass are divided into by character set according to the average vector of the unitary position in training sample.So p (pb i|Ci-1Ci) just
By p (pb i|C’i-1,C’i) replaced, wherein C 'i-1And C 'iFor superclass.Using quadric discriminant function grader, geometric properties to
Amount bi, qi, pu iAnd pb iJust it is changed into the score of likelihood logarithm, it is used in formula (11).Character shape score p (zi|Ci) by
One Character recognizer is given.Characteristic vector giComprising the feature repeatedly measured a segmentation candidates point, at segmentation candidates point
Between two adjacent original segmentation blocks.
P (g are approximately obtained using support vector machine classifieri| SP) and p (gi|NSP).And in order to obtain credibility p (oi|
) and p (o SPi| NSP), the output result of support vector machine adopts warping function, wherein oiFor giOutput result.Warping function
Then by having verified that data set on the basis of support vector machine output result obtaining.Wherein p (o1| SP) value be set to 1.
For the output result of warpage support vector machine, we obtain first output result p (oi| SP) and p (oi| NSP)
Rectangular histogram, then calculates accumulative credibility p ' (oi| SP) and p ' (oi|NSP):
Then by p ' (oi| SP) and p ' (oi| NSP) it is fitted using two s shape functions, fitting parameter passes through square mistake
Poor minimum is obtaining.The parameter of two of which s shape function uses different standards.
Weight factor λh1, λh2(h=1~7) and λ are drawn by genetic algorithm training, using the training number of character string figure
According to so that the discrimination of training data is maximum.Accomplish this point, by λh1, λh2(h=1~7) and λ regard of chromosome as
Element.It is according to the step of genetic algorithm estimation weight factor:
1) initialize:A random value being chosen from 0 to 1 and initializing N bar chromosomes, the degree of fitting for arranging N bar chromosomes is put down
0 is, the time, t was set to 1.
2) hybridize:Two chromosomes are randomly selected from N bar chromosomes.Hybridize this two chromosomes in two random sites
So as to produce two new chromosomes.Repeat this step, until obtaining M new chromosome.
3) it is mutated:With credibility PmutRandom value is from -1 to 1 changing each chromosome of N+M chromosome.
4) degree of fitting evaluation:According to the discrimination assessment degree of fitting of training data, its weighted value is encoded in each chromosome
In.
5) select:The roulette probability of each chromosome is judged according to degree of fitting.Degree of fitting highest two is chosen first
Chromosome, then using roulette selection chromosome, until obtaining the new chromosome of N bars.The dyeing of original N bars is substituted with new chromosome
Body.
6) iteration:There is new N bars chromosome fnewObtain Average Quasi right.If (fnew-fold<Threshold value) there occurs nstopIt is secondary,
Or t>T, then return to degree of fitting highest chromosome.Conversely, arranging fnewFor fold, increase t, return to 2).
In order to assess the degree of fitting of a chromosome, the optimal path of each training character string picture is searched for, path is most
Excellent is to be estimated what is obtained with the weighted value in chromosome.
Step 108:Character string is recognized.Path score in formula (11) is added up by original segmentation block, so not with word
The segmentation number change of symbol figure.Therefore, optimal path can be found by Viterbi search's (dynamic programming).By improving path
Evaluation criterion, repeats above step, so as to drawing and correctly splitting and recognize to deserved more preferable optimal path.Finally, by
Point highest optimal path provides the final result of Character segmentation and identification.
Description of the drawings
Fig. 1 shows the recognition methodss flow chart of steps of handwritten text of the invention.
Fig. 2 shows the recognition methodss particular flow sheet of handwritten text of the invention.
Fig. 3 shows the flow chart of over-segmentation of the invention.
Fig. 4 shows the citing of stroke interval mistake classification.
Fig. 5 shows the schematic diagram of some geometric properties according to embodiments of the present invention.
Fig. 6 shows the schematic diagram of character graphics internal clearance feature according to embodiments of the present invention.
Fig. 7 shows a schematic diagram of imaginary cut-point given threshold according to embodiments of the present invention.
Fig. 8 shows the segmentation block according to embodiments of the present invention citing corresponding with candidate's lattice.
Specific embodiment
With reference to specific embodiment, the present invention is expanded on further.It should be understood that these embodiments are merely to illustrate the present invention
Rather than restriction the scope of the present invention.
Embodiment one:
In the present embodiment, it is right using Japanese online handwriting data base in order to assess character string identification model of the invention
Data verification program and geometry scoring function are trained.Being recognized by normalization must be divided into condition credibility p (zi|Ci), character
Identification combines offline and ONLINE RECOGNITION method.For geometry score, four quadric discriminant function graders are respectively trained as p
(bi|Ci), p (qi|Ci), p (pu i|Ci) and p (pb i|Ci-1Ci)。
In order to give a mark to language environment, Wo Mencong《Asahi Shimbun》1993 volume and《Nikkei newspaper》It is accurate in volumes in 2002
For an initial three words syntax form.Smoothing factor β 1, β 2, β 3, β 4 are obtained using Japanese online handwriting data base estimation.It is logical
Cross and delete the phrase not occurred, the logarithm value ignored the less phrase of occurrence number, quantify phrase credibility, the number of the three words syntax
6MB is reduced to according to size.
Lteral data storehouse (Kondate) is collected in 100 people, then horizontal text writing row is extracted from data base, from
And weight parameter is trained, and assess the performance of character string identification.With the handwriting text lines Training Support Vector Machines grader of 75 people,
So as to draw segmentation candidates point credibility, and obtain the weight factor of path evaluation score.After training, using remaining 25 people's
Line of text is tested.Training and the data tested are listed in table 1.Experiment is by the Pentium (R) 4 that 512MB is saved as in one
2.80GHz processors are realizing.In order to calculate simplicity, each weighted value is set to 1 by us first, and trains character string for each
Select front 100 identification candidate (segmentation block identification path).
Text line number | Characteristic pattern figurate number | Feature set | Feature set/OK | |
Training | 10 174 | 104 093 | 1 106 | 10.23 |
Test | 3 511 | 35 686 | 790 | 16.89 |
Form 1 trains/data of line of text test
In embodiments of the present invention, such as accompanying drawing 7, if the horizontal range eigenvalue at stroke interval is more than 0, or it is in
In the OABCDE of region, imaginary cut-point (point to be located) is divided into, otherwise is then non-cut-point.Then, if in region OFGH
Interior, the width between two continuous imagination cut-points is more than a certain threshold value divided by acs, and we are between this two imaginary cut-point
, the point modification for being classified as non-cut-point be categorized as point to be located.
The horizontal range feature at the stroke interval of one group of training sample character string and the distribution such as accompanying drawing 8 of intersection length characteristic
It is shown.As a result show, the embodiment of the present invention can be very good to distinguish cut-point and non-cut-point.
The line of text recognition result that form 2 is obtained using two kinds of undue segmentation methods
Comparison over-segmentation (two-stage classification method) of the invention and direct according to support vector machine output result (i.e. first-level class
Method) result that obtains.In order to justice compares, both approaches are all estimated using path evaluation standard proposed by the present invention.
The undue segmentation method that first-level class method is obtained has obtained 19 characteristic quantities by stroke interval, and extracts spy using support vector machine
The amount of levying so as to by each stroke Margin Classification into cut-point, non-cut-point and point to be located.Using Character segmentation operation f (segmentations
The F operations at point interval), character identification rate Cr, and character string evaluation recognition time Tav_rec_tlTo assess the property of line of text identification
Energy.As a result as shown in Table 2.
Can see from form 2, although over-segmentation method proposed by the present invention consumes more process than first-level class method
Between, but significantly improve character recognition and segmentation accuracy rate.Two-stage classification method is undetermined by many stroke gaps, although these
The complicated candidate's lattice of cut-point undetermined so that increased the calculating of character string identification, but can but reduce classification error, so as to
Improve recognition performance.
Compare the performance of three kinds of path evaluation standards:The method one of the weighting repeated factor of the propositions such as the present invention, Nakagawa
(Nakagawa,M.,Zhu,B.,Onuma,M.:A model of on-line handwritten Japanese text
recognition free from line direction and writing format constraints.IEICE
Trans.Inf.Syst.E88 (D (8), 1815-1822 (2005)) and Zhou et al. propose method two (Zhou, X.-D., Yu,
J.-L.,Liu,C.-L.,Nagasaki,T.,Marukawa,K.:Online handwritten Japanese character
string recognition incorporating geometric context,In:Proceedings of the 9th
International Conference on Document Analysis and Recognition,pp.48–
52.Curitiba,Brazil(2007)).Method one and method two are respectively formula (14) and formula (15).For fair ratio
Compared with three kinds of methods are same three word text, same character recognition and background geometry grader using language setting.The power of every kind of method
Repeated factor λh1, λh2(h=1~7), λi(i=1~7) and λ are optimized using genetic algorithm.Three kinds of methods combine formula
(9) same seven of the path evaluation in, except for the difference that method one be not used and ki(i.e. original segmentation block constitutes one to relevant item
Character graphics), second method adopts the quantity normalization of the separating character path score of method one.
In method two, because path score does not add up with character string, therefore beam search is employed being waited
Select the optimal path of lattice.Then it is that optimal path is obtained by Viterbi search for the present invention and method one.For it is all of this
Three kinds of methods, candidate's lattice are each character graphics and have chosen ten Candidate Sets.
In order to prove the advantage of genetic algorithm optimization weight factor, can be by the genetic algorithm optimization of the present invention and stochastic gradient
Decline the character identification rate obtained by the minimum classification error standard of optimization to compare.The minimum classification error of stochastic gradient descent optimization
Standard least determine character trail and most determine the difference between character trail finding optimized parameter vector λ by is minimized
's:
LMCE(λ, X)=σ (max (ScoreIncorrect)-ScoreCorrect), (16)
σ (x)=(1+e-x)-1
ScorecorrectFor the score of correct path in candidate's lattice, ScoreincorrectFor incorrect path in candidate's lattice
Score.
Form 3 shows the character string recognition result of three kinds of path evaluation methods.Wherein, the character obtained by genetic algorithm optimization
The training weighted value of identification is:
(λ11, λ12, λ21, λ22, λ31, λ32, λ41, λ42, λ51, λ52, λ61, λ62, λ71, λ72, λ)=(0.351,0.000,
0.265,0.001,0.199,0.000,1.000,0.641,0.009,0.000,0.100,0.000,0.100,0.000,
0.323,0.120,0.100)。
3 three kinds of path evaluation methods of form recognize the result of line of text
Calculate in weight factor in genetic algorithm, it can be seen that except character recognition score p (zi|Ci) and non-segmentation score
p(gi| NSP), remaining geometric feature and language environment be not with original segments (λh2=0) quantity weighting.It means that
Except character recognition score and non-cut-point score, the original of remaining geometric feature and language environment almost with character graphics
The number for beginning to split block is unrelated.
From result, it can be seen that the either character recognition as obtained by genetic algorithm optimization, or under stochastic gradient
Character recognition obtained by the minimum classification error standard of drop optimization, the path evaluation model of the present invention improves character recognition and divides
The accuracy for cutting.For method one, because a shorter character string often has larger compared to longer character string
Evaluation score, therefore be easily caused and excessively merge character.On the other hand and, for method two, i.e., normalized path score, then easily
Longer character string is produced, so often tending to over-segmentation character.Instant invention overcomes these problems, while path score
Do not change with the change of separating character quantity.In addition, the process time of three kinds of methods is almost identical.Relative to random
Gradient declines this parameter optimization method of the minimum classification error standard of optimization, and genetic algorithm optimization can obtain more preferable character string to be known
Other performance.This is that take is local optimum in gradient decline due to the former, and genetic algorithm optimization is on the training data
Directly optimization character identification rate (this can not separate), such that it is able to reach the optimization of the overall situation.
Because weight factor is learnt with data, in theory degree of freedom is big more than parameter.But in practice, due to training
Data base's is limited, can cause the complication for learning construction more than parameter on the contrary, causes to be difficult to obtain optimized parameter.Especially, root
According to the present invention, using formula (11), i.e., for λh2Leave behind important P4The weight factor of item, remaining is set as 0, Ke Yiyou
Help the optimization of parameter.Actual result also demonstrates that method proposed by the present invention can continue to improve segmentation accuracy rate and character is known
Other accuracy rate, such as form 4.
Form 4
Claims (10)
1. a kind of recognition methodss of Chinese and japanese handwritten text, it is characterised in that the method is mainly included the following steps that:
Step 100, the over-segmentation of stroke, including each stroke interval are classified, i.e. cut-point, non-cut-point and point to be located;
Step 102, candidate's lattice of character graphics is made, including by character classification, by each candidate characters figure and many
Candidate Set is associated;
Step 104, the trust evaluation of partition graph, including correspondence Candidate Set are assessing the score of candidate characters figure;
Step 106, the path evaluation of character string, including accumulative confidence score, and reduce unessential parameter in assessment;
Step 108, finds optimal path, identification string;Assessment and its and character string with reference to the candidate characters of split path
The score of collection and character compatibility, improve path evaluation standard, draw and correctly split and recognize to deserved best path;
In the step 100, the over-segmentation of stroke is by character string figure X=s1,…,smIt is divided into candidate characters figure Z=
z1,…,zn, wherein, each candidate characters figure contains kiIndividual original segmentation, zi=sji,…sji+ki-1;The partition graph of character
Then it is classified as collecting C=C1,…,Cn;One character set CjBy one or more continuous ciConstitute;ciRepresent a character string or
One imagination classification of original segmentation block was referred to as classification;
In the step 104, reliability evaluation method includes correspondence character set C to assess the score of character string figure X, extracts
Unitary position feature value p of bounding box feature value b, internal clearance eigenvalue q, single segmentation block or characteru, segmentation block gap spy
The binary position eigenvalue p of value indicative g, adjacent segmentation block or characterb, data verification program and geometry scoring function are by training number
Obtain according to storehouse, including condition credibility p (zi|Ci), geometry score p (bi|Ci), p (qi|Ci),With
The posteriority credibility of character string is:
Evaluation to character trail etc. is all:
In the step 106, accumulated path must be divided into:
Ph, h=1 ... 6, respectively represent P (Ci|Ci-2Ci-1), p (bi|Ci), p (qi|Ci), p (zi|Ci),With
In accumulated path score f (X, C) in the step 106, the λ when h ≠ 4 is seth2=0, accumulated path score is changed into:
Wherein, Ph, h=1 ... 5, respectively represent P (Ci|Ci-2Ci-1), P (bi|Ci), P (qi|Ci),Withλ is weight factor, and SP and NSP represents respectively cut-point and non-cut-point.
2. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that input is a handss
The character string write or the line of text being made up of the stroke type of a sequence, including Chinese and Japanese.
3. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that over-segmentation step bag
Include, the horizontal range and intersection two features of length that each stroke interval is extracted first obtains imaginary cut-point, then chooses and props up
The imaginary cut-point that vector machine output and width characteristics value are held more than certain two threshold value is cut-point, and remaining is point to be located.
4. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that original segmented extraction
Eigenvalue is using the eigenvalue estimation extracted from candidate characters figure.
5. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that described by character string
Credibility is obtained in the evaluation of character trail, ignores the denominator unrelated with character string, is mutually independent of between eigenvalue.
6. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that P (C) representation language
Priori, using the three words syntax for crossing classification, by the character shape score P (c to original segmentation blocki|ci-2ci-1) represent, by one
Character recognizer is given.
7. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that calculate path score
In, geometric properties vector bi, qi,WithThe score of likelihood logarithm is changed into using quadric discriminant function grader;Path
Score is added up by original segmentation block, is not changed with the segmentation number of character graphics.
8. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that weight factor λ11And λ12
Changed according to starting stroke and non-starting stroke;Weight factor λ1Changed according to number of characters;Weight factor λh1, λh2(h=1~
7) drawn using genetic algorithm with λ;In accumulated path assessment, for λh2Leave behind important P4The weight factor of item, remaining
Unessential weight factor is 0.
9. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that the p in path score
(gi| SP) and p (gi| NSP) approximately obtained using support vector machine classifier.
10. a kind of recognition methodss of Chinese and japanese handwritten text as claimed in claim 1, it is characterised in that optimal path passes through
Viterbi search finds.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610895677.5A CN106570518A (en) | 2016-10-14 | 2016-10-14 | Chinese and Japanese handwritten text identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610895677.5A CN106570518A (en) | 2016-10-14 | 2016-10-14 | Chinese and Japanese handwritten text identification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106570518A true CN106570518A (en) | 2017-04-19 |
Family
ID=58532832
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610895677.5A Pending CN106570518A (en) | 2016-10-14 | 2016-10-14 | Chinese and Japanese handwritten text identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106570518A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108364655A (en) * | 2018-01-31 | 2018-08-03 | 网易乐得科技有限公司 | Method of speech processing, medium, device and computing device |
CN109614494A (en) * | 2018-12-29 | 2019-04-12 | 东软集团股份有限公司 | A kind of file classification method and relevant apparatus |
CN111639646A (en) * | 2020-05-18 | 2020-09-08 | 山东大学 | Test paper handwritten English character recognition method and system based on deep learning |
CN111797634A (en) * | 2020-06-04 | 2020-10-20 | 语联网(武汉)信息技术有限公司 | Document segmentation method and device |
CN112699780A (en) * | 2020-12-29 | 2021-04-23 | 上海臣星软件技术有限公司 | Object identification method, device, equipment and storage medium |
CN113592045A (en) * | 2021-09-30 | 2021-11-02 | 杭州一知智能科技有限公司 | Model adaptive text recognition method and system from printed form to handwritten form |
-
2016
- 2016-10-14 CN CN201610895677.5A patent/CN106570518A/en active Pending
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108364655A (en) * | 2018-01-31 | 2018-08-03 | 网易乐得科技有限公司 | Method of speech processing, medium, device and computing device |
CN108364655B (en) * | 2018-01-31 | 2021-03-09 | 网易乐得科技有限公司 | Voice processing method, medium, device and computing equipment |
CN109614494A (en) * | 2018-12-29 | 2019-04-12 | 东软集团股份有限公司 | A kind of file classification method and relevant apparatus |
CN111639646A (en) * | 2020-05-18 | 2020-09-08 | 山东大学 | Test paper handwritten English character recognition method and system based on deep learning |
CN111639646B (en) * | 2020-05-18 | 2021-04-13 | 山东大学 | Test paper handwritten English character recognition method and system based on deep learning |
CN111797634A (en) * | 2020-06-04 | 2020-10-20 | 语联网(武汉)信息技术有限公司 | Document segmentation method and device |
CN111797634B (en) * | 2020-06-04 | 2023-09-08 | 语联网(武汉)信息技术有限公司 | Document segmentation method and device |
CN112699780A (en) * | 2020-12-29 | 2021-04-23 | 上海臣星软件技术有限公司 | Object identification method, device, equipment and storage medium |
CN113592045A (en) * | 2021-09-30 | 2021-11-02 | 杭州一知智能科技有限公司 | Model adaptive text recognition method and system from printed form to handwritten form |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106570518A (en) | Chinese and Japanese handwritten text identification method | |
Saba et al. | Methods and strategies on off-line cursive touched characters segmentation: a directional review | |
Álvaro et al. | An integrated grammar-based approach for mathematical expression recognition | |
Saba et al. | Cursive script segmentation with neural confidence | |
Mouchere et al. | Icdar 2013 crohme: Third international competition on recognition of online handwritten mathematical expressions | |
Saba et al. | Effects of artificially intelligent tools on pattern recognition | |
JP2750057B2 (en) | Statistical mixing method for automatic handwritten character recognition | |
Harouni et al. | Online Persian/Arabic script classification without contextual information | |
Simistira et al. | Recognition of online handwritten mathematical formulas using probabilistic SVMs and stochastic context free grammars | |
US9711117B2 (en) | Method and apparatus for recognising music symbols | |
Awal et al. | Towards handwritten mathematical expression recognition | |
CN110114776B (en) | System and method for character recognition using a fully convolutional neural network | |
TWI437448B (en) | Radical-based hmm modeling for handwritten east asian characters | |
JP2008532176A (en) | Recognition graph | |
Saba et al. | A survey on methods and strategies on touched characters segmentation | |
CN108280430B (en) | Flow image identification method | |
Hu et al. | MST-based visual parsing of online handwritten mathematical expressions | |
Liu et al. | Dynamic local search based immune automatic clustering algorithm and its applications | |
CN105023029B (en) | A kind of on-line handwritten Tibetan language syllable recognition methods and device | |
CN102360436B (en) | Identification method for on-line handwritten Tibetan characters based on components | |
Sundaram et al. | Bigram language models and reevaluation strategy for improved recognition of online handwritten Tamil words | |
Montagner et al. | Staff removal using image operator learning | |
Le et al. | Stroke order normalization for improving recognition of online handwritten mathematical expressions | |
Saabni et al. | Hierarchical on-line arabic handwriting recognition | |
Wan et al. | On-line Chinese character recognition system for overlapping samples |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170419 |