Summary of the invention
The invention provides a kind of hand-written recognition method, system and character recognition terminal, often occur mistake to solve existing character identification result, cause discrimination to decline, and then affect the problem of the hand-written experience that multiword inputs.
In order to solve the problem, the invention discloses a kind of hand-written recognition method, comprising: the person's handwriting gathering input continuously; Extract handwriting characteristic; Be input to by handwriting characteristic in maximum entropy model, maximum entropy model judges whether current stroke is cut point; If so, then character is cut, obtain final recognition result.
Preferably, maximum entropy model judges whether current stroke is that cut point comprises: maximum entropy model utilizes handwriting characteristic to provide the probability that current stroke is cut point; If the probability obtained is greater than predetermined probabilities, current stroke is cut point.
Preferably, also comprise the step determining predetermined probabilities, determine that predetermined probabilities comprises: described character script is cut, obtain at least one cutting route; Individual character identification is carried out to each cutting route, obtains candidate's recognition result for each cutting route and obtain the first probable value of this candidate's recognition result; Utilize language model to give a mark to each candidate's recognition result, draw the second probable value of the expression intercharacter related information for each candidate's recognition result; The combined chance value of each candidate's recognition result is obtained according to the first probable value of each candidate's recognition result and the second probable value; Maximum combined chance value is selected to be predetermined probabilities.
Preferably, the person's handwriting gathering continuously input comprises: gather the character script that inputs continuously with reduplicated word or with row or with the character script arranging input continuously.
Preferably, also comprise and set up maximum entropy model, described maximum entropy model of setting up comprises: select maximum entropy model feature, prepares training data, training maximum entropy model.
Preferably, the maximum entropy model feature of selection comprises: the handwriting characteristic selecting the character script inputted continuously with reduplicated word; Namely select the relative position between stroke, stroke is arranged in the start to write regional location at a place, the regional location lifting pen point place, the size increasing stroke, stroke height of position, the stroke of writing region and accounts for the ratio of writing region height or stroke width and account at least one feature of ratio of writing peak width feature as maximum entropy model.
Preferably, the maximum entropy model feature selected comprises: select with the handwriting characteristic of the continuous character script of input of row, namely to select in the width in width, below space in space before current character and the ratio of width to height of current character at least one feature as the feature of maximum entropy model; Select the handwriting characteristic of the character script to arrange continuously input, namely to select in the width in width, below space in space above current character and the ratio of width to height of current character at least one feature as the feature of maximum entropy model.
The invention also discloses a kind of hand-written discrimination system, comprising: acquisition module, for gathering the person's handwriting of input continuously; Characteristic extracting module: for extracting handwriting characteristic; Cutting module, for being input in maximum entropy model by handwriting characteristic, maximum entropy model judges whether current stroke is cut point; Identification module, for when current stroke is cut point, cuts character, obtains final recognition result.
Preferably, hand-written discrimination system also comprises: determination module, for determining predetermined probabilities; Described determination module comprises:
Cutting submodule; For cutting described character script, obtain at least one cutting route;
Individual character recognin module; For carrying out individual character identification to each cutting route, obtaining candidate's recognition result for each cutting route and obtaining the first probable value of this candidate's recognition result;
Language model recognin module; For utilizing language model to give a mark to each candidate's recognition result, draw the second probable value of the expression intercharacter related information for each candidate's recognition result;
Comprehensive descision submodule; For obtaining the combined chance value of each candidate's recognition result according to the first probable value of each candidate's recognition result and the second probable value;
Chooser module; Be predetermined probabilities for selecting maximum combined chance value.
The invention also discloses a kind of handwriting recognition terminal, comprise a kind of hand-written discrimination system disclosed by the invention.
Compared with prior art, the present invention has the following advantages:
The Character segmentation method based on maximum entropy that the present invention provides, it is the forecast model of Corpus--based Method, relation between character stroke and stroke can be judged more accurately, thus be confirmed whether as cut point, and provide the probability being judged as cut point, more comprehensive and comprehensively judge intercharacter cutting, the accuracy of raising recognition result.
Embodiment
For enabling above-mentioned purpose of the present invention, feature and advantage become apparent more, and below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation.
The present invention proposes a kind of hand-written recognition method, system and terminal, whether the handwriting characteristic that the method can extract user's continuous writing is input in maximum entropy model is the judgement of cut point, relation between character stroke and stroke can be judged more accurately, improve the accuracy of recognition result.
Be described in detail below by embodiment.
With reference to Fig. 1, it is a kind of hand-written recognition method process flow diagram described in the embodiment of the present invention.
Step 11, gathers the person's handwriting of input continuously;
Step 12, extracts handwriting characteristic;
User can repeat to input multiple character continuously in same handwriting area, and described character comprises the forms such as Chinese text, punctuation mark, English alphabet.
Gather the character script that user inputs continuously, described character script refers to the information of stroke forms input.The equipment gathering handwriting input has multiple, and as electromagnetic induction handwriting pad, pressure-sensitive handwriting pad, touch-screen, Trackpad, ultrasonic pen etc., distinct device is all the coordinate utilizing the induction installation that equipment is installed to record user writing when gathering, i.e. person's handwriting point.Usually the position of starting to write is designated as the reference position of a stroke, the position of lifting pen is designated as the final position of a stroke, pen-down position and a series of person's handwriting points lifted between a position form an entering stroke.
Step 13, is input to handwriting characteristic in maximum entropy model, and maximum entropy model judges whether current stroke is cut point;
In hand-written recognition method described in the present embodiment, what collect is multiple character script that user inputs continuously, can be the person's handwriting point of the character script that collection inputs continuously with reduplicated word in practical application; Or gather with the person's handwriting point of the character script of the continuous input of row; Or gather the person's handwriting point of the character script to arrange input continuously.
Also needed to set up maximum entropy model before carrying out determining whether cut point, concrete set up maximum entropy model and can comprise: select maximum entropy model feature, prepare training data, training maximum entropy model.
Give a concrete illustration detailed description below:
(1) maximum entropy model feature is selected
Select the feature relevant with character stroke position as the feature of maximum entropy model.Different handwriting characteristics is selected according to different input conditions when selecting, such as: when reduplicated word inputs continuously, the handwriting characteristic of selection can comprise: the relative position between stroke, stroke are arranged in the start to write regional location at a place, the regional location lifting pen point place, the size increasing stroke, stroke height of position, the stroke of writing region and account for the ratio of writing region height or stroke width accounts for the feature of the features such as at least one of the features such as the ratio of writing peak width as maximum entropy model.Selected feature including, but not limited to the above-mentioned feature enumerated, can according to practical application need with and choose required handwriting characteristic.When with row continuously input, the handwriting characteristic of selection can comprise: before current character space the feature such as the width in width, below space, the ratio of width to height of current character at least one feature as the feature of maximum entropy model; To arrange the handwriting characteristic of the character script of continuously input, namely to select in the width in width, below space in space above current character and the ratio of width to height of current character at least one feature as the feature of maximum entropy model.Illustrate with reduplicated word input below.
(2) training data is prepared
After the feature selecting maximum entropy model, carry out the preparation of training data, need the feature of character stroke position in Confirming model.As the relative position between stroke, stroke are positioned at position of writing region etc., the x namely in above-mentioned model.Then carry out data encasement, prepare some reduplicated word character strokes, and mark according to the feature determined.
Consider a stochastic process p (y|x), it is according to the vector x that can observe, belongs to a finite aggregate Y with certain probability output certain y, y.In the judgement of Character segmentation, Y={1,0}, represent cut point and non-cutting point respectively.The feature that x representative is relevant with character stroke position, i.e. unsentenced reduplicated word character stroke, comprises the relative position between stroke, stroke is positioned at the position of writing region etc.In order to rebuild stochastic process p (y|x), we sample to its output, obtain N number of training examples (x
1, y
1), (x
2, y
2) ..., (x
n, y
n).Due to the stochastic process generation thus of these training examples, so we suppose the empirical probability of certain event in training examples, equal the expected probability of this event when known p (y|x).
(3) maximum entropy model is trained
After getting out training data, utilize ready training data to train maximum entropy model.Previous step has been marked the relative position between stroke, stroke be positioned at the data behind the character stroke position of writing regional location send into maximum entropy model training, whether data layout is: cut, feature 1, feature 2
Certain event can characterize function f with one
i(x, y) represents.If sample (x
j, y
j) in there occurs this event, then f
i(x
i, y
i)=1; Otherwise be 0.Such as: complete if x meets previous character writing, and y is cut point, then f
i(x, y)=1; Other situations, then f
i(x, y)=0.The empirical probability of this event in training examples is expressed as:
Wherein,
the probability that sample (x, y) occurs in training examples,
occurrence number in training character stroke.
If known p (y|x), then event f
ithe expected probability of (x, y) is expressed as:
Wherein, p (x) is the probability of x in training example.
Hypothesis according to us has
that is:
We claim to characterize function f
i(x, y) is fundamental function, or is called for short feature.So above formula is called as about feature f
ian equation of constraint of (x, y), referred to as constraint.Constraint is stochastic process p (y|x) and training examples equation about a certain feature, and it has done some restriction to the distribution of p (y|x), make it the sample that produces feature indicate in, from the statistical significance close to training examples.
Suppose to define n feature, all stochastic processes meeting this n feature form a set:
Usually, | C|>1.We choose that maximum stochastic process of wherein entropy as the model rebuild out.Here entropy is conditional entropy, is expressed as:
Then our the final model rebuild out is: p*=arg max
p ∈ Ch (p) (6)
This model is referred to as maximum entropy model.The maximum principle of entropy ensure that maximum entropy model has good extensive effect.The expression-form of maximum entropy model and parameter calculate
(6) formula that solves obtains maximum entropy model and has following form:
In above formula, λ i is feature f
ithe weight of (x, y), can use IIS or L-BFGS iterative algorithm, trains obtaining from training character stroke.Z (x) is normalization coefficient.
Complete after setting up maximum entropy model, the handwriting characteristic of collection input maximum entropy model is judged.The detailed process judged can be as described in step 14.
Step 14, if maximum entropy model judges that current stroke is cut point, then cuts character, obtains final recognition result.If maximum entropy model judges that current stroke is not cut point, then character is not cut, the handwriting characteristic gathering continuous input character can be continued.
The process judged specifically can comprise maximum entropy model can to the probability between character stroke and stroke being whether cut point.If this probability is very large, then thinks the cut point between character stroke, cutting is completed to character.How to judge that whether obtained be that the probability of cut point is very large, rule of thumb can set a fixed value, if the cutting probability obtained is greater than this fixed value, is then that the probability of cut point is very large between description character stroke and stroke, can cuts.
How to judge that whether obtained be that the probability of cut point is very large, can be also whether the probability of cut point by what obtain, join in route searching, improve discrimination further.During concrete enforcement, can be setting predetermined probabilities, if whether what obtain is that the probability of cut point is greater than predetermined probabilities, is then that the probability of cut point is very large between description character stroke and stroke, can cuts.Predetermined probabilities can obtain by the following method:
Described character script is cut, obtains at least one cutting route;
Such as, will " all over the world " cut after obtain 4 cutting route, be under “ bis-∣ Ren ∣ respectively ", " under Er Ren ∣ ", under “ bis-∣ people ", " under two people ", all corresponding cutting probable value of every bar cutting route.
Individual character identification is carried out to each cutting route, obtains candidate's recognition result for each cutting route and obtain the first probable value of this candidate's recognition result;
Wherein, described identifying can adopt existing identification methods, and the embodiment of the present invention does not limit at this.
In each cutting route, the each individual character opened with the cutting of candidate's cut point is identified, identification for each individual character may obtain multiple candidate's recognition result (being individual character candidate recognition result), and obtains the individual character identification probability of each candidate's recognition result, is called the first probable value.
Such as, for " all over the world " that input is shorter, to under corresponding 4 cutting route " two | people | under ", " under Er Ren ∣ ", “ bis-∣ people ", " under two people " identify respectively: respectively individual character identification is carried out to " two people ", D score for cutting route " under Er Ren ∣ "; candidate's recognition result that corresponding " two people " obtains may be " my god ", " husband " etc.; each candidate's recognition result obtains an individual character identification probability, as corresponding " my god ", first probable value of " husband " is A, B respectively; Equally, the first probable value that individual character identification obtains corresponding one or more candidate's recognition result and each candidate's recognition result is also carried out for D score.The individual character identifying of other cutting route is identical, describes in detail no longer one by one.
Utilize language model to give a mark to each candidate's recognition result, draw the second probable value of the expression intercharacter related information for each candidate's recognition result;
Described language model can represent the related information between character, and this related information represents by probability.Language model refers to the model for calculating phrase or sentence probability, for a word, if there are many cutting route just to have multiple candidate's recognition result, candidate's recognition result herein refers to that individual character candidate recognition result that above-mentioned steps 131 obtains is combined into candidate's recognition result of word, word, phrase or sentence according to language model, as " two " and " people " be combined into a word " my god ", " my god " being candidate's recognition result, it is also candidate's recognition result that " literary composition " and " part " are combined into phrase " file ".Then to each candidate's recognition result, language model can calculate the correct probability of this sentence to be had much.Such as, user inputs candidate's recognition result of person's handwriting point for " civilian coroner ", and another candidate's recognition result is " file ", and from language model, the probability of " file " is greater than the probability of " civilian coroner "; If the identification probability of " file " and " civilian coroner " is more or less the same, then result can be defined as more commonly using " file " by language model.
About the realization of language model, a kind of simple method only considers the probability of former and later two words, and the probability being above " literary composition " word as " part " is how many, and " coroner " is the probability of " literary composition " word is above how many, and be forward that what word is irrelevant again.But in fact situation is not like this, thus any implementation method complicated also can consider before before word (or more word), or consider the language model based on word, but calculated amount and storage space can increase a lot.
Equally, for candidate's recognition result " under two people ", " all over the world ", " under husband " etc. ", because " all over the world " is everyday words, the probability therefore drawn according to language model is the highest; And " under two people " are not everyday words, therefore the probability of language model is lower.
The combined chance value of each candidate's recognition result is obtained according to the first probable value of each candidate's recognition result and the second probable value; When calculating combined chance value, a kind of simple method is that the first probable value of each candidate's recognition result and the second probable value are weighted addition, obtains should a combined chance value of candidate's recognition result.Certainly, also can adopt other more complicated computing method, the embodiment of the present invention does not limit at this.
Maximum combined chance value is selected to be predetermined probabilities.This maximum combined chance value can represent the cutting cost of cutting route, the probable value that the cutting namely drawn according to input sequence and stroke relative position is correct.
The method that example above gives maximum entropy model in Corpus--based Method method carries out cutting identification to continuation character, also can comprise in a particular application and utilize other statistical methods to carry out cutting identification to continuation character, such as support vector machines (Support Vector Machine) method etc., the embody rule of these methods in the prior art more description, repeats no more.
In sum, through the process of above flow process, whether the handwriting characteristic that above-mentioned hand-written recognition method can extract user's continuous writing is input in maximum entropy model is the judgement of cut point, relation between character stroke and stroke can be judged more accurately, improve the accuracy of recognition result.Meanwhile, because user once can input multiple word, input speed is substantially increased.
In actual applications, hand-written recognition method described in the embodiment of the present invention can be applicable to some to be had in the product of handwriting input demand, as in the desktop operating systems such as PC, notebook computer, panel computer, handwriting pad.In addition, also can be applied in embedded OS, the intelligent mobile terminals such as such as palm PC, mobile phone, PAD, PDA, little screen mobile phone or horizontal screen mobile phone; The GPS/GIS such as personal information terminal, board information terminal terminal; The intelligent learning terminals such as eBOOK, electronic dictionary, intelligent toy; Tax control machine entry terminal, China second-generation identity card Card Reader information terminal, large database inquiry terminal, Hospitality management system entry terminal, intelligent alarm, digital television interaction telepilot, other data terminals such as Karaoke requesting song device, Information Appliance device etc.The present invention requires lower to the screen size of handwriting area, is particularly useful for reduplicated word input and the identification of small screen device, has greater advantage for small screen devices such as current mobile phones.
Preferably, in multitask system, above-mentioned cutting and comprehensive identifying synchronously can be carried out with writing process (i.e. person's handwriting gatherer process), thus accelerate identifying processing speed further.Described multitask system refers to the system can carrying out multithreading.Within the time period that user writes, due to person's handwriting collection, to take CPU lower or substantially do not take CPU, and therefore most of CPU is in idle condition.And in multitask system, the CPU of this part free time can be used, write while identify, so can recognition speed be accelerated.
Based on foregoing, the embodiment of the present invention additionally provides corresponding system embodiment.
With reference to Fig. 2, it is the structural drawing of a kind of hand-written discrimination system described in the embodiment of the present invention.
Acquisition module 21, for gathering the person's handwriting of input continuously;
Characteristic extracting module 22, for extracting the feature of person's handwriting;
Cutting module 23, for being input in maximum entropy model by handwriting characteristic, maximum entropy model judges whether current stroke is cut point;
Identification module 24, for when current stroke is cut point, cuts character, obtains final recognition result.
Wherein, described cutting module 23 can make full use of the method that based on maximum entropy model judge of statistical method especially in said method embodiment to whether being that cut point judges more accurately between character.In order to further improve the accuracy rate of judgement, can the cutting probability obtained based on maximum entropy model be joined in route searching, improve discrimination further.Therefore described system can comprise determination module 25 further, and determination module 25 can comprise:
Cutting submodule 251; For cutting described character script, obtain at least one cutting route;
Individual character recognin module 252; For carrying out individual character identification to each cutting route, obtaining candidate's recognition result for each cutting route and obtaining the first probable value of this candidate's recognition result, and by each candidate's recognition result input language Model Identification submodule 253;
Language model recognin module 253; For utilizing language model to give a mark to each candidate's recognition result, draw the second probable value of the expression intercharacter related information for each candidate's recognition result;
Comprehensive descision submodule 254; For obtaining the combined chance value of each candidate's recognition result according to the first probable value of each candidate's recognition result and the second probable value;
Chooser module 255; Be predetermined probabilities for selecting maximum combined chance value.
After determining predetermined probabilities value, the cutting probability that maximum entropy model can be obtained compares with predetermined probabilities value, if cutting probability is more than or equal to predetermined probabilities value, can judge that obtaining is cut point between character.Through probability contrast so again, improve and to judge between character being the whether accuracy of cut point, enhance character recognition ability.
Based on the above-mentioned hand-written discrimination system based on maximum entropy model, the embodiment of the present invention additionally provides a kind of handwriting recognition terminal, and this handwriting recognition terminal can comprise above-mentioned hand-written discrimination system, thus supports the identification of continuation character input.The concrete structure of described hand-written discrimination system can refer to shown in Fig. 2, is not described in detail in this.
Described handwriting recognition terminal can be the desktop operating system terminals such as PC, notebook computer, panel computer, handwriting pad, also can be the intelligent mobile terminals such as palm PC, mobile phone, PAD, PDA, little screen mobile phone or horizontal screen mobile phone, can also be each Terminal Type with multitask system.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.For system embodiment, due to itself and embodiment of the method basic simlarity, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.
Above to a kind of hand-written recognition method provided by the present invention, system and handwriting recognition terminal, be described in detail, apply specific case herein to set forth principle of the present invention and embodiment, the explanation of above embodiment just understands method of the present invention and core concept thereof for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.