CN100390815C

CN100390815C - Template optimized character recognition method and system

Info

Publication number: CN100390815C
Application number: CNB2005100908775A
Authority: CN
Inventors: 刘芝; 康凯; 徐剑波
Original assignee: BEIDA FANGZHENG TECHN INST Co Ltd BEIJING; Peking University Founder Group Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University Founder Research and Development Center
Priority date: 2005-08-18
Filing date: 2005-08-18
Publication date: 2008-05-28
Anticipated expiration: 2025-08-18
Also published as: CN1916940A

Abstract

A character identification method being optimized by template includes using at least one training character in training character set to structure a font, placing said font in different cluster set to pick up public templates of said font then storing picked up templates in template storing module, storing internal code of each font to be corresponded to pointer of various font public template to form font index table by template output unit, carrying out level by level matching of character to be identified with various public templates pointed by a found font index table then recording matched result for obtaining candidate character.

Description

Template optimized character identifying method and system

Technical field

The present invention relates to a kind of character identifying method and system, relate in particular to a kind of template optimized statistical pattern recognition method and the system of this method of execution.

Background technology

The research work of China's Chinese character recognition technology originates in early eighties, and two stages have mainly been experienced in its development.Phase one is whole the eighties, and what this stage was mainly studied is the various algorithms and the identifying schemes exploration of Chinese Character Recognition.Subordinate phase is from the nineties, and Chinese character OCR technology has entered important, a flourish period, breadboard achievement in research is introduced to the market, and many Chinese Character Recognition new algorithm and new technology have occurred.

At present, the printed Chinese character recognition technology is very ripe, and algorithm is of a great variety, but substantially all is to comprise several links such as input, pre-service, identification, aftertreatment.Here, pre-service comprises processes such as binaryzation, denoising, well straightening, printed page analysis, cutting, normalization, identification comprise presort, process such as individual character identification.Wherein, identification division is the core of Chinese character recognition system.

Recognition methods mainly contains statistical model identification and tactic pattern identification two big classes.Tactic pattern identification is architectural feature and the group word rule thereof of managing to extract literal, with the foundation of these information as the identification Chinese character.Statistical model identification development early, also comparative maturity, main points are to extract one group of statistical nature waiting to know pattern, determine decision function according to certain criterion then, and according to the judgement of classifying of this decision function.Wherein, template matching method is exactly a kind of classic methods in the statistical model identification.Template matching method at first will extract the feature of standard Chinese character pattern, these characteristic sets is stored be called dictionary, will treat during identification that the standard Chinese character pattern feature in knowledge pattern and the dictionary mates one by one.The publication number of submitting to as on April 1st, 1985 is the Chinese patent application of CN85100085, discloses a kind of recognition method of printed Chinese character recognition device, extracts the storage of Chinese character stroke and background dot, to carry out the method that from top to down carries out template matches.This method has characteristics such as coupling is simple, antijamming capability is strong, but also exists certain deficiency, as supporting to need the storage and the data volume of the pattern feature of coupling to increase widely under the more situation of Chinese character, font.

Summary of the invention

At the defective that exists in the prior art, the purpose of this invention is to provide a kind of method and system that utilizes common template to carry out character recognition.

With regard to one aspect of the present invention, the character identifying method that is proposed comprises step: be concentrated to a few training character by the training character and constitute a font, to carry out cluster in the set different to be clustered that described font is put into one by one with its font similar on different ingredients is formed and extract common template, the multistage common template of the described font that extracts is preserved; With reference to the multistage common template that extracts, extract the afterbody common template and the preservation of described font; For each font, with its ISN and the pointer corresponding stored of pointing to its common template at different levels, to generate the concordance list of described font; Treat when knowing character set and discerning, wait to become literate symbol and a font concordance list that finds common template at different levels pointed are mated step by step and write down matching result, to obtain candidate characters.

In the said method, before mating,, use directly then that the matching result that is write down is determined and whether described common template mates if matched record has been arranged with the font concordance list that finds one-level common template pointed.

The above-mentioned step of obtaining candidate characters further comprises: if all mate with common template at different levels, then that corresponding ISN is represented text character is as the candidate characters of the described symbol of waiting to become literate; If do not match with one-level common template wherein, then stop the coupling with this font, mate step by step with the common template that the next font concordance list that finds is pointed.

In the said method, extract the multistage common template of described font and store further and comprise:

A, search and a font similar font on first ingredient, the font that finds and described font are put in the same set to be clustered carry out cluster, to extract similar part as the first order common template of each font in the set and preserve;

B, for the part of described font remainder, search the similar font of counterpart, the font that finds and described font be put in the same set to be clustered carry out cluster, to extract similar part as the next stage common template of each font in the set and preserve;

C, repeating step B, the multistage common template and the preservation of extracting described font.

In the said method, first order common template is the radical common template, and second level common template is the class common template.

In the said method, first order common template is the class common template, and second level common template is the radical common template.

In the said method, steps A further comprises: select radical to be extracted, search and the identical font of described font radical, the font that finds and described font be placed in the same set to be clustered carry out cluster, to all character pattern lattices stacks in the class to generate the radical common template of each font in the set.

In the said method, steps A further comprises: go out a zone that comprises radical to be extracted fully at the similar character pattern lattice upper ledge of font, as the mask zone of dot matrix stack.

In the said method, step B further comprises: cover the part that is extracted the radical common template in the described font with mask, the extraction of remaining part and other font fellowship class common template.

In the said method, step B further comprises: search and the similar font of described font counterpart based on initialization rough sort saltus step rule.

Among the above-mentioned steps B, treating process that process that cluster set collects and the interior font of pair set carry out cluster intersects and carries out, under a predetermined threshold, collect in the set to be clustered but the font of cluster success not, participate in the collection of set to be clustered next time.

In the said method, in one of identification waits to become literate the process of symbol, find a font concordance list, search correct classification according to the resemblance of the described symbol of waiting to become literate and find.

In the said method, each training character concentrated in the training character all is a font.

In the said method, the font information of each font is kept in the font concordance list together with its ISN.

In the said method, font is by concentrate to select the different monocase dot matrix of ISN same font to carry out cluster from the training character and carry out the dot matrix stack at one character and form gathering.

In the said method, one of font information of each character of a font of formation is kept in the concordance list of described font together with the ISN of described font.

In the said method, further extract the difference part of each character that forms described font, as the variant template of described font and preserve, the pointer that points to each variant template of described font is mapped with one of font information of each character that forms described variant template and is kept in the concordance list of described font.

In the said method, the step of obtaining candidate characters further comprises: if all mate with templates at different levels, then with the text character of the common expression of corresponding ISN and the font information candidate characters as the described symbol of waiting to become literate; If do not match with one-level template wherein, then stop the coupling with described font, mate step by step with the common template that the next font concordance list that finds is pointed.

In the said method, if all mate with templates at different levels then the text character of corresponding ISN and the common expression of the font information candidate characters as the described symbol of waiting to become literate further comprised: if comprise the pointer that points to a plurality of variant templates in the font concordance list that finds, then described wait to become literate symbol with table in the common template at different levels of pointed after all the match is successful, as long as with the success of one of them variant template matches, the match is successful just to think described font.

With regard to the present invention on the other hand, the character recognition system that is proposed comprises template generating portion and character recognition part, and character recognition partly comprises recognition unit; The template generating portion comprises font storage unit, common template extraction unit, template output unit and dictionary.Wherein, the common template extraction unit comprises multistage common template extraction module and afterbody common template extraction module, and dictionary comprises font index table stores module and template memory module.

Multistage common template extraction module, be used for different ingredients, in the set different to be clustered that the font similar by counterpart formed, carry out cluster successively and extract the common template at different levels of this font at a font of all fonts in the font storage unit; Afterbody common template extraction module is used for extracting the afterbody common template of described font with reference to the multistage common template that extracts; The template output unit, be used for receiving the common template described at different levels that the common template extraction unit extracts and be saved in template memory module, and the pointer that will point to the common template at different levels of font described in the template memory module is saved in the concordance list of described font in the font index table stores module; Recognition unit comprises the common template matching module, is used for the symbol of waiting to become literate is mated step by step with the common template of a font concordance list pointed that finds.

Above-mentioned recognition unit also comprises the candidate characters output module, be used for described wait to become literate symbol in the common template matching module with the templates at different levels of described table pointed when all the match is successful, the text character that corresponding ISN is represented is exported as candidate characters.

In the said system, the template generating portion further comprises the font generation unit, be used for carrying out cluster and the character that gathers at being carried out the dot matrix stack, the font that is formed by stacking is saved in the font storage unit from the different monocase dot matrix of training character concentrated selection ISN same font.

In the said system, the template generating portion further comprises variant template extraction unit, is used to extract the difference part of each character that forms described font, as each variant template of described font.

In the said system, the template output unit also is used for receiving variant template that variant template extraction unit extracts and the template memory module that is saved in dictionary, and the pointer that will point to all variant templates of font in the template memory module is saved in the concordance list of described font in the index table stores module.

In the said system, recognition unit further comprises variant template matches module, after being used for waiting becoming literate symbol all the match is successful by the common template at different levels of common template matching unit and the font concordance list pointed that finds, mate with the variant template of pointed in the described table.

In the said system, recognition unit further comprises sorter, is used for searching correct classification according to the described resemblance of waiting to become literate symbol, so that template matches module and the font concordance list in the described classification of finding template pointed is mated step by step.

The present invention has following advantage: according to the public part between the feature extraction kinds of characters of character pattern, thereby reduced the memory space of dictionary; No matter the match is successful whether, and the point on the common template needn't repeated matching, thereby improved the recognition speed of character recognition system.Therefore, having optimized with the template point from these two angles is the pattern recognition system of feature base

Description of drawings

By reading the description of the embodiment of the invention being carried out below in conjunction with accompanying drawing, these and other advantages of the present invention will be more readily understood.

Fig. 1 is a recognition system The general frame according to an embodiment of the invention;

Fig. 2 is the process flow diagram according to the template generation phase of one embodiment of the invention;

Fig. 3 illustrates according to one embodiment of the invention and carries out the process flow diagram that the radical common template is extracted;

Fig. 4 illustrates according to one embodiment of the invention and is carrying out the process flow diagram that the class common template is extracted;

Fig. 5 is to be the font pointer design sketch of example explanation present embodiment with a concrete character;

Fig. 6 is the process flow diagram of by template matches character being discerned according to one embodiment of the invention;

The further depicted in greater detail of Fig. 7 the character recognition flow process of embodiment shown in Figure 6.

Embodiment

In statistical pattern recognition method, Hanzi features is with n-dimensional vector X=[x ₁, x ₂..., x _n] form represent, comprise the n dimensional feature vector of every class Chinese character in the dictionary.Discern certain Chinese character and use distance and/or similarity formula exactly, the n dimensional feature vector of differentiating which class standard Chinese character in the n dimensional feature vector of this Chinese character and the dictionary is the most approaching.Distance metric has Euclidean distance, city piece distance and mahalanobis distance etc. usually, and distance commonly used and similarity formula are respectively shown in formula (1) and (2):

D (X, G) = \sqrt{Σ_{i = 1}^{n} {[X (i) - G (i)]}^{2}} - - - (1)

R(X，G)＝(X，G)/(|X||G|) (2)

In the formula (1), D (X, G) Euclidean distance between vectorial X of expression and the G.In the formula (2), R (X, G) expression similarity, (X G) is the inner product of vectorial X and G, | X|, | G| represents the mould of vectorial X and G respectively.

Character identifying method of the present invention is a kind of statistical pattern recognition method, based on the recognition system of storing Hanzi features with point.In the existing Chinese character recognition system, be the sequence storage with the synthetic point of the feature point group of each Chinese character, the unique point of Chinese character is such as various particular points such as the general point that extracts from stroke and intersection point, flex points, and the sequence of each point is called a template.During identification, dot matrix to be identified and each template are mated.Because when extracting Chinese character template point, generally all Chinese character is normalized to 48 * 48 dot matrix, on a lot of similar fonts, will extract a large amount of identical or close template points.When system comprised a large amount of Chinese characters, a lot of points all were repeated storage, and during identification, therefore the coupling that these identical points can be repeated has again caused waste on speed and storage space.

Character identifying method of the present invention comprises the template generation phase and utilizes template to carry out the stage of character recognition.A kind of like this method superiority is mainly reflected in two aspects, the process that character pattern to be identified and template are mated step by step in the leaching process of templates at different levels and the cognitive phase in the template generation phase.The leaching process of templates at different levels comprises step: extract the radical common template, extract the class common template, extract the font common template, can further include the step of extracting the variant template.The stage of utilizing template to carry out character recognition generally includes three processes, the last handling process of the preprocessing process of character recognition, character recognition process and character recognition.

Font, number of words that the present invention supported are more, take this large character set directly to carry out cluster, calculate with very complexity and cluster result are unfavorable for using for extracting common template.Therefore in order to improve the facility that cluster efficient and cluster result use, the extraction of all common template among the following embodiment all is that character to be extracted is collected according to certain rule grouping, then the character in each group is carried out cluster.

Fig. 1 is a recognition system The general frame according to an embodiment of the invention, and dot-and-dash line is above to be character recognition part 11, and dot-and-dash line is following to be template generating portion 12.

Template generating portion 12 comprises image input block 101, it can be image-input devices such as scanner, facsimile recorder or digital camera, also comprises pretreatment unit 102, font generation unit 121, font storage unit 122, common template extraction unit 123, variant template extraction unit 124, template output unit 125 and dictionary 126.Wherein, common template extraction unit 123 comprises multistage common template extraction module 1231 afterbody common template extraction modules 1232, and dictionary 126 comprises font index table stores module 1261 and template memory module 1262.

Image input block 101 is used for the document printing or the hand-written document of input are converted to Digital Image Data.Pretreatment unit 102 is used for removing pre-service such as noise, binaryzation on original view data, and single character pattern is extracted in procession cutting then one by one, resulting monocase dot matrix is normalized into 48 * 48 dot matrix of standard.Font generation unit 121 is used for carrying out cluster and the character that gathers at being carried out the dot matrix stack from the different monocase dot matrix of pretreated training character concentrated selection ISN same font, and the font that is formed by stacking is saved in the font storage unit 122.Multistage common template extraction module 1231 in the common template extraction unit 123 is used for the different ingredients at fonts of all fonts in the font storage unit 122, carries out cluster successively and extract the common template at different levels of this font in the set different to be clustered that the font similar by counterpart formed.Afterbody common template extraction module 1232 is used for extracting the afterbody common template of this font with reference to the multistage common template that extracts.Variant template extraction unit 124 is used to extract the difference part of each character that forms this font, as each variant template of this font.Template output unit 125 is used to receive common template extraction unit 123 and variant template extraction unit 124 is common template at different levels and the variant template that a font extracts successively, and the template of receiving is saved in the template memory module 1262, and will point to the pointer of this font common template at different levels in the template memory module 1262 and point to wherein that the pointer of all variant templates of this font is saved in the concordance list of these fonts in the font index table stores module 1261.This shows that the template memory module 1262 in the dictionary 126 is to be used for preserving the common template at different levels that common template extraction unit 123 and variant template extraction unit 124 extract and the database of variant template.

If should be pointed out that the above-mentioned three grades of common template of certain font and itself and other fonts are distinguished, then font does not need variant template extraction unit 124 hereto.And when having several characters that differ greatly in the similar character group of the ISN same font that forms a font, the common template point that common template extraction unit 123 can extract from this character group is very few.Therefore, the combination of the above-mentioned three grades of common template that extract is not enough so that this font and other fonts are distinguished, at this moment more similar character can be put together, so that during the character that differs greatly is assigned to not on the same group, on the part of the difference outside the common template point, extract the common point of these characters that reconfigure, from common point, choose the variant template of some points, thereby increased the diacritical point between font as this font.

Preferably, the template generating portion comprises font generation unit 121.But, this part also can not comprise font generation unit 121, after the training character set process pre-service of image input block 101 inputs, directly sends to common template extraction unit 123 and handles.In this case, be that the character of the identical ISN of different fonts is treated as different fonts, with the elementary cell of single character as multistage common template extraction.

In above-mentioned font concordance list, the ISN of font is together with the font information and the pointer corresponding stored of pointing to this font common template at different levels of this font.The font information of font might adopt following three kinds of forms to be stored in the font concordance list.A kind of form, the font of font are a kind of representational fonts of selecting the different fonts under each character that forms this font, the corresponding a kind of font of font ISN; Another kind of form, the font of font are a kind of representational fonts of selecting the different fonts under each character that forms a variant template of this font, each the corresponding a kind of font in a plurality of variant template of font; The third form does not comprise in the template generating portion under the situation of font generation unit 121, and font is just to the font under should font.When comprising the font information of this font in the font concordance list, character recognition system can either convert the symbol of waiting to become literate to correct text character, can make the text character after the conversion have correct or approximate correct font again.Certainly, also can not comprise the font information of this font in the font concordance list, the unified font that adopts a kind of appointment of the text character after the conversion.As seen, such character recognition system can not be discerned the font of the symbol of waiting to become literate.

Character recognition part 11 equally also comprises image input block 101 and pretreatment unit 102, comprises recognition unit 111, post-processing unit 112 and text data output unit 113 in addition.Wherein, recognition unit 111 comprises sorter 1111, common template matching module 1112, variant template matches module 1113 and candidate characters output module 1114.

Image input block 101 is that with the difference of using process object is different in character recognition part 11 with pretreatment unit 102 in template generating portion 12.What they were handled in character recognition part 11 is data to be identified, and what handle in template generating portion 12 is the training character set.Sorter 1111 is used for searching correct classification according to the described resemblance that accords with of waiting to become literate, so that common template matching module 1112 and variant template matches module 1113 are mated the concordance list template pointed that comprises font in the described classification waiting to become literate symbol and find step by step.Candidate characters output module 1114 is used for describedly waiting to become literate symbol common template matching module 1112 and variant template matches module 1113 in and the templates at different levels of this font concordance list pointed when all the match is successful, the text character that corresponding ISN is represented is exported as candidate characters, simultaneously the numerical value of each candidate characters of expression of being generated of output and corresponding view data matching degree.Post-processing unit 112 is used for proofreading and correct by context relation the mistake recognition result of recognition unit 111 on the basis of the candidate characters that obtains.Text data output unit 113 is used to export the document that has been converted into correct text data.

Preferably, character recognition partly comprises sorter 1111, the purpose of classification is to select relatively very little candidate characters subclass of a number in a big character set fast, and it is big as far as possible to guarantee to comprise in this subclass under the character to be identified the probability of correct classification.But, the character recognition part also can not comprise sorter 1111, after the character set process to be identified pre-service of image input block 101 inputs, directly be transferred to the template matches module 1113 in the recognition unit 111, mate step by step according to the template in the dictionary 126 of random order and the sensing of each font concordance list.Obviously, matching speed will have the speed of template matches module of sorter much lower than front end in this case.

The match is successful with a font that includes only common template at different levels for symbol if certain waits to become literate, and needn't carry out the coupling of follow-up variant template so to this symbol of waiting to become literate, and treats hereto that promptly the character learning symbol does not need variant template matches module 1113.

When symbol mated with certain grade of common template of certain font if wait to become literate, this common template was mated when other fonts are mated in early stage, needn't carry out repeated matching to this common template so, as long as use the matching distance of exporting previously.

Below with reference to Fig. 2, the process that template generating portion 12 is operated according to one embodiment of the invention among Fig. 1 is described.

At first, import training sample set in step 201.The multi-font character sample of a large amount of actual acquisition gained is by image input block 101, and for example scanner, facsimile recorder or digital camera etc. convert view data to.

In step 202, the view data from image input block 101 is carried out pre-service.Utilize prior art that raw image data is removed necessary pre-service such as noise, binaryzation.Pretreated view data is carried out printed page analysis and monocase cutting.Then, resulting monocase dot matrix is normalized into 48 * 48 dot matrix of standard.

In step 203,, demarcate the ISN and the font of its corresponding text character to the monocase dot matrix after each normalization.Choose the different character pattern of ISN same font and form set to be clustered, each character pattern in each set to be clustered is carried out cluster, the character of getting together is carried out the dot matrix stack to form a font.Therefore, the affiliated various fonts of all characters that form a font are formed a font set, and the font in the font set all is very similar.A kind of representational font of selecting from font set is as the font of this font.

After the font cluster, be that elementary cell is extracted the radical common template respectively, extracts the class common template on step 205, extract the font common template on step 206 on step 204 with the font.

The font that has with above-mentioned three kinds of templates with not enough at this moment the common template to this font is distinguished inquiry learning on step 207 so that itself and other fonts are distinguished, thereby generate the variant template.

The extraction of radical common template:

It is after the font cluster that the radical common template is extracted, and is that elementary cell is operated with the font.Each font in the same set to be clustered is collected to similar font according to identical radical, carries out cluster by the similarity degree between the radical in set, and the public part to poly-font extraction radical in same class forms the radical common template.Fig. 3 illustrates according to one embodiment of the invention and carries out the process flow diagram that the radical common template is extracted.

In step 301, select radical to be extracted.Radical to be extracted will possess following three features: the first, and the radical stroke is less, and position shared in whole Chinese character does not surpass half; The second, the quantity of font that belongs to this radical is more; The 3rd, the position of radical in whole font will be chosen in the structure of the left and right sides by the left avertence, chooses radical in the up-down structure.

In step 302, be core with each radical to be extracted, the font that will belong to same radical is chosen the formation text, is called radical sequence text.

In step 303, radical is identical and all fonts that font is similar collect, and form a set to be clustered.Repeat this process, all be divided in the corresponding set to be clustered up to the similar font of all radical same font.

In step 304, for a set to be clustered, according to shape, position, the size of the radical to be extracted that one of them font comprised, go out a zone that comprises this radical fully at standard 48 * 48 dot matrix upper ledges, thereby obtain the mask zone of this set to be clustered.Repeat this process, up to all set to be clustered all by manual creation the mask zone.All such zones are that index stores is in the mask dictionary with font and radical ISN.

Because the similar font of radical same font is quite similar in radical parts, and the mask zone do not need strict accurately, make mask and get final product so do to represent with a font.Mask is stored in the mask dictionary, and visible mask dictionary is a kind of database that is used for storing mask.If the mask zone is not provided when dot matrix superposes, will enlarge the difference zone, influence the cluster effect.

In step 305,, carry out cluster by each font in the similarity degree pair set of radical for each set to be clustered.The font of getting together is carried out dot matrix stack, extract common point on stroke and the background in the mask zone according to dot matrix, as the radical common template of each font in this set to be clustered.

In step 306, the radical common template that extracts is kept in the included template memory module 1262 of dictionary 126, set up the pointer that points to corresponding radical common template in the template memory module 1262 for each font, and the pointer of the ISN of font and font and this font radical common template of sensing is mapped is saved in the included font index table stores module 1261 of dictionary 126, thereby set up the concordance list of this font.

At last, on step 307, cover the dot matrix of the radical parts of radical common template font, the extraction of remaining part and other font fellowship class common template with mask.

The extraction of class common template:

The class common template is extracted after the radical common template is extracted and is operated.Font set to be clustered is what to define by the rough sort of presorting in the recognition system.The rough sort here is any one method of presorting in the character recognition, for example the rough sort of calculating by pattern geometry.Nearly all in the prior art literal identification all will have this step of rough sort.Carry out saltus step because the identifying of present embodiment is unit with the rough sort, the font of so identical common template accumulates between the rough sort of same rough sort or vicinity, more can improve recognition speed effectively.

The cluster that the class common template is asked in the process is to carry out under the situation that has shielded the dot matrix radical, has therefore reduced the interference of radical.Collect each font in the same set to be clustered according to the similarity cluster of the dot matrix of shielding behind the radical.For poly-each good class, all character pattern lattice stacks generate the prospect background dot matrix respectively in the class, extract the common template point, as the class common template of each font in this class font.If common template point is very few, the big class that then will be polymerized as mentioned above is split as two groups, asks for template again.

The class common template is extracted and is extracted different with the radical common template.When carrying out the extraction of radical common template, packet mode is nonoverlapping, and promptly which set to be clustered is a font belong to and determine, so cluster only circulates in a set and carries out.And when carrying out the extraction of class common template, grouping is not absolute for the ownership boundary of font, a font may belong to the adjacent packets that some groupings also may belong to this grouping, therefore have to have redundant necessity, in the class common template leaching process in the set to be clustered the collection of font intersect with cluster process and carry out.

Fig. 4 illustrates according to one embodiment of the invention when carrying out the extraction of class common template, collects the interspersed whole process together of font to be clustered and cluster.

At first, initialization rough sort saltus step rule on step 401.During the class common template was extracted, it was to define by the rough sort of presorting in the recognition system that the grouping of font to be clustered is collected.Because the process of identification is that unit carries out saltus step with the rough sort, and rough sort is not one-dimensional coding usually, so need at first initialization classification saltus step rule.For each rough sort, other rough sorts according to the distance of this classification, based on certain one-dimensional sequence that is regularly arranged into, such as according to the ascending series arrangement of the distance of this rough sort.

On step 402, create empty polymerization chained list, and set an initial similarity threshold.In the present embodiment, similarity be meant font normalization dot matrix stack up after the quantity of identical point, rule of thumb initial similarity threshold is set at therefore that the normalization dot matrix always counts about 0.6.

On step 403, find first rough sort.

On step 404, be that font is collected at the center with the rough sort that is found.With this rough sort is the font to be clustered that saltus step selects to satisfy threshold value around the benchmark, if this font Already in certain cluster, is then skipped.

Detect on determination step 405 whether the font quantity of collecting reaches a predetermined value and whether the classification saltus step acquires a certain degree.If the font number of collecting is less than this numerical value and classification saltus step also not as far as to a certain degree, then on step 406 according to coding specification rule given in the step 1, jump to next rough sort.Then flow process turns back to step 404, is that font is collected at the center again with this rough sort.This flow process is in step 404-405 cocycle, participates in cluster or classification saltus step enough far the time up to abundant font (such as more than 100), and flow process just proceeds to step 407.

In step 407, in the abundant set to be clustered of this font number, carry out cluster according to the similarity between font and the corresponding rough sort, and the polymerization chained list is put in qualified classification.

Then, detecting on the determination step 408 whether all rough sorts were all collected under this threshold value.If not, then on step 209, find a rough sort of not collecting.Then flow process is got back to step 204, is that font is collected at the center again with this rough sort.If then flow process proceeds to step 410.

On determination step 410, detect similarity threshold and whether be reduced to a certain degree, for example the normalization dot matrix always count below 0.2.If drop to below the predetermined value, then flow process proceeds to step 412.Otherwise, on step 411, reduce threshold value, then repeating step 403-410.

Output polymerization chained list on step 412, and create the class common template.

At last, on step 413, the class common template of creating is kept in the template memory module 1262 of dictionary 126, be that each font sets up the pointer that points to respective class common template in template memory module 1262, and the pointer that will point to this font class common template is saved in the font index table stores module 1261 of dictionary 126 accordingly in the font concordance list.

For each poly-good class, with all character pattern lattices stacks in the class, by previous description as can be known at this moment dot matrix be the dot matrix that has shielded radical, thereby generate the prospect background dot matrix, extract the class common template of common template point as each font in such word.If common template point is very few, poly-good big class is split as two groups, ask for template again.

The extraction that should be noted in the discussion above that the extraction of radical common template in statistical model identification and class common template is as two processes, and their precedence is not changeless.In the real process that generates common template, can determine to carry out earlier which process as the case may be.

The extraction of font common template:

The extraction of font common template is operated after the class common template is extracted.Because in advance that the ISN same font is different monocase dot matrix is clustered into font,, do not need cluster once more so font exists when extracting the font common template.The extraction of this one-level template is the point with reference to radical and class common template, extracts stroke point and background dot respectively before font, on the background specimen page.That is to say, after font is extracted through radical and class common template, extracted in this font on the position of radical and class common template point and do not needed to extract again font template point.

If distinguishing of the above-mentioned three kinds of templates of certain font with not enough so that itself and other fonts, then template is learnt, by extracting the difference part of each character that forms this font with reference to the common template at different levels of this font, thereby generate the variant template, to increase the diacritical point between font.

After the establishment of common template at different levels is finished and is kept in the template memory module 1262, each font comprises the pointer that points to corresponding radical, class and font common template, also may further include the pointer that points to the variant template, with the ISN of this font and each pointer corresponding stored of font and its templates at different levels of sensing, thereby generate the concordance list of this font.Fig. 5 is to be that example explanation is according to the resulting font concordance list of present embodiment with a concrete character, font shown in the figure is that upright Yao's body " pulls " word ISN and the pointer corresponding stored of pointing to its common template at different levels in the concordance list of this font, points to the pointer of each grade of " pulling " word common template and has all indicated the memory address of this grade common template in template memory module 1262.By the radical common template pointer of this font, can find the radical common template (using ● expression) of " pulling " word that A03 place in address preserves in the template memory module 1262.Similarly, by class common template pointer, can find the class common template (representing) of " pulling " word that B10 place in address preserves in the template memory module 1262 with ■.Still similar, by font common template pointer, can find the font common template (representing) of " pulling " word that C07 place in address preserves in the template memory module 1262 with.It can also be seen that the public radical common template of a plurality of fonts such as " pulling ", " playing the part of ", " mixing ", the public class common template of a plurality of fonts such as " pulling ", " Du ", " location " from Fig. 5.Font shown in Figure 5 only comprises a kind of font, otherwise similar a plurality of " pulling " word of font is with a public font common template.If differ greatly between these different fonts, also will generate the variant template of " pulling " word.

Process is identical with traditional character identifying method early stage when treating character learning symbol and discern according to present embodiment, be according to the resemblance of waiting to know dot matrix find the symbol place of waiting to become literate slightly, disaggregated classification.Different is, classic method is that the template of each font in wait to become literate symbol and these subclassifications is mated, and the present invention is mated step by step with the common template of each font.The matching distance sum that only remains to be known dot matrix and these all templates at different levels of font hour thinks that just the match is successful.

Fig. 6 is achieved in accordance with the invention by the process flow diagram that template matches is discerned character.Wait to become literate symbol when mating, need record matching result and execution operation as shown in Figure 6 with certain one-level common template of a font.At first in step 601, check that whether this one-level common template was mated in formerly the font matching process.If do not mate, then in step 602, match, and the record matching distance.If mated, then in step 603 according to matched record directly the add up matching distance of previous output, no longer repeated matching.Then, in step 604, check the matching distance after adding up, and handle in two kinds of situation.If matching distance is less, then in step 605, continue other templates of this font of coupling on apart from the basis at this.If distance is enough big, think that then this font and dot matrix to be known do not match, no longer mate other templates of this font, finish whole flow process.

This shows that the described method of present embodiment is for the improve of recognition speed, be mainly reflected in the matching process of the templates at different levels of waiting to know dot matrix and font in the identification process.On the one hand, symbol mated once with the one-level common template of a font as long as certain waits to become literate, and later several are waited to become literate and can be used the record of coupling for the first time when symbol mates with this common template, and needn't repeated matching.On the other hand, wait to become literate symbol when mating step by step,, just no longer mate for one with other templates of this font as long as wherein the one-level template matches is unsuccessful with the templates at different levels of a font.Therefore, from this two aspect, all saved the process of repeated coupling in large quantities.

The further depicted in greater detail of Fig. 7 the character recognition flow process of embodiment shown in Figure 6.The process in early stage of character recognition adopts with template and generates the same processing of process in early stage, at first convert paper spare document to view data, carry out necessary pre-service such as denoising, binaryzation then, thereby pretreated data are made printed page analysis and monocase cutting acquisition monocase dot matrix, resulting monocase dot matrix is normalized into 48 * 48 dot matrix of standard.

With reference to thickness Classification and Identification technology in the prior art, the thickness sorting code number of the character pattern after the calculating normalization, find a subclassification according to thickness classification saltus step rule then, carry out template matches based on the concordance list of each font that this subclassification comprised, concrete steps are as follows:

In step 701,, the radical common template of pointed in wait to become literate symbol dot matrix and this concordance list is mated, and write down matching distance for first font concordance list in this classification.Whether detect the radical common template on determination step 702 refuses to know.If refuse to know, then stop with the coupling of this radical common template and return step 701, search next font concordance list to carry out template matches.Otherwise flow process proceeds to step 703.

On step 703, will treat that character learning symbol dot matrix and class common template mate, write down matching distance also with this matching distance with the addition of radical common template matching distance.Whether the detection type common template refuses to know on determination step 704, and whether class common template matching distance and radical common template matching distance sum be greater than predetermined threshold.If refuse to know or greater than predetermined threshold, then stop with the coupling of such common template and return step 701, search next font concordance list to carry out template matches.Otherwise flow process proceeds to step 705.

In step 705, will treat that character learning symbol dot matrix and font common template mate, write down matching distance also with this matching distance with radical and the addition of class common template matching distance.On determination step 706, detect the font common template and whether refuse to know, and font common template distance and radical and class common template apart from sum whether greater than predetermined threshold.If refuse to know or, then stop mating and returning step 701, search next font concordance list to carry out template matches greater than predetermined threshold.Otherwise flow process proceeds to step 707.

In step 707, will treat that character learning accords with dot matrix and the variant template is mated, the record matching distance is also with the same radical of this matching distance, class and the addition of font common template matching distance.On determination step 708, detect the variant template and whether refuse to know, and variant template distance with radical, class and font common template apart from sum whether greater than predetermined threshold.If refuse to know or, then stop mating and returning step 701, carry out template matches with next font greater than predetermined threshold.Otherwise, pointing to the pairing font ISN of pointer and the font of above-mentioned templates at different levels, the text character that combines expression is added on step 709 in the candidate character lists.

Whether on determination step 710, detect the matching distance of candidate characters less than a predetermined threshold.If less than this threshold value, think on determination step 711 that then this candidate characters is to wait to become literate the correct character of symbol.In this step, stop coupling and return step 701, carry out template matches with next font.

Otherwise, on step 712 according to the standards such as distance, font, confidence level of classification to the candidate characters that the adds coupling of stretching.Flexible coupling is exactly to adjust the position of template point according to certain rule, reduces the influence that brings owing to distortion such as word lattice is squeezed, expands, displacements.Deviation occurs for fear of the calculating of matching distance and cause mistake identification, stretch when mating, variant template and class common template, radical common template, font common template will be combined calculating.That is to say, these four templates are combined into a template, by the flexible coupling of same rule.

Repeating step 701 to 710 was till all fonts were all carried out template matches up to this character to be identified and in such.

If comprise the pointer that points to a plurality of variant templates in the font concordance list that finds, then described wait to become literate symbol with table in the common template at different levels of pointed after all the match is successful, as long as with the success of one of them variant template matches, just think described font the match is successful and with the text character of corresponding ISN and the common expression of font information as candidate characters.

The description of front points out that the extraction order of common template of the present invention can change, and extracts the radical common template again such as extracting the class common template earlier.Should also be noted that the sequence independence that order that character to be identified and common template at different levels mate and these common template are extracted out in addition, and the order of template matches can change, mate the radical common template again such as coupling class common template earlier.With regard to matching order, preferably earlier with comprise font a fairly large number of common template mate.

Experimental result shows, by adopting the present invention, the printing Chinese character recognition system that can discern the whole Chinese characters of GB GBK2312,56 kinds of fonts is under the situation that does not reduce discrimination, and the template stores amount in the dictionary 126 has only original 3/5.The reason that document size falls sharply is, needn't all store a corresponding templates for each font, only needs the shared template of a plurality of fonts of storage.Like this, a radical common template, a class common template and a font common template of being stored in just can the template memory module 1262 with dictionary 126 combine as the template of corresponding font, also will further make up a variant template sometimes.By adopting the present invention, the template matches time only accounts for about 1/2 of former match time in the character recognition process.

In view of this, the present invention has greatly reduced the amount of redundancy of printed Chinese characters recognition system information, has played significant optimization function to improving recognition speed and reducing the dictionary memory space.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. template optimized character identifying method, it is characterized in that: the method comprising the steps of:

Be concentrated to a few training character by the training character and constitute a font, to carry out cluster in the set different to be clustered that described font is put into one by one with its font similar on different ingredients is formed and extract common template, the multistage common template of the described font that extracts is preserved;

With reference to the multistage common template that extracts, extract the afterbody common template and the preservation of described font;

For each font, with its ISN and the pointer corresponding stored of pointing to its common template at different levels, to generate the concordance list of described font;

Treat when knowing character set and discerning, wait to become literate symbol and a font concordance list that finds common template at different levels pointed are mated step by step and write down matching result, to obtain candidate characters.

2. the method for claim 1, it is characterized in that: before mating with the font concordance list that finds one-level common template pointed, if matched record has been arranged, then directly use the matching result that is write down and determine whether mate with described common template.

3. method as claimed in claim 1 or 2 is characterized in that: the described step of obtaining candidate characters further comprises:

If all mate with common template at different levels, then that corresponding ISN is represented text character is as the candidate characters of the described symbol of waiting to become literate;

If do not match with one-level common template wherein, then stop the coupling with this font, mate step by step with the common template that the next font concordance list that finds is pointed.

4. the method for claim 1 is characterized in that: extract the multistage common template of described font and store further and comprise:

5. method as claimed in claim 4 is characterized in that: first order common template is the radical common template, and second level common template is the class common template.

6. method as claimed in claim 4 is characterized in that: first order common template is the class common template, and second level common template is the radical common template.

7. method as claimed in claim 5, it is characterized in that: steps A further comprises: select radical to be extracted, search and the identical font of described font radical, the font that finds and described font be placed in the same set to be clustered carry out cluster, to all character pattern lattices stacks in the class to generate the radical common template of each font in the set.

8. method as claimed in claim 7, it is characterized in that: steps A further comprises: go out a zone that comprises radical to be extracted fully at the similar character pattern lattice upper ledge of font, as the mask zone of dot matrix stack.

9. method as claimed in claim 8, it is characterized in that: step B further comprises: cover the part that is extracted the radical common template in the described font with mask, the extraction of remaining part and other font fellowship class common template.

10. method as claimed in claim 9, it is characterized in that: step B further comprises: search and the similar font of described font counterpart based on initialization rough sort saltus step rule.

11. method as claimed in claim 10, it is characterized in that: among the described step B, treating process that process that cluster set collects and the interior font of pair set carry out cluster intersects and carries out, under a predetermined threshold, collect in the set to be clustered but the font of cluster success not, participate in the collection of set to be clustered next time.

12. the method for claim 1 is characterized in that: in one of identification waits to become literate the process of symbol, find a font concordance list, search correct classification according to the resemblance of the described symbol of waiting to become literate and find.

13. the method for claim 1 is characterized in that: each training character concentrated in the training character all is a font.

14. method as claimed in claim 13 is characterized in that: the font information of each font is kept in the font concordance list together with its ISN.

15. the method for claim 1 is characterized in that: described font is by concentrate to select the different monocase dot matrix of ISN same font to carry out cluster from the training character and carry out the dot matrix stack at one character and form gathering.

16. method as claimed in claim 15 is characterized in that: one of font information of each character of a font of formation is kept in the concordance list of described font together with the ISN of described font.

17. method as claimed in claim 16, it is characterized in that: the difference part of further extracting each character that forms described font, as the variant template of described font and preserve, the pointer that points to each variant template of described font is mapped with one of font information of each character that forms described variant template and is kept in the concordance list of described font.

18. as claim 15 or 17 described methods, it is characterized in that: the described step of obtaining candidate characters further comprises:

If all mate, then with the text character of the common expression of corresponding ISN and font information candidate characters as the described symbol of waiting to become literate with templates at different levels;

If do not match with one-level template wherein, then stop the coupling with described font, mate step by step with the common template that the next font concordance list that finds is pointed.

19. method as claimed in claim 18, it is characterized in that: if all mate with templates at different levels then the text character of corresponding ISN and the common expression of the font information candidate characters as the described symbol of waiting to become literate is further comprised: if comprise the pointer that points to a plurality of variant templates in the font concordance list that finds, then described wait to become literate symbol with table in the common template at different levels of pointed after all the match is successful, as long as with the success of one of them variant template matches, the match is successful just to think described font.

20. a template optimized character recognition system comprises template generating portion and character recognition part, character recognition partly comprises recognition unit; It is characterized in that: the template generating portion comprises font storage unit, common template extraction unit, template output unit and dictionary, wherein, the common template extraction unit comprises multistage common template extraction module and afterbody common template extraction module, and dictionary comprises font index table stores module and template memory module;

Multistage common template extraction module, be used for different ingredients, in the set different to be clustered that the font similar by counterpart formed, carry out cluster successively and extract the common template at different levels of this font at a font of all fonts in the font storage unit;

Afterbody common template extraction module is used for extracting the afterbody common template of described font with reference to the multistage common template that extracts;

The template output unit, be used for receiving the common template described at different levels that the common template extraction unit extracts and be saved in template memory module, and the pointer that will point to the common template at different levels of font described in the template memory module is saved in the concordance list of described font in the font index table stores module;

Recognition unit comprises the common template matching module, is used for the symbol of waiting to become literate is mated step by step with the common template of a font concordance list pointed that finds.

21. system as claimed in claim 20, it is characterized in that: described recognition unit also comprises the candidate characters output module, be used for described wait to become literate symbol in the common template matching module with the templates at different levels of described table pointed when all the match is successful, the text character that corresponding ISN is represented is exported as candidate characters.

22. system as claimed in claim 20, it is characterized in that: the template generating portion further comprises the font generation unit, be used for carrying out cluster and the character that gathers at being carried out the dot matrix stack, the font that is formed by stacking is saved in the font storage unit from the different monocase dot matrix of training character concentrated selection ISN same font.

23. the system as claimed in claim 22 is characterized in that: the template generating portion further comprises variant template extraction unit, is used to extract the difference part of each character that forms described font, as each variant template of described font.

24. system as claimed in claim 23, it is characterized in that: the template output unit also is used for receiving variant template that variant template extraction unit extracts and the template memory module that is saved in dictionary, and the pointer that will point to all variant templates of font in the template memory module is saved in the concordance list of described font in the index table stores module.

25. system as claimed in claim 24, it is characterized in that: recognition unit further comprises variant template matches module, after being used for waiting becoming literate symbol all the match is successful by the common template at different levels of common template matching unit and the font concordance list pointed that finds, mate with the variant template of pointed in the described table.

26. as claim 20 or 25 described systems, it is characterized in that: recognition unit further comprises sorter, be used for searching correct classification, so that template matches module and the font concordance list in the described classification of finding template pointed is mated step by step according to the described resemblance of waiting to become literate symbol.