CN101286202A - Multi-font multi- letter size print form charater recognition method based on 'Yi' character set - Google Patents

Multi-font multi- letter size print form charater recognition method based on 'Yi' character set Download PDF

Info

Publication number
CN101286202A
CN101286202A CNA2008100478130A CN200810047813A CN101286202A CN 101286202 A CN101286202 A CN 101286202A CN A2008100478130 A CNA2008100478130 A CN A2008100478130A CN 200810047813 A CN200810047813 A CN 200810047813A CN 101286202 A CN101286202 A CN 101286202A
Authority
CN
China
Prior art keywords
character
word
statistics
file
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008100478130A
Other languages
Chinese (zh)
Other versions
CN100589119C (en
Inventor
朱宗晓
吴显礼
刘赛
田微
程立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South Central Minzu University
Original Assignee
South Central University for Nationalities
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South Central University for Nationalities filed Critical South Central University for Nationalities
Priority to CN200810047813A priority Critical patent/CN100589119C/en
Publication of CN101286202A publication Critical patent/CN101286202A/en
Application granted granted Critical
Publication of CN100589119C publication Critical patent/CN100589119C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Character Discrimination (AREA)
  • Character Input (AREA)

Abstract

The invention relates to a multi-font multi-size printed character recognition method based on a Yi character set, which takes a single-page file of a Yi printed book or a magazine as a processing object, the page of the file is firstly scanned to be a picture file in a computer, after the Yi characters, and commonly used characters such as punctuation characters, English letters, numbers in the picture file are carried out foremost segmentation, mergence and re-segmentation, the high-dimensional characteristics based on the contribution degree on the peripheral direction of each single character obtained by segmentation are extracted and then compressed into low-dimensional characteristics by using a characteristic compression conversion matrix, the classification determination of the characters are completed by the dictionary three-level characteristic matching based on the low-dimensional characteristics, the characters are carried out the post-treatment by a file, combined and reduced to a computer text, and the characters which may be identified wrongly in the text are prompted.

Description

The printed character recognition methods based on Yi character symbol collection of the many font sizes of multi-font
Technical field
The invention belongs to the character recognition field, be specifically related to the printed character recognition methods based on Yi character symbol collection of the many font sizes of multi-font.
Background technology
An important channel rescuing our spoken and written languages of nation is exactly the Computer Automatic Recognition typing that realizes national document document, the various document documents of each ethnic group's culture marrow of record is handled passing on by informationization.Since carried out the Liangshanyizu compact community, standard Yi nationality literary composition was being used widely based on the Yi nationality, distributed over Yunnan, Sichuan and Guizhou area of Xiaoliang Shans such as northern dialect of the Yi in Liangshan Mountain Yi district and the Ning Lang of Yunnan Province and is being promoted from Sichuan standard Yi nationality literary composition in 1980.Comprise that in school eduaction primary school, junior middle school, senior middle school and big-and-middle education and instruction, literary and artistic creation, film and television translation and broadcasting, government's style of writing and various fields such as policies and regulations, musical composition special and colleges and universities are well used, Yi nationality, distributed over Yunnan, Sichuan and Guizhou's economy and cultural construction have been produced positive impetus.At present, the research of most of Yi nationality literary composition input method concentrates on the keyboard coding input mode, and the research of importing about the literary composition identification of block letter Yi nationality still is a blank spot, and this has seriously restricted infotech popularizing and using in the minority area.At this present situation, we have invented the printed character recognition methods based on many font sizes of multi-font Yi character symbol collection, this method at first is applied to block letter Yi character symbol, but be not limited to Yi character symbol, can be generalized to very soon in the document recognition work of other national writings such as Chinese character, Japanese, Korea's literary composition and go.For the inputs of Yi nationality literary composition and other similar national writings provide a kind of fast input mode easily, this is to inheriting and the social progress of development of national culture, promotion ethnic mimority area has great importance.
The identification of all literal is based on all that the character feature of this literal carries out, and feature can be divided into architectural feature and statistical nature two classes by and large.Basic thought based on the architectural feature recognition methods is character picture to be cut apart be reduced to some primitives, as stroke, topological point, structural mutation point etc., compare with template, check whether necessary primitive exists, whether the primitive that can not have occurs, thus the classification under judging.Mainly contain the method that obtains structural motif at skeleton, profile, stroke at present.Structural approach is subject to various noise effects, and robustness is not strong, but distinguishes similar character easily, and is good to variant font body deformability adaptability.And statistical nature is to extract from raw data and the maximally related information of classification, makes gap minimization in the class, gap maximization between class.The deformation that feature is tackled same class character type remains unchanged as far as possible.The statistical method antinoise is good, but generalization is strong.If but the not science of choosing of statistical nature will be difficult to distinguish similar character.Yi nationality's literary composition has 1165 characters, and it comprises 819 standard Yi nationality words, and 345 times word and 1 are for the sound word to a high-profile, and the standard Yi nationality word that inferior lofty tone word is corresponding with it has only small differences on form, also promptly have a large amount of similar characters in Yi nationality's literary composition.This brings challenges to the character recognition of using statistical method.
Summary of the invention
The objective of the invention is to realize the method based on the printed character identification of Yi character symbol collection of the many font sizes of multi-font.With single page documents such as Yi nationality's literary composition block letter books, magazines is process object, it at first is the picture file in the computing machine with this page file scanning, Yi character symbol in the picture and punctuation mark, English alphabet, digital conventional characters are carried out handling based on the Character segmentation of just cutting apart, merge, cutting apart, extract its high dimensional feature to cutting apart each the single character that obtains again based on the peripheral direction contribution degree, the feature compression transition matrix is compressed into low dimensional feature with high dimensional feature then, by finish the character classification judgement based on three grades of characteristic matching of dictionary of low dimensional feature.Thus, can obtain high individual character recognition correct rate.Be reduced into computer version through the combination of document aftertreatment link again, and may discern the character of makeing mistakes in the prompting text.According to this method, realized a printed character recognition system based on multi-font, multiword symbol Yi character symbol collection.
Also comprise the generation and the checking of font collection, characteristics dictionary as the printed character recognition system based on Yi character symbol collection of a multi-font, many font sizes.It is the sample that system scans block letter Yi character symbol at first in a large number, adopt the mode of batch processing to carry out the character cutting, utilize the training sample character repertoire of gathering foundation to carry out extracting based on the high dimensional feature of peripheral direction contribution degree, high dimensional feature conversion by all training sample characters goes out the feature compression transition matrix, with the high dimensional feature boil down to low dimensional feature of this matrix with all training sample characters.Low-dimensional feature set generating feature dictionary by all training sample characters.Again the low dimensional feature of all training sample characters is carried out three grades of coupling identifications with characteristics dictionary respectively and finish the character classification judgement, verified the quality of this characteristics dictionary by file identification statistical report and character recognition statistical report, and the foundation of dictionary correction is provided.
The present invention comprises following content:
Sample preparation, Character segmentation, feature extraction, feature compression, characteristics dictionary generate, three grades apart from matching algorithm, recognition result statistics, the identification early warning that makes mistakes.
1, sample is prepared
Participation used in the present invention generates dictionary Yi character sample body and comprises 12 kinds of fonts, be respectively: black matrix, the Song typeface, Microsoft's lean type, SIL body, upright Yi nationality voiceover body, the thin black matrix of upright Yi nationality literary composition, the upright Yi nationality literary composition Song typeface, upright Yi nationality literary composition imitation Song-Dynasty-Style typeface, upright Yi nationality literary composition black matrix, upright Yi nationality literary composition handwritten form, upright Yi nationality literary composition round end body, upright Yi nationality literary composition variety body.
These 12 kinds of fonts are all adopted two kinds of font sizes No. three, No. five, rule, two kinds of forms of overstriking, original paper, once copy, three kinds of tupes of secondary copy, two kinds of scan patterns of 300DPI, 400DPI, obtain 288 cover sample scintigram sheets, prepare in the sample preprocessing process, to adopt 130,150 then, 170 3 kind of concentration carries out the binaryzation sampling, obtain 864 cover sizes, forms, thickness, Yi character allusion quotation sample deep or light, that fog-level is different altogether and participate in generating the Yi character allusion quotation, Yi character symbol quantity is that 864*1165=1006560 is individual altogether.
2, Character segmentation
The principle of Character segmentation as shown in Figure 3, the Wen Wenben of papery Yi nationality becomes gray scale picture file in the computing machine through scanning, the gray scale picture is become has only 0 and the two-value picture file of 1 two kind of element by setting suitable binaryzation threshold.This document is carried out Flame Image Process, isolate literal piece wherein.At first to the literal piece carry out based on the row projection the literal line cutting, promptly determine the initial longitudinal axis position of each row and finish longitudinal axis position in the projection distribution situation of y direction according to picture file, obtain each the row carve information and note, enter word again and cut apart.
Word is cut apart as shown in Figure 4, cutting apart good literal line at first carries out just dividing based on the word of row projection, promptly determine the initial transverse axis position of each character and finish the transverse axis position in the projection distribution situation of X direction according to each row, thereby obtain just dividing word, because the projection of character and the structure of Yi nationality's text body have much relations, just divide the situation that can occur in a large number the building block of single character being divided into character in the word, this moment will be according to the characteristics of Yi character symbol, sum up a series of merging rule, obtain merging word among Fig. 4 thereby become the parts of a plurality of characters to be merged into a character erroneous segmentation.If there is English alphabet in actual applications, universal character such as punctuation mark and numeral, to prevent that also existing merging rule from combining universal character, this moment will be according to the characteristics of Yi character symbol and universal character, sum up a series of wrong anti-mergings rules that merge that prevent, the character of mistake merging is decomposed again come.Merge main three characteristic parameters that utilize based on first branch word of rule, the statistics word that is respectively in the delegation is wide, represents with tempflag; Statistics word height in the delegation is represented with templineheight; The ordinate of delegation's center line is represented with rowbaseliney.
Two Yi nationality's literary compositions just divide the merging rule condition of word to be: 1) the wide wide tempflag of this row statistics word that is not more than of the word that divides word to merge at the beginning of two; 2) divide the spacing of word space at the beginning of two less than them and adjacent character; 3) divide at the beginning of two the high maximal value of word word greater than the high templineheight of 60% statistics word;
The merging rule of double quotation marks merged in two commas: 1) the wide wide tempflag of this row statistics word that is not more than of the word that divides word to merge at the beginning of two; 2) divide the spacing of word space at the beginning of two less than them and adjacent character; 3) ordinate that divides word at the beginning of two is all less than the ordinate rowbaseliney of center line; 4) divide word height half at the beginning of two less than the high templineheight of statistics word;
Three Yi nationality's literary compositions just divide the merging rule of word to be: 1) the wide wide tempflag of this row statistics word that is not more than of the word that divides word to merge at the beginning of three; 2) divide at the beginning of three word space less than the first and the 3rd spacing of just dividing word and adjacent character; 3) divide at the beginning of three the high maximal value of word word greater than the high templineheight of 60% statistics word;
Suspension points ... merge rule: 1) the wide wide tempflag of this row statistics word that is not more than of the word that at the beginning of three, divides word to merge; 2) divide at the beginning of three word word Gao Jun less than 30% statistics word height; 3) divide at the beginning of three the word Diff N less than 10% statistics word height.
Anti-what merge that rule sums up is the situation that can not merge, and the anti-merging rule that can not merge as two numerals is: 1) minimum value of dividing word at the beginning of two is greater than 70% statistics word height, and maximal value is less than 90% statistics word height; 2) divide at the beginning of two the minimum value of word width wide less than 40% statistics word; 3) divide at the beginning of two the difference in height of word less than 1;
The character of condition may be a numeral above satisfying, and can not merge even then satisfy the merging rule, so be referred to as the anti-rule that merges.
Anti-merging rule as right parenthesis and back punctuation mark is: 1) divide at the beginning of the 1st word height greater than 70% statistics height; 2) divide at the beginning of the 1st word width less than 40% statistics width; 3) divide at the beginning of the 2nd word height less than 40% statistics height; 4) divide at the beginning of the 2nd word width less than 40% statistics width; 5) divide the ordinate rowbaseliney of the ordinate of word at the beginning of the 2nd greater than center line.
Merging rule and the anti-rule that merges, is to cut apart the expert knowledge system that practice summary comes out by a large amount of Yi character symbols and universal character, and text segmentation civilian to Yi nationality and universal character has extremely strong specific aim, is important component part of the present invention.
3, feature extraction
Extracting the individual character recognition feature is not only to have general character but also have one of committed step of individual character among the present invention.Saying that it has general character, is because all literal identifications all need to extract the recognition feature of this literal individual character; Saying that it has individual character, is because different literal form, composing, number of characters has very big difference, need adopt special method to be handled at the characteristics of this literal.In Yi nationality's literary composition identification, having a large amount of similar characters is one of its characteristics.Through comparing the final feature extraction algorithm that is based on the peripheral direction contribution degree that uses repeatedly.
3.1 peripheral direction contribution degree principle
3.1.1 direction contribution degree definition
Be the center to be studied a P as shown above, the relativeness of continuously black number of pixels on 8 directions, (i=1,2 ..., 8) wherein, i represents direction, 1i represents the black continuously pixel count of i direction.So obtain 8 dimensional feature vectors
d i = l i Σ i = 1 8 l i 2 - - - ( 1 )
[d1, d2 ..., d8] and be the direction contribution degree proper vector that P is ordered.Just for P order 8 the dimension PDC.Wherein the prescribed manner of direction 1-8 is for convenience of the processing in the program.1,2,3,4 four directions can be unified to handle, and 5,6,7,8 four directions can be unified to handle.
If incite somebody to action reciprocal 1-2 each other, 3-4,5-8,6-7 merge into a feature mutually, then
d 1 = l 1 + l 2 ( l 1 + l 2 ) 2 + ( l 3 + l 4 ) 2 + ( l 6 + l 7 ) 2 + ( l 5 + l 8 ) 2 - - - ( 2 - 1 )
d 2 = l 3 + l 4 ( l 1 + l 2 ) 2 + ( l 3 + l 4 ) 2 + ( l 6 + l 7 ) 2 + ( l 5 + l 8 ) 2 - - - ( 2 - 2 )
d 3 = l 6 + l 7 ( l 1 + l 2 ) 2 + ( l 3 + l 4 ) 2 ( l 6 + l 7 ) 2 + ( l 5 + l 8 ) 2 - - - ( 2 - 3 )
d 4 = l 5 + l 8 ( l 1 + l 2 ) 2 + ( l 3 + l 4 ) 2 + ( l 6 + l 7 ) 2 + ( l 5 + l 8 ) 2 - - - ( 2 - 4 )
4 dimension PDC for this point.
3.1.2 peripheral direction contribution degree
The peripheral direction contribution degree is for the direction contribution degree of a character periphery, periphery is meant the border of character, and we investigate is eight directions of branch, the border of the multilayer degree of depth, with level 1 direction is example, and figure below is the 0xa004 Yi character symbol through scanning, cut apart, obtaining after the normalized. the implication of four layer depth as shown in the figure:
With 8 direction of search searching character frontier points, be the direction contribution degree that them are asked in the center then with the frontier point.Here in order to reduce the intrinsic dimensionality of character, along 8 directions of search are parallel character is divided into 8 zones, in each zone, ask for earlier 8 dimension PDC of each layer frontier point by the direction of search, the vector of averaging then as the PDC of this this direction of search of zone to measure feature.Then the total intrinsic dimensionality of character is:
8 directions of search * 8 cut zone * 4 layer depth * 8 dimension PDC feature=2048 dimensions (3)
If adopt 4 dimension PDC of each layer frontier point, then the total intrinsic dimensionality of character is:
8 directions of search * 8 cut zone * 4 layer depth * 4 dimension PDC feature=1024 dimensions (4)
In this article, the intrinsic dimensionality of each Yi character symbol is 1024 dimensions, promptly uses 4 dimension PDC as boundary characteristic.
3.2 the realization of direction contribution degree feature extraction algorithm
A. will check the 64*64 dot matrix of a word in the standard sample file to read in fundamental lattice chardim[64] [64].
B.64*64 dot matrix dim[64] [64] be divided into eight directions, reads in eight direction arrays by direction respectively---
WtoE[64] [64]---from west to east, 1 direction is from dim[0] [0] read], to dim[63] [63] end;
EtoW[64] [64]---from east to west, 2 directions are from dim[0] [63] read, to dim[63] [0] end;
NtoS[64] [64]---by north orientation south, 3 directions are from dim[0] [0] read, to dim[63] [63] end;
StoN[64] [64]---from south to north, 4 directions are from dim[63] [0] read, to dim[0] [63] end;
WNtoES[4096]---by the north-westward southeast, 5 directions are from dim[63] [0] read, to dim[0] [63] end;
ENtoWS[4196]---by North-East Bound southwest, 6 directions are from dim[0] [0] read, to dim[63] [63] end;
WStoEN[4196]---by southwest northeastward, 7 directions are from d im[0] [0] read, to dim[63] [63] end;
EStoWN[4196]---by the southeast northwestwards, 8 directions are from dim[63] [0] read, to dim[0] [63] end;
C. find out eight direction matrixes and fundamental lattice dim[64] [64] corresponding conversion relation.For the speed of accelerating to change, these transformational relations are precomputed be stored among the Feature Conversion file PdcTransArray.kl.During program run the data in this document are read in PdcTrans[4096] [16], adopt look-up table during conversion.
D. in the direction matrix, search for frontier point, the straight line of a direction of each search, preceding four direction array straight line is a fixed length, back four direction array straight line is the variation of each+1 or-1, on this straight line be at first continuously 1 the section first point be the ground floor frontier point, in have no progeny be continuously once more 1 the section first point be second layer frontier point, obtain the 3rd layer of frontier point successively, the 4th layer of frontier point, or to this straight line end.
E. every layer of frontier point coordinate conversion that each direction is found is to dim[i] [j], be transformed into eight direction array coordinates again, be 8 dimension PDC and 4 dimension PDC that 1 number promptly can be used to calculate this character continuously behind eight direction array coordinates.What calculate herein is 4 dimension PDC feature, totally 1024 dimensional features.Characteristic is recorded in PdcArray[8] in [8] [4] [4], be then written in the Pdc file, obtain 864 Pdc files among the present invention altogether.
4, feature compression
The effect of feature compression is to remove information redundant in the high dimensional feature vector, extracts the validity feature that distinguishing ability is arranged most.The present invention adopts the KL conversion, with 1024 dimension PDC feature compression to 128 dimensions.
For each character, as training sample set, each individual character standard character looks like to extract its 1024 dimension PDC feature x with the N after the normalization individual character standard character picture.Overall scatter matrix with this sample set serves as to produce matrix, i.e. ∑=E{ (x-μ) (x-μ) TOr Σ = 1 M Σ i = 0 M - 1 ( x i - μ ) ( x i - μ ) T - - - ( 5 )
In the formula, xi is the proper vector (1024 dimension) of i training sample, and μ is the averaged feature vector (mean value of total character 1165 * N character on 1024 dimensions also is 1024 dimensions) of training sample set, and M is the training sample sum.Obtain the oblique variance matrix of each character, obtain the mean value of oblique variance matrix.It should be noted that, the averaged feature vector of training sample set is the mean value of all sample character features, covariance matrix be all sample character features respectively with the covariance matrix of the averaged feature vector of training sample set, the mean value of covariance matrix also be all sample character features respectively with the mean value of the covariance matrix of the averaged feature vector of training sample set.
Ask for the eigenvalue i and the orthonormal proper vector ui of the mean value matrix ∑ of 1024 * 1024 dimension covariance matrixes below, the present invention uses with Hao Siheerde (Householder) conversion 1024 rank real symmetric matrix ∑s is turned to symmetrical tridiagonal matrix approximately, re-uses whole eigenvalue i and corresponding proper vector ui that distortion QR method is calculated real symmetrical tridiagonal matrix.
The eigenwert descending order is sorted: λ 0〉=λ 1〉=L 〉=λ M-1, its characteristic of correspondence vector is ui.Like this, 1024 dimensional features of each character can project to u0, u1 ..., in the subspace that uM-1 opens.Therefore each character is corresponding to a point in the subspace.Equally, any point in the subspace is also corresponding to a character.Such M dimension dimensionality reduction subspace has been arranged, and 1024 dimensional vectors of any one character can be done projection and obtain one group of coordinate coefficient to it, and this group coefficient shows the position of this character in the subspace, thereby can be used as the foundation of character recognition.
The proper vector set has been arranged, and any new character learning symbol sample f (1024 dimension) that treats can project in the M dimensional feature space (is base vector with the proper vector), is expressed as:
(y 0,y 1,L,y M-1)=(u 0,u 1,L u M-1) T(f-μ) (6)
We obtain M proper vector altogether, though M is little more a lot of than N (1024).But generally, M still can be too big.And in fact, according to the requirement of using, be not that all ui have very big reservation meaning.
Consider and use Karhunen-Loeve transformation, can choose preceding k maximum proper vector, make as compression means to character
Σ i = 0 k λ i Σ i = 0 M - 1 λ i ≥ α - - - ( 7 )
In following formula, can choose α=99%.The energy of this explanation sample set on preceding k axle accounts for more than 99% of whole energy.Exactly because like this, dimension has dropped to 128 dimensions.
In application process, at first generate 128 * 1024 dimension KL transition matrix (u 0, u 1, L u M-1) T, remove multiply by 1024 dimensional feature vectors that from the Pdc file, extract with this KL transition matrix, obtain 128 dimension compressive features, be then written in the Pkl file, obtain 864 Pkl files among the present invention altogether.The Pkl file reads in after the computer system with Pkl data matrix PklDa ta[CHARMAXNUM] [128] deposit.
5, the generation of characteristics dictionary
Characteristics dictionary comprises two files among the present invention, and one is dictionary file YIDICT.dic, and one is weighting coefficient file YIWEIGHT.dic.Dictionary file is the average mean value of 1165 character 128 dimensional features in 864 Pkl files, be that YIDICT.dic also is a Pkl file in essence, this file contains 1165 characters, and 128 dimensional feature vectors of each character are the mean value of this character 128 dimensional feature vectors in 864 samples.The weighting coefficient of each character also is 128 dimensions, is each dimensional feature vector variance inverse in 864 samples in each character 128 dimensional feature vector.Dictionary file reads in after the computer system with dictionary matrix D icArray[CHARMAXNUM] [128] deposit, the weighting coefficient file is used weighting coefficient matrix WeightArray[CHARMAXNUM after reading in computer system] [128] deposit.
6, three grades apart from matching algorithm
Sample identification at first will be calculated three grades of Weighted distances between each the character 128 dimension compressive features to be identified extracted and the standard character 128 dimension compressive features of extracting from dictionary file from Pkl sample characteristics compressed file, the calculating of each character Weighted distance to be identified is divided into three grades.The definition such as the table 1 of parameters at different levels:
Three grades in table 1 is apart from the matching algorithm parameter-definition
FNum=128 First order candidate characters number FLevel=24 First order characteristic number
SNum=24 Second level candidate characters number SLevel=48 Second level characteristic number
TNum=10 Third level candidate characters number TLevel=128 Third level characteristic number
The first order is calculated the Weighted distance between the preceding 24 dimension compressive features of all characters in the dictionary, win the confidence and adopt preceding 128 less characters of distance and enter second level calculating, preceding 48 distances of tieing up between the compressive features of 128 characters that the first order is obtained are calculated in the second level, win the confidence and adopt preceding 24 less characters of distance and enter third level calculating, the third level is calculated 128 distances of tieing up between the compressive features of 24 characters that the second level is obtained, win the confidence and adopt the candidate result collection that preceding 10 less characters of distance enter this character, it is unified the Weighted distance computing formula and is:
It unifies the Weighted distance computing formula:
d = Σ i = 1 Level WeightArray [ objUnicode ] [ i ] × ( PklData [ OCRNumber ] [ i ] - DicArray [ objUnicode ] [ i ] ) 2 Σ i = 1 Level WeightArray [ objUnicode ] [ i ] - - - ( 8 )
Wherein the Level record needs the dimension of computed range in prime, its value is 24,48 and 128, WeightArray[objUnicode] [i] write down the weighting coefficient of this dimensional feature, PklData[OCRNumber] compressive features of [i] record character to be identified, DicArray[objUnicode] compressive features of standard character in [i] record dictionary;
After each grade Weighted distance calculates, need to calculate its preceding less character of several Weighted distances and enter next stage calculating or enter the candidate result collection.This just need sort to Weighted distance, and what adopt among the present invention is improved quick sort.The thinking of quick sort is: at first choose a record as pivot, key word according to it is that benchmark is reset all the other records, all key words all are placed in after it than its big record, and all key words all are placed in before it than its little record, finish quicksort thus one time; Afterwards, respectively two subsequences that are divided into by an ordering are carried out the next round quicksort, be ranked from small to large according to key word up to all records.
In the present invention, owing to only need choose Num less character of distance of front candidate characters number according to each grade Weighted distance, the value of Num in three grades of distance couplings is respectively 128,24 and 10; Be FNum, SNum, therefore TNum can do following improvement to quick sort: when during greater than Num, only the record of the subsequence before the pivot being carried out the next round quicksort as the residing position number of the pivot of benchmark; When the position number of pivot during, " Num-1-pivot location sequence number write down " after record of the subsequence before the pivot and the pivot carried out the next round quicksort respectively smaller or equal to Num.The ranking results of Num minimum weight distance before finally obtaining.Figure below is the synoptic diagram of this improvement quick sort recursive call, wherein supposes the residing position number>=Num of pivot Q0 of first round ordering, and second takes turns the residing position number<Num of pivot Q1 of ordering.Then when the third round quicksort, can all sequence to big order according to little up to Num record carrying out quicksort respectively to two subsequences of Num before the Q1 He behind the Q1.Quick sort after the improvement as shown in Figure 8.
In the application of reality, because summary journal quantity much larger than Num, may just reach the state of the position number of Q1 less than Num through a lot of wheels.But in a large number need not the record that sort owing to all cast out at every turn, make it on the basis of quick sort average case time complexity 0 (nlog2n), improve a lot again.
7, recognition result statistics
When setting up characteristics dictionary, if can be to the characteristics dictionary of a certain algorithm foundation, carry out result's statistics at the identification of regulation sample, can be to the accuracy of this dictionary feature representative, and carry out quantitative test at the precision of specific font identification, for dictionary is from now on set up the improvement of algorithm and the correction of recognition result provides foundation.Among the present invention, because the test sample book that participates in generating the dictionary sample of dictionary and participating in the test dictionary is all for being begun by Unicode sign indicating number 0xA000,1165 Yi characters that finish to 0xA48c accord with, the Unicode coding of each character is known, if the Unicode that identifies coding is inconsistent with known coded, can thinks and make mistakes one.In the present invention, statistics be a discrimination and preceding ten discriminations, statistics can be divided into file identification statistics, character recognition statistics and character distribution distance statistics by objects of statistics.File identification statistics be meant statistics in the designated computer path in next or a plurality of sample files discrimination and preceding ten discriminations of each sample file and discern that whole sample file spends time, in order to identification order of accuarcy and the speed of reflection to file integral body; Character recognition statistics be meant each Yi character symbol total once identification in one or more sample files of statistics number of times and preceding ten identification number of times of makeing mistakes of makeing mistakes, in order to picking out the character that easy identification makes mistakes, be the make mistakes early warning and the postprocessing correction channeling direction and support on the data is provided of back; Character distribution distance statistics is meant that statistics participates in generating in each sample of dictionary the range distribution of standard character feature in each character feature and dictionary, be used for analyzing the font situation of change of multi-font, concentrated each character of many font sizes Yi character symbol, the character range distribution that font changes greatly is far away, the identification error probability is bigger, for the judgement of the early warning that makes mistakes provides foundation.
8, discern the early warning mechanism of makeing mistakes
Although 99.5%, always some character recognition makes mistakes to the accuracy of identification of sample.In application,, the efficient of check and correction will be improved if can remind the user that those characters may be made mistakes.Be each character setting among the present invention it confidence interval from.With it current decipherment distance is judged, may be made mistakes greater than thinking of this distance.Confidence interval is from relevant with two factors, and one is internal cause: for different characters, its confidence interval is from should being different.This need draw standard range distribution figure in itself and the dictionary to 864 samples of each character, according to its distribution situation set its distribution confidence interval from, at present in the height byte records of weighting dictionary YIWEIGHT.dic this distribution confidence interval from; Another is an external cause, promptly scans the fog-level of picture, and the present invention represents this fog-level with the mean distance that all identification characters first of this picture are selected; Separately as follows according to the make mistakes step of early warning of fog-level:
Step 1: first distance and the second distance of getting character late;
Step 2: relatively first distance and 80% mean distance, if, then enter step 5, otherwise enter step 3 less than 80% mean distance;
Step 3: relatively the second distance and first distance is poor, if greater than 10% mean distance, then enter step 5, otherwise enter step 4;
Step 4: to the warning that makes mistakes of this character.
Step 5: judge whether this picture is handled, and has been untreated then to enter step 1, handles then and withdraws from.
Description of drawings
Fig. 1 is the printed character recognition system process flow diagram of multi-font, multiword symbol Yi character symbol collection.
Fig. 2 is the generation and the checking synoptic diagram of font collection, characteristics dictionary.
Fig. 3 Character segmentation general diagram.
Fig. 4 monocase is cut apart block diagram.
Fig. 5 direction contribution degree synoptic diagram.
Fig. 6 periphery is to the contribution degree synoptic diagram.
Eight direction arrays of Fig. 7 synoptic diagram.
The synoptic diagram of the quick sort recursive call after Fig. 8 improves.
Embodiment
Embodiment
Figure A20081004781300161
: many font sizes of multi-font block letter Yi character symbol recognition system
Many font sizes of multi-font block letter Yi character symbol recognition system has the current Unicode sign indicating number of setting, GBK sign indicating number and Unicode sign indicating number change, select universal character, setting binaryzation threshold values, picture cutting in batches, a key to set up functions such as recognition dictionary, key checking recognition dictionary, the identification of batch sample, sample character distribution distance statistics mutually, can improve the work efficiency of foundation in the literal identification work, checking recognition dictionary greatly.Simultaneously, the method for setting up recognition dictionary that platform adopted is not limited to Yi character symbol, can be generalized to very soon in the identification work of other national writings such as Chinese character, Japanese, Korea's literary composition to go.This system is divided into two groups of font commonly used and non-common fonts with 12 font covers, and non-common font comprises that font changes upright Yi nationality literary composition variety body and upright Yi nationality literary composition handwritten form greatly, and all the other ten cover fonts belong to font commonly used.System is that 99.22, preceding ten total discriminations are 99.95% to once total discrimination of font commonly used.Once total discrimination to non-common font is that 99.71%, preceding ten total discriminations are 99.98%.
Discrimination of 12 kinds of fonts of table 2 and preceding ten discrimination statistical forms
Figure A20081004781300171
Embodiment
Figure A20081004781300172
: multi-font block letter (Yi nationality, the Chinese) document recognition system
Multi-font block letter (Yi nationality, the Chinese) document recognition system can become the standard Yi character symbol of international standard at present to the simplified Hanzi character that the secondary character library comprises and discern with the block letter books document of universal characters such as some English commonly used, numeral, and principal feature is as follows:
Discrimination is higher--can discern the Yi character symbol and the Chinese character of multiple font, multiple font size, discrimination is more than 95%, discrimination to universal characters such as English, numeral, punctuation marks is also higher, and can to the picture comma, and single closing quote ', fullstop.And angle mark '.Correctly distinguish Deng extremely similar character.
Recognition speed is very fast--and per second identification Yi character symbol number is more than 400, and Chinese character is enough to satisfy general office demand more than 150.
Editting function is comprehensive--and to each character of identification simultaneously its picture position of tracing display and ten candidate characters, the corresponding candidate characters of available click is made amendment to the identification text; Can increase the character that needs with Yi nationality's literary composition input method; Can the basic edit operation such as delete, shear, duplicate, paste, cancel to the identification text; Can select in the dialog box font of discerning text to be provided with in font, and add various effects; The appointed area that is identified picture can be copied in the identification text; Can insert new object types such as formula, form, picture in the formulation position of text.Whether automatic prompting preserved after text modification finished, and will save as " figure title .rtf " board document if preserve, and also can use derivation Word function to be saved as " figure title .doc " Word document and open the document automatically to supply the user to edit.
Embodiment 3: the recognition result statistical report form
File identification form such as table 3:
Table 3 file identification form
Many KL file path is: E: OCR all Pnt dictionary files deposit the path and be: E: OCR dictionary library DICT080204 YI
743 PKL files participate in the once total discrimination of identification: 98.558795% preceding 10 total discriminations: 99.873266% discrimination: preceding 10 discriminations:
1E: OCR all Pnt BLA30030.pkl 99.828326%100.000000%
2E: OCR all Pnt BLA30031.pkl 99.914163%100.000000%
3E: OCR all Pnt BLA30032.pkl 99.828326%100.000000%
4E: OCR all Pnt BLA30040.pkl 99.914163%100.000000%
5E: OCR all Pnt BLA30041.pkl 99.828326%100.000000%
6E: OCR all Pnt BLA30042.pkl 100.000000%100.000000%
7E: OCR all Pnt BLA30130.pkl 99.828326%100.000000%
8E: OCR all Pnt BLA30131.pkl 99.742489%100.000000%
9E: OCR all Pnt BLA30132.pkl 99.828326%100.000000%
10E: OCR all Pnt BLA30140.pkl 100.000000%100.000000%
11E: OCR all Pnt BLA30141.pkl 100.000000%100.000000%
12E: OCR all Pnt BLA30142.pkl 99.656652%100.000000%
13E: OCR all Pnt BLA30230.pkl 99.828326%100.n00000%
14E: OCR all Pnt BLA30231.pkl 99.742489%100.000000%
15E: OCR all Pnt/BLA30232.pkl 99.742489%100.000000%
16E: OCR all Pnt BLA30240.pkl 100.000000%100.000000%
17E: OCR all Pnt BLA30241.pkl 99.914163%100.000000%
18E: OCR all Pnt BLA30242.pkl 99.828326%100.000000%
19E: OCR all Pnt BLA31030.pkl 98.884120%99.914163%
20E: OCR all Pnt BLA31031.pkl 98.969957%99.828326%
21E: OCR all Pnt BLA31032.pkl 98.626609%99.828326%
22E: OCR all Pnt BLA31040.pkl 98.712446%99.914163%
23E: OCR all Pnt BLA31041.pkl 99.055794%99.914163%
24E: OCR all Pnt BLA31042.pkl 98.369099%99.828326%
25E: OCR all Pnt BLA31130.pkl 98.540773%100.000000%
The file identification error rate of band x surpasses 0.625841 second consuming time of 5% average every file
Character recognition form such as table 4:
Table 4 character recognition form
Many KL file path is: E: OCR all Pnt dictionary files deposit the path and be: E: OCR dictionary library DICT080204/YI
743 PKL files participate in the once total discrimination of identification: 98.558795% preceding 10 total discriminations: 0.625841 second consuming time of 99.873266% average every file
Figure A20081004781300191
Character range distribution form such as table 5:
Table 5 character range distribution form
Figure A20081004781300201

Claims (4)

1, the printed character recognition methods based on Yi character symbol collection of the many font sizes of a kind of multi-font, it is characterized in that: (1) Yi character body is gathered: computer system scans the sample of block letter Yi character symbol at first in a large number; Adopt the mode of batch processing to carry out character cutting, the training sample character repertoire of foundation; (2) generation of characteristics dictionary: utilize the training sample character repertoire of gathering to carry out extracting based on the high dimensional feature of peripheral direction contribution degree; High dimensional feature conversion by all training sample characters goes out the feature compression transition matrix; With the high dimensional feature boil down to low dimensional feature of this matrix with all training sample characters; Low-dimensional feature set generating feature dictionary by all training sample characters; (3) checking of characteristics dictionary: all training sample characters are carried out extracting based on the high dimensional feature of peripheral direction contribution degree; With the high dimensional feature boil down to low dimensional feature of feature compression transition matrix with all training sample characters; The low dimensional feature of all training sample characters is carried out three grades of coupling identifications with characteristics dictionary respectively finish the character classification judgement, verify this characteristics dictionary by file identification statistical report and character recognition statistical report, and the foundation of dictionary correction is provided; (4) Yi nationality's literary composition document recognition: with Yi nationality's literary composition block letter books or magazine single page document is process object, it at first is the picture file in the computing machine with this page file scanning, to symbol of the Yi character in the picture file and punctuation mark, English alphabet, the numeral conventional characters carries out just cutting apart, merge, after the Character segmentation of cutting apart is again handled, extract its high dimensional feature to cutting apart each the single character that obtains again based on the peripheral direction contribution degree, with the feature compression transition matrix high dimensional feature is compressed into low dimensional feature then, by finish the character classification judgement based on three grades of characteristic matching of dictionary of low dimensional feature, be reduced into computer version through document aftertreatment combination again, and may discern the character of makeing mistakes in the prompting text.
2, the printed character recognition methods based on Yi character symbol collection of the many font sizes of multi-font according to claim 1, concrete steps are:
(1) sample is prepared: participate in generating dictionary Yi character sample body and comprise black matrix, the Song typeface, Microsoft's lean type, the SIL body, upright Yi nationality voiceover body, the thin black matrix of upright Yi nationality literary composition, the upright Yi nationality literary composition Song typeface, upright Yi nationality literary composition imitation Song-Dynasty-Style typeface, upright Yi nationality literary composition black matrix, upright Yi nationality literary composition handwritten form, upright Yi nationality literary composition round end body, 12 kinds of fonts of upright Yi nationality literary composition variety body, these 12 kinds of fonts are all adopted No. three, No. five two kinds of font sizes, rule, two kinds of forms of overstriking, original paper, once copy, three kinds of tupes of secondary copy, 300DPI, two kinds of scan patterns of 400DPI, obtain 288 cover sample scintigram sheets, in the sample preprocessing process, adopt 130 then, 150,170 3 kinds of concentration are carried out the binaryzation sampling, obtain 864 cover sizes altogether, form, thickness, deep or light, the different Yi character allusion quotation sample of fog-level participates in generating the Yi character allusion quotation;
(2) Character segmentation: the Wen Wenben of papery Yi nationality becomes gray scale picture in the computing machine through scanning, the gray scale picture file is become have only 0 and the two-value picture file of 1 two kind of element by setting suitable binaryzation threshold, the two-value picture file is carried out Flame Image Process, isolate literal piece wherein, to the literal piece carry out based on the row projection the literal line cutting, promptly determine the initial longitudinal axis position of each row and finish longitudinal axis position in the projection distribution situation of y direction according to picture file, obtain each the row carve information and note, enter word again and cut apart; Cutting apart good literal line at first carries out just dividing based on the word of row projection, promptly determine the initial transverse axis position of each character and finish the transverse axis position in the projection distribution situation of X direction according to each row, thereby obtain just dividing word, just divide the situation that can occur in a large number the building block of single character being divided into character in the word, according to the characteristics of Yi character symbol and merging rule that sums up and the anti-rule that merges, obtain merging word thereby become the parts of a plurality of characters to be merged into a character erroneous segmentation, the character that mistake is merged decomposes and comes;
(3) feature extraction:
(3.1.1) direction contribution degree definition
To be studied a P is the center, the relativeness of continuously black number of pixels on 8 directions, and i=1,2 ..., 8, wherein, i represents direction, 1 iRepresent the black continuously pixel count of i direction, obtain 8 dimensional feature vectors,
d i = l i Σ i = 1 8 l i 2 - - - ( 1 )
[d1, d2 ..., d8] be the 8 dimension direction contribution degree proper vectors that P orders; If incite somebody to action reciprocal 1-2 each other, 3-4,5-8,6-7 merge into a feature mutually, then
d 1 = l 1 + l 2 ( l 1 + l 2 ) 2 + ( l 3 + l 4 ) 2 + ( l 6 + l 7 ) 2 + ( l 5 + l 8 ) 2 - - - ( 2 - 1 )
d 2 = l 3 + l 4 ( l 1 + l 2 ) 2 + ( l 3 + l 4 ) 2 + ( l 6 + l 7 ) 2 + ( l 5 + l 8 ) 2 - - - ( 2 - 2 )
d 3 = l 6 + l 7 ( l 1 + l 2 ) 2 + ( l 3 + l 4 ) 2 ( l 6 + l 7 ) 2 + ( l 5 + l 8 ) 2 - - - ( 2 - 3 )
d 4 = l 5 + l 8 ( l 1 + l 2 ) 2 + ( l 3 + l 4 ) 2 + ( l 6 + l 7 ) 2 + ( l 5 + l 8 ) 2 - - - ( 2 - 4 )
4 dimension direction contribution degrees for this point;
(3.1.2) peripheral direction contribution degree: with level 1 direction is example, the 0xa004 Yi character symbol through scanning, cut apart, obtaining after the normalized; Along 8 directions of search are parallel character is divided into 8 zones, in each zone, ask for 8 dimension direction contribution degrees of each layer frontier point earlier by the direction of search, vector average then as peripheral direction contribution degree vector (PDC) feature of this this direction of search of zone, and then the total intrinsic dimensionality of character is:
8 directions of search * 8 cut zone * 4 layer depth * 8 dimension PDC feature=2048 dimensions (3)
If adopt 4 dimension PDC of each layer frontier point, then the total intrinsic dimensionality of character is:
8 directions of search * 8 cut zone * 4 layer depth * 4 dimension PDC feature=1024 dimensions (4)
3.2 direction contribution degree feature extraction algorithm:
A. will check the 64*64 dot matrix of a word in the standard sample file to read in fundamental lattice dim[64] [64];
B.64*64 dot matrix dim[64] [64] be divided into eight directions, reads in eight direction arrays by direction respectively:
WtoE[64] [64]---from west to east, 1 direction is from dim[0] [0] read], to dim[63] [63] end;
EtoW[64] [64]---from east to west, 2 directions are from dim[0] [63] read, to dim[63] [0] end;
NtoS[64] [64]---by north orientation south, 3 directions are from dim[0] [0] read, to dim[63] [63] end;
StoN[64] [64]---from south to north, 4 directions are from dim[63] [0] read, to dim[0] [63] end;
WNtoES[4096]---by the north-westward southeast, 5 directions are from dim[63] [0] read, to dim[0] [63] end;
ENtoWS[4196]---by North-East Bound southwest, 6 directions are from dim[0] [0] read, to dim[63] [63] end;
WStoEN[4196]---by southwest northeastward, 7 directions are from dim[0] [0] read, to dim[63] [63] end;
EStoWN[4196]---by the southeast northwestwards, 8 directions are from dim[63] [0] read, to dim[0] [63] end;
C. find out eight direction matrixes and fundamental lattice dim[64] [64] corresponding conversion relation, these transformational relations are precomputed be stored among the Feature Conversion file PdcTransArray.k1, during program run the data in this document are read in PdcTrans[4096] [16], adopt look-up table during conversion;
D. in the direction matrix, search for frontier point, the straight line of a direction of each search, preceding four direction array straight line is a fixed length, back four direction array straight line is the variation of each+1 or-1, first point that on this straight line is 1 section at first continuously is for the ground floor frontier point, in to have no progeny be that first point of 1 section is second layer frontier point once more continuously, obtain the 3rd layer of frontier point successively, the 4th layer of frontier point finishes up to this straight line;
E. every layer of frontier point coordinate conversion that each direction is found is to dim[i] [j], be transformed into eight direction array coordinates again, be the 4 dimension PDC that 1 number is used for calculating this character continuously behind eight direction array coordinates; Calculate the characteristic that 4 dimension PDC features obtain and be recorded in PdcArray[8] in [8] [4] [4], be then written in the Pdc file;
(4) feature compression: adopt the KL conversion, with 1024 dimension PDC feature compression to 128 dimensions,
For each character, as training sample set, each individual character standard character looks like to extract its 1024 dimension PDC feature x with the N after the normalization individual character standard character picture, serves as to produce matrix with the overall scatter matrix of this sample set, promptly
∑=E{ (x-μ) (x-μ) TOr Σ = 1 M Σ i = 0 M - 1 ( x i - μ ) ( x i - μ ) T - - - ( 5 )
In the formula, x iBe 1024 dimensional feature vectors of i training sample, μ is the averaged feature vector of training sample set, and M is the training sample sum, obtains the oblique variance matrix of each character, and obtains the mean value of oblique variance matrix;
Ask for the eigenvalue of the mean value matrix ∑ of 1024 * 1024 dimension covariance matrixes below iWith orthonormal proper vector u i, use the Hao Siheerde conversion that 1024 rank real symmetric matrix ∑s are turned to symmetrical tridiagonal matrix approximately, re-use whole eigenvalue that distortion QR method is calculated real symmetrical tridiagonal matrix iWith corresponding proper vector u i
The eigenwert descending order is sorted: λ 0〉=λ 1〉=L 〉=λ M-1, its characteristic of correspondence vector is u i, 1024 dimensional features of each character can project to u 0, u 1..., u M-1In the subspace of opening, so each character is corresponding to a point in the subspace, and same, any point in the subspace is also corresponding to a character;
The proper vector set has been arranged, and any new character learning symbol sample f that treats can project in the M dimensional feature space, is expressed as:
(y 0,y 1,L,y M-1)=(u 0,u 1,L u M-1) T(f-μ) (6)
Use Karhunen-Loeve transformation as compression means, can choose preceding k maximum proper vector, make character Σ i = 0 k λ i Σ i = 0 M - 1 λ i ≥ α - - - ( 7 )
In following formula, can choose α=99%, illustrate that the energy of sample set on preceding k axle accounts for more than 99% of whole energy, dimension has dropped to 128 dimensions;
In application process, at first generate 128 * 1024 dimension KL transition matrix (u 0, u 1, L u M-1) T, remove multiply by 1024 dimensional feature vectors that extract with this KL transition matrix from the Pdc file, obtain 128 dimension compressive features, be then written in the Pkl file, the Pkl file reads in after the computer system with Pkl data matrix PklData[CHARMAXNUM] and [128] deposit;
(5) generation of characteristics dictionary: characteristics dictionary comprises two files, one is dictionary file YIDICT.dic, one is weighting coefficient file YIWEIGHT.dic, dictionary file is the average mean value of 1165 character 128 dimensional features in 864 Pkl files, and 128 dimensional feature vectors of each character are the mean value of this character 128 dimensional feature vectors in 864 samples; The weighting coefficient of each character also is 128 dimensions, it is each dimensional feature vector variance inverse in 864 samples in each character 128 dimensional feature vector, dictionary file reads in after the computer system with dictionary matrix D icArray[CHARMAXNUM] [128] deposit, the weighting coefficient file is used weighting coefficient matrix WeightArray[CHARMAXNUM after reading in computer system] [128] deposit;
(6) three grades apart from matching algorithm: sample identification at first will be calculated three grades of Weighted distances that each character 128 dimension compressive features to be identified extracted and the standard character 128 that extracts are tieed up between the compressive features from dictionary file from Pkl sample characteristics compressed file, the calculating of each character Weighted distance to be identified is divided into three grades, and parameter is defined as follows table:
FNum=128 First order candidate characters number FLeve 1=24 First order characteristic number SNum=24 Second level candidate characters number SLevel=48 Second level characteristic number TNum=10 Third level candidate characters number TLevel=128 Third level characteristic number
The first order is calculated the Weighted distance between the preceding 24 dimension compressive features of all characters in the dictionary, win the confidence and adopt preceding 128 less characters of distance and enter second level calculating, preceding 48 distances of tieing up between the compressive features of 128 characters that the first order is obtained are calculated in the second level, win the confidence and adopt preceding 24 less characters of distance and enter third level calculating, the third level is calculated 128 distances of tieing up between the compressive features of 24 characters that the second level is obtained, win the confidence and adopt the candidate result collection that preceding 10 less characters of distance enter this character, it is unified the Weighted distance computing formula and is:
d = Σ i = 1 Level WeightArray [ objUnicode ] [ i ] × ( PklData [ OCRNumber ] [ i ] - DicArray [ objUnicode ] [ i ] ) 2 Σ i = 1 Level WeightArray [ objUnicode ] [ i ] - - - ( 8 )
Wherein the Level record needs the dimension of computed range in prime, its value is 24,48 and 128, WeightArray[objUnicode] [i] write down the weighting coefficient of this dimensional feature, PklData[OCRNumber] compressive features of [i] record character to be identified, DicArray[objUnicode] compressive features of standard character in [i] record dictionary;
After each grade Weighted distance calculates, less character of several Weighted distances enters next stage calculating or enters the candidate result collection before needing to calculate it, this just need sort to Weighted distance, what adopt is improved quick sort: when during greater than candidate characters number Num, only the record of the subsequence before the pivot being carried out the next round quicksort as the residing position number of the pivot of benchmark; When the position number of pivot during, " Num-1-pivot location sequence number write down " after record of the subsequence before the pivot and the pivot carried out the next round quicksort respectively smaller or equal to Num; The ranking results of Num minimum weight distance before finally obtaining;
(7) recognition result statistics: because the test sample book that participates in generating the dictionary sample of dictionary and participating in the test dictionary is all for being begun by Unicode sign indicating number 0xA000,1165 Yi characters that finish to 0xA48c accord with, the Unicode coding of each character is known, if the Unicode that identifies coding is inconsistent with known coded, can thinks and make mistakes one; What add up is a discrimination and preceding ten discriminations, and statistics can be divided into file identification statistics, character recognition statistics and character distribution distance statistics by objects of statistics; The file identification statistics is meant discrimination and preceding ten discriminations of statistics each sample file in the designated computer path next one or a plurality of sample files and discerns the time that whole sample file spends, in order to identification order of accuarcy and the speed of reflection to file integral body; Character recognition statistics be meant each Yi character symbol total once identification in one or more sample files of statistics number of times and preceding ten identification number of times of makeing mistakes of makeing mistakes, in order to picking out the character that easy identification makes mistakes, be the make mistakes early warning and the postprocessing correction channeling direction and support on the data is provided of back; Character distribution distance statistics is meant that statistics participates in generating in each sample of dictionary the range distribution of standard character feature in each character feature and dictionary, be used for analyzing the font situation of change of multi-font, concentrated each character of many font sizes Yi character symbol, the character range distribution that font changes greatly is far away, the identification error probability is bigger, for the judgement of the early warning that makes mistakes provides foundation;
(8) the identification early warning mechanism of makeing mistakes: for its confidence interval of each character setting from, with it current decipherment distance is judged, may make mistakes greater than thinking of this distance, confidence interval is relevant with two factors of fog-level of scanning picture from distributing with character, different character range distribution is different, need draw standard range distribution figure in itself and the dictionary to 864 samples of each character, according to its distribution situation set its distribution confidence interval from, at present in the height byte records of weighting dictionary YIWEIGHT.dic this distribution confidence interval from; Represent its fog-level with the mean distance of all identification characters first selections of scanning picture; As follows according to the make mistakes step of early warning of fog-level:
Step 1: first distance and the second distance of getting character late;
Step 2: relatively first distance and 80% mean distance, if, then enter step 5, otherwise enter step 3 less than 80% mean distance;
Step 3: relatively the second distance and first distance is poor, if greater than 10% mean distance, then enter step 5, otherwise enter step 4;
Step 4: to the warning that makes mistakes of this character;
Step 5: judge whether this picture is handled, and has been untreated then to enter step 1, handles then and withdraws from.
3, the printed character recognition methods based on Yi character symbol collection of the many font sizes of multi-font according to claim 2, described merging rule:
Two Yi nationality's literary compositions just divide the merging rule condition of word to be: 1) the wide wide tempflag of this row statistics word that is not more than of the word that divides word to merge at the beginning of two; 2) divide the spacing of word space at the beginning of two less than them and adjacent character; 3) divide at the beginning of two the high maximal value of word word greater than the high templineheight of 60% statistics word;
The merging rule of double quotation marks merged in two commas: 1) the wide wide tempflag of this row statistics word that is not more than of the word that divides word to merge at the beginning of two; 2) divide the spacing of word space at the beginning of two less than them and adjacent character; 3) ordinate that divides word at the beginning of two is all less than the ordinate rowbaseliney of center line; 4) divide word height half at the beginning of two less than the high templineheight of statistics word;
Three Yi nationality's literary compositions just divide the merging rule of word to be: 1) the wide wide tempflag of this row statistics word that is not more than of the word that divides word to merge at the beginning of three; 2) divide at the beginning of three word space less than the first and the 3rd spacing of just dividing word and adjacent character; 3) divide at the beginning of three the high maximal value of word word greater than the high templineheight of 60% statistics word;
Suspension points ... merge rule: 1) the wide wide tempflag of this row statistics word that is not more than of the word that at the beginning of three, divides word to merge; 2) divide at the beginning of three word word Gao Jun less than 30% statistics word height; 3) divide at the beginning of three the word Diff N less than 10% statistics word height.
4, the printed character recognition methods based on Yi character symbol collection of the many font sizes of multi-font according to claim 2, described anti-merging rule:
The anti-merging rule that two numerals can not merge is: 1) minimum value of dividing word at the beginning of two is greater than 70% statistics word height, and maximal value is less than 90% statistics word height; 2) divide at the beginning of two the minimum value of word width wide less than 40% statistics word; 3) divide at the beginning of two the difference in height of word less than 1;
The anti-merging rule of right parenthesis and back punctuation mark is: 1) divide at the beginning of the 1st word height greater than 70% statistics height; 2) divide at the beginning of the 1st word width less than 40% statistics width; 3) divide at the beginning of the 2nd word height less than 40% statistics height; 4) divide at the beginning of the 2nd word width less than 40% statistics width; 5) divide the ordinate rowbaseliney of the ordinate of word at the beginning of the 2nd greater than center line.
CN200810047813A 2008-05-23 2008-05-23 Multi-font multi- letter size print form charater recognition method based on 'Yi' character set Expired - Fee Related CN100589119C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810047813A CN100589119C (en) 2008-05-23 2008-05-23 Multi-font multi- letter size print form charater recognition method based on 'Yi' character set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810047813A CN100589119C (en) 2008-05-23 2008-05-23 Multi-font multi- letter size print form charater recognition method based on 'Yi' character set

Publications (2)

Publication Number Publication Date
CN101286202A true CN101286202A (en) 2008-10-15
CN100589119C CN100589119C (en) 2010-02-10

Family

ID=40058400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810047813A Expired - Fee Related CN100589119C (en) 2008-05-23 2008-05-23 Multi-font multi- letter size print form charater recognition method based on 'Yi' character set

Country Status (1)

Country Link
CN (1) CN100589119C (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103310210A (en) * 2012-03-13 2013-09-18 日立电脑机器株式会社 Character recognition device, recognition dictionary generation device and normalization method
CN103400132A (en) * 2013-07-02 2013-11-20 Tcl集团股份有限公司 Method and device for character segmentation
CN103995816A (en) * 2013-02-14 2014-08-20 富士施乐株式会社 Information processing apparatus, information processing method
CN104376300A (en) * 2014-11-03 2015-02-25 电子科技大学 Identification method used for intelligent matching of incomplete Chinese characters on basis of grid characteristics
CN105027178A (en) * 2013-01-09 2015-11-04 柳仲夏 Apparatus and method for editing symbol images, and recording medium in which program for executing same is recorded
CN105488471A (en) * 2015-11-30 2016-04-13 北大方正集团有限公司 Character pattern recognition method and device
CN106446898A (en) * 2016-09-14 2017-02-22 宇龙计算机通信科技(深圳)有限公司 Extraction method and extraction device of character information in image
CN106611174A (en) * 2016-12-29 2017-05-03 成都数联铭品科技有限公司 OCR recognition method for unusual fonts
CN106778507A (en) * 2016-11-24 2017-05-31 北京小米移动软件有限公司 Text extraction method and device
CN107122113A (en) * 2017-03-31 2017-09-01 北京小米移动软件有限公司 Generate the method and device of picture
CN107194394A (en) * 2016-09-29 2017-09-22 北京神州泰岳信息安全技术有限公司 Remotely access monitoring method and relevant apparatus
CN108874752A (en) * 2018-06-12 2018-11-23 广东信浓信息技术有限公司 A kind of material object picture conversion text method and system
CN109190630A (en) * 2018-08-29 2019-01-11 摩佰尔(天津)大数据科技有限公司 Character identifying method
CN109284012A (en) * 2018-09-12 2019-01-29 西南大学 A kind of Gu Yi nationality's text language in-put control system and method, information data processing terminal
CN109344834A (en) * 2018-09-06 2019-02-15 昆明理工大学 A kind of incomplete Chinese characters recognition method based on image procossing
CN109478229A (en) * 2016-08-31 2019-03-15 富士通株式会社 Training device, character recognition device and the method for sorter network for character recognition
CN109753968A (en) * 2019-01-11 2019-05-14 北京字节跳动网络技术有限公司 Generation method, device, equipment and the medium of character recognition model
CN110163203A (en) * 2019-04-09 2019-08-23 浙江口碑网络技术有限公司 Character identifying method, device, storage medium and computer equipment
CN110178139A (en) * 2016-11-14 2019-08-27 柯达阿拉里斯股份有限公司 Use the system and method for the character recognition of the full convolutional neural networks with attention mechanism
CN111553336A (en) * 2020-04-27 2020-08-18 西安电子科技大学 Print Uyghur document image recognition system and method based on link segment
CN113627175A (en) * 2021-08-17 2021-11-09 北京计算机技术及应用研究所 Method for calculating Chinese word vector by utilizing orthogonal transformation

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103310210A (en) * 2012-03-13 2013-09-18 日立电脑机器株式会社 Character recognition device, recognition dictionary generation device and normalization method
CN103310210B (en) * 2012-03-13 2016-06-29 株式会社日立信息通信工程 Character recognition device, recognition dictionary generate device and method for normalizing
CN105027178A (en) * 2013-01-09 2015-11-04 柳仲夏 Apparatus and method for editing symbol images, and recording medium in which program for executing same is recorded
CN103995816A (en) * 2013-02-14 2014-08-20 富士施乐株式会社 Information processing apparatus, information processing method
CN103995816B (en) * 2013-02-14 2018-10-02 富士施乐株式会社 Information processing equipment and information processing method
CN103400132A (en) * 2013-07-02 2013-11-20 Tcl集团股份有限公司 Method and device for character segmentation
CN103400132B (en) * 2013-07-02 2017-08-25 Tcl集团股份有限公司 A kind of character segmentation method and device
CN104376300A (en) * 2014-11-03 2015-02-25 电子科技大学 Identification method used for intelligent matching of incomplete Chinese characters on basis of grid characteristics
CN104376300B (en) * 2014-11-03 2018-01-30 电子科技大学 A kind of recognition methods based on grid search-engine intelligent Matching incompleteness Chinese character
CN105488471B (en) * 2015-11-30 2019-03-29 北大方正集团有限公司 A kind of font recognition methods and device
CN105488471A (en) * 2015-11-30 2016-04-13 北大方正集团有限公司 Character pattern recognition method and device
CN109478229B (en) * 2016-08-31 2021-08-10 富士通株式会社 Training device for classification network for character recognition, character recognition device and method
CN109478229A (en) * 2016-08-31 2019-03-15 富士通株式会社 Training device, character recognition device and the method for sorter network for character recognition
CN106446898A (en) * 2016-09-14 2017-02-22 宇龙计算机通信科技(深圳)有限公司 Extraction method and extraction device of character information in image
CN107194394A (en) * 2016-09-29 2017-09-22 北京神州泰岳信息安全技术有限公司 Remotely access monitoring method and relevant apparatus
CN110178139A (en) * 2016-11-14 2019-08-27 柯达阿拉里斯股份有限公司 Use the system and method for the character recognition of the full convolutional neural networks with attention mechanism
CN106778507A (en) * 2016-11-24 2017-05-31 北京小米移动软件有限公司 Text extraction method and device
CN106611174A (en) * 2016-12-29 2017-05-03 成都数联铭品科技有限公司 OCR recognition method for unusual fonts
CN107122113A (en) * 2017-03-31 2017-09-01 北京小米移动软件有限公司 Generate the method and device of picture
CN107122113B (en) * 2017-03-31 2021-07-13 北京小米移动软件有限公司 Method and device for generating picture
CN108874752A (en) * 2018-06-12 2018-11-23 广东信浓信息技术有限公司 A kind of material object picture conversion text method and system
CN109190630A (en) * 2018-08-29 2019-01-11 摩佰尔(天津)大数据科技有限公司 Character identifying method
CN109344834A (en) * 2018-09-06 2019-02-15 昆明理工大学 A kind of incomplete Chinese characters recognition method based on image procossing
CN109284012A (en) * 2018-09-12 2019-01-29 西南大学 A kind of Gu Yi nationality's text language in-put control system and method, information data processing terminal
CN109753968A (en) * 2019-01-11 2019-05-14 北京字节跳动网络技术有限公司 Generation method, device, equipment and the medium of character recognition model
CN110163203A (en) * 2019-04-09 2019-08-23 浙江口碑网络技术有限公司 Character identifying method, device, storage medium and computer equipment
CN111553336A (en) * 2020-04-27 2020-08-18 西安电子科技大学 Print Uyghur document image recognition system and method based on link segment
CN111553336B (en) * 2020-04-27 2023-03-24 西安电子科技大学 Print Uyghur document image recognition system and method based on link segment
CN113627175A (en) * 2021-08-17 2021-11-09 北京计算机技术及应用研究所 Method for calculating Chinese word vector by utilizing orthogonal transformation
CN113627175B (en) * 2021-08-17 2024-05-28 北京计算机技术及应用研究所 Method for calculating Chinese word vector by orthogonal transformation

Also Published As

Publication number Publication date
CN100589119C (en) 2010-02-10

Similar Documents

Publication Publication Date Title
CN100589119C (en) Multi-font multi- letter size print form charater recognition method based on 'Yi' character set
Lawgali et al. HACDB: Handwritten Arabic characters database for automatic character recognition
CN101447017B (en) Method and system for quickly identifying and counting votes on the basis of layout analysis
Mahmoud Recognition of writer-independent off-line handwritten Arabic (Indian) numerals using hidden Markov models
CN110033000A (en) A kind of text detection and recognition methods of bill images
CN109871851B (en) Chinese character writing normalization judging method based on convolutional neural network algorithm
CN104966097A (en) Complex character recognition method based on deep learning
CN106446954A (en) Character recognition method based on depth learning
US20120072859A1 (en) System and method for comparing and reviewing documents
CN106096557A (en) A kind of semi-supervised learning facial expression recognizing method based on fuzzy training sample
CN103093240A (en) Calligraphy character identifying method
CN103810484B (en) The mimeograph documents discrimination method analyzed based on printing character library
DE102011079443A1 (en) Learning weights of typed font fonts in handwriting keyword retrieval
Fazilov et al. State of the art of writer identification
CN109190630A (en) Character identifying method
CN102360436B (en) Identification method for on-line handwritten Tibetan characters based on components
Akbari et al. A novel database for automatic processing of Persian handwritten bank checks
Chtourou et al. ALTID: Arabic/Latin text images database for recognition research
Ni et al. Writer identification in noisy handwritten documents
Mariyathas et al. Sinhala handwritten character recognition using convolutional neural network
Fornés et al. A keyword spotting approach using blurred shape model-based descriptors
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
Abdalkafor et al. A Novel Database for Arabic Handwritten Recognition (NDAHR) System
Magotra et al. A Comparative analysis for identification and classification of text segmentation challenges in Takri Script
Halder et al. Individuality of isolated Bangla characters

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100210

Termination date: 20160523

CF01 Termination of patent right due to non-payment of annual fee