Summary of the invention
In view of the weak point of prior art, the purpose of this invention is to provide a kind of new method of presorting and pre-classifier.It can take into account the accuracy and the speed of presorting better.
Further purpose of the present invention provides a kind of Chinese handwriting identifying method efficiently and system.
According to a kind of method of presorting that is used for Chinese handwritten Chinese character recognition system of the present invention, this handwritten Chinese character recognition system to the feature of handwritten Chinese character presort and sophisticated category to discern this Chinese character.This method comprises: extract first kind of Hanzi features of the low-dimensional of handwritten Chinese character, and produce the first candidate group thus; Extract second kind of Hanzi features of the higher-dimension of handwritten Chinese character, be used for sophisticated category; Second kind of Hanzi features dimensionality reduction of described higher-dimension, obtain second kind of Hanzi features of low-dimensional, and produce the second candidate group thus; And, obtain final candidate group according to the common factor of the described first candidate group and the second candidate group.
For same handwritten Chinese character, the present invention utilizes two kinds of different Hanzi featureses to filter out two different candidate groups respectively by two sub-pre-classifiers.Presort according to these two candidate groups, thereby avoided utilizing merely a pre-classifier and a kind of Hanzi features to produce the deficiency that the candidate group is brought based on the different Chinese character feature.
The present invention also provides a kind of method that is used for the hand-written Chinese character of Chinese handwritten Chinese character recognition system identification.This method comprises: extract first kind of Hanzi features of the low-dimensional of handwritten Chinese character, and presorted by first son and to produce the first candidate group; Extract second kind of Hanzi features of the higher-dimension of described handwritten Chinese character; Obtain second kind of Hanzi features of low-dimensional from second kind of Hanzi features of described higher-dimension, and presort by second son and to produce the second candidate group; Common factor according to the described first candidate group and the second candidate group obtains final candidate group, as the result who presorts; And utilize the second kind of Hanzi features and the described final candidate group of described higher-dimension to identify the Chinese character of being write.Utilize second kind of Hanzi features of two kinds of different candidate groups and higher-dimension, the accuracy rate and the speed combination property of identification writing Chinese characters improve significantly.
The present invention also provides a kind of pre-classifier that is used for Chinese handwritten Chinese character recognition system.It comprises first kind of Research of Chinese Feature Extraction device of low-dimensional, is used to extract the first kind of Hanzi features and the first sub-pre-classifier of the low-dimensional of handwritten Chinese character, is used to produce the first candidate group; Second kind of Research of Chinese Feature Extraction device of higher-dimension is used to extract second kind of Hanzi features of the higher-dimension of handwritten Chinese character; The dimensionality reduction device second kind of Hanzi features dimensionality reduction of described higher-dimension, obtains second kind of Hanzi features of low-dimensional; The second sub-pre-classifier produces the second candidate group according to second kind of Hanzi features of said low-dimensional; And final candidate group generation device, utilize the common factor of the described first candidate group and the second candidate group to obtain final candidate group.
The present invention also provides a kind of Chinese handwritten Chinese character recognition system.It comprises first kind of Research of Chinese Feature Extraction device of low-dimensional, is used to extract the first kind of Hanzi features and the first sub-pre-classifier of the low-dimensional of handwritten Chinese character, is used to produce the first candidate group; Second kind of Research of Chinese Feature Extraction device of higher-dimension is used to extract second kind of Hanzi features of the higher-dimension of described handwritten Chinese character.This handwritten Chinese character recognition system also comprises: the dimensionality reduction converting means second kind of Hanzi features dimensionality reduction of described higher-dimension, obtains second kind of Hanzi features of low-dimensional; The second sub-pre-classifier produces the second candidate group according to second kind of Hanzi features of the low-dimensional that is obtained; Final candidate group generation device is used to produce final candidate group; And the sophisticated category device, be used to utilize the second kind of Hanzi features and the described final candidate group of described higher-dimension to discern this handwritten Chinese character.
Described sophisticated category device of the present invention utilizes the common factor of the described first candidate group and the second candidate group to discern this handwritten Chinese character.Make full use of the complementarity of the first candidate group and the second candidate group, removed some unnecessary candidate, thereby improved the recognition speed of sophisticated category device.
First kind of Hanzi features of described low-dimensional of the present invention is different with second kind of Hanzi features of described low-dimensional.Uncorrelated basically between them.Therefore, the resultant first candidate group and the second candidate group have certain complementarity.
In addition, the peripheral characteristic of Chinese character is important more than its internal feature, more helps discerning Chinese character, and therefore, what second kind of Hanzi features of low-dimensional of the present invention selected for use is the peripheral statistical nature of Chinese character.Dimensionality reduction converting means of the present invention gathers the peripheral characteristic of second Hanzi features of the higher-dimension of sampling with (summarize), as adds up, and obtains second kind of Hanzi features of low-dimensional.So just saved the independently extraction of second kind of Hanzi features of low-dimensional.
The present invention also proposes a kind ofly is used for Chinese handwritten Chinese character recognition system produces the candidate group by presorting method and comprises: a plurality of templates of training effective statistical nature; These templates are divided into a plurality of statistical nature clusters; In each cluster, generate the wherein cluster centre of whole Hanzi featureses of representative; And, produce a word indexing group to each statistical nature cluster; Chinese character to input is sampled and is obtained the statistical nature of this Chinese character; The statistical nature of this Chinese character sampling gained and the cluster centre of each cluster are compared, select some groups of clusters the most similar with it, the quantity of wherein similar cluster group pre-determines; And merge the selected corresponding word indexing group of cluster group, produce candidate group to the input Chinese character.This with cluster centre mode relatively, be better than in the prior art mode that the cluster feature scope with each cluster compares.Its accuracy rate height, and have greater flexibility.
Embodiment
With reference to figure 1, handwritten Chinese character sorter of the present invention comprises a pre-classifier 1 and sophisticated category device 2.This pre-classifier comprises the first sub-pre-classifier 12 and the second sub-pre-classifier 13.Pre-classifier 1 also comprises first kind of Research of Chinese Feature Extraction device 10 of a low-dimensional, is used for extracting from the Chinese character of input first kind of Hanzi features of low-dimensional.First kind of Hanzi features of this low-dimensional, it generally is the Chinese character statistical nature (Statistic Feature) of the low-dimensional of Chinese character, as the frequency field feature (low dimension frequency domain feature) of the low-dimensional of Chinese character, or other Chinese character statistical natures.The first sub-pre-classifier 12 also stores a plurality of clusters (not shown cluster centre and word indexing group) that first kind of Hanzi features with low-dimensional adapts, and comprises the cluster centre and the corresponding word indexing group of this Hanzi features.Wherein, each cluster comprises Chinese character like a plurality of feature classes, and each cluster has a cluster centre, and this cluster centre has been represented the common trait of Chinese character in this cluster.The first sub-pre-classifier compares first kind of Hanzi features of low-dimensional and each cluster centre of the first sub-pre-classifier, obtains and the first sub-pre-classifier distances of clustering centers (distance).According to the first sub-pre-classifier in each distances of clustering centers, select several minimum clusters of distance with it, as the output of the first sub-pre-classifier.The first candidate group formed in these Chinese characters included apart from the cluster of minimum.
Pre-classifier 1 also comprises second kind of Hanzi features device of a low-dimensional, i.e. a dimensionality reduction converting means 21.Second Hanzi features that this device will extract higher-dimension is reduced to second kind of Hanzi features of low-dimensional.Second Hanzi features of this higher-dimension is to be extracted by the Research of Chinese Feature Extraction device of higher-dimension, is used for sophisticated category.Second kind of statistical nature that Hanzi features also is a kind of Chinese character this higher-dimension or low-dimensional.But second kind of Hanzi features of this low-dimensional is the Chinese character statistical nature different with first kind of Hanzi features of described low-dimensional.The front has said that the statistical nature of Chinese character has a variety of.Here first or second Hanzi features of saying can be wherein any.But require selected first kind of Hanzi features different with second kind of Hanzi features, that is, and mutually orthogonal to a certain extent (almost not having correlativity).For example, the correlativity of the stroke number feature of Chinese character and stroke direction feature is low.For example, the stroke number feature similarity of China fir and close to is in same cluster, but the direction character difference between them is very big, not in same cluster.The second sub-pre-classifier stores a plurality of clusters that second kind of Hanzi features with low-dimensional adapts.Each cluster comprises a plurality of Chinese characters, and each cluster has a cluster centre.This cluster centre has been represented the common trait of Chinese character in this cluster.The second sub-pre-classifier compares second kind of Hanzi features of the low-dimensional of input Chinese character and each cluster centre of the second sub-pre-classifier, obtains and the second sub-pre-classifier distances of clustering centers (distance).According to itself and the second sub-pre-classifier distances of clustering centers, select the minimum cluster of a plurality of distances with it, as the output of the second sub-pre-classifier.The second candidate group formed in these Chinese characters included apart from the cluster of minimum.
Because the first candidate group and the second candidate group have certain complementarity, can be with the common factor of the first candidate group and the second candidate group as the final candidate group of pre-classifier, just as sophisticated category device candidate group, so that remove the unnecessary Chinese character in the first candidate group that first kind of Hanzi features according to low-dimensional filter out, and the unnecessary Chinese character in the second candidate group that filters out according to second kind of Hanzi features of low-dimensional.This is finished by final candidate group generation device (common factor generating apparatus) 14 shown in Fig. 1 and the final candidate group of pre-classifier memory storage 15.This method can reduce the sophisticated category device the number of the Chinese character in the candidate group to be processed, also just improved the recognition speed of sophisticated category device.And then improved the speed of whole handwritten Chinese character sorter.
As selection, because second kind of Hanzi features of first kind of Hanzi features of low-dimensional and low-dimensional is mutually orthogonal to a certain extent, so the first candidate group and the second candidate group have certain complementarity.The first candidate group that filters out according to first kind of Hanzi features of low-dimensional can be replenished mutually with the second candidate group that second kind of Hanzi features according to low-dimensional filters out.At this moment, the final candidate group generation device (common factor generating apparatus) 14 among Fig. 1 can be replaced with a union generating apparatus (not shown) gets final product.Like this, form the final candidate group of pre-classifier, just can be used as sophisticated category device candidate group, identify this handwritten Chinese character by sophisticated category device 22 by all Chinese characters in the first candidate group and the second candidate group.
Described sophisticated category device 22 comprises the extraction element 20 of a certain Hanzi features of a higher-dimension, is used for extracting from handwritten Chinese character the Hanzi features of higher-dimension.In order to make Chinese Character Recognition have enough precision, the Hanzi features of higher-dimension generally selects the direction character (high dimensiondirectional feature) of higher-dimension.This sophisticated category device 22 utilizes the Hanzi features of described higher-dimension, identifies described handwritten Chinese character the candidate group after the preliminary election that is transported to this sophisticated category device.
Second kind of Hanzi features of described low-dimensional is by conversion obtains through dimensionality reduction the Hanzi features of the higher-dimension that is used for the sophisticated category device.This function is finished by dimensionality reduction converting means 21.The front has said that for handwritten Chinese character, its peripheral characteristic is more important than its internal feature.Therefore, when dimensionality reduction, the present invention preferentially extracts the peripheral characteristic in the Hanzi features of higher-dimension.Fig. 2 a is depicted as a kind of Chinese character statistical nature of the higher-dimension of higher-dimension Research of Chinese Feature Extraction device extraction.Wherein each stain is represented the multidimensional feature.The feature (being peripheral characteristic) at four angles in the Hanzi features of the higher-dimension of extraction Chinese character is shown in Fig. 2 b.Then, the peripheral characteristic in each dotted line is gathered (summarize), as add up, obtain the statistical nature behind the dimensionality reduction shown in Fig. 2 c.With the second kind Hanzi features of the statistical nature behind the dimensionality reduction, thereby simplified the extraction of Hanzi features as low-dimensional.
Below with reference to Fig. 3, word indexing group generating apparatus 5 of the present invention is described.This device is used for being divided into a plurality of clusters according to the Chinese character that the feature of Chinese character is discerned needs.Each cluster has a cluster centre.Cluster centre is represented the feature of cluster, i.e. the common trait of all Chinese characters in this cluster.Each cluster comprises the index of Chinese character in the cluster corresponding to a word indexing group in this word indexing group.Word indexing group generating apparatus 5 comprises statistical nature template 51, clustering apparatus 52, word indexing group memory storage 53 and cluster centre memory storage 54.
Need to suppose m Chinese character of identification, at first train effective statistical nature template 51, make this template number also be m.Utilize clustering technique then, m template is divided into n cluster.In order to make the process of presorting have fast speeds, the value of n and m need satisfy n<<m.The number that is cluster will be less than template number far away.Then, obtain the cluster centre of each cluster, and the word indexing group of each cluster, write down the index of all Chinese characters in this cluster in the word indexing group of cluster.The feature similarity of the Chinese character in the same cluster.
Like this, can obtain a plurality of first clusters, cluster centre and a plurality of first word indexing group by first kind of Hanzi features of low-dimensional about this m Chinese character.Can obtain a plurality of second clusters, cluster centre and a plurality of second word indexing group by second kind of Hanzi features of low-dimensional about this m Chinese character.The frequency field feature of utilizing Chinese character can obtain a plurality of frequency field feature clusterings relevant with the frequency field feature, frequency field feature clustering center and frequency field tagged word index-group in conjunction with said method.
The direction character that utilizes Chinese character can obtain a plurality of direction character clusters relevant with direction character, direction character cluster centre and direction character word indexing group in conjunction with said method.
Describe candidate group generating apparatus of the present invention in detail below in conjunction with Fig. 3.Each sub-pre-classifier all comprises a candidate group generating apparatus 6.It comprises feature innput device 60, cluster centre comparison means 61, cluster selecting arrangement 62 and word indexing group combination memory storage 63.After the feature of extracting handwritten Chinese character, feature innput device 60 inputs to sub-pre-classifier with this Hanzi features.The Hanzi features cluster centre corresponding with corresponding cluster (or word indexing group) that cluster centre comparison means 61 will be imported in this sub-pre-classifier compared.Cluster selecting arrangement 62 utilizes the difference that relatively obtains to select P the minimum cluster of distance, i.e. P word indexing group with it.Chinese character in this P word indexing group has been formed a candidate group by word indexing group combination memory storage 63.
The candidate group that two sub-pre-classifiers are obtained combines, and has just obtained the final candidate group of pre-classifier.The value of P influences the recognition accuracy of handwritten Chinese character, and what of cluster in the candidate group, promptly the candidate in the candidate group what.If the value of P is big, the accuracy of handwritten Kanji recognition will improve, but the candidate in the candidate group also can increase, and make that the identifying of subsequent fine disaggregated classification device is slack-off.If the value of P is little, then the identifying of subsequent fine disaggregated classification device is fast, but recognition accuracy will descend.
Identification below in conjunction with Fig. 5 a, Fig. 5 b, Fig. 6 a and Fig. 6 b explanation handwritten Chinese character " hand ".After handwritten Chinese character " hand " input, the handwritten Chinese character sorter will extract two kinds of statistical natures of this Chinese character.First kind of Research of Chinese Feature Extraction device 10 of low-dimensional extracts a kind of (first kind) Hanzi features of the low-dimensional of " hand ".The second Research of Chinese Feature Extraction device 20 of higher-dimension extracts another kind (second kind) Hanzi features of the higher-dimension of " hand ".These two kinds of statistical natures can be selected from Chinese Character Recognition statistical nature commonly used, such as direction character (directional feature), contour feature (contour feature), stroke number feature and frequency field feature (frequency domainfeature) or the like.A kind of statistical nature is used for the first sub-pre-classifier 12, and another kind of statistical nature is used for sophisticated category device 2.Above-mentioned two kinds of statistical natures that statistical nature is preferentially chosen according to the Chinese character different qualities.Because, second kind of Hanzi features that this sorter also will above-mentioned higher-dimension through the dimensionality reduction conversion after, second Hanzi features that becomes low-dimensional is used for the second sub-pre-classifier 13.
In this embodiment, first kind of Hanzi features of low-dimensional selected the frequency field feature of low-dimensional for use, as the frequency field feature less than 30 dimensions.Second Hanzi features of higher-dimension is selected the direction character of higher-dimension for use, as the direction character greater than 100 dimensions.
Fig. 5 has schematically showed a kind of Hanzi features of the higher-dimension of extraction Chinese character " hand ".For difference is used for the Hanzi features of the first sub-pre-classifier 12, be referred to as second kind of Hanzi features.Chinese character " hand " is divided into a plurality of after input, shown in Fig. 5 a.Fig. 5 a only is the purpose of giving an example, and the piece of actual division statistical nature dimension is as required determined.In each piece, the direction character of system-computed stroke extracts the result shown in Fig. 5 b."-" " | " and "/" among Fig. 5 b " " represent different direction characters respectively.
Fig. 6 has showed that dimensionality reduction converting means 21 is how with second kind of Hanzi features dimensionality reduction of the higher-dimension among Fig. 5, obtaining the statistical nature of low-dimensional.The statistical nature of this low-dimensional will be as second kind of Hanzi features of low-dimensional.As indicated above, the peripheral characteristic of Chinese character is than important many of the internal feature of Chinese character.In Fig. 6 a, the with dashed lines rectangle is chosen the Hanzi features of the higher-dimension at four angles of input Chinese character.Then, the direction character in the piece in each dashed rectangle is gathered (summarize), as add up, reduce dimension, obtain the direction character of the low-dimensional shown in Fig. 6 b.The direction character of this low-dimensional will be used for the second sub-pre-classifier 13, so be referred to as second Hanzi features of low-dimensional.
According to said method, first kind of Hanzi features of the required low-dimensional of pre-classifier and second kind of Hanzi features of low-dimensional have just been obtained.The first sub-pre-classifier 12 compares low-dimensional frequency field feature and each the frequency field feature clustering center that obtains, and obtains the distance between them.Based on this distance, from a plurality of frequency field feature clusterings, select P1 the frequency field feature clustering that distance is minimum.The first candidate group will be formed in Chinese character in these frequency field feature clusterings.The value of P1 will be considered to trade off between recognition accuracy (discrimination) and required calculated amount (speed).
Equally, the second sub-pre-classifier 13 compares direction character and each direction character cluster centre of the low-dimensional that obtains, obtains the distance between them.Based on this distance, from a plurality of direction character clusters, select P2 the direction character cluster that distance is minimum.The second candidate group will be formed in Chinese character in these direction character clusters.The value of P2 also will be considered to trade off between recognition accuracy (discrimination) and required calculated amount (speed).
Next, occur simultaneously and ask for device 14, receive first candidate group of first sub-pre-classifier 12 outputs and the second candidate group of second sub-pre-classifier 13 outputs, ask for the common factor of the first candidate group and the second candidate group, as the final candidate group of pre-classifier.Last sophisticated category device 22 utilizes the direction character of the higher-dimension that obtains, and identifies handwritten Chinese character from this candidate group.
Ask for the common factor of the first candidate group and the second candidate group, be equivalent to utilize first kind of Hanzi features screening of low-dimensional to obtain the first candidate group after, get rid of impossible Chinese character in the first candidate group, promptly unnecessary Chinese character according to second kind of Hanzi features of low-dimensional.Like this, just dwindled the Chinese character quantity in the final candidate group, also just dwindled the identification range of sophisticated category device, thereby accelerated recognition speed.
Fig. 4 c shows among pre-classifier of the present invention and Fig. 4 a and Fig. 4 b different between the existing pre-classifier.Wherein, the present invention at first at the statistical nature of step 91 sampling and input handwritten Chinese character, compares the statistical nature of Chinese character and the cluster centre of each cluster in step 92 then.In step 93,, select and import P the cluster of the statistical nature of handwritten word apart from minimum according to comparative result.In step 94, the candidate group formed in the Chinese character in this P cluster.The present invention is the statistical nature that utilizes Chinese character, and sorter of the present invention is the distance classification device, rather than the dynamic programming sorter.
Handwritten Chinese character sorter of the present invention is after definite recognition speed and discrimination, can take all factors into consideration the value of choosing the first number of clusters P1, the value of choosing the second number of clusters P2, and common factor or the union of utilizing the first candidate group and the second candidate group, determine the handwritten Kanji recognition scheme according to different demands.