CN104598887A

CN104598887A - Recognition method for written Chinese address of non-specification format

Info

Publication number: CN104598887A
Application number: CN201510044955.1A
Authority: CN
Inventors: 吕岳; 韦箫华; 吕淑静
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2015-01-29
Filing date: 2015-01-29
Publication date: 2015-05-06
Anticipated expiration: 2035-01-29
Also published as: CN104598887B

Abstract

The invention provides a recognition method for a written Chinese address of a non-specification format, and establishes an address representation method of a specification format. According to the method, the structure of a word class tree is put forward to store a Chinese address base, wherein an address word is stored in each node, and a written address of a specification format is stored in a route from the root node to the leaf node. The whole address recognition is achieved through the steps of establishing the word class tree, establishing a character index table, conducting over-segmentation on an image, combining segmentation blocks, recognizing characters, generating a candidate address word, and recognizing the address of the specification format. The written address of the non-specification format can be mapped to the corresponding address of the specification format, and therefore recognition is achieved.

Description

For the recognition methods of non-canonical format handwritten Chinese address

Technical field

The invention belongs to handwritten Chinese Address Recognition technical field, particularly to the identification of the hand-written Chinese address of non-canonical format.

Background technology

Chinese address be identified in the automatic sorting of letter and parcel play a part very crucial.In mail Center, large batch of letter and parcel is had every day to be processed and to send with charge free.This just requires that the process of mail not only wants fast, and wants accurately.Although people make great progress in the research of Chinese address identification, in the middle of real mail, hand-written Address Recognition remains the difficult problem failing to solve very well.Such as, the change of Chinese character quantity many and writing style is various, the word in address and also may exist between word and connect pen.Particularly the polytrope of address format write and no regularity, this considerably increases the difficulty of opponent's write address identification.Rarely work at present and consider that this identifies address on the one hand specially.

Traditional Chinese handwritten Address Recognition method main target is Chinese characters all on the width address image that identification is from cover to cover given.They need an address list to provide the contextual information of Address Recognition.Each entry in this list is a sufficient address, and is usually mated by the recognition result be used for one by one with Input Address image.For improving the efficiency of address search and reducing the storage space of address list, there has been proposed a kind of method based on search tree structure and carry out storage address information.In the structure of these trees, what each node was deposited is a character, is therefore also referred to as word level tree.But on the one hand, word level tree is more responsive to noise ratio, because it requires that all characters in the image of address all must identify in order.On the other hand, whether candidate pattern block accurately can have a great impact recognition performance with the mating of child node of root node.Briefly, Address Recognition based on word level tree construction needs to depend on an address list pre-defined, if the address information in address list is incomplete, namely, it does not comprise all format write changes of address, or the address information that address list provides is not enough, and so in the middle of the application of reality, the discrimination of these Address Recognition methods will reduce greatly.

Usually, an address is made up of some addresses word, and these address words are defined as basic administration cell.Such as: the specification writing format address " Zhongshan North Road, Putuo District, Shanghai " shown in Fig. 2 (a) comprises address word " Shanghai City ", " Putuo District ", and " Zhongshan North Road ".The last character of each address word is defined as key word, as " province ", " city ", " district ", " road ", etc.

But in actual applications, the address book WriteMode on envelope is very complicated, and people can not write according to the cannonical format of address usually.Such as, in fig. 2, the specification writing form that Fig. 2 (a) is address, the various non-canonical formats that Fig. 2 (b-e) then shows it are write, and these unconventional writing are considered to rational in reality.

In sum, be almost an impossible mission with manually going to collect all these unconventional address written forms.

Summary of the invention

These unconventional handwritten Chinese addresses are finally mapped to the corresponding address of standardized writing by the method that the object of the invention is to propose for the deficiencies in the prior art based on word level tree construction, realize identifying it; Overcome the limitation of classic method to non-standard handwritten Chinese Address Recognition.

The object of the present invention is achieved like this:

For a recognition methods for non-canonical format handwritten Chinese address, comprise the following steps:

Build word level tree, described structure word level tree is in order to represent and the address of storage specification format write;

Build character index table, described structure character index table is in order to represent the association between single character and address word;

Segmentation-identifying processing, described segmentation-identifying processing method is the segmentation for carrying out character to image, merges and merge formed candidate pattern block to block to carry out character recognition;

Generate candidate site word, the method for described generation candidate site word is for obtaining the higher candidate site word of degree of confidence;

Cannonical format Address Recognition, described cannonical format Address Recognition method is used for the mode of being write to the cannonical format corresponding to it by hand-written address maps to be identified; Wherein:

The degree of depth of described structure word level tree is the 5,1st layer is root node, and store expression " province " respectively from the 2nd layer to the 5th layer, " city ", the address word of " district " and " road " name, wherein each node stores an address word.

Character for storing all characters be comprised in the word of address, and associates with all addresses word comprising this character by described structure character index table.

Described segmentation-identifying processing also comprises:

Image over-segmentation, becomes atomic block by Iamge Segmentation, for the lap between handwritten Chinese character or company's pen part being separated;

Combination and segmentation block, is merged into candidate pattern block by continuous print atom, for recovering single character that over-segmentation process causes or the separated situation about opening of the character of tiled configuration;

Character recognition, for identifying candidate pattern block, and calculates recognition result degree of confidence;

Described image over-segmentation is by adopting connected member analysis, and normalization Overlapping Calculation and Projection Analysis technology are carried out over-segmentation to image and finally obtained a series of atom block;

The method of described combination and segmentation block continuous print atom block is carried out one by one merging to form candidate pattern block;

Described character recognition also comprises:

Hand-written character sorter, for classifying to candidate pattern block;

Degree of confidence is changed, for carrying out the calculating of degree of confidence to recognition result;

Described generation candidate site word is by conjunction with candidate pattern recognition result, and character index table and word level set the address word stored, and prunes and obtain word level tree.

Described cannonical format Address Recognition is set candidate site word bluebeard compound level, is taken to end searching method upwards and combines candidate site word, finally generate candidate site to word level tree.Get the highest candidate site of degree of confidence as final Address Recognition result.

Instant invention overcomes the limitation of classic method to non-standard handwritten Chinese Address Recognition, propose the method based on word level tree construction, address maps non-canonical format can write to the corresponding address of cannonical format, thus realizes the identification of non-canonical format being write to address.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention;

Fig. 2 is the different ways of writing instance graph in address " Zhongshan North Road, Putuo District, Shanghai ";

Fig. 3 is the schematic diagram of the word level tree of specification writing address format;

Fig. 4 is address line image over-segmentation fructufy illustration;

Fig. 5 is candidate pattern block diagram example figure;

Fig. 6 is the generation schematic diagram of candidate site word;

Fig. 7 is the instance graph of candidate site word correspondence position in candidate pattern block diagram;

Fig. 8 is word level tree route searching process flow diagram;

Fig. 9 searches for and generates the instance graph of candidate site in word level tree;

Figure 10 is the recognition result instance graph of non-canonical format handwritten Chinese address.

Embodiment

As shown in Figure 1, be the process flow diagram of the embodiment of the present invention, the method specifically comprises:

Build word level tree, in order to represent and the address of storage specification format write.

The address administrative relation of China is a kind of top-down hierarchical structure.The quantity of level is generally 4.These 4 layers respectively corresponding " province ", " city ", " district " and " road " name.According to this structure definition one tree, the degree of depth is 5.Root node is empty, and store expression " province " respectively from the 2nd layer to the 5th layer, " city ", the address word of " district " and " road " name, wherein each node stores an address word.In word level tree, the corresponding standardization form of the paths from root node to leafy node write address.

For the situation of middle omission key word is write in process address, the last character (except " road " word) of each address word is defined as option.As shown in Figure 3, the word table in bracket shows it is option to the word level tree built.

In this word level tree, once a certain leaf node (i.e. road name) is identified, all candidate sites comprising this road name can be obtained.Such as, if address word " Zhongshan North Road " is identified, address word " Shanghai City " can be obtained, " Putuo District ", " Zhejiang Province ", " Hangzhou ", " Xiacheng District " by setting to word level upwards searching for the end of to of carrying out, etc.So relevant candidate site " Putuo District, Shanghai City Zhejiang Province Zhongshan North Road " and " Hangzhou, Zhejiang province city Xiacheng District Zhongshan North Road ", etc., just can obtain.Further, if address word " Putuo District " or " Shanghai City " also identified, so candidate site " Putuo District, Shanghai City Zhejiang Province Zhongshan North Road " is larger by the possibility as recognition result, particularly when the situation that " Putuo District " and " Shanghai City " is all identified.

Build character index table, in order to represent the association between single character and address word.

As shown in table 1, character index table is point 3 row, and the 2nd is classified as all characters appeared in the word of address, and the 1st is classified as GB2312-80 coding corresponding to the 2nd row character.3rd is classified as all relative address words comprising a certain character.When a character is identified time, all address words comprising this character can be obtained, for generating last candidate site word.

Table 1

Image over-segmentation, for separating the lap between handwritten Chinese character or company's pen part.

First connected member analysis is carried out to image, then Overlapping Calculation is normalized to adjacent connected member, be used for judging whether to merge these connected members, because they some may be different piece in same character.Judge that whether connected member is containing connecting a part, if having, then splits it finally by Projection Analysis.As much as possible the lap of kinds of characters or the connecting pen that exists between them are divided and cut open, finally obtain a series of atom block.To the segmentation result of Fig. 2 (d) as shown in Figure 4.In the diagram, atomic block, by order arrangement from left to right, has all carried out label in order to it above atomic block.

Combination and segmentation block, for recovering single character that over-segmentation process causes or the separated situation about opening of the character of tiled configuration.

To image after over-segmentation process, continuous print atomic block is combined and generate candidate pattern block, as shown in Figure 5.Defining all candidate pattern blocks is a set P={p _(1,1), p _(1,2), p _(1,3), p _(2,1), p _(2,2), p _(2,3)..., p _{(m, n)}..., p _{(l, q)}, wherein, (m, n) is the numbering (1≤m≤l, 1≤n≤q) of atomic block, and l is the sum of atomic block, the maximum atomic block number that q comprises for a candidate pattern block, and in the present embodiment, q is set as 3.

Character recognition, for identifying candidate pattern block, and calculates recognition result degree of confidence.

In candidate pattern block diagram, with character classifier, each candidate pattern block is identified, generate a series of candidate characters.Large and the unconfined handwritten Chinese character for identification categorical measure, MQDF method is method the most practical at present.But its character feature memory space is larger.Present invention incorporates the method that MQDF differentiates study and shares distribution subspace, when not reducing discrimination, reducing the space of character storage shared by feature.

About the degree of confidence of character recognition, i.e. posterior probability p (w|x), (w is the character identified, and x is image feature vector), it is extremely important to the identification of character string, but it can not directly obtain from the output of MQDF sorter.Therefore, need the method adopting degree of confidence conversion, transfer the output of sorter to posterior probability.The present invention is by sigmoidal function application in degree of confidence conversion, then the posterior probability of character identification result can be expressed as

p^{sg} (w_{j} | x) = \frac{\exp ({ad}_{j} (x) + β)}{1 + \exp (- {ad}_{j} (x) + β)}, j = 1,2, . . ., M - - - (1)

Wherein, M is total classification number of character, d _j(x) for sorter be w to classification _joutput mark, α and β is degree of confidence parameter to be optimized, can be optimized by minimizing entropy loss function (CE) of reporting to the leadship after accomplishing a task to it.By the theoretical proof of Dempster-Shafer (D-S), the calculating of character degree of confidence can be expressed as:

p^{sg} (w_{j} | x) = \frac{\exp ({αd}_{j} (x) + β)}{1 + Σ_{i = 1}^{M} \exp (- {αd}_{j} (x) + β)}, j = 1,2, . . ., M - - - (2)

Finally the maximum candidate characters of front 20 degree of confidence is got as recognition result to each mode block, arrange in the mode of degree of confidence size descending.The recognition result of the candidate pattern block in Fig. 5 is as shown in table 2.The recognition result of some mode blocks is empty, because can by their shape, namely the ratio of width to height directly judges whether they are a rational character, such as mode block p _(10,1), p _(15,1), p _(17,1), p _(19,1)it not rational character.

Table 2

Generate candidate site word, by conjunction with candidate pattern recognition result, character index table and word level sets the address word stored, and set prune word level.

In word level tree, implicit expression illustrates the address list (AW_O) that stores all addresses word, as shown in Fig. 6 (a).From this table, generate candidate site word through a series of process.These process comprise 3 steps: first, and by associating of the candidate characters that identified and address word, his-and-hers watches AW_O prunes, and the address word in association generates a new address word list AW_R.Mating then by the position limitation relation of the candidate characters that identified in the word of address and candidate pattern block, Table A W_R is pruned further and obtains address list AW_P.Finally, by calculating the mark of the address word in AW_P, the address word that address word mark is greater than a presetting threshold value is stored in list AW_C, and the address word so deposited in AW_C is then final candidate site word.Be described as follows:

(1), the generation of AW_R: by the recognition result of candidate pattern, do not meet in Table A W_O address word all deleted (nr is number of characters identified in the word of a certain address, and nl is the number of characters that comprises of address word for this reason).Candidate site word composition AW_R list (as Fig. 6 (b)) of last remainder.

(2), the generation of AW_P: need to consider the character and corresponding mode block position limitation relation in the picture that have been identified in candidate site word.If the mode block position in the picture that in a certain candidate site word in Table A W_R, recognized character mates does not meet position limitation relation, this candidate site word is by deleted.Candidate site word composition AW_P list (as Fig. 6 (c)) of last remainder.

(3), the generation of AW_C: a kind of method that the present invention proposes calculated address word mark, for calculating the address word mark in AW_P, circular introduces below.If the mark of a certain candidate site word in AW_P is less than predefined empirical value, then this candidate site word is deleted.The address word of last remainder then forms Table A W_C.Each address word in Table A W_C is then defined as final candidate site word (as Fig. 6 (d)).

By following formula, address word is calculated:

MSF = \frac{nr}{nl} \cdot Σ_{k = 1}^{nl} (p_{k}^{ds} + {SC}_{k}) + v - - - (3)

This formula considers two kinds of situations: a kind of ratio being the number of characters identified in the word of address and accounting for all number of characters that this address word comprises, another kind is the degree of confidence of cutting cube.Wherein, for individual character degree of confidence, be can be calculated by formula (2).The calculating of nr/nl considers recognized character ratio shared in the word of address.Relative, if all characters in this address word are all identified, and with mode block to mate relative position reasonable, then increase the degree of confidence of this address word, the mark of increase represents with a constant v (1≤v≤4).SC is the degree of confidence of cutting cube, is defined as

SC = m \cdot (1 - \frac{pw}{ph}) - - - (4)

Wherein, m is the continuous print atomic block quantity of a composition mode block, and the combination of these atomic block does not comprise even pen.The ratio of width to height of pw/ph mode block for this reason.

For reducing the error rate identified, in Table A W_P, every mark all can be deleted lower than the address word of a threshold epsilon.ε is defined as

Wherein, nl is the character number that candidate site word comprises, be an empirical value, through repeatedly testing, getting 2.5 can make recognition system obtain best performance.

By this step, the candidate site word of the generation correspondence position in candidate pattern block diagram as shown in Figure 7.

Cannonical format Address Recognition, for the mode of being write to corresponding cannonical format by hand-written address maps to be identified.

When identification Chinese handwritten address, its all non-canonical formats can be mapped in a certain paths of word level tree.After candidate site word is generated, can search for tree in conjunction with their node relationships in word level tree, the candidate site of generating standard format writing.Be taken to the end upwards searching method, search for root node from the leaf node (corresponding road name) of tree.Can obtain some candidate sites in this step, the mark of every bar candidate site is equivalent to the cumulative of the mark of the candidate site word identified that it comprises.Finally, the maximum candidate site of mark is got as recognition result.The idiographic flow of this step as shown in Figure 8.

In the present invention, store respectively and represent " province ", " city " with 4 lists, the address word of " district " and " road " name, these 4 lists use PR respectively, and CI, DI, RO represent.In addition, with three metaset TN={CN, PN, AS} represent a node of search volume.Wherein, CN points to the present node of word level tree, and PN points to the father node of CN, and AS is address word mark cumulative in search procedure.For a candidate word W, its leftmost mode block (lp (W)) and rightmost mode block (rp (W)) correspond respectively to it first by the character that mates with last is by the character mated.Judge the address word that father node is corresponding and the position limitation relation of address word corresponding to child node in pattern block diagram whether reasonable, whether the rp being based on father node is less than the lp of child node.In literary composition, the position size of address word is by from left to right ascending sort.

Before search, first check whether list RO is empty.If RO is empty, namely road name does not all have identified, now AS=0, stops this time search, and recognition result is known for refusing.Otherwise, search for from the address word stored in list RO one by one.First, CN points to an address word in RO, and AS is initially mark corresponding to this address word, and PN points to the father node of CN.Ensuing search is in two kinds of situation: namely PN ∈ DI or if it represents that the candidate site word that PN points to is unrecognized, and in this case, PN directly points to the father node of PN indication node, then continues search.If PN ∈ is DI, then illustrate that the candidate site word pointed by PN is identified.If the address word now pointed by PN and CN meets position relationship rp (PN) < lp (CN), so AS then equals the cumulative of these two address word marks.Then CN points to word level tree node corresponding to PN, and PN points to the father node of CN.Otherwise if these two words do not meet position relationship, PN directly points to the father node of PN indication node, then continue search.When PN points to the root node of tree, represent that search this time terminates.Finally, the canonical address alternatively address result this time obtained from leaf node reverse search to root node, AS is the mark of its correspondence.With two metaset RS={ ξ, AS} stores this Search Results, and what wherein ξ stored is the specification candidate site that current search obtains.

Fig. 9 illustrates the search procedure in word level tree.Such as, search for from leafy node " Zhongshan North Road ", AS equals the mark 20.19 of this address word.The address word " Putuo District " that PN points to has been identified alternatively address word, and rp (" Putuo District ") <lp (" Zhongshan North Road "), so AS equals 34.85 (=20.19+14.66).Finally, candidate site " Zhongshan North Road, Putuo District, Shanghai " is obtained to the Search Results of this paths, the mark of its correspondence is 48.41 (=20.19+14.66+13.56), is the highest score of all candidate sites, so as final recognition result.By route searching, the address word that some are not identified also is included in candidate site as recognition result, but their mark can not get adding up.

There are some candidate site word positions in Fractionation regimen framework may overlapping (as Fig. 7).If the address word of same grade is overlapping, do not affect the search of tree, because the relation of Bu Shi the superior and the subordinate between them, the relation of not corresponding father node and child node in tree, so different paths can be obtained in pattern block diagram, such as: " Shanghai City " and " Shanghai ", " Putuo District " and " Putuo ".On the contrary, if these two address words are different brackets, the father node in their corresponding same paths of possibility and child node relationships, such as: " Putuo District " and " Putuo road ".In this case, the word that priority is low will be skipped in search procedure, and also do not add up to the mark of this low priority address word in this path simultaneously.In the present invention, the priority of address word increases along with the increase of the node number of plies, so, represents that the priority of the address word of road name is the highest.

After address words all in RO is all searched in tree, generate some candidate sites.Finally, the highest candidate site of mark is only got as recognition result.Recognition result S represents, is defined as

S＝arg maxξ(AS _i|i＝1，2，…，n) (6)

N is the candidate site sum generated, in the figure 7, and n=5.Obviously, AS when i=3 time _iobtain largest score 48.41, therefore the standardized writing address " Zhongshan North Road, Putuo District, Shanghai " of its correspondence final recognition result that is Fig. 2 (d).

Figure 10 shows the recognition result of the Chinese handwritten address line image in Fig. 2.As can be seen from Figure 10, what the address of this three classes non-standard format write can be identified as specification by the present invention writes address " Zhongshan North Road, Putuo District, Shanghai ".

Claims

1., for a recognition methods for non-canonical format handwritten Chinese address, it is characterized in that the method comprises the following steps:

Build word level tree, in order to represent and the address of storage specification format write;

Build character index table, in order to represent the association between single character and address word;

Segmentation-identifying processing, for carrying out the segmentation of character to image, merges and merges formed candidate pattern block to block and carry out character recognition;

Generate candidate site word, for obtaining the higher candidate site word of degree of confidence;

2. recognition methods as claimed in claim 1, is characterized in that the degree of depth that described structure word level is set be the 5,1st layer is root node, expression " province " is stored respectively from the 2nd layer to the 5th layer, " city ", the address word of " district " and " road " name, wherein each node stores an address word.

3. recognition methods as claimed in claim 1, is characterized in that described structure character index table is for storing all characters be comprised in the word of address, and is associated with all addresses word comprising this character by character.

4. recognition methods as claimed in claim 1, is characterized in that described segmentation-identifying processing comprises:

Combination and segmentation block, carries out merging one by one and forms candidate pattern block by continuous print atom block, for recovering single character that over-segmentation process causes or the separated situation about opening of the character of tiled configuration;

5. recognition methods as claimed in claim 4, it is characterized in that described image over-segmentation is by adopting connected member analysis, normalization Overlapping Calculation and Projection Analysis are carried out over-segmentation to image and are finally obtained a series of atom block.

6. recognition methods as claimed in claim 4, is characterized in that described character recognition also comprises:

Hand-written character sorter, for classifying to candidate pattern block;

Degree of confidence is changed, for carrying out the calculating of degree of confidence to recognition result.

7. recognition methods as claimed in claim 1, it is characterized in that described generation candidate site word is by conjunction with candidate pattern recognition result, character index table and word level sets the address word stored, and set prune and obtain word level.

8. recognition methods as claimed in claim 1, is characterized in that described cannonical format Address Recognition is set candidate site word bluebeard compound level, is taken to end searching method upwards and combines candidate site word, finally generate candidate site to word level tree; Get the highest candidate site of degree of confidence as final Address Recognition result.