CN104598887B - Recognition methods for non-canonical format handwritten Chinese address - Google Patents
Recognition methods for non-canonical format handwritten Chinese address Download PDFInfo
- Publication number
- CN104598887B CN104598887B CN201510044955.1A CN201510044955A CN104598887B CN 104598887 B CN104598887 B CN 104598887B CN 201510044955 A CN201510044955 A CN 201510044955A CN 104598887 B CN104598887 B CN 104598887B
- Authority
- CN
- China
- Prior art keywords
- address
- word
- candidate
- recognition
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000011218 segmentation Effects 0.000 claims abstract description 14
- 238000004458 analytical method Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000003909 pattern recognition Methods 0.000 claims description 3
- 238000013138 pruning Methods 0.000 claims description 2
- 230000001788 irregular Effects 0.000 claims 1
- 238000013507 mapping Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 14
- 238000004364 calculation method Methods 0.000 description 5
- 238000009825 accumulation Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000001174 ascending effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Landscapes
- Character Discrimination (AREA)
Abstract
Description
技术领域technical field
本发明属于手写中文地址识别技术领域,特别涉及对非规范格式手写的中文地址的识别。The invention belongs to the technical field of handwritten Chinese address recognition, in particular to the recognition of handwritten Chinese addresses in non-standard format.
背景技术Background technique
中文地址识别在信函和包裹的自动分拣中起着非常关键的作用。在邮件处理中心,每天都有大批量的信函和包裹被处理和派送。这就要求邮件的处理不仅要快,而且要精确。虽然人们在中文地址识别的研究中取得了很大的进展,但在真实的信件当中,手写地址识别仍然是一个未能很好解决的难题。比如,汉字数量多且书写风格变化多样,地址中的字与字之间还可能存在连笔。特别是地址书写格式的多变性及无规则性,这大大增加了对手写地址识别的难度。目前极少有工作专门考虑到这一方面来对地址进行识别。Chinese address recognition plays a key role in the automatic sorting of letters and parcels. In the mail processing center, a large number of letters and parcels are processed and dispatched every day. This requires that the processing of mail not only be fast, but also precise. Although people have made great progress in the research of Chinese address recognition, handwritten address recognition in real letters is still a problem that has not been well solved. For example, there are a large number of Chinese characters and various writing styles, and there may be consecutive strokes between characters in the address. In particular, the variability and irregularity of address writing formats greatly increase the difficulty of handwritten address recognition. Very little work has been done to identify addresses specifically taking this aspect into account.
传统的中文手写地址识别方法主要目标在于原原本本的识别给定的一幅地址图像上所有的汉字。它们需要一张地址列表以提供地址识别的上下文信息。这张列表中的每个条目都是一条完整的地址,且通常被逐一用来和输入地址图像的识别结果进行匹配。为提高地址检索的效率及降低地址列表的存储空间,人们提出了一种基于查找树结构的方法来存储地址信息。在这些树的结构中,每一个节点存的是一个字符,因此也被称为字级树。但是,一方面,字级树对噪声比较敏感,因为它要求地址图像中的所有字符都必须按顺序进行识别。另一方面,候选模式块与根结点的子结点的匹配是否准确会对识别性能有很大的影响。简要的说,基于字级树结构的地址识别需要依赖于一个预先定义好的地址列表,如果地址列表中的地址信息是不完整的,即,它没有包括地址的所有书写格式变化,或者地址列表所提供的地址信息不足,那么在实际的应用当中这些地址识别方法的识别率将会大大降低。The main goal of the traditional Chinese handwritten address recognition method is to recognize all the Chinese characters on a given address image. They require a list of addresses to provide context for address identification. Each entry in this list is a complete address, and is usually used one by one to match the recognition result of the input address image. In order to improve the efficiency of address retrieval and reduce the storage space of the address list, a method based on a search tree structure is proposed to store address information. In these tree structures, each node stores a character, so it is also called a word-level tree. However, on the one hand, word-level trees are sensitive to noise because it requires that all characters in the address image must be recognized in order. On the other hand, whether the match between the candidate pattern block and the child nodes of the root node is accurate will have a great impact on the recognition performance. Briefly, address recognition based on a word-level tree structure needs to rely on a pre-defined address list, if the address information in the address list is incomplete, i.e., it does not include all written format variations of the address, or the address list If the address information provided is insufficient, the recognition rate of these address recognition methods will be greatly reduced in practical applications.
通常,一条地址是由若干地址词组成的,这些地址词被定义为基本行政单元。例如:图2(a)所示的规范书写格式地址“上海市普陀区中山北路”包含地址词“上海市”,“普陀区”,和“中山北路”。每个地址词的最后一个字被定义为关键字,如“省”,“市”,“区”,“路”,等等。Usually, an address is composed of several address words, which are defined as basic administrative units. For example: the standard writing format address "Zhongshan North Road, Putuo District, Shanghai City" shown in Figure 2 (a) contains address words "Shanghai City", "Putuo District", and "Zhongshan North Road". The last word of each address word is defined as a keyword, such as "province", "city", "district", "road", and so on.
但在实际应用中,信封上的地址书写方式是非常复杂的,人们通常不会按照地址的规范格式进行书写。例如,在图2中,图2(a)为地址的规范书写形式,图2(b-e)则显示了它的各种非规范格式书写,这些非规范的书写在现实中被认为是合理的。But in practical application, the way of writing the address on the envelope is very complicated, and people usually don't write according to the standard format of the address. For example, in Figure 2, Figure 2(a) is the standard writing form of the address, and Figure 2(b-e) shows its various non-standard writing forms, which are considered reasonable in reality.
综上所述,用人工去搜集所有这些非规范的地址书写形式几乎是一项不可能完成的任务。To sum up, it is almost an impossible task to manually collect all these non-standard address writing forms.
发明内容Contents of the invention
本发明的目的是针对现有技术的不足而提出了基于词级树结构的方法将这些非规范的手写中文地址最终映射成规范化书写的对应地址,实现对其识别;克服了传统方法对非规范手写中文地址识别的局限性。The purpose of the present invention is to propose a method based on word-level tree structure at the deficiencies in the prior art, these non-standard handwritten Chinese addresses are finally mapped to the corresponding address of standardized writing, realize its identification; Limitations of Handwritten Chinese Address Recognition.
本发明的目的是这样实现的:The purpose of the present invention is achieved like this:
一种用于非规范格式手写中文地址的识别方法,包括以下步骤:A method for recognizing a handwritten Chinese address in a non-standard format, comprising the following steps:
构建词级树,所述构建词级树用以表示并存储规范书写格式的地址;Constructing a word-level tree, which is used to represent and store addresses in a standardized writing format;
构建字符索引表,所述构建字符索引表用以表示单个字符和地址词之间的关联;Constructing a character index table, which is used to represent the association between a single character and an address word;
分割-识别处理,所述分割-识别处理方法是用于对图像进行字符的分割,合并及对分割块合并所成的候选模式块进行字符识别;Segmentation-recognition processing, the segmentation-recognition processing method is used to segment the image into characters, merge and perform character recognition on the candidate pattern blocks formed by merging the segmented blocks;
生成候选地址词,所述生成候选地址词的方法用于得到置信度较高的候选地址词;Generate candidate address words, the method for generating candidate address words is used to obtain candidate address words with higher confidence;
规范格式地址识别,所述规范格式地址识别方法用于将待识别手写地址映射到它所对应的规范格式书写的方式;其中:Canonical format address recognition, the canonical format address recognition method is used to map the handwritten address to be recognized to its corresponding canonical format writing mode; wherein:
所述构建词级树的深度为5,第1层为根节点,从第2层至第5层分别存储表示“省”,“市”,“区”及“路”名的地址词,其中每个节点存储一个地址词。The depth of described construction word-level tree is 5, and the first layer is a root node, stores and represents respectively " province ", " city ", the address word of " district " and " road " name from the second layer to the fifth layer, wherein Each node stores an address word.
所述构建字符索引表用于存储所有被包含在地址词中的字符,并且将字符与包含此字符的所有地址词进行关联。The construction of the character index table is used to store all the characters contained in the address words, and associate the characters with all the address words containing the characters.
所述分割-识别处理还包括:The segmentation-recognition process also includes:
图像过分割,将图像分割成原子块,用于将手写汉字之间的重叠部分或连笔部分分割开;Image over-segmentation, which divides the image into atomic blocks, which is used to separate the overlapping parts or continuous strokes between handwritten Chinese characters;
合并分割块,将连续的原子合并成候选模式块,用于恢复过分割过程造成的单个字符或是左右结构的字符被分离开的情况;Merging segmentation blocks, merging consecutive atoms into candidate pattern blocks, used to restore a single character or the separation of left and right structure characters caused by the over-segmentation process;
字符识别,用于识别候选模式块,并计算识别结果置信度;Character recognition, used to identify candidate pattern blocks and calculate the confidence of the recognition result;
所述图像过分割通过采用连通元分析,归一化重叠度计算及投影分析技术对图像进行过分割并最终得到一系列原子分割块;The image is over-segmented by using connected element analysis, normalized overlap calculation and projection analysis technology to over-segment the image and finally obtain a series of atomic segmentation blocks;
所述合并分割块的方法是将连续的原子分割块逐一进行合并形成候选模式块;The method for merging segmentation blocks is to merge continuous atomic segmentation blocks one by one to form candidate pattern blocks;
所述字符识别还包括:The character recognition also includes:
手写字符分类器,用于对候选模式块进行分类;A handwritten character classifier for classifying candidate pattern blocks;
置信度转换,用于对识别结果进行置信度的计算;Confidence conversion, used to calculate the confidence of the recognition result;
所述生成候选地址词是通过结合候选模式识别结果,字符索引表以及词级树所存储的地址词,对词级树进行修剪而得。The generation of candidate address words is obtained by pruning the word-level tree in combination with the candidate pattern recognition results, the character index table and the address words stored in the word-level tree.
所述规范格式地址识别是将候选地址词结合词级树,对词级树采用至底向上的搜索方法对候选地址词进行组合,最终生成候选地址。取置信度最高的候选地址作为最终的地址识别结果。The canonical format address recognition is to combine the candidate address words with the word-level tree, use the bottom-up search method to combine the candidate address words on the word-level tree, and finally generate the candidate address. The candidate address with the highest confidence is taken as the final address recognition result.
本发明克服了传统方法对非规范手写中文地址识别的局限性,提出了基于词级树结构的方法,可将非规范格式书写的地址映射到规范格式的对应地址,从而实现对非规范格式书写地址的识别。The present invention overcomes the limitations of traditional methods for identifying non-standard handwritten Chinese addresses, and proposes a method based on a word-level tree structure, which can map addresses written in non-standard formats to corresponding addresses in standard formats, thereby realizing writing in non-standard formats Address identification.
附图说明Description of drawings
图1为本发明流程图;Fig. 1 is a flowchart of the present invention;
图2为地址“上海市普陀区中山北路”不同书写方式实例图;Figure 2 is an example diagram of different writing methods for the address "Zhongshan North Road, Putuo District, Shanghai";
图3为规范书写地址格式的词级树的示意图;Fig. 3 is the schematic diagram of the word-level tree of standard writing address format;
图4为地址行图像过分割结果实例图;Fig. 4 is an example diagram of an address row image over-segmentation result;
图5为候选模式框图实例图;Fig. 5 is an example diagram of a candidate mode block diagram;
图6为候选地址词的生成示意图;Fig. 6 is the generation schematic diagram of candidate address words;
图7为候选地址词在候选模式框图中对应位置的实例图;Fig. 7 is the example diagram of the corresponding position of candidate address words in the candidate mode block diagram;
图8为词级树路径搜索流程图;Fig. 8 is a flow chart of word-level tree path search;
图9为在词级树中搜索并生成候选地址的实例图;Fig. 9 is the instance diagram of searching and generating the candidate address in the word-level tree;
图10为非规范格式手写中文地址的识别结果实例图。Fig. 10 is an example diagram of the recognition result of a handwritten Chinese address in a non-standard format.
具体实施方式detailed description
如图1所示,为本发明实施例的流程图,该方法具体包括:As shown in Figure 1, it is a flowchart of an embodiment of the present invention, and the method specifically includes:
构建词级树,用以表示并存储规范书写格式的地址。A word-level tree is constructed to represent and store addresses in a canonical writing format.
中国的地址行政关系是一种自上而下的层次结构。层次的数量一般为4。这4层分别对应“省”,“市”,“区”及“路”名。根据此结构定义一棵树,深度为5。根节点为空,从第2层至第5层分别存储表示“省”,“市”,“区”及“路”名的地址词,其中每个节点存储一个地址词。在词级树中,从根结点到叶子结点的一条路径对应一个规范化格式的书写地址。The address-administrative relationship in China is a top-down hierarchical structure. The number of layers is generally 4. These 4 layers correspond to the names of "province", "city", "district" and "road" respectively. Define a tree from this structure, with depth 5. The root node is empty, and the address words representing the names of "province", "city", "district" and "road" are respectively stored from the second layer to the fifth layer, and each node stores an address word. In the word-level tree, a path from the root node to the leaf node corresponds to a written address in a normalized format.
为处理地址书写中省略关键字的情况,每个地址词的最后一个字(除了“路”字)都被定义为可选项。构建好的词级树如图3所示,括号中的字表示是可选项。In order to deal with the situation of omitting keywords in address writing, the last word of each address word (except the word "road") is defined as an optional item. The constructed word-level tree is shown in Figure 3, and the word representations in brackets are optional.
在这棵词级树中,一旦某一叶子节点(即路名)被识别出,可以得到所有包含此路名的候选地址。例如,若地址词“中山北路”被识别出,通过对词级树进行的至底向上搜索可以得到地址词“上海市”,“普陀区”,“浙江省”,“杭州市”,“下城区”,等等。那么相关的候选地址“上海市普陀区浙江省中山北路”及“浙江省杭州市下城区中山北路”,等等,就可以获得。进一步的,如果地址词“普陀区”或“上海市”也被识别出,那么候选地址“上海市普陀区浙江省中山北路”被作为识别结果的可能性更大,特别是当“普陀区”和“上海市”都被识别出的情况。In this word-level tree, once a certain leaf node (namely road name) is identified, all candidate addresses containing this road name can be obtained. For example, if the address word "Zhongshan North Road" is identified, the address words "Shanghai City", "Putuo District", "Zhejiang Province", "Hangzhou City", " Downtown", and so on. Then the relevant candidate addresses "Zhongshan North Road, Zhejiang Province, Putuo District, Shanghai City" and "Zhongshan North Road, Xiacheng District, Hangzhou City, Zhejiang Province", etc., can be obtained. Further, if the address word "Putuo District" or "Shanghai City" is also recognized, then the candidate address "Zhongshan North Road, Putuo District, Shanghai City, Zhejiang Province" is more likely to be used as the recognition result, especially when "Putuo District " and "Shanghai" are both identified.
构建字符索引表,用以表示单个字符和地址词之间的关联。Construct a character index table to represent the association between individual characters and address words.
如表1所示,字符索引表为分3列,第2列为所有出现在地址词中的字符,第1列为第2列字符对应的GB2312-80编码。第3列为所有包括某一字符的相关地址词。当一个字符被识别出来的时候,可以得到所有包含这个字符的地址词,用于生成最后的候选地址词。As shown in Table 1, the character index table is divided into 3 columns, the second column is all characters appearing in the address words, and the first column is the GB2312-80 code corresponding to the characters in the second column. Column 3 is all relevant address words including a certain character. When a character is recognized, all address words containing this character can be obtained to generate the final candidate address words.
表1Table 1
图像过分割,用于将手写汉字之间的重叠部分或连笔部分分割开。Image over-segmentation, which is used to separate the overlapping parts or continuous strokes between handwritten Chinese characters.
首先对图像进行连通元分析,然后对相邻连通元进行归一化重叠度计算,用来判断是否合并这些连通元,因为它们有些可能是同一个字符里的不同部分。最后通过投影分析判断连通元是否含有连笔部分,若有,则对其进行分割。尽可能的将不同字符的重叠部分或它们之间存在的连接笔划分割开,最终得到一系列原子分割块。对图2(d)的分割结果如图4所示。在图4中,原子块按从左到右的顺序排列,在原子块的上方都对其按顺序进行了标号。First, the connected element analysis is performed on the image, and then the normalized overlap calculation is performed on adjacent connected elements to determine whether to merge these connected elements, because some of them may be different parts of the same character. Finally, through projection analysis, it is judged whether the connected elements contain continuous strokes, and if so, they are segmented. Split the overlapping parts of different characters or the connecting strokes between them as much as possible, and finally obtain a series of atomic segmentation blocks. The segmentation results of Figure 2(d) are shown in Figure 4. In Fig. 4, the atomic blocks are arranged in order from left to right, and the atomic blocks are numbered sequentially above the atomic blocks.
合并分割块,用于恢复过分割过程造成的单个字符或是左右结构的字符被分离开的情况。Merge segmentation blocks, which are used to restore a single character or the separation of left and right structure characters caused by the over-segmentation process.
在对图像经过过分割处理之后,连续的原子块被结合和生成候选模式块,如图5所示。定义所有候选模式块为一个集合P={p(1,1),p(1,2),p(1,3),p(2,1),p(2,2),p(2,3),...,p(m,n),...,p(l,q)},其中,(m,n)是原子块的编号(1≤m≤l,1≤n≤q),l为原子块的总数,q为一候选模式块所包含的最大原子块数,在本实施例中,q被设为3。After the image is over-segmented, consecutive atomic blocks are combined and candidate pattern blocks are generated, as shown in Figure 5. Define all candidate pattern blocks as a set P={p (1,1) , p (1,2) , p (1,3) , p (2,1) , p (2,2) , p (2, 3) , ..., p (m, n) , ..., p (l, q) }, where (m, n) is the number of the atomic block (1≤m≤l, 1≤n≤q ), l is the total number of atomic blocks, q is the maximum number of atomic blocks contained in a candidate pattern block, in this embodiment, q is set to 3.
字符识别,用于识别候选模式块,并计算识别结果置信度。Character recognition, which is used to identify candidate pattern blocks and calculate the confidence of the recognition results.
在候选模式框图中,用字符分类器对每个候选模式块进行识别,生成一系列的候选字符。对于识别类别数量大而且无约束的手写汉字,MQDF方法是目前最实用的方法。但是它的字符特征存储量比较大。本发明结合了MQDF判别学习及共享分布子空间的方法,在不降低识别率的情况下,降低了字符存储特征所占用的空间。In the candidate pattern block diagram, a character classifier is used to identify each candidate pattern block to generate a series of candidate characters. For the recognition of handwritten Chinese characters with a large number of categories and no constraints, the MQDF method is currently the most practical method. But its character feature storage is relatively large. The invention combines the method of MQDF discriminant learning and shared distribution subspace, and reduces the space occupied by character storage features without reducing the recognition rate.
关于字符识别的置信度,即后验概率p(w|x),(w为识别出的字符,x为图像特征向量),它对字符串的识别非常重要,但它不能直接从MQDF分类器的输出得到。因此,需要采用置信度转换的方法,将分类器的输出转为后验概率。本发明将sigmoidal函数应用于置信度转换中,则字符识别结果的后验概率可表示为Regarding the confidence of character recognition, that is, the posterior probability p(w|x), (w is the recognized character, x is the image feature vector), it is very important for the recognition of character strings, but it cannot be directly obtained from the MQDF classifier The output is obtained. Therefore, it is necessary to use the method of confidence conversion to convert the output of the classifier into the posterior probability. In the present invention, the sigmoidal function is applied to the confidence conversion, and the posterior probability of the character recognition result can be expressed as
其中,M为字符的总类别数,dj(x)为分类器对类别为wj的输出分数,α和β都是待优化的置信度参数,可以通过最小化交差熵损失函数(CE)对其进行优化。通过Dempster-Shafer(D-S)的理论证明,字符置信度的计算可表示为:Among them, M is the total number of categories of characters, d j (x) is the output score of the classifier for category w j , α and β are the confidence parameters to be optimized, which can be minimized by minimizing the cross entropy loss function (CE) Optimize it. Through the theoretical proof of Dempster-Shafer (DS), the calculation of character confidence can be expressed as:
最终对每个模式块取前20个置信度最大的候选字符作为识别结果,以置信度大小降序的方式进行排列。图5中的候选模式块的识别结果如表2所示。一些模式块的识别结果为空,因为可以通过它们的形状,即宽高比来直接判断它们是否为一个合理的字符,比如模式块p(10,1),p(15,1),p(17,1),p(19,1)都不是合理的字符。Finally, for each pattern block, the top 20 candidate characters with the highest confidence are taken as the recognition results, and they are arranged in descending order of confidence. The recognition results of the candidate pattern blocks in Fig. 5 are shown in Table 2. The recognition results of some pattern blocks are empty, because they can be directly judged whether they are a reasonable character by their shape, that is, the aspect ratio, such as pattern blocks p (10, 1) , p (15 , 1) , p ( 17, 1) , p (19, 1) are not reasonable characters.
表2Table 2
生成候选地址词,通过结合候选模式识别结果,字符索引表以及词级树所存储的地址词,对词级树进行修剪。The candidate address words are generated, and the word-level tree is pruned by combining the candidate pattern recognition results, the character index table and the address words stored in the word-level tree.
词级树中隐式的表示了一张存储所有地址词的地址列表(AW_O),如图6(a)所示。从这张表开始,经过一系列处理生成候选地址词。这些处理包括3个步骤:首先,通过已识别的候选字符与地址词的关联,对表AW_O进行修剪,关联上的地址词生成一张新的地址词列表AW_R。然后通过地址词中已识别的候选字符与候选模式块的位置限制关系的匹配,将表AW_R进一步修剪得到地址列表AW_P。最后,通过计算AW_P中的地址词的分数,地址词分数大于预设定的一个阈值的地址词被存入列表AW_C中,那么AW_C中所存的地址词则为最终的候选地址词。具体说明如下:An address list (AW_O) storing all address words is implicitly represented in the word-level tree, as shown in Figure 6(a). Starting from this table, candidate address words are generated through a series of processing. These processes include three steps: First, through the association of identified candidate characters and address words, the table AW_O is pruned, and the associated address words generate a new address word list AW_R. Then, through the matching of the identified candidate characters in the address words and the position restriction relationship of the candidate pattern blocks, the table AW_R is further pruned to obtain the address list AW_P. Finally, by calculating the score of the address words in AW_P, the address words whose score is greater than a preset threshold are stored in the list AW_C, and the address words stored in AW_C are the final candidate address words. The specific instructions are as follows:
(1)、AW_R的生成:通过候选模式的识别结果,表AW_O中不满足的地址词都被删除(nr为某一地址词中已经被识别出来的字符数,nl为此地址词所包含的字符数)。最后余下的候选地址词组成AW_R列表(如图6(b))。(1), Generation of AW_R: Through the recognition result of the candidate pattern, the table AW_O does not satisfy All address words are deleted (nr is the number of characters that have been identified in a certain address word, and nl is the number of characters contained in this address word). Finally, the remaining candidate address words form the AW_R list (as shown in Figure 6(b)).
(2)、AW_P的生成:需要考虑到候选地址词中已被识别的字符与所对应的模式块在图像中的位置限制关系。如果表AW_R中的某一候选地址词中已识别字符所匹配的模式块在图像中的位置不满足位置限制关系,此候选地址词将被删除。最后余下的候选地址词组成AW_P列表(如图6(c))。(2) Generation of AW_P: It is necessary to consider the position restriction relationship between the recognized characters in the candidate address words and the corresponding pattern blocks in the image. If the position in the image of a pattern block matched by a recognized character in a certain candidate address word in the table AW_R does not meet the position restriction relationship, the candidate address word will be deleted. Finally, the remaining candidate address words form the AW_P list (as shown in Figure 6(c)).
(3)、AW_C的生成:本发明提出了一种计算地址词分数的方法,用于计算AW_P中的地址词分数,具体计算方法在下文介绍。如果AW_P中的某一候选地址词的分数小于预先定义的经验阈值,则将此候选地址词删除。最后余下的地址词则组成表AW_C。表AW_C中的每个地址词则被定义为最终候选地址词(如图6(d))。(3), generation of AW_C: the present invention proposes a method for calculating the address word score, which is used to calculate the address word score in AW_P, and the specific calculation method is introduced below. If the score of a candidate address word in AW_P is less than the predefined experience threshold, the candidate address word is deleted. The last remaining address words form the table AW_C. Each address word in the table AW_C is defined as the final candidate address word (as shown in Figure 6(d)).
通过如下公式对地址词进行计算:Address words are calculated by the following formula:
此公式考虑了两种情况:一种是地址词中已识别出的字符数占此地址词包含的所有字符数的比例,另一种是切割块的置信度。其中,为单字置信度,由公式(2)计算可得。nr/nl的计算考虑了已识别字符在地址词中所占的比例。相对的,如果此地址词中的所有字符都被识别出来,而且与模式块的匹配相对位置合理,则增加此地址词的置信度,增加的分数用一个常量v(1≤v≤4)表示。SC为切割块的置信度,定义为This formula takes two cases into account: one is the proportion of recognized characters in the address word to all characters contained in the address word, and the other is the confidence of the cut block. in, is the word confidence, which can be calculated by formula (2). The calculation of nr/nl takes into account the proportion of recognized characters in the address word. Relatively, if all the characters in the address word are recognized, and the relative position of the match with the pattern block is reasonable, then increase the confidence of the address word, and the increased score is represented by a constant v (1≤v≤4) . SC is the confidence of the cutting block, defined as
其中,m为组成一个模式块的连续的原子块数量,且这些原子块的组合没有包含连笔。pw/ph为此模式块的宽高比。Among them, m is the number of continuous atomic blocks forming a pattern block, and the combination of these atomic blocks does not contain continuous strokes. pw/ph The aspect ratio of this pattern block.
为减少识别的错误率,表AW_P中凡是分数低于一个阈值ε的地址词都会被删除。ε定义为In order to reduce the error rate of recognition, all address words whose score is lower than a threshold ε in the table AW_P will be deleted. ε is defined as
其中,nl为候选地址词包含的字符个数,为一个经验阈值,经过多次测试,取2.5能使识别系统获得最佳的性能。Among them, nl is the number of characters contained in the candidate address word, is an empirical threshold, after many tests, Taking 2.5 can make the recognition system get the best performance.
通过此步骤,生成的候选地址词在候选模式框图中的对应位置如图7所示。Through this step, the corresponding positions of the generated candidate address words in the candidate mode block diagram are shown in FIG. 7 .
规范格式地址识别,用于将待识别手写地址映射到所对应的规范格式书写的方式。Canonical format address recognition, which is used to map the handwritten address to be recognized to the corresponding canonical format writing method.
在识别一条中文手写地址的时候,它的所有非规范格式都可以映射到词级树的某一条路径中。在候选地址词被生成之后,可以结合它们在词级树中的节点关系对树进行搜索,生成规范格式书写的候选地址。采用至底向上搜索方法,从树的叶子节点(对应路名)开始向根节点搜索。在这一步中可以获取若干条候选地址,每条候选地址的分数等价于它所包含的已经识别出的候选地址词的分数的累加。最后,取分数最大的候选地址作为识别结果。此步骤的具体流程如图8所示。When recognizing a Chinese handwritten address, all its non-standard formats can be mapped to a certain path in the word-level tree. After the candidate address words are generated, the tree can be searched in combination with their node relationships in the word-level tree to generate candidate addresses written in a canonical format. Using the bottom-up search method, search from the leaf node of the tree (corresponding to the road name) to the root node. In this step, several candidate addresses can be obtained, and the score of each candidate address is equivalent to the accumulation of the scores of the identified candidate address words contained in it. Finally, take the candidate address with the largest score as the recognition result. The specific flow of this step is shown in Figure 8.
在本发明中,用4个列表分别存储表示“省”,“市”,“区”和“路”名的地址词,这4个列表分别用PR,CI,DI,RO表示。另外,用一个三元集TN={CN,PN,AS}来表示搜索空间的一个节点。其中,CN指向词级树的当前节点,PN指向CN的父节点,AS为搜索过程中累加的地址词分数。对于一个候选词W,它的最左边的模式块(lp(W))和最右边的模式块(rp(W))分别对应于它第一个被匹配的字符和最后一个被匹配的字符。判断父节点对应的地址词和子节点对应的地址词在模式框图中的位置限制关系是否合理,依据的是父节点的rp是否小于子节点的lp。文中地址词的位置大小是按从左到右升序排序的。In the present invention, the address words representing "province", "city", "district" and "road" name are stored respectively with 4 lists, and these 4 lists are represented by PR, CI, DI, RO respectively. In addition, a triplet set TN={CN, PN, AS} is used to represent a node in the search space. Among them, CN points to the current node of the word-level tree, PN points to the parent node of CN, and AS is the accumulated address word score during the search process. For a word candidate W, its leftmost pattern block (lp(W)) and rightmost pattern block (rp(W)) correspond to its first matched character and last matched character, respectively. Judging whether the position restriction relationship between the address word corresponding to the parent node and the address word corresponding to the child node in the pattern diagram is reasonable, is based on whether the rp of the parent node is smaller than the lp of the child node. The location and size of address words in the text are sorted in ascending order from left to right.
搜索之前,先检查列表RO是否为空。如果RO为空,即路名都没有被识别出来,此时AS=0,停止此次搜索,识别结果为拒识。否则,将从列表RO里存储的地址词开始逐一进行搜索。首先,CN指向RO中的一个地址词,AS初始为该地址词对应的分数,PN指向CN的父节点。接下来的搜索分两种情况:即PN∈DI或者如果它表示PN指向的候选地址词未被识别,在这种情况下,PN直接指向PN所指节点的父节点,然后继续搜索。如果PN∈DI,则说明PN所指向的候选地址词已经被识别。若此时PN和CN所指向的地址词满足位置关系rp(PN)<lp(CN),那么AS则等于这两个地址词分数的累加。然后CN指向PN对应的词级树结点,PN指向CN的父节点。否则,如果这两个词不满足位置关系,PN直接指向PN所指节点的父节点,然后继续搜索。当PN指向树的根节点时,表示这一次的搜索结束。最后,此次从叶子节点逆向搜索至根节点所得到的规范地址作为候选地址结果,AS为它对应的分数。用一个二元集RS={ξ,AS}来存储此次搜索结果,其中ξ存储的是当前搜索得到的规范候选地址。Before searching, check if the list RO is empty. If RO is empty, that is, no road name has been recognized, at this time AS=0, stop this search, and the recognition result is rejection. Otherwise, the search will be performed one by one starting from the address words stored in the list RO. First, CN points to an address word in RO, AS is initially the score corresponding to the address word, and PN points to the parent node of CN. The next search is divided into two cases: PN∈DI or if It indicates that the candidate address word pointed to by PN is not recognized, in this case, PN directly points to the parent node of the node pointed to by PN, and then the search continues. If PN ∈ DI, it means that the candidate address word pointed to by PN has been identified. If the address words pointed to by PN and CN satisfy the positional relationship rp(PN)<lp(CN), then AS is equal to the accumulation of the scores of these two address words. Then CN points to the word-level tree node corresponding to PN, and PN points to the parent node of CN. Otherwise, if the two words do not satisfy the positional relationship, PN directly points to the parent node of the node pointed to by PN, and then continues searching. When PN points to the root node of the tree, it means that this search is over. Finally, the canonical address obtained from the reverse search from the leaf node to the root node is used as the result of the candidate address, and AS is its corresponding score. A binary set RS={ξ, AS} is used to store the search results, where ξ stores the canonical candidate addresses obtained from the current search.
图9举例说明了在词级树中的搜索过程。例如,从叶子结点“中山北路”开始搜索,AS等于此地址词的分数20.19。PN指向的地址词“普陀区”已经被识别作为候选地址词,且rp(“普陀区”)<lp(“中山北路”),那么AS等于34.85(=20.19+14.66)。最后,对这条路径的搜索结果得到候选地址“上海市普陀区中山北路”,其对应的分数为48.41(=20.19+14.66+13.56),为所有候选地址的最高分数,所以作为最终的识别结果。通过路径搜索,一些没有被识别的地址词也作为识别结果被包括在候选地址中,但是它们的分数得不到累加。Figure 9 illustrates the search process in the word-level tree. For example, start searching from the leaf node "Zhongshan North Road", AS is equal to the score of this address word 20.19. The address word "Putuo District" pointed to by PN has been identified as a candidate address word, and rp("Putuo District")<lp("Zhongshan North Road"), then AS is equal to 34.85 (=20.19+14.66). Finally, the search result for this path obtains the candidate address "Zhongshan North Road, Putuo District, Shanghai", and its corresponding score is 48.41 (=20.19+14.66+13.56), which is the highest score of all candidate addresses, so it is used as the final identification result. Through route search, some unrecognized address words are also included in candidate addresses as recognition results, but their scores are not accumulated.
有一些候选地址词在分割模式框架中的位置可能会重叠(如图7)。如果是同一等级的地址词重叠,不影响树的搜索,因为它们之间不是上下级的关系,在树中不对应父节点与子节点的关系,所以在模式框图中可以得到不同的路径,比如:“上海市”和“上海”,“普陀区”和“普陀”。相反,如果这两个地址词是不同等级的,它们可能对应同一路径的父节点与子节点关系,比如:“普陀区”和“普陀路”。这种情况下,优先级低的词在搜索过程中将被跳过,同时此路径也不对这个低优先级地址词的分数进行累加。在本发明中,地址词的优先级是随着结点层数的增加而增加,如此,表示路名的地址词的优先级为最高。The positions of some candidate address words in the segmentation pattern frame may overlap (as shown in Figure 7). If the address words of the same level overlap, it will not affect the search of the tree, because there is no relationship between them, and there is no relationship between the parent node and the child node in the tree, so different paths can be obtained in the pattern diagram, such as : "Shanghai City" and "Shanghai", "Putuo District" and "Putuo". On the contrary, if the two address words are of different levels, they may correspond to the relationship between the parent node and the child node of the same path, for example: "Putuo District" and "Putuo Road". In this case, the low-priority word will be skipped during the search process, and the path will not accumulate the score of this low-priority address word. In the present invention, the priority of the address word increases with the increase of the number of node layers, so the priority of the address word representing the road name is the highest.
当RO中所有的地址词都在树中被搜索以后,生成了若干候选地址。最后,只取分数最高的候选地址作为识别结果。识别结果用S表示,定义为After all address words in the RO are searched in the tree, several candidate addresses are generated. Finally, only the candidate address with the highest score is taken as the recognition result. The recognition result is denoted by S, which is defined as
S=arg maxξ(ASi|i=1,2,…,n) (6)S=arg maxξ(AS i |i=1, 2,..., n) (6)
n为生成的候选地址总数,在图7中,n=5。显然,当i=3的时候ASi取得最大分数48.41,因此它对应的规范化书写地址“上海市普陀区中山北路”为图2(d)的最终识别结果。n is the total number of generated candidate addresses, in FIG. 7, n=5. Obviously, AS i gets the maximum score of 48.41 when i=3, so its corresponding normalized writing address "Zhongshan North Road, Putuo District, Shanghai" is the final recognition result in Figure 2(d).
图10显示了图2中的中文手写地址行图像的识别结果。从图10可以看出,这三类非规范书写格式的地址通过本发明都可以被识别成规范的书写地址“上海市普陀区中山北路”。Figure 10 shows the recognition results of the Chinese handwritten address line image in Figure 2. It can be seen from FIG. 10 that the addresses of these three types of non-standard writing formats can all be recognized as the standard writing address "Zhongshan North Road, Putuo District, Shanghai City" through the present invention.
Claims (3)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510044955.1A CN104598887B (en) | 2015-01-29 | 2015-01-29 | Recognition methods for non-canonical format handwritten Chinese address |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510044955.1A CN104598887B (en) | 2015-01-29 | 2015-01-29 | Recognition methods for non-canonical format handwritten Chinese address |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104598887A CN104598887A (en) | 2015-05-06 |
CN104598887B true CN104598887B (en) | 2017-11-24 |
Family
ID=53124660
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510044955.1A Active CN104598887B (en) | 2015-01-29 | 2015-01-29 | Recognition methods for non-canonical format handwritten Chinese address |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104598887B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106503634B (en) * | 2016-10-11 | 2020-02-14 | 讯飞智元信息科技有限公司 | Image alignment method and device |
CN107133215A (en) * | 2017-05-20 | 2017-09-05 | 复旦大学 | A kind of Chinese canonical address recognition methods of offline handwriting |
WO2019165644A1 (en) * | 2018-03-02 | 2019-09-06 | 福建联迪商用设备有限公司 | Address error correction method and terminal |
CN108647263B (en) * | 2018-04-28 | 2022-04-12 | 淮阴工学院 | Network address confidence evaluation method based on webpage segmentation crawling |
CN109961259B (en) * | 2019-03-28 | 2021-07-27 | 上海中通吉网络技术有限公司 | Address standardization processing method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6327386B1 (en) * | 1998-09-14 | 2001-12-04 | International Business Machines Corporation | Key character extraction and lexicon reduction for cursive text recognition |
CN101645134A (en) * | 2005-07-29 | 2010-02-10 | 富士通株式会社 | Integral place name recognition method and integral place name recognition device |
CN102289467A (en) * | 2011-07-22 | 2011-12-21 | 浙江百世技术有限公司 | Method and device for determining target site |
CN103678708A (en) * | 2013-12-30 | 2014-03-26 | 小米科技有限责任公司 | Method and device for recognizing preset addresses |
-
2015
- 2015-01-29 CN CN201510044955.1A patent/CN104598887B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6327386B1 (en) * | 1998-09-14 | 2001-12-04 | International Business Machines Corporation | Key character extraction and lexicon reduction for cursive text recognition |
CN101645134A (en) * | 2005-07-29 | 2010-02-10 | 富士通株式会社 | Integral place name recognition method and integral place name recognition device |
CN102289467A (en) * | 2011-07-22 | 2011-12-21 | 浙江百世技术有限公司 | Method and device for determining target site |
CN103678708A (en) * | 2013-12-30 | 2014-03-26 | 小米科技有限责任公司 | Method and device for recognizing preset addresses |
Non-Patent Citations (1)
Title |
---|
中文邮政地址识别研究;娄正良;《中国优秀博士学位论文全文数据库》;20070215;第I139-72页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104598887A (en) | 2015-05-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106777274B (en) | A kind of Chinese tour field knowledge mapping construction method and system | |
CN112069276B (en) | Address coding method, address coding device, computer equipment and computer readable storage medium | |
CN108369582B (en) | Address error correction method and terminal | |
CN104598887B (en) | Recognition methods for non-canonical format handwritten Chinese address | |
WO2018177316A1 (en) | Information identification method, computing device, and storage medium | |
CN110069626B (en) | Target address identification method, classification model training method and equipment | |
CN108491528B (en) | Image retrieval method, system and device | |
CN104199840B (en) | Intelligent place name identification technology based on statistical model | |
CN105528411B (en) | Device and method for full-text retrieval of ship equipment interactive electronic technical manual | |
CN112612863A (en) | Address matching method and system based on Chinese word segmentation device | |
CN106776564A (en) | The method for recognizing semantics and system of a kind of knowledge based collection of illustrative plates | |
CN106909611A (en) | A kind of hotel's automatic matching method based on Text Information Extraction | |
CN111625732A (en) | Address matching method and device | |
CN102169591B (en) | Line selecting method and drawing method of text note in drawing | |
CN114780680B (en) | Retrieval and completion method and system based on place name and address database | |
CN106155998B (en) | A kind of data processing method and device | |
WO2019227581A1 (en) | Interest point recognition method, apparatus, terminal device, and storage medium | |
CN101844135A (en) | Method for sorting postal letters according to addresses driven by address information base | |
CN101923556B (en) | Method and device for searching webpages according to sentence serial numbers | |
CN103996021A (en) | Fusion method of multiple character identification results | |
CN106777118B (en) | A kind of quick abstracting method of geographical vocabulary based on fuzzy dictionary tree | |
CN115185986A (en) | Method and device for matching provincial and urban area address information, computer equipment and storage medium | |
CN111291099A (en) | Address fuzzy matching method and system and computer equipment | |
JP2021501387A (en) | Methods, computer programs and computer systems for extracting expressions for natural language processing | |
CN105447104A (en) | Knowledge map generating method and apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |