CN1025764C - Character distinguishing method - Google Patents

Character distinguishing method Download PDF

Info

Publication number
CN1025764C
CN1025764C CN 92103651 CN92103651A CN1025764C CN 1025764 C CN1025764 C CN 1025764C CN 92103651 CN92103651 CN 92103651 CN 92103651 A CN92103651 A CN 92103651A CN 1025764 C CN1025764 C CN 1025764C
Authority
CN
Grant status
Grant
Patent type
Prior art keywords
character
distinguishing
method
character distinguishing
distinguishing method
Prior art date
Application number
CN 92103651
Other languages
Chinese (zh)
Other versions
CN1066335A (en )
Inventor
杨源远
路浩如
杨震
杨平勇
李璇
Original Assignee
浙江大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Grant date

Links

Abstract

本发明涉及一种字符识别方法与系统。 The present invention relates to a character recognition method and system. 抽取字符图象的笔划特征,直接利用笔划特征对字符进行分类和匹配识别。 Stroke extracting feature of character image, the direct use of the stroke characteristics of the character recognition and matching classification. 字符的结构词义采用框架形式的知识表达,框架中强调有重要影响的笔划和笔划连接,忽视作用不大的笔划,给出允许畸变的笔划方向以及为辨析相似字所必须的比较条件,十分有利于突出字符间的区别又简化了匹配识别的过程,较之目前通用的字符识别技术具有更高的识别率和适应能力。 Semantic knowledge representation structure character uses the form of a frame, the frame emphasizes the significant impact of stroke and stroke connection, ignore the role of small strokes, allowing the stroke direction given to discrimination and distortion of words necessary to compare similar conditions, very conducive to highlight the differences between the characters and simplifies the process of matching identification, compared with the current general character recognition technology has a higher recognition rate and adaptability.

Description

本发明涉及一种字符识别方法和系统,尤其适用于识别手写体汉字和多字体印刷汉字的识别方法。 The present invention relates to a method and system for character recognition, especially for recognition of handwritten characters and multi-font recognition of printed characters.

国内外已经研制的若干字符识别系统,主要采用对字符图象的象元分布抽取特征参量,并以此参量为依据进行分类和匹配识别的字符识别方法。 Some character recognition systems have been developed at home and abroad, the primary method of using character recognition parameters on the distribution of pixels of the character image feature extraction, and thus the matching and classification parameters based on the identification. 例如,1989年2月8日中国专利审定公告CN1003257B的字符识别系统,1990年11月21日中国专利审定公告CN1010512B所公开的技术。 For example, February 8, 1989 China Patent Examined announcement CN1003257B character recognition systems, November 21, 1990 China Patent Examined announcement CN1010512B disclosed technology.

因此,通常的技术有如下的问题:1.不能直接反映字符的结构特征,因而忽视了笔划结构作为字符构成的本质特点。 Therefore, conventional techniques have the following problems: 1 not directly reflect the structural features of the character, thereby ignoring the essential characteristics of the strokes as the structural configuration of the character.

2.大字符集的情况下难以达到高的识别率。 2. The case of large character sets is difficult to achieve high recognition rate.

3.区分形态相似或笔划结构复杂的字符十分困难。 3. distinguish similar shape or structure of complex characters strokes very difficult.

4.在手写体字符情况下,字形书写变化很大,所抽取的特征参量分散性大,且需采用高维特征矢量。 4. The feature parameters in the case of handwritten characters, glyphs written vary widely, the extracted large dispersion, and requires the use of high-dimensional feature vectors.

本发明的目的是创造一种字符识别方法,力求准确地抽取字符图象的笔划特征,充分反映字符的结构本质;直接利用字符的笔划结构词义对字符分类和匹配识别;运用知识表达字符的结构词义,达到简化字符的匹配识别过程,提高辨认相似字符的准确性和识别方法的适应能力。 Object of the present invention is to create a method for character recognition, the character image feature extraction seek stroke accurately, adequately reflect the nature of the structure of the character; meaning strokes directly on the character structure and character classification identification match; knowledge representation using character structure meaning, to match the character recognition process to simplify and improve the adaptability and accuracy of identification methods to identify similar characters.

本发明所涉及的字符识别方法包括:对书写有字符的页面扫描获得字符图象为第一步骤;字符图象二值化、字符切分及规格化为第二步骤;抽取字符二值化点阵的笔划结构特征为第三步骤;由结构特征求得分类特征码以确定所属分类为第四步骤;将结构特征与所属分类的字符模型进行匹配并识别之为第五步骤;将识别结果转为可见输出为第六步骤。 Character recognition method according to the present invention comprises: acquiring character image written on a page scan with a character as a first step; binary character image, character segmentation and the normalized second step; extracting character binary point structural features of a third step of the stroke array; structural features determined from the pattern to determine the classification category of the fourth step; structural characteristics matching the character models for identification and classification of the fifth step; transfer the recognition result a sixth step of outputting visible.

所述的第三步骤包括:1.字符结构模式作为模式整体可以分解为元字符、笔划和笔划元三种子模式。 Said third step includes: a character pattern as a whole structural model may be decomposed into character element, three strokes and a stroke element seed patterns. 元字符是构造字符的字符。 Yuan is the character structure of characters. 笔划分解为直线段即为笔划元。 Stroke is the decomposition of straight line segments stroke element. 笔划元是最低级子模式,用作描述字符模式的结构基元,其结构特征包括笔划元中心坐标、长度、方向和连接关系。 Stroke is the lowest level sub-element model, as described structural motif character pattern, wherein the structure comprises a central stroke element coordinates, length, direction and connection relationship.

2.对字符点阵作一次简单的扫描,检测每一象元在8个方向上与相邻象元的连接情况,将其区分为笔划的始端、终端、连接区或普通笔划元素并标记相应的符号,从而将字符点阵平面(CDP)转换成字符象元属性平面(CAP)。 2. character lattice for a simple scanning, the detection of each pixel in eight directions as an adjacent connection element, which is divided into stroke start and end, or connection regions corresponding numerals and common elements stroke symbols, so that the character lattice plane (CDP) is converted into a character image attribute element plane (CAP).

3.除属于连接区的象元以外,在CAP上处于边缘点的象元,计算其“|”、“-”、“/”、“\”四个方向上连续的象元个数en,en最大的方向取作该边缘点的纤维主方向。 3. In addition pixel belonging to the connecting region, CAP in pixels on the edge points, calculated "|", "-", "/", "\" as the number of consecutive membered en four directions, en direction is taken as the maximum of the main fiber direction of the edge point. 在主方向上的en值称作纤维长度,纤维长度上连接的象元赋以主方向相应的权值。 en values ​​in the main direction of the fiber length is referred to the main direction of the respective weights assigned pixels connected to the fiber length. 各边缘点的纤维可能相交形成交织区,交织区的象元其方向权值累加。 Each fiber may be formed by the intersection edge point region interleaving, interleaved pixel region weight value accumulated direction thereof. 所有边缘点完成上述计算后即可求得字符纤维结构图(CFP)。 After all the edge points can be obtained by calculating the above fiber structure of FIG character (CFP).

4.对照CAP连接区的方向特征,除去CFP中的噪声纤维,将属于“|”、“-”、“/”、“\”四个方向的纤维分别置于V、h、s、b四个平面中,即可求得每一笔划元的中心坐标、长度和方向。 Wherein the direction of the connecting region 4. Control CAP, CFP noise removing fibers belonging "|", "-", "/", "\" fibers were placed in four directions V, h, s, b four planes, the center coordinates can be obtained, the stroke length and direction of each element.

5.利用CAP的端点和连接区特征,结合已经求到的笔划元中心坐标、长度和方向可以计算笔划元的连接关系。 The use of endpoints and wherein the connection region CAP binding metadata stroke center coordinates, length and direction can be calculated to have been seeking connection relationship stroke elements.

所述的第四步骤包括:1.应用字符外围结构的四角特征和四边特征作为字符的分类特征,在二个层次上进行外围结构的描述和分类。 Said fourth step comprises: a feature corners and four sides of the peripheral configuration of the character features of the application as a classification feature character, a peripheral structure described and classified in two levels. 由已知字符的四角特征和四边特征建立预分类字典。 Establishing a pre-classification feature dictionary by the four corners and four sides feature known characters.

2.在字符的笔划平面上(CSP)以平面的四个角为中心,搜索距离四角最近的笔划元。 2. On the stroke plane characters (CSP) to the four corners of the plane as the center, from the four corners of the search for the nearest stroke yuan.

3.判断最近角点的笔划元方向属性,并分成横、竖、撇、捺、角、交六种类型,赋以相应的编码,称作角码。 3. Analyzing properties latest stroke direction membered corner points, and divided into horizontal, vertical, left, right, angle, cross six types, endowed with a corresponding coding, referred to as the angle code. 由四个角码组成的码串构成字符的第一分类特征。 The first classification characteristic character code string composed of codes constituting the four corners.

4.在CSP上由中心引出射线,按顺时针扫描,获得射线与字符最外层笔划元所组成的多边形作为字符外围轮廓,抽取其超过某一阈值的凸点,分别计数每一边的凸点数求得四边的码串构成字符的第二分类特征。 4. polygon drawn by the center ray on the CSP, a clockwise scan, to obtain the outermost ray and the character strokes as a character element consisting of a peripheral contour extracting it exceeds a certain threshold bumps were counted the number of bumps on each side a second code string obtained by the four sides constituting the character of the classification characteristic.

5.查找预分类字典中与待识字符四角码及四边码相同的同类字符代码,完成第四步骤。 5. Find presorting dictionary to be identified with the same character code and the four corners of the four sides of the same character code symbols, to complete the fourth step.

所述的第五步骤:1.字符结构词义采用框架形式的知识表达,由字符框架表达每一字符模式。 The fifth step: a knowledge representation structure using characters meaning the form of a frame, each character expressed by the character frame mode. 在框架中,构成字符的全部笔划元分别在h、v、s、b四个平面上分组排序,并列出必要的笔划连接关系和相似字之间笔划元特征的辨析条件。 In the frame, all the strokes constituting the character element, respectively h, v, s, packet ordering four plane B, and the Stroke List Element Analysis conditions necessary features between similar words and the strokes of the connection relationship. 在字符框架中参与分组排序的每一个笔划元由笔划元框架描述。 Each element participating in a packet ordering strokes in the character frame element frame described by the stroke. 笔划元框架表达笔划元之正常方向、中心位置和长度。 Expression stroke frame element in a normal direction of the stroke element, the center position and length. 此外,还给出该笔划的权重和允许的畸变方向。 Furthermore, the weight given stroke weight and allows distortion direction. 字符框架中的必要连接关系和笔划元框架中的权重属于运用知识表达、强调对识别结果有重要影响的笔划元及其连接关系而忽视那些冗余的或影响不大的成份。 Character framework necessary connections and relations in the framework of the stroke yuan weights belong to the use of knowledge representation, stressed the important impact on the recognition results of the stroke elements and their connection relationship while ignoring those redundant or little effect on the ingredients. 相似字辨析条件和允许的畸变方向使得识别过程既能顾及在结构复杂而且数量庞大的字符集中辨认不同字符间笔划结构的细微差别,又能对变化万千的字形具有良好的适应能力。 Analysis conditions similar words and the allowable distortion regard both directions so that the identification process to identify subtle differences stroke centralized configuration of different characters and a complicated structure of a large number of characters, but also has a good ability to adapt to the ever-changing shape.

2.取出预分类同类的字符模型,依次与待识字符的笔划元特征进行搜索匹配、计算属性距离,若距离小于某一阈值认为匹配成功,否则认为匹配失败。 2. Remove the pre-classification model similar character, sequentially search for matching features to be identified with the stroke element of the character, the attribute calculating the distance, when the distance is smaller than a certain threshold match is considered successful, or that the match fails. 如此过程在每个模型的四个笔划元子平面上依次执行直至结束。 Thus the process sequentially performed until the end of the four strokes in each model subspace plane.

3.按照笔划元框架指定的权重计算笔划元属性的加权距离。 3. The right frame element specified by the stroke from the stroke weight attribute membered recalculation. 对字符结构起关键作用的笔划元由于有最高的权重而便于区分字符间笔划的细微差异,影响不大的笔划元有较小的权重,从而达到忽略冗余笔划的目的。 The character structure plays a key role strokes yuan due to the highest weight and easy to distinguish the subtle differences between a character stroke, stroke yuan have little effect on smaller weight, so as to achieve redundancy ignored stroke.

4.匹配未成的笔划元中若存在容许畸变方向的、转向相应方向的样本子平面搜索匹配。 4. The matching element unpaired stroke allowable distortion in the direction, if present, turn corresponding sample sub-plane direction match search.

5.对必要的连接关系进行检测,不满足这一要求时退出匹配候选列。 The connection relationship of the necessary detection, column exit matching candidate is not satisfied this requirement.

6.检测笔划元比较和相似字符辨析条件,不满足要求时退出匹配候选列。 Comparing the detected stroke and 6-membered similar character discrimination condition is not satisfied when the column exit matching candidate requirements.

7.匹配总距离在阈值范围内的所有字符,按距离从小到大排序,取出最小的几个作为识别候选字,若无识别候选字以拒识处理。 7. All characters matching the total distance within a threshold range, from small to large by distance, taken as the smallest number of recognition candidates, the candidate words to identify the absence of the rejection process.

本发明具有的独特优点可概括如下:准确抽取笔划结构特征从而充分反映了字符的本质特点。 Has the unique advantages of the present invention may be summarized as follows: Accurate extraction structure wherein the stroke in order to fully reflect the essential characteristics of the character. 直接利用笔划特征描述字符之结构骨架而以笔划属性矢量适应字符形态的种种变化,实现字符分类和匹配识别。 Directly stroke features described backbone structure and character of the stroke attribute vectors to accommodate various changes in forms of characters, character classification and to achieve matching recognition. 对字符的结构词义模型运用框架形式的知识表达,既便于强调重要的笔划或笔划连接关系,又可忽视对识别字符影响不大的笔划,十分有利于突出字符间的区别简化匹配识别过程。 Knowledge representation framework for the use of semantic models form the structure of the character, both for stressed the importance of stroke or stroke connection relationship, but also neglected to identify the characters little effect on stroke, it is very conducive to the difference between the outstanding character recognition to simplify the matching process. 框架中表达了相似字的辨析条件,使得辨认字符间细微的笔划差异成为可能,例如:风、凤;士、土;澜、谰……,从而极大地提高了字符的识别率。 Analysis of the conditions expressed in the framework of a similar character, making the subtle differences between the strokes become possible to identify the character, such as: wind, Phoenix; Shi, soil; Lan, Lan ......, thereby greatly enhancing the recognition rate of the character. 在笔划框架中还给出允许畸变的方向,使得识别的灵活性和适应能力显著提高。 Frame in the stroke direction is also given allowing distortion such identification flexibility and adaptability significantly improved. 与现有的技术比较,既避免统计方法中因采用高维特征存在特征选择和模式可分性方面的困难而限制识别率的提高。 Compared with the prior art, avoids using statistical methods due to the high dimensional feature separability difficult aspects of feature selection and pattern recognition rate increase limit. 也避免了结构方法难以适应字符形态多变的缺陷。 The method also avoids the defect structure is difficult to adapt to the changing shape of the character.

本发明的实施例由图文扫描仪、微型计算机主机、显示器、打印机、磁带机及有关接口板组成。 Embodiments of the invention by the graphic scanner, host microcomputer, monitors, printers, tape drives and related interface board. 扫描仪包括手持扫描在内各种型式均可适用。 Including hand-held scanner to scan, including various types can be applied. 微型计算机主机使用DOS操作系统最为通用。 Microcomputer host uses DOS operating system is the most common. 磁带机不是必要的设备可以作为主机存储器的扩充或后备自由选用。 Tape drive is not necessary equipment can be used as back-up or expansion of freedom of choice of host memory. 系统的工作原理结合下面的附图逐步说明。 Step instructions system works in conjunction with the following figures.

附图说明 BRIEF DESCRIPTION

:图1是本发明实施例的方块结构图图2是结构特征抽取工作流程图图3是结构特征抽取的实例图4是笔划元连接关系描述图5是预分类工作流程图图6是四角特征码表图7是字符框架图8是笔划元条件排序结构图图9是笔划元条件排序工作流程图图10是笔划元框架图11是运用知识引导的匹配识别工作流程图图12是子平面h笔划元匹配工作流程图。 : FIG. 1 is a block configuration diagram of an embodiment of the present invention, FIG 2 is a flow chart of FIG. 3 feature extraction is a configuration example of feature extraction stroke element 4 is described with FIG. 5 is a connection relationship presorting FIG 6 is a flowchart of the four corners feature FIG 7 is a character code table frame 8 is a configuration diagram of stroke-membered sorting condition is stroke FIG. 9 membered sorting condition 10 is a flowchart of FIG. FIG. 11 is a stroke-membered frame using the identification information matching the guide 12 is a flowchart of a sub-plane h Primitive match a flowchart of the stroke.

图1是实施例的系统方块图,书写在纸张上的字符用图文扫描仪扫描页面,每页扫描得到一幅图象文件,按所选的灰度阈值转换成二值化(0,1)点阵,经接口板存入计算机内。 1 is a system block diagram of the embodiment, the character written on a paper sheet with a scanner teletext pages, each page scanned an image file, a gradation conversion by a selected threshold value to a binary (0, 1 ) dot matrix, stored in the computer via the interface board. 由页面切分程序模块搜索点阵的起始行,行总数,字首和字数自动完成字的切分,经规格化处理后得到每个字符的点阵(例如32×32或64×64字符点阵),抽取每个字符点阵的笔划特征,进行分类、匹配进而识别该字符至存于机内的字符点阵全部识别完毕,以机内码表示识别结果。 Search page segmentation program modules by the lattice start line, the total number of lines, number of words and automatically prefix word segmentation, normalized to give each character dot after treatment (e.g., 32 × 32 or 64 × 64 characters dot) extracted feature of each character stroke lattice, classification, matching and thus to recognize the characters stored in the character lattice machine recognition of all completed, the machine code to represent the recognition result. 最后以标准字形显示或打印出书写在样张上全部字符的识别结果,或者继续进行必要的编辑。 Finally, the standard font to display or print out the results of all the characters written on the identification of the sample, or continue to make the necessary edits.

图2是结构特征抽取的流程图,以规格化处理后的字符点阵(CDP)作为该流程的起点,扫描CDP的行和列,检测在行和列二个方向取值为1的连续象元数X,记录出现次数最多的X作为笔划宽度wi,在行和列方向用笔划宽度量度连续象元素不足wi时,分别用“|”和“-”标记该象元。 FIG 2 is a flowchart of the feature extraction structure, lattice character (CDP) is normalized after the starting point of the process, a scan CDP rows and columns, the row and column detecting two directions as a continuous value is 1 X metadata, recording the highest number of X appears as the stroke width wi, row and column directions as a continuous measure of the width of the stroke is insufficient elements wi, respectively, "|" and "-" marks the pixels. 在“-”象元的两侧检测其是否为0,如左侧为0属于左端点,标记为“W”。 In the "-" sides of the pixel detected whether it is 0, to the left as the left point 0 belongs, labeled "W is." 如右侧为0属于右端点,标记为“E”。 The right part of the right end is 0, labeled "E". 在“|”象元的上、下二方检测其是否为0,上方为0属于上端点标记为“N”,下方为0属于下端点标记为“S”。 In the "|" on the picture elements, which detect whether the second party is 0, 0 belonging to the upper end marked "N", 0 is the lower end part of the marked "S". 在CDP中所有既不是“-”亦不是“|”的象元,按其区域的坐标顺序用小写英文字符标记。 In the CDP is neither all "-" nor a "|" of pixels, the coordinates of its order of the regions labeled with lowercase characters. 该英文字符标记的区域即为笔划的连接区,并计算该连接区的特征。 The marked area is the English character strokes connection region, and wherein the connection region is calculated. CDP的每一个象元按上述要求由指定符号标记之后即赋予笔划的始端、终端、连接区或普通象元等不同的属性称为字符象元属性平面(CAP),图3示出书写字符“毗”字的结构特征抽取实例。 Each pixel as above by the following specified symbol mark i.e. imparting stroke start end CDP, the terminal connection area or general pixel, such as different attributes referred to as character picture element attributes plane (CAP), FIG. 3 shows a handwritten character " structure adjoin "character extraction example. 其中左上方是CAP图,下方是连接区特征表,第一列是序号、第二列是连接区代号、第三、四列分别是起始和终结的列坐标、第五、六列分别是起始和终结的横坐标。 Wherein CAP FIG upper left, the lower connection region is characterized in the table, the first column is the sequence number, the second column is connected to the code region, the third and fourth columns are the coordinates of the starting and ending column, fifth, six are abscissa beginning and the end. 最后一列是连接区的连接特征,连接特征用代码表示示于图4。 The last feature is to connect a connection region, the connection characteristics indicated by the code shown in Fig. 对CAP的每一个边缘点,除连接区的象元外,在行、列、左斜、右斜四个方向上计算其连续非0的象元数,取其象元数最大的方向作为该边缘点的纤维主方向,主方向上连接的象元数为纤维长度,各象元赋以主方向相应的权值。 For each edge point of CAP, in addition to the pixel connected region, row, column, left diagonal, upper right diagonal four directions calculated continuously non-pixel number 0, whichever is the largest number of pixels as the direction the main fiber direction of the edge points, the number of pixels connected to the main direction of fiber length, each pixel assigned to the main direction of the respective weights. 各边缘点的纤维可能相交形成交织区,交织区象元其方向权值累加。 Each fiber may be formed by the intersection edge point region interleaving, interleaved pixel region weight value accumulated direction thereof. 所有边缘点完成上述计算后即求得字符纤维结构图(CFP)。 All edge points is calculated after completion of the above fiber structure obtained in FIG character (CFP). 除去交织区的噪声纤维,将属于行、列、左斜、右斜四个方向的纤维分别置于h、v、s、b四个平面中即可求得每一笔划元的中心坐标、长度和方向,再利用CAP的端点和连接特征求得笔划元的连接关系,从而取得字符的全部结构特征。 Noise removing fiber interlacing zone belonging rows, columns, left diagonal and right diagonal directions four fibers were placed h, v, s, b four planes to obtain the center coordinates of each element of the stroke length and direction, and re-use connection features CAP endpoint connection relationship determined stroke element, to obtain all of the structural features of the characters. 图3的右上图示出了“毗”字结构特征的实例。 The upper right of FIG. 3 illustrates an example of the structural features of the word "border."

图5是预分类工作流程图。 FIG 5 is a flowchart of the pre-classification. 在字符的笔划平面上,以平面的四个角为中心,搜索距离四角最近的笔划元。 On the stroke plane characters, to the four corners of the plane as the center, from the four corners of the search for the nearest stroke yuan. 判断该笔划元的方向属性,把它们分成横、竖、撇、捺、角、交和空七种类型。 Determining the direction attribute of the stroke element, put them into horizontal, vertical, left, right, angle, and cross empty seven types. 它们的编码如图6所示称为角码,由四个角码组成的码串构成字符的第一分类特征。 FIG referred their coding Corner FIG. 6, the first classification characteristic of a code string composed of four corners of character codes thereof. 在字符笔划平面上再由中心引出射线,按顺时针扫描,获得射线与字符最外层笔划元所组成的多边形作为字符的外围轮廓,抽取其超过某一阈值的凸点,分别计算每一边的凸点数作为边码,四个边码构成四边码串即为字符的第二分类特征。 Polygons then drawn by the rays in the central plane of a character stroke, clockwise scan, to obtain the radiographic element with the character strokes outermost layer composed as a peripheral outline of the character, extracting it exceeds a certain threshold bumps, were calculated for each side of the as the number of bumps pp, four sides constituting the second code string is the code classification characteristic quadrilateral characters. 由四边码和四角码查找预分类字典,获得同类字符代码。 Find presorting dictionary by the four sides and four corners of the code code, access to the same character code.

图7是表达字符结构词义模型的框架,其中带下标的εi表示第i个笔划元,分别在h、v、s、b四个子面上分组排序,图8为笔划元条件排序结构图,排序条件可参照图9。 FIG. 7 is a model expressing a character meaning frame structure, wherein the subscripts stroke εi denotes the i th element, respectively h, v, s, b four subpackets sorting surface, FIG. 8 is a block diagram of stroke-membered conditions sorting, ordering conditions with reference to FIG. 必要的连接关系Ωmn是指该字符第m个笔划元和第n个笔划元之间必须满足的连接关系,例如:“夫”字,第一横笔和竖笔之间必须是相交的关系,而天则无此要求。 Ωmn necessary connection relationship refers to the connection between the m-th character stroke element and the n-th stroke element must be met, for example: between "fu" character, the first cross pen and the pen must be vertical intersecting relationship, The day does not have this requirement. 笔划元比较槽口则用以辨别字符内部笔划长短比较或方向的不同,例如:土、士;天、夭,而相似字符辨析条件则判断某一笔划元缺少或存在时,候选字符的转移方向,例如:风、凤;梁、粱等等。 Stroke unary comparison to distinguish between different comparison of the notch or orientation of the character strokes inside length, for example: soil, disabilities; days, tender, and similar character discrimination condition is determined transfer direction at a certain stroke membered absence or presence of candidate characters such as: wind, Phoenix; Liang, Liang and so on. 图10是笔划元框架表达图7中的每一个笔规元εi的结构特征。 FIG 10 is a structural feature of each of the pen strokes in Fig. 7 membered frame expression regulatory element of εi. 包括笔划元的正常方向之量化为横、竖、撇、捺为四个方向分别用h、v、s、b代表;笔划元中心坐标( xo, yo)i和笔划长度。 Including normal direction of the stroke element is a cross-quantization, vertical, left, right, for the four-direction h, v, s, b are representatives; stroke membered center coordinates (xo, yo) i and a stroke length. 框架中还给出了该笔划允许畸变的方向ε′i和结构权重wi,前者使匹配过程灵活而提高系统对字形变化的适应能力,后者则突出重点简化匹配。 The framework also given stroke direction ε'i allows distortion weights Wi and structure, so that the former matching process to improve the adaptability and flexibility of shape changing system, which simplifies the focus match. 图7和图10组成系统的结构模型。 FIGS. 7 and 10 make up the system structure model. 图11示出运用知识引导的匹配识别工作流程图。 Figure 11 shows a flowchart of the use of knowledge to identify matching guided. 图12是某子平面笔划元匹配工作流程图。 FIG 12 is a flowchart of the plane of the stroke of a sub-matching element. 按照预分类所给出的同类字符代码从知识库中逐个取出相应的字符模型,由图9表示的条件排序程序模块对已求得之笔划元进行排序。 Individually fetches the corresponding character pattern from the knowledge base according to the same character code presorted given, the conditions represented by the sort program modules stroke membered FIG. 9 has determined the sort. 在h、v、s、b四个子平面上依次将模型中的笔划元与待识字符之笔划元先组内后组间依次匹配,笔划之间的属性距离小于规定阈值δ时认为匹配成功,否则认为匹配失败。 Sequentially turn mismatch between the strokes of the stroke element model and to be identified as characters membered first group within a group on h, v, s, b four sub-planes, that the matching is successful when the properties distance between the strokes is less than the predetermined threshold value [delta], or that the match fails. 若匹配失败,向下搜索字符笔划元是否匹配,如无匹配可能,取下一个模型笔划元进行匹配。 If the match fails, the search down character stroke yuan matches, as no match possible, remove a model of stroke yuan to match. 这一过程一直进行到最后一个模型笔划元。 This process continues until the last model of stroke yuan. 若模型笔划元全部匹配成功或字符笔划元匹配完毕,则按照指定的权重计算全部笔划的属性距离,距离在阈值△范围之内时认为可列入匹配候选。 If the stroke model successfully matching all meta-membered or character strokes match is completed, the weight calculation in accordance with the specified properties of all of the stroke distance, the distance that can be included in the matching candidate within a threshold time range △. 模型中匹配未成的笔划元中若存在允许畸变方向的,转向相应方向的样本子平面搜索匹配,方法相同。 Stroke metamodel matching unpaired allowed if there is distortion in the direction, the steering direction of the corresponding sample sub-plane searching and matching, in the same manner. 对于列入匹配候选的字符模型进一步检测待识字符是否满足指定的连接关系Ωmn,例如:夫、天;力、刀;……,夫和力相交关系都是必要的,不满足这一要求时退出匹配候选列。 ...... time, the relationship between husband and intersecting a force is necessary, does not meet this requirement; Fu, days; force, blade: character pattern matching candidate for inclusion of further detects whether the characters to be identified to satisfy a specified connection relationship Ωmn, e.g. quit match candidate column. 如果模型框架中存在笔划元比较的要求则检查是否满足要求,不满足比较条件的退出候选列。 If required the presence of unary comparison stroke model it is checked whether the frame to meet the requirements, the comparison does not satisfy the exit conditions candidate column. 重复上述匹配比较直至全部分类模型匹配完毕。 Repeating the comparison until all matching classification model matching is completed. 匹配总距离在阈值范围内的所有字符按距离从小到大的次序排列作为识别候选字排列首位的是第一候选字,通常情况下取为识别结果。 All characters matching the total distance within a threshold range from small to large by their order of arrangement of the first place as a recognition candidate word is the first candidate, usually taken to be the recognition result. 若无识别候选字则以拒识处理。 Without recognition candidate places rejection process.

Claims (1)

  1. 1.一种字符识别方法,对书写有字符的页面扫描获得字符图象为第一步骤;字符图象二值化、字符切分及规格化为第二步骤:抽取字符二值化点阵的笔划结构特征为第三步骤;由结构特征求得分类特征码以确定所属分类为第四步骤;将结构特征与所属分类的字符模型进行匹配并识别之为第五步骤;将识别结果转为可见输出为第六步骤,本发明的特征是:所述的第三步骤包括:(1)字符结构模式作为模式整体可以分解为元字符、笔划和笔划元三种子模式。 1. A character recognition method, a character written with a page scan is obtained as a first step of the character image; binary character image, character segmentation and the normalized second steps of: extracting character binary lattice a third step wherein the structure of the stroke; structural features determined from the pattern to determine the classification category of the fourth step; structural characteristics matching the character models for identification and classification of the fifth step; recognition result converted to visible output a sixth step, the present invention is characterized in that: said third step comprises: (1) the structure of the character mode as the mode character element can be decomposed into a whole, the stroke and the stroke yuan seed patterns. 元字符是构造字符的字符,笔划分解为直线段即为笔划元。 Element is the character configuration of the character, the stroke is a straight line segment is the decomposition of the stroke element. 笔划元是最低级子模式,用作描述字符模式的结构基元,其结构特征包括笔划元中心坐标、长度、方向和连接关系。 Stroke is the lowest level sub-element model, as described structural motif character pattern, wherein the structure comprises a central stroke element coordinates, length, direction and connection relationship. (2)对字符点阵作一次简单的扫描,检测每一象元在8个方向上与相邻象元的连接情况,将其区分为笔划的始端、终端、连接区或普通笔划元素并标记相应的符号,从而将字符点阵平面(CDP)转换成字符象元属性平面(CAP)。 (2) to make a simple character dot scanning, is detected and each pixel adjacent pixels connection, which is divided into eight stroke directions in the start and end, or common strokes element connection area and labeled respective symbol, so that the character lattice plane (CDP) is converted into a character image attribute element plane (CAP). (3)除属于连接区的象元以外,在CAP上处于边缘点的象元,计算其“|”、“一”、“/”、“\”四个方向上连续的象元个数en,en最大的方向取作该边缘点的纤维主方向。 (3) other than the pixels belonging to the connecting region, CAP in pixels on the edge points, calculated "|", "a", "/", "\" as the four direction of the continuous element number en , en direction is taken as the maximum point of the edge of the main direction of the fiber. 在主方向上的en值称作纤维长度,纤维长度上连接的象元赋以主方向相应的权值。 en values ​​in the main direction of the fiber length is referred to the main direction of the respective weights assigned pixels connected to the fiber length. 各边缘点的纤维可能相交形成交织区,交织区的象元其方向权值累加。 Each fiber may be formed by the intersection edge point region interleaving, interleaved pixel region weight value accumulated direction thereof. 所有边缘点完成上述计算后即可求得字符纤维结构图(CFP)。 After all the edge points can be obtained by calculating the above fiber structure of FIG character (CFP). (4)对照CAP连接区的方向特征,除去CFP中的噪声纤维,将属于“|”、“一”、“/”、“\”四个方向的纤维分别置于V、h、s、b四个平面中,即可求得每一笔划元的中心坐标、长度和方向。 (4) control the direction of the connecting region wherein CAP, CFP noise removing fibers belonging "|", "a", "/", "\" fibers were placed in four directions V, h, s, b four planes, to obtain the center coordinates, the stroke length and direction of each element. (5)利用CAP的端点和连接区特征,结合已经求到的笔划元中心坐标、长度和方向可以计算笔划元的连接关系。 (5) using the characteristic of the terminal CAP and the connection region, with the center coordinates of the stroke element, and longitudinal directions can be calculated to have been seeking connection relationship stroke elements. 所述的第四步骤包括:(1)应用字符外围结构的四角特征和四边特征作为字符的分类特征,在二个层次上进行外围结构的描述和分类。 Said fourth step comprises: (1) features four sides and four corners feature a peripheral configuration of a character applied as a classification feature character, a peripheral structure described and classified in two levels. 由已知字符的四角特征和四边特征建立预分类字典。 Establishing a pre-classification feature dictionary by the four corners and four sides feature known characters. (2)在字符的笔划平面上(CSP)以平面的四个角为中心,搜索距离四角最近的笔划元。 (2) in the plane of the character stroke (CSP) to the four corners of the center plane, the four corners of the search from the nearest stroke element. (3)判断最近角点的笔划元方向属性,并分成横、竖、撇、捺、角、交六种类型,赋以相应的编码,称作角码。 (3) determining the direction of the stroke element nearest corner of the properties, and is divided into horizontal, vertical, left, right, angle, cross six types, endowed with a corresponding coding, referred to as the angle code. 由四个角码组成的码串构成字符的第一分类特征。 The first classification characteristic character code string composed of codes constituting the four corners. (4)在CSP上由中心引出射线,按顺时针扫描,获得射线与字符最外层笔划元所组成的多边形作为字符外围轮廓,抽取其超过某一阈值的凸点,分别计数每一边的凸点数求得四边的码串构成字符的第二分类特征。 A polygon (4) drawn by the center ray on the CSP, a clockwise scan, to obtain the outermost ray and the character strokes as a character element consisting of a peripheral contour extracting it exceeds a certain threshold bumps, projections were counted on each side classification characteristic points and determining a second code string of four sides constituting the character. (5)查找预分类字典中与待识字符四角码及四边码相同的同类字符代码,完成第四步骤。 (5) Search the dictionary to be presorting character identification code and the four corners of the four sides of the same code same character code, to complete the fourth step. 所述的第五步骤:(1)字符结构词义采用框架形式的知识表达,由字符框架表达每一字符模式。 The fifth step: Knowledge Representation (1) meaning the form of a frame structure using characters, each character expressed by the character frame mode. 在框架中,构成字符的全部笔划元分别在h、v、s、b四个平面上分组排序,并列出必要的笔划连接关系和相似字之间笔划元特征的辨析条件。 In the frame, all the strokes constituting the character element, respectively h, v, s, packet ordering four plane B, and the Stroke List Element Analysis conditions necessary features between similar words and the strokes of the connection relationship. 在字符框架中参与分组排序的每一个笔划元由笔划元框架描述。 Each element participating in a packet ordering strokes in the character frame element frame described by the stroke. 笔划元框架表达笔划元之正常方向、中心位置和长度。 Expression stroke frame element in a normal direction of the stroke element, the center position and length. 此外,还给出该笔划的权重和允许的畸变方向。 Furthermore, the weight given stroke weight and allows distortion direction. 字符框架中的必要连接关系和笔划元框架中的权重属于运用知识表达、强调对识别结果有重要影响的笔划元及其连接关系而忽视那些冗余的或影响不大的成份。 Character framework necessary connections and relations in the framework of the stroke yuan weights belong to the use of knowledge representation, stressed the important impact on the recognition results of the stroke elements and their connection relationship while ignoring those redundant or little effect on the ingredients. 相似字辨析条件和允许的畸变方向使得识别过程既能顾及在结构复杂而且数量庞大的字符集中辨认不同字符间笔划结构的细微差别,又能对变化万千的字形具有良好的适应能力。 Analysis conditions similar words and the allowable distortion regard both directions so that the identification process to identify subtle differences stroke centralized configuration of different characters and a complicated structure of a large number of characters, but also has a good ability to adapt to the ever-changing shape. (2)取出预分类同类的字符模型,依次与待识字符的笔划元特征进行搜索匹配、计算属性距离,若距离小于某一阈值认为匹配成功,否则认为匹配失败。 (2) Remove the character patterns similar presorted, sequentially search for matching features to be identified with the stroke element of the character, the attribute calculating the distance, when the distance is smaller than a certain threshold match is considered successful, or that the match fails. 如此过程在每个模型的四个笔划元子平面上依次执行直至结束。 Thus the process sequentially performed until the end of the four strokes in each model subspace plane. (3)按照笔划元框架指定的权重计算笔划元属性的加权距离。 (3) specified in the right frame element stroke weighted distance calculating stroke weight attribute element. 对字符结构起关键作用的笔划元由于有最高的权重而便于区分字符间笔划的细微差异,影响不大的笔划元有较小的权重,从而达到忽略冗余笔划的目的。 The character structure plays a key role strokes yuan due to the highest weight and easy to distinguish the subtle differences between a character stroke, stroke yuan have little effect on smaller weight, so as to achieve redundancy ignored stroke. (4)匹配未成的笔划元中若存在容许畸变方向的、转向相应方向的样本子平面搜索匹配。 Stroke element (4) When the matching unpaired allowable distortion direction in which the steering direction of the corresponding sample sub-plane search matches. (5)对必要的连接关系进行检测,不满足这一要求时退出匹配候选列。 (5) necessary for detecting the connection relationship, when the column exit matching candidate does not satisfy this requirement. (6)检测笔划元比较和相似字符辨析条件,不满足要求时退出匹配候选列。 (6) comparing the detected stroke and similar characters membered discrimination condition is not satisfied when the column exit matching candidate requirements. (7)匹配总距离在阈值范围内的所有字符,按距离从小到大排序,取出最小的几个作为识别候选字,若无识别候选字以拒识处理。 (7) matching the total distance of all the characters within the threshold range, from small to large by distance, taken as the smallest number of recognition candidates, the candidate words to identify the absence of the rejection process.
CN 92103651 1992-05-12 1992-05-12 Character distinguishing method CN1025764C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 92103651 CN1025764C (en) 1992-05-12 1992-05-12 Character distinguishing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 92103651 CN1025764C (en) 1992-05-12 1992-05-12 Character distinguishing method

Publications (2)

Publication Number Publication Date
CN1066335A true CN1066335A (en) 1992-11-18
CN1025764C true CN1025764C (en) 1994-08-24

Family

ID=4940353

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 92103651 CN1025764C (en) 1992-05-12 1992-05-12 Character distinguishing method

Country Status (1)

Country Link
CN (1) CN1025764C (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254204A (en) * 2011-06-03 2011-11-23 吴林 Coding and decoding method for graphemic code

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1317664C (en) * 2004-01-17 2007-05-23 中国科学院计算技术研究所 Confused stroke order library establishing method and on-line hand-writing Chinese character identifying and evaluating system
JP5071914B2 (en) 2005-02-28 2012-11-14 ザイ デクマ アクチボラゲット Recognition graph
CN1332348C (en) * 2005-09-23 2007-08-15 清华大学 Blocks letter Arabic character set text dividing method
CN101436254B (en) 2007-11-14 2013-07-24 佳能株式会社 Image processing method and image processing equipment
CN101436248B (en) 2007-11-14 2012-10-24 佳能株式会社 Method and equipment for generating text character string according to image
CN102024138B (en) 2009-09-15 2013-01-23 富士通株式会社 Character identification method and character identification device
CN102096662A (en) * 2010-12-06 2011-06-15 无敌科技(西安)有限公司 Code conversion method
CN103366716B (en) * 2012-03-31 2016-03-30 华为终端有限公司 Dot matrix font and character dot matrix font compression and decompression method and apparatus

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254204A (en) * 2011-06-03 2011-11-23 吴林 Coding and decoding method for graphemic code

Also Published As

Publication number Publication date Type
CN1066335A (en) 1992-11-18 application

Similar Documents

Publication Publication Date Title
Munson Experiments in the recognition of hand-printed text, part I: character recognition
Plamondon et al. Online and off-line handwriting recognition: a comprehensive survey
Chaudhuri et al. A complete printed Bangla OCR system
US5390259A (en) Methods and apparatus for selecting semantically significant images in a document image without decoding image content
Amin Off-line Arabic character recognition: the state of the art
US6047251A (en) Automatic language identification system for multilingual optical character recognition
Bansal et al. Integrating knowledge sources in Devanagari text recognition system
US6249604B1 (en) Method for determining boundaries of words in text
Kahan et al. On the recognition of printed characters of any font and size
US20030026507A1 (en) Sorting images for improved data entry productivity
US5359673A (en) Method and apparatus for converting bitmap image documents to editable coded data using a standard notation to record document recognition ambiguities
US20020054693A1 (en) Orthogonal technology for multi-line character recognition
US4516262A (en) Character data processing apparatus
US20030190074A1 (en) Methods and apparatuses for handwriting recognition
US7020338B1 (en) Method of identifying script of line of text
US4933979A (en) Data reading apparatus for reading data from form sheet
US6466694B2 (en) Document image processing device and method thereof
US5787197A (en) Post-processing error correction scheme using a dictionary for on-line handwriting recognition
US5796867A (en) Stroke-number-free and stroke-order-free on-line Chinese character recognition method
US6038342A (en) Optical character recognition method and apparatus
Carter et al. Automatic recognition of printed music
US5373566A (en) Neural network-based diacritical marker recognition system and method
US6275611B1 (en) Handwriting recognition device, method and alphabet, with strokes grouped into stroke sub-structures
Casey et al. Intelligent forms processing
US5926564A (en) Character recognition method and apparatus based on 0-1 pattern representation of histogram of character image

Legal Events

Date Code Title Description
C10 Request of examination as to substance
C06 Publication
C14 Granted
C19 Cessation of patent right (cessation of patent right due to non-paymentof the annual fee)