KR100286709B1

KR100286709B1 - Method for separating ideographic character in alphabetic string

Info

Publication number: KR100286709B1
Application number: KR1019930010895A
Authority: KR
Inventors: 노희호
Original assignee: 구자홍; 엘지전자주식회사
Priority date: 1993-06-15
Filing date: 1993-06-15
Publication date: 2001-04-16
Also published as: KR950001553A

Abstract

PURPOSE: A method for separating an ideographic character in an alphabetic string is provided to rapidly and accurately recognize the ideographic character by accurately and simply abstracting the ideographic character without the interference of a character recognition unit. CONSTITUTION: A character string is inputted(ST30). An ideographic character is abstracted from the character string inputted(ST31). A small letter is separated from the ideographic character abstracted(ST32). An outline is tracked from the character(ST33). A capital letter is separated from the character(ST34). The character is classified using a feature of a hole. After that, the ideographic character separated is abstracted(ST35). A height and a margin of the character abstracted are detected. After that, the height and the margin are inputted to a recognition process(ST36).

Description

How to Separate Individual Characters in English Strings

본 발명은 문서인식장치에 있어서의 영문자열에서 개별문자 절취에 관한 것으로, 특히 문자들이 접촉되어 있는 경우에도 문자인식수단의 간섭없이 개별문자를 정확하게 절취하여 보다 빠르게 문서를 인식하도록 하는 영문자열에서의 개별문자 분리 방법에 관한 것이다.The present invention relates to an individual character truncation in an English character string in a document recognition apparatus. In particular, even when characters are in contact with each other, the character string is precisely cut out without interference of the character recognition means to recognize a document more quickly. It is about a method of separating individual characters.

종래의 문서인식장치에 있어서의 입력된 접촉문자를 분리하기 위한 방법으로는 접촉문자중 분리 가능한 지점을 복수로 선정하여 각지점에 대해 인식처리를 한 후에 인식이 되는 지점을 절취점으로하여 개별문자를 절취하게된다.As a method for separating the input contact characters in the conventional document recognition apparatus, a plurality of separable points are selected from the contact characters and the recognition process is performed after each point. Will be cut off.

즉, 제 1 도와 같이, 영문자 "O"와 "C"가 접촉되어 입력될 경우에 절취 후보위치는 "2"와 "3"이 된다.That is, as in the first diagram, when the letters "O" and "C" are input in contact with each other, the cut candidate positions are "2" and "3".

따라서 먼저 절취점 "2"를 절취하여 인식을 행하면 "O"가 "C"로 인식이 되는데 이 "C"는 영어의 알파벳의 하나이므로 절취점 "2"에 대한 절취가 적당하다고 간주하고 절취를하여 인식을 하게된다.Therefore, if you cut the cut point "2" and recognize it, "O" is recognized as "C". Since this "C" is one of the English alphabets, the cut to the cut point "2" is considered appropriate and the cut is performed. To be recognized.

그러나 이와같은 종래 문서인식장치에서의 접촉문자에 대한 절취를 인식에 의존하므로 인식이 잘못되는 경우 인식의 상위 과정인 개별문자 절취가 인식의 영향을 받게 되는 경우가 있으며, 또한 인식과 절취가 상호 반복적으로 행해지므로 처리시간이 많이 걸리는 문제점이 있었다.However, since the cutting of the contact character in the conventional document recognizing apparatus depends on the recognition, if the recognition is wrong, the individual character truncation, which is a higher process of the recognition, may be affected by the recognition, and the recognition and the truncation are mutually repetitive. Since the process takes a lot of processing time was a problem.

따라서, 본 발명의 목적은 문자들이 접촉되어 있는 경우에도 문자인식수단의 간섭없이 개별문자를 정확하게 절취하여 보다 빠르게 문서를 인식할 수 있을 뿐 아니라 홀특징을 사용하여 간단하게 접촉문자를 절취함으로써, 보다 신속하고 정확하게 개별 문자를 인식할 수 있는 영문자열에서의 개별문자 분리 방법을 제공하는 데 있다.Accordingly, an object of the present invention is to accurately recognize a document by quickly cutting individual characters without interference of character recognition means even when the characters are in contact, as well as by simply cutting out contact characters using hole features. The purpose of the present invention is to provide a method of separating individual characters in an English character string that can recognize individual characters quickly and accurately.

상기 목적을 달성하기 위한 본 발명의 기술적 방법은, 문서를 스캐닝하여 문서 화상을 2치화한 후 상기 2치 화상으로부터 그림영역과 문자영역을 분리하고, 상기 문자영역에서 문자열을 추출해 문자를 인식하는 방법에 있어서, 상기 추출된 문자열로부터 개별문자를 절출하는 개별문자절출단계; 상기 절출된 개별문자에 대해 소문자를 구분하는 문자구분단계; 상기 절출한 문자 중 접촉된 문자에 대하여 윤곽선을 추적하는 윤곽추적단계; 상기 접촉문자에 대해 대문자를 구별하는 문자구별단계; 상기 문자구별단계에서 구별된 개별문자를 홀특징을 이용하여 절취하는 문자절취단계; 상기 절취된 개별문자의 높이를 구하고, 여백을 구하는 여백검출단계; 및 상기 절취된 개별문자를 인식하고, 인식 중 오인식된 부분을 수정하는 인식 및 후처리단계로 이루어지는 것을 특징으로 한다.The technical method of the present invention for achieving the above object is a method of scanning a document to binarize a document image, separating a picture region and a character region from the binary image, and extracting a character string from the character region to recognize a character. An individual character extracting step of extracting individual characters from the extracted character strings; Character classification step of distinguishing the lowercase letters for the individual characters of the cut out; An outline tracking step of tracking an outline of the touched character among the cut out characters; Character discriminating step of distinguishing the uppercase letters for the contact character; A character truncation step of cutting individual characters distinguished in the character discrimination step by using hall features; A margin detection step of obtaining a height of the cut individual character and obtaining a margin; And a recognition and post-processing step of recognizing the cut individual character and correcting a misrecognized portion of the recognition.

제1도는 종래 접촉문자를 개별문자로 분리하기 위한 일예시도이고,1 is an exemplary view for separating a conventional contact character into individual characters,

제2도는 본 발명의 개별문자 분리 시스템을 나타낸 구성도이고,2 is a block diagram showing a separate character separation system of the present invention,

제3도는 제2도에 따른 문서화상의 인식 흐름도이고,3 is a flowchart of recognition of a document image according to FIG.

제4도는 제3도의 개별문자 절출 과정의 상세 흐름도이고,4 is a detailed flowchart of the process of cutting individual characters of FIG.

제5도는 제4도의 홀을 구하는 상세 흐름도이고,5 is a detailed flowchart for obtaining the holes of FIG.

제6도는 제4도의 문자절취단계에서의 흑화소 런의 길이와 흑화소 위치등을 보인 예시도이고,6 is an exemplary view showing the length of the black pixel run and the position of the black pixel in the character cutting step of FIG.

제7도는 제4도의 문자여백 검출단계에서의 문자 융합 조건을 보인 설명도이다.FIG. 7 is an explanatory diagram showing character fusion conditions in the character margin detection step of FIG.

〈도면의 주요부분에 대한 부호의 설명〉<Explanation of symbols for main parts of drawing>

1 : 스캐너 2 : 스캐너 인터페이스수단1 scanner 2 scanner interface means

3 : 호스트컴퓨터 4 : 호스트인터페이스수단3: host computer 4: host interface means

5 : 디지털신호처리수단 6 : 버퍼부재5: digital signal processing means 6: buffer member

7 : 디램제어수단 8 : 프로그램 메모리7: DRAM control means 8: program memory

이하, 첨부한 도면을 참조하여 본 발명을 보다 상세하게 살펴보고자 한다.Hereinafter, the present invention will be described in more detail with reference to the accompanying drawings.

제 2 도는 본 발명 개별문자 분리 시스템 구성도로서, 이에 도시한 바와 같이, 문서를 스캔하여 2치 화상을 발생하는 스캐너(1)와, 상기 스캐너(1)의 2진 화상 데이터를 인터페이스하는 인터페이스수단(2)과, 상기 인터페이스수단(2)으로부터 전달된 화상데이터를 프로그램메모리(8)의 프로그램 수행에 따라 처리하여 데이터 메모리(9)에 저장, 출력하는 디지털신호처리수단(5)과, 상기 디지털신호 처리수단(5)에서 처리된 데이터를 버퍼링하는 버퍼부재(6)와, 상기 버퍼부재(6)에서 출력된 화상 데이터를 화상메모리(10)를 제어하여 저장하는 디램제어수단(7)과, 상기 버퍼부재(6)의 출력 데이터를 읽어들이고 디램제어수단(7)의 제어신호에 따라 문서인식 장치의 전체시스템 동작을 제어하는 호스트컴퓨터(3)와, 상기 호스트컴퓨터(3)의 제어신호를 인터페이스하여 디지털신호처리수단(5) 및 버퍼부재(6)를 제어하는 호스트 인터페이스수단(4)으로 구성한다.2 is a block diagram of a separate character separation system according to the present invention. As shown therein, an interface means for interfacing the binary image data of the scanner 1 with a scanner 1 for scanning a document and generating a binary image (2), digital signal processing means (5) for processing the image data transferred from the interface means (2) in accordance with the program execution of the program memory (8) and storing and outputting the data in the data memory (9); A buffer member 6 for buffering the data processed by the signal processing means 5, DRAM control means 7 for controlling and storing the image data output from the buffer member 6 by controlling the image memory 10; A host computer (3) for reading the output data of the buffer member (6) and controlling the entire system operation of the document recognition device in accordance with the control signal of the DRAM control means (7), and the control signal of the host computer (3). Interface Constitute a digital signal processing means 5 and the buffer member host interface means (4) for controlling (6).

그리고 제 3 도는 제 2 도에 따른 문서화상의 인식 흐름도로서, 이에 도시한 바와같이, 입력문서로부터 2치 화상을 발생시키는 스캔과정과(ST100)과, 상기 입력 화상으로부터 그림부분과 문자부분을 구별하고 문자부분에 대해 문자열을 분리하는 영역분할 및 문자열추출과정(ST200)과, 상기 분리된 문자열로부터 개별문자를 절출하는 개별문자절출과정(ST300)과, 상기 절출된 각 개별문자를 인식하는 인식과정(ST400)과, 상기 인식과정에서 오인식된 부분을 수정보완하는 후처리과정(ST500)으로 이루어진다.FIG. 3 is a flow chart of recognition of a document image according to FIG. 2. As shown therein, a scanning process for generating a binary image from an input document (ST100) and a picture portion and a character portion are distinguished from the input image. An area division and string extraction process (ST200) for separating a character string for the character portion, an individual character extraction process (ST300) for extracting individual characters from the separated character string, and recognition for recognizing each of the individual characters extracted A process ST400 and a post-processing process ST500 for correcting a part misrecognized in the recognition process.

제 4 도는 제 3 도의 개별문자 절출과정(ST300)의 상세 신호 흐름도로서, 상기 영역분할 및 문자열추출과정(ST200)에서 분리되어 문자열입력단계(ST30)를 통해 입력되는 문자열로부터 개별문자를 절출하는 개별문자절출단계(ST31)와, 상기 절출된 개별문자에 대해 소문자를 구분하는 문자구분단계(ST32)와, 상기 접촉된 문자에 대하여 윤곽선을 추적하는 윤곽선추적단계(ST33)와, 상기 접촉문자에 대하여 대문자를 구별하는 문자구별단계(ST34)와, 홀특징을 이용하여 상기 문자구별단계(ST34)에서 구별된 개별문자를 절취하는 문자절취단계(ST35)와, 상기 절취된 개별문자의 높이를 구하고 여백을 구하여 인식과정(ST400)으로 입력하는 문자여백검출단계(ST36)로 이루어진다.4 is a detailed signal flow diagram of the individual character cutting process ST300 of FIG. 3 to separate individual characters from a character string separated in the area division and string extraction process ST200 and input through a string input step ST30. An individual character cutting step (ST31), a character classification step (ST32) for distinguishing lowercase letters for the individual characters being cut out, an outline tracking step (ST33) for tracking an outline for the contacted character, and the contact character A character discriminating step (ST34) for distinguishing capital letters with respect to a letter, a character cutting step (ST35) for cutting individual letters distinguished in the character discriminating step (ST34) using hole features, and obtaining the heights of the individual letters cut out; A text margin detection step ST36 is performed to obtain a margin and input it into the recognition process ST400.

이와같이, 이루어진 본 발명의 작용, 효과를 제 2 도를 참조하여 상세히 설명하면 다음과 같다.Thus, the operation and effect of the present invention made in detail with reference to Figure 2 as follows.

먼저, 입력문서를 스캐너(1)가 호스트컴퓨터(3)의 제어 하에 독취하여 화상 메모리(10)에 화상정보를 저장하게 된다.First, the input document is read by the scanner 1 under the control of the host computer 3 to store image information in the image memory 10.

이 정보를 이용하여 디지털신호처리수단(5)이 프로그램메모리(8)의 영역분할 프로그램에 따라 문자열을 분리, 각 문자열 정보를 데이터메모리(9)에 저장하게 된다.Using this information, the digital signal processing means 5 separates the character strings according to the area division program of the program memory 8 and stores the character string information in the data memory 9.

이후, 프로그램메모리(8)에 저장되어 있는 개별문자 절출 프로그램을 이용하여 디지털신호처리수단(5)은 제 4 도의 흐름으로 개별문자를 절취하는 동작을 수행한다.Then, by using the individual character cutting program stored in the program memory 8, the digital signal processing means 5 performs an operation of cutting individual characters in the flow of FIG.

즉, 제 4 도의 개별문자절취단계(ST31)에서는 문자열입력단계(ST30)로부터 입력된 문자열에 대해 수직방향의 흑화소 밀도를 구해 흑화소의 수가 임계치 보다 작으면 문자를 절단하게 된다.That is, in the individual character truncation step ST31 of FIG. 4, the black pixel density in the vertical direction is obtained for the character string input from the character string input step ST30, and the character is cut when the number of the black pixels is smaller than the threshold value.

여기서 임계치는 "1"이다.The threshold here is "1".

상기 절단된 문자에 대하여 문자열의 높이와 절단된 블록의 폭/높이 정보를 이용하여 절취 블록의 속성을 결정한다.An attribute of the cut block is determined using the height of the character string and the width / height information of the cut block for the cut character.

일예로써, 문자열의 높이를 shgt, 절취블록의 높이를 chgt, 폭을 cwid라 하고 문자열의 최대 수직런의 길이를 max라 하면,For example, if the height of the string is shgt, the height of the cut block is chgt, the width is cwid, and the maximum vertical run length of the string is max.

shgt - max 〈 5 이고 cwid 〉 max ×3/4 이거나,shgt-max <5 and cwid> max × 3/4

shgt - max 〉 = 5 이고 cwid 〉 max ×4/5 이면 절취 블록을 접촉문자로 표시하게 된다.If shgt-max> = 5 and cwid> max × 4/5, the truncation block is marked as a contact character.

상기 개별문자 절취단계(ST31)가 끝나면 접촉문자를 절단하는 다음 과정이 수행된다.When the individual character cutting step ST31 is completed, the next process of cutting the contact character is performed.

즉, 문자구분단계(ST32)에서는 접촉문자로 분류된 블록이 소문자 m이나 w인가의 여부를 판단하는 단계이다.That is, in the character classification step ST32, it is a step of determining whether a block classified as a contact character is a lowercase m or w.

이들을 구별하는 이유는 이들이 다른 문자에 비해 폭이 높이에 비해 커 접촉 문자로 오분류되는 경우가 있는 데, 소문자 m, w가 오분류되는 경우 이들을 절단하면 안되므로 문자구분단계(ST32)의 과정이 필요하게 된다.The reason for distinguishing them is that they are misclassified as contact letters because their width is larger than other letters, but if the lowercase m and w are misclassified, they should not be cut, so the process of character classification step (ST32) is necessary. Done.

여기서 소문자 m은 블록의 중간 높이 부분에 수평방향으로 3개의 획이 존재하며 문자폭의 1/4좌단 이하에 블록의 높이와의 차가 3이하인 연속적인 흑화소 길이가 존재하면 소문자 m으로 인정된다.Here, the lowercase m is regarded as the lowercase m if there are three strokes in the horizontal direction in the middle height of the block, and if there is a continuous length of black pixels with a difference of 3 or less in height from the lower left quarter of the character width.

여기서 소문자 w의 구분은 다음과 같다.Here, the case of the lowercase letter w is as follows.

소문자 w의 특징은 상단부에서 볼 때 두 개의 굴곡의 존재여부로 판단한다.The character of the lowercase letter w is determined by the presence of two bends in the upper part.

문자구분단계(ST32)의 과정이 끝난후에 존재하는 접촉문자에 대하여는 윤곽선 추적에 의하여 문자를 절취하는 윤곽선추적단계(ST33)이다.The contact character existing after the process of the character classification step ST32 is the contour tracking step ST33 which cuts out the character by contour tracking.

상기에서 윤곽선 추적은 반시계방향으로 행해지며 이 과정에서 수직흑화소 밀도로 절취가 불가능한 세리프(serif)형태의 접촉문자들을 절취할 수 있다.In the above, the contour tracking is performed in the counterclockwise direction, and in the process, it is possible to cut off serif-type letters which cannot be cut at the vertical black pixel density.

일 예로써, OB의 경우 B의 좌측 상하로 삐침이 있어 수직방향으로 보았을 경우 이 삐침과 O의 우측이 중첩되어 분리가 되지 않으나, 이들은 서로 다른 글자결합(cluster)을 구성하므로 윤곽선 추적에 의하면 절취가 가능하다.As an example, in the case of OB, the left and right sides of B are squeezed, and when viewed in the vertical direction, the spool and the right side of O are not overlapped and separated, but since they constitute different letter clusters, they are cut according to the contour trace. Is possible.

다음은 문자구별단계(500)의 과정이 수행된다.Next, the process of character classification step 500 is performed.

이 부분은 대문자 'I, T'가 다른 문자(특히 대문자)와 접촉되어 있는 경우 I, T를 인식하여 절취하는 루틴이다. 대문자만이 있는 단어에서는 I, T가 다른 문자와 접촉하는 경우가 많다.This part is a routine that recognizes and cuts I and T when the uppercase letters 'I, T' are in contact with other characters (especially uppercase letters). In words with only uppercase letters, I and T often come into contact with other letters.

이 경우 이들을 정확하게 절단하기 위해서는 I, T를 인식하여 분리해 내는 것이 가장 정확하다.In this case, to accurately cut them, it is most accurate to recognize and separate I and T.

그래서, 본 루틴에서는 간단한 특징을 이용하여 I, T를 분할하여 절취한다.Thus, in this routine, I and T are divided and cut using a simple feature.

I는 문자열의 높이와 비슷한 런(run)이 존재로 T는 문자열의 높이와 비슷한 수직 런이 존재하나 이 문자들의 상ㆍ하면의 모양이 I, T와 다르므로 I, T의 구별이 가능하다.I has a run that is similar to the height of the string. T has a vertical run that is similar to the height of the string, but I and T can be distinguished because the upper and lower shapes of these characters are different from I and T.

먼저, I의 구별은 다음과 같다. 문자열의 높이와 비슷한 수직 런이 시작점과 끝점의 좌표를 x1, x2라 하면 x1의 좌측과 x2의 우측의 상단, 하단의 모양이 서로 대칭을 이루고, 상ㆍ하단의 수평 런의 길이가 임계치 이하이면 I로 인식하여 문자를 절취한다. T의 구분도 조사영역의 상단에 수평선이 존재하고 xq, x2를 기준으로 수평선의 길이가 대칭을 이루면 T라고 간주하여 절취한다.First, the distinction of I is as follows. If the vertical run, similar to the height of the string, has the coordinates of the start and end points x1 and x2, the shapes of the upper and lower sides of x1 and the right and lower sides of x2 are symmetrical, and if the length of the upper and lower horizontal runs is less than or equal to the threshold Truncate the letter by recognizing it as I Division of T If a horizontal line exists at the top of the irradiation area and the length of the horizontal line is symmetrical with respect to xq and x2, it is regarded as T and cut off.

다음은 문자절취단계(ST35)로 홀(hole) 특징을 이용하여 접촉문자를 절단하기 위해서는 먼저 홀을 구해야 하는데, 이 과정을 제 5 도를 참고로 설명한다.Next, in order to cut the contact character using the hole feature in the character truncation step ST35, a hole must first be obtained. This process will be described with reference to FIG.

먼저, 각 열(i)에 대하여 수직방향의 처음의 흑화소 위치와 마지막 흑화소 위치와 첫 흑화소 런의 길이를 비교하여(ST4) 조건을 만족하면, 홀의 시작점(h_s)을 I로 하고(ST5), 홀의 시작점이 구해진 다음 조건(ST4)을 만족하지 아니하는 I를 홀의 끝위치(H_1)(ST6)로 하여 홀의 시작점과 끝점을 구한다.First, for each column i, the first black pixel position in the vertical direction, the last black pixel position and the length of the first black pixel run are compared (ST4), and if the condition is satisfied, the starting point h_s of the hole is set to I ( ST5), the starting point of the hole is obtained, and then the starting point and the ending point of the hole are obtained by making I, which does not satisfy the condition ST4, as the end position of the hole (H_1) (ST6).

다음에 단계(ST7)에서 홀의 크기와 문자열에서 가장 큰 흑화소 런의 길이(Smax)를 비교하여 홀(hole)의 개수를 구한다.Next, in step ST7, the number of holes is obtained by comparing the size of the hole with the length Smax of the largest black pixel run in the character string.

여기에서 사용되는 흑화소 런의 길이 최소 흑화소 위치 등은 제 6 도에 도시되어 있다.The minimum black pixel position and the like of the black pixel run used here are shown in FIG.

홀이 구해지면 홀 사이의 최소 흑화소 밀도를 갖는 위치에서 접촉문자를 절단한다.Once the hole is found, the contact character is cut off at the position with the minimum black pixel density between the holes.

다음은 문자여백 검출단계(ST36)의 과정으로 추출된 개별문자의 높이와 여백을 검출하는 과정이다.The following is a process of detecting the height and the margin of an individual character extracted by the process of the character margin detection step ST36.

이 과정은 각 블록의 높이를 구하고 잘못 절단된 문자를 융합시키고 블랭크를 구해 단어를 구분하기 위한 과정이다.This process is to find the height of each block, fuse the wrong cut letters, and find the blank to separate the words.

블록의 높이는 최초 흑화소의 위치와 마지막 흑화소의 위치의 차이가 되고 잘못 절단된 문자는 인접 문자들의 간격을 이용하여 하나의 문자가 절단되어 두 개의 블록으로 되었다고 인정되는 두 블록을 융합한다.The height of the block is the difference between the position of the first black pixel and the position of the last black pixel, and the erroneously truncated character fuses two blocks that are considered to have been cut into two blocks by using the spacing of adjacent characters.

융합조건은 제 7 도에서와 같이, 하나의 블록이 b, c의 두 개의 블록으로 잘못 절단되어 있는 경우를 나타내는 것이고, 이런 블록은 융합되어야 한다.The fusion condition indicates a case in which one block is incorrectly cut into two blocks b and c, as shown in FIG. 7, and these blocks must be fused.

절단된 문자의 간격을 b2, 이 문자와 좌우문자의 간격을 각각 b2, b3라 하고, 각 문자의 폭을 w1, w2, w3라 할 때 다음 조건을 만족하면 블록 b, c를 융합한다.If the spacing of the characters is b2, the spacing of these characters and the left and right characters are b2 and b3, and the widths of each character are w1, w2 and w3.

b2 ×2 〈 = b1 and b2 ×2 〈 = b3 and w2 〈 max (w1, w3) ×3/2이고, b, c블록의 시작 y좌표의 차가 3이하이면 두 블록을 융합하여 하나의 블록으로 만든다.If b2 × 2 〈= b1 and b2 × 2 〈= b3 and w2 〈max (w1, w3) × 3/2 and the difference of the starting y coordinate of the b and c blocks is less than or equal to 3, the two blocks are merged into one block. Make.

다음은 단어를 구분하기 위한 여백을 구하는 방법이다.Here's how to find the margins to separate words.

문자 사이의 간격이 2보다 큰 여백의 평균을 ave_blank라 하고, 문자열의 높이를 str_hgt라 하고, 여백추출을 위한 문자의 높이를 bhgt라 하면, 문자의 높이(bhgt)는 다음과 같이 구해진다.Assuming that the average space between characters is greater than 2 ave_blank, the height of the string is str_hgt, and the height of the character for margin extraction is bhgt, the height of the character bhgt is obtained as follows.

bhgt = max (str_hgt / 2 〈 문자열에 포함된 모든 문자들의 높이 〈 str_hgt - 3)(조건)을 만족하는 문자가 없으면 bhgt = 0으로 한다.bhgt = max (str_hgt / 2 〈bhgt = 0 if no characters satisfy the height <str_hgt-3) (condition) of all characters in the string.

문자의 높이(bhgt)가 구해지면 조사여백(test_intv)이 (ave_blank*3/2)보다 클 때 bhgt 〉 0 이고, test_intv 〉 bhgt /3 이면 test_intv = blank이고, bhgt = 0 이고 test_intv 〉 str_hgt/3 이면 test_intv = blank로 간주한다.If the height of the character (bhgt) is found, then if the test margin (test_intv) is greater than (ave_blank * 3/2), then bhgt> 0 and test_intv> bhgt / 3, then test_intv = blank, bhgt = 0, and test_intv> str_hgt / 3 Then test_intv = blank.

이와 같이, 개별문자절출과정(ST300)에서 절출된 각 개별문자를 인식과정(ST400)을 통해 인식하고, 상기 인식과정(ST400)에서 오인식된 부분을 후처리과정(ST500)에서 수정 보완하여 문자를 인식하게 된다.As described above, each individual character cut out in the individual character cutting process ST300 is recognized through the recognition process ST400, and the character misrecognized in the recognition process ST400 is corrected and supplemented in the post-processing process ST500. To be recognized.

따라서, 본 발명에서는 문자들이 접촉되어 있는 경우에도 문자인식수단의 간섭없이 개별문자를 정확하게 절취하여 보다 빠르게 문서를 인식할 수 있을 뿐 아니라 홀특징을 사용하여 간단하게 접촉문자를 절취함으로써, 보다 신속하고 정확하게 개별 문자를 인식할 수 있는 효과가 있다.Therefore, in the present invention, even when characters are in contact with each other, it is possible to recognize the document more quickly by accurately cutting the individual characters without interference of the character recognition means, and by simply cutting the contact characters using the hall feature, It has the effect of accurately recognizing individual characters.

Claims

In the method of scanning a document and binarizing a document image, separating a picture area and a text area from the binary image, extracting a character string from the character area, and recognizing a character, the individual character is extracted from the extracted character string. Individual character extraction step; Character classification step of distinguishing the lowercase letters for the individual characters of the cut out; An outline tracking step of tracking an outline of the touched character among the cut out characters; Character discriminating step of distinguishing the uppercase letters for the contact character; A character truncation step of cutting individual characters distinguished in the character discrimination step using hole features; A margin detection step of obtaining a height of the cut individual character and obtaining a margin; And a recognition and post-processing step of recognizing the cut individual character and correcting a misrecognized part of the recognition.

2. The individual character of the character string of claim 1, wherein the hole feature in the character truncation step is to determine the presence or absence of the hole using the position of the black pixel in the vertical direction of the target block and the length of the black pixel run. Separation method.