KR910007032B1

KR910007032B1 - A method for truncating strings of characters and each character in korean documents recognition system

Info

Publication number: KR910007032B1
Application number: KR1019880018070A
Authority: KR
Inventors: 정영화
Original assignee: 삼성전자 주식회사; 안시환
Priority date: 1988-12-31
Filing date: 1988-12-31
Publication date: 1991-09-16
Also published as: KR900010597A

Abstract

The method is for estimating the pitch of each characters to read the hangul character and for reading each hangul characters successively to process the character row. The method comprises steps: (A) counting the black pixels read by a scanner and comparing the number of the black pixels and a certain threshold to read the first character row; (B) determining the start and end points of each characters according to the state of black pixels and estimating the candidate for the read character; (c) counting the number of condidates and determining the estimated pitches of space and characters for each candidates; and (D) outputing one character using the estimated pitches.

Description

How to Extract Characters and Characters in Korean Document Recognizer

제1도는 본 발명을 수행하기 위한 시스템 구성도.1 is a system diagram for carrying out the present invention.

제2도는 문서 화상 인식의 흐름도.2 is a flowchart of document image recognition.

제3도는 본발명의 흐름도.3 is a flow chart of the present invention.

제4도는 제3도중 문자열 추출 흐름도.4 is a flowchart of extracting character strings from FIG.

제5도는 제3도중 개별문자 절출부호 위치결정흐름도.FIG. 5 is a flow chart illustrating the positioning of the individual character cutout code in FIG. 3; FIG.

제6도는 제3도중 한 문자열 내의 문자 및 스페이스 피치의 추정흐름도.FIG. 6 is an estimated flow chart of character and space pitch in one character string of FIG.

제7도는 제3도중 개별문자 절출후보의 통합 분리흐름도.7 is an integrated separation flow diagram of the individual character interception candidates in FIG.

제8도는 개별문자 절출후보 위치결정의 일실시예도.8 is an embodiment of individual character cut candidate positioning.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

11 : 중앙처리부 12 : 롬11 central processing unit 12 ROM

13 : 램 14 : 스캐너13: RAM 14: Scanner

15 : 화상메모리 16 : 모니터15: Image memory 16: Monitor

본 발명은 문서화상 데이타의 인식방법에 관한 것으로, 특히 한글 문자의 인식방법에 관한 것이다.The present invention relates to a method for recognizing document image data, and more particularly, to a method for recognizing Korean characters.

일반적으로 스캐너(scanner)를 통해 알파베트, 한글등 문자를 인식할시, 각각의 특성에 따라 그 인식방법이 다양하다. 즉 알파베트의 경우에는 각각의 개별문자 형태를 갖게 되지만, 한글 문자의 경우에는 각각의 구성 자소에 이해 한개의 개별문자가 조합되므로 한글 문자의 인식방법은 알파베트의 인식과정보다 복잡하게 된다.In general, when recognizing characters such as alphabet, Korean, etc. through a scanner, the recognition method varies according to each characteristic. That is, in the case of alphabet, each individual character has a form, but in the case of a Hangul character, since one individual character is combined in each constituent element, the recognition method of the Hangul character becomes more complicated than the alphabet recognition process.

종래의 문서 인식장치에서 한글 문자 인식시 스캐너에서 문서를 2치화 데이타로 변환할시 문서를 독취하면서 문자열을 추출하지 않으므로서 문서의 연속적인 처리가 어려웠으며, 한글문자의 특성인 자음 및 모음의 분리현상을 통합해 주지 않으므로서 한글 문서의 인식 처리가 불가능했던 문제점등이 있었다.When converting a document into binary data when a Korean character is recognized by a conventional document recognition device, it is difficult to process documents continuously without reading a document while extracting a character string, and separating consonants and vowels which are characteristics of Korean characters. There was a problem that the recognition process of Hangul documents was impossible because the phenomenon was not integrated.

따라서 본 발명의 목적은 한글 문서 인식시 추출문자열중의 각 개별문자의 피치를 추정하여 개별문자를 절출할 수 있는 방법을 제공함에 있다.Accordingly, an object of the present invention is to provide a method for estimating individual characters by estimating the pitch of each individual character in an extracted character string when recognizing a Korean document.

이하 본 발명을 도면을 참조하여 상세히 설명한다.Hereinafter, the present invention will be described in detail with reference to the drawings.

제1도는 본 발명을 수행하기 위한 시스템 구성도로서, 문서 인식장치의 전반적은 동작을 제어하는 중앙처리부(11), 문서처리 및 인식 프로그램을 저장하고 있는 롬(12)과, 상기 문서처리 및 인식 프로그램 수행중에 발생하는 데이타를 저장하는 램(13), 상기 중앙처리부(11)의 제어하에 문서를 스캐닝하여 2치화정보를 발생하는 스캐너(14)와, 상기 중앙처리부(11)의 제어하에 상기 스캐너(14)를 통한 문서화상 데이타를 저장하는 화상메모리(15)와, 상기 문서 인식후 상기 중앙처리부(11)의 제어하에 인식 문서를 표시하는 모니터(16)로 구성된다.1 is a system configuration for carrying out the present invention, the central processing unit 11 for controlling the overall operation of the document recognition device, the ROM 12 storing the document processing and recognition program, and the document processing and recognition RAM 13, which stores data generated during program execution, a scanner 14 which scans a document under the control of the central processing unit 11, and generates binary information, and the scanner under the control of the central processing unit 11; And an image memory 15 for storing document image data through (14), and a monitor 16 for displaying the recognized document under the control of the central processing unit 11 after the document is recognized.

제2도는 문서화상의 인식흐름도로서, 문서 화상을 스캐닝하여 2치화 정보를 발생하는 스캔과정(11)과, 상기 스캔과정(1)에서 스캔된 문서 정보를 독취하여 자동적으로 1문자열의 문서정보를 추출하는 문자열 추출과정(2)과, 상기 문자열 추출과정(2) 수행후 추출된 문자열에서 개별문자를 절출하는 절출과정(3)과, 상기 개별문자 절출과정(3) 수행후 절출된 개별문자에서 포함된 잡음을 제거하고 세선화 및 정규화하는 전처리과정(4)과, 상기 전처리과정(4) 수행후 처리정보로부터 한글 문자를 인식하는 인식과정(5)으로 구성된다.2 is a recognition flow chart of a document image, which includes a scanning process 11 for generating binary information by scanning a document image, and automatically reading document information scanned in the scanning process 1 to obtain document information of one character string. A string extraction process (2) for extracting, an extraction process (3) for extracting individual characters from the extracted string after performing the string extraction process (2), and an individual character extracted after performing the individual character extraction process (3) It consists of a preprocessing process (4) for removing, thinning and normalizing the noise included in the step, and a recognition process (5) for recognizing Korean characters from the processing information after performing the preprocessing process (4).

제3도는 본 발명에 따른 문자열 및 개별문자 절출과정의 흐름도로서, 상기 스캐너의 문서 정보를 라인 스캔하여 계수한 흑점수의 상태에 따라 1문자열을 추출하는 과정과, 상기 문자열 추출과정 수행후 추출된 1문자열 내에 있는 각 개별문자의 시작점 및 끝점을 구하여 개별문자 절출의 후보를 결정하는 과정과, 사기 절출후보 결정과정 수행후 개별문자 절출후보 갯수를 계수하여 계수값에 따른 개별문자의 가로크기 및 문자간 스페이스 간격을 합하고 이에 따른 문자 및 스페이스 추정피치를 수하는 과정과, 상기 문자 및 스페이스의 추정피치 결정과정 수행후 결정된 추정피치를 이용하여 개별문자 절출후보간의 통합 및 분리를 수행하고, 개별문자를 절출하는 과정으로 이루어진다.3 is a flowchart of a character string and individual character cutting process according to the present invention, the process of extracting one string according to the state of the number of black dots counted by line scanning the document information of the scanner, and extracted after performing the string extraction process The process of determining the candidates for individual character severity by obtaining the start point and the end point of each individual character within 1 string, and counting the number of individual character severity candidates after performing the fraudulent sequestration candidate decision process, and counting the horizontal size and character of the individual character according to the count value Integrating and separating individual character extraction candidates using the estimated pitch determined after the process of adding the space intervals between them and performing the estimated pitch of characters and spaces, and determining the estimated pitch of the characters and spaces, It is made by cutting out.

제4도는 상기 제3도중 문자열 추출과정에 대한 상세 흐름도이고, 제5도는 상기 제3도중 개별문자 절출후보 위치결정 과정에 대한 상세 흐름도이며, 제6도는 상기 제3도중 문자열내 문자 및 스페이스 추정피치 결정과정에 대한 상세 흐름도이고, 제7도는 상기 제3도중 개별문자 절출후보 통합 및 분리과정에 대한 상세 흐름도이다.FIG. 4 is a detailed flowchart of the process of extracting character strings in FIG. 3, and FIG. 5 is a detailed flowchart of the individual character extraction candidate positioning process in FIG. 3, and FIG. FIG. 7 is a detailed flowchart of the decision process, and FIG. 7 is a detailed flowchart of the process of integrating and separating individual character candidates.

제8도는 개별문자 절출후보 위치결정을 위한 예로서, S₁∼S_n은 개별문자 절출후보의 시작점이다. e₁∼e_n은 개별문자 절출후보의 끝점으로서 개별문자 절출후보의 가로크기는 상기 시작점과 끝점 사이의 크기가 되며, 그 크기는 개별문자의 구성 자소 상태에 따라 다르게 된다. 또한 L₁∼L_n은 1문자열 추출을 위한 스캔라인들로서 L₁은 1문자열 영역의 시작 라인이며, L_n은 1문자열 영역의 마지막 라인을 의미한다.8 is an example for positioning individual character cut candidates, and S _{1 to} S _n are starting points of individual character cut candidates. e _{1 to} e _n are the end points of the individual character cut candidates, and the horizontal size of the individual character cut candidates is the size between the start point and the end point, and the size varies depending on the constituent phoneme state of the individual characters. In addition, L _{1 to} L _n are scan lines for extracting one string, L ₁ is a start line of one string region, and L _n is a last line of one string region.

상술한 구성에 의거 본 발명을 제1도∼제8도를 참조하여 상세히 설명한다.Based on the above structure, this invention is demonstrated in detail with reference to FIGS.

스캐너(14)는 문서의 화상 이미지를 스캐닝하여 2치화하며 상기 중앙처리부(110는 수신되는 문서 화상데이타를 화상메모리(15)에 저장한다. 이때 상기 중앙처리장치(11)가 1문자열이 입력됐음을 인지하면, 모터구동을 제어하여 스캐너(14)의 동작을 일시 정지시킨후, 롬(12)의 문서처리 및 인식프로그램에 따라 제2도와 같은 과정을 수행한다.The scanner 14 scans and binarizes an image of a document, and the central processing unit 110 stores the received document image data in the image memory 15. At this time, the central processing unit 11 inputs one character string. If it is recognized, the motor drive is controlled to temporarily stop the operation of the scanner 14, and then the process shown in FIG. 2 is performed according to the document processing and recognition program of the ROM 12.

먼저 상기 중앙처리부(11)는 스캔과정(1)을 수행하는데, 이 스캔과정(1)에서는 모터의 구동에 의해 스캐너(14)가 원고의 내용 독취하여 2치화된 문서화상 데이타를 발생한 후 메모리에 저장한다. 이때 상기 중앙처리부(11)는 문자열 추출과정(2)을 수행하여 스캔되는 문서화상 데이타의 문자열을 추출한다. 상기 과정(2)에서 1라인의 문자열이 추출되면, 해당 문자열의 문서화상 데이타를 화상메모리(15)에 격납시키며, 개별 문자 절출과정(3)을 수행하여 1문자열에 속하는 개별문자들을 절출하고, 절출된 개별문자들을 통합 및 분리한다. 이후 전처리과정(4)을 수행하여 절출된 개별문자 데이타의 세선화 및 정규화를 수행한 후, 인식과정(5)을 수행하여 문자인식을 수행한다. 이와 같이 1문자에 인식과정을 수행하면, 다시 개별문자 절출과정(3)을 수행하여 다음 문자에 대한 인식을 수행할 수 있도록 상기한 과정을 반복하여 수행한다. 따라서 사기 과정들을 반복하여 1문자열에 대한 문자들의 인식을 종료하면, 다시 상기 문자열 추출과정(2)을 수행하여 스캔된 문서화상 데이타들에서 인식하고자 하는 다음 문자열을 추출하여 상기와 같이 인식과정들을 반복 수행한다.First, the central processing unit 11 performs a scanning process (1). In this scanning process (1), the scanner 14 reads the contents of the original by the driving of the motor to generate binarized document image data, and then stores it in the memory. Save it. At this time, the central processing unit 11 performs a string extraction process (2) to extract a character string of the scanned document image data. When the character string of one line is extracted in the process (2), the document image data of the character string is stored in the image memory 15, and the individual character cutting process (3) is performed to cut out individual characters belonging to one string, Integrate and separate the individual characters that were cut out. After performing preprocessing (4) to perform thinning and normalization of the extracted individual character data, character recognition is performed by performing recognition process (5). As described above, when the recognition process is performed on one character, the above-described process is repeatedly performed to perform recognition of the next character by performing the individual character extraction process (3) again. Therefore, when the recognition of the characters for one string is repeated by repeating the fraudulent processes, the character string extraction process (2) is performed again to extract the next character string to be recognized from the scanned document image data and repeat the recognition processes as described above. Perform.

상기와 같은 한글 문자인식 고정에서 스캐너(14)로 부터 발생되는 문서화상 데이타들에서 문자열을 추출하고, 추출된 문자열 내의 개별문자를 절출하는 본 발명은 하기와 같이 진행한다.According to the present invention, a character string is extracted from document image data generated from the scanner 14 and the individual character in the extracted character string is extracted as described above.

먼저 제3도의 문자열 추출과정(100)을 제4도의 흐름을 참조하여 살펴본다. 문서정보 발생시, 상기 중앙처리부(11)는 (101)단계에서 스템모터를 구동하여 스캐너(14)를 통해 발생되는 1라인의 문서화상 데이타를 독취한 후, (102)단계에서 독취한 해당라인이 흑점수를 계수한다. 이후 (103)단계에서 해당 라인이 제8도의 L₁과 같이 1문자열 영역의 시작위치인가 검사하는데, 상기 독취한 라인이 제1드레쉬 홀드 레벨(Th1)의 흑점수보다 크면 문자열 영역의 시작위치로 간주하고, 제1드레쉬 홀드 레벨(Th1)의 흑점수보다 작으면, 다시 (104)단계로 되돌아가 상기 스캐너(14)를 통해 수신되는 다음 라인의 문서화상 데이타를 독취한다. 여기서 임의 문자열 영역의 시작 위치는 1라인 단위로 스캔데이타를 독취하여 흑화소들의 유무를 검출하여 결정한다. 즉, 문자열과 문자열 사이에는 한글 문자가 없게 되므로 이에 따른 흑화소도 없게 된다. 또한 임의 문자열내에서는 한글문자가 존재하면, 흑화소들이 존재하게 된다. 따라서 상기와 같이 흑화소들의 유무에 따른 경게선을 추출하면 1문자열의 시작위치를 검출할 수 없으며, 제1드레쉬 홀드 레벨(Th1)은 이에 따른 적당값으로 설정하면 된다.First, the string extraction process 100 of FIG. 3 will be described with reference to the flow of FIG. 4. When the document information is generated, the central processing unit 11 drives the stem motor in step 101 to read the document image data of one line generated through the scanner 14, and then the corresponding line read in step 102 is read. Count the black score. After the appropriate line to apply the start position of the first text area checking as separate L ₁ of claim 8 in the 103 phase, the Poison taken line the first drain Threshold level (Th1) black points than the larger string area starting position of the If it is smaller than the black score of the first threshold hold level Th1, the process returns to step 104 to read the document image data of the next line received through the scanner 14. The start position of the arbitrary character string region is determined by reading scan data in units of one line and detecting the presence or absence of black pixels. That is, there is no Korean character between the string and the string, so there is no black pixel. In addition, if a Korean character exists in an arbitrary string, black pixels exist. Therefore, when the warning line according to the presence or absence of black pixels is extracted as described above, the starting position of one string cannot be detected, and the first threshold hold level Th1 may be set to an appropriate value accordingly.

이때 상기(103)단계에서 1라인의 흑화소들이 제1드레쉬 홀드 레벨(Th)보다 많아 1문자열의 시작 위치로 판별되면, (104)단계에서는 1문자열의 종료위치를 검출하기 위하여 다시 1라인을 독취한다. 그리고 (105)단계에서는 문자열 영역의 마지막 위치(L_n)인가를 검사하는데, 아닐시에는 (106)단계에서 해당 1라인의 문서화상 데이타를 문자열 메모리에 격납시키고, 다시 다음 라인을 독취하기 위하여 (104)단계로 되돌아 간다. 즉, 라인단위로 독취되는 문서화상 데이타를 화상메모리(15)에 순차적으로 저장함으로서, 1문자열에 대한 문서화상 데이타를 추출하는 과정을 수행하게 된다. 이때 상기 1문자열의 시작위치 검출시와 마찬가지로, 1문자열의 종료위치에서는 흑화소들이 존재하다가 없어지게 된다. 이때 상기 1문자열의 종료위치는 흑화소들이 제2드레쉬 홀드 레벨(Th2)보다 작게 검출되는 위치에서 결정된다.In this case, if the black pixels of one line are larger than the first threshold hold level Th in step 103, and it is determined as the start position of one string, in step 104, one line is again used to detect the end position of one string. To poison. In step (105), it is checked whether the last position (L _n ) of the string area is checked. If not, in step 106, the document image data of the corresponding one line is stored in the string memory, and the next line is read again ( Return to step 104). That is, by sequentially storing the document image data read in line units in the image memory 15, a process of extracting document image data for one string is performed. At this time, as in the detection of the start position of the one string, the black pixels exist and disappear at the end position of the one string. At this time, the end position of the first string is determined at a position where the black pixels are detected to be smaller than the second threshold hold level Th2.

따라서 상기 (105) 단계에서 흑점수가 제2드레쉬 홀드 레벨(Th2)보다 갖게 검출되면, 해당 문자열 영역의 마지막 위치(L_n)이므로 (107)단계로 진행하여 독취한 라인들을 1문자열로 간주할 수 있는가 결정한다. 이때 한글 문자열의 높이는 어느 일정한 높이를 가지므로 그 높이를 판단의 기준으로 삼는데, 독취한 문자열 영역의 마지막 라인그러에서 시작라인 값을 뺀 값(L_n∼L_n+1)이 상기 일정 높이보다 크면 이를 문자열로 결정한다. 그러므로 상기 (107)단계에서는 독취 결과치가 일정 높이보다 크면 문자열을 추출한 경우이므로 문자열 추출 흐름을 종료하며, 그렇지 않은 경우에서 다시 문자열을 추출하기 위하여 상기 (101)단계로 되돌아간다.Therefore, if it is detected in step 105 that the black spot number is greater than the second threshold hold level Th2, the last position L _n of the corresponding character string area is considered, and thus, in step 107, the read lines may be regarded as one string. Decide if you can. At this time, the height of the Hangul string has a certain height, so the height is used as a criterion. The final line value of the read string region minus the starting line value (L _{n to} L _{n + 1} ) is greater than the predetermined height. If large, it is determined as a string. Therefore, in the step (107), if the read result value is larger than the predetermined height, the character string extraction flow is terminated. If not, the flow returns to the step (101) to extract the character string again.

상기와 같은 과정을 수행하여 스캔된 문서화상 데이타에서 1문자열을 추출한후 해당 문자열내의 개별문자절출후보 위치를 결정하는 과정(200)을 수행하느데 이는 제5도를 참조하여 상세히 설명한다.The procedure 200 is performed to extract one character string from the scanned document image data and to determine the position of the individual character extraction candidate in the character string, which will be described in detail with reference to FIG.

먼저 (201)단계에서는 추출 문자열을 수직 방향으로 투영하여 각 문자들의 투영량을 제8도와 같이 구한다. 상긱투영량에 따라 1문자열내의 각 개별문자 절출후보에 대한 시작점(S₁∼S_n+1) 및 끝점(e₁∼e_n+1)들을 구하여 개별문자 절출후보의 위치를 결정한다. 여기서 절출(trumcate)는 두 개별문자가 접속된 경우에 접촉된 문자들을 자르는 것을 의미한다. 개별문자란 한글의 1문자를 구성하는 자소(모음 또는 자음)들을 의미한다. 상기 (202)단계에서 상기 투영량을 (P₀∼P_n)이라 할때, P_n-1=0이고 P_n＞0인 조건을 만족하는 위치를 개별문자 절출후보의 시작점 위치로 결정한다. 이후 (203)단계에서는 모든 개별문자의 시작점(S₁∼S_n)을 구했는가 판단한후, 아닐시에는 다음 개별문자의 시작점을 구하기 위해 상기 (202)단계로 되돌아간다.First, in step 201, the extracted character string is projected in the vertical direction to obtain a projection amount of each character as shown in FIG. Sanggik according to the projection amount of the first start point is obtained for each individual character in the character string candidate jeolchul (S _₁ ~S _{n +} ₁₎ and end point (e _₁ ~e _{n +} ₁₎ determines the position of the individual characters jeolchul candidate. In this case, truncation means truncation of characters touched when two individual characters are connected. Individual characters mean the phonemes (vowels or consonants) that make up one character of Hangul. When the projection amount is (P ₀ -P _n ) in the step (202), the position satisfying the condition of P _n-1 = 0 and P _n > 0 is determined as the starting point position of the individual character cutting candidate. Thereafter, in step (203), it is determined whether the starting points (S _{1 to} S _n ) of all the individual letters have been obtained, and if not, the process returns to step (202) to obtain the starting point of the next individual letter.

이때 상기 (203)단계에서 모든 개별문자의 시작점을 구했으면, (208)단계에서는 P_n+1＞0의 조건을 만족하는 개별문자 절출후보들의 끝점위치를 결정하며, (205)단계에서 모든 끝점(e₁∼e_n)을 구했으면 개별문자 절출후보의 위치결정 흐름을 종료한다.At this time, if the starting point of all individual characters is obtained in step (203), in step (208), the end point positions of the individual character cutting candidates satisfying the condition of P _{n + 1} > 0 are determined, and in step (205) When (e _{1 to} e _n ) is obtained, the positioning flow of the individual character notation candidate is finished.

여기서 개별문자 절출후보란 시작점(S₁∼S_n)과 끝점(e₁∼e_n) 사이의 문자를 말하며, 개별문자 절출후보의 가로크기는 한글문자의 자소 조합의 특성에 따라 달라지게 된다. 즉 제8도에 예시된 바와 같이 한글 1문자는 s₁∼e₁및 s₂∼e₂와 같이 2개의 개별문자 절출후보로 구성될 수 있는 동시에 s₃∼e₃와 같이 1개의 개별문자 절출후보로 구성될 수 있으며, 2개의 개별문자 절출후보로 구성된 1개의 한글문자도 s₁∼e₁과 같은 자음과 s₂∼e₂와 같은 모음에 따라 그 가로크기가 다르게 된다.Here, the individual character nomination candidate refers to a character between the starting point (S ₁ ˜S _n ) and the end point (e ₁ ˜e _n ), and the horizontal size of the individual character nomination candidate varies depending on the characteristics of the phoneme combination of Korean characters. That is, as illustrated in FIG. 8, one Korean character may be composed of two individual character notation candidates such as s _{1 to} e ₁ and s _{2 to} e _2, and one individual character is cut out as s _{3 to} e _3. It can be composed of candidates, and one Hangul character composed of two individual character cutting candidates also has different horizontal sizes depending on consonants such as s _{1 to} e ₁ and vowels such as s _{2 to} e ₂ .

상기와 같이 개별문자 절출후보의 위치를 결정한 후, 해당 문자열내의 문자 및 스페이스 피치를 추정하는 과정(300)을 제6도에 참조하여 살펴본다. (301)단계에서는 상기 개별문자 절출후보 결정과정(2000을 통해 결정한 개별문자 절출후보의 갯수를 카운터(counter)에 저장하고, (302)단계에서 문자 피치추정을 위한 조건을 만족하는 개별문자 절출후보의 가로폭 크기를 모두 합한다. 그리고 (303)단계에서 스페이스 피치 (space pitch) 추정을 위한 위한 조건을 만족하는 개별문자간의 스페이스간격의 합을 구한다. 이후 (304) 및 (305)단계에서는 상기 (302) 및 (303)에서 구한 값을 가지고 하기 (1)식과 같은 개별문자 추정피치와 (2)식과 같은 스페이스 추정 피치를 구한다After determining the position of the individual character cut candidate as described above, a process 300 of estimating the character and the space pitch in the corresponding character string will be described with reference to FIG. In step 301, the number of individual character cutting candidates determined in 2000 is stored in a counter, and in step 302, the individual character cutting candidates satisfying the conditions for character pitch estimation are evaluated. The sum of the widths of the two sums is obtained, and the sum of the space intervals between the individual characters satisfying the conditions for the space pitch estimation is obtained in step 303. In step 304 and 305, the ( Using the values obtained in 302) and (303), the individual character estimation pitch as shown in Equation (1) and the space estimation pitch as in (2) are obtained.

Pe=Pw/Cn………………………………………………………………………………(1)Pe = Pw / Cn... … … … … … … … … … … … … … … … … … … … … … … … … … … … … … (One)

Pe : 개별문자 추정 피치Pe: Estimated pitch of individual characters

Pw : 개별문자 절출후보의 가로폭합Pw: Horizontal width sum of individual characters

Cn : 개별문자 추정 조건을 만족하는 개별문자 절출후보의 수Cn: Number of candidate candidates for individual character satisfies individual character estimation condition

Spe=Ppw/Spn…………………………………………………………………………(2)Spe = Ppw / Spn… … … … … … … … … … … … … … … … … … … … … … … … … … … … (2)

Spe : 스페이스 추정피치Spe: Space estimated pitch

Spw : 문자간 스페이스 간격의 합Spw: Sum of space spacing between characters

Spn : 스페이스 추정 피치조건을 만족하는 스페이스수Spn: Number of spaces that satisfy the space estimation pitch condition

상기와 같은 1문자열내의 개별문자 추정피치 Pe 및 스페이스 추정 피치 Spe를 결정한 후, 상기 추정 피치Pe 및 Spe를 이용하여 개별문자간의 통합 및 분리하는 과정(400)을 수행하는데, 상기 과정(400)을 제7도를 참조하여 상세히 살펴본다After determining the individual character estimation pitch Pe and the space estimation pitch Spe in the one string as described above, a process 400 of merging and separating individual characters using the estimated pitches Pe and Spe is performed. See in detail with reference to FIG.

먼저 (401)단계에서 두 개별문자 절출후보 크기가 하기 (3-1)-(3-4)식과 같이 개별문자 추정피치의 소정 에러범위에 있는 통합조건에 맞는가 검사한다.First, in step 401, it is checked whether the size of the two individual character nomination candidates meets the integrated condition in the predetermined error range of the individual character estimated pitch as shown in Equation (3-1)-(3-4).

(e_n=1∼s_n1)＜(e_n∼s_n)≤Pe*7/10…………………………………………………………(3-1)(e _{n = 1-s} _n1 ) <(e _n-s _n )? … … … … … … … … … … … … … … … … … … … … … (3-1)

(e_n+1∼s_n)＞Pe*7/10………………………………………………………………………(3-2)(e _{n + 1 to s} _n )> Pe * 7/10... … … … … … … … … … … … … … … … … … … … … … … … … … … (3-2)

(e_n+1∼s_n)≤Pe*12/10……………………………………………………………………(3-2)(e _{n + 1 to s} _n )? Pe * 12/10... … … … … … … … … … … … … … … … … … … … … … … … … … (3-2)

(e_n+1∼s_n+1)＜Pe*4/10……………………………………………………………………(3-4)(e _{n + 1 to s} _{n + 1} ) <Pe * 4/10... … … … … … … … … … … … … … … … … … … … … … … … … … (3-4)

여기서 상기 (3-1)식은 두 개별문자가 떨어져 있을때, 앞의 개별문자폭(e_n∼s_n)이 뒤의 개별문자폭(e_n+1∼s_n+1)보다 크고 1문자추정 피치 Pe의

보다 작은 경우, 이 경우 두 개별문자는 통합후보가 된다. 즉, 앞의 개별문자가 자음이면 뒤의 개별문자인 모음보다 개별문자의 가로폭이 크게 되며, 이때 상기 자음이 1문자의 가로폭의

배 이하이면 두 개별문자(자음+모음)는 통합조건이 된다는 의미이다.In the formula (3-1), when the two individual characters are separated, the first individual character width (e _{n to s} _n ) is larger than the subsequent individual character width (e _{n + 1 to s} _{n + 1} ) Pe

In smaller cases, the two individual characters are combined candidates. That is, if the first individual character is a consonant, the width of the individual character is larger than the vowel, which is the second individual character, and the consonant is the width of one character.

If it is less than 2, it means that two individual letters (consonants + vowels) become an integrated condition.

두번째로 상기 (3-2)식은 뒤에 있는 개별문자의 끝점위치 (e_n+1)값에서 앞에 있는 개별문자의 시작점위치(S_n) 값을 감산하여 계산되는 두 개별문자이 가로폭 값이 1문자 추정피치(Pe)의

보다 큰 경우 두 개별문자는 통합후보가 된다. 이 는 앞의 자소와 뒤의 자소폭을 가산한 값이 1문자 추정피치(Pe)의

보다 크면 두 개별문자는 통합조건이 된다는 의미이다.Second, in the above formula (3-2), two individual characters calculated by subtracting the starting point position (S _n ) value of the preceding individual character from the ending point position (e _{n + 1} ) value of the following individual character are 1 character in width value. Of estimated pitch

If greater, the two individual characters are candidates for integration. This is the sum of the previous and second phonemes, which is the one-character estimated pitch (Pe).

If greater, the two individual characters are to be combined.

세번째로 상기 (3-3)식은 뒤에 있는 개별문자의 끝점위치 (e_n+1)값에서 앞에서 있는 개문자의 시작점위치(S_n)값을 감산하여 계산되는 두 개별문자의 가로 폭값이 1문자추정피치(Pe)

보다 작은 경우 두 개별문자는통합후보가 된다. 이는 앞의 자소와 뒤의 자소폭을 가산한 값이 1문자 추정피치(Pe)의

보다 작으면 두 개별문자는 통합조건이 된다는 의미이다.Third, in the above formula (3-3), the width value of the two individual characters is calculated by subtracting the starting point position (S _n ) of the previous character from the end point position (e _{n + 1} ) of the individual character at the back. Estimated Pitch (Pe)

If less, the two individual characters are candidates for integration. This means that the sum of the preceding and the following widths is equal to the one-character estimated pitch (Pe).

If smaller, the two individual characters are combined.

네번째로 개별문자의 끝점위치(e_n+1)값에서 시작점위치(S_n+1)값을 감산한 값이 1문자추정피치(Pe)의

보다 작으면, 앞에 있는 개별문자와 통합시킨다. 이는 뒤에 있는 자소가 모음이므로 앞에 있는 자소와 통합조건이 된다는 의미이다.Fourth, the value obtained by subtracting the start point position (S _{n + 1} ) from the end point position (e _{n + 1} ) of an individual character is one character estimated pitch (Pe).

If smaller, it is merged with the preceding individual character. This means that the phoneme at the back is a vowel, so it is an integrated condition with the phoneme at the front.

상기 (3-1)∼(3-4)식에서

는 각 개별문자의 가로폭이 통합조건을 만족시키는가 유무를 검사하기 위한 드레쉬 홀드 값들로서, 여기서는 실험에 의해 결정된 값이다. 즉 자음의 가로폭은 1문자의 가로폭에

배 이하가 되며, 모음의 가로폭의 1문자의 가로폭에

배 이하가 된다. 또한, 모음과 자음이 가산된 1문자의 가로폭은 1문자추정피치(Pe)의

배 이상인 동시에

배 이하의 값을 가질 경우에 만족되을 알 수 있다.In the above formulas (3-1) to (3-4)

Are the threshold values used to check whether the width of each individual character satisfies the integration condition. In other words, the width of consonants is the width of one character.

It is less than double, and to the width of one character of the width of the vowel

It is less than twice. Also, the width of one character with vowels and consonants is equal to that of

More than double

It can be seen that the value is less than twice.

이때 두 개별문자가 상기 (3-1)∼(3-4)식과 같은 통합조건을 만족하지 않으면 (402)단계에서 전문자에 대한 통합을 했는가 검사하며, 통합 완료시에는 분리과정을 수행하기 하기 위해 상기 (405)단계로 진행하고, 통합중일시는 상기 (401)단계로 되돌아간다. 또한, 상기 (401)단계에서 두 개별문자 절출후보가 상긱(3-1)∼(3-4)식과 같은 통합조건에 맞는 경우에는 절출 위치점을 통합시킨다(제8도의 경우 S₀를

로 하고, e₁은

로 절출 위치점을 통합시킴).At this time, if the two individual characters do not satisfy the integration condition such as the above formulas (3-1) to (3-4), it checks whether the expert has integrated in step (402) and performs the separation process when the integration is completed. The process proceeds to step 405 and the integration returns to step 401. Also, in step 401, when the two individual character cut candidates meet the integration condition such as Sangik (3-1) to (3-4), the cut position points are integrated (S ₀ in FIG. 8).

Where e ₁ is

Integration of the infeed location point.

이후 마찬가지로 (405)단계에서 전 개별문자에 대한 통합을 종료했는가 판단하며 아닐시에는 상기 (501)단계로 되돌아가며, 통합완료시에는 (405)단계에서 하기(4)식과 같이 접촉 문자의 분리조건을 만족하는가 검사한다.Thereafter, in step 405, it is determined whether the integration of all the individual characters is terminated. If not, the process returns to step 501, and when the integration is completed, the separation conditions of the contact characters are determined as in the following equation (4). Check if it is satisfied.

(e_n∼s_n)≥Pe*16/10…………………………………………………………………………(4)(e _{n to s} _n ) ≥Pe * 16/10... … … … … … … … … … … … … … … … … … … … … … … … … … … … (4)

상기 (4)식은 개별문자의 끝점위치(e_n)값에서 시작점위치(S_n)값을 감산하여 계산된 값이 1문자추정피치(Pe)의

배보다 크거나 같으면, 해당 개별문자는 분리시켜야 한다는 의미이다. 즉 개별문자의 가로폭이 1문자추정피치*

이상이면, 2개의 문자가 접촉된 상태로 판단하여 두 접촉문자를 분리시킨다.Equation (4) is a value calculated by subtracting the starting point position (S _n ) from the end point position (e _n ) of the individual character.

If greater than or equal to twice, the individual characters must be separated. That is, the width of each character is 1 character estimated pitch *

If it is, the two characters are determined to be in contact with each other and the two contact characters are separated.

이때 개별문자가 상기 (4)식의 조건을 만족할시에는(407)단계에서 1문자추정피치(Pe)를 이용하여 접촉문자 분리를 행한다. 그러나 상기(405)단계에서 상기 (4)식과 같은 분리조건을 만족하지 않거나 상기(407)단계 수행후에는 (406) 또는 (408)단계에서 전 문자에 대한 접촉문자 분리를 수행했는가 판단하는데, 분리중일시는 상긱(405)단계로 진행하여 상기 분리과정을 반복 수행하며, 분리완료시 흐름을 종료한다.In this case, when the individual character satisfies the condition of Equation (4), in step 407, the contact character is separated using the one-character estimation pitch Pe. However, in step (405), if it is not satisfied with the separation condition as in Eq. (4) or after performing step (407), it is determined whether contact character separation for all characters is performed in step (406) or (408). Sino-Japanese time proceeds to the Sangik 405 step to repeat the separation process, the flow ends when the separation is complete.

상기와 같은 (100)∼(400) 과정을 수행하여 구한 1문자열내의 각 개별문자들을 이용하여 전 처리과정(4)을 수행하게 된다.The preprocessing process (4) is performed using each individual character in the one string obtained by performing the above processes (100) to (400).

상술한 바와 같이 한글 문자를 인식할시 문자열 자동추출과 개별문자 절출을 용이하게 수행할 수 있는 것이며, 높은 개별문자 추출율을 얻을 수 있으므로 한글 문자의 오인식 비율을 최소화시킬 수 있는 이점이 있다.As described above, when the Korean character is recognized, the automatic character string extraction and the individual character extraction can be easily performed, and since the high individual character extraction rate can be obtained, there is an advantage of minimizing the recognition rate of the Korean character.

Claims

A method of recognizing Korean characters in a recognition apparatus having a scanner that generates binary image data by scanning an original image, the method comprising: counting the number of black points generated by line scanning in the scanner and counting the number of black points and a predetermined threshold value. Comparing and analyzing the first process and extracting the first string, and after performing the first process, to determine the start point and end point position of each individual character according to the state value of the black point in the extracted one string to determine the individual character candidates After performing the second process and the second process, the third process of counting the number of candidates of the individual character cut-out candidates and determining the estimated pitch of the individual characters and the estimated pitch of the space according to the count value, and performing the third process Then, the Hangul character characterized in that it comprises a fourth step of cutting out the Hangul 1 character size by integrating or separating the individual character cutting candidates using the estimated pitch. Individual character jeolchul method of digital devices.

2. The method of claim 1, wherein the first process counts the black points of the first running scan data and recognizes the _first line as the start line L ₁ of the first string when the black points of the first threshold hold level Th1 are exceeded. And counting the black points of the scan data received in units of lines after recognizing the start line L ₁ , and storing the received scan data in the string memory when the number of black points is greater than the second threshold level. During the storing step, when the number of black points of the received line data becomes the second threshold hold level Th1, the start line value L ₁ is added to the last line L _n value of the corresponding character string, and the first string is longer than a predetermined number of lines. And the character extracting method of the Hangul character recognition device, characterized in that it comprises the step of terminating the string extraction process.

The method of claim 2, wherein the second process includes: projecting one string data stored in the string memory in a vertical direction to count vertical black points in the character string; and analyzing the state of the black points counted in the vertical direction. Obtaining a starting point (S ₁ -S _{n + 1} ) for each individual character in one string, and obtaining the starting point (S ₁ -S _{n + 1} ), and then starting point (e) for each individual character in one string. _{1 to} e _{n + 1} ), and the method of extracting character strings and individual characters of the Hangul recognition device.

The method of claim 3, wherein the third process uses the starting point (S ₁ -S _{n + 1} ) and the end point (e _{1 ~} e _{n + 1} ) of the individual character of the number (Cn) of the individual character candidates and the space. Obtaining the number (Spn) and the width of each individual character by sequentially subtracting the starting point (S _{1 ~} S _{n + 1} ) of the same individual character from the end point (e _{1 ~} e _{n + 1} ) of each individual character After the values are obtained, the sum of these width values is added to find the sum of the horizontal sizes (Pw) of all individual character candidates, and the end point (e) of the individual character preceding at the next individual character start point (S _{2 to} S _{n + 1} ). _{1 to} e _{n + 1} ) are sequentially calculated to obtain a space width value between each individual character, and then the sum of these space width values is summed to obtain a horizontal size sum Spw of all the spaces. Calculating the one-character estimated pitch Pe by dividing the number of characters Cn from the sum of sizes Pw, and spacing between the individual characters. String and each character jeolchul method of Hangul character recognition apparatus, characterized by made of an step of obtaining the sum (Spw) space number (Spn) divided by the estimated pitch space (Spe) at the interval.

5. The method of claim 4, wherein the fourth process is larger than the width of each character followed by the width of each preceding character.

If smaller, the two individual characters are merged, and the horizontal size of the two individual characters obtained by subtracting the starting point position value of the preceding individual character from the end position value of the subsequent individual character is one character estimated pitch Pe.

Greater than

If smaller, the step of combining the two individual characters, and the horizontal size of the following individual characters of the one-character estimated pitch (Pe)

If it is smaller, the step of merging with the preceding individual character and subtracting the starting point position value of the preceding individual character from the ending position value of the subsequent individual character, the horizontal size of the two individual characters is estimated to be 1 character estimated pitch (Pe).

If larger than the individual character character string extraction method of the Hangul character recognition device comprising the step of separating the two individual characters.