KR19980037632A

KR19980037632A - Document recognition device and method for improving English contact character separation function

Info

Publication number: KR19980037632A
Application number: KR1019960056415A
Authority: KR
Inventors: 고동익
Original assignee: 구자홍; 엘지전자 주식회사
Priority date: 1996-11-22
Filing date: 1996-11-22
Publication date: 1998-08-05
Also published as: KR100210492B1

Abstract

본 발명은 문서 편집기나 자동 번역 시스템과 같이 문서를 인식하기 위한 장치에서 영문이 접촉되어 있는 경우에 이를 타입별로 구분하여 접촉분자를 분리할 수 있는 영문 접촉문자 분기기능을 개선한 문서 인식장치 및 그 방법에 관한 것이다.The present invention provides a document recognition device that improves the English contact character branching function that can separate contact molecules by dividing them by type when English is contacted in an apparatus for recognizing a document, such as a text editor or an automatic translation system, and its It is about a method.

본 발명의 방법은, 화상입력장치를 통해 입력되는 문서 데이터로부터 영역을 분할한 뒤에 개별문자를 분리하는 단계와, 상기 개별문자로부터 한영을 판단하고 접촉문자를 판단하는 단계와, 영문 접촉문자가 있는 경우에 이를 타입별로 구분하여 절단영역을 설정한 뒤에 이에 따라 접촉문자를 분리하는 단계와, 분리된 문자 또는 개별문자를 인식처리하는 단계를 포함한다.The method of the present invention comprises the steps of dividing an area from document data input through an image input device, and then separating individual characters, judging Korean and English from the individual characters, and determining contact characters. In this case, after setting the cutting area by dividing them by type, the method includes separating contact letters accordingly and recognizing the separated letters or individual letters.

본 발명의 방법을 실행하기 위한 장치는, 문서의 내용을 전기적인 정보신호로 변환하여 출력하는 화상입력수단과, 상기 화상 입력수단으로부터 입력되는 정보신호로부터 한글 및 영문을 분리한 뒤에, 접촉 영문자가 있는 경우에 이를 타입별로 구분하여 절단영역을 할당하고 분리를 하여, 이에 따라 분리된 문자 및 개별문자를 인식하는 제어수단과, 상기 인식결과를 저장하는 메모리 수단을 구비한다.An apparatus for carrying out the method of the present invention comprises image input means for converting and outputting the contents of a document into an electrical information signal, and separating the Korean and English characters from the information signal inputted from the image input means. If so, it is provided with a control means for classifying and dividing a cutting area by type, separating the separated characters and individual characters according to the type, and a memory means for storing the recognition result.

Description

Document recognition device and method for improving English contact character separation function

제 1도는 영문 접촉문자를 나타낸 도면.1 is a view showing the English contact letters.

제 2도는 종래의 영문 접촉문자 분리기능을 구비한 문서 인식장치의 블럭구성도.2 is a block diagram of a document recognition device having a conventional English contact character separation function.

제 3도는 종래의 영문 접촉문자 분리기능을 구비한 문서 인식방법의 동작 흐름도.3 is a flowchart illustrating a method of recognizing a document having a conventional English contact character separation function.

제 4도는 본 발명의 실시예에 따른 영문 접촉문자 분리기능을 개선한 문서 인식방법의 동작 흐름도.4 is a flowchart illustrating a method of recognizing a document in which an English contact character separation function is improved according to an embodiment of the present invention.

*도면의 주요 부분에 대한 부호의 설명** Description of the symbols for the main parts of the drawings *

10:화상 입력부20:키입력부10: image input unit 20: key input unit

30:제어부40:모니터30: control unit 40: monitor

50:문서 출력부50: document output section

본 발명은 영문 접촉문자 분리기능을 개선한 문서 인식장치 및 그 방법에 관한 것으로, 특히 문서 편집기나 자동 번역 시스템과 같이 문서를 인식하기 위한 장치에서 영문이 접촉되어 있는 경우에 이를 타입별로 구분하여 접촉분자를 분리할 수 있는 영문 접촉문자 분리기능을 개선한 문서 인식장치 및 그 방법에 관한 것이다.The present invention relates to a document recognition device and a method for improving the English contact character separation function, and in particular, when the English contact in the device for recognizing the document, such as a text editor or automatic translation system, contact them by type The present invention relates to a document recognition apparatus and method for improving English contact character separation function capable of separating molecules.

워드 프로세서(word processor), 자동 번역 시스템 등에서는 화상 입력 장치를 통해서 입력되는 문자를 인식하기 위한 문서 인식장치를 필요로 한다. 상술한 문서 인식장치에서는 문서인식 기법으로서 결정론적 인식방법과, 구문론적(syntactic) 인식방법 등을 사용하고 있는데, 전자의 방법은 패턴의 특징검출에 의한 패턴의 구조와 특징관계를 취급하는 것으로서 통일된 형식이 확립되어 있지 않은 상태이며, 후자의 방법은 패턴의 구성요소를 기분자소(primitive)로 선택하고 신택스 규칙(syntax rule)에 의하여 패턴의 구조를 인식하고 있다. 현재 영문 및 한글 문자를 인식하기 위한 문서 인식장치는 어느정도 실용화되어 있는 상태인데, 이와 같은 문서 인식장치에서는 제 1도에 도시한 바와 같은 영문 접촉문자가 존재하는 경우에 이를 분리하여 인식함으로써 인식률을 높이고 있다.Word processors, automatic translation systems, and the like, require a document recognition device for recognizing characters input through an image input device. The document recognition apparatus described above uses a deterministic recognition method, a syntactic recognition method, and the like as document recognition techniques. The former method handles the structure and the feature relation of the pattern by detecting the feature of the pattern. The former method is not established. The latter method selects elements of the pattern as primitives and recognizes the structure of the pattern by syntax rules. Currently, a document recognition device for recognizing English and Korean characters has been put to practical use. However, in such a document recognition device, if there is an English contact character as shown in FIG. have.

이하 첨부된 도면을 참조로 하여 종래의 영문접촉문자 분리기능을 구비한 문서 인식장치에 대하여 설명하기로 한다.Hereinafter, a document recognition apparatus having a conventional English contact character separation function will be described with reference to the accompanying drawings.

제 2도는 종래의 영문 접촉문자 분리기능을 구비한 문서 인식장치의 구블럭 구성도이다. 제 2도에 도시되어 있는 바와 같이 종래의 영문접촉문자 분리 기능을 구비한 문서 인식장치의 구성은, 스캐너 또는 팩스 등을 이용하여 문서의 내용을 전기적인 영상신호로 변환하여 출력하는 화상 입력부(10)와, 사용자가 명령을 입력시키기 위한 키입력부(20)와, 상기 키입력부(20)로부터 입력되는 명령에 따라 상기 화상 입력부(10)로부터 입력되는 영상신호를 데이터화하여 한글 또는 영문자를 인식하는 제어부(30)와, 상기 제어부(30)에 의해서 인식된 문자를 화면에 표시하는 모니터(40)와, 프린터 또는 복사기 등을 이용하여 상기 제어부(30)에 의해서 인식되어 편집된 문자를 하드카피하는 문서 출력부(50)로 이루어진다.2 is a block diagram of a conventional document recognition apparatus having a conventional English contact character separation function. As shown in FIG. 2, a conventional document recognition device having a function of separating English contact characters includes an image input unit 10 for converting a document's contents into an electrical video signal and outputting the same using a scanner or a fax machine. And a key input unit 20 for allowing a user to input a command, and a controller for recognizing Korean or English characters by converting the image signal input from the image input unit 10 according to the command input from the key input unit 20. 30, a monitor 40 for displaying the characters recognized by the controller 30 on the screen, and a document for hard copying the characters recognized and edited by the controller 30 using a printer or a copying machine. It consists of an output unit 50.

상기 구성에 의한, 종래의 영문 접촉문자 분리기능을 구비한 문서 인식장치의 동작을 제 3도를 참조로하여 설명하면 다음과 같다. 사용자에 의해서 키 입력부(20)로부터 문서 인식명령이 입력되면, 제어부(30)는 이를 인지하여 스캐너와 같은 화상장치로 구성되어 있는 화상 입력부(10)로부터 입력되는 문서상의 화상정보를 읽어들여 문서 데이터를 취득한다. 이과정에서, 화상 입력부(10)는 CCD(Charge Coupled Device)와 같은 광전변환소자를 이용하여 문서 데이터를 전기적인 신호로 변환하고, 이를 디지틀 변환하여 2치화 데이터로 변환하는 기능을 수행한다. 제어부(30)는 이와 같이 얻어진 문서 데이터를 대상을 영역분할을 실시하여 문자부와 화상영역으로 분리하는데, 이를 실현하는 기법으로는 RLSA(Run-Length Smoothing Algorithm) 기법 등이 있다. 영역분할이 이루어지면, 제어부(30)는 문서부내의 문자열을 추출하고, 해당 문서열내의 개별문자를 절출 및 분리한 뒤에, 이와 같이 개별로 절출 및 분리된 문자가 한글인지 아니면 영문인지를 판별한다.Referring to FIG. 3, the operation of the document recognition apparatus having the conventional English contact character separation function according to the above configuration will be described. When a document recognition command is input from the key input unit 20 by the user, the control unit 30 recognizes this, reads image information on the document input from the image input unit 10 constituted of an image device such as a scanner, and reads document data. Get. In this process, the image input unit 10 converts the document data into an electrical signal using a photoelectric conversion element such as a charge coupled device (CCD), and digitally converts the converted data into binary data. The control unit 30 divides the document data thus obtained into areas by separating the objects into character sections and image areas. Examples of a technique for realizing the document data include a run-length smoothing algorithm (RLSA). When region division is made, the control unit 30 extracts the character string in the document part, cuts and separates the individual characters in the document string, and then determines whether the individually cut and separated characters are Korean or English. .

개별로 분리된 문자가 한글인 경우에 제어부(30)는 접촉문자인지를 판별하여, 접촉문자인 경우에는 이를 분리한다. 다음에, 제어부(30)는 분리된 한글문자 또는 개별 한글문자로부터 자소를 분리한 뒤에, 이를 한글 인식처리한다. 그러나, 개별로 분리된 문자가 한글이 아니고 영문인 경우에는, 제어부(30)는 접촉문자인지를 판별하여, 제 1도에 도시되어 있는 바와 같이 접촉문자인 경우에는 이를 분리시킨다. 다음에, 제어부(30)는 분리된 영문자 또는 개별 영문자를 영문 인식처리한다. 이와 같은 한글 또는 영문의 인식처리가 끝나면, 제어부(30)는 인식결과를 내부 메모리에 저장함과 동시에 이를 모니터(40)의 화면에 표시한 후, 동작을 종료한다.If the separately separated characters are Korean characters, the controller 30 determines whether the characters are contact characters, and separates them if the characters are contact characters. Next, the controller 30 separates the phoneme from the separated Hangul characters or the individual Hangul characters, and then processes the Hangul recognition. However, if the characters separated separately are not Korean characters but English characters, the controller 30 determines whether the characters are contact characters, and separates them when the characters are contact characters as shown in FIG. Next, the controller 30 processes the separated English letters or individual English letters in English. When the recognition processing of Korean or English is completed, the controller 30 stores the recognition result in the internal memory, displays it on the screen of the monitor 40, and then ends the operation.

사용자는 상술한 바와 같이 인식이 완료된 문자를 편집한 뒤에, 이를 문서 출력부(50)를 이용하여 하드 카피할 수가 있다.As described above, the user edits the completed character and then hard copies the text using the document output unit 50.

그러나, 이와 같은 종래의 영문 접촉문자 분리기능을 구비한 문서 인식장치는, 접촉문자를 분리하는 과정에서, 막연히 접촉된 글자영역 전체를 검색하는 단점이 있고, 또한 영문의 글자별 문자폭을 고려하지 않은채 개략적인 평균에 의하여 절단후보 결정영역을 설정하기 때문에 중대한 오류를 일으킬 수 있는 문제점이 있다. 왜냐하면, 몇개의 문자가 접촉된 경우에, 과연 이것이 몇개의 단어가 접촉된 것이며, 또한 어느 영역내에서 분리를 시도해야 하는지를 결정하는 것은, 접촉문자 분리 및 문자 인식률에 막대한 영향을 미치기 때문이다.However, such a conventional document recognition device having English contact character separation function has a disadvantage of searching the entire area of letters in vague contact in the process of separating contact characters, and does not consider the character width of each letter in English. There is a problem that can cause a serious error because the cut candidate decision region is set by a rough average. This is because, in the case where several letters are touched, this is how many words are touched and also in which area the separation should be attempted, because it greatly affects the contact letter separation and the character recognition rate.

본 발명은 이와같은 종래의 단점 및 문제점을 해결 하기 위한 것으로서, 영문 접촉자를 분리하는 과정에서 발생될 수있는 오류를 방지하여 문자 인식률을 제고시키기 위한 것이다.The present invention is to solve the above disadvantages and problems, and to improve the character recognition rate by preventing errors that may occur in the process of separating the English contact.

따라서, 본 발명의 목적은, 영문자를 타입별로 그룹을 나누어 이에 따른 분리영역을 미리 구분하여 놓은 뒤에, 접촉문자가 있는 경우에 접촉문자의 타입을 결정하여 미리 데이터화되어 있는 분리영역에 따라 분리영역을 설정함으로써 인식률을 높일 수 있는 영문 접촉문자 분리기능을 개선한 문서 인식장치 및 그 방법을 제공하는 데 있다.Accordingly, an object of the present invention is to divide the separated areas according to the type of alphabetic characters in advance, and to separate the divided areas according to the separated areas that are previously data by determining the type of the contact letters when there are contact letters. The present invention provides a document recognition apparatus and method for improving the English contact character separation function that can increase the recognition rate.

상술한 목적을 달성하기 위한 수단으로서, 본 발명의 영문 접촉문자 분리 기능을 구비한 문서 인식장치는, 문서의 내용을 전기적인 정보신호로 변환하여 출력하는 화상입력수단과, 상기 화상 입력수단으로부터 입력된 정보신호로부터 한글 및 영문을 분리한 뒤에, 접촉 영문자가 있는 경우에 이를 타입별로 구분하여 절단영역을 할당하고 분리를 하여, 이에 따라 분리된 문자 및 개별문자를 인식하여 제어수단과, 상기 인식결과를 저장하는 메모리 수단을 구비한다.As a means for achieving the above object, the document recognition apparatus having the English contact character separation function of the present invention comprises image input means for converting the contents of a document into an electrical information signal and outputting it from the image input means. After separating Korean and English characters from the information signal, if there is a contact alphabet, it divides them by type and allocates a cutting area and separates them. It has a memory means for storing.

그리고 본 발명의 목적을 달성하기 위한 수단으로서, 영문 접촉문자 분리 기능을 구비한 문서 인식방법은, 화상입력장치를 통해 입력되는 문서 데이터로부터 영역을 분할한 뒤에 개별문자를 분리하는 단계와, 상기 개별문자로부터 한영을 판단하고 접촉문자를 판단하는 단계와, 영문 접촉문자가 있는 경우에 이를 타입별로 구분하여 절단영역을 설정한 뒤에 이에 따라 접촉문자를 분리하는 단계와, 분리된 문자 또는 개별문자를 인식처리하는 단계를 포함한다.In addition, as a means for achieving the object of the present invention, a document recognition method having an English contact character separation function, comprising the steps of separating individual characters after dividing an area from document data input through an image input device; Determining the Korean-English from the characters and judging the contact character, and if there is an English contact character, set the cutting area by dividing it by type, and then separating the contact character accordingly, and recognize the separated character or individual character. Processing.

본 발명의 상기 목적 및 그밖의 목적 및 이점은 후술될 본 발명의 실시예에 대한 상세한 설명으로부터 보다 명확해질 것이다.The above and other objects and advantages of the present invention will become more apparent from the following detailed description of embodiments of the present invention.

이하, 본 발명이 속하는 기술분야에서 통상의 지식을 가진자가 본 발명을 용이하게 실시할 수 있을 정도로 본 발명을 상세하게 설명하기 위하여 본 발명의 가장 바람직한 실시예를 첨부된 도면을 참조로 하여 설명하기로 한다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. Shall be.

본 발명의 실시예에 따른 영문 접촉문자 분리기능을 개선한 문서 인식장치의 구성은, 단순히 하드웨어상으로는 제 2도에 도시되어 있는 종래의 영문 접촉문자 분리기능을 구비한 문서 인식장치와 유사하므로 편의상 이를 이용하기로 한다. 상기 구성에 의한, 본 발명의 실시예에 따른 영문 접촉문자 분리기능을 개선한 문서 인식장치의 작용을 제 4도를 참조로 하여 설명하면 다음과 같다.The configuration of the document recognition device that improves the English contact character separation function according to the embodiment of the present invention is simply similar to the conventional document recognition device having the English contact character separation function shown in FIG. I will use it. Referring to FIG. 4, the operation of the document recognition device having improved the English contact character separation function according to the embodiment of the present invention will be described as follows.

사용자에 의해서 키입력부(20)로부터 문서 인식명령이 입력되면, 제어부(30)는 이를 인지하여 스캐너와 같은 화상장치로 구성되어 있는 화상 입력부(10)로부터 입력되는 문서상의 화상정보를 읽어들여 문서 데이터를 얻는다. 이 과정에서, 화상 입력부(10)는 CCD와 같은 광전변환소자를 이용하여 문서 데이터를 전기적인 신호로 변환하고, 이를 디지틀 변환하여 2치화 데이터로 변환하여 출력한다. 제어부(30)는 화상 입력부(10)로부터 얻어지는 문서 데이터를 대상으로 RLSA(Run-Length Smoothing Algorithm) 기법 등을 이용하여 영역분할을 실시한다. 문서 데이터에 대한 영역분할이 이루어지면, 제어부(30)는 문서부내의 문자열을 추출하고, 해당 문서열내의 개별문자를 분리한 뒤에, 이와 같이 분리된 문자가 한글인지 아니면 영문인지를 판별한다.When a document recognition command is input from the key input unit 20 by the user, the control unit 30 recognizes this, reads image information on the document input from the image input unit 10 constituted of an image device such as a scanner, and then reads document data. Get In this process, the image input unit 10 converts the document data into an electrical signal using a photoelectric conversion element such as a CCD, digitally converts it into binary data, and outputs the converted data. The control unit 30 performs area division on the document data obtained from the image input unit 10 using a run-length smoothing algorithm (RLSA) technique or the like. When area division is performed on the document data, the controller 30 extracts the character string in the document part, separates the individual characters in the document string, and then determines whether the separated character is Korean or English.

분리된 문자가 한글인 경우에 제어부(30)는 다시 접촉문자인지를 판별하여, 접촉문자인 경우에는 이를 분리시키고, 이어서 분리된 한글 문자 또는 개별 한글문자로부터 자소를 분리한 뒤에, 이를 한글인식처리한다.If the separated character is a Korean character, the control unit 30 determines whether it is a contact character again, and if it is a contact character, separates it, and then separates a phoneme from the separated Korean character or an individual Korean character, and then processes the Korean character. do.

그러나, 분리된 문자가 한글이 아니고 영문인 경우에는, 제어부(30)는 접촉문자인지를 판별하여, 제 1도에 도시되어 있는 바와 같이 접촉문자인 경우에는 상기한 접촉문자가 어느 타입에 해당하는지를 판별한다. 접촉문자의 타입결정은 접촉문자가 다음과 같은 문자 그룹군 중에서 어느 그룹에 속하는지를 판단하여 행하게 된다.However, if the separated character is not Korean but is English, the controller 30 determines whether the contact character is a contact character, and if the contact character is a contact character as shown in FIG. Determine. The type determination of the contact character is performed by determining which group among the following character group groups the contact character belongs to.

1. 알파벳 대문자군1.capital letters

2. *y*, *p*, *q* 등의 접촉문자군2. Contact character group such as * y *, * p *, * q *

3. b, d, h, k, l, t와 같이 비례(문자폭/문자높이)가 작은 알파벳소문자군3. A small group of alphabets with small proportions (character width / character height) such as b, d, h, k, l, t

4. 소문자 g4. lowercase g

5. W, w, M, m과 같이 비례(문자폭/문자높이)가 큰 문자군5. Character group with large proportion (character width / character height) such as W, w, M, m

6. 그외의 알파벳 군6. Other alphabet groups

상술한 바와 같은 기준에 의하여 접촉문자의 타입이 결정되면, 제어부(30)는 타입에 따른 절단영역을 할당하고, 이에 따라 영문 접촉문자 분리를 행한다. 이어서, 제어부(30)는 분리된 영문자 또는 개별 영문자를 영문인식처리한다.When the type of the contact character is determined based on the criteria as described above, the control unit 30 allocates a cutting area according to the type, and thereby separates the English contact character. Subsequently, the controller 30 recognizes the separated English letter or the individual English letter.

이와 같은 한글 또는 영문의 인식처리가 끝나면, 제어부(30)는 인식결과를 내어 메모리에 저장함과 동시에 이를 모니터(40)의 화면에 표시한 후, 동작을 종료한다. 사용자는 상술한 바와 같이 인식이 완료된 문자를 편집한 뒤에, 프린터와 같은 문서 출력부(50)를 이용하여 출력할 수가 있다.After the recognition processing of Korean or English is finished, the controller 30 outputs the recognition result and stores it in the memory, displays it on the screen of the monitor 40, and then ends the operation. As described above, the user can edit the completed characters and output them using the document output unit 50 such as a printer.

이상 설명한 바와 같은 본 발명에 의하면, 문서 편집기나 자동 번역 시스템과 같이 문서를 인식하기 위한 장치에서 접촉되어 있는 영문자를 타입별로 구분하여 분리함으로써 신뢰성을 향상시킴과 인식률을 높일 수 있는 영문 접촉문자 분리기능을 개선한 문서 인식장치 및 그 방법을 제공할 수가 있다.According to the present invention as described above, the English contact character separation function that can improve the reliability and increase the recognition rate by separating the separated English characters by type in the device for recognizing the document, such as a text editor or automatic translation system It is possible to provide an improved document recognition apparatus and method thereof.

이상 설명한 내용을 통하여 당업자라면 본 발명의 기술사상을 일탈하지 아니하는 범위에서 다양한 변경 및 수정이 가능함을 알 수 있을 것이다. 따라서, 본 발명의 기술적 범위는 명세서의 상세한 설명에 기재된 내용으로 한정되는 것이 아니라 특허 청구의 범위에 의하여 정하여져야만 한다.Those skilled in the art will appreciate that various changes and modifications can be made without departing from the technical spirit of the present invention. Therefore, the technical scope of the present invention should not be limited to the contents described in the detailed description of the specification but should be defined by the claims.

Claims

Image input means for converting the contents of the document into electrical information signals and outputting them;

After separating Korean and English characters from the information signal input from the image input means, if there is a contact alphabet character, the control unit recognizes the separated character and the individual character by allocating and separating the cutting area by type. Sudan,

And a memory means for storing the recognition result.

The method of claim 1,

And a monitor for displaying the recognition result on a screen.

The method according to claim 1 or 2,

And a document output means for hard copying the recognition result and the edited content.

Dividing the region from the document data input through the image input apparatus and separating the individual characters;

Judging Korean and English letters from the individual letters, and if there are English contact letters, dividing them by type and setting a cutting area, and then separating contact letters accordingly;

Recognizing a separated character or an individual character comprising the step of recognizing the improved English contact character separation function.

The method of claim 4, wherein

In the case of classifying contact letters by type, document recognition method which improves English contact character separation function characterized by setting type using group group as follows.

1.capital letters

2. Contact character group such as * y *, * p *, * q *

3. A small group of alphabets with small proportions (character width / character height) such as b, d, h, k, l, t

4. lowercase g

5. Character group with large proportion (character width / character height) such as W, w, M, m

6. Other alphabet groups

The method of claim 4, wherein

And editing and recognizing the recognized characters on the screen.