KR950001059B1

KR950001059B1 - Korean character address recognition method and apparatus

Info

Publication number: KR950001059B1
Application number: KR1019920015673A
Authority: KR
Inventors: 이성환
Original assignee: 이성환
Priority date: 1992-08-29
Filing date: 1992-08-29
Publication date: 1995-02-08
Also published as: KR940004484A

Abstract

The character recognition method and unit recognize hand-written Korean character adddress at high recognition speed and rate by comparing the input character and standard character using administrative district name dictionary constructed hierarchically. The recognition method comprises extracting the candidate character by using the district name dictionary; recognizing the character; processing the next part of the address based on the recognized district name. The recognition rate is above 99 percent.

Description

Korean address recognition method and device

제1도는 우리나라 행정구역의 구조도.1 is a structural diagram of the administrative region of Korea.

제2도는 행정 구역명 사전의 구성도.2 is a block diagram of an administrative district name dictionary.

제3도는 계층 1(특별시,직할시,도)과 계층2(구,군,시), 그리고 계층3(예외 구)에 해당되는 행정 구역명을 처리하는 과정에서 행정 구역명 사전으로부터 후보 문자를 추출하는 방법을 나타낸 플로우챠트.3 is a method of extracting candidate characters from an administrative district name dictionary in the process of processing administrative district names corresponding to Tier 1 (Gyeonggi, Juksi, Do), Tier 2 (Old, County, and City), and Tier 3 (Exceptional District). Flowchart showing.

제4도는 계층 4(읍,면,동)와 계층5(리,동)에 해당되는 행정 구역명을 처리하는 과정에서 행정구역명 사전으로부터 후보 문자를 추출하는 방법을 나타낸 플로우챠트.4 is a flowchart illustrating a method of extracting candidate characters from an administrative district name dictionary while processing administrative district names corresponding to hierarchy 4 (eup, myeon, dong) and hierarchy 5 (ri, dong).

제5도는 계층1의 후처리 방법을 나타낸 플로우챠트.5 is a flowchart showing a post-processing method of layer 1. FIG.

제6도는 계층2의 후처리 방법을 나타낸 플로우챠트.6 is a flowchart showing a post-processing method of Layer 2.

제7도는 계층3의 후처리 방법을 나타낸 플로우챠트.7 is a flowchart showing a post-processing method of layer 3.

제8도는 계층4의 후처리 방법을 나타낸 플로우챠트.8 is a flowchart showing a post-processing method of layer 4. FIG.

제9도는 계층5의 후처리 방법을 나타낸 플로우챠트.9 is a flowchart showing a post-processing method of layer 5. FIG.

제10도는 본 발명에 따른 시스템의 구성도.10 is a block diagram of a system according to the present invention.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

1 : 스캐너 또는 태블릿 2 : 주소 분할부1 scanner or tablet 2 address divider

3 : 후보 문자 추출부 4 : 행정구역명 사전3: candidate character extraction unit 4: administrative name dictionary

5 : 문자 인식부 6 : 후처리부5: character recognition unit 6: post-processing unit

7 : 주소 출력부7: address output unit

[산업상 이용분야][Industrial use]

본 발명은 한글 주소 인식 방법 및 장치에 관한 것으로 특히, 온라인 또는 오프라인 필기체 한글 주소 인식의 경우 필기자에 따른 다양한 필기 형태로 인하여 문자 인식 단계에서 상당한 혼동과 오류가 발생하므로, 이를 행정 구역명 사전을 이용하여 인식을 함에 있어서 비교 횟수를 현저히 줄이는 방법과 또한 문자 인식기의 인식상의 혼동과 오류로부터 발생하는 오인식 결과를 효율적으로 수정하는 후처리 방법에 관한 것이다. 여기서 온라인이라 함을 사용자가 태블릿(tablet)과 스타일러스 펜(stylus pen)을 사용하여 실시간으로 데이타를 입력함을 의미하며, 오프라인이라 함은 사용자가 필기한 종이 등의 매체를 스캐너를 통하여 입력함을 의미한다. 본 발명의 한글 주소 인식 방법 및 장치는 고속문자 인식을 꾀할 수 있음은 물론, 문자 인식기의 높은 인식률을 기반으로 하여 행정 구역명 별 후처리를 적용함으로써 보다 정확한 주소 인식을 가능케 한다. 물론, 본 발명은 필기체 주소 인식에만 국한되는 것이 아니고 타이핑이나 인쇄, 기타 기계매체에 의해 기재된 주소의 인식에도 당연히 적용되는 것이나 특히 인식에 있어 많은 어려움을 수반하는 필기체 주소 인식의 경우에 그 탁월함이 있는 것이다.The present invention relates to a method and apparatus for recognizing a Hangul address, and in particular, in the case of online or offline handwriting Hangul address recognition, a considerable confusion and error occurs in the character recognition step due to various handwriting types according to the handwriting. The present invention relates to a method of significantly reducing the number of comparisons in recognition, and also to a post-processing method for efficiently correcting misrecognition results resulting from recognition confusion and errors in the character recognizer. Here, online means that the user inputs data in real time using a tablet and a stylus pen. Offline means that a user inputs a medium such as paper, which is written by the user, through a scanner. it means. The Hangul address recognition method and apparatus of the present invention can achieve high-speed character recognition, as well as more accurate address recognition by applying post-processing for each administrative area based on the high recognition rate of the character recognizer. Of course, the present invention is not only limited to handwriting address recognition but also applies to the recognition of addresses described by typing, printing, and other mechanical media, but is particularly excellent in the case of handwriting address recognition involving many difficulties in recognition. will be.

[종래의 기술 및 그 문제점][Traditional Technology and Problems]

종래에 연구된 주소 인식 방법을 살펴보면, 본 발명자가 조사하여 아는 한에서는 국내의 경우 관련 연구를 전혀 찾아볼 수 없으며, 국외의 경우는 일본에서 주소의 인식에 대하여 3편의 연구 결과가 발표되었는데, 그 중 하나로서 1983년 10월에 개최된 "광범위 문자군의 텍스트 처리에 관한 국제회의(International Conference on Text Processing with a Large Character Set)"에서 발표된 논문집의 97-202면에 실린 Izaki등의 "필기체 한자 인식의 후처리"논문을 들 수 있다. 이 논문에서는 문자 인식기가 입력된 주소의 각 문자에 대하여 특징을 추출하고 이들의 특징을 각 표준 문자의 특징과 비교하여 그 차이가 가장 작은것을 후보 문자로 출력하면 후처리부에서는 이 후보 문자들의 순위에 따라 가중치를 두어 사전의 행정구역명과 문자 정합(matching)을 하여 가장 유사한 후보 주소를 찾는 방법을 소개하고 있다. 또한 1990년 4월에 개최된 "중국 및 동양 언어의 컴퓨터처리(Computer Processing of Chinese and Orinetal Languager)"에 관한 국제 학술회의에서 발표된 논문집의 95-100면에 실린 Suzuki등의 "주소 인식 오류에 대한 선택적인 수정"논문에서는 문자 인식기의 후보 문자 출력 과정에서 후보 1순위의 문자가 얻은 값과 더불어 후보 1순위와 후보 2순위의 문자가 얻은 값의 차이를 함께 고려하여 인식의 정확도를 검사한 후 오류의 여지가 있는 문자는 몇 개의 후보 문자를 내어주어 이 후보 문자들과 사전의 행정 구역명을 정합하는 방법을 제시하였다.Looking at the address recognition method researched in the prior art, as far as the present inventors have investigated, there are no related studies in Korea, and in the case of foreign countries, three research results on address recognition were published in Japan. As one, Izaki et al., Page 97-202 of the paper presented at the International Conference on Text Processing with a Large Character Set, held in October 1983. Post-processing of recognition "paper. In this paper, the character recognizer extracts the features for each character of the input address, compares them with the characteristics of each standard character, and outputs the smallest difference as candidate characters. This paper introduces a method to find the most similar candidate address by weighting the dictionary with the administrative name of the dictionary. In addition, in the publication addressed at the International Conference on Computer Processing of Chinese and Orinetal Languager in April 1990, Suzuki et al. This paper examines the accuracy of recognition by considering the difference between the values obtained by the letters of candidate 1st priority and the values obtained by the letters of candidate 1st and 2nd candidates in the process of outputting the candidate characters of the character recognizer. The letters in error give several candidate letters, suggesting a way to match these candidate letters with the names of administrative divisions in the dictionary.

또한 행정 구역명 사전을 이용할 수 없는 행정 구역명 이외의 번지부 처리에서는 오류 규칙 테이블(errorrule table)과 연결 행렬(connection matrix)을 적용하는 후처리 방법을 도입하였다. 그리고 1991년 9월에 개최된 "문서 분석 및 인식에 관한 국제회의(International Conference on Document Analysis and Recognition)"에서 발표된 논문집의 916-924면에 실린 Marukawa등의 "필기체 한자 주소 인식에 대한 오류 수정 알고리즘"논문에서는 연속적으로 필기된 주소의 처리를 위하여 오토마타(automata)를 도입하여 행정 구역을 분할한 후 행정 구역명 사전과 문자 정합을 하는 방법을 소개하였다.In addition, the post-processing method that applies an error rule table and a connection matrix is introduced in address processing other than the administrative area name for which the administrative area name dictionary is not available. And correcting errors in handwriting address recognition by Marukawa et al. On pages 916-924 of the paper presented at the International Conference on Document Analysis and Recognition held in September 1991. Algorithm "introduces a method of segmenting administrative districts by introducing automata and processing character matching with dictionary of administrative district names for the processing of continuously written addresses.

상기의 논문에서 살펴본 바와 같이 기존의 주소 인식 방법은 문자 인식기가 입력된 주소의 각 문자들을 모든 표준 문자와 비교하여 몇개의 후보 문자를 출력하면 이 후보 문자들을 이용하여 후처리하는 방법에 관한 것이다.As described in the above paper, the conventional address recognition method relates to a method of post-processing using the candidate characters when the character recognizer compares each character of the input address with all standard characters and outputs some candidate characters.

이와 같은 방법은 인식 대상이 주소라는 점을 감안할 때, 매우 광범위하고 불필요한 비교를 수반하므로 비효율적이다. 필기체 한글 주소 인식에 있어서 종래의 방법을 적용할 경우, 현행 KS C 5601 완성형 코드 체계의 경우 무려 2,350개의 문자 집합을 가지고 있기 때문에 비교의 횟수는 2,350번에 이른다. 한자의 경우는 훨씬 많은 문자집합을 가지고 있어 문자 인식의 혼동 범위가 더욱 크다.This method is inefficient because it involves a very wide and unnecessary comparison given that the recognition target is an address. When applying the conventional method for handwritten Hangul address recognition, the current KS C 5601 complete code system has as many as 2,350 character sets, so the number of comparisons is 2,350 times. In the case of Chinese characters, there are much more character sets, and the confusion range of character recognition is much larger.

이러한 비효율성을 개선하기 위하여 본 발명에서는 행정 구역명 사전을 이용하여 주소에 나타날 수 있는 문자만을 특히, 주소를 이루고 있는 문자들의 각 위치에 나타날 수 있는 문자만을 비교의 대상으로 삼는다.In order to improve such inefficiency, in the present invention, only characters that can appear in the address, especially those that can appear in each position of the characters constituting the address, are used for comparison by using the administrative area name dictionary.

본 발명의 상기의 논문에서 제시한 방법과 또 하나의 다른 점은 주소에 대한 문자 인식이 완전히 끝난 후에 후처리를 하는 것이 아니라, 먼저 한 계층의 행정 구역에 대하여 인식과 후처리를 하여 그 계층에 대한 행정 구역명(이하 후보 행정 구역명이라함)을 결정하고 이 후보 행정 구역명을 바탕으로 그 다음 계층을 처리한다는 점이다.Another difference from the method presented in the above paper of the present invention is that after the character recognition for the address is completely completed, the post processing is not performed after the character recognition for the address is completed. It determines the name of the administrative district (hereinafter referred to as the candidate administrative district name) and processes the next hierarchy based on this candidate administrative district name.

[발명의 목적][Purpose of invention]

본 발명의 목적은 한글 특히, 필기체 한글 주소 인식 방법 및 장치에 관한 것으로 문자 인식기가 입력된 문자와 표준 문자들과의 비교를 수행함에 있어서, 비교 대상이 되는 표준 문자들을 행정 구역명 사전을 이용하여 줄임으로써 인식 속도는 물론 인식률을 향상시키고 아울러 효과적인 후처리 방법을 통해 실용적인 한글 주소 인식 방법 및 장치를 제공함에 있다.An object of the present invention relates to a method and apparatus for recognizing handwritten Hangul addresses, in particular, that a character recognizer reduces a standard character to be compared using an administrative area name dictionary when comparing a character entered with a standard character. In addition, the present invention provides a method and apparatus for recognizing a Korean address using practical post-processing methods as well as improving recognition rate.

[상세한 설명][details]

본 발명을 첨부한 도면과 함께 상세히 설명하면 다음과 같다.The present invention will be described in detail with reference to the accompanying drawings.

본 발명에 사용되는 행정 구역명 사전의 형식을 살펴보기에 앞서 우리나라 형정구역의 구조를 살펴보면, 우리나라의 행정 구역은 제1도와 같은 계층적인 구조로 이루어져 있음을 알 수 있다. 계층 1은 서울특별시, 직할시, 도로 구성되고 계층 2는 구(특별시나 직할시의 경우) 시, 군(도의 경우)으로 구성되며, 계층4는 읍, 면, 동에 해당하며, 계층 5는 말단 행정구역인 리, 동으로 구성된다. 예외적인 경우로써 계층 2의 하위 계층이 읍, 면 또는 동(계층 4)이 아닌 구(계층 2, 이하 예외구라 명명함)로 분류되는 몇 개의 도시가 있는데 현재 경기도 부천시, 경상남도 마산시 등 5개 지역이 여기에 해당된다. 따라서, 제1도에 도시한 바와 같이, 우리나라의 행정구역은 5개의 계층으로 구분할 수 있다.Before looking at the format of the administrative area name dictionary used in the present invention, it is understood that the administrative area of Korea has a hierarchical structure as shown in FIG. Tier 1 consists of Seoul Metropolitan City, Direct City, and Road, and Tier 2 consists of City (Geo, Direct City), City, and County (For Province), and Tier 4 corresponds to Eup, Myeon, and Dong, and Tier 5 is Terminal Administration. It is composed of two areas, Lee and Dong. As an exceptional case, there are several cities where the lower tier of Tier 2 is classified as wards (tier 2, exempt districts), not Eup, Myeon, or Dong (tier 4). Currently, there are five regions in Bucheon, Gyeongsangnam-do, and Masan, Gyeongsangnam-do. This is the case here. Thus, as shown in Figure 1, the administrative districts of Korea can be divided into five tiers.

이러한 구조를 바탕으로 행정 구역명 사전은 제2도와 같이 5개의 사전으로 구성된다. 행정 구역명 사전1은 특별시,직할시, 도에 관한 것이고 행정 구역명 사전 2는 구, 군, 시에 관한 것이며 행정 구역명 사전3은 예외구에 관한 것이다. 행정 구역명 사전 4에는, 읍, 면, 동에 관한 정보가 수록되어 있고 행정 구역명 사전 5에는 리, 동이 수록되어 있다. 본 발명에서 사용하는 행정 구역명 사전4는 읍, 면 동 뿐만 아니라 건물명도 수록하였는데 이는 우리나라에서 우편물 분리의 편리성을 도모하기 위하여 공공 기관 및 규모가 비교적 큰 건물에 대하여 계층 4에 해당되는 하나의 우편 번호를 부여함으로써 건물을 임의의 행정 구역명으로 다루고 있기 때문이다. 따라서, 본 발명에서는 이러한 건물명의 처리를 위하여 우편번호를 갖는 건물을 행정 구역명 사전에 수록하였음은 물론 사용자가 이용 목적에 따라 우편 번호를 갖지 않는 건물명도 행정 구역명 사전에 추가하여 수록할 수 있도록 행정 구역명 사전이 구성되어 있음을 밝힌다.Based on this structure, the administrative district name dictionary consists of five dictionaries as shown in FIG. Administrative District Name Dictionary 1 refers to the City, Direct City, and Province, Administrative District Name Dictionary 2 refers to City, County, and City, and Administrative District Name Dictionary 3 refers to the exception. In administrative district name dictionary 4, information about town, town, and town is recorded, and administrative district name dictionary 5 contains li, town. The dictionary name 4 used in the present invention includes not only towns and villages but also buildings, which is one post that corresponds to the level 4 for public institutions and relatively large buildings in order to facilitate mail separation in Korea. This is because the buildings are treated with arbitrary administrative names by assigning numbers. Therefore, in the present invention, not only the building having a postal code in the administrative area name dictionary for processing the building name, but also the name of the administrative area so that the user can add the building name without the postal code to the administrative area name dictionary according to the purpose of use. Reveal that a dictionary is constructed.

이렇듯 건물명을 행정구역사전에 입력할 수 있도록 하는 것은 사람에 따라서 주소를 기재하는 방법이 상이한 점을 고려할 때 대단히 편리한 방법을 제공하게 된다. 나아가, 작금의 주택추세가 아파트 단지화하고 있는 점에 비추어 볼때 이러한 아파트에 대해서도 행정 구역명 사전에 수록함으로써 이용의 편리성 내지는 극대화를 꾀할 수 있는 것이다. 아파트에 대해서는 행정 구역명 사전 4에 입력하는 것이 자료의 방대성이나 행정 구역명 사전의 관리라는 차원에서 복잡하게 될 경우에는 별도의 사전을 작성할 수도 있을 것이다. 이와 같이 아파트나 우편번호를 갖지 않는 건물 등에 대해서 별도의 사전을 구성하는 경우에 이 사전은 특수구라 칭할 있을 것이다. 특수구의 처리방법 또한 본 발명의 처리 과정에 의해 용이하게 구현할 수 있음은 이후의 설명이 진행됨에 따라 알 수 있을 것이다.As such, being able to input a building name in an administrative area dictionary provides a very convenient method in consideration of the difference in the method of writing an address for each person. Furthermore, in light of the fact that the recent trend of housing is becoming apartment complexes, such apartments are also listed in the name of the administrative district in advance, so that convenience or maximum use can be achieved. For apartments, it may be possible to create a separate dictionary if the entry into the Administrative District Name Dictionary 4 becomes complicated in terms of the vastness of the data or the management of the Administrative District Name Dictionary. As such, when a separate dictionary is constructed for an apartment or a building without a postal code, the dictionary may be referred to as a special phrase. It can be seen that the treatment method of the special phrase can also be easily implemented by the treatment process of the present invention as the following description proceeds.

필기자로부터 입력된 주소는 다음의 예와 같이 행정 구역 명들이 계층적으로 나열된다.The address entered from the scriber is hierarchically listed with administrative district names as in the following example.

충청북도 청원군 강내면 황탄리Hwangtan-ri, Gangnae-myeon, Cheongwon-gun, Chungcheongbuk-do

위와 같이 하나의 주소를 구성하는데 있어서 가장 먼저 나타나는 행정 구역명은 특별시, 직할시 또는 도를 나타내는 계층1에 대한 것임을 알 수 있다. 또한 그 이하의 행정 구역명은 계층 1을 시점으로 하여 순차적으로 구성되어 있음을 알 수 있다. 따라서 본 발명에서는 인식 대상의 주소가 위와 같은 형식으로 필기됨을 전제로 하고 있다. 이하의 설명 과정에서 하나의 전체 주소(즉, 계층 1에서부터 계층 5까지로 구성되는 주소)를 이루고 있는 개개의 부분(즉, 계층 1내지 계층 5에서 어느 하나의 계층)을 "주소 단어"라 표현할 것이다. 계층 1을 최상위의 계층이라 할 때, 주소의 인식 과정은 상위 계층에서 하위 계층으로 진행된다. 그러므로 가장 선두에 위치한 주소 단어부터 주소 인식은 시작된다. 본 발명에서는 인식을 시작하기에 앞서 인식할 주소 단어의 위치에 나타날 수 있는 후보 문자들을 문자 추출부를 통하여 추출한다. 즉, 가장 선두에 있는 주소 단어의 첫 자리에 발생 가능한 후보 문자를 계층 1에 대한 행정 구역명들이 수록되어 있는 행정 구역명 사전 1을 이용하여 추출한다. 이점이 바로 종래의 방법에서와 같이 표준문자 전체와 일일이 비교하는 것과 다름을 유념해야 할 것이다.As mentioned above, it can be seen that the name of the administrative district that appears first in the construction of one address is for the tier 1 which represents the city, city, state or province. In addition, it can be seen that the administrative district names below are sequentially formed with the first layer as the starting point. Therefore, the present invention assumes that the address of the recognition target is written in the above format. In the following description, the individual parts (that is, any one of the layers 1 to 5) constituting one global address (that is, an address consisting of layers 1 to 5) may be referred to as "address words". will be. When layer 1 is called the uppermost layer, the address recognition process proceeds from the upper layer to the lower layer. Therefore, address recognition starts from the head address word. In the present invention, candidate characters that may appear at the position of the address word to be recognized are extracted through the character extracting unit before starting the recognition. That is, candidate characters that can be generated at the first position of the first address word are extracted using the administrative region name dictionary 1 containing the administrative region names for hierarchy 1. It should be noted that this is different from the comparison with the entire standard character as in the conventional method.

제3도는 계층 1내지 계층3에 해당하는 행정 구역명을 처리하는 과정에서 행정 구역명 사전으로부터 후보 문자를 추출하는 방법을 나타낸 플로우챠트로서, 여기서는 행정 구역명 사전을 이용하여 하나의 주소 단어를 이루고 있는 각 문자의 위치에 발생가능한 문자를 추출하게 된다. 먼저, 현재 처리하려고 하는 주소단어(이는 계층 1이나 계층 2 또는 계층 3에 해당되는 주소단어임)에 해당하는 행정 구역명들을 수록한 행정 구역명 사전으로 접근한다(단계 10). 다음으로 처리하고자하는 주소 단어(계층 1 또는 계층 2 또는 계층 3을 이루는 주소단어)를 이루고 있는 문자의 갯수를 센다(단계 12). 이 갯수는 단계(12)에 나타나 있듯이 인식하여야 할 문자의 갯수로서 후보 문자를 추출할 때 사전에서 탐색하여야 할 각각의 행정 구역명들에 대하여 몇 번째 자리까지 추출할 것인가를 결정하는데 사용된다. 즉, 위에서 예시한 주소에 대하여 설명하면 먼저 "충청북도"의 각 자리에 나타날 수 있는 문자는 "강(강원도), 경(경기도, 경상북도, 경상남도),…, 제(제주도)"와 같은 15(이는 현재의 특별시, 직할시 및 각 도를 나타내는 행정구역명의 첫자리에나오는 문자의 갯수 숫자이다)개의 문자들이고 "청"의 위치에 나타날 수 있는 문자는 "원(강원도), 상(경상남도, 경상북도),...., 주(광주직할시,제주도)"등이다.FIG. 3 is a flowchart illustrating a method of extracting candidate characters from an administrative area name dictionary in processing administrative area names corresponding to hierarchies 1 to 3, wherein each character forming one address word using the administrative area name dictionary is shown. Extract possible characters at the position of. First, the administrative zone name dictionary containing the administrative zone names corresponding to the address words currently being processed (which are the address words corresponding to the layer 1, the layer 2 or the layer 3) is accessed (step 10). Next, the number of characters constituting the address word (the address word forming the layer 1 or the layer 2 or the layer 3) to be processed is counted (step 12). This number is the number of characters to be recognized as shown in step 12, which is used to determine how many digits to extract for each administrative region name to be searched in the dictionary when extracting candidate characters. That is, when describing the above-mentioned address, the characters that may appear in each place of "Chungcheongbuk-do" are 15 such as "Gang (Gangwon-do), Gyeong (Gyeonggi-do, Gyeongsangbuk-do, Gyeongsangnam-do),…, Je (Jeju-do)". The number of letters appearing in the first digit of the name of the administrative district that represents the current city, district, and province is the number of letters) and the letters that may appear in the "Chung" are "Won (Gangwon-do), Sang (Gyeongsangnam-do, Gyeongsangbuk-do) ,. ..., Ju (Gwangju, Jeju Island) ".

이렇게 주소 단어 "충청북도"의 각 4자리에 나타날 수 있는 후보 문자들을 행정 구역명 사전을 이용하여 추출한다. 이상에서 알 수 있듯이 문자 인식기가 계층 1의 주소단어인 첫문자인 "충"을 인식하는데 있어 현행 KS C 5601 완성형 코드 체계인 2,350개의 표준 문자와 비교하는 것이 아니라 15개의 표준 문자만을 비교함으로써 비교 대상이 약 15/2,350로 감소됨을 알 수 있다. 단계(14,16)에서는 DO루프를 반복하여 행정구역명 사전에서 후보 문자들을 추출하고, 이렇게 하여 얻어진 후보 문자의 집합은 일정한 형태의 배열(A[j])로 저장된다(단계 18). 다음으로 후보문자를 갖고 있는 배열(A)을 문자인식부(제10도의 5)에 넘겨주면(단계 20), 문자 인식기는 입력된 주소 단어를 이루고 있는 각각의 문자와 후보 문자 추출부에서 추출한 후 문자들과 비교하여 가장 유사한 문자를 순위별로 출력한다(단계 22). 이때 출력되는 문자들을 지칭함에 있어서 후보 문자 추출부에서 추출하는 후보 문자와의 표현상 혼동을 피하기 위하여 "순위별 후보 문자"라 하겠다.The candidate characters that can appear in each of the four digits of the address word "Chungcheongbuk-do" are extracted by using the administrative district name dictionary. As can be seen from the above, the character recognizer does not compare the current KS C 5601 complete code system with 2,350 standard characters in recognition of the first character, "Cong," which is the address word of layer 1, but by comparing only 15 standard characters. It can be seen that this is reduced to about 15 / 2,350. In steps 14 and 16, the DO loop is repeated to extract candidate characters from the administrative district name dictionary, and the set of candidate characters thus obtained is stored in an array A [j] in a form (step 18). Next, if the array (A) containing the candidate characters is passed to the character recognition unit (5 in FIG. 10) (step 20), the character recognizer extracts from each character and the candidate character extracting unit forming the input address word. The most similar characters are output by rank in comparison with the characters (step 22). In this case, in order to avoid confusion in representation with candidate characters extracted by the candidate character extracting unit, the term "character candidates by rank" will be referred to.

출력된 순위별 후보 문자들의 한 예를 표 1에 나타내었다. 표 1의 예에 의하자면입력문자 "충"에 대한 문자 인식기의 제1순위의 인식 결과는 "충"이 가장 유사한 것으로 선택되었고 그 다음 2순위로 "경"이며, 이와 같이하여 n순위까지의 순위별 후보 문자를 고려한다. 두번째 입력문자 "청"에 대해서는 제1순위에서 "전"이 가장 유사한 것으로 인식 되었음을 알 수 있으며 차후의 순위에 대해서도 "충"에 대한 인식과 같이 순위별 후보 문자가 출력되게 된다.Table 1 shows an example of candidate characters by rank. According to the example of Table 1, the recognition result of the first order of the character recognizer for the input character "chong" is that "chong" was selected as the most similar, and then "kyeong" to the second rank, thus up to n ranks. Consider ranking candidate characters. As for the second input character "blue", it can be seen that "before" is recognized as the most similar in the first rank, and the candidate letters for each rank are output like the recognition of "chong" for the next rank.

[표 1]TABLE 1

문자 인식기에 의해 인식된 임의의 주소 단어에 대한 순위별 후보 문자들Ranked candidate characters for any address word recognized by the character recognizer

위의 표 1에서 입력된 주소 단어 "충청북도"에 대하여 1순위로 인식된 결과는 "충전특도"이다. 이는 사용자가 원하는 결과가 아니므로 문자 인식기에 의해 출력된 n순위까지의 후보 문자를 이용하여 후처리 과정에 들어가게 된다. 계층 1의 행정구역명에 대한 후처리 방법을 나타낸 플로우챠트는 제5도에 도시되어 있다. 후처리는 표1에 나타난 후보 문자들과 행정 구역명 사전의 행정 구역명들과 문자 정합을 하게 되는데 입력된 주소의 선두에 있는 주소 단어, 즉 "충청북도"에 대한 후처리는 이 주소 단어가 계층 1에 해당되는 것이므로 행정 구역명 사전1을 이용하여 문자 정합을 한다.The result recognized as the first priority with respect to the address word “Chungcheongbuk-do” entered in Table 1 above is “charging specialty”. Since this is not a result desired by the user, the candidate character up to n rank output by the character recognizer is used to enter post-processing. A flowchart showing the post-processing method for the administrative district name of layer 1 is shown in FIG. Post-processing matches the candidate characters shown in Table 1 with the administrative district names in the administrative district name dictionary. The postprocessing of the address word at the head of the address entered, namely "Chungcheongbuk-do", is applied to layer 1 As it is applicable, letter matching is done using Administrative District Name Dictionary1.

제5도의 처리 과정을 살펴보면, 단계(60)에서는 처리하려는 주소 단어(계층 1)의 각 문자별 위치에 나타날 수 있는 문자들을 행정 구역명 사전에서 읽어 문자 인식기에 넘겨준다(제10도의 3). 다음 단계962)로는 후보 문자 추출부에서 추출한 위치별 후보 문자들을 바탕으로 문자 인식을 하는 문자 인식기로 부터 순위별 후보문자를 n순위까지 받아온다. 여기서, 순위별 후보 문자를 몇 순위까지 추출하여 이를 받아올 것인가의 문제는 문자 인식기의 누적 인식율을 고려하여 결정한다. 본 발명에 사용된 문자 인식기의 경우에는 5순위까지 98%이상의 누적인식율을 나타내었다. 이 의미는 문자 인식기의 성능이 한 문자에 대하여 대체적으로 5순위 안에는 사용자가 원하는 문자를 문자 인식기가 출력함을 의미한다. 이와 같이 n값의 결정은 문자 인식기의 성능에 따라 결정되는 것으로 그 성능이 낮을 경우는 n값의 결정은 문자 인식기의 성능에 따라 결정되는 것으로 그 성능이 낮을 경우는 n값을 크게 취하여 후처리를 적용함으로써 문자 인식기의 낮은 성능을 극복할 수 있다. 단계(64)에서는 행정 구역명 사전에 수록된 특별시, 직할시, 도명과 문자 정합을 한다.Referring to the process of FIG. 5, in step 60, characters that may appear at the position of each letter of the address word (layer 1) to be processed are read from the administrative region name dictionary and passed to the character recognizer (3 in FIG. 10). In the next step 962), the candidate character for each rank is received up to n ranks from the character recognizer that performs character recognition based on the candidate characters for each position extracted by the candidate character extractor. Here, the question of how to extract the candidate characters for each rank and how to receive them is determined in consideration of the cumulative recognition rate of the character recognizer. In the case of the character recognizer used in the present invention, the cumulative expression rate of 98% or more was shown up to 5th rank. This means that the character recognizer outputs the character that the user wants within 5 ranks of the character recognizer. As described above, the determination of n value is determined by the performance of the character recognizer. When the performance is low, the determination of n value is determined by the performance of the character recognizer. By applying it, the low performance of the character recognizer can be overcome. In step 64, characters are matched with the cities, cities, provinces, and provinces listed in the dictionary of administrative district names.

여기서, 정합을 할 때 입력문자와 같은 갯수를 가진 행정 구역명 만을 정합 대상으로 하는데, 그 이유는 사람들이 주소를 기재할 때의 서울, 서울시, 서울특별시, 충청북도, 충북 등으로 기재하게 되는데, 이와 같이 여러 형태로 기입된 주소를 인식하는데 있어서 정식의 행정구역명(서울특별시, 충청북도 등)만을 기준으로 문자 정합을 하는 것이 비효율적이기 때문이다. 이와 같은 문제를 해결하기 위해서는 제1계층의 행정구역명에 대한 다양한 기재 형태를 유형화하여 행정 구역명 사전 1을 구성할때 반영함으로써 쉽게 해결될 수 있다. 이어 단계(66)에서 정합 과정에 의한 결과로써 행정 구역명 사전에서 선택된 k순위까지의 후보 행정구역명과, 정합과정에서 이들이 얻은 차이를 기억하게 되는데 이 과정은 백트랙킹(backtracking)의 개념을 도입하기 위해 필요한 것으로써 이 개념에 대해서는 후에 상술하겠다. 그리고 여기에서 사용된 k값, 즉 후보 행정 구역명을 몇 순위까지 고려할 것인가의 문제는 사용자가 원하는 행정 구역명이, 문자 인식기가 출력하는 순위별 후보 문자를 바탕으로 행정 구역명 사전의 행정구역명과 문자 정합을 하여 선택된 후보 행정 구역명등 가운데 몇 순위안에 포함되는가에 따라 결정된다. 이와 같이 k값의 결정은 문자 인식기의 성능에 따라 결정되는 것으로 그 성능이 낮은 경우는 k값을 증가시키도록 한다. 본 발명에서는 k를 3으로 하여 3순위까지의 후보 행정구역명을 고려하도록 하였다. 이상과 같은 과정을 거침으로써 제1계층의 후처리가 끝나게 된다.Here, only the name of the administrative district with the same number as the input characters is used for matching. The reason for this is that people write their address in Seoul, Seoul, Seoul, Chungcheongbuk-do, and Chungbuk. This is because it is inefficient to match letters based on the official administrative name (Seoul, Chungcheongbuk-do, etc.) in recognizing addresses written in various forms. In order to solve such a problem, it can be easily solved by tying various types of descriptions of administrative district names of the first hierarchy and reflecting them when constructing the administrative district name dictionary 1. Subsequently, in step 66, as a result of the matching process, the names of candidate administrative districts up to k ranks selected from the administrative district name dictionary and the differences obtained from the matching process are remembered. This process is to introduce the concept of backtracking. This concept will be discussed later as necessary. And the question of how to consider the k value used here, that is, the candidate administrative district name, is to match the administrative district name with the administrative zone name in the dictionary based on the candidate characters by rank that the character recognizer outputs. It will be decided according to the rank among the selected candidate administrative district names. As described above, the determination of the k value is determined according to the performance of the character recognizer. When the performance is low, the k value is increased. In the present invention, k is set to 3, so that candidate administrative district names up to 3 ranks are considered. By the above process, the post-processing of the first layer is completed.

계층1에 대한 인식과 후처리에 의하여 후보 행정 구역명이 선택되었으면 이 후보 행정 구역명을 바탕으로 그 다음에 나타는 주소단어를 처리한다. 즉 "충청북도"가 계층1에 대한 1순위의 후보 행정 구역명으로 선택되었으면 그 다음에 나타나는 주소 단어는 "충청북도'에 속한 계층 2의 행정 구역이라 간주하고 행정 구역명 사전 2에서 "충청북도"의 하위 계층에 해당되는 행정 구역명들이 위치한 곳으로 접근한다. 이러한 접근은 행정 구역명 사전 구성시에 상위 계층과 이에 속한 하위 계층을 연결하는 포인터를 행정 구역명과 함께 사전에 넣음으로써 해결할 수 있다. "충청북도"에 속한 하위 계층이라함은 "청주시", "충주시", "청원군"등을 말한다.If the candidate administrative district name has been selected by recognition and post-processing of layer 1, the next address word is processed based on the candidate administrative district name. That is, if "Chungcheongbuk-do" is selected as the first priority administrative district name for Tier 1, the next address word is considered to be the administrative district of Tier 2 belonging to "Chungcheongbuk-do," and the sub-layer of "Chungcheongbuk-do" in the administrative zone name dictionary 2 Approach the place where the relevant administrative district names are located This approach can be solved by pre-composing the administrative district name with a pointer that connects the upper hierarchy and its lower hierarchy to the administrative district name preconfiguration. Hierarchy means "Cheongju", "Chungju", "Cheongwon-gun", etc.

계층 2에 해당되는 주소 단어의 처리는 계층1의 주소 단어를 처리했던 방법과 동일하다. 제3도에 설명된 방법으로 주소 단어를 이루고 있는 각 문자의 자리에 나타날 수 있는 후보 문자를 행정 구역명 사전2를 이용하여 추출한 후, 추출된 문자를 바탕으로 인식과 후처리를 한다. 그 결과로써 1순위로 선택된 후보 행정 구역명이 "청원군"이라면 그 다음에 나타나는 주소 단어는 계층 4로 간주하고 행정 구역명 사전4에서 "청원군"의 하위 행정 구역명이 수록되어 있는 위치로 접근하여 계층4에 해당되는 주소 단어를 처리한다.The processing of the address word corresponding to layer 2 is the same as the method of processing the address word of layer 1. After extracting candidate characters that can appear in place of each character constituting the address word by the method described in FIG. 3 using the administrative district name dictionary 2, recognition and post-processing are performed based on the extracted characters. As a result, if the candidate administrative district name selected as "Priority" is "Pleasant Army", the next address word is considered to be Tier 4, and accesses to Tier 4 by accessing the location containing the sub-administrative district names of "Pleasant Army" in the Administrative Territory Dictionary. Process the corresponding address word.

만약 계층 2에 해당되는 주소 단어에 대하여 인식과 후처리를 적용한 결과가 "부천시" 또는 "마산시"등과 같은 예외적인 지역이 후보 행정 구역명으로 선택되었을 경우에는 그 다음에 나타나는 주소 단어를 계층 3으로 간주하고 행정 구역명 사전3에서 그 지역의 하위 행정 구역명들이 수록되어 있는 위치로 접근하여 계층 3을 처리한다. 이와 같이 계층 2에서 계층3으로 진행한 다음 계층 4의 처리를 할 것이나(즉, 예외구에 해당하는 행정 구역명이나)는 예외구를 가지는 행정구역명에 대해서 포인터를 이용함으로써 용이하게 해결할 수 있을 것이다. 즉, 계층 2와 계층3을 연결해주는 별도의 포인터 값을 유지하는 것이다. 이와 같이 하여 계층 2및 계층 3에 대한 후보 문자 추출을 하게 되는데, 그 과정은 제3도에 의한다.If the result of applying recognition and post-processing to the address word corresponding to layer 2 is that an exceptional region such as "Bukchon" or "Masan" is selected as the candidate administrative district name, the next address word to be considered as layer 3 Then, access the location where the sub-administrative district names of the area are stored in the administrative district name dictionary 3, and process layer 3. Thus, proceeding from layer 2 to layer 3 and then performing the processing of layer 4 (that is, the name of the administrative district corresponding to the exception phrase) can be easily solved by using a pointer to the name of the administrative district having the exception phrase. That is, it maintains a separate pointer value that connects layer 2 and layer 3. In this way, candidate character extraction for layer 2 and layer 3 is performed, and the process is shown in FIG.

이제, 계층 2에 대한 후처리 과정을 살펴보면 제6도에 나타나 있는데, 먼저 후에 적용하는 백트랙킹의 횟수를 검사하기 위하여 백트랙킹의 횟수를 초기화시키고(단계 80), 단계(82)에서는 처리하려는 주소 단어(계층 2)의 각 문자별 위치에 나타날 수 있는 문자들을 행정 구역명 사전에서 읽어 문자 인식기에 넘겨준다. 그다음 단계(84)에서는 문자 인식기로부터 순위별 후보 문자를 n순위까지 받아온다. 단계(86)에서는 행정구역명 사전에 수록된 구, 군, 시 명과 문자 정합을 한다. 이때 문자의 갯수가 입력 문자와 같은 갯수이거나 한 문자 더 많은 행정구역명 만을 정합 대상으로 한다. 단계(88)에서는, 정합과정에 의한 결과로 행정 구역명 사전에서 선택된 k순위까지의 후보 행정 구역명과 정합과정에 이들이 얻은 차이를 기억하게 되는데 이는 제5도에서 언급했듯이 백트랙킹의 개념을 도입 하기위해 필요한 것이다. 단계(90)에서는 정합 과정의 결과로 행정 구역명 사전에서 선택된 1순위의 후보 행정 구역명이 얻은 차이와 임계값과의 비교를 하게 된다. 1순위의 후보 행정 구역명이 얻은 차이가 임계값보다 크면 이어서 단계(92)에서는 계층 2를 처리함에 있어서 백트랙킹이 몇번째 수행되는 것인지를 검사한다.Now, look at the post-processing process for layer 2 is shown in Figure 6, first to initialize the number of backtracking to check the number of backtracking applied later (step 80), and in step 82 the address to be processed Characters that may appear at each character position of the word (layer 2) are read from the administrative name dictionary and passed to the character recognizer. Next, in step 84, the candidate characters for each rank are received up to n ranks from the character recognizer. In step 86, a text match is made with the city, county, and city names contained in the administrative name dictionary. At this time, only the number of characters equal to the input characters or more than one administrative name is matched. In step 88, as a result of the matching process, the candidate administrative area names up to the selected k rank in the administrative area name dictionary and the differences obtained in the matching process are memorized. As mentioned in FIG. 5, in order to introduce the concept of backtracking It is necessary. In step 90, as a result of the matching process, the difference obtained from the first priority candidate administrative area name selected from the administrative area name dictionary is compared with a threshold. If the difference obtained by the first candidate administrative district name is larger than the threshold, then step 92 checks how many times backtracking is performed in processing layer two.

본 발명에서는 백트랙킹을 최대 k번 수행하게 되는데 이는 순위별 후보 문자들 바탕으로 행정 구역명 사전의 행정 구역명과 문자 정합을 하여 선택된 k순위까지이 후보 행정 구역명들에 접근하기 위하여 백트랙킹의 최대 횟수를 k로 정한 것이다. 이와 같이 백트랙킹의 횟수는 단계(88)에서 사용된 k값에 기반을 둔 것으로 문자 인식기의 성능이 낮은 경우는 k값을 증가시킴으로 즉, 후보 행정 구역명을 k순위 이상 고려하고 이에 따라 k순위 이상의 백트랙킹을 적용함으로써 보다 정확한 주소인식을 할 수 있다. 단계(92)에서 백트랙킹이 k번 수행되었는가를 검사하여 긍정인 경우는 단계(98)로 진행하게된다. 단계(90)에서 1순위의 구역명이 예외구인가의 여부를 판정하여 각각 해당하는 곳으로 진행된다.In the present invention, backtracking is performed up to k times, which means that the maximum number of backtrackings is performed in order to access the candidate administrative area names up to the selected k rank by matching characters with the administrative area name of the administrative area name dictionary based on the candidate characters for each rank. It is decided. As such, the number of backtrackings is based on the k value used in step 88. When the character recognizer is low in performance, the k value is increased, that is, the candidate administrative area name is considered higher than k ranks, and accordingly, the k rank is higher than the k ranks. By applying back tracking, more accurate address recognition can be achieved. In step 92, it is checked whether the backtracking has been performed k times, and if yes, the process proceeds to step 98. In step 90, it is determined whether or not the first priority zone name is an exception phrase, and proceeds to the respective place.

그리고 단계(92)에서 백트랙킹이 k번 수행되지 않았으면 백트랙킹의 회수를 누적시키고(단계 94) 백트랙킹을 시도한다(단계 96). 백트랙킹은 먼저 단계(66)에서 저장해 높은 상위 계층(여기에서는 계층 1을 의미함)에 대한 다음 순위의 후보 행정 구역명을 고려하여 이의 하위 계층이 수록되어 있는 행정 구역명 사전으로 접근하여 단계982) 내지 단계(90)을 진행하게 된다. 임계값과 관련하여 백트랙킹의 필요성 내지 단위성은 후에 상술하겠다. 단계(92)에서 백트랙킹이 k번 수행되었으면 단계(98)로 진행하게 되는데 단계(98)에서는 k번의 백트랙킹에 의하여 얻어진 k+1개의 상위 계층(계층 1)과 하위 계층(계층 2)에 대한 후보 행정 구역명들 중에서 각각의 상위 계층과 하위 계층에 해당되는 후보 행정 구역명(계층 1의 1순위의 후보행정 구역명, 계층 2에 대한 1순위의 후보 행정 구역명), (계층 1의 2순위의 후보 행정 구역명, 첫번째 백트랙킹에 의한 생성된 계층2에 대한 1순위의 후보 행정 구역명),..., (계층 1의 k순위의 후보 행정 구역명, k번째 백트랙킹에 의한 생성된 계층 2에 대한 1순위의 후보 행정 구역명))이 정합 과정에서 얻은 차이를 합하여 그 값이 가장 작게되는 두 계층에 대한 후보 행정 구역명을 택하게 되는데 이는 정합 과정에서 1순위로 선택된 후보 행정 구역명의 차이가 계속 임계값보다 큰 경우를 고려하여 수정을 가하기 위한 것으로 이에 대해서도 백트랙킹과 관련하여 다시 언급하겠다.If backtracking has not been performed k times in step 92, the number of backtrackings is accumulated (step 94), and backtracking is attempted (step 96). Backtracking is first stored in step 66, taking into account the next higher priority candidate administrative district name for the higher tier (here, tier 1), and accessing the administrative zonal name dictionary containing its lower tier, and steps 982) through Step 90 proceeds. The necessity or unity of backtracking with respect to the threshold will be described later. If backtracking has been performed k times in step 92, the process proceeds to step 98. In step 98, k + 1 upper layers (layer 1) and lower layers (layer 2) obtained by k backtracking are performed. Among the candidate administrative district names for the candidates, candidate administrative district names corresponding to each of the upper and lower hierarchies (layer 1 first candidate administrative district name, Tier 1 candidate administrative district name), and (tier 1 candidates Administrative District Name, candidate administrative district name for Tier 2 created by first backtracking, ..., (K-ranked candidate Administrative District name for Tier 1, Tier 1 created for kth backtracking) The candidate administrative district name in the ranking))) sums the differences obtained during the matching process, and selects the candidate administrative district names for the two hierarchies whose values are the smallest. This is to make corrections in consideration of large cases, which will be mentioned again with respect to backtracking.

제7도는 계층 3의 예외구에 대한 후처리 과정을 나타내는 플로우챠트로서 먼저, 백트랙킹의 횟수를 초기화 시킨다(단계112). 단계(114)에서는 처리하려는 주소 단어(계층 3)의 각 문자별 위치에 나타날 수 있는 문자들을 행정 구역명에서 읽어 문자 인식기에 넘겨주게 되고, 단계(116)에서는 문자 인식기로부터 출력된 순위별 후보 문자를 n순위까지 받아온다. 다음으로 단계(118)에서 행정 구역명 사전에 수록된 예외구명과 문자정합을 한다. 이때, 문자의 갯수가 입력 문자와 같은 갯수이거나 한 문자 더 많은 행정구역명만을 정합 대상으로 한다. 이하, 단계(120 내지 단계 130)까지는 제6도의 후처리과정에서 설명한 것과 동일하므로 중복설명을 피하겠다. 다만, 차이가 있는 것은 백트랙킹에서 상위 계층만이 제6도의 그것과 차이가 있다.FIG. 7 is a flowchart showing a post-processing procedure for the exception phrase of the layer 3, first, initializing the number of backtracking (step 112). In step 114, characters that may appear at the position of each character of the address word (layer 3) to be processed are read from the administrative district name and passed to the character recognizer, and in step 116, candidate characters for the rank outputted from the character recognizer are received. Get up to n ranks. Next, in step 118, the text is matched with the exception name included in the administrative name dictionary. At this time, only the number of characters equal to the number of input characters or more than one administrative region name is matched. Hereinafter, steps 120 to 130 are the same as those described in the post-process of FIG. However, the difference is that only the upper layer in backtracking differs from that of FIG.

이제, 계층 4와 계층 5의 행정 구역명을 처리하는 과정에 대해서 설명하겠다. 먼저, 계층 4,5에서는 유념하여야 할 사항이 있는데, 그것은 계층 4이하의 행정 구역명에는 0-9의 숫자가 나타날 수 있으며, 그 형태는 하나의 숫자 또는 두개의 연속된 숫자가 나올 수 있다. 예를 들어 "신림2동" 또는 "신림12동"과 같은 행정 구역을 말한다. 한글은 2바이트로 표현되는데 반하여 숫자는 1바이트로 표현되므로 계층 4이하 단계의 행정 구역명은 계층 3이상에 적용되었던 제3도와 같은 단순한 후보 문자 추출 방법으로는 원하는 후보 문자를 얻을 수 없게 된다. 왜냐하면 제3도는 문자인지 숫자인지의 검사를 하지 않고 무조건 2바이트씩 일게 되므로 숫자가 나타날 경우에는 후보 문자를 정확히 얻어 낼수 없다. 그러므로 제4도에 문자인지 숫자인지를 검사하여 문자인 경우에는 2바이트를 읽고 숫자인 경우는 1바이트를 읽는 과정을 첨가하였다. 제4도에 기술된 방법이 모든 계층에 적용될 수 있지만, 본 발명에서는 계층 1내지 계층3에 해당되는 주소 단어에 대해서는 제3도의 방법을 적용하고 계층 4와 계층 5에 해당되는 주소 단어에 대해서는 제4도의 방법을 적용한 이유는 계층 3이상의 행정 구역명에는 숫자가 발생하지 않으므로 문자인지 숫자인지 검사하는 과정은 불필요하기 때문이다. 다시 제4도를 보면 단계(30)에서 현재 처리하려고 하는 주소 단어(계층 4 또는 계층5)에 나타날 수 있는 행정 구역명들을 수록한 행정 구역명 사전으로 접근한다. 이하 그 행정 구역명들의 갯수를 사전에서 탐색하여야 할 행정 구역명의 갯수라 하겠다. 다음 단계(32)에서는 처리하여야 할 주소 단어(계층 4 또는 계층 5)를 이루고 있는 문자의 갯수(이하 인식하여야 할 문자의 갯수라 함)를 센다. 단계(34,38)에서는 (DO)루프를 반복하여 행정 구역명 사전에서 후보 문자들을 추출하게 되는데, 이때 위에서 언급한 바와 같이 행정 구역명에서 숫자가 나타나는 경우와 문자가 나타나는 경우와를 구분하여 처리해 줘야 한다(단계 40 내지 48). 그다음 단계(50)에서 처리하여야 할 주소 단어의 각 처리에 나타날 수 있는 문자들이 저장되어 있는 후보 문자 배열 A를 문자 인식기에 넘겨주어 후처리 과정이 시작된다.Now, a description will be given of the process of processing the administrative division names of layers 4 and 5. First, there are some things to keep in mind in Tiers 4 and 5, where administrative names below Tier 4 may contain numbers from 0 to 9, in the form of one number or two consecutive numbers. For example, it refers to an administrative district such as "Sirim 2 dong" or "Sirim 12 dong." Since Hangul is represented by 2 bytes but the number is represented by 1 byte, administrative area names of level 4 and below cannot obtain the desired candidate character using a simple candidate character extraction method such as FIG. Because Figure 3 does not check whether it is a letter or a number, it is unconditionally two bytes long, so if a number appears, the candidate character cannot be obtained correctly. Therefore, the process of checking whether the character is a number or a number in FIG. 4 is added to read two bytes in case of letters and one byte in case of numbers. Although the method described in FIG. 4 can be applied to all layers, the present invention applies the method of FIG. 3 to address words corresponding to layers 1 to 3, and applies to the address words corresponding to layers 4 and 5 in the present invention. The reason for applying the method of 4 degrees is that the number does not occur in the administrative area name above the third layer, so it is unnecessary to check whether it is a letter or a number. Referring again to FIG. 4, in step 30, the administrative area name dictionary containing the administrative area names that may appear in the address word (layer 4 or layer 5) to be processed is approached. Hereinafter, the number of administrative district names to be searched in the dictionary will be referred to as the number of administrative district names. In the next step 32, the number of characters constituting the address word (layer 4 or layer 5) to be processed (hereinafter referred to as the number of characters to be recognized) is counted. In steps 34 and 38, the (DO) loop is repeated to extract candidate characters from the administrative district name dictionary. As described above, the characters must be distinguished from the case where a number appears in the administrative district name and a character appears. (Steps 40 to 48). Then, in step 50, the candidate character array A, which stores characters that can appear in each process of the address word to be processed, is passed to the character recognizer, and the post-processing process is started.

계층 4를 처리한 후, 처리하여야 할 주소 단어가 또 존재하면 그 주소 단어는 계층 5에 해당되는 것이므로 제4도에 따라 계층 5에 대한 후보문자를 추출한다. 제8도는 계층 4의 후처리 과정에 대한 플로우챠트이고 제9도는 계층 5의 후처리 과정에 대한 플로우챠트로서,이들 후처리 과정은 계층 2의 후처리 과정인 제6도와 근본적므로 설명의 중복을 피하겠다. 다만, 제8도의 과정에서는 인식할 대상 문자의 존재여부에 대한 판단 단계(160)와 확정된 계층별 행정구역을 출력하는 단계(162)가 존재하는데, 이는 필기자에 따라 계층 5를 생략하거나, 공란의 부족으로 기입하지 못한 경우등에 대비하기 위한 것이다.After processing layer 4, if there is another address word to be processed, the address word corresponds to layer 5, and thus, candidate characters for layer 5 are extracted according to FIG. FIG. 8 is a flowchart of the post-processing process of layer 4 and FIG. 9 is a flowchart of the post-processing process of layer 5, and these post-processing processes are fundamental to FIG. I will avoid it. However, in the process of FIG. 8, there is a determination step 160 of whether there is a target character to be recognized and a step 162 of outputting the determined administrative region for each hierarchy. This is to prepare for the case where it could not be filled out due to lack of space.

이제, 본 발명에서 제시한 문자 정합 방법과 백트랙킹의 방법을 아래의 표 2를 참조하여 상세히 설명하겠다.Now, the character matching method and the method of backtracking proposed in the present invention will be described in detail with reference to Table 2 below.

[표 2]TABLE 2

주소 단어를 정합을 위하여 사용되는 순위별 후보 문자들과 가중치Ranked candidate characters and weights used to match address words

여기서, C₁C₂…Cm은 m개의 문자로 구성된 주소 단어이며, C₁(1), C₁(2)..., C₁(n)는 입력문자 C₁에 대한 n위까지의 후보 문자들을 의미하며, W(1), W(2), ..., W(n)는 순위별 가중치를 나타낸다. 사전에 수록된 하나의 주소 단어를 S₁S₂... Sm 이라 할 때, C₁C₂... Cm과 S₁S₂... Sm의 차이는 각 문자들의 차이의 합으로 정의한다. 예를 들어, 후보 문자 C₁(4)이 S₁과 일치할 경우(즉, S₁=C₁(4))후보 문자 C₁과 사전 문자 S₁의 차이는 W(4)만큼이다. 입력된 하나의 행정 구역명과 사전의 행정 구역명과의 차이(D)는 다음의 식으로 정의할 수 있다.Where C ₁ C ₂ ... Cm is an address word consisting of m characters, C ₁ (1), C ₁ (2) ..., C ₁ (n) are the nth candidate characters for the input character C ₁ , and W ( 1), W (2), ..., W (n) represent weights for each rank. When referred to advance one of the address word contained in S ₁ S ₂ _{_{... Sm, C 1 C 2 ...}} Cm and S ₁ S ₂ ... Sm of the difference is defined as the sum of the differences of each character. For example, if candidate character C ₁ (4) matches S ₁ (ie, S ₁ = C ₁ (4)), the difference between candidate character C ₁ and dictionary character S ₁ is as much as W (4). The difference (D) between one administrative district name entered and the previous administrative district name can be defined by the following equation.

단, a=1, 2, ..., nWhere a = 1, 2, ..., n

본 발명에서 사용된 가중치W(α)는 α가 후보 문자의 순위를 나타낼 때 W(α)=α-1로 정의하였다. 만약 S₁이 C₁(1), C₁(2),...C(n)중 어느 것과도 일치하지 않을 경우에는 가중치를 W(n+1)로 한다.The weight W (α) used in the present invention is defined as W (α) = α-1 when α represents the rank of the candidate character. If S ₁ does not match any of C ₁ (1), C ₁ (2), ... C (n), the weight is W (n + 1).

이와 같은 방법으로 순위별 후보 문자와 가장 작은 차이를 갖는 사전의 후보 행정 구역명을 k순위까지 결정한다. 1순위로 선택된 행정 구역명의 하위 계층에 대한 지식을 행정 구역명 사전에서 얻은 후에 다음에 나타나는 주소 단어에 대하여 후보 문자를 추출하고 이 후보 문자들을 바탕으로 문자 인식을 하고, 문자 인식기에 의하여 출력된 순위별 후보 문자를 이용하여 후처리를 적용하는 위와 같은 방법으로 주소 인식은 진행된다.In this way, the names of candidate administrative districts of the dictionary having the smallest difference from the candidate letters of the ranking are determined up to the rank k. After gaining knowledge of the lower hierarchy of administrative area names selected as the first priority, the candidate characters are extracted for the next address word that appears next, and the characters are recognized based on the candidate characters, and the ranking is output by the character recognizer. Address recognition proceeds in the same manner as above, which applies postprocessing using candidate characters.

그러나 1순위로 선택된 행정 구역을 확정하여 하위 행정 구역을 처리할 경우 잘못된 결과가 초래될 수 있다. 그 이유는 정합 과정에서 사전의 각 행정 구역명이 얻은 차이가 2개 이상의 행정 구역명에 대해 동일하게 된다든지, 인식기에 의하여 얻어진 후보 문자가 해당되는 행정 구역명이 아닌 다른 행정 구역명과 더 가깝게 정합되는 경우다. 이러한 문제점을 해결하기 위하여 차이에대한 임계값(Threshold)과 백트랙킹을 도입한다. 임계값의 결정은 문자 인식기의 성능과 처리하고 있는 주소 단어의 문자수에 따라 달라진다. 즉, 문자 인식기의 인식률에는 반비례하고 처리하는 주소 단어의 문자 수에는 비례하도록 임계값을 정한다. 서두에서 언급했듯이 본 시스템의 문자 인식기는 입력된 하나의 문자에 대하여 제한된 문자 집합내에서만 비교하므로 성능이 매우 높게 나타난다. 따라서 그 임계값을 아주 작게 택함으로써 후처리의 정확도를 더욱 높였다. 본 발명에서 바람직하게는 임계값을 2로 하는게 좋은데, 그 근거는 본 발명에서는 후보문자를 먼저 추출하여 이를 기준으로 정합을 하게 되므로 문자 인식기의 인식률이 자연히 높아지게 되고 또한 우리나라 행정 구역명의 평균 문자수가 3-4자인 것에 기인한다.However, the treatment of lower administrative districts by determining the administrative district selected as the first priority may lead to erroneous results. The reason for this is that the difference obtained by each administrative division name in the matching process is the same for two or more administrative division names, or the candidate character obtained by the recognizer is more closely matched with other administrative division names than the corresponding administrative division name. . To solve this problem, we introduce threshold and backtracking for the difference. The determination of the threshold depends on the performance of the character recognizer and the number of characters in the address word being processed. That is, the threshold value is set to be inversely proportional to the recognition rate of the character recognizer and proportional to the number of characters of the address word to be processed. As mentioned earlier, the system's character recognizer compares a single character entered only within a limited character set, resulting in very high performance. Therefore, by selecting the threshold value very small, the accuracy of the post-processing is further increased. In the present invention, it is preferable that the threshold value is 2, but the basis of the present invention is that the candidate characters are first extracted and matched based on this, so that the recognition rate of the character recognizer is naturally increased, and the average number of characters in the Korean administrative district name is 3 Is due to being -4 characters.

현재 처리하고 있는 계층을 계층 L이라 할때, 후보 문자들과 사전의 행정 구역명과의 정합에 의하여 1순위로 선택된 후보 행정 구역명과의 차이가 임계값 이하이면, 게층 Lt1을 진행하고, 그렇지 않으면 게층 1L-1에서 그다음 순위를 고려하여 계층 L을 다시 처리한다. 상위의 계층을 다시 검색하는 이와 같은 방법을 백트랙킹이라 하는데, 이상에서 소개한 임계값과 백트랙킹은 1983년 10월 Izaki등이 광범위 문자군의 텍스트 처리에 관한 국제회의 (International Conference on Text Processing with a Large Charcater Set)에서 발표한 "필기체 한자 인식의 후처리" 논문에서도 언급된 바 있으나 이는 본 발명에 적용된 방법과는 다소 차이가 있다. 즉, Izaki등은 입력된 주소 단어가 사전에 존재하지 않거나 필기자의 변형이 심한 필체에 의하여 입력된 주소 단어를 구성하고 있는 문자들이 인식기가 출력한 순위별 후보 문자들 중에서 존재하지 않는 경우에 최종적인 주소 인식의 결과가 틀리게 유도될 수 있으므로 임계값을 도입하여 임계값 이하이면 다음 계층을 처리하게 되고 그렇지 않으면 1순위의 후보 문자들을 후보 행정 구역명으로 내어주고 있다.When the layer currently being processed is called layer L, if the difference between the candidate characters and the name of the candidate administrative area selected as the first priority by matching the administrative administrative name with the dictionary is less than or equal to the threshold, the stratified layer Lt1 is progressed, otherwise the stratification is performed. At 1L-1, layer L is reprocessed taking into account the next rank. This method of re-searching the upper hierarchy is called backtracking. The thresholds and backtracking introduced above were discussed in October 1983 by International Conference on Text Processing with Izaki et al. Although mentioned in the paper "Post-processing of Handwritten Chinese Character Recognition" published by a Large Charcater Set), this is somewhat different from the method applied to the present invention. In other words, Izaki et al., When the address word entered does not exist in the dictionary or the letters constituting the address word input by the handwriting with severe handwriting deformation do not exist among the candidate letters for each rank output by the recognizer. Since the result of address recognition can be wrongly derived, the threshold value is introduced to process the next layer if the threshold value is lower than the threshold value. Otherwise, candidate characters of the first rank are given as candidate administrative region names.

또, 임의의 계층 L에 대하여 차이가 동일한 행정 구역명이 두개 이상 존재하여 후보 행정 구역명의 선택에 혼동이 생길 경우, 이를 해결하기 위한 방법으로 백트랙킹을 도입하였다.In addition, when there is confusion in the selection of candidate administrative district names due to the existence of two or more administrative district names having the same difference for a certain layer L, backtracking is introduced as a method to solve this problem.

즉, 문제가 발생한 계층을 계층 L이라 할때, 계층 L의 하위 계층인 계층 L+1을 먼저 처리하여 이 계층에서 1순위로 선택된 후보 행정 구역명을 바탕으로 계층 L에서 차이가 동일하게 나타난 행정 구역명들중에서 계층L+1에서 선택된 후보 행정 구역명의 상위 계층에 해당되는 행정 구역명을 계층 L에 대한 후보 행정 구역명으로 결정하고 있다. 그러므로 본 발명에서 사용된 임계값은 계층 L에 대하여 1순위의 후보 행정 구역명이 얻은 차이가 임계값 이상이 될 경우 상위의 행정 구역명이 잘못 선택되었을 수 있음을 검사하는 기준으로 사용되었고 이를 수행하기 위하여 백트랙킹을 적용하는데 본 발명에서의 백트랙킹은 하위 계층을 처리하는 것이 아니라 상위 계층의 다음 순위를 고려항 다시 계층 L을 처리하는 것이 상기 논문에서 언급한 백트랙킹과 다른점이다.In other words, when the problem layer is called layer L, the name of the administrative area whose difference is the same in the layer L based on the candidate administrative area name selected first in this layer by first processing the layer L + 1, which is the lower layer of the layer L Among these, the administrative district name corresponding to the upper tier of the candidate administrative district name selected in the hierarchy L + 1 is determined as the candidate administrative district name for the hierarchy L. Therefore, the threshold value used in the present invention was used as a criterion to check that the upper administrative region name may have been incorrectly selected when the difference obtained by the first candidate administrative district name for the layer L becomes greater than or equal to the threshold. Applying backtracking, the backtracking in the present invention is different from the backtracking mentioned in the above paper, in that the processing of the layer L is performed in consideration of the next rank of the upper layer rather than the lower layer.

이상에서 소개한 임계값의 검사와 백트랙킹을 반복적으로 수행하여도 문자 정합과정에서 계층 L에 대하여 1순위로 선택된 후보 행정 구역명이 얻은 차이가 임계값 이상이 계속될 경우는, 계층 L의 상위 계층인 계층 L-1에 대한 후보 행정 구역명이 얻은 차이를 함께 고려한다. 백트랙킹을 k번 시도하였다면 두 계층에 대하여 k+1개의 후보 주소를 얻게 되는데, 이 두 계층의 후보 행정 구역명을 결정하기 위해 문자 정합 과정에서 각각의 후보 행정 구역명들이 얻은 차이를 합하여 그 값이 가장 작게 되는 두 계층의 행정 구역명을 계층 L과 계층 L-1의 후보 행정 구역명으로 확정한다. 이와 같이 하여 계층 L의 행정 구역명이 선택 되었으면 이를 바탕으로 계층 L+1을 처리한다.Even if the above-described threshold check and backtracking are repeatedly performed, if the difference obtained by the candidate administrative district name selected as the first priority for the layer L in the character matching process continues beyond the threshold value, the upper layer of the layer L Consider together the differences obtained for candidate administrative district names for the population L-1. If you attempt k backtracking k times, you get k + 1 candidate addresses for the two tiers, and the sum of the differences of each candidate administrative names in the character matching process is used to determine the candidate administrative names of the two tiers. The names of the administrative districts of the two smaller hierarchies are determined as the candidate administrative district names of the tier L and the tier L-1. In this way, if the administrative area name of the layer L is selected, the layer L + 1 is processed based on this.

이와 같이 두 계층을 함께 고려함으로써 임의의 계층에서 사전의 각 행정 구역명에 대한 차이가 같은 것이 2개 이상 존재하는 경우나 심지어 그 계층에서 문자 인식기의 오류로 인하여 틀린 행정 구역명이 후보 행정 구역명으로 확정되어 틀린 후보 행정 구역명을 바탕으로 다음 계층을 처리하게 된다 할지라도 올바른 수정 결과를 유도할 수 있다.By considering the two hierarchies together, if there is more than one difference in the name of each administrative division in any hierarchy, or because of the error of the character recognizer in that hierarchy, the wrong administrative district name is determined as the candidate administrative district name. Even if the next hierarchy is handled based on the wrong candidate administrative district name, correct corrective results can be derived.

위의 단어 정합 방법을 적용함에 있어 계층 1에 대한 행정 구역명에는 입력 문자와 같은 갯수를 가진 행정 구역명만을 후보 주소로 하였고, 그 이하 행정 구역은 문자의 갯수가 같거나 한 문자 더 많은 행정 구역명을 후보 주소로 선택하였다. 그 이유는 계층별 행정 구역을 나타내는 특별한 문자(도, 시, 특별시.. )를 생략하는 사람들의 습관에 대하여 사전을 확장하지 않고서도 입력 문자의 수 만큼만 출력함으로써 해결할 수 있기 때문이다. 단 계층 1에 대해서는 생략 범위가 크고 행정 구역명의 수가 매우 적기 때문에 사전을 확장하였다. 구체적으로 예를들자면 "서울특별시"에 대해서는 "서울" 또는 "서울시"를 사전에 추가하였다.In applying the above word matching method, only administrative district names with the same number of input characters are candidate addresses for administrative district names for Tier 1, and administrative subdivisions with the same number of characters or more than one character are candidate candidates. Selected by address. The reason for this is that it can be solved by outputting only the number of input characters for the habits of those who omit the special characters (do, city, city ..) representing the administrative divisions by hierarchy, without expanding the dictionary. However, for Tier 1, dictionaries were expanded because of the large scope of omission and very few administrative district names. Specifically, for "Seoul," "Seoul" or "Seoul" was added to the dictionary.

우리는 이러한 사전 검색 과정을 통하여 주소의 어떤 특정 위치에 나타날 수 있는 문자는 극히 한정되어 있음을 알게된다. 예를 들어 계층 3의 행정 구역명에서 3번째 자리 이상에 나타날 수 있는 문자는 "읍", "면", "동", "가", "1-9"등이다. 심지어 어떤 행정 구역에 속한 계층 3의 3번째 문자는 "동"하나 뿐일 수도 있다. 이러한 경우의 문자 인식 단계는 무의미하므로 본 발명에 따른 시스템에서는 행정 구역명 사전에서 추출한 문자가 한 문자밖에 없을 경우에는 문자 인식부에 들어가지 않고 그 문자를 곧바로 후보 문자 배열에 넣는 방법을 취하였다. 이와 같은 방법은 우리나라 행정 구역명에 대한 자료를 빠짐없이 수록하는 행정구역명 사전의 구축이 무엇보다도 중요함을 알 수 있게 한다.Through this dictionary search we find that the characters that can appear at any particular location in the address are extremely limited. For example, the characters that can appear in the third digit or more of the name of the administrative district in Tier 3 are "eup", "myeon", "dong", "ga", and "1-9". It may even be that the third letter of Tier 3 belonging to an administrative district is "east." In this case, the character recognition step is meaningless, so in the system according to the present invention, when there is only one character extracted from the administrative division name dictionary, the character recognition unit is immediately inserted into the candidate character array without entering the character recognition unit. Such a method makes it possible to know that the construction of a dictionary of administrative district names containing all the data on the administrative district names in Korea is more important than anything else.

제10도는 본 발명에 따른 주소 인식 장치를 개략적으로 도시해 놓은 구성도로, 온라인 주소 인식 장치의 경우, 입력 장치인 태블릿(1)을 통하여 한글 주소 데이타가 읽혀지며, 오프라인 주소 인식 장치의 경우에는 스캐너(1)에 의해 한글 주소 데이타가 읽혀진다. 입력 장치를 통과한 한글 주소는 단순한 이진 데이타에 불과하므로 이 데이타를 문자 단위로 분할하여야만 문자 인식 알고리즘의 적용이 가능해진다. 따라서, 입력된 이진 데이타는 주소 분할부(2)에서 문자 단위로 분할되는데 입력 장치로써 스캐너를 사용하는 오프라인 주소 인식 장치의 경우에 주로 사용되는 대표적인 문자 분할 방법은 스캐너에 저장된 모든 화소를 x축 또는 y축으로 사영(projection)시켜 검은 점의 분포를 이용하는 것이다. 종래의 주소 인식 장치는 주소 분할 단계가 끝나면 즉시 문자 인식을 시작한다. 그러나 제10도에 도시된 바와 같이 본 발명의 특기할 만한 점은 문자 인식부에 들어가기 전에 주소의 각 위치에 나타날 수 있는 후보 문자를 행정 구역명 사전(4)에서 추출하여 문자 인식부에 넘겨준다는 점이다. 이와 같이 후보 문자 추출부를 사용하여 발생 가능한 문자의 범위를 줄여줌으로써 빠르고 정확한 주소 인식이 가능하다. 공백이 나타나기 전까지를 하나의 행정구역으로 간주하여 입력된 단어의 각 위치에 나타날 수 있는 문자를 제3도와 제4도의 플로우챠트에 기술되어 있는 방법으로 추출한 후에 문자 인식부에 들어간다. 문자 인식부(4)에서는 입력 문자의 특징과 후보 문자 추출부(3)에서 추출된 각각의 문자들의 특징과 비교하여 가장 유사한 문자를 순위별로 출력한다. 후처리부(6)에서는 문자 인식부(5)에 의해 출력되는 후보 문자를 이용하여 행정 구역명 사전(4)과 문자 정합을 하여 그 계층에 대하여 몇개의 후보 행정 구역명을 간직한다.10 is a block diagram schematically illustrating an address recognition apparatus according to the present invention. In the case of an online address recognition apparatus, Hangul address data is read through the tablet 1 as an input apparatus, and in the case of an offline address recognition apparatus, a scanner The Korean address data is read by (1). The Hangul address passed through the input device is just binary data, so the character recognition algorithm can be applied only by dividing this data into character units. Therefore, the input binary data is divided by character units in the address divider 2, and a representative character division method mainly used in the case of an offline address recognition apparatus using a scanner as an input device includes all the pixels stored in the scanner on the x-axis or Projection on the y-axis to use the distribution of black points. The conventional address recognition apparatus starts character recognition immediately after the address splitting step is completed. However, as shown in FIG. 10, it is noteworthy that the candidate characters that may appear at each position of the address before entering the character recognition unit are extracted from the administrative region name dictionary 4 and passed to the character recognition unit. to be. As such, by using the candidate character extracting unit, the range of characters that can be generated can be reduced to enable fast and accurate address recognition. Characters that can appear at each position of the entered word are regarded as a single administrative district until the spaces appear, and are extracted into the character recognition unit by the method described in the flowcharts of FIG. 3 and FIG. The character recognition unit 4 outputs the most similar characters by rank in comparison with the characteristics of the input characters and the characteristics of the respective characters extracted by the candidate character extraction unit 3. The post-processing section 6 matches characters with the administrative section name dictionary 4 using the candidate characters output by the character recognition section 5, and retains some candidate administrative section names for the hierarchy.

한 계층이 처리되면 후처리부(6)에서 생성된 후보 행정 구역명을 바탕으로 하위 계층으로 접근한다. 이 후처리부에서 임계값의 도입과 백트랙킹이 적용된다. 문자 정합과 임계값의 사용 및 백트랙킹 방법은 위에서 상세히 설명되었으며 각 계층에 후처리를 적용함에 있어 각 계층들간의 차이점은 제5도-제9도의 플로우챠트에 상세히 나타나 있다. 즉, 계층 1은 상위 계층이 존재하지 않기 때문에 임계값의 도입없이 순위별 행정 구역명만을 간직하고 하위 계층으로 넘어간다는 점과 계층 2는 후보 행정 구역명을 결정한 후 이 행정 구역명이 경기도 부천시와 경상남도 울산시 등의 예외적인 지역이 아닌가를 검사한다. 예외적인 지역이면 계층 3을 처리하게 되고 그렇지 않으면 계층 4를 처리한다. 계층 4에서는, 계층 4의 하위 계층인 계층 5가 존재하는 지역일지라도 필기자가 계층 5에 대한 지역명을 알지 못하거나 너무 작은 구역으로 분리되어 있어 계층 5를 생략하는 필기자의 습관, 또는 공란의 부족으로 기입하지 못한 경우등의 처리를 위하여 인식하여야 할 주소 단어가 존재하는지의 유무를 검사한다. 이상과 같은 방법으로 입력 주소의 마지막 단어를 처리하면 결과 출력부(7)를 거쳐 끝나게 된다.When one layer is processed, the lower layer is approached based on the candidate administrative region name generated by the post-processing unit 6. In this post-processing section, the introduction of the threshold and backtracking are applied. Character matching and threshold usage and backtracking methods are described in detail above, and the differences between the layers in applying post-processing to each layer are detailed in the flowcharts of FIGS. In other words, since Tier 1 does not exist in the upper tier, it retains only the names of administrative districts by rank without introducing thresholds, and goes to the lower tier, while Tier 2 determines candidate administrative district names, and these administrative districts are named Bucheon-si, Gyeonggi-do, and Ulsan-si, Gyeongsangnam-do. Check for an exceptional region of the. If it is an exceptional region, it will handle layer 3. Otherwise, it will handle layer 4. In Layer 4, even in the region where Layer 5, which is a lower layer of Layer 4, exists, the writer does not know the region name for Layer 5 or is divided into sections that are too small, resulting in the habit of the note-taker skipping Layer 5, or lack of space. It checks whether there is an address word to be recognized for the processing such as the case where it cannot be entered. Processing the last word of the input address in the manner described above ends with the result output unit 7.

본 발명은 온라인 또는 오프라인 한글 주소 인식 방법 및 장치에 대한 것이나 요즘 주목받고 있는 음성 인식의 한 분야인 음성 주소 인식에도 적용할 수 있다. 그 방법은 제10도에서 도시한 바와 같은 형식으로 사전을 구성한 다음, 스피커를 통하여 입력된 음성 주소 테이타를 음절별 음성 인식 과정을 거침으로써 해결할 수 있다.The present invention can be applied to a method and apparatus for recognizing an online or offline Korean address, or to speech address recognition, which is one of the fields of speech recognition that is attracting attention these days. The method can be solved by constructing a dictionary in a format as shown in FIG. 10 and then performing speech recognition for each syllable of the voice address data input through the speaker.

본 발명의 효과는 한글 주소 인식에 있어 비교해야 할 문자의 범위를 후보문자를 먼저 추출한 후 이를 행정 구역명 사전을 이용해 문자 정합을 하여 후보 행정 구역명을 작성, 이를 다시 후처리함으로써 한 문자에 대한 인식 속도를 크게 향상시킴은 물론 비교적 높은 인식율을 유도할 수 있다. 또, 문자 인식기에서 발생한 오인식 결과는 임계값과 백트랙킹 방법을 도입한 후처리를 통하여 수정됨으로써 입력된 하나의 주소에 대하여 99% 이상의 주소 인식을 할 수 있다.The effect of the present invention is to extract the range of characters to be compared in the Hangul address recognition first, and then to match the characters by using the administrative area name dictionary to create the candidate administrative area name, and then post-process it again to recognize the speed of one character In addition, the recognition rate can be greatly improved and a relatively high recognition rate can be induced. In addition, the misrecognition result generated by the character recognizer can be corrected through post-processing by introducing a threshold value and a backtracking method, thereby making it possible to recognize an address of 99% or more with respect to an input address.

Claims

In the Hangul address recognition method, a candidate character extraction process of extracting candidate characters from administrative region name dictionaries by a candidate character extractor, and the character recognizer recognizes and outputs the candidate characters based on the candidate characters extracted by the candidate character extractor. A post-processing step of character matching candidate characters for each rank with the administrative-zone names listed in the corresponding administrative-zone name dictionaries, recognition of an address word of an input administrative-zone name to be recognized by a character recognition unit, and the post-processing process. And a candidate administrative district name selection process for selecting a candidate administrative district name by means of, and fixing the next address word based on the candidate administrative district name.

The method of claim 1, wherein the candidate character extracting process extracts from the administrative region name stored in the administrative region name dictionary corresponding to the address word to be recognized according to the position of each character constituting the address word to be recognized. And a character matching process using character candidates for each rank up to n ranks outputted by the character recognition unit based on the candidate characters extracted in the candidate character extraction process.

The method of claim 2, wherein the value n is determined in consideration of a cumulative recognition rate of the character recognizer used.

4. A dictionary according to claim 1 or 2 or 3, wherein the administrative district name dictionaries. District 1 administrative division name dictionary 1, which consists of direct city and road. group. Administrative division name dictionary 2 of hierarchy 2 comprised of city, administrative division name dictionary 3 of hierarchy 3 comprised of exception phrase, administrative division name dictionary 4 of hierarchy 4 comprised of town, town, town and administrative of hierarchy 5 comprised of town Hangul address recognition method comprising a zone name dictionary 5.

The Korean address recognition method according to claim 4, wherein the administrative district name dictionary 4 includes a building having a postal code and also includes a building name having no postal code according to a user's purpose of use.

6. The character of the address word to be recognized according to claim 5, wherein the administrative district name dictionary 1 of the tier 1 provides a modified form such as Seoul or Seoul in advance, so that the post-processing process for the tier 1 corresponds to the tier 1 A Hangul address recognition method in which only administrative district names composed of the same number of characters are matched.

7. The method of claim 6, wherein the post-processing process for the hierarchical layer 1 includes: a candidate administrative district name up to k rank selected from an administrative district name dictionary 1 corresponding to the hierarchical layer 1 as a result of the matching process; And storing a difference between administrative district names, wherein k is selected from the selected administrative district names by letter matching with the administrative district name of the administrative district name dictionary 1 based on the candidate letter for each rank output by the character recognizer. Hangul address recognition method characterized in that it is determined according to whether included in the ranking.

8. The method of claim 7, wherein the post-processing process for the layers 2 to 5 matches only the administrative number of the names equal to or more than one letter of the address words to be recognized respectively corresponding to the layers 2 to 5. How to recognize Hangul address.

10. The Korean address recognition method of claim 8, wherein if the result of processing the address word of the layer 2 corresponds to the exception phrase, layer 3 corresponding to the exception phrase is processed; otherwise, the layer 4 address is processed. Way.

10. The method of claim 9, wherein each of the post-processing steps for the layer 2 to layer 5 is a candidate administrative area name up to a k rank selected from an administrative area name dictionary corresponding to each of the layer 2 to layer 5 as a result of the matching process. Storing a difference between the candidate characters for each rank and the candidate administrative area names, wherein the k value corresponds to each of the hierarchical regions 2 to 5 based on the candidate characters for each rank output by the character recognizer; Hangul address recognition method characterized in that it is determined according to the number of rank among the candidate administrative district names selected by character matching with the administrative district name of the dictionary.

11. The method of claim 10, wherein each of the post-processing processes for the layers 2 to 5 corresponds to the rank candidate characters output by the character recognizer in processing any layer L except for the layer 1. If the difference between the administrative name of the administrative name and the candidate administrative name selected as the first priority in the matching process is less than the threshold, proceed to layer L + 1. Otherwise, the layer L is reprocessed considering the next rank. Korean address recognition method using back tracking.

12. The method of claim 11, wherein each of the post-processing steps for the layer 2 to layer 5 ranks first in a process of matching the candidate letter for each rank with the administrative area name of the corresponding administrative area name dictionary even if the backtracking attempts repeatedly. If the difference obtained by the candidate administrative district name continues to be greater than or equal to the threshold value, and the number of times the backtracking is performed k times, considering the candidate administrative district names for the K + 1 upper and lower layers respectively, The names of the candidate administrative districts for the two hierarchies with the smallest value are determined as the names of the candidate administrative districts for the layer L and the layer L-1 by adding the differences obtained from the matching process. In this way, the Hangul address recognition method for processing layer L + 1 based on the selected administrative region name of layer L.

In the Hangul address recognition apparatus, an address divider for dividing an address word input through an input device into character units, administrative zone name dictionaries configured according to hierarchical administrative zones, and candidate characters for extracting candidate characters from the administrative zone name dictionaries An extractor, a character recognizer that recognizes only the characters to be recognized in the address word and the candidate characters extracted by the candidate character extractor as matching objects, and candidate characters for each rank in the corresponding administrative section name dictionaries; And a post-processing unit for selecting candidate administrative district names by letter matching with administrative district names and a result output unit for outputting a recognized address.

15. The candidate character extraction of claim 13, wherein the candidate character extraction is extracted from the administrative region name contained in the administrative region name dictionary corresponding to the address word to be recognized according to the position of each character constituting the address word to be recognized, and the post-processing is the extracted candidate. Character matching apparatus using a character candidate for each rank up to n ranks output by the character recognition unit based on the characters.

15. The administrative district name dictionary according to claim 14, wherein the administrative district name dictionaries comprise a dictionary of administrative district name 1 of tier 1 composed of cities, districts, and roads, and an administrative district name dictionary 2 of tier 2 composed of districts, cities, and cities. A Hangul address recognizing apparatus comprising: dictionary 3, administrative division name dictionary 4 of hierarchy 4 composed of eup.myeon, dong, and administrative division name dictionary 5 of hierarchy 5 composed of li.

The Korean address recognition apparatus according to claim 15, wherein the administrative district name dictionary 4 includes a building having a postal code, and also includes a building name having no postal code according to a user's purpose of use.

17. The apparatus of claim 16, wherein the administrative district name dictionary 1 of the layer 1 includes a modified form such as Seoul and Seoul.