KR20130024380A

KR20130024380A - Method for searching postal address including error using tree structure and apparatus thereof

Info

Publication number: KR20130024380A
Application number: KR1020110087803A
Authority: KR
Inventors: 설재민; 나동길
Original assignee: 한국전자통신연구원
Priority date: 2011-08-31
Filing date: 2011-08-31
Publication date: 2013-03-08

Abstract

PURPOSE: A method and a device are provided to correct errors in a mailing address by determining the mailing address based on similarity by calculating the similarity with nodes. CONSTITUTION: A calculation unit(120) calculates the similarity of a character string corresponding to nodes within a set tree structure with articles of a mailing address. A selecting unit(130) selects a node which has the highest similarity in the nodes based on the similarity. A determination unit(140) determines a standard mailing address including the character string corresponding to the selected node. A storage unit(150) stores a tree structure which hierarchically expresses articles of the standard mailing address by using the nodes. [Reference numerals] (110) Input unit; (120) Calculation unit; (130) Selecting unit; (140) Determination unit; (150) Storage unit

Description

METHOD FOR SEARCHING POSTAL ADDRESS INCLUDING ERROR USING TREE STRUCTURE AND APPARATUS THEREOF}

본 발명은 우편 주소 검색 방법에 관한 것으로, 특히, 우편 주소를 표준 우편 주소의 항목들을 계층적으로 표현한 트리 구조를 이용하여 검색하되, 우편 주소의 각 항목별로 트리 구조의 각 노드와의 유사도를 산출하여 그 산출된 유사도를 기반으로 가장 유사도 높은 노드를 포함하는 우편 주소를 결정할 수 있도록 하는 트리 구조를 이용하여 오류를 포함한 우편 주소를 검색하기 위한 방법 및 그 장치에 관한 것이다.The present invention relates to a postal address search method. In particular, a postal address is searched using a tree structure hierarchically expressing items of a standard postal address, and the similarity with each node of the tree structure is calculated for each item of the postal address. The present invention relates to a method and apparatus for retrieving a postal address including an error using a tree structure that enables a postal address including a node having the highest similarity based on the calculated similarity.

관공서나 기업에서 주소를 관리하는데 있어서 고객에게 획득한 주소에 포함된 오류를 처리하는 것이 어려운 일이다. 이러한 오류를 처리하기 위한 방법으로 주소 정제 소프트웨어가 있으며, 그 구성은 주소를 구분 분석하고, 그 결과에 따라 다양한 형태로 DB(Database)를 조회한다. 이러한 구성의 문제점은 구문분석을 수행하는데 오류 유형에 대한 사전 지식이 필요하며, 오류 유형에 따라 개별적으로 처리해야 한다는 것이다.In managing an address in a government office or enterprise, it is difficult to deal with errors in the addresses obtained from customers. There is an address resolving software as a way to deal with such errors, and its composition is analyzed by analyzing the address, and the DB (Database) is inquired in various forms according to the result. The problem with this configuration is that parsing requires prior knowledge of the type of error and must be handled separately according to the type of error.

최근에 스마트폰 및 클라우드 컴퓨팅 기술의 대두로 인하여 위치기반 서비스가 조명을 받고 있다. 위치기반 서비스에서 기계가 장소를 식별하는 방법은 위도, 경도와 같은 2차원의 좌표정보이다. 반면에 사람이 장소를 식별하는 구조적이고 체계적인 방법은 주소이다. 따라서 사람과 소프트웨어(기계)가 효율적으로 인터페이스하기 위해서는 주소-좌표의 상호 변환이 필수적이다. 좌표에서 주소로의 변환은 해당하는 좌표, 주소 매칭 정보가 있다면 간단하게 이루어진다.Recently, with the rise of smart phones and cloud computing technology, location-based services have been in the spotlight. In the location-based service, the way a machine identifies a place is two-dimensional coordinate information such as latitude and longitude. On the other hand, the structured and systematic way for a person to identify a place is by address. Therefore, the conversion of address-coordinates is essential for efficient interface between humans and software (machines). The coordinate-to-address conversion is simple if there is corresponding coordinate and address matching information.

그러나 주소나 우편번호 등을 위도, 경도 등의 지리학적인 좌표로 변환해 주는 지오코딩(geocoding)은 그 결과가 정확하지 않은 경우가 종종 있다. 그 이유는 주소, 좌표 매칭 정보가 존재하더라도 주소의 비교가 정확하게 이루어 지지 않기 때문이다.However, geocoding, which converts addresses and postal codes into geographic coordinates such as latitude and longitude, is often inaccurate. The reason is that the address is not compared accurately even if the address and coordinate matching information exists.

따라서 이러한 종래 기술의 문제점을 해결하기 위한 것으로, 본 발명의 목적은 우편 주소를 표준 우편 주소의 항목들을 계층적으로 표현한 트리 구조를 이용하여 검색하되, 우편 주소의 각 항목별로 트리 구조의 각 노드와의 유사도를 산출하여 그 산출된 유사도를 기반으로 가장 유사도 높은 노드를 포함하는 우편 주소를 결정할 수 있도록 하는 트리 구조를 이용하여 오류를 포함한 우편 주소를 검색하기 위한 방법 및 그 장치를 제공하는데 있다.Accordingly, an object of the present invention is to search for a postal address using a tree structure hierarchically expressing items of a standard postal address, and for each item of a postal address, The present invention provides a method and apparatus for retrieving a postal address including an error by using a tree structure that calculates a similarity of and determines a postal address including a node having the highest similarity based on the calculated similarity.

그러나 본 발명의 목적은 상기에 언급된 사항으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.However, the objects of the present invention are not limited to those mentioned above, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

상기 목적들을 달성하기 위하여, 본 발명의 한 관점에 따른 트리 구조를 이용하여 오류를 포함한 우편 주소를 검색하기 위한 방법은 입력받은 우편 주소의 각 항목별로 기 설정된 트리 구조 내의 각 노드에 상응하는 문자열과의 유사도를 산출하는 단계; 산출된 상기 유사도를 기반으로 가장 유사도 높은 노드를 선택하는 단계; 및 선택된 상기 노드에 상응하는 문자열을 포함하는 표준 우편 주소를 결정하는 단계를 포함할 수 있다.In order to achieve the above objects, a method for searching for a postal address including an error using a tree structure according to an aspect of the present invention includes a character string corresponding to each node in a preset tree structure for each item of an input postal address; Calculating the similarity of; Selecting a node having the highest similarity based on the calculated similarity; And determining a standard postal address comprising a string corresponding to the selected node.

바람직하게, 상기 트리 구조는, 상기 표준 우편 주소의 항목들을 노드로 하여 계층적으로 표현할 수 있다.Preferably, the tree structure may be represented hierarchically by using items of the standard postal address as nodes.

바람직하게, 상기 산출하는 단계는 상기 우편 주소의 각 항목별로 기 설정된 트리 구조 내의 루트 노드(root node)에서 리프 노드(leaf node)까지 각 노드에 상응하는 문자열과의 유사도를 산출할 수 있다.Preferably, the calculating may calculate the similarity with a string corresponding to each node from a root node to a leaf node in a preset tree structure for each item of the postal address.

바람직하게, 상기 산출하는 단계는 상기 우편 주소의 각 항목별로 기 설정된 트리 구조 내의 각 노드에 상응하는 문자열과 자소 단위로 유사도를 산출할 수 있다.Preferably, the calculating may calculate the similarity in units of a character string and a letter corresponding to each node in a preset tree structure for each item of the postal address.

바람직하게, 상기 유사도 S는 다음의 수학식 S = (E_path - N_equ)/E_path를 통해 구하고, 여기서, 상기 E_path는 편집 경로를 나타내고, 상기 N_equ은 불일치된 자소의 개수를 나타낼 수 있다.Preferably, the similarity S is obtained through the following equation S = (E _path -N _equ ) / E _path , where E _path represents an editing path and N _equ represents a number of discordant phonemes. have.

바람직하게, 상기 결정하는 단계는 상기 기 설정된 트리 구조 내의 노드 중 루트 노드에서 선택된 리프 노드까지의 경로 상에 있는 노드에 상응하는 문자열을 포함하는 표준 우편 주소를 결정할 수 있다.
Preferably, the determining may determine a standard postal address including a string corresponding to a node on a path from a root node to a selected leaf node among nodes in the preset tree structure.

본 발명의 다른 한 관점에 따른 트리 구조를 이용하여 오류를 포함한 우편 주소를 검색하기 위한 장치는 입력받은 우편 주소의 각 항목별로 기 설정된 트리 구조 내의 각 노드에 상응하는 문자열과의 유사도를 산출하는 산출 수단; 산출된 상기 유사도를 기반으로 가장 유사도 높은 노드를 선택하는 선택 수단; 및 선택된 상기 노드에 상응하는 문자열을 포함하는 표준 우편 주소를 결정하는 결정 수단을 포함할 수 있다.An apparatus for retrieving a postal address including an error using a tree structure according to another aspect of the present invention is to calculate a similarity with a string corresponding to each node in a preset tree structure for each item of an input postal address. Way; Selecting means for selecting a node having the highest similarity based on the calculated similarity; And determining means for determining a standard postal address comprising a string corresponding to the selected node.

바람직하게, 상기 산출 수단은 상기 우편 주소의 각 항목별로 기 설정된 트리 구조 내의 루트 노드(root node)에서 리프 노드(leaf node)까지 각 노드에 상응하는 문자열과의 유사도를 산출할 수 있다.Preferably, the calculating means may calculate the similarity with a string corresponding to each node from a root node to a leaf node in a preset tree structure for each item of the postal address.

바람직하게, 상기 산출 수단은 상기 우편 주소의 각 항목별로 기 설정된 트리 구조 내의 각 노드에 상응하는 문자열과 자소 단위로 유사도를 산출할 수 있다.Preferably, the calculating means may calculate the similarity in units of character strings and letters corresponding to each node in a preset tree structure for each item of the postal address.

바람직하게, 상기 결정 수단은 상기 기 설정된 트리 구조 내의 노드 중 루트 노드에서 선택된 리프 노드까지의 경로 상에 있는 노드에 상응하는 문자열을 포함하는 표준 우편 주소를 결정할 수 있다.Preferably, the determining means may determine a standard postal address including a character string corresponding to a node on a path from a root node to a selected leaf node among the nodes in the preset tree structure.

이를 통해, 본 발명은 우편 주소를 표준 우편 주소의 항목들을 계층적으로 표현한 트리 구조를 이용하여 검색하되, 우편 주소의 각 항목별로 트리 구조의 각 노드와의 유사도를 산출하여 그 산출된 유사도를 기반으로 가장 유사도 높은 노드를 포함하는 우편 주소를 결정함으로써, 우편 주소의 오류를 정정할 수 있는 효과가 있다.Through this, the present invention searches the mailing address using a tree structure hierarchically expressing the items of the standard mailing address, but calculates the similarity with each node of the tree structure for each item of the mailing address and based on the calculated similarity. By determining the postal address including the node with the highest similarity, the error of the postal address can be corrected.

또한, 본 발명은 우편 주소의 각 항목별로 트리 구조의 각 노드와의 유사도를 산출하여 그 산출된 유사도를 기반으로 가장 유사도 높은 노드를 포함하는 우편 주소를 결정함으로써, 빠른 우편 주소의 검색이 가능할 수 있는 효과가 있다.In addition, the present invention can calculate the similarity with each node of the tree structure for each item of the postal address and determine the postal address including the node with the highest similarity based on the calculated similarity, it is possible to search the fast postal address It has an effect.

또한, 본 발명은 우편 주소의 각 항목별로 트리 구조의 각 노드와의 유사도를 산출하여 그 산출된 유사도를 기반으로 가장 유사도 높은 노드를 포함하는 우편 주소를 결정함으로써, 주소관리 업무의 효율성을 향상시킬 수 있는 효과가 있다.In addition, the present invention calculates the similarity with each node of the tree structure for each item of the postal address and determine the postal address including the node with the highest similarity based on the calculated similarity, thereby improving the efficiency of address management work It can be effective.

도 1은 본 발명의 실시예에 따른 우편 주소를 검색하기 위한 장치를 나타내는 예시도이다.
도 2는 본 발명의 실시예에 따른 우편 주소를 계층적으로 표현한 트리 구조를 나타내는 예시도이다.
도 3은 본 발명의 실시예에 따른 우편 주소를 검색하기 위한 방법을 나타내는 예시도이다.
도 4는 도 3에 도시된 유사도를 산출하는 과정을 나타내는 예시도이다.
도 5는 편집 거리와 편집 경로를 산출하는 원리를 설명하기 위한 제1 예시도이다.
도 6은 편집 거리와 편집 경로를 산출하는 원리를 설명하기 위한 제2 예시도이다.1 is an exemplary view showing an apparatus for searching a postal address according to an embodiment of the present invention.
2 is an exemplary diagram showing a tree structure hierarchically expressing a postal address according to an embodiment of the present invention.
3 is an exemplary view showing a method for searching a postal address according to an embodiment of the present invention.
4 is an exemplary diagram illustrating a process of calculating the similarity illustrated in FIG. 3.
5 is a first exemplary view for explaining a principle of calculating the editing distance and the editing path.
6 is a second exemplary view for explaining the principle of calculating the editing distance and the editing path.

이하에서는, 본 발명의 실시예에 따른 트리 구조를 이용하여 오류를 포함한 우편 주소를 검색하기 위한 방법 및 그 장치를 첨부한 도 1 내지 도 6을 참조하여 설명한다. 본 발명에 따른 동작 및 작용을 이해하는데 필요한 부분을 중심으로 상세히 설명한다. 명세서 전체를 통하여 각 도면에서 제시된 동일한 참조 부호는 동일한 구성 요소를 나타낸다.Hereinafter, a method and apparatus for searching a postal address including an error using a tree structure according to an embodiment of the present invention will be described with reference to FIGS. 1 to 6. The present invention will be described in detail with reference to the portions necessary for understanding the operation and operation according to the present invention. Like reference numerals in the drawings denote like elements throughout the specification.

본 발명에서는 우편 주소를 표준 우편 주소의 항목들 예컨대, 도, 시, 구, 동, 번지 등을 계층적으로 표현한 트리 구조를 이용하여 검색하되, 우편 주소의 각 항목별로 트리 구조의 각 노드와 유사도를 산출하여 그 산출된 유사도를 기반으로 가장 유사도 높은 노드를 포함하는 표준 우편 주소를 결정할 수 있는 방안을 제안한다.In the present invention, the postal address is searched using a tree structure hierarchically expressing items of standard postal addresses, for example, a city, a city, a city, a city, a street address, and the like, and similarity with each node of the tree structure for each item of the postal address. Based on the calculated similarity, we propose a method to determine the standard postal address including the node with the highest similarity.

도 1은 본 발명의 실시예에 따른 우편 주소를 검색하기 위한 장치를 나타내는 예시도이다.1 is an exemplary view showing an apparatus for searching a postal address according to an embodiment of the present invention.

도 1에 도시한 바와 같이, 본 발명에 따른 우편 주소를 검색하기 위한 장치는 입력 수단(110), 산출 수단(120), 선택 수단(130), 결정 수단(140), 및 저장 수단(150) 등을 포함하여 구성될 수 있다.As shown in FIG. 1, an apparatus for retrieving a postal address according to the present invention includes an input means 110, a calculation means 120, a selection means 130, a determination means 140, and a storage means 150. And the like.

입력 수단(110)은 사용자로부터 우편 주소를 입력받을 수 있다. 여기서, 우편 주소는 도, 시, 구, 동, 번지 등으로 표현되는데, 예컨대, "서울특별시 성북구 돈암2동 524번지"나 "강원도 강릉시 옥천동 168번지" 등으로 나타낼 수 있다.The input unit 110 may receive a postal address from the user. Here, the postal address may be represented as a province, city, ward, dong, or street address, for example, "524 Donam 2-dong, Seongbuk-gu, Seoul" or "168 Okcheon-dong, Gangneung-si, Gangwon-do".

산출 수단(120)은 입력받은 우편 주소의 각 항목별로 기 설정된 트리 구조 내의 각 노드에 상응하는 문자열과의 유사도를 산출할 수 있다. 즉, 산출 수단(120)은 우편 주소의 각 항목별로 기 설정된 트리 구조 내의 루트 노드(root node)에서 리프 노드(leaf node)까지 각 노드에 상응하는 문자열과의 유사도를 산출하게 된다.The calculation unit 120 may calculate the similarity with the character string corresponding to each node in the preset tree structure for each item of the received postal address. That is, the calculating means 120 calculates the similarity with the string corresponding to each node from the root node to the leaf node in the preset tree structure for each item of the postal address.

구체적으로, 산출 수단(120)은 우편 주소의 각 항목과 이에 상응하는 기 설정된 트리 구조 내의 각 노드에 상응하는 문자열을 자소 단위로 분리하게 된다. 산출 수단(120)은 자소 단위로 분리된 우편 주소의 각 항목과 이에 상응하는 기 설정된 트리 구조 내의 각 노드에 상응하는 문자열의 편집 거리(edit distance)와 편집 경로(edit path)를 산출하게 된다.Specifically, the calculating means 120 separates each item of the postal address and the character string corresponding to each node in the preset tree structure corresponding to the letter in units of phonemes. The calculation means 120 calculates an edit distance and an edit path of each item of a postal address separated by phoneme units and a string corresponding to each node in a preset tree structure corresponding thereto.

이후 산출 수단(120)은 산출된 편집 거리와 편집 경로를 기반으로 유사도를 산출하게 된다. 이러한 구체적인 동작 원리에 대해서는 이하에서 상세히 설명하기로 한다.Thereafter, the calculating means 120 calculates the similarity based on the calculated editing distance and the editing path. This specific operation principle will be described in detail below.

선택 수단(130)은 산출된 유사도를 기반으로 가장 유사도가 높은 노드를 선택할 수 있다. 선택 수단(130)은 동일한 레벨을 갖는 노드들 중 가장 유사도가 높은 노드를 선택하게 된다.The selecting unit 130 may select a node having the highest similarity based on the calculated similarity. The selecting means 130 selects nodes having the highest similarity among nodes having the same level.

결정 수단(140)은 선택된 노드에 상응하는 문자열을 포함하는 표준 우편 주소를 결정할 수 있다. 즉, 결정 수단(140)은 기 설정된 트리 구조 내의 노드 중 루트 노드에서 시작하여 마지막으로 선택된 리프 노드까지의 경로 상에 있는 모든 노드에 상응하는 문자열을 포함하는 표준 우편 주소를 결정하게 된다.Determination means 140 may determine a standard postal address that includes a string corresponding to the selected node. That is, the determining means 140 determines a standard postal address including a string corresponding to all nodes on the path starting from the root node to the last selected leaf node among the nodes in the preset tree structure.

저장 수단(150)은 표준 우편 주소의 항목들을 노드로 하여 계층적으로 표현한 트리 구조를 저장할 수 있다. 실제로 서울시 주소를 처리하는 경우에 전체 주소 목록인 약 360만 개의 주소에 대하여 행정구역, 건물명, 지번 단위로 주소를 분할하여 구성하게 된다. 이러한 트리 구조는 '시멘틱 트리(semantic tree)로 명명된다.
The storage means 150 may store a tree structure hierarchically expressed by using items of standard postal addresses as nodes. In fact, when addressing Seoul City, about 3.6 million addresses, which are the entire address list, are divided into administrative district, building name, and district number. This tree structure is called 'semantic tree'.

도 2는 본 발명의 실시예에 따른 우편 주소를 계층적으로 표현한 트리 구조를 나타내는 예시도이다.2 is an exemplary diagram showing a tree structure hierarchically expressing a postal address according to an embodiment of the present invention.

도 2에 도시한 바와 같이, 본 발명에 따른 트리 구조는 표준 우편 주소의 항목들을 노드로 하여 계층적으로 표현할 수 있다. 일반적으로 자료 검색을 수월하게 하기 위한 구조로 트라이(Trie) 구조가 사용되지만 본 발명에서는 시멘틱 트리 구조를 사용한다. 트라이 구조는 트리를 구성함에 있어서 단순히 문자의 변경이 일어나는 시점에서 노드를 생성하게 되지만, 시멘틱 트리 구조는 세부 행정구역과 같은 의미의 변경이 발생하는 시점에서 노드를 생성하게 된다.As shown in Fig. 2, the tree structure according to the present invention can be represented hierarchically by using items of standard postal addresses as nodes. Generally, a trie structure is used as a structure for facilitating data retrieval, but the semantic tree structure is used in the present invention. The tri-structure simply creates nodes when the text changes, but the semantic tree structure creates nodes when changes in meaning, such as subdivisions, occur.

예컨대, '서울시 강남구'와 '서울시 강서구'로 시작하는 주소가 있을 경우, 트라이 구조에서는 '서울시 강'을 프리픽스(prefix)로 하고 '남구'와 '서구'를 자식 노드로 생성하는 반면 시멘틱 트리 구조에서는 강남구와 강서구가 서울시에 속하기 때문에 '서울시'의 자식 노드로 '강남구'와 '강서구'를 구성하게 된다.For example, if there are addresses starting with 'Gangnam-gu' and 'Seoul-gu, Seoul', the tri-structure will create 'Seoul-si' as a prefix and 'Nam-gu' and 'seo-gu' as child nodes, while the semantic tree structure will be used. Since Gangnam-gu and Gangseo-gu belong to Seoul City, they form 'Gangnam-gu' and 'Gangseo-gu' as child nodes of 'Seoul-si'.

또한, 본 발명은 과거의 명칭변경이 있었거나 행정동, 법정동에 차이가 있을 경우 이러한 정보를 시멘틱 트리 구조에 추가할 수 있다. 또한, 본 발명은 주소-좌표변환(geo-coding)을 위하여 각각의 노드에 좌표정보를 추가하거나 구조소-새주소 변환을 위하여 새주소 정보를 추가하는 등 목적에 따라 해당 정보를 추가할 수 있다. 이러한 추가 정보는 반드시 이에 한정되지 않고 다양하게 적용된다.In addition, the present invention may add this information to the semantic tree structure when there is a change in the name of the past or there is a difference between administrative and legal dongs. In addition, the present invention can add the corresponding information according to the purpose, such as adding coordinate information to each node for geo-coding or adding new address information for structural-new address conversion. . Such additional information is not necessarily limited thereto and may be variously applied.

이러한 트리 구조에서, 부모 노드가 없는 최상위에 있는 노드를 루트 노드(root node), 최하위에 있는 노드를 리프 노드(leaf node)라 한다. 리프 노드를 제외한 노드 즉 자식 노드가 있는 노드를 내부 노드(Internal node)라 한다. 또한 자신의 위로 연결된 노드를 부모 노드, 자신의 아래로 연결된 노드를 자식 노드라 한다.In this tree structure, the top node with no parent node is called the root node, and the bottom node is called the leaf node. Nodes other than leaf nodes, that is, nodes with child nodes, are called internal nodes. Also, the node connected to it above is called the parent node and the node connected below it is called the child node.

또한, 가장 상위의 노드는 레벨 1, 아들 노드는 레벨 2, 손자 노드는 레벨 3 등이 된다.
In addition, the highest node is level 1, the son node is level 2, the grandchild node is level 3, and the like.

도 3은 본 발명의 실시예에 따른 우편 주소를 검색하기 위한 방법을 나타내는 예시도이다.3 is an exemplary view showing a method for searching a postal address according to an embodiment of the present invention.

도 3에 도시한 바와 같이, 본 발명에 따른 우편 주소 검색 장치는 우편 주소 예컨대, '서울시 강서구 등촌동 주공APT 1001동 103호'를 입력받으면, 입력받은 우편 주소의 각 항목별로 기 설정된 트리 구조 내의 각 노드에 상응하는 문자열과의 유사도를 산출할 수 있다(S310). 이를 도 4를 참조하여 상세히 설명한다.
As shown in FIG. 3, when a postal address search apparatus according to the present invention receives a postal address, for example, '103, Jugong APT 1001 Dongjugong-dong, Gangseo-gu, Seoul-si,' each of the items in the tree structure preset for each item of the received postal address. The similarity with the string corresponding to the node can be calculated (S310). This will be described in detail with reference to FIG. 4.

도 4는 도 3에 도시된 유사도를 산출하는 과정을 나타내는 예시도이다.4 is an exemplary diagram illustrating a process of calculating the similarity illustrated in FIG. 3.

도 4에 도시한 바와 같이, 우편 주소 검색 장치는 우편 주소의 각 항목과 이에 상응하는 기 설정된 트리 구조 내의 각 노드에 상응하는 문자열을 자소 단위로 분리할 수 있다(S311).As shown in FIG. 4, the apparatus for retrieving a postal address may separate each item of a postal address and a character string corresponding to each node in a preset tree structure corresponding thereto according to a phoneme unit (S311).

예컨대, 우편 주소의 첫 번째 항목 '서울시'는 'ㅅㅓㅇㅜㄹㅅㅣ'로 트리 구조의 노드 '서울특별시'는 'ㅅㅓㅇㅜㄹㅌㅡㄱㅂㅕㄹㅅㅣ'로 각각 분리될 수 있다.For example, the first item 'Seoul city' of the postal address may be divided into 'ㅅㅓ ㅇㅜ ㄹㅅ ㅣ' and the node 'Seoul Metropolitan City' of the tree structure may be divided into 'ㅅㅓ ㅇㅜ ㄹㅌ ㅡ ㄱ ㅂㅕ ㄹㅅ ㅣ'.

다음으로, 우편 주소 검색 장치는 자소 단위로 분리된 우편 주소의 각 항목과 이에 상응하는 기 설정된 트리 구조 내의 각 노드에 상응하는 문자열의 편집 거리(edit distance)와 편집 경로(edit path)를 산출할 수 있다(S312). 이를 도 5 내지 도 6을 참조하여 설명한다.Next, the postal address retrieval apparatus calculates an edit distance and an edit path of a string corresponding to each item of a postal address separated by phoneme units and corresponding nodes in a preset tree structure. It may be (S312). This will be described with reference to FIGS. 5 to 6.

도 5는 편집 거리와 편집 경로를 산출하는 원리를 설명하기 위한 제1 예시도이고, 도 6은 편집 거리와 편집 경로를 산출하는 원리를 설명하기 위한 제2 예시도이다.FIG. 5 is a first exemplary view for explaining the principle of calculating the editing distance and the editing path, and FIG. 6 is a second exemplary view for explaining the principle of calculating the editing distance and the editing path.

도 5를 참조하면, '서울시'와 '서울특별시'의 편집 거리와 편집 경로를 산출하는 원리를 보여주고 있다. 이때, 편집 거리는 문자열 A를 문자열 B로 바꾸기 위하여 필요한 최소한의 연산 예컨대, 삽입, 삭제, 치환의 개수를 나타내고, 편집 경로는 편집 거리에 일치하는 자소의 개수를 더한 값을 나타낼 수 있다.Referring to FIG. 5, the principle of calculating the editing distance and the editing path of Seoul and Seoul. In this case, the editing distance may indicate the minimum number of operations required for converting the text string A into the text string B, for example, the number of insertions, deletions, and substitutions, and the editing path may indicate a value obtained by adding the number of elements corresponding to the editing distance.

그림 (a)의 표를 작성하는 방법을 설명하면 다음과 같다. 첫 번째 행과 첫 번째 열을 각각 0부터 13까지, 0부터 7까지를 순차적으로 입력한다. 두 번째 행부터는 일치하는 경우 대각선 왼쪽 위, 왼쪽, 또는 위쪽의 칸에 있는 값 중에서 가장 작은 값을 입력하고, 일치하지 않는 경우 가장 작은 값에 1을 추가하여 입력한다.The method of creating the table in Fig. (A) is as follows. Enter the first row and the first column, 0 through 13, and 0 through 7, respectively, sequentially. From the second row, if it matches, enter the smallest value in the upper left, left, or top of the diagonal. If it does not match, add 1 to the smallest value.

첫 번째 칸의 'λ'는 빈 문자열이라는 의미로, (1,2)의 칸은 아무것도 없는 문자열에서 'ㅅ'으로 변환하기 위해서는 한번의 입력이 필요하다는 의미이다.'Λ' in the first column means an empty string, and the column in (1,2) means one input is required to convert from an empty string to 'ㅅ'.

예컨대, (2,2)의 값을 정하는 경우, (2,1),(1,1),(1,2)의 값 중 가장 작은 값이 0이며, 'ㅅ'에서 'ㅅ'으로의 변경은 변환이 없기 때문에 가장 작은 값을 그대로 입력한다.For example, when determining the value of (2,2), the smallest value among the values of (2,1), (1,1), and (1,2) is 0, and the change from 'ㅅ' to 'ㅅ' Since there is no conversion, enter the smallest value as it is.

(7,6)의 값을 정하는 경우, 해당 행의 'ㅌ'과 열의 'ㄹ'이 동일하지 않기 때문에 (6,6),(7,5),(6,5)의 값 중 최소 값인 0에 1을 더한 값 1을 입력하게 된다.If you set the value of (7,6), the 'ㅌ' of the row and the 'ㄹ' of the column are not the same, so the minimum value of (6,6), (7,5), (6,5) is 0 Enter 1 plus 1.

이러한 과정을 통해 작성된 표에서 오른쪽 아래의 값 6이 편집 거리가 된다.In this table, the value 6 in the lower right is the editing distance.

그림 (b)처럼 두 문자열을 변환하기 위해 문자열이 일치하는 경우를 정리하고, 위에서 구한 편집 거리 6에 일치한 문자의 개수 7을 더한 값 13이 편집 경로가 된다.As shown in the figure (b), the case where the strings are matched to convert the two strings is summarized, and the edit path is 13 by adding the number 7 of the matching characters to the edit distance 6 obtained above.

도 6을 참조하면, '서울시'와 '강원도'의 편집 거리와 편집 경로를 산출하는 원리를 보여주고 있다. 도 5에서 설명한 내용과 동일하게 표를 작성하면, 그림 (a)에서 표의 오른쪽 아래의 값 7이 편집 거리가 된다.Referring to FIG. 6, the principle of calculating the editing distance and the editing path of Seoul and Gangwon-do is shown. If the table is created in the same manner as described in Fig. 5, the value 7 at the lower right of the table in Fig. (A) becomes the editing distance.

다시 작성된 표를 기반으로 그림 (b)처럼 문자열이 일치하는 경우를 정리하고, 위에서 구한 편집거리 7에 일치한 문자의 개수 1을 더한 값 8이 편집 경로가 된다.Based on the rewritten table, the case where the strings match as shown in Fig. (B) is summarized, and the edit path is 8 by adding the number 1 of the matching characters to the edit distance 7 obtained above.

다음으로, 우편 주소 검색 장치는 산출된 편집 거리와 편집 경로를 기반으로 유사도를 산출할 수 있다(S313). 이러한 유사도는 다음의 [수학식 1]과 같이 정의할 수 있다.Next, the postal address retrieval apparatus may calculate the similarity based on the calculated editing distance and the editing path (S313). Such similarity may be defined as in Equation 1 below.

[수학식 1][Equation 1]

S = (E_path - N_equ)/E_path S = (E _path -N _equ ) / E _path

여기서, E_path는 편집 경로를 나타내고, N_equ은 불일치된 자소의 개수를 나타낼 수 있다.Here, E _path may represent an editing path and N _equ may represent the number of mismatched phonemes.

예컨대, 도 5와 도 6을 참조하면'서울시'와 '서울특별시'의 유사도 S = (13-5)/13 = 8/13이 되고, '서울시'와 '강원도'의 유사도 S = (8-7)/8 = 1/8이 된다.
For example, referring to FIGS. 5 and 6, the similarity S between Seoul and Seoul is S = (13-5) / 13 = 8/13, and the similarity between Seoul and Gangwon is S = (8- 7) / 8 = 1/8.

다음으로, 우편 주소 검색 장치는 산출된 유사도를 기반으로 가장 유사도 높은 노드를 선택할 수 있다(S320). 즉, 우편 주소 검색 장치는 레벨 1에서 '서울특별시'와 '강원도' 중 가장 유사도가 높은 노드의'서울특별시'를 선택한다.Next, the postal address retrieval apparatus may select a node having the highest similarity based on the calculated similarity (S320). In other words, the postal address retrieval apparatus selects 'Seoul Metropolitan Government' of the node having the highest similarity among 'Seoul Metropolitan Government' and 'Gangwon-do Province' at level 1.

우편 주소 검색 장치는 이러한 과정을 루트 노드에서 리프 노드까지 수행하여 각 레벨에서 가장 유사도가 높은 노드에 상응하는 문자열을 선택하게 된다. 이를 살펴보면 다음과 같다.The postal address retrieval apparatus performs this process from the root node to the leaf node to select a string corresponding to the node with the highest similarity at each level. This is as follows.

'서울시 강서구 등촌동 주공APT 1001동 103호'에서 두 번째 항목인 '강서구'와 트리 구조에서 '서울특별시'의 아들 노드인 '용산구'와 '강서구'의 유사도를 산출하면, 각각 1.0과 0.5가 된다. 이들 중 가장 유사도가 높은 노드 '강서구'를 선택하게 된다.The similarity between the second item 'Gangseo-gu' in the 'Gongseo-gu', 100, Donggong-gu, Gangchon-gu, Seoul, and 'Yongsan-gu' and 'Gangseo-gu', which are the son nodes of 'Seoul', are 1.0 and 0.5, respectively. . Among them, the most similar node 'Gangseo-gu' is selected.

이때, 우편 주소의 항목과 각 노드의 문자열의 유사도를 비교하기 전에 문자열 자체를 비교할 수 있다. 문자열 자체만을 비교하게 되면 속도가 빨라지기 때문이다. 즉, 본 발명은 문자열을 비교한 후 그 비교 결과에 따라 문자열의 유사도를 비교할 수도 있다.At this time, the string itself may be compared before comparing the similarity between the item of the postal address and the string of each node. This is because comparing only the string itself is faster. That is, the present invention may compare the strings and then compare the similarity of the strings according to the comparison result.

'서울시 강서구 등촌동 주공APT 1001동 103호'에서 세 번째 항목인 '등촌동'과 트리 구조에서 '강서구'의 아들 노드인 '방화1동, 방화3동, 등촌1동, 등촌3동'의 유사도를 산출하면 4/9, 4/9, 9/10, 9/10이 된다. 이들 중 가장 유사도가 높은 노드 '등촌1동'과 '등촌3동'을 선택하게 된다.The similarity between the third item in Deungchon-dong, Jugong Apt 1001-dong 103, Deungchon-dong, Gangseo-gu, Seoul, and the nodes of Banghwa 1-dong, Banghwa 3-dong, Deungchon 1-dong, and Deungchon 3-dong in the tree structure Calculations are 4/9, 4/9, 9/10 and 9/10. Among them, the most similar nodes 'Deungchon 1-dong' and 'Deungchon 3-dong' will be selected.

이때, '등촌1동'과 '등촌3동'이 유사도가 동일하기 때문에 모두 리프 노드까지 확인해야 한다.At this time, since 'Deungchon 1-dong' and 'Deungchon 3-dong' have the same similarity, they must be checked to leaf nodes.

'등촌3동'에 속한 '코오롱정보센터'와 '등촌1동'에 속한 '주공아파트 1001 103'의 2개의 노드를 이용하여 검색된 우편 주소는 '등촌3동 주공아파트 1001 103'과 '등촌1동 코오롱정보센터'가 된다. 이렇게 검색된 우편 주소의 유사도를 산출하여 가장 유사도가 높은 '등촌3동 주공아파트 1001 103'이 선택된다.The mail addresses searched using the two nodes of 'Kolong Information Center' in 'Deungchon 3-dong' and 'Jugong Apartment 1001 103' in 'Deungchon 1-dong' are 'Deungchon 3-dong Jugong Apartment 1001 103' and 'Deungchon 1' East Kolon Information Center. By calculating the similarity of the retrieved postal addresses, 'Deungchon 3-dong Jugong Apartment 1001 103' is selected.

다음으로, 우편 주소 검색 장치는 선택된 노드에 상응하는 문자열을 포함하는 표준 우편 주소를 결정할 수 있다(S330). 즉, 우편 주소 검색 장치는 '서울특별시 강서구 등촌3동 주공아파트 1001 103'의 표준 우편 주소를 결정하게 된다.
Next, the postal address retrieval apparatus may determine a standard postal address including a string corresponding to the selected node (S330). In other words, the postal address retrieval device determines the standard postal address of the 1001 103 Jugong Apartment in Deungchon 3-dong, Gangseo-gu, Seoul.

본 발명에 의한 트리 구조를 이용하여 오류를 포함한 우편 주소를 검색하기 위한 방법 및 그 장치가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.A method for retrieving a postal address including an error using a tree structure according to the present invention, and those skilled in the art, various modifications and variations can be made without departing from the essential characteristics of the present invention. will be. Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The protection scope of the present invention should be interpreted by the following claims, and all technical ideas within the equivalent scope should be interpreted as being included in the scope of the present invention.

110: 입력 수단
120: 산출 수단
130: 선택 수단
140: 결정 수단
150: 저장 수단110: input means
120: output means
130: selection means
140: means of determination
150: Storage means

Claims

Calculating similarity with a character string corresponding to each node in a preset tree structure for each item of the received postal address;
Selecting a node having the highest similarity based on the calculated similarity; And
Determining a standard postal address comprising a string corresponding to the selected node;
Method for retrieving a postal address containing an error using a tree structure comprising a.