KR20000045107A

KR20000045107A - Method for korean synonym search using phoneme level's appropriate function

Info

Publication number: KR20000045107A
Application number: KR1019980061635A
Authority: KR
Inventors: 송주석; 이영훈
Original assignee: 이계철; 한국전기통신공사
Priority date: 1998-12-30
Filing date: 1998-12-30
Publication date: 2000-07-15

Abstract

PURPOSE: A synonym search using phoneme level's appropriate function is disclosed to search for data without precise knowledge of data by searching in units of word, not in syllables, from ATTACH-based Directory Information System (ADAS), to examine the results in different steps, and to return the appropriate results in a numbered form. CONSTITUTION: A synonym search using phoneme level's appropriate function is composed of information search server, response device, and telephone. ADAS is an automatic telephone number information guide system that uses regular telephones(11) through public switched telephone network. As an input method, Korean code(ATTACH code) is inputted using telephone keypad, and as an output method an Automatic Response System device that displays voice or text is used. ATTACH code not only uses telephone buttons for input of Korean, English, special symbols, numbers, and control codes, it's also easily learned and mastered. Above search system is used by information search server(13) and information is output through voice response device(12) to the telephone(11).

Description

Hangul-like word retrieval method using phoneme fitness function

본 발명은 전화번호 직접 검색시스템(ADAS : ATTACH-based Directory Information System) 등에서 질의어를 음절이 아닌 자소 단위로 구분하여 나올 수 있는 경우의 수를 여러 단계로 나누어 각 단계에 맞게 우선순위를 부여하여 이용자에게 적절한 검색결과를 제공하므로써, 찾고자 하는 데이터를 정확히 모르더라도 데이터를 검색할 수 있도록 한 한글 유사단어 검색방법 및 그를 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention divides the number of cases in which a query word can be divided into phoneme units rather than syllables in an ATTACH-based Directory Information System (ADAS). The present invention relates to a method for retrieving similar words in Korean that enables users to search data even if they do not know exactly what data they are looking for by providing appropriate search results, and a computer-readable recording medium that records a program for realizing the same.

데이터 검색(Data Retrieval)은 데이터베이스에 저장된 단어를 이용자가 입력한 질의어에 비교하여 일치하는 것을 찾아내는 것이다. 그러나, 데이터 검색은 비교 대상이 완전히 일치해야 하므로 이용자는 찾고자 하는 대상을 정확히 알고 있어야만 한다.Data Retrieval compares words stored in a database to a query entered by a user and finds a match. However, data retrieval requires exact comparisons, so the user must know exactly what they are looking for.

따라서, 종래의 데이터 검색으로는 이용자가 긴 단어를 전부 입력하지 않고도 원하는 정보를 찾거나, 이용자가 입력하고자 하는 단어의 각 음절 또는 일부 음절을 단축 입력하여 원하는 정보를 검색하거나, 버튼을 잘못 눌러 오류가 생기는 경우에 원하는 정보검색을 지원할 수 없는 문제점이 있었다.Therefore, in the conventional data retrieval, the user can find the desired information without inputting all the long words, or search for the desired information by shortening each syllable or some syllables of the word that the user wants to input, or pressing the wrong button. There was a problem that can not support the desired information retrieval in case of occurrence.

상기한 바와 같은 문제점을 해결하기 위하여 안출된 본 발명은, 전화번호 직접 검색시스템(ADAS) 등에서 질의어를 음절이 아닌 자소 단위로 구분하여 나올 수 있는 경우의 수를 여러 단계로 나누어 각 단계에 맞게 우선순위를 부여하여 이용자에게 적절한 검색결과를 제공하므로써, 찾고자 하는 데이터를 정확히 모르더라도 데이터를 검색하기 위한 한글 유사단어 검색방법 및 그를 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.In order to solve the problems described above, the present invention divides the number of cases in which the query word can be divided into phoneme units rather than syllables in the direct telephone number search system (ADAS), divided into several stages, and according to each stage, By providing the appropriate search results to users by assigning rankings, we provide a computer-readable recording medium that records Korean similar words for searching data and a program for realizing the data without knowing exactly what data to look for. There is this.

도 1 은 본 발명이 적용되는 전화번호 직접 검색시스템(ADAS)의 구성 예시도.1 is an exemplary configuration of a direct telephone number search system (ADAS) to which the present invention is applied.

도 2 는 본 발명에 따른 음소단위의 적합도함수를 이용한 한글 유사단어 검색방법에 대한 일실시예 흐름도.2 is a flowchart illustrating a method for retrieving Korean similar words using a fitness function of a phoneme unit according to the present invention.

*도면의 주요 부분에 대한 부호의 설명* Explanation of symbols for the main parts of the drawings

11 : 전화기 12 : 음성응답장치11 telephone 12 voice answering device

13 : 정보검색서버13: information search server

상기 목적을 달성하기 위한 본 발명은, 전화번호 검색시스템에 적용되는 한글 유사단어 검색방법에 있어서, 질의어를 음절이 아닌 자소 단위로 구분하여 나올 수 있는 경우의 수에 따라, 검색단어 비교 기준을 정의하는 제 1 단계; 상기 정의된 검색단어 비교 기준에 대해 가중치를 적용하여 검색 유형에 따른 우선순위를 부여하는 제 2 단계; 및 질의어가 입력되면, 상기 우선순위에 따라 상기 우선순위가 낮은 것을 잘라내어 부분적 일치나 최선 일치되는 단어를 검색하는 제 3 단계를 포함한다.According to the present invention for achieving the above object, in the Korean similar word search method applied to a telephone number search system, a search word comparison criterion is defined according to the number of cases in which a query word can be divided into phonemes instead of syllables. A first step of making; A second step of applying a weight to the defined search word comparison criteria to give priority to the search type; And a third step of, when the query word is input, searching for a partial match or a best match word by cutting the lower priority according to the priority.

또한, 본 발명은, 프로세서를 구비한 전화번호 검색시스템에, 질의어를 음절이 아닌 자소 단위로 구분하여 나올 수 있는 경우의 수에 따라, 검색단어 비교 기준을 정의하는 기능; 상기 정의된 검색단어 비교 기준에 대해 가중치를 적용하여 검색 유형에 따른 우선순위를 부여하는 기능; 및 질의어가 입력되면, 상기 우선순위에 따라 상기 우선순위가 낮은 것을 잘라내어 부분적 일치나 최선 일치되는 단어를 검색하는 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.The present invention also provides a telephone number retrieval system having a processor, the function of defining a search word comparison criterion according to the number of cases where a query word can be divided into phonemes instead of syllables; A function of applying a weight to the defined search word comparison criteria to give priority to the search type; And a computer-readable recording medium having recorded thereon a program for realizing a function of searching for a partial match or a best match by cutting out the lower priority according to the priority when the query word is input.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1 은 본 발명이 적용되는 전화번호 직접 검색시스템(ADAS)의 구성 예시도이다.1 is an exemplary configuration of a direct telephone number search system (ADAS) to which the present invention is applied.

본 발명이 적용되는 전화번호 직접 검색시스템(ADAS)은 공중전화망(PSTN : Public Switched Telephone Network)를 통해 일반 전화기(11)를 이용하여 전화번호 정보를 자동 안내하는 시스템으로, 입력 수단으로는 전화기(11)의 키패드를 이용하여 한글을 입력하는 한글코드(ATTACH 코드)를 이용하며, 출력수단으로는 음성음답장치(ARS : Automatic Response System)(12)를 이용한 음성 또는 문자를 디스플레이할 수 있는 단말기에 텍스트 형식을 제공한다.The telephone number direct search system (ADAS) to which the present invention is applied is a system for automatically guiding telephone number information using a general telephone 11 through a public switched telephone network (PSTN). 11) A Hangul code (ATTACH code) for inputting Hangul using the keypad of the present invention is used, and the output means is a terminal capable of displaying voice or text using an automatic response system (ARS) 12. Provide a text format.

ATTACH 코드는 전화기(11)의 버튼으로 한글 뿐만아니라, 영문, 특수기호, 숫자, 제어코드까지도 입력할 수 있는 매우 정교하게 설계된 인터페이스로서, 인접한 두 개의 버튼을 순서대로 눌러 한글자모 및 영문 1자를 입력할 수 있으므로 배우기 쉽고 숙달이 잘 되는 등 장점을 가지고 있다.ATTACH code is a very sophisticated interface that can input not only Korean characters, but also English, special symbols, numbers, and control codes as buttons on the phone 11. It can be easily learned and mastered.

그러나, 12개의 제한된 버튼만을 가지고 한글과 영문 등을 입력하기에는 아무리 잘 설계된 인터페이스라 하더라도 이용자가 부담을 느끼게 마련이고, 또 긴 단어를 입력하다 보면 실수도 뒤따르게 된다.However, even if the interface is well designed to input Korean and English with only twelve limited buttons, the user will feel burdened, and typing a long word will make a mistake.

이와 같은 전화기(11) 문자입력 환경에서는, 이용자의 문자 입력을 가능하면 짧게 하고 단축입력이 가능해야 하며, 입력오류를 자동으로 정정하는 등의 이용자 입력 편의를 증진시킬 수 있는 방법이 필요하다.In the telephone 11 character input environment, the user's character input should be made as short as possible and shortened input should be possible, and there is a need for a method for enhancing user input convenience such as automatically correcting an input error.

이에 따라 전화번호 직접 검색시스템 설계과정에서 이러한 입력편의 개념을 도입하여 정보검색서버(13)에 본 발명의 기능이 구현되었고, 출력은 음성응답장치(12)를 통해 전화기(11)로 출력한다.Accordingly, the function of the present invention was implemented in the information retrieval server 13 by introducing the concept of the input part in the design process of the direct telephone number search system, and the output is output to the telephone 11 through the voice response device 12.

이제, 전화번호 직접 검색시스템에 구현된 한글 단어 검색 방법을 구체적으로 설명하면 다음과 같다.Now, the Hangul word search method implemented in the direct phone number search system will be described in detail.

도 2 는 본 발명에 따른 음소단위의 적합도함수를 이용한 한글 유사단어 검색방법에 대한 일실시예 흐름도이다.2 is a flowchart illustrating a method for retrieving Korean similar words using a fitness function of a phoneme unit according to the present invention.

본 발명은 기본적으로 데이터베이스에 저장된 한글 단어(즉, 인명, 상호명, 업종명 등)를 이용자가 입력한 질의어에 비교하여 일치하는 것을 찾아내는 데이터 검색(Data Retrieval)에 해당된다.The present invention basically corresponds to a data retrieval that finds a match by comparing a Hangul word (ie, a person's name, a company name, a business name, etc.) stored in a database to a query word input by a user.

그러나, 데이터 검색은 비교 대상이 완전히 일치해야 하므로 이용자는 찾고자 하는 대상을 정확히 알고있어야만 한다. 따라서, 일반적으로 쓰이는 데이터 검색으로는 다음과 같은 본 시스템의 요구사항을 충족시킬 수 없다.However, data retrieval requires a complete match of the object, so the user must know exactly what to look for. Therefore, the commonly used data retrieval cannot meet the following requirements of the system.

첫째, 이용자가 긴 단어를 전부 입력하지 않고도 원하는 정보를 찾을 수 있어야 한다.First, users should be able to find the information they want without having to type all the long words.

둘째, 이용자가 입력하고자 하는 단어의 각 음절 또는 일부 음절을 초성만을 입력한다든지 또는 중성까지만 입력한다든지 등의 단축 입력을 하여 원하는 정보를 검색할 수 있어야 한다.Second, the user should be able to search for the desired information by inputting the syllables or some syllables of the words that the user wants to input, such as entering only consonants or only neutrals.

셋째, 버튼을 잘못 눌러 오류가 생기는 경우에도 원하는 정보를 검색할 수 있도록 지원해야 한다.Third, even if an error occurs by pressing the wrong button, you need to support searching for the desired information.

넷째, 이상의 방법들로 검색을 하여 근사적으로 부합되는 여러 대상이 나왔을 때, 부합되는 성질이 좋은 것 순서로 검색결과를 제공해야 한다. 여기서, 검색결과는 음성으로 제공되므로 음성출력 및 청취 시간이 많이 소요되는 만큼 이 요구사항은 매우 중요하다.Fourth, when searching through the above methods results in several matching objects, the search results should be provided in order of good matching properties. Here, since the search results are provided as voice, this requirement is very important as it requires a lot of audio output and listening time.

다섯째, 이용자가 찾고자 하는 단어를 정확히 모르더라도 단어의 일부만을 입력하여(예로서, 약호를 입력한다든지 등) 원하는 정보를 검색할 수 있어야 한다. 이러한 요구사항은 본 시스템의 전화번호 정보 데이터베이스에 등록된 상호명, 업종명 등이 일반 이용자들이 알고 있는 것과 차이가 있을 수 있고, 반면에 데이터베이스에는 별도의 키워드 인덱스 장치가 설계되어 있지 않으므로 반드시 제공되어야 한다.Fifth, even if a user does not exactly know a word to be searched, the user should be able to search for the desired information by inputting only a part of the word (for example, entering an abbreviation or the like). This requirement must be provided because the company name and business name registered in the telephone number database of the system may be different from those of general users, while the keyword index device is not designed in the database.

본 발명은 상기한 바와 같이 종래의 검색방법에서 적절한 결과를 얻지 못하는 경우에, 질의어를 음절이 아닌 자소 단위로 구분하여 나올 수 있는 경우의 수를 여러 단계로 나누어 각 단계에 맞게 우선순위를 주어 이용자에게 적절한 검색결과를 주어 찾고자 하는 데이터를 정확히 모르더라도 데이터를 검색할 수 있도록 한다.According to the present invention, when a proper result is not obtained in the conventional search method, the priority is given to each step by dividing the number of cases in which the query word can be divided into phonemes instead of syllables. Give them the appropriate search results so they can search the data without knowing exactly what they are looking for.

검색기능(Relevance Function)을 세우기 위해 다음과 같은 기준(Measure)을 설정한다.To establish the relevance function, set the following criteria.

먼저, 부분문자열을 위한 측정을 살펴보면 다음과 같다.First, the measurement for the substring is as follows.

일반적으로, 사용되는 부분문자열(Substring)의 개념은 질의어(Q)가 대상 문자열(T)의 연속된 일부분일 때, Q가 T의 부분문자열로 보는 것이다. 여기서, Q가 또 다른 대상 T'에도 역시 부분문자열일 때, T와 T'중 어느 것이 더 적합한지를 가려낼 기준이 필요하다. 불행히도 여기에는 어느 상황에도 들어 맞는 특별한 기준이 있을 수 없다.In general, the concept of a substring used is that Q is regarded as a substring of T when the query Q is a continuous portion of the target string T. Here, when Q is also a substring to another object T ', a criterion is needed to determine which of T and T' is more suitable. Unfortunately, there can be no special criteria for any situation.

예를 들면, 이용자가 "학교"를 입력했을 때 "대학교"와 "고등학교"중 어느 것을 의도했다고 판단할 수 없기 때문이다. 그러나, 다음과 같은 기준을 세울 수 있다.For example, when the user enters "school", it cannot be determined that either "university" or "high school" was intended. However, the following criteria can be set.

첫 번째로, 일치된 부분을 제외한 부분(즉, 음절)의 길이가 짧은 것이 우선한다. 이러한 기준은 음성 출력 시간을 고려한 것으로, 검색결과 부분일치되는 대상이 여러 개가 있다면, 그 중 짧은 것은 먼저 출력해 주는 것이 이용자가 목적하는 대상을 찾는 시간이 줄어들 확률이 크기 때문이다. 또한, 그러한 대상에는 불확실한 정보량의 상대치가 다른 것보다 적다는 일반적인 논리도 성립한다.First, the shorter length of the part (ie, the syllable) except the matched part takes precedence. This criterion takes into account the voice output time, and if there are several objects that partially match the search results, the shorter one is outputted first because the user's time to find the target is reduced. In addition, the general logic that such a relative value of the amount of uncertain information is smaller than that of the other holds.

두 번째로, 일치된 부분의 앞에 있는 불일치 음절의 수가 적을수록 우선한다. 이 기준도 역시 음성 출력을 고려한 것으로 이용자가 관심을 갖는 부분이 먼저 출력되는 것이 청취가 용이하다는 판단 때문이다.Secondly, the smaller the number of discordant syllables before the matched part, the higher the priority. This criterion also takes into account the audio output because it is easy to hear that the portion of interest is output first.

이러한 첫 번째 및 두 번째의 기준에 의하면 질의어 "학교"를 입력했을 때, "대학교"와 "고등학교"의 두 부분일치 대상에 대하여 첫 번째 및 두 번째 모두 "대학교"는 "고등학교"에 우선한다.According to these first and second criteria, when the query word "school" is entered, "university" takes precedence over "high school" for both partial matches of "university" and "high school".

반면에, "학교사랑"과 "대학교"를 비교해 보면, 첫 번째에 의하면 "대학교"가 우선하지만 두 번째에 의하면 "학교사랑"이 우선한다.On the other hand, if you compare "school love" with "university," the first is "university," but the second is "school love."

그러나, 앞서 언급한대로 어느 것이 옳다고 판정할 수 없으므로 이들 두 기준 사이에 적당한 가중치가 있기만 하면 된다.However, as mentioned above, it cannot be determined which is right, so there is only a reasonable weight between these two criteria.

일반적인 부분일치의 개념을 확장하여 질의어 Q가 대상 문자열 T의 일부분일 때, Q가 T의 부분문자열로 보면("연속된"이 빠졌음), 상기의 네 번째 요구사항(즉, 근사적으로 부합되는 여러 대상이 나왔을 때 부합되는 성질이 좋은 것 순서로 검색결과를 제공함)을 반영할 수 있다. 예를 들면, 질의어가 "한통"일 때, "한국통신"도 부분일치의 대상이 된다. 이와 같은 경우에도 역시 비교 기준이 필요하다.Extending the notion of general partial match, when query Q is part of the target string T, if Q is a substring of T ("consecutive" is missing), the fourth requirement above (ie The search results are presented in order of best match when multiple objects appear. For example, when the query is "traditional", "Korea Telecom" is also subject to partial agreement. In this case too, comparison criteria are needed.

세 번째로, 부분일치된 부분에 첨가된 불일치 부분의 길이가 짧은 것이 우선한다.Third, the length of the mismatched portion added to the partially matched portion takes precedence.

예를 들면, 질의어가 "한통"일 때, "한국통신"은 "한나라통신"에 우선한다. 여기서, "한국통"을 부분일치된 부분으로 본다. 따라서, 이 기준은 일치된 부분을 제외한 부분의 길이가 짧은 것이 우선하는 첫 번째 기준과 일치된 부분의 앞에 있는 불일치 음절의 수가 적을수록 우선하는 두 번째 기준에 서로 상충되지 않는다.For example, when the query word is "traditional", "Korean communication" takes precedence over "Hannari communication". Here, "Korean barrel" is regarded as a partial match. Thus, this criterion does not conflict with the preferred second criterion as the number of mismatched syllables in front of the matched first criterion where the length of the part except the matched section prevails is preferred.

한편, 단축입력을 위한 기준(Measure)을 구체적으로 살펴보면 다음과 같다.On the other hand, look at the criteria (Measure) for the shortcut input in detail as follows.

상기의 첫 번째 요구사항(즉, 이용자가 긴 단어를 전부 입력하지 않고도 원하는 정보를 찾을 수 있어야 함)을 위한 Measure를 개발하기 위해서 ATTACH코드의 입력방식을 살펴보자.Let's look at the input method of the ATTACH code to develop a measure for the first requirement above (that is, the user should be able to find the desired information without having to type all the long words).

ATTACH 코드는 한글 2벌식 입력방법을 사용한다.ATTACH code uses Hangul 2 beol type input method.

예를 들면, "한"은 "(버튼 8과 7)" + "(2와 3)" + "(2와 1)"과 같이 입력한다.For example, "one" means " (Buttons 8 and 7) "+" (2 and 3) "+" (2 and 1) ".

따라서, 이용자가 "한국"을 단축 입력한다면, " ", "국", "하 구" 등 어느 것이라도 가능할 것이다. 여기서, "", "하"는 "한"의 부분음절이고, "", "구"는 "국"의 부분음절로 보고 이를 확대 해석하면 " ", "국", "하 구" 등은 모두 "한국"의 부분문자열로 볼 수 있다. 이와 같은 관점에서 부분문자열 개념을 적용할 수 있다. 즉, 다음과 같은 비교기준을 도출할 수 있다.Therefore, if the user enters "Korea" shortened, " "," Station "," study ", etc., where" "," Bottom "is a partial syllable of" one "," "," Gu "is considered to be a syllable of" station "and expanded to interpret" "," Station "," Estuary ", etc. can all be regarded as a substring of" Korea. "From this point of view, the concept of substring can be applied, that is, the following comparison criteria can be derived.

네 번째로, 음절단위로 부분일치된 부분에 첨가된 불일치 부분의 길이의 음절별 합이 작은 것이 우선한다.Fourth, a small syllable sum of the lengths of mismatched portions added to portions partially matched by syllable units takes precedence.

위 기준에 의하면 질의어가 "국"일 때, "호국"은 "한국"에 우선한다. 왜냐하면, "호"가 "한"보다 입력된 ""에 대한 불확실성이 작기 때문이다. 또한, 질의어가 " "일 때 "가구"는 "강국"에 우선한다. 이 기준은 문자 디스플레이 전화기(11)를 이용하는 경우 ATTACH 코드 전송량이 적은 것을 먼저 디스플레이하게 되므로 유리하다.According to the criteria above, the query " "Hoku" takes precedence over "korea" because "ho" entered "han" rather than "han" Because the uncertainty for "is small. Also, the query" &Quot; Furniture " takes the priority over " powerhouse " when "

다음으로 첫 번째 요구사항(즉, 이용자가 긴 단어를 전부 입력하지 않고도 원하는 정보를 찾을 수 있어야 함)과 네 번째 요구상항(즉, 근사적으로 부합되는 여러 대상이 나왔을 때 부합되는 성질이 좋은 것 순서로 검색결과를 제공함)을 결합하여 프리픽스(Prefix)의 용도로 초성만을 입력하는 경우를 위한 기준이 필요하다.Next, the first requirement (i.e. the user should be able to find the information they want without having to type all the long words) and the fourth requirement (i.e. when the number of approximate matches come out is good). Ordering the search results), a criterion is needed to enter only the consonants for the purpose of prefixing.

예를 들면, "한국통신"을 검색하기 위해, "통신"과 같이 입력하는 경우이다. 여기서, ""이 프리픽스의 구실을 하여 질의어가 의미하는 바는 ""으로 시작하는 모든 "통신(또는 "XX통신"중에 ""으로 시작하는 X가 있으면 된다)"을 검색하는 것이다. 이때, 복잡성을 줄이기 위해 프리픽스가 되는 초성은 하나인 것으로 가정한다. 이 경우를 위해 두 번째 기준(즉, 일치된 부분의 앞에 있는 불일치 음절의 수가 적을수록 우선함)과 세 번째 기준(즉, 부분일치된 부분에 첨가된 불일치 부분의 길이가 짧은 것이 우선함)을 확장하여 다음과 같은 새로운 기준을 도출할 수 있다.For example, to search for "KT", Communication ", where" "What this query means in terms of this prefix" Of all "communications" (or "XXcommunications") beginning with " Search for "there's an X that starts with"). In this case, it is assumed that the first property of the prefix is one to reduce the complexity. For this case, the second criterion (i.e., the smaller the number of mismatched syllables in front of the matched part takes precedence) and the third criterion (i.e., the shorter length of the mismatched part added to the matched part takes precedence). By extension, new criteria can be derived:

다섯 번째로, 프리픽스에 해당하는 음절의 앞에 있는 불일치 음절의 수가 적을수록 우선한다. 이 기준은 단어의 첫 글자의 초성을 프리픽스로 활용하는 것이 보다 직관적임을 의미한다.Fifth, the smaller the number of discordant syllables in front of the syllable that corresponds to the prefix, the higher the priority. This criterion means that it is more intuitive to use the initial letter of the first letter of a word as a prefix.

한편, 오류정정을 위한 기준(Measure)을 구체적으로 살펴보면 다음과 같다.Meanwhile, the criteria for error correction will be described in detail as follows.

오류정정은 합리적인 일관된 결과를 얻기 위한 메커니즘을 개발하기가 매우 어렵기 때문에 여기서는 이용자들이 일반적으로 범하기 쉬운 사례인 중성의 오입력의 경우만을 다루기로 한다.Error correction is very difficult to develop a mechanism for obtaining reasonable and consistent results, so we will only deal with the case of neutral misinput, which is a common practice for users.

초성의 경우에 상기 제안한 기준에서 매우 중요한 구실을 하고 있으므로 초성 오류는 허용하기가 어렵고, 종성은 단축입력이 제공되는한 무시가 가능하다는 점에서 오류정정을 중성에만 국한하는 것도 무리는 없다.In the case of the first generation, since the proposed criterion plays a very important role, it is difficult to tolerate the initial error, and the finality can be ignored as long as the short input is provided.

중성 입력오류의 대표적인 사례는 맞춤법 오류이다.A typical example of a neutral typing error is a spelling error.

예를 들면, ""와 ""의 혼동 내지는 ""와 ""의 혼동 따위이다. 특히, 외래어 표기에 많이 나타난다. 따라서, 다음과 같은 단순한 기준을 세울 수 있다.For example, " "Wow " "Confused of" "Wow " ", Especially in foreign language notation. Thus, the following simple criteria can be set.

여섯 번째로, 근사된 중성이 적을수록 우선한다.Sixth, the smaller the approximated neutral, the higher the priority.

예를 들면, 질의어로 "슈퍼마킷"를 입력하면, "수퍼마킷"은 "슈퍼마켓"에 우선한다. 일단, 오류가 있은 다음에는 이 기준은 모호한 점이 적은 반면에, 혹시나 발생할 수 있는 오입력의 경우에도 대상들의 후보를 제시한다는 장점이 있다.For example, if "supermarket" is entered as a query, "supermarket" takes precedence over "supermarket." Once there is an error, this criterion is less ambiguous, but it has the advantage of presenting candidates for objects in the event of a possible misinput.

이제, 검색기능(Relevance Function)을 보다 상세히 설명하면 다음과 같다.Now, the search function will be described in more detail as follows.

상기한 바와 같은 첫 번째 기준에서 여섯 번째 기준까지의 검색단어 비교 기준을 표로 정리하면, (표 1)과 같다.When comparing the search word comparison criteria from the first criterion to the sixth criterion as described above, it is shown in (Table 1).

기준(Measure)의 설명Description of the Measure 이름name 기준standard 일치된 부분을 제외한 음절의 수(첫번째 기준)The number of syllables except the matched part (first criterion) DIS_KEYDIS_KEY 작을수록 우선The smaller the first 일치된 부분을 제외한 전방 불일치 음절의 수(두번째 기준)Number of forward mismatched syllables except second match (second base) DIS_PREDIS_PRE 비연속으로 일치된 부분내에 첨가된 불일치 음절의 수(세번째 기준)Number of mismatched syllables added in discontinuously matched parts (third criterion) DIS_ISODIS_ISO 음절단위로 비연속 일치된 부분내에 첨가된 불일치 자모의 수(네번째 기준)Number of discrepancies added in discontinuous match parts in syllable units (fourth criterion) DIS_PARDIS_PAR 프리픽스 일치된 음절의 위치(다섯번째 기준)Prefix position of the matched syllable (fifth) DIS_POSDIS_POS 근사된 중성의 수(여섯번째 기준)Approximate number of neutrals (sixth) DIS_APXDIS_APX

상기한 바와 같이, 이들 기준간에는 서로 상충되는 것도 있고 중복 적용 가능한 것도 있다. 따라서, 검색기능은 이들간의 기준에 대한 가중치를 적용하는 방법을 취한다. 그러기 위해서는, 먼저 각 기준이 갖는 값의 범위를 알고 정규화(Normalize)를 해야 한다.As mentioned above, some of these standards may conflict with each other and may be applied in duplicate. Thus, the search function takes a method of applying weights to the criteria between them. To do this, you first need to know and normalize the range of values that each criterion has.

질의어를 Q, 대상 단어를 T라 놓자.Let Q be the query and T be the target word.

여기서, L(T)를 단어 T의 음절 수라 할 때, 첫 번째 기준(즉, 일치된 부분을 제외한 음절의 수)의 이름 DIS_KEY는 0부터 L(T)-L(Q)까지의 정수값을 갖는다.Here, when L (T) is the number of syllables of the word T, the name DIS_KEY of the first criterion (that is, the number of syllables except the matched portion) is an integer value from 0 to L (T) -L (Q). Have

그리고, 두 번째 기준(즉, 일치된 부분을 제외한 전방 불일치 음절의 수)의 이름 DIS_PRE도 0부터 L(T)-L(Q)까지의 정수값을 갖는다.The name DIS_PRE of the second criterion (ie, the number of forward mismatched syllables except for the matched part) also has an integer value from 0 to L (T) -L (Q).

또한, 세 번째 기준(즉, 비연속으로 일치된 부분내에 첨가된 불이치 음절의 수)의 비연속으로 일치된 부분은 최대 L(T)까지 이므로 DIS_ISO도 역시 0부터 최대 L(T)-L(Q)까지이다.Also, the discontinuous matched portion of the third criterion (ie, the number of disagreeable syllables added within the discontinuous matched portion) is up to L (T), so DIS_ISO is also 0 to maximum L (T) -L. (Q) up to.

또한, 네 번째 기준(즉, 음절단위로 비연속 일치된 부분내에 첨가된 불일치 자모의 수)에서 음절 t내의 자모의 수를 l(t)라면, DIS_PAR은 최대로 (q(Q (l(Q)-1) 까지의 정수값을 갖는다.In addition, if the number of the letters in the syllable t is l (t) in the fourth criterion (i.e., the number of inconsistent letters added in the discontinuous matched part by syllable unit), DIS_PAR is at most (q (Q (l (Q It has an integer value up to) -1).

또한, 다섯 번째 기준(프리픽스 일치된 음절의 위치)의 이름 DIS_POS는 맨 앞의 음절인 때를 0으로 하면 0부터 L(Q)-2까지의 값은 갖는다.Also, the name DIS_POS of the fifth criterion (the position of the prefix-matched syllable) has a value from 0 to L (Q) -2 when 0 is the first syllable.

또한, 여섯 번째 기준(즉, 근사화된 중성의 수)의 이름 DIS_APX는 0부터 최대 L(Q)까지이다.In addition, the name DIS_APX of the sixth criterion (ie, the approximate number of neutrals) ranges from 0 to a maximum of L (Q).

이상에서와 같이, 최소 최대값을 기준으로 정규화를 거치면, 각 기준이 취하는 값의 범위를 0부터 1.0까지로 할 수 있다.As described above, if the normalization is performed based on the minimum maximum value, the range of values taken by each criterion can be set from 0 to 1.0.

각 기준의 정규화 값을 N(...)로 표기하면, 검색기능(Relevance Function)은 다음과 같이 표현된다. 즉, 기준들의 집합, M={DIS_KEY, DIS_PRE, DIS_ISO, DIS_POS, DIS_APX}로 놓고 기준 m에 대한 가중치를 W(m) (0부터 1사이의 실수값)이라 가정하면, 검색(Relevance) = {1 - (m(M, W(m)(N(m) } × 100%.이다.When the normalized value of each criterion is expressed as N (...), the relevance function is expressed as follows. In other words, assuming a set of criteria, M = {DIS_KEY, DIS_PRE, DIS_ISO, DIS_POS, DIS_APX} and assuming that the weight for criterion m is W (m) (a real value between 0 and 1), the search = { 1-(m (M, W (m)) (N (m)) x 100%.

위 함수는 0부터 100까지의 퍼센트 값을 가지며 Q가 T와 완전히 일치하는 경우에 100%가 된다.The above function has a percentage value from 0 to 100 and is 100% if Q matches T completely.

마지막으로, 질의어에 대한 검색결과가 많을 때는 일정 수 이하로 커팅(Cutting)하는 것이 필요하다.Finally, when there are many search results for the query, it is necessary to cut to a certain number or less.

일반적인 정보검색(IR)에서는 커팅(Cutting)이 중요하지 않지만, 서비스 시간이 오래 걸리는 전화기(11) 음성/텍스트 정보 서비스에서는 가급적이면 검색 대상을 축소하는 것이 유리하다. 또한, 커팅(Cutting)의 방법으로 검색값이 일정 수준 이하인 대상을 버리는 방법도 있을 수 있겠지만, 보다 효과적인 방법으로 기준간에 우선순위를 두어 우선순위가 낮은 것은 잘라내는 방법을 이용한다. 이를 위하여 검색 유형을 (표 2)와 같이 분류한다.Cutting is not important in general IR, but it is advantageous in the telephone 11 voice / text information service that takes a long time to reduce the search target. In addition, cutting may be a method of discarding an object whose search value is equal to or less than a predetermined level, but a more effective method may be to cut a lower priority by giving priority to criteria. For this purpose, the search types are classified as shown in Table 2.

유형type 설명Explanation 우선순위Priority RV_EXACTRV_EXACT 완전일치Exact match 1One RV_P_KEYRV_P_KEY 초성 프리픽스+부분문자열Initial prefix + substring 22 RV_KEYRV_KEY 부분문자열Substring 33 RV_P_PARRV_P_PAR 초성 프리픽스+음절단위 부분문자열Initial prefix + syllable unit substring 44 RV_PARRV_PAR 음절단위 부분문자열Syllable Unit Substring 55 RV_P_APXRV_P_APX 초성 프리픽스+중성근사Initial Prefix + Neutral Approximation 66 RV_APXRV_APX 중성근사Neutral approximation 77

(표 1) 및 (표 2)를 참조하여, 검색 유형에 따른 기준의 영향을 살펴보면, (표 3)과 같다.Referring to (Table 1) and (Table 2), look at the effect of the criteria according to the search type, as shown in (Table 3).

이상에서와 같이, 본 발명에 따른 음소단위의 적합도함수를 이용한 한글 유사단어 검색방법은, 찾고자 하는 데이터를 정확히 모르더라도 데이터를 검색할 수 있도록 질의어를 음절이 아닌 자소 단위로 구분하여 나올 수 있는 경우의 수에 따라, 검색단어 비교 기준을 정의하고(표 1 참조), 정의된 검색단어 비교 기준에 대해 가중치를 적용하여 검색 유형에 따른 우선순위를 부여한다(표 2 참조).As described above, the Korean similar word search method using the fitness function of the phoneme unit according to the present invention may be classified into phonemes instead of syllables so that the data can be searched even if the data to be searched is not known exactly. According to the number of search terms, search word comparison criteria are defined (see Table 1), and weights are applied to the defined search word comparison criteria, and priority is given according to the search type (see Table 2).

이후, 찾고자 하는 단어가 입력되면(301), 데이터베이스로부터 검색하여(302), 기준 유형에 따른 우선순위에 따라 완전일치(RV_EXACT), 초성 프리픽스+부분문자열(RV_P_KEY), 부분문자열(RV_KEY), 초성 프리픽스+음절단위 부분문자열(RV_P_PAR), 음절단위 부분문자열(RV_PAR), 초성 프리픽스+중성근사(RV_P_APX), 그리고 중성근사(RV_APX)를 순차적으로 검사한다(203 내지 208).Then, when the word to be searched is input (301), it is searched from the database (302), exact match (RV_EXACT), initial prefix + substring (RV_P_KEY), substring (RV_KEY), and consonant according to the priority according to the criterion type. The prefix + syllable unit substring (RV_P_PAR), the syllable unit substring (RV_PAR), the initial prefix + neutral approximation (RV_P_APX), and the neutral approximation (RV_APX) are sequentially examined (203 to 208).

마지막으로, 우선순위에 따라 검색된 검색결과를 출력한다(209).Finally, the search results searched according to the priority are output (209).

이상에서와 같은, 본 발명은 완전 일치가 아닌 부분적 일치 또는 최선의 일치를 선택하며 오류에도 덜 민간한 단어 검색 방법으로서, 정확한 용어와 정보를 암기하기가 어려운 상황에서 사용자 자신이 알고 있는 정보나 용어만을 이용하여 사용자가 원하는 결과에 접근이 가능하도록 지원하므로 데이터베이스 검색 및 파일에 저장된 단어를 검색, 워드 프로세스에서 맞춤법 정정, 전화기나 이동단말 등 단어의 입력이 제한되어 있는 경우 등과 같은 정보통신산업 분야에 활용 가능하다.As described above, the present invention is a word search method that selects partial matches or best matches, not exact matches, and is less civilized in error, and information or terms that the user knows in a situation where it is difficult to memorize accurate terms and information. It provides users with access to the results they want by using only the information and communication industry such as searching the database and searching the words stored in the file, correcting the spelling in word processing, and when the input of words such as the phone or mobile terminal is restricted. It can be utilized.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 있어 본 발명의 기술적 사상을 벗어나지 않는 범위내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the spirit of the present invention for those skilled in the art to which the present invention pertains, and the above-described embodiments and accompanying It is not limited to the drawing.

상기한 바와 같은 본 발명은, 전화번호 직접 검색시스템(ADAS) 등에서 질의어를 음절이 아닌 자소 단위로 구분하여 나올 수 있는 경우의 수를 여러 단계로 나누어 각 단계에 맞게 우선순위를 부여하여 이용자에게 적절한 검색결과를 제공하므로써, 사용자가 단어를 정확히 알지 못하는 경우나 단어 전체를 모두 입력할 수 없는 경우나 입력이 제한되는 경우에도 적절한 결과를 얻을 수 있는 효과가 있다.As described above, the present invention divides the number of cases in which the query word can be divided into phoneme units instead of syllables in a direct telephone number search system (ADAS), and divides the number of cases into several stages to give priority to each step, so as to be appropriate for the user. By providing a search result, an appropriate result can be obtained even if the user does not know the word correctly, cannot input all the words, or if the input is restricted.

Claims

In the Korean similar word search method applied to the phone number search system,

A first step of defining a search word comparison criterion according to the number of cases in which the query word can be classified by phoneme rather than syllable;

A second step of applying a weight to the defined search word comparison criteria to give priority to the search type; And

A third step of searching for a partial or best match word by cutting out the lower priority according to the priority when the query word is input;

Korean similar word retrieval method using the goodness-of-fit function of the phoneme unit, including.

The method of claim 1,

The first step is,

A fourth step of prioritizing that the length of the portion (syllable) except the matched portion is short for the substring;

A fifth step of prioritizing for the substring the smaller the number of discordant syllables before the matched part;

A sixth step of prioritizing that the length of the mismatched portion added to the partially matched portion is short for the substring;

A seventh step of prioritizing that the sum of the syllables of the lengths of the mismatched portions added to the portions matched by the digit unit is small for the short-term input;

An eighth step of prioritizing the shorter input, the smaller the number of discordant syllables in front of the syllable corresponding to the prefix; And

The ninth step, with less approximation of neutrality, prioritizing for error correction

The method according to claim 1 or 2,

With respect to the search word comparison criteria defined above, by knowing and normalizing the range of values of each criterion and applying the weights between the criteria, perfect match (RV_EXACT), initial prefix + substring (RV_P_KEY), substring (RV_KEY) Goodness of phoneme units characterized by priorities in order of first prefix, syllable unit substring (RV_P_PAR), syllable unit substring (RV_PAR), initial prefix + neutral approximation (RV_P_APX), and neutral approximation (RV_APX) How to search Korean similar words using functions.

In a telephone number search system having a processor,

A function of defining a search word comparison criterion according to the number of cases in which the query word can be classified by phoneme instead of syllables;

A function of applying a weight to the defined search word comparison criteria to give priority to the search type; And

When a query is input, a function of searching for a partial or best match word by cutting out the lower priority according to the priority

A computer-readable recording medium having recorded thereon a program for realizing this.