KR20000036487A

KR20000036487A - A Database System for Korean-English Translation Using Information Retrieval Techniques

Info

Publication number: KR20000036487A
Application number: KR1020000013599A
Authority: KR
Inventors: 임종태
Original assignee: 임종태
Priority date: 2000-03-17
Filing date: 2000-03-17
Publication date: 2000-07-05
Also published as: KR100376931B1

Abstract

PURPOSE: A database system for Korean-English interpretation is proposed to provide an interpreted sentence approximate to the user's query. The system is based on n-gram method of information search, the idea of rapid data processing of the database system about large data, and many database information for Korean and English. CONSTITUTION: A database system for Korean-English interpretation using information search technique is composed of database schema, database construction step, and interpretation step. The structure of database schema is a database system for Korean-English interpretation and has several relation schema as in figure 1. A Korean sentence and its corresponding English sentence are put into a database through user interface and then the Korean sentence is investigated whether it was already registered in the database or not at the step of database construction. The interpretation step receives Korean sentences needed to be interpreted and eliminates punctuation marks in the sentences. And then the sentences are decomposed into two syllables using n-gram technique.

Description

Korean-English Translation Using Information Retrieval Techniques

본 발명은 정보 검색 기술을 이용한 한영 번역 데이터베이스 시스템에 관한 것으로서, 더욱 상세하게는 한글 문장과 그에 대응하는 영어 문장을 데이터베이스로 구축하고 정보 검색의 n-gram 기술을 이용하여 한글 문장을 영어 문장으로 번역하는 정보 검색 기술을 이용한 한영 번역 데이터베이스 시스템에 관한 것이다.The present invention relates to a Korean-English translation database system using information retrieval technology, and more particularly, to construct a Korean sentence and its corresponding English sentence into a database and to translate the Korean sentence into English sentence using n-gram technology of information retrieval. It relates to a Korean-English translation database system using information retrieval technology.

종래에는 숙어 번역 방식에 의한 방법, 단어의 품사로 번역순서를 정하는 방법들을 사용한 번역 소프트 웨어가 일부 개발되어 상품화되어 있다. 하지만 한영 번역 기술은 한글의 형태소 분석, 구문 분석과 의미소 분석을 사용하여 영어 문장으로 번역하는 방법을 쓴다. 이 방법은 한글 문장의 복잡하고 조사의 다양한 사용 방법으로 인해서 원하는 영어 문장으로의 번역이 이루어지고 있지 않다. 예를 들어 최근에 좋은 번역 프로그램이라고 알려진 매직 박스 99는 영어 단문을 한글로 번역하는데 있어서 단문인 경우에 60-70%의 번역률을 나타내고 있으며, 두 개 이상의 절이 결합된 영어 장문 번역에 있어 매끄럽지 못한 단점이 있다. 특히 And와 But 등의 접속사와 가정형 If나 명령문의 연결이 매끄럽지 못하다. "It is important that people from different cultures come to understand each other and develop mutual trust"의 경우 "중요하다, 다른 문화들로부터 사람이 서로 이해하는 것을 하게 된다, 그리고 상호의 확신한다는 것을 발전시킨다"로 해석해 순차 번역 알고리즘의 한계를 드러낸다.Conventionally, some translation software has been developed and commercialized using idiom translation methods and methods for determining translation order by parts of speech. However, the Korean-English translation technique uses Hangeul's morpheme analysis, syntax analysis, and semantic analysis to translate into English sentences. This method is not translated into the desired English sentence due to the complexity of the Hangul sentence and various ways of using the survey. For example, Magic Box 99, recently known as a good translation program, has a 60-70% translation rate when translating a short sentence into Korean, and is not smooth when it comes to translating English long sentences with two or more verses. There is this. In particular, connections between conjunctions such as And and But and hypothetical If or statements are not smooth. In the case of "It is important that people from different cultures come to understand each other and develop mutual trust," it is interpreted as "important, people from different cultures understand each other, and develop mutual confidence." Reveal the limitations of the translation algorithm.

기초적인 영문법에 근간으로 무난한 번역률을 보인 매직 박스 99는, 그러나 한영 번역에서 50% 미만의 번역률로, 실제로 홈페이지에 올릴 회사 소개서를 번역할 경우 주어와 동사로 연결된 문장이 아닌 단어 번역의 수준에 그쳤다. "최고의 경쟁력을 갖춘 기업임을 자신합니다"를 "Paramount competition 갖춘 enterprise 자신합니다"로 옮기는 정도이다.Magic Box 99, which has a good translation rate based on basic English grammar, has a translation rate of less than 50% in Korean-English translation, but is only a level of word translation, not a sentence linked to a subject and verb, when a company introduction is actually translated on the homepage. . It's about moving from "I'm confident that I'm the best competitive company" to "I'm confident with Paramount competition."

국내공개특허공보 공개번호제10-1999-047854호에는 지능형 메타데이타 시스템(IMDS)을 위한 메타데이타에 의한 정보 검색의 지능형 사용자 인터페이스 방법에 관한 것이다. 사용자가 질의를 하고, 시스템은 사용자 질의의 특징을 추출하여 번역 모듈에서 특징 벡터로 변환한다. 그후 태스크 처리 모듈에서 질의 특징벡터를 이용하여 메타데이터 탐색 트리를 탐색한다. 그 탐색중에 특징에서 지속적 탐색이 불가능하면 자동 질의 분석 모드로 전환하여서 사용자 인터페이스 관리기에서 이전의 유사한 질의와 긍정적으로 연계되어 있는 응답 케이스를 조회하고, 조회된 응답 케이스들을 통하여 특징의 속성을 결정한다. 만일 불충분한 정보로 인하여 특징의 속성을 결정할 수 없으면 사용자와 대화형 모드로 전환하여서 사용자가 속성을 추가 입력한다. 그리고 그 판단 트리의 단말노드에 도착할 때까지 지속적으로 탐색한다. 그 탐색결과, 시스템 학습모드이면 조회된 응답들에 대한 상관도를 사용자가 평가하여 피이드백하는 것을 특징으로 하는 메타데이타에 의한 정보 검색의 지능형 사용자 인터페이스 방법이 기재되어 있으며,Korean Laid-Open Patent Publication No. 10-1999-047854 relates to an intelligent user interface method of information retrieval by metadata for an intelligent metadata system (IMDS). The user makes a query, and the system extracts the features of the user query and converts them into feature vectors in the translation module. The task processing module then searches the metadata search tree using the query feature vector. If the feature cannot be continuously searched during the search, the user interface manager switches to the automatic query analysis mode, and the user interface manager searches for a response case that is positively associated with the previous similar query, and determines the attribute of the feature through the inquired response cases. If the attribute of the feature cannot be determined due to insufficient information, the user enters the interactive mode with the user and adds the attribute. It continues to search until it reaches the terminal node of the decision tree. As a result of the search, an intelligent user interface method of information retrieval using metadata is described, wherein the user evaluates and feedbacks the correlation of the inquired responses in the system learning mode.

국내공개특허공보 공개번호 제10-1994-017572호에는 데이타 백업 블럭(DBBG)의 기능 수행 시작시 데이타 베이스 화일의 물리적 옵셋을 저장하는 제1단계(10); 데이타 백업 요구를 받으면 화일을 오픈하고 생(RAW) 디스크의 물리적 옵셋을 계산한 후 화일 시스팀 내부의 블럭 경계를 조정하는 제2단계(11 내지 13); 출력 데이타의 블럭 경계를 조정하여 해당 데이타를 데이타 베이스 화일에 출력하는 제3단계(14 내지18)로 구성된 실시간 데이타베이스 관리 시스템에서 고속 데이타 백업 방법이 공개되어 있고,Korean Laid-Open Patent Publication No. 10-1994-017572 includes a first step (10) of storing a physical offset of a database file when starting to perform a function of a data backup block (DBBG); A second step (11 to 13) of opening a file upon calculating a data backup request, calculating a physical offset of a raw disk, and adjusting a block boundary inside the file system; A high speed data backup method is disclosed in a real time database management system comprising a third step (14 to 18) of adjusting a block boundary of output data and outputting the data to a database file.

국내공개특허공보 공개번호 제10-1999-088678호에는 문서에 기술된 내용의 특징을 나타내는 문자열을 추출하는 방법 및 장치와, 문자열 추출 프로그램을 격납한 기억매체와, 이 방법 및 장치를 이용하여, 사용자가 지정한 문서에 기술되어 있는 내용과 유사한 내용을 포함하는 문서를 문서 데이터베이스 중에서 검출하는 방법 및 장치와, 검색 프로그램을 격납한 기억매체에 관한 특징 문자열 추출방법 및 장치와,이를 이용한 유사문서 검색방법 및 장치와,특징문자열 추출프로그램을 격납한 기억매체 및 유사문서 검색프로그램을 격납한 기억매체가 기재되어 있으나,Korean Laid-Open Patent Publication No. 10-1999-088678 discloses a method and apparatus for extracting a character string representing a feature of a content described in a document, a storage medium storing a character string extraction program, and using the method and apparatus, A method and apparatus for detecting a document in a document database containing contents similar to those described in a user-specified document, a feature string extraction method and apparatus for a storage medium storing a retrieval program, and a method for searching similar documents using the same And a storage medium storing a feature string extracting program and a storage medium containing a similar document retrieval program,

상기 종래의 기술들은 개발된 영한 번역 프로그램은 초급 수준 정도로 사용할 수 있으나, 한영 번역 프로그램은 거의 사용할 수 있는 단계에 있지 못하고 있는 실정이다.The above-described conventional techniques can use the developed English-Korean translation program as a beginner level, but the Korean-English translation program is not in a stage that can be almost used.

본 발명은 상기와 같은 문제점을 해결하기 위하여, 정보 검색의 N-GRAM 기술의 문장 분리 방법과, 데이터베이스 시스템의 장점인 대용량의 데이터를 빨리 처리할 수 있다는 점과, 한글과 영어 문법을 잘 아는 사람보다 아주 많은 영어 문장을 암기하고 있는 사람이 영어 작문 실력이 더 훌륭하다는 사실에 착안하여 많은 양의 한글과 영어를 데이터베이스로 구축하고, 이를 이용하여 사용자의 질의에 일치하거나 가장 근접한 영어 문장으로 번역해 주는 정보 검색 기술을 이용한 한영 번역 데이터베이스 시스템을 제공하는 것을 그 목적으로 하는 것이다.The present invention is to solve the above problems, it is possible to quickly process a large amount of data that is an advantage of the sentence separation method of the N-GRAM technology of information retrieval, database system, and those who are familiar with Korean and English grammar A person who memorizes a lot more English sentences has the ability to write a lot of Hangul and English as a database, and to translate them into English sentences that match or closest to the user's query. The main purpose is to provide a Korean-English translation database system using information retrieval technology.

도 1은 한영 번역 데이터베이스 시스템의 스키마1 is a schema of the Korean-English translation database system

도 2는 데이터베이스에 "나는 교회에 간다"와 "I go to church"를 추가할 때의 변화되는 부분들을 나타낸 도면FIG. 2 shows the changes in adding "I go to church" and "I go to church" to the database.

상기와 같은 목적을 달성하기 위하여, 본 발명은 데이타베이스 스키마의 구조, 데이터베이스 구축 단계 및 번역 단계로 구성된 정보 검색 기술을 이용한 한영 번역 데이터베이스 시스템에 관한 것이다.In order to achieve the above object, the present invention relates to a Korean-English translation database system using information retrieval technology consisting of the structure of the database schema, database construction step and translation step.

본 발명에서 사용되는 데이타베이스 스키마의 구조는 한영 번역 데이터베이스 시스템으로서, 도 1과 같은 한글 문장 릴레이션(테이블) 스키마들을 갖는다. 도 1의 1번 릴레이션(KTB)는 한글 문장 리스트들을, 2번 영어 문장 릴레이션(ETB)는 영어문장 리스트들을 유지하는 테이블이다. 도 1의 3번 한영 매칭 릴레이션(KEMAT)는 한글 문장과 그에 대응하는 영어 문장의 쌍의 관계를 표시한다. 도 1의 4번 음절표 릴레이션(KWTB)는 데이터베이스에 구축된 한글 문장에서 나타날 수 있는 음절들과 그들을 구별하기 위한 번호, 그리고 음절이 포함한 문장의 수를 나타내는 항목들로 구성한다. 도 1의 5번 음절포함 문장 릴레이션(WKMAT)는 두 음절이 포한된 문장들의 관계를 표시한다.The structure of the database schema used in the present invention is a Korean-English translation database system, and has Korean sentence relation (table) schemas as shown in FIG. The first relation KTB of FIG. 1 is a table for keeping Korean sentence lists and the second English sentence relation ETB holds a list of English sentences. The third Korean-English matching relation (KEMAT) of FIG. 1 indicates a relationship between a pair of Korean sentences and an English sentence corresponding thereto. The number 4 syllable relation (KWTB) of FIG. 1 is composed of items representing syllables that can appear in Korean sentences constructed in a database, numbers for distinguishing them, and the number of sentences included in the syllables. The sentence relation WKMAT including the syllable 5 of FIG. 1 indicates a relationship between sentences including two syllables.

데이터베이스 구축 단계는 사용자 인터페이스에서 한글 문장과 대응하는 영어 문장을 입력받아서 이 한글 문장이 이미 등록되어 있는지 확인하는 단계를 거친다. 등록되어 있는 지 여부의 확인은 입력한 영어 문장과 도 2의 영어 문장 릴레이션(ETB)의 영어 문장들과 문자열 비교하여 매칭되는 지 여부로 결정한다. 새로운 한영 문장 쌍이 데이터베이스에 등록되어 있지 않으면 입력한 한영 문장 쌍을 데이터베이스에 추가하는 단계를 거친다. 예를 들어 설명하면, "나는 교회에 간다"와 "I go to church"의 한영 문장 쌍에 대하여 이 영어 문장이 데이터베이스에 등록되어 있지 않으므로 이 문장 쌍을 데이터베이스에 추가한다. 도 2에서 릴레이션의 별표(*) 표시가 있는 부분이 추가 변화된 것이다. 먼저 도 2의 KTB 릴레이션에 한글 문장을, ETB 릴레이션에 영어 문장을, 그리고 KEMAT 릴레이션에 한글 문장번호와 영어 문장번호를 추가한다. 다음에 추가할 한글 문장을 구두점(쉼표, 물음표, 느낌표 등)을 제거한 후에 두 음절씩 분해하고, 각 음절이 KWTB 릴레이션에 있는지 여부를 탐색하여 있으면 음절 번호를, 없으면 이 음절을 KWTB 릴레이션에 추가하고 나서 음절 번호를 얻는다. 이 때에 KWTB 릴레이션의 이들 음절에 대한 Scnt 항목의 값은 1 증가 시키고, 새로운 음절인 경우에는 1로 초기화한다. 도 2의 KWTB 릴레이션은 Wid가 16, 17, 18인 레코드의 Scnt의 값이 1로 초기화되고, 10, 15번은 1이 증가한 내용을 나타내고 있다.In the database construction step, the user interface receives an English sentence corresponding to the Korean sentence and checks whether the Korean sentence is already registered. Whether or not it is registered is determined by comparing the input English sentences with the English sentences of the English sentence relation (ETB) of FIG. If a new Korean-English sentence pair is not registered in the database, the entered Korean-English sentence pair is added to the database. For example, for the English-English sentence pair "I go to church" and "I go to church", the English sentence is not registered in the database, so add this sentence pair to the database. In FIG. 2, the part marked with an asterisk (*) of the relation is further changed. First, the Korean sentence is added to the KTB relation, the English sentence is added to the ETB relation, and the Korean sentence number and the English sentence number are added to the KEMAT relation. After removing punctuation marks (commas, question marks, exclamation marks, and so on), the next Korean sentence is decomposed into two syllables. Then you get a syllable number. At this time, the value of the Scnt item for these syllables of the KWTB relation is increased by 1, and initialized to 1 for new syllables. In the KWTB relation of FIG. 2, the value of Scnt of a record having Wids of 16, 17, and 18 is initialized to 1, and numbers 10 and 15 represent an increase of 1.

다음에는 도 2의 WKMAT 릴레이션에 이들 음절 번호와 KTB 릴레이션에 첨가할 때의 한글 문장 번호를 쌍으로 추가한다.Next, these syllable numbers and Korean sentence numbers when added to the KTB relation are added to the WKMAT relation in FIG. 2 in pairs.

번역 단계는 사용자가 번역을 원하는 한글 문장을 받아서 문장내의 구두점(쉼표, 물음표, 느낌표 등)을 제거하는 과정을 거쳐서, n-gram 기술을 이용하여 문장으로부터 두 음절씩 분해하는 과정을 거친다. 분해한 각 음절들에 대하여 도 2의 KWTB 릴레이션를 탐색하여 음절번호(Wid)를 얻고, 이 음절 번호로 WKMAT 릴레이션에서 문장번호들을 얻는다. 모든 분해 음절에 대하여 포함한 문장번호를 얻어서 이 문장번호들 중에서 분해 음절을 많이 포함한 순서로 정렬하여 KEMAT, ETB, 그리고 KTB 릴레이션들로부터 영어문장 과 한글 문장들을 얻어 이를 사용자에게 보여 준다.The translation step receives a Hangul sentence that the user wants to translate, removes punctuation marks (commas, question marks, exclamation marks, etc.) in the sentence, and decomposes two syllables from the sentence using n-gram technology. For each syllable that has been decomposed, the syllable number Wid is obtained by searching the KWTB relation of FIG. 2, and the sentence numbers are obtained from the WKMAT relation with the syllable number. Obtains the sentence number including all the disassembled syllables and sorts them in the order of including the many disassembled syllables among them and obtains English and Korean sentences from KEMAT, ETB, and KTB relations and shows them to the user.

예를 들어 사용자가 "나는 교회에 간다"라는 한글을 번역하고자 한다면, 먼저 구두점을 제거한 문장 "나는교회에간다" 로 바꾸고 다시 두음절로 분해한 "나는,는교,교회,회에,에간,간다"를 만들고 KWTB 릴레이션을 탐색하여 "10,16,17,18,15"의 음절번호를 얻는다. 다음에 WKMAT 릴레이션으로부터 10번은 문장번호 (100,101)을, 16번은 문장번호 101을, 17번은 101, 18은 101, 15는 (100,101)의 문장번호를 얻게 되어 결과로 101번이 원하는 문장이다. 다음에는 이 문장번호 101번을 가지고 KEMAT, KTB, ETB 로부터 영어 문장 201을 얻게 되어, 영어문장 201과 한글문장 101을 사용자에게 보여 준다.For example, if a user wants to translate Hangul "I'm going to church", I change the sentence "I'm going to church" to remove the punctuation first and then disassemble it into two syllables. "And search the KWTB relation to get the syllable number" 10,16,17,18,15 ". Next, from the WKMAT relation, 10 is sentence number (100,101), 16 is sentence number 101, 17 is 101, 18 is 101, and 15 is (100,101) sentence number. Next, with the sentence number 101, the English sentence 201 is obtained from KEMAT, KTB, and ETB, and the English sentence 201 and the Korean sentence 101 are shown to the user.

상기와 같은 본 발명은, 데이터베이스와 정보검색의 기술을 이용하여 한글문장과 그에 대응되는 영어 문장들의 데이터베이스를 구축하고 사용자의 한글 질의를 구축한 데이터베이스를 검색하여 해당 영어 문장으로 변역하여 주는 방법으로, 향후 기가헤르쯔의 CPU 속도와 테라 바이트의 기억 공간이 보편화되는 시점에는 질의 처리 속도는 크게 빨라지게 될 것이므로 번역이 필요한 모든 분야에서 널리 이용되는 효과가 있는 것이다.The present invention as described above, by using a database and the technology of information retrieval to build a database of Hangul sentences and English sentences corresponding thereto, and to search the database of the user's Hangul query to translate the corresponding English sentences, As Gigahertz CPU speeds and terabytes of memory become more common in the future, query processing speeds will be significantly faster, which is why they are widely used in all areas requiring translation.

Claims

In the Korean-English translation database system using information retrieval technology, a database construction step, a step of translating a Korean sentence into an English sentence using a database, a process of dividing a questioned Hangul sentence into two syllables using n-gram technology, and The Korean-English translation database using information retrieval technology comprising the steps of: retrieving sentences containing separated syllables from a database, and creating a user interface by arranging the searched sentences in order including a large number of syllables. system.

The method of claim 1, wherein the database building step comprises receiving an English sentence corresponding to a Korean sentence from a user interface and checking whether the Korean sentence is already registered, and then checking whether the Korean sentence is already registered is inputted in English. The sentence is compared with the English sentences of the English sentence relation (ETB) to determine whether the string is matched. If a new Korean-English sentence pair is not registered in the database, the information is searched for by adding the input Korean-English sentence pair to the database. Korean-English translation database system using technology.

The method of claim 1, wherein the translation step receives a Hangul sentence that the user wants to translate and removes punctuation (comma, question mark, exclamation point, etc.) in the sentence, and decomposes two syllables from the sentence using n-gram technology. For each syllable, we search the KWTB relation to get the syllable number (Wid), get the sentence numbers from the WKMAT relation with this syllable number, and get the number of the disassembled syllables among these sentence numbers by getting the included sentence number for all the disassembled syllables. A Korean-English translation database system using information retrieval technology, which is a translation step of obtaining English sentences and Korean sentences from KEMAT, ETB, and KTB relations in the order of inclusion and displaying them to users.