KR20050063990A

KR20050063990A - Translation unit extraction and search device for machine translation and method using it

Info

Publication number: KR20050063990A
Application number: KR1020030095250A
Authority: KR
Inventors: 윤승; 이영직
Original assignee: 한국전자통신연구원
Priority date: 2003-12-23
Filing date: 2003-12-23
Publication date: 2005-06-29
Also published as: KR100511409B1

Abstract

본 발명은 기계번역을 위한 번역단위 추출/검색장치 및 방법에 관한 것으로, 코퍼스에서 기계번역의 기본단위가 되는 형태소 연쇄열을 추출하고, 이들을 검색해 내어 번역을 요하는 입력문장의 기계번역 수행과정에서 번역단위를 이용한 번역을 수행함으로써 번역 성공률을 높일 수 있도록 하는 것이다.The present invention relates to an apparatus and method for extracting and retrieving a translation unit for machine translation. The present invention relates to extracting a morpheme chain string which is a basic unit of machine translation from a corpus, and searching them to perform a translation of an input sentence requiring translation. The translation success rate can be improved by performing translation using translation unit.

이에, 본 발명은 코퍼스를 대상으로 문장의 형태소를 분석하고, 상기 형태소 분석결과에 의해 번역단위 후보를 검색하여 결과를 저장하고 빈도를 기록하는 번역단위 후보 추출모듈과, 상기 번역단위 후보 추출모듈의 번역단위 후보들로부터 기본 번역단위를 선정하고, 선정된 기본 번역단위를 저장하는 기본 번역단위 선정/저장모듈 및 입력문장에서 형태소를 분석하고, 기본 번역단위 데이터베이스로부터 상기 입력문장의 기본 번역단위를 찾아내는 온라인 번역단위 검색모듈로 구성되는 것을 특징으로 한다. Accordingly, the present invention is a translation unit candidate extraction module for analyzing the morphemes of the sentence in the corpus, search for the translation unit candidates based on the morphological analysis results, store the results and record the frequency, and the translation unit candidate extraction module. On-line selection of basic translation units from translation unit candidates, analysis of morphemes in basic translation unit selection / storage modules and input sentences that store the selected basic translation units, and finding basic translation units of the input sentences from the basic translation unit database. Characterized in that the translation unit search module.

Description

TRANSLATION UNIT EXTRACTION AND SEARCH DEVICE FOR MACHINE TRANSLATION AND METHOD USING IT}

본 발명은 기계번역을 위한 번역단위 추출 및 검색장치와 이를 이용한 번역단위 추출 및 검색방법에 관한 것으로, 더욱 상세하게는 기계번역 수행과정에서 번역 성공률을 높이기 위하여 코퍼스에서 미리 번역에 이용할 번역단위들을 추출해낸 다음, 이를 번역을 필요로 하는 입력문장에서 찾아내어 다음 단계의 번역과정에서 이용할 수 있도록 하는 것이다.The present invention relates to a translation unit extraction and retrieval device for machine translation, and a translation unit extraction and retrieval method using the same. More specifically, to improve translation success rate in the process of performing machine translation, the translation units to be used for translation in advance in the corpus are extracted. After it is done, it is found in the input sentences that need translation and made available for the next step in the translation process.

종래의 기계번역 시스템은 주로 변환방식의 번역방법을 택해왔다.The conventional machine translation system has mainly adopted a translation method of translation.

상기 변환방식은 입력문장에 대하여 형태소분석, 구문분석, 생성과정을 거치면서 대역문을 만들어내는 방법으로써, 단순한 문장의 경우에는 처리시 어려움이 없으나 문장이 길어지면 길어질수록 가능한 구조의 수가 많아지기 때문에 옳은 구문분석 결과를 내놓기 어려우며 이에 따라, 상기 대역문의 품질도 저하되는 문제가 있다.The conversion method is a method of generating a band sentence through morphological analysis, syntax analysis, and generation process for an input sentence. In the case of a simple sentence, there is no difficulty in processing, but the longer the sentence is, the larger the number of possible structures becomes. It is difficult to give a correct parsing result, and accordingly, the quality of the band sentence is also degraded.

또한, 최근에는 코퍼스에 기반한 통계기반 기계번역 시스템이나 예제기반 기계번역 시스템도 개발되고 있으나, 상기 예제기반 기계번역 시스템의 경우 보통 문장을 하나의 예제로 처리함에 따라 문장의 길이가 길어질수록 일치하는 예제를 발견하기 어려운 문제가 있으며, 상기 통계기반 기계번역 시스템의 경우에도 문장의 길이가 길어져 한 문장내의 단어의 수가 많아지면 이를 번역했을 경우에 단어의 재배열이 어렵게 되는 문제가 있다.In addition, recently, a statistics-based machine translation system or an example-based machine translation system based on corpus has been developed, but in the case of the example-based machine translation system, a sentence is matched as the length of the sentence becomes longer as a normal sentence is processed as an example. There is a problem that is difficult to find, and even in the case of the statistics-based machine translation system, when the length of a sentence increases and the number of words in a sentence increases, rearrangement of the word becomes difficult when the translation is performed.

따라서, 본 발명은 상술한 종래의 문제점을 해결하기 위한 것으로, 본 발명의 목적은 코퍼스에서 문장단위로 모든 가능한 번역단위 후보들을 추출해 내는 모듈과, 추출해 낸 번역단위 후보들을 대상으로 빈도를 계산해 이후 번역에 이용될 번역단위들을 데이터베이스로 저장하는 모듈 및 상기 기계번역 과정에서 번역단위들을 입력문장에서 찾아내는 모듈로 구성되어 문장의 길이에 영향을 받지 않고 기계번역이 이루어져 번역의 성공률을 높일 수 있도록 하는 기계번역을 위한 번역단위 추출/검색장치 및 방법을 제공하는데 있다. Accordingly, the present invention has been made to solve the above-mentioned conventional problems, and an object of the present invention is to calculate a frequency for the extracted translation unit candidates and to perform translation after extracting all possible translation unit candidates in sentence units from the corpus. It consists of a module for storing translation units to be used in the database and a module for finding the translation units in the input sentence during the machine translation process so that the machine translation can be performed without being affected by the length of the sentence, thereby improving the success rate of translation. It provides a translation unit extraction / retrieval apparatus and method for the.

상기와 같은 본 발명의 목적을 달성하기 위한 기계번역을 위한 번역단위 추출/검색장치는, 코퍼스를 대상으로 문장의 형태소를 분석하고, 상기 형태소 분석결과에 의해 번역단위 후보를 검색하여 결과를 저장하고 빈도를 기록하는 번역단위 후보 추출모듈과, 상기 번역단위 후보 추출모듈의 번역단위 후보들로부터 기본 번역단위를 선정하고, 선정된 기본 번역단위를 저장하는 기본 번역단위 선정/저장모듈 및 입력문장에서 형태소를 분석하고, 기본 번역단위 데이터베이스로부터 상기 입력문장의 기본 번역단위를 찾아내는 온라인 번역단위 검색모듈로 구성된다. Translation unit extraction / retrieval apparatus for machine translation to achieve the object of the present invention as described above, and analyzes the morphemes of the sentence in the corpus, search the translation unit candidates based on the morphological analysis results and store the results A translation unit candidate extraction module for recording a frequency, a basic translation unit selected from the translation unit candidates of the translation unit candidate extraction module, and a basic translation unit selection / storage module and input sentence for storing the selected basic translation unit And an online translation unit search module for analyzing and finding the basic translation unit of the input sentence from the basic translation unit database.

상기와 같은 본 발명의 목적을 달성하기 위한 기계번역을 위한 번역단위 추출/검색방법은, (a)번역을 요하는 입력문장을 읽어 들인 후, 상기 입력문장에서 형태소를 분석하는 단계와, (b)상기 기본 번역단위 선정 및 저장모듈에 의해 생성된 기본 번역단위 데이터베이스로부터 기본 번역단위를 순서대로 하나씩 읽어들이는 단계와, (c)상기 기본 번역단위 데이터베이스에 저장된 기본 번역단위의 검색과정이 완료되었는지 판단하여, 검색과정이 완료되었을 경우, 상기 기본 번역단위와 일치하는 부분이 상기 입력문장에 있는가 판단하는 단계, (d)상기 기본 번역단위와 일치하는 부분이 상기 입력문장에 존재하는 경우, 상기 문장에 일치하는 부분을 표시하는 단계로 이루어진다. Translation unit extraction / retrieval method for machine translation to achieve the object of the present invention as described above, (a) after reading the input sentence requiring translation, and analyzing the morpheme from the input sentence, (b Reading basic translation units one by one from the basic translation unit database generated by the basic translation unit selection and storage module, and (c) checking whether the basic translation unit is stored in the basic translation unit database. Judging, if the search process is completed, determining whether a portion matching the basic translation unit exists in the input sentence; (d) if a portion matching the basic translation unit exists in the input sentence, the sentence Marking the matching part is made.

이하, 본 발명에 따른 실시예를 첨부한 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 기계번역을 위한 번역단위 추출 및 검색장치의 구성도이다.1 is a block diagram of a translation unit extraction and retrieval apparatus for machine translation according to the present invention.

도 1에 도시된 바와 같이 상기 번역단위 추출 및 검색장치는 크게 코퍼스(10)를 대상으로 문장의 형태소를 분석하고, 상기 형태소 분석결과에 의해 번역단위 후보를 검색하여 결과를 저장하고 빈도를 기록하는 번역단위 후보 추출모듈(20)과, 상기 번역단위 후보 추출모듈(20)의 번역단위 후보들로부터 기본 번역단위를 선정하고, 선정된 기본 번역단위를 저장하는 기본 번역단위 선정/저장모듈(30) 및 입력문장에서 형태소를 분석하고, 기본 번역단위 데이터베이스(33)로부터 상기 입력문장의 기본 번역단위를 찾아내는 온라인 번역단위 검색모듈(50)로 구성된다.As shown in FIG. 1, the translation unit extraction and retrieval apparatus largely analyzes the morphemes of sentences in the corpus 10, searches for translation unit candidates based on the morphological analysis results, stores the results, and records frequencies. A basic translation unit selection / storage module 30 for selecting a translation unit candidate extraction module 20, a basic translation unit from translation unit candidates of the translation unit candidate extraction module 20, and storing the selected basic translation unit; It is composed of an online translation unit search module 50 for analyzing the morpheme from the input sentence and finding the basic translation unit of the input sentence from the basic translation unit database 33.

먼저, 상기 번역단위 후보 추출모듈(20)은 코퍼스(10)를 대상으로 각 어절마다 형태소를 분석하고, 상기 형태소의 품사를 결정하는 형태소 분석부(21)와, 상기 형태소 분석결과의 각 문장마다 공백문자를 기준으로 번역단위 경계를 결정하고, 상기 번역단위 경계를 근거로 상기 문장에서 생성할 수 있는 경우의 수를 계산하여, 상기 경우의 수만큼 번역단위 후보를 생성하는 번역단위 후보 검색부(22) 및 상기 번역단위 후보 검색부(22)에서 검색된 번역단위 후보를 번역단위 후보 데이터베이스(23)에 저장하고 빈도를 기록하는 번역단위 후보 저장부(24)를 포함하여 구성된다.First, the translation unit candidate extraction module 20 analyzes the morphemes for each word of the corpus 10, and the morpheme analyzer 21 determines the parts of speech of the morphemes, and for each sentence of the morpheme analysis result. A translation unit candidate search unit which determines a translation unit boundary based on a space character, calculates the number of cases that can be generated in the sentence based on the translation unit boundary, and generates translation unit candidates by the number of cases. 22) and a translation unit candidate storage unit 24 for storing the translation unit candidate retrieved by the translation unit candidate searcher 22 in the translation unit candidate database 23 and recording the frequency.

또한, 상기 기본 번역단위 선정/저장모듈(30)은 상기 번역단위 후보 저장부(24)에서 저장한 번역단위 후보 데이터베이스(23)를 검색하여, 한국어의 경우에는 용언, 서술격 조사, 동사, 파생 접미사, 각종 기호류 등이 번역단위 처음이나 중간에 위치하고 있는 경우와 의존명사가 번역단위의 처음에 위치하고 있는 경우 번역단위 후보를 제거하고, 영어의 경우에는 의문 한정사, 의문 부사 등이 번역단위 중간이나 마지막에 위치하고 있는 경우와, 각종 기호류가 번역단위 처음이나 중간에 위치하고 있는 경우 및 전치사가 번역단위 마지막에 위치하고 있는 경우에 번역단위 후보를 제거하는 번역단위 후보 제거부(31)와, 상기 번역단위 후보 제거부(31)에 의해 잘못된 번역단위 후보들이 제거된 번역단위 후보 데이터베이스(23)로부터 번역단위 후보들을 읽어들여 고빈도의 번역단위를 기본 번역단위로 선정하는 기본 번역단위 선정부(32) 및 상기 기본 번역단위 선정부(32)에 의해 선정된 고빈도 번역단위들을 기본 번역단위 데이터베이스(33)에 저장하는 기본 번역단위 저장부(34)로 구성된다.In addition, the basic translation unit selection / storage module 30 searches the translation unit candidate database 23 stored in the translation unit candidate storage unit 24, and in the case of Korean, a verb, descriptive search, verb, and derivative suffix. For example, if various symbols are placed at the beginning or middle of a translation unit and dependent nouns are located at the beginning of a translation unit, translation unit candidates are removed. A translation unit candidate removing unit 31 for removing a translation unit candidate when the symbol is located in the case where various symbols are located at the beginning or the middle of a translation unit, and when a preposition is located at the end of the translation unit; Read the translation unit candidates from the translation unit candidate database 23 from which invalid translation unit candidates have been removed by the removal unit 31. A basic translation unit selecting unit 32 for selecting a high frequency translation unit as a basic translation unit and a basic storing high frequency translation units selected by the basic translation unit selecting unit 32 in the basic translation unit database 33. And a translation unit storage unit 34.

여기서, 상기 기본 번역단위 선정부(32)는 상기 번역단위 후보들의 빈도를 계산시, 번역단위 후보 데이터베이스(23) 전체를 한꺼번에 계산하지 않고, 번역단위 내부의 형태소 수를 기준으로 각 번역단위를 그룹으로 묶어 빈도를 계산한다.Here, when calculating the frequency of the translation unit candidates, the basic translation unit selection unit 32 does not calculate the entire translation unit candidate database 23 at once, but groups each translation unit based on the number of morphemes in the translation unit. To calculate the frequency.

예를 들어, 가장 긴 번역 단위 그룹의 경우는 해당 번역단위 중 상위 5%정도까지, 내부 형태소의 수가 2인 가장 짧은 번역단위 그룹의 경우 상위 2.5%정도까지만 단계적으로 수를 줄여나가며 기본 번역단위에 포함시킨다.For example, the longest group of translation units can be incrementally reduced to the top 5% of its translation units, and the shortest group of translation units with 2 internal morphemes to the top 2.5%. Include it.

물론, 기본 번역단위에 포함시키는 후보를 결정하는 경계는 상기 코퍼스(10)의 크기와 성격에 따라 달라질 수 있다. Of course, the boundary for determining the candidate to be included in the basic translation unit may vary depending on the size and nature of the corpus 10.

또한, 상기 온라인 번역단위 검색모듈(50)은 번역대상 문장이 입력되면 번역대상 입력문장에서 각 어절마다 형태소를 분석하고, 상기 형태소의 품사를 결정하는 형태소 분석부(51)와, 상기 기본 번역단위 선정/저장모듈(30)에 의해 형성된 기본 번역단위 데이터베이스(33)를 이용하여 상기 번역대상 입력문장에 해당 기본 번역단위가 존재하는지 검색하고, 상기 기본 번역단위가 존재할 경우, 해당 기본 번역단위를 문장에 표시하는 온라인 기본 번역단위 검색부(52)로 구성되며, 상기 온라인 기본 번역단위 검색부(52)에 의해 기본 번역단위를 문장에 표시해 이후 번역과정에서 기본 번역단위를 번역에 이용할 수 있도록 한다.In addition, when the translation target sentence is input, the online translation unit search module 50 may analyze the morpheme for each word in the translation target input sentence, and determine the part of speech of the morpheme by the morpheme analyzer 51 and the basic translation unit. The basic translation unit database 33 formed by the selection / storage module 30 is used to search whether the corresponding translation unit exists in the translation target input sentence, and if the basic translation unit exists, the basic translation unit is sentenced. It consists of an online basic translation unit search unit 52 to be displayed on, by the online basic translation unit search unit 52 to display the basic translation unit in the sentence so that the basic translation unit can be used for translation in the subsequent translation process.

도 2는 본 발명에 따른 번역단위 후보 추출모듈의 번역단위 후보 데이터베이스 생성과정을 나타낸 흐름도이다.2 is a flowchart illustrating a process of generating a translation unit candidate database of a translation unit candidate extraction module according to the present invention.

도 2에 도시된 바와 같이, 상기 번역단위 후보 데이터베이스의 생성과정을 보다 상세하게 설명하면 다음과 같다.As shown in FIG. 2, the generation process of the translation unit candidate database will be described in detail as follows.

먼저, 코퍼스(10)의 첫 문장부터 순서대로 한 문장씩 문장단위로 읽어들여 마지막 문장까지 다 읽어들인다(S100).First, the first sentence of the corpus (10) to read one sentence by sentence in order to read all the last sentence (S100).

그런 다음, 상기 읽어들인 문장이 상기 코퍼스(10)의 마지막 문장인가를 판단하여(S110), 더 이상 읽어 올 문장이 없으면 모듈의 동작은 종료되고, 마지막 문장이 아닐 경우, 상기 번역단위 후보 추출모듈(20)의 형태소 분석부(21)에 의해 문장내의 형태소를 분석하고(S120), 전체 형태소의 수를 계산한다(S130).Then, it is determined whether the read sentence is the last sentence of the corpus 10 (S110). If there is no more sentence to read, the operation of the module is terminated, and if not the last sentence, the translation unit candidate extraction module The morpheme analysis unit 21 of (20) analyzes the morphemes in the sentence (S120), and calculates the total number of morphemes (S130).

예를 들어, 상기 코퍼스(10)로부터 읽어들인 문장이 <차는 연료 탱크를 가득 채운 상태로 반납해 주십시오> 일 경우, 분석된 형태소는 < 차/ncn+는/jxt 연료/ncn 탱크/ncn+를/jco 가득/mag 채우/pvg+ㄴ/etm 상태/ncn+로/jca 반납/ncpa+해/xsv 주/px+시/ep+ㅂ시오/ef>가 된다.For example, if the sentence read from the corpus 10 is " return the car full of fuel tanks ", the analyzed morpheme is <car / ncn + / jxt fuel / ncn tank / ncn + / jco Full / mag fill / pvg + b / etm status / ncn + / jca return / ncpa + year / xsv note / px + poetry / ep + sh / ef>

이어서, 상기 계산된 전체 형태소의 수가 2 이상인가를 판단하여(S140), 상기 전체 형태소의 수가 2 이상인 경우 상기 형태소 연쇄열 전체를 번역단위 후보 데이터베이스(23)에 저장한다(S150).Subsequently, it is determined whether the calculated total number of morphemes is 2 or more (S140), and when the total number of total morphemes is 2 or more, the entire morpheme chain sequence is stored in the translation unit candidate database 23 (S150).

즉, 상기 예문에서 볼 때, 상기 예문의 경우, 전체 형태소의 수는 15개이기 때문에 상기 < 차/ncn+는/jxt 연료/ncn 탱크/ncn+를/jco 가득/mag 채우/pvg+ㄴ/etm 상태/ncn+로/jca 반납/ncpa+해/xsv 주/px+시/ep+ㅂ시오/ef> 전체가 통째로 하나의 번역단위 후보가 되는 것이다.That is, in the example sentence, in the case of the example sentence, since the total number of morphemes is 15, the <car / ncn + fills / jxt fuel / ncn tank / ncn + / jco full / mag / pvg + b / etm state / ncn + / jca return / ncpa + solution / xsv note / px + poetry / ep + check / ef> The whole is a translation unit candidate.

이때, 상기 전제 형태소의 수가 2보다 작으면 번역단위가 될 수 없으므로 현재 문장의 처리를 종료하고, 다음 문장을 읽어들인다.At this time, if the number of the morphemes is less than 2, it cannot be a translation unit, so the processing of the current sentence is terminated and the next sentence is read.

계속해서, 상기 번역단위 후보 데이터베이스(23)에 저장된 전체 형태소의 수(예문의 경우 15개)에서 하나씩을 빼내고(남은 형태소의 수 14개)(S160), 남은 형태소의 수가 2 이상인가를 판단하여(S170), 상기 남은 형태소의 수가 2 이상인 경우, 현재 형태소의 수를 기본 단위로 모든 가능한 번역 단위를 찾아낸다(S180).Subsequently, one from the total number of morphemes stored in the translation unit candidate database 23 (15 in the case of the example) is extracted (14 remaining morphemes) (S160), and it is determined whether the number of remaining morphemes is 2 or more. If the number of remaining morphemes is 2 or more (S170), all possible translation units are found based on the current number of morphemes (S180).

상기 예문의 경우에는 두 가지 번역단위를 찾아낼 수 있다.(형태소의 수의 합이 14가 될 수 있은 연쇄들)In the case of the above example, two translation units can be found (chains whose sum of morphemes can be 14).

첫 번째 ; 번역단위 14-1 : <는/jxt 연료/ncn 탱크/ncn+를/jco 가득/mag 채우/pvg+ㄴ/etm 상태/ncn+로/jca 반납/ncpa+해/xsv 주/px+시/ep+ㅂ시오/ef>first ; 14-1: </ jxt fuel / ncn tank / ncn + / jco full / mag fill / pvg + b / etm status / ncn + / jca return / ncpa + sea / xsv note / px + shi / ep + ㅂ / ef >

두 번째 ; 번역단위 14-2 : <차/ncn+는/jxt 연료/ncn 탱크/ncn+를/jco 가득/mag 채우/pvg+ㄴ/etm 상태/ncn+로/jca 반납/ncpa+해/xsv 주/px+시/ep>second ; Unit 14-2: <car / ncn + / jxt fuel / ncn tank / ncn + / jco full / mag fill / pvg + b / etm status / ncn + / jca return / ncpa + sea / xsv note / px + poetry / ep>

그런 다음, 상기 찾아낸 번역단위(번역단위 14-1, 번역단위 14-2)가 상기 번역단위 후보 데이터베이스(23)에 이미 존재하는지 판단하여(S190), 동일한 번역단위가 존재하는 경우에는, 발생빈도를 기록하고(S200), 동일한 번역단위가 존재하는 않는 경우에는, 상기 기본 번역단위 후보 데이터베이스(33)에 번역단위를 저장한다(S210).Then, it is determined whether the found translation unit (translation unit 14-1, translation unit 14-2) already exists in the translation unit candidate database 23 (S190), and if the same translation unit exists, the frequency of occurrence If the same translation unit does not exist, the translation unit is stored in the basic translation unit candidate database 33 (S210).

상기 예문의 경우에는, 첫 번째 문장으로 상기 번역단위 후보 데이터베이스(23)에 동일한 것이 이미 기록되어 있지 않을 경우, 상기 번역단위 14-1, 번역단위 14-2를 모두 저장할 수 있다.In the case of the example sentence, if the same sentence is not already recorded in the translation unit candidate database 23 as the first sentence, both the translation unit 14-1 and the translation unit 14-2 may be stored.

계속해서, 상기 남은 형태소의 수가 2이상일 경우(S220), 상기 S160에서부터 S220을 반복 실행하게 된다. Subsequently, when the number of the remaining morphemes is 2 or more (S220), the processes from S160 to S220 are repeated.

도 3은 본 발명에 따른 기본 번역단위 선정/저장모듈의 기본 번역단위 데이터베이스 생성과정을 나타낸 흐름도이다.3 is a flowchart illustrating a basic translation unit database generation process of the basic translation unit selection / storage module according to the present invention.

도 3을 참조하여, 상기 기본 번역단위 데이터베이스(33) 생성과정을 보다 상세하게 설명하면 다음과 같다.Referring to Figure 3, it will be described in more detail the process of generating the basic translation unit database 33 as follows.

먼저, 상기한 번역단위 후보 추출모듈(20)에 의해 생성된 번역단위 후보 데이터베이스(23)로부터 번역단위 후보들을 읽어들인 다음(S300), 상기 읽어들인 번역단위 후보들에서 번역단위 제외대상이 존재하는가 판단하여(S310), 제외 대상이 존재하는 경우, 상기 제외 대상을 제거한다(S320).First, the translation unit candidates are read from the translation unit candidate database 23 generated by the translation unit candidate extraction module 20 (S300), and then the translation unit candidates are determined from the read translation unit candidates. (S310), if the exclusion object exists, the exclusion object is removed (S320).

즉, 상기한 예문에서는 '시/ep+시오/ef'와 같이 어미로 시작되는 번역단위라던가 아니면 '는/jxt 연료/ncn'과 같이 보조사로 시작되는 번역 단위 또는 '차/ncn+는/jxt 연료/ncn 탱크/ncn+를/jco 가득/mag'과 같이 부사로 끝나는 번역 단위(참고로, 한국어에서 부사는 보통 뒤에 오는 용언을 수식한다.)등이 제거되게 된다.In other words, in the above example, the translation unit starts with the ending, such as' shi / ep + theio / ef ', or the translation unit starts with the assistant, such as' / jxt fuel / ncn' or the 'car / ncn + is / jxt fuel / Translation units that end with adverbs such as ncn tank / ncn + / jco full / mag '(note, in Korean, adverbs usually modify the following verbs) are removed.

이어서, 상기 번역단위 내의 형태소 수 기준 발생빈도를 계산하여(S330), 고빈도의 번역단위를 기본 번역단위로 선정하여 상기 기본 번역단위 데이터베이스(33)에 저장한다(S340). Subsequently, the morpheme number occurrence frequency in the translation unit is calculated (S330), and a high frequency translation unit is selected as the basic translation unit and stored in the basic translation unit database 33 (S340).

이는 상기한 도 2에 설명된 바와 같이, 상기 코퍼스(10) 내의 모든 문장을 처리하고 나면 번역 단위마다 상기 코퍼스(10) 내에서 몇 번 출현했는지 기록이 되어, 이 빈도를 가지고 각 번역 단위마다 적정 고빈도를 결정하게 되는 것이다.As described in FIG. 2, after processing all sentences in the corpus 10, the number of occurrences in the corpus 10 is recorded for each translation unit, and the frequency is appropriate for each translation unit. High frequency will be determined.

도 4는 본 발명에 따른 기계번역을 위한 번역단위 추출 및 검색방법의 흐름도이다.4 is a flowchart of a translation unit extraction and retrieval method for machine translation according to the present invention.

도 4를 참조하여, 본 발명의 번역단위 추출 및 검색방법을 상세하게 설명하면 다음과 같다.Referring to Figure 4, the translation unit extraction and search method of the present invention will be described in detail as follows.

먼저, 번역을 요하는 입력문장을 읽어 들인 후(S400), 상기 입력문장에서 형태소를 분석한다(S410).First, after reading an input sentence requiring translation (S400), the morpheme is analyzed in the input sentence (S410).

그런 다음, 상기 도 3에서 설명된 바와 같이 기본 번역단위 선정 및 저장모듈(30)에 의해 생성된 기본 번역단위 데이터베이스(33)로부터(S420) 기본 번역단위를 순서대로 하나씩 읽어들인다(S430).Then, as described in FIG. 3, the basic translation units are read one by one from the basic translation unit database 33 generated by the basic translation unit selection and storage module 30 (S420).

이때, 상기 기본 번역단위 데이터베이스(33)에 저장된 기본 번역단위의 검색과정이 완료되었는지 판단하여(S440), 검색과정이 완료되었을 경우, 문장을 출력하고(S450), 상기 검색과정이 완료되지 않았을 경우, 상기 기본 번역단위와 일치하는 부분이 상기 입력문장에 있는가 판단한다(S460, S470).At this time, it is determined whether the search process of the basic translation unit stored in the basic translation unit database 33 is completed (S440), when the search process is completed, and outputs a sentence (S450), when the search process is not completed In operation S460 and S470, it is determined whether a portion corresponding to the basic translation unit exists in the input sentence.

여기서, 상기 기본 번역단위와 일치하는 부분이 상기 입력문장에 존재하는 경우, 상기 문장에서 일치되는 부분을 표시한다(S480).In this case, when a part matching the basic translation unit exists in the input sentence, the matching part in the sentence is displayed (S480).

본 발명의 예문을 들어 설명하면, 다음의 < >부분이 기본 번역 단위로 선정된 것들 중 문장과 일치된 부분이라 할 수 있다.Referring to the example sentences of the present invention, the following <> part may be said to be a part consistent with a sentence among those selected as a basic translation unit.

예문 : 차/ncn+는/jxt <연료/ncn 탱크/ncn+를/jco 가득/mag 채우/pvg+ㄴ/etm 상태/ncn>+로/jca <반납/ncpa+해/xsv 주/px>+시/ep+ㅂ시오/efExample: car / ncn + / jxt <fuel / ncn tank / ncn + / jco full / mag fill / pvg + b / etm status / ncn> + / jca <return / ncpa + harm / xsv note / px> + shi / ep + Sioux / ef

이처럼, 상기 일치하는 부분에 특정표시를 해줌으로써 중복검색을 방지할 수 있어, 기계번역 과정의 복잡성을 줄일 수 있게 되는 것이다.In this way, by specifying a specific mark on the matching portion, it is possible to prevent duplicate search, thereby reducing the complexity of the machine translation process.

이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 또한 설명하였으나, 본 발명은 상기한 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형 실시가 가능한 것을 물론이고, 그와 같은 변경은 기재된 청구범위 내에 있게 된다.Although the preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the above-described embodiments, and the present invention is not limited to the above-described embodiments without departing from the spirit of the present invention as claimed in the claims. Of course, any person skilled in the art can make various modifications, and such changes are within the scope of the claims.

이상에 설명한 바와 같이 본 발명에 의하면, 번역과정에서 기본 번역단위를 제공하여 번역이 기본 번역단위를 기준으로 이루어지도록 함으로써, 기계번역 과정의 복잡성을 줄일 수 있는 효과가 있다.As described above, according to the present invention, by providing a basic translation unit in the translation process so that the translation is made based on the basic translation unit, the complexity of the machine translation process can be reduced.

또한, 본 발명은 다양한 기계번역 시스템에 적용되어 번역 성공률을 향상시키는 효과가 있다.In addition, the present invention is applied to a variety of machine translation system has the effect of improving the translation success rate.

도 1은 본 발명에 따른 기계번역을 위한 번역단위 추출 및 검색장치의 구성도,1 is a block diagram of a translation unit extraction and retrieval apparatus for machine translation according to the present invention,

도 2는 본 발명에 따른 번역단위 후보 추출모듈의 번역단위 후보 데이터베이스 생성과정을 나타낸 흐름도,2 is a flowchart illustrating a process of generating a translation unit candidate database of a translation unit candidate extraction module according to the present invention;

도 3은 본 발명에 따른 기본 번역단위 선정/저장모듈의 기본 번역단위 데이터베이스 생성과정을 나타낸 흐름도,3 is a flowchart illustrating a process of generating a basic translation unit database of the basic translation unit selection / storage module according to the present invention;

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

10 : 코퍼스 20 : 번역단위 후보 추출모듈10: corpus 20: translation unit candidate extraction module

21 : 형태소 분석부 22 : 번역단위 후보 검색부21: morphological analysis unit 22: translation unit candidate search unit

23 : 번역단위 후보 데이터베이스 24 : 번역단위 후보 저장부23: translation unit candidate database 24: translation unit candidate storage

30 : 기본 번역단위 선정/저장모듈 31 : 번역단위 후보 제거부30: basic translation unit selection / storage module 31: translation unit candidate removal unit

32 : 기본 번역단위 선정부 33 : 기본 번역단위 데이터베이스32: basic translation unit selection unit 33: basic translation unit database

34 : 기본 번역단위 저장부 40 : 번역대상 문장34: basic translation unit storage unit 40: the sentence to be translated

50 : 온라인 번역단위 검색모듈 51 : 형태소 분석부50: online translation unit search module 51: morphological analysis unit

52 : 온라인 기본 번역단위 검색부 52: online basic translation unit search unit

Claims

A translation unit candidate extraction module for analyzing a morpheme of a sentence based on a corpus, searching for a translation unit candidate based on the morphological analysis result, storing a result, and recording a frequency;

A basic translation unit selection / storage module for selecting a basic translation unit from translation unit candidates of the translation unit candidate extraction module and storing the selected basic translation unit; And

A translation unit extraction and retrieval device for machine translation, comprising: an online translation unit search module for analyzing a morpheme from an input sentence and finding a basic translation unit of the input sentence from a basic translation unit database.

The method of claim 1, wherein the translation unit candidate extraction module

A morpheme analysis unit for analyzing a morpheme for each word of the corpus, and determining a part-of-speech of the morpheme;

For each sentence of the morphological analysis result, a translation unit boundary is determined based on a space character, and the number of cases that can be generated in the sentence is calculated based on the translation unit boundary, and the number of translation unit candidates is determined by the number of cases. A translation unit candidate searcher to generate; And

And a translation unit candidate storage unit for storing the translation unit candidate searched by the translation unit candidate search unit in a translation unit candidate database and recording the frequency.

The method of claim 1, wherein the basic translation unit selection / storage module

If the translation unit candidate database stored in the translation unit candidate storage is searched and the morphemes serving as starting or ending points of consecutive meanings are included in the wrong position in the translation unit, the translation unit candidate system is excluded from the basic translation unit. denial;

A basic translation unit selection unit that reads translation unit candidates from a translation unit candidate database from which wrong translation unit candidates are removed by the translation unit candidate removing unit, and selects a high frequency translation unit as a basic translation unit; And

And a basic translation unit storage unit for storing the high frequency translation units selected by the basic translation unit selection unit in a basic translation unit database.

The method of claim 3, wherein the translation unit candidate removing unit

In the case of Korean, a machine characterized by removing translation unit candidates when verbs, descriptive checks, verbs, derivative suffixes, various symbols, etc. are located at the beginning or middle of a translation unit and dependent nouns are located at the beginning of a translation unit. Translation unit extraction and retrieval device for translation.

The method of claim 3, wherein the translation unit candidate removing unit

In the case of English, interrogation qualifiers, interrogative adverbs, etc. are located at the middle or the end of a translation unit, when various symbols are located at the beginning or the middle of a translation unit, and when a preposition is at the end of a translation unit, a translation unit candidate is removed. Translation unit extraction and retrieval apparatus for machine translation, characterized in that.

The method of claim 1, wherein the online translation unit search module

A morpheme analysis unit for analyzing a morpheme for each word in a translation target input sentence and determining a part-of-speech of the morpheme;

Search whether the basic translation unit exists in the translation target input sentence using the basic translation unit database formed by the basic translation unit selection / storage module, and if the basic translation unit exists, display the basic translation unit in the sentence. Translation unit extraction and retrieval apparatus for a machine translation, characterized in that consisting of an on-line basic translation unit search.

(a) reading an input sentence requiring translation, and analyzing a morpheme from the input sentence;

(b) reading basic translation units one by one from the basic translation unit database generated by the basic translation unit selection and storage module;

(c) determining whether a search process of the basic translation unit stored in the basic translation unit database is completed, and when the search process is completed, determining whether a part matching the basic translation unit exists in the input sentence;

and (d) if a part matching the basic translation unit exists in the input sentence, displaying a part matching the sentence.

The method of claim 7, wherein the basic translation unit database generation process of step (b)

(e) reading translation unit candidates from a translation unit candidate database generated by the translation unit candidate extraction module;

(f) determining whether a translation unit exclusion object exists from the read translation unit candidates, and if the exclusion object exists, removing the exclusion object;

(g) calculating a morpheme based occurrence frequency in the translation unit, selecting a high frequency translation unit as a basic translation unit, and storing the translation unit in the basic translation unit database. / Search method.

The method of claim 8, wherein the process of generating a translation unit candidate database in step (e)

(h) reading the sentence one by one from the corpus, analyzing the morphemes in the sentence, and calculating the total number of morphemes;

(i) determining whether the calculated total number of morphemes is 2 or more, and if the total number of morphemes is 2 or more, storing the entire morpheme chain sequence in a translation unit candidate database;

(j) extracting one from the total number of morphemes stored in the translation unit candidate database, and determining whether the number of remaining morphemes is two or more;

(k) if the number of remaining morphemes is 2 or more, finding the translation unit based on the current number of morphemes;

(l) It is determined whether the found translation unit already exists in the translation unit candidate database, and if the same translation unit exists, the frequency of occurrence is recorded, and if the same translation unit does not exist, the basic translation unit candidate Storing the translation unit in a database; And

(m) a method of extracting and retrieving a translation unit for machine translation, comprising repeating steps (j) to (m) when the number of remaining morphemes is 2 or more.