KR20080052282A

KR20080052282A - Apparatus and method for unsupervised learning translation relationships among words and phrases in the statistical machine translation system

Info

Publication number: KR20080052282A
Application number: KR1020070076140A
Authority: KR
Inventors: 황영숙; 박상규; 김영길; 김창현; 양성일; 서영애; 홍문표; 윤창호
Original assignee: 한국전자통신연구원
Priority date: 2006-12-05
Filing date: 2007-07-30
Publication date: 2008-06-11
Also published as: KR100911372B1

Abstract

An apparatus and a method for autonomously learning translation relationships among words and phrases in a statistical machine translation system are provided to enhance correctness in arranging phrases by using unified results of arranging multiple words arrangement and a result of arranging commonly identical words. An apparatus for autonomously learning translation relationships among words and phrases includes a source language sentence preprocessor(101), a target language sentence preprocessor(102), an autonomous learning device(200), a learning terminating condition inspecting unit(210), a statistical machine translation model parameter extracting unit(300), and a decoder. The source language sentence preprocessor receives a sentence written in a source language whose morpheme and phrase are analyzed. The target language sentence preprocessor receives a sentence written in a target language whose morpheme and phrase are analyzed. The autonomous learning device receives learning sets from the source language sentence preprocessor and the target language sentence preprocessor. The learning terminating condition inspecting unit repeats word arrangement, word rearrangement, phrase arrangement, and word & phrase translation dictionary acquisition until a learning termination condition that there is no more change in the word & phrase translation dictionary is satisfied. The statistical machine translation model parameter extracting unit extracts parameters for a statistics based translation model from a word & phrase arrangement result acquired from learning the word & phrase arrangement. The decoder uses the statistical machine translation model for generating a sentence written in a target language from a sentence written in a source language inputted together with a learned language model(400).

Description

Apparatus and method for unsupervised learning translation relationships among words and phrases in the statistical machine translation system}

도 1은 본 발명에 따른 학습 장치가 적용되는 예시적인 기계번역 시스템의 개괄적인 블럭도이다.1 is a schematic block diagram of an exemplary machine translation system to which a learning apparatus according to the present invention is applied.

도 2는 도 1에 도시된 자율 학습기에 대한 보다 구체적인 블럭도이다.FIG. 2 is a more detailed block diagram of the autonomous learner shown in FIG. 1.

*도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

101: 소스 언어 문장 전처리기101: source language sentence preprocessor

102: 타겟 언어 문장 전처리기102: target language sentence preprocessor

200: 자율 학습기200: self-study

201: 단어 정렬기201: Word Sorter

202: 단어 재정렬기202: word rearranger

203: 구문 정렬기203: syntax sorter

204: 대역 사전 생성기204: Band Dictionary Generator

210: 학습 종료 조건 검사기210: learning end condition checker

300: 통계적 기계번역 모델 파라미터 추출기300: Statistical Machine Translation Model Parameter Extractor

400: 언어 모델 400: language model

500: 디코더 500: decoder

본 발명은 일반적으로 기계번역 시스템에 관한 것으로, 좀더 구체적으로는 기계번역 시스템에 적용될 수 있는, 통계적 방법에 기초하여 자율적으로 단어 및 구문들 사이의 번역 관계를 학습하는 장치에 관한 것이다.The present invention relates generally to machine translation systems, and more particularly, to an apparatus for autonomously learning translation relationships between words and phrases based on statistical methods, which can be applied to machine translation systems.

기계번역 시스템은 소스 언어의 문장들로 이루어진 텍스트를 입력받아 타겟 언어로 번역하여 그 결과를 출력하는 시스템으로 일반적으로 문장 단위로 정렬된 병렬 문장 집합을 학습 데이터로 사용하여 단어 대역 사전(lexicon) 및 구문 대역 사전을 학습하고 그외의 번역에 필요한 번역지식을 획득한다. 이렇게 문장 단위로 병렬 정렬된 학습 데이터 집합으로부터 대역 사전을 도출하는 접근법은 정렬된 문장에서 소스 언어의 단어와 타겟 언어의 단어 사이의 연관성 정도를 측정하고 연관성 정도가 임계값을 넘는 모든 단어쌍들로부터 대역사전을 형성한다. 예를 들면, 기존의 접근법에서는 단어 사이의 연관성 정도의 측정은 병렬 문장들(또는 대응영역)에서 단어들이 얼마나 자주 동시에 나타나는지에 기초한다. 서로 다른 단어쌍들에 대한 연관성 스코어가 계산되고, 연관성 스코어의 내림차순으로 정렬되고, 선택된 임계값에 따라 대역어 쌍이 선택되어 대역사전에 추가된다. 그러나, 이 방법은 독립적으로 단어들 사이의 연관성 스코어를 계산하기 때문에, 복합어의 구성 단어 들 사이의 대역관계를 잘못 설정하는 오류를 발생시킬 수 있다. 예를 들어, 한국어의 "파일 시스템"의 영어 대역어는 "file system"이고, "시스템 파일"의 대역어는 "system files"이다. 이러한 복합어쌍들이 많은 문장에서 나타난다면 복합어의 구성단어들의 대역어 쌍이 "파일/system" "시스템/files"로 형성될 가능성도 커진다.Machine translation system is a system that receives text composed of sentences of source language, translates it into target language and outputs the result. Learn phrase band dictionaries and acquire translation knowledge for other translations. This approach of deriving band dictionaries from a parallel set of sentence-aligned learning data measures the degree of association between words in the source language and words in the target language in the sorted sentences, and from all word pairs where the degree of association exceeds the threshold. Form a band dictionary. For example, in the conventional approach, the measure of the degree of association between words is based on how often words appear simultaneously in parallel sentences (or corresponding regions). Association scores for different word pairs are calculated, sorted in descending order of association scores, and bandword pairs are selected and added to the band dictionary according to the selected threshold. However, since this method independently calculates an association score between words, it may cause an error of incorrectly setting a band relationship between constituent words of a compound word. For example, the English bandword of Korean "file system" is "file system" and the bandword of "system file" is "system files". If these compound word pairs appear in many sentences, it is also possible that the band word pairs of the constituent words of the compound word are formed as "file / system" or "system / files".

이러한 문제점을 해결하고자 하는 시도 중 하나는 멜라메드(Melamed)의 "Automatic Construction of Clean Broad-Coverage Translation Lexicon(Second Conference of the Association for Machine Translation in the America's(AMTA, 1996))이다. 멜라메드는 연관성이 높은 단어 쌍들이 동일한 단어 중 하나 또는 모두를 포함하는 보다 연관성이 높은 단어쌍들이 배열된 문장으로부터 도출되면 번역시 가장 연관성 스코어가 높은 쪽을 선택한다. 이 방법은 대역 사전의 정확도를 높여주는 장점이 있으나, 학습 데이터의 집합이 적은 경우에는 자료부족 문제로 인해 재현율이 떨어진다는 문제가 있다.One attempt to solve this problem is Melamed's "Automatic Construction of Clean Broad-Coverage Translation Lexicon (Second Conference of the Association for Machine Translation in the America's (AMTA, 1996)). If these higher pairs of words are derived from an array of more relevant pairs of words that contain one or both of the same words, then choose the one with the highest correlation score in translation. However, when the set of learning data is small, there is a problem that the reproducibility is lowered due to lack of data.

단어 사이의 대역 관계를 학습하는데 있어서 또 다른 문제는 복합어 및 합성어를 포함하는 것이다. 종래 기술에서는 어휘의 번역 관계가 오직 하나의 단어만을 포함한다고 가정했지만, 합성어의 경우 이는 명백한 거짓이 된다. 예를 들면 "a pool of water"는 "물웅덩이"의 대역어이며 이는 4 단어가 한 단어에 대응관계를 갖는다. 이 문제를 해결하기 위한 한가지 시도는 멜라메드에 의한 것으로 "Automatic Discovery of Non-Compositional compounds in Parallel Data, (Conference on Empirical Methods in Natural Language Processing(EMNLP-97)"이 있다. 멜라메드는 2개의 번역 모델 즉, 후보 합성어를 포함하는 시험 번역 모델 및 포함하지 않는 기준 번역 모델을 유도했고, 목적함수의 값이 기준 모델에서보다 시험 모델에서 더 높으면 합성어는 유효한 것으로 간주하고 그렇지 않으면 무효인 것으로 간주했다. 그러나 이 방법은 시험 번역 모델의 구성을 통해 이루어지게 되므로 매우 복잡하고 계산량이 많이 요구된다는 문제점이 있다. Another problem in learning band relationships between words is to include compound and compound words. In the prior art, it is assumed that the translation relationship of a vocabulary contains only one word, but in the case of compound words, this is obviously false. For example, "a pool of water" is a band word of "water pool", where four words correspond to one word. One attempt to solve this problem is by melamed, which includes "Automatic Discovery of Non-Compositional compounds in Parallel Data, (Conference on Empirical Methods in Natural Language Processing (EMNLP-97)". A model, namely, a test translation model including a candidate compound word and a reference translation model without it, was derived, and if the value of the objective function was higher in the test model than in the reference model, the compound word was considered valid and otherwise invalid. However, this method has a problem in that it is very complicated and requires a lot of computation since it is made through the construction of a test translation model.

이에 반해 구문정보를 이용한 연구들은 구문 분석용 지식을 획득하는 비용 및 구문 분석기의 낮은 정확성 들을 이유로 주목을 덜 받아왔지만, 소스 언어 구문의 번역이 타겟 언어 문장에서 상호 인접하는 단어들의 시퀀스라고 가정하고 특정 구문(예를 들면 명사 구문)을 번역하는 고정 구문 번역 기법에 대한 연구는 상대적으로 많이 연구되어 왔다. Dagan과 Church에 의한 "Termight: Coordinating Humans and Machine in Bilingual Teminology Acquisition(Machine Translation, 12:89-107, 1997)"은 사전 편찬가용 보조 도구로서 소스 언어 문장으로부터 기술 용어들을 예측, 사용자들에게 제안하며, 사용자들에 의해 승인된 것에 대해서 병렬 말뭉치로부터 가능한 대역어 쌍을 추출하여 대역어를 제안하는 방법을 사용한다.On the other hand, studies using syntax information have received less attention because of the cost of acquiring knowledge for parsing and the lower accuracy of the parser, but assuming that the translation of the source language phrase is a sequence of adjacent words in a target language sentence. There has been relatively much research on fixed phrase translation techniques for translating phrases (eg, noun phrases). "Termight: Coordinating Humans and Machine in Bilingual Teminology Acquisition" by Dagan and Church (Machine Translation, 12: 89-107, 1997) is a proactive assistant tool that predicts technical terms from source language sentences and suggests them to users. For example, we propose a bandword by extracting a pair of possible bandwords from parallel corpus for those approved by users.

또한, 로버트 무어(Robert Moore)는 "구문들 사이의 번역관계를 학습하기 위한 통계적 방법 및 장치(Statistical Method and Apparatus for Learning Translation Relationships among Phrase), 특허 공개번호:2004-0044176"에서 소스 언어에서 식별된 대역어 추출 대상 구문을 포함한 정렬된 문장쌍을 입력으로 받아 타겟 언어에서 구문번역 후보를 생성하고 타겟 언어의 구문과 소스 언어 구문의 내부 단어들 사이의 연관성에 기초한 내부 컴포넌트 및 소스 언어 구문과 타겟 후보 구문의 외부 단어들 사이의 연관성에 기초한 외부 컴포넌트를 포함하는 각 후보 구 문에 대한 스코어를 계산하는 방법을 사용하여 구문사이의 번역관계를 학습하고 대역사전을 도출한다. 그러나, 이 방법은 대역어를 추출하고자 하는 소스 구문이 이미 정해진 상태(대문자로 구성된 단어열:Captoids)에서 타겟 구문의 후보를 설정하는 방법을 사용하고 있기 때문에 대소문자 구분을 하지 않는 한국어와 같은 경우에는 적용하기 어렵고, 또한 일반 구문 타입의 경우에도 입력문에서 하나의 구문을 대상으로 대역어 후보를 생성하고 소스와 타겟 구문의 내외부 단어들 사이의 연관성 스코어를 계산하기 때문에 외부 구문의 단어들에 의한 연관성 스코어에 의한 영향력이 높아져 대역어 선택의 정확성이 낮다는 문제점이 있다.Robert Moore also identified in the source language in "Statistical Method and Apparatus for Learning Translation Relationships among Phrase," Patent Publication No. 2004-0044176. Generates a translation translation candidate in the target language by receiving an ordered pair of sentences containing the extracted target word extraction syntax, and internal component and source language syntax and target candidate based on the association between the target language syntax and the internal words of the source language syntax. Using the method of calculating the score for each candidate phrase including the external component based on the association between the external words of the phrase, the translation relationship between the phrases is learned and the band dictionary is derived. However, this method uses a method of setting candidate candidates for a target phrase when the source phrase to extract the bandword is already defined (captoids composed of uppercase letters). Correlation scores by words in external phrases are difficult to apply, and also for common phrase types because they generate bandword candidates for a phrase in the input statement and calculate an association score between words inside and outside the source and target phrases. There is a problem in that the influence of the increase is low accuracy of the selection of the band word.

전술한 종래 기술과 같이, 하나 이상의 형태소가 교착, 굴절되어 단어(어절)을 형성하는 한국어에서 어절을 번역의 기본 단위로 사용함에 따라 학습집합을 사용하는 통계 기반의 접근방법에서 심각한 문제로 대두될 수 있는 학습집합의 자료부족(data sparseness) 문제를 해결하기 위해, 본 발명은 형태소 단위로 분할된 정보를 사용함으로써 이러한 자료 부족 문제를 완화하는 것을 목적으로 한다.As in the prior art described above, in Korean, in which one or more morphemes are interlaced and refracted to form words (words), words are used as basic units of translation, which is a serious problem in statistical-based approaches using learning sets. In order to solve the problem of possible data sparseness of a learning set, the present invention aims to alleviate this problem of data shortage by using information divided into morphological units.

또한, 본 발명은 소스 언어와 타겟 언어 모두에서 품사나 원형 정보를 형태소와 함께 사용함으로써 표층형 형태소 정보만을 사용할 때 발생 가능한 동형이의어 문제나 통사적 쓰임의 차이를 변별할 수 있도록 하고, 품사간 대역어 관계를 통계적으로 파악하고 활용한다.In addition, the present invention by using parts of speech or circular information with morphemes in both the source language and the target language to be able to distinguish between homomorphic problems or syntactic use that can occur when using only surface morpheme information, Identify and use relationships statistically.

더 나아가, 본 발명은 통계 기반의 기계번역 시스템을 구축하는데 필수 불가결한 문장 혹은 구문 단위로 병렬 정렬된 학습 집합을 학습과정을 통해 도출한 구 문 대역 사전을 활용함으로써 자동으로 확장하고, 자율 학습(unsupervised learning)의 학습 효율 및 정확성을 증가시키는데 효과적으로 사용될 수 있는 고품질의 단어 대역 사전 또한 학습 과정을 통해 자동으로 학습하여 재사용할 수 있는 메카니즘을 제공한다.Furthermore, the present invention automatically expands by utilizing a syntax band dictionary derived through a learning process, a parallel learning set of sentences or phrases that are indispensable for constructing a statistics-based machine translation system. High quality word band dictionaries, which can be effectively used to increase the learning efficiency and accuracy of unsupervised learning, also provide a mechanism that can be automatically learned and reused through the learning process.

본 발명의 일측면에 따른, 통계적 기계번역 시스템에서 단어 및 구문들간의 번역 관계를 자율적으로 학습하기 위한 장치는, 형태-구문 분석된 소스 언어 문장을 수신하고, 상기 형태-구문 분석된 소스 언어 문장을 형태-구문적 특징 정보를 부착한 토큰화된 소스 언어 문장으로 변화시키는 소스 언어 문장 전처리기와, 형태-구문 분석된 타겟 언어 문장을 수신하고, 상기 형태-구문 분석된 타겟 언어 문장을 형태-구문적 특징 정보를 부착한 토큰화된 타겟 언어 문장으로 변화시키는 타겟 언어 문장 전처리기와, 상기 토큰화된 소스 언어 문장 및 상기 토큰화된 타겟 언어 문장을 수신하여, 상기 토큰화된 소스 언어 문장 및 상기 토큰화된 타겟 언어 문장쌍에 대한 단어 정렬 및 구문 정렬을 수행하는 자율 학습기와, 상기 자율 학습기의 상기 단어 정렬 및 구문 정렬 수행을 반복시키기 위한 학습 종료 조건 검사기를 포함한다.An apparatus for autonomously learning a translation relationship between words and phrases in a statistical machine translation system, in accordance with an aspect of the present invention, receives a form-parsed source language sentence, and the form-parsed source language sentence. Receiving a source-language sentence preprocessor for transforming the target-language sentence with the form-syntactic feature information into a tokenized source language sentence, and receiving the form-parsed target language sentence, A target language sentence preprocessor that converts the tokenized target language sentence with the target feature information into the tokenized target language sentence, and receives the tokenized source language sentence and the tokenized target language sentence, thereby receiving the tokenized source language sentence and the token. A self-learning unit performing word sorting and phrase sorting on the targeted target language sentence pairs, the word sorting and Including the learning termination condition checker for repeatedly performing the door alignment.

본 발명의 일측면에 따른, 통계적 기계번역 시스템에서 단어 및 구문들간의 번역 관계를 자율적으로 학습하기 위한 방법은, 형태-구문 분석된 소스 언어 문장 및 형태-구문 분석된 타겟 언어 문장을 수신하는 단계와, 상기 형태-구문 분석된 소스 언어 문장 및 상기 형태-구문 분석된 타겟 언어 문장 각각을 형태소 또는 단 어 단위로 나누고 각 형태소 또는 단어에 형태-구문적 특징 정보를 부착하는 단계와, 상기 형태-구문적 특징 정보가 부착된 형태소 또는 단어를 번역의 기본단위로 토큰화하는 단계와, 토큰화된 소스 언어 문장의 단어를 토큰화된 타겟 언어 문장의 단어로 정렬하고 토큰화된 타겟 언어 문장의 단어를 토큰화된 소스 언어 문장의 단어로 정렬하여, 소스 언어에 적용된 형태-구문 특징 정보와 타겟 언어에 적용된 형태-구문 특징 정보에 따라 문장단위로 다수의 단어 정렬 집합을 획득하는 단계와,다수의 단어 정렬 집합이 공통적으로 갖는 단어 정렬 결과를 초기 정렬값으로 취하여 구문 정렬(phrase alignment)을 수행하고 정렬되지 않은 소스와 타겟 언어의 구문의 내용어(content word)들을 대상으로 대역 스코어를 계산하고 가장 높은 대역 스코어를 갖는 단어를 선택하여 단어를 재정렬하는 단계와, 구문 정보를 활용하여 하나 이상의 소스 구문과 하나 이상의 타겟 구문을 대역 구문으로 정렬하는 단계와, 단어 재정렬 결과 및 구문 정렬 결과로부터 단어 및 구문 대역 사전을 생성하는 단계와, 단어 및 구문 대역 사전에 더 이상 변화가 없게 될 때까지 단어 정렬 단계, 단어 재정렬 단계, 구문 정렬 단계 그리고 단어 및 구문 대역 사전 생성 단계를 반복하는 단계를 포함한다.According to an aspect of the present invention, a method for autonomously learning a translation relationship between words and phrases in a statistical machine translation system includes: receiving a form-parsed source language sentence and a form-parsed target language sentence. And dividing each of the form-parsed source language sentences and the form-parsed target language sentences into morphemes or word units and attaching form-syntactic feature information to each morpheme or word. Tokenizing the morpheme or word with syntactic feature information as a basic unit of translation, sorting the words in the tokenized source language sentence with the words in the tokenized target language sentence, and the words in the tokenized target language sentence. Is sorted by the words of the tokenized source language sentence, and applied to the form-syntax feature information applied to the source language and the form-syntax feature information applied to the target language. Acquiring a plurality of sets of word alignments in sentence units, and performing phrase alignment by taking word alignment results common to multiple word alignment sets as initial alignment values, and performing unaligned source and target languages. Computing band scores for the content words of a phrase and reordering the words by selecting the word with the highest band score; and using phrase information, band phrases of one or more source phrases and one or more target phrases. Sorting, generating word and phrase band dictionaries from the word reordering and phrase sorting results, and then sorting words, reordering words, and phrase sorting until there are no more changes to the word and phrase band dictionaries. And repeating the word and phrase band dictionary generation step.

본 발명은 바람직하게는 한국어를 소스 언어로 하고 영어 또는 중국어를 타겟 언어로 한다. 이하에서는 첨부된 도면을 참조하여 한국어가 소스 언어이고 영어가 타겟인 실시예를 중심으로 설명한다.The present invention preferably uses Korean as the source language and English or Chinese as the target language. Hereinafter, with reference to the accompanying drawings will be described with reference to the embodiment in which Korean is the source language and English is the target.

도 1은 본 발명에 따른 학습 장치가 적용될 수 있는 예시적인 기계번역 시스템의 개괄적인 블럭도이다. 소스 언어 문장 전처리기(101)는 형태소-구문 분석된 소스 언어 문장을 수신하며, 타겟 언어 문장 전처리기(102)는 형태소-구문 분석된 타겟 언어 문장을 수신한다. 소스 언어 문장 전처리기(101)는 입력된 한국어를 형태소 단위로 나누어 각 형태소에 원형, 품사, 기본구 내에서의 상대적 위치정보, 구문정보(의존관계에 있는 단어)를 부착한다. 이와 유사하게 타겟 언어 문장 전처리기(102)는 입력된 영어를 단어로 나누어 각 단어마다 단어의 원형, 품사, 기본구 내에서의 단어의 상대적 위치 정보를 부착한다. 이러한 형태-구문적 특징 정보를 부착한 한국어와 영어 문장쌍은 형태소와 원형, 형태소와 품사, 원형과 품사, 원형, 형태소를 번역의 기본 단위로 토큰화되고 한국어와 영어 문장이 각각 재구성되어, 각 단위를 토큰으로 사용하는 학습집합을 별도로 구성한다. 이와 같이, 원형-품사-구문정보가 부착된 한국어 형태소 및 영어 단어는 번역의 기본 단위인 토큰으로 인식되며, 형태-구문적 특징 정보를 부착한 한국어 및 영어는 자율 학습기(200)의 입력으로 주어진다. 이렇게 함으로써 하나 이상의 형태소가 교착, 굴절되어 단어(어절)을 형성하는 한국어에서는 형태소 단위로 분할된 정보를 사용함으로써 어절 단위를 사용했을 때 심각해질 수 있는 자료 문제를 완화한다. 또한 한국어와 영어 모두에서 품사나 원형 정보를 형태소와 함께 사용함으로써 표층형 형태소 정보만을 사용할 때 발생가능한 동형이의어 문제나 통사적 쓰임의 차이를 변별할 수 있도록 하고, 품사간 대역어 관계를 통계적으로 파악하고 활용할 수 있다.1 is a schematic block diagram of an exemplary machine translation system to which a learning apparatus according to the present invention may be applied. Source language sentence preprocessor 101 receives the stemmed-parsed source language sentence, and target language sentence preprocessor 102 receives the stemmed-parsed target language sentence. The source language sentence preprocessor 101 divides the input Korean into morpheme units and attaches circular, part-of-speech, relative position information and syntax information (words in dependency) to each morpheme. Similarly, the target language sentence preprocessor 102 divides the input English into words, and attaches the relative position information of the words in the prototype, the part of speech, and the basic phrase for each word. Korean and English sentence pairs attached with such form-syntactic feature information are tokenized as basic units of translation of morphemes and prototypes, morphemes and parts of speech, prototypes and parts of speech, prototypes, and morphemes. Construct a separate learning set that uses units as tokens. As such, Korean morphemes and English words with circular-part-of-speech information attached are recognized as tokens that are basic units of translation, and Korean and English with form-syntactic feature information are given as inputs to the autonomous learner 200. . In this way, in Korean, where one or more morphemes are interlaced and refracted to form words (words), information divided into morphological units can be used to alleviate data problems that can be serious when using word units. In addition, by using part-of-speech or prototype information in both Korean and English together with morphemes, it is possible to discriminate between homomorphic problems and syntactic usage that may occur when using surface morphological information alone, and to statistically grasp the relationship between bandwords. It can be utilized.

도 2를 참조하면, 자율 학습기(200)의 구성요소가 보다 자세히 도시되어 있다. 자율 학습기(200)는 소스 언어 문장 전처리기(101) 및 타겟 언어 문장 전처리기(102)에서 각각 학습집합을 수신한다. 단어 정렬기(201)는 수신된 각 학습집합을 대상으로 IBM 모델 1, 2, 3, 4를 순차적으로 적용, 학습하여 소스 언어 문장의 단어를 타겟 언어 문장의 단어로 정렬한 결과를 획득하고, 또한 타겟 언어 문장의 단어를 소스 언어 문장의 단어로 정렬한 결과를 획득한다. 이러한 결과로써 소스 언어에 적용된 형태소 특징 정보와 타겟 언어에 적용된 형태소 특징 정보에 따라 문장단위로 다수의 단어 정렬 학습 결과를 1차적으로 얻는다.Referring to FIG. 2, the components of the autonomous learner 200 are shown in more detail. The autonomous learner 200 receives a learning set from the source language sentence preprocessor 101 and the target language sentence preprocessor 102, respectively. The word sorter 201 sequentially applies and learns IBM models 1, 2, 3, and 4 to each received learning set to obtain a result of sorting words in a source language sentence into words in a target language sentence. In addition, a result of sorting the words of the target language sentence by the words of the source language sentence is obtained. As a result, a plurality of word alignment learning results are obtained first in sentence units according to the morpheme feature information applied to the source language and the morpheme feature information applied to the target language.

단어 재정렬기(202)는 1차적으로 학습된, 형태소 특징 정보별 문장 단위 단어 정렬 집합을 수신한다. 단어 재정렬기(202)는 소스 언어 형태소의 특징 정보별 단어 정렬 결과들이 공통적으로 갖는 단어 정렬 결과를 초기 정렬값으로 취하여 구문 정렬(phrase alignment)을 수행하고 정렬되지 않은 소스와 타겟 언어의 구문의 내용어(content word)들을 대상으로 단어 재정렬을 위해 대역 스코어를 계산한다. 이때 대역 스코어는 소스 언어의 단어가 타겟 언어의 단어로 번역될 조건 확률, 타겟 언어의 단어가 소스 언어의 단어로 번역될 조건 확률, 소스 언어의 단어와 타겟 언어의 단어의 상호정보 정보량(KL-divergence)을 계산하여 가중적으로 결합하는데, 토큰의 유형(사용된 형태소 특징 정보들의 결합 유형)별로 조건 확률 및 상호정보량을 각각 계산하고 가중적으로 결합하여, 소스 언어에 대한 타겟 단어의 대역 스코어를 산출하며, 가장 높은 대역 스코어를 갖는 단어를 선택하여 단어를 재정렬한다. 이와 같이, 각 문장쌍에 대한 단어 재정렬 결과는 구문 정렬기(203)의 입력으로 주어진다.The word rearranger 202 receives a sentence unit word sorting set for each morpheme feature information that is primarily learned. The word rearranger 202 performs a phrase alignment by taking the word alignment result that the word alignment results for each feature information of the source language morpheme have in common as an initial alignment value, and performs the contents of the syntax of the unaligned source and target language. Band scores are calculated for word reordering for content words. In this case, the band score includes a conditional probability that a word of a source language is translated into a word of a target language, a conditional probability that a word of a target language is translated into a word of a source language, and an amount of mutual information information between words of a source language and a word of a target language (KL- divergence) is computed and weightedly combined. The conditional probability and mutual information amount are calculated and weightedly combined for each type of token (combination type of morphological feature information used), and the band score of the target word for the source language is calculated. Calculate and rearrange the words by selecting the word with the highest band score. As such, the word reordering result for each sentence pair is given as input to phrase sorter 203.

구문 정렬기(203)는 각 입력 문장에 부착된 구문정보를 활용하며, 하나 이상의 소스 구문과 하나 이상의 타겟 구문이 대역 구문으로 정렬될 수 있으며, 불연속 적인 구문과 연속적인 구문 또는 불연속적인 구문의 대역관계 또한 가능하다.The phrase sorter 203 utilizes syntax information attached to each input sentence, and one or more source phrases and one or more target phrases may be sorted into band phrases, and a disjoint phrase and a continuous phrase or a band of discontinuous phrases may be used. Relationships are also possible.

대역 사전 생성기(204)는 단어 재정렬기(202) 및 구문 정렬기(203)로부터의 단어 재정렬 결과 및 구문 정렬 결과로부터 빈도 정보를 포함한 단어 대역 사전 및 구문 대역 사전을 생성한다. 이때 단어 대역 사전에 포함된 빈도 정보는 소스 언어의 단어 빈도, 타겟 언어의 단어 빈도, 두 단어의 공기 빈도를 포함하며 구문 대역 사전의 빈도 정보 또한 동일한 방식으로 구축된다. 생성된 단어 대역 사전 및 구문 대역 사전은 그대로 대역 사전으로 활용되지 않으며, 신뢰도 측정을 통한 필터링 과정을 거쳐 일정 신뢰도 이상의 대역쌍들만이 대역 사전에 존재하고 임계 신뢰도 미만의 대역쌍은 대역 사전에서 제거된다. 구축된 단어 대역 사전은 단어 정렬 학습의 속도 개선 및 정확도 향상을 위해 재사용되고, 구축된 구문 대역 사전은 단어 정렬 및 구문 정렬을 통한 통계 기반의 번역 모델을 학습하기 위한 학습집합을 확장시키고 단어 정렬 범위를 제한하여 정렬 속도 개선 및 정확도를 향상시키기 위해 재사용된다.The band dictionary generator 204 generates a word band dictionary and a phrase band dictionary including frequency information from the word rearrangement result and the phrase sort result from the word rearranger 202 and the phrase sorter 203. In this case, the frequency information included in the word band dictionary includes the word frequency of the source language, the word frequency of the target language, and the air frequency of two words. The frequency information of the phrase band dictionary is also constructed in the same manner. The generated word band dictionary and phrase band dictionary are not used as a band dictionary as it is. Through the filtering process, only band pairs having a certain reliability are present in the band dictionary, and band pairs below the threshold reliability are removed from the band dictionary. . The built-in word band dictionary is reused to improve the speed and accuracy of word sort learning, and the built-in phrase band dictionary expands the learning set and the range of word sorting to train statistics-based translation models through word sort and phrase sort. It is reused to limit alignment and improve alignment speed and accuracy.

학습 종료 조건 검사기(210)는 학습 집합으로부터의 단어 및 구문 대역 사전에 더 이상 변화가 없게 되는 학습 종료 조건을 만족시킬때까지 학습 집합에 대한 단어 정렬, 단어 재정렬, 구문 정렬 그리고 단어 및 구문 대역 사전의 획득 과정을 반복 수행시킴으로써 단어 및 구문 대역 사전의 양과 질이 향상되게 된다. 더이상 단어 및 구문 대역 사전에 변화가 없게 되면, 통계적 기계번역 모델 파라미터 추출기(300)에 의해, 단어 및 구문 정렬에 대한 학습 결과 획득된 단어 및 구문 정렬 결과로부터 통계 기반의 번역 모델을 위한 파라미터들이 추출된다. 그리고 통계적 기계번역 모델은 별도로 학습된 언어모델(400)과 함께 입력된 소스 언어 문장으로부터 타겟 언어의 문장을 생성하기 위해 디코더(500)내에서 사용된다.The end-of-learning condition checker 210 performs word alignment, word reordering, phrase ordering, and word and phrase banding dictionaries for the learning set until the end-of-learning condition is no longer changed in the word and phrase banding dictionary from the learning set. By repeating the acquisition process, the quantity and quality of the word and phrase band dictionary are improved. When the word and phrase band dictionary is no longer changed, the statistical machine translation model parameter extractor 300 extracts the parameters for the statistical-based translation model from the word and phrase alignment results obtained from the learning result of the word and phrase alignment. do. In addition, the statistical machine translation model is used in the decoder 500 to generate sentences of the target language from the input language sentences input together with the separately trained language model 400.

본 발명에 의해 제공되는 반복적 자율학습을 통한 단어 및 구문 재정렬 및 단어 및 구문 대역 사전 생성은 다양한 형태-구문 특징 정보를 종합하여 단어 및 구문 재정렬을 위한 결정을 내리며, 통계에 기반한 신뢰도를 측정하여 단어 및 구문 대역 사전을 생성하므로 종래 기술에 비해 단어 및 구문 정렬의 정확성을 향상시키는 효과가 있다. 또한 기존 방식과 달리 구문의 경계 정보를 사용하여 구문 정렬을 수행하되, 다중 단어 정렬 결과를 종합하여 공통적으로 일치하는 단어 정렬 결과를 기반으로 구문 정렬을 수행하므로 구문 정렬의 정확성을 향상시킨다.Word and phrase rearrangement and word and phrase band dictionary generation through repetitive self-study provided by the present invention combines various form-phrase feature information to make a decision for word and phrase rearrangement, and measures reliability based on statistics. And a phrase band dictionary, thereby improving the accuracy of word and phrase alignment over the prior art. In addition, unlike the existing method, the syntax is sorted using the boundary information of the syntax, but the syntax alignment is performed based on the common word matching result by synthesizing the multiple word sorting results, thereby improving the accuracy of the syntax sorting.

본 발명에 의하면, 학습의 실마리가 되는 단어 및 구문 대역 사전을 수동으로 형성하는 것이 아니라, 단어 및 구문 대역 사전의 학습과정을 통해 자동으로 획득하여 사용하며, 신뢰도 미만의 대역 정보는 필터링을 통해 제거함으로써 자율 학습과정에서 발생하는 오류를 최소화하고 이로 인해 학습의 효율 증대 및 자동번역에 필요한 번역 지식을 효과적으로 획득할 수 있다. 또한 획득된 단어 및 구문 대역 사전은 유용한 번역 지식으로 통계 기반의 번역 시스템에서 뿐만 아니라 규칙 기반, 패턴 기반의 기계번역 시스템에서도 사용될 수 있다는 장점이 있다. 또한, 통계 기반의 기계번역 시스템에서는 대량의 문장 및 구문 정렬된 말뭉치를 요구하고, 문장 및 구문 정렬된 말뭉치를 구축하는 것 또한 시간과 비용이 소모된다. 본 발명은 반복적 자율 학습 과정에서 획득되는 양질의 구문 대역 정보를 학습 집합에 활용함으로써 점차적으로 학습집합을 증가시키고, 자율 학습 과정에서 발생하는 자료부족 문제를 완화시키는 효과를 갖는다.According to the present invention, rather than manually forming a word and phrase band dictionary that is a clue of learning, it is automatically acquired and used through the learning process of the word and phrase band dictionary, and the band information below the reliability is removed through filtering. By minimizing the errors that occur in the self-learning process, it is possible to effectively acquire the translation knowledge necessary for increasing the efficiency of learning and automatic translation. Also, the acquired word and phrase band dictionaries are useful translation knowledge, and can be used not only in statistical-based translation systems but also in rule-based and pattern-based machine translation systems. In addition, statistical-based machine translation systems require large amounts of sentence and phrase aligned corpus, and building sentence and phrase aligned corpus is also time consuming and costly. The present invention has the effect of gradually increasing the learning set by using the high quality syntax band information obtained in the iterative self-learning process, and alleviating the problem of data shortage occurring in the self-learning process.

Claims

An apparatus for autonomously learning a translation relationship between words and phrases in a statistical machine translation system,

A source language sentence preprocessor for receiving a form-parsed source language sentence and for converting the form-parsed source language sentence into a tokenized source language sentence with shape-syntactic feature information;

A target language sentence preprocessor for receiving a shape-parsed target language sentence and converting the shape-parsed target language sentence into a tokenized target language sentence with shape-syntactic feature information attached thereto;

An autonomous learner configured to receive the tokenized source language sentence and the tokenized target language sentence and perform word alignment and phrase alignment on the tokenized source language sentence and the tokenized target language sentence pair;

A learning end condition checker for repeating the word sorting and phrase sorting of the autonomous learner.

Apparatus for autonomously learning a translation relationship between words and phrases, comprising.

The method of claim 1,

The source language is Korean and the target language is English or Chinese, and the form-syntactic feature information is translated between words and phrases, including morphemes or word prototypes, parts of speech, relative positional information within basic phrases, and syntax information. Device for learning relationships autonomously.

The method of claim 2,

The autonomous learner sorts the words of the tokenized source language sentence with the words of the tokenized target language sentence and the words of the tokenized target language sentence with the words of the tokenized source language sentence, To autonomously learn translation relationships between words and phrases, including a word sorter that obtains a plurality of word alignment sets in sentence units according to the form-syntax feature information applied to the source language and the form-syntax feature information applied to the target language. Device for.

The method of claim 3,

The self-learner performs phrase alignment by taking the word alignment result which the plurality of word alignment sets have in common as an initial alignment value, and extracts content words of syntax of the unaligned source and target languages. And a word rearranger that calculates a band score to a subject and selects the word having the highest band score to reorder the words.

The method of claim 4, wherein

The autonomous learner further includes a phrase sorter for sorting one or more source phrases and one or more target phrases into band phrases using the phrase information. The apparatus for autonomously learning a translation relationship between words and phrases.

The method of claim 5,

And the autonomous learner further comprises a band dictionary generator for generating words and phrase band dictionaries from the word reordering result and the phrase sorting result.

The method of claim 6,

The band dictionary generator performs a filtering process through a reliability measurement, and only band pairs having a certain reliability or more exist in the word and phrase band dictionary, and band pairs below a threshold reliability are removed from the word and phrase band dictionary. Device for autonomously learning the translation relationship of the.

The method of claim 7, wherein

The end-of-learning condition checker between words and phrases repeats the word sorting, word reordering, phrase sorting, word and phrase band dictionary generation process of the autonomous learner until there are no more changes to the word and phrase band dictionary. Device for autonomous learning of translation relationships.

The method of claim 1,

Autonomously learning a translation relationship between words and phrases, further comprising a statistical machine translation model parameter extractor, extracting a parameter for a statistical-based translation model from the word alignment result and the phrase alignment result by the autonomous learner. Device for.

In a method for autonomously learning a translation relationship between words and phrases in a statistical machine translation system,

(a) receiving a form-parsed source language sentence and a form-parsed target language sentence;

(b) converting the form-parsed source language sentence into a tokenized source language sentence to which the form-syntactic feature information is attached and the form-parsed target language sentence to which the form-syntax characteristic information is attached. Transforming it into a tokenized target language sentence,

(c) performing word and phrase alignment on the tokenized source language sentence and the tokenized target language sentence pair;

(d) generating a word and phrase band dictionary from the word alignment result and the phrase alignment result

And autonomously learning translation relationships between words and phrases in a statistical machine translation system.

The method of claim 10,

The step (b) may include dividing each of the form-parsed source language sentences and the form-parsed target language sentences into morphemes or word units and attaching shape-syntactic feature information to each morpheme or word; ,

Tokenizing the morpheme or word to which the form-syntactic feature information is attached as a basic unit of translation;

The method of claim 11,

In step (c), the words of the tokenized source language sentence are sorted by the words of the tokenized target language sentence and the words of the tokenized target language sentence are aligned with the words of the tokenized source language sentence. Obtaining a plurality of word alignment sets in sentence units according to the form-syntax feature information applied to the source language and the form-syntax feature information applied to the target language;

Performs phrase alignment by taking the word alignment result that the plurality of word alignment sets have in common as an initial alignment value, and uses a band score for content words of an unaligned source and a target language syntax. Rearranging the words by calculating the and selecting the word with the highest band score,

Using syntax information to sort one or more source phrases and one or more target phrases into band phrases

The method of claim 12,

The step (d) includes the steps of presenting only word pairs of predetermined reliability or higher in the word and phrase band dictionary and removing band pairs below a critical reliability from the word and phrase band dictionary. And a method for autonomously learning a translation relationship between phrases.

The method of claim 13,

Autonomously learning translation relationships between words and phrases in a statistical machine translation system, further comprising repeating steps (c) and (d) until there is no further change in the word and phrase band dictionary. Way.