KR20170108621A

KR20170108621A - Statistical and learning translation apparatus using monolingual corpus

Info

Publication number: KR20170108621A
Application number: KR1020160032815A
Authority: KR
Inventors: 권오욱
Original assignee: 한국전자통신연구원
Priority date: 2016-03-18
Filing date: 2016-03-18
Publication date: 2017-09-27

Abstract

단일언어 코퍼스를 이용한 통계 및 학습기반 번역 장치가 개시된다. 단일언어 코퍼스를 이용한 통계 및 학습기반 번역 장치는 원시언어 입력문장의 문맥에 가장 적합한 원시언어 어순의 목적언어 단어나 구를 생성하는 번역 디코더부와, 상기 번역 디코더부에서 생성한 원시언어 어순의 목적언어 단어와 구의 나열을 목적언어 어순과 표현의 번역문장으로 생성하는 단일언어 학습 번역부를 포함한다. 따라서, 최적의 번역 결과를 제공할 수 있다.A statistical and learning based translation apparatus using a single language corpora is disclosed. A statistical and learning-based translation apparatus using a single-language corpus includes a translation decoder unit for generating a target language word or phrase of a source language word order most suitable for a context of a source language input sentence, And a single language learning translation unit for generating a list of language words and phrases as translation sentences of the target language order and expression. Therefore, an optimal translation result can be provided.

Description

[0001] STATISTICAL AND LEARNING TRANSLATION APPARATUS USING MONOLINGUAL CORPUS [0002]

본 발명은 단일언어 코퍼스를 이용한 통계 및 학습기반 번역 장치에 관한 것으로, 더욱 상세하게는 양국어 병렬코퍼스 없이 대용량으로 획득하기 쉽고 지속적으로 증가하는 양국어의 단일언어 코퍼스와 양국어 사전을 이용한 통계 및 학습 기반 번역 장치에 관한 것이다.The present invention relates to a statistical and learning-based translation apparatus using a single-language corpora, and more particularly, to a statistical and learning-based translation apparatus using a single-language corpora of bilingual and bilingual dictionary Learning-based translation apparatus.

현재 출시되는 번역 장치들은 규칙기반 번역(Rule-based Machine Translation: RBMT) 방법과 통계기반 번역(Statistical Machine Translation: SMT) 방법을 이용하고 있다.Currently available translation devices use Rule-based Machine Translation (RBMT) and Statistical Machine Translation (SMT).

근래에는 번역 지식을 번역 전문가에 의해 구축해야 규칙기반 번역 방법보다 사람에 의해 미리 번역된 양국어 병렬 코퍼스(bilingual parallel corpus)를 이용하여 양국어 번역 관계를 학습(통계화)하는 통계기반 방법이 그 개발 비용 및 기간의 유용성에 의해 더 많이 사용되고 있다. 특히, 최근에는 양국어 병렬 코퍼스를 심화학습(Deep Learning)을 통하여 양국어 간의 번역 관계를 학습하는 방법들도 등장하고 있다.Recently, the translation expert has to construct the translation knowledge by statistical method which learns bilingual translation relation by using bilingual parallel corpus which is pre - translated by human rather than rule - based translation method. It is being used more and more because of the development cost and usefulness of the term. In particular, recently, there are also methods for learning the translation relation between bilinguals through deep learning of bilingual parallel corpus.

종래의 통계 및 학습 기반 번역에서는 양질의 번역을 위해서는 대용량의 양국어 병렬 코퍼스가 필요하다. 하지만, 번역 언어 쌍이나 특정 도메인에 따라 양국어 병렬 코퍼스가 없거나 부족한 경우가 많다. 일반적으로 통계기반 번역에서는 작게는 200만 병렬 문장에서 많게는 몇 천만 또는 그 이상의 병렬 문장의 양국어 병렬 코퍼스를 이용하기도 한다. 이러한 양국어 병렬 코퍼스 구축을 위해 대용량 문장을 사람이 직접 전문적으로 번역하고 감수하는 작업에는 많은 비용과 시간이 필요로 한다.Conventional statistical and learning-based translation requires large bilingual parallel corpus for high-quality translation. However, bilingual parallel corpus is often missing or insufficient depending on the translation language pair or the specific domain. In statistical-based translation, bilingual parallel corpus is often used in small parallel sentences of several million or several thousand or more parallel sentences. In order to construct bilingual parallel corpus, it takes a great deal of time and money to translate and supervise large-scale sentences in a professional manner.

상기한 바와 같은 문제점을 극복하기 위한 본 발명의 목적은 전술한 종래의 통계 및 학습 기반 번역 기술의 문제점을 해결하기 위하여, 대용량으로 손쉽게 획득이 가능한 단일언어 코퍼스들과 양국어 대역어 사전을 활용한 단일언어 코퍼스를 이용한 통계 및 학습 기반 번역 장치를 제공하는 것이다.In order to overcome the above-mentioned problems, an object of the present invention is to solve the problems of the conventional statistical and learning-based translation techniques described above by using single language corpus which can be easily acquired with a large capacity, And a statistical and learning-based translation apparatus using language corpus.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 단일언어 코퍼스를 이용한 통계 및 학습기반 번역 장치는, 원시언어 입력문장의 문맥에 가장 적합한 원시언어 어순의 목적언어 단어나 구를 생성하는 번역 디코더부와, 상기 번역 디코더부에서 생성한 원시언어 어순의 목적언어 단어와 구의 나열을 목적언어 어순과 표현의 번역문장으로 생성하는 단일언어 학습 번역부를 포함한다.According to an aspect of the present invention, there is provided a statistical and learning-based translation apparatus using a monolingual corpus, the apparatus comprising: a translator for generating a target language word or phrase of a source language word order most suitable for a context of a source language input sentence; And a single language learning translator for generating a list of target language words and phrases of the source language word order generated by the translation decoder unit as translated sentences of target language order and expression.

상술한 바와 같은 단일언어 코퍼스를 이용한 통계 및 학습기반 번역 장치 에 따르면, 대용량으로 확보하기가 어려운 양국어 병렬 코퍼스 없이 손쉽게 획득할 수 있는 대용량의 단일언어 코퍼스들을 활용하여 통계 및 학습 기반의 번역 장치를 만들 수 있다.According to the statistical and learning-based translation apparatus using the single-language corpus as described above, the statistical and learning-based translation device can be easily obtained using the large-capacity single language corpora, which can be easily acquired without bilingual parallel corpora, Can be made.

또한, 인위적으로 확보해야 하는 양국어 병렬 코퍼스와 다르게 단일언어 코퍼스는 자연적으로 계속 증가하고 최근 언어 현상을 빠르게 반영하는 장점이 있어 지속적인 성능 개선과 시대 언어를 지속적으로 반영할 수 있는 장점을 가진다.In addition, unlike bilingual parallel corpus, which is artificially secured, monolingual corpus continues to grow naturally and has the advantage of quickly reflecting recent language phenomenon, which has the advantage of continuously improving performance and continuously reflecting the language of the time.

또한, 번역 사전을 구축할 수 있어서 사용자 및 도메인 맞춤별 사전을 도입하여 특정 사용자나 분야에 쉽게 적용할 수 있는 장점이 있다.In addition, translation dictionaries can be constructed so that users and domain customized dictionaries can be easily applied to specific users or fields.

도 1은 본 발명의 일 실시예에 따른 단일언어 코퍼스를 이용한 통계 및 학습기반 번역 장치에 대한 블록도이다.
도 2는 도 1에서 어순 변환부의 세부 구성을 도시한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 단일언어 코퍼스를 이용한 통계 및 학습기반 번역 방법을 도시한 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 목적언어 코퍼스로부터 원시언어 어순의 목적언어 코퍼스를 생성하는 방법을 도시한 흐름도이다.1 is a block diagram of a statistical and learning-based translation apparatus using a single language corpus according to an embodiment of the present invention.
FIG. 2 is a block diagram showing a detailed configuration of the word-line conversion unit in FIG.
3 is a flowchart illustrating a statistical and learning-based translation method using a single language corpus according to an embodiment of the present invention.
4 is a flowchart illustrating a method for generating a target language corpus of a source language word order from a target language corpus according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail.

그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. 본 출원에서, "연결하다"의 용어는 명세서상에 기재된 요소의 물리적인 연결만을 의미하는 것이 아니라, 적기적인 연결, 네트워크적인 연결 등을 포함하는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof. In the present application, the term "connect" should be understood to include not only physical connections of the elements described in the specification but also timely connections, network connections, and the like.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be interpreted in an ideal or overly formal sense unless explicitly defined in the present application Do not.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In order to facilitate the understanding of the present invention, the same reference numerals are used for the same constituent elements in the drawings and redundant explanations for the same constituent elements are omitted.

도 1은 본 발명의 일 실시예에 따른 단일언어 코퍼스를 이용한 통계 및 학습기반 번역 장치에 대한 블록도이다.1 is a block diagram of a statistical and learning-based translation apparatus using a single language corpus according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 단일언어 코퍼스를 이용한 통계 및 학습기반 번역 장치는 번역 디코더부(translation decoder)(101)와, 번역 사전(translation dictionary)(102)과, 원시언어 어순의 목적언어 언어지식 DB(103)와, 원시언어 어순의 목적언어 언어지식 학습부(104)와, 단일언어 학습 번역부(learning based monolingual translation)(105)와, 단일언어 번역지식 DB(106)와, 단일언어 번역지식 학습부(107)와, 목적언어 코퍼스(108)와, 원시언어 어순 목적언어 코퍼스(109)와, 어순 변환부(ordering transfer)(110)를 포함한다.1, a statistical and learning-based translation apparatus using a single language corpora of the present invention comprises a translation decoder 101, a translation dictionary 102, a target language of a source language order A language knowledge database 103, a target language language knowledge learning unit 104 in a source language order, a learning based monolingual translation unit 105, a single language translation knowledge database 106, A language translation knowledge learning unit 107, a target language corpus 108, a source language word order target language corpus 109, and an ordering transfer unit 110.

번역 디코더부(translation decoder)(101)는 원시언어(source language)의 입력문장을 입력으로 받아서 입력문장의 단어나 구를 번역 사전(102)과 원시언어 어순의 목적언어 언어지식 DB(103)을 이용하여 목적언어(target language) 단어나 구를 입력문장 문맥에 가장 적합한 목적언어 단어나 구를 선택한다.The translation decoder 101 receives an input sentence of a source language as an input and outputs a word or phrase of the input sentence to a translation dictionary 102 and a target language language knowledge database 103 of a source language order The target language word or phrase that best matches the input sentence context is selected.

번역 디코더부(101)는 입력 문장의 문맥에 가장 적합한 대역 단어나 구를 찾기 위해서 원시언어 어순의 목적언어 언어지식 DB(103)의 목적언어 단어나 구의 출현 확률과 목적언어에서의 주위 문맥을 표현하는 언어모델을 이용하여 입력 문장의 대역 단어나 구의 조합이 원시언어 어순의 목적언어로 출현할 가능성이 가장 높게 하는 대역 단어나 구 조합을 선택한다.The translation decoder unit 101 expresses the appearance probability of the target language word or phrase of the target language language knowledge DB 103 of the source language word order and the surrounding context in the target language in order to search for the best band word or phrase in the context of the input sentence A word or a combination of a band word or phrase of the input sentence which has the highest possibility of appearing in the target language of the source language word order is selected.

번역 사전(translation dictionary)(102)은 사람에 의해 구축된 대역 사전이거나 혹은 양국어 병렬 코퍼스, 단일언어 코퍼스, 다양한 대역사전 등에서 자동 추출한 대역 사전일 수 있다. 번역 사전(102)에서 하나의 원시언어 단어나 구에 대해 다수 개의 대역 단어나 구를 가질 수 있다.The translation dictionary 102 may be a band dictionary constructed by a person or a band dictionary automatically extracted from bilingual parallel corpus, single language corpus, various band dictionary, and the like. The translation dictionary 102 may have a plurality of band words or phrases for one source language word or phrase.

원시언어 어순의 목적언어 언어지식 DB(103)에는 목적언어 코퍼스(108)를 원시언어 어순으로 변환하는 어순 변환부(110)에 의해서 생성된 원시언어 어순 목적언어 코퍼스(109)로부터 원시언어 어순의 목적언어 언어지식 학습부(104)에 의해서 단어나 원시언어 어순의 목적언어 구의 출현 빈도와 원시언어 어순의 목적언어에 대한 n-gram 정보인 언어모델정보가 저장된다.The target language language knowledge DB 103 of the source language word order extracts the source language word order from the source language word order destination language corpus 109 generated by the word order conversion unit 110 that converts the target language corpus 108 into the source language word order, The target language language knowledge learning unit 104 stores the frequency of occurrence of the target language phrase of the word or the source language word order and the language model information which is the n-gram information of the target language of the source language word order.

원시언어 어순의 목적언어 언어지식 학습부(104)는 목적언어 코퍼스(108)를 원시언어 어순으로 변환하는 어순 변환부(110)에 의해서 생성된 원시언어 어순 목적언어 코퍼스(109)로부터 단어나 원시언어 어순의 목적언어 구의 출현 빈도와 원시언어 어순의 목적언어에 대한 n-gram 정보인 언어모델정보를 생성한다.Purpose of the source language word order The source language language knowledge learning unit 104 acquires the source language word from the source language word order destination language corpus 109 generated by the word order conversion unit 110 for converting the destination language corpus 108 into the source language word order, Purpose of Language Order Generates language model information, which is n-gram information about the frequency of occurrence of a language phrase and the target language of a source language word order.

단일언어 학습 번역부(learning based monolingual translation)(105)는 단일언어 번역지식 DB(106)을 이용하여 번역 디코더부(101)에서 생성한 원시언어 어순의 목적언어 단어와 구 나열을 자연스러운 목적언어 어순과 표현의 번역문장(translation sentence)으로 생성한다.The learning-based monolingual translation unit 105 uses the single-language translation knowledge database 106 to convert the target language words and phrases of the source language word order generated by the translation decoder unit 101 into natural language word order And translation sentence of expression.

단일언어 학습 번역부(105)는 번역 디코더부(101)에 의해서 생성되는 입력 문장에 대한 원시언어 어순의 목적언어 문장을 단일언어 번역지식 DB(106)의 구 변환 확률과 목적언어 어순의 언어모델 정보를 이용하여 목적언어 어순과 표현에 가장 적합한 번역 문장을 생성한다. 단일언어 학습 번역부(105)는 기존 통계 및 학습 기반 번역 방법과 동일하다. 단, 기존 방법에서 양국어 병렬 코퍼스의 원시언어 부분이 원시언어 어순의 목적언어가 되는 차이점을 가진다.The single language learning translation unit 105 translates the target language sentence of the source language word order of the input sentence generated by the translation decoder unit 101 into a language model of the target language order of the single translation language DB 106, The information is used to generate the most appropriate translation sentence for the target language order and expression. The single language learning translation unit 105 is the same as the existing statistics and learning based translation method. However, in the existing method, the source language part of the bilingual parallel corpus has the difference that it is the target language of the source language word order.

단일언어 학습 번역부(105)는 목적언어 코퍼스와 원시언어 어순의 목적언어 코퍼스로부터 학습된 번역 지식을 사용한다.The single language learning translation unit 105 uses the translation knowledge learned from the target language corpus and the target language corpus of the source language order.

단일언어 번역지식 DB(106)는 단일언어 학습 번역부(learning based monolingual translation)(105)에서 번역 디코더부(101)에서 생성한 원시언어 어순의 목적언어 단어와 구 나열을 자연스러운 목적언어 어순과 표현의 번역문장(translation sentence)으로 생성하는데 필요한 정보가 저장된다.The single language translation knowledge database 106 is a database for storing the target language words and phrase lists of the source language word order generated by the translation decoder unit 101 in the learning based monolingual translation unit 105 in a natural language The translation sentence of the present invention is stored.

단일언어 번역지식 DB(106)는 목적언어 코퍼스(108)와 원시언어 어순 목적언어 코퍼스(109)을 병렬 코퍼스로 하여 기존 통계 및 학습 기반 번역 방법에서 같은 방법으로 적용된 단일언어 번역지식 학습부(107)에 의해 추출된 구 변환 정보와 목적언어 언어모델 정보를 저장한다.The single language translation knowledge database 106 includes a single language translation knowledge learning unit 107 applied in the same manner as the existing statistical and learning based translation method using the target language corpus 108 and the source language word order purpose language corpus 109 as parallel corpus ) And the target language language model information.

단일언어 번역지식 학습부(107)는 목적언어 코퍼스(108)와, 목적언어 코퍼스(108)를 원시언어 어순으로 변환하는 어순 변환부(110)에 의해서 생성된 원시언어 어순 목적언어 코퍼스(109)를 입력받아 단일언어 학습 번역부(learning based monolingual translation)(105)에서 번역 디코더부(101)에서 생성한 원시언어 어순의 목적언어 단어와 구 나열을 자연스러운 목적언어 어순과 표현의 번역문장(translation sentence)으로 생성하는데 필요한 정보를 생성한다.The single language translation knowledge learning unit 107 includes a target language corpus 108 and a source language word order destination language corpus 109 generated by a word order conversion unit 110 for converting the target language corpus 108 into a source language word order, The target language words and phrases of the source language word order generated by the translation decoder unit 101 in the learning based monolingual translation unit 105 are converted into a translation sentence of natural object language order and expression ). &Lt; / RTI >

어순 변환부(ordering transfer)(110)는 목적언어 코퍼스(108)를 원시언어 어순으로 변환하여 원시언어 어순의 목적언어 코퍼스를 생성한다. 원시언어 어순의 목적언어 코퍼스는 목적언어의 구문관계와 이웃한 원시언어 구문관계의 어순으로 정렬되고, 이와 같은 구문관계 어순 정보는 목적언어 및 원시언어 코퍼스로부터 구문 분석한 결과를 이용하는 어순 변환부(110)에 의해 생성된다.The ordering transfer 110 converts the target language corpus 108 into a source language word order to generate a target language corpus of the source language word order. Purpose of the source language order The language corpus is arranged in the order of the syntax of the target language and the neighboring source language syntactic relations. The syntactic relation word order information includes a word-to-word conversion unit using the result of parsing from the target language and the source language corpus 110).

만약 원시언어와 목적언어의 어순이 매우 비슷한 경우라면, 원시언어 어순 목적언어 코퍼스(109)은 목적언어 코퍼스(108)와 거의 유사할 것이어서 원시언어 어순 목적언어 코퍼스(109)를 대신하여 목적언어 코퍼스(108)을 활용할 수 있다. 이와 같은 경우에는 단일언어 학습 번역부(105)와 어순 변환부(110)의 효과가 미비할 것이므로, 단일언어 학습 번역부(105)와 어순 변환부(110)를 구성 요소에서 배제할 수 있다.If the order of the source language and the target language is very similar, the source language word order purpose language corpus 109 will be almost similar to the target language corpus 108, so that the source language word order destination language corpus 109, (108). In this case, since the effects of the single language learning translation unit 105 and the word order conversion unit 110 are insufficient, the single language learning translation unit 105 and the word order conversion unit 110 can be excluded from the components.

도 2는 도 1에서 어순 변환부의 세부 구성을 도시한 블록도이다.FIG. 2 is a block diagram showing a detailed configuration of the word-line conversion unit in FIG.

도 2를 참조하면, 본 발명의 어순 변환부(110)는 원시언어 코퍼스(111)와, 원시언어 형태소 품사 태깅부(112)와, 원시언어 의존 구문 분석부(113)와, 원시언어 어순 DB(114)와, 목적언어 형태소 품사 태깅부(115)와, 목적언어 의존 구문 분석부(116)와, 원시언어 어순 조정부(117)를 포함한다.2, the word-to-word converting unit 110 of the present invention includes a source language corpus 111, a source language morpheme tagging unit 112, a source language dependent syntax analyzing unit 113, A target language morpheme tagging unit 115, a target language dependent syntax analyzing unit 116, and a source language word order adjusting unit 117. [

원시언어 형태소 품사 태깅부(112)는 원시언어 코퍼스(111)의 문장을 원시언어 형태소 품사 태깅을 한다.The native language morpheme part tagging unit 112 tags the sentences of the native language corpus 111 with raw language morpheme tags.

원시언어 의존 구문 분석부(113)는 원시언어 형태소 품사 태깅부(112)에서의 태깅 정보를 이용하여 원시언어 의존 구문 분석을 수행하여 지배소 품사, 의존소 품사와 구문관계에 따른 지배소와 의존소의 어순 정보를 원시언어 어순으로 저장되도록 한다.The primitive language-dependent parsing unit 113 analyzes the primitive language-dependent parsing using the tagging information in the primitive language morphing and tagging unit 112, The word order information of the cow is stored in the source language order.

원시언어 어순 DB(114)는 원시언어 코퍼스(111)의 문장을 원시언어 형태소 품사 태깅부(112)에서 형태소 품사를 태깅하고, 이 태깅 정보를 이용하여 원시언어 의존 구문 분석부(113)에서 구문 분석을 수행하여 지배소 품사, 의존소 품사와 구문관계에 따른 지배소와 의존소의 어순 정보를 저장한다. 만약, 목적언어 형태소 품사 분류와 구문관계 종류가 원시언어 형태소 품사 분류와 구문관계 종류와 일치하지 않는 경우에 대해서는 상호 맵핑 정보를 가지고 저장하고 검색할 수 있도록 한다.The native language word order DB 114 tags the sentences of the source language corpus 111 in the source language morpheme tagging unit 112 and morpheme phrases are tagged in the source language dependent phrase analysis unit 113 using the tagging information Analysis is carried out to store the word order information of dominant subordinates, dependent subordinates, and syntactic relations between dominant and dependent subordinates. If the target language morpheme classification and syntactic relation types do not match the primitive language morpheme classification and syntactic relation types, they can be stored and retrieved with mutual mapping information.

목적언어 형태소 품사 태깅부(115)는 목적언어 코퍼스(108)의 목적언어 문장들을 입력으로 받아서 단어 단위로 분리하고 각 단어에 대한 최적의 형태소 품사를 할당한다.The target language morpheme part tagging unit 115 receives the target language sentences of the target language corpus 108 as an input, separates the words into units, and allocates optimum morpheme parts for the respective words.

목적언어 의존 구문 분석부(116)는 목적언어 형태소 품사 태깅부(115)에서의 형태소 품사 태깅 결과를 이용하여 문법적 지배소(head)와 의존소(dependant)를 찾고 그 사이의 문법 관계를 설정한다.The target language-dependent syntax analysis unit 116 finds a grammar head and a dependency using the morpheme partly tagging result in the target language morpheme tagging unit 115 and sets a grammar relation therebetween .

원시언어 어순 조정부(117)는 분석된 목적언어의 지배소 형태소 품사와 의존소 형태소 품사, 문법 관계에 대한 의존 구문 관계를 원시언어 어순 DB(114)에서 원시언어 의존 구문 관계에서 가장 유사한 의존 구문 관계를 검색하여 원시언어에서 지배소와 의존소의 어순과 동일하도록 목적언어의 지배소와 의존소의 어순을 조정하여 원시언어 어순 목적언어 코퍼스(109)의 문장으로 생성한다.The primitive language word order adjustment unit 117 compares the dependency syntax relations of the dominant element morpheme parts, dependency element morpheme parts, and grammatical relations of the analyzed target languages in the source language word order DB 114 with the most similar dependency syntax relations And generates a sentence of the source language word order purpose language corpus 109 by adjusting the order of the dominant language and the dependency language of the target language so as to be equal to the order of dominance and dependency in the source language.

원시언어 어순 조정부(117)는 지배소와 의존소 사이에 지배소가 의존소의 앞 부분에 올 것인가 뒤 부분에 올 것인가를 원시언어에서 같은 지배소 품사와 의존소 품사, 구문 관계일 경우에 어떠한 위치로 설정되는가에 대한 정보를 담은 원시언어 어순 DB(114)의 정보를 따라 순서를 조정하는 것이다. 만약 원시언어에서 같은 지배소 품사, 의존소 품사, 구문관계이면서 지배소가 의존소 앞에 올 가능성과 뒤에 올 가능성의 차이가 얼마나지 않은 경우는 2가지 경우에 대해 모두 생성하도록 한다. 동일 지배소에 대해 같은 어순을 가지는 의존소들 간의 어순은 목적언어의 어순을 그대로 유지한다.The primitive language ordering coordinator 117 determines whether the dominant locus will come before or after the dependent locus between the dominant locus and the dependent locus in the same locale and dependency locus in the native language, Order DB 114, which contains information on whether or not the language is set to " If there is no difference in likelihood of coming to the dependence point and the possibility of the follow-up to be followed by the dominant subordinate, dependency subordinate, syntactic relationship in the primordial language, both cases should be generated. The order of the dependents having the same order for the same ruler maintains the order of the target language.

도 3은 본 발명의 일 실시예에 따른 단일언어 코퍼스를 이용한 통계 및 학습기반 번역 방법을 도시한 흐름도이다.3 is a flowchart illustrating a statistical and learning-based translation method using a single language corpus according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 단일언어 코퍼스를 이용한 통계 및 학습기반 번역 장치에서 원시언어(source language)의 입력문장을 입력받는다(S310).Referring to FIG. 3, an input sentence of a source language is input in the statistical and learning-based translation apparatus using the single language corpus of the present invention (S310).

이어서, 원시언어 입력문장의 단어나 구를 번역 사전을 검색하여 목적언어의 대역 단어나 대역 구를 검색(S320)한다. 이때, 번역 사전에서 하나의 원시언어 단어나 구에 대해 다수 개의 대역 단어나 구를 가질 수 있다.Subsequently, a word or phrase of the source language input sentence is searched for a translation dictionary, and a band word or band word of the target language is searched (S320). At this time, a plurality of band words or phrases can be provided for one source language word or phrase in the translation dictionary.

이어서, 번역 사전에서 검색된 다수 개의 목적언어의 단어나 구 중에서 입력문장 문맥에 가장 적합한 목적언어 단어나 구를 선택(S330)한다. 이때, 입력 문장의 문맥에 가장 적합한 대역 단어나 구를 찾기 위해서 원시언어 어순의 목적 언어 언어지식 DB의 목적언어 단어나 구의 출현 확률과 목적언어에서의 주위 문맥을 표현하는 언어모델을 이용하여 입력 문장의 대역 단어나 구의 조합이 원시언어 어순의 목적언어로 출현할 가능성이 가장 높게 하는 대역 단어나 구 조합을 선택한다.Subsequently, a target language word or phrase most suitable for the input sentence context is selected from a plurality of target language words or phrases retrieved from the translation dictionary (S330). In order to find the most suitable word or phrases in the context of the input sentence, the target language word or phrase of the target language word or phrase of the source language language knowledge database of the source language order and the language model representing the surrounding context in the target language are used to input sentences A word or a combination of a band word or a phrase having the highest probability of appearing as a target language of a source language word order is selected.

이어서, 원시언어 어순의 목적언어 단어와 구 나열을 자연스러운 목적언어 어순과 표현의 번역문장으로 생성한다(S340).Subsequently, the target language word and phrase list of the source language word order are generated as translation sentences of the natural target language word order and the expression (S340).

도 4는 본 발명의 일 실시예에 따른 목적언어 코퍼스로부터 원시언어 어순의 목적언어 코퍼스를 생성하는 방법을 도시한 흐름도이다.4 is a flowchart illustrating a method for generating a target language corpus of a source language word order from a target language corpus according to an embodiment of the present invention.

도 4를 참조하면, 본 발명의 어순 변환부는 목적언어 코퍼스의 목적언어 문장들을 입력으로 받아서 단어 단위로 분리하고 각 단어에 대한 최적의 형태소 품사를 태깅한다(S410).Referring to FIG. 4, the word-by-word conversion unit of the present invention receives target language sentences of the target language corpus as an input, separates them into words, and tags optimal morpheme words for each word (S410).

이어서, 형태소 품사 태깅 결과를 이용하여 문법적 지배소(head)와 의존소(dependant)를 찾고 그 사이의 문법 관계를 설정한다(S420).Then, the grammatical head and dependant are found using the result of the morpheme part tagging, and the grammatical relation between them is set (S420).

이어서, 분석된 목적언어의 지배소 형태소 품사와 의존소 형태소 품사, 문법 관계에 대한 의존 구문 관계를 원시언어 의존 구문 관계에서 가장 유사한 의존 구문 관계를 검색한다(S430).Subsequently, the dependency syntax relation for the dominant dominant morpheme part, dependency morpheme part, and grammar relation of the analyzed target language is retrieved in the native language dependent syntax relation (S430).

이어서, 원시언어에서 지배소와 의존소의 어순과 동일하도록 목적언어의 지배소와 의존소의 어순을 조정하여 원시언어 어순 목적언어 코퍼스의 문장으로 생성한다(S440). 이때, 지배소와 의존소 사이에 지배소가 의존소의 앞 부분에 올 것인가 뒤 부분에 올 것인가를 원시언어에서 같은 지배소 품사와 의존소 품사, 구문 관계일 경우에 어떠한 위치로 설정되는가에 대한 정보를 담은 원시언어 어순 DB의 정보를 따라 순서를 조정할 수 있다. 만약 원시언어에서 같은 지배소 품사, 의존소 품사, 구문관계이면서 지배소가 의존소 앞에 올 가능성과 뒤에 올 가능성의 차이가 얼마 나지 않은 경우는 2가지 경우에 대해 모두 생성하도록 한다. 그리고, 동일 지배소에 대해 같은 어순을 가지는 의존소들 간의 어순은 목적언어의 어순을 그대로 유지한다.Subsequently, the order of the dominant and dependency of the target language is adjusted to be the same as the order of dominant and dependency in the primitive language (S440). In this case, whether or not the dominant locus will come before or after the dependent locus between the dominant locus and the dependency locus is the same as the dominant locus locus in the primitive language, Can be adjusted according to the information of the native language word order DB containing the words. If there is no difference between the likelihood of coming to the dependence point and the possibility of coming after the dominant subordinate part, dependency part, and syntactic relation in the primordial language, both cases should be generated. And, the order of the dependents having the same order for the same ruler keeps the order of the target language.

이상 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined in the appended claims. It will be possible.

101 : 번역 디코더부
102 : 번역 사전
103 : 원시언어 어순의 목적언어 언어지식 DB
104 : 원시언어 어순의 목적언어 언어지식 학습부
105 : 단일언어 학습 번역부
106 : 단일언어 번역지식 DB
107 : 단일언어 번역지식 학습부
108 : 목적언어 코퍼스
109 : 원시언어 어순 목적언어 코퍼스
110 : 어순 변환부101: Translation decoder unit
102: translation dictionary
103: Objective of the source language order Language Language Knowledge DB
104: Purpose of source language order
105: Single Language Learning Translation Department
106: Single Language Translation Knowledge DB
107: Single Language Translation Knowledge Learning Department
108: Objective language Corpus
109: Primitive Language Word Order Purpose Language Corpus
110:

Claims

As a statistical and learning based translation device using a single language corpus,
A translation decoder unit for generating a target language word or phrase of a source language word order most suitable for a context of a source language input sentence; And
A single language learning translation unit for generating a target language word and phrases of the source language word order and the phrases generated by the translation decoder unit in a translation order of target language order and expression;
A statistical and learning based translation device using a single language corpora.