KR100512541B1

KR100512541B1 - Machine translation machine and system, method

Info

Publication number: KR100512541B1
Application number: KR10-2000-0010247A
Authority: KR
Inventors: 김영택; 장정호; 김성동
Original assignee: 김영택; 김성동; 장정호
Priority date: 2000-02-29
Filing date: 2000-02-29
Publication date: 2005-09-06
Also published as: KR20010103151A

Abstract

본 발명은 원시언어로 작성된 입력문장을 입력하는 입력단계와; 상기 입력문장을 형태소 분석하는 어휘 분석 단계와; 상기 입력문장의 통사적인 구조를 밝히는 구문 분석 단계와, 상기 입력문장의 의미를 분석하는 의미 분석 단계와; 상기 구문 분석과 상기 의미분석에 기초하여 목표언어로 번역된 출력문장을 생성하는 목표 문장 단계와; 상기 출력문장을 출력하는 출력단계를 갖는 기계번역방법에 있어서, 상기 어휘 분석 단계 이후에, 상기 어휘 분석 단계에서 형태소 분석된 상기 입력문장을 문장분할 규칙을 적용하는 규칙적용단계와, 상기 규칙 적용에 의해 문장분할 가능 위치를 선정하는 위치선정단계와, 상기 위치선정을 확률분포를 통해 확률값은 구하는 해석단계와, 상기 확률값에 의해 문장분할을 결정하는 결정단계를 더 갖는 기계번역방법 및 이를 이용하는 기계번역장치과 시스템을 제공한다.The present invention includes an input step of inputting an input sentence written in a source language; A lexical analysis step of morphologically analyzing the input sentence; A syntax analysis step of revealing the syntactic structure of the input sentence, and a meaning analysis step of analyzing the meaning of the input sentence; A target sentence step of generating an output sentence translated into a target language based on the syntax analysis and the semantic analysis; A machine translation method having an output step of outputting the output sentence, wherein after the lexical analysis step, a rule applying step of applying a sentence division rule to the input sentence morphologically analyzed in the lexical analysis step, and applying the rule A machine translation method further comprising: a position selection step of selecting a position where a sentence can be divided, an analysis step of obtaining a probability value through the probability distribution, and a decision step of determining sentence division based on the probability value, and a machine translation using the same Provide devices and systems.

Description

Machine translation machine and system, method

본 발명은 원시언어(제1언어)로 기술된 입력문장을 번역처리하여 목표언어(제2언어)로 기술된 출력문장을 출력하는 기계번역 장치 및 시스템, 방법에 대한 것으로, 특히 입력문장의 문장분할에 의한 구문분석 및 의미분석의 시간적, 공간적 복잡도를 감소시키는 기계번역 장치 및 시스템, 방법에 대한 것이다.The present invention relates to a machine translation apparatus, system, and method for translating an input sentence described in a primitive language (first language) and outputting an output sentence described in a target language (second language). A machine translation apparatus, system, and method for reducing the temporal and spatial complexity of syntax and semantic analysis by segmentation.

도1을 참조하면, 종래의 기계번역방법은 원시언어로 작성된 입력문장을 입력하는 입력단계(10)와, 입력문장을 형태소 분석하는 어휘 분석 단계(20)와, 입력문장의 통사적인 구조를 밝히는 구문 분석 단계(40)와, 상기 입력문장의 의미를 분석하는 의미 분석 단계(50)와, 구문 분석과 의미분석에 기초하여 목표언어로 번역된 출력문장을 생성하는 목표 문장 단계(60)와, 상기 출력문장을 출력하는 출력단계(90)를 갖는다.Referring to FIG. 1, the conventional machine translation method includes an input step 10 for inputting an input sentence written in a source language, a lexical analysis step 20 for morphological analysis of the input sentence, and a syntactic structure of the input sentence. A syntax analysis step 40, a semantic analysis step 50 for analyzing the meaning of the input sentence, a target sentence step 60 for generating an output sentence translated into a target language based on syntax analysis and semantic analysis, And an output step 90 for outputting the output sentence.

이러한 종래의 기계 번역(영한기계번역)은 서로 다른 문화권에 속한 두 언어 (한국어, 영어) 간의 번역을 대상으로 한다. 이들 언어는 그 구조나 어순에서 상당한 차이를 보이므로 정확한 분석과 자연스러운 번역을 얻기 위하여 숙어 기반의 분석 방법을 종래에는 주로 이용하였다. 그러나, 숙어 기반의 분석 방법은 분석 이전에 숙어를 인식하고 그것을 하나의 분석 단위로 취급하는 것인데, 숙어의 모호성에 의해 많은 숙어의 단위가 생성될 수 있고 이로 인해 문장 분석의 복잡도가 상당히 증가하는 문제점이 있다. This conventional machine translation (English-Korean machine translation) targets translation between two languages (Korean, English) belonging to different cultures. Since these languages show considerable differences in their structure or word order, idiom-based analysis methods have been used in the past to obtain accurate analysis and natural translation. However, the idiom-based analysis method recognizes idioms before analysis and treats them as a single unit of analysis. The ambiguity of idioms can generate many units of idioms, which increases the complexity of sentence analysis. There is this.

이러한 문장 분석의 복잡도를 줄이기 위해 여러가지의 문장분할 방법들이 제시되었다. 그 중 하나가 부분 분석 (partial parsing)에 의한 방법이다. 이 방법은 문장을 chunk로 분할하고 chunk별로 분석을 수행하였다. chunk는 Gee와 Grosjean [3]에서 제시한 언어 응용 구조 (performance structure)에 대응하는 구조이다. Chunk는 NP(noun phrase), VP(verb phrase), PP(prepositional phrase) 등과 같은 구절 구조에 대응하는데, 하나의 중심어와 여러 기능어의 결합으로 이루어진다. 이러한 chunk는 자연언어 분석에 일반적으로 이용되는 문맥 자유 문법(context-free grammar)에 의해 분석이 가능하다. 그러나 언어 응용 구조는 사람이 말을 할 때 한번에 말하게 되는 문장의 일부, 즉 운율 패턴 (prosodic pattern)에 대응하므로 이것에 기반을 둔 chunk 간의 관계를 문맥 자유 문법으로 분석하기는 어렵다. In order to reduce the complexity of sentence analysis, various sentence splitting methods have been proposed. One of them is by partial parsing. This method breaks up the statements into chunks and analyzes them by chunk. chunks correspond to the language performance structures presented by Gee and Grosjean [3]. Chunks correspond to phrase structures such as NP (noun phrase), VP (verb phrase), and PP (prepositional phrase). They consist of a combination of one central word and several functional words. These chunks can be analyzed by the context-free grammar which is commonly used for natural language analysis. However, since the language application structure corresponds to a part of the sentence that is spoken at a time when a person speaks, a prosodic pattern, it is difficult to analyze the relationship between chunks based on this based on context-free grammar.

또다른 방법으로 긴 영어 문장을 분할하기 위해 패턴 규칙을 이용한 방법이 제안되었다. 분할 패턴 규칙을 사람이 구성하고 그 규칙에 해당하는 문장을 분할하여 각 분할(세그먼트)들을 독립적으로 분석하고 그 분석 결과를 합성하여 전체 문장 구조를 생성하였다. 이 방법은 패턴 규칙을 적용할 수 있는 문장에 대해서 구문 분석의 복잡도를 줄일 수 있으나, 긴 문장의 패턴을 사람이 모두 규칙화하는 것은 불가능하므로 실용적이지 못한 문제점이 있었다.Another method using pattern rules has been proposed to segment long English sentences. The division pattern rule was composed by a person and the sentences corresponding to the rule were divided to analyze each segment (segment) independently, and the analysis results were synthesized to generate the entire sentence structure. This method can reduce the complexity of parsing the sentence to which the pattern rule can be applied, but there is a problem that is not practical because it is impossible for a person to regularize a pattern of a long sentence.

또다른 방법으로 긴 일본어 문장을 분할하기 위해 다층 패턴 매칭(multi-layered pattern matching) 방법이 제안되었다. 그러나 마찬가지로 긴 문장 패턴을 구축해야 하고 그 패턴에 맞는 문장만 분할할 수 있는 문제점이 있었다. 또한 분할된 짧은 문장이 주어를 가지지 않을 경우, 그 주어를 찾아주는 추가적인 알고리즘이 필요한 문제점이 있었다. In another method, a multi-layered pattern matching method has been proposed to segment long Japanese sentences. However, likewise, there was a problem in that a long sentence pattern had to be constructed, and only the sentences matching the pattern could be divided. In addition, if the divided short sentence does not have a subject, there is a problem that an additional algorithm for finding the subject is required.

또다른 방법으로 영어의 분석에 있어서 영어의 평서문(declarative sentence)이 거의 항상 3개의 연속된 부분(주어 앞 부분+주어+서술부)의 결합으로 이루어진다는 사실을 이용하여 문장을 분할하는 방법에 대한 연구되었다. 이 방법은 신경망(neural network)의 패턴 매칭(pattern matching) 능력을 이용하여 문장을 3개의 부분으로 분할하여 구문 분석의 복잡도를 줄이려는 시도를 하였다. 이 방법은 주로 단문(simple sentence)에만 적용할 수 있는 문제점이 있었다. 또한 이 방법은 복수개의 주어부와 서술부를 갖는 중문(coordinate sentence) 또는 복문(complex sentence)에 적용하기 곤란한 문제점이 있었다. Another way of analyzing English is to divide sentences using the fact that English declarative sentences almost always consist of a combination of three consecutive parts (front part + part + part). It became. This method attempts to reduce the complexity of syntax analysis by dividing a sentence into three parts using the pattern matching capability of neural networks. This method has a problem mainly applicable to simple sentences. In addition, this method has a problem in that it is difficult to apply to a coordinate sentence or a complex sentence having a plurality of subjects and descriptions.

또다른 방법으로 문장 패턴 (sentence pattern)을 이용하여 문장을 분할하는 방법이 제시되었다. 이 방법은 분할된 세그먼트를 각각 분석하고 패턴에 지시된 바에 따라 각 분석 결과를 결합하여 하나의 문장 구조를 생성한다. 그러나 이 방법도 마찬가지로 사람이 긴 문장의 패턴, 즉 중문이나 복문의 패턴을 구축하고 이것을 이용하여 문장을 분할해야 하는 문제점이 있었다. As another method, a sentence division method using a sentence pattern has been proposed. This method analyzes the divided segments individually and combines each analysis result as instructed in the pattern to create a sentence structure. However, this method also had a problem in that a person constructs a long sentence pattern, that is, a sentence of a Chinese sentence or a compound sentence, and divides the sentence using the sentence.

이러한 종래의 문제점을 해결하기 위하여, 본 발명의 목적은 학습에 의한 분할 가능 위치 분류 규칙의 생성을 통해 사람의 노력을 줄이는 기계번역 장치 및 시스템, 방법을 제공하는 것이다.In order to solve such a conventional problem, it is an object of the present invention to provide a machine translation apparatus, system, method for reducing human effort through the generation of segmentable position classification rules by learning.

또한 본 발명의 목적은 최대 엔트로피 확률 모델에 의한 분할 위치 결정을 통해 실용적으로 적용할 수 있는 정확도의 안전한 분할 을 할 수 있는 기계번역 장치 및 시스템, 방법을 제공하는 것이다.It is also an object of the present invention to provide a machine translation apparatus, system, and method capable of safe division of accuracy that can be practically applied through division position determination by a maximum entropy probability model.

또한 본 발명의 목적은 학습데이터와 다른 영역의 문장분할에도 일정한 수준 이상의 적용률과 정확도를 유지하는 기계번역 장치 및 시스템, 방법을 제공하는 것이다.It is also an object of the present invention to provide a machine translation apparatus, system, and method for maintaining the application rate and accuracy of a certain level or more even in the sentence segmentation of learning data and other areas.

또한 본 발명의 목적은 문장분할에 의한 구분 분석의 효율 향상으로 실용적 기계번역을 실시간으로 하는 기계번역 장치 및 시스템, 방법을 제공하는 것이다.It is also an object of the present invention to provide a machine translation apparatus, system, and method for performing machine translation in real time by improving the efficiency of divisional analysis by sentence division.

상기 목적을 달성하기 위하여, 본 발명은 원시언어로 작성된 입력문장을 입력하는 입력부와; 상기 입력문장을 형태소 분석하는 어휘 분석 모듈과, 상기 입력문장의 통사적인 구조를 밝히는 구문 분석 모듈과, 상기 입력문장의 의미를 분석하는 의미 분석 모듈과, 상기 구문 분석과 상기 의미분석에 기초하여 목표언어로 번역된 출력문장을 생성하는 목표 문장 생성모듈을 탑재한 중앙처리수단과; 상기 출력문장을 출력하는 출력부;를 갖는 기계번역장치로서, 문장분할 가능 위치의 분류 규칙이 저장된 분류규칙DB와; 문장분할 가능 위치의 확률분표가 저정된 확률분포DB를 더 가지며, 상기 중앙처리수단은, 상기 입력문장에 대하여 상기 분류규칙DB에 저장된 분류규칙을 적용하여 분할 가능 위치를 선정하고, 상기 선정된 분할 가능 위치를 상기 확률분포DB에 저장된 확률분포를 이용하여 확률값을 구하여 상기 입력문장의 문장분할을 결정하는 문장분할모듈을 더 탑재하는 것을 특징으로 하는 기계번역장치를 제공한다. In order to achieve the above object, the present invention includes an input unit for inputting an input sentence written in a source language; A lexical analysis module for stemming the input sentence, a syntax analysis module for revealing the syntactic structure of the input sentence, a semantic analysis module for analyzing the meaning of the input sentence, a target based on the syntax analysis and the semantic analysis A central processing means equipped with a target sentence generation module for generating an output sentence translated into a language; A machine translation apparatus having an output unit for outputting the output sentence, the machine translation apparatus comprising: a classification rule DB storing a classification rule of a position at which a sentence can be divided; The apparatus further includes a probability distribution DB in which a probability distribution of sentence division possible positions is stored, and the central processing unit selects a segmentation position by applying a classification rule stored in the classification rule DB to the input sentence, and selects the selected division. The machine translation apparatus further comprises a sentence division module for determining a sentence division of the input sentence by obtaining a probability value using a probability distribution stored in the probability distribution DB.

또한 본 발명은, 단말기와, 서버와, 상기 단말기와 상기 서버를 연결하는 개방형 네트워크를 갖고 원시언어로 작성된 입력문장을 목표언어로 번역하여 출력문장으로 출력하는 기계번역시스템에 있어서, 상기 단말기는 원시언어로 작성된 입력문장을 입력하는 입력부와, 목표언어로 번역한 출력문장을 출력하는 출력부를 가지며; 상기 서버는 문장분할 가능 위치의 분류 규칙이 저장된 분류규칙DB와, 문장분할 가능 위치의 확률분표가 저장된 확률분포DB와, 상기 입력문장을 형태소 분석하는 어휘 분석 모듈과, 상기 입력문장의 통사적인 구조를 밝히는 구문 분석 모듈과, 상기 입력문장의 의미를 분석하는 의미 분석 모듈과, 상기 구문 분석과 상기 의미분석에 기초하여 목표언어로 번역된 출력문장을 생성하는 목표 문장 모듈과, 상기 입력문장에 대하여 상기 분류규칙DB에 저장된 분류규칙을 적용하여 분할 가능 위치를 선정하고, 상기 선정된 분할 가능 위치를 상기 확률분포DB에 저장된 확률분포를 이용하여 확률값을 구하여 상기 입력문장의 문장분할을 결정하는 문장분할모듈을 탑재한 중앙처리수단을 갖는 기계번역시스템을 제공한다. In another aspect, the present invention is a machine translation system having a terminal, a server, and an open network connecting the terminal and the server to translate an input sentence written in a source language into a target language and output the output sentence as an output sentence. An input unit for inputting an input sentence written in a language and an output unit for outputting an output sentence translated into a target language; The server may include a classification rule DB in which a classification rule of a sentence dividable position is stored, a probability distribution DB in which a probability table of a sentence dividable position is stored, a lexical analysis module for stemming the input sentence, and a syntactic structure of the input sentence. A syntax parsing module for identifying a language, a semantic analysis module for analyzing the meaning of the input sentence, a target sentence module for generating an output sentence translated into a target language based on the syntax analysis and the semantic analysis, and the input sentence. Segmentation is determined by applying a classification rule stored in the classification rule DB, and a sentence value is determined using a probability distribution stored in the probability distribution DB. Provided is a machine translation system having a central processing unit equipped with a module.

또한 본 발명은, 상기 분류규칙DB가 상기 분할위치가 표시된 다수의 말뭉치를 구축하여 학습데이터를 생성하고, 상기 학습데이터를 이용하여 분할 가능 위치의 개념을 학습하여 분할 가능 위치의 분류규칙을 생성하여 저장된 것을 특징으로 하는 기계번역장치 및 시스템을 제공한다.In addition, the present invention, the classification rule DB constructs a plurality of corpus marked with the split position to generate the learning data, and using the learning data to learn the concept of the splittable position to generate a classification rule of the segmentable position Machine translation apparatus and system characterized in that the stored.

또한 본 발명은, 상기 확률분포DB가 상기 분할위치가 표시된 다수의 말뭉치를 구축하여 학습데이터를 생성하고, 상기 학습데이터를 이용하여 규칙 및/또는 특성을 생성하여 확률모델을 만들고, 상기 확률모델을 이용하여 확률분포를 생성하여 저장된 것을 특징으로 하는 기계번역장치 및 시스템을 제공한다.In addition, the present invention, the probability distribution DB to build a plurality of corpus marked with the split position to generate the training data, to generate a rule and / or characteristics using the training data to create a probability model, the probability model It provides a machine translation apparatus and system, characterized in that by generating a probability distribution using the stored.

또한 본 발명은, 상기 문장분할모듈이 상기 출력문장을 분할 가능 위치로써 상기 분류규칙DB에 추가저장하고, 확률모델을 수정하여 확률분포DB에 저장하는 것을 특징으로 하는 기계번역장치 및 시스템을 제공한다. In another aspect, the present invention provides a machine translation apparatus and system, characterized in that the sentence segmentation module adds and stores the output sentence in the classification rule DB as a segmentable position, modifies a probability model and stores it in a probability distribution DB. .

또한 본 발명은, 원시언어로 작성된 입력문장을 입력하는 입력단계와; 상기 입력문장을 형태소 분석하는 어휘 분석 단계와; 상기 입력문장의 통사적인 구조를 밝히는 구문 분석 단계와, 상기 입력문장의 의미를 분석하는 의미 분석 단계와; 상기 구문 분석과 상기 의미분석에 기초하여 목표언어로 번역된 출력문장을 생성하는 목표 문장 단계와; 상기 출력문장을 출력하는 출력단계를 갖는 기계번역방법에 있어서, 상기 어휘 분석 단계 이후에, 상기 어휘 분석 단계에서 형태소 분석된 상기 입력문장을 문장분할 규칙을 적용하는 규칙적용단계와, 상기 규칙 적용에 의해 문장분할 가능 위치를 선정하는 위치선정단계와, 상기 위치선정을 확률분포를 통해 확률값은 구하는 해석단계와, 상기 확률값에 의해 문장분할을 결정하는 결정단계를 더 갖는 기계번역방법을 제공한다. In another aspect, the present invention, the input step of inputting an input sentence written in the source language; A lexical analysis step of morphologically analyzing the input sentence; A syntax analysis step of revealing the syntactic structure of the input sentence, and a meaning analysis step of analyzing the meaning of the input sentence; A target sentence step of generating an output sentence translated into a target language based on the syntax analysis and the semantic analysis; A machine translation method having an output step of outputting the output sentence, wherein after the lexical analysis step, a rule applying step of applying a sentence division rule to the input sentence morphologically analyzed in the lexical analysis step, and applying the rule There is provided a machine translation method further comprising: a position selection step of selecting a position where a sentence can be divided, an analysis step of obtaining a probability value through the probability distribution, and a decision step of determining sentence division based on the probability value.

또한 본 발명은, 상기 구문 분석 단계가 상기 결정단계에 의한 문장분할에 의해 입력문장을 세그먼트들로 나누어 분석하고, 세그먼트를 합성하여 상기 입력문장의 통사적 구조를 밝히는 것을 특징으로 하는 기계번역방법을 제공한다. The present invention also provides a machine translation method, wherein the syntax analysis step divides the input sentence into segments by sentence division according to the determination step, and synthesizes the segments to reveal the syntactic structure of the input sentence. to provide.

또한 본 발명은, 상기 규칙적용단계에서 적용하는 상기 문장분할 규칙이 상기 분할위치가 표시된 다수의 말뭉치를 구축하여 학습데이터를 생성하고, 상기 학습데이터를 이용하여 분할 가능 위치의 개념을 학습하여 생성되는 것을 특징으로 하는 기계번역방법을 제공한다.In another aspect, the present invention, the sentence division rule applied in the rule applying step is generated by building a plurality of corpus marked with the split position to generate the learning data, the learning data is generated by learning the concept of the segmentable position It provides a machine translation method characterized in that.

또한 본 발명은, 상기 해석단계에서 적용하는 상기 확률분포가 상기 분할위치가 표시된 다수의 말뭉치를 구축하여 학습데이터를 생성하고, 상기 학습데이터를 이용하여 규칙 및/또는 특성을 생성하여 확률모델을 만들고, 상기 확률모델을 이용하여 생성되는 것을 특징으로 하는 기계번역방법을 제공한다.In addition, the present invention, the probability distribution to be applied in the analysis step to construct a plurality of corpus marked with the split position to generate the training data, and to generate a rule and / or characteristics using the training data to create a probability model The present invention provides a machine translation method, which is generated using the probability model.

또한 본 발명은, 상기 해석단계 이후에, 상기 해석단계에서 구한 확률값은 저장하는 저장단계를 더 가지며, 상기 결정단계는 상기 저장단계에 저장된 확률값 중 최대 확률값을 갖는 위치로 문장분할을 결정하는 것을 특징으로 하는 기계번역방법을 제공한다.In addition, the present invention, after the analysis step, further comprises a storage step of storing the probability value obtained in the analysis step, wherein the determining step is characterized in that the sentence division to determine the position having the maximum probability value of the stored probability value in the storage step It provides a machine translation method.

이하, 본 발명에 대한 실시예들을 첨부된 도면을 참조하여 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도2를 참조하면, 본 발명에 따른 기계번역장치는 입력부(1)와, 출력부(2)와, 중앙처리장치(3)와, 분류규칙DB(4)와, 확률분포DB(5)를 갖는다. Referring to FIG. 2, the machine translation apparatus according to the present invention includes an input unit 1, an output unit 2, a central processing unit 3, a classification rule DB 4, and a probability distribution DB 5; Have

입력부(1)는 원시언어로 작성된 입력문장을 입력하는 키보드나 마우스, 스캐너와 같은 입력장치이다. 출력부(2)는 목표언어로 출력된 출력문장을 출력하는 모니터나 프린터와 같은 출력장치이다. The input unit 1 is an input device such as a keyboard, a mouse, or a scanner for inputting an input sentence written in a source language. The output unit 2 is an output device such as a monitor or a printer that outputs an output sentence output in a target language.

중앙처리장치(3)는 입력부(1)의 입력문장을 형태소 분석하는 어휘 분석 모듈(6A)과, 입력문장을 세그먼트들로 분할하는 문장 분할 모듈(6B)과, 입력문장의 통사적인 구조를 밝히는 구문 분석 모듈(6C)과, 입력문장의 의미를 분석하는 의미 분석 모듈(6D)과, 구문 분석과 의미분석에 기초하여 목표언어로 번역된 출력문장을 생성하는 목표 문장 모듈(6E)을 갖는다.The central processing unit 3 includes a lexical analysis module 6A for morphological analysis of the input sentence of the input unit 1, a sentence division module 6B for dividing the input sentence into segments, and a syntactic structure of the input sentence. It includes a parsing module 6C, a semantic analysis module 6D for analyzing the meaning of the input sentence, and a target sentence module 6E for generating an output sentence translated into the target language based on the parsing and semantic analysis.

분류규칙DB(4)는 분할위치가 표시된 다수의 말뭉치를 구축하여 학습데이터를 생성하고, 학습데이터를 이용하여 분할 가능 위치의 개념을 학습하여 분할 가능 위치의 분류규칙을 생성하여 저장된다. The classification rule DB 4 generates a learning data by constructing a plurality of corpus in which the division positions are displayed, and learns the concept of the dividable position using the learning data to generate and store a classification rule of the dividable position.

확률분포DB(5)는 분할위치가 표시된 다수의 말뭉치를 구축하여 학습데이터를 생성하고, 학습데이터를 이용하여 규칙 및/또는 특성을 생성하여 확률모델을 만들고, 확률모델을 이용하여 확률분포를 생성하여 저장된다.Probability distribution DB (5) generates a learning data by building a plurality of corpus marked with the split position, create a probability model by generating rules and / or characteristics using the training data, and generates a probability distribution using the probability model Are stored.

중앙처리장치(3)의 문장 분할 모듈(6B)은 입력부(1)에 의해 입력된 입력문장에 대하여 분류규칙DB(4)에 저장된 분류규칙을 적용하여 분할 가능 위치를 선정한다. 문장 분할 모듈(6B)은 선정된 분할 가능 위치를 확률분포DB(5)에 저장된 확률분포를 이용하여 확률값을 구하여 입력문장의 문장분할을 결정한다.The sentence dividing module 6B of the central processing unit 3 selects the dividable position by applying the classification rule stored in the classification rule DB 4 to the input sentence inputted by the input unit 1. The sentence division module 6B determines the sentence division of the input sentence by obtaining a probability value using the probability distribution stored in the probability distribution DB 5 at the selected segmentable position.

또한 문장 분할 모듈(6B)은 출력문장을 분할 가능 위치로써 분류규칙DB(4)에 추가저장하고, 확률모델을 수정하여 확률분포DB(5)에 저장한다. 이러한 분할 가능 위치의 추가저장과, 확률모델의 수정을 통하여 보다 정확하고 풍부한 분류규칙DB(4)와 확률분포DB(5)가 마련된다.In addition, the sentence segmentation module 6B further stores the output sentence in the classification rule DB 4 as a segmentable position, modifies the probability model, and stores it in the probability distribution DB 5. By further storing the partitionable position and modifying the probability model, a more accurate and rich classification rule DB 4 and a probability distribution DB 5 are prepared.

이상 본 발명에 따른 기계번역장치를 설명하였으나, 본 발명은 이에 제한되는 것은 아니다. The machine translation apparatus according to the present invention has been described above, but the present invention is not limited thereto.

본 발명은 입력부(1)와 출력부(2)를 갖는 클라이언트PC와, 분류규칙DB(4)와 확률분포DB(5)과 상기의 모듈들을 탑재한 중앙처리수단(3)을 갖는 서버와, 클라이언트PC와 서버를 연결하는 인터넷망을 갖는 기계번역시스템일 수 있다. 이러한 기계번역시스템은 인터넷 상에서 검색된 결과들을 원하는 언어로 실시간으로 번역하는데 사용할 수 있다. The present invention provides a client PC having an input unit 1 and an output unit 2, a server having a classification rule DB 4, a probability distribution DB 5, and a central processing unit 3 equipped with the above modules; It may be a machine translation system having an internet network connecting a client PC and a server. Such a machine translation system can be used to translate the results retrieved on the Internet into a desired language in real time.

도3을 참조하면, 본 발명에 따른 기계번역방법은 원시언어로 작성된 입력문장을 입력하는 입력단계(10)와, 입력문장을 형태소 분석하는 어휘 분석 단계(20)와, 입력문장의 통사적인 구조를 밝히는 구문 분석 단계(40)와, 입력문장의 의미를 분석하는 의미 분석 단계(50)와, 구문 분석과 의미분석에 기초하여 목표언어로 번역된 출력문장을 생성하는 목표문장 생성단계(60)와, 출력문장을 출력하는 출력단계(90)를 갖는 것은 종래의 기계번역방법과 동일하다.Referring to FIG. 3, the machine translation method according to the present invention includes an input step 10 for inputting an input sentence written in a source language, a lexical analysis step 20 for morphological analysis of the input sentence, and a syntactic structure of the input sentence. Parsing step (40) for revealing, a semantic analysis step (50) for analyzing the meaning of the input sentence, the target sentence generation step (60) for generating an output sentence translated into the target language based on the parsing and semantic analysis And having an output step 90 for outputting an output sentence is the same as the conventional machine translation method.

본 발명에 따른 기계번역방법은 어휘 분석 단계(20) 이후에 문장 분할 단계(30)를 더 갖는다. The machine translation method according to the present invention further includes a sentence division step 30 after the lexical analysis step 20.

도4는 도3의 기계번역방법의 문장 분할 단계(30)의 세부흐름도이며, 도3과 도4를 참조하면, 문장 분할 단계(30)는 먼저 형태소가 분석된 입력문장에 대하여 문장 분할 규칙을 적용한다(31). FIG. 4 is a detailed flowchart of the sentence segmentation step 30 of the machine translation method of FIG. 3. Referring to FIGS. 3 and 4, the sentence segmentation step 30 first applies a sentence segmentation rule to an input sentence whose morpheme has been analyzed. Apply (31).

도5는 도4의 문장분할 규칙 적용 단계(31)에 사용되는 분류 규칙 생성방법의 흐름도이며, 도5를 참조하여 문장 분할 규칙을 생성하는 과정을 설명하면, 먼저 분할 위치가 표시된 다수의 말뭉치를 수동적으로 구축한다(102). 말뭉치는 분할 위치가 표시된 다수의 문장의 집합으로, 각각의 문장에는 사람이 그것을 읽으면서 적절한 분할 위치가 표시되어 구축된다. 말뭉치를 구축하는 사람은 어느 정도의 원시언어(영어) 문법에 대한 지식을 가진 사람인 것이 바람직하다. 이러한 말뭉치에 있는 표시된 분할 위치는 사람이 문장 분할 할 때 고려하는 특성들을 가지고 있다.FIG. 5 is a flowchart illustrating a classification rule generating method used in the sentence division rule applying step 31 of FIG. 4. Referring to FIG. 5, a process of generating a sentence division rule will be described. Build passively (102). A corpus is a set of sentences that show a split position, where each sentence is constructed with the appropriate split positions marked by the human reading it. The person who builds the corpus is preferably someone who has some knowledge of the source grammar. The marked splitting positions in these corpus have the characteristics that people consider when splitting sentences.

다음으로, 이 말뭉치로부터 학습데이터(어휘문맥)를 생성하여 저장한다(104). 학습데이터(어휘문맥)는 문장 내에서 w_i의 분할 가능 위치 여부와, 단어(w_i)와, w_i의 왼쪽 2개 단어와 오른쪽 2개 단어(4개)와, w_i의 왼쪽 2개 단어의 품사와 오른쪽 2개 단어의 품사(4개)와, w_i의 왼쪽 2개 단어의 하위 범주화(subcategorization) 정보(w_i가 절을 목적어로 취할 수 있는 동사인가를 표시하는 이진값) 등 모두 12개의 속성(attribute)으로 구성된다(표1 참조).Next, learning data (lexical context) is generated from this corpus and stored (104). The learning data (lexical context) is divided into two parts of the sentence: whether or not w _i can be divided, the word (w _i ), the left two words of w _i and the two right words (four), and the left two of w _i . part of speech and the right second part of speech of the words of the word (4) and a sub-categorization (subcategorization) information (w _i is the binary value representing the authorized company to take the section in direct object) of the two words the left side of the w _i, etc. There are 12 attributes in all (see Table 1).

s_pos s_pos word_i word _i w_i-2,w_i-1,w_i+1,w_i+2 w _i-2 , w _i-1 , w _{i + 1} , w _{i + 2} p_i-2,p_i-1,p_i+1,p_i+2 p _i-2 , p _i-1 , p _{i + 1} , p _{i + 2} s_cat_{i_2} s_cat _{i_2} s_cat_{i_1} s_cat _{i_1}

다음으로, 학습데이터의 학습을 통해 분할 가능 위치를 규정하는 개념을 얻는다(106). Next, a concept of defining a segmentable position through learning of the training data is obtained (106).

다음으로, 학습데이터를 이용하여 분할 가능 위치 개념 학습을 통해 버전 그래프로 표현되는 활성 어휘 문맥의 집약된 표현을 생성하여 버전 그래프 상에서 일반화 경계와 특수화 경계 간의 경로로 표현되는 규칙을 획득한다(108). 이렇게 획득된 분류규칙을 도2의 분류규칙DB(4)에 저장한다. Next, an aggregated representation of the active vocabulary context represented by the version graph is generated by learning the segmentable position concept using the learning data to obtain a rule represented by the path between the generalization boundary and the specialization boundary on the version graph (108). . The classification rule thus obtained is stored in the classification rule DB 4 of FIG.

다시 도3과 도4를 참조하면, 문장 분할 단계는 분류규칙DB(4)에 저장된 분류규칙들을 적용하여 입력문장의 분할 가능 위치를 선정한다(32). 분할 가능 위치는 확률분포DB(5)에 저장된 확률분포를 이용하여 선정된 분할 가능 위치의 확률값은 계산한다(33). 이 분할 확률값(segmentation probability value)은 분할 가능 위치가 될 것이라고 믿는 신념의 정도를 표현한다.3 and 4, the sentence division step selects a segmentable position of an input sentence by applying classification rules stored in the classification rule DB 4 (32). The splittable position calculates a probability value of the selected splittable position using the probability distribution stored in the probability distribution DB 5 (33). This segmentation probability value represents the degree of belief that it will be a segmentable position.

도6은 도4의 위치선정의 확률 해석 단계(33)에 사용되는 확률 분포 생성방법의 흐름도이며, 도6을 참조하면, 확률분포는 최대 엔트로피 원리에 기반하여 생성된다. 확률분포는 도5의 분류규칙을 저장하는 과정과 동일하게 분할위치가 표시된 말뭉치를 구축하고(112), 학습데이터(어휘문맥)를 생성하고(114), 규칙을 생성한다(116). FIG. 6 is a flowchart of a probability distribution generating method used in the probability analysis step 33 of positioning. Referring to FIG. 6, the probability distribution is generated based on the principle of maximum entropy. In the same manner as in the process of storing the classification rule of FIG. 5, the probability distribution constructs a corpus in which the split position is displayed (112), generates learning data (lexical context) (114), and generates a rule (116).

다음으로, 분할 위치 결정시 많은 효율 향상을 도모함과 동시에 안전한 세그먼트를 생성하기 위해 여러가지 요인, 후보특성들(candidate feature)을 고려한다(118). 후보특성은 (1)단어의 어휘 문맥적 특성과, (2)단어의 지역적 특성(위치정보), (3)최초 분할 위치의 특성(다른 분할 위치가 앞에 존재하는 지의 여부)이다. (1)어휘 문맥적 특성은 분할 가능 위치 분류를 위한 규칙(110)으로부터 추출한다. 분류규칙은 분할 가능 취치가 가지는 어휘 문맥의 속성들을 표현하며 자주 나타나는 속성들은 분할 위치 결정에 유력한 증거가 된다. 분류규칙은 버전 그래프 상에서 일반화 경계와 특수화 경계 간의 경로로 정의되는데, 특성들은 이 경로를 추적하면서 추출된다. 도7은 어휘 문맥적 특성 추출을 위한 알고리즘을 나타낸다. (2)단어의 지역적 특성은 같은 단어라도 문장에서의 위치에 따라 분할 위치로서 선택될 수 있는 선호도가 다르므로 이를 고려하기 위해 사용된다. n개의 단어로 이루어진 문장에서 i번째 단어의 영역값(위치값)는 Next, various factors and candidate features are taken into consideration in order to achieve a large efficiency improvement in determining the split position and generate a safe segment (118). The candidate characteristics are (1) the lexical contextual characteristics of the word, (2) the local characteristics (location information) of the words, and (3) the characteristics of the initial splitting position (whether or not there are other splitting positions in front). (1) Vocabulary contextual features are extracted from rule 110 for segmentable location classification. The classification rules express the attributes of the lexical context of segmentable practices, and the frequently occurring attributes provide strong evidence for segmentation positioning. The classification rule is defined as a path between the generalization boundary and the specialization boundary on the version graph, and features are extracted while tracking this path. 7 shows an algorithm for lexical contextual feature extraction. (2) The regional characteristics of the word are used to consider the same word because the preference that can be selected as the split position is different depending on the position in the sentence. In a sentence of n words, the region value (position value) of the i th word is

여기서 R은 문장의 영역 개수를 의미한다. Where R means the number of regions of the sentence.

(3)최초 분할 위치의 특성, 즉 다른 분할 위치가 앞에 존재하는 지의 여부는 안전한 분할을 위해 고려된다. 예를 들면, 문장에서 처음 나오는 분할 위치는 사람이 처음으로 문장을 분할하는 위치이므로 다른 분할 위치에 비해 상대적으로 안전하다. (3) The characteristics of the initial dividing position, i.e. whether another dividing position is present in front, are considered for safe dividing. For example, the first splitting position in a sentence is a position where a person splits a sentence for the first time, so it is relatively safe compared to other splitting positions.

확률분포는 주어진 특성들을 고려하여 최대 엔트로피 원리에 기반하여 생성한다(120). 확률분포 p*는 최대 엔트로피 원리에 의해 다음의 수학식2로 표현된다. The probability distribution is generated based on the maximum entropy principle in consideration of given characteristics (120). The probability distribution p * is represented by the following equation 2 by the principle of maximum entropy.

확률분포 생성을 위해 확률변수X, Y는 지수함수적으로 분포된다는 가정 하에 다음의 수학식3과 같은 조건부 지수 함수 계열의 확률 모델로 확룰분포를 표현한다. For generating the probability distribution, the probability variables X and Y are distributed exponentially, and the probability rule distribution is expressed by a probability model of a series of conditional exponential functions as shown in Equation 3 below.

이 확률모델의 가중치 계산을 위해 GIS(Generalized Iterative Scaling) 알고리즘을 사용한다(도8 참조).Generalized Iterative Scaling (GIS) algorithm is used to calculate the weight of this probability model (see Fig. 8).

분할 위치 선정의 확률 해석(33)의 마지막 단계로, 확률분포는 최대 유사도 원리에 기반하여 생성된다(122). 최대 유사도 원리는 최대 유사도를 갖는 확률 모델을 구하는 것이 최대 엔트로피를 가지는 확률모델을 구하는 것과 같은 것임을 알려준다. 최대 유사도 원리(Maximum Likelihood Principle)는 다음의 수학식4에 의해 표현된다. As a final step in the probability analysis 33 of partition position selection, a probability distribution is generated based on the principle of maximum similarity (122). The principle of maximum similarity indicates that finding a probability model with maximum similarity is the same as finding a probability model with maximum entropy. Maximum Likelihood Principle is represented by the following equation (4).

상기 후보특성들(118) 중 분할 위치 결정에 유용한 것들만 확률모델에서 고려하는 것이 모델 생성을 위한 계산시간을 줄이면서 생성된 모델을 이용한 분할 위치 결정의 정확도를 유지한다. 후보특성들 중 유용한 특성들만을 고려하기 위한 방법으로 도9의 점진적인 특성 선택(IFS, Incremental Feature Selection)의 알고리즘을 이용하지만, 통상의 빈도수를 이용한 특성 선택(FFS, Frequency-based Feature Selection)를 이용할 수도 있다. 상기 과정에 의해 생성된 확률분포는 다음의 수학식5로 표현된다.Considering only one of the candidate features 118 useful for splitting position determination in the probability model maintains the accuracy of splitting position determination using the generated model while reducing computation time for model generation. In order to consider only the useful features among the candidate features, the incremental feature selection (IFS) algorithm of FIG. 9 is used. However, the frequency-based feature selection (FFS) can be used. It may be. The probability distribution generated by the above process is expressed by the following equation (5).

여기서 x는 분할 위치 결정에 고려되는 문맥 상황(정보)을 나타내고, y는 0 또는 1의 값으로서 분할 위치의 여부를 표현한다. Here, x represents a context situation (information) to be considered in determining the split position, and y represents a split position as a value of 0 or 1.

다시 도4를 참조하면, 분할 가능 위치의 문맥 상황이 가지는 특성에 의해 분할 확률값이 결정되면, 분할 가능 위치의 분할 확률을 계산하여 분할위치를 결정한다(34). 분할 위치 결정을 위한 알고리즘은 도10에 도시한다. 이 때 집합 A는 분할 확률이 각 단어의 임계값보다 큰 분할 가능 위치의 집합이고, B는 그 이외의 분할 가능 위치의 집합을 의미한다. 이 임계값은 다음의 수학식6에 의해 표현된다.Referring to FIG. 4 again, when the split probability value is determined based on the characteristic of the context of the splittable position, the split position is determined by calculating the split probability of the splittable position (34). An algorithm for division position determination is shown in FIG. At this time, the set A is a set of divisible positions whose split probability is larger than the threshold of each word, and B means a set of other divisible positions. This threshold is expressed by the following equation (6).

이 임계값은 특정한 단어가 분할 위치가 될 기대값을 의미하며, 이 값보다 큰 확률을 가지면, 그 단어가 분할 위치로서 적절하다고 간주한다. 임계값 이상의 분할 확률은 분할 위치로서 적절하다는 것을 의미하며, 그 중에서 가장 큰 확률을 가지는 문할 가능 위치가 분할 위치로서 결정된다. 모든 분할 가능 위치의 분할 확률이 임계값보다 작다는 것은 분할 위치로서의 신뢰도가 낮다는 것을 의미한다. 이 경우에는 분할의 결과로 생기는 세그먼트의 크기를 함께 고려한다. 이 때 분할 확률은 0에서 1까지의 값을 가지고, 세그먼트의 크기는 1보다 큰 정수값을 가진다. 분할 위치 결정에 미치는 두가지 요인의 영향을 동등하게 하기 위해 세그먼트 크기의 값을 0에서 1까지의 값으로 정규화한다. 이 두가지 값의 합으로 분할 위치의 점수가 결정되고, 가장 큰 점수를 가지는 위치를 분할위치로 결정한다. 이러한 분할위치의 결정은 구문 분석이 용이한 길이의 세그먼트로 분할되어야 하므로 일정한 길이 이상의 세그먼트가 존재하지 않을 때까지 분할이 계속된다. 여기서 분할 위치 결정에 고려되는 요인 중에서 위치정보와 다른 분할 위치가 앞에 존재하는가의 여부는 문장분할을 할 때마다 달라지므로 분할 확률값이 달라지게 된다. 도11은 일정한 길이 이하의 세그먼트로 분할하는 알고리즘을 도시한다.This threshold means an expected value at which a particular word will be a split position, and if there is a probability greater than this value, the word is considered appropriate as the split position. The split probability above the threshold means that the split position is appropriate as the split position, among which the questionable position having the largest probability is determined as the split position. If the split probability of all the splittable positions is less than the threshold, it means that the reliability as the split position is low. In this case, the size of the segment resulting from the segmentation is also considered. In this case, the split probability has a value from 0 to 1, and the size of the segment has an integer value greater than 1. In order to equalize the effects of the two factors on partition position determination, the segment size values are normalized from 0 to 1. The sum of these two values determines the score of the divided position, and the position having the largest score is determined as the divided position. Since the determination of the split position should be divided into segments of easy parsing length, the segmentation continues until no segment longer than a certain length exists. Here, among the factors considered in determining the split position, whether the split information is different from the position information is different in each sentence division, so the split probability value is different. 11 shows an algorithm for dividing into segments of a certain length or less.

다시 도3을 참조하여 설명하면, 문장 분할 후 구문분석단계(40)에서 세그먼트를 분석하고, 세그먼트를 합성하여 구문을 분석한다. 구문분석단계(40) 후 의미분석(50)을 통해 목표 문장을 생성한다(60). 목표문장에 대하여 만족여부를 확인하여, 만족되지 못한 경우 수정된 문장 분할을 수동입력(70)하여 다시 구문분석단계(40)를 반복한다. 만족한 경우 문장분할된 분류규칙을 분류규칙DB(4)에 추가하고, 확률분포DB에 저장된 확률분포를 수정한다(80). 이 마지막 단계는 생략할 수도 있다.Referring again to FIG. 3, after the sentence is divided, the segment is analyzed in the syntax analysis step 40, and the syntax is analyzed by synthesizing the segments. After the syntax analysis step 40, a target sentence is generated through semantic analysis 50 (60). After checking whether the target sentence is satisfied, if it is not satisfied, the parsed step 40 is repeated by manually inputting the modified sentence segmentation 70. If satisfied, the sentence division classification rule is added to the classification rule DB (4), and the probability distribution stored in the probability distribution DB is modified (80). This last step may be omitted.

실시예Example

분할 가능 위치 분류를 위한 규칙의 생성을 위해 쉼표를 포함하지 않은 길이 15 이상의 문장을 월 스트리트 저널의 10,300개의 문장에서 3,000개를 추출하여 사람이 분할 위치를 표시하녀 학습데이터를 구축하였다. 쉼표가 없는 문장으로, 고등학교 영어 교과서 2,640 문장과, 컴퓨터 분야의 바이트 매거진(Byte magazine) 1,000 문장, 워싱턴 포스트(Washington Post) 정치분야의 1,200개 문장에서 각각 300개 문장씩을 추출하여 문장번역을 실시하였다.In order to generate a rule for classifying segmentable locations, 3,000 sentences were extracted from 10,300 sentences of Wall Street Journal. A sentence without commas was translated into sentence sentences by extracting 300 sentences from 2,640 high school English textbooks, 1,000 byte magazines in the computer field, and 1,200 sentences in the Washington Post politics. .

학습데이터의 3,000개 문장으로부터 5,375개의 활성 어휘문맥과, 40,236개의 비활성 어휘 문맥을 생성하였다. 분할 가능 위치 개념 학습 결과 모두 9,002개의 노드를 가지는 360개의 버전 그래프가 생성되었고, 이로부터 5,851개의 분할 가능 위치 분류 규칙을 생성하였다. 생성된 규칙으로부터 6,596개의 후보특성을 생성하여 확률모델을 생성하여 이용하였다. 특성 선택방법 중 FFS는 생성시간 12분에, 특성개수 2,866개를 보여주었으며, IFS는 생성시간 1,115분에, 특성개수 1,853개를 보여주었다. IFS는 FFS에 비해 상당히 많은 시간을 필요로 하지만, 보다 적은 수의 유용한 특성만을 고려하여 분할 확률을 제공하므로 확률 계산시 유리하였다.5,375 active vocabulary contexts and 40,236 inactive vocabulary contexts were generated from 3,000 sentences of the training data. As a result of learning the segmentable position concept, 360 version graphs having 9,002 nodes were generated, and 5,851 segmentable position classification rules were generated from this. We generated 6,596 candidate characteristics from the generated rule and used the probabilistic model. Among the feature selection methods, FFS showed 2,866 features at 12 minutes of creation time and 1,853 features at 1,115 minutes of creation time. IFS requires considerably more time than FFS, but it is advantageous in calculating probability because it provides split probability considering only a few useful features.

본 발명에 따른 기계번역방법에 의한 기계번역결과의 적용률, 정확도, 분할 오류 문장(분할 오류가 발생하지 않은 문장수와 전체 분할 대상의 문장 수를 이용한 분할 기여도 값)을 사람이 만든 분할 규칙에 의하여 분할하는 규칙 기반 방법과 비교하였다. 이 규칙 기반 방법은 긴 문장의 분석에 이용되는 문맥 자유 문법의 관찰을 통해 분할 가능 위치를 규정하는 규칙을 사람이 구축했다. 분할 위치 결정을 위해 분할 가능 위치는 유형별로 분류되고 유형마다 분할 우선 순위가 할당되었다. 분할 위치는 분할 가능 위치의 분할 우선 순위와 분할로 생성되는 세그먼트의 크기를 고려하여 결정되었다. According to the division rule that the application rate, accuracy, and segmentation error sentences (partition contribution value using the number of sentences without division errors and the number of sentences to be divided) of the machine translation results according to the machine translation method according to the present invention are made by man Compared to the rule-based method of partitioning. This rule-based method builds a rule for defining segmentable positions through observation of the context-free grammar used for long sentence analysis. Dividable locations are categorized by type and partition priority is assigned to each type to determine a split position. The split position is determined in consideration of the split priority of the splittable position and the size of the segment created by the split.

표2는 본 발명에 따른 문장분할의 분할성능을 비교하여 나타낸다.Table 2 compares the partitioning performance of sentence division according to the present invention.

적용률Application rate 정확도accuracy 분할 오류 문장Split error sentences 기준방법Standard method 100100 77.677.6 0.7760.776 규칙 기반 방법Rule-based method 85.285.2 86.586.5 0.7030.703 FFSFFS 98.398.3 88.288.2 0.8650.865 IFSIFS 98.398.3 91.291.2 0.8950.895

IFS에 의한 적용률, 정확도, 분할성능을 문장길이별로 살펴보면, 분할 오류 문장은 학습 데이터와 같은 영역에서 추출된 데스트 문장에 대해서는 0.913, 고등학교 교과서의 문장에 대해서는 0.906이었다. 그리고 컴퓨터 영역의 바이트 매거진의 문장에 대해서는 0.883, 워싱턴 포스트지의 정치 분야 문장에 대해서는 0.87이었다. 전체적인 평균값은 0.895이며, 약 90%의 문장이 본 발명에 따른 기계번역방법의 문장분할을 통해 낮은 분석 복잡도를 가지고 올바른 구문구조를 생성함을 나타내었다.In terms of application rate, accuracy, and segmentation performance by IFS, segmentation error sentences were 0.913 for test sentences and 0.906 for high school textbooks. It was 0.883 for the byte magazine in the computer domain and 0.87 for the political post in the Washington Post. The overall average value is 0.895, which indicates that about 90% of sentences generate the correct syntax structure with low analysis complexity through sentence division of the machine translation method according to the present invention.

또한 구문분석 효율은 문장 분할을 하지 않을 경우 20 단어 이상의 문장은 많은 경우 분석이 종료하지 않았다. 표3은 분할을 이용한 분석과, 분할을 이용하지 않은 분석의 시간 및 공간, 효율향상을 나타냈다.In addition, the parsing efficiency is that the sentence is more than 20 words when the sentence is not divided, the analysis is not finished in many cases. Table 3 shows the time, space and efficiency improvement of the analysis using segmentation and the analysis without segmentation.

분할을 이용한 분석Analysis using segmentation 분할을 이용하지 않은 분석Analysis Without Segmentation 효을 향상(%)Improve effect (%) 월스트리트저널Wall Street Journal 시간(초)Time in seconds 4.84.8 22.122.1 77.477.4 공간(MB)MB space 1.11.1 3.93.9 71.871.8 고등학교 영어교과서High School English Textbook 시간(초)Time in seconds 4.64.6 19.619.6 76.576.5 공간(MB)MB space 0.90.9 3.43.4 73.573.5 바이트 메거진Byte Magazine 시간(초)Time in seconds 5.45.4 25.125.1 78.578.5 공간(MB)MB space 1.11.1 3.73.7 70.370.3 워싱턴 포스트Washington Post 시간(초)Time in seconds 5.15.1 2929 82.482.4 공간(MB)MB space 1.11.1 4.34.3 74.474.4

분 발명에 따른 문장 분할에 의한 경우 문장분할을 이용하지 않은 분석에 비해 시간면에서 30.9%, 공간면에서 57.8%의 효율향상을 얻을 수 있었다. 또한 문장의 종류나 길이에 무관하게 적용할 수 있었으며, 약98%의 적용률과 약 90%의 분할 정확도를 나타내었다. In case of sentence segmentation according to the invention, efficiency improvement of 30.9% in time and 57.8% in space was obtained compared to analysis without sentence segmentation. In addition, it could be applied regardless of the type or length of sentences, and showed about 98% application rate and about 90% segmentation accuracy.

본 발명에 따른 기계번역 장치 및 시스템, 방법은 학습에 의한 분할 가능 위치 분류 규칙의 생성을 통해 사람의 노력을 줄일 수 있고, 최대 엔트로피 확률 모델에 의한 분할 위치 결정을 통해 실용적으로 적용할 수 있는 정확도의 안전한 분할을 할 수 있는 효과가 있다.Machine translation apparatus, system and method according to the present invention can reduce the human effort through the generation of a segmentable position classification rule by learning, the accuracy that can be practically applied through the determination of the segmentation position by the maximum entropy probability model It has the effect of safe partitioning.

또한 본 발명에 따른 기계번역 장치 및 시스템, 방법은 학습데이터와 다른 영역의 문장분할에도 일정한 수준 이상의 적용률과 정확도를 유지할 수 있고, 문장분할에 의한 구분 분석의 효율 향상으로 실용적 기계번역을 실시간으로 할 수 있는 효과가 있다.In addition, the machine translation apparatus, system and method according to the present invention can maintain the application rate and accuracy over a certain level even in the sentence segmentation of the learning data and other areas, and can perform the practical machine translation in real time by improving the efficiency of the segmentation analysis by the sentence segmentation. It can be effective.

도1은 종래의 기계번역방법의 흐름도.1 is a flowchart of a conventional machine translation method.

도2는 본 발명에 따른 기계번역장치의 개념도.2 is a conceptual diagram of a machine translation apparatus according to the present invention.

도3은 본 발명에 따른 기계번역방법의 흐름도.3 is a flowchart of a machine translation method according to the present invention;

도4는 도3의 기계번역방법의 문장분할의 세부흐름도.4 is a detailed flowchart of sentence division of the machine translation method of FIG.

도5는 도4의 문장분할 규칙 적용에 사용되는 분류 규칙 생성방법의 흐름도.5 is a flowchart of a classification rule generating method used for applying the sentence division rule of FIG. 4; FIG.

도6은 도4의 위치선정의 확률해석에 사용되는 확률 분포 생성방법의 흐름도.FIG. 6 is a flowchart of a probability distribution generating method used for probability analysis of position selection in FIG. 4; FIG.

도7은 확률 분포 생성방법에서 어휘 문맥적 특성 추출을 위한 알고리즘.7 is an algorithm for lexical contextual feature extraction in the probability distribution generation method.

도8은 확률 분포 생성방법에서 GIS 알고리즘. 8 is a GIS algorithm in a probability distribution generation method.

도9는 확률 분포 생성방법에서 점진적인 특성 선택 알고리즘.9 is a gradual feature selection algorithm in the probability distribution generation method.

도10은 확률 분포 생성방법에서 분할 위치 결정 알고리즘10 is a partition position determination algorithm in the probability distribution generation method.

도11은 확률 분포 생성방법에서 문장 분할 알고리즘.11 is a sentence segmentation algorithm in the probability distribution generation method.

Claims

An input unit for inputting an input sentence written in a source language; A lexical analysis module for stemming the input sentence, a syntax analysis module for revealing the syntactic structure of the input sentence, a semantic analysis module for analyzing the meaning of the input sentence, a target based on the syntax analysis and the semantic analysis A central processing means equipped with a target sentence generation module for generating an output sentence translated into a language; A machine translation apparatus having an output unit for outputting the output sentence,

A classification rule DB storing a classification rule of a sentence splittable position;

Build a plurality of corpus marked with the split position to generate training data, generate rules and / or properties using the training data to create a probability model, and generate probability of sentence segmentation using the probability model. It has more probability distribution DB where the index is stored.

The central processing unit selects a segmentable position by applying a classification rule stored in the classification rule DB to the input sentence, and obtains a probability value using the probability distribution stored in the probability distribution DB using the selected segmentable position. And a sentence dividing module for determining a sentence dividing of the input sentence.

The method of claim 1,

The classification rule DB may be configured to generate a plurality of corpus in which the division positions are displayed to generate learning data, and use the learning data to learn a concept of a dividable position to generate and store a classification rule of a dividable position. Machine translation device.

delete

The method of claim 2,

And the sentence splitting module adds and stores the output sentence in the classification rule DB as a segmentable position, and modifies and stores a probability model in the probability distribution DB.

A machine translation system having a terminal, a server, and an open network connecting the terminal and the server to translate input sentences written in a source language into target sentences and output them as output sentences.

The terminal has an input unit for inputting an input sentence written in a source language and an output unit for outputting an output sentence translated into a target language;

The server may include a classification rule DB in which a classification rule of a sentence division position is stored;

Build a plurality of corpus marked with the split position to generate training data, generate rules and / or properties using the training data to create a probability model, and generate probability of sentence segmentation using the probability model. A probability distribution DB in which the index is stored,

A lexical analysis module for stemming the input sentence, a syntax analysis module for revealing the syntactic structure of the input sentence, a semantic analysis module for analyzing the meaning of the input sentence, a target based on the syntax analysis and the semantic analysis A target sentence module for generating an output sentence translated into a language, and applying a classification rule stored in the classification rule DB to the input sentence to select a segmentable position, and storing the selected segmentable position in the probability distribution DB. And a central processing unit equipped with a sentence division module for determining a sentence division of the input sentence by obtaining a probability value using a probability distribution.

The method of claim 5,

The classification rule DB may generate a plurality of corpus in which the division positions are displayed and store learning data. The classification rule DB may be configured to learn a concept of a dividable position by using the learning data and generate and store a classification rule of a dividable position. Machine translation system.

delete

The method of claim 7, wherein

The sentence division module further stores the output sentence in the classification rule DB as a segmentable position, modifies a probability model, and stores the output sentence in a probability distribution DB.

An input step of inputting an input sentence written in a source language; A lexical analysis step of morphologically analyzing the input sentence; A syntax analysis step of revealing the syntactic structure of the input sentence, and a meaning analysis step of analyzing the meaning of the input sentence; A target sentence step of generating an output sentence translated into a target language based on the syntax analysis and the semantic analysis; In the machine translation method having an output step of outputting the output sentence,

After the lexical analysis step, a rule applying step of applying a sentence dividing rule to the input sentence morphologically analyzed in the lexical analysis step, a position selection step of selecting a possible position to divide the sentence by applying the rule, and the position selection The method further includes an analysis step of obtaining a probability value through a probability distribution of and a decision step of determining sentence division based on the probability value.

The probability distribution may be generated by constructing a plurality of corpus marked with the split position to generate training data, generating a rule and / or characteristic using the training data to create a probability model, and using the probability model. Machine translation method characterized in that.

The method of claim 9,

In the parsing step, the sentence is divided into segments by the sentence division according to the determining step, and the segment is synthesized, thereby synthesizing the syntactic structure of the input sentence.

The method of claim 10,

The sentence division rule applied in the rule applying step may be generated by constructing a plurality of corpus in which the division positions are displayed to generate learning data, and using the learning data to learn a concept of a segmentable position. Machine translation method.

delete

The method according to any one of claims 9 to 12,

After the analysis step, further has a storage step of storing the probability value obtained in the analysis step,

The determining step is a machine translation method, characterized in that for determining the sentence division to the position having the maximum probability value of the stored probability value.