KR100198959B1

KR100198959B1 - Language translation method

Info

Publication number: KR100198959B1
Application number: KR1019960007606A
Authority: KR
Inventors: 최운천
Original assignee: 정선종; 한국전자통신연구원
Priority date: 1996-03-20
Filing date: 1996-03-20
Publication date: 1999-06-15
Also published as: KR970066941A

Abstract

본 발명은 토큰 분리기를 이용한 다국어 번역 시스템에 관한 것으로, 한 개념이 한 단어에 대응되도록 하기 위해 개념의 정의에 중요한 영향을 미치는 단어들을 기준 단어라는 이름으로 정의하고, 이런 기준 단어를 이용한 토큰 분리기를 해석기의 앞단에 설치한 다국어 번역 시스템을 제공하기 위하여, 원시 언어 문장이 입력되면 토큰 분리기가 기준 단어를 이용하여 입력된 문장을 토큰 단위로 분리하는 제1단계(31,32); 개념 구조를 이용하여 원시 언어를 해석하는 제2단계(33); 및 상기 해석 결과를 목적 언어로 번역함과 동시에 목적 언어를 생성하여 출력하는 제3단계(34,35)를 포함하여 개념이 명확해지고, 해석과 생성 문법의 크기를 현저하게 줄일 수 있고, 시스템의 처리속도를 향상시킬 수 있으며, 문법을 작성하는 문법 기술자가 훨씬 쉽고 빠르게 문법을 작성할 수 있는 효과가 있다.The present invention relates to a multi-language translation system using a token separator. In order for a concept to correspond to a word, words having an important influence on the definition of the concept are defined as a reference word, and a token separator using the reference word is used. In order to provide a multilingual translation system installed at the front of the parser, a first step (31, 32) of separating the input sentences into token units using a reference word when a native language sentence is input; A second step 33 of interpreting the primitive language using the conceptual structure; And a third step (34, 35) of translating the analysis result into the target language and simultaneously generating and outputting the target language, thereby clarifying the concept and significantly reducing the size of the interpretation and the generated grammar. Processing speed can be improved, and grammar technicians who write grammars can write grammars much easier and faster.

Description

Source language translation method using token separator

제1도는 본 발명이 적용되는 하드웨어의 구성도.1 is a block diagram of hardware to which the present invention is applied.

제2도는 종래의 다국어 번역 시스템의 처리 흐름도.2 is a process flow diagram of a conventional multilingual translation system.

제3도는 본 발명에 따른 외국어 번역 시스템의 흐름도.3 is a flowchart of a foreign language translation system according to the present invention.

제4도는 본 발명에 따른 토큰 분리 방법의 흐름도.4 is a flowchart of a token separation method according to the present invention.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

11 : 문자 입력 장치 12 : 중앙 처리 장치11: character input device 12: central processing unit

13 : 기준 단어 사전 14 : 문자 출력 장치13: reference word dictionary 14: character output device

본 발명은 토큰 분리기를 이용한 원시언어의 번역방법에 관한 것이다.The present invention relates to a method of translating a primitive language using a token separator.

다국어 번역 시스템은 컴퓨터를 이용하여 한 언어를 동시에 여러 나라 언어로 번역하는 시스템이다. 예를 들면, 한국어를 영어나 일어, 독일어 등으로 번역해 주고, 영어를 한국어, 일어, 독일어 등으로 번역해 주는 시스템이다.A multilingual translation system is a system for translating a language into several languages at the same time using a computer. For example, it is a system that translates Korean into English, Japanese, German, and the like, and translates English into Korean, Japanese, and German.

기계번역 시스템은 일반적으로 한 언어를 다른 언어로 번역해 주는 시스템으로, 입력 언어를 원시 언어(source language), 번역되어 나온 언어를 목적 언어(target language)라고 한다. 일반적인 기계 번역 방법은 원시 언어를 해석하는 해석 단계와 목적 언어로 변환하는 변환 단계와 목적 언어를 생성하는 생성 단계로 이루어진다. 그리고, 언어학적 지식을 바탕으로 번역을 한다. 즉, 각 단어의 품사와 문장에서의 역할이 중시된다. 이를 위하여 형태소 해석, 구문해석, 의미 해석, 변환, 구문/의미 생성, 형태소 생성의 6단계를 거쳐 한 문장을 번역하게 된다.Machine translation systems generally translate one language into another. The input language is called the source language and the translated language is called the target language. The general machine translation method consists of an interpretation step of interpreting a source language, a conversion step of converting it to a target language, and a generation step of generating a target language. And they translate based on linguistic knowledge. In other words, the parts of speech and sentences in each word are important. To this end, a sentence is translated through six steps: morphological analysis, syntax analysis, semantic interpretation, transformation, syntax / meaning generation, and morphological generation.

그러나, 개념 구조를 이용한 기계 번역(언어번역)은 언어학적 지식에 중심을 두고 각 단계별로 번역을 하지 않고, 의미에 바탕을 둔 개념 구조를 이용하여 한꺼번에 처리한다. 이 방법은 비문법적인 요소가 많은 대화체 문장을 처리하는데 아주 유리한 방법이다. 개념 구조를 이용한 다국어 번역은 미국의 카네기 멜론 대학에서 현재 개발중이다. 개념 구조를 이용한 언어 번역은 원시 언어를 해석하여 개념 구조를 만들고, 이 개념 구조를 이용하여 목적 언어로 번역/생성하는 단계로 이루어 진다.However, the machine translation (language translation) using the concept structure is processed at the same time by using the concept structure based on the meaning, without translating at each stage centering on linguistic knowledge. This is a great way to handle conversational sentences with many non-legal elements. Multilingual translations using conceptual structures are currently under development at Carnegie Mellon University in the United States. Language translation using the concept structure consists of interpreting the primitive language to create a concept structure, and using this concept structure to translate / generate the target language.

개념 구조를 이용한 언어 번역의 첫단계인 해석(파싱)은 입력된 텍스트 문장을 보고, 그 문장에서 나타내고자 하는 개념을 찾아서 개념의 트리 구조로 만들어 낸다. 이 과정에서 처리 단위는 단어가된다. 이 원리는 영어와는 잘 맞지만 우리말과는 잘 맞지 않는다. 그것은 영어의 경우에는 한 단어가 대부분 한 개념과 대응될 수 있지만, 우리말의 경우에는 한 단어가 두가지의 개념이 될 수 있기 때문이다.Interpretation (parsing), the first stage of language translation using the concept structure, looks at the input text sentence, finds the concept to be represented in the sentence, and creates a tree of concepts. In this process, the processing unit becomes a word. This principle works well with English but not with Korean. This is because in English, one word can correspond to one concept, but in Korean, one word can be two concepts.

개념 구조를 이용한 종래의 다국어 번역 시스템은 단어(우리말의 어절)위주로 파싱을 하기 때문에 실질 형태소와 형식 형태소가 결합하여 한 단어를 이루는 우리말의 경우와는 잘 맞지 않는다.The conventional multilingual translation system using the concept structure parses around words (words of Korean), so it does not fit well with the case of Korean, where real morphemes and formal morphemes combine to form a word.

즉, 단어 위주로 파싱을 할 경우에는 개념이 불명확해지고, 해석과 생성 문법의 크기가 필요 이상으로 커지는 문제점이 있다. 그래서, 우리말에 맞는 해석을 위해서는 한 개념이 한 단어에 대응되도록 어절을 분리할 필요가 있다.In other words, when parsing around words, the concept becomes unclear and the size of the interpretation and generation grammar becomes larger than necessary. So, in order for the interpretation to fit in Korean, it is necessary to separate words so that a concept corresponds to a word.

따라서, 본 발명은 한 개념이 한 단어에 대응되도록 하기 위해 개념의 정의에 중요한 영향을 미치는 단어들을 기준 단어라는 이름으로 정의하고, 이런 기준 단어를 이용한 입력된 문장의 토큰을 분리하여 원시언어를 목적언어로 번역하는 방법을 제공하는 데 그 목적이 있다.Therefore, the present invention defines the words that have a significant influence on the definition of the concept in order to ensure that a concept corresponds to a word as a reference word, and the target language is separated by separating tokens of the input sentences using the reference word. The purpose is to provide a way to translate into a language.

상기 목적을 달성하기 위한 본 발명은, 개념 구조를 이용하는 원시언어 번역방법에 있어서, 원시 언어 문장이 입력되면 입력된 문장에 대해 각 단어별로 개념에 의해 작성된 기준단어 사전을 이용해 토큰 단위로 분리하는 제1단계; 상기 토큰 단위로 분리된 원시언어를 해석하는 제2단계; 및 상기 해석된 결과를 목적 언어로서 출력하는 제3단계를 포함한 것을 특징으로 한다.In the primitive language translation method using the concept structure, the present invention for achieving the above object, when the primitive language sentence is input is divided into token units using a reference word dictionary created by the concept for each word for the input sentence Stage 1; A second step of interpreting the source language separated by the token unit; And a third step of outputting the interpreted result as a target language.

이하, 첨부된 도면을 참조하여 본 발명에 따른 일실시예를 상세히 설명한다.Hereinafter, with reference to the accompanying drawings will be described an embodiment according to the present invention;

제1도는 본 발명이 적용되는 하드웨어의 구성도로서,11은 문자 입력 장치, 12는 중앙 처리장치, 13은 기준 단어 사전, 14는 문자출력 장치를 각각 나타낸다.1 is a configuration diagram of hardware to which the present invention is applied, where 11 is a character input device, 12 is a central processing unit, 13 is a reference word dictionary, and 14 is a character output device.

문자 입력 장치(11)는 번역 대상인 다국어 문자를 입력하고, 문자 출력 장치(12)는 번역된 다국어 문자를 출력한다. 중앙 처리 장치(12)는 본 발명이 탑재되어 수행되는 부분이고, 기준 단어 사전(13)은 단일 의미 단어 통합 처리에 필요한 데이터베이스이다.The character input device 11 inputs a multilingual character to be translated, and the character output device 12 outputs a translated multilingual character. The central processing unit 12 is a part in which the present invention is mounted and carried out, and the reference word dictionary 13 is a database necessary for single-word word integration processing.

제2도는 종래의 다국어 번역 시스템의 처리 흐름도이다.2 is a process flow diagram of a conventional multilingual translation system.

먼저, 원시 언어 문장 및 텍스트가 입력되면(21) 개념 구조를 이용하여 원시 언어를 해석하게된다(22). 그리고, 해석된 결과를 목적 언어로 번역함으로써, 목적 언어를 생성하게 된다(23). 이렇게 생성된 목적언어의 문장 및 텍스트는 곧바로 디스플레이 된다(24).First, when a primitive language sentence and text are input (21), the primitive language is interpreted using a conceptual structure (22). Then, the interpreted result is translated into the target language, thereby generating the target language (23). The sentences and texts of the target language thus generated are immediately displayed (24).

제3도는 본 발명에 따른 다국어 번역 시스템의 흐름도이다.3 is a flowchart of a multilingual translation system according to the present invention.

먼저, 원시 언어 문장 및 텍스트가 입력되면(31) 입력된 문장 및 텍스트는 토큰 분리기에 의하여 토큰 단위로 분리되어 다음 단계인 해석으로 넘어간다(32). 이때, 본 발명의 핵심인 토큰 분리기는 기준 단어를 이용한다.First, when a primitive language sentence and text are input (31), the input sentence and text are separated into token units by a token separator, and the process proceeds to the next step, interpretation (32). At this time, the token separator which is the core of the present invention uses a reference word.

이후, 개념 구조를 이용하여 원시 언어를 해석한 후에(33) 해석된 결과를 목적 언어로 번역함으로써, 목적 언어가 생성된다(34). 이렇게 생성된 문장 및 텍스트는 곧바로 디스플레이 된다(35).Thereafter, after interpreting the primitive language using the conceptual structure (33), the target language is generated (34) by translating the interpreted result into the target language. The sentences and texts thus generated are immediately displayed (35).

[표 1]은 종래의 방법을 이용하여 해석 문법을 작성하는 것보다 본 발명을 이용하여 해석 문법을 작성하는 것이 현저하게 문법의 크기를 줄일 수 있음을 보여준다.[Table 1] shows that it is possible to significantly reduce the size of the grammar by using the present invention to prepare the interpretation grammar rather than using the conventional method.

종래의 방법으로 요일 이름(week_name) 이란 개념에 대하여 문법을 기술할 경우에 요일 이름이 7가지에 각종 변형이 약 40가지가 되어 총 280개의 단어를 문법에 표기해야한다. 그러나, 본 발명을 이용할 경우에는 그룹화를 위한 비단말(non_terninal) 표현 3개와 요일 이름 7개, 그리고 변형 표현 40개를 합한 50개의 단어만으로 문법을 기술할 수가 있어 종래의 방법에 비해 1/5정도의 단어만으로 문법을 기술할 수 있다.In the conventional method, when describing the grammar for the concept of week_name, seven weekday names and about 40 kinds of variations have to be expressed in grammar. However, when using the present invention, the grammar can be described using only 50 words including three non_terninal expressions for grouping, seven day names, and 40 modified expressions. You can describe grammar only with words.

제4도는 본 발명에 따른 토큰 분리 방법의 흐름도이다.4 is a flowchart of a token separation method according to the present invention.

한국어 문장이 입력되면(41) 각 단어(어절)별로 기준 단어 사전(13)을 이용하여 기준 단어를 검색한다(42). 이후, 기준 단어가 있는지를 판단하여(43) 기준 단어가 있을 경우에는 기준 단어를 이용하여 토큰을 분리하여 기준 단어를 잘라내고(44) 나머지 부분을 가지고 다시 기준 단어 검색 과정(42)을 수행한다. 왜냐하면 한 단어에는 여러개의 기준 단어가 포함될 수 있기 때문이다.When the Korean sentence is input (41), the reference word is searched by using the reference word dictionary 13 for each word (phrase) (42). Subsequently, it is determined whether there is a reference word (43), and if there is a reference word, the token is separated using the reference word, the reference word is cut out (44), and the reference word search process 42 is performed again with the remaining part. . This is because a word can contain several reference words.

판단 결과, 더 이상 기준 단어가 없을 경우에는 분리된 토큰 리스트를 출력하고 (45), 입력된 모든 문장에 대하여 토큰을 분리하였는지를 판단하여(46), 모든 문장에 대하여 토큰을 분리하지 않은 경우에는 다음 단어에 대하여 기준 단어 검색 과정(42)부터 반복 수행한다.As a result of the determination, if there are no more reference words, a separate list of tokens is output (45), and if the tokens are separated for all the input sentences (46), and if the tokens are not separated for all sentences, The word is repeated from the reference word search process 42.

한편, 기준 단어 사전은 개념을 정의하는데 기준이 되는 단어들을 모아 둔 것이다. 스케쥴링(scheduling) 도메인을 예로 들면 날짜 관련 표현, 만남의 종류, 장소 관련 표현 등이 포함된다.그리고, 다양한 어미 변화를 하는 용언의 어간 부분도 해석 문법의 간략화를 위해 포함된다.On the other hand, the reference word dictionary is a collection of words that are the criteria for defining the concept. Scheduling domains include, for example, date-related expressions, types of encounters, location-related expressions, etc. The stems of verbs with various ending changes are also included to simplify interpretation grammar.

상기와 같이 구성되어 동작하는 본 발명은 아래와 같은 다양한 효과가 있다.The present invention configured and operated as described above has various effects as follows.

첫 번째의 효과는 개념의 명확화에 있다.The first effect is in the clarification of the concept.

예를들면 삼월 사일부터 오일까지라는 문장에서 사일부터와 오일까지는 실제로 두 단어지만 개념적으로 봐서는 4개이다. 그러나, 종래의 방법으로는 4개가 아닌 두 개의 개념으로 밖에 볼 수가 없다. 즉, 범위(range)라는 큰 개념에 포함된 시작을 의미하는 시작점(start_point)(사일부터)과 끝을 나타내는 끝점(end_point)(오일까지)이라는 두 개의 하위 개념으로 나타낼 수 있다. 주의해서 살펴보면 사일과 오일이 날짜라는 개념을 찾을 수가 없다. 이래서 명확한 해석이 불가능하다.For example, in the sentence from March 4th to oil, from 4th to 4th oil are actually two words, but conceptually four. However, the conventional method can only be seen as two concepts rather than four. That is, it can be represented by two sub-concepts: start_point (starting from four days), which means the start included in a large concept called range, and end_point (up to oil), which indicates the end. If you look carefully, you can't find the concept of silos and oils as dates. Thus no clear interpretation is possible.

그러나, 본 발명을 이용하면 사일, 오일, 부터, 까지가 기준 단어이므로 처음의 문장은 삼월 사일부터 오일까지로 분리되게 된다. 그래서, 원래 개념 파서가 의도한대로 시작점(start_point)인지를 부터로부터 알 수 있고, 그 날짜는 사일임을 알 수 있으며, 끝점(end_point)인지는 까지로부터 알 수 있고, 그 날짜는 오일임을 쉽게 알 수 있다.However, when using the present invention, the first sentence is divided into four days from March four days to four days since four days, oil, and, until. So, from the original concept parser, we can see from the start point (start_point) as intended, the date is four days, the end point (from the end_point), and we can easily see that the date is oil. .

두 번째 효과는 해석과 생성 문법의 크기를 현저하게 줄여준다는 것이다.The second effect is that it significantly reduces the size of the interpretation and generation syntax.

예를 들면 종래의 방법을 이용할 경우에는 요일 이름을 의미하는 week_name이라는 개념을 정의하기 위해서는 요일 이름이 포함된 모든단어(어절)가 문법에 포함되어야 한다. 월요일을 예로 들면, 월요일은, 월요일이, 월요일에는, 월요일인데요, 월요일부터, 월요일만... 등 월요일이 들어가는 모든 단어가 문법에 포함되어야 한다. 요일 이름 뒤에 붙을 수 있는 조사나 어미의 수는 각종 조사나 어미 변화를 포함하면 40개가 넘는다. 만약, 변화형의 최소치인 40을 기준으로 하더라도 요일 이름 7가지에 변화형 40가지를 곱하면 총 280개의 단어가 week_name이라는 문법을 만드는 데 필요하다.For example, in the conventional method, in order to define the concept of week_name, which means weekday name, all words (word clauses) including the weekday name must be included in the grammar. For example, Monday is Monday, Monday is Monday, Monday to Monday, only Monday, etc. Every word that contains Monday must be included in the grammar. The number of surveys and endings that can be followed by the name of the day is over 40, including various surveys and ending changes. Even if it is based on 40, the minimum of change type, multiplying 7 change names by 40 change types requires a total of 280 words to make the week_name grammar.

그러나, 본 발명을 이용하면 요일 이름 7가지가 기준 단어에 포함되므로 위의 표현들은 모두 요일 이름과 변화형이 분리되게 된다. 따라서, 요일 이름과 변화형을 분리하여 문법에 표현하면 총 47개의 단어만으로 표현이 가능하다. 그리고, 40개의 변형을 그룹으로 묶어두면 다른 개념, 예를 들면, 월 이름(month_name)에서도 그 그룹만 참조하면 되므로 문법의 크기를 현저하게 줄일 수 있다. 이것은 생성 문법에도 영향을 미쳐, 대응하는 생성 문법의 크기를 해석 문법과 같은 비율로 줄일 수 있다.However, according to the present invention, since seven day names are included in the reference word, all the above expressions are separated from the day names and variations. Therefore, if the day name and the change type are separated and expressed in the grammar, only 47 words can be expressed. Grouping 40 variants into a group can significantly reduce the size of the grammar because other groups, such as month_name, only need to refer to that group. This also affects the generation grammar, reducing the size of the corresponding generation grammar at the same rate as the analysis grammar.

세 번째 효과는 시스템의 처리 속도의 향상이다.The third effect is an increase in the processing speed of the system.

시스템 차원에서 보면 문법의 크기가 줄었기 때문에 파싱에 걸리는 시간이 줄어 처리의 속도가 훨씬 빨라지게 된다.At the system level, the size of the grammar is reduced, which reduces the time spent parsing, making the process much faster.

네 번째 효과는 문법을 작성하는 문법 기술자가 훨씬 쉽고 빠르게 문법을 작성할 수 있도록 해준다.The fourth effect makes grammar writers writing grammar much easier and faster.

문법 기술자는 개념 분리에 중요한 의미가 있는 단어들을 기준 단어 리스트에 추가하므로써, 위에 예에서 처럼 280개의 단어를 넣는 대신 47개의 단어만 추가하면 되므로 보다 효율적으로 문법을 작성할 수 있다.The grammar descriptor adds words that are important for concept separation to the reference word list, making it more efficient to write 47 words instead of 280 words as in the example above.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진자에 있어 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하므로, 전술한 실시예 및 첨부된 도면에 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the technical spirit of the present invention for those skilled in the art to which the present invention pertains. It is not limited to the drawings shown.

Claims

A primitive language translation method using a concept structure, the primitive language sentence comprising: a first step of dividing an input sentence into token units using a reference word dictionary created by a concept for each word; A second step of interpreting the source language separated by the token unit; And a third step of outputting the interpreted result as a target language.

The method of claim 1, wherein the first step comprises: a fourth step of searching for an input sentence in which a word is recorded in a reference word dictionary created by the concept of a word for each word (phrase); A fifth step of dividing the word into a reference word if the word is recorded in the dictionary as a reference word in the fourth step, and repeatedly performing the fourth step or less for the remaining parts; And a sixth step of outputting a separated token list when there are no more reference words in the fourth step, and repeatedly performing the fourth step or less on all the input sentences. Source language translation method.