KR20000056245A

KR20000056245A - Translation example selection method using similarity reflecting discriminitive value in example-based machine translation

Info

Publication number: KR20000056245A
Application number: KR1019990005383A
Authority: KR
Inventors: 이재원; 권철중
Original assignee: 윤종용; 삼성전자 주식회사
Priority date: 1999-02-18
Filing date: 1999-02-18
Publication date: 2000-09-15

Abstract

PURPOSE: A translation example selection method is provided to select an example sentence most similar to an input sentence in an EBMT(example based machine translation) by using a similarity reflected by distinction included in the words. CONSTITUTION: A translation example selection method comprises steps of converting words with a word selection ambiguity, included in an input sentence, into a translation example expression(110), selecting a translation example sentence in relation with the words having the ambiguity from a translation example sentence database(120), calculating a similarity with the input sentence by applying the similarity calculation method to the selected example sentences(130), and selecting a target word among the most similar example sentence as a translation word(140). The translation example expression step(110) includes describing a tree structure reflecting a sentence structure as well as a word semantics expressed in the word and the context.

Description

Translation example selection method using similarity reflecting discriminitive value in example-based machine translation}

본 발명은 영한번역에 관한 것으로, 특히 예문기반 기계번역에서 입력문장과 가장 유사한 번역예문을 선정하기 위하여 단어가 가지는 분별력을 반영하는 유사도를 이용한 번역예문 선정방법에 관한 것이다.The present invention relates to English-Korean translation, and more particularly, to a method of selecting a translation example using similarity reflecting the discriminating power of a word in order to select a translation example most similar to an input sentence in example-based machine translation.

예문에 기반한 기계번역(EBMT ; Example-Based Machine Translation)은 유사한 번역예문를 흉내내어 번역을 수행하려고 하는 아이디어에 기반하고 있다. 이러한 번역 시스템에 있어서 주된 요소는 대량의 번역예문 데이타베이스를 구축하고, 번역하고자 하는 입력 문장과 가장 유사한 번역예문을 데이타베이스에서 탐색하는 일이다.Example-Based Machine Translation (EBMT) is based on the idea of performing translations by mimicking similar translation examples. The main element of such a translation system is to build a large database of translation sentences and search the database for translation sentences that are most similar to the input sentences to be translated.

기존의 연구는 주로 번역예문에 나타나는 단어와 입력문장에 나타나는 단어 간의 의미거리를 계산함으로써 유사도를 평가한다. 영어 단어 CASE는 '경우', '상자', '환자'등으로 번역될 수 있으며, 이에 대해 다음과 같은 번역예문들이 가능하다.The existing research mainly evaluates the similarity by calculating the semantic distance between the words in the translated sentences and the words in the input sentences. The English word CASE can be translated into 'case', 'box', 'patient', and the like.

[경우] It was a case of bad judgment.It was a case of bad judgment.

[상자] The porter will carry your cases up to your room.[Box] The porter will carry your cases up to your room.

[환자] He is a new case of influenza.[Patient] He is a new case of influenza.

입력문장이 “He is a new case of cholera.”일 경우, 'CASE'의 문맥에 나타나는 'CHOLERA'는 병명으로 'INFLUENZA'와 의미적으로 매우 가까우며, 따라서 이때의 'CASE'는 '환자'로 번역되어야 함을 알 수 있다. 그러나, 번역예문에서 해당 단어의 문맥에 나타나는 단어들은 각기 대역어를 결정하는 데 기여하는 정도가 다를 수 있다. '환자'에 대한 번역예문에서 'HE'는 다른 목적언어 단어에 대한 번역예문에서도 또한 나타날 수 있으므로 목적언어 단어를 선정하는데 기여하는 정도가 매우 작다. 따라서 단순히 문맥에 나타나는 단어들의 의미거리 계산만으로는 적합한 번역예문을 선정하기는 어렵다.If the input sentence is "He is a new case of cholera." It can be seen that it must be translated. However, the words that appear in the context of the word in the translation example may have different degrees of contribution to determining the band words. In the translation example for 'patient', 'HE' may also appear in translation examples for other target language words, so the contribution to selecting the target language word is very small. Therefore, it is difficult to select an appropriate translation example simply by calculating the semantic distance of words appearing in the context.

다음은 'AREA'에 대한 번역예문에 대해 문맥에 나타나는 단어와 그의 의미코드만으로 간략하게 보인 것이다.The following is a brief representation of the words and their meaning codes that appear in the context of the translation example for 'AREA'.

[지역] residential:Dd65, city:Ce80, England:Ce78, no-smoking:Ed88[Region] residential: Dd65, city: Ce80, England: Ce78, no-smoking: Ed88

[분야] education:Gb35, mathematics:Jb30, economy:Jf141Education: Gb35, mathematics: Jb30, economy: Jf141

[면적] circle:Jb44, land:Ce78, square-mile:Jc66, square-kilometer:Jc66[Area] circle: Jb44, land: Ce78, square-mile: Jc66, square-kilometer: Jc66

위의 예에서 알 수 있는 바와 같이 번역예문에 나타나는 어떤 단어는 대역어를 결정함에 있어 그 영향력이 약하다. Ce80(city, land)의 의미를 가지는 단어는 목적언어 단어 '지역'과 '면적' 두 경우에 사용될 수 있으며, Jb30(mathematics, circle)의 의미를 가지는 단어는 목적언어 단어 '분야’와 '면적’으로 번역될 수 있다. 그러므로, 이러한 의미의 단어들은 단어 'AREA'의 목적언어 단어를 선정함에 있어 분별력이 약하다고 할 수 있다.As can be seen in the above example, some words appearing in the translation example have a weak influence in determining the band word. Words with the meaning of Ce80 (city, land) can be used in both cases of the target language words 'region' and 'area', and words with the meaning of Jb30 (mathematics, circle) are the target language words 'field' and 'area' Can be translated as' Therefore, the words in this meaning can be said to have a weak discernment in selecting the target language word of the word 'AREA'.

반면, Gb35(education)의미의 단어는‘분야’이외의 목적언어에 대한 번역예문에 나타나지 않으므로‘분야’로 번역되어야 할 가능성이 높다는 것을 알 수 있다. 이와 같이 어떤 단어는 특정 목적언어 단어에 대한 번역예문에 편중적으로 나타남으로써 목적언어 단어를 선정함에 있어 결정적인 단서를 제공하기도 한다. 즉, 분별력이 높은 단어인 것이다. 그러므로, 번역에 사용할 번역예문를 선정함에 있어 입력문장과 번역예문과의 의미적인 유사성만을 계산할 것이 아니라, 번역예문에 나타나는 단어의 의미가 가지는 분별력을 반영하는 것이 효과적이다.On the other hand, the word meaning Gb35 (education) does not appear in the translation example of the target language other than 'field', so it is likely that the word should be translated as 'field'. As such, some words appear biased in the translation example for a particular target language word, providing a crucial clue in selecting the target language word. That is, words with high discernment. Therefore, in selecting the translation example to be used for translation, it is effective not only to calculate the semantic similarity between the input sentence and the translation example, but also to reflect the discernment of the meaning of the word appearing in the translation example.

기존의 방법은 일반적으로 입력문장의 문맥에 나타나는 단어와 번역예문에 나타나는 단어 간의 의미거리를 계산함으로써 유사도를 평가한다. 입력문장에서 번역하고자 하는 단어의 문맥에 나타나는 단어가 I₁, I₂, ..., I_m이고, 번역예문에 나타나는 단어가 E_ij1, E_ij2, ..., E_ijm라고 하면 입력문장과 번역예문과의 유사도는 수학식 1과 같이 각각의 의미거리의 합으로 계산될 수 있다.Conventional methods generally evaluate similarity by calculating the semantic distance between words in the context of input sentences and words in translation sentences. And the words that appear in the context of the word to be translated from the input sentence _{_{I 1, I 2, ...,}} I m, the words that appear in the translated sentence Speaking _{_{E ij1, E ij2, ...,}} E ijm input sentence and Similarity with the translation example can be calculated as the sum of the respective semantic distances as shown in Equation (1).

Input : I = (I₁, I₂, ..., I_m)Input: I = (I ₁ , I ₂ , ..., I _m )

Example : E_ij= (E_ij1, E_ij2, ..., E_ijm) _{_{Example: E ij = (E ij1}} , E ij2, ..., E ijm)

Dist(I, E_ij) = d((I₁, I₂, ..., I_m)(E_ij1, E_ij2, ..., E_ijm)) _{Dist (I, E ij) =} d ((I 1, I 2, ..., I m) (E ij1, E ij2, ..., E ijm))

그러므로, 기존의 방법은 단어간의 의미거리, 즉 d(I_k, E_ijk) 함수를 구하는데 중점을 두었다. 그러나, 앞서 살펴본 바와 같이 같은 의미거리를 가지더라도 목적언어 단어를 선정하는데 기여하는 분별력이 다르므로 의미거리는 가깝지만 적합하지 않은 대역어가 선정될 수 도 있다.Therefore, the existing method focused on _finding the semantic distance between words, that is, d (I _k , E _ijk ) function. However, as described above, even if the same meaning distance is different, the discriminating power contributing to selecting the target language word is different, but the closest but not suitable band word may be selected.

수미타와 리다(Sumita and Lida)는 문맥에 나타나는 단어가 목적언어 단어를 결정하는데 영향을 미치는 정도를 가중치로 이용한 바 있다("Experiments and Prospects of Example-based Machine Translation"; ACL91). 그러나, 문맥에 나타나는 단어에 대해 고정된 가중치를 적용함으로써 그 단어가 목적언어 단어 후보자 각각에 대해 가지는 분별력을 반영하지 못하고 있다. 즉, 목적언어 단어 후보자의 문맥에 나타나는 개개의 분포를 모델링하지 못하고 있다.Sumita and Lida used weights to determine how words in a context affect the determination of the target language words ("Experiments and Prospects of Example-based Machine Translation"; ACL91). However, by applying a fixed weight to a word appearing in the context, it does not reflect the discernment that the word has for each target language word candidate. In other words, individual distributions appearing in the context of the target language word candidates cannot be modeled.

본 발명이 이루고자하는 기술적 과제는 데이타베이스로 구축되어 있는 번역예문들 중에서 입력문장과 가장 유사한 번역예문을 선정하기 위하여 문맥에 함께 나타나는 단어의 의미가 가지는 분별력을 의미거리계산 방법에 가중치로 반영한 유사도를 계산하여 번역예문을 표현하는 예제기반 기계번역에서 분별성이 반영된 유사도를 이용한 번역예문 선정방법을 제공함에 있다.The technical problem to be achieved by the present invention is to compare the similarity of the meaning of the word that appears in the context with weights in the semantic distance calculation method in order to select a translation example most similar to the input sentence among the translation examples that are built into the database This paper provides a method for selecting a translation example using similarity reflecting discernment in example-based machine translation that expresses translation example by calculating.

도 1은 본 발명에 의한 예제기반 기계번역에서 분별성이 반영된 유사도를 이용한 번역예문 선정방법을 도시한 순서도이다.FIG. 1 is a flowchart illustrating a method for selecting a translation example using similarities in which discrimination is reflected in example-based machine translation according to the present invention.

도 2는 본 발명에 사용된 유사도 계산 방법을 도시한 순서도이다.2 is a flow chart illustrating a similarity calculation method used in the present invention.

상기 기술적 과제를 해결하기 위한 본 발명에 의한 예제기반 기계번역에서 분별성이 반영된 유사도를 이용한 번역예문 선정방법은 (a)(a)입력문장 중 어휘선택의 모호성을 가지는 단어를 구문관계를 반영한 번역예문 표현방법으로 변환하는 입력문장 변환단계; (b)상기 번역예문 표현방법으로 구축된 번역예문 데이터베이스에서 상기 (a)단계로부터 변환된 입력문장과 관련된 번역예문들을 선택하는 단계; (c)상기 (b)단계로부터 선택된 번역예문들 각각에 대해 입력문장과의 유사도를 계산하는 단계; 및 (d)상기 (c)단계로부터 계산된 유사도가 가장 큰 번역예문을 선정하여 선정된 번역예문의 목표단어(TARGET_WORD)를 원시언어 단어에 해당하는 목적언어 단어로 선정하는 단계를 포함함을 특징으로 한다.In the example-based machine translation according to the present invention for solving the above technical problem, the method of selecting a translation example using the similarity reflecting the sensitivities includes (a) (a) translating a word having ambiguity in lexical selection among input sentences reflecting a syntax relationship An input sentence conversion step of converting an example sentence expression method; (b) selecting translation examples related to the input sentence converted from step (a) in the translation example database constructed by the translation example expression method; (c) calculating a similarity with the input sentence for each of the translation sentences selected from the step (b); And (d) selecting a translation example sentence having the largest similarity calculated from step (c) and selecting the target word TARGET_WORD of the selected translation example sentence as a target language word corresponding to a source language word. It is done.

이하 도면을 참조하여 본 발명을 상세히 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 의한 예제기반 기계번역에서 분별성을 반영한 유사도를 이용한 번역예문 선정방법을 도시한 순서도로서, 입력문장 변환단계(110), 번역예문 선택단계(120), 유사도 계산단계(130) 및 목적언어 선정단계(140)로 이루어진다.1 is a flowchart illustrating a method for selecting a translation example using similarity reflecting discernment in example-based machine translation according to the present invention, the input sentence conversion step 110, the translation example selection step 120, and the similarity calculation step 130 ) And the target language selection step (140).

입력문장 변환단계(110)는 입력문장 중 어휘선택 모호성을 가지는 단어를 번역예문 표현으로 변환하는 단계이다.The input sentence conversion step 110 is a step of converting a word having lexical selection ambiguity among the input sentences into a translation example expression.

번역예문 선택단계(120)는 입력문장에서 모호성을 가지는 단어에 관련된 번역예문들을 기 구축된 번역예문 데이터베이스에서 선택하는 단계이다.The translation example selection step 120 is a step of selecting translation example sentences related to a word having ambiguity in an input sentence from a previously constructed translation example database.

유사도 계산단계(130)는 번역예문 선택단계(120)로부터 선택된 번역예문들 각각에 대해 유사도 계산방법을 이용하여 입력문장과의 유사도를 계산하는 단계이다.The similarity calculation step 130 is a step of calculating the similarity with the input sentence using a similarity calculation method for each of the translation examples selected from the translation example selection step 120.

목적언어 선정단계(140)는 가장 유사한 번역예문를 선정하여 선정된 번역예문의 TARGET_WORD를 원시언어 단어에 해당하는 목적언어 단어로 선정하는 단계이다.The target language selection step 140 selects the most similar translation example sentence and selects TARGET_WORD of the selected translation example sentence as the target language word corresponding to the source language word.

상술한 단계에 의거하여 본 발명을 상세히 설명하기로 한다.The present invention will be described in detail based on the above steps.

번역예문은 어휘, 의미정보 외에도 형태-구문 정보와 트리 구조를 반영할 수 있도록 다양한 지식을 포함하도록 기술한다. 이는 형태-구문 정보와 구문관계를 반영하는 트리 구조를 기술함으로써 형태-구문적으로 적합하지 않은 목적언어 단어가 선정되는 것을 배제하기 위한 것이다.In addition to vocabulary and semantic information, the translation example should include various knowledge to reflect the form-syntax information and tree structure. This is to exclude the selection of the target language words that are not form-syntactically suitable by describing a tree structure that reflects form-syntax information and syntax relationships.

번역예문에 포함되는 지식의 종류는 다음과 같다.The types of knowledge included in the translation example are as follows.

a. 트리구조(Parent, Child, Sibling) : 문장의 구문구조 트리상에서의 위치를 기술.a. Tree structure (Parent, Child, Sibling): Describes the position of the sentence structure tree in the tree.

b. 형태-구문 정보(Constraints) : 관사, 수 정보, 가산성 등 형태통사적 제약조건을 기술.b. Constraints: Describes morphological constraints, such as article, number information, and additiveity.

c. 어휘 : 문맥에 나타나는 단어 어휘 자체를 기술c. Vocabulary: Describe the word vocabulary itself that appears in the context

d. 의미 정보 : 문맥에 나타나는 단어 어휘의 의미코드를 기술d. Semantic information: describes the semantic code of the word vocabulary that appears in the context

번역예문 표현 방법(구문관계를 반영한 번역예문 표현)은 다음과 같은 형태로 표현된다(110단계).The translation example expression method (translation example expression reflecting syntax relation) is expressed in the following form (step 110).

% SOURCE_WORD 〈Source_Language_Word〉% SOURCE_WORD 〈Source_Language_Word〉

% TARGET_WORD 〈Target_Language_Word〉% TARGET_WORD 〈Target_Language_Word〉

% PARENT 〈Lexical_Item:Semantic_Code:Constraints〉% PARENT 〈Lexical_Item: Semantic_Code: Constraints〉

% CHILD 〈Lexical_Item:Semantic_Code:Constraints〉% CHILD 〈Lexical_Item: Semantic_Code: Constraints〉

% SIBLING 〈Lexical_Item:Semantic_Code:Constraints〉% SIBLING 〈Lexical_Item: Semantic_Code: Constraints〉

'CASE'에 대한 번역예문 "He is a new case of influenza"는 다음과 같은 형태로 기술된다. 이때 '[ ]' 안에 기술된 'PREP:OF'는 'case'와 'influenza'가 전치사 'of'로 연결되어 있다는 형태-구문 정보를 의미하며, N1, B140, ... 은 각각 어휘 'be', 'influenza'의 의미 코드를 나타낸다.The translation example "He is a new case of influenza" for "CASE" is described in the following form. At this time, 'PREP: OF' described in '[]' means form-syntax information that 'case' and 'influenza' are connected with the preposition 'of', and N1, B140, ... are the vocabulary 'be' respectively. ', represents the semantic code of' influenza '.

% SOURCE_WORD 〈CASE〉% SOURCE_WORD 〈CASE〉

% TARGET_WORD 〈환자〉% TARGET_WORD 〈Patient〉

% PARENT 〈BE:N1〉% PARENT 〈BE: N1〉

% CHILD 〈INFLUENZA:B140[PREP:OF], NEW:L241〉% CHILD 〈INFLUENZA: B140 [PREP: OF], NEW: L241〉

% SIBLING 〈HE:G280〉% SIBLING 〈HE: G280〉

상술한 번역예문 표현방법에 의해 구축된 번역예문 데이터베이스에서 입력문장과 관련된 번역예문들을 선택한다(120단계).In operation 120, translation sentences related to the input sentence are selected from the translation example database constructed by the above-described translation example expression method.

번역예문 선택단계(120)로부터 선택된 번역예문들 각각에 대해 유사도 계산방법을 이용하여 입력문장과의 유사도를 계산한다(130단계).The similarity with the input sentence is calculated using the similarity calculation method for each of the selected translation sentences from the translation example selection step 120 (step 130).

유사도는 의미거리, 제약조건, 분별성의 세가지 정보에 기반하여 계산된다. 이에 대한 유사도 계산 방법은 수학식 2와 같다.Similarity is calculated based on three pieces of information: semantic distance, constraint, and discernment. Similarity calculation method for this is as shown in Equation 2.

Input : I = (I₁, I₂, ..., I_m)Input: I = (I ₁ , I ₂ , ..., I _m )

의미거리는 두 단어의 의미가 계층적인 시소러스 상에서 떨어져 있는 정도에 따라 계산된다. 일반적인 의미거리 계산 방법은 수학식 3과 같이 표현된다.Semantic distances are calculated according to the degree of separation of the meaning of two words on a hierarchical thesaurus. A general method of calculating the mean distance is expressed as in Equation 3.

d((I_k, E_ijk) = 0 if (L e x i c a l _ I t e m (I_k) == L e x i c a l _ I t e m (E_ijk))d ((I _k , E _ijk ) = 0 if (L exical _ I tem (I _k ) == L exical _ I tem (E _ijk ))

= 0.125 if(Semantic_Code(I_k)와 Semantic_Code(E_ijk)가= 0.125 if (Semantic_Code (I _k ) and Semantic_Code (E _ijk )

시소러스의 동일 레벨에 존재)Present at the same level of thesaurus)

= 0.25 if(Semantic_Code(I_k)와 Semantic_Code(E_ijk)가= 0.25 if (Semantic_Code (I _k ) and Semantic_Code (E _ijk )

시소러스에서 1레벨의 차이를 가짐)1 level difference in the thesaurus)

= 0.5 if(Semantic_Code(I_k)와 Semantic_Code(E_ijk)가= 0.5 if (Semantic_Code (I _k ) and Semantic_Code (E _ijk )

시소러스에서 2레벨의 차이를 가짐)2 levels difference in the thesaurus)

= 1.0 그외= 1.0 others

계층구조에서 가까울수록 두 단어의 의미거리는 작은 수치를 부여받는다. 즉, 두 단어의 의미거리가 작다는 것은 역으로 두 단어의 유사도가 높다는 것을 의미한다.The closer you are to the hierarchy, the smaller the meaning of the two words. In other words, the small meaning of the two words means that the similarity of the two words is high.

제약조건은 형태-구문 정보에 대한 것으로, 수학식 4와 같이 표현된다.Constraints are for form-syntax information and are expressed as in Equation 4.

Penalty(I_k, E_ijk) = 1 if Constraint(E_ijk)를 I_k가 만족Is a _{_{Penalty (I k, E ijk)}} = 1 if Constraint (E ijk) I k satisfies

= 2 otherwise= 2 otherwise

입력문장에 있는 해당 단어가 번역예문에 기술되어 있는 형태-구문 조건을 만족하는지를 검사하여 만족하지 않은 경우에는 위에서 계산된 의미거리에 2를 곱하여 유사도 계산에 감점을 부여하는 효과를 가진다. 2를 곱한다는 것은 의미거리 계산에서 일치하는 레벨을 한단계 낮추는 역할을 한다.If the corresponding word in the input sentence does not satisfy the form-phrase condition described in the translation example, it has the effect of deducting the similarity calculation by multiplying the above-described semantic distance by two. Multiplying by two lowers the matching level by one in the semantic distance calculation.

분별성은 특정 목적언어에 대한 번역예문들에 편중적으로 나타날 때 높다고 할 수 있다. 수학식 5는 목적언어 단어 선정에 대한 분별력을 계산하는 식이다. 이 값은 분별력이 높을 수록 낮은 값을 가지도록 계산된다. 분별력은 거리계산 값에 곱해지는 항목으로 거리가 가까울 수록 유사도가 높음을 의미하기 때문이다.Discrimination is high when it appears biased in translation examples for a particular target language. Equation 5 is an equation for calculating the discrimination power for selecting the target language word. This value is calculated to have a lower value for higher discernment. Discernment is an item that is multiplied by the distance calculation value, which means that the closer the distance is, the higher the similarity is.

W_k= disc_value(semantic(E_ijk))W _k = disc_value (semantic (E _ijk ))

여기서, freq(SM) in E_ijkfor all i,j 는 모호성을 가지는 원시언어 단어에 대한 번역예문에서 특정 의미코드(SM)가 나타나는 모든 빈도수를 의미하며, freq(SM) in E_ijkfor all j 는 원시언어 단어에 대한 번역예문들 중 특정 목적언어 단어 i 에 대한 번역예문에서 나타나는 특정 의미코드(SM)의 빈도수를 의미한다. 그러므로, freq(SM) in E_ijkfor all j 값의 비중이 클수록 분별력은 높다고 할 수 있다.Here, freq (SM) in E _ijk for all i, j means all frequencies in which a specific semantic code (SM) appears in a translation example for an ambiguous source language word, and freq (SM) in E _ijk for all j Denotes the frequency of a specific semantic code (SM) appearing in the translation example for the specific target language word i among the translation examples for the source language word. Therefore, the greater the specific gravity of freq (SM) in E _ijk for all j value, the higher the discernment.

도 2는 상술한 입력문장과 번역예문간의 유사도 계산과정을 순서도로 도시한 것이다.2 is a flowchart illustrating a similarity calculation process between the above-described input sentence and the translated example sentence.

먼저, 변수 Distance에 최대값인 "0"을 부여하고, 처리할 번역예문의 개수를 의미하는 변수 K에 "0"을 입력한다(210단계). 처리할 번역예문이 있는지 검사하고, 더이상 처리할 번역예문이 없으면 종료한다(220단계). 상기 번역예문의 단어와 입력문장의 단어 사이의 의미거리를 계산한다(230단계). 상기 입력문장의 단어가 제약조건을 만족하는지 검사하여 Penalty를 계산하여 의미거리와 곱한다(240단계). 상기 번역예문의 단어 의미가 가지는 분별력을 계산하고, 상기 240단계에서 계산된 값과 상기 분별력을 곱하여 유사도를 계산한다(250단계). 상기 250단계에서 계산된 유사도를 변수 Distance와 합하고(260단계), 처리할 번역예문을 220단계내지 260단계를 반복하여 입력문장과 번역예문들간의 유사도를 계산한다(130단계).First, a maximum value "0" is assigned to the variable Distance, and a value "0" is input to a variable K representing the number of translation examples to be processed (step 210). Check whether there is a translation example to be processed, and if there is no more translation example to process (step 220). The semantic distance between the word of the translation example sentence and the word of the input sentence is calculated (step 230). In operation 240, the penalty is calculated by checking whether a word in the input sentence satisfies a constraint (step 240). The classification power of the word meaning of the translation example sentence is calculated, and the similarity is calculated by multiplying the value calculated in step 240 with the classification power (step 250). The similarity calculated in step 250 is summed with the variable distance (step 260), and the similarity between the input sentences and the translated sentences is calculated by repeating steps 220 to 260 for the translated example to be processed (step 130).

상술한 유사도 계산단계(130)로부터 유사도가 가장 큰 번역예문을 선정하여 선정된 번역예문의 목표단어(TARGET_WORD)를 원시언어 단어에 해당하는 목적언어 단어로 선정한다(140단계).From the above similarity calculation step 130, a translation example sentence having the largest similarity is selected, and the target word TARGET_WORD of the selected translation example sentence is selected as the target language word corresponding to the source language word (step 140).

단 어word 테스트 문장Test sentences 모델 AModel A 모델 BModel B 명 사noun 4242 27(64.3%)27 (64.3%) 27(64.3%)27 (64.3%) 동 사verb 4646 33(71.7%)33 (71.7%) 30(65.2%)30 (65.2%) 형용사adjective 1515 11(73.3%)11 (73.3%) 9(60.0%)9 (60.0%) 부 사adverb 66 4(66.6%)4 (66.6%) 2(33.3%)2 (33.3%) 합 계Sum 109109 75(68.8%)75 (68.8%) 68(62.4%)68 (62.4%)

표 1은 본 발명에 의한 일실시예로 33단어에 대해 720여개의 번역예문를 구축하여 분별력을 이용한 모델(A)과 그렇지 않은 모델(B)을 비교 평가하여 보았다. 테스트 문장은 109 문장을 임의로 추출하여 사용하였다. 결과적으로 62.4%에서 68.8%의 성능향상을 보였다. 이러한 결과는 분별력이 어휘선택에 있어 유용하게 사용될 수 있음을 보여주고 있다.Table 1 compares and evaluates the model (A) and the model (B) using discriminant power by constructing about 720 translation examples for 33 words according to an embodiment of the present invention. The test sentences were used by randomly extracting 109 sentences. As a result, the performance improved from 62.4% to 68.8%. These results show that discernment can be useful in vocabulary selection.

본 발명에 의하면, 번역예문에 나타나는 단어의 의미가 목적언어 선정에 대해 가지는 분별력을 유사도 계산에 가중치로 반영함으로써 기존의 방법에 비해 적합한 번역예문을 선정할 가능성이 향상된다. 즉, 기계번역에서 원시언어 단어에 적합한 목적언어 단어를 선정함에 있어 정확률이 높아진다.According to the present invention, the possibility of selecting a proper translation example compared to the conventional method is improved by reflecting the discrimination power of the word appearing in the translation example as the weight in the similarity calculation. In other words, the accuracy of selecting the target language word suitable for the source language word in machine translation increases.

또한, 번역예문에 나타나는 단어의 의미 외에도 분별력을 반영함으로써 번역예문 구축이 좀 더 용이해 진다.In addition to the meaning of the words appearing in the translation example, it is easier to build the translation example by reflecting the discernment.

Claims

(a) an input sentence conversion step of converting a word having ambiguity of lexical selection among input sentences into a translation example expression method reflecting a syntax relationship;

(b) selecting translation examples related to the input sentence converted from the step (a) in the translation example database constructed by the translation example expression method;

(c) calculating a similarity with the input sentence for each of the translation sentences selected from the step (b); And

(d) selecting a translation example sentence having the largest similarity calculated from step (c) and selecting the target word TARGET_WORD of the selected translation example sentence as a target language word corresponding to a source language word. Translation Example Selection Method using Similarity Reflecting Discrimination in Example-based Machine Translation.

The method of claim 1, wherein the translation example expression method in step (a) is

A method of selecting translation examples using similarity reflecting discrimination in example-based machine translation, which describes a tree structure that reflects form-syntax information and syntax relationship, in addition to word vocabulary in context and semantic code of word in context.

The method of claim 1, wherein step (c)

(c1) calculating a meaning distance between the word of the translation example sentence and the word of the input sentence;

(c2) calculating a penalty by checking whether the input sentence word satisfies a constraint;

(c3) calculating the discernment of the word meaning of the translation example; And

and (c4) multiplying the meaning distance by the penalty and the discriminant power.

The method of claim 3, wherein step (c1)

A method of selecting a translation example using similarity reflecting discernment in example-based machine translation, characterized in that the word meaning of the translation example and the word meaning of the input sentence are calculated according to the distance from the hierarchical thesaurus.

The method of claim 3, wherein step (c2)

If the word corresponding to the input sentence word satisfies the form-phrase condition described in the translation example, and if it is not satisfied, it calculates by deducting the similarity by lowering the matching level by one step in the semantic distance calculation. Translation Example Selection Method using Similarity Reflecting Discrimination in Example-based Machine Translation.

The method of claim 3, wherein step (c3)

In the translation example of the ambiguity of a primitive word,

A method for selecting translation examples using similarity reflecting discrimination in example-based machine translation, wherein all frequencies in which a specific semantic code appears are divided by the frequency of a specific semantic code for a specific target language word.