KR20030056655A

KR20030056655A - Similar sentence retrieval method for translation aid

Info

Publication number: KR20030056655A
Application number: KR1020010086929A
Authority: KR
Inventors: 이기영; 노윤형; 김창현; 최승권; 김영길; 서영애; 양성일; 류철; 홍문표
Original assignee: 한국전자통신연구원
Priority date: 2001-12-28
Filing date: 2001-12-28
Publication date: 2003-07-04
Also published as: US20030125928A1; US7333927B2; KR100453227B1

Abstract

PURPOSE: A method for searching a similar sentence on a translation support system is provided to improve accuracy of the system by offering a similarity measure between sentences for searching the most similar example from a translation memory, and outputting a searched source sentence and the corresponding translation sentence. CONSTITUTION: The translation memory(105) comprises the source sentence, a morpheme analysis result of the source sentence, and the translation sentence by performing the morpheme analysis of a parallel corpus(101) in not processed form through a morpheme analyzer(102). An index reverse-file(104) is formed by extracting an index word from each source sentence forming the translation memory(105). A filtering part(106) separately extracts the morpheme corresponding to a noun, a verb, and an adjective used as the index word from the inputted source sentence. A searching part(107) loads a searching result after searching the candidate sentences provided from the filtering part(106) through the translation memory(105). A similarity calculation part(108) calculates the similarity applying a weight of each part for the candidate sentences by using an edit distance method.

Description

How to search for similar sentences in translation support system {SIMILAR SENTENCE RETRIEVAL METHOD FOR TRANSLATION AID}

본 발명은 번역 지원 시스템에서의 유사 문장 검색 기술에 관한 것으로, 특히, 번역 지원 시스템을 위해 원문과 번역문의 쌍(pair)으로 이루어진 번역 메모리의 구성과 번역 메모리의 원문들 중 입력문과 문법적/구조적으로 가장 유사한 문장을 검색하는데 적합한 번역 지원 시스템에서의 유사 문장 검색 방법에 관한 것이다.The present invention relates to a technique for retrieving a similar sentence in a translation support system, and more particularly, to a structure of a translation memory composed of a pair of original text and a translation sentence and a grammatical / structural structure among original texts of a translation memory for a translation support system. A method for searching similar sentences in a translation support system suitable for searching for the most similar sentences.

일반적인 기계 번역 시스템의 경우, 현재로서는 자연스러운 번역 품질을 얻을 수 없다. 그 이유는 원시 언어 분석 기술이 아직 완벽하지 않으며, 원시언어를 목표언어로 변환하는 변환 기술 또한 현재로서는 미흡한 실정이기 때문이다.In the general machine translation system, natural translation quality cannot be obtained at this time. The reason is that the primitive language analysis technique is not yet perfect, and the conversion technique for converting the primitive language into the target language is also insufficient at present.

이와는 반대로, 번역 지원 시스템의 경우, 완전 자동 번역 기능은 제공하지는 못하지만, 사용자가 번역하려는 문장과 가장 유사한 문장 및 해당 대역문을 번역 메모리에서 검색하여 사용자가 번역을 하는데 많은 도움을 준다는 점에서 현재의 기계 번역 시스템 보다 훨씬 실용성이 높다고 할 수 있다.On the contrary, the translation support system does not provide a fully automatic translation function. However, since the translation support system searches for the most similar sentence and the corresponding band sentence in the translation memory, it helps the user to translate. It is much more practical than machine translation system.

하지만, 대부분의 번역 지원 시스템에서는 단순하게 입력 원문에 나타나는 단어들에 대한 스트링 매칭 기법만을 사용하므로, 표면적으로 매칭되는 문장만을 출력할 수밖에 없다는 단점을 여전히 지니고 있다.However, since most translation support systems use only string matching techniques for words appearing in the input text, they still have a disadvantage in that they can only output sentences that are superficially matched.

따라서, 상기와 같은 문제점을 해결하고, 문장의 구조적, 문법적인 성분을 고려하여 보다 양질의 결과를 실현시키기 위해서는 단순히 표층 단어 위주의 매칭뿐만 아니라, 형태소 분석 결과 및 품사에 따라 상이한 가중치 등을 적용하여 유사 문장을 추출하는 기술적 방안이 요망된다.Therefore, in order to solve the above problems and to realize a higher quality result in consideration of the structural and grammatical components of the sentence, not only surface word-based matching but also different weights according to morphological analysis results and parts of speech. A technical solution for extracting similar sentences is desired.

본 발명은 상술한 요망에 의해 안출한 것으로, 번역 메모리를 자원으로 사용하여 사용자가 제시한 입력문과 문법적/구조적으로 가장 유사한 예문을 번역 메모리에서 검색하기 위한 문장간 유사도의 척도(measure)를 제공하고, 이 척도에 의해 번역 메모리에서 검색하여 검색된 원시 문장과 해당 번역 문장을 출력함으로써, 보다 정확한 시스템 성능을 구현하도록 한 번역 지원 시스템에서의 유사 문장 검색 방법을 제공하는데 그 목적이 있다.SUMMARY OF THE INVENTION The present invention has been made in view of the above-described needs, and provides a measure of similarity between sentences for retrieving example sentences most similar in grammatically and structurally from a user-proposed input sentence using a translation memory as a resource. It is an object of the present invention to provide a method for retrieving similar sentences in a translation support system that realizes more accurate system performance by outputting original sentences and corresponding translation sentences retrieved from the translation memory by this scale.

이러한 목적을 달성하기 위하여 본 발명은, 원문과 대역문의 쌍으로 이루어진 병렬 코퍼스로부터 번역 메모리 구성하고, 번역 메모리를 빠른 속도로 접근하기 위한 인덱스 단어의 역파일을 구성하는 제 1 단계; 상기에서 구축된 인덱스 역파일의 정보와 명사, 동사, 형용사의 매칭 가중치를 적용하여 유사도 계산 이전에 후보 문장들을 일차적으로 필터링하는 제 2 단계; 및 필터링된 후보 문장들을 번역 메모리로부터 로딩하여 정의된 유사도 계산법에 따라 입력 원문과 후보 문장들 간의 각 유사도를 계산하고, 유사도의 크기 순으로 원문 및 해당 대역문을 출력하는 제 3 단계를 포함하는 것을 특징으로 하는 번역 지원 시스템에서의 유사 문장 검색 방법을 제공한다.In order to achieve the above object, the present invention comprises a first step of constructing a translation memory from a parallel corpus composed of a pair of original text and band text, and constructing an inverse file of index words for quickly accessing the translation memory; A second step of first filtering candidate sentences before calculating similarity by applying the information of the index inverse file and the matching weights of nouns, verbs, and adjectives; And calculating each similarity between the input original text and the candidate sentences according to the similarity calculation method defined by loading the filtered candidate sentences from the translation memory, and outputting the original text and the corresponding band text in order of similarity. A similar sentence search method in a translation support system is provided.

도 1은 본 발명에 따른 방법에 적용되는 번역 지원 시스템의 구성 블록도,1 is a block diagram of a translation support system applied to a method according to the present invention;

도 2는 본 발명의 일 실시예에 따른 유사 문장 검색 방법으로서, 필터링 과정을 설명하기 위한 도면,2 is a diagram illustrating a similar sentence search method according to an embodiment of the present invention.

도 3은 본 발명의 다른 실시예에 따른 유사 문장 검색 방법으로서, 유사도 계산 과정을 설명하기 위한 도면.3 is a diagram for describing a similarity calculation process according to a similar sentence search method according to another embodiment of the present invention.

＜도면의 주요부분에 대한 부호의 설명＞<Description of the code | symbol about the principal part of drawing>

101 : 병렬 코퍼스 102 : 형태소 분석기101: parallel corpus 102: stemmer

103 : 입력부 104 : 인덱스 역파일103: input unit 104: index inverse file

105 : 번역 메모리 106 : 필터링부105: translation memory 106: filtering unit

107 : 검색부 108 : 유사도 계산부107: search unit 108: similarity calculation unit

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 대하여 상세하게 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the present invention.

설명에 앞서, 본 발명에서 사용되는 용어를 정의하면 다음과 같다.Prior to the description, terms used in the present invention are defined as follows.

먼저, 번역 메모리라 함은 원문, 원문의 형태소 분석 결과, 대역문으로 구성되며, 이러한 번역 메모리의 크기가 클수록 입력문과 유사한 문장을 발견할 확률이 높다고 할 수 있다.First, the translation memory is composed of the original text, the result of the morphological analysis of the original text, and the band text. The larger the size of the translation memory, the higher the probability of finding a sentence similar to the input text.

인덱스 역파일이라 함은 번역 메모리를 구성하는 모든 문장들을 형태소 분석한 결과로부터 인덱스 단어를 추출하여 해당 단어를 키(Key)로 하고, 그 단어가 나타나는 문장 번호 및 문장에서의 위치 정보를 포함하고 있다.The index inverse file is an index word extracted from the result of morphological analysis of all sentences constituting the translation memory, and the corresponding word is a key, and includes the sentence number and the position information in the sentence. .

문장간 유사도라 함은 정의된 유사도에 따라 계산되는 것으로, 두 문장 간의 문법적 및 의미적 유사한 정도의 척도를 나타낸다.The similarity between sentences is calculated according to the defined similarity and represents a measure of grammatical and semantic similarity between two sentences.

이하의 실시예에서는 번역 메모리를 이용하는 번역 지원 시스템을 일례로 들어 설명하기로 한다. 그러나, 본 실시예가 번역 메모리를 이용하는 번역 지원 장치에 한정되는 것이 아님을 미리 밝혀둔다.In the following embodiment, a translation support system using a translation memory will be described as an example. However, it is noted that the present embodiment is not limited to the translation support apparatus using the translation memory.

도 1은 본 발명에 따른 방법에 적용되는 번역 지원 시스템의 구성 블록도이다.1 is a block diagram of a translation support system applied to a method according to the present invention.

먼저, 번역 메모리(105)는 가공되지 않은 형태의 병렬 코퍼스(101)를 형태소 분석기(102)를 통하여 형태소 분석하여 원문, 원문의 형태소 분석 결과, 대역문, 이 세 가지 요소들의 집합으로 구성되며, 이 과정에서 번역 메모리(105)를 구성하는 각 원문으로부터 인덱스 단어가 추출되어 별도의 인덱스 역파일(104)이 구성된다. 인덱스 역파일(104)은 번역 메모리(105)를 구성하는 모든 원문의 형태소 분석 결과 중 인덱스 단어로 사용되는 명사, 동사 및 형용사에 대해서 해당 단어가 몇 번째 문장의 몇 번째 형태소인지에 대한 정보, 즉, 문장 번호와 형태소 위치 정보를 포함한다.First, the translation memory 105 is a morphological analysis of the raw corpus parallel corpus 101 through the morphological analyzer 102 is composed of a set of three elements, the original text, stemming results of the original text, band text, In this process, an index word is extracted from each original text constituting the translation memory 105 to form a separate index inverse file 104. The index inverted file 104 is information about a noun, a verb, and an adjective used as an index word among the stemming results of all original texts constituting the translation memory 105, that is, the information of the morpheme of the number of sentences. , Sentence number and morphological position information.

그리고, 입력부(103)를 통해 들어온 원문은 형태소 분석기(102)를 거쳐 원문을 구성하는 각각의 단어에 대해 형태소가 분석되며, 동시에 그 단어들의 품사가결정된다.The text input through the input unit 103 is analyzed for each word constituting the original text through the morpheme analyzer 102, and at the same time, the parts of speech of the words are determined.

그리고, 필터링부(106)는 입력 원문으로부터 인덱스 단어로 사용하는 명사, 동사 및 형용사에 해당하는 형태소만을 따로 추출하는 기능을 수행한다. 추출된 입력 원문의 인덱스 단어들을 대상으로, 인덱스 역파일(104)의 정보를 참조하여, 입력 원문의 인덱스 어휘들을 포함하고 있는 후보 문장들을 가상으로 구성한다. 구성된 가상 후보 문장들은 실제 번역 메모리(105)에 적재되어 있는 문장이 아니라, 인덱스 단어들로만 구성이 된다. 이렇게 구성된 가상의 후보 문장들을 대상으로 명사, 동사 및 형용사의 가중치를 적용하여 입력 원문과 가상으로 구성된 후보 문장들 간의 매칭률을 계산하여 일차적으로 필터링한다.The filtering unit 106 separately extracts only morphemes corresponding to nouns, verbs, and adjectives used as index words from the input original text. The candidate sentences including the index vocabularies of the input original text are virtually constructed by referring to the information of the index inverse file 104 based on the extracted index words of the input original text. The constructed virtual candidate sentences are composed of only index words, not sentences stored in the actual translation memory 105. By applying the nouns, verbs, and adjective weights to the virtual candidate sentences constructed as described above, the matching rate between the input original text and the virtual candidate sentences is first filtered.

이러한 필터링이 필요한 이유는 번역 메모리(105)를 구성하는 모든 원문들과 입력 원문간의 유사도를 계산하기 전에, 어느 정도의 후보 문장들만을 추출함으로써, 유사도 계산의 오버헤드를 감소시키기 위해서이다.The reason for such filtering is to reduce the overhead of the similarity calculation by extracting only some candidate sentences before calculating the similarity between all the texts constituting the translation memory 105 and the input text.

그리고, 검색부(107)는 상술한 필터링부(106)에서 제공된 후보 문장들을 번역 메모리를 통해 검색한 후 검색 결과를 로딩하는 기능을 수행한다.The search unit 107 searches for the candidate sentences provided by the filtering unit 106 through the translation memory and loads a search result.

유사도 계산부(108)에서는 번역 메모리(105)의 각 후보 원문들에 대해서 품사별 가중치가 적용된 "Edit Distance" 기법을 사용하여 유사도를 계산한다.The similarity calculator 108 calculates the similarity using a "Edit Distance" technique to which the parts-of-speech weights are applied to the candidate texts of the translation memory 105.

마지막으로, 유사도 계산부(108)에서 계산된 유사도에 따라 번역 메모리(105)의 가장 유사한 문장들부터 등위를 매겨 인쇄부(109)나 표시 제어부(110)를 통하여 인쇄 장치(111)나 표시 장치(112)로 출력한다.Lastly, the most similar sentences of the translation memory 105 are ranked according to the similarity calculated by the similarity calculator 108, and then the printing apparatus 111 or the display apparatus through the printing unit 109 or the display control unit 110. Output to (112).

이하, 상술한 구성과 함께, 본 발명의 바람직한 실시예에 따른 유사 문장 검색 구현 과정을 첨부한 도 2 및 도 3을 참조하여 보다 상세하게 설명하기로 한다.Hereinafter, with reference to the above-described configuration, with reference to Figures 2 and 3 attached to the similar sentence search implementation process according to a preferred embodiment of the present invention will be described in more detail.

먼저, 도 2는 본 발명의 일 실시예에 따른 유사 문장 검색 방법으로서, 필터링 과정을 설명하기 위한 도면이다.First, FIG. 2 is a diagram illustrating a filtering process as a similar sentence search method according to an exemplary embodiment of the present invention.

도 2에 도시한 바와 같이, 인덱스 역파일 내용(201)은 상기에서 설명한 바와 같이 번역 메모리(105)를 구성하고 있는 각 원문들을 형태소 분석하여, 인덱스 단어로 사용되는 명사, 동사, 형용사만을 추출하여 해당 단어의 문장 번호 및 해당 문장에서의 위치 정보를 포함한다.As shown in FIG. 2, the index inverse file content 201 is morphologically analyzed for each text constituting the translation memory 105 as described above, and extracts only nouns, verbs, and adjectives used as index words. It includes a sentence number of the word and position information in the sentence.

필터링부(106)의 첫 단계로 입력 원문(202)이 들어오면 형태소 분석기(102)의 형태소 분석 과정을 통해 입력 원문의 각 단어의 품사가 결정되고, 이중 인덱스 단어로 사용되는 명사, 동사 및 형용사에 해당하는 단어만을 추출한다. 이것은 도 2의 입력 원문(202)에서 추출된 인덱스 단어(203)에 나타난다.When the input text 202 is entered as the first step of the filtering unit 106, the parts of each word of the input text are determined through the morphological analysis of the morpheme analyzer 102, and the nouns, verbs, and adjectives used as double index words. Extract only the word corresponding to This appears in the index word 203 extracted from the input text 202 of FIG.

다음으로, 추출된 입력 원문의 인덱스 단어에 대해 인덱스 역파일(104)을 참조하여 문장 번호 정보 및 형태소 위치 정보를 사용하면 인덱스 단어를 포함하는 재구성된 후보 문장(204)이 생성된다.Next, using the sentence number information and the morphological position information with reference to the index inverse file 104 for the extracted index word of the input original text, a reconstructed candidate sentence 204 including the index word is generated.

이렇게 구성된 후보 문장(204)들에 대해서 명사, 동사 및 형용사의 매칭 가중치를 적용하여 입력 원문과 후보 문장간의 매칭 가중치를 구한다. 이러한 매칭 가중치는 도 2의 (205)에 도시되어 있다.The matching weights between the input original text and the candidate sentences are obtained by applying matching weights of nouns, verbs, and adjectives to the candidate sentences 204 configured as described above. This matching weight is shown at 205 of FIG.

이때, 사용되는 매칭 가중치는 다음 수학식 1과 같이 표현될 수 있다.In this case, the matching weight used may be expressed as Equation 1 below.

매칭 가중치 = (매칭된 명사 개수 × 명사 가중치) + (매칭된 용언 개수 × 용언 가중치)Matching weight = (number of matched nouns × noun weight) + (matched verbs × verb weights)

여기서, 용언이라 함은 동사 및 형용사를 나타낸다.Here, a term refers to a verb and an adjective.

이렇게 구해진 매칭 가중치를 사용하여 후보가 될 가능성이 높은 번역 메모리의 문장들만을 추출할 수 있다.Only the sentences in the translation memory that are likely to be candidates can be extracted using the obtained matching weights.

이상과 같은 필터링 과정은 후술하는 번역 메모리(105)의 모든 문장들에 대한 문장간 유사도를 계산하는 부담을 상당히 감소시킬 수 있을 것이다.The filtering process as described above may significantly reduce the burden of calculating the similarity between sentences for all sentences in the translation memory 105 described later.

도 3은 본 발명의 다른 실시예에 따른 유사 문장 검색 방법으로서, 유사도 계산 과정을 설명하기 위한 도면이다.3 is a diagram illustrating a similarity sentence searching method according to another embodiment of the present invention.

본 과정은 상술한 필터링부(106)의 결과를 사용하여, 실제 후보 문장이 번역 메모리(105)로부터 로딩되고, 각각의 후보 문장과 입력 원문간의 문장간 유사도를 계산하는 과정이다.In this process, the actual candidate sentences are loaded from the translation memory 105 using the results of the filtering unit 106 described above, and the similarity between the sentences between the candidate sentences and the input original text is calculated.

도 3의 설명에 앞서, 본 실시예에 적용되는 "Edit Distance"는 두 문장 간의 차이를 정량적으로 나타낸다. 즉, "Edit Distance"는, A라는 문장을 B라는 문장으로 고칠 때에, A라는 문장을 기준으로 삭제할 단어의 수와 삽입된 단어의 수를 합한 것을 의미한다. 하지만, 번역 지원 도구에서 이러한 "Edit Distance"를 그대로 사용할 경우, 표층 표현의 매칭 여부만을 고려하는 것이라서, 그 결과가 사용자가 원하는 바와 상당히 다르다고 할 수 있다. 따라서, 본 발명에서는 다음 사항들을 고려한 "Edit Distance"를 적용하는 것을 특징으로 한다.Prior to the description of FIG. 3, "Edit Distance" applied to the present embodiment quantitatively indicates a difference between two sentences. That is, "Edit Distance" means that when the sentence A is changed to the sentence B, the number of words to be deleted and the number of inserted words are summed based on the sentence A. However, if the "Edit Distance" is used as it is in the translation support tool, it only considers whether the surface representation is matched, and the result is quite different from what the user wants. Therefore, the present invention is characterized by applying the "Edit Distance" considering the following matters.

첫째, 표층 단어뿐만 아니라, 단어의 품사를 고려하여, 매칭된 단어의 품사에 따라 서로 다른 매칭 가중치를 부여한다. 이러한 이유는 문장의 구조적인 유사성도 고려하기 위해서이다. 즉, 번역 메모리의 후보 문장이 표층 단어는 틀리지만, 구조적으로 유사한 후보 문장도 결과로 제시해 줄 수 있도록 하기 위함이다. 예를 들어, 동사, 격조사, 어미 등의 매칭 가중치가 명사와 비교해서 클 경우, 입력 원문과 구조적으로 유사한 문장이 결과로서 나올 가능성이 크다.First, considering the parts of speech of words as well as surface words, different matching weights are given according to the parts of speech of the matched words. The reason for this is to consider the structural similarity of sentences. In other words, the candidate sentences in the translation memory are different from the surface words, but structurally similar candidate sentences can be presented as a result. For example, if the matching weights of verbs, dictionaries, and endings are greater than nouns, sentences that are structurally similar to the input text are likely to result.

둘째, 매칭 오퍼레이션, 삽입 오퍼레이션, 삭제 오퍼레이션을 "Edit Distance"의 기본 오퍼레이션으로 사용한다. 이러한 이유는 후보 문장이 길거나 짧은 경우에 대해, 삽입 오퍼레이션 및 삭제 오퍼레이션이 일종의 정규화 역할을 수행한다고 할 수 있다. 즉, A라는 긴 후보 문장과 B라는 짧은 후보 문장이 매칭되는 단어의 수는 동일하다고 할지라도, 삽입 및 삭제 오퍼레이션에 의해 후보 문장 B가 후보 문장 A보다 유사하다고 판단할 수 있다.Second, the matching operation, the insert operation, and the delete operation are used as basic operations of "Edit Distance". For this reason, it can be said that the insert operation and the delete operation perform a kind of normalization role for the case where the candidate sentence is long or short. That is, even though the number of words to which the long candidate sentence A is short and the short candidate sentence B is identical, the candidate sentence B may be determined to be similar to the candidate sentence A by the insertion and deletion operations.

요약컨대, 본 실시예에서는 가중치가 부여된 "Edit Distance" 기법을 사용하는 것을 특징으로 한다.In summary, the present embodiment is characterized by using a weighted "Edit Distance" technique.

다음의 표 1은 본 발명에서 사용하는 Edit 오퍼레이션의 종류와 해당 가중치를 나타낸다.Table 1 below shows the types of Edit operations used in the present invention and their weights.

Edit 오퍼레이션의 종류Type of Edit Operation 가중치weight 일치Same -10-10 삽입insertion 22 삭제delete 1One

다음의 표 2는 본 발명에서 사용하는 매칭 단어의 품사에 따른 가중치를 나타낸다.Table 2 below shows weights according to parts of speech of matching words used in the present invention.

품사Parts of speech 가중치weight 동사verb 1010 격조사, 어미Screening 88 명사 중심어Noun central 55 기타Etc 33

상술한 표 1 및 표 2에 나타나는 오퍼레이션 별 가중치 및 품사별 매칭 가중치는 휴리스틱하게 결정된다.The operation-specific weights and the parts-of-speech matching weights shown in Tables 1 and 2 described above are heuristically determined.

상기에서 설명된 사항은 품사를 고려한 가중치가 적용된 "Edit Distance"에 관한 것이고, 다음은 기타로 문장간 유사도를 계산할 때 고려하는 요소들이다.The above-described matters are related to the "Edit Distance" to which the parts of speech are weighted, and the following are factors to be considered when calculating the similarity between sentences.

첫째, 입력 원문에 나타나는 단어 분포를 고려한다. 이러한 이유는, 예를 들어, 입력 원문이 복합 명사를 포함하고, 해당 복합 명사를 동일한 형태로 포함하고 있는 후보 문장이 존재하고, 해당 복합 명사를 분해된 형태로 포함하고 있는 후보 문장이 존재한다고 할 때, 두 후보 문장 모두 매칭되는 단어의 수는 동일하지만 입력 원문의 단어 분포를 고려할 경우, 해당 복합 명사를 동일한 형태로 포함하고 있는 후보 문장을 보다 유사한 것으로 판단할 수 있기 때문이다.First, consider the distribution of words that appear in the input text. This is because, for example, an input text includes a compound noun, a candidate sentence containing the compound noun in the same form, and a candidate sentence containing the compound noun in a decomposed form exist. In this case, when both candidate sentences have the same number of matching words but the word distribution of the input source text is considered, candidate sentences including the corresponding compound nouns in the same form may be determined to be more similar.

본 발명에서 사용하는 문장간 유사도는 다음 수학식 2와 같이 표현될 수 있다.The similarity between sentences used in the present invention may be expressed as in Equation 2 below.

문장간 유사도 =Similarity between sentences =

(매칭된 각 단어에 대해)(매칭된 단어별 품사 가중치 × '일치' 오퍼레이션)(For each matched word) (part of speech weight by matched word × 'match' operation)

+ ('삽입' 오퍼레이션 가중치 × '삽입' 오퍼레이션 수행 횟수)+ ('Insert' operation weight × number of 'insert' operations performed)

+ ('삭제' 오퍼레이션 가중치 × '삭제' 오퍼레이션 수행 횟수)+ ('Delete' operation weight × 'Delete' operation count)

이러한 수학식 2를 적용하여 얻어진 문장간 유사도 결과는 'Cost'의 관점에서 볼 때, 낮을수록 그 유사도가 크다고 할 수 있다.The similarity result between sentences obtained by applying the above Equation 2 can be said that the lower the similarity from the viewpoint of 'Cost'.

도 3에 도시한 바와 같이, 입력 원문(202)의 형태소 분석 결과(301)와 번역 메모리(105)에 있는 한 후보 문장 1의 형태소 분석 결과(302)가 주어졌을 때, 상술한 수학식 2를 적용하여 입력 원문(202)과 후보 문장 1(302)간의 유사도를 계산할 경우, '-329'라는 결과치를 얻게 됨을 알 수 있다.As shown in FIG. 3, when the stemming analysis result 301 of the input original text 202 and the stemming analysis result 302 of one candidate sentence 1 in the translation memory 105 are given, Equation 2 described above is given. When calculating the similarity between the input text 202 and the candidate sentence 1 (302) by applying, it can be seen that the result value of '-329'.

한편, 후보 문장 1의 형태소 분석 결과(302)에서 밑줄과 함께 굵게 표시된 부분은 입력 원문의 단어와 매칭되는 부분을 표시한 것이다.Meanwhile, in the morpheme analysis result 302 of the candidate sentence 1, the part shown in bold with the underline indicates a part matching the word of the input original text.

이상, 본 발명을 실시예에 근거하여 구체적으로 설명하였지만, 본 발명은 이러한 실시예에 한정되는 것이 아니라, 그 요지를 벗어나지 않는 범위내에서 여러 가지 변형, 예컨대, 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체(시디롬, 램, 롬, 플로피 디스크, 하드디스크, 광자기디스크 등)에 저장될 수 있음은 물론이다.As mentioned above, although this invention was demonstrated concretely based on the Example, this invention is not limited to this Example, A various deformation | transformation, for example, is implemented as a program and can be read by a computer within the range which does not deviate from the summary. It can of course be stored in a recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.) in the form.

이상 설명한 바와 같이 본 발명에 따르면, 필터링 및 문장간 유사도 계산을 수행함으로써, 번역 지원 시스템에서 사용자가 번역하고자 하는 원문과 가장 유사한 번역 예문을 번역 메모리로부터 검색하여 제시함으로써, 번역 지원 시스템의 성능을 높일 수 있는 효과가 있다As described above, according to the present invention, by performing filtering and similarity calculation between sentences, the translation support system searches for and presents the translation example most similar to the original text to be translated by the user, thereby improving the performance of the translation support system. It is effective

Claims

In the translation support system to search for sentences similar to the original text entered by the user through the translation memory,

A first step of constructing an inverse file of the translation memory and the index word from the parallel corpus;

A second step of filtering candidate sentences having high similarity by comparing input sentences presented by a user with sentences of the configured translation memory;

And calculating a similarity with the actual input sentence with respect to the filtered candidate sentences, and outputting the original sentence and the band sentence in the order of the similar sentences.

The method of claim 1,

The first step is,

Performing a morphological analysis on each of the original texts and band texts constituting the parallel corpus, and forming a translation memory in the form of original texts, stemming results of original texts, and band texts;

Extract the words belonging to nouns and verbs used as index words among the morphologically analyzed original texts, and use the words as a key, the sentence number where the word appears, and the morpheme position information in the sentence. And constructing an index inverse file.

The method of claim 1,

The second step,

Extracting only nouns and verbs used as index words among words constituting the input original text;

Extracting a sentence number including the corresponding words from the index inverse file with respect to the index words of the input original text extracted in the step;

Virtually constructing sentences including only index words of the input original text as a result of the sentence number extracting step, and extracting only possible candidate sentence numbers by applying matching weights of nouns and verbs to the virtual sentences; Similar sentence search method in a translation support system, characterized in that.

The method of claim 1,

The third step,

Calculating similarity between sentences using a weighted " Edit Distance " method and a word distribution.