KR20120060666A

KR20120060666A - Apparatus and method for extracting noun-phrase translation pairs of statistical machine translation

Info

Publication number: KR20120060666A
Application number: KR1020100122282A
Authority: KR
Inventors: 김상범; 윤창호; 황영숙; 임해창; 김민정
Original assignee: 에스케이플래닛 주식회사; 고려대학교 산학협력단
Priority date: 2010-12-02
Filing date: 2010-12-02
Publication date: 2012-06-12
Also published as: KR101753708B1

Abstract

PURPOSE: An apparatus and a method for extracting a noun phrase translation pair are provided to automatically extract a translation pair from a parallel language corpus. CONSTITUTION: A source language noun phrase extracting unit(140) extracts a noun phrase from a noun phrase analysis result of a source language sentence. A target language candidate noun phrase extracting unit(130) extracts a target language noun candidate from the analysis result. A translation pair score calculating unit(150) calculates an arrangement probability. A translation pair extracting unit(160) extracts translation pair having high score.

Description

Apparatus and method for extracting noun phrase band pairs from statistical machine translation {APPARATUS AND METHOD FOR EXTRACTING NOUN-PHRASE TRANSLATION PAIRS OF STATISTICAL MACHINE TRANSLATION}

본 발명은 통계적 기계 번역에 관한 것으로서, 더욱 상세하게는 소스 언어 문장과 타겟 언어 문장을 단어 정렬한 말뭉치에서 명사구를 구성할 수 있는 품사 정보를 이용하여 각 언어별로 명사구 후보를 추출하고 추출한 양 언어의 명사구 후보 쌍에서 정렬 확률이 높은 명사구 후보 쌍을 명사구 대역 쌍으로 추출함으로써 소스 언어와 타겟 언어에 대한 병렬 말뭉치가 존재하면 병렬 말뭉치로부터 명사구를 자동 추출하여 단어 정렬에 의존적이지 않고서도 대역 쌍 추출이 가능한 명사구 대역 쌍 추출 장치 및 방법에 관한 것이다.The present invention relates to statistical machine translation, and more particularly, using a part-of-speech information that can form a noun phrase from a corpus of word-aligned source language sentences and target language sentences. By extracting a noun phrase pair that has a high probability of sorting from a noun phrase candidate pair as a noun phrase band pair, if there is a parallel corpus for the source language and the target language, a noun phrase can be automatically extracted from the parallel corpus to extract a band pair without depending on the word alignment. Noun phrase band pair extraction apparatus and method.

자동 번역 기술은 한 언어를 다른 언어로 자동으로 전환해주는 소프트웨어적 기술을 의미한다. 이러한 기술은 20세기 중반부터 미국에서 군사적인 목적으로 연구가 시작되었으며, 지금은 세계적으로 정보접근범위의 확대와 휴먼인터페이스의 혁신을 목적으로 다수의 연구소와 민간기업에서 활발히 연구 중이다.Automatic translation technology refers to a software technology that automatically switches from one language to another. This technology has been studied in the United States for military purposes since the middle of the 20th century, and is now being actively researched by numerous research institutes and private companies for the purpose of expanding the scope of information access and innovation of the human interface.

자동 번역 기술의 초기 단계에서는 전문가가 수동으로 작성한 양국어(Bilingual) 사전과 한 언어를 다른 언어로 변환할 수 있는 규칙을 기반으로 발전되어 왔다. 그러나 컴퓨팅 파워의 급속한 발전이 진행된 21세기 초부터는 대량의 데이터로부터 통계적으로 번역 알고리즘을 자동으로 학습하는 통계적 기반 번역 기술에 대한 개발이 활발히 전개되고 있다.The early stages of automatic translation technology have evolved based on manual bilingual dictionaries written by experts and rules for translating one language into another. However, from the beginning of the 21st century, when the rapid development of computing power has progressed, the development of statistical-based translation technology that automatically learns the translation algorithm from a large amount of data has been actively developed.

통계적 기반 번역 기술은 기계 번역 분야에 있어서 규칙 기반 기계 번역, 예제 기반 기계 번역 등과 함께 중요한 축이 되어 왔다. 특히, 통계적 기계 번역이 단어 단위의 기계 번역 모형에서 구 단위의 기계 번역 모형으로 확장되면서, 예제 기반 기계 번역은 통계적 기계 번역과 경계가 모호해지고 규칙 기반 기계 번역이 통계적 기계 번역과 함께 자동 기계 번역의 주 축이 되었다.Statistical-based translation technology has become an important axis in the field of machine translation along with rule-based machine translation and example-based machine translation. In particular, as statistical machine translation extends from word-based machine translation models to phrase-based machine translation models, example-based machine translation blurs the boundaries between statistical machine translation and rule-based machine translation along with statistical machine translation. It became the main axis.

구 단위의 통계적 기계 번역에서 중요한 문제는 구의 경계를 어떻게 구분할 것인가 하는 것과, 대역구를 어떻게 찾을 수 있느냐 하는 것이다. 현재 주로 사용하는 방법은 아래 수학식 1에 나타난 정의에 따라 단어 정렬 결과에 기반하여 구를 찾는 방법을 적용하고 있다. The important issues in phrase-based statistical machine translation are how to distinguish the boundaries of a phrase and how to find a band phrase. Currently, the most commonly used method is to find a phrase based on a word sorting result according to the definition shown in Equation 1 below.

이렇게 찾은 대역구 쌍은 단어 정렬과 일치한 구이며, 그 경계는 아래 표 1과 같은 방법으로 구분할 수 있다. The band-band pairs found are phrases that match the word alignment, and their boundaries can be distinguished in the same manner as in Table 1 below.

표 1의 그림에서 빗금 친 칸은 단어 정렬이 된 것이며, 굵은 선으로 박스(box) 처리된 사각 영역은 구의 경계를 나타낸 것이다. 예제의 no ? did not처럼, 어느 한 단어에 정렬된 단어가 여러 페어(pairs)일 경우에는 페어가 가능한 모든 단어를 같은 구로 묶고, 어느 한쪽 언어의 단어가 특정 구에 속할 경우에는 정렬되는 다른 언어의 단어들도 대역 쌍으로 포함하는 방식이다. 또한 최소한 한 개 이상의 단어 정렬을 포함하는 것만 구로 추출한다.In the figure in Table 1, the hatched cells are word-aligned, and the rectangular areas boxed with bold lines represent the boundaries of the sphere. No? Like did not, if there are multiple pairs of words sorted in one word, all the pairable words are grouped into the same phrase, and if a word in one language belongs to a particular phrase, the words in the other language are sorted. It is a method of including them as band pairs. It also extracts only phrases that contain at least one word alignment.

표 1의 (a)는 상기의 조건을 모두 충족시킨 구의 경계를 나타낸 것이고, (b)는 한 단어에 정렬된 단어를 모두 포함시키지 않아 상기의 조건에 부합되지 않는 구의 경계, (c)는 한 단어가 특정 구에 속할 경우 정렬되는 다른 언어의 단어들도 대역 쌍에 포함하는 조건에 부합되지 않는 구의 경계를 나타낸 것이다. Table 1 (a) shows the boundary of the phrase that satisfies all of the above conditions, (b) shows the boundary of the phrase that does not meet all the above conditions because it does not include all the words arranged in one word, (c) When a word belongs to a specific phrase, the words of other languages that are sorted also represent the boundary of the phrase that does not meet the criteria for inclusion in the band pair.

이렇게 추출한 대역구 기반의 통계적 기계 번역은 언어학적 정보를 사용하지 않고 다양한 언어 쌍에서 비교적 안정된 성능을 보이고 있다. This band-band based statistical machine translation shows relatively stable performance in various language pairs without using linguistic information.

그런데, 영어-한국어와 같이 단어 정렬의 성능이 좋지 않은 대역어의 경우 그 오류가 전파된다는 단점이 있다.However, in the case of a band word having poor performance of word alignment, such as English-Korean, the error is propagated.

도 1은 일 예로 영어-한국어 문장 쌍의 단어 정렬 결과를 나타낸 도면이다. 1 is a diagram illustrating a word alignment result of an English-Korean sentence pair as an example.

여러 단어를 묶은 영역(A, B)이 추출하고자 하는 명사구라 할 때, 단어 정렬 결과 영어 문장의 '⑧safety'는 한국어 문장의 '①당연히'에 연결된다. 이때, 어느 한 단어에 정렬된 단어가 여러 페어(pairs)일 경우 같은 구로 묶는 기존 방식을 적용하여 대역 구를 추출하면, 'the cause of the safety accidents'와 '당연히 안전 사고의 원인'가 대역 쌍으로 추출하게 된다. 또한, 'the safety accidents'나 'the cause'의 대역 구를 추출할 때에도 한국어 문장 '④의'가 영어 문장의 '④the'와 '⑦the'에 연결되어 '안전 사고의'나 '의 원인'이 대역 쌍으로 추출하게 된다. When a region (A, B) that combines several words is a noun phrase to extract, '⑧safety' in the English sentence is linked to '① Naturally' in the Korean sentence. In this case, if a word arranged in one word is a pair, a band phrase is extracted by applying an existing method of enclosing the same phrase, and the 'the cause of the safety accidents' and 'naturally cause of safety accidents' are band pairs. Will be extracted. Also, when extracting the band phrases of 'the safety accidents' or 'the cause', the Korean sentence '④' is connected to '④the' and '⑦the' of the English sentence, so that 'the cause of the safety accident' or 'cause' The bands are extracted as pairs.

이처럼, 종래 방식에 따라 구 기반의 대역 쌍을 추출하면 적절한 대역 구를 찾지 못하는 문제점이 있다. As such, when a sphere-based band pair is extracted according to the conventional method, there is a problem in that an appropriate band sphere is not found.

본 발명은 상기의 문제점을 해결하기 위해 창안된 것으로서, 소스 언어 문장과 타겟 언어 문장을 단어 정렬한 말뭉치에서 명사구를 구성할 수 있는 품사 정보를 이용하여 각 언어별로 명사구 후보를 추출하고 추출한 양 언어의 명사구 후보 쌍에서 정렬 확률을 고려하여 정렬 확률이 높은 명사구 후보 쌍을 명사구 대역 쌍으로 추출함으로써 소스 언어와 타겟 언어에 대한 병렬 말뭉치가 존재하면 병렬 말뭉치로부터 명사구를 자동 추출하여 단어 정렬에 의존적이지 않고 대역 쌍 추출이 가능한 장치 및 방법을 제공하는 데 그 목적이 있다. The present invention has been made to solve the above problems, using both parts of the language to extract the noun phrase candidate for each language by using the part-of-speech information that can form a noun phrase in the corpus of words aligned with the source language sentence and the target language sentence By extracting noun phrase candidate pairs that have a high probability of sorting from the noun phrase candidate pairs as a noun phrase band pair, if there is a parallel corpus for the source language and the target language, the noun phrases are automatically extracted from the parallel corpus and the band is not dependent on the word alignment. It is an object of the present invention to provide an apparatus and method capable of pair extraction.

이를 위하여 본 발명의 제1 측면에 따르면, 본 발명의 장치는, 통계적 기계 번역에서 대역 쌍 추출을 위한 장치로서, 소스 언어 문장을 구문 분석한 결과로부터 명사구를 추출하는 소스 언어 명사구 추출기; 타겟 언어 문장을 형태소 단위로 분석한 결과로부터 상기 소스 언어의 명사구와 대응 가능한 명사구 후보를 추출하는 타겟 언어 명사구 후보 추출기; 상기 소스 언어 명사구 추출기로부터 추출된 소스 언어의 명사구와 상기 타겟 언어 명사구 후보 추출기로부터 추출된 타겟 언어의 명사구 후보간 정렬 확률을 계산하여 대역 쌍 스코어를 산출하는 대역쌍 스코어 산출기; 상기 대역쌍 스코어 산출기에서 산출된 스코어들 중에서 가장 높은 스코어를 갖는 대역 쌍을 추출하는 대역쌍 추출기를 포함하는 것을 특징으로 한다. To this end, according to a first aspect of the present invention, an apparatus for band pair extraction in statistical machine translation, comprising: a source language noun phrase extractor for extracting a noun phrase from a result of parsing a source language sentence; A target language noun phrase candidate extractor extracting a noun phrase candidate corresponding to a noun phrase of the source language from a result of analyzing a target language sentence in a morpheme unit; A band pair score calculator configured to calculate a band pair score by calculating an alignment probability between a noun phrase of a source language extracted from the source language noun phrase extractor and a noun phrase candidate of a target language extracted from the target language noun phrase candidate extractor; And a band pair extractor for extracting a band pair having the highest score among scores calculated by the band pair score calculator.

본 발명의 제2 측면에 따르면, 본 발명의 통계적 기계 번역에서 대역 쌍 추출을 위한 단말 장치는, 소스 언어 문장의 구문 분석 결과, 타겟 언어 문장의 형태소 분석 결과를 저장한 메모리; 상기 메모리에 저장된 소스 언어 문장의 구문 분석 결과로부터 소스 언어의 명사구를 추출하는 소스 언어 명사구 추출기; 상기 메모리에 저장된 타겟 언어 문장의 형태소 분석 결과로부터 소스 언어의 명사구와 대응 가능한 타겟 언어의 명사구 후보를 추출하는 타겟 언어 명사구 후보 추출기; 상기 소스 언어 명사구 추출기로부터 추출된 소스 언어의 명사구와 상기 타겟 언어 명사구 후보 추출기로부터 추출된 타겟 언어의 명사구 후보간 정렬 확률을 계산하여 대역 쌍 스코어를 산출하는 대역쌍 스코어 산출기; 상기 대역쌍 스코어 산출기에서 산출된 스코어들 중에서 가장 높은 스코어를 갖는 대역 쌍을 추출하는 대역쌍 추출기를 포함하는 것을 특징으로 한다.According to a second aspect of the present invention, a terminal device for extracting band pairs in a statistical machine translation of the present invention includes: a memory storing a result of syntax analysis of a source language sentence and a result of morphological analysis of a target language sentence; A source language noun phrase extractor for extracting a noun phrase of a source language from a result of parsing a source language sentence stored in the memory; A target language noun phrase candidate extractor for extracting a noun phrase candidate of a target language corresponding to a noun phrase of a source language from a morphological analysis result of the target language sentence stored in the memory; A band pair score calculator configured to calculate a band pair score by calculating an alignment probability between a noun phrase of a source language extracted from the source language noun phrase extractor and a noun phrase candidate of a target language extracted from the target language noun phrase candidate extractor; And a band pair extractor for extracting a band pair having the highest score among scores calculated by the band pair score calculator.

본 발명의 제3 측면에 따르면, 본 발명의 타겟언어 명사구 대역 쌍 추출기는, 형태소 단위로 분석된 타겟언어 말뭉치에 품사 정보를 부착하고, 상기 형태소 단위로 분석된 타겟언어 말뭉치로부터 품사 정보를 이용하여 명사구 후보를 추출하는 것을 특징으로 한다. According to the third aspect of the present invention, the target language noun phrase band pair extractor of the present invention attaches part-of-speech information to the target language corpus analyzed in morpheme units, and uses the part-of-speech information from the target language corpus analyzed in the morpheme units. And extracting a noun phrase candidate.

본 발명의 제4 측면에 따르면, 본 발명의 방법은, 통계적 기계 번역에서 대역 쌍 추출을 위한 방법으로서, (a) 소스 언어 문장을 구문 분석한 결과로부터 명사구를 추출하는 단계; (b) 타겟 언어 문장을 형태소 단위로 분석한 결과로부터 상기 소스 언어의 명사구와 대응 가능한 명사구 후보들을 추출하는 단계; (c) 상기 소스 언어의 명사구와 상기 타겟 언어의 명사구 후보들간 정렬 확률을 계산하여 대역 쌍 스코어를 산출하는 단계; (d) 상기 산출된 대역 쌍 스코어들 중에서 가장 높은 스코어를 갖는 대역 쌍을 추출하는 단계를 포함하는 것을 특징으로 한다. According to a fourth aspect of the present invention, a method of the present invention includes a method for band pair extraction in statistical machine translation, comprising: (a) extracting a noun phrase from a result of parsing a source language sentence; extracting noun phrase candidates corresponding to a noun phrase of the source language from a result of analyzing a target language sentence in a morpheme unit; (c) calculating a band pair score by calculating an alignment probability between the noun phrase of the source language and the noun phrase candidate of the target language; (d) extracting a band pair having the highest score among the calculated band pair scores.

본 발명의 제5 측면에 따르면, 본 발명의 명사구 대역쌍 추출 방법은, 형태소 단위로 분석된 타겟언어 말뭉치에 품사 정보를 부착하고, 상기 형태소 단위로 분석된 타겟언어 말뭉치를 단어 정렬한 결과로부터 품사 정보를 이용하여 명사구 후보를 추출하는 것을 특징으로 한다. According to a fifth aspect of the present invention, the method for extracting a noun phrase band pair according to the present invention comprises attaching part-of-speech information to a target language corpus analyzed in morpheme units and from the result of word alignment of the target language corpus analyzed in morpheme units. And extracting a noun phrase candidate using the information.

본 발명에 따르면, 소스 언어와 타겟 언어에 대한 병렬 말뭉치가 존재하면 자동으로 명사구 대역 쌍을 추출하므로, 기존의 사전에 없는 신조어의 대역 쌍을 병렬 말뭉치에서 자동으로 추출하는 것이 가능한 효과가 있다. According to the present invention, if there is a parallel corpus for a source language and a target language, a noun phrase band pair is automatically extracted, and thus, it is possible to automatically extract a band pair of a new word not existing in the parallel corpus.

또한, 기존 대역 쌍 추출은 단어 정렬에 오류가 있으면 대역 쌍 후보 추출부터 오류가 발생하였으나, 본 방법은 단어 정렬에 의존적이지 않으므로 단어 정렬의 오류가 있을 경우에도 이의 영향을 덜 받으면서 대역 쌍 추출이 가능한 효과가 있다. In addition, if the existing band pair extraction has an error in the word alignment, an error occurs from the extraction of the band pair candidate. However, since the method is not dependent on the word alignment, the band pair extraction can be performed with less influence even when there is an error in the word alignment. It works.

또한, 본 발명은 기본적으로 소스 언어에 대한 구문 분석기만 사용해도 대역 쌍 추출이 가능하므로 두 언어 모두에 대한 구문 분석기가 불필요하다. 이로써, 본 발명은 두 언어 모두의 구문 분석기를 사용하는 것 보다 구문 분석기의 성능에 영향을 덜 받을 수 있다는 효과가 있다. In addition, the present invention basically eliminates the need for a parser for both languages, since band pair extraction is possible using only the parser for the source language. Thus, the present invention has an effect that the performance of the parser can be less affected than using a parser of both languages.

도 1은 영어-한국어 문장 쌍의 단어 정렬 결과를 나타낸 예시 도면이다.
도 2는 영어 문장을 구문 구조 분석한 결과를 나타낸 예시 도면이다.
도 3은 본 발명의 실시 예에 따라 명사구 대역 쌍 추출이 가능한 장치를 나타낸 도면이다.
도 4는 본 발명의 실시 예에 따른 명사구 대역 쌍 추출 방법을 나타낸 순서도이다. 1 is an exemplary diagram illustrating a word alignment result of an English-Korean sentence pair.
2 is an exemplary diagram showing a result of analyzing a sentence structure of an English sentence.
3 is a diagram illustrating an apparatus capable of extracting a noun phrase band pair according to an embodiment of the present invention.
4 is a flowchart illustrating a noun phrase band pair extraction method according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명에 따른 실시 예를 상세하게 설명한다. 본 발명의 구성 및 그에 따른 작용 효과는 이하의 상세한 설명을 통해 명확하게 이해될 것이다. 본 발명의 상세한 설명에 앞서, 동일한 구성요소에 대해서는 다른 도면 상에 표시되더라도 가능한 동일한 부호로 표시하며, 공지된 구성에 대해서는 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 구체적인 설명은 생략하기로 함에 유의한다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. The configuration of the present invention and the operation and effect thereof will be clearly understood through the following detailed description. Prior to the detailed description of the present invention, the same components will be denoted by the same reference numerals even if they are displayed on different drawings, and the detailed description will be omitted when it is determined that the well-known configuration may obscure the gist of the present invention. do.

본 발명은 병렬 말뭉치를 이용한 통계적 기계 번역에서 병렬 말뭉치로부터 단어 정렬된 결과를 이용하되 단어 정렬 결과에서 명사구 후보를 추출하여 이를 기반으로 분석하는 것으로서, 단어 정렬의 오류가 있을 경우 이의 영향을 덜 받으면서 기계 번역을 위한 분석이 가능한 구성을 제공한다. The present invention is to use the results of the word alignment from the parallel corpus in the statistical machine translation using parallel corpus but to extract the noun phrase candidate from the word alignment results based on this, if there is an error in the word alignment is less affected by the machine It provides a structure that can be analyzed for translation.

이하에서 언급하는 소스 문장(source sentence) 또는 소스 언어 문장은 번역할 대상이 되는 원시 언어의 문장이고, 타겟 문장(target sentence) 또는 타겟 언어 문장은 소스 문장을 원하는 언어로 번역하여 출력되는 목표 언어의 문장을 의미한다. The source sentence or source language sentence mentioned below is a sentence of a source language to be translated, and the target sentence or target language sentence is a target language output by translating the source sentence into a desired language. It means a sentence.

도 2는 영어 문장을 구문 구조 분석한 결과를 나타낸 예시 도면이다. 참고로, 도 2는 설명의 이해를 돕기 위해 도 1과 동일한 예시를 든다. 2 is an exemplary diagram showing a result of analyzing a sentence structure of an English sentence. For reference, FIG. 2 includes the same example as FIG. 1 to help understand the description.

도 2의 영어 문장을 구문 구조 분석한 결과, 소스 트리로부터 NP(Noun Phrase) 또는 BNP라고 분류한 단어 또는 구가 명사구(Noun Phrase)로 볼 수 있다.As a result of the syntax structure analysis of the English sentence of FIG. 2, a word or phrase classified as NP (Noun Phrase) or BNP from a source tree may be viewed as a noun phrase.

즉, 영어 문장에서 'the cause', 'the safety accidents', 'the cause of the safety accidents'가 이에 해당된다. 또한, 대명사 'it'도 명사구로 포함할 수 있으나, 이는 단일 단어로 구성된 명사구로 본 발명에서는 생략한다. That is, 'the cause', 'the safety accidents', and 'the cause of the safety accidents' correspond to the English sentences. In addition, the pronoun 'it' may also be included as a noun phrase, which is omitted in the present invention as a noun phrase consisting of a single word.

이렇게 구문 분석한 결과로 영어 명사구를 결정하고, 이에 대응되는 타겟 언어 문장의 명사구 추출은 형태소 분석 및 품사 정보, 단어 정렬 정보 등을 이용한다. 특히, 타겟 언어가 한국어의 경우, 한국어는 교착어이므로 형태소 분석을 하지 않으면 자료 부족 문제가 심각해 질 수 있다. 따라서 한국어의 경우 반드시 형태소 분석 결과를 이용한다. 그리고, 한국어의 모든 가능한 형태소 열을 고려할 경우 명사구 후보가 지나치게 많아질 수 있으므로 품사 정보를 이용하여 명사구 후보를 추출할 수 있다. As a result of the syntax analysis, the English noun phrase is determined, and the noun phrase extraction of the target language sentence corresponding thereto uses morpheme analysis, parts of speech information, word alignment information, and the like. In particular, if the target language is Korean, Korean is a deadlock, and if the morphological analysis is not performed, the lack of data may become serious. Therefore, in the case of Korean, the results of morphological analysis must be used. In addition, when all possible morphological sequences of Korean language are considered, the noun phrase candidate may be excessive, so that the noun phrase candidate may be extracted using the part of speech information.

이를 구현하기 위한 본 발명의 장치에 대하여 구체적으로 설명한다.The apparatus of the present invention for implementing this is described in detail.

도 3은 본 발명의 실시 예에 따라 명사구 대역 쌍 추출이 가능한 장치를 나타낸 도면이다.3 is a diagram illustrating an apparatus capable of extracting a noun phrase band pair according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 실시 예에 따른 장치는 타겟 형태소 분석 말뭉치 DB(110), 소스 구문 분석 말뭉치 DB(120), 타겟 언어 명사구 후보 추출기(130), 소스 언어 명사구 추출기(140), 명사구 대역쌍 스코어 산출기(150), 대역쌍 추출기(160), 명사구 대역 사전 DB(170)를 포함한다. Referring to FIG. 3, an apparatus according to an embodiment of the present invention includes a target morphological analysis corpus DB 110, a source parsing corpus DB 120, a target language noun phrase candidate extractor 130, a source language noun phrase extractor 140, A noun phrase band pair score calculator 150, a band pair extractor 160, and a noun phrase band dictionary DB 170.

여기서, 타겟 형태소 분석 말뭉치 DB(110), 소스 구문 분석 말뭉치 DB(120), 명사구 대역 사전 DB(170)는 본 발명의 장치가 자립형(stand- alone)으로 컴퓨터 내에서 구현되는 경우 하드디스크 또는 외장메모리 등의 저장 매체에 저장되고, 네트워크 기반으로 구현되는 경우는 서버에 저장된다. 또한, 본 발명의 장치가 고성능, 고용량화인 휴대 단말의 경우 휴대 단말의 메모리에 저장될 수 있다.Here, the target morphological corpus DB 110, the source parsing corpus DB 120, the noun phrase band dictionary DB 170 is a hard disk or an external device when the apparatus of the present invention is implemented in a computer stand-alone. It is stored in a storage medium such as a memory, and if implemented on a network basis is stored in a server. In addition, the device of the present invention can be stored in the memory of the portable terminal in the case of a high performance, high capacity portable terminal.

타겟 형태소 분석 말뭉치 DB(110)는 타겟 언어의 문장을 형태소로 분석한 말뭉치 정보를 저장한 것이고, 소스 구문 분석 말뭉치 DB(120)는 소스 언어의 문장을 구문 분석한 말뭉치 정보를 저장한 것이다.The target morpheme analysis corpus DB 110 stores corpus information obtained by morphologically analyzing sentences of a target language, and the source syntax analysis corpus DB 120 stores corpus information parsed a sentence of a source language.

여기서, 타겟 형태소 분석 말뭉치 DB(110) 및 소스 구문 분석 말뭉치 DB(120)은 각 언어의 문장으로부터 명사구를 추출하기 위해 소스 언어 문장 및 타겟 언어 문장을 전처리하고 이를 저장해 둔 데이터베이스이다. Here, the target morphological corpus DB 110 and the source parsing corpus DB 120 are pre-processed and stored source language sentences and target language sentences in order to extract noun phrases from sentences of each language.

이때, 타겟 형태소 분석 말뭉치 DB(110)는 타겟 언어가 한국어와 같이 언어의 특성상 형태소 분석이 반드시 필요한 경우에 한정되는 구성으로, 이에 한정되지 않고 품사 정보를 이용하는 경우에는 해당 단어에 품사 정보를 부착한 말뭉치를 저장할 수 있다. 이때, 타겟 형태소 분석 말뭉치 DB(110)와 별도로, 타겟 언어 문장에 품사 정보를 부착하여 저장한 타겟 품사 부착 말뭉치 DB(미도시)를 추가로 포함할 수 있다. In this case, the target morphological analysis corpus DB 110 is a configuration that is limited to the case where the target language is necessarily necessary due to the characteristics of the language, such as Korean, and the like. You can save the corpus. In this case, the target morpheme analysis corpus DB 110 may further include a target part-of-speech corpus DB (not shown) attached and stored by attaching the part-of-speech information to the target language sentence.

소스 언어 명사구 추출기(140)는 소스 구문 분석 말뭉치 DB(120)로부터 구문 분석한 소스 말뭉치를 수신하여 소스 언어 문장 내에서 명사구를 추출한다. 이는 이미 도 2에서 언급한 바와 같이 구문 분석한 결과 트리로부터 BNP 또는 NP로 분류되는 단어들을 묶음으로써 추출 가능하다.The source language noun phrase extractor 140 receives the parsed source corpus from the source parsing corpus DB 120 and extracts the noun phrase in the source language sentence. This can be extracted by grouping words classified as BNP or NP from the parsed tree as mentioned in FIG. 2.

타겟 언어 명사구 후보 추출기(130)는 타겟 형태소 분석 말뭉치 DB(110)로부터 형태소 단위로 분석된 타겟 말뭉치를 수신하여 타겟 언어 문장의 명사구 후보를 추출한다. The target language noun phrase candidate extractor 130 receives the target corpus analyzed in morpheme units from the target morpheme analysis corpus DB 110 and extracts a noun phrase candidate of the target language sentence.

이때, 영한 말뭉치의 경우, 타겟 언어 명사구 후보 추출기(130)는 영-한, 한-영 말뭉치로부터 단어 정렬을 모두 고려한 합집합(Union) 휴리스틱(heuristic: 발견법)을 적용하고, 이 휴리스틱을 적용한 결과로 앞에서 설명한 단어 정렬에 기반한 구 대역쌍 추출 방법을 적용한다. 적용 결과, 영어 명사구에 대응되는 한국어 구를 모두 선별한다. 그리고, 선별한 구에서 통계 정보를 기반으로 명사구를 구성할 수 있는 품사를 추출하여 추출된 품사로 구성될 수 있는 구를 명사구 후보로 한다. In this case, in the case of the English-Korean corpus, the target language noun phrase candidate extractor 130 applies a union heuristic that considers the word alignment from the English-Korean, Korean-English corpus, and applies the heuristic as a result of applying the heuristic. We apply the old band pair extraction method based on the word alignment described above. As a result, all Korean phrases corresponding to English noun phrases are selected. In addition, a phrase that can be composed of parts of speech extracted by extracting a part-of-speech that can form a noun phrase from the selected phrase is selected as a noun phrase candidate.

명사구를 구성할 수 있는 품사는 아래의 표 2와 같이, 명사구 시작에만 가능한 품사, 명사구 끝에만 가능한 품사, 명사구 시작과 끝은 되지 않지만 3개 이상의 형태소로 구성된 명사구 중간에는 가능한 품사, 명사구 어느 위치에도 가능한 품사 등을 포함할 수 있다. Parts of speech that can form noun phrases are parts of speech that can be formed only at the beginning of noun phrases, parts of speech only at the end of noun phrases, parts of noun phrases consisting of three or more morphemes, Possible parts-of-speech and the like.

명사구의 시작과 끝에 올 수 있는 형태소Morphs that can begin and end noun phrases 명사, 수사, 대명사Noun, investigation, pronoun 명사구의 끝에만 올 수 있는 형태소A morpheme that can only come at the end of a noun phrase 접미사(-들)Suffix (-s) 명사구의 중간에만 올 수 있는 형태소A morpheme that can only come in the middle of a noun phrase 접속조사(-와/과), 관형격조사(-의), 관형형어미Connection investigation (-and /), tubular screening (-), tubular mother

이처럼, 타겟 언어 명사구 후보 추출기(130)는 각 언어의 문장별로 소스 언어의 명사구와 타겟 언어의 명사구 후보를 추출하고, 모든 가능한 쌍을 대상으로 적합한 명사구 대역 쌍을 추출한다. As such, the target language noun phrase candidate extractor 130 extracts the noun phrase of the source language and the noun phrase candidate of the target language for each sentence of each language, and extracts a proper noun phrase band pair for all possible pairs.

명사구 대역쌍 스코어 산출기(150)는 타겟 언어 명사구 후보 추출기(130)로부터 추출된 타겟 언어의 명사구 후보와 소스 언어 명사구 추출기(140)로부터 추출된 소스 언어의 명사구간 대응 가능한 모든 쌍을 대상으로 정렬 확률을 분석한다. 분석한 정렬 확률 값을 스코어라 한다. The noun phrase band pair score calculator 150 sorts all the pairs corresponding to the noun phrase candidate of the target language extracted from the target language noun phrase candidate extractor 130 and the noun phrase pair of the source language extracted from the source language noun phrase extractor 140. Analyze the probability. The analyzed sort probability values are called scores.

스코어를 산출하는 방법은 아래의 표 3에 나타낸 항목들 중 하나 또는 하나 이상의 스코어를 합산한다. The method for calculating the score adds up one or more scores from one of the items shown in Table 3 below.

1. 교집합(Intersection) 정렬 중 소스 명사와 타겟 명사가 정렬된 링크 수1. The number of links in which source and target nouns are sorted during intersection sorting. 2. 합집합(Union) 정렬 중 소스 명사와 타겟 명사가 정렬된 링크 수2. Number of links in which source and target nouns are sorted during union sort. 3. 타겟 언어의 명사구 후보 내 명사의 개수3. The number of nouns in the noun phrase candidate of the target language 4. 소스 언어의 명사구 내 명사의 개수4. Number of nouns in the noun phrase of the source language 5. 타겟 언어의 명사구 후보 내 단어의 개수5. Number of words in noun phrase candidate of target language 6. 소스 언어의 명사구 내 단어의 개수6. Number of words in noun phrases in source language 7. 한 쪽 언어만 명사구 대역 쌍 후보 내에 걸친 단어 정렬 링크 수7. Number of word-aligned links across nominal band pair candidates in only one language 8. 한 쪽 언어만 명사구 대역 쌍 후보 내에 걸친 명사 정렬 링크 수8. Number of noun-aligned links across noun phrase band pair candidates in only one language. 9. 문장 전체 교집합 정렬 링크 수/ 합집합 정렬 링크 수9. Total Segment Sort Links / Union Sort Links 10. 소스 문장 내 명사구 개수 / 타겟 문장 내 명사구 후보 개수10. Number of noun phrases in the source sentence / Noun phrase candidates in the target sentence

또한, 명사구 대역쌍 스코어 산출기(150)는 소스 언어의 명사구와 타겟 언어의 명사구 후보 상호간에 대응 가능한 모든 쌍을 대상으로 정렬 확률을 분석할 때 단어 정렬 말뭉치(180)에 저장된 교집합 단어 정렬 말뭉치(182) 또는 합집합 단어 정렬 말뭉치(184)를 근거로 대응 가능한 모든 쌍을 추출한다. 교집합 정렬은 영한 단어 정렬과 한영 단어 정렬 결과에 모두 있는 교집합을 의미하며, 합집합 정렬은 영한 단어 정렬과 한영 단어 정렬 결과 둘 중 하나에라도 있는 것을 모두 포함시킨 합집합을 의미한다.In addition, the noun phrase band pair score calculator 150 performs an intersection word sort corpus (stored in the word sort corpus 180) when analyzing the probability of sorting all pairs corresponding to the noun phrase of the source language and the noun phrase candidate of the target language. 182) or extract all possible pairs based on union word sort corpus 184. The intersection sort refers to the intersection of both English and Korean word sort results, and the union sort refers to the union of both English and Korean word sort results.

대역쌍 추출기(160)는 명사구 대역쌍 스코어 산출기(150)에서 각 소스 언어의 명사구에 대해 스코어가 가장 높은 하나의 타겟 언어의 명사구 후보를 추출한다. 이때, 스코어가 가장 높은 타겟 언어의 명사구 후보가 여러 개 존재할 경우 해당 스코어를 갖는 명사구 후보를 모두 추출한다. 추출된 명사구 대역 쌍은 명사구 대역 사전 DB(170)에 저장 및 갱신되어 명사구 대역 사전을 구축하는데 기초 자료가 될 수 있다. The band pair extractor 160 extracts a noun phrase candidate of one target language having the highest score for the noun phrase of each source language from the noun phrase band pair score calculator 150. In this case, when there are several noun phrase candidates of the target language having the highest score, all noun phrase candidates having the corresponding score are extracted. The extracted noun phrase band pair may be stored and updated in the noun phrase band dictionary DB 170 to serve as basic data for constructing the noun phrase band dictionary.

이처럼, 본 발명은 단어 정렬 정보를 사용하지만, 한 단어와 정렬된 모든 다른 언어의 단어를 같은 구에 포함시키는 조건을 적용하지 않았으므로, 도 1의 예시에서 'safety'와 정렬된 '당연히', 또는 'the'와 정렬된 '의'가 반드시 같은 구에 포함되지 않는다. As such, the present invention uses the word alignment information, but does not apply the condition to include the words of all the other languages aligned with one word in the same phrase, 'naturally' aligned with 'safety' in the example of FIG. Or 'the' aligned with 'the' is not necessarily included in the same phrase.

이에 따라, 본 발명은 단어 정렬의 오류에도 불구하고 추출할 수 있는 명사구 쌍의 양이 많아지지만, 한 문장에서 한국어 명사구 후보의 개수가 지나치게 많아져 전체 시스템의 작동에 무리가 올 수 있다. 따라서, 본 발명은 다음과 같은 세 가지의 경우 소스 언어의 명사구-타겟 언어의 명사구 후보 쌍을 제외하도록 제약을 두어 시스템의 효율성을 높였다. Accordingly, the present invention increases the amount of noun phrase pairs that can be extracted in spite of errors in word alignment. However, the number of candidates for Korean noun phrases in one sentence may be excessive, resulting in an excessive operation of the entire system. Accordingly, the present invention is limited to exclude noun phrase candidate pairs of noun phrase-target language of the source language in the following three cases to increase the efficiency of the system.

- 소스 언어의 명사구(영어 명사구)-타겟 언어의 명사구(한국어 명사구) 후보 내 합집합 정렬이 한 개도 없는 경우 제외-Noun phrases in the source language (English noun phrases)-except when there is no union sort in candidate noun phrases in the target language (Korean noun phrases)

- 어느 한 쪽의 명사의 개수가 다른 한 쪽의 명사의 개수의 3배 이상인 경우 제외-If the number of nouns on one side is more than three times the number of nouns on the other side

- 어느 한 쪽의 길이(단어나 형태소의 개수)가 다른 한 쪽의 5배 이상인 경우 제외-Except when the length of one side (the number of words or morphemes) is at least five times that of the other.

위 조건에서, 두 번째, 세 번째 조건의 기준은 영어는 단어이고 한국어는 형태소이다. In the above conditions, the criteria for the second and third conditions are English and words are Korean.

위와 같은 제약을 가한 결과, 이전에 비해 시스템의 효율성은 좋아졌으며, 추출된 명사구 쌍의 결과에는 크게 변함이 없다. As a result of applying the above restrictions, the efficiency of the system is better than before, and the result of the extracted noun phrase pairs does not change much.

그럼, 이상의 장치를 이용하여 명사구 대역 쌍을 추출하는 방법에 대하여 도 4를 참조하여 설명한다. Next, a method of extracting a noun phrase band pair using the above apparatus will be described with reference to FIG. 4.

먼저, 소스 언어 문장에 대한 전처리 과정으로, 병렬 말뭉치 DB(10)에 저장된 소스 말뭉치(12)를 각 단어에 대하여 품사 정보를 부착하고 이 품사 정보에 근거하여 구문 분석기를 통해 구문 분석한다(S110). 그리고, 구문 분석된 결과를 소스 구문 분석/품사 부착 말뭉치 DB(120)에 저장하여 데이터베이스를 구축한다(S140).First, as a preprocessing process for the source language sentence, the source corpus 12 stored in the parallel corpus DB 10 is attached with parts of speech information for each word and parsed through a parser based on the parts of speech information (S110). . Then, the parsed result is stored in the source parsing / part-of-speech corpus DB 120 to build a database (S140).

이와 동시에, 병렬 말뭉치 DB(10)에 저장된 타겟 말뭉치(14)를 형태소 분석하고 분석한 형태소에 대하여 품사 정보를 부착한다(S120). 그리고, 그 결과를 타겟 형태소 분석/품사 부착 말뭉치 DB(110)에 저장하여 데이터베이스를 구축한다(S140).At the same time, the target corpus 14 stored in the parallel corpus DB 10 is morphologically analyzed and the part-of-speech information is attached to the analyzed morphemes (S120). Then, the result is stored in the target morphological analysis / POS attached corpus DB 110 to build a database (S140).

또한, 또 다른 전처리 과정으로, 병렬 말뭉치 DB(10)를 단어 정렬하여 소스-타겟 단어 정렬 DB(180)을 구축한다(S130, S140). 여기서, 소스-타겟 단어 정렬 DB(180)는 소스 언어-타겟 언어에 대하여 어느 한 방향에 대한 정렬 결과를 모두 합집합 한 합집합(union) 단어 정렬과, 소스 언어와 타겟 언어의 양 방향에 대한 정렬 결과에 모드 있는 교집합 한 교집합(intersection) 단어 정렬로 구분될 수 있다. In addition, as another preprocessing process, the parallel corpus DB 10 is word aligned to build a source-target word alignment DB 180 (S130 and S140). Here, the source-target word alignment DB 180 is a union word sort that combines all sorting results in one direction with respect to the source language-target language, and a sorting result for both directions of the source language and the target language. Intersections in mode can be divided into one intersection word order.

이렇게 구축된 데이터베이스(110,120,180) 및 말뭉치 DB(10)는 본 발명의 장치가 자립형(stand- alone)으로 컴퓨터 내에서 구현되는 경우 하드디스크 또는 외장메모리 등의 저장 매체에 저장하고, 네트워크 기반으로 구현되는 경우는 서버에 저장하여 구축한다. 또한, 본 발명의 장치가 고성능, 고용량화인 휴대 단말의 경우 휴대 단말의 메모리에 저장하여 구축할 수 있다.The database (110, 120, 180) and corpus DB (10) constructed as described above are stored in a storage medium such as a hard disk or an external memory when the device of the present invention is implemented in a stand-alone computer, and implemented on a network basis. The case is stored on the server to build. In addition, in the case of a portable terminal having high performance and high capacity, the device of the present invention can be stored in a memory of the portable terminal and constructed.

이후, 소스 구문 분석/품사 부착 말뭉치 DB(120)에 저장된 말뭉치로부터 명사구를 추출한다(S150). 이의 추출 방법은 소스 말뭉치를 구문 분석한 결과를 이용하여 쉽게 얻을 수 있다. Then, the noun phrase is extracted from the corpus stored in the source parsing / POS attached corpus DB 120 (S150). Its extraction method can be easily obtained by parsing the source corpus.

타겟 형태소 분석/품사 부착 말뭉치 DB(110)에 저장된 말뭉치로부터 타겟 언어의 명사구 후보를 추출한다(S160). 이의 추출 방법은 초기에 소스 언어의 명사구와 대응될 수 있는 타겟 언어의 명사구를 모두 선별하고, 선별한 구에서 통계 정보를 기반으로 명사구를 구성할 수 있는 품사를 추출함으로써 추출된 품사에 근거하여 품사로 구성될 수 있는 구를 명사구 후보로 추출한다. From the corpus stored in the target morphological analysis / part of speech corpus DB 110, a noun phrase candidate of the target language is extracted (S160). Its extraction method initially selects the noun phrases of the target language that can correspond to the noun phrases of the source language, and extracts the parts of speech based on the extracted parts of speech from the selected phrases based on statistical information. Phrase that can be composed is extracted as a noun phrase candidate.

즉, 명사구의 시작과 끝, 끝에만, 중간에만 올 수 있는 형태소의 품사를 구별하여 해당 조건에 만족하는 구를 모두 추출한다. That is, the parts of the morpheme that can only come in the middle of the noun phrase only at the beginning, end, and end of the noun phrase are distinguished and all the phrases satisfying the conditions are extracted.

이후, 상기의 과정에서 추출한 소스 언어의 명사구와 타겟 언어의 명사구 후보로부터 대응 가능한 모든 쌍을 대상으로 정렬 확률을 분석한 스코어를 산출한다(S170).Subsequently, a score obtained by analyzing alignment probabilities for all pairs corresponding to the noun phrase candidate of the source language and the target language noun phrase extracted in the above process is calculated (S170).

즉, 소스 언어의 명사구 내 명사/단어/명사구 개수, 타겟 언어의 명사구 후보 내 명사/단어/명사구 개수, 합집합 또는 교집합 단어 정렬에 의해 소스 명사와 타겟 명사가 정렬된 링크 수, 한 쪽 언어만 명사구 대역 쌍 후보 내에 걸친 단어/명사 정렬 링크 수 등에 의해 스코어를 산출한다. That is, the number of nouns / words / noun phrases in noun phrases in the source language, the number of nouns / words / noun phrases in noun phrase candidates in the target language, the number of links in which source nouns and target nouns are sorted by union or intersection word sorting, The score is calculated by the number of word / noun alignment links spanning the band pair candidate.

이후, 산출된 스코어 순으로 정렬하여 스코어가 가장 높은 하나의 명사구 대역 쌍을 선별한다(S180). Thereafter, the noun phrase band pair having the highest score is selected by sorting the calculated scores (S180).

이후, 선별한 명사구 대역 쌍을 명사구 대역 사전에 저장 및 갱신하여 사전을 구축한다(S190).Thereafter, the selected noun phrase band pair is stored and updated in the noun phrase band dictionary to build a dictionary (S190).

이러한 과정에 따르면, 소스 언어와 타겟 언어에 대한 병렬 말뭉치로부터 자동으로 명사구 대역 쌍을 추출할 수 있으므로 단어 정렬의 오류가 있을 경우에도 이의 영향을 덜 받으면서 대역 쌍 추출이 가능하고, 기존의 사전에 없는 신조어(미등록어)의 대역 쌍을 병렬 말뭉치에서 자동 추출할 수 있다. According to this process, the noun phrase band pairs can be automatically extracted from the parallel corpus for the source language and the target language, so that even if there is an error in word alignment, the band pair extraction can be performed without being affected by the word alignment error. Band pairs of new words (unregistered words) can be automatically extracted from parallel corpus.

한편, 본 발명은 이상에서 설명한 단어 정렬 방법을 소프트웨어적인 프로그램으로 구현하여 컴퓨터로 읽을 수 있는 소정 기록 매체에 기록해 둠으로써 다양한 재생 장치에 적용할 수 있다. Meanwhile, the present invention can be applied to various reproduction apparatuses by implementing the word alignment method described above as a software program and recording it on a computer-readable predetermined recording medium.

다양한 재생 장치는 PC, 노트북, 휴대용 단말 등일 수 있다. Various playback devices may be PCs, laptops, portable terminals, and the like.

예컨대, 기록 매체는 각 재생 장치의 내장형으로 하드 디스크, 플래시 메모리, RAM, ROM 등이거나, 외장형으로 CD-R, CD-RW와 같은 광디스크, 콤팩트 플래시 카드, 스마트 미디어, 메모리 스틱, 멀티미디어 카드일 수 있다. For example, the recording medium may be a hard disk, a flash memory, a RAM, a ROM, or the like as an internal type of each playback device, or an optical disc such as a CD-R or a CD-RW, a compact flash card, a smart media, a memory stick, or a multimedia card as an external type. have.

이 경우, 컴퓨터로 읽을 수 있는 기록 매체에 기록한 프로그램은, 앞서 설명한 바와 같이 소스 언어 문장을 구문 분석한 결과로부터 명사구를 추출하는 과정과, 타겟 언어 문장을 형태소 단위로 분석한 결과로부터 소스 언어의 명사구와 대응 가능한 명사구 후보들을 추출하는 과정과, 소스 언어의 명사구와 타겟 언어의 명사구 후보들간 정렬 확률을 계산하여 대역 쌍 스코어를 산출하는 과정과, 산출된 대역 쌍 스코어들 중에서 가장 높은 스코어를 갖는 대역 쌍을 추출하는 과정을 포함하여 실행될 수 있다. In this case, the program recorded on the computer-readable recording medium, as described above, extracts the noun phrase from the result of parsing the source language sentence and the result of analyzing the target language sentence in morpheme units. Extracting noun phrase candidates corresponding to and, calculating a band pair score by calculating an alignment probability between a noun phrase of a source language and a noun phrase candidate of a target language, and a band pair having the highest score among the calculated band pair scores. It may be performed including the process of extracting.

또한, 한국어에 대하여 명사구 추출을 위한 프로그램은, 형태소 단위로 분석된 한국어 말뭉치에 품사 정보를 부착하고, 형태소 단위로 분석된 한국어 말뭉치를 단어 정렬한 결과로부터 명사구 후보를 추출하는 과정으로 실행될 수 있다. In addition, the program for extracting noun phrases for Korean may be executed by attaching part-of-speech information to a Korean corpus analyzed in morpheme units and extracting a noun phrase candidate from a result of word alignment of the Korean corpus analyzed in morpheme units.

이상의 설명은 본 발명을 예시적으로 설명한 것에 불과하며, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 본 발명의 기술적 사상에서 벗어나지 않는 범위에서 다양한 변형이 가능할 것이다. 따라서 본 발명의 명세서에 개시된 실시 예들은 본 발명을 한정하는 것이 아니다. 본 발명의 범위는 아래의 특허청구범위에 의해 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술도 본 발명의 범위에 포함되는 것으로 해석해야 할 것이다.The foregoing description is merely illustrative of the present invention, and various modifications may be made by those skilled in the art without departing from the spirit of the present invention. Therefore, the embodiments disclosed in the specification of the present invention are not intended to limit the present invention. The scope of the present invention should be construed by the claims below, and all techniques within the scope equivalent thereto will be construed as being included in the scope of the present invention.

통계적 기계 번역 시스템에서, 종래 기술에 따른 대역 쌍 추출은 단어 정렬에 의존적이어서 단어 정렬에 오류가 있으면 대역 쌍 후보 추출부터 오류가 발생하였으나, 본 발명은 소스 언어 및 타겟 언어의 병렬 말뭉치로부터 품사 정보를 이용하여 각 언어별로 명사구 후보를 추출하고 추출한 두 언어의 명사구 후보 쌍에서 정렬 가능한 확률이 높은 쌍을 추출함으로써 단어 정렬의 오류가 있을 경우에도 용이하게 구 대역쌍 추출이 가능하고, 기존의 사전에 없는 신조어(미등록어)의 대역 쌍을 병렬 말뭉치에서 자동으로 추출하는 것이 가능하다. 이로써, 자동 번역기 또는 자동 사전 구축을 통한 번역의 품질을 향상시킬 수 있다. In a statistical machine translation system, band pair extraction according to the prior art is dependent on word alignment, and if there is an error in word alignment, an error occurs from band pair candidate extraction, but the present invention extracts part-of-speech information from parallel corpus of source language and target language. By extracting a noun phrase candidate for each language and extracting a pair with a high probability that can be sorted from the noun phrase candidate pairs of the two languages, it is possible to easily extract an old band pair even when there is an error in word alignment. It is possible to automatically extract band pairs of new words (unregistered words) from parallel corpus. As a result, the quality of translation can be improved by using an automatic translator or an automatic dictionary.

110: 타겟 형태소 분석 말뭉치 DB
120: 소스 구문 분석 말뭉치 DB
130: 타겟 언어 명사구 후보 추출기
140: 소스 언어 명사구 추출기
150: 명사구 대역쌍 스코어 산출기
160: 대역쌍 추출기
170: 명사구 대역 사전 DB
180: 단어 정렬 DB
182: 교집합 단어 정렬 말뭉치
184: 합집합 단어 정렬 말뭉치110: target stemming corpus DB
120: source parsing corpus DB
130: target language noun phrase candidate extractor
140: source language noun phrase extractor
150: noun phrase band pair score calculator
160: band pair extractor
170: noun phrase band dictionary DB
180: word sort DB
182: Intersection word sort corpus
184: Union word sort corpus

Claims

Apparatus for band pair extraction in statistical machine translation,
A source language noun phrase extractor for extracting a noun phrase of the source language from a result of parsing the source language sentence;
A target language noun phrase candidate extractor for extracting a noun phrase candidate of a target language corresponding to a noun phrase of the source language from a result of analyzing a target language sentence in a morpheme unit;
A band pair score calculator configured to calculate a band pair score by calculating an alignment probability between a noun phrase of a source language extracted from the source language noun phrase extractor and a noun phrase candidate of a target language extracted from the target language noun phrase candidate extractor;
Band pair extractor for extracting the band pair having the highest score among the scores calculated by the band pair score calculator
Noun phrase band pair extraction device in the statistical machine translation, comprising a.

The method of claim 1,
The target language noun phrase candidate extractor
An apparatus for extracting noun phrase band pairs in statistical machine translation, comprising extracting a noun phrase candidate by extracting a part-of-speech that can form a noun phrase using the part-of-speech information of the target language sentence.

The method of claim 1,
The band pair score calculator
Nouns in the source language and nouns in the target language by number of nouns / words / noun phrases in noun phrases of the source language, nouns / words / noun phrase phrases in noun phrase candidates of the target language, union or intersection word sorting An apparatus for extracting noun phrase band pairs in a statistical machine translation, comprising calculating one or more of the number of sorted links and only one language by adding one or more of the number of words / nouns aligned links within a noun phrase band pair candidate.

The method of claim 1,
The band pair extractor
An apparatus for extracting noun phrase band pairs in statistical machine translation, wherein all band pairs are extracted when there are several band pairs having the highest score among the scores.

The method of claim 1,
The apparatus for extracting a noun phrase band pair in statistical machine translation, wherein the result of the morphological analysis of the target language sentence includes a result of word alignment through a sorting algorithm.

The method of claim 1,
The band pair score calculator
If there is no union of the noun phrases in the source language and the noun phrase candidates in the target language, and the number of nouns in one language is three or more of the nouns in the other language, the length of either language (word B) a noun phrase band pair extraction device in statistical machine translation, wherein the nominal phrase candidate of the target language is removed when the number of morphemes is 5 times or more than that of the other language.

The method of claim 1,
And a noun phrase band pair extraction device in statistical machine translation, wherein the source language is English and the target language is Korean.

A terminal for band pair extraction in statistical machine translation,
A memory for storing a result of parsing the source language sentence and a result of morphological analysis of the target language sentence;
A source language noun phrase extractor for extracting a noun phrase of a source language from a result of parsing a source language sentence stored in the memory;
A target language noun phrase candidate extractor for extracting a noun phrase candidate of a target language corresponding to a noun phrase of a source language from a morphological analysis result of the target language sentence stored in the memory;
A band pair score calculator configured to calculate a band pair score by calculating an alignment probability between a noun phrase of a source language extracted from the source language noun phrase extractor and a noun phrase candidate of a target language extracted from the target language noun phrase candidate extractor;
Band pair extractor for extracting the band pair having the highest score among the scores calculated by the band pair score calculator
And a second terminal.

The method of claim 8,
The target language noun phrase candidate extractor
And extracting a noun phrase candidate by extracting a part-of-speech that can form a noun phrase using the part-of-speech information of the target language sentence.

The method of claim 8,
The band pair score calculator
Nouns in the source language and nouns in the target language by number of nouns / words / noun phrases in noun phrases of the source language, nouns / words / noun phrase phrases in noun phrase candidates of the target language, union or intersection word sorting And calculating one or more of the number of sorted links and one or more of the number of words / nouns sorted links in one language only within a noun phrase band pair candidate.

The method of claim 8,
The band pair extractor
Terminal for extracting all band pairs when there are several band pairs having the highest score among the scores.

The method of claim 8,
The result of the stemming of the target language sentence comprises a result of word alignment through a sorting algorithm

The method of claim 8,
The band pair score calculator
If there is no union of the noun phrases in the source language and the noun phrase candidates in the target language, and the number of nouns in one language is three or more of the nouns in the other language, the length of either language (word B) the nominal phrase candidate of the target language is removed when the number of morphemes is 5 times or more than that of the other language.

Apparatus for extracting noun phrase band pairs by attaching part-of-speech information to a target language corpus analyzed in morpheme units, and extracting noun phrase candidates from the target language corpus analyzed in morpheme units.

15. The method of claim 14,
And the noun phrase candidate is generated by extracting all parts of speech that can form a noun phrase from the result of word alignment of the target language corpus.

The method of claim 15,
Part of speech that can make up a noun phrase includes a part of speech only at the beginning of a noun phrase, a part of speech only at the end of a noun phrase, and a part of speech that is possible in the middle of a noun phrase consisting of three or more morphemes. Noun phrase band pair extraction device characterized in that.

15. The method of claim 14,
When no nouns in the noun phrases of the target language differ from the noun phrase of the target language, when the number of nouns in the target language is three or more of the nouns in the other language, the number of stems of the target language is different. The noun phrase band pair extraction apparatus of claim 5, wherein the noun phrase candidate of the target language is removed.

15. The method of claim 14,
The noun phrase band pair extraction apparatus, characterized in that the target language is Korean.

A method for band pair extraction in statistical machine translation,
(a) extracting a noun phrase from a result of parsing a source language sentence;
extracting noun phrase candidates corresponding to a noun phrase of the source language from a result of analyzing a target language sentence in a morpheme unit;
(c) calculating a band pair score by calculating an alignment probability between the noun phrase of the source language and the noun phrase candidate of the target language;
(d) extracting a band pair having the highest score among the calculated band pair scores
Method for extracting noun phrase band pairs in statistical machine translation, comprising: a.

The method of claim 19,
Step (b) is
A method for extracting noun phrase band pairs in a statistical machine translation, comprising extracting a noun phrase candidate by extracting a part of speech that can form a noun phrase using the part of speech information of a morphologically analyzed target language sentence.

The method of claim 19,
The step (c)
Nouns / words / noun phrases in the noun phrase of the source language, nouns / words / noun phrases in the noun phrase candidate of the target language, number of links in which nouns of the source language and nouns of the target language are sorted by union or intersection A method for extracting noun phrase band pairs in a statistical machine translation, comprising calculating a sum of one or more of the number of words / noun alignment links in a noun phrase band pair candidate.

The method of claim 19,
The step (d)
A method for extracting noun phrase band pairs in statistical machine translation, wherein all band pairs are extracted when there are several band pairs having the highest score among the scores.

The method of claim 19,
The step (c)
If there is no union of the noun phrases in the source language and the noun phrase candidates in the target language, and the number of nouns in one language is three or more of the nouns in the other language, the length of either language (word And nominal phrase candidates of the target language are removed when the number of morphemes is greater than or equal to five times the other language.

Attaching part-of-speech information to a target language corpus analyzed in morpheme units and extracting a target language noun phrase candidate using part-of-speech information from a result of word alignment of the target language corpus analyzed in morpheme units Way.

The method of claim 24,
The target language noun phrase candidate extraction is based on parts of speech information that may constitute a target language noun phrase,
The part-of-speech information constituting the target language noun phrase is a part-of-speech that can be done only at the beginning of a noun phrase, a part-of-speech only at the end of a noun phrase, and a part-of-speech that can be located at any position in the middle of a noun phrase consisting of three or more morphemes. Noun phrase band pair extraction method comprising a.

The method of claim 24,
When no nouns in the noun phrases of the target language differ from the noun phrase of the target language, when the number of nouns in the target language is three or more of the nouns in the other language, the number of stems of the target language is different. And a noun phrase band pair extraction method for removing a noun phrase candidate of the target language if it is 5 times or more.

A computer-readable recording medium having recorded thereon a program for executing the process according to any one of claims 14 to 26.