KR100910275B1

KR100910275B1 - Method and apparatus for automatic extraction of transliteration pairs in dual language documents

Info

Publication number: KR100910275B1
Application number: KR1020070107661A
Authority: KR
Inventors: 방정민; 진청궈; 남상협; 김성일
Original assignee: 방정민; 진청궈; 남상협; 김성일
Priority date: 2007-10-25
Filing date: 2007-10-25
Publication date: 2009-08-03
Also published as: KR20090041897A

Abstract

The present invention relates to a method and apparatus for automatically extracting a tuning fork notation band pair in a bilingual document, and extracting a bilingual document represented by a bilingual language existing on the Internet and separating the bilingual document into a single language document represented by each monolingual document. A bilingual document extraction module for extracting a phonetic notation candidate word extracting module for extracting a phonetic notation candidate words from a single language document among monolingual documents separated from the bilingual document extraction module, and the phonetic notation candidate After selecting the candidate words to be expressed from the tuning fork candidate words extracted from the word extraction module, the tuning fork corresponding to the selected tuning note candidate words in the other monolingual documents using the dynamic window technique or the tokenizer technique. By including the tuning fork notation band pair extraction module for automatically extracting the notation band pair, Automatically extract a pair of phonetic notation bands using a large number of bilingual documents on the Web to build a large volume of unregistered dictionaries, and apply it to cross-language search or machine language translation service for effective performance improvement. can do.

Phonetic notation, Statistics-based phonetic notation model, Dynamic window technique, Tokenizer technique, Bilingual document, Tunnel notation band pair, English, Chinese

Description

Method and apparatus for automatically extracting tuning fork band pairs from bilingual documents {METHOD AND APPARATUS FOR AUTOMATIC EXTRACTION OF TRANSLITERATION PAIRS IN DUAL LANGUAGE DOCUMENTS}

본 발명은 이중언어 문서에서의 음차표기 대역쌍 자동 추출 방법 및 장치에 관한 것으로, 보다 상세하게는 웹(Web) 상에 있는 대량의 이중언어 문서를 이용하여 음차표기 대역쌍을 자동으로 추출하여 대용량의 미등록어 사전DB를 구축하고, 이를 교차언어 검색 또는 기계 언어번역 서비스 등에 적용하여 성능 향상을 도모할 수 있는 이중언어 문서에서의 음차표기 대역쌍 자동 추출 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for automatically extracting a tuning fork band pair in a bilingual document, and more particularly, to automatically extract a tuning fork band pair using a large amount of bilingual documents on the Web. The present invention relates to a method and apparatus for automatically extracting a pair of phonetic notation band pairs from a bilingual document that can improve the performance by constructing a non-registered dictionary dictionary of DB and applying it to a cross-language search or machine language translation service.

요즘 인터넷이 발달하고 각 나라간의 문화 교류가 많아짐에 따라 많은 외래어가 새로 생성되고 있다. 이런 외래어들은 대부분 음차표기(Transliteration)된 것으로서 언어처리에 있어서 큰 문제를 일으키고 있다. 기존에 이런 음차표기들을 해결하기 위하여 다양한 방법들이 제안되었다. 이러한 방법들은 크게 아래와 같은 두 가지로 나눌 수 있다.Nowadays, as the internet develops and cultural exchanges between countries increase, many foreign languages are newly generated. Most of these foreign words are transliteration, which causes big problems in language processing. In the past, various methods have been proposed to solve these tone notations. These methods can be largely divided into the following two.

첫 번째는 음차표기를 자동으로 생성하는 방법이다{참고문헌: J.-H. Oh, K.-S. Choi, "An English Korean transliteration model using pronunciation and contextual rules", in: Proceedings of the 19th International Conference on Computational Linguistics(COLING), Taipei, Taiwan, pp. 758-764, (2002)}.The first is the method of automatically generating the tone notation {Reference: J.-H. Oh, K.-S. Choi, "An English Korean transliteration model using pronunciation and contextual rules", in: Proceedings of the 19th International Conference on Computational Linguistics (COLING), Taipei, Taiwan, pp. 758-764, (2002)}.

상기 음차표기 자동 생성이란 한 외국어 단어가 주어졌을 때 그에 대응되는 음차표기를 자동으로 생성해주는 것을 말한다. 예를 들면, "Cliton"이란 영어단어가 있을 때 "클린턴"이란 한국어 음차표기를 자동으로 생성해주는 것을 말한다.The automatic generation of the phonetic notation refers to automatically generating a phonetic notation corresponding to a given foreign language word. For example, "Cliton" refers to the automatic creation of Korean phonetic notation "Clinton" when there is an English word.

한편, 이러한 기존의 방법은 음차표기의 다양성과 사람들의 음차표기를 만드는 여러 가지 습관 때문에 높은 성능을 내지 못하고 있다. 예를 들면, "Scofield"란 영어단어가 있을 때, 원래는 이것을 "스코필드" 혹은 "스커우필드"라고 음차표기 하는 것이 옳을 것이다.On the other hand, these existing methods do not perform high performance due to the diversity of the phonetic notation and various habits of making the phonetic notation of people. For example, when there is an English word "Scofield", it would be correct to originally label it "Scofield" or "Scofield".

그러나, 사람들은 이렇게 번역하지 않고 "석호필"이라고 번역한다. 이런 경우에 기존의 음차표기 생성 방법으로 "석호필"을 생성해주는 것은 거의 불가능한 일이다. 왜냐하면 지금의 컴퓨터 기술이 아직까지 사람의 습관까지 감지할 수 있는 인공지능을 갖추지 못했기 때문이다.However, people do not translate it like this, but they translate it as "Sukho Phil." In this case, it is almost impossible to generate "Shoho Phil" using the conventional tuning method. Because today's computer technology does not yet have artificial intelligence that can detect human habits.

특히, 중국어와 같은 언어에서는 음차표기를 할 때 뜻만 고려하는 것이 아니라 그 한자의 의미도 고려하기 때문에 이런 문제가 더 심각함에 따라 음차표기 자동 생성은 아주 낮은 성능을 보이고 있다.In particular, in languages such as Chinese, not only the meaning of words is considered but also the meaning of the Chinese characters, so this problem is more serious.

두 번째는 음차표기 대역쌍을 이중언어 문서에서 자동으로 추출하는 방법이 다{참고문헌: Richard Sproat, Tao Tao, ChengXiang Zhai, "Named Entity Tranliteration with Comparable Corpora", in: Proceddings of the 21st International Conference on Computational Linguistics.(2006)}.The second method is the automatic extraction of tuning fork band pairs from bilingual documents. {Ref .: Richard Sproat, Tao Tao, ChengXiang Zhai, "Named Entity Tranliteration with Comparable Corpora", in: Proceddings of the 21st International Conference on Computational Linguistics. (2006)}.

상기 음차표기 대역쌍 자동추출은 전반적으로 음차표기 자동 생성에 비하여 높은 성능을 보이고 있다. 그러나, 아직까지 우리가 만족할만한 성능은 내지 못하고 있다.The automatic extraction of the tuning fork band pairs generally shows higher performance than the automatic generation of the tuning fork notation. However, we have not yet achieved satisfactory performance.

즉, 기존의 음차표기 대역쌍 자동추출 방법은 대부분 두 언어에서 먼저 각각 음차표기 후보를 추출한 후, 그 후보들 사이의 음성적 유사도를 계산하여 음차표기 대역쌍을 추출한다.In other words, the conventional method for automatically extracting the tuning fork band pairs extracts the tuning fork band pairs by first extracting the phonetic notation candidates from each of the two languages and then calculating the voice similarity between the candidates.

한편, 이러한 기존의 방식에서 음차표기 후보는 추출결과에 크게 영향을 받는다. 음차표기 후보 추출이 가장 잘되는 언어는 영어인데 이는 영어에서 고유명사의 첫 자모를 대문자로 표시하고 단어와 단어 사이에 띄어쓰기가 있기 때문이다.Meanwhile, in this conventional method, the tuning fork candidate is greatly influenced by the extraction result. The best language for extracting the phonetic transcription candidates is English because the first letter of the proper noun in English is capitalized and there is a space between words.

그러나, 중국어와 같은 경우에는 띄어쓰기도 없고, 대문자도 없기 때문에 음차표기 후보 추출은 아직까지 어려운 주제로 남아 있으며 높은 성능을 내지 못하고 있다. 그러므로, 음차표기 후보 추출이 어려운 언어에서 후보를 추출하면 추출된 후보에 많은 오류를 포함하기 때문에 이러한 기존의 방법으로 음차표기 대역쌍 추출의 좋은 성능을 기대할 수 없다.However, in the case of Chinese, since there is no spacing and no capital letters, the extraction of the candidates for the tuning note remains a difficult topic and does not have high performance. Therefore, if a candidate is extracted from a language that is difficult to extract the phonetic representation of candidates, the extracted candidate includes a large number of errors. Therefore, it is not possible to expect good performance of the extraction of the tuning fork band pairs by this conventional method.

이러한 문제점을 해결하기 위하여 영어 문서에서만 음차표기 후보를 추출하고, 그 후보를 기준으로 통계기반 음차표기 모델을 사용하여 중국어 문서에서 대응되는 음차표기를 추출하는 방법을 제안하였다{참고문헌: C.-J. Lee, J.S. Chang, J.-S.R. Jang, "Extraction of transliteration pairs from parallel corpora using a statistical transliteration model", in: Information Sciences 176, 67-90 (2006)}.In order to solve this problem, we proposed a method for extracting the phonetic notation only from the English document and extracting the corresponding phonetic notation from the Chinese document using the statistical-based phonetic notation model based on the candidate. {Reference: C.- J. Lee, J.S. Chang, J.-S.R. Jang, "Extraction of transliteration pairs from parallel corpora using a statistical transliteration model", in: Information Sciences 176, 67-90 (2006)}.

그러나, 이러한 방법은 전술한 기존의 방법에 비하여 성능을 크게 향상시켰으나, 중국어 문장 길이가 길어질수록 성능이 떨어지고, 또한 규칙에 기반한 후처리 기법은 특정 상황에 대해서만 처리할 수밖에 없는 한계점을 지니고 있다.However, this method has greatly improved performance compared to the conventional method described above. However, the longer the Chinese sentence length, the lower the performance. Also, the rule-based post-processing technique has a limitation in that it can only deal with a specific situation.

본 발명은 전술한 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은 웹(Web) 상에 있는 대량의 이중언어 문서를 이용하여 음차표기 대역쌍을 자동으로 추출하여 대용량의 미등록어 사전DB를 구축하고, 이를 교차언어 검색 또는 기계 언어번역 서비스 등에 적용하여 성능 향상을 도모할 수 있도록 한 이중언어 문서에서의 음차표기 대역쌍 자동 추출 방법 및 장치를 제공하는데 있다.The present invention has been made to solve the above-described problems, an object of the present invention is to automatically extract the tuning fork notation band pair using a large number of bilingual documents on the Web (Web) to generate a large number of unregistered dictionary dictionary DB The present invention provides a method and apparatus for automatically extracting a tuning fork notation pair pair from a bilingual document that can be improved by applying it to a cross-language search or a machine language translation service.

전술한 목적을 달성하기 위하여 본 발명의 제1 측면은, (a) 인터넷 상에 존재하는 이중언어로 표현된 이중언어 문서를 추출하여 각각의 단일언어로 표현된 단일언어 문서들로 분리하는 단계; (b) 상기 분리된 단일언어 문서들 중 어느 하나의 단일언어 문서에서 음차표기 후보단어들을 추출하는 단계; 및 (c) 상기 추출된 음차표기 후보단어들 중 음차표기될 후보단어를 선택한 후, 상기 선택된 음차표기 후보단어를 기준으로 동적 윈도우 기법을 이용하여 나머지 다른 단일언어 문서에서 상기 선택된 음차표기 후보단어에 대응되는 음차표기 대역쌍을 자동으로 추출하는 단계를 포함하는 이중언어 문서에서의 음차표기 대역쌍 자동 추출 방법을 제공하는 것이다.In order to achieve the above object, a first aspect of the present invention comprises the steps of: (a) extracting a bilingual document represented in a bilingual language existing on the Internet and separating it into monolingual documents represented in each monolingual language; (b) extracting the phonetic transcription candidate words from any one of the separated monolingual documents; And (c) selecting a candidate word to be expressed from the extracted tuning fork candidate words, and then applying the selected tuning fork candidate word to another selected monolingual document using a dynamic window technique based on the selected tuning fork candidate word. The present invention provides a method for automatically extracting a tuning fork band pair in a bilingual document including automatically extracting a corresponding tuning fork band pair.

본 발명의 제2 측면은, (a') 인터넷 상에 존재하는 이중언어로 표현된 이중 언어 문서를 추출하여 각각의 단일언어로 표현된 단일언어 문서들로 분리하는 단계; (b') 상기 분리된 단일언어 문서들 중 어느 하나의 단일언어 문서에서 음차표기 후보단어들을 추출하는 단계; 및 (c') 상기 추출된 음차표기 후보단어들 중 음차표기될 후보단어를 선택한 후, 어절 분리를 이용한 토크나이저(Tokenizer) 기법을 적용하여 나머지 다른 단일언어 문서에서 상기 선택된 음차표기 후보단어에 대응되는 음차표기 대역쌍을 자동으로 추출하는 단계를 포함하는 이중언어 문서에서의 음차표기 대역쌍 자동 추출 방법을 제공하는 것이다.A second aspect of the present invention includes the steps of: (a ') extracting bilingual documents represented in bilingual languages existing on the Internet and separating them into monolingual documents represented in each monolingual language; (b ') extracting the phonetic transcription candidate words from any one of the separated monolingual documents; And (c ') selecting candidate words to be written from the extracted phonetic note candidate words, and then applying a tokenizer technique using word separation to correspond to the selected tuning note candidate words in the other monolingual documents. The present invention provides a method for automatically extracting a tuning fork band pair in a bilingual document, including automatically extracting a tuning fork notation band pair.

바람직하게, 상기 토크나이저 기법은, 나머지 다른 단일언어 문서내의 각 문장을 음차표기에 사용되지 않는 문자들을 기준으로 여러 부분으로 분할한 후, 각 분할된 부분에 대해 통계기반 음차표기 모델을 사용하여 점수를 계산하고, 최고값을 갖는 문자열을 역추적하여 상기 선택된 음차표기 후보단어에 대응되는 음차표기 대역쌍을 자동으로 추출할 수 있다.Preferably, the tokenizer technique divides each sentence in the other monolingual document into parts based on characters not used in the phonetic notation, and then scores each segment by using a statistical-based phonetic notation model. It is possible to automatically extract the tuning fork band pair corresponding to the selected tuning fork candidate word by backtracking the string having the highest value.

본 발명의 제3 측면은, (a") 인터넷 상에 존재하는 이중언어로 표현된 이중언어 문서를 추출하여 각각의 단일언어로 표현된 단일언어 문서들로 분리하는 단계; (b") 상기 분리된 단일언어 문서들 중 어느 하나의 단일언어 문서에서 음차표기 후보단어들을 추출하는 단계; (c") 상기 추출된 음차표기 후보단어들 중 음차표기될 후보단어를 선택한 후, 나머지 다른 단일언어 문서내의 각 문장을 음차표기에 사용되지 않는 문자들을 기준으로 여러 부분으로 분할하는 단계; 및 (d") 각 분할된 부분에 대해 동적 윈도우 기법을 이용하여 나머지 다른 단일언어 문서에서 상기 선택된 음차표기 후보단어에 대응되는 음차표기 대역쌍을 자동으로 추출하는 단계를 포함하는 이중언어 문서에서의 음차표기 대역쌍 자동 추출 방법을 제공하는 것이다.A third aspect of the present invention provides a method for manufacturing a bilingual document, comprising: (a ") extracting a bilingual document represented in a bilingual language existing on the Internet and separating the bilingual document represented in each monolingual document into a single language; Extracting the phonetic transcription candidate words from any one of the monolingual documents; (c ") selecting candidate words to be written from the extracted phoneme notation candidate words, and dividing each sentence in the other monolingual document into various parts based on characters not used in the phonetic notation; and ( d ") for each segmented portion, using a dynamic window technique to automatically extract a tone pair band pair corresponding to the selected note tone candidate word from the other monolingual document, using the dynamic window technique. It is to provide a method for automatic band pair extraction.

바람직하게, 상기 동적 윈도우 기법은, 상기 선택된 음차표기 후보단어의 길이에 근거하여 나머지 다른 단일언어의 음차표기 가능한 윈도우 길이범위를 설정하는 단계; 상기 설정된 윈도우 길이범위 내의 각 윈도우를 앞으로 이동하면서 상기 선택된 음차표기 후보단어와 현재 윈도우가 포함된 문자열에 대하여 최대 음성적 유사 확률값들을 구하는 단계; 및 각 윈도우에서의 최대 음성적 유사 확률값들 중에서 가장 큰 값에 대응되는 문자열을 역추적하여 음차표기 대역쌍을 자동으로 추출하는 단계를 포함할 수 있다.Preferably, the dynamic window technique may include: setting a range of possible phonetic notations of the other monolingual based on the selected length of the tuning fork candidate words; Obtaining maximum phonetic likelihood probability values for the character string including the selected tuning note candidate word and the current window while moving forward through each window within the set window length range; And automatically extracting a tuning fork notation pair by tracing a string corresponding to the largest value among the maximum negative likelihood probability values in each window.

바람직하게, 상기 선택된 음차표기 후보단어가 영어 단어이고, 나머지 다른 단일언어가 중국어이며, 상기 영어 단어의 길이가 L일 경우, 상기 윈도우 길이범위는 L/3부터 L 사이로 설정될 수 있다.Preferably, when the selected phonetic note candidate word is an English word, the other single language is Chinese, and the length of the English word is L, the window length range may be set between L / 3 and L.

바람직하게, 상기 음차표기에 사용되지 않는 문자들은 문장부호, 숫자, 띄어쓰기 또는 영어 자모 중 적어도 어느 하나로 이루어질 수 있다.Preferably, the letters not used in the phonetic notation may be made of at least one of punctuation marks, numbers, spaces, or English letters.

바람직하게, 상기 나머지 다른 단일언어 문서가 중국어일 경우, 상기 음차표기에 사용되지 않는 문자들은 조사들로 이루어질 수 있다.Preferably, if the other monolingual document is Chinese, characters not used in the phonetic notation may be made up of surveys.

바람직하게, 상기 음차표기 후보단어들을 추출하는 단일언어 문서는 영어 문서이며, 나머지 다른 단일언어 문서는 중국어 문서이다.Preferably, the monolingual document extracting the phonetic transcription candidate words is an English document, and the other monolingual document is a Chinese document.

바람직하게, 상기 음차표기 후보단어들은 고유명사이다.Preferably, the phonetic transcription candidate words are proper nouns.

바람직하게, 상기 이중언어 문서는 중국어 및 영어로 이루어진 언어쌍을 비롯한 모든 언어쌍에서의 병렬 말뭉치 또는 비교 가능한 말뭉치를 포함할 수 있다.Preferably, the bilingual document may include parallel corpus or comparable corpus in all language pairs, including language pairs consisting of Chinese and English.

바람직하게, 상기 추출된 음차표기 대역쌍은 미리 구축된 미등록어 사전DB에 등록하는 단계를 더 포함할 수 있다.Preferably, the extracted tuning fork notation band pair may further include registering in a pre-registered non-registered word dictionary DB.

바람직하게, 상기 미등록어 사전DB에 등록된 음차표기 대역쌍을 이용하여 교차언어 검색 서비스를 수행하는 단계를 더 포함할 수 있다.Preferably, the method may further include performing a cross-language search service using the tuning fork notation band pair registered in the non-registered word dictionary DB.

바람직하게, 상기 미등록어 사전DB에 등록된 음차표기 대역쌍을 이용하여 기계 언어번역 서비스를 수행하는 단계를 더 포함할 수 있다.Preferably, the method may further include performing a machine language translation service using the tuning fork notation band pair registered in the non-registered dictionary dictionary DB.

본 발명의 제4 측면은 상술한 이중언어 문서에서의 음차표기 대역쌍 자동 추출 방법을 컴퓨터로 실행시킬 수 있는 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는 것이다.A fourth aspect of the present invention is to provide a computer-readable recording medium having recorded thereon a program which can be executed by the computer for automatically extracting the tuning fork band pairs in the bilingual document.

본 발명의 제5 측면은, 인터넷 상에 존재하는 이중언어로 표현된 이중언어 문서를 추출하여 각각의 단일언어로 표현된 단일언어 문서들로 분리하기 위한 이중언어 문서 추출모듈; 상기 이중언어 문서 추출모듈로부터 분리된 단일언어 문서들 중 어느 하나의 단일언어 문서에서 음차표기 후보단어들을 추출하기 위한 음차표기 후보단어 추출모듈; 및 상기 음차표기 후보단어 추출모듈로부터 추출된 음차표기 후보단어들 중 음차표기될 후보단어를 선택한 후, 동적 윈도우 기법 또는 토크나이저(Tokenizer) 기법을 이용하여 나머지 다른 단일언어 문서에서 상기 선택된 음차 표기 후보단어에 대응되는 음차표기 대역쌍을 자동으로 추출하기 위한 음차표기 대역쌍 추출모듈을 포함하는 이중언어 문서에서의 음차표기 대역쌍 자동 추출 장치를 제공하는 것이다.A fifth aspect of the present invention includes a bilingual document extraction module for extracting bilingual documents expressed in bilingual languages existing on the Internet and separating them into monolingual documents expressed in respective monolingual languages; A phoneme notation candidate word extraction module for extracting a phonetic notation candidate words from any single language document among monolingual documents separated from the bilingual document extraction module; And selecting a candidate word to be expressed in the tuning fork candidate words extracted from the tuning fork candidate word extraction module, and using the dynamic window technique or the tokenizer technique, the selected tuning fork representation candidate in the other monolingual document. The present invention provides an apparatus for automatically extracting a tuning fork band pair in a bilingual document including a tuning fork band band extracting module for automatically extracting a tuning fork band pair corresponding to a word.

본 발명의 제6 측면은, 인터넷 상에 존재하는 이중언어로 표현된 이중언어 문서를 추출하여 각각의 단일언어로 표현된 단일언어 문서들로 분리하기 위한 이중언어 문서 추출모듈; 상기 이중언어 문서 추출모듈로부터 분리된 단일언어 문서들 중 어느 하나의 단일언어 문서에서 음차표기 후보단어들을 추출하기 위한 음차표기 후보단어 추출모듈; 및 상기 음차표기 후보단어 추출모듈로부터 추출된 음차표기 후보단어들 중 음차표기될 후보단어를 선택한 후, 나머지 다른 단일언어 문서내의 각 문장을 음차표기에 사용되지 않는 문자들을 기준으로 여러 부분으로 분할하고, 각 분할된 부분에 대해 동적 윈도우 기법을 이용하여 나머지 다른 단일언어 문서에서 상기 선택된 음차표기 후보단어에 대응되는 음차표기 대역쌍을 자동으로 추출하기 위한 음차표기 대역쌍 추출모듈을 포함하는 이중언어 문서에서의 음차표기 대역쌍 자동 추출 장치를 제공하는 것이다.A sixth aspect of the present invention includes a bilingual document extraction module for extracting bilingual documents expressed in bilingual languages existing on the Internet and separating them into monolingual documents expressed in respective monolingual languages; A phoneme notation candidate word extraction module for extracting a phonetic notation candidate words from any single language document among monolingual documents separated from the bilingual document extraction module; And selecting candidate words to be written in the tuning fork notation candidate words extracted from the phonetic notation candidate word extraction module, and then dividing each sentence in the other monolingual document into several parts based on characters not used in the phonetic notation. A bilingual document including a tuning fork notation band pair extraction module for automatically extracting a tuning fork notation band pair corresponding to the selected phonetic notation candidate word from the other monolingual documents using the dynamic window technique for each divided portion. To provide an automatic extraction of the tuning fork notation band pair in.

바람직하게, 상기 음차표기 대역쌍 추출모듈로부터 추출된 음차표기 대역쌍을 저장하기 위한 미등록어 사전DB이 더 포함될 수 있다.Preferably, the non-registered word dictionary DB for storing the tuning fork notation band pair extracted from the tuning fork notation band pair extraction module may be further included.

바람직하게, 상기 미등록어 사전DB은 인터넷을 통해 입력된 교차언어에 대한 검색 서비스에 이용될 수 있다.Preferably, the non-registered word dictionary DB may be used for a search service for a cross language input through the Internet.

바람직하게, 상기 미등록어 사전DB은 기계 언어번역 서비스 작업 수행에 이 용될 수 있다.Preferably, the non-registered language dictionary DB may be used to perform a machine language translation service task.

이상에서 설명한 바와 같은 본 발명의 이중언어 문서에서의 음차표기 대역쌍 자동 추출 방법 및 장치에 따르면, 웹(Web) 상에 있는 대량의 이중언어 문서를 이용하여 음차표기 대역쌍을 자동으로 추출하여 대용량의 미등록어 사전DB를 구축하고, 이를 교차언어 검색 또는 기계 언어번역 서비스 등에 적용하여 성능 향상을 도모할 수 있는 이점이 있다.According to the method and apparatus for automatically extracting the tuning fork notation band pairs in the bilingual document of the present invention as described above, the tuning fork notation band pair is automatically extracted using a large amount of bilingual documents on the Web. It is possible to improve performance by constructing a dictionary database of non-registered languages and applying it to cross-language search or machine language translation service.

이하, 첨부 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. 그러나, 다음에 예시하는 본 발명의 실시예는 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 다음에 상술하는 실시예에 한정되는 것은 아니다. 본 발명의 실시예는 당업계에서 통상의 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위하여 제공되어지는 것이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, embodiments of the present invention illustrated below may be modified in many different forms, and the scope of the present invention is not limited to the embodiments described below. The embodiments of the present invention are provided to more completely explain the present invention to those skilled in the art.

도 1은 본 발명의 일 실시예에 따른 이중언어 문서에서의 음차표기 대역쌍 자동 추출 장치를 설명하기 위한 전체적인 블록 구성도이다.1 is an overall block diagram illustrating an apparatus for automatically extracting a tuning fork notation band pair in a bilingual document according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 이중언어 문서에서의 음차표기 대역쌍 자동 추출 장치는, 크게 이중언어 문서 추출모듈(100), 음차표기 후보단어 추출모듈(200), 및 음차표기 대역쌍 추출모듈(300)을 포함하여 이루어진다.Referring to FIG. 1, the apparatus for automatically extracting a tuning fork notation band pair in a bilingual document according to an embodiment of the present invention includes a bilingual document extraction module 100, a tuning fork candidate word extraction module 200, and a tuning fork. A marking band pair extraction module 300 is included.

여기서, 이중언어 문서 추출모듈(100)은 인터넷(Internet)을 통해 배포된 다량의 웹(Web) 상에 존재하는 이중언어로 표현된 이중언어 문서를 추출하여 각각의 단일언어로 표현된 단일언어 문서들(예컨대, 영어 문서, 중국어 문서 등)로 분리하는 기능을 수행한다.Here, the bilingual document extraction module 100 extracts a bilingual document expressed in bilingual languages existing on a large amount of webs distributed through the Internet, and expresses a bilingual document represented in each monolingual document. To separate them (eg, English documents, Chinese documents, etc.).

상기 이중언어 문서는 예컨대, 중국어 및 영어로 이루어진 언어쌍을 비롯한 모든 언어쌍에서의 병렬 말뭉치 또는 비교 가능한 말뭉치 등을 포함할 수 있다.The bilingual document may include, for example, parallel corpus or comparable corpus in all language pairs, including language pairs consisting of Chinese and English.

음차표기 후보단어 추출모듈(200)은 이중언어 문서 추출모듈(100)로부터 분리된 단일언어 문서들 중 어느 하나의 단일언어 문서(예컨대, 영어 문서)에서 음차표기 후보단어들(예컨대, 고유명사)을 추출하는 기능을 수행한다.The phonetic transcription candidate word extraction module 200 may include the phonetic transcription candidate words (eg, proper nouns) in one monolingual document (eg, an English document) among monolingual documents separated from the bilingual document extraction module 100. It performs the function of extracting.

음차표기 대역쌍 추출모듈(300)은 음차표기 후보단어 추출모듈(200)로부터 추출된 음차표기 후보단어들 중 음차표기될 후보단어를 선택한 후, 후술하는 동적 윈도우 기법 또는 토크나이저(Tokenizer) 기법을 이용하여 나머지 다른 단일언어 문서(예컨대, 중국어 문서)에서 상기 선택된 음차표기 후보단어에 대응되는 음차표기 대역쌍을 자동으로 추출하는 기능을 수행한다.The tuning fork band pair extraction module 300 selects a candidate word to be written from the tuning fork notation candidate words extracted from the tuning fork candidate word extraction module 200 and then uses a dynamic window technique or a tokenizer technique to be described later. And extracting a tuning fork notation pair corresponding to the selected tuning fork candidate word from another monolingual document (eg, a Chinese document).

또한, 음차표기 대역쌍 추출모듈(300)은 음차표기 후보단어 추출모듈(200)로부터 추출된 음차표기 후보단어들 중 음차표기될 후보단어를 선택한 후, 나머지 다른 단일언어 문서(예컨대, 중국어 문서)내의 각 문장을 음차표기에 사용되지 않는 문자들(예컨대, 문장부호, 숫자, 띄어쓰기, 영어 자모, 조사 등)을 기준으로 여러 부분으로 분할하고, 각 분할된 부분에 대해 동적 윈도우 기법을 이용하여 나머지 다른 단일언어 문서에서 상기 선택된 음차표기 후보단어에 대응되는 음차표기 대역 쌍을 자동으로 추출할 수도 있다.In addition, the tuning fork notation band pair extraction module 300 selects a candidate word to be written in the tuning fork notation candidate words extracted from the tuning fork candidate word extraction module 200 and then performs the remaining monolingual document (eg, a Chinese document). Each sentence within is divided into parts based on characters not used in the phonetic notation (for example, punctuation, numbers, spacing, English letters, surveys, etc.), and each part is divided using the dynamic window technique. A tuning fork band pair corresponding to the selected tuning fork candidate word may be automatically extracted from another monolingual document.

추가적으로, 음차표기 대역쌍 추출모듈(300)로부터 추출된 음차표기 대역쌍을 저장하기 위한 대용량의 미등록어 사전DB(400)가 더 구비될 수 있다.In addition, a large-capacity non-registered dictionary DB 400 for storing the tuning fork notation band pair extracted from the tuning fork notation band pair extraction module 300 may be further provided.

이러한 미등록어 사전DB(400)은 인터넷을 통해 접속된 사용자 단말(PC)로부터 입력된 교차언어에 대한 검색(예컨대, 중국어 질의) 서비스에 이용될 수 있다.The non-registered dictionary DB 400 may be used for a search (eg, Chinese query) service for a cross-language input from a user terminal (PC) connected through the Internet.

또한, 미등록어 사전DB(400)은 예컨대, 웹(Web) 상에서의 실시간 번역을 위한 기계 언어번역 서비스의 작업 수행에 이용될 수도 있다.In addition, the non-registered dictionary DB 400 may be used, for example, to perform a machine language translation service for real-time translation on a web.

이와 같이 전술한 목적을 달성하기 위하여 약 100％에 가까운 완벽한 음차표기 대역쌍 추출이 필요한데, 본 발명의 음차표기 대역쌍 추출모듈(300)은 음차표기 후보단어 추출이 잘 되는 한 언어(예컨대, 영어 문서)에서만 음차표기 후보단어를 추출하고, 그 음차표기 후보단어를 기준으로 음차표기 후보단어 추출이 잘 않되는 다른 언어 문서(예컨대, 중국어 문서)에서 대응되는 음차표기 대역쌍을 추출하는 기법(예컨대, 동적 윈도우 기법 또는 토크나이저 기법)을 이용하여 기존의 약 90％미만의 성능을 약 99％까지 끌어올림으로써, 기존의 수동적인 미등록어 사전DB 구축 작업을 완전히 자동화시켜 대량의 미등록어 사전DB 구축이 가능해졌고, 이를 이용하여 교차언어 검색 도는 기계 언어번역 서비스 등의 큰 성능향상을 도모할 수 있는 효과가 있다.In order to achieve the above-mentioned object, a perfect tuning fork notation band pair extraction is required close to about 100%, and the tuning fork notation band pair extraction module 300 of the present invention is one language for which the tuning fork candidate words are well extracted (eg, English) A method of extracting a tuning fork notation word only from a document and extracting a corresponding tuning fork band pair from another language document (for example, a Chinese document) that is difficult to extract the tuning fork candidate word based on the tuning fork candidate word. , By using the dynamic window technique or the tokenizer technique, to increase the performance of less than about 90% to about 99%, fully automating the existing manual non-registered dictionary dictionary construction and constructing a large number of non-registered dictionary dictionaries It is possible to use this, and it is possible to achieve a great performance improvement such as cross-language search or machine language translation service.

이하에는 중국어와 영어를 일 예로 어떻게 이중언어 문서에서 음차표기 대역쌍을 추출하는지에 대하여 상세하게 설명한다.The following describes in detail how to extract the tuning fork band pair from a bilingual document using Chinese and English as an example.

본 발명의 이중언어(영어-중국어) 문서에서의 음차표기 대역쌍 자동 추출 방법은, 먼저, 영어-중국어 병렬 말뭉치의 영어 문장에 음차표기 후보단어 추출모듈(200)을 적용하여 음차표기 후보단어 즉, 고유명사를 추출한 후, 그 중에서 음차표기될 영어 단어만 선택하여 대응되는 중국어 문장에서 음차표기 대역쌍을 추출한다.In the bilingual (English-Chinese) document of the present invention, the method for automatically extracting the tuning fork notation band pair, first, by applying the tuning fork notation candidate word extraction module 200 to the English sentence of the English-Chinese parallel corpus, After extracting proper nouns, only the English words to be represented in the phonetic tone are selected, and the phonetic notation band pairs are extracted from the corresponding Chinese sentences.

이후에, 중국어 음차표기 대역쌍 추출에서는 일반적으로 중국어 한자의 로마표기법 즉, 병음을 사용하여 영어 단어와 비교한다. 예를 들면, "

"(클린턴)이란 중국어 단어는 먼저 "KeLinDun"이란 병음으로 변환한 후, 영어 단어인 "Clinton"과 중국어 병음인 "KeLinDun"의 음성적 유사도를 계산하여 비교한다.Subsequently, the Chinese tuning fork band pair extraction is generally compared with English words using the Roman notation of Chinese kanji, that is, Pinyin. For example, "

The Chinese word "(Clinton) is first converted into the Pinyin word" KeLinDun "and then compared by calculating the phonetic similarity between the English word" Clinton "and the Chinese Pinyin" KeLinDun ".

본 발명에서 E는 영어, C는 중국어, TU(Transliteration Unit)는 음차표기 단위로 가정한다. 그러면, 조건확률 P(C|E)는 P(

｜Clinton)로 치환되어 P(KeLinDun｜Clinton)확률을 구하는 문제로 전환할 수 있다.In the present invention, it is assumed that E is English, C is Chinese, and TU (Transliteration Unit) is a tuning fork unit. Then the conditional probability P (C | E) is equal to P (

Substituting Clinton, it can be converted to the problem of P (KeLinDun | Clinton) probability.

또한, 본 발명에서 영어는 유니그램(Unigram), 바이그램(Bigram), 트라이그램(Trigram)을 사용하며, 중국어는 병음의 첫 음절, 마지막 음절 혹은 병음 전체를 TU로 사용할 수 있다. 그러면, TU의 정의에 근거하여 P(

｜Clinton)는 하기의 수학식 1과 같이 여러 개의 TU로 분할하여 그 근사치를 구할 수 있다.In the present invention, English uses Unigram, Bigram, Trigram, and Chinese can use the first syllable, the last syllable of the Pinyin, or the entire Pinyin as a TU. Then, based on the definition of TU, P (

Clinton can be approximated by dividing into multiple TUs as shown in Equation 1 below.

도 2는 본 발명의 일 실시예에 적용된 음차표기 정렬 모델을 설명하기 위한 도면으로서, 영어 단어와 중국어 병음이 어떻게 정렬되는가를 보여주고 있다.FIG. 2 is a diagram illustrating a tuning index notation alignment model applied to an embodiment of the present invention and shows how English words and Chinese Pinyin are aligned.

도 2를 참조하면, 본 발명에서는 상기의 수학식 1에 매치타입(M)이라는 정보를 더 추가한다. 상기 매치타입(M)은 영어 TU의 크기와 중국어 TU의 크기에 의하여 정의될 수 있다.Referring to FIG. 2, the present invention further adds information of a match type (M) to Equation 1 above. The match type (M) may be defined by the size of the English TU and the size of the Chinese TU.

예를 들면, 상기의 수학식 1에서 P(ke｜C)의 매치타입은 영어 TU "C"의 크기가 1, 중국어 TU "ke"의 크기가 2이므로 "2-1"이다. 이러한 매치타입(M)은 파라미터의 추정과정에서 학습이 안된 새로운 파라미터에 대하여 보완 작용을 하므로, 전술한 종래의 기술에서와 같이 발음사전 없이 "통계기반 음차표기 모델"을 적용할 때 더 좋은 성능을 기대할 수 있다.For example, in the above Equation 1, the match type of P (ke | C) is "2-1" because the size of the English TU "C" is 1 and the size of the Chinese TU "ke" is 2. Since the match type (M) complements a new parameter that has not been learned in the parameter estimation process, it has better performance when applying a "statistic-based phonetic notation model" without a pronunciation dictionary as in the conventional technique described above. You can expect

본 발명의 일 실시예에서는 발음사전 없이 통상의 통계기반 음차표기 모델을 적용하여 파라미터를 자동으로 추정하는 방법을 사용하며, 상기 통계기반 음차표기 모델에 매치타입(M) 정보를 추가한다.In an embodiment of the present invention, a method of automatically estimating a parameter is applied by applying a conventional statistical-tone notation model without a pronunciation dictionary, and adds match type (M) information to the statistical-based tone model.

즉, 상기의 수학식 1에 매치타입(M) 정보를 추가하면, 하기의 수학식 2 및 수학식 3과 같이 표현될 수 있다.That is, if the match type (M) information is added to Equation 1, Equation 2 and Equation 3 may be expressed.

여기서, u, v는 각각 영어 TU와 중국어 TU를 의미하며, m은 u와 v의 매치타입을 의미한다.Here, u and v mean English TU and Chinese TU, respectively, and m means match type of u and v, respectively.

도 3은 본 발명의 일 실시예에 적용된 통계기반 음차표기 모델을 이용하여 문장에서 음차표기 대역쌍을 추출하는 과정을 설명하기 위한 도면으로서, 영어 단어 "Clinton"에 대하여 대응되는 중국어 문장에서 정확한 음차표기인 "

"(KeLinDun)을 찾아주는 일 예이다.3 is a diagram illustrating a process of extracting a tuning fork notation band pair from a sentence by using a statistical-based tuning fork notation model applied to an embodiment of the present invention, the correct tuning for a Chinese sentence corresponding to the English word "Clinton" Notation "

This is an example of finding "(KeLinDun).

도 3을 참조하면, 통상의 통계기반 음차표기 모델을 적용하여 음차표기 대역쌍을 추출할 때, 만약 한 문장에 주어진 영어 단어와 발음상 비슷한 중국어 문자열이 여러 개 존재할 경우 오류가 자주 발생한다. 본 발명의 일 실시예에서는 이러한 오류를 해결하고자 동적 윈도우 기법과 토크나이저(Tokenizer) 기법을 이용한다.Referring to FIG. 3, when extracting a phonetic notation band pair by applying a conventional statistical-based phonetic notation model, an error frequently occurs when there are a plurality of Chinese character strings similar to English words given in a sentence. In an embodiment of the present invention, a dynamic window technique and a tokenizer technique are used to solve such an error.

도 4는 본 발명의 일 실시예에 적용된 동적 윈도우 기법의 이론적 근거를 설명하기 위한 도면이고, 도 5는 본 발명의 일 실시예에 적용된 동적 윈도우 기법을 이용하여 정확한 음차표기를 추출하는 과정을 설명하기 위한 도면이다.4 is a view for explaining the theoretical basis of the dynamic window technique applied to an embodiment of the present invention, Figure 5 illustrates the process of extracting the correct note notation using the dynamic window technique applied to an embodiment of the present invention It is a figure for following.

도 4 및 도 5를 참조하면, 본 발명의 일 실시예에 적용된 동적 윈도우 기법은 중국어 문장에 대하여 한번에 최적화된 경로를 찾는 것이 아니라, 주어진 영어 단어에 근거하여 가능한 중국어 음차표기 단어크기의 윈도우 길이범위를 설정하고, 그 윈도우 길이범위 내의 윈도우를 각각 앞으로 이동하면서 음차표기 대역쌍을 찾는 기법이다.4 and 5, the dynamic window technique applied to an embodiment of the present invention does not find a path optimized for a Chinese sentence at once, but based on a given English word, a window length range of a possible Chinese tuning fork word size. It is a technique to find the tuning fork band pair by moving the window within the window length range.

만약, 중국어 음차표기 단어의 실제 길이를 알 수 있고 그것을 윈도우 크기로 설정하여 음차표기를 찾으면 아주 높은 성능을 낼 수 있다. 예를 들면, 도 4에 도시된 바와 같이 영어 단어 "Clinton"과 정확한 음차표기인 "

"(KeLinDun), 정확한 음차표기에 한 글자가 더 들어간 "

"(KeLinYiDun), 정확한 음차표기에 한 글자가 빠진 "

"(LinDun)과 각각 정렬해 본 결과 정확한 음차표기와 정렬했을 때 점수가 가장 높았다.If you know the actual length of the Chinese phonetic notation and set it to the window size to find the phonetic notation, you can get very high performance. For example, as shown in FIG. 4, the English word "Clinton" and the correct pitch notation ""

"(KeLinDun), with one more letter in the correct note"

"(KeLinYiDun), one letter missing from the correct note"

Sorting with "(LinDun), the score was highest when aligned with the correct note.

이는 정확한 음차표기일수록 영어 TU와 중국어 TU사이의 정렬이 더 잘 되기 때문이다. 이러한 특성은 중국어뿐만 아니라 다른 언어에도 공통적으로 나타나는 특성이다.This is because the more accurate the phonetic notation, the better the alignment between the English TU and the Chinese TU. These characteristics are common to not only Chinese but also other languages.

그러나, 정확한 중국어 음차표기의 크기를 예측하기 어려우므로 본 발명의 일 실시예에서는 학습데이터에서 영어 단어 길이와 중국어 단어 길이 사이의 분포에 대한 분석을 통하여 음차표기 단어 크기의 가능한 범위를 예측한다.However, since it is difficult to predict the exact size of the Chinese phonetic notation, an embodiment of the present invention predicts a possible range of the phonetic notation word size by analyzing the distribution between the English word length and the Chinese word length in the learning data.

즉, 동적 윈도우를 적용하는 과정은 다음과 같다. 먼저, 주어진 영어 단어에 근거하여 윈도우 길이범위를 예측한 후, 상기 예측한 윈도우 길이범위 내의 윈도우를 각각 앞으로 이동하면서 주어진 영어 단어와 현재 윈도우가 포함하는 중국어 문자열에 대하여 상기의 수학식 3을 이용하여 음성적 유사 확률값을 구한다.That is, the process of applying the dynamic window is as follows. First, after predicting a window length range based on a given English word, using the above Equation 3 for a Chinese string included in a given English word and the current window while moving forward the windows within the predicted window length range, respectively Obtain the negative likelihood probability value.

이때, 상기 영어 단어 길이가 L일 경우, 상기 윈도우 길이범위는 L/3부터 L 사이로 설정됨이 바람직하다.In this case, when the English word length is L, the window length range is preferably set between L / 3 to L.

이러한 방식으로 윈도우 크기를 점차적으로 증가시키면서 가장 높은 확률값을 갖는 중국어 문자열을 찾고, 그 문자열을 역추적하여 음차표기 대역쌍을 추출한다. 그리고, 도 5에 도시된 바와 같이, 각 윈도우 사이의 점수를 비교할 때에는 윈도우 크기가 커짐에 따라 전반적으로 점수가 낮아지므로 윈도우 크기로 정규화하여 비교한다. 이와 같이 동적 윈도우 기법을 적용하면, 통계기반 음차표기 모델을 적용했을 때 생기는 대부분 오류들을 효과적으로 해결할 수 있다.In this way, the Chinese character string with the highest probability value is found while gradually increasing the window size, and the trace pairs are extracted by backtracking the string. And, as shown in Figure 5, when comparing the score between each window as the window size increases as the overall score is lowered, the normalized to the window size and compared. By applying the dynamic window technique, it is possible to effectively solve most of the errors that occur when the statistical-based phonetic notation model is applied.

도 6은 본 발명의 일 실시예에 적용된 토크나이저(Tokenizer) 기법을 이용하여 정확한 음차표기를 추출하는 과정을 설명하기 위한 도면이다.FIG. 6 is a diagram for explaining a process of extracting an accurate tuning index using a tokenizer technique applied to an embodiment of the present invention.

도 6을 참조하면, 본 발명의 일 실시예에 적용된 어절 분리를 이용한 토크나이저 기법은 중국어 음차표기에 전혀 사용되지 않는 문자를 기준으로 중국어 문장을 먼저 여러 부분으로 나누고, 각 부분에 대하여 통계기반 음차표기 모델을 적용하여 음차표기 대역쌍을 추출하는 기법이다.Referring to FIG. 6, in the tokenizer technique using word separation applied to an embodiment of the present invention, a Chinese sentence is first divided into several parts based on characters not used at all in the Chinese phonetic notation, and statistics-based tuning for each part is performed. It is a technique to extract the tuning fork band pair by applying the notation model.

이를 구체적으로 설명하면, 중국어에는 예컨대, "施(shi)，德(de)，勒(le)，赫(he),…"와 같은 음차표기에 자주 사용하는 문자 집합이 있는 반면에 발음은 비슷하나, 예컨대, "是(shi), 的(de)，了(le)，和(he),…" 등 음차표기에는 전혀 사용하지 않는 문자 집합이 있다.Specifically, in Chinese, for example, the phonetic notation is often used for phonetic notations such as "施 (shi), 德 (de), 勒 (le), 赫 (he), ..." while the pronunciation is similar. For example, there is a character set that is not used at all in the phonetic notation, such as "是 (shi)," (de), 了 (le), 和 (he), ... ".

이러한 문자들은 보통 조사로서 고유명사 주위에 자주 나타나므로 이러한 조사와 같은 문자들과 정확한 음차표기가 결합하여 오류를 낼 때가 많다. 예를 들면, "David"의 음차표기는 마지막 "d"발음을 생략하여 "

"(DaWei)로 음차표기한다.These letters are usually searched around proper nouns, so they often make errors by combining letters such as these with the correct phonetic notation. For example, the notation of "David" can be omitted by omitting the last "d"

"(DaWei)"

여기서, 만약 이런 명사 뒤에 생략한 문자 "d"와 비슷한 발음을 내는 "的"(De)과 같은 조사가 붙으면, "

"(DaWeiDe)로 잘못 인식될 수 있다. 전술한 종래의 기술에서는 규칙에 기반한 후처리 과정을 거쳐 추출한 음차표기의 양끝에 자주 사용하지 않는 문자가 있으면 제거해주는 방식으로 어느 정도 이런 문제를 해결하였다.Here, if such a noun is followed by a search such as "的" (De), which produces a pronunciation similar to the abbreviated letter "d", "

It can be misrecognized as "(DaWeiDe). The above-mentioned conventional technology solves this problem to some extent by removing an infrequently used character at both ends of the tuning fork notation extracted through rule-based post-processing.

그러나, 이러한 후처리를 통한 기법은 도 6과 같은 오류에 대해서는 여전히 해결할 수 없다. 왜냐하면, 조사 "是"(Shi)가 다른 문자와 결합하여 음성적으로 영어 단어 "Jacey"와 비슷한 "者是"(ZheShi)가 정확한 음차표기 "杰西"(JieXi) 대신 인식되었기 때문이다.However, this post-processing technique still cannot solve the error shown in FIG. 6. This is because the search "是" (Shi) is combined with other letters, and "者是" (ZheShi), which is phonetically similar to the English word "Jacey", was recognized instead of the correct phonetic notation "杰西" (JieXi).

이때, 후처리 기법을 통하여 "是"를 제거한다 하더라도 나머지 부분 "者"는 정확한 음차표기가 아니다. 본 발명의 토크나이저 기법에서는 "是"(Shi)와 같은 문자는 음차표기에 전혀 사용하지 않는 문자 집합에 속하기 때문에 이러한 문자들을 사전에 제거한다.At this time, even if "是" is removed through the post-processing technique, the remaining portion "者" is not an accurate note notation. In the tokenizer technique of the present invention, characters such as "是" (Shi) belong to a character set which is not used at all in the phonetic notation, and thus these characters are removed in advance.

그러면, 남은 문자 "者"(Zhe)와 "Jacey"의 음성적 유사도는 "杰西"(JieXi)와 "Jacey"의 음성적 유사도보다 훨씬 낮아짐으로 정확한 음차표기 대역쌍을 추출할 수 있다. 뿐만 아니라 토크나이저 기법을 적용하여 전체 문장을 여러 부분으로 나누면 시간 복잡도도 크게 줄어들게 된다.Then, the phonetic similarity of the remaining characters "者" (Zhe) and "Jacey" is much lower than the phonetic similarity of "杰西" (JieXi) and "Jacey", so that an accurate tone-matching band pair can be extracted. In addition, the application of the Tokenizer technique breaks down the entire sentence into parts, which greatly reduces the time complexity.

한편, 상기 음차표기에 전혀 사용되지 않는 문자라고 함은 한국어를 예로 들면, 문장부호, 숫자, 띄어쓰기, 영어 자모 등을 의미하며, 중국어를 예로 들면, 조 사 등을 의미한다.On the other hand, the letters that are not used at all in the phonetic notation means Korean, for example, punctuation, numbers, spacing, English alphabet, etc., for example Chinese, means an investigation.

전술한 바와 같이 동적 윈도우 기법과 토크나이저 기법은 서로 다른 문제를 해결하므로 두 가지 방법을 함께 적용하면 더 높은 성능을 낼 수 있을 뿐만 아니라 시간 복잡도도 크게 줄여줄 수 있다.As described above, the dynamic window technique and the tokenizer technique solve different problems, and therefore, the combination of the two methods can not only provide higher performance but also greatly reduce the time complexity.

즉, 전술한 동적 윈도우 기법과 토크나이저 기법을 함께 적용하는 방법은, 음차표기 대역쌍 추출모듈(300, 도 1 참조)을 통해 음차표기 후보단어 추출모듈(200, 도 1 참조)로부터 추출된 음차표기 후보단어들 중 음차표기될 후보단어를 선택한 후, 나머지 다른 단일언어 문서(예컨대, 중국어 문서)내의 각 문장을 음차표기에 사용되지 않는 문자들(예컨대, 조사)을 기준으로 여러 부분으로 분할하고, 각 분할된 부분에 대해 동적 윈도우 기법을 이용하여 나머지 다른 단일언어 문서에서 상기 선택된 음차표기 후보단어에 대응되는 음차표기 대역쌍을 자동으로 추출하는 것으로 이루어질 수 있다.That is, the method of applying the above-described dynamic window technique and the tokenizer technique together, the tuning fork extracted from the tuning fork notation candidate word extraction module 200 (see FIG. 1) through the tuning fork band pair extraction module 300 (see FIG. 1). After selecting candidate words to be phonated among the candidate candidates, each sentence in the other monolingual document (eg, Chinese document) is divided into several parts based on characters (eg, survey) not used in the phonetic notation. For example, the partitioned band pairs corresponding to the selected phonetic note candidate words may be automatically extracted from the other monolingual documents using the dynamic window technique for each divided portion.

한편, 본 발명의 일 실시예에서는 중국어와 영어를 일 예로 적용하였지만, 이에 국한하지 않으며, 모든 언어쌍에도 적용할 수 있다.Meanwhile, in one embodiment of the present invention, although Chinese and English are applied as an example, the present invention is not limited thereto and may be applied to all language pairs.

이하에는 실제로 실험을 통하여 본 발명의 일 실시예에 적용된 동적 윈도우 기법과 토크나이저 기법의 효율성을 검증하였다.Hereinafter, through the experiments, the efficiency of the dynamic window technique and the torqueizer technique applied to the embodiment of the present invention was verified.

먼저, 실험을 위하여 영-중 병렬 말뭉치에서 예컨대, 지명, 인명, 제품명 등 각종 음차표기 대역쌍을 포함한 300개 문장을 선택하였다. 학습 데이터는 860개 영-중 음차표기 단어쌍을 사용하였다.First of all, for the experiment, 300 sentences including various pairs of tuning fork notation such as place names, names of people, and product names were selected from the English-Chinese parallel corpus. For the training data, we used 860 English-to-sonic word pairs.

하기의 표 1에 나타낸 바와 같이, 동적 윈도우와 토크나이저 기법을 적용한 결과 기존의 방법에 비하여 약 12％정도의 성능향상을 가져올 수 있었으며 이는 약 99％의 높은 성능이다.As shown in Table 1 below, as a result of applying the dynamic window and the torqueizer technique, the performance could be improved by about 12% compared to the conventional method, which is about 99% higher performance.

방법Way 단어 정확률Word accuracy 문자 정확률Character accuracy 문자 재현율Character recall 통계기반 음차표기 모델(STM)Statistically-based phonetic notation model (STM) 75.33％75.33% 86.65％86.65% 91.11％91.11% 통계기반 음차표기 모델(STM) +동적 윈도우(DW)+토크나이저(TOK)Statistic-based tuning fork model (STM) + dynamic window (DW) + tokenizer (TOK) 99.00％99.00% 99.78％99.78% 99.72％99.72% 통계기반 음차표기 모델(STM) +기존 방법Statistically-based phonetic notation model (STM) + traditional method 87.99％87.99% 90.17％90.17% 91.11％91.11%

전술한 바와 같이, 본 발명의 일 실시예에 따른 이중언어 문서에서의 음차표기 대역쌍 자동 추출 방법 및 장치를 이용하면, 대량의 미등록어 사전DB를 자동으로 구축할 수 있으며 이는 교차언어 검색 및 기계 언어번역 시스템 등의 성능을 높이는데 큰 역할을 할 것이다.As described above, by using the method and apparatus for automatically extracting the tuning fork notation band pair in a bilingual document according to an embodiment of the present invention, a large number of non-registered dictionary dictionaries DB can be automatically constructed, which is a cross-language search and a machine. It will play a big role in improving the performance of the language translation system.

예를 들면, 현재 구글에서 한국어와 영어사이 교차언어 검색서비스를 하고 있다. 구글에 한국어로 "구글"이라고 검색하면, "구글"이란 단어가 "nine writing"이란 영어 단어로 번역되어서 검색된다.For example, Google currently offers a cross-language search service between Korean and English. When you search Google for "Google" in Korean, the word "Google" is translated into the English word "nine writing."

이런 경우에 사용자 원래 의도했던 검색 결과는 전혀 다른 검색 결과가 나올 것이다. 만약, 본 발명의 방법 및 장치로 구축한 미등록어 사전DB를 이용하면, "구글"은 정확하게 "Google"로 번역할 수 있음으로 정확한 검색 결과를 얻을 수 있다.In this case, the user's original search result will be a totally different search result. If the non-registered dictionary dictionary DB constructed by the method and apparatus of the present invention is used, "Google" can be accurately translated to "Google", thereby obtaining accurate search results.

또 다른 예로 구글의 기계 언어번역 서비스를 이용하여 "나는 구글을 좋아한다"라는 한국어 문장을 영어로 번역하면, "I like nine writings"라는 의미 없는 번역 결과가 나온다.Another example is using Google's machine language translation service to translate a Korean sentence "I like Google" into English, resulting in a meaningless translation of "I like nine writings."

만약, 여기에 본 발명을 적용하면 위의 문장을 "I like Google"로 정확하게 번역할 수 있다. 기존에는 이런 미등록어 사전DB를 수동으로 대량의 인력과 시간을 투자하여 구축하였으나, 본 발명은 이런 수동적인 방식을 완전히 자동화함으로써 실용가치가 있는 대량의 미등록어 사전DB를 구축할 수 있다. 이는 교차언어 검색과 기계 언어번역 등 분야에 큰 역할을 할 것이다.If the present invention is applied here, the above sentence can be accurately translated to "I like Google". In the past, such a non-registered word dictionary DB was built by investing a large amount of manpower and time manually, but the present invention can build a large amount of non-registered word dictionary DB having a practical value by fully automating such a manual method. This will play a huge role in areas such as cross-language search and machine language translation.

한편, 본 발명의 실시예에 따른 이중언어 문서에서의 음차표기 대역쌍 자동 추출 방법은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현되는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다.On the other hand, the method for automatically extracting the tuning fork band pairs in a bilingual document according to an embodiment of the present invention can also be implemented as computer readable codes on a computer readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data that can be read by a computer system is stored.

예컨대, 컴퓨터가 읽을 수 있는 기록매체로는 롬(ROM), 램(RAM), 시디-롬(CD-ROM), 자기 테이프, 하드디스크, 플로피디스크, 이동식 저장장치, 비휘발성 메모리(Flash Memory), 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들면, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함된다.For example, the computer-readable recording medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a hard disk, a floppy disk, a removable storage device, a nonvolatile memory (Flash memory). Optical data storage, and the like, and also implemented in the form of a carrier wave (eg, transmission over the Internet).

또한, 컴퓨터로 읽을 수 있는 기록매체는 컴퓨터 통신망으로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 읽을 수 있는 코드로서 저장되고 실행될 수 있다.The computer readable recording medium can also be distributed over computer systems connected over a computer network so that the computer readable code is stored and executed in a distributed fashion.

전술한 본 발명에 따른 이중언어 문서에서의 음차표기 대역쌍 자동 추출 방법 및 장치에 대한 바람직한 실시예에 대하여 설명하였지만, 본 발명은 이에 한정되는 것이 아니고 특허청구범위와 발명의 상세한 설명 및 첨부한 도면의 범위 안에서 여러 가지로 변형하여 실시하는 것이 가능하고 이 또한 본 발명에 속한다.Although a preferred embodiment of the method and apparatus for automatically extracting a tuning fork band pair in a bilingual document according to the present invention has been described above, the present invention is not limited thereto, and the claims and the detailed description of the invention and the accompanying drawings. It is possible to carry out various modifications within the scope of this also belongs to the present invention.

도 1은 본 발명의 일 실시예에 따른 이중언어 문서에서의 음차표기 대역쌍 자동 추출 장치를 설명하기 위한 전체적인 블록 구성도.BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is an overall block diagram illustrating a device for automatically extracting a tuning fork band pair in a bilingual document according to an embodiment of the present invention.

도 2는 본 발명의 일 실시예에 적용된 음차표기 정렬 모델을 설명하기 위한 도면.2 is a view for explaining a tuning fork notation alignment model applied to an embodiment of the present invention.

도 3은 본 발명의 일 실시예에 적용된 통계기반 음차표기 모델을 이용하여 문장에서 음차표기 대역쌍을 추출하는 과정을 설명하기 위한 도면.3 is a diagram illustrating a process of extracting a tuning fork notation pair from a sentence using a statistical-based tuning fork notation model applied to an embodiment of the present invention.

도 4는 본 발명의 일 실시예에 적용된 동적 윈도우 기법의 이론적 근거를 설명하기 위한 도면.4 is a view for explaining the theoretical basis of the dynamic window technique applied to an embodiment of the present invention.

도 5는 본 발명의 일 실시예에 적용된 동적 윈도우 기법을 이용하여 정확한 음차표기를 추출하는 과정을 설명하기 위한 도면.FIG. 5 is a diagram for explaining a process of extracting an accurate phonetic notation using a dynamic window technique applied to an embodiment of the present invention. FIG.

도 6은 본 발명의 일 실시예에 적용된 토크나이저(Tokenizer) 기법을 이용하여 정확한 음차표기를 추출하는 과정을 설명하기 위한 도면.FIG. 6 is a view for explaining a process of extracting an accurate tuning fork using a tokenizer technique applied to an embodiment of the present invention. FIG.

Claims

delete

Extracting bilingual documents represented in bilingual languages existing on the Internet and separating them into monolingual documents represented in respective monolingual languages;

Extracting the phonetic transcription candidate words from one monolingual document of the separated monolingual documents; selecting a candidate word to be expressed from the extracted phonetic transcription candidate words, and then each sentence in the other monolingual documents Dividing the data into several parts based on characters that are not used in the tuning fork; And

Automatically extracting a tuning fork band pair corresponding to the selected tuning fork candidate word from the other monolingual document using a dynamic window technique for each divided portion,

The dynamic window technique,

Setting a range of window lengths that can be expressed in another monolingual based on the length of the selected tuning note candidate words;

Obtaining maximum phonetic likelihood probability values for the character string including the selected tuning note candidate word and the current window while moving forward through each window within the set window length range; And

And automatically extracting a tuning fork notation pair by tracing a string corresponding to the largest value among the maximum negative likelihood probability values in each window.

In the dual-language document, if the selected phonetic note candidate word is an English word, the other single language is Chinese, and the length of the English word is L, the window length range is set between L / 3 and L. Automatic Extraction of Tunnel Representation Band Pairs.

The method of claim 6,

The phonetic similarity probability pair extraction method of a bilingual document, characterized in that obtained by the following equation (4).

Here, C is a string containing the current window, E is a selected tuning note candidate word, u, v are a tuning note unit (TU) of the selected tuning note candidate word and the string including the current window, respectively. m is the match type of u and v.

The method of claim 6,

Characters not used in the phonetic notation is a method for extracting the tuning fork notation band pairs in a bilingual document, characterized in that consisting of at least one of punctuation, numbers, spacing or English alphabet.

The method of claim 6,

And if the other monolingual document is Chinese, characters not used in the phonetic notation are made up of surveys.

The method of claim 6,

The monolingual document extracting the phonetic transcription candidate words is an English document, the other monolingual document is a Chinese document, characterized in that the tuning fork band band pairs in a bilingual document.

The method of claim 6,

And the phonetic notation candidate words are proper nouns.

The method of claim 6,

The bilingual document automatically extracts a tuning fork band pair in a bilingual document, characterized in that it comprises a parallel corpus or comparable corpus in all language pairs, including language pairs made of Chinese and English.

The method of claim 6,

And extracting the tuning fork notation band pairs in the bilingual document further comprising registering the pre-registered non-registered word dictionary DB.

The method of claim 13,

And performing a cross-language search service using the tuning fork notation band pairs registered in the non-registered dictionary DB.

The method of claim 13,

And performing a machine language translation service using the tuning fork notation band pair registered in the non-registered dictionary dictionary DB.

A computer-readable recording medium having recorded thereon a program capable of executing the method of any one of claims 6 to 15.

delete

A bilingual document extraction module for extracting bilingual documents expressed in bilingual languages existing on the Internet and separating them into monolingual documents expressed in respective monolingual languages;

A phoneme notation candidate word extraction module for extracting a phonetic notation candidate words from any single language document among monolingual documents separated from the bilingual document extraction module; And

After selecting candidate words to be written in the tuning fork notation candidate words extracted from the phonetic notation candidate word extracting module, each sentence in the other monolingual document is divided into several parts based on characters not used in the phonetic notation, And a tuning fork notation band pair extraction module for automatically extracting a tuning fork notation band pair corresponding to the selected tuning note notation word from the other monolingual documents using the dynamic window technique for each divided portion,

The dynamic window technique,

Setting a range of window lengths that can be represented by another single language based on the length of the selected note title candidate words and moving forward each window within the set window length range, wherein the selected note title candidate words and the current window are included; After obtaining the maximum negative likelihood probability values for the character strings, the phonetic notation band pairs are automatically extracted by tracing the string corresponding to the largest value among the maximum negative similarity probability values in each window.

In the dual-language document, if the selected phonetic note candidate word is an English word, the other single language is Chinese, and the length of the English word is L, the window length range is set between L / 3 and L. Automatic extraction of the tuning fork band pair.

The method of claim 20,

The phonetic similarity band pair automatic extraction apparatus of a bilingual document, characterized in that the negative likelihood probability value is obtained by the following Equation 5.

delete

The method of claim 20,

The monolingual document extracting the phonetic note candidate words is an English document, and the other monolingual document is a Chinese language document for extracting the tuning fork band pairs in a bilingual document.

The method of claim 20,

The tuning fork notation band pair automatic extracting device of a bilingual document, wherein the tuning fork candidate words are proper nouns.

The method of claim 20,

The bilingual document automatically extracts the tuning fork band pairs in the bilingual document, characterized in that it comprises a parallel corpus or comparable corpus in all language pairs, including language pairs made of Chinese and English.

The method of claim 20,

The apparatus for automatically extracting a tuning fork band pair in a bilingual document, further comprising a non-registered word dictionary DB for storing the tuning fork band band extracted from the tuning fork band band extraction module.

The method of claim 28,

And the non-registered word dictionary DB is used for a search service for a cross-language input through the Internet.

The method of claim 28,

The non-registered word dictionary DB is a device for automatically extracting a tuning fork notation pair in a bilingual document, characterized in that it is used to perform a machine language translation service.

Extracting the phonetic transcription candidate words from any single language document of the separated monolingual documents; And

After selecting a candidate word to be expressed among the extracted phoneme notation candidate words, a phoneme notation corresponding to the selected phoneme notation candidate word in the other monolingual document using a dynamic window technique based on the selected phoneme notation candidate word. Automatically extracting the band pairs,

The dynamic window technique,

The method of claim 31, wherein

The negative likelihood probability value is calculated by Equation 6 below.

After selecting a candidate word to be expressed in the tuning fork candidate words extracted from the tuning fork candidate word extraction module, the selected tuning note candidate words in the other monolingual documents using a dynamic window technique or a tokenizer technique. Including a tuning fork notation band pair extraction module for automatically extracting the tuning fork notation band pair corresponding to,

The dynamic window technique,

The method of claim 33, wherein

And the phonetic similarity probability value is obtained by the following Equation (7).