KR20110062867A

KR20110062867A - Apparatus and method for constructing terms list of source language-target language

Info

Publication number: KR20110062867A
Application number: KR1020090119720A
Authority: KR
Inventors: 김운; 권오욱; 오영순; 김창현; 서영애; 양성일; 황금하; 최승권; 이기영; 노윤형; 박은진; 김영길; 박상규
Original assignee: 한국전자통신연구원
Priority date: 2009-12-04
Filing date: 2009-12-04
Publication date: 2011-06-10

Abstract

PURPOSE: A source language-object language term list constructing apparatus and method thereof are provided to construct an original language-object language term list through mining of a search result through an intermediate language. CONSTITUTION: A document search unit(105) collects a search result including an original language. A data mining unit(107) extracts intermediate language parallel sentence or a word. A morpheme analysis unit(109) analyzes an extracted original language-middle language parallel sentence. A word alignment unit(111) aligns the language parallel sentence. A term generator(113) generates an original language term list.

Description

APPARATUS AND METHOD FOR CONSTRUCTING TERMS LIST OF SOURCE LANGUAGE-TARGET LANGUAGE}

본 발명은 원시언어(source language)-목적언어(target language) 용어 리스트 구축 장치 및 방법에 관한 것으로서, 더욱 상세하게는 원시언어 용어와 그 대역어인 목적언어 용어가 매칭된 원시언어-목적언어 용어 리스트를 구축할 때에 중간언어(pivot language)를 이용한 검색 결과를 마이닝(mining)하여 구축하는 원시언어-목적언어 용어 리스트 구축 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for constructing a source language-target language term list, and more particularly, to a source language-target language term list in which a source language term and a target language term corresponding to the target language term are matched. The present invention relates to an apparatus and method for constructing a primitive language-purpose language terminology for mining and constructing a search result using a pivot language.

본 발명은 지식경제부의 IT성장동력기술개발사업의 일환으로 수행한 연구로부터 도출된 것이다[과제관리번호 : 2009-S-034-01, 과제명 : 한중영 대화체 및 기업문서 자동번역 기술개발].The present invention is derived from the research conducted as part of the IT growth engine technology development project of the Ministry of Knowledge Economy [Task management number: 2009-S-034-01, Task name: Korean-Chinese dialogue and corporate document automatic translation technology development].

최근 들어, 기술이 빠르게 발전하면서 언어 간의 기술적 교류가 심화되고 있다. 특정 언어간 기술적 교류의 심화는 전문분야 자동번역 기술의 수요를 촉진시킨 다. 특허 및 논문과 같은 전문분야 자동번역 수요가 그 한 예이다. 그런데, 전문분야 번역은 그 번역 성능을 좌지우지하는 전문용어 리스트의 구축이 어렵고, 새로운 전문용어의 빈번한 출현으로 인해 번역 성능의 저하를 가져온다.In recent years, with the rapid development of technology, technical exchanges between languages have deepened. The deepening of technical exchanges between specific languages promotes the demand for specialized translation technology. One example is the need for automatic translation of specialized fields such as patents and papers. However, specialized field translation is difficult to build a terminology list that determines the translation performance, and the frequent appearance of new terminology causes a decrease in translation performance.

종래 기술에 따른 기계번역시스템의 성능 향상 방법에는 크게 아래와 같은 두 가지의 방법을 이용하고 있다.The following two methods are largely used to improve the performance of a machine translation system according to the prior art.

첫째, 원시언어 코퍼스를 수집하고, 수집된 원시언어 코퍼스로부터 신조어(전문용어 포함) 추출도구를 이용하여 전문용어를 포함한 사전에 없는 신조어를 추출하고, 추출된 신조어에 대해 전문 번역가가 그 대역어를 부탁하는 방법으로 전문용어 리스트를 구축하였다. 최근 들어, 원시언어와 목적언어 쌍으로 이루어진 병렬 코퍼스(bilingual corpus)를 이용하여 신조어를 추출하는 방법도 사용하고 있다. 이런 방법들은 뉴스와 같이 문서 수집이 쉬운 도메인에서 사용하기 적합하며, 전문 기술 분야에서는 원시언어 문서를 수집해야 하는 것을 전제로 하고 있다. 하지만, 전문분야 문서는 기술적으로 보호를 받기 때문에 그 수집이 어려우며, 중국어와 같은 언어의 전문분야 문서 수집은 더욱 어려운 단점을 갖게 된다. 또한, 이중 언어에 능통한 번역가라 할지라도 전문용어를 선별하고 그 대역어를 부탁하는 데는 많은 전문지식을 필요로 하기 때문에 어려움이 많으며, 고품질의 전문용어 구축이 어렵다.First, the source language corpus is collected, new words (including terminology) extracted from the collected source language corpus are extracted from the new words that do not exist in the dictionary including the terminology, and the expert translator asks the band words for the extracted new words. The terminology list was constructed in the same way. Recently, a method of extracting a new word using a parallel corpus composed of a pair of primitive languages and a target language has also been used. These methods are suitable for use in domains where documents are easy to collect, such as news, and premise that source technology should be collected in technical fields. However, specialty documents are difficult to collect because they are technically protected, and specialty documents in languages such as Chinese are more difficult to collect. In addition, even if a translator fluent in bilingualism requires a lot of expertise in selecting a terminology and requesting a band word, it is difficult, and it is difficult to construct a high-quality terminology.

둘째, 웹 마이닝을 통하여 신조어의 대역어를 추정하는 방법이 있는데, 이 방법의 특징은 이미 알려져 있는 신조어에 대하여 웹 검색을 통하여 웹 검색 결과 중 "원시언어 단어(목적언어 대역어)" 형태의 패턴으로부터 그 대역어를 추정하여 추출하는 교차언어 정보검색에서 시도한 방법이다. 하지만, 이 방법은 원시언어신조어를 미리 알고 있어야 하며, 웹 검색 결과에 "원시언어 단어(목적언어 대역어)"와 같은 패턴이 존재해야 그 대역어를 추정하여 추출할 수 있다.Second, there is a method of estimating a band word of a new word through web mining. The characteristic of this method is that it is based on a pattern of "primary language word (target language word)" form of the web search result through a web search for a known word. This method is attempted in cross-language information retrieval to estimate and extract band words. However, this method needs to know the source language coincidence in advance, and a pattern such as "native language word (object language)" must exist in the web search result to estimate and extract the language.

본 발명은 이와 같은 종래 기술의 문제점을 해결하기 위해 제안한 것으로서, 중간언어를 이용한 검색 결과를 마이닝하여 원시언어-목적언어 용어 리스트를 구축함으로써, 특히 비영어권 언어(예: 중국어, 한국어)간의 전문분야 자동번역시스템에서 사전에 없는 전문용어 리스트를 쉽게 구축할 수 있도록 한다.The present invention has been proposed to solve this problem of the prior art, by mining the search results using the intermediate language to build a list of primitive language-purpose language terms, in particular the automatic field of expertise between non-English language (eg Chinese, Korean) Make it easy to build a list of terminology that is not in the dictionary in the translation system.

본 발명의 제 1 관점으로서 원시언어-목적언어 용어 리스트 구축 장치는, 중간언어-목적언어 용어 리스트에서 중간언어 용어를 추출하는 용어 추출부와, 추출한 상기 중간언어 용어를 검색 질의어로 삼은 데이터 검색의 결과 중에서 원시언어를 포함하는 검색 결과를 수집하는 문서 검색부와, 상기 검색 결과를 마이닝하여 원시언어-중간언어 병렬 문장 또는 단어를 추출하는 데이터 마이닝부와, 추출한 상기 원시언어-중간언어의 병렬 문장 또는 단어에 대하여 각각 원시언어 형태소 분석과 중간언어 형태소 분석을 수행하는 형태소 분석부와, 상기 형태소 분석을 수행한 상기 원시언어-중간언어 병렬 문장 또는 단어를 단어 단위로 정렬하여 상기 검색 결과 내의 중간언어 용어와 공기한 원시언어 대역어를 정렬하는 단어 정렬부와, 정렬한 상기 중간언어 용어와 공기한 원시언어 대역어를 대상으로 하여 기 설정한 선정 기준에 따라 원시언어-중간언어 용어를 선정하여 선정한 원시언어에 대응하는 목적언어 대역어 쌍을 생성해 원시언어-목적언어 용어 리스트를 생성하는 용어 생성부를 포함할 수 있다.In accordance with a first aspect of the present invention, an apparatus for constructing a primitive language-target language terminology includes: a term extractor which extracts an intermediate language term from an intermediate-object language term list; A document retrieval unit for collecting search results including a primitive language among the results, a data mining unit for extracting a primitive language-middle language parallel sentence or word by mining the search result, and a parallel sentence of the extracted primitive language-middle language Or a morpheme analysis unit which performs source language morphological analysis and intermediate language morphological analysis on each word, and the source language-middle language parallel sentences or words on which the morphological analysis is performed are arranged in word units, and the intermediate language in the search result. A word aligning unit for aligning the term and the defensive source language bandword, and the intermediate language sorted Prototype-target language term list is generated by selecting target language-intermediate language pairs corresponding to the selected source language by selecting source language-middle language term according to pre-set selection criteria for language terminology and common source language band language. It may include a term generating unit.

여기서, 상기 데이터 마이닝부는, 상기 검색 결과에 포함된 중간언어 문장 중에 상기 검색 질의어가 포함되어 있으면 원시언어 문장과 상기 검색 질의어가 포함된 중간언어 문장을 함께 추출할 수 있다.Here, when the search query is included in the intermediate language sentences included in the search results, the data mining unit may extract both a source language sentence and an intermediate language sentence including the search query.

상기 데이터 마이닝부는, 상기 원시언어 문장과 상기 중간언어 문장에 대해 문장길이 또는 문장부호를 이용해 대응 여부를 추정하여 대응하지 않은 상기 검색 결과는 필터링을 통해 제거할 수 있다.The data mining unit estimates whether the source language sentence and the intermediate language sentence are matched using sentence length or punctuation, and removes the search result that is not matched by filtering.

상기 데이터 마이닝부는, 상기 검색 질의어가 상기 검색 결과에 포함된 원시언어 문장의 내부에 포함되어 있으면 바로 앞의 단어를 대응하는 원시언어 용어로 추정하여 추출할 수 있다.The data mining unit may estimate and extract the immediately preceding word as a corresponding source language term when the search query word is included in a source language sentence included in the search result.

상기 용어 생성부는, 상기 원시언어 용어의 빈도, 상기 원시언어 용어의 불필요한 조사 포함 여부, 상기 중간언어 용어와 상기 원시언어 용어의 단어 사이즈 차이 중에서 적어도 하나 이상을 상기 선정 기준으로 설정할 수 있다.The term generator may set at least one or more of the frequency of the source language term, whether to include unnecessary investigation of the source language term, and the difference between the word size of the intermediate language term and the original language term as the selection criteria.

본 발명의 제 2 관점으로서 원시언어-목적언어 용어 리스트 구축 방법은, 중간언어-목적언어 용어 리스트에서 중간언어 용어를 추출하는 단계와, 추출한 상기 중간언어 용어를 검색 질의어로 삼은 데이터 검색의 결과 중에서 원시언어를 포함하는 검색 결과를 수집하는 단계와, 상기 검색 결과를 마이닝하여 원시언어-중간언어 병렬 문장 또는 단어를 추출하는 단계와, 추출한 상기 원시언어-중간언어의 병렬 문장 또는 단어에 대하여 각각 원시언어 형태소 분석과 중간언어 형태소 분석을 수행하는 단계와, 상기 형태소 분석을 수행한 상기 원시언어-중간언어 병렬 문장 또는 단어를 단어 단위로 정렬하여 상기 검색 결과 내의 중간언어 용어와 공기한 원시언어 대역어를 정렬하는 단계와, 정렬한 상기 중간언어 용어와 공기한 원시언어 대역어를 대상으로 하여 기 설정된 선정 기준에 따라 원시언어-중간언어 용어를 선정하여 선정한 원시언어에 대응하는 목적언어 대역어 쌍을 생성해 원시언어-목적언어 용어 리스트를 생성하는 단계를 포함할 수 있다.In accordance with a second aspect of the present invention, a method for constructing a source language-target language term list includes extracting an intermediate language term from a middle-object language term list, and performing a data search using the extracted intermediate language term as a search query. Collecting a search result including a source language, mining the search result to extract a primitive language-medium language parallel sentence or word, and primitive for each of the extracted primitive language-medium language parallel sentences or words Performing language morphological analysis and intermediate language morphological analysis, and sorting the primitive-intermediate parallel sentences or words on which the morphological analysis has been performed by word units to form intermediate language terms and common primitive band words in the search results. Sorting the target intermediate language and the defensive primitive language; The method may include generating a source language-target language term list corresponding to the selected source language by selecting the source language-intermediate language term according to a predetermined selection criterion.

여기서, 상기 원시언어-중간언어 병렬 문장 또는 단어를 추출하는 단계는, 상기 검색 결과에 포함된 중간언어 문장 중에 상기 검색 질의어가 포함되어 있으면 원시언어 문장과 상기 검색 질의어가 포함된 중간언어 문장을 함께 추출할 수 있다.The extracting of the primitive-intermediate language parallel sentences or words may include extracting the primitive sentence and the intermediate sentence including the search query if the search query is included among the intermediate sentences included in the search results. Can be extracted.

상기 원시언어-중간언어 병렬 문장 또는 단어를 추출하는 단계는, 상기 원시언어 문장과 상기 중간언어 문장에 대해 문장길이 또는 문장부호를 이용해 대응 여부를 추정하여 대응하지 않은 상기 검색 결과는 필터링을 통해 제거할 수 있다.The extracting of the source language-medium language parallel sentence or word may include estimating whether the source language sentence and the intermediate language sentence correspond to each other using sentence length or punctuation and then remove the unmatched search result through filtering. can do.

상기 원시언어-중간언어 병렬 문장 또는 단어를 추출하는 단계는, 상기 검색 질의어가 상기 검색 결과에 포함된 원시언어 문장의 내부에 포함되어 있으면 바로 앞의 단어를 대응하는 원시언어 용어로 추정하여 추출할 수 있다.The extracting of the source language-medium language parallel sentences or words may include extracting the previous word as a corresponding source language term if the search query is included in the source language sentence included in the search result. Can be.

상기 원시언어-목적언어 용어 리스트를 생성하는 단계는, 상기 원시언어 용어의 빈도, 상기 원시언어 용어의 불필요한 조사 포함 여부, 상기 중간언어 용어와 상기 원시언어 용어의 단어 사이즈 차이 중에서 적어도 하나 이상을 상기 선정 기준으로 설정할 수 있다.The generating of the primitive language-target language term list may include at least one of a frequency of the primitive language term, whether to include unnecessary investigation of the primitive language term, and a difference in word size between the intermediate language term and the primitive language term. Can be set as selection criteria.

본 발명의 실시예에 의하면, 중간언어를 이용한 검색 결과를 마이닝하여 원시언어-목적언어 용어 리스트를 구축함으로써, 자동번역시스템을 위한 용어 리스트를 쉽게 구축할 수 있으며, 특히 비영어권 언어(예: 중국어, 한국어)간의 전문분야 자동번역시스템에서 사전에 없는 전문용어 리스트를 쉽게 구축할 수 있으므로 번역 성능이 향상되는 효과가 있다.According to an embodiment of the present invention, by constructing a list of source language-target language terms by mining a search result using an intermediate language, a term list for an automatic translation system can be easily constructed, and in particular, a non-English language (eg, Chinese, It is possible to easily build up the terminology list that is not in the dictionary in the automatic translation system.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 도면부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention and methods for achieving them will be apparent with reference to the embodiments described below in detail with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but can be implemented in various different forms, and only the embodiments make the disclosure of the present invention complete, and the general knowledge in the art to which the present invention belongs. It is provided to fully inform the person having the scope of the invention, which is defined only by the scope of the claims. Like numbers refer to like elements throughout.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. In describing the embodiments of the present invention, if it is determined that a detailed description of a known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, terms to be described below are terms defined in consideration of functions in the embodiments of the present invention, which may vary according to intentions or customs of users and operators. Therefore, the definition should be based on the contents throughout this specification.

첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수도 있다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다. Combinations of each block of the accompanying block diagram and each step of the flowchart may be performed by computer program instructions. These computer program instructions may be mounted on a processor of a general purpose computer, special purpose computer, or other programmable data processing equipment such that instructions executed through the processor of the computer or other programmable data processing equipment may not be included in each block or flowchart of the block diagram. It will create means for performing the functions described in each step. These computer program instructions may be stored in a computer usable or computer readable memory that can be directed to a computer or other programmable data processing equipment to implement functionality in a particular manner, and thus the computer usable or computer readable memory. It is also possible for the instructions stored in to produce an article of manufacture containing instruction means for performing the functions described in each block or flow chart step of the block diagram. Computer program instructions may also be mounted on a computer or other programmable data processing equipment, such that a series of operating steps may be performed on the computer or other programmable data processing equipment to create a computer-implemented process to create a computer or other programmable data. Instructions that perform processing equipment may also provide steps for performing the functions described in each block of the block diagram and in each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.In addition, each block or step may represent a portion of a module, segment or code that includes one or more executable instructions for executing a specified logical function (s). It should also be noted that in some alternative embodiments, the functions noted in the blocks or steps may occur out of order. For example, the two blocks or steps shown in succession may in fact be executed substantially concurrently or the blocks or steps may sometimes be performed in the reverse order, depending on the functionality involved.

본 발명에 의한 원시언어-목적언어 용어 리스트 구축 장치 및 방법은 비영어권 언어(예: 중국어, 한국어)간의 자동번역시스템을 위한 용어 리스트의 구축은 물론이고 언어의 종류를 가지지 않고 모든 언어들에 대해서 언어간의 용어 리스트 구축에 적용할 수 있다. 이하에서는 설명의 이해를 돕기 위하여 중국어-한국어 용어 리스트를 구축하기 위해 영어를 중간언어로 이용하는 실시예를 설명하기로 한다. 즉 영어가 중간언어이고, 원시언어는 중국어이며, 목적언어는 한국어인 경우의 실시예이다.Apparatus and method for constructing a source language-target language terminology list according to the present invention can be used to construct a language list for an automatic translation system between non-English speaking languages (eg, Chinese and Korean), as well as language for all languages without any kind of language. Applicable to building a list of terms. In the following description, an embodiment using English as an intermediate language for constructing a Chinese-Korean term list will be described to help understand the description. In other words, English is an intermediate language, an original language is Chinese, and the target language is Korean.

도 1은 본 발명의 실시예에 따른 원시언어-목적언어 용어 리스트 구축 장치 의 블록 구성도이다.1 is a block diagram of an apparatus for constructing a source language-target language terminology according to an embodiment of the present invention.

이에 나타낸 바와 같이 원시언어-목적언어 용어 리스트 구축 장치는, 영한 용어 사전(101), 영어 용어 추출부(103), 문서 검색부(105), 데이터 마이닝부(107), 형태소 분석부(109), 단어 정렬부(111), 중한 용어 생성부(113), 중안 용어 사전(115) 등을 포함한다.As shown therein, the primitive language-target language term list construction device includes an English-English terminology dictionary 101, an English term extraction unit 103, a document search unit 105, a data mining unit 107, and a morphological analysis unit 109. , A word aligning unit 111, a Chinese term generating unit 113, a central term dictionary 115, and the like.

영어 용어 추출부(103)는 영한 용어 사전(101)에 저장된 영한 용어 리스트에서 영어 용어를 추출하여 문서 검색부(105)에게 제공한다.The English term extracting unit 103 extracts an English term from the English-English term list stored in the English-English term dictionary 101 and provides it to the document search unit 105.

문서 검색부(105)는 영어 용어 추출부(103)가 추출한 영어 용어를 검색 질의어로 삼은 데이터 검색의 결과 중에서 중국어를 포함하는 검색 결과를 수집하여 데이터 마이닝부(107)에게 제공한다.The document search unit 105 collects a search result including Chinese from the data search result using the English term extracted by the English term extraction unit 103 as a search query and provides the search result to the data mining unit 107.

데이터 마이닝부(107)는 문서 검색부(105)에 의한 검색 결과를 마이닝하여 중영 병렬 문장 또는 단어를 추출하여 형태소 분석부(109)에게 제공한다. 검색 결과에 포함된 영어 문장 중에 검색 질의어가 포함되어 있으면 중국어 문장과 검색 질의어가 포함된 영어 문장을 함께 추출하며, 중국어 문장과 영어 문장에 대해 문장길이 또는 문장부호를 이용해 대응 여부를 추정하여 대응하지 않은 검색 결과는 필터링을 통해 제거하고, 검색 질의어가 검색 결과에 포함된 중국어 문장의 내부에 포함되어 있으면 바로 앞의 단어를 대응하는 중국어 용어로 추정하여 추출한다.The data mining unit 107 mines the search result by the document searching unit 105, extracts the middle-parallel parallel sentences or words, and provides them to the morpheme analysis unit 109. If a search query is included among the English sentences included in the search results, the Chinese sentence and the English sentence containing the search query are extracted together, and the sentence length or punctuation for the Chinese sentence and the English sentence is estimated and not responded. If the search result is not removed through filtering, and the search query is included in the Chinese sentence included in the search result, the previous word is estimated by the corresponding Chinese term and extracted.

형태소 분석부(109)는 데이터 마이닝부(107)가 추출한 중영 병렬 문장 또는 단어에 대하여 각각 중국어 형태소 분석과 영어 형태소 분석을 수행하여 그 결과를 단어 정렬부(111)에게 제공한다.The morpheme analysis unit 109 performs Chinese morphological analysis and English morphological analysis on the Chinese-English parallel sentences or words extracted by the data mining unit 107, and provides the result to the word alignment unit 111.

단어 정렬부(111)는 형태소 분석을 수행한 중영 병렬 문장 또는 단어를 단어 단위로 정렬하여 검색 결과 내의 영어 용어와 공기한 중국어 대역어를 정렬한다.The word sorting unit 111 sorts the English-Chinese parallel sentences or words on which the morphological analysis has been performed in units of words, and arranges the English terms in the search results and the Chinese Traditional Chinese words.

중한 용어 생성부(113)는 단어 정렬부(111)가 정렬한 영어 용어와 공기한 중국어 대역어를 대상으로 하여 기 설정한 선정 기준에 따라 중영 용어를 선정하여 선정한 중국어에 대응하는 한국어 대역어 쌍을 생성해 중한 용어 리스트를 생성한다. 여기서, 중국어 용어의 빈도, 중국어 용어의 불필요한 조사 포함 여부, 영어 용어와 중국어 용어의 단어 사이즈 차이 등을 중영 용어의 선정 기준으로 이용한다.The Chinese term generation unit 113 generates Korean Korean word pairs corresponding to the selected Chinese by selecting Chinese and English terms according to a predetermined selection criterion for the English terms arranged by the word aligning unit 111 and the Chinese Traditional Chinese words. Create a list of harmful terms. Here, the frequency of Chinese terms, whether or not to include unnecessary investigation of Chinese terms, the difference in the word size of English terms and Chinese terms, etc. are used as the selection criteria for Chinese-English terms.

도 2는 본 발명의 실시예에 따른 원시언어-목적언어 용어 리스트 구축 방법을 설명하기 위한 흐름도이다.2 is a flowchart illustrating a method of constructing a source language-target language term list according to an embodiment of the present invention.

이에 나타낸 바와 같이 원시언어-목적언어 용어 리스트 구축 방법은, 영한 용어 리스트에서 영어 용어를 추출하는 단계(S201)와, 추출한 영어 용어를 검색 질의어로 삼은 데이터 검색의 결과 중에서 중국어를 포함하는 검색 결과를 수집하는 단계(S203, S205)와, 검색 결과를 마이닝하여 중영 병렬 문장 또는 단어를 추출하는 단계(S207 내지 S211)와, 추출한 중영 병렬 문장 또는 단어에 대하여 각각 중국어 형태소 분석과 영어 형태소 분석을 수행하는 단계(S213)와, 형태소 분석을 수행한 중영 병렬 문장 또는 단어를 단어 단위로 정렬하여 검색 결과 내의 영어 용어와 공기한 중국어 대역어를 정렬하는 단계(S215)와, 정렬한 영어 용어와 공기한 중국어 대역어를 대상으로 하여 기 설정된 선정 기준에 따라 중영 용어를 선정하여 선 정한 중국어에 대응하는 한국어 대역어 쌍을 생성해 중한 용어 리스트를 생성하는 단계(S217, S219)를 포함한다.As described above, the method for constructing a primitive language-target language term list includes extracting an English term from an English-English term list (S201), and selecting a search result including Chinese from a result of data search using the extracted English term as a search query. Collecting (S203, S205), mining the search results to extract the Chinese-Parallel parallel sentences or words, and performing Chinese morphological analysis and English morphological analysis on the extracted Chinese-Parallel parallel sentences or words, respectively. Step S213, sorting the English-language parallel sentences or words which have been subjected to morphological analysis by word units, and sorting the English terms and the simplistic Chinese bandwords in the search results (S215), the sorted English terms and the simplistic Chinese bandwords Korean language corresponding to Chinese selected by selecting Chinese-English terminology according to preset criteria To create a pair yeokeo and a step (S217, S219) for generating a serious term list.

이하에서는 본 발명의 실시예에 따른 원시언어-목적언어 용어 리스트 구축 장치에 의한 중한 용어 리스트 구축 과정을 도 1 및 도 2를 참조하여 설명하기로 한다.Hereinafter, a process of constructing an important term list by the apparatus for constructing a source language-target language terminology according to an embodiment of the present invention will be described with reference to FIGS. 1 and 2.

먼저, 영어 용어 추출부(103)는 기 구축된 영한 용어 사전(101)에서 영어 용어를 추출하여 문서 검색부(105)에게 전달한다(S201). 여기서, 영한 용어 사전(101)은 영어 용어와 그 대역어인 한국어 용어가 매칭된 용어 리스트가 전자적으로 집약된 것을 의미한다.First, the English term extracting unit 103 extracts an English term from the pre-built English-Korean term dictionary 101 and transmits it to the document search unit 105 (S201). Here, the English-Korean term dictionary 101 means that the term list matching the English term and the Korean term, which is the band word, is electronically aggregated.

문서 검색부(105)는 영어 용어 추출부(103)로부터 전달받은 영어 용어를 영어 질의어로 삼아서 온라인 데이터 검색을 수행하며(S203), 데이터 검색의 결과 중에서 중국어를 포함하는 검색 결과 문서를 수집하여 데이터 마이닝부(107)에게 전달한다(S205). 예를 들면, 중국어를 주언어로 사용하는 학술전문 사이트(예컨대, http://scholar.google.com)를 대상으로 하여 영어 질의어를 검색할 때에 영어 용어가 포함된 중국어 학술 문서를 수집한다. 여기서, 영어를 질의어로 사용하는 것은 대부분의 전문용어는 영어로부터 시작되기 때문이며, 중국어와 한국어와 같은 비영어권 언어의 학술논문에서는 제목과 요약을 영어와 로컬언어(local language) 쌍으로 기재하거나, 전문용어를 흔히 "로컬언어(영어 대역어)"형식으로 기술하는 특징을 이용하고자 하는 것이다. 중국어 학술 문서도 마찬가지로 대부분 제목과 요 약에는 영어가 포함되기 때문이며, 이외에도 중국어로 작성된 기술 문서에는 특정한 전문용어에 대하여 "중국어(영어)" 형식의 패턴으로 흔히 작성되어 영어 용어가 포함된 중국어 문서를 검색할 수 있다.The document search unit 105 performs an online data search using the English term received from the English term extraction unit 103 as an English query (S203), and collects a search result document including Chinese from the data search results and collects the data. It transfers to the mining part 107 (S205). For example, Chinese academic documents containing English terms are collected when an English query is searched for academic sites using Chinese as the main language (eg, http://scholar.google.com). Here, the use of English as a query is because most of the terminology begins with English, and in academic papers in non-English languages such as Chinese and Korean, titles and summaries are written in English and local language pairs, or terminology. I would like to take advantage of features that often describe "local language". Similarly, most Chinese academic documents include English in their titles and summaries. In addition, technical documents written in Chinese are often written in a pattern in the form of "Chinese (English)" for a particular terminology, so that Chinese documents containing English terms You can search.

데이터 마이닝부(107)는 문서 검색부(105)가 수집한 검색 결과 문서를 마이닝한다. 검색 결과 문서의 내용에 포함된 영어 문장 중에 영어 질의어가 포함되어 있으면 즉, 주변 단어가 영어이면, 검색 결과 문서의 중국어 문장과 영어 질의어가 포함된 영어 문장을 함께 추출한다. 이때, 중국어 문장이 영어 문장에 대응되는지의 여부는 두 문장의 길이, 문장부호 등을 이용하여 추정한다. 여기서, 영어 문장에 대응하는 중국어 문장이 존재하는 조건을 만족하지 않을 경우(S207), 즉 영어 문장에 대응하는 중국어 문장이 없을 경우에 해당 검색 결과 문서는 필터링을 통해 제거한다(S209).The data mining unit 107 mines the search result document collected by the document searching unit 105. If an English query is included in the English sentences included in the contents of the search result document, that is, if the surrounding words are English, the Chinese sentences of the search result document and the English sentences including the English query are extracted together. At this time, whether the Chinese sentence corresponds to the English sentence is estimated using the length of the two sentences, the sentence code and the like. In this case, when a condition in which a Chinese sentence corresponding to an English sentence exists does not satisfy (S207), that is, when there is no Chinese sentence corresponding to the English sentence, the corresponding search result document is removed through filtering (S209).

아울러, 데이터 마이닝부(107)는 만약에 영어 질의어가 검색 결과 문서의 내용에 포함된 중국어 문장의 내부에 포함되어 있으면 바로 앞의 단어를 대응되는 중국어 용어로 추정하여 추출한다. 중국어는 단어 사이의 띄어쓰기를 하지 않기 때문에 바로 앞의 중국어가 어디까지 대응되는 용어인지 알기 어렵다. 하지만, 대부분의 중국어 단어는 2, 3글자 단어들로 구성되었기 때문에 예컨대, 추출 사이즈를 3글자로 설정하여 바로 앞의 단어를 추출한다. 이렇게 데이터 마이닝부(107)에 의해 추출된 중영 병렬 문장 또는 단어는 형태소 분석부(109)에게 제공된다(S211).In addition, if the English query is included in the Chinese sentence included in the content of the search result document, the data mining unit 107 estimates and extracts the immediately preceding word as a corresponding Chinese term. Since Chinese does not use spaces between words, it is difficult to know how far the previous Chinese corresponds. However, since most Chinese words are composed of 2 or 3 letter words, for example, the previous word is extracted by setting the extraction size to 3 letters. In this way, the Chinese-English parallel sentences or words extracted by the data mining unit 107 are provided to the morpheme analysis unit 109 (S211).

형태소 분석부(109)는 데이터 마이닝부(107)로부터 제공받은 중영 병렬 문장 또는 단어에 대하여 각각 중국어 형태소 분석과 영어 형태소 분석을 실시하여 그 결과를 단어 정렬부(111)에게 제공한다(S213). 이는 중국어 문장인 경우에는 단어 단위 분절이 되어 있지 않았기 때문이며, 영어인 경우에는 형태소를 추출하기 위함이다. 아울러, 이후에 수행할 단어 단위 정렬(alignment)의 정렬 비율을 높여준다.The morpheme analyzing unit 109 performs Chinese morphological analysis and English morphological analysis on the Chinese-Chinese parallel sentences or words provided from the data mining unit 107 and provides the result to the word alignment unit 111 (S213). This is because the Chinese sentence is not a word unit segment, in the case of English to extract the morphemes. In addition, it increases the alignment ratio of word alignment to be performed later.

단어 정렬부(111)는 형태소 분석된 중국어와 영어 문장 또는 단어를 대상으로 하여 단어 단위의 정렬을 실시한다. 이때, 단어 단위의 정렬은 현재 이종 언어 간의 정렬에서 흔히 사용하는 GIZA++와 같은 도구를 이용하여 실시한다. 이렇게 단어 단위로 정렬하게 되면 검색 결과 문서 내의 영어 용어와 공기한 중국어 대역어가 높은 빈도로 정렬되어 중한 용어 생성부(113)에게 제공된다(S215).The word alignment unit 111 performs word unit alignment on the stemmed Chinese and English sentences or words. At this time, the alignment of word units is performed by using a tool such as GIZA ++ which is commonly used in the alignment between heterogeneous languages. When the word unit is sorted in this way, the English term in the search result document and the Chinese Traditional Chinese word are sorted at a high frequency and are provided to the Chinese term generator 113 (S215).

중한 용어 생성부(113)는 단어 정렬부(111)에 의해 정렬된 영어 용어와 중국어 대역어를 단어 단위의 중영 용어로 선정하고, 선정된 중국어 대역어의 영어 용어에 매칭하는 한국어를 영한 용어 사전(101)으로부터 추출(S217)하여 최종적으로 중한 용어 사전(115)을 구축한다(S219). 여기서, 중국어 용어 선정에는 주로 중국어 용어의 빈도가 높은지의 여부, 중국어 용어 중에 불필요한 조사가 포함되었는지의 여부 및/또는 단어 사이즈가 대응하는 영어 용어에 비해 너무 차이가 나는지 등을 체크하여 최종 중국어 용어로 선정한다.The Chinese term generator 113 selects the English term and the Chinese band word arranged by the word aligning unit 111 as the Chinese-English term in the unit of words, and sets the English-English term dictionary 101 to match the English term of the selected Chinese band word. (S217) to finally build a heavy term dictionary (115) (S219). Here, the selection of Chinese terms is mainly performed by checking whether the frequency of Chinese terms is high, whether unnecessary investigations are included in Chinese terms, and / or whether the word size is too different from the corresponding English terms. Select.

도 1은 본 발명의 실시예에 따른 원시언어-목적언어 용어 리스트 구축 장치의 블록 구성도,1 is a block diagram of an apparatus for constructing a source language-target language terminology according to an embodiment of the present invention;

도 2는 본 발명의 실시예에 따른 원시언어-목적언어 용어 리스트 구축 방법의 블록 구성도.2 is a block diagram of a method for constructing a primitive language-purpose language terminology list according to an embodiment of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

101 : 영한 용어 사전 103 : 영어 용어 추출부101: English-English term dictionary 103: English term extraction unit

105 : 문서 검색부 107 : 데이터 마이닝부105: document search unit 107: data mining unit

109 : 형태소 분석부 111 : 단어 정렬부109: stemming unit 111: word alignment unit

113 : 중한 용어 생성부 115 : 중한 용어 사전113: Chinese terminology generation unit 115: Chinese terminology dictionary

Claims

A term extraction unit for extracting an intermediate language term from the intermediate-object language term list;

A document search unit which collects a search result including a source language among data search results using the extracted intermediate language term as a search query;

A data mining unit which extracts source language-medium language parallel sentences or words by mining the search results;

A morphological analysis unit for performing source language morphological analysis and intermediate language morphological analysis on the extracted parallel sentences or words of the source language-middle language, respectively;

A word alignment unit for arranging the primitive-middle language parallel sentences or words on which the morphological analysis has been performed, on a word-by-word basis, to align the intermediate language term and the unfamiliar primitive language band word in the search result;

Proto-language-target language is generated by selecting a target language-intermediate language pair corresponding to the selected source language by selecting the source language-intermediate language term according to the selected selection criteria for the intermediate language term and the primitive source language band. Including a term generator for generating a term list

Source language-target language term list building device.

The method of claim 1,

The data mining unit extracts a source language sentence and an intermediate language sentence including the search query if the search query is included among the intermediate language sentences included in the search result.

Source language-target language term list building device.

The method of claim 2,

The data mining unit estimates whether the source language sentence and the intermediate language sentence are matched using sentence length or punctuation, and removes the unmatched search result through filtering.

Source language-target language term list building device.

The method of claim 2,

The data mining unit estimates and extracts the immediately preceding word as a corresponding source language term when the search query word is included in a source language sentence included in the search result.

Source language-target language term list building device.

The method of claim 4, wherein

The term generator is configured to set at least one or more of the frequency of the source language term, whether to include unnecessary investigation of the source language term, and the difference between the word size of the intermediate language term and the source language term as the selection criteria.

Source language-target language term list building device.

Extracting the intermediate language term from the intermediate-object language term list;

Collecting search results including a source language from data search results using the extracted intermediate language terms as a search query;

Mining the search results to extract source language-medium language parallel sentences or words;

Performing source language morphological analysis and intermediate language morphological analysis on the extracted parallel sentences or words of the source language-middle language, respectively;

Arranging the source language-medium language parallel sentences or words on which the morphological analysis has been performed in word units to align intermediate language terms and common source language band words in the search results;

Proto-language terminology is generated by generating a pair of target language bands corresponding to the selected primitive language by selecting the primitive language-middle language term according to a predetermined selection criterion for the intermediate and linguistic primitive band languages. Generating a list

How to build a list of source language-target language terms.

The method of claim 6,

The extracting of the source-middle parallel sentences or words may include extracting a source language sentence and an intermediate language sentence including the search query when the search query is included in the intermediate language sentences included in the search results.

How to build a list of source language-target language terms.

The method of claim 7, wherein

The extracting of the source language-medium language parallel sentence or word may include estimating whether the source language sentence and the intermediate language sentence correspond to each other using sentence length or punctuation and then remove the unmatched search result through filtering. doing

How to build a list of source language-target language terms.

The method of claim 7, wherein

The extracting of the source language-medium language parallel sentences or words may include extracting by estimating the immediately preceding word as a corresponding source language term if the search query word is included in the source language sentence included in the search result.

How to build a list of source language-target language terms.

The method of claim 9,

The generating of the primitive language-target language term list may include at least one of a frequency of the primitive language term, whether to include unnecessary investigation of the primitive language term, and a difference in word size between the intermediate language term and the primitive language term. Set by selection criteria

How to build a list of source language-target language terms.