KR100792204B1

KR100792204B1 - Apparatus for automatic translation customized for restrictive domain documents, and method thereof

Info

Publication number: KR100792204B1
Application number: KR1020060056203A
Authority: KR
Inventors: 이기영; 노윤형; 최승권; 권오욱; 박상규; 김영길; 김창현; 서영애; 양성일; 류철
Original assignee: 한국전자통신연구원
Priority date: 2005-12-05
Filing date: 2006-06-22
Publication date: 2008-01-08
Also published as: KR20070058950A

Abstract

본 발명은 특허 문서를 번역 대상으로 하여 특허 도메인에 특화된(customized) 번역 지식(translation knowledge)을 추출하고 이렇게 추출된 특화된 지식을 사용하여 특허 문서를 자동 번역하는 방법 및 그 장치에 관한 것으로, 특허 문서로부터 전문용어를 대량으로 추출하고 추출된 전문용어에 대하여 대역어를 할당하며, 일반 도메인의 문서가 아닌 특허 문서에서 고빈도로 사용되는 표현들을 추출하고 해당 표현들에 대한 대역 표현을 구축하며, 이렇게 구축된 번역 지식을 사용하고, 또한 특허문서에서의 과대하게 긴 문장들에 대해 문형패턴 적용, 병렬구조 인식 및 구문단서에 의한 문장분할 등을 수행하여, 파싱 가능한 번역 단위를 추출하여 구조분석을 수행하는 특허 문서를 자동으로 번역하는 장치 및 방법을 제공하는데 있다.The present invention relates to a method and apparatus for extracting translation knowledge customized for a patent domain using a patent document as a target of translation and automatically translating the patent document using the extracted specialized knowledge. Extract terminology from the mass and assign a band word to the extracted terminology, extract expressions used in high frequency in patent documents, not documents in general domain, and build band expressions for the expressions Using the translated translation knowledge, and applying sentence pattern pattern, parallel structure recognition and sentence division by syntax cues to the excessively long sentences in the patent document, and extracting a parseable translation unit to perform structural analysis An apparatus and method for automatically translating a patent document are provided.

기계번역, 자동번역, 지식추출, 특허번역 Machine Translation, Automatic Translation, Knowledge Extraction, Patent Translation

Description

Apparatus for automatic translation customized for restrictive domain documents, and method

도 1 은 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 장치의 전체 구성을 나타낸 도면1 is a view showing the overall configuration of an automatic translation apparatus specialized for patent documents according to the present invention

도 2 는 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 방법을 설명한 흐름도2 is a flowchart illustrating an automatic translation method specialized for patent documents according to the present invention.

도 3 은 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 방법 중 전문용어를 구축하는 방법을 나타낸 흐름도3 is a flowchart illustrating a method of constructing a terminology among specialized automatic translation methods for a patent document according to the present invention.

도 4 는 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 방법 중 대역어 구축 및 정제 방법을 나타낸 흐름도4 is a flowchart showing a method for constructing and refining a band word among automatic translation methods specialized for patent documents according to the present invention.

도 5 는 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 방법 중 특허 고유의 문장 또는 구문 패턴을 대역어 구축 및 정제 방법을 나타낸 흐름도5 is a flowchart illustrating a method for constructing and refining a bandword of a sentence or phrase pattern unique to a patent among an automatic translation method specialized for a patent document according to the present invention.

도 6 은 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 방법 중 최종 구문 분석결과를 생성하는 방법을 나타낸 흐름도6 is a flowchart illustrating a method of generating a final parsing result among automatic translation methods specialized for patent documents according to the present invention.

도 7 은 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 방법 중 원시 언어의 구조 분석 결과를 목표 언어의 구조로 변환하는 방법을 나타낸 흐름도7 is a flowchart illustrating a method of converting a structure analysis result of a native language into a structure of a target language among automatic translation methods specialized for patent documents according to the present invention.

*도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

10 : 문장분할 규칙 20 : 형태소 분석 사전10: sentence division rule 20: stemming dictionary

30 : 어휘품사문맥 확률정보 40 : 어휘확률 정보30: Vocabulary context context probability information 40: Vocabulary probability information

50 : 목적언어 특허코퍼스 60 : 변환패턴50: target language patent corpus 60: conversion pattern

70 : 단일어 및 복합명사 사전 100 : 지식추출부70: dictionary of single words and compound nouns 100: knowledge extraction unit

110 : 전문용어 구축부 120 : 대역어 구축 및 정제부110: terminology construction unit 120: band word construction and refinement unit

130 : 문장/구 패턴 구축부 200 : 번역부130: sentence / phrase pattern construction unit 200: translation unit

210 : 전처리부 220 : 형태소분석 및 태깅부210: preprocessing unit 220: morphological analysis and tagging unit

230 : 구조분석부 240 : 구조 및 어휘 변환부230: structure analysis unit 240: structure and vocabulary conversion unit

250 : 생성부250 generator

본 발명은 자동번역 시스템에 관한 것으로, 특히 제한적인 도메인의 문서를 대상으로 특화된 자동 번역 장치 및 방법에 관한 것이다.TECHNICAL FIELD The present invention relates to an automatic translation system, and more particularly, to an automatic translation apparatus and method specialized for documents in a limited domain.

기계번역 시스템 또는 자동번역 시스템은 컴퓨터의 발명과 함께 연구가 시작되어온 분야로서, 컴퓨터 분야에 있어서는 그 역사가 짧다고는 할 수 없는 연구 분야이다. The machine translation system or automatic translation system has been a field of research that has been started with the invention of the computer. In the field of computer, its history is not short.

하지만, 그러한 긴 개발 역사에도 불구하고, 현재의 시장 상황 등을 고려해 보면, 일반 도메인에서 사용자가 만족할 만한 수준의 번역 품질을 제공하는 자동번 역 시스템은 거의 존재하지 않는다고 할 수 있다. However, despite such a long history of development, considering the current market situation, it can be said that there is almost no automatic translation system that provides a level of translation quality satisfactory to the user in the general domain.

실제로 그 이유를 찾아본다면, 종래의 자동번역 시스템은 웹 환경의 발달과 함께, 웹 문서 번역 시스템과 같은 다양한 단어들과 다양한 표현들이 존재하는 문서가 그 대상이었기 때문이다. 이러한 이유로 인해 자동번역의 가장 기초가 되는 사전 어휘라든지 변환을 위한 규칙 또는 패턴 등은 언어의 특성으로 인해 완벽한 구축이 어려웠다. In fact, the reason for this is that the conventional automatic translation system has been the subject of a document having various words and various expressions such as a web document translation system with the development of the web environment. For this reason, the dictionary vocabulary or the rules or patterns for conversion, which are the basics of automatic translation, have been difficult to build completely due to the characteristics of the language.

이러한 이유로 인해 사전 미등록어 문제, 분석 규칙의 커버리지를 벗어나는 문제, 변환 정보가 존재하지 않는 문제와 같은 심각한 오류들을 발생시켰으며, 실제 그 번역 품질도 상용화에 근접한 수준에는 크게 미치지 못하여 만족스럽지 못하고 있다. 이러한 문제는 결국 자동번역 시스템의 상용화에 커다란 걸림돌로 작용되었다.For this reason, serious errors such as the problem of non-registered words, the problem of out of coverage of analysis rules, and the problem of no conversion information existed, and the quality of translation is not satisfactorily close to the level of commercialization. This problem eventually became a major obstacle to the commercialization of the automatic translation system.

이러한 무제한 도메인에서 발생하는 다양한 문제점들은 자연스럽게 제한적인 도메인으로 자동번역의 범위를 축소시키고자 하는 시도를 이끌어 냈으며, 실제로 상용화를 목표로 하는 점에 있어서 제한적인 도메인은 현재의 자동번역 기술을 고려할 때, 매우 현실적인 대상이었다.Various problems arising from these unrestricted domains have naturally led to attempts to reduce the scope of automatic translation into restricted domains. In fact, the limited domains have been aimed at commercialization. It was a very realistic target.

특히 제한적인 도메인의 대표적인 특허(patent) 도메인의 경우, 전 세계적으로 한 해 동안 출원되고 등록되는 특허 문서의 양은 매년 급속히 늘어나고 있으며, 글로벌 시대에 있어서 자국 특허뿐만 아니라 타국 특허에 대한 관심도 매우 높다고 할 수 있다. 현재는 대부분의 특허 번역은 전문 특허 번역 전문가를 통해서 이루어지고 있으며, 기업에 속하지 않은 각 개인의 경우, 언어적 차이에서 오는 특허 검 색 및 작성의 어려움은 매우 큰 문제로 남아있다. 또한, 기업의 경우도 특허 번역 등에 소요되는 경비(cost) 및 시간이 날로 증가하는 추세이다.Especially in the case of representative patent domains of limited domains, the volume of patent documents filed and registered worldwide is increasing rapidly every year, and in the global era, there is a great interest in not only domestic patents but also foreign patents. have. Currently, most patent translations are done by professional patent translation experts, and for each individual who does not belong to a company, the difficulty of searching and writing patents due to linguistic differences remains a big problem. In addition, in the case of companies, the cost (cost) and time required for patent translation, etc. are increasing day by day.

한편, 특허 번역과 같은 제한된 도메인의 문서를 기존의 일반 도메인용 지식을 사용하여 번역할 경우 나타나는 문제점에 대하여 설명하면 다음과 같다. On the other hand, a description will be given of the problems appearing when translating a document of a limited domain such as patent translation using the existing general domain knowledge.

첫 번째로서, 일반적으로 자동번역에 있어서 가장 중요한 지식이라면 어휘 사전, 분석 규칙/패턴, 변환 규칙/패턴 등이 있을 수 있다. 만약 이러한 기존의 지식을 사용하여 특허 도메인에 해당하는 문서를 번역한다면 제일 먼저 나타나는 문제는 미등록어 문제이다. 즉, 특허와 같은 도메인은 전기, 전자, 화학, 물리, 컴퓨터, 등과 같은 다양한 분야에서 사용되는 전문용어가 극단적으로 많이 사용되며, 또한 특허 문서에서 사용되는 일반 어휘들도 특허 문서에서는 일반 도메인과는 상이한 의미로 사용되는 경우가 매우 많다. First, in general, the most important knowledge in automatic translation may include lexical dictionaries, analysis rules / patterns, transformation rules / patterns, and the like. If this existing knowledge is used to translate a document corresponding to a patent domain, the first problem that arises is the unregistered language problem. In other words, domains such as patents have extreme terminology used in various fields such as electricity, electronics, chemistry, physics, computers, etc. Also, general vocabularies used in patent documents are different from general domains in patent documents. It is often used in different meanings.

두 번째로서, 특허 문서의 경우, 해당 특허 도메인에서만 극단적으로 고빈도로 사용되는 표현 등이 있으며, 이러한 표현들은 일반 도메인에서는 좀처럼 사용되지 않는다. 따라서 기존의 일반적인 구문 규칙이나 패턴으로는 커버리지 문제가 발생한다. Secondly, in the case of patent documents, there are expressions that are used at extremely high frequencies only in the patent domain, and such expressions are rarely used in the general domain. Therefore, coverage problems arise with existing general syntax rules or patterns.

세 번째로서, 자동번역에 있어서, 문장길이가 길어지면 그 구조 모호성이 폭발적으로 증가하고, 이에 따라 분석시간이 현저히 증가하고 구조분석 성능이 떨어지는데, 특허문서에서는 수백 단어를 넘어가는 긴 문장들이 자주 발생한다. 따라서, 이러한 장문의 문장들에 대해 특별한 처리가 이루어지지 않고는 분석하고 번역하기가 결코 쉽지 않다.Thirdly, in automatic translation, the longer the sentence length, the more the structural ambiguity explodes, and the analysis time is significantly increased and the structure analysis performance is deteriorated. In the patent document, long sentences exceeding several hundred words frequently occur. do. Therefore, it is never easy to analyze and translate these long sentences without special processing.

따라서 본 발명은 상기와 같은 문제점을 해결하기 위해 안출한 것으로서, 제한된 도메인에 해당하는 문서를 번역하는데 있어서 해당 도메인에 특화된 지식을 추출하고 이렇게 추출된 지식을 활용하여 효과적으로 자동번역을 수행하는 자동 번역 장치 및 방법을 제공하는데 그 목적이 있다.Therefore, the present invention has been made to solve the above problems, in the translation of the document corresponding to the limited domain to extract the specialized knowledge of the domain and the automatic translation device using the extracted knowledge effectively performs the automatic translation And to provide a method.

본 발명의 다른 목적은 해당 도메인에 추출된 특화된 지식을 활용하여 일반 분석으로는 다루기 힘든 긴 문장들에 대해 파싱 가능한 분석 범위를 추출하여 구조분석을 수행하는 자동 번역 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide an automatic translation apparatus and method for performing structural analysis by extracting a parseable analysis range for long sentences that cannot be handled by general analysis using specialized knowledge extracted in a corresponding domain.

본 발명의 또 다른 목적은 번역 결과의 품질을 떨어뜨리지 않는 합리적인 개수만큼을 구축하며, 해당 도메인에서 자주 사용되는 고빈도 표현 등에 대해서는 그 분석 규칙 또는 번역 패턴을 미리 구축하여 자연스러운 번역 결과를 만들어 낼 수 있는 자동 번역 장치 및 방법을 제공하는데 있다.Another object of the present invention is to construct a reasonable number of not lowering the quality of the translation results, for the high frequency expressions frequently used in the domain, such as analysis rules or translation patterns can be built in advance to produce a natural translation results An automatic translation apparatus and method are provided.

상기와 같은 목적을 달성하기 위한 본 발명의 일 측면을 살펴보면, 형태소 분석과 태깅을 통해 제한된 도메인의 문서에 따른 해당 코퍼스를 구축하여 전문용어를 추출 및 구축하는 전문용어 구축부와, 가중치를 적용하여 최장우선 방식의 고빈도 표현을 추출하여 문형/구 패턴을 정제하고, 상기 구축된 전문용어에 대해서 대역어를 구축하는 대역어 구축 및 정제부와, 상기 해당 코퍼스에 기반해서 고빈도 중복 어휘 문자열 및 용례에 따른 해당 도메인의 구문 번역 패턴 및 문장 번역 패턴을 구축하는 문장/구 패턴 구축부를 포함하며 제한된 도메인의 문서를 구성하는 문장들의 번역에 필요한 번역 지식을 추출하는 지식추출부와, 상기 추출된 번역 지식이 적용된 사전 및 변환 패턴에 기반하여 입력되는 문장에 대한 번역문을 생성하는 번역부를 포함하는 자동 번역 장치를 제공할 수 있다.Looking at one aspect of the present invention for achieving the above object, by building a corpus according to the documents of the restricted domain through morphological analysis and tagging, the terminology construction unit for extracting and building the terminology, and by applying a weight The band word construction and refinement unit extracts the longest-priority high frequency expression to refine the sentence / phrase pattern, and constructs the band word for the constructed terminology, and the high frequency redundant lexical string and usage based on the corpus. And a knowledge extracting unit for extracting translation knowledge necessary for translation of sentences constituting a document of a restricted domain, including a sentence / phrase pattern building unit for constructing a phrase translation pattern and a sentence translation pattern of the corresponding domain according to the present invention. It includes a translation unit for generating a translation for the input sentence based on the dictionary and the conversion pattern applied It can provide automatic translation device.

바람직하게, 상기 지식추출부는 형태소 분석과 태깅을 통해 제한된 도메인의 문서에 따른 해당 코퍼스를 구축하여 전문용어를 추출 및 구축하는 전문용어 구축 부와, 가중치를 적용하여 최장우선 방식의 고빈도 표현을 추출하여 문형/구 패턴을 정제하고, 상기 구축된 전문용어에 대해서 대역어를 구축하는 대역어 구축 및 정제부와, 상기 해당 코퍼스에 기반해서 고빈도 중복 어휘 문자열 및 용례에 따른 해당 도메인의 구문 번역 패턴 및 문장 번역 패턴을 구축하는 문장/구 패턴 구축부를 포함하는 것을 특징으로 한다.Preferably, the knowledge extracting unit constructs a corpus according to a document of a restricted domain through morphological analysis and tagging, and extracts and constructs a terminology, and applies a weight to extract a high frequency expression of the longest priority method. A band word construction and refining unit for refining a sentence / phrase pattern and constructing a band word with respect to the constructed terminology, and a phrase translation pattern and sentence of a corresponding domain according to a high frequency redundant lexical string and usage based on the corpus And a sentence / phrase pattern construction unit for constructing a translation pattern.

바람직하게, 상기 가중치는 제한된 도메인에 따른 어휘별 빈도 및 공기 어휘와의 밀접성에 상응하여 적용되는 것을 특징으로 한다.Preferably, the weight is applied corresponding to the frequency of each vocabulary according to the restricted domain and the closeness to the air vocabulary.

바람직하게 상기 번역부는 입력되는 문장을 분리하고, 분리한 문장에 나타나는 어휘를 토큰(token)으로 분리하여 기호, 수식 및 단어로 구분하는 전처리부와, 상기 토큰의 형태소를 분석하고 상기 지식추출부를 통해 어휘 변환된 HMM(Lexicalized Hidden Markov Model)을 이용하여 통계적 품사 태깅하는 형태소 분석 및 태깅부와, 상기 형태소 분석 및 태깅된 문장에 대해 문형패턴 및 구문패턴에 의한 문장분할 적용하여 패턴의 각 노드에 대해 파싱하여 최종 구문 분석결과를 생성하는 구조분석부와, 상기 지식추출부의 추출된 번역 지식에 따른 변환 패턴을 사용하여 상기 생성된 구문 분석결과에 대한 구조 변환을 수행하여 목표 언어의 구조로 변환하고, 사전을 이용하여 각 개별 어휘 변환을 수행하는 구조 및 어휘 변환부와, 상기 구조 및 어휘 변환부에서 출력되는 변환된 구조 및 어휘를 통해 최종적인 목표 언어 문장을 생성하는 생성부를 포함하는 것을 특징으로 한다.Preferably, the translator separates the input sentences, and separates the vocabulary appearing in the separated sentences into tokens (tokens), and divides them into symbols, formulas, and words, and analyzes the morphemes of the tokens. Morphological analysis and tagging unit for statistical part-of-speech tagging using the lexical transformed HMM (Lexicalized Hidden Markov Model), and sentence division by sentence pattern and syntax pattern for the morphological analysis and tagged sentences for each node of the pattern A structure analysis unit for parsing and generating a final syntax analysis result, and a structure conversion for the generated syntax analysis result using a conversion pattern according to the extracted translation knowledge of the knowledge extraction unit to convert into a structure of a target language, Structure and vocabulary conversion unit for performing each individual vocabulary conversion using a dictionary, and output from the structure and vocabulary conversion unit It is characterized in that it comprises a generator which generates the final target language sentence using a transformed structure and vocabulary.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 제한적인 도메인의 문서를 대상으로 특화된 자동 번역 방법의 특징은 (a) 원시언어로 작성된 문서들로부터 형태소 분석과 태깅 과정을 통해 제한된 도메인에 따른 특정 코퍼스를 구축하여 전문용어를 추출하는 단계와, (b) 제한된 도메인에 따른 가중치를 적용하여 최장우선 방식의 고빈도 표현을 추출하여 문형/구 패턴을 정제하고, 상기 구축된 전문용어에 대해서 대역어를 구축하는 단계와, (c) 상기 (a) 단계에서 구축된 특정 코퍼스에 기반하여 구문 번역 패턴 및 문장 번역 패턴을 구축하는 단계와, (d) 형태소 분석 및 태깅된 문장에 대해 문형 패턴에 의한 문장분할을 적용하고 문형패턴의 각 노드에 대해 파싱하여 구문 분석결과를 생성하는 단계와, (e) 상기 (c) 단계에서 구축된 구문 및 문장 번역 패턴을 사용하여 상기 생성된 구문 분석결과에 대한 구조 변환을 통한 목표 언어의 구조 변환과 각 개별 어휘 변환을 수행하는 단계와, (f) 상기 변환된 구조 및 어휘를 통해 목표 언어 문장을 생성하는 단계를 포함하는데 있다.In order to achieve the above object, a feature of the automatic translation method specialized for a limited domain document according to the present invention is (a) specific corpus according to the restricted domain through morphological analysis and tagging process from documents written in the native language. Extracting the terminology by constructing the terminology, and (b) extracting the high frequency expression of the longest priority method by applying weights according to the restricted domain to refine the sentence / phrase pattern, and construct a band term for the terminology constructed. (C) constructing a phrase translation pattern and a sentence translation pattern based on the specific corpus constructed in step (a), and (d) sentence division by sentence pattern for morphological analysis and tagged sentences Generating a parsing result by parsing each node of the sentence pattern, and (e) translating the phrase and sentence constructed in step (c) Performing a structure conversion of each target language through a structure conversion on the generated parse result using each turn and each individual vocabulary conversion, and (f) generating a target language sentence through the converted structure and vocabulary. It is to include.

바람직하게 상기 (a) 단계는 (a1) 원시언어로 작성된 문서들로부터 구축된 대량의 문서 코퍼스를 입력으로 각 문장으로 분리하고, 상기 분리된 문장의 어휘를 토큰(token)으로 분리하는 단계와, (a2) 형태소를 분석하여 상기 각 토큰에 가능한 모든 품사를 부착하는 단계와, (a3) 기 정의된 어휘품사문맥 확률 정보 및 어휘확률 정보를 이용하여 각 단어에 특정 품사를 할당하는 통계적 품사 태깅을 수행하여 특정 품사가 할당된 특정 코퍼스를 구축하는 단계와, (a4) 상기 구축된 특정 코퍼스에서 전문용어를 추출하는 단계를 포함하는 것을 특징으로 한다.Preferably, the step (a) comprises: (a1) separating a large amount of document corpus constructed from documents written in a source language into each sentence, and separating the vocabulary of the separated sentence into tokens; (a2) analyzing morphemes and attaching all possible parts of speech to each token; and (a3) statistical part-of-speech tagging that assigns specific parts of speech to each word using pre-defined lexicon context probability information and lexical probability information. And performing a specific corpus to which a specific part-of-speech is assigned and extracting a terminology from the constructed specific corpus.

바람직하게 상기 (a1) 단계는 입력되는 문서가 장문일 경우에는 장문 분리 규칙에 의해서 장문을 몇 개의 문장으로 분리하는 단계를 포함하는 것을 특징으로 한다.Preferably, the step (a1) is characterized in that it comprises the step of separating the long sentence into a few sentences by the long sentence separation rule when the input document is a long sentence.

바람직하게 상기 (a1) 단계의 토큰은 기호, 수식, 단어 중 어느 하나로 정의되는 것을 특징으로 한다.Preferably, the token of step (a1) is characterized in that it is defined by any one of a symbol, a formula, a word.

바람직하게 상기 (a4) 단계는 이하의 조건 중 적어도 하나에 만족하는 것을 특징으로 한다.Preferably, step (a4) is characterized by satisfying at least one of the following conditions.

조건 1) 미등록어(unknown word): 일반 도메인 사전에 없는 단어Condition 1) unknown word: A word that is not in the regular domain dictionary.

조건 2) <조건식 1>을 만족하는 단어 w_i: Condition 2) The word w _i that satisfies <Condition 1>:

<수학식 1><Equation 1>

여기에서, f(w_i); 일반 도메인에서 단어 w_i 의 총빈도수,Where f (w _i ); The total frequency of the word w _i in the general domain,

f(w_i, t_ij): 일반 도메인에서 단어 w_i 가 품사 t_ij로 나타나는 빈도,f (w _i , t _ij ): how often the word w _i appears as part of speech t _ij in the general domain,

f'(w_i): 해당 도메인에서 단어 w_i 의 총빈도수,f '(w _i ): the total frequency of the word w _i in that domain,

f'(w_i, t_ij): 해당 도메인에서 단어 w_i 가 품사 t_ij로 나타나는 빈도,f '(w _i , t _ij ): how often the word w _i appears as part of speech t _ij in that domain,

α: 합계 threshold value (본 발명에서는 0.15로 사용)α: total threshold value (0.15 in the present invention)

β: 최대 threshold value (본 발명에서는 0.1로 사용) 이다.β: maximum threshold value (in the present invention, used as 0.1).

바람직하게 상기 (b) 단계는 (b1) 상기 구축된 특정 코퍼스로부터 사전 각 엔트리의 각 대역어에 대한 발생 빈도를 계산하는 단계와, (b2) 상기 각 대역어와 함께 공기하는(co-occurring) 어휘들을 추출하고 각 어휘들 간의 상호 정보(Mutual Information)를 계산하는 단계와, (b3) 상기 어휘 빈도 및 공기 어휘를 통해 얻어진 값들을 사용하여 각 대역어의 사용 가중치를 계산하는 단계와, (b4) 상기 대역어 사용 가중치를 적용하여 사전 엔트리의 각 대역어를 특정 도메인에서의 사용 중요도에 따라 정제하는 단계와, (b3) 상기 정제된 문형/구 패턴 및 정의된 가중치에 기반하여 상기 구축된 전문용어에 대해서 대역어를 구축하는 단계를 포함하는 것을 특징으로 한다.Preferably, step (b) comprises: (b1) calculating a frequency of occurrence of each band word of each entry in advance from the constructed specific corpus; and (b2) co-occurring vocabularies with each band word. Extracting and calculating mutual information between each vocabulary, (b3) calculating a usage weight of each band word using values obtained through the vocabulary frequency and air vocabulary, and (b4) the band word Refining each bandword of a dictionary entry according to the importance of usage in a specific domain by applying a usage weight; and (b3) applying a bandword to the constructed terminology based on the refined sentence / phrase pattern and a defined weight. Characterized in that it comprises the step of building.

바람직하게 상기 (b) 단계의 가중치는 제한된 도메인에 따른 어휘별 빈도 및 공기 어휘와의 밀접성에 상응하는 값인 것을 특징으로 한다.Preferably, the weight of step (b) is characterized in that the value corresponding to the frequency of each vocabulary according to the restricted domain and the closeness to the air vocabulary.

바람직하게 상기 (c) 단계는 (c1) 상기 (a) 단계에서 구축된 특정 코퍼스 중 가장 빈도수가 높은 문자열을 추출하고, 이 추출된 문자열을 통해 고빈도 중복 어휘 문자열 및 용례를 생성하는 단계와, (c2) 상기 생성된 고빈도 중복 어휘 문자열 및 용례에 대해 구문 패턴 후보 또는 문형 패턴 후보의 가능성을 판단하는 단계와, (c3) 상기 판단결과 구문 패턴 후보로 판단되면, 구의 시작/끝 노드인지, 또는 품사 노드인지 체크하여 특정 구문 번역 패턴을 구축하는 단계와, (c4) 상기 판단결과 문형 패턴 후보로 판단되면, 문장 전체에 대해 특정 문장 번역 패턴으로 구축하는 단계를 포함하는 것을 특징으로 한다.Preferably, the step (c) comprises: (c1) extracting a string having the highest frequency among the specific corpus constructed in the step (a), and generating a high frequency redundant vocabulary string and usage through the extracted string; (c2) determining a possibility of a syntax pattern candidate or sentence pattern candidate with respect to the generated high frequency redundant lexical string and usage; and (c3) if the determination result is a syntax pattern candidate, whether the phrase is a start / end node of a phrase, Or checking whether it is a part-of-speech node and constructing a specific phrase translation pattern; and (c4) if it is determined that the sentence pattern candidate is a result of the determination, constructing a specific sentence translation pattern for the entire sentence.

바람직하게 상기 (d) 단계는 (d1) 상기 형태소 분석 및 태깅된 문장에 대해 패턴을 판단하는 단계와, (d2) 상기 판단 결과 구문패턴으로 판단되면, 병렬구조를 인식하고 병렬 노드 파싱을 통해 문장분할을 수행하여 구문 노드 파싱을 수행하는 단계와, (d3) 상기 판단 결과 문형패턴으로 판단되면, 문형패턴의 각 노드에 대해 구문 노드 파싱을 수행하는 단계와, (d4) 상기 구문 노드 파싱된 결과를 하나의 챠트로 취급하여 전체 문장을 다시 파싱함으로써 최종 구조분석결과를 생성하는 단계를 포함하는 것을 특징으로 한다.Preferably, the step (d) comprises: (d1) determining a pattern for the morphological analysis and tagged sentences; and (d2) if the determination result is a syntax pattern, recognizing a parallel structure and parsing a sentence through parallel node parsing. Performing syntax node parsing by performing partitioning; (d3) if it is determined that the sentence pattern is a sentence pattern, performing syntax node parsing on each node of the sentence pattern; and (d4) parsing the syntax node. It is characterized in that it comprises the step of generating a final structural analysis results by parsing the entire sentence by treating as a single chart.

바람직하게 상기 병렬구조 인식은 상기 판단 결과 구문 노드 파싱해야 할 부분문장의 크기가 특정길이 이상이면, 구문패턴에 의한 병렬 구조 후보를 생성하는 단계와, 상기 병력 구조의 각 후보에 대한 병렬노드 인식 수단 및 구문노드 제약을 통한 병렬구조를 선택하는 단계를 포함하는 것을 특징으로 한다.Preferably, the parallel structure recognition includes generating a parallel structure candidate based on a syntax pattern when the size of the partial sentence to be parsed by the syntax node is greater than or equal to a specific length as a result of the determination, and parallel node recognition means for each candidate of the history structure. And selecting a parallel structure through syntax node constraints.

바람직하게 상기 (e) 단계는 (e1) 상기 (c) 단계에서 구축된 구문 및 문장 번역 패턴을 사용하여 입력되는 원시 언어 문서의 구조적 변환을 수행하여 목표 언어의 문장 구조로 변환하는 단계와, (e2) 단일어 및 복합명사 사전을 사용하여 원시 언어 문서에 따른 각 원시 어휘에 대한 최적의 대역어를 선택하여 어휘 레벨에서의 변환을 수행하는 단계와, (e3) 상기의 구조 변환 및 어휘 변환 결과들을 바탕으로 변환 자료 구조를 생성하는 단계를 포함하는 것을 특징으로 한다.Preferably, the step (e) comprises: (e1) performing a structural conversion of the input source language document by using the syntax and sentence translation pattern constructed in the step (c) and converting it into a sentence structure of the target language; e2) converting at the lexical level by selecting an optimal band word for each source vocabulary according to the source language document using a single word and a compound noun dictionary, and (e3) based on the results of the above structure conversion and vocabulary conversion. And generating a conversion data structure.

바람직하게 상기 (e1) 단계의 구조적 변환은 문장 단위, 절 단위, 구 단위로 수행되는 것을 특징으로 한다.Preferably, the structural transformation of the step (e1) is performed in a sentence unit, a clause unit, or a phrase unit.

발명의 다른 목적, 특성 및 이점들은 첨부한 도면을 참조한 실시예들의 상세한 설명을 통해 명백해질 것이다.Other objects, features and advantages of the invention will become apparent from the following detailed description of embodiments taken in conjunction with the accompanying drawings.

본 발명에 따른 제한적인 도메인의 문서를 대상으로 특화된 자동 번역 장치 및 방법의 바람직한 실시예에 대하여 첨부한 도면을 참조하여 설명하면 다음과 같다. 설명에 앞서, 본 명세서에서는 다만 설명의 편의를 위해 제한적인 도메인의 문서를 특허 도메인의 문서로 한정하여 일실시예로 설명한다. 그러나 상기 특허 도메인의 문서는 바람직한 일실시예 일뿐, 상기 제한적인 도메인이 이에 한정되지는 않는다.A preferred embodiment of an automatic translation apparatus and method specialized for a document in a limited domain according to the present invention will be described with reference to the accompanying drawings. Prior to the description, in the present specification, for convenience of description, the document of the limited domain is limited to the document of the patent domain and described as an embodiment. However, the document of the patent domain is only one preferred embodiment, and the limited domain is not limited thereto.

도 1 은 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 장치의 전체 구성을 나타낸 도면이다.1 is a view showing the overall configuration of an automatic translation apparatus specialized for patent documents according to the present invention.

도 1을 참조하여 설명하면, 자동 번역 장치는 제한된 도메인의 문서를 바탕으로 해당 도메인의 문서를 구성하는 문장들의 번역에 필요한 번역 지식을 추출하는 지식추출부(100)와, 상기 추출된 번역 지식을 사전 및 변환 패턴에 적용하여 입력되는 문장에 대한 번역문을 생성하는 번역부(200)로 구성된다.Referring to FIG. 1, the automatic translation apparatus extracts translation knowledge necessary for translation of sentences constituting a document of a corresponding domain based on a document of a limited domain, and the extracted translation knowledge. It is composed of a translation unit 200 for generating a translation for the sentence input by applying to the dictionary and the conversion pattern.

이때, 상기 지식추출부(100)는 원시언어로 작성된 문서들을 입력으로 형태소 분석과 태깅 과정을 통해 특허 도메인에 따른 특허코퍼스를 구축하여 전문용어를 추출 및 구축하는 전문용어 구축부(110)와, 특허 도메인에 따른 어휘별 빈도 및 공기 어휘와의 밀접성에 상응하는 가중치를 적용하여 최장우선 방식의 고빈도 표현을 추출하여 문형/구 패턴을 정제하고, 상기 구축된 전문용어에 대해서 대역어를 구축하는 대역어 구축 및 정제부(120)와, 상기 특허코퍼스에 기반해서 고빈도 중복 어휘 문자열 및 용례에 따른 해당 도메인의 구문 번역 패턴 및 문장 번역 패턴을 구축하는 문장/구 패턴 구축부(130)로 구성된다. At this time, the knowledge extraction unit 100 is a terminology construction unit 110 for extracting and building a terminology by building a patent corpus according to the patent domain through a morphological analysis and tagging process by inputting documents written in the source language; By applying a weight corresponding to the frequency of each vocabulary according to the patent domain and the closeness to the air vocabulary, the high frequency expression of the longest priority method is extracted to refine the sentence / phrase pattern, and the bandword that constructs the bandword for the constructed terminology. And a sentence / phrase pattern construction unit 130 for constructing a phrase translation pattern and a sentence translation pattern of a corresponding domain according to a high frequency redundant vocabulary string and usage based on the patent corpus.

또한, 상기 번역부(200)는 입력되는 원문을 문장분할 규칙을 이용하여 문장 으로 분리하고, 이 분리된 문장이 나타내는 어휘를 토큰(token)으로 분리한 후, 이 토큰을 기호, 수식 및 단어 등으로 구분하는 전처리부(210)와, 상기 전처리된 토큰을 형태소 분석 사전을 이용하여 형태소를 분석하고 상기 지식추출부를 통해 어휘 변환된 HMM(Lexicalized Hidden Markov Model)을 이용하여 통계적 품사 태깅하는 형태소 분석 및 태깅부(220)와, 상기 형태소 분석 및 태깅된 문장에 대해 문형패턴 및 구문패턴에 의한 문장분할을 적용하여 문형패턴의 각 노드에 대해 파싱하여 최종 구문 분석결과를 생성하는 구조분석부(230)와, 상기 지식추출부(100)의 문장/구 패턴 구축부(130)에서 구축된 문장/구 패턴에 따른 변환 패턴을 사용하여 상기 생성된 구문 분석결과에 대한 구조 변환을 수행하여 목표 언어의 구조로 변환한 후, 사전을 이용하여 각 개별 어휘 변환을 수행하는 구조 및 어휘 변환부(240)와, 상기 구조 및 어휘 변환부(240)에서 출력되는 변환된 구조 및 어휘를 통해 최종적인 목표 언어 문장을 생성하는 생성부(250)로 구성된다. In addition, the translator 200 separates the inputted original text into sentences using sentence division rules, separates the vocabulary represented by the separated sentences into tokens, and then divides the tokens into symbols, equations, words, and the like. Morphological analysis using the preprocessing unit 210 and the preprocessed token to analyze the morphemes using a morphological analysis dictionary, and statistical part-of-speech tagging using the lexical transformed HMM (Lexicalized Hidden Markov Model) through the knowledge extraction unit. The tagging unit 220 and the structural analysis unit 230 for parsing each node of the sentence pattern by applying sentence division by sentence pattern and syntax pattern to the morphological analysis and tagged sentences to generate a final syntax analysis result. And, using the conversion pattern according to the sentence / phrase pattern constructed in the sentence / phrase pattern construction unit 130 of the knowledge extraction unit 100 to perform a structural transformation for the generated syntax analysis result After converting to the structure of the target language by using a dictionary, the structure and vocabulary conversion unit 240 for performing each individual vocabulary conversion using a dictionary, and the converted structure and vocabulary output from the structure and vocabulary conversion unit 240 It consists of a generator 250 for generating a final target language sentence through.

이와 같이 구성된 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 방법을 첨부한 도면을 참조하여 상세히 설명하면 다음과 같다.The automatic translation method for the patent document according to the present invention configured as described above will be described in detail with reference to the accompanying drawings.

도 2 는 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 방법을 설명한 흐름도이다.2 is a flowchart illustrating an automatic translation method specialized for a patent document according to the present invention.

도 2를 참조하여 설명하면, 먼저 첫 번째 단계로 원시언어로 작성된 문서들을 입력으로 형태소 분석과 태깅 과정을 통해 특허코퍼스를 구축하고, 상기 구축된 특허코퍼스에서 전문용어를 구축한다(S100).Referring to FIG. 2, first, a patent corpus is constructed through morphological analysis and tagging by inputting documents written in a native language as a first step, and a terminology is constructed from the constructed patent corpus (S100).

도 3 은 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 방법 중 전문용어를 구축하는 방법을 나타낸 흐름도로서, 이를 참조하여 전문용어를 구축하는 방법을 상세히 설명한다. 3 is a flowchart illustrating a method of constructing a terminology among specialized automatic translation methods for a patent document according to the present invention, and a method of constructing a terminology will be described in detail with reference to this.

먼저 원시 언어로 작성된 대량의 특허 문서 코퍼스를 입력으로 받고(S110), 입력된 특허 문서 코퍼스를 문장분할 규칙(10)을 이용하여 각기 문장을 분리하고, 또한 각 문장에 나타나는 어휘를 토큰(token)으로 분리하는 전처리를 수행한다(S120). First, a large amount of patent document corpus written in a native language is received as an input (S110), and each sentence is separated using a sentence division rule 10 of the input patent document corpus, and a vocabulary appearing in each sentence is token. Perform the pretreatment to separate into (S120).

상기 토큰은 기호, 수식, 단어 등으로 구분한다. 또한, 상기 문장분할 규칙(10)은 입력되는 문장이 장문일 경우에 장문을 몇 개의 문장으로 분리하는 장문 분리 규칙을 적용한다. The token is divided into symbols, formulas, words, and the like. In addition, the sentence dividing rule 10 applies a long sentence dividing rule for dividing a long sentence into several sentences when the input sentence is a long sentence.

즉, 상기 장문 분리 규칙은 어휘와 문두 심볼, 문미 심볼과 문장분할 기호들을 토큰으로 하는 정규 표현식이다. 그리고 상기 문장 분리 정규 표현식은 <조건부>와 <문장분리부>로 구성된다. 이때, 상기 <조건부>에는 입력 어휘, 문두/문미 심볼의 나열로 구성되며, 상기 <문장분리부>는 문장분할 기호를 포함한 문장 분리 형태를 기술한다. That is, the long sentence separation rule is a regular expression that uses a vocabulary, a sentence symbol, a sentence symbol, and a sentence division symbol as a token. The sentence separation regular expression is composed of a <condition> and a <statement separator>. In this case, the <conditional part> is composed of an input vocabulary and a list of sentence / end symbols, and the <sentence separator> describes a sentence separation type including a sentence division symbol.

그러므로, 상기 장문 분리 규칙은 입력 문장이 <조건부>를 만족하게 되면 <문장분리부>의 표현으로 문장이 분리된다. Therefore, in the sentence separation rule, when the input sentence satisfies the conditional sentence, the sentence is divided into the expression of the sentence separation unit.

예를 들어, <조건부>가 "including:" <문장분리부>가 "including as follow:\n" 이면, 입력문장 중에서 "ncluding:"이란 어휘를 만나면 "including:" 대신 "including as follow:"로 대체하고 문장을 분리한다.For example, if <conditional> is "including:" <sentence separator> is "including as follow: \ n", and the input sentence encounters the word "ncluding:" instead of "including:" instead of "including:" Replace and separate sentences.

이와 같은 전처리가 끝나면, 형태소 분석 사전(20)을 검색하여 형태소를 분석하고, 상기 각 토큰에 가능한 모든 품사를 부착한다(S130). 이때, 상기 형태소 분석 사전에 나타나지 않는 단어를 미등록어(unknown word)로 처리한다.After such preprocessing, the morphological analysis dictionary 20 is searched to analyze the morphemes and attach all possible parts of speech to the respective tokens (S130). At this time, a word that does not appear in the morphological analysis dictionary is treated as an unknown word.

그리고 상기 모든 가능한 단어의 품사들 중에서 그 문장에서 정확하게 사용된 품사를 정하기 위해 기정의된 어휘품사문맥 확률 정보(30) 및 어휘확률 정보(40)를 이용하여 각 단어에 최적 품사를 할당하는 통계적 품사 태깅을 수행한다(S140). 이때, 어휘 변환된 HMM(Lexicalized Hidden Markov Model)을 이용하여 태깅을 하는 것이 바람직하다. And a statistical part-of-speech for allocating an optimal part-of-speech to each word using the predefined lexical part-of-speech probability information 30 and the lexical probability information 40 to determine the parts of speech used correctly in the sentence among the possible parts of the word. Tagging is performed (S140). In this case, it is preferable to tag using the lexical converted HMM (Lexicalized Hidden Markov Model).

이와 같은 일련 작업을 통해 상기 입력되는 특허 문서 코퍼스의 각 단어에 대하여 최적 품사가 할당된 자동 태깅된 특허코퍼스를 구축한다(S150). Through this serial operation, an automatically tagged patent corpus in which an optimal part of speech is assigned to each word of the input patent document corpus is constructed (S150).

그리고 상기 구축된 특허코퍼스에서 전문용어를 추출한 후(S160), 이렇게 추출된 전문용어를 DB로 구축한다(S170). 이때, 해당 도메인에 대한 전문용어는 일반 도메인에서 나타나지 않는 단어들이 나타나더라도 일반 도메인에서 사용된 품사가 해당 도메인에서 사용된 품사와 아주 다른 단어들이다. 그러므로 이하 조건 중에 하나라도 만족하면 해당 도메인의 전문용어로 추출한다.After extracting the terminology from the constructed patent corpus (S160), the terminology thus extracted is constructed as a DB (S170). In this case, the terminology for the domain is very different from the parts of speech used in the domain even though words that do not appear in the general domain appear. Therefore, if any one of the following conditions is satisfied, the terminology of the domain is extracted.

이때, 이하 조건을 만족하는 것을 찾기 위해서 상기 구축된 자동 태깅된 특허코퍼스를 이용한다.At this time, the constructed automatic tagged patent corpus is used to find one satisfying the following conditions.

조건 1) 미등록어(unknown word): 형태소 분석 사전에 없는 단어Condition 1) unknown word: A word that is not in the stemming dictionary.

조건 2) 이하 수학식 1을 만족하는 단어 w_i: Condition 2) The word w _i that satisfies Equation 1 below:

여기에서, f(w_i): 일반 도메인에서 단어 w_i 의 총빈도수,Where f (w _i ): the total frequency of the word w _{i in} the general domain,

f'(w_i): 자동 태깅된 특허코퍼스(209)에서 단어 w_i 의 총빈도수,f '(w _i ): the total frequency of the word w _{i in} the automatically tagged patent corpus 209,

f'(w_i, t_ij): 자동 태깅된 특허코퍼스(209)에서 단어 w_i 가 품사 t_ij로 나타나는 빈도,f '(w _i , t _ij ): frequency of word w _i appearing as part-of-speech t _ij in the automatically tagged patent corpus 209,

β: 최대 threshold value (본 발명에서는 0.1로 사용)β: maximum threshold value (0.1 in the present invention)

이다.to be.

위의 수학식 1 중에서 일반 도메인에서 획득하는 f(w_i)와 f(w_i, t_ij) 값은 통계적 품사 태깅을 위해서 일반 도메인의 태깅된 코퍼스로부터 기 구축한 어휘확률 정보(208)에 이미 들어 있는 값을 활용한다.In Equation 1, f (w _i ) and f (w _i , t _ij ) values obtained in the general domain are already included in the lexical probability information 208 previously constructed from the tagged corpus of the general domain for statistical part-of-speech tagging. Use the value contained.

그리고 두 번째 단계로 특허 도메인에 따른 어휘별 빈도 및 공기 어휘와의 밀접성에 상응하는 가중치를 적용하여 최장우선 방식의 고빈도 표현을 추출하여 문형/구 패턴을 정제하고, 상기 구축된 전문용어에 대해서 대역어를 구축한다(S300). The second step is to extract the high frequency expression of the longest priority method by applying weights corresponding to the frequency of each vocabulary according to the patent domain and the closeness to the air vocabulary, to refine the sentence / phrase pattern, and to construct the terminology. Build a band word (S300).

즉, 기존의 일반 도메인에서 사용되었던 사전 또는 전문용어 사전의 경우, 그 대역어들의 사용 빈도가 특허와 같은 특정 도메인에서는 달라질 수 있다. 다시 말해, 특허와 같은 특정 도메인의 문서들을 기존 일반 도메인 사전을 이용하여 번역할 경우, 그 대역어의 사용 가중치가 다르기 때문에 특정 도메인에서 자주 사용되는 대역어가 사용되지 않으며, 따라서 번역 결과에 있어서 구조적 변환이 제대로 수행되었다고 하더라도 전체적인 의미의 전달에 있어서는 문제가 발생한다. That is, in the case of a dictionary or terminology dictionary used in the existing general domain, the use frequency of the band words may be changed in a specific domain such as a patent. In other words, when translating documents of a specific domain such as a patent using an existing general domain dictionary, because the use weights of the band words are different, the band words frequently used in the specific domain are not used, and thus, the structural transformation is Even if done properly, problems arise in conveying the overall meaning.

따라서 이러한 기존 일반 도메인 사전의 각 엔트리별 대역어를 목적언어로 작성된 모노링궐 코퍼스(monolingual corpus)에 기반하여 정제하는 방안을 도입한다. Therefore, a method of refining the bandword of each entry of the existing general domain dictionary based on the monolingual corpus written in the target language is introduced.

도 4 는 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 방법 중 대역어 구축 및 정제 방법을 나타낸 흐름도로서, 이를 참조하여 상기 구축된 전문용어에 대해서 대역어의 구축 및 정제 방법을 상세히 설명한다. 4 is a flowchart illustrating a method for constructing and refining a band word among automatic translation methods specialized for a patent document according to the present invention. Referring to this, a method for constructing and refining a band word for the constructed terminology will be described in detail.

먼저, 일반 도메인 사전의 각 엔트리 대역어들을 추출하고(S210), 이 추출된 대역어에 대해서 각 대역어에 대한 발생 빈도를 계산한다(S220).First, each entry band word of the general domain dictionary is extracted (S210), and a frequency of occurrence of each band word is calculated for the extracted band word (S220).

그리고 상기 목적언어로 작성된 특허 도메인의 코퍼스(50) 상에서 발생하는 어휘별 빈도를 계산하고(S220), 또한 각 대역어들과 공기하는 어휘들을 상호 정보(Mutual Information)에 근거하여 관련성(relatedness)을 계산한다(S230).In addition, the frequency of each vocabulary generated on the corpus 50 of the patent domain written in the target language is calculated (S220), and the relatedness is calculated based on mutual information of the vocabulary words and the vocabulary of each band. (S230).

이어, 상기 어휘별 빈도 및 공기어휘들의 관련성의 계산에서 얻어진 가중치 함수를 사용하여 각 대역어의 사용 가중치를 계산한다(S250).Subsequently, the use weight of each band word is calculated using a weight function obtained in the calculation of the relationship between the frequency of each vocabulary and the air vocabulary (S250).

이렇게 구하여진 어휘별 빈도와 공기 어휘와의 밀접성 등에 대한 가중치 함 수에 의해 해당 대역어가 특정 도메인에서 어느 정도의 중요성을 지니는 어휘인지를 판단하여 문형/구 패턴을 정제한다(S260).The sentence / phrase pattern is refined by determining the degree of importance of the corresponding bandword in a specific domain based on the weight function of the frequency of each vocabulary and the closeness of the air vocabulary (S260).

그리고 이러한 과정을 통하여 일반 도메인에 맞도록 구축된 기존의 일반 사전 및 전문용어 사전의 각 대역어들은 새로운 특정 도메인에 특화되도록 새롭게 정의된 가중치 함수에 의하여 상기 구축된 전문용어에 대해서 대역어를 구축한다(S270). In this process, each of the band words of the existing general dictionary and the terminology dictionary constructed to fit the general domain builds the band word for the terminology constructed by the weight function newly defined to be specific to the new specific domain (S270). ).

다음 세 번째 단계로 상기 첫 번째 단계(S100)에서 구축된 특허코퍼스에 기반해서 고빈도 중복 어휘 문자열 및 용례에 따른 해당 도메인의 구문 번역 패턴 및 문장 번역 패턴을 구축한다(S300).In the next third step, a phrase translation pattern and a sentence translation pattern of the corresponding domain are constructed based on the high frequency redundant vocabulary strings and examples based on the patent corpus constructed in the first step (S100) (S300).

도 5 는 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 방법 중 특허 고유의 문장 또는 구문 패턴을 대역어 구축 및 정제 방법을 나타낸 흐름도로서, 이를 참조하여 특허 도메인의 구문 번역 패턴 및 문장 번역 패턴을 구축하는 방법을 상세히 설명한다. FIG. 5 is a flowchart illustrating a method for constructing and refining a word or phrase of a patent-specific sentence or phrase pattern among automatic translation methods specialized for patent documents according to the present invention, and constructs a phrase translation pattern and a sentence translation pattern of a patent domain with reference thereto How to do this in detail.

먼저, 첫 번째 단계에서 구축된 대량의 자동 태깅된 특허코퍼스를 입력으로 경계 조건을 체크하여 가장 빈도수가 높은 어휘 문자열 및 용례를 추출하고(S320), 이 추출된 어휘 문자열 및 용례를 통해 고빈도 중복 어휘 문자열 및 용례를 생성한다(S330). First, the boundary conditions are checked by inputting a large amount of automatically tagged patent corpus constructed in the first step to extract the most frequent lexical strings and usages (S320), and the high frequency overlaps through the extracted lexical strings and usages. A lexical string and an example are generated (S330).

이어 상기 생성된 고빈도 중복 어휘 문자열 및 용례에 대해 구문 패턴 후보 또는 문형 패턴 후보의 가능성을 판단하게 된다(S340). Subsequently, a possibility of a syntax pattern candidate or a sentence pattern candidate is determined based on the generated high frequency redundant lexical string and usage (S340).

상기 판단결과 구문 패턴 후보로 판단되면(S340), 제시된 구문 패턴 후보의 양끝 단어가 구조분석 규칙의 구의 시작/끝 노드의 품사에 해당하는지, 또는 구문 패턴 후보의 옆 단어들이 규칙 내에서 구노드 사이에 있는 품사 노드에 해당하는지를 체크하여 특허용 구문 패턴을 구축한다(S350). 그리고 구축된 특허용 구문 패턴을 이용하여 특허용 구문 번역 패턴 DB를 구축한다(S360). If the determination result is a syntax pattern candidate (S340), whether the words at both ends of the proposed syntax pattern candidate correspond to the parts of speech of the start / end node of the phrase of the structural analysis rule, or the words next to the syntax pattern candidate are between the old nodes in the rule. Checking whether the part-of-speech node corresponds to constructs a patent syntax pattern (S350). Then, using the constructed syntax phrase pattern for patent construct a phrase translation pattern DB (S360).

또한 상기 판단결과 문형 패턴 후보로 판단되면(S340), 문장 전체에 대해 특허용 문장 패턴으로 구축하고(S370), 특허용 문장 번역 패턴 DB를 구축한다(S380).In addition, when it is determined that the sentence pattern candidate is determined as a result of the determination (S340), a sentence sentence pattern for the whole sentence is constructed (S370), and a sentence translation pattern DB for patent is constructed (S380).

다음은 위에서 설명된 특허 문서에서의 용례에 따른 특허용 구문 패턴(S350) 및 특허용 문장 패턴(S370)이 구축되는 일실시예를 나타낸다.The following shows an embodiment in which the patent syntax pattern S350 and the patent sentence pattern S370 are constructed according to an application in the patent document described above.

1) 특허용 구문 번역 패턴의 구축 예(S350)1) Example of constructing a phrase translation pattern for a patent (S350)

추출된 고빈도 중복 어휘 문자열 및 용례:Extracted High Frequency Duplicate Vocabulary Strings and Usages:

in_accordance_with 20063 The present invention relates to a DC transformer / reactor in accordance with the introductory part of claim 1 .in_accordance_with 20063 The present invention relates to a DC transformer / reactor in accordance with the introductory part of claim 1.

구축된 특허용 구문 번역 패턴: in accordance! with -> 에_따른! Constructed patent translation patterns: in accordance! with-> according to!

2) 특허용 문장 번역 패턴의 구축 예(S370)2) Example of constructing a sentence translation pattern for patent (S370)

relates_to 20063 The present invention relates to a DC transformer / reactor in accordance with the introductory part of claim 1 .relates_to 20063 The present invention relates to a DC transformer / reactor in accordance with the introductory part of claim 1.

구축된 특허용 문장 번역 패턴: NP1 relate to NP2 -> NP1:[는] NP2:[에] 관한 것이! Constructed patent translation sentences: NP1 relate to NP2-> NP1: [] is about NP2: [!

다음 네 번째 단계로 상기 형태소 분석 및 태깅된 문장에 대해 문형패턴 및 구문패턴에 의한 문장분할을 적용하여 문형패턴의 각 노드에 대해 파싱하여 최종 구문 분석결과를 생성한다(S400).In the fourth step, the sentence analysis by sentence pattern and syntax pattern is applied to the morphological analysis and tagged sentences, and the result is parsed for each node of the sentence pattern (S400).

도 6 은 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 방법 중 최종 구문 분석결과를 생성하는 방법을 나타낸 흐름도로서, 이를 참조하여 최종 구문 분석결과를 생성하는 방법을 상세히 설명한다. FIG. 6 is a flowchart illustrating a method of generating a final parsing result of an automatic translation method specialized for a patent document according to the present invention. Referring to this, a method of generating a final parsing result is described in detail.

먼저, 상기 형태소 분석 및 태깅된 문장이 입력되면(S410), 이 문장에 대해 문형 패턴을 인식하고(S420), 문형패턴이 적용되면 문형패턴의 각 노드에 대해 구문 노드 파싱을 수행한다(S430). First, when the morphological analysis and tagged sentence is input (S410), the sentence pattern is recognized for the sentence (S420), and when the sentence pattern is applied, syntax node parsing is performed for each node of the sentence pattern (S430). .

이때, 상기 문형패턴은 패턴의 범위가 문장전체인 패턴을 의미하며, 어휘와 구문노드들로 구성되어 있다. 상기 구문노드는 주로 명사구(NP), 동사구(VP), 문장(S)등에 해당하고 구문노드들은 연속해서 올 수 없다. 또한 문형패턴의 인식방법은 문형패턴을 챠트 파서의 규칙으로 사용하여 챠트파싱을 하되, 만일 구문이 나오면, 태깅결과에 대해 현재위치의 단어부터 탐색을 하여, 문형패턴에서 구문노드 다음의 어휘가 매칭될 때까지 범위를 구문노드로 인식한다. 이때 인식된 구문에 대한 조건을 간단히 체크하여, 조건을 만족하는 경우에만 구문노드를 생성하고 무위(inactive) 챠트에 추가한다.In this case, the sentence pattern means a pattern in which the range of the pattern is the entire sentence, and is composed of vocabulary and syntax nodes. The phrase node mainly corresponds to a noun phrase (NP), a verb phrase (VP), a sentence (S), and the like. In addition, the sentence pattern recognition method uses the sentence pattern as a chart parser rule, and if the sentence is found, the word is searched from the current position for the tagging result, and the word following the syntax node in the sentence pattern is matched. Recognize ranges as syntax nodes until At this time, simply check the condition for the recognized syntax, and create the syntax node only if the condition is satisfied and add it to the inactive chart.

이때 상기 구문 노드 파싱에서 상기 구문 노드 파싱의 단위가 특정길이 이상이면, 병렬구조를 인식하여(S440), 구문패턴에 의한 문장분할을 통해 병렬노드로 분할한다(S450).In this case, when the syntax node parsing unit is greater than or equal to a certain length in parsing the syntax node, the parallel structure is recognized (S440), and the sentence is divided into parallel nodes through sentence division based on a syntax pattern (S450).

그리고 이렇게 분할되어 인식된 병렬노드를 다시 파싱 단위로 하여 병렬 노 드 파싱을 시도한다(S460). In addition, the parallel node parsed as the parsing unit is attempted as the parsing unit (S460).

이때 상기 병렬구조 인식은 먼저 구문정보를 이용해서 가능한 병렬구조 범위를 인식한다. 즉 영어에서 병렬구문은 X -> X, X, .., and X의 형태를 띠기 때문에 이러한 가능한 모든 범위들을 병렬구조 후보로 인식한다. 이렇게 인식된 병렬구조에 대해 태깅 결과를 이용해 병렬노드를 인식한다. 병렬 노드는 크게 NP(명사구), VP(동사구), VPG(동명사), S(문장), SG(독립분사구문)인지를 구분한다. 이에 대한 구분은 파싱을 수행하지 않고, 다음과 같은 휴리스틱을 사용한다.In this case, the parallel structure recognition first recognizes a range of possible parallel structures using syntax information. That is, in English, the parallel syntax takes the form X-> X, X, .., and X, so all of these possible ranges are recognized as candidates for parallelism. The parallel node is recognized using the tagging result for the recognized parallel structure. Parallel nodes are divided into NP (noun phrase), VP (verb phrase), VPG (verb noun), S (statement), and SG (independent injection phrase). The distinction is not to parse, but the following heuristics are used.

1) 만일 본동사 개수가 1개 이상이고 본동사 앞에 명사/대명사/수사가 존재하면 S, 아니면 VP1) If there is more than one main verb and there are nouns / pronouns / investigators in front of the main verb, S, VP

2) 동사의 분사형이 존재하고 동사 분사형 앞에 명사/대명사/수사가 존재하면 NP/SG, 아니면 VPG2) NP / SG or VPG if there is a participle of the verb and a noun / pronoun / investigator before the part

3) 아니면 NP3) or NP

그리고 위와 같은 구문인식결과에 대해 두 가지 제약을 체크한다.And two constraints are checked for the above syntax recognition result.

1) 모든 병렬노드의 구문노드가 동일해야 한다.1) Syntax nodes of all parallel nodes must be identical.

2) 병렬구조 앞에 동사나 전치사가 오는 경우 NP/VPG만이 가능하다.2) NP / VPG is only possible if a verb or preposition precedes the parallel structure.

위와 같은 조건을 만족하는 병렬 구조 중에서 길이가 가장 긴 병렬구조를 선택한다.The longest parallel structure is selected among the parallel structures that satisfy the above conditions.

이와 같은 병렬노드 파싱을 시도한 경우에도 파싱단위가 특정 길이 이상이면, 정해진 구문패턴에 의해 문장분할을 수행하고, 이를 통해 분할된 결과에 대해 분할 문장 파싱을 수행한다. 이때, 상기 구문 패턴에 의한 문장분할에서는 콤마에 의해 무조건 문장분할 수행한다.Even when such a parallel node parsing is attempted, if the parsing unit is greater than or equal to a certain length, sentence division is performed according to a predetermined syntax pattern, and partition sentence parsing is performed on the result of the division. In this case, sentence division is performed unconditionally by commas.

그리고 마지막으로, 지금까지 부분 파싱된 결과를 하나의 챠트로 취급하여 전체 문장을 다시 파싱함으로써 최종 구조분석결과를 생성하게 된다(S470).And finally, treating the partial parsed results so far as one chart to parse the entire sentence again to generate the final structural analysis results (S470).

상기 네 번째 단계(S400)인 최종 구문 분석결과를 생성하는 방법을 영문특허 예문을 통해 실시예로 나타내면 다음과 같다.If the fourth step (S400), the method for generating the final syntax analysis results are shown in the examples through the English patent example as follows.

실시예Example

[입력문장]: ?Construction of fixing a flexible sheet for use in an electronic device comprising a case being formed with a plurality of through holes, a chassis being accommodated in an interior of the case, a flexible sheet being disposed on a surface of the chassis and having a plurality of flexible switches arranged thereon, a circuit board being provided below the chassis and having a connector fixed thereon, and a plurality of manual buttons being provided above each flexible switch and being exposed from the through holes of the case to the outside of the case, the construction of fixing the flexible sheet wherein the flexible sheet comprises a flat plate portion being in close contact with the chassis and a flat cable portion which projects on an edge of the flat plate portion and with which a connecting terminal portion is provided on its end, the flat cable portion is folded back to the chassis to have the connecting terminal portion connected to the connector, and the chassis is provided with a lift-up prevention piece to prevent a part of the flat plate portion of the flexible sheet from being lifted up from a surface of the chassis." [Input Sentence]:? Construction of fixing a flexible sheet for use in an electronic device comprising a case being formed with a plurality of through holes, a chassis being accommodated in an interior of the case, a flexible sheet being disposed on a surface of the chassis and having a plurality of flexible switches arranged thereon, a circuit board being provided below the chassis and having a connector fixed thereon, and a multiple of manual buttons being provided above each flexible switch and being exposed from the through holes of the case to the outside of the case, the construction of fixing the flexible sheet wherein the flexible sheet comprises a flat plate portion being in close contact with the chassis and a flat cable portion which projects on an edge of the flat plate portion and with which a connecting terminal portion is provided on its end, the flat cable portion is folded back to the chassis to have the connecting terminal portion connected to the connector, and the cha ssis is provided with a lift-up prevention piece to prevent a part of the flat plate portion of the flexible sheet from being lifted up from a surface of the chassis. "

[패턴 적용]: S -> S:[vg], NP wherein S , S[Pattern]: S-> S: [vg], NP where S, S

[패턴 인식 결과]:[Pattern Recognition Result]:

(S:[vg] Construction of fixing a flexible sheet for use in an electronic device comprising a case being formed with a plurality of through holes, a chassis being accommodated in an interior of the case, a flexible sheet being disposed on a surface of the chassis and having a plurality of flexible switches arranged thereon, a circuit board being provided below the chassis and having a connector fixed thereon, and a plurality of manual buttons being provided above each flexible switch and being exposed from the through holes of the case to the outside of the case), (NP the construction of fixing the flexible sheet) wherein (S the flexible sheet comprises a flat plate portion being in close contact with the chassis and a flat cable portion which projects on an edge of the flat plate portion and with which a connecting terminal portion is provided on its end), (S the flat cable portion is folded back to the chassis to have the connecting terminal portion connected to the connector, and the chassis is provided with a lift-up prevention piece to prevent a part of the flat plate portion of the flexible sheet from being lifted up from a surface of the chassis.)(S: [vg] Construction of fixing a flexible sheet for use in an electronic device comprising a case being formed with a plurality of through holes, a chassis being accommodated in an interior of the case, a flexible sheet being disposed on a surface of the chassis and having a plurality of flexible switches arranged thereon, a circuit board being provided below the chassis and having a connector fixed thereon, and a multiple of manual buttons being provided above each flexible switch and being exposed from the through holes of the case to the outside of the case), (NP the construction of fixing the flexible sheet) wherein (S the flexible sheet comprises a flat plate portion being in close contact with the chassis and a flat cable portion which projects on an edge of the flat plate portion and with which a connecting terminal portion is provided on its end), (S the flat cable portion is folded back to the chassis to have the connecting terminal portion connected to the connector, and the chassis is provided with a lift-up prevention piece to prevent a part of the flat plate portion of the flexible sheet from being lifted up from a surface of the chassis.)

[인식된 각 구문노드에 대한 파싱 수행][Perform parsing of each recognized syntax node]

S[vg], NP, S들에 대한 구문파싱 수행Parse S [vg], NP, S

[병렬구조 인식] [Parallel Structure Recognition]

(S:[vg] Construction of fixing a flexible sheet for use in an electronic device comprising (NP a case being formed with a plurality of through holes), (NP a chassis being accommodated in an interior of the case), (NP a flexible sheet being disposed on a surface of the chassis and having a plurality of flexible switches arranged thereon), (NP a circuit board being provided below the chassis and having a connector fixed thereon), and a plurality of manual buttons being provided above each flexible switch and being exposed from the through holes of the case to the outside of the case)(S: [vg] Construction of fixing a flexible sheet for use in an electronic device comprising (NP a case being formed with a plurality of through holes), (NP a chassis being accommodated in an interior of the case), (NP a flexible sheet being disposed on a surface of the chassis and having a multiple of flexible switches arranged thereon, (NP a circuit board being provided below the chassis and having a connector fixed thereon), and a multiple of manual buttons being provided above each flexible switch and being exposed from the through holes of the case to the outside of the case)

위에서 마지막 노드는 구문노드로 묶지 않는다. 이는 마지막 노드의 끝 범위를 모르기 때문이다.The last node above is not bound to a syntax node. This is because we do not know the end range of the last node.

[각 병렬노드들에 대한 파싱 수행]Parsing of each parallel node

인식된 각 NP에 대해 파싱 수행후 병렬구조에 대한 트리 구성Parse tree for each recognized NP and construct tree for parallel structure

위에서 만일 문형패턴이 존재하지 않은 경우에는 먼저 병렬구조 인식을 수행하고 각 구문노드들을 파싱한 후, 다시 전체 문장에 대한 파싱을 수행하고자 할 때, 문장길이가 특정 길이를 초과한다고 할 때, 콤마에 의해 분할이 이루어진다.Above, if the sentence pattern does not exist, first perform parallel structure recognition, parse each syntax node, and then parse the whole sentence again. When the sentence length exceeds a certain length, The division is made by.

( Construction of fixing a flexible sheet for use in an electronic device comprising (NP a case being formed with a plurality of through holes), (NP a chassis being accommodated in an interior of the case), (NP a flexible sheet being disposed on a surface of the chassis and having a plurality of flexible switches arranged thereon), (NP a circuit board being provided below the chassis and having a connector fixed thereon), and a plurality of manual buttons being provided above each flexible switch and being exposed from the through holes of the case to the outside of the case), ( the construction of fixing the flexible sheet wherein the flexible sheet comprises a flat plate portion being in close contact with the chassis and a flat cable portion which projects on an edge of the flat plate portion and with which a connecting terminal portion is provided on its end), ( the flat cable portion is folded back to the chassis to have the connecting terminal portion connected to the connector, and the chassis is provided with a lift-up prevention piece to prevent a part of the flat plate portion of the flexible sheet from being lifted up from a surface of the chassis.)(Construction of fixing a flexible sheet for use in an electronic device comprising (NP a case being formed with a plurality of through holes), (NP a chassis being accommodated in an interior of the case), (NP a flexible sheet being disposed on a surface of the chassis and having a plurality of flexible switches arranged thereon), (NP a circuit board being provided below the chassis and having a connector fixed thereon), and a multiple of manual buttons being provided above each flexible switch and being exposed from the through holes of the case to the outside of the case), (the construction of fixing the flexible sheet wherein the flexible sheet comprises a flat plate portion being in close contact with the chassis and a flat cable portion which projects on an edge of the flat plate portion and with which a connecting terminal portion is provided on its end), (the flat cable portion is folded back to the chassis to have the connecting terminal portion connected to the connector, and the chassis is provided with a lift-up prevention piece to prevent a part of the flat plate portion of the flexible sheet from being lifted up from a surface of the chassis.)

다음의 다섯 번째 단계로 상기 구축된 문장/구 패턴에 따른 변환 패턴(605)을 사용하여 상기 생성된 구문 분석결과에 대한 구조 변환을 수행하여 목표 언어의 구조로 변환한 후, 단일어 및 복합명사 사전을 이용하여 각 개별 어휘 변환을 수행한다(S600).In the fifth step, a structure conversion is performed on the generated parsing result using the conversion pattern 605 according to the constructed sentence / phrase pattern, and then converted into a structure of a target language. Each individual vocabulary conversion is performed using (S600).

도 7 은 본 발명에 따른 특허 문서를 대상으로 특화된 자동 번역 방법 중 원 시 언어의 구조 분석 결과를 목표 언어의 구조로 변환하는 방법을 나타낸 흐름도로서, 이를 참조하여 상세히 설명한다. 7 is a flowchart illustrating a method of converting a structure analysis result of a native language into a structure of a target language among automatic translation methods specialized for patent documents according to the present invention.

먼저, 상기 세 번째 단계(S300)에서 구축된 변환 패턴(60)을 사용하여 입력된 원시 언어 문장에 대한 구조 분석 결과에 대한 구조적 변환을 수행한다(S510). 이때 상기 구조적 변환은 문장 단위, 절 단위, 구 단위로 수행되며, 사용되는 변환 패턴은(60)은 상기 네 번째 단계(S400)의 구조 분석 결과에 대해 가장 최적으로 매칭되는 변환 패턴이 선택된다. First, a structural transformation of a structural analysis result of an input source language sentence is performed using the transformation pattern 60 constructed in the third step S300 (S510). In this case, the structural transformation is performed in units of sentences, clauses, and phrases. The transformation pattern 60 is used to select a transformation pattern that is most optimally matched with the structural analysis result of the fourth step S400.

이를 통해 목표 언어 문장 구조로 변환이 완료되면, 단일어 및 복합명사 사전(70)을 사용하여 각 개별 어휘의 변환을 수행한다(S520). 이때, 상기 수행되는 어휘 변환 과정에서 대역어 선택 모호성, 즉, 중의적 의미를 지닌 원시 어휘의 경우에 대해서는 그 모호성을 해소하고 최적의 대역어를 선택하기 위한 대역어 선택 기능이 수행된다. 그리고 특허 문서의 경우, 전문용어를 많이 포함하는 특성으로 인해, 일반 도메인에서와 같은 복잡한 대역어 선택이 구현되지는 않으며, 주로 사용빈도에 따른 대역어 선택 기법이 적용된다. When the conversion to the target language sentence structure is completed through this, the conversion of each individual vocabulary is performed using the single word and compound noun dictionary 70 (S520). In this case, a bandword selection ambiguity, that is, in the case of a primitive vocabulary having an intermediate meaning, is performed in the lexical conversion process to perform the bandword selection function for eliminating the ambiguity and selecting an optimal bandword. In the case of a patent document, due to the feature that includes a lot of terminology, the complex band word selection as in the general domain is not implemented, and the band word selection technique is mainly applied according to the frequency of use.

이렇게 상기의 구조 변환 및 어휘 변환이 끝나면, 해당 결과들을 생성부에 넘기기 위한 변환 자료 구조를 생성한다(S530).When the structure conversion and the lexical conversion are completed as described above, a conversion data structure for generating the corresponding results is generated (S530).

그러면 마지막인 여섯 번째 단계로서 상기 구조 및 어휘 변환되어 출력되는 변환된 구조 및 어휘를 통해 최종적인 목표 언어 문장을 생성하게 된다(S600).Then, as a sixth step, which is the last step, the final target language sentence is generated through the converted structure and vocabulary that is output after the structure and the vocabulary are converted (S600).

이상에서와 같이 상세한 설명과 도면을 통해 본 발명의 최적 실시예를 개시 하였다. 용어들은 단지 본 발명을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로 본 기술분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다.As described above, the optimum embodiment of the present invention has been disclosed through the detailed description and the drawings. The terms are used only for the purpose of describing the present invention and are not used to limit the scope of the present invention as defined in the meaning or claims. Therefore, those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

이상에서 설명한 바와 같은 본 발명에 따른 제한적인 도메인의 문서를 대상으로 특화된 자동 번역 장치 및 방법은 다음과 같은 효과가 있다.As described above, the automatic translation apparatus and method specialized for the document of the limited domain according to the present invention have the following effects.

첫째, 보다 좁은 구체적 번역 도메인으로서 특허 도메인을 설정하여 특허 도메인에 특화된 번역 지식을 구축하고 이렇게 구축된 특화된 번역 지식을 사용하고, 또한 특허문서의 긴 문장들에 대해 장문 분할을 수행하여 특허 문서를 자동으로 번역함으로써, 실제 특허 필드에 직접적인 도움을 줄 수 있을 정도로 자동 번역 품질을 향상시킬 수 있다.First, as a narrower specific translation domain, the patent domain is set up to build a specialized translation knowledge for the patent domain, and the specialized translation knowledge thus constructed is used, and the long document is divided into long sentences of the patent document to automatically generate the patent document. By translating, the quality of automatic translation can be improved to the extent that it can directly help the actual patent field.

둘째, 또한 유무선 통신 기술의 발달과 함께 전 세계적으로 특허의 공유가 가능해진 시점에, 타국의 특허를 보다 쉽고, 보다 경제적으로 참조할 수 있음으로 해서 향후 특허 분쟁의 소지를 보다 낮출 수 있으며, 이러한 타국 특허의 검색/참조에 소비되어 왔던 경제적 비용도 또한 대폭 줄일 수 있다.Secondly, with the development of wired / wireless communication technology, it is possible to refer to patents of other countries more easily and economically at the time of sharing of patents around the world, thereby lowering the possibility of future patent disputes. The economic costs that have been spent on searching / referencing foreign patents can also be significantly reduced.

Claims

A terminology construction unit that extracts and constructs terminology by establishing a corpus according to a document of a limited domain through morphological analysis and tagging;

A band word construction and refining unit for extracting a high frequency expression of a longest priority method by applying weights to refine a sentence / phrase pattern, and constructing a band word for the constructed terminology;

Translation knowledge required for translation of sentences constituting a document of a limited domain, including a sentence / phrase pattern construction unit for constructing a phrase translation pattern and a sentence translation pattern of the domain according to the high frequency redundant lexical string and usage based on the corpus; A knowledge extraction unit for extracting

And a translation unit configured to generate a translation for an input sentence based on the dictionary and the conversion pattern to which the extracted translation knowledge is applied.

delete

The method of claim 1,

And the weight is applied corresponding to the frequency of each vocabulary according to the limited domain and the closeness to the air vocabulary.

The method of claim 1, wherein the translation unit

A pre-processing unit that separates the input sentences, separates the vocabulary appearing in the separated sentences into tokens, and separates them into symbols, formulas, and words;

A morpheme analysis and tagging unit for analyzing the morphemes of the token and tagging the statistical parts of speech using a lexical transformed HMM (Lexicalized Hidden Markov Model) through the knowledge extraction unit;

A structural analysis unit for parsing each node of the pattern by applying sentence division by sentence pattern and syntax pattern to the morphological analysis and tagged sentences, and generating a final syntax analysis result;

A structure and vocabulary that converts the generated syntax analysis result into a structure of a target language by using a transformation pattern according to the extracted translation knowledge of the knowledge extractor, and converts each vocabulary by using a dictionary. With a conversion unit,

And a generator for generating a final target language sentence through the converted structure and vocabulary output from the structure and the vocabulary converter.

(a) extracting a terminology by constructing a specific corpus according to a restricted domain through morphological analysis and tagging from documents written in a source language;

(b) extracting a high frequency representation of the longest priority method by applying weights according to restricted domains, refining sentence / phrase patterns, and constructing a band word for the constructed terminology;

(c) constructing a phrase translation pattern and a sentence translation pattern based on the specific corpus constructed in step (a);

(d) applying sentence segmentation by sentence pattern to the morphological analysis and tagged sentences, parsing each node of the sentence pattern pattern to generate a parsing result;

(e) performing a structure conversion of the target language and each individual vocabulary conversion through structure conversion on the generated parse result using the phrase and sentence translation pattern constructed in step (c);

(f) generating a target language sentence through the converted structure and vocabulary.

The method of claim 5, wherein step (a)

(a1) separating a large amount of document corpus constructed from documents written in the source language into each sentence, and separating the vocabulary of the separated sentence into tokens;

(a2) analyzing morphemes and attaching all possible parts of speech to each of the tokens,

(a3) constructing a specific corpus to which a specific part of speech is assigned by performing statistical part-of-speech tagging that assigns a specific part-of-speech to each word using the predefined lexical part-of-speech probability information and the lexical probability information;

(a4) Automatic translation method comprising the step of extracting the terminology from the constructed specific corpus.

The method of claim 6,

Wherein (a1) is the automatic translation method comprising the step of separating the long sentence into a few sentences by the long sentence separation rule, if the input document is a long sentence.

The method of claim 6,

The token of step (a1) is an automatic translation method characterized in that it is defined by any one of a symbol, a formula, a word.

7. The automatic translation method of claim 6, wherein step (a4) satisfies at least one of the following conditions.

Condition 1) unknown word: A word that is not in the regular domain dictionary.

Condition 2) The word w _i that satisfies <Condition 1>:

Where f (w _i ); The total frequency of the word w _i in the general domain,

f (w _i , t _ij ): how often the word w _i appears as part of speech t _ij in the general domain,

f '(w _i ): the total frequency of the word w _i in that domain,

f '(w _i , t _ij ): how often the word w _i appears as part of speech t _ij in that domain,

α: total threshold value (0.15 in the present invention)

β: maximum threshold value (in the present invention, used as 0.1).

The method of claim 5, wherein step (b)

(b1) calculating frequency of occurrence of each band word of each entry in advance from the constructed specific corpus;

(b2) extracting co-occurring vocabulary with each band word and calculating mutual information between each vocabulary;

(b3) calculating a use weight of each band word using the values obtained through the lexical frequency and the air vocabulary,

(b4) applying each of the bandword usage weights to refine each bandword of a dictionary entry according to the importance of usage in a specific domain;

(b3) an automatic translation method comprising constructing a band word for the constructed terminology based on the refined sentence / phrase pattern and a defined weight.

The method of claim 5,

The weighting of step (b) is a value corresponding to the vocabulary for each vocabulary according to the restricted domain and the closeness to the air vocabulary.

The method of claim 5, wherein step (c)

(c1) extracting a string having the highest frequency among the specific corpus constructed in the step (a), and generating a high frequency redundant lexical string and usage through the extracted string;

(c2) determining a possibility of a syntax pattern candidate or a sentence pattern candidate for the generated high frequency redundant lexical string and usage;

(c3) if it is determined that the candidate is a phrase pattern candidate, constructing a specific phrase translation pattern by checking whether the phrase is a start / end node or a part-of-speech node;

(c4) if it is determined that the sentence pattern candidate is a result, the automatic translation method comprising the step of constructing a specific sentence translation pattern for the entire sentence.

The method of claim 5, wherein step (d)

(d1) determining a pattern with respect to the morphological analysis and the tagged sentence,

(d2) if it is determined that the syntax pattern is a result, recognizing the parallel structure and performing syntax parsing by parsing the sentences through parallel node parsing;

(d3) if it is determined that the sentence pattern is determined, parsing a node of each sentence of the sentence pattern;

and (d4) generating the final structural analysis result by re-parsing the entire sentence by treating the parsed node parsed result as one chart.

The method of claim 13, wherein the parallel structure recognition

Generating a parallel structure candidate based on a syntax pattern when the size of the partial sentence to be parsed as the result of the determination is greater than or equal to a certain length;

Selecting a parallel structure through parallel node recognition means and syntax node constraints for each candidate of the history structure.

The method of claim 5, wherein step (e)

(e1) performing a structural conversion of the input source language document by using the syntax and sentence translation pattern constructed in step (c) and converting it into a sentence structure of the target language;

(e2) performing conversion at the lexical level by selecting an optimal band word for each source vocabulary according to the source language document using a single word and a compound noun dictionary;

(e3) generating a transformation data structure based on the structure transformation and lexical transformation results.

The method of claim 15,

The structural transformation of the step (e1) is automatic translation method, characterized in that performed in units of sentences, clauses, phrases.