KR102593447B1

KR102593447B1 - Device and method for generating of training data for quality estimation in machine translation

Info

Publication number: KR102593447B1
Application number: KR1020210156657A
Authority: KR
Inventors: 임희석; 어수경; 박찬준; 문현석
Original assignee: 고려대학교 산학협력단
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2023-10-25
Also published as: KR20230071825A

Abstract

기계 번역의 품질을 평가하는 품질 평가 모델의 생성에 이용되는 학습 데이터를 생성하는 학습 데이터 생성 장치가 개시된다. 상기 학습 데이터 생성 장치는 웹 크롤링을 이용하여 복수의 문장들을 포함하는 기본 데이터를 수집하는 데이터 수집부, 기본 데이터에 기초하여 기계 번역의 품질 평가 모델의 학습에 이용되고 소스 문장과 상기 소스 문장에 대한 기계 번역문을 포함하는 입력 문장쌍을 생성하는 입력 문장 생성부, 및 상기 기계 번역문의 TER(Translation Error Rate)을 산출하는 TER 산출부를 포함한다.Disclosed is a learning data generation device that generates learning data used to generate a quality evaluation model that evaluates the quality of machine translation. The learning data generation device includes a data collection unit that collects basic data including a plurality of sentences using web crawling, is used to learn a quality evaluation model of machine translation based on the basic data, and is used to learn a quality evaluation model of machine translation based on the basic data and a source sentence and It includes an input sentence generation unit that generates an input sentence pair including a machine translated text, and a TER calculation unit that calculates a TER (Translation Error Rate) of the machine translated text.

Description

Apparatus and method for generating learning data for predicting machine translation quality {DEVICE AND METHOD FOR GENERATING OF TRAINING DATA FOR QUALITY ESTIMATION IN MACHINE TRANSLATION}

본 발명은 기계 번역 품질 예측을 위한 학습 데이터 생성 방법에 관한 것으로, 보다 구체적으로 기계 번역문의 품질 예측을 수행하는 품질 예측 모델을 생성하는 과정에서 이용되는 학습 데이터를 자동으로 생성하는 장치 및 방법에 관한 것이다.The present invention relates to a method of generating learning data for predicting the quality of machine translation, and more specifically, to an apparatus and method for automatically generating learning data used in the process of creating a quality prediction model that predicts the quality of machine translation text. will be.

기계 번역(Machine Translation, MT)의 품질 예측(Quality Estimation, QE)이란, 레퍼런스 문장(reference sentence)을 참고하지 않고 소스 문장(source sentence)과 기계 번역 모델이 추론한 기계 번역문만을 가지고 기계 번역의 품질을 예측하는 것을 의미한다.Quality Estimation (QE) of machine translation (MT) refers to the quality of machine translation using only the source sentence and the machine translation inferred by the machine translation model without referring to the reference sentence. means predicting.

일반적으로 기계 번역의 품질을 판단하기 위해서는 기계 번역문과 레퍼런스 문장을 비교하여야 하지만, 레퍼런스 문장이 존재하는 경우는 매우 한정적이다. 또한, 기계 번역을 활용하는 사람들의 경우 소스 언어(source language) 또는 타겟 언어(target language)를 잘 알지 못하는 경우가 존재하기 때문에, 기계 번역이 도출한 번역 결과가 좋은 품질인지 좋지 못한 품질인지 판단하기 어렵다. 이러한 문제점에 입각하여 레퍼런스 문장 없이 자동으로 번역 품질을 예측할 수 있는 QE에 대한 연구의 필요성이 증가하고 있다.In general, in order to judge the quality of machine translation, the machine translation text and the reference sentence must be compared, but the cases where the reference sentence exists are very limited. In addition, since people who use machine translation sometimes do not know the source language or target language well, it is difficult to determine whether the translation result derived from machine translation is of good quality or poor quality. difficult. Based on these problems, the need for research on QE, which can automatically predict translation quality without reference sentences, is increasing.

QE에서는 기계 번역문에 대한 품질을 수치 또는 오류 태그와 같은 품질 주석(quality annotations)을 통해 나타낸다. 이를 활용하여 여러 기계 번역 시스템 중 어떤 시스템의 번역 결과가 가장 좋은지를 선택하거나, 결과에 대한 순위(ranking)를 매길 수 있다. 또한, 품질이 낮은 기계 번역 문장의 경우, 어절 단위로 부착된 품질 주석을 활용하여 품질이 낮은 어절만을 수정함으로써 사후 교정 시 효율을 높일 수 있다. 이와 같이 기계 번역에서 QE는 폭넓은 적용이 가능하다는 점에서 그 중요성이 부각되고 있다.In QE, the quality of machine translation text is expressed through quality annotations such as numbers or error tags. Using this, you can select which of several machine translation systems has the best translation results or rank the results. Additionally, in the case of low-quality machine translated sentences, post-proofing efficiency can be improved by correcting only low-quality words using quality annotations attached to each word. In this way, the importance of QE in machine translation is highlighted in that it can be applied widely.

QE 태스크는 레퍼런스 문장 없이 소스 문장과 기계 번역문만으로도 이에 대한 품질을 예측할 수 있으나, QE 모델을 학습하기 위한 학습 데이터(training data)를 구축하기 위해 번역 전문가가 직접 주석 작업을 진행한 라벨(label)이 필요하며, 이를 위해 많은 전문가의 노력이 필요하다.The quality of the QE task can be predicted using only source sentences and machine translations without reference sentences, but labels annotated directly by translation experts are used to build training data for learning the QE model. It is necessary, and it requires the efforts of many experts.

본 발명에서는, 이러한 문제점을 제거하고 리소스가 부족한 언어(Low Resource Language, LRL)에서도 QE를 적용할 수 있도록, 자동적으로 학습 데이터(수도-QE 학습 데이터(pseudo-QE training data)라 칭할 수 있음)를 생성하는 방법을 제안한다.In the present invention, training data (which can be referred to as pseudo-QE training data) is automatically created to eliminate these problems and enable QE to be applied even to low resource languages (LRL). We propose a method to generate .

대한민국 공개특허 제2021-0030238호 (2021.03.17. 공개)Republic of Korea Patent Publication No. 2021-0030238 (published on March 17, 2021) 대한민국 공개특허 제2017-0053527호 (2017.05.16. 공개)Republic of Korea Patent Publication No. 2017-0053527 (published on May 16, 2017) 대한민국 공개특허 제2015-0029931호 (2015.03.19. 공개)Republic of Korea Patent Publication No. 2015-0029931 (published March 19, 2015) 대한민국 공개특허 제2021-0070891호 (2021.06.15. 공개)Republic of Korea Patent Publication No. 2021-0070891 (published on June 15, 2021)

Ranasinghe, T., Orasan, C., and Mitkov, R. (2020). TransQuest: Translation quality estimation with cross-lingual transformers. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5070-5081, Barcelona, Spain (Online). International Committee on Computational Linguistics.Ranasinghe, T., Orasan, C., and Mitkov, R. (2020). TransQuest: Translation quality estimation with cross-lingual transformers. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5070-5081, Barcelona, Spain (Online). International Committee on Computational Linguistics.

본 발명이 이루고자 하는 기술적인 과제는 QE(Quality Estimation) 모델의 학습을 위한 학습 데이터를 자동적으로 생성하는 방법 및 장치를 제공하는 것이다.The technical problem to be achieved by the present invention is to provide a method and device for automatically generating learning data for learning a QE (Quality Estimation) model.

본 발명의 일 실시예에 따른 학습 데이터 생성 장치는 웹 크롤링을 이용하여 복수의 문장들을 포함하는 기본 데이터를 수집하는 데이터 수집부, 기본 데이터에 기초하여 기계 번역의 품질 평가 모델의 학습에 이용되고 소스 문장과 상기 소스 문장에 대한 기계 번역문을 포함하는 입력 문장쌍을 생성하는 입력 문장 생성부, 및 상기 기계 번역문의 TER(Translation Error Rate)을 산출하는 TER 산출부를 포함한다. A learning data generation device according to an embodiment of the present invention includes a data collection unit that collects basic data including a plurality of sentences using web crawling, and is used to learn a quality evaluation model of machine translation based on the basic data and source It includes an input sentence generating unit that generates an input sentence pair including a sentence and a machine translation of the source sentence, and a TER calculation unit that calculates a TER (Translation Error Rate) of the machine translated text.

본 발명의 일 실시예에 따른 기계 학습 품질 예측 모델 생성 장치는 웹 크롤링을 이용하여 복수의 문장들을 포함하는 기본 데이터를 수집하는 데이터 수집부, 기본 데이터에 기초하여 기계 번역의 품질 평가 모델의 학습에 이용되고 소스 문장과 상기 소스 문장에 대한 기계 번역문을 포함하는 입력 문장쌍을 생성하는 입력 문장 생성부, 상기 기계 번역문의 TER(Translation Error Rate)을 산출하는 TER 산출부, 및 상기 입력 문장쌍과 상기 TER을 이용하여 학습함으로써 상기 기계 번역의 품질 평가 모델을 생성하는 모델 생성부를 포함한다.A machine learning quality prediction model generating device according to an embodiment of the present invention includes a data collection unit that collects basic data including a plurality of sentences using web crawling, and learning a quality evaluation model of machine translation based on the basic data. An input sentence generation unit that is used and generates an input sentence pair including a source sentence and a machine translation of the source sentence, a TER calculation unit that calculates a TER (Translation Error Rate) of the machine translated sentence, and the input sentence pair and the It includes a model generator that generates a quality evaluation model of the machine translation by learning using TER.

본 발명의 일 실시예에 따른 학습 데이터 생성 방법은 적어도 프로세서를 포함하는 학습 데이터 생성 장치에 의해 수행되고, 웹 크롤링을 이용하여 복수의 문장들을 포함하는 기본 데이터를 수집하는 단계, 기본 데이터에 기초하여 기계 번역의 품질 평가 모델의 학습에 이용되고 소스 문장과 상기 소스 문장에 대한 기계 번역문을 포함하는 입력 문장쌍을 생성하는 단계, 및 상기 기계 번역문의 TER(Translation Error Rate)을 산출하는 단계를 포함한다.The learning data generating method according to an embodiment of the present invention is performed by a learning data generating device including at least a processor, and includes collecting basic data including a plurality of sentences using web crawling, based on the basic data. It is used for learning a quality evaluation model of machine translation and includes the steps of generating an input sentence pair including a source sentence and a machine translation of the source sentence, and calculating a TER (Translation Error Rate) of the machine translation. .

본 발명의 일 실시예에 따른 기계 번역 품질 평가 모델 생성 방법은 적어도 프로세서를 포함하는 기계 번역 품질 평가 모델 생성 장치에 의해 수행되고, 웹 크롤링을 이용하여 복수의 문장들을 포함하는 기본 데이터를 수집하는 단계, 기본 데이터에 기초하여 기계 번역의 품질 평가 모델의 학습에 이용되고 소스 문장과 상기 소스 문장에 대한 기계 번역문을 포함하는 입력 문장쌍을 생성하는 단계, 상기 기계 번역문의 TER(Translation Error Rate)을 산출하는 단계, 및 상기 입력 문장쌍과 상기 TER을 이용하여 학습함으로써 상기 기계 번역의 품질 평가 모델을 생성하는 단계를 포함한다.A machine translation quality evaluation model generation method according to an embodiment of the present invention is performed by a machine translation quality evaluation model generation device including at least a processor, and includes the steps of collecting basic data including a plurality of sentences using web crawling. , generating an input sentence pair that is used to learn a quality evaluation model of machine translation based on basic data and includes a source sentence and a machine translation of the source sentence, calculating TER (Translation Error Rate) of the machine translation. and generating a quality evaluation model of the machine translation by learning using the input sentence pair and the TER.

본 발명의 실시예에 따른, QE 모델의 학습을 위한 학습 데이터를 자동으로 생성하는 방법 및 장치에 의할 경우, 번역 전문가의 노력 없이 QE 모델을 생성하기 위한 학습 데이터를 자동으로 생성할 수 있는 효과가 있다.The method and device for automatically generating learning data for learning a QE model according to an embodiment of the present invention has the effect of automatically generating learning data for creating a QE model without the effort of a translation expert. There is.

본 발명의 상세한 설명에서 인용되는 도면을 보다 충분히 이해하기 위하여 각 도면의 상세한 설명이 제공된다.
도 1은 본 발명의 일 실시예에 따른 학습 데이터 생성 장치의 기능 블럭도이다.
도 2는 도 1에 도시된 학습 데이터 생성 장치에 의해 수행되는 학습 데이터 생성 방법 또는 품질 평가 모델 생성 방법을 설명하기 위한 개념도이다.
도 3은 도 1에 도시된 학습 데이터 생성 장채에 의해 수행되는 학습 데이터 생성 방법 또는 품질 평가 모델 생성 방법을 설명하기 위한 흐름도이다.In order to more fully understand the drawings cited in the detailed description of the present invention, a detailed description of each drawing is provided.
1 is a functional block diagram of a learning data generating device according to an embodiment of the present invention.
FIG. 2 is a conceptual diagram illustrating a learning data generation method or a quality evaluation model generation method performed by the learning data generation device shown in FIG. 1.
FIG. 3 is a flowchart illustrating a learning data generation method or a quality evaluation model generation method performed by the learning data generation device shown in FIG. 1.

본 명세서에 개시되어 있는 본 발명의 개념에 따른 실시예들에 대해서 특정한 구조적 또는 기능적 설명들은 단지 본 발명의 개념에 따른 실시예들을 설명하기 위한 목적으로 예시된 것으로서, 본 발명의 개념에 따른 실시예들은 다양한 형태들로 실시될 수 있으며 본 명세서에 설명된 실시예들에 한정되지 않는다.Specific structural or functional descriptions of the embodiments according to the concept of the present invention disclosed in this specification are merely illustrative for the purpose of explaining the embodiments according to the concept of the present invention. They may be implemented in various forms and are not limited to the embodiments described herein.

본 발명의 개념에 따른 실시예들은 다양한 변경들을 가할 수 있고 여러 가지 형태들을 가질 수 있으므로 실시예들을 도면에 예시하고 본 명세서에서 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시예들을 특정한 개시 형태들에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물, 또는 대체물을 포함한다.Since the embodiments according to the concept of the present invention can make various changes and have various forms, the embodiments will be illustrated in the drawings and described in detail in this specification. However, this is not intended to limit the embodiments according to the concept of the present invention to specific disclosed forms, and includes all changes, equivalents, or substitutes included in the spirit and technical scope of the present invention.

제1 또는 제2 등의 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만, 예컨대 본 발명의 개념에 따른 권리 범위로부터 벗어나지 않은 채, 제1 구성 요소는 제2 구성 요소로 명명될 수 있고 유사하게 제2 구성 요소는 제1 구성 요소로도 명명될 수 있다.Terms such as first or second may be used to describe various components, but the components should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another component, for example, without departing from the scope of rights according to the concept of the present invention, a first component may be named a second component and similarly a second component The component may also be named a first component.

어떤 구성 요소가 다른 구성 요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성 요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성 요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성 요소가 다른 구성 요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는 중간에 다른 구성 요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성 요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is said to be “connected” or “connected” to another component, it is understood that it may be directly connected to or connected to that other component, but that other components may also exist in between. It should be. On the other hand, when it is mentioned that a component is “directly connected” or “directly connected” to another component, it should be understood that there are no other components in between. Other expressions that describe the relationship between components, such as "between" and "immediately between" or "neighboring" and "directly adjacent to" should be interpreted similarly.

본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로서, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 본 명세서에 기재된 특징, 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in this specification are only used to describe specific embodiments and are not intended to limit the invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as “comprise” or “have” are intended to designate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in this specification, but are not intended to indicate the presence of one or more other features. It should be understood that this does not exclude in advance the possibility of the existence or addition of elements, numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the technical field to which the present invention pertains. Terms as defined in commonly used dictionaries should be interpreted as having meanings consistent with the meanings they have in the context of the related technology, and unless clearly defined in this specification, should not be interpreted in an idealized or overly formal sense. No.

이하, 본 명세서에 첨부된 도면들을 참조하여 본 발명의 실시예들을 상세히 설명한다. 그러나, 특허출원의 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings attached to this specification. However, the scope of the patent application is not limited or limited by these examples. The same reference numerals in each drawing indicate the same members.

본 발명에서는 레퍼런스 문장(reference sentence) 없이 번역 품질을 측정할 수 있다는 특징을 지닌 QE(Quelity Estimation)에서, QE를 위한 학습 데이터 구축이 오히려 번역문을 생성하는 것보다 더 많은 노력이 필요하다는 문제점을 해결하고자 한다. 이를 완화하기 위하여 학습 데이터를 자동으로 생성할 수 있는 기법 및 장치를 제안하고자 한다.The present invention solves the problem that in QE (Quality Estimation), which has the characteristic of measuring translation quality without a reference sentence, building learning data for QE requires more effort than generating a translation. I want to do it. To alleviate this problem, we would like to propose techniques and devices that can automatically generate learning data.

도 1은 본 발명의 일 실시예에 따른 학습 데이터 생성 장치의 기능 블럭도이다.1 is a functional block diagram of a learning data generating device according to an embodiment of the present invention.

도 1을 참조하면, 학습 데이터 생성 장치(10)는 기계 번역문의 품질 예측 모델의 학습(또는 생성)에 사용되는 학습 데이터를 생성할 수 있으며, 적어도 프로세서(processor) 및/또는 메모리(memory)를 포함하는 컴퓨팅 장치로 구현될 수 있다. 학습 데이터 생성 장치(10)는, 데이터 수집부(110), 입력 문장 생성부(120), TER 산출부(130), 모델 생성부(140), 및 저장부(150) 중 적어도 하나 이상을 포함할 수 있다.Referring to FIG. 1, the learning data generating device 10 is capable of generating learning data used for learning (or generating) a quality prediction model of a machine translation text, and includes at least a processor and/or memory. It can be implemented with a computing device that includes: The learning data generating device 10 includes at least one of a data collection unit 110, an input sentence generating unit 120, a TER calculating unit 130, a model generating unit 140, and a storage unit 150. can do.

실시예에 따라, 학습 데이터 생성 장치(10)는 모델 생성부(140)을 포함할 수 있고, 이 경우 학습 데이터 생성 장치(10)는 품질 예측 모델 생성 장치로 명명될 수도 있다.Depending on the embodiment, the learning data generating device 10 may include a model generating unit 140, and in this case, the learning data generating device 10 may be called a quality prediction model generating device.

데이터 수집부(110)는 임의의 수집 기법을 통해 복수의 문장으로 구성된 기본 데이터를 수집할 수 있다. 예컨대, 데이터 수집부(110)는 웹 크롤링 기법을 이용하여 기본 데이터를 수집할 수 있다. 기본 데이터는 제1 기본 데이터 및/또는 제2 기본 데이터를 포함한다. 제1 기본 데이터는 타겟 언어로 된 복수의 문장들(Nono-Lingual corpus, 단일 말뭉치)을 포함하고, 제2 기본 데이터는 병렬 말뭉치(Parallel corpus)를 포함할 수 있다. 병렬 말뭉치는 소스 언어로 된 소스 문장들과 각각이 소스 문장들 각각에 대응하고 타겟 언어로 된 타겟 문장들을 포함할 수 있다.The data collection unit 110 may collect basic data consisting of a plurality of sentences through any collection technique. For example, the data collection unit 110 may collect basic data using a web crawling technique. The basic data includes first basic data and/or second basic data. The first basic data may include a plurality of sentences (Nono-Lingual corpus, single corpus) in the target language, and the second basic data may include a parallel corpus. A parallel corpus may include source sentences in a source language and target sentences, each corresponding to each of the source sentences, in a target language.

데이터 수집부(110)에 의해 수집된 기본 데이터는 저장부(150)에 저장될 수 있다. 그러나, 기본 데이터가 미리 수집되어 저장부(150)에 저장되어 있는 경우, 데이터 수집부(110)는 학습 데이터 생성 장치(10)에서 생략될 수도 있다.Basic data collected by the data collection unit 110 may be stored in the storage unit 150. However, if basic data is collected in advance and stored in the storage unit 150, the data collection unit 110 may be omitted from the learning data generating device 10.

입력 문장 생성부(120)는 기계 번역문의 품질 평가 모델을 학습(또는 생성)하는 과정에서 이용되는 학습 데이터 중 소스 문장 및/또는 기계 번역문을 생성할 수 있다. 입력 문장 생성부(120)에 의한 문장 생성 과정은 기본 데이터의 종류에 따라 상이할 수 있다.The input sentence generator 120 may generate a source sentence and/or a machine translated text from learning data used in the process of learning (or generating) a quality evaluation model of a machine translated text. The sentence generation process by the input sentence generator 120 may differ depending on the type of basic data.

우선, 제1 기본 데이터에 기초하여 입력 문장을 생성하는 경우, 입력 문장 생성부(120)는 제1 기본 데이터에 포함된 타겟 언어 문장을 기계 번역(Backward translation, 역번역)하여 소스 문장(수도 소스(pseudo source) 문장이라 명명될 수 있음)을 생성하고, 생성된 소스 문장을 다시 번역(Forward translation)하여 기계 번역문(MT output)을 생성할 수 있다. 제1 기본 데이터에 포함된 복수의 문장들 각각에 상술한 동작을 수행함으로서, 품질 예측 모델의 학습 데이터를 생성할 수 있다. 상술한 입력 문장 생성부(120)의 동작은 RTT(Round-Trip Translation)을 기반으로 입력 문장쌍을 생성하는 기법이다. 입력 문장 생성부(120)에 의해 생성된 입력 문장들은 저장부(150)에 저장될 수 있다.First, when generating an input sentence based on the first basic data, the input sentence generator 120 machine translates (backward translates) the target language sentence included in the first basic data into the source sentence (also known as the source sentence). (can be named a pseudo source sentence) can be created, and a machine translation text (MT output) can be generated by re-translating the generated source sentence (forward translation). By performing the above-described operation on each of the plurality of sentences included in the first basic data, training data for a quality prediction model can be generated. The operation of the input sentence generator 120 described above is a technique for generating input sentence pairs based on RTT (Round-Trip Translation). Input sentences generated by the input sentence generator 120 may be stored in the storage unit 150.

다음으로, 제2 기본 데이터에 기초하여 입력 문장을 생성하는 경우, 입력 문장 생성부(120)는 제2 기본 데이터에 포함된 소스 문장을 기계 번역(Forward translation)함으로서 기계 번역문(MT output)을 생성할 수 있다. 여기서, 제2 기본 데이터에 포함된 소스 문장과 이에 대응하는 기계 번역문이 품질 예측 모델의 학습에 이용되는 학습 데이터이다. 제2 기본 데이터에 포함된 복수의 문장쌍들 각각에 상술한 동작을 수행함으로써, 품질 예측 모델의 학습 데이터를 생성할 수 있다. 입력 문장 생성부(120)에 의해 생성된 입력 문장들은 저장부(150)에 저장될 수 있다.Next, when generating an input sentence based on the second basic data, the input sentence generator 120 generates a machine translated text (MT output) by machine translating the source sentence included in the second basic data (forward translation). can do. Here, the source sentence included in the second basic data and the corresponding machine translation text are learning data used to learn the quality prediction model. By performing the above-described operation on each of a plurality of sentence pairs included in the second basic data, training data for a quality prediction model can be generated. Input sentences generated by the input sentence generator 120 may be stored in the storage unit 150.

TER 산출부(130)는 품질 예측 모델의 학습에 이용되는 학습 데이터 중 소스 문장과 기계 번역문 사이의 번역 수정률(Translation Edit Rate or Translation Error Rate, TER)을 산출할 수 있다. 제1 기본 데이터에 기초하여 기계 번역문이 생성되는 경우, TER은 제1 기본 데이터에 포함된 타겟 언어로 된 문장을 레퍼런스 문장으로 하여 계산될 수 있다. 제2 기본 데이터에 기초하여 기계 번역문이 생성되는 경우, TER은 제2 기본 데이터에 포함된 타겟 문장을 레퍼런스 문장으로 하여 계산될 수 있다. TER은 0보다는 크거나 같고 1보다는 작거나 같은 실수 값을 갖을 수 있다. TER 산출 방식은 이미 널리 알려져 있으므로 이에 대한 상세한 설명은 생략하기로 한다.The TER calculation unit 130 may calculate a translation edit rate (Translation Edit Rate or Translation Error Rate, TER) between the source sentence and the machine translation text among the learning data used to learn the quality prediction model. When a machine translated text is generated based on the first basic data, TER may be calculated using a sentence in the target language included in the first basic data as a reference sentence. When a machine translated text is generated based on the second basic data, TER may be calculated using the target sentence included in the second basic data as a reference sentence. TER can have a real value greater than or equal to 0 and less than or equal to 1. Since the TER calculation method is already widely known, detailed description thereof will be omitted.

모델 생성부(140)는 입력 문장 생성부(120)에 의해 생성된 소스 문장과 기계 번역문, 그리고 TER 산출부(130)에 의해 산출된 TER을 학습 데이터로 이용하여 학습함으로써 기계 번역문이 품질 예측을 수행하는 품질 예측 모델을 생성할 수 있다. 일 실시예로, 모델 생성부(140)는 오픈 소스 프레임워크인 TransQuest(Ranasinghe et al., 2020, https://github.com/TharinduDR/TransQuest)를 학습시킴으로써, 품질 예측 모델을 생성할 수 있다. Ranasinghe 등은 Mono-TransQuest와 SiameseTransQuest의 두 가지 구조를 제안한 바 있으나, 본 발명에서 Mono-TransQuest만을 학습함으로써 품질 예측 모델을 생성하였다. 그러나, 본 발명의 권리범위가 이에 제한되는 것은 아니며, 학습에 이용되는 모델은 변경될 수 있다.The model generator 140 uses the source sentence and machine translation text generated by the input sentence generator 120 and the TER calculated by the TER calculation unit 130 as training data to learn, thereby predicting the quality of the machine translation text. You can create a quality prediction model that performs In one embodiment, the model generator 140 can generate a quality prediction model by learning TransQuest (Ranasinghe et al., 2020, https://github.com/TharinduDR/TransQuest), an open source framework. . Ranasinghe et al. proposed two structures, Mono-TransQuest and SiameseTransQuest, but in the present invention, a quality prediction model was created by learning only Mono-TransQuest. However, the scope of the present invention is not limited thereto, and the model used for learning may be changed.

저장부(150)에는 데이터 수집부(110)에 의해 수집된 기본 데이터, 입력 문장 생성부(120)에 의해 생성된 입력 문장, TER 산출부(130)에 의해 산출된 TER, 및 모델 생성부(140)에 의해 생성된 기계 번역문의 품질 예측 모델 등이 저장될 수 있다.The storage unit 150 contains the basic data collected by the data collection unit 110, the input sentence generated by the input sentence generator 120, the TER calculated by the TER calculation unit 130, and the model generator ( 140), the quality prediction model of the machine translation text generated, etc. may be stored.

도 1에 도시된 학습 데이터 생성 장치의 구성들 각각은 기능 및 논리적으로 분리될 수 있음으로 나타내는 것이며, 반드시 각각의 구성이 별도의 물리적 장치로 구분되거나 별도의 코드로 작성됨을 의미하는 것이 아님을 본 발명의 기술분야의 평균적 전문가는 용이하게 추론할 수 있을 것이다.Each of the components of the learning data generation device shown in Figure 1 is indicated as being functionally and logically separable, and does not necessarily mean that each component is divided into a separate physical device or written in a separate code. An average expert in the technical field of the invention will be able to easily infer.

또한, 본 명세서에서 "~부"라 함은, 본 발명의 기술적 사상을 수행하기 위한 하드웨어 및 상기 하드웨어를 구동하기 위한 소프트웨어의 기능적, 구조적 결합을 의미할 수 있다. 예컨대, 상기 모듈은 소정의 코드와 상기 소정의 코드가 수행되기 위한 하드웨어 리소스의 논리적인 단위를 의미할 수 있으며, 반드시 물리적으로 연결된 코드를 의미하거나, 한 종류의 하드웨어를 의미하는 것이 아니다.In addition, in this specification, “~ part” may mean a functional and structural combination of hardware for carrying out the technical idea of the present invention and software for driving the hardware. For example, the module may mean a logical unit of hardware resources for executing a predetermined code and the predetermined code, and does not necessarily mean a physically connected code or a single type of hardware.

도 2는 도 1에 도시된 학습 데이터 생성 장치에 의해 수행되는 학습 데이터 생성 방법 또는 품질 평가 모델 생성 방법을 설명하기 위한 개념도이고, 도 3은 도 1에 도시된 학습 데이터 생성 장치에 의해 수행되는 학습 데이터 생성 방법 또는 품질 평가 모델 생성 방법을 설명하기 위한 흐름도이다. 학습 데이터 생성 방법 또는 품질 평가 모델 생성 방법을 설명함에 있어, 앞선 기재와 중복되는 내용에 관하여는 그 기재를 생략하기로 한다.FIG. 2 is a conceptual diagram for explaining a learning data generation method or a quality evaluation model generation method performed by the learning data generating device shown in FIG. 1, and FIG. 3 is a learning performed by the learning data generating device shown in FIG. 1. This is a flowchart to explain how to generate data or create a quality evaluation model. In explaining the learning data generation method or the quality evaluation model generation method, descriptions of content that overlaps with the previous description will be omitted.

도 1 내지 도 3을 참조하면, 학습 데이터 생성 장치(10)에 포함된 데이터 수집부(110)는 기본 데이터를 수집할 수 있다(S110). 기본 데이터는 제1 기본 데이터와 제2 기본 데이터를 포함할 수 있다.Referring to FIGS. 1 to 3 , the data collection unit 110 included in the learning data generating device 10 may collect basic data (S110). Basic data may include first basic data and second basic data.

학습 데이터 생성 장치(10)에 포함된 입력 문장 생성부(120)는 품질 예측 모델의 생성에 이용되는 학습 데이터, 즉 입력 문장쌍을 생성할 수 있다(S120). 입력 문장쌍은 소스 문장과 소스 문장에 대한 기계 번역문을 의미할 수 있다.The input sentence generator 120 included in the learning data generating device 10 may generate learning data, that is, input sentence pairs, used to generate a quality prediction model (S120). The input sentence pair may mean a source sentence and a machine translation of the source sentence.

학습 데이터 생성 장치(10)에 포함된 TER 산출부(130)는 품질 예측 모델의 생성에 이용되는 입력 문장쌍 중 기계 번역문의 TER을 산출할 수 있다(S130). 결국, 학습 데이터는 입력 문장쌍들과 각각이 입력 문장쌍들 각각에 대응하는 복수의 TER들을 의미할 수 있다.The TER calculation unit 130 included in the learning data generating device 10 can calculate the TER of the machine translated text among the input sentence pairs used to generate the quality prediction model (S130). Ultimately, learning data may mean input sentence pairs and a plurality of TERs each corresponding to each of the input sentence pairs.

실시예에 따라, 학습 데이터 생성 장치(10)에 포함된 모델 생성부(140)는 생성된 학습 데이터를 이용하여 기계 번역문의 품질을 예측하는 품질 예측 모델을 생성할 수 있다(S140).Depending on the embodiment, the model generator 140 included in the learning data generating device 10 may generate a quality prediction model that predicts the quality of a machine translated text using the generated learning data (S140).

상술한 과정을 통하여, 레퍼런스 문장 없이 기계 번역문의 품질을 평가(또는 예측)하는 모델을 생성하는 과정에 이용되는 학습 데이터를 생성할 수 있고, 생성된 학습 데이터를 이용하여 품질 평가 모델을 생성할 수 있다.Through the above-described process, learning data used in the process of creating a model that evaluates (or predicts) the quality of machine translation text without reference sentences can be generated, and a quality evaluation model can be created using the generated learning data. there is.

이상에서 설명된 장치는 하드웨어 구성 요소, 소프트웨어 구성 요소, 및/또는 하드웨어 구성 요소 및 소프트웨어 구성 요소의 집합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성 요소는, 예를 들어, 프로세서, 콘트롤러, ALU(Arithmetic Logic Unit), 디지털 신호 프로세서(Digital Signal Processor), 마이크로컴퓨터, FPA(Field Programmable array), PLU(Programmable Logic Unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(Operation System, OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술 분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(Processing Element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(Parallel Processor)와 같은, 다른 처리 구성(Processing Configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a set of hardware components and software components. For example, devices and components described in the embodiments include, for example, a processor, a controller, an Arithmetic Logic Unit (ALU), a Digital Signal Processor, a microcomputer, a Field Programmable Array (FPA), It may be implemented using one or more general-purpose or special-purpose computers, such as a Programmable Logic Unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device may include multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include a plurality of processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are also possible.

소프트웨어는 컴퓨터 프로그램(Computer Program), 코드(Code), 명령(Instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(Collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성 요소(Component), 물리적 장치, 가상 장치(Virtual Equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(Signal Wave)에 영구적으로, 또는 일시적으로 구체화(Embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, and may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM, DVD와 같은 광기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-optical Media), 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes specially configured hardware devices to store and execute program instructions, such as magneto-optical media, ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성 요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성 요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 등록청구범위의 기술적 사상에 의해 정해져야 할 것이다.The present invention has been described with reference to the embodiments shown in the drawings, but these are merely illustrative, and those skilled in the art will understand that various modifications and other equivalent embodiments are possible therefrom. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent. Therefore, the true scope of technical protection of the present invention should be determined by the technical spirit of the attached registration claims.

10 : 학습 데이터 생성 장치
110 : 데이터 수집부
120 : 입력 문장 생성부
130 : TER 산출부
140 : 모델 생성부
150 : 저장부10: Learning data generation device
110: data collection unit
120: Input sentence generation unit
130: TER calculation unit
140: model creation unit
150: storage unit

Claims

a data collection unit that collects basic data including a plurality of sentences using web crawling;
an input sentence generator that is used to learn a quality evaluation model of machine translation based on basic data and generates an input sentence pair including a source sentence and a machine translation of the source sentence;
a TER calculation unit that calculates a TER (Translation Error Rate) of the machine translated text; and
A model generator that generates a quality evaluation model of the machine translation by learning using training data including the input sentence pair and the TER,
The basic data includes sentences in the target language,
The input sentence generator generates the source sentence by backward translating the sentence in the target language, and generates the machine translation text by translating the source sentence,
The TER calculation unit calculates the TER between the sentence in the target language and the machine translation,
Machine translation quality evaluation model generation device.

delete

In the machine translation quality evaluation model generation method performed by a machine translation quality evaluation model generation device including at least a processor,
Collecting basic data including a plurality of sentences using web crawling;
Generating an input sentence pair that is used to learn a quality evaluation model of machine translation based on basic data and includes a source sentence and a machine translation of the source sentence;
Calculating TER (Translation Error Rate) of the machine translated text; and
Generating a quality evaluation model of the machine translation by learning using training data including the input sentence pair and the TER,
The basic data includes sentences in the target language,
In the step of generating the input sentence pair, the source sentence is generated by backward translation of the sentence in the target language, and the source sentence is translated to generate the machine translation text,
The step of calculating the TER is calculating the TER between the sentence in the target language and the machine translated text,
How to create a machine translation quality assessment model.

delete