KR20230034468A

KR20230034468A - Device and method for generating optimal translation subtitle using quality estimation

Info

Publication number: KR20230034468A
Application number: KR1020210117011A
Authority: KR
Inventors: 임희석; 박찬준
Original assignee: 고려대학교 산학협력단
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2023-03-10

Abstract

Disclosed are a device and method for generating translated sentences. The method for generating translated sentences performed by a computing device including at least a processor comprises the steps of: performing speech recognition on speech data in a first language to generate a text sentence in the first language; performing speech recognition on speech data in a first language to generate a text sentence in the first language; performing machine translation on each of the plurality of text sentences to generate a plurality of translated sentences in a second language; and performing machine translation quality prediction for each of the plurality of translated sentences, and selecting one of the plurality of translated sentences as a translated sentence corresponding to the speech data based on a result of the machine translation quality prediction. Accordingly, by using the machine translation quality prediction, optimal translated subtitles can be created.

Description

Apparatus and method for generating optimal translated subtitles using machine translation quality prediction {DEVICE AND METHOD FOR GENERATING OPTIMAL TRANSLATION SUBTITLE USING QUALITY ESTIMATION}

본 발명은 기계 번역 품질 예측을 이용한 최적의 번역 자막 생성 방법에 관한 것으로, 특히 음성 인식 결과에 대한 번역 자막에 대하여 품질 예측을 수행하고 최고의 품질을 보이는 번역 자막을 선택함으로써 최적의 번역 자막 품질을 가지는 자막 파일을 생성하는 장치 및 방법에 관한 것이다.The present invention relates to a method for generating optimal translated subtitles using machine translation quality prediction, and in particular, by performing quality prediction on translated subtitles for speech recognition results and selecting translated subtitles showing the highest quality, the quality of translated subtitles is optimal. It relates to an apparatus and method for generating a subtitle file.

기계번역 품질 예측(Quality Estimation)이란 정답 번역문을 참고하지 않고 기계 번역 문장의 번역 품질을 예측하는 기법을 의미한다. 즉, 소스 문장(Source Sentence)과 타겟 문장(Target Sentence)만으로 기계 번역의 품질을 예측할 수 있다. 자동 음성 인식(이하, 음성 인식)은 컴퓨터를 이용하여 음성을 문자로 변환해주는 기술이다. 음성 인식 기술은 최근들어 딥러닝(Deep Learning)의 등장으로 인해 급격한 인식율 향상을 이뤘다. 음성 인식 기술을 응용하여 동영상을 입력으로 넣었을 시 출력으로 해당 동영상의 자막을 추출해주는 기술이 있다.Machine translation quality estimation refers to a technique for estimating the translation quality of a machine translation sentence without referring to the correct translation. That is, the quality of machine translation can be predicted only with the source sentence and the target sentence. Automatic speech recognition (hereinafter referred to as speech recognition) is a technology that converts speech into text using a computer. Voice recognition technology has recently achieved a rapid improvement in recognition rate due to the emergence of deep learning. There is a technology that extracts subtitles of a corresponding video as an output when a video is input as an input by applying voice recognition technology.

음성 인식에서 가장 어려운 분야 중 하나로 EPD(End Point Detection)가 있다. EPD는 발화의 끝이 어디인지 기계가 자동적으로 판단하는 기술을 의미하며, 종결어미 사전을 추가하거나 기계 학습을 통하여 이 문제를 해결하고 있는 실정이다. 그러나, 아직까지 만족스러운 성능을 보이지는 않는다. EPD의 기술적 한계의 문제로 자동 자막 생성 기술을 이용하여 기계 번역을 진행하는 것은 상당히 난해하다. 보통의 기계 번역에서 입력으로 완벽한 문장이 들어가게 되나 자동 자막 생성 기술을 이용한 기계 번역에서는 완벽한 문장이 들어가는 것이 아닌 EPD 단위의 문장이 들어가기 때문이다. 이는 번역 자막의 품질 하락으로 이어지게 된다.One of the most difficult fields in speech recognition is EPD (End Point Detection). EPD refers to a technology in which a machine automatically determines where the end of an utterance is, and this problem is currently being solved by adding a final ending dictionary or machine learning. However, it does not show satisfactory performance yet. Due to the technical limitations of EPD, it is quite difficult to perform machine translation using automatic subtitle generation technology. This is because perfect sentences are entered as input in normal machine translation, but sentences in EPD units are entered instead of complete sentences in machine translation using automatic subtitle generation technology. This leads to a decrease in the quality of translated subtitles.

본 발명에서는, 기계 번역을 이용하여 추출된 자막으로부터 번역 자막을 생성하는 방법을 제안하고자 한다.In the present invention, a method of generating translated subtitles from subtitles extracted using machine translation is proposed.

대한민국 공개특허 제2017-0053527호 (2017.05.16. 공개)Republic of Korea Patent Publication No. 2017-0053527 (published on May 16, 2017) 대한민국 공개특허 제2015-0029931호 (2015.03.19. 공개)Republic of Korea Patent Publication No. 2015-0029931 (published on March 19, 2015) 대한민국 공개특허 제2021-0070891호 (2021.06.15. 공개)Republic of Korea Patent Publication No. 2021-0070891 (2021.06.15. Publication)

본 발명이 이루고자 하는 기술적인 과제는 음성 인식의 결과물에 대한 기계 번역의 품질 예측을 수행하고, 품질 예측 결과에 기초하여 최적의 번역 자막을 생성하는 장치 및 방법을 제공하는 것이다.A technical problem to be achieved by the present invention is to provide an apparatus and method for predicting the quality of machine translation for a speech recognition result and generating an optimal translation subtitle based on the quality prediction result.

본 발명의 일 실시예에 따른 번역 문장 생성 방법은 적어도 프로세서를 포함하는 컴퓨팅 장치에 의해 수행되는 번역 문장 생성 방법으로서, 제1 언어로 된 음성 데이터에 대한 음성 인식을 수행하여 상기 제1 언어로 된 텍스트 문장을 생성하는 단계, 상기 텍스트 문장으로부터 서로 길이가 상이한 복수의 텍스트 문장들을 생성하는 단계, 상기 복수의 텍스트 문장들 각각에 대한 기계 번역을 수행하여 제2 언어로 된 복수의 번역 문장들을 생성하는 단계, 및 상기 복수의 번역 문장들 각각에 대한 기계 번역 품질 예측을 수행하고, 상기 기계 번역 품질 예측의 결과에 기초하여 상기 복수의 번역 문장들 중 하나를 상기 음성 데이터에 대응하는 번역 문장으로 선택하는 단계를 포함한다.A method for generating a translated sentence according to an embodiment of the present invention is performed by a computing device including at least a processor, and performs voice recognition on voice data in a first language to generate a translated sentence in the first language. Generating a text sentence; Generating a plurality of text sentences having different lengths from the text sentence; Generating a plurality of translated sentences in a second language by performing machine translation on each of the plurality of text sentences. and performing machine translation quality prediction on each of the plurality of translated sentences, and selecting one of the plurality of translated sentences as a translated sentence corresponding to the voice data based on a result of the machine translation quality prediction. Include steps.

본 발명의 일 실시예에 다른 번역 자막 생성 장치는 제1 언어로 된 음성 데이터에 대한 음성 인식을 수행하여 상기 제1 언어로 된 텍스트 문장을 생성하는 음성 인식부, 상기 텍스트 문자으로부터 서로 길이가 상이한 복수의 텍스트 문장들을 생성하고, 상기 복수의 텍스트 문장들 각각에 대한 기계 번역을 수행하여 제2 언어로 된 복수의 번역 문장들을 생성하는 기계 번역부, 및 상기 복수의 번역 문장들 각각에 대한 기계 번역 품질 예측을 수행하고, 상기 기계 번역 품질 예측의 결과에 기초하여 상기 복수의 번역 문장들 중 하나를 상기 음성 데이터에 대응하는 번역 문장으로 선택하는 번역 자막 생성부를 포함한다.According to another embodiment of the present invention, an apparatus for generating translated captions includes a voice recognition unit configured to perform voice recognition on voice data in a first language to generate text sentences in the first language, and a text character having different lengths from each other. A machine translation unit generating a plurality of text sentences and performing machine translation on each of the plurality of text sentences to generate a plurality of translated sentences in a second language, and machine translation for each of the plurality of translated sentences. and a translated caption generating unit that performs quality prediction and selects one of the plurality of translated sentences as a translated sentence corresponding to the voice data based on a result of the machine translation quality prediction.

본 발명의 실시예에 따른, 기계 번역 품질 예측을 이용한 최적의 번역 자막 생성 장치 및 방법에 의할 경우, 음성 인식의 결과물에 대한 EPD(End Point Detection)을 수행하지 않고 기계 번역 품질 예측을 이용함으로써 최적의 번역 자막을 생성할 수 있는 효과가 있다.According to the apparatus and method for generating optimal translated subtitles using machine translation quality prediction according to an embodiment of the present invention, by using machine translation quality prediction without performing EPD (End Point Detection) on the result of speech recognition There is an effect of generating optimal translation subtitles.

본 발명의 상세한 설명에서 인용되는 도면을 보다 충분히 이해하기 위하여 각 도면의 상세한 설명이 제공된다.
도 1은 본 발명의 일 실시예에 따른 번역 자막 생성 장치의 기능 블럭도이다.
도 2는 도 1에 도시된 번역 자막 생성 장치에 의해 수행되는 번역 자막 생성 방법을 설명하기 위한 흐름도이다.A detailed description of each drawing is provided in order to more fully understand the drawings cited in the detailed description of the present invention.
1 is a functional block diagram of an apparatus for generating translated captions according to an embodiment of the present invention.
FIG. 2 is a flowchart for explaining a translated caption generating method performed by the translated caption generating device shown in FIG. 1 .

본 명세서에 개시되어 있는 본 발명의 개념에 따른 실시예들에 대해서 특정한 구조적 또는 기능적 설명들은 단지 본 발명의 개념에 따른 실시예들을 설명하기 위한 목적으로 예시된 것으로서, 본 발명의 개념에 따른 실시예들은 다양한 형태들로 실시될 수 있으며 본 명세서에 설명된 실시예들에 한정되지 않는다.Specific structural or functional descriptions of the embodiments according to the concept of the present invention disclosed in this specification are only illustrated for the purpose of explaining the embodiments according to the concept of the present invention, and the embodiments according to the concept of the present invention may be embodied in many forms and are not limited to the embodiments described herein.

본 발명의 개념에 따른 실시예들은 다양한 변경들을 가할 수 있고 여러 가지 형태들을 가질 수 있으므로 실시예들을 도면에 예시하고 본 명세서에서 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시예들을 특정한 개시 형태들에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물, 또는 대체물을 포함한다.Embodiments according to the concept of the present invention can apply various changes and can have various forms, so the embodiments are illustrated in the drawings and described in detail in this specification. However, this is not intended to limit the embodiments according to the concept of the present invention to specific disclosed forms, and includes all modifications, equivalents, or substitutes included in the spirit and scope of the present invention.

제1 또는 제2 등의 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만, 예컨대 본 발명의 개념에 따른 권리 범위로부터 벗어나지 않은 채, 제1 구성 요소는 제2 구성 요소로 명명될 수 있고 유사하게 제2 구성 요소는 제1 구성 요소로도 명명될 수 있다.Terms such as first or second may be used to describe various components, but the components should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another, e.g. without departing from the scope of rights according to the concept of the present invention, a first component may be termed a second component and similarly a second component may be termed a second component. A component may also be referred to as a first component.

어떤 구성 요소가 다른 구성 요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성 요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성 요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성 요소가 다른 구성 요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는 중간에 다른 구성 요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성 요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.It is understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, but other elements may exist in the middle. It should be. On the other hand, when a component is referred to as “directly connected” or “directly connected” to another component, it should be understood that no other component exists in the middle. Other expressions describing the relationship between components, such as "between" and "directly between" or "adjacent to" and "directly adjacent to", etc., should be interpreted similarly.

본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로서, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 본 명세서에 기재된 특징, 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this specification are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "having" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in this specification, but one or more other features It should be understood that it does not preclude the possibility of the presence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in this specification, it should not be interpreted in an ideal or excessively formal meaning. don't

이하, 본 명세서에 첨부된 도면들을 참조하여 본 발명의 실시예들을 상세히 설명한다. 그러나, 특허출원의 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the scope of the patent application is not limited or limited by these examples. Like reference numerals in each figure indicate like elements.

도 1은 본 발명의 일 실시예에 따른 번역 자막 생성 장치의 기능 블럭도이고, 도 2는 도 1에 도시된 번역 자막 생성 장치에 의해 수행되는 번역 자막 생성 방법을 설명하기 위한 흐름도이다.1 is a functional block diagram of a translated caption generating device according to an embodiment of the present invention, and FIG. 2 is a flowchart illustrating a translated caption generating method performed by the translated caption generating device shown in FIG. 1 .

도 1과 도 2를 참조하면, 적어도 프로세서(processor) 및/또는 메모리(memory)를 포함하는 컴퓨팅 장치로 구현될 수 있는 번역 자막 생성 장치(10, 번역 문장 생성 장치로 명명될 수도 있음)는 데이터 수신부(110), 음성 인식부(120), 기계 번역부(130), 번역 자막 생성부(140), 및 저장부(150) 중 적어도 하나를 포함할 수 있다. 번역 자막 생성 장치(10)는 소정의 언어(제1 언어)로 된 음성 데이터에 대하여 상기 소정의 언어와는 다른 언어(제2 언어)로 된 번역 자막(번역 문장)을 생성할 수 있다. 또한, 번역 자막 생성 방법을 구성하는 각 단계들 중 적어도 일부는 번역 자막 생성 장치(10)를 구성하는 컴퓨팅 장치 또는 프로세서에 의한 동작으로 이해될 수도 있다.Referring to FIGS. 1 and 2 , a translated subtitle generating device 10 (which may also be referred to as a translated sentence generating device), which may be implemented as a computing device including at least a processor and/or a memory, may include data It may include at least one of a receiving unit 110, a voice recognition unit 120, a machine translation unit 130, a translated caption generator 140, and a storage unit 150. The translated subtitle generating device 10 may generate translated subtitles (translated sentences) in a language (second language) different from the predetermined language with respect to audio data in a predetermined language (first language). Also, at least some of the steps constituting the translated caption generating method may be understood as operations performed by a computing device or processor constituting the translated caption generating device 10 .

데이터 수신부(110)는 소정의 인터페이스 또는 유무선 통신망을 통하여 음성 인식의 대상 또는 번역의 대상이 되는 데이터를 수신할 수 있다. 데이터는 음성 데이터(또는 음성 파일)를 의미하거나 음성 데이터를 포함하는 동영상 데이터(동영상 파일)을 의미할 수 있다. 데이터 수신부(110)에 의해 수신된 데이터는 저장부(150)에 저장될 수 있다.The data receiving unit 110 may receive data that is an object of speech recognition or translation through a predetermined interface or wired/wireless communication network. Data may refer to audio data (or audio files) or video data (video files) including audio data. Data received by the data receiving unit 110 may be stored in the storage unit 150 .

실시예에 따라, 데이터는 저장부(150)에 미리 저장되어 있을 수도 있다. 이 경우, 번역 자막 생성 장치(10)는 데이터 수신부(110)를 포함하지 않을 수도 있다.Depending on the embodiment, data may be previously stored in the storage unit 150 . In this case, the translated caption generating device 10 may not include the data receiver 110.

음성 인식부(120)는 데이터 수신부(110)에 의해 수신된 데이터 또는 저장부(110)에 저장되어 있는 데이터에 대한 음성 인식을 수행한다. 즉, 음성 인식부(120)는 제1 언어로 된 음성 데이터에 대한 음성 인식을 수행하여 제1 언어로 된 문자 데이터로 변환할 수 있다. 음성 인식부(120)는 기 공지된 소정의 음성 인식기(음성 인식 모델)을 이용함으로써 음성 인식 동작을 수행할 수 있다.The voice recognition unit 120 performs voice recognition on data received by the data receiving unit 110 or data stored in the storage unit 110 . That is, the voice recognition unit 120 may perform voice recognition on voice data in the first language and convert the voice data into text data in the first language. The voice recognition unit 120 may perform a voice recognition operation by using a previously known predetermined voice recognizer (voice recognition model).

음성 인식의 결과물, 즉 제1 언어로 된 문자 데이터(텍스트 문장, 자막 파일 등을 의미할 수 있음)의 EPD나 WER(Word Error Rate)은 음성 인식기(음성 인식 모델)의 성능에 따라 천차만별이다. 이렇게 생성된 자막 파일을 입력으로 기계 번역을 수행하게 되면, 출력으로 자동 자막 생성 기술을 이용한 번역 자막이 생성되게 된다. 그러나, 이러한 번역 자막은 품질이 저하될 확률이 높다. 기계 번역의 입력으로 완전한 문장이 입력으로 들어가는 것이 아니기 때문이다. 따라서, 번역 자막(번역 문장)을 생성하기 전에 각기 다른 문장 길이 값(또는 얼라인먼트 값)에 따라서 다양한 후보군의 번역 자막(번역 문장)을 생성하여야 한다. The EPD or Word Error Rate (WER) of the speech recognition result, that is, text data (which may mean text sentences, subtitle files, etc.) in the first language varies greatly depending on the performance of the speech recognizer (speech recognition model). When machine translation is performed using the generated subtitle file as an input, a translated subtitle using an automatic subtitle generation technology is generated as an output. However, the quality of such translated subtitles is highly likely to be deteriorated. This is because machine translation input does not include complete sentences. Therefore, before generating translated subtitles (translated sentences), translation subtitles (translated sentences) of various candidate groups must be generated according to different sentence length values (or alignment values).

기계 번역부(130)는 음성 인식부(120)에 의한 음성 인식의 결과물을 대상으로 기계 번역을 수행할 수 있다. 기계 번역부(130)는 소정의 기계 번역기(기계 번역 모델)을 이용하여 기계 번역 동작을 수행할 수 있다.The machine translation unit 130 may perform machine translation on a result of voice recognition by the voice recognition unit 120 . The machine translation unit 130 may perform a machine translation operation using a predetermined machine translator (machine translation model).

구체적으로, 기계 번역부(130)는 음성 인식의 결과물(문자 데이터)로부터 서로 길이가 다른 복수의 문장을 생성할 수 있다. 예컨대, 음성 인식의 결과로 생성된 문자 데이터가 'I am a boy'일 때, 기계 번역부(130)는 'I', 'I am', 'I am a', 및 'I am a boy' 중 적어도 하나를 생성할 수 있다. 기계 번역부(130)에 의해 생성된 서로 길이가 다른 복수의 문장들은 기계 번역의 소스 문장이 될 수 있다. 즉, 기계 번역부(130)는 서로 길이가 다른 복수의 문장들 각각에 대한 기계 번역을 수행함으로써 각각이 서로 길이가 다른 복수의 문장들 각각에 대응하는 복수의 번역 문장 후보들을 생성할 수 있다. 여기서, 문장의 길이가 다르다 함은, 문장에 포함되는 단어의 개수가 상이함을 의미할 수 있다.Specifically, the machine translation unit 130 may generate a plurality of sentences having different lengths from a result of voice recognition (text data). For example, when text data generated as a result of voice recognition is 'I am a boy', the machine translation unit 130 converts 'I', 'I am', 'I am a', and 'I am a boy'. At least one of them can be created. A plurality of sentences having different lengths generated by the machine translation unit 130 may be source sentences for machine translation. That is, the machine translation unit 130 may generate a plurality of translated sentence candidates corresponding to each of a plurality of sentences having different lengths by performing machine translation on each of the plurality of sentences having different lengths. Here, the different lengths of the sentences may mean that the number of words included in the sentences is different.

이를 위해, 기계 번역부(130)는 음성 인식의 결과물(문자 데이터)인 문장의 얼라인먼트(Alignment)를 임의의 단위(예컨대, 100 내지 1000)으로 분할함으로써 복수의 문장들을 생성하고, 복수의 문장들 각각에 대한 기계 번역을 수행함으로써 복수의 번역 문장 후보들을 생성할 수 있다. 기계 번역부(130)에 의해 생성된 복수의 번역 문장 후보들은 저장부(150)에 저장될 수 있다.To this end, the machine translation unit 130 generates a plurality of sentences by dividing the alignment of a sentence, which is a result of speech recognition (text data), into arbitrary units (eg, 100 to 1000), and generates a plurality of sentences. A plurality of translated sentence candidates may be generated by performing machine translation for each sentence. A plurality of translated sentence candidates generated by the machine translation unit 130 may be stored in the storage unit 150 .

번역 자막 생성부(140)는 기계 번역부(130)에 의해 생성된 복수의 번역 문장 후보들 각각에 대한 기계 번역 품질 예측을 수행한 후 복수의 번역 문장 후보들 중 가장 높은 품질을 보이는 번역 문장을 선택함으로써 번역 문장을 생성할 수 있다. 선택된 번역 문장이 음성 데이터 또는 영상 데이터에 대응하는 번역 자막으로 활용될 수 있다. 이를 위해, 번역 자막 생성부(140)는 소정의 기계 번역 품질 예측기(기계 번역 품질 예측 모델)을 이용하여 기계 번역 품질 예측 동작을 수행할 수 있다. 기계 번역 품질 예측을 수행하기 위해서, 정답 번역문은 고려 대상이 아니고 소스 문장과 타겟 문장만이 이용될 수 있다.The translated subtitle generator 140 performs machine translation quality prediction on each of the plurality of translated sentence candidates generated by the machine translation unit 130, and then selects the translated sentence having the highest quality among the plurality of translated sentence candidates. You can create translation sentences. The selected translated sentence may be used as a translated subtitle corresponding to audio data or video data. To this end, the translated caption generator 140 may perform a machine translation quality prediction operation using a predetermined machine translation quality predictor (machine translation quality prediction model). To perform machine translation quality prediction, the correct translation is not considered and only the source sentence and the target sentence can be used.

구체적으로, 번역 자막 생성부(140)는 피어슨 상관관계(Pearson's Correlation), MAE(Mean Average Error), 및 RMSE(Root Mean Squared Error) 중 적어도 하나에 기초하여 복수의 번역 문장 후보들에 대한 품질 평가를 수행할 수 있다. 즉, 번역 자막 생성부(140)는 가장 좋은 성능(품질)을 보이는 번역 문장 후보를 선택함으로써 번역 문장을 생성할 수 있다. 기계 번역의 품질 예측은 대게 문서, 문장, 어절, 단어 혹은 형태소 수준에서 이루어지는데, 본 발명에서는 문서, 문장, 어절, 단어, 및 형태소 수준 중 적어도 하나의 수준에서 기계 번역 품질 예측을 수행할 수 있다.Specifically, the translated caption generation unit 140 evaluates the quality of a plurality of translated sentence candidates based on at least one of Pearson's Correlation, Mean Average Error (MAE), and Root Mean Squared Error (RMSE). can be done That is, the translated caption generator 140 may generate translated sentences by selecting a translated sentence candidate exhibiting the best performance (quality). Machine translation quality prediction is usually performed at the document, sentence, word, word, or morpheme level. In the present invention, machine translation quality prediction can be performed at at least one of the document, sentence, word, word, and morpheme levels. .

상술한 성능 척도 중 피어슨 상관관계는 점수가 높을수록, MAE와 RMSE는 점수가 낮을수록 번역문은 더 좋은 성능을 보이게 된다. 성능 척도의 점수에 따라서, 각 번역 자막들(번역 문장들) 중 최적의 기계 번역 품질 예측값을 가지는 얼라인먼트를 산정하고, 산정된 얼라인먼트 값을 이용한 최적의 번역 자막을 생성할 수 있다.Among the performance measures described above, the higher the Pearson's correlation score and the lower the MAE and RMSE scores, the better the performance of the translation. According to the score of the performance scale, an alignment having an optimal machine translation quality prediction value among each translated subtitle (translated sentences) may be calculated, and an optimal translated subtitle may be generated using the calculated alignment value.

실시예에 따라서, 복수의 성능 척도 값에 기초하여 번역 문장에 대한 품질 평가(품질 예측)가 수행될 수 있다. 이 경우, 복수의 척도들 각각에 미리 정해진 가중치를 반영한 후 합산(또는 평균)함으로써 품질 평가를 진행할 수 있다. 이때, 피어슨 상관관계와 MAE(RMSE 포함) 중 어느 하나는 역수 값을 이용할 수 있다. 피어슨 상관관계의 역수값을 이용하는 경우 총합(또는 평균)이 낮을수록 품질이 좋다고 볼 수 있고, MAE의 역수값을 이용하는 경우 총합(또는 평균)이 높을수록 품질이 좋다고 볼 수 있다.Depending on the embodiment, quality evaluation (quality prediction) may be performed on the translated sentence based on a plurality of performance measure values. In this case, the quality evaluation may be performed by adding (or averaging) a predetermined weight to each of the plurality of scales. At this time, one of Pearson's correlation and MAE (including RMSE) may use a reciprocal value. When the reciprocal value of the Pearson's correlation is used, the lower the sum (or average) is, the better the quality is. When the reciprocal value of the MAE is used, the higher the total (or average) is, the better the quality is.

본 발명의 실시예에 따르면, 먼저 가장 최고의 품질을 갖는 기계 번역 자막(기계 번역 문장)을 도출할 수 있다. 또한, 품질 예측에서 정답 번역문이 필요하지 않기 때문에, 음성 인식을 이용한 자동 자막 생성 기술을 이용한 번역 자막 생성 기술에 가장 최적의 품질 예측 방법이라고 볼 수 있다. 일반적으로, 기계 번역 품질 및 성능을 측정하는 단위로 BLUE 스코어(Bilingual Evaluation Understudy Score)가 활용되는데, 이는 정답 번역문이 필수적이므로 자동 자막 생성 기술을 이용하여 번역 자막을 생성하는 기술에서는 시간과 비용이 많이 소요되는 평가 척도이다. 왜냐하면, 번역 정답문을 사람이 일일이 생성해야 하기 때문이다. 따라서, 본 발명은 시간과 비용 측면에서도 많은 장점을 보인다고 할 수 있다.According to an embodiment of the present invention, machine translation subtitles (machine translation sentences) having the highest quality may be derived first. In addition, since the correct translation is not required for quality prediction, it can be regarded as the most optimal quality prediction method for the translation caption generation technology using the automatic caption generation technology using voice recognition. In general, the BLUE score (Bilingual Evaluation Understudy Score) is used as a unit for measuring machine translation quality and performance. Since a correct translation is essential, the technology for generating translated subtitles using automatic subtitle generation technology requires a lot of time and cost. It is an evaluation criterion that is required. This is because a person has to manually generate the correct answers for translation. Therefore, the present invention can be said to show many advantages in terms of time and cost.

상술한 바와 같이, 본 발명에서는 자동 자막 생성 기술을 이용한 번역 자막 생성의 기술적 한계로 발생하는 기계 번역 품질 하락의 문제를 기계 번역 품질 예측 기술을 이용하여 해결하고자 한다.As described above, the present invention intends to solve the problem of machine translation quality degradation caused by technical limitations of generating translated subtitles using automatic subtitle generation technology by using machine translation quality prediction technology.

저장부(150)에는 데이터 수신부(110)에 의해 수신된 데이터, 음성 인식부(120)에 의한 음성 인식의 결과, 기계 번역부(130)에 의한 기계 번역의 결과, 번역 자막 생성부(140)에 의한 기계 번역 품질 예측의 결과, 선택된 번역 문장 등이 저장될 수 있다. 또한, 저장부(150)에는 음성 인식을 위한 음성 인식 모델, 기계 번역을 위한 기계 번역 모델, 기계 번역의 품질 예측을 위한 기계 번역 품질 예측 모델 등이 저장될 수 있다.The storage unit 150 includes the data received by the data receiver 110, the result of voice recognition by the voice recognition unit 120, the result of machine translation by the machine translation unit 130, and the translation subtitle generator 140. Results of machine translation quality prediction by , selected translation sentences, and the like may be stored. Also, a speech recognition model for speech recognition, a machine translation model for machine translation, a machine translation quality prediction model for machine translation quality prediction, and the like may be stored in the storage 150 .

이상에서 설명된 장치는 하드웨어 구성 요소, 소프트웨어 구성 요소, 및/또는 하드웨어 구성 요소 및 소프트웨어 구성 요소의 집합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성 요소는, 예를 들어, 프로세서, 콘트롤러, ALU(Arithmetic Logic Unit), 디지털 신호 프로세서(Digital Signal Processor), 마이크로컴퓨터, FPA(Field Programmable array), PLU(Programmable Logic Unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(Operation System, OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술 분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(Processing Element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(Parallel Processor)와 같은, 다른 처리 구성(Processing Configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a set of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a Programmable Logic Unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Also, other processing configurations are possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(Computer Program), 코드(Code), 명령(Instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(Collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성 요소(Component), 물리적 장치, 가상 장치(Virtual Equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(Signal Wave)에 영구적으로, 또는 일시적으로 구체화(Embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, and may configure a processing device to operate as desired or process independently or collectively. The device can be commanded. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device intended to be interpreted by or provide instructions or data to a processing device. , or may be permanently or temporarily embodied in the transmitted signal wave. Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM, DVD와 같은 광기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-optical Media), 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - Includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, ROM, RAM, flash memory, etc. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성 요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성 요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 등록청구범위의 기술적 사상에 의해 정해져야 할 것이다.Although the present invention has been described with reference to the embodiments shown in the drawings, this is only exemplary, and those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved. Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the attached claims.

10 : 번역 자막 생성 장치
110 : 데이터 수신부
120 : 음성 인식부
130 : 기계 번역부
140 : 번역 자막 생성부
150 : 저장부10: translation subtitle generating device
110: data receiving unit
120: voice recognition unit
130: machine translation department
140: translation subtitle generation unit
150: storage unit

Claims

A method for generating translated sentences performed by a computing device including at least a processor,
generating a text sentence in the first language by performing voice recognition on voice data in the first language;
generating a plurality of text sentences having different lengths from the text sentence;
generating a plurality of translated sentences in a second language by performing machine translation on each of the plurality of text sentences; and
Performing machine translation quality prediction on each of the plurality of translated sentences, and selecting one of the plurality of translated sentences as a translated sentence corresponding to the voice data based on a result of the machine translation quality prediction. How to generate a translation sentence that does.

According to claim 1,
The machine translation quality prediction uses at least one of Pearson's Correlation, Mean Average Error (MAE), and Root Mean Squared Error (RMSE) as an evaluation criterion.
How to create translation sentences

According to claim 1,
The step of selecting as the translation sentence,
Selecting a translated sentence showing the highest machine translation quality prediction value among the plurality of translated sentences,
How to create translation sentences.

a voice recognition unit configured to generate text sentences in the first language by performing voice recognition on voice data in the first language;
a machine translation unit generating a plurality of text sentences having different lengths from the text characters and performing machine translation on each of the plurality of text sentences to generate a plurality of translated sentences in a second language; and
Generating translated subtitles that perform machine translation quality prediction for each of the plurality of translated sentences and select one of the plurality of translated sentences as a translated sentence corresponding to the audio data based on the result of the machine translation quality prediction. An apparatus for generating translated subtitles including subtitles.

According to claim 4,
The machine translation quality prediction uses at least one of Pearson's Correlation, Mean Average Error (MAE), and Root Mean Squared Error (RMSE) as an evaluation criterion.
Translation subtitle generator.

According to claim 4,
The translated subtitle generation unit selects a translated sentence showing the highest machine translation quality prediction value among the plurality of translated sentences,
Translation subtitle generator.