KR102457822B1

KR102457822B1 - apparatus and method for automatic speech interpretation

Info

Publication number: KR102457822B1
Application number: KR1020200091907A
Authority: KR
Inventors: 윤승; 김상훈; 이민규
Original assignee: 한국전자통신연구원
Priority date: 2019-08-14
Filing date: 2020-07-23
Publication date: 2022-10-24
Also published as: KR20210020774A

Abstract

본 발명의 자동 통역 방법은, 발화자 단말과 통신하는 상대 단말에서 수행하는 자동 통역 방법으로서, 통신기가, 상기 발화자 단말로부터, 발화자가 원시 언어로 발화한 음성을 목적 언어로 자동 번역하여 획득한 자동 번역 결과와 상기 발화자의 음성 특징 정보를 수신하는 단계; 및 음성 합성기가, 상기 자동 번역 결과와 상기 음성 특징 정보를 기반으로 음성 합성을 수행하여, 개인화된 합성음을 자동 통역 결과로서 출력하는 단계;를 포함한다. 여기서, 상기 발화자의 음성 특징 정보는, 상기 발화자의 음성으로부터 추출된 제1 부가 음성 자질과 음성 특징 파라메터를 포함하는 은닉 변수 및 제2 부가 음성 자질을 포함한다.The automatic interpretation method of the present invention is an automatic interpretation method performed by a counterpart terminal communicating with a speaker terminal, and automatic translation obtained by a communicator automatically translating a voice uttered by a speaker in a source language from the speaker terminal into a target language receiving a result and voice characteristic information of the speaker; and performing, by a voice synthesizer, voice synthesis based on the automatic translation result and the voice characteristic information, and outputting the personalized synthesized sound as an automatic interpretation result. Here, the voice feature information of the talker includes a hidden variable including a first additional voice feature extracted from the speaker's voice, a voice feature parameter, and a second additional voice feature.

Description

Automatic interpretation device and method {apparatus and method for automatic speech interpretation}

본 발명은 자동 통역 기술에 관한 것이다.The present invention relates to automatic interpretation technology.

자동 통역 기술(automatic interpretation technology)은 발화자에 의해 어떤 특정 언어로 발화된 음성을 음성 인식(voice recognition), 자동 번역(automatic translation) 등의 과정을 거쳐서 다른 특정 언어로 변환하고, 이를 텍스트 형태의 자막으로 출력하거나 혹은 음성과 합성된 합성음으로 출력하는 기술을 의미한다.Automatic interpretation technology converts the voice spoken by the speaker in a specific language into another specific language through processes such as voice recognition and automatic translation, and converts it to a text-type subtitle It refers to the technology of outputting a sound or outputting it as a synthesized sound synthesized with voice.

최근 자동 통역의 요소 기술들 중의 하나인 음성 합성에 대한 관심이 높아지면서, 단순한 의사 전달 수준을 넘어서 '개인화된 음성 합성(Personalized voice synthesis)'에 대한 연구가 진행되고 있다.Recently, as interest in speech synthesis, which is one of the component technologies of automatic interpretation, has increased, research on 'personalized voice synthesis' beyond simple communication is being conducted.

개인화된 음성 합성은 음성 인식 및 자동 번역 등을 통해 원시 언어로부터 변환된(또는 번역된) 목적 언어를 발화자의 발화 음색(또는 발화 스타일)에 가까운 합성음으로 출력하는 기술을 의미한다.Personalized speech synthesis refers to a technology of outputting a target language converted (or translated) from a source language through speech recognition and automatic translation as a synthesized sound close to the tone (or speech style) of a speaker.

한편, 최근 대다수의 사용자가 스마트 폰과 같은 개인용 단말을 보유하고 있고, 해외 여행이 보편화되면서, 자동 통역 기능이 기본적으로 탑재된 개인용 단말 및 자동 통역과 관련된 다양한 앱이 출시되고 있다.Meanwhile, as the majority of users have personal terminals such as smart phones in recent years, and overseas travel has become common, personal terminals equipped with automatic interpretation functions and various applications related to automatic interpretation have been released.

이처럼 발화자의 개인용 단말(이하, '발화자 단말'이라 함)은 발화자가 원시 언어로 발화한 음성을 목적 언어로 자동 번역(automatic translation)한 후, 그 목적 언어를 개인화된 음성 합성 과정을 거쳐 자신(발화자)의 발화 음색에 가까운 개인화된 합성음으로 재생할 수 있다.As such, the speaker's personal terminal (hereinafter referred to as 'speaker terminal') automatically translates the speaker's speech in the source language into the target language, and then converts the target language into a personalized speech synthesis process. It can be reproduced as a personalized synthesized sound close to the utterance tone of the speaker).

그런데, 다른 사용자의 단말(이하, '상대 단말'이라 함)이 발화자 음성에 가까운 개인화된 합성음을 자동 통역 결과로서 재생하고자 하는 경우, 발화자 단말은 상대 단말에게 발화자의 원시 음성 파일을 제공하고, 상대 단말은 발화자 단말로부터 제공된 음성 파일을 분석하여 발화자의 음성 특징과 관련된 정보들을 추출해야 한다.However, when another user's terminal (hereinafter referred to as 'opposite terminal') wants to reproduce a personalized synthesized sound close to the speaker's voice as an automatic interpretation result, the speaker terminal provides the speaker's original voice file to the other terminal, The terminal should analyze the voice file provided from the speaker terminal to extract information related to the speaker's voice characteristics.

이후, 상대 단말은 그 추출된 발화자의 음성 특징과 관련된 정보와 상대 단말에서 번역한 번역문을 합성하여 음성 합성을 수행함으로써, 발화자의 음성과 유사한 개인화된 합성음을 자동 통역 결과로서 재생한다. Thereafter, the opposite terminal performs speech synthesis by synthesizing the extracted information related to the speaker's voice characteristics and the translation translated by the other terminal, thereby reproducing a personalized synthesized sound similar to the speaker's voice as an automatic interpretation result.

이처럼 상대 단말은, 발화자의 음성과 유사한 음성으로 개인화된 합성음을 자동 통역 결과로 재생하기 위해, 발화자 단말로부터 제공된 원시 음성 파일로부터 음성 특징을 추출해야 하기 때문에, 그 음성 특징 추출에 필요한 처리 시간은 자동 통역의 실시간 처리 성능을 저하시키는 요소이다.In this way, since the other terminal needs to extract voice features from the original voice file provided from the talker terminal in order to reproduce the personalized synthesized sound as a result of automatic interpretation with a voice similar to the speaker's voice, the processing time required for extracting the voice features is automatic. It is a factor that reduces the real-time processing performance of interpretation.

또한, 발화자 단말이 상대 단말로 원시 음성 파일을 전송하는 과정에서 통신 환경에 기인한 전송 지연이 발생할 수 있고, 이러한 전송 지연 또한 자동 통역의 실시간 처리 성능을 저하시키는 요소이다.In addition, a transmission delay due to a communication environment may occur while the talker terminal transmits the original voice file to the other terminal, and this transmission delay is also a factor degrading the real-time processing performance of automatic interpretation.

또한, 종래의 자동 통역 기술에 따른 음성 합성 과정은 개인화된 합성음을 사용자가 원하는 음색으로 변환하는 것이 불가능하다.In addition, in the voice synthesis process according to the conventional automatic interpretation technology, it is impossible to convert the personalized synthesized sound into a tone desired by the user.

상술한 문제점을 해결하기 위한 본 발명은 발화자 단말에서 발화자의 원시 음성 파일 자체를 상대 단말에게 전송하지 않고서도, 상대 단말이 발화자의 음성(또는 음색)에 가까운 개인화된 합성음을 자동 통역 결과로서 출력할 수 있는 자동 통역 장치 및 그 방법을 제공하는 데 그 목적이 있다.The present invention for solving the above-mentioned problems provides that the counterpart terminal outputs a personalized synthesized sound close to the speaker's voice (or tone) as an automatic interpretation result without transmitting the speaker's original voice file itself to the other terminal. It is an object of the present invention to provide an automatic interpretation device and a method thereof.

또한, 본 발명은 개인화된 합성음의 음색을 자유롭게 조정 및 변환할 수 있는 자동 통역 장치 및 그 방법을 제공하는 데 다른 목적이 있다.Another object of the present invention is to provide an automatic interpretation apparatus and method capable of freely adjusting and converting the tone of a personalized synthesized sound.

본 발명의 전술한 목적 및 그 이외의 목적과 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부된 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. The above and other objects, advantages and features of the present invention, and a method for achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings.

상술한 목적을 달성하기 위한 본 발명의 일면에 따른 자동 통역 방법은, 발화자 단말과 통신하는 상대 단말로서, 상기 상대 단말에서 수행하는 자동 통역 방법으로서, 통신기가, 상기 발화자 단말로부터, 발화자가 원시 언어로 발화한 음성을 목적 언어로 자동 번역하여 획득한 자동 번역 결과와 상기 발화자의 음성 특징 정보를 수신하는 단계; 및 음성 합성기가, 상기 자동 번역 결과와 상기 음성 특징 정보를 기반으로 음성 합성을 수행하여, 개인화된 합성음을 자동 통역 결과로서 출력하는 단계;를 포함한다. 여기서, 상기 발화자의 음성 특징 정보는, 상기 발화자의 음성으로부터 추출된 제1 부가 음성 자질과 음성 특징 파라메터를 포함하는 은닉 변수 및 제2 부가 음성 자질을 포함한다.An automatic interpretation method according to an aspect of the present invention for achieving the above object is an automatic interpretation method performed by a counterpart terminal communicating with a talker terminal, wherein a communicator includes a communicator, from the talker terminal, the speaker is a source language Receiving an automatic translation result obtained by automatically translating the voice uttered by the user into a target language and voice characteristic information of the speaker; and performing, by a voice synthesizer, voice synthesis based on the automatic translation result and the voice characteristic information, and outputting the personalized synthesized sound as an automatic interpretation result. Here, the voice feature information of the talker includes a hidden variable including a first additional voice feature extracted from the speaker's voice, a voice feature parameter, and a second additional voice feature.

본 발명의 다른 일면에 따른 자동 통역방법은, 상대 단말과 통신하는 발화자 단말로서, 상기 발화자 단말에서 수행하는 자동 통역 방법으로서, 제1 음성특징 추출기가, 발화자가 발화한 음성으로부터 제1 부가 음성 자질과 음성 특징 파라메터를 포함하는 은닉 변수를 추출하는 단계; 제2 음성특징 추출기가, 상기 음성으로부터 제2 부가 음성 자질을 추출하는 단계; 음성 인식기가, 상기 음성에 대해 음성 인식을 수행하여 음성 인식 결과를 획득하는 단계; 자동 번역기가, 상기 음성 인식 결과에 대해 자동 번역을 수행하여 자동 번역 결과를 획득하는 단계; 및 통신기가, 상기 자동 번역 결과, 상기 은닉 변수 및 상기 제2 부가 음성 자질을 상기 상대 단말로 송신하는 단계를 포함한다.An automatic interpretation method according to another aspect of the present invention is an automatic interpretation method performed by a speaker terminal communicating with a counterpart terminal, wherein a first voice feature extractor includes a first additional voice quality from the voice uttered by the speaker. and extracting a hidden variable including a voice feature parameter; extracting, by a second voice feature extractor, a second additional voice feature from the voice; performing, by a voice recognizer, voice recognition on the voice to obtain a voice recognition result; obtaining, by an automatic translator, an automatic translation result by performing automatic translation on the speech recognition result; and transmitting, by the communicator, the automatic translation result, the hidden variable, and the second additional voice quality to the counterpart terminal.

본 발명의 또 다른 일면에 따른 자동 통역 장치는, 발화자 단말과 통신하는 상대 단말에 포함된 장치로서, 상기 발화자 단말로부터, 발화자가 원시 언어로 발화한 음성을 목적 언어로 자동 번역하여 획득한 자동 번역 결과와 상기 발화자의 음성 특징 정보를 수신하는 통신기; 및 상기 자동 번역 결과와 상기 음성 특징 정보를 기반으로 음성 합성을 수행하여, 개인화된 합성음을 자동 통역 결과로서 출력하는 음성 합성기를 포함한다. 여기서, 상기 발화자의 음성 특징 정보는, 상기 발화자의 음성으로부터 추출된 제1 부가 음성 자질과 음성 특징 파라메터를 포함하는 은닉 변수 및 제2 부가 음성 자질을 포함한다.An automatic interpretation apparatus according to another aspect of the present invention is an apparatus included in a counterpart terminal communicating with a speaker terminal, and automatic translation obtained by automatically translating a speaker's speech in a source language into a target language from the speaker terminal a communicator for receiving a result and voice characteristic information of the talker; and a speech synthesizer for performing speech synthesis based on the automatic translation result and the speech characteristic information, and outputting a personalized synthesized sound as an automatic interpretation result. Here, the voice feature information of the talker includes a hidden variable including a first additional voice feature extracted from the speaker's voice, a voice feature parameter, and a second additional voice feature.

본 발명에 의하면, 발화자 단말에서 자동 통역 결과인 개인화된 합성음의 음성 파일 자체를 상대 단말에게 전송하는 것이 아니라 발화자 단말의 사용자가 발화한 음성으로부터 추출된 규격화된 음성 특징 정보, 예를 들면, 부가 음성 자질 및 이를 포함하는 은닉 변수를 전송함으로써, 상대 단말이 발화자 단말의 사용자 음성과 유사한 음성으로 개인화된 합성음을 자동 통역 결과로서 출력할 수 있다. According to the present invention, standardized voice characteristic information extracted from the voice uttered by the user of the talker terminal, for example, additional voice By transmitting the feature and the hidden variable including the feature, the counterpart terminal can output a personalized synthesized sound with a voice similar to the user's voice of the talker terminal as an automatic interpretation result.

또한, 음성 특징 정보가 사용자에 의해 자유롭게 조정됨으로써, 사용자가 원하는 다양한 음색을 갖는 개인화된 합성음의 생성이 가능하다.In addition, since voice characteristic information is freely adjusted by the user, it is possible to generate personalized synthesized sounds having various tones desired by the user.

도 1은 본 발명의 실시 예에 따른 자동 통역을 위한 전체 시스템 구성을 나타내는 블록도이다.
도 2는 도 1에 도시한 발화자 단말의 내부 구성을 나타내는 블록도이다.
도 3은 도 2에 도시한 음성 합성기의 내부 구성을 나타내는 블록도이다.
도 4는 도 1에 도시한 상대 단말의 내부 구성을 나타내는 블록도이다.
도 5는 도 4에 도시한 음성 합성기의 내부 구성을 나타내는 블록도이다.
도 6은 본 발명의 실시 예에 따른 상대 단말에서 수행하는 자동 통역 방법을 설명하기 위한 흐름도이다.
도 7은 본 발명의 실시 예에 따른 발화자 단말에서 수행하는 자동 통역 방법을 설명하기 위한 흐름도이다.1 is a block diagram showing the configuration of an entire system for automatic interpretation according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating an internal configuration of the talker terminal shown in FIG. 1 .
FIG. 3 is a block diagram showing the internal configuration of the speech synthesizer shown in FIG. 2 .
4 is a block diagram illustrating an internal configuration of the counterpart terminal shown in FIG. 1 .
5 is a block diagram showing the internal configuration of the speech synthesizer shown in FIG.
6 is a flowchart illustrating an automatic interpretation method performed by a counterpart terminal according to an embodiment of the present invention.
7 is a flowchart illustrating an automatic interpretation method performed by a talker terminal according to an embodiment of the present invention.

본 명세서에 개시되어 있는 본 발명의 개념에 따른 실시 예들에 대해서 특정한 구조적 또는 기능적 설명들은 단지 본 발명의 개념에 따른 실시 예들을 설명하기 위한 목적으로 예시된 것으로서, 본 발명의 개념에 따른 실시 예들은 다양한 형태로 실시될 수 있으며 본 명세서에 설명된 실시 예들에 한정되지 않는다.Specific structural or functional descriptions for the embodiments according to the concept of the present invention disclosed in this specification are only exemplified for the purpose of explaining the embodiments according to the concept of the present invention, and the embodiments according to the concept of the present invention are It may be implemented in various forms and is not limited to the embodiments described herein.

본 발명의 개념에 따른 실시예들은 다양한 변경들을 가할 수 있고 여러 가지 형태들을 가질 수 있으므로 실시예들을 도면에 예시하고 본 명세서에 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시예들을 특정한 개시형태들에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Since the embodiments according to the concept of the present invention may have various changes and may have various forms, the embodiments will be illustrated in the drawings and described in detail herein. However, this is not intended to limit the embodiments according to the concept of the present invention to specific disclosed forms, and includes changes, equivalents, or substitutes included in the spirit and scope of the present invention.

본 명세서에서 사용한 용어는 단지 특정한 실시예들을 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is used only to describe specific embodiments, and is not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, and includes one or more other features or numbers, It should be understood that the possibility of the presence or addition of steps, operations, components, parts or combinations thereof is not precluded in advance.

이하, 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다. 그러나, 특허출원의 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, the scope of the patent application is not limited or limited by these embodiments. Like reference numerals in each figure indicate like elements.

도 1은 본 발명의 실시 예에 따른 자동 통역을 위한 전체 시스템 구성을 나타내는 블록도이다.1 is a block diagram showing the configuration of an entire system for automatic interpretation according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시 예에 따른 자동 통역을 위한 전체 시스템(300)은 발화자 단말(100)과 상대 단말(200)을 포함한다. Referring to FIG. 1 , the entire system 300 for automatic interpretation according to an embodiment of the present invention includes a speaker terminal 100 and a counterpart terminal 200 .

발화자 단말(100)과 상대 단말(200)은 본 발명에 따른 자동 통역 프로세스를 각각 수행한다. 자동 통역 과정을 수행하기 위해, 발화자 단말(100)과 상대 단말(200) 각각은 데이터 처리 기능을 구비한 컴퓨팅 장치일 수 있다.The talker terminal 100 and the counterpart terminal 200 respectively perform the automatic interpretation process according to the present invention. In order to perform the automatic interpretation process, each of the speaker terminal 100 and the counterpart terminal 200 may be a computing device having a data processing function.

컴퓨팅 장치는, 스마트폰, 웨어러블 디바이스, 헤드셋 장치, 노트북, PDA, 태블릿, 스마트 안경, 스마트 워치 등과 같은 휴대 가능한 이동식 장치일 수 있으며, 이에 한정하지 않고, 데스크 탑과 같이 고정식 장치일 수도 있다.The computing device may be a portable mobile device such as a smart phone, a wearable device, a headset device, a notebook computer, a PDA, a tablet, smart glasses, a smart watch, etc., but is not limited thereto, and may be a stationary device such as a desktop.

발화자 단말(100)과 상대 단말(200) 각각은 데이터 처리 기능 외에 유선 및/또는 무선 통신 기능을 갖는다. 발화자 단말(100)과 상대 단말(200)은 유선 및/또는 무선 통신을 이용하여 각자의 자동 통역을 위한 정보들을 교환한다.Each of the talker terminal 100 and the counterpart terminal 200 has a wired and/or wireless communication function in addition to a data processing function. The talker terminal 100 and the counterpart terminal 200 exchange information for their automatic interpretation using wired and/or wireless communication.

무선 통신은, 예를 들면, 근거리 무선 통신, 이동 통신(예, 3G 통신, 4G 통신, 5G 통신) 및 인터넷 무선 통신 등을 포함한다. 근거리 무선 통신은 와이파이, 블루투스 등을 예로 들 수 있다.Wireless communication includes, for example, short-range wireless communication, mobile communication (eg, 3G communication, 4G communication, 5G communication), and Internet wireless communication. The short-range wireless communication may include Wi-Fi, Bluetooth, and the like.

도 1에서는 하나의 발화자 단말(100)이 하나의 상대 단말(200)과 통신하는 것을 도시하지만, 하나의 발화자 단말(100)이 다수의 상대 단말들과 통신할 수도 있다. 1 illustrates that one talker terminal 100 communicates with one counterpart terminal 200, one talker terminal 100 may communicate with a plurality of counterpart terminals.

발화자 단말(100) 및 상대 단말(200)에서 수행하는 자동 통역 과정(automatic interpretation 또는 automatic speech interpretation)은 음성 인식(voice recognition, automatic speech recognition 또는 Speech-to-Text), 자동 번역(automatic translation 또는 automatic speech translation) 및 음성 합성(voice synthesis 또는 speech synthesis)로 이루어진 요소 기술들을 포함한다.Automatic interpretation or automatic speech interpretation performed by the speaker terminal 100 and the counterpart terminal 200 includes voice recognition, automatic speech recognition, or Speech-to-Text, automatic translation or automatic speech interpretation. It includes element descriptions consisting of speech translation) and voice synthesis (voice synthesis or speech synthesis).

따라서, 발화자 단말(100)과 상대 단말(200)이 자동 통역을 위해 교환하는 정보들은 음성 인식에 필요한 정보들, 자동 번역에 필요한 정보들 및 음성 합성에 필요한 정보들을 포함한다. Accordingly, the information exchanged between the talker terminal 100 and the counterpart terminal 200 for automatic interpretation includes information necessary for voice recognition, information necessary for automatic translation, and information necessary for voice synthesis.

아래에서 설명하겠지만, 본 발명에 따른 음성 합성에 필요한 정보들은 부가 음성 자질(additional voice features)과 관련된 정보들과 개인 음성 특징과 관련된 은닉 변수(hidden variables 또는 latent variables) 정보들을 포함한다. As will be described below, the information necessary for speech synthesis according to the present invention includes information related to additional voice features and hidden variables or latent variables information related to personal voice features.

본 발명에 따른 상대 단말(200)은 발화자 단말(100)로부터 수신한 발화자의 고유 음성 특징, 즉, 부가 음성 자질(additional voice features)과 개인 음성 특징과 관련된 은닉 변수(hidden variables)를 음성 합성에 필요한 정보들로 활용하여, 발화자의 음성에 가까운 개인화된 합성음을 자동 통역 결과로서 출력(재생)한다. The counterpart terminal 200 according to the present invention uses the intrinsic voice characteristics of the talker received from the talker terminal 100, that is, additional voice features and hidden variables related to the personal voice characteristics, for speech synthesis. Utilizing necessary information, a personalized synthesized sound close to the speaker's voice is output (played) as an automatic interpretation result.

이처럼 본 발명에서는 상대 단말(200)이 발화자 음성에 가까운 개인화된 합성음을 자동 통역 결과로서 재생하고자 하는 경우, 상대 단말(200)이 발화자 단말(100)로부터 원시 음성 파일 자체를 수신하는 것이 아니라 발화자의 고유 음성 특징과 관련된 정규화된 부가 음성 자질과 개인 음성 특징과 관련된 은닉 변수를 수신함으로써, 상대 단말(200)은 음성 합성을 위해 발화자 단말(100)로부터 수신한 발화자의 음성 파일로부터 발화자의 음성 특징을 추출하는 처리 과정을 수행할 필요가 없다.As such, in the present invention, when the opposite terminal 200 wants to reproduce a personalized synthesized sound close to the speaker's voice as an automatic interpretation result, the opposite terminal 200 does not receive the original voice file itself from the speaker terminal 100, but the speaker's voice. By receiving the normalized additional voice quality related to the intrinsic voice characteristic and the hidden variable related to the personal voice characteristic, the opposite terminal 200 obtains the speaker's voice characteristic from the speaker's voice file received from the speaker terminal 100 for voice synthesis. There is no need to perform an extraction process.

따라서, 상대 단말(200)은 발화자의 원시 음성 파일 자체를 송수신하는 과정이 없고, 발화자의 원시 음성 파일로부터 발화자의 음성 특징 추출을 수행하지 않기 때문에, 상대 단말(200)이 자동 통역을 실시간으로 처리할 수 있다.Accordingly, since the counterpart terminal 200 does not transmit/receive the speaker's original voice file itself and does not extract the speaker's voice features from the speaker's original voice file, the counterpart terminal 200 processes the automatic interpretation in real time. can do.

또한, 발화자 단말(100)의 음성 합성기와 상대 단말(200)의 음성 합성기가 서로 다른 신경망 구조로 이루어진 경우, 신경망 구조가 다르기 때문에, 상대 단말(200)의 음성 합성 과정에 의해 재생되는 개인화된 합성음은 발화자 음성과 차이가 발생하거나 상대 단말(200)에서 음성 합성이 불가능할 수도 있다.In addition, when the speech synthesizer of the talker terminal 100 and the speech synthesizer of the opposite terminal 200 have different neural network structures, the personalized synthesized sound reproduced by the speech synthesis process of the opposite terminal 200 because the neural network structures are different. may be different from the speaker's voice, or voice synthesis may not be possible in the opposite terminal 200 .

그러나 본 발명에서는 발화자 단말(100)이 상대 단말(200)로 발화자의 음성 특징 즉, 정규화된 부가 음성 자질과 개인 음성 특징과 관련된 은닉 변수를 상대 단말(200)로 제공하고, 상대 단말(200)은 발화자 단말(100)로부터 제공된 정규화된 부가 음성 자질과 개인 음성 특징 은닉 변수를 기반으로 자신(상대 단말)의 음성 합성기를 학습시킨다. 이때, 실시간 자동 통역을 위해, 음성 학습기는 학습 없이, 음성 합성을 실시간으로 수행할 수도 있다.However, in the present invention, the talker terminal 100 provides the opposite terminal 200 with hidden variables related to the speaker's voice characteristics, that is, normalized additional voice qualities and personal voice characteristics, to the opposite terminal 200, and the other terminal 200. learns its own (opponent terminal) speech synthesizer based on the normalized additional speech quality and the personal speech feature hiding variable provided from the talker terminal 100 . In this case, for real-time automatic interpretation, the voice learner may perform voice synthesis in real time without learning.

따라서, 본 발명에서는 발화자 단말(100)에서 재생하는 개인화된 합성음의 음색과 상대 단말(200)에서 재생하는 개인화된 합성음의 음색 간의 차이를 현저히 줄일 수 있고, 각 단말에 설치된 음성 합성기의 사양(신경망 구조)에 대한 의존성을 줄일 수 있다. Therefore, in the present invention, the difference between the tone of the personalized synthesized sound reproduced by the talker terminal 100 and the tone of the personalized synthesized sound reproduced by the opposite terminal 200 can be significantly reduced, and the specifications (neural network) of the voice synthesizer installed in each terminal structure) can be reduced.

이하, 상술한 기술적 효과를 달성하기 위한 발화자 단말과 상대 단말의 내부 구성에 대해 상세히 설명하기로 한다.Hereinafter, internal configurations of the talker terminal and the counterpart terminal for achieving the above-described technical effects will be described in detail.

설명에 앞서, 도 1에 도시된 발화자 단말(100)과 상대 단말(200)은 동일한 자동 통역 프로세스를 수행하도록 동일하게 설계된다. Prior to the description, the talker terminal 100 and the counterpart terminal 200 shown in FIG. 1 are identically designed to perform the same automatic interpretation process.

상대 단말(200)은 발화자 단말(100)로부터 음성 합성에 필요한 정보들(부가 음성 자질과 은닉 변수)을 수신하여 이를 기반으로 음성 합성 과정을 수행하고, 반대로 발화자 단말(100)은 상대 단말(200)로부터 수신한 부가 음성 자질과 개인 음성 특징과 관련된 은닉 변수를 기반으로 음성 합성 과정을 수행할 수도 있다.The counterpart terminal 200 receives information necessary for voice synthesis (additional voice qualities and hidden variables) from the talker terminal 100 and performs a voice synthesis process based on the received information. ), it is also possible to perform a speech synthesis process based on the hidden variables related to the additional speech qualities and personal speech characteristics.

즉, 음성 합성에 필요한 정보를 수신하는 단말에 따라, 발화자 단말이 상대 단말일 수 있고, 상대 단말이 발화자 단말일 수 있다. That is, depending on the terminal receiving the information required for voice synthesis, the talker terminal may be the counterpart terminal, and the counterpart terminal may be the talker terminal.

본 발명에서는 음성 합성에 필요한 정보를 송신하는 단말은 음성 합성을 수행하지 않고, 그 정보를 수신하는 단말이 음성 합성을 수행하는 것으로 가정한다. 물론 음성 합성에 필요한 정보를 송신하는 단말이 다른 단말로부터 음성 합성에 필요한 정보를 수신하면, 그 단말이 음성 합성을 수행하는 것은 당연하다.In the present invention, it is assumed that the terminal transmitting information necessary for speech synthesis does not perform speech synthesis, and it is assumed that the terminal receiving the information performs speech synthesis. Of course, when a terminal transmitting information necessary for speech synthesis receives information necessary for speech synthesis from another terminal, it is natural that the terminal performs speech synthesis.

도 2는 도 1에 도시한 발화자 단말의 내부 구성을 나타내는 블록도이다.FIG. 2 is a block diagram illustrating an internal configuration of the talker terminal shown in FIG. 1 .

도 2를 참조하면, 발화자 단말(100)은 자동 통역을 위한 통역 전용 단말이거나 자동 통역 장치를 포함하는 단말일 수 있다. Referring to FIG. 2 , the talker terminal 100 may be an interpretation-only terminal for automatic interpretation or a terminal including an automatic interpretation apparatus.

발화자 단말(100)은, 자동 통역을 위해, 음성 수집기(101), 음성 인식기(103), 자동 번역기(105), 음성 합성기(107), 음성 출력기(109), 통신기(111)를 포함하며, 추가로, 부가 음성 자질 추출기(113), 개인 음성 특징 인코더(115), 저장 유닛(117), 부가 음성 자질 조정기(119), 개인 음성 특징 조정기(121) 및 학습기(123)를 포함한다.The speaker terminal 100 includes a voice collector 101, a voice recognizer 103, an automatic translator 105, a voice synthesizer 107, a voice output device 109, and a communicator 111 for automatic interpretation, In addition, it includes an additional voice feature extractor 113 , a personal voice feature encoder 115 , a storage unit 117 , an additional voice feature adjuster 119 , a personal voice feature adjuster 121 and a learner 123 .

또한, 발화자 단말(100)은 상기 구성들(101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 121 및 123)의 동작을 제어하고, 실행하는 프로세서(125) 및 상기 프로세서(125)에 의해 처리된 중간 데이터 및 결과 데이터를 일시적으로 저장하거나 상기 프로세서(125)에 의해 실행되는 자동 통역과 관련된 프로그램 또는 소프트웨어 모듈의 실행 공간을 제공하는 메모리(127)를 더 포함하도록 구성될 수 있다.In addition, the talker terminal 100 controls and executes the operations of the components 101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 121 and 123, and a processor 125 and the and a memory (127) for temporarily storing intermediate data and result data processed by the processor (125) or providing an execution space for a program or software module related to automatic interpretation executed by the processor (125). can be

프로세서(125)는 중앙 처리 장치로 불리 수 있으며, 적어도 하나 이상의 ALU(arithmetic logic unit) 와 처리 레지스터를 포함하고, 이러한 ALU와 처리 레지스터를 기반으로 데이터 포맷 변환 등의 데이터 처리 기능을 갖는 하드웨어 유닛이다.The processor 125 may be referred to as a central processing unit, and includes at least one arithmetic logic unit (ALU) and a processing register, and is a hardware unit having a data processing function such as data format conversion based on the ALU and processing register. .

음성 수집기(101)Voice Collector (101)

음성 수집기(101)는 발화자가 원시 언어로 발화한 음성(음성 신호)을 수집하는 구성으로, 예를 들면, 마이크일 수 있다. 본 명세서에서는 음성 수집기(101)에 의해 수집된 아날로그 형태의 음성에 대한 처리 과정, 예를 들면, 잡음 제거 과정, 증폭 과정, 주파수 변환 과정, 샘플링 과정, 디지털 형태의 데이터로 변환하는 과정들은 본 발명의 핵심적인 특징이 아니기 때문에, 이들에 대한 구체 설명은 공지 기술로 대신한다.The voice collector 101 is configured to collect the voice (voice signal) uttered by the speaker in the source language, and may be, for example, a microphone. In the present specification, the processing process for the analog voice collected by the voice collector 101, for example, the noise removal process, the amplification process, the frequency conversion process, the sampling process, and the process of converting data into digital form are described in the present invention. Since they are not essential features of , detailed descriptions thereof are replaced by known techniques.

음성 인식기(103)Speech Recognizer (103)

음성 인식기(103)는 프로세서(125)에 의해 제어되는 하드웨어 모듈이거나, 프로세서(125)에 내장된 디지털 회로일 수 있다. 또는 음성 인식기(103)는 메모리(127)에 로딩되어 프로세서(125)에 의해 실행되는 소프트웨어 모듈일 수 있다. 소프트웨어 모듈은 특정 언어로 프로그래밍 된 프로그램일 수 있다.The voice recognizer 103 may be a hardware module controlled by the processor 125 or a digital circuit built into the processor 125 . Alternatively, the voice recognizer 103 may be a software module loaded into the memory 127 and executed by the processor 125 . A software module may be a program programmed in a specific language.

음성 인식기(103)는 음성 수집기(101)에 의해 수집된 음성을 문장(또는 문자열)로 변환하기 위해, 음성 인식 과정을 수행한다. 음성 인식 과정을 수행하기 위해, 예를 들면, 확률 통계 기반의 음향 모델(acoustic model), 언어 모델(language model) 및 종단형 음성인식 구조 등이 사용될 수 있다. The voice recognizer 103 performs a voice recognition process to convert the voice collected by the voice collector 101 into a sentence (or character string). In order to perform the speech recognition process, for example, an acoustic model based on probability statistics, a language model, and a longitudinal speech recognition structure may be used.

여기서, 음향 모델은, 예를 들면, GMM(Gaussian Mixture Model) 또는 딥 러닝(deep learning) 아키텍쳐 중 하나인 DNN(Deep Neural Network)일 수 있고, 언어 모델은, 예를 들면, N-gram 또는 RNN(Recursive Neural Network)일 수 있다. 또는 음향모델과 언어모델은 하나의 종단형(End-to-End) 구조로 통합될 수 있다.Here, the acoustic model may be, for example, a Gaussian Mixture Model (GMM) or a Deep Neural Network (DNN) which is one of deep learning architectures, and the language model is, for example, N-gram or RNN (Recursive Neural Network). Alternatively, the acoustic model and the language model may be integrated into one end-to-end structure.

음성 인식기(103)가 음성 인식 과정을 처리하여 획득하는 음성 인식 결과는 발화자가 발화한 원시 언어의 문장(또는 문자열) 외에 문장 경계에 대한 정보를 더 포함하도록 구성될 수 있다.The voice recognition result obtained by the voice recognizer 103 by processing the voice recognition process may be configured to further include information on sentence boundaries in addition to the sentence (or character string) of the original language uttered by the speaker.

자동 번역기(105)Automatic Translator(105)

자동 번역기(105)는 프로세서(125)에 의해 제어되는 하드웨어 모듈이거나, 프로세서(125) 내에 내장된 디지털 회로일 수 있다. 또는 자동 번역기(105)는 메모리(127)에 로딩되어 프로세서(125)에 의해 실행되는 소프트웨어 모듈일 수 있다. The automatic translator 105 may be a hardware module controlled by the processor 125 or a digital circuit built into the processor 125 . Alternatively, the automatic translator 105 may be a software module loaded into the memory 127 and executed by the processor 125 .

자동 번역기(105)는 음성 인식기(103)로부터 입력되는 음성 인식 결과, 즉, 원시 언어로 구성된 문장(이하, '원시 언어 문장'이라 함)을 목표 언어로 구성된 문장(또는 문자열)(이하, '목표 언어 문장'이라 함)으로 변환한다.The automatic translator 105 converts the speech recognition result input from the speech recognizer 103, that is, a sentence composed of a source language (hereinafter, referred to as 'source language sentence') to a sentence (or string) composed of a target language (hereinafter, ' target language sentences).

원시 언어 문장을 목표 언어 문장으로 변환하기 위해, 예를 들면, 규칙에 기반 기계 번역(Rule-Based　MachineTranslation: RBMT), 말뭉치(corpus) 기반 기계 번역(Corpus-Based Machine Translation: CBMT), 통계 기반 기계 번역(Statistical Based Machine Translation: SBMT), 신경망 기반 기계 번역(Neural Machine Translation) 등이 이용될 수 있다.To transform a source language sentence into a target language sentence, for example, Rule-Based Machine Translation (RBMT), Corpus-Based Machine Translation (CBMT), Statistical-Based Machine Translation (Statistical Based Machine Translation: SBMT), neural network-based machine translation (Neural Machine Translation), etc. may be used.

도 1에서는 음성 인식기(103)와 자동 번역기(105)가 분리되어 있으나, 이들은 하나로 통합될 수 있다. 즉, 음성 인식 과정과 자동 전역 과정이 하나의 과정을 통합될 수 있다. 이처럼 음성 인식기(103)와 자동 번역기(105)가 하나로 통합된 형태를 '종단형(end-to-end) 자동 통역'이라 부른다. 본 발명은 종단형(end-to-end) 자동 통역 장치에도 적용될 수 있음은 당연하다.Although the speech recognizer 103 and the automatic translator 105 are separated in FIG. 1, they may be integrated into one. That is, the voice recognition process and the automatic global process may be integrated into one process. A form in which the voice recognizer 103 and the automatic translator 105 are integrated into one is called 'end-to-end automatic interpretation'. It goes without saying that the present invention can also be applied to an end-to-end automatic interpretation device.

아래에서 설명하겠지만, 부가 음성 자질 추출기(113)에서 수행하는 부가 음성 자질을 추출하는 과정과 개인 음성 특징 인코더(115)에서 수행하는 개인 음성 특징과 관련된 은닉 변수를 추출하는 과정은 음성 인식기(103)에서 수행하는 음성 인식 과정에서 포함될 수도 있다.As will be described below, the process of extracting the additional voice feature performed by the additional voice feature extractor 113 and the process of extracting the hidden variable related to the personal voice feature performed by the personal voice feature encoder 115 are performed by the voice recognizer 103 . It may be included in the speech recognition process performed by .

음성 합성기(107, voice synthesis)voice synthesizer (107, voice synthesis)

음성 합성기(107)는 프로세서(125)에 의해 제어되는 하드웨어 모듈이거나, 프로세서(125) 내에 내장된 디지털 회로일 수 있다. 또는 음성 합성기(107)는 메모리(127)에 로딩되어 프로세서(125)에 의해 실행되는 소프트웨어 모듈일 수 있다.The voice synthesizer 107 may be a hardware module controlled by the processor 125 or a digital circuit built in the processor 125 . Alternatively, the speech synthesizer 107 may be a software module loaded into the memory 127 and executed by the processor 125 .

음성 합성기(107)는 자동 번역기(105)로부터 입력되는 목표 언어 문장을 상대 단말(200)의 상대 화자의 음성 특징(도 4의 23, 25)과 합성하여 개인화된 합성음을 생성한다. 개인화된 합성음을 생성하기 위해, 음성 합성기(107)는 신경망(Neural Network) 기반의 음성 합성 과정을 수행할 수 있다.The speech synthesizer 107 generates a personalized synthesized sound by synthesizing the target language sentence input from the automatic translator 105 with the voice characteristics (23 and 25 of FIG. 4 ) of the opposite speaker of the opposite terminal 200 . In order to generate a personalized synthesized sound, the voice synthesizer 107 may perform a neural network-based voice synthesis process.

음성 합성기(107)의 신경망 구조는, 예를 들면, 순환 신경망(Recurrent Neural Network: RNN)으로 각각 구성된 인코더(encoder)-디코더(decoder) 모델과 같은 신경망 모델로 구성될 수 있다. 이때, 인코더와 디코더 각각은 다수의 메모리 셀(memory cell) 또는 다수의 RNN 셀로 구성될 수 있다.The neural network structure of the speech synthesizer 107 may include, for example, a neural network model such as an encoder-decoder model each configured with a Recurrent Neural Network (RNN). In this case, each of the encoder and the decoder may be composed of a plurality of memory cells or a plurality of RNN cells.

이러한 신경망 기반의 음성 합성기(107)는 학습기(123)에 의해 사전 학습될 수 있다. 예를 들면, 학습기(223)는 다중 화자의 입력 텍스트와 상대 단말(200)의 상대 화자를 포함하는 다중 화자의 부가 음성 자질을 훈련 데이터로 이용하여 학습시킬 수 있다.The neural network-based speech synthesizer 107 may be pre-trained by the learner 123 . For example, the learner 223 may learn by using the input text of the multi-speaker and the additional voice qualities of the multi-speaker including the counterpart of the opposite terminal 200 as training data.

이하, 도 3을 참조하여 음성 합성기(107)에 대해 더욱 상세히 설명하기로 한다.Hereinafter, the speech synthesizer 107 will be described in more detail with reference to FIG. 3 .

도 3은 도 2에 도시한 음성 합성기의 내부 구성을 나타내는 블록도이다.FIG. 3 is a block diagram showing the internal configuration of the speech synthesizer shown in FIG. 2 .

도 3을 참조하면, 음성 합성기(107)는, 크게, 인코더(encoder)(107A), 차원 정규화기(107B), 디코더(decoder)(107C) 및 보코더(vocoder)(107D)를 포함하도록 구성될 수 있다.Referring to FIG. 3 , the speech synthesizer 107 will be configured to include, largely, an encoder 107A, a dimensional normalizer 107B, a decoder 107C and a vocoder 107D. can

인코더(107A)는 다수의 RNN 셀로 구성된 RNN과 같은 신경망으로서, 자동 번역기(105)로부터 입력되는 자동 번역 결과(10), 통신기(111)를 통해 상대 단말(200)로부터 제공된 부가 음성 자질(23) 및 통신기(111)를 통해 상대 단말(200)로부터 제고된 개인 음성 특징과 관련된 은닉 변수(25)를 인코딩한다. 여기서, 부가 음성 자질(23)와 은닉 변수(25)은 상대 단말(200)에서 추출한 상대 화자의 음성 특징과 관련된 정보이다. 또한, 자동 번역 결과(10)는 통신기(111)를 통해 상대 단말(200)에 제공된 자동 번역 결과(도 4의 30)으로 대체될 수 있다.The encoder 107A is an RNN-like neural network composed of a plurality of RNN cells. The automatic translation result 10 input from the automatic translator 105 and the additional speech quality 23 provided from the counterpart terminal 200 through the communicator 111. and a hidden variable 25 related to the personal voice feature raised from the counterpart terminal 200 through the communicator 111 . Here, the additional voice feature 23 and the hidden variable 25 are information related to the voice characteristics of the counterpart speaker extracted from the counterpart terminal 200 . Also, the automatic translation result 10 may be replaced with the automatic translation result ( 30 in FIG. 4 ) provided to the counterpart terminal 200 through the communicator 111 .

인코더(107A)로부터 출력되는 인코딩 결과는 자동 번역 결과(10)를 기반으로 획득된 언어학적 내용(linguistic content)과 부가 음성 자질(13)을 기반으로 획득된 음성학적 내용(acoustic feature)을 포함하도록 구성된다.The encoding result output from the encoder 107A includes a linguistic content obtained based on the automatic translation result 10 and an acoustic feature obtained based on the additional speech quality 13 . is composed

언어학적 내용은, 예를 들면, 자동 번역 결과(10)(목표 언어 문장)로부터 획득된 문자열(text)및 이로부터 추출된 음소(phoneme)를 포함하도록 구성된 정보일 수 있다.The linguistic content may be, for example, information configured to include a character string (text) obtained from the automatic translation result 10 (target language sentence) and a phoneme (phoneme) extracted therefrom.

음성학적 내용은, 예를 들면, 상대 단말(200)로부터 제공된 부가 음성 자질(23)로부터 획득된 상대 화자의 음색(억양, 강도, 높낮이 등)과 관련된 정보일 수 있다.The phonetic content may be, for example, information related to the tone (intonation, intensity, pitch, etc.) of the counterpart speaker obtained from the additional voice quality 23 provided from the counterpart terminal 200 .

차원 정규화기(107B)는, 인코더(107A)로부터 입력되는 인코딩 결과(60)와 상대 단말(200)로부터 제공된 개인 음성 특징과 관련된 은닉 변수(25)가 결합 가능하도록 동일한 데이터 차원으로 정규화는 과정을 수행한다. The dimension normalizer 107B performs a process of normalizing to the same data dimension so that the encoding result 60 input from the encoder 107A and the hidden variable 25 related to the personal speech feature provided from the opposite terminal 200 can be combined. carry out

개인 음성 특징 은닉 변수는 상대 단말(200) 내의 개인 음성 특징 인코더(215)로부터 제공된다. 상대 단말(200) 내의 개인 음성 특징 인코더(215)는 신경망 구조로 구성된다. 이때, 개인 음성 특징 인코더(215)의 신경망 구조와 상기 인코더(107A)의 신경망 구조가 다른 경우, 개인 음성 특징 은닉 변수(25)와 상기 인코더(107A)로부터의 인코딩 결과(60)는 서로 다른 데이터 차원을 가질 수 있다. The private voice feature hiding variable is provided from the personal voice feature encoder 215 in the counterpart terminal 200 . The personal voice feature encoder 215 in the counterpart terminal 200 is configured in a neural network structure. In this case, when the neural network structure of the personal speech feature encoder 215 and the neural network structure of the encoder 107A are different, the personal speech feature hiding variable 25 and the encoding result 60 from the encoder 107A are different data. can have dimensions.

이처럼 데이터 차원이 다른 개인 음성 특징 은닉 변수(25)와 상기 인코딩 결과(60)가 아래에서 설명할 디코더(107C)에 입력되는 경우, 디코더(107C)는 부정확한 디코딩 결과를 출력하게 된다.When the individual speech feature hiding variable 25 and the encoding result 60 having different data dimensions are input to the decoder 107C, which will be described below, the decoder 107C outputs an inaccurate decoding result.

정확한 디코딩 결과를 획득하기 위해, 차원 정규화기(107B)는 은닉 변수(23)와 상기 인코딩 결과(60)의 데이터 결합이 가능하도록 은닉 변수(23)와 상기 인코딩 결과(60)를 동일한 데이터 차원으로 정규화 한다.In order to obtain an accurate decoding result, the dimension normalizer 107B converts the hidden variable 23 and the encoding result 60 into the same data dimension so that the data combination of the hidden variable 23 and the encoding result 60 is possible. Normalize.

물론, 상대 단말 내의 개인 음성 특징 인코더(215)가 상기 인코더(107A)의 신경망 구조와 동일하거나 유사한 신경망 구조를 갖는 경우, 차원 정규화기(107B)에 의해 수행되는 차원 정규화 과정은 수행되지 않을 수도 있다.Of course, when the personal speech feature encoder 215 in the opposite terminal has the same or similar neural network structure to the neural network structure of the encoder 107A, the dimensional normalization process performed by the dimension normalizer 107B may not be performed. .

디코더(107C)는 상대 단말(200)로부터 제공된 개인 음성 특징과 관련된 은닉 변수(25)와 상기 인코딩 결과(60)를 디코딩한다. 이러한 디코딩 과정은 상대 화자의 음성을 결정하는 파라메터(Parameter)를 생성하는 과정일 수 있다. 여기서, 파라메터는, 예를 들면, 스펙트로그램(spectrogram) 기반의 특징 벡터일 수 있다. The decoder 107C decodes the hidden variable 25 and the encoding result 60 related to the personal voice feature provided from the opposite terminal 200 . This decoding process may be a process of generating a parameter that determines the voice of the other speaker. Here, the parameter may be, for example, a spectrogram-based feature vector.

보코더(107D)는 디코더(107C)로부터 입력된 파라메터를 기반으로 개인화된 합성음을 자동 통역 결과로서 생성한다. 여기서, 디코더(107C)가 디코딩 결과를 개인화된 합성음으로서 직접 생성하는 경우, 보코더(107D)의 설계는 생략될 수 있다.The vocoder 107D generates a personalized synthesized sound as an automatic interpretation result based on the parameters input from the decoder 107C. Here, when the decoder 107C directly generates the decoding result as a personalized synthesized sound, the design of the vocoder 107D may be omitted.

이상 설명한 바와 같이, 음성 합성기(107)는 자동 번역 결과(10 또는 도 4의 30), 상대 단말(200)로부터 제공된 부가 음성 자질(23) 및 상대 단말(200)로부터 제공된 개인 음성 특징과 관련된 은닉 변수(25)를 입력으로 이용하여 개인화된 합성음을 생성한다. As described above, the voice synthesizer 107 hides the automatic translation result (10 or 30 in FIG. 4 ), the additional voice quality 23 provided from the counterpart terminal 200 , and the personal voice feature provided from the counterpart terminal 200 . A personalized synthesized sound is generated using the variable 25 as an input.

설계에 따라, 음성 합성기(107)는 상대 단말(200)로부터 제공된 부가 음성 자질(23) 및 은닉 변수(25) 중에서 어느 하나의 정보만을 이용하여 자동 번역 결과(10 또는 도 4의 30)에 대한 음성 합성을 수행할 수도 있다.According to the design, the speech synthesizer 107 uses only any one of the additional speech quality 23 and the hidden variable 25 provided from the counterpart terminal 200 for the automatic translation result (10 or 30 in FIG. 4 ). Speech synthesis may also be performed.

아래에서 설명하겠지만, 은닉 변수와 관련된 음성 특징은 부가 음성 자질과 관련된 음성 특징보다 상대 화자의 음색과 관련된 더 많은 정보를 담고 있다. 음성 합성기(107)는 상대 단말(200)로부터 제공된 은닉 변수(25)만을 이용하여 자동 번역 결과(10 또는 도 4의 30)에 대한 음성 합성을 수행할 수 있다. 이 경우, 음성 합성기(107) 내의 인코더(107A)는 자동 번역 결과(10 또는 도 4의 30)만을 인코딩한다.As will be described below, the voice feature related to the hidden variable contains more information related to the tone of the other speaker than the voice feature related to the additional voice quality. The speech synthesizer 107 may perform speech synthesis on the automatic translation result 10 or 30 in FIG. 4 using only the hidden variable 25 provided from the counterpart terminal 200 . In this case, the encoder 107A in the speech synthesizer 107 encodes only the automatic translation result 10 or 30 in FIG. 4 .

물론 음성 합성기(107)가 상대 단말(200)로부터 제공된 부가 음성 자질(23)만을 이용하여 자동 번역 결과(10 또는 도 4의 30)에 대한 음성 합성을 수행할 수도 있다. 다만, 이 경우는 음성 합성에 따라 생성된 개인화된 합성음의 품질은 다소 낮을 수 있다.Of course, the voice synthesizer 107 may perform voice synthesis on the automatic translation result 10 or 30 in FIG. 4 using only the additional voice quality 23 provided from the counterpart terminal 200 . However, in this case, the quality of the personalized synthesized sound generated according to the voice synthesis may be rather low.

또한, 음성 합성기(107)는 상대 단말(200)로부터의 변환된(업데이트된 또는 수정된) 부가 음성 자질(23) 및/또는 상대 단말(200)로부터의 변환된(업데이트된 또는 수정된) 은닉 변수(25)를 이용하여 자동 번역 결과(10 또는 도 4의 30)에 대한 음성 합성을 수행할 수 있다. 이 경우, 음성 합성기(107)는 상대 단말(200)의 상대 화자 또는 발화자 단말(100)의 발화자가 원하는 음색으로 변환된(업데이트된) 개인화된 합성음을 생성한다. In addition, the voice synthesizer 107 may be configured to hide the transformed (updated or modified) additional voice qualities 23 from the opposite terminal 200 and/or the transformed (updated or modified) from the opposite terminal 200 . Speech synthesis may be performed on the automatic translation result (10 or 30 in FIG. 4 ) using the variable 25 . In this case, the voice synthesizer 107 generates a personalized synthesized sound converted (updated) into a tone desired by the other speaker of the opposite terminal 200 or the speaker of the speaker terminal 100 .

음성 합성기(107)가 발화자가 원하는 음색으로 변환된(업데이트된) 개인화된 합성음을 생성하고자 하는 경우, 아래에서 설명하겠지만, 발화자 단말(100) 내의 부가 음성 자질 조정기(119) 및 개인 음성 특징 조정기(121)가 통신기(111)를 통해 상대 단말(200)로부터 수신한 부가 음성 자질(도 4의 23)과 은닉 변수(도 4의 25)를 각각 변환하고, 이를 음성 합성기(107)에 전달한다.When the voice synthesizer 107 wants to generate a personalized synthesized sound converted (updated) into a tone desired by the speaker, as will be described below, the additional voice quality adjuster 119 and the personal voice feature adjuster ( 121) converts the additional voice quality (23 in FIG. 4) and the hidden variable (25 in FIG. 4) received from the counterpart terminal 200 through the communicator 111, respectively, and transmits them to the voice synthesizer 107.

한편, 음성 합성기(107)가 자동 번역 결과(10 또는 도 4의 30)만을 입력으로 이용하여 합성음을 생성하는 경우, 상기 생성된 합성음은 개인화된 합성음이 아니다.Meanwhile, when the speech synthesizer 107 generates a synthesized sound using only the automatic translation result (10 or 30 in FIG. 4 ) as an input, the generated synthesized sound is not a personalized synthesized sound.

음성 출력기(109)Audio Output(109)

다시 도 1을 참조하면, 음성 출력기(109)는 음성 합성기(107)가 자동 통역 결과로서 생성한 개인화된 합성음을 재생(출력)하는 것으로, 예를 들면, 스피커일 수 있다.Referring back to FIG. 1 , the voice output device 109 reproduces (outputs) the personalized synthesized sound generated by the voice synthesizer 107 as a result of automatic interpretation, and may be, for example, a speaker.

통신기(111)Communicator(111)

통신기(111)는, 발화자 단말과 상대 단말 간의 정보 교환을 위해, 상대 단말(200)과 유선 또는 무선 방식으로 통신한다.The communicator 111 communicates with the counterpart terminal 200 in a wired or wireless manner to exchange information between the talker terminal and the counterpart terminal.

통신기(111)는 상대 단말(200)로 저장 유닛(117)에 저장된 부가 음성 자질(13) 및 개인 음성 특징 은닉 변수(15)를 송신한다. 이때, 자동 번역기(105)로부터 입력되는 자동 번역 결과(10)를 상대 단말(200)로 더 송신할 수 있다.The communicator 111 transmits the additional voice quality 13 and the personal voice feature hiding variable 15 stored in the storage unit 117 to the opposite terminal 200 . In this case, the automatic translation result 10 input from the automatic translator 105 may be further transmitted to the counterpart terminal 200 .

반대로, 통신기(111)는 상대 단말(200)로부터 부가 음성 자질(23) 및 은닉 변수(25)를 수신하고, 이를 음성 합성기(107)로 전달한다. 이때, 통신기(111)는 상대 단말(200)에서 수행한 자동 번역 결과(도 4의 30)를 상대 단말로부터 더 수신하고, 이를 음성 합성기(107)로 전달할 수 있다.Conversely, the communicator 111 receives the additional voice feature 23 and the hidden variable 25 from the counterpart terminal 200 , and transmits them to the voice synthesizer 107 . In this case, the communicator 111 may further receive the automatic translation result ( 30 in FIG. 4 ) performed by the counterpart terminal 200 from the counterpart terminal, and transmit it to the voice synthesizer 107 .

또한, 통신기(111)는 상대 단말(200)로 부가 음성 자질 조정기(119)에 의해 변환된(업데이트된) 부가 음성 자질(19) 및/또는 개인 음성 특징 조정기(121)에 의해 변환된(업데이트된) 은닉 변수(21)를 송신한다.In addition, the communicator 111 is converted (updated) by the additional voice feature adjuster 119 to the counterpart terminal 200 (updated) and/or converted by the personal voice feature adjuster 121 (updated). ) the hidden variable 21 is transmitted.

반대로 통신기(111)는 상대 단말(200)로부터 상대 단말(200)내의 부가 음성 자질 조정기(219)에 의해 변환된(업데이트된) 부가 음성 자질(29) 및/또는 개인 음성 특징 조정기(221)에 의해 변환된(업데이트된) 은닉 변수(31)를 수신하고, 이를 음성 합성기(107)로 전달할 수 있다.Conversely, the communicator 111 is an additional voice feature 29 and/or a personal voice feature adjuster 221 converted (updated) by the additional voice feature adjuster 219 in the counterpart terminal 200 from the counterpart terminal 200. Receives the transformed (updated) hidden variable 31 by the voice synthesizer 107 can be transmitted.

부가 음성 자질 추출기(113)Additional Voice Feature Extractor (113)

부가 음성 자질 추출기(113)는 프로세서(125)에 의해 제어되는 하드웨어 모듈이거나, 프로세서(125) 내에 내장된 디지털 회로일 수 있다. 또는 부가 음성 자질 추출기(113)는 메모리(127)에 로딩되어 프로세서(125)에 의해 실행되는 소프트웨어 모듈일 수 있다.The additional voice quality extractor 113 may be a hardware module controlled by the processor 125 or a digital circuit built into the processor 125 . Alternatively, the additional voice feature extractor 113 may be a software module loaded into the memory 127 and executed by the processor 125 .

부가 음성 자질 추출기(113)는 음성 수집기(101)로부터 입력된 발화자의 음성(또는 음성 신호)으로부터 부가 음성 자질(13)(또는 부가 음성 특징)을 추출한다. The additional voice feature extractor 113 extracts the additional voice feature 13 (or additional voice feature) from the speaker's voice (or voice signal) input from the voice collector 101 .

부가 음성 자질(13)은 통신기(111)를 통해 상대 단말(200)로 송신되고, 상대 단말(200)의 음성 합성기(207)는 부가 음성 자질(13)을 음성 합성에 필요한 정보로 이용한다.The additional voice feature 13 is transmitted to the counterpart terminal 200 through the communicator 111 , and the voice synthesizer 207 of the counterpart terminal 200 uses the additional voice feature 13 as information necessary for voice synthesis.

부가 음성 자질 추출기(113)는, 예를 들면, 비 신경망 기반의 알고리즘을 이용하여 부가 음성 자질(13)을 추출한다. 여기서, 비 신경망 기반의 알고리즘은 발화자의 음성 신호에서 반복적으로 나타나는 파형의 특징적인 패턴을 분석하는 알고리즘(이하, '파형 분석 알고리즘'라 함)일 수 있다.The additional voice feature extractor 113 extracts the additional voice feature 13 using, for example, a non-neural network-based algorithm. Here, the non-neural network-based algorithm may be an algorithm (hereinafter, referred to as a 'waveform analysis algorithm') for analyzing a characteristic pattern of a waveform repeatedly appearing in the speaker's voice signal.

비 신경망을 기반으로 제2 부가 음성 자질(13)을 추출하는 방법은, 예를 들면, 먼저, 발화자의 음성 신호를 수치화 된 디지털화된 파형(digitalized waveform(디지털화된 파형)으로 변환하는 과정, 변환된 디지털화된 파형에서 특정 주기를 설정하는 과정 및 파형 분석 알고리즘을 기반으로 파형의 진폭과 주기를 분석하여 상기 설정된 특정 주기에서의 특징적인 패턴을 추출하는 과정으로 포함한다.The method of extracting the second additional voice feature 13 based on the non-neural network is, for example, first, the process of converting the speaker's voice signal into a digitalized waveform, the converted It includes a process of setting a specific period in the digitized waveform and a process of extracting a characteristic pattern in the set specific period by analyzing the amplitude and period of the waveform based on a waveform analysis algorithm.

도 1에서는 부가 음성 자질 추출기(113)가 하나의 독립된 블록으로 도시되어 있으나, 음성 인식기(103) 내에 통합될 수 있다. 이 경우, 음성 인식기 (103)에서 수행하는 음성 인식 과정은 부가 음성 자질의 추출 과정을 더 포함한다.In FIG. 1 , the additional speech feature extractor 113 is shown as one independent block, but may be integrated into the speech recognizer 103 . In this case, the voice recognition process performed by the voice recognizer 103 further includes an additional voice feature extraction process.

부가 음성 자질은 감정(차분, 공손, 공포, 행복, 긍정, 부정 등), 강도(intensity), 억양(intonation), 높낮이(pitch), 속도, 지속 시간 등을 나타내는 발화자 음성(voice of speaker(talker))의 음색 또는 스타일과 관련된 음성 특징일 수 있다.Additional voice qualities are the voice of speaker (talker) representing emotions (difficulty, politeness, fear, happiness, positive, negative, etc.), intensity, intonation, pitch, speed, duration, etc. ))), may be a voice characteristic related to the tone or style.

부가 음성 자질 추출기(113)는 발화자의 음성 신호로부터 추출한 부가 음성 자질과 상기 음성 신호를 매핑하여, 그 매핑 결과들을 사전에 정해진 시간 구간에 따라 분류한다.The additional voice feature extractor 113 maps the voice signal with the additional voice feature extracted from the speaker's voice signal, and classifies the mapping results according to a predetermined time interval.

이후, 부가 음성 자질 추출기(113)는 상기 분류된 매핑 결과들을 사전에 정해진 규칙에 따라 수치화하고, 그 수치화 된 결과를 부가 음성 자질(13)로서 저장 유닛(117)에 데이터베이스 형태(117A)로 저장한다. 여기서, 상기 수치화 된 결과는 정수값, 실수값 또는 백분율 등의 형태일 수 있다.Thereafter, the additional voice feature extractor 113 quantifies the classified mapping results according to a predetermined rule, and stores the quantified result as the additional voice feature 13 in the storage unit 117 in the form of a database 117A. do. Here, the digitized result may be in the form of an integer value, a real value, or a percentage.

한편, 상대 단말(200) 내의 부가 음성 자질 추출기(213)는 부가 음성 자질 추출기(113)와 동일한 구성 및 기능을 갖는다. 다만, 부가 음성 자질 추출기(113)는 발화자의 음성으로부터 부가 음성 자질(13)을 추출하고, 부가 음성 자질 추출기(213)는 상대 화자의 음성으로부터 부가 음성 자질(도 4의 23)을 추출하는 점에서 차이가 있을 뿐이다.Meanwhile, the additional voice feature extractor 213 in the counterpart terminal 200 has the same configuration and function as the additional voice feature extractor 113 . However, the additional voice feature extractor 113 extracts the additional voice feature 13 from the speaker's voice, and the additional voice feature extractor 213 extracts the additional voice feature (23 in FIG. 4) from the other speaker's voice. There is only a difference in

개인 음성 특징 인코더(115)Personal Voice Feature Encoder(115)

개인 음성 특징 인코더(115)는 프로세서(125)에 의해 제어되는 하드웨어 모듈이거나, 프로세서(125) 내에 내장된 디지털 회로일 수 있다. 또는 개인 음성 특징 인코더(115)는 메모리(127)에 로딩되어 프로세서(125)에 의해 실행되는 소프트웨어 모듈일 수 있다.The personal voice feature encoder 115 may be a hardware module controlled by the processor 125 or a digital circuit built into the processor 125 . Alternatively, the personal voice feature encoder 115 may be a software module loaded into the memory 127 and executed by the processor 125 .

개인 음성 특징 인코더(115)는, 음성 수집기(101)로부터 입력된 발화자의 음성 신호에 대해 인코딩 과정을 수행하여, 개인 음성 특징을 포함하는 은닉 변수(15)를 생성한다. The personal speech feature encoder 115 performs an encoding process on the speaker's speech signal input from the speech collector 101 to generate the hidden variable 15 including the personal speech feature.

개인 음성 특징은 부가 음성 자질과 음성 특징 파라메터를 포함하거나 이들을 결합한 정보일 수 있다. 여기서, 음성 특징 파라메터는, 예를 들면, 멜 주파수 켑스트럼 계수(Mel-Frequency Cepstral Coefficient: MFCC) 기반의 특징 벡터일 수 있다.The personal voice feature may include additional voice qualities and voice feature parameters, or may be information combining them. Here, the speech feature parameter may be, for example, a feature vector based on a Mel-Frequency Cepstral Coefficient (MFCC).

은닉 변수(15)는, 부가 음성 자질 추출기(113)에 의해 추출된 부가 음성 자질(13)과 함께, 음성 합성기(107)에서 음성 합성 과정을 수행하는데 필요한 정보로 활용된다.The hidden variable 15 is used as information necessary for performing a speech synthesis process in the speech synthesizer 107 together with the additional speech feature 13 extracted by the supplementary speech feature extractor 113 .

은닉 변수(15)는, 부가 음성 자질(13)과 유사한 부가 음성 자질 외에 MFCC와 같은 음성 특징 파라메터를 더 포함하기 때문에, 부가 음성 자질(13)보다 발화자의 음성 특징(발화 음색, 발화 스타일 등)과 관련된 정보를 더 많이 포함하고 있는 정보일 수도 있다. 이 경우, 상대 단말(200) 내의 음성 합성기(207)는 발화자 단말(100)로부터 발화자 단말(100)에서 추출한 상기 은닉 변수(15)만을 수신하여 상기 은닉 변수(15)만을 음성 합성에 필요한 정보로 이용할 수 있다.Since the hidden variable 15 further includes a voice characteristic parameter such as MFCC in addition to the additional voice quality similar to the additional voice feature 13, the voice characteristics of the speaker (speech tone, utterance style, etc.) It may be information that contains more information related to In this case, the voice synthesizer 207 in the counterpart terminal 200 receives only the hidden variable 15 extracted from the talker terminal 100 from the talker terminal 100, and uses only the hidden variable 15 as information necessary for voice synthesis. Available.

이러한 은닉 변수(15)를 생성하기 위해, 개인 음성 특징 인코더(115)은 사전 학습된 신경망 구조로 이루어질 수 있다.In order to generate such a hidden variable 15, the personal speech feature encoder 115 may consist of a pre-trained neural network structure.

개인 음성 특징 인코더(115)의 신경망 구조는 상대 단말(200) 내의 음성 합성기(도 4의 207)의 신경망 구조와 동일하거나 다를 수도 있다. 전술한 바와 같이, 신경망 구조가 다른 경우, 도시하지는 않았으나, 상대 단말(200) 내의 음성 합성기(도 4의 207)는 데이터 차원을 정규화하는 차원 정규화기와 같은 처리 블록을 더 포함하도록 구성된다.The neural network structure of the personal speech feature encoder 115 may be the same as or different from the neural network structure of the speech synthesizer ( 207 in FIG. 4 ) in the counterpart terminal 200 . As described above, when the neural network structure is different, although not shown, the speech synthesizer ( 207 in FIG. 4 ) in the counterpart terminal 200 is configured to further include a processing block such as a dimension normalizer for normalizing the data dimension.

상대 단말(200) 내의 음성 합성기(207)는 부가 음성 자질 추출기(113)으로부터 제공된 부가 음성 자질(13)만을 활용하여 개인화된 합성음을 생성할 수 있지만, 개인화된 합성음의 정확도를 더욱 높이기 위해, 부가 음성 자질 추출기(113)으로부터 제공된 은닉 변수(15)을 주 정보로 활용하고, 부가 음성 자질 추출기(113)으로부터 제공된 부가 음성 자질(13)를 보조 정보로 활용하여 음성 합성을 수행할 수 있다.The voice synthesizer 207 in the opposite terminal 200 may generate a personalized synthesized sound by using only the additional voice feature 13 provided from the additional voice feature extractor 113, but in order to further increase the accuracy of the personalized synthesized sound, Speech synthesis may be performed by using the hidden variable 15 provided from the voice feature extractor 113 as main information and using the additional voice feature 13 provided from the additional voice feature extractor 113 as auxiliary information.

반대로, 상대 단말(200) 내의 음성 합성기(207)는 부가 음성 자질 추출기(113)으로부터 제공된 부가 음성 자질(13)을 주 정보로 활용하고, 부가 음성 자질 추출기(113)로부터 제공된 은닉 변수(15)를 보조 정보로 활용하여 음성 합성을 수행할 수도 있다. Conversely, the speech synthesizer 207 in the counterpart terminal 200 utilizes the additional speech feature 13 provided from the supplementary speech feature extractor 113 as main information, and the hidden variable 15 provided from the supplementary speech feature extractor 113. can also be used as auxiliary information to perform speech synthesis.

개인 음성 특징 인코더(115)의 신경망 구조는 정해진 규칙 또는 다양한 훈련 데이터의 특성(예, 다양한 데이터 차원)에 따라 다양하게 변형된 신경망 구조로 학습될 수 있다.The neural network structure of the personal speech feature encoder 115 may be learned in variously modified neural network structures according to predetermined rules or characteristics of various training data (eg, various data dimensions).

이처럼 개인 음성 특징 인코더(115)의 신경망 구조가 훈련 데이터의 특성에 따라 다양한 신경망 구조로 학습되는 경우, 발화자 단말(100)의 개인 음성 특징 인코더(115)의 신경망 구조는 상대 단말(200) 내의 음성 합성기(207) 내의 인코더의 신경망 구조와 다를 수 있다.As such, when the neural network structure of the personal voice feature encoder 115 is learned in various neural network structures according to the characteristics of the training data, the neural network structure of the personal voice feature encoder 115 of the talker terminal 100 is the voice in the counterpart terminal 200 . It may be different from the neural network structure of the encoder in the synthesizer 207 .

따라서, 개인 음성 특징 인코더(115)에 의해 생성된 은닉 변수(15)의 데이터 차원은 상대 단말(200)의 음성 합성기(207) 내의 인코더에 의해 생성된 인코딩 결과의 데이터 차원이 동일하도록 차원 정규화 과정을 수행하는 것이 바람직하다.Therefore, the data dimension of the hidden variable 15 generated by the personal speech feature encoder 115 is dimensionally normalized so that the data dimension of the encoding result generated by the encoder in the speech synthesizer 207 of the opposite terminal 200 is the same. It is preferable to perform

한편, 개인 음성 특징 인코더(115)에서 생성한 인코딩 결과, 즉, 은닉 변수(15)는 저장 유닛(117) 내에 데이터베이스 형태(117B)로 저장된다.Meanwhile, the encoding result generated by the personal speech feature encoder 115 , that is, the hidden variable 15 , is stored in the storage unit 117 in the form of a database 117B.

개인 음성 특징 인코더(115)와 상대 단말(200) 내의 개인 음성 특징 인코더(215)는 동일한 구성 및 기능을 갖는다. 다만, 개인 음성 특징 인코더(115)는 발화자의 개인 음성 특징과 관련된 은닉 변수(15)를 추출하고, 개인 음성 특징 인코더(215)는 상대 화자의 개인 음성 특징과 관련된 은닉 변수(25)를 추출하는 점에서 차이가 있을 뿐이다.The personal voice feature encoder 115 and the personal voice feature encoder 215 in the opposite terminal 200 have the same configuration and function. However, the personal voice feature encoder 115 extracts the hidden variable 15 related to the speaker's personal voice feature, and the personal voice feature encoder 215 extracts the hidden variable 25 related to the other speaker's personal voice feature. There is only a difference in point.

저장 유닛(117)storage unit (117)

저장 유닛(117)은 부가 음성 자질 추출기(113) 및 개인 음성 특징 인코더(115)로부터 각각 출력된 부가 음성 자질(13)과 개인 음성 특징 은닉 변수(15)를 데이터 베이스 형태(117A, 117B)로 일시적 또는 영구적으로 저장하는 유닛으로, 휘발성 및 비휘발성 저장 매체로 구현될 수 있다.The storage unit 117 stores the additional speech feature 13 and the personal speech feature hidden variable 15 output from the supplementary speech feature extractor 113 and the personal speech feature encoder 115, respectively, into database forms 117A and 117B. A temporary or permanent storage unit, which may be implemented as a volatile and non-volatile storage medium.

저장 유닛(117)에 저장된 부가 음성 자질(13)과 은닉 변수(15)는 음성 수집기(101)에 의해 새롭게 수집된 음성 신호를 기반으로 획득한 새로운 부가 음성 자질과 새로운 은닉 변수로 실시간으로 업데이트될 수 있다.The additional voice qualities 13 and hidden variables 15 stored in the storage unit 117 are to be updated in real time with new additional voice qualities and new hidden variables acquired based on the voice signals newly collected by the voice collector 101. can

부가 음성 자질 조정기(119)Additional Voice Quality Adjusters (119)

부가 음성 자질 조정기(119)는 프로세서(125)에 의해 제어되는 하드웨어 모듈이거나, 프로세서(125) 내에 내장된 디지털 회로일 수 있다. 또는 부가 음성 자질 조정기(119)는 메모리(127)에 로딩되어 프로세서(125)에 의해 실행되는 소프트웨어 모듈일 수 있다.The additional voice quality adjuster 119 may be a hardware module controlled by the processor 125 or a digital circuit built into the processor 125 . Alternatively, the additional voice quality adjuster 119 may be a software module loaded into the memory 127 and executed by the processor 125 .

부가 음성 자질 조정기(119)는 발화자의 요구에 따라 부가 음성 추출기(113)에 의해 추출된 부가 음성 자질(13)의 특정값 또는 상대 단말(200)로부터 제공된 부가 음성 자질(23)의 특정값을 수정하여 업데이트된 부가 음성 자질로 변환시킨다.The additional voice quality adjuster 119 selects the specific value of the additional voice feature 13 extracted by the additional voice extractor 113 or the specific value of the additional voice feature 23 provided from the counterpart terminal 200 according to the request of the speaker. It is modified to transform it into an updated additional voice quality.

부가 음성 자질의 조정은, 예를 들면, 발화자 또는 상대 화자의 음성 높낮이를 변환하는 경우, 상기 음성 높낮이에 대응하는 특정값을 조정하는 것일 수 있다.The adjustment of the additional voice quality may be, for example, adjusting a specific value corresponding to the voice pitch when the voice pitch of the speaker or the other speaker is changed.

이러한 부가 음성 자질의 조정은 발화자에 의해 수행될 수 있다. 예를 들면, 발화자가 사용자 인터페이스(도시하지 않음)를 통해 상기 특정값을 입력하면, 사용자 인터페이스는 상기 특정값을 부가 음성 자질 조정기(119)로 전달하고, 부가 음성 자질 조정기(119)는 사용자 인터페이스를 통해 입력된 상기 특정값을 기반으로 부가 음성 자질(13) 또는 상대 단말(200)로부터 제공된 부가 음성 자질(23)을 변환시킨다.Adjustment of these additional voice qualities may be performed by the speaker. For example, when the speaker inputs the specific value through a user interface (not shown), the user interface transmits the specific value to the additional voice quality adjuster 119, and the additional voice quality adjuster 119 uses the user interface. The additional voice feature 13 or the additional voice feature 23 provided from the counterpart terminal 200 is converted based on the specific value input through .

다르게, 발화 단말(100)이 통신기(111)를 통해 부가 음성 자질(13)을 상대 단말(200)로 전송하면, 상대 단말(200) 내의 부가 음성 자질 조정기(219)가 상기 수신된 부가 음성 자질(13)을 특정값을 조정하는 방식으로 발화 단말(100)로부터 수신된 부가 음성 자질(13)을 변환시킬 수 있다.Alternatively, when the speaking terminal 100 transmits the additional voice feature 13 to the counterpart terminal 200 through the communicator 111 , the additional voice feature adjuster 219 in the counterpart terminal 200 controls the received additional voice feature. The additional voice feature 13 received from the uttering terminal 100 may be converted by adjusting (13) to a specific value.

이때, 상대 단말(200)이 수신한 부가 음성 자질(13)이 발화 단말(100)의 부가 음성 자질 조정기(119)에 의해 업데이트된 부가 음성 자질(19)인 경우, 상대 단말(100) 내의 부가 음성 자질 조정기(219)는 발화 단말(100)에 의해 업데이트된 부가 음성 자질(29)을 상대 단말(200)의 상대 화자가 원하는 음색으로 한번 더 업데이트(변환)할 수 있다.At this time, when the additional voice feature 13 received by the counterpart terminal 200 is the additional voice feature 19 updated by the additional voice feature adjuster 119 of the speaking terminal 100, The voice quality adjuster 219 may update (convert) the additional voice quality 29 updated by the utterance terminal 100 once more into a tone desired by the other speaker of the opposite terminal 200 .

개인 음성 특징 조정기(121)Personal Voice Feature Adjuster(121)

개인 음성 특징 조정기(121)는 프로세서(125)에 의해 제어되는 하드웨어 모듈이거나, 프로세서(125) 내에 내장된 디지털 회로일 수 있다. 또는 개인 음성 특징 조정기(121)는 메모리(127)에 로딩되어 프로세서(125)에 의해 실행되는 소프트웨어 모듈일 수 있다.The personal voice feature adjuster 121 may be a hardware module controlled by the processor 125 or a digital circuit built into the processor 125 . Alternatively, the personal voice feature adjuster 121 may be a software module loaded into the memory 127 and executed by the processor 125 .

개인 음성 특징 조정기(121)는 개인 음성 특징 인코더(115)에 의해 인코딩된 은닉 변수(15)의 특정값을 변경하여 은닉 변수를 업데이트할 수 있다.The personal speech feature adjuster 121 may update the hidden variable by changing a specific value of the hidden variable 15 encoded by the personal speech feature encoder 115 .

은닉 변수(15)에 포함된 부가 음성 자질과 음성 특징 파라메터(예, MFCC)는 신경망의 처리 과정에서 은닉 변수에 은닉된 정보이이므로, 그 은닉 변수가 발화자가 발화한 음성의 어떤 부가 음성 자질과 관련성이 있는지 알 수 없다. 따라서, 발화자가 은닉 변수를 변경하려면, 부가 음성 자질과 은닉 변수 간의 관계를 알아내는 선행 작업이 필요하다. Since the additional voice quality and voice feature parameters (eg, MFCC) included in the hidden variable 15 are information hidden in the hidden variable during the processing of the neural network, the hidden variable is related to any additional voice quality of the voice uttered by the speaker. I don't know if this exists. Therefore, in order for the speaker to change the hidden variable, it is necessary to find out the relationship between the additional voice quality and the hidden variable.

선행 작업은, 예를 들면, 발화자 음성의 높낮이를 변경한 경우, 은닉 변수의 어떤 특정값이 변경되었는 지를 분석하는 작업일 수 있다.The preceding task may be, for example, a task of analyzing which specific value of the hidden variable is changed when the pitch of the speaker's voice is changed.

또는 선행 작업은 은닉 변수를 변경한 후, 변경된 은닉 변수를 기반으로 음성 합성을 수행하여 획득한 개인화된 합성음의 어떤 부가 음성 자질이 어떻게 변경되었는 지를 분석하는 작업일 수 있다.Alternatively, the preceding task may be a task of analyzing which additional voice qualities of a personalized synthesized sound obtained by changing a hidden variable and then performing voice synthesis based on the changed hidden variable are changed.

이러한 선행 작업은 신경망 학습을 통해 가능하며, 선행 작업을 통해, 은닉 변수의 특정값과 부가 음성 자질의 특정값 간의 관련성이 확인되면, 그 관련성은 매핑 테이블로 구성한다.Such a preceding task is possible through neural network learning, and when a relation between a specific value of a hidden variable and a specific value of an additional voice feature is confirmed through the preceding work, the relation is constituted by a mapping table.

개인 음성 특징 조정기(121)는 사용자 인터페이스(도시하지 않음)로부터 발화자가 변경하고자 하는 개인화된 합성음의 부가 음성 자질의 특정값을 수신하고, 상기 매핑 테이블을 참조하여 상기 부가 음성 자질의 특정값에 대응하는 은닉 변수의 특정값을 변경(업데이트)한다.The personal voice feature adjuster 121 receives a specific value of the additional voice feature of the personalized synthesized sound that the speaker wants to change from a user interface (not shown), and corresponds to the specific value of the additional voice feature with reference to the mapping table Changes (updates) the specific value of the hidden variable.

이와 같이 상기 변경된(업데이트된) 은닉 변수(21)는 통신기(111)를 통해 상대 단말(200)로 전송되고, 그 은닉 변수(21)는 상대 단말(200) 내의 음성 합성기(207)에서 음성 합성 수행에 필요한 정보로 활용된다. In this way, the changed (updated) hidden variable 21 is transmitted to the opposite terminal 200 through the communicator 111, and the hidden variable 21 is synthesized by the speech synthesizer 207 in the opposite terminal 200. It is used as information necessary for execution.

또한, 상대 단말(200) 내의 개인 음성 특징 조정기(221)는 발화자 단말(100)로부터 수신된 상기 변경된(업데이트된) 은닉 변수(21)를 상대 단말(200)의 상대 화자가 원하는 값으로 다시 변경(업데이트)할 수 있다.In addition, the personal voice feature adjuster 221 in the counterpart terminal 200 changes the changed (updated) hidden variable 21 received from the talker terminal 100 back to a value desired by the counterpart speaker of the counterpart terminal 200 . You can (update).

학습기(123)Learner(123)

학습기(123)는 프로세서(125)에 의해 제어되는 하드웨어 모듈이거나, 프로세서(125) 내에 내장된 디지털 회로일 수 있다. 또는 학습기(123)는 메모리(127)에 로딩되어 프로세서(125)에 의해 실행되는 소프트웨어 모듈일 수 있다.The learner 123 may be a hardware module controlled by the processor 125 or a digital circuit built in the processor 125 . Alternatively, the learner 123 may be a software module loaded into the memory 127 and executed by the processor 125 .

학습기(123)는 음성 학습기(107)를 학습시키는 구성일 수 있다. The learner 123 may be configured to learn the voice learner 107 .

학습기(123)에 의해 수행되는 학습 방법은 지도 학습 및/또는 비지도 학습을 포함하는 기계학습일 수 있다.The learning method performed by the learner 123 may be machine learning including supervised learning and/or unsupervised learning.

학습기(123)는 상대 단말(200)로부터 수신된 상대 화자의 음성 특징을 나타내는 부가 음성 자질(도 4의 23), 은닉 변수(도 4의 25), 변경된 부가 음성 자질(도 4의 29) 및 변경된 은닉 변수(도 4의 31)를 훈련 데이터로 이용하여 음성 합성기(107)를 학습시킬 수 있다.The learner 123 includes an additional voice feature (23 in FIG. 4), a hidden variable (25 in FIG. 4), a changed additional voice feature (29 in FIG. 4) and The voice synthesizer 107 may be trained by using the changed hidden variable (31 of FIG. 4 ) as training data.

도 4는 도 1에 도시한 상대 단말의 내부 구성을 나타내는 블록도이고, 도 5는 도 4에 도시한 음성 합성기의 내부 구성을 나타내는 블록도이다.4 is a block diagram showing the internal configuration of the counterpart terminal shown in FIG. 1, and FIG. 5 is a block diagram showing the internal configuration of the voice synthesizer shown in FIG.

도 4를 참조하면, 상대 단말(200)는 음성 수집기(201), 음성 인식기(203), 자동 번역기(205), 음성 합성기(207), 음성 출력기(209), 통신기(211), 부가 음성 자질 추출기(213), 개인 음성 특징 인코더(215), 저장 유닛(217), 부가 음성 자질 조정기(219), 개인 음성 특징 조정기(221) 및 학습기(223)를 포함한다.Referring to FIG. 4 , the counterpart terminal 200 includes a voice collector 201 , a voice recognizer 203 , an automatic translator 205 , a voice synthesizer 207 , a voice output device 209 , a communicator 211 , and additional voice qualities. an extractor 213 , a personal speech feature encoder 215 , a storage unit 217 , an additional speech quality adjuster 219 , a personal speech feature adjuster 221 and a learner 223 .

상기 구성들(201, 203, 205, 207, 209, 211, 213, 215, 217, 219, 221 및 223)은 도 2에 도시한 구성들(101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 121 및 123)과 동일한 구조 및 기능을 갖는다.The components 201, 203, 205, 207, 209, 211, 213, 215, 217, 219, 221 and 223 are the components 101, 103, 105, 107, 109, 111, 113 shown in FIG. , 115, 117, 119, 121 and 123) have the same structure and function.

따라서, 상기 구성들(201, 203, 205, 207, 209, 211, 213, 215, 217, 219, 221 및 223) 각각 대한 설명은 도 2에 도시한 구성들(101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 121 및 123) 각각에 대한 설명으로 대신한다.Accordingly, the description of each of the components 201, 203, 205, 207, 209, 211, 213, 215, 217, 219, 221, and 223 is the components 101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 121 and 123) are replaced with descriptions of each.

다만, 상대 단말(200)이 발화자 단말(100)로부터 자동 번역 결과(10), 부가 음성 자질(13) 및 은닉 변수(15)를 수신한 경우, 자동 번역 결과(10), 부가 음성 자질(13) 및 은닉 변수(15)의 처리와 관련된 구성들에 대해서만 간략히 설명하기로 한다.However, when the counterpart terminal 200 receives the automatic translation result 10, the additional voice feature 13, and the hidden variable 15 from the talker terminal 100, the automatic translation result 10 and the additional voice feature 13 ) and the configurations related to the processing of the hidden variable 15 will be briefly described.

우선, 통신기(211)는 발화자 단말(100)로부터 자동 번역 결과(10), 부가 음성 자질(13) 및 은닉 변수(15)를 수신한다. 이때, 통신기(211)는 자동 번역 결과(10)를 수신하지 않고, 부가 음성 자질(13) 및 개인 음성 특징과 관련된 은닉 변수(15)만을 수신할 수도 있다.First, the communicator 211 receives an automatic translation result 10 , an additional voice feature 13 , and a hidden variable 15 from the speaker terminal 100 . In this case, the communicator 211 may not receive the automatic translation result 10 , but only receive the additional voice quality 13 and the hidden variable 15 related to the personal voice feature.

상대 단말(200)의 통신기(211)가 발화자 단말(100)로부터 자동 번역 결과(10), 부가 음성 자질(13) 및 개인 음성 특징과 관련된 은닉 변수(15)를 수신한 경우, 통신기(211)는 자동 번역 결과(10), 부가 음성 자질(13) 및 은닉 변수(15)를 음성 합성기(207)로 전달한다.When the communicator 211 of the opposite terminal 200 receives the automatic translation result 10, the additional voice quality 13, and the hidden variable 15 related to the personal voice feature from the talker terminal 100, the communicator 211 transmits the automatic translation result (10), the additional speech feature (13) and the hidden variable (15) to the speech synthesizer (207).

음성 합성기(207)는 통신기(211)로부터 전달된 자동 번역 결과(10), 부가 음성 자질(13) 및 은닉 변수(15)를 이용하여 발화자 단말(100)에서 전송한 자동 번역 결과()에 대한 음성 합성을 수행하여 개인화된 합성음을 재생(출력)할 수 있다.The speech synthesizer 207 uses the automatic translation result 10, the additional speech feature 13, and the hidden variable 15 transmitted from the communicator 211 to the automatic translation result () transmitted from the speaker terminal 100. A personalized synthesized sound can be reproduced (outputted) by performing voice synthesis.

다르게, 상대 단말(200)이 발화자 단말(100)로부터 부가 음성 자질(13) 및 은닉 변수(15)만을 수신하는 경우, 음성 합성기(207)는 발화자 단말(100)로부터 수신된 부가 음성 자질(13)과 은닉 변수(15)를 이용하여 자동 번역기(205)로부터 입력되는 자동 번역 결과(30)에 대한 음성 합성을 수행할 수 있다.Alternatively, when the opposite terminal 200 receives only the additional voice feature 13 and the hidden variable 15 from the talker terminal 100 , the voice synthesizer 207 may generate the additional voice feature 13 received from the talker terminal 100 . ) and the hidden variable 15 , speech synthesis may be performed on the automatic translation result 30 input from the automatic translator 205 .

한편, 발화자 단말(100)로부터 부가 음성 자질(13) 및 은닉 변수(15)은 부가 음성 자질 조정기(219) 및 개인 음성 특징 조정기(221)에 의해 상대 화자가 원하는 음색으로 각각 업데이트할 수 있다.Meanwhile, the additional voice feature 13 and the hidden variable 15 from the talker terminal 100 may be respectively updated to the tone desired by the other speaker by the additional voice feature adjuster 219 and the personal voice feature adjuster 221 .

음성 합성기(207)는 업데이트된 부가 음성 자질과 업데이트된 은닉 변수를 이용하여 발화자 단말(100)로부터 수신된 자동 번역 결과(10) 또는 자동 번역기(205)에서 제공하는 자동 번역 결과(30)에 대해 음성 합성을 수행할 수도 있다.The speech synthesizer 207 uses the updated additional voice quality and the updated hidden variable to obtain the automatic translation result 10 received from the speaker terminal 100 or the automatic translation result 30 provided by the automatic translator 205 . Speech synthesis may also be performed.

또한 음성 합성기(207)는 발화자 단말에서 이미 업데이트된 부가 음성 자질(19) 및 은닉 변수(21)를 이용하여 발화자 단말(100)로부터 수신된 자동 번역 결과(10) 또는 자동 번역기(205)에서 제공하는 자동 번역 결과(30)에 대해 음성 합성을 수행할 수도 있다.In addition, the speech synthesizer 207 provides the automatic translation result 10 received from the speaker terminal 100 or the automatic translator 205 using the additional speech quality 19 and the hidden variable 21 already updated in the speaker terminal. Speech synthesis may be performed on the automatic translation result 30 to be performed.

학습기(223)는 발화자 단말(100)로부터 수신된 부가 음성 자질(13)과 은닉 변수(15)를 훈련 데이터로 이용하여 음성 합성기(207)를 학습시킬 수 있다.The learner 223 may train the speech synthesizer 207 by using the additional speech feature 13 and the hidden variable 15 received from the talker terminal 100 as training data.

또한, 학습기(223)는 발화자 단말(100)로부터 수신된 업데이트된 부가 음성 자질(19)과 업데이트된 은닉 변수(21)를 훈련 데이터로 이용하여 음성 합성기(207)를 학습시킬 수 있다.Also, the learner 223 may train the speech synthesizer 207 by using the updated additional voice feature 19 and the updated hidden variable 21 received from the talker terminal 100 as training data.

또한, 상대 단말(200) 내의 부가 음성 자질 조정기(219)와 개인 음성 특징 조정기(221)가 발화자 단말(100)로부터 수신한 부가 음성 자질(13)과 은닉 변수(15)를 각각 업데이트하는 경우, 학습기(223)는 상기 업데이트된 부가 음성 자질(29)과 상기 업데이트된 은닉 변수(31)를 훈련 데이터로 이용하여 음성 학습기(207)를 학습시킬 수 있다.In addition, when the additional voice feature adjuster 219 and the personal voice feature adjuster 221 in the opposite terminal 200 update the additional voice feature 13 and the hidden variable 15 received from the talker terminal 100, respectively, The learner 223 may train the voice learner 207 by using the updated additional voice feature 29 and the updated hidden variable 31 as training data.

그 밖에 참조 번호 23은 부가 음성 자질 추출기(213)가 음성 수집기(201)가 수집한 음성 신호를 기반으로 생성한 부가 음성 자질(23)을 지시하는 것이고, 참조 번호 25는 개인 음성 특징 인코더(215)가 음성 수집기(201)가 수집한 음성 신호를 기반으로 생성한 개인 음성 특징과 관련된 은닉 변수(25)를 지시하는 것이다.In addition, reference number 23 indicates an additional voice feature 23 generated by the additional voice feature extractor 213 based on the voice signal collected by the voice collector 201, and reference number 25 is a personal voice feature encoder 215 ) indicates the hidden variable 25 related to the personal voice feature generated based on the voice signal collected by the voice collector 201 .

참조 번호 30은 자동 번역기(205)가 음성 인식기(203)로부터 입력된 음성 인식 결과에 대한 자동 번역 결과를 지시한다.Reference numeral 30 denotes an automatic translation result of the speech recognition result input by the automatic translator 205 from the speech recognizer 203 .

상대 단말(200) 내에서 개인 음성 특징 인코더(215)의 신경망 구조와 발화자 단말(100) 내에서 개인 음성 특징 인코더(115)의 신경망 구조는 동일하거나 다를 수 있다.The neural network structure of the personal voice feature encoder 215 in the counterpart terminal 200 and the neural network structure of the personal voice feature encoder 115 in the talker terminal 100 may be the same or different.

또한, 상대 단말(200) 내에서 개인 음성 특징 인코더(215)의 신경망 구조와 발화자 단말(100) 내에서 음성 합성기(107)에 구비된 인코더(도 3의 107A)의 신경망 구조는 동일하거나 다를 수 있다.In addition, the neural network structure of the personal voice feature encoder 215 in the counterpart terminal 200 and the neural network structure of the encoder (107A in FIG. 3) provided in the voice synthesizer 107 in the talker terminal 100 may be the same or different. have.

또한, 상대 단말(200) 내에서 음성 합성기(207)에 구비된 인코더(207A)의 신경망 구조와 발화자 단말(100) 내에서 음성 합성기(107)에 구비된 인코더(도 3의 107A)의 신경망 구조는 동일하거나 다를 수 있다. In addition, the neural network structure of the encoder 207A provided in the speech synthesizer 207 in the counterpart terminal 200 and the neural network structure of the encoder 107A provided in the speech synthesizer 107 in the talker terminal 100 ( 107A in FIG. 3 ) may be the same or different.

이처럼 서로 다른 단말에서 인코더들(107A, 207A)의 신경망 구조가 다르기 때문에, 발화자 단말(100) 또는 상대 단말(200)에 구비된 음성 합성기 내의 디코더(107D 또는 207D)는 서로 다른 데이터 차원의 인코딩 결과들을 디코딩하는 문제가 발생할 수 있다.As such, since the neural network structures of the encoders 107A and 207A in different terminals are different, the decoders 107D or 207D in the speech synthesizer provided in the talker terminal 100 or the counterpart terminal 200 encode different data dimensions. Decoding problems may arise.

그러나, 본 발명에서는 발화자 단말(100) 또는 상대 단말(200) 내의 음성 합성기가 데이터 차원을 동일한 차원으로 정규화하는 과정을 수행하기 때문에, 데이터 차원의 불일치에 따른 디코딩 결과의 오류를 최소화할 수 있다.However, in the present invention, since the speech synthesizer in the talker terminal 100 or the counterpart terminal 200 performs a process of normalizing the data dimension to the same dimension, it is possible to minimize the error in the decoding result due to the data dimension mismatch.

즉, 각 단말에 설치된 음성 합성기(107 또는 207)의 사양(신경망 구조)에 대한 의존성을 줄임으로써, 발화자 단말(100)과 상대 단말(200)이 발화자의 음성 특징을 기반으로 개인화된 합성음을 자동 번역 결과로 재생하는 경우, 발화자 단말(100) 및 상대 단말(200)은 동일한 발화자의 음성 특징을 기반으로 개인화된 합성음을 자동 통역 결과로 제공할 수 있다.That is, by reducing the dependence on the specification (neural network structure) of the speech synthesizer 107 or 207 installed in each terminal, the speaker terminal 100 and the counterpart terminal 200 automatically synthesize personalized sounds based on the speaker's voice characteristics. When the translation result is reproduced, the speaker terminal 100 and the counterpart terminal 200 may provide a personalized synthesized sound as an automatic interpretation result based on the voice characteristics of the same speaker.

도 5에 도시한 음성 합성기(207)의 내부 구성들(207A, 207B, 207C, 207D)은 도 3에 도시한 발화자 단말(100)에 구비된 음성 합성기(107)의 내부 구성들(107A, 107B, 107C, 107D)과 각각 동일한 기능을 갖는다. 따라서, 음성 합성기(207)의 각 내부 구성에 대한 설명은 도 3에 도시한 음성 합성기(107)의 내부 구성들(107A, 107B, 107C, 107D)에 대한 설명으로 대신한다. The internal components 207A, 207B, 207C, and 207D of the voice synthesizer 207 shown in FIG. 5 are internal components 107A, 107B of the voice synthesizer 107 provided in the talker terminal 100 shown in FIG. 3 . , 107C, 107D), respectively, and have the same function. Accordingly, a description of each internal configuration of the voice synthesizer 207 is replaced with a description of the internal configurations 107A, 107B, 107C, and 107D of the voice synthesizer 107 shown in FIG. 3 .

다만, 상대 단말의 음성 합성기(207)는 발화자 단말(100)로부터 수신된 자동 번역 결과(10), 은닉 변수(15) 및 부가 음성 자질(13)을 기반으로 상기 자동 번역 결과(10)에 대한 음성 합성을 수행하고, 발화자 단말(100)의 음성 합성기(107)는 상대 단말(200)로부터 수신된 자동 번역 결과(30), 은닉 변수(25) 및 부가 음성 자질(23)을 기반으로 상기 자동 번역 결과(30)에 대한 음성 합성을 수행하는 점에서 차이가 있을 뿐이다.However, the voice synthesizer 207 of the opposite terminal performs the automatic translation result 10 based on the automatic translation result 10 received from the talker terminal 100, the hidden variable 15, and the additional voice quality 13. The speech synthesis is performed, and the speech synthesizer 107 of the talker terminal 100 automatically There is only a difference in that speech synthesis is performed on the translation result 30 .

도 6은 본 발명의 실시 예에 따른 상대 단말에서 수행하는 자동 통역 방법을 설명하기 위한 흐름도이다.6 is a flowchart illustrating an automatic interpretation method performed by a counterpart terminal according to an embodiment of the present invention.

용어의 명확한 구분을 위해, 도 2 및 4에 도시한 개인 음성 특징 인코더(115 또는 215)에서 획득한 은닉 변수(15 또는 25)에 포함된 부가 음성 자질을 '제1 부가 음성 자질', 부가 음성 추출기(113, 또는 213)에서 추출한 부가 음성 자질(13 또는 23)을 '제2 부가 음성 자질'이라 칭한다.For clear distinction between terms, the additional voice feature included in the hidden variable 15 or 25 obtained from the personal voice feature encoder 115 or 215 shown in FIGS. The additional voice feature 13 or 23 extracted by the extractor 113 or 213 is referred to as a 'second additional voice feature'.

또한, 도 2 및 4에 도시한 개인 음성 특징 인코더(115, 215)는 '제1 음성 특징 추출기'라 칭하고, 부가 음성 자질 추출기(113, 213)은 '제2 음성 특징 추출기'라 칭한다.In addition, the personal voice feature encoders 115 and 215 shown in FIGS. 2 and 4 are called 'first voice feature extractors', and the additional voice feature extractors 113 and 213 are called 'second voice feature extractors'.

또한, 도 2 및 4에 도시한 개인 음성 특징 조정기(121 또는 221)는 '제1 음성 특징 조정기', 부가 음성 자질 조정기(119 또는 219)는 '제2 음성 특징 조정기'로 칭한다.In addition, the personal voice feature adjuster 121 or 221 shown in FIGS. 2 and 4 is referred to as a 'first voice feature adjuster', and the additional voice quality adjuster 119 or 219 is referred to as a 'second voice feature adjuster'.

도 5를 참조하면, 먼저, S510에서, 상대 단말(200)의 통신기(211)에서, 발화자 단말(100)로부터 자동 번역 결과(10)와 발화자가 발화한 음성으로부터 추출된 음성 특징 정보를 수신하는 과정이 수행된다.Referring to FIG. 5 , first, in S510, the communicator 211 of the opposite terminal 200 receives the automatic translation result 10 from the talker terminal 100 and the voice characteristic information extracted from the voice uttered by the speaker. The process is carried out.

이어, S520에서, 상대 단말(200)의 음성 합성기(207)에서, 상기 자동 번역 결과(10)와 상기 음성 특징 정보(13 및 15)를 기반으로 음성 합성을 수행하여 개인화된 합성음을 자동 통역 결과로서 출력하는 과정이 수행된다.Then, in S520, the voice synthesizer 207 of the opposite terminal 200 performs voice synthesis based on the automatic translation result 10 and the voice characteristic information 13 and 15, and results in automatic interpretation of personalized synthesized sounds. The process of outputting as .

상기 발화자 단말(100)로부터 제공되는 음성 특징 정보는 은닉 변수(도 2의 15)와 제2 부가 음성 자질(도 2의 13)을 포함한다. The voice characteristic information provided from the talker terminal 100 includes a hidden variable (15 in FIG. 2 ) and a second additional voice feature (13 in FIG. 2 ).

은닉 변수(도 2의 15)는 발화자 단말(100) 내의 제1 음성 특징 추출기(도 2의 115)에서 신경망 알고리즘을 기반으로 추출한 정보로서, 제1 부가 음성 자질과 음성 특징 파라메터를 포함한다. 여기서, 음성 특징 파라메터는, 예를 들면, 멜 주파수 켑스트럼 계수(Mel-Frequency Cepstral Coefficient: MFCC) 기반의 특징 벡터일 수 있다.The hidden variable (15 of FIG. 2 ) is information extracted based on a neural network algorithm by the first voice feature extractor ( 115 of FIG. 2 ) in the talker terminal 100 , and includes a first additional voice feature and a voice feature parameter. Here, the speech feature parameter may be, for example, a feature vector based on a Mel-Frequency Cepstral Coefficient (MFCC).

제2 부가 음성 자질(도 2의 13)은 발화자 단말(100) 내의 제2 음성 특징 추출기(도 2의 113)에서 비 신경망 알고리즘을 기반으로 추출한 정보일 수 있다. 여기서, 비 신경망 알고리즘은 상발화자의 음성에서 반복적으로 나타나는 파형 특징을 분석하는 알고리즘일 수 있다.The second additional voice feature (13 of FIG. 2 ) may be information extracted by the second voice feature extractor ( 113 of FIG. 2 ) in the talker terminal 100 based on a non-neural network algorithm. Here, the non-neural network algorithm may be an algorithm for analyzing waveform features repeatedly appearing in the speaker's voice.

상기 제2 부가 음성 자질(13)과 은닉 변수(15)에 포함된 상기 제1 부가 음성 자질은 발화자 음성의 강도, 억양, 높낮이 및 속도 등과 관련된 정보로서, 발화자가 발화한 음성의 음색과 관련된 특징일 수 있다.The first additional voice quality included in the second additional voice quality 13 and the hidden variable 15 is information related to the strength, intonation, pitch, and speed of the speaker's voice, and features related to the tone of the voice uttered by the speaker. can be

상기 S520에서, 개인화된 합성을 출력하는 과정은, 다음과 같다.The process of outputting the personalized synthesis in S520 is as follows.

먼저, 인코더(도 5의 207A)에서, 상기 자동 번역 결과(10)와 상기 제2 부가 음성 자질을 인코딩하여 획득한 인코딩 결과(도 5의 70)를 출력하는 과정이 수행된다.First, the encoder ( 207A in FIG. 5 ) outputs the encoding result ( 70 in FIG. 5 ) obtained by encoding the automatic translation result 10 and the second additional voice feature.

이어, 차원 정규화기(도 5의 207B)에서, 상기 인코딩 결과(도 5의 70)의 데이터 차원과 상기 은닉 변수(15)의 데이터 차원을 동일한 데이터 차원으로 정규화하는 과정이 수행된다.Next, in the dimension normalizer ( 207B of FIG. 5 ), a process of normalizing the data dimension of the encoding result ( 70 of FIG. 5 ) and the data dimension of the hidden variable 15 to the same data dimension is performed.

이어, 디코더(207C)에서, 상기 동일한 데이터 차원으로 정규화된 상기 은닉 변수와 상기 인코딩 결과를 디코딩하여, 상기 개인화된 합성음을 생성하는 과정이 수행된다. Next, the decoder 207C decodes the hidden variable normalized to the same data dimension and the encoding result to generate the personalized synthesized sound.

또는 디코더(207D)가 개인화된 합성을 생성하지 않고, 발화자의 음성을 결정하는 파라메터(Parameter)를 생성하는 경우, 보코더(도 5의 207D) 디코더(도 5의 207C)로부터 입력된 파라메터를 기반으로 개인화된 합성음을 출력하는 과정이 더 추가될 수 있다. 여기서, 파라메터는, 예를 들면, 스펙트로그램(spectrogram) 기반의 특징 벡터일 수 있다. Alternatively, when the decoder 207D generates a parameter that determines the speaker's voice without generating a personalized synthesis, based on the parameter input from the vocoder (207D in FIG. 5) and the decoder (207C in FIG. 5) A process of outputting a personalized synthesized sound may be further added. Here, the parameter may be, for example, a spectrogram-based feature vector.

한편, 상대 단말(200)에서 수행하는 자동 통역 방법은, 상대 단말(200)의 상대 화자가 상기 발화자가 발화한 음성의 음색을 다른 음색으로 변경하고자 하는 경우, 제1 음성 특징 조정기(도 4의 221)가, 발화자 단말(100)로부터 제공된 은닉 변수(도 2의 15)의 특정값을 조정하여, 상기 은닉 변수(도2의 15)를 업데이트하는 과정 및 제2 음성 특징 조정기(도 4의 219)가, 발화자 단말(100)로부터 제공된 제2 부가 음성 자질(도 2의 13)의 특정값을 조정하여, 상기 제2 부가 음성 자질(13)을 업데이트하는 과정을 더 포함할 수 있다.On the other hand, in the automatic interpretation method performed by the opposite terminal 200, when the opposite speaker of the opposite terminal 200 wants to change the tone of the voice uttered by the speaker to another tone, the first voice characteristic adjuster (see FIG. 4) 221) adjusts a specific value of the hidden variable (15 in FIG. 2) provided from the talker terminal 100 to update the hidden variable (15 in FIG. 2) and a second voice feature adjuster (219 in FIG. 4) ) may further include the process of updating the second additional voice feature 13 by adjusting a specific value of the second additional voice feature (13 of FIG. 2 ) provided from the talker terminal 100 .

이처럼 발화자 단말(100)로부터 제공된 상기 은닉 변수(도 2의 15)와 제2 부가 음성 자질(도 2의 13)가 업데이트되는 경우, 상대 단말(200) 내의 음성 합성기(도 4의 207)에서는, 상기 업데이트된 은닉 변수와 상기 업데이트된 제2 부가 음성 자질을 기반으로 음성 합성을 수행하여, 상기 상대 화자가 원하는 상기 다른 음색을 갖는 개인화된 합성음을 출력하는 과정이 수행될 수 있다.As such, when the hidden variable (15 in FIG. 2) and the second additional voice quality (13 in FIG. 2) provided from the talker terminal 100 are updated, the voice synthesizer (207 in FIG. 4) in the opposite terminal 200, A process of outputting a personalized synthesized sound having the different tone desired by the counterpart speaker may be performed by performing voice synthesis based on the updated hidden variable and the updated second additional voice quality.

도 7은 본 발명의 실시 예에 따른 발화자 단말에서 수행하는 자동 통역 방법을 설명하기 위한 흐름도이다.7 is a flowchart illustrating an automatic interpretation method performed by a talker terminal according to an embodiment of the present invention.

도 7을 참조하면, 먼저, S710에서, 제1 음성특징 추출기(도 2의 115)에서, 발화자가 발화한 음성으로부터 제1 부가 음성 자질과 음성 특징 파라메터를 포함하는 은닉 변수(도 2의 15)를 추출하는 과정이 수행된다. 여기서, 은닉 변수(도 2의 15)의 추출을 위해, 신경망 기반 알고리즘이 이용될 수 있다.Referring to FIG. 7 , first, in S710 , in the first voice feature extractor ( 115 in FIG. 2 ), a hidden variable ( 15 in FIG. 2 ) including the first additional voice quality and voice feature parameter from the voice uttered by the speaker. The extraction process is performed. Here, for the extraction of the hidden variable (15 in FIG. 2 ), a neural network-based algorithm may be used.

이어, S720에서, 제2 음성특징 추출기(도2의 113)에서, 상기 음성으로부터 제2 부가 음성 자질(도 2의 13)을 추출하는 과정이 수행된다. 여기서, 제2 부가 음성 자질(도 2의 13)을 추출을 위해, 비 신경망 기반 알고리즘이 이용될 수 있다. 비 신경망 기반 알고리즘은, 예를 들면, 상기 발화자의 음성에서 반복적으로 나타나는 파형 특징을 분석하는 알고리즘일 수 있다.Next, in S720, the second voice feature extractor (113 in FIG. 2) extracts the second additional voice feature (13 in FIG. 2) from the voice. Here, for extracting the second additional voice feature (13 in FIG. 2 ), a non-neural network-based algorithm may be used. The non-neural-network-based algorithm may be, for example, an algorithm for analyzing waveform features repeatedly appearing in the speaker's voice.

이어, S730에서, 음성 인식기(도 2의 103)에서, 상기 음성에 대해 음성 인식을 수행하여 음성 인식 결과를 획득하는 과정이 수행된다.Next, in S730, the voice recognizer (103 in FIG. 2) performs voice recognition on the voice to obtain a voice recognition result.

이어, S740에서, 자동 번역기(도 2의 105)에서, 상기 음성 인식 결과에 대해 자동 번역을 수행하여 자동 번역 결과를 획득하는 과정이 수행된다.Next, in S740 , the automatic translator ( 105 in FIG. 2 ) performs an automatic translation on the speech recognition result to obtain an automatic translation result.

이어, S750에서, 통신기(도 2의 111)에서, 상기 상대 단말(200)이 상기 은닉 변수와 상기 제2 부가 음성 자질을 기반으로 상기 자동 번역 결과에 대한 음성 합성을 수행하도록 상기 자동 번역 결과(10), 상기 은닉 변수(15) 및 상기 제2 부가 음성 자질(13)을 상기 상대 단말(200)로 송신하는 과정이 수행된다.Then, in S750, in the communicator (111 in FIG. 2), the automatic translation result ( 10), the process of transmitting the hidden variable 15 and the second additional voice feature 13 to the counterpart terminal 200 is performed.

선택적으로, 발화자 단말에서 수행하는 자동 통역 방법은 제1 음성 특징 조정기(도 2의 121)가 상기 은닉 변수(15)의 특정값을 조정하여, 상기 은닉 변수(15)를 업데이트하는 과정과 제2 음성 특징 조정기(도 2의 119)가 상기 제2 부가 음성 자질(13)의 특정값을 조정하여, 상기 제2 부가 음성 자질(13)을 업데이트 하는 과정을 더 포함할 수 있다.Optionally, the automatic interpretation method performed by the talker terminal includes the process of updating the hidden variable 15 by adjusting the specific value of the hidden variable 15 by the first voice feature adjuster (121 in FIG. 2), and the second The method may further include the step of the voice feature adjuster ( 119 of FIG. 2 ) adjusting the specific value of the second additional voice feature 13 to update the second additional voice feature 13 .

상기 은닉 변수(15)와 상기 제2 부가 음성 자질(13)이 업데이트 된 경우, 상기 통신기(도 2의 111)에서, 상기 상대 단말(200)이 상기 업데이트된 은닉 변수와 상기 업데이트된 제2 부가 음성 자질을 기반으로 상기 자동 번역 결과(10)에 대한 음성 합성을 수행하도록, 상기 자동 번역 결과(10), 상기 업데이트된 은닉 변수(15) 및 상기 업데이트된 제2 부가 음성 자질(13)을 상기 상대 단말(200)로 송신하는 과정이 수행될 수 있다.When the hidden variable 15 and the second additional voice feature 13 are updated, in the communicator (111 in FIG. 2 ), the counterpart terminal 200 transmits the updated hidden variable and the updated second additional voice feature. The automatic translation result (10), the updated hidden variable (15) and the updated second additional voice feature (13) are stored in the automatic translation result (10) to perform speech synthesis on the automatic translation result (10) based on the voice feature. A process of transmitting to the counterpart terminal 200 may be performed.

S740에 따른 자동 번역은 상대 단말(200)에서 수행될 수 있다. 이 경우, 상대 단말(200)이, S730에 따라 획득된 음성 인식 결과에 대한 자동 번역을 수행하도록, 상기 통신기(도 2의 111)에서, 상기 음성 인식 결과를 상기 상대 단말(200)로 송신하는 과정이 수행될 수 있다.The automatic translation according to S740 may be performed in the counterpart terminal 200 . In this case, the communicator (111 in FIG. 2 ) transmits the voice recognition result to the counterpart terminal 200 so that the counterpart terminal 200 performs automatic translation on the voice recognition result obtained in S730. The process may be performed.

이제까지 본 발명을 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양하게 변경 또는 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명을 위한 예시적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been looked at mainly in the embodiments. Those of ordinary skill in the art to which the present invention pertains will understand that the present invention can be implemented in variously changed or modified forms without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments are to be considered in an illustrative rather than a restrictive sense. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within an equivalent scope should be construed as being included in the present invention.

Claims

A counterpart terminal communicating with a talker terminal, in an automatic interpretation method performed by the counterpart terminal,
receiving, by the communicator, an automatic translation result obtained by automatically translating the voice uttered by the speaker in a source language into a target language and voice characteristic information of the speaker, from the speaker terminal; and
performing, by a voice synthesizer, voice synthesis based on the automatic translation result and the voice characteristic information, and outputting a personalized synthesized sound as an automatic interpretation result;
The speaker's voice characteristic information,
a hidden variable including a first additional voice feature and a voice feature parameter extracted from the speaker's voice, and a second additional voice feature,
In order for the opposite speaker of the opposite terminal to change the tone of the voice uttered by the speaker to another tone, the first speech characteristic adjuster adjusts the specific value of the hidden variable to update the hidden variable, and the second speech characteristic adjuster and updating the second additional voice feature by adjusting a specific value of the second additional voice feature.

In claim 1,
The hidden variable is
An automatic interpretation method extracted from the speaker terminal based on a neural network algorithm.

In claim 1,
The second additional voice quality is
An automatic interpretation method that is extracted based on a non-neural network algorithm in the talker terminal.

In claim 3,
The non-neural network-based algorithm is
An automatic interpretation method that is an algorithm for analyzing waveform features repeatedly appearing in the speaker's voice.

In claim 1,
Each of the first and second supplementary voice qualities comprises:
A method of automatic interpretation, wherein a voice characteristic related to the tone or style of the user's voice indicating the strength, intonation, pitch and speed of the user's voice.

In claim 1,
Outputting the personalized synthesized sound as an automatic interpretation result comprises:
outputting, by an encoder, an encoding result obtained by encoding the automatic translation result and the second additional voice feature;
normalizing, by a dimension normalizer, a data dimension of the encoding result and a data dimension of the hidden variable to the same data dimension; and
Decoding, by a decoder, the hidden variable normalized to the same data dimension and the encoding result to generate the personalized synthesized sound
Automatic interpretation method including.

delete

In claim 1,
Outputting the personalized synthesized sound as an automatic interpretation result comprises:
performing voice synthesis based on the updated hidden variable and the updated second additional voice quality, and outputting a personalized synthesized sound having the different tone desired by the other speaker as the automatic interpretation result;
Automatic interpretation method including.

A talker terminal communicating with a counterpart terminal, in an automatic interpretation method performed by the talker terminal,
extracting, by the first voice feature extractor, a hidden variable including a first additional voice feature and a voice feature parameter from the voice uttered by the speaker;
extracting, by a second voice feature extractor, a second additional voice feature from the voice;
performing, by a voice recognizer, voice recognition on the voice to obtain a voice recognition result;
obtaining, by an automatic translator, an automatic translation result by performing automatic translation on the speech recognition result; and
transmitting, by a communicator, the automatic translation result, the hidden variable, and the second additional voice quality to the counterpart terminal,
When the speaker of the talker terminal wants to change the tone of the voice uttered by the speaker to another tone, the first voice feature adjuster adjusts the specific value of the hidden variable to update the hidden variable, and the second voice The automatic interpretation method further comprising the step of, by a feature adjuster, updating the second additional voice feature by adjusting a specific value of the second additional voice feature.

delete

In claim 9,
the updated hidden variable and the updated second additional voice so that the communicator performs speech synthesis on the automatic translation result based on the updated hidden variable and the updated second additional voice quality in the counterpart terminal The automatic interpretation method further comprising; transmitting the qualities to the counterpart terminal.

In claim 9,
The step of extracting the hidden variable is,
and extracting the hidden variable from the voice based on a neural network-based algorithm.

In claim 9,
The step of extracting the second additional voice feature,
and extracting the second additional voice feature from the voice based on a non-neural network-based algorithm.

In claim 9,
The step of transmitting to the counterpart terminal comprises:
and transmitting, by the communicator, the voice recognition result to the counterpart terminal instead of the automatic translation result when the automatic translation of the speech recognition result is performed by the counterpart terminal.

A counterpart terminal communicating with a talker terminal, wherein the counterpart terminal includes an automatic interpretation device,
The automatic interpretation device is
a communicator for receiving, from the talker terminal, an automatic translation result obtained by automatically translating the voice uttered by the speaker in a source language into a target language, and voice characteristic information of the speaker; and
and a voice synthesizer for performing voice synthesis based on the automatic translation result and the voice characteristic information and outputting a personalized synthesized sound as an automatic interpretation result;
The speaker's voice characteristic information,
a hidden variable including a first additional voice feature and a voice feature parameter extracted from the speaker's voice, and a second additional voice feature,
When the other speaker of the other terminal wants to change the tone of the voice uttered by the speaker to another tone, the first voice feature adjuster and the second adder for updating the hidden variable by adjusting a specific value of the hidden variable and a second voice feature adjuster configured to adjust a specific value of the voice feature to update the second additional voice feature.

delete

In claim 15,
The speech synthesizer,
The result of automatic interpretation of the personalized synthesized sound by performing speech synthesis based on the automatic translation result, the hidden variable updated by the first voice feature adjuster, and the second additional voice quality updated by the second voice feature adjuster An automatic interpretation device that outputs as .

In claim 15,
The speech synthesizer,
an encoder for outputting an encoding result obtained by encoding the automatic translation result and the second additional voice feature;
a dimension normalizer for normalizing the data dimension of the encoding result and the data dimension of the hidden variable to the same data dimension; and
A decoder for generating the personalized synthesized sound by decoding the hidden variable normalized to the same data dimension and the encoding result
Automatic interpretation device including.

In claim 15,
Based on a neural network algorithm, a first voice feature extractor for extracting a hidden variable C including a first additional voice feature A indicating a tone characteristic of the counterpart speaker and a voice feature parameter B from a voice uttered by the counterpart speaker of the counterpart terminal ;
a second speech feature extractor for extracting a second additional speech feature D representing the tone characteristics of the counterpart speaker from the speech uttered by the counterpart speaker based on a non-neural network algorithm;
a voice recognizer configured to perform voice recognition on the voice uttered by the other speaker to obtain a voice recognition result; and
Further comprising an automatic translator for obtaining an automatic translation result E by performing automatic translation on the speech recognition result,
the communicator,
so that the talker terminal performs the speech synthesis based on the hidden variable C, the second additional voice feature D, and the automatic translation result E, the hidden variable C, the second additional voice feature D, and the automatic translation result E The automatic interpretation device that transmits to the talker terminal.