KR20230021395A

KR20230021395A - Simultaenous interpretation service device and method for generating simultaenous interpretation results being applied with user needs

Info

Publication number: KR20230021395A
Application number: KR1020210103219A
Authority: KR
Inventors: 김익재; 김학섭; 조정현; 최희승; 남기표; 박해솔
Original assignee: 한국과학기술연구원
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2023-02-14

Abstract

Embodiments relate to a simultaneous interpretation device and method. The method includes the steps of: receiving content in the first language; generating a corresponding control command by receiving a user request to be reflected in content in the second language; and reflecting the user request corresponding to the control command to generate audio of content in the second language obtained by simultaneous interpretation of content in the first language.

Description

SIMULTAENOUS INTERPRETATION SERVICE DEVICE AND METHOD FOR GENERATING SIMULTAENOUS INTERPRETATION RESULTS BEING APPLIED WITH USER NEEDS}

본 출원의 실시예들은 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 동시 통역한 결과를 생성하는 기술에 관한 것으로서, 보다 상세하게는 사용자 요구를 반영한 동시 통역 결과를 생성하여 서비스하는 장치 및 방법에 관련된다.Embodiments of the present application relate to a technology for generating a result of simultaneous interpretation of content in a first language with content in a second language, and more particularly, to an apparatus and method for generating and providing a result of simultaneous interpretation reflecting a user's request. related

잘 알려진 바와 같이, 자동 통역 장치는 서로 다른 언어를 사용하는 사람들이 자신들의 모국어로도 의사소통을 할 수 있도록 하기 위한 장치로서, 제1 언어의 음성 신호를 받아 음성 인식을 수행하고 그 결과를 제2 언어로 자동 번역한 후에 그 결과를 다시 음성으로 출력하도록 구성된다. As is well known, an automatic interpreting device is a device for enabling people who speak different languages to communicate in their native language, and performs voice recognition by receiving a voice signal of a first language and provides the result. After automatic translation into two languages, it is configured to output the result again as voice.

최근 인공 지능(artificial intelligence) 기술의 발전으로 인해 자동 통역 장치에 인공 지능 기술을 접목하려는 시도가 늘고 있다. Recently, due to the development of artificial intelligence technology, attempts to apply artificial intelligence technology to automatic interpretation devices are increasing.

그러나, 종래의 동시 통역 서비스는 번역 결과를 일괄적으로 발화한 특정 성우의 음성 또는 기계음으로 제공하는데 그쳤다. However, the conventional simultaneous interpretation service only provides the translation result as a voice or machine sound of a specific voice actor uttered collectively.

따라서, 동시 통역 서비스를 제공함에 있어 다양한 사용자 요구를 반영하는 것에 대한 수요가 있다.Therefore, there is a demand for reflecting various user needs in providing simultaneous interpretation services.

특허공개공보 제10-2014-0120560호 (2014.10.14.)Patent Publication No. 10-2014-0120560 (2014.10.14.)

본 발명의 실시예들에 따르면, 사용자 요구를 반영한 동시 통역 결과를 생성하여 서비스하는 장치 및 방법을 제공하고자 한다. According to embodiments of the present invention, it is intended to provide an apparatus and method for providing a service by generating a simultaneous interpretation result reflecting a user's request.

본 출원의 일 측면에 따른 사용자 요구를 반영한 동시 통역 결과를 생성하는 동시 통역 방법은 번역부 및 합성부를 포함한 동시 통역 장치에 의해 수행된다. 상기 동시 통역 방법은: 제1 언어의 콘텐츠를 발화한 화자 음성을 수신하는 단계; 수신된 제1 언어의 콘텐츠의 화자 음성을 제1 언어의 텍스트로 변환하는 단계; 화자 음성으로부터 화자의 음성 특성을 추출하는 단계; 동시 통역 결과에 반영할 사용자 요구를 수신하여 해당 제어 명령을 생성하는 단계; 및 제어 명령에 대응하는 사용자 요구를 반영하여 제1 언어의 콘텐츠를 동시 통역한 제2 언어의 콘텐츠의 음성을 생성하는 단계를 포함한다. 상기 동시 통역한 제2 언어의 콘텐츠의 음성을 생성하는 단계는: 상기 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 번역하는 단계; 및 제2 언어의 콘텐츠의 음성을 생성하는 단계를 포함한다. A simultaneous interpretation method for generating a simultaneous interpretation result reflecting a user's demand according to an aspect of the present application is performed by a simultaneous interpretation apparatus including a translation unit and a synthesis unit. The simultaneous interpretation method may include: receiving a voice of a speaker who utters content in a first language; converting the received speaker's voice of the content in the first language into text in the first language; extracting a speaker's voice characteristics from the speaker's voice; generating a corresponding control command by receiving a user request to be reflected in a result of simultaneous interpretation; and reflecting a user request corresponding to the control command to generate audio of content in a second language obtained by simultaneous interpretation of content in the first language. The step of generating audio of the simultaneously interpreted content in the second language may include: translating the content in the first language into content in the second language; and generating audio of the content in the second language.

일 실시예에서, 상기 사용자 요구를 수신하여 해당 제어 명령을 생성하는 단계는, 상황 요구를 사용자 요구로서 수신하고 제1 제어 명령을 생성하여 번역부에 전송하는 단계; 및 화자 음성 요구, 특정인 음성 요구 또는 참조 음성 요구를 사용자 요구로서 수신하고 제2 제어 명령을 생성하여 합성부로 전송하는 단계 중 적어도 하나를 포함할 수도 있다. 상기 제1 제어 명령은 상기 번역부가 콘텐츠 상황 하에서 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 번역하게 하는 명령이고, 상기 제2 제어 명령은 상기 합성부가 상기 제2 언어의 콘텐츠를 화자 음성, 특성인 음성 또는 참조에 연관된 샘플 기계 음성과 합성하게 하는 명령이다. In an embodiment, the receiving the user request and generating the corresponding control command includes: receiving the situation request as the user request and generating a first control command and sending it to the translation unit; and receiving a speaker's voice request, a specific person's voice request, or a reference voice request as a user request, and generating and transmitting a second control command to a synthesis unit. The first control command is a command for the translation unit to translate the content of the first language into the content of the second language under a content situation, and the second control command is a command for the synthesizer to translate the content of the second language into a speaker's voice, characteristics A command to synthesize a human voice or a sample machine voice associated with a reference.

일 실시예에서, 상기 제2 언어의 콘텐츠의 음성을 생성하는 단계는, 화자 음성 요구를 입력 받으면, 상기 합성부가 화자 음성을 번역부에 의해 번역된 제2 언어의 콘텐츠와 합성하게 하는 제2-1 제어 명령을 생성하고 상기 합성부로 상기 제2-1 제어 명령을 전송하는 단계 - 상기 제2-1 제어 명령은 화자 음성 특성 정보를 포함함; 및 상기 제2-1 제어 명령의 화자 음성 특성 및 제2 언어의 콘텐츠에 기초하여 상기 화자 음성을 번역부에 의해 번역된 제2 언어의 콘텐츠와 합성한 화자 합성 음성을 생성하는 단계를 포함할 수도 있다.In one embodiment, the step of generating the voice of the content in the second language may include a second-language synthesizing unit synthesizing the speaker's voice with the content of the second language translated by the translation unit when a speaker's voice request is received. Generating 1 control command and transmitting the 2-1 control command to the synthesis unit - the 2-1 control command includes speaker voice characteristic information; and generating a synthesized speaker voice synthesized by synthesizing the speaker's voice with the content of the second language translated by the translation unit based on the speaker's voice characteristic of the 2-1 control command and the content of the second language. there is.

일 실시예에서, 상기 제2 언어의 콘텐츠의 음성을 생성하는 단계는, 특정인 음성 요구를 입력 받으면, 상기 합성부가 특정인 음성을 번역부에 의해 번역된 제2 언어의 콘텐츠와 합성하게 하는 제2-2 제어 명령을 생성하고 상기 합성부로 상기 제2-2 제어 명령을 전송하는 단계 - 상기 제2-2 제어 명령은 미리 저장된 특정인 음성 특성 정보를 포함함; 및 상기 제2-2 제어 명령의 특정인 음성 특성 및 제2 언어의 콘텐츠에 기초하여 미리 저장된 특정인 음성을 번역부에 의해 번역된 제2 언어의 콘텐츠와 합성한 특정인 합성 음성을 생성하는 단계를 포함할 수도 있다.In one embodiment, the step of generating the voice of the content in the second language may include, when receiving a request for a specific voice, the synthesizing unit synthesizes the specific voice with the content in the second language translated by the translation unit. 2 Generating a control command and transmitting the 2-2 control command to the synthesis unit - the 2-2 control command includes pre-stored specific voice characteristic information; and generating a specific synthesized voice obtained by synthesizing a previously stored specific voice with the second language content translated by a translation unit based on the specific voice characteristic of the 2-2 control command and the second language content. may be

일 실시예에서, 상기 합성부는 기본 음성 모델 및 제2 튜닝 모델을 포함한 특정인 음성 모델을 사용하여 제2 언어의 콘텐츠의 특정인 합성 음성을 생성할 수도 있다. 상기 기본 음성 모델은 제2 언어의 텍스트를 제2 언어의 발음 규칙에 따라 기본 음성으로 발화하도록 학습된 모델이고, 상기 제2 튜닝 모델은 미리 저장된 특정인 음성 특성을 상기 기본 음성에 반영하여 특정인 합성 음성을 생성하도록 학습된 모델이다. In one embodiment, the synthesis unit may generate synthesized speech specific to content of the second language using a specific speech model including a basic speech model and a second tuning model. The basic voice model is a model learned to utter a text in a second language as a basic voice according to a pronunciation rule of the second language, and the second tuning model reflects pre-stored specific voice characteristics to the basic voice to produce a specific synthesized voice. is a model trained to generate

일 실시예에서, 상기 제2 언어의 콘텐츠의 음성을 생성하는 단계는, 저장된 하나 이상의 특정인 음성 특성 정보 중 어느 하나를 선택하게 하는 인터페이스 화면을 제공하는 단계;를 더 포함할 수도 있다. 상기 합성부는 하나 이상의 특정인 각각에 대한 하나 이상의 제2 튜닝 모델을 사용하도록 구성된 것으로서, 선택된 특정인에 대한 제2 튜닝 모델을 사용하여 해당 특정인 음성 특성을 상기 제2 언어의 콘텐츠의 기본 음성에 반영한 특정인 합성 음성을 생성한다. In one embodiment, the step of generating the audio of the content in the second language may further include providing an interface screen for selecting one of one or more pieces of stored specific voice characteristic information. The synthesis unit is configured to use one or more second tuning models for each of one or more specific persons, and synthesizes a specific person by reflecting the voice characteristics of the corresponding specific person to the basic voice of the content in the second language using the second tuning model for the selected specific person. generate voice

일 실시예에서, 상기 제2 언어의 콘텐츠의 음성을 생성하는 단계는, 참조 음성 요구를 입력 받으면, 상기 참조 음성 요구의 참조에 연관된 샘플 기계 음성을 번역부에 의해 번역된 제2 언어의 콘텐츠와 합성하게 하는 제2-3 제어 명령을 생성하고 상기 합성부로 상기 제2-3 제어 명령을 전송하는 단계 - 상기 제2-3 제어 명령은 미리 저장된, 샘플 기계 음성 특성 정보를 포함함; 및 상기 제2-3 제어 명령의 샘플 기계 음성 특정 및 제2 언어의 콘텐츠에 기초하여 미리 저장된 샘플 기계 음성을 번역부에 의해 번역된 제2 언어의 콘텐츠와 합성한 참조 합성 음성을 생성하는 단계를 포함할 수도 있다.In one embodiment, the step of generating the voice of the content in the second language may include, upon receipt of a request for a reference voice, a sample machine voice associated with the reference of the request for a reference voice, and the content in the second language translated by the translation unit. generating a 2-3 control command for synthesizing and transmitting the 2-3 control command to the synthesis unit, wherein the 2-3 control command includes pre-stored sample machine voice characteristic information; and generating a reference synthesized voice synthesized by synthesizing the pre-stored sample machine voice with the content of the second language translated by the translation unit based on the specification of the sample machine voice of the 2-3 control command and the content of the second language. may also include

일 실시예에서, 상기 합성부는 기본 음성 모델 및 제3 튜닝 모델을 포함한 참조 음성 모델을 사용하여 제2 언어의 콘텐츠의 참조 합성 음성을 생성할 수도 있다. 상기 기본 음성 모델은 제2 언어의 텍스트를 제2 언어의 발음 규칙에 따라 기본 음성으로 발화하도록 학습된 모델이고, 상기 제3 튜닝 모델은 미리 저장된 샘플 기계 음성 특성을 상기 기본 음성에 반영하여 상기 참조 합성 음성을 생성하도록 학습된 모델이다. In one embodiment, the synthesis unit may generate a reference synthesized speech of the content of the second language using a reference speech model including a basic speech model and a third tuning model. The basic voice model is a model learned to utter a text in a second language as a basic voice according to a pronunciation rule of the second language, and the third tuning model reflects pre-stored sample machine voice characteristics to the basic voice for reference. A model trained to generate synthetic speech.

일 실시예에서, 상기 제2 언어의 콘텐츠의 음성을 생성하는 단계는, 저장된 하나 이상의 샘플 기계 음성 특성 정보에 연관된 하나 이상의 참조 중 어느 하나를 선택하게 하는 인터페이스 화면을 제공하는 단계;를 더 포함할 수도 있다. 상기 합성부는 하나 이상의 참조에 연관된 샘플 기계 음성 각각에 대한 하나 이상의 제3 튜닝 모델을 사용하도록 구성된 것으로서, 선택된 참조에 연관된 샘플 기계 음성에 대한 제3 튜닝 모델을 사용하여 해당 샘플 기계 음성 특성을 상기 제2 언어의 기본 음성에 반영한 참조 합성 음성을 생성할 수도 있다. In an embodiment, the generating of the content of the second language may further include providing an interface screen for selecting one of one or more references associated with one or more sample machine voice characteristic information stored. may be The synthesis unit is configured to use one or more third tuning models for each of the sample machine voices associated with the one or more references, wherein the corresponding sample machine speech characteristics are determined by using the third tuning model for the sample machine speeches associated with the selected reference. It is also possible to generate a reference synthesized voice reflected in the basic voice of the 2 languages.

일 실시예에서, 상기 음성 특성은 입력된 음성 신호를 시간 도메인 또는 주파수 도메인 상에서 분석하여 추출된 것으로서, 음색, 음역대, 톤 및 성별 중 하나 이상을 포함할 수도 있다.In one embodiment, the voice characteristics are extracted by analyzing an input voice signal in a time domain or a frequency domain, and may include one or more of timbre, voice range, tone, and gender.

일 실시예에서, 상기 콘텐츠 상황은 비즈니스 상황, 학술 상황 및 일상 생활 상황 중 하나 이상을 포함할 수도 있다.In one embodiment, the content context may include one or more of a business context, an academic context, and a daily life context.

일 실시예에서, 상기 샘플 기계 음성 특성 정보는 연관된 참조 각각에 대한 참조 템플릿으로 저장된 것을 특징으로 하는 동시 통역 방법. In one embodiment, the simultaneous interpretation method characterized in that the sample machine voice characteristic information is stored as a reference template for each associated reference.

일 실시예에서, 상기 상기 제2 언어의 콘텐츠의 음성을 생성하는 단계는, 생성한 제2 언어의 콘텐츠의 음성을 변조하는 단계를 더 포함할 수도 있다. 제2 언어의 콘텐츠의 변조된 음성은 상기 생성한 제2 언어의 콘텐츠의 음성의 음성 특성 값을 입력된 변조 값으로 조절하여 생성될 수도 있다.In one embodiment, the step of generating the audio of the content in the second language may further include modulating the audio of the content in the second language. The modulated audio of the content in the second language may be generated by adjusting the voice characteristic value of the generated audio of the content in the second language to the input modulation value.

본 출원의 다른 일 측면에 따른 사용자 요구를 반영한 동시 통역 결과를 생성하는 동시 통역 방법은 번역부 및 합성부를 포함한 동시 통역 장치에 의해 수행된다. 상기 동시 통역 방법은: 제1 언어의 콘텐츠의 텍스트를 수신하는 단계; 상기 제1 언어의 콘텐츠와 상이한 표현의 화자 음성을 수신하는 단계; 상기 화자 음성에서 화자 음성 특성을 추출하는 단계; 동시 통역 결과에 반영할 사용자 요구를 수신하여 해당 제어 명령을 생성하는 단계; 및 제어 명령에 대응하는 사용자 요구를 반영하여 제1 언어의 콘텐츠를 동시 통역한 제2 언어의 콘텐츠의 음성을 생성하는 단계를 포함한다. 상기 동시 통역한 제2 언어의 콘텐츠의 음성을 생성하는 단계는: 상기 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 번역하는 단계; 및 제2 언어의 콘텐츠의 음성을 생성하는 단계를 포함한다. A simultaneous interpretation method for generating a simultaneous interpretation result reflecting a user's demand according to another aspect of the present application is performed by a simultaneous interpretation apparatus including a translation unit and a synthesis unit. The simultaneous interpretation method includes: receiving text of content in a first language; receiving a speaker's voice of a different expression from the contents of the first language; extracting speaker voice characteristics from the speaker's voice; generating a corresponding control command by receiving a user request to be reflected in a result of simultaneous interpretation; and reflecting a user request corresponding to the control command to generate audio of content in a second language obtained by simultaneous interpretation of content in the first language. The step of generating audio of the simultaneously interpreted content in the second language may include: translating the content in the first language into content in the second language; and generating audio of the content in the second language.

본 출원의 또 다른 일 측면에 따른 컴퓨터 판독가능 기록매체는 상술한 실시예들에 따른 동시 통역 방법을 수행하기 위한, 프로그램을 기록할 수도 있다. A computer readable recording medium according to another aspect of the present application may record a program for performing the simultaneous interpretation method according to the above-described embodiments.

본 출원의 또 다른 일 측면에 따른 동시 통역 장치는: 음성 형태로 정보를 획득하는 제1 입력부; 비-음성 형태로 정보를 획득하는 제2 입력부; 제2 입력부에 음성 신호가 입력되면, 음성 신호를 텍스트로 변환하는 텍스트 생성부; 상기 음성 신호로부터 음성 특성을 추출하는 특성 추출부; 사용자 요구가 입력되면, 해당 사용자 요구가 반영된 동시 통역 결과물을 생성하기 위한 제어 명령을 생성하는 제어 모듈; 및 제어 명령에 대응하는 사용자 요구를 반영하여 제1 언어의 콘텐츠를 동시 통역한 제2 언어의 콘텐츠의 음성을 생성하는 동시 통역 모듈을 포함할 수도 있다. 상기 동시 통역 모듈은, 상기 제1 입력부 또는 제2 입력부에 의해 획득된 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 번역하는 번역부; 및 상기 제2 언어의 콘텐츠의 음성을 생성하는 합성부를 포함한다. A simultaneous interpretation apparatus according to another aspect of the present application includes: a first input unit acquiring information in a voice form; a second input unit for acquiring information in non-speech form; a text generation unit that converts the audio signal into text when the audio signal is input to the second input unit; a feature extractor extracting voice features from the voice signal; When a user request is input, a control module generating a control command for generating a simultaneous interpretation result reflecting the user request; and a simultaneous interpretation module configured to reflect a user request corresponding to a control command and generate audio of content in a second language obtained by simultaneous interpretation of content in the first language. The simultaneous interpretation module may include: a translation unit that translates content in a first language acquired by the first input unit or the second input unit into content in a second language; and a synthesis unit generating audio of the content in the second language.

본 발명의 일 측면에 따른 장치는 사용자 특성이 반영되도록 동시 통역 서비스를 제어하여 사용자별로 맞춤화된 동시 통역 서비스를 제공할 수 있다. An apparatus according to an aspect of the present invention may provide a customized simultaneous interpretation service for each user by controlling the simultaneous interpretation service to reflect user characteristics.

본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 청구범위의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description of the claims.

본 발명 또는 종래 기술의 실시예의 기술적 해결책을 보다 명확하게 설명하기 위해, 실시예에 대한 설명에서 필요한 도면이 아래에서 간단히 소개된다. 아래의 도면들은 본 명세서의 실시예를 설명하기 위한 목적일 뿐 한정의 목적이 아니라는 것으로 이해되어야 한다. 또한, 설명의 명료성을 위해 아래의 도면들에서 과장, 생략 등 다양한 변형이 적용된 일부 요소들이 도시될 수 있다.
도 1은, 본 출원의 일 측면에 따른, 동시 통역 장치의 개략적인 블록도이다.
도 2a 및 도 2b는, 본 출원의 일 실시예에 따른, 콘텐츠 상황이 반영된 제2 언어의 콘텐츠를 생성하는 과정의 개략도이다.
도 3은, 본 출원의 일 실시예에 따른, 화자 음성 요구가 반영된 제2 언어의 콘텐츠를 생성하는 과정의 개략도이다.
도 4는, 본 출원의 일 실시예에 따른, 화자 음성 모델을 생성하는 과정의 개략도이다.
도 5는, 본 출원의 일 실시예에 따른, 특정인 음성 요구가 반영된 제2 언어의 콘텐츠를 생성하는 과정의 개략도이다.
도 6은, 본 출원의 일 실시예에 따른, 특정인 음성 모델을 생성하는 과정의 개략도이다.
도 7은, 본 출원의 일 실시예에 따른, 참조 음성 요구가 반영된 제2 언어의 콘텐츠를 생성하는 과정의 개략도이다.
도 8은, 본 출원의 일 실시예에 따른, 참조 음성 모델을 생성하는 과정의 개략도이다.
도 9는, 본 출원의 일 측면에 따른, 동시 통역 방법의 흐름도이다.
도 10은, 본 출원의 일 실시예에 따른, 제1 언어의 콘텐츠가 음성으로 획득된 경우 동시 통역 방법의 흐름도이다.
도 11은, 본 출원의 일 실시예에 따른, 제2 언어의 콘텐츠가 비-음성으로 획득된 경우 동시 통역 방법의 흐름도이다.
도 12는, 본 출원의 또 다른 일 실시예에 따른, 제2 언어의 콘텐츠가 비-음성으로 획득된 경우 동시 통역 방법의 흐름도이다.BRIEF DESCRIPTION OF THE DRAWINGS To describe the technical solutions of the embodiments of the present invention or the prior art more clearly, drawings required in the description of the embodiments are briefly introduced below. It should be understood that the drawings below are for the purpose of explaining the embodiments of the present specification and not for the purpose of limitation. In addition, for clarity of explanation, some elements applied with various modifications, such as exaggeration and omission, may be shown in the drawings below.
1 is a schematic block diagram of a simultaneous interpretation device, according to one aspect of the present application.
2A and 2B are schematic diagrams of a process of generating content in a second language in which a content situation is reflected, according to an embodiment of the present application.
3 is a schematic diagram of a process of generating content in a second language in which a speaker's voice request is reflected, according to an embodiment of the present application.
4 is a schematic diagram of a process of generating a speaker voice model according to an embodiment of the present application.
5 is a schematic diagram of a process of generating content in a second language in which a specific person's voice request is reflected, according to an embodiment of the present application.
6 is a schematic diagram of a process of generating a specific voice model according to an embodiment of the present application.
7 is a schematic diagram of a process of generating content in a second language in which a reference voice request is reflected, according to an embodiment of the present application.
8 is a schematic diagram of a process of generating a reference speech model according to an embodiment of the present application.
9 is a flowchart of a method for simultaneous interpretation, according to one aspect of the present application.
10 is a flowchart of a method for simultaneous interpretation when content in a first language is obtained as voice, according to an embodiment of the present application.
11 is a flowchart of a simultaneous interpretation method when content in a second language is acquired as non-voice, according to an embodiment of the present application.
12 is a flowchart of a simultaneous interpretation method when content in a second language is acquired as non-voice, according to another embodiment of the present application.

이하에서, 도면을 참조하여 본 발명의 실시예들에 대하여 상세히 살펴본다.Hereinafter, with reference to the drawings, look at the embodiments of the present invention in detail.

그러나, 이는 본 개시를 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 개시의 실시예의 다양한 변경(modification), 균등물(equivalent), 및/또는 대체물(alternative)을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다.However, it should be understood that this disclosure is not intended to limit the disclosure to the specific embodiments, and includes various modifications, equivalents, and/or alternatives of the embodiments of the disclosure. In connection with the description of the drawings, like reference numerals may be used for like elements.

본 명세서에서, “가진다,” “가질 수 있다,”“포함한다,” 또는 “포함할 수 있다” 등의 표현은 해당 특징(예: 수치, 기능, 동작, 단계, 부품, 요소 및/또는 성분 등의 구성요소)의 존재를 가리키며, 추가적인 특징의 존재나 부가를 제외시키는 것이 아니다. In this specification, expressions such as “has,” “can have,” “includes,” or “can include” refer to corresponding characteristics (eg, numerical values, functions, operations, steps, parts, elements, and/or components). elements), and does not preclude the presence or addition of additional features.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.It is understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, but other elements may exist in the middle. It should be. On the other hand, when an element is referred to as “directly connected” or “directly connected” to another element, it should be understood that no other element exists in the middle.

다양한 실시예에서 사용된 “제 1,”“제 2,”“첫째,”또는“둘째,”등의 표현들은 다양한 구성요소들을, 순서 및/또는 중요도에 상관없이 수식할 수 있고, 해당 구성요소들을 한정하지 않는다. 상기 표현들은 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 수 있다. 예를 들면, 제1 구성요소와 제2 구성요소는, 순서 또는 중요도와 무관하게, 서로 다른 구성요소를 나타낼 수 있다. Expressions such as “first,” “second,” “first,” or “second,” used in various embodiments may modify various components regardless of order and/or importance, and the corresponding components do not limit them The above expressions may be used to distinguish one component from another. For example, the first element and the second element may represent different elements regardless of order or importance.

본 명세서에서 사용된 표현 “~하도록 구성된(또는 설정된)(configured to)”은 상황에 따라, 예를 들면, “~에 적합한(suitable for),” “~하는 능력을 가지는(having the capacity to),” “~하도록 설계된(designed to),” “~하도록 변경된(adapted to),” “~하도록 만들어진(made to),”또는 “~를 할 수 있는(capable of)”과 바꾸어 사용될 수 있다. 용어 “~하도록 구성(또는 설정)된”은 하드웨어적으로 “특별히 설계된(specifically designed to)”것만을 반드시 의미하지 않을 수 있다. 대신, 어떤 상황에서는, “~하도록 구성된 장치”라는 표현은, 그 장치가 다른 장치 또는 부품들과 함께 “~할 수 있는” 것을 의미할 수 있다. 예를 들면, 문구 “A, B, 및 C를 수행하도록 구성(또는 설정)된 프로세서”는 해당 동작을 수행하기 위한 전용 프로세서(예: 임베디드 프로세서), 또는 메모리 장치에 저장된 하나 이상의 소프트웨어 프로그램들을 실행함으로써, 해당 동작들을 수행할 수 있는 범용 프로세서(generic-purpose processor)(예: CPU 또는 application processor)를 의미할 수 있다.As used herein, the expression “configured to (or configured to)” means, depending on the situation, for example, “suitable for”, “having the capacity to” ,” “designed to,” “adapted to,” “made to,” or “capable of.” The term “configured (or set) to” may not necessarily mean only “specifically designed to” hardware. Instead, in some contexts, the expression "device configured to" may mean that the device is "capable of" in conjunction with other devices or components. For example, the phrase "a processor configured (or set) to perform A, B, and C" may include a dedicated processor (e.g., embedded processor) to perform those operations, or one or more software programs stored in a memory device. By doing so, it may mean a general-purpose processor (eg, CPU or application processor) capable of performing corresponding operations.

도 1은, 본 출원의 일 측면에 따른, 동시 통역 장치의 개략적인 블록도이다. 1 is a schematic block diagram of a simultaneous interpretation device, according to one aspect of the present application.

도 1을 참조하면, 사용자 요구를 반영한 동시 통역 서비스를 제공하는 장치(이하, “동시 통역 장치”, (1))는 제1 입력부(11); 텍스트 생성부(20); 특성 추출부(30); 제어 모듈(40); 동시 통역 모듈(50); 및 출력부(60)를 포함한다. 상기 동시 통역 모듈(50)은 번역부(51); 및 합성부(55)를 포함한다. 일부 실시예들에서, 상기 동시 통역 장치(1)는 제2 입력부(12)를 더 포함할 수도 있다. Referring to FIG. 1 , a device for providing a simultaneous interpretation service reflecting a user's request (hereinafter referred to as “simultaneous interpretation device”, (1)) includes a first input unit 11; text generator 20; feature extraction unit 30; control module 40; simultaneous interpretation module 50; and an output unit 60. The simultaneous interpretation module 50 includes a translation unit 51; and a synthesis unit 55 . In some embodiments, the simultaneous interpretation device 1 may further include a second input unit 12 .

실시예들에 따른 동시 통역 장치(1)는 전적으로 하드웨어이거나, 전적으로 소프트웨어이거나, 또는 부분적으로 하드웨어이고 부분적으로 소프트웨어인 측면을 가질 수 있다. 예컨대 동시 통역 장치(1)는 데이터 처리 능력이 구비된 하드웨어 및 이를 구동시키기 위한 운용 소프트웨어를 통칭할 수 있다. 본 명세서에서 "부(unit)", “모듈(module)”“장치”, 또는 "시스템" 등의 용어는 하드웨어 및 해당 하드웨어에 의해 구동되는 소프트웨어의 조합을 지칭하는 것으로 의도된다. 예를 들어, 하드웨어는 CPU(Central Processing Unit), GPU(Graphic Processing Unit) 또는 다른 프로세서(processor)를 포함하는 데이터 처리 가능한 컴퓨팅 장치일 수 있다. 또한, 소프트웨어는 실행중인 프로세스, 객체(object), 실행파일(executable), 실행 스레드(thread of execution), 프로그램(program) 등을 지칭할 수 있다.The simultaneous interpretation apparatus 1 according to embodiments may be entirely hardware, entirely software, or partially hardware and partially software. For example, the simultaneous interpretation device 1 may collectively refer to hardware capable of processing data and operating software for driving the same. In this specification, terms such as "unit", "module", "device", or "system" are intended to refer to a combination of hardware and software driven by the hardware. For example, the hardware may be a data processing computing device including a Central Processing Unit (CPU), a Graphic Processing Unit (GPU), or another processor. Also, software may refer to a running process, an object, an executable file, a thread of execution, a program, and the like.

제1 입력부(11)는 음성 형태로 정보 또는 사용자 입력을 획득하기 위한 입력 기기이다. 성기 제1 입력부(11)는 음성 신호로 정보 또는 사용자 입력을 입력 받는다. 상기 입력부12는 음파 신호를 수신하여 전기 신호를 변환하는 마이크를 포함할 수도 있다. The first input unit 11 is an input device for obtaining information or user input in the form of voice. The genital first input unit 11 receives information or user input as a voice signal. The input unit 12 may include a microphone that receives sound wave signals and converts them into electrical signals.

제2 입력부(12)는 비-음성 형태로 정보 또는 사용자 입력을 획득하기 위한 입력 기기이다. 상기 제2 입력부(12)는 터치 패널, 키보드, 버튼, 마우스, 다이얼, 스위치 및/또는 스틱을 포함할 수도 있다. The second input unit 12 is an input device for obtaining information or user input in non-speech form. The second input unit 12 may include a touch panel, a keyboard, a button, a mouse, a dial, a switch, and/or a stick.

상기 동시 통역 장치(1)는 상기 제1 입력부(11) 및/또는 제2 입력부(12)에 의해 동시 통역 대상의 정보를 획득한다. 상기 동시 통역 대상의 정보는 콘텐츠(contents)로 지칭될 수도 있다. The simultaneous interpretation apparatus 1 acquires information of a simultaneous interpretation target through the first input unit 11 and/or the second input unit 12 . The information of the simultaneous interpretation target may be referred to as contents.

상기 콘텐츠는 제1 언어로 표현된 정보로서, 예를 들어, 대화, 광고, 기술문헌 등을 포함할 수도 있다. 상기 콘텐츠는 단어, 문장 및/또는 문단으로 이루어진 텍스트 집합을 포함한다. 예를 들어 콘텐츠는 문장일 수도 있다. The content is information expressed in a first language, and may include, for example, conversations, advertisements, and technical literature. The content includes a set of text consisting of words, sentences and/or paragraphs. For example, the content may be a sentence.

상기 동시 통역 장치(1)는 제1 입력부(11)에 의해 제1 언어의 콘텐츠의 음성을 획득할 수도 있다. 상기 동시 통역 장치(1)는 획득한 제1 언어의 콘텐츠의 음성을 텍스트 생성부(20)로 공급한다. The simultaneous interpretation device 1 may obtain audio of content in the first language through the first input unit 11 . The simultaneous interpretation device 1 supplies the acquired voice of the content in the first language to the text generator 20 .

상기 동시 통역 장치(1)는 제2 입력부(12)에 의해 제1 언어의 콘텐츠의 텍스트를 획득할 수도 있다. 제1 언어의 콘텐츠가 타이핑, 포인팅과 같은 비-음성 매개체를 통해 제2 입력부(12)에 의해 입력되면, 상기 제2 입력부(12)는 비-음성 매개체 상의 입력 정보에 대응한 전기 신호를 생성하여 제1 언어의 콘텐츠의 텍스트를 생성할 수도 있다. 예를 들어, 사용자가 키보드로 제1 언어의 콘텐츠를 입력할 경우, 키보드 자판별 전기 신호에 기초하여 제1 언어의 콘텐츠의 텍스트 데이터를 생성한다. 상기 동시 통역 장치(1)는 제2 입력부(12)로부터 제1 언어의 콘텐츠의 텍스트를 번역부(51)로 공급한다. The simultaneous interpretation device 1 may acquire text of content in the first language through the second input unit 12 . When content in the first language is input by the second input unit 12 through a non-voice medium such as typing or pointing, the second input unit 12 generates an electrical signal corresponding to the input information on the non-voice medium. In this way, the text of the content in the first language may be generated. For example, when the user inputs content in the first language through a keyboard, text data of the content in the first language is generated based on an electric signal for each keyboard. The simultaneous interpretation device 1 supplies text of content in the first language from the second input unit 12 to the translation unit 51 .

또한, 상기 동시 통역 장치(1)는 상기 제1 입력부(11) 및/또는 제2 입력부(12)에 의해 사용자 요구를 반영한 동시 통역 결과를 생성하기 위한 사용자 입력을 수신한다. 이러한 사용자 입력에 따른 동시 통역 결과를 생성하는 과정은 아래의 도 2 내지 도 8을 참조하여 보다 상세하게 서술한다. Also, the simultaneous interpretation apparatus 1 receives user input through the first input unit 11 and/or the second input unit 12 to generate a simultaneous interpretation result reflecting a user request. A process of generating simultaneous interpretation results according to such user input will be described in detail with reference to FIGS. 2 to 8 below.

텍스트 생성부(20)는 제1 입력부(11)에 의해 획득된 음성 신호로부터 콘텐츠의 텍스트 데이터를 생성한다. 텍스트 생성부(20)는 발화자의 입으로부터 나온 음성 신호를 자동으로 인식하여 문자열로 변환해 주는 과정을 수행한다. The text generation unit 20 generates text data of content from the voice signal acquired by the first input unit 11 . The text generation unit 20 performs a process of automatically recognizing and converting a voice signal from a speaker's mouth into a character string.

특정 실시예들에서, 상기 텍스트 생성부(20)는 ASR(Automatic Speech Recognition) 또는 STT(Speech-to-Text) 프로그램을 포함할 수도 있다. 제1 언어의 콘텐츠가 음성 매개체를 통해 음성 정보 형태로 제1 입력부(11)에 의해 획득되면, 상기 텍스트 생성부(20)는 상기 제1 언어의 콘텐츠의 음성 신호를 STT 프로그램을 통해 변환하여 제1 언어의 콘텐츠의 텍스트 데이터를 생성한다. In certain embodiments, the text generator 20 may include an Automatic Speech Recognition (ASR) or Speech-to-Text (STT) program. When the contents of the first language are obtained by the first input unit 11 in the form of voice information through an audio medium, the text generator 20 converts the audio signal of the contents of the first language through the STT program and outputs Creates text data of content in one language.

텍스트 생성부(20)는 제1 언어의 텍스트 데이터를 동시 통역 모듈(50)로 공급한다. The text generation unit 20 supplies text data of the first language to the simultaneous interpretation module 50 .

특성 추출부(30)는 제1 입력부(11)에 음성 신호가 입력되면, 입력된 음성 신호로부터 음성 특성을 추출한다. 음성 특성은 동시 통역 대상인 제1 언어의 콘텐츠의 음성 또는 또는 제1 언어의 콘텐츠와 다른 음성으로부터 추출된다. 제1 입력부(11)에 의해 동시 통역 대상인 제1 언어의 콘텐츠의 발화자의 음성으로 상기 제1 언어의 콘텐츠가 입력되면, 상기 특성 추출부(30)는 이 발화자의 음성 특성을 추출할 수도 있다.When a voice signal is input to the first input unit 11, the feature extraction unit 30 extracts voice characteristics from the input voice signal. The voice characteristic is extracted from the voice of the content of the first language to be simultaneously interpreted or the voice of the content different from the content of the first language. When the content of the first language is input as the voice of a speaker of the content of the first language to be simultaneously interpreted by the first input unit 11, the feature extractor 30 may extract the voice characteristics of the speaker.

또한, 특성 추출부(30)는 화자와 다른 특정인, 예컨대 연예인, 배우 같은 유명인의 음성 신호로부터 음성 특성을 추출할 수도 있다. Also, the feature extractor 30 may extract voice features from a voice signal of a specific person other than the speaker, such as a celebrity or an actor.

일 실시예에서, 상기 음성 특성은 음색, 음역대, 톤 및/또는 성별을 포함할 수도 있다. In one embodiment, the voice characteristics may include timbre, vocal range, tone and/or gender.

특성 추출부(30)는 입력된 음성 신호를 시간 도메인 또는 주파수 도메인 상에서 분석하여 입력된 음성 신호의 음성 특성을 추출한다. 여기서, 입력된 음성 신호와 관련된 신호 정보는 파형, 주파수, 주기 및/또는 세기 등을 포함할 수도 있으나, 이에 제한되진 않는다. The feature extraction unit 30 extracts voice characteristics of the input voice signal by analyzing the input voice signal in a time domain or a frequency domain. Here, the signal information related to the input voice signal may include a waveform, frequency, period, and/or intensity, but is not limited thereto.

상기 특성 추출부(30)는 추출한 음성 특성을 제어 모듈(40)로 공급한다. The feature extractor 30 supplies the extracted voice features to the control module 40 .

제어 모듈(40)은 사용자 요구를 반영한 동시 통역 결과를 생성하기 위한 전반적인 제어 동작을 수행하도록 구성된다. The control module 40 is configured to perform overall control operations for generating simultaneous interpretation results reflecting user requests.

일 실시예에서, 상기 제어 모듈(40)은 사용자가 반영하길 원하는 요구 사항, 즉 사용자 요구가 반영된 동시 통역 결과물을 생성하기 위한 제어 명령을 생성할 수도 있다. 상기 동시 통역 장치(1)에 사용자 요구가 입력되면, 상기 제어 모듈(40)은 입력된 사용자 요구를 반영하기 위한 제어 명령을 생성한다. In one embodiment, the control module 40 may generate a control command for generating a simultaneous interpretation result in which the user's request, that is, the user's request, is reflected. When a user request is input to the simultaneous interpretation device 1, the control module 40 generates a control command to reflect the input user request.

상기 동시 통역 장치(1)가 하나 이상의 사용자 요구를 반영 가능하도록 구성될 경우, 상기 제어 모듈(40)은 하나 이상의 제어 명령을 생성할 수도 있다. When the simultaneous interpretation device 1 is configured to reflect one or more user requests, the control module 40 may generate one or more control commands.

제어 모듈(40)은 이러한 제어 명령을 동시 통역 모듈(50)에 공급한다. 특정 실시예들에서 상기 제어 모듈(40)은 번역부(51) 또는 합성부(55)로 제어 명령을 공급할 수도 있다. The control module 40 supplies these control commands to the simultaneous interpretation module 50. In certain embodiments, the control module 40 may supply a control command to the translation unit 51 or synthesis unit 55 .

제어 모듈(40)의 제어 명령에 대해서는 아래의 도 2 내지 도 8을 참조하여 보다 상세하게 서술한다. Control commands of the control module 40 will be described in more detail with reference to FIGS. 2 to 8 below.

동시 통역 모듈(50)은 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 번역하고, 번역한 제2 언어의 콘텐츠의 음성을 동시 통역 결과물로 생성한다.The simultaneous interpretation module 50 translates the content in the first language into the content in the second language and generates the translated voice of the content in the second language as a result of simultaneous interpretation.

동시 통역 모듈(50)의 번역부(51)는 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 번역한다. 제2 언어의 콘텐츠는 제1 언어의 콘텐츠와 동일한 맥락(context)을 가진다. 번역부(51)에 의해 제1 언어의 텍스트는 제2 언어의 텍스트로 번역된다. The translation unit 51 of the simultaneous interpretation module 50 translates content in a first language into content in a second language. Content in the second language has the same context as content in the first language. The text of the first language is translated into the text of the second language by the translation unit 51 .

번역부(51)는 다양한 번역 방식을 사용하여 제1 언어의 텍스트를 제2 언어의 텍스트로 번역한다. 상기 번역 방식은 규칙 기반 방식 및/또는 인공지능 기반 방식을 포함할 수도 있다. The translation unit 51 translates the text of the first language into the text of the second language using various translation methods. The translation method may include a rule-based method and/or an artificial intelligence-based method.

규칙 기반 방식은 언어 학자들이 구축해놓은 언어 규칙을 사용하여 통역하는 방식이다. 이 경우, 상기 동시 통역 모듈은 언어 학자들이 구축해놓은 언어 규칙을 기반으로 프로그래밍된 통역 프로그램을 포함할 수도 있다. The rule-based method is a method of interpreting using language rules established by linguists. In this case, the simultaneous interpretation module may include an interpretation program programmed based on language rules established by linguists.

인공지능 기반 방식은 신경망 모델을 사용하여 통역하는 방식이다. 상기 신경망 모델은 제1 언어의 텍스트 데이터를 동일한 의미를 갖는 제2 언어의 텍스트 데이터로 변환하도록 구성된다. 상기 신경망 모델은 CNN(Convolution Neural Network), RNN(Recurrent Neural Network), LSTM(Long Short-Term Memory models), GAN(Generative Adversarial Network) 및 이들의 조합 중 어느 하나의 구조로 구현될 수도 있으나, 이에 제한되진 않는다. The artificial intelligence-based method is a method of interpreting using a neural network model. The neural network model is configured to convert text data of a first language into text data of a second language having the same meaning. The neural network model may be implemented in any one structure of a Convolution Neural Network (CNN), a Recurrent Neural Network (RNN), Long Short-Term Memory models (LSTM), a Generative Adversarial Network (GAN), and a combination thereof. not limited

번역부(51)의 번역 결과인 제2 언어의 콘텐츠는 합성부(55)로 공급된다.The translation result of the translation unit 51, the content in the second language is supplied to the synthesis unit 55.

합성부(55)는 제2 언어의 콘텐츠에 대한 음성 데이터를 생성한다. 상기 음성 데이터는 사용자 요구가 반영된 음성 데이터이다. The synthesis unit 55 generates audio data for content in the second language. The voice data is voice data in which a user request is reflected.

사용자 요구는 맥락 측면의 요구 및/또는 음성 측면의 요구를 포함한다. 맥락 측면의 요구는 번역 정확성을 개선하기 위한 요구이다. 음성 측면의 요구는 동시 통역 결과를 원하는 음성으로 제공받기 위한 요구이다. User needs include contextual needs and/or audio-related needs. The contextual needs are those to improve translation accuracy. The voice-side request is a request to receive a simultaneous interpretation result in a desired voice.

합성부(55)는 맥락 측면의 요구를 반영하여 번역된 제2 언어의 콘텐츠를 기계 음성과 합성하여 제2 언어의 콘텐츠의 합성 음성을 생성하거나, 및/또는 번역된 제2 언어의 콘텐츠에 음성 측면의 요구를 반영한 음성과 합성하여 제2 언어의 콘텐츠의 합성 음성을 생성할 수도 있다. The synthesis unit 55 reflects the contextual requirements and synthesizes the translated second language content with machine voice to generate a synthesized voice of the second language content, and/or voices the translated second language content. A synthesized voice of content in the second language may be generated by synthesizing the voice reflecting the request of the side.

일 실시예에서, 맥락 측면의 요구는 콘텐츠 상황을 반영하기 위한 요구를 포함할 수도 있다. In one embodiment, the context-wise request may include a request to reflect the context of the content.

제1 언어의 단일 표현은 제2 언어의 다수 표현으로 번역 가능하다. 제1 언어의 특정 표현이 제2 언어의 어느 표현으로 번역되는 것이 적절한지 여부는 콘텐츠 상황과 관련이 있다. 동일한 단어, 문장에 대해서, 예를 들어 대화 분위기, 화자 목적 등과 같은 콘텐츠 상황에 따라 다른 번역 결과가 야기될 수도 있다. 상기 동시 통역 장치(1)는 다양한 미리 정의된 상황 중 번역 대상의 콘텐츠 상황이 선택되면, 선택된 콘텐츠 상황 하에서 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 번역한다. A single expression in a first language is translatable into multiple expressions in a second language. Whether a particular expression in a first language is appropriate to be translated into which expression in a second language is related to the context of the content. For the same word or sentence, different translation results may be caused depending on the context of the content, such as the atmosphere of conversation and the purpose of the speaker. When a content situation to be translated is selected from among various predefined situations, the simultaneous interpretation device 1 translates the first language content into the second language content under the selected content situation.

상기 콘텐츠 상황은 콘텐츠 목적, 콘텐츠 종류, 화자의 의도, 발화 배경과 관련된 것이다. 상기 콘텐츠 상황은 비즈니스 상황, 학술 상황 및/또는 일상 생활(daily life) 상황을 포함할 수도 있으나, 이에 제한되진 않는다. The content situation is related to content purpose, content type, speaker's intention, and utterance background. The content context may include, but is not limited to, a business context, an academic context, and/or a daily life context.

도 2a 및 도 2b는, 본 출원의 일 실시예에 따른, 콘텐츠 상황이 반영된 제2 언어의 콘텐츠를 생성하는 과정의 개략도이다. 2A and 2B are schematic diagrams of a process of generating content in a second language in which a content situation is reflected, according to an embodiment of the present application.

도 2a 및 도 2b를 참조하면, 상기 동시 통역 장치(1)는 동시 통역의 번역에 반영하기 위한 콘텐츠 상황을 사용자 요구로 입력 받도록 구성된다. 설명의 명료성을 위해, 이 사용자 요구를 상황 요구로 지칭하여 보다 상세히 서술한다. Referring to FIGS. 2A and 2B , the simultaneous interpretation apparatus 1 is configured to receive, as a user request, a content situation to be reflected in translation for simultaneous interpretation. For clarity of explanation, this user request is referred to as a contextual request and described in more detail.

상황 요구가 입력되면 상기 동시 통역 장치(1)는 제1 언어의 콘텐츠에 대한 상황 정보를 획득한다. 그러면, 제어 모듈(40)은 입력된 콘텐츠 상황으로 제1 언어의 콘텐츠 상황을 지정한다. When a contextual request is input, the simultaneous interpretation device 1 obtains contextual information about content in the first language. Then, the control module 40 designates the content status of the first language as the input content status.

상황 요구가 입력되면 제어 모듈(40)은 상기 번역부(51)가 지정된 콘텐츠 상황 하에서 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 번역하게 하는 제1 제어 명령을 생성한다. 제1 제어 명령은 번역부(51)로 공급된다. When a situation request is input, the control module 40 generates a first control command to cause the translation unit 51 to translate the first language content into the second language content under a designated content situation. The first control command is supplied to the translation unit 51.

일 실시예에서, 상기 제1 제어 명령은 제1 언어의 콘텐츠에 대한 콘텐츠 상황 정보를 포함할 수도 있다. 번역부(51)는 제1 제어 명령을 통해 입력된 사용자 요구의 콘텐츠 상황 정보를 획득한다. 그러면, 번역부(51)는 상기 제1 제어 명령의 콘텐츠 상황 하에서 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 번역한다. In one embodiment, the first control command may include content context information on content in the first language. The translation unit 51 obtains content context information of the user request input through the first control command. Then, the translation unit 51 translates the content in the first language into the content in the second language under the context of the content of the first control command.

단일 텍스트는 각각의 상황에 따라 다른 맥락으로 해석될 수도 있다. 상기 번역부(51)는 제1 언어의 단일 텍스트를 복수의 상황 각각에 대한 제2 언어의 텍스트 중 어느 하나로 번역하도록 구성된다. 상기 번역부(51)는 제1 제어 명령 내 콘텐츠 상황 정보에 기초하여 해당 콘텐츠 상황에 대한 제2 언어의 텍스트를 번역 텍스트로 사용한다. A single text may be interpreted in different contexts depending on each situation. The translation unit 51 is configured to translate a single text in a first language into one of texts in a second language for each of a plurality of situations. The translation unit 51 uses the second language text for the corresponding content context as the translated text based on the content context information in the first control command.

예를 들어, 도 2a에 도시된 바와 같이 제1 언어의 콘텐츠 상황이 비즈니스 상황이란 것을 가리키는 사용자 요구가 상기 동시 통역 장치(1)에 입력되면, 상기 제어 모듈(40)은 제1 언어의 콘텐츠에 대한 콘텐츠 상황을 비즈니스 상황으로 지정한다. 그러면, 상기 제어 모듈(40)은 제1 언어의 콘텐츠에 대한 콘텐츠 상황 정보를 포함한 제1 제어 명령을 생성하고 번역부(51)로 공급한다. 상기 번역부(51)는 제1 제어 명령에 기초하여 비즈니스 상황 하에서 제1 언어의 콘텐츠의 맥락(context)과 동일한 맥락의 제2 언어의 콘텐츠로 번역한다. For example, as shown in FIG. 2A , when a user request indicating that the content situation in the first language is a business situation is input to the simultaneous interpretation device 1, the control module 40 controls the content in the first language. Designate the content context for the business context. Then, the control module 40 generates a first control command including content status information about the content in the first language and supplies it to the translation unit 51 . The translation unit 51 translates content in a second language in the same context as the content in the first language under a business situation based on a first control command.

또는, 도 2b에 도시된 바와 같이 제1 언어의 콘텐츠 상황이 학술 상황이란 것을 가리키는 사용자 요구가 상기 동시 통역 장치(1)에 입력되면, 상기 제어 모듈(40)은 제1 언어의 콘텐츠에 대한 콘텐츠 상황을 학술 상황으로 지정한다. 그러면, 상기 제어 모듈(40)은 제1 언어의 콘텐츠에 대한 콘텐츠 상황 정보를 포함한 제1 제어 명령을 생성하고 번역부(51)로 공급한다. 상기 번역부(51)는 제1 제어 명령에 기초하여 학술 상황 하에서 제1 언어의 콘텐츠의 맥락(context)과 동일한 맥락의 제2 언어의 콘텐츠로 번역한다. Alternatively, as shown in FIG. 2B, when a user request indicating that the content situation in the first language is an academic situation is input to the simultaneous interpretation device 1, the control module 40 controls the content for the content in the first language. Designate the situation as an academic situation. Then, the control module 40 generates a first control command including content status information about the content in the first language and supplies it to the translation unit 51 . Based on the first control command, the translation unit 51 translates the contents of the second language in the same context as the contents of the first language under an academic context.

상기 동시 통역 장치(1)는 제2 언어의 콘텐츠의 텍스트 데이터를 음성 데이터로 변환하여 사용자에게 제공할 수도 있다. 즉, 맥락 측면의 요구만을 반영한 동시 통역 결과를 제공할 수도 있다. 이에 대해서는 아래의 도 10을 참조하여 보다 상세히 서술한다. The simultaneous interpretation device 1 may convert text data of content in the second language into voice data and provide the same to the user. That is, simultaneous interpretation results reflecting only the needs of the context can be provided. This will be described in more detail with reference to FIG. 10 below.

일 실시예에서, 음성 측면의 요구는 화자 음성 요구, 특정인 음성 요구 및/또는 참조 음성 요구를 포함할 수도 있다. In one embodiment, voice-side requests may include speaker voice requests, specific voice requests, and/or reference voice requests.

화자 음성 요구는 화자 음성으로 동시 통역 결과를 제공해달라는 사용자 요구이다. 화자 음성 요구가 입력되면, 상기 동시 통역 장치(1)는 제2 언어의 콘텐츠를 제1 언어의 콘텐츠를 발화하는 화자의 음성으로 제공한다. The speaker's voice request is a user's request to provide a simultaneous interpretation result in the speaker's voice. When a speaker's voice request is input, the simultaneous interpretation apparatus 1 provides the content of the second language as the voice of the speaker who utters the content of the first language.

특정인 음성 요구는 특정인의 음성으로 동시 통역 결과를 제공해달라는 사용자 요구이다. 여기서, 특정인은 동시 통역 대상을 발화하는 화자와 다른 특정인, 예컨대 연예인, 배우 같은 유명인일 수도 있다. 특정인 음성 요구가 사용자 요구로 입력되면, 상기 동시 통역 장치(1)는 제2 언어의 콘텐츠를 특정인의 음성으로 제공한다. The specific person's voice request is a user's request to provide a simultaneous interpretation result in the voice of a specific person. Here, the specific person may be a specific person different from the speaker who utters the simultaneous interpretation target, for example, a celebrity such as a celebrity or an actor. When a specific person's voice request is input as a user request, the simultaneous interpretation device 1 provides the second language content in the specific person's voice.

참조 음성 요구는 사용자가 원하는 샘플 기계 음성으로 동시 통역 결과물을 제공해달라는 사용자 요구이다. 상기 참조 음성 요구에서 참조(reference)는 사용자가 원하는 음성에 대한 방향성, 취향 또는 패턴을 가리키는 것이다. 상기 참조는 음성 특성별로 분류될 수도 있다. 이 경우, 참조는 음성 특성별 특정 음성의 값 또는 그룹을 지칭하는데 레이블(label)일 수도 있다.The reference voice request is a user request to provide a simultaneous interpretation result with a sample machine voice desired by the user. In the reference voice request, a reference indicates a direction, taste, or pattern for a voice desired by the user. The references may be classified by voice characteristics. In this case, the reference refers to a value or group of a specific voice for each voice characteristic, and may be a label.

음성 특성이 음역대, 톤, 음색 및/또는 성별을 포함할 경우, 상기 참조는 음역에 대한 참조, 톤에 대한 참조, 성별에 대한 참조를 포함할 수도 있다. When voice characteristics include range, tone, timbre and/or gender, the reference may include a reference to range, a reference to tone, and a reference to gender.

음성 특성별 참조는 하나 이상의 서브 참조를 포함할 수도 있다. 음성은 음역대 별로 제1 내지 제n 개의 참조 음성을 포함거나, 톤별로 제1 내지 제n 개의 참조 음성을 포함할 수도 있다. 예를 들어, 상기 동시 통역 장치(1)는 음역대의 음성 특성에 대해서 “고음”, “중음”, “저음”의 세 개의 참조에 각각 대응하는 샘플 기계 음성 특성 정보를 포함할 수도 있다. The reference for each voice characteristic may include one or more sub-references. The voice may include the first to nth reference voices for each sound band or the first to nth reference voices for each tone. For example, the simultaneous interpretation apparatus 1 may include sample machine voice characteristic information corresponding to three references of "high", "middle", and "low" for the voice characteristics of the voice range.

상기 참조는 제1 언어 또는 제2 언어를 사용하는 사람들의 응답 결과에 기초하여 평균적 또는 통계적으로 결정될 수도 있다. The reference may be determined averagely or statistically based on the results of responses of people who speak the first language or the second language.

특정 참조의 사용자 요구가 참조 음성 요구로 입력되면, 상기 동시 통역 장치(1)는 이 참조에 대응하는 대응하는 샘플 기계 음성으로 제공한다. When a user request of a specific reference is input as a reference voice request, the simultaneous interpretation apparatus 1 provides a corresponding sample machine voice corresponding to the reference.

참조에 대응한 기계 샘플 음성의 특성 정보는 참조 템플릿으로 저장된다. 동시 통역 장치(1)는 참조 음성 요구가 입력되면, 해당 참조 템플릿의 음성 특성 정보를 사용하여 합성 음성을 생성한다. The characteristic information of the machine sample voice corresponding to the reference is stored as a reference template. When a request for a reference voice is input, the simultaneous interpretation apparatus 1 generates a synthesized voice using voice characteristic information of a corresponding reference template.

샘플 기계 음성은 평균적 또는 통계적으로 결정된 특정 음성 또는 그룹의 음성일 수도 있다. 샘플 기계 음성은 그룹에 대한 대표 음성일 수도 있다. The sample machine voice may be an average or statistically determined specific voice or group of voices. A sample machine voice may be a representative voice for a group.

예를 들어, 저음으로 평가된 다수의 음성 샘플의 음역대 값을 통계적으로 또는 평균적으로 분석하여 상기 “저음”의 참조에 대응하는 샘플 기계 음성의 특성 값으로서 산출될 수도 있다. 동시 통역 장치(1)는 산출된 기계 음성의 특성 값에 기초하여 “저음”에 대한 템플릿을 생성하고 저장한다. For example, it may be calculated as a characteristic value of a sample machine voice corresponding to the reference of “low pitch” by statistically or averagely analyzing the frequency band values of a plurality of voice samples evaluated as low pitch. The simultaneous interpretation apparatus 1 generates and stores a template for “bass” based on the calculated characteristic value of the machine voice.

일부 실시예에서, 상기 참조는 동일한 특정 음성 특성에 대해서 복수의 서브 참조로 이루어질 수도 있다. 예를 들어, 동시 통역 장치(1)는 음역대의 음성 특성에 대한 참조로서 저음의 참조를 포함하고, 동일한 저음의 참조의 서브 참조로서 제1 저음, …., 제n 저음의 서브 참조에 각각 대응한 샘플 기계 음성 특성을 포함할 수도 있다. In some embodiments, the reference may consist of multiple sub-references to the same specific voice characteristic. For example, the simultaneous interpretation apparatus 1 includes a bass reference as a reference for voice characteristics of a voice range, and as sub-references of the same bass reference, a first bass, . . . ., sample machine voice characteristics respectively corresponding to the sub-references of the nth bass.

도 3은, 본 출원의 일 실시예에 따른, 화자 음성 요구가 반영된 제2 언어의 콘텐츠를 생성하는 과정의 개략도이다. 3 is a schematic diagram of a process of generating content in a second language in which a speaker's voice request is reflected, according to an embodiment of the present application.

도 3을 참조하면, 상기 동시 통역 장치(1)가 화자 음성 요구를 입력 받으면 제어 모듈(40)은, 상기 합성부(55)가 화자 음성을 번역부(51)에 의해 번역된 제2 언어의 콘텐츠와 합성하게 하는 제2-1 제어 명령을 생성한다. 제2-1 제어 명령은 합성부(55)로 공급된다. Referring to FIG. 3 , when the simultaneous interpretation apparatus 1 receives a speaker's voice request, the control module 40 controls the synthesis unit 55 to convert the speaker's voice into the second language translated by the translation unit 51. A 2-1 control command to be combined with the content is generated. The 2-1 control command is supplied to the synthesis unit 55.

일 실시예에서, 상기 제2-1 제어 명령은 특성 추출부(30)에서 추출된 화자의 음성 특성 정보를 포함할 수도 있다. 합성부(55)는 제2-1 제어 명령을 통해 화자 음성과 관련된 정보를 획득한다. In one embodiment, the 2-1 control command may include the voice characteristic information of the speaker extracted by the characteristic extraction unit 30 . The synthesis unit 55 obtains information related to the speaker's voice through the 2-1 control command.

그러면, 합성부(55)는 상기 제2-1 제어 명령의 화자 음성 특성을 반영한 화자 합성 음성을 생성한다. 이 화자 합성 음성은 제2 언어의 콘텐츠를 화자 음성과 합성한 것으로서, 제2 언어의 콘텐츠의 음성에 화자 음성 특성을 반영한 음성 데이터이다.Then, the synthesis unit 55 generates a synthesized speaker voice reflecting the voice characteristics of the speaker of the 2-1 control command. This synthesized speaker voice is obtained by synthesizing the content of the second language with the voice of the speaker, and is voice data in which the characteristics of the speaker's voice are reflected in the voice of the content of the second language.

일 실시예에서, 상기 합성부(55)는 화자 음성 모델(540)을 사용하여 제2 언어의 콘텐츠의 화자 합성 음성을 생성할 수도 있다. In one embodiment, the synthesizer 55 may generate a synthesized speaker voice of the content of the second language using the speaker voice model 540 .

상기 동시 통역 장치(1)는 번역부(51)에 의해 생성된 제2 언어의 콘텐츠 및 특성 추출부(30)에서 추출된 화자 음성 특성을 상기 화자 음성 모델(540)에 입력하여 제2 언어의 콘텐츠의 화자 합성 음성을 생성할 수도 있다. The simultaneous interpretation device 1 inputs the content of the second language generated by the translation unit 51 and the speaker's voice characteristics extracted from the feature extractor 30 to the speaker's voice model 540 to obtain the second language. A speaker synthesized voice of the content may be generated.

이를 위해, 상기 동시 통역 장치(1)는 화자 음성 데이터를 생성하기 이전에 화자 음성 모델(540)을 미리 저장한다. 상기 동시 통역 장치(1)는 화자 음성 모델(540)을 생성하거나 외부로부터 획득하여 저장한다.To this end, the simultaneous interpretation apparatus 1 stores the speaker's voice model 540 in advance before generating the speaker's voice data. The simultaneous interpretation apparatus 1 generates a speaker's voice model 540 or obtains and stores the speaker's voice model 540 from the outside.

도 4는, 본 출원의 일 실시예에 따른, 화자 음성 모델의 개략도이다. 4 is a schematic diagram of a speaker voice model, according to an embodiment of the present application.

도 4를 참조하면, 화자 음성 모델(540)은 기본 음성 모델(501) 및 제1 튜닝 모델(545)을 포함할 수도 있다. 여기서, 모델을 포함하는 것은 해당 모델의 적어도 일부를 포함하거나, 또는 해당 모델의 적어도 일부와 동일한 파라미터 값 및/또는 구조를 갖는 것을 의미한다.Referring to FIG. 4 , a speaker voice model 540 may include a basic voice model 501 and a first tuning model 545 . Here, including the model means including at least a part of the model or having the same parameter value and/or structure as at least a part of the model.

화자 음성 모델(540)은 번역부(51)의 제2 언어의 콘텐츠가 입력되면, 상기 제2 언어의 콘텐츠를 기본 음성 모델(501)에 입력하여 상기 제2 언어의 콘텐츠의 기본 음성 데이터를 생성한다. When the content of the second language of the translation unit 51 is input, the speaker voice model 540 inputs the content of the second language into the basic voice model 501 to generate basic voice data of the content of the second language. do.

상기 기본 음성 모델(501)은 제2 언어의 텍스트를 제2 언어의 발음 규칙에 따라 기본 음성으로 발화하도록 학습된 모델이다. 상기 기본 음성 모델은 제2 언어를 사용하는 국민의 음성 데이터를 출력하는 것이므로, 제2 국 음성 모델로 취급될 수도 있다. The basic voice model 501 is a model learned to utter a text in a second language as a basic voice according to a pronunciation rule of the second language. Since the basic voice model outputs voice data of citizens who speak the second language, it may be treated as a second station voice model.

이 기본 음성은 제2 언어를 사용하는 사람들의 음성을 통계적 또는 평균적으로 분석한 것과 같은 대표 음성, 또는 기계 음성일 수도 있다. 예를 들어, 상기 기본 음성은 아래의 복수의 훈련 샘플의 음성의 평균적 음성일 수도 있다.This basic voice may be a representative voice, such as a statistical or average analysis of voices of people who use the second language, or a machine voice. For example, the basic voice may be an average voice of voices of a plurality of training samples below.

상기 기본 음성 모델(501)은 복수의 훈련 샘플을 사용하여 제2 언어로 기록된 텍스트(예컨대, 제2 언어의 텍스트 데이터)가 입력되면 제2 언어로 발음한 결과(예컨대, 제2 언어의 음성 데이터)를 산출하도록 학습된다. The basic speech model 501 uses a plurality of training samples and when text (eg, text data in the second language) recorded in the second language is input, the result of pronunciation in the second language (eg, speech in the second language) is input. data).

상기 기본 음성 모델(501)을 위한 복수의 훈련 샘플 각각은 제2 언어의 텍스트 데이터와 이에 매칭하는 제2 언어의 음성 데이터의 쌍으로 이루어진다. 여기서, 텍스트 데이터와 매칭하는 음성 데이터는 제2 언어의 텍스트를 제2 언어의 음성으로 발화하는 것을 의미한다. 상기 복수의 훈련 샘플은 다수의 사람 각각이 자신의 음성으로 제2 언어의 단일 텍스트를 발화한 음성 데이터로부터 생성될 수도 있다. 즉, 상기 복수의 훈련 샘플은 단일 텍스트에 대한 하나 이상의 훈련 샘플을 가질 수도 있다. Each of the plurality of training samples for the basic speech model 501 includes a pair of text data of the second language and speech data of the second language matching the text data of the second language. Here, the voice data that matches the text data means that the text of the second language is uttered as the voice of the second language. The plurality of training samples may be generated from voice data in which each of a plurality of people utters a single text in the second language with their own voice. That is, the plurality of training samples may have one or more training samples for a single text.

제1 튜닝 모델(545)은 제2 언어의 콘텐츠의 기본 음성에 화자 음성 특성을 반영하여 제2 언어의 콘텐츠의 화자 합성 음성을 생성한다. The first tuning model 545 generates a synthesized speaker voice of the content of the second language by reflecting the characteristics of the speaker's voice to the basic voice of the content of the second language.

일 실시예에서, 제1 튜닝 모델(545)은 조건부 자기회귀 모델(Conditional auto-regressive model)을 포함할 수도 있다. In one embodiment, the first tuning model 545 may include a conditional auto-regressive model.

상기 제1 튜닝 모델(545)은 기본 음성에 화자 음성 특성을 반영하도록 학습된다. 기본 음성 및 화자 음성은 음성 특성의 파라미터에 의해 구현된다. 제1 튜닝 모델(545)은 기본 음성 특성 값을 입력되는 화자 음성 특성 값으로 조절하도록 학습된다. The first tuning model 545 is trained to reflect the speaker's voice characteristics to the basic voice. The basic voice and the speaker's voice are implemented by parameters of voice characteristics. The first tuning model 545 is trained to adjust the basic voice characteristic values to the input speaker's voice characteristic values.

일 실시예에서, 제1 튜닝 모델(545)은 교차언어 학습(cross lingual learning) 방식을 통해 제2 언어의 기본 음성에 화자 음성 특성을 반영하도록 학습될 수도 있다. 상기 교차언어 학습 방식은, 예를 들어 Selvy deep 방식을 포함할 수도 있으나, 이에 제한되진 않는다.In one embodiment, the first tuning model 545 may be trained to reflect speaker voice characteristics to the basic voice of the second language through a cross lingual learning method. The cross language learning method may include, for example, a Selvy deep method, but is not limited thereto.

이와 같이, 제어 모듈(40)에 의해 화자 음성 요구의 입력에 따른 제2-1 제어 명령이 합성부(55)로 공급되면, 합성부(55)는 화자 음성 모델(540)을 사용하여 제2 언어의 콘텐츠의 화자 합성 음성 데이터를 생성할 수도 있다. In this way, when the control module 40 supplies the 2-1 control command according to the input of the speaker's voice request to the synthesis unit 55, the synthesis unit 55 uses the speaker's voice model 540 to generate the second control command. It is also possible to generate speaker synthesized speech data of content in a language.

도 5는, 본 출원의 일 실시예에 따른, 특정인 음성 요구가 반영된 제2 언어의 콘텐츠를 생성하는 과정의 개략도이다. 5 is a schematic diagram of a process of generating content in a second language in which a specific person's voice request is reflected, according to an embodiment of the present application.

도 5를 참조하면, 상기 동시 통역 장치(1)가 특정인 음성 요구를 입력 받으면 제어 모듈(40)은, 상기 합성부(55)가 특정인의 음성을 번역부(51)에 의해 번역된 제2 언어의 콘텐츠와 합성하게 하는 제2-2 제어 명령을 생성한다. 제2-2 제어 명령은 합성부(55)로 공급된다. Referring to FIG. 5 , when the simultaneous interpretation device 1 receives a specific person's voice request, the control module 40 controls the synthesis unit 55 to convert the specific person's voice into a second language translated by the translation unit 51. A 2-2 control command to be synthesized with the contents of is generated. The 2-2 control command is supplied to the synthesis unit 55.

일 실시예에서, 상기 제2-2 제어 명령은 미리 저장된 특정인의 음성 특성 정보를 포함할 수도 있다. 상기 특정인의 음성 특성 정보는 특성 추출부(30)에 의해 특정인의 음성으로부터 추출되어 미리 저장될 수도 있다.In one embodiment, the 2-2 control command may include previously stored voice characteristic information of a specific person. The specific person's voice characteristic information may be extracted from the specific person's voice by the feature extraction unit 30 and stored in advance.

그러면, 합성부(55)는 제2-2 제어 명령의 특정인 음성 정보를 반영한 특정인 합성 음성을 생성한다. 이 특정인 합성 음성은 제2 언어의 콘텐츠를 특정인 음성과 합성한 것으로서, 제2 언어의 콘텐츠의 음성에 특정인 음성 특성을 반영한 음성 데이터이다. Then, the synthesis unit 55 generates a synthesized voice that reflects the specific voice information of the 2-2 control command. The synthesized voice of a specific person is obtained by synthesizing the content of the second language with the voice of the specific person, and is audio data in which voice characteristics specific to the voice of the content of the second language are reflected.

일 실시예에서, 상기 합성부(55)는 특정인 음성 모델(560)을 사용하여 제2 언어의 콘텐츠의 특정인 합성 음성을 생성할 수도 있다. In one embodiment, the synthesis unit 55 may generate synthesized speech specific to the content of the second language using the specific speech model 560 .

상기 동시 통역 장치(1)는 번역부(51)에 의해 생성된 제2 언어의 콘텐츠 및 미리 저장된 특정인 음성 특성 정보를 상기 특정인 음성 모델(560)에 입력하여 제2 언어의 콘텐츠의 특정인 합성 음성을 생성할 수도 있다. The simultaneous interpretation device 1 inputs the content of the second language generated by the translation unit 51 and the specific voice characteristic information stored in advance to the specific voice model 560 to generate synthesized voice specific to the content of the second language. can also create

이를 위해, 상기 동시 통역 장치(1)는 특정인 합성 음성을 생성하기 이전에 특정인 음성 모델(560)을 미리 저장한다. 상기 동시 통역 장치(1)는 특정인 음성 모델(560)을 생성하거나 외부로부터 획득하여 저장한다.To this end, the simultaneous interpretation apparatus 1 stores a specific voice model 560 in advance before generating a specific synthesized voice. The simultaneous interpretation device 1 generates a specific voice model 560 or obtains and stores it from the outside.

도 6은, 본 출원의 일 실시예에 따른, 특정인 음성 모델을 생성하는 과정의 개략도이다. 6 is a schematic diagram of a process of generating a specific voice model according to an embodiment of the present application.

도 6을 참조하면, 특정인 음성 모델(560)은 기본 음성 모델(501) 및 제2 튜닝 모델(565)을 포함할 수도 있다. Referring to FIG. 6 , a specific voice model 560 may include a basic voice model 501 and a second tuning model 565 .

특정인 음성 모델(560)은 번역부(51)의 제2 언어의 콘텐츠가 입력되면, 상기 제2 언어의 콘텐츠를 기본 음성 모델(501)에 입력하여 상기 제2 언어의 콘텐츠의 기본 음성 데이터를 생성한다. When the content of the second language of the translation unit 51 is input, the specific person voice model 560 inputs the content of the second language into the basic voice model 501 to generate basic voice data of the content of the second language. do.

상기 기본 음성 모델(501)은 제2 언어의 발음 규칙에 따라 제2 언어의 텍스트를 음성으로 발화한 결과를 출력하도록 학습된 모델로서, 도 4를 참조하여 위에서 서술하였는 바, 자세한 설명은 생략한다. The basic voice model 501 is a model learned to output a result of uttering text in the second language as voice according to the pronunciation rules of the second language, and has been described above with reference to FIG. 4, so detailed descriptions will be omitted. .

제2 튜닝 모델(565)은 제2 언어의 콘텐츠의 기본 음성에 특정인 음성 특성을 반영하여 제2 언어의 콘텐츠의 특정인 합성 음성을 생성한다. The second tuning model 565 reflects voice characteristics specific to the basic voice of the content in the second language to generate synthesized voice specific to the content in the second language.

일 실시예에서, 제2 튜닝 모델(565)은 CNN(Convolution Neural Network), RNN(Recurrent Neural Network), LSTM(Long Short-Term Memory models) 또는 GAN(Generative Adversarial Network) 을 포함할 수도 있다. 이 외에도, 상기 제2 튜닝 모델(565)은 다른 구조의 신경망을 포함할 수도 있다. In one embodiment, the second tuning model 565 may include a Convolution Neural Network (CNN), a Recurrent Neural Network (RNN), Long Short-Term Memory models (LSTM), or a Generative Adversarial Network (GAN). In addition to this, the second tuning model 565 may include a neural network having a different structure.

상기 제2 튜닝 모델(565)은 기본 음성에 특정인 음성 특성을 반영하도록 학습된다. 기본 음성 및 특정인 음성은 음성 특성의 파라미터에 의해 구현된다. 제2 튜닝 모델(565)은 기본 음성 특성 값을 입력되는 특정인 음성 특성 값으로 조절하도록 학습된다. The second tuning model 565 is learned to reflect voice characteristics specific to the base voice. The basic voice and the specific voice are implemented by parameters of voice characteristics. The second tuning model 565 is trained to adjust basic voice characteristic values to input specific voice characteristic values.

상기 제2 튜닝 모델(565)은, CNN(Convolution Neural Network), RNN(Recurrent Neural Network), LSTM(Long Short-Term Memory models), GAN(Generative Adversarial Network) 등과 같은 신경망 구조에 활용 가능한, 학습 알고리즘을 사용하여 제2 언어의 콘텐츠의 기본 음성 데이터에 특정인 음성 특성을 반영하도록 학습된다. The second tuning model 565 is a learning algorithm that can be used for neural network structures such as convolution neural networks (CNNs), recurrent neural networks (RNNs), long short-term memory models (LSTMs), and generative adversarial networks (GANs). It is learned to reflect speech characteristics specific to the basic speech data of the content of the second language using .

이와 같이, 제어 모듈(40)에 의해 특정인 음성 요구 입력에 따른 제2-2 제어 명령이 합성부(55)로 공급되면, 합성부(55)는 특정인 음성 모델을 사용하여 제2 언어의 콘텐츠의 특정인 합성 음성 데이터를 생성할 수도 있다. In this way, when the 2-2 control command according to the specific voice request input is supplied to the synthesis unit 55 by the control module 40, the synthesis unit 55 uses the specific voice model to generate the content of the second language. It is also possible to generate synthetic speech data for a specific person.

일 실시예에서, 상기 동시 통역 장치(1)는 복수의 특정인 중 어느 하나를 선택하기 위한 사용자 입력을 수신하고, 그리고 상기 번역부(51)에 의해 생성된 제2 언어의 콘텐츠를 선택된 특정인의 음성과 합성하여 제2 언어의 콘텐츠의 특정인 합성 음성을 생성할 수도 있다. In one embodiment, the simultaneous interpretation device 1 receives a user input for selecting one of a plurality of specific persons, and transmits the second language content generated by the translation unit 51 to the selected specific person's voice. and synthesized speech specific to the second language content may be generated.

상기 동시 통역 장치(1)는 하나 이상의 특정인 음성 특성 정보를 저장할 수도 있다. 상기 동시 통역 장치(1)는 저장된 하나 이상의 특정인 음성 특성 정보 중 어느 하나를 선택하게 하는 인터페이스 화면(M2)을 제공하여 사용자 입력을 유도할 수도 있다. The simultaneous interpretation apparatus 1 may store one or more specific voice characteristic information. The simultaneous interpretation device 1 may induce a user input by providing an interface screen M2 for selecting one of one or more pieces of stored specific voice characteristic information.

상기 동시 통역 장치(1)는 제2 언어의 콘텐츠 및 선택한 특정인의 음성 특성에 기초하여 제2 언어의 콘텐츠의 선택된 특정인의 합성 음성을 생성한다. The simultaneous interpretation device 1 generates a synthesized voice of the selected person of the content in the second language based on the content in the second language and the voice characteristics of the selected person.

일부 실시예들에서, 상기 동시 통역 장치(1)는 복수의 특정인 각각에 대한 복수의 제2 튜닝 모델(565)을 포함할 수도 있다. 예를 들어, n명의 특정인에 대해서, 상기 동시 통역 장치(1)는 제2-1 튜닝 모델(565-1), …., 제2-n 튜닝 모델(565-n)을 포함할 수도 있다. 각 제2 튜닝 모델(565)은 입력되는 음성(예컨대, 기본 음성)의 음성 특성 값을 해당 특정인 음성 특성 값으로 조절하도록 각각 학습된 것이다. In some embodiments, the simultaneous interpretation device 1 may include a plurality of second tuning models 565 for each of a plurality of specific persons. For example, for n specific persons, the simultaneous interpretation apparatus 1 has a 2-1 tuning model 565-1, . . . ., 2-n tuning models 565-n. Each second tuning model 565 is learned to adjust the voice characteristic value of the input voice (eg, basic voice) to the corresponding specific voice characteristic value.

도 7은, 본 출원의 일 실시예에 따른, 참조 음성 요구가 반영된 제2 언어의 콘텐츠를 생성하는 과정의 개략도이다. 7 is a schematic diagram of a process of generating content in a second language in which a request for a reference voice is reflected, according to an embodiment of the present application.

도 7을 참조하면, 상기 동시 통역 장치(1)가 참조 음성 요구를 입력 받으면 제어 모듈(40)은, 상기 합성부(55)가 샘플 기계 음성을 번역부(51)에 의해 번역된 제2 언어의 콘텐츠와 합성하게 하는 제2-3 제어 명령을 생성한다. 제2-3 제어 명령은 합성부(55)로 공급된다. Referring to FIG. 7 , when the simultaneous interpretation device 1 receives a request for a reference voice, the control module 40 converts the synthesizer 55 into a second language translated by the translator 51 into the sample machine voice. A 2-3 control command to be synthesized with the contents of is generated. The 2-3 control command is supplied to the synthesis unit 55.

일 실시예에서, 상기 제2-3 제어 명령은 미리 저장된 샘플 기계 음성 특성 정보를 포함할 수도 있다. 이러한 샘플 기계 음성 특성 정보는 전술한 바와 같이, 다수의 사람의 음성을 평균적 또는 통계적으로 분석한 결과의 특성 정보일 수도 있다. In one embodiment, the 2-3 control commands may include pre-stored sample machine voice characteristic information. As described above, the sample machine voice characteristic information may be characteristic information resulting from average or statistical analysis of voices of a plurality of people.

일부 실시예들에서, 상기 동시 통역 장치(1)는 상기 샘플 기계 음성 특성 정보를 참조 템플릿으로 미리 저장할 수도 있다. 상기 참조 템플릿은 제2 언어를 사용하는 다수의 사람들을 샘플로 사용하여 생성된 것으로서, 해당 샘플 기계 음성 특성 정보를 포함한다. In some embodiments, the simultaneous interpretation device 1 may store the sample machine voice characteristic information as a reference template in advance. The reference template is generated by using a plurality of people who speak the second language as samples, and includes corresponding sample machine voice characteristic information.

일 실시예에서, 상기 합성부(55)는 참조 음성 모델(580)을 사용하여 제2 언어의 콘텐츠의 참조 합성 음성을 생성할 수도 있다. 이 참조 합성 음성은 참조 음성 요구의 참조에 연관된 샘플 기계 음성으로 제2 언어의 콘텐츠를 합성한 음성 데이터이다. In one embodiment, the synthesizer 55 may generate a reference synthesized voice of the content of the second language using the reference voice model 580 . This reference synthesized voice is voice data obtained by synthesizing the content of the second language with the sample machine voice associated with the reference of the reference voice request.

상기 동시 통역 장치(1)는 번역부(51)에 의해 생성된 제2 언어의 콘텐츠 및 미리 저장된 샘플 기계 음성 특성 정보(예컨대, 참조 템플릿)를 상기 참조 음성 모델(580)에 입력하여 제2 언어의 콘텐츠의 참조 합성 음성을 생성할 수도 있다. The simultaneous interpretation device 1 inputs the contents of the second language generated by the translation unit 51 and the pre-stored sample machine voice characteristic information (eg, a reference template) into the reference speech model 580 to input the second language content to the reference speech model 580. It is also possible to generate a reference synthesized voice of the content of

이를 위해, 상기 동시 통역 장치(1)는 참조 합성 음성을 생성하기 이전에 참조 음성 모델(580)을 미리 저장한다. 상기 동시 통역 장치(1)는 참조 음성 모델(580)을 생성하거나 외부로부터 획득하여 저장한다.To this end, the simultaneous interpretation apparatus 1 stores the reference voice model 580 in advance before generating the reference synthesized voice. The simultaneous interpretation apparatus 1 generates a reference speech model 580 or obtains and stores it from the outside.

도 8은, 본 출원의 일 실시예에 따른, 참조 음성 모델을 생성하는 과정의 개략도이다. 8 is a schematic diagram of a process of generating a reference speech model according to an embodiment of the present application.

도 8을 참조하면, 참조 음성 모델(580)은 기본 음성 모델(501) 및 제3 튜닝 모델(585)을 포함할 수도 있다. Referring to FIG. 8 , a reference voice model 580 may include a basic voice model 501 and a third tuning model 585 .

참조 음성 모델(580)은 번역부(51)에 의해 생성된 제2 언어의 콘텐츠가 입력되면, 상기 제2 언어의 콘텐츠를 기본 음성 모델(501)에 입력하여 상기 제2 언어의 콘텐츠의 기본 음성 데이터를 생성한다. When the content of the second language generated by the translation unit 51 is input, the reference speech model 580 inputs the content of the second language to the basic speech model 501 to provide a basic voice of the content of the second language. generate data

제3 튜닝 모델(585)은 제2 언어의 콘텐츠의 기본 음성에 샘플 기계 음성 특성을 반영하여 제2 언어의 콘텐츠의 참조 합성 음성을 생성한다. The third tuning model 585 generates a reference synthesized voice of the content of the second language by reflecting sample machine voice characteristics to the basic voice of the content of the second language.

일 실시예에서, 상기 제3 튜닝 모델(585)은 GMM(Gaussian mixture model), RBM(restricted Boltzmann machines), DNN(deep neural network), CycleGAN, 또는 StarGAN 를 포함할 수도 있다. In one embodiment, the third tuning model 585 may include Gaussian mixture model (GMM), restricted Boltzmann machines (RBM), deep neural network (DNN), CycleGAN, or StarGAN.

다른 일 실시예에서, 상기 제3 튜닝 모델(585)은 제2 튜닝 모델(565)과 마찬가지로, CNN(Convolution Neural Network), RNN(Recurrent Neural Network), LSTM(Long Short-Term Memory models) 또는 GAN(Generative Adversarial Network) 구조를 가질 수도 있다. In another embodiment, the third tuning model 585, like the second tuning model 565, is a Convolution Neural Network (CNN), Recurrent Neural Network (RNN), Long Short-Term Memory models (LSTM), or GAN. (Generative Adversarial Network) structure.

상기 제3 튜닝 모델(585)은 기본 음성에 특정인 음성 특성을 반영하도록 학습된다. 기본 음성 및 샘플 기계 음성은 음성 특성의 파라미터에 의해 구현된다. 제3 튜닝 모델(585)은 기본 음성 특성 값을 입력되는 샘플 기계 음성 특성 값으로 조절하도록 학습된다. The third tuning model 585 is learned to reflect voice characteristics specific to the base voice. Basic voices and sample machine voices are implemented by parameters of voice characteristics. The third tuning model 585 is trained to adjust basic voice characteristic values to input sample machine voice characteristic values.

상기 제3 튜닝 모델(585)은, CNN(Convolution Neural Network), RNN(Recurrent Neural Network), LSTM(Long Short-Term Memory models), GAN(Generative Adversarial Network) 등과 같은 신경망 구조에 활용 가능한, 학습 알고리즘을 사용하여 제2 언어의 콘텐츠의 기본 음성 데이터에 샘플 기계 음성 특성을 반영하도록 학습된다. The third tuning model 585 is a learning algorithm that can be used for neural network structures such as convolution neural networks (CNNs), recurrent neural networks (RNNs), long short-term memory models (LSTMs), and generative adversarial networks (GANs). It is learned to reflect the sample machine speech characteristics to the basic speech data of the content of the second language using

이와 같이, 제어 모듈(40)에 의해 참조 음성 요구에 따른 제2-3 제어 명령이 합성부(55)로 공급되면, 합성부(55)는 참조 음성 모델을 사용하여 제2 언어의 콘텐츠의 샘플 기계 합성 음성 데이터를 생성할 수도 있다. In this way, when the second-third control command according to the reference speech request is supplied to the synthesis unit 55 by the control module 40, the synthesis unit 55 uses the reference speech model to sample the contents of the second language. It is also possible to generate machine synthesized voice data.

일 실시예에서, 상기 동시 통역 장치(1)는 복수의 샘플 기계 음성 중 어느 하나를 선택하기 위한 사용자 입력을 수신하고, 그리고 상기 번역부(51)에 의해 생성된 제2 언어의 콘텐츠를 선택된 샘플 기계 음성과 합성하여 제2 언어의 콘텐츠의 참조 합성 음성을 생성할 수도 있다. In one embodiment, the simultaneous interpretation device 1 receives a user input for selecting one of a plurality of sample machine voices, and converts the content of the second language generated by the translation unit 51 to the selected sample. It is also possible to generate a reference synthesized voice of content in the second language by synthesizing it with the machine voice.

상기 동시 통역 장치(1)는 하나 이상의 샘플 기계 음성 특성 정보를 저장할 수도 있다. 상기 동시 통역 장치(1)는 하나 이상의 샘플 기계 음성 각각에 연관된, 하나 이상의 참조 템플릿을 저장할 수도 있다. 상기 동시 통역 장치(1)는 저장된 하나 이상의 샘플 기계 음성 특성 정보 중 어느 하나를 선택하게 하는 인터페이스 화면(M2)을 제공하여 사용자 입력을 유도할 수도 있다. The simultaneous interpretation apparatus 1 may store one or more sample machine voice characteristic information. The simultaneous interpretation device 1 may store one or more reference templates associated with each of one or more sample machine voices. The simultaneous interpretation apparatus 1 may induce a user input by providing an interface screen M2 for selecting one of one or more stored pieces of sample machine-voice characteristic information.

상기 동시 통역 장치(1)는 제2 언어의 콘텐츠 및 선택된 샘플 기계 음성에 연관된 참조 템플릿에 기초하여 제2 언어의 콘텐츠의 참조 합성 음성을 생성한다. The simultaneous interpretation device 1 generates a reference synthesized voice of the content of the second language based on the content of the second language and a reference template associated with the selected sample machine voice.

일부 실시예들에서, 상기 동시 통역 장치(1)는 복수의 샘플 기계 음성 각각에 대한 복수의 제3 튜닝 모델283을 포함할 수도 있다. 예를 들어, n개의 샘플 기계 음성에 대해서, 상기 동시 통역 장치(1)는 제3-1 튜닝 모델(585-1), …., 제2-n 튜닝 모델(585-n)을 포함할 수도 있다. 각 제2 튜닝 모델(585)은 입력되는 음성(예컨대, 기본 음성)의 음성 특성 값을 해당 샘플 기계 음성 특성 값으로 조절하도록 각각 학습된 것이다. In some embodiments, the simultaneous interpretation device 1 may include a plurality of third tuning models 283 for each of a plurality of sample machine voices. For example, for n samples of machine speech, the simultaneous interpretation apparatus 1 uses a 3-1 tuning model 585-1, . . . ., 2-n tuning models 585-n. Each of the second tuning models 585 is learned to adjust voice characteristic values of an input voice (eg, basic voice) to corresponding sample machine voice characteristic values.

또한, 상기 동시 통역 장치(1)는 생성한 합성 음성을 추가적으로 변조할 수도 있다. In addition, the simultaneous interpretation apparatus 1 may additionally modulate the generated synthesized voice.

특정 실시예들에서, 상기 동시 통역 장치(1)는 생성된 합성 음성의 음성 특성 값을 조절하여 합성 음성을 변조할 수도 있다. 상기 동시 통역 장치(1)는 변조된 합성 음성의 음성 특성 값(즉, 변조 값)이 입력되면 음성 특성 값이 변조 값으로 조절된다. In certain embodiments, the simultaneous interpretation apparatus 1 may modulate the synthesized voice by adjusting voice characteristic values of the generated synthesized voice. When voice characteristic values (ie, modulation values) of the modulated synthetic voice are input to the simultaneous interpretation apparatus 1, the voice characteristic values are adjusted to the modulation values.

합성된 음성, 즉 이미 생성된 음성 데이터의 음성 특성의 파라미터 값은 초기 값이고, 최종 음성 데이터의 음성 특성의 파라미터 값은 변조 값이다.A parameter value of a voice characteristic of the synthesized voice, that is, already generated voice data is an initial value, and a parameter value of a voice characteristic of the final voice data is a modulation value.

일 실시예에서, 상기 동시 통역 장치(1)는 생성한 합성 음성의 음성 특성의 값을 표시하는, 미세 제어(fine control)를 위한 인터페이스 화면(M3)을 사용자에게 제공할 수도 있다. 상기 인터페이스 화면(M3)은 상대적 위치에 기초하여 변조 값을 획득하거나, 또는 변조 값을 직접 숫자로 획득하도록 구성될 수도 있다. In one embodiment, the simultaneous interpretation apparatus 1 may provide the user with an interface screen M3 for fine control, which displays voice characteristic values of the generated synthesized voice. The interface screen M3 may be configured to obtain a modulation value based on a relative position or directly obtain a modulation value as a number.

예를 들어, 상기 동시 통역 장치(1)는 화자 합성 음성을 생성한 경우, 생성한 화자 합성 음성의 적어도 일부의 음성 특성의 값을 표시하는 도 3의 인터페이스 화면(M3)을 제공할 수도 있다. For example, when the synthesized speaker voice is generated, the simultaneous interpretation apparatus 1 may provide the interface screen M3 of FIG. 3 for displaying at least some voice characteristic values of the synthesized speaker voice.

상기 동시 통역 장치(1)는 특정인 합성 음성을 생성한 경우, 생성한 특정인 합성 음성의 적어도 일부의 음성 특성의 값을 표시하는 도 5의 인터페이스 화면(M3)을 제공할 수도 있다.When a synthesized voice of a specific person is generated, the simultaneous interpretation apparatus 1 may provide the interface screen M3 of FIG. 5 for displaying at least some voice characteristic values of the generated synthesized voice of a specific person.

상기 동시 통역 장치(1)는 참조 합성 음성을 생성한 경우, 생성한 참조 합성 음성의 적어도 일부의 음성 특성의 값을 표시하는 도 7의 인터페이스 화면(M3)을 제공할 수도 있다.When the reference synthesized voice is generated, the simultaneous interpretation apparatus 1 may provide the interface screen M3 of FIG. 7 for displaying at least some voice characteristic values of the generated reference synthesized voice.

이러한 인터페이스 화면(M3)을 통해 상기 동시 통역 장치(1)는 사용자와 상호작용하여 음성 특성의 변조 값을 획득한다. Through this interface screen M3, the simultaneous interpretation apparatus 1 interacts with the user to acquire modulation values of voice characteristics.

상기 제어 모듈(40)0은 변조 값 정보를 포함한 조절 명령을 상기 합성부(55)로 전송한다. 예를 들어, 도 3의 합성 음성의 음성 특성 값을 표시하는 인터페이스 화면(M3)이 표시되는 경우, 제어 모듈(40)0은 입력된 변조 값 정보를 포함한 조절 명령을 생성한다. 변조 값 정보는 조절될 음성 특성과 이 음성 특성의 조절될 파라미터 값을 포함한다. The control module 40 transmits an adjustment command including modulation value information to the synthesis unit 55. For example, when the interface screen M3 displaying the voice characteristic values of the synthesized voice of FIG. 3 is displayed, the control module 400 generates an adjustment command including input modulation value information. The modulation value information includes a voice characteristic to be adjusted and parameter values of the voice characteristic to be adjusted.

합성부(55)는 조절 명령에 기초하여 생성된 합성 음성의 음성 특성 값을 조절한다. 그러면, 합성부(55)는 합성 음성을 변조 값에 대응한 음성으로 변조한다. The synthesis unit 55 adjusts voice characteristic values of the generated synthesized voice based on the control command. Then, the synthesis unit 55 modulates the synthesized voice into a voice corresponding to the modulation value.

일 실시예에서, 합성부(55)는 변조 알고리즘을 사용하여 합성 음성을 변조할 수도 있다. 변조된 음성은 조절된 변조 값을 갖는 음성 특성의 파라미터에 의해 구현된다. In one embodiment, synthesizer 55 may modulate the synthesized voice using a modulation algorithm. A modulated voice is implemented by parameters of voice characteristics having adjusted modulation values.

상기 변조 알고리즘은 평행 데이터를 사용하는 제1 변조 알고리즘 또는 비-평행 데이터를 사용하는 제2 변조 알고리즘을 포함할 수도 있다. 여기서, 평행 데이터는 변조 전/후의 음성 데이터가 동일인의 음성에 기원한 것을 의미한다.The modulation algorithm may include a first modulation algorithm using parallel data or a second modulation algorithm using non-parallel data. Here, parallel data means that voice data before and after modulation originate from the voice of the same person.

일 실시예에서, 상기 제1 변조 알고리즘은 프레임 단위의 정렬 동작을 수행한 이후, 변조 값이 입력되면 합성 음성의 음성 특성의 파라미터 값을 해당 변조 값으로 조절할 수도 있다. In one embodiment, the first modulation algorithm may adjust a parameter value of a speech characteristic of synthesized speech to a corresponding modulation value when a modulation value is input after performing an alignment operation in units of frames.

예를 들어, 상기 제1 변조 알고리즘은 DTW(Dynamic Time Warping), 또는 INCA(Iterative combination of a nearest Neighbor search step and a Conversion step Alignment) 알고리즘일 수도 있다. For example, the first modulation algorithm may be a dynamic time warping (DTW) or an iterative combination of a nearest neighbor search step and a conversion step alignment (INCA) algorithm.

다른 일 실시예에서, 상기 제1 변조 알고리즘은 학습 모델을 사용하여 음성 특성의 파라미터 초기 값을 변조 값으로 조절할 수도 있다. In another embodiment, the first modulation algorithm may adjust an initial value of a parameter of a voice characteristic to a modulation value using a learning model.

이러한 학습 모델은 제3 튜닝 모델(585)과 유사할 수도 있다. 예를 들어, 제1 변조 알고리즘의 학습 모델은 GMM(Gaussian mixture model), RBM(restricted Boltzmann machines), DNN(deep neural network)을 포함할 수도 있다. 상기 제1 변조 알고리즘의 학습 모델은 평행 데이터를 사용하여 학습된다. This learning model may be similar to third tuning model 585 . For example, the learning model of the first modulation algorithm may include Gaussian mixture model (GMM), restricted Boltzmann machines (RBM), and deep neural network (DNN). A learning model of the first modulation algorithm is learned using parallel data.

일 실시예에서, 상기 제2 알고리즘은 학습 모델을 사용하여 음성 특성의 파라미터 초기 값을 변조 값으로 조절할 수도 있다. In one embodiment, the second algorithm may adjust an initial value of a parameter of a voice characteristic to a modulation value using a learning model.

이러한 학습 모델은 제3 튜닝 모델(585)과 유사할 수도 있다. 예를 들어, 제2 변조 알고리즘의 학습 모델은 CycleGAN, 또는 StarGAN 를 포함할 수도 있다. 상기 제2 변조 알고리즘의 학습 모델은 비-평행 데이터를 사용하여 학습된다. This learning model may be similar to third tuning model 585 . For example, the learning model of the second modulation algorithm may include CycleGAN or StarGAN. The learning model of the second modulation algorithm is learned using non-parallel data.

상기 Cycle GAN을 포함한 제2 변조 알고리즘의 학습 모델은 두 사람의 목소리를 양방향으로 변조하는 것을 훈련(training)함으로써, 음성 특성의 파라미터 초기 값을 변조 값으로 조절하도록 학습된다. The learning model of the second modulation algorithm, including the Cycle GAN, is trained to modulate the voices of two people in both directions, thereby learning to adjust initial values of parameters of voice characteristics to modulation values.

상기 Star GAN을 포함한 제2 변조 알고리즘의 학습 모델은 마스크 벡터(mask vector)를 사용하여 다수의 음성 데이터를 출력하는 것을 동시에 훈련함으로써, 음성 특성의 파라미터 초기 값을 다양한 변조 값으로 조절하도록 학습된다. 상기 Star GAN은 복수의 도메인에 대한 훈련 샘플 세트를 사용하여 학습된다. 상기 훈련 샘플 세트는 도메인별로 서브 세트화된다. 각 도메인별 훈련 샘플의 서브 세트에 포함된 각 훈련 샘플은 음성 데이터, 도메인 레이블, 마스크 벡터를 각각 포함한다. 상기 도메인 레이블은 학습 모델이 대상 도메인 하에서 음성 데이터를 변화시키는 것을 학습하는데 사용된다. 상기 마스크 벡터는 학습 시 집중해야 하는 도메인을 명시한다. 명시되지 않은 레이블에 대해서는 학습 시 무시되고 명시된 레이블에 대해서는 학습 시 집중된다. The learning model of the second modulation algorithm including the Star GAN is trained to simultaneously output a plurality of voice data using a mask vector, thereby learning to adjust initial values of parameters of voice characteristics to various modulation values. The Star GAN is learned using training sample sets for multiple domains. The training sample set is subsetted by domain. Each training sample included in the subset of training samples for each domain includes voice data, a domain label, and a mask vector, respectively. The domain label is used by the learning model to learn to change voice data under the target domain. The mask vector specifies the domain to focus on during learning. Labels that are not specified are ignored during training, and labels that are specified are focused during training.

일 실시예에서, 복수의 도메인에 대한 훈련 샘플 세트에서 도메인은 음성일 수도 있다. 즉, 제2 변조 알고리즘의 학습 모델은 사람별로 서브 세톼된 음성 데이터를 개별 데이터 세트로 사용하여 학습된다. 마스크 벡터가 훈련 샘플에 추가되어, 상기 제2 변조 알고리즘의 학습 모델은 서로 다른 데이터세트의 도메인 간의 결합된 학습이 가능하다. In one embodiment, a domain may be negative in a training sample set for multiple domains. That is, the learning model of the second modulation algorithm is learned using voice data subdivided for each person as an individual data set. A mask vector is added to the training samples, so that the learning model of the second modulation algorithm can perform combined learning between domains of different datasets.

이 외에도, 합성부(55)는 다른 음성 변조 기술을 사용하여 합성 음성의 음성 특성의 파라미터 초기 값을 변조 값으로 조절하여 변조된 합성 음성을 생성할 수도 있다. In addition to this, the synthesis unit 55 may generate a modulated synthesized voice by adjusting an initial value of a voice characteristic parameter of the synthesized voice to a modulation value using another voice modulation technique.

출력부(60)는 사용자 요구가 반영된, 제2 언어의 콘텐츠에 대한 음성 데이터를 음성으로 출력한다. 상기 출력부(60)는 스피커와 같은, 전기 신호를 음성 신호로 변환하여 출력하는 기기를 포함할 수도 있다.The output unit 60 outputs audio data for content in the second language in which the user's request is reflected. The output unit 60 may include a device such as a speaker that converts an electrical signal into a voice signal and outputs the converted audio signal.

출력부(60)는 상황 요구가 반영된 제2 언어의 콘텐츠의 음성을 출력할 수도 있다. The output unit 60 may output audio of content in a second language in which a situation request is reflected.

또는, 출력부(60)는 화자 음성 요구, 특정인 음성 요구 또는 참조 음성 요구가 반영된 제2 언어의 콘텐츠의 합성 음성을 출력할 수도 있다. Alternatively, the output unit 60 may output synthesized voice of content in the second language in which a speaker's voice request, a specific person's voice request, or a reference voice request is reflected.

또한, 출력부(60)는 변조된 합성 음성을 출력할 수도 있다. Also, the output unit 60 may output the modulated synthesized voice.

상기 동시 통역 장치(1)가 본 명세서에 서술되지 않은 다른 구성요소를 포함할 수도 있다는 것이 통상의 기술자에게 명백할 것이다. 예를 들어, 데이터 입력 장치, 디스플레이 및/또는 인쇄와 같은 출력 장치, 메모리와 같은 저장장치, 네트워크, 네트워크 인터페이스 및/또는 프로토콜 등을 더 포함할 수 있다. It will be clear to those skilled in the art that the simultaneous interpretation device 1 may include other components not described herein. For example, it may further include a data input device, an output device such as a display and/or print, a storage device such as a memory, a network, a network interface, and/or a protocol.

본 발명의 다른 일 측면에 따른 동시 통역 방법은 프로세서를 포함한 컴퓨팅 장치에 의해 수행된다. 컴퓨팅 장치는 상기 동시 통역 장치(1) 또는 컴퓨터와 같은 기타 데이터 처리 장치일 수도 있다. 이하, 설명의 명료성을 위해, 상기 방법은 도 1의 동시 통역 장치(1)에 의해 수행되는 실시예들을 기초로 본 동시 통역 방법을 보다 상세하게 서술한다. A simultaneous interpretation method according to another aspect of the present invention is performed by a computing device including a processor. The computing device may be the simultaneous interpretation device 1 or other data processing device such as a computer. Hereinafter, for clarity of explanation, the simultaneous interpretation method will be described in more detail based on the embodiments performed by the simultaneous interpretation apparatus 1 of FIG. 1 .

도 9는, 본 출원의 일 측면에 따른, 동시 통역 방법의 흐름도이다. 9 is a flow diagram of a method for simultaneous interpretation, according to one aspect of the present application.

도 9를 참조하면, 상기 동시 통역 방법은: 제1 언어의 콘텐츠를 수신하는 단계(S10); 제2 언어의 콘텐츠에 반영할 사용자 요구를 수신하여 해당 제어 명령을 생성하는 단계(S40); 및 제어 명령에 대응하는 사용자 요구를 반영하여 제1 언어의 콘텐츠를 동시 통역한 제2 언어의 콘텐츠의 음성을 생성하는 단계(S50)를 포함한다. 또한, 상기 동시 통역 방법은 제2 언어의 콘텐츠의 음성을 출력하는 단계(S60)를 포함한다. Referring to FIG. 9 , the simultaneous interpretation method includes: receiving content in a first language (S10); generating a corresponding control command by receiving a user request to be reflected in content in the second language (S40); and reflecting a user request corresponding to the control command to generate audio of content in a second language obtained by simultaneous interpretation of content in the first language (S50). Also, the simultaneous interpretation method includes outputting audio of content in the second language (S60).

이러한 동시 통역 방법은 제1 언어의 콘텐츠의 입력 방식, 사용자 요구에 따라 제1 언어의 콘텐츠를 동시 통역한 제2 언어의 콘텐츠의 음성을 생성한다. In this simultaneous interpretation method, audio of content in a second language obtained by simultaneous interpretation of content in a first language is generated according to an input method for content in a first language and a user request.

도 10은, 본 출원의 일 실시예에 따른, 제1 언어의 콘텐츠가 음성으로 획득된 경우 동시 통역 방법의 흐름도이다. 10 is a flowchart of a simultaneous interpretation method when content in a first language is acquired as voice, according to an embodiment of the present application.

도 10을 참조하면, 상기 동시 통역 방법은: (예컨대, 제1 입력부(11)에 의해) 화자가 발화하는 제1 언어의 콘텐츠의 화자 음성을 수신하는 단계(S1010); 수신된 제1 언어의 콘텐츠의 화자 음성을 제1 언어의 텍스트로 변환하는 단계(S1020); 동시 통역 결과에 반영할 사용자 요구를 수신하여 해당 제어 명령을 생성하는 단계(S1040); 및 제어 명령에 대응하는 사용자 요구를 반영하여 제1 언어의 콘텐츠를 동시 통역한 제2 언어의 콘텐츠의 음성을 생성하는 단계(S1050)를 포함한다. 상기 제2 언어의 콘텐츠의 음성은 사용자 요구가 반영된 음성일 수도 있다. Referring to FIG. 10 , the simultaneous interpretation method includes: receiving (eg, through the first input unit 11) a speaker's voice of content in a first language spoken by the speaker (S1010); converting the received speaker's voice of the content in the first language into text in the first language (S1020); receiving a user request to be reflected in the simultaneous interpretation result and generating a corresponding control command (S1040); and reflecting a user request corresponding to the control command to generate audio of content in a second language obtained by simultaneous interpretation of content in the first language (S1050). The audio of the content in the second language may be a voice in which a user request is reflected.

상기 단계(S1050)는: 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 번역하는 단계(S1051) 및 제2 언어의 콘텐츠의 음성 데이터를 생성하는 단계(S1055)를 포함한다. The step (S1050) includes: translating the content in the first language into the content in the second language (S1051) and generating audio data of the content in the second language (S1055).

특정 실시예들에서, 상기 도 10의 동시 통역 방법은: (예컨대, 특성 추출부(30)에 의해) 화자 음성으로부터 상기 화자의 음성 특성을 추출하는 단계(S1030)를 더 포함할 수도 있다. In certain embodiments, the simultaneous interpretation method of FIG. 10 may further include extracting the speaker's voice characteristics from the speaker's voice (eg, by the feature extraction unit 30) (S1030).

단계(S1010) 및 단계(S1020)에서 화자 음성 및 제1 언어의 콘텐츠 정보가 획득된다. In steps S1010 and S1020, the speaker's voice and content information of the first language are obtained.

단계(S1030)에서 시간 도메인 또는 주파수 도메인 상의 입력된 음성 신호와 관련된 신호 정보를 분석하여 입력된 음성 신호의 음성 특성을 추출한다. 여기서, 입력된 음성 신호와 관련된 신호 정보는 파형, 주파수, 주기 및/또는 세기 등을 포함할 수도 있으나, 이에 제한되진 않는다.In step S1030, signal information related to the input voice signal on the time domain or frequency domain is analyzed to extract the voice characteristics of the input voice signal. Here, the signal information related to the input voice signal may include a waveform, frequency, period, and/or intensity, but is not limited thereto.

일 실시예에서, 상기 음성 특성은 음색, 음역대, 톤 및/또는 성별을 포함할 수도 있다.In one embodiment, the voice characteristics may include timbre, vocal range, tone and/or gender.

일 실시예에서, 상기 단계(S1040)는: 단계(S1051) 이전에 상황 요구를 사용자 요구로서 수신하고 제1 제어 명령을 생성하여 번역부(51)에 전송하는 단계(S1041); 및/또는 단계(S1055) 이전에 화자 음성 요구, 특정인 음성 요구 또는 참조 음성 요구를 사용자 요구로서 수신하고 제2 제어 명령을 생성하여 합성부(55)로 전송하는 단계(S1045)를 포함할 수도 있다. In one embodiment, the step (S1040) includes: receiving the situation request as a user request before step (S1051) and generating and transmitting a first control command to the translation unit 51 (S1041); and/or receiving a speaker's voice request, a specific person's voice request, or a reference voice request as a user request before step S1055 and generating and transmitting a second control command to the synthesis unit 55 (S1045). .

단계(S1041)에서 상황 요구가 입력되면, 상황 요구의 콘텐츠 상황에 기초하여 제1 언어의 콘텐츠가 제2 언어의 콘텐츠로 번역된다. When a contextual request is input in step S1041, the content in the first language is translated into the content in the second language based on the content context of the contextual request.

일 실시예에서, 상기 제2 언어의 텍스트로 번역하는 단계(S1051)는 번역 방식은 규칙 기반 방식 및/또는 인공지능 기반 방식을 통해 제1 언어의 콘텐츠의 텍스트를 제2 언어의 콘텐츠의 텍스트로 번역하는 단계를 포함할 수도 있다. In one embodiment, in the step of translating into text in the second language (S1051), the text of the content in the first language is converted into the text in the content in the second language through a rule-based method and/or an artificial intelligence-based method. It may also include a translation step.

제2 언어의 콘텐츠의 텍스트가 제2 언어의 콘텐츠의 음성으로 변환된다(S1055). The text of the content in the second language is converted into audio of the content in the second language (S1055).

단계(S1041)에서 상기 상황 요구는 미리 저장된 다수의 콘텐츠 상황 중 어느 하나의 콘텐츠 상황을 선택하는 사용자 입력에 의해 수신될 수도 있다. In step S1041, the situation request may be received by a user input for selecting one of a plurality of pre-stored content situations.

일 실시예에서, 상기 단계(S1041)는: 미리 저장된 다수의 콘텐츠 상황 중 어느 하나를 선택 가능하도록 구성된 인터페이스 화면을 제공하는 단계; 및 어느 하나의 콘텐츠 상황을 선택하는 입력이 수신되면, 선택된 콘텐츠 상황 정보를 포함한 제1 제어 명령을 생성하여 상기 제1 제어 명령을 번역부(51)로 전송하는 단계를 포함할 수도 있다. 상기 제1 제어 명령은, 선택된 콘텐츠 상황 하에서 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 번역하게 하는 명령이다. 선택된 콘텐츠 상황은 제1 언어의 콘텐츠 상황을 나타낸다. In one embodiment, the step (S1041) includes: providing an interface screen configured to select any one of a plurality of pre-stored content situations; and generating a first control command including the selected content status information and transmitting the first control command to the translation unit 51 when an input for selecting one content status is received. The first control command is a command for translating content in a first language into content in a second language under a selected content situation. The selected content context represents the content context of the first language.

그러면, 번역부(51)는 콘텐츠 상황 하에서 단계(S1020)의 제1 언어의 콘텐츠의 텍스트를 제2 언어의 콘텐츠의 텍스트로 번역한다(S1051). 이 번역 결과는 지정된 콘텐츠 상황에 특화되어 번역된다. Then, the translation unit 51 translates the text of the content in the first language into the text of the content in the second language in step S1020 under the context of the content (S1051). This translation result is translated specifically for the specified content situation.

단계(S1045)에서 화자 음성 요구, 특정인 음성 요구 또는 참조 음성 요구는 상기 화자 음성 요구, 특정인 음성 요구 및 참조 음성 요구 중 어느 하나를 사용자 요구로서 선택하는 사용자 입력에 의해 수신될 수도 있다. In step S1045, the speaker's voice request, the specific person's voice request, or the reference voice request may be received by a user input for selecting one of the speaker's voice request, the specific person's voice request, and the reference voice request as the user request.

일 실시예에서, 상기 단계(S1041)는: 상기 화자 음성 요구, 특정인 음성 요구 및 참조 음성 요구 중 어느 하나를 사용자 요구로서 선택 가능하도록 구성된 인터페이스 화면(M1)을 제공하는 단계; (예컨대, 제어 모듈(40)에 의해) 화자 음성 요구를 선택하는 입력이 수신되면, 화자의 음성 특성 정보를 포함한 제2-1 제어 명령을 생성하는 단계; 특정인 음성 요구를 선택하는 입력이 수신되면, 미리 저장된, 선택된 특정인의 음성 특성 정보를 포함한 제2-2 제어 멍령을 생성하는 단계; 참조 음성 요구를 선택하는 입력이 수신되면, 미리 저장된, 샘플 기계 음성 정보를 포함한 제2-3 제어 명령을 생성하는 단계; 생성된 제2-1 제어 명령, 제2-2 제어 명령 또는 제2-3 제어 명령을 합성부(55)로 전송하는 단계를 포함할 수도 있다.In one embodiment, the step S1041 includes: providing an interface screen M1 configured to select any one of the speaker's voice request, the specific person's voice request, and the reference voice request as a user request; if an input for selecting a speaker's voice request is received (eg, by the control module 40), generating a 2-1 control command including the speaker's voice characteristic information; generating a 2-2 control command including pre-stored voice characteristic information of the selected specific person when an input for selecting a specific person's voice request is received; generating a 2-3 control command including pre-stored sample machine voice information when an input for selecting a reference voice request is received; A step of transmitting the generated 2-1 control command, 2-2 control command or 2-3 control command to the synthesis unit 55 may be included.

상기 제2-1 제어 명령은, 화자의 음성 특성이 반영된 동시 통역 결과를 제공하기 위해, 번역된 제2 언어의 콘텐츠를 화자 음성과 합성하게 하는 명령이다. 화자의 음성 특성 정보는 화자가 발화한 제1 언어의 콘텐츠의 음성 신호로부터 추출된 것으로서, 특성 추출부(30)로부터 획득된다. The 2-1 control command is a command for synthesizing the translated second language content with the speaker's voice in order to provide a simultaneous interpretation result in which the speaker's voice characteristics are reflected. The speaker's voice characteristic information is extracted from the voice signal of the content of the first language spoken by the speaker, and is obtained from the feature extractor 30 .

상기 제2-2 제어 명령은, 특정인의 음성 특성이 반영된 동시 통역 결과를 제공하기 위해, 번역된 제2 언어의 콘텐츠를 선택된 특정인 음성과 합성하게 하는 명령이다. 특정인 음성 특성 정보는 상기 동시 통역 장치(1)에 미리 저장된 것으로서, 하나 이상의 특정인의 음성 신호에서 추출된 음성 특성 정보이다. The 2-2 control command is a command for synthesizing the translated content in the second language with the voice of the selected specific person in order to provide a simultaneous interpretation result in which the voice characteristics of the specific person are reflected. The voice characteristic information of a specific person is pre-stored in the simultaneous interpretation apparatus 1, and is voice characteristic information extracted from one or more voice signals of a specific person.

상기 동시 통역 방법은 복수의 특정인 중 선택된 특정인의 음성을 반영하는데 사용될 수도 있다. The simultaneous interpretation method may be used to reflect the voice of a specific person selected from among a plurality of specific persons.

일 실시예에서, 상기 단계(S1041)는: 상기 화자 음성 요구, 특정인 음성 요구 및 참조 음성 요구 중 어느 하나를 사용자 요구로서 선택 가능하도록 구성된 인터페이스 화면(M1)을 제공한 뒤 특정인 음성 요구를 선택하는 사용자 입력이 획득되면, 미리 저장된 선택 가능한 복수의 특정인 중 어느 하나를 선택하는 것을 유도하는 인터페이스 화면(M2)을 제공하는 단계를 더 포함할 수도 있다. 예를 들어, 도 5의 인터페이스 화면(M1)에서 특정인 음성 요구가 선택되면, 선택 가능한 음성을 갖는 특정인을 표시한 인터페이스 화면(M2)가 제공될 수도 있다. In one embodiment, the step (S1041) includes providing an interface screen M1 configured to select any one of the speaker's voice request, the specific person's voice request, and the reference voice request as a user request, and then selecting the specific person's voice request. When a user input is obtained, the method may further include providing an interface screen M2 for inducing selection of one of a plurality of pre-stored selectable specific persons. For example, when a specific person's voice request is selected on the interface screen M1 of FIG. 5 , an interface screen M2 displaying a specific person having a selectable voice may be provided.

상기 제2-3 제어 명령은, 샘플 기계 음성 특성이 반영된 동시 통역 결과를 제공하기 위해, 번역된 제2 언어의 콘텐츠를 선택된 샘플 기계 음성과 합성하게 하는 명령이다. 샘플 기계 음성 특성 정보는 상기 동시 통역 장치(1)에 미리 저장된 것으로서, 하나 이상의 특정인의 음성 신호에서 추출된 음성 특성 정보이다. The 2-3 control command is a command for synthesizing the translated second language content with the selected sample machine voice to provide a simultaneous interpretation result in which the sample machine voice characteristics are reflected. The sample machine voice characteristic information is previously stored in the simultaneous interpretation apparatus 1, and is voice characteristic information extracted from one or more specific person's voice signals.

상기 동시 통역 방법은 복수의 샘플 기계 음성 중 선택된 샘플 기계 음성을 반영하는데 사용될 수도 있다. The simultaneous interpretation method may be used to reflect a selected sample machine voice from among a plurality of sample machine voices.

일 실시예에서, 상기 단계(S1041)는: 상기 화자 음성 요구, 특정인 음성 요구 및 참조 음성 요구 중 어느 하나를 사용자 요구로서 선택 가능하도록 구성된 인터페이스 화면(M1)을 제공한 뒤 특정인 음성 요구를 선택하는 사용자 입력이 획득되면, 미리 저장된 선택 가능한 복수의 샘플 기계 음성 중 어느 하나를 선택하는 것을 유도하는 인터페이스 화면(M2)을 제공하는 단계를 더 포함할 수도 있다. 예를 들어, 도 7의 인터페이스 화면(M1)에서 특정인 음성 요구가 선택되면, 선택 가능한 음성을 갖는 특정인을 표시한 인터페이스 화면(M2)가 제공될 수도 있다. In one embodiment, the step (S1041) includes providing an interface screen M1 configured to select any one of the speaker's voice request, the specific person's voice request, and the reference voice request as a user request, and then selecting the specific person's voice request. When the user input is obtained, the method may further include providing an interface screen M2 for inducing selection of one of a plurality of pre-stored selectable sample machine voices. For example, if a specific person's voice request is selected on the interface screen M1 of FIG. 7 , an interface screen M2 displaying a specific person having a selectable voice may be provided.

음성 측면의 사용자 요구가 없는 경우(예컨대, 제1 제어 명령만 있는 경우), 기본 음성 모델(501)을 사용하여 제2 언어의 콘텐츠의 텍스트로부터 제2 언어의 콘텐츠에 대한 음성을 생성한다(S1055). When there is no user request in terms of voice (for example, when there is only a first control command), a voice for the content in the second language is generated from the text of the content in the second language using the basic voice model 501 (S1055). ).

음성 측면의 사용자 요구가 있는 경우(예컨대, 제2 제어 명령이 있는 경우), 해당 모델(540, 560, 580)을 사용하여 제2 언어의 콘텐츠의 텍스트로부터 제2 언어의 콘텐츠에 대한 음성(즉, 합성 음성)을 생성한다(S1055). 상기 제2 제어 명령은 상기 합성부가 상기 제2 언어의 콘텐츠를 화자 음성, 특성인 음성 또는 참조에 연관된 샘플 기계 음성과 합성하게 하는 명령인 것이다. When there is a user request in terms of voice (eg, when there is a second control command), the voice for the second language content (i.e., , synthetic voice) is generated (S1055). The second control command is a command for the synthesis unit to synthesize the second language content with a speaker's voice, a characteristic voice, or a sample machine voice associated with a reference.

예를 들어, 합성부(55)는 제2-1 제어 명령을 통해 화자 음성 특성 정보를 포함한 화자 음성과 관련된 정보를 획득한다. 합성부(55)는 단계(S1051)의 제2 언어의 콘텐츠 및 화자 음성 특성 정보에 기초하여 제2 언어의 콘텐츠의 화자 합성 음성을 생성한다(S1055). For example, the synthesis unit 55 obtains speaker voice-related information including speaker voice characteristic information through the 2-1 control command. The synthesis unit 55 generates a synthesized speaker of the content of the second language based on the content of the second language and the speaker's voice characteristic information in step S1051 (S1055).

제2-1 제어 명령과 이에 기초하여 화자 합성 음성을 생성하는 화자 음성 모델(540)에 대해서는 도 3, 도 4를 참조하여 위에서 서술하였는 바 자세한 설명은 생략한다. The 2-1 control command and the speaker voice model 540 for generating a synthesized speaker voice based on the control command have been described above with reference to FIGS. 3 and 4, and detailed description thereof will be omitted.

또는, 합성부(55)는 제2-2 제어 명령을 통해 특정인 음성 특성 정보를 포함한 특정인 음성과 관련된 정보를 획득한다. 합성부(55)는 단계(S1051)의 제2 언어의 콘텐츠 및 특정인 음성 특성 정보에 기초하여 제2 언어의 콘텐츠의 특정인 합성 음성을 생성한다(S1055). Alternatively, the synthesis unit 55 obtains information related to the specific person's voice including the specific person's voice characteristic information through the 2-2 control command. The synthesis unit 55 generates synthesized speech specific to the content in the second language based on the content in the second language and the specific voice characteristic information in step S1051 (S1055).

제2-2 제어 명령의 생성 과정과 상기 제2-2 제어 명령에 기초하여 특정인 음성 데이터를 생성하는 특정인 음성 모델(560)에 대해서는 도 5, 도 6을 참조하여 위에서 서술하였는 바 자세한 설명은 생략한다. The process of generating the 2-2 control command and the specific person's voice model 560 for generating the specific person's voice data based on the 2-2 control command have been described above with reference to FIGS. 5 and 6, so detailed descriptions are omitted. do.

또는, 합성부(55)는 제2-3 제어 명령을 통해 샘플 기계 음성 특성 정보를 포함한 특정인 음성과 관련된 정보를 획득한다. 합성부(55)는 단계(S1051)의 제2 언어의 콘텐츠 및 샘플 기계 음성 특성 정보에 기초하여 제2 언어의 콘텐츠의 참조 합성 음성을 생성한다(S1055). 전술한 바와 같이, 제2 언어의 콘텐츠의 참조 합성 음성은 제2-3 제어 명령의 음성 특성을 갖는 샘플 기계 음성과 제2 언어의 콘텐츠를 합성한 음성 데이터이다. Alternatively, the synthesis unit 55 obtains information related to a specific person's voice including sample machine voice characteristic information through the 2-3 control command. The synthesis unit 55 generates a reference synthesized voice of the content of the second language based on the content of the second language and the sample machine voice characteristic information in step S1051 (S1055). As described above, the reference synthetic voice of the content in the second language is audio data obtained by synthesizing the content in the second language and the sample machine voice having the voice characteristics of the 2-3 control commands.

일 실시예에서, 상기 합성부(55)는 참조 음성 모델(580)을 사용하여 제2 언어의 콘텐츠의 참조 합성 음성을 생성할 수도 있다. In one embodiment, the synthesizer 55 may generate a reference synthesized voice of the content of the second language using the reference voice model 580 .

제2-3 제어 명령과 이에 기초하여 참조 합성 음성을 생성하는 참조 음성 모델(580)에 대해서는 도 7, 도 8를 참조하여 위에서 서술하였는 바 자세한 설명은 생략한다. The 2-3 control command and the reference speech model 580 generating the reference synthesized speech based thereon have been described above with reference to FIGS. 7 and 8 , and detailed description thereof will be omitted.

상기 동시 통역 방법은: 단계(S1050)에서 생성된 제2 언어의 콘텐츠의 음성을 출력하는 단계(S1060)를 포함한다. The simultaneous interpretation method includes: outputting audio of the second language content generated in step S1050 (S1060).

상기 동시 통역 방법은: (예컨대, 출력부(60)에 의해) 단계(S1050)의 음성 데이터를 음성 신호로 출력하는 단계(S1060)를 포함한다. The simultaneous interpretation method includes: outputting the voice data of step S1050 as a voice signal (eg, by the output unit 60) (S1060).

또한, 상기 동시 통역 방법은 단계(S1050)의 음성 데이터를 저장하는 단계(S970)를 더 포함할 수도 있다. Also, the simultaneous interpretation method may further include storing the voice data of step S1050 (S970).

또한, 상기 음성 신호로 출력하는 단계(S1060)는: 제2 언어의 콘텐츠의 음성 데이터를 미리 듣는 단계를 더 포함할 수도 있다. 여기서, 미리 듣기는 음성 데이터의 일부 구간을 음성 신호로 출력하는 것이다. 상기 일부 구간은 합성된 음성의 전체 구간 중 초기 구간일 수도 있으나, 이에 제한되진 않는다. In addition, the step of outputting the audio signal (S1060) may further include: listening to audio data of content in the second language in advance. Here, pre-listening is to output a part of the audio data as a voice signal. The partial section may be an initial section of the entire section of the synthesized voice, but is not limited thereto.

또한, 상기 단계(S1050)는: 단계(S1060)의 음성 신호로 출력하기 이전에, 단계(S1055)에서 생성한 제2 언어의 콘텐츠의 음성(예컨대, 합성 음성)을 변조하는 단계(S1057)를 더 포함할 수도 있다. In addition, the step (S1050) includes a step (S1057) of modulating the audio (eg, synthesized voice) of the second language content generated in the step (S1055) before outputting the audio signal in the step (S1060). may include more.

제2 언어의 콘텐츠에 대한 음성 특성의 조절 값이 입력되면, 입력된 조절 값을 갖는 제2 언어의 콘텐츠의 음성이 생성된다(S1057). 예를 들어, 제2 제어 명령에 따른 제2 언어의 콘텐츠의 합성 음성을 생성한 이후 조절 값이 입력되면, 제2 언어의 콘텐츠의 변조된 합성 음성이 생성된다(S1057). When the adjustment value of the audio characteristics of the content in the second language is input, the audio of the content in the second language having the input adjustment value is generated (S1057). For example, if the adjustment value is input after generating the synthesized voice of the content in the second language according to the second control command, the modulated synthesized voice of the content in the second language is generated (S1057).

일 실시예에서, 상기 단계(S1050)는: 조절 값의 입력을 유도하는 인터페이스 화면을 제공하는 단계(S1056)를 더 포함할 수도 있다. 예를 들어, 도 3, 5, 7의 인터페이스 화면(M3)이 조절 값 입력을 위해 제공될 수도 있다(S1056). In one embodiment, the step (S1050) may further include: providing an interface screen for inducing an input of an adjustment value (S1056). For example, the interface screen M3 of FIGS. 3, 5, and 7 may be provided for inputting an adjustment value (S1056).

이와 같이 변조된 음성(예컨대, 변조되 합성 음성)을 생성하는 과정에 대해서는 도 3 내지 도 8을 참조하여 위에서 서술하였는 바 자세한 설명은 생략한다. Since the process of generating the modulated voice (eg, modulated synthesized voice) has been described above with reference to FIGS. 3 to 8 , a detailed description thereof will be omitted.

도 11은, 본 출원의 일 실시예에 따른, 제2 언어의 콘텐츠가 비-음성으로 획득된 경우 동시 통역 방법의 흐름도이다. 11 is a flowchart of a simultaneous interpretation method when content in a second language is acquired as non-voice, according to an embodiment of the present application.

도 11의 동시 통역 방법은 도 10의 동시 통역 방법과 유사하므로, 차이점을 위주로 서술한다. Since the simultaneous interpretation method of FIG. 11 is similar to the simultaneous interpretation method of FIG. 10, the differences will be mainly described.

도 10을 참조하면, 상기 동시 통역 방법은: (예컨대, 제2 입력부(12)에 의해) 제1 언어의 콘텐츠의 텍스트를 수신하는 단계(S1110); 제2 언어의 콘텐츠에 반영할 사용자 요구를 수신하여 해당 제어 명령을 생성하는 단계(S1140); 및 제어 명령에 대응하는 사용자 요구를 반영하여 제1 언어의 콘텐츠를 동시 통역한 제2 언어의 콘텐츠의 음성을 생성하는 단계(S1150)를 포함한다. 또한, 특정 실시예들에서, 상기 동시 통역 방법은: 단계(S1150)에서 생성된 제2 언어의 콘텐츠의 음성을 출력하는 단계(S1160)를 포함할 수도 있다. Referring to FIG. 10 , the simultaneous interpretation method includes: receiving text of content in a first language (eg, through the second input unit 12) (S1110); generating a corresponding control command by receiving a user request to be reflected in content in the second language (S1140); and reflecting a user request corresponding to the control command to generate audio of the content in the second language obtained by simultaneous interpretation of the content in the first language (S1150). Also, in certain embodiments, the simultaneous interpretation method may include: outputting audio of the content in the second language generated in step S1150 (S1160).

단계(S1110)에서 제1 언어의 콘텐츠 정보가 획득된다. 예를 들어, 단계(S1110)에서 제1 언어의 콘텐츠의 텍스트 데이터가 획득된다. In step S1110, content information of the first language is obtained. For example, in step S1110, text data of content in the first language is obtained.

상기 단계(S1140)는: 단계(S1150) 이전에 상황 요구를 사용자 요구로서 수신하고 제1 제어 명령을 생성하여 번역부(51)에 전송하는 단계(S1141); 및/또는 단계(S1155) 이전에 특정인 음성 요구 또는 참조 음성 요구를 사용자 요구로서 수신하고 제2 제어 명령을 생성하여 합성부(55)로 전송하는 단계(S1145)를 포함한다. The step (S1140) includes: receiving the situation request as a user request before step (S1150), generating a first control command, and transmitting it to the translation unit 51 (S1141); and/or receiving a specific voice request or reference voice request as a user request before step S1155 and generating and transmitting a second control command to the synthesis unit 55 (S1145).

상기 단계(S1150)는: 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 번역하는 단계(S1151) 및 제2 언어의 콘텐츠의 음성 데이터를 생성하는 단계(S1155)를 포함한다. The step (S1150) includes: translating the first language content into the second language content (S1151) and generating audio data of the second language content (S1155).

상황 요구가 입력되면(S1141), 상황 요구의 콘텐츠 상황 하에서 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 번역한다(S1151). When a contextual request is input (S1141), the first language content is translated into the second language content under the content context of the contextual request (S1151).

단계(S1155)에서 제2 언어의 콘텐츠의 음성이 생성된다. In step S1155, audio of content in the second language is generated.

특정인 음성 요구가 입력되면, 해당 특정인 음성과 제2 언어의 콘텐츠를 합성한 합성 음성이 생성될 수도 있다(S1155). When a specific person's voice request is input, a synthesized voice may be generated by synthesizing the specific person's voice and the content of the second language (S1155).

참조 음성 요구가 입력되면 해당 샘플 기계 음성과 제2 언어의 콘텐츠를 합성한 합성 음성이 생성될 수도 있다(S1155).When a reference voice request is input, a synthesized voice obtained by synthesizing the corresponding sample machine voice and the contents of the second language may be generated (S1155).

맥락적 측면의 요구 및 음성적 측면의 요구가 입력된 경우, 콘텐츠 상황 하에서 번역된 제2 언어의 콘텐츠의 합성 음성이 생성될 수도 있다(S1155). When the contextual request and the audio aspect request are input, synthesized voice of the second language content translated under the context of the content may be generated (S1155).

이와 같이, 제1 언어의 콘텐츠의 텍스트만이 입력되면, 특정인 음성 또는 샘플 기계 음성을 반영한 동시 통역 결과가 제공될 수도 있다.In this way, when only the text of the first language content is input, a simultaneous interpretation result reflecting a specific person's voice or sample machine voice may be provided.

또한, 맥락적 측면의 요구만 입력된 경우, 콘텐츠 상황 하에서 번역된 제2언어의 콘텐츠의 음성이 생성될 수도 있다(S1155). 여기서, 제2 언어의 콘텐츠의 텍스트는 음성적 측면의 요구가 입력되지 않은 것에 대해 설정된 음성으로 변환된다. 상기 설정된 음성은 기본 음성 모델(501)을 사용하여 출력되는 기본 음성일 수도 있다. 그러면, 콘텐츠 상황 하에서 번역된 제2 언어의 콘텐츠는 기본 음성으로 출력된다(S1160). In addition, when only the contextual request is input, the voice of the content in the second language translated under the context of the content may be generated (S1155). Here, the text of the content in the second language is converted into a voice set for a case in which a voice aspect request is not input. The set voice may be a basic voice output using the basic voice model 501 . Then, the content of the second language translated under the content situation is output as a basic voice (S1160).

도 12는, 본 출원의 또 다른 일 실시예에 따른, 제2 언어의 콘텐츠가 비-음성으로 획득된 경우 동시 통역 방법의 흐름도이다. 12 is a flowchart of a simultaneous interpretation method when content in a second language is acquired as non-voice, according to another embodiment of the present application.

도 12의 동시 통역 방법은 도 10의 동시 통역 방법의 적어도 일부 및 도 11의 동시 통역 방법의 적어도 일부의 조합과 유사하므로, 차이점을 위주로 서술한다. Since the simultaneous interpretation method of FIG. 12 is similar to a combination of at least a part of the simultaneous interpretation method of FIG. 10 and at least a part of the simultaneous interpretation method of FIG. 11, the differences will be mainly described.

상기 동시 통역 방법은: (예컨대, 제2 입력부(12)에 의해) 제1 언어의 콘텐츠의 텍스트를 수신하는 단계(S1210); (예컨대, 제1 입력부(11)에 의해) 화자의 음성을 입력하는 단계(S1211); (예컨대, 특성 추출부(30)에 의해) 화자의 음성으로부터 상기 화자의 음성 특성을 추출하는 단계(S1230); 제2 언어의 콘텐츠에 반영할 사용자 요구를 수신하여 해당 제어 명령을 생성하는 단계(S1240); 및 제어 명령에 대응하는 사용자 요구를 반영하여 제1 언어의 콘텐츠를 제2 언어의 콘텐츠로 동시 통역한 음성 데이터를 생성하는 단계(S1250)를 포함한다. 또한, 특정 실시예들에서, 상기 동시 통역 방법은: 단계(S1250)에서 생성된 제2 언어의 콘텐츠의 음성을 출력하는 단계(S1260)를 포함할 수도 있다.The simultaneous interpretation method may include: receiving text of content in a first language (eg, through the second input unit 12) (S1210); inputting a speaker's voice (eg, by the first input unit 11) (S1211); extracting the speaker's voice characteristics from the speaker's voice (eg, by the feature extraction unit 30) (S1230); generating a corresponding control command by receiving a user request to be reflected in content in the second language (S1240); and generating voice data obtained by simultaneously interpreting the contents of the first language into the contents of the second language by reflecting the user request corresponding to the control command (S1250). Also, in specific embodiments, the simultaneous interpretation method may include: outputting audio of the content in the second language generated in step S1250 (S1260).

단계(S1210)에서 제1 언어의 콘텐츠 정보가 획득되고, 단계(S1211)에서 화자 음성이 획득된다. 즉, 동시 통역 대상과 화자 음성이 서로 다른 입력 수단으로 입력된다. In step S1210, content information of the first language is obtained, and in step S1211, the speaker's voice is obtained. That is, the simultaneous interpretation target and the speaker's voice are input through different input means.

단계(S1211)의 화자 음성은 화자가 제1 언어의 콘텐츠의 전체를 발화한 음성과 다른 음성이다. 이 화자 음성은 제1 언어의 콘텐츠의 일부를 발화하거나, 또는 제1 언어의 콘텐츠에 포함되지 않은 다른 표현을 발화한 음성일 수도 있다. The speaker's voice in step S1211 is a voice different from the one in which the speaker uttered the entire contents of the first language. This speaker's voice may be a voice that utters a part of the content in the first language or another expression not included in the content in the first language.

상기 단계(S1240)는: 단계(S1250) 이전에 상황 요구를 사용자 요구로서 수신하고 제1 제어 명령을 생성하여 번역부(51)에 전송하는 단계(S1241); 및/또는 단계(S1255) 이전에 화자 음성 요구, 특정인 음성 요구 또는 참조 음성 요구를 사용자 요구로서 수신하고 제2 제어 명령을 생성하여 합성부(55)로 전송하는 단계(S1245)를 포함한다. The step (S1240) includes: receiving the situation request as a user request before step (S1250), generating a first control command, and transmitting the generated first control command to the translation unit 51 (S1241); and/or receiving a speaker's voice request, a specific person's voice request, or a reference voice request as a user request before step S1255 and generating and transmitting a second control command to the synthesis unit 55 (S1245).

도 12의 제2 언어의 콘텐츠의 음성을 생성하는 단계(S1250)는 도 10의 제2 언어의 콘텐츠의 음성을 생성하는 단계(S950)와 유사하므로, 자세한 설명은 생략한다. Since the step of generating the audio of the second language content of FIG. 12 ( S1250 ) is similar to the step of generating the audio of the second language content of FIG. 10 ( S950 ), a detailed description thereof will be omitted.

이와 같이, 제1 언어의 콘텐츠의 텍스트 및 화자의 음성이 입력되면, 화자 음성을 반영한 동시 통역 결과가 제공될 수도 있다. In this way, when the text of the first language content and the speaker's voice are input, a simultaneous interpretation result reflecting the speaker's voice may be provided.

이상에서 설명한 실시예들에 따른 동시 통역 장치(1) 및 방법에 의한 동작은 적어도 부분적으로 컴퓨터 프로그램으로 구현되어, 컴퓨터로 읽을 수 있는 기록매체에 기록될 수 있다. 예를 들어, 프로그램 코드를 포함하는 컴퓨터-판독가능 매체로 구성되는 프로그램 제품과 함께 구현되고, 이는 기술된 임의의 또는 모든 단계, 동작, 또는 과정을 수행하기 위한 프로세서에 의해 실행될 수 있다. The operations of the simultaneous interpretation apparatus 1 and method according to the above-described embodiments may be at least partially implemented as a computer program and recorded on a computer-readable recording medium. For example, implemented together with a program product consisting of a computer-readable medium containing program code, which may be executed by a processor to perform any or all steps, operations, or processes described.

상기 컴퓨터는 데스크탑 컴퓨터, 랩탑 컴퓨터, 노트북, 스마트 폰, 또는 이와 유사한 것과 같은 컴퓨팅 장치일 수도 있고 통합될 수도 있는 임의의 장치일 수 있다. 컴퓨터는 하나 이상의 대체적이고 특별한 목적의 프로세서, 메모리, 저장공간, 및 네트워킹 구성요소(무선 또는 유선 중 어느 하나)를 가지는 장치다. 상기 컴퓨터는 예를 들어, 마이크로소프트의 윈도우와 호환되는 운영 체제, 애플 OS X 또는 iOS, 리눅스 배포판(Linux distribution), 또는 구글의 안드로이드 OS와 같은 운영체제(operating system)를 실행할 수 있다.The computer may be any device that may be integrated into or may be a computing device such as a desktop computer, laptop computer, notebook, smart phone, or the like. A computer is a device that has one or more alternative and special purpose processors, memory, storage, and networking components (whether wireless or wired). The computer may run, for example, an operating system compatible with Microsoft's Windows, Apple's OS X or iOS, a Linux distribution, or an operating system such as Google's Android OS.

상기 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록신원확인 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장신원확인 장치 등을 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다. 또한, 본 실시예를 구현하기 위한 기능적인 프로그램, 코드 및 코드 세그먼트(segment)들은 본 실시예가 속하는 기술 분야의 통상의 기술자에 의해 용이하게 이해될 수 있을 것이다. The computer-readable recording medium includes all types of recording and identification devices in which data readable by a computer is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage and identification devices, and the like. In addition, computer-readable recording media may be distributed in computer systems connected through a network, and computer-readable codes may be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing this embodiment can be easily understood by those skilled in the art to which this embodiment belongs.

이상에서 살펴본 본 발명은 도면에 도시된 실시예들을 참고로 하여 설명하였으나 이는 예시적인 것에 불과하며 당해 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 실시예의 변형이 가능하다는 점을 이해할 것이다. 그러나, 이와 같은 변형은 본 발명의 기술적 보호범위 내에 있다고 보아야 한다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해서 정해져야 할 것이다.The present invention reviewed above has been described with reference to the embodiments shown in the drawings, but this is only exemplary, and those skilled in the art will understand that various modifications and variations of the embodiments are possible therefrom. However, such modifications should be considered within the technical protection scope of the present invention. Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

Claims

A simultaneous interpretation method for generating a simultaneous interpretation result reflecting a user's request, performed by a simultaneous interpretation apparatus including a translation unit and a synthesis unit, comprising:
receiving a voice of a speaker who uttered content in a first language;
converting the received speaker's voice of the content in the first language into text in the first language;
extracting a speaker's voice characteristics from the speaker's voice;
generating a corresponding control command by receiving a user request to be reflected in a result of simultaneous interpretation; and
generating audio of content in a second language obtained by simultaneously interpreting content in a first language by reflecting a user request corresponding to a control command;
Generating the audio of the simultaneously interpreted second language content includes:
translating the content in the first language into content in a second language; and
A method of simultaneous interpretation comprising generating audio of content in a second language.

The method according to claim 1, wherein receiving the user request and generating a corresponding control command comprises:
receiving the context request as a user request and generating and sending the first control command to the translation unit; and
Receiving a speaker's voice request, a specific person's voice request, or a reference voice request as a user request, and generating and transmitting a second control command to a synthesis unit.
includes at least one of
the first control command is a command for the translation unit to translate content in a first language into content in a second language under a content situation;
The second control command is a command for causing the synthesis unit to synthesize the contents of the second language with a speaker's voice, a characteristic voice, or a sample machine voice associated with a reference.

The method according to claim 2, wherein the generating of the audio of the second language content comprises:
When receiving a speaker voice request, the synthesis unit generates a 2-1 control command for synthesizing the speaker's voice with content of a second language translated by the translation unit, and transmits the 2-1 control command to the synthesis unit. step - the 2-1 control command includes speaker voice characteristic information; and
and generating a synthesized speaker voice by synthesizing the speaker's voice with the content of the second language translated by the translation unit based on the speaker's voice characteristic of the 2-1 control command and the content of the second language. method.

The method according to claim 2, wherein the generating of the audio of the second language content comprises:
When receiving a request for a specific voice, the synthesis unit generates a 2-2 control command for synthesizing the specific voice with content in a second language translated by the translation unit, and transmits the 2-2 control command to the synthesis unit. step - the 2-2 control command includes pre-stored specific voice characteristic information; and
and generating a synthesized specific voice synthesized by synthesizing a pre-stored specific voice with the second language content translated by a translation unit based on the specific voice characteristic of the 2-2 control command and the content of the second language. Interpretation method.

The method of claim 4,
The synthesis unit generates synthesized speech specific to the contents of the second language using specific speech models including a basic speech model and a second tuning model;
The basic voice model is a model learned to utter text in a second language as a basic voice according to pronunciation rules of the second language,
The second tuning model is a model learned to generate a specific synthetic voice by reflecting pre-stored specific voice characteristics to the basic voice.

The method of claim 5,
Generating the audio of the content in the second language,
Providing an interface screen for selecting any one of one or more specific voice characteristic information stored; further comprising,
The synthesis unit is configured to use one or more second tuning models for each of one or more specific persons, and synthesizes the specific person by reflecting the voice characteristics of the specific person to the basic voice of the content in the second language using the second tuning model for the selected specific person. A method of simultaneous interpretation characterized by generating voice.

The method according to claim 2, wherein the generating of the audio of the second language content comprises:
When a reference voice request is input, a 2-3 control command for synthesizing the sample machine voice associated with the reference of the reference voice request with the content of the second language translated by the translation unit is generated, and the synthesis unit generates the 2-3 control command. 3 sending a control command, wherein the 2-3 control command includes pre-stored sample machine voice characteristic information; and
and generating a reference synthesized voice synthesized by synthesizing a pre-stored sample machine voice with the second language content translated by a translation unit based on the second language content and the sample machine voice specification of the 2-3 control command. simultaneous interpretation method.

The method of claim 7,
The synthesis unit generates a reference synthesized speech of content of the second language using a reference speech model including a basic speech model and a third tuning model;
The basic voice model is a model learned to utter text in a second language as a basic voice according to pronunciation rules of the second language,
The third tuning model is a model learned to generate the reference synthesized voice by reflecting pre-stored sample machine voice characteristics to the basic voice.

The method of claim 5,
Generating the audio of the content in the second language,
Further comprising providing an interface screen for selecting any one of the one or more references associated with the stored one or more sample machine speech characteristic information;
The synthesis unit is configured to use one or more third tuning models for each of the sample machine voices associated with the one or more references, wherein the corresponding sample machine speech characteristics are determined by using the third tuning model for the sample machine speeches associated with the selected reference. A simultaneous interpretation method characterized by generating a reference synthesized voice reflected in a basic voice of two languages.

The method of claim 1,
The voice characteristics are extracted by analyzing an input voice signal in a time domain or a frequency domain, and include at least one of timbre, vocal range, tone, and gender.

The method of claim 2,
The simultaneous interpretation method according to claim 1 , wherein the content context includes at least one of a business context, an academic context, and a daily life context.

The method of claim 1,
wherein the sample machine voice characteristic information is stored as a reference template for each associated reference.

The method according to claim 1, wherein the generating of the audio of the second language content comprises:
Modulating the voice of the generated second language content;
The method of simultaneous interpretation characterized in that the modulated voice of the content in the second language is generated by adjusting the voice characteristic value of the generated voice of the content in the second language to the input modulation value.

A simultaneous interpretation method for generating a simultaneous interpretation result reflecting a user's request, performed by a simultaneous interpretation apparatus including a translation unit and a synthesis unit, comprising:
receiving text of the content in the first language;
receiving a speaker's voice of a different expression from the contents of the first language;
extracting speaker voice characteristics from the speaker voice;
generating a corresponding control command by receiving a user request to be reflected in a result of simultaneous interpretation; and
generating audio of content in a second language obtained by simultaneously interpreting content in a first language by reflecting a user request corresponding to a control command;
Generating the audio of the simultaneously interpreted second language content includes:
translating the content in the first language into content in a second language; and
A method of simultaneous interpretation comprising generating audio of content in a second language.

A computer-readable recording medium recording a program for performing the simultaneous interpretation method according to any one of claims 1 to 14.

a first input unit for obtaining information in the form of voice;
a second input unit for acquiring information in non-speech form;
a text generation unit that converts the audio signal into text when the audio signal is input to the second input unit;
a feature extractor extracting voice features from the voice signal;
When a user request is input, a control module generating a control command for generating a simultaneous interpretation result reflecting the user request; and
a simultaneous interpretation module generating audio of content in a second language obtained by simultaneous interpretation of content in a first language by reflecting a user request corresponding to a control command;
The simultaneous interpretation module,
a translation unit which translates the contents of the first language obtained by the first input unit or the second input unit into contents of a second language; and
and a synthesis unit for generating voice of the content in the second language.