KR20140028336A

KR20140028336A - Voice conversion apparatus and method for converting voice thereof

Info

Publication number: KR20140028336A
Application number: KR1020120094360A
Authority: KR
Inventors: 김일환; 김은경
Original assignee: 삼성전자주식회사
Priority date: 2012-08-28
Filing date: 2012-08-28
Publication date: 2014-03-10

Abstract

Provided are a voice converting apparatus and a voice converting method thereof. The method of converting a voice is to select one of multiple types of voice for outputting an auditory user interface (AUI) by user command; search an output AUI text corresponding to the selected voice type when an event for outputting the specific AUI is generated; and converting the searched output AUI text to the sound corresponding thereto by using a personalized text-to-speech (TTS) model corresponding thereto. Therefore, the present invention can provide more intuitive and active AUI to a user by providing the distinct output AUI text depending on the multiple types of the voice. [Reference numerals] (110) User input unit; (120) Text search unit; (130) Voice converting unit; (140) Output unit; (151) Personas type storage unit; (152) Personalized TTS model; (200) Registering unit

Description

Voice conversion apparatus and method for converting voice thereof

본 발명은 음성 변환 장치 및 이의 음성 변환 방법에 관한 것으로, 더욱 상세하게는 사용자가 선택한 목소리 타입에 따라 상이한 AUI를 출력하는 음성 변환 장치 및 이의 음성 변환 방법에 관한 것이다.The present invention relates to a voice conversion device and a voice conversion method thereof, and more particularly, to a voice conversion device and a voice conversion method for outputting a different AUI according to the voice type selected by the user.

통상적으로 AUI 기술은 전자 장치상에서 사용자의 요구에 따라 수행되는 각종 기능 및 그에 따라 전자 장치 내부에서 일어나는 태스크에 대해 소리로 피드백(feedback)을 제공함으로써, 사용자로 하여금 자신이 선택한 상황 및 태스크의 수행 상태를 명확하게 인지할 수 있도록 한 것이다. In general, AUI technology provides a sound feedback to various functions performed according to a user's request on the electronic device and thus a task occurring inside the electronic device, thereby allowing a user to select a situation and a performance state of the task. It is to be able to recognize clearly.

특히, 종래의 AUI 기술은 하나의 소리를 통해 제공되거나, 혹은 복수의 목소리 중 사용자에 의해 선택된 하나의 목소리로 제공되었다. 이때, 복수의 목소리 중 하나를 통해 사용자에게 AUI를 제공하더라도, 전자 장치는 단순히 목소리의 음색만을 변환할 뿐, 복수의 목소리 유형에 따라 출력 텍스트 자체는 변환하지 않았다. In particular, the conventional AUI technology has been provided through one sound or one voice selected by a user among a plurality of voices. At this time, even if the AUI is provided to the user through one of the plurality of voices, the electronic device merely converts only the tone of the voice and does not convert the output text itself according to the plurality of voice types.

이에 의해, 출력 텍스트와 목소리의 음색이 서로 어울리지 않는 경우(예를 들어, 출력 AUI 텍스트는 반말인데 반해, AUI의 음색은 공손한 경우), 사용자에게 어색한 AUI가 제공되는 문제점이 발생하였다.As a result, when the tone of the output text and the voice do not match with each other (for example, the output AUI text is half-word, while the tone of the AUI is polite), awkward AUI is provided to the user.

본 발명은 상술한 문제점을 해결하기 위해 안출된 것으로, 본 발명의 목적은 복수의 목소리 유형에 따라 상이한 출력 AUI 텍스트를 제공함으로써, 사용자에게 더욱 직관적이고 능동적인 AUI를 제공할 수 있는 음성 변환 장치 및 이의 음성 변환 방법을 제공함에 있다.SUMMARY OF THE INVENTION The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a user with a more intuitive and active AUI by providing different output AUI text according to a plurality of voice types, and It is to provide a voice conversion method thereof.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른, 음성 변환 방법은 사용자 명령에 의해, AUI(Auditory User Interface)를 출력하기 위한 복수의 목소리 타입 중 하나를 선택하는 단계; 특정 AUI를 출력하는 이벤트가 발생된 경우, 상기 선택된 목소리 타입에 대응되는 출력 AUI 텍스트를 검색하는 단계; 및 상기 선택된 목소리 타입에 대응되는 개인화 TTS 모델(personalized Text-To-Speech Model)을 이용하여, 상기 검색된 출력 AUI 텍스트를 상기 선택된 목소리 타입에 대응되는 음색으로 변환하여 출력하는 단계;를 포함한다.According to an embodiment of the present invention for achieving the above object, the voice conversion method comprises the steps of selecting one of a plurality of voice types for outputting an AUI (Auditory User Interface) by a user command; Searching for output AUI text corresponding to the selected voice type when an event for outputting a specific AUI occurs; And converting the retrieved output AUI text into a tone corresponding to the selected voice type by using a personalized text-to-speech model corresponding to the selected voice type.

그리고, 상기 검색하는 단계는, 상기 선택된 목소리 타입에 대응되는 페르소나 유형을 판단하는 단계; 및 상기 특정 AUI에 대한 복수의 출력 AUI 텍스트 중 상기 페르소나 유형에 대응되는 출력 AUI 텍스트를 검색하는 단계;를 포함할 수 있다.The searching may include determining a persona type corresponding to the selected voice type; And searching for output AUI text corresponding to the persona type among the plurality of output AUI texts for the specific AUI.

또한, 상기 특정 AUI에 대한 복수의 출력 AUI 텍스트는, 상기 특정 AUI에 대한 페르소나 유형별로 기 저장될 수 있다.Also, the plurality of output AUI texts for the specific AUI may be pre-stored for each persona type for the specific AUI.

그리고, AUI를 출력하기 위한 목소리 타입에 대응되는 개인화 TTS 모델 및 페르소나 유형을 등록하는 단계;를 더 포함할 수 있다.The method may further include registering a personalized TTS model and a persona type corresponding to the voice type for outputting the AUI.

또한, 상기 등록하는 단계는, 외부로부터 오디오 데이터 및 자막 데이터를 포함하는 영상 데이터를 수신하는 단계; 사용자 입력에 의해 상기 오디오 데이터에 포함된 복수의 목소리 중 사용자가 등록하고자 하는 목소리 타입이 선택되면, 상기 오디오 데이터 중 상기 등록하고자 하는 목소리 타입에 대응되는 음성 데이터를 추출하는 단계; 상기 추출된 음성 데이터 및 자막 데이터를 이용하여 상기 등록하고자 하는 목소리 타입에 대응되는 개인화 TTS 모델을 생성하는 단계; 상기 추출된 음성 데이터 및 상기 자막 데이터를 이용하여, 상기 등록하고자 하는 목소리 타입에 대한 페르소나 유형을 판단하는 단계; 및 상기 등록하고자 하는 목소리 타입의 개인화 TTS 모델과 페르소나 유형을 매칭하여 등록하는 단계;를 포함할 수 있다.The registering may include receiving image data including audio data and subtitle data from the outside; Extracting voice data corresponding to the voice type to be registered from the audio data when a voice type to be registered by the user is selected from among a plurality of voices included in the audio data by a user input; Generating a personalized TTS model corresponding to the voice type to be registered using the extracted voice data and subtitle data; Determining a persona type for the voice type to be registered using the extracted voice data and the caption data; And registering the personalized TTS model of the voice type to be registered with the persona type.

그리고, 상기 등록하고자 하는 목소리 타입에 대응되는 음성 데이터를 추출하는 단계는, 상기 오디오 데이터에 포함된 음성 구간과 비음성 구간으로 구분하여, 음성 구간을 추출하는 단계; 상기 음성 구간과 상기 자막 데이터를 동기화하는 단계; 스피커 다이얼리제이션 알고리즘(Speaker Diarization Algorithm) 및 자막 데이터를 이용하여, 상기 음성 구간에 포함된 음성 데이터 중 상기 등록하고자 하는 목소리 타입에 대응되는 음성 데이터를 추출하는 단계;를 포함할 수 있다.The extracting of the voice data corresponding to the voice type to be registered may include: extracting a voice section by dividing the voice section and the non-voice section included in the audio data; Synchronizing the voice section with the subtitle data; And extracting voice data corresponding to the voice type to be registered from among voice data included in the voice section by using a speaker diarization algorithm and caption data.

또한, 상기 등록하고자 하는 목소리 타입에 대응되는 개인화 TTS 모델을 생성하는 단계는, 상기 등록하고자 하는 목소리 타입에 대응되는 음성 데이터의 음성 파라미터를 추출하는 단계; 상기 자막 데이터를 폰 트랜스크립트(phone transcripts)로 변환하는 단계; 상기 음성 파라미터, 상기 폰 트랜스크립트 및 화자 적응 알고리즘을 이용하여 변환 행렬을 생성하는 단계; 상기 생성된 변환 행렬 및 평균 모델을 이용하여 개인화 TTS 모델을 생성하는 단계;를 포함할 수 있다.The generating of the personalized TTS model corresponding to the voice type to be registered may include extracting a voice parameter of voice data corresponding to the voice type to be registered; Converting the caption data into phone transcripts; Generating a transformation matrix using the speech parameter, the phone transcript and a speaker adaptation algorithm; Generating a personalized TTS model using the generated transformation matrix and the average model.

그리고, 상기 등록하고자 하는 목소리 타입에 대한 페르소나 유형을 판단하는 단계는, 상기 등록하고자 하는 목소리 타입에 대응되는 음성 데이터와 동기화된 자막 데이터의 문장 종결어, 및 등록하고자 하는 목소리 타입에 대응되는 음성 데이터의 주파수, 톤 변화, 중 적어도 하나를 분석하여, 상기 등록하고자 하는 목소리 타입에 대한 페르소나 유형을 판단할 수 있다.The determining of the persona type for the voice type to be registered may include a sentence termination word of the caption data synchronized with the voice data corresponding to the voice type to be registered, and the voice data corresponding to the voice type to be registered. By analyzing at least one of the frequency, the tone change, can be determined the persona type for the voice type to be registered.

한편, 상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른, 음성 변환 장치는 AUI(Auditory User Interface)를 출력하기 위한 복수의 목소리 타입 중 하나를 선택하기 위한 사용자 선택을 입력받는 사용자 입력부; 특정 AUI를 출력하는 이벤트가 발생된 경우, 상기 선택된 목소리 타입에 대응되는 출력 AUI 텍스트를 검색하는 텍스트 검색부; 상기 선택된 목소리 타입에 대응되는 개인화 TTS 모델(personalized Text-To-Speech Model)을 이용하여, 상기 검색된 출력 AUI 텍스트를 상기 선택된 목소리 타입에 대응되는 음색으로 변환하는 음색 변환부; 및 상기 변환된 음색으로 상기 특정 AUI를 출력하는 출력부;를 포함한다.On the other hand, according to an embodiment of the present invention for achieving the above object, the voice conversion device comprises a user input unit for receiving a user selection for selecting one of a plurality of voice types for outputting an AUI (Auditory User Interface); A text search unit that searches for output AUI text corresponding to the selected voice type when an event for outputting a specific AUI occurs; A tone converter for converting the retrieved output AUI text into a tone corresponding to the selected voice type by using a personalized text-to-speech model corresponding to the selected voice type; And an output unit configured to output the specific AUI as the converted timbre.

그리고, 상기 텍스트 검색부는, 상기 선택된 목소리 타입에 대응되는 페르소나 유형을 판단하고, 상기 특정 AUI에 대한 복수의 출력 AUI 텍스트 중 상기 페르소나 유형에 대응되는 출력 AUI 텍스트를 검색할 수 있다.The text search unit may determine a persona type corresponding to the selected voice type, and search for output AUI text corresponding to the persona type among a plurality of output AUI texts for the specific AUI.

또한, 상기 특정 AUI에 대한 복수의 출력 AUI 텍스트를 페르소나 유형별로 기 저장하는 저장부;를 포함할 수 있다.In addition, the storage unit for storing the plurality of output AUI text for the particular AUI for each persona type; may include.

그리고, AUI를 출력하기 위한 목소리 타입에 대응되는 개인화 TTS 모델 및 페르소나 유형을 등록하는 등록부;를 더 포함할 수 있다.The registration unit may register a personalized TTS model and a persona type corresponding to the voice type for outputting the AUI.

또한, 상기 등록부는, 외부로부터 오디오 데이터 및 자막 데이터를 포함하는 영상 데이터를 수신하는 영상 수신부; 상기 사용자 입력부에 의해 상기 오디오 데이터에 포함된 복수의 목소리 중 사용자가 등록하고자 하는 목소리 타입이 선택되면, 상기 오디오 데이터 중 상기 등록하고자 하는 목소리 타입에 대응되는 음성 데이터를 추출하는 음성 데이터 추출부; 상기 추출된 음성 데이터 및 자막 데이터를 이용하여 상기 등록하고자 하는 목소리 타입에 대응되는 개인화 TTS 모델을 생성하는 TTS 모델 생성부; 및 상기 추출된 음성 데이터 및 상기 자막 데이터를 이용하여, 상기 등록하고자 하는 목소리 타입에 대한 페르소나 유형을 판단하는 페르소나 유형 판단부;를 더 포함하고, 상기 등록하고자 하는 목소리 타입의 개인화 TTS 모델과 페르소나 유형을 매칭하여 저장하는 저장부;를 더 포함할 수 있다.The register may include an image receiver configured to receive image data including audio data and subtitle data from the outside; A voice data extractor configured to extract voice data corresponding to the voice type to be registered from the audio data when a voice type to be registered by the user is selected from the plurality of voices included in the audio data by the user input unit; A TTS model generator for generating a personalized TTS model corresponding to the voice type to be registered using the extracted voice data and subtitle data; And a persona type determination unit that determines a persona type for the voice type to be registered using the extracted voice data and the caption data. The personalized TTS model and persona type of the voice type to be registered further include. It may further include a storage unit for matching and storing.

그리고, 상기 음성 데이터 추출부는, 상기 오디오 데이터에 포함된 음성 구간과 비음성 구간으로 구분하여, 음성 구간을 추출하는 음성 구간 추출부; 상기 음성 구간과 상기 자막 데이터를 동기화하는 동기화부; 스피커 다이얼리제이션 알고리즘(Speaker Diarization Algorithm) 및 자막 데이터를 이용하여, 상기 음성 구간에 포함된 음성 데이터 중 상기 등록하고자 하는 목소리 타입에 대응되는 음성 데이터를 추출하는 목표 음성 추출부;를 포함할 수 있다.The voice data extractor may include: a voice interval extractor configured to extract a voice interval by dividing the voice interval and the non-voice interval included in the audio data; A synchronization unit for synchronizing the voice section with the subtitle data; And a target voice extractor configured to extract voice data corresponding to the voice type to be registered from among voice data included in the voice section by using a speaker diarization algorithm and caption data. .

또한, 상기 TTS 모델 생성부는, 상기 등록하고자 하는 목소리 타입에 대응되는 음성 데이터의 음성 파라미터를 추출하는 음성 파라미터 추출부; 상기 자막 데이터를 폰 트랜스크립트(phone transcripts)로 변환하는 자막 데이터 변환부; 상기 음성 파라미터, 상기 폰 트랜스크립트 및 화자 적응 알고리즘을 이용하여 변환 행렬을 생성하는 변환 행렬 생성부; 상기 생성된 변환 행렬 및 평균 모델을 이용하여 개인화 TTS 모델을 생성하는 개인화 TTS 모델 생성부;를 포함할 수 있다.The TTS model generator may include: a voice parameter extractor configured to extract a voice parameter of voice data corresponding to the voice type to be registered; A caption data converter converting the caption data into phone transcripts; A transformation matrix generator for generating a transformation matrix using the speech parameter, the phone transcript, and a speaker adaptation algorithm; And a personalized TTS model generator configured to generate a personalized TTS model using the generated transformation matrix and the average model.

그리고, 상기 페르소나 유형 판단부는, 상기 등록하고자 하는 목소리 타입에 대응되는 음성 데이터와 동기화된 자막 데이터의 문장 종결어, 및 등록하고자 하는 목소리 타입에 대응되는 음성 데이터의 주파수, 톤 변화, 중 적어도 하나를 분석하여, 상기 등록하고자 하는 목소리 타입에 대한 페르소나 유형을 판단할 수 있다.The persona type determination unit may include at least one of a sentence termination word of the caption data synchronized with the voice data corresponding to the voice type to be registered and a frequency, a tone change, and the like of the voice data corresponding to the voice type to be registered. By analyzing, it is possible to determine the persona type for the voice type to be registered.

상술한 바와 같은 본 발명의 다양한 실시예에 의해, 복수의 목소리 유형에 따라 상이한 출력 AUI 텍스트를 제공함으로써, 사용자에게 더욱 직관적이고 능동적인 AUI를 제공할 수 있게 된다. 뿐만 아니라, 사용자는 자신이 등록하고자 하는 목소리를 AUI의 출력 목소리로 등록할 수 있다.According to various embodiments of the present invention as described above, by providing different output AUI text according to a plurality of voice types, it is possible to provide a more intuitive and active AUI to the user. In addition, the user may register a voice to be registered as the output voice of the AUI.

도 1은 본 발명의 일 실시예에 따른, 음성 변환 장치의 구성을 나타내는 블럭도,
도 2는 본 발명의 일 실시예에 따른, 등록부의 구성을 나타내는 블럭도,
도 3은 본 발명의 일 실시예에 따른, 음성 데이터 추출부의 구성을 나타내는 블럭도,
도 4는 본 발명의 일 실시예에 따른, 음성 구간 추출부에 의해 추출된 음성 구간 및 비음성 구간을 도시한 도면,
도 5는 본 발명의 일 실시예에 따른, 스피커 다이얼리제이션 알고리즘에 의해 추출된 화자별 음성 구간을 도시한 도면,
도 6은 본 발명의 일 실시예에 따른, TTS 모델 생성부의 구성을 나타내는 블럭도,
도 7은 본 발명의 일 실시예에 따른, 저장부에 저장된 페르소나 유형별 출력 AUI 텍스트를 도시한 도면,
도 8은 본 발명의 일 실시예에 따른, 음성 변환 방법을 설명하기 위한 흐름도, 그리고,
도 9는 본 발명의 일 실시예에 따른, AUI를 출력하기 위한 목소리 타입을 등록하는 방법을 설명하기 위한 흐름도이다.1 is a block diagram showing a configuration of a voice conversion device according to an embodiment of the present invention;
2 is a block diagram showing a configuration of a registration unit according to an embodiment of the present invention;
3 is a block diagram showing a configuration of a voice data extracting unit according to an embodiment of the present invention;
4 is a diagram illustrating a speech section and a non-voice section extracted by the speech section extracting unit according to an embodiment of the present invention;
5 is a diagram illustrating a speaker-specific speech section extracted by a speaker dialing algorithm according to an embodiment of the present invention;
6 is a block diagram showing a configuration of a TTS model generator according to an embodiment of the present invention;
7 is a diagram illustrating output AUI text for each persona type stored in a storage unit according to an embodiment of the present invention;
8 is a flowchart illustrating a voice conversion method according to an embodiment of the present invention, and
9 is a flowchart illustrating a method of registering a voice type for outputting an AUI according to an embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명에 대해 더욱 상세히 설명하기로 한다.Hereinafter, the present invention will be described in more detail with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른, 음성 변환 장치(100)의 구성을 나타내는 블록도이다. 도 1에 도시된 바와 같이, 음성 변환 장치(100)는 사용자 입력부(110), 텍스트 검색부(120), 음색 변환부(130), 출력부(140), 저장부(150) 및 등록부(200)를 포함한다. 이때, 저장부(150)는 페르소나 유형 저장부(151) 및 개인화 TTS(personalized Text-To-Speech) 모델(152)을 포함한다.1 is a block diagram showing a configuration of a voice conversion apparatus 100 according to an embodiment of the present invention. As illustrated in FIG. 1, the voice conversion apparatus 100 may include a user input unit 110, a text search unit 120, a tone converter 130, an output unit 140, a storage unit 150, and a registration unit 200. ). In this case, the storage unit 150 includes a persona type storage unit 151 and a personalized text-to-speech (TTS) model 152.

한편, 본 발명의 일 실시예에 따른, 음성 변환 장치(100)는 스마트 TV로 구현될 수 있으나, 이는 일 실시예에 불과할 뿐, 내비게이션, 스마트 폰, 태블릿 PC, 노트북 PC, 데스크 탑 PC, PDA 등과 같은 다양한 전자 장치로 구현될 수 있다.On the other hand, according to an embodiment of the present invention, the voice conversion device 100 may be implemented as a smart TV, but this is only one embodiment, navigation, smart phone, tablet PC, notebook PC, desktop PC, PDA It may be implemented in various electronic devices such as.

사용자 입력부(110)는 음성 변환 장치(100)를 제어하기 위한 사용자의 명령을 입력받는다. 특히, 사용자 입력부(110)는 AUI(Auditory User Interface) 출력을 위한 복수의 목소리 타입 중 하나를 선택하는 사용자 명령을 입력받을 수 있다. 예를 들어, AUI 출력을 위한 목소리 타입으로 "A" 아나운서 또는 "B" 개그맨을 선택할 수 있다. The user input unit 110 receives a user's command for controlling the voice conversion apparatus 100. In particular, the user input unit 110 may receive a user command for selecting one of a plurality of voice types for an AUI (Auditory User Interface) output. For example, you can select "A" announcer or "B" gagman as the voice type for AUI output.

텍스트 검색부(120)는 특정 AUI를 출력하는 이벤트가 발생된 경우, 선택된 목소리 타입에 대응되는 출력 AUI 텍스트를 검색한다. 구체적으로, 텍스트 검색부(120)는 선택된 목소리 타입에 대응되는 페르소나 유형을 판단하고, 특정 AUI에 대한 복수의 출력 AUI 텍스트 중 페르소나 유형에 대응되는 출력 AUI 텍스트를 검색할 수 있다. 여기서, 페르소나 유형은 사용자가 선택한 목소리 타입의 화자에 대한 성격을 의미하는 것으로, 친근함, 딱딱함, 당당함, 공손함과 같은 성격을 예로 들 수 있다. 예를 들어, 사용자가 사용자 입력부(110)를 통해 AUI 출력을 위한 목소리 타입으로 "A" 아나운서를 선택한 경우, 텍스트 검색부(120)는 "A" 아나운서의 페르소나 유형인 "공손함"에 대응되는 출력 AUI 텍스트를 검색할 수 있다. 또는 사용자가 사용자 입력부(110)를 통해 AUI 출력을 위한 목소리 타입으로 "B" 개그맨을 선택한 경우, 텍스트 검색부(120)는 "B" 개그맨의 페르소나 유형인 "당당함"에 대응되는 출력 AUI 텍스트를 검색할 수 있다.When an event for outputting a specific AUI occurs, the text search unit 120 searches for output AUI text corresponding to the selected voice type. In detail, the text search unit 120 may determine the persona type corresponding to the selected voice type and search for the output AUI text corresponding to the persona type among the plurality of output AUI texts for the specific AUI. Here, the persona type refers to the personality of the speaker of the voice type selected by the user, and may include, for example, a personality such as friendliness, hardness, dignity, and politeness. For example, when the user selects the "A" announcer as the voice type for the AUI output through the user input unit 110, the text search unit 120 outputs the corresponding "person" type of "A" announcer. You can search for AUI text. Alternatively, when the user selects the "B" gagman as the voice type for the AUI output through the user input unit 110, the text search unit 120 outputs the output AUI text corresponding to the "fair" type persona of the "B" gagman. You can search.

한편, 특정 AUI에 대한 복수의 출력 AUI 텍스트는 도 7에 도시된 바와 같이, 특정 AUI에 대해 페르소나 유형별로 기 저장될 수 있다. Meanwhile, as illustrated in FIG. 7, a plurality of output AUI texts for a specific AUI may be pre-stored by person or type for a specific AUI.

음색 변환부(130)는 선택된 목소리 타입에 대응되는 개인화 TTS 모델(personalized Text-To-Speech Model)을 이용하여, 검색된 출력 AUI 텍스트를 선택된 목소리 타입에 대응되는 음색으로 변환한다. 이때, 개인화 TTS 모델은 출력 AUI를 사용자가 선택한 목소리 타입에 대응되는 음색으로 변환하기 위한 모델로서, 사용자에 의해 기 등록될 수 있다. 예를 들어, 사용자가 사용자 입력부(110)를 통해 AUI 출력을 위한 목소리 타입으로 "A" 아나운서를 선택한 경우, 음색 변환부(130)는 "A" 아나운서의 페르소나 유형인 "공손함"에 대응되는 출력 AUI 텍스트를 "A" 아나운서의 개인화 TTS 모델을 이용하여 "A" 아나운서의 음색으로 변환할 수 있다. 또는 사용자가 사용자 입력부(110)를 통해 AUI 출력을 위한 목소리 타입으로 "B" 개그맨을 선택한 경우, 음색 변환부(130)는 "B" 개그맨의 페르소나 유형인 "당당함"에 대응되는 출력 AUI 텍스트를 "B" 개그맨의 개인화 TTS 모델을 이용하여 "B" 개그맨의 음색으로 변환할 수 있다.The tone converting unit 130 converts the retrieved output AUI text into a tone corresponding to the selected voice type by using a personalized text-to-speech model corresponding to the selected voice type. In this case, the personalized TTS model is a model for converting the output AUI into a tone corresponding to the voice type selected by the user, and may be pre-registered by the user. For example, when the user selects the "A" announcer as the voice type for the AUI output through the user input unit 110, the tone converting unit 130 outputs a "politeness" that is a persona type of the "A" announcer. The AUI text can be converted to the tone of the "A" announcer using the personalized TTS model of the "A" announcer. Alternatively, when the user selects "B" gagman as the voice type for the AUI output through the user input unit 110, the tone converter 130 may output the output AUI text corresponding to the "fairness" that is the persona type of the "B" gagman. The personalized TTS model of the "B" gagman can be used to convert to the tone of the "B" gagman.

출력부(140)는 음색 변환부(130)에 의해 변환된 AUI를 출력한다. 이때, 출력부(140)는 스피커로 구현될 수 있으나, 이는 일 실시예에 불과할 뿐, 헤드폰 출력 단자 또는 S/PDIF 출력 단자로 구현될 수 있다. The output unit 140 outputs the AUI converted by the tone converter 130. In this case, the output unit 140 may be implemented as a speaker, but this is only an example, and may be implemented as a headphone output terminal or an S / PDIF output terminal.

저장부(150)는 음색 변환 장치(100)를 제어하기 위한 다양한 데이터 및 프로그램을 저장한다. 특히, 저장부(150)는 목소리 타입에 대응되는 페르소나 유형 저장부(151) 및 개인화 TTS 모델(152)을 매칭하여 저장할 수 있다. The storage unit 150 stores various data and programs for controlling the tone converter 100. In particular, the storage unit 150 may match and store the persona type storage unit 151 and the personalized TTS model 152 corresponding to the voice type.

또한, 저장부(150)는 도 7에 도시된 바와 같이, 출력 AUI 텍스트를 출력 AUI 이벤트 상황에 따라 페르소나 유형별로 저장할 수 있다.In addition, as illustrated in FIG. 7, the storage 150 may store the output AUI text for each persona type according to the output AUI event situation.

등록부(200)는 사용자 명령에 의해, 사용자가 등록하고자 하는 목소리 타입에 대응되는 페르소나 유형 및 개인화 TTS 모델을 등록한다. 등록부(200)에 대해서는 도 2를 참조하여 자세히 설명하기로 한다.The registration unit 200 registers a persona type and a personalized TTS model corresponding to a voice type to be registered by the user by a user command. The registration unit 200 will be described in detail with reference to FIG. 2.

도 2는 본 발명의 일 실시예에 따른, 등록부(200)의 구성을 나타내는 블럭도를 도시한 도면이다. 도 2에 도시된 바와 같이, 등록부(200)는 영상 수신부(210), 음성 데이터 추출부(220), 자막 데이터 추출부(230), TTS 모델 생성부(240), 페르소나 유형 판단부(250)를 포함한다.2 is a block diagram showing the configuration of the registration unit 200 according to an embodiment of the present invention. As shown in FIG. 2, the register 200 includes an image receiver 210, a voice data extractor 220, a caption data extractor 230, a TTS model generator 240, and a persona type determiner 250. It includes.

영상 수신부(210)는 오디오 데이터 및 자막 데이터를 포함하는 영상 데이터를 수신한다. 이때, 영상 수신부(210)는 외부의 방송국으로부터 방송 데이터를 수신할 수 있으며, 외부기기(예를 들어, DVD, 셋탑박스 등)로부터 영상 데이터를 수신할 수 있으며, 내부에 저장된 영상 데이터를 수신할 수 있다.The image receiver 210 receives image data including audio data and subtitle data. In this case, the image receiving unit 210 may receive broadcast data from an external broadcasting station, may receive image data from an external device (for example, DVD, set-top box, etc.), and receive image data stored therein. Can be.

그리고, 사용자는 사용자 입력부(110)를 통해 영상 데이터의 오디오 데이터에 포함된 복수의 목소리 중 자신이 등록하고자 하는 목소리 타입을 선택할 수 있다. 예를 들어, 사용자는 영상 데이터의 오디오 데이터에 포함된 복수의 목소리 중 "C" 영화배우에 대한 목소리 타입을 선택할 수 있다.In addition, the user may select a voice type to be registered by the user from the plurality of voices included in the audio data of the image data through the user input unit 110. For example, the user may select a voice type for the movie actor "C" from among a plurality of voices included in the audio data of the image data.

음성 데이터 추출부(220)는 영상 데이터에 포함된 오디오 데이터 중 사용자가 등록하고자 하는 목소리 타입에 대응되는 음성 데이터를 추출할 수 있다. 음성 데이터 추출부(220)가 사용자가 등록하고자 하는 목소리 타입에 대응되는 음성 데이터를 추출하는 방법에 대해서는 도 3을 참조하여 상세히 설명하기로 한다. The voice data extractor 220 may extract voice data corresponding to a voice type to be registered by the user from audio data included in the image data. A method of extracting the voice data corresponding to the voice type to be registered by the user by the voice data extractor 220 will be described in detail with reference to FIG. 3.

도 3에 도시된 바와 같이, 음성 데이터 추출부(220)는 음성 구간 추출부(221), 동기화부(222) 및 목표 음성 추출부(223)를 포함한다.As shown in FIG. 3, the speech data extractor 220 includes a speech section extractor 221, a synchronizer 222, and a target speech extractor 223.

음성 구간 추출부(221)는 입력된 오디오 데이터를 분석하여 음성 구간과 비음성 구간을 구분한다. 구체적으로, 음성 구간 추출부(221)는 오디오 데이터를 주파수 영역에서의 프레임으로 변환하고, 주파수 영역에서 음성 모델 및 복수의 비음성 모델을 설정하고 모델들을 초기화하거나 갱신하며, 초기화 또는 갱신된 음성 모델 및 복수의 비음성 모델을 이용하여 각각의 비음성별로 음성 부재 확률을 산출하여, 입력된 오디오 데이터의 음성 구간을 구분할 수 있다.The voice section extractor 221 analyzes the input audio data to distinguish between the voice section and the non-voice section. In detail, the voice interval extractor 221 converts the audio data into a frame in the frequency domain, sets a voice model and a plurality of non-voice models in the frequency domain, initializes or updates the models, and initializes or updates the voice model. And a voice absence probability for each non-voice by using a plurality of non-voice models to distinguish voice sections of the input audio data.

예를 들어, 입력된 오디오 데이터는 도 4에 도시된 바와 같이, 사람 목소리를 포함하는 음성 구간뿐만 아니라, 사람 목소리를 제외한 오디오(예를 들어, 음악, 배경효과음 등)를 포함하는 비음성 구간 및 사람 목소리와 사람 목소리가 아닌 오디오가 동시에 출력되는 혼합 구간 등을 더 포함할 수 있다. 이때, 음성 구간 추출부(221)는 상술한 바와 같은 방법에 의해 오디오 데이터를 음성 구간, 비음성 구간, 혼합 구간으로 구분할 수 있다.For example, as shown in FIG. 4, the input audio data includes not only a voice section including a human voice but also a non-voice section including audio (eg, music and a background effect sound) except the human voice. The audio signal may further include a mixed section in which audio of a human voice and a non-human voice are simultaneously output. In this case, the voice section extractor 221 may divide the audio data into a voice section, a non-voice section, and a mixing section by the method described above.

한편, 상술한 바와 음성 구간/비음성 구간 구분 방법은 일 실시예에 불과할 뿐, 다른 방법으로 오디오 데이터를 음성 구간/비음성 구간으로 구분하는 실시예 역시 본 발명의 기술적 사상에 적용될 수 있다.On the other hand, as described above, the voice section / non-voice section classification method is only an embodiment, and another embodiment of dividing the audio data into the voice section / non-voice section may also be applied to the technical idea of the present invention.

동기화부(222)는 음성 구간 추출부(221)에 의해 추출된 음성 구간과 자막 데이터를 동기화할 수 있다. 구체적으로, 동기화부(222)는 폰 인식기(phone recognizer)를 이용하여 오디오 데이터에서 추출된 음성 구간의 폰 트랜스크립션(phone transcription)을 생성할 수 있다. 이때, 동기화부(222)는 음성 및 자막 동기화에 이용되는 것이므로, 고성능의 ASR(automatic speech recognition)이 아닌 단순한 폰 인식기를 이용할 수 있다. 그리고, 동기화부(222)는 폰 트랜스크립션과 자막 데이터의 상관값(correlation value)을 이용하여, 음성 구간과 자막 데이터를 동기화할 수 있다. 이때, 입력된 자막 데이터는 약간의 지연이 존재하므로, 동기화부(222)는 음성 구간에 해당하는 자막을 수 초간 지연하여 상관값을 계산할 수 있다.The synchronizer 222 may synchronize the caption data extracted with the speech section extracted by the speech section extractor 221. In detail, the synchronizer 222 may generate a phone transcription of a voice interval extracted from audio data using a phone recognizer. In this case, since the synchronization unit 222 is used for synchronizing voice and subtitles, the synchronization unit 222 may use a simple phone recognizer instead of high performance automatic speech recognition (ASR). The synchronization unit 222 may synchronize the voice interval and the caption data by using a correlation value between the phone transcription and the caption data. In this case, since there is a slight delay in the input subtitle data, the synchronization unit 222 may calculate a correlation value by delaying the subtitle corresponding to the voice interval for several seconds.

목표 음성 추출부(223)는 스피커 다이얼리제이션 알고리즘(Speaker Diarization Algorithm) 및 자막 데이터를 이용하여, 음성 구간에 포함된 음성 데이터 중 등록하고자 하는 목소리 타입에 대응되는 음성 데이터를 추출한다. The target voice extractor 223 extracts voice data corresponding to a voice type to be registered from the voice data included in the voice section by using a speaker diarization algorithm and caption data.

구체적으로, 목표 음성 추출부(223)는 스피커 다이얼리제이션 알고리즘을 이용하여 화자 변화 검출(speaker change detection) 및 화자 클러스터링(speaker clustering)을 수행한다. 예를 들어, 목표 음성 추출부(223)는 스피커 다이얼리제이션 알고리즘을 이용하여, 도 5에 도시된 바와 같이, 오디오 데이터의 음성 구간에서 복수의 화자를 구분할 수 있다. 또한, 목표 음성 추출부(223)는 음성 구간과 동기화된 자막 데이터를 이용하여 발화 종료 정보 및 화자 전환 정보 등을 추출할 수 있다. 예를 들어, 목표 음성 추출부(223)는 텍스트 분석을 통해 화자 전환 정보(예를 들어, 자막 데이터에 "안녕하세요."가 존재하는 경우)를 추출할 수 있다. 또한, 목표 음성 추출부(223)는 자막 데이터에 포함된 화자 구분 기호(예를 들어, "/")를 이용하여 화자 전환 정보를 추출할 수 있다.In detail, the target voice extractor 223 performs speaker change detection and speaker clustering using a speaker dialing algorithm. For example, the target voice extractor 223 may distinguish a plurality of speakers in a voice section of the audio data as illustrated in FIG. 5 using a speaker dialing algorithm. In addition, the target voice extractor 223 may extract speech termination information, speaker switching information, and the like, using the caption data synchronized with the voice interval. For example, the target voice extractor 223 may extract speaker switching information (for example, when "hello" exists in the caption data) through text analysis. In addition, the target voice extractor 223 may extract speaker switching information using a speaker separator (eg, "/") included in the caption data.

상술한 바와 같이, 목표 음성 추출부(223)는 스피커 다이얼리제이션 알고리즘 및 자막 데이터를 분석하여 오디오 데이터의 음성 구간을 복수의 화자로 구분하고, 음성 구간에 포함된 복수의 화자의 음성 중 사용자가 선택한 목소리 타입에 대응되는 목표 음성을 추출할 수 있다.As described above, the target voice extracting unit 223 analyzes the speaker dialing algorithm and the caption data to divide the voice section of the audio data into a plurality of speakers, and the user of the voices of the plurality of speakers included in the voice section. A target voice corresponding to the selected voice type may be extracted.

다시 도 2에 대해 설명하면, 자막 데이터 추출부(230)는 영상 수신부(210)로 입력된 영상 데이터로부터 자막 데이터를 추출한다. 이때, 추출된 자막 데이터는 상술한 바와 같이, 사용자가 등록하고자 하는 목소리 타입에 대응되는 음성 데이터와 동기화되는데 이용될 수 있다.Referring again to FIG. 2, the caption data extractor 230 extracts caption data from the image data input to the image receiver 210. At this time, the extracted subtitle data may be used to synchronize with the voice data corresponding to the voice type to be registered by the user as described above.

TTS 모델 생성부(240)는 음성 데이터 추출부(220)로부터 추출된 음성 데이터 및 자막 데이터 추출부(230)로부터 추출된 자막 데이터를 이용하여, 등록하고자 하는 목소리 타입에 대응되는 개인화 TTS 모델을 생성한다. TTS 모델 생성부(240)에 대해서는 도 6을 참조하여 상세히 설명하기로 한다.The TTS model generator 240 generates a personalized TTS model corresponding to the voice type to be registered using the voice data extracted from the voice data extractor 220 and the caption data extracted from the caption data extractor 230. do. The TTS model generator 240 will be described in detail with reference to FIG. 6.

도 6에 도시된 바와 같이, TTS 모델 생성부(240)는 음성 파라미터 추출부(241), 자막 데이터 변환부(242), 변환 행렬 생성부(243), 개인화 TTS 모델 생성부(246)를 포함한다.As shown in FIG. 6, the TTS model generator 240 includes a voice parameter extractor 241, a caption data converter 242, a transformation matrix generator 243, and a personalized TTS model generator 246. do.

음성 파라미터 추출부(241)는 등록하고자 하는 목소리 타입에 대응되는 음성 데이터의 음성 파라미터를 추출한다. 이때, 음성 파라미터에는 음성 데이터의 스펙트럼(spectrum) 및 피치(pitch) 중 적어도 하나일 수 있다.The voice parameter extractor 241 extracts a voice parameter of voice data corresponding to a voice type to be registered. In this case, the voice parameter may be at least one of a spectrum and a pitch of the voice data.

자막 데이터 변환부(242)는 자막 데이터를 폰 트랜스크립트(phone transcripts)로 변환할 수 있다. The caption data converter 242 may convert caption data into phone transcripts.

변환 행렬 생성부(243)는 입력된 음성 파라미터, 폰 트랜스크립트 및 화자 적응 알고리즘(244)을 이용하여 변환 행렬을 생성한다. 이때, 변환 행렬 생성부(243)에서 사용되는 화자 적응 알고리즘(244)은 MLLR(Maximum Likelihood Linear Regression) 및 MAPLR(Maximum A Posteriori Linear Regression) 중 하나일 수 있다.The transformation matrix generator 243 generates a transformation matrix by using the input speech parameter, the phone transcript, and the speaker adaptation algorithm 244. In this case, the speaker adaptation algorithm 244 used in the transformation matrix generator 243 may be one of Maximum Likelihood Linear Regression (MLLR) and Maximum A Posteriori Linear Regression (MAPLR).

그리고, 개인화 TTS 모델 생성부(246)는 변환 행렬 생성부(243)에서 생성된 변환 행렬 및 평균 모델(245)을 이용하여 개인화 TTS 모델을 생성한다. 이때, 평균 모델(245)은 특정 화자에 치우치지 않는 화자 독립 음향 모델(Speaker Independent Sound Model)이다.The personalized TTS model generator 246 generates a personalized TTS model using the transform matrix and the average model 245 generated by the transform matrix generator 243. In this case, the average model 245 is a speaker independent sound model that does not bias the specific speaker.

다시 도 2에 대해 설명하면, 페르소나 유형 판단부(250)는 추출된 음성 데이터 및 자막 데이터를 이용하여, 사용자가 등록하고자 하는 목소리 타입에 대한 페르소나 유형을 판단한다. 이때, 페르소나 유형이라 함은 상술한 바와 같이, 사용자가 등록하고자 하는 목소리 타입의 화자에 대한 성격을 의미하는 것으로, 친근함, 딱딱함, 당당함, 공손함과 같은 성격을 예로 들 수 있다.Referring back to FIG. 2, the persona type determination unit 250 determines the persona type for the voice type to be registered by the user using the extracted voice data and the subtitle data. In this case, the persona type refers to the personality of the speaker of the voice type that the user wants to register as described above, and may include, for example, a personality such as friendliness, hardness, dignity and politeness.

구체적으로, 페르소나 유형 판단부(250)는 사용자가 등록하고자 하는 목소리 타입에 대응되는 음성 데이터와 동기화된 자막 데이터의 문장 종결어, 및 등록하고자 하는 목소리 타입에 대응되는 음성 데이터의 주파수, 톤 변화, 중 적어도 하나를 분석하여, 사용자가 등록하고자 하는 목소리 타입에 대한 페르소나 유형을 판단할 수 있다.Specifically, the persona type determination unit 250 includes a sentence termination word of the caption data synchronized with the voice data corresponding to the voice type to be registered by the user, a frequency, a tone change of the voice data corresponding to the voice type to be registered, By analyzing at least one of the, it is possible to determine the persona type for the voice type to be registered by the user.

예를 들어, 페르소나 유형 판단부(250)는 음성 데이터와 동기화된 자막 데이터의 문장 종결어에 "~예요","~께요","~해요"가 포함된 경우, 페르소나 유형을 "다정함"이라고 판단할 수 있다. 또 다른 예로, 페르소나 유형 판단부(250)는 음성 데이터와 동기화된 자막 데이터의 문장 종결어에 "~입니다","~합니다""~습니다"가 포함된 경우, 페르소나 유형은 "공손함"이라고 판단할 수 있다. 또 다른 예로, 페르소나 유형 판단부(250)는 음성 데이터와 동기화된 자막 데이터의 문장 종결어에 "~임" "~이야","~해"가 포함된 경우, 페르소나 유형을 "딱딱함"이라고 판단할 수 있다.For example, the persona type determination unit 250 may include "affinity" for the persona type when the sentence termination word of the subtitle data synchronized with the voice data includes "...", "...", "...". You can judge that. As another example, the persona type determination unit 250 determines that the persona type is "polite" when the sentence termination word of the subtitle data synchronized with the voice data includes "...", "~", "~". can do. As another example, the persona type determination unit 250 determines that the persona type is "hard" when the sentence termination word of the caption data synchronized with the voice data includes "~", "~", "~ harm". can do.

또는, 페르소나 유형 판단부(250)는 등록하고자 하는 목소리 타입에 대응되는 음성 데이터의 주파수, 톤 변화를 분석하여 페르소나 유형을 판단할 수 있다. 예를 들어, 페르소나 유형 판단부(250)는 음성 데이터의 톤 변화가 급격한 경우. 페르소나 유형을 "흥분함"이라고 판단할 수 있으며, 톤 변화가 급격하지 않은 경우, 페르소나 유형을 "침착함"이라고 판단할 수 있다.Alternatively, the persona type determination unit 250 may determine the persona type by analyzing the frequency and tone change of the voice data corresponding to the voice type to be registered. For example, when the persona type determination unit 250 suddenly changes the tone of the voice data. The persona type may be determined to be "excited", and if the tone change is not abrupt, the persona type may be determined to be "settled".

등록부(200)는 상술한 바와 같이 사용자가 선택하고자 하는 목소리 유형의 개인화 TTS 모델 및 페르소나 유형을 매칭하여 등록하고, 등록 정보를 저장부(150)에 저장할 수 있다.As described above, the registration unit 200 may match and register a personalized TTS model and persona type of a voice type that the user wants to select, and store the registration information in the storage unit 150.

상술한 바와 같은 음성 변환 장치(100)에 의해, 복수의 목소리 유형에 따라 상이한 출력 AUI 텍스트를 제공함으로써, 사용자에게 더욱 직관적이고 능동적인 AUI를 제공할 수 있게 된다. 뿐만 아니라, 사용자는 자신이 등록하고자 하는 목소리를 AUI의 출력 목소리로 등록할 수 있다.
By the voice conversion apparatus 100 as described above, by providing different output AUI text according to a plurality of voice types, it is possible to provide a more intuitive and active AUI to the user. In addition, the user may register a voice to be registered as the output voice of the AUI.

이하에서는 도 8 및 도 9를 참조하여, 음성 변환 방법 및 목소리 유형 등록 방법에 대해 설명하기로 한다.Hereinafter, a voice conversion method and a voice type registration method will be described with reference to FIGS. 8 and 9.

도 8은 본 발명의 일 실시예에 따른, 음성 변환 방법을 설명하기 위한 흐름도이다.8 is a flowchart illustrating a voice conversion method according to an embodiment of the present invention.

우선, 음성 변환 장치(100)는 사용자 명령에 의해, AUI 출력을 위한 복수의 목소리 타입 중 하나를 선택한다(S810). 예를 들어, 음성 변환 장치(100)는 AUI 출력을 위한 목소리 타입으로 "D" 가수를 선택할 수 있다.First, the voice conversion apparatus 100 selects one of a plurality of voice types for AUI output by a user command (S810). For example, the voice conversion apparatus 100 may select "D" mantissa as a voice type for AUI output.

그리고, 음성 변환 장치(100)는 특정 AUI를 출력하는 이벤트가 발생하였는지 여부를 판단한다(S820). 예를 들어, 음성 변환 장치(100)는 날씨를 안내하기 위한 AUI 출력 이벤트가 발생하였는지 여부를 판단할 수 있다.The voice conversion apparatus 100 determines whether an event for outputting a specific AUI has occurred (S820). For example, the voice conversion apparatus 100 may determine whether an AUI output event for guiding the weather has occurred.

그리고, 특정 AUI를 출력하는 이벤트가 발생된 것으로 판단되면(S820), 음성 변환 장치(100)는 선택된 목소리 타입에 대응되는 출력 AUI 텍스트를 검색한다(S830). 이때, 음성 변환 장치(100)는 선택된 목소리 타입에 대응되는 페르소나 유형을 판단하고, 특정 AUI에 대한 복수의 출력 AUI 텍스트 중 페르소나 유형에 대응되는 출력 AUI 텍스트를 검색할 수 있다. 예를 들어, 음성 변환 장치(100)는 선택된 목소리 타입인 "D" 가수에 대응되는 페르소나 유형인 "친근함"을 검색하고, 특정 AUI에 대한 출력 AUI 텍스트로 "친근함"에 대응되는 출력 AUI 텍스트인 "오늘 날씨는 맑음이야"를 검색할 수 있다. If it is determined that an event for outputting a specific AUI has occurred (S820), the voice conversion apparatus 100 searches for output AUI text corresponding to the selected voice type (S830). In this case, the voice conversion apparatus 100 may determine the persona type corresponding to the selected voice type and search for the output AUI text corresponding to the persona type among the plurality of output AUI texts for the specific AUI. For example, the speech conversion apparatus 100 searches for the "personal" type persona corresponding to the selected voice type "D" singer, and outputs the AUI corresponding to "friendship" as the output AUI text for the specific AUI. You can search for the text "Weather is sunny today."

그리고, 음성 변환 장치(100)는 개인화 TTS 모델을 이용하여 검색된 출력 AUI 텍스트를 선택된 목소리에 대응되는 음색으로 변환하여 출력한다(S840). 예를 들어, 음성 변환 장치(100)는 선택된 "D" 가수의 목소리 타입에 대응되는 개인화 TTS 모델을 이용하여 출력 AUI 텍스트인 "오늘 날씨는 맑음이야"를 "D" 가수에 대응되는 목소리로 변환하여 출력할 수 있다.In operation S840, the voice conversion apparatus 100 converts the output AUI text searched using the personalized TTS model into a tone corresponding to the selected voice. For example, the speech converting apparatus 100 converts the output AUI text "today is sunny" into a voice corresponding to "D" singer using a personalized TTS model corresponding to the voice type of the selected "D" singer. Can be output.

상술한 바와 같은 음성 변환 방법에 의해, 복수의 목소리 유형에 따라 상이한 출력 AUI 텍스트를 사용자에게 제공함으로써, 사용자에게 더욱 직관적이고 엔터테인먼트 AUI를 제공할 수 있게 된다.
By the voice conversion method as described above, by providing the user with different output AUI text according to a plurality of voice types, it is possible to provide a more intuitive and entertaining AUI to the user.

도 9는 본 발명의 일 실시예에 따른, AUI를 출력하기 위한 목소리 타입을 등록하는 방법을 설명하기 위한 흐름도이다.9 is a flowchart illustrating a method of registering a voice type for outputting an AUI according to an embodiment of the present invention.

우선, 음성 변환 장치(100)는 방송 데이터를 수신한다(S910). 이때, 수신되는 방송 데이터에는 오디오 데이터 및 자막 데이터를 포함할 수 있다. 예를 들어, 음성 변환 장치(100)는 오디오 데이터 및 자막 데이터가 포함된 영화 데이터를 수신할 수 있다.First, the voice conversion apparatus 100 receives broadcast data (S910). In this case, the received broadcast data may include audio data and subtitle data. For example, the voice conversion apparatus 100 may receive movie data including audio data and subtitle data.

그리고, 음성 변환 장치(100)는 사용자 입력부(110)를 통해 사용자가 등록하고자 하는 목소리 타입 선택 명령이 입력되었는지 여부를 판단한다(S920). 예를 들어, 음성 변환 장치(100)는 사용자가 등록하고자 하는 목소리 타입으로 영화 데이터에 포함된 오디오 데이터 중 "E" 영화 배우의 목소리 타입을 선택하는 사용자 명령을 입력받을 수 있다.In operation S920, the voice conversion apparatus 100 determines whether a voice type selection command to be registered by the user is input through the user input unit 110. For example, the voice conversion apparatus 100 may receive a user command for selecting a voice type of an "E" movie star among audio data included in the movie data as a voice type to be registered by the user.

음성 변환 장치(100)에 목소리 타입 선택 명령이 입력된 경우(S920-Y), 음성 변환 장치(100)는 사용자가 등록하고자 하는 목소리 타입에 대응되는 음성 데이터를 추출한다(S930). 예를 들어, 음성 변환 장치(100)는 도 3에 도시된 바와 같은 음성 데이터 추출부(220)를 이용하여 영화 데이터에 포함된 오디오 데이터 중 "E" 영화 배우의 음성 데이터를 추출할 수 있다.When a voice type selection command is input to the voice conversion apparatus 100 (S920-Y), the voice conversion apparatus 100 extracts voice data corresponding to a voice type to be registered by the user (S930). For example, the voice conversion apparatus 100 may extract voice data of an "E" movie actor from audio data included in the movie data using the voice data extractor 220 as shown in FIG. 3.

그리고, 음성 변환 장치(100)는 사용자가 등록하고자 하는 목소리에 대응되는 개인화 TTS 모델을 생성한다(S940). 예를 들어, 음성 변환 장치(100)는 도 6에 도시된 바와 같은 TTS 모델 생성부(240)를 이용하여 "E" 영화배우의 음색에 대응되는 개인화 TTS 모델을 생성할 수 있다.The voice conversion apparatus 100 generates a personalized TTS model corresponding to the voice to be registered by the user (S940). For example, the voice conversion apparatus 100 may generate a personalized TTS model corresponding to the tone of the "E" movie star using the TTS model generator 240 as shown in FIG. 6.

그리고, 음성 변환 장치(100)는 사용자가 등록하고자 하는 목소리에 대응되는 페르소나 유형을 판단한다(S950). 예를 들어, 음성 변환 장치(100)는 "E" 영화배우의 음성 데이터 및 자막 데이터를 분석하여 "E" 영화 배우의 페르소나 유형을 "무례함"으로 판단할 수 있다.In operation S950, the voice conversion apparatus 100 determines a persona type corresponding to a voice to be registered by the user. For example, the speech conversion apparatus 100 may analyze the voice data and the subtitle data of the "E" movie star to determine the persona type of the "E" movie star as "rude".

그리고, 음성 변환 장치(100)는 등록하고자 하는 목소리 타입의 개인화 TTS 모델과 페르소나 유형을 매칭하여 저장한다(S960). 예를 들어, 음성 변환 장치(100)는 등록하고자 하는 "E" 영화배우의 목소리 타입에 대응되는 개인화 TTS 모델과 "E" 영화배우의 페르소나 유형인 "무례함"을 매칭하여 저장할 수 있다.The voice conversion apparatus 100 matches and stores the personalized TTS model of the voice type to be registered with the persona type (S960). For example, the voice conversion apparatus 100 may match and store a personalized TTS model corresponding to the voice type of the "E" movie star to be registered with "person disrespect" of the "E" movie star.

상술한 바와 같은 목소리 타입 등록 방법에 의해, 사용자는 자신이 원하는 목소리를 AUI의 출력 목소리로 등록할 수 있게 된다.
By the voice type registration method as described above, the user can register his or her desired voice as the output voice of the AUI.

이상과 같은 다양한 실시 예에 따른 음성 변환 방법을 수행하기 위한 프로그램 코드는 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장될 수 있다. 비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 어플리케이션 또는 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.The program code for performing the voice conversion method according to various embodiments as described above may be stored in a non-transitory computer readable medium. A non-transitory readable medium is a medium that stores data for a short period of time, such as a register, cache, memory, etc., but semi-permanently stores data and is readable by the apparatus. In particular, the various applications or programs described above may be stored on non-volatile readable media such as CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM,

또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, It will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the present invention.

110: 사용자 입력부 120: 텍스트 검색부
130: 음색 변환부 140: 출력부
150: 저장부 160: 등록부110: user input unit 120: text search unit
130: tone conversion unit 140: output unit
150: storage unit 160: register

Claims

Selecting, by a user command, one of a plurality of voice types for outputting an auditory user interface (AUI);
Searching for output AUI text corresponding to the selected voice type when an event for outputting a specific AUI occurs; And
And converting the searched output AUI text into a tone corresponding to the selected voice type and outputting the same by using a personalized text-to-speech model corresponding to the selected voice type. .

The method of claim 1,
Wherein the searching comprises:
Determining a persona type corresponding to the selected voice type; And
And searching for output AUI text corresponding to the persona type among the plurality of output AUI texts for the specific AUI.

3. The method of claim 2,
A plurality of output AUI text for the specific AUI,
Voice conversion method characterized in that previously stored for each persona type for the particular AUI.

3. The method of claim 2,
And registering a personalized TTS model and persona type corresponding to the voice type for outputting the AUI.

5. The method of claim 4,
Wherein the registering step comprises:
Receiving image data including audio data and subtitle data from the outside;
Extracting voice data corresponding to the voice type to be registered from the audio data when a voice type to be registered by the user is selected from among a plurality of voices included in the audio data by a user input;
Generating a personalized TTS model corresponding to the voice type to be registered using the extracted voice data and subtitle data;
Determining a persona type for the voice type to be registered using the extracted voice data and the caption data; And
And registering the personalized TTS model of the voice type to be registered with the persona type.

6. The method of claim 5,
Extracting the voice data corresponding to the voice type to be registered,
Extracting a voice section by dividing the voice section and the non-voice section included in the audio data;
Synchronizing the voice section with the subtitle data;
And extracting voice data corresponding to the voice type to be registered from among voice data included in the voice section by using a speaker diarization algorithm and subtitle data.

The method according to claim 6,
Generating a personalized TTS model corresponding to the voice type to be registered,
Extracting a voice parameter of voice data corresponding to the voice type to be registered;
Converting the caption data into phone transcripts;
Generating a transformation matrix using the speech parameter, the phone transcript and a speaker adaptation algorithm;
Generating a personalized TTS model using the generated transformation matrix and the average model.

The method according to claim 6,
Determining the persona type for the voice type to be registered,
Analyzing at least one of a sentence termination word of the caption data synchronized with the voice data corresponding to the voice type to be registered and a frequency, a tone change, and the like of the voice data corresponding to the voice type to be registered, And a method for determining a persona type for the type.

A user input unit configured to receive a user selection for selecting one of a plurality of voice types for outputting an auditory user interface (AUI);
A text search unit that searches for output AUI text corresponding to the selected voice type when an event for outputting a specific AUI occurs; And
A tone converter for converting the retrieved output AUI text into a tone corresponding to the selected voice type by using a personalized text-to-speech model corresponding to the selected voice type;
And an output unit for outputting the specific AUI as the converted tone.

10. The method of claim 9,
The text search unit,
And determine a persona type corresponding to the selected voice type, and search for output AUI text corresponding to the persona type among a plurality of output AUI texts for the specific AUI.

11. The method of claim 10,
And a storage unit to prestore a plurality of output AUI texts for the specific AUI for each persona type.

11. The method of claim 10,
And a registration unit that registers a personalized TTS model and a persona type corresponding to the voice type for outputting the AUI.

The method of claim 12,
Wherein the registration unit comprises:
An image receiver configured to receive image data including audio data and subtitle data from the outside;
A voice data extractor configured to extract voice data corresponding to the voice type to be registered from the audio data when a voice type to be registered by the user is selected from the plurality of voices included in the audio data by the user input unit;
A TTS model generator for generating a personalized TTS model corresponding to the voice type to be registered using the extracted voice data and subtitle data; And
And a persona type determination unit configured to determine a persona type for the voice type to be registered using the extracted voice data and the caption data.
The voice conversion device,
And a storage unit for matching and storing the personalized TTS model of the voice type to be registered and the persona type.

14. The method of claim 13,
The voice data extraction unit,
A voice section extractor configured to extract a voice section by dividing the voice section and the non-voice section included in the audio data;
A synchronization unit for synchronizing the voice section with the subtitle data;
A voice conversion unit for extracting voice data corresponding to the voice type to be registered from among voice data included in the voice section by using a speaker diarization algorithm and caption data; Device.

15. The method of claim 14,
The TTS model generator,
A voice parameter extraction unit for extracting a voice parameter of voice data corresponding to the voice type to be registered;
A caption data converter converting the caption data into phone transcripts;
A transformation matrix generator for generating a transformation matrix using the speech parameter, the phone transcript, and a speaker adaptation algorithm;
And a personalized TTS model generator for generating a personalized TTS model using the generated transformation matrix and the average model.

15. The method of claim 14,
The persona type determination unit,
Analyzing at least one of a sentence termination word of the caption data synchronized with the voice data corresponding to the voice type to be registered and a frequency, a tone change, and the like of the voice data corresponding to the voice type to be registered, And a type of persona for the type.