KR20220170330A

KR20220170330A - Electronic device and method for controlling thereof

Info

Publication number: KR20220170330A
Application number: KR1020210194532A
Authority: KR
Inventors: 박상준; 주기현
Original assignee: 삼성전자주식회사
Priority date: 2021-06-22
Filing date: 2021-12-31
Publication date: 2022-12-29

Abstract

Disclosed are an electronic device and a method for controlling the same. A method for controlling an electronic device, according to the present disclosure, comprises the steps of: obtaining text; inputting the text in a first neural network model so as to obtain acoustic feature information corresponding to the text and alignment information formed by matching each frame of the acoustic feature information with each phoneme included in the text; identifying an utterance speed of the acoustic feature information on the basis of the obtained alignment information; identifying a reference utterance speed for each phoneme included in the acoustic feature information on the basis of the text and acoustic feature information; obtaining utterance speed adjustment information on the basis of the utterance speed of the acoustic feature information and the reference utterance speech; and inputting the acoustic feature information in a second neural network model set on the basis of the obtained utterance speed adjustment information so as to obtain voice data corresponding to the text. According to the present invention, it is possible to obtain voice data with an improved utterance speed.

Description

Electronic device and its control method {ELECTRONIC DEVICE AND METHOD FOR CONTROLLING THEREOF}

본 개시는 전자 장치 및 이의 제어 방법에 관한 것으로, 더욱 상세하게는 인공 지능 모델을 이용하여 음성 합성을 수행하는 전자 장치 및 이의 제어 방법에 관한 것이다.The present disclosure relates to an electronic device and a control method thereof, and more particularly, to an electronic device and a control method for performing voice synthesis using an artificial intelligence model.

전자 기술의 발달에 힘입어 다양한 유형의 디바이스들이 개발 및 보급되고 있으며, 특히 음성 합성을 수행하는 디바이스들이 보편화되고 있다.Thanks to the development of electronic technology, various types of devices are being developed and spread, and in particular, devices that perform voice synthesis are becoming common.

음성 합성은 텍스트로부터 사람의 목소리를 구현해내는 기술로서, TTS(text to speech)라고도 불리며, 최근에는 신경망 모델을 이용한 Neural TTS가 개발되고 있다.Speech synthesis is a technology for realizing a human voice from text, and is also called text to speech (TTS). Recently, neural TTS using a neural network model has been developed.

Neural TTS는 일 예로, 프로소디(prosody) 신경망 모델 및 뉴럴 보코더(neural vocoder) 신경망 모델을 포함할 수 있다. 프로소디 신경망 모델은 텍스트를 입력받아 음향 특성 정보(acoustic feature)를 출력하고, 뉴럴 보코더 신경망 모델은 음향 특성 정보를 입력받아 음성 데이터(waveform)을 출력할 수 있다.Neural TTS may include, for example, a prosody neural network model and a neural vocoder neural network model. The prosody neural network model may receive text and output acoustic feature information, and the neural vocoder neural network model may receive acoustic feature information and output voice data (waveform).

한편, TTS 모델에서 프로소디 신경망 모델은 학습에 사용된 화자의 목소리 특성을 가지고 있다. 즉, 프로소디 신경망 모델의 출력은 특정 화자의 목소리 특성 및 특정 화자의 발화 속도 특성을 포함하는 음향 특성 정보라고 할 수 있다.Meanwhile, in the TTS model, the prosody neural network model has the speaker's voice characteristics used for learning. That is, the output of the prosody neural network model may be sound characteristic information including voice characteristics of a specific speaker and speech speed characteristics of a specific speaker.

종래에는 인공 지능 모델의 발전에 따라 전자 장치 사용자의 목소리 특성을 가지는 음성 데이터를 출력하는 개인화 TTS 모델이 제안되고 있다. 개인화 TTS 모델이란, 개인 사용자의 발화 음성 데이터를 바탕으로 학습되어, 학습에 사용된 사용자의 목소리 특성 및 발화 속도 특성을 포함하는 음성 데이터를 출력하기 위한 TTS 모델이다. Conventionally, with the development of artificial intelligence models, personalized TTS models that output voice data having voice characteristics of electronic device users have been proposed. The personalized TTS model is a TTS model that is learned based on speech data of an individual user and outputs speech data including the user's voice characteristics and speech speed characteristics used for learning.

한편, 개인화 TTS 모델의 학습에 사용되는 개인 사용자의 발화 음성 데이터의 음질이 일반 TTS 모델의 학습에 사용되는 데이터에의 음질에 비해 일반적으로 떨어지므로, 개인화 TTS 모델에서 출력되는 음성 데이터에 대한 발화 속도 문제점이 발생될 수 있다. On the other hand, since the sound quality of the individual user's speech data used for learning the personalized TTS model is generally lower than the sound quality of the data used for learning the general TTS model, the speech speed of the speech data output from the personalized TTS model Problems may arise.

본 개시는 상술한 문제점을 해결하기 위한 안출된 것으로, 본 개시의 목적은 TTS 모델을 위한 적응적 발화 속도 조절 방법을 제공함에 있다.The present disclosure has been made to solve the above problems, and an object of the present disclosure is to provide an adaptive speech rate control method for a TTS model.

본 개시의 일 실시 예에 따른, 전자 장치의 제어 방법은, 텍스트를 획득하는 단계; 제1 신경망 모델에 상기 텍스트를 입력함으로, 상기 텍스트에 대응하는 음향 특성(Acoustic Feature) 정보 및 상기 음향 특성 정보의 프레임 각각과 상기 텍스트에 포함된 음소(phoneme) 각각을 매칭한 정렬(alignment) 정보를 획득하는 단계; 상기 획득한 정렬 정보를 바탕으로, 상기 음향 특성 정보의 발화 속도를 식별하는 단계; 상기 텍스트 및 상기 음향 특성 정보를 바탕으로, 상기 음향 특성 정보에 포함된 음소 별 기준 발화 속도를 식별하는 단계; 상기 음향 특성 정보의 발화 속도 및 상기 기준 발화 속도를 바탕으로, 발화 속도 조절 정보를 획득하는 단계; 및 상기 획득된 발화 속도 조절 정보에 기초하여 설정되는 제2 신경망 모델에 상기 음향 특성 정보를 입력함으로, 상기 텍스트에 대응하는 음성 데이터를 획득하는 단계;를 포함한다.According to an embodiment of the present disclosure, a control method of an electronic device may include obtaining text; By inputting the text to the first neural network model, acoustic feature information corresponding to the text and alignment information matching each frame of the acoustic feature information with each phoneme included in the text obtaining; identifying an utterance rate of the acoustic characteristic information based on the acquired alignment information; identifying a reference speech speed for each phoneme included in the sound characteristic information, based on the text and the sound characteristic information; obtaining speech speed control information based on the speech speed of the sound characteristic information and the reference speech speed; and obtaining voice data corresponding to the text by inputting the acoustic characteristic information to a second neural network model established based on the acquired speech rate control information.

그리고, 상기 음향 특성 정보의 발화 속도를 식별하는 단계는, 상기 획득한 정렬 정보를 바탕으로, 상기 음향 특성 정보에 포함된 제1 음소에 대응하는 발화 속도를 식별하는 단계;이며, 상기 기준 발화 속도를 식별하는 단계는, 상기 음향 특성 정보를 바탕으로, 상기 음향 특성 정보에 포함된 상기 제1 음소를 식별하는 단계; 및 상기 텍스트를 바탕으로, 상기 제1 음소에 대응하는 기준 발화 속도를 식별하는 단계;일 수 있다.The identifying speech speed of the acoustic characteristic information may include identifying a speech speed corresponding to a first phoneme included in the acoustic characteristic information, based on the acquired alignment information, and the reference speech speed. The identifying may include: identifying the first phoneme included in the acoustic characteristic information based on the acoustic characteristic information; and identifying a reference speech rate corresponding to the first phoneme based on the text.

그리고, 상기 기준 발화 속도를 식별하는 단계는, 상기 텍스트와 상기 제1 신경망의 학습에 이용된 샘플 데이터를 바탕으로, 상기 제1 음소에 대응하는 제1 기준 발화 속도를 획득하는 단계;일 수 있다.The identifying of the reference speech rate may include obtaining a first reference speech rate corresponding to the first phoneme based on the text and sample data used for learning of the first neural network. .

그리고, 상기 기준 발화 속도를 식별하는 단계는, 상기 제1 신경망 모델의 학습에 이용된 샘플 데이터에 대한 평가 정보를 획득하는 단계; 및 상기 제1 기준 발화 속도 및 상기 평가 정보를 바탕으로, 상기 제1 음소에 대응하는 제2 기준 발화 속도를 식별하는 단계;를 더 포함하며, 상기 평가 정보는 상기 전자 장치의 사용자에 의해 획득되는 것을 특징으로 할 수 있다.The identifying of the reference speech rate may include obtaining evaluation information on sample data used for learning of the first neural network model; and identifying a second reference speech rate corresponding to the first phoneme based on the first reference speech rate and the evaluation information, wherein the evaluation information is obtained by a user of the electronic device that can be characterized.

그리고, 상기 제1 기준 발화 속도 및 상기 제2 기준 발화 속도 중 하나를 바탕으로, 상기 제1 음소에 대응하는 기준 발화 속도를 식별하는 단계;를 더 포함할 수 있다.The method may further include identifying a reference speech speed corresponding to the first phoneme based on one of the first reference speech speed and the second reference speech speed.

그리고, 상기 제1 음소에 대응하는 발화 속도를 식별하는 단계는, 상기 제1 음소에 대응하는 발화 속도 및 상기 음향 특성 정보에서 상기 제1 음소 이전의 적어도 하나의 음소 각각에 대응하는 발화 속도를 바탕으로 상기 제1 음소에 대응하는 평균 발화 속도를 식별하는 단계;를 더 포함하고, 상기 발화 속도 조절 정보를 획득하는 단계는, 상기 제1 음소에 대응하는 평균 발화 속도 및 상기 제1 음소에 대응하는 기준 발화 속도를 바탕으로, 상기 제1 음소에 대응하는 발화 속도 조절 정보를 획득하는 단계;일 수 있다.The identifying of the speech speed corresponding to the first phoneme may include the speech speed corresponding to the first phoneme and the speech speed corresponding to each of at least one phoneme prior to the first phoneme in the acoustic characteristic information. and identifying an average speech rate corresponding to the first phoneme with , wherein the acquiring the speech rate control information comprises: the average speech rate corresponding to the first phoneme and the average speech rate corresponding to the first phoneme. Acquiring speech rate control information corresponding to the first phoneme based on the reference speech rate;

그리고, 상기 제2 신경망 모델은 상기 음향 특성 정보를 입력 받는 인코더 및 상기 인코더에서 출력되는 벡터 정보를 입력 받는 디코더를 포함하며, 상기 음성 데이터를 획득하는 단계는, 상기 음향 특성 정보 중 상기 제1 음소에 대응하는 적어도 하나의 프레임이 상기 제2 신경망 모델에 입력되는 동안, 상기 제1 음소에 대응하는 발화 속도 조절 정보를 바탕으로 제2 신경망 모델에 포함된 디코더의 루프(loop) 횟수를 식별하는 단계; 및 상기 제2 신경망 모델에 상기 제1 음소에 대응하는 적어도 하나의 프레임이 입력되는 것을 바탕으로, 상기 적어도 하나의 프레임 및 상기 식별된 루프 횟수에 대응되는 개수의 제1 음성 데이터를 획득하는 단계;를 더 포함하며, 상기 제1 음성 데이터는 상기 제1 음소에 대응하는 음성 데이터인 것을 특징으로 할 수 있다.The second neural network model includes an encoder receiving the acoustic characteristic information and a decoder receiving vector information output from the encoder, and the obtaining of the voice data includes the first phoneme among the acoustic characteristic information. identifying the number of loops of a decoder included in a second neural network model based on speech rate control information corresponding to the first phoneme while at least one frame corresponding to is input to the second neural network model; ; and acquiring a number of first voice data corresponding to the at least one frame and the identified number of loops, based on the at least one frame corresponding to the first phoneme being input to the second neural network model. It may further include, and the first voice data may be voice data corresponding to the first phoneme.

그리고, 상기 음향 특성 정보 중 상기 제1 음소에 대응하는 적어도 하나의 프레임 중 하나가 상기 제2 신경망 모델에 입력되면, 상기 루프 횟수에 대응되는 개수의 제2 음성 데이터가 획득되는 것을 특징으로 할 수 있다.And, when one of the at least one frame corresponding to the first phoneme among the sound characteristic information is input to the second neural network model, second voice data corresponding to the number of loops may be obtained. there is.

그리고, 상기 디코더는 시프트 크기(Shift size)가 제1 시간 간격(sec)의 음향 특성 정보를 바탕으로 제1 주파수(khz)의 음성 데이터를 획득하는 것을 특징으로 하며, 상기 발화 속도 조절 정보의 값이 기준 값인 경우, 상기 음향 특성 정보에 포함된 하나의 프레임이 상기 제2 신경망 모델에 입력되어, 상기 제1 시간 간격과 상기 제1 주파수의 곱에 대응하는 개수의 음성 데이터가 획득되는 것을 특징으로 할 수 있다.And, the decoder is characterized in that the shift size (Shift size) acquires the voice data of the first frequency (khz) based on the sound characteristic information of the first time interval (sec), and the value of the speech speed control information When is the reference value, one frame included in the acoustic characteristic information is input to the second neural network model, and a number of voice data corresponding to the product of the first time interval and the first frequency is obtained. can do.

그리고, 상기 발화 속도 조절 정보는, 상기 음향 특성 정보의 발화 속도와 상기 기준 발화 속도의 비율 값에 대한 정보인 것을 특징으로 할 수 있다.The speech rate control information may be information on a ratio value between the speech rate of the acoustic characteristic information and the reference speech rate.

한편, 본 개시의 일 실시 예에 따른, 전자 장치는 적어도 하나의 인스트럭션을 저장하는 메모리; 및 상기 메모리에 저장된 적어도 하나의 인스트럭션을 실행하여 상기 전자 장치를 제어하는 프로세서;를 포함하고, 상기 프로세서는, 텍스트를 획득하고, 제1 신경망 모델에 상기 텍스트를 입력함으로, 상기 텍스트에 대응하는 음향 특성(Acoustic Feature) 정보 및 상기 음향 특성 정보의 프레임 각각과 상기 텍스트에 포함된 음소(phoneme) 각각을 매칭한 정렬(alignment) 정보를 획득하고, 상기 획득한 정렬 정보를 바탕으로, 상기 음향 특성 정보의 발화 속도를 식별하고, 상기 텍스트 및 상기 음향 특성 정보를 바탕으로, 상기 음향 특성 정보에 포함된 음소 별 기준 발화 속도를 식별하고, 상기 음향 특성 정보의 발화 속도 및 상기 기준 발화 속도를 바탕으로, 발화 속도 조절 정보를 획득하고, 상기 획득된 발화 속도 조절 정보에 기초하여 설정되는 제2 신경망 모델에 상기 음향 특성 정보를 입력함으로, 상기 텍스트에 대응하는 음성 데이터를 획득한다.Meanwhile, according to an embodiment of the present disclosure, an electronic device includes a memory for storing at least one instruction; and a processor configured to control the electronic device by executing at least one instruction stored in the memory, wherein the processor obtains text and inputs the text to a first neural network model, thereby generating sound corresponding to the text. Acoustic feature information and alignment information matching each frame of the acoustic feature information and each phoneme included in the text are obtained, and based on the obtained alignment information, the acoustic feature information Identify the speech rate of, identify the reference speech speed for each phoneme included in the sound characteristic information based on the text and the sound characteristic information, and based on the speech speed and the reference speech speed of the sound characteristic information, Speech rate control information is obtained, and voice data corresponding to the text is obtained by inputting the acoustic characteristic information to a second neural network model established based on the obtained speech rate control information.

그리고, 상기 프로세서는, 상기 획득한 정렬 정보를 바탕으로, 상기 음향 특성 정보에 포함된 제1 음소에 대응하는 발화 속도를 식별하고, 상기 음향 특성 정보를 바탕으로, 상기 음향 특성 정보에 포함된 상기 제1 음소를 식별하고, 상기 텍스트를 바탕으로, 상기 제1 음소에 대응하는 기준 발화 속도를 식별할 수 있다.Further, the processor identifies a speech speed corresponding to a first phoneme included in the acoustic characteristic information based on the acquired alignment information, and based on the acoustic characteristic information, the speech speed included in the acoustic characteristic information. A first phoneme may be identified, and a reference speech speed corresponding to the first phoneme may be identified based on the text.

그리고, 상기 프로세서는, 상기 텍스트와 상기 제1 신경망의 학습에 이용된 샘플 데이터를 바탕으로, 상기 제1 음소에 대응하는 제1 기준 발화 속도를 획득할 수 있다.The processor may obtain a first reference speech rate corresponding to the first phoneme based on the text and sample data used for learning of the first neural network.

그리고, 상기 프로세서는, 상기 제1 신경망 모델의 학습에 이용된 샘플 데이터에 대한 평가 정보를 획득하고, 상기 제1 기준 발화 속도 및 상기 평가 정보를 바탕으로, 상기 제1 음소에 대응하는 제2 기준 발화 속도를 식별하고, 상기 평가 정보는 상기 전자 장치의 사용자에 의해 획득되는 것을 특징으로 할 수 있다.The processor obtains evaluation information on sample data used to learn the first neural network model, and based on the first reference speech rate and the evaluation information, a second criterion corresponding to the first phoneme. Speech speed may be identified, and the evaluation information may be obtained by a user of the electronic device.

그리고, 상기 프로세서는, 상기 제1 기준 발화 속도 및 상기 제2 기준 발화 속도 중 하나를 바탕으로, 상기 제1 음소에 대응하는 기준 발화 속도를 식별할 수 있다.The processor may identify a reference speech speed corresponding to the first phoneme based on one of the first reference speech speed and the second reference speech speed.

그리고, 상기 프로세서는, 상기 제1 음소에 대응하는 발화 속도 및 상기 음향 특성 정보에서 상기 제1 음소 이전의 적어도 하나의 음소 각각에 대응하는 발화 속도를 바탕으로 상기 제1 음소에 대응하는 평균 발화 속도를 식별하고, 상기 제1 음소에 대응하는 평균 발화 속도 및 상기 제1 음소에 대응하는 기준 발화 속도를 바탕으로, 상기 제1 음소에 대응하는 발화 속도 조절 정보를 획득할 수 있다.The processor may perform an average speech speed corresponding to the first phoneme based on a speech speed corresponding to the first phoneme and a speech speed corresponding to each of at least one phoneme prior to the first phoneme in the acoustic characteristic information. may be identified, and speech rate control information corresponding to the first phoneme may be obtained based on an average speech rate corresponding to the first phoneme and a reference speech rate corresponding to the first phoneme.

그리고, 상기 제2 신경망 모델은 상기 음향 특성 정보를 입력 받는 인코더 및 상기 인코더에서 출력되는 벡터 정보를 입력 받는 디코더를 포함하며, 상기 프로세서는, 상기 음향 특성 정보 중 상기 제1 음소에 대응하는 적어도 하나의 프레임이 상기 제2 신경망 모델에 입력되는 동안, 상기 제1 음소에 대응하는 발화 속도 조절 정보를 바탕으로 제2 신경망 모델에 포함된 디코더의 루프(loop) 횟수를 식별하고, 상기 제2 신경망 모델에 상기 제1 음소에 대응하는 적어도 하나의 프레임이 입력되는 것을 바탕으로, 상기 적어도 하나의 프레임 및 상기 식별된 루프 횟수에 대응되는 개수의 제1 음성 데이터를 획득하고, 상기 제1 음성 데이터는 상기 제1 음소에 대응하는 음성 데이터인 것을 특징으로 할 수 있다.The second neural network model includes an encoder receiving the acoustic characteristic information and a decoder receiving vector information output from the encoder, and the processor includes at least one of the acoustic characteristic information corresponding to the first phoneme. While the frame of is input to the second neural network model, the number of loops of the decoder included in the second neural network model is identified based on the speech rate control information corresponding to the first phoneme, and the second neural network model On the basis that at least one frame corresponding to the first phoneme is input to the first phoneme, a number of first voice data corresponding to the at least one frame and the identified number of loops is obtained, and the first voice data is It may be characterized in that it is voice data corresponding to the first phoneme.

이상과 같은 다양한 실시 예에 따르면, 전자 장치는 TTS 모델의 뉴럴 보코더(neural vocoder) 신경망 모델에 입력되는 음향 특성 정보에 대응하는 음소 별로 발화 속도를 조절할 수 있어, 발화 속도가 개선된 음성 데이터를 획득할 수 있다.According to various embodiments as described above, the electronic device can adjust the speech speed for each phoneme corresponding to the acoustic characteristic information input to the neural vocoder neural network model of the TTS model, thereby obtaining voice data with improved speech speed can do.

도 1은 본 개시의 일 실시 예에 따른, 전자 장치의 구성을 설명하기 위한 블록도이다.
도 2는 본 개시의 일 실시 예에 따른, TTS 모델의 구성을 설명하기 위한 블록도이다.
도 3은 본 개시의 일 실시 예에 따른, TTS 모델 내의 제2 신경망 모델(예로, 뉴럴 보코더 신경망 모델)의 구성을 설명하기 위한 블록도이다.
도 4는 본 개시의 일 실시 예에 따른, 발화 속도가 개선된 음성 데이터를 획득하는 방법을 설명하기 위한 도면이다.
도 5는 본 개시의 일 실시 예에 따른, 음향 특성 정보의 프레임 각각과 텍스트에 포함된 음소 각각을 매칭한 정렬 정보를 설명하기 위한 도면이다.
도 6은 본 개시의 제1 실시 예에 따른, 음향 특성 정보에 포함된 음소 별 기준 발화 속도를 식별하는 방법을 설명하기 위한 도면이다.
도 7은 본 개시의 제2 실시 예에 따른, 음향 특성 정보에 포함된 음소 별 기준 발화 속도를 식별하는 방법을 설명하기 위한 도면이다.
도 8은 본 개시의 일 실시 예에 따른, 기준 발화 속도를 식별하기 위한 방법을 설명하기 위한 도면이다.
도 9는 본 개시의 일 실시 예에 따른, 전자 장치의 동작을 설명하기 위한 흐름도이다.
도 10은 본 개시의 일 실시 예에 따른, 전자 장치의 구성을 설명하기 위한 블록도이다.1 is a block diagram for explaining the configuration of an electronic device according to an embodiment of the present disclosure.
2 is a block diagram for explaining the configuration of a TTS model according to an embodiment of the present disclosure.
3 is a block diagram for explaining the configuration of a second neural network model (eg, a neural vocoder neural network model) in a TTS model according to an embodiment of the present disclosure.
4 is a diagram for explaining a method of obtaining voice data with improved speech speed according to an embodiment of the present disclosure.
5 is a diagram for explaining alignment information obtained by matching each frame of sound characteristic information with each phoneme included in text, according to an embodiment of the present disclosure.
6 is a diagram for explaining a method of identifying a reference speech rate for each phoneme included in acoustic characteristic information according to a first embodiment of the present disclosure.
7 is a diagram for explaining a method of identifying a reference speech rate for each phoneme included in acoustic characteristic information according to a second embodiment of the present disclosure.
8 is a diagram for explaining a method for identifying a reference speech rate according to an embodiment of the present disclosure.
9 is a flowchart illustrating an operation of an electronic device according to an embodiment of the present disclosure.
10 is a block diagram for explaining a configuration of an electronic device according to an embodiment of the present disclosure.

이하에서는 첨부 도면을 참조하여, 본 개시를 상세히 설명한다.Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.

도 1은 본 개시의 일 실시 예에 따른, 전자 장치의 구성을 설명하기 위한 블록도이다. 1 is a block diagram for explaining the configuration of an electronic device according to an embodiment of the present disclosure.

도 1을 참조하면, 전자 장치(100)는 메모리(110) 및 프로세서(120)를 포함할 수 있다. 본 개시에 따른, 전자 장치(100)는 스마트 폰, AR 글래스, 태블릿 PC, 이동 전화기, 영상 전화기, 전자책 리더기, TV, 데스크탑 PC, 랩탑 PC, 넷북 컴퓨터, 워크스테이션, 카메라, 스마트 워치 및 서버 등과 같은 다양한 형태의 전자 장치로 구현될 수 있다.Referring to FIG. 1 , an electronic device 100 may include a memory 110 and a processor 120 . According to the present disclosure, the electronic device 100 includes smart phones, AR glasses, tablet PCs, mobile phones, video phones, e-book readers, TVs, desktop PCs, laptop PCs, netbook computers, workstations, cameras, smart watches, and servers. It may be implemented in various types of electronic devices such as the like.

메모리(110)는 전자 장치(100)의 적어도 하나의 다른 구성요소에 관계된 적어도 하나의 인스트럭션(instruction) 또는 데이터를 저장할 수 있다. 특히, 메모리(110)는 비휘발성 메모리, 휘발성 메모리, 플래시 메모리(flash-memory), 하드디스크 드라이브(Hard-Disk Drive, HDD) 또는 솔리드 스테이트 드라이브 (Solid State Drive, SDD) 등으로 구현될 수 있다. 메모리(110)는 프로세서(120)에 의해 액세스(access)되며, 프로세서(120)에 의한 데이터의 독취/기록/수정/삭제/갱신 등이 수행될 수 있다. The memory 110 may store at least one instruction or data related to at least one other component of the electronic device 100 . In particular, the memory 110 may be implemented as a non-volatile memory, a volatile memory, a flash-memory, a hard-disk drive (HDD), or a solid-state drive (SDD). . The memory 110 is accessed by the processor 120, and data can be read/written/modified/deleted/updated by the processor 120.

본 개시에서 메모리라는 용어는 메모리(110), 프로세서(120) 내의 롬(미도시), 램(미도시) 또는 전자 장치(100)에 장착되는 메모리 카드(미도시)(예를 들어, micro SD 카드, 메모리 스틱)를 포함할 수 있다. In the present disclosure, the term memory refers to the memory 110, a ROM (not shown) in the processor 120, a RAM (not shown), or a memory card (not shown) mounted in the electronic device 100 (eg, micro SD). card, memory stick).

상술한 바와 같이, 메모리(110)는 적어도 하나의 인스트럭션을 저장할 수 있다. 여기에서, 인스트럭션은 전자 장치(100)를 제어하기 위한 것일 수 있다. 가령, 메모리(110)에는 사용자의 대화 상황에 따른 동작 모드를 변경하기 위한 기능과 관련된 인스트럭션이 저장될 수 있다. 구체적으로, 메모리(110)는 본 개시에 따른 사용자의 대화 상황에 따른 동작 모드를 변경하기 위한 복수의 구성(또는 모듈)을 포함할 수 있으며, 이에 대해서는 후술하도록 한다.As described above, the memory 110 may store at least one instruction. Here, the instruction may be for controlling the electronic device 100 . For example, an instruction related to a function for changing an operation mode according to a user's conversation situation may be stored in the memory 110 . Specifically, the memory 110 may include a plurality of components (or modules) for changing an operation mode according to a user's conversation situation according to the present disclosure, which will be described later.

메모리(110)에는 문자, 수, 영상 등을 나타낼 수 있는 비트 또는 바이트 단위의 정보인 데이터가 저장될 수 있다. 예를 들어, 메모리(110)에는 제1 신경망 모델(10) 및 제2 신경망 모델(20)이 저장될 수 있다. 여기서, 제1 신경망 모델은 프로소디 신경망 모델이며, 제2 신경망 모델은 뉴럴 보코더 신경망 모델일 수 있다. The memory 110 may store data that is information in units of bits or bytes capable of representing characters, numbers, images, and the like. For example, the first neural network model 10 and the second neural network model 20 may be stored in the memory 110 . Here, the first neural network model may be a prosody neural network model, and the second neural network model may be a neural vocoder neural network model.

프로세서(120)는 메모리(110)와 전기적으로 연결되어 전자 장치(100)의 전반적인 동작 및 기능을 제어할 수 있다. 프로세서(120)는 메모리(110)와 전기적으로 연결되어 전자 장치(100)의 전반적인 동작 및 기능을 제어할 수 있다. The processor 120 may be electrically connected to the memory 110 to control overall operations and functions of the electronic device 100 . The processor 120 may be electrically connected to the memory 110 to control overall operations and functions of the electronic device 100 .

일 실시 예에 따라 프로세서(120)는 디지털 시그널 프로세서(digital signal processor(DSP), 마이크로 프로세서(microprocessor), TCON(Time controller)으로 구현될 수 있다. 다만, 이에 한정되는 것은 아니며, 중앙처리장치(central processing unit(CPU)), MCU(Micro Controller Unit), MPU(micro processing unit), 컨트롤러(controller), 어플리케이션 프로세서(application processor(AP)), 또는 커뮤니케이션 프로세서(communication processor(CP)), ARM 프로세서 중 하나 또는 그 이상을 포함하거나, 해당 용어로 정의될 수 있다. 또한, 프로세서(132)는 프로세싱 알고리즘이 내장된 SoC(System on Chip), LSI(large scale integration)로 구현될 수도 있고, FPGA(Field Programmable gate array) 형태로 구현될 수도 있다.According to an embodiment, the processor 120 may be implemented as a digital signal processor (DSP), a microprocessor, or a time controller (TCON). However, it is not limited thereto, and the central processing unit ( central processing unit (CPU)), micro controller unit (MCU), micro processing unit (MPU), controller, application processor (AP), or communication processor (CP), ARM processor In addition, the processor 132 may be implemented as a system on chip (SoC) having a built-in processing algorithm, a large scale integration (LSI), or an FPGA ( It may be implemented in the form of a field programmable gate array).

하나 또는 복수의 프로세서는, 메모리(110)에 저장된 기 정의된 동작 규칙 또는 인공지능 모델에 따라, 입력 데이터를 처리하도록 제어한다. 기 정의된 동작 규칙 또는 인공지능 모델은 학습을 통해 만들어진 것을 특징으로 한다. 여기서, 학습을 통해 만들어진다는 것은, 다수의 학습 데이터들에 학습 알고리즘을 적용함으로써, 원하는 특성의 기 정의된 동작 규칙 또는 인공지능 모델이 만들어짐을 의미한다. 이러한 학습은 본 개시에 따른 인공지능이 수행되는 기기 자체에서 이루어질 수도 있고, 별도의 서버/시스템을 통해 이루어 질 수도 있다. One or more processors control input data to be processed according to predefined operating rules or artificial intelligence models stored in the memory 110 . A predefined action rule or an artificial intelligence model is characterized in that it is created through learning. Here, being created through learning means that a predefined operation rule or an artificial intelligence model having desired characteristics is created by applying a learning algorithm to a plurality of learning data. Such learning may be performed in the device itself in which artificial intelligence according to the present disclosure is performed, or may be performed through a separate server/system.

인공지능 모델은, 복수의 신경망 레이어들로 구성될 수 있다. 각 레이어는 복수의 가중치(weight values)을 갖고 있으며, 이전(previous) 레이어의 연산 결과와 복수의 가중치의 연산을 통해 레이어의 연산을 수행한다. 신경망의 예로는, CNN (Convolutional Neural Network), DNN (Deep Neural Network), RNN (Recurrent Neural Network), RBM (Restricted Boltzmann Machine), DBN (Deep Belief Network), BRDNN(Bidirectional Recurrent Deep Neural Network) 및 심층 Q-네트워크 (Deep Q-Networks)이 있으며, 본 개시에서의 신경망은 명시한 경우를 제외하고 전술한 예에 한정되지 않는다.An artificial intelligence model may be composed of a plurality of neural network layers. Each layer has a plurality of weight values, and the layer operation is performed through the operation result of the previous layer and the plurality of weight values. Examples of neural networks include Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Bidirectional Recurrent Deep Neural Network (BRDNN), and Deep Neural Network (BRDNN). There are Q-networks (Deep Q-Networks), and the neural network in the present disclosure is not limited to the above-described examples except for explicitly stated cases.

프로세서(120)는 운영 체제 또는 응용 프로그램을 구동하여 프로세서(120)에 연결된 하드웨어 또는 소프트웨어 구성요소들을 제어할 수 있고, 각종 데이터 처리 및 연산을 수행할 수 있다. 또한, 프로세서(120)는 다른 구성요소들 중 적어도 하나로부터 수신된 명령 또는 데이터를 휘발성 메모리에 로드하여 처리하고, 다양한 데이터를 비휘발성 메모리에 저장할 수 있다.The processor 120 may control hardware or software components connected to the processor 120 by driving an operating system or an application program, and may perform various data processing and operations. Also, the processor 120 may load and process commands or data received from at least one of the other components into a volatile memory, and store various data in a non-volatile memory.

특히, 프로세서(120)는 음성 데이터를 합성함에 있어 적응적 발화 속도 조절 기능을 제공할 수 있다. 본 개시에 따른 적응적 발화 속도 조절 기능은 도 1에 도시된 바와 같이, 텍스트 획득 모듈(121), 음향 특성 정보 획득 모듈(122), 발화 속도 획득 모듈(123), 기준 발화 속도 획득 모듈(124), 발화 속도 조절 정보 획득 모듈(125) 및 음성 데이터 획득 모듈(126)을 포함할 수 있으며, 각각의 모듈은 메모리(110)에 저장될 수 있다. 일 예로, 적응적 발화 속도 조절 기능은 도 2에 도시된 TTS(text to speech) 모델(200)에 포함된 제2 신경망 모델(20)의 루프 횟수를 조절함으로 발화 속도를 조절할 수 있다.In particular, the processor 120 may provide an adaptive speech rate control function when synthesizing voice data. As shown in FIG. 1 , the adaptive speech rate control function according to the present disclosure includes a text acquisition module 121, a sound characteristic information acquisition module 122, a speech rate acquisition module 123, and a reference speech rate acquisition module 124. ), a speech rate control information acquisition module 125 and a voice data acquisition module 126, and each module may be stored in the memory 110. For example, the adaptive speech speed control function may adjust the speech speed by adjusting the number of loops of the second neural network model 20 included in the text to speech (TTS) model 200 shown in FIG. 2 .

도 2는 본 개시의 일 실시 예에 따른, TTS 모델의 구성을 설명하기 위한 블록도이다. 2 is a block diagram for explaining the configuration of a TTS model according to an embodiment of the present disclosure.

도 2에 도시된 TTS 모델(200)은 제1 신경망 모델(10) 및 제2 신경망 모델(20)을 포함할 수 있다. The TTS model 200 shown in FIG. 2 may include a first neural network model 10 and a second neural network model 20 .

제1 신경망 모델(10)은 텍스트(210)를 입력 받아, 텍스트(210)에 대응되는 음향 특성 정보(220)를 출력하기 위한 구성일 수 있다. 일 예로, 제1 신경망 모델(10)은 프로소디(prosody) 신경망 모델로 구현될 수 있다. The first neural network model 10 may be configured to receive text 210 and output sound characteristic information 220 corresponding to the text 210 . For example, the first neural network model 10 may be implemented as a prosody neural network model.

프로소디 신경망 모델은 복수의 샘플 텍스트 및 복수의 샘플 텍스트 각각에 대응되는 복수의 샘플 음향 특성 정보 간 관계를 학습한 신경망 모델일 수 있다. 구체적으로, 프로소디 신경망 모델은 하나의 샘플 텍스트 및 하나의 샘플 텍스트에 대응되는 샘플 음성 데이터로부터 획득된 샘플 음향 특성 정보 간의 관계를 학습하며, 이러한 과정을 복수의 샘플 텍스트에 대하여 수행함으로, 프로소디 신경망 모델에 대한 학습이 수행될 수 있다. 그리고, 프로소디 신경망 모델에는 일 예로, 성능 향상을 목적으로 언어처리부가 포함될 수 있으며, 언어처리부는 텍스트 정규화(Text normalization) 모듈, 발음열변환(G2P: Grapheme-to-Phoneme)모듈 등이 포함될 수 있다. 한편, 제1 신경망 모델(10)에서 출력되는 음향 특성 정보(220)는 제1 신경망 모델(10)의 학습에 사용된 화자의 목소리 특성을 포함할 수 있다. 즉, 제1 신경망 모델(10)에서 출력되는 음향 특성 정보(220)는 특정 화자(제1 신경망 모델의 학습에 사용된 데이터에 대응하는 화자)의 목소리 특성을 가질 수 있다. The prosody neural network model may be a neural network model obtained by learning relationships between a plurality of sample texts and a plurality of sample sound characteristic information corresponding to each of the plurality of sample texts. Specifically, the prosody neural network model learns the relationship between one sample text and sample sound characteristic information obtained from sample voice data corresponding to the one sample text, and performs this process on a plurality of sample texts, thereby prosody Learning may be performed on the neural network model. In addition, the prosody neural network model may include, for example, a language processing unit for the purpose of performance improvement, and the language processing unit may include a text normalization module, a grapheme-to-phoneme (G2P) module, and the like. there is. Meanwhile, the acoustic characteristic information 220 output from the first neural network model 10 may include characteristics of a speaker's voice used for learning of the first neural network model 10 . That is, the acoustic characteristic information 220 output from the first neural network model 10 may have voice characteristics of a specific speaker (speaker corresponding to data used for learning of the first neural network model).

제2 신경망 모델(20)은 음향 특성 정보(220)를 음성 데이터(230)로 변환하기 위한 신경망 모델로, 일 예로, 뉴럴 보코더(neural vocoder) 신경망 모델로 구현될 수 있다. 본 개시에 따른, 뉴럴 보코더 신경망 모델은 제1 신경망 모델(10)에서 출력되는 음향 특성 정보(220)를 입력 받아 음향 특성 정보(220)에 대응하는 음성 데이터(230)를 출력할 수 있다. 구체적으로, 제2 신경망 모델(20)은 복수의 샘플 음향 특성 정보 및 복수의 샘플 음향 특성 정보 각각에 대응되는 샘플 음성 데이터간의 관계를 학습한 신경망 모델일 수 있다. The second neural network model 20 is a neural network model for converting the acoustic characteristic information 220 into voice data 230, and may be implemented as, for example, a neural vocoder neural network model. According to the present disclosure, the neural vocoder neural network model may receive the acoustic characteristic information 220 output from the first neural network model 10 and output voice data 230 corresponding to the acoustic characteristic information 220. Specifically, the second neural network model 20 may be a neural network model obtained by learning a relationship between a plurality of sample acoustic characteristic information and sample voice data corresponding to each of the plurality of sample acoustic characteristic information.

그리고, 제2 신경망 모델(20)은 도 3과 같이 음향 특성 정보(220)를 입력 받는 인코더(20-1) 및 인코더(20-1)에서 출력되는 벡터 정보를 입력 받아 음성 데이터(230)를 출력하는 디코더(20-2)를 포함할 수 있으며, 제2 신경망 모델(20)에 대하여는 도 3을 통해 후술하도록 한다.Then, the second neural network model 20 receives the encoder 20-1 receiving the acoustic characteristic information 220 and the vector information output from the encoder 20-1 as shown in FIG. 3 to generate voice data 230. It may include a decoder 20 - 2 for outputting, and the second neural network model 20 will be described later with reference to FIG. 3 .

다시 도 1을 참조하면, 적응적 발화 속도 조절 기능이 수행되기 위해 복수의 모듈(121 내지 126)들이 프로세서(120)에 포함된 메모리(예로, 휘발성 메모리)에 로딩될 수 있다. 즉, 적응적 발화 속도 조절 기능을 수행하기 위해, 프로세서(120)는 복수의 모듈(121 내지 126)들을 비휘발성 메모리에서 휘발성 메모리로 로딩하여 복수의 모듈(121 내지 126)의 각 가능들을 실행할 수 있다. 로딩(loading)이란, 프로세서(120)가 액세스할 수 있도록 비휘발성 메모리에 저장된 데이터를 휘발성 메모리에 불러들여 저장하는 동작을 의미한다.Referring back to FIG. 1 , a plurality of modules 121 to 126 may be loaded into a memory (eg, a volatile memory) included in the processor 120 in order to perform an adaptive ignition rate control function. That is, in order to perform the adaptive firing speed control function, the processor 120 may load the plurality of modules 121 to 126 from non-volatile memory to volatile memory to execute respective functions of the plurality of modules 121 to 126. there is. Loading refers to an operation of loading and storing data stored in a non-volatile memory into a volatile memory so that the processor 120 can access the data.

본 개시에 따른 일 실시 예로, 도 1에 도시된 바와 같이 메모리(110)에 저장된 복수의 모듈(121 내지 126)을 통해 적응적 발화 속도 조절 기능이 구현될 수 있으나, 이에 한정되지 않고 적응적 발화 속도 조절 기능이 전자 장치(100)와 연결된 외부 장치를 통해 구현될 수 있다.As an example according to the present disclosure, as shown in FIG. 1 , an adaptive speech speed control function may be implemented through a plurality of modules 121 to 126 stored in the memory 110, but is not limited thereto and adaptive speech A speed control function may be implemented through an external device connected to the electronic device 100 .

본 개시에 따른 복수의 모듈(121 내지 126)은 각각의 소프트웨어로 구현될 수 있으나, 이에 한정되지 않고 일부 모듈은 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또 다른 실시 예로, 복수의 모듈(121 내지 126)은 하나의 소프트웨어로 구현될 수 있다. 또한, 일부 모듈은 전자 장치(100) 내에서 구현되고, 다른 일부 모듈은 외부 장치에서 구현될 수 있다.Each of the plurality of modules 121 to 126 according to the present disclosure may be implemented as software, but is not limited thereto and some modules may be implemented as a combination of hardware and software. As another example, the plurality of modules 121 to 126 may be implemented as one software. Also, some modules may be implemented within the electronic device 100 and other modules may be implemented in an external device.

텍스트 획득 모듈(121)은 음성 데이터로 변환할 텍스트를 획득하기 위한 모듈이다. 일 예로, 텍스트 획득 모듈(121)에서 획득하는 텍스트는 사용자의 음성 명령에 대한 응답에 대응하는 텍스트일 수 있다. 일 예로, 텍스트는 전자 장치(100)의 디스플레이에 표시되고 있는 텍스트일 수 있다. 일 예로, 텍스트는 전자 장치(100)의 사용자로부터 입력 받은 텍스트일 수 있다. 일 예로, 텍스트는 음성 인식 시스템(예로, 빅스비)에서 제공되는 텍스트일 수 있다. 일 예로, 텍스트는 외부 서버에서 수신하는 텍스트일 수 있다. 즉, 본 개시에 따른, 텍스트는 음성 데이터로 변환하기 위한 다양한 텍스트일 수 있다.The text acquisition module 121 is a module for obtaining text to be converted into voice data. For example, the text acquired by the text acquisition module 121 may be text corresponding to a response to a user's voice command. For example, the text may be text being displayed on the display of the electronic device 100 . For example, the text may be text input by a user of the electronic device 100 . As an example, the text may be text provided by a voice recognition system (eg, Bixby). For example, the text may be text received from an external server. That is, according to the present disclosure, text may be various texts to be converted into voice data.

음향 특성 정보 획득 모듈(122)은 텍스트 획득 모듈(121)에서 획득한 텍스트에 대응되는 음향 특성(Acoustic Feature) 정보를 획득하기 위한 구성이다.The acoustic feature information acquisition module 122 is a component for acquiring acoustic feature information corresponding to the text acquired by the text acquisition module 121 .

음향 특성 정보 획득 모듈(122)은 텍스트 획득 모듈(121)에서 획득한 텍스트를 제1 신경망 모델(10)에 입력하여, 입력된 텍스트에 대응하는 음향 특성 정보를 출력할 수 있다. The sound characteristic information acquisition module 122 may input the text obtained by the text acquisition module 121 to the first neural network model 10 and output sound characteristic information corresponding to the input text.

본 개시에 따른, 음향 특성 정보는 특정 화자의 목소리 특성(예로, 높낮이 정보, 운율 정보, 발화 속도 정보)에 대한 정보를 포함하는 정보일 수 있다. 이러한, 음향 특성 정보는 후술할 제2 신경망 모델(20)에 입력됨으로, 텍스트에 대응하는 음성 데이터가 출력될 수 있다. Acoustic characteristic information according to the present disclosure may be information including information about voice characteristics (eg, pitch information, prosody information, and speech speed information) of a specific speaker. Since such sound characteristic information is input to the second neural network model 20 to be described later, voice data corresponding to text may be output.

여기서, 음향 특성 정보는 음성 데이터의 짧은 구간(프레임) 내에서의 정적인 특성을 의미하며, 음성 데이터를 short-time analysis한 후 구간 별로 음향 특성 정보가 획득될 수 있다. 음향 특성 정보의 프레임은 10~20 msec로 설정될 수 있으나, 얼마든지 다른 시간 구간으로 설정될 수도 있다. 음향 특성 정보의 예로서 Spectrum, Mel-spectrum, cepstrum, pitch lag, pitch correlation 등이 있으며, 이들 중 하나 또는 조합으로 사용될 수 있다.Here, the acoustic characteristic information means a static characteristic within a short interval (frame) of voice data, and after short-time analysis of the voice data, the acoustic characteristic information can be obtained for each interval. The frame of the sound characteristic information may be set to 10 to 20 msec, but may be set to any other time interval. Examples of sound characteristic information include spectrum, mel-spectrum, cepstrum, pitch lag, pitch correlation, and the like, and one or a combination thereof may be used.

예를 들어, 음향 특성 정보는 257차 Spectrum, 80차 Mel-spectrum 또는 Cepstrum(20차) + pitch lag(1차) + pitch correlation(1차)와 같은 방식으로 설정될 수 있다. 좀더 구체적으로 예를 들어, shift size가 10 msec이고 80차 Mel-spenctrum을 음향 특성 정보로 사용하는 경우, 1초의 음성 데이터로부터 [100, 80] 차원의 음향 특성 정보가 획득될 수 있으며, 여기서 [T, D]는 하기와 같은 의미를 포함한다.For example, the sound characteristic information may be set in a manner such as a 257th order spectrum, an 80th order Mel-spectrum, or a Cepstrum (20th order) + pitch lag (1st order) + pitch correlation (1st order). More specifically, for example, when the shift size is 10 msec and the 80th order Mel-spenctrum is used as acoustic characteristic information, [100, 80] dimension acoustic characteristic information can be obtained from 1 second of voice data, where [ T, D] includes the following meanings.

[T, D] : T개 프레임, D차원 음향 특성 정보[T, D]: T frames, D-dimensional sound characteristic information

또한, 음향 특성 정보 획득 모듈(122)은 제1 신경망 모델(10)에서 출력된 음향 특성 정보의 프레임 각각과 입력된 텍스트에 포함된 음소(phoneme) 각각을 매칭한 정렬(alignment) 정보를 획득할 수 있다. 구체적으로, 음향 특성 정보 획득 모듈(122)은 텍스트를 제1 신경망 모델(10)에 입력함으로, 텍스트에 대응하는 음향 특성 정보를 획득하며, 또한, 음향 특성 정보의 프레임 각각과 제1 신경망 모델(10)에 입력된 텍스트에 포함된 음소 각각을 매칭한 정렬 정보를 획득할 수 있다. In addition, the sound characteristic information acquisition module 122 acquires alignment information obtained by matching each frame of the sound characteristic information output from the first neural network model 10 with each phoneme included in the input text. can Specifically, the acoustic characteristic information acquisition module 122 acquires acoustic characteristic information corresponding to the text by inputting text into the first neural network model 10, and furthermore, each frame of the acoustic characteristic information and the first neural network model ( Alignment information matching each phoneme included in the text input in 10) may be obtained.

본 개시에 따른, 정렬(alignment) 정보는 Sequence-to-sequence 모델에서 입/출력 sequence간의 정렬(alignment)를 위한 matrix 정보일 수 있다. 구체적으로, 정렬 정보를 통해 출력 sequence의 각 time-step이 어떠한 입력으로부터 예측 되었는지에 대한 정보를 알 수 있다. 그리고, 본 개시에 따른, 제1 신경망 모델(10)에서 획득되는 정렬 정보는 제1 신경망 모델(10)에 입력된 텍스트에 대응하는 '음소'와 제1 신경망 모델(10)에서 출력되는 '음향 특성 정보의 프레임'을 매칭한 정렬 정보일 수 있으며, 정렬 정보에 대하여는 도 5를 통해 후술하도록 한다. According to the present disclosure, alignment information may be matrix information for alignment between input/output sequences in a sequence-to-sequence model. Specifically, through alignment information, it is possible to know information about which input each time-step of the output sequence was predicted from. In addition, according to the present disclosure, the alignment information obtained from the first neural network model 10 includes 'phonemes' corresponding to text input to the first neural network model 10 and 'sound sounds' output from the first neural network model 10. It may be alignment information that matches the 'frame' of the characteristic information, and the alignment information will be described later with reference to FIG. 5 .

발화 속도 획득 모듈(123)은 음향 특성 정보 획득 모듈(122)에서 획득한 정렬 정보를 바탕으로, 음향 특성 정보 획득 모듈(122)에서 획득한 음향 특성 정보의 발화 속도를 식별하기 위한 구성이다. The speech rate acquisition module 123 is a component for identifying the speech speed of the acoustic characteristic information acquired by the acoustic characteristic information acquisition module 122 based on the alignment information acquired by the acoustic characteristic information acquisition module 122 .

발화 속도 획득 모듈(123)은 음향 특성 정보 획득 모듈(122)에서 획득한 정렬 정보를 바탕으로, 음향 특성 정보 획득 모듈(122)에서 획득한 음향 특성 정보에 포함된 음소 각각에 대응하는 발화 속도를 식별할 수 있다. The speech rate acquiring module 123 calculates the speech speed corresponding to each phoneme included in the acoustic characteristic information acquired by the acoustic characteristic information acquisition module 122 based on the alignment information acquired by the acoustic characteristic information obtaining module 122. can be identified.

구체적으로, 발화 속도 획득 모듈(123)은 음향 특성 정보 획득 모듈(122)에서 획득한 정렬 정보를 바탕으로, 음향 특성 정보 획득 모듈(122)에서 획득한 음향 특성 정보에 포함된 음소 각각에 대한 발화 속도를 식별할 수 있다. 본 개시에 따른, 정렬 정보는 제1 신경망 모델(10)에 입력된 텍스트에 대응하는 '음소'와 제1 신경망 모델(10)에서 출력되는 '음향 특성 정보의 프레임'을 매칭한 정렬 정보이므로, 정렬 정보에 포함된 음소 중 제1 음소에 대응하는 음향 특성 정보의 프레임이 많을수록 제1 음소가 느리게 발화됨을 알 수 있다. 일 예로, 정렬 정보를 바탕으로 제1 음소에 대응하는 음향 특성 정보의 프레임이 3개 인 것으로 식별되며, 제2 음소에 대응하는 음향 특성 정보의 프레임이 5개인 것으로 식별되면, 제1 음소의 발화 속도는 제2 음소의 발화 속도 보다 상대적으로 빠름을 알 수 있다. Specifically, the speech rate acquiring module 123 uses the acoustic characteristic information acquisition module 122 to obtain utterances for each phoneme included in the acoustic characteristic information acquired by the acoustic characteristic information acquisition module 122 based on the alignment information acquired by the acoustic characteristic information acquisition module 122. speed can be discerned. According to the present disclosure, the alignment information is alignment information obtained by matching 'phonemes' corresponding to text input to the first neural network model 10 with 'frames of sound characteristic information' output from the first neural network model 10. It can be seen that the first phoneme is uttered slowly as the number of frames of the acoustic characteristic information corresponding to the first phoneme among the phonemes included in the alignment information increases. For example, if three frames of sound characteristic information corresponding to the first phoneme are identified based on the alignment information and five frames of sound characteristic information corresponding to the second phoneme are identified, the first phoneme is uttered. It can be seen that the speed is relatively faster than the speech speed of the second phoneme.

그리고, 텍스트에 포함된 음소 각각에 대한 발화 속도가 획득되면, 발화 속도 획득 모듈(123)은 텍스트에 포함된 특정 음소 및 해당 음소 이전의 적어도 하나의 음소에 대응하는 발화 속도를 고려하여, 특정 음소의 평균 발화 속도를 획득할 수 있다. 일 예로, 발화 속도 획득 모듈(123)은 텍스트에 포함된 제1 음소에 대응하는 발화 속도 및 제1 음소 이전의 적어도 하나의 음소 각각에 대응하는 발화 속도를 바탕으로 제1 음소에 대응하는 평균 발화 속도를 식별할 수 있다.Then, when the speech rate for each phoneme included in the text is obtained, the speech rate acquisition module 123 considers the speech rate corresponding to the specific phoneme included in the text and at least one phoneme prior to the corresponding phoneme, and determines the specific phoneme. The average firing rate of can be obtained. For example, the speech rate acquisition module 123 may perform an average speech corresponding to the first phoneme based on a speech speed corresponding to the first phoneme included in the text and a speech speed corresponding to each of at least one phoneme prior to the first phoneme. speed can be discerned.

다만, 음소 하나에 대한 발화 속도는 짧은 구간에 대한 속도이므로, 너무 짧은 구간에 대하여 발화 속도를 예측하는 경우 음소 간 길이 차이가 줄어들어 부자연스러운 결과가 발생될 수 있다. 또한, 너무 짧은 구간에 대하여 발화 속도를 예측하는 경우 발화 속도 예측 값이 시간 축에서 너무 빠르게 변화되므로, 부자연스러운 결과가 발생될 수 있다. 이에, 본 개시는 음소 이전의 음소들의 발화 속도를 함께 고려한 해당 음소에 대응하는 평균 발화 속도를 식별하여, 식별된 평균 발화 속도를 해당 음소의 발화 속도로 이용할 수 있다. However, since the speech speed for one phoneme is for a short section, when the speech speed is predicted for an extremely short section, the difference in length between phonemes is reduced, resulting in unnatural results. In addition, when the speech rate is predicted for a very short period, an unnatural result may occur because the prediction value of the speech rate changes too quickly on the time axis. Accordingly, according to the present disclosure, an average speech speed corresponding to a corresponding phoneme in consideration of speech speeds of previous phonemes may be identified, and the identified average speech speed may be used as the speech speed of the corresponding phoneme.

다만, 발화 속도 예측 시 너무 긴 구간에 대하여 평균 발화 속도를 예측하는 경우에는 텍스트 내에서 느린 발화와 빠른 발화가 함께 있는 경우 반영이 어려울 수 있다. 또한, 스트리밍 구조에서는 식별된 발화 속도가 이미 출력된 발화에 대한 속도 예측이므로, 발화 속도 조절에 대한 지연이 발생될 수 있는 바, 적절한 구간에 대한 평균 발화 속도를 측정할 수 있는 방법이 필요하다. However, when estimating the average speech speed for an excessively long section when uttering speed is predicted, it may be difficult to reflect when there are both slow and fast utterances in the text. In addition, in the streaming structure, since the identified speech speed is a speed prediction for speech that has already been output, a delay in speech speed control may occur. Therefore, a method for measuring the average speech speed for an appropriate section is required.

평균 발화 속도는 일 실시 예에 따라, Simple Moving Average 방법 또는 EMA(Exponential Moving Average) 방법에 의해 식별될 수 있으며, 이에 대한 자세한 내용은 도 6 및 도 7을 통해 후술하도록 한다. The average firing speed may be identified by a simple moving average method or an exponential moving average (EMA) method according to an embodiment, and details thereof will be described later with reference to FIGS. 6 and 7 .

기준 발화 속도 획득 모듈(124)은 음향 특성 정보에 포함된 음소 별 기준 발화 속도를 식별하기 위한 구성이다. 본 개시에 따른, 기준 발화 속도는 음향 특성 정보에 포함된 음소 각각에 대하여 적절한 속도로 느껴지는 최적의 발화 속도를 의미할 수 있다. The reference speech rate acquisition module 124 is a component for identifying a reference speech rate for each phoneme included in the sound characteristic information. According to the present disclosure, the reference speech speed may mean an optimal speech speed that is felt as an appropriate speed for each phoneme included in the acoustic characteristic information.

제1 실시 예로, 기준 발화 속도 획득 모듈(124)은, 제1 신경망 모델(10)의 학습에 이용된 샘플 데이터(예로, 샘플 텍스트 및 샘플 음성 데이터)를 바탕으로, 음향 특성 정보에 포함된 제1 음소에 대응하는 제1 기준 발화 속도를 획득할 수 있다. As a first embodiment, based on sample data (eg, sample text and sample voice data) used for learning of the first neural network model 10, the reference speech rate acquisition module 124 may perform a first step included in sound characteristic information. A first reference speech rate corresponding to one phoneme may be obtained.

일 예로, 제1 음소를 포함하고 있는 음소 열에 모음이 많은 경우, 제1 음소에 대응하는 제1 기준 발화 속도는 상대적으로 느릴 수 있다. 그리고, 제1 음소를 포함하고 있는 음소 열에 자음이 많은 경우, 제1 음소에 대응하는 제1 기준 발화 속도는 상대적으로 빠를 수 있다. 또한, 제1 음소를 포함하고 있는 단어가 강조해야되는 단어인 경우, 해당 단어는 천천히 발화되는 할 것이므로 제1 음소에 대응하는 제1 기준 발화 속도는 상대적으로 느릴 수 있다. For example, when there are many vowels in a phoneme column including the first phoneme, the first reference speech speed corresponding to the first phoneme may be relatively slow. Also, when there are many consonants in a phoneme column including the first phoneme, the first reference speech speed corresponding to the first phoneme may be relatively fast. In addition, when a word including the first phoneme is a word to be emphasized, the first reference speech speed corresponding to the first phoneme may be relatively slow because the word is to be uttered slowly.

일 예로, 기준 발화 속도 획득 모듈(124)은 기준 발화 속도를 추정하는 제3 신경망 모델을 이용하여, 제1 음소에 대응하는 제1 기준 발화 속도를 획득할 수 있다. 구체적으로, 기준 발화 속도 획득 모듈(124)은 음향 특성 정보 획득 모듈(122)에서 획득된 정렬 정보에서 제1 음소를 식별할 수 있다. 그리고, 기준 발화 속도 획득 모듈(124)는 식별된 제1 음소에 대한 정보 및 텍스트 획득 모듈(121)에서 획득한 텍스트를 제3 신경망 모델에 입력하여, 제1 음소에 대응하는 제1 기준 발화 속도를 획득할 수 있다. For example, the reference speech speed acquisition module 124 may obtain a first reference speech speed corresponding to the first phoneme by using a third neural network model for estimating the reference speech speed. Specifically, the reference speech rate acquisition module 124 may identify the first phoneme from the alignment information acquired by the acoustic characteristic information acquisition module 122 . Also, the reference speech rate obtaining module 124 inputs the information on the identified first phoneme and the text obtained by the text acquisition module 121 to the third neural network model, and the first reference speech rate corresponding to the first phoneme. can be obtained.

일 예로, 제3 신경망 모델은 제1 신경망 모델(10)의 학습에 이용된 샘플 데이터(예로, 샘플 텍스트 및 샘플 음성 데이터)를 바탕으로 학습될 수 있다. 즉, 제3 신경망 모델은 샘플 음향 특성 정보 및 샘플 음향 특성 정보에 대응하는 샘플 텍스트를 바탕으로, 샘플 음향 특성 정보의 구간 평균 발화 속도를 추정하도록 학습될 수 있다. 여기서, 제3 신경망 모델은 구간 평균 발화 속도를 추정할 수 있는 HMM(Hidden Markov Model) 및 DNN(Deep Neural Network) 모델 등과 같은 통계 모델로 구현될 수 있다. 제3 신경망 모델의 학습에 이용되는 데이터에 대하여는 도 8을 통해 후술하도록 한다.For example, the third neural network model may be trained based on sample data (eg, sample text and sample voice data) used for learning the first neural network model 10 . That is, the third neural network model may be trained to estimate the section average speech speed of the sample acoustic characteristic information based on the sample acoustic characteristic information and the sample text corresponding to the sample acoustic characteristic information. Here, the third neural network model may be implemented as a statistical model such as a Hidden Markov Model (HMM) and a Deep Neural Network (DNN) model capable of estimating a section average firing rate. Data used for learning the third neural network model will be described later with reference to FIG. 8 .

상술한 실시 예에서는 제3 신경망 모델을 이용하여 제1 음소에 대응하는 제1 기준 발화 속도를 획득하는 것으로 설명하였으나, 본 개시는 이에 한정되지 않는다. 즉, 기준 발화 속도 획득 모듈(124)은 제3 신경망 이외에, Rule 기반의 예측 방법 또는 Decision 기반 예측 방법을 이용하여 제1 음소에 대응하는 제1 기준 발화 속도를 획득할 수 있다.In the above-described embodiment, it has been described that the first reference speech rate corresponding to the first phoneme is obtained using the third neural network model, but the present disclosure is not limited thereto. That is, the reference speech rate acquisition module 124 may obtain the first reference speech rate corresponding to the first phoneme by using a rule-based prediction method or a decision-based prediction method in addition to the third neural network.

제2 실시 예로, 기준 발화 속도 획득 모듈(124)은 음성 데이터를 청취하는 사용자가 주관적으로 결정하는 발화 속도인 제2 기준 발화 속도를 획득할 수 있다. 구체적으로, 기준 발화 속도 획득 모듈(124)은 제1 신경망 모델(10)의 학습에 이용된 샘플 데이터에 대한 평가 정보를 획득할 수 있다. 일 예로, 기준 발화 속도 획득 모듈(124)은 제1 신경망 모델(10)의 학습에 이용된 샘플 음성 데이터에 대한 사용자의 평가 정보를 획득할 수 있다. 여기서, 평가 정보는 샘플 음성 데이터를 청취한 사용자가 주관적으로 느끼는 속도에 대한 평가 정보일 수 있다. 일 예로, 전자 장치(100)의 디스플레이에 표시되는 UI를 통해, 사용자 입력을 수신함으로 평가 정보를 획득할 수 있다.As a second embodiment, the reference speech rate acquisition module 124 may acquire a second reference speech rate, which is a speech rate subjectively determined by a user listening to voice data. Specifically, the reference speech rate acquisition module 124 may obtain evaluation information on sample data used for learning of the first neural network model 10 . For example, the reference speech rate acquisition module 124 may obtain user evaluation information on sample speech data used for learning of the first neural network model 10 . Here, the evaluation information may be evaluation information on speed subjectively felt by a user who has listened to the sample voice data. For example, evaluation information may be obtained by receiving a user input through a UI displayed on a display of the electronic device 100 .

일 예로, 샘플 음성 데이터를 청취한 사용자가 샘플 음성 데이터의 발화 속도가 조금 느리다고 생각하는 경우, 기준 발화 속도 획득 모듈(124)은 사용자로부터 샘플 음성 데이터의 발화 속도를 빠르게 설정하기 위한 제1 평가 정보(예로, 1.1배)를 획득할 수 있다. 일 예로, 샘플 음성 데이터를 청취한 사용자가 샘플 음성 데이터의 발화 속도가 조금 빠르다고 생각하는 경우, 기준 발화 속도 획득 모듈(124)은 사용자로부터 샘플 음성 데이터의 발화 속도를 느리게 설정하기 위한 제2 평가 정보(예로, 0.95배)를 획득할 수 있다.For example, when a user who has listened to the sample voice data thinks that the speech rate of the sample voice data is slightly slow, the reference speech rate obtaining module 124 provides first evaluation information for quickly setting the speech rate of the sample voice data from the user. (eg, 1.1 times). For example, when a user who has listened to the sample voice data thinks that the speech rate of the sample voice data is slightly fast, the reference speech rate obtaining module 124 provides second evaluation information for setting the speech rate of the sample voice data to be slow from the user. (eg, 0.95 times).

그리고, 기준 발화 속도 획득 모듈(124)은 제1 음소에 대응하는 제1 기준 발화 속도에 평가 정보를 적용한 제2 기준 발화 속도를 획득할 수 있다. 일 예로, 제1 평가 정보가 획득되면, 기준 발화 속도 획득 모듈(124)은 제1 음소에 대응하는 제1 기준 발화 속도의 1.1배에 대응하는 발화 속도를 제1 음소에 대응하는 제2 기준 발화 속도로 식별할 수 있다. 일 예로, 제2 평가 정보가 획득되면, 기준 발화 속도 획득 모듈(124)은 제1 음소에 대응하는 제1 기준 발화 속도의 0.95배에 대응하는 발화 속도를 제1 음소에 대응하는 제2 기준 발화 속도로 식별할 수 있다.Also, the reference speech rate acquisition module 124 may obtain a second reference speech rate obtained by applying the evaluation information to the first reference speech rate corresponding to the first phoneme. For example, when the first evaluation information is acquired, the reference speech rate acquisition module 124 sets the speech rate corresponding to 1.1 times the first reference speech rate corresponding to the first phoneme to the second reference speech rate corresponding to the first phoneme. can be identified by speed. For example, when the second evaluation information is obtained, the reference speech rate acquisition module 124 sets the speech rate corresponding to 0.95 times the first reference speech rate corresponding to the first phoneme to the second reference speech rate corresponding to the first phoneme. can be identified by speed.

제3 실시 예로. 기준 발화 속도 획득 모듈(124)은 기준 샘플 데이터에 대한 평가 정보를 바탕으로, 제3 기준 발화 속도를 획득할 수 있다. 여기서, 기준 샘플 데이터는 복수의 샘플 텍스트 및 복수의 샘플 텍스트 각각을 기준 발화자가 발화한 복수의 샘플 음성 데이터를 포함할 수 있다. 일 예로, 제1 기준 샘플 데이터는 특정 성우가 복수의 샘플 텍스트 각각을 발화한 복수의 샘플 음성 데이터를 포함할 수 있으며, 제2 기준 샘플 데이터는 또 다른 성우가 복수의 샘플 텍스트 각각을 발화한 복수의 샘플 음성 데이터를 포함할 수 있다. 그리고, 기준 발화 속도 획득 모듈(124)은 기준 샘플 데이터에 대한 사용자의 평가 정보를 바탕으로, 제3 기준 발화 속도를 획득할 수 있다. 일 예로, 제1 기준 샘플 데이터에 대하여 제1 평가 정보가 획득되면, 기준 발화 속도 획득 모듈(124)은 제1 기준 샘플 데이터에 대응하는 제1 음소의 발화 속도의 1.1배의 속도를 제1 음소에 대응하는 제3 기준 발화 속도로 식별할 수 있다. 일 예로, 제1 기준 샘플 데이터에 대하여 제2 평가 정보가 획득되면, 기준 발화 속도 획득 모듈(124)은 제1 기준 샘플 데이터에 대응하는 제1 음소의 발화 속도의 0.95배의 속도를 제1 음소에 대응하는 제3 기준 발화 속도로 식별할 수 있다.As a third embodiment. The reference speech rate acquisition module 124 may obtain a third reference speech rate based on the evaluation information on the reference sample data. Here, the reference sample data may include a plurality of sample texts and a plurality of sample voice data in which a reference speaker utters each of the plurality of sample texts. For example, the first reference sample data may include a plurality of sample voice data in which a specific voice actor utters each of a plurality of sample texts, and the second reference sample data may include a plurality of sample voice data in which another voice actor utters each of a plurality of sample texts. of sample voice data. Also, the reference speech rate acquisition module 124 may obtain a third reference speech rate based on the user's evaluation information on the reference sample data. For example, when the first evaluation information is obtained for the first reference sample data, the reference speech rate obtaining module 124 converts the speech rate of the first phoneme to 1.1 times the speech rate of the first phoneme corresponding to the first reference sample data. It can be identified as a third reference speech rate corresponding to . For example, when the second evaluation information is acquired for the first reference sample data, the reference speech rate acquisition module 124 converts the speech rate of the first phoneme to 0.95 times the speech rate of the first phoneme corresponding to the first reference sample data. It can be identified as a third reference speech rate corresponding to .

그리고, 기준 발화 속도 획득 모듈(124)은 제1 음소에 대응하는 제1 기준 발화 속도, 제1 음소에 대응하는 제2 기준 발화 속도 및 제1 음소에 대응하는 제3 기준 발화 속도 중 하나를 제1 음소에 대응하는 기준 발화 속도로 식별할 수 있다.Further, the reference speech rate obtaining module 124 determines one of a first reference speech rate corresponding to the first phoneme, a second reference speech rate corresponding to the first phoneme, and a third reference speech rate corresponding to the first phoneme. It can be identified as a standard speech rate corresponding to 1 phoneme.

발화 속도 조절 정보 획득 모듈(125)은 발화 속도 획득 모듈(123)을 통해 획득한 제1 음소에 대응하는 발화 속도와 기준 발화 속도 획득 모듈(124)을 통해 획득한 제1 음소에 대응하는 발화 속도를 바탕으로, 발화 속도 조절 정보를 획득하기 위한 구성이다. The speech rate control information obtaining module 125 includes the speech rate corresponding to the first phoneme obtained through the speech rate acquisition module 123 and the speech rate corresponding to the first phoneme obtained through the reference speech rate acquisition module 124. Based on, it is a configuration for obtaining speech rate control information.

구체적으로, 발화 속도 획득 모듈(123)을 통해 획득한 제n 음소에 대응하는 발화 속도를 Xn, 기준 발화 속도 획득 모듈(124)을 통해 획득한 제n 음소에 대응하는 기준 발화 속도를 Xrefn이라 하면, 제n 음소에 대응하는 발화 속도 조절 정보 Sn는 (Xrefn / Xn)으로 정의될 수 있다. 일 예로, 제1 음소에 대응하는 현재 예측된 발화 속도 X1이 20 (phoneme / sec)이며, 제1 음소에 대응하는 기준 발화 속도 Xref1이 18 (phoneme / sec)인 경우, 제1 음소에 대응하는 발화 속도 조절 정보 S1은 0.9일 수 있다. Specifically, if the speech rate corresponding to the n-th phoneme acquired through the speech rate acquisition module 123 is Xn and the reference speech rate corresponding to the n-th phoneme acquired through the reference speech rate acquisition module 124 is Xrefn, , Speech rate control information Sn corresponding to the n-th phoneme may be defined as (Xrefn / Xn). For example, when the currently predicted speech rate X1 corresponding to the first phoneme is 20 (phoneme / sec) and the reference speech rate Xref1 corresponding to the first phoneme is 18 (phoneme / sec), the The firing rate control information S1 may be 0.9.

음성 데이터 획득 모듈(126)은 텍스트에 대응하는 음성 데이터를 획득하기 위한 구성이다. The voice data acquisition module 126 is a component for acquiring voice data corresponding to text.

구체적으로, 음성 데이터 획득 모듈(126)은 발화 속도 조절 정보에 기초하여 설정되는 제2 신경망 모델(20)에 음향 특성 정보 획득 모듈(122)에서 획득한 텍스트에 대응하는 음향 특성 정보를 입력함으로, 텍스트에 대응하는 음성 데이터를 획득할 수 있다.Specifically, the voice data obtaining module 126 inputs the sound characteristic information corresponding to the text obtained by the sound characteristic information obtaining module 122 to the second neural network model 20 set based on the speech speed control information, Voice data corresponding to text may be acquired.

음향 특성 정보(220) 중 제1 음소에 대응하는 적어도 하나의 프레임이 제2 신경망 모델(20)에 입력되는 동안, 음성 데이터 획득 모듈(126)은 제1 음소에 대응하는 발화 속도 조절 정보를 바탕으로, 제2 신경망 모델(20) 내 디코더(20-2)의 루프(loop) 횟수를 식별할 수 있다. 그리고, 음성 데이터 획득 모듈(126)은 제2 신경망 모델(20)에 제1 음소에 대응하는 적어도 하나의 프레임이 입력되는 동안 디코더(20-2)로부터 루프 횟수에 대응되는 복수의 제1 음성 데이터를 획득할 수 있다. While at least one frame corresponding to the first phoneme among the sound characteristic information 220 is input to the second neural network model 20, the voice data obtaining module 126 based on the speech speed control information corresponding to the first phoneme , the number of loops of the decoder 20-2 in the second neural network model 20 can be identified. Further, while at least one frame corresponding to the first phoneme is input to the second neural network model 20, the voice data acquisition module 126 receives a plurality of first voice data corresponding to the number of loops from the decoder 20-2. can be obtained.

음향 특성 정보 중 제1 음소에 대응하는 적어도 하나의 프레임 중 하나가 제2 신경망 모델(20)에 입력되면, 루프 횟수에 대응하는 개수의 복수의 제2 음성 샘플 데이터가 획득될 수 있다. 그리고, 제1 음소에 대응하는 적어도 하나의 프레임 각각이 제2 신경망 모델(20)에 입력되어 획득한 제2 음성 샘플 데이터의 집합이 제1 음성 데이터일 수 있다. 여기서, 복수의 제1 음성 데이터는 제1 음소에 대응하는 음성 데이터일 수 있다.When one of the at least one frame corresponding to the first phoneme among the sound characteristic information is input to the second neural network model 20, a plurality of second voice sample data corresponding to the number of loops may be obtained. Also, a set of second voice sample data obtained by inputting each of at least one frame corresponding to the first phoneme to the second neural network model 20 may be the first voice data. Here, the plurality of first voice data may be voice data corresponding to the first phoneme.

즉, 디코더(20-2)의 루프 횟수를 조절하면, 출력되는 음성 데이터의 샘플 개수를 조절할 수 있으므로, 디코더(20-2)의 루프 횟수를 조절함으로, 음성 데이터의 발화 속도를 조절할 수 있다. 제2 신경망 모델(20)을 통해 발화 속도 조절 방법에 대하여는 도 3을 통해 후술하도록 한다. That is, by adjusting the number of loops of the decoder 20-2, the number of samples of output audio data can be adjusted. Therefore, the speech rate of voice data can be adjusted by adjusting the number of loops of the decoder 20-2. A method for adjusting the firing rate through the second neural network model 20 will be described later with reference to FIG. 3 .

그리고, 음성 데이터 획득 모듈(126)은 복수의 음소 각각에 대응하는 발화 속도 조절 정보에 기초하여 디코더(20-2)의 루프 횟수가 설정되는 제2 신경망 모델(20)에 음향 특성 정보에 포함된 복수의 음소 각각을 입력하여 텍스트에 대응하는 음성 데이터를 획득할 수 있다. Further, the voice data acquisition module 126 is included in the sound characteristic information in the second neural network model 20 in which the number of loops of the decoder 20-2 is set based on the speech rate control information corresponding to each of a plurality of phonemes. Voice data corresponding to text may be obtained by inputting each of a plurality of phonemes.

도 3은 본 개시의 일 실시 예에 따른, TTS 모델(200) 내의 제2 신경망 모델(예로, 뉴럴 보코더 신경망 모델)의 구성을 설명하기 위한 블록도이다.3 is a block diagram for explaining the configuration of a second neural network model (eg, a neural vocoder neural network model) in the TTS model 200 according to an embodiment of the present disclosure.

도 3을 참조하면, 제2 신경망 모델(20)의 인코더(20-1)는 음향 특성 정보(220)를 입력 받아, 음향 특성 정보(220)에 대응하는 벡터 정보(225)를 출력할 수 있다. 여기서 벡터 정보(225)는 제2 신경망 모델(20) 관점에서 볼 때, 은닉 레이어(hidden layer)에서 출력된 데이터이므로 은닉 특징(hidden representation)으로 명칭될 수 있다.Referring to FIG. 3 , the encoder 20-1 of the second neural network model 20 may receive acoustic characteristic information 220 and output vector information 225 corresponding to the acoustic characteristic information 220. . Here, the vector information 225 is data output from a hidden layer when viewed from the viewpoint of the second neural network model 20, and thus may be referred to as a hidden representation.

음향 특성 정보(220) 중 제1 음소에 대응하는 적어도 하나의 프레임이 제2 신경망 모델(20)에 입력되는 동안, 음성 데이터 획득 모듈(126)은 제1 음소에 대응하는 발화 속도 조절 정보를 바탕으로 디코더(20-2)의 루프(loop) 횟수를 식별할 수 있다. 그리고, 음성 데이터 획득 모듈(126)은 제2 신경망 모델(20)에 제1 음소에 대응하는 적어도 하나의 프레임이 입력되는 동안 디코더(20-2)로부터 식별된 루프 횟수에 대응되는 복수의 제1 음성 데이터를 획득할 수 있다. While at least one frame corresponding to the first phoneme among the sound characteristic information 220 is input to the second neural network model 20, the voice data obtaining module 126 based on the speech speed control information corresponding to the first phoneme The number of loops of the decoder 20-2 can be identified. Further, while the at least one frame corresponding to the first phoneme is input to the second neural network model 20, the voice data acquisition module 126 generates a plurality of first phonemes corresponding to the number of loops identified from the decoder 20-2. Voice data can be obtained.

즉, 음향 특성 정보 중 제1 음소에 대응하는 적어도 하나의 프레임 중 하나가 제2 신경망 모델(20)에 입력되면, 루프 횟수에 대응하는 개수의 복수의 제2 음성 데이터가 획득될 수 있다. 일 예로, 음향 특성 정보(220) 중 제1 음소에 대응하는 적어도 하나의 프레임 중 하나가 제2 신경망 모델(20)의 인코더(20-1)에 입력되면, 이에 대응되는 벡터 정보가 출력될 수 있다. 그리고, 벡터 정보가 디코더(20-2)에 입력되며, 디코더(20-2)는 N회의 루프 횟수로, 즉, 음향 특성 정보(220) 하나의 프레임 당 N회의 루프 횟수로 동작하여, N개의 음성 데이터를 출력할 수 있다.That is, when one of the at least one frame corresponding to the first phoneme among the sound characteristic information is input to the second neural network model 20, a plurality of second voice data corresponding to the number of loops may be obtained. For example, when one of the at least one frame corresponding to the first phoneme among the sound characteristic information 220 is input to the encoder 20-1 of the second neural network model 20, vector information corresponding thereto may be output. there is. Then, the vector information is input to the decoder 20-2, and the decoder 20-2 operates with the number of loops N times, that is, with the number of loops N times per frame of the sound characteristic information 220, Audio data can be output.

그리고, 제1 음소에 대응하는 적어도 하나의 프레임 각각이 제2 신경망 모델(20)에 입력되어 획득되는 제2 음성 데이터의 집합이 제1 음성 데이터일 수 있다. 여기서, 복수의 제1 음성 데이터는 제1 음소에 대응하는 음성 데이터일 수 있다.Also, a set of second voice data obtained by inputting each of at least one frame corresponding to the first phoneme to the second neural network model 20 may be the first voice data. Here, the plurality of first voice data may be voice data corresponding to the first phoneme.

시프트 크기(Shift size)가 제1 시간 간격(sec)의 음향 특성 정보를 바탕으로 디코더(20-2)에서 제1 주파수(khz)의 음성 데이터를 획득하는 실시 예에서, 발화 속도 조절 정보의 값이 기준 값(예로, 1)인 경우, 음향 특성 정보에 포함된 하나의 프레임이 제2 신경망 모델(20)에 입력되어, (제1 시간 간격 X 제1 주파수)에 대응하는 루프 횟수로 디코더(20-2)가 동작하여 해당 루프 횟수에 대응하는 개수의 음성 데이터가 획득될 수 있다. 일 예로, 시프트 크기가 10msec의 음향 특성 정보를 바탕으로 디코더(20-2)에서 24khz의 음성 데이터를 획득하는 경우, 발화 속도 조절 정보의 값이 기준 값(예로, 1)인 경우, 음향 특성 정보에 포함된 하나의 프레임이 제2 신경망 모델(20)에 입력되어, 240개의 루프 횟수로 디코더(20-2)가 동작하여 240개의 음성 데이터가 획득될 수 있다.In an embodiment in which the voice data of the first frequency (khz) is obtained by the decoder 20-2 based on the sound characteristic information of the first time interval (sec), the value of the speech speed control information When this reference value (eg, 1), one frame included in the acoustic characteristic information is input to the second neural network model 20, and the number of loops corresponding to (first time interval X first frequency) is decoded ( 20-2) is operated so that the number of voice data corresponding to the number of loops can be obtained. For example, when voice data of 24 khz is obtained from the decoder 20-2 based on sound characteristic information having a shift size of 10 msec, and the value of the speech speed control information is a reference value (eg, 1), the sound characteristic information One frame included in is input to the second neural network model 20, and the decoder 20-2 operates with 240 loops, so that 240 pieces of voice data can be obtained.

또한, 시프트 크기(Shift size)가 제1 시간 간격(sec)의 음향 특성 정보를 바탕으로 디코더(20-2)에서 제1 주파수(khz)의 음성 데이터를 획득하는 실시 예에서, 음향 특성 정보에 포함된 하나의 프레임이 제2 신경망 모델(20)에 입력되어, (제1 시간 간격 X 제1 주파수 X 발화 속도 조절 정보)에 대응하는 루프 횟수로 디코더(20-2)가 동작하여 해당 루프 횟수에 대응하는 개수의 음성 데이터가 획득될 수 있다. 일 예로, 시프트 크기가 10msec의 음향 특성 정보를 바탕으로 디코더(20-2)에서 24khz의 음성 데이터를 획득하는 경우, 발화 속도 조절 정보의 값이 1.1인 경우, 음향 특성 정보에 포함된 하나의 프레임이 제2 신경망 모델(20)에 입력되어, 264개의 루프 횟수로 디코더(20-2)가 동작하여 264개의 음성 데이터가 획득될 수 있다.In addition, in an embodiment in which voice data of a first frequency (khz) is acquired by the decoder 20-2 based on the acoustic characteristic information of the first time interval (sec), the shift size is the acoustic characteristic information One included frame is input to the second neural network model 20, and the decoder 20-2 operates with the number of loops corresponding to (first time interval X first frequency X firing speed control information) and the corresponding loop number A number of voice data corresponding to may be obtained. For example, when voice data of 24 khz is obtained from the decoder 20-2 based on sound characteristic information having a shift size of 10 msec, and the value of the speech speed control information is 1.1, one frame included in the sound characteristic information This is input to the second neural network model 20, and the decoder 20-2 operates with 264 loops, so that 264 pieces of voice data can be obtained.

여기서, 발화 속도 조절 정보의 값이 1.1일 때 획득된 음성 데이터의 개수(예로, 264)는 발화 속도 조절 정보의 값이 기준 값일 때 획득된 음성 데이터의 개수(예로, 240개)는 보다 많을 수 있다. 즉, 발화 속도 조절 정보의 값이 1.1로 조절되면, 기존 10msec에 해당하는 음성 데이터를 11mec으로 출력하기 때문에, 발화 속도가 발화 속도 조절 정보의 값이 기준 값일 때에 비해 느리게 조절될 수 있다.Here, the number of voice data (eg, 264) obtained when the value of the speech rate control information is 1.1 may be greater than the number of voice data (eg, 240) obtained when the value of the speech rate control information is the reference value. there is. That is, when the value of the speech rate control information is adjusted to 1.1, since voice data corresponding to the existing 10 msec is output at 11 mec, the speech rate can be adjusted slower than when the value of the speech rate control information is the reference value.

즉, 발화 속도 조절 정보의 기준 값이 1인 경우, 발화 속도 조절 정보의 값이 S라면, 디코더(20-2)의 루프 횟수 N'는 하기의 수학식 1과 같을 수 있다.That is, when the reference value of the speech rate control information is 1 and the value of the speech rate control information is S, the number of loops N' of the decoder 20-2 may be expressed as Equation 1 below.

수학식 1에서,

은, 발화 속도 조절을 위한 디코더(20-2)의 제n 음소에서의 루프 횟수를 의미하며,

은 디코더(20-2)의 기본 루프 횟수를 의미할 수 있다. 그리고, 제n 음소에서의

은 발화 속도 조절 정보의 값으로,

이 1.1인 경우 10% 빠르게 발화되는 음성 데이터를 획득할 수 있다. In Equation 1,

denotes the number of loops in the n-th phoneme of the decoder 20-2 for speech speed control,

may mean the number of basic loops of the decoder 20-2. And, at the nth phoneme

is the value of the ignition rate control information,

When is 1.1, voice data that is uttered 10% faster can be obtained.

그리고, 수학식 1에서 보는 바와 같이, 발화 속도 조절 정보는 제2 신경망 모델(20)에 입력되는 음향 특성 정보(220)에 포함된 음소 별로 상이하게 설정됨을 알 수 있다. 즉, 본 개시는 수학식 1을 바탕으로, 음향 특성 정보(220)에 포함된 음소 별로 상이하게 발화 속도를 조절하는 적응적 발화 속도 조절 방법을 이용하여, 실시간으로 발화 속도가 조절된 음성 데이터가 획득될 수 있다.Also, as shown in Equation 1, it can be seen that the speech rate control information is set differently for each phoneme included in the acoustic characteristic information 220 input to the second neural network model 20. That is, according to the present disclosure, based on Equation 1, by using an adaptive speech speed control method that differently adjusts speech speed for each phoneme included in the sound characteristic information 220, voice data whose speech speed is adjusted in real time is obtained. can be obtained

도 4는 본 개시의 일 실시 예에 따른, 전자 장치가 발화 속도가 개선된 음성 데이터를 획득하는 방법을 설명하기 위한 도면이다.4 is a diagram for explaining a method for obtaining, by an electronic device, voice data with improved speech speed, according to an embodiment of the present disclosure.

도 4를 참조하면, 전자 장치(100)는 텍스트(210)를 획득할 수 있다. 여기서, 텍스트(210)는 음성 데이터로 변환하기 위한 텍스트로 획득 방법에 제한이 없다. 즉, 텍스트(210)는 전자 장치(100)의 사용자로부터 입력 받은 텍스트, 전자 장치(100)의 음성 인식 시스템(예로, 빅스비)에서 제공되는 텍스트 및 외부 서버에서 수신하는 텍스트 등 다양한 텍스트를 포함할 수 있다.Referring to FIG. 4 , the electronic device 100 may acquire text 210 . Here, the text 210 is text to be converted into voice data, and there is no limitation on how to acquire it. That is, the text 210 includes various texts such as text input from the user of the electronic device 100, text provided by the voice recognition system (eg, Bixby) of the electronic device 100, and text received from an external server. can do.

그리고, 전자 장치(100)는 제1 신경망 모델(10)에 텍스트(210)를 입력하여, 음향 특성 정보(220) 및 정렬 정보(400)를 획득할 수 있다. 여기서, 음향 특성 정보(220)는 특정 화자(제1 신경망 모델에 대응하는 특정 화자)의 텍스트(210)에 대응되는 목소리 특성 및 발화 속도 특성을 포함하는 정보일 수 있다. 그리고, 정렬 정보(400)는 텍스트(210)에 포함된 음소와 음향 특성 정보(220)의 프레임 각각을 매칭한 정렬 정보일 수 있다.In addition, the electronic device 100 may input text 210 to the first neural network model 10 to obtain acoustic characteristic information 220 and alignment information 400 . Here, the sound characteristic information 220 may be information including voice characteristics and speech speed characteristics corresponding to the text 210 of a specific speaker (specific speaker corresponding to the first neural network model). Further, the alignment information 400 may be alignment information obtained by matching the phonemes included in the text 210 with each frame of the sound characteristic information 220 .

그리고, 전자 장치(100)는, 발화 속도 획득 모듈(123)을 통해, 정렬 정보(400)를 바탕으로 음향 특성 정보(220)에 대응되는 발화 속도(410)를 획득할 수 있다. 여기서, 발화 속도(410)는 음향 특성 정보(220)가 음성 데이터(230)로 변환되는 경우 실제 발화 속도에 대한 정보일 수 있다. 그리고, 발화 속도(410)는 음향 특성 정보(220)에 포함된 음소 각각에 대한 발화 속도 정보를 포함할 수 있다.In addition, the electronic device 100 may acquire the speech speed 410 corresponding to the sound characteristic information 220 based on the alignment information 400 through the speech speed obtaining module 123 . Here, the speech speed 410 may be information about an actual speech speed when the sound characteristic information 220 is converted into voice data 230 . Also, the speech speed 410 may include speech speed information for each phoneme included in the sound characteristic information 220 .

그리고, 전자 장치(100)는, 발화 속도 조절 정보 획득 모듈(125)을 통해, 텍스트(210) 및 정렬 정보(400)를 바탕으로 기준 발화 속도(420)를 획득할 수 있다. 여기서, 기준 발화 속도(420)는 텍스트(210)에 포함된 음소에 대한 최적의 발화 속도를 의미할 수 있다. 그리고, 기준 발화 속도(420)는 음향 특성 정보(220)에 포함된 음소 각각에 대한 기준 발화 속도 정보를 포함할 수 있다.In addition, the electronic device 100 may obtain the reference speech speed 420 based on the text 210 and the alignment information 400 through the speech speed control information obtaining module 125 . Here, the reference speech speed 420 may mean an optimal speech speed for phonemes included in the text 210 . Also, the reference speech speed 420 may include reference speech speed information for each phoneme included in the sound characteristic information 220 .

그리고, 전자 장치(100)는, 발화 속도 조절 정보 획득 모듈(125)을 통해, 발화 속도(410) 및 기준 발화 속도(420)를 바탕으로 발화 속도 조절 정보(430)를 획득할 수 있다. 여기서, 발화 속도 조절 정보(430)는 음향 특성 정보(220)에 포함된 음소 각각의 발화 속도를 조절하기 위한 정보일 수 있다. 일 예로, 제m 음소에 대한 발화 속도(410)가 20 (phoneme / sec) 이며, 제m 음소에 대한 기준 발화 속도(420)가 18 (phoneme / sec)인 경우, 제m 음소에 대한 발화 속도 조절 정보(430)는 0.9로 식별(18 / 20)될 수 있다.Further, the electronic device 100 may obtain the speech rate control information 430 based on the speech speed 410 and the reference speech speed 420 through the speech speed control information obtaining module 125 . Here, the speech speed control information 430 may be information for adjusting the speech speed of each phoneme included in the sound characteristic information 220 . For example, when the speech rate 410 for the m-th phoneme is 20 (phoneme/sec) and the reference speech rate 420 for the m-th phoneme is 18 (phoneme/sec), the speech rate for the m-th phoneme is Adjustment information 430 can be identified (18 / 20) as 0.9.

그리고, 전자 장치(100)는 발화 속도 조절 정보(430)에 기초하여 설정되는 제2 신경망 모델(20)에 음향 특성 정보(220)를 입력함으로, 텍스트(210)에 대응하는 음성 데이터(230)를 획득할 수 있다. In addition, the electronic device 100 inputs the sound characteristic information 220 to the second neural network model 20 set based on the speech rate control information 430, so that voice data 230 corresponding to the text 210 is generated. can be obtained.

일 실시 예로, 전자 장치(100)는 음향 특성 정보(220) 중 제m 음소에 대응하는 적어도 하나의 프레임이 제2 신경망 모델(20)의 인코더(20-1)에 입력되는 동안, 제m 음소에 대응하는 발화 속도 조절 정보(430)를 바탕으로, 제2 신경망 모델(20)의 디코더(20-2)의 루프 횟수를 식별할 수 있다. 일 예로, 제m 음소에 대한 발화 속도 조절 정보(430)가 0.9이라면, 음향 특성 정보(220) 중 제m 음소에 대응하는 프레임이 인코더(20-1)에 입력되는 동안의 디코더(20-2)의 루프 횟수는 (기본 루프 횟수 / 제m 음소에 대응하는 발화 속도 조절 정보)일 수 있다. 즉, 기본 루프 횟수가 240회라면, 음향 특성 정보(220) 중 제m 음소에 대응하는 프레임이 인코더(20-1)에 입력되는 동안의 디코더(20-2)의 루프 횟수는 264회일 수 있다.For example, while at least one frame corresponding to the m-th phoneme of the sound characteristic information 220 is input to the encoder 20-1 of the second neural network model 20, the electronic device 100 selects the m-th phoneme. The number of loops of the decoder 20 - 2 of the second neural network model 20 may be identified based on the speech rate control information 430 corresponding to . For example, if the speech rate control information 430 for the m-th phoneme is 0.9, the decoder 20-2 while a frame corresponding to the m-th phoneme among the sound characteristic information 220 is input to the encoder 20-1 ) may be (basic loop count/speech speed control information corresponding to the mth phoneme). That is, if the basic number of loops is 240, the number of loops of the decoder 20-2 while the frame corresponding to the mth phoneme of the sound characteristic information 220 is input to the encoder 20-1 may be 264. .

루프 횟수가 식별되면, 전자 장치(100)는 음향 특성 정보(220) 중 디코더(20-2)에 제m 음소에 대응하는 프레임이 입력되는 동안 제m 음소에 대응하는 루프 횟수로 디코더(20-2)를 동작시켜, 음향 특성 정보(220) 하나의 프레임 당 제m 음소에 대응하는 루프 횟수 개의 음성 데이터를 획득할 수 있다. 그리고, 전자 장치(100)는 이러한 과정을 텍스트(210)에 포함된 모든 음소에 대하여 수행하여, 텍스트(210)에 대응되는 음성 데이터(230)를 획득할 수 있다.If the number of loops is identified, the electronic device 100 determines the number of loops corresponding to the m-th phoneme while the frame corresponding to the m-th phoneme is input to the decoder 20-2 of the sound characteristic information 220. By operating step 2), voice data corresponding to the number of loops corresponding to the m-th phoneme per one frame of the sound characteristic information 220 may be obtained. In addition, the electronic device 100 may obtain voice data 230 corresponding to the text 210 by performing this process on all phonemes included in the text 210 .

도 5는 본 개시의 일 실시 예에 따른, 음향 특성 정보의 프레임 각각과 텍스트에 포함된 음소 각각을 매칭한 정렬 정보를 설명하기 위한 도면이다.5 is a diagram for explaining alignment information obtained by matching each frame of sound characteristic information with each phoneme included in text, according to an embodiment of the present disclosure.

도 5를 참조하면, 음향 특성 정보의 프레임 각각과 텍스트에 포함된 음소 각각을 매칭한 정렬 정보는 (N, T)의 크기를 가질 수 있다. 여기서, N은 텍스트(210)에 포함된 전체 음소의 개수를 나타내며, T는 텍스트(210)에 대응되는 음향 특성 정보(220)의 프레임 개수를 나타낼 수 있다.Referring to FIG. 5 , alignment information obtained by matching each frame of sound characteristic information with each phoneme included in text may have a size of (N, T). Here, N may represent the total number of phonemes included in the text 210, and T may represent the number of frames of the sound characteristic information 220 corresponding to the text 210.

그리고,

를 음향 특성 정보(220)에서의 n번째 음소, t 번째 프레임에서의 웨이트라고 정의하면,

일 수 있다. And,

If is defined as the weight in the n-th phoneme and t-th frame in the acoustic characteristic information 220,

can be

그리고, 정렬 정보에서 t 번째 프레임에 매핑되는 음소

는 하기의 수학식 2와 같을 수 있다.And, the phoneme mapped to the t-th frame in the alignment information

May be the same as Equation 2 below.

즉, 수학식 2를 참조하면, t 번째 프레임에 매핑되는 음소

는 t 번째 프레임에 대응되는

의 값이 가장 큰 음소일 수 있다.That is, referring to Equation 2, the phoneme mapped to the t-th frame

corresponds to the tth frame

The value of may be the largest phoneme.

그리고,

인 프레임 사이에서

에 해당되는 음소의 길이를 식별할 수 있다. 즉, n 번째 음소의 길이를

이라고 정의하면, n번째의 음소의 길이는 수학식 3과 같을 수 있다.And,

between in-frame

The length of the phoneme corresponding to can be identified. That is, the length of the nth phoneme

When defined as , the length of the n-th phoneme may be equal to Equation 3.

즉, 수학식 2를 참조하였을 때, 도 5의 정렬 정보의

은 2이며,

는 3일 수 있다.That is, when referring to Equation 2, the alignment information of FIG.

is 2,

may be 3.

한편, 도 5의 네모 박스 영역과 같이 max 값으로 매핑되지 않는 음소들이 존재할 수 있다. 일 예로, 제1 신경망 모델(10)을 이용하는 TTS 모델에서는 음소로 특수기호들이 사용될 수 있으며, 이 경우, 특수 기호들은 pause를 생성하기도 하지만, 전후의 prosody에만 영향을 미치고 실제로 발화되지 않는 경우도 있을 수 있다. 이러한 경우, 도 5의 네모 박스 영역과 같이 프레임과 매핑되지 않는 음소들이 존재할 수 있다. Meanwhile, there may be phonemes that are not mapped to max values, such as the square box area of FIG. 5 . For example, in the TTS model using the first neural network model 10, special symbols may be used as phonemes. In this case, special symbols may create a pause, but affect only the prosody before and after and may not actually be uttered. can In this case, there may be phonemes that are not mapped to frames, such as the square box area of FIG. 5 .

이러한 경우, 매핑되지 않는 음소의 길이

을 수학식 4와 같이 할당할 수 있다. 즉,

인 프레임 사이에서, n 부터

번째 음소의 길이는 수학식 4와 같을 수 있다. 여기서,

는 1보다 큰 값일 수 있다.In this case, the length of the unmapped phoneme

Can be assigned as shown in Equation 4. in other words,

Between in-frames, from n

The length of the th phoneme may be equal to Equation 4. here,

may be a value greater than 1.

수학식 4를 참조하면, 도 5의 정렬 정보의

은 0.5일 수 있으며,

은 0.5일 수 있다. Referring to Equation 4, the alignment information of FIG. 5

may be 0.5,

may be 0.5.

상술한 바와 같이, 정렬 정보를 통해, 음향 특성 정보(220)에 포함된 음소의 길이를 식별할 수 있으며, 음소의 길이를 통해 음소별 발화 속도를 식별할 수 있다.As described above, the length of the phoneme included in the acoustic characteristic information 220 can be identified through the alignment information, and the speech speed of each phoneme can be identified through the length of the phoneme.

구체적으로, 음향 특성 정보(220)에 포함된 n번째 음소에서의 발화 속도

은 수학식 5와 같을 수 있다.Specifically, the rate of speech at the n-th phoneme included in the sound characteristic information 220

may be the same as Equation 5.

수학식 4에서 r은 제1 신경망 모델(10)의 reduction factor일 수 있다. 일 예로, r이 1이며, frame-length가 10ms 인 경우,

는 50,

는 33.3일 수 있다. In Equation 4, r may be a reduction factor of the first neural network model 10. As an example, when r is 1 and frame-length is 10 ms,

is 50,

may be 33.3.

다만, 음소 하나에 대한 발화 속도는 짧은 구간에 대한 속도이므로, 너무 짧은 구간에 대하여 발화 속도를 예측하는 경우 음소 간 길이 차이가 줄어들어 부자연스러운 결과가 발생될 수 있다. 그리고, 너무 짧은 구간에 대하여 발화 속도를 예측하는 경우 발화 속도 예측 값이 시간 축에서 너무 빠르게 변화되므로, 부자연스러운 결과가 발생될 수 있다. 또한, 발화 속도 예측 시 너무 긴 구간에 대하여 평균 발화 속도를 예측하는 경우에는 텍스트 내에서 느린 발화와 빠른 발화가 함께 있는 경우 반영이 어려울 수 있다. 또한, 스트리밍 구조에서는 식별된 발화 속도가 이미 출력된 발화에 대한 속도 예측이므로, 발화 속도 조절에 대한 지연이 발생될 수 있는 바, 적절한 구간에 대한 평균 발화 속도를 측정할 수 있는 방법이 필요하며, 이에 대해서는 도 6 및 도 7을 통해 후술하도록 한다.However, since the speech speed for one phoneme is for a short section, when the speech speed is predicted for an extremely short section, the difference in length between phonemes is reduced, resulting in unnatural results. In addition, when the speech rate is predicted for a too short period, an unnatural result may occur because the prediction value of the speech rate changes too quickly on the time axis. In addition, when estimating the average speech speed for an excessively long section when uttering speed is predicted, it may be difficult to reflect when both slow and fast utterances are present in the text. In addition, in the streaming structure, since the identified speech speed is a speed prediction for speech that has already been output, a delay in speech speed control may occur. Therefore, a method for measuring the average speech speed for an appropriate section is required, This will be described later with reference to FIGS. 6 and 7 .

도 6은 본 개시의 제1 실시 예에 따른, 음향 특성 정보에 포함된 음소 별 평균 발화 속도를 식별하는 방법을 설명하기 위한 도면이다.6 is a diagram for explaining a method of identifying an average speech rate for each phoneme included in acoustic characteristic information according to a first embodiment of the present disclosure.

도 6의 610 실시 예를 참조하면, 전자 장치(100)는 음향 특성 정보(220)에 포함된 최근 M개의 음소에 대하여 발화 속도의 평균을 계산할 수 있다. 일 예로, n < M인 경우에는 해당되는 원소에 대하여만 평균하여 평균 발화 속도를 계산할 수 있다. Referring to embodiment 610 of FIG. 6 , the electronic device 100 may calculate an average of speech speeds for M recent phonemes included in the acoustic characteristic information 220 . For example, when n<M, the average ignition rate may be calculated by averaging only the corresponding element.

또한, 일 예로, M=5인 경우, 도 6의 620 실시 예와 같이 제3 음소에 대한 평균 발화 속도

은

,

및

의 평균 값으로 계산될 수 있다. 또한, 제5 음소에 대한 평균 발화 속도

는

내지

의 평균 값으로 계산될 수 있다. Also, as an example, when M=5, the average speech rate for the third phoneme as in the 620 embodiment of FIG. 6

silver

,

and

can be calculated as the average value of Also, the average rate of speech for the fifth phoneme

Is

pay

can be calculated as the average value of

도 6의 610 실시 예 및 620의 실시 예를 통한 음소 별 평균 발화 속도를 계산하는 방법은 Simple Moving Average 방법이라 명칭 될 수 있다. The method of calculating the average utterance speed for each phoneme through the embodiments 610 and 620 of FIG. 6 may be referred to as a simple moving average method.

도 7은 본 개시의 제2 실시 예에 따른, 음향 특성 정보에 포함된 음소 별 평균 발화 속도를 식별하는 방법을 설명하기 위한 도면이다.7 is a diagram for explaining a method of identifying an average speech rate for each phoneme included in acoustic characteristic information according to a second embodiment of the present disclosure.

도 7은 본 개시의 일 실시 예에 따른, EMA(Exponential Moving Average) 방법에 의해 음소 별 평균 발화 속도가 식별되는 실시 예를 설명하기 위한 수식이다. 7 is a formula for explaining an embodiment in which an average utterance rate for each phoneme is identified by an Exponential Moving Average (EMA) method according to an embodiment of the present disclosure.

즉, 도 7의 수식과 같은 EMA 방법에 따르면, 현재 음소로부터 먼 음소에 대한 발화 속도일수록 가중치가 지수적으로 감소되어 적절한 구간의 평균 길이가 계산될 수 있다.That is, according to the EMA method as shown in the equation of FIG. 7 , the average length of an appropriate section can be calculated as the weight is exponentially reduced as the utterance rate for a phoneme farther from the current phoneme increases.

여기서, 도 7의 α의 값이 클수록 짧은 구간에 대한 평균 발화 속도가 계산될 수 있으며, α의 값이 작을수록 긴 구간에 대한 평균 발화 속도가 계산될 수 있다. 따라서, 전자 장치(100)는 상황에 맞게 적절한 α의 값을 선택하여 현재 평균 발화 속도를 실시간으로 계산할 수 있다.Here, as the value of α in FIG. 7 increases, the average firing rate for a shorter section can be calculated, and as the value of α decreases, the average firing rate for a longer section can be calculated. Accordingly, the electronic device 100 may calculate the current average speech rate in real time by selecting an appropriate value of α according to the situation.

도 8은 본 개시의 일 실시 예에 따른, 기준 발화 속도를 식별하기 위한 방법을 설명하기 위한 도면이다. 8 is a diagram for explaining a method for identifying a reference speech rate according to an embodiment of the present disclosure.

도 8은 본 개시의 일 실시 예에 따라, 음향 특성 정보(220)에 포함된 음소 각각에 대응하는 기준 발화 속도를 획득하는 제3 신경망 모델을 학습하는 방법을 설명하기 위한 도면이다. 8 is a diagram for explaining a method of learning a third neural network model for obtaining a reference speech rate corresponding to each phoneme included in the sound characteristic information 220 according to an embodiment of the present disclosure.

일 예로, 제3 신경망 모델은 샘플 데이터(예로, 샘플 텍스트 및 샘플 음성 데이터)를 바탕으로 학습될 수 있다. 일 예로, 샘플 데이터는 제1 신경망 모델(10)의 학습에 이용된 샘플 데이터일 수 있다. For example, the third neural network model may be trained based on sample data (eg, sample text and sample voice data). For example, the sample data may be sample data used for learning of the first neural network model 10 .

그리고, 샘플 음성 데이터를 바탕으로, 샘플 음성 데이터에 대응되는 음향 특성 정보가 추출되어, 도 8과 같이 샘플 음성 데이터에 포함된 음소별 발화 속도가 식별될 수 있다. 그리고, 샘플 텍스트 및 샘플 음성 데이터에 포함된 음소별 발화 속도를 바탕으로, 제3 신경망 모델이 학습될 수 있다.Then, based on the sample voice data, acoustic characteristic information corresponding to the sample voice data is extracted, and the speech rate for each phoneme included in the sample voice data may be identified as shown in FIG. 8 . In addition, a third neural network model may be learned based on the speech rate for each phoneme included in the sample text and sample voice data.

즉, 제3 신경망 모델은 샘플 음향 특성 정보 및 샘플 음향 특성 정보에 대응하는 샘플 텍스트를 바탕으로, 샘플 음향 특성 정보의 구간 평균 발화 속도를 추정하도록 학습될 수 있다. 여기서, 제3 신경망 모델은 구간 평균 발화 속도를 추정할 수 있는 HMM(Hidden Markov Model) 및 DNN(Deep Neural Network) 모델 등과 같은 통계 모델로 구현될 수 있다. That is, the third neural network model may be trained to estimate the section average speech speed of the sample acoustic characteristic information based on the sample acoustic characteristic information and the sample text corresponding to the sample acoustic characteristic information. Here, the third neural network model may be implemented as a statistical model such as a Hidden Markov Model (HMM) and a Deep Neural Network (DNN) model capable of estimating a section average firing rate.

그리고, 전자 장치(100)는 학습된 제3 신경망 모델, 텍스트(210) 및 정렬 정보(400)를 이용하여 음향 특성 정보(220)에 포함된 음소 별 기준 발화 속도를 식별할 수 있다.In addition, the electronic device 100 may identify the reference speech rate for each phoneme included in the sound characteristic information 220 using the learned third neural network model, the text 210, and the alignment information 400.

도 9는 본 개시의 일 실시 예에 따른, 전자 장치의 동작을 설명하기 위한 흐름도이다.9 is a flowchart illustrating an operation of an electronic device according to an embodiment of the present disclosure.

도 9를 참조하면, 전자 장치(100)는 텍스트를 획득할 수 있다(S910). 여기서, 텍스트는 전자 장치(100)의 사용자로부터 입력 받은 텍스트, 전자 장치의 음성 인식 시스템(예로, 빅스비)에서 제공되는 텍스트 및 외부 서버에서 수신하는 텍스트 등 다양한 텍스트를 포함할 수 있다.Referring to FIG. 9 , the electronic device 100 may acquire text (S910). Here, the text may include various types of text, such as text input from a user of the electronic device 100, text provided by a voice recognition system (eg, Bixby) of the electronic device, and text received from an external server.

그리고, 전자 장치(100)는 제1 신경망 모델에 텍스트를 입력함으로, 텍스트에 대응하는 음향 특성 정보 및 음향 특성 정보의 프레임 각각과 텍스트에 포함된 음소 각각을 매칭한 정렬 정보를 획득할 수 있다(S920). 일 예로, 정렬 정보는 도 5에서 설명한 바와 같이, (N, T)의 크기를 가지는 matrix 정보일 수 있다.In addition, the electronic device 100 inputs text into the first neural network model to obtain acoustic characteristic information corresponding to the text and alignment information obtained by matching each frame of the acoustic characteristic information with each phoneme included in the text ( S920). As an example, the alignment information may be matrix information having a size of (N, T) as described in FIG. 5 .

그리고, 전자 장치(100)는 획득한 정렬 정보를 바탕으로, 음향 특성 정보의 발화 속도를 식별할 수 있다(S930). 구체적으로, 전자 장치(100)는 획득한 정렬 정보를 바탕으로, 음향 특성 정보에 포함된 음소 별 발화 속도를 식별할 수 있다. 여기서, 음소 별 발화 속도는 음소 하나에 대응되는 발화 속도일 수 있으나, 이에 한정되지 않는다. 즉, 음소 별 발화 속도는 해당 음소 이전의 적어도 하나의 음소 각각에 대응하는 발화 속도를 더 고려한 평균 발화 속도일 수 있다.Then, the electronic device 100 may identify the speaking speed of the acoustic characteristic information based on the acquired alignment information (S930). Specifically, the electronic device 100 may identify the speech rate for each phoneme included in the acoustic characteristic information based on the acquired alignment information. Here, the speech speed for each phoneme may be a speech speed corresponding to one phoneme, but is not limited thereto. That is, the speech speed for each phoneme may be an average speech speed in consideration of speech speeds corresponding to each of at least one phoneme prior to the corresponding phoneme.

그리고, 전자 장치(100)는 텍스트 및 음향 특성 정보를 바탕으로, 음향 특성 정보에 포함된 음소 별 기준 발화 속도를 식별할 수 있다(S940). 여기서, 기준 발화 속도는 도 1에서 설명한 바와 같이 다양한 방식에 의해 식별될 수 있다. Then, the electronic device 100 may identify a reference speech speed for each phoneme included in the sound characteristic information based on the text and sound characteristic information (S940). Here, the reference speech rate may be identified by various methods as described in FIG. 1 .

일 예로, 전자 장치(100)는 획득된 텍스트와 제1 신경망의 학습에 이용된 샘플 데이터를 바탕으로, 음향 특성 정보에 포함된 음소 각각에 대한 제1 기준 발화 속도를 획득할 수 있다.For example, the electronic device 100 may obtain a first reference speech rate for each phoneme included in the sound characteristic information based on the obtained text and sample data used for learning of the first neural network.

일 예로, 전자 장치(100)는 제1 신경망 모델의 학습에 이용된 샘플 데이터에 대한 평가 정보를 획득할 수 있다. 일 예로, 전자 장치(100)는 샘플 데이터 중 음성 데이터를 사용자에게 제공한 후, 이에 대한 피드백에 대한 평가 정보를 입력 받을 수 있다. 그리고, 전자 장치(100)는 제1 기준 발화 속도 및 평가 정보를 바탕으로, 음향 특성 정보에 포함된 음소 각각에 대한 제2 기준 발화 속도를 획득할 수 있다. For example, the electronic device 100 may obtain evaluation information on sample data used for learning the first neural network model. For example, the electronic device 100 may provide voice data among sample data to the user, and then receive evaluation information for feedback thereto. Also, the electronic device 100 may obtain a second reference speech speed for each phoneme included in the sound characteristic information based on the first reference speech speed and evaluation information.

그리고, 전자 장치(100)는 식별된 제1 기준 발화 속도 및 제2 기준 발화 속도 중 적어도 하나를 바탕으로, 음향 특성 정보에 포함된 음소 각각에 대한 기준 발화 속도를 식별할 수 있다. Then, the electronic device 100 may identify a reference speech speed for each phoneme included in the acoustic characteristic information based on at least one of the identified first reference speech speed and the second reference speech speed.

그리고, 전자 장치(100)는 음향 특성 정보의 발화 속도 및 기준 발화 속도를 바탕으로, 발화 속도 조절 정보를 획득할 수 있다(S950). 구체적으로, 제n 음소에 대응하는 발화 속도를 Xn, 제n 음소에 대응하는 기준 발화 속도를 Xrefn이라 하면, 제n 음소에 대응하는 발화 속도 조절 정보 Sn는 (Xrefn / Xn)으로 정의될 수 있다.Then, the electronic device 100 may acquire speech speed control information based on the speech speed of the sound characteristic information and the reference speech speed (S950). Specifically, if the speech rate corresponding to the n-th phoneme is Xn and the reference speech rate corresponding to the n-th phoneme is Xrefn, the speech rate control information Sn corresponding to the n-th phoneme may be defined as (Xrefn / Xn). .

그리고, 전자 장치(100)는 획득된 발화 속도 조절 정보에 기초하여 설정되는 제2 신경망 모델에 음향 특성 정보를 입력함으로, 텍스트에 대응하는 음성 데이터를 획득할 수 있다(S960).Then, the electronic device 100 may obtain voice data corresponding to the text by inputting the acoustic characteristic information to the second neural network model set based on the acquired speech rate control information (S960).

구체적으로, 제2 신경망 모델은 음향 특성 정보를 입력받는 인코더 및 인코더에서 출력되는 벡터 정보를 입력 받아 음성 데이터를 출력하는 디코더를 포함할 수 있다. 그리고, 전자 장치(100)는 음향 특성 정보에 포함된 특정 음소에 대응하는 적어도 하나의 프레임이 제2 신경망 모델에 입력되는 동안, 해당 음소에 대응하는 발화 속도 조절 정보를 바탕으로, 제2 신경망 모델에 포함된 디코더의 루프 횟수를 식별할 수 있다. 그리고, 전자 장치(100)는 제2 신경망 모델에 해당 음소에 대응하는 적어도 하나의 프레임이 입력되는 것을 바탕으로, 식별된 루프 횟수로 디코더가 동작되어, 루프 횟수에 대응되는 제1 음성 데이터를 획득할 수 있다. Specifically, the second neural network model may include an encoder receiving sound characteristic information and a decoder receiving vector information output from the encoder and outputting voice data. And, while at least one frame corresponding to a specific phoneme included in the sound characteristic information is input to the second neural network model, the electronic device 100 uses the second neural network model based on the speech rate control information corresponding to the phoneme. The number of loops of the decoder included in can be identified. Further, the electronic device 100 acquires first voice data corresponding to the number of loops by operating the decoder with the number of loops identified based on the input of at least one frame corresponding to the corresponding phoneme to the second neural network model. can do.

구체적으로, 음향 특성 정보 중 특정 음소에 대응하는 적어도 하나의 프레임 중 하나가 제2 신경망 모델에 입력되면, 식별된 루프 횟수에 대응되는 개수의 제2 음성 데이터가 획득될 수 있다. 그리고, 음향 특성 정보 중 특정 음소에 대응하는 적어도 하나의 프레임을 통해 획득된 복수의 제2 음성 데이터의 집합이 특정 음소에 대응하는 제1 음성 데이터일 수 있다. 즉, 제2 음성 데이터는 음향 특성 정보 중 하나의 프레임에 대응하는 음성 데이터 이며, 제1 음성 데이터는 특정 음소 하나에 대응하는 음성 데이터일 수 있다.Specifically, when one of at least one frame corresponding to a specific phoneme among sound characteristic information is input to the second neural network model, second voice data corresponding to the number of identified loops may be obtained. Also, a set of a plurality of second voice data acquired through at least one frame corresponding to a specific phoneme among sound characteristic information may be first voice data corresponding to a specific phoneme. That is, the second voice data may be voice data corresponding to one frame of sound characteristic information, and the first voice data may be voice data corresponding to one specific phoneme.

일 예로, 디코더는 시프트 크기(Shift size)가 제1 시간 간격(sec)의 음향 특성 정보를 바탕으로 제1 주파수(khz)의 음성 데이터를 획득하는 것을 특징으로 하며, 발화 속도 조절 정보의 값이 기준 값인 경우, 음향 특성 정보에 포함된 하나의 프레임이 제2 신경망 모델에 입력되어, 제1 시간 간격과 제1 주파수의 곱에 대응하는 개수의 제2 음성 데이터가 획득될 수 있다.For example, the decoder is characterized in that the shift size (Shift size) acquires voice data of a first frequency (khz) based on sound characteristic information of a first time interval (sec), and the value of the speech speed control information is In the case of the reference value, one frame included in the acoustic characteristic information may be input to the second neural network model, and second voice data corresponding to the product of the first time interval and the first frequency may be obtained.

도 10은 본 개시의 일 실시 예에 따른, 전자 장치의 구성을 설명하기 위한 블록도이다. 도 10을 참조하면, 전자 장치(100)는 메모리(110), 프로세서(120), 마이크(130), 디스플레이(140), 스피커(150), 통신 인터페이스(160) 및 유저 인터페이스(170)를 포함할 수 있다. 한편, 도 10에 도시된 메모리(110) 및 프로세서(120)는 도 1에서 설명한 메모리(110) 및 프로세서(120)와 중복되므로, 중복되는 설명은 생략한다. 또한, 전자 장치(100)의 구현 예에 따라, 도 10의 구성 중 일부가 제거되거나 다른 구성이 추가될 수 있음은 물론이다.10 is a block diagram for explaining a configuration of an electronic device according to an embodiment of the present disclosure. Referring to FIG. 10 , the electronic device 100 includes a memory 110, a processor 120, a microphone 130, a display 140, a speaker 150, a communication interface 160, and a user interface 170. can do. Meanwhile, since the memory 110 and the processor 120 shown in FIG. 10 overlap with the memory 110 and the processor 120 described in FIG. 1, overlapping descriptions are omitted. Also, it goes without saying that some of the components of FIG. 10 may be removed or other components may be added according to the implementation example of the electronic device 100 .

마이크(130)는 전자 장치(100)가 음성 신호을 입력 받기 위한 구성 요소이다. 구체적으로, 마이크(130)는 마이크로폰(Microphone)을 이용하여 외부의 음성 신호를 수신하고, 이를 전기적인 음성 데이터로 처리할 수 있다. 이 경우, 마이크(130)는 처리된 음성 데이터를 프로세서(120)에 전달할 수 있다.The microphone 130 is a component through which the electronic device 100 receives a voice signal. Specifically, the microphone 130 may receive an external voice signal using a microphone and process it as electrical voice data. In this case, the microphone 130 may transfer the processed voice data to the processor 120 .

디스플레이(140)는 전자 장치(100)가 정보를 시각적으로 제공하기 위한 구성이다. 전자 장치(100)는 하나 이상의 디스플레이(140)를 포함할 수 있으며, 음성 데이터로 변환하기 위한 텍스트, 사용자로부터 평가 정보를 획득하기 위한 UI 등을 디스플레이(140)를 통해 표시할 수 있다. 이때, 디스플레이(140)는 LCD(Liquid Crystal Display), PDP(Plasma Display Panel), OLED(Organic Light Emitting Diodes), TOLED(Transparent OLED), Micro LED 등으로 구현될 수 있다. 또한, 디스플레이(140)는, 사용자의 터치 조작을 감지할 수 있는 터치스크린 형태로 구현될 수 있으며, 접히거나 구부러질 수 있는 플렉서블 디스플레이로 구현될 수도 있다. 특히, 디스플레이(140)는 음성 신호에 포함된 명령에 대응되는 응답을 시각적으로 제공할 수 있다.The display 140 is a component for the electronic device 100 to visually provide information. The electronic device 100 may include one or more displays 140 and may display text for conversion into voice data, a UI for acquiring evaluation information from a user, and the like through the display 140 . In this case, the display 140 may be implemented as a liquid crystal display (LCD), a plasma display panel (PDP), organic light emitting diodes (OLED), a transparent OLED (TOLED), or a micro LED. Also, the display 140 may be implemented in the form of a touch screen capable of detecting a user's touch manipulation, or may be implemented as a flexible display capable of being folded or bent. In particular, the display 140 may visually provide a response corresponding to a command included in the voice signal.

스피커(150)는 전자 장치(100)가 정보를 청각적으로 제공하기 위한 구성이다. 전자 장치(100)는 하나 이상의 스피커(150)를 포함할 수 있으며, 본 개시에 따라 획득되는 음성 데이터를 스피커(150)를 통해 오디오 신호로 출력할 수 있다. 한편, 오디오 신호를 출력하기 위한 구성이 스피커(150)로 구현될 수 있으나, 이는 일 실시 예에 불과할 뿐, 출력 단자로 구현될 수 있음은 물론이다.The speaker 150 is a component for the electronic device 100 to provide information aurally. The electronic device 100 may include one or more speakers 150 and may output voice data obtained according to the present disclosure as audio signals through the speakers 150 . Meanwhile, although a configuration for outputting an audio signal may be implemented as the speaker 150, this is merely an example and may be implemented as an output terminal, of course.

통신 인터페이스(160)는 외부 장치와 통신을 수행할 수 있는 구성이다. 한편, 통신 인터페이스(160)가 외부 장치와 통신 연결되는 것은 제3 기기(예로, 중계기, 허브, 엑세스 포인트, 서버 또는 게이트웨이 등)를 거쳐서 통신하는 것을 포함할 수 있다. 무선 통신은, 예를 들면, LTE, LTE-A(LTE Advance), CDMA(code division multiple access), WCDMA(wideband CDMA), UMTS(universal mobile telecommunications system), WiBro(Wireless Broadband), 또는 GSM(Global System for Mobile Communications) 등 중 적어도 하나를 사용하는 셀룰러 통신을 포함할 수 있다. 일 실시예에 따르면, 무선 통신은, 예를 들면, WiFi(wireless fidelity), 블루투스, 블루투스 저전력(BLE), 지그비(Zigbee), NFC(near field communication), 자력 시큐어 트랜스미션(Magnetic Secure Transmission), 라디오 프리퀀시(RF), 또는 보디 에어리어 네트워크(BAN) 중 적어도 하나를 포함할 수 있다. 유선 통신은, 예를 들면, USB(universal serial bus), HDMI(high definition multimedia interface), RS-232(recommended standard232), 전력선 통신, 또는 POTS(plain old telephone service) 등 중 적어도 하나를 포함할 수 있다. 무선 통신 또는 유선 통신이 수행되는 네트워크는 텔레커뮤니케이션 네트워크, 예를 들면, 컴퓨터 네트워크(예: LAN 또는 WAN), 인터넷, 또는 텔레폰 네트워크 중 적어도 하나를 포함할 수 있다.The communication interface 160 is a component capable of communicating with an external device. Meanwhile, connecting the communication interface 160 to an external device may include communication through a third device (eg, a repeater, a hub, an access point, a server, or a gateway). Wireless communication is, for example, LTE, LTE-A (LTE Advance), CDMA (code division multiple access), WCDMA (wideband CDMA), UMTS (universal mobile telecommunications system), WiBro (Wireless Broadband), or GSM (Global System for Mobile Communications), etc. may include cellular communication using at least one. According to one embodiment, wireless communication, for example, WiFi (wireless fidelity), Bluetooth, Bluetooth Low Energy (BLE), Zigbee (Zigbee), near field communication (NFC), magnetic secure transmission (Magnetic Secure Transmission), radio It may include at least one of a frequency (RF) and a body area network (BAN). Wired communication may include, for example, at least one of universal serial bus (USB), high definition multimedia interface (HDMI), recommended standard 232 (RS-232), power line communication, or plain old telephone service (POTS). there is. A network in which wireless communication or wired communication is performed may include at least one of a telecommunication network, eg, a computer network (eg, LAN or WAN), the Internet, or a telephone network.

특히, 통신 인터페이스(160)는 외부의 서버와 통신을 수행하여 전자 장치(100)에서 음성 인식 기능을 제공할 수 있다. 다만, 본 개시는 이에 한정되지 않고, 전자 장치(100)는 외부의 서버와 통신 없이도 전자 장치(100) 내에서 음성 인식 기능을 제공할 수 있다. In particular, the communication interface 160 can provide a voice recognition function in the electronic device 100 by communicating with an external server. However, the present disclosure is not limited thereto, and the electronic device 100 may provide a voice recognition function within the electronic device 100 without communication with an external server.

유저 인터페이스(170)는 전자 장치(100)를 제어하기 위한 사용자 명령을 입력받기 위한 구성이다. 특히, 유저 인터페이스(170)는 버튼, 터치 패드, 마우스 및 키보드와 같은 장치로 구현되거나, 상술한 디스플레이 기능 및 조작 입력 기능도 함께 수행 가능한 터치 스크린으로도 구현될 수 있다. 여기서, 버튼은 전자 장치(100)의 본체 외관의 전면부나 측면부, 배면부 등의 임의의 영역에 형성된 기계적 버튼, 터치 패드, 휠 등과 같은 다양한 유형의 버튼이 될 수 있다.The user interface 170 is a component for receiving a user command for controlling the electronic device 100 . In particular, the user interface 170 may be implemented as a device such as a button, a touch pad, a mouse, and a keyboard, or may be implemented as a touch screen capable of simultaneously performing the above-described display function and manipulation input function. Here, the buttons may be various types of buttons such as mechanical buttons, touch pads, wheels, etc. formed on an arbitrary area such as the front, side, or rear surface of the main body of the electronic device 100 .

본 문서의 실시 예의 다양한 변경(modifications), 균등물(equivalents), 및/또는 대체물(alternatives)을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다.It should be understood to include various modifications, equivalents, and/or alternatives of the embodiments herein. In connection with the description of the drawings, like reference numerals may be used for like elements.

본 문서에서, "가진다," "가질 수 있다," "포함한다," 또는 "포함할 수 있다" 등의 표현은 해당 특징(예: 수치, 기능, 동작, 또는 부품 등의 구성요소)의 존재를 가리키며, 추가적인 특징의 존재를 배제하지 않는다.In this document, expressions such as "has," "may have," "includes," or "may include" indicate the existence of a corresponding feature (eg, numerical value, function, operation, or component such as a part). , which does not preclude the existence of additional features.

본 문서에서, "A 또는 B," "A 또는/및 B 중 적어도 하나," 또는 "A 또는/및 B 중 하나 또는 그 이상"등의 표현은 함께 나열된 항목들의 모든 가능한 조합을 포함할 수 있다. 예를 들면, "A 또는 B," "A 및 B 중 적어도 하나," 또는 "A 또는 B 중 적어도 하나"는, (1) 적어도 하나의 A를 포함, (2) 적어도 하나의 B를 포함, 또는 (3) 적어도 하나의 A 및 적어도 하나의 B 모두를 포함하는 경우를 모두 지칭할 수 있다. 본 문서에서 사용된 "제1," "제2," "첫째," 또는 "둘째,"등의 표현들은 다양한 구성요소들을, 순서 및/또는 중요도에 상관없이 수식할 수 있고, 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 뿐 해당 구성요소들을 한정하지 않는다. In this document, expressions such as “A or B,” “at least one of A and/and B,” or “one or more of A or/and B” may include all possible combinations of the items listed together. . For example, “A or B,” “at least one of A and B,” or “at least one of A or B” (1) includes at least one A, (2) includes at least one B, Or (3) may refer to all cases including at least one A and at least one B. Expressions such as "first," "second," "first," or "second," used in this document may modify various elements, regardless of order and/or importance, and refer to one element as It is used only to distinguish it from other components and does not limit the corresponding components.

어떤 구성요소(예: 제1 구성요소)가 다른 구성요소(예: 제2 구성요소)에 "(기능적으로 또는 통신적으로) 연결되어((operatively or communicatively) coupled with/to)" 있다거나 "접속되어(connected to)" 있다고 언급된 때에는, 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로 연결되거나, 다른 구성요소(예: 제3 구성요소)를 통하여 연결될 수 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소(예: 제1 구성요소)가 다른 구성요소(예: 제2 구성요소)에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 상기 어떤 구성요소와 상기 다른 구성요소 사이에 다른 구성요소(예: 제 3 구성요소)가 존재하지 않는 것으로 이해될 수 있다.A component (e.g., a first component) is "(operatively or communicatively) coupled with/to" another component (e.g., a second component); When referred to as "connected to", it should be understood that the certain component may be directly connected to the other component or connected through another component (eg, a third component). On the other hand, when an element (eg, a first element) is referred to as being “directly connected” or “directly connected” to another element (eg, a second element), the element and the above It may be understood that other components (eg, a third component) do not exist between the other components.

본 문서에서 사용된 표현 "~하도록 구성된(또는 설정된)(configured to)"은 상황에 따라, 예를 들면, "~에 적합한(suitable for)," "~하는 능력을 가지는(having the capacity to)," "~하도록 설계된(designed to)," "~하도록 변경된(adapted to)," "~하도록 만들어진(made to)," 또는 "~를 할 수 있는(capable of)"과 바꾸어 사용될 수 있다. 용어 "~하도록 구성된(또는 설정된)"은 하드웨어적으로 "특별히 설계된(specifically designed to)" 것만을 반드시 의미하지 않을 수 있다. 대신, 어떤 상황에서는, "~하도록 구성된 장치"라는 표현은, 그 장치가 다른 장치 또는 부품들과 함께 "~할 수 있는" 것을 의미할 수 있다. 예를 들면, 문구 "A, B, 및 C를 수행하도록 구성된(또는 설정된) 부프로세서"는 해당 동작을 수행하기 위한 전용 프로세서(예: 임베디드 프로세서), 또는 메모리 장치에 저장된 하나 이상의 소프트웨어 프로그램들을 실행함으로써, 해당 동작들을 수행할 수 있는 범용 프로세서(generic-purpose processor)(예: CPU 또는 application processor)를 의미할 수 있다.As used in this document, the expression "configured to" means "suitable for," "having the capacity to," depending on the circumstances. ," "designed to," "adapted to," "made to," or "capable of." The term "configured (or set) to" may not necessarily mean only "specifically designed to" hardware. Instead, in some contexts, the phrase "device configured to" may mean that the device is "capable of" in conjunction with other devices or components. For example, the phrase "a coprocessor configured (or configured) to perform A, B, and C" may include a dedicated processor (e.g., embedded processor) to perform those operations, or one or more software programs stored in a memory device. By doing so, it may mean a general-purpose processor (eg, CPU or application processor) capable of performing corresponding operations.

한편, 본 개시에서 사용된 용어 "부" 또는 "모듈"은 하드웨어, 소프트웨어 또는 펌웨어로 구성된 유닛을 포함하며, 예를 들면, 로직, 논리 블록, 부품, 또는 회로 등의 용어와 상호 호환적으로 사용될 수 있다. "부" 또는 "모듈"은, 일체로 구성된 부품 또는 하나 또는 그 이상의 기능을 수행하는 최소 단위 또는 그 일부가 될 수 있다. 예를 들면, 모듈은 ASIC(application-specific integrated circuit)으로 구성될 수 있다.On the other hand, the term "unit" or "module" used in the present disclosure includes units composed of hardware, software, or firmware, and may be used interchangeably with terms such as logic, logic blocks, parts, or circuits, for example. can A “unit” or “module” may be an integrated component or a minimum unit or part thereof that performs one or more functions. For example, the module may be composed of an application-specific integrated circuit (ASIC).

본 개시의 다양한 실시 예들은 기기(machine)(예: 컴퓨터)로 읽을 수 있는 저장 매체(machine-readable storage media에 저장된 명령어를 포함하는 소프트웨어로 구현될 수 있다. 기기는, 저장 매체로부터 저장된 명령어를 호출하고, 호출된 명령어에 따라 동작이 가능한 장치로서, 개시된 실시 예들에 따른 적층형 디스플레이 장치를 포함할 수 있다. 상기 명령이 프로세서에 의해 실행될 경우, 프로세서가 직접, 또는 상기 프로세서의 제어 하에 다른 구성요소들을 이용하여 상기 명령에 해당하는 기능을 수행할 수 있다. 명령은 컴파일러 또는 인터프리터에 의해 생성 또는 실행되는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, '비일시적'은 저장매체가 신호(signal)를 포함하지 않으며 실재(tangible)한다는 것을 의미할 뿐 데이터가 저장매체에 반영구적 또는 임시적으로 저장됨을 구분하지 않는다.Various embodiments of the present disclosure may be implemented as software including commands stored in a storage medium readable by a machine (eg, a computer). The device may receive instructions stored from the storage medium. A device capable of calling and operating according to the called command may include a stacked display device according to the disclosed embodiments When the command is executed by a processor, the processor directly or other components under the control of the processor A function corresponding to the command may be performed using a command A command may include a code generated or executed by a compiler or an interpreter A storage medium readable by a device may include non-transitory storage It can be provided in the form of a medium Here, 'non-temporary' means that the storage medium does not contain signals and is tangible, but does not distinguish whether data is stored semi-permanently or temporarily in the storage medium. .

일 실시 예에 따르면, 본 문서에 개시된 다양한 실시 예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예: compact disc read only memory (CD-ROM))의 형태로, 또는 어플리케이션 스토어(예: 플레이 스토어TM)를 통해 온라인으로 배포될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다.According to one embodiment, the method according to various embodiments disclosed in this document may be included and provided in a computer program product. Computer program products may be traded between sellers and buyers as commodities. The computer program product may be distributed in the form of a device-readable storage medium (eg compact disc read only memory (CD-ROM)) or online through an application store (eg Play Store™). In the case of online distribution, at least part of the computer program product may be temporarily stored or temporarily created in a storage medium such as a manufacturer's server, an application store server, or a relay server's memory.

다양한 실시 예들에 따른 구성 요소(예: 모듈 또는 프로그램) 각각은 단수 또는 복수의 개체로 구성될 수 있으며, 전술한 해당 서브 구성 요소들 중 일부 서브 구성 요소가 생략되거나, 또는 다른 서브 구성 요소가 다양한 실시 예에 더 포함될 수 있다. 대체적으로 또는 추가적으로, 일부 구성 요소들(예: 모듈 또는 프로그램)은 하나의 개체로 통합되어, 통합되기 이전의 각각의 해당 구성 요소에 의해 수행되는 기능을 동일 또는 유사하게 수행할 수 있다. 다양한 실시 예들에 따른, 모듈, 프로그램 또는 다른 구성 요소에 의해 수행되는 동작들은 순차적, 병렬적, 반복적 또는 휴리스틱하게 실행되거나, 적어도 일부 동작이 다른 순서로 실행되거나, 생략되거나, 또는 다른 동작이 추가될 수 있다.Each component (eg, module or program) according to various embodiments may be composed of a single object or a plurality of entities, and some of the sub-components may be omitted, or other sub-components may be various. It may be further included in the embodiment. Alternatively or additionally, some components (eg, modules or programs) may be integrated into one entity and perform the same or similar functions performed by each corresponding component prior to integration. According to various embodiments, operations performed by modules, programs, or other components may be executed sequentially, in parallel, repetitively, or heuristically, or at least some operations may be executed in a different order, may be omitted, or other operations may be added. can

100: 전자 장치
110: 메모리
120: 프로세서100: electronic device
110: memory
120: processor

Claims

In the control method of an electronic device,
obtaining text;
By inputting the text to the first neural network model, acoustic feature information corresponding to the text and alignment information matching each frame of the acoustic feature information with each phoneme included in the text obtaining;
identifying an utterance rate of the acoustic characteristic information based on the acquired alignment information;
identifying a reference speech speed for each phoneme included in the sound characteristic information, based on the text and the sound characteristic information;
obtaining speech speed control information based on the speech speed of the sound characteristic information and the reference speech speed; and
and obtaining voice data corresponding to the text by inputting the acoustic characteristic information to a second neural network model set based on the acquired speech rate control information.

According to claim 1,
Identifying the speech speed of the acoustic characteristic information,
Identifying a speech rate corresponding to a first phoneme included in the acoustic characteristic information based on the acquired alignment information;
The step of identifying the reference speech rate,
identifying the first phoneme included in the acoustic characteristic information based on the acoustic characteristic information; and
and identifying a reference speech rate corresponding to the first phoneme based on the text.

According to claim 2,
The step of identifying the reference speech rate,
Acquiring a first reference speech rate corresponding to the first phoneme based on the text and sample data used for learning of the first neural network;

According to claim 3,
The step of identifying the reference speech rate,
obtaining evaluation information on sample data used for learning the first neural network model; and
Further comprising identifying a second reference speech rate corresponding to the first phoneme based on the first reference speech rate and the evaluation information;
The control method, characterized in that the evaluation information is obtained by a user of the electronic device.

According to claim 4,
The method further includes identifying a standard speech rate corresponding to the first phoneme based on one of the first reference speech rate and the second reference speech rate.

According to claim 2,
Identifying the speech rate corresponding to the first phoneme,
identifying an average speech speed corresponding to the first phoneme based on a speech speed corresponding to the first phoneme and a speech speed corresponding to each of at least one phoneme prior to the first phoneme in the sound characteristic information; contain more,
The step of obtaining the speech rate control information,
Acquiring speech rate adjustment information corresponding to the first phoneme based on the average speech rate corresponding to the first phoneme and the reference speech rate corresponding to the first phoneme.

According to claim 2,
The second neural network model includes an encoder receiving the acoustic characteristic information and a decoder receiving vector information output from the encoder,
Obtaining the voice data,
While at least one frame corresponding to the first phoneme among the sound characteristic information is input to the second neural network model, the decoder included in the second neural network model based on the speech rate control information corresponding to the first phoneme identifying the number of loops; and
obtaining a number of first voice data corresponding to the at least one frame and the identified number of loops based on the input of at least one frame corresponding to the first phoneme to the second neural network model; Including more
The first voice data is voice data corresponding to the first phoneme.

According to claim 7,
And if one of the at least one frame corresponding to the first phoneme among the sound characteristic information is input to the second neural network model, second voice data corresponding to the number of loops is obtained.

According to claim 7,
The decoder is characterized in that the shift size (Shift size) acquires the voice data of the first frequency (khz) based on the sound characteristic information of the first time interval (sec),
When the value of the speech rate control information is a reference value, one frame included in the sound characteristic information is input to the second neural network model, and the number of voices corresponding to the product of the first time interval and the first frequency A control method characterized in that data is acquired.

According to claim 1,
The control method, characterized in that the speech rate control information is information on a ratio value of the speech rate of the acoustic characteristic information and the reference speech rate.

In electronic devices,
a memory storing at least one instruction; and
A processor configured to control the electronic device by executing at least one instruction stored in the memory;
the processor,
get the text
By inputting the text to the first neural network model, acoustic feature information corresponding to the text and alignment information matching each frame of the acoustic feature information with each phoneme included in the text to obtain,
Based on the obtained alignment information, identifying the speech rate of the acoustic characteristic information;
Based on the text and the sound characteristic information, a reference speech rate for each phoneme included in the sound characteristic information is identified;
Obtaining speech speed control information based on the speech speed of the acoustic characteristic information and the reference speech speed;
An electronic device that obtains voice data corresponding to the text by inputting the acoustic characteristic information to a second neural network model set based on the acquired speech rate control information.

According to claim 11,
the processor,
Identifying a speech rate corresponding to a first phoneme included in the acoustic characteristic information based on the obtained alignment information;
Based on the acoustic characteristic information, identifying the first phoneme included in the acoustic characteristic information;
An electronic device that identifies a reference speech rate corresponding to the first phoneme based on the text.

According to claim 12,
the processor,
An electronic device that obtains a first reference speech rate corresponding to the first phoneme based on the text and sample data used for learning of the first neural network.

According to claim 13,
the processor,
Acquiring evaluation information on sample data used for learning the first neural network model;
Identifying a second reference speech rate corresponding to the first phoneme based on the first reference speech rate and the evaluation information;
The electronic device, characterized in that the evaluation information is obtained by a user of the electronic device.

According to claim 14,
the processor,
An electronic device that identifies a reference speech speed corresponding to the first phoneme based on one of the first reference speech speed and the second reference speech speed.

According to claim 12,
the processor,
Identifying an average speech speed corresponding to the first phoneme based on a speech speed corresponding to the first phoneme and a speech speed corresponding to each of at least one phoneme prior to the first phoneme in the acoustic characteristic information;
The electronic device for acquiring speech rate control information corresponding to the first phoneme based on the average speech rate corresponding to the first phoneme and the reference speech rate corresponding to the first phoneme.

According to claim 12,
The second neural network model includes an encoder receiving the acoustic characteristic information and a decoder receiving vector information output from the encoder,
the processor,
While at least one frame corresponding to the first phoneme among the sound characteristic information is input to the second neural network model, the decoder included in the second neural network model based on the speech speed control information corresponding to the first phoneme identify the number of loops;
Based on at least one frame corresponding to the first phoneme being input to the second neural network model, first voice data corresponding to the at least one frame and the number of loops identified is obtained;
The electronic device, characterized in that the first voice data is voice data corresponding to the first phoneme.

According to claim 17,
The electronic device, characterized in that, when one of the at least one frame corresponding to the first phoneme among the sound characteristic information is input to the second neural network model, second voice data corresponding to the number of loops is obtained.

According to claim 17,
The decoder is characterized in that the shift size (Shift size) acquires the voice data of the first frequency (khz) based on the sound characteristic information of the first time interval (sec),
When the value of the speech rate control information is a reference value, one frame included in the sound characteristic information is input to the second neural network model, and the number of voices corresponding to the product of the first time interval and the first frequency An electronic device characterized in that data is obtained.

According to claim 11,
The electronic device, characterized in that the speech rate control information is information on a ratio value between the speech rate of the acoustic characteristic information and the reference speech rate.