WO2019139430A1

WO2019139430A1 - Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium

Info

Publication number: WO2019139430A1
Application number: PCT/KR2019/000512
Authority: WO
Inventors: 김태수; 이영근
Original assignee: 네오사피엔스 주식회사
Priority date: 2018-01-11
Filing date: 2019-01-11
Publication date: 2019-07-18

Abstract

The present disclosure relates to a method for synthesizing speech from text and an apparatus for reproducing the synthesized speech. A text-to-speech synthesis method using machine learning comprises the steps of: generating a single artificial neural network text-to-speech synthesis model by performing machine learning on the basis of multiple learning texts and speech data corresponding to the multiple learning texts; receiving input text; receiving an utterer's articulatory characteristics; and generating output speech data which corresponds to the input text and reflects the utterer's articulatory characteristics, by inputting the utterer's articulatory characteristics to the single artificial neural network text-to-speech synthesis model.

Description

Text-speech synthesis method using machine learning, apparatus and computer-readable storage medium

The present disclosure relates to a method for receiving an input text and composing a voice for the input text and an apparatus for reproducing the synthesized voice.

Speech is one of the tools to communicate basic and effective doctors. Voice-based communications use a voice user interface that provides intuitive and convenient services to the user, and some devices can interact using voice. A simple way to implement a voice response in a conventional voice user interface is audio recording, but there is a limitation that only the recorded voice can be used. Such a device is not flexible enough to use the device because it can not provide answering service for unrecorded voice. For example, an AI agent such as Apple Siri and Amazon Alexa may be able to generate various sentences for answering the user's query, since the user's queries may be arbitrary. Significant time and expense is required when recording all possible responses from these applications. In this environment, many researchers are trying to create a natural and fast speech synthesis model. Text-to-speech synthesis, also referred to as TTS (text-to-speech), has been extensively studied to generate speech from text.

In general, TTS technology has various speech synthesis methods such as Concatenative TTS and Parametric TTS. For example, Concatenative TTS can synthesize voice by combining the voices composing the sentence to be synthesized in advance by cutting and storing the voice in a very short unit such as a phoneme, and Parametric TTS can express the characteristic of voice as a parameter and synthesize The parameters representing the speech features that make up the sentence can be synthesized with vocoder corresponding to the sentence using a vocoder.

Recently, a speech synthesis method based on an artificial neural network (for example, a deep neural network) has been actively studied, and a speech synthesized according to the speech synthesis method includes a more natural speech characteristic than an existing method . However, in order to provide a speech synthesis service for a new speaker by using a speech synthesis method based on an artificial neural network, a large amount of data corresponding to the voice of the speaker is required, and re-learning of the artificial neural network model using the data is required.

The method and apparatus according to the present disclosure is directed to a speech synthesis method and apparatus that provides output speech data for an input text that reflects the speech characteristics of the new speaker without inputting much data or information to a new speaker. In addition, the method and apparatus of the present disclosure can provide a speech synthesis service by extending a new speaker without additional machine learning.

A method of text-to-speech synthesis using machine learning according to an embodiment of the present disclosure includes generating a plurality of training texts and a plurality of training texts, A method of generating a single artificial neural network text-to-speech synthesis model, receiving input text, receiving a speaker's vocal characteristics, and comparing the speaker's vocal characteristics with a single artificial neural network text- And generating output speech data for the input text in which the speech characteristic of the speaker is reflected.

The step of receiving a speaker's speech feature of the text-to-speech synthesis method using machine learning according to an embodiment of the present disclosure includes receiving a speech sample and extracting an embedding vector representing a speaker's speech feature from the speech sample Step < / RTI >

The step of extracting the embedding vector representing the utterance characteristic of the speaker from the speech samples of the text-to-speech synthesis method using machine learning according to an embodiment of the present disclosure includes a step of extracting a first sub- Wherein the prosodic feature includes at least one of information on a speech rate, information on a pronunciation strength, information on a dormant section, or information on a pitch height, and includes an output for an input text in which a speech characteristic of a speaker is reflected The step of generating the voice data may include inputting a first sub-embedding vector representing a prosodic feature to a single artificial neural network text-to-speech synthesis model to generate output speech data for the input text in which the prosodic characteristic of the speaker is reflected .

The step of extracting the embedding vector representing the utterance characteristic of the speaker from the speech sample of the text-speech synthesis method using the machine learning according to an embodiment of the present invention includes extracting the second subembedding vector representing the emotion characteristic of the speaker Wherein the step of generating the output speech data for the input text in which the speech characteristic of the speaker is reflected includes the information about the emotion contained in the speech contents of the speaker and the second sub- May be input to a single artificial neural network text-to-speech synthesis model to generate output speech data for the input text in which the emotion characteristic of the speaker is reflected.

The step of extracting the embedding vector representing the utterance characteristic of the speaker from the speech samples of the text-to-speech synthesis method using the machine learning according to an embodiment of the present disclosure includes a step of extracting a third sub-embedding vector Wherein the step of generating output speech data for an input text that reflects a speaker's utterance characteristic comprises generating a third subembedding vector representing characteristics of a speaker's tone and pitch by a single artificial neural network text- And generating output speech data for an input text that is input to the model to reflect features of the speaker's tone color and tone height.

The step of generating output speech data for an input text that reflects a speaker's vocal characteristics of the method of text-to-speech synthesis using machine learning in accordance with an embodiment of the present disclosure includes receiving additional input for output speech data, Modifying an embedding vector representing a speaker's utterance characteristic based on the input speech data and inputting the modified embedding vector into a single artificial neural network text-to-speech synthesis model, And converting the voice data into voice data.

Further input to the output speech data of the text-to-speech synthesis method using machine learning according to an embodiment of the present disclosure may include information on gender, information on age, information on intonation by region, information on speed of utterance, Height, or information on the size of the utterance.

The step of receiving the speech samples of the text-to-speech synthesis method using machine learning according to an embodiment of the present disclosure may include receiving in real time as speech samples speech input from a speaker within a predetermined time period .

The step of receiving the speech samples of the text-to-speech synthesis method using machine learning in accordance with an embodiment of the present disclosure may include receiving speech input from the speaker within a predetermined time period from the speech database.

In addition, a program for implementing the text-to-speech synthesis method using the above-described machine learning may be recorded in a computer-readable recording medium.

Further, apparatuses and technical means related to the text-to-speech synthesis method using the above-described machine learning can also be disclosed.

1 is a diagram of a text-to-speech synthesis terminal according to an embodiment of the present disclosure;

2 is a block diagram of a text-to-speech synthesizer according to an embodiment of the present disclosure.

3 is a flow diagram illustrating a text-to-speech synthesis method in accordance with one embodiment of the present disclosure.

4 is a block diagram of a text-to-speech synthesizer according to an embodiment of the present disclosure.

5 is a diagram showing a configuration of a text-to-speech synthesizer based on an artificial neural network.

FIG. 6 is a diagram illustrating a configuration of a text-to-speech synthesizer based on an artificial neural network according to an embodiment of the present disclosure.

7 is a diagram illustrating a network for extracting embedding vectors representing vocal characteristics that can distinguish each of a plurality of speakers in accordance with one embodiment of the present disclosure.

FIG. 8 is a diagram illustrating a configuration of a text-to-speech synthesizer based on an artificial neural network according to an embodiment of the present disclosure.

FIG. 9 is a flowchart illustrating an operation of a vocal characteristic adjusting unit according to an embodiment of the present disclosure.

10 illustrates an example of a user interface that alters the characteristics of the output speech in accordance with one embodiment of the present disclosure.

11 is a block diagram of a text-to-speech synthesis system in accordance with one embodiment of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the disclosed embodiments, and how to accomplish them, will become apparent with reference to the embodiments described below with reference to the accompanying drawings. It should be understood, however, that the present disclosure is not limited to the embodiments disclosed herein but may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, It is only provided to give the complete scope of the invention to the person.

The terms used in this specification will be briefly described, and the disclosed embodiments will be described in detail.

As used herein, terms used in the present specification are taken to be those of ordinary skill in the art and are not intended to limit the scope of the present invention. Also, in certain cases, there may be a term selected arbitrarily by the applicant, in which case the meaning thereof will be described in detail in the description of the corresponding invention. Accordingly, the terms used in this disclosure should be defined based on the meaning of the term rather than on the name of the term, and throughout the present disclosure.

The singular expressions herein include plural referents unless the context clearly dictates otherwise. Also, plural expressions include singular expressions unless the context clearly dictates otherwise.

When an element is referred to as "including" an element throughout the specification, it is to be understood that the element may include other elements as well, without departing from the spirit or scope of the present invention.

In addition, the term "part" used in the specification means software or hardware component, and "part " However, "part" is not meant to be limited to software or hardware. "Part" may be configured to reside on an addressable storage medium and may be configured to play back one or more processors. Thus, by way of example, and not limitation, "part (s) " refers to components such as software components, object oriented software components, class components and task components, and processes, Subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays and variables. The functions provided in the components and "parts " may be combined into a smaller number of components and" parts " or further separated into additional components and "parts ".

In accordance with one embodiment of the present disclosure, "part" may be embodied in a processor and memory. The term "processor" should be broadly interpreted to include a general purpose processor, a central processing unit (CPU), a microprocessor, a digital signal processor (DSP), a controller, a microcontroller, In some circumstances, a "processor" may refer to an application specific integrated circuit (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA) The term "processor" refers to a combination of processing devices, such as, for example, a combination of a DSP and a microprocessor, a combination of a plurality of microprocessors, a combination of one or more microprocessors in conjunction with a DSP core, It can also be called.

The term "memory" should be broadly interpreted to include any electronic component capable of storing electronic information. The terminology memory may be any suitable memory such as random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erase- May refer to various types of processor-readable media such as erasable programmable read-only memory (PROM), flash memory, magnetic or optical data storage devices, registers, and the like. The memory is said to be in electronic communication with the processor if the processor is able to read information from and / or write information to the memory. The memory integrated in the processor is in electronic communication with the processor.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. In order to clearly explain the present disclosure in the drawings, portions not related to the description will be omitted.

1 is a diagram of a text-to-speech synthesis terminal 100 in accordance with an embodiment of the present disclosure.

The text-to-speech synthesis terminal 100 may include at least one processor and a memory. For example, the text-to-speech synthesis terminal 100 may be implemented in a smart phone, a computer, a mobile phone, or the like. The text-to-speech synthesis terminal 100 may communicate with an external device (e.g., a server device) including a communication unit.

The text-to-speech synthesis terminal 100 may receive text input and a specific speaker input from the user 110. For example, as shown in FIG. 1, the text-to-speech synthesis terminal 100 may receive "How are you?" As text input. Also, the text-to-speech synthesis terminal 100 may receive "Person 1" as a speaker input. Here, "person 1" may represent the utterance characteristic of a preset speaker, i.e., "person 1 ". The text-to-speech synthesis terminal 100 may be configured to preset at least one vocal characteristic (e.g., "person 1") of a plurality of human vocal characteristics. For example, the vocal characteristics of the plurality of persons can be received from an external apparatus such as a server apparatus through the communication unit. 1 shows a user interface for specifying a preset speaker. However, the present invention is not limited thereto. The user can provide a voice for specific text to the text-to-speech synthesis terminal 100, The controller 100 may extract the voice feature of the received voice and display the voice feature of the user for voice synthesis to be selected. For example, the speech feature of the speech can be extracted from the received speech, and the speech feature of the speech can be represented by the embedding vector.

The text-to-speech synthesis terminal 100 may be configured to output speech data for the input text in which the speech characteristic of the designated speaker is reflected. For example, in generating output speech data for the input text of "How are you ", as shown in Fig. 1, the speech characteristic of the selected" person 1 " Here, the utterance characteristic of the specific speaker may include not only the voice of the speaker but also at least one of various factors such as a style, a rhyme, an emotion, a tone color, and a pitch capable of constituting the utterance. In order to generate such output speech data, the text-to-speech synthesis terminal 100 may provide input text and a designated speaker to the text-to-speech synthesis apparatus and may provide speech data synthesized from the text- , "How are you" in which the utterance characteristic of "Person 1 " is reflected). The text-to-speech synthesizer will be described in more detail below with reference to FIG. The text-to-speech synthesis terminal 100 can output the synthesized speech data to the user 110. Alternatively, the text-to-speech synthesis terminal 100 may be configured to include a text-to-speech synthesis device.

2 is a block diagram of a text-to-speech synthesis apparatus 200 according to an embodiment of the present disclosure.

The data learning unit (not shown) and the data recognition unit (not shown) used by the text-to-speech synthesizer 200 of FIG. 2 are the same as or similar to the configuration of the text-to-speech synthesizer 1100 of FIG. . &Lt; / RTI > The speech synthesis apparatus 200 includes a speech feature extraction unit 210, a speech feature adjustment unit 220, a speech database 230, an encoder 240, a decoder 250, a postprocessing processor 260, 270).

According to one embodiment, the utterance feature extraction unit 210 may be configured to receive a speaker's speech signal (e. G., A voice sample) and extract the speaker's utterance characteristics from the received speech signal. Here, the received speech signal or sample may include speech spectral data representing information related to a speech feature of the speaker. In extracting a speaker's vocal feature, any known appropriate feature extraction method capable of extracting a vocal feature from a speaker's speech signal can be used. For example, a speech processing method, such as Mel frequency synthesizer (MFC), can be used to extract speech features from a received speech signal or sample. Alternatively, speech samples may be input to a learned speech feature extraction model (e.g., an artificial neural network) to extract speech features. For example, the utterance characteristic of the extracted speaker can be represented by an embedding vector. According to another embodiment, the utterance feature extraction section 210 can receive at least one of text and video, and can be configured to extract the utterance characteristics of the speaker from the received text and video. The speech feature of the extracted speaker may be provided to at least one of the encoder 240 or the decoder 250.

According to one embodiment, the utterance characteristics of the speaker extracted from the utterance feature extraction section 210 may be stored in a storage medium (e.g., voice database 230) or an external storage device. Accordingly, at the time of speech synthesis for the input text, one or more speech characteristics of a plurality of speakers, which are stored in advance in the storage medium, can be selected or designated, and the speech characteristics of a selected or designated plurality of speakers Can be used.

The utterance characteristic adjuster 220 may be configured to adjust the utterance characteristics of the speaker. According to one embodiment, the utterance feature adjuster 220 may receive information for adjusting the speaker's utterance characteristics. For example, information for adjusting a speaker's utterance characteristic may be input from a user by the utterance characteristic adjuster 220. Based on the information received from the user, the utterance characteristic adjusting unit 220 can adjust the utterance characteristic of the speaker extracted by the utterance characteristic extracting unit 210. [

According to one embodiment, the voice database 230 may store a learning text and a voice corresponding to a plurality of learning texts. The learning text may be written in at least one language and may include at least one of words, phrases, and sentences that a person can understand. In addition, the voice stored in the voice database 230 may include voice data in which a plurality of speakers have read the learning text. The learning text and voice data may be stored in advance in the voice database 230 or may be received from the communication unit 270. [ At least one of the encoder 240 and the decoder 250 may include or generate a single artificial neural network text-speech synthesis model based on the learning text and speech stored in the speech database 230. For example, the encoder 240 and the decoder 250 may constitute a single artificial neural network text-synthesis model.

According to one embodiment, the speech database 230 may be configured to store speech characteristics of one or more speakers extracted from the speech feature extraction section 210. [ The speech feature of the stored speech (e.g., an embedding vector representing the speech feature of the speaker) may be provided to at least one of the encoder 240 or the decoder during speech synthesis.

In addition, the encoder 240 can receive the input text and can be configured to generate the input text by converting it into character embedding. Such character embedding may be entered into a single artificial neural network text-to-speech synthesis model (e.g., pre-net, CBHG module, DNN, CNN + DNN, etc.) to generate the hidden states of the encoder 240. According to one embodiment, the encoder 240 further receives the speaker's utterance characteristics from at least one of the utterance feature extraction unit 210 or the utterance feature control unit 220, and performs character embedding and speaker's utterance characteristics on a single artificial neural network (E.g., pre-net, CBHG module, DNN, CNN + DNN, etc.) to generate hidden states of the encoder 240. [ The thus generated hidden states of the encoder 240 may be provided to the decoder 820. [

The decoder 250 may be configured to receive the speaker's speech characteristics. The decoder 250 can receive the speaker's utterance characteristic from at least one of the utterance feature extraction unit 210 and the utterance feature control unit 220. However, the present invention is not limited thereto, and the decoder 250 can receive the utterance characteristic of the speaker from the communication unit 270 or the input / output unit (I / O unit: not shown).

The decoder 250 may receive hidden states corresponding to the input text from the encoder 240. According to one embodiment, the decoder 250 may include an attention module configured to determine from which part of the input text to generate the speech at the current time-step (time-step).

Decoder 250 may generate the output speech data corresponding to the input text by inputting the speech characteristics and input text of the speaker into a single artificial neural network text-speech synthesis model. Such output speech data may include synthesized speech data that reflects the speech characteristics of the speaker. According to one embodiment, output speech data in which the first speaker appears to read the input text may be generated based on the speech characteristics of the first speaker set in advance. For example, output speech data may be represented by a mel-spectrogram. However, the present invention is not limited to this, and the output speech data may be represented by a linear spectrogram. The output audio data may be output to at least one of a speaker, a post-processing processor 260, and a communication unit 270.

According to one embodiment, the post-processor 260 may be configured to convert the output speech data generated at the decoder 250 into speech output from the speaker. For example, a changed outputable voice can be represented by a waveform. The post-processor 260 may be configured to operate only when the output voice data generated at the decoder 250 is inappropriate for output from the speaker. That is, if the output voice data generated at the decoder 250 is suitable for output from the speaker, the output voice data can be output directly to the speaker without going through the post-processor 260. [ Thus, although post processor 260 is shown in FIG. 2 as being included in text-to-speech synthesizer 200, post processor 260 may be configured not to be included in text-to-speech synthesizer 200 have.

According to one embodiment, the post-processor 260 may be configured to convert the output speech data represented by the mel-spectrogram generated in the decoder 250 into a waveform in the time domain. In addition, the post-processor 260 may amplify the size of the output speech data if the size of the signal of the output speech data does not reach a predetermined reference size. The post-processor 260 may output the converted output voice data to at least one of the speaker or the communication unit 270.

The communication unit 270 may be configured such that the text-to-speech synthesizer 200 transmits / receives signals or data to / from an external device. The external device may include the text-to-speech synthesis terminal 100 of FIG. Alternatively, the external device may include another text-to-speech synthesizer. Or the external device may be any device, including a voice database.

According to one embodiment, the communication unit 270 can be configured to receive text from an external device. Here, the text may include a learning text to be used for learning of a single artificial neural network text-speech synthesis model. Alternatively, the text may include input text received from a user terminal. This text may be provided to at least one of the encoder 240 or the decoder 250.

In one embodiment, the communication unit 270 can receive the speech characteristics of the speaker from an external device. The communication unit 270 can receive the speech signal or sample of the speaker from the external device and transmit the speech signal to the speech feature extraction unit 210.

The communication unit 270 may receive the information input from the user terminal. For example, the communication unit 270 may receive the input information for adjusting the speaker's utterance characteristics and provide the received utterance information to the utterance characteristic adjuster 220.

The communication unit 270 can transmit any signal or data to the external device. For example, the communication unit 270 can transmit information related to the generated output voice, that is, output voice data to an external device. Also, the generated single artificial neural network text-to-speech synthesis model may be transmitted to the text-to-speech synthesis terminal 100 or another text-to-speech synthesis apparatus through the communication unit 270.

According to one embodiment, the text-to-speech synthesizer 200 may further include an input / output unit (not shown). The input / output unit can receive the input directly from the user. Also, the input / output unit may output at least one of voice, image, and text to the user.

First, in step 310, the text-to-speech synthesis apparatus 200 generates a single artificial neural network text-to-speech synthesis by performing a machine learning based on a plurality of learning texts and speech data corresponding to a plurality of learning texts -peech synthesis) model can be performed. In step 320, the text-to-speech synthesizer 200 may perform the step of receiving the input text. In step 330, the text-to-speech synthesizer 200 And a step of receiving the utterance characteristic of the speaker can be performed. The text-to-speech synthesis apparatus 200 can perform the step of generating the output speech data for the input text in which the speech characteristic of the speaker is reflected by inputting the speech characteristic of the speaker into the pre-learned text-speech synthesis model have.

Hereinafter, the text-speech synthesis method will be described in more detail with reference to FIG.

4 is a diagram of a text-to-speech synthesizer 400 according to an embodiment of the present disclosure. The text-to-speech synthesizer 400 of FIG. 4 may have the same or similar configuration as that of the text-to-speech synthesizer 200 of FIG. The text-to-speech synthesis apparatus 400 may include a speech feature extraction unit 410, a speech database 430, a communication unit 470, an encoder 440, and a decoder 450. The utterance feature extraction unit 410 of FIG. 4 may have the same or similar configuration as that of the utterance feature extraction unit 210 of FIG. The voice database 430 of FIG. 4 may include the same or similar configuration as the voice database 230 of FIG. The communication unit 470 of FIG. 4 may include the same or similar configuration as the communication unit 270 of FIG. The encoder 440 of FIG. 4 may include the same or similar configuration as the encoder 240 of FIG. The decoder 450 of FIG. 4 may include the same or similar configuration as the decoder 250 of FIG. The description of the text-to-speech synthesizer 200 of FIG. 2 and the description of the text-to-speech synthesizer 400 of FIG. 4 are omitted.

According to one embodiment, the text-to-speech synthesizer 400 may receive speech samples or signals of the speaker. For example, the voice samples may be received from the user terminal via the communication unit 470. As another example, the speech samples or signals of such speakers may be received from a text-to-speech synthesis terminal including a speech database. The speech samples or signals of these speakers may be provided to the vocal feature extraction unit 410. The speech sample or signal of the speaker may include speech data input from the speaker within a predetermined time period. For example, the predetermined time interval may represent a relatively short time (e.g., several seconds, tens seconds, or even tens of minutes) in which the speaker can input his voice.

According to one embodiment, the text-to-speech synthesizer 400 may be configured to transmit input text that is the subject of speech synthesis. For example, the input text may be received from the user terminal via the communication unit 470. Alternatively, the text-to-speech synthesizer 400 may include an input / output device (not shown) to receive the input text. The received input text may be provided to the vocal feature extraction unit 410.

According to one embodiment, the speech database 430 may be configured to store speech samples or signals of one or more speakers. The speech samples or signals of these speakers may be provided to the speech feature extraction unit 410.

The utterance feature extraction unit 410 may extract the embedding vector representing the utterance characteristic of the speaker from the speech sample or signal. The utterance feature extraction unit 410 may include a prosody feature extraction unit 412, an emotion feature extraction unit 414, and a tone color and pitch extraction unit 416. 4, the utterance feature extraction unit 410 includes a rhyme feature extraction unit 412, a feeling feature extraction unit 414, and a tone color and pitcher extraction unit 416. The rhyme feature extraction unit 412, The emotion feature extracting unit 414, and the tone color and pitch extracting unit 416. [0064]

The prosodic feature extraction unit 412 may be configured to extract a first sub-embedding vector that indicates a prosodic feature of the speaker. Here, the rhyme feature may include at least one of information on the speaking speed, information on the pronunciation strength, information on the idle period, and information on the pitch height. The first sub-embedding vector representing the prosodic feature of the extracted speaker may be provided to at least one of the encoder 440 or the decoder 450. According to one embodiment, the encoder 440 and the decoder 450 input a first sub-embedding vector representing the rhyme characteristics into a single artificial neural network text-to-speech synthesis model to generate output speech data Lt; / RTI >

The emotion feature extraction unit 414 may be configured to extract a second sub-embedding vector indicating the emotion characteristics of the speaker. Here, the emotion feature may include information on the emotion inherent in the utterance contents of the speaker. For example, the emotion feature is not limited to a predetermined predetermined emotion, but may include information such as the degree of each of the one or more emotions inherent in the speaker's voice and / or a combination of emotional emotions. A second sub-embedding vector representing the emotion characteristics of the extracted speaker may be provided to at least one of the encoder 440 or the decoder 450. [ According to one embodiment, the encoder 440 and the decoder 450 input a second sub-embedding vector representing emotion characteristics into a single artificial neural network text-to-speech synthesis model to generate output speech data for the input text that reflects the emotion characteristics of the speaker Lt; / RTI >

The tone color and pitch extracting unit 416 may be configured to extract a third sub-embedding vector indicating characteristics of the tone color and pitch height of the speaker. A third sub-embedding vector indicating characteristics of the tone and height of the extracted speaker may be provided to at least one of the encoder 440 or the decoder 450. [ According to one embodiment, the encoder 440 and the decoder 450 input a third subembedding vector, which characterizes the tone and pitch of the speaker, into a single artificial neural network text-to-speech synthesis model, The output speech data for the input text in which the characteristic of the input text is reflected can be generated.

According to one embodiment, the encoder 440 may receive an embedding vector representing the utterance characteristics of the extracted speaker. Encoder 440 may generate or update a single artificial neural network text-speech synthesis model based on an embedding vector representing the speech feature of one or more speakers previously learned mechanically and an embedding vector representing the speech feature of the received speaker, Can be synthesized.

In FIG. 4, speech is synthesized by extracting at least one of emotion feature, prosody feature, or speech and pitch from a speech sample or signal of one speaker, but the present invention is not limited thereto. In another embodiment, at least one of the emotion feature, the rhyme feature, or the speech and pitch may be extracted from the speech samples or signals of the other speaker. For example, the utterance feature extraction unit 410 receives the voice samples or signals of the first speaker, extracts emotion characteristics and rhyme characteristics from the voice samples or signals of the first speaker, Or a signal (e.g., a voice of a celebrity), and extract tone color and pitch characteristics from the speech sample or signal of the received second speaker. The speech characteristics of the two speakers thus extracted may be provided to at least one of the encoder 440 or the decoder 450 during speech synthesis. Accordingly, the synthesized voice reflects the emotion and the rhyme of the first speaker who uttered the voice of the first speaker or the voice contained in the signal, but the voice or voice contained in the signal of the second speaker (e.g., a famous person) The tone and pitch of the second speaker can be reflected.

According to one embodiment, the encoder 510 may be configured to generate text as pronunciation information. The generated pronunciation information may be provided to a decoder 520 including an attention module, and the decoder 520 may be configured to generate such pronunciation information by speech.

The encoder 510 may generate the input text by converting it into character embedding. At encoder 510, the generated character embedding may be passed through a pre-net including a fully-connected layer. In addition, the encoder 510 may provide an output from the pre-net to the CBHG module to output Encorder hidden states e _i , as shown in FIG. For example, the CBHG module may include a 1D convolution bank, a max pooling, a highway network, and a bidirectional gated recurrent unit (GRU).

The decoder 520 includes a decoder RNN (Decoder RNN) including an attention RNN (residual neural network) including a pre-network composed of a fully connected layer and a gated recurnt unit (GRU), and a residual GRU can do. For example, the output from decoder 520 may be represented by a mel-scale spectrogram.

The Attention RNN and Decoder RNN of the decoder 520 may receive information corresponding to the speaker of the voice. For example, the decoder 520 may receive the one-hot speaker ID vector 521. Decoder 520 may generate the speaker embedding vector 522 based on the one-hot speaker ID vector 521. [ The Attention RNN and Decoder RNN of the decoder 520 may receive the speaker embedding vector 522 and update the single artificial neural network text-speech synthesis model so that output speech data may be generated differently for different speakers.

The decoder 520 also includes a database that exists as a pair of speech signals corresponding to the input text, the information associated with the speaker, and the input text, in order to create or update a single artificial neural network text-to- Can be used. The decoder 520 can learn the input text and the information related to the speaker as the input of the artificial neural network and the speech signal corresponding to the input text as the correct answer. The decoder 520 may apply the input text and the information associated with the speaker to the updated single artificial neural network text-speech synthesis model to output the speech of the speaker.

The output of the decoder 520 may also be provided to the post-processor 530. The CBHG of post-processor 530 may be configured to convert the Mel Scale Spectrogram of decoder 520 to a linear-scale spectrogram. For example, the output signal of the CBHG of post-processor 530 may include a magnitude spectrogram. The phase of the output signal of the CBHG of post-processor 530 may be recovered through a Griffin-Lim algorithm and may be inverse short-time fourier transformed. The post-processor 530 may output a voice signal in a time domain.

The artificial neural network-based text-to-speech synthesizer can be learned by using a large-capacity database existing as a pair of text and speech signals. A loss function can be defined by comparing the output of the input text with the corresponding speech signal of interest. The text-to-speech synthesizer learns the loss function through an error back propagation algorithm and finally obtains a single artificial neural network text-speech synthesis model in which desired speech output is obtained when arbitrary text is input.

In FIG. 6, the contents overlapping with those described in FIG. 5 are omitted. The decoder 620 of FIG. 6 may receive the hidden states e _i of the encoder from the encoder 610. In addition, the decoder 620 of FIG. 6 can receive the speaker's voice data 621 differently from the decoder 520 of FIG. Here, the voice data 621 may include data representing voice inputted from the speaker within a predetermined time period (a short time period, for example, several seconds, tens seconds, or tens of minutes). For example, the speech data 621 of the speaker may include speech spectrogram data (e.g., a log-mel-spectrogram). The decoder 620 may obtain a speech feature embedding vector 622 of the speaker that represents the speech feature of the speaker based on the speech data of the speaker. Decoder 620 may be provided to the Attention RNN and Decoder RNN with the Speech Feature Embedding Vector 622 of the Speaker.

The text-to-speech synthesis system shown in Fig. 5 uses a speaker ID as information indicating a speaker's utterance characteristic, and the speaker ID can be expressed as a single-hot vector. However, this one-hot speaker ID vector can not easily expand the ID for a new speaker that is not in the learning data. Since the text-to-speech synthesis system learned embedding only for the speaker represented by the one-hot vector, there is no way to obtain a new speaker's embedding. To generate a new speaker's voice, you must re-learn the entire TTS model or fine-tune the embedded layer of the TTS model. This is a time-consuming process when using GPU-equipped equipment. On the other hand, the text-to-speech synthesis system shown in FIG. 6 is a system for generating a new speaker vector by adding a TTS model capable of instantly generating a new speaker's voice without learning additional TTS models or manually searching a speaker embedding vector. System. That is, the text-to-speech synthesis system can generate speech adaptively changed to a plurality of speakers.

According to one embodiment, in the speech synthesis for the input text, the speech characteristic embedding vector 622 of the first speaker extracted from the speech data 621 of the first speaker is input to the decoder RNN and the attention RNN , But the circle-hot speaker ID vector 521 of the second speaker shown in FIG. 5 may also be input to the decoder RNN and the attention RNN. For example, the first speaker associated with the vocal feature embedding vector 622 and the second speaker associated with the one-hot speaker ID vector 521 may be the same. As another example, the first speaker associated with the vocal feature embedding vector 622 and the second speaker associated with the one-hot speaker ID vector 521 may be different. Accordingly, at the time of speech synthesis for the input text, the voiced feature embedding vector 622 of the first speaker and the one-hot speaker ID vector 521 of the second speaker are input to the decoder RNN and the attention RNN together, A synthesized voice in which at least one characteristic of a rhyme characteristic, an emotion characteristic, or a tone color and a pitch characteristic included in the voicing characteristic embedding vector 622 of the first speaker is reflected to the voice of the corresponding second speaker. That is, a synthesized voice in which at least one characteristic of the first speaker's vocal characteristic, that is, a prosodic characteristic, an emotional characteristic, or a tone and pitch characteristic is reflected, is generated in the voice of the second speaker associated with the one-hot speaker ID vector 521 .

7 is a diagram illustrating a network that extracts an embedded vector 622 that can identify each of a plurality of speakers in accordance with one embodiment of the present disclosure.

According to one embodiment, the network shown in FIG. 6 includes a convolutional network and a max-over-time pooling, receives a log-Mel-spectrogram and stores it as a speech sample or voice signal You can extract the dimension speaker embedding vector. Here, the speech sample or speech signal does not need to be speech data corresponding to the input text, and a speech signal that is arbitrarily selected may be used.

In such a network, any spectrogram can be inserted into this network since there is no restriction on the use of the spectrogram. In addition, through this instant adaptation of the network, it is possible to generate an implicit vector 622 that represents a speech feature for a new speaker. An input spectrogram can have various lengths, for example, a fixed dimension vector of length 1 for the time axis can be input to the max-over-time pooling layer located at the end of the convolution layer.

Although FIG. 7 illustrates a network including a convolutional network and a max over time pooling, a network including various layers can be constructed to extract a speaker's utterance characteristic. For example, if the speech characteristic pattern changes over time, such as the intonation of the speaker's speech characteristics, the network can be implemented to extract features using the RNN (Recurrent Neural Network).

FIG. 8 is a diagram illustrating a configuration of a text-to-speech synthesizer based on an artificial neural network according to an embodiment of the present disclosure. The description of the text-to-speech synthesizing apparatus of FIG. 8 is omitted from the description of the text-to-speech synthesizing apparatus of FIG. 5 or 6.

In Fig. 8, the encoder 810 may receive the input text. For example, the encoder 810 may have input text in multiple languages. According to one embodiment, the input text may include at least one of words, phrases or sentences used in one or more languages. For example, a Korean sentence such as "Hello" or "How are you?" Such as an English sentence, can be input. When the input text is received, the encoder 810 can separate the received input text into alphabet, letter, and phoneme units. According to another embodiment, the encoder 810 may receive input text separated in alphabet, letter, and phoneme units. According to another embodiment, the encoder 810 may receive the character embedding for the input text.

If the encoder 810 receives the input text or the separate input text, the encoder 810 may be configured to generate at least one embedded layer. According to one embodiment, at least one embedded layer of the encoder 810 may generate character embedding based on input text separated in alphabet, character, and phoneme units. For example, the encoder 810 may use an already learned machine learning model (e.g., a probabilistic model or an artificial neural network) to obtain character embedding based on the separated input text. Further, the encoder 810 may update the machine learning model while performing machine learning. If the machine learning model is updated, the character embedding for the discrete input text can also be changed.

The encoder 810 may pass the character embedding to a Deep Neural Network (DNN) module configured as a fully-connected layer. DNNs may include a general feedforward layer or a linear layer.

The encoder 810 may provide the output of the DNN to a module including at least one of a convolutional neural network (CNN) or a recurrent neural network (RNN). The encoder 810 may also receive the speech feature embedding vector s of the speaker generated based on the speaker speech data at the decoder 820. [ CNN can capture local characteristics according to the convolution kernel size, while RNN can capture long term dependency. The encoder 810 may output the output of the DNN and the speech feature embedding vector s of the speaker into at least one of the CNN or the RNN to output the hidden states h of the encoder 810. [

Decoder 820 can receive speech data of the speaker. The decoder 820 may generate the speech feature embedding vector s of the speaker based on the speaker speech data. The embedding layer can receive speech data of the speaker. The embedding layer can generate the speech characteristics of the speaker based on the speech data of the speaker. Here, the speaker's utterance characteristic may have different characteristics for each individual. The embedding layer may, for example, distinguish speaker perceptual features based on machine learning. For example, the embedding layer may generate a speech feature embedding vector (s) of the speaker that represents the speech feature of the speaker. According to one embodiment, the decoder 820 may use the already learned machine learning model to transform the speaker's speech characteristics into the speaker's speech feature embedding vector s. The decoder can update the machine learning model while performing machine learning. When the machine learning model is updated, the speech characteristic embedding vector (s) of the speaker representing the speech characteristic of the speaker can also be changed. For example, the utterance characteristic of the speaker can be extracted from the voice data of the speaker received using the voice extracting network of Fig. 7 described above.

The speaker's vocal feature embedding vector s may be output to at least one of the CNN or RNN of the encoder 810. Also, the speech characteristic embedding vector (s) of the speaker can be output to the decoder RNN and the attention RNN of the decoder.

The attention of the decoder 820 may receive the hidden states h of the encoder from the encoder 810. The hidden states (h) may represent the results from the machine learning model of the encoder (810). For example, hidden states (h) may include some elements of a single artificial neural network text-speech synthesis model according to one embodiment of the present disclosure. Also, the attentions of the decoder 820 may receive information from the Attention RNN. The information received from the Attention RNN may include information on what speech the decoder 820 has generated up to the previous time-step. The attitude of the decoder 820 can also output the context vector based on the information received from the Attention RNN and the information of the encoder. The information of the encoder 810 may include information on the input text to generate the voice. The context vector may include information for determining which portion of the input text is to be generated at the current time-step (time-step). For example, the attentions of the decoder 820 may include information for generating a speech based on the beginning of the input text at the beginning of speech generation and for generating a speech based on the later part of the input text as the speech is generated Can be output.

The decoder 820 may configure the structure of the artificial neural network to input the speech feature embedding vector s of the speaker to the attention RNN and the decoder RNN to decode differently for each speaker. According to one embodiment, a text-to-speech synthesis system can use a database that exists as a pair of speech, speech, and speech signals, a speech, a speech, a speech, to learn an artificial neural network. In another embodiment, the database may be constructed using a one-hot vector instead of the speaker's vocal feature embedding vector (s), which describes the speaker's vocal characteristics, as described in FIG. Alternatively, the speech characteristic embedding vector (s) of the speaker and the one-hot vector may be used together to construct a database.

The dummy frames are frames that are input to the decoder if no previous time-step is present. RNNs can do machine learning with autoregressive. That is, the r frame output in the immediately preceding time-step 822 may be the input of the current time-step 823. In the initial time-step 821, since there can not be an immediately preceding time-step, the decoder 820 can input the dummy frame into the original time-step machine learning network.

According to one embodiment, the decoder 820 may include a DNN configured as a fully-connected layer. The DNN may include at least one of a general feedforward layer or a linear layer.

In one embodiment, decoder 820 may include an attention RNN configured with a GRU. Attention RNN is a layer that outputs information to be used in Attention. Attention is already described above, so a detailed explanation is omitted.

Decoder 820 may include a decoder RNN configured with a residual GRU. The decoder RNN may receive location information of the input text from the Attention. That is, the location information may include information about which location of the input text the decoder 820 is converting to speech. The decoder RNN may receive information from the Attention RNN. The information received from the Attention RNN may include information on what speech the decoder 820 has generated up to the previous time-step. The decoder RNN can generate the next output speech that will follow the speech generated so far. For example, the output speech may have a mel-spectrogram shape, and the output speech may include r frames.

The operation of the DNN, the Attention RNN and the Decoder RNN may be repeatedly performed for text-to-speech synthesis. For example, the r frames obtained in the initial time-step 821 may be the inputs of the next time-step 822. Also, the r frames output in the time-step 822 may be the inputs of the next time-step 823.

Through the above-described process, speech for all units of text can be generated. According to one embodiment, the text-to-speech synthesis system may concatenate mel-spectrograms for each time-step in chronological order to obtain the speech of the mel-spectrogram for the entire text. The voice of the mel spectrogram for the entire text can be output to the vocoder 830. [

The CNN or RNN of the vocoder 830 in accordance with an embodiment of the present disclosure may be similar to the CNN or RNN of the encoder 810. [ That is, the CNN or RNN of vocoder 830 can capture local characteristics and long-term dependencies. Accordingly, the CNN or RNN of the vocoder 830 may output a linear-scale spectrogram. For example, a linear-scale spectrogram may include a magnitude spectrogram. The vocoder 830 can predict the phase of the spectrogram through the Griffin-Lim algorithm, as shown in FIG. The vocoder 830 may output a time domain voice signal using Inverse Short-Time Fourier Transform.

A vocoder in accordance with another embodiment of the present disclosure may generate a speech signal from a melrospectogram based on a machine learning model. The machine learning model can include a machine-learned model of the correlation between the mel-spectrogram and the speech signal. For example, an artificial neural network model such as WaveNet or WaveGlow may be used.

The artificial neural network-based speech synthesizer is learned by using a large-capacity database existing in a pair of text and speech signals in one or more languages. According to one embodiment, the speech synthesis apparatus can receive the text and compare the output speech signal with the correct speech signal to define a loss function. The speech synthesis apparatus learns the loss function through an error back propagation algorithm and finally obtains an artificial neural network in which desired speech output is obtained when arbitrary text is input.

In this artificial neural network-based speech synthesis apparatus, text, speech characteristics of a speaker, and the like can be input to an artificial neural network and a speech signal can be output. The text-to-speech synthesizer can generate output speech data in which the text is read by the speech of the speaker when the speech and the speech signal of the speaker are compared by learning the output speech signal and the correct speech signal.

9 is a flowchart illustrating an operation of the utterance feature adjuster 900 according to an embodiment of the present disclosure.

The vocal feature adjuster 900 of FIG. 8 may include the same or similar configuration of the vocal feature adjuster 220 of FIG. The description overlapping with FIG. 2 is omitted.

The utterance feature adjuster 900 may receive an embedding vector indicating speaker information. According to one embodiment, such an embedding vector may include an embedding vector for the speech feature of the speaker. For example, the embedding vector for the speaker information can be expressed as a weighted sum of a plurality of sub-embedding vectors orthogonal to each other among the speaker's utterance characteristics.

The utterance feature adjuster 900 may separate the embedded elements of the embedding vector with respect to the received speaker information. For example, the utterance feature adjuster 900 may obtain a plurality of unit embedding vectors that are orthogonal to each other based on an embedding vector for speaker information. According to one embodiment, the method of separating the elements embedded in the embedding vector includes independent component analysis (ICA), independent vector analysis (IVA), sparse coding, independent factor analysis (IFA), independent subspace analysis (nonnegative matrix factorization). The text-to-speech synthesizer can perform regularization on the learning expression of the text-to-speech synthesizer when learning the embedding vector for the speaker information so that the elements inherent in the embedding vector can be separated. When a text-to-speech synthesizer performs machine learning by performing normalization on a learning expression, the embedding vector can be learned by a sparse vector. Accordingly, the text-to-speech synthesis apparatus can correctly separate the inherent elements by using principle component analysis (PCA), in an embedded vector learned with a sparse vector.

According to one embodiment, the utterance feature adjuster 900 may be configured to receive additional input to the output voice data. The utterance feature adjuster 900 may modify an embedding vector that indicates a speaker's utterance characteristics based on additional input. For example, the utterance feature adjuster 900 may change the weights for the plurality of unit embedding vectors based on the additional input.

In one embodiment, the utterance feature adjuster 900 may be configured to modify an embedding vector that indicates a speaker's utterance characteristics based on the received additional input. For example, the utterance characteristic adjuster 900 may re-synthesize an embedding vector for speaker information by multiplying a plurality of unit embedding vectors by a modified weight according to an additional input. The utterance characteristic adjuster 900 may output an embedding vector for the changed speaker information. The text-to-speech synthesizer can input the modified embedding vector into a single artificial neural network text-to-speech synthesis model, and convert the output speech data into speech data for the input text in which the information included in the additional input is reflected.

The text-to-speech synthesizer can receive text entered from the user into the text window. When the reproduction button shown in FIG. 10 is selected (for example, when it is touched or touched), the text-to-speech synthesizer generates output speech data corresponding to the input text and transmits it to a user terminal .

The text-to-speech synthesizer may receive additional input from the user. Additional inputs to the output voice data may include at least one of information about gender, information about age, information about the intonation by region, information about the speed of utterance, or information about the pitch height and the size of the utterance.

According to one embodiment, the text-to-speech synthesizer can transmit the speech feature of the currently selected or designated speaker to the user terminal through the communication unit, and the characteristic of the current speech is displayed on the display unit (E.g., lines, polygons, circles, and the like). The user can change at least one of information on sex, information on age, information on intonation by region, information on speed of utterance, information on height of voice and size of utterance by using input unit, The changed output voice can be output based on the output voice. For example, the user can select a sex close to the female, an approximate age of about 10, and an intonation of Chungcheong province, as shown in Fig. The characteristic of the current voice is changed according to the selected input, and the characteristic of the changed voice is reflected to the user terminal or the synthesized voice can be outputted.

As described above, according to various embodiments, there has been described the configuration in which one or more of the elements embedded in the embedding vector for the speaker information is changed to change the characteristics of the voices. However, the present invention is not limited to this, . According to an embodiment, the embedded element of the embedding vector may be changed by expressing it as an attribute of a speech synthesis markup language (SSML). For example, <gender value = "6"> <region value = "3,4,5"> can be expressed as attributes of SSML.

11 is a block diagram of a text-to-speech synthesis system 1100 in accordance with one embodiment of the present disclosure.

Referring to FIG. 11, the text-to-speech synthesis system 1100 according to an embodiment may include a data learning unit 1110 and a data recognition unit 1120. The data learning unit 1110 can input data and acquire a machine learning model. The data recognition unit 302 can also apply the data to the machine learning model to generate output speech. The text-to-speech synthesis system 1100 as described above may include a processor and a memory.

The data learning unit 1110 can learn speech about text. The data learning unit 1110 can learn a criterion as to which voice to output according to the text. In addition, the data learning unit 1110 can learn a criterion as to which voice feature should be used to output the voice. The feature of the speech may include at least one of pronunciation of the phoneme, tone of the user, accentuation, or accentuation. The data learning unit 1110 acquires data to be used for learning, and applies the obtained data to a data learning model, which will be described later, so as to learn a voice based on the text.

The data recognizing unit 1120 can output a voice for the text based on the text. The data recognizing unit 1120 can output speech from a predetermined text using the learned data learning model. The data recognizing unit 1120 can acquire predetermined text (data) according to a preset reference by learning. Further, the data recognizing unit 1120 can output a voice based on predetermined data by using the data learning model with the obtained data as an input value. Further, the resultant value output by the data learning model with the obtained data as an input value can be used to update the data learning model.

At least one of the data learning unit 1110 or the data recognizing unit 1120 may be manufactured in at least one hardware chip form and mounted on the electronic device. For example, at least one of the data learning unit 1110 and the data recognition unit 1120 may be fabricated in the form of a dedicated hardware chip for artificial intelligence (AI) Or an application processor) or a graphics processor (e.g., a GPU), and may be mounted on various electronic devices already described.

Further, the data learning unit 1110 and the data recognition unit 1120 may be mounted on separate electronic devices, respectively. For example, one of the data learning unit 1110 and the data recognizing unit 1120 may be included in the electronic device, and the other may be included in the server. The data learning unit 1110 and the data recognizing unit 1120 may provide the model information constructed by the data learning unit 1110 to the data recognizing unit 1120 via the wired or wireless network, 1120 may be provided to the data learning unit 1110 as additional learning data.

At least one of the data learning unit 1110 and the data recognition unit 1120 may be implemented as a software module. When at least one of the data learning unit 1110 and the data recognition unit 1120 is implemented as a software module (or a program module including instructions), the software module may be a memory or a computer- And may be stored in non-transitory computer readable media. Also, in this case, the at least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by an operating system (OS), and some of the software modules may be provided by a predetermined application.

The data learning unit 1110 according to an embodiment of the present disclosure includes a data acquisition unit 1111, a preprocessor 1112, a learning data selection unit 1113, a model learning unit 1114, and a model evaluation unit 1115 .

The data acquisition unit 1111 can acquire data necessary for machine learning. Since a lot of data is required for learning, the data acquisition unit 1111 can receive a plurality of texts and a voice corresponding thereto.

The preprocessing unit 1112 can preprocess the acquired data so that the acquired data can be used for machine learning to determine the psychological state of the user. The preprocessing unit 1112 can process the acquired data into a predetermined format so that it can be used by the model learning unit 1114 to be described later. For example, the preprocessing unit 1112 may morpheme text and speech to obtain morpheme embedding.

The learning data selection unit 1113 can select data necessary for learning from the preprocessed data. The selected data may be provided to the model learning unit 1114. The learning data selection unit 1113 can select data necessary for learning from among the preprocessed data according to a predetermined criterion. The learning data selection unit 1113 can also select data according to a predetermined reference by learning by the model learning unit 1114, which will be described later.

The model learning unit 1114 can learn a criterion as to which voice to output according to the text based on the learning data. Also, the model learning unit 1114 can learn by using a learning model for outputting a voice according to text as learning data. In this case, the data learning model may include a pre-built model. For example, the data learning model may include a pre-built model that receives basic learning data (e.g., a sample image, etc.).

The data learning model can be constructed considering the application field of the learning model, the purpose of learning, or the computer performance of the device. The data learning model may include, for example, a model based on a neural network. For example, models such as Deep Neural Network (DNN), Recurrent Neural Network (RNN), Long Short-Term Memory models (LSTM), Bidirectional Recurrent Deep Neural Network (BRDNN), and Convolutional Neural Networks But is not limited thereto.

According to various embodiments, when there are a plurality of pre-built data learning models, the model learning unit 1114 can determine a data learning model with which the input learning data and the basic learning data are highly relevant, have. In this case, the basic learning data may be pre-classified according to the type of data, and the data learning model may be pre-built for each data type. For example, the basic learning data may be pre-classified by various criteria such as an area where the learning data is generated, a time at which the learning data is generated, a size of the learning data, a genre of the learning data, a creator of the learning data, .

In addition, the model learning unit 1114 can learn a data learning model using, for example, a learning algorithm including an error back-propagation method or a gradient descent method.

Also, the model learning unit 1114 can learn the data learning model through supervised learning using, for example, learning data as input values. In addition, the model learning unit 1114 learns, for example, the types of data necessary for the situation determination without any further guidance, and thereby, through unsupervised learning that finds a criterion for determining the situation, The model can be learned. Also, the model learning unit 1114 can learn the data learning model through reinforcement learning using, for example, feedback as to whether the result of the situation judgment based on learning is correct.

Further, when the data learning model is learned, the model learning unit 1114 can store the learned data learning model. In this case, the model learning unit 1114 can store the learned data learning model in the memory of the electronic device including the data recognition unit 1120. [ Alternatively, the model learning unit 1114 may store the learned data learning model in the memory of the server connected to the electronic device and the wired or wireless network.

In this case, the memory in which the learned data learning model is stored may also store instructions or data associated with, for example, at least one other component of the electronic device. The memory may also store software and / or programs. The program may include, for example, a kernel, a middleware, an application programming interface (API), and / or an application program (or "application").

The model evaluation unit 1115 inputs the evaluation data to the data learning model and can cause the model learning unit 1114 to learn again when the result output from the evaluation data does not satisfy the predetermined criterion. In this case, the evaluation data may include predetermined data for evaluating the data learning model.

For example, when the number or ratio of evaluation data whose recognition result is not correct is greater than a predetermined threshold value among the results of the learned data learning model for evaluation data, the model evaluation unit 1115 . For example, when a predetermined criterion is defined as a ratio of 2%, and the learned data learning model outputs an incorrect recognition result for evaluation data exceeding 20 out of a total of 1000 evaluation data, Can be assessed as inappropriate.

On the other hand, when there are a plurality of learned data learning models, the model evaluating unit 1115 evaluates whether each of the learned moving learning models satisfies a predetermined criterion, and uses a model satisfying a predetermined criterion as a final data learning model You can decide. In this case, when there are a plurality of models satisfying a predetermined criterion, the model evaluating unit 1115 can determine any one or a predetermined number of models previously set in descending order of the evaluation score, using the final data learning model.

At least one of the data acquiring unit 1111, the preprocessing unit 1112, the learning data selecting unit 1113, the model learning unit 1114, or the model evaluating unit 1115 in the data learning unit 1110 includes at least one And can be mounted on an electronic device. For example, at least one of the data acquisition unit 1111, the preprocessor 1112, the learning data selection unit 1113, the model learning unit 1114, or the model evaluation unit 1115 may be an artificial intelligence (AI) Or may be implemented as part of a conventional general-purpose processor (e.g., a CPU or an application processor) or a graphics-only processor (e.g., a GPU) and mounted on the various electronic devices described above.

The data acquisition unit 1111, the preprocessor 1112, the learning data selection unit 1113, the model learning unit 1114, and the model evaluation unit 1115 may be mounted on one electronic device, Electronic devices, respectively. For example, some of the data acquisition unit 1111, the preprocessor 1112, the learning data selection unit 1113, the model learning unit 1114, and the model evaluation unit 1115 are included in the electronic device, May be included in the server.

At least one of the data acquisition unit 1111, the preprocessing unit 1112, the learning data selection unit 1113, the model learning unit 1114, and the model evaluation unit 1115 may be implemented as a software module. At least one of the data acquisition unit 1111, the preprocessor 1112, the learning data selection unit 1113, the model learning unit 1114 or the model evaluation unit 1115 is a software module (or a program including an instruction) Module), the software module may be stored in a computer-readable, readable non-transitory computer readable media. Also, in this case, the at least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by an operating system (OS), and some of the software modules may be provided by a predetermined application.

The data recognizing unit 1120 according to an embodiment of the present invention includes a data obtaining unit 1121, a preprocessing unit 1122, a recognition data selecting unit 1123, a recognition result providing unit 1124, and a model updating unit 1125, . &Lt; / RTI >

The data acquisition unit 1121 can acquire the text necessary for outputting the voice. Conversely, the data acquisition unit 1121 can acquire the voice necessary for outputting the text. The preprocessing section 1122 can preprocess acquired data so that the data obtained to output voice or text can be used. The preprocessing unit 1122 can process the acquired data into a predetermined format so that the recognition result providing unit 1124, which will be described later, can use the data obtained for outputting voice or text.

The recognition data selection unit 1123 can select data necessary for outputting voice or text among the preprocessed data. The selected data may be provided to the recognition result provider 1124. The recognition data selection unit 1123 can select some or all of the preprocessed data according to predetermined criteria for outputting voice or text. The recognition data selection unit 1123 can also select data according to a predetermined criterion by learning by the model learning unit 1114. [

The recognition result providing unit 1124 can output the voice or text by applying the selected data to the data learning model. The recognition result providing unit 1124 can apply the selected data to the data learning model by using the data selected by the recognition data selecting unit 1123 as an input value. In addition, the recognition result can be determined by the data learning model.

The model updating unit 1125 can cause the data learning model to be updated based on the evaluation of the recognition result provided by the recognition result providing unit 1124. [ For example, the model updating unit 1125 may allow the model learning unit 1114 to update the data learning model by providing the model learning unit 1114 with the recognition result provided by the recognition result providing unit 1124 have.

At least one of the data acquiring unit 1121, the preprocessing unit 1122, the recognition data selection unit 1123, the recognition result providing unit 1124 or the model updating unit 1125 in the data recognizing unit 1120 is a It can be manufactured in the form of one hardware chip and mounted on the electronic device. For example, at least one of the data acquisition unit 1121, the preprocessing unit 1122, the recognition data selection unit 1123, the recognition result providing unit 1124, and the model updating unit 1125 may be an artificial intelligence Or may be mounted on a variety of electronic devices as described above and manufactured as part of a conventional general purpose processor (e.g., a CPU or an application processor) or a graphics dedicated processor (e.g., a GPU).

The data acquisition unit 1121, the preprocessing unit 1122, the recognition data selection unit 1123, the recognition result providing unit 1124 and the model updating unit 1125 may be mounted on one electronic device, Lt; RTI ID = 0.0 > electronic devices, respectively. For example, some of the data acquisition unit 1121, the preprocessing unit 1122, the recognition data selection unit 1123, the recognition result providing unit 1124, and the model updating unit 1125 are included in the electronic device, May be included in the server.

At least one of the data acquisition unit 1121, the preprocessing unit 1122, the recognition data selection unit 1123, the recognition result providing unit 1124, and the model updating unit 1125 may be implemented as a software module. At least one of the data acquisition unit 1121, the preprocessing unit 1122, the recognition data selection unit 1123, the recognition result providing unit 1124, or the model updating unit 1125 is a software module Program modules), the software modules may be stored in a computer-readable, readable non-transitory computer readable media. Also, in this case, the at least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by an operating system (OS), and some of the software modules may be provided by a predetermined application.

Various embodiments have been described above. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is defined by the appended claims rather than by the foregoing description, and all differences within the scope of equivalents thereof should be construed as being included in the present invention.

The embodiments of the present invention described above can be embodied in a general-purpose digital computer that can be created as a program that can be executed by a computer and operates the program using a computer-readable recording medium. The computer readable recording medium includes a storage medium such as a magnetic storage medium (e.g., ROM, floppy disk, hard disk, etc.), optical reading medium (e.g., CD ROM, DVD, etc.).

Claims

A method for text-to-speech synthesis using machine learning,

Generating a single artificial neural network text-to-speech synthesis model generated by performing a machine learning based on a plurality of learning texts and voice data corresponding to the plurality of learning texts;

Receiving input text;

Receiving a speech feature of the speaker; And

Inputting the speech characteristic of the speaker into the single artificial neural network text-speech synthesis model, and generating output speech data for the input text in which the speech characteristic of the speaker is reflected

To-speech < / RTI > synthesis method.
The method according to claim 1,

Wherein the step of receiving the speech feature of the speaker comprises:

Receiving a voice sample; And

And extracting an embedding vector representing the utterance characteristic of the speaker from the speech sample.

Text-to-speech synthesis method.
3. The method of claim 2,

Wherein extracting an embedding vector representing the utterance characteristic of the speaker from the speech sample comprises extracting a first subembedding vector representing a prosodic characteristic of the speaker, , Information on the pronunciation strength, information on the idle period, or information on the pitch height,

Wherein the step of generating output speech data for the input text in which the speaker's utterance characteristic is reflected comprises inputting a first sub-embedding vector representing the prosodic feature to the single artificial neural network text-speech synthesis model, Generating output speech data for the input text;

Text-to-speech synthesis method.
3. The method of claim 2,

Wherein the step of extracting an embedding vector representing the utterance characteristic of the speaker from the speech sample comprises extracting a second subembedding vector representing the emotion characteristics of the speaker, Contains information about the underlying emotions,

Wherein the step of generating output speech data for the input text in which the speech characteristic of the speaker is reflected comprises inputting a second sub-embedding vector expressing the emotion characteristic to the single artificial neural network text-speech synthesis model, Generating output speech data for the input text;

Text-to-speech synthesis method.
3. The method of claim 2,

Wherein extracting the embedding vector representing the utterance characteristic of the speaker from the speech sample comprises extracting a third subembedding vector characterizing the tone and pitch of the speaker,

Wherein the step of generating output speech data for the input text in which the speaker's utterance characteristic is reflected comprises inputting a third subembedding vector characterizing the tone and height of the speaker into the single artificial neural network text- And generating output speech data for the input text in which characteristics of the speaker's tone color and tone height are reflected.

Text-to-speech synthesis method.
3. The method of claim 2,

Wherein the step of generating output speech data for the input text, in which the speech characteristic of the speaker is reflected,

Receiving additional input to the output speech data;

Modifying an embedding vector representing the utterance characteristic of the speaker based on the further input; And

Inputting the modified embedding vector into the single artificial neural network text-to-speech synthesis model and converting the output speech data into speech data for the input text reflecting information contained in the additional input

Wherein the text-to-speech synthesis method comprises the steps of:
The method according to claim 6,

The information included in the additional input to the output voice data may include at least one of information on sex, information on age, information on intonation by region, information on speed of utterance, Wherein the text-to-speech synthesis method comprises:
3. The method of claim 2,

Wherein the step of receiving the speech samples comprises:

And receiving in real time the speech inputted from the speaker within the predetermined time period as the speech sample.

Text-to-speech synthesis method.
3. The method of claim 2,

Wherein the step of receiving the speech samples comprises:

Receiving from the speech database a voice input from the speaker within a predetermined time period;

Text-to-speech synthesis method.
11. A computer-readable storage medium having stored thereon instructions for performing the respective steps according to the method for text-to-speech synthesis using machine learning of claim 1.