WO2020190054A1

WO2020190054A1 - Speech synthesis apparatus and method therefor

Info

Publication number: WO2020190054A1
Application number: PCT/KR2020/003768
Authority: WO
Inventors: 이자룡; 박중배
Original assignee: 휴멜로 주식회사
Priority date: 2019-03-19
Filing date: 2020-03-19
Publication date: 2020-09-24
Also published as: KR102057927B1

Abstract

Provided is a speech synthesis apparatus capable of synthesizing emotional speech that reflects emotional information. A speech synthesis apparatus according to some embodiments of the present disclosure can include: a pre-processing unit for pre-processing inputted text; and a speech synthesis unit, which inputs the pre-processed text and the emotional information into a neural network-based speech synthesis model, so as to synthesize, for the inputted text, emotional speech that reflects emotional information, wherein the speech synthesis model includes an encoder neural network and a decoder neural network, and the emotional speech that reflects the emotional information can be outputted by using emotion embedding vectors for the emotional information as an input into the encoder neural network.

Description

Speech synthesis device and method thereof

The present disclosure relates to a speech synthesis apparatus and method thereof. In more detail, the present invention relates to an apparatus for synthesizing an emotional speech reflecting emotion information using a neural network-based speech synthesis model, a speech synthesis method performed by the apparatus, and a method of constructing the speech synthesis model.

Speech synthesis technology is a technology that synthesizes a sound similar to a human speaking sound from an input text, and is commonly known as a text-to-speech (TTS) technology. In recent years, as personal portable devices such as smart phones, e-book readers, and vehicle navigation have been actively developed and distributed, the demand for speech synthesis technology for voice output is rapidly increasing.

As the demand for speech synthesis technology increases, the requirements are also subdivided. Recently, demand for a technology capable of synthesizing various emotional voices containing human emotions from a specific text has been continuously raised.

In the conventional case, an audio post-processing method was mainly used to synthesize emotional voices. The audio post-processing method synthesizes the voice for the input text and modifies the audio signal of the synthesized voice according to the desired emotion. This method artificially modifies the audio signal, so the naturalness of the voice disappears. There was a problem.

In addition, a method has been proposed in which voice fragments for each emotion are stored in the voice DB in advance in units of text tokens, and the previously stored voice fragments are extracted and synthesized according to the desired emotion. However, even in such a method, there is a problem in that a non-smooth voice is generated due to a connection problem between voice fragments, and above all, it takes a lot of time and cost to build a massive voice DB.

Accordingly, there is a need for a voice synthesis method capable of generating emotional voices containing various and continuous emotions.

A technical problem to be solved through some embodiments of the present disclosure is to provide an apparatus capable of synthesizing emotional voices containing various emotions for a given text and a method performed by the apparatus.

Another technical problem to be solved through some embodiments of the present disclosure is to provide an apparatus capable of constructing a neural network-based speech synthesis model capable of synthesizing an emotional voice for a given text, and a method performed in the apparatus. .

Another technical problem to be solved through some embodiments of the present disclosure is to provide an apparatus capable of synthesizing emotional voices containing various emotions for each speaker with respect to a given text, and a method performed by the apparatus.

Another technical problem to be solved through some embodiments of the present disclosure is an apparatus capable of constructing a neural network-based speech synthesis model capable of synthesizing emotional voices containing various emotions for each speaker for a given text, and in the apparatus It is to provide a way to be done.

The technical problems of the present disclosure are not limited to the technical problems mentioned above, and other technical problems that are not mentioned will be clearly understood by those skilled in the art from the following description.

In order to solve the above technical problem, a speech synthesis apparatus according to some embodiments of the present disclosure includes a preprocessor that performs preprocessing on an input text, and inputs the preprocessed text and emotion information into a speech synthesis model based on a neural network. And a speech synthesis unit for synthesizing an emotion voice in which the emotion information is reflected with respect to the input text, wherein the speech synthesis model includes a text embedding module for converting the preprocessed text into a text embedding vector, and an emotion embedding the emotion information An emotion embedding module that converts into a vector, an encoder neural network that receives an input sequence consisting of the character embedding vector and the emotion embedding vector and outputs an encoded vector, and an encoder neural network that receives the encoded vector and outputs an output sequence associated with the emotion speech. It may include a decoder neural network.

In some embodiments, the speech synthesis model may further include an attention module positioned between the encoder neural network and the decoder neural network and determining a portion to be focused by the decoder neural network in the encoded vector.

In some embodiments, the output sequence is composed of data in the form of a spectrogram, and the speech synthesis unit may further include a vocoder unit for converting the output sequence into the emotional speech.

In some embodiments, the decoder neural network may further receive the emotion embedding vector and output the output sequence.

In some embodiments, the output sequence is composed of spectrogram-type data, and the speech synthesis unit inputs the text for training preprocessed by the preprocessor into the speech synthesis model, and the spectrogram data obtained as a result The speech synthesis model may be trained by comparing correct answer spectrogram data to calculate an error value, and back-propagating the calculated error value.

In some embodiments, the speech synthesis model further includes a speaker embedding module for converting speaker information into a speaker embedding vector, and the speech synthesis unit inputs the speaker information to the speech synthesis model to determine the specific information indicated by the speaker information. The voice reflecting the emotion information for the speaker may be output as the emotional voice.

In order to solve the above technical problem, a speech synthesis apparatus according to some embodiments of the present disclosure includes a preprocessor for performing preprocessing on an input text, and the preprocessed text and emotion information in a neural network-based speech synthesis model. And a speech synthesis unit for synthesizing an emotion voice in which the emotion information is reflected with respect to the input text, wherein the speech synthesis model includes a text embedding module for converting the preprocessed text into a text embedding vector, and the emotion information An emotion embedding module that converts into an embedding vector, an encoder neural network that receives an input sequence consisting of the character embedding vector and outputs an encoded vector, and the encoded vector and the emotion embedding vector, and outputs an output sequence associated with the emotion speech It may include a decoder neural network.

A method of constructing a speech synthesis model according to some embodiments of the present disclosure for solving the above technical problem is a method of constructing a speech synthesis model including an encoder neural network and a decoder neural network to synthesize emotional speech in a computing device. , Embedding the training text and converting it into a character embedding vector, embedding the training emotion information and converting it into an emotion embedding vector, an encoded vector receiving the character embedding vector and the emotion embedding vector in the encoder neural network Outputting, receiving the encoded vector from the decoder neural network and outputting predicted spectrogram data, and back-propagation of an error between the correct answer spectrogram data and the predicted spectrogram data to the speech synthesis model It may include the step of updating.

1 is a diagram for explaining input and output of a speech synthesis apparatus according to some embodiments of the present disclosure.

2 is an exemplary block diagram illustrating a speech synthesis apparatus according to some embodiments of the present disclosure.

3 is an exemplary diagram for describing an operation of a preprocessor according to some embodiments of the present disclosure.

4 is an exemplary block diagram illustrating a speech synthesizer according to some embodiments of the present disclosure.

5 and 6 are diagrams for explaining a neural network structure of a speech synthesis model according to some embodiments of the present disclosure.

7 and 8 are exemplary diagrams for explaining emotional information that may be referred to in various embodiments of the present disclosure.

9 is an exemplary diagram illustrating an LSTM recurrent neural network that can be used in a speech synthesis model according to some embodiments of the present disclosure.

10 is an exemplary diagram for explaining a learning operation for a speech synthesis model according to some embodiments of the present disclosure.

11 to 15 are diagrams for explaining a neural network structure of a modified speech synthesis model according to various embodiments of the present disclosure.

16 is an exemplary flowchart illustrating a speech synthesis method according to some embodiments of the present disclosure.

17 is an exemplary flowchart illustrating a method of constructing a speech synthesis model according to some embodiments of the present disclosure.

18 is a diagram illustrating an exemplary computing device capable of implementing a speech synthesis device according to some embodiments of the present disclosure.

Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. Advantages and features of the present disclosure, and a method of achieving them will be apparent with reference to the embodiments described below in detail together with the accompanying drawings. However, the technical idea of the present disclosure is not limited to the following embodiments, but may be implemented in various different forms, and only the following embodiments complete the technical idea of the present disclosure, and in the technical field to which the present disclosure belongs. It is provided to completely inform the scope of the present disclosure to those of ordinary skill in the art, and the technical idea of the present disclosure is only defined by the scope of the claims.

In adding reference numerals to elements of each drawing, it should be noted that the same elements are assigned the same numerals as possible even if they are indicated on different drawings. In addition, in describing the present disclosure, if it is determined that a detailed description of a related known configuration or function may obscure the subject matter of the present disclosure, a detailed description thereof will be omitted.

Unless otherwise defined, all terms (including technical and scientific terms) used in the present specification may be used as meanings that can be commonly understood by those of ordinary skill in the art to which this disclosure belongs. In addition, terms defined in a commonly used dictionary are not interpreted ideally or excessively unless explicitly defined specifically. The terms used in the present specification are for describing exemplary embodiments and are not intended to limit the present disclosure. In this specification, the singular form also includes the plural form unless specifically stated in the phrase.

In addition, in describing the constituent elements of the present disclosure, terms such as first, second, A, B, (a) and (b) may be used. These terms are only used to distinguish the component from other components, and the nature, order, or order of the component is not limited by the term. When a component is described as being "connected", "coupled" or "connected" to another component, the component may be directly connected or connected to that other component, but another component between each component It should be understood that elements may be “connected”, “coupled” or “connected”.

As used in the specification, "comprises" and/or "comprising" refers to the presence of one or more other components, steps, actions and/or elements, and/or elements, steps, actions and/or elements mentioned. Or does not exclude additions.

Prior to the description of the present specification, some terms used in the present specification will be clarified.

In the present specification, emotional speech (emotional speech or emotional voice) means a speech synthesized by literally containing human emotions.

In this specification, an instruction refers to a series of computer-readable instructions grouped on a function basis, which is a component of a computer program and executed by a processor.

Hereinafter, some embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

1 is an exemplary diagram illustrating inputs and outputs of a speech synthesis apparatus 10 according to some embodiments of the present disclosure.

As shown in FIG. 1, the speech synthesis device 10 is a computing device that receives text 1 and emotion information 3, synthesizes and outputs emotion voices 7 corresponding thereto. The emotional voice 7 at this time refers to the voice reflecting the emotional information 3.

The computing device may be a notebook computer, a desktop computer, a laptop computer, etc., but is not limited thereto and may include all types of devices equipped with a computing function. For an example of the computing device, refer to FIG. 18 further.

1 illustrates, as an example, that the speech synthesis device 10 is implemented as a single computing device, but the first function of the speech synthesis device 10 is implemented in a first computing device, and the second function is a second computing device. It can also be implemented in

In some embodiments, the speech synthesis apparatus 10 may further receive speaker information 5 and synthesize and output the emotional voice 7 of a specific speaker indicated by the speaker information 10. The emotional voice 7 at this time means a voice reflecting the emotion information 3 of the specific speaker.

According to various embodiments of the present disclosure, the speech synthesis apparatus 10 constructs a speech synthesis model based on a neural network in order to synthesize speech reflecting various and continuous emotions, and through the speech synthesis model Voice (7) can be synthesized. A detailed description of the neural network structure and learning method of the speech synthesis model will be described in detail with reference to the accompanying drawings in FIG. 2.

2 is an exemplary block diagram illustrating a speech synthesis apparatus 10 according to some embodiments of the present disclosure.

As shown in FIG. 2, the speech synthesis apparatus 10 may include an input unit 21, a preprocessor 23, a storage unit 25, and a speech synthesis unit 27. However, only the components related to the embodiment of the present disclosure are shown in FIG. 2. Accordingly, those of ordinary skill in the art to which the present disclosure pertains may recognize that other general-purpose components may be further included in addition to the components illustrated in FIG. 2. In addition, it should be noted that each of the constituent elements of the speech synthesis apparatus 10 shown in FIG. 2 represents functional elements that are functionally divided, and a plurality of constituent elements may be implemented in a form integrated with each other in an actual physical environment. . Hereinafter, each component will be described in detail.

The input unit 21 receives text, emotion information, speaker information, and the like. Among the input information, text may be provided to the preprocessor 23 for preprocessing, and the remaining information (e.g. emotion, speaker information, etc.) may be provided to the speech synthesis unit 25.

In addition, the input unit 21 may receive a training data set including training text, training emotion information, training speaker information, and correct answer voice data for training the speech synthesis model 43. A description of each data will be described later.

Next, the preprocessor 23 performs preprocessing on the input text. The pre-processing can be in various ways, such as dividing the input text into sentences, parsing the text in units of sentences into units such as words, words, characters, and phonemes, and converting numbers and special characters into characters. It can be, and the specific pre-treatment method may vary according to the embodiment. Some examples of the pretreatment process are shown in FIG. 3.

As shown in FIG. 3, the preprocessor 23 converts the number of the input text 31 into a character to generate a text 33 in the form of a character, and converts the text 33 to the text 35 in phoneme units. Can be converted to However, this is only an example for describing the operation of the preprocessor 23, and the preprocessor 23 may perform a natural language preprocessing function in various ways.

In some embodiments, the preprocessor 23 may further perform not only a text preprocessing function, but also a preprocessing function of converting voice data (e.g. wav format audio) into spectrogram format data. For example, the preprocessor 23 may perform Short Time Fourier Transform (STFT) signal processing to convert voice data into STFT spectrogram data or transform the STFT spectrogram data into mel-scale. The spectrogram data may be used to train the speech synthesis model 43.

Referring back to FIG. 2, the storage unit 25 includes text, emotion information, speaker information, and voice data. It stores and manages various data such as spectrogram data. For effective data management, the storage unit 25 may manage the various types of data in a database. The various data may be used as training data for constructing the speech synthesis model 43, but the technical scope of the present disclosure is not limited thereto.

Next, the speech synthesis unit 27 receives the pre-processed text, emotions, and speaker information, and generates (synthesizes) the emotional voice of a specific speaker indicated by the speaker information. That is, the speech synthesis unit 27 may generate voices of different speakers for the same text or may generate emotional voices reflecting different emotions. For example, when the first speaker information is input, the speech synthesis unit 27 synthesizes and outputs the voice of the first speaker, and when the first emotion information is input, the speech synthesis unit 27 generates an emotional voice containing the first emotion. Can be synthesized and printed.

As illustrated in FIG. 4, the speech synthesis unit 27 according to some embodiments may include a learning unit 41, a speech synthesis model 43, a synthesis unit 45, and a vocoder unit 47. . Hereinafter, detailed components of the speech synthesis unit 27 will be described in detail.

The learning unit 41 trains the speech synthesis model 43 using the training data set. That is, the learning unit 41 may construct the speech synthesis model 43 by updating the weight of the speech synthesis model 43 so that the prediction error of the speech synthesis model 43 is minimized using the training data set. The training data set may be provided from the storage unit 25, but the technical scope of the present disclosure is not limited thereto. In order to provide ease of understanding, the structure of the neural network of the speech synthesis model 43 will be first described, and then the operation of the learning unit 41 will be described in detail.

The speech synthesis model 43 is a neural network-based model that receives pre-processed text, emotion information, and/or speaker information and synthesizes emotion speech corresponding thereto. As shown in FIG. 5, the speech synthesis model 43 according to some embodiments of the present disclosure includes an embedding module 51, an encoder neural network 53, an attention module 55, and a decoder neural network 57. I can.

The embedding module 51 is a module that embeds input information and converts it into vector data. As illustrated in FIG. 6, the embedding module 51 may include a character embedding module 61, an emotion embedding module 63, and a speaker embedding module 65.

The character embedding module 61 is a module that embeds preprocessed text information and converts it into a character embedding vector. For example, the character embedding module 61 may be able to generate a character embedding vector using a fasttext embedding technique, an auto-encoder embedding technique, a self-attention embedding technique, etc. The technical scope of is not limited thereto.

Next, the emotion embedding module 63 is a module that embeds emotion information and converts it into an emotion embedding vector.

In some embodiments, the emotion embedding module 63 may be implemented as a specific layer of the speech synthesis model 43. For example, the emotion embedding module 63 may be implemented as a fully connected layer or a fully connected network located in front of the encoder neural network 53 and/or the decoder neural network 57. In this case, the emotion embedding module 63 and the other modules (e.g. 53 to 57) are configured as one organic neural network, so that end-to-end learning and speech synthesis may be performed. That is, all

components

63, 53 to 57 of the speech synthesis model 43 may be learned at once through error backpropagation. According to this embodiment, various advantages of the end-to-end method can be secured compared to a conventional speech synthesis model implemented by integrating a plurality of independent modules. The advantages are due to accumulation of a loss of a specific module. Problems of deteriorating model performance are solved, learning is easier, and high-performance speech synthesis models can be built with a smaller amount of training datasets.

Meanwhile, the specific form of the emotion information may vary according to embodiments.

In some embodiments, the emotion information may be an emotion vector indicating a probability of one or more emotions. For example, when the emotion class and the vector index are defined as shown in the table 71 shown on the left side of FIG. 7, emotion information in which most of emotions and emotions are mixed with a very small amount of emotion It can be expressed by the emotion vector 73 on the right side. According to the present embodiment, since the detailed emotion information is used for learning, a speech synthesis model capable of finer emotion control and generating a complex emotion voice can be constructed.

In some other embodiments, the emotion information may be label information indicating a specific emotion. For example, when a label value corresponding to an emotion class is defined as shown in the table 81 shown on the left side of FIG. 8, emotion information indicating an emotion (happy) may be expressed as an emotion label 83 on the right. According to the present embodiment, since emotion label information that can be easily secured or generated is used as learning data, time and human cost required to secure the learning data can be reduced.

Meanwhile, according to some embodiments of the present disclosure, emotion information such as an emotion vector or an emotion label may be automatically generated by a machine learning model for classifying emotion classes. The machine learning model is a model that receives voice data or spectrogram data and outputs an emotion class. In this case, the emotion vector may be generated based on a confidence score for each emotion class output by the machine learning model, and the emotion label is generated based on the final classification result of the machine learning model. I can. According to the present embodiment, since emotion information is automatically generated, time and human cost required for generating a data set for learning can be reduced.

With reference to FIG. 6 again, other components of the embedding module 51 will be described.

The speaker embedding module 65 is a module that embeds speaker information and converts it into a speaker embedding vector. In this case, the speaker information may be label information (refer to FIG. 9) indicating a specific speaker, but the technical scope of the present disclosure is not limited thereto.

In some embodiments, the speaker embedding module 65 may be implemented as a specific layer of the speech synthesis model 43. For example, the speaker embedding module 65 may be implemented as a fully connected layer or fully connected network located in front of the encoder neural network 53 and/or the decoder neural network 57. In this case, the speaker embedding module 65 and other modules (e.g. 53 to 57) are configured as one organic neural network, so that end-to-end learning and speech synthesis can be performed. That is, all

components

63, 53 to 57 of the speech synthesis model 43 may be learned at once through error backpropagation.

In some other embodiments, at least some of the above-described respective embedding modules 61 to 65 may be implemented as separate modules that independently perform an embedding function. That is, at least some of each of the embedding modules 61 to 65 are not affected by the learning of the speech synthesis model 43, or a separately learned embedding module or a module that performs embedding through a mathematical algorithm without needing to be learned. It can also be implemented.

As shown in FIG. 6, the output vectors of each embedding module 61 to 65 are input to the encoder neural network 53. At this time, at least some of the output vectors (e.g. a character embedding vector, an emotion embedding vector, and a speaker embedding vector) may be merged into a single vector and input to the encoder neural network 53. For example, a vector generated by concatenating an emotion and/or speaker embedding vector to a character embedding vector may be input to the encoder neural network 53. Of course, each of the output vectors may be independently input to the encoder neural network 53, which may be modified as much as possible according to the implementation method of the input layer of the encoder neural network 53.

With reference to FIG. 5 again, other components of the speech synthesis model 43 will be described.

The encoder neural network 53 is a neural network that receives an input sequence composed of one or more character embedding vectors, an emotion embedding vector and/or a speaker embedding vector, encodes input information, and outputs the encoded vector. As the learning progresses, the encoder neural network 53 understands the context according to the input sequence, the emotion embedding vector, and the speaker embedding vector, and outputs an encoded vector representing the understood context. The encoded vector may be referred to as a context vector in the art.

In some embodiments, the encoder neural network 53 and the decoder neural network 57 may be implemented as a recurrent neural network (RNN) to be suitable for receiving and outputting a sequence. For example, the encoder neural network 53 and the decoder neural network 57 may be implemented as a Long Short-Term Memory Model (LSTM) neural network 90 as shown in FIG. 9. However, the present invention is not limited thereto, and at least some of the encoder neural network 75 and the decoder neural network 79 may be implemented through a self-attention, a transformer network, or the like. Those skilled in the art will be able to clearly understand self-attention and transformer networks, and detailed descriptions of the techniques will be omitted.

Referring back to FIG. 5, the attention module 55 indicates which part to focus on (or which part to focus on) when learning/predicting the output sequence for the vector encoded by the decoder neural network 57. It is a module that provides attention information. As the learning progresses, the attention module 55 may learn a mapping relationship between the encoded vector and the output sequence to provide attention information indicating a portion to be focused on and a portion not to be focused upon decoding. The attention information may be provided in the form of a weight vector (or weight matrix), but the technical scope of the present disclosure is not limited thereto. Those skilled in the art will be able to clearly understand the attention mechanism, and a detailed description thereof will be omitted.

The decoder neural network 57 receives the encoded vector and the attention information and outputs an output sequence corresponding to the encoded vector. More specifically, the decoder neural network 57 predicts an output sequence associated with the emotional voice of a specific speaker using the encoded vector and the attention information. In this case, the output sequence may be composed of spectrogram data in units of frames, but the technical scope of the present disclosure is not limited thereto.

When the decoder neural network 57 is implemented as a recurrent neural network, the decoder neural network 57 may further input spectrogram data of a previous frame and sequentially output spectrogram data of a current frame to construct an output sequence.

The spectrogram data is data representing a spectrogram of a voice signal, and may be STFT spectrogram data or mel-spectrogram data, but the technical scope of the present disclosure is not limited thereto.

For reference, the reason why the decoder neural network 57 is configured to output spectrogram data instead of a speech signal is that when learning is performed with spectrogram data, a prediction error can be calculated more accurately than that of a speech signal. In addition, since accurate prediction error calculation is possible, a speech synthesis model with superior performance can be constructed.

So far, a neural network structure and operation principle of the speech synthesis model 43 according to some embodiments of the present disclosure have been described with reference to FIGS. 5 to 9. Hereinafter, a process of learning the speech synthesis model 43 by the learning unit 41 will be described with reference to FIG. 10 based on the above description.

As shown in FIG. 10, each learning data 100 may include text 101, emotion information 102, speaker information 103, and correct answer voice data 104. At this time, the correct answer voice data 104 is voice data of a specific speaker (e.g. wav format audio) indicated by the speaker information 103, and corresponds to the text 101 and reflects the emotion information 102. Before learning is performed, the correct answer voice data 104 is converted into correct answer spectrogram data 106 through the preprocessor 23, and the text 101 is subjected to appropriate preprocessing by the preprocessor 23.

The process of learning the speech synthesis model 43 by the learning unit 41 is as follows. First, the preprocessed text 101 is input to the character embedding module 61, and the emotion and

speaker information

102 and 103 are input to the emotion embedding module and the

speaker embedding modules

63 and 65, respectively. In addition, spectrogram data 105 predicted by the decoder neural network 55 is output as a result.

The learning unit 41 compares the predicted spectrogram data 105 and the correct answer spectrogram data 106 to calculate a prediction error 107, and backpropagates the prediction error 107 to determine the speech synthesis model 43 Update weights. In this case, the weights of the encoder neural network 53, the attention module 55, and the decoder neural network 55 may be updated at once through the backpropagation. When the embedding module 51 is implemented as some layers of a neural network, the weight of the embedding module 51 may also be updated. The learning unit 41 may build the speech synthesis model 43 by repeating such a learning process for a plurality of training data.

So far, the learning unit 41 and the speech synthesis model 43 have been described with reference to FIGS. 5 to 10. In the following, description of other components of the speech synthesis unit 25 will be continued with reference to FIG. 4 again.

The synthesis unit 45 predicts and outputs spectrogram data using the speech synthesis model 43 learned by the learning unit 41. More specifically, the synthesizing unit 45 inputs text for synthesis in which the correct answer speech data does not exist, emotion information for synthesis, and speaker information for synthesis into the speech synthesis model 43, and as a result, the synthesis speaker information is Predict the output sequence of a specific speaker pointed to. As described above, the output sequence may consist of, for example, frame-by-frame prediction spectrogram data. Here, the synthesis speaker information may be label information indicating the specific speaker who wants to synthesize speech, and the synthesis emotion information may be information in the form of an emotion vector or an emotion label as emotion information of the specific makeup to be expressed. have.

Next, the vocoder unit 47 converts the predicted spectrogram data included in the output sequence into emotional voice data (e.g. wav format audio). If the conversion function can be performed, the vocoder unit 47 may be implemented in any way. For example, the vocoder unit 47 may be implemented with one or more vocoder modules (e.g. WaveNet, Griffin-lim) well known in the art. In order not to obscure the subject matter of the present invention, further description of the vocoder unit 47 will be omitted.

Meanwhile, it should be noted that not all of the components shown in FIG. 2 or 4 may be essential components for implementing the speech synthesis apparatus 10. That is, the speech synthesis apparatus 10 according to some other embodiments of the present disclosure may be implemented by some of the components illustrated in FIG. 2 or 4.

Each component shown in FIG. 2 or 4 may mean software or hardware such as a Field Programmable Gate Array (FPGA) or an Application-Specific Integrated Circuit (ASIC). However, the components are not limited to software or hardware, and may be configured to be in an addressable storage medium, or may be configured to execute one or more processors. The functions provided in the above components may be implemented by more subdivided components, or may be implemented as one component that performs a specific function by combining a plurality of components.

So far, the speech synthesis apparatus 10 according to some embodiments of the present disclosure has been described with reference to FIGS. 2 to 10. As described above, since a neural network-based speech synthesis model is constructed by learning emotion information and emotion voice data, an emotion speech reflecting the emotion information may be synthesized through the speech synthesis model. In this method, a natural emotional voice can be generated, since the emotional voice is not synthesized by performing audio post-processing or combining voice fragments. In addition, a speech synthesis function capable of controlling emotion may be provided by changing emotion information input to the speech synthesis model. For example, the emotion contained in the synthesized voice may be adjusted by adjusting the type and/or strength and weakness of the emotion in the emotion information.

Furthermore, since a speech synthesis model is constructed by learning speaker information and emotional voice data, emotional voices of a plurality of speakers can be synthesized through one speech synthesis model.

Hereinafter, a neural network structure of a modified speech synthesis model according to various embodiments of the present disclosure will be described with reference to FIGS. 11 to 15. In the following description of the embodiments, a description of a portion overlapping with the above-described speech synthesis model 43 will be omitted.

11 illustrates a neural network structure of a modified speech synthesis model 110 according to the first embodiment of the present disclosure.

As illustrated in FIG. 11, the speech synthesis model 110 further includes an emotion embedding module 115 and a speaker embedding module 116 for the decoder neural network 114. That is, the decoder neural network 114 further receives output vectors (ie, the emotion embedding vector and the speaker embedding vector) of the emotion embedding module 115 and the speaker embedding module 116.

In some embodiments, the emotion embedding module 115 and the speaker embedding module 116 may be implemented as a specific layer (e.g. a fully connected layer) located in front of the decoder neural network 114.

In some other embodiments, the emotion embedding module 115 and the speaker embedding module 116 do not exist, and the output vectors of the emotion embedding module and speaker embedding module included in the embedding module 111 are input to the decoder neural network 114 The speech synthesis model 110 may be implemented in a form that is configured.

The embedding module 111 may include a character embedding module, an emotion embedding module, and a speaker embedding module, and the functions of the embedding module 111, the encoder neural network 112, the attention module 113 and the decoder neural network 124 are described above. It is similar to one. However, there are some differences in that the decoder neural network 114 further receives an emotion embedding vector and a speaker embedding vector and outputs an output sequence.

12 illustrates a neural network structure of a modified speech synthesis model 120 according to a second embodiment of the present disclosure.

As shown in FIG. 12, in the second embodiment, only the character embedding vector output from the character embedding module 121 is input to the encoder neural network 122, and the emotion embedding module 125 and the speaker embedding module 126 ) Of the output vectors (ie, the emotion embedding vector and the speaker embedding vector) are input to the decoder neural network 124.

In some embodiments, the emotion embedding module 125 and the speaker embedding module 126 may be implemented as a specific layer located in front of the decoder neural network 124.

The overall structure of the speech synthesis model 120 and the operation of each module 121 to 126 are similar to the speech synthesis model 110 according to the first embodiment described above, but the encoder neural network 122 inputs only the character embedding vector. There is a difference in using it.

13 illustrates a neural network structure of a modified speech synthesis model 130 according to a third embodiment of the present disclosure.

As shown in FIG. 13, since the speech synthesis model 130 according to the third embodiment is a model for a single speaker, it does not include a speaker embedding module. Accordingly, the encoder neural network 133 uses only the output vectors (ie, the character embedding vector and the emotion embedding vector) of the character embedding module 131 and the emotion embedding module 132 as input values.

The overall structure of the speech synthesis model 130 and the operation of each of the modules 131 to 135 are similar to those of the above-described embodiments.

14 illustrates a neural network structure of a modified speech synthesis model 140 according to a fourth embodiment of the present disclosure.

As shown in Fig. 14, the speech synthesis model 140 according to the fourth embodiment is also a model for a single speaker, similar to the third embodiment described above. Accordingly, the speech synthesis model 140 also does not include a speaker embedding module. However, in the fourth embodiment, the emotion embedding vector is further input to the decoder neural network 145.

In some embodiments, the emotion embedding module 146 may be implemented as a specific layer (e.g. a fully connected layer) located in front of the decoder neural network 145.

In some other embodiments, the emotion embedding module 146 does not exist, and the speech synthesis model 140 may be implemented in a form in which the emotion embedding vector of the emotion embedding module 142 is input to the decoder neural network 114.

In some other embodiments, the emotion embedding module 142 may be omitted. That is, in this embodiment, similar to the above-described second embodiment, only character embedding vectors are inputted to the encoder neural network 143, and emotion embedding vectors can be inputted only to the decoder neural network 145.

Meanwhile, the

speech synthesis models

130 and 140 described with reference to FIGS. 13 and 14 may be constructed for each speaker. For example, as shown in FIG. 15, a first voice synthesis model 150-1 for synthesizing the voice of a first speaker is constructed, and a second voice synthesis model 150-2 for synthesizing the voice of a second speaker. This is separately constructed, and an n-th speech synthesis model 150-n for synthesizing the speech of the n-th speaker may be separately constructed.

So far, modified speech synthesis models 110 to 140 and 150-1 to 150-n according to various embodiments of the present disclosure have been described with reference to FIGS. 11 to 15. Various speech synthesis models (e.g. 43, 110 to 140, 150-1 to 150-n) have been described so far, but the effects achieved according to the configuration of each model may vary.

First, a model in which emotion information is input to an encoder neural network may more accurately control the speed of a synthesized speech according to the emotion information. In addition, the model in which the emotion information is input to the decoder neural network can more accurately adjust the tone or pitch of the synthesized speech according to the emotion information. In addition, in a model in which emotion information is input to an encoder and a decoder neural network, since the speed, tone, and pitch of the voice can all be accurately adjusted, a voice containing natural emotions can be synthesized as if a real person speaks.

In addition, since the model to which the speaker information is further input can synthesize speech for multiple speakers, all the costs required for model construction (eg, computing cost for learning) compared to the case of building a speech synthesis model for each speaker. This can be saved. In addition, since synergy occurs when learning is performed for a large number of speakers, a relatively high-performance speech synthesis model can be built even when the amount of learning data for each speaker is small, and the cost of building the learning data will be reduced. I can.

Hereinafter, a speech synthesis method according to some embodiments of the present disclosure will be described in detail with reference to FIGS. 16 and 17.

Each step of the speech synthesis method may be performed by a computing device. In other words, each step of the speech synthesis method may be implemented with one or more instructions executed by a processor of a computing device. All the steps included in the speech synthesis method may be performed by one physical computing device, but the first steps of the method are performed by a first computing device, and the second steps of the method are performed by a second computing device. It can also be performed by In the following, description will be continued on the assumption that each step of the speech synthesis method is performed by the speech synthesis device 10. However, for convenience of explanation, the description of the operation subject of each step included in the speech synthesis method may be omitted.

16 is an exemplary flowchart illustrating a speech synthesis method according to some embodiments of the present disclosure. However, this is only a preferred embodiment for achieving the object of the present disclosure, and of course, some steps may be added or deleted as necessary.

As shown in FIG. 16, the speech synthesis method includes a learning process of constructing a speech synthesis model and a synthesis process of synthesizing speech using the speech synthesis model.

The learning process starts in step S100 of acquiring a learning dataset. At this time, each learning data included in the learning dataset is composed of text for learning, emotion information for learning, speaker information for learning, and correct answer voice data. Of course, in the case of constructing a speech synthesis model for a single speaker, the training speaker information may be excluded from the training dataset.

In step S200, a neural network-based speech synthesis model is constructed using the training dataset. Since the structure of the speech synthesis model has already been described above, further description will be omitted, and details of this step S200 will be described later with reference to FIG. 17.

The synthesis process starts in step S300 of obtaining data for synthesis. The synthesis data is composed of text for synthesis, emotion information for synthesis, and speaker information for synthesis. Of course, in the case of synthesizing speech for a single speaker, speaker information may be excluded from the synthesis data.

As described above, the emotion information for synthesis may be an emotion vector or an emotion label.

In step S400, an emotional voice of a specific speaker with respect to the text for synthesis is output using a speech synthesis model. In this case, the emotional voice means a voice in which the emotional information for synthesis is reflected.

More specifically, an output sequence composed of spectrogram data may be output from a speech synthesis model, and the emotional voice may be output by vocoding the output sequence.

In this step S400, the emotion of the synthesized emotion voice may be adjusted by adjusting the type of emotion or the strength of emotion on the emotion information. Also, by changing the speaker information, emotional voices of different speakers can be synthesized.

For reference, among the above-described steps S100 to S400, steps S100 and S200 are performed by the input unit 21, the preprocessor 23 and the learning unit 41, and the steps S300 and S400 are the input unit 21, the preprocessor ( 23), it may be performed by the synthesis unit 45 and the vocoder unit 47.

So far, a speech synthesis method according to some embodiments of the present disclosure has been described with reference to FIG. 16. Hereinafter, a method of constructing a speech synthesis model that can be performed in step S200 will be described in more detail with reference to FIG. 17.

17 is an exemplary flowchart illustrating a method of constructing a speech synthesis model according to some embodiments of the present disclosure. However, this is only a preferred embodiment for achieving the object of the present disclosure, and of course, some steps may be added or deleted as necessary.

As shown in FIG. 17, the method of constructing the speech synthesis model begins in step S210 of performing pre-processing on the training text and the jeongdam speech data. Since the contents of the pre-processing are the same as described above, further description will be omitted.

In step S220, the preprocessed text is converted into a character embedding vector, the learning emotion information is converted into an emotion embedding vector, and the learning speaker information is converted into a speaker embedding vector. The conversion process may be performed in an embedding module (e.g. 51 in FIG. 5) constituting the speech synthesis model (e.g. 43 in FIG. 5), but may be performed in a separate embedding module.

In step S230, the character embedding vector and the emotion embedding vector are inputted from an encoder neural network (e.g. 55 in FIG. 5) of the speech synthesis model, and the encoded vector is output by encoding them. More precisely, an input sequence composed of the character embedding vector may be input to the encoder neural network.

In step S240, the encoded vector is inputted from a decoder neural network (e.g. 57 in FIG. 5) of the speech synthesis model and predicted spectrogram data is output. In this case, the decoder neural network may further receive attention information from an attention module (e.g. 55) located between the encoder neural network and the decoder neural network. In addition, the decoder neural network may receive prediction spectrogram data of a previous frame, and may further use this to output prediction spectrogram data of a current frame. Prediction spectrogram data sequentially output corresponds to the output sequence.

In step S250, the weight of the speech synthesis model is updated by backpropagating the error between the correct answer spectrogram data and the predicted spectrogram data. In this case, weights of the encoder neural network and the decoder neural network may be updated at once through the error backpropagation. If an embedding module is included in the speech synthesis model, the weight of the embedding module may be updated as well.

As the above-described steps S210 to S250 are performed on a plurality of training data, a speech synthesis model may be constructed. In addition, the above-described steps S210 to S250 may be performed by the learning unit 41 and the speech synthesis model 43.

So far, a method of constructing a speech synthesis model according to some embodiments of the present disclosure has been described with reference to FIG. 17. According to the above-described method, a speech synthesis model capable of controlling emotion and capable of synthesizing speech for a plurality of speakers can be constructed. Hereinafter, an exemplary computing device 180 capable of implementing the speech synthesis device 10 according to some embodiments of the present disclosure will be described.

18 is a hardware configuration diagram illustrating an exemplary computing device 180 capable of implementing the speech synthesis device 10 according to some embodiments of the present disclosure.

18, the computing device 180 is a memory for loading a computer program executed by one or more processors 181, a bus 183, a communication interface 184, and the processor 181 ( 182 and a storage 185 for storing the computer program 186 may be included. However, only components related to the embodiment of the present disclosure are shown in FIG. 18. Accordingly, those of ordinary skill in the art to which the present disclosure pertains may recognize that other general-purpose components may be further included in addition to the components illustrated in FIG. 18.

The processor 181 controls the overall operation of each component of the computing device 180. The processor 181 includes a CPU (Central Processing Unit), MPU (Micro Processor Unit), MCU (Micro Controller Unit), GPU (Graphic Processing Unit), or any type of processor well known in the art of the present disclosure. Can be. Also, the processor 181 may perform an operation on at least one application or program for executing the method according to the embodiments of the present disclosure. The computing device 180 may include one or more processors.

The memory 182 stores various types of data, commands and/or information. The memory 182 may load one or more programs 186 from the storage 185 in order to execute the speech synthesis method according to embodiments of the present disclosure. For example, when the computer program 186 is loaded in the memory 182, a module as shown in FIG. 2 may be implemented on the memory 182. The memory 182 may be implemented as a volatile memory such as RAM, but the technical scope of the present disclosure is not limited thereto.

The bus 183 provides communication functions between components of the computing device 180. The bus 183 may be implemented as various types of buses such as an address bus, a data bus, and a control bus.

The communication interface 184 supports wired/wireless Internet communication of the computing device 180. In addition, the communication interface 184 may support various communication methods other than Internet communication. To this end, the communication interface 184 may be configured to include a communication module well known in the technical field of the present disclosure.

According to some embodiments, the communication interface 184 may be omitted.

The storage 185 may non-temporarily store the one or more programs 186 and various data. For example, if the speech synthesis device 10 is implemented through the computing device 180, the various types of data may include data managed by the storage unit 25.

The storage 185 is a nonvolatile memory such as a Read Only Memory (ROM), an Erasable Programmable ROM (EPROM), an Electrically Erasable Programmable ROM (EEPROM), a flash memory, etc., a hard disk, a removable disk, or a technical field to which the present disclosure belongs. It may be configured to include any known computer-readable recording medium.

Computer program 186 may include one or more instructions that when loaded into memory 182 cause processor 181 to perform a method/operation in accordance with various embodiments of the present disclosure. That is, the processor 181 may perform methods/operations according to various embodiments of the present disclosure by executing the one or more instructions.

For example, the computer program 186 includes an operation of acquiring a training data set, an operation of constructing a speech synthesis model using the training data set, an operation of acquiring synthesis data, and the synthesis using the speech synthesis model. It may include instructions for performing an operation of synthesizing emotional voices for the dragon data.

Alternatively, the computer program 186 embeds the text for learning and converts it into a character embedding vector, the operation of embedding the emotion information for learning and converting it into an emotion embedding vector, and the character embedding vector and the emotion embedding in the encoder neural network An operation of receiving a vector and outputting an encoded vector, an operation of receiving the encoded vector from the decoder neural network and outputting prediction spectrogram data, and backpropagating an error between the correct answer spectrogram data and the predicted spectrogram data. propagation) to update the speech synthesis model.

In the above case, the speech synthesis apparatus 10 according to some embodiments of the present disclosure may be implemented through the computing device 180.

An exemplary computing device 180 capable of implementing the speech synthesis device 10 according to an embodiment of the present disclosure has been described so far with reference to FIGS. 1 to 18.

So far, various embodiments of the present disclosure and effects according to the embodiments have been mentioned with reference to FIGS. 1 to 18. The effects according to the technical idea of the present disclosure are not limited to the above-mentioned effects, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

The technical idea of the present disclosure described with reference to FIGS. 1 to 18 so far may be implemented as computer-readable code on a computer-readable medium. The computer-readable recording medium is, for example, a removable recording medium (CD, DVD, Blu-ray disk, USB storage device, removable hard disk) or a fixed recording medium (ROM, RAM, computer-equipped hard disk). I can. The computer program recorded in the computer-readable recording medium may be transmitted to another computing device through a network such as the Internet and installed in the other computing device, thereby being used in the other computing device.

In the above, even if all the constituent elements constituting the embodiments of the present disclosure have been described as being combined into one or operating in combination, the technical idea of the present disclosure is not necessarily limited to these embodiments. That is, within the scope of the object of the present disclosure, all of the components may be selectively combined with one or more to operate.

Although the operations are illustrated in a specific order in the drawings, it should not be understood that the operations must be executed in the specific order shown or in a sequential order, or all illustrated operations must be executed to obtain a desired result. In certain situations, multitasking and parallel processing may be advantageous. Moreover, the separation of the various components in the above-described embodiments should not be understood as necessitating such separation, and the program components and systems described may generally be integrated together into a single software product or packaged into multiple software products. It should be understood that there is.

Although the embodiments of the present disclosure have been described with reference to the accompanying drawings, the present disclosure may be implemented in other specific forms without changing the technical spirit or essential features of those of ordinary skill in the art. I can understand that there is. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not limiting. The scope of protection of the present disclosure should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the technical ideas defined by the present disclosure.

Claims

A preprocessor for performing preprocessing on the text; And

Including a speech synthesizer for inputting the preprocessed text and emotion information into a speech synthesis model based on a neural network to synthesize an emotion speech reflecting the emotion information with respect to the text,

The speech synthesis model,

It characterized in that it comprises an encoder neural network for outputting the encoded vector by encoding the input sequence constructed by using the preprocessed text and the emotion information,

Speech synthesis device.
The method of claim 1,

Further comprising a decoder neural network for receiving the encoded vector and outputting an output sequence associated with the emotional voice,

The speech synthesis model,

It characterized in that it further comprises an attention module positioned between the encoder neural network and the decoder neural network and configured to determine a portion to be focused by the decoder neural network in the encoded vector,

Speech synthesis device.
The method of claim 1,

Further comprising a decoder neural network for receiving the encoded vector and outputting an output sequence associated with the emotional voice,

The encoder neural network and the decoder neural network are implemented based on a recurrent neural network (RNN) or a self-attention technique,

Speech synthesis device.
The method of claim 1,

The emotion information is characterized in that the emotion vector indicating the probability of one or more emotions,

Speech synthesis device.
The method of claim 1,

The emotion information is characterized in that the label information indicating a specific emotion class,

Speech synthesis device.
The method of claim 1,

Further comprising a decoder neural network for receiving the encoded vector and outputting an output sequence associated with the emotional voice,

The output sequence is composed of data in the form of a spectrogram,

The speech synthesis unit,

It characterized in that it further comprises a vocoder unit for converting the output sequence into the emotion voice,

Speech synthesis device.
The method of claim 1,

Further comprising a decoder neural network for receiving the encoded vector and outputting an output sequence associated with the emotional voice,

The decoder neural network,

The emotion information is further input and the output sequence is output.

Speech synthesis device.
The method of claim 1,

Further comprising a decoder neural network for receiving the encoded vector and outputting an output sequence associated with the emotional voice,

The output sequence is composed of spectrogram data,

The speech synthesis unit,

The training text preprocessed by the preprocessor is input into the speech synthesis model, and the resulting spectrogram data is compared with the correct answer spectrogram data to calculate an error value, and the calculated error value is backpropagated. -propagation) to train the speech synthesis model,

Speech synthesis device.
The method of claim 8,

The weight of the encoder neural network and the weight of the decoder neural network are updated together through the backpropagation,

Speech synthesis device.
The method of claim 1,

The speech synthesis model,

Further comprising a speaker embedding module for converting speaker information into a speaker embedding vector,

The speech synthesis unit,

By inputting the speaker information into the speech synthesis model, for a specific speaker indicated by the speaker information, a voice reflecting the emotion information is output as the emotional voice,

Speech synthesis device.
The method of claim 10,

The speaker embedding vector is input to the encoder neural network,

Speech synthesis device.
The method of claim 10,

Further comprising a decoder neural network for receiving the encoded vector and outputting an output sequence associated with the emotional voice,

The speaker embedding vector is input to the decoder neural network,

Speech synthesis device.
Performing pre-processing on the text; And

Comprising the step of inputting the preprocessed text and emotion information into a speech synthesis model based on a neural network, and synthesizing an emotion speech reflecting the emotion information with respect to the text,

The step of synthesizing the emotional voice,

It characterized in that it comprises the step of encoding the input sequence constructed by using the preprocessed text and the emotion information and outputting the encoded vector,

Speech synthesis method.
The method of claim 13,

Receiving the encoded vector and outputting an output sequence associated with the emotional voice,

The step of synthesizing the emotional voice,

It is located between the encoder neural network and the decoder neural network and further comprising the step of determining, by the decoder neural network, a portion to be focused in the encoded vector,

Speech synthesis method.
The method of claim 13,

Further comprising the step of receiving the encoded vector and outputting an output sequence associated with the emotional voice,

The step of outputting the encoded vector and the step of outputting the output sequence is implemented based on a recurrent neural network (RNN) or a self-attention technique,

Speech synthesis method.
The method of claim 13,

The emotion information is characterized in that the emotion vector indicating the probability of one or more emotions,

Speech synthesis method.
The method of claim 13,

The emotion information is characterized in that the label information indicating a specific emotion class,

Speech synthesis method.