WO2023033237A1

WO2023033237A1 - Multi-style speech synthesis system capable of prosody control using style tag described in natural language

Info

Publication number: WO2023033237A1
Application number: PCT/KR2021/015743
Authority: WO
Inventors: 김남수; 김민찬
Original assignee: 서울대학교산학협력단
Priority date: 2021-08-31
Filing date: 2021-11-03
Publication date: 2023-03-09
Also published as: KR102486106B1

Abstract

According to a multi-style speech synthesis system capable of prosody control using a style tag described in natural language, proposed in the present invention, when synthesizing styled speech, a user can intuitively and easily control the style of speech using a style tag without limiting the number of styles that can be uttered by using existing style labels or reference speech, and without the hassle of finding and inputting reference speech every time the user utters.

Description

A multi-style speech synthesis system that can adjust prosody using style tags described in natural language

The present invention relates to a multi-style voice synthesis system, and more specifically, by using a style tag described in natural language, which allows a user to intuitively and conveniently adjust a voice style, prosody can be controlled by using a style tag described in natural language. It is about a possible multi-style speech synthesis system.

The voice synthesis system is a technology used in various fields such as audio books, video editing, and AI speakers. Markets and Markets predicts that the size of the speech synthesis market is about $ 1.3 billion as of 2017, and will reach $ 3.03 billion in 2022 through an annual growth rate of 15.2%. Along with the growth of the market size, the demand for style voice synthesis technology that utters with various emotions and voice tones according to the situation beyond simple voice synthesis is also increasing.

Recently, a deep learning-based longitudinal speech synthesis system has been trained using a dataset composed of text and voice pairs, and shows a quality comparable to that of a real human voice. This longitudinal speech synthesis system mainly consists of an acoustic model that generates a mel-spectrogram, which is a frequency characteristic of voice, from text, and a vocoder that generates a voice from the mel-spectrogram. Here, the acoustic model is mainly implemented using a deep learning-based generative model. Representatively, there is an autoregressive-based model using an attention mechanism such as Tacotron. In the case of an autoregressive model, since the Mel-spectrogram is sequentially generated frame by frame, the generation speed is relatively slow.

Recently, research on a non-autoregressive model that stretches text to the length of speech and then generates all frames at the same time has been actively conducted. Representative nonautoregressive speech synthesis models include FastSpeech and Glow-TTS. This non-autoregressive model generates all frames at the same time, so the generation speed is very fast. However, due to the improvement in the performance of deep learning-based generative models, these problems have been improved, and recently, models that are more excellent than autoregressive-based models have appeared.

In addition, style voice synthesis refers to a synthesis technique capable of adjusting the style of a voice to be synthesized. At this time, the style means elements that can provide additional information independently of the content of speech, such as emotion, tone of voice, intention, and speaker. In style speech synthesis, style-related input is required to control the speech style. However, since style is not a concept that can be clearly distinguished, there are difficulties in this part, and therefore, a style label or a reference voice is mainly used. When style labels are used, when collecting data, each data must be labeled with what style the corresponding voice has.

On the other hand, when using a reference voice, a reference encoder is used to extract reference embedding from the reference voice. These reference embeddings are conditioned so that the speech synthesis model can reflect the corresponding style when synthesizing, and the reference encoder is trained to extract style information necessary for synthesis. In this way, when learning, a learning voice to be generated is mainly used as a reference voice, and when generating, a voice having a desired style is used as a reference voice. In the case of the above style speech synthesis, various longitudinal speech synthesis models can be modified and used.

In this way, the style input of the style speech synthesis mainly utilizes a style label or a reference voice, and has an advantage of being able to conveniently use a desired style of the style label, but is limited to styles within a predetermined category. This acts as a major limitation in expressing various speech styles. In addition, if the size of the category is very large, this problem is somewhat solved, but since the user has to make a selection within the category, it may be difficult to make a decision when there are many options. Conversely, when using a reference voice, it has the advantage of being able to express any style of voice, but it is inconvenient because a reference voice must be selected every time it is created, which accompanies the process of checking the reference voice. There is a problem with limitations that make it difficult to ensure that among the various properties of the user, the properties desired by the user can be extracted without fail.

Figure 1 is a diagram schematically showing the configuration of a conventional longitudinal speech synthesis system, Figure 2 is a diagram showing the configuration of a conventional dataset consisting of text and voice, Figure 3 is a conventional style label or reference voice It is a diagram showing the configuration of style input in which an input is made with .

The present invention is proposed to solve the above problems of the previously proposed methods, and includes a style tag encoder that receives a style tag as an input, extracts and outputs a style embedding, and a model that extracts a Mel-spectrogram from text. , A longitudinal speech synthesizer generating a Mel-spectrogram reflecting style information using style embedding input from the style tag encoder, and a vocoder extracting voice from the Mel-spectrogram reflecting style information input from the longitudinal speech synthesizer. By configuring it to include, when synthesizing style voice, the number of styles that can be uttered is limited due to the use of existing style labels or reference voices, and style tags are created without the hassle of finding and inputting reference voices every time a user utters. An object of the present invention is to provide a multi-style speech synthesis system capable of controlling prosody by utilizing style tags described in natural language, which allow users to intuitively and easily adjust the style of voice.

In addition, the present invention extracts an embedding containing meaning from a style tag given as text using a language model learned in advance by utilizing a style tag given as text for style input in style speech synthesis, and extracts an embedding containing meaning from the style tag given as text using a language model learned in advance, and converts the embedding into a speech synthesizer. Natural language technology that provides intuitive and convenient style speech synthesis technology by using it as a style input, and extracts meaningful meaning for style tags that were not used during learning through the generalization function of the language model to reflect the style. Another object of the present invention is to provide a multi-style speech synthesis system capable of controlling prosody by utilizing a style tag that can be used.

In addition, the present invention models the embedding extracted from the reference speech and the embedding extracted from the style tag in the same space by configuring the style tag encoder to further include a reference encoder, and when learning is completed, the reference speech and the style tag There is an advantage that any of them can be used, and as a result, a new style interface can be added compared to the existing method, and it is applied as a concept of upward compatibility to various voice synthesis services that are currently applied, so that the convenience and efficiency of use are further improved. Another object is to provide a multi-style speech synthesis system capable of adjusting prosody by utilizing style tags described in natural language, which can be improved.

In order to achieve the above object, a multi-style speech synthesis system capable of adjusting prosody using style tags described in natural language according to the characteristics of the present invention,

As a multi-style speech synthesis system that can control prosody using style tags described in natural language,

A style tag encoder that receives style tags as input and extracts and outputs style embeddings;

A model for extracting a mel-spectrogram from text, comprising: a longitudinal speech synthesizer for generating a mel-spectrogram reflecting style information using style embedding input from the style tag encoder; and

It is characterized in that it includes a vocoder for extracting voice from the mel-spectrogram reflecting the style information input from the vertical voice synthesizer.

Preferably, the style tag encoder,

A language model based on a neural network learned through a large amount of text data, which maps input text to a meaningful embedding space; and

It may be configured to include an adaptive layer that receives the embedding extracted through the language model, transforms it into a form suitable for style speech synthesis, and outputs the style embedding to the longitudinal speech synthesizer.

More preferably, the language model,

Texts with similar meanings can be mapped to an adjacent embedding space, and through this mapping characteristic, even if a style tag not used in learning is input, it can function to synthesize a voice having a style corresponding to it.

Even more preferably, the language model comprises:

It can be implemented as a SentenceBERT (SBERT) model that maps input sentences to a meaningful embedding space so that sentences with similar meanings are located adjacently.

More preferably, the adaptive layer,

It may be configured with a multi-layer perceptron (MLP) network structure that receives the embedding extracted through the language model and outputs a style embedding transformed into a form suitable for style speech synthesis, and maps the output of the language model to the style embedding.

Preferably, the style tag encoder,

It can be configured to further include a reference encoder that receives a reference voice as an input and outputs style embedding.

More preferably, the end-to-end speech synthesizer,

a text encoder that converts the text input into text embedding to have a length extended by the length of the mel-spectrogram by utilizing duration information of each phonetic symbol of the input text; and

A Mel decoder synthesizing and outputting a Mel-spectrogram using the text embedding whose length has been increased through the text encoder and the style embedding extracted from the style tag encoder may be included.

According to the multi-style speech synthesis system that can control prosody using style tags described in natural language proposed in the present invention, a style tag encoder that receives style tags as input and extracts and outputs style embeddings, and converts text to Mel-Spect. As a model for extracting a gram, a longitudinal speech synthesizer generating a mel-spectrogram reflecting style information using a style embedding input from a style tag encoder, and a mel-spectrogram reflecting style information input from the longitudinal speech synthesizer. By including a vocoder that extracts voice from, the limit on the number of styles that can be uttered due to the use of existing style labels or reference voices when synthesizing style voices, and the need to find and input reference voices each time a user utters It is possible to allow the user to intuitively and easily adjust the style of the voice by using the style tag without any hassle.

In addition, according to the multi-style speech synthesis system capable of controlling rhyme by using style tags described in natural language of the present invention, a style tag given as text is used for style input in style speech synthesis, so that a language model learned in advance can be used. It extracts embeddings with meaning from style tags given as text, uses them as style inputs to the speech synthesizer to provide intuitive and convenient style speech synthesis technology, and styles that were not used during learning through the generalization function of the language model. Significant meanings can also be extracted from tags so that the style can be reflected.

In addition, according to the multi-style speech synthesis system capable of controlling prosody using style tags described in natural language of the present invention, a reference encoder is further included in the style tag encoder, so that embeddings extracted from reference speech and style tags are obtained. It has the advantage of modeling the extracted embedding in the same space and using either the reference voice or style tag when learning is complete, which can be seen as adding a new style interface compared to the existing method, and is currently applied It can be applied to various voice synthesis services as an upward compatible concept so that the convenience and efficiency of use can be further improved.

1 is a diagram schematically showing the configuration of a conventional end-to-end speech synthesis system;

2 is a diagram showing the configuration of a conventional data set consisting of text and voice;

3 is a diagram showing the configuration of a style input in which an input is made with a conventional style label or reference voice;

4 is a diagram showing the configuration of a multi-style voice synthesis system capable of adjusting prosody using style tags described in natural language according to an embodiment of the present invention in functional blocks.

5 is a diagram showing the configuration of a style tag encoder of a multi-style speech synthesis system capable of adjusting prosody using style tags described in natural language according to an embodiment of the present invention as functional blocks.

6 is a diagram showing the configuration of a longitudinal voice synthesizer of a multi-style voice synthesis system capable of controlling prosody using style tags described in natural language according to an embodiment of the present invention as functional blocks.

7 is a diagram schematically showing the overall configuration of a multi-style speech synthesis system capable of adjusting prosody using style tags described in natural language according to an embodiment of the present invention.

FIG. 8 is a diagram showing detailed configurations of a style tag encoder and an end-to-end speech synthesizer of a multi-style speech synthesis system capable of adjusting prosody using style tags described in natural language according to an embodiment of the present invention.

9 is a view showing a style embedding space of a learned model of a multi-style speech synthesis system capable of controlling prosody by utilizing style tags described in natural language according to an embodiment of the present invention.

100: Multi-style voice synthesis system according to an embodiment of the present invention

110: style tag encoder

111: language model

112: adaptive layer

113: reference encoder

120: end-to-end speech synthesizer

121: text encoder

122: Mel decoder

130: Vocoder

Hereinafter, preferred embodiments will be described in detail so that those skilled in the art can easily practice the present invention with reference to the accompanying drawings. However, in describing a preferred embodiment of the present invention in detail, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description will be omitted. In addition, the same reference numerals are used throughout the drawings for parts having similar functions and actions.

In addition, throughout the specification, when a part is said to be 'connected' to another part, this is not only the case where it is 'directly connected', but also the case where it is 'indirectly connected' with another element in between. include In addition, 'including' a certain component means that other components may be further included, rather than excluding other components unless otherwise specified.

4 is a diagram showing the configuration of a multi-style voice synthesis system capable of adjusting prosody using style tags described in natural language according to an embodiment of the present invention in functional blocks, and FIG. 6 is a diagram showing the configuration of a style tag encoder of a multi-style speech synthesis system capable of controlling prosody using style tags described in natural language according to functional blocks, and FIG. 6 is a style described in natural language according to an embodiment of the present invention. It is a diagram showing the configuration of a vertical voice synthesizer of a multi-style voice synthesis system capable of controlling rhyme using tags as functional blocks, and FIG. 7 is a diagram showing prosody using style tags described in natural language according to an embodiment of the present invention 8 is a diagram schematically showing the overall configuration of a multi-style speech synthesis system capable of adjustment, and FIG. 8 is a style of a multi-style speech synthesis system capable of adjusting prosody using style tags described in natural language according to an embodiment of the present invention. 9 is a diagram showing the detailed configuration of a tag encoder and a longitudinal speech synthesizer, and FIG. 9 is a diagram of a learned model of a multi-style speech synthesis system capable of controlling prosody using style tags described in natural language according to an embodiment of the present invention. It is a diagram showing the style embedding space. As shown in FIGS. 4 to 9, the multi-style speech synthesis system 100 capable of adjusting prosody using style tags described in natural language according to an embodiment of the present invention includes a style tag encoder 110, It may be configured to include a longitudinal voice synthesizer 120 and a vocoder 130.

The style tag encoder 110 is a component that receives a style tag as an input and extracts and outputs a style embedding. As shown in FIG. 5, the style tag encoder 110 is a neural network-based language model learned through a large amount of text data, and includes a language model 111 that maps input text to a meaningful embedding space. , Adaptive layer 112 that receives the embedding extracted through the language model 111, transforms it into a form suitable for style speech synthesis, and outputs the style embedding to the longitudinal speech synthesizer 120. Here, the style tag may represent a speech style, such as the emotion of a voice to be synthesized, as a short text phrase or text expressed in words (e.g. #cheerful, #gloomy voice).

In addition, the language model 111 maps texts having similar meanings to an adjacent embedding space, and functions to synthesize a speech having a style corresponding to the style tags that have not been used during learning through such mapping characteristics. can Here, the language model 111 may be implemented as a SentenceBERT (SBERT) model that functions to map input sentences into a meaningful embedding space so that sentences having similar meanings are located adjacently.

In addition, the adaptive layer 112 receives the embedding extracted through the language model 111 and outputs a style embedding transformed into a form suitable for style speech synthesis, and maps the output of the language model 111 to the style embedding. It may be composed of a multi-layer perceptron (MLP) network structure.

The style tag encoder 110 may further include a reference encoder 113 that receives a reference voice as an input and outputs a style embedding. Here, the reference encoder 113 is a network that extracts a style embedding from a reference speech, and is a module used in an existing unsupervised learning-based style speech synthesis technology. When learning a model, the longitudinal speech synthesizer 120 has a reference encoder ( 113) was input, and the style tag embedding was learned to be close to the reference embedding. Through this, the style tag embedding could learn the average characteristics of reference voices having the same style tag. Additionally, in this method, a reference voice other than a style tag can be used for synthesis, if necessary.

The vertical speech synthesizer 120, as a model for extracting a mel-spectrogram from text, is a component that generates a mel-spectrogram reflecting style information by using style embedding input from the style tag encoder 110. As shown in FIG. 6, the vertical speech synthesizer 120 converts text input into text embedding to have a length extended by the length of the mel-spectrogram by utilizing the duration information of each phonetic symbol of the input text. Including a text encoder 121 and a mel decoder 122 that synthesizes and outputs a mel-spectrogram using the text embedding lengthened through the text encoder 121 and the style embedding extracted from the style tag encoder 110 can be configured.

The vocoder 130 is a component that extracts voice from a mel-spectrogram reflecting style information input from the vertical voice synthesizer 120 . The vocoder 130 may utilize the end-to-end speech synthesizer 120 and various deep learning-based models. Here, the vocoder 130 may output voice in which the style is reflected by utilizing a style tag given as text for style input.

FIG. 8 shows a detailed configuration of a style tag encoder and an end-to-end speech synthesizer of a multi-style speech synthesis system capable of controlling prosody by using style tags described in natural language according to an embodiment of the present invention. FIG. 9 illustrates the present invention. It shows the style embedding space of the learned model of the multi-style speech synthesis system capable of controlling prosody by utilizing style tags described in natural language according to an embodiment of. Hereinafter, a specific embodiment of a multi-style voice synthesis system capable of adjusting prosody using a style tag described in natural language according to an embodiment of the present invention will be described with reference to the accompanying drawings.

A Korean voice dataset composed of voice, text, and style tags was collected for an experiment of a multi-style voice synthesis system capable of controlling prosody using style tags described in natural language according to an embodiment of the present invention. And the style tag dataset consists of about 327 style tags, which is about 26 hours of a single female speaker, and the style tags represent emotions, intentions, and voice tones.

First, as a model structure, the style tag encoder 110 and the longitudinal speech synthesizer 120 are configured, but the SentenceBERT (SBERT) model is used as the language model 111 of the style tag encoder 110, and the model is input It maps sentences to a meaningful embedding space so that sentences with similar meanings are located adjacently. Here, the language model 111 uses a pre-learned SBERT model using a large amount of text data, and is not further trained when constructing a speech synthesis system. The adaptive layer 112 of the style tag encoder 110 is a network that maps the output of the language model 111 to the style embedding and has a multi-layer perception (MLP) structure.

In the experiment of this embodiment, the reference encoder 113 using the reference voice was also used for the style tag encoder 110. Reference encoder 113 is a network that extracts style embedding from reference speech, and is a module used in existing unsupervised learning-based style speech synthesis technology. The style embedding output from is input, and the style tag embedding is learned to be close to the reference embedding. Through this, the style tag embedding can learn the average characteristics of reference voices having the same style tag. Additionally, this method has the advantage that a reference voice other than a style tag can be used for synthesis as needed.

As the longitudinal speech synthesizer 120, a non-autoregressive based speech synthesizer, which has been actively researched recently, was used, and the model structure is a newly devised model for experiments, and is largely composed of a text encoder 121 and a Mel decoder 122. The text encoder 121 is a module that converts text input into text embedding. At this time, the text embedding is extended by the length of the Mel-spectrogram, and the length is increased by using the duration information of each phonetic symbol. The duration information used at this time is obtained through a monotonic search alignment (MAS) algorithm during learning, and can be obtained using a duration predictor during generation. Thereafter, the MEL decoder 122 synthesizes the MEL spectrogram using the lengthened text embedding and the previously extracted style embedding.

The training of this model is learned to reduce the L1 distance between the output of text and style tags input to the model and the answer Mel-spectrogram, and an objective expression to reduce the L2 distance between the style tag embedding and the reference embedding is added.

The style embedding space of the model learned using this model is shown in Fig. 9 using t-SNE. It is a tag. When the four subregions are enlarged, it can be confirmed that tags having similar properties are placed adjacent to each other, and it can be confirmed that, in particular, even in the case of style tags not seen during learning, they are appropriately mapped.

In addition, through a listening evaluation targeting 18 people, it was confirmed whether the style tag was appropriately reflected in the synthesized sound. As a baseline model, the Tacotron2-GST model was used. Tacotron2-GST is a style voice synthesizer using a reference voice, and used the average of reference embeddings with a corresponding style tag as an input to give an input corresponding to the style tag. Using this, a Comparative Mean Opinion Score (CMOS) was conducted to evaluate the degree to which the given style tag was well reflected after listening to the synthesized sound of the two models. The style tag reflection of the two models was evaluated on a scale of (-5 to 5) (+ indicates that the invention model is excellent, - indicates that the baseline is excellent), and the result was +1.37. As a result of the experiment, it was confirmed that the style tag formed the embedding space well and showed excellent style reflection even in actual listening.

In this way, the multi-style speech synthesis system 100 capable of adjusting prosody by utilizing style tags described in natural language according to an embodiment of the present invention is a method of utilizing style tags described in natural language as a style input, style tags In the case of , there is an advantage in that the user can intuitively input a speech style suitable for the purpose without being limited to a specific category. In particular, since styles corresponding to style tags that were not used during learning can be created using natural language processing technology, users can freely enter styles without being bound by a specific style, and the functions of existing speech synthesis technologies are included as they are. In addition, since it provides additional convenience to users, it can be widely applied by replacing the existing speech synthesis market.

In addition, the present invention can be applied to various speech synthesis services, and since the speech style can be adjusted based on natural language that humans can intuitively understand, it can provide much greater user convenience than other existing speech synthesis systems. . The present invention can be widely used for various services such as AI assistant, audio book, and entertainment.

As described above, the multi-style speech synthesis system capable of adjusting prosody using style tags described in natural language according to an embodiment of the present invention includes a style tag encoder that receives style tags as input and extracts and outputs style embeddings. , As a model for extracting a mel-spectrogram from text, a longitudinal speech synthesizer generating a mel-spectrogram reflecting style information using style embedding input from a style tag encoder, and style information input from the longitudinal speech synthesizer. By including a vocoder that extracts voice from the Mel-spectrogram reflecting the Users can intuitively and easily adjust the style of voice by using style tags without the hassle of finding and inputting voices. In particular, by using style tags given as text for style input in style speech synthesis, By using the language model trained in the language model, we extract the embedding that implies meaning from the style tag given as text, and use it as a style input for the speech synthesizer to provide an intuitive and convenient style speech synthesis technology, and through the generalization function of the language model Even for style tags that were not used during learning, meaningful meanings can be extracted so that the style can be reflected.

In addition, by further including a reference encoder in the style tag encoder, the embedding extracted from the reference speech and the embedding extracted from the style tag can be modeled in the same space, and when learning is completed, any of the reference speech and style tag can be used. As a result, it can be seen that a new style interface is added compared to the existing method, and it is applied as a concept of upward compatibility to various currently applied voice synthesis services so that the convenience and efficiency of use can be further improved. be able to

The present invention described above can be variously modified or applied by those skilled in the art to which the present invention belongs, and the scope of the technical idea according to the present invention should be defined by the claims below.

Claims

As a multi-style speech synthesis system 100 capable of adjusting prosody using style tags described in natural language,

a style tag encoder 110 that receives a style tag as an input and extracts and outputs a style embedding;

As a model for extracting a mel-spectrogram from text, a vertical speech synthesizer 120 generating a mel-spectrogram reflecting style information using style embedding input from the style tag encoder 110; and

Characterized in that it includes a vocoder 130 for extracting voice from the mel-spectrogram reflecting the style information input from the vertical voice synthesizer 120, using a style tag described in natural language to control prosody. Style speech synthesis system.
The method of claim 1, wherein the style tag encoder 110,

A language model based on a neural network learned through a large amount of text data, which maps input text to a meaningful embedding space (111); and

It is characterized by including an adaptive layer 112 that receives the embedding extracted through the language model 111, transforms it into a form suitable for style speech synthesis, and outputs the style embedding to the longitudinal speech synthesizer 120. A multi-style speech synthesis system that can control prosody by using style tags described in natural language.
The method of claim 2, wherein the language model 111,

Natural language technology characterized in that it maps texts with similar meanings to adjacent embedding spaces and functions to synthesize voices with styles corresponding to them even if style tags not used in learning are input through these mapping characteristics. A multi-style voice synthesis system that can adjust prosody by utilizing style tags that can be used.
The method of claim 3, wherein the language model 111,

Characterized in that it is implemented as a SentenceBERT (SBERT) model that maps an input sentence to a meaningful embedding space and functions so that sentences with similar meanings are located adjacently, multi-style voice that can control prosody using style tags described in natural language synthetic system.
The method of claim 2, wherein the adaptive layer 112,

MLP (Multi Layer Perceptron) network structure that receives the embedding extracted through the language model 111 and outputs a style embedding transformed into a form suitable for style speech synthesis, and maps the output of the language model 111 to the style embedding. Characterized in that it consists of, a multi-style speech synthesis system capable of controlling prosody using style tags described in natural language.
The method according to any one of claims 1 to 5, wherein the style tag encoder 110,

A multi-style speech synthesis system capable of controlling prosody using style tags described in natural language, characterized by further comprising a reference encoder 113 that receives a reference voice as an input and outputs a style embedding.
The method of claim 6, wherein the end-to-end speech synthesizer 120,

a text encoder 121 that converts text input into text embedding so as to have a length extended by the length of the mel-spectrogram by utilizing duration information of each phonetic symbol of the input text; and

and a mel decoder 122 that synthesizes and outputs a mel-spectrogram using the text embedding whose length has been increased through the text encoder 121 and the style embedding extracted from the style tag encoder 110. A multi-style speech synthesis system that can control prosody using style tags described in natural language.