WO2023157066A1

WO2023157066A1 - Speech synthesis learning method, speech synthesis method, speech synthesis learning device, speech synthesis device, and program

Info

Publication number: WO2023157066A1
Application number: PCT/JP2022/005903
Authority: WO
Inventors: 裕紀金川; 勇祐井島
Original assignee: 日本電信電話株式会社
Priority date: 2022-02-15
Filing date: 2022-02-15
Publication date: 2023-08-24

Abstract

A computer executes a first learning procedure to learn a second model by updating a first model, to which a speaker vector representing a speaker, text, and a first acoustic feature related to speech obtained by the speaker uttering the text are inputted, on the basis of losses of a first predicted acoustic feature outputted by the first model and the first acoustic feature, and a second learning procedure to update the second model on the basis of losses of a second predicted acoustic feature outputted by the second model, to which a speaker vector of a target speaker and a second acoustic feature related to speech uttered by the target speaker are inputted, and the second acoustic feature, thereby, even if there is no text corresponding to the speech of the target speaker, enabling adaptation through fine-tuning of a TTS model from the acoustic feature related to the speech.

Description

Speech synthesis learning method, speech synthesis method, speech synthesis learning device, speech synthesis device and program

TECHNICAL FIELD The present invention relates to a speech synthesis learning method, a speech synthesis method, a speech synthesis learning device, a speech synthesis device, and a program.

In text-to-speech synthesis (TTS), which predicts speech from text, the mainstream in recent years is statistical parametric speech synthesis. This is a method of modeling the correspondence between texts, which are training data, and corresponding speech pairs. By using a deep neural network (DNN) as a modeling technique, the quality of synthesized speech has improved dramatically (Non-Patent Document 1). However, statistical modeling by DNN acquires the correspondence between input and output only from data, so a large amount of training data is required to train a TTS model that synthesizes speech with high quality. If a TTS model is constructed using only a target speaker with a small amount of data, the model overfits the training data, so there are cases where desired utterance content and quality cannot be obtained when unknown text is input.

In order to deal with this problem, there is a technology that fine-tunes the model to fit the target speaker with a small amount of data, based on a model trained with a large amount of data and a large number of speakers. This is called adaptation, and has long been studied in speech recognition and synthesis.

Among the adaptations, there are two main types: semi-supervised learning and unsupervised learning, in cases where there is a target speaker's voice but no corresponding text (hereinafter referred to as "label").

In the feature space maximum likelihood linear regression (fMLLR), which is one method of semi-supervised learning, in order to create a label corresponding to the speech, the speech of the target speaker is once recognized by a speaker-independent model, and pseudo-labels are generated. obtain. This label is regarded as training data, and a linear regression coefficient for speech features is obtained so as to fill in the mismatch between the speech of the target speaker and the model (Non-Patent Document 2, Patent Document 1). Another method of using pseudo-labels to fill mismatches considers the confidence of the pseudo-labels and fine-tunes the DNN to fit the target speaker through operations such as excluding pseudo-labels with low confidence. (Non-Patent Document 3).

As an example of unsupervised adaptation, which is another adaptive approach, there is a method of manipulating information including speaker information that is put together with the text according to the target speaker (Non-Patent Document 4). Based on Patent Document 2, first, a TTS model is trained with a large number of speakers in advance using speaker expression vectors based on one-hot. Separately, a model is prepared to identify the training speaker from the input speech, and the speech of the target speaker is input to the model. Then, a vector (speaker posterior probability) indicating how much the target speaker resembles who among the large number of speakers is obtained. By inputting this into the TTS model as a speaker vector instead of a one-hot vector, synthesized speech resembling the target speaker can be obtained without obtaining pseudo labels.

U.S. Patent Application Publication No. 20120173240 Japanese Patent No. 6680933 U.S. Patent No. 10347241

In semi-supervised learning represented by Non-Patent Document 2, the adaptation of speech synthesis is pseudo-label generation, so it is necessary to prepare a separate speech recognition model. Therefore, the learning cost is very high, and the accuracy of the pseudo-label depends on the speech recognition model.

When unsupervised adaptation is performed by the approach of Non-Patent Document 4, not only is a separate speaker recognition model necessary, but also the speech to be recognized by the speaker recognition model is predicted in order to predict the one-hot vector equivalent of TTS. must match that of the TTS model. Furthermore, if the acoustic characteristics of the target speaker are significantly different from those of the large number of speakers who are training data for the TTS model, the quality of the synthesized speech will be significantly degraded. There are no pseudo-labels, and it is impossible to reduce the mismatch between the target speaker and the model by fine-tuning.

The present invention has been made in view of the above points, and enables adaptation of a TTS model by fine-tuning from acoustic features related to the target speaker's speech even if there is no text corresponding to the target speaker's speech. With the goal.

Therefore, in order to solve the above problem, a first model that inputs a speaker vector indicating a speaker, a text, and a first acoustic feature amount related to the speech in which the speaker utters the text outputs a first model. a first learning procedure for learning a second model by updating the first model based on the loss of one predicted acoustic feature quantity and the first acoustic feature quantity; The loss of the second predicted acoustic feature output by the second model to which the vector and the second acoustic feature relating to the speech uttered by the target speaker are input and the second acoustic feature and a second learning procedure for updating the second model based on.

Even if there is no text corresponding to the speech of the target speaker, it is possible to adapt the TTS model by fine-tuning from the acoustic features of the speech.

It is a figure which shows the hardware structural example of the speech synthesizer 10 in embodiment of this invention. FIG. 3 is a diagram showing the configuration of a large-scale TTS model learning phase in the first embodiment; FIG. FIG. 3 is a diagram showing the configuration of an unsupervised adaptation phase in the first embodiment; FIG. FIG. 4 is a diagram showing the configuration of an inference phase in the case of speech synthesis in the first embodiment; FIG. FIG. 10 is a diagram showing the configuration of an inference phase in the case of voice quality conversion in the first embodiment; FIG. 10 is a diagram showing the configuration of a large-scale TTS model learning phase in the second embodiment; FIG. 12 is a diagram showing the configuration of a large-scale TTS model learning phase in the third embodiment; It is a figure which shows the structure of the large-scale TTS model learning phase in 4th Embodiment. FIG. 13 is a diagram showing the configuration of an inference phase in the case of speech synthesis in the fourth embodiment; FIG.

Unlike non-patent document 4 in unsupervised adaptation, this embodiment utilizes not only text but also acoustic features obtained from speech as input for the TTS model.

In the TTS model, the DNN modules that convert the text into an intermediate representation and the intermediate representation into acoustic features are called a text encoder 112 and a decoder 114, respectively. In this embodiment, an acoustic feature encoder 113 that converts the acoustic feature into an intermediate representation is newly prepared so that the acoustic feature, which is the output of the TTS model, can be reconstructed from the input text and the acoustic feature. . As a result, the intermediate representation originally obtained through the text encoder 112 can also be obtained from the acoustic features.

Also, as speaker information, continuous expressions such as i-vector and x-vector using a speaker vector extractor are used instead of discrete expressions such as one-hot. As a result, the learning speaker for the TTS model and speaker recognition may be different, and by increasing the learning data for the speaker vector extractor, which requires relatively low annotation cost, various speaker characteristics can be covered.

Embodiments of the present invention will be described below based on the drawings. FIG. 1 is a diagram showing a hardware configuration example of a speech synthesizer 10 according to an embodiment of the present invention. The speech synthesizer 10 of FIG. 1 has a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, an interface device 105, etc., which are interconnected by a bus B, respectively.

A program that implements processing in the speech synthesizer 10 is provided by a recording medium 101 such as a CD-ROM. When the recording medium 101 storing the program is set in the drive device 100 , the program is installed from the recording medium 101 to the auxiliary storage device 102 via the drive device 100 . However, the program does not necessarily need to be installed from the recording medium 101, and may be downloaded from another computer via the network. The auxiliary storage device 102 stores installed programs, as well as necessary files and data.

The memory device 103 reads and stores the program from the auxiliary storage device 102 when a program activation instruction is received. The processor 104 is a CPU or a GPU (Graphics Processing Unit), or a CPU and a GPU, and executes functions related to the speech synthesizer 10 according to programs stored in the memory device 103 . The interface device 105 is used as an interface for connecting to a network.

[First embodiment]
2, 3, 4, and 5 show the large-scale TTS model learning phase, the unsupervised adaptation phase, the inference phase in the case of speech synthesis, and the voice quality conversion in the first embodiment, respectively. A configuration example of the inference phase is shown.

In this embodiment, the speech synthesizer 10 includes a TTS model λ, a loss calculator 115 for acoustic feature O, and a TTS model learner 116 . The TTS model λ includes speaker vector encoder 111 , text encoder 112 , acoustic feature encoder 113 and decoder 114 . Each of these units is implemented by processing that one or more programs installed in the speech synthesizer 10 cause the processor 104 to execute.

[Overall flow]
The flow of the large-scale TTS model learning phase of the first embodiment will be described along FIG.

In the large-scale TTS model learning phase, a plurality of sets of learning data are prepared, each consisting of a set of speaker vectors S, texts L, and acoustic features O. The speaker vector S is a continuous expression such as i-vector or x-vector indicating the speaker who uttered the speech, and is obtained by inputting the speech into the speaker vector extractor. The text L is information indicating the content of the voice (content of the utterance). As the text L, a raw text, a sequence of phonemes and accents, or a linguistic feature vectorized from them can be used. The acoustic feature quantity O is the acoustic feature quantity of the speech. Acoustic features include mel-spectrogram, mel-cepstrum, fundamental frequency, etc., which are information necessary for reconstructing speech waveforms. Note that X^ (X is an arbitrary symbol) in the text indicates a symbol with ^ added above X in the drawing. The speaker of each training data may be different, and the text L may also be different. Any training data speaker may be the target speaker in the unsupervised adaptation phase described below.

The speaker vector encoder 111 inputs the speaker vector S, calculates and outputs an intermediate representation of the speaker vector S (hereinafter referred to as "intermediate representation of the speaker vector S").

The text encoder 112 receives the text L, computes an intermediate representation _hL of the text L, and outputs it.

The acoustic feature quantity encoder 113 receives the acoustic feature quantity O, and calculates and outputs an intermediate representation _hO of the acoustic feature quantity O. FIG.

The decoder 114 receives the intermediate representation of the speaker vector S, the intermediate representation _hL , and the intermediate representation _hO . However, the intermediate representation _hL and the intermediate representation _hO are input to the decoder 114 at different timings. That is, the decoder 114 inputs the intermediate representation of the speaker vector S and the intermediate representation _hL (hereinafter referred to as the "first phase") and the intermediate representation of the speaker vector S for one piece of training data. Two phases are executed: one for inputting the intermediate representation _hO (hereinafter referred to as the "second phase"). In unsupervised adaptation, which will be described later, there is no text of an unknown speaker, so the TTS model λ is constructed.

First, the first phase will be explained.

The decoder 114 inputs the intermediate representation of the speaker vector S and the intermediate representation _hL , and calculates and outputs the predicted acoustic feature O^.

Subsequently, the loss calculation unit 115 for O receives the predicted acoustic feature quantity O^ and the acoustic feature quantity O, and calculates and outputs the loss _Lo , which is the error between the acoustic feature quantity O and the predicted acoustic feature quantity O^. do. For the loss _Lo , an index indicating the error of vectors of the same dimension, such as the mean squared error or the mean absolute error, can be used.

Subsequently, the TTS model learning unit 116 receives the TTS model λ and the loss L ₀ , and learns the TTS model λ ⁻ by updating the model parameters of the TTS model λ based on the loss L ₀ . Note that X ⁻ (X is an arbitrary symbol) in the text indicates a symbol in which ^“-” is added above X in the drawings.

When configuring the TTS model λ with a DNN, the TTS model learning unit 116 updates the TTS model λ so as to minimize the loss _Lo . Model parameters that reduce the loss _Lo can be obtained by executing error backpropagation with the help of the gradient information when the predicted acoustic feature O^ was generated.

Next, the second phase will be explained.

The decoder 114 inputs the intermediate representation of the speaker vector S and the intermediate representation _hO , and calculates and outputs the predicted acoustic feature O^.

The TTS model learning unit 116 learns the TTS model λ ⁻ by updating the TTS model λ as in the first phase.

Since the first phase and the second phase are executed for one piece of learning data, updating of the TTS model λ is executed twice for one piece of learning data. By learning the model so as to obtain the predicted acoustic feature O^ from the two kinds of intermediate representations _hL and _hO through the decoder 114, it is possible to obtain corresponding information from the acoustic feature without text.

Next, the flow of the unsupervised adaptation phase of the first embodiment will be described along FIG. In FIG. 3, the same parts as in FIG. 2 are given the same names. The TTS model λ ⁻ was obtained during the large-scale TTS model training phase of FIG.

In the unsupervised adaptation phase, a plurality of sets of learning data are prepared, each of which consists of one target speaker's acoustic feature O' and the target speaker's speaker vector S'. Therefore, the speaker of each training data is common. However, the voice indicated by the acoustic feature O' of each learning data is different.

The TTS model λ ⁻ receives the acoustic feature O′ of the target speaker and the speaker vector S′ of the target speaker, and calculates and outputs the predicted acoustic feature Ô′. As in FIG. 2, the TTS model learning unit 116 updates the TTS model λ so as to minimize the loss _Lo , which is the error between the predicted acoustic feature quantity Ô′ and the acoustic feature quantity O′. λ ⁻ ' is learned. That is, with the configuration of FIG. 2, the TTS model λ ⁻ can be learned by substituting the acoustic feature O′ of the unknown speaker even if the text of the unknown speaker is not input to the TTS model λ. This enables adaptation (≈fine tuning) using the acoustic feature O'.

In this adaptation phase, both the input and output of the TTS model λ ⁻ ' are acoustic features, and the TTS model λ ⁻ ' becomes equivalent to an autoencoder. Since there is no text in the adaptation data, the acoustic feature encoder 113 may overlearn and the intermediate representation _hO may not be able to predict the information corresponding to the intermediate representation _hL in FIG. Therefore, by freezing the acoustic feature quantity encoder 113 (fixing the model parameters of the acoustic feature quantity encoder 113) and updating only the decoder 114, it is possible to avoid the possibility that the text content, which is a necessary condition of the TTS model, will collapse. , it is possible to adapt the model to the target speaker.

Next, the flow of the inference phase in the case of speech synthesis according to the first embodiment will be described with reference to FIG. The model used is that obtained in the unsupervised adaptation phase of FIG.

The trained TTS model λ ⁻ ' inputs an arbitrary text L' to be synthesized into speech and the speaker vector S' of the target speaker, and calculates (estimates) and outputs a predicted acoustic feature Ô'. Since the TTS model λ ⁻ ' is adapted in the phase described with reference to FIG. 3, it is possible to synthesize speech without significant deterioration in quality even for a target speaker not included in the training data of FIG.

Since the TTS model λ of the first embodiment is configured to be able to predict acoustic feature values from both text and acoustic feature values as inputs, the speaker vector is It can also be used for voice quality conversion by replacing with volume. In the inference phase for voice quality conversion shown in ^FIG . Thus, the speaker's acoustic feature O^'' corresponding to the speaker vector S'' is (output) predicted.

As described above, according to the first embodiment, unsupervised adaptation by fine-tuning of the TTS model is possible only from the acoustic features without using the target speaker's text. As a result, it is possible to eliminate the need to annotate the speech of the target speaker, thereby reducing both the time and cost required to construct the TTS model.

[Second embodiment]
Next, a second embodiment will be described. 2nd Embodiment demonstrates a different point from 1st Embodiment. Points not specifically mentioned in the second embodiment may be the same as in the first embodiment.

FIG. 6 is a diagram showing the configuration of the large-scale TTS model learning phase in the second embodiment. In FIG. 6, the same parts as those in FIG. 2 are denoted by the same reference numerals, and the description thereof will be omitted as appropriate. In FIG. 6, the speech synthesizer 10 further has a loss calculator 117 and a loss weighter 118 for the intermediate representation h. Each of these units is implemented by processing that one or more programs installed in the speech synthesizer 10 cause the processor 104 to execute.

[Overall flow]
The flow of the large-scale TTS model learning phase of the second embodiment will be described along FIG. As in FIG. 2, the TTS model λ receives the speaker vector S, the text L, and the acoustic feature O, and outputs the predicted acoustic feature Ô. A loss calculator 115 relating to O receives the acoustic feature quantity O and the predicted acoustic feature quantity Ô and outputs a loss _Lo . Note that, as described with reference to FIG. 2, for each of the first phase related to the text L and the second phase related to the acoustic feature O, the predicted acoustic feature Ô and the loss _Lo are output.

In the second embodiment, the loss calculation unit 117 for h further receives the intermediate representation _hL output by the text encoder 112 and the intermediate representation _hO output by the acoustic feature encoder 113, _and Calculate and output the loss L _h with h _O. For the index of the loss L _h , not only the mean squared error and the mean absolute error but also the cosine distance or the like is used to constrain the error between h _L and h _O to be small.

The loss weighting unit 118 receives the loss L _o output by the loss calculation unit 115 for O and the loss L _h output by the loss calculation unit 117 for h for each of the first phase and the second phase, and weights Calculate and output the weighted loss ( _the weighted sum of Lo and _Lh ). Note that the weighting coefficient may be fixed, or may be a learning target. The TTS model learning unit 116 learns the TTS model λ ⁻ by updating the model parameters of the TTS model λ so as to minimize the weighted loss for each of the first and second phases. By doing so, in preparation for unsupervised adaptation, it is possible to increase the possibility of predicting the output of the text encoder 112 from the acoustic feature quantity encoder 113 as well.

The processing procedure after the unsupervised adaptation phase may be the same as in the first embodiment.

In the configuration of the first embodiment, since there is no restriction to include text information in acoustic features, the intermediate representation _hO from the acoustic feature encoder 113 looks similar to the intermediate representation _hL from the text encoder 112. It's not necessarily something. In the second embodiment, this problem can be reduced by restricting _hO to a vector similar to _hL in the course of learning.

[Third embodiment]
Next, a third embodiment will be described. In the third embodiment, differences from the first embodiment will be explained. Points not particularly mentioned in the third embodiment may be the same as those in the first embodiment.

FIG. 7 is a diagram showing the configuration of the large-scale TTS model learning phase in the third embodiment. In FIG. 7, the same parts as in FIG. 2 are denoted by the same reference numerals, and the description thereof will be omitted as appropriate. In FIG. 7, the speech synthesizer 10 includes a speaker identity remover 119, which is a module for removing the speaker identity from the intermediate representation _hO by the acoustic feature encoder 113, a loss calculator 120 for s, and a loss weighter 121. and Each of these units is implemented by processing that one or more programs installed in the speech synthesizer 10 cause the processor 104 to execute. A speaker ID is data that identifies a speaker in a form different from the speaker vector.

[Overall flow]
The flow of the large-scale TTS model learning phase of the third embodiment will be described along FIG. As in FIG. 2, the TTS model λ receives the speaker vector S, the text L, and the acoustic feature O, and outputs the predicted acoustic feature Ô. A loss calculator 115 for O receives the acoustic feature quantity O and the predicted acoustic feature quantity Ô, and outputs a loss _Lo . Note that, as described with reference to FIG. 2, for each of the first phase related to the text L and the second phase related to the acoustic feature O, the predicted acoustic feature Ô and the loss _Lo are output.

In the third embodiment, the speaker characteristics removal unit 119 receives the intermediate representation _hO output by the acoustic feature encoder 113, and calculates and outputs the intermediate representation _h′O from which the speaker characteristics are removed. . The intermediate representation _h'O from which the speaker's characteristic is removed is an intermediate representation obtained by removing the voice features of the speaker from the intermediate representation _hO . The speaker adversarial learning device or the like proposed in Patent Document 3 can be used for the speaker identity removal unit 119 .

The loss calculation unit 120 for s receives the intermediate representation _h′O with speaker characteristics removed and the true speaker ID s, and calculates and outputs the loss L _s . The loss L _s is an index that takes a larger value as the probability that h′ _O corresponds to speaker s is lower. For example, the loss _Ls can be an index for solving a discrimination problem, such as cross-entropy.

The loss weighting unit 121 inputs the loss L _o output by the loss calculation unit 115 regarding O and the L _s output by the loss calculation unit regarding S for each of the first phase and the second phase, and performs weighting. Calculate and output the calculated loss (the weighted sum of L _o and L _s ). Note that the weighting coefficient may be fixed, or may be a learning target. The TTS model learning unit 116 learns the TTS model λ ⁻ by updating the model parameters of the TTS model λ to minimize the weighted loss.

In the configuration of the first embodiment, the output of the text encoder 112 does not include speaker characteristics, whereas the output of the acoustic feature quantity encoder 113 includes speaker characteristics. A mismatch between the two causes deterioration of TTS performance. According to the third embodiment, it is possible to reduce speaker characteristics from the intermediate representation h _O by the acoustic feature encoder 113 .

Also, the third embodiment may be used together with the second embodiment. In this case, loss weighting section 121 may receive loss L _o , loss L _s , and loss L _h and output a weighted loss. By using this together with the second embodiment, more stable unsupervised adaptation can be made possible by excluding speaker characteristics _from _hO while imposing constraints on hO and _hL .

[Fourth embodiment]
Next, a fourth embodiment will be described. In the fourth embodiment, points different from the first embodiment will be described. Points not specifically mentioned in the fourth embodiment may be the same as those in the first embodiment.

FIG. 8 is a diagram showing the configuration of the large-scale TTS model learning phase in the fourth embodiment. In FIG. 8, the same parts as those in FIG. 2 are denoted by the same reference numerals, and the description thereof will be omitted as appropriate. 8, the speech synthesizer 10 has text encoders 112-n (n=1, . . . , N) for each language. Each text encoder 112-n is selectively used according to the input text L _n (n=1, . . . , N). The text L _n indicates the text according to the language n of the utterance.

[Overall flow]
The flow of the large-scale TTS model learning phase of the fourth embodiment will be described along FIG.

In the large-scale TTS model learning phase, a plurality of sets of learning data each including a speaker vector S, a text _Ln , and an acoustic feature O are prepared. Acoustic feature O is an acoustic feature of speech in which text _Ln is uttered in language n. Language n of each of the plurality of learning data is one of 1 to N, and learning data for each language of 1 to N are prepared. The meaning of the text _Ln of each learning data may be different.

The processing of the speaker vector encoder 111 and acoustic feature quantity encoder 113 is the same as in FIG.

On the other hand, the text encoder 112- _n corresponding to the input training data text Ln calculates and outputs the intermediate representation _hLn .

The decoder 114 inputs the intermediate representation h _Ln and the speaker vector S in the first phase, inputs the intermediate representation h _O and the speaker vector S in the second phase, and generates the predicted acoustic feature O^ to output Thereafter, the TTS model λ is updated and the TTS model λ ⁻ is learned in the same procedure as in the first embodiment. The flow for inputting the acoustic feature quantity O to the TTS model λ is the same as in the first embodiment.

The phase of the unsupervised adaptation phase is text-independent and the adapted TTS model λ ⁻ ' is learned as in FIG. 3 of the first embodiment.

FIG. 9 is a diagram showing the configuration of the inference phase for speech synthesis in the fourth embodiment. Using the TTS model λ ⁻ ' obtained by unsupervised adaptation, the predicted acoustic feature Ô' is predicted in the same procedure as in FIG.

In the configuration of the first embodiment, there was only one text encoder 112, so it was only possible to synthesize speech in one language. According to the fourth embodiment, since text encoders 112 are prepared for each language, multilingual speech synthesis is possible with one TTS model. As in the first embodiment, unsupervised adaptation using acoustic features is also possible, and can be combined with the second and third embodiments.

[Effects of each embodiment]
With the configuration of each of the above embodiments, it is possible to adapt the TTS model by fine-tuning only from the acoustic features without the text of the target speaker. Not requiring the text of the target speaker leads to a reduction in the cost of annotating the speech, which is advantageous in terms of both the time and cost required to construct the TTS model.

In addition, in each of the above-described embodiments, the speech synthesizer 10 is also an example of a speech synthesis learning device. The TTS model λ is an example of the first model. The TTS model λ ⁻ is an example of the second model. The predicted acoustic feature Ô is an example of the first acoustic feature. The predicted acoustic feature O^' is an example of the second acoustic feature. The acoustic feature quantity encoder 113 is an example of a first encoder. Text encoder 112 is an example of a second encoder.

Although the embodiments of the present invention have been described in detail above, the present invention is not limited to such specific embodiments, and various modifications can be made within the scope of the gist of the present invention described in the claims.・Changes are possible.

10 Speech synthesis device 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 Processor 105 Interface device 111 Speaker vector encoder 112 Text encoder 113 Acoustic feature encoder 114 Decoder 115 O loss calculation unit 116 TTS model learning unit 117 h Loss calculator 118 Loss weighting unit 119 Loss weighting unit 120 Loss calculator 121 for s Loss weighting unit B Bus

Claims

A first predicted acoustic feature output by a first model to which a speaker vector indicating a speaker, a text, and a first acoustic feature relating to the speech of the text uttered by the speaker are input; a first learning procedure for learning a second model by updating the first model based on a loss with the first acoustic feature;
A second predicted acoustic feature output from the second model to which the speaker vector of the target speaker and a second acoustic feature relating to the speech uttered by the target speaker are input, and the second acoustic a second learning procedure for updating the second model based on the loss with the feature;
A speech synthesis learning method characterized in that a computer executes
The first model is
a first encoder that inputs an acoustic feature and outputs an intermediate representation related to the acoustic feature;
a second encoder for inputting text and outputting an intermediate representation for the text;
including
In the first learning procedure, the first model to which the speaker vector indicating the speaker, the text, and the first acoustic feature related to the speech of the text uttered by the speaker are input is output. Loss of the predicted acoustic feature quantity and the first acoustic feature quantity, the intermediate representation output by the first encoder that inputs the first acoustic feature quantity, and the second encoder that inputs the text updating the first model based on the output intermediate representation and the loss;
2. The speech synthesis learning method according to claim 1, wherein:
The first model is
a first encoder that inputs an acoustic feature and outputs an intermediate representation related to the acoustic feature;
a speaker identity removal unit that inputs the intermediate representation and outputs an intermediate representation from which the speaker identity is removed;
including
In the first learning procedure, the first model to which the speaker vector indicating the speaker, the text, and the first acoustic feature related to the speech of the text uttered by the speaker are input outputs the first model. loss of the predicted acoustic feature quantity and the first acoustic feature quantity; updating the first model based on representation and loss of true speaker ID;
3. The speech synthesis learning method according to claim 1, wherein:
The first model includes, for each language, a second encoder that inputs a text related to the language and outputs an intermediate representation related to the text;
including
In the first learning procedure, a first model input with a speaker vector indicating a speaker, a text, and a first acoustic feature amount related to the speech of the text uttered by the speaker is applied to the text. By updating the first model based on the loss of the first predicted acoustic feature quantity output using the second encoder corresponding to the language and the first acoustic feature quantity, the second learn the model,
4. The speech synthesis learning method according to any one of claims 1 to 3, characterized in that:
A speaker vector and text are input to the second model trained by the speech synthesis learning method according to any one of claims 1 to 4, and an acoustic feature amount related to the speaker vector and the text is estimated. estimation procedure,
A speech synthesis method characterized in that a computer executes
A first predicted acoustic feature output by a first model to which a speaker vector indicating a speaker, a text, and a first acoustic feature relating to the speech of the text uttered by the speaker are input; a first learning unit configured to learn a second model by updating the first model based on a loss with the first acoustic feature;
A second predicted acoustic feature output from the second model to which the speaker vector of the target speaker and a second acoustic feature relating to the speech uttered by the target speaker are input, and the second acoustic a second learning unit configured to update the second model based on the loss of features;
A speech synthesis learning device characterized by comprising:
A speaker vector and text are input to the second model trained by the speech synthesis learning method according to any one of claims 1 to 4, and an acoustic feature amount related to the speaker vector and the text is estimated. an estimator configured to:
A speech synthesizer characterized by comprising:
A program characterized by causing a computer to execute the speech synthesis learning method according to any one of claims 1 to 4 or the speech synthesis method according to claim 5.