WO2023157066A1 - Speech synthesis learning method, speech synthesis method, speech synthesis learning device, speech synthesis device, and program - Google Patents

Speech synthesis learning method, speech synthesis method, speech synthesis learning device, speech synthesis device, and program Download PDF

Info

Publication number
WO2023157066A1
WO2023157066A1 PCT/JP2022/005903 JP2022005903W WO2023157066A1 WO 2023157066 A1 WO2023157066 A1 WO 2023157066A1 JP 2022005903 W JP2022005903 W JP 2022005903W WO 2023157066 A1 WO2023157066 A1 WO 2023157066A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
acoustic feature
text
model
speech
Prior art date
Application number
PCT/JP2022/005903
Other languages
French (fr)
Japanese (ja)
Inventor
裕紀 金川
勇祐 井島
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2022/005903 priority Critical patent/WO2023157066A1/en
Publication of WO2023157066A1 publication Critical patent/WO2023157066A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • the present invention relates to a speech synthesis learning method, a speech synthesis method, a speech synthesis learning device, a speech synthesis device, and a program.
  • TTS text-to-speech synthesis
  • DNN deep neural network
  • the quality of synthesized speech has improved dramatically (Non-Patent Document 1).
  • statistical modeling by DNN acquires the correspondence between input and output only from data, so a large amount of training data is required to train a TTS model that synthesizes speech with high quality. If a TTS model is constructed using only a target speaker with a small amount of data, the model overfits the training data, so there are cases where desired utterance content and quality cannot be obtained when unknown text is input.
  • fMLLR feature space maximum likelihood linear regression
  • Non-Patent Document 4 As an example of unsupervised adaptation, which is another adaptive approach, there is a method of manipulating information including speaker information that is put together with the text according to the target speaker (Non-Patent Document 4).
  • a TTS model is trained with a large number of speakers in advance using speaker expression vectors based on one-hot. Separately, a model is prepared to identify the training speaker from the input speech, and the speech of the target speaker is input to the model. Then, a vector (speaker posterior probability) indicating how much the target speaker resembles who among the large number of speakers is obtained.
  • a vector peaker posterior probability
  • Non-Patent Document 2 In semi-supervised learning represented by Non-Patent Document 2, the adaptation of speech synthesis is pseudo-label generation, so it is necessary to prepare a separate speech recognition model. Therefore, the learning cost is very high, and the accuracy of the pseudo-label depends on the speech recognition model.
  • Non-Patent Document 4 When unsupervised adaptation is performed by the approach of Non-Patent Document 4, not only is a separate speaker recognition model necessary, but also the speech to be recognized by the speaker recognition model is predicted in order to predict the one-hot vector equivalent of TTS. must match that of the TTS model. Furthermore, if the acoustic characteristics of the target speaker are significantly different from those of the large number of speakers who are training data for the TTS model, the quality of the synthesized speech will be significantly degraded. There are no pseudo-labels, and it is impossible to reduce the mismatch between the target speaker and the model by fine-tuning.
  • the present invention has been made in view of the above points, and enables adaptation of a TTS model by fine-tuning from acoustic features related to the target speaker's speech even if there is no text corresponding to the target speaker's speech. With the goal.
  • a first model that inputs a speaker vector indicating a speaker, a text, and a first acoustic feature amount related to the speech in which the speaker utters the text outputs a first model.
  • a first learning procedure for learning a second model by updating the first model based on the loss of one predicted acoustic feature quantity and the first acoustic feature quantity;
  • FIG. 3 is a diagram showing the configuration of a large-scale TTS model learning phase in the first embodiment
  • FIG. 3 is a diagram showing the configuration of an unsupervised adaptation phase in the first embodiment
  • FIG. 4 is a diagram showing the configuration of an inference phase in the case of speech synthesis in the first embodiment
  • FIG. 10 is a diagram showing the configuration of an inference phase in the case of voice quality conversion in the first embodiment
  • FIG. 10 is a diagram showing the configuration of a large-scale TTS model learning phase in the second embodiment
  • FIG. 12 is a diagram showing the configuration of a large-scale TTS model learning phase in the third embodiment; It is a figure which shows the structure of the large-scale TTS model learning phase in 4th Embodiment.
  • FIG. 13 is a diagram showing the configuration of an inference phase in the case of speech synthesis in the fourth embodiment; FIG.
  • this embodiment utilizes not only text but also acoustic features obtained from speech as input for the TTS model.
  • the DNN modules that convert the text into an intermediate representation and the intermediate representation into acoustic features are called a text encoder 112 and a decoder 114, respectively.
  • an acoustic feature encoder 113 that converts the acoustic feature into an intermediate representation is newly prepared so that the acoustic feature, which is the output of the TTS model, can be reconstructed from the input text and the acoustic feature. .
  • the intermediate representation originally obtained through the text encoder 112 can also be obtained from the acoustic features.
  • FIG. 1 is a diagram showing a hardware configuration example of a speech synthesizer 10 according to an embodiment of the present invention.
  • the speech synthesizer 10 of FIG. 1 has a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, an interface device 105, etc., which are interconnected by a bus B, respectively.
  • a program that implements processing in the speech synthesizer 10 is provided by a recording medium 101 such as a CD-ROM.
  • a recording medium 101 such as a CD-ROM.
  • the program is installed from the recording medium 101 to the auxiliary storage device 102 via the drive device 100 .
  • the program does not necessarily need to be installed from the recording medium 101, and may be downloaded from another computer via the network.
  • the auxiliary storage device 102 stores installed programs, as well as necessary files and data.
  • the memory device 103 reads and stores the program from the auxiliary storage device 102 when a program activation instruction is received.
  • the processor 104 is a CPU or a GPU (Graphics Processing Unit), or a CPU and a GPU, and executes functions related to the speech synthesizer 10 according to programs stored in the memory device 103 .
  • the interface device 105 is used as an interface for connecting to a network.
  • [First embodiment] 2, 3, 4, and 5 show the large-scale TTS model learning phase, the unsupervised adaptation phase, the inference phase in the case of speech synthesis, and the voice quality conversion in the first embodiment, respectively.
  • a configuration example of the inference phase is shown.
  • the speech synthesizer 10 includes a TTS model ⁇ , a loss calculator 115 for acoustic feature O, and a TTS model learner 116 .
  • the TTS model ⁇ includes speaker vector encoder 111 , text encoder 112 , acoustic feature encoder 113 and decoder 114 . Each of these units is implemented by processing that one or more programs installed in the speech synthesizer 10 cause the processor 104 to execute.
  • a plurality of sets of learning data are prepared, each consisting of a set of speaker vectors S, texts L, and acoustic features O.
  • the speaker vector S is a continuous expression such as i-vector or x-vector indicating the speaker who uttered the speech, and is obtained by inputting the speech into the speaker vector extractor.
  • the text L is information indicating the content of the voice (content of the utterance).
  • a raw text, a sequence of phonemes and accents, or a linguistic feature vectorized from them can be used.
  • the acoustic feature quantity O is the acoustic feature quantity of the speech.
  • Acoustic features include mel-spectrogram, mel-cepstrum, fundamental frequency, etc., which are information necessary for reconstructing speech waveforms.
  • X ⁇ (X is an arbitrary symbol) in the text indicates a symbol with ⁇ added above X in the drawing.
  • the speaker of each training data may be different, and the text L may also be different. Any training data speaker may be the target speaker in the unsupervised adaptation phase described below.
  • the speaker vector encoder 111 inputs the speaker vector S, calculates and outputs an intermediate representation of the speaker vector S (hereinafter referred to as "intermediate representation of the speaker vector S").
  • the text encoder 112 receives the text L, computes an intermediate representation hL of the text L, and outputs it.
  • the acoustic feature quantity encoder 113 receives the acoustic feature quantity O, and calculates and outputs an intermediate representation hO of the acoustic feature quantity O.
  • the decoder 114 receives the intermediate representation of the speaker vector S, the intermediate representation hL , and the intermediate representation hO . However, the intermediate representation hL and the intermediate representation hO are input to the decoder 114 at different timings. That is, the decoder 114 inputs the intermediate representation of the speaker vector S and the intermediate representation hL (hereinafter referred to as the "first phase") and the intermediate representation of the speaker vector S for one piece of training data. Two phases are executed: one for inputting the intermediate representation hO (hereinafter referred to as the "second phase"). In unsupervised adaptation, which will be described later, there is no text of an unknown speaker, so the TTS model ⁇ is constructed.
  • the decoder 114 inputs the intermediate representation of the speaker vector S and the intermediate representation hL , and calculates and outputs the predicted acoustic feature O ⁇ .
  • the loss calculation unit 115 for O receives the predicted acoustic feature quantity O ⁇ and the acoustic feature quantity O, and calculates and outputs the loss Lo , which is the error between the acoustic feature quantity O and the predicted acoustic feature quantity O ⁇ . do.
  • the loss Lo an index indicating the error of vectors of the same dimension, such as the mean squared error or the mean absolute error, can be used.
  • the TTS model learning unit 116 receives the TTS model ⁇ and the loss L 0 , and learns the TTS model ⁇ ⁇ by updating the model parameters of the TTS model ⁇ based on the loss L 0 .
  • X ⁇ (X is an arbitrary symbol) in the text indicates a symbol in which “-” is added above X in the drawings.
  • the TTS model learning unit 116 updates the TTS model ⁇ so as to minimize the loss Lo .
  • Model parameters that reduce the loss Lo can be obtained by executing error backpropagation with the help of the gradient information when the predicted acoustic feature O ⁇ was generated.
  • the decoder 114 inputs the intermediate representation of the speaker vector S and the intermediate representation hO , and calculates and outputs the predicted acoustic feature O ⁇ .
  • the TTS model learning unit 116 learns the TTS model ⁇ ⁇ by updating the TTS model ⁇ as in the first phase.
  • the TTS model ⁇ ⁇ was obtained during the large-scale TTS model training phase of FIG.
  • a plurality of sets of learning data are prepared, each of which consists of one target speaker's acoustic feature O' and the target speaker's speaker vector S'. Therefore, the speaker of each training data is common. However, the voice indicated by the acoustic feature O' of each learning data is different.
  • the TTS model ⁇ ⁇ receives the acoustic feature O′ of the target speaker and the speaker vector S′ of the target speaker, and calculates and outputs the predicted acoustic feature ⁇ ′.
  • the TTS model learning unit 116 updates the TTS model ⁇ so as to minimize the loss Lo , which is the error between the predicted acoustic feature quantity ⁇ ′ and the acoustic feature quantity O′.
  • ⁇ ⁇ ' is learned. That is, with the configuration of FIG. 2, the TTS model ⁇ ⁇ can be learned by substituting the acoustic feature O′ of the unknown speaker even if the text of the unknown speaker is not input to the TTS model ⁇ . This enables adaptation ( ⁇ fine tuning) using the acoustic feature O'.
  • both the input and output of the TTS model ⁇ ⁇ ' are acoustic features, and the TTS model ⁇ ⁇ ' becomes equivalent to an autoencoder. Since there is no text in the adaptation data, the acoustic feature encoder 113 may overlearn and the intermediate representation hO may not be able to predict the information corresponding to the intermediate representation hL in FIG. Therefore, by freezing the acoustic feature quantity encoder 113 (fixing the model parameters of the acoustic feature quantity encoder 113) and updating only the decoder 114, it is possible to avoid the possibility that the text content, which is a necessary condition of the TTS model, will collapse. , it is possible to adapt the model to the target speaker.
  • the trained TTS model ⁇ ⁇ ' inputs an arbitrary text L' to be synthesized into speech and the speaker vector S' of the target speaker, and calculates (estimates) and outputs a predicted acoustic feature ⁇ '. Since the TTS model ⁇ ⁇ ' is adapted in the phase described with reference to FIG. 3, it is possible to synthesize speech without significant deterioration in quality even for a target speaker not included in the training data of FIG.
  • the speaker vector is It can also be used for voice quality conversion by replacing with volume.
  • the speaker's acoustic feature O ⁇ '' corresponding to the speaker vector S'' is (output) predicted.
  • unsupervised adaptation by fine-tuning of the TTS model is possible only from the acoustic features without using the target speaker's text.
  • it is possible to eliminate the need to annotate the speech of the target speaker, thereby reducing both the time and cost required to construct the TTS model.
  • 2nd Embodiment demonstrates a different point from 1st Embodiment. Points not specifically mentioned in the second embodiment may be the same as in the first embodiment.
  • FIG. 6 is a diagram showing the configuration of the large-scale TTS model learning phase in the second embodiment.
  • the same parts as those in FIG. 2 are denoted by the same reference numerals, and the description thereof will be omitted as appropriate.
  • the speech synthesizer 10 further has a loss calculator 117 and a loss weighter 118 for the intermediate representation h. Each of these units is implemented by processing that one or more programs installed in the speech synthesizer 10 cause the processor 104 to execute.
  • the TTS model ⁇ receives the speaker vector S, the text L, and the acoustic feature O, and outputs the predicted acoustic feature ⁇ .
  • a loss calculator 115 relating to O receives the acoustic feature quantity O and the predicted acoustic feature quantity ⁇ and outputs a loss Lo . Note that, as described with reference to FIG. 2, for each of the first phase related to the text L and the second phase related to the acoustic feature O, the predicted acoustic feature ⁇ and the loss Lo are output.
  • the loss calculation unit 117 for h further receives the intermediate representation hL output by the text encoder 112 and the intermediate representation hO output by the acoustic feature encoder 113, and Calculate and output the loss L h with h O.
  • the index of the loss L h not only the mean squared error and the mean absolute error but also the cosine distance or the like is used to constrain the error between h L and h O to be small.
  • the loss weighting unit 118 receives the loss L o output by the loss calculation unit 115 for O and the loss L h output by the loss calculation unit 117 for h for each of the first phase and the second phase, and weights Calculate and output the weighted loss (the weighted sum of Lo and Lh ).
  • the weighting coefficient may be fixed, or may be a learning target.
  • the TTS model learning unit 116 learns the TTS model ⁇ ⁇ by updating the model parameters of the TTS model ⁇ so as to minimize the weighted loss for each of the first and second phases. By doing so, in preparation for unsupervised adaptation, it is possible to increase the possibility of predicting the output of the text encoder 112 from the acoustic feature quantity encoder 113 as well.
  • the processing procedure after the unsupervised adaptation phase may be the same as in the first embodiment.
  • the intermediate representation hO from the acoustic feature encoder 113 looks similar to the intermediate representation hL from the text encoder 112. It's not necessarily something.
  • this problem can be reduced by restricting hO to a vector similar to hL in the course of learning.
  • FIG. 7 is a diagram showing the configuration of the large-scale TTS model learning phase in the third embodiment.
  • the speech synthesizer 10 includes a speaker identity remover 119, which is a module for removing the speaker identity from the intermediate representation hO by the acoustic feature encoder 113, a loss calculator 120 for s, and a loss weighter 121. and Each of these units is implemented by processing that one or more programs installed in the speech synthesizer 10 cause the processor 104 to execute.
  • a speaker ID is data that identifies a speaker in a form different from the speaker vector.
  • the TTS model ⁇ receives the speaker vector S, the text L, and the acoustic feature O, and outputs the predicted acoustic feature ⁇ .
  • a loss calculator 115 for O receives the acoustic feature quantity O and the predicted acoustic feature quantity ⁇ , and outputs a loss Lo . Note that, as described with reference to FIG. 2, for each of the first phase related to the text L and the second phase related to the acoustic feature O, the predicted acoustic feature ⁇ and the loss Lo are output.
  • the speaker characteristics removal unit 119 receives the intermediate representation hO output by the acoustic feature encoder 113, and calculates and outputs the intermediate representation h′O from which the speaker characteristics are removed.
  • the intermediate representation h'O from which the speaker's characteristic is removed is an intermediate representation obtained by removing the voice features of the speaker from the intermediate representation hO .
  • the speaker adversarial learning device or the like proposed in Patent Document 3 can be used for the speaker identity removal unit 119 .
  • the loss calculation unit 120 for s receives the intermediate representation h′O with speaker characteristics removed and the true speaker ID s, and calculates and outputs the loss L s .
  • the loss L s is an index that takes a larger value as the probability that h′ O corresponds to speaker s is lower.
  • the loss Ls can be an index for solving a discrimination problem, such as cross-entropy.
  • the loss weighting unit 121 inputs the loss L o output by the loss calculation unit 115 regarding O and the L s output by the loss calculation unit regarding S for each of the first phase and the second phase, and performs weighting. Calculate and output the calculated loss (the weighted sum of L o and L s ). Note that the weighting coefficient may be fixed, or may be a learning target.
  • the TTS model learning unit 116 learns the TTS model ⁇ ⁇ by updating the model parameters of the TTS model ⁇ to minimize the weighted loss.
  • the processing procedure after the unsupervised adaptation phase may be the same as in the first embodiment.
  • the output of the text encoder 112 does not include speaker characteristics, whereas the output of the acoustic feature quantity encoder 113 includes speaker characteristics. A mismatch between the two causes deterioration of TTS performance. According to the third embodiment, it is possible to reduce speaker characteristics from the intermediate representation h O by the acoustic feature encoder 113 .
  • the third embodiment may be used together with the second embodiment.
  • loss weighting section 121 may receive loss L o , loss L s , and loss L h and output a weighted loss.
  • FIG. 8 is a diagram showing the configuration of the large-scale TTS model learning phase in the fourth embodiment.
  • the text L n indicates the text according to the language n of the utterance.
  • a plurality of sets of learning data each including a speaker vector S, a text Ln , and an acoustic feature O are prepared.
  • Acoustic feature O is an acoustic feature of speech in which text Ln is uttered in language n.
  • Language n of each of the plurality of learning data is one of 1 to N, and learning data for each language of 1 to N are prepared. The meaning of the text Ln of each learning data may be different.
  • the processing of the speaker vector encoder 111 and acoustic feature quantity encoder 113 is the same as in FIG.
  • the text encoder 112- n corresponding to the input training data text Ln calculates and outputs the intermediate representation hLn .
  • the decoder 114 inputs the intermediate representation h Ln and the speaker vector S in the first phase, inputs the intermediate representation h O and the speaker vector S in the second phase, and generates the predicted acoustic feature O ⁇ to output Thereafter, the TTS model ⁇ is updated and the TTS model ⁇ ⁇ is learned in the same procedure as in the first embodiment.
  • the flow for inputting the acoustic feature quantity O to the TTS model ⁇ is the same as in the first embodiment.
  • the phase of the unsupervised adaptation phase is text-independent and the adapted TTS model ⁇ ⁇ ' is learned as in FIG. 3 of the first embodiment.
  • FIG. 9 is a diagram showing the configuration of the inference phase for speech synthesis in the fourth embodiment.
  • the predicted acoustic feature ⁇ ' is predicted in the same procedure as in FIG.
  • the speech synthesizer 10 is also an example of a speech synthesis learning device.
  • the TTS model ⁇ is an example of the first model.
  • the TTS model ⁇ ⁇ is an example of the second model.
  • the predicted acoustic feature ⁇ is an example of the first acoustic feature.
  • the predicted acoustic feature O ⁇ ' is an example of the second acoustic feature.
  • the acoustic feature quantity encoder 113 is an example of a first encoder.
  • Text encoder 112 is an example of a second encoder.
  • Speech synthesis device 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 Processor 105 Interface device 111 Speaker vector encoder 112 Text encoder 113 Acoustic feature encoder 114 Decoder 115 O loss calculation unit 116 TTS model learning unit 117 h Loss calculator 118 Loss weighting unit 119 Loss weighting unit 120 Loss calculator 121 for s Loss weighting unit B Bus

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A computer executes a first learning procedure to learn a second model by updating a first model, to which a speaker vector representing a speaker, text, and a first acoustic feature related to speech obtained by the speaker uttering the text are inputted, on the basis of losses of a first predicted acoustic feature outputted by the first model and the first acoustic feature, and a second learning procedure to update the second model on the basis of losses of a second predicted acoustic feature outputted by the second model, to which a speaker vector of a target speaker and a second acoustic feature related to speech uttered by the target speaker are inputted, and the second acoustic feature, thereby, even if there is no text corresponding to the speech of the target speaker, enabling adaptation through fine-tuning of a TTS model from the acoustic feature related to the speech.

Description

音声合成学習方法、音声合成方法、音声合成学習装置、音声合成装置及びプログラムSpeech synthesis learning method, speech synthesis method, speech synthesis learning device, speech synthesis device and program
 本発明は、音声合成学習方法、音声合成方法、音声合成学習装置、音声合成装置及びプログラム
に関する。
TECHNICAL FIELD The present invention relates to a speech synthesis learning method, a speech synthesis method, a speech synthesis learning device, a speech synthesis device, and a program.
 テキストから音声を予測するテキスト音声合成(TTS)において、近年の主流は統計的パラメトリック音声合成である。これは、学習データであるテキストと、それに対応する音声のペアの対応関係をモデル化する手法である。モデル化の手法としてディープニューラルネットワーク(DNN)を用いることで、合成音声の品質が飛躍的に向上した(非特許文献1)。しかしながら、DNNによる統計的モデリングは、入力と出力の対応関係をデータのみから獲得するため、高品質に音声を合成するTTSモデルを学習するためには大量の学習データを必要とする。データ量の少ない目標話者だけでTTSモデルを構築するとモデルが学習データに過学習するため、未知のテキストが入力された際に所望の発話内容や品質が出ないケースがある。 In text-to-speech synthesis (TTS), which predicts speech from text, the mainstream in recent years is statistical parametric speech synthesis. This is a method of modeling the correspondence between texts, which are training data, and corresponding speech pairs. By using a deep neural network (DNN) as a modeling technique, the quality of synthesized speech has improved dramatically (Non-Patent Document 1). However, statistical modeling by DNN acquires the correspondence between input and output only from data, so a large amount of training data is required to train a TTS model that synthesizes speech with high quality. If a TTS model is constructed using only a target speaker with a small amount of data, the model overfits the training data, so there are cases where desired utterance content and quality cannot be obtained when unknown text is input.
 この問題に対処するため、大量データかつ大量話者により学習したモデルをベースとし、データ量の少ない目標話者にフィットするよう微調整する技術がある。これは適応と呼ばれ、古くから音声認識・合成で研究されてきた。 In order to deal with this problem, there is a technology that fine-tunes the model to fit the target speaker with a small amount of data, based on a model trained with a large amount of data and a large number of speakers. This is called adaptation, and has long been studied in speech recognition and synthesis.
 適応の中でも、目標話者の音声は有るがそれに対応するテキスト(以下、「ラベル」という。)が無いケースでは、大きく分けて半教師あり学習、教師無し学習の二つがある。 Among the adaptations, there are two main types: semi-supervised learning and unsupervised learning, in cases where there is a target speaker's voice but no corresponding text (hereinafter referred to as "label").
 半教師あり学習の1手法である特徴量空間の最尤線形回帰(fMLLR)では、音声に対応するラベルを作るため、話者非依存モデルにより目標話者の音声を一度認識し、疑似ラベルを得る。このラベルを教師データとみなして、目標話者の音声とモデルのミスマッチを埋めるよう、音声の特徴量に対する線形回帰係数を求める(非特許文献2、特許文献1)。疑似ラベルを用いてミスマッチを埋める方法の別の方法では、疑似ラベルの信頼度を考慮し、信頼度の低い疑似ラベルを除外するなどの操作を経て目標話者にフィットするようDNNを微調整している(非特許文献3)。 In the feature space maximum likelihood linear regression (fMLLR), which is one method of semi-supervised learning, in order to create a label corresponding to the speech, the speech of the target speaker is once recognized by a speaker-independent model, and pseudo-labels are generated. obtain. This label is regarded as training data, and a linear regression coefficient for speech features is obtained so as to fill in the mismatch between the speech of the target speaker and the model (Non-Patent Document 2, Patent Document 1). Another method of using pseudo-labels to fill mismatches considers the confidence of the pseudo-labels and fine-tunes the DNN to fit the target speaker through operations such as excluding pseudo-labels with low confidence. (Non-Patent Document 3).
 もう一つの適応アプローチである教師無し適応の一例として、テキストとともに入れる話者情報を含む情報を目標話者に合わせて操作する方法がある(非特許文献4)。特許文献2に基づき、まず事前にone-hotに基づく話者表現ベクトルを用いて大量話者でTTSモデルを学習する。それとは別に入力音声から学習話者を識別するようなモデルを用意し、当該モデルに目標話者の音声を入力する。すると目標話者が大量話者のうち誰にどの程度似ているかを示すベクトル(話者事後確率)が得られる。これをone-hotベクトルの代わりに話者ベクトルとしてTTSモデルに入力することで、疑似ラベルを得ることなく目標話者に似た合成音声を得ることができる。 As an example of unsupervised adaptation, which is another adaptive approach, there is a method of manipulating information including speaker information that is put together with the text according to the target speaker (Non-Patent Document 4). Based on Patent Document 2, first, a TTS model is trained with a large number of speakers in advance using speaker expression vectors based on one-hot. Separately, a model is prepared to identify the training speaker from the input speech, and the speech of the target speaker is input to the model. Then, a vector (speaker posterior probability) indicating how much the target speaker resembles who among the large number of speakers is obtained. By inputting this into the TTS model as a speaker vector instead of a one-hot vector, synthesized speech resembling the target speaker can be obtained without obtaining pseudo labels.
米国特許出願公開第20120173240号明細書U.S. Patent Application Publication No. 20120173240 特許第6680933号公報Japanese Patent No. 6680933 米国特許第10347241号明細書U.S. Patent No. 10347241
 非特許文献2に代表される半教師あり学習において、音声合成の適応は疑似ラベル生成のため、音声認識モデルを別途用意しておく必要がある。このため学習コストが非常に高く、かつ疑似ラベルの精度は音声認識モデルに依存することとなる。 In semi-supervised learning represented by Non-Patent Document 2, the adaptation of speech synthesis is pseudo-label generation, so it is necessary to prepare a separate speech recognition model. Therefore, the learning cost is very high, and the accuracy of the pseudo-label depends on the speech recognition model.
 教師無し適応を非特許文献4のアプローチで実施すると、別途話者認識モデルが必要なだけでなく、TTSのone-hotベクトル相当のものを予測するため、話者認識モデルが認識対象とする話者はTTSモデルのそれと一致しなければならない。さらに目標話者の音響特性がTTSモデルの学習データである大量話者とのそれと著しく異なる場合、合成音声の品質は著しく劣化する。疑似ラベルもなく、目標話者とモデルのミスマッチをファインチューニングで低減することも不可能である。 When unsupervised adaptation is performed by the approach of Non-Patent Document 4, not only is a separate speaker recognition model necessary, but also the speech to be recognized by the speaker recognition model is predicted in order to predict the one-hot vector equivalent of TTS. must match that of the TTS model. Furthermore, if the acoustic characteristics of the target speaker are significantly different from those of the large number of speakers who are training data for the TTS model, the quality of the synthesized speech will be significantly degraded. There are no pseudo-labels, and it is impossible to reduce the mismatch between the target speaker and the model by fine-tuning.
 本発明は、上記の点に鑑みてなされたものであって、目標話者の音声に対応するテキストが無くても当該音声に係る音響特徴量からTTSモデルのファインチューニングによる適応を可能とすることを目的とする。 The present invention has been made in view of the above points, and enables adaptation of a TTS model by fine-tuning from acoustic features related to the target speaker's speech even if there is no text corresponding to the target speaker's speech. With the goal.
 そこで上記課題を解決するため、話者を示す話者ベクトルと、テキストと、当該話者が当該テキストを発話した音声に係る第1の音響特徴量とを入力した第1のモデルが出力する第1の予測音響特徴量と前記第1の音響特徴量との損失に基づいて前記第1のモデルを更新することで第2のモデルを学習する第1の学習手順と、目標話者の話者ベクトルと、前記目標話者が発話した音声に係る第2の音響特徴量とを入力した前記第2のモデルが出力する第2の予測音響特徴量と前記第2の音響特徴量との損失に基づいて前記第2のモデルを更新する第2の学習手順と、をコンピュータが実行する。 Therefore, in order to solve the above problem, a first model that inputs a speaker vector indicating a speaker, a text, and a first acoustic feature amount related to the speech in which the speaker utters the text outputs a first model. a first learning procedure for learning a second model by updating the first model based on the loss of one predicted acoustic feature quantity and the first acoustic feature quantity; The loss of the second predicted acoustic feature output by the second model to which the vector and the second acoustic feature relating to the speech uttered by the target speaker are input and the second acoustic feature and a second learning procedure for updating the second model based on.
 目標話者の音声に対応するテキストが無くても当該音声に係る音響特徴量からTTSモデルのファインチューニングによる適応を可能とすることができる。 Even if there is no text corresponding to the speech of the target speaker, it is possible to adapt the TTS model by fine-tuning from the acoustic features of the speech.
本発明の実施の形態における音声合成装置10のハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of the speech synthesizer 10 in embodiment of this invention. 第1の実施の形態における大規模TTSモデル学習フェーズの構成を示す図である。FIG. 3 is a diagram showing the configuration of a large-scale TTS model learning phase in the first embodiment; FIG. 第1の実施の形態における教師無し適応フェーズの構成を示す図である。FIG. 3 is a diagram showing the configuration of an unsupervised adaptation phase in the first embodiment; FIG. 第1の実施の形態における音声合成の場合の推論フェーズの構成を示す図である。FIG. 4 is a diagram showing the configuration of an inference phase in the case of speech synthesis in the first embodiment; FIG. 第1の実施の形態における声質変換の場合の推論フェーズの構成を示す図である。FIG. 10 is a diagram showing the configuration of an inference phase in the case of voice quality conversion in the first embodiment; 第2の実施の形態における大規模TTSモデル学習フェーズの構成を示す図である。FIG. 10 is a diagram showing the configuration of a large-scale TTS model learning phase in the second embodiment; 第3の実施の形態における大規模TTSモデル学習フェーズの構成を示す図である。FIG. 12 is a diagram showing the configuration of a large-scale TTS model learning phase in the third embodiment; 第4の実施の形態における大規模TTSモデル学習フェーズの構成を示す図である。It is a figure which shows the structure of the large-scale TTS model learning phase in 4th Embodiment. 第4の実施の形態における音声合成の場合の推論フェーズの構成を示す図である。FIG. 13 is a diagram showing the configuration of an inference phase in the case of speech synthesis in the fourth embodiment; FIG.
 本実施の形態は、教師無し適応において非特許文献4とは異なり、TTSモデルの入力としてテキストだけでなく、音声から得られる音響特徴量を活用する。 Unlike non-patent document 4 in unsupervised adaptation, this embodiment utilizes not only text but also acoustic features obtained from speech as input for the TTS model.
 TTSモデルにおいてテキストを中間表現に、また中間表現を音響特徴量に変換するDNNモジュールをそれぞれテキストエンコーダ112、デコーダ114と呼ぶこととする。本実施の形態では音響特徴量を中間表現に変換する音響特徴量エンコーダ113を新たに用意し、TTSモデルの出力である音響特徴量を、入力するテキスト及び音響特徴量から再構成できるようにする。このことにより、本来テキストエンコーダ112を通して得られる中間表現を、音響特徴量からも得られるようになる。 In the TTS model, the DNN modules that convert the text into an intermediate representation and the intermediate representation into acoustic features are called a text encoder 112 and a decoder 114, respectively. In this embodiment, an acoustic feature encoder 113 that converts the acoustic feature into an intermediate representation is newly prepared so that the acoustic feature, which is the output of the TTS model, can be reconstructed from the input text and the acoustic feature. . As a result, the intermediate representation originally obtained through the text encoder 112 can also be obtained from the acoustic features.
 また、話者情報としてone-hotのような離散表現ではなく、話者ベクトル抽出器を用いたi-vectorやx-vector等の連続表現を用いる。このことにより、TTSモデルと話者認識の学習話者が異なっても良く、アノテーションコストが比較的小さくて済む話者ベクトル抽出器の学習データを増やすことで、多様な話者性をカバーできる。 Also, as speaker information, continuous expressions such as i-vector and x-vector using a speaker vector extractor are used instead of discrete expressions such as one-hot. As a result, the learning speaker for the TTS model and speaker recognition may be different, and by increasing the learning data for the speaker vector extractor, which requires relatively low annotation cost, various speaker characteristics can be covered.
 以下、図面に基づいて本発明の実施の形態を説明する。図1は、本発明の実施の形態における音声合成装置10のハードウェア構成例を示す図である。図1の音声合成装置10は、それぞれバスBで相互に接続されているドライブ装置100、補助記憶装置102、メモリ装置103、プロセッサ104、及びインタフェース装置105等を有する。 Embodiments of the present invention will be described below based on the drawings. FIG. 1 is a diagram showing a hardware configuration example of a speech synthesizer 10 according to an embodiment of the present invention. The speech synthesizer 10 of FIG. 1 has a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, an interface device 105, etc., which are interconnected by a bus B, respectively.
 音声合成装置10での処理を実現するプログラムは、CD-ROM等の記録媒体101によって提供される。プログラムを記憶した記録媒体101がドライブ装置100にセットされると、プログラムが記録媒体101からドライブ装置100を介して補助記憶装置102にインストールされる。但し、プログラムのインストールは必ずしも記録媒体101より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置102は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 A program that implements processing in the speech synthesizer 10 is provided by a recording medium 101 such as a CD-ROM. When the recording medium 101 storing the program is set in the drive device 100 , the program is installed from the recording medium 101 to the auxiliary storage device 102 via the drive device 100 . However, the program does not necessarily need to be installed from the recording medium 101, and may be downloaded from another computer via the network. The auxiliary storage device 102 stores installed programs, as well as necessary files and data.
 メモリ装置103は、プログラムの起動指示があった場合に、補助記憶装置102からプログラムを読み出して格納する。プロセッサ104は、CPU若しくはGPU(Graphics Processing Unit)、又はCPU及びGPUであり、メモリ装置103に格納されたプログラムに従って音声合成装置10に係る機能を実行する。インタフェース装置105は、ネットワークに接続するためのインタフェースとして用いられる。 The memory device 103 reads and stores the program from the auxiliary storage device 102 when a program activation instruction is received. The processor 104 is a CPU or a GPU (Graphics Processing Unit), or a CPU and a GPU, and executes functions related to the speech synthesizer 10 according to programs stored in the memory device 103 . The interface device 105 is used as an interface for connecting to a network.
 [第1の実施の形態]
 図2、図3、図4、図5のそれぞれに、第1の実施の形態における大規模TTSモデル学習フェーズと、教師無し適応フェーズと、音声合成の場合の推論フェーズと、声質変換の場合の推論フェーズの構成例を示す。
[First embodiment]
2, 3, 4, and 5 show the large-scale TTS model learning phase, the unsupervised adaptation phase, the inference phase in the case of speech synthesis, and the voice quality conversion in the first embodiment, respectively. A configuration example of the inference phase is shown.
 本実施形態において、音声合成装置10は、TTSモデルλと、音響特徴量Oに関する損失計算部115と、TTSモデル学習部116とを備える。TTSモデルλは、話者ベクトルエンコーダ111と、テキストエンコーダ112と、音響特徴量エンコーダ113と、デコーダ114とを含む。これら各部は、音声合成装置10にインストールされた1以上のプログラムが、プロセッサ104に実行させる処理により実現される。 In this embodiment, the speech synthesizer 10 includes a TTS model λ, a loss calculator 115 for acoustic feature O, and a TTS model learner 116 . The TTS model λ includes speaker vector encoder 111 , text encoder 112 , acoustic feature encoder 113 and decoder 114 . Each of these units is implemented by processing that one or more programs installed in the speech synthesizer 10 cause the processor 104 to execute.
 [全体の流れ]
 図2に沿って、第1の実施の形態の大規模TTSモデル学習フェーズの流れを説明する。
[Overall flow]
The flow of the large-scale TTS model learning phase of the first embodiment will be described along FIG.
 大規模TTSモデル学習フェーズでは、話者ベクトルS、テキストL及び音響特徴量Oを一組とする学習データが複数通り用意される。話者ベクトルSは、音声を発話した話者を示すi-vectorやx-vector等の連続表現であり、当該音声を話者ベクトル抽出器に入力することで得られる。テキストLは、当該音声の内容(発話の内容)を示す情報である。テキストLとしては、生のテキスト、音素やアクセントの系列、又はそれらをベクトル化した言語特徴量などが使用可能である。音響特徴量Oは、当該音声の音響特徴量である。音響特徴量には音声波形の再構成に必要な情報であるメルスペクトログラム、メルケプストラム、基本周波数などが用いられる。なお、文中におけるX^(Xは任意の記号)は、図中において、Xの上に^が付与された記号を示す。各学習データの話者は異なっていてもよいし、テキストLも異なっていてもよい。いずれかの学習データの話者は、後述される教師無し適応フェーズにおける目標話者であってもよい。 In the large-scale TTS model learning phase, a plurality of sets of learning data are prepared, each consisting of a set of speaker vectors S, texts L, and acoustic features O. The speaker vector S is a continuous expression such as i-vector or x-vector indicating the speaker who uttered the speech, and is obtained by inputting the speech into the speaker vector extractor. The text L is information indicating the content of the voice (content of the utterance). As the text L, a raw text, a sequence of phonemes and accents, or a linguistic feature vectorized from them can be used. The acoustic feature quantity O is the acoustic feature quantity of the speech. Acoustic features include mel-spectrogram, mel-cepstrum, fundamental frequency, etc., which are information necessary for reconstructing speech waveforms. Note that X^ (X is an arbitrary symbol) in the text indicates a symbol with ^ added above X in the drawing. The speaker of each training data may be different, and the text L may also be different. Any training data speaker may be the target speaker in the unsupervised adaptation phase described below.
 話者ベクトルエンコーダ111は、話者ベクトルSを入力し、話者ベクトルSを中間表現にしたもの(以下、「話者ベクトルSの中間表現」という。)を計算して出力する。 The speaker vector encoder 111 inputs the speaker vector S, calculates and outputs an intermediate representation of the speaker vector S (hereinafter referred to as "intermediate representation of the speaker vector S").
 テキストエンコーダ112は、テキストLを入力し、テキストLの中間表現hを計算して出力する。 The text encoder 112 receives the text L, computes an intermediate representation hL of the text L, and outputs it.
 音響特徴量エンコーダ113は、音響特徴量Oを入力し、音響特徴量Oの中間表現hを計算及び出力する。 The acoustic feature quantity encoder 113 receives the acoustic feature quantity O, and calculates and outputs an intermediate representation hO of the acoustic feature quantity O. FIG.
 デコーダ114は、話者ベクトルSの中間表現と、中間表現hと、中間表現hとを入力とする。但し、中間表現hと中間表現hとは別々のタイミングでデコーダ114に入力される。すなわち、デコーダ114は、1つの学習データについて、話者ベクトルSの中間表現と中間表現hとを入力するフェーズ(以下、「第1フェーズ」という。)と、話者ベクトルSの中間表現と中間表現hとを入力するフェーズ(以下、「第2フェーズ」という。)との2つのフェーズを実行する。後述の教師無し適応では、未知の話者のテキストは存在しないため、こうすることで、テキストL及び音響特徴量Oのどちらを入力しても予測音響特徴量O^を出力できるようにTTSモデルλが構成される。 The decoder 114 receives the intermediate representation of the speaker vector S, the intermediate representation hL , and the intermediate representation hO . However, the intermediate representation hL and the intermediate representation hO are input to the decoder 114 at different timings. That is, the decoder 114 inputs the intermediate representation of the speaker vector S and the intermediate representation hL (hereinafter referred to as the "first phase") and the intermediate representation of the speaker vector S for one piece of training data. Two phases are executed: one for inputting the intermediate representation hO (hereinafter referred to as the "second phase"). In unsupervised adaptation, which will be described later, there is no text of an unknown speaker, so the TTS model λ is constructed.
 まず、第1フェーズについて説明する。 First, the first phase will be explained.
 デコーダ114は、話者ベクトルSの中間表現と中間表現hとを入力し、予測音響特徴量O^を計算及び出力する。 The decoder 114 inputs the intermediate representation of the speaker vector S and the intermediate representation hL , and calculates and outputs the predicted acoustic feature O^.
 続いて、Oに関する損失計算部115は、予測音響特徴量O^と音響特徴量Oとを入力し、音響特徴量Oと予測音響特徴量O^との誤差である損失Lを計算及び出力する。損失Lには、平均二乗誤差や平均絶対誤差等の同次元のベクトルの誤差を示す指標が使用できる。 Subsequently, the loss calculation unit 115 for O receives the predicted acoustic feature quantity O^ and the acoustic feature quantity O, and calculates and outputs the loss Lo , which is the error between the acoustic feature quantity O and the predicted acoustic feature quantity O^. do. For the loss Lo , an index indicating the error of vectors of the same dimension, such as the mean squared error or the mean absolute error, can be used.
 続いて、TTSモデル学習部116は、TTSモデルλと損失Lとを入力し、損失Lに基づいてTTSモデルλのモデルパラメータを更新することでTTSモデルλを学習する。なお、文中におけるX(Xは任意の記号)は、図中において、Xの上にが付与された記号を示す。 Subsequently, the TTS model learning unit 116 receives the TTS model λ and the loss L 0 , and learns the TTS model λ by updating the model parameters of the TTS model λ based on the loss L 0 . Note that X (X is an arbitrary symbol) in the text indicates a symbol in which “-” is added above X in the drawings.
 TTSモデルλをDNNで構成する場合、TTSモデル学習部116は、損失Lを最小化するようTTSモデルλを更新する。予測音響特徴量O^を生成した時の勾配情報を頼りに誤差逆伝搬を実行することで、損失Lを小さくするモデルパラメータを求めることができる。 When configuring the TTS model λ with a DNN, the TTS model learning unit 116 updates the TTS model λ so as to minimize the loss Lo . Model parameters that reduce the loss Lo can be obtained by executing error backpropagation with the help of the gradient information when the predicted acoustic feature O^ was generated.
 次に、第2フェーズについて説明する。 Next, the second phase will be explained.
 デコーダ114は、話者ベクトルSの中間表現と中間表現hとを入力し、予測音響特徴量O^を計算及び出力する。 The decoder 114 inputs the intermediate representation of the speaker vector S and the intermediate representation hO , and calculates and outputs the predicted acoustic feature O^.
 TTSモデル学習部116は、第1フェーズと同様にTTSモデルλを更新することでTTSモデルλを学習する。 The TTS model learning unit 116 learns the TTS model λ by updating the TTS model λ as in the first phase.
 1つの学習データに対して第1フェーズ及び第2フェーズが実行されるため、1つの学習データに対してTTSモデルλの更新が2回実行される。2種類の中間表現h、hからデコーダ114を通して予測音響特徴量O^を得るようモデルを学習することで、テキストがなくても音響特徴量からそれ相当の情報を得られるようになる。 Since the first phase and the second phase are executed for one piece of learning data, updating of the TTS model λ is executed twice for one piece of learning data. By learning the model so as to obtain the predicted acoustic feature O^ from the two kinds of intermediate representations hL and hO through the decoder 114, it is possible to obtain corresponding information from the acoustic feature without text.
 次に、図3に沿って、第1の実施の形態の教師無し適応フェーズの流れを説明する。図3中、図2と同一部分には同一名を付している。TTSモデルλは、図2の大規模TTSモデル学習フェーズにて得られたものである。 Next, the flow of the unsupervised adaptation phase of the first embodiment will be described along FIG. In FIG. 3, the same parts as in FIG. 2 are given the same names. The TTS model λ was obtained during the large-scale TTS model training phase of FIG.
 教師無し適応フェーズにおいては、一人の目標話者の音響特徴量O'及び目標話者の話者ベクトルS'を一組とする学習データが複数通り用意される。したがって、各学習データの話者は共通である。但し、各学習データの音響特徴量O'が示す音声は異なっている。 In the unsupervised adaptation phase, a plurality of sets of learning data are prepared, each of which consists of one target speaker's acoustic feature O' and the target speaker's speaker vector S'. Therefore, the speaker of each training data is common. However, the voice indicated by the acoustic feature O' of each learning data is different.
 TTSモデルλは、目標話者の音響特徴量O'及び目標話者の話者ベクトルS'を入力し、予測音響特徴量O^'を計算及び出力する。図2と同様に、TTSモデル学習部116は、予測音響特徴量O^'と、音響特徴量O'との誤差である損失Lを最小化するようにTTSモデルλを更新してTTSモデルλ'を学習する。すなわち、図2の構成により、TTSモデルλにとって未知の話者のテキストが入力されなくても未知の話者の音響特徴量O'で代用することでTTSモデルλが学習可能となる。これにより、音響特徴量O'を用いて適応(≒ファインチューニング)が可能となる。 The TTS model λ receives the acoustic feature O′ of the target speaker and the speaker vector S′ of the target speaker, and calculates and outputs the predicted acoustic feature Ô′. As in FIG. 2, the TTS model learning unit 116 updates the TTS model λ so as to minimize the loss Lo , which is the error between the predicted acoustic feature quantity Ô′ and the acoustic feature quantity O′. λ ' is learned. That is, with the configuration of FIG. 2, the TTS model λ can be learned by substituting the acoustic feature O′ of the unknown speaker even if the text of the unknown speaker is not input to the TTS model λ. This enables adaptation (≈fine tuning) using the acoustic feature O'.
 この適応フェーズではTTSモデルλ'の入出力ともに音響特徴量であり、TTSモデルλ'は、オートエンコーダと等価となる。適応データにテキストがないため、音響特徴量エンコーダ113が過学習し、中間表現hが図2における中間表現h相当の情報を予測できなくなる可能性がある。したがって、音響特徴量エンコーダ113を凍結し(音響特徴量エンコーダ113のモデルパラメータを固定し)、デコーダ114のみを更新することで、TTSモデルの必要条件であるテキスト内容が崩れる可能性を回避しつつ、目標話者にモデルを適応することが可能である。 In this adaptation phase, both the input and output of the TTS model λ ' are acoustic features, and the TTS model λ ' becomes equivalent to an autoencoder. Since there is no text in the adaptation data, the acoustic feature encoder 113 may overlearn and the intermediate representation hO may not be able to predict the information corresponding to the intermediate representation hL in FIG. Therefore, by freezing the acoustic feature quantity encoder 113 (fixing the model parameters of the acoustic feature quantity encoder 113) and updating only the decoder 114, it is possible to avoid the possibility that the text content, which is a necessary condition of the TTS model, will collapse. , it is possible to adapt the model to the target speaker.
 次に、図4に沿って、第1の実施の形態の音声合成の場合の推論フェーズの流れを説明する。モデルは図3の教師無し適応フェーズにて得られたものを使用する。 Next, the flow of the inference phase in the case of speech synthesis according to the first embodiment will be described with reference to FIG. The model used is that obtained in the unsupervised adaptation phase of FIG.
 学習済みのTTSモデルλ'は、音声合成したい任意のテキストL'と、目標話者の話者ベクトルS'とを入力し、予測音響特徴量O^'を計算(推定)及び出力する。TTSモデルλ'は、図3で述べたフェーズにて適応されているため、図2の学習データに含まれないような目標話者においても著しい品質が劣化なく音声合成が可能である。 The trained TTS model λ ' inputs an arbitrary text L' to be synthesized into speech and the speaker vector S' of the target speaker, and calculates (estimates) and outputs a predicted acoustic feature Ô'. Since the TTS model λ ' is adapted in the phase described with reference to FIG. 3, it is possible to synthesize speech without significant deterioration in quality even for a target speaker not included in the training data of FIG.
 第1の実施の形態のTTSモデルλは、テキスト及び音響特徴量のどちらを入力としても音響特徴量を予測できるように構成されているため、話者ベクトルを別の話者のテキスト及び音響特徴量に差し替えることで声質変換にも利用可能である。図5に示す声質変換の場合の推論フェーズでは、TTSモデルλ'は、音響特徴量O''と、O''の話者とは異なる話者の話者ベクトルS''とを入力することで、話者ベクトルS''に対応する話者の音響特徴量O^''を(出力)予測する。 Since the TTS model λ of the first embodiment is configured to be able to predict acoustic feature values from both text and acoustic feature values as inputs, the speaker vector is It can also be used for voice quality conversion by replacing with volume. In the inference phase for voice quality conversion shown in FIG . Thus, the speaker's acoustic feature O^'' corresponding to the speaker vector S'' is (output) predicted.
 上述したように、第1の実施の形態によれば、目標話者のテキストを用いず、音響特徴量のみからTTSモデルのファインチューニングによる教師無し適応が可能である。その結果、目標話者の音声へのアノテーションが不要となり、ひいてはTTSモデル構築に要する時間と費用の両コストを削減することができる。 As described above, according to the first embodiment, unsupervised adaptation by fine-tuning of the TTS model is possible only from the acoustic features without using the target speaker's text. As a result, it is possible to eliminate the need to annotate the speech of the target speaker, thereby reducing both the time and cost required to construct the TTS model.
 [第2の実施の形態]
 次に、第2の実施の形態について説明する。第2の実施の形態では第1の実施の形態と異なる点について説明する。第2の実施の形態において特に言及されない点については、第1の実施の形態と同様でもよい。
[Second embodiment]
Next, a second embodiment will be described. 2nd Embodiment demonstrates a different point from 1st Embodiment. Points not specifically mentioned in the second embodiment may be the same as in the first embodiment.
 図6は、第2の実施の形態における大規模TTSモデル学習フェーズの構成を示す図である。図6中、図2と同一部分には同一符号を付し、その説明は適宜省略する。図6において、音声合成装置10は、中間表現hに関する損失計算部117及び損失重みづけ部118を更に有する。これら各部は、音声合成装置10にインストールされた1以上のプログラムが、プロセッサ104に実行させる処理により実現される。 FIG. 6 is a diagram showing the configuration of the large-scale TTS model learning phase in the second embodiment. In FIG. 6, the same parts as those in FIG. 2 are denoted by the same reference numerals, and the description thereof will be omitted as appropriate. In FIG. 6, the speech synthesizer 10 further has a loss calculator 117 and a loss weighter 118 for the intermediate representation h. Each of these units is implemented by processing that one or more programs installed in the speech synthesizer 10 cause the processor 104 to execute.
 [全体の流れ]
 図6に沿って第2の実施の形態の大規模TTSモデル学習フェーズの流れを説明する。図2と同様に、TTSモデルλは、話者ベクトルS、テキストL及び音響特徴量Oを入力し、予測音響特徴量O^を出力する。Oに関する損失計算部115は、音響特徴量Oと予測音響特徴量O^とを入力し損失Lを出力する。なお、図2において説明したように、テキストLに係る第1フェーズと音響特徴量Oに係る第2フェーズとのそれぞれについて予測音響特徴量O^及び損失Lが出力される。
[Overall flow]
The flow of the large-scale TTS model learning phase of the second embodiment will be described along FIG. As in FIG. 2, the TTS model λ receives the speaker vector S, the text L, and the acoustic feature O, and outputs the predicted acoustic feature Ô. A loss calculator 115 relating to O receives the acoustic feature quantity O and the predicted acoustic feature quantity Ô and outputs a loss Lo . Note that, as described with reference to FIG. 2, for each of the first phase related to the text L and the second phase related to the acoustic feature O, the predicted acoustic feature Ô and the loss Lo are output.
 第2の実施の形態では、更に、hに関する損失計算部117が、テキストエンコーダ112が出力する中間表現hと、音響特徴量エンコーダ113が出力する中間表現hとを入力し、hとhとの損失Lを計算及び出力する。損失Lの指標には、平均二乗誤差や平均絶対誤差だけでなく、コサイン距離などを用いて、hとhの誤差を小さくするよう制約を与える。 In the second embodiment, the loss calculation unit 117 for h further receives the intermediate representation hL output by the text encoder 112 and the intermediate representation hO output by the acoustic feature encoder 113, and Calculate and output the loss L h with h O. For the index of the loss L h , not only the mean squared error and the mean absolute error but also the cosine distance or the like is used to constrain the error between h L and h O to be small.
 損失重みづけ部118は、第1フェーズ及び第2フェーズのそれぞれについて、Oに関する損失計算部115が出力した損失Lと、hに関する損失計算部117が出力した損失Lとを入力し、重みづけされた損失(LとLとの加重和)を計算及び出力する。なお、重みづけの係数は固定でもよいし、学習対象とされてもよい。TTSモデル学習部116は、第1フェーズ及び第2フェーズのそれぞれについて、重みづけされた損失を最小化するようにTTSモデルλのモデルパラメータを更新することで、TTSモデルλを学習する。そうすることで、教師無し適応に備え、音響特徴量エンコーダ113からもテキストエンコーダ112の出力相当のものを予測できる可能性を高めることができる。 The loss weighting unit 118 receives the loss L o output by the loss calculation unit 115 for O and the loss L h output by the loss calculation unit 117 for h for each of the first phase and the second phase, and weights Calculate and output the weighted loss ( the weighted sum of Lo and Lh ). Note that the weighting coefficient may be fixed, or may be a learning target. The TTS model learning unit 116 learns the TTS model λ by updating the model parameters of the TTS model λ so as to minimize the weighted loss for each of the first and second phases. By doing so, in preparation for unsupervised adaptation, it is possible to increase the possibility of predicting the output of the text encoder 112 from the acoustic feature quantity encoder 113 as well.
 教師無し適応フェーズ以降の処理手順は第1の実施の形態と同じでよい。 The processing procedure after the unsupervised adaptation phase may be the same as in the first embodiment.
 第1の実施の形態の構成では、音響特徴量にテキスト情報を含ませる制約がないため、音響特徴量エンコーダ113からの中間表現hが、テキストエンコーダ112からの中間表現hに似たようなものになるとは限らない。第2の実施の形態では、学習の過程でhがhと類似したベクトルとなるよう制約を与えることでこの問題を低減することができる。 In the configuration of the first embodiment, since there is no restriction to include text information in acoustic features, the intermediate representation hO from the acoustic feature encoder 113 looks similar to the intermediate representation hL from the text encoder 112. It's not necessarily something. In the second embodiment, this problem can be reduced by restricting hO to a vector similar to hL in the course of learning.
 [第3の実施の形態]
 次に、第3の実施の形態について説明する。第3の実施の形態では第1の実施の形態と異なる点について説明する。第3の実施の形態において特に言及されない点については、第1の実施の形態と同様でもよい。
[Third embodiment]
Next, a third embodiment will be described. In the third embodiment, differences from the first embodiment will be explained. Points not particularly mentioned in the third embodiment may be the same as those in the first embodiment.
 図7は、第3の実施の形態における大規模TTSモデル学習フェーズの構成を示す図である。図7中、図2と同一部分には同一符号を付し、その説明は適宜省略する。図7において、音声合成装置10は、音響特徴量エンコーダ113による中間表現hから話者性を抜くモジュールである話者性除去部119と、sに関する損失計算部120と、損失重みづけ部121とを更に有する。これら各部は、音声合成装置10にインストールされた1以上のプログラムが、プロセッサ104に実行させる処理により実現される。話者IDは、話者ベクトルとは異なる形式で話者を識別するデータである。 FIG. 7 is a diagram showing the configuration of the large-scale TTS model learning phase in the third embodiment. In FIG. 7, the same parts as in FIG. 2 are denoted by the same reference numerals, and the description thereof will be omitted as appropriate. In FIG. 7, the speech synthesizer 10 includes a speaker identity remover 119, which is a module for removing the speaker identity from the intermediate representation hO by the acoustic feature encoder 113, a loss calculator 120 for s, and a loss weighter 121. and Each of these units is implemented by processing that one or more programs installed in the speech synthesizer 10 cause the processor 104 to execute. A speaker ID is data that identifies a speaker in a form different from the speaker vector.
 [全体の流れ]
 図7に沿って第3の実施の形態の大規模TTSモデル学習フェーズの流れを説明する。図2と同様に、TTSモデルλは、話者ベクトルS、テキストL及び音響特徴量Oを入力し、予測音響特徴量O^を出力する。Oに関する損失計算部115は、音響特徴量Oと予測音響特徴量O^とを入力し、損失Lを出力する。なお、図2において説明したように、テキストLに係る第1フェーズと音響特徴量Oに係る第2フェーズとのそれぞれについて予測音響特徴量O^及び損失Lが出力される。
[Overall flow]
The flow of the large-scale TTS model learning phase of the third embodiment will be described along FIG. As in FIG. 2, the TTS model λ receives the speaker vector S, the text L, and the acoustic feature O, and outputs the predicted acoustic feature Ô. A loss calculator 115 for O receives the acoustic feature quantity O and the predicted acoustic feature quantity Ô, and outputs a loss Lo . Note that, as described with reference to FIG. 2, for each of the first phase related to the text L and the second phase related to the acoustic feature O, the predicted acoustic feature Ô and the loss Lo are output.
 第3の実施の形態では、る話者性除去部119が、音響特徴量エンコーダ113によって出力される中間表現hを入力し、話者性を抜いた中間表現h'を計算及び出力する。話者性を抜いた中間表現h'とは、中間表現hから話者の音声の特徴が除去された中間表現をいう。話者性除去部119には、特許文献3で提案された話者敵対学習器などが利用できる。 In the third embodiment, the speaker characteristics removal unit 119 receives the intermediate representation hO output by the acoustic feature encoder 113, and calculates and outputs the intermediate representation h′O from which the speaker characteristics are removed. . The intermediate representation h'O from which the speaker's characteristic is removed is an intermediate representation obtained by removing the voice features of the speaker from the intermediate representation hO . The speaker adversarial learning device or the like proposed in Patent Document 3 can be used for the speaker identity removal unit 119 .
 sに関する損失計算部120は、話者性を抜いた中間表現h'と、真の話者ID sとを入力し、損失Lを計算及び出力する。損失Lは、h'が話者sに対応する可能性が低いほど大きな値となる指標である。例えば、損失Lにはクロスエントロピーなど識別問題を解く指標を用いることができる。 The loss calculation unit 120 for s receives the intermediate representation h′O with speaker characteristics removed and the true speaker ID s, and calculates and outputs the loss L s . The loss L s is an index that takes a larger value as the probability that h′ O corresponds to speaker s is lower. For example, the loss Ls can be an index for solving a discrimination problem, such as cross-entropy.
 損失重みづけ部121は、第1フェーズ及び第2フェーズのそれぞれについて、Oに関する損失計算部115が出力した損失Lと、S^に関する損失計算部が出力したLとを入力し、重みづけされた損失(LとLとの加重和)を計算及び出力する。なお、重みづけの係数は固定でもよいし、学習対象とされてもよい。TTSモデル学習部116は、重みづけされた損失を最小化するようにTTSモデルλのモデルパラメータを更新することで、TTSモデルλを学習する。 The loss weighting unit 121 inputs the loss L o output by the loss calculation unit 115 regarding O and the L s output by the loss calculation unit regarding S for each of the first phase and the second phase, and performs weighting. Calculate and output the calculated loss (the weighted sum of L o and L s ). Note that the weighting coefficient may be fixed, or may be a learning target. The TTS model learning unit 116 learns the TTS model λ by updating the model parameters of the TTS model λ to minimize the weighted loss.
 教師無し適応フェーズ以降の処理手順は第1の実施の形態と同じでよい。 The processing procedure after the unsupervised adaptation phase may be the same as in the first embodiment.
 第1の実施の形態の構成では、テキストエンコーダ112の出力には話者性が含まれないのに対し、音響特徴量エンコーダ113の出力には話者性が含まれる。この両者のミスマッチがTTSの性能低下の原因となる。第3の実施の形態によれば、音響特徴量エンコーダ113による中間表現hから話者性を低減することが可能である。 In the configuration of the first embodiment, the output of the text encoder 112 does not include speaker characteristics, whereas the output of the acoustic feature quantity encoder 113 includes speaker characteristics. A mismatch between the two causes deterioration of TTS performance. According to the third embodiment, it is possible to reduce speaker characteristics from the intermediate representation h O by the acoustic feature encoder 113 .
 また、第3の実施の形態は、第2の実施の形態と併用されてもよい。この場合、損失重みづけ部121は、損失L、損失L、及び損失Lを入力し、重みづけされた損失を出力すればよい。第2の実施の形態と併用することで、hとhの制約を与えつつ、かつhから話者性を除外することでより安定した教師無し適応を可能とすることができる。 Also, the third embodiment may be used together with the second embodiment. In this case, loss weighting section 121 may receive loss L o , loss L s , and loss L h and output a weighted loss. By using this together with the second embodiment, more stable unsupervised adaptation can be made possible by excluding speaker characteristics from hO while imposing constraints on hO and hL .
 [第4の実施の形態]
 次に、第4の実施の形態について説明する。第4の実施の形態では第1の実施の形態と異なる点について説明する。第4の実施の形態において特に言及されない点については、第1の実施の形態と同様でもよい。
[Fourth embodiment]
Next, a fourth embodiment will be described. In the fourth embodiment, points different from the first embodiment will be described. Points not specifically mentioned in the fourth embodiment may be the same as those in the first embodiment.
 図8は、第4の実施の形態における大規模TTSモデル学習フェーズの構成を示す図である。図8中、図2と同一部分には同一符号を付し、その説明は適宜省略する。図8において、音声合成装置10は、言語ごとにテキストエンコーダ112-n(n=1,…,N)を有する。各テキストエンコーダ112-nは、入力されるテキストL(n=1,…,N)に応じて選択的に使用される。テキストLは、発話の言語nに応じたテキストを示す。 FIG. 8 is a diagram showing the configuration of the large-scale TTS model learning phase in the fourth embodiment. In FIG. 8, the same parts as those in FIG. 2 are denoted by the same reference numerals, and the description thereof will be omitted as appropriate. 8, the speech synthesizer 10 has text encoders 112-n (n=1, . . . , N) for each language. Each text encoder 112-n is selectively used according to the input text L n (n=1, . . . , N). The text L n indicates the text according to the language n of the utterance.
 [全体の流れ]
 図8に沿って第4の実施の形態の大規模TTSモデル学習フェーズの流れを説明する。
[Overall flow]
The flow of the large-scale TTS model learning phase of the fourth embodiment will be described along FIG.
 大規模TTSモデル学習フェーズでは、話者ベクトルS、テキストL及び音響特徴量Oを一組とする学習データが複数通り用意される。音響特徴量Oは、言語nによってテキストLが発話された音声の音響特徴量である。複数通りの学習データのそれぞれの言語nは、1~Nのいずれかであり、1~Nの各言語の学習データが用意される。各学習データのテキストLが意味する内容は異なっていてよい。 In the large-scale TTS model learning phase, a plurality of sets of learning data each including a speaker vector S, a text Ln , and an acoustic feature O are prepared. Acoustic feature O is an acoustic feature of speech in which text Ln is uttered in language n. Language n of each of the plurality of learning data is one of 1 to N, and learning data for each language of 1 to N are prepared. The meaning of the text Ln of each learning data may be different.
 話者ベクトルエンコーダ111及び音響特徴量エンコーダ113の処理は図2と同じである。 The processing of the speaker vector encoder 111 and acoustic feature quantity encoder 113 is the same as in FIG.
 一方、入力される学習データのテキストLに対応するテキストエンコーダ112-nは、中間表現hLnを計算及び出力する。 On the other hand, the text encoder 112- n corresponding to the input training data text Ln calculates and outputs the intermediate representation hLn .
 デコーダ114は、第1フェーズにおいて中間表現hLnと話者ベクトルSとを入力し、第2フェーズにおいて中間表現hと話者ベクトルSとを入力し、それぞれのフェーズで予測音響特徴量O^を出力する。以降、第1の実施の形態と同様の手順でTTSモデルλが更新され、TTSモデルλが学習される。音響特徴量OをTTSモデルλに入力する場合の流れは第1の実施の形態と同じである。 The decoder 114 inputs the intermediate representation h Ln and the speaker vector S in the first phase, inputs the intermediate representation h O and the speaker vector S in the second phase, and generates the predicted acoustic feature O^ to output Thereafter, the TTS model λ is updated and the TTS model λ is learned in the same procedure as in the first embodiment. The flow for inputting the acoustic feature quantity O to the TTS model λ is the same as in the first embodiment.
 教師無し適応フェーズのフェーズはテキストに依存せず、第1の実施の形態の図3と同様で適応したTTSモデルλ'が学習される。 The phase of the unsupervised adaptation phase is text-independent and the adapted TTS model λ ' is learned as in FIG. 3 of the first embodiment.
 図9は、第4の実施の形態における音声合成の場合の推論フェーズの構成を示す図である。教師無し適応で得たTTSモデルλ'を用いて、図8と同様の手順で予測音響特徴量O^'が予測される。 FIG. 9 is a diagram showing the configuration of the inference phase for speech synthesis in the fourth embodiment. Using the TTS model λ ' obtained by unsupervised adaptation, the predicted acoustic feature Ô' is predicted in the same procedure as in FIG.
 第1の実施の形態の構成では、テキストエンコーダ112が1つであったため、1言語の音声合成しかできなかった。第4の実施の形態によれば、言語別にテキストエンコーダ112が用意されるため、1つのTTSモデルでマルチリンガル音声合成が可能である。第1の実施の形態と同様に音響特徴量を用いた教師無し適応も可能であり、さらに第2の実施の形態、3と組み合わせることもできる。 In the configuration of the first embodiment, there was only one text encoder 112, so it was only possible to synthesize speech in one language. According to the fourth embodiment, since text encoders 112 are prepared for each language, multilingual speech synthesis is possible with one TTS model. As in the first embodiment, unsupervised adaptation using acoustic features is also possible, and can be combined with the second and third embodiments.
 [各実施の形態の効果]
 上記各実施の形態の構成により、目標話者のテキストなしに、音響特徴量のみからTTSモデルのファインチューニングによる適応が可能である。目標話者のテキストが不要であることは、音声へのアノテーションコストの削減に繋がり、ひいてはTTSモデル構築に要する時間と費用の両面で有利となる。
[Effects of each embodiment]
With the configuration of each of the above embodiments, it is possible to adapt the TTS model by fine-tuning only from the acoustic features without the text of the target speaker. Not requiring the text of the target speaker leads to a reduction in the cost of annotating the speech, which is advantageous in terms of both the time and cost required to construct the TTS model.
 なお、上記各実施の形態において、音声合成装置10は、音声合成学習装置の一例でもある。TTSモデルλは、第1のモデルの一例である。TTSモデルλは、第2のモデルの一例である。予測音響特徴量O^は、第1の音響特徴量の一例である。予測音響特徴量O^'は、第2の音響特徴量の一例である。音響特徴量エンコーダ113は、第1のエンコーダの一例である。テキストエンコーダ112は、第2のエンコーダの一例である。 In addition, in each of the above-described embodiments, the speech synthesizer 10 is also an example of a speech synthesis learning device. The TTS model λ is an example of the first model. The TTS model λ is an example of the second model. The predicted acoustic feature Ô is an example of the first acoustic feature. The predicted acoustic feature O^' is an example of the second acoustic feature. The acoustic feature quantity encoder 113 is an example of a first encoder. Text encoder 112 is an example of a second encoder.
 以上、本発明の実施の形態について詳述したが、本発明は斯かる特定の実施形態に限定されるものではなく、請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the embodiments of the present invention have been described in detail above, the present invention is not limited to such specific embodiments, and various modifications can be made within the scope of the gist of the present invention described in the claims.・Changes are possible.
10     音声合成装置
100    ドライブ装置
101    記録媒体
102    補助記憶装置
103    メモリ装置
104    プロセッサ
105    インタフェース装置
111    話者ベクトルエンコーダ
112    テキストエンコーダ
113    音響特徴量エンコーダ
114    デコーダ
115    Oに関する損失計算部
116    TTSモデル学習部
117    hに関する損失計算部
118    損失重みづけ部
119    話者性除去部
120    sに関する損失計算部
121    損失重みづけ部
B      バス
10 Speech synthesis device 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 Processor 105 Interface device 111 Speaker vector encoder 112 Text encoder 113 Acoustic feature encoder 114 Decoder 115 O loss calculation unit 116 TTS model learning unit 117 h Loss calculator 118 Loss weighting unit 119 Loss weighting unit 120 Loss calculator 121 for s Loss weighting unit B Bus

Claims (8)

  1.  話者を示す話者ベクトルと、テキストと、当該話者が当該テキストを発話した音声に係る第1の音響特徴量とを入力した第1のモデルが出力する第1の予測音響特徴量と前記第1の音響特徴量との損失に基づいて前記第1のモデルを更新することで第2のモデルを学習する第1の学習手順と、
     目標話者の話者ベクトルと、前記目標話者が発話した音声に係る第2の音響特徴量とを入力した前記第2のモデルが出力する第2の予測音響特徴量と前記第2の音響特徴量との損失に基づいて前記第2のモデルを更新する第2の学習手順と、
    をコンピュータが実行することを特徴とする音声合成学習方法。
    A first predicted acoustic feature output by a first model to which a speaker vector indicating a speaker, a text, and a first acoustic feature relating to the speech of the text uttered by the speaker are input; a first learning procedure for learning a second model by updating the first model based on a loss with the first acoustic feature;
    A second predicted acoustic feature output from the second model to which the speaker vector of the target speaker and a second acoustic feature relating to the speech uttered by the target speaker are input, and the second acoustic a second learning procedure for updating the second model based on the loss with the feature;
    A speech synthesis learning method characterized in that a computer executes
  2.  前記第1のモデルは、
     音響特徴量を入力して前記音響特徴量に係る中間表現を出力する第1のエンコーダと、
     テキストを入力して前記テキストに係る中間表現を出力する第2のエンコーダと、
    を含み、
     前記第1の学習手順は、話者を示す話者ベクトルと、テキストと、当該話者が当該テキストを発話した音声に係る前記第1の音響特徴量とを入力した前記第1のモデルが出力する予測音響特徴量と前記第1の音響特徴量との損失と、当該第1の音響特徴量を入力した前記第1のエンコーダが出力した中間表現と当該テキストを入力した前記第2のエンコーダが出力した中間表現との損失とに基づいて前記第1のモデルを更新する、
    ことを特徴とする請求項1記載の音声合成学習方法。
    The first model is
    a first encoder that inputs an acoustic feature and outputs an intermediate representation related to the acoustic feature;
    a second encoder for inputting text and outputting an intermediate representation for the text;
    including
    In the first learning procedure, the first model to which the speaker vector indicating the speaker, the text, and the first acoustic feature related to the speech of the text uttered by the speaker are input is output. Loss of the predicted acoustic feature quantity and the first acoustic feature quantity, the intermediate representation output by the first encoder that inputs the first acoustic feature quantity, and the second encoder that inputs the text updating the first model based on the output intermediate representation and the loss;
    2. The speech synthesis learning method according to claim 1, wherein:
  3.  前記第1のモデルは、
     音響特徴量を入力して前記音響特徴量に係る中間表現を出力する第1のエンコーダと、
     前記中間表現を入力して話者性を抜いた中間表現を出力する話者性除去部と、
    を含み、
     前記第1の学習手順は、話者を示す話者ベクトルと、テキスト、当該話者が当該テキストを発話した音声に係る前記第1の音響特徴量とを入力した前記第1のモデルが出力する予測音響特徴量と前記第1の音響特徴量との損失と、当該第1の音響特徴量を入力した前記第1のエンコーダが出力した中間表現に基づいて前記話者性除去部が出力した中間表現と真の話者IDとの損失とに基づいて前記第1のモデルを更新する、
    ことを特徴とする請求項1又は2記載の音声合成学習方法。
    The first model is
    a first encoder that inputs an acoustic feature and outputs an intermediate representation related to the acoustic feature;
    a speaker identity removal unit that inputs the intermediate representation and outputs an intermediate representation from which the speaker identity is removed;
    including
    In the first learning procedure, the first model to which the speaker vector indicating the speaker, the text, and the first acoustic feature related to the speech of the text uttered by the speaker are input outputs the first model. loss of the predicted acoustic feature quantity and the first acoustic feature quantity; updating the first model based on representation and loss of true speaker ID;
    3. The speech synthesis learning method according to claim 1, wherein:
  4.  前記第1のモデルは、言語ごとに、当該言語に係るテキストを入力して前記テキストに係る中間表現を出力する第2のエンコーダ、
    を含み、
     前記第1の学習手順は、話者を示す話者ベクトルと、テキストと、当該話者が当該テキストを発話した音声に係る第1の音響特徴量とを入力した第1のモデルが前記テキストに係る言語に対応する前記第2のエンコーダを使用して出力する第1の予測音響特徴量と前記第1の音響特徴量との損失に基づいて前記第1のモデルを更新することで第2のモデルを学習する、
    ことを特徴とする請求項1乃至3いずれか一項記載の音声合成学習方法。
    The first model includes, for each language, a second encoder that inputs a text related to the language and outputs an intermediate representation related to the text;
    including
    In the first learning procedure, a first model input with a speaker vector indicating a speaker, a text, and a first acoustic feature amount related to the speech of the text uttered by the speaker is applied to the text. By updating the first model based on the loss of the first predicted acoustic feature quantity output using the second encoder corresponding to the language and the first acoustic feature quantity, the second learn the model,
    4. The speech synthesis learning method according to any one of claims 1 to 3, characterized in that:
  5.  請求項1乃至4いずれか一項記載の音声合成学習方法によって学習された前記第2のモデルに話者ベクトル及びテキストを入力して、当該話者ベクトル及び当該テキストに係る音響特徴量を推定する推定手順、
    をコンピュータが実行することを特徴とする音声合成方法。
    A speaker vector and text are input to the second model trained by the speech synthesis learning method according to any one of claims 1 to 4, and an acoustic feature amount related to the speaker vector and the text is estimated. estimation procedure,
    A speech synthesis method characterized in that a computer executes
  6.  話者を示す話者ベクトルと、テキストと、当該話者が当該テキストを発話した音声に係る第1の音響特徴量とを入力した第1のモデルが出力する第1の予測音響特徴量と前記第1の音響特徴量との損失に基づいて前記第1のモデルを更新することで第2のモデルを学習するように構成されている第1の学習部と、
     目標話者の話者ベクトルと、前記目標話者が発話した音声に係る第2の音響特徴量とを入力した前記第2のモデルが出力する第2の予測音響特徴量と前記第2の音響特徴量との損失に基づいて前記第2のモデルを更新するように構成されている第2の学習部と、
    を有することを特徴とする音声合成学習装置。
    A first predicted acoustic feature output by a first model to which a speaker vector indicating a speaker, a text, and a first acoustic feature relating to the speech of the text uttered by the speaker are input; a first learning unit configured to learn a second model by updating the first model based on a loss with the first acoustic feature;
    A second predicted acoustic feature output from the second model to which the speaker vector of the target speaker and a second acoustic feature relating to the speech uttered by the target speaker are input, and the second acoustic a second learning unit configured to update the second model based on the loss of features;
    A speech synthesis learning device characterized by comprising:
  7.  請求項1乃至4いずれか一項記載の音声合成学習方法によって学習された前記第2のモデルに話者ベクトル及びテキストを入力して、当該話者ベクトル及び当該テキストに係る音響特徴量を推定するように構成されている推定部、
    を有することを特徴とする音声合成装置。
    A speaker vector and text are input to the second model trained by the speech synthesis learning method according to any one of claims 1 to 4, and an acoustic feature amount related to the speaker vector and the text is estimated. an estimator configured to:
    A speech synthesizer characterized by comprising:
  8.  請求項1乃至4いずれか一項記載の音声合成学習方法、又は請求項5記載の音声合成方法をコンピュータに実行させることを特徴とするプログラム。 A program characterized by causing a computer to execute the speech synthesis learning method according to any one of claims 1 to 4 or the speech synthesis method according to claim 5.
PCT/JP2022/005903 2022-02-15 2022-02-15 Speech synthesis learning method, speech synthesis method, speech synthesis learning device, speech synthesis device, and program WO2023157066A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/005903 WO2023157066A1 (en) 2022-02-15 2022-02-15 Speech synthesis learning method, speech synthesis method, speech synthesis learning device, speech synthesis device, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/005903 WO2023157066A1 (en) 2022-02-15 2022-02-15 Speech synthesis learning method, speech synthesis method, speech synthesis learning device, speech synthesis device, and program

Publications (1)

Publication Number Publication Date
WO2023157066A1 true WO2023157066A1 (en) 2023-08-24

Family

ID=87577741

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/005903 WO2023157066A1 (en) 2022-02-15 2022-02-15 Speech synthesis learning method, speech synthesis method, speech synthesis learning device, speech synthesis device, and program

Country Status (1)

Country Link
WO (1) WO2023157066A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017032839A (en) * 2015-08-04 2017-02-09 日本電信電話株式会社 Acoustic model learning device, voice synthesis device, acoustic model learning method, voice synthesis method, and program
JP2017058513A (en) * 2015-09-16 2017-03-23 株式会社東芝 Learning device, speech synthesis device, learning method, speech synthesis method, learning program, and speech synthesis program
WO2019044401A1 (en) * 2017-08-29 2019-03-07 大学共同利用機関法人情報・システム研究機構 Computer system creating speaker adaptation without teacher in dnn-based speech synthesis, and method and program executed in computer system
JP2020034883A (en) * 2018-08-27 2020-03-05 日本放送協会 Voice synthesizer and program
JP2020060633A (en) * 2018-10-05 2020-04-16 日本電信電話株式会社 Acoustic model learning device, voice synthesizer and program
JP2020160319A (en) * 2019-03-27 2020-10-01 Kddi株式会社 Voice synthesizing device, method and program

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017032839A (en) * 2015-08-04 2017-02-09 日本電信電話株式会社 Acoustic model learning device, voice synthesis device, acoustic model learning method, voice synthesis method, and program
JP2017058513A (en) * 2015-09-16 2017-03-23 株式会社東芝 Learning device, speech synthesis device, learning method, speech synthesis method, learning program, and speech synthesis program
WO2019044401A1 (en) * 2017-08-29 2019-03-07 大学共同利用機関法人情報・システム研究機構 Computer system creating speaker adaptation without teacher in dnn-based speech synthesis, and method and program executed in computer system
JP2020034883A (en) * 2018-08-27 2020-03-05 日本放送協会 Voice synthesizer and program
JP2020060633A (en) * 2018-10-05 2020-04-16 日本電信電話株式会社 Acoustic model learning device, voice synthesizer and program
JP2020160319A (en) * 2019-03-27 2020-10-01 Kddi株式会社 Voice synthesizing device, method and program

Similar Documents

Publication Publication Date Title
JP6727607B2 (en) Speech recognition device and computer program
CN107615376B (en) Voice recognition device and computer program recording medium
US10529314B2 (en) Speech synthesizer, and speech synthesis method and computer program product utilizing multiple-acoustic feature parameters selection
KR20230003056A (en) Speech recognition using non-speech text and speech synthesis
JP4274962B2 (en) Speech recognition system
US8825485B2 (en) Text to speech method and system converting acoustic units to speech vectors using language dependent weights for a selected language
JP3933750B2 (en) Speech recognition method and apparatus using continuous density Hidden Markov model
US6006186A (en) Method and apparatus for a parameter sharing speech recognition system
JP6884946B2 (en) Acoustic model learning device and computer program for it
KR20050082253A (en) Speaker clustering method and speaker adaptation method based on model transformation, and apparatus using the same
US20110276332A1 (en) Speech processing method and apparatus
JP5807921B2 (en) Quantitative F0 pattern generation device and method, model learning device for F0 pattern generation, and computer program
JP6631883B2 (en) Model learning device for cross-lingual speech synthesis, model learning method for cross-lingual speech synthesis, program
JP7192882B2 (en) Speech rhythm conversion device, model learning device, methods therefor, and program
US8185393B2 (en) Human speech recognition apparatus and method
Razavi et al. Acoustic data-driven grapheme-to-phoneme conversion in the probabilistic lexical modeling framework
Fan et al. Sequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis
US20040006469A1 (en) Apparatus and method for updating lexicon
JP5474713B2 (en) Speech synthesis apparatus, speech synthesis method, and speech synthesis program
JP2015161927A (en) Acoustic model generation device, production method for acoustic model, and program
WO2023157066A1 (en) Speech synthesis learning method, speech synthesis method, speech synthesis learning device, speech synthesis device, and program
Miguel et al. Augmented state space acoustic decoding for modeling local variability in speech.
JP6167063B2 (en) Utterance rhythm transformation matrix generation device, utterance rhythm transformation device, utterance rhythm transformation matrix generation method, and program thereof
Imseng Multilingual speech recognition: a posterior based approach
JP6137708B2 (en) Quantitative F0 pattern generation device, model learning device for F0 pattern generation, and computer program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22926970

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE