WO2023157066A1 - Procédé d'apprentissage de synthèse vocale, procédé de synthèse vocale, dispositif d'apprentissage de synthèse vocale, dispositif de synthèse vocale et programme - Google Patents

Procédé d'apprentissage de synthèse vocale, procédé de synthèse vocale, dispositif d'apprentissage de synthèse vocale, dispositif de synthèse vocale et programme Download PDF

Info

Publication number
WO2023157066A1
WO2023157066A1 PCT/JP2022/005903 JP2022005903W WO2023157066A1 WO 2023157066 A1 WO2023157066 A1 WO 2023157066A1 JP 2022005903 W JP2022005903 W JP 2022005903W WO 2023157066 A1 WO2023157066 A1 WO 2023157066A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
acoustic feature
text
model
speech
Prior art date
Application number
PCT/JP2022/005903
Other languages
English (en)
Japanese (ja)
Inventor
裕紀 金川
勇祐 井島
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2022/005903 priority Critical patent/WO2023157066A1/fr
Publication of WO2023157066A1 publication Critical patent/WO2023157066A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • the present invention relates to a speech synthesis learning method, a speech synthesis method, a speech synthesis learning device, a speech synthesis device, and a program.
  • TTS text-to-speech synthesis
  • DNN deep neural network
  • the quality of synthesized speech has improved dramatically (Non-Patent Document 1).
  • statistical modeling by DNN acquires the correspondence between input and output only from data, so a large amount of training data is required to train a TTS model that synthesizes speech with high quality. If a TTS model is constructed using only a target speaker with a small amount of data, the model overfits the training data, so there are cases where desired utterance content and quality cannot be obtained when unknown text is input.
  • fMLLR feature space maximum likelihood linear regression
  • Non-Patent Document 4 As an example of unsupervised adaptation, which is another adaptive approach, there is a method of manipulating information including speaker information that is put together with the text according to the target speaker (Non-Patent Document 4).
  • a TTS model is trained with a large number of speakers in advance using speaker expression vectors based on one-hot. Separately, a model is prepared to identify the training speaker from the input speech, and the speech of the target speaker is input to the model. Then, a vector (speaker posterior probability) indicating how much the target speaker resembles who among the large number of speakers is obtained.
  • a vector peaker posterior probability
  • Non-Patent Document 2 In semi-supervised learning represented by Non-Patent Document 2, the adaptation of speech synthesis is pseudo-label generation, so it is necessary to prepare a separate speech recognition model. Therefore, the learning cost is very high, and the accuracy of the pseudo-label depends on the speech recognition model.
  • Non-Patent Document 4 When unsupervised adaptation is performed by the approach of Non-Patent Document 4, not only is a separate speaker recognition model necessary, but also the speech to be recognized by the speaker recognition model is predicted in order to predict the one-hot vector equivalent of TTS. must match that of the TTS model. Furthermore, if the acoustic characteristics of the target speaker are significantly different from those of the large number of speakers who are training data for the TTS model, the quality of the synthesized speech will be significantly degraded. There are no pseudo-labels, and it is impossible to reduce the mismatch between the target speaker and the model by fine-tuning.
  • the present invention has been made in view of the above points, and enables adaptation of a TTS model by fine-tuning from acoustic features related to the target speaker's speech even if there is no text corresponding to the target speaker's speech. With the goal.
  • a first model that inputs a speaker vector indicating a speaker, a text, and a first acoustic feature amount related to the speech in which the speaker utters the text outputs a first model.
  • a first learning procedure for learning a second model by updating the first model based on the loss of one predicted acoustic feature quantity and the first acoustic feature quantity;
  • FIG. 3 is a diagram showing the configuration of a large-scale TTS model learning phase in the first embodiment
  • FIG. 3 is a diagram showing the configuration of an unsupervised adaptation phase in the first embodiment
  • FIG. 4 is a diagram showing the configuration of an inference phase in the case of speech synthesis in the first embodiment
  • FIG. 10 is a diagram showing the configuration of an inference phase in the case of voice quality conversion in the first embodiment
  • FIG. 10 is a diagram showing the configuration of a large-scale TTS model learning phase in the second embodiment
  • FIG. 12 is a diagram showing the configuration of a large-scale TTS model learning phase in the third embodiment; It is a figure which shows the structure of the large-scale TTS model learning phase in 4th Embodiment.
  • FIG. 13 is a diagram showing the configuration of an inference phase in the case of speech synthesis in the fourth embodiment; FIG.
  • this embodiment utilizes not only text but also acoustic features obtained from speech as input for the TTS model.
  • the DNN modules that convert the text into an intermediate representation and the intermediate representation into acoustic features are called a text encoder 112 and a decoder 114, respectively.
  • an acoustic feature encoder 113 that converts the acoustic feature into an intermediate representation is newly prepared so that the acoustic feature, which is the output of the TTS model, can be reconstructed from the input text and the acoustic feature. .
  • the intermediate representation originally obtained through the text encoder 112 can also be obtained from the acoustic features.
  • FIG. 1 is a diagram showing a hardware configuration example of a speech synthesizer 10 according to an embodiment of the present invention.
  • the speech synthesizer 10 of FIG. 1 has a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, an interface device 105, etc., which are interconnected by a bus B, respectively.
  • a program that implements processing in the speech synthesizer 10 is provided by a recording medium 101 such as a CD-ROM.
  • a recording medium 101 such as a CD-ROM.
  • the program is installed from the recording medium 101 to the auxiliary storage device 102 via the drive device 100 .
  • the program does not necessarily need to be installed from the recording medium 101, and may be downloaded from another computer via the network.
  • the auxiliary storage device 102 stores installed programs, as well as necessary files and data.
  • the memory device 103 reads and stores the program from the auxiliary storage device 102 when a program activation instruction is received.
  • the processor 104 is a CPU or a GPU (Graphics Processing Unit), or a CPU and a GPU, and executes functions related to the speech synthesizer 10 according to programs stored in the memory device 103 .
  • the interface device 105 is used as an interface for connecting to a network.
  • [First embodiment] 2, 3, 4, and 5 show the large-scale TTS model learning phase, the unsupervised adaptation phase, the inference phase in the case of speech synthesis, and the voice quality conversion in the first embodiment, respectively.
  • a configuration example of the inference phase is shown.
  • the speech synthesizer 10 includes a TTS model ⁇ , a loss calculator 115 for acoustic feature O, and a TTS model learner 116 .
  • the TTS model ⁇ includes speaker vector encoder 111 , text encoder 112 , acoustic feature encoder 113 and decoder 114 . Each of these units is implemented by processing that one or more programs installed in the speech synthesizer 10 cause the processor 104 to execute.
  • a plurality of sets of learning data are prepared, each consisting of a set of speaker vectors S, texts L, and acoustic features O.
  • the speaker vector S is a continuous expression such as i-vector or x-vector indicating the speaker who uttered the speech, and is obtained by inputting the speech into the speaker vector extractor.
  • the text L is information indicating the content of the voice (content of the utterance).
  • a raw text, a sequence of phonemes and accents, or a linguistic feature vectorized from them can be used.
  • the acoustic feature quantity O is the acoustic feature quantity of the speech.
  • Acoustic features include mel-spectrogram, mel-cepstrum, fundamental frequency, etc., which are information necessary for reconstructing speech waveforms.
  • X ⁇ (X is an arbitrary symbol) in the text indicates a symbol with ⁇ added above X in the drawing.
  • the speaker of each training data may be different, and the text L may also be different. Any training data speaker may be the target speaker in the unsupervised adaptation phase described below.
  • the speaker vector encoder 111 inputs the speaker vector S, calculates and outputs an intermediate representation of the speaker vector S (hereinafter referred to as "intermediate representation of the speaker vector S").
  • the text encoder 112 receives the text L, computes an intermediate representation hL of the text L, and outputs it.
  • the acoustic feature quantity encoder 113 receives the acoustic feature quantity O, and calculates and outputs an intermediate representation hO of the acoustic feature quantity O.
  • the decoder 114 receives the intermediate representation of the speaker vector S, the intermediate representation hL , and the intermediate representation hO . However, the intermediate representation hL and the intermediate representation hO are input to the decoder 114 at different timings. That is, the decoder 114 inputs the intermediate representation of the speaker vector S and the intermediate representation hL (hereinafter referred to as the "first phase") and the intermediate representation of the speaker vector S for one piece of training data. Two phases are executed: one for inputting the intermediate representation hO (hereinafter referred to as the "second phase"). In unsupervised adaptation, which will be described later, there is no text of an unknown speaker, so the TTS model ⁇ is constructed.
  • the decoder 114 inputs the intermediate representation of the speaker vector S and the intermediate representation hL , and calculates and outputs the predicted acoustic feature O ⁇ .
  • the loss calculation unit 115 for O receives the predicted acoustic feature quantity O ⁇ and the acoustic feature quantity O, and calculates and outputs the loss Lo , which is the error between the acoustic feature quantity O and the predicted acoustic feature quantity O ⁇ . do.
  • the loss Lo an index indicating the error of vectors of the same dimension, such as the mean squared error or the mean absolute error, can be used.
  • the TTS model learning unit 116 receives the TTS model ⁇ and the loss L 0 , and learns the TTS model ⁇ ⁇ by updating the model parameters of the TTS model ⁇ based on the loss L 0 .
  • X ⁇ (X is an arbitrary symbol) in the text indicates a symbol in which “-” is added above X in the drawings.
  • the TTS model learning unit 116 updates the TTS model ⁇ so as to minimize the loss Lo .
  • Model parameters that reduce the loss Lo can be obtained by executing error backpropagation with the help of the gradient information when the predicted acoustic feature O ⁇ was generated.
  • the decoder 114 inputs the intermediate representation of the speaker vector S and the intermediate representation hO , and calculates and outputs the predicted acoustic feature O ⁇ .
  • the TTS model learning unit 116 learns the TTS model ⁇ ⁇ by updating the TTS model ⁇ as in the first phase.
  • the TTS model ⁇ ⁇ was obtained during the large-scale TTS model training phase of FIG.
  • a plurality of sets of learning data are prepared, each of which consists of one target speaker's acoustic feature O' and the target speaker's speaker vector S'. Therefore, the speaker of each training data is common. However, the voice indicated by the acoustic feature O' of each learning data is different.
  • the TTS model ⁇ ⁇ receives the acoustic feature O′ of the target speaker and the speaker vector S′ of the target speaker, and calculates and outputs the predicted acoustic feature ⁇ ′.
  • the TTS model learning unit 116 updates the TTS model ⁇ so as to minimize the loss Lo , which is the error between the predicted acoustic feature quantity ⁇ ′ and the acoustic feature quantity O′.
  • ⁇ ⁇ ' is learned. That is, with the configuration of FIG. 2, the TTS model ⁇ ⁇ can be learned by substituting the acoustic feature O′ of the unknown speaker even if the text of the unknown speaker is not input to the TTS model ⁇ . This enables adaptation ( ⁇ fine tuning) using the acoustic feature O'.
  • both the input and output of the TTS model ⁇ ⁇ ' are acoustic features, and the TTS model ⁇ ⁇ ' becomes equivalent to an autoencoder. Since there is no text in the adaptation data, the acoustic feature encoder 113 may overlearn and the intermediate representation hO may not be able to predict the information corresponding to the intermediate representation hL in FIG. Therefore, by freezing the acoustic feature quantity encoder 113 (fixing the model parameters of the acoustic feature quantity encoder 113) and updating only the decoder 114, it is possible to avoid the possibility that the text content, which is a necessary condition of the TTS model, will collapse. , it is possible to adapt the model to the target speaker.
  • the trained TTS model ⁇ ⁇ ' inputs an arbitrary text L' to be synthesized into speech and the speaker vector S' of the target speaker, and calculates (estimates) and outputs a predicted acoustic feature ⁇ '. Since the TTS model ⁇ ⁇ ' is adapted in the phase described with reference to FIG. 3, it is possible to synthesize speech without significant deterioration in quality even for a target speaker not included in the training data of FIG.
  • the speaker vector is It can also be used for voice quality conversion by replacing with volume.
  • the speaker's acoustic feature O ⁇ '' corresponding to the speaker vector S'' is (output) predicted.
  • unsupervised adaptation by fine-tuning of the TTS model is possible only from the acoustic features without using the target speaker's text.
  • it is possible to eliminate the need to annotate the speech of the target speaker, thereby reducing both the time and cost required to construct the TTS model.
  • 2nd Embodiment demonstrates a different point from 1st Embodiment. Points not specifically mentioned in the second embodiment may be the same as in the first embodiment.
  • FIG. 6 is a diagram showing the configuration of the large-scale TTS model learning phase in the second embodiment.
  • the same parts as those in FIG. 2 are denoted by the same reference numerals, and the description thereof will be omitted as appropriate.
  • the speech synthesizer 10 further has a loss calculator 117 and a loss weighter 118 for the intermediate representation h. Each of these units is implemented by processing that one or more programs installed in the speech synthesizer 10 cause the processor 104 to execute.
  • the TTS model ⁇ receives the speaker vector S, the text L, and the acoustic feature O, and outputs the predicted acoustic feature ⁇ .
  • a loss calculator 115 relating to O receives the acoustic feature quantity O and the predicted acoustic feature quantity ⁇ and outputs a loss Lo . Note that, as described with reference to FIG. 2, for each of the first phase related to the text L and the second phase related to the acoustic feature O, the predicted acoustic feature ⁇ and the loss Lo are output.
  • the loss calculation unit 117 for h further receives the intermediate representation hL output by the text encoder 112 and the intermediate representation hO output by the acoustic feature encoder 113, and Calculate and output the loss L h with h O.
  • the index of the loss L h not only the mean squared error and the mean absolute error but also the cosine distance or the like is used to constrain the error between h L and h O to be small.
  • the loss weighting unit 118 receives the loss L o output by the loss calculation unit 115 for O and the loss L h output by the loss calculation unit 117 for h for each of the first phase and the second phase, and weights Calculate and output the weighted loss (the weighted sum of Lo and Lh ).
  • the weighting coefficient may be fixed, or may be a learning target.
  • the TTS model learning unit 116 learns the TTS model ⁇ ⁇ by updating the model parameters of the TTS model ⁇ so as to minimize the weighted loss for each of the first and second phases. By doing so, in preparation for unsupervised adaptation, it is possible to increase the possibility of predicting the output of the text encoder 112 from the acoustic feature quantity encoder 113 as well.
  • the processing procedure after the unsupervised adaptation phase may be the same as in the first embodiment.
  • the intermediate representation hO from the acoustic feature encoder 113 looks similar to the intermediate representation hL from the text encoder 112. It's not necessarily something.
  • this problem can be reduced by restricting hO to a vector similar to hL in the course of learning.
  • FIG. 7 is a diagram showing the configuration of the large-scale TTS model learning phase in the third embodiment.
  • the speech synthesizer 10 includes a speaker identity remover 119, which is a module for removing the speaker identity from the intermediate representation hO by the acoustic feature encoder 113, a loss calculator 120 for s, and a loss weighter 121. and Each of these units is implemented by processing that one or more programs installed in the speech synthesizer 10 cause the processor 104 to execute.
  • a speaker ID is data that identifies a speaker in a form different from the speaker vector.
  • the TTS model ⁇ receives the speaker vector S, the text L, and the acoustic feature O, and outputs the predicted acoustic feature ⁇ .
  • a loss calculator 115 for O receives the acoustic feature quantity O and the predicted acoustic feature quantity ⁇ , and outputs a loss Lo . Note that, as described with reference to FIG. 2, for each of the first phase related to the text L and the second phase related to the acoustic feature O, the predicted acoustic feature ⁇ and the loss Lo are output.
  • the speaker characteristics removal unit 119 receives the intermediate representation hO output by the acoustic feature encoder 113, and calculates and outputs the intermediate representation h′O from which the speaker characteristics are removed.
  • the intermediate representation h'O from which the speaker's characteristic is removed is an intermediate representation obtained by removing the voice features of the speaker from the intermediate representation hO .
  • the speaker adversarial learning device or the like proposed in Patent Document 3 can be used for the speaker identity removal unit 119 .
  • the loss calculation unit 120 for s receives the intermediate representation h′O with speaker characteristics removed and the true speaker ID s, and calculates and outputs the loss L s .
  • the loss L s is an index that takes a larger value as the probability that h′ O corresponds to speaker s is lower.
  • the loss Ls can be an index for solving a discrimination problem, such as cross-entropy.
  • the loss weighting unit 121 inputs the loss L o output by the loss calculation unit 115 regarding O and the L s output by the loss calculation unit regarding S for each of the first phase and the second phase, and performs weighting. Calculate and output the calculated loss (the weighted sum of L o and L s ). Note that the weighting coefficient may be fixed, or may be a learning target.
  • the TTS model learning unit 116 learns the TTS model ⁇ ⁇ by updating the model parameters of the TTS model ⁇ to minimize the weighted loss.
  • the processing procedure after the unsupervised adaptation phase may be the same as in the first embodiment.
  • the output of the text encoder 112 does not include speaker characteristics, whereas the output of the acoustic feature quantity encoder 113 includes speaker characteristics. A mismatch between the two causes deterioration of TTS performance. According to the third embodiment, it is possible to reduce speaker characteristics from the intermediate representation h O by the acoustic feature encoder 113 .
  • the third embodiment may be used together with the second embodiment.
  • loss weighting section 121 may receive loss L o , loss L s , and loss L h and output a weighted loss.
  • FIG. 8 is a diagram showing the configuration of the large-scale TTS model learning phase in the fourth embodiment.
  • the text L n indicates the text according to the language n of the utterance.
  • a plurality of sets of learning data each including a speaker vector S, a text Ln , and an acoustic feature O are prepared.
  • Acoustic feature O is an acoustic feature of speech in which text Ln is uttered in language n.
  • Language n of each of the plurality of learning data is one of 1 to N, and learning data for each language of 1 to N are prepared. The meaning of the text Ln of each learning data may be different.
  • the processing of the speaker vector encoder 111 and acoustic feature quantity encoder 113 is the same as in FIG.
  • the text encoder 112- n corresponding to the input training data text Ln calculates and outputs the intermediate representation hLn .
  • the decoder 114 inputs the intermediate representation h Ln and the speaker vector S in the first phase, inputs the intermediate representation h O and the speaker vector S in the second phase, and generates the predicted acoustic feature O ⁇ to output Thereafter, the TTS model ⁇ is updated and the TTS model ⁇ ⁇ is learned in the same procedure as in the first embodiment.
  • the flow for inputting the acoustic feature quantity O to the TTS model ⁇ is the same as in the first embodiment.
  • the phase of the unsupervised adaptation phase is text-independent and the adapted TTS model ⁇ ⁇ ' is learned as in FIG. 3 of the first embodiment.
  • FIG. 9 is a diagram showing the configuration of the inference phase for speech synthesis in the fourth embodiment.
  • the predicted acoustic feature ⁇ ' is predicted in the same procedure as in FIG.
  • the speech synthesizer 10 is also an example of a speech synthesis learning device.
  • the TTS model ⁇ is an example of the first model.
  • the TTS model ⁇ ⁇ is an example of the second model.
  • the predicted acoustic feature ⁇ is an example of the first acoustic feature.
  • the predicted acoustic feature O ⁇ ' is an example of the second acoustic feature.
  • the acoustic feature quantity encoder 113 is an example of a first encoder.
  • Text encoder 112 is an example of a second encoder.
  • Speech synthesis device 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 Processor 105 Interface device 111 Speaker vector encoder 112 Text encoder 113 Acoustic feature encoder 114 Decoder 115 O loss calculation unit 116 TTS model learning unit 117 h Loss calculator 118 Loss weighting unit 119 Loss weighting unit 120 Loss calculator 121 for s Loss weighting unit B Bus

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

Un ordinateur exécute une première procédure d'apprentissage pour apprendre un second modèle en mettant à jour un premier modèle, dans lequel sont entrés un vecteur de locuteur représentant un locuteur, un texte et une première caractéristique acoustique associée à la parole obtenue par le locuteur prononçant le texte, d'après les pertes d'une première caractéristique acoustique prédite générée par le premier modèle et la première caractéristique acoustique, ainsi qu'une seconde procédure d'apprentissage pour mettre à jour le second modèle d'après les pertes d'une seconde caractéristique acoustique prédite générée par le second modèle, dans lequel sont entrés un vecteur d'un locuteur cible et une seconde caractéristique acoustique associée à la parole prononcée par le locuteur cible. Ainsi, la seconde caractéristique acoustique, même s'il n'y a pas de texte correspondant à la parole du locuteur cible, permet une adaptation par réglage fin d'un modèle TTS à partir de la caractéristique acoustique associée à la parole.
PCT/JP2022/005903 2022-02-15 2022-02-15 Procédé d'apprentissage de synthèse vocale, procédé de synthèse vocale, dispositif d'apprentissage de synthèse vocale, dispositif de synthèse vocale et programme WO2023157066A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/005903 WO2023157066A1 (fr) 2022-02-15 2022-02-15 Procédé d'apprentissage de synthèse vocale, procédé de synthèse vocale, dispositif d'apprentissage de synthèse vocale, dispositif de synthèse vocale et programme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/005903 WO2023157066A1 (fr) 2022-02-15 2022-02-15 Procédé d'apprentissage de synthèse vocale, procédé de synthèse vocale, dispositif d'apprentissage de synthèse vocale, dispositif de synthèse vocale et programme

Publications (1)

Publication Number Publication Date
WO2023157066A1 true WO2023157066A1 (fr) 2023-08-24

Family

ID=87577741

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/005903 WO2023157066A1 (fr) 2022-02-15 2022-02-15 Procédé d'apprentissage de synthèse vocale, procédé de synthèse vocale, dispositif d'apprentissage de synthèse vocale, dispositif de synthèse vocale et programme

Country Status (1)

Country Link
WO (1) WO2023157066A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017032839A (ja) * 2015-08-04 2017-02-09 日本電信電話株式会社 音響モデル学習装置、音声合成装置、音響モデル学習方法、音声合成方法、プログラム
JP2017058513A (ja) * 2015-09-16 2017-03-23 株式会社東芝 学習装置、音声合成装置、学習方法、音声合成方法、学習プログラム及び音声合成プログラム
WO2019044401A1 (fr) * 2017-08-29 2019-03-07 大学共同利用機関法人情報・システム研究機構 Système informatique créant une adaptation de locuteur sans enseignant dans une synthèse de la parole basée sur dnn, et procédé et programme exécutés dans le système informatique
JP2020034883A (ja) * 2018-08-27 2020-03-05 日本放送協会 音声合成装置及びプログラム
JP2020060633A (ja) * 2018-10-05 2020-04-16 日本電信電話株式会社 音響モデル学習装置、音声合成装置、及びプログラム
JP2020160319A (ja) * 2019-03-27 2020-10-01 Kddi株式会社 音声合成装置、方法及びプログラム

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017032839A (ja) * 2015-08-04 2017-02-09 日本電信電話株式会社 音響モデル学習装置、音声合成装置、音響モデル学習方法、音声合成方法、プログラム
JP2017058513A (ja) * 2015-09-16 2017-03-23 株式会社東芝 学習装置、音声合成装置、学習方法、音声合成方法、学習プログラム及び音声合成プログラム
WO2019044401A1 (fr) * 2017-08-29 2019-03-07 大学共同利用機関法人情報・システム研究機構 Système informatique créant une adaptation de locuteur sans enseignant dans une synthèse de la parole basée sur dnn, et procédé et programme exécutés dans le système informatique
JP2020034883A (ja) * 2018-08-27 2020-03-05 日本放送協会 音声合成装置及びプログラム
JP2020060633A (ja) * 2018-10-05 2020-04-16 日本電信電話株式会社 音響モデル学習装置、音声合成装置、及びプログラム
JP2020160319A (ja) * 2019-03-27 2020-10-01 Kddi株式会社 音声合成装置、方法及びプログラム

Similar Documents

Publication Publication Date Title
JP6727607B2 (ja) 音声認識装置及びコンピュータプログラム
CN107615376B (zh) 声音识别装置及计算机程序记录介质
US10529314B2 (en) Speech synthesizer, and speech synthesis method and computer program product utilizing multiple-acoustic feature parameters selection
JP4274962B2 (ja) 音声認識システム
US8825485B2 (en) Text to speech method and system converting acoustic units to speech vectors using language dependent weights for a selected language
JP3933750B2 (ja) 連続密度ヒドンマルコフモデルを用いた音声認識方法及び装置
KR20230003056A (ko) 비음성 텍스트 및 스피치 합성을 사용한 스피치 인식
US6006186A (en) Method and apparatus for a parameter sharing speech recognition system
JP6884946B2 (ja) 音響モデルの学習装置及びそのためのコンピュータプログラム
KR20050082253A (ko) 모델 변이 기반의 화자 클러스터링 방법, 화자 적응 방법및 이들을 이용한 음성 인식 장치
US20110276332A1 (en) Speech processing method and apparatus
JP6631883B2 (ja) クロスリンガル音声合成用モデル学習装置、クロスリンガル音声合成用モデル学習方法、プログラム
JP7192882B2 (ja) 発話リズム変換装置、モデル学習装置、それらの方法、およびプログラム
JP5807921B2 (ja) 定量的f0パターン生成装置及び方法、f0パターン生成のためのモデル学習装置、並びにコンピュータプログラム
US8185393B2 (en) Human speech recognition apparatus and method
Razavi et al. Acoustic data-driven grapheme-to-phoneme conversion in the probabilistic lexical modeling framework
Fan et al. Sequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis
US20040006469A1 (en) Apparatus and method for updating lexicon
JP5474713B2 (ja) 音声合成装置、音声合成方法および音声合成プログラム
WO2023157066A1 (fr) Procédé d'apprentissage de synthèse vocale, procédé de synthèse vocale, dispositif d'apprentissage de synthèse vocale, dispositif de synthèse vocale et programme
Miguel et al. Augmented state space acoustic decoding for modeling local variability in speech.
JP2015161927A (ja) 音響モデル生成装置、音響モデルの生産方法、およびプログラム
JP2008026721A (ja) 音声認識装置、音声認識方法、および音声認識用プログラム
JP6167063B2 (ja) 発話リズム変換行列生成装置、発話リズム変換装置、発話リズム変換行列生成方法、及びそのプログラム
Imseng Multilingual speech recognition: a posterior based approach

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22926970

Country of ref document: EP

Kind code of ref document: A1