WO2019044401A1 - Système informatique créant une adaptation de locuteur sans enseignant dans une synthèse de la parole basée sur dnn, et procédé et programme exécutés dans le système informatique - Google Patents

Système informatique créant une adaptation de locuteur sans enseignant dans une synthèse de la parole basée sur dnn, et procédé et programme exécutés dans le système informatique Download PDF

Info

Publication number
WO2019044401A1
WO2019044401A1 PCT/JP2018/029438 JP2018029438W WO2019044401A1 WO 2019044401 A1 WO2019044401 A1 WO 2019044401A1 JP 2018029438 W JP2018029438 W JP 2018029438W WO 2019044401 A1 WO2019044401 A1 WO 2019044401A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
unknown
acoustic
speakers
information
Prior art date
Application number
PCT/JP2018/029438
Other languages
English (en)
Japanese (ja)
Inventor
山岸 順一
信二 高木
Original Assignee
大学共同利用機関法人情報・システム研究機構
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 大学共同利用機関法人情報・システム研究機構 filed Critical 大学共同利用機関法人情報・システム研究機構
Priority to JP2018568997A priority Critical patent/JP6505346B1/ja
Publication of WO2019044401A1 publication Critical patent/WO2019044401A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • the present invention relates to a computer system for realizing unsupervised speaker adaptation of DNN speech synthesis, a method implemented in the computer system and a program.
  • supervised speaker adaptation of DNN speech synthesis is known (see, for example, Non-Patent Document 1).
  • estimation of the speaker information of the unknown speaker has been performed based on both the voice data of the unknown speaker and the text input as the teacher data.
  • the speaker information of a conventional unknown speaker is a speaker code represented by a vector consisting of only 0 and 1 (for example, a one-hot vector in which only the kth element is 1 and the other elements are all 0) (Speaker code) represented by
  • the conventional supervised speaker adaptation of DNN speech synthesis requires the input of text as teacher data.
  • the text input as teacher data is manually transcribed from the corresponding audio file, the cost is increased due to labor cost and the text input as teacher data is prepared using a speech recognizer. In this case, there has been a problem that the speech recognition device is affected by a recognition error.
  • the present invention has been made to solve this problem, and it is an object of the present invention to provide a computer system for realizing unsupervised speaker adaptation of DNN speech synthesis, and a method and program implemented in the computer system. .
  • the computer system according to the present invention is a computer system that outputs synthesized speech of an unknown speaker corresponding to the input text by using an acoustic model of a plurality of speakers represented by a deep neural network (DNN).
  • the acoustic model of the plurality of speakers is learned using at least a plurality of pieces of speaker information, and each of the plurality of pieces of speaker information has a distribution of its own acoustic feature value and a plurality of other speakers.
  • a speaker code representing the degree of similarity with the distribution of the acoustic features by probability and the computer system generates an audio feature of the unknown speaker by analyzing the speech signal of the unknown speaker to generate a speech analysis Speaker information for estimating the speaker information of the unknown speaker based on the acoustic feature of the unknown speaker without requiring the input of a text as the teaching data and The estimation unit, wherein the speaker information of the unknown speaker indicates the similarity between the distribution of the acoustic feature of the unknown speaker and the distribution of each of the acoustic features of the plurality of known speakers by a probability
  • a speaker information estimation unit including a code, a text analysis unit generating a language feature of the input text by analyzing the input text, and an acoustic model of the plurality of speakers
  • a synthesized acoustic feature amount generation unit for generating a synthesized acoustic feature amount of the unknown speaker based on the input language feature amount of the text and the speaker information of the unknown speaker; the unknown speaker And
  • the speaker information estimation unit estimates speaker information of the unknown speaker using a speaker similarity model, and the speaker similarity model includes acoustic features of each of the plurality of known speakers.
  • a distribution of quantities may be stored.
  • the method of the present invention is implemented in a computer system that outputs a synthesized speech of an unknown speaker corresponding to an input text using a multi-speaker acoustic model represented by a deep neural network (DNN)
  • DNN deep neural network
  • the acoustic models of the plurality of speakers have been learned using at least a plurality of pieces of speaker information, and each of the plurality of pieces of speaker information has a distribution of its own acoustic feature value and the others.
  • the speaker of the unknown speaker Information includes a speaker code representing the degree of similarity between the distribution of the acoustic feature of the unknown speaker and the distribution of the acoustic features of each of the plurality of known speakers, and the input text Generating a linguistic feature of the input text by analysis; and using the acoustic model of the plurality of speakers, the linguistic feature of the input text and the speaker information of the unknown speaker Generating a synthesized acoustic feature of the unknown speaker, and generating a synthesized voice of the unknown speaker based on the synthesized acoustic feature of the unknown speaker And the above object is achieved.
  • the program of the present invention is executed in a computer system that outputs a synthesized speech of an unknown speaker corresponding to the input text using a multi-speaker acoustic model represented by a deep neural network (DNN).
  • the acoustic model of the plurality of speakers has been learned using at least a plurality of pieces of speaker information, and each of the plurality of pieces of speaker information has a distribution of its own acoustic feature and the other
  • the computer system includes a processor unit, the computer system including a processor unit, the program being executed by the processor unit, the speaker system including a speaker code representing the similarity to the distribution of acoustic features of the plurality of speakers as the probability, the unknown speaker Needs to generate the acoustic feature of the unknown speaker by analyzing the voice signal of Estimating the speaker information of the unknown speaker based on the acoustic feature of the unknown speaker, wherein the speaker information of the unknown speaker is the acoustic feature of the unknown speaker Including a speaker code representing,
  • the speech synthesizer uses an acoustic model of multiple speakers represented by a deep neural network (DNN) to respond to input text according to input speaker information of an unknown speaker.
  • a speech synthesizer that changes synthesized speech of an unknown speaker, wherein the acoustic models of the plurality of speakers have been learned using at least a plurality of pieces of speaker information, and each of the plurality of pieces of speaker information is , And includes a speaker code representing the degree of similarity between the distribution of its own acoustic feature and the distribution of the acoustic features of other speakers, and the speech synthesizer analyzes the input text Receiving the speaker information of the input unknown speaker by using the text analysis unit generating the language feature of the input text, and using the acoustic model of the plurality of speakers, the input text of A synthetic acoustic feature quantity generation unit for generating a synthesized acoustic feature quantity of the unknown speaker based on the word feature quantity and the speaker information of the unknown speaker, wherein
  • Diagram showing an example of a framework to realize unsupervised speaker adaptation of DNN speech synthesis Diagram showing the results of objective evaluation experiments conducted based on the framework shown in Figure 1 (A) Quality and (b) Speaker in Supervised Speaker Adaptation (Supervise) and in Unsupervised Speaker Adaptation (GMM, i-vec) using GMM-UBM or i-vector / PLDA Figure showing the results of subjective evaluation on similarity
  • Unknown speaker means a speaker whose speaker information input to the speech synthesizer is unknown when generating synthesized speech.
  • the "known speaker” refers to a speaker whose speaker information to be input to the speech synthesizer is known when generating synthesized speech.
  • DNN is an abbreviation of deep neural network.
  • Unsupervised speaker adaptation refers to performing processing adapted to an unknown speaker without requiring input of teacher data.
  • unsupervised speaker adaptation of DNN speech synthesis refers to constructing a DNN speech synthesis system of an unknown speaker from speech only, without requiring the input of text as teacher data.
  • FIG. 1 shows an example of a framework for implementing unsupervised speaker adaptation of DNN speech synthesis.
  • a multi-speaker acoustic model (DNN) 230 is used to output synthesized speech of an unknown speaker corresponding to the input text.
  • the multi-speaker acoustic model (DNN) 230 has been learned using at least a plurality of pieces of speaker information.
  • Each of the plurality of pieces of speaker information includes a speaker code representing the degree of similarity between the distribution of its own acoustic feature amount and the distribution of each of the other plurality of speakers.
  • This framework is roughly divided into an adaptation part 100 and a synthesis part 200.
  • the adaptation part 100 functions to generate speaker information of the unknown speaker based on the speech signal of the unknown speaker.
  • the flow of processing in the adaptation part 100 will be described below.
  • the speech signal of the unknown speaker from the unknown speaker database 110 is input to the speech analysis unit 120.
  • the voice analysis unit 120 generates an acoustic feature of the unknown speaker by analyzing the voice signal of the unknown speaker.
  • the speaker information estimation unit 130 estimates the speaker information of the unknown speaker based on the acoustic feature of the unknown speaker without requiring the input of the text as the teacher data.
  • the estimation of the speaker information of the unknown speaker is performed using, for example, the speaker similarity model 140.
  • the speaker similarity model 140 is designed to eliminate the need for text.
  • the speaker similarity model 140 distributions of acoustic feature quantities of a plurality of known speakers are stored.
  • the speaker similarity model 140 stores the distribution of acoustic features of five known speakers.
  • the similarity vector is expressed as (0.8, 0.05, 0.05, 0.05, 0.05).
  • an example of the acoustic feature value is, but not limited to, mel frequency cepstrum coefficient (MFCC) and / or voice height (fundamental frequency).
  • MFCC mel frequency cepstrum coefficient
  • voice height fractional frequency
  • i-vector / PLDA model a method of reducing the summary statistics of speech by probabilistic LDA method used in speaker recognition.
  • the use of the i-vector / PLDA model is expected to be capable of speaker adaptation even from noisy speech.
  • the similarity vector is described as being generated as a speaker code, but the expression format of the speaker code is not limited to the vector format.
  • the speaker code may be represented by any data format as long as it represents a probability representing the similarity between the distribution of the acoustic feature of the unknown speaker and the distribution of the acoustic features of each of the plurality of known speakers. It is possible.
  • the speaker code may be all of the speaker information or part of the speaker information.
  • the speaker information may include information other than the speaker code (eg, gender code, age code).
  • the speaker information of the unknown speaker is generated based on the voice signal of the unknown speaker without requiring the input of the text as the teacher data.
  • the speaker information of the unknown speaker includes a speaker code representing the degree of similarity between the distribution of the acoustic feature of the unknown speaker and the distribution of the acoustic features of each of the plurality of known speakers as a probability.
  • the speaker code is expressed, for example, in the form of a similarity vector.
  • the synthesis part 200 uses the multi-speaker acoustic model (DNN) 230 to determine the unknown corresponding to the input text according to the input speaker information of the unknown speaker It functions as a speech synthesizer that changes the speaker's synthesized speech.
  • DNN multi-speaker acoustic model
  • the text analysis unit 210 generates the language feature of the input text by analyzing the input text.
  • the linguistic feature quantities of the input text generated by the text analysis unit 210 are input to the synthetic acoustic feature quantity generation unit 220.
  • the speaker information of the unknown speaker is input to the synthetic acoustic feature quantity generation unit 220.
  • the input speaker information of the unknown speaker includes a speaker code representing, in probability, the similarity between the distribution of the acoustic feature of the unknown speaker and the distribution of the acoustic features of each of the plurality of known speakers.
  • the input speaker information of the unknown speaker is, for example, one estimated by the above-described speaker information estimation unit 130.
  • the synthetic acoustic feature quantity generation unit 220 generates a synthetic acoustic feature quantity of the unknown speaker based on the input language feature quantity of the text and the input speaker information of the unknown speaker.
  • Generation of synthesized acoustic feature quantities of unknown speakers is performed using a multi-speaker acoustic model (DNN) 230.
  • the multi-speaker acoustic model (DNN) 230 is a multi-speaker acoustic model represented by a deep neural network (DNN).
  • the use of the multi-speaker acoustic model (DNN) 230 is performed after learning of the multi-speaker acoustic model (DNN) 230.
  • the learning of the multi-speaker acoustic model (DNN) 230 includes, for example, learning of speaker information of the known speaker and / or learning of acoustic features of the known speaker.
  • the input text is represented by a T-dimensional vector
  • the speaker code contained in the input unknown speaker's speaker information is represented by a K-dimensional vector.
  • the vector of the speaker code is the above-described similarity vector.
  • the information input to the synthetic acoustic feature quantity generation unit 220 is represented by an N-dimensional vector.
  • N T + K
  • N, T and K are any integers of 1 or more.
  • the information output from the synthetic acoustic feature quantity generation unit 220 (that is, the synthesized acoustic feature quantity) is represented by an S-dimensional vector.
  • each of the first intermediate layer and the second intermediate layer represents an intermediate representation and is represented by an H-dimensional vector. That is, N-dimensional input ⁇ matrix operation ⁇ sigmoid operation ⁇ first layer intermediate layer (H-dimensional vector) ⁇ matrix operation ⁇ sigmoid operation ⁇ second layer intermediate layer (H-dimensional vector) ⁇ matrix operation ⁇ S-dimensional output Processing is performed.
  • T-dimensional vector The first dimension of this T-dimensional vector is an input for instructing a sound to be generated, such as whether to generate a sound of "A" or whether to generate a sound of "I" in a second dimension.
  • An element of a vector being 1 indicates that a sound corresponding to that element is to be generated, and an element of vector being 0 indicates that a sound corresponding to that element is not to be generated.
  • the speaker code is represented by a K-dimensional vector.
  • N T + K.
  • weight matrix 1 is a 1 ⁇ T matrix
  • weight matrix 2 is a 1 ⁇ K matrix. Therefore, the result of the calculation of (Expression 1) is a scalar value, and the output of the sigmoid function is also a scalar value.
  • the weight matrix 1 ′ is a 1 ⁇ T matrix
  • the weight matrix 2 ′ is a 1 ⁇ K matrix. Therefore, the result of the calculation of (Expression 2) is a scalar value, and the output of the sigmoid function is also a scalar value.
  • the weighting matrix 3 ' is a 1 ⁇ H matrix. Therefore, the result of the calculation of (Expression 4) is a scalar value, and the output of the sigmoid function is also a scalar value.
  • S-dimensional vector weight matrix 4 ⁇ second layer intermediate layer H-dimensional vector”.
  • the weight matrix 4 is a matrix of S ⁇ H. Therefore, the result of the calculation of (Equation 5) is an S-dimensional vector. In this way, it is possible to predict synthesized acoustic features represented by S-dimensional vectors.
  • the vector of speaker codes is a similarity vector, not only a specific element (for example, the k-th element) of weight matrix 2 but all elements of weight matrix 2 are always used. Ru.
  • the vector of the speaker code is a similarity vector, it is possible to estimate the speaker information of the unknown speaker in consideration of the similarity of all K known speakers. Therefore, it can be said that the speaker information of the unknown speaker estimated using the similarity vector is more useful than the speaker information of the unknown speaker estimated using the one-hot vector described later.
  • the vector of the speaker code is only the k-th element and 1 is the other element.
  • elements other than the k-th element of the weight matrix 2 will disappear in the zero operation. Therefore, when the vector of the speaker code is a one-hot vector, the speaker information of the unknown speaker should be estimated considering only the k-th known speaker among the K known speakers. become. Therefore, it can be said that the speaker information of the unknown speaker estimated using the one-hot vector is less useful than the speaker information of the unknown speaker estimated using the similarity vector described above.
  • the multi-speaker acoustic model (DNN) 230 is composed of two layers, in the multi-speaker acoustic model (DNN) 230, in the processing of the N-dimensional input ⁇ first intermediate layer
  • the weight matrix 1, the weight matrix 2, the weight matrix 1 ′, the weight matrix 2 ′, the weight matrix 1 ′ ′, the weight matrix 2 ′ ′,... (Total H ⁇ 2 matrices in total) to be used are stored.
  • Weight matrix 3, weight matrix 3 ', weight matrix 3' ", weight matrix 3 '", ... used in the processing of the first layer middle layer ⁇ second layer middle layer (total of H in total) Matrix) is stored, and a weighting matrix 4 (one matrix) used in processing of the second layer intermediate layer ⁇ S-dimensional output is stored.
  • the multi-speaker acoustic model (DNN) 230 can be composed of two or more arbitrary numbers of layers, and can perform the same processing as the above-described processing for two or more arbitrary numbers of layers. It is possible.
  • the DNN may have another structure. For example, even in a convolutional neural network or a recursive neural network (recurrent neural network), the same process as the above-described process can be performed.
  • the present invention is not limited to this.
  • the operation using the sigmoid function instead of the operation using the sigmoid function, the operation using a normalized linear unit may be performed.
  • the synthesized speech generation unit 240 generates synthesized speech of the unknown speaker based on the synthesized acoustic feature of the unknown speaker.
  • the synthesis part 200 uses the multi-speaker acoustic model (DNN) 230 to respond to the input text of the unknown speaker according to the input speaker information of the unknown speaker. It functions as a speech synthesizer that changes synthesized speech. In other words, the synthesis part 200 functions as a speech synthesizer that reproduces the speaker nature of the unknown speaker.
  • this speech synthesizer has a usual function of outputting synthesized speech of an unknown speaker corresponding to the text in response to the inputted text, the speaker information of the unknown speaker is In response, it further has a function of changing the speech to be synthesized. Therefore, it can be said that this speech synthesizer has means for changing the synthesized speech according to the speaker information of the unknown speaker.
  • the means for changing the speech to be synthesized can change the speech to be synthesized using a multi-speaker acoustic model (DNN). Further, changing the synthesized speech according to the speaker information of the unknown speaker is, for example, changing the synthesized acoustic feature amount according to the speaker information of the unknown speaker, and combining This is achieved by converting the acoustic feature amount into synthesized speech and outputting it.
  • DNN multi-speaker acoustic model
  • a DNN speech synthesizer of an unknown speaker is constructed from only a small amount of speech data in which no text exists. It is possible.
  • FIG. 2A shows the results of objective evaluation experiments conducted based on the framework shown in FIG.
  • FIG. 2B shows (a) quality and (a) for supervised speaker adaptation (Supervise) and for unsupervised speaker adaptation (GMM, i-vec) using GMM-UBM or i-vector / PLDA b) Subjective evaluation experiment results on speaker similarity are shown.
  • "utt” represents the number of utterances.
  • i-vector / PLDA i-vec (MFCC), i-vec (MFCC + F0)
  • GMM-UBM GMM-UBM
  • the experimental conditions underlying the results of the experiments shown in FIGS. 2A and 2B are as follows.
  • ⁇ Learning data ⁇ Learning of DNN for speech synthesis of multiple speakers
  • the number of speakers 112
  • the number of utterances A total of 11,154 utterances (about 100 utterances of each speaker)
  • Speaker similarity model learning ⁇ Same as DNN for multi-speaker speech synthesis
  • Speaker adaptation data ⁇ Number of speakers: 23 ⁇ Number of utterances: About 100 utterances of each speaker ⁇
  • Test data ⁇ Number of speakers: 23 (same as adaptation speaker) ⁇ Number of synthesized utterances: 10 utterances for each speaker
  • the second experiment was performed under the same experimental conditions as the experimental conditions on which the results of the experiments shown in FIGS. 2A and 2B are based.
  • a noise database storing noise data for adding noise to speech data and a reverberation database storing reverberation data for adding reverberation to speech data are used and degraded.
  • Speech y was created from high quality speech data without degradation using Equation 6 below.
  • y x * h 1 + ⁇ (n * h 2 ) (Equation 6)
  • x represents high-quality speech
  • n represents noise
  • h 1 and h 2 represent impulse responses used to apply reverberation obtained at different microphone positions (h 1 represents represents a position closer to the speaker than h 2 )
  • * represents a convolution
  • represents a desired parameter for adjusting the noise intensity.
  • FIG. 3A shows the results of an objective evaluation experiment of unsupervised speaker adaptation (mel cepstrum distortion, LF0 RMSE) in speaker models constructed using different acoustic features (MFCC, MFCC + F0) and methods (GMM, i-vec) Indicates
  • FIG. 3B shows an objective evaluation experiment result of unsupervised speaker adaptation using speaker adaptation data of different SNRs.
  • SNR refers to the ratio of signal to noise, and the larger the number, the less noise.
  • the unit of SNR is decibel (dB).
  • FIG. 3B is a view showing a more detailed result of FIG. 3A.
  • FIG. 3C shows the results of an objective evaluation experiment of unsupervised speaker adaptation in the case of using high quality speech data without deterioration for learning of the speaker similarity check model, and unsupervised speaker adaptation in the case of using degraded speech data.
  • the left side of the slash symbol represents the type of speech data for which the speaker similarity model has been learned
  • the right side of the slash symbol represents speech data of the speaker adaptation data.
  • CLEAN / MEETING indicates that high quality speech data without degradation (CLEAN) is used for learning of the speaker similarity model and that the degraded speech data of MEETING is used as data for speaker adaptation.
  • GMM, GMM (F0), i-vec, and i-vec (F0) indicate experiments using the methods described below, respectively.
  • FIG. 3D relates to the quality of unsupervised speaker adaptation when high quality speech data without deterioration is used for learning the speaker similarity model and deteriorated speech data is used for learning the speaker similarity model.
  • the subjective evaluation experiment results are shown.
  • FIG. 3E shows that unsupervised speaker adaptation is performed when high quality speech data without deterioration is used for learning of the speaker similarity model and degraded speech data is used for learning the speaker similarity model.
  • the subjective evaluation experiment result regarding a speaker similarity is shown.
  • FIG. 3D shows the result of subjectively evaluating the quality of the synthesized speech by the 5-step MOS
  • FIG. 3E shows the speaker similarity comparing the synthesized speech with the reference speech by the 5-step MOS The results of the evaluation are shown.
  • FIG. 4 shows an example of the configuration of the computer system 1 for realizing the framework shown in FIG.
  • the computer system 1 at least includes a memory unit 10 and a processor unit 20. These components are connected to one another. Each of these components may be configured with a single hardware component or may be configured with multiple hardware components.
  • a program required to execute a process for example, a program required to execute the process shown in FIG. 1 and data required to execute the program Etc are stored.
  • the program may be preinstalled in the memory unit 10.
  • the program may be installed in the memory unit 10 by being downloaded via a network such as the Internet, or may be installed in the memory unit 10 via a storage medium such as an optical disk or USB. You may
  • the processor unit 20 controls the overall operation of the computer system 1.
  • the processor unit 20 reads a program stored in the memory unit 10 and executes the program.
  • the computer system 1 can function as an apparatus configured to perform a desired step or an apparatus provided with means for performing a desired function.
  • the computer system 1 is an apparatus including a voice analysis unit 120, a speaker information estimation unit 130, a text analysis unit 210, a synthetic acoustic feature quantity generation unit 220, and a synthetic speech generation unit 240 as means for executing specific functions It is possible to function.
  • the present invention is useful as providing a computer system for realizing unsupervised speaker adaptation of DNN speech synthesis, a method and program executed in the computer system, and the like.
  • Reference Signs List 1 computer system 10 memory unit 20 processor unit 100 adaptation part 110 unknown speaker database 120 speech analysis unit 130 speaker information estimation unit 140 speaker similarity model 200 synthesis part 210 text analysis unit 220 synthesized acoustic feature quantity generation unit 230 multiple talk Acoustic model of the elderly (DNN) 240 Synthesized voice generator

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un système informatique 1 qui comprend une unité d'estimation d'informations de locuteur 130 qui estime les informations de locuteur d'un locuteur inconnu sur la base de la quantité de caractéristiques acoustiques pour le locuteur inconnu sans qu'il soit nécessaire d'entrer du texte en tant que données d'enseignant. Les informations de locuteur du locuteur inconnu comprennent un code de locuteur qui représente une similarité par probabilité entre une distribution de la quantité de caractéristiques acoustiques pour le locuteur inconnu et une distribution pour chacune des quantités de caractéristiques acoustiques pour une pluralité de locuteurs connus. Le système informatique 1 comprend en outre : une unité de génération de quantité de caractéristiques acoustiques synthétisées 220 pour générer une quantité de caractéristiques acoustiques synthétisées pour le locuteur inconnu sur la base d'une quantité de caractéristiques de langue pour un texte d'entrée et des informations de locuteur du locuteur inconnu, à l'aide de modèles acoustiques (DNN) 230 de multiples locuteurs ; et une unité de génération de parole synthétisée 240 pour générer une parole synthétisée du locuteur inconnu sur la base de la quantité de caractéristiques acoustiques synthétisées du locuteur inconnu.
PCT/JP2018/029438 2017-08-29 2018-08-06 Système informatique créant une adaptation de locuteur sans enseignant dans une synthèse de la parole basée sur dnn, et procédé et programme exécutés dans le système informatique WO2019044401A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2018568997A JP6505346B1 (ja) 2017-08-29 2018-08-06 Dnn音声合成の教師無し話者適応を実現するコンピュータシステム、そのコンピュータシステムにおいて実行される方法およびプログラム

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017-164267 2017-08-29
JP2017164267 2017-08-29

Publications (1)

Publication Number Publication Date
WO2019044401A1 true WO2019044401A1 (fr) 2019-03-07

Family

ID=65527677

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/029438 WO2019044401A1 (fr) 2017-08-29 2018-08-06 Système informatique créant une adaptation de locuteur sans enseignant dans une synthèse de la parole basée sur dnn, et procédé et programme exécutés dans le système informatique

Country Status (2)

Country Link
JP (1) JP6505346B1 (fr)
WO (1) WO2019044401A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210241780A1 (en) * 2020-01-31 2021-08-05 Nuance Communications, Inc. Method And System For Speech Enhancement
US11545135B2 (en) * 2018-10-05 2023-01-03 Nippon Telegraph And Telephone Corporation Acoustic model learning device, voice synthesis device, and program
WO2023157066A1 (fr) * 2022-02-15 2023-08-24 日本電信電話株式会社 Procédé d'apprentissage de synthèse vocale, procédé de synthèse vocale, dispositif d'apprentissage de synthèse vocale, dispositif de synthèse vocale et programme

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015057651A (ja) * 2013-08-23 2015-03-26 株式会社東芝 音声処理システム及び方法

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015057651A (ja) * 2013-08-23 2015-03-26 株式会社東芝 音声処理システム及び方法

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DODDIPATLA, RAMA ET AL.: "Speaker adaptation in DNN-based speech synthesis using d-vectors", PROC. OF INTERSPEECH, 20 August 2017 (2017-08-20), pages 3404 - 3408, XP055579842 *
INOUE, KATSUKI ET AL.: "Comparisons on Transplant Emotional Expressions in DNN-based TTS Synthesis", IEICE TECHNICAL REPORT, vol. 117, no. 106, 15 June 2017 (2017-06-15), pages 23 - 28 *
THI LUONG, HIEU ET AL.: "A DNN-based Text-to-Speech Synthesis System using Speaker, Gender and Age Codes", IEICE TECHNICAL REPORT, vol. 116, no. 279, 20 October 2016 (2016-10-20), pages 37 - 42 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11545135B2 (en) * 2018-10-05 2023-01-03 Nippon Telegraph And Telephone Corporation Acoustic model learning device, voice synthesis device, and program
US20210241780A1 (en) * 2020-01-31 2021-08-05 Nuance Communications, Inc. Method And System For Speech Enhancement
US11657828B2 (en) * 2020-01-31 2023-05-23 Nuance Communications, Inc. Method and system for speech enhancement
WO2023157066A1 (fr) * 2022-02-15 2023-08-24 日本電信電話株式会社 Procédé d'apprentissage de synthèse vocale, procédé de synthèse vocale, dispositif d'apprentissage de synthèse vocale, dispositif de synthèse vocale et programme

Also Published As

Publication number Publication date
JPWO2019044401A1 (ja) 2019-11-07
JP6505346B1 (ja) 2019-04-24

Similar Documents

Publication Publication Date Title
EP1515305B1 (fr) Adaptation au bruit pour la reconnaissance de la parole
Deng et al. Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition
JP7018659B2 (ja) 声質変換装置、声質変換方法およびプログラム
JP5842056B2 (ja) 雑音推定装置、雑音推定方法、雑音推定プログラム及び記録媒体
JP6437581B2 (ja) 話者適応型の音声認識
JP2019120841A (ja) スピーチチェイン装置、コンピュータプログラムおよびdnn音声認識・合成相互学習方法
Valentini-Botinhao et al. Speech enhancement of noisy and reverberant speech for text-to-speech
JP6505346B1 (ja) Dnn音声合成の教師無し話者適応を実現するコンピュータシステム、そのコンピュータシステムにおいて実行される方法およびプログラム
JP6783475B2 (ja) 声質変換装置、声質変換方法およびプログラム
EP1457968B1 (fr) Adaptation au bruit d'un modèle de parole, méthode d'adaptation au bruit et programme d'adaptation au bruit pour la reconnaissance de parole
Hwang et al. LP-WaveNet: Linear prediction-based WaveNet speech synthesis
JP6543820B2 (ja) 声質変換方法および声質変換装置
CN110998723B (zh) 使用神经网络的信号处理装置及信号处理方法、记录介质
Mohammadi et al. A Voice Conversion Mapping Function Based on a Stacked Joint-Autoencoder.
Ozerov et al. GMM-based classification from noisy features
JP6721165B2 (ja) 入力音マスク処理学習装置、入力データ処理関数学習装置、入力音マスク処理学習方法、入力データ処理関数学習方法、プログラム
Giacobello et al. Stable 1-norm error minimization based linear predictors for speech modeling
JP6594251B2 (ja) 音響モデル学習装置、音声合成装置、これらの方法及びプログラム
Elshamy et al. DNN-based cepstral excitation manipulation for speech enhancement
JP4964194B2 (ja) 音声認識モデル作成装置とその方法、音声認識装置とその方法、プログラムとその記録媒体
Yanagisawa et al. Noise robustness in HMM-TTS speaker adaptation
Lanchantin et al. Dynamic model selection for spectral voice conversion.
Coto-Jiménez Experimental study on transfer learning in denoising autoencoders for speech enhancement
Song et al. Speaker-adaptive neural vocoders for parametric speech synthesis systems
Nakashika et al. Speaker adaptive model based on Boltzmann machine for non-parallel training in voice conversion

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2018568997

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18852555

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18852555

Country of ref document: EP

Kind code of ref document: A1