WO2023158226A1 - Procédé et dispositif de synthèse vocale utilisant une technique d'apprentissage antagoniste - Google Patents

Procédé et dispositif de synthèse vocale utilisant une technique d'apprentissage antagoniste Download PDF

Info

Publication number
WO2023158226A1
WO2023158226A1 PCT/KR2023/002229 KR2023002229W WO2023158226A1 WO 2023158226 A1 WO2023158226 A1 WO 2023158226A1 KR 2023002229 W KR2023002229 W KR 2023002229W WO 2023158226 A1 WO2023158226 A1 WO 2023158226A1
Authority
WO
WIPO (PCT)
Prior art keywords
adversarial
speech
learning
synthesizing
voice
Prior art date
Application number
PCT/KR2023/002229
Other languages
English (en)
Korean (ko)
Inventor
장준혁
이모아
Original Assignee
한양대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한양대학교 산학협력단 filed Critical 한양대학교 산학협력단
Publication of WO2023158226A1 publication Critical patent/WO2023158226A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present invention relates to a speech synthesis method and apparatus using an adversarial learning technique, and more particularly, to a speech synthesis method using an artificial neural network, and to a technique for learning speech in a non-auto-rare manner using adversarial learning. am.
  • Speech communication means a technology that transmits the user's uttered voice to the other party for mutual communication between voice communication users. It is used in various fields. In voice communication, only the clear voice signal of the speaker must be transmitted to convey the correct meaning to the other party. However, in situations where two speakers or several speakers speak at the same time, the speech of the previous speaker is input back into the microphone, and playback and input from the speaker are performed. When an echo phenomenon, which is a repeated acoustic echo, occurs, the speaker's voice cannot be accurately transmitted.
  • a deep neural network (DNN), a machine learning technique, has shown excellent performance in various speech enhancement and speech recognition studies.
  • the deep neural network shows excellent performance by effectively modeling the non-linear relationship between the input feature vector and the output feature vector through a plurality of hidden layers and hidden nodes.
  • the non-autoregressive speech synthesis model since the non-autoregressive speech synthesis model synthesizes all frames of the target speech at once, the non-autoregressive speech synthesis model has the advantage of eliminating unnecessary time delay.
  • the non-autoregressive speech synthesis model unlike the autoregressive method, since learning is performed by assuming conditional independence between all frames, the sound quality is somewhat lower than that of the autoregressive method.
  • the monotonic aligner is a hard aligner, and studies have shown that its performance is limited compared to the soft aligner, and the method using multiple decoders greatly increases the parameters of the model and is inefficient.
  • a speech synthesis method and apparatus using an adversarial learning technique is an invention designed to solve the above-described problems, and provides a speech synthesis method and apparatus between end-to-end non-autoregressive methods through adversarial learning. The purpose.
  • a speech synthesis method and apparatus using an adversarial learning technique improves a conventional non-autoregressive speech synthesis model by applying an adversarial learning method in a speech synthesis method using an artificial neural network.
  • the purpose is to reduce the delay that occurs when synthesizing voices compared to the prior art.
  • a voice synthesis method using an adversarial learning technique includes receiving voice data input, learning an adversarial model for synthesizing voice based on the voice data input, and using the adversarial model to target a voice. and synthesizing frames of the speech, and synthesizing the frames of the speech may include synthesizing the frames of the target speech using a non-autoregressive method.
  • entire frames of the target speech may be synthesized in a non-automatic regression method using the adversarial model.
  • monotonic attention information for text for synthesizing mel spectogram signal information and the target voice may be received as input information.
  • Learning the adversarial model may include performing adversarial learning using an output value of the monotonic attention.
  • the sum of recovery loss (LOSSrecon), periodic prediction loss (LOSSdur) and adversarial loss loss is used as a loss function, and the absolute value of the loss function is minimized. Learning the adversarial model steps may be included.
  • An apparatus for synthesizing speech using an adversarial learning technique learns an adversarial model for synthesizing speech based on a memory, an input unit receiving voice data input, and the voice data input received through the input unit, and the and a control unit that stores an adversarial model in the memory and synthesizes the frame of the target speech in a non-autoregressive manner using the adversarial model stored in the memory.
  • the control unit may synthesize entire frames of the target speech in a non-autoregressive manner using the adversarial model.
  • the control unit may control the input unit to receive a text for synthesizing a mel spectogram signal and the target voice as an input of monotonic attention.
  • the controller may perform adversarial learning using the output value of the monotonic attention.
  • the controller may estimate the sequence length of the target speech based on an output of a text encoder for encoding the text included in the speech data input.
  • the control unit may learn the adversarial model so that the absolute value of the loss function is minimized by taking a sum of the restoration loss (LOSSrecon), the periodic prediction loss (LOSSdur), and the adversarial loss loss as a loss function.
  • a speech synthesis method and apparatus using an adversarial learning technique by making the distance between two latent vectors close, learning is performed so that sufficient information is included in the latent vector generated only with random noise and text information during actual speech synthesis. It is possible, and consequently, a negative feature vector can be generated from a latent vector in a non-automatic regression method.
  • the target period can be learned together with the entire model, unnecessary time required to learn the target period can be reduced, and since the voice feature vector is generated in a non-automatic regression method, unnecessary unnecessary time that occurs during actual speech synthesis is generated. There is an advantage of reducing the delay.
  • FIG. 1 is a flowchart illustrating a voice synthesis method according to an embodiment of the present invention.
  • FIG. 2 is a block diagram showing the components of a voice synthesizer according to an embodiment of the present invention.
  • FIG. 3 is a flowchart illustrating a process of predicting a Mel-spectogram from input data according to an embodiment of the present invention.
  • FIG. 4 is a diagram for explaining the function of an aligner according to an embodiment of the present invention.
  • FIG. 5 is a diagram illustrating a process of extracting an aligner and a target period according to an embodiment of the present invention.
  • FIGS. 5, 6, and 7 are graphs showing the alignment output from the Mel decoder when alignment is set as a target and when it is not.
  • 8 is a table showing a comparison of the actual experimental results of the prior art and the present invention.
  • Embodiments described below are inventions related to a technology for synthesizing voice using a deep neural network (DNN), and the control unit of the voice synthesizer 200 is performed on the premise that an artificial neural network is configured.
  • the speech synthesis apparatus 200 may include a learning session for learning an artificial neural network for speech synthesis, an inference session, and the like.
  • FIG. 1 is a flowchart illustrating a voice synthesis method according to an embodiment of the present invention.
  • the voice synthesis method (S100) includes steps S110, S130, and S150, and a brief description of the voice synthesis method is described with reference to FIG. 1, and specific learning The method and synthesis method will be described through FIGS. 2 to 5 .
  • the voice synthesis apparatus 200 receives various data for voice synthesis (S110).
  • data for speech synthesis may include speech text and a Mel-spectogram.
  • an adversarial model can be an adversarial model for speech synthesis.
  • the voice synthesis apparatus 200 synthesizes frames of the target voice (S150). Specifically, the synthesis device may synthesize voice in a non-automatic regression method (S151).
  • the speech synthesis apparatus 200 applies adversarial learning to generate a latent vector from the reference mel-spectrogram, which is ground truth, which becomes reference data. can create
  • the speech synthesis apparatus 200 may learn an adversarial model so that the latent vector and the delay vector generated from the text are projected into the same semantic space. Accordingly, the speech synthesis apparatus may perform parallel decoding of the mel-spectogram through the learned latent vector.
  • the synthesis apparatus can learn a speech synthesis model of a non-autoregressive method.
  • the latent vector according to the present invention described in the above example may include compressed speech information. Accordingly, the latent vector may be input to the speech feature vector decoder, and accordingly, the speech synthesis apparatus 200 may generate a target speech feature vector based on the input data.
  • the non-autoregressive speech synthesis model learned in the speech synthesizer 200 can synthesize speech faster than the autoregressive speech synthesis model that sequentially generates frames of speech feature vectors. Accordingly, the adversarial model learned according to the embodiment of the present specification has the advantage of being applicable to a real-time speech synthesis program.
  • a method of applying up-sampling through a monotonic aligner and a method of iterative refinement using multiple decoders are applied to apply latent vectors input to the decoder. It is a method of learning so that as much information as possible can be included in , which leads to improvement in output speech.
  • the conventional monotonic aligner is a hard alignment, and its performance is inferior to that of the soft alignment.
  • the conventional method using multiple decoders has a disadvantage in that it increases parameters of a learning model and is inefficient.
  • a speech synthesis method and apparatus using an adversarial learning technique is an invention designed to solve the above-described problems, and provides a speech synthesis method and system between end-to-end non-autoregressive methods through adversarial learning. More specifically, the purpose is to reduce the delay occurring when synthesizing a voice compared to the prior art by improving the existing non-autoregressive voice synthesis model by applying an adversarial learning method.
  • the speech synthesis apparatus 200 using the adversarial learning technique uses a aligner using Gaussian upsampling to obtain a hidden representation equal to the length of the Mel spectogram This upsamples.
  • the speech synthesizer 200 uses the hidden expression as an input to make the learned fake latent vector similar to the real latent vector including the compressed target mel-spectogram information.
  • a GAN training process using the Mel-spectogram as a reference input is available.
  • the speech synthesis apparatus 200 may project two latent vectors (a fake latent vector and a real latent vector) of which the input text and the mel-spectogram are input into the same semantic space.
  • the speech synthesizer 200 after the model learning process of step S130 is completed, the fake latent vector used for the disturbing operation in learning the adversarial model is higher than the latent vector of the model that is not learned through the conventional GAN process. It can be taught to contain a lot of information. Accordingly, the speech synthesis apparatus 200 can decode the mel-spectogram from the latent vector in parallel with high performance.
  • the synthesis device learns a non-autoregressive speech synthesis model through adversarial learning.
  • the synthesis device learns a latent variable generator for generating compressed speech information from text and random noise through adversarial learning.
  • the synthesis device can train an adversarial model so that sufficient information is included in a latent vector generated only with random noise and text information during actual speech synthesis by making the distance between the real latent vector and the fake latent vector close. Accordingly, a negative feature vector can be generated from the latent vector in a non-autoregressive manner.
  • the conventional duration predictor uses a target duration extracted from a pre-learned TTS model or an ASR model, and the process is cumbersome, and performance is poor according to the extracted target duration. It is limited.
  • the aligner according to the embodiment of the present specification has the advantage of being able to learn the target period together with the entire model.
  • FIG. 2 is a block diagram showing the components of a speech synthesis apparatus according to an embodiment of the present specification.
  • the apparatus 200 for voice synthesis may include a voice input unit 210 , a controller 220 and a memory 230 .
  • the voice input unit 210 may receive data for voice synthesis and adversarial model learning. Specifically, as examples of data for speech synthesis and learning of an adversarial model, a mel-spectogram and text may be used.
  • the controller 220 learns an adversarial model using the input text and the mel-spectogram.
  • the controller 220 may train the adversarial model in a non-automatic regression method.
  • the controller 220 may include a voice feature vector encoder, a text encoder, an aligner, a latent variable generator, a discriminator, and a voice feature vector decoder.
  • the voice feature vector encoder and the text encoder may respectively analyze the input mel spectrogram and text data.
  • the aligner can upsample the text encoder output to equal the length of the mel spectrogram of the target speech.
  • the aligner sorts the text information generated through the text encoder, and can pass it along with the text information and random noise of the same length to the latent variable generator.
  • the speech feature vector encoder and the speech feature vector decoder may be formed as an autoencoder structure.
  • the voice feature vector encoder generates latent variables that compress information of the voice feature vector.
  • the speech feature vector decoder may reconstruct speech feature vectors from latent variables.
  • the latent variable generator can generate latent variables in which speech feature vector information is compressed from random noise and text.
  • the latent variable generator may be learned through adversarial learning by the control unit.
  • the discriminator may discriminate latent variables generated from the voice feature vector encoder and latent variables generated from the latent variable generator.
  • the discriminator can project two latent variables into the same semantic space.
  • control unit 220 for actual speech synthesis may include a text encoder, an aligner, a latent variable generator, and a speech feature vector decoder.
  • the text encoder may analyze the input text data, and the aligner upsamples the text encoder output to be equal to the mel spectrogram length of the target speech.
  • the latent variable generator may generate a latent vector in which speech feature vector information is compressed from text and random noise having the same length as the target speech feature vector length.
  • the speech feature vector decoder may generate a speech feature vector using the latent vector.
  • the memory 230 may store the adversarial model learned by the controller 220, and reference data necessary for learning in the controller 220 may be stored.
  • FIG. 3 is a flowchart illustrating a process of predicting a mel-spectogram from input data.
  • a real latent vector (Lr) generated from a speech feature vector (Y) and a fake latent variable (Lt) generated by inputting an input text (X) are the same in a semantic space. Learning can be performed in a direction in which the distance can be minimized.
  • the speech synthesis apparatus 200 may learn a real latent vector (Lr) for compressing speech information through an auto-encoder structure and adversarial learning. Specifically, the speech synthesis apparatus 200 receives a speech feature vector (Y) as an input and learns a latent vector (Lr) for compressing speech information through an autoencoder structure composed of a speech feature vector encoder and a speech feature vector decoder.
  • Y speech feature vector
  • Lr latent vector
  • the latent variable generator of the speech synthesizer 200 may generate a fake latent vector Lf from an expression learned from the random noise input information N and the input text 302 .
  • the discriminator 310 of the speech synthesizer 200 discriminates two latent vectors (Real and Fake, 308 and 309).
  • the discriminator 310 learns to make two latent vectors similar through adversarial learning. That is, the discriminator 310 projects the real latent vector Lt and the fake latent vector Lr into the same semantic space.
  • the discriminator 310 may be trained to include sufficient information for estimating a speech feature vector from x through parallel decoding.
  • the text-based learned representation (H) is upsampled to be equal to the length of the speech feature vector.
  • sorter 305 transforms expression H into expression U.
  • the sorter can output U by transforming the length of the expression H.
  • the aligner performs learning using the period setting unit (prediction unit 406) and the attention unit 405.
  • the aligner may extract a period d corresponding to each text token from alignment A through the attention calculation of the attention unit 405 from the target speech feature vector (Y) and text (X).
  • the extracted d may be used as a target for learning a duration predictor.
  • the period setting unit learns to reduce the MSE loss between the predicted period d ⁇ and the target period d with H as an input.
  • the period for each token is used to calculate weights for upsampling based on Gaussian distribution.
  • the synthesis device finds a token center position in an output segment from a period per token using Equation 1 below.
  • the speech synthesis apparatus 200 may calculate a weight for upsampling using Equation 2 below based on Gaussian variance having a standard deviation of ⁇ at the token center, and for example, ⁇ ⁇ 2 is It can be set to 10.0.
  • the speech synthesis apparatus 200 may upsample the upsampled expression through a weighted sum of the weight calculated through Equation 2 and the output expression of the text encoder.
  • the voice synthesizer 200 may calculate the upsampled vector ut of the t-th frame using Equation 3 below through a weighted sum of the i-th token expression hi and the weight calculated through Equation 2. .
  • the speech synthesis apparatus 200 may estimate a desired target speech feature vector by using text X and random noise N as inputs during actual speech synthesis.
  • the speech synthesis apparatus 200 first converts text X and random text X into H through an encoder, and upsamples them into U by an aligner.
  • the aligner 305 may estimate the length of each token from H, which is an output of the text encoder, through a duration predictor.
  • the aligner 305 may determine the length of random noise input to the latent variable generator when synthesizing real speech, and perform Gaussian upsampling.
  • the generator 307 may take U and N as inputs and output a latent vector to be input to a speech feature vector decoder.
  • the generator 307 may pass the latent vector to the speech feature vector decoder, and the decoder may output the speech feature vector.
  • FIG. 4 is a diagram for explaining the function of an aligner according to an embodiment of the present invention.
  • Equation 4 when the speech synthesis apparatus 100 performs adversarial learning, the loss required for learning the adversarial model is as shown in Equation 4 below.
  • Loss recon denotes recovery loss
  • Loss dur denotes period setup (prediction) loss
  • Loss advo and Loss advp denote adversarial loss of the generator and discriminator, respectively.
  • the restoration loss can be calculated through the L1 loss between the target mel-spectogram Y and the predicted mel-spectogram Y ⁇ output from the mel decoder.
  • the period setup (prediction) loss can be calculated through the MSE loss between the target period d and the prediction length d ⁇ .
  • the adversarial loss is calculated using the hinge version of the adversarial loss through Equation 5 below.
  • FIG. 5 is a diagram illustrating a process of extracting an aligner and a target period according to an embodiment of the present invention.
  • the aligner may calculate attention between a mel encoder and a text encoder output.
  • the aligner 305 may extract the target period from the alignment A derived by calculating the attention, and then the aligner 305 uses the MSE loss between the target period and the predicted period to set the period. (predictor) can be trained and then the aligner can perform upsampling.
  • the aligner calculates the center c of the output segment Y corresponding to each token from d (or d ⁇ ) using Equation 6 below.
  • the aligner 305 may generate a new attention weight w by using Equation 7 below by calculating a Gaussian distribution having a standard deviation of ⁇ around c.
  • the aligner 305 may calculate the upsampled expression through the weighted sum of w and the token expression using Equation 8 below.
  • FIGS. 5, 6, and 7 are graphs showing the alignment output from the Mel decoder when alignment is set as a target and when it is not.
  • 8 is a table showing a comparison of the actual experimental results of the prior art and the present invention.
  • FIG. 6 (a) is a graph showing the alignment generated through the generator
  • FIG. 6 (b) is a graph showing the alignment generated by the Mel decoder when the alignment generated by the generator is not given as a target
  • 6(c) is a graph showing the alignment generated by the mel decoder when the alignment is given as a target
  • FIG. 7(a) is a graph showing the alignment generated through the generator
  • (b) is a graph showing the alignment generated by the generator when the alignment generated by the generator is not given as a target
  • (c) of FIG. 7 is an alignment generated by the generator when the alignment is given as a target. It is a graph showing.
  • the alignment (FIG. 6(c) and FIG. 7(c)) output from the mel decoder and generator is output from the mel decoder and generator when it is not. It can be seen that the output is more similar to FIG. 6 (a) and FIG. 7 (a), which are reference data, than the alignment (FIG. 6 (b) and FIG. 7 (b)), and through this, the present invention It can be seen that the voice synthesizer according to the related art can synthesize a voice most similar to a real voice than a voice synthesizer according to the prior art.
  • the first row shows the experimental results in the case of synthesizing speech using the Tacotron2 algorithm
  • the second row shows the experimental results using the FastSpeech2 algorithm
  • the third row is a diagram showing the experimental results in the case of synthesizing speech using the algorithm according to the present invention.
  • the interruption speed is shorter when the voice is synthesized according to the prior art, so the delay occurring in synthesizing the voice can be reduced. exists.
  • a speech synthesis method and apparatus using an adversarial learning technique by making the distance between two latent vectors close, learning is performed so that sufficient information is included in the latent vector generated only with random noise and text information during actual speech synthesis. It is possible, and consequently, a negative feature vector can be generated from a latent vector in a non-automatic regression method.
  • the target period can be learned together with the entire model, unnecessary time required to learn the target period can be reduced, and since the voice feature vector is generated in a non-automatic regression method, unnecessary unnecessary time that occurs during actual speech synthesis is generated. There is an advantage of reducing the delay.
  • devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions.
  • the processing device may run an operating system (OS) and one or more software applications running on the operating system.
  • a processing device may also access, store, manipulate, process, and generate data in response to execution of software.
  • the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include.
  • a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.
  • Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device.
  • Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device.
  • can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner.
  • Software and data may be stored on one or more computer readable media.
  • the method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium.
  • the computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software.
  • Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks.
  • - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like.
  • Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

Selon un mode de réalisation, un procédé de synthèse vocale utilisant une technique d'apprentissage antagoniste peut comprendre les étapes consistant : à recevoir une entrée de données vocales ; à former un modèle antagoniste pour une synthèse vocale sur la base de l'entrée de données vocales ; et à synthétiser une trame d'une parole cible à l'aide du modèle antagoniste, l'étape de synthèse de la trame de la parole comprenant la synthèse de la trame de la parole cible à l'aide d'un procédé non autorégressif.
PCT/KR2023/002229 2022-02-18 2023-02-15 Procédé et dispositif de synthèse vocale utilisant une technique d'apprentissage antagoniste WO2023158226A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020220021354A KR102613030B1 (ko) 2022-02-18 2022-02-18 적대적 학습 기법을 이용한 음성 합성 방법 및 장치
KR10-2022-0021354 2022-02-18

Publications (1)

Publication Number Publication Date
WO2023158226A1 true WO2023158226A1 (fr) 2023-08-24

Family

ID=87578632

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2023/002229 WO2023158226A1 (fr) 2022-02-18 2023-02-15 Procédé et dispositif de synthèse vocale utilisant une technique d'apprentissage antagoniste

Country Status (2)

Country Link
KR (1) KR102613030B1 (fr)
WO (1) WO2023158226A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727290A (zh) * 2024-02-18 2024-03-19 厦门她趣信息技术有限公司 一种语音合成方法、装置、设备及可读存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200098353A1 (en) * 2018-09-28 2020-03-26 Capital One Services, Llc Adversarial learning framework for persona-based dialogue modeling
KR102275656B1 (ko) * 2019-09-26 2021-07-09 국방과학연구소 적대적 학습(adversarial training) 모델을 이용한 강인한 음성 향상 훈련 방법 및 그 장치

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101871604B1 (ko) 2016-12-15 2018-06-27 한양대학교 산학협력단 심화 신경망을 이용한 다채널 마이크 기반의 잔향시간 추정 방법 및 장치
KR101988504B1 (ko) 2019-02-28 2019-10-01 아이덴티파이 주식회사 딥러닝에 의해 생성된 가상환경을 이용한 강화학습 방법

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200098353A1 (en) * 2018-09-28 2020-03-26 Capital One Services, Llc Adversarial learning framework for persona-based dialogue modeling
KR102275656B1 (ko) * 2019-09-26 2021-07-09 국방과학연구소 적대적 학습(adversarial training) 모델을 이용한 강인한 음성 향상 훈련 방법 및 그 장치

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JAEHYEON KIM; JUNGIL KONG; JUHEE SON: "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 11 June 2021 (2021-06-11), 201 Olin Library Cornell University Ithaca, NY 14853, XP081988468 *
JUHEON LEE; HYEONG-SEOK CHOI; CHANG-BIN JEON; JUNGHYUN KOO; KYOGU LEE: "Adversarially Trained End-to-end Korean Singing Voice Synthesis System", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 August 2019 (2019-08-06), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081456376 *
YANG JINHYEOK, BAE JAE-SUNG, BAK TAEJUN, KIM YOUNG-IK, CHO HOON-YOUNG: "GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis", INTERSPEECH 2021, ISCA, ISCA, 1 January 2021 (2021-01-01), ISCA, pages 2202 - 2206, XP093085648, DOI: 10.21437/Interspeech.2021-971 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727290A (zh) * 2024-02-18 2024-03-19 厦门她趣信息技术有限公司 一种语音合成方法、装置、设备及可读存储介质

Also Published As

Publication number Publication date
KR102613030B1 (ko) 2023-12-12
KR20230124266A (ko) 2023-08-25

Similar Documents

Publication Publication Date Title
CN110600018B (zh) 语音识别方法及装置、神经网络训练方法及装置
CN108520741A (zh) 一种耳语音恢复方法、装置、设备及可读存储介质
US6366885B1 (en) Speech driven lip synthesis using viseme based hidden markov models
WO2020204525A1 (fr) Procédé et dispositif d'apprentissage combiné utilisant la fonction de perte transformée et l'amélioration de caractéristique basée sur un réseau neuronal profond pour la reconnaissance de locuteur qui est fiable dans un environnement bruyant
KR20210138094A (ko) 심층 다항식 네트워크를 사용하는 암호화된 데이터의 분산 및 협업 분석
CN112735373A (zh) 语音合成方法、装置、设备及存储介质
WO2020256471A1 (fr) Procédé et dispositif de génération de vidéo de parole sur la base d'un apprentissage automatique
WO2023158226A1 (fr) Procédé et dispositif de synthèse vocale utilisant une technique d'apprentissage antagoniste
WO2021051765A1 (fr) Procédé et appareil de synthèse vocale, et support de stockage
WO2021010613A1 (fr) Procédé et système de synthèse vocale non autorégressive basés sur un réseau neuronal profond et utilisant de multiples décodeurs
CN111883107B (zh) 语音合成、特征提取模型训练方法、装置、介质及设备
KR20200044388A (ko) 음성을 인식하는 장치 및 방법, 음성 인식 모델을 트레이닝하는 장치 및 방법
CN111916053B (zh) 语音生成方法、装置、设备和计算机可读介质
WO2020256472A1 (fr) Procédé et dispositif de génération de vidéo d'énoncé au moyen d'un signal vocal
CN109697978B (zh) 用于生成模型的方法和装置
WO2022152029A1 (fr) Procédé et appareil de reconnaissance de la parole, dispositif informatique et support de stockage
WO2022203152A1 (fr) Procédé et dispositif de synthèse de parole sur la base d'ensembles de données d'apprentissage de locuteurs multiples
CN113963715A (zh) 语音信号的分离方法、装置、电子设备及存储介质
WO2022037383A1 (fr) Procédé et appareil de traitement vocal, dispositif électronique et support lisible par ordinateur
EP4295357A1 (fr) Attention de modèle de mélange permettant une reconnaissance vocale automatique flexible de diffusion continue et de diffusion non continue
WO2022045485A1 (fr) Appareil et procédé de génération d'une vidéo de parole qui créent ensemble des points de repère
WO2023167466A1 (fr) Système de construction d'une base de données d'apprentissage machine au moyen d'une technologie de protection de la confidentialité des conversations
CN112752118A (zh) 视频生成方法、装置、设备及存储介质
CN115273862A (zh) 语音处理的方法、装置、电子设备和介质
JP7426917B2 (ja) ユーザ周辺のマルチモーダル情報に応じてユーザと対話するプログラム、装置及び方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23756638

Country of ref document: EP

Kind code of ref document: A1