CN112289304A - A Multi-Speaker Speech Synthesis Method Based on Variational Autoencoder - Google Patents

A Multi-Speaker Speech Synthesis Method Based on Variational Autoencoder Download PDF

Info

Publication number
CN112289304A
CN112289304A CN201910671050.5A CN201910671050A CN112289304A CN 112289304 A CN112289304 A CN 112289304A CN 201910671050 A CN201910671050 A CN 201910671050A CN 112289304 A CN112289304 A CN 112289304A
Authority
CN
China
Prior art keywords
level
phoneme
duration
frame
normalized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910671050.5A
Other languages
Chinese (zh)
Other versions
CN112289304B (en
Inventor
张鹏远
蒿晓阳
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN201910671050.5A priority Critical patent/CN112289304B/en
Publication of CN112289304A publication Critical patent/CN112289304A/en
Application granted granted Critical
Publication of CN112289304B publication Critical patent/CN112289304B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/086Detection of language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a multi-speaker voice synthesis method based on a variational self-encoder, which comprises the following steps: extracting a phoneme-level duration parameter and a frame-level acoustic parameter of a clean voice of a speaker to be synthesized, inputting the normalized phoneme-level duration parameter into a first variational self-encoder, and outputting a duration speaker tag; inputting the normalized frame level acoustic parameters into a second variational self-encoder, and outputting an acoustic speaker tag; extracting frame-level linguistic features and phoneme-level linguistic features of a voice signal to be synthesized, wherein the voice signal comprises a plurality of speakers; inputting the time length speaker label and the normalized phoneme level linguistic feature into a time length prediction network, and outputting the current phoneme prediction time length; obtaining the frame-level linguistic characteristics of the phoneme through the current phoneme prediction duration, inputting the frame-level linguistic characteristics of the phoneme and the acoustic speaker tag into an acoustic parameter prediction network, and outputting normalized acoustic parameters of predicted speech; and inputting the normalized predicted voice acoustic parameters into a vocoder, and outputting a synthesized voice signal.

Description

Multi-speaker voice synthesis method based on variational self-encoder
Technical Field
The invention relates to a voice synthesis method, in particular to a multi-speaker voice synthesis method based on a variational self-encoder.
Background
The speech synthesis technology is an important technology for converting an input text into speech, and is also an important research content in the field of human-computer interaction.
The traditional speech synthesis algorithm needs to record a relatively comprehensive sound base covered by single speaker phonemes to ensure that the single speaker phonemes can synthesize the voices of various texts, but causes the problems of high recording cost, low efficiency and only single speaker voice synthesis. The voice synthesis of multiple speakers supports the parallel recording of voices of different speakers, and can synthesize voices from different speakers. The traditional multi-speaker speech synthesis usually needs to obtain the speaker information of the current speech and manually label a speaker tag, such as the one-hot encoding of the speaker, which belongs to a supervised learning, and the synthesized speech often has the tone overlapping of a plurality of speakers when the number of speakers is large. The method introduces a variational self-encoder network, samples the output of the network to obtain the label of the speaker.
Disclosure of Invention
The invention aims to solve the problems of supervised learning and tone overlapping of a plurality of speakers when the number of speakers is large in the traditional multi-speaker voice synthesis method, and provides a multi-speaker voice synthesis method based on a variational self-encoder by introducing a variational self-encoder network and sampling the output of the network to obtain the labels of the speakers.
In order to achieve the above object, the present invention provides a method for synthesizing multiple speakers based on a variational self-encoder, the method comprising:
extracting a phoneme-level duration parameter and a frame-level acoustic parameter of a clean voice of a speaker to be synthesized, normalizing the phoneme-level duration parameter, inputting the normalized phoneme-level duration parameter into a first variational self-encoder, and outputting a duration speaker tag; inputting the normalized frame level acoustic parameters into a second variational self-encoder, and outputting an acoustic speaker tag;
extracting frame-level linguistic features and phoneme-level linguistic features from a voice signal to be synthesized, wherein the voice signal comprises a plurality of speakers, and normalizing the frame-level linguistic features and the phoneme-level linguistic features;
inputting the time length speaker label and the normalized phoneme level linguistic feature into a time length prediction network, and outputting the predicted time length of the current phoneme;
obtaining the frame-level linguistic characteristics of the phoneme through the predicted duration of the current phoneme, inputting the frame-level linguistic characteristics of the phoneme and the acoustic speaker tag into an acoustic parameter prediction network, and outputting normalized acoustic parameters of predicted speech;
the normalized acoustic parameters of the predicted speech are input to the vocoder, and the synthesized speech signal is output.
As an improvement of the above method, the first variational self-encoder/the second variational self-encoder includes 5 one-dimensional convolution layers, 1 long-short-term memory layer and 1 fully-connected layer, where the convolution kernel size of the convolution layer is 5, the step size is 2, the number is 128, the fully-connected layer outputs the standard deviation and the mean value of the predicted gaussian distribution, the long-short-term memory layer includes 128 neurons, and the activation function of each neuron uses a modified linear unit, and its expression is:
f(x)=max(0,x)
the input of the first variational auto-encoder/the second variational auto-encoder is normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, the output is mean and standard deviation of gaussian distribution, and the relative entropy between the predicted encoded distribution and the real distribution is calculated as:
Figure BDA0002141722860000021
where N is the dimension of the Gaussian distribution, σnAnd unRespectively, the standard deviation and mean of the nth dimension of σ (x) and u (x) of the gaussian distribution predicted by the variational autocoder,
Figure BDA0002141722860000022
for implicit vector true distribution, pθ(z) is the implicit vector distribution predicted by the variational self-encoder, and the true distribution is assumed to be standard gaussian distribution;
the hidden vector is implemented by resampling:
zN=u(x)+σ(x)·εN
wherein z isNAs hidden vectors, x is the normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, εNN (0, I), a vector obtained by standard Gaussian sampling, and the gradient of the variational output feedback from the encoder is calculated as:
Figure BDA0002141722860000023
wherein e isNIs an N-dimensional all-1 vector; hidden vector zNThe output of the first variational auto-encoder/the second variational auto-encoder is a long-time speaker tag/an acoustic speaker tag converted into a 128-dimensional speaker tag by a full-connection layer containing 256 neurons.
As an improvement of the above method, the duration prediction network comprises a full connection layer and a two-way long and short duration memory layer;
the input of the full connection layer is a time length speaker label and normalized phoneme level linguistic characteristics; the output of the bidirectional long-short time memory layer is the predicted duration of the current phoneme and a Loss function LossMSEFor the minimum mean square error of the normalized predicted speech duration parameter and the real speech duration parameter:
Figure BDA0002141722860000031
wherein d is a time length parameter of the real voice,
Figure BDA0002141722860000032
is a time length parameter of the normalized predicted speech;
the number of the neurons of the full-connection layer and the bidirectional long-short time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
as an improvement of the above method, the acoustic parameter prediction network comprises a full connection layer and three bidirectional long-term and short-term memory layers;
the input of the full connection layer is acoustic speaker labels and normalized frame level linguistic characteristics; the output of the three-layer bidirectional long-short time memory layer is normalized acoustic parameters of predicted voice and Loss function LossMSEFor the minimum mean square error of the normalized acoustic parameters of the predicted speech and the acoustic parameters of the real speech:
Figure BDA0002141722860000033
wherein x isjThe j-th dimension of the acoustic parameter of the real voice,
Figure BDA0002141722860000034
a value of j dimension of the acoustic parameter for the normalized predicted speech;
the number of the nerve cells of the connecting layer and the bidirectional long-time and short-time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
as an improvement of the above method, the method further comprises: extracting frame-level acoustic parameters, phoneme-level duration parameters, frame-level linguistic features and phoneme-level linguistic features from the recorded voice signals containing a plurality of speakers, and normalizing the frame-level acoustic parameters, the phoneme-level duration parameters, the frame-level linguistic features and the phoneme-level linguistic features respectively;
the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter;
the phoneme-level duration parameter comprises 1-dimensional duration information;
the frame-level linguistic features include 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information; the phone-level linguistic features include: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics;
the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by adopting a 0-mean value; the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.
As an improvement of the above method, the method further comprises: the step of training the first variational self-encoder and the duration parameter prediction network specifically comprises the following steps:
inputting the normalized phoneme-level duration parameter into a first variational self-encoder, calculating the loss function of the variational self-encoder
Figure BDA0002141722860000041
Outputting a time length speaker tag;
inputting the normalized phoneme level linguistic features and the time length speaker label into the time length parameter prediction network to calculate the Loss function LossMSE
An optimization function C1 obtained by weighted summation of the loss function of the first variational self-encoder and the loss function of the time-length parameter prediction network:
Figure BDA0002141722860000042
where ω (n) is the weight of the loss function of the encoder:
Figure BDA0002141722860000043
n is the iteration number of the whole training data;
and (4) carrying out gradient feedback by reducing the value of the optimization function C1 and updating the parameters of the network to obtain the trained first variational self-encoder and the duration parameter prediction network.
As an improvement of the above method, the method further comprises: the step of training the second variational autocoder and the acoustic parameter prediction network comprises the following steps:
inputting the normalized frame-level acoustic parameters into a second variational auto-encoder, calculating a loss function of the second variational auto-encoder
Figure BDA0002141722860000044
Outputting an acoustic speaker tag;
inputting acoustic speaker label and normalized frame level linguistic feature into the acoustic parameter prediction network calculation Loss function LossMSE
An optimization function C2 obtained by weighted summation of the loss function of the second variational self-encoder and the loss function of the acoustic parameter prediction network:
Figure BDA0002141722860000045
where ω (n) is the weight of the loss function of the encoder:
Figure BDA0002141722860000046
n is the iteration number of the whole training data;
and (4) carrying out gradient back transmission by reducing the value of the optimization function C2 and updating the parameters of the network to obtain the trained second variational self-encoder and the acoustic parameter prediction network.
As an improvement of the above method, the extracting and normalizing the phoneme-level duration parameter and the frame-level acoustic parameter of the clean speech of the speaker to be synthesized specifically includes:
extracting a phoneme-level duration parameter of a clean voice of a speaker to be synthesized, wherein the phoneme-level duration parameter comprises 1-dimensional duration information;
extracting a frame level acoustic parameter of the clean voice of the speaker to be synthesized; the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter;
and the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by 0 mean value.
As an improvement of the above method, the speech signal to be synthesized containing multiple speakers extracts frame-level linguistic features and phoneme-level linguistic features and performs normalization; the method specifically comprises the following steps:
extracting frame-level linguistic features from a speech signal to be synthesized, wherein the speech signal comprises a plurality of speakers, and the frame-level linguistic features comprise 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information;
extracting phoneme-level linguistic features from a speech signal to be synthesized containing a plurality of speakers, wherein the phoneme-level linguistic features comprise: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics;
the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.
As an improvement of the above method, the obtaining the frame-level linguistic feature of the phoneme by predicting the duration specifically includes: and obtaining the relative position and the absolute position of the current frame relative to the predicted duration through predicting the duration, thereby obtaining the frame-level linguistic feature of the phoneme.
The invention has the advantages that:
the invention learns the tag information of the speaker without supervision through a variational self-encoder, obtains different implicit vector distributions by selecting voices from different speakers, and synthesizes the voices of different speakers by obtaining speaker tags through sampling the implicit vectors.
Drawings
FIG. 1 is a flow chart of the method for multi-speaker speech synthesis based on variational auto-encoders of the present invention;
FIG. 2 is a block diagram of a variational self-encoder and acoustic parameter prediction network of the present invention;
fig. 3 is a block diagram of a variation autoencoder and duration parameter prediction network of the present invention.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings.
The invention has proposed a multi-speaker speech synthesis method based on variational self-encoder, said method comprises training stage and synthetic stage;
as shown in fig. 1, the training phase includes:
step 101) extracting acoustic parameters at a frame level, duration parameters at a phoneme level, and linguistic characteristics at the frame level and the phoneme level from recorded voice signals containing a plurality of speakers, and normalizing the acoustic parameters, the duration parameters, the frame level and the linguistic characteristics at the phoneme level respectively.
The acoustic parameters at the frame level have 187 dimensions, and comprise: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter. 624 dimensions for the phonetic feature of the phoneme level, comprising: 477 dimensional text pronunciation characteristics, 177 dimensional word segmentation and rhythm characteristics. The frame-level linguistic feature includes a 624-dimensional phoneme-level linguistic feature and 4-dimensional frame position information. The duration parameter comprises 1-dimensional duration information.
The linguistic features of the frame level and the phoneme level are normalized by adopting the maximum and minimum values, and the calculation formula is as follows:
Figure BDA0002141722860000061
wherein
Figure BDA0002141722860000062
For the normalized value of the ith dimension feature,
Figure BDA0002141722860000063
min as the value before normalization for the ith dimension featureiAnd maxiThe maximum and minimum values of the ith dimension feature, respectively.
The acoustic parameters and the duration parameters are normalized by 0 mean value, and the calculation formula is as follows:
Figure BDA0002141722860000064
wherein
Figure BDA0002141722860000065
For the normalized value of the ith dimension feature,
Figure BDA0002141722860000066
for the value before normalization for the i-th dimension feature, uiAnd σiRespectively, mean and variance of the ith dimension feature.
Step 102) constructing a variational self-encoder network, taking the normalized frame-level acoustic parameters as input, assuming that the encoded distribution is Gaussian distribution, and the output of the network is the mean value and standard deviation of the Gaussian distribution, and calculating the relative entropy between the variational self-encoder network and the actual distribution to be used as a loss function of an encoder.
The encoder contains 5 one-dimensional convolutional layers to and 1 layer long-term memory network (LSTM layer) and 1 layer of full connection layer, wherein convolutional kernel size of convolutional layer is 5, the step length is 2, the quantity is 128, full connection layer output is standard deviation and the mean value of the gaussian distribution that the encoder predicts, LSTM layer contains 128 neurons, the use of the activation function of each neuron is the correction linear unit, its expression is:
f(x)=max(0,x)
the relative entropy between the predicted encoded distribution and the true distribution is calculated as:
Figure BDA0002141722860000071
where N is the dimension of the Gaussian distribution, σnAnd unRespectively, the standard deviation and mean of the nth dimension of σ (x) and u (x) of the gaussian distribution predicted by the variational autocoder,
Figure BDA0002141722860000072
for implicit vector true distribution, pθ(z) is the implicit vector distribution predicted by the variational self-encoder, and the true distribution is assumed to be standard gaussian distribution; at a relative entropy
Figure BDA0002141722860000073
As a loss function of the encoder:
Figure BDA0002141722860000074
step 103) obtaining a hidden vector as an acoustic speaker label based on the distributed sampling of the step 102);
in order to avoid the problem that the gradient cannot be transmitted back due to direct sampling, the hidden vector is realized by resampling, and the formula of resampling is as follows:
zN=u(x)+σ(x)·εN
wherein N is the dimension of a Gaussian distribution, zNIs a hidden vector, x is an input normalized frame-level acoustic parameter, εNN (0, I), a vector obtained for standard Gaussian sampling, where the gradient for the encoder output return can be calculated as:
Figure BDA0002141722860000075
wherein eNIs an N-dimensional all-1 vector;
hidden vector zNConversion to 128-dimensional sound through a full-connection layer containing 256 neuronsA speaker learning tag.
Step 104) constructing an acoustic parameter prediction network, taking the normalized frame-level linguistic features and the acoustic speaker labels as input, and taking the output of the network as the predicted acoustic parameters of the normalized predicted speech, and calculating the minimum mean square error between the predicted acoustic parameters and the actual acoustic parameters, so as to be used as a loss function of the network.
As shown in fig. 2, the acoustic parameter prediction network includes a full connection layer and three bidirectional long and short term memory layers;
the input of the full connection layer is acoustic speaker labels and normalized frame level linguistic characteristics; the output of the three-layer bidirectional long-short time memory layer is normalized acoustic parameters of predicted voice and Loss function LossMSEFor the minimum mean square error of the normalized acoustic parameters of the predicted speech and the acoustic parameters of the real speech:
Figure BDA0002141722860000076
wherein x isjThe j-th dimension of the acoustic parameter of the real voice,
Figure BDA0002141722860000081
a value of j dimension of the acoustic parameter for the normalized predicted speech;
the number of the nerve cells of the connecting layer and the bidirectional long-time and short-time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
step 105) carrying out weighted summation on the loss function of the variational self-encoder in the step 102) and the loss function of the acoustic parameter prediction network in the step 104) to serve as an optimization function, carrying out gradient return transmission by reducing the value of the optimization function, and updating the parameters of the network to obtain a trained network;
the optimization function obtained by weighted summation of the loss function of the encoder and the loss function of the acoustic parameter prediction network is calculated as follows:
Figure BDA0002141722860000082
where ω (n) is an expression of the weight of the loss function of the encoder over the number of iterations of the training data:
Figure BDA0002141722860000083
step 106) building a variational self-encoder network with the same structure as the step 102), taking normalized phoneme-level duration parameters as input, outputting as a duration speaker tag, and building a duration prediction network at the same time, taking the duration speaker tag and the phoneme-level linguistic features as input; outputting the time length parameter as a prediction;
as shown in fig. 3, the duration prediction network includes a full connection layer and a two-way long and short duration memory layer;
the input of the full connection layer is a time length speaker label and normalized phoneme level linguistic characteristics; the output of the bidirectional long-short time memory layer is the predicted duration of the current phoneme and a Loss function LossMSEFor the minimum mean square error of the normalized predicted speech duration parameter and the real speech duration parameter:
Figure BDA0002141722860000084
wherein d is a time length parameter of the real voice,
Figure BDA0002141722860000085
is a time length parameter of the normalized predicted speech;
the number of the neurons of the full-connection layer and the bidirectional long-short time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
step 107) training the variational self-encoder and the duration parameter prediction network in the step 106);
inputting the normalized phoneme-level duration parameter into a variational self-encoder, and calculating the loss function of the variational self-encoder
Figure BDA0002141722860000086
Outputting a time length speaker tag;
calculating Loss function Loss of network by inputting normalized phoneme-level linguistic features and duration speaker tags into duration parameter prediction networkMSE
And (3) an optimization function C obtained by weighted summation of the loss function of the variational self-encoder and the loss function of the time-length parameter prediction network:
Figure BDA0002141722860000091
where ω (n) is the weight of the loss function of the encoder:
Figure BDA0002141722860000092
n is the iteration number of the whole training data;
and (4) carrying out gradient back transmission by reducing the value of the optimization function C2 and updating the parameters of the network to obtain the trained second variational self-encoder and the acoustic parameter prediction network.
The synthesis stage comprises:
step 201) extracting a phoneme level duration parameter and a frame level acoustic parameter of a clean voice of a speaker to be synthesized, normalizing the phoneme level duration parameter, inputting the normalized phoneme level duration parameter into the variation self-encoder in the step 105), and outputting a duration speaker label; inputting the normalized frame level acoustic parameters into the variational self-encoder in the step 107) and outputting an acoustic speaker label;
the phoneme-level duration parameter comprises 1-dimensional duration information; the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter; and the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by 0 mean value.
Step 202) extracting frame-level linguistic features and phoneme-level linguistic features from a voice signal to be synthesized, wherein the voice signal comprises a plurality of speakers, and normalizing the frame-level linguistic features and the phoneme-level linguistic features;
the frame-level linguistic features include 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information; the phone-level linguistic features include: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics; the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.
Step 203) inputting the time length speaker label and the normalized phoneme level linguistic feature into the time length prediction network in the step 107) and outputting the predicted time length of the current phoneme;
step 204) obtaining the frame-level linguistic characteristics of the current phoneme through the predicted duration of the current phoneme, inputting the frame-level linguistic characteristics of the current phoneme and the acoustic parameter prediction network of the acoustic speaker tag input step 105), and outputting normalized acoustic parameters of predicted speech;
and obtaining the relative position and the absolute position of the current frame relative to the predicted duration through predicting the duration, thereby obtaining the frame-level linguistic feature of the phoneme.
Step 205) inputs the normalized acoustic parameters of the predicted speech into the vocoder and outputs a synthesized speech signal.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1.一种基于变分自编码器的多说话人语音合成方法,所述方法包括:1. A multi-speaker speech synthesis method based on variational autoencoder, the method comprising: 提取一条待合成说话人干净语音的音素级别时长参数和帧级别声学参数并进行归一化,将归一化的音素级别时长参数输入第一变分自编码器,输出时长说话人标签;将归一化的帧级别声学参数输入第二变分自编码器,输出声学说话人标签;Extract and normalize the phoneme-level duration parameters and frame-level acoustic parameters of a clean speech of the speaker to be synthesized, input the normalized phoneme-level duration parameters into the first variational autoencoder, and output the duration speaker label; The normalized frame-level acoustic parameters are input to the second variational autoencoder, and the acoustic speaker label is output; 对待合成的包含多个说话人的语音信号提取帧级别语言学特征和音素级别语言学特征并进行归一化;Extract and normalize frame-level linguistic features and phoneme-level linguistic features from the speech signal containing multiple speakers to be synthesized; 将时长说话人标签和归一化的音素级别语言学特征输入时长预测网络,输出当前音素的预测时长;Input the duration speaker label and the normalized phoneme-level linguistic features into the duration prediction network, and output the predicted duration of the current phoneme; 通过当前音素的预测时长获得该音素的帧级别语言学特征,将其与声学说话人标签输入声学参数预测网络,输出归一化的预测语音的声学参数;Obtain the frame-level linguistic feature of the phoneme through the prediction duration of the current phoneme, input it and the acoustic speaker label into the acoustic parameter prediction network, and output the normalized acoustic parameters of the predicted speech; 将归一化的预测语音的声学参数输入声码器,输出合成语音信号。The normalized acoustic parameters of the predicted speech are input into the vocoder, and the synthesized speech signal is output. 2.根据权利要求1所述的基于变分自编码器的多说话人语音合成方法,其特征在于,所述第一变分自编码器/第二变分自编码器包含5层一维卷积层、1层长短时记忆层和1层全连接层,其中卷积层的卷积核大小为5,步长为2,数量为128,全连接层输出为预测的高斯分布的标准差和均值,长短时记忆层包含128个神经元,每个神经元的激活函数使用的是修正线性单元,其表达式为:2. The multi-speaker speech synthesis method based on variational autoencoder according to claim 1, wherein the first variational autoencoder/second variational autoencoder comprises 5 layers of one-dimensional volumes Convolution layer, 1 layer of long and short-term memory layer and 1 layer of fully connected layer, in which the convolution kernel size of the convolution layer is 5, the stride is 2, the number is 128, and the output of the fully connected layer is the standard deviation of the predicted Gaussian distribution and The mean value, the long and short-term memory layer contains 128 neurons, the activation function of each neuron uses a modified linear unit, and its expression is: f(x)=max(0,x)f(x)=max(0,x) 所述第一变分自编码器/第二变分自编码器的输入为归一化的音素级别时长参数/归一化的帧级别声学参数,输出为高斯分布的均值与标准差,计算预测的编码后的分布与真实分布之间的相对熵为:The input of the first variational autoencoder/second variational autoencoder is the normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, the output is the mean and standard deviation of the Gaussian distribution, and the prediction is calculated. The relative entropy between the encoded distribution of and the true distribution is:
Figure FDA0002141722850000011
Figure FDA0002141722850000011
其中,N为高斯分布的维度,σn和un分别为变分自编码器预测的高斯分布的σ(x)和u(x)第n维的标准差和均值,
Figure FDA0002141722850000012
为隐向量真实分布,pθ(z)为变分自编码器预测的隐向量分布,并假定真实分布为标准高斯分布;
where N is the dimension of the Gaussian distribution, σ n and u n are the standard deviation and mean of the nth dimension of σ(x) and u(x) of the Gaussian distribution predicted by the variational autoencoder, respectively,
Figure FDA0002141722850000012
is the true distribution of the latent vector, p θ (z) is the latent vector distribution predicted by the variational autoencoder, and the true distribution is assumed to be a standard Gaussian distribution;
隐向量通过重采样来实现:The hidden vector is implemented by resampling: zN=u(x)+σ(x)·εN z N =u(x)+σ(x)· εN 其中,zN为隐向量,x为归一化的音素级别时长参数/归一化的帧级别声学参数,εN~N(0,I),为标准高斯采样获得的向量,所述变分自编码器输出回传的梯度计算为:Among them, z N is the latent vector, x is the normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, ε N ~ N(0, I) is the vector obtained by standard Gaussian sampling, and the variation The gradient returned from the output of the autoencoder is calculated as:
Figure FDA0002141722850000021
Figure FDA0002141722850000021
其中,eN为N维全1向量;隐向量zN通过一个包含256个神经元全连接层转换为128维的说话人标签,所述第一变分自编码器/第二变分自编码器的输出为时长说话人标签/声学说话人标签。Among them, e N is an N-dimensional all-one vector; the hidden vector z N is converted into a 128-dimensional speaker label through a fully connected layer containing 256 neurons, and the first variational autoencoder/second variational autoencoder The output of the device is the duration speaker label/acoustic speaker label.
3.根据权利要求2所述的基于变分自编码器的多说话人语音合成方法,其特征在于,所述时长预测网络包括全连接层和一层双向长短时记忆层;3. The multi-speaker speech synthesis method based on a variational autoencoder according to claim 2, wherein the duration prediction network comprises a fully connected layer and a one-layer bidirectional long-short-term memory layer; 所述全连接层的输入为时长说话人标签和归一化的音素级别语言学特征;所述双向长短时记忆层的输出为预测的当前音素的时长,损失函数LossMSE为归一化的预测语音的时长参数与真实语音的时长参数的最小均方误差:The input of the fully connected layer is the duration speaker label and the normalized phoneme-level linguistic feature; the output of the two-way long-short-term memory layer is the predicted duration of the current phoneme, and the loss function Loss MSE is the normalized prediction. The minimum mean square error of the duration parameter of the speech and the duration parameter of the real speech:
Figure FDA0002141722850000022
Figure FDA0002141722850000022
其中,d为真实语音的时长参数,
Figure FDA0002141722850000023
为归一化的预测语音的时长参数;
Among them, d is the duration parameter of real speech,
Figure FDA0002141722850000023
is the duration parameter of the normalized predicted speech;
所述全连接层与双向长短时记忆层的神经元数均为256;所有神经元的激活函数使用的是修正线性单元,其表达式为:The number of neurons in the fully connected layer and the bidirectional long-term memory layer are both 256; the activation functions of all neurons use a modified linear unit, and its expression is: f(x)=max(0,x)。f(x)=max(0,x).
4.根据权利要求2所述的基于变分自编码器的多说话人语音合成方法,其特征在于,所述声学参数预测网络包含全连接层和三层双向长短时记忆层;4. The multi-speaker speech synthesis method based on variational autoencoder according to claim 2, wherein the acoustic parameter prediction network comprises a fully connected layer and a three-layer bidirectional long-short-term memory layer; 所述全连接层的输入为声学说话人标签和归一化的帧级别语言学特征;所述三层双向长短时记忆层的输出为归一化的预测语音的声学参数,损失函数LossMSE为归一化的预测语音的声学参数与真实语音的声学参数的最小均方误差:The input of the fully connected layer is the acoustic speaker label and the normalized frame-level linguistic feature; the output of the three-layer bidirectional long-short-term memory layer is the normalized acoustic parameter of the predicted speech, and the loss function Loss MSE is The normalized minimum mean square error of the acoustic parameters of the predicted speech and the acoustic parameters of the real speech:
Figure FDA0002141722850000024
Figure FDA0002141722850000024
其中,xj为真实语音的声学参数的第j维的值,
Figure FDA0002141722850000025
为归一化的预测语音的声学参数的第j维的值;
Among them, x j is the value of the jth dimension of the acoustic parameter of the real speech,
Figure FDA0002141722850000025
is the value of the jth dimension of the acoustic parameter of the normalized predicted speech;
所述连接层与双向长短时记忆层的神经元数均为256;所有神经元的激活函数使用的是修正线性单元,其表达式为:The number of neurons in the connection layer and the bidirectional long and short-term memory layer are both 256; the activation functions of all neurons use a modified linear unit, and its expression is: f(x)=max(0,x)。f(x)=max(0,x).
5.根据权利要求3所述的基于变分自编码器的多说话人语音合成方法,其特征在于,所述方法之前还包括:对已录制的包含多个说话人的语音信号提取帧级别声学参数、音素级别时长参数、帧级别语言学特征和音素级别语言学特征,并分别对其做归一化;5. The multi-speaker speech synthesis method based on variational autoencoder according to claim 3, characterized in that, before the method, the method further comprises: extracting frame-level acoustics from the recorded speech signals containing multiple speakers parameters, phoneme-level duration parameters, frame-level linguistic features, and phoneme-level linguistic features, and normalize them respectively; 所述帧级别声学参数包括:60维梅尔倒谱系数及其一阶和二阶差分、1维基频参数及其一阶和二阶差分、1维非周期参数及其一阶和二阶差分、1维元音辅音判决参数;The frame-level acoustic parameters include: 60-dimensional Mel cepstral coefficients and their first- and second-order differences, 1-dimensional frequency parameters and their first- and second-order differences, and 1-dimensional aperiodic parameters and their first- and second-order differences , 1-dimensional vowel and consonant judgment parameters; 所述音素级别时长参数包括1维的时长信息;The phoneme-level duration parameter includes 1-dimensional duration information; 所述帧级别语言学特征包含624维音素级别的语言学特征和4维的帧位置信息;所述音素级别语言学特征包括:477维的文本发音特征、177维的分词及韵律特征;The frame-level linguistic features include 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information; the phoneme-level linguistic features include: 477-dimensional text pronunciation features, 177-dimensional word segmentation and prosody features; 所述帧级别声学参数和音素级别时长参数采用0均值归一化;所述帧级别语言学特征和音素级别语言学特征采用最大最小值归一化。The frame-level acoustic parameters and the phoneme-level duration parameters are normalized by 0 mean; the frame-level linguistic features and phoneme-level linguistic features are normalized by the maximum and minimum values. 6.根据权利要求5所述的基于变分自编码器的多说话人语音合成方法,其特征在于,所述方法之前还包括:对所述第一变分自编码器和时长参数预测网络进行训练的步骤,具体包括:6. The multi-speaker speech synthesis method based on variational autoencoder according to claim 5, characterized in that, before the method, further comprising: performing the first variational autoencoder and the duration parameter prediction network on the The training steps include: 将归一化的音素级别时长参数输入第一变分自编码器,计算所述变分自编码器的损失函数
Figure FDA0002141722850000031
输出时长说话人标签;
Input the normalized phoneme-level duration parameter into the first variational autoencoder, and calculate the loss function of the variational autoencoder
Figure FDA0002141722850000031
Output duration speaker label;
将归一化的音素级别语言学特征和时长说话人标签输入所述时长参数预测网络计算损失函数LossMSEInput the normalized phoneme-level linguistic features and the duration speaker label into the duration parameter prediction network to calculate the loss function Loss MSE ; 对所述第一变分自编码器的损失函数及时长参数预测网络的损失函数加权求和得到的优化函数C1:The optimization function C1 obtained by the weighted summation of the loss function of the first variational autoencoder and the loss function of the length parameter prediction network:
Figure FDA0002141722850000032
Figure FDA0002141722850000032
其中,ω(n)为编码器的损失函数的权重:where ω(n) is the weight of the loss function of the encoder:
Figure FDA0002141722850000033
Figure FDA0002141722850000033
n为整个训练数据迭代次数;n is the number of iterations for the entire training data; 通过减小优化函数C1的值进行梯度回传并更新网络的参数,得到训练好的第一变分自编码器和时长参数预测网络。By reducing the value of the optimization function C1 for gradient backhaul and updating the parameters of the network, the trained first variational autoencoder and the duration parameter prediction network are obtained.
7.根据权利要求5所述的基于变分自编码器的多说话人语音合成方法,其特征在于,所述方法之前还包括:对所述第二变分自编码器和声学参数预测网络进行训练的步骤,包括:7. The multi-speaker speech synthesis method based on variational autoencoder according to claim 5, characterized in that, before the method further comprising: performing the second variational autoencoder and the acoustic parameter prediction network on the The steps of training include: 将归一化的帧级别声学参数输入第二变分自编码器,计算所述第二变分自编码器的损失函数
Figure FDA0002141722850000041
输出声学说话人标签;
Input the normalized frame-level acoustic parameters into a second variational autoencoder, and calculate the loss function of the second variational autoencoder
Figure FDA0002141722850000041
output the acoustic speaker label;
将声学说话人标签和归一化的帧级别语言学特征输入所述声学参数预测网络计算损失函数LossMSEInput the acoustic speaker labels and the normalized frame-level linguistic features into the acoustic parameter prediction network to calculate the loss function Loss MSE ; 对所述第二变分自编码器的损失函数及声学参数预测网络的损失函数加权求和得到的优化函数C2:The optimization function C2 obtained by the weighted summation of the loss function of the second variational autoencoder and the loss function of the acoustic parameter prediction network:
Figure FDA0002141722850000042
Figure FDA0002141722850000042
其中,ω(n)为编码器的损失函数的权重:where ω(n) is the weight of the loss function of the encoder:
Figure FDA0002141722850000043
Figure FDA0002141722850000043
n为整个训练数据迭代次数;n is the number of iterations for the entire training data; 通过减小优化函数C2的值进行梯度回传并更新网络的参数,得到训练好的第二变分自编码器和声学参数预测网络。By reducing the value of the optimization function C2 for gradient backhaul and updating the parameters of the network, the trained second variational autoencoder and acoustic parameter prediction network are obtained.
8.根据权利要求1所述的基于变分自编码器的多说话人语音合成方法,其特征在于,所述提取一条待合成说话人干净语音的音素级别时长参数和帧级别声学参数并进行归一化,具体包括:8. The multi-speaker speech synthesis method based on variational autoencoder according to claim 1, is characterized in that, described extracting a phoneme-level duration parameter and frame-level acoustic parameter of a speaker's clean speech to be synthesized and normalized. Unification, including: 提取一条待合成说话人干净语音的音素级别时长参数,所述音素级别时长参数包括1维的时长信息;Extracting a phoneme-level duration parameter of the clean speech of the speaker to be synthesized, where the phoneme-level duration parameter includes 1-dimensional duration information; 提取一条待合成说话人干净语音的帧级别声学参数;所述帧级别声学参数包括:60维梅尔倒谱系数及其一阶和二阶差分、1维基频参数及其一阶和二阶差分、1维非周期参数及其一阶和二阶差分、1维元音辅音判决参数;Extract the frame-level acoustic parameters of a clean speech of the speaker to be synthesized; the frame-level acoustic parameters include: 60-dimensional Mel cepstral coefficients and their first-order and second-order differences, 1-dimensional frequency parameters and their first- and second-order differences , 1-dimensional aperiodic parameters and their first-order and second-order differences, and 1-dimensional vowel and consonant decision parameters; 所述帧级别声学参数和音素级别时长参数采用0均值归一化。The frame-level acoustic parameters and the phoneme-level duration parameters are normalized with zero mean. 9.根据权利要求1所述的基于变分自编码器的多说话人语音合成方法,其特征在于,所述对待合成的包含多个说话人的语音信号提取帧级别语言学特征和音素级别语言学特征并进行归一化;具体包括:9. The multi-speaker speech synthesis method based on variational autoencoder according to claim 1, characterized in that, the speech signal containing multiple speakers to be synthesized extracts frame-level linguistic features and phoneme-level language characteristics and normalized; specifically include: 对待合成的包含多个说话人的语音信号提取帧级别语言学特征,所述帧级别语言学特征包含624维音素级别的语言学特征和4维的帧位置信息;Extracting frame-level linguistic features from a speech signal containing multiple speakers to be synthesized, the frame-level linguistic features comprising 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information; 对待合成的包含多个说话人的语音信号提取音素级别语言学特征,所述音素级别语言学特征包括:477维的文本发音特征、177维的分词及韵律特征;Extracting phoneme-level linguistic features from the speech signals containing multiple speakers to be synthesized, the phoneme-level linguistic features including: 477-dimensional text pronunciation features, 177-dimensional word segmentation and prosody features; 所述帧级别语言学特征和音素级别语言学特征采用最大最小值归一化。The frame-level linguistic features and phoneme-level linguistic features are normalized by the maximum and minimum values. 10.根据权利要求1所述的基于变分自编码器的多说话人语音合成方法,其特征在于,所述通过预测时长获得该音素的帧级别语言学特征,具体包括:通过预测时长获得当前帧相对于预测的时长的相对位置和绝对位置,从而获得该音素的帧级别语言学特征。10. The multi-speaker speech synthesis method based on variational autoencoder according to claim 1, wherein the obtaining the frame-level linguistic feature of the phoneme by predicting the duration specifically comprises: obtaining the current The relative and absolute position of the frame relative to the predicted duration to obtain the frame-level linguistic feature of the phoneme.
CN201910671050.5A 2019-07-24 2019-07-24 A multi-speaker speech synthesis method based on variational autoencoder Active CN112289304B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910671050.5A CN112289304B (en) 2019-07-24 2019-07-24 A multi-speaker speech synthesis method based on variational autoencoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910671050.5A CN112289304B (en) 2019-07-24 2019-07-24 A multi-speaker speech synthesis method based on variational autoencoder

Publications (2)

Publication Number Publication Date
CN112289304A true CN112289304A (en) 2021-01-29
CN112289304B CN112289304B (en) 2024-05-31

Family

ID=74418960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910671050.5A Active CN112289304B (en) 2019-07-24 2019-07-24 A multi-speaker speech synthesis method based on variational autoencoder

Country Status (1)

Country Link
CN (1) CN112289304B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066511A (en) * 2021-03-16 2021-07-02 云知声智能科技股份有限公司 Voice conversion method and device, electronic equipment and storage medium
CN113409764A (en) * 2021-06-11 2021-09-17 北京搜狗科技发展有限公司 Voice synthesis method and device for voice synthesis
CN113450761A (en) * 2021-06-17 2021-09-28 清华大学深圳国际研究生院 Parallel speech synthesis method and device based on variational self-encoder
CN113488022A (en) * 2021-07-07 2021-10-08 北京搜狗科技发展有限公司 Speech synthesis method and device
CN113707122A (en) * 2021-08-11 2021-11-26 北京搜狗科技发展有限公司 Method and device for constructing voice synthesis model
CN113782045A (en) * 2021-08-30 2021-12-10 江苏大学 A Single-Channel Speech Separation Method Based on Multi-scale Delay Sampling
CN114267331A (en) * 2021-12-31 2022-04-01 达闼机器人有限公司 Speaker coding method, device and multi-speaker voice synthesis system
CN114267329A (en) * 2021-12-24 2022-04-01 厦门大学 Multi-speaker speech synthesis method based on probabilistic generation and non-autoregressive model
CN118430512A (en) * 2024-07-02 2024-08-02 厦门蝉羽网络科技有限公司 Speech synthesis method and device for improving phoneme pronunciation time accuracy

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Many-to-many speech conversion method based on text encoder under the condition of non-parallel text
JP2019109306A (en) * 2017-12-15 2019-07-04 日本電信電話株式会社 Voice conversion device, voice conversion method and program
WO2019138897A1 (en) * 2018-01-10 2019-07-18 ソニー株式会社 Learning device and method, and program
CN110047501A (en) * 2019-04-04 2019-07-23 南京邮电大学 Multi-to-multi phonetics transfer method based on beta-VAE

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019109306A (en) * 2017-12-15 2019-07-04 日本電信電話株式会社 Voice conversion device, voice conversion method and program
WO2019138897A1 (en) * 2018-01-10 2019-07-18 ソニー株式会社 Learning device and method, and program
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Many-to-many speech conversion method based on text encoder under the condition of non-parallel text
CN110047501A (en) * 2019-04-04 2019-07-23 南京邮电大学 Multi-to-multi phonetics transfer method based on beta-VAE

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHIN-CHENG HSU ET AL.: "《Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder》", 《ARXIV:1610.04019V1》, pages 1 - 6 *
YAJIE ZHANG ET AL.: "《LEARNING LATENT REPRESENTATIONS FOR STYLE CONTROL AND TRANSFER IN END-TO-END SPEECH SYNTHESIS》", 《ARXIV:1812.04342V2》, pages 1 - 5 *
黄国捷等: "《增强变分自编码器做非平行语料语音转换》", 《信 号 处 理》, vol. 34, no. 10, pages 1246 - 1251 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066511A (en) * 2021-03-16 2021-07-02 云知声智能科技股份有限公司 Voice conversion method and device, electronic equipment and storage medium
CN113409764A (en) * 2021-06-11 2021-09-17 北京搜狗科技发展有限公司 Voice synthesis method and device for voice synthesis
CN113409764B (en) * 2021-06-11 2024-04-26 北京搜狗科技发展有限公司 Speech synthesis method and device for speech synthesis
CN113450761A (en) * 2021-06-17 2021-09-28 清华大学深圳国际研究生院 Parallel speech synthesis method and device based on variational self-encoder
CN113450761B (en) * 2021-06-17 2023-09-22 清华大学深圳国际研究生院 Parallel voice synthesis method and device based on variation self-encoder
CN113488022A (en) * 2021-07-07 2021-10-08 北京搜狗科技发展有限公司 Speech synthesis method and device
CN113488022B (en) * 2021-07-07 2024-05-10 北京搜狗科技发展有限公司 Speech synthesis method and device
CN113707122A (en) * 2021-08-11 2021-11-26 北京搜狗科技发展有限公司 Method and device for constructing voice synthesis model
CN113707122B (en) * 2021-08-11 2024-04-05 北京搜狗科技发展有限公司 Method and device for constructing voice synthesis model
CN113782045B (en) * 2021-08-30 2024-01-05 江苏大学 Single-channel voice separation method for multi-scale time delay sampling
CN113782045A (en) * 2021-08-30 2021-12-10 江苏大学 A Single-Channel Speech Separation Method Based on Multi-scale Delay Sampling
CN114267329A (en) * 2021-12-24 2022-04-01 厦门大学 Multi-speaker speech synthesis method based on probabilistic generation and non-autoregressive model
CN114267331A (en) * 2021-12-31 2022-04-01 达闼机器人有限公司 Speaker coding method, device and multi-speaker voice synthesis system
CN118430512A (en) * 2024-07-02 2024-08-02 厦门蝉羽网络科技有限公司 Speech synthesis method and device for improving phoneme pronunciation time accuracy
CN118430512B (en) * 2024-07-02 2024-10-22 厦门蝉羽网络科技有限公司 Speech synthesis method and device for improving phoneme pronunciation time accuracy

Also Published As

Publication number Publication date
CN112289304B (en) 2024-05-31

Similar Documents

Publication Publication Date Title
CN112289304B (en) A multi-speaker speech synthesis method based on variational autoencoder
CN101030369B (en) Embedded Speech Recognition Method Based on Subword Hidden Markov Model
CN112767958A (en) Zero-learning-based cross-language tone conversion system and method
Soleymanpour et al. Text-independent speaker identification based on selection of the most similar feature vectors
CN111081230B (en) Speech recognition method and device
CN113506562A (en) End-to-end voice synthesis method and system based on fusion of acoustic features and text emotional features
Lu et al. Automatic speech recognition
CN113297383B (en) Speech Emotion Classification Method Based on Knowledge Distillation
WO2023245389A1 (en) Song generation method, apparatus, electronic device, and storage medium
Gaurav et al. Development of application specific continuous speech recognition system in Hindi
CN117672268A (en) Multi-mode voice emotion recognition method based on relative entropy alignment fusion
Garg et al. Survey on acoustic modeling and feature extraction for speech recognition
Becerra et al. Speech recognition in a dialog system: From conventional to deep processing: A case study applied to Spanish
Sen et al. A convolutional neural network based approach to recognize bangla spoken digits from speech signal
Wang et al. A research on HMM based speech recognition in spoken English
Zhao et al. Research on voice cloning with a few samples
Mei et al. A particular character speech synthesis system based on deep learning
Ajayi et al. Systematic review on speech recognition tools and techniques needed for speech application development
Lee et al. Isolated word recognition using modular recurrent neural networks
Oprea et al. An artificial neural network-based isolated word speech recognition system for the Romanian language
Galatang Syllable-Based Indonesian Automatic Speech Recognition.
Raghudathesh et al. Review of toolkit to build automatic speech recognition models
Bohouta Improving wake-up-word and general speech recognition systems
Ren et al. [Retracted] Articulatory‐to‐Acoustic Conversion Using BiLSTM‐CNN Word‐Attention‐Based Method
CN112951270A (en) Voice fluency detection method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20241011

Address after: 100190, No. 21 West Fourth Ring Road, Beijing, Haidian District

Patentee after: INSTITUTE OF ACOUSTICS, CHINESE ACADEMY OF SCIENCES

Country or region after: China

Address before: 100190, No. 21 West Fourth Ring Road, Beijing, Haidian District

Patentee before: INSTITUTE OF ACOUSTICS, CHINESE ACADEMY OF SCIENCES

Country or region before: China

Patentee before: BEIJING KEXIN TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right