CN117854473A - Zero sample speech synthesis method based on local association information - Google Patents

Zero sample speech synthesis method based on local association information Download PDF

Info

Publication number
CN117854473A
CN117854473A CN202410083127.8A CN202410083127A CN117854473A CN 117854473 A CN117854473 A CN 117854473A CN 202410083127 A CN202410083127 A CN 202410083127A CN 117854473 A CN117854473 A CN 117854473A
Authority
CN
China
Prior art keywords
speaker
local
zero
training
speech synthesis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410083127.8A
Other languages
Chinese (zh)
Other versions
CN117854473B (en
Inventor
岳焕景
王嘉玮
杨敬钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202410083127.8A priority Critical patent/CN117854473B/en
Publication of CN117854473A publication Critical patent/CN117854473A/en
Application granted granted Critical
Publication of CN117854473B publication Critical patent/CN117854473B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/173Transcoding, i.e. converting between two coded representations avoiding cascaded coding-decoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Complex Calculations (AREA)
  • Stereophonic System (AREA)

Abstract

本发明公开了基于局部参考信息的零样本语音合成方法,涉及语音信号处理技术领域。基于局部参考信息的零样本语音合成方法,包括如下步骤:S1、预处理文本及语音数据;S2、构建基本网络框架;S3、设计目标说话人的参考信息引入方案,并依据所设计的方案搭建零样本语音合成模型;S4、利用深度学习Pytorch框架训练模型;S5、向模型中输入目标文本和目标说话人的参考语音,获得语音合成的结果;本发明利用提出的特征匹配与重组方法,以及局部说话人编码器模块,将零样本语音合成的性能提升到了新的高度。

The invention discloses a zero-sample speech synthesis method based on local reference information, and relates to the technical field of speech signal processing. The zero-sample speech synthesis method based on local reference information includes the following steps: S1, preprocessing text and speech data; S2, building a basic network framework; S3, designing a reference information introduction scheme for the target speaker, and building a zero-sample speech synthesis model according to the designed scheme; S4, training the model using the deep learning Pytorch framework; S5, inputting the target text and the reference speech of the target speaker into the model to obtain the result of speech synthesis; the invention uses the proposed feature matching and recombination method, as well as the local speaker encoder module, to improve the performance of zero-sample speech synthesis to a new level.

Description

基于局部关联信息的零样本语音合成方法Zero-shot speech synthesis method based on local correlation information

技术领域Technical Field

本发明涉及语音信号处理技术领域,尤其涉及一种基于局部关联信息的零样本语音合成方法。The present invention relates to the technical field of speech signal processing, and in particular to a zero-sample speech synthesis method based on local correlation information.

背景技术Background technique

零样本语音合成旨在利用少数目标说话人的语音作为参考,从文本合成具有该说话人音色的自然流畅的语音,其可以应用于有声小说、语音播报等诸多实际场景中;零样本语音合成面临合成泛化性不佳、韵律合成不够自然等诸多挑战;近年来,零样本语音合成已从传统的数学模型转向基于神经网络的深度学习方法;Zero-shot speech synthesis aims to use the speech of a few target speakers as a reference to synthesize natural and fluent speech with the speaker's timbre from text. It can be applied to many practical scenarios such as audio novels and voice broadcasting. Zero-shot speech synthesis faces many challenges such as poor synthesis generalization and unnatural rhythm synthesis. In recent years, zero-shot speech synthesis has shifted from traditional mathematical models to deep learning methods based on neural networks.

这些方法中多采用全局的参考引入方式,即将参考谱图经过提取后转换为一维向量,使其包含目标说话人的基本音色信息,再利用该向量来帮助模型学习目标说话人的声音信息;然而,该方式在引入向量时,对于中间特征的每一时间帧,引入的向量都是一致的,这忽略了目标语句的局部内容信息;最近,在语音合成任务中,有研究者提出利用目标文本的内容信息来寻找信息关联度更高的参考声音特征;相比于全局方法,该方法能够为不同的局部引入更为丰富的有用信息,从而使语音合成的结果更加自然。Most of these methods adopt a global reference introduction method, that is, the reference spectrogram is extracted and converted into a one-dimensional vector so that it contains the basic timbre information of the target speaker, and then the vector is used to help the model learn the sound information of the target speaker; however, when introducing the vector, the introduced vector is consistent for each time frame of the intermediate feature, which ignores the local content information of the target sentence; recently, in the task of speech synthesis, some researchers have proposed to use the content information of the target text to find reference sound features with higher information relevance; compared with the global method, this method can introduce richer useful information for different local parts, thereby making the speech synthesis result more natural.

另一方面,零样本语音合成系统通常可以修改为说话人转换系统。说话人转换的目标是在保持语义内容不变的前提下,将源说话人的音色替换为目标说话人的音色,在数字人、虚拟换声等场景均有应用。最近,研究者提出将零样本语音合成的声学模型转换为说话人转换模型,利用零样本语音合成对目标说话人声音的适应能力,帮助完成不同说话人之间的音色转换。基于上述内容,本发明提出一种基于局部关联信息的零样本语音合成方法。On the other hand, the zero-sample speech synthesis system can usually be modified into a speaker conversion system. The goal of speaker conversion is to replace the timbre of the source speaker with the timbre of the target speaker while keeping the semantic content unchanged. It is used in scenarios such as digital humans and virtual voice change. Recently, researchers have proposed converting the acoustic model of zero-sample speech synthesis into a speaker conversion model, and using the adaptability of zero-sample speech synthesis to the target speaker's voice to help complete the timbre conversion between different speakers. Based on the above content, the present invention proposes a zero-sample speech synthesis method based on local correlation information.

发明内容Summary of the invention

本发明的目的在于提出一种基于局部关联信息的零样本语音合成方法以解决背景技术中所提出的问题,以实现在少量样本的情况下,从文本生成自然高质量的语音。The purpose of the present invention is to propose a zero-sample speech synthesis method based on local correlation information to solve the problems raised in the background technology, so as to generate natural and high-quality speech from text with a small number of samples.

为了实现上述目的,本发明采用了如下技术方案:In order to achieve the above object, the present invention adopts the following technical solutions:

基于局部关联信息的零样本语音合成方法,包括以下步骤:The zero-sample speech synthesis method based on local correlation information comprises the following steps:

S1、预处理文本及语音数据:给定原始文本和语音数据(Craw,Sraw),对Craw进行词到音素处理,得到音素序列c;对Sraw进行短时傅里叶变换(STFT)并随机选取片段,得到参考短时傅里叶变换(STFT)频谱r和真实短时傅里叶变换(STFT)频谱x,将数据划分为训练集、验证集、测试集,以及独立的参考语音集;S1. Preprocessing text and speech data: Given the original text and speech data (C raw , S raw ), perform word-to-phoneme processing on C raw to obtain a phoneme sequence c; perform short-time Fourier transform (STFT) on S raw and randomly select segments to obtain a reference short-time Fourier transform (STFT) spectrum r and a true short-time Fourier transform (STFT) spectrum x, and divide the data into a training set, a validation set, a test set, and an independent reference speech set;

S2、构建基本网络框架:设计一个基于Transformer的先验文本编码器,一个基于流的多层声学解码器,一个基于膨胀卷积的后验编码器以及一个基于生成对抗网络的声码器,将上述结构共同组成了一个联合对抗训练的条件变分自编码器网络框架;S2. Build a basic network framework: Design a Transformer-based prior text encoder, a stream-based multi-layer acoustic decoder, a dilated convolution-based posterior encoder, and a generative adversarial network-based vocoder. Combine the above structures into a joint adversarial training conditional variational autoencoder network framework.

S3、设计方案、搭建模型:基于目标语句与参考语音之间的信息关联性,结合S1~S2中所述预处理方法和基本网络框架,设计参考信息引入方案,并依据所设计的方案搭建零样本语音合成模型,所述方案具体包括如下内容:S3, design scheme, build model: Based on the information correlation between the target sentence and the reference speech, combined with the preprocessing method and basic network framework described in S1-S2, design a reference information introduction scheme, and build a zero-sample speech synthesis model based on the designed scheme. The scheme specifically includes the following contents:

①特征匹配与重组:将S1中所得的真实与参考短时傅里叶变换(STFT)频谱数据对(x,r)输入后验编码器,分别得到目标隐藏特征z和参考隐藏特征zr;随后,在每一层仿射耦合层中,对z和zr实行匹配与重组方法,得到信息关联度更高的参考特征 ① Feature matching and recombination: The real and reference short-time Fourier transform (STFT) spectrum data pairs (x, r) obtained in S1 are input into the posterior encoder to obtain the target hidden feature z and the reference hidden feature zr respectively; then, in each affine coupling layer, the matching and recombination method is performed on z and zr to obtain the reference feature with higher information correlation

②局部说话人编码器:将重组后的特征输入到基本声纹向量生成模块,得到基本声纹向量e;同时将/>输入到门卷积模块中进行信息过滤;最后将过滤后的特征与e输入到局部注意力模块,得到局部说话人嵌入/> ② Local speaker encoder: The reorganized features Input it into the basic voiceprint vector generation module to obtain the basic voiceprint vector e; at the same time, Input into the gated convolution module for information filtering; finally, the filtered features and e are input into the local attention module to obtain the local speaker embedding/>

③训练与推理流程设计:所述训练流程按照S2中所述的对抗训练和变分推理来进行;所述推理流程与训练流程的不同在于,仿射耦合层采用可逆形式;③ Training and reasoning process design: The training process is performed according to the adversarial training and variational reasoning described in S2; the difference between the reasoning process and the training process is that the affine coupling layer adopts a reversible form;

④损失函数模块设计:将零样本语音合成模型通过在变分推理过程中的重建损失和KL散度损失、对抗训练过程中的对抗损失和特征匹配损失、基本声纹向量生成过程中的说话人身份损失进行联合优化;④ Loss function module design: The zero-shot speech synthesis model is jointly optimized through reconstruction loss and KL divergence loss in the variational inference process, adversarial loss and feature matching loss in the adversarial training process, and speaker identity loss in the basic voiceprint vector generation process;

⑤说话人转换方法设计:基于S3所述的零样本语音合成模型做出修改,保留参考信息引入方案、后验编码器、K层仿射耦合层、声码器,以实现不同说话人之间的声音转换;⑤ Design of speaker conversion method: Based on the zero-shot speech synthesis model described in S3, modifications are made to retain the reference information introduction scheme, the posterior encoder, the K-layer affine coupling layer, and the vocoder to achieve voice conversion between different speakers;

S4、训练模型:利用深度学习Pytorch框架训练模型,遍历S1中所有预处理的文本和语音数据,初始学习率设置为2e-4,按0.999的速率指数衰减,经过500k次迭代的训练,保存在验证集上表现最好的模型;S4, training model: Use the deep learning Pytorch framework to train the model, traverse all the preprocessed text and voice data in S1, set the initial learning rate to 2e-4, and exponentially decay at a rate of 0.999. After 500k iterations of training, save the model with the best performance on the validation set;

S5、输出结果:将S1中预处理的测试集文本数据和参考集语音数据输入到稳定模型中,获得零样本语音合成的结果。S5. Output result: Input the test set text data and reference set speech data preprocessed in S1 into the stable model to obtain the result of zero-sample speech synthesis.

优选地,所述方案①进一步包括以下内容:Preferably, the scheme ① further includes the following contents:

A1、在同一仿射耦合层内,对目标隐藏特征z和参考隐藏特征zr进行帧级别的余弦相似度计算,即对于z中的每一帧z(t),计算这一帧与zr中所有帧的相似度,选择其中余弦相似度最大的一帧作为z(t)的匹配索引,该过程的函数表示为:A1. In the same affine coupling layer, the cosine similarity of the target hidden feature z and the reference hidden feature zr is calculated at the frame level. That is, for each frame z (t) in z, the similarity between this frame and all frames in zr is calculated, and the frame with the largest cosine similarity is selected as the matching index of z (t) . The function of this process is expressed as:

式中,<·,·>代表内积,ψ[·]代表向量的归一化;In the formula, <·,·> represents the inner product, ψ[·] represents the normalization of the vector;

A2、按照匹配得到的索引值来重组/>生成一个与z在时域上对齐的重组参考特征/>所述重组参考特征/>是与当前目标中间特征z局部关联性最高的参考帧级特征组合。A2. According to the index value obtained by matching To reorganize/> Generate a reconstructed reference signature aligned with z in time domain/> The recombinant reference features It is the reference frame level feature combination that has the highest local correlation with the current target intermediate feature z.

优选地,所述方案②具体包括以下内容:Preferably, the solution ② specifically includes the following contents:

B1、基本声纹向量生成模块首先通过三层长短时记忆网络(LSTM)将重组参考特征映射为最后一层LSTM的隐藏向量,然后通过线性层将该隐藏向量映射为说话人的基本声纹向量e;B1. The basic voiceprint vector generation module first reconstructs the reference features through a three-layer long short-term memory network (LSTM) Mapped to the hidden vector of the last layer of LSTM, and then mapped to the speaker's basic voiceprint vector e through a linear layer;

B2、门卷积模块首先通过由卷积模块和门控激活单元组成的输入门对重组参考特征进行信息过滤,然后通过由全局基本声纹向量g控制的遗忘门,以实现对冗余信息的进一步控制;B2. The gated convolution module first reconstructs the reference features through an input gate consisting of a convolution module and a gated activation unit. Information filtering is performed, and then the information passes through the forget gate controlled by the global basic voiceprint vector g to further control redundant information;

B3、局部注意力模块使用B2中过滤后的特征来调制基本声纹向量e;首先沿着时间维度将过滤后的特征切分成不同的帧组,然后基于注意力机制,将不同的帧组分别与e融合,生成同时捕捉了目标说话人基本声纹信息以及对应帧组j所代表的局部关联信息的局部说话人嵌入向量ej;所述注意力机制使用e作为查询,同时将e作为键和值;所述帧组j不作为查询,只作为键和值使用;上述过程的函数表示为:B3, the local attention module uses the features filtered in B2 to modulate the basic voiceprint vector e; first, the filtered features are divided into different frame groups along the time dimension, and then based on the attention mechanism, different frame groups are fused with e respectively to generate a local speaker embedding vector e j that captures both the basic voiceprint information of the target speaker and the local association information represented by the corresponding frame group j; the attention mechanism uses e as a query, and uses e as a key and value; the frame group j is not used as a query, but only as a key and value; the function of the above process is expressed as:

式中,q0、k0和v0分别表示从e中获取的查询、键和值;ki,vi分别表示由帧组所得到的键和值,其中,i={1,2,3}。Where q 0 , k 0 and v 0 represent the query, key and value obtained from e respectively; k i , vi represent the key and value obtained from the frame group respectively, where i = {1, 2, 3}.

优选地,所述方案③具体包括以下内容:Preferably, the solution ③ specifically includes the following contents:

以最大化的对数似然的证据下界作为训练过程的目标,其函数表示为:The goal of the training process is to maximize the lower bound of the evidence of the log-likelihood, and its function is expressed as:

式中,对数似然函数logPθ(x│z,r)代表声码器重建波形的流程;logQθ(z│x,r)-logPθ(x│z,r)代表先验分布和近似后验分布之间的KL散度,实际对应K层仿射耦合层建立文本隐藏特征与音频隐藏特征之间的可逆变换的流程;Wherein, the log-likelihood function logP θ (x│z,r) represents the process of reconstructing the waveform by the vocoder; logQ θ (z│x,r)-logP θ (x│z,r) represents the KL divergence between the prior distribution and the approximate posterior distribution, which actually corresponds to the process of establishing a reversible transformation between the text hidden features and the audio hidden features by K layers of affine coupling layers;

在推理过程中,所述K层仿射耦合层采用可逆形式,以文本先验为输入,输出声学隐藏特征z。During the inference process, the K-layer affine coupling layer adopts a reversible form, takes the text prior as input, and outputs the acoustic hidden feature z.

优选地,所述方案④进一步包括以下内容:Preferably, the scheme ④ further includes the following contents:

C1、将真实的短时傅里叶变换(STFT)频谱x投影到梅尔刻度上,记为xmel;对于声码器预测的波形y,也使用相同的参数将其转换为梅尔频谱ymel;然后,使用xmel与ymel之间的L1损失作为重建损失,其函数表示如下:C1. Project the true short-time Fourier transform (STFT) spectrum x onto the Mel scale, denoted as x mel ; for the waveform y predicted by the vocoder, also use the same parameters to convert it into the Mel spectrum y mel ; then, use the L 1 loss between x mel and y mel as the reconstruction loss, and its function is expressed as follows:

C2、利用变分推理使得声码器更好地还原x,并使得近似的后验分布Qθ(z│x,r)与先验分布Pθ(x│z,r)相趋近;定义KL散度为:C2. Use variational inference to make the vocoder restore x better and make the approximate posterior distribution Q θ (z│x,r) close to the prior distribution P θ (x│z,r); define KL divergence as:

z~Qφ(z|x,r)=N(z;μφ(x,r),σφ(x,r))z~Q φ (z|x,r)=N(z;μ φ (x,r),σ φ (x,r))

式中,[μφ(x,r),σφ(x,r)]表示后验分布的统计量;Where [μ φ (x,r), σ φ (x,r)] represents the statistics of the posterior distribution;

C3、使用最小二乘损失函数作为对抗训练损失函数以避免梯度消失现象,具体函数表示如下:C3. Use the least squares loss function as the adversarial training loss function to avoid the gradient disappearance phenomenon. The specific function is expressed as follows:

C4、声码器使用多尺度和多周期辨别器,并使用多层特征匹配损失函数以提高对抗训练的稳定性,具体函数表示如下:C4, the vocoder uses a multi-scale and multi-cycle discriminator and a multi-layer feature matching loss function to improve the stability of adversarial training. The specific function is expressed as follows:

式中,T表示辨别器的层数,Dl和Nl分别表示第l层的特征图和该特征图中特征的数目;Where T represents the number of layers of the discriminator, D l and N l represent the feature map of the lth layer and the number of features in the feature map respectively;

将拉近相同说话人的声纹向量的距离,拉远不同说话人的声纹向量的距离作为说话人身份损失函数的基本思想,具体函数表示如下:The basic idea of the speaker identity loss function is to shorten the distance between the voiceprint vectors of the same speaker and increase the distance between the voiceprint vectors of different speakers. The specific function is expressed as follows:

式中,i(j)表示批处理中的样本索引,m表示仿射耦合层的层索引,和e分别表示由重组的参考特征/>和目标特征z生成的声纹向量,ψ(·)表示向量的归一化,<·,·>表示内积;/>表示来自同一说话人两个样本的声纹向量之间的相似性,而S表示来自不同说话人两个样本的声纹向量之间的相似性;正实数α和β分别作为S和/>的乘子。Where i(j) represents the sample index in the batch, m represents the layer index of the affine coupling layer, and e represent the reference features of the recombinant /> and the voiceprint vector generated by the target feature z, ψ(·) represents the normalization of the vector, and <·,·> represents the inner product;/> represents the similarity between the voiceprint vectors of two samples from the same speaker, and S represents the similarity between the voiceprint vectors of two samples from different speakers; positive real numbers α and β are used as S and / > The multiplier of .

优选地,所述方案⑤进一步包括以下内容:Preferably, the scheme ⑤ further includes the following contents:

说话人转换的目的是保持相同语义内容的前提下,将源音色转换为目标音色,具体指:The purpose of speaker switching is to convert the source timbre to the target timbre while maintaining the same semantic content. Specifically, it refers to:

D1、在源说话人参考语音s的帮助下,经过后验编码器Qφ和正向仿射耦合层fdec,将源说话人语音x转换为与说话人无关的中间变量z′,其函数表示如下:D1. With the help of the source speaker reference speech s, the source speaker speech x is converted into an intermediate variable z′ that is independent of the speaker through the posterior encoder Q φ and the forward affine coupling layer f dec . Its function is expressed as follows:

z~Qφ(z|x,s)z~Q φ (z|x,s)

z′=fdec(z|s)z′=f dec (z|s)

D2、在目标说话人参考语音t的帮助下,经过逆向仿射耦合层和声码器V,将与说话人无关的中间变量z′转换为目标说话人的波形y,其函数表示如下:D2, with the help of the target speaker's reference speech t, after the inverse affine coupling layer The vocoder V converts the speaker-independent intermediate variable z′ into the waveform y of the target speaker, and its function is expressed as follows:

式中,V表示声码器;表示逆向仿射耦合层;z′表示与说话人无关的中间变量;t表示目标说话人参考语音。Where V represents the vocoder; represents the inverse affine coupling layer; z′ represents the speaker-independent intermediate variable; t represents the target speaker reference speech.

与现有技术相比,本发明提供了基于神经网络的事件相机图像重建方法,具备以下有益效果:Compared with the prior art, the present invention provides an event camera image reconstruction method based on a neural network, which has the following beneficial effects:

(1)本发明提出了一种基于局部关联信息的零样本语音合成方法;不同于以往的全局目标参考信息的引入方式,本发明额外考虑了目标语句与参考语音在音素级别上的信息关联性,能够为不同的局部引入更加丰富的信息,从而使合成结果更加自然。(1) The present invention proposes a zero-sample speech synthesis method based on local correlation information. Different from the previous method of introducing global target reference information, the present invention additionally considers the information correlation between the target sentence and the reference speech at the phoneme level, and can introduce richer information for different parts, thereby making the synthesis result more natural.

(2)本发明提出了一个帧级别的特征匹配与重组方法,能获取信息关联度更高的参考特征;本发明还提出一种局部说话人编码器,能够使参考特征更好地表达基本声纹特征和局部关联信息。(2) The present invention proposes a frame-level feature matching and reorganization method, which can obtain reference features with higher information correlation; the present invention also proposes a local speaker encoder, which can enable the reference features to better express basic voiceprint features and local correlation information.

(3)基于本发明实施例中所进行的实验表明,所提出的方法优于目前主流的零样本语音合成方法;经过本发明的研究探索,能够启发更多利用目标语句与参考语音之间的局部关联信息的研究。(3) The experiments conducted in the embodiments of the present invention show that the proposed method is superior to the current mainstream zero-sample speech synthesis method; the research and exploration of the present invention can inspire more research on utilizing the local correlation information between the target sentence and the reference speech.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明提出的基于局部关联信息的零样本语音合成方法的整体框架图;FIG1 is an overall framework diagram of the zero-sample speech synthesis method based on local correlation information proposed by the present invention;

图2为本发明实施例1中提出的匹配与重组结构图;FIG2 is a diagram of the matching and reorganization structure proposed in Example 1 of the present invention;

图3为本发明实施例1中提出的局部说话人编码器结构图;FIG3 is a structural diagram of a local speaker encoder proposed in Embodiment 1 of the present invention;

图4为本发明实施例1中提出的说话人转换框架图。FIG. 4 is a diagram of a speaker conversion framework proposed in Embodiment 1 of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。The technical solutions in the embodiments of the present invention will be described clearly and completely below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, rather than all the embodiments.

实施例1:Embodiment 1:

请参阅图1,本发明提出一种基于局部关联信息的零样本语音合成方法,包括以下步骤:Referring to FIG. 1 , the present invention proposes a zero-sample speech synthesis method based on local correlation information, comprising the following steps:

S1、预处理文本及语音数据:给定原始文本和语音数据(Craw,Sraw),对Craw进行词到音素处理,得到音素序列c;对Sraw进行短时傅里叶变换(STFT)并随机选取片段,得到参考STFT频谱r和真实STFT频谱x,所有数据被划分为训练集、验证集、测试集,以及独立的参考语音集;S1. Preprocessing text and speech data: Given the original text and speech data (C raw , S raw ), perform word-to-phoneme processing on C raw to obtain a phoneme sequence c; perform short-time Fourier transform (STFT) on S raw and randomly select segments to obtain a reference STFT spectrum r and a true STFT spectrum x. All data are divided into a training set, a validation set, a test set, and an independent reference speech set;

S2、构建基本网络框架:设计一个基于Transformer的先验文本编码器,一个基于流的多层声学解码器,一个基于膨胀卷积的后验编码器以及一个基于生成对抗网络的声码器,以上结构共同组成了一个联合对抗训练的条件变分自编码器网络框架;S2. Build a basic network framework: Design a Transformer-based prior text encoder, a stream-based multi-layer acoustic decoder, a dilated convolution-based posterior encoder, and a generative adversarial network-based vocoder. The above structures together form a joint adversarial training conditional variational autoencoder network framework;

S3、设计方案、搭建模型:基于目标语句与参考语音之间的信息关联性,结合S1~S2中所述预处理方法和基本网络框架设计参考信息引入方案,所述方案具体包括如下内容:S3, design scheme, build model: Based on the information correlation between the target sentence and the reference speech, the reference information introduction scheme is designed in combination with the preprocessing methods and basic network framework described in S1-S2. The scheme specifically includes the following contents:

S3.1、匹配与重组设计S3.1. Matching and recombination design

如图2所示,对于局部关联信息,本发明采取帧级别的余弦相似度来度量,并对目标特征与参考特征进行匹配。具体来说,在同一仿射耦合层内,对目标隐藏特征z和参考隐藏特征zr进行帧级别的余弦相似度计算,即对于z中的每一帧z(t),计算这一帧与zr中所有帧的相似度,选择其中余弦相似度最大的一帧作为z(t)的匹配索引,该过程可以表示为:As shown in FIG2 , for local association information, the present invention uses frame-level cosine similarity to measure and matches the target feature with the reference feature. Specifically, in the same affine coupling layer, the frame-level cosine similarity calculation is performed on the target hidden feature z and the reference hidden feature z r , that is, for each frame z (t) in z, the similarity of this frame with all frames in z r is calculated, and the frame with the largest cosine similarity is selected as the matching index of z (t) . The process can be expressed as:

式中,<·,·>代表内积,ψ[·]代表向量的归一化;In the formula, <·,·> represents the inner product, ψ[·] represents the normalization of the vector;

随后,按照匹配得到的索引值来重组/>生成一个与z在时域上对齐的重组参考特征/>该特征是与当前目标中间特征z局部关联性最高的参考帧级特征组合。Then, according to the matched index value To reorganize/> Generate a reconstructed reference signature aligned with z in time domain/> This feature is the combination of reference frame-level features that has the highest local correlation with the current target intermediate feature z.

S3.2、局部说话人编码器设计S3.2. Local Speaker Encoder Design

经过匹配与重组后,获得了与当前目标语句信息关联度更高的参考特征而局部说话人编码器的目的是将/>转换为能够捕捉目标说话者的基本声学特征的表示,并将局部关联信息与基本声纹表征有效融合,如图3所示,它由三个部分组成:目标说话人基本声纹向量生成模块(右分支),门卷积模块(左分支)和最后的局部注意力模块。After matching and reorganization, reference features with higher relevance to the current target sentence information are obtained. The purpose of the local speaker encoder is to convert/> It is converted into a representation that can capture the basic acoustic features of the target speaker and effectively fuses the local correlation information with the basic voiceprint representation. As shown in Figure 3, it consists of three parts: the target speaker basic voiceprint vector generation module (right branch), the gated convolution module (left branch) and the final local attention module.

基本声纹向量生成模块首先通过三层长短时记忆网络(LSTM)将重组参考特征映射为最后一层LSTM的隐藏向量,然后通过线性层将该隐藏向量映射为说话人的基本声纹向量e;为了使e能够更好地捕捉目标说话人的基本声纹特征,引入说话人身份验证损失来进行约束,该损失函数基于一个基本思想,即拉近相同说话人的声纹向量的距离,拉远不同说话人的声纹向量的距离。可以被表示为:The basic voiceprint vector generation module first reconstructs the reference features through a three-layer long short-term memory network (LSTM) Mapped to the hidden vector of the last layer of LSTM, and then mapped to the basic voiceprint vector e of the speaker through the linear layer; in order to make e better capture the basic voiceprint features of the target speaker, the speaker identity verification loss is introduced to constrain it. The loss function is based on a basic idea, that is, to shorten the distance between the voiceprint vectors of the same speaker and to increase the distance between the voiceprint vectors of different speakers. It can be expressed as:

式中,i(j)表示批处理中的样本索引,m表示仿射耦合层的层索引,和e分别表示由重组的参考特征/>和目标特征z生成的声纹向量,ψ(·)表示向量的归一化,<·,·>表示内积。/>表示来自同一说话人两个样本的声纹向量之间的相似性,而S表示来自不同说话人两个样本的声纹向量之间的相似性。正实数α和β分别作为S和/>的乘子。Where i(j) represents the sample index in the batch, m represents the layer index of the affine coupling layer, and e represent the reference features of the recombinant /> The voiceprint vector generated by the target feature z, ψ(·) represents the normalization of the vector, and <·,·> represents the inner product. /> represents the similarity between the voiceprint vectors of two samples from the same speaker, while S represents the similarity between the voiceprint vectors of two samples from different speakers. The positive real numbers α and β are used as S and / > The multiplier of .

在不同的情感条件下,即使是相同音素的发音也会有所不同,这种个性化会对训练带来冗余的干扰。为了尽可能地减少由上述冗余信息产生的影响,本发明提出了门卷积模块。门卷积模块由输入门和遗忘门组成。输入门包括卷积模块和门控激活单元。Sigmoid函数输出的数值范围是[0,1],可以起到门控作用,通过将Tanh和Sigmoid函数的输出相乘,门控激活单元可以实现对输入信息的选择性保留和过滤。遗忘门包括卷积模块和Sigmoid门控单元,使用全局说话人嵌入向量g来控制Sigmoid门控单元。g是从预训练的说话人编码器中得到的,不具有语义相关性,但代表了目标说话人的基本声学特征。g经过线性层和Sigmoid函数处理后,生成遗忘门的调制系数,来控制输入信息的保留与遗忘。所有的操作都在频域维度进行,门卷积模块的输出是经过信息过滤后的参考特征 Under different emotional conditions, even the pronunciation of the same phoneme will be different. This personalization will bring redundant interference to the training. In order to minimize the impact caused by the above redundant information, the present invention proposes a gated convolution module. The gated convolution module consists of an input gate and a forget gate. The input gate includes a convolution module and a gated activation unit. The numerical range of the Sigmoid function output is [0,1], which can play a gating role. By multiplying the output of Tanh and Sigmoid functions, the gated activation unit can achieve selective retention and filtering of input information. The forget gate includes a convolution module and a Sigmoid gate unit, and the global speaker embedding vector g is used to control the Sigmoid gate unit. g is obtained from the pre-trained speaker encoder and has no semantic relevance, but represents the basic acoustic features of the target speaker. After g is processed by the linear layer and the Sigmoid function, the modulation coefficient of the forget gate is generated to control the retention and forgetting of the input information. All operations are performed in the frequency domain dimension. The output of the gated convolution module is the reference feature after information filtering.

局部说话人编码器的作用是将基本声纹向量e与信息过滤后的参考特征有效融合。首先沿着时间维度将过滤后的特征/>切分成不同的帧组,然后基于注意力机制,将不同的帧组分别与e融合,生成同时捕捉了目标说话人基本声纹信息以及对应帧组j所代表的局部关联信息的局部说话人嵌入向量ej。该注意力机制使用e作为查询,同时将e作为键和值;而帧组j不作为查询,只作为键和值使用。该过程可以被表示为:The role of the local speaker encoder is to combine the basic voiceprint vector e with the reference feature after information filtering Effective fusion. First, the filtered features are integrated along the time dimension. It is divided into different frame groups, and then based on the attention mechanism, different frame groups are fused with e respectively to generate a local speaker embedding vector e j that captures both the basic voiceprint information of the target speaker and the local correlation information represented by the corresponding frame group j. The attention mechanism uses e as a query and e as a key and value; while frame group j is not used as a query, but only as a key and value. The process can be expressed as:

式中,q0、k0和v0分别表示从e中获取的查询、键和值,而ki,vi(i={1,2,3})分别表示由帧组所得到的键和值。为了实现ej和帧组j之间的时间维度的对齐,沿着时间维度将每个ej复制三次。最后,将这些局部嵌入向量拼接起来形成一个局部嵌入向量组合该组合与z在时间维度上是对齐的。换句话说,z中的每个局部帧组都接收到唯一的局部说话人嵌入ej,而在同一个帧组内的三帧共享相同的局部说话人嵌入。Where q 0 , k 0 and v 0 represent the query, key and value obtained from e, respectively, and k i , vi (i = {1, 2, 3}) represent the key and value obtained from the frame group, respectively. In order to achieve the alignment of the time dimension between e j and the frame group j, each e j is replicated three times along the time dimension. Finally, these local embedding vectors are concatenated to form a local embedding vector combination The combination is aligned with z in the temporal dimension. In other words, each local frame group in z receives a unique local speaker embedding e j , while the three frames in the same frame group share the same local speaker embedding.

S3.3、基于局部关联信息的说话人转换方法设计S3.3 Design of speaker conversion method based on local correlation information

如图4所示,s和t分别代表源说话人和目标说话人的参考语音,和/>分别代表第k层仿射耦合层中的源说话人的隐藏参考特征和重组参考特征,/>和/>分别代表第k层逆仿射耦合层中的目标说话人的隐藏参考特征和重组参考特征。x和y分别代表源说话人语音和目标说话人语音。As shown in Figure 4, s and t represent the reference speech of the source speaker and the target speaker respectively. and/> represent the hidden reference features and recombined reference features of the source speaker in the kth affine coupling layer, respectively./> and/> They represent the hidden reference features and reconstructed reference features of the target speaker in the kth inverse affine coupling layer. x and y represent the source speaker’s speech and the target speaker’s speech, respectively.

该系统首先在源说话人参考语音s的帮助下,经过后验编码器Qφ和正向仿射耦合层fdec,将源说话人语音x转换为与说话人无关的中间变量z′,如下所示:The system first converts the source speaker speech x into a speaker-independent intermediate variable z′ with the help of the source speaker reference speech s through the posterior encoder Q φ and the forward affine coupling layer f dec , as shown below:

z~Qφ(z|x,s)z~Q φ (z|x,s)

z′=fdec(z|s)z′=f dec (z|s)

然后,说话人转换系统在目标说话人参考语音t的帮助下,经过逆向仿射耦合层和声码器V,将z′转换为目标说话人的波形y,即:Then, the speaker conversion system is assisted by the target speaker reference speech t through the inverse affine coupling layer and the vocoder V, which converts z′ into the waveform y of the target speaker, namely:

S3.4、损失函数设计S3.4. Loss function design

在网络训练时,总共使用了5个不同的损失,分别为:与条件变分自编码器相关的重建损失和KL散度损失,与对抗训练相关的生成对抗损失和多层特征匹配损失,以及在局部说话人编码器中应用的说话人身份验证损失。具体如下:A total of 5 different losses are used during network training, namely: reconstruction loss and KL divergence loss associated with the conditional variational autoencoder, generative adversarial loss and multi-layer feature matching loss associated with adversarial training, and speaker authentication loss applied in the local speaker encoder. The details are as follows:

将真实的STFT频谱x投影到梅尔刻度上,记为xmel;此外,对于声码器预测的波形y,也使用相同的参数将其转换为梅尔频谱ymel。然后,使用xmel与ymel之间的L1损失作为重建损失。如下所示:The true STFT spectrum x is projected onto the Mel scale, recorded as x mel ; in addition, for the waveform y predicted by the vocoder, it is also converted to the Mel spectrum y mel using the same parameters. Then, the L 1 loss between x mel and y mel is used as the reconstruction loss. As shown below:

变分推理一方面期待声码器能够更好地还原x,另一方面期待近似的后验分布Qθ(z|x,r)可以与先验分布Pθ(x|z,r)越接近越好。因此,定义KL散度为:Variational reasoning expects the vocoder to restore x better on the one hand, and on the other hand, expects the approximate posterior distribution Q θ (z|x, r) to be as close as possible to the prior distribution P θ (x|z, r). Therefore, the KL divergence is defined as:

z~Qφ(z|x,r)=N(z;μφ(x,r),σφ(x,r))z~Q φ (z|x, r)=N(z;μ φ (x, r), σ φ (x, r))

式中,[μφ(x,r),σφ(x,r)]表示后验分布的统计量;Where [μ φ (x, r), σ φ (x, r)] represents the statistics of the posterior distribution;

为了更好的避免梯度消失现象,所述对抗训练损失函数使用最小二乘损失函数,如下所示:In order to better avoid the gradient vanishing phenomenon, the adversarial training loss function uses the least squares loss function as shown below:

声码器使用多尺度和多周期辨别器,因此使用多层特征匹配损失函数来提高对抗训练的稳定性,如下所示:The vocoder uses a multi-scale and multi-cycle discriminator, so a multi-layer feature matching loss function is used to improve the stability of adversarial training, as shown below:

式中,T表示辨别器的层数,Dl和Nl分别表示第l层的特征图和该特征图中特征的数目;Where T represents the number of layers of the discriminator, D l and N l represent the feature map of the lth layer and the number of features in the feature map respectively;

说话人身份损失函数的基本思想是拉近相同说话人的声纹向量的距离,拉远不同说话人的声纹向量的距离,如下所示:The basic idea of the speaker identity loss function is to bring the distance between the voiceprint vectors of the same speaker closer and the distance between the voiceprint vectors of different speakers farther apart, as shown below:

式中,i(j)表示批处理中的样本索引,m表示仿射耦合层的层索引,和e分别表示由重组的参考特征/>和目标特征z生成的声纹向量,ψ(·)表示向量的归一化,<·,·>表示内积。/>表示来自同一说话人两个样本的声纹向量之间的相似性,而S表示来自不同说话人两个样本的声纹向量之间的相似性。正实数α和β分别作为S和/>的乘子。Where i(j) represents the sample index in the batch, m represents the layer index of the affine coupling layer, and e represent the reference features of the recombinant /> The voiceprint vector generated by the target feature z, ψ(·) represents the normalization of the vector, and <·,·> represents the inner product. /> represents the similarity between the voiceprint vectors of two samples from the same speaker, while S represents the similarity between the voiceprint vectors of two samples from different speakers. The positive real numbers α and β are used as S and / > The multiplier of .

综合上述内容,依据上述所设计的方案搭建零样本语音合成模型。Based on the above content, a zero-sample speech synthesis model is built according to the above designed solution.

S4、训练模型:利用深度学习Pytorch框架训练模型,遍历S1中所有预处理的文本和语音数据,初始学习率为2e-4,按0.999的速率指数衰减,经过500k次迭代的训练,保存在验证集上表现最好的模型;S4, training model: Use the deep learning Pytorch framework to train the model, traverse all the preprocessed text and voice data in S1, the initial learning rate is 2e-4, and the exponential decay rate is 0.999. After 500k iterations of training, save the model with the best performance on the validation set;

S5、输出结果:将S1中预处理的测试集文本数据和参考集语音数据输入到稳定模型中,获得零样本语音合成的结果。S5. Output result: Input the test set text data and reference set speech data preprocessed in S1 into the stable model to obtain the result of zero-sample speech synthesis.

实施例2:Embodiment 2:

基于实施例一但有所不同:Based on Example 1 but with some differences:

本实施例使用三种不同时长的参考语音进行评估,分别为5秒、20秒和40秒。同时选取3个同在VCTK数据集上训练的先进对比方法,包括:Attentron ZS,SC-GlowTTS,以及YourTTS。This example uses three different lengths of reference speech for evaluation, namely 5 seconds, 20 seconds, and 40 seconds. At the same time, three advanced comparison methods trained on the VCTK dataset are selected, including Attentron ZS, SC-GlowTTS, and YourTTS.

其中,Attentron ZS是一种利用基于注意力机制和循环神经网络的自回归零样本语音合成方法,使用粗、细粒度两个编码器分别提取基本的声音信息和细致的风格信息,以提高合成语音的质量。SC-GlowTTS是一种基于流生成模型的非自回归零样本语音合成方法,将多层膨胀卷积模块引入到标准化流结构中,提高流模型提取特征的能力,进而提高合成语音的质量。YourTTS是一个基于条件变分自编码器的非自回归零样本语音合成方法,YourTTS基于全局参考信息引入方式,在后验编码器、流解码器中进行多层引入来辅助提高合成语音的质量。Among them, Attentron ZS is an autoregressive zero-shot speech synthesis method based on attention mechanism and recurrent neural network. It uses coarse and fine-grained encoders to extract basic sound information and detailed style information respectively to improve the quality of synthesized speech. SC-GlowTTS is a non-autoregressive zero-shot speech synthesis method based on stream generation model. It introduces multi-layer dilated convolution modules into the standardized stream structure to improve the ability of stream model to extract features, thereby improving the quality of synthesized speech. YourTTS is a non-autoregressive zero-shot speech synthesis method based on conditional variational autoencoder. Based on the global reference information introduction method, YourTTS introduces multiple layers in the posterior encoder and stream decoder to help improve the quality of synthesized speech.

为了更公平的对比,本实施例提供了使用40秒参考语音辅助YourTTS模型生成的结果(YourTTS-40s)。具体结果请参阅表1。For a fairer comparison, this embodiment provides the results generated by the YourTTS model using 40 seconds of reference speech (YourTTS-40s). Please refer to Table 1 for specific results.

表1SECS和95%置信言区间下的MOS,Sim-MOS评估Table 1 MOS and Sim-MOS evaluation under SECS and 95% confidence interval

如表1所示,表中展示了在SECS、MOS、Sim-MOS指标上的评估结果。SECS是评估声音音色相似程度的指标,而MOS和Sim-MOS是评估声音自然度的指标。三种指标均为越高越好。40秒参考时长的模型在SECS和Sim-MOS评估上均取得了最好的结果,而20秒参考时长的模型在MOS得分上表现最好,并全面优于先前的零样本语音合成方法。然而,5秒参考时长的模型的表现却不如YourTTS方法,这意味着所提的方法的表现依赖于参考语音的时长,过短的参考时长由于无法对音素集达到高覆盖率,会导致模型的性能下降。As shown in Table 1, the table shows the evaluation results on the SECS, MOS, and Sim-MOS indicators. SECS is an indicator for evaluating the similarity of sound timbre, while MOS and Sim-MOS are indicators for evaluating the naturalness of sound. The higher the three indicators, the better. The model with a reference duration of 40 seconds achieved the best results in both SECS and Sim-MOS evaluations, while the model with a reference duration of 20 seconds performed best in MOS scores and outperformed previous zero-sample speech synthesis methods in all aspects. However, the model with a reference duration of 5 seconds did not perform as well as the YourTTS method, which means that the performance of the proposed method depends on the duration of the reference speech. A reference duration that is too short will lead to a decline in the performance of the model due to the inability to achieve high coverage of the phoneme set.

此外,本实施例在VCTK数据集上进行了说话人转换的实验,从VCTK测试子集中选择了8位说话者(4男/4女)。本发明与两个先进的由零样本语音合成系统转变而来的说话人转换模型进行对比。考虑到男女说话人在声音特征方面的普遍差异,本发明分别提供了跨性别说话人转换和同性别说话人转换的评估结果。评估结果请参阅表2。In addition, this embodiment conducts speaker conversion experiments on the VCTK dataset, and selects 8 speakers (4 males/4 females) from the VCTK test subset. The present invention is compared with two advanced speaker conversion models transformed from zero-shot speech synthesis systems. Considering the general differences in voice characteristics between male and female speakers, the present invention provides evaluation results for cross-gender speaker conversion and same-gender speaker conversion respectively. Please refer to Table 2 for the evaluation results.

表295%置信区间下的说话人转换评估结果Table 2 Speaker conversion evaluation results under 95% confidence interval

如表2所示,本发明所提的说话人转换方法在同性别转换方面取得了最高的MOS评分和Sim-MOS评分,这证明了基于局部关联信息的方法对不同说话人的声音特征具有更好的适应能力。跨性别转换的实验结果与基于YourTTS的说话人转换的实验结果相近,并差于同性别转换的实验结果,这说明说话人转换模型对于处理两个差异显著的声音特征的能力还需要进一步提升。As shown in Table 2, the speaker conversion method proposed in the present invention achieved the highest MOS score and Sim-MOS score in same-gender conversion, which proves that the method based on local correlation information has better adaptability to the voice characteristics of different speakers. The experimental results of cross-gender conversion are similar to those of speaker conversion based on YourTTS, and worse than those of same-gender conversion, which shows that the ability of the speaker conversion model to process two significantly different voice characteristics needs to be further improved.

以上,仅为本发明较佳的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,根据本发明的技术方案及其发明构思加以等同替换或改变,都应涵盖在本发明的保护范围之内。The above are only preferred specific implementation modes of the present invention, but the protection scope of the present invention is not limited thereto. Any technician familiar with the technical field can make equivalent replacements or changes according to the technical solutions and inventive concepts of the present invention within the technical scope disclosed by the present invention, which should be covered by the protection scope of the present invention.

Claims (6)

1. The zero sample voice synthesis method based on the local association information is characterized by comprising the following steps:
s1, preprocessing text and voice data: given the original text C raw And speech data S raw For C raw Word-to-phoneme processing is carried out to obtain a phoneme sequence c; for S raw Performing short-time Fourier transform and randomly selecting fragments to obtain a reference short-time Fourier transform spectrum r and a real short-time Fourier transform spectrum x, and dividing data into a training set, a verification set, a test set and an independent reference voice set;
s2, constructing a basic network frame: designing a priori text encoder based on a transducer, a multi-layer acoustic decoder based on a stream, a posterior encoder based on expansion convolution and a vocoder based on generation of an countermeasure network, and combining the structures to form a condition variation self-encoder network framework for combined countermeasure training;
s3, designing a scheme and building a model: based on the information relevance between the target sentence and the reference voice, a reference information introduction scheme is designed by combining the preprocessing methods and the basic network frames in S1-S2, and a zero sample voice synthesis model is built according to the designed scheme, wherein the scheme specifically comprises the following contents:
(1) feature matching and recombination: inputting the real and reference short-time Fourier transform spectrum data pair (x, r) obtained in the step S1 into a posterior encoder to respectively obtain a target hidden characteristic z and a reference hidden characteristic z r The method comprises the steps of carrying out a first treatment on the surface of the Subsequently, in each affine coupling layer, pairs z and z r Matching and reorganizing methods are carried out to obtain reference features with higher information association degree
(2) Local speaker encoder: features after recombinationInput to the baseThe voiceprint vector generation module obtains a basic voiceprint vector e; at the same time will->Inputting the information into a gate convolution module for information filtering; finally, the filtered characteristics and e are input into a local attention module to obtain the embedding of the local speaker>
(3) Training and reasoning process design: the training process is carried out according to the countermeasure training and the variational reasoning in the step S2; the reasoning process is different from the training process in that the affine coupling layer adopts a reversible form;
(4) and (3) designing a loss function module: the zero sample speech synthesis model is subjected to joint optimization through reconstruction loss and KL divergence loss in a variational reasoning process, antagonism loss and feature matching loss in an antagonism training process, and speaker identity loss in a basic voiceprint vector generation process;
(5) the speaker conversion method is designed: making modification based on the zero sample speech synthesis model in the step S3, and reserving a reference information introduction scheme, a posterior encoder, a K-layer affine coupling layer and a vocoder to realize sound conversion among different speakers;
s4, training a model: traversing all the preprocessed text and voice data in the S1 by using a deep learning Pytorch frame training model, setting the initial learning rate to be 2e-4, carrying out exponential decay according to the rate of 0.999, carrying out 500k iterative training, and storing the model with the best performance on a verification set;
s5, outputting a result: and (3) inputting the test set text data and the reference set voice data preprocessed in the step (S1) into a stable model to obtain a zero-sample voice synthesis result.
2. The zero-sample speech synthesis method based on local correlation information according to claim 1, wherein the scheme (1) further comprises the following:
a1 is the same asWithin an affine coupling layer, hidden from object z and reference z r Performing a frame-level cosine similarity calculation, i.e. for each frame z in z (t) Calculate the frame and z r The similarity of all frames in the list, selecting one frame with the maximum cosine similarity as z (t) Is expressed as a function of:
wherein < ·, · > represents the inner product, ψ [ · ] represents the normalization of the vector;
a2, obtaining index values according to the matchingRecombinant->Generating a reorganized reference feature aligned with z in the time domainSaid recombinant reference feature->Is the reference frame level feature combination with the highest local relevance to the current target intermediate feature z.
3. The zero-sample speech synthesis method based on local correlation information according to claim 1, wherein the scheme (2) specifically comprises the following:
b1, a basic voiceprint vector generation module firstly recombines reference features through a three-layer long short-time memory networkThe hidden vector mapped to the last layer LSTM is then hidden by the linear layerThe Tibetan vector is mapped to a basic voiceprint vector e of the speaker;
b2, the gate convolution module first recombines the reference characteristics through an input gate pair consisting of the convolution module and a gate control activation unitFiltering information, and then realizing further control of redundant information through a forgetting gate controlled by the global basic voiceprint vector g;
b3, the local attention module modulates the basic voiceprint vector e by using the filtered characteristics in the B2; firstly, segmenting the filtered characteristics into different frame groups along a time dimension, then respectively fusing the different frame groups with e based on an attention mechanism, and generating a local speaker embedded vector e which simultaneously captures basic voiceprint information of a target speaker and local associated information represented by a corresponding frame group j j The method comprises the steps of carrying out a first treatment on the surface of the The attention mechanism uses e as a query while e is a key and a value; the frame group j is not used as a query, but is only used as a key and a value; the function of the above procedure is expressed as:
wherein q is 0 、k 0 And v 0 Representing the query, key and value obtained from e, respectively; k (k) i ,v i The key and value resulting from the frame group are represented, respectively, where i= {1,2,3}.
4. The zero-sample speech synthesis method based on local correlation information according to claim 1, wherein the scheme (3) specifically comprises the following:
the lower bound of evidence of maximized log-likelihood is targeted for the training process, the function of which is expressed as:
in the log likelihood function log P θ (x-z, r) represents the flow of vocoder reconstructed waveforms; log Q θ (z│x,r)-log P θ (x-z, r) represents the KL divergence between the a priori distribution and the approximate posterior distribution, the actual corresponding K-layer affine coupling layer establishing a reversible transform flow between the text hidden features and the audio hidden features;
in the reasoning process, the K-layer affine coupling layer adopts a reversible form, takes text prior as input, and outputs acoustic hidden characteristics z.
5. The zero-sample speech synthesis method based on local correlation information according to claim 1, wherein the scheme (4) further comprises the following:
c1, projecting a real short-time Fourier transform spectrum x onto a Mel scale, and marking the real short-time Fourier transform spectrum as x mel The method comprises the steps of carrying out a first treatment on the surface of the For the vocoder predicted waveform y, the same parameters are also used to convert it to mel spectrum y mel The method comprises the steps of carrying out a first treatment on the surface of the Then, x is used mel And y is mel L in between 1 The loss is expressed as a function of reconstruction loss as follows:
c2, using variational reasoning to make the vocoder better recover x and make the approximate posterior distribution Q θ (z-x, r) and a priori distribution P θ (x-z, r) phase approach; define the KL divergence as:
z~Q φ (z|x,r)=N(z;μ φ (x,r),σ φ (x,))
wherein [ mu ] is φ (x,r),σ φ (x,r)]Statistics representing posterior distribution;
c3, using a least squares loss function as an countermeasure training loss function to avoid gradient disappearance, wherein the specific function is as follows:
c4, vocoders use multi-scale and multi-cycle discriminators and use multi-layer feature matching loss functions to improve stability against training, the specific functions are shown below:
wherein T represents the number of layers of the discriminator, D l And N l Respectively representing a feature map of the first layer and the number of features in the feature map;
the basic idea of pulling the distances of voiceprint vectors of the same speaker and the distances of voiceprint vectors of different speakers is used as a speaker identity loss function, and the specific function is expressed as follows:
where i or j represents a sample index in the batch process, m represents a layer index of the affine coupling layer,and e represents the reference feature by recombination, respectively->And a voiceprint vector generated by the target feature z, ψ (·) represents the normalization of the vector,<·,·>representing the inner product; />Representing the similarity between voiceprint vectors from two samples of the same speaker, while S represents the similarity between voiceprint vectors from two samples of different speakers; positive real numbers alpha and beta are S and +.>Is a multiplier of (2).
6. The zero-sample speech synthesis method based on local correlation information according to claim 1, wherein the scheme (5) further comprises the following:
the purpose of speaker conversion is to convert a source tone into a target tone on the premise of keeping the same semantic content, specifically:
d1, with the help of the reference speech s of the source speaker, pass through a posterior encoder Q φ And a forward affine coupling layer f dec The source speaker speech x is converted to a speaker independent intermediate variable z', the function of which is expressed as follows:
zQφ(z]x,s)=faec(zs)
d2, with the help of the target speaker reference voice t, passing through the inverse affine coupling layerAnd a vocoder V converting the speaker independent intermediate variable z' into a waveform y of the target speaker, the function of which is expressed as follows:
wherein V represents a vocoder;representing a reverse affine coupling layer; z' represents an intermediate speaker independent variable; t represents the target speaking ginseng test voice.
CN202410083127.8A 2024-01-19 2024-01-19 Zero-shot speech synthesis method based on local correlation information Active CN117854473B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410083127.8A CN117854473B (en) 2024-01-19 2024-01-19 Zero-shot speech synthesis method based on local correlation information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410083127.8A CN117854473B (en) 2024-01-19 2024-01-19 Zero-shot speech synthesis method based on local correlation information

Publications (2)

Publication Number Publication Date
CN117854473A true CN117854473A (en) 2024-04-09
CN117854473B CN117854473B (en) 2024-08-06

Family

ID=90530366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410083127.8A Active CN117854473B (en) 2024-01-19 2024-01-19 Zero-shot speech synthesis method based on local correlation information

Country Status (1)

Country Link
CN (1) CN117854473B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118135990A (en) * 2024-05-06 2024-06-04 厦门立马耀网络科技有限公司 End-to-end text speech synthesis method and system combining autoregressive
CN118658128A (en) * 2024-08-19 2024-09-17 杭州熠品智能科技有限公司 AI multi-dimensional teaching behavior analysis method and system based on classroom video

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114360493A (en) * 2021-12-15 2022-04-15 腾讯科技(深圳)有限公司 Speech synthesis method, apparatus, medium, computer device and program product
CN115035885A (en) * 2022-04-15 2022-09-09 科大讯飞股份有限公司 Voice synthesis method, device, equipment and storage medium
WO2022243337A2 (en) * 2021-05-17 2022-11-24 Deep Safety Gmbh System for detection and management of uncertainty in perception systems, for new object detection and for situation anticipation
CN115497449A (en) * 2022-08-23 2022-12-20 哈尔滨工业大学(深圳) Zero-sample voice cloning method and device based on audio decoupling and fusion
CN115547293A (en) * 2022-09-27 2022-12-30 杭州电子科技大学 Multi-language voice synthesis method and system based on layered prosody prediction
CN116092474A (en) * 2023-04-07 2023-05-09 北京边锋信息技术有限公司 Speech synthesis method and device
US20230206898A1 (en) * 2021-12-23 2023-06-29 Google Llc Neural-Network-Based Text-to-Speech Model for Novel Speaker Generation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022243337A2 (en) * 2021-05-17 2022-11-24 Deep Safety Gmbh System for detection and management of uncertainty in perception systems, for new object detection and for situation anticipation
CN114360493A (en) * 2021-12-15 2022-04-15 腾讯科技(深圳)有限公司 Speech synthesis method, apparatus, medium, computer device and program product
US20230206898A1 (en) * 2021-12-23 2023-06-29 Google Llc Neural-Network-Based Text-to-Speech Model for Novel Speaker Generation
CN115035885A (en) * 2022-04-15 2022-09-09 科大讯飞股份有限公司 Voice synthesis method, device, equipment and storage medium
CN115497449A (en) * 2022-08-23 2022-12-20 哈尔滨工业大学(深圳) Zero-sample voice cloning method and device based on audio decoupling and fusion
CN115547293A (en) * 2022-09-27 2022-12-30 杭州电子科技大学 Multi-language voice synthesis method and system based on layered prosody prediction
CN116092474A (en) * 2023-04-07 2023-05-09 北京边锋信息技术有限公司 Speech synthesis method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
岳焕景 等: "基于邻域自适应注意力的跨域融合语音增强", 《湖南大学学报(自然科学版)》, vol. 50, no. 12, 31 December 2023 (2023-12-31), pages 59 - 68 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118135990A (en) * 2024-05-06 2024-06-04 厦门立马耀网络科技有限公司 End-to-end text speech synthesis method and system combining autoregressive
CN118135990B (en) * 2024-05-06 2024-11-05 厦门立马耀网络科技有限公司 End-to-end text speech synthesis method and system combining autoregressive
CN118658128A (en) * 2024-08-19 2024-09-17 杭州熠品智能科技有限公司 AI multi-dimensional teaching behavior analysis method and system based on classroom video

Also Published As

Publication number Publication date
CN117854473B (en) 2024-08-06

Similar Documents

Publication Publication Date Title
CN107545903B (en) A voice conversion method based on deep learning
Nachmani et al. Unsupervised singing voice conversion
CN110060690B (en) Many-to-many speaker conversion method based on STARGAN and ResNet
CN111816156A (en) Method and system for many-to-many speech conversion based on speaker style feature modeling
Lian et al. Robust disentangled variational speech representation learning for zero-shot voice conversion
CN117854473A (en) Zero sample speech synthesis method based on local association information
CN111785261A (en) Method and system for cross-language speech conversion based on disentanglement and interpretive representation
CN109767778B (en) A Speech Conversion Method Fusion Bi-LSTM and WaveNet
CN108717856A (en) A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN113724712B (en) A Bird Voice Recognition Method Based on Multi-feature Fusion and Combination Model
CN102568476B (en) Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN114141238A (en) Voice enhancement method fusing Transformer and U-net network
CN112259080B (en) Speech recognition method based on neural network model
CN113593588B (en) A multi-singer singing voice synthesis method and system based on generative adversarial network
CN110136686A (en) Many-to-many speaker conversion method based on STARGAN and i-vector
CN111833855A (en) Many-to-many speaker conversion method based on DenseNet STARGAN
CN110060657A (en) Multi-to-multi voice conversion method based on SN
Shah et al. Nonparallel emotional voice conversion for unseen speaker-emotion pairs using dual domain adversarial network & virtual domain pairing
Wang et al. Acoustic-to-articulatory inversion based on speech decomposition and auxiliary feature
Zhang et al. AccentSpeech: Learning accent from crowd-sourced data for target speaker TTS with accents
Gao et al. An experimental study on joint modeling of mixed-bandwidth data via deep neural networks for robust speech recognition
CN117275496A (en) End-to-end speech synthesis method based on global style tokens and singular spectrum analysis
Chandra et al. Towards the development of accent conversion model for (l1) bengali speaker using cycle consistent adversarial network (cyclegan)
CN114626424A (en) Data enhancement-based silent speech recognition method and device
Bhavani et al. A survey on various speech emotion recognition techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant