WO2022126924A1 - 基于域分离的语音转换模型的训练方法及装置 - Google Patents

基于域分离的语音转换模型的训练方法及装置 Download PDF

Info

Publication number
WO2022126924A1
WO2022126924A1 PCT/CN2021/083956 CN2021083956W WO2022126924A1 WO 2022126924 A1 WO2022126924 A1 WO 2022126924A1 CN 2021083956 W CN2021083956 W CN 2021083956W WO 2022126924 A1 WO2022126924 A1 WO 2022126924A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature vector
training
speech
classification error
error
Prior art date
Application number
PCT/CN2021/083956
Other languages
English (en)
French (fr)
Inventor
陈闽川
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022126924A1 publication Critical patent/WO2022126924A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Definitions

  • the present application relates to speech semantic technology, and in particular, to a method and apparatus for training a speech conversion model based on domain separation.
  • the voice conversion is used to convert the voice of speaker A into the voice of speaker B to output the content in the voice of speaker A.
  • Speech conversion can not only be used in the back-end of speech synthesis, but also in terms of speaker identity confidentiality, dubbing of film and television works, etc.
  • the method for realizing speech conversion includes: based on generative adversarial network, variational autoencoder, phoneme posterior map, hidden Markov model, etc., but the inventor found that the trained speech conversion model in the prior art When voice conversion is performed on the audio with unbalanced corpus, the audio cannot be completely converted, and after the audio conversion is completed, the obtained audio has a low similarity with the target speaker's timbre.
  • the embodiments of the present application provide a training method and device for a speech conversion model based on domain separation, and the speech conversion model is trained by the domain separation technology, so that the trained speech conversion model can not only convert the unbalanced corpus Complete voice conversion is performed, and the accuracy of voice conversion is improved.
  • an embodiment of the present application provides a method for training a speech conversion model based on domain separation, which includes:
  • the Mel frequency cepstral coefficients of the training speech are input into the content encoder and the timbre encoder respectively, and the phoneme feature vector and the timbre feature vector of the training speech are obtained;
  • the phoneme feature vector and the timbre feature vector are respectively classified and processed to obtain a first classification error and a second classification error;
  • an embodiment of the present application provides a training device for a voice conversion model based on domain separation, which includes:
  • a feature extraction unit configured to receive a preset training voice and perform feature extraction on the training voice to obtain Mel frequency cepstral coefficients of the training voice
  • the first input unit for inputting the Mel-frequency cepstral coefficients of the training voice into the content encoder and the timbre encoder respectively, to obtain the phoneme feature vector and the timbre feature vector of the training voice;
  • An updating unit configured to calculate the overall loss of the speech conversion model according to the first classification error, the second classification error and the reconstruction error, and update network parameters of the speech conversion model according to the overall loss.
  • the Mel frequency cepstral coefficients of the training speech are input into the content encoder and the timbre encoder respectively, and the phoneme feature vector and the timbre feature vector of the training speech are obtained;
  • embodiments of the present application further provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when executed by a processor, the computer program causes the processor to perform the following steps :
  • the embodiments of the present application provide a method and apparatus for training a speech conversion model based on domain separation.
  • the embodiments of the present application use the domain separation technology to train the speech conversion model, so that the trained speech conversion model can not only perform training on unbalanced corpus Complete voice conversion with improved voice conversion accuracy.
  • FIG. 1 is a schematic flowchart of a training method for a voice conversion model based on domain separation provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of another sub-flow of a training method for a voice conversion model based on domain separation provided by an embodiment of the present application;
  • FIG. 3 is a schematic block diagram of an apparatus for training a speech conversion model based on domain separation provided by an embodiment of the present application
  • FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • the speech conversion model includes a content encoder, a tone encoder, and a decoder, and the first classifier, the second classifier, and the ASR system all use
  • the voice conversion can be completed through the content encoder, the tone encoder and the decoder in the voice conversion model.
  • the training method of the voice conversion model based on domain separation will be described in detail below. As shown in FIG. 1, the method includes the following steps S110-S150.
  • the training voice is audio information used for training the voice conversion model
  • the Mel-Frequency Cepstral Coefficients (MFCCs) of the training voice are the voice features of the training voice
  • the Mel-frequency cepstral coefficients of the training speech include phoneme characteristics and timbre characteristics of the speaker of the training speech.
  • the corpus of the training speech can be either balanced prediction or unbalanced corpus.
  • step S110 includes a sub-step: acquiring the frequency spectrum of the training speech and inputting the frequency spectrum of the training speech into a preset mel filter bank to obtain the mel frequency spectrum of the training speech; Fast Fourier transform is performed on the preprocessed training speech to obtain the frequency spectrum of the training speech.
  • the sub-step acquiring the frequency spectrum of the training speech and inputting the frequency spectrum of the training speech into a preset mel filter bank, and obtaining the mel frequency spectrum of the training speech includes: The training speech is preprocessed to obtain the preprocessed training speech; the preprocessed training speech is subjected to fast Fourier transform to obtain the frequency spectrum of the training speech.
  • the training speech is preprocessed to obtain the preprocessed training speech.
  • the voice signal of the training voice received by the terminal device is generally unstable.
  • the training voice becomes stable.
  • the terminal device After receiving the voice signal of the training voice, the terminal device first performs pre-emphasis processing on the voice signal of the training voice, then divides the pre-emphasized voice signal into frames, and finally adds a window to the framed voice signal. After processing, the preprocessed training speech can be obtained.
  • the pre-emphasis processing of the speech signal is mainly to perform pre-emphasis processing on the high-frequency part of the speech signal, thereby removing the influence of lip radiation and increasing the resolution of the high-frequency part of the speech signal;
  • the voice signal is processed by framing, but the beginning and end of each frame of the voice signal after framing will be discontinuous, resulting in an increase in error.
  • the voice signal is smooth and continuous.
  • Fast Fourier transform is performed on the preprocessed training speech to obtain the frequency spectrum of the training speech.
  • a voice signal composed of continuous voice signals of each frame is obtained, and the voice signal composed of continuous voice signals of each frame is to describe the training after the preprocessing voice, and then perform short-time Fourier transform on each frame of voice signal in the preprocessed training voice to obtain the frequency of each frame of voice signal, and the frequency of each frame of voice signal is the frequency spectrum of the training voice the frequency of a time period in .
  • the Mel frequency cepstral coefficients of the training speech can be obtained by performing logarithmic operation on the Mel spectrum of the training speech, and performing inverse Fourier transform after the logarithmic operation is completed.
  • S120 Input the Mel frequency cepstral coefficients of the training speech into the content encoder and the timbre encoder respectively, to obtain the phoneme feature vector and the timbre feature vector of the training speech.
  • the Mel frequency cepstral coefficients of the training speech are input into the content encoder and the timbre encoder, respectively, to obtain the phoneme feature vector and the timbre feature vector of the training speech.
  • the content encoder is an encoder for extracting common features
  • the timbre encoder is a source-domain private encoder for extracting private features of source-domain data.
  • the training voice The phoneme feature vector in is used to characterize the content of the training voice, that is, the content of the training voice is the common feature of the training voice, and the timbre feature vector in the training voice is used to characterize the speech of the training voice.
  • Person identity that is, the speaker identity of the training speech is a private feature of the docile speech
  • the training speech can be obtained from the training speech.
  • the phoneme feature vector of the training speech is extracted from the Mel-frequency cepstral coefficients of
  • the phoneme feature vector of the training speech is extracted from the cepstral coefficients.
  • the phoneme feature vector and the timbre feature vector are classified respectively according to a preset classification rule to obtain a first classification error and a second classification error.
  • the preset classification rule is used to classify the phoneme feature vector and the timbre feature vector respectively, and then obtain the first classification error of the phoneme feature vector and the first classification error of the timbre feature vector.
  • Rule information for binary classification errors is the error generated when the phoneme feature vector is classified in the preset first classifier, and the second classification error is the timbre feature vector is classified in the preset second classifier. error generated.
  • step S130 includes sub-steps: passing the phoneme feature vector through a preset gradient reversal layer and a preset first classifier in sequence to obtain the first classification error; The feature vector is input into a preset second classifier to obtain the second classification error.
  • the first classification error is obtained by sequentially passing the phoneme feature vector through a preset gradient reversal layer and a preset first classifier.
  • the gradient inversion layer is a connection layer between the content encoder and the preset first classifier and is used to implement confrontational learning between the content encoder and the first classifier.
  • the back-propagation process of the first classification error is multiplied by - ⁇ to achieve gradient inversion, where ⁇ is a positive number, so that the learning goals of the first classifier and the content encoder are opposite, and the confrontation between the first classifier and the content encoder is achieved.
  • the network parameters of the content encoder can be adjusted through the first classification error, that is, the content encoder is trained.
  • the timbre feature vector is input into a preset second classifier to obtain the second classification error.
  • the second classifier is used for label classification of the timbre feature vector, so that the timbre encoder can extract the private features of the training speech from the training speech, and the timbre feature vector input To the second classifier, the second classification error can be generated from the second classifier, and the network parameters of the timbre encoder can be adjusted through the second classification error, that is, the The timbre encoder described above is trained.
  • S140 Splicing the phoneme feature vector and the timbre feature vector and inputting the spliced feature vector into a decoder to obtain the reconstruction error of the Mel frequency cepstral coefficient.
  • the phoneme feature vector and the timbre feature vector are spliced and the spliced feature vector is input into the decoder to obtain the reconstruction error of the Mel frequency cepstral coefficient.
  • the dimension of the phoneme feature vector is the same as the dimension of the timbre feature vector.
  • the feature vector after the splicing includes both the private features extracted from the tone color encoder and the common features extracted from the content encoder, and the feature vector after the splicing is input into the decoder, A new Mel-frequency cepstral coefficient can be obtained, and the decoder will also generate a reconstruction error for reconstructing the Mel-frequency cepstral coefficient.
  • An overall loss of the speech conversion model is calculated according to the first classification error, the second classification error and the reconstruction error, and network parameters of the speech conversion model are updated according to the overall loss. Specifically, after adding the functions representing the first classification error, the second classification error and the reconstruction error with respective weights, a function representing the overall loss of the speech conversion model can be obtained.
  • step S150 includes sub-steps: calculating the difference loss between the phoneme feature vector and the timbre feature vector according to the Frobenius norm; according to the first classification error, the second The classification error, the reconstruction error, and the disparity loss calculate the overall loss of the speech conversion model.
  • the difference loss between the phoneme feature vector and the timbre feature vector is calculated according to the Frobenius norm.
  • the Frobenius norm is also called the Hilbert-Schmidt norm.
  • the Frobenius norm is the Frobenius norm.
  • the Uthian norm is defined as: where A * represents the conjugate transpose of A, and ⁇ i is the singular value of A.
  • A is the product of the transposed matrix corresponding to the phoneme feature vector and the matrix corresponding to the timbre feature vector, that is, representing the difference loss
  • the function is expressed as: where L difference is denoted as difference loss, is represented as the transposed matrix corresponding to the phoneme feature vector, and h p is represented as the matrix corresponding to the timbre feature vector.
  • the norm of the vector can be understood as the length of the vector, or the distance from the vector to the zero point, or the distance between the corresponding two points.
  • An overall loss of the speech conversion model is calculated from the first classification error, the second classification error, the reconstruction error, and the disparity loss.
  • the step of calculating the overall loss of the speech conversion model according to the first classification error, the second classification error, the reconstruction error, and the difference loss includes: converting the phoneme feature vector Input into the preset ASR system for phoneme recognition, and obtain the cross entropy loss; calculate the loss according to the first classification error, the second classification error, the reconstruction error, the difference loss and the cross entropy loss.
  • the phoneme feature vector is input into a preset ASR system for phoneme recognition, and a cross entropy loss is obtained.
  • the ASR system performs phoneme recognition on the phoneme feature vector, and then obtains the cross entropy loss, and the cross entropy loss is obtained by the cross entropy loss.
  • Adjusting the network parameters of the content encoder can not only improve the accuracy of phoneme feature extraction after the content encoder is trained, but also speed up the training efficiency of the content encoder.
  • adding the ASR system can also prevent the content encoding from degenerating into an autoencoder network during training.
  • An overall loss of the speech conversion model is calculated from the first classification error, the second classification error, the reconstruction error, the disparity loss, and the cross-entropy loss.
  • steps S160 , S170 and S180 are further included after step S150 .
  • the first audio of the first speaker is received, obtain the Mel-frequency cepstral coefficient of the first audio.
  • the first audio of the first speaker is a voice signal that needs to be converted by the voice conversion model that has been trained.
  • the Mel-frequency cepstral coefficients of the first audio can be obtained from the first audio.
  • the identity information The timbre feature vector of the second audio can be used to characterize, and then the timbre feature vector of the second audio is spliced with the phoneme feature vector extracted from the first audio, and then input into the decoder to obtain the following: The voice output by the identity of the second speaker.
  • the phoneme feature vector of the first audio and the timbre feature vector of the second audio are spliced and input to the decoder to obtain the first audio of the second speaker.
  • the phoneme feature vector of the first audio and the timbre feature vector of the second audio are spliced and input to the decoder to obtain the first audio of the second speaker.
  • the audio content in the first audio of the second speaker is the same as the audio content of the first audio of the first speaker, but the timbre in the first audio of the first speaker is the The timbre of the first speaker, and the timbre in the first audio of the second speaker is the timbre of the second speaker.
  • the spliced feature vector includes both the audio content of the first audio and the timbre information of the second speaker.
  • the Mel-frequency cepstral coefficients of the first audio can be reconstructed, and then the second speech can be obtained through the reconstructed Mel-frequency cepstral coefficients People's first audio.
  • the speech conversion model based on domain separation provided by the embodiment of the present application, by receiving a preset training speech and performing feature extraction on the training speech, the Mel frequency cepstral coefficients of the training speech are obtained;
  • the Mel-frequency cepstral coefficients of the training speech are input into the content encoder and the timbre encoder respectively, and the phoneme feature vector and the timbre feature vector of the training speech are obtained; vector and the timbre feature vector are classified to obtain the first classification error and the second classification error;
  • the phoneme feature vector and the timbre feature vector are spliced and the spliced feature vector is input into the decoder to obtain the reconstruction error of the Mel-frequency cepstral coefficients; calculate the overall loss of the speech conversion model according to the first classification error, the second classification error and the reconstruction error and update the speech according to the overall loss Transform the network parameters of the model.
  • the voice conversion model is trained by using the domain separation technology, so that the trained voice conversion model can not only perform complete voice
  • An embodiment of the present application further provides a training apparatus 100 for a speech conversion model based on domain separation, which is used to execute any of the foregoing embodiments of the training method for a speech conversion model based on domain separation.
  • FIG. 8 is a schematic block diagram of a training apparatus 100 for a speech conversion model based on domain separation provided by an embodiment of the present application.
  • the training device 100 for the speech conversion model based on domain separation includes: a feature extraction unit 110 , a first input unit 120 , a first classification unit 130 , a splicing unit 140 and an updating unit 150 .
  • the feature extraction unit 110 is configured to receive preset training speech and perform feature extraction on the training speech to obtain the Mel frequency cepstral coefficients of the training speech.
  • the feature extraction unit 110 includes: a first acquisition unit and a cepstral analysis unit.
  • the first acquiring unit is configured to acquire the frequency spectrum of the training speech and input the frequency spectrum of the training speech into a preset mel filter bank to obtain the mel frequency spectrum of the training speech.
  • the first obtaining unit includes: a preprocessing unit and a transforming unit.
  • the preprocessing unit is used for preprocessing the training speech to obtain the preprocessed training speech.
  • a transformation unit configured to perform fast Fourier transform on the preprocessed training speech to obtain the frequency spectrum of the training speech.
  • the cepstral analysis unit is configured to perform cepstral analysis on the Mel spectrum of the training speech to obtain the Mel frequency cepstral coefficients of the training speech.
  • the first input unit 120 is configured to input the Mel-frequency cepstral coefficients of the training speech into the content encoder and the timbre encoder, respectively, to obtain the phoneme feature vector and the timbre feature vector of the training speech.
  • the first classification unit 130 is configured to perform classification processing on the phoneme feature vector and the timbre feature vector respectively according to a preset classification rule to obtain a first classification error and a second classification error.
  • the first classification unit 130 includes: a second classification unit and a third classification unit.
  • the third classification unit is configured to input the timbre feature vector into a preset second classifier to obtain the second classification error.
  • the splicing unit 140 is used for splicing the phoneme feature vector and the timbre feature vector and inputting the spliced feature vector into the decoder to obtain the reconstruction error of the Mel frequency cepstral coefficient.
  • the updating unit 150 is configured to calculate the overall loss of the speech conversion model according to the first classification error, the second classification error and the reconstruction error, and update network parameters of the speech conversion model according to the overall loss.
  • the updating unit 150 includes: a first computing unit and a second computing unit.
  • the first calculation unit is configured to calculate the difference loss between the phoneme feature vector and the timbre feature vector according to the Frobenius norm.
  • the second calculation unit is configured to calculate the overall loss of the speech conversion model according to the first classification error, the second classification error, the reconstruction error and the difference loss.
  • the second computing unit includes: a second obtaining unit and a third computing unit.
  • the second obtaining unit is configured to input the phoneme feature vector into a preset ASR system for phoneme identification, and obtain the cross entropy loss.
  • the third calculation unit is configured to calculate the overall loss of the speech conversion model according to the first classification error, the second classification error, the reconstruction error, the difference loss and the cross entropy loss.
  • the apparatus for training a speech conversion model based on domain separation further includes: a receiving unit 160 , a second input unit 170 and a third input unit 180 .
  • the receiving unit 160 is configured to acquire the Mel-frequency cepstral coefficient of the first audio if the first audio of the first speaker is received.
  • the second input unit 170 is configured to obtain the timbre feature vector of the second audio in the second speaker according to the timbre encoder and input the Mel-frequency cepstral coefficients of the first audio into the content encoder, Obtain the phoneme feature vector of the first audio.
  • the third input unit 180 is configured to concatenate the phoneme feature vector of the first audio and the timbre feature vector of the second audio and input them to the decoder to obtain the first audio of the second speaker.
  • the training device 100 based on the speech conversion model based on domain separation provided by the embodiment of the present application is configured to perform the above-mentioned receiving preset training speech and perform feature extraction on the training speech to obtain Mel frequency cepstral coefficients of the training speech
  • the Mel frequency cepstral coefficients of described training speech are input into content encoder and color tone encoder respectively, obtain the phoneme feature vector and the color tone feature vector of described training speech;
  • the feature vector and the timbre feature vector are classified to obtain the first classification error and the second classification error;
  • the phoneme feature vector and the timbre feature vector are spliced and the spliced feature vector is input into the decoder,
  • Obtain the reconstruction error of the Mel-frequency cepstral coefficient calculate the overall loss of the speech conversion model according to the first classification error, the second classification error and the reconstruction error, and update the speech conversion model according to the overall loss Network parameters for the speech conversion model.
  • FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • the device 500 includes a processor 502 , a memory and a network interface 505 connected by a system bus 501 , wherein the memory may include a non-volatile storage medium 503 and an internal memory 504 .
  • the nonvolatile storage medium 503 can store an operating system 5031 and a computer program 5032 .
  • the computer program 5032 when executed, can cause the processor 502 to perform a training method for a domain-separated speech conversion model.
  • the processor 502 is used to provide computing and control capabilities to support the operation of the entire device 500 .
  • the internal memory 504 provides an environment for running the computer program 5032 in the non-volatile storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute the training method of the speech conversion model based on domain separation.
  • the network interface 505 is used for network communication, such as providing transmission of data information.
  • the specific device 500 may be Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • the processor 502 is configured to run the computer program 5032 stored in the memory, so as to implement any embodiment of the above-mentioned training method of the speech conversion model based on domain separation.
  • the computer-readable storage medium may be a U disk, a removable hard disk, a read-only memory (ROM, Read-Only Memory), a magnetic disk, or an optical disk, and other mediums that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种基于域分离的语音转换模型的训练方法及装置(100),方法包括:接收训练语音并对训练语音进行特征提取并将得到的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到音素特征向量和音色特征向量(S110,S120);分别对音素特征向量、音色特征向量进行分类处理,得到第一分类误差和第二分类误差(S130);将音素特征向量、音色特征向量进行拼接后输入到解码器中,得到重构误差(S140);根据第一分类误差、第二分类误差、重构误差计算语音转换模型的整体损失以对语音转换模型进行更新(S150)。基于人工智能中的语音合成技术,通过采用域分离技术对语音转换模型进行训练,不仅能将非平衡语料进行完整的语音转换,而且提高了语音转换准确率。

Description

基于域分离的语音转换模型的训练方法及装置
本申请要求于2020年12月18日提交中国专利局、申请号为202011509341.3,发明名称为“基于域分离的语音转换模型的训练方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音语义技术,尤其涉及一种基于域分离的语音转换模型的训练方法及装置。
背景技术
语音转换为用于将A说话人的语音转换成以B说话人的声音来输出A说话人的语音中内容。语音转换不仅可以用在语音合成的后端,还可以用于说话人身份保密,影视作品配音等方面。现有技术中,实现语音转换的方法包括:基于生成式对抗网络、变分自编码器、音素后验图、隐马尔科夫模型等,但是发明人发现现有技术中训练后的语音转换模型在对存在非平衡语料的音频进行语音转换时,无法将该音频进行完整的语音转换,且该音频转换完成后,得到音频与目标说话者音色的相似度不高。
发明内容
针对上述技术问题,本申请实施例提供了一种基于域分离的语音转换模型的训练方法及装置,通过域分离技术对语音转换模型进行训练,使得训练后的语音转换模型不仅能将非平衡语料进行完整的语音转换,而且提高了语音转换准确率。
第一方面,本申请实施例提供了一种基于域分离的语音转换模型的训练方法,其包括:
接收预设的训练语音并对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数;
将所述训练语音的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到所述训练语音的音素特征向量和音色特征向量;
根据预设的分类规则分别对所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差;
将所述音素特征向量、所述音色特征向量进行拼接并将拼接后的特征向量输入到解码器中,得到所述梅尔频率倒谱系数的重构误差;
根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失并根据所述整体损失更新所述语音转换模型的网络参数。
第二方面,本申请实施例提供了一种基于域分离的语音转换模型的训练装置,其包括:
特征提取单元,用于接收预设的训练语音并对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数;
第一输入单元,用于将所述训练语音的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到所述训练语音的音素特征向量和音色特征向量;
第一分类单元,用于根据预设的分类规则分别所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差;
拼接单元,用于将所述音素特征向量、所述音色特征向量进行拼接并将拼接后的特征向 量输入到解码器中,得到所述梅尔频率倒谱系数的重构误差;
更新单元,用于根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失并根据所述整体损失更新所述语音转换模型的网络参数。
第三方面,本申请实施例又提供了一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时执行以下步骤:
接收预设的训练语音并对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数;
将所述训练语音的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到所述训练语音的音素特征向量和音色特征向量;
根据预设的分类规则分别对所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差;
将所述音素特征向量、所述音色特征向量进行拼接并将拼接后的特征向量输入到解码器中,得到所述梅尔频率倒谱系数的重构误差;
根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失并根据所述整体损失更新所述语音转换模型的网络参数。
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下步骤:
接收预设的训练语音并对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数;
将所述训练语音的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到所述训练语音的音素特征向量和音色特征向量;
根据预设的分类规则分别对所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差;
将所述音素特征向量、所述音色特征向量进行拼接并将拼接后的特征向量输入到解码器中,得到所述梅尔频率倒谱系数的重构误差;
根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失并根据所述整体损失更新所述语音转换模型的网络参数。
本申请实施例提供了一种基于域分离的语音转换模型的训练方法及装置,本申请实施例通过域分离技术对语音转换模型进行训练,使得训练后的语音转换模型不仅能将非平衡语料进行完整的语音转换,而且提高了语音转换准确率。
附图说明
图1为本申请实施例提供的基于域分离的语音转换模型的训练方法的流程示意图;
图2为本申请实施例提供的基于域分离的语音转换模型的训练方法的另一子流程示意图;
图3为本申请实施例提供的基于域分离的语音转换模型的训练装置的示意性框图;
图4为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参阅图1,图1为本申请实施例提供的基于域分离的语音转换模型的训练方法的流程示意图。本申请实施例的所述的基于域分离的语音转换模型的训练方法应用于终端设备中,该方法通过安装于终端设备中的应用软件进行执行。其中,终端设备为具备接入互联网功能的终端设备,例如台式电脑、笔记本电脑、平板电脑或手机等设备。需要说明的是,在本申请具体实施例中,所述语音转换模型包括内容编码器、音色编码器以及解码器,所述第一分类器、所述第二分类器、所述ASR系统均用于辅助所述语音转换模型的训练,所述语音转换模型训练完成后,通过所述语音转换模型中的内容编码器、音色编码器以及解码器便可完成语音转换。
下面对所述的基于域分离的语音转换模型的训练方法进行详细说明。如图1所示,该方法包括以下步骤S110~S150。
S110、接收预设的训练语音并对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数。
接收预设的训练语音并对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数。具体的,所述训练语音为用于对语音转换模型进行训练的音频信息,所述训练语音的梅尔频率倒谱系数((Mel-Frequency Cepstral Coefficients,MFCCs))为所述训练语音的语音特征,所述训练语音的梅尔频率倒谱系数包括所述训练语音的说话人的音素特征和音色特征。在本申请实施例中,所述训练语音的语料既可以为平衡预料,也可以为非平衡语料。
在另一实施例中,步骤S110包括子步骤:获取所述训练语音的频谱并将所述训练语音的频谱输入至预置的梅尔滤波器组中,得到所述训练语音的梅尔频谱;将所述预处理后的训练语音进行快速傅里叶变换,得到所述训练语音的频谱。
获取所述训练语音的频谱并将所述训练语音的频谱输入至预置的梅尔滤波器组中,得到所述训练语音的梅尔频谱。具体的,终端设备以语音信号的方式接收所述训练语音接收到所述训练语音后,对所述训练语音的每一帧的语音信号进行傅里叶变换,得到所述描述所述训练语音的频谱图。
在另一实施例中,子步骤获取所述训练语音的频谱并将所述训练语音的频谱输入至预置的梅尔滤波器组中,得到所述训练语音的梅尔频谱包括:对所述训练语音进行预处理,得到预处理后的训练语音;将所述预处理后的训练语音进行快速傅里叶变换,得到所述训练语音的频谱。
对所述训练语音进行预处理,得到预处理后的训练语音。具体的,通常情况下终端设备接收到的训练语音的语音信号整体上不稳定,通过对所述训练语音进行预处理,使得训练语音趋于平稳。终端设备在接收到所述训练语音的语音信号后,首先对训练语音的语音信号进 行预加重处理,然后将预加重处理后的语音信号进行分帧,最后对分帧后的语音信号进行加窗处理,便可得到所述预处理后的训练语音。其中,语音信号的预加重处理主要是对语音信号中高频部分进行预加重处理,进而去除口唇辐射的影响,增加所述语音信号中高频部分的分辨率;语音信号进行预加重处理后,通过将语音信号进行分帧处理,但是分帧处理后的语音信号的每一帧的起始段和末尾端会不连续,导致误差增大,因此分帧后通过加窗处理便可使得分帧后的语音信号平滑连续。
将所述预处理后的训练语音进行快速傅里叶变换,得到所述训练语音的频谱。具体的,所述训练语音在进行预处理后,得到了由每一帧连续的语音信号组成的语音信号,该每一帧连续的语音信号组成的语音信号即为描述所述预处理后的训练语音,然后将所述预处理后的训练语音中每一帧语音信号进行短时傅里叶变换,得到每一帧语音信号的频率,每一帧语音信号的频率即为所述训练语音的频谱中的一个时间段的频率。
将所述训练语音的梅尔频谱进行倒谱分析,得到所述训练语音的梅尔频率倒谱系数。具体的,通过将所述训练语音的梅尔频谱进行对数运算,对数运算完成后进行逆傅里叶变换便可得到所述训练语音的梅尔频率倒谱系数。
S120、将所述训练语音的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到所述训练语音的音素特征向量和音色特征向量。
将所述训练语音的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到所述训练语音的音素特征向量和音色特征向量。具体的,所述内容编码器为用于提取共有特征的编码器,所述音色编码器为用于提取源域数据私有特征的源域私有编码器,在本申请实施例中,所述训练语音的中的音素特征向量用于表征所述训练语音的内容,即所述训练语音的内容为所述训练语音的共有特征,所述训练语音中的音色特征向量用于表征所述训练语音的说话人身份,即所述训练语音的说话人身份为所述驯良语音的私有特征,通过将所述训练语音的梅尔频率倒谱系数输入到所述内容编码器中,便可从所述训练语音的梅尔频率倒谱系数中提取所述训练语音的音素特征向量;将所述训练语音的梅尔频率倒谱系数输入到所述音色编码器中,便可从所述训练语音的梅尔频率倒谱系数中提取所述训练语音的音素特征向量。
S130、根据预设的分类规则分别对所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差。
根据预设的分类规则分别对所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差。具体的,所述预设的分类规则为用于分别对所述音素特征向量、所述音色特征向量进行分类处理,进而得到所述音素特征向量的第一分类误差、所述音色特征向量的第二分类误差的规则信息。所述第一分类误差为所述音素特征向量在预置的第一分类器中进行分类产生的误差,所述第二分类误差为所述音色特征向量在预置的第二分类器中进行分类产生的误差。
在另一实施例中,步骤S130包括子步骤:将所述音素特征向量依次通过预置的梯度反转层、预置的第一分类器中,得到所述第一分类误差;将所述音色特征向量输入至预置的第二分类器中,得到所述第二分类误差。
将所述音素特征向量依次通过预置的梯度反转层、预置的第一分类器中,得到所述第一分类误差。具体的,所述梯度反转层为内容编码器和预置的第一分类器之间的连接层并用于实现内容编码器和第一分类器的对抗学习,在所述第一分类器的产生的第一分类误差反向传播过程乘以-λ实现梯度反转,其中λ为一个正数,使得第一分类器和内容编码器的学习目标相反,达到第一分类器和内容编码器的对抗学习的目的,通过所述第一分类误差便可对所述内容编码器的网络参数进行调整,即对所述内容编码器进行训练。
将所述音色特征向量输入至预置的第二分类器中,得到所述第二分类误差。具体的,所述第二分类器为用于对所述音色特征向量进行标签分类,以便于音色编码器能实现从所述训练语音中提取所述训练语音的私有特征,所述音色特征向量输入到所述第二分类器中便可从所述第二分类器中产生所述第二分类误差,通过所述第二分类误差便可对所述音色编码器的网络参数进行调整,即对所述音色编码器进行训练。
S140、将所述音素特征向量、所述音色特征向量进行拼接并将拼接后的特征向量输入到解码器中,得到所述梅尔频率倒谱系数的重构误差。
将所述音素特征向量、所述音色特征向量进行拼接并将拼接后的特征向量输入到解码器中,得到所述梅尔频率倒谱系数的重构误差。具体的,在进行拼接前,所述音素特征向量的维度与所述音色特征向量的维度相同,通过将所述音素特征向量、所述音色特征向量进行首尾拼接,便可得到所述拼接后的特征向量,所述拼接后的特征向量即包含了从音色编码器中提取的私有特征,也包含了从内容编码器中提取的共有特征,将所述拼接后的特征向量输入到解码器中,便可得到一个新的梅尔频率倒谱系数,同时解码器也会产生重构梅尔频率倒谱系数的重构误差。
S150、根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失并根据所述整体损失更新所述语音转换模型的网络参数。
根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失并根据所述整体损失更新所述语音转换模型的网络参数。具体的,通过将表征第一分类误差、第二分类误差以及重构误差的函数以各自的权重进行相加后,便可得到表征所述语音转换模型的整体损失的函数。表征整体损失的函数表示为:L=L recon+bL class1+dL class2,其中,L为整体损失,L recon为重构误差,L class1为第一分类误差,L class2为第二分类误差,b为第一分类误差的权重、d为第二分类误差的权重。
在另一实施例中,步骤S150包括子步骤:根据弗罗贝尼乌斯范数计算所述音素特征向量与所述音色特征向量的差异损失;根据所述第一分类误差、所述第二分类误差、所述重构误差以及所述差异损失计算所述语音转换模型的整体损失。
根据弗罗贝尼乌斯范数计算所述音素特征向量与所述音色特征向量的差异损失。具体的,弗罗贝尼乌斯范数又称希尔伯特-施密特范数,当矩阵范数中的P=2时,便为弗罗贝尼乌斯范数,弗罗贝尼乌斯范数的定义为:
Figure PCTCN2021083956-appb-000001
其中A *表示A的共轭转置,σ i为A的奇异值,在本申请实施例中,A为音素特征向量对应的 转置矩阵与音色特征向量对应的矩阵的乘积,即表征差异损失的函数表示为:
Figure PCTCN2021083956-appb-000002
Figure PCTCN2021083956-appb-000003
其中,L difference表示为差异损失,
Figure PCTCN2021083956-appb-000004
表示为音素特征向量对应的转置矩阵,h p表示为音色特征向量对应的矩阵。其中,向量的范数可以理解为向量的长度,或者向量到零点的距离,或者相应的两个点之间的距离。通过添加所述差异损失,进而进一步提高了内容编码器对训练语音中共有特征提取的精确度以及音色编码器对训练语音中私有特征提取的精确度,从而更加突出出转换后说话人的语音特征。
根据所述第一分类误差、所述第二分类误差、所述重构误差以及所述差异损失计算所述语音转换模型的整体损失。在本申请实施例中,所述语音转换模型的整体损失的函数表征为:L=L recon+bL class1+cL difference+dL class2,其中,L为整体损失,L recon为重构误差,L class1为第一分类误差,L difference为差异损失,L class2为第二分类误差,b为第一分类误差的权重、c为差异损失的权重,d为第二分类误差的权重。
在另一实施例中,步骤根据所述第一分类误差、所述第二分类误差、所述重构误差以及所述差异损失计算所述语音转换模型的整体损失包括:将所述音素特征向量输入到预设的ASR系统中进行音素识别,得到交叉熵损失;根据所述第一分类误差、所述第二分类误差、所述重构误差、所述差异损失以及所述交叉熵损失计算所述语音转换模型的整体损失。
将所述音素特征向量输入到预设的ASR系统中进行音素识别,得到交叉熵损失。具体的,所述内容编码器在对所述训练语音进行音素特征向量提取完成后,所述ASR系统对所述音素特征向量进行音素识别,进而得到所述交叉熵损失,通过所述交叉熵损失对所述内容编码器的网络参数进行调整,不仅可以提高所述内容编码器训练完成后音素特征提取的精确度,而且加快所述内容编码器的训练效率。另外,在训练过程中,添加所述ASR系统还可以防止所述内容编码在训练过程中退化为自编码器的网络。
根据所述第一分类误差、所述第二分类误差、所述重构误差、所述差异损失以及所述交叉熵损失计算所述语音转换模型的整体损失。在本申请实施例中,所述语音转换模型的整体损失的函数表征为:L=L recon+aL ce+bL class1+cL difference+dL class2,其中,L为整体损失,L recon为重构误差,L ce为交叉熵损失,L class1为第一分类误差,L difference为差异损失,L class2为第二分类误差,a为交叉熵损失的权重,b为第一分类误差的权重、c为差异损失的权重,d为第二分类误差的权重。
在另一实施例中,如图2所示,步骤S150之后还包括步骤S160、S170和S180。
S160、若接收到第一说话人的第一音频,获取所述第一音频的梅尔频率倒谱系数。
若接收到第一说话人的第一音频,获取所述第一音频的梅尔频率倒谱系数。具体的,所述第一说话人的第一音频为需通过已经训练完成后的所述语音转换模型进行语音转换的语音信号,终端设备在接收到所述第一说话人的第一音频后,便可从所述第一音频中获取所述第一音频的梅尔频率倒谱系数。
S170、根据所述音色编码器获取第二说话人中第二音频的音色特征向量并将所述第一音频的梅尔频率倒谱系数输入至所述内容编码器中,得到所述第一音频的音素特征向量。
根据所述音色编码器获取第二说话人中第二音频的音色特征向量并将所述第一音频的 梅尔频率倒谱系数输入至所述内容编码器中,得到所述第一音频的音素特征向量。具体的,所述第二说话人为所述第一说话人的第一音频进行语音转换后需要以所述第二说话人的声音来进行表征的人,即所述第一说话人的第一音频在进行语音转换后得到的语音中说话人的音色为所述第二说话人的声音特征,所述第二音频为所述第二说话人任意的音频。当需要将所述第一音频转换成所述第二说话人的语音时,只需从所述第二说话人的第二音频中提取能表征所述第二说话人的身份信息,该身份信息可用所述第二音频的音色特征向量来进行表征,然后将第二音频的音色特征向量与从所述第一音频中提取的音素特征向量进行拼接后,输入到解码器中,便可得到以所述第二说话人的身份输出的语音。
S180、将所述第一音频的音素特征向量与所述第二音频的音色特征向量拼接后输入到所述解码器,得到所述第二说话人的第一音频。
将所述第一音频的音素特征向量与所述第二音频的音色特征向量拼接后输入到所述解码器,得到所述第二说话人的第一音频。具体的,所述第二说话人的第一音频中的音频内容与所述第一说话人的第一音频的音频内容相同,但是所述第一说话人的第一音频中的音色为所述第一说话人的音色,所述第二说话人的第一音频中的音色为所述第二说话人的音色。所述第一音频的音素特征向量与所述第二音频的音色特征向量拼接后,拼接后的特征向量中既包含所述第一音频的音频内容,也包含所述第二说话人的音色信息,拼接后的特征向量通过所述解码器解码后,便可重构所述第一音频的梅尔频率倒谱系数,然后通过重构的梅尔频率倒谱系数便可得到所述第二说话人的第一音频。
在本申请实施例所提供的基于域分离的语音转换模型的训练方法中,通过接收预设的训练语音并对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数;将所述训练语音的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到所述训练语音的音素特征向量和音色特征向量;根据预设的分类规则分别对所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差;将所述音素特征向量、所述音色特征向量进行拼接并将拼接后的特征向量输入到解码器中,得到所述梅尔频率倒谱系数的重构误差;根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失并根据所述整体损失更新所述语音转换模型的网络参数。本申请实施例通过域分离技术对语音转换模型进行训练,使得训练后的语音转换模型不仅能将非平衡语料进行完整的语音转换,而且提高了语音转换准确率。
本申请实施例还提供了一种基于域分离的语音转换模型的训练装置100,该装置用于执行前述基于域分离的语音转换模型的训练方法的任一实施例。具体地,请参阅图8,图8是本申请实施例提供的基于域分离的语音转换模型的训练装置100的示意性框图。
如图3所示,所述的基于域分离的语音转换模型的训练装置100,该装置包括:特征提取单元110、第一输入单元120、第一分类单元130、拼接单元140和更新单元150。
特征提取单元110,用于接收预设的训练语音并对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数。
在其他发明实施例中,所述特征提取单元110包括:第一获取单元和倒谱分析单元。
第一获取单元,用于获取所述训练语音的频谱并将所述训练语音的频谱输入至预置的梅尔滤波器组中,得到所述训练语音的梅尔频谱。
在其他发明实施例中,所述第一获取单元包括:预处理单元和变换单元。
预处理单元,用于对所述训练语音进行预处理,得到预处理后的训练语音。
变换单元,用于将所述预处理后的训练语音进行快速傅里叶变换,得到所述训练语音的频谱。
倒谱分析单元,用于将所述训练语音的梅尔频谱进行倒谱分析,得到所述训练语音的梅尔频率倒谱系数。
第一输入单元120,用于将所述训练语音的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到所述训练语音的音素特征向量和音色特征向量。
第一分类单元130,用于根据预设的分类规则分别所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差。
在其他发明实施例中,所述第一分类单元130包括:第二分类单元和第三分类单元。
第二分类单元,用于将所述音素特征向量依次通过预置的梯度反转层、预置的第一分类器中,得到所述第一分类误差。
第三分类单元,用于将所述音色特征向量输入至预置的第二分类器中,得到所述第二分类误差。
拼接单元140,用于将所述音素特征向量、所述音色特征向量进行拼接并将拼接后的特征向量输入到解码器中,得到所述梅尔频率倒谱系数的重构误差。
更新单元150,用于根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失并根据所述整体损失更新所述语音转换模型的网络参数。
在其他发明实施例中,所述更新单元150包括:第一计算单元和第二计算单元。
第一计算单元,用于根据弗罗贝尼乌斯范数计算所述音素特征向量与所述音色特征向量的差异损失。
第二计算单元,用于根据所述第一分类误差、所述第二分类误差、所述重构误差以及所述差异损失计算所述语音转换模型的整体损失。
在其他发明实施例中,所述第二计算单元包括:第二获取单元和第三计算单元。
第二获取单元,用于将所述音素特征向量输入到预设的ASR系统中进行音素识别,得到交叉熵损失。
第三计算单元,用于根据所述第一分类误差、所述第二分类误差、所述重构误差、所述差异损失以及所述交叉熵损失计算所述语音转换模型的整体损失。
在其他发明实施例中,所述的基于域分离的语音转换模型的训练装置还包括:接收单元160、第二输入单元170和第三输入单元180。
接收单元160,用于若接收到第一说话人的第一音频,获取所述第一音频的梅尔频率倒谱系数。
第二输入单元170,用于根据所述音色编码器获取第二说话人中第二音频的音色特征向 量并将所述第一音频的梅尔频率倒谱系数输入至所述内容编码器中,得到所述第一音频的音素特征向量。
第三输入单元180,用于将所述第一音频的音素特征向量与所述第二音频的音色特征向量拼接后输入到所述解码器,得到所述第二说话人的第一音频。
本申请实施例所提供的基于域分离的语音转换模型的训练装置100用于执行上述接收预设的训练语音并对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数;将所述训练语音的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到所述训练语音的音素特征向量和音色特征向量;根据预设的分类规则分别对所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差;将所述音素特征向量、所述音色特征向量进行拼接并将拼接后的特征向量输入到解码器中,得到所述梅尔频率倒谱系数的重构误差;根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失并根据所述整体损失更新所述语音转换模型的网络参数。
请参阅图4,图4是本申请实施例提供的计算机设备的示意性框图。
参阅图4,该设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行基于域分离的语音转换模型的训练方法。该处理器502用于提供计算和控制能力,支撑整个设备500的运行。该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行基于域分离的语音转换模型的训练方法。该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图4中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的设备500的限定,具体的设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现上述基于域分离的语音转换模型的训练方法的任一实施例。
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成。该计算机程序可存储于一存储介质中,该存储介质可以为计算机可读存储介质。该计算机程序被该计算机系统中的至少一个处理器执行,以实现上述方法的实施例的流程步骤。
因此,本申请还提供了一种计算机可读存储介质。该计算机可读存储介质可以是非易失性,也可以是易失性。该存储介质存储有计算机程序,该计算机程序当被处理器执行时实现上述基于域分离的语音转换模型的训练方法的任一实施例。
该计算机可读存储介质可以是U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置、设备和方法,可以通过 其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置、设备和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种基于域分离的语音转换模型的训练方法,包括以下步骤:
    接收预设的训练语音并对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数;
    将所述训练语音的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到所述训练语音的音素特征向量和音色特征向量;
    根据预设的分类规则分别对所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差;
    将所述音素特征向量、所述音色特征向量进行拼接并将拼接后的特征向量输入到解码器中,得到所述梅尔频率倒谱系数的重构误差;
    根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失并根据所述整体损失更新所述语音转换模型的网络参数。
  2. 根据权利要求1所述的基于域分离的语音转换模型的训练方法,其中,所述对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数,包括:
    获取所述训练语音的频谱并将所述训练语音的频谱输入至预置的梅尔滤波器组中,得到所述训练语音的梅尔频谱;
    将所述训练语音的梅尔频谱进行倒谱分析,得到所述训练语音的梅尔频率倒谱系数。
  3. 根据权利要求2所述的基于域分离的语音转换模型的训练方法,其中,所述获取所述训练语音的频谱,包括:
    对所述训练语音进行预处理,得到预处理后的训练语音;
    将所述预处理后的训练语音进行快速傅里叶变换,得到所述训练语音的频谱。
  4. 根据权利要求1所述的基于域分离的语音转换模型的训练方法,其中,所述根据预设的分类规则分别对所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差,包括:
    将所述音素特征向量依次通过预置的梯度反转层、预置的第一分类器中,得到所述第一分类误差;
    将所述音色特征向量输入至预置的第二分类器中,得到所述第二分类误差。
  5. 根据权利要求4所述的基于域分离的语音转换模型的训练方法,其中,所述根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失,包括:
    根据弗罗贝尼乌斯范数计算所述音素特征向量与所述音色特征向量的差异损失;
    根据所述第一分类误差、所述第二分类误差、所述重构误差以及所述差异损失计算所述语音转换模型的整体损失。
  6. 根据权利要求5所述的基于域分离的语音转换模型的训练方法,其中,所述根据所述第一分类误差、所述第二分类误差、所述重构误差以及所述差异损失计算所述语音转换模型的整体损失,包括:
    将所述音素特征向量输入到预设的ASR系统中进行音素识别,得到交叉熵损失;
    根据所述第一分类误差、所述第二分类误差、所述重构误差、所述差异损失以及所述交 叉熵损失计算所述语音转换模型的整体损失。
  7. 根据权利要求1-6中任意一项所述的基于域分离的语音转换模型的训练方法,所述根据所述整体损失更新所述语音转换模型的网络参数之后,还包括:
    若接收到第一说话人的第一音频,获取所述第一音频的梅尔频率倒谱系数;
    根据所述音色编码器获取第二说话人中第二音频的音色特征向量并将所述第一音频的梅尔频率倒谱系数输入至所述内容编码器中,得到所述第一音频的音素特征向量;
    将所述第一音频的音素特征向量与所述第二音频的音色特征向量拼接后输入到所述解码器,得到所述第二说话人的第一音频。
  8. 一种基于域分离的语音转换模型的训练装置,包括:
    特征提取单元,用于接收预设的训练语音并对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数;
    第一输入单元,用于将所述训练语音的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到所述训练语音的音素特征向量和音色特征向量;
    第一分类单元,用于根据预设的分类规则分别所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差;
    拼接单元,用于将所述音素特征向量、所述音色特征向量进行拼接并将拼接后的特征向量输入到解码器中,得到所述梅尔频率倒谱系数的重构误差;
    更新单元,用于根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失并根据所述整体损失更新所述语音转换模型的网络参数。
  9. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时执行以下步骤:
    接收预设的训练语音并对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数;
    将所述训练语音的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到所述训练语音的音素特征向量和音色特征向量;
    根据预设的分类规则分别对所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差;
    将所述音素特征向量、所述音色特征向量进行拼接并将拼接后的特征向量输入到解码器中,得到所述梅尔频率倒谱系数的重构误差;
    根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失并根据所述整体损失更新所述语音转换模型的网络参数。
  10. 根据权利要求9所述的计算机设备,其中,所述对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数,包括:
    获取所述训练语音的频谱并将所述训练语音的频谱输入至预置的梅尔滤波器组中,得到所述训练语音的梅尔频谱;
    将所述训练语音的梅尔频谱进行倒谱分析,得到所述训练语音的梅尔频率倒谱系数。
  11. 根据权利要求10所述的计算机设备,其中,所述获取所述训练语音的频谱,包括:
    对所述训练语音进行预处理,得到预处理后的训练语音;
    将所述预处理后的训练语音进行快速傅里叶变换,得到所述训练语音的频谱。
  12. 根据权利要求9所述的计算机设备,其中,所述根据预设的分类规则分别对所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差,包括:
    将所述音素特征向量依次通过预置的梯度反转层、预置的第一分类器中,得到所述第一分类误差;
    将所述音色特征向量输入至预置的第二分类器中,得到所述第二分类误差。
  13. 根据权利要求12所述的计算机设备,其中,所述根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失,包括:
    根据弗罗贝尼乌斯范数计算所述音素特征向量与所述音色特征向量的差异损失;
    根据所述第一分类误差、所述第二分类误差、所述重构误差以及所述差异损失计算所述语音转换模型的整体损失。
  14. 根据权利要求13所述的计算机设备,其中,所述根据所述第一分类误差、所述第二分类误差、所述重构误差以及所述差异损失计算所述语音转换模型的整体损失,包括:
    将所述音素特征向量输入到预设的ASR系统中进行音素识别,得到交叉熵损失;
    根据所述第一分类误差、所述第二分类误差、所述重构误差、所述差异损失以及所述交叉熵损失计算所述语音转换模型的整体损失。
  15. 根据权利要求9-14中任一项所述的计算机设备,所述根据所述整体损失更新所述语音转换模型的网络参数之后,还包括:
    若接收到第一说话人的第一音频,获取所述第一音频的梅尔频率倒谱系数;
    根据所述音色编码器获取第二说话人中第二音频的音色特征向量并将所述第一音频的梅尔频率倒谱系数输入至所述内容编码器中,得到所述第一音频的音素特征向量;
    将所述第一音频的音素特征向量与所述第二音频的音色特征向量拼接后输入到所述解码器,得到所述第二说话人的第一音频。
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下步骤:
    接收预设的训练语音并对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数;
    将所述训练语音的梅尔频率倒谱系数分别输入到内容编码器和音色编码器中,得到所述训练语音的音素特征向量和音色特征向量;
    根据预设的分类规则分别对所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差;
    将所述音素特征向量、所述音色特征向量进行拼接并将拼接后的特征向量输入到解码器中,得到所述梅尔频率倒谱系数的重构误差;
    根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体 损失并根据所述整体损失更新所述语音转换模型的网络参数。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述对所述训练语音进行特征提取,得到所述训练语音的梅尔频率倒谱系数,包括:
    获取所述训练语音的频谱并将所述训练语音的频谱输入至预置的梅尔滤波器组中,得到所述训练语音的梅尔频谱;
    将所述训练语音的梅尔频谱进行倒谱分析,得到所述训练语音的梅尔频率倒谱系数。
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述获取所述训练语音的频谱,包括:
    对所述训练语音进行预处理,得到预处理后的训练语音;
    将所述预处理后的训练语音进行快速傅里叶变换,得到所述训练语音的频谱。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述根据预设的分类规则分别对所述音素特征向量、所述音色特征向量进行分类处理,得到第一分类误差和第二分类误差,包括:
    将所述音素特征向量依次通过预置的梯度反转层、预置的第一分类器中,得到所述第一分类误差;
    将所述音色特征向量输入至预置的第二分类器中,得到所述第二分类误差。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述根据所述第一分类误差、所述第二分类误差以及所述重构误差计算语音转换模型的整体损失,包括:
    根据弗罗贝尼乌斯范数计算所述音素特征向量与所述音色特征向量的差异损失;
    根据所述第一分类误差、所述第二分类误差、所述重构误差以及所述差异损失计算所述语音转换模型的整体损失。
PCT/CN2021/083956 2020-12-18 2021-03-30 基于域分离的语音转换模型的训练方法及装置 WO2022126924A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011509341.3A CN112634920B (zh) 2020-12-18 2020-12-18 基于域分离的语音转换模型的训练方法及装置
CN202011509341.3 2020-12-18

Publications (1)

Publication Number Publication Date
WO2022126924A1 true WO2022126924A1 (zh) 2022-06-23

Family

ID=75317416

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083956 WO2022126924A1 (zh) 2020-12-18 2021-03-30 基于域分离的语音转换模型的训练方法及装置

Country Status (2)

Country Link
CN (1) CN112634920B (zh)
WO (1) WO2022126924A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114999447A (zh) * 2022-07-20 2022-09-02 南京硅基智能科技有限公司 一种基于对抗生成网络的语音合成模型及训练方法

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345452B (zh) * 2021-04-27 2024-04-26 北京搜狗科技发展有限公司 语音转换方法、语音转换模型的训练方法、装置和介质
CN113436609B (zh) * 2021-07-06 2023-03-10 南京硅语智能科技有限公司 语音转换模型及其训练方法、语音转换方法及系统
CN113689868B (zh) * 2021-08-18 2022-09-13 北京百度网讯科技有限公司 一种语音转换模型的训练方法、装置、电子设备及介质
CN113689867B (zh) * 2021-08-18 2022-06-28 北京百度网讯科技有限公司 一种语音转换模型的训练方法、装置、电子设备及介质
CN113823300B (zh) * 2021-09-18 2024-03-22 京东方科技集团股份有限公司 语音处理方法及装置、存储介质、电子设备
CN113782052A (zh) * 2021-11-15 2021-12-10 北京远鉴信息技术有限公司 一种音色转换方法、装置、电子设备及存储介质
CN114283825A (zh) * 2021-12-24 2022-04-05 北京达佳互联信息技术有限公司 一种语音处理方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005071663A2 (en) * 2004-01-16 2005-08-04 Scansoft, Inc. Corpus-based speech synthesis based on segment recombination
CN105844331A (zh) * 2015-01-15 2016-08-10 富士通株式会社 神经网络系统及该神经网络系统的训练方法
CN108847249A (zh) * 2018-05-30 2018-11-20 苏州思必驰信息科技有限公司 声音转换优化方法和系统
CN110427978A (zh) * 2019-07-10 2019-11-08 清华大学 面向小样本学习的变分自编码器网络模型和装置
CN111883102A (zh) * 2020-07-14 2020-11-03 中国科学技术大学 一种双层自回归解码的序列到序列语音合成方法及系统

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101892733B1 (ko) * 2011-11-24 2018-08-29 한국전자통신연구원 켑스트럼 특징벡터에 기반한 음성인식 장치 및 방법
RU2765985C2 (ru) * 2014-05-15 2022-02-07 Телефонактиеболагет Лм Эрикссон (Пабл) Классификация и кодирование аудиосигналов
CN107705802B (zh) * 2017-09-11 2021-01-29 厦门美图之家科技有限公司 语音转换方法、装置、电子设备及可读存储介质
CN107507619B (zh) * 2017-09-11 2021-08-20 厦门美图之家科技有限公司 语音转换方法、装置、电子设备及可读存储介质
CN110600047B (zh) * 2019-09-17 2023-06-20 南京邮电大学 基于Perceptual STARGAN的多对多说话人转换方法
WO2021128256A1 (zh) * 2019-12-27 2021-07-01 深圳市优必选科技股份有限公司 语音转换方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005071663A2 (en) * 2004-01-16 2005-08-04 Scansoft, Inc. Corpus-based speech synthesis based on segment recombination
CN105844331A (zh) * 2015-01-15 2016-08-10 富士通株式会社 神经网络系统及该神经网络系统的训练方法
CN108847249A (zh) * 2018-05-30 2018-11-20 苏州思必驰信息科技有限公司 声音转换优化方法和系统
CN110427978A (zh) * 2019-07-10 2019-11-08 清华大学 面向小样本学习的变分自编码器网络模型和装置
CN111883102A (zh) * 2020-07-14 2020-11-03 中国科学技术大学 一种双层自回归解码的序列到序列语音合成方法及系统

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114999447A (zh) * 2022-07-20 2022-09-02 南京硅基智能科技有限公司 一种基于对抗生成网络的语音合成模型及训练方法
CN114999447B (zh) * 2022-07-20 2022-10-25 南京硅基智能科技有限公司 一种基于对抗生成网络的语音合成模型及语音合成方法
US11817079B1 (en) 2022-07-20 2023-11-14 Nanjing Silicon Intelligence Technology Co., Ltd. GAN-based speech synthesis model and training method

Also Published As

Publication number Publication date
CN112634920B (zh) 2024-01-02
CN112634920A (zh) 2021-04-09

Similar Documents

Publication Publication Date Title
WO2022126924A1 (zh) 基于域分离的语音转换模型的训练方法及装置
CN112289333B (zh) 语音增强模型的训练方法和装置及语音增强方法和装置
WO2020173134A1 (zh) 一种基于注意力机制的语音合成方法及装置
CN104424952B (zh) 语音处理设备、语音处理方法以及程序
WO2018223727A1 (zh) 识别声纹的方法、装置、设备及介质
CN112927707A (zh) 语音增强模型的训练方法和装置及语音增强方法和装置
WO2019237518A1 (zh) 模型库建立方法、语音识别方法、装置、设备及介质
CN113658583B (zh) 一种基于生成对抗网络的耳语音转换方法、系统及其装置
CN111429894A (zh) 基于SE-ResNet STARGAN的多对多说话人转换方法
CN112381040B (zh) 一种基于语音和人脸图像的跨模态生成方法
WO2019232833A1 (zh) 语音区分方法、装置、计算机设备及存储介质
WO2022166710A1 (zh) 语音增强方法、装置、设备及存储介质
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
Esmaeilpour et al. Class-conditional defense GAN against end-to-end speech attacks
US20220335944A1 (en) Voice conversion apparatus, voice conversion learning apparatus, image generation apparatus, image generation learning apparatus, voice conversion method, voice conversion learning method, image generation method, image generation learning method, and computer program
WO2019232867A1 (zh) 语音区分方法、装置、计算机设备及存储介质
CN111402922B (zh) 基于小样本的音频信号分类方法、装置、设备及存储介质
WO2022213825A1 (zh) 基于神经网络的端到端语音增强方法、装置
CN114550702A (zh) 一种语音识别方法和装置
CN113744715A (zh) 声码器语音合成方法、装置、计算机设备及存储介质
US20230368777A1 (en) Method And Apparatus For Processing Audio, Electronic Device And Storage Medium
Kwek et al. Sparse representation and reproduction of speech signals in complex Fourier basis
CN115273890A (zh) 音色转换方法、电子设备和计算机可读存储介质
Zhipeng et al. Voiceprint recognition based on BP Neural Network and CNN
Tzudir et al. Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21904863

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21904863

Country of ref document: EP

Kind code of ref document: A1