CN112767958A - Zero-learning-based cross-language tone conversion system and method - Google Patents

Zero-learning-based cross-language tone conversion system and method Download PDF

Info

Publication number
CN112767958A
CN112767958A CN202110217545.8A CN202110217545A CN112767958A CN 112767958 A CN112767958 A CN 112767958A CN 202110217545 A CN202110217545 A CN 202110217545A CN 112767958 A CN112767958 A CN 112767958A
Authority
CN
China
Prior art keywords
speaker
audio
target
layer
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110217545.8A
Other languages
Chinese (zh)
Other versions
CN112767958B (en
Inventor
杨镇川
张伟彬
徐向民
邢晓芬
陈艺荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110217545.8A priority Critical patent/CN112767958B/en
Publication of CN112767958A publication Critical patent/CN112767958A/en
Application granted granted Critical
Publication of CN112767958B publication Critical patent/CN112767958B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Abstract

The invention discloses a zero-learning-based cross-language tone conversion system and a zero-learning-based cross-language tone conversion method. The system takes a Mel spectrum of a voice signal as an input signal, extracts bottleneck characteristics of the voice signal through a phoneme recognition module, normalizes the characteristics and transmits the characteristics to an acoustic model, controls a Mel spectrum synthesized by the acoustic model by controlling a reference vector of a speaking person, and finally synthesizes audio through a vocoder. The system can convert the voice of a common speaker into the tone of a designated speaker, is suitable for accent corpora which do not appear in a training database, can be suitable for the voice change of dialects in multiple regions, and has wide application prospect.

Description

Zero-learning-based cross-language tone conversion system and method
Technical Field
The invention belongs to the technical field of voice synthesis, and particularly relates to a zero-learning-based cross-language tone conversion system and method.
Background
The Voice Conversion (Voice Conversion) technique is intended to convert the timbre of a piece of Speech into another designated person (B.Siman, J.Yamagishi, S.King and H.Li, "An Overview of Voice Conversion and Its changes: From Statistical Modeling to Deep Learning," in IEEE/ACM transformations on Audio, Speech, and Language Processing, vol.29, pp.132-157,2021, doi: 10.1109/TASLP.2020.3038524.). The method is a branch of the speech synthesis field, but not only speech synthesis, but also relates to the technologies of the speech recognition and speaker recognition related fields. A core problem of the tone conversion technology is how to decouple the content and the tone in the voice, which not only ensures the integrity of the voice content, but also ensures that the voice content does not mix with the tone information of the source speaker; the accuracy of speaker feature encoder classification is ensured, and the extracted speaker information is ensured to be enough to represent the tone of the speaker.
The currently common method for tone conversion is based on VAE (spatial AutoEncoder) (Hsu C, Hwang H T, Wu Y C, et al]//2016Asia-Pacific Signal and Information Processing Association Annual Summit and Conference(APSIPA).IEEE,2016:1-6.), and GAN (generic adaptive network), both of which are highly dependent on the data of the tone-color transformation-related training set, perform poorly in cross-lingual tasks. The tone conversion model based on the PPG (phonetic Posteriorgrams) can make full use of a large amount of data (Zhou Y, Tian X, Y)1lmaz E,et al.A modularized neural network with language-specific output layers for cross-lingual voice conversion[C]I/2019 IEEE Automatic Speech Recognition and interpretation Workshop (ASRU), IEEE 2019:160 and 167), the decoupling of the pronunciation content at the phoneme level is carried out, so that the converted sound is more similar to a target speaker in tone color, and the conversion effect is more robust.
The existing tone color conversion technology has the problem that the details of the sounding habits of the source speaker are difficult to restore, particularly in the application scene of cross-languages, the methods such as VAE and GAN are quite unstable in performance, and it is necessary to develop a tone color conversion technology which has more stable conversion effect and can still show excellent tone color conversion in the task of cross-languages.
Disclosure of Invention
The invention provides a cross-language tone conversion method based on zero learning, which solves the common problems of tone conversion models such as GAN and VAE from three aspects, wherein the reference method only trains on a certain specific training data set, and the extracted features seriously depend on the data distribution of the training set, so that the generalization capability of the trained models is poor, the decoupling of the reference method on the meaning and the tone is insufficient, the voice content and the tone of a speaker are not explicitly stripped, and the reference method is difficult to keep the pronunciation habit of a specific language for cross-language and dialect accent voice and has poor expression on the mobility of the language. The invention can extract high-level semantic features by means of a phoneme recognition model with highly stripped speaker information, converts the high-level semantic features into the target speaker voice by combining the target speaker feature vector and the fundamental frequency, and solves the problem that the training data has no related language by taking the phoneme features as modeling, so that the model can convert various dialect accent voices and foreign language voices.
The invention is realized by at least one of the following technical schemes.
A cross-language tone conversion system based on zero learning comprises a phoneme recognition module, a tone conversion module, a speaker coding module and a vocoder module, wherein the phoneme recognition module extracts language information from original voice, sends the language information to the tone conversion module for voice change reasoning, combines personal information authentication provided by the speaker coding module to synthesize audio characteristics of a new speaker, and then restores the audio characteristics of the tone of the new speaker through the vocoder.
Preferably, the phoneme recognition module GpThe hybrid neural network comprises a time delay deep neural network (TDNN) and a long-time and short-time memory neural network (LSTM), wherein the interlayer activation functions adopt linear rectification functions (RELU) and [ -n, n]M of (A)sourceThe context window is used as input, the dimension of an input node is (2n +1) m, the number of output nodes of each layer of the neural network is, the output of the neural network is the established phoneme category and a speaker-based pair, wherein a bottleneck layer is used after the neural network is memorized for a long time, and the extracted bottleneck layer characteristics BnIs the value after RELU, the final output Gp(Msource) The number of frames is the same as the number of input feature frames.
Preferably, the tone conversion module GtThe system comprises an encoder and a decoder, wherein the encoder consists of a plurality of layers of time delay depth neural network layers and a long-and-short term memory neural network layer, the decoder consists of a self-attention layer and a long-and-short term memory neural network layer, and G is outputt(Bn) The number of frames is the same as the number of input feature frames.
Preferably, the speaker coding module D comprises a multi-layer time-delay deep neural network and a Global-Pooling layer (Global-Pooling), and the input feature is MtargetThe output nodes of each layer of the hidden layer are the same in number, the global pooling layer is used before the classification layer, the output nodes of the global pooling layer are the extracted pooling features D (M), and the number of the output layer nodes is equal to the number of speaker categories.
Preferably, the zero-learning-based cross-lingual tone conversion system method comprises a training phase and a conversion phase.
Preferably, the training phase comprises the steps of:
1-5) data processing stage, obtaining audio signals of multiple speakers, performing activity detection on the audio signals, cutting a mute section to obtain non-mute section voice, performing short-time Fourier transform and triangular filtering on the audio according to the interval of fixed time and the window length of fixed time to obtain an audio Mel spectrum section M, and extracting fundamental frequency characteristic f0
1-6) training a phoneme recognition model, wherein the phoneme recognition model is a phoneme recognition model G based on a time delay bidirectional long-short term memory networkpSaid phoneme recognition model GpBottleneck layer output bottleneck characteristic sequence BnAnd performing normalization of single sample, inputting Mel spectrum signals of multiple speakers into phoneme recognition model GpTo obtain phoneme mutual information loss LpAnd speaker classification loss L based on counterstudysMinimizing phoneme loss function, maximizing speaker classification loss, and adjusting to model parameter convergence, wherein the loss function LtotalComprises the following steps:
Ltotal=аLp-(1-а)Ls
phoneme mutual information loss LpAnd speaker classification loss L based on counterstudysComprises the following steps:
Figure BDA0002954452110000031
Figure BDA0002954452110000041
where a is the coefficient of the loss function, siFor the ith real speaker tag,
Figure BDA0002954452110000042
for predicting speaker labels, i is a subscript of the number of speakers, M is a frame of data of an audio Mel spectrum segment M, N is a total frame number of the Mel spectrum, N is a frame number of the Mel spectrum, and T is a real transcription phoneme text,
Figure BDA0002954452110000043
For predictive transcription of phoneme text, T is a character of the T-word graph, P (m)n|tn) Representing the Mel spectral feature mnAt a given text tnProbability of being, P (t) is the language model score for text t, p (t)n) As a text tnThe language model score of (a);
1-7) speaker feature extraction, wherein the audio Mel spectral band M is used to extract a pooled target speaker feature pooled vector P-vector as the feature of a speaker by using a global pooling layer at the second last layer through a pooled vector-based speaker encoder D;
1-8) training a tone color conversion model, wherein the tone color conversion model is an automatic supervision tone color conversion model G based on the characteristics of a bottleneck layertIncluding encoders and decoders; the tone conversion model identifies the bottleneck characteristic sequence B of the model by the phonemenTarget speaker feature pooling vector P-vector and fundamental frequency feature f0Connecting the input as a tone conversion model side by side; wherein, the encoder performs characteristic extraction of context window on the input, and the decoder synthesizes Mel spectrogram of the target speaker based on an autoregressive model of an attention-making mechanism
Figure BDA0002954452110000044
Minimizing loss L of synthesized Mel spectra and true Mel spectra during trainingtAnd the gradient is passed back to the model parameters:
Figure BDA0002954452110000045
wherein N is the number of frames of Mel spectrum, MnThe real Mel-gram is obtained by taking the real Mel-gram,
Figure BDA0002954452110000046
the synthesized Mel spectrum.
Preferably, the training phase comprises the steps of:
2-1) recording a section of audio frequency needing to be converted by using a recording device, carrying out audio activity detection, carrying out silent section cutting to obtain speaking section voice, and carrying out short-time Fourier transform and triangular filtering on the audio frequency for a fixed time period to obtain a first section of audio frequency Mel spectrum M1And extracting the fundamental frequency f0_source
2-2) recording a section of audio of a target speaker by using a recording device, performing audio activity detection, cutting a silent section to obtain speech of a speaking section, and performing short-time Fourier transform and triangular filtering on the audio for a fixed time period to obtain a second section of audio Mel spectrum M2Extracting the characteristic pooling vector P-vector characteristic of the target speaker by using the speaker encoder D based on the pooling vector, and using the characteristic pooling vector P-vector characteristic for registering the newly added speaker;
2-3) the first audio Mel spectrum M1Passing through the phoneme recognition model GpNormalizing the bottleneck layer characteristics through forward calculation to obtain a bottleneck layer characteristic sequence BnA 1 is to f0_sourcePerforming linear transformation:
Figure BDA0002954452110000051
wherein, musource、μtargetMean, σ, of the original and target audio, respectivelysource、σtargetVariance of the original audio and the target audio respectively; characterizing the bottleneck layer BnTarget speaker feature pooling vector P-vector and transformed fundamental frequency f0_targetSplicing side by side, and inputting into an automatic supervision tone conversion model GtIn the method, the semantic and tone features are recombined by the encoder, and the Mel spectrum M of the target tone is synthesized by the decoder in an autoregressive mannertarget:
2-Mtarget=Gt(Gp(M1),f0_target,D(M2))
Wherein the bottleneck layer characteristic Bn=Gp(M1),D(M2) A speaker characteristic P-vector;
2-4) Merr spectrum M of target tone using vocodertargetUp-sampling is performed to synthesize an audio signal having the style of the targeted speaker.
Preferably, the audio signal obtained in step 1) is subjected to data enhancement, and training data is subjected to reverberation, random white noise and background music on the basis of speed change.
Preferably, the phoneme recognition model G of step 1-2)pThe input data adopts the mixed corpus data of common mandarin and English.
Preferably, the model in step 1-3) is input by using a plurality of regional accent dialect speaker corpora and foreign language corpora, and finally, the original speech is respectively converted into the dialect speech of the target speaker and the corresponding foreign language speech of the target speaker.
Compared with the prior art, the invention has the following advantages and effects:
1) the method adopts the modeling of the phoneme characteristics and extracts the bottleneck layer characteristics of the phoneme model to more accurately model the sounding phonemes and can more strip the characteristics of semantics and tone to reconstruct the voice;
2) the method can be used in a cross-language environment, dependency on rare language data and dialect data is effectively reduced by using zero learning, and the tone color conversion robustness in a complex scene is enhanced.
3) The invention supports the use of the recording of a target speaker to extract the characteristics of the speaker, and has low difficulty in registering the voice and convenient use.
Drawings
FIG. 1 is a diagram of an overall structure of a cross-lingual tone color conversion method and system thereof based on zero learning under self-supervision according to the present embodiment;
FIG. 2 is a block diagram of a tone conversion model based on an encoder and a decoder in the present embodiment;
FIG. 3 is a structural diagram of a phoneme recognition model in the present embodiment;
fig. 4 is a diagram showing a spectrum effect of performing timbre conversion in the present embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A cross-language tone conversion system based on zero learning comprises a phoneme recognition module, a tone conversion module, a speaker coding module and a vocoder module;
the phoneme recognition module GpThe hybrid neural network comprises a 6-layer time delay neural network and a 2-layer long-and-short time memory network, which are connected in series in a TDNN (1-5) -LSTM1-TDNN6-LSTM2 mode, wherein each interlayer activation function adopts RELU, context windows of-20 and 20 are adopted, an input node is 200, the number of output nodes of each layer of a hidden layer is 850, 1024(LSTM1), 256 and 1024(LSTM2), an output layer is an established phoneme type and a speaker-based pair, wherein the 256 output node layers are bottleneck layers for extracting characteristics, the extracted bottleneck layer characteristics are numerical values after RELU, and G is output finallyp(Msource) The number of frames is the same as the number of input feature frames.
The tone conversion module comprises an encoder and a decoder, wherein the encoder is formed by connecting a 3-layer time delay neural network layer and a long-time and short-time memory neural network layer in series, the input node is 256, the output nodes are 512, 512, 512 and 512(LSTM), the decoder is a self-attention layer and a long-time and short-time memory neural network layer, the output nodes are 512 and 1024, the output layer is 80 nodes, and the output G is outputt(Bn) The number of frames is the same as the number of input feature frames.
The speaker coding module D comprises 5 layers of time delay neural networks and Global-Pooling, the input node is 200, and the number of output nodes of each layer of the hidden layer is 500, 500, 500, 500, 500 and 256 (Global-Pooling). The Global-Pooling output node is used for extracting the Pooling characteristic D (M), and the number of the final output layer nodes is output to be equal to the number of the speaker types.
A method of a cross-lingual tone-color conversion system based on zero-time learning comprises a training phase and a conversion phase, wherein the training phase comprises the following steps:
1) training a phoneme recognition model and a tone conversion model, comprising the steps of:
1-1) data processing stage, obtaining audio signals of a plurality of speakers, wherein each speaker data is not less than 100 sentences, carrying out activity detection on the audio to cut a mute section to obtain non-mute section voice, carrying out short-time Fourier transform and triangular filtering on the audio at intervals of 10 milliseconds and window length of 25 milliseconds for a fixed time period to obtain an audio Mel spectrum section M, and extracting fundamental frequency characteristics f0
1-2) in the stage of training the phoneme recognition model, the phoneme recognition model is a phoneme recognition model G based on a time delay bidirectional long-short term memory networkpSaid phoneme recognition model GpThe bottleneck layer outputs a bottleneck characteristic sequence BnAnd performing normalization of single sample, inputting Mel spectrum signals of multiple speakers into phoneme recognition model GpTo obtain phoneme mutual information loss LpAnd speaker classification loss L based on counterstudysMinimizing phoneme loss function, maximizing speaker classification loss, and adjusting to model parameter convergence, wherein the loss function LtotalComprises the following steps:
Ltotal=аLp-(1-а)Ls
Ls、Lpis calculated as:
Figure BDA0002954452110000081
Figure BDA0002954452110000082
where a is the coefficient of the loss function, siFor the ith real speaker tag,
Figure BDA0002954452110000083
to predict speaker labels, i is a subscript of the number of speakers, M is a frame of data of the Mel spectrum M, N is a total frame number of the Mel spectrum, N is a frame number of the Mel spectrum, T is a real transcription phoneme text,
Figure BDA0002954452110000084
for predictive transcription of phoneme text, T is a character of the T-word graph, P (m)n|tn) Representing the Mel spectral feature mnAt a given text tnProbability of being, P (t) is the language model score for text t, p (t)n) As a text tnThe language model score of (a);
in the speaker feature extraction stage, the Mel spectral features M are used to extract pooled target speaker feature pooled vectors P-vector as the features of a speaker through a pooled vector-based speaker encoder D, which comprises a multilayer time delay neural network (as shown in FIG. 3), and a global pooling layer is used in the second last layer;
in the training stage of the tone color conversion model, the tone color conversion model is an automatic supervision tone color conversion model G based on the characteristics of a bottleneck layertIncluding encoders and decoders; the tone conversion model identifies the bottleneck characteristic sequence B of the model by the phonemenTarget speaker feature pooling vector P-vector and fundamental frequency feature f0Connecting the input as a tone conversion model side by side; wherein, the encoder performs characteristic extraction of context window on the input, and the decoder synthesizes Mel spectrogram of the target speaker based on an autoregressive model of an attention-making mechanism
Figure BDA0002954452110000085
Minimizing loss L of synthesized Mel spectra and true Mel spectra during trainingtAnd the gradient is passed back to the model parameters:
Figure BDA0002954452110000086
wherein N is the number of frames of Mel spectrum, MnThe real Mel-gram is obtained by taking the real Mel-gram,
Figure BDA0002954452110000087
is a synthetic Meier spectrogram;
2) the tone conversion stage comprises the following steps:
2-1) recording a section of audio frequency needing to be converted by using a recording device, carrying out audio activity detection, carrying out silent section cutting to obtain speaking section voice, and carrying out short-time Fourier transform and triangular filtering on the audio frequency for a fixed time period to obtain a first section of audio frequency Mel spectrum M1And extracting the fundamental frequency f0_source
2-2) recording a section of audio of a target speaker by using a recording device, performing audio activity detection, cutting a silent section to obtain speech of a speaking section, and performing short-time Fourier transform and triangular filtering on the audio for a fixed time period to obtain a second section of audio Mel spectrum M2Extracting the characteristic pooling vector P-vector characteristic of the target speaker by using the speaker encoder D based on the pooling vector, and using the characteristic pooling vector P-vector characteristic for registering the newly added speaker;
2-3) the first audio Mel spectrum M1Passing through the phoneme recognition model GpNormalizing the bottleneck layer characteristics through forward calculation to obtain a bottleneck layer characteristic sequence BnA 1 is to f0_sourcePerforming linear transformation:
Figure BDA0002954452110000091
wherein, musource、μtargetMean, σ, of the original and target audio, respectivelysource、σtargetVariance of the original audio and the target audio respectively;
characterizing the bottleneck layer BnTarget speaker feature pooling vector P-vector and transformed f0_targetSplicing side by side, and inputting into an automatic supervision tone conversion model GtThrough the encoder, the semantic and the tone characteristics are recombined, and self-operation is carried out through the decoderRegression synthesis of mel-frequency spectrum M of target tonetarget:
Mtarget=Gt(Gp(M1),f0_target,D(M2))
Wherein the bottleneck layer characteristic Bn=Gp(M1),D(M2) A speaker characteristic P-vector;
2-4) Merr spectrum M of target tone using vocodertargetUp-sampling is performed to synthesize an audio signal having the style of the targeted speaker.
The audio signal obtained in step 1) is subjected to random speech rate change in a determined range, the tone is kept unchanged, the obtained audio signal is subjected to data enhancement, and training data are subjected to reverberation, random white noise and background music on the basis of speed change.
Step 1-2) the phoneme recognition model GpThe input data adopts the mixed corpus data of common mandarin and English. And 1-3) inputting the model by adopting various regional accent dialect speaker linguistic data and foreign language linguistic data, and finally respectively converting the original voice into the dialect voice of the target speaker and the corresponding foreign language voice of the target speaker.
As shown in the frame diagram of FIG. 1, an original sound signal is obtained by a recording device, a silent section is cut to obtain a speaking section voice through audio activity detection, short-time Fourier transform and triangular filtering are carried out on the audio for a fixed time period to obtain an audio Mel spectrum section, and a fundamental frequency f is extracted0_source. Passing the source speaker's speech Mel spectrum signal through a phoneme recognition model G based on a time delay bidirectional long-short term memory networkpThe bottleneck layer of the delay model outputs a bottleneck characteristic sequence BnAnd the feature is normalized as shown in fig. 3. The audio features of the target speaker are passed through a speaker encoder of a time delay network to extract speaker specific pooling vectors P-vector, which are input into the model as speaker reference vectors. Connecting the bottleneck layer characteristics, the target speaker characteristic pooling vector P-vector and the linearly converted fundamental frequency characteristics in parallel as the input of a tone conversion model, and combiningThe Mel spectrogram of the targeted speaker is shown in FIG. 2. Finally, the synthesized Mel spectrum is passed through a vocoder based on the spk-generating countermeasure network, i.e. the audio signal of the target speaker is synthesized.
The embodiment is implemented based on the technical scheme of fig. 1, and relates to technologies such as audio feature extraction, phoneme model bottleneck layer feature extraction, speaker pooling vector extraction, tone conversion model synthesis, vocoder synthesis, and the like to obtain the effect diagram as shown in fig. 4.
As shown in fig. 2, the input audio is the audio recorded by the original speaker, and the language is english; the registered audio is recorded by a Mandarin speaker, and the pooled vector characteristics of the speaker are extracted for registration; the target audio of the tone of the Mandarin speaker is synthesized through the scheme, and the text content is the same as the original audio. The target audio keeps rhythm, pause and content of the original audio, and meanwhile, the tone and the tone are converted, so that the high quality of frequency spectrum synthesis is ensured.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1. A cross-language tone conversion system based on zero learning is characterized by comprising a phoneme recognition module, a tone conversion module, a speaker coding module and a vocoder module, wherein the phoneme recognition module extracts language information from original voice, sends the language information to the tone conversion module for voice-changing reasoning, combines personal information authentication provided by the speaker coding module to synthesize audio characteristics of a new speaker, and then restores the audio characteristics of the tone of the new speaker through the vocoder.
2. The system according to claim 1, wherein said phoneme recognition module G is configured to recognize different phonemespComprises a time delay deep neural network (TDNN) and a long-time and short-time memory neural network(LSTM) and interlayer activation functions all adopt linear rectification functions (RELU) and use [ -n, n [ -n]M of (A)sourceThe context window is used as input, the dimension of an input node is (2n +1) m, the number of output nodes of each layer of the neural network is, the output of the neural network is the established phoneme category and a speaker-based pair, wherein a bottleneck layer is used after the neural network is memorized for a long time, and the extracted bottleneck layer characteristics BnIs the value after RELU, the final output Gp(Msource) The number of frames is the same as the number of input feature frames.
3. The system according to claim 2, wherein the timbre conversion module G is used for converting the timbre of the spoken language into a timbre of the spoken languagetThe system comprises an encoder and a decoder, wherein the encoder consists of a plurality of layers of time delay depth neural network layers and a long-and-short term memory neural network layer, the decoder consists of a self-attention layer and a long-and-short term memory neural network layer, and G is outputt(Bn) The number of frames is the same as the number of input feature frames.
4. The system of claim 3, wherein the speaker coding module D comprises a multi-layered time-delay deep neural network and a Global-Pooling layer (Global-Pooling), and the input feature is MtargetThe output nodes of each layer of the hidden layer are the same in number, the global pooling layer is used before the classification layer, the output nodes of the global pooling layer are the extracted pooling features D (M), and the number of the output layer nodes is equal to the number of speaker categories.
5. The method of zero learning based cross-lingual tone color conversion system according to claim 4, comprising a training phase and a conversion phase.
6. The method according to claim 5, wherein the training phase comprises the following steps:
1-1) dataThe processing stage is to obtain the audio signals of a plurality of speakers, cut the audio signals into silence segments through activity detection to obtain non-silence segment voices, perform short-time Fourier transform and triangular filtering on the audio according to the interval of fixed time and the window length of the fixed time to obtain an audio Mel spectrum segment M, and extract the fundamental frequency characteristic f0
1-2) training a phoneme recognition model, wherein the phoneme recognition model is a phoneme recognition model G based on a time delay bidirectional long-short term memory networkpSaid phoneme recognition model GpBottleneck layer output bottleneck characteristic sequence BnAnd performing normalization of single sample, inputting Mel spectrum signals of multiple speakers into phoneme recognition model GpTo obtain phoneme mutual information loss LpAnd speaker classification loss L based on counterstudysMinimizing phoneme loss function, maximizing speaker classification loss, and adjusting to model parameter convergence, wherein the loss function LtotalComprises the following steps:
Ltotal=аLp-(1-а)Ls
phoneme mutual information loss LpAnd speaker classification loss L based on counterstudysComprises the following steps:
Figure FDA0002954452100000021
Figure FDA0002954452100000022
where a is the coefficient of the loss function, siFor the ith real speaker tag,
Figure FDA0002954452100000023
to predict speaker labels, i is a subscript of the number of speakers, M is a frame of data of an audio Mel spectrum band M, N is a total frame number of the Mel spectrum, N is a frame number of the Mel spectrum, T is a real transcription phoneme text,
Figure FDA0002954452100000024
for predictive transcription of phoneme text, T is a character of the T-word graph, P (m)n|tn) Representing the Mel spectral feature mnAt a given text tnProbability of being, P (t) is the language model score for text t, p (t)n) As a text tnThe language model score of (a);
1-3) speaker feature extraction, wherein the audio Mel spectral band M is used to extract a pooled target speaker feature pooled vector P-vector as the feature of a speaker by using a global pooling layer at the second last layer through a pooled vector-based speaker encoder D;
1-4) training a tone color conversion model, wherein the tone color conversion model is an automatic supervision tone color conversion model G based on the characteristics of a bottleneck layertIncluding encoders and decoders; the tone conversion model identifies the bottleneck characteristic sequence B of the model by the phonemenTarget speaker feature pooling vector P-vector and fundamental frequency feature f0Connecting the input as a tone conversion model side by side; wherein, the encoder performs characteristic extraction of context window on the input, and the decoder synthesizes Mel spectrogram of the target speaker based on an autoregressive model of an attention-making mechanism
Figure FDA0002954452100000031
Minimizing loss L of synthesized Mel spectra and true Mel spectra during trainingtAnd the gradient is passed back to the model parameters:
Figure FDA0002954452100000032
wherein N is the number of frames of Mel spectrum, MnThe real Mel-gram is obtained by taking the real Mel-gram,
Figure FDA0002954452100000033
the synthesized Mel spectrum.
7. The method according to claim 6, wherein the training phase comprises the following steps:
2-1) recording a section of audio frequency needing to be converted by using a recording device, carrying out audio activity detection, carrying out silent section cutting to obtain speaking section voice, and carrying out short-time Fourier transform and triangular filtering on the audio frequency for a fixed time period to obtain a first section of audio frequency Mel spectrum M1And extracting the fundamental frequency f0_source
2-2) recording a section of audio of a target speaker by using a recording device, performing audio activity detection, cutting a silent section to obtain speech of a speaking section, and performing short-time Fourier transform and triangular filtering on the audio for a fixed time period to obtain a second section of audio Mel spectrum M2Extracting the characteristic pooling vector P-vector characteristic of the target speaker by using the speaker encoder D based on the pooling vector, and using the characteristic pooling vector P-vector characteristic for registering the newly added speaker;
2-3) the first audio Mel spectrum M1Passing through the phoneme recognition model GpNormalizing the bottleneck layer characteristics through forward calculation to obtain a bottleneck layer characteristic sequence BnA 1 is to f0_sourcePerforming linear transformation:
Figure FDA0002954452100000034
wherein, musource、μtargetMean, σ, of the original and target audio, respectivelysource、σtargetVariance of the original audio and the target audio respectively;
characterizing the bottleneck layer BnTarget speaker feature pooling vector P-vector and transformed fundamental frequency f0_targetSplicing side by side, and inputting into an automatic supervision tone conversion model GtIn the method, the semantic and tone features are recombined by the encoder, and the Mel spectrum M of the target tone is synthesized by the decoder in an autoregressive mannertarget:
Mtarget=Gt(Gp(M1),f0_target,D(M2))
Wherein the bottleneck layer characteristic Bn=Gp(M1),D(M2) A speaker characteristic P-vector;
2-4) Merr spectrum M of target tone using vocodertargetUp-sampling is performed to synthesize an audio signal having the style of the targeted speaker.
8. The cross-lingual tone color conversion method based on zero learning according to claim 7, wherein the audio signal obtained in step 1) is subjected to data enhancement, and training data is subjected to reverberation, random white noise and background music on the basis of speed change.
9. The zero-learning-based cross-lingual tone color conversion method according to claim 8, wherein the phoneme recognition model G of step 1-2)pThe input data adopts the mixed corpus data of common mandarin and English.
10. The method as claimed in claim 9, wherein the model of step 1-3) is inputted with different regional accent dialect speaker corpora and foreign language corpora, and finally the original speech is converted into the dialect speech of the target speaker and the corresponding foreign language speech of the target speaker respectively.
CN202110217545.8A 2021-02-26 2021-02-26 Zero-order learning-based cross-language tone conversion system and method Active CN112767958B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110217545.8A CN112767958B (en) 2021-02-26 2021-02-26 Zero-order learning-based cross-language tone conversion system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110217545.8A CN112767958B (en) 2021-02-26 2021-02-26 Zero-order learning-based cross-language tone conversion system and method

Publications (2)

Publication Number Publication Date
CN112767958A true CN112767958A (en) 2021-05-07
CN112767958B CN112767958B (en) 2023-12-26

Family

ID=75704215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110217545.8A Active CN112767958B (en) 2021-02-26 2021-02-26 Zero-order learning-based cross-language tone conversion system and method

Country Status (1)

Country Link
CN (1) CN112767958B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327575A (en) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113327627A (en) * 2021-05-24 2021-08-31 清华大学深圳国际研究生院 Multi-factor controllable voice conversion method and system based on feature decoupling
CN113327586A (en) * 2021-06-01 2021-08-31 深圳市北科瑞声科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113327583A (en) * 2021-05-24 2021-08-31 清华大学深圳国际研究生院 Optimal mapping cross-language tone conversion method and system based on PPG consistency
CN113343924A (en) * 2021-07-01 2021-09-03 齐鲁工业大学 Modulation signal identification method based on multi-scale cyclic spectrum feature and self-attention generation countermeasure network
CN113421585A (en) * 2021-05-10 2021-09-21 云境商务智能研究院南京有限公司 Audio fingerprint database generation method and device
CN113470622A (en) * 2021-09-06 2021-10-01 成都启英泰伦科技有限公司 Conversion method and device capable of converting any voice into multiple voices
CN113611309A (en) * 2021-07-13 2021-11-05 北京捷通华声科技股份有限公司 Tone conversion method, device, electronic equipment and readable storage medium
CN113674735A (en) * 2021-09-26 2021-11-19 北京奇艺世纪科技有限公司 Sound conversion method, device, electronic equipment and readable storage medium
CN113763987A (en) * 2021-09-06 2021-12-07 中国科学院声学研究所 Training method and device of voice conversion model
CN113808573A (en) * 2021-08-06 2021-12-17 华南理工大学 Dialect classification method and system based on mixed domain attention and time sequence self-attention
CN113823300A (en) * 2021-09-18 2021-12-21 京东方科技集团股份有限公司 Voice processing method and device, storage medium and electronic equipment
CN115240630A (en) * 2022-07-22 2022-10-25 山东大学 Method and system for converting Chinese text into personalized voice
CN116778937A (en) * 2023-03-28 2023-09-19 南京工程学院 Speech conversion method based on speaker versus antigen network
CN117809621A (en) * 2024-02-29 2024-04-02 暗物智能科技(广州)有限公司 Speech synthesis method, device, electronic equipment and storage medium
CN113611309B (en) * 2021-07-13 2024-05-10 北京捷通华声科技股份有限公司 Tone conversion method and device, electronic equipment and readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018084604A (en) * 2016-11-21 2018-05-31 日本電信電話株式会社 Cross lingual voice synthesis model learning device, cross lingual voice synthesis device, cross lingual voice synthesis model learning method, and program
CN111785261A (en) * 2020-05-18 2020-10-16 南京邮电大学 Cross-language voice conversion method and system based on disentanglement and explanatory representation
CN111951810A (en) * 2019-05-14 2020-11-17 国际商业机器公司 High quality non-parallel many-to-many voice conversion
CN112017644A (en) * 2020-10-21 2020-12-01 南京硅基智能科技有限公司 Sound transformation system, method and application
US20200380952A1 (en) * 2019-05-31 2020-12-03 Google Llc Multilingual speech synthesis and cross-language voice cloning
CN112071325A (en) * 2020-09-04 2020-12-11 中山大学 Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling
CN112382308A (en) * 2020-11-02 2021-02-19 天津大学 Zero-order voice conversion system and method based on deep learning and simple acoustic features

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018084604A (en) * 2016-11-21 2018-05-31 日本電信電話株式会社 Cross lingual voice synthesis model learning device, cross lingual voice synthesis device, cross lingual voice synthesis model learning method, and program
CN111951810A (en) * 2019-05-14 2020-11-17 国际商业机器公司 High quality non-parallel many-to-many voice conversion
US20200380952A1 (en) * 2019-05-31 2020-12-03 Google Llc Multilingual speech synthesis and cross-language voice cloning
CN111785261A (en) * 2020-05-18 2020-10-16 南京邮电大学 Cross-language voice conversion method and system based on disentanglement and explanatory representation
CN112071325A (en) * 2020-09-04 2020-12-11 中山大学 Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling
CN112017644A (en) * 2020-10-21 2020-12-01 南京硅基智能科技有限公司 Sound transformation system, method and application
CN112382308A (en) * 2020-11-02 2021-02-19 天津大学 Zero-order voice conversion system and method based on deep learning and simple acoustic features

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
左国玉等: "《声音转换技术的研究与进展》", 《电子学报》, vol. 32, no. 7, pages 1165 - 1172 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421585A (en) * 2021-05-10 2021-09-21 云境商务智能研究院南京有限公司 Audio fingerprint database generation method and device
CN113327627A (en) * 2021-05-24 2021-08-31 清华大学深圳国际研究生院 Multi-factor controllable voice conversion method and system based on feature decoupling
CN113327583A (en) * 2021-05-24 2021-08-31 清华大学深圳国际研究生院 Optimal mapping cross-language tone conversion method and system based on PPG consistency
CN113327627B (en) * 2021-05-24 2024-04-05 清华大学深圳国际研究生院 Multi-factor controllable voice conversion method and system based on feature decoupling
CN113327575A (en) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113327575B (en) * 2021-05-31 2024-03-01 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113327586A (en) * 2021-06-01 2021-08-31 深圳市北科瑞声科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113327586B (en) * 2021-06-01 2023-11-28 深圳市北科瑞声科技股份有限公司 Voice recognition method, device, electronic equipment and storage medium
CN113343924A (en) * 2021-07-01 2021-09-03 齐鲁工业大学 Modulation signal identification method based on multi-scale cyclic spectrum feature and self-attention generation countermeasure network
CN113343924B (en) * 2021-07-01 2022-05-17 齐鲁工业大学 Modulation signal identification method based on cyclic spectrum characteristics and generation countermeasure network
CN113611309B (en) * 2021-07-13 2024-05-10 北京捷通华声科技股份有限公司 Tone conversion method and device, electronic equipment and readable storage medium
CN113611309A (en) * 2021-07-13 2021-11-05 北京捷通华声科技股份有限公司 Tone conversion method, device, electronic equipment and readable storage medium
CN113808573A (en) * 2021-08-06 2021-12-17 华南理工大学 Dialect classification method and system based on mixed domain attention and time sequence self-attention
CN113808573B (en) * 2021-08-06 2023-11-07 华南理工大学 Dialect classification method and system based on mixed domain attention and time sequence self-attention
CN113763987A (en) * 2021-09-06 2021-12-07 中国科学院声学研究所 Training method and device of voice conversion model
CN113470622B (en) * 2021-09-06 2021-11-19 成都启英泰伦科技有限公司 Conversion method and device capable of converting any voice into multiple voices
CN113470622A (en) * 2021-09-06 2021-10-01 成都启英泰伦科技有限公司 Conversion method and device capable of converting any voice into multiple voices
CN113823300A (en) * 2021-09-18 2021-12-21 京东方科技集团股份有限公司 Voice processing method and device, storage medium and electronic equipment
CN113823300B (en) * 2021-09-18 2024-03-22 京东方科技集团股份有限公司 Voice processing method and device, storage medium and electronic equipment
CN113674735A (en) * 2021-09-26 2021-11-19 北京奇艺世纪科技有限公司 Sound conversion method, device, electronic equipment and readable storage medium
CN115240630A (en) * 2022-07-22 2022-10-25 山东大学 Method and system for converting Chinese text into personalized voice
CN116778937B (en) * 2023-03-28 2024-01-23 南京工程学院 Speech conversion method based on speaker versus antigen network
CN116778937A (en) * 2023-03-28 2023-09-19 南京工程学院 Speech conversion method based on speaker versus antigen network
CN117809621A (en) * 2024-02-29 2024-04-02 暗物智能科技(广州)有限公司 Speech synthesis method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112767958B (en) 2023-12-26

Similar Documents

Publication Publication Date Title
CN112767958B (en) Zero-order learning-based cross-language tone conversion system and method
WO2018121757A1 (en) Method and system for speech broadcast of text
Womack et al. N-channel hidden Markov models for combined stressed speech classification and recognition
Shahnawazuddin et al. Voice Conversion Based Data Augmentation to Improve Children's Speech Recognition in Limited Data Scenario.
WO2006099467A2 (en) An automatic donor ranking and selection system and method for voice conversion
JPH075892A (en) Voice recognition method
CN110930981A (en) Many-to-one voice conversion system
Huang et al. Dialect/accent classification using unrestricted audio
Gamit et al. Isolated words recognition using mfcc lpc and neural network
Kumar et al. Machine learning based speech emotions recognition system
CN111968622A (en) Attention mechanism-based voice recognition method, system and device
Ali Multi-dialect Arabic speech recognition
CN114550706A (en) Smart campus voice recognition method based on deep learning
Mistry et al. Overview: Speech recognition technology, mel-frequency cepstral coefficients (mfcc), artificial neural network (ann)
BABU PANDIPATI Speech to text conversion using deep learning neural net methods
Zhao et al. Research on voice cloning with a few samples
Fu et al. A survey on Chinese speech recognition
Win et al. Myanmar Text-to-Speech System based on Tacotron (End-to-End Generative Model)
Othmane et al. Enhancement of esophageal speech using voice conversion techniques
Sharma et al. Soft-Computational Techniques and Spectro-Temporal Features for Telephonic Speech Recognition: an overview and review of current state of the art
CN113436607B (en) Quick voice cloning method
Nazir et al. Deep learning end to end speech synthesis: A review
Liu et al. A New Speech Encoder Based on Dynamic Framing Approach.
Chen et al. Phoneme-guided Dysarthric speech conversion With non-parallel data by joint training
CN116403562B (en) Speech synthesis method and system based on semantic information automatic prediction pause

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant