CN112562704A - BLSTM-based frequency division spectrum expansion anti-noise voice conversion method - Google Patents

BLSTM-based frequency division spectrum expansion anti-noise voice conversion method Download PDF

Info

Publication number
CN112562704A
CN112562704A CN202011288173.XA CN202011288173A CN112562704A CN 112562704 A CN112562704 A CN 112562704A CN 202011288173 A CN202011288173 A CN 202011288173A CN 112562704 A CN112562704 A CN 112562704A
Authority
CN
China
Prior art keywords
voice
blstm
frequency
speech
conversion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011288173.XA
Other languages
Chinese (zh)
Other versions
CN112562704B (en
Inventor
孙蒙
苗晓孔
张雄伟
曹铁勇
郑昌艳
李莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN202011288173.XA priority Critical patent/CN112562704B/en
Publication of CN112562704A publication Critical patent/CN112562704A/en
Application granted granted Critical
Publication of CN112562704B publication Critical patent/CN112562704B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses a BLSTM-based frequency division spectrum expansion anti-noise voice conversion method, which comprises the following specific steps: filtering source voice and target voice, and extracting voice characteristic parameters, wherein the voice characteristic parameters comprise fundamental frequency, vocal tract spectrum and aperiodic components; performing dynamic time warping alignment on the extracted sound channel spectrums of the source speech and the target speech; respectively inputting the aligned source speech and target speech sound track spectrums into a BLSTM network model of frequency division conversion for training to obtain corresponding feature conversion networks; constructing a global statistical variance consistency filtering model; after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, and preprocessing the characteristic parameters; and carrying out parameterized speech synthesis on the feature parameters of the preprocessed speech to be converted to generate the final converted speech. The invention designs a brand new fusion rule, and fuses the parts after frequency division conversion, thereby obtaining a sound channel spectrum closer to a target, and further improving the similarity of voice conversion.

Description

BLSTM-based frequency division spectrum expansion anti-noise voice conversion method
Technical Field
The invention belongs to a voice signal processing technology, and particularly relates to a frequency division spectrum expansion anti-noise voice conversion method based on BLSTM.
Background
Voice conversion, which is a speech-to-speech technique, refers to changing the speech personality of one speaker (source speaker) to have the speech personality of another speaker (target speaker). Speech conversion can be divided into two categories: one is non-specific person voice conversion, only the voice of the source speaker needs to be changed, and the method is used for ensuring that the opposite side cannot hear the scene of the own identity; the other type is specific person voice conversion, which is to convert the voice of a source speaker into the voice of a specific target person for a scene masquerading the identity of the target person. The voice conversion of a specific person meets the technical requirements of personalized voice generation, and is one of the main hotspots of the current research.
The speech conversion for a particular speaker can also be divided into: the speech conversion of parallel corpora and the conversion of non-parallel corpora, at present, systems with higher conversion quality and similarity are generally based on the parallel corpora conversion method, and the current research situation of the technology is briefly summarized as follows:
speech conversion dates back to the fifth and sixty years of the last century, and has been improved from the most classical Gaussian Mixture Model (GMM) to the models such as deep neural network (deep neural network) which can now effectively represent high-dimensional sequence data, such as: full Convolutional neural Networks (FCN), Generative countermeasure Networks (Kaneko T., Kameoka H, Hiramatsu K, Kashino K, Sequence-to-Sequence Voice Conversion with knowledge Metal raw Using genetic Adversal Networks, Interspeed 2017; Kaneko T, Kameoka H, Hojo N, Kashino K, general adaptive Network-based post-filter for static parameter synthesis, ICA2017), bidirectional short and long memory Networks (Huang Z, Xu W and Yu. K. binary LSmodules for Sequence retrieval SSP/Avage: 1508.01991,2015/1508.01991,2015, etc.). With the continuous establishment of international events, Voice Conversion Challenge (VCC), in recent years, the Voice Conversion method is continuously improved, and the quality and similarity of the converted Voice are further improved. Although these voice conversion schemes are reasonable and effective, and a better conversion effect is obtained, most voice conversion methods are performed under experimental conditions, and have serious dependence on the size and quality of training data, the more the training sample data amount is, the purer the training corpus is, the better the obtained conversion voice effect is, and for small sample data and noisy voice data, the conversion effect of the model is limited, and the quality of the conversion voice is also greatly reduced.
Disclosure of Invention
The invention aims to provide a frequency division spectrum expansion anti-noise voice conversion method based on BLSTM.
The technical solution for realizing the purpose of the invention is as follows: a frequency division spectrum expansion anti-noise voice conversion method based on BLSTM comprises the following specific steps:
step 1: filtering source voice and target voice, and extracting voice characteristic parameters, wherein the voice characteristic parameters comprise fundamental frequency, vocal tract spectrum and aperiodic components; performing dynamic time warping alignment on the extracted sound channel spectrums of the source speech and the target speech;
step 2: respectively inputting the aligned source speech and target speech sound track spectrums into a BLSTM network model of frequency division conversion for training to obtain corresponding feature conversion networks;
step 3, constructing a global statistical variance consistency filtering model;
step 4, after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, and preprocessing the characteristic parameters;
and 5: and carrying out parameterized speech synthesis on the feature parameters of the preprocessed speech to be converted to generate the final converted speech.
Preferably, the extracted vocal tract spectral feature is a mel-frequency cepstrum.
Preferably, the BLSTM network model of the input-frequency-division conversion includes two BLSTM networks with the same structure, each of the two BLSTM networks is composed of 3 hidden layers, and the number of hidden nodes in the three layers is respectively: 128, 256, 128, one of the BLSTM networks has no dropout layer, and the other BLSTM network has a dropout layer parameter of 0.5.
Preferably, the specific method for constructing the global statistical variance consistency filtering model comprises the following steps:
step 3-1, calculating the mean and variance of each one-dimensional Mel cepstrum of the cepstrum coefficient of the target statement;
step 3-2, calculating the mean and variance of each dimensionality Mel cepstrum of all frames of the statement obtained after the source speech is converted by the BLSTM network model of frequency division conversion;
step 3-3: constructing a global statistical variance consistency primary filter, wherein the global statistical variance consistency primary filter specifically comprises the following steps:
Figure BDA0002783043930000021
wherein ,σ2 tarRepresenting the mean of the Mel cepstrum of each dimension of the target speech
Figure BDA0002783043930000022
Constructed vector, σ2 conRepresenting the Mel cepstrum mean value of each dimension of the sentence obtained after the BLSTM network model conversion of the source speech through frequency division conversion
Figure BDA0002783043930000031
The constructed vector, y represents the Mel cepstrum of the sentence to be converted,
Figure BDA0002783043930000032
a vector formed by the mean value of the Mel cepstrum of each dimension of all frames of the source speech to-be-converted sentence at the testing stage;
step 3-4, setting adjustment parameters to obtain an adjusted global statistical variance consistency filter, specifically:
Figure BDA0002783043930000033
wherein ,
Figure BDA0002783043930000034
is the Mel cepstrum obtained after filtering, y represents the Mel cepstrum of the sentence to be converted, and alpha is the adjusting parameter.
Preferably, the specific formula for calculating the mean and variance of each one-dimensional mel cepstrum of the target speech cepstrum coefficients is as follows:
Figure BDA0002783043930000035
wherein N represents the number of target sentences in the training stage, M represents the number of frames contained in each sentence, T represents the dimension of the Mel cepstrum, i represents the index of the dimension of the Mel cepstrum,
Figure BDA0002783043930000036
and
Figure BDA0002783043930000037
respectively representing the Merr cepstrum mean and variance, x, of each dimension obtained from all frames of all training sentencesiRepresenting the ith vimel spectrum.
Preferably, the specific method for preprocessing the feature parameters of the speech to be converted is as follows:
the non-periodic components remain unchanged;
carrying out logarithmic linear transformation on the fundamental frequency;
widening and frequency dividing the sound channel spectrum to obtain a high-frequency part and a full-band part, converting the high-frequency part and the full-band part by using a BLSTM network model for frequency division conversion, carrying out frequency division on the full-band part obtained by conversion again to obtain a high-frequency sound channel spectrum and a low-frequency sound channel spectrum, fusing the obtained sound channel spectrums of different frequency bands through a fusion model, and sending the fused converted sound channel spectrum characteristics into a global statistical variance consistency filtering model obtained in the step 3 for filtering to obtain the sound channel spectrum characteristics after the conversion and the filtering.
Preferably, the fusion model is specifically:
MChigh fusion=α*mcepHeight 1+(1-α)*mcepHeight 2
MCLow 1=mcepLow 1
MCFull fusion=[mcepLow 1+MCHigh fusion]
Wherein a is the fusion coefficient, mcepHeight 1High frequency channel spectrum, mecep, being a full band channel spectrumLow 1Low frequency channel spectrum, mecep, being a full band channel spectrumHeight 2The high frequency part of the vocal tract spectrum.
Preferably, the fusion coefficient is specifically:
Figure BDA0002783043930000041
in the formula, mecepHeight of and mcepIs low inThe frequency band parameter information of each part is obtained after frequency division statistics.
Compared with the prior art, the invention has the following remarkable advantages: 1) the invention introduces two newly designed filtering modules, the training data is filtered before the characteristic extraction, the sound channel spectrum is filtered after the characteristic conversion, the noise is suppressed through the time-frequency filtering, and the quality of the converted voice is improved; 2) according to the method, dimensionality expansion is performed on the vocal tract spectrum characteristics, a high-dimensional vocal tract spectrum is extracted, the network of the BLSTM is improved, frequency division conversion and fusion of the high-dimensional vocal tract spectrum are realized by designing two different BLSTM networks, the problems of over-fitting or under-fitting and the like caused by small sample data are solved, and the conversion precision of a model and the adaptability of the model are improved; 3) the invention designs a brand new fusion rule, uses the fusion rule obtained by statistics in the training stage in the conversion process, and fuses the parts after frequency division conversion, thereby obtaining a sound channel spectrum closer to a target, and further improving the similarity of voice conversion.
The present invention is described in further detail below with reference to the attached drawings.
Drawings
FIG. 1 is a flow chart of a BLSTM-based voice conversion method with spectrum extension and noise immunity.
Fig. 2 is a flow chart of spectral broadening, frequency division, and weight fusion.
Fig. 3 is a schematic diagram of a frequency division converted BLSTM network architecture.
Detailed Description
As shown in fig. 1, a frequency division spectrum expansion anti-noise speech conversion method based on BLSTM includes the following steps:
step 1, filtering the trained source speech and target speech, extracting speech characteristic parameters after removing partial noise, performing dynamic time warping alignment on the extracted sound channel spectrums of the source speech and the target speech, and respectively counting logarithmic fundamental frequencies (logF) of the source speech and the target speech0) Mean u and variance σ of2For step 4, calculating linear conversion of logarithmic fundamental frequency;
the voice characteristic parameters comprise fundamental frequency, vocal tract spectrum (the invention mainly adopts Mel cepstrum) and non-periodic components;
step 2, sending the aligned source speech and target speech sound track spectrums into the BLSTM network model of frequency division conversion in the figure 3 for training to obtain corresponding feature conversion networks;
the frequency-division converted BLSTM network is composed of two BLSTM networks with the same structure, namely BLSTM1 and BLSTM 2. Two BLSTM networks are constituteed by 3 hidden layers, and the hidden node number of three-layer is respectively: 128, 256, 128. BLSTM1 has no dropout layer, and BLSTM2 has a dropout layer parameter of 0.5.
Step 3, counting the mean variance of the target voice sound channel spectrum characteristics and the mean variance of the converted sound channel spectrum characteristics, and constructing a global statistical variance consistency filtering model; the global statistical variance consistency filtering model is used for obtaining a sound channel spectrum of the voice to be converted, and the specific implementation process is as follows:
step 3-1, calculating the mean value and the variance of each one-dimensional Mel cepstrum of the target voice cepstrum coefficient, wherein the calculation formula is as follows:
Figure BDA0002783043930000051
wherein N represents the number of target sentences in the training phase, M represents the number of frames contained in each sentence, T represents the dimension of the Mel cepstrum, i represents the index of the dimension of the Mel cepstrum,
Figure BDA0002783043930000052
and
Figure BDA0002783043930000053
respectively, the mel cepstrum mean and variance of each dimension obtained from all frames of all training sentences, and tar is an abbreviation of the conversion target.
Step 3-2, calculating the mean value of each dimension Mel cepstrum of all frames of sentences obtained after the BLSTM network model conversion of source speech in the training data through frequency division conversion by using formula (1)
Figure BDA0002783043930000054
Sum variance
Figure BDA0002783043930000055
And a vector formed by all dimensionality Mel cepstrum mean values of all frames of the source speech to-be-converted sentence in the testing stage
Figure BDA0002783043930000056
Vector sigma formed by the sum variancey 2Where con is an abbreviation for convert.
Step 3-3, constructing a primary filter with global statistical variance consistency to obtain primary filtered data
Figure BDA0002783043930000057
In the formula (2), σ2 tarMel cepstrum mean of each dimension of sentence representing target speaker in training set
Figure BDA0002783043930000061
Constructed vector, σ2 conRepresenting each dimension Mel cepstrum mean value of sentences obtained by BLSTM network model conversion of source speech in training set through frequency division conversion
Figure BDA0002783043930000062
The constructed vector, y, represents the mel cepstrum of the sentence to be converted.
Step 3-4, setting a parameter α (α is usually set according to an actual effect, and is set to 0.2 in the experiment) according to the formula (3), and adjusting to obtain a global statistical variance consistency filter:
Figure BDA0002783043930000063
wherein ,
Figure BDA0002783043930000064
is the resulting mel cepstrum after the final filtering, and this parameter will be used to generate the converted speech.
Step 4, after filtering the voice to be converted, extracting the characteristic parameters of the voice to be converted, wherein the characteristic parameters of the voice to be converted comprise logarithmic fundamental frequency, vocal tract spectrum and non-periodic components, and the characteristic parameters of the voice to be converted are preprocessed in the specific preprocessing modes that:
the non-periodic components remain unchanged;
the fundamental frequency F0 is logarithmically and linearly transformed, and its formula is shown in equation (4):
Figure BDA0002783043930000065
pt (Y) and pt (X)Respectively represent the logF after conversion0And original logF0,u(X) and u(Y)Means, σ, representing the logarithmic fundamental frequencies of the source and target speech as counted in step 1(X) and σ(Y)Is the standard deviation of the logarithmic fundamental frequencies of the source speech and the target speech as counted in step 1.
Widening and frequency dividing the sound channel spectrum to obtain a high-frequency part and a full-frequency band part, converting the high-frequency part and the full-frequency band part by using a BLSTM network model for frequency division conversion, frequency dividing the full-frequency band part obtained by conversion again to obtain a high-frequency sound channel spectrum and a low-frequency sound channel spectrum, fusing the obtained sound channel spectrums of different frequency bands through a fusion model, and sending the fused converted sound channel spectrum characteristics into the global statistical variance consistency filtering model obtained in the step 3 for filtering to obtain the sound channel spectrum characteristics after the conversion and the filtering;
in a further embodiment, the specific method for widening and dividing the sound channel spectrum of the speech to be converted is as follows:
in the traditional voice conversion model, in order to compress the model and reduce the data training time, the vocal tract spectrum is converted into 24-dimensional or 39-dimensional Mel cepstrum, the invention tries the voice conversion under low dimension, and the conversion effect is difficult to resist partial noise interference while ensuring the voice conversion quality. Therefore, in the process of acquiring the vocal tract spectrum parameters such as the Mel cepstrum, parameters are set to directly obtain the widened 129-dimensional high-dimensional Mel cepstrum. The high-dimensional information not only keeps more information parameters, but also is beneficial to solving the problems of data overfitting and the like during corpus training of the small samples.
The method comprises the steps of widening and frequency dividing a vocal tract spectrum of voice to be converted to obtain a high-frequency part and a full-frequency band part, converting the high-frequency part and the full-frequency band part by using a BLSTM network model for frequency dividing conversion, and frequency dividing the full-frequency band part obtained by conversion again to obtain a high-frequency vocal tract spectrum and a low-frequency vocal tract spectrum.
In a further embodiment, as shown in fig. 2, the obtained vocal tract spectrums of different frequency bands are fused by a fusion model, and the specific fusion parameters involved in the fusion process are extracted as follows:
extracting fusion coefficients
Counting the high-frequency vocal tract spectrum and the low-frequency vocal tract spectrum in the vocal tract spectrum information in the training stage, and then calculating according to a formula (5) to obtain a fusion coefficient a
Figure BDA0002783043930000071
The fusion coefficient obtained in a in the above formula, mcepHeight of and mcepIs low inNamely the parameter information of each partial frequency band obtained after frequency division statistics.
Transition weight fusion
And (3) performing weight fusion according to the formula (6) to obtain a final converted sound channel spectrum:
MChigh fusion=α*mcepHeight 1+(1-α)*mcepHeight 2
MCLow 1=mcepLow 1
MCFull fusion=[mcepLow 1+MCHigh fusion] (6)
Wherein a is the fusion coefficient, mcepHeight 1High frequency channel spectrum, mecep, being a full band channel spectrumLow 1Low frequency channel spectrum, mecep, being a full band channel spectrumHeight 2The high frequency part of the vocal tract spectrum.
Fig. 3 is a specific internal structure of the frequency-division weight fusion network. The main difference between BLSTM1 and BLSTM2 is that BLSTM2 adds a drop layer, preventing the occurrence of the overfitting phenomenon due to less high-band portion information.
And 5, carrying out parameterized speech synthesis on the non-periodic components preprocessed in the step 4, the sound channel spectrum after conversion and filtering, the fundamental frequency after logarithmic linear conversion and other parameters to generate final converted speech.

Claims (8)

1. A frequency division spectrum expansion anti-noise voice conversion method based on BLSTM is characterized by comprising the following specific steps:
step 1: filtering source voice and target voice, and extracting voice characteristic parameters, wherein the voice characteristic parameters comprise fundamental frequency, vocal tract spectrum and aperiodic components; performing dynamic time warping alignment on the extracted sound channel spectrums of the source speech and the target speech;
step 2: respectively inputting the aligned source speech and target speech sound track spectrums into a BLSTM network model of frequency division conversion for training to obtain corresponding feature conversion networks;
step 3, constructing a global statistical variance consistency filtering model;
step 4, after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, and preprocessing the characteristic parameters;
and 5: and carrying out parameterized speech synthesis on the feature parameters of the preprocessed speech to be converted to generate the final converted speech.
2. The BLSTM-based frequency division extension spectrum anti-noise speech conversion method according to claim 1, wherein the extracted vocal tract spectrum features are Mel cepstrum.
3. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 1, wherein the BLSTM network model of the input frequency division conversion comprises two BLSTM networks with the same structure, each of the two BLSTM networks consists of 3 hidden layers, and the number of hidden nodes of the three layers is respectively: 128, 256, 128, one of the BLSTM networks has no dropout layer, and the other BLSTM network has a dropout layer parameter of 0.5.
4. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 1, wherein the specific method for constructing the global statistical variance consistency filtering model is as follows:
step 3-1, calculating the mean and variance of each one-dimensional Mel cepstrum of the cepstrum coefficient of the target statement;
step 3-2, calculating the mean and variance of each dimensionality Mel cepstrum of all frames of the statement obtained after the source speech is converted by the BLSTM network model of frequency division conversion;
step 3-3: constructing a global statistical variance consistency primary filter, wherein the global statistical variance consistency primary filter specifically comprises the following steps:
Figure FDA0002783043920000011
wherein ,σ2 tarRepresenting the mean of the Mel cepstrum of each dimension of the target speech
Figure FDA0002783043920000012
Constructed vector, σ2 conRepresenting the Mel cepstrum mean value of each dimension of the sentence obtained after the BLSTM network model conversion of the source speech through frequency division conversion
Figure FDA0002783043920000021
The constructed vector, y represents the Mel cepstrum of the sentence to be converted,
Figure FDA0002783043920000022
a vector formed by the mean value of the Mel cepstrum of each dimension of all frames of the source speech to-be-converted sentence at the testing stage;
step 3-4, setting adjustment parameters to obtain an adjusted global statistical variance consistency filter, specifically:
Figure FDA0002783043920000023
wherein ,
Figure FDA0002783043920000024
is the Mel cepstrum obtained after filtering, y represents the Mel cepstrum of the sentence to be converted, and alpha is the adjusting parameter.
5. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 4, wherein the specific formula for calculating the mean and variance of the Mel cepstrum of each dimension of the target voice cepstrum coefficient is as follows:
Figure FDA0002783043920000025
wherein, N represents the number of target sentences in the training phase, M represents the number of frames contained in each sentence, and T represents the number of framesThe dimension of the mel-frequency cepstrum, i represents the index of the mel-frequency cepstrum dimension,
Figure FDA0002783043920000026
and
Figure FDA0002783043920000027
respectively representing the Merr cepstrum mean and variance, x, of each dimension obtained from all frames of all training sentencesiRepresenting the ith vimel spectrum.
6. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 1, wherein the specific method for preprocessing the characteristic parameters of the voice to be converted is as follows:
the non-periodic components remain unchanged;
carrying out logarithmic linear transformation on the fundamental frequency;
widening and frequency dividing the sound channel spectrum to obtain a high-frequency part and a full-band part, converting the high-frequency part and the full-band part by using a BLSTM network model for frequency division conversion, carrying out frequency division on the full-band part obtained by conversion again to obtain a high-frequency sound channel spectrum and a low-frequency sound channel spectrum, fusing the obtained sound channel spectrums of different frequency bands through a fusion model, and sending the fused converted sound channel spectrum characteristics into a global statistical variance consistency filtering model obtained in the step 3 for filtering to obtain the sound channel spectrum characteristics after the conversion and the filtering.
7. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 6, wherein the fusion model is specifically:
MChigh fusion=α*mcepHeight 1+(1-α)*mcepHeight 2
MCLow 1=mcepLow 1
MCFull fusion=[mcepLow 1+MCHigh fusion]
Wherein a is the fusion coefficient, mcepHeight 1High frequency channel spectrum, mecep, being a full band channel spectrumLow 1For full frequency band vocal tract spectrumLow frequency channel spectrum of, mecepHeight 2The high frequency part of the vocal tract spectrum.
8. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 6, wherein the fusion coefficients are specifically:
Figure FDA0002783043920000031
in the formula, mecepHeight of and mcepIs low inThe frequency band parameter information of each part is obtained after frequency division statistics.
CN202011288173.XA 2020-11-17 2020-11-17 Frequency division topological anti-noise voice conversion method based on BLSTM Active CN112562704B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011288173.XA CN112562704B (en) 2020-11-17 2020-11-17 Frequency division topological anti-noise voice conversion method based on BLSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011288173.XA CN112562704B (en) 2020-11-17 2020-11-17 Frequency division topological anti-noise voice conversion method based on BLSTM

Publications (2)

Publication Number Publication Date
CN112562704A true CN112562704A (en) 2021-03-26
CN112562704B CN112562704B (en) 2023-08-18

Family

ID=75043062

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011288173.XA Active CN112562704B (en) 2020-11-17 2020-11-17 Frequency division topological anti-noise voice conversion method based on BLSTM

Country Status (1)

Country Link
CN (1) CN112562704B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120077527A (en) * 2010-12-30 2012-07-10 부산대학교 산학협력단 Apparatus and method for feature compensation using weighted auto-regressive moving average filter and global cepstral mean and variance normalization
CN104658547A (en) * 2013-11-20 2015-05-27 大连佑嘉软件科技有限公司 Method for expanding artificial voice bandwidth
US20160322055A1 (en) * 2015-03-27 2016-11-03 Google Inc. Processing multi-channel audio waveforms
US20180012613A1 (en) * 2016-07-11 2018-01-11 The Chinese University Of Hong Kong Phonetic posteriorgrams for many-to-one voice conversion
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network
US10249314B1 (en) * 2016-07-21 2019-04-02 Oben, Inc. Voice conversion system and method with variance and spectrum compensation
CN109767778A (en) * 2018-12-27 2019-05-17 中国人民解放军陆军工程大学 A kind of phonetics transfer method merging Bi-LSTM and WaveNet
CN110473564A (en) * 2019-07-10 2019-11-19 西北工业大学深圳研究院 A kind of multi-channel speech enhancement method based on depth Wave beam forming
CN110648680A (en) * 2019-09-23 2020-01-03 腾讯科技(深圳)有限公司 Voice data processing method and device, electronic equipment and readable storage medium
US10726830B1 (en) * 2018-09-27 2020-07-28 Amazon Technologies, Inc. Deep multi-channel acoustic modeling

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120077527A (en) * 2010-12-30 2012-07-10 부산대학교 산학협력단 Apparatus and method for feature compensation using weighted auto-regressive moving average filter and global cepstral mean and variance normalization
CN104658547A (en) * 2013-11-20 2015-05-27 大连佑嘉软件科技有限公司 Method for expanding artificial voice bandwidth
US20160322055A1 (en) * 2015-03-27 2016-11-03 Google Inc. Processing multi-channel audio waveforms
US20180012613A1 (en) * 2016-07-11 2018-01-11 The Chinese University Of Hong Kong Phonetic posteriorgrams for many-to-one voice conversion
US10249314B1 (en) * 2016-07-21 2019-04-02 Oben, Inc. Voice conversion system and method with variance and spectrum compensation
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network
US10726830B1 (en) * 2018-09-27 2020-07-28 Amazon Technologies, Inc. Deep multi-channel acoustic modeling
CN109767778A (en) * 2018-12-27 2019-05-17 中国人民解放军陆军工程大学 A kind of phonetics transfer method merging Bi-LSTM and WaveNet
CN110473564A (en) * 2019-07-10 2019-11-19 西北工业大学深圳研究院 A kind of multi-channel speech enhancement method based on depth Wave beam forming
CN110648680A (en) * 2019-09-23 2020-01-03 腾讯科技(深圳)有限公司 Voice data processing method and device, electronic equipment and readable storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BAJIBABU BOLLEPALLI: "Normal-to-Lombard adaptation of speech synthesis using long short-term memory recurrent neural networks", SPEECH COMMUNICATION *
张雄伟: "语音去混响技术的研究进展与展望", 数据采集与处理, pages 1069 - 1081 *
张雄伟: "语音转换技术研究现状及展望", 数据采集与处理, pages 753 - 770 *
曾歆;张雄伟;孙蒙;苗晓孔;姚琨;: "基于GMM模型和LPC-MFCC联合特征的声道谱转换研究", 声学技术, no. 04 *
苗晓孔,张雄伟,孙蒙: "基于 BLSTM 实现基频 (F0) 融合变换的语音转换方法研究", SCIENCE DISCOVERY 2018; 6(4): 298-305, pages 298 - 305 *

Also Published As

Publication number Publication date
CN112562704B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
Qian et al. Very deep convolutional neural networks for noise robust speech recognition
CN109767778B (en) Bi-L STM and WaveNet fused voice conversion method
Hu et al. Pitch‐based gender identification with two‐stage classification
Jiang et al. Geometric methods for spectral analysis
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
CN113314140A (en) Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network
CN112259119B (en) Music source separation method based on stacked hourglass network
CN107564543A (en) A kind of Speech Feature Extraction of high touch discrimination
CN111583957B (en) Drama classification method based on five-tone music rhythm spectrogram and cascade neural network
CN103345920B (en) Self-adaptation interpolation weighted spectrum model voice conversion and reconstructing method based on Mel-KSVD sparse representation
CN112382308A (en) Zero-order voice conversion system and method based on deep learning and simple acoustic features
CN105679321A (en) Speech recognition method and device and terminal
CN114141237A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN101178895A (en) Model self-adapting method based on generating parameter listen-feel error minimize
CN114495969A (en) Voice recognition method integrating voice enhancement
CN112562704A (en) BLSTM-based frequency division spectrum expansion anti-noise voice conversion method
CN116364096B (en) Electroencephalogram signal voice decoding method based on generation countermeasure network
CN116013339A (en) Single-channel voice enhancement method based on improved CRN
Alku et al. Linear predictive method for improved spectral modeling of lower frequencies of speech with small prediction orders
CN110619886B (en) End-to-end voice enhancement method for low-resource Tujia language
Zheng et al. Bandwidth extension WaveNet for bone-conducted speech enhancement
CN112992157A (en) Neural network noisy line identification method based on residual error and batch normalization
Gonzales et al. Voice conversion of philippine spoken languages using deep neural networks
Kumar et al. Comparative Analysis of Features In a Speech Emotion Recognition System using Convolutional Neural Networks
Liao et al. Acoustic Model for Sichuan Dialect Speech Recognition Based on Deep Neural Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant