CN112562704A - BLSTM-based frequency division spectrum expansion anti-noise voice conversion method - Google Patents
BLSTM-based frequency division spectrum expansion anti-noise voice conversion method Download PDFInfo
- Publication number
- CN112562704A CN112562704A CN202011288173.XA CN202011288173A CN112562704A CN 112562704 A CN112562704 A CN 112562704A CN 202011288173 A CN202011288173 A CN 202011288173A CN 112562704 A CN112562704 A CN 112562704A
- Authority
- CN
- China
- Prior art keywords
- voice
- blstm
- frequency
- speech
- conversion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001228 spectrum Methods 0.000 title claims abstract description 85
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 84
- 238000000034 method Methods 0.000 title claims abstract description 31
- 230000004927 fusion Effects 0.000 claims abstract description 32
- 238000001914 filtration Methods 0.000 claims abstract description 30
- 230000001755 vocal effect Effects 0.000 claims abstract description 23
- 238000012549 training Methods 0.000 claims abstract description 21
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 5
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 5
- 230000000737 periodic effect Effects 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 2
- 230000000694 effects Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- OGQICQVSFDPSEI-UHFFFAOYSA-N Zorac Chemical compound N1=CC(C(=O)OCC)=CC=C1C#CC1=CC=C(SCCC2(C)C)C2=C1 OGQICQVSFDPSEI-UHFFFAOYSA-N 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 229940054720 avage Drugs 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Abstract
The invention discloses a BLSTM-based frequency division spectrum expansion anti-noise voice conversion method, which comprises the following specific steps: filtering source voice and target voice, and extracting voice characteristic parameters, wherein the voice characteristic parameters comprise fundamental frequency, vocal tract spectrum and aperiodic components; performing dynamic time warping alignment on the extracted sound channel spectrums of the source speech and the target speech; respectively inputting the aligned source speech and target speech sound track spectrums into a BLSTM network model of frequency division conversion for training to obtain corresponding feature conversion networks; constructing a global statistical variance consistency filtering model; after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, and preprocessing the characteristic parameters; and carrying out parameterized speech synthesis on the feature parameters of the preprocessed speech to be converted to generate the final converted speech. The invention designs a brand new fusion rule, and fuses the parts after frequency division conversion, thereby obtaining a sound channel spectrum closer to a target, and further improving the similarity of voice conversion.
Description
Technical Field
The invention belongs to a voice signal processing technology, and particularly relates to a frequency division spectrum expansion anti-noise voice conversion method based on BLSTM.
Background
Voice conversion, which is a speech-to-speech technique, refers to changing the speech personality of one speaker (source speaker) to have the speech personality of another speaker (target speaker). Speech conversion can be divided into two categories: one is non-specific person voice conversion, only the voice of the source speaker needs to be changed, and the method is used for ensuring that the opposite side cannot hear the scene of the own identity; the other type is specific person voice conversion, which is to convert the voice of a source speaker into the voice of a specific target person for a scene masquerading the identity of the target person. The voice conversion of a specific person meets the technical requirements of personalized voice generation, and is one of the main hotspots of the current research.
The speech conversion for a particular speaker can also be divided into: the speech conversion of parallel corpora and the conversion of non-parallel corpora, at present, systems with higher conversion quality and similarity are generally based on the parallel corpora conversion method, and the current research situation of the technology is briefly summarized as follows:
speech conversion dates back to the fifth and sixty years of the last century, and has been improved from the most classical Gaussian Mixture Model (GMM) to the models such as deep neural network (deep neural network) which can now effectively represent high-dimensional sequence data, such as: full Convolutional neural Networks (FCN), Generative countermeasure Networks (Kaneko T., Kameoka H, Hiramatsu K, Kashino K, Sequence-to-Sequence Voice Conversion with knowledge Metal raw Using genetic Adversal Networks, Interspeed 2017; Kaneko T, Kameoka H, Hojo N, Kashino K, general adaptive Network-based post-filter for static parameter synthesis, ICA2017), bidirectional short and long memory Networks (Huang Z, Xu W and Yu. K. binary LSmodules for Sequence retrieval SSP/Avage: 1508.01991,2015/1508.01991,2015, etc.). With the continuous establishment of international events, Voice Conversion Challenge (VCC), in recent years, the Voice Conversion method is continuously improved, and the quality and similarity of the converted Voice are further improved. Although these voice conversion schemes are reasonable and effective, and a better conversion effect is obtained, most voice conversion methods are performed under experimental conditions, and have serious dependence on the size and quality of training data, the more the training sample data amount is, the purer the training corpus is, the better the obtained conversion voice effect is, and for small sample data and noisy voice data, the conversion effect of the model is limited, and the quality of the conversion voice is also greatly reduced.
Disclosure of Invention
The invention aims to provide a frequency division spectrum expansion anti-noise voice conversion method based on BLSTM.
The technical solution for realizing the purpose of the invention is as follows: a frequency division spectrum expansion anti-noise voice conversion method based on BLSTM comprises the following specific steps:
step 1: filtering source voice and target voice, and extracting voice characteristic parameters, wherein the voice characteristic parameters comprise fundamental frequency, vocal tract spectrum and aperiodic components; performing dynamic time warping alignment on the extracted sound channel spectrums of the source speech and the target speech;
step 2: respectively inputting the aligned source speech and target speech sound track spectrums into a BLSTM network model of frequency division conversion for training to obtain corresponding feature conversion networks;
step 3, constructing a global statistical variance consistency filtering model;
step 4, after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, and preprocessing the characteristic parameters;
and 5: and carrying out parameterized speech synthesis on the feature parameters of the preprocessed speech to be converted to generate the final converted speech.
Preferably, the extracted vocal tract spectral feature is a mel-frequency cepstrum.
Preferably, the BLSTM network model of the input-frequency-division conversion includes two BLSTM networks with the same structure, each of the two BLSTM networks is composed of 3 hidden layers, and the number of hidden nodes in the three layers is respectively: 128, 256, 128, one of the BLSTM networks has no dropout layer, and the other BLSTM network has a dropout layer parameter of 0.5.
Preferably, the specific method for constructing the global statistical variance consistency filtering model comprises the following steps:
step 3-1, calculating the mean and variance of each one-dimensional Mel cepstrum of the cepstrum coefficient of the target statement;
step 3-2, calculating the mean and variance of each dimensionality Mel cepstrum of all frames of the statement obtained after the source speech is converted by the BLSTM network model of frequency division conversion;
step 3-3: constructing a global statistical variance consistency primary filter, wherein the global statistical variance consistency primary filter specifically comprises the following steps:
wherein ,σ2 tarRepresenting the mean of the Mel cepstrum of each dimension of the target speechConstructed vector, σ2 conRepresenting the Mel cepstrum mean value of each dimension of the sentence obtained after the BLSTM network model conversion of the source speech through frequency division conversionThe constructed vector, y represents the Mel cepstrum of the sentence to be converted,a vector formed by the mean value of the Mel cepstrum of each dimension of all frames of the source speech to-be-converted sentence at the testing stage;
step 3-4, setting adjustment parameters to obtain an adjusted global statistical variance consistency filter, specifically:
wherein ,is the Mel cepstrum obtained after filtering, y represents the Mel cepstrum of the sentence to be converted, and alpha is the adjusting parameter.
Preferably, the specific formula for calculating the mean and variance of each one-dimensional mel cepstrum of the target speech cepstrum coefficients is as follows:
wherein N represents the number of target sentences in the training stage, M represents the number of frames contained in each sentence, T represents the dimension of the Mel cepstrum, i represents the index of the dimension of the Mel cepstrum,andrespectively representing the Merr cepstrum mean and variance, x, of each dimension obtained from all frames of all training sentencesiRepresenting the ith vimel spectrum.
Preferably, the specific method for preprocessing the feature parameters of the speech to be converted is as follows:
the non-periodic components remain unchanged;
carrying out logarithmic linear transformation on the fundamental frequency;
widening and frequency dividing the sound channel spectrum to obtain a high-frequency part and a full-band part, converting the high-frequency part and the full-band part by using a BLSTM network model for frequency division conversion, carrying out frequency division on the full-band part obtained by conversion again to obtain a high-frequency sound channel spectrum and a low-frequency sound channel spectrum, fusing the obtained sound channel spectrums of different frequency bands through a fusion model, and sending the fused converted sound channel spectrum characteristics into a global statistical variance consistency filtering model obtained in the step 3 for filtering to obtain the sound channel spectrum characteristics after the conversion and the filtering.
Preferably, the fusion model is specifically:
MChigh fusion=α*mcepHeight 1+(1-α)*mcepHeight 2
MCLow 1=mcepLow 1
MCFull fusion=[mcepLow 1+MCHigh fusion]
Wherein a is the fusion coefficient, mcepHeight 1High frequency channel spectrum, mecep, being a full band channel spectrumLow 1Low frequency channel spectrum, mecep, being a full band channel spectrumHeight 2The high frequency part of the vocal tract spectrum.
Preferably, the fusion coefficient is specifically:
in the formula, mecepHeight of and mcepIs low inThe frequency band parameter information of each part is obtained after frequency division statistics.
Compared with the prior art, the invention has the following remarkable advantages: 1) the invention introduces two newly designed filtering modules, the training data is filtered before the characteristic extraction, the sound channel spectrum is filtered after the characteristic conversion, the noise is suppressed through the time-frequency filtering, and the quality of the converted voice is improved; 2) according to the method, dimensionality expansion is performed on the vocal tract spectrum characteristics, a high-dimensional vocal tract spectrum is extracted, the network of the BLSTM is improved, frequency division conversion and fusion of the high-dimensional vocal tract spectrum are realized by designing two different BLSTM networks, the problems of over-fitting or under-fitting and the like caused by small sample data are solved, and the conversion precision of a model and the adaptability of the model are improved; 3) the invention designs a brand new fusion rule, uses the fusion rule obtained by statistics in the training stage in the conversion process, and fuses the parts after frequency division conversion, thereby obtaining a sound channel spectrum closer to a target, and further improving the similarity of voice conversion.
The present invention is described in further detail below with reference to the attached drawings.
Drawings
FIG. 1 is a flow chart of a BLSTM-based voice conversion method with spectrum extension and noise immunity.
Fig. 2 is a flow chart of spectral broadening, frequency division, and weight fusion.
Fig. 3 is a schematic diagram of a frequency division converted BLSTM network architecture.
Detailed Description
As shown in fig. 1, a frequency division spectrum expansion anti-noise speech conversion method based on BLSTM includes the following steps:
step 1, filtering the trained source speech and target speech, extracting speech characteristic parameters after removing partial noise, performing dynamic time warping alignment on the extracted sound channel spectrums of the source speech and the target speech, and respectively counting logarithmic fundamental frequencies (logF) of the source speech and the target speech0) Mean u and variance σ of2For step 4, calculating linear conversion of logarithmic fundamental frequency;
the voice characteristic parameters comprise fundamental frequency, vocal tract spectrum (the invention mainly adopts Mel cepstrum) and non-periodic components;
the frequency-division converted BLSTM network is composed of two BLSTM networks with the same structure, namely BLSTM1 and BLSTM 2. Two BLSTM networks are constituteed by 3 hidden layers, and the hidden node number of three-layer is respectively: 128, 256, 128. BLSTM1 has no dropout layer, and BLSTM2 has a dropout layer parameter of 0.5.
Step 3, counting the mean variance of the target voice sound channel spectrum characteristics and the mean variance of the converted sound channel spectrum characteristics, and constructing a global statistical variance consistency filtering model; the global statistical variance consistency filtering model is used for obtaining a sound channel spectrum of the voice to be converted, and the specific implementation process is as follows:
step 3-1, calculating the mean value and the variance of each one-dimensional Mel cepstrum of the target voice cepstrum coefficient, wherein the calculation formula is as follows:
wherein N represents the number of target sentences in the training phase, M represents the number of frames contained in each sentence, T represents the dimension of the Mel cepstrum, i represents the index of the dimension of the Mel cepstrum,andrespectively, the mel cepstrum mean and variance of each dimension obtained from all frames of all training sentences, and tar is an abbreviation of the conversion target.
Step 3-2, calculating the mean value of each dimension Mel cepstrum of all frames of sentences obtained after the BLSTM network model conversion of source speech in the training data through frequency division conversion by using formula (1)Sum varianceAnd a vector formed by all dimensionality Mel cepstrum mean values of all frames of the source speech to-be-converted sentence in the testing stageVector sigma formed by the sum variancey 2Where con is an abbreviation for convert.
Step 3-3, constructing a primary filter with global statistical variance consistency to obtain primary filtered data
In the formula (2), σ2 tarMel cepstrum mean of each dimension of sentence representing target speaker in training setConstructed vector, σ2 conRepresenting each dimension Mel cepstrum mean value of sentences obtained by BLSTM network model conversion of source speech in training set through frequency division conversionThe constructed vector, y, represents the mel cepstrum of the sentence to be converted.
Step 3-4, setting a parameter α (α is usually set according to an actual effect, and is set to 0.2 in the experiment) according to the formula (3), and adjusting to obtain a global statistical variance consistency filter:
wherein ,is the resulting mel cepstrum after the final filtering, and this parameter will be used to generate the converted speech.
Step 4, after filtering the voice to be converted, extracting the characteristic parameters of the voice to be converted, wherein the characteristic parameters of the voice to be converted comprise logarithmic fundamental frequency, vocal tract spectrum and non-periodic components, and the characteristic parameters of the voice to be converted are preprocessed in the specific preprocessing modes that:
the non-periodic components remain unchanged;
the fundamental frequency F0 is logarithmically and linearly transformed, and its formula is shown in equation (4):
pt (Y) and pt (X)Respectively represent the logF after conversion0And original logF0,u(X) and u(Y)Means, σ, representing the logarithmic fundamental frequencies of the source and target speech as counted in step 1(X) and σ(Y)Is the standard deviation of the logarithmic fundamental frequencies of the source speech and the target speech as counted in step 1.
Widening and frequency dividing the sound channel spectrum to obtain a high-frequency part and a full-frequency band part, converting the high-frequency part and the full-frequency band part by using a BLSTM network model for frequency division conversion, frequency dividing the full-frequency band part obtained by conversion again to obtain a high-frequency sound channel spectrum and a low-frequency sound channel spectrum, fusing the obtained sound channel spectrums of different frequency bands through a fusion model, and sending the fused converted sound channel spectrum characteristics into the global statistical variance consistency filtering model obtained in the step 3 for filtering to obtain the sound channel spectrum characteristics after the conversion and the filtering;
in a further embodiment, the specific method for widening and dividing the sound channel spectrum of the speech to be converted is as follows:
in the traditional voice conversion model, in order to compress the model and reduce the data training time, the vocal tract spectrum is converted into 24-dimensional or 39-dimensional Mel cepstrum, the invention tries the voice conversion under low dimension, and the conversion effect is difficult to resist partial noise interference while ensuring the voice conversion quality. Therefore, in the process of acquiring the vocal tract spectrum parameters such as the Mel cepstrum, parameters are set to directly obtain the widened 129-dimensional high-dimensional Mel cepstrum. The high-dimensional information not only keeps more information parameters, but also is beneficial to solving the problems of data overfitting and the like during corpus training of the small samples.
The method comprises the steps of widening and frequency dividing a vocal tract spectrum of voice to be converted to obtain a high-frequency part and a full-frequency band part, converting the high-frequency part and the full-frequency band part by using a BLSTM network model for frequency dividing conversion, and frequency dividing the full-frequency band part obtained by conversion again to obtain a high-frequency vocal tract spectrum and a low-frequency vocal tract spectrum.
In a further embodiment, as shown in fig. 2, the obtained vocal tract spectrums of different frequency bands are fused by a fusion model, and the specific fusion parameters involved in the fusion process are extracted as follows:
extracting fusion coefficients
Counting the high-frequency vocal tract spectrum and the low-frequency vocal tract spectrum in the vocal tract spectrum information in the training stage, and then calculating according to a formula (5) to obtain a fusion coefficient a
The fusion coefficient obtained in a in the above formula, mcepHeight of and mcepIs low inNamely the parameter information of each partial frequency band obtained after frequency division statistics.
Transition weight fusion
And (3) performing weight fusion according to the formula (6) to obtain a final converted sound channel spectrum:
MChigh fusion=α*mcepHeight 1+(1-α)*mcepHeight 2
MCLow 1=mcepLow 1
MCFull fusion=[mcepLow 1+MCHigh fusion] (6)
Wherein a is the fusion coefficient, mcepHeight 1High frequency channel spectrum, mecep, being a full band channel spectrumLow 1Low frequency channel spectrum, mecep, being a full band channel spectrumHeight 2The high frequency part of the vocal tract spectrum.
Fig. 3 is a specific internal structure of the frequency-division weight fusion network. The main difference between BLSTM1 and BLSTM2 is that BLSTM2 adds a drop layer, preventing the occurrence of the overfitting phenomenon due to less high-band portion information.
And 5, carrying out parameterized speech synthesis on the non-periodic components preprocessed in the step 4, the sound channel spectrum after conversion and filtering, the fundamental frequency after logarithmic linear conversion and other parameters to generate final converted speech.
Claims (8)
1. A frequency division spectrum expansion anti-noise voice conversion method based on BLSTM is characterized by comprising the following specific steps:
step 1: filtering source voice and target voice, and extracting voice characteristic parameters, wherein the voice characteristic parameters comprise fundamental frequency, vocal tract spectrum and aperiodic components; performing dynamic time warping alignment on the extracted sound channel spectrums of the source speech and the target speech;
step 2: respectively inputting the aligned source speech and target speech sound track spectrums into a BLSTM network model of frequency division conversion for training to obtain corresponding feature conversion networks;
step 3, constructing a global statistical variance consistency filtering model;
step 4, after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, and preprocessing the characteristic parameters;
and 5: and carrying out parameterized speech synthesis on the feature parameters of the preprocessed speech to be converted to generate the final converted speech.
2. The BLSTM-based frequency division extension spectrum anti-noise speech conversion method according to claim 1, wherein the extracted vocal tract spectrum features are Mel cepstrum.
3. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 1, wherein the BLSTM network model of the input frequency division conversion comprises two BLSTM networks with the same structure, each of the two BLSTM networks consists of 3 hidden layers, and the number of hidden nodes of the three layers is respectively: 128, 256, 128, one of the BLSTM networks has no dropout layer, and the other BLSTM network has a dropout layer parameter of 0.5.
4. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 1, wherein the specific method for constructing the global statistical variance consistency filtering model is as follows:
step 3-1, calculating the mean and variance of each one-dimensional Mel cepstrum of the cepstrum coefficient of the target statement;
step 3-2, calculating the mean and variance of each dimensionality Mel cepstrum of all frames of the statement obtained after the source speech is converted by the BLSTM network model of frequency division conversion;
step 3-3: constructing a global statistical variance consistency primary filter, wherein the global statistical variance consistency primary filter specifically comprises the following steps:
wherein ,σ2 tarRepresenting the mean of the Mel cepstrum of each dimension of the target speechConstructed vector, σ2 conRepresenting the Mel cepstrum mean value of each dimension of the sentence obtained after the BLSTM network model conversion of the source speech through frequency division conversionThe constructed vector, y represents the Mel cepstrum of the sentence to be converted,a vector formed by the mean value of the Mel cepstrum of each dimension of all frames of the source speech to-be-converted sentence at the testing stage;
step 3-4, setting adjustment parameters to obtain an adjusted global statistical variance consistency filter, specifically:
5. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 4, wherein the specific formula for calculating the mean and variance of the Mel cepstrum of each dimension of the target voice cepstrum coefficient is as follows:
wherein, N represents the number of target sentences in the training phase, M represents the number of frames contained in each sentence, and T represents the number of framesThe dimension of the mel-frequency cepstrum, i represents the index of the mel-frequency cepstrum dimension,andrespectively representing the Merr cepstrum mean and variance, x, of each dimension obtained from all frames of all training sentencesiRepresenting the ith vimel spectrum.
6. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 1, wherein the specific method for preprocessing the characteristic parameters of the voice to be converted is as follows:
the non-periodic components remain unchanged;
carrying out logarithmic linear transformation on the fundamental frequency;
widening and frequency dividing the sound channel spectrum to obtain a high-frequency part and a full-band part, converting the high-frequency part and the full-band part by using a BLSTM network model for frequency division conversion, carrying out frequency division on the full-band part obtained by conversion again to obtain a high-frequency sound channel spectrum and a low-frequency sound channel spectrum, fusing the obtained sound channel spectrums of different frequency bands through a fusion model, and sending the fused converted sound channel spectrum characteristics into a global statistical variance consistency filtering model obtained in the step 3 for filtering to obtain the sound channel spectrum characteristics after the conversion and the filtering.
7. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 6, wherein the fusion model is specifically:
MChigh fusion=α*mcepHeight 1+(1-α)*mcepHeight 2
MCLow 1=mcepLow 1
MCFull fusion=[mcepLow 1+MCHigh fusion]
Wherein a is the fusion coefficient, mcepHeight 1High frequency channel spectrum, mecep, being a full band channel spectrumLow 1For full frequency band vocal tract spectrumLow frequency channel spectrum of, mecepHeight 2The high frequency part of the vocal tract spectrum.
8. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 6, wherein the fusion coefficients are specifically:
in the formula, mecepHeight of and mcepIs low inThe frequency band parameter information of each part is obtained after frequency division statistics.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011288173.XA CN112562704B (en) | 2020-11-17 | 2020-11-17 | Frequency division topological anti-noise voice conversion method based on BLSTM |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011288173.XA CN112562704B (en) | 2020-11-17 | 2020-11-17 | Frequency division topological anti-noise voice conversion method based on BLSTM |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112562704A true CN112562704A (en) | 2021-03-26 |
CN112562704B CN112562704B (en) | 2023-08-18 |
Family
ID=75043062
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011288173.XA Active CN112562704B (en) | 2020-11-17 | 2020-11-17 | Frequency division topological anti-noise voice conversion method based on BLSTM |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112562704B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20120077527A (en) * | 2010-12-30 | 2012-07-10 | 부산대학교 산학협력단 | Apparatus and method for feature compensation using weighted auto-regressive moving average filter and global cepstral mean and variance normalization |
CN104658547A (en) * | 2013-11-20 | 2015-05-27 | 大连佑嘉软件科技有限公司 | Method for expanding artificial voice bandwidth |
US20160322055A1 (en) * | 2015-03-27 | 2016-11-03 | Google Inc. | Processing multi-channel audio waveforms |
US20180012613A1 (en) * | 2016-07-11 | 2018-01-11 | The Chinese University Of Hong Kong | Phonetic posteriorgrams for many-to-one voice conversion |
CN108986834A (en) * | 2018-08-22 | 2018-12-11 | 中国人民解放军陆军工程大学 | The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network |
US10249314B1 (en) * | 2016-07-21 | 2019-04-02 | Oben, Inc. | Voice conversion system and method with variance and spectrum compensation |
CN109767778A (en) * | 2018-12-27 | 2019-05-17 | 中国人民解放军陆军工程大学 | A kind of phonetics transfer method merging Bi-LSTM and WaveNet |
CN110473564A (en) * | 2019-07-10 | 2019-11-19 | 西北工业大学深圳研究院 | A kind of multi-channel speech enhancement method based on depth Wave beam forming |
CN110648680A (en) * | 2019-09-23 | 2020-01-03 | 腾讯科技(深圳)有限公司 | Voice data processing method and device, electronic equipment and readable storage medium |
US10726830B1 (en) * | 2018-09-27 | 2020-07-28 | Amazon Technologies, Inc. | Deep multi-channel acoustic modeling |
-
2020
- 2020-11-17 CN CN202011288173.XA patent/CN112562704B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20120077527A (en) * | 2010-12-30 | 2012-07-10 | 부산대학교 산학협력단 | Apparatus and method for feature compensation using weighted auto-regressive moving average filter and global cepstral mean and variance normalization |
CN104658547A (en) * | 2013-11-20 | 2015-05-27 | 大连佑嘉软件科技有限公司 | Method for expanding artificial voice bandwidth |
US20160322055A1 (en) * | 2015-03-27 | 2016-11-03 | Google Inc. | Processing multi-channel audio waveforms |
US20180012613A1 (en) * | 2016-07-11 | 2018-01-11 | The Chinese University Of Hong Kong | Phonetic posteriorgrams for many-to-one voice conversion |
US10249314B1 (en) * | 2016-07-21 | 2019-04-02 | Oben, Inc. | Voice conversion system and method with variance and spectrum compensation |
CN108986834A (en) * | 2018-08-22 | 2018-12-11 | 中国人民解放军陆军工程大学 | The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network |
US10726830B1 (en) * | 2018-09-27 | 2020-07-28 | Amazon Technologies, Inc. | Deep multi-channel acoustic modeling |
CN109767778A (en) * | 2018-12-27 | 2019-05-17 | 中国人民解放军陆军工程大学 | A kind of phonetics transfer method merging Bi-LSTM and WaveNet |
CN110473564A (en) * | 2019-07-10 | 2019-11-19 | 西北工业大学深圳研究院 | A kind of multi-channel speech enhancement method based on depth Wave beam forming |
CN110648680A (en) * | 2019-09-23 | 2020-01-03 | 腾讯科技(深圳)有限公司 | Voice data processing method and device, electronic equipment and readable storage medium |
Non-Patent Citations (5)
Title |
---|
BAJIBABU BOLLEPALLI: "Normal-to-Lombard adaptation of speech synthesis using long short-term memory recurrent neural networks", SPEECH COMMUNICATION * |
张雄伟: "语音去混响技术的研究进展与展望", 数据采集与处理, pages 1069 - 1081 * |
张雄伟: "语音转换技术研究现状及展望", 数据采集与处理, pages 753 - 770 * |
曾歆;张雄伟;孙蒙;苗晓孔;姚琨;: "基于GMM模型和LPC-MFCC联合特征的声道谱转换研究", 声学技术, no. 04 * |
苗晓孔,张雄伟,孙蒙: "基于 BLSTM 实现基频 (F0) 融合变换的语音转换方法研究", SCIENCE DISCOVERY 2018; 6(4): 298-305, pages 298 - 305 * |
Also Published As
Publication number | Publication date |
---|---|
CN112562704B (en) | 2023-08-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Qian et al. | Very deep convolutional neural networks for noise robust speech recognition | |
CN109767778B (en) | Bi-L STM and WaveNet fused voice conversion method | |
Hu et al. | Pitch‐based gender identification with two‐stage classification | |
Jiang et al. | Geometric methods for spectral analysis | |
CN110648684B (en) | Bone conduction voice enhancement waveform generation method based on WaveNet | |
CN113314140A (en) | Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network | |
CN112259119B (en) | Music source separation method based on stacked hourglass network | |
CN107564543A (en) | A kind of Speech Feature Extraction of high touch discrimination | |
CN111583957B (en) | Drama classification method based on five-tone music rhythm spectrogram and cascade neural network | |
CN103345920B (en) | Self-adaptation interpolation weighted spectrum model voice conversion and reconstructing method based on Mel-KSVD sparse representation | |
CN112382308A (en) | Zero-order voice conversion system and method based on deep learning and simple acoustic features | |
CN105679321A (en) | Speech recognition method and device and terminal | |
CN114141237A (en) | Speech recognition method, speech recognition device, computer equipment and storage medium | |
CN101178895A (en) | Model self-adapting method based on generating parameter listen-feel error minimize | |
CN114495969A (en) | Voice recognition method integrating voice enhancement | |
CN112562704A (en) | BLSTM-based frequency division spectrum expansion anti-noise voice conversion method | |
CN116364096B (en) | Electroencephalogram signal voice decoding method based on generation countermeasure network | |
CN116013339A (en) | Single-channel voice enhancement method based on improved CRN | |
Alku et al. | Linear predictive method for improved spectral modeling of lower frequencies of speech with small prediction orders | |
CN110619886B (en) | End-to-end voice enhancement method for low-resource Tujia language | |
Zheng et al. | Bandwidth extension WaveNet for bone-conducted speech enhancement | |
CN112992157A (en) | Neural network noisy line identification method based on residual error and batch normalization | |
Gonzales et al. | Voice conversion of philippine spoken languages using deep neural networks | |
Kumar et al. | Comparative Analysis of Features In a Speech Emotion Recognition System using Convolutional Neural Networks | |
Liao et al. | Acoustic Model for Sichuan Dialect Speech Recognition Based on Deep Neural Network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |