CN112562704B - Frequency division topological anti-noise voice conversion method based on BLSTM - Google Patents

Frequency division topological anti-noise voice conversion method based on BLSTM Download PDF

Info

Publication number
CN112562704B
CN112562704B CN202011288173.XA CN202011288173A CN112562704B CN 112562704 B CN112562704 B CN 112562704B CN 202011288173 A CN202011288173 A CN 202011288173A CN 112562704 B CN112562704 B CN 112562704B
Authority
CN
China
Prior art keywords
voice
frequency
converted
blstm
conversion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011288173.XA
Other languages
Chinese (zh)
Other versions
CN112562704A (en
Inventor
孙蒙
苗晓孔
张雄伟
曹铁勇
郑昌艳
李莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN202011288173.XA priority Critical patent/CN112562704B/en
Publication of CN112562704A publication Critical patent/CN112562704A/en
Application granted granted Critical
Publication of CN112562704B publication Critical patent/CN112562704B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a frequency division topological anti-noise voice conversion method based on BLSTM, which comprises the following specific steps: filtering the source voice and the target voice, and extracting voice characteristic parameters, wherein the voice characteristic parameters comprise fundamental frequency, sound channel spectrum and non-periodic components; carrying out dynamic time warping alignment on the extracted sound channel spectrums of the source voice and the target voice; respectively inputting the aligned source voice channel spectrum and the aligned target voice channel spectrum into a frequency division converted BLSTM network model for training to obtain a corresponding characteristic conversion network; constructing a global statistical variance consistency filtering model; after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, and preprocessing; and carrying out parameterized voice synthesis on the characteristic parameters of the preprocessed voice to be converted to generate final converted voice. The invention designs a brand new fusion rule, fuses the frequency division converted parts, and further obtains the sound channel spectrum which is closer to the target, thereby improving the similarity of voice conversion.

Description

Frequency division topological anti-noise voice conversion method based on BLSTM
Technical Field
The invention belongs to a voice signal processing technology, and particularly relates to a frequency division topological anti-noise voice conversion method based on BLSTM.
Background
Speech conversion refers to changing the speech personality of one speaker (source speaker) to have the speech personality of another speaker (target speaker), which is a speech-to-speech technique. Speech conversion can be divided into two categories: the voice conversion of the non-specific person is realized only by changing the voice of the source speaker, so that the voice of the source speaker is used for enabling the opposite party to hear the scene without self-identity; another type is person-specific voice conversion, which is a scenario in which the voice of a source speaker is converted into the voice of a specific target person, for masquerading the identity of the target person. The conversion of specific human voice meets the technical requirements of personalized voice generation and is one of the main hot spots of current research.
Speaker-specific speech conversion can also be divided into: at present, a system with higher conversion quality and similarity is generally a conversion method based on parallel corpus, and the current research status of the technology is briefly summarized as follows:
the earliest speech conversion dates back to the fifth sixty of the last century, from the most classical gaussian mixture model (Gaussian Mixture Model, GMM) to the now efficient representation of deep neural networks etc. models that represent high dimensional sequence data, such as: a full convolutional neural network (Fully Convolutional Network, FCN), a generative antagonism network (Kaneko t., kameoka H, hiramatsu K, kashino K, sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks, interspeech 2017;Kaneko T,Kameoka H,Hojo N,Kashino K,Generative adversarial network-based post-filter for statistical parameter synthesis, ica ssp 2017), a bidirectional long short time memory network (Huang Z, xu W and yu.k. bidirect LSTM-CRF models for Sequence table, available, https:// arxiv.org/abs/1508.01991,2015), and the like. With the continuous holding of international events, namely voice conversion challenge (Voice Conversion Challenge, VCC), the voice conversion method is continuously improved in recent years, and the quality and similarity of the converted voice are further improved. Although these speech conversion schemes are reasonable and effective, a better conversion effect is obtained, most of the speech conversion methods are performed under experimental conditions, so that the larger the training sample data size is, the purer the training corpus is, the better the obtained converted speech effect is, and the conversion effect of the model is limited for small sample data and noisy speech data, and the quality of the converted speech is also greatly reduced.
Disclosure of Invention
The invention aims to provide a frequency division topological anti-noise voice conversion method based on BLSTM.
The technical solution for realizing the purpose of the invention is as follows: a frequency division topological anti-noise voice conversion method based on BLSTM comprises the following specific steps:
step 1: filtering the source voice and the target voice, and extracting voice characteristic parameters, wherein the voice characteristic parameters comprise fundamental frequency, sound channel spectrum and non-periodic components; carrying out dynamic time warping alignment on the extracted sound channel spectrums of the source voice and the target voice;
step 2: respectively inputting the aligned source voice channel spectrum and the aligned target voice channel spectrum into a frequency division converted BLSTM network model for training to obtain a corresponding characteristic conversion network;
step 3, constructing a global statistical variance consistency filtering model;
step 4, after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, and preprocessing;
step 5: and carrying out parameterized voice synthesis on the characteristic parameters of the preprocessed voice to be converted to generate final converted voice.
Preferably, the extracted vocal tract spectrum features are mel-frequency cepstrum.
Preferably, the frequency-division-input-conversion BLSTM network model includes two BLSTM networks with identical structures, each of the two BLSTM networks is composed of 3 hidden layers, and the hidden node numbers of the three layers are respectively: 128 256, 128, wherein one BLSTM network has no dropout layer and the dropout layer parameter of the other BLSTM network is 0.5.
Preferably, the specific method for constructing the global statistical variance consistency filter model is as follows:
step 3-1, calculating the mean value and variance of each dimension of mel cepstrum of the cepstrum coefficient of the target sentence;
step 3-2, calculating the mean value and variance of the mel cepstrum of each dimension of all frames of the sentence obtained by converting the BLSTM network model of the source voice through frequency division conversion;
step 3-3: constructing a global statistical variance consistency primary filter, wherein the global statistical variance consistency primary filter specifically comprises the following components:
wherein ,σ2 tar Meier cepstrum mean value representing each dimension of target voiceVectors, sigma of the constitution 2 con Represents the mean value +.about.L of the Meier cepstrum of each dimension of the sentence obtained after the conversion of the BLSTM network model of the source voice through frequency division conversion>The vector formed, y represents the mel-cepstrum of the sentence to be converted, < >>The method comprises the steps of (1) setting a vector formed by the mean value of the mel cepstrum of each dimension of all frames of a source voice to be converted sentence in a test stage;
step 3-4, setting adjustment parameters to obtain an adjusted global statistical variance consistency filter, which specifically comprises the following steps:
wherein ,the method is characterized in that the Mel cepstrum obtained after filtering is performed, y represents the Mel cepstrum of the sentence to be converted, and alpha is an adjustment parameter.
Preferably, the specific formula for calculating the mean and variance of each one-dimensional mel-frequency cepstrum of the target voice cepstrum coefficient is as follows:
wherein N represents the number of target sentences in the training stage, M represents the number of frames contained in each sentence, T represents the dimension of the mel-frequency cepstrum, i represents the index of the mel-frequency cepstrum dimension, and />Represents the mean and variance of the mel-cepstrum, x, of each dimension, obtained from all frames of all training sentences, respectively i Representing the i-th wimeyer spectrum.
Preferably, the specific method for preprocessing the characteristic parameters of the voice to be converted comprises the following steps:
the non-periodic components remain unchanged;
performing log-linear transformation on the fundamental frequency;
stretching and dividing the channel spectrum to obtain a high-frequency part and a full-frequency band part, converting the high-frequency part and the full-frequency band part by using a frequency division converted BLSTM network model, dividing the converted full-frequency band part again to obtain a high-frequency channel spectrum and a low-frequency channel spectrum, fusing the channel spectrums of different frequency bands by using a fusion model, and sending the fused converted channel spectrum characteristics to a global statistical variance consistency filtering model obtained in the step 3 for filtering to obtain the channel spectrum characteristics after conversion and filtering.
Preferably, the fusion model is specifically:
MC high fusion =α*mcep High 1 +(1-α)*mcep High 2
MC Low 1 =mcep Low 1
MC Full fusion =[mcep Low 1 +MC High fusion ]
Wherein a is a fusion coefficient, mcep High 1 High frequency channel spectrum, mcep, which is a full band channel spectrum Low 1 Low frequency channel spectrum, mcep, which is a full band channel spectrum High 2 Is the high frequency part of the channel spectrum.
Preferably, the fusion coefficient is specifically:
in the formula, mcep High height and mcepLow and low And the frequency division statistics are used for obtaining the parameter information of each part of frequency bands.
Compared with the prior art, the invention has the remarkable advantages that: 1) According to the invention, two filtering modules with brand new designs are introduced, training data is filtered before feature extraction, global statistical variance consistency filtering is carried out on sound channel spectrums after feature conversion, noise is jointly restrained through time-frequency filtering, and the quality of converted voice is improved; 2) According to the method, the dimensions of the channel spectrum characteristics are expanded, the high-dimensional channel spectrum is extracted, the BLSTM network is improved, the frequency division conversion and fusion of the high-dimensional channel spectrum are realized by designing two different BLSTM networks, the problems of over fitting or under fitting and the like caused by small sample data are solved, and the conversion precision and the adaptability of the model are improved; 3) The invention designs a brand new fusion rule, uses the fusion rule obtained by statistics of the training stage in the conversion process, fuses the frequency division converted part, and further obtains the vocal tract spectrum which is closer to the target, thereby improving the similarity of voice conversion.
The present invention will be described in further detail with reference to the accompanying drawings.
Drawings
FIG. 1 is a flow chart of a frequency-divided topology anti-noise speech conversion method based on BLSTM.
Fig. 2 is a flow chart of spectral broadening, frequency division, and weight fusion.
Fig. 3 is a schematic diagram of a frequency-divided converted BLSTM network.
Detailed Description
As shown in FIG. 1, the anti-noise voice conversion method for the frequency division topological spectrum based on the BLSTM comprises the following specific steps:
step 1, filtering the trained source voice and target voice, removing part of noise, extracting voice characteristic parameters, and entering the channel spectrums of the extracted source voice and target voiceDynamic time warping alignment and statistics of log fundamental frequencies (i.e., log f) of source and target voices, respectively 0 ) Mean u and variance sigma of 2 The method comprises the steps of (1) calculating linear conversion of logarithmic fundamental frequency in the step (4);
the voice characteristic parameters comprise fundamental frequency, sound channel spectrum (the invention mainly adopts mel cepstrum) and non-periodic components;
step 2, sending the aligned source voice and target voice sound channel spectrums into the frequency division converted BLSTM network model training of the figure 3 to obtain a corresponding characteristic conversion network;
the frequency-divided converted BLSTM network is composed of two BLSTM networks of identical structure, namely BLSTM1 and BLSTM2. Both BLSTM networks consist of 3 hidden layers, the hidden node numbers of the three layers being: 128, 256, 128.BLSTM1 has no dropout layer and BLSTM2 has a dropout layer parameter of 0.5.
Step 3, counting the mean variance of the target voice channel spectrum characteristics and the mean variance of the converted channel spectrum characteristics, and constructing a global statistical variance consistency filtering model; the global statistical variance consistency filtering model is used for obtaining the sound channel spectrum of the voice to be converted, and the specific implementation process is as follows:
step 3-1, calculating the mean value and variance of each dimension of mel cepstrum of the target voice cepstrum coefficient, wherein the calculation formula is as follows:
wherein N represents the number of target sentences in the training stage, M represents the number of frames contained in each sentence, T represents the dimension of the mel-frequency cepstrum, i represents the index of the mel-frequency cepstrum dimension, and />The mel-cepstrum mean and variance of each dimension obtained from all frames of all training sentences are respectively represented, and tar is an abbreviation of the conversion target.
Step 3-2, calculating the Meier cepstrum average value of each dimension of all frames of the sentence obtained by converting the BLSTM network model of the source voice in the training data through frequency division conversion by using the formula (1)Sum of variances->And vector formed by the mean value of the Meier cepstrum of each dimension of all frames of the source voice to be converted in the test stage>Vector sigma formed by sum of variances y 2 Where con is an abbreviation for conversion of overt.
Step 3-3, constructing a primary filter with global statistical variance consistency to obtain primary filtered data
In the formula (2), sigma 2 tar Meier cepstrum mean of each dimension of sentences representing target speakers in training setVectors, sigma of the constitution 2 con Represents the mean value +.about.L of the Meier cepstrum of each dimension of the sentence obtained by the conversion of the BLSTM network model of the frequency division conversion of the source voice in the training set>The vector, y, represents the mel-cepstrum of the sentence to be converted.
Step 3-4, setting a parameter alpha (alpha is usually set according to the actual effect, and is set to be 0.2 in the experiment) according to a formula (3), and adjusting to obtain a global statistical variance consistency filter:
wherein ,is the last mel-cepstrum obtained after filtering, which will be used to generate the converted speech.
Step 4, after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, wherein the characteristic parameters of the voice to be converted comprise logarithmic fundamental frequency, sound channel spectrum and non-periodic components, and the characteristic parameters of the voice to be converted are preprocessed in the following specific preprocessing modes:
the non-periodic components remain unchanged;
the fundamental frequency F0 is subjected to logarithmic linear transformation, and the formula is shown in formula (4):
p t (Y) and pt (X) Respectively represent the converted logF 0 And the original logF 0 ,u (X) and u(Y) Representing the mean value, sigma, of the logarithmic fundamental frequencies of the source and target voices counted in step 1 (X) and σ(Y) Is the standard deviation of the logarithmic fundamental frequencies of the source and target voices counted in step 1.
Stretching and dividing the channel spectrum to obtain a high-frequency part and a full-frequency band part, converting the high-frequency part and the full-frequency band part by using a frequency division converted BLSTM network model, dividing the converted full-frequency band part again to obtain a high-frequency channel spectrum and a low-frequency channel spectrum, fusing the channel spectrums of different frequency bands by using a fusion model, and sending the fused converted channel spectrum characteristics to a global statistical variance consistency filtering model obtained in the step 3 for filtering to obtain converted and filtered channel spectrum characteristics;
in a further embodiment, the specific method for widening and dividing the channel spectrum of the voice to be converted includes:
in the traditional voice conversion model, in order to compress the model and reduce the data training time, the vocal tract spectrum is converted into 24-dimensional or 39-dimensional mel-frequency cepstrum, the voice conversion under the low dimensionality is tried, and the conversion effect is difficult to resist partial noise interference while ensuring the voice conversion quality. Therefore, in the process of obtaining parameters of sound channels such as the mel-frequency cepstrum, the parameters are set to directly obtain the 129-dimensional high-dimensional mel-frequency cepstrum after widening. The high-dimensional information not only reserves more information parameters, but also is beneficial to solving the problems of data overfitting and the like in the training of the corpus of small samples.
The method comprises the steps of stretching and dividing a sound channel spectrum of voice to be converted to obtain a high-frequency part and a full-frequency band part, converting the high-frequency part and the full-frequency band part by using a BLSTM network model of frequency division conversion, dividing the converted full-frequency band part again to obtain a high-frequency sound channel spectrum and a low-frequency sound channel spectrum, and dividing the full-frequency sound channel spectrum into the high-frequency sound channel spectrum and the low-frequency sound channel spectrum by setting half-folding and dividing of a middle dimension.
In a further embodiment, as shown in fig. 2, the channel spectrums of the different frequency bands are fused by a fusion model, and specific fusion parameters involved in the fusion process are extracted as follows:
extracting fusion coefficients
Counting high-frequency channel spectrum and low-frequency channel spectrum parts in the training stage channel spectrum information, and then calculating according to a formula (5) to obtain a fusion coefficient a
Fusion coefficient, mcep, obtained in formula a above High height and mcepLow and low The frequency division statistics is carried out to obtain the parameter information of each part of frequency bands.
Conversion weight fusion
And (3) carrying out weight fusion according to a formula (6) to obtain a final converted channel spectrum:
MC high fusion =α*mcep High 1 +(1-α)*mcep High 2
MC Low 1 =mcep Low 1
MC Full fusion =[mcep Low 1 +MC High fusion ] (6)
Wherein a is a fusion coefficient, mcep High 1 High frequency channel spectrum, mcep, which is a full band channel spectrum Low 1 Low frequency channel spectrum, mcep, which is a full band channel spectrum High 2 Is the high frequency part of the channel spectrum.
Fig. 3 is a specific internal configuration of the frequency division weight fusion network. The BLSTM1 and the BLSTM2 are mainly different in that the BLSTM2 is added with a drop layer, so that the phenomenon of overfitting caused by partial information of a high frequency band is prevented.
And 5, carrying out parameterized voice synthesis on the non-periodic components preprocessed in the step 4, the channel spectrum subjected to conversion and filtering, the fundamental frequency subjected to log linear conversion and other parameters, and generating final converted voice.

Claims (7)

1. A frequency division topological anti-noise voice conversion method based on BLSTM is characterized by comprising the following specific steps:
step 1: filtering the source voice and the target voice, and extracting voice characteristic parameters, wherein the voice characteristic parameters comprise fundamental frequency, sound channel spectrum and non-periodic components; carrying out dynamic time warping alignment on the extracted sound channel spectrums of the source voice and the target voice;
step 2: respectively inputting the aligned source voice channel spectrum and the aligned target voice channel spectrum into a frequency division converted BLSTM network model for training to obtain a corresponding characteristic conversion network;
step 3, constructing a global statistical variance consistency filtering model, wherein the specific method comprises the following steps:
step 3-1, calculating the mean value and variance of each dimension of mel cepstrum of the cepstrum coefficient of the target sentence;
step 3-2, calculating the mean value and variance of the mel cepstrum of each dimension of all frames of the sentence obtained by converting the BLSTM network model of the source voice through frequency division conversion;
step 3-3: constructing a global statistical variance consistency primary filter, wherein the global statistical variance consistency primary filter specifically comprises the following components:
wherein ,σ2 tar Meier cepstrum mean value representing each dimension of target voiceVectors, sigma of the constitution 2 con Represents the mean value +.about.L of the Meier cepstrum of each dimension of the sentence obtained after the conversion of the BLSTM network model of the source voice through frequency division conversion>The vector formed, y represents the mel-cepstrum of the sentence to be converted, < >>The method comprises the steps of (1) setting a vector formed by the mean value of the mel cepstrum of each dimension of all frames of a source voice to be converted sentence in a test stage;
step 3-4, setting adjustment parameters to obtain an adjusted global statistical variance consistency filter, which specifically comprises the following steps:
wherein ,the method is characterized in that the Mel cepstrum obtained after filtering is performed, y represents the Mel cepstrum of the sentence to be converted, and alpha is an adjustment parameter;
step 4, after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, and preprocessing;
step 5: and carrying out parameterized voice synthesis on the characteristic parameters of the preprocessed voice to be converted to generate final converted voice.
2. The method for anti-noise conversion of a frequency-divided-topology based on BLSTM of claim 1, wherein the extracted vocal tract spectrum feature is mel-frequency cepstrum.
3. The frequency-divided topological anti-noise voice conversion method based on BLSTM according to claim 1, wherein the frequency-divided-conversion-entering BLSTM network model comprises two BLSTM networks with the same structure, each of the two BLSTM networks is composed of 3 hidden layers, and the hidden node numbers of the three layers are respectively: 128 256, 128, wherein one BLSTM network has no dropout layer and the dropout layer parameter of the other BLSTM network is 0.5.
4. The anti-noise voice conversion method of frequency-divided topological spectrum based on BLSTM according to claim 1, wherein the specific formula for calculating the mean and variance of each dimension of mel cepstrum of the target voice cepstrum coefficient is:
wherein N represents the number of target sentences in the training stage, M represents the number of frames contained in each sentence, T represents the dimension of the mel-frequency cepstrum, i represents the index of the mel-frequency cepstrum dimension, and />Represents the mean and variance of the mel-cepstrum, x, of each dimension, obtained from all frames of all training sentences, respectively i Representing the i-th wimeyer spectrum.
5. The anti-noise voice conversion method based on frequency division topology of BLSTM according to claim 1, wherein the specific method for preprocessing the characteristic parameters of the voice to be converted is as follows:
the non-periodic components remain unchanged;
performing log-linear transformation on the fundamental frequency;
stretching and dividing the channel spectrum to obtain a high-frequency part and a full-frequency band part, converting the high-frequency part and the full-frequency band part by using a frequency division converted BLSTM network model, dividing the converted full-frequency band part again to obtain a high-frequency channel spectrum and a low-frequency channel spectrum, fusing the channel spectrums of different frequency bands by using a fusion model, and sending the fused converted channel spectrum characteristics to a global statistical variance consistency filtering model obtained in the step 3 for filtering to obtain the channel spectrum characteristics after conversion and filtering.
6. The method for anti-noise voice conversion of frequency-divided topology based on BLSTM according to claim 5, wherein the fusion model is specifically:
MC high fusion =α*mcep High 1 +(1-α)*mcep High 2
MC Low 1 =mcep Low 1
MC Full fusion =[mcep Low 1 +MC High fusion ]
Wherein a is a fusion coefficient, mcep High 1 High frequency channel spectrum, mcep, which is a full band channel spectrum Low 1 Low frequency channel spectrum, mcep, which is a full band channel spectrum High 2 Is the high frequency part of the channel spectrum.
7. The method for anti-noise voice conversion of frequency-divided topology based on BLSTM of claim 6, wherein the fusion coefficient is specifically:
in the formula, mcep High height and mcepLow and low And the frequency division statistics are used for obtaining the parameter information of each part of frequency bands.
CN202011288173.XA 2020-11-17 2020-11-17 Frequency division topological anti-noise voice conversion method based on BLSTM Active CN112562704B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011288173.XA CN112562704B (en) 2020-11-17 2020-11-17 Frequency division topological anti-noise voice conversion method based on BLSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011288173.XA CN112562704B (en) 2020-11-17 2020-11-17 Frequency division topological anti-noise voice conversion method based on BLSTM

Publications (2)

Publication Number Publication Date
CN112562704A CN112562704A (en) 2021-03-26
CN112562704B true CN112562704B (en) 2023-08-18

Family

ID=75043062

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011288173.XA Active CN112562704B (en) 2020-11-17 2020-11-17 Frequency division topological anti-noise voice conversion method based on BLSTM

Country Status (1)

Country Link
CN (1) CN112562704B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120077527A (en) * 2010-12-30 2012-07-10 부산대학교 산학협력단 Apparatus and method for feature compensation using weighted auto-regressive moving average filter and global cepstral mean and variance normalization
CN104658547A (en) * 2013-11-20 2015-05-27 大连佑嘉软件科技有限公司 Method for expanding artificial voice bandwidth
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network
US10249314B1 (en) * 2016-07-21 2019-04-02 Oben, Inc. Voice conversion system and method with variance and spectrum compensation
CN109767778A (en) * 2018-12-27 2019-05-17 中国人民解放军陆军工程大学 A kind of phonetics transfer method merging Bi-LSTM and WaveNet
CN110473564A (en) * 2019-07-10 2019-11-19 西北工业大学深圳研究院 A kind of multi-channel speech enhancement method based on depth Wave beam forming
CN110648680A (en) * 2019-09-23 2020-01-03 腾讯科技(深圳)有限公司 Voice data processing method and device, electronic equipment and readable storage medium
US10726830B1 (en) * 2018-09-27 2020-07-28 Amazon Technologies, Inc. Deep multi-channel acoustic modeling

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9697826B2 (en) * 2015-03-27 2017-07-04 Google Inc. Processing multi-channel audio waveforms
US10176819B2 (en) * 2016-07-11 2019-01-08 The Chinese University Of Hong Kong Phonetic posteriorgrams for many-to-one voice conversion

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120077527A (en) * 2010-12-30 2012-07-10 부산대학교 산학협력단 Apparatus and method for feature compensation using weighted auto-regressive moving average filter and global cepstral mean and variance normalization
CN104658547A (en) * 2013-11-20 2015-05-27 大连佑嘉软件科技有限公司 Method for expanding artificial voice bandwidth
US10249314B1 (en) * 2016-07-21 2019-04-02 Oben, Inc. Voice conversion system and method with variance and spectrum compensation
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network
US10726830B1 (en) * 2018-09-27 2020-07-28 Amazon Technologies, Inc. Deep multi-channel acoustic modeling
CN109767778A (en) * 2018-12-27 2019-05-17 中国人民解放军陆军工程大学 A kind of phonetics transfer method merging Bi-LSTM and WaveNet
CN110473564A (en) * 2019-07-10 2019-11-19 西北工业大学深圳研究院 A kind of multi-channel speech enhancement method based on depth Wave beam forming
CN110648680A (en) * 2019-09-23 2020-01-03 腾讯科技(深圳)有限公司 Voice data processing method and device, electronic equipment and readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于GMM模型和LPC-MFCC联合特征的声道谱转换研究;曾歆;张雄伟;孙蒙;苗晓孔;姚琨;;声学技术(04);全文 *

Also Published As

Publication number Publication date
CN112562704A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
Qian et al. Very deep convolutional neural networks for noise robust speech recognition
CN109767778B (en) Bi-L STM and WaveNet fused voice conversion method
CN109767756B (en) Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient
CN102568476B (en) Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN106328123B (en) Method for recognizing middle ear voice in normal voice stream under condition of small database
Zhang et al. Adadurian: Few-shot adaptation for neural text-to-speech with durian
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
CN111210803A (en) System and method for training clone timbre and rhythm based on Bottleneck characteristics
CN107274887A (en) Speaker&#39;s Further Feature Extraction method based on fusion feature MGFCC
CN105679321B (en) Voice recognition method, device and terminal
CN112382308A (en) Zero-order voice conversion system and method based on deep learning and simple acoustic features
Liu et al. Emotional voice conversion with cycle-consistent adversarial network
CN101178895A (en) Model self-adapting method based on generating parameter listen-feel error minimize
CN114283822A (en) Many-to-one voice conversion method based on gamma pass frequency cepstrum coefficient
CN114141237A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
Jie Speech emotion recognition based on convolutional neural network
CN112562704B (en) Frequency division topological anti-noise voice conversion method based on BLSTM
CN116364096B (en) Electroencephalogram signal voice decoding method based on generation countermeasure network
Shen Application of transfer learning algorithm and real time speech detection in music education platform
Wu et al. Rules based feature modification for affective speaker recognition
Gonzales et al. Voice conversion of philippine spoken languages using deep neural networks
Liu et al. Audio bandwidth extension based on ensemble echo state networks with temporal evolution
Xie et al. End-to-End Voice Conversion with Information Perturbation
Akhter et al. An analysis of performance evaluation metrics for voice conversion models
Lv et al. Objective evaluation method of broadcasting vocal timbre based on feature selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant