CN102568476A - Voice conversion method based on self-organizing feature map network cluster and radial basis network - Google Patents

Voice conversion method based on self-organizing feature map network cluster and radial basis network Download PDF

Info

Publication number
CN102568476A
CN102568476A CN2012100388747A CN201210038874A CN102568476A CN 102568476 A CN102568476 A CN 102568476A CN 2012100388747 A CN2012100388747 A CN 2012100388747A CN 201210038874 A CN201210038874 A CN 201210038874A CN 102568476 A CN102568476 A CN 102568476A
Authority
CN
China
Prior art keywords
parameter
fundamental frequency
lsf parameter
cluster
conversion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012100388747A
Other languages
Chinese (zh)
Other versions
CN102568476B (en
Inventor
解伟超
张玲华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN2012100388747A priority Critical patent/CN102568476B/en
Publication of CN102568476A publication Critical patent/CN102568476A/en
Application granted granted Critical
Publication of CN102568476B publication Critical patent/CN102568476B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a voice conversion method based on self-organizing feature map network clusters and a radial basis network, belonging to the technical field of voice signal processing. The voice conversion method comprises the following steps of: preprocessing, judging unvoiced or voiced sound, and extracting feature parameters; clustering the parameters; establishing spectrum envelop conversion rules; establishing fundamental sound frequency conversion rules; converting the feature parameters; and synthesizing voice. In the voice conversion method, the feature parameters of source voice are divided into a plurality of clusters, the corresponding target feature parameters are also divided into a plurality of clusters in one-to-one correspondence to each cluster of the source voice, and a conversion rule is established for each cluster. Therefore, the time complexity of training is reduced by dividing training data, so that voice generated by conversion has good naturalness. When the feature parameters of voice are converted, the fundamental sound frequency is correlated with spectrum envelop, the conversion relation between the two is established, the defects of individual fundamental sound frequency conversion are overcome, and the converted fundamental sound frequency has more features of a target speaker.

Description

Based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially
Technical field
The present invention relates to Voice Conversion Techniques, particularly, belong to the voice process technology field based on the s self-organizing feature map network clustering and the phonetics transfer method of base net network radially.
Background technology
Speech conversion is the emerging in recent years research branch of field of voice signal; Be on the research basis of Speaker Identification and phonetic synthesis, to carry out; Simultaneously also be the abundant and continuation of these two branch's intensions, but not exclusively be under the jurisdiction of the category of Speaker Identification and phonetic synthesis again.
The target of speech conversion is under the condition that the semantic information that guarantees wherein remains unchanged; Personal characteristics information in the speaker's voice of change source; Make it to have target speaker's personal characteristics, thereby make the voice after the conversion sound similarly being target speaker's sound.
The realization of speech conversion can be divided into training stage and translate phase.In the training stage, system trains source speaker and target speaker, analyzes their parameter, sets up transformation rule.At translate phase, earlier phonetic feature is analyzed and extracted to the source voice, be converted to the target speech characteristic according to the voice conversion rules that obtains by the training stage again.
The key issue of speech conversion is the extraction of speaker's personal characteristics and the foundation of transformation rule, and the development through recent two decades emerges a large amount of achievements in research, and the research to speech characteristic parameter at present mainly comprises spectrum envelope parameter and fundamental frequency.In the speech conversion there be based on linear predictive coding model (Linear Prediction Coding present conversion method to the spectrum envelope parameter; LPC), and gauss hybrid models (Gaussian Mixture Model, GMM); Harmonic wave plus noise model (Harmonic plus Noise Model; HNM) etc., but these methods when setting up transformation rule, directly extracting parameter is trained, set up a unified transformation rule; Like this since voice signal the time become and non-stationary property; And training data quantity is huge, makes a unique transformation rule can accurately not describe the mapping relations between the characteristic parameter of characteristic parameter and target speech of source voice, must cause distortion; (1, Zad-Issa, M.R, Kabal; P.Smoothing the Evolution ofthe Spectral Parameters in Linear Prediction of Speech using Target Matching.ICASSP; 1997:vol.3,1699-1702.2, Daojian Zeng, Yibiao Yu.Voice Conversion using structrued Gaussian Mixture Model.ICSP; 2010:541-544.3, Hu H.T; Yu C, Lin C.H.HNM parameter transform for voice conversion using a HMM-WDLT framework.ICIMA, 2010:vol.2; 282-287.) at present the conversion method of fundamental frequency there are average transformation approach, Gauss model method etc. in the speech conversion; But these conversion methods all are the spectrum envelope parameter to be separated with fundamental frequency change, not contact between both conversions, but spectrum envelope parameter and fundamental frequency all from same voice signal; Increasing research shows has close contact between the two, and therefore traditional method of respectively above two kinds of parameters being changed can must influence the quality of synthetic speech.(1、Lee?K.S,Doh?W,Youn?D.H?Voice?conversion?using?low?dimensional?vector?mapping.IEICE?Transaction?Information&System,2002,E85(D):1297-1305.2、L.M.Arslan.Speaker?Transformation?Algorithm?using?Segmental?Codebooks(STASC).Speech?Communication,Jul.1999:vol.28,no.3,pp.211-226.)
Summary of the invention
The object of the present invention is to provide a kind of voice time domain characteristics and the phonetics transfer method of speaker's personal characteristics under the condition of parallel text of combining; Obtain a kind of transformation rule more accurately, make the speaker's personal characteristics in the converting speech strengthen and improve the acoustical quality of converting speech.
In order to realize the foregoing invention purpose, the present invention has adopted following technical scheme:
A kind of based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, concrete steps are following:
The first step; Pre-service; Voicing decision and characteristic parameter extraction; After promptly input speech signal being carried out pre-emphasis, branch frame and windowing process, calculate the short-time energy and average zero-crossing rate of each frame, accomplish the judgement of pure and impure sound; Utilize STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum) model to extract LSF (Linear Spectral Frequency, linear spectral frequency) parameter and the fundamental frequency of each unvoiced frame again;
Second step; The parameter cluster; Promptly earlier the source LSF parameter and the target LSF parameter that extract are carried out dynamic time warping, utilize the s self-organizing feature map network that source LSF parameter is carried out self-organizing clustering again, note the subscript of all kinds of middle sources LSF parameter simultaneously; Like this with a certain cluster in the corresponding target LSF parameter of source LSF parameter also gather into one type, realize the cluster of target LSF parameter; Similarly, the parameter subscript of returning when utilizing the consolidation of LSF parameter dynamic time is confirmed the corresponding fundamental frequency of target LSF parameter behind the time unifying, can be realized the cluster of target fundamental frequency again by the source LSF parameter subscript of record;
The 3rd step, the foundation of Spectral envelope conversion rule, promptly respectively with the source LSF parameter of each cluster as input; The target LSF parameter of corresponding cluster is as output; Utilize RBF (Radial Basis Function, radially base) network to train, each cluster is set up the Spectral envelope conversion rule;
In the 4th step, the foundation of fundamental frequency transformation rule to each cluster, uses target LSF parameter as input respectively, and corresponding fundamental frequency is trained with the RBF network as output, sets up the fundamental frequency transformation rule of each cluster;
The 5th step; The conversion of characteristic parameter; The unvoiced frame of promptly treating converting speech is earlier sorted out one by one; Such Spectral envelope conversion rule with the 3rd step obtained is changed, the LSF parameter that obtains changing, the fundamental frequency that such fundamental frequency transformation rule that utilizes above-mentioned the 4th step to obtain by the LSF parameter of conversion again obtains changing;
The 6th step, phonetic synthesis, the LSF parameter and the fundamental frequency that promptly obtain, the voice after the STRAIGHT model finally obtains changing by above-mentioned the 4th step and the 5th step.
The present invention compared with prior art; Its remarkable advantage: (1) is divided into several clusters with the source speech characteristic parameter under the guidance of speech characteristic parameter mapping theory; Corresponding target signature parameter also is divided into and some one to one types of each cluster of source voice; Each cluster is set up transformation rule respectively, so not only training data is divided the time complexity that has reduced training, and combine the characteristics of the quasi periodic in short-term of voice; The transformation rule of each cluster can reflect such mapping relations more accurately, and the voice that make conversion generate have good naturalness; (2) when speech characteristic parameter is changed, fundamental frequency and spectrum envelope are connected, set up transformational relation between the two, overcome isolated at present shortcoming, make the fundamental frequency of changing out have target speaker's characteristic more the fundamental frequency conversion.
Below in conjunction with accompanying drawing the present invention is described in further detail.
Description of drawings
Fig. 1 the present invention is based on the s self-organizing feature map network clustering and the speech conversion synoptic diagram of base net network radially;
Fig. 2 is the cluster of LSF parameter and the synoptic diagram that transformation rule is set up thereof;
Fig. 3 is the cluster of fundamental frequency and the synoptic diagram that transformation rule is set up thereof;
Fig. 4 is the synoptic diagram of i frame voiced sound Parameters Transformation and phonetic synthesis.
Embodiment
In conjunction with Fig. 1, the present invention is based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, step is following:
The first step; Carry out pre-service earlier in the training stage, voicing decision and characteristic parameter extraction are after promptly input speech signal being carried out pre-emphasis, divides frame and windowing process; Calculate the short-time energy and average zero-crossing rate of each frame; Accomplish the judgement of pure and impure sound, utilize the STRAIGHT model to extract the fundamental frequency and the linear spectral frequency LSF parameter of each unvoiced frame again, detailed process is following:
(1) voice signal is carried out pre-service, pre emphasis factor is 0.96, divides frame by 40ms, and frame moves and is 1ms, uses Hamming window to carry out windowing process afterwards;
(2) by frame calculate short-time energy
Figure BDA0000137086240000031
and
Short-time zero-crossing rate
Figure BDA0000137086240000032
X wherein n(m) be n voice signal after windowing, and frame length is N, adopts the double threshold method to carry out the judgement of pure and impure sound;
(3) extraction of characteristic parameter promptly utilizes the STRAIGHT model to extract fundamental frequency and language spectrum parameter to unvoiced frame, because the language of each frame spectrum parameter is 513 dimensions; Therefore to carry out dimensionality reduction and conversion to it; Language spectrum parameter is carried out IFFT (Inverse Fast Fourier Transformation, invert fast fourier transformation) conversion earlier and is obtained coefficient of autocorrelation, thereby utilize the Levinson-Durbin algorithm to obtain AR (Autoregressive; Autoregression) model parameter; Further obtain it by the AR parameter and derive from linear-in-the-parameter spectral frequency LSF, the dimension of LSF parameter is decided to be 16, so just can obtain the LSF parameter and the fundamental frequency of each unvoiced frame;
Second step; The parameter cluster, as shown in Figure 2, earlier the source LSF parameter and the target LSF parameter that extract are carried out dynamic time warping; Utilize the s self-organizing feature map network that source LSF parameter is carried out self-organizing clustering again; Note simultaneously all kinds of in the subscript of sources LSF parameters, like this with the target LSF parameter of source LSF parameter time unifying is also corresponding gathers into one type, the cluster of realization target LSF parameter in a certain cluster; Similarly; As shown in Figure 3, the parameter subscript of the alignment of returning when carrying out the dynamic time consolidation according to the LSF parameter is made the corresponding fundamental frequency of target LSF parameter of alignment; Utilize the source LSF parameter subscript that writes down can realize the cluster of target fundamental frequency then, detailed process is following:
(1) (Dynamic Time Warping DTW), makes source LSF parameter and target LSF parameter time unifying the source LSF parameter that extracts and target LSF parameter to be carried out dynamic time warping;
(2) source LSF parameter and the target LSF parameter after will aliging constitutes matrix respectively; Each row representative is a frame LSF parameter; Utilize the s self-organizing feature map network that source LSF parameter is carried out self-organizing clustering again; The input layer of this s self-organizing feature map network is taken as 16, and the two-dimentional neuronic number of competition layer is taken as 5 row, 5 row and amounts to 25, and the node of competition layer just can be represented each type parameter respectively after training; Cluster opisthogenesis LSF parameter is divided into 25 types like this, notes the subscript of LSF parameter in source in each type simultaneously respectively;
The subscript of the source LSF parameter of (3) noting according to each type, the target LSF parameter of same index is also gathered into one type, realizes the cluster to target LSF parameter;
(4) similarly; The subscript of the target LSF parameter of the alignment of at every turn returning when carrying out the dynamic time consolidation according to the LSF parameter; Select the fundamental frequency of the target unvoiced frame of same index position; The target LSF parameter of alignment is also corresponding with the target fundamental frequency like this, the subscript of the source LSF parameter of noting by each type then, and the target fundamental frequency that will have same index also gathers into one type;
The 3rd step, the foundation of Spectral envelope conversion rule, as shown in Figure 2; Respectively with the source LSF parameter of each cluster as input; The target LSF parameter of corresponding cluster utilizes the RBF network to train as output, and basis function adopts Gaussian function; Find out the relation between input vector and the output vector, as input vector X nThe time, k component of output layer does
Figure BDA0000137086240000041
X wherein jThe Spectral envelope conversion rule is set up to each cluster in the center of representing j basis function after training;
The 4th step, the foundation of fundamental frequency transformation rule, as shown in Figure 3; To each cluster, use target LSF parameter as input respectively, corresponding target fundamental frequency is as output; Train with the RBF network, set up the fundamental frequency transformation rule of each cluster, detailed process is following:
(1) at first the fundamental frequency of target speech unvoiced frame is carried out convergent-divergent; Because the fundamental frequency scope of male voice is at 60~200Hz; The fundamental frequency of female voice is at 60~450Hz, and therefore the output of RBF network carry out convergent-divergent for the target fundamental frequency divided by 500 between 0~1;
(2) respectively with the target LSF parameter of each cluster as input, as output, basis function adopts Gaussian function, trains with the RBF network, sets up the fundamental frequency transformation rule of each cluster with the target fundamental frequency behind the convergent-divergent;
The 5th step; The conversion of characteristic parameter, as shown in Figure 4, treat the unvoiced frame of converting speech earlier and sort out one by one; Such Spectral envelope conversion rule with the 3rd step obtained is changed; The LSF parameter that obtains changing, the fundamental frequency that such fundamental frequency transformation rule that utilized for the 4th step obtained by the LSF parameter of conversion again obtains changing, detailed process is following:
(1) advancing at first to source voice to be converted, row divides frame; Add Hamming window and voicing decision; Utilize the STRAIGHT model to extract fundamental frequency and LSF parameter for unvoiced frame, the LSF parameter is sent into the s self-organizing feature map network that has trained by frame, find out the neuron of competition triumph; This neuron is promptly represented the classification under this unvoiced frame, accomplishes the classification to unvoiced frame;
(2) choose the Spectral envelope conversion rule of respective class, the LSF parameter of this unvoiced frame as input, is sent into the RBF network of such spectrum envelope that has trained, the LSF parameter after obtaining changing;
(3) choose the fundamental frequency transformation rule of respective class, the LSF parameter after the conversion is sent into the RBF network of such fundamental frequency that has trained as input, obtain multiply by 500 after the output and reduce the fundamental frequency after obtaining changing;
The 6th step, phonetic synthesis, the LSF parameter and the fundamental frequency that promptly obtain, the voice after the STRAIGHT model finally obtains changing by the 4th step and the 5th step.
(1) being the AR parameter with the LSF Parameters Transformation after the unvoiced frame conversion at first, is energy spectrum (unit is dB) with the AR Parameters Transformation again, gets preceding 513 dimensions of energy spectrum, and converts energy spectrum unit into numerical value;
(2) for unvoiced frames directly with the language of source speech frame spectrum parameter and fundamental frequency respectively as energy spectrum with change after fundamental frequency;
Fundamental frequency after energy spectrum that (3) will finally obtain and the conversion is sent into the voice after the STRAIGHT model synthesizes conversion.

Claims (7)

1. based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that comprising following steps:
The first step; Pre-service; Voicing decision and characteristic parameter extraction, after promptly input speech signal being carried out pre-emphasis, divides frame and windowing process, the short-time energy and average zero-crossing rate of calculating each frame; Accomplish the judgement of pure and impure sound, utilize the STRAIGHT model to extract the LSF parameter and the fundamental frequency of each unvoiced frame again;
Second step; The parameter cluster; Promptly earlier the source LSF parameter and the target LSF parameter that extract are carried out dynamic time warping, utilize the s self-organizing feature map network that source LSF parameter is carried out self-organizing clustering again, note the subscript of all kinds of middle sources LSF parameter simultaneously; Like this with a certain cluster in the corresponding target LSF parameter of source LSF parameter also gather into one type, realize the cluster of target LSF parameter; Similarly, the parameter subscript of returning when utilizing the consolidation of LSF parameter dynamic time is confirmed the corresponding fundamental frequency of target LSF parameter behind the time unifying, can be realized the cluster of target fundamental frequency again by the source LSF parameter subscript of record;
The 3rd step, the foundation of Spectral envelope conversion rule, promptly respectively with the source LSF parameter of each cluster as input, the target LSF parameter of corresponding cluster utilizes the RBF network to train as output, and each cluster is set up the Spectral envelope conversion rule;
In the 4th step, the foundation of fundamental frequency transformation rule to each cluster, uses target LSF parameter as input respectively, and corresponding fundamental frequency is trained with the RBF network as output, sets up the fundamental frequency transformation rule of each cluster;
The 5th step; The conversion of characteristic parameter; The unvoiced frame of promptly treating converting speech is earlier sorted out one by one; Such Spectral envelope conversion rule with above-mentioned the 3rd step obtains is changed, the LSF parameter that obtains changing, the fundamental frequency that such fundamental frequency transformation rule that utilized for the 4th step obtained by the LSF parameter of conversion again obtains changing;
The 6th step, phonetic synthesis, the LSF parameter and the fundamental frequency that promptly obtain, the voice after the STRAIGHT model finally obtains changing by above-mentioned the 4th step and the 5th step.
2. according to claim 1ly it is characterized in that pre-service based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, the detailed process of voicing decision and characteristic parameter extraction is following:
The first step is carried out pre-service to voice signal, and pre emphasis factor is 0.96, divides frame by 40ms, and frame moves and is 1ms, uses Hamming window to carry out windowing process afterwards;
Second step, by frame calculate short-time energy
Figure FDA0000137086230000011
and
Short-time zero-crossing rate
Figure FDA0000137086230000012
X wherein n(m) be n voice signal after windowing, and frame length is N, adopts the double threshold method to carry out the judgement of pure and impure sound;
The 3rd step; The extraction of characteristic parameter promptly utilizes the STRAIGHT model to extract fundamental frequency and language spectrum parameter to unvoiced frame, again language spectrum parameter is carried out dimensionality reduction and conversion; Carry out the IFFT conversion earlier and obtain coefficient of autocorrelation; Obtain the AR parameter by coefficient of autocorrelation again, further obtain it by the AR parameter at last and derive from linear-in-the-parameter spectral frequency LSF, the dimension of LSF parameter is decided to be 16.
3. according to claim 1 based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that the detailed process of parameter cluster is following:
The first step is carried out dynamic time warping to the source LSF parameter and the target LSF parameter that extract, makes source LSF parameter and target LSF parameter time unifying;
Second step; Utilize the s self-organizing feature map network that source LSF parameter is carried out the self-organization classification; The input layer of this self organizing neural network is taken as 16; The neuronic number of the two dimension of competition layer is taken as 5 row, 5 row and amounts to 25, and cluster opisthogenesis LSF parameter is divided into 25 types like this, notes the subscript of LSF parameter in source in each type simultaneously respectively;
The 3rd step, the subscript of the source LSF parameter of noting according to each type, the target LSF parameter of same index is also gathered into one type, realizes the cluster to target LSF parameter;
The 4th step; Likewise; The subscript of the alignment parameters of returning when utilizing LSF parameter dynamic time warping, find out with time unifying after the corresponding target fundamental frequency of target LSF parameter, the source LSF parameter subscript according to record can realize the cluster of target fundamental frequency again.
4. according to claim 1 based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially; The method for building up that it is characterized in that the Spectral envelope conversion rule is to utilize the RBF network; Respectively with the source LSF parameter of each cluster as input; The target LSF parameter of corresponding cluster is trained as output, and basis function adopts Gaussian function, as input vector X nThe time, k component of output layer does X wherein jThe Spectral envelope conversion rule is set up to each cluster in the center of representing j basis function after training.
5. according to claim 1 based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that the fundamental frequency transformation rule to set up process following:
The first step; Fundamental frequency to the target speech unvoiced frame carries out convergent-divergent, because the fundamental frequency scope of male voice is at 60~200Hz, the fundamental frequency of female voice is at 60~450Hz; And therefore the output of RBF network carry out convergent-divergent for the target fundamental frequency divided by 500 between 0~1;
Second step, respectively with the target LSF parameter of each cluster as input, the target fundamental frequency behind the convergent-divergent is as output, basis function adopts Gaussian function, trains with the RBF network, sets up the fundamental frequency transformation rule of each cluster.
6. according to claim 1 based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that the transfer process of characteristic parameter is following:
The first step; Voice to be converted are carried out branch frame and voicing decision; Utilize the STRAIGHT model to extract fundamental frequency and LSF parameter for unvoiced frame, the LSF parameter is sent into the s self-organizing feature map network that the training stage trains one by one, find out the neuron of competition triumph; This neuron is promptly represented the classification under this unvoiced frame, accomplishes the classification to unvoiced frame;
Second step, choose the Spectral envelope conversion rule of respective class, the LSF parameter of this unvoiced frame is sent into the RBF network of such spectrum envelope that has trained, the LSF parameter after obtaining changing as input;
The 3rd step, choose the fundamental frequency transformation rule of respective class, the LSF parameter after the conversion is sent into the RBF network of such fundamental frequency that has trained as input, obtain multiply by 500 after the output and reduce the fundamental frequency after can obtaining changing.
7. according to claim 1 based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially; The detailed process that it is characterized in that phonetic synthesis is following: be the AR parameter with the LSF Parameters Transformation after the unvoiced frame conversion at first; Be energy spectrum with the AR Parameters Transformation again; Unit is dB, gets preceding 513 dimensions of energy spectrum, and converts energy spectrum unit into numerical value; For unvoiced frames directly with the language of source speech frame spectrum parameter and fundamental frequency respectively as the fundamental frequency after energy spectrum and the conversion, like this with the energy spectrum that obtains with change after fundamental frequency send into the voice after the STRAIGHT model synthesizes conversion.
CN2012100388747A 2012-02-21 2012-02-21 Voice conversion method based on self-organizing feature map network cluster and radial basis network Expired - Fee Related CN102568476B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100388747A CN102568476B (en) 2012-02-21 2012-02-21 Voice conversion method based on self-organizing feature map network cluster and radial basis network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100388747A CN102568476B (en) 2012-02-21 2012-02-21 Voice conversion method based on self-organizing feature map network cluster and radial basis network

Publications (2)

Publication Number Publication Date
CN102568476A true CN102568476A (en) 2012-07-11
CN102568476B CN102568476B (en) 2013-07-03

Family

ID=46413732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100388747A Expired - Fee Related CN102568476B (en) 2012-02-21 2012-02-21 Voice conversion method based on self-organizing feature map network cluster and radial basis network

Country Status (1)

Country Link
CN (1) CN102568476B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104464744A (en) * 2014-11-19 2015-03-25 河海大学常州校区 Cluster voice transforming method and system based on mixture Gaussian random process
CN105390141A (en) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 Sound conversion method and sound conversion device
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning
CN105023574B (en) * 2014-04-30 2018-06-15 科大讯飞股份有限公司 A kind of method and system for realizing synthesis speech enhan-cement
CN108364639A (en) * 2013-08-23 2018-08-03 株式会社东芝 Speech processing system and method
CN108417198A (en) * 2017-12-28 2018-08-17 中南大学 A kind of men and women's phonetics transfer method based on spectrum envelope and pitch period
CN109448739A (en) * 2018-12-13 2019-03-08 山东省计算中心(国家超级计算济南中心) Vocoder line spectral frequency parameters quantization method based on hierarchical cluster
CN109712634A (en) * 2018-12-24 2019-05-03 东北大学 A kind of automatic sound conversion method
CN110085255A (en) * 2019-03-27 2019-08-02 河海大学常州校区 Voice conversion learns Gaussian process regression modeling method based on depth kernel
CN110536215A (en) * 2019-09-09 2019-12-03 普联技术有限公司 Method, apparatus, calculating and setting and the storage medium of Audio Signal Processing
CN113380261A (en) * 2021-05-26 2021-09-10 特斯联科技集团有限公司 Artificial intelligent voice acquisition processor and method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1185194A (en) * 1997-09-04 1999-03-30 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Voice nature conversion speech synthesis apparatus
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1185194A (en) * 1997-09-04 1999-03-30 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Voice nature conversion speech synthesis apparatus
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
左国玉等: "基于遗传径向基神经网络的声音转换", 《中文信息学报》, vol. 18, no. 1, 29 February 2004 (2004-02-29), pages 78 - 84 *
陈芝等: "基频轨迹转换算法及在语音转换系统中的应用研究", 《南京邮电大学学报(自然科学版)》, vol. 30, no. 5, 31 October 2010 (2010-10-31), pages 83 - 87 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108364639A (en) * 2013-08-23 2018-08-03 株式会社东芝 Speech processing system and method
CN105023574B (en) * 2014-04-30 2018-06-15 科大讯飞股份有限公司 A kind of method and system for realizing synthesis speech enhan-cement
CN104464744A (en) * 2014-11-19 2015-03-25 河海大学常州校区 Cluster voice transforming method and system based on mixture Gaussian random process
CN105390141B (en) * 2015-10-14 2019-10-18 科大讯飞股份有限公司 Sound converting method and device
CN105390141A (en) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 Sound conversion method and sound conversion device
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning
CN107545903B (en) * 2017-07-19 2020-11-24 南京邮电大学 Voice conversion method based on deep learning
CN108417198A (en) * 2017-12-28 2018-08-17 中南大学 A kind of men and women's phonetics transfer method based on spectrum envelope and pitch period
CN109448739A (en) * 2018-12-13 2019-03-08 山东省计算中心(国家超级计算济南中心) Vocoder line spectral frequency parameters quantization method based on hierarchical cluster
CN109448739B (en) * 2018-12-13 2019-08-23 山东省计算中心(国家超级计算济南中心) Vocoder line spectral frequency parameters quantization method based on hierarchical cluster
CN109712634A (en) * 2018-12-24 2019-05-03 东北大学 A kind of automatic sound conversion method
CN110085255A (en) * 2019-03-27 2019-08-02 河海大学常州校区 Voice conversion learns Gaussian process regression modeling method based on depth kernel
CN110085255B (en) * 2019-03-27 2021-05-28 河海大学常州校区 Speech conversion Gaussian process regression modeling method based on deep kernel learning
CN110536215A (en) * 2019-09-09 2019-12-03 普联技术有限公司 Method, apparatus, calculating and setting and the storage medium of Audio Signal Processing
CN113380261A (en) * 2021-05-26 2021-09-10 特斯联科技集团有限公司 Artificial intelligent voice acquisition processor and method

Also Published As

Publication number Publication date
CN102568476B (en) 2013-07-03

Similar Documents

Publication Publication Date Title
CN102568476B (en) Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN102509547B (en) Method and system for voiceprint recognition based on vector quantization based
CN105118498B (en) The training method and device of phonetic synthesis model
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN102231278B (en) Method and system for realizing automatic addition of punctuation marks in speech recognition
CN103065620B (en) Method with which text input by user is received on mobile phone or webpage and synthetized to personalized voice in real time
CN109272990A (en) Audio recognition method based on convolutional neural networks
Wali et al. Generative adversarial networks for speech processing: A review
CN101777347B (en) Model complementary Chinese accent identification method and system
CN102982803A (en) Isolated word speech recognition method based on HRSF and improved DTW algorithm
CN103021418A (en) Voice conversion method facing to multi-time scale prosodic features
CN102237083A (en) Portable interpretation system based on WinCE platform and language recognition method thereof
CN105654942A (en) Speech synthesis method of interrogative sentence and exclamatory sentence based on statistical parameter
CN106782599A (en) The phonetics transfer method of post filtering is exported based on Gaussian process
CN110047504A (en) Method for distinguishing speek person under identity vector x-vector linear transformation
Garg et al. Survey on acoustic modeling and feature extraction for speech recognition
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
Singh et al. Spectral Modification Based Data Augmentation For Improving End-to-End ASR For Children's Speech
Xue et al. Cross-modal information fusion for voice spoofing detection
Cheng et al. DNN-based speech enhancement with self-attention on feature dimension
Guo et al. Phonetic posteriorgrams based many-to-many singing voice conversion via adversarial training
CN114495969A (en) Voice recognition method integrating voice enhancement
Li et al. Emotion recognition from speech with StarGAN and Dense‐DCNN
Wu et al. A Characteristic of Speaker's Audio in the Model Space Based on Adaptive Frequency Scaling
CN103226946A (en) Voice synthesis method based on limited Boltzmann machine

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130703

Termination date: 20160221

CF01 Termination of patent right due to non-payment of annual fee