CN102568476B - Voice conversion method based on self-organizing feature map network cluster and radial basis network - Google Patents

Voice conversion method based on self-organizing feature map network cluster and radial basis network Download PDF

Info

Publication number
CN102568476B
CN102568476B CN2012100388747A CN201210038874A CN102568476B CN 102568476 B CN102568476 B CN 102568476B CN 2012100388747 A CN2012100388747 A CN 2012100388747A CN 201210038874 A CN201210038874 A CN 201210038874A CN 102568476 B CN102568476 B CN 102568476B
Authority
CN
China
Prior art keywords
parameter
fundamental frequency
lsf parameter
cluster
conversion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2012100388747A
Other languages
Chinese (zh)
Other versions
CN102568476A (en
Inventor
解伟超
张玲华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN2012100388747A priority Critical patent/CN102568476B/en
Publication of CN102568476A publication Critical patent/CN102568476A/en
Application granted granted Critical
Publication of CN102568476B publication Critical patent/CN102568476B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a voice conversion method based on self-organizing feature map network clusters and a radial basis network, belonging to the technical field of voice signal processing. The voice conversion method comprises the following steps of: preprocessing, judging unvoiced or voiced sound, and extracting feature parameters; clustering the parameters; establishing spectrum envelop conversion rules; establishing fundamental sound frequency conversion rules; converting the feature parameters; and synthesizing voice. In the voice conversion method, the feature parameters of source voice are divided into a plurality of clusters, the corresponding target feature parameters are also divided into a plurality of clusters in one-to-one correspondence to each cluster of the source voice, and a conversion rule is established for each cluster. Therefore, the time complexity of training is reduced by dividing training data, so that voice generated by conversion has good naturalness. When the feature parameters of voice are converted, the fundamental sound frequency is correlated with spectrum envelop, the conversion relation between the two is established, the defects of individual fundamental sound frequency conversion are overcome, and the converted fundamental sound frequency has more features of a target speaker.

Description

Based on s self-organizing feature map network clustering and the speech conversion method of base net network radially
Technical field
The present invention relates to Voice Conversion Techniques, particularly based on s self-organizing feature map network clustering and the phonetics transfer method of base net network radially, belong to the voice process technology field.
Background technology
Speech conversion is field of voice signal emerging research branch in recent years, be to carry out on the research basis of Speaker Identification and phonetic synthesis, simultaneously also be the abundant and continuation of these two branch's intensions, but not exclusively be under the jurisdiction of the category of Speaker Identification and phonetic synthesis again.
The target of speech conversion is under the condition that the semantic information that guarantees wherein remains unchanged, personal characteristics information in the speaker's voice of change source, make it to have target speaker's personal characteristics, thereby make the voice after the conversion sound similarly being target speaker's sound.
The realization of speech conversion can be divided into training stage and translate phase.In the training stage, system trains source speaker and target speaker, analyzes their parameter, sets up transformation rule.At translate phase, earlier phonetic feature is analyzed and extracted to the source voice, be converted to the target phonetic feature according to the voice conversion rules that is obtained by the training stage again.
The key issue of speech conversion is the extraction of speaker's personal characteristics and the foundation of transformation rule, and the development through recent two decades emerges a large amount of achievements in research, and the research to speech characteristic parameter at present mainly comprises spectrum envelope parameter and fundamental frequency.In the speech conversion there be based on linear predictive coding model (Linear Prediction Coding present conversion method to the spectrum envelope parameter, LPC), gauss hybrid models (Gaussian Mixture Model, GMM), harmonic wave plus noise model (Harmonic plus Noise Model, HNM) etc., but these methods are directly trained extracting parameter when setting up transformation rule, set up a unified transformation rule, like this since voice signal the time become and non-stationary property, and training data quantity is huge, make a unique transformation rule can accurately not describe the mapping relations between the characteristic parameter of the characteristic parameter of source voice and target voice, must cause distortion; (1, Zad-Issa, M.R, Kabal, P.Smoothing the Evolution ofthe Spectral Parameters in Linear Prediction of Speech using Target Matching.ICASSP, 1997:vol.3,1699-1702.2, Daojian Zeng, Yibiao Yu.Voice Conversion using structrued Gaussian Mixture Model.ICSP, 2010:541-544.3, Hu H.T, Yu C, Lin C.H.HNM parameter transform for voice conversion using a HMM-WDLT framework.ICIMA, 2010:vol.2,282-287.) in the speech conversion at present the conversion method to fundamental frequency the average transformation approach is arranged, Gauss model method etc., but these conversion methods all are the spectrum envelope parameter to be separated with fundamental frequency change, not contact between both conversions, but spectrum envelope parameter and fundamental frequency all from same voice signal, more and more studies show that has close contact between the two, so traditional method of respectively above two kinds of parameters being changed can must influence the quality of synthetic speech.(1、Lee?K.S,Doh?W,Youn?D.H?Voice?conversion?using?low?dimensional?vector?mapping.IEICE?Transaction?Information&System,2002,E85(D):1297-1305.2、L.M.Arslan.Speaker?Transformation?Algorithm?using?Segmental?Codebooks(STASC).Speech?Communication,Jul.1999:vol.28,no.3,pp.211-226.)
Summary of the invention
The object of the present invention is to provide a kind of in conjunction with voice time domain characteristics and the phonetics transfer method of speaker's personal characteristics under the condition of parallel text, obtain a kind of transformation rule more accurately, make the speaker's personal characteristics in the converting speech strengthen and improve the acoustical quality of converting speech.
In order to realize the foregoing invention purpose, the present invention has adopted following technical scheme:
A kind of based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, concrete steps are as follows:
The first step, pre-service, voicing decision and characteristic parameter extraction, after namely input speech signal being carried out pre-emphasis, branch frame and windowing process, calculate short-time energy and the average zero-crossing rate of each frame, finish the judgement of pure and impure sound, recycling STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum) model extracts LSF (Linear Spectral Frequency, linear spectral frequency) parameter and the fundamental frequency of each unvoiced frame;
Second step, the parameter cluster, namely earlier the source LSF parameter and the target LSF parameter that extract are carried out dynamic time warping, recycling s self-organizing feature map network carries out self-organizing clustering to source LSF parameter, note the subscript of all kinds of middle sources LSF parameter simultaneously, corresponding with source LSF parameter in a certain cluster like this target LSF parameter is also gathered into a class, realizes the cluster of target LSF parameter; Similarly, the parameter subscript of returning when utilizing the consolidation of LSF parameter dynamic time is determined the fundamental frequency of the target LSF parameter correspondence behind the time unifying, can be realized the cluster of target fundamental frequency again by the source LSF parameter subscript of record;
The 3rd step, the foundation of Spectral envelope conversion rule, namely respectively with the source LSF parameter of each cluster as input, the target LSF parameter of corresponding cluster is as output, utilize RBF (Radial Basis Function, radially base) network to train, each cluster is set up the Spectral envelope conversion rule;
In the 4th step, the foundation of fundamental frequency transformation rule to each cluster, uses target LSF parameter as input respectively, and corresponding fundamental frequency is trained with the RBF network as output, sets up the fundamental frequency transformation rule of each cluster;
The 5th step, the conversion of characteristic parameter, namely the unvoiced frame for the treatment of converting speech is earlier sorted out one by one, change with such Spectral envelope conversion rule that the 3rd step obtained, the LSF parameter that obtains changing, the fundamental frequency that utilizes above-mentioned the 4th such fundamental frequency transformation rule of obtaining of step to obtain changing by the LSF parameter of conversion again;
In the 6th step, phonetic synthesis namely goes on foot LSF parameter and the fundamental frequency that obtains, the voice after the STRAIGHT model finally obtains changing by above-mentioned the 4th step and the 5th.
The present invention compared with prior art, its remarkable advantage: (1) is divided into several clusters with the source speech characteristic parameter under the guidance of speech characteristic parameter mapping theory, corresponding target signature parameter also is divided into and each cluster of source voice some classes one to one, each cluster is set up transformation rule respectively, so not only training data is divided the time complexity that has reduced training, and in conjunction with the characteristics of the quasi periodic in short-term of voice, the transformation rule of each cluster can reflect such mapping relations more accurately, and the voice that make conversion generate have good naturalness; (2) when speech characteristic parameter is changed, fundamental frequency and spectrum envelope are connected, set up transformational relation between the two, overcome the shortcoming to the fundamental frequency conversion that isolates at present, make the fundamental frequency of changing out have target speaker's characteristic more.
Below in conjunction with accompanying drawing the present invention is described in further detail.
Description of drawings
Fig. 1 the present invention is based on s self-organizing feature map network clustering and the speech conversion synoptic diagram of base net network radially;
Fig. 2 is the cluster of LSF parameter and the synoptic diagram that transformation rule is set up thereof;
Fig. 3 is the cluster of fundamental frequency and the synoptic diagram that transformation rule is set up thereof;
Fig. 4 is the synoptic diagram of the conversion of i frame voiced sound parameter and phonetic synthesis.
Embodiment
In conjunction with Fig. 1, the present invention is based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, step is as follows:
The first step, carry out pre-service earlier in the training stage, voicing decision and characteristic parameter extraction, after namely input speech signal being carried out pre-emphasis, branch frame and windowing process, calculate short-time energy and the average zero-crossing rate of each frame, finish the judgement of pure and impure sound, recycling STRAIGHT model extracts fundamental frequency and the linear spectral frequency LSF parameter of each unvoiced frame, and detailed process is as follows:
(1) voice signal is carried out pre-service, pre emphasis factor is 0.96, divides frame by 40ms, and frame moves and is 1ms, uses Hamming window to carry out windowing process afterwards;
(2) calculate short-time energy frame by frame
Figure BDA0000137086240000031
With
Short-time zero-crossing rate
Figure BDA0000137086240000032
X wherein n(m) be n voice signal after windowing, and frame length is N, adopts the double threshold method to carry out the judgement of pure and impure sound;
(3) extraction of characteristic parameter, namely utilize the STRAIGHT model to extract fundamental frequency and language spectrum parameter to unvoiced frame, because the language of each frame spectrum parameter is 513 dimensions, therefore to carry out dimensionality reduction and conversion to it, language spectrum parameter is carried out IFFT (Inverse Fast Fourier Transformation earlier, invert fast fourier transformation) conversion obtains coefficient of autocorrelation, thereby utilize the Levinson-Durbin algorithm to obtain AR (Autoregressive, autoregression) model parameter, further obtain it by the AR parameter and derive from linear-in-the-parameter spectral frequency LSF, the dimension of LSF parameter is decided to be 16, so just can obtain LSF parameter and the fundamental frequency of each unvoiced frame;
Second step, the parameter cluster, as shown in Figure 2, earlier the source LSF parameter and the target LSF parameter that extract are carried out dynamic time warping, recycling s self-organizing feature map network carries out self-organizing clustering to source LSF parameter, note simultaneously all kinds of in the subscript of sources LSF parameters, like this with a certain cluster in the also corresponding class of gathering into of target LSF parameter of source LSF parameter time unifying, the cluster of realization target LSF parameter; Similarly, as shown in Figure 3, the parameter subscript of the alignment of returning when carrying out the dynamic time consolidation according to the LSF parameter is made the fundamental frequency of the target LSF parameter correspondence of alignment, utilize the source LSF parameter subscript that records can realize the cluster of target fundamental frequency then, detailed process is as follows:
(1) (Dynamic Time Warping DTW), makes source LSF parameter and target LSF parameter time unifying the source LSF parameter that extracts and target LSF parameter to be carried out dynamic time warping;
(2) source LSF parameter and the target LSF parameter after will aliging constitutes matrix respectively, each row representative is a frame LSF parameter, recycling s self-organizing feature map network carries out self-organizing clustering to source LSF parameter, the input layer of this s self-organizing feature map network is taken as 16, the two-dimentional neuronic number of competition layer is taken as 5 row, 5 row and amounts to 25, the node of competition layer just can represent each class parameter respectively after training, cluster opisthogenesis LSF parameter is divided into 25 classes like this, notes the subscript of LSF parameter in source in each class simultaneously respectively;
(3) subscript of the source LSF parameter of noting according to each class, the target LSF parameter of same index is also gathered into a class, realizes the cluster to target LSF parameter;
(4) similarly, the subscript of the target LSF parameter of the alignment of at every turn returning when carrying out the dynamic time consolidation according to the LSF parameter, select the fundamental frequency of the target unvoiced frame of same index position, Dui Qi target LSF parameter is also corresponding with the target fundamental frequency like this, the subscript of the source LSF parameter of being noted by each class then, the target fundamental frequency that will have same index also gathers into a class;
The 3rd step, the foundation of Spectral envelope conversion rule, as shown in Figure 2, respectively with the source LSF parameter of each cluster as input, the target LSF parameter of corresponding cluster utilizes the RBF network to train as output, and basis function adopts Gaussian function, find out the relation between input vector and the output vector, as input vector X nThe time, k component of output layer is
Figure BDA0000137086240000041
X wherein jThe Spectral envelope conversion rule is set up to each cluster in the center of representing j basis function after training;
The 4th step, the foundation of fundamental frequency transformation rule, as shown in Figure 3, to each cluster, use target LSF parameter as input respectively, corresponding target fundamental frequency is as output, train with the RBF network, set up the fundamental frequency transformation rule of each cluster, detailed process is as follows:
(1) at first the fundamental frequency of target voice unvoiced frame is carried out convergent-divergent, because the fundamental frequency scope of male voice is at 60~200Hz, the fundamental frequency of female voice is at 60~450Hz, and therefore the output of RBF network carry out convergent-divergent for the target fundamental frequency divided by 500 between 0~1;
(2) respectively with the target LSF parameter of each cluster as input, as output, basis function adopts Gaussian function, trains with the RBF network, sets up the fundamental frequency transformation rule of each cluster with the target fundamental frequency behind the convergent-divergent;
The 5th step, the conversion of characteristic parameter, as shown in Figure 4, treating the unvoiced frame of converting speech earlier sorts out one by one, change with such Spectral envelope conversion rule that the 3rd step obtained, the LSF parameter that obtains changing, the fundamental frequency that utilizes the 4th such fundamental frequency transformation rule of obtaining of step to obtain changing by the LSF parameter of conversion again, detailed process is as follows:
(1) advancing at first to source voice to be converted, row divides frame, add Hamming window and voicing decision, utilize the STRAIGHT model to extract fundamental frequency and LSF parameter for unvoiced frame, the LSF parameter is sent into frame by frame the s self-organizing feature map network that has trained, find out the neuron of competition triumph, this neuron namely represents the classification under this unvoiced frame, finishes the classification to unvoiced frame;
(2) choose the Spectral envelope conversion rule of respective class, the LSF parameter of this unvoiced frame as input, is sent into the RBF network of such spectrum envelope that has trained, the LSF parameter after obtaining changing;
(3) choose the fundamental frequency transformation rule of respective class, the LSF parameter after the conversion is sent into the RBF network of such fundamental frequency that has trained as input, obtain multiply by 500 after the output and reduce the fundamental frequency after obtaining changing;
In the 6th step, phonetic synthesis namely goes on foot LSF parameter and the fundamental frequency that obtains, the voice after the STRAIGHT model finally obtains changing by the 4th step and the 5th.
(1) at first the LSF parameter after the unvoiced frame conversion is converted to the AR parameter, again the AR parameter is converted to energy spectrum (unit is dB), get preceding 513 dimensions of energy spectrum, and energy spectrum unit is converted to numerical value;
(2) directly the language of source speech frame is composed parameter and fundamental frequency respectively as the fundamental frequency after energy spectrum and the conversion for unvoiced frames;
(3) fundamental frequency after the energy spectrum that will finally obtain and the conversion is sent into the voice after the STRAIGHT model synthesizes conversion.

Claims (7)

1. based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that comprising following steps:
The first step, pre-service, voicing decision and characteristic parameter extraction, after namely input speech signal being carried out pre-emphasis, branch frame and windowing process, calculate short-time energy and the average zero-crossing rate of each frame, finish the judgement of pure and impure sound, recycling STRAIGHT model extracts LSF parameter and the fundamental frequency of each unvoiced frame;
Second step, the parameter cluster, namely earlier the source LSF parameter and the target LSF parameter that extract are carried out dynamic time warping, recycling s self-organizing feature map network carries out self-organizing clustering to source LSF parameter, note the subscript of all kinds of middle sources LSF parameter simultaneously, corresponding with source LSF parameter in a certain cluster like this target LSF parameter is also gathered into a class, realizes the cluster of target LSF parameter; Similarly, the parameter subscript of returning when utilizing the consolidation of LSF parameter dynamic time is determined the fundamental frequency of the target LSF parameter correspondence behind the time unifying, can be realized the cluster of target fundamental frequency again by the source LSF parameter subscript of record;
The 3rd step, the foundation of Spectral envelope conversion rule, namely respectively with the source LSF parameter of each cluster as input, the target LSF parameter of corresponding cluster is utilized radially base net network training as output, and each cluster is set up the Spectral envelope conversion rule;
In the 4th step, the foundation of fundamental frequency transformation rule to each cluster, uses target LSF parameter as input respectively, and corresponding fundamental frequency with radially base net network training, is set up the fundamental frequency transformation rule of each cluster as output;
The 5th step, the conversion of characteristic parameter, namely the unvoiced frame for the treatment of converting speech is earlier sorted out one by one, change with such Spectral envelope conversion rule that above-mentioned the 3rd step obtains, the LSF parameter that obtains changing, the fundamental frequency that utilizes the 4th such fundamental frequency transformation rule of obtaining of step to obtain changing by the LSF parameter of conversion again;
In the 6th step, phonetic synthesis namely goes on foot LSF parameter and the fundamental frequency that obtains, the voice after the STRAIGHT model finally obtains changing by above-mentioned the 4th step and the 5th.
2. according to claim 1ly it is characterized in that pre-service based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, the detailed process of voicing decision and characteristic parameter extraction is as follows:
The first step is carried out pre-service to voice signal, and pre emphasis factor is 0.96, divides frame by 40ms, and frame moves and is 1ms, uses Hamming window to carry out windowing process afterwards;
In second step, calculate short-time energy frame by frame
Figure FDA00002958113500011
With
Short-time zero-crossing rate X wherein n(m) be n voice signal after windowing, and frame length is N, adopts the double threshold method to carry out the judgement of pure and impure sound;
The 3rd step, the extraction of characteristic parameter, namely utilize the STRAIGHT model to extract fundamental frequency and language spectrum parameter to unvoiced frame, again language spectrum parameter is carried out dimensionality reduction and conversion, carry out the IFFT conversion earlier and obtain coefficient of autocorrelation, obtain the AR parameter by coefficient of autocorrelation again, further obtain it by the AR parameter at last and derive from linear-in-the-parameter spectral frequency LSF, the dimension of LSF parameter is decided to be 16.
3. according to claim 1 based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that the detailed process of parameter cluster is as follows:
The first step is carried out dynamic time warping to the source LSF parameter and the target LSF parameter that extract, makes source LSF parameter and target LSF parameter time unifying;
Second step, utilize the s self-organizing feature map network that source LSF parameter is carried out the self-organization classification, the input layer of this s self-organizing feature map network is taken as 16, the neuronic number of the two dimension of competition layer is taken as 5 row, 5 row and amounts to 25, cluster opisthogenesis LSF parameter is divided into 25 classes like this, notes the subscript of LSF parameter in source in each class simultaneously respectively;
The 3rd step, the subscript of the source LSF parameter of noting according to each class, the target LSF parameter of same index is also gathered into a class, realizes the cluster to target LSF parameter;
The 4th step, similarly, the subscript of the alignment parameters of returning when utilizing LSF parameter dynamic time warping is found out the target fundamental frequency corresponding with the target LSF parameter behind the time unifying, can realize the cluster of target fundamental frequency again according to the source LSF parameter subscript of record.
4. according to claim 1 based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, the method for building up that it is characterized in that the Spectral envelope conversion rule is to utilize radially base net network, respectively with the source LSF parameter of each cluster as input, the target LSF parameter of corresponding cluster is trained as output, basis function adopts Gaussian function, as input vector X nThe time, k component of output layer is X wherein jThe Spectral envelope conversion rule is set up to each cluster in the center of representing j basis function after training.
5. according to claim 1 based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that the fundamental frequency transformation rule to set up process as follows:
The first step, fundamental frequency to target voice unvoiced frame carries out convergent-divergent, because the fundamental frequency scope of male voice is at 60~200Hz, the fundamental frequency of female voice is at 60~450Hz, and radially therefore the output of base net network carry out convergent-divergent for the target fundamental frequency divided by 500 between 0~1;
Second step, respectively with the target LSF parameter of each cluster as input, the target fundamental frequency behind the convergent-divergent is as output, basis function adopts Gaussian function, with radially base net network training, sets up the fundamental frequency transformation rule of each cluster.
6. according to claim 1 based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that the transfer process of characteristic parameter is as follows:
The first step, voice to be converted are carried out branch frame and voicing decision, utilize the STRAIGHT model to extract fundamental frequency and LSF parameter for unvoiced frame, the LSF parameter is sent into the s self-organizing feature map network that the training stage trains one by one, find out the neuron of competition triumph, this neuron namely represents the classification under this unvoiced frame, finishes the classification to unvoiced frame;
Second step, choose the Spectral envelope conversion rule of respective class, the LSF parameter of this unvoiced frame is sent into the radially base net network of such spectrum envelope that has trained, the LSF parameter after obtaining changing as input;
The 3rd step, choose the fundamental frequency transformation rule of respective class, the LSF parameter after the conversion is sent into the radially base net network of such fundamental frequency that has trained as input, obtain multiply by 500 after the output and reduce the fundamental frequency after can obtaining changing.
7. according to claim 1 based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, the detailed process that it is characterized in that phonetic synthesis is as follows: at first the LSF parameter after the unvoiced frame conversion is converted to the AR parameter, again the AR parameter is converted to energy spectrum, unit is dB, get preceding 513 dimensions of energy spectrum, and energy spectrum unit is converted to numerical value, for unvoiced frames directly with the language of source speech frame spectrum parameter and fundamental frequency respectively as the fundamental frequency after energy spectrum and the conversion, like this energy spectrum that obtains and the fundamental frequency after the conversion are sent into the voice after the STRAIGHT model synthesizes conversion.
CN2012100388747A 2012-02-21 2012-02-21 Voice conversion method based on self-organizing feature map network cluster and radial basis network Expired - Fee Related CN102568476B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100388747A CN102568476B (en) 2012-02-21 2012-02-21 Voice conversion method based on self-organizing feature map network cluster and radial basis network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100388747A CN102568476B (en) 2012-02-21 2012-02-21 Voice conversion method based on self-organizing feature map network cluster and radial basis network

Publications (2)

Publication Number Publication Date
CN102568476A CN102568476A (en) 2012-07-11
CN102568476B true CN102568476B (en) 2013-07-03

Family

ID=46413732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100388747A Expired - Fee Related CN102568476B (en) 2012-02-21 2012-02-21 Voice conversion method based on self-organizing feature map network cluster and radial basis network

Country Status (1)

Country Link
CN (1) CN102568476B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2517503B (en) * 2013-08-23 2016-12-28 Toshiba Res Europe Ltd A speech processing system and method
CN105023574B (en) * 2014-04-30 2018-06-15 科大讯飞股份有限公司 A kind of method and system for realizing synthesis speech enhan-cement
CN104464744A (en) * 2014-11-19 2015-03-25 河海大学常州校区 Cluster voice transforming method and system based on mixture Gaussian random process
CN105390141B (en) * 2015-10-14 2019-10-18 科大讯飞股份有限公司 Sound converting method and device
CN107545903B (en) * 2017-07-19 2020-11-24 南京邮电大学 Voice conversion method based on deep learning
CN108417198A (en) * 2017-12-28 2018-08-17 中南大学 A kind of men and women's phonetics transfer method based on spectrum envelope and pitch period
CN109448739B (en) * 2018-12-13 2019-08-23 山东省计算中心(国家超级计算济南中心) Vocoder line spectral frequency parameters quantization method based on hierarchical cluster
CN109712634A (en) * 2018-12-24 2019-05-03 东北大学 A kind of automatic sound conversion method
CN110085255B (en) * 2019-03-27 2021-05-28 河海大学常州校区 Speech conversion Gaussian process regression modeling method based on deep kernel learning
CN110536215B (en) * 2019-09-09 2021-06-29 普联技术有限公司 Method and apparatus for audio signal processing, calculation device, and storage medium
CN113380261B (en) * 2021-05-26 2021-12-31 特斯联科技集团有限公司 Artificial intelligent voice acquisition processor and method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1185194A (en) * 1997-09-04 1999-03-30 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Voice nature conversion speech synthesis apparatus

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JP特开平11-85194A 1999.03.30
基于遗传径向基神经网络的声音转换;左国玉等;《中文信息学报》;20040229;第18卷(第1期);78-84 *
左国玉等.基于遗传径向基神经网络的声音转换.《中文信息学报》.2004,第18卷(第1期),78-84.
陈芝等.基频轨迹转换算法及在语音转换系统中的应用研究.《南京邮电大学学报(自然科学版)》.2010,第30卷(第5期),83-87. *

Also Published As

Publication number Publication date
CN102568476A (en) 2012-07-11

Similar Documents

Publication Publication Date Title
CN102568476B (en) Voice conversion method based on self-organizing feature map network cluster and radial basis network
Palo et al. Wavelet based feature combination for recognition of emotions
CN102509547B (en) Method and system for voiceprint recognition based on vector quantization based
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN102231278B (en) Method and system for realizing automatic addition of punctuation marks in speech recognition
CN101178896B (en) Unit selection voice synthetic method based on acoustics statistical model
CN103871426A (en) Method and system for comparing similarity between user audio frequency and original audio frequency
CN102982803A (en) Isolated word speech recognition method based on HRSF and improved DTW algorithm
CN111210803B (en) System and method for training clone timbre and rhythm based on Bottle sock characteristics
CN113539232B (en) Voice synthesis method based on lesson-admiring voice data set
CN102237083A (en) Portable interpretation system based on WinCE platform and language recognition method thereof
CN103021418A (en) Voice conversion method facing to multi-time scale prosodic features
CN106024010A (en) Speech signal dynamic characteristic extraction method based on formant curves
CN114495969A (en) Voice recognition method integrating voice enhancement
Singh et al. Spectral modification based data augmentation for improving end-to-end ASR for children's speech
Xue et al. Cross-modal information fusion for voice spoofing detection
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
Guo et al. Phonetic posteriorgrams based many-to-many singing voice conversion via adversarial training
Cheng et al. DNN-based speech enhancement with self-attention on feature dimension
CN111081273A (en) Voice emotion recognition method based on glottal wave signal feature extraction
Jin et al. Speaker verification based on single channel speech separation
Djeffal et al. Noise-robust speech recognition: A comparative analysis of LSTM and CNN approaches
Li et al. Emotion recognition from speech with StarGAN and Dense‐DCNN
CN103226946B (en) Voice synthesis method based on limited Boltzmann machine
Zhao et al. Research on voice cloning with a few samples

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130703

Termination date: 20160221