CN102568476B

CN102568476B - Voice conversion method based on self-organizing feature map network cluster and radial basis network

Info

Publication number: CN102568476B
Application number: CN2012100388747A
Authority: CN
Inventors: 解伟超; 张玲华
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2012-02-21
Filing date: 2012-02-21
Publication date: 2013-07-03
Anticipated expiration: 2032-02-21
Also published as: CN102568476A

Abstract

The invention discloses a voice conversion method based on self-organizing feature map network clusters and a radial basis network, belonging to the technical field of voice signal processing. The voice conversion method comprises the following steps of: preprocessing, judging unvoiced or voiced sound, and extracting feature parameters; clustering the parameters; establishing spectrum envelop conversion rules; establishing fundamental sound frequency conversion rules; converting the feature parameters; and synthesizing voice. In the voice conversion method, the feature parameters of source voice are divided into a plurality of clusters, the corresponding target feature parameters are also divided into a plurality of clusters in one-to-one correspondence to each cluster of the source voice, and a conversion rule is established for each cluster. Therefore, the time complexity of training is reduced by dividing training data, so that voice generated by conversion has good naturalness. When the feature parameters of voice are converted, the fundamental sound frequency is correlated with spectrum envelop, the conversion relation between the two is established, the defects of individual fundamental sound frequency conversion are overcome, and the converted fundamental sound frequency has more features of a target speaker.

Description

Based on s self-organizing feature map network clustering and the speech conversion method of base net network radially

Technical field

The present invention relates to Voice Conversion Techniques, particularly based on s self-organizing feature map network clustering and the phonetics transfer method of base net network radially, belong to the voice process technology field.

Background technology

Speech conversion is field of voice signal emerging research branch in recent years, be to carry out on the research basis of Speaker Identification and phonetic synthesis, simultaneously also be the abundant and continuation of these two branch's intensions, but not exclusively be under the jurisdiction of the category of Speaker Identification and phonetic synthesis again.

The target of speech conversion is under the condition that the semantic information that guarantees wherein remains unchanged, personal characteristics information in the speaker's voice of change source, make it to have target speaker's personal characteristics, thereby make the voice after the conversion sound similarly being target speaker's sound.

The realization of speech conversion can be divided into training stage and translate phase.In the training stage, system trains source speaker and target speaker, analyzes their parameter, sets up transformation rule.At translate phase, earlier phonetic feature is analyzed and extracted to the source voice, be converted to the target phonetic feature according to the voice conversion rules that is obtained by the training stage again.

The key issue of speech conversion is the extraction of speaker's personal characteristics and the foundation of transformation rule, and the development through recent two decades emerges a large amount of achievements in research, and the research to speech characteristic parameter at present mainly comprises spectrum envelope parameter and fundamental frequency.In the speech conversion there be based on linear predictive coding model (Linear Prediction Coding present conversion method to the spectrum envelope parameter, LPC), gauss hybrid models (Gaussian Mixture Model, GMM), harmonic wave plus noise model (Harmonic plus Noise Model, HNM) etc., but these methods are directly trained extracting parameter when setting up transformation rule, set up a unified transformation rule, like this since voice signal the time become and non-stationary property, and training data quantity is huge, make a unique transformation rule can accurately not describe the mapping relations between the characteristic parameter of the characteristic parameter of source voice and target voice, must cause distortion; (1, Zad-Issa, M.R, Kabal, P.Smoothing the Evolution ofthe Spectral Parameters in Linear Prediction of Speech using Target Matching.ICASSP, 1997:vol.3,1699-1702.2, Daojian Zeng, Yibiao Yu.Voice Conversion using structrued Gaussian Mixture Model.ICSP, 2010:541-544.3, Hu H.T, Yu C, Lin C.H.HNM parameter transform for voice conversion using a HMM-WDLT framework.ICIMA, 2010:vol.2,282-287.) in the speech conversion at present the conversion method to fundamental frequency the average transformation approach is arranged, Gauss model method etc., but these conversion methods all are the spectrum envelope parameter to be separated with fundamental frequency change, not contact between both conversions, but spectrum envelope parameter and fundamental frequency all from same voice signal, more and more studies show that has close contact between the two, so traditional method of respectively above two kinds of parameters being changed can must influence the quality of synthetic speech.(1、Lee?K.S，Doh?W，Youn?D.H?Voice?conversion?using?low?dimensional?vector?mapping.IEICE?Transaction?Information&System，2002，E85(D)：1297-1305.2、L.M.Arslan.Speaker?Transformation?Algorithm?using?Segmental?Codebooks(STASC).Speech?Communication，Jul.1999：vol.28，no.3，pp.211-226.)

Summary of the invention

The object of the present invention is to provide a kind of in conjunction with voice time domain characteristics and the phonetics transfer method of speaker's personal characteristics under the condition of parallel text, obtain a kind of transformation rule more accurately, make the speaker's personal characteristics in the converting speech strengthen and improve the acoustical quality of converting speech.

In order to realize the foregoing invention purpose, the present invention has adopted following technical scheme:

A kind of based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, concrete steps are as follows:

The first step, pre-service, voicing decision and characteristic parameter extraction, after namely input speech signal being carried out pre-emphasis, branch frame and windowing process, calculate short-time energy and the average zero-crossing rate of each frame, finish the judgement of pure and impure sound, recycling STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum) model extracts LSF (Linear Spectral Frequency, linear spectral frequency) parameter and the fundamental frequency of each unvoiced frame;

Second step, the parameter cluster, namely earlier the source LSF parameter and the target LSF parameter that extract are carried out dynamic time warping, recycling s self-organizing feature map network carries out self-organizing clustering to source LSF parameter, note the subscript of all kinds of middle sources LSF parameter simultaneously, corresponding with source LSF parameter in a certain cluster like this target LSF parameter is also gathered into a class, realizes the cluster of target LSF parameter; Similarly, the parameter subscript of returning when utilizing the consolidation of LSF parameter dynamic time is determined the fundamental frequency of the target LSF parameter correspondence behind the time unifying, can be realized the cluster of target fundamental frequency again by the source LSF parameter subscript of record;

The 3rd step, the foundation of Spectral envelope conversion rule, namely respectively with the source LSF parameter of each cluster as input, the target LSF parameter of corresponding cluster is as output, utilize RBF (Radial Basis Function, radially base) network to train, each cluster is set up the Spectral envelope conversion rule;

In the 4th step, the foundation of fundamental frequency transformation rule to each cluster, uses target LSF parameter as input respectively, and corresponding fundamental frequency is trained with the RBF network as output, sets up the fundamental frequency transformation rule of each cluster;

The 5th step, the conversion of characteristic parameter, namely the unvoiced frame for the treatment of converting speech is earlier sorted out one by one, change with such Spectral envelope conversion rule that the 3rd step obtained, the LSF parameter that obtains changing, the fundamental frequency that utilizes above-mentioned the 4th such fundamental frequency transformation rule of obtaining of step to obtain changing by the LSF parameter of conversion again;

In the 6th step, phonetic synthesis namely goes on foot LSF parameter and the fundamental frequency that obtains, the voice after the STRAIGHT model finally obtains changing by above-mentioned the 4th step and the 5th.

The present invention compared with prior art, its remarkable advantage: (1) is divided into several clusters with the source speech characteristic parameter under the guidance of speech characteristic parameter mapping theory, corresponding target signature parameter also is divided into and each cluster of source voice some classes one to one, each cluster is set up transformation rule respectively, so not only training data is divided the time complexity that has reduced training, and in conjunction with the characteristics of the quasi periodic in short-term of voice, the transformation rule of each cluster can reflect such mapping relations more accurately, and the voice that make conversion generate have good naturalness; (2) when speech characteristic parameter is changed, fundamental frequency and spectrum envelope are connected, set up transformational relation between the two, overcome the shortcoming to the fundamental frequency conversion that isolates at present, make the fundamental frequency of changing out have target speaker's characteristic more.

Below in conjunction with accompanying drawing the present invention is described in further detail.

Description of drawings

Fig. 1 the present invention is based on s self-organizing feature map network clustering and the speech conversion synoptic diagram of base net network radially;

Fig. 2 is the cluster of LSF parameter and the synoptic diagram that transformation rule is set up thereof;

Fig. 3 is the cluster of fundamental frequency and the synoptic diagram that transformation rule is set up thereof;

Fig. 4 is the synoptic diagram of the conversion of i frame voiced sound parameter and phonetic synthesis.

Embodiment

In conjunction with Fig. 1, the present invention is based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, step is as follows:

The first step, carry out pre-service earlier in the training stage, voicing decision and characteristic parameter extraction, after namely input speech signal being carried out pre-emphasis, branch frame and windowing process, calculate short-time energy and the average zero-crossing rate of each frame, finish the judgement of pure and impure sound, recycling STRAIGHT model extracts fundamental frequency and the linear spectral frequency LSF parameter of each unvoiced frame, and detailed process is as follows:

(1) voice signal is carried out pre-service, pre emphasis factor is 0.96, divides frame by 40ms, and frame moves and is 1ms, uses Hamming window to carry out windowing process afterwards;

(2) calculate short-time energy frame by frame

With

Short-time zero-crossing rate

X wherein _n(m) be n voice signal after windowing, and frame length is N, adopts the double threshold method to carry out the judgement of pure and impure sound;

(3) extraction of characteristic parameter, namely utilize the STRAIGHT model to extract fundamental frequency and language spectrum parameter to unvoiced frame, because the language of each frame spectrum parameter is 513 dimensions, therefore to carry out dimensionality reduction and conversion to it, language spectrum parameter is carried out IFFT (Inverse Fast Fourier Transformation earlier, invert fast fourier transformation) conversion obtains coefficient of autocorrelation, thereby utilize the Levinson-Durbin algorithm to obtain AR (Autoregressive, autoregression) model parameter, further obtain it by the AR parameter and derive from linear-in-the-parameter spectral frequency LSF, the dimension of LSF parameter is decided to be 16, so just can obtain LSF parameter and the fundamental frequency of each unvoiced frame;

Second step, the parameter cluster, as shown in Figure 2, earlier the source LSF parameter and the target LSF parameter that extract are carried out dynamic time warping, recycling s self-organizing feature map network carries out self-organizing clustering to source LSF parameter, note simultaneously all kinds of in the subscript of sources LSF parameters, like this with a certain cluster in the also corresponding class of gathering into of target LSF parameter of source LSF parameter time unifying, the cluster of realization target LSF parameter; Similarly, as shown in Figure 3, the parameter subscript of the alignment of returning when carrying out the dynamic time consolidation according to the LSF parameter is made the fundamental frequency of the target LSF parameter correspondence of alignment, utilize the source LSF parameter subscript that records can realize the cluster of target fundamental frequency then, detailed process is as follows:

(1) (Dynamic Time Warping DTW), makes source LSF parameter and target LSF parameter time unifying the source LSF parameter that extracts and target LSF parameter to be carried out dynamic time warping;

(2) source LSF parameter and the target LSF parameter after will aliging constitutes matrix respectively, each row representative is a frame LSF parameter, recycling s self-organizing feature map network carries out self-organizing clustering to source LSF parameter, the input layer of this s self-organizing feature map network is taken as 16, the two-dimentional neuronic number of competition layer is taken as 5 row, 5 row and amounts to 25, the node of competition layer just can represent each class parameter respectively after training, cluster opisthogenesis LSF parameter is divided into 25 classes like this, notes the subscript of LSF parameter in source in each class simultaneously respectively;

(3) subscript of the source LSF parameter of noting according to each class, the target LSF parameter of same index is also gathered into a class, realizes the cluster to target LSF parameter;

(4) similarly, the subscript of the target LSF parameter of the alignment of at every turn returning when carrying out the dynamic time consolidation according to the LSF parameter, select the fundamental frequency of the target unvoiced frame of same index position, Dui Qi target LSF parameter is also corresponding with the target fundamental frequency like this, the subscript of the source LSF parameter of being noted by each class then, the target fundamental frequency that will have same index also gathers into a class;

The 3rd step, the foundation of Spectral envelope conversion rule, as shown in Figure 2, respectively with the source LSF parameter of each cluster as input, the target LSF parameter of corresponding cluster utilizes the RBF network to train as output, and basis function adopts Gaussian function, find out the relation between input vector and the output vector, as input vector X _nThe time, k component of output layer is

X wherein _jThe Spectral envelope conversion rule is set up to each cluster in the center of representing j basis function after training;

The 4th step, the foundation of fundamental frequency transformation rule, as shown in Figure 3, to each cluster, use target LSF parameter as input respectively, corresponding target fundamental frequency is as output, train with the RBF network, set up the fundamental frequency transformation rule of each cluster, detailed process is as follows:

(1) at first the fundamental frequency of target voice unvoiced frame is carried out convergent-divergent, because the fundamental frequency scope of male voice is at 60～200Hz, the fundamental frequency of female voice is at 60～450Hz, and therefore the output of RBF network carry out convergent-divergent for the target fundamental frequency divided by 500 between 0～1;

(2) respectively with the target LSF parameter of each cluster as input, as output, basis function adopts Gaussian function, trains with the RBF network, sets up the fundamental frequency transformation rule of each cluster with the target fundamental frequency behind the convergent-divergent;

The 5th step, the conversion of characteristic parameter, as shown in Figure 4, treating the unvoiced frame of converting speech earlier sorts out one by one, change with such Spectral envelope conversion rule that the 3rd step obtained, the LSF parameter that obtains changing, the fundamental frequency that utilizes the 4th such fundamental frequency transformation rule of obtaining of step to obtain changing by the LSF parameter of conversion again, detailed process is as follows:

(1) advancing at first to source voice to be converted, row divides frame, add Hamming window and voicing decision, utilize the STRAIGHT model to extract fundamental frequency and LSF parameter for unvoiced frame, the LSF parameter is sent into frame by frame the s self-organizing feature map network that has trained, find out the neuron of competition triumph, this neuron namely represents the classification under this unvoiced frame, finishes the classification to unvoiced frame;

(2) choose the Spectral envelope conversion rule of respective class, the LSF parameter of this unvoiced frame as input, is sent into the RBF network of such spectrum envelope that has trained, the LSF parameter after obtaining changing;

(3) choose the fundamental frequency transformation rule of respective class, the LSF parameter after the conversion is sent into the RBF network of such fundamental frequency that has trained as input, obtain multiply by 500 after the output and reduce the fundamental frequency after obtaining changing;

In the 6th step, phonetic synthesis namely goes on foot LSF parameter and the fundamental frequency that obtains, the voice after the STRAIGHT model finally obtains changing by the 4th step and the 5th.

(1) at first the LSF parameter after the unvoiced frame conversion is converted to the AR parameter, again the AR parameter is converted to energy spectrum (unit is dB), get preceding 513 dimensions of energy spectrum, and energy spectrum unit is converted to numerical value;

(2) directly the language of source speech frame is composed parameter and fundamental frequency respectively as the fundamental frequency after energy spectrum and the conversion for unvoiced frames;

(3) fundamental frequency after the energy spectrum that will finally obtain and the conversion is sent into the voice after the STRAIGHT model synthesizes conversion.

Claims

1. based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that comprising following steps:

The first step, pre-service, voicing decision and characteristic parameter extraction, after namely input speech signal being carried out pre-emphasis, branch frame and windowing process, calculate short-time energy and the average zero-crossing rate of each frame, finish the judgement of pure and impure sound, recycling STRAIGHT model extracts LSF parameter and the fundamental frequency of each unvoiced frame;

The 3rd step, the foundation of Spectral envelope conversion rule, namely respectively with the source LSF parameter of each cluster as input, the target LSF parameter of corresponding cluster is utilized radially base net network training as output, and each cluster is set up the Spectral envelope conversion rule;

In the 4th step, the foundation of fundamental frequency transformation rule to each cluster, uses target LSF parameter as input respectively, and corresponding fundamental frequency with radially base net network training, is set up the fundamental frequency transformation rule of each cluster as output;

The 5th step, the conversion of characteristic parameter, namely the unvoiced frame for the treatment of converting speech is earlier sorted out one by one, change with such Spectral envelope conversion rule that above-mentioned the 3rd step obtains, the LSF parameter that obtains changing, the fundamental frequency that utilizes the 4th such fundamental frequency transformation rule of obtaining of step to obtain changing by the LSF parameter of conversion again;

2. according to claim 1ly it is characterized in that pre-service based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, the detailed process of voicing decision and characteristic parameter extraction is as follows:

The first step is carried out pre-service to voice signal, and pre emphasis factor is 0.96, divides frame by 40ms, and frame moves and is 1ms, uses Hamming window to carry out windowing process afterwards;

In second step, calculate short-time energy frame by frame

With

Short-time zero-crossing rate X wherein _n(m) be n voice signal after windowing, and frame length is N, adopts the double threshold method to carry out the judgement of pure and impure sound;

The 3rd step, the extraction of characteristic parameter, namely utilize the STRAIGHT model to extract fundamental frequency and language spectrum parameter to unvoiced frame, again language spectrum parameter is carried out dimensionality reduction and conversion, carry out the IFFT conversion earlier and obtain coefficient of autocorrelation, obtain the AR parameter by coefficient of autocorrelation again, further obtain it by the AR parameter at last and derive from linear-in-the-parameter spectral frequency LSF, the dimension of LSF parameter is decided to be 16.

3. according to claim 1 based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that the detailed process of parameter cluster is as follows:

The first step is carried out dynamic time warping to the source LSF parameter and the target LSF parameter that extract, makes source LSF parameter and target LSF parameter time unifying;

Second step, utilize the s self-organizing feature map network that source LSF parameter is carried out the self-organization classification, the input layer of this s self-organizing feature map network is taken as 16, the neuronic number of the two dimension of competition layer is taken as 5 row, 5 row and amounts to 25, cluster opisthogenesis LSF parameter is divided into 25 classes like this, notes the subscript of LSF parameter in source in each class simultaneously respectively;

The 3rd step, the subscript of the source LSF parameter of noting according to each class, the target LSF parameter of same index is also gathered into a class, realizes the cluster to target LSF parameter;

The 4th step, similarly, the subscript of the alignment parameters of returning when utilizing LSF parameter dynamic time warping is found out the target fundamental frequency corresponding with the target LSF parameter behind the time unifying, can realize the cluster of target fundamental frequency again according to the source LSF parameter subscript of record.

4. according to claim 1 based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, the method for building up that it is characterized in that the Spectral envelope conversion rule is to utilize radially base net network, respectively with the source LSF parameter of each cluster as input, the target LSF parameter of corresponding cluster is trained as output, basis function adopts Gaussian function, as input vector X _nThe time, k component of output layer is X wherein _jThe Spectral envelope conversion rule is set up to each cluster in the center of representing j basis function after training.

5. according to claim 1 based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that the fundamental frequency transformation rule to set up process as follows:

The first step, fundamental frequency to target voice unvoiced frame carries out convergent-divergent, because the fundamental frequency scope of male voice is at 60～200Hz, the fundamental frequency of female voice is at 60～450Hz, and radially therefore the output of base net network carry out convergent-divergent for the target fundamental frequency divided by 500 between 0～1;

Second step, respectively with the target LSF parameter of each cluster as input, the target fundamental frequency behind the convergent-divergent is as output, basis function adopts Gaussian function, with radially base net network training, sets up the fundamental frequency transformation rule of each cluster.

6. according to claim 1 based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that the transfer process of characteristic parameter is as follows:

The first step, voice to be converted are carried out branch frame and voicing decision, utilize the STRAIGHT model to extract fundamental frequency and LSF parameter for unvoiced frame, the LSF parameter is sent into the s self-organizing feature map network that the training stage trains one by one, find out the neuron of competition triumph, this neuron namely represents the classification under this unvoiced frame, finishes the classification to unvoiced frame;

Second step, choose the Spectral envelope conversion rule of respective class, the LSF parameter of this unvoiced frame is sent into the radially base net network of such spectrum envelope that has trained, the LSF parameter after obtaining changing as input;

The 3rd step, choose the fundamental frequency transformation rule of respective class, the LSF parameter after the conversion is sent into the radially base net network of such fundamental frequency that has trained as input, obtain multiply by 500 after the output and reduce the fundamental frequency after can obtaining changing.

7. according to claim 1 based on s self-organizing feature map network clustering and the speech conversion method of base net network radially, the detailed process that it is characterized in that phonetic synthesis is as follows: at first the LSF parameter after the unvoiced frame conversion is converted to the AR parameter, again the AR parameter is converted to energy spectrum, unit is dB, get preceding 513 dimensions of energy spectrum, and energy spectrum unit is converted to numerical value, for unvoiced frames directly with the language of source speech frame spectrum parameter and fundamental frequency respectively as the fundamental frequency after energy spectrum and the conversion, like this energy spectrum that obtains and the fundamental frequency after the conversion are sent into the voice after the STRAIGHT model synthesizes conversion.