CN102568476A

CN102568476A - Voice conversion method based on self-organizing feature map network cluster and radial basis network

Info

Publication number: CN102568476A
Application number: CN2012100388747A
Authority: CN
Inventors: 解伟超; 张玲华
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2012-02-21
Filing date: 2012-02-21
Publication date: 2012-07-11
Anticipated expiration: 2032-02-21
Also published as: CN102568476B

Abstract

The invention discloses a voice conversion method based on self-organizing feature map network clusters and a radial basis network, belonging to the technical field of voice signal processing. The voice conversion method comprises the following steps of: preprocessing, judging unvoiced or voiced sound, and extracting feature parameters; clustering the parameters; establishing spectrum envelop conversion rules; establishing fundamental sound frequency conversion rules; converting the feature parameters; and synthesizing voice. In the voice conversion method, the feature parameters of source voice are divided into a plurality of clusters, the corresponding target feature parameters are also divided into a plurality of clusters in one-to-one correspondence to each cluster of the source voice, and a conversion rule is established for each cluster. Therefore, the time complexity of training is reduced by dividing training data, so that voice generated by conversion has good naturalness. When the feature parameters of voice are converted, the fundamental sound frequency is correlated with spectrum envelop, the conversion relation between the two is established, the defects of individual fundamental sound frequency conversion are overcome, and the converted fundamental sound frequency has more features of a target speaker.

Description

Based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially

Technical field

The present invention relates to Voice Conversion Techniques, particularly, belong to the voice process technology field based on the s self-organizing feature map network clustering and the phonetics transfer method of base net network radially.

Background technology

Speech conversion is the emerging in recent years research branch of field of voice signal; Be on the research basis of Speaker Identification and phonetic synthesis, to carry out; Simultaneously also be the abundant and continuation of these two branch's intensions, but not exclusively be under the jurisdiction of the category of Speaker Identification and phonetic synthesis again.

The target of speech conversion is under the condition that the semantic information that guarantees wherein remains unchanged; Personal characteristics information in the speaker's voice of change source; Make it to have target speaker's personal characteristics, thereby make the voice after the conversion sound similarly being target speaker's sound.

The realization of speech conversion can be divided into training stage and translate phase.In the training stage, system trains source speaker and target speaker, analyzes their parameter, sets up transformation rule.At translate phase, earlier phonetic feature is analyzed and extracted to the source voice, be converted to the target speech characteristic according to the voice conversion rules that obtains by the training stage again.

The key issue of speech conversion is the extraction of speaker's personal characteristics and the foundation of transformation rule, and the development through recent two decades emerges a large amount of achievements in research, and the research to speech characteristic parameter at present mainly comprises spectrum envelope parameter and fundamental frequency.In the speech conversion there be based on linear predictive coding model (Linear Prediction Coding present conversion method to the spectrum envelope parameter; LPC), and gauss hybrid models (Gaussian Mixture Model, GMM); Harmonic wave plus noise model (Harmonic plus Noise Model; HNM) etc., but these methods when setting up transformation rule, directly extracting parameter is trained, set up a unified transformation rule; Like this since voice signal the time become and non-stationary property; And training data quantity is huge, makes a unique transformation rule can accurately not describe the mapping relations between the characteristic parameter of characteristic parameter and target speech of source voice, must cause distortion; (1, Zad-Issa, M.R, Kabal; P.Smoothing the Evolution ofthe Spectral Parameters in Linear Prediction of Speech using Target Matching.ICASSP; 1997:vol.3,1699-1702.2, Daojian Zeng, Yibiao Yu.Voice Conversion using structrued Gaussian Mixture Model.ICSP; 2010:541-544.3, Hu H.T; Yu C, Lin C.H.HNM parameter transform for voice conversion using a HMM-WDLT framework.ICIMA, 2010:vol.2; 282-287.) at present the conversion method of fundamental frequency there are average transformation approach, Gauss model method etc. in the speech conversion; But these conversion methods all are the spectrum envelope parameter to be separated with fundamental frequency change, not contact between both conversions, but spectrum envelope parameter and fundamental frequency all from same voice signal; Increasing research shows has close contact between the two, and therefore traditional method of respectively above two kinds of parameters being changed can must influence the quality of synthetic speech.(1、Lee?K.S，Doh?W，Youn?D.H?Voice?conversion?using?low?dimensional?vector?mapping.IEICE?Transaction?Information&System，2002，E85(D)：1297-1305.2、L.M.Arslan.Speaker?Transformation?Algorithm?using?Segmental?Codebooks(STASC).Speech?Communication，Jul.1999：vol.28，no.3，pp.211-226.)

Summary of the invention

The object of the present invention is to provide a kind of voice time domain characteristics and the phonetics transfer method of speaker's personal characteristics under the condition of parallel text of combining; Obtain a kind of transformation rule more accurately, make the speaker's personal characteristics in the converting speech strengthen and improve the acoustical quality of converting speech.

In order to realize the foregoing invention purpose, the present invention has adopted following technical scheme:

A kind of based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, concrete steps are following:

The first step; Pre-service; Voicing decision and characteristic parameter extraction; After promptly input speech signal being carried out pre-emphasis, branch frame and windowing process, calculate the short-time energy and average zero-crossing rate of each frame, accomplish the judgement of pure and impure sound; Utilize STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum) model to extract LSF (Linear Spectral Frequency, linear spectral frequency) parameter and the fundamental frequency of each unvoiced frame again;

Second step; The parameter cluster; Promptly earlier the source LSF parameter and the target LSF parameter that extract are carried out dynamic time warping, utilize the s self-organizing feature map network that source LSF parameter is carried out self-organizing clustering again, note the subscript of all kinds of middle sources LSF parameter simultaneously; Like this with a certain cluster in the corresponding target LSF parameter of source LSF parameter also gather into one type, realize the cluster of target LSF parameter; Similarly, the parameter subscript of returning when utilizing the consolidation of LSF parameter dynamic time is confirmed the corresponding fundamental frequency of target LSF parameter behind the time unifying, can be realized the cluster of target fundamental frequency again by the source LSF parameter subscript of record;

The 3rd step, the foundation of Spectral envelope conversion rule, promptly respectively with the source LSF parameter of each cluster as input; The target LSF parameter of corresponding cluster is as output; Utilize RBF (Radial Basis Function, radially base) network to train, each cluster is set up the Spectral envelope conversion rule;

In the 4th step, the foundation of fundamental frequency transformation rule to each cluster, uses target LSF parameter as input respectively, and corresponding fundamental frequency is trained with the RBF network as output, sets up the fundamental frequency transformation rule of each cluster;

The 5th step; The conversion of characteristic parameter; The unvoiced frame of promptly treating converting speech is earlier sorted out one by one; Such Spectral envelope conversion rule with the 3rd step obtained is changed, the LSF parameter that obtains changing, the fundamental frequency that such fundamental frequency transformation rule that utilizes above-mentioned the 4th step to obtain by the LSF parameter of conversion again obtains changing;

The 6th step, phonetic synthesis, the LSF parameter and the fundamental frequency that promptly obtain, the voice after the STRAIGHT model finally obtains changing by above-mentioned the 4th step and the 5th step.

The present invention compared with prior art; Its remarkable advantage: (1) is divided into several clusters with the source speech characteristic parameter under the guidance of speech characteristic parameter mapping theory; Corresponding target signature parameter also is divided into and some one to one types of each cluster of source voice; Each cluster is set up transformation rule respectively, so not only training data is divided the time complexity that has reduced training, and combine the characteristics of the quasi periodic in short-term of voice; The transformation rule of each cluster can reflect such mapping relations more accurately, and the voice that make conversion generate have good naturalness; (2) when speech characteristic parameter is changed, fundamental frequency and spectrum envelope are connected, set up transformational relation between the two, overcome isolated at present shortcoming, make the fundamental frequency of changing out have target speaker's characteristic more the fundamental frequency conversion.

Below in conjunction with accompanying drawing the present invention is described in further detail.

Description of drawings

Fig. 1 the present invention is based on the s self-organizing feature map network clustering and the speech conversion synoptic diagram of base net network radially;

Fig. 2 is the cluster of LSF parameter and the synoptic diagram that transformation rule is set up thereof;

Fig. 3 is the cluster of fundamental frequency and the synoptic diagram that transformation rule is set up thereof;

Fig. 4 is the synoptic diagram of i frame voiced sound Parameters Transformation and phonetic synthesis.

Embodiment

In conjunction with Fig. 1, the present invention is based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, step is following:

The first step; Carry out pre-service earlier in the training stage, voicing decision and characteristic parameter extraction are after promptly input speech signal being carried out pre-emphasis, divides frame and windowing process; Calculate the short-time energy and average zero-crossing rate of each frame; Accomplish the judgement of pure and impure sound, utilize the STRAIGHT model to extract the fundamental frequency and the linear spectral frequency LSF parameter of each unvoiced frame again, detailed process is following:

(1) voice signal is carried out pre-service, pre emphasis factor is 0.96, divides frame by 40ms, and frame moves and is 1ms, uses Hamming window to carry out windowing process afterwards;

(2) by frame calculate short-time energy

and

Short-time zero-crossing rate

X wherein _n(m) be n voice signal after windowing, and frame length is N, adopts the double threshold method to carry out the judgement of pure and impure sound;

(3) extraction of characteristic parameter promptly utilizes the STRAIGHT model to extract fundamental frequency and language spectrum parameter to unvoiced frame, because the language of each frame spectrum parameter is 513 dimensions; Therefore to carry out dimensionality reduction and conversion to it; Language spectrum parameter is carried out IFFT (Inverse Fast Fourier Transformation, invert fast fourier transformation) conversion earlier and is obtained coefficient of autocorrelation, thereby utilize the Levinson-Durbin algorithm to obtain AR (Autoregressive; Autoregression) model parameter; Further obtain it by the AR parameter and derive from linear-in-the-parameter spectral frequency LSF, the dimension of LSF parameter is decided to be 16, so just can obtain the LSF parameter and the fundamental frequency of each unvoiced frame;

Second step; The parameter cluster, as shown in Figure 2, earlier the source LSF parameter and the target LSF parameter that extract are carried out dynamic time warping; Utilize the s self-organizing feature map network that source LSF parameter is carried out self-organizing clustering again; Note simultaneously all kinds of in the subscript of sources LSF parameters, like this with the target LSF parameter of source LSF parameter time unifying is also corresponding gathers into one type, the cluster of realization target LSF parameter in a certain cluster; Similarly; As shown in Figure 3, the parameter subscript of the alignment of returning when carrying out the dynamic time consolidation according to the LSF parameter is made the corresponding fundamental frequency of target LSF parameter of alignment; Utilize the source LSF parameter subscript that writes down can realize the cluster of target fundamental frequency then, detailed process is following:

(1) (Dynamic Time Warping DTW), makes source LSF parameter and target LSF parameter time unifying the source LSF parameter that extracts and target LSF parameter to be carried out dynamic time warping;

(2) source LSF parameter and the target LSF parameter after will aliging constitutes matrix respectively; Each row representative is a frame LSF parameter; Utilize the s self-organizing feature map network that source LSF parameter is carried out self-organizing clustering again; The input layer of this s self-organizing feature map network is taken as 16, and the two-dimentional neuronic number of competition layer is taken as 5 row, 5 row and amounts to 25, and the node of competition layer just can be represented each type parameter respectively after training; Cluster opisthogenesis LSF parameter is divided into 25 types like this, notes the subscript of LSF parameter in source in each type simultaneously respectively;

The subscript of the source LSF parameter of (3) noting according to each type, the target LSF parameter of same index is also gathered into one type, realizes the cluster to target LSF parameter;

(4) similarly; The subscript of the target LSF parameter of the alignment of at every turn returning when carrying out the dynamic time consolidation according to the LSF parameter; Select the fundamental frequency of the target unvoiced frame of same index position; The target LSF parameter of alignment is also corresponding with the target fundamental frequency like this, the subscript of the source LSF parameter of noting by each type then, and the target fundamental frequency that will have same index also gathers into one type;

The 3rd step, the foundation of Spectral envelope conversion rule, as shown in Figure 2; Respectively with the source LSF parameter of each cluster as input; The target LSF parameter of corresponding cluster utilizes the RBF network to train as output, and basis function adopts Gaussian function; Find out the relation between input vector and the output vector, as input vector X _nThe time, k component of output layer does

X wherein _jThe Spectral envelope conversion rule is set up to each cluster in the center of representing j basis function after training;

The 4th step, the foundation of fundamental frequency transformation rule, as shown in Figure 3; To each cluster, use target LSF parameter as input respectively, corresponding target fundamental frequency is as output; Train with the RBF network, set up the fundamental frequency transformation rule of each cluster, detailed process is following:

(1) at first the fundamental frequency of target speech unvoiced frame is carried out convergent-divergent; Because the fundamental frequency scope of male voice is at 60～200Hz; The fundamental frequency of female voice is at 60～450Hz, and therefore the output of RBF network carry out convergent-divergent for the target fundamental frequency divided by 500 between 0～1;

(2) respectively with the target LSF parameter of each cluster as input, as output, basis function adopts Gaussian function, trains with the RBF network, sets up the fundamental frequency transformation rule of each cluster with the target fundamental frequency behind the convergent-divergent;

The 5th step; The conversion of characteristic parameter, as shown in Figure 4, treat the unvoiced frame of converting speech earlier and sort out one by one; Such Spectral envelope conversion rule with the 3rd step obtained is changed; The LSF parameter that obtains changing, the fundamental frequency that such fundamental frequency transformation rule that utilized for the 4th step obtained by the LSF parameter of conversion again obtains changing, detailed process is following:

(1) advancing at first to source voice to be converted, row divides frame; Add Hamming window and voicing decision; Utilize the STRAIGHT model to extract fundamental frequency and LSF parameter for unvoiced frame, the LSF parameter is sent into the s self-organizing feature map network that has trained by frame, find out the neuron of competition triumph; This neuron is promptly represented the classification under this unvoiced frame, accomplishes the classification to unvoiced frame;

(2) choose the Spectral envelope conversion rule of respective class, the LSF parameter of this unvoiced frame as input, is sent into the RBF network of such spectrum envelope that has trained, the LSF parameter after obtaining changing;

(3) choose the fundamental frequency transformation rule of respective class, the LSF parameter after the conversion is sent into the RBF network of such fundamental frequency that has trained as input, obtain multiply by 500 after the output and reduce the fundamental frequency after obtaining changing;

The 6th step, phonetic synthesis, the LSF parameter and the fundamental frequency that promptly obtain, the voice after the STRAIGHT model finally obtains changing by the 4th step and the 5th step.

(1) being the AR parameter with the LSF Parameters Transformation after the unvoiced frame conversion at first, is energy spectrum (unit is dB) with the AR Parameters Transformation again, gets preceding 513 dimensions of energy spectrum, and converts energy spectrum unit into numerical value;

(2) for unvoiced frames directly with the language of source speech frame spectrum parameter and fundamental frequency respectively as energy spectrum with change after fundamental frequency;

Fundamental frequency after energy spectrum that (3) will finally obtain and the conversion is sent into the voice after the STRAIGHT model synthesizes conversion.

Claims

1. based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that comprising following steps:

The first step; Pre-service; Voicing decision and characteristic parameter extraction, after promptly input speech signal being carried out pre-emphasis, divides frame and windowing process, the short-time energy and average zero-crossing rate of calculating each frame; Accomplish the judgement of pure and impure sound, utilize the STRAIGHT model to extract the LSF parameter and the fundamental frequency of each unvoiced frame again;

The 3rd step, the foundation of Spectral envelope conversion rule, promptly respectively with the source LSF parameter of each cluster as input, the target LSF parameter of corresponding cluster utilizes the RBF network to train as output, and each cluster is set up the Spectral envelope conversion rule;

The 5th step; The conversion of characteristic parameter; The unvoiced frame of promptly treating converting speech is earlier sorted out one by one; Such Spectral envelope conversion rule with above-mentioned the 3rd step obtains is changed, the LSF parameter that obtains changing, the fundamental frequency that such fundamental frequency transformation rule that utilized for the 4th step obtained by the LSF parameter of conversion again obtains changing;

2. according to claim 1ly it is characterized in that pre-service based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, the detailed process of voicing decision and characteristic parameter extraction is following:

The first step is carried out pre-service to voice signal, and pre emphasis factor is 0.96, divides frame by 40ms, and frame moves and is 1ms, uses Hamming window to carry out windowing process afterwards;

Second step, by frame calculate short-time energy

and

Short-time zero-crossing rate

The 3rd step; The extraction of characteristic parameter promptly utilizes the STRAIGHT model to extract fundamental frequency and language spectrum parameter to unvoiced frame, again language spectrum parameter is carried out dimensionality reduction and conversion; Carry out the IFFT conversion earlier and obtain coefficient of autocorrelation; Obtain the AR parameter by coefficient of autocorrelation again, further obtain it by the AR parameter at last and derive from linear-in-the-parameter spectral frequency LSF, the dimension of LSF parameter is decided to be 16.

3. according to claim 1 based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that the detailed process of parameter cluster is following:

The first step is carried out dynamic time warping to the source LSF parameter and the target LSF parameter that extract, makes source LSF parameter and target LSF parameter time unifying;

Second step; Utilize the s self-organizing feature map network that source LSF parameter is carried out the self-organization classification; The input layer of this self organizing neural network is taken as 16; The neuronic number of the two dimension of competition layer is taken as 5 row, 5 row and amounts to 25, and cluster opisthogenesis LSF parameter is divided into 25 types like this, notes the subscript of LSF parameter in source in each type simultaneously respectively;

The 3rd step, the subscript of the source LSF parameter of noting according to each type, the target LSF parameter of same index is also gathered into one type, realizes the cluster to target LSF parameter;

The 4th step; Likewise; The subscript of the alignment parameters of returning when utilizing LSF parameter dynamic time warping, find out with time unifying after the corresponding target fundamental frequency of target LSF parameter, the source LSF parameter subscript according to record can realize the cluster of target fundamental frequency again.

4. according to claim 1 based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially; The method for building up that it is characterized in that the Spectral envelope conversion rule is to utilize the RBF network; Respectively with the source LSF parameter of each cluster as input; The target LSF parameter of corresponding cluster is trained as output, and basis function adopts Gaussian function, as input vector X _nThe time, k component of output layer does X wherein _jThe Spectral envelope conversion rule is set up to each cluster in the center of representing j basis function after training.

5. according to claim 1 based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that the fundamental frequency transformation rule to set up process following:

The first step; Fundamental frequency to the target speech unvoiced frame carries out convergent-divergent, because the fundamental frequency scope of male voice is at 60～200Hz, the fundamental frequency of female voice is at 60～450Hz; And therefore the output of RBF network carry out convergent-divergent for the target fundamental frequency divided by 500 between 0～1;

Second step, respectively with the target LSF parameter of each cluster as input, the target fundamental frequency behind the convergent-divergent is as output, basis function adopts Gaussian function, trains with the RBF network, sets up the fundamental frequency transformation rule of each cluster.

6. according to claim 1 based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially, it is characterized in that the transfer process of characteristic parameter is following:

The first step; Voice to be converted are carried out branch frame and voicing decision; Utilize the STRAIGHT model to extract fundamental frequency and LSF parameter for unvoiced frame, the LSF parameter is sent into the s self-organizing feature map network that the training stage trains one by one, find out the neuron of competition triumph; This neuron is promptly represented the classification under this unvoiced frame, accomplishes the classification to unvoiced frame;

Second step, choose the Spectral envelope conversion rule of respective class, the LSF parameter of this unvoiced frame is sent into the RBF network of such spectrum envelope that has trained, the LSF parameter after obtaining changing as input;

The 3rd step, choose the fundamental frequency transformation rule of respective class, the LSF parameter after the conversion is sent into the RBF network of such fundamental frequency that has trained as input, obtain multiply by 500 after the output and reduce the fundamental frequency after can obtaining changing.

7. according to claim 1 based on the s self-organizing feature map network clustering and the speech conversion method of base net network radially; The detailed process that it is characterized in that phonetic synthesis is following: be the AR parameter with the LSF Parameters Transformation after the unvoiced frame conversion at first; Be energy spectrum with the AR Parameters Transformation again; Unit is dB, gets preceding 513 dimensions of energy spectrum, and converts energy spectrum unit into numerical value; For unvoiced frames directly with the language of source speech frame spectrum parameter and fundamental frequency respectively as the fundamental frequency after energy spectrum and the conversion, like this with the energy spectrum that obtains with change after fundamental frequency send into the voice after the STRAIGHT model synthesizes conversion.