CN104183233A

CN104183233A - Method for improving periodic component extraction quality of joint parts of consonants and vowels of speech sounds

Info

Publication number: CN104183233A
Application number: CN201410457379.9A
Authority: CN
Inventors: 华侃如
Original assignee: Individual
Current assignee: Individual
Priority date: 2014-09-10
Filing date: 2014-09-10
Publication date: 2014-12-03

Abstract

When a sine model is used for processing speech sound periodic components of joint parts of consonants and vowels, a gradient descent algorithm is used for optimizing sine model parameters obtained by adopting modes of short-time Fourier transformation and the like, so that the sine model fits speech sounds of the joint parts of the consonants and the vowels more accurately, the quality of extracting the speech sound periodic components is higher, and finally the natureness degree of fitting of the joint parts of the consonants and the vowels is effectively improved in the speech sound synthesizing process.

Description

Improve the auxiliary unit of voice and be connected the method that part periodic component extracts quality

Technical field

The present invention's technology belongs to field of voice signal, the technical field of the phonetic synthesis that the sinusoidal model of particularly take is technical foundation.

Background technology

One of common method of at present voice being carried out to modeling is sinusoidal model.Sinusoidal model theory thinks that all waveforms can represent with the stack of several sine waves, thereby all waveforms are expressed as to unified functional expression.Conventionally use in actual use cosine function to represent:

Wherein:

be i frame, the cosine wave (CW) amplitude when time is n;

be i frame, the instantaneous phase when time is n, can be expressed as , wherein for instantaneous frequency, fs is sample frequency, for initial phase;

N is sinusoidal wave number.

What in phonetic synthesis, conventionally adopt is harmonic sinusoidal modal, .

Use sinusoidal model to carry out pronunciation modeling, model parameter is more accurate, and the degree of fitting of model and raw tone is just higher.The most popular method that is used for obtaining sinusoidal model parameter is short time discrete Fourier transform, but because short time discrete Fourier transform is intrinsic, and when carrying out voice signal windowing and analyzing, window is long shorter, and the analysis of voice signal frequency is got over to out of true; Window is longer, more accurate to the analysis of voice signal frequency.Therefore, short time discrete Fourier transform time precision and frequency accuracy can not get both.Short time discrete Fourier transform when the speech processes, is applicable to frequency and the inviolent voice signal of changes in amplitude to process more, while processing for frequency or the comparatively violent voice signal of changes in amplitude, does not often reach higher precision.

The auxiliary first joining place voice cycle of voice becomes the changes in amplitude of sub-signal more violent, while using short time discrete Fourier transform to carry out voice cycle constituent analysis, and often can not be by voice cycle composition Accurate Curve-fitting.

Summary of the invention

The present invention seeks to: realize the accurate extraction of the auxiliary first joining place voice cycle composition of voice.The present invention uses gradient descent algorithm, to what adopt that short time discrete Fourier transform or additive method obtain, not that point-device speech sinusoidal model parameter is carried out iteration correction, to obtain more high-precision sinusoidal model parameter, thereby make model and the better matching of raw tone.

The present invention's step adopting that achieves the goal comprises:

Step 1, input speech signal, analyze sounding starting point and fundamental curve;

Step 2, to input voice signal carry out the analysis of harmonic wave plus noise: whole segment signal is carried out to short time discrete Fourier transform (STFT), with reference to fundamental frequency in short-term frequently width spectrum carry out peak value detection, thereby each humorous wave frequency and amplitude on this analysis frame according to a preliminary estimate, and from phase spectrum, obtain the phase place of each harmonic wave;

Step 3, the beginning from sounding starting point to voice signal, carry out conversed analysis frame by frame: by harmonic amplitude and phase place on each analysis frame of gradient descent algorithm accurate Calculation;

Step 4, the amplitude and the phase place that according to accurate Calculation in step 3, obtain, adjust each humorous wave frequency.

Purposes: in harmonic wave plus noise model, while rejecting acquisition noise (the gentle sound of voiceless consonant) by voice cycle composition, the vowel that obviously reduces the gentle sound of voiceless consonant is residual, and the aural signature that overcomes the gentle sound of voiceless consonant cannot reduce preferably, near the larger problem of frequency analysis error of sounding starting point.

Accompanying drawing explanation

Fig. 1 the present invention is based on the algorithm block diagram that sinusoidal model accurately extracts the auxiliary first joining place periodic component of voice.

Fig. 2 is that the present invention uses gradient descent algorithm to optimize the harmonic amplitude of specific i analysis frame, the algorithm block diagram of phase place.

Fig. 3 is the input voice time domain waveform example of certain analysis frame.

Fig. 4 is the original state of gradient descent algorithm matching speech waveform.Wherein red line is target waveform, and green line is periodic component waveform to be optimized.

Fig. 5 to Fig. 7 is respectively periodic component waveform (green) that gradient descent algorithm generates after 15,50,100,150 iteration and the contrast of target waveform (red).

Embodiment

Algorithm overall procedure used herein as shown in Figure 1.

One, input speech signal , divide frame, add rectangular window, produce sound bite as shown in Figure 3

。The integer power that the long N of window is 2, window gap length for being less than 2 the integer power of N.When N meets following condition, best results:

Wherein fs is sampling rate, unit hertz.

Two, use YIN algorithm to carry out Analysis of Fundamental Frequencies to the signal of each frame, concrete steps are as follows:

(1) i frame is calculated to similarity function:

(2) calculate the average similarity function of normalization accumulative total:

(3) find n, satisfy condition: 1. for local minimum 2. 3. n is the minimum value that meets above-mentioned condition.Threshold value T span is best between 0.1 to 0.3.If meet the n of above-mentioned condition, do not exist, find and meet n for global minimum;

(4) according to n and sampling rate, calculate fundamental frequency in short-term: .

The fundamental frequency in short-term of each frame that YIN algorithm is obtained, is used hereinafter represent.For the unvoiced frames of no periodic composition, YIN algorithm will produce wrong fundamental frequency estimation, but does not affect the execution of the method for the invention.

Three, basis judgement sounding initial time roughly , be the minimum value satisfying condition: at j, be to set up for 1,2,3,4 o'clock, threshold value span is 10 to 40 hertz.

Four, right jia Hanning (Hanning) window, and carry out short time discrete Fourier transform (STFT), obtain amplitude spectrum and phase frequency spectrum .

Five, from start, successively increase progressively i, carry out the analysis of forward direction sinusoidal model periodic component, use following algorithm to find harmonic wave:

(1) set harmonic wave counting h=1

(2) exist following frequency range in, the corresponding frequency of maximizing amplitude and maximal value, phase place.Frequency assignment arrives ; After amplitude normalization, assignment arrives ; Phase place assignment arrives .

(3) h is added to 1, repeating step 2,3, until surpass . generally be greater than 5000 hertz.

Six, from start, the i that successively successively decreases, carries out the analysis of reverse sinusoidal model periodic component:

(1) set harmonic wave counting h=1

(2), with (2) step in step 5, find , , .

(3) if meet , or , set , , .In this step span is best at 30 to 50 hertz.

(4) h is added to 1, repeating step 2,3, until surpass . generally be greater than 5000 hertz.

Seven,, as shown in Fig. 2 flow process, use gradient descent algorithm pair front each frame with be optimized, iteration is carried out following steps:

(1) reduction cycle composition and calculate the difference of two squares ,

Wherein for harmonic wave number.

(2) right calculate respectively time, partial derivative ,

。

(3) right calculate respectively time, partial derivative ,

。

(3), according to the partial derivative of obtaining in step (2), (3), upgrade with ,

When get 0.2 to 0.5, iterations is 100 o'clock, and Gradient Descent matching can obtain better effects.State after matching original state, the 15th, 50,100 iteration respectively as shown in FIG. 4,5,6, 7.Wherein red line representative waveform; Green line representative waveform.

Eight, according to after optimizing right correct output , , with :

(1) calculate the window moving interval time

(2) calculate phase place and change,

(3) cycle estimator quantity n,

(4) recalculate ,

(5) correct ,

Claims

1. based on a sinusoidal model, improve the auxiliary unit of voice and be connected the method that part periodic component extracts quality, it is characterized in that comprising the following steps:

2. a kind of auxiliary unit of voice of improving according to claim 1 is connected the method that part periodic component extracts quality, step 2 is characterised in that: from sounding starting point, while oppositely obtaining harmonic frequency, amplitude from frequency spectrum, the amplitude of the unsettled harmonic wave of frequency in adjacent two frames is set as to 0:

If meet , or , set , , ;

span is at 30 to 50 hertz;

Wherein, be the fundamental frequency of i frame, h humorous wave frequency of i frame, be the amplitude of h harmonic wave of i frame, the phase place of h harmonic wave of i frame, it is the phase spectrum of i frame.

3. a kind of auxiliary unit of voice of improving according to claim 1 is connected the method that part periodic component extracts quality, step 3 is characterised in that: while using the harmonic amplitude phase place of each frame of gradient descent algorithm accurate Calculation, use following formula to upgrade the amplitude-phase of each harmonic wave:

Wherein, for harmonic wave number, for analysis window in step 2 long, for sample frequency, be the speech signal segment before the windowing of i frame, the speech signal segment of attaching most importance to and generating;

? span is 0.2 to 0.5.

4. according to one kind of claim 1, improve the auxiliary unit of voice and be connected the method that part periodic component extracts quality, based on claim 3, the feature of step 3 is also: according to after optimizing in step 3 right correct, formula and computation sequence are as follows:

(1)

(2)

(3)

(4)

(5)

Wherein, for window moving interval hits in step 2.