CN109346106B

CN109346106B - Cepstrum domain pitch period estimation method based on sub-band signal-to-noise ratio weighting

Info

Publication number: CN109346106B
Application number: CN201811035434.XA
Authority: CN
Inventors: 吕勇
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2018-09-06
Filing date: 2018-09-06
Publication date: 2022-12-06
Anticipated expiration: 2038-09-06
Also published as: CN109346106A

Abstract

The invention discloses a cepstrum domain pitch period estimation method based on sub-band signal-to-noise ratio weighting, which is characterized in that a sub-band weighting coefficient is calculated by utilizing Mel spectrums of noisy speech, sub-band weighting is carried out on all logarithmic spectrums of the noisy speech on each Mel sub-band in a logarithmic spectrum domain, peak value detection is carried out in the cepstrum domain, and the pitch period of the noisy speech signal is estimated. The technical scheme of the invention can simultaneously inhibit the environmental noise and the sound channel formants to obtain more accurate pitch period estimated value, and is particularly suitable for pitch estimation in low signal-to-noise ratio environment.

Description

Cepstrum domain pitch period estimation method based on sub-band signal-to-noise ratio weighting

Technical Field

The invention belongs to the technical field of voice processing, and particularly relates to a pitch period estimation method for carrying out sub-band signal-to-noise ratio weighting on a voice signal containing noise in a log-spectrum domain and carrying out peak detection in a cepstrum domain.

Background

When a person is voiced, the airflow passes through the glottis to vibrate the vocal cords, so that quasi-periodic pulse airflow is generated to excite the vocal cords to generate sound. The frequency of this vocal cord vibration is called the fundamental frequency, and the reciprocal of the fundamental frequency is called the pitch period. The pitch period is one of the important parameters of a speech signal, describes an important feature of an excitation source, and has wide application in multiple fields of speaker recognition, speech synthesis, speech coding and the like.

Because the glottal excitation signal of speech is only quasi-periodic and the channel formants affect the harmonic structure of the excitation signal, it is difficult to accurately extract the pitch period from the speech signal. Common pitch period methods include autocorrelation, mean amplitude difference function, parallel processing, and cepstrum. These methods have a good effect on clean speech or noisy speech with a high signal-to-noise ratio. However, during the transmission of speech, it is inevitable to be interfered by the environmental noise, which may make the extracted pitch period in the noise environment far from the actual value.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a method for estimating the pitch period of a cepstrum domain based on the weighting of the signal-to-noise ratio of a sub-band, which considers the influence of environmental noise and a vocal tract formant on an excitation signal simultaneously in the pitch period estimation under the noise environment and increases the robustness of a pitch estimation algorithm.

The technical scheme is as follows: a cepstrum domain pitch period estimation method based on sub-band signal-to-noise ratio weighting utilizes Mel spectrum of noisy speech to calculate sub-band weighting coefficient, carries out sub-band weighting on characteristic parameters of noisy speech in log spectrum domain, carries out peak value detection in cepstrum domain, and estimates the pitch period of noisy speech signal.

The method comprises the following specific steps:

(1) Interpolating or extracting the input digital voice, and fixing the sampling frequency of the digital voice to 8000Hz;

(2) Performing low-pass filtering on the interpolated or extracted standard digital voice, only reserving frequency components below 1000Hz, windowing, and framing to obtain a frame signal;

(3) Performing Fast Fourier Transform (FFT) on each frame of voice signal to obtain a magnitude spectrum of each frame of signal;

(4) Mel filtering is carried out on the amplitude spectrum of each frame of signal, logarithm is taken, and a sub-band weighting coefficient is calculated according to the signal-to-noise ratio of each Mel sub-band;

(5) Taking logarithm of the magnitude spectrum of each frame signal to obtain a logarithmic spectrum, and carrying out sub-band weighting on the logarithmic spectrum to reduce the influence of additive noise on pitch period estimation;

(6) Carrying out Discrete Cosine Transform (DCT) on the logarithmic spectrum after the subband weighting to obtain cepstrum parameters of the voice signal;

(7) And carrying out peak value detection on the cepstrum parameters of the voice signals, and carrying out smooth filtering to obtain a pitch period estimated value of the input voice.

By adopting the technical scheme, the invention has the following beneficial effects:

the technical scheme of the invention can simultaneously inhibit the environmental noise and the sound channel formants to obtain more accurate pitch period estimated value, and is particularly suitable for pitch estimation in the environment with low signal-to-noise ratio.

Drawings

Fig. 1 is a general block diagram of a subband snr weighting-based cepstral pitch period estimation method according to an embodiment of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

As shown in FIG. 1, the method for estimating the pitch period in the cepstral domain based on the weighting of the sub-band SNR mainly comprises the steps of interpolation and extraction, preprocessing, FFT, mel filtering, sub-band weighting, DCT and pitch estimation.

1. Interpolation and decimation

For the convenience of back-end processing, the sampling frequency of the input voice needs to be fixed to 8000Hz. If the original sampling frequency of the input voice is higher than 8000Hz, extracting the input voice to 8000Hz; if the original sampling frequency of the input speech is below 8000Hz, it is interpolated to 8000Hz.

2. Pretreatment of

Because the energy of the voice is mainly concentrated in a low-frequency area, and the energy of a high-frequency area is small and is easily influenced by noise, the low-pass filtering is firstly carried out on the standard digital voice after interpolation or extraction in the preprocessing, and only frequency components below 1000Hz are reserved; and then windowing and framing the filtered digital voice to obtain a frame signal. The window length is 256 and the frame shift is 128.

3、FFT

And performing Fast Fourier Transform (FFT) on each frame of voice signal, and performing modulus operation on the complex frequency spectrum of the voice signal after the FFT to obtain the magnitude spectrum of each frame of signal.

4. Mel filtering

Firstly, performing Mel filtering on the amplitude spectrum of each frame of signal to obtain a Mel spectrum; then, taking logarithm of the Mel spectrum to obtain a logarithm spectrum of the voice signal; finally, the weighting coefficient for each Mel subband is calculated according to the following formula:

where SNR (i) is the signal-to-noise ratio of the ith Mel subband; SNR _max And SNR _min Respectively represent the phrasesMaximum and minimum values of the tone sub-band signal-to-noise ratio; α (i) represents a weighting coefficient of the ith Mel subband. The subband signal-to-noise ratio SNR (i) is estimated from the energy of noisy speech in the Mel subband and the noise energy, which is estimated in the silence period.

5. Subband weighting

Firstly, taking logarithm of the magnitude spectrum of each frame signal to obtain a logarithm spectrum; then all noisy speech log spectra x (k) on the ith Mel subband are subband weighted with the estimated weighting coefficients α (i):

wherein,

is the logarithmic spectrum after the weighting of the sub-band, namely the estimation value of the clean speech logarithmic spectrum.

6、DCT

Weighted log spectrum of subbands

Performing Discrete Cosine Transform (DCT) to obtain cepstrum parameters of the speech signal

7. Pitch estimation

The clean speech s (n) can be regarded as resulting from the filtering of the glottal excitation signal e (n) by the vocal tract response v (n), i.e.

s(n)＝e(n)*v(n) (3)

Wherein the symbol "+" represents convolution.

After the pure speech s (n) is subjected to FFT, logarithm taking, sub-band weighting and DCT, the excitation signal and the vocal tract response are separated in a cepstrum domain:

wherein,

and

cepstral parameters representing the excitation signal and the vocal tract response, respectively.

In the cepstral domain, the excitation signal

Having a pulse characteristic, i.e.

Only in discrete periods N of the fundamental tone _p The integer multiples of the number of the positive electrode have non-zero values; at the argument n of the other sequences,

all equal to 0. Human pitch period T _p Is in the range between about 2ms and 20ms and the sampling frequency of the system is 8000Hz, so N _p May range between about 16 and 160. And the sound channel is responded to

Usually has a fast decay characteristic in the region-16, 16]The values outside are already small and can be assumed to be 0. Thus, in pitch estimation, only detection [16, 160 ] is required]Obtaining the estimated value of the pitch period by the peak value of the cepstrum parameter:

wherein,

is that

N value corresponding to the first peak value of (1);

is an estimate of the pitch period.

Claims

1. A cepstrum domain pitch period estimation method based on subband signal-to-noise ratio weighting is characterized in that: calculating a sub-band weighting coefficient by using the Mel spectrum of the noisy speech, carrying out sub-band weighting on the characteristic parameters of the noisy speech in a logarithmic spectrum domain, carrying out peak value detection in a cepstrum domain, and estimating the pitch period of the noisy speech signal;

the method specifically comprises the following steps:

(3) Performing fast Fourier transform on each frame of voice signal to obtain an amplitude spectrum of each frame of signal;

(5) Taking logarithm of the magnitude spectrum of each frame signal to obtain a logarithmic spectrum, and carrying out sub-band weighting to reduce the influence of additive noise on pitch period estimation;

(6) Carrying out discrete cosine transform on the logarithmic spectrum after the subband weighting to obtain cepstrum parameters of the voice signal;

(7) And carrying out peak value detection and smooth filtering on the cepstrum parameters of the voice signals to obtain a pitch period estimated value of the input voice.

2. The method of claim 1, wherein the subband signal-to-noise ratio weighting based cepstral pitch lag estimation method comprises: in each frame signal, the weighting coefficient α (i) of each Mel subband is calculated by the following formula:

where SNR (i) is the signal-to-noise ratio of the ith Mel subband; SNR _max And SNR _min Respectively representing the maximum value and the minimum value of the signal-to-noise ratio of the frame voice sub-band; α (i) represents a weighting coefficient of the ith Mel subband; the subband signal-to-noise ratio SNR (i) is estimated from the energy of the noisy speech in the Mel subband and the noise energy, which is estimated in the silence period.

3. The method of claim 1, wherein the subband signal-to-noise ratio weighting based cepstral pitch lag estimation method comprises: fixing the sampling frequency of the input digital voice to 8000Hz, and extracting the input digital voice to 8000Hz if the original sampling frequency of the input digital voice is higher than 8000Hz; if the original sampling frequency of the input voice is lower than 8000Hz, interpolating the input voice to 8000Hz; performing low-pass filtering on the interpolated or extracted standard digital voice, and only reserving frequency components below 1000 Hz; then, windowing the filtered digital voice, and framing to obtain a frame signal; the window length is 256 and the frame shift is 128.

4. The method of claim 1, wherein the subband signal-to-noise ratio weighting based cepstral pitch lag estimation method comprises: firstly, taking logarithm of the magnitude spectrum of each frame signal to obtain a logarithm spectrum; then all noisy speech log spectra x (k) on the ith Mel subband are subband weighted with the estimated weighting coefficient α (i):

wherein,