WO2017125840A1

WO2017125840A1 - Method for analysis and synthesis of aperiodic signals

Info

Publication number: WO2017125840A1
Application number: PCT/IB2017/050208
Authority: WO
Inventors: Kanru HUA
Original assignee: Hua Kanru
Priority date: 2016-01-19
Filing date: 2017-01-15
Publication date: 2017-07-27

Abstract

The present invention is a method for analysis and synthesis of the aperiodic component in speech signals. The analysis stage involves spectral envelope estimation and decomposition of the input aperiodic component into a plurality of band-pass filtered signals, from which time-domain envelopes are extracted. The synthesis stage involves multi-band time-domain modulation and spectral modification on a white noise signal. The present invention preserves both temporal and spectral characteristics of the aperiodic component when applied to speech signals.

Description

Method for Analysis and Synthesis of Aperiodic Signals

FIELD OF THE INVENTION This invention relates to a method for analysis and synthesis of aperiodic signals, in particular the analysis and synthesis of the aperiodic component in speech signals.

DESCRIPTION OF THE PRIOR ART Various speech processing technologies have been proposed that decomposes a speech signal into periodic and aperiodic components in analysis stage, and recombine the two components in synthesis stage. In some literatures, the periodic component is referred to as deterministic component and the aperiodic component is referred to as stochastic component or noise component.

For example, U.S. Patent No. 5029509 discloses a well-known Spectral Modeling Synthesis (SMS) technology in which the periodic component is represented as a series of sinusoids and the aperiodic component is represented as a series of magnitude spectral envelopes. According to Childers, Donald G., and C. K. Lee. "Vocal quality factors: Analysis, synthesis, and perception." the Journal of the Acoustical Society of America 90.5 (1991): 2394-2410, the time-domain envelope of the speech aperiodic component was found to be related to the periodic component. Further, the use of a square-wave-modulated white noise excitation signal synchronized to the periodic excitation in a formant synthesizer was found to produce more natural-sounding voice.

In many applications it is desirable to preserve both temporal and frequency-domain characteristics of the aperiodic signal during processing, modification, or parametrization. Such an attempt, for the purpose of pitch shifting, is described in Mehta, Daryush, and Thomas F. Quatieri. "Synthesis, analysis, and pitch modification of the breathy vowel." Applications of Signal Processing to Audio and Acoustics, 2005. IEEE Workshop on. IEEE, 2005, according to which the speech signal is first decomposed into periodic component and aperiodic component, then the periodic component is pitch-shifted by a sinusoidal model. The pitch shifting of the aperiodic component involves whitening the aperiodic signal, estimating time-domain envelope of the whitened signal, demodulating the whitened signal by estimated time-domain envelope, resampling the time-domain envelope, re-modulating the demodulated signal by the resampled time-domain envelope, and finally spectral coloring the re-modulated signal by the spectrum of the original aperiodic signal. However, the short-time energy of the resynthesized aperiodic signal is not guaranteed to comply with the original aperiodic signal because the resampling of time-domain envelope of whitened aperiodic signal may introduce an energy distortion.

A similar technology is described in Pantazis, Yannis, and Stylianou, Yannis, "Improving the modeling of the noise part in the harmonic plus noise model of speech." International Conference on Acoustics, Speech, and Signal Processing (2008), in which the synthesis stage involves first coloring the white noise signal and then modulating the colored signal by a time-domain envelope. However, the time-domain modulation introduces a frequency- domain distortion that blurs the spectrum of the aperiodic signal, resulting in a degradation in naturalness of the synthesized speech.

Another problem of conventional technologies is the over-simplified assumption that the noise excitation receives the same modulation in all frequency channels. In fact, noise is produced not only near glottis but also in the vocal tract during phonation, which implies that the aperiodic component exhibits different time-domain characteristics in different frequency regions. For example, the time-domain envelope of aperiodic component extracted from a sustained vowel in the 0-5kHz band has very weak periodicity while the time-domain envelope of the aperiodic component in the 9-12kHz band has stronger periodicity, as shown in Fig. 1.

SUMMARY OF THE INVENTION The present invention is a method for analysis and synthesis of stochastic signals with quasi- periodic time-domain envelopes. The method is primarily designed for high-quality speech processing applications, for example, speech synthesis and audio production.

The present invention, in its analysis stage, involves the following steps. (1) Estimate one or a plurality of spectral envelopes from the input aperiodic signal. (2) Band-pass filter the input aperiodic signal for each designated frequency band. (3) Extract time-domain envelope from each band-pass filtered signal. (4) Store the analysis results.

The present invention, in its synthesis stage, involves the following steps. (1) Generate a white noise signal. (2) Band-pass filter the white noise signal for each designated frequency band. (3) Modulate the band-pass filtered signal with input time-domain envelope. (4) Obtain the full-band excitation signal as the summation of modulated band-limited signals. (5) Whiten the excitation signal by inverse-filtering by its spectral envelope. (6) Filter the whitened excitation signal by input spectral envelope.

BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is the plot of an example of an aperiodic signal extracted from a sustained vowel sound /a/ filtered by a 0-5kHz band-pass filter and a 9-12kHz band-pass filter, respectively. The time-domain envelopes of the signal are also plotted in the figure.

Fig. 2 is a flow chart showing the analysis process of this invention.

Fig. 3 is a flow chart showing the synthesis process of this invention.

Fig. 4 is a flow chart showing the analysis process of a speech processing application involving this invention.

Fig. 5 is a flow chart showing the synthesis process of a speech processing application involving this invention.

Fig. 6 is the plot of an example of a magnitude spectrum of the aperiodic signal and its spectral envelope obtained by cepstral smoothing.

DETAILED DESCRIPTION OF THE INVENTION As shown in Fig. 2, the analysis stage of the present invention consists of the following steps.

Step A001, obtain the input aperiodic signal and a predetermined array of one or a plurality of frequency values designating the frequency bands for modeling the time-domain characteristics of the aperiodic signal.

Step A002, perform Short-Time Fourier Transform (STFT) on the input aperiodic signal and obtain a series of magnitude spectra of the input aperiodic signal.

Step A003, for each magnitude spectrum in the series of magnitude spectra obtained in step A002, calculate the corresponding spectral envelope. The result should be a series of spectral envelopes.

The preferred method for calculating the spectral envelope from a magnitude spectrum is to convert the magnitude spectrum into cepstrum, truncate the cepstrum to a designated order, and finally convert the cepstrum back to spectrum. An example of a magnitude spectrum obtained by STFT in step A002 and the corresponding spectral envelope obtained by truncating the cepstrum is shown in Fig. 6.

Step A004, for each of the frequency bands designated by the predetermined array of frequency values, band-pass filter the input aperiodic signal to remove the portion of the aperiodic signal outside of the frequency band. Step A005, extract the time-domain envelope of the band-pass filtered signal obtained in step A004.

The preferred method for time-domain envelope extraction is to low-pass filter the absolute value of the band-pass filtered signal.

Step A006, store the analysis results, including the series of spectral envelopes obtained in step A003 and one or a plurality of time-domain envelopes obtained in step A005.

As shown in Fig. 3, the synthesis stage of the present invention consists of the following steps,

Step S001, obtain a series of spectral envelopes describing the frequency-domain

characteristics of the aperiodic signal, one or a plurality of time-domain envelopes describing the time-domain characteristics of the aperiodic signal, and a predetermined array of one or a plurality of frequency values designating the frequency bands for modeling the time-domain characteristics of the aperiodic signal.

Step S002, generate a white noise signal with the same duration as the input time-domain envelopes.

Step S003, for each of the frequency bands designated by the predetermined array of frequency values, band-pass filter the white noise signal generated in step S002 to remove the portion of the noise signal outside of the frequency band.

Step S004, for each of the frequency bands designated by the predetermined array of frequency values, multiply the band-pass filtered noise signal obtained in step S003, corresponding to the frequency band, by the time-domain envelope corresponding to the frequency band.

Step S005, calculate the sum of the modulated signals obtained in step S004. The result will be denoted as the noise excitation signal in the rest of this description.

Because the time-domain modulation in step S004 changes the energy of the band-pass filtered signal in each frequency band, the resulting noise excitation signal becomes colored. Thus the spectral envelope of the noise excitation signal should be taken into consideration in the following noise coloring procedure, in particular described in step S007-S009.

Step S006, perform STFT on the noise excitation signal and obtain a series of complex spectra of the noise excitation signal. For each complex spectra, calculate the corresponding magnitude spectrum.

The preferred method for calculating the spectral envelope from a magnitude spectrum is to convert the magnitude spectrum into cepstrum, truncate the cepstrum to a designated order, and finally convert the cepstrum back to spectrum.

Step S007, for each magnitude spectrum in the series of magnitude spectra obtained in step S006, calculate the corresponding spectral envelope. The result should be a series of spectral envelopes.

Step S008, inverse filter the series of complex spectra obtained in step S006 by the series of spectral envelopes obtained in step S007. The inverse filtering can be implemented as dividing each complex spectrum by the corresponding spectral envelope. The result should be a series of complex spectra.

Step S009, filter the series of inverse filtered complex spectra obtained in step S008 by the series of spectral envelopes describing the frequency-domain characteristics of the aperiodic signal. The filtering can be implemented as multiplying each complex spectrum by the corresponding spectral envelope. The result should be a series of complex spectra.

Step S010, perform inverse STFT on the series of complex spectra obtained in step S009. The resulting time-domain signal is the synthesized aperiodic signal.

The following describes the implementation of an exemplary speech processing application involving this invention, as shown in Fig. 4 and Fig. 5.

Step A101, receive a speech signal from a sound input device, such as a microphone.

Step A102, extract the pitch contour from the input speech signal. The extracted pitch contour is an array of frequency values corresponding to the fundamental frequency of the speech signal at a series of time instants, spacing at a fixed time interval (around 5 milliseconds). If the speech is unvoiced at a certain time instant, then the frequency value corresponding to the time instant is set to zero.

The preferred pitch extraction method is the YIN algorithm described in De Cheveigne, Alain, and Kawahara, Hideki, "YIN, a fundamental frequency estimator for speech and music." Journal of the Acoustical Society of America 111.4 (2002) : 1917-1930.

Step A103, perform STFT analysis on the input speech signal at a series of time instants corresponding to the analysis time instants at where the pitch contour is extracted; obtain a series of complex spectra of the input speech signal. The preferred window for the STFT analysis is Blackman window. The length of the window is preferred to be time-varying. The length of the window is preferred to be twice the length of a period of the speech signal around the analysis time instant.

Step A104, for each complex spectrum in the series of complex spectra obtained in step A103, calculate the log magnitude spectrum and pick the spectral peaks around each harmonic frequency calculated as the integer multiple of the fundamental frequency at the corresponding time instant. Perform parabolic interpolation at the spectral peaks to obtain a refined estimation of the harmonic amplitudes and harmonic frequencies. Perform linear interpolation at the refined harmonic frequencies on the unwrapped phase spectrum calculated from the complex spectrum to obtain a refined estimation of the phase of the harmonics. Normalize the estimated harmonic amplitudes by dividing the amplitudes by half the sum of the analysis window for the STFT analysis in step A103.

Step A105, generate a plurality of sinusoids with time-varying amplitude and time-varying frequency according to the series of harmonic amplitudes and harmonic phases obtained in step A104 and the pitch contour obtained in step A102. Calculate the sum of the sinusoids. In the rest of this description the sum of the sinusoids is denoted as the extracted periodic component.

Step A106, subtract the extract periodic component from the input speech signal. The resulting signal is denoted as the extracted aperiodic component in the rest of this description.

Step A112, perform Short-Time Fourier Transform (STFT) on the extracted aperiodic component signal and obtain a series of magnitude spectra of the extracted aperiodic component signal. The length of the window for the STFT analysis is around 10

milliseconds.

Step A113, for each magnitude spectrum in the series of magnitude spectra obtained in step A112, calculate the corresponding spectral envelope. The result should be a series of spectral envelopes.

Step A114, for each of the frequency bands designated by the predetermined array of frequency values, band-pass filter the extracted aperiodic component signal to remove the portion of the aperiodic signal outside of the frequency band.

Step A115, extract the time-domain envelope of the band-pass filtered signal obtained in step A114.

The preferred method for time-domain envelope extraction is to low-pass filter the absolute value of the signal.

Step A116, store the analysis results, including the series of spectral envelopes obtained in step A113, one or a plurality of time-domain envelopes obtained in step A115, the pitch contour obtained in step A102 and the series of harmonic amplitudes and harmonic phases obtained in step A104.

Step M101, optionally, modify the analysis results. For example, multiply the pitch contour by a constant, accordingly adjust the amplitudes of the harmonics at each time instant and accordingly resample the time-domain envelopes for each frequency band describing the aperiodic component to shift up the pitch.

Step SlOl, generate a plurality of sinusoids with time-varying amplitude and time-varying frequency according to the series of harmonic amplitudes and harmonic phases obtained in step A104 or step M101 and the pitch contour obtained in step A102 or step M101.

Calculate the sum of the sinusoids. In the rest of this description the sum of the sinusoids is denoted as the synthesized periodic signal.

Step S112, generate a white noise signal with the same duration as the time-domain envelopes obtained in step A115 or step M101.

Step S113, for each of the frequency bands designated by the predetermined array of frequency values, band-pass filter the white noise signal generated in step S112 to remove the portion of the noise signal outside of the frequency band.

Step S114, for each of the frequency bands designated by the predetermined array of frequency values, multiply the band-pass filtered noise signal obtained in step S113, corresponding to the frequency band, by the time-domain envelope corresponding to the frequency band.

Step S115, obtain the noise excitation signal by computing the sum of the modulated signals obtained in step S114.

Step S116, perform STFT on the noise excitation signal and obtain a series of complex spectra of the noise excitation signal. For each complex spectra, calculate the corresponding magnitude spectrum.

Step S117, for each magnitude spectrum in the series of magnitude spectra obtained in step S116, calculate the corresponding spectral envelope. The result should be a series of spectral envelopes.

Step S118, inverse filter the series of complex spectra obtained in step S116 by the series of spectral envelopes obtained in step S117. The inverse filtering can be implemented as dividing each complex spectrum by the corresponding spectral envelope. The result should be a series of complex spectra.

Step S119, filter the series of inverse filtered complex spectra obtained in step S118 by the series of spectral envelopes describing the frequency-domain characteristics of the aperiodic signal. The filtering can be implemented as multiplying each complex spectrum by the corresponding spectral envelope. The result should be a series of complex spectra.

Step S120, perform inverse STFT on the series of complex spectra obtained in step S119. The resulting time-domain signal is the synthesized aperiodic signal.

Step S121, calculate the sum of the synthesized periodic signal obtained in step SlOl and the synthesized aperiodic signal obtained in step S120. Send the resulting signal to an sound output device such as a speaker.

Claims

1. a speech processing method that extracts information from an aperiodic signal, consisting of the following steps

estimate one or a plurality of spectral envelopes from the input aperiodic signal;

band-pass filter the input aperiodic signal for each designated frequency band;

extract time-domain envelope from each band-pass filtered signal;

store the analysis results comprising of one or a plurality of spectral envelopes and time- domain envelope for each designated frequency band.

2. a speech processing method that generates an aperiodic signal from spectral and time- domain envelopes, consisting of the following steps

generate a white noise signal;

band-pass filter the white noise signal for each designated frequency band;

modulate the band-pass filtered signal with input time-domain envelope;

obtain the full-band excitation signal as the summation of modulated band-limited signals;

whiten the excitation signal by inverse-filtering by its spectral envelope;

filter the whitened excitation signal by one or a plurality of input spectral envelopes.