GB2525438A - A speech processing system - Google Patents

A speech processing system Download PDF

Info

Publication number
GB2525438A
GB2525438A GB1407337.3A GB201407337A GB2525438A GB 2525438 A GB2525438 A GB 2525438A GB 201407337 A GB201407337 A GB 201407337A GB 2525438 A GB2525438 A GB 2525438A
Authority
GB
United Kingdom
Prior art keywords
frequency
signal
speech
complex amplitude
sinusoidal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1407337.3A
Other versions
GB201407337D0 (en
GB2525438B (en
Inventor
Ioannis Stylianou
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Research Europe Ltd filed Critical Toshiba Research Europe Ltd
Priority to GB1407337.3A priority Critical patent/GB2525438B/en
Publication of GB201407337D0 publication Critical patent/GB201407337D0/en
Publication of GB2525438A publication Critical patent/GB2525438A/en
Application granted granted Critical
Publication of GB2525438B publication Critical patent/GB2525438B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Abstract

Complex amplitudes of sinusoidal components for a speech synthesiser (eg. a text to speech vocoder) are extracted from an input audio signal by dividing it into frames (eg. windowed, overlapping frames), Fourier transforming each frame into the frequency domain, identifying the peak in each frequency band (eg. each of 21 critical bands), calculating the complex amplitude of a sinusoidal component at this frequency (eg via a MMSE method) and finally assigning this amplitude to a fixed frequency (eg. the centre frequency) in each band (figs. 7A & B). An output signal may then be generated from the sum of these components.

Description

A speech processing system
FIELD
Embodiments of the present invention described herein generally relate to the
field of speech processing and speech synthesis.
BACKGROUND
Speech processors are used in systems such as vocoders to extract parameters from speech signals and then use them, or some modified form of them, to reconstruct speech. Vocoders are applied in a number of applications, such as low-bit rate speech coding analysis, speech synthesis, speech modification and text to speech applications such as electronic games, E-book readers, E-mail readers, satellite navigation, automated telephone systems and automated warning systems.
Speech processors are also used in a wide variety of other applications such as hearing aids and cochlear implants.
BRIEF DESCRIPTION OF THE FIGURES
Embodiments will now be described with reference to the following figures: Figure 1 is a flow chart of a method of extracting parameters for use in speech synthesis from an input speech signal in accordance with an embodiment; Figure 2 is a flow chart of a method of synthesizing a speech signal in accordance with an embodiment; Figure 3 is a speech analysis operation which is a method for extracting parameters for use in speech synthesis from an input speech signa' in accordance with an embodiment; Figure 4 is a speech synthesis operation which is a method of synthesizing a speech signa' in accordance with an embodiment; Figure 5 is a flow chart of a method of extracting speech parameters from an input speech signal in accordance with an embodiment in which dynamic features are not included; Figure 6 is a flow chart of a method of synthesizing a speech signa' in accordance with an embodiment in which dynamic features are not included and further sinusoids are generated having the boundary frequencies of a number of critical bands; Figure 7a shows a speech magnitude spectrum, with estimated complex amplitudes for the centres of the critical bands indicated; Figure 7b shows a speech magnitude spectrum, with estimated complex amplitudes for the centres of the critical bands and for the maximum spectral amplitude in the critical band indicated; Figure 8 is a schematic illustration of a very basic text to speech synthesis system; Figure 9 is a schematic illustration of the basic architecture of a text to speech system; Figure 10 is a flow chart of a generalised method of synthesizing speech from text based on unit selection; Figure 11 is a flow chart of a generalised method of synthesizing speech from text based on statistical modelling using a HMTVI.
DETAILED DESCRIPTION
According to one embodiment, there is provided a system for extracting parameters for use in speech synthesis from a speech signal, comprising a processor configured to: receive an input speech signal; extract frames of said input speech signal; transform each frame of said input speech signal into the frequency domain; for each frame, identi1 the frequency at which the maximum spectral amplitude occurs in each of a plurality of frequency bands; calculate the complex amplitudes of sinusoidal components with the identified frequencies; assign each complex amplitude to a sinusoidal component with a fixed frequency in the corresponding frequency band.
In one embodiment, the frequency bands are critical bands. In human perception, a range of frequencies [a frequency band) may be perceived as the same, because they activate the same area on the basilar membrane of the inner ear. The frequency ranges of the critical bands correspond to the smallest frequency difference that a listener could perceive. In one embodiment, the number of critical bands is 21. In one embodiment, the fixed frequency is the centre frequency of the frequency band. In other embodiments, the fixed frequency is substantially central to the critical band for example, a frequency within 150 Hz of the centre of the critical band. In alternative embodiments, any fixed frequency within the critical band can be chosen for assigning the parameters. By a fixed frequency it is meant that the same frequency is used for each frame of speech analysed. The same frequency can also be used to synthesize a speech signal.
In one embodiment the frames are overlapping frames. In an alternative embodiment, the frames are contiguous. In one embodiment, the centre of the frames are aligned with the GCIs, and pitch synchronous analysis is used. In one embodiment, the frames are windowed frames. In one embodiment the frames are windowed frames with zero-padding. In one embodiment, the Hann window function is appiled to the frames to produce windowed frames. Afternative window functions can be used. In one embodiment, the number of sinusoids for each frame is fixed.
In one embodiment, the transformation of the signal into the frequency domain is a Fourier transform. In one embodiment, an FFT algorithm is used.
In one embodiment, the frequency at which the maximum spectral amplitude occurs is identified using peak picking.
In one embodiment the complex amplitude of each of the sinusoidal components is calculated using a least squares method, by minimising the error e between the frame of input speech signal and a signal comprising the sum of the sinusoidal components:
N
C = Xw2(n)(s(n) -h(n))2 where w(n) is the window function, h(n) is the input speech signal, N is the window length, n is the discrete time index and s(n) = where w' is the identified frequency, j is the imaginary unit B is the number of frequency bands and is the complex amplitude.
In one embodiment, the complex amplitude is calculated using peak picking.
in one embodiment, the speech processor further comprises a speech synthesizer, configured to: generate a plurality of sinusoidal components, the frequency of each sinusoidal component being the fixed frequency in a frequency band and the complex amplitude of the sinusoidal component being the assigned complex amplitude; and output a signal comprising a sum of said sinusoidal components.
in one embodiment, the output signal s[n] is generated from a sum of sinusoidal components given by: s(n) = kB where B is the number of frequency bands, is the complex amplitude assigned to the sinusoidal component with a fixed frequency in the kth frequency band, and wf< is the fixed frequency in the kth frequency band.
in one embodiment, the processor is further configured to calculate a complex amplitude value for each of a further number of sinusoidal components. The further number of sinusoidal components each has the frequency of the boundary of a frequency band. in one embodiment there are nine further sinusoidal components. in one embodiment, the complex amplitude values are calculated using the least squares method, by minimising the error between the frame of input speech signal and a signal comprising the sum of the further number of sinusoidal components and the sinusoidal components with the identified frequencies:
N
= Xw2(n)(sn) -h(n))2 where s(n) = (aX) elWi + (a) e°' k=-B k=-13L where is the boundary frequency, BL is the further number of sinusoidal components and a0 are the complex amplitudes.
In one embodiment, the complex amplitude values are calculated using peak picking.
In one embodiment, the output signal s(nJ of the speech processor is given by: s(n) = (a) e4' + (a°) ejOn k=-13 k=-BL In one embodiment, dynamic features are included. The complex amplitudes and dynamic slope values of sinusoidal components with the identified frequencies are calculated using a least squares method. The complex amplitudes and dynamic slope values of each sinusoidal component are calculated by minimising the error E between the input speech signal and a signal comprising the sum of the sinusoidal components:
N
= Xw2(n(s) -h(n))2 where s(n) = kB + nb) el°11 where is the dynamic slope.
l'he complex amplitude and dynamic slope values are assigned to a sinusoidal component with a fixed frequency in the corresponding frequency band.
in one embodiment, the output signal of the speech processor s(n) is generated from a sum of sinusoidal components given by: s(n) = + nbX) e4 where b is the dynamic slope value assigned to the sinusoidal component with a fixed frequency in the kth frequency band.
in one embodiment, the processor is further configured to calculate a complex amplitude va'ue and a dynamic slope value for each of a further number of sinusoidal components. The further number of sinusoidal components each has the frequency of the boundary of a frequency band. In one embodiment there are nine further sinusoidal components. The processor is further configured to calculate a complex amplitude value and a dynamic slope value for each of a further number of sinusoidal components using a least squares method, by minimising the error E between the frame of input speech signal and a signal comprising the sum of the further number of sinusoidal components and the sinusoidal components with the identified frequencies:
N
= Xw2(n)(s(n) -h(n))2 where s(n) = (a + nbX) + (a + nb°) k=-B in one embodiment, the output signal s(n] of the speech processor is given by: s(n) = (a + nb) e°4 + (a° + nb°) k=-B in one embodiment, the speech synthesizer is further configured to add a noise component to said signal. in one embodiment the output signal s(n) is generated from a sum of sinusoidal components given by
B BL
s(n) = (a + nbX) ej4hi + sH(n) + > (a° + nb°) e°1° k=-B k=-BL where s11(n) is a noise component.
in one embodiment, a first component of the output signal is generated by adding sinusoidal components generated using the complex amplitude and optionally using dynamic slope values up to a selected frequency. in one embodiment, the selected frequency is 4000 Hz. in one embodiment, the first component of the output signal is: s1(n) = in one embodiment, the first component of the output signal is s1(n) = k17 + nb) e4' in one embodiment, the first component of the output signal is s1(n) = (arx) el + (a0) eJ°ll in one embodiment, the first component of the output signal is s1(n) = k17 + nbX) ei0411 + X(a° + nb°) e°' in one embodiment, a second component of the output signal is generated. The first component and second component are added to give the output signal. In one embodiment, the second component of the output signal is generated by: generating the sinusoidal components with frequencies above a selected frequency; estimating the envelope of the sum of said sinusoidal components with frequencies above the selected frequency; interpolating the complex amplitude values assigned to the critical bands above the selected frequency to obtain a further plurality of complex amplitude values; generating sinusoidal components with interpolated complex amplitude values and random phase and multiplying the sum of sinusoidal components with interpolated complex amplitude values by the estimated envelope.
in one embodiment, the sum of the sinusoidal components with frequencies above the selected frequency is ShJ(fl) = (aTh + nbrt) e"°fo + + nb7X) e4° + (Umax + nbjrx) e104 ofl + (aJ + nb2T) in one embodiment, the envelope of the sum is determined by computing the analytic form of hr() using the HUbert transform.
in one embodiment, the complex amplitude values aax, a]'", a'°', a2" and are interpolated to give complex values corresponding to frequencies of every 100 Hz between wf7 and w1. Sinusoids are generated with frequencies corresponding to frequencies of every 100 Hz between and w, and complex amplitude values obtained through interpolation and random phase.
The generated sinusoids are multiplied by the envelope to give the second component of the output signal.
in one embodiment, the speech processor is further configured to modify the output signal. In one embodiment, the speech processor is configured to modify selected sinusoidal components of the signaL for example, the sinusoidal components at the higher frequencies may be amplified by the speech processor.
in one embodiment the speech processor is a speech processor for use in a cochlear implant In one embodiment the speech processor is a speech processor for use in a hearing aid. in one embodiment, the speech processor is a speech processor for use in text to speech synthesis.
According to one embodiment, there is provided a method of extracting parameters for use in speech synthesis from an audio signal, the method comprising; receiving an input speech signal; extracting frames of said input speech signal; transforming each frame of said input speech signal into the frequency domain; for each frame, identifying the frequency at which the maximum spectral amplitude occurs in each of a plurality of frequency bands; calculating the complex amplitudes of sinusoidal components with the identified frequencies; assigning each complex amplitude to a sinusoidal component with a fixed frequency in the corresponding frequency band.
According to one embodiment, there is provided a method of synthesizing a speech signaL comprising a method of extracting parameters from an audio signal comprising receiving an input speech signal; extracting frames of said input speech signa'; transforming each frame of said input speech signal into the frequency domain; for each frame, identifying the frequency at which the maximum spectral amplitude occurs in each of a plurality of frequency bands; calculating the complex amplitudes of sinusoidal components with the identified frequencies; assigning each complex amplitude to a sinusoidal component with a fixed frequency in the corresponding frequency band, and further comprising; generating a plurality of sinusoidal components, the frequency of each sinusoidal component being a fixed frequency in one of the frequency bands and the complex amplitude of the sinusoidal component being the assigned complex amplitude; outputting a signal comprising a sum of said sinusoidal components.
According to one embodiment, there is provided a carrier medium comprising computer readable code configured to cause a computer to perform a method of extracting parameters for use in speech synthesis from an audio signal, the method comprising; receiving an input speech signal; extracting frames of said input speech signal; transforming each frame of said input speech signal into the frequency domain; for each frame, identifying the frequency at which the maximum spectral amplitude occurs in each of a plurality of frequency bands; calculating the complex amplitudes of sinusoidal components with the identified frequencies; assigning each complex amplitude to a sinusoidal component with a fixed frequency in the corresponding frequency band.
According to one embodiment there is provided a text to speech system, comprising a memory and a processor, configured to receive input text; divide said inputted text into a sequence of acoustic units; determine a sequence of sets of complex amplitude values corresponding to said sequence of acoustic units, wherein each set of complex amplitude values comprises a value of complex amplitude corresponding to each frequency band; generate a speech signal using said complex amplitude values, wherein the speech signal comprises a sum of sinusoidal components each corresponding to a frequency band, wherein the frequency of each sinusoidal component is a fixed frequency in the corresponding frequency band and the complex amplitude is the complex amplitude value corresponding to the frequency band.
in one embodiment, dynamic features are included.
in one embodiment the text to speech system is based on unit selection. The sets of complex amplitude values are stored complex amplitude values obtained from recorded segments of speech.
in one embodiment, the text to speech system is based on statistical modelling.
Probability distributions relating sets of stored complex amplitude values to each acoustic unit are determined during training. The mean and variance of the sets of complex amplitude values are also determined during training and define the probability distributions. An acoustic model, for example a Hidden Markov Model is used to determine the sequence of sets of complex amplitude values corresponding to the sequence of acoustic units.
Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.
Figure 1 is a flow chart of a method of extracting parameters for use in speech synthesis from an input speech signal in accordance with an embodiment. The extracted information comprises a set of complex amplitude and dynamic slope values for each frame of the inputted speech signal. Each value of complex amplitude and dynamic slope corresponds to the sinusoidal component of the speech signal with the maximum spectral amplitude for a critical band. These values are then each assigned to correspond to the central frequency of the critical band. For each frame, a complex amplitude value and a dynamic slope value are extracted for each critical band.
These values can be used to generate a speech signal, by generating a sum of sinusoids for each frame of the signal, the frequencies of the sinusoids being the central frequencies of the critical bands, and the sinusoids having the assigned complex amplitudes and dynamic slope values. Frequency information corresponding to each frame is not needed in order to generate the speech signal, because the sinusoids have the same frequency for each frame [the central frequency of the critical bands]. In one embodiment, only the complex amplitude and dynamic slope values vary between the frames and only one sinusoidal component is generated for each critical band, corresponding to the smallest frequency difference that a listener can perceive. This means that a small number of frequencies, or channels, are used to represent the speech. This also means that a small, fixed number of parameters are stored and used to generate the speech signal [i.e. 21 complex amplitudes and 21 dynamic slope values corresponding to 21 critical bands]. Fewer parameters can be used, corresponding to certain critical bands for example. The small number of parameters means a reasonable processing time for generating a speech signal from the parameters in applications such as text to speech synthesis.
The fact that the number of parameters is fixed and small means that the parameters can be used in statistical speech synthesis. The fact that the sinusoids are generated based on the critical bands means the generated signal is high quality, because the critical bands correspond to the smallest frequency difference that a listener can perceive.
The approach used to analyse and then synthesize the speech uses a perceptually based dynamic sinusoidal model of speech, and is based on research results regarding the human auditory system. The approach is based on a sinusoidal model of speech, and combines this with research results regarding the human auditory system. The approach uses information that is relevant for the human hearing system by modelling information relating to critical bands. The approach considers dynamic modelling of the speech signal, although in some embodiments, dynamic features are not included, in these embodiments, the complex amplitude values are calculated and used to generate the speech and the dynamic slope values are not calculated. A method that does not consider dynamic modelling is described in relation to Figure 5. A method that does not include dynamic features could be used in a cochlear implant or hearing aid for example. Hearing aids and cochlear implants can also use methods including dynamic features.
For applications such as text to speech synthesis, modelling the dynamic features of speech results in a high quality synthesized signal. This is because humans are very sensitive to the changes of a signal. Dynamic information is important whenever spectral information changes quickly over time, for example in expressive speech. Using dynamic features means more parameters are extracted from the speech and then used to synthesize speech. For example, where parameters are extracted for 21 critical bands and 9 band boundaries, 30 complex amplitude values are extracted. If dynamic slope values are also extracted, then 60 parameters are extracted in total, and can be used to synthesize the speech.
Extracting dynamic features and then using these features when synthesizing the speech signal means a high quality signal is produced.
In some embodiments, information is also extracted regarding sinusoidal components of the signal with the boundary frequency of a number of the critical bands. This feature is included in the method described below in relation to Figure 1, but not in the method described in relation to Figure 5.
The first step in the method is SlOl "receive input speech signal". The input signal may be an electrical sound signal. The input signal may be an audio signal which is converted to an electrical sound signal. The electrical sound signal is sampled. The sampled speech signal is denoted by h(nJ, where n is the discrete time index. The number of sample points per second is the sampling frequency or sampling rate. In one embodiment, the sample rate is 16kHz.
In step 5102 "window the signal", the sampled electrical sound signal is divided into frames. The number of samples in each frame is N. In one embodiment, N is equal to 320, meaning that there are 320 samples in each frame. This is known as the frame size or window length. In one embodiment, the duration of the frame is 2Oms. Adjacent can frames overlap each other in time. In one embodiment, the frame shift, i.e. the time between the start of a frame and the start of the next frame is Sms.
Each of the frames is windowed. Applying a window involves multiplying the sampled signal by a window function w(n). A window function such as the Hann window can be applied. The Hann window function is: 2 un w(n) = 0.5(1-cos(N it E [0,N -fl Alternative window functions can be used.
In step S103 "Perform Fourier transform on each windowed frame of signal", a Fourier transform is applied, which converts the signal to the frequency domain.
The Fourier transform results in a speech magnitude spectrum for each frame (i.e. the spectral amplitude versus frequency information). A fast Fourier transform (FFT) algorithm with zero padding is used. Using zero padding means appending zero valued samples to the end of each windowed frame. This results in an increased resolution in the frequency domain. In one embodiment zero valued samples are added until the total number of samples in each frame is a number which is a power of 2. In one embodiment, the sample rate is 16kHz, the window duration is 2Oms and the number of signal samples in each window is 320.
Applying zero padding means that 192 zero values samples are added to the end of the frame, meaning that the total number of samples is 512 (which is 2 to the power of 9).
The next step is S104, "Perform peak picking to identify frequencies at which maximum spectral amplitude occurs for each critical band". The Fourier transform computes the magnitude spectrum, i.e. the magnitude values of the speech signal corresponding to each frequency. A critical band is a band of frequencies between which no difference can be perceived by a listener. This is because all the frequencies within the critical band activate the same area on the basilar membrane. In one embodiment, 21 critical bands are used. In one embodiment the critical bands with the frequency ranges given in Table 1, below are used.
Table 1:
Critical Band Central Lower Boundary Upper Boundary Frequency (Hz) Frequency (Hz) Frequency (Hz) 1 150 100 200 2 250 200 300 3 350 300 400 4 450 400 510 570 510 630 6 700 630 770 7 840 770 920 8 1000 920 1080 9 1170 1080 1270 1370 1270 1480 11 1600 1480 1720 12 1850 1720 2000 13 2150 2000 2320 14 2500 2320 2700 2900 2700 3150 16 3400 3150 3700 17 4000 3700 4400 18 4800 4400 5300 19 5800 5300 6400 7000 6400 7300 21 7700 7300 8000 Although the above frequency bands are used in the embodiment described, in alternative embodiments? other frequency bands, with different ranges can be used. For example, frequency bands with different ranges which are based on perception criteria or evidence can be used. A different number of frequency bands can be used. There are many theories proposing frequency ranges of critical bands, therefore alternative frequency ranges that are proposed in alterative theories of critical bands can be used.
For a wideband speech signal (0Hz -8 kHz), 21 critical bands are assumed to cover the whole spectrum of the speech signal in this embodiment The method predominantly extracts information which is pertinent for the human auditory system, since it extracts information in relation to each critical band. This means that a small number of parameters can be used to achieve a sufficient representation of speech.
For each frame) the frequency at which the maximum spectral amplitude occurs within each critical band is identified using peak picking. In other words, the local maximum [the maximum for the critical band) of the magnitudes computed by the Fourier Transform is calculated for each critical band using peak picking. Peak picking is performed using the log of the magnitude information. A peak picking algorithm is used to detect the maxima in the spectrum for each band. The peak picking algorithm returns a set of 21 frequencies for each frame of the speech signaL each frequency corresponding to a critical band.
The next step is SlOS "Calculate complex amplitude and dynamic slope values of sinusoidal components with identified frequencies and of sinusoidal components with frequency at boundaries of a number of critical bands". A model of the speech signal s(n) is constructed for each frame: s(n) = + nbr') + + nb0) In this model, uJZtat is the frequency returned by the peak picking algorithm for the critical band k as the frequency at which the spectra! amplitude is maximum. The first term therefore consists of 21 sinusoids, at the frequencies each having a complex amplitude and a dynamic slope bdtc. The dynamic slope va'ues are complex slopes, i.e. the values are complex numbers. The slope parameter performs a different role to a static amplitude parameter. The slope parameter b1 is multiplied by the discrete time index n. This means that bk has a meaning in terms of the slope.
In one embodiment) the second term consists of a further 9 sinusoids. In one embodiment, the frequencies of these sinusoids, w, are the frequencies of the boundaries of S critical bands (in this case 8 critical bands with frequencies lower than 4000Hz) and a further sinusoid with a frequency at 6400Hz. In one embodiment, the frequencies of these sinusoids are those given in Table 2 below.
Table 2:
Frequencies of further 9 sinusoids (Hz) The B lower frequency sinusoids are added because at lower frequencies humans are more sensitive and have a higher resolution. However, since pitch (i.e. the fundamental frequency for a human] does not go below 80Hz, it is not necessary to add sinusoids corresponding to the boundaries for the bands below 1270 Hz (as the difference between the central frequency and the boundary frequency for these bands is less than 80 Hz). There is therefore a low probability of high energy at these frequencies.
For low frequencies, a strong sinusoidal component at the centre of a critical band will mask all other sinusoidal components in the band. However, this masking effect is not as powerful at the boundaries of the critical bands. Within each critical band, the masking effect decreases further away from the critical centre. This implies the sinusoids at the boundaries of the critical band can potentially affect perception. In the lower frequencies humans are more sensitive compared to the higher frequencies [in other words we have higher resolution and can distinguish between frequencies) so these sinusoids can be important. In the high frequencies humans are less sensitive and therefore these sinusoids are not as important Accordingly, the parameters (i.e. the complex amplitude and dynamic slope) required to generate sinusoidal components at the lower critical band boundaries are determined. The terms a° and bJ'° are the complex amplitude and dynamic slope of the sinusoid at the boundary of the kth critical band, which has a frequency of max -The complex amplitudes ak and ak [where a_k = ak) and dynamic slope values b° and bk"° are estimated. In one embodiment, the a° and are estimated using a least squares approach. The ar" values can alternatively be obtained from the Fourier Transform and peak picking computation. The can also be obtained from the Fourier Transform computation. However, a simple Fourier transform followed by a peak picking algorithm cannot be used to compute the dynamic slope values. Using a least squares approach therefore provides both the complex amplitude and dynamic slope values. The least squares approach also minimizes the error for both the complex amplitude and dynamic slope values at the same time. Furthermore, estimating the parameters using a least squares approach means that it is possible to use short analysis windows (of the order of 20 ms). For a method in which the parameters are estimated using Fourier Transform computations, longer windows are used.
in the least squares approach, the parameters are computed for windowed frames by minimizing the error e between the speech model s[n) and the original speech h(n):
N
6 = Xw2(n)(s(n) -where N is the window length and w(n] is the analysis window for each frame (for example the Fiann window function].
in some embodiments, the second term (comprising the boundary terms for the 9 critical bands] in the model is not included, and the modal is constructed from the sinusoids with the identified frequencies. The complex amplitudes and dynamic slope va'ues of the sinusoids at the identified frequencies are estimated from the east squares approach, resulting in a single set of 21 complex amplitudes and 21 dynamic slope values. in alternative embodiments, more or less than the 9 extra critical bands are used.
When the second term is included) for each frame the least squares estimation results in a first set of 21 complex amplitudes and 21 dynamic slope values, each of which correspond to a sinusoid with a frequency which corresponded to the maximum spectral amplitude for the critical band for that particular frame. The frequencies of the sinusoids corresponding to the first set of values are therefore not the same frequencies for different frames. The least squares estimation also results in a second set of 9 complex amplitudes and 9 dynamic slope values, each of which correspond to a sinusoid with a frequency of the boundary frequency of the 9 critical bands for each frame. The frequencies of the sinusoids corresponding to the second set of values are the same for each frame, i.e. they are always the frequencies of the boundaries of the 9 critical bands. The frequencies of the critical bands do not vary from frame to frame.
The next step is 5106 "Assign complex amplitudes and dynamic slope values of sinusoidal components with identified frequencies to central frequency of corresponding critical band". The first set of values are assigned to correspond to a central frequency of the critical band. For example, for a particular frame, the complex amplitude and dynamic slope value corresponding to the sinusoidal component with a frequency which corresponded to the maximum spectral amplitude for the first critical band are assigned to a sinusoidal component with a frequency of the centre of the first critical band. The frequencies of the sinusoids corresponding to the first set of values are now the same for each frame, i.e. they are always the frequencies of the centre of the 21 critical bands.
In some embodiments, the frequency that corresponds to the exact centre of each critical band is used. In other embodiments, a frequency that is substantially central to the critical band is used, and the complex amplitudes (and dynamic slope values if included] are assigned to the frequency that is substantially central to the critical band, for example, to within 150 Hz of the centre of the critical band. The same frequency which is chosen as the substantially central frequency is used when analysing each frame of speech. For example, the central frequencies given in Table 1 can be used as the central frequency of the critical bands to which the parameter values are assigned. In alternative embodiments, any fixed frequency within the critical band can be chosen for assigning the parameters. The frequency is fixed in the sense that the parameters are assigned to the same frequency for each frame of speech analysed. The same frequency can also be used to synthesize the speech signal in the synthesis stage.
In the method, the estimated complex amplitudes for each critical band for the frequency at which the spectral amplitude is maximum are returned. These complex amplitudes can be used to synthesize a speech signal. Sinusoidal components with frequencies equal to the central frequencies of the bands and using the assigned complex amplitudes are generated in order to produce a speech signal. By using the complex amplitudes that were estimated for the frequency at which the amplitude is maximum, a synthesized signal will not sound muffled or attenuated because most of the energy of the original speech signal is modelled and used to synthesize speech. Loss of energy of the signal is reduced. During synthesis, the speech signal may be synthesized from sinusoids with parameters obtained from frequency shifting [or assignment of the amplitudes from the frequencies where the maximum amplitude occurred to the centre frequency of the band]. For higher frequencies, further sinusoids can be used. This is described in more detail in relation to Figure 2. This means that the frequency shifting, or assignment, doesn't modify the quality of the sound [in terms of phones identity) and speaker identity.
The method extracts information from the input speech signal in the form of a first set of complex ampiltude and dynamic slope values and a second set of complex amplitude and dynamic slope values. Each pair of complex amplitude and dynamic slope values from the second set corresponds to a sinusoid with the boundary frequency of the 9 critical bands. it is not necessary to store frequency information for each set of values because the frequencies will be the same for each frame. The information stored for each frame is the sets of values, information identifying which band each value corresponds to and whether each va'ue corresponds to the centre or boundary frequency. The number of parameters extracted for each frame is therefore fixed. The number of parameters, in this case 60, is chosen based on knowledge of human perception.
The method described is performed by a system for extracting parameters from an audio signal, comprising a processor. For example, such a system may be a speech processor for a cochlear implant, hearing aid or a component of a vocoder.
A vocoder may be connected to a component such as a microphone in order to receive a speech signal. Alternatively, a recorded speech signal may be inputted.
For a cochlear implant, the processor is located externally, and the device also comprises an external component such as a microphone, configured to detect sound and convert it to an electrical sound signal. The electrical sound signal is transmitted to the speech processor. Parameters are extracted from this received speech signal.
The method used in a hearing aid or a cochlear implant may not include modelling the dynamic features. In this case, the complex amplitudes would be extracted from the inputted speech signal and then used to generate a speech signal. Such a method is described in relation to Figure 5. It should be noted that although the method described in relation to Figure 5 does not include extracting information regarding sinusoids at the boundaries of the critical bands, this feature could of course be included without including dynamic features. Including dynamic features means that the synthesized speech signal quality is good, as the representation is not as parsimonious.
Hearing aids and cochlear implants may use a few channels, for example S or 6 channels. The channels correspond to particular frequencies. For example, a cochlear implant may synthesize a speech signal using S sinusoids, at five different frequencies. These frequencies are known as channels. In order to implement a method in which speech is synthesized using one sinusoid for each of 21 critical bands (such as is described in relation to Figure 6), the hearing aid or cochlear implant would therefore comprise 21 channels. For a method in which 9 boundary sinusoids are also synthesized the hearing aid or cochlear implant would comprise channels. For a method such as is described in relation to Figure 2, where further sinusoids are higher frequencies are added, the hearing aid or cochlear implant may have many more channels.
In a cochlear implant, a speech processor extracts the complex amplitude values from a received speech signal. The processor then generates an electrical speech signal using the extracted values, by generating a signal comprising a sum of sinusoids using the complex amplitudes and the central band frequencies. The speech processor may also modify the signal, or modify selected sinusoidal components of the signal. For example, the sinusoidal components at the higher frequencies may be amplified by the speech processor. The electrical speech signal is then sent through a cable to a transmitter. The transmitter transmits the signal to an internal device. In the internal device, the signal is converted to electrical impulses which are sent to the nerves inside the ear.
In a cochlear implant) the input speech signal is received, processed in order to extract the information, and then an electrical speech signal continuously generated from the information. As the speech signal is received, each frame is extracted) processed and then a frame of an electrical signal corresponding to the original frame generated and transmitted. Modification may be made to the generated signal before it is transmitted, for example amplifying certain frequencies. The general method of synthesizing a frame of a generated speech signal from parameters extracted from a frame of an original speech signal is described in relation to Figure 2.
For other applications, for example text to speech (TTS) applications, the information for each frame can be stored, and transferred to another device where the speech synthesis takes place. Methods of synthesizing a speech signal from inputted text using the stored values is described in relation to Figures 10 and 11.
Figure 2 is a flow chart of a method of synthesizing a speech signal in accordance with an embodiment. The method can be performed by a system for extracting parameters from an audio signal such as performs the method described in relation to Figure 1 or Figure 5. Alternatively, the method can be performed by a separate speech synthesizer, into which the values extracted from an audio signal by a system such as described in relation to Figure 1 are inputted or stored.
A sinusoidal model of speech represents speech as a sum of sinusoids. Many acoustic signals, and the human voice and music in particular, can be efficiently modelled as a sum of sinusoids. Speech can therefore be synthesized by generating sinusoidal components and summing them. This approach allows modification of many speech characteristics such as timbre (i.e. the quality of the sound] and duration (i.e. where the speech is slower or faster).
In a sinusoidal based model, a number of non-harmonically related sinusoids can be used to represent speech, for example: s(n) = k=K where 0k is the phase of the kth sinusoid, Akis the amplitude of the kth sinusoid and t0k is the frequency of the kth sinusoid. As AkelOk is invariant, it is also possible to model the speech as s(n) = where ak is a complex amplitude ( a_k = ak). K is the number of sinusoids used for each frame. Speech can therefore be synthesized by generating sinusoids and summing them.
The speech analysis method described in relation to Figure 1 extracts a first set of values, comprising one complex amplitude value and one dynamic slope value for each critical band) and a second set comprising one further complex amplitude value and one further dynamic slope value for a certain number of critical bands.
These values can then be used to synthesize speech by generating sinusoids using these parameters. The number of parameters, and the frequency of the sinusoids that the parameters correspond to is fixed. The number of parameters is limited by the number of critical bands used, plus an extra set of values for the number of the critical bands. The speech is therefore modelled with one dynamic sinusoidal component per critical band (plus an extra set for a number of the critical bands in this embodiment].
During synthesis therefore, the number of sinusoids generated is also fixed. The sinusoidal components are generated such that there is one sinusoidal component per critical band, with the central frequency of the critical band, and the extra sinusoidal components with the boundary frequencies for a number of critical bands. Using frequencies that correspond to the critical bands means that the signal is high quality, as the critical bands correspond to the smallest frequency difference that a listener could perceive, in human perception, the range of sound sensitivity is broad. The auditory system is more sensitive to lower frequencies than to the higher frequencies. Furthermore, a range of frequencies may be perceived as the same, as they activate the same area on the basilar membrane.
These ranges are the critical bands. Sinusoids in the same critical band may therefore be redundant. By using one sinusoid corresponding to each critical band to generate the signal, the speech is generated with a small number of sinusoids, and the perceptually salient characteristics of a speech signal are retained. The model can be expanded at low frequencies by incorporating sinusoids at the boundaries of the corresponding bands while at the higher frequencies a modulated noise component can be used.
The method shown is that for generating a single frame of a speech signal. Several frames are generated and the frames overlapped and added in order to generate the entire speech signal. In other words, the speech signal is generated using an overlap and add (OLA) procedure.
The first step is, S401 "Input values of first set of complex amplitude and dynamic slope values assigned to critical band central frequencies and second set of complex amplitude and dynamic slope values corresponding to critical band boundary frequencies". in order to generate a frame of speech, values of complex amplitude and dynamic slope are inputted, i.e. a° and b,'°. For the case of a cochlear implant the values extracted from the frames of the received speech are inputted frame by frame in order to generate the outputted electrical signal. Alternatively, the values may be stored in a speech synthesizer. When a particular outputted speech signal is required, the frames corresponding to the particular output are selected, and the values stored for these frames are input frame by frame in order to generate the output speech signal.
Step 5402 is "Synthesize a first component of signal by generating sinusoids up to selected frequency from first and second set of values". A first component of the output signal is generated comprising a sum of sinusoids up to a certain maximum frequency. In one embodiment, the first component of the speech signal is generated comprising the sinusoids up to 4000 Hz, i.e. up to the 17th critical band.
The component is generated from a sum of sinusoids using the values of the complex amplitude and the dynamic slope. The component is given by: s1(n) = k17 + nba) + + nb:°) e'c° The values of aUX, beat, a° and are those extracted and stored from the speech analysis method described in relation to Figure 1. The frequency u4 is the central frequency of the kth critical band and is the boundary frequency of the kth critical band.
In one embodiment, there are four critical bands above 4000 Hz. The critical bands become sparse in the higher frequency range, therefore more sinusoidal components can be added to the synthesized speech in order to produce a good signal quality. Based on the fact that the human auditory system is not as selective (i.e. not as able to distinguish between frequencies] at high frequencies as at the thw frequencies, a time and frequency domain modulated noise, covering the high frequencies, is added to the model used to synthesize speech. A random sinusoidal signal is obtained with amplitudes obtained through interpolation every 100Hz of the amplitudes estimated at the high frequency bands (i.e. k=18 21) and random phase (step S406]. No dynamic features are used for this random signal.
The signal is further modulated over time by the time-domain envelope (estimated through Hilbert Transform) from the sinusoidal signal made by the sinusoidal components for bands 18 to 21 (S404). The speech signal including this high frequency component generated from modulated coloured noise has more sinusoidal components for the high frequency critical bands. The signal quality is therefore good. This also mitigates perceptual distortion. Perception distortion can cause a 1tube effect", where the synthesised sound sounds as though it has been spoken through a tube, with some frequencies completely disappeared. This is especially critical for fricative sounds.
The high frequency information generated during synthesis is generated by modulating coloured noise. This means that when the synthesis is used for applications such as speech synthesis from text) the synthetic speech output will not sound buzzy.
The first step in generating the high frequency component is step S403, "Generate sinusoids above selected frequency". The sum of the sinusoids above the selected frequency shftn), in this case the sinusoids for bands 18, 19, 20 and 21, is calculated using the first set of values for those bands: Shf(fl) = (ajj + nbfl) e1(0s1 + (aj + nb1j') ej(09h1 + (a$' + nbJj'") e1°'o + (aJ' + nb2T') e1°'i' In step S404, "Estimate envelope of sum of sinusoids" the envelope of the signal shf(n) is estimated, for example using the Hilbert transform or another method, resulting in a time-domain envelope. Given a signal, i.e. the sum of sinusoids 25, in this step the time domain envelope is determined.
The Hilbert transform is used to compute the time-domain envelope of the signal.
The Hilbert transform method starts with a real signal, in this case the signal The next step is to compute the analytic form of the signal through the Hubert transform. The analytic form of the signal is given by: Skf (fl) = Shf(fl) +jsjj(n) where j is the imaginary unit and s(n) is the Hilbert transform of sh;(n), i.e. the convolution of with the function The resultant function, shr (it) gives a complex signal at every time instant n. The absolute value of that complex signal gives the time-domain envelope fjn) of the signal shf(n) at each time instant, n.
in step S405, "interpolate complex amplitude values for bands above selected frequency' the values of the complex amplitude for the bands above the selected frequency, i.e. in this case for bands 18, 19, 20 and 21 are interpolated. In other words, the sequence of values a9', afC, and a° are interpolated in order to determine intermediate estimates of the complex amplitude. in other words, values of the complex amplitude for sinusoids every 100 Hz between w and &4' are found. Any interpolation method can be used, for example linear interpolation.
in step S406, "Generate sinusoids with interpolated complex amplitude values and random phase" sinusoids are generated with frequencies every 100 Hz from the centre of the 17th critical band to the centre of the 21t critical band) using the interpolated complex amplitudes values and random phase numbers. The form of these generated sinusoid is given by: s(n) = aeJ0utntn e'0 where cu is the frequency, for examp'e, the frequency of the first sinusoid is = w + 100. The frequency of the second sinusoid is to2 = w + 200 and so on until the frequency ct4r is reached. The frequency of the next generated sinusoid after wfr is reached is o4r + 100 and so on.
The value of is calculated by first determining the number of divisions between the centre frequencies. For the sinusoids between and w'r, the G8Th G872 number of divisions is calculated by d1 = -17 For the sinusoids between cem_ Gem c4r and w"4, the number of divisions is calculated by d2 = ° 18 and so on.
The complex amplitude is then determined by interpolation, for example for the max_ mdx max a18 a17 max first sinusoid, a1 = a17 + . For the second sinusoid a2 = a17 + mam_ flax 2 (1 dal7). For the first sinusoid with frequency between &4f1 and a4, the max_ mar complex amplitude is determined by af"' + 2 (aio. ) and so on. The value of B is chosen at random for each sinusoid.
All of these generated sinusoids are then summed to give s(n) = E s1(n). This summed signal is a coloured noise signal. The signal is referred to as a coloured noise signal because the energies of all the frequencies are not the same.
in step S407 "Multiply sum of sinusoids by time domain envelope to synthesize second component of signal", the noise signal sGfl(n) is multiplied by the time domain envelope f0(n) obtained from step S404. This gives the second component of the output speech signaL s2(n), where s2 (it) = (n)s, (n).
in step S408, the first and second components are summed. A window is applied to Lhe summed components (for example, the sum of Lhe first and second components are multiplied by the Hann window function) to give the output speech signal. Applying the window means that the signal is limited to the length of the frame. Using a window such as a Hann window avoids discontinuities at the frame boundaries. The speech signals for each frame are then overlapped and added. This final speech signal can be outputted as an electrical signal, for example in a cochlear implant, or as an audio signal.
The order of the steps is not important, although of course S403 and 5404, and S405 and 5406 must all be performed before step 5407 can be performed.
Similarly, S407 and S402 must be performed before S408. However, for example, step S402 can be performed simultaneously to steps S403 to 5407, or steps S405 and 5406 can be performed first, then steps S403 and 5404, then step S407, and finally step S402 and S408. Variations of this order can be implemented.
The Perceptual Evaluation of Speech Quality [PESQJ was calculated for sentences of speech synthesized using the method described [with the dynamic features included). PESQ compares the original signal with a reconstructed signal synthesized using the method described in Figure 2. Phonetically balanced speech data from 3 male and 4 female English speakers was selected for testing. Five neutral speaking sentences were selected for each speaker, with a 16kHz sampling rate. The frame shift was set to Sms with a window length of 2Oms. A reference implementation of the model was used to create stimuli for a listening test using copy synthesis. PESQ was calculated as an objective error measure. The average PESQ value of all 35 sentences of the seven speakers was 3.2183. The quality of the generated speech was also confirmed with informal listening tests.
The analysis and synthesis processes described in relation to Figures 1 and 2 uses a perceptual dynamic sinusoidal model based on critical bands (PDM) for representing speech. One sinusoidal component is used in each critical band. For each band, the sinusoid with the maximum spectra' amplitude is selected and the spectra' amplitude assigned to the central frequency of the band. Frequency and time-domain envelope modulation of a noise component at the highest frequencies and sinusoidal components at the critical boundaries for a number of frequencies can be included. These result in removal of the "tube effect".
Figure 3 is a speech analysis operation which is a method for extracting parameters from an audio signal in accordance with an embodiment. The component operations are labelled by the numbers in circles. The speech analysis operation is based on a perceptually motivated dynamical model of speech.
A speech signal is inputted. The speech signal may be a sound signal received by a sensor such as a microphone. The speech signal may be a recorded audio file of speech. The speech signal may be an electrical speech signal.
The signal can be analysed pitch synchronous or asynchronous. The speech is analysed at a fixed rate, which depends on the overlap of the frames [or the frame duration where the frames are contiguous]. For the method described in relation to Figure 3, the frames shift is 5ms, therefore the speech is analysed every Sms.
During speech, the glottis opens and closes. For a female on average this happens times a second. The time instants in which the glottis is closed are called Glottal Closure Instants [GCI5). If the centre of the analysis windows are aligned with the GCls, this is referred to as pitch synchronous analysis. If the GCIs are not taken into account, this is referred to as pitch asynchronous analysis. In the method described in relation to Figure 3 therefore, in which the windows occur every 5ms, without reference to the CGI, the analysis is pitch asynchronous.
However, the method described in Figure 3 can be used in pitch synchronous analysis by aligning the centre of the windows with the GCIs.
A windowing operation 1 is performed on the signal. The windowing operation 1 first divides the signal into frames. The frames can be contiguous or overlapping.
The windowing operation then involves applying a window function to the signal, by multiplying the signal in each frame by a window function such as the Hann function. The window may have a duration of ZOms for example. Fourier transform of the signal is then performed. Using the log of the magnitude information, a peak picking operation 2 is performed. This operation provides the frequencies where the local maxima are located.
Using this information, and taking into account the specific properties of the perceptual motivated dynamical model approach, which uses one sinusoid per critical band (the one that has the maximum amplitude] as well as using a sinusoid which is located at the boundaries of contiguous critical bands, the least squares operation 4 applies a least squares criterion to estimate the complex amplitude (ak) and dynamic slope (bit) at the frequencies where the local maxima are located and at the frequencies at the boundaries of the critical bands. A model of the speech using sinusoids at these frequencies is constructed in the operation labelled PDM 3. This model is compared to the input speech signal and a least squares method used to determine the complex amplitudes and slopes.
Two sets of complex amplitudes and slopes are estimated. The first set comprises 21 complex amplitudes and slopes, one complex amplitude and slope for each critical band, for the frequency at which the maximum amplitude of the band is located. The second set comprises 9 complex amplitudes and slopes at the boundaries of the 9 critical bands.
In the assignment operation 5, the first set of complex amplitudes and slopes are moved to the centre frequency of each band. In this way) no frequency information needs to be saved.
Figure 4 is a speech synthesis operation which is a method of synthesizing a speech signal in accordance with an embodiment. During synthesis, the complex amplitudes and slopes from the first and second set of values determined during speech analysis are used to generate the signal up to a certain maximum frequency. In this embodiment, the maximum frequency is set to 4000 Hz, or up to the 17th critical band. This part of the signal is generated in operation 6, PDMcf, where f is the certain maximum frequency.
In operation 7, PDM>f, the signal above 4000Hz is synthesized. In operation 8, the envelope of the signal generated in operation 7 is estimated, using for example the filbert transform or another method.
In operation 9, new sinusoids are created, by interpolating between the estimated complex amplitudes. For example, linear interpolation may be performed to estimate a value of the complex amplitude every 100 Hz. Other schemes of interpolation can be used. Using these interpolated amplitudes with random numbers as phase, a coloured noise signal is generated in operation 9. The dynamic slope values are not used. The time domain envelope generated in operation 7 modulates the coloured noise signal by the multiplication operation shown in the figure. This results in a high pass signal beyond the frequency f and operation 6 results in the signal below that frequency. The addition of the two signals provides the final reconstructed speech signal.
Figure 5 is a flow chart of a method of extracting speech parameters from an audio signal in accordance with an embodiment in which dynamic features are not included. The choice of the number of sinusoids used for each frame is based on knowledge of human perception. In order to distinguish the smallest frequency difference that a listener could perceive therefore, one sinusoid is used for each critica' band. In an embodiment, 21 sinusoids at the frequencies of 21 critical band centres are used. For each frame of speech synthesized, the sinusoidal components will have the same frequencies [i.e. the central band frequency). In one embodiment, frequency information is not extracted or modelled and the deterministic frequency information provided by the critical bands is used. A small number of sinusoids with fixed frequencies are used to model the speech. In other words, the model is fixed and low dimensional.
The method does not rely on any voiced/unvoiced estimation [i.e. estimating the voiced/unvoiced decisions made during speech, voiced being when the vocal folds vibrate as in /a/ and /o/ sounds, unvoiced being the sounds like /s/ and /f/ where no vocal folds vibration occurs) or pitch estimation to perform its analysis. It is therefore robust, especially for difficult to model areas of speech like the onsets and offsets (which are associated with the beginning and ending of a word or sentence) or expressive speech.
Steps S301 to S304 are the same as described in rebtion to Figure 1. Step S305 is "Calculate complex amplitude of sinusoidal components with identified frequencies". A model of the speech signal s[n) is constructed for each frame: s(n) = k=-21 e4"' in this model, &1at is the frequency returned by the peak picking algorithm for the critical band k as the frequency at which the spectral amplitude is maximum. The model therefore consists of 21 sinusoids, at the frequencies each having a complex amplitude aX.
The complex amplitudes are estimated. The values can simply be those estimated from the Fourier transform computation and then peak picking. In one embodiment, the a"1°" are estimated using a least squares approach, in which the error is minimised between the speech model s(nJ and the original speech h(nJ:
N
c = Xw2(n)(s(n) -where N is the window length. The estimation results in a single set of 21 complex amplitudes.
The next step is S306 "Assign complex amplitudes of sinusoidal components with identified frequencies to central frequency of corresponding critical band".
Using the extracted parameters a,r'', a frame of a speech signal can be synthesized from a sum of sinusoids: s(n) = k=-21 atX e104fl where c4 is the central frequency of the kth critical band.
Figure 6 is a flow chart of a method of synthesizing a speech signal in accordance with an embodiment In this method, dynamic features are not included) however, sinusoids are generated having the boundary frequencies of a number of the critical bands. In an alternative embodiment, the sinusoids with the boundary frequencies are not generated. The method could be performed by the speech processor in a cochlear implant for example. The speech processor may first analyse a received frame of the speech signal in order to extract the values of complex amplitude corresponding to the maximum spectral amplitude for each critical band and the values of the complex amplitude corresponding to the band boundaries for the number of critical bands. The processor then assigns the values of complex amplitude corresponding to the maximum spectral amplitude to sinusoidal components with the central frequency of the band. For each frame therefore, the processor extracts a first set of complex values assigned to the central frequencies of the critical bands and a second set of complex amplitudes assigned to the critical band boundaries for a further number of bands. In this embodiment, dynamic features are not included. The speech processor then performs the method shown in Figure 6 in order to generate an electrical speech signal.
The first step is S601 "Input values of first set of complex amplitude and dynamic slope values assigned to critical band central frequencies and second set of complex amplitude and dynamic slope values corresponding to critical band boundary frequencies". In order to generate a particular frame of speech, values of complex amplitude are inputted, i.e. a2'"and a'°.
Step 5602 is "Generating sum of sinusoids with central band frequency and complex amplitudes from first set of values". A first component of the output signal is generated comprising a sum of sinusoids given by: s(n) = k=-21 aa e'°c' Step 5603 is "Generating sum of sinusoids with band boundary frequency of a number of bands and complex amplitudes from second set of values". A second component of the output signal is generated comprising a sum of sinusoids given by: b -s(n) = ak e'°k k=-9 in step S604, these components are added together to give the output signal.
Figure 7a and 7b show two speech magnitude spectrums, with estimated complex amplitudes indicated, in Figure 7a, the complex amplitudes are estimated for the sinusoids with a frequency of the centre of the critical band. In Figure 7b, the complex amplitudes estimated for the sinusoidal components with the maximum spectral amplitude for each critical band are indicated, having been moved to correspond to the central frequency of the band. A second set of complex amplitudes, estimated using the same method used to estimate the complex amplitudes for the spectrum in Figure 7a, are also indicated, for comparison.
Figure 7a shows a speech magnitude spectrum. The horizontal axis is frequency in Hz. The vertical axis is spectral amplitude, which is shown on a logarithmic scale.
The critical band boundaries are indicated with dashed lines. The frequency spectrum shown is that of a single frame of a speech signal. A Fourier transform has been applied to the speech signal in order to generate the spectrum.
Open circles show the estimated amplitudes corresponding to all the harmonic sinusoidal components. The crosses show estimated amplitudes calculated by constructing a model of the speech: scen(n) = k=-21 a where a is the complex amplitude at the centre of the critical band. A least squares approach is used to determine an estimate for the complex amplitude, by comparing the model with the original speech signal.
Figure 7b shows a second speech magnitude spectrum of a second frame of a speech signal. The horizontal axis is frequency in Hz. The vertical axis is spectral amplitude, which is shown on a logarithmic scale. The critical band boundaries are again indicated with dashed lines.
The dark crosses again show the estimated amplitudes calculated by constructing a mode! of the speech: scen(n) = and using a least squares approach used to determine an estimate for the complex amplitude, by comparing the model with the original speech signal.
The lighter crosses show the complex amplitudes estimated by the estimation method described in relation to Figure 5. The frequency at which the maximum spectral amplitude occurs within each critical band is identified using peak picking.
A model of the speech signal s(nJ is then constructed for each frame: -mar s(n) = e"°k k=-21 where w"' is the frequency returned by the peak picking algorithm for the critical band k as the frequency at which the spectral amplitude is maximum. The complex amplitudes a"' are estimated using a least squares approach) in which the error is minimised between the model and the original speech. The complex amplitudes are shown on the spectrum as moved to the central frequency of each critical band.
Figure 8 is a schematic illustration of a very basic text to speech synthesis system.
Text is received via unit 10. Unit 10 may be a connection to the internet, a connection to a text output from a processor, an input from a speech to speech language processing module, a mobile phone etc. The unit 10 could be substituted by a memory which contains text data previously saved.
The text signal is then directed into a speech processor 11 which will be described in more detail with reference to Figure 9.
The speech processor 11 takes the text signal and turns it into speech corresponding to the text signal. Many different forms of output are available. For example, the output may be in the form of a direct audio output 12 which outputs to a speaker. This could be implemented on a mobile telephone, satellite navigation system etc. Alternatively, the output could be saved as an audio file 13 and directed to a memory. Also, the output could be in the form of an electronic audio signal 14 which is provided to a further system.
Figure 9 is a schematic illustration of the basic architecture of a text to speech system 15. The text to speech system 15 comprises a processor 11 which executes a program 16. Text to speech system 15 further comprises storage 17. The storage 17 stores data which is used by program 16 to convert text to speech. The text to speech system 15 further comprises an input module 18 and an output module 19. The input module 18 is connected to a text input 10. Text input 10 receives text The text input 10 may be for example a keyboard. Alternatively, text input 10 may be a means for receiving text data from an external storage medium or a network.
Connected to the output module 19 is output for audio 12. The audio output 12 is used for outputting a speech signal converted from text input into text input 10.
The audio output 12 may be for example a direct audio output e.g. a speaker or an output for an audio data file which may be sent to a storage medium, networked etc. In use, the text to speech system 15 receives text through text input 10. The program 16 executed on processor 11 coverts the text into speech data using data stored in the storage 17. The speech is output via the output module 19 to audio output 12.
A text to speech system can be based on unit selection. Figure 10 is a flow chart of a general method of synthesizing speech from text based on unit selection.
Step SlOOl is "Input text". The text may be inputted via a keyboard, touch screen, text predictor or the like.
In step S1002, "Convert text into sequence of acoustic units", the text is then converted into a sequence of acoustic units. These acoustic units may be phonemes or graphemes. The units may be context dependent e.g. triphones which take into account not only the phoneme which has been selected but the proceeding and following phonemes. The text is converted into the sequence of acoustic units.
Step S1003 is "Convert acoustic units into stored complex amplitude values obtained from recorded segments of speech". The text to speech system comprises a model that relates acoustic units to stored complex amplitude values obtained from recorded segments of speech (and dynamic slope values, if included). Such models include, for example, models for relating acoustic units of text into stored recorded segments of speech.
The stored complex amplitude values are values that had previously been extracted from an audio signal using a method such as described in relation to Figures 1, 3 or 5. An audio signal is divided into recorded segments of speech. Each recorded segment of speech corresponds to an acoustic unit The method described in relation to Figures 1, 3 or 5 for example can be used to extract 21 complex amplitude values for each frame of the recorded segment of speech (and dynamic slope values, and boundary amplitudes and slope values if included]. The complex amplitudes for all the frames are stored in the database and associated with a label identifying them as those for the particular recorded segment of speech. In step 51003, the model finds the label designating the recorded segment of speech associated with the particular acoustic unit Step S1004 is "Synthesize speech signal using stored parameters". The complex amplitude values (and dynamic slope values) for each unit that have been retrieved from the database are used to generate the speech signal. Each frame of the speech signal for each unit is generated using a method such as described in relation to Figures 2, 4 or 6. Sinusoids, with the central band frequencies (and other frequencies, if for example, boundary sinusoids are included) are generated for each frame of the speech vector, using the complex amplitudes (and dynamic slope values] retrieved from the database. The frames are then overlapped and added. The speech signal can be outputted to a direct audio output e.g. a speaker or to an output for an audio data file which may be sent to a storage medium for
example.
A text to speech system can alternatively be based on statistical modelling, for example a Hidden Markov Model (HMM) ITS system. The low number of parameters used and the fact that the number of parameters used does not vary between frames means that the method can be used for statistical speech synthesis. Figure 11 is a flow chart of a general method of synthesizing speech from text based on statistical modelling using a 11MM.
The first step is SilOl, "input text'. The text may be inputted via a keyboard, touch screen, text predictor or the like.
The second step is 51102 "convert text into sequence of acoustic units". Again, the acoustic units may be phonemes or graphemes. The units may be context dependent e.g. triphones which take into account not only the phoneme which has been selected but the proceeding and following phonemes.
In step 51103 "Look up probability distributions relating acoustic units to sets of complex amplitude values", probability distributions which relate the acoustic units to complex amplitude values are looked up. Many acoustic units are pronounced in a similar manner, or are affected by surrounding acoustic units, their location in a word or sentence, or are pronounced differently by different speakers. Thus, each acoustic unit has a certain probability of being related to a particular set of complex amplitude values.
For example, the probability distributions may be Gaussian distributions which are defined by means and variances. Other distributions such as the Poisson, Student-t, Laplacian or Gamma distributions can be used. The Gaussian distribution for a particular acoustic unit gives the probability, for each set of complex amplitude values, that the set of values corresponds to the particular acoustic unit. By set of complex amplitude values, it is meant the 21 complex amplitude values extracted for all of the frames for a particular segment of recorded speech. The sets of dynamic slope values can also be included. Further complex amplitude and slope values corresponding to the band boundaries can also be included.
The shape and position of the Gaussian is defined by its mean and variance.
Therefore, in this case, the mean and variance of the complex amplitude values are used to define the Gaussian. The mean and variance of the complex amplitude values are determined during the training of the system and are used in the acoustic model in step S1104 "Determine sequence of sets of complex amplitude values using an acoustic model". In this example, the acoustic model is a Hidden Markov Model [HMM). However, other models could also be used.
In a Hidden Markov Model or other type of acoustic model, the probability of all potential sets of complex amplitude values relating to a specific acoustic unit must be considered. Then the sequence of sets of complex amplitude values which most likely corresponds to the sequence of acoustic units will be taken into account.
This implies a global optimization over all the acoustic units of the sequence taking into account the way in which two units affect to each other. As a result, it is possible that the most likely set of complex amplitude values for a specific acoustic unit is not the best set of complex amplitude values when a sequence of acoustic units is considered.
Once the sequence of sets of complex amplitude values has been determined, the speech signal is output in step 51105. Step 51105 is "Output speech signal". Each frame of the speech signal is generated using a method such as described in relation to Figures 2, 4 or 6. Sinusoids, with the central band frequencies [and other frequencies, if for example, boundary sinusoids are included) are generated for each frame, using the sequence of complex amplitudes (and dynamic slope values] determined by the acoustic model. The frames are then overlapped and added. The speech signal can be outputted to a direct audio output e.g. a speaker or to an output for an audio data file which may be sent to a storage medium for
example.
For each frame of speech synthesized, the sinusoidal components will have the same frequencies (i.e. the central band frequency). In one embodiment, frequency information is therefore not extracted or modelled and the deterministic frequency information provided by the critical bands is used. This allows observation of the evolution of the spectrum information over time independently of the frequency information. This may be useful in the statistical modelling of speech for text to speech synthesis.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and apparatus described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and apparatus described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms of modifications as would fall within the scope and spirit of the inventions.

Claims (20)

  1. CLAIMS: 1. A system for extracting parameters for use in speech synthesis from a speech signaL comprising a processor configured to: receive an input speech signa'; extract frames of said input speech signal; transform each frame of said input speech signal into the frequency domain; for each frame, identify the frequency at which the maximum spectral amplitude occurs in each of a plurality of frequency bands; calculate the complex amplitudes of sinusoidal components with the identified frequencies; assign each complex amplitude to a sinusoida' component with a fixed frequency in the corresponding frequency band.
  2. 2. The system according to claim 1, wherein the fixed frequency is a substantially central frequency of the frequency band.
  3. 3. The system according to claim 1, wherein the frequency bands are critical bands.
  4. 4. The system according to claim 1, wherein the complex amplitude of each of the sinusoidal components is calculated by minimising the error between the frame of input speech signal and a signal comprising the sum of the sinusoidal components.
  5. 5. The system according to claim 1, wherein the frames are overlapping frames.
  6. 6. The system according to claim 5, wherein the frames are windowed before being transformed into the frequency domain.
  7. 7. The system according to claim 1, wherein the frequency at which the maximum spectral amplitude occurs is identified using peak picking.
  8. 8. The system according to claim 1, further configured to calculate the dynamic slope value of each of the sinusoidal components with the identified frequencies; assign each dynamic slope value to a sinusoidal component with a fixed frequency in the corresponding frequency band.
  9. 9. The system according to claim 8, wherein the dynamic slope value of each sinusoidal component is calculated by minimising the error between the input speech signal and a signal comprising the sum of the sinusoidal components.
  10. 10. The system according to claim 9, further configured to calculate a complex amplitude and a dynamic slope value for a further number of sinusoidal components by minimising the error between the frame of input speech signal and a signal comprising the sum of all the sinusoidal components, the further number of sinusoidal components having the frequency of the boundary of a number of the bands.
  11. 11. The system according to claim 10, wherein there are nine further sinusoidal components.
  12. 12. The system of claim 1, further comprising: a speech synthesizer, configured to; generate a plurality of sinusoidal components, the frequency of each sinusoidal component being a fixed frequency in one of the frequency bands and the complex amplitude of the sinusoidal component being the assigned complex amplitude; output a signal comprising a sum of said sinusoidal components.
  13. 13. The system according to claim 12, wherein said output signal s(n] is generated from a sum of sinusoidal components given by s(n) = arax where B is the number of frequency bands, aZ" is the complex amplitude assigned to sinusoidal component with a fixed frequency in the kth frequency band, and w is the fixed frequency in the kth frequency band.
  14. 14. The system according to claim 12, further configured to: calculate the dynamic slope value of the sinusoidal components with the identified frequencies; assign each dynamic slope value to a sinusoidal component with a fixed frequency in the corresponding frequency band; and wherein said signal 5(n) is generated from a sum of sinusoidal components given by s(n) = + nb) where B is the number of frequency bands, aknto is the complex amplitude assigned to the kth frequency band, b'°' is the dynamic sthpe va'ue assigned to the sinusoidal component with a fixed frequency in the kth frequency band and w is the fixed frequency in the kth frequency band.
  15. 15. The system according to claim 14, further configured to: calculate a complex amplitude and a dynamic slope value for a further number of sinusoidal components by minimising the error between the frame of input speech signal and a signal comprising a sum of all the sinusoidal components, the further number of sinusoidal components having the frequency of the boundary of a number of the frequency bands; wherein said output signal s(n) is generated from a sum of sinusoidal components given by s(n) = + nb) e°1 + kBL + nb°) e° where B is the number of frequency bands, aj is the complex amplitude assigned to the sinusoidal component with a fixed frequency in the kth frequency band, is the dynamic slope value assigned to the sinusoidal component with a fixed frequency in the kth frequency band and w[ is the fixed frequency in the kth frequency band, w° is the frequency of the boundary of the kth frequency band, is the complex amplitude and b'° is the dynamic slope value of the sinusoidal component at the boundary of the kth frequency band and BL is the further number of frequency bands.
  16. 16. The system according to claim 15, further configured to add a noise component to said signaL wherein said noise component is generated by: generating the sinusoidal components with frequencies above a selected frequency; estimating the envelope of the sum of said sinusoidal components with frequencies above the selected frequency; interpolating the complex amplitude values assigned to the critical bands above the selected frequency to obtain a further plurality of complex amplitude values; generate sinusoidal components with interpolated complex amplitude values and random phase; multiply sum of sinusoidal components with interpolated complex amplitude values by estimated envelope.
  17. 17. A method of extracting parameters for use in speech synthesis from an audio signal, the method comprising: receiving an input speech signal; extracting frames of said input speech signa'; transforming each frame of said input speech signal into the frequency domain; for each frame, identifying the frequency at which the maximum spectral amplitude occurs in each of a plurality of frequency bands; calculating the complex amplitudes of sinusoidal components with the identified frequencies; assigning each complex amplitude to a sinusoidal component with a fixed frequency in the corresponding frequency band.
  18. 18. A method of synthesizing a speech signal, comprising the method of extracting parameters from an audio signal of claim 17, further comprising: generating a plurality of sinusoidal components, the frequency of each sinusoidal component being a fixed frequency in one of the frequency bands and the complex amplitude of the sinusoidal component being the assigned complex amplitude; outputting a signal comprising a sum of said sinusoidal components.
  19. 19. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 17.
  20. 20. A text to speech system, comprising a memory and a processor, configured to: receive input text; divide said inputted text into a sequence of acoustic units; determine a sequence of sets of complex amplitude values corresponding to said sequence of acoustic units, wherein each set of complex amplitude values comprises values of complex amplitude each corresponding to a frequency band; generate a speech signal using said complex amplitude va'ues, wherein the speech signal comprises a sum of sinusoidal components each corresponding to a frequency band, wherein the frequency of each sinusoidal component is a fixed frequency in the corresponding frequency band and the complex amplitude is the complex amplitude value corresponding to the frequency band.
GB1407337.3A 2014-04-25 2014-04-25 A speech processing system Expired - Fee Related GB2525438B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1407337.3A GB2525438B (en) 2014-04-25 2014-04-25 A speech processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1407337.3A GB2525438B (en) 2014-04-25 2014-04-25 A speech processing system

Publications (3)

Publication Number Publication Date
GB201407337D0 GB201407337D0 (en) 2014-06-11
GB2525438A true GB2525438A (en) 2015-10-28
GB2525438B GB2525438B (en) 2018-06-27

Family

ID=50971896

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1407337.3A Expired - Fee Related GB2525438B (en) 2014-04-25 2014-04-25 A speech processing system

Country Status (1)

Country Link
GB (1) GB2525438B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117116245B (en) * 2023-10-18 2024-01-30 武汉海微科技有限公司 Method, device, equipment and storage medium for generating harmonic wave of sound signal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0982713A2 (en) * 1998-06-15 2000-03-01 Yamaha Corporation Voice converter with extraction and modification of attribute data
US6311158B1 (en) * 1999-03-16 2001-10-30 Creative Technology Ltd. Synthesis of time-domain signals using non-overlapping transforms
WO2013067145A1 (en) * 2011-11-04 2013-05-10 Northeastern University Systems and methods for enhancing place-of-articulation features in frequency-lowered speech

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0982713A2 (en) * 1998-06-15 2000-03-01 Yamaha Corporation Voice converter with extraction and modification of attribute data
US6311158B1 (en) * 1999-03-16 2001-10-30 Creative Technology Ltd. Synthesis of time-domain signals using non-overlapping transforms
WO2013067145A1 (en) * 2011-11-04 2013-05-10 Northeastern University Systems and methods for enhancing place-of-articulation features in frequency-lowered speech

Also Published As

Publication number Publication date
GB201407337D0 (en) 2014-06-11
GB2525438B (en) 2018-06-27

Similar Documents

Publication Publication Date Title
US9368103B2 (en) Estimation system of spectral envelopes and group delays for sound analysis and synthesis, and audio signal synthesis system
US10621969B2 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
CN107851444A (en) For acoustic signal to be decomposed into the method and system, target voice and its use of target voice
JP4516157B2 (en) Speech analysis device, speech analysis / synthesis device, correction rule information generation device, speech analysis system, speech analysis method, correction rule information generation method, and program
Maia et al. Complex cepstrum for statistical parametric speech synthesis
CN108369803B (en) Method for forming an excitation signal for a parametric speech synthesis system based on a glottal pulse model
CN112185342A (en) Voice conversion and model training method, device and system and storage medium
Rao Real time prosody modification
Talesara et al. A novel Gaussian filter-based automatic labeling of speech data for TTS system in Gujarati language
EP3113180B1 (en) Method for performing audio inpainting on a speech signal and apparatus for performing audio inpainting on a speech signal
Kafentzis et al. Pitch modifications of speech based on an adaptive harmonic model
GB2525438A (en) A speech processing system
Toda et al. An evaluation of cost functions sensitively capturing local degradation of naturalness for segment selection in concatenative speech synthesis
Ganapathy et al. Temporal resolution analysis in frequency domain linear prediction
Rao Unconstrained pitch contour modification using instants of significant excitation
US20050131679A1 (en) Method for synthesizing speech
JP5245962B2 (en) Speech synthesis apparatus, speech synthesis method, program, and recording medium
JP6213217B2 (en) Speech synthesis apparatus and computer program for speech synthesis
JP4223416B2 (en) Method and computer program for synthesizing F0 contour
JP3358139B2 (en) Voice pitch mark setting method
Vích et al. Pitch synchronous transform warping in voice conversion
CN116543778A (en) Vocoder training method, audio synthesis method, medium, device and computing equipment
JP2005043828A (en) Creation device of voice data set for perceptual examination, computer program, optimization device of sub-cost function for voice synthesis and voice synthesizer
Lee et al. A source-filter based adaptive harmonic model and its application to speech prosody modification.
Achanta et al. Significance of Maximum Spectral Amplitude in Sub-bands for Spectral Envelope Estimation and Its Application to Statistical Parametric Speech Synthesis

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20230425