WO1998001848A1 - Speech synthesis system - Google Patents

Speech synthesis system Download PDF

Info

Publication number
WO1998001848A1
WO1998001848A1 PCT/GB1997/001831 GB9701831W WO9801848A1 WO 1998001848 A1 WO1998001848 A1 WO 1998001848A1 GB 9701831 W GB9701831 W GB 9701831W WO 9801848 A1 WO9801848 A1 WO 9801848A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
voiced
pitch
speech
lpc
Prior art date
Application number
PCT/GB1997/001831
Other languages
French (fr)
Inventor
Costas Xydeas
Original Assignee
The Victoria University Of Manchester
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GBGB9614209.6A external-priority patent/GB9614209D0/en
Application filed by The Victoria University Of Manchester filed Critical The Victoria University Of Manchester
Priority to AU34523/97A priority Critical patent/AU3452397A/en
Priority to EP97930643A priority patent/EP0950238B1/en
Priority to AT97930643T priority patent/ATE249672T1/en
Priority to DE69724819T priority patent/DE69724819D1/en
Priority to JP10504943A priority patent/JP2000514207A/en
Publication of WO1998001848A1 publication Critical patent/WO1998001848A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/10Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a multipulse excitation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • G10L2025/937Signal energy in various frequency bands
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Definitions

  • the present invention relates to speech synthesis systems, and in
  • a speech communication system is to be capable of
  • Unvoiced speech is produced by turbulent air flow at a constriction and does not
  • parameters used to represent a frame are the pitch period, the magnitude and
  • phase function is also defined using linear frequency
  • randomness in the signal is introduced by adding jitter to the amplitude
  • CELP code-excited linear prediction
  • the system employs 20msecs coding frames which are classified
  • a pitch period in a given frame is
  • coefficients are coded using a differential vector quantization scheme.
  • LPC synthesis filter the output of which provides the synthesised voiced speech
  • An amount of randomness can be introduced into voiced speech by
  • Periodic voice excitation signals are mainly represented by the "slowly
  • Phase information is
  • one known system operates on 5msec frames, a pitch period being selected for voiced frames and DFT transformed to yield
  • Unvoiced speech is CELP coded.
  • each frame is converted into a coded signal including a
  • peaks are determined and used to define a pitch estimate.
  • the system avoids undue complexity and may he readily implemented.
  • the pitch estimate is defined using an iterative process.
  • single reference sample may be used, for example centred with respect to the
  • the correlation function may be clipped using a threshold value
  • a predetermined factor for example smaller than 0.9 times the
  • the pitch estimation procedure is based on a least squares
  • the algorithm defines the pitch as a number whose
  • values may be limited to integral numbers which are not consecutive, the
  • each frame is converted into a coded signal including pitch segment magnitude spectral information, a voiced/unvoiced classification, and a mixed
  • a threshold value is
  • Peaks may be located using a second order polynomial.
  • the samples may be
  • the threshold value may be calculated by identifying
  • Peaks may be defined as those values which arc greater than
  • a peak may be rejected from consideration if
  • neighbouring peaks are of a similar magnitude, e.g. more than 80% of the
  • a harmonic may be considered as not being associated with a
  • the spectrum may be divided into bands of fixed width and a
  • the frequency range may be divided into two or more bands of variable width,
  • the spectrum may be divided into fixed bands, for example fixed
  • frequency band e.g. 0-500Hz
  • the highest frequency band for example 3500Hz to 4000Hz, may always
  • 3000Hz to 3500Hz may be automatically classified as weakly voiced.
  • the strongly/weakly voiced classification may be determined using a majority
  • alternate bands may be alternately assigned strongly voiced and weakly voiced classifications.
  • excitation signal to recover the speech signal which takes into account this classification. It is an object of the present invention to provide such a system.
  • each frame is defined as voiced or unvoiced, each frame is converted into
  • a coded signal including a pitch period value, a frame voiced/unvoiced
  • the speech signal is reconstructed by generating an
  • the excitation signal is represented by a function which includes a first
  • harmonic frequency component the frequency of which is dependant upon the
  • the random component may be introduced by reducing the amplitude of
  • harmonic oscillators assigned the weakly voiced classification for example by
  • the oscillators producing random signals may be randomised at pitch intervals. Thus for a weakly voiced band, some periodicity remains but the power of the
  • an input speech signal is processed to produce an
  • the discarded magnitude values arc represented at
  • magnitude values to be quantized are always the same and predetermined on the
  • each voiced frame is converted into a coded signal including a pitch
  • the pitch segment is DFT transformed, the mean value of the
  • the selected magnitudes are recovered, and each of the
  • the input vector is transformed to a fixed size vector which is then
  • variable input vector is directly quantized with a
  • variable size training vectors are obtained from variable size training vectors and an interpolation
  • the invention is applicable in particular to pitch synchronous low bit rate
  • the interpolation process is linear.
  • the interpolation process is applied to produce from the
  • codebook vectors a set of vectors of that given dimension.
  • the dimension of the input vectors is reduced by taking into
  • the remaining amplitudes i.e. in the region of
  • 3.4kHz to 4 kHz are set to a constant value.
  • the constant value is
  • the backward prediction may be performed on a harmonic basis
  • each frame is converted into a coded signal including an estimated pitch
  • the excitation signal the excitation spectral envelope is shaped according to the
  • the result is a system which is capable of delivering high
  • the invention is based on the observation that
  • the magnitude values may be obtained by spectrally sampling a modified
  • the modified LPC synthesis filter may have reduced feed back gain and
  • the value of the feed back gain may be controlled by the performance of the LPC model such that it is
  • the reproduced speech signal may be equal to the energy of the original speech
  • each frame is converted into a coded signal including LPC filter
  • each pair of excitation signals comprising a first
  • the outputs of the first and second LPC filters are weighted by
  • a window function such as a Hamming window such that the magnitude of
  • the output of the first filter is decreasing with time and the magnitude of the
  • Figure 1 is a general block diagram of the encoding process in accordance with the present invention.
  • Figure 2 illustrates the relationship between coding and matrix
  • Figure 3 is a general block diagram of the decoding process
  • Figure 4 is a block diagram of the excitation synthesis process
  • Figure 5 is a schematic diagram of the overlap and add process
  • Figure 6 is a schematic diagram of the calculation of an instantaneous
  • Figure 7 is a block diagram of the overall voiced/unvoiced classification
  • Figure 8 is a block diagram of the pitch estimation process
  • Figure 9 is a schematic diagram of two speech segments which participate
  • Figure 10 is a schematic diagram of speech segments used in the calculation of the crosscorrelation function value
  • Figure 11 represents the value allocated to a parameter used in the
  • Figure 12 is a block diagram of the process used for calculated the
  • Figure 13 is a flow chart of a pitch estimation algorithm
  • Figure 14 is a flow chart of a procedure used in the pitch estimation
  • Figure 15 is a flow chart of a further procedure used in the pitch
  • Figure 16 is a flow chart of a further procedure used in the pitch
  • Figure 17 is a flow chart of a threshold value selection procedure
  • Figure 18 is a flow chart of the voiced/unvoiced classification process
  • Figure 19 is a schematic diagram of the voiced/unvoiced classification
  • Figure 20 is a flow chart of the procedure used to determine offset values
  • Figure 21 is a flow chart of the pitch estimation algorithm
  • Figure 22 is a flow chart of a procedure used to impose constraints on
  • Figures 23, 24 and 25 represent different portions of a flow chart of a
  • Figure 26 is a general block diagram of the LPC analysis and LPC
  • Figure 27 is a general flow chart of a strongly or weakly voiced
  • Figure 28 is a flow chart of the procedure responsible for the
  • Figure 29 represents a speech waveform obtained from a particular
  • Figure 30 shows frequency tracks obtained for the speech utterance of
  • Figure 31 shows to a larger scale a portion of Figure 30 and represents the
  • Figure 32 shows a magnitude spectrum of a particular speech segment
  • Figure 33 is a general block diagram of a system for representing
  • Figure 34 is a block diagram of an adaptive quantiser shown in Figure 33;
  • Figure 35 is a general block diagram of a quantisation process
  • Figure 36 is a general block diagram of a differential variable size
  • Figure 37 represents the hierarchical structure of a mean gain shape
  • quantiser A system in accordance with the present invention is described below, firstly in general terms and then in greater detail.
  • the system operates on an LPC residual signal on a frame by frame basis.
  • voiced speech depends on the pitch frequency of the signal.
  • a voiced/unvoiced classification process allows the coding of voiced and unvoiced frames to be handled in different ways.
  • Unvoiced frames are modelled in terms of an RMS value and a random time series.
  • voiced frames a pitch period estimate is obtained and used to define a pitch segment which is centred at the middle of the frame.
  • Pitch segments from adjacent frames are DFT transformed and only the resulting pitch segment magnitude information is coded and transmitted.
  • pitch segment magnitude samples are classified as strongly or weakly voiced.
  • the system transmits for every voiced frame the pitch period value, the magnitude spectral information of the pitch segment, the strong/weak voiced classification of the pitch magnitude spectral values,, and the LPC coefficient.
  • the information which is transmitted for every voiced frame is, in addition to voiced/unvoiced information, the pitch period value, the magnitude spectral information of the pitch segment, and the LPC filter coefficients.
  • MG ⁇ are decoded pitch segment magnitude values and phase j (i) is calculated from the integral of the linearly interpolated instantaneous harmonic frequencies ⁇ ,(i).
  • K is the largest value of j for which ⁇ j "(i) ⁇ .
  • the initial phase for each harmonic is set to zero. Phase continuity is preserved across the boundaries of successive interpolation intervals.
  • the synthesis process is performed twice however, once using the magnitude spectral values MG j " + of the pitch segment derived from the current (n+l )th frame and again using the magnitude values MG j " of the pitch segment derived in the previous nth frame.
  • the phase function phase i) in each case remains the same.
  • the resulting residual signals Res n (i) and Res n+) (i) are used as inputs to corresponding LPC synthesis filters calculated for the nth and (n+l)th speech frames.
  • the two LPC synthesised speech waveforms are then weighted by W n+ 1 (i) and W n (i) to yield the recovered speech" signal.
  • the LPC excitation signal is based on a "mixed" excitation model which allows for the appropriate mixing of periodic and random excitation components in voiced frames on a frequency-band basis. This is achieved by operating the system such that the magnitude spectrum of the residual signal is examined, and applying a peak-picking process, near the ⁇ : resonant frequencies, to detect possible dominant spectral peaks.
  • NRS random components are spaced at 50 Hz intervals symmetrically about ⁇ , c ⁇
  • the amplitudes of the NRS random components are set to MG j I V2 x NRS Their initial phases are selected randomly from the [- ⁇ , + ⁇ ] region at
  • the hv j information must be transmitted to be available at the receiver and, in order to reduce the bit rate allocated to hv f , the bandwidth of the input signal is divided into a number of fixed size bands BD k and a "strongly” or “weakly” voiced flag Bhv k is assigned for each band.
  • a strongly or “weakly” voiced flag Bhv k is assigned for each band.
  • a weakly voiced band a highly periodic signal is reproduced.
  • a signal which combines both periodic and aperiodic components is required.
  • the remaining spectral bands can be strongly or weakly voiced.
  • Figure 1 schematically illustrates processes operated by the system encoder. These processes are referred to in Figure 1 as Processes I to VII and these terms are used throughout this document.
  • a speech signal is input and processes I, III, IV, VI AND VII produce outputs for transmission.
  • Process I Assuming that the first Matrix Quantization analysis frame (MQA) of kxM samples is available, each of the k coding frames within the MQA is classified as voiced or unvoiced (V n ) using, Process I.
  • a pitch estimation part of Process I provides a pitch period value P réelle only when a coding frame is voiced.
  • k/m is an integer and represents the frame dimension of the matrix quantizer employed in Process III.
  • the quantized coefficients a are used to derive a residual signal R n (i).
  • P n is the pitch period value associated with the nth frame. This segment is centred in the middle of the frame.
  • the selected P n samples are DFT transformed (Process V) to yield + l) / 2 spectral magnitude values MG" , + 1) / 2 ⁇
  • the magnitude information is coded (using Process VI) and transmitted.
  • Process IV produces quantized Bhv information, which for voiced frames is multiplexed and transmitted to the receiver together with the voiced/unvoiced decision V the pitch period P n , the quantized LPC coefficients a of the corresponding LPC frame, and the magnitude values MG" . In unvoiced frames only the quantized value and the quantized LPC filter coefficients a are transmitted.
  • Figure 3 schematically illustrates processes operated by the system decoder.
  • the decoder Given the received parameters of the nth coding frame and those of the previous (n-l)th coding frame, the decoder synthesises a speech signal S n (i) that extends from the middle of the (n-l)th frame to the middle of the nth frame.
  • This synthesis process involves the generation in parallel of two excitation signals Res n (i) and Res n .,(i) which are used to drive two independent LPC synthesis filters 1 / A n (z) and 1 / A tl ⁇ (z) the coefficients of which are derived from the transmitted quantized coefficients a .
  • the process commences by considering the voiced unvoiced status V k , where k is equal to n or n-1, (see Figure 4).
  • V k 0
  • a gaussian random number generator RG(0,1) of zero mean and unit variance provides a time series which is subsequently scaled by the JE ⁇ value received for this frame. This is effectively the required:
  • the Res k (i) excitation signal is defined as the summation of a "harmonic" Res k (i) component and a "random" Res k r (i) component.
  • the top path of the V l part of the synthesis in Figure 4, which provides the harmonic component of this mixed excitation model, calculates always the instantaneous harmonic frequency function ⁇ j "(i) which is associated with the interpolation interval that is defined between the middle points of the nth and (n-l)th frames, (i.e. this action is independent of the value of k).
  • ⁇ > j n (i) is calculated using the pitch frequencies f j 1 '", f, 2 "" and linear interpolation i.e.
  • the f* ⁇ x value is calculated during the decoding process of the previous (n-l)th coding frame, hv j " is the strongly/weakly voiced classification (0, or 1 ) of the jth harmonic ⁇ ".
  • P n and P n _ are the received pitch estimates from the n and n-1 frames.
  • the associated phase value is:
  • the random excitation signal Res k (i) can be generated by the summation of random cosines located 50 Hz apart, where their phase is randomised every ⁇ samples, and ⁇ M, i.e
  • Res k (i) cos(2 ⁇ ( /50) + ⁇ (i - ⁇ x ⁇ - ⁇ ) x RU(- ⁇ ,+ ⁇ ))
  • 1 / A n (z) becomes 1 / A n+I (z) with the memory of 1 / A n (z) .
  • This is valid in all cases except during an unvoiced to voiced transition, where the memory of the 1 / ⁇ dress +l (z) filter is set to zero.
  • the coefficients of the 1 / A n (z) and 1 / A H _ X (z) synthesis filters are calculated directly from the nth and (n-l)th coding speech frames respectively, when the LPC analysis frame size L is equal to M samples. However, when L ⁇ M (usually L>M) linear interpolation is used on the filter coefficients (defined every L samples) so that the transfer function of the synthesis filter is updated every M samples.
  • the output signals of these filters denoted as X n .,(i) and X n (i), are weighted, overlapped and added as schematically illustrated in Figure 5 to yield X n (i) i.e:
  • PF(z) is the conventional post filter:
  • HP(z) is defined as: b l - c.z '1
  • a scaling factor SC is calculated every LPC frame of L samples.
  • is associated with the middle of the 1th LPC frame as illustrated in Figure 6.
  • the filtered samples from the middle of the (1-1 )th frame to the middle of the lth frame are then multiplied by SC j (i) to yield the final output of the system, where:
  • the scaling process introduces an extra half LPC frame delay into the coding-decoding process.
  • the above described energy scaling procedure operates on an LPC frame basis in contrast to both the decoding and PF(z), HP(z) filtering procedures which operate on the basis of a frame of M samples.
  • Process I derives a voiced/unvoiced (V/UV) classification V n for the nth input coding frame and also assigns a pitch estimate P n to the middle sample M casual of this frame. This process is illustrated in Figure 7.
  • V/UV voiced/unvoiced
  • the V/UV and pitch estimation analysis frame is centred at the middle M n+ i of the (n+l)th coding frame with 237 samples on either side.
  • the pitch estimation algorithm is illustrated in Figure 8, where P represents the output of the pitch estimation process.
  • the 294 input samples are used to calculate a crosscorrelation function CR(d), where d is shown in Figure 9 and 20 ⁇ d ⁇ 147.
  • Figure 9 shows the two speech segments which participate in the calculation of the crosscorrelation function value at "d" delay.
  • the crosscorrelation function p d (j) is calculated for the segments ⁇ X ⁇ d , ⁇ R ⁇ ,as:
  • Figure 12 is a block diagram of the process involving the calculation of the CR function and the selection of its peaks. As illustrated, given CR(d), a threshold th(d) is determined as:
  • th(d) CR( ⁇ C . ) ⁇ b - (d - d£ )x a - c (28)
  • the algorithm examines the length of the G 0 runs which exist between successive G s segments (i.e. G s and Gs + i), and when G 0 ⁇ 17, then the G s segment with the max CR (d) value is kept. This procedure yields CR, (d) , which is then examined by the following "peak picking" procedure.
  • CR L (d) > CR, (d - ⁇ ) and CR, (d) > CR, (d + ⁇ )
  • certain peaks can be rejected if: CR L (loc(k)) ⁇ CR L (loc(k + 1)) x 0.9
  • CR(d) and loc(k) are used as inputs to the following Modified High Resolution Pitch Estimation algorithm (MHRPE) shown in Figure 8, whose output is P Mn+) .
  • MHRPE Modified High Resolution Pitch Estimation algorithm
  • FIG 13 The flowchart of this MHRPE procedure is shown in Figure 13, where P is initialised with 0 and, at the end, the estimated P is the requested P Mn+l .
  • the main pitch estimation procedure is based on a Least Squares Error (LSE) algorithm which is defined as follows: For each possible pitch value j in the range from 21 to 147 with an increment of 0.1 x j . i.e. j € ⁇ 21,23,25-27,30,33,36,40,44,48,53,58,64,70,77,84,92,101,1 1 1,122,134 ⁇ . (Thus 21 iterations are performed.)
  • LSE Least Squares Error
  • V/UV part of Process I calculates the status V Mn+
  • the flowchart of this part of the algorithm is shown in Figure 18 where "V" represents the output V/UV flag of this procedure. Setting the "V” flag to 1 or 0 indicates voiced or unvoiced classification respectively.
  • the "CR” parameter denotes the maximum value of the CR function which is calculated in the pitch estimation process.
  • a diagrammatic representation of the voiced/unvoiced procedure is given in Figure 19.
  • a multipoint pitch estimation algorithm accepts P M ⁇ +1 , P M . +d P -, +d2> V lake.,, P n . ⁇ , V' n , P' n to provide a preliminary pitch value P ", .
  • the flowchart of this multipoint pitch estimation algorithm is given in Figure 21, where P,, P 2 and P 0 represent the pitch estimates associated with the M n+1+d
  • V' n+I and P' n+i produced from this section are then used in the next pitch past processing section together with V n _,, V' n , P n ., and P' n to yield the final voiced/unvoiced and pitch estimate parameters V n and P n for the nth coding frame.
  • This pitch post processing stage is defined in the flowchart of Figures 23, 24 and 25, the output A of Figure 23 being the input to Figure 24, and the output B of Figure 24 being the input to Figure 25.
  • "Pate" and "V n " represent the pitch estimate and voicing flag respectively, which correspond to the nth coding frame prior to post processing (i.e.
  • the LPC analysis process (Process II of Figure 1) can be performed using the Autocorrelation, Stabilised Covariance or Lattice methods. The Burg algorithm was used, although simple autocorrelation schemes could be employed without a noticeable effect in the decoded speech quality.
  • the LPC coefficients are then transformed to an LSP representation. Typical values for the number of coefficients are 10 to 12 and a 10th order filter has been used. LPC analysis processes are well known and described in the literature, for example "Digital Processing of Speech Signals", L.R.
  • LSP coefficients are used to represent the data. These 10 coefficients could be quantized using scalar 37 bits with the following bit allocation pattern [3,4,4,4,4,4,4,3,3]. This is a relatively simple process, but the resulting bit rate of 1850 bits/second is unnecessarily high.
  • the LSP coefficients can be Vector Quantised (VQ) using a Split-VQ technique.
  • VQ Vector Quantised
  • the Split-VQ technique an LSP parameter vector of dimension "p" is split into two or more subvectors of lower dimensions and then, each subvector is Vector Quantised separately (when Vector Quantising the subvectors a direct VQ approach is used).
  • the LSP transformed coefficient vector, C which consists of "p" consecutive coefficients (ci,c2,...,cp) is split into “K” vectors, C (l ⁇ k ⁇ K), with the corresponding dimensions dfc (l ⁇ dk ⁇ P)-
  • the Split-VQ becomes equivalent to Scalar Quantisation.
  • the Split-VQ becomes equivalent to Full
  • m(k) represents the spectral dimension of the kth submatrix and N is the SMQ
  • Er(f) is the normalised energy of the prediction error of the (l+t)th frame
  • En(t) is the RMS value of the (l+t)th speech frame
  • Aver(En) is the average RMS value of the ⁇ LPC frames used in SMQ.
  • the values of the constants ⁇ and ⁇ l are set to 0.2 and 0.15 respectively.
  • the overall SMQ quantisation process that yields the quantised LSP coefficients vectors / ' to / 1+N ' 1 for the 1 to 1+N-l analysis frames is shown in Figure 26.
  • a 5Hz bandwidth expansion is also included in the inverse quantisation process.
  • Process IV of Figure 1 This process is concerned with the mixed voiced classification of harmonics.
  • the flowchart of Process IV is given in Figure 27.
  • the R" array of 160 samples is Hamming windowed and augmented to form a 512 size array, which is then FFT processed.
  • the maximum and minimum values MGR max , MGR mm of the resulting 256 spectral magnitude values are determined, and a threshold THO is calculated. TH0 is then used to clip the magnitude spectrum.
  • the clipped MGR array is searched to define peaks MGR(P) satisfying:
  • MGR(P) For each peak, MGR(P), "supported” by the MGR(P+1 ) and MGR(P-l) values a second order polynomial is fitted and the maximum point of this curve is accepted as MGR(P) with a location loc(MGR(P)). Further constraints are then imposed on these magnitude peaks. In particular peaks are rejected : Struktur, ⁇ , admir_.,
  • spectral peaks in the neighbourhood of loc(MGR(P)) (i.e in the range (loc(MGR(P))-fo/2 to loc(MGR(P))+fo/2 where fo is the fundamental frequency in Hz), whose value is larger than 80% of MGR(P) or b) if there are any spectral magnitudes in the same range whose value is larger than MGR(P).
  • two thresholds are defined as follows:
  • THl 0.15 ⁇ fo.
  • loc(MGR d (A))- loc(MGR d (A: - l) is compared to 1.5 ⁇ fo+TH2, and if
  • classification hv is zero (weakly voiced). (loc(MGR d (k)) is the location of the kth dominant
  • loc(k) is the location of the kth
  • loc (MGR d (k)) loc(K).
  • the spectrum is divided into bands of 500Hz each and a strongly voiced/weakly voiced flag Bhv is assigned for each band.
  • the Bhv values of the remaining 5 bands are determined using a majority decision rule on the hv j values of the j harmonics which fall within the band under consideration.
  • the hv ( of a specific harmonic j is equal to the Bhv value of the corresponding band.
  • the hv information may be transmitted with 5 bits.
  • the 680 Hz to 3400 Hz range is represented by only two variable size bands.
  • the Fc frequency that separates these two bands can be one of the following:
  • the Fc frequency is selected by examining the three bands sequentially defined by the frequencies in (A) or (B) and by using again a majority rule on the harmonics which fall within a band.
  • Figures 29 and 30 represent respectively an original speech
  • the horizontal axis represents time in terms of frames each of
  • Figure 31 shows to a larger scale a section of Figure 30, and represents
  • Waveform A represents the magnitude
  • Waveforms B, C and D represent the normalised Short-Term magnitude spectrum of the
  • the hybrid model introduces an appropriate amount of randomness where required in the 3 ⁇ /4
  • the DFT For a real-valued sequence x(i) of P points the DFT may be expressed as:
  • the P n point DFT will yield a double-side spectrum.
  • the magnitude of all the non DC components must be multiplied by a factor of 2.
  • the total number of single side magnitude spectrum values, which are used in the reconstruction process, is equal to
  • MSVSAR modified single value spectral amplitude representation
  • MSVSAR is based on the observation that some of the speech spectrum resonance and anti- resonance information is also present at the residual magnitude spectrum (G.S. Kang and S.S. Everett, "Improvement of the Excitation Source in the Narrow-Band Linear Prediction Vocoder", IEEE Trans. Acoust., Speech and Signal Proc, Vol. ASSP-33, pp.377-386, 1985).
  • LPC inverse filtering can not produce a residual signal of absolutely flat magnitude spectrum mainly due to: a) the "cascade representation" of formats by the LPC filter 1/A(z), which results in the magnitudes of the resonant peaks to be dependent upon the pole locations of the 1/A(z) all-pole filter and b) the LPC quantisation noise.
  • G R and G N are defined as follows:
  • x tripod r ⁇ "(i) represents a sequence of 2P n speech samples centred in the middle of the nth coding frame from which the mean value is being calculated and removed
  • the G l( parameter represents a constant whose value is set to 0.25.
  • Equation 32 defines a modified LPC synthesis filter with reduced feedback gain, whose frequency response consists of nearly equalised resonant peaks, the locations of which are very close to the LPC synthesis resonant locations. Furthermore, the value of the feedback gain G R is controlled by the performance of the LPC model (i.e. it is proportional to the normalised LPC prediction error). In addition Equation 34 ensures that the energy of the reproduced speech signal is equal to the energy of the original speech waveform. Robustness is increased by computing the speech RMS value over two pitch periods. Two alternative magnitude spectrum representation techniques are described below, which allow for better coding of the magnitude information and lead to a significant improvement in reconstructed speech quality.
  • the first of the alternative magnitude spectrum representations tecliniques is referred to below in the "Na amplitude system".
  • the basic principle of this MG" quantisation system is to represent accurately those MG" values which correspond to the Na largest speech Short
  • - 1 MG" magnitudes are subjectively more important for accurate quantization. The system subsequently selects MG jn j lc(l),...,lc(Na) and Vector Quantizes these values. If the minimum pitch value is 17 samples, the number of non-DC MG" amplitudes is equal to 8 and for this reason Na ⁇ 8. Two variations of the "Na-amplitudes system" were developed with equivalent performance and their block diagrams are depicted in Figure 33 (a) and (b) respectively.
  • This arrangement for the quantization of g or m extends the dynamic range of the coder to not less than 25dBs.
  • A is either "m” or "g”).
  • the block diagram of the adaptive ⁇ -law quantiser is shown in Figure 34.
  • the second of the alternative magnitude spectrum representation techniques is referred to below as the "Variable Size Spectral Vector Quantisation (VS/SVQ)" system. Coding systems, which employ the general synthesis formula of Equation (1) to recover speech, encounter the problem of coding a variable length, pitch dependant spectral amplitude vector MG .
  • the "Na- amplitudes" MG" quantisation schemes described in Figure 33 avoid this problem by Vector Quantising the minimum expected number of spectral amplitudes and by setting the rest of the MG" amplitudes to a fixed value.
  • a partially spectrally flat excitation model has limitations in providing high recovered speech quality.
  • the shape of the entire ⁇ MG" ⁇ magnitude spectrum should be quantised.
  • Various techniques have been proposed for coding ⁇ MG" ⁇ . Originally ADPCM has been used across the MG" values associated to a specific coding frame. Also ⁇ MG" ⁇ has been DCT transformed and coded differentially across successive MG" magnitude spectra.
  • the first VQ method involves the transformation of the input vector to a fixed size vector followed by conventional Vector Quantisation.
  • the inverse transformation on the quantised fixed size vector yields the recovered quantised MG" vector. Transformation techniques which have been used include, Linear Interpolation, Band Limited Interpolation, All Pole modelling and Non-Square transformation. However, the overall distortion produced by this approach is the summation of the VQ noise and a component, which is introduced by the transformation process.
  • the second VQ method achieves the direct quantisation of a variable input vector wit a fixed size code vector. This is based in selecting only vs braid elements from each codebook vector, to form a distortion measure between a codebook vector and an input MG" vector. Such a quantisation approach avoids the transformation distortion of the previous techniques mentioned in (i) and results in an overall distortion that is equal to the Vector Quantisation noise.
  • VQ Variable Size Spectral Vector Quantisation
  • Figure 35 highlights the VS/SVQ process.
  • Interpolation (in this case linear) is used on the S' vectors to yield S]f_ vectors of dimension vs n .
  • the S' to S ⁇ interpolation process is given by:
  • Amplitude vectors obtained from adjacent residual frames, exhibit significant redundancy, which can be removed by means of backward prediction. Prediction is performed on a harmonic basis i.e. the amplitude value of each harmonic MG," is predicted from the amplitude value of the same harmonic in previous frames i.e. MG" ' 1 .
  • a fixed linear predictor MG b MG may be incorporated in the VS/SVQ system, and the resulting DPCM structure is shown in Figure 36 (differential VS/SVQ, (DVS/SVQ)).
  • E denotes the quantised error vector
  • the quantisation of the E" l ⁇ j ⁇ vs n error vector incorporates Mean Removal and Gain Shape Quantisation techniques, using the hierarchical VQ structure of Figure 36.
  • a weighted Mean Square Error is used in the VS/SVQ stage of the system.
  • W is normalised so that:
  • the pdf of the mean value of F is very broad and, as a result, the mean value differs widely from one vector to another.
  • This mean value can be regarded as statistically independent of the variation of the shape of the error vector E ⁇ and thus, can be quantised separately without paying a substantial penalty in compression efficiency.
  • the mean value of an error vector is calculated as follows:
  • M is Optimum Scalar Quantised to M and is then removed from the original error vector to form Erm" - (E ⁇ _ - M) .
  • the overall quantization distortion is attributed to the quantization of the "Mean Removed" error vectors ( Erm” ), which is performed by a Gain-Shape Vector Quantiser.
  • the objective of the Gain-Shape VQ process is to determine the gain value C and the shape vector S so as to minimise the distortion measure:
  • a gain optimised VQ search method similar to techniques used in CELP systems, is employed to find the optimum G and S_.
  • the shape Codebook (CBS) of vectors S is searched first to yield an index I, which maximises the quantity:
  • cbs is the number of codevectors in the CBS.
  • the optimum gain value is defined as:
  • each quantizer i.e. b k , CBM k , CBG k ' CBS k
  • b k The performance of each quantizer (i.e. b k , CBM k , CBG k ' CBS k ) has been evaluated using subjective tests and a LogSegSNR distortion measure, which was found to reflect the subjective performance of the system.
  • Qi denotes the cluster of Erm" k error vectors which are quantised to the S
  • cbs represents the total number of shape quantisation levels
  • J n represents the CBG k v -' gain codebook index which encodes the Erm" k error vector and 1 ⁇ j ⁇ vs n .
  • D j denotes the cluster of Erm" k error vectors which are quantised to the G l k v " 1"1 gain quantiser level
  • cbg represents the total number of gain quantisation levels
  • I n represents the CBS k - v shape codebook index which encodes the Erm" k error vector and l ⁇ j ⁇ vs n .
  • Process VII calculates the energy of the residual signal.
  • the LPC analysis performed in Process II provides the prediction coefficients a, l ⁇ i ⁇ p and the reflection coefficients k, l ⁇ i ⁇ p.
  • the Voiced/Unvoiced classification performed in Process I provides the short term autocorrelation coefficient for zero delay of the speech signal (RO) for the frame under consideration.
  • the Energy of the residual signal E Tha value is given as:
  • Equation (50) gives a good approximation of the residual signal energy with low computational requirements.
  • Equation (50) gives a good approximation of the residual signal energy with low computational requirements.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Aerials With Secondary Devices (AREA)
  • Optical Communication System (AREA)
  • Telephonic Communication Services (AREA)
  • Reduction Or Emphasis Of Bandwidth Of Signals (AREA)
  • Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)

Abstract

A speech synthesis system in which a speech signal is divided into a series of frames, and each frame is converted into a coded signal including a voiced/unvoiced classification and a pitch estimate, wherein a low pass filtered speech segment centred about a reference sample is defined in each frame, a correlation value is calculated for each of a series of candidate pitch estimates as the maximum of multiple crosscorrelation values obtained from variable length speech segments centred about the reference sample, the correlation values are used to form a correlation function defining peaks, and the locations of the peaks are determined and used to define a pitch estimate.

Description

SPEECH SYNTHESIS SYSTEM
The present invention relates to speech synthesis systems, and in
particular to speech systems coding and synthesis systems which can be used in
speech communication systems operating at low bit rates.
Speech can be represented as a waveform the detailed structure of which
represents the characteristics of the vocal tract and vocal excitation of the person
producing the speech. If a speech communication system is to be capable of
providing an adequate perceived quality, the transmitted information must be
capable of representing that detailed structure. Most of the power in voiced
speech is at relatively low frequencies, for example below 2kHz. Accordingly
good quality speech synthesis can be achieved on the basis of speech waveforms
that have been low pass filtered to reject higher frequency components. The
perceived speech quality is however adversely effected if the frequency is
restricted much below 4kHz.
Many models have been suggested for defining the characteristics of
speech. The known models rely upon dividing a speech signal into blocks or
frames and deriving parameters to represent the characteristics of the speech
within each frame. Those parameters are then quantized and transmitted to a
receiver. At the receiver the quantization process is reversed to recover the
parameters, and a speech signal is then synthesised on the basis of the recovered
parameters. The common objective of the designers of the known models is to
minimise the volume of data which must be transmitted whilst maximising the
perceived quality of the speech that can be synthesised from the transmitted
data. In some of the models a distinction is made between whether or not a
particular frame is "voiced" or "unvoiced". In the case of voiced speech, speech
is produced by glottal excitation and as a result has a quasi-periodic structure.
Unvoiced speech is produced by turbulent air flow at a constriction and does not
have the "periodic" spectral structure characteristic of voiced speech. Most
models seek to take advantage of the fact that voiced speech signals evolve
relatively slowly in the context of frames the duration of which is typically 10 to
30msecs. Most models also rely upon quantization schemes intended to minimise
the amount of information which must be transmitted without significant loss of
perceived quality. As a result of the work done to date it is now possible to
produce speech synthesis systems capable of operating at bit rate of only a few
thousand bits per second.
One model which has been developed is known as "sinusoidal coding"
(R.J. McAulay and T.F. Quatieri, "Low Rate Speech Coding Based on Sinusoidal
Coding", Advances in Speech Signal Processing, Editors S. Furui and M.M.
Sondhi, Chapter 6, pp. 165-208, Markel Dekker, New York, 1992). This
approach relies upon an FFT analysis of each input frame to produce a
magnitude spectrum, estimating the pitch period of the input frame from that
spectrum, and defining the amplitudes at the pitch related harmonics, the harmonics being multiples of the fundamental frequency of the frame. An error
measure is calculated in the time domain representing the difference between
harmonic and aharmonic speech spectra and that error measure is used to define
the degree of voicing of the input frame in terms of a frequency value. Thus the
parameters used to represent a frame are the pitch period, the magnitude and
phase values for each harmonic, and the frequency value. Proposals have been
made to operate this system such that phase information is predicted in a
coherent way across successive frames.
In another system known as "multiband excitation coding" (D.W. Griffin
and J.S. Lim, "Multiband Excitation Vocoder" IEEE Transaction on Acoustics,
Speech and Signal Processing, vol. 36, pp 1223-1235, 1988 and Digital Voice
Systems Inc, "INMARSA T M Voice Codec, Version 3.0", Voice Coding System
Description, Module 1, Appendix 1, August 1991) the amplitude and phase
functions are determined in a different way from that employed in sinusoidal
coding. The emphasis in this system is placed on dividing a spectrum into bands,
for example up to twelve bands, and evaluating the voiced/unvoiced nature of
each of these bands. Bands that arc classified as unvoiced are synthesised using
random signals. Where the difference between the pitch estimates of successive
frames is relatively small, linear interpolation is used to define the required
amplitudes. The phase function is also defined using linear frequency
interpolation but in addition includes a constant displacement which is a random
variable and which depends on the number of unvoiced bands present in the short term spectrum of the input signal. The system works in a way to preserve
phase continuity between successive frames. When the pitch estimates of
successive frames are significantly different, a weighted summation of signals
produced from amplitudes and phases derived for successive frames is formed to
produced the synthesised signal.
Thus the common ground between the sinusoidal and multiband systems
referred to above is that both schemes directly model the input speech signal
which is DFT analysed, and both systems arc at least partially based on the same
fundamental relationship for representing speech to be synthesised. The systems
differ however in terms of the way in which amplitudes and phase are estimated
and quantized, the way in which different interpolation methods arc used to
define the necessary phase relationships, and the way in which "randomness" is
introduced in the recovered speech.
Various versions of the multiband excitation coding system have been
proposed, for example an enhanced multiband excitation speech coder (A. Das
and A. Gersho, Variable-Dimension Spectral Coding of Speech at 2400 bps and
below with phonetic classification", IEEE Proc. ICASSP-95, pp. 492-495, May
1995) in which input frames are classified into four types, that is noise, unvoiced,
fully voiced and mixed voiced, and a variable dimension vector quantization
process for spectral magnitude is introduced, the bi-harmonic spectral modelling
system (C. Garcia-Matteo., J. L. Alba-Castro and Eduardo R. Banga, "Speech
Coding Using Bi-Harmonic Spectral Modelling", Proc. EUSIPCO-94, Edingburgh, Vol. 2, pp. 391-394, September 1994) in which the short term
magnitude spectrum is divided into two bands and a separate pitch frequency is
calculated for each band, the spectral excitation coding system (V. Cupcrman, P.
Lupini and B. Bhattacharya, "Spectral Excitation Coding of Speech at 2.4 kb/s",
IEEE Proc. ICASSP-95, pp. 504-507, Detrpot, May 1995) which applies
sinusoidal based coding in the linear predictive coding (LPC) residual domain
where the synthesised residual signal is the summation of pitch harmonic
oscillators with appropriate amplitude and phase functions and amplitudes are
quantized using a non-square transformation, the band-widened harmonic
vocoder (G. Yang, G Zanellato and H. Leich, "Band Widened Harmonic Vocoder
at 2 to 4 kbps", IEEE Proc. 1CASSP-95, pp. 504-507, Detroit, May 1995) in which
randomness in the signal is introduced by adding jitter to the amplitude
information on a per band basis, pitch synchronous multiband coding (H. Yang,
S. N. Koh and P. Sivaprakasapilai, "Pitch Synchronous Multi-Band (PSMB)
Speech Coding", IEEE Proc. 1CASSP-95, pp. 516-519, Detroit, May 1995) in
which a CELP (code-excited linear prediction) based coding scheme is used to
encode speech period segments, multi band LPC coding (S. Yeldener, M. Kondoz
and G. Evans, "High Quality Multiband LPC Coding of Speech at 2.4 kbits/s",
Electronic Letters, pp. 1287-1289, Vol. 27, No 14, 4th July 1991) in which a single
amplitude value is allocated to each frame to in effect specify a "flat" residual
spectrum, and harmonic and noise coding (M. Nishiguchi and J. Matsumoto,
"Harmonic and Noise Coding of LPC Residuals with Classified Vector Quantisation", IEEE Proc. ICASSP-95, pp. 484-487, Detroit, May 1995) with
classified vector quantization which operates in the LPC residual domain, an
input signal being classified as voiced or unvoiced and being full band modelled.
A further type of coding system exists, that is the prototype interpolation
coding system. This relies upon the use of pitch period segments or prototypes
which are spaced apart in time and reiteration/interpolation techniques to
synthesise the signal between two prototypes. Such a system was described as
early as 1971 (J.S. Severwight, "Interpolation Reiterations Techniques for
Efficient Speech Transmission", Ph.D. Thesis, Loughborough University,
Department of Electrical Engineering, 1971). More sophisticated systems of the
same general class have been described more recently, for example in the paper
by W.B. Kleijn, "Continuous Representations in Linear Predictive Coding, Proc.
ICASSP-91, pp201-204, May 1991. The same author has published a series of
related papers. The system employs 20msecs coding frames which are classified
as voiced or unvoiced. Unvoiced frames are effectively CELP coded. Pitch
prototype segments are defined in adjacent voiced frames, in the LPC residual
signal, in a way which ensures maximum alignment (correlation) of the
prototypes and defines the prototype so that the main pitch excitation pulse is
not near to either of the ends of the prototype. A pitch period in a given frame is
considered to be a cycle of an artificial periodic signal from which the prototype
for the frame is obtained. The prototypes which have been appropriately selected from adjacent frames are Fourier transformed and the resulting
coefficients are coded using a differential vector quantization scheme.
With this scheme, during synthesis of voiced frames, the decoded
prototype Fourier representations for adjacent frames are used to reconstruct
the missing signal waveform between the two prototype segments using linear
interpolation. Thus the residual signal is obtained which is then presented to an
LPC synthesis filter the output of which provides the synthesised voiced speech
signal. An amount of randomness can be introduced into voiced speech by
injecting noise at frequencies larger than 2khz, the amplitude of the noise
increasing with frequency. In addition, the periodicity of synthesised voiced
speech is controlled during the quantization of prototype parameters in
accordance with a long term signal to change ratio measure that reflects the
similarity which exists between the prototypes of adjacent frames in the residual
excitation signal.
The known prototype interpolation coding systems rely upon a Fourier
Series synthesis equation which involves a linear-with-time-interpolation process.
The assumption is that the pitch estimates for successive frames are linearly
interpolated to provide a pitch function and an associated instant fundamental
frequency. The instant phase used in the cosine and sine terms of the Fourier
series synthesis equation is the integral of the instantaneous harmonic
frequencies. This synthesis arrangement allows for the linear evolution of the instantaneous pitch and the non-linear evolution of the instantaneous harmonic
frequencies.
A development of this system is described by W.B. Kleijn and J. Haaden,
"A Speech Coder Based on Decomposition of Characteristics Waveforms", Proc.
ICASSP-95, pp508-511, Detroit, May 1995. In the described system the Fourier
series coefficients are low pass filtered over time, with a cut-off frequency of
20Hz, to provide a "slowly evolving" waveform component for the LPC
excitation signal. The difference between this low pass component and the
original parameters provides the "rapidly evolving" components of the excitation
signal. Periodic voice excitation signals are mainly represented by the "slowly
evolving" component, whereas random unvoiced excitation signals are
represented by the "rapidly evolving" component in this dual decomposition of
the Fourier series coefficients. This removes effectively the need for treating
voiced and unvoiced frames separately. Furthermore, the rate of quantization and transmission of the two components is different. The "slowly evolving"
signal is sampled at relatively long intervals of 25msecs, but the parameters are
quantized quite accurately on the basis of spectral magnitude information. In
contrast, the spectral magnitude of the "rapidly evolving" signal is sampled
frequently, every 4msecs, but is quantized less accurately. Phase information is
randomised every 2msecs.
Other developments of the prototype interpolation coding system have
been proposed. For example one known system operates on 5msec frames, a pitch period being selected for voiced frames and DFT transformed to yield
prototype spectral magnitude values. These values arc quantized and the
quantized values for adjacent frames are linearly interpolated. Phase
information is defined in a manner which does not satisfy any frequency
restrictions at the interpolation boundaries. This causes problems of
discontinuity at frame boundaries. At the receiver the excitation signal is
synthesised using a decoded magnitude and estimated phase values, via an
inverse DFT process. The resulting signal is filtered by a following LPC
synthesis filter. This model is purely periodic during voiced speech, and this is
why a very short duration frame is used. Unvoiced speech is CELP coded.
The wide range of speech synthesis models currently being proposed, only
some of which are described above, and the range of alternative approaches
proposed to implement those models, indicates the interest in such systems and
the lack of any consensus as to which system provides the most advantageous
performance.
It is an object of the present invention to provide an improved low bit rate
speech synthesis system.
In known systems in which it is necessary to obtain an estimate of the
pitch of a frame of a speech signal, it has been thought necessary, if high quality
of synthesised speech is to be achieved, to obtain high resolution non-integer
pitch period estimates. This requires complex processes, and it would be highly desirable to reduce the complexity of the pitch estimation process in a manner
which did not result in degraded quality.
According to a first aspect of the present invention, there is provided a
speech synthesis system in which a speech signal is divided into a scries of
frames, and each frame is converted into a coded signal including a
voiced/unvoiced classification and a pitch estimate, wherein a low pass filtered
speech segment centred about a reference sample is defined in each frame, a
correlation value is calculated for each of a series of candidate pitch estimates as
the maximum of multiple crosscorrelation values obtained from variable length
speech segments centred about the reference sample, the correlation values are
used to form a correlation function defining peaks, and the locations of the
peaks are determined and used to define a pitch estimate.
The result of the above system is that an integer pitch period value is
obtained. The system avoids undue complexity and may he readily implemented.
Preferably the pitch estimate is defined using an iterative process. A
single reference sample may be used, for example centred with respect to the
respective frame, or alternatively multiple pitch estimates may be derived for
each frame using different reference samples, the multiple pitch estimates being
combined to define a combined pitch estimate for the frame. The pitch estimate
may be modified by reference to a voiced/unvoiced status and/or pitch estimates
of adjacent frames to define a final pitch estimate. The correlation function may be clipped using a threshold value,
remaining peaks being rejected if they are adjacent to larger peaks. Peaks are
initially selected and can be rejected if they arc smaller than a following peak by
more than a predetermined factor, for example smaller than 0.9 times the
following peak.
Preferably the pitch estimation procedure is based on a least squares
error algorithm. Preferably the algorithm defines the pitch as a number whose
multiples best fit the correlation function peak locations. Initial possible pitch
values may be limited to integral numbers which are not consecutive, the
increment between two successive numbers being proportional to a constant
multiplied by the lower of those two numbers.
It is well known from the prior art to classify individual frames as voiced
or unvoiced and to process those frames in accordance with that classification.
Unfortunately such a simple classification process does not accurately reflect the
true characteristics of speech. It is often the case that individual frames are made up of both periodic (voiced) and aperiodic (unvoiced) components. Prior
attempts to address this problem have not proved particularly effective.
It is an object of the present invention to provide an improved voiced or
unvoiced classification system.
According to a second aspect of the present invention there is provided a
speech synthesis system in which a speech signal is divided into a series of
frames, and each frame is converted into a coded signal including pitch segment magnitude spectral information, a voiced/unvoiced classification, and a mixed
voiced classification which classifies harmonics in the magnitude spectrum of
voiced frames as strongly voiced or weakly voiced, wherein a scries of samples
centred on the middle of the frame arc windowed to form a data array which is
Fourier transformed to produce a magnitude spectrum, a threshold value is
calculated and used to clip the magnitude spectrum, the clipped data is searched
to define peaks, the locations of peaks are determined, constraints are applied to
define dominant peaks, and harmonics not associated with a dominant peak are
classified as weakly voiced.
Peaks may be located using a second order polynomial. The samples may
be Hamming windowed. The threshold value may be calculated by identifying
the maximum and minimum magnitude spectrum values and defining the
threshold as a constant multiplied by the difference between the maximum and
minimum values. Peaks may be defined as those values which arc greater than
the two adjacent values. A peak may be rejected from consideration if
neighbouring peaks are of a similar magnitude, e.g. more than 80% of the
magnitude, or if there are spectral magnitudes in the same range of greater
magnitudes. A harmonic may be considered as not being associated with a
dominant peak if the difference between two adjacent peaks is greater than a
predetermined threshold value.
The spectrum may be divided into bands of fixed width and a
strongl /weakly voiced classification assigned for each band. Alternatively the frequency range may be divided into two or more bands of variable width,
adjacent bands being separated at a frequency selected by reference to the
strongly/weakly voiced classification of harmonics.
Thus, the spectrum may be divided into fixed bands, for example fixed
bands each of 500Hz, or variable width bands selected in dependence upon the
strongly/weakly voiced status of harmonic components of the excitation signal. A
strongly/weakly voiced classification is then assigned to each band. The lowest
frequency band, e.g. 0-500Hz, may always be regarded as strongly voiced,
whereas the highest frequency band, for example 3500Hz to 4000Hz, may always
be regarded as weakly voiced, In the event that a current frame is voiced, and
the previous frame is unvoiced, other bands within the current frame, e.g.
3000Hz to 3500Hz may be automatically classified as weakly voiced. Generally
the strongly/weakly voiced classification may be determined using a majority
decision rule on the strongly/weakly voiced classification of those harmonics
which fall within the band in question. If there is no majority, alternate bands may be alternately assigned strongly voiced and weakly voiced classifications.
Given the classification of a voiced frame such that harmonics are
classified as either strongly or weakly voiced, it is necessary to generate an
excitation signal to recover the speech signal which takes into account this classification. It is an object of the present invention to provide such a system.
According to a third aspect of the present invention, there is provided a
speech synthesis system in which a speech signal is divided into a series of frames, each frame is defined as voiced or unvoiced, each frame is converted into
a coded signal including a pitch period value, a frame voiced/unvoiced
classification and, for each voiced frame, a mixed voiced spectral band
classification which classifies harmonics within spectral bands as either strongly
or weakly voiced, and the speech signal is reconstructed by generating an
excitation signal in respect of each frame and applying the excitation signal to a
filter, wherein for each weakly voiced spectral band, an excitation signal is
generated which includes a random component in the form of a function which is
dependent upon the respective pitch period value.
Thus for each frame which has a spectral band that is classified as weakly
voiced, the excitation signal is represented by a function which includes a first
harmonic frequency component, the frequency of which is dependant upon the
pitch period value appropriate to that frame, and a second random component
which is superimposed upon the first component.
The random component may be introduced by reducing the amplitude of
harmonic oscillators assigned the weakly voiced classification, for example by
reducing the power of the harmonics by 50%, while disturbing the oscillator
frequencies, for example by shifting the oscillators randomly in frequency in the
range of 0 to 30 Hz such that the frequency is no longer a multiple of the
fundamental frequency, and then adding further random signals. The phase of
the oscillators producing random signals may be randomised at pitch intervals. Thus for a weakly voiced band, some periodicity remains but the power of the
periodic component is reduced and then combined with a random component.
In a speech synthesis system in which a speech signal is represented in
part by spectral information in the form of harmonic magnitude values, it is
possible to process an input speech signal to produce a series of spectral
magnitude values and then to use all of those magnitude values at harmonic
locations in subsequent processing steps. In many circumstances however at
least some of the magnitude values contain little information which is useful in
the recovery of the input speech signal. Accordingly when magnitude values are
quantized for transmission to a receiver it is sensible to discard magnitude values
which contain little useful information.
In one known system an input speech signal is processed to produce an
LPC residual signal which in turn is processed to provide harmonic magnitude
values, but only a fixed number of those magnitude values is vector quantized for
transmission to a receiver. The discarded magnitude values arc represented at
the receiver as identical constant values. This known system reduces
redundancy but is inflexible in that the locations of the fixed number of
magnitude values to be quantized are always the same and predetermined on the
basis of assumption that may be inappropriate in particular circumstances.
It is an object of the present invention to provide an improved magnitude
value quantization system. According to a fourth aspect of the present invention, there is provided a
speech synthesis system in which a speech signal is divided into a series of
frames, and each voiced frame is converted into a coded signal including a pitch
period value, LPC coefficients, and pitch segment spectral magnitude
information, wherein the spectral magnitude information is quantized by
sampling the LPC short term magnitude spectrum at harmonic frequencies, the
locations of the largest spectral samples are determined to identify which of the
magnitudes are relatively more important for accurate quantization, and the
magnitudes so identified are selected and vector quantized.
Thus rather than relying upon a simple location selection strategy of a
fixed number of magnitude values for quantization and transmission, for
example the "low part" of the magnitude spectrum, the invention selects only
those values which make a significant contribution according to the subjectively
important LPC magnitude spectrum, thereby reducing redundancy without
compromising quality.
In one arrangement in accordance with the invention a pitch segment of
Pn LPC residual samples is obtained, where P„ is the pitch period value of the
nth frame, the pitch segment is DFT transformed, the mean value of the
resultant spectral magnitudes is calculated, the mean value is quantized and used
as a normalisation factor for the selected magnitudes, and the resulting
normalised amplitudes are quantized. Alternatively, the RMS value of the pitch segment is calculated, the RMS
value is quantized and used as a normalisation factor for the selected
magnitudes, and the resulting normalised amplitudes are quantized.
At the receiver, the selected magnitudes are recovered, and each of the
other magnitude values is reproduced as a constant value.
Interpolation coding systems which employ a pitch-related synthesis
formula to recover speech generally encounter the problem of coding a variable
length, pitch dependant spectral amplitude vector. The quantization scheme
referred to above in which only the magnitudes of relatively greater importance
are quantized avoids this problem by quantizing only a fixed number of
magnitude values and setting the rest of the magnitude values to a constant
value. Thus at the receiver a fixed length vector can be recovered. Such a
solution to the problem however may result in a relatively spectrally flat
excitation model which has limitations in providing high recovered speech
quality.
In an ideal world output speech quality would be maximised by
quantizing the entire shape of the magnitude spectrum, and various approaches
have been proposed for coding the entire magnitude spectrum. In one approach,
the spectrum is DFT transformed and coded differentially across successive
spectra. This and similar coding schemes are rather inefficient however and
operate with relatively high bit rates. The introduction of vector quantization allowed for the development of sinusoidal and prototype interpolation systems
which operate at lower bit rates, typically around 2.4Kbits/sec.
Two vector quantization methodologies have been reported which
quantize a variable size input vector with a fixed size code vector. In a first
approach, the input vector is transformed to a fixed size vector which is then
conventionally vector quantized. An inverse transform of the quantized fixed
size vector yields the recovered quantized vector. Transformation techniques
which have been used include linear interpolation, band limited interpolation, all
pole modelling and non-square transformation. This approach however
produces an overall distortion which is the summation of the vector quantization
noise and a component which is introduced by the transformation process. In a
second known approach, a variable input vector is directly quantized with a
fixed size code vector. This approach is based on selecting only a limited number
of elements from each codebook vector to form a distortion measure between a
codebook vector and an input vector. Such a quantization approach avoids the
transformation distortion of the alternative technique mentioned above and
results in an overall distortion that is equal to the vector quantization noise, but
this is significant.
It is an object of the present invention to provide an improved variable
sized spectral vector quantization scheme.
According to a fifth aspect of the present invention, there is provided a
speech synthesis system in which a variable size input vector of coefficients to be transmitted to a receiver for the reconstruction of a speech signal is vector
quantized using a codebook defined by vectors of fixed size, the codebook vectors
of fixed size are obtained from variable size training vectors and an interpolation
technique which is an integral part of the codebook generation process, codebook
vectors are compared to the variable sized input vector using the interpolation
process, and an index associated with the codebook entry with the smallest
difference from the comparison is transmitted, the index being used to address a
further codebook at the receiver and thereby derive an associated fixed size
codebook vector, and the interpolation process being used to recover from the
derived fixed sized codebook vector an approximation of the variable sized input
vector.
The invention is applicable in particular to pitch synchronous low bit rate
coders of the type described in this document and takes advantage of the
underlying principle of such coders which means that the shape of the magnitude
spectrum is represented by a relatively small number of equally spaced samples.
Preferably the interpolation process is linear. For an input vector of given dimension, the interpolation process is applied to produce from the
codebook vectors a set of vectors of that given dimension. A distortion measure
is then derived to compare the interpolated set of vectors and the input vector
and the codebook vector which yields the minimum distortion is selected.
Preferably the dimension of the input vectors is reduced by taking into
account only the harmonic amplitudes with the input brandwidth range, for example 0 to 3.4kHz. Preferably the remaining amplitudes i.e. in the region of
3.4kHz to 4 kHz are set to a constant value. Preferably, the constant value is
equal to the mean value of the quantized amplitudes.
Amplitude vectors obtained from adjacent residual frames exhibit
significant amounts of redundancy which can be removed by means of backward
prediction. The backward prediction may be performed on a harmonic basis
such that the amplitude value of each harmonic of one frame is predicted from the amplitude value of the same harmonic in the previous frame or frames. A
fixed linear predictor may be incorporated in the system, together with mean
removal and gain shape quantization processes which operate on a resulting
error magnitude vector.
Although the above described variable sized vector quantization scheme
provides advantageous characteristics, and in particular provides for good
perceived signal quality at a bit rate of for example 2.4Kbits/sec, in some
environments a lower bit rate would be highly desirable even at the loss of some
quality. It would be possible for example to rely upon a single value
representation and quantization strategy on the assumption that the magnitude
spectrum of the pitch segment in the residual domain has an approximately flat
shape. Unfortunately systems based on this assumption have a rather poor
decoded speech quality.
It is an object of the present invention to overcome the above limitation in
lower bit rate systems. According to a sixth aspect of the present invention, there is provided a
speech synthesis system in which a speech signal is divided into a series of
frames, each frame is converted into a coded signal including an estimated pitch
period, an estimate of the energy of a speech segment the duration of which is a
function of the estimated pitch period, and LPC filter coefficients defining an
LPC spectral envelope, and a speech signal of related power to the power of the
input speech signal is reconstructed by generating an excitation signal using
spectral amplitudes which are defined from a modified LPC spectral envelope
sampled at the harmonic frequencies defined by the pitch period.
Thus, although a single value is used to represent the spectral envelope of
the excitation signal, the excitation spectral envelope is shaped according to the
LPC spectral envelope. The result is a system which is capable of delivering high
quality speech at 1.5Kbits/sec. The invention is based on the observation that
some of the speech spectrum resonance and anti-resonance information is also
present in the residual magnitude spectrum, since LPC inverse filtering cannot
produce a residual signal of absolutely flat magnitude spectrum. As a
consequence, the LPC residual signal is itself highly intelligible.
The magnitude values may be obtained by spectrally sampling a modified
LPC synthesis filter characteristic at the harmonic locations related to the pitch
period. The modified LPC synthesis filter may have reduced feed back gain and
a frequency response which consists of equalised resonant peaks, the locations of
which are close to the LPC synthesis resonant locations. The value of the feed back gain may be controlled by the performance of the LPC model such that it is
for example proportional to the normalised LPC prediction error. The energy of
the reproduced speech signal may be equal to the energy of the original speech
waveform.
It is well known that in prototype interpolation coding speech synthesis
systems there are often substantial similarities between the prototypes of
adjacent frames in the residual excitation signals. This has been used in various
systems to improve perceived speech quality by ensuring that there is a smooth
evolution of the speech signal over time.
It is an object of the present invention to provide an improved speech
synthesis system in which the excitation and vocal tract dynamics are
substantially preserved in the recovered speech signal.
According to a seventh aspect of the present invention, there is provided a
speech synthesis system in which a speech signal is divided into a series of
frames, each frame is converted into a coded signal including LPC filter
coefficients and at least one parameter associated with a pitch segment
magnitude, and the speech signal is reconstructed by generating two excitation
signals in respect of each frame, each pair of excitation signals comprising a first
excitation signal generated on the basis of the pitch segment magnitude
parameter or parameters of one frame and a second excitation signal generated
on the basis of the pitch segment magnitude parameter or parameters of a
second frame which follows and is adjacent to the said one frame, applying the first excitation signal to a first LPC filter the characteristics of which are
determined by the LPC filter coefficients of the said one frame and applying the
second excitation signal to a second LPC filter the characteristics of which are
determined by the LPC filter coefficients of the said second frame, and weighting
and combining the outputs of the first and second LPC filters to produce one
frame of a synthesised speech signal.
Preferably the first and second excitation signals include the same phase
function and different phase contributions from the two LPC filters involved in
the above double synthesis process. This reduces the degree of pitch periodicity
in the recovered signals. This and the combination of the first and second LPC
filter outputs ensures an effective smooth evolution of the speech spectral
envelope on a sample by sample basis.
Preferably the outputs of the first and second LPC filters are weighted by
half a window function such as a Hamming window such that the magnitude of
the output of the first filter is decreasing with time and the magnitude of the
output of the second filter is increasing with time.
According to an eighth aspect of the present invention, there is provided a
speech coding system which operates on a frame by frame basis, and in which
information is transmitted which represents each frame as either voiced or
unvoiced and, for each voiced frame, represents that frame by a pitch period
value, quantized magnitude spectral information, and LPC filter coefficients, the
received pitch period value magnitude spectral information being used to generate residual signals at the receiver which are applied to LPC speech
synthesis filters the characteristics of which arc determined by the transmitted
filter coefficients, wherein each residual signal is synthesised according to a
sinusoidal mixed excitation synthesis process, and a recovered speech signal is
derived from the residual signals.
Embodiments of the present invention will now be described, by way of
example, with reference to the accompanying drawings, in which:
Figure 1 is a general block diagram of the encoding process in accordance with the present invention;
Figure 2 illustrates the relationship between coding and matrix
quantisation frames;
Figure 3 is a general block diagram of the decoding process;
Figure 4 is a block diagram of the excitation synthesis process;
Figure 5 is a schematic diagram of the overlap and add process;
Figure 6 is a schematic diagram of the calculation of an instantaneous
scaling factor;
Figure 7 is a block diagram of the overall voiced/unvoiced classification
and pitch estimation process;
Figure 8 is a block diagram of the pitch estimation process;
Figure 9 is a schematic diagram of two speech segments which participate
in the calculation of a crosscorrelation function value;
Figure 10 is a schematic diagram of speech segments used in the calculation of the crosscorrelation function value;
Figure 11 represents the value allocated to a parameter used in the
calculation of the crosscorrelation function value for different delays;
Figure 12 is a block diagram of the process used for calculated the
crosscorrelating function and the selection of its peaks;
Figure 13 is a flow chart of a pitch estimation algorithm; Figure 14 is a flow chart of a procedure used in the pitch estimation
process;
Figure 15 is a flow chart of a further procedure used in the pitch
estimation process;
Figure 16 is a flow chart of a further procedure used in the pitch
estimation process.
Figure 17 is a flow chart of a threshold value selection procedure;
Figure 18 is a flow chart of the voiced/unvoiced classification process;
Figure 19 is a schematic diagram of the voiced/unvoiced classification
process with respect to parameters generated during the pitch estimation
process;
Figure 20 is a flow chart of the procedure used to determine offset values;
Figure 21 is a flow chart of the pitch estimation algorithm;
Figure 22 is a flow chart of a procedure used to impose constraints on
output pitch estimates to ensure smooth evolution of pitch values with time;
Figures 23, 24 and 25 represent different portions of a flow chart of a
pitch post processing procedure;
Figure 26 is a general block diagram of the LPC analysis and LPC
quantisation process;
Figure 27 is a general flow chart of a strongly or weakly voiced
classification process; Figure 28 is a flow chart of the procedure responsible for the
strongly/weakly voiced classification.
Figure 29 represents a speech waveform obtained from a particular
speech utterance;
Figure 30 shows frequency tracks obtained for the speech utterance of
Figure 29;
Figure 31 shows to a larger scale a portion of Figure 30 and represents the
difference between strongly and weakly voiced classifications;
Figure 32 shows a magnitude spectrum of a particular speech segment
and the corresponding LPC spectral envelope and the normalised short term
magnitude spectra of the corresponding residual segment, excitation segment
obtained using a binary excitation model and an excitation segment obtained
using the strongly/weakly voiced model;
Figure 33 is a general block diagram of a system for representing and
quantising magnitude information;
Figure 34 is a block diagram of an adaptive quantiser shown in Figure 33;
Figure 35 is a general block diagram of a quantisation process;
Figure 36 is a general block diagram of a differential variable size
spectral vector quantiser; and
Figure 37 represents the hierarchical structure of a mean gain shape
quantiser. A system in accordance with the present invention is described below, firstly in general terms and then in greater detail. The system operates on an LPC residual signal on a frame by frame basis.
Speech is synthesised using the following general expression: s = ∑Ak(i)cos(3k (i) + φk) (1) k=0
where i is the sampling instant and Ak(i) represents the amplitude value of the kth cosine term cos(Θ, (/')) (with Θk ( ) = 3 k (i) + φ k ) as a function of i. In voiced speech depends on the pitch frequency of the signal.
A voiced/unvoiced classification process allows the coding of voiced and unvoiced frames to be handled in different ways. Unvoiced frames are modelled in terms of an RMS value and a random time series. In voiced frames a pitch period estimate is obtained and used to define a pitch segment which is centred at the middle of the frame. Pitch segments from adjacent frames are DFT transformed and only the resulting pitch segment magnitude information is coded and transmitted. Furthermore, pitch segment magnitude samples are classified as strongly or weakly voiced. Thus in addition to voiced/unvoiced information, the system transmits for every voiced frame the pitch period value, the magnitude spectral information of the pitch segment, the strong/weak voiced classification of the pitch magnitude spectral values,, and the LPC coefficient. Thus, the information which is transmitted for every voiced frame is, in addition to voiced/unvoiced information, the pitch period value, the magnitude spectral information of the pitch segment, and the LPC filter coefficients.
At the receiver a synthesis process, that includes interpolation, is used to reconstruct the waveform between the middle points of the current (n+ 1 )th and previous nth frames. The basic synthesis equation for the residual signal is: Res(i) = MGj cos(phase j i) (2)
where MG } are decoded pitch segment magnitude values and phasej(i) is calculated from the integral of the linearly interpolated instantaneous harmonic frequencies ω,(i). K is the largest value of j for which ωj"(i)<π.
In the transitions from unvoiced to voiced, the initial phase for each harmonic is set to zero. Phase continuity is preserved across the boundaries of successive interpolation intervals.
The synthesis process is performed twice however, once using the magnitude spectral values MGj"+ of the pitch segment derived from the current (n+l )th frame and again using the magnitude values MGj" of the pitch segment derived in the previous nth frame. The phase function phase i) in each case remains the same. The resulting residual signals Resn(i) and Resn+)(i) are used as inputs to corresponding LPC synthesis filters calculated for the nth and (n+l)th speech frames. The two LPC synthesised speech waveforms are then weighted by Wn+ 1(i) and Wn(i) to yield the recovered speech" signal.
Thus the overall synthesis process, for successive voiced frames, can be described by:
•5(0 = Wn (i)∑ H" (ω - (Ϊ))MG cosϊphase", (i) + φ " (ω '; ( ))]
cosfyhase", (i) + φ "+l (ω " ( )) .
Figure imgf000031_0001
where H"(ω "( )J is the frequency response of the nth frame LPC synthesis filter calculated, at the co:n(i) harmonic frequency function at the ith instant. φ "(ω " (/)J is the associated phase response of this filter. α>j"(i) and phase j"(i) are the frequency and phase functions defined for the sampling instants i, with i covering the middle of the nth frame to the middle of the (n+l)th frame segments. K is the largest value of j for which cøj n(i)<π. The above speech synthesis process introduces two "phase dispersion" terms i.e. φ"^ύ '!( )") and φ"+'(ω"(/H which effectively reduce the degree of pitch periodicity in the recovered signal. In addition, this "double synthesis" arrangement followed by an overlap-add process ensures an effective smooth evolution of the speech spectral envelope (LPC) on a sample by sample basis.
The LPC excitation signal is based on a "mixed" excitation model which allows for the appropriate mixing of periodic and random excitation components in voiced frames on a frequency-band basis. This is achieved by operating the system such that the magnitude spectrum of the residual signal is examined, and applying a peak-picking process, near the ω: resonant frequencies, to detect possible dominant spectral peaks. A peak associated with a frequency G>J indicates a high degree of voicing (represented by hvj=l) for that harmonic. The absence of an adjacent spectral peak, on the other hand, indicates a certain degree of randomness (represented by hvj=0). When hvj=l (to indicate "strong" voicing) the contribution of the jth harmonic to the synthesis process is MGt cos(phase ,( )) However, when hv:=0 (to indicate "weak" voicing) the frequency of the jth harmonic is slightly dithered, its magnitude MGf is reduced to MG/ I Jl j and random cosine terms are added
symmetrically alongside the jth harmonic Oj. The terms "strong" and "weak" are used in this sense below. The number NRS of these random terms is
Figure imgf000032_0001
where |~ "j indicates rounding off to the next larger integer value. Furthermore, the NRS random components are spaced at 50 Hz intervals symmetrically about ω, cθ| being located in the middle of such a 50 Hz interval. The amplitudes of the NRS random components are set to MGj I V2 x NRS Their initial phases are selected randomly from the [-π, +π] region at
pitch period intervals. The hvj information must be transmitted to be available at the receiver and, in order to reduce the bit rate allocated to hvf, the bandwidth of the input signal is divided into a number of fixed size bands BDk and a "strongly" or "weakly" voiced flag Bhvk is assigned for each band. In a "strongly" voiced band, a highly periodic signal is reproduced. In a "weakly" voiced band, a signal which combines both periodic and aperiodic components is required. These bands are classified as strongly voiced (Bhvk=l) or weakly voiced (Bhvk=0) using a majority decision rule approach on the hv. classification values of the harmonics ω: contained within each frequency band.
Further restrictions can be imposed on the strongly/weakly voiced profiles resulting from the classification of bands. For example, the first λ bands may always be strongly voiced i.e. hvj=l for BD with k=l,2,...,λ, and λ being a variable. The remaining spectral bands can be strongly or weakly voiced.
Figure 1 schematically illustrates processes operated by the system encoder. These processes are referred to in Figure 1 as Processes I to VII and these terms are used throughout this document. Figure 2 represents the relationship between analysis/coding frame sizes employed. These are M samples per coding frame, e.g. 160 samples per frame, and k frames are analysed in a block, for example k=4. This block size is used for matrix quantization. A speech signal is input and processes I, III, IV, VI AND VII produce outputs for transmission.
Assuming that the first Matrix Quantization analysis frame (MQA) of kxM samples is available, each of the k coding frames within the MQA is classified as voiced or unvoiced (Vn) using, Process I. A pitch estimation part of Process I provides a pitch period value P„ only when a coding frame is voiced. 32 Process II operates in parallel on the input speech samples and estimates p LPC filter coefficients a (for example p=10) every L samples (L is a multiple of M i.e. L=mxM, and m may be equal to for example 2). In addition, k/m is an integer and represents the frame dimension of the matrix quantizer employed in Process III. Thus the LPC filter coefficients are quantized, using Process III and transmitted. The quantized coefficients a are used to derive a residual signal Rn(i).
When an input coding frame is unvoiced, the Energy E„ of the residual obtained for this frame is calculated (Process VII). JE^ is then quantized and transmitted.
When the nth coding frame is classified as voiced, a segment of Pn residual samples is obtained (Pn is the pitch period value associated with the nth frame). This segment is centred in the middle of the frame. The selected Pn samples are DFT transformed (Process V) to yield + l) / 2 spectral magnitude values MG" ,
Figure imgf000034_0001
+ 1) / 2~|, and [(/>„ + l) / 2"j phase values. The phase information is neglected. The magnitude information is coded (using Process VI) and transmitted. In addition a segment of 20 secs, which is centred in the middle of the nth coding frame, is obtained from the residual signal Rπ(i). This is input to Process IV, together with Pn to provide the strongly/weakly voiced classification parameters hv:n of the harmonics ω,π. Process IV produces quantized Bhv information, which for voiced frames is multiplexed and transmitted to the receiver together with the voiced/unvoiced decision V the pitch period Pn, the quantized LPC coefficients a of the corresponding LPC frame, and the magnitude values MG" . In unvoiced frames only the
Figure imgf000034_0002
quantized value and the quantized LPC filter coefficients a are transmitted.
Figure 3 schematically illustrates processes operated by the system decoder. In general terms, given the received parameters of the nth coding frame and those of the previous (n-l)th coding frame, the decoder synthesises a speech signal Sn(i) that extends from the middle of the (n-l)th frame to the middle of the nth frame. This synthesis process involves the generation in parallel of two excitation signals Resn(i) and Resn.,(i) which are used to drive two independent LPC synthesis filters 1 / An (z) and 1 / Atl^ (z) the coefficients of which are derived from the transmitted quantized coefficients a . The outputs Xn(i) and Xn_,(i) of these synthesis filters are weighted and added to provide a speech segment which is then post filtered to yield the recovered speech Sn(i). The excitation synthesis process used in both paths of Figure 3 is shown in more detail in Figure 4.
The process commences by considering the voiced unvoiced status Vk, where k is equal to n or n-1, (see Figure 4). When the frame is unvoiced i.e. Vk=0, a gaussian random number generator RG(0,1) of zero mean and unit variance, provides a time series which is subsequently scaled by the JE^ value received for this frame. This is effectively the required:
Figure imgf000035_0001
signal which is then presented to the corresponding LPC synthesis filter 1 / Ak (z) , k=n or n- 1. Performance could be increased if the ^Ek value was calculated, quantized and transmitted every 5msecs. Thus, provided that bits are available when coding unvoiced speech, four Ek A , ξ=0,..,3, values are transmitted for every unvoiced frame of 20msecs duration (160 samples).
In the case where Vk=l, the Resk(i) excitation signal is defined as the summation of a "harmonic" Resk (i) component and a "random" Resk r(i) component. The top path of the V =l part of the synthesis in Figure 4, which provides the harmonic component of this mixed excitation model, calculates always the instantaneous harmonic frequency function ωj"(i) which is associated with the interpolation interval that is defined between the middle points of the nth and (n-l)th frames, (i.e. this action is independent of the value of k). Thus, when 34 decoding the nth frame, α>j n(i) is calculated using the pitch frequencies fj 1'", f,2"" and linear interpolation i.e.
Figure imgf000036_0001
with 0≤j<f(Pmtx +l)/2"l,0<i<M and Pmm = max[/>„ , />„_,]
The frequencies, fj1'" and fj ,n are defined as follows:
I) When both the nth and (n-l)th coding frames are voiced i.e Vn=l and Vn.,=l, then the pitch frequencies are estimated as follows: a) If - P„_,|< 0.2 x (/>„ + />„.,) (7) which means that the pitch values of the nth and (n-l)th coding frames are rather similar, then: f? = J " + (J " hv" ) Λi/(-β,+fl) (8)
2„ , / ""'+ A
Figure imgf000036_0002
+P,l_2)) ' I /'""' otherwise
The f*~x value is calculated during the decoding process of the previous (n-l)th coding frame, hvj" is the strongly/weakly voiced classification (0, or 1 ) of the jth harmonic ω". Pn and Pn_, are the received pitch estimates from the n and n-1 frames. RU(-a,+a) indicates the output of a random number generator with uniform pdf within the -a to +a range. (a=0.00375) b) if \P„ -/>„_,!> 0.2 x(P. + />,_,) (10)
then f " + (l-hv';)χRU(-a,+a) (11)
Figure imgf000036_0003
and /?"=/!"-' +b j where b is defined as:
ssgfinnJl— -/ (12)
Figure imgf000036_0004
Notice that in case (b) which applies for significantly different Pn and Pπ_, pitch estimates, equations 11 and 12 ensure that the rate of change of the ω,n(i) function is restricted to
Figure imgf000037_0001
I) When one of the two coding frames (i.e. n, n-1) is unvoiced, one of the following two definitions is applicable: a) for Vn.,=0 and Vn=l
2*< _ P..+1
/; 0<y
and fj1'" is given by Equation (8). b)forVn.,=landVn=0 ζ '" is set to the fj ,n" value, which has been calculated during the decoding process of the previous (n-l)th coding frame and fj 1'" = fj2'".
Given cj,n(i) the instantaneous function phase^i) is calculated by:
(/;* -/ -). + 1 phase" =2π +2πff"ι + phase"-' (M) forO≤j<
2M (13) and 0 < < M
Furthermore, the "harmonic" component R k (i) of the residual signal is given by:
Figure imgf000037_0002
where k=n or n-1, z/ω';()>π
Figure imgf000037_0003
/ = 0 and 36 MG j =
Figure imgf000038_0001
+ 1)/2_|-1 are the received magnitude values of the "kth" coding frame, with k=n or k=n-l .
37
The second path of the Vk=l case in Figure 4 provides the random excitation component Resk r ( ). In particular, given the recovered strongly/weakly voiced classification values hv \ the system calculates for those harmonics with hvj =0 the number of random sinusoidal NRS components, which are used to randomise the corresponding harmonic. This is: ω ,
NRS = 2 x (15)
4π x (50/ fs)
where fs is the sampling frequency. Notice that the NRS random sinusoidal components are located symmetrically about the corresponding harmonic ω * and they are spaced 50 Hz apart.
The instantaneous frequency of the qth random component, q=0,l ,...,NRS-l, for the jth harmonic ω * is calculated by: . + ω J q (i) = ω (i) + 2π x (25/ 'fs) + (</ - (NRS/2))x 2π x (50/ fs for 0 < /' < (16)
and O ≤ i ≤ M
The associated phase value is:
Figure imgf000039_0001
and O ≤ i ≤ M
where φ = RU(π,-π) . In addition, the Phk l q (i) function is randomised at pitch intervals
(i.e. when the phase of the fundamental harmonic component is a multiple of 2π, i.e. taoάyphase" ( ), 2π )= 0 ).
Given the Ph* ( ) , the random excitation component Reskr(i) is calculated as follows:
Figure imgf000039_0002
where
Figure imgf000040_0001
0
0 ω *Λ(/) > π
C,Λi) = ω *Λ( <π
Thus for Vk=l voiced coding frames, the mixed excitation residual is formed as: Resk (0 = Res, ( ) + Re^;( ) (19)
Notice that when Vk=0, instead of using Equation 5, the random excitation signal Resk(i) can be generated by the summation of random cosines located 50 Hz apart, where their phase is randomised every λ samples, and λ<M, i.e
Resk(i) = cos(2π( /50) + δ(i - λ x ξ -ζ) x RU(-π,+π)) where
Figure imgf000040_0002
(20) ξ = 0,1,2,..., and O ≤ i < M and
ζ is defined so as to ensure that the phase of the cos terms is randomised every λ samples across frame boundaries. The resulting Resn(i) and Resn.|(i) excitation sequences, see Figure
4, are processed by the corresponding 1 / A (z) and 1 / A„_{ (z) LPC synthesis filters. When coding the next (n+l)th frame, M AH_^(z) becomes 1 / An(z) (including the memory) and
1 / An (z) becomes 1 / An+I (z) with the memory of 1 / An (z) . This is valid in all cases except during an unvoiced to voiced transition, where the memory of the 1 / Λ„+l (z) filter is set to zero. The coefficients of the 1 / An(z) and 1 / AH_X (z) synthesis filters are calculated directly from the nth and (n-l)th coding speech frames respectively, when the LPC analysis frame size L is equal to M samples. However, when L≠M (usually L>M) linear interpolation is used on the filter coefficients (defined every L samples) so that the transfer function of the synthesis filter is updated every M samples. The output signals of these filters, denoted as Xn.,(i) and Xn(i), are weighted, overlapped and added as schematically illustrated in Figure 5 to yield Xn (i) i.e:
^„( = ^-,( ^-,( + ^( ^,(
where
Figure imgf000041_0001
(21) and
Figure imgf000041_0002
(22)
X„(i) ιs then filtered via a PF(z) post filter and a high pass filter HP(z) to yield the speech segment S'n(i). PF(z) is the conventional post filter:
Figure imgf000041_0003
(23)
with b=0.5, c=0.8 and μ = 0.5ATl" ..vl" is the first reflection coefficient of the nth coding frame. HP(z) is defined as: bl - c.z'1
HP(z)=τ 1 —τ a z^ <2 > with b,=C!=0.9807 and a,=0.96148l .
In order to ensure that the energy of the recovered S(i) signal is preserved, as compared to that of the X(i) sequence, a scaling factor SC is calculated every LPC frame of L samples.
SC, = (2S)
V ^l i-l l,-\ where: E,' = X,(i and E, = ∑S,'(i)2 i.O ,=0
SC| is associated with the middle of the 1th LPC frame as illustrated in Figure 6. The filtered samples from the middle of the (1-1 )th frame to the middle of the lth frame are then multiplied by SCj(i) to yield the final output of the system,
Figure imgf000042_0001
where:
SC, (i) = SC,W, (i) + SC,_, W,. (0 0<i<L (26)
and
W, ( ) = 0.5 - 05 cosf π — ~J 0 ≤ i < L
Wl (i) = 0.5 + 05cosf π — '— J 0 < < L
The scaling process introduces an extra half LPC frame delay into the coding-decoding process.
The above described energy scaling procedure operates on an LPC frame basis in contrast to both the decoding and PF(z), HP(z) filtering procedures which operate on the basis of a frame of M samples.
Details of the coding processes represented in Figure 1 will now be described. Process I derives a voiced/unvoiced (V/UV) classification Vn for the nth input coding frame and also assigns a pitch estimate Pn to the middle sample M„ of this frame. This process is illustrated in Figure 7.
The V/UV and pitch estimation analysis frame is centred at the middle Mn+ i of the (n+l)th coding frame with 237 samples on either side. The signal x(i) in the above analysis frame is low pass filtered with a cut off frequency fc=1.45KHz and the resulting (-147, 147) samples centred about Mn+j are used in a pitch estimation algorithm, which yields an estimate PM_+I. The pitch estimation algorithm is illustrated in Figure 8, where P represents the output of the pitch estimation process. The 294 input samples are used to calculate a crosscorrelation function CR(d), where d is shown in Figure 9 and 20<d<147. Figure 9 shows the two speech segments which participate in the calculation of the crosscorrelation function value at "d" delay. In particular, for a given value of d, the crosscorrelation function pd(j) is calculated for the segments {Xι d , { R} ,as:
Figure imgf000043_0001
where: xL d (i)=x(Mn+rd+j+i), xR d (i)=x(Mn+1+j+i), for 0<i<d-j-l, j=0,l,...,f(d) (Figure 10 schematically represents the Xf and XH'' speech segments used in the calculation of the value Ci?(<- and the non linear relationship between d and f(d) is given in Figure 1 1 xd and XR represent the mean value of the {x ) d and {xR}d sequences respectively.
The algorithm then selects max[pd(j)] and defines CR(d)= max [pd(j)], 20<d<147.
O≤j≤f(d)
In addition to CR(d), the box in Figure 8 labelled "Calculation of CR function and selection of its peaks", whose detailed diagram is shown in Figure 12, provides also the locations loc(k) of the peaks of the CR(d) function, where k=l,2,...,Np and Np is the number of peaks in a CR(d) function.
Figure 12 is a block diagram of the process involving the calculation of the CR function and the selection of its peaks. As illustrated, given CR(d), a threshold th(d) is determined as:
th(d) = CR(<C. )~ b - (d - d£ )x a - c (28)
where c=0.08 when (p = 1)]
Figure imgf000044_0001
AND(d > 0.875 x Pιι')AND(d < 1.125 x />„') or c=0 elsewhere, and constants a and b are defined as:
Figure imgf000044_0003
* *s
Figure imgf000044_0002
. Using this threshold the CR(d) function is clipped to CRL(d). i.e. CRL(d) =0 for CR(d)<th(d)
CRL(d) =CR(d) otherwise.
CRi/d) contains segments Gs s= 1,2,3 , of positive values separated by G0 runs of zero values. The algorithm examines the length of the G0 runs which exist between successive Gs segments (i.e. Gs and Gs+i), and when G0 < 17, then the Gs segment with the max CR (d) value is kept. This procedure yields CR, (d) , which is then examined by the following "peak picking" procedure. In particular those CR, (d) values are selected for which: CRL(d) > CR, (d - \) and CR, (d) > CR, (d + \) However certain peaks can be rejected if: CRL (loc(k)) ≤ CRL (loc(k + 1)) x 0.9
This ensures that the final CR, (loc(k)) k=l,...,Np does not contain spurious low level
CR, (d) peaks. The locations d of the above defined CR, (d) peaks are given by loc(k) k=l,2,...,Np.
CR(d) and loc(k) are used as inputs to the following Modified High Resolution Pitch Estimation algorithm (MHRPE) shown in Figure 8, whose output is PMn+). The flowchart of this MHRPE procedure is shown in Figure 13, where P is initialised with 0 and, at the end, the estimated P is the requested PMn+l. In Figure 13 the main pitch estimation procedure is based on a Least Squares Error (LSE) algorithm which is defined as follows: For each possible pitch value j in the range from 21 to 147 with an increment of 0.1 x j . i.e. j € {21,23,25-27,30,33,36,40,44,48,53,58,64,70,77,84,92,101,1 1 1,122,134} . (Thus 21 iterations are performed.)
1) Form the multiplication factor vector: uf = j — loc
2) Reject possible pitch j and go back to (J) if a) the same element occurs in u, twice. b) the elements of ϊij have as a common factor a prime number.
3) Form the following error quantity
Et - \oc loc - 2pιύ/ loc + p ul ut where loc iij ujUj τ
4) Select the p,s value for which the associated Error quantity E|S is minimum, (i.e. jsiEj, < Ej Vy e {21,23,...134}). Set P=pjs. The next two general conditions "Reject Highest Delay" loc(Np) and "Reject Lowest Delay" loc(l) are included in order to reject false pitch, "double" or "half values and in general to provide constraints in the pitch estimates of the system. The "Reject Highest Delay" condition involves 3 constraits: i) if P=0 then reject Ioc(Np). ii) if loc(Np) >100 then find the local maximum CR(dlm) in CR(d) at the vicinity of the estimated pitch P (i.e 0.8xP to 1.2xP) and compare this with th(dim), which is determined as in Equation 28 Reject loc(Np) when CR(dlm)<th(dιm)-0.02. Hi) If the error EjS of the LSE algorithm is larger than 50 and un (Np)=Nρ with Np>2 then reject loc(Nρ). The flowchart of this is given in Figure 14.
The "Reject Lowest Delay" general condition, whose flowchart is given in Figure 15, rejects loc(l) when the following three constraints are simultaneously satisfied: i) The density of detection of the peaks of the correlation coefficient function is less than or equal to 0.75. i.e.
NP < 0.75 u,, (Np) ii) If the location of the first peak is neglected (i.e. loc(l)), then the remaining locations exhibit a common factor. Hi) The value of the correlation coefficient function at the locations of the missing peaks is relatively small compared to adjacent detected peaks, i.e.
If uPn k-Upn(k)>l , for k=l,...Np. then for i=u(k)+l : uPn(k+l)-l a) find local maximum CR(d[m) in the range from (i-O.l)χloc(l) to (i+0.1)χ loc(l). b) if CR(dIm) <0.97xCR(uPn(k)) then Reject Lowest Delay END. else Continue This concludes the pitch estimation procedure of Figure 7 whose output is PMntl. As is also illustrated in Figure 7 however, in parallel to the pitch estimation. Process I obtains 160 samples centred at the middle of the Mn+! coding frame, removes their mean value, and then calculates RO, Rl and the average R^ of the energies of the previous K non-silence coding frames. K is fixed to 50 for the first 50 non-silence coding frames, increases from 50 to 100 with the next 50 non-silence coding frames, and then remains constant at the value of 100. The flowchart of the procedure that calculates R^, Rl, R0 and updates the Rj,v buffer is shown in Figure 16, where "Count" represents the number of non-silence speech frames, and "++" denotes increase by one. Notice that TH is an adaptive threshold that is representative of a silence (non speech) frame and is defined as in Figure 17. CR in this case is equal to
CR
Given R0, Rl, Rav and CR t , the V/UV part of Process I calculates the status VMn+| of the n+1 frame. The flowchart of this part of the algorithm is shown in Figure 18 where "V" represents the output V/UV flag of this procedure. Setting the "V" flag to 1 or 0 indicates voiced or unvoiced classification respectively. The "CR" parameter denotes the maximum value of the CR function which is calculated in the pitch estimation process. A diagrammatic representation of the voiced/unvoiced procedure is given in Figure 19.
Having the VMll+l value, the P ntl estimate and the V'n and P'π estimates which have been produced from Process I operating on the previous nth coding frame, as illustrated in Figure 7, part b, two further locations Mn+1+dl and Mn+,+d2 are estimated and the corresponding [-147,147] segments of filtered speech samples are obtained as illustrated in Figure 7, part b. These additional two analysis frames are used as input to the "Pitch Estimation process" of Figure 8 to yield PM+ι+di and P N .+d2- The procedure for calculating dl and d2 is given in the flowchart of Figure 20.
The final step in part (a) of Process I of Figure 7, evolves the previous V/UV classification procedure of Figure 8 with inputs RO, Rl, Rav, and
Figure imgf000048_0001
to yield a preliminary value V^ .
In addition, a multipoint pitch estimation algorithm accepts PMι+1, PM. +d P -,+d2> V„.,, Pn.ι, V'n, P'n to provide a preliminary pitch value P ", . The flowchart of this multipoint pitch estimation algorithm is given in Figure 21, where P,, P2 and P0 represent the pitch estimates associated with the Mn+1+d|, Mn+( +d2 and Mn+ 1 points respectively, and P denotes the output pitch estimate of the process, that is Pn+,.
Finally part (b) Process I of Figure 7 imposes constraints on the Vn1 and estimates in order to ensure a smooth evolution for the pitch parameter. The flowchart of this section is given in Figure 22. At the start of this process "V" and "P" represent the voicing flag and pitch estimate values before constraints are applied ( V + and P , in Figure 7) whereas at the end of the process "V" and "P" represent the voicing flag and pitch estimate values after the constraints have been applied ( H'+I and Pl+l ). The V'n+I and P'n+i produced from this section are then used in the next pitch past processing section together with Vn_,, V'n, Pn., and P'n to yield the final voiced/unvoiced and pitch estimate parameters Vn and Pn for the nth coding frame. This pitch post processing stage is defined in the flowchart of Figures 23, 24 and 25, the output A of Figure 23 being the input to Figure 24, and the output B of Figure 24 being the input to Figure 25. At the start of this procedure "P„" and "Vn" represent the pitch estimate and voicing flag respectively, which correspond to the nth coding frame prior to post processing (i.e. P,,1 , V„ ) whereas at the end of the procedure "Pn" and "Vn" represent the final pitch estimate and voicing flag associated with the nth frame (i.e. P„ , Vn). The LPC analysis process (Process II of Figure 1) can be performed using the Autocorrelation, Stabilised Covariance or Lattice methods. The Burg algorithm was used, although simple autocorrelation schemes could be employed without a noticeable effect in the decoded speech quality. The LPC coefficients are then transformed to an LSP representation. Typical values for the number of coefficients are 10 to 12 and a 10th order filter has been used. LPC analysis processes are well known and described in the literature, for example "Digital Processing of Speech Signals", L.R. Rabiner and R.W. Schafer, Prentice - Hall Inc., Englewood Cliffs, New Jersey, 1978. Similarly, LSP representations are well known, for example from "Line Spectrum Pair and Speech Data Compression", F Soong and B.H. Juang, Proc. ICASSP-84, pp 1.10.1-1.10.4, 1984. Accordingly these processes and representations will not be described further in this document.
In process II, ten LSP coefficients are used to represent the data. These 10 coefficients could be quantized using scalar 37 bits with the following bit allocation pattern [3,4,4,4,4,4,4,4,3,3]. This is a relatively simple process, but the resulting bit rate of 1850 bits/second is unnecessarily high. Alternatively the LSP coefficients can be Vector Quantised (VQ) using a Split-VQ technique. In the Split-VQ technique an LSP parameter vector of dimension "p" is split into two or more subvectors of lower dimensions and then, each subvector is Vector Quantised separately (when Vector Quantising the subvectors a direct VQ approach is used). In effect, the LSP transformed coefficient vector, C, which consists of "p" consecutive coefficients (ci,c2,...,cp) is split into "K" vectors, C (l≤k≤K), with the corresponding dimensions dfc (l≤dk≤P)-
Figure imgf000049_0001
In particular, when " " is set to "p" (i.e. when C is partitioned into "p" elements) the Split-VQ becomes equivalent to Scalar Quantisation. On the other hand, when is set to unity (K=l, dk=p) the Split-VQ becomes equivalent to Full
Search VQ. 48 The above Split VQ approach leads to an LPC filter bit rate of the order of 1.3 to 1.4Kbits/sec. In order to minimize further the bit rate of the voice coded system described in this document a Split Matrix VQ (SMQ) has been developed in the University of Manchester and reported in "Efficient Coding of LSP Parameters using Split Matrix Quantisation", C.Xydeas and C.Papanastasiou, Proc ICASSP-95, pp 740-743, 1995. This method results in transparent LPC quantisation at 900bits/sec and offers a flexible way to obtain, for a given quantisation accuracy, the required memory/complexity characteristics for Process III. An important feature of SMQ is a new weighted Euclidean distance which is defined in detail as follows.
(*)-l (29)
D{Lk (l), L;k (l)) = ∑ ∑(LSP ' - LSP' ' )2 wt(s,t)2 w,((γ
where L'k (1) represents the kth (k=l,...,K) quantized submatrix and LSP s +/Λ.l)+% are its elements. m(k) represents the spectral dimension of the kth submatrix and N is the SMQ
* A' frame dimension. Note also that : S(k) = ∑m(j) , m(0) = 1 and ∑m(k) = p =0 k = \
En{t) w,{t) = \ - E*t)) ■E„(t) for transmission frames 0 < / < N - 1 (30)
Aver (En)
when the Ν LPC frames consist of both voiced and unvoiced frames
w,(f) = En(t)αl otherwise
where Er(f) is the normalised energy of the prediction error of the (l+t)th frame, En(t) is the RMS value of the (l+t)th speech frame and Aver(En) is the average RMS value of the Ν LPC frames used in SMQ. The values of the constants α and αl are set to 0.2 and 0.15 respectively.
Also: w (^,t) = | S'^i'.I)+l|P (31) 49 where P(lk+,"*' ) is the value of the power envelope spectrum of the (1+t) speech frame at the l +s LSP '_i)+x frequency, β is equal to 0.15
The overall SMQ quantisation process that yields the quantised LSP coefficients vectors / ' to / 1+N'1 for the 1 to 1+N-l analysis frames is shown in Figure 26. This figure also includes the inverse process, which accepts the above / l+l vectors i=0,..,N-l and provides the corresponding LPC coefficients vector q_ to a *N~ . The a'*' i=0,..,N-l, coefficients vectors are modified, prior to the LPC to LSP transformation, by a 10 Hz bandwidth expansion as indicated in Figure 26. A 5Hz bandwidth expansion is also included in the inverse quantisation process.
Process IV of Figure 1 will now be described. This process is concerned with the mixed voiced classification of harmonics. When the nth coding frame is classified as voiced, the residual signal Rn(i) of length 160 samples centred at the middle Mn of the nth coding frame and the pitch period Pn for that frame are used to determine the strongly voiced (hv;=l)/weakly voiced (hVj=0) classification associated with the jth harmonic cøj". The flowchart of Process IV is given in Figure 27. The R" array of 160 samples is Hamming windowed and augmented to form a 512 size array, which is then FFT processed. The maximum and minimum values MGRmax, MGRmm of the resulting 256 spectral magnitude values are determined, and a threshold THO is calculated. TH0 is then used to clip the magnitude spectrum. The clipped MGR array is searched to define peaks MGR(P) satisfying:
MGR(P)>MGR(P+1) and MGR(P)>MGR(P- 1 )
For each peak, MGR(P), "supported" by the MGR(P+1 ) and MGR(P-l) values a second order polynomial is fitted and the maximum point of this curve is accepted as MGR(P) with a location loc(MGR(P)). Further constraints are then imposed on these magnitude peaks. In particular peaks are rejected : „,Λ,„_.,
WO 98/01848
50 a) if there are spectral peaks in the neighbourhood of loc(MGR(P)) (i.e in the range (loc(MGR(P))-fo/2 to loc(MGR(P))+fo/2 where fo is the fundamental frequency in Hz), whose value is larger than 80% of MGR(P) or b) if there are any spectral magnitudes in the same range whose value is larger than MGR(P). After applying these two constraints the remaining spectral peaks are characterised as "dominant" peaks. The objective of the remaining part of the process is to examine if there is a "dominant" peak near a given harmonic jχω0, in which case the harmonic is classified as strongly voiced and hVj=l, otherwise hvj=0. In particular, two thresholds are defined as follows:
THl=0.15χfo. TH2=(L5/Pn)χfo with fo=(l/Pn)xfs and fs is the sampling frequency.
The difference (loc(MGRd(A))- loc(MGRd (A: - l)) is compared to 1.5χfo+TH2, and if
larger a related harmonic is not associated with a "dominant" peak and the corresponding
classification hv is zero (weakly voiced). (loc(MGRd (k)) is the location of the kth dominant
peak and k=l,...,D where D is the number of dominant peaks. This procedure is described in
detail in Figure 28, in which it should be noted that the harmonic index j does not always
correspond to the magnitude spectrum peak index k, and loc(k) is the location of the kth
dominant peak, i.e. loc (MGRd(k)) = loc(K).
In order to minimise the bit rate associated with the transmission of the hv, information, two schemes have been employed which coarsely represent hv.
Scheme I
The spectrum is divided into bands of 500Hz each and a strongly voiced/weakly voiced flag Bhv is assigned for each band. The first and last 500Hz bands i.e. 0 to 500 and 3500 to 4000Hz are always regarded as strongly voiced (Bhv=l) and weakly voiced (Bhv=0) respectively. When Vn=l and Vn., = l the 500 to 1000 Hz band is classified as voiced i.e. Bhv=l . Furthermore, when Vn=l and V„.,=0 the 3000 to 3500 Hz band is classified as weakly voiced i.e. Bhv=0. The Bhv values of the remaining 5 bands are determined using a majority decision rule on the hvj values of the j harmonics which fall within the band under consideration. When the number of harmonics for a given band is even and no clear majority can be established i.e. the number of harmonics with hVj=l is equal to the number of harmonics with hVj=0, then the value of Bhv for that band is set to the opposite of the value assigned to the immediately preceding band. At the decoding process the hv( of a specific harmonic j is equal to the Bhv value of the corresponding band. Thus the hv information may be transmitted with 5 bits.
Scheme II
In this case the 680 Hz to 3400 Hz range is represented by only two variable size bands. When Vn=l and Vn.!=0 the Fc frequency that separates these two bands can be one of the following:
(A) 680, 1360, 2040, 2720. whereas, when Vn=l and Vn.j— 1 , Fc can be one of the following frequencies:
(B) 1360, 2040, 2720, 3400.
Furthermore, the 0 to 680 and 3400 to 4000 Hz bands are always represented with Bhv=l and Bhv=0 respectively. The Fc frequency is selected by examining the three bands sequentially defined by the frequencies in (A) or (B) and by using again a majority rule on the harmonics which fall within a band. When a band with a mixed voiced classification Bhv=0 is found, i.e. the number of harmonics with hvj=0 is larger than to the number of harmonics with hVj=l , then Fc is set to the lower boundary of this band and the remaining spectral region is classified as Bhv=0. In this case only 2 bits are allocated to define Fc. The lower band is strongly voiced with Bhv=l, whereas the higher band is weakly voiced with Bhv=0. To illustrate the effect of the mixed voice classification on the speech synthesised from the
transmitted information, Figures 29 and 30 represent respectively an original speech
waveform obtained for the utterance "Industrial shares were mostly a" and frequency tracks
obtained for that utterance. The horizontal axis represents time in terms of frames each of
20msec duration. Figure 31 shows to a larger scale a section of Figure 30, and represents
frequency tracks by full lines for the case when the voiced frames are all deemed to be
strongly voiced (hv=l) and by dashed lines when the strongly/weakly voiced classification is
taken into account so as to introduce random perturbations when hv=0.
Figure 32 shows four waveforms A, B, C and D. Waveform A represents the magnitude
spectrum of a speech segment and the corresponding LPC spectral envelope (log10 domain).
Waveforms B, C and D represent the normalised Short-Term magnitude spectrum of the
corresponding residual segment (B), the excitation segment obtained using the binary
(voiced unvoiced) excitation model (C), and the excitation segment obtained using the
strongly voiced/weakly voiced/unvoiced hybrid excitation model (D). It will be noted that
the hybrid model introduces an appropriate amount of randomness where required in the 3π/4
to π range such that curve D is a much closer approximation to curve B than curve C.
Process V of Figure 1 will now be described. Once the residual signal has been derived, a segment of Pn samples is obtained in the residual signal domain. The magnitude spectrum of the segment, which contains excitation source information, is derived by applying a Pn points DFT. An alternative solution, in order to avoid the computational complexity of the P„ points DFT, is to apply a fix length FFT (128 points) and to find the value of the magnitude spectrum at the desired points, using linear interpolation.
For a real-valued sequence x(i) of P points the DFT may be expressed as:
Figure imgf000055_0001
The Pn point DFT will yield a double-side spectrum. Thus, in order to represent the excitation signal as a superposition of sinusoidal signals, the magnitude of all the non DC components must be multiplied by a factor of 2. The total number of single side magnitude spectrum values, which are used in the reconstruction process, is equal to
Figure imgf000055_0002
Process VI of Figure 1 will now be described. The DFT (Process V) applied on the Pn samples of a pitch segment in the residual domain, yields |~( P„ + 1) / 2~| spectral magnitudes (MGjn, 0<]<[(P„ + l) / 2 ) and [(PH + l) /
Figure imgf000055_0003
phase values. The phase information is neglected. However, the continuity of the phase between adjacent voiced frames is preserved. Moreover, the contribution of the DC magnitude component is assumed to be negligible and thus, MG0" is set to 0. In this way, the non-DC magnitude spectrum is assumed to contain all the perceptually important information.
Based on the assumption of an "approximately" flat shape magnitude spectrum for the pitch residual segment, various methods could be used to represent the entire magnitude spectrum with a single value. Specifically, a modified single value spectral amplitude representation (MSVSAR) technique is described below.
MSVSAR is based on the observation that some of the speech spectrum resonance and anti- resonance information is also present at the residual magnitude spectrum (G.S. Kang and S.S. Everett, "Improvement of the Excitation Source in the Narrow-Band Linear Prediction Vocoder", IEEE Trans. Acoust., Speech and Signal Proc, Vol. ASSP-33, pp.377-386, 1985). LPC inverse filtering can not produce a residual signal of absolutely flat magnitude spectrum mainly due to: a) the "cascade representation" of formats by the LPC filter 1/A(z), which results in the magnitudes of the resonant peaks to be dependent upon the pole locations of the 1/A(z) all-pole filter and b) the LPC quantisation noise. As a consequence, the LPC residual signal is itself highly intelligible. Based on this observation the MG," magnitudes are obtained by spectral sampling at the harmonic locations, ω,π, j=l ,...,
Figure imgf000056_0001
+ 1) / 2 |, of a modified LPC synthesis filter, that is defined as follows:
MP(z) = (32)
\ - GR∑a;>z-' l= \ where, a" , i=l,...,p represent the p quantised LPC coefficients of the nth coding frame and GR and GN are defined as follows:
G* ~Gκjfl(l -K;) (33) ι=l and
Figure imgf000056_0002
where K,n , i=l,...,p are the reflection coefficients of the nth coding frame, x„"(i) represents a sequence of 2Pn speech samples centred in the middle of the nth coding frame from which the mean value is being calculated and removed, MP(ω " ) and H(ω " ) represent the frequency response of the MP(z) and 1/A(z) filters respectively at the ω,n frequency. Notice that the MP(ω " ) values are calculated assuming GN=1. The Gl( parameter represents a constant whose value is set to 0.25.
Equation 32 defines a modified LPC synthesis filter with reduced feedback gain, whose frequency response consists of nearly equalised resonant peaks, the locations of which are very close to the LPC synthesis resonant locations. Furthermore, the value of the feedback gain GR is controlled by the performance of the LPC model (i.e. it is proportional to the normalised LPC prediction error). In addition Equation 34 ensures that the energy of the reproduced speech signal is equal to the energy of the original speech waveform. Robustness is increased by computing the speech RMS value over two pitch periods. Two alternative magnitude spectrum representation techniques are described below, which allow for better coding of the magnitude information and lead to a significant improvement in reconstructed speech quality.
The first of the alternative magnitude spectrum representations tecliniques is referred to below in the "Na amplitude system". The basic principle of this MG" quantisation system is to represent accurately those MG" values which correspond to the Na largest speech Short
Term (ST) spectral envelope values. In particular, given the LPC coefficients of the nth coding frame, the ST magnitude spectrum envelope is calculated (i.e. sampled) at the harmonic frequencies ω " and the locations lc(j), j=l ,...,Na of the largest Na spectral samples are determined. These locations indicate effectively which of the
Figure imgf000057_0001
+ 1) / 2~|- 1 MG" magnitudes are subjectively more important for accurate quantization. The system subsequently selects MGjn j=lc(l),...,lc(Na) and Vector Quantizes these values. If the minimum pitch value is 17 samples, the number of non-DC MG" amplitudes is equal to 8 and for this reason Na<8. Two variations of the "Na-amplitudes system" were developed with equivalent performance and their block diagrams are depicted in Figure 33 (a) and (b) respectively.
i) Na-amplitudes system with Mean Normalization Factor. In this variation, a pitch segment of Pn residual samples Rn(i), centered about the middle Mn of the nth coding frame is obtained and DFT transformed. The mean value of the spectral magnitudes MG" , j=l,..., + 1) / 2j is calculated as:
m — -1 as)
Vill m is quantized and then used as the normalization factor of the Na selected amplitudes MG" , j=lc(l ),..., lc(Na). The resulting Na amplitudes are then vector quantized to MG" .
ii) Na-amplitudes system with RMS Normalization Factor. In this variation the RMS value of the pitch segment centered about the middle Mπ of the nth coding frame, is calculated as:
Figure imgf000058_0001
g is quantized and then used as the normalization factor of the Na selected amplitudes MG" , j=lc(l),...,lc(Na). These normalized amplitudes are then Vector Quantised to MG]' . Notice that the Pn points DFT operation can be avoided in this case, since the magnitude spectrum of the pitch segment is calculated only at the Na selected harmonic frequencies ω " , j=lc(l),...,lc(Na).
In both cases the quantisation of the m and g factors, used to normalize the MG" values, is performed using an adaptive μ-law quantiser with a non-linear characteristic as:
c(A) sgn( ) with μ=255 (37)
Figure imgf000058_0002
This arrangement for the quantization of g or m extends the dynamic range of the coder to not less than 25dBs.
At the receiver end the decoder recovers the MG" magnitudes as MG" = MG'" x A , j=lc(l),...,lc(Na). The remaining
Figure imgf000058_0003
+ 1) / 2~|- Nσ - 1 MG" values are set to a constant value A. (where A is either "m" or "g"). The block diagram of the adaptive μ-law quantiser is shown in Figure 34. The second of the alternative magnitude spectrum representation techniques is referred to below as the "Variable Size Spectral Vector Quantisation (VS/SVQ)" system. Coding systems, which employ the general synthesis formula of Equation (1) to recover speech, encounter the problem of coding a variable length, pitch dependant spectral amplitude vector MG . The "Na- amplitudes" MG" quantisation schemes described in Figure 33 avoid this problem by Vector Quantising the minimum expected number of spectral amplitudes and by setting the rest of the MG" amplitudes to a fixed value. However, such a partially spectrally flat excitation model has limitations in providing high recovered speech quality. Thus, in order to improve the output speech quality, the shape of the entire { MG" } magnitude spectrum should be quantised. Various techniques have been proposed for coding { MG" }. Originally ADPCM has been used across the MG" values associated to a specific coding frame. Also { MG" } has been DCT transformed and coded differentially across successive MG" magnitude spectra. However, these coding schemes are rather inefficient and operate with relatively high bit rates. The introduction of Vector Quantisation on the { MG" } spectral amplitude vectors allowed for the development of Sinusoidal and Prototype Interpolation systems which operate at around 2.4 Kbits/sec. Two known { MG" } VQ methods are described below which quantise a variable size (vsn) input vector with a fixed size (fxs) codevector.
i) The first VQ method involves the transformation of the input vector to a fixed size vector followed by conventional Vector Quantisation. The inverse transformation on the quantised fixed size vector yields the recovered quantised MG" vector. Transformation techniques which have been used include, Linear Interpolation, Band Limited Interpolation, All Pole modelling and Non-Square transformation. However, the overall distortion produced by this approach is the summation of the VQ noise and a component, which is introduced by the transformation process. 58 ii) The second VQ method achieves the direct quantisation of a variable input vector wit a fixed size code vector. This is based in selecting only vs„ elements from each codebook vector, to form a distortion measure between a codebook vector and an input MG" vector. Such a quantisation approach avoids the transformation distortion of the previous techniques mentioned in (i) and results in an overall distortion that is equal to the Vector Quantisation noise.
An improved VQ method will now be described which is referred to below as the Variable Size Spectral Vector Quantisation (VS/SVQ) scheme. This scheme was developed to take advantage of the underlying principle that the actual shape of the { MG" } magnitude spectrum is defined by a minimum
Figure imgf000060_0001
+ 1) / 2 of equally spaced samples. If we consider the maximum expected pitch estimate P,„ax, then any { MG" } spectral shape can be represented adequately by
Figure imgf000060_0002
+ 1) /2~] samples. This suggests that the fixed size fxs of the codebook vectors Sj^ representing ih MG" shapes should not be larger than[~(/^, + 1) / 2~|. Of course this also implies that given the
Figure imgf000060_0003
+ l) / 2"j samples of a codebook vector, the complete spectral shape, defined at any frequency, is obtained via an interpolation process.
Figure 35 highlights the VS/SVQ process. The codebook CBS having cbs fixed fxs dimension vectors S' j=l,...,fxs and i=l ,...,cbs, where fxs is (P„ + I) / 2"), is used to quantise an input vector MG" , j=l,...,vsn of dimension vsn. Interpolation (in this case linear) is used on the S' vectors to yield S]f_ vectors of dimension vsn . The S' to S^ interpolation process is given by:
Figure imgf000060_0004
Figure imgf000060_0005
for i=l ,...,cbs and j= 1 ,...,vsn This process effectively defines S^_ spectral shapes at the ω " frequencies of the MG" vector. A distortion measure D(S" , MG" ) is then defined between the S^_ and MG" vectors, and the codebook vector S^ that yields the minimum distortion is selected and its index I is transmitted. Of course in the receiver. Equation (38) is used to define MG" from
If we assume that Pmax«120 then fxs=60. However this value can be reduced to 50 without significant degradation by low pass filtering the signal synthesised from Equation (1 ). This is achieved by setting to zero all the harmonics MG" in the region of 3.4 to 4.0KHz, in which case: "3400 x pi = vs„ if vsn<50 (39) fs 50=vsn otherwise. and vsn≤fxs.
Amplitude vectors, obtained from adjacent residual frames, exhibit significant redundancy, which can be removed by means of backward prediction. Prediction is performed on a harmonic basis i.e. the amplitude value of each harmonic MG," is predicted from the amplitude value of the same harmonic in previous frames i.e. MG"' 1 . A fixed linear predictor MG = b MG may be incorporated in the VS/SVQ system, and the resulting DPCM structure is shown in Figure 36 (differential VS/SVQ, (DVS/SVQ)). In particular, error vectors are formed as the difference between the original spectral amplitudes MG," and their predicted ones MG" , i.e.: Ej" = MG) - MG ' for j=l,...,vsn.
where the predicted spectral amplitudes MG" are given as:
Figure imgf000061_0001
and 60
MG" = -^~ ∑ MGk n . for vsn.,<j<vsn (41)
Furthermore the quantised spectral amplitudes MG] are given as:
Figure imgf000062_0001
where E" denotes the quantised error vector.
The quantisation of the E" l<j<vsn error vector incorporates Mean Removal and Gain Shape Quantisation techniques, using the hierarchical VQ structure of Figure 36.
A weighted Mean Square Error is used in the VS/SVQ stage of the system. The weighting function is defined as the frequency response of the filter: W(z) = 1 / A (z /γ ) , where An(z) is the short-term linear prediction filter and γ is a constant, defined as γ=0.93. Such a weighting function that is proportional to the short-term envelope spectrum, results in substantially improved decoded speech quality. The weighting function W" is normalised so that:
Figure imgf000062_0002
The pdf of the mean value of F is very broad and, as a result, the mean value differs widely from one vector to another. This mean value can be regarded as statistically independent of the variation of the shape of the error vector E ^ and thus, can be quantised separately without paying a substantial penalty in compression efficiency. The mean value of an error vector is calculated as follows:
Figure imgf000062_0003
M is Optimum Scalar Quantised to M and is then removed from the original error vector to form Erm" - (E^_ - M) . The overall quantization distortion is attributed to the quantization of the "Mean Removed" error vectors ( Erm" ), which is performed by a Gain-Shape Vector Quantiser.
The objective of the Gain-Shape VQ process is to determine the gain value C and the shape vector S so as to minimise the distortion measure:
∑XErm" , G χ S) = ∑ W/' rm] - G x S, J ' (45)
7-1
A gain optimised VQ search method, similar to techniques used in CELP systems, is employed to find the optimum G and S_. The shape Codebook (CBS) of vectors S is searched first to yield an index I, which maximises the quantity:
Q(0 = for i=l,...,cbs (46)
Figure imgf000063_0001
where cbs is the number of codevectors in the CBS. The optimum gain value is defined as:
Figure imgf000063_0002
and is Optimum Scalar Quantised to G .
During shape quantisation the principles of VS/SVQ are employed, in the sense that the S" , vsn size vectors are produced using Linear Interpolation on fxs size codevectors S ^ . Both trained and randomly generated shape CBS codebooks were investigated. Although Erm" has noise-like characteristics, systems using randomly generated shape codebooks resulted in unsatisfactory muffled decoded speech and were inferior to systems employing trained shape codebooks. 62 A closed-loop joint predictor and VQ design process was employed to design the CBS codebook, the optimum scalar quantisers CBM and CBG of the mean M and gain G values respectively, and also to define the prediction coefficient b of Figure 36. In particular, the following steps take place in the design process.
STEP A0 (k=0). Given a training sequence of MGj" the predictor b° is calculated in an open loop fashion (i.e. MG] = b x MG]-* for l<j<f( „ + 1) / 2~| when Vn.,=l, or MG] = 0 elsewhere). Furthermore, the CBM0 mean, CBG0 gain and CBS0 shape codebooks are designed independently and again in an open loop fashion using unquantized E^ . In particular: a) Given a training sequence of error vectors E^°, the mean value of each E" ° is calculated and used in the training process of an Optimum Scalar Quantiser (CBM0). b) Given a training sequence of error vectors E_^ ° and the CBM° mean quantiser, the mean value of each error vector is calculated, quantised using the CBM° quantiser and removed from the original error vectors F ° to yield a sequence of "Mean Removed" training vectors Erm" °. c) Given a training sequence of Erm" ° vectors, each "Mean Removed" training vector
is normalised to unit power (i.e. is divided by the factor G = I∑Wj Erm] ),
linear interpolated to fxs points, and then used in the training process of a conventional Vector Quantiser of fxs dimension. (CBS°). d) Given a training sequence of Erm" ° vectors and the CBS° shape codebook, each "Mean Removed" training vector is encoded using Equations 46 and 47 and the value G of Equation 47 is used in the training process of an Optimum Scalar Quantiser (CBG0). k is set to 1 (k=l). STEP A1 Given a training sequence of MGj and the mean, gain and shape codebooks of the previous k-1 iterations (i.e. CBM^-l, CBG^"1 ' CBSk- l)ι the optimum prediction coefficient bk is calculated. STEP A2 Given a training sequence of MGj , an optimum prediction coefficient bk and
CBM^-l , CBG^- CBSk'l , a training sequence of error vectors E^ k is formed, which is then used for the design of new mean, gain and shape codebooks (i.e. CBMk, CBG1^ CBSk). STEP A3 The performance of the kth iteration quantization system (i.e. bk, CBMk, CBGk, CBSk) is evaluated and compared against the quantization system of the previous iteration (i.e. bk-> , CBM^" 1 , CBGk-^ CBS^1 ). If the quantization distortion converges to a minimum, the quantization design process stops. Otherwise, k=k+l and Steps A1 , A2 and A3 are repeated.
The performance of each quantizer (i.e. bk , CBMk, CBGk ' CBSk) has been evaluated using subjective tests and a LogSegSNR distortion measure, which was found to reflect the subjective performance of the system.
The design for the Mean-Shape-Gain Quantiser used in STEP A2 is performed using the following two steps :
STEP B1 Given a training sequence of error vectors E ^k, the mean value of each E" k is calculated and used in the training process of an Optimum Scalar Quantiser (CBMk).
STEP B2 Given a training sequence of error vectors E^k and the CBMk mean quantizer, the mean value of each residual vector is calculated, quantized and removed from the original residual vectors E_^k to yield a sequence of "Mean Removed" training vectors Erm" , which are then used as the training data in the design of an optimum Gain Shape Quantizer (CBGk and CBSk). This involves steps Cl - C4 below. (The quantization design process is performed under the assumption of any independent gain shape quantiser structure, i.e. an input error vector Ejnr1 can be represented by any possible combination of S' codebook shape vectors and G gain quantizer levels.) STEP C1 (v=0). Given a training sequence of vectors Erm" k and an initial CBGk 0 and
CBSk,° gain and shape codebooks respectively, compute the overall average distortion distance D 0 as in Equation 44. Set v equal to 1 (v=l ).
STEP C2 Given a training sequence of vectors Erm" k and the CBGk v-' gain codebook from the previous iteration, compute the new shape codebook CBSk v which minimises the VQ distortion measure. Notice that the optimum CBSk v shape codebook is obtained when the distortion measure of Equation (44) is a minimum and this is achieved in Ml k v iterations.
STEP C3 Given a training sequence of vectors Erm" and the CBS v shape codebook, compute a new gain quantiser CBGk v, which minimise the distortion measure of Equation (44). This optimum CBGk v gain quantiser is obtained when the distortion measure of Equation (44) is a minimum and this is achieved in M2k v iterations.
STEP C4 Given a training sequence of vectors Erm" k and the shape and gain codebooks CBS^ and CBGk v, compute the average overall distortion measure. If (Dk v.r D vVDk v<ε stop. Otherwise, v=v+ 1 and go back to STEP C2.
The centroids S*U VJ" , i=l,...,cbs and u=l ,...,fxs of the shape Codebook CBSk v-m , are updated during the mth iteration performed in STEP C2 (m=l ,...,M lk v) as follows:
Figure imgf000066_0001
where C^^ ^^G^-' x /,,,,,,)2 ,
HCU , = W;G/' - fU J,(Erm" - G^S^ O - ./H.) 65
Figure imgf000067_0001
C [1 if Λ,, ≤ i
H.J 1° if /..,„ > » '
Figure imgf000067_0002
Figure imgf000067_0004
Figure imgf000067_0003
Qi denotes the cluster of Erm" k error vectors which are quantised to the S, ''"'""' codebook shape vector, cbs represents the total number of shape quantisation levels, Jn represents the CBGk v-' gain codebook index which encodes the Erm" k error vector and 1 ≤j<vsn.
The gain centroids, Gjc,v m , i=l,...,cbg of the CBGk v m gain quantiser, which are computed during the mth iteration in STEP C3 (m=l ,...,M2k v), are given as: 66
Figure imgf000068_0001
where Dj denotes the cluster of Erm" k error vectors which are quantised to the Gl k v"1"1 gain quantiser level, cbg represents the total number of gain quantisation levels, In represents the CBSk-v shape codebook index which encodes the Erm" k error vector and l<j<vsn.
The above employed design process is applied to obtain the optimum shape codebook CBS, optimum gain and mean quantizers, CBG and CBM and the optimum prediction coefficient b which was finally set to b=0.35.
Process VII calculates the energy of the residual signal. The LPC analysis performed in Process II provides the prediction coefficients a, l<i<p and the reflection coefficients k, l<i<p. On the other hand, the Voiced/Unvoiced classification performed in Process I provides the short term autocorrelation coefficient for zero delay of the speech signal (RO) for the frame under consideration. Hence, the Energy of the residual signal E„ value is given as:
Figure imgf000068_0002
The above expression represents the minimum prediction error as it is obtained from the Linear Prediction process. However, because of quantization distortion the parameters of the LPC filter used in the coding-decoding process are slightly different from the ones that achieve minimum prediction error. Thus, Equation (50) gives a good approximation of the residual signal energy with low computational requirements. The accurate E„ value can be given as:
Figure imgf000069_0001
The resulting Λ/E 1S en Scalar Quantised using an adaptive μ-law quantised arrangement similar to the one depicted in Figure 34. In the case where more than one Λ/E are used in the system i.e. the energy En is calculated for a number of subframes then £H t is given by the general equation:
£„,=~ Mx-∑ M R"i+^M °≤ξ≤≡ (52)
Notice that when Ξ= 1,MS=M and for Ξ = 4, M,=M/4.

Claims

1. A speech synthesis system in which a speech signal is divided into a series
of frames, and each frame is converted into a coded signal including a
voiced/unvoiced classification and a pitch estimate, wherein a low pass filtered
speech segment centred about a reference sample is defined in each frame, a
correlation value is calculated for each of a series of candidate pitch estimates as
the maximum of multiple crosscorrelation values obtained from variable length
speech segments centred about the reference sample, the correlation values are
used to form a correlation function defining peaks, and the locations of the
peaks are determined and used to define a pitch estimate.
2. A system according to claim 1, wherein the pitch estimate is defined using
an iterative process.
3. A system according to claim 1 or 2, wherein a single reference sample may
be used, centred with respect to the respective frame.
4. A system according to claim 1 or 2, wherein multiple pitch estimates are
derived for each frame using different reference samples, the multiple pitch
estimates being combined to define a combined pitch estimate for the frame.
5. A system according to any preceding claim, wherein the pitch estimate is
modified by reference to a voiced/unvoiced status and/or pitch estimates of
adjacent frames to define a final pitch estimate.
6. A system according to any preceding claim, wherein the correlation
function is clipped using a threshold value, remaining peaks being rejected if
they are adjacent to larger peaks.
7. A system according to claim 6, wherein peaks are selected which are
larger that either adjacent peak and peaks are rejected if they are smaller than a
following peak by more than a predetermined factor.
8. A system according to any preceding claim, wherein the pitch estimation
procedure is based on a least squares error algorithm.
9. A system according to claim 8, wherein the pitch estimation algorithm
defines the pitch valve as a number whose multiples best fit the correlation
function peak locations.
10. A system according to any preceding claim, wherein possible pitch values
are limited to integral numbers which are not consecutive, the increment between two successive numbers being proportional to a constant multiplied by
the lower of those two numbers.
11. A speech synthesis system in which a speech signal is divided into a series
of frames, and each frame is converted into a coded signal including pitch
segment magnitude spectral information, a voiced/unvoiced classification, and a
mixed voiced classification which classifies harmonics in the magnitude spectrum
of voiced frames as strongly voiced or weakly voiced, wherein a series of samples
centred on the middle of the frame arc windowed to form a data array which is
Fourier transformed to produce a magnitude spectrum, a threshold value is
calculated and used to clip the magnitude spectrum, the clipped data is searched
to define peaks, the locations of peaks are determined, constraints are applied to
define dominant peaks, and harmonics not associated with a dominant peak are
classified as weakly voiced.
12. A system according to claim 11, wherein peaks arc located using a second
order polynomial
13. A system according to claim 11 or 12, wherein the samples are Hamming
windowed.
14. A system according to claim 11, 12 or 13, wherein the threshold value is
calculated by identifying the maximum and minimum magnitude spectrum
values and defining the threshold as a constant multiplied by the difference
between the maximum and minimum values.
15. A system according to any one of claims 11 to 14, wherein peaks are
defined as those values which are greater than the two adjacent values, a peak
being rejected from consideration if neighbouring peaks are of a similar
magnitude or if there arc spectral magnitudes in the same range of greater
magnitude.
16. A system according to any one of claims 11 to 15, wherein a harmonic is
considered as not being associated with a dominant peak if the difference
between two adjacent peaks is greater than a predetermined threshold value.
17. A system according to any one of claims 11 to 16, wherein the spectrum is
divided into bands of fixed width and a strongly/weakly voiced classification is
assigned for each band.
18. A system according to any one of claims 11 to 17, wherein the frequency
range is divided into two or more bands of variable width, adjacent bands being separated at a frequency selected by reference to the strongly/weakly voiced
classification of harmonics.
19. A system according to claim 17 or 18, wherein the lowest frequency band
is regarded as strongly voiced, whereas the highest frequency band is regarded
as weakly voiced.
20. A system according to claim 19, wherein the event that a current frame is
voiced, and the following frame is unvoiced, further bands within the current
frame will be automatically classified as weakly voiced.
21. A system according to claim 19 or 20, wherein the strongly/weakly voiced
classification is determined using a majority decision rule on the strongly/weakly
voiced classification of those harmonics which fail within the band in question.
22. A system according to claim 21, wherein, if there is no majority, alternate
bands are alternately assigned strongly voiced and weakly voiced classifications.
23. A speech synthesis system in which a speech signal is divided into a series
of frames, each frame is defined as voiced or unvoiced, each frame is converted
into a coded signal including a pitch period value, a frame voiced/unvoiced
classification and, for each voiced frame, a mixed voiced spectral band classification which classifies harmonics within spectral bands as either strongly
or weakly voiced, and the speech signal is reconstructed by generating an
excitation signal in respect of each frame and applying the excitation signal to a
filter, wherein for each weakly voiced spectral band, an excitation signal is
generated which includes a random component in the form of a function which is
dependent upon the respective pitch period value.
24. A system according to claim 23, wherein the spectrum is divided into
bands and a strongly/weakly voiced classification is assigned to each band.
25. A system according to claim 23 or 24, wherein the random component is
introduced by reducing the amplitude of harmonic oscillators assigned the
weakly voiced classification, disturbing the oscillator frequencies such that the
frequency is no longer a multiple of the fundamental frequency, and then adding
further random signals.
26. A system according to claim 25, wherein the phase of the oscillators is
randomised.
27. A speech synthesis system in which a speech signal is divided into a series
of frames, and each voiced frame is converted into a coded signal including a
pitch period value LPC coefficients and pitch segment spectral magnitude information, wherein the spectral magnitude information is quantized by
sampling the LPC short term magnitude spectrum at harmonic frequencies, the
locations of the largest spectral samples are determined to identify which of the
magnitudes are relatively more important for accurate quantization, and the
magnitudes so identified are selected and vector quantized.
28. A system according to claim 27, wherein a pitch segment of P„ LPC
residual samples is obtained, where Pπ is the pitch period value of the nth frame,
the pitch segment is DFT transformed, the mean value of the resultant spectral
magnitudes is calculated, the mean value is quantized and used as a
normalisation factor for the selected magnitudes, and the resulting normalised
amplitudes are quantized.
29. A system according to claim 27, wherein the RMS value of the pitch
segment is calculated, the RMS value is quantized and used as a normalisation
factor for the selected magnitudes, and the resulting normalised amplitudes are
quantized.
30. A system according to any one of claims 27 to 29, wherein , at the receiver,
the selected magnitudes are recovered, and each of the other magnitude values is
reproduced as a constant value.
31. A speech synthesis system in which a variable size input vector of
coefficients to be transmitted to a receiver for the reconstruction of a speech
signal is vector quantized using a codebook defined by vectors of fixed size, the
codebook vectors of fixed size are obtained from variable sized training vectors
and an interpolation technique which is an integral part of the codebook
generation process, codebook vectors arc compared to the variable sized input
vector using the interpolation process, and an index associated with the codebook
entry with the smallest difference from the comparison is transmitted, the index
being used to address a further codebook at the receiver and thereby derive an
associated fixed size codebook vector, and the interpolation process being used to
recover from the derived fixed sized codebook vector an approximation of the
variable sized input vector.
32. A system according to claim 31 , wherein the interpolation process is
linear, and for an input vector of given dimension, the interpolation process is
applied to produce from the codebook vectors a set of vectors of that given
dimension, a distortion measure is then derived to compare the interpolated set
of vectors and the input vector, and the codebook vector is selected which yields
the minimum distortion.
33. A system according to claim 32, wherein the dimension of the vectors is
reduced by taking into account only the harmonic amplitudes within an input
bandwidth range.
34. A system according to claim 33, wherein the remaining amplitudes are set
to a constant value.
35. A system according to claim 34, wherein the constant value is equal to the
mean value of the quantized amplitudes.
36. A system according to any one of claims 31 to 35, wherein redundancy
between amplitude vectors obtained from adjacent residual frames is removed
by means of backward prediction.
37. A system according to claim 36, wherein the backward prediction is
performed on a harmonic basis such that the amplitude value of each harmonic
of one frame is predicted from the amplitude value of the same harmonic in the
previous frame or frames.
38. A speech synthesis system in which a speech signal is divided into a series of frames, each frame is converted into a coded signal including an estimated
pitch period, an estimate of the energy of a speech segment the duration of which is a function of the estimated pitch period, and LPC filter coefficients defining an
LPC spectral envelope, and a speech signal of related power to the power of the
input speech signal is reconstructed by generating an excitation signal using
spectral amplitudes which are defined from a modified LPC spectral envelope
sampled at harmonic frequencies defined by the pitch period.
39. A system according to claim 38, wherein the magnitude values are
obtained by spectrally sampling a modified LPC synthesis filter characteristic at
the harmonic locations related to the pitch period.
40. A system according to claim 39, wherein the modified LPC synthesis filter
has reduced feed back gain and a frequency response which consists of equalised
resonant peaks, the locations of which are close to the LPC synthesis resonant
locations.
41. A system according to claim 40, wherein the value of the feed back gain is
controlled by the performance of the LPC model such that it is related to the
normalised LPC prediction error.
42. A system according to any one of claims 38 to 41, wherein the energy of
the reproduced speech signal is equal to the energy of the original speech
waveform.
43. A speech synthesis system in which a speech signal is divided into a series
of frames, each frame is converted into a coded signal including LPC filter
coefficients and at least one parameter associated with a pitch segment
magnitude, and the speech signal is reconstructed by generating two excitation
signals in respect of each frame, each pair of excitation signals comprising a first
excitation signal generated on the basis of the pitch segment magnitude
parameter or parameters of one frame and a second excitation signal generated
on the basis of the pitch segment magnitude parameter or parameters of a
second frame which follows and is adjacent to the said one frame, applying the
first excitation signal to a first LPC filter the characteristics of which are
determined by the LPC filter coefficients of the said one frame and applying the
second excitation signal to a second LPC filter the characteristics of which are
determined by the LPC filter coefficients of the said second frame, and weighting
and combining the outputs of the first and second LPC filters to produce one
frame of a synthesised speech signal.
44. A system according to claim 43, wherein the first and second excitation
signals include the same phase function and different phase contributions from
the two LPC filters.
45. A system according to claim 44, wherein the outputs of the first and
second LPC filters arc weighted by half a window function such that the
magnitude of the output of the first filter is decreasing with time and the
magnitude of the output of the second filter is increasing with time.
46. A speech coding system which operates on a frame by frame basis, and in
which information is transmitted which represents each frame as either voiced or
unvoiced and, for each voiced frame, represents that frame by a pitch period
value, quantized magnitude spectral information, and LPC filter coefficients, the
received pitch period value and magnitude spectral information being used to
generate residual signals at the receiver which are applied to LPC speech
synthesis filters the characteristics of which arc determined by the transmitted
filter coefficients, wherein each residual signal is synthesised according to
sinusoidal mixed excitation synthesis process, and a recovered speech signal is
derived from the residual signals.
47. A speech synthesis system substantially as hereinbefore described with reference to the accompany drawings.
PCT/GB1997/001831 1996-07-05 1997-07-07 Speech synthesis system WO1998001848A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
AU34523/97A AU3452397A (en) 1996-07-05 1997-07-07 Speech synthesis system
EP97930643A EP0950238B1 (en) 1996-07-05 1997-07-07 Speech coding and decoding system
AT97930643T ATE249672T1 (en) 1996-07-05 1997-07-07 VOICE CODING AND DECODING SYSTEM
DE69724819T DE69724819D1 (en) 1996-07-05 1997-07-07 VOICE CODING AND DECODING SYSTEM
JP10504943A JP2000514207A (en) 1996-07-05 1997-07-07 Speech synthesis system

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GB9614209.6 1996-07-05
GBGB9614209.6A GB9614209D0 (en) 1996-07-05 1996-07-05 Speech synthesis system
US2181596P 1996-07-16 1996-07-16
US021,815 1996-07-16

Publications (1)

Publication Number Publication Date
WO1998001848A1 true WO1998001848A1 (en) 1998-01-15

Family

ID=26309651

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB1997/001831 WO1998001848A1 (en) 1996-07-05 1997-07-07 Speech synthesis system

Country Status (7)

Country Link
EP (1) EP0950238B1 (en)
JP (1) JP2000514207A (en)
AT (1) ATE249672T1 (en)
AU (1) AU3452397A (en)
CA (1) CA2259374A1 (en)
DE (1) DE69724819D1 (en)
WO (1) WO1998001848A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2357683A (en) * 1999-12-24 2001-06-27 Nokia Mobile Phones Ltd Voiced/unvoiced determination for speech coding
GB2398981A (en) * 2003-02-27 2004-09-01 Motorola Inc Speech communication unit and method for synthesising speech therein
CN114519996A (en) * 2022-04-20 2022-05-20 北京远鉴信息技术有限公司 Method, device and equipment for determining voice synthesis type and storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2784218B1 (en) * 1998-10-06 2000-12-08 Thomson Csf LOW-SPEED SPEECH CODING METHOD
DE102004007191B3 (en) 2004-02-13 2005-09-01 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Audio coding
DE102004007200B3 (en) 2004-02-13 2005-08-11 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Device for audio encoding has device for using filter to obtain scaled, filtered audio value, device for quantizing it to obtain block of quantized, scaled, filtered audio values and device for including information in coded signal
DE102004007184B3 (en) * 2004-02-13 2005-09-22 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Method and apparatus for quantizing an information signal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0490740A1 (en) * 1990-12-11 1992-06-17 Thomson-Csf Method and apparatus for pitch period determination of the speech signal in very low bitrate vocoders
EP0703565A2 (en) * 1994-09-21 1996-03-27 International Business Machines Corporation Speech synthesis method and system
WO1996027870A1 (en) * 1995-03-07 1996-09-12 British Telecommunications Public Limited Company Speech synthesis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0490740A1 (en) * 1990-12-11 1992-06-17 Thomson-Csf Method and apparatus for pitch period determination of the speech signal in very low bitrate vocoders
EP0703565A2 (en) * 1994-09-21 1996-03-27 International Business Machines Corporation Speech synthesis method and system
WO1996027870A1 (en) * 1995-03-07 1996-09-12 British Telecommunications Public Limited Company Speech synthesis

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2357683A (en) * 1999-12-24 2001-06-27 Nokia Mobile Phones Ltd Voiced/unvoiced determination for speech coding
US6915257B2 (en) 1999-12-24 2005-07-05 Nokia Mobile Phones Limited Method and apparatus for speech coding with voiced/unvoiced determination
GB2398981A (en) * 2003-02-27 2004-09-01 Motorola Inc Speech communication unit and method for synthesising speech therein
GB2398981B (en) * 2003-02-27 2005-09-14 Motorola Inc Speech communication unit and method for synthesising speech therein
CN114519996A (en) * 2022-04-20 2022-05-20 北京远鉴信息技术有限公司 Method, device and equipment for determining voice synthesis type and storage medium

Also Published As

Publication number Publication date
AU3452397A (en) 1998-02-02
JP2000514207A (en) 2000-10-24
EP0950238B1 (en) 2003-09-10
EP0950238A1 (en) 1999-10-20
ATE249672T1 (en) 2003-09-15
DE69724819D1 (en) 2003-10-16
CA2259374A1 (en) 1998-01-15

Similar Documents

Publication Publication Date Title
EP1576585B1 (en) Method and device for robust predictive vector quantization of linear prediction parameters in variable bit rate speech coding
EP3039676B1 (en) Adaptive bandwidth extension and apparatus for the same
RU2389085C2 (en) Method and device for introducing low-frequency emphasis when compressing sound based on acelp/tcx
KR101604774B1 (en) Multi-reference lpc filter quantization and inverse quantization device and method
US6871176B2 (en) Phase excited linear prediction encoder
US7039581B1 (en) Hybrid speed coding and system
US7222070B1 (en) Hybrid speech coding and system
WO2007083933A1 (en) Apparatus and method for encoding and decoding signal
US7139700B1 (en) Hybrid speech coding and system
EP0950238B1 (en) Speech coding and decoding system
Champion et al. High-order allpole modelling of the spectral envelope
US20050065786A1 (en) Hybrid speech coding and system
Drygajilo Speech Coding Techniques and Standards
Villette Sinusoidal speech coding for low and very low bit rate applications
So et al. Multi-frame GMM-based block quantisation of line spectral frequencies
Bhaskar et al. Low bit-rate voice compression based on frequency domain interpolative techniques
CA2511516C (en) Method and device for robust predictive vector quantization of linear prediction parameters in variable bit rate speech coding
Lee et al. An Efficient Segment-Based Speech Compression Technique for Hand-Held TTS Systems
Papanastasiou LPC-Based Pitch Synchronous Interpolation Speech Coding
Yang et al. A 5.4 kbps speech coder based on multi-band excitation and linear predictive coding
Zhang Speech transform coding using ranked vector quantization
EP1212750A1 (en) Multimode vselp speech coder
Lupini Harmonic coding of speech at low bit rates
Balint Excitation modeling in CELP speech coders [articol]
Ilk Low Bit Rate DCT Prototype Interpolation Speech Coding

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH HU IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW AM AZ BY KG KZ MD RU TJ TM

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH KE LS MW SD SZ UG ZW AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
ENP Entry into the national phase

Ref document number: 2259374

Country of ref document: CA

Ref country code: CA

Ref document number: 2259374

Kind code of ref document: A

Format of ref document f/p: F

WWE Wipo information: entry into national phase

Ref document number: 1997930643

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 09214308

Country of ref document: US

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWP Wipo information: published in national office

Ref document number: 1997930643

Country of ref document: EP

WWG Wipo information: grant in national office

Ref document number: 1997930643

Country of ref document: EP