WO1998001848A1 - Speech synthesis system - Google Patents
Speech synthesis system Download PDFInfo
- Publication number
- WO1998001848A1 WO1998001848A1 PCT/GB1997/001831 GB9701831W WO9801848A1 WO 1998001848 A1 WO1998001848 A1 WO 1998001848A1 GB 9701831 W GB9701831 W GB 9701831W WO 9801848 A1 WO9801848 A1 WO 9801848A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frame
- voiced
- pitch
- speech
- lpc
- Prior art date
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 69
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 69
- 239000013074 reference sample Substances 0.000 claims abstract description 8
- 238000005314 correlation function Methods 0.000 claims abstract description 7
- 238000000034 method Methods 0.000 claims description 169
- 239000013598 vector Substances 0.000 claims description 146
- 230000008569 process Effects 0.000 claims description 119
- 230000003595 spectral effect Effects 0.000 claims description 82
- 230000005284 excitation Effects 0.000 claims description 71
- 238000001228 spectrum Methods 0.000 claims description 61
- 238000013139 quantization Methods 0.000 claims description 33
- 238000012549 training Methods 0.000 claims description 25
- 238000005070 sampling Methods 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 8
- 230000004044 response Effects 0.000 claims description 7
- 230000001419 dependent effect Effects 0.000 claims description 3
- 230000003247 decreasing effect Effects 0.000 claims description 2
- 238000012804 iterative process Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 description 36
- 238000010586 diagram Methods 0.000 description 20
- 238000013459 approach Methods 0.000 description 14
- 230000009466 transformation Effects 0.000 description 12
- 230000000737 periodic effect Effects 0.000 description 9
- 230000005540 biological transmission Effects 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 8
- 239000011159 matrix material Substances 0.000 description 6
- 230000003044 adaptive effect Effects 0.000 description 5
- 238000012938 design process Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 239000000523 sample Substances 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 230000018109 developmental process Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000001914 filtration Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000012805 post-processing Methods 0.000 description 3
- 230000001360 synchronised effect Effects 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 235000010649 Lupinus albus Nutrition 0.000 description 1
- 240000000894 Lupinus albus Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000004870 electrical engineering Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/08—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
- G10L19/10—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a multipulse excitation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
- G10L2025/937—Signal energy in various frequency bands
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
Definitions
- the present invention relates to speech synthesis systems, and in
- a speech communication system is to be capable of
- Unvoiced speech is produced by turbulent air flow at a constriction and does not
- parameters used to represent a frame are the pitch period, the magnitude and
- phase function is also defined using linear frequency
- randomness in the signal is introduced by adding jitter to the amplitude
- CELP code-excited linear prediction
- the system employs 20msecs coding frames which are classified
- a pitch period in a given frame is
- coefficients are coded using a differential vector quantization scheme.
- LPC synthesis filter the output of which provides the synthesised voiced speech
- An amount of randomness can be introduced into voiced speech by
- Periodic voice excitation signals are mainly represented by the "slowly
- Phase information is
- one known system operates on 5msec frames, a pitch period being selected for voiced frames and DFT transformed to yield
- Unvoiced speech is CELP coded.
- each frame is converted into a coded signal including a
- peaks are determined and used to define a pitch estimate.
- the system avoids undue complexity and may he readily implemented.
- the pitch estimate is defined using an iterative process.
- single reference sample may be used, for example centred with respect to the
- the correlation function may be clipped using a threshold value
- a predetermined factor for example smaller than 0.9 times the
- the pitch estimation procedure is based on a least squares
- the algorithm defines the pitch as a number whose
- values may be limited to integral numbers which are not consecutive, the
- each frame is converted into a coded signal including pitch segment magnitude spectral information, a voiced/unvoiced classification, and a mixed
- a threshold value is
- Peaks may be located using a second order polynomial.
- the samples may be
- the threshold value may be calculated by identifying
- Peaks may be defined as those values which arc greater than
- a peak may be rejected from consideration if
- neighbouring peaks are of a similar magnitude, e.g. more than 80% of the
- a harmonic may be considered as not being associated with a
- the spectrum may be divided into bands of fixed width and a
- the frequency range may be divided into two or more bands of variable width,
- the spectrum may be divided into fixed bands, for example fixed
- frequency band e.g. 0-500Hz
- the highest frequency band for example 3500Hz to 4000Hz, may always
- 3000Hz to 3500Hz may be automatically classified as weakly voiced.
- the strongly/weakly voiced classification may be determined using a majority
- alternate bands may be alternately assigned strongly voiced and weakly voiced classifications.
- excitation signal to recover the speech signal which takes into account this classification. It is an object of the present invention to provide such a system.
- each frame is defined as voiced or unvoiced, each frame is converted into
- a coded signal including a pitch period value, a frame voiced/unvoiced
- the speech signal is reconstructed by generating an
- the excitation signal is represented by a function which includes a first
- harmonic frequency component the frequency of which is dependant upon the
- the random component may be introduced by reducing the amplitude of
- harmonic oscillators assigned the weakly voiced classification for example by
- the oscillators producing random signals may be randomised at pitch intervals. Thus for a weakly voiced band, some periodicity remains but the power of the
- an input speech signal is processed to produce an
- the discarded magnitude values arc represented at
- magnitude values to be quantized are always the same and predetermined on the
- each voiced frame is converted into a coded signal including a pitch
- the pitch segment is DFT transformed, the mean value of the
- the selected magnitudes are recovered, and each of the
- the input vector is transformed to a fixed size vector which is then
- variable input vector is directly quantized with a
- variable size training vectors are obtained from variable size training vectors and an interpolation
- the invention is applicable in particular to pitch synchronous low bit rate
- the interpolation process is linear.
- the interpolation process is applied to produce from the
- codebook vectors a set of vectors of that given dimension.
- the dimension of the input vectors is reduced by taking into
- the remaining amplitudes i.e. in the region of
- 3.4kHz to 4 kHz are set to a constant value.
- the constant value is
- the backward prediction may be performed on a harmonic basis
- each frame is converted into a coded signal including an estimated pitch
- the excitation signal the excitation spectral envelope is shaped according to the
- the result is a system which is capable of delivering high
- the invention is based on the observation that
- the magnitude values may be obtained by spectrally sampling a modified
- the modified LPC synthesis filter may have reduced feed back gain and
- the value of the feed back gain may be controlled by the performance of the LPC model such that it is
- the reproduced speech signal may be equal to the energy of the original speech
- each frame is converted into a coded signal including LPC filter
- each pair of excitation signals comprising a first
- the outputs of the first and second LPC filters are weighted by
- a window function such as a Hamming window such that the magnitude of
- the output of the first filter is decreasing with time and the magnitude of the
- Figure 1 is a general block diagram of the encoding process in accordance with the present invention.
- Figure 2 illustrates the relationship between coding and matrix
- Figure 3 is a general block diagram of the decoding process
- Figure 4 is a block diagram of the excitation synthesis process
- Figure 5 is a schematic diagram of the overlap and add process
- Figure 6 is a schematic diagram of the calculation of an instantaneous
- Figure 7 is a block diagram of the overall voiced/unvoiced classification
- Figure 8 is a block diagram of the pitch estimation process
- Figure 9 is a schematic diagram of two speech segments which participate
- Figure 10 is a schematic diagram of speech segments used in the calculation of the crosscorrelation function value
- Figure 11 represents the value allocated to a parameter used in the
- Figure 12 is a block diagram of the process used for calculated the
- Figure 13 is a flow chart of a pitch estimation algorithm
- Figure 14 is a flow chart of a procedure used in the pitch estimation
- Figure 15 is a flow chart of a further procedure used in the pitch
- Figure 16 is a flow chart of a further procedure used in the pitch
- Figure 17 is a flow chart of a threshold value selection procedure
- Figure 18 is a flow chart of the voiced/unvoiced classification process
- Figure 19 is a schematic diagram of the voiced/unvoiced classification
- Figure 20 is a flow chart of the procedure used to determine offset values
- Figure 21 is a flow chart of the pitch estimation algorithm
- Figure 22 is a flow chart of a procedure used to impose constraints on
- Figures 23, 24 and 25 represent different portions of a flow chart of a
- Figure 26 is a general block diagram of the LPC analysis and LPC
- Figure 27 is a general flow chart of a strongly or weakly voiced
- Figure 28 is a flow chart of the procedure responsible for the
- Figure 29 represents a speech waveform obtained from a particular
- Figure 30 shows frequency tracks obtained for the speech utterance of
- Figure 31 shows to a larger scale a portion of Figure 30 and represents the
- Figure 32 shows a magnitude spectrum of a particular speech segment
- Figure 33 is a general block diagram of a system for representing
- Figure 34 is a block diagram of an adaptive quantiser shown in Figure 33;
- Figure 35 is a general block diagram of a quantisation process
- Figure 36 is a general block diagram of a differential variable size
- Figure 37 represents the hierarchical structure of a mean gain shape
- quantiser A system in accordance with the present invention is described below, firstly in general terms and then in greater detail.
- the system operates on an LPC residual signal on a frame by frame basis.
- voiced speech depends on the pitch frequency of the signal.
- a voiced/unvoiced classification process allows the coding of voiced and unvoiced frames to be handled in different ways.
- Unvoiced frames are modelled in terms of an RMS value and a random time series.
- voiced frames a pitch period estimate is obtained and used to define a pitch segment which is centred at the middle of the frame.
- Pitch segments from adjacent frames are DFT transformed and only the resulting pitch segment magnitude information is coded and transmitted.
- pitch segment magnitude samples are classified as strongly or weakly voiced.
- the system transmits for every voiced frame the pitch period value, the magnitude spectral information of the pitch segment, the strong/weak voiced classification of the pitch magnitude spectral values,, and the LPC coefficient.
- the information which is transmitted for every voiced frame is, in addition to voiced/unvoiced information, the pitch period value, the magnitude spectral information of the pitch segment, and the LPC filter coefficients.
- MG ⁇ are decoded pitch segment magnitude values and phase j (i) is calculated from the integral of the linearly interpolated instantaneous harmonic frequencies ⁇ ,(i).
- K is the largest value of j for which ⁇ j "(i) ⁇ .
- the initial phase for each harmonic is set to zero. Phase continuity is preserved across the boundaries of successive interpolation intervals.
- the synthesis process is performed twice however, once using the magnitude spectral values MG j " + of the pitch segment derived from the current (n+l )th frame and again using the magnitude values MG j " of the pitch segment derived in the previous nth frame.
- the phase function phase i) in each case remains the same.
- the resulting residual signals Res n (i) and Res n+) (i) are used as inputs to corresponding LPC synthesis filters calculated for the nth and (n+l)th speech frames.
- the two LPC synthesised speech waveforms are then weighted by W n+ 1 (i) and W n (i) to yield the recovered speech" signal.
- the LPC excitation signal is based on a "mixed" excitation model which allows for the appropriate mixing of periodic and random excitation components in voiced frames on a frequency-band basis. This is achieved by operating the system such that the magnitude spectrum of the residual signal is examined, and applying a peak-picking process, near the ⁇ : resonant frequencies, to detect possible dominant spectral peaks.
- NRS random components are spaced at 50 Hz intervals symmetrically about ⁇ , c ⁇
- the amplitudes of the NRS random components are set to MG j I V2 x NRS Their initial phases are selected randomly from the [- ⁇ , + ⁇ ] region at
- the hv j information must be transmitted to be available at the receiver and, in order to reduce the bit rate allocated to hv f , the bandwidth of the input signal is divided into a number of fixed size bands BD k and a "strongly” or “weakly” voiced flag Bhv k is assigned for each band.
- a strongly or “weakly” voiced flag Bhv k is assigned for each band.
- a weakly voiced band a highly periodic signal is reproduced.
- a signal which combines both periodic and aperiodic components is required.
- the remaining spectral bands can be strongly or weakly voiced.
- Figure 1 schematically illustrates processes operated by the system encoder. These processes are referred to in Figure 1 as Processes I to VII and these terms are used throughout this document.
- a speech signal is input and processes I, III, IV, VI AND VII produce outputs for transmission.
- Process I Assuming that the first Matrix Quantization analysis frame (MQA) of kxM samples is available, each of the k coding frames within the MQA is classified as voiced or unvoiced (V n ) using, Process I.
- a pitch estimation part of Process I provides a pitch period value P réelle only when a coding frame is voiced.
- k/m is an integer and represents the frame dimension of the matrix quantizer employed in Process III.
- the quantized coefficients a are used to derive a residual signal R n (i).
- P n is the pitch period value associated with the nth frame. This segment is centred in the middle of the frame.
- the selected P n samples are DFT transformed (Process V) to yield + l) / 2 spectral magnitude values MG" , + 1) / 2 ⁇
- the magnitude information is coded (using Process VI) and transmitted.
- Process IV produces quantized Bhv information, which for voiced frames is multiplexed and transmitted to the receiver together with the voiced/unvoiced decision V the pitch period P n , the quantized LPC coefficients a of the corresponding LPC frame, and the magnitude values MG" . In unvoiced frames only the quantized value and the quantized LPC filter coefficients a are transmitted.
- Figure 3 schematically illustrates processes operated by the system decoder.
- the decoder Given the received parameters of the nth coding frame and those of the previous (n-l)th coding frame, the decoder synthesises a speech signal S n (i) that extends from the middle of the (n-l)th frame to the middle of the nth frame.
- This synthesis process involves the generation in parallel of two excitation signals Res n (i) and Res n .,(i) which are used to drive two independent LPC synthesis filters 1 / A n (z) and 1 / A tl ⁇ (z) the coefficients of which are derived from the transmitted quantized coefficients a .
- the process commences by considering the voiced unvoiced status V k , where k is equal to n or n-1, (see Figure 4).
- V k 0
- a gaussian random number generator RG(0,1) of zero mean and unit variance provides a time series which is subsequently scaled by the JE ⁇ value received for this frame. This is effectively the required:
- the Res k (i) excitation signal is defined as the summation of a "harmonic" Res k (i) component and a "random" Res k r (i) component.
- the top path of the V l part of the synthesis in Figure 4, which provides the harmonic component of this mixed excitation model, calculates always the instantaneous harmonic frequency function ⁇ j "(i) which is associated with the interpolation interval that is defined between the middle points of the nth and (n-l)th frames, (i.e. this action is independent of the value of k).
- ⁇ > j n (i) is calculated using the pitch frequencies f j 1 '", f, 2 "" and linear interpolation i.e.
- the f* ⁇ x value is calculated during the decoding process of the previous (n-l)th coding frame, hv j " is the strongly/weakly voiced classification (0, or 1 ) of the jth harmonic ⁇ ".
- P n and P n _ are the received pitch estimates from the n and n-1 frames.
- the associated phase value is:
- the random excitation signal Res k (i) can be generated by the summation of random cosines located 50 Hz apart, where their phase is randomised every ⁇ samples, and ⁇ M, i.e
- Res k (i) cos(2 ⁇ ( /50) + ⁇ (i - ⁇ x ⁇ - ⁇ ) x RU(- ⁇ ,+ ⁇ ))
- 1 / A n (z) becomes 1 / A n+I (z) with the memory of 1 / A n (z) .
- This is valid in all cases except during an unvoiced to voiced transition, where the memory of the 1 / ⁇ dress +l (z) filter is set to zero.
- the coefficients of the 1 / A n (z) and 1 / A H _ X (z) synthesis filters are calculated directly from the nth and (n-l)th coding speech frames respectively, when the LPC analysis frame size L is equal to M samples. However, when L ⁇ M (usually L>M) linear interpolation is used on the filter coefficients (defined every L samples) so that the transfer function of the synthesis filter is updated every M samples.
- the output signals of these filters denoted as X n .,(i) and X n (i), are weighted, overlapped and added as schematically illustrated in Figure 5 to yield X n (i) i.e:
- PF(z) is the conventional post filter:
- HP(z) is defined as: b l - c.z '1
- a scaling factor SC is calculated every LPC frame of L samples.
- is associated with the middle of the 1th LPC frame as illustrated in Figure 6.
- the filtered samples from the middle of the (1-1 )th frame to the middle of the lth frame are then multiplied by SC j (i) to yield the final output of the system, where:
- the scaling process introduces an extra half LPC frame delay into the coding-decoding process.
- the above described energy scaling procedure operates on an LPC frame basis in contrast to both the decoding and PF(z), HP(z) filtering procedures which operate on the basis of a frame of M samples.
- Process I derives a voiced/unvoiced (V/UV) classification V n for the nth input coding frame and also assigns a pitch estimate P n to the middle sample M casual of this frame. This process is illustrated in Figure 7.
- V/UV voiced/unvoiced
- the V/UV and pitch estimation analysis frame is centred at the middle M n+ i of the (n+l)th coding frame with 237 samples on either side.
- the pitch estimation algorithm is illustrated in Figure 8, where P represents the output of the pitch estimation process.
- the 294 input samples are used to calculate a crosscorrelation function CR(d), where d is shown in Figure 9 and 20 ⁇ d ⁇ 147.
- Figure 9 shows the two speech segments which participate in the calculation of the crosscorrelation function value at "d" delay.
- the crosscorrelation function p d (j) is calculated for the segments ⁇ X ⁇ d , ⁇ R ⁇ ,as:
- Figure 12 is a block diagram of the process involving the calculation of the CR function and the selection of its peaks. As illustrated, given CR(d), a threshold th(d) is determined as:
- th(d) CR( ⁇ C . ) ⁇ b - (d - d£ )x a - c (28)
- the algorithm examines the length of the G 0 runs which exist between successive G s segments (i.e. G s and Gs + i), and when G 0 ⁇ 17, then the G s segment with the max CR (d) value is kept. This procedure yields CR, (d) , which is then examined by the following "peak picking" procedure.
- CR L (d) > CR, (d - ⁇ ) and CR, (d) > CR, (d + ⁇ )
- certain peaks can be rejected if: CR L (loc(k)) ⁇ CR L (loc(k + 1)) x 0.9
- CR(d) and loc(k) are used as inputs to the following Modified High Resolution Pitch Estimation algorithm (MHRPE) shown in Figure 8, whose output is P Mn+) .
- MHRPE Modified High Resolution Pitch Estimation algorithm
- FIG 13 The flowchart of this MHRPE procedure is shown in Figure 13, where P is initialised with 0 and, at the end, the estimated P is the requested P Mn+l .
- the main pitch estimation procedure is based on a Least Squares Error (LSE) algorithm which is defined as follows: For each possible pitch value j in the range from 21 to 147 with an increment of 0.1 x j . i.e. j € ⁇ 21,23,25-27,30,33,36,40,44,48,53,58,64,70,77,84,92,101,1 1 1,122,134 ⁇ . (Thus 21 iterations are performed.)
- LSE Least Squares Error
- V/UV part of Process I calculates the status V Mn+
- the flowchart of this part of the algorithm is shown in Figure 18 where "V" represents the output V/UV flag of this procedure. Setting the "V” flag to 1 or 0 indicates voiced or unvoiced classification respectively.
- the "CR” parameter denotes the maximum value of the CR function which is calculated in the pitch estimation process.
- a diagrammatic representation of the voiced/unvoiced procedure is given in Figure 19.
- a multipoint pitch estimation algorithm accepts P M ⁇ +1 , P M . +d P -, +d2> V lake.,, P n . ⁇ , V' n , P' n to provide a preliminary pitch value P ", .
- the flowchart of this multipoint pitch estimation algorithm is given in Figure 21, where P,, P 2 and P 0 represent the pitch estimates associated with the M n+1+d
- V' n+I and P' n+i produced from this section are then used in the next pitch past processing section together with V n _,, V' n , P n ., and P' n to yield the final voiced/unvoiced and pitch estimate parameters V n and P n for the nth coding frame.
- This pitch post processing stage is defined in the flowchart of Figures 23, 24 and 25, the output A of Figure 23 being the input to Figure 24, and the output B of Figure 24 being the input to Figure 25.
- "Pate" and "V n " represent the pitch estimate and voicing flag respectively, which correspond to the nth coding frame prior to post processing (i.e.
- the LPC analysis process (Process II of Figure 1) can be performed using the Autocorrelation, Stabilised Covariance or Lattice methods. The Burg algorithm was used, although simple autocorrelation schemes could be employed without a noticeable effect in the decoded speech quality.
- the LPC coefficients are then transformed to an LSP representation. Typical values for the number of coefficients are 10 to 12 and a 10th order filter has been used. LPC analysis processes are well known and described in the literature, for example "Digital Processing of Speech Signals", L.R.
- LSP coefficients are used to represent the data. These 10 coefficients could be quantized using scalar 37 bits with the following bit allocation pattern [3,4,4,4,4,4,4,3,3]. This is a relatively simple process, but the resulting bit rate of 1850 bits/second is unnecessarily high.
- the LSP coefficients can be Vector Quantised (VQ) using a Split-VQ technique.
- VQ Vector Quantised
- the Split-VQ technique an LSP parameter vector of dimension "p" is split into two or more subvectors of lower dimensions and then, each subvector is Vector Quantised separately (when Vector Quantising the subvectors a direct VQ approach is used).
- the LSP transformed coefficient vector, C which consists of "p" consecutive coefficients (ci,c2,...,cp) is split into “K” vectors, C (l ⁇ k ⁇ K), with the corresponding dimensions dfc (l ⁇ dk ⁇ P)-
- the Split-VQ becomes equivalent to Scalar Quantisation.
- the Split-VQ becomes equivalent to Full
- m(k) represents the spectral dimension of the kth submatrix and N is the SMQ
- Er(f) is the normalised energy of the prediction error of the (l+t)th frame
- En(t) is the RMS value of the (l+t)th speech frame
- Aver(En) is the average RMS value of the ⁇ LPC frames used in SMQ.
- the values of the constants ⁇ and ⁇ l are set to 0.2 and 0.15 respectively.
- the overall SMQ quantisation process that yields the quantised LSP coefficients vectors / ' to / 1+N ' 1 for the 1 to 1+N-l analysis frames is shown in Figure 26.
- a 5Hz bandwidth expansion is also included in the inverse quantisation process.
- Process IV of Figure 1 This process is concerned with the mixed voiced classification of harmonics.
- the flowchart of Process IV is given in Figure 27.
- the R" array of 160 samples is Hamming windowed and augmented to form a 512 size array, which is then FFT processed.
- the maximum and minimum values MGR max , MGR mm of the resulting 256 spectral magnitude values are determined, and a threshold THO is calculated. TH0 is then used to clip the magnitude spectrum.
- the clipped MGR array is searched to define peaks MGR(P) satisfying:
- MGR(P) For each peak, MGR(P), "supported” by the MGR(P+1 ) and MGR(P-l) values a second order polynomial is fitted and the maximum point of this curve is accepted as MGR(P) with a location loc(MGR(P)). Further constraints are then imposed on these magnitude peaks. In particular peaks are rejected : Struktur, ⁇ , admir_.,
- spectral peaks in the neighbourhood of loc(MGR(P)) (i.e in the range (loc(MGR(P))-fo/2 to loc(MGR(P))+fo/2 where fo is the fundamental frequency in Hz), whose value is larger than 80% of MGR(P) or b) if there are any spectral magnitudes in the same range whose value is larger than MGR(P).
- two thresholds are defined as follows:
- THl 0.15 ⁇ fo.
- loc(MGR d (A))- loc(MGR d (A: - l) is compared to 1.5 ⁇ fo+TH2, and if
- classification hv is zero (weakly voiced). (loc(MGR d (k)) is the location of the kth dominant
- loc(k) is the location of the kth
- loc (MGR d (k)) loc(K).
- the spectrum is divided into bands of 500Hz each and a strongly voiced/weakly voiced flag Bhv is assigned for each band.
- the Bhv values of the remaining 5 bands are determined using a majority decision rule on the hv j values of the j harmonics which fall within the band under consideration.
- the hv ( of a specific harmonic j is equal to the Bhv value of the corresponding band.
- the hv information may be transmitted with 5 bits.
- the 680 Hz to 3400 Hz range is represented by only two variable size bands.
- the Fc frequency that separates these two bands can be one of the following:
- the Fc frequency is selected by examining the three bands sequentially defined by the frequencies in (A) or (B) and by using again a majority rule on the harmonics which fall within a band.
- Figures 29 and 30 represent respectively an original speech
- the horizontal axis represents time in terms of frames each of
- Figure 31 shows to a larger scale a section of Figure 30, and represents
- Waveform A represents the magnitude
- Waveforms B, C and D represent the normalised Short-Term magnitude spectrum of the
- the hybrid model introduces an appropriate amount of randomness where required in the 3 ⁇ /4
- the DFT For a real-valued sequence x(i) of P points the DFT may be expressed as:
- the P n point DFT will yield a double-side spectrum.
- the magnitude of all the non DC components must be multiplied by a factor of 2.
- the total number of single side magnitude spectrum values, which are used in the reconstruction process, is equal to
- MSVSAR modified single value spectral amplitude representation
- MSVSAR is based on the observation that some of the speech spectrum resonance and anti- resonance information is also present at the residual magnitude spectrum (G.S. Kang and S.S. Everett, "Improvement of the Excitation Source in the Narrow-Band Linear Prediction Vocoder", IEEE Trans. Acoust., Speech and Signal Proc, Vol. ASSP-33, pp.377-386, 1985).
- LPC inverse filtering can not produce a residual signal of absolutely flat magnitude spectrum mainly due to: a) the "cascade representation" of formats by the LPC filter 1/A(z), which results in the magnitudes of the resonant peaks to be dependent upon the pole locations of the 1/A(z) all-pole filter and b) the LPC quantisation noise.
- G R and G N are defined as follows:
- x tripod r ⁇ "(i) represents a sequence of 2P n speech samples centred in the middle of the nth coding frame from which the mean value is being calculated and removed
- the G l( parameter represents a constant whose value is set to 0.25.
- Equation 32 defines a modified LPC synthesis filter with reduced feedback gain, whose frequency response consists of nearly equalised resonant peaks, the locations of which are very close to the LPC synthesis resonant locations. Furthermore, the value of the feedback gain G R is controlled by the performance of the LPC model (i.e. it is proportional to the normalised LPC prediction error). In addition Equation 34 ensures that the energy of the reproduced speech signal is equal to the energy of the original speech waveform. Robustness is increased by computing the speech RMS value over two pitch periods. Two alternative magnitude spectrum representation techniques are described below, which allow for better coding of the magnitude information and lead to a significant improvement in reconstructed speech quality.
- the first of the alternative magnitude spectrum representations tecliniques is referred to below in the "Na amplitude system".
- the basic principle of this MG" quantisation system is to represent accurately those MG" values which correspond to the Na largest speech Short
- - 1 MG" magnitudes are subjectively more important for accurate quantization. The system subsequently selects MG jn j lc(l),...,lc(Na) and Vector Quantizes these values. If the minimum pitch value is 17 samples, the number of non-DC MG" amplitudes is equal to 8 and for this reason Na ⁇ 8. Two variations of the "Na-amplitudes system" were developed with equivalent performance and their block diagrams are depicted in Figure 33 (a) and (b) respectively.
- This arrangement for the quantization of g or m extends the dynamic range of the coder to not less than 25dBs.
- A is either "m” or "g”).
- the block diagram of the adaptive ⁇ -law quantiser is shown in Figure 34.
- the second of the alternative magnitude spectrum representation techniques is referred to below as the "Variable Size Spectral Vector Quantisation (VS/SVQ)" system. Coding systems, which employ the general synthesis formula of Equation (1) to recover speech, encounter the problem of coding a variable length, pitch dependant spectral amplitude vector MG .
- the "Na- amplitudes" MG" quantisation schemes described in Figure 33 avoid this problem by Vector Quantising the minimum expected number of spectral amplitudes and by setting the rest of the MG" amplitudes to a fixed value.
- a partially spectrally flat excitation model has limitations in providing high recovered speech quality.
- the shape of the entire ⁇ MG" ⁇ magnitude spectrum should be quantised.
- Various techniques have been proposed for coding ⁇ MG" ⁇ . Originally ADPCM has been used across the MG" values associated to a specific coding frame. Also ⁇ MG" ⁇ has been DCT transformed and coded differentially across successive MG" magnitude spectra.
- the first VQ method involves the transformation of the input vector to a fixed size vector followed by conventional Vector Quantisation.
- the inverse transformation on the quantised fixed size vector yields the recovered quantised MG" vector. Transformation techniques which have been used include, Linear Interpolation, Band Limited Interpolation, All Pole modelling and Non-Square transformation. However, the overall distortion produced by this approach is the summation of the VQ noise and a component, which is introduced by the transformation process.
- the second VQ method achieves the direct quantisation of a variable input vector wit a fixed size code vector. This is based in selecting only vs braid elements from each codebook vector, to form a distortion measure between a codebook vector and an input MG" vector. Such a quantisation approach avoids the transformation distortion of the previous techniques mentioned in (i) and results in an overall distortion that is equal to the Vector Quantisation noise.
- VQ Variable Size Spectral Vector Quantisation
- Figure 35 highlights the VS/SVQ process.
- Interpolation (in this case linear) is used on the S' vectors to yield S]f_ vectors of dimension vs n .
- the S' to S ⁇ interpolation process is given by:
- Amplitude vectors obtained from adjacent residual frames, exhibit significant redundancy, which can be removed by means of backward prediction. Prediction is performed on a harmonic basis i.e. the amplitude value of each harmonic MG," is predicted from the amplitude value of the same harmonic in previous frames i.e. MG" ' 1 .
- a fixed linear predictor MG b MG may be incorporated in the VS/SVQ system, and the resulting DPCM structure is shown in Figure 36 (differential VS/SVQ, (DVS/SVQ)).
- E denotes the quantised error vector
- the quantisation of the E" l ⁇ j ⁇ vs n error vector incorporates Mean Removal and Gain Shape Quantisation techniques, using the hierarchical VQ structure of Figure 36.
- a weighted Mean Square Error is used in the VS/SVQ stage of the system.
- W is normalised so that:
- the pdf of the mean value of F is very broad and, as a result, the mean value differs widely from one vector to another.
- This mean value can be regarded as statistically independent of the variation of the shape of the error vector E ⁇ and thus, can be quantised separately without paying a substantial penalty in compression efficiency.
- the mean value of an error vector is calculated as follows:
- M is Optimum Scalar Quantised to M and is then removed from the original error vector to form Erm" - (E ⁇ _ - M) .
- the overall quantization distortion is attributed to the quantization of the "Mean Removed" error vectors ( Erm” ), which is performed by a Gain-Shape Vector Quantiser.
- the objective of the Gain-Shape VQ process is to determine the gain value C and the shape vector S so as to minimise the distortion measure:
- a gain optimised VQ search method similar to techniques used in CELP systems, is employed to find the optimum G and S_.
- the shape Codebook (CBS) of vectors S is searched first to yield an index I, which maximises the quantity:
- cbs is the number of codevectors in the CBS.
- the optimum gain value is defined as:
- each quantizer i.e. b k , CBM k , CBG k ' CBS k
- b k The performance of each quantizer (i.e. b k , CBM k , CBG k ' CBS k ) has been evaluated using subjective tests and a LogSegSNR distortion measure, which was found to reflect the subjective performance of the system.
- Qi denotes the cluster of Erm" k error vectors which are quantised to the S
- cbs represents the total number of shape quantisation levels
- J n represents the CBG k v -' gain codebook index which encodes the Erm" k error vector and 1 ⁇ j ⁇ vs n .
- D j denotes the cluster of Erm" k error vectors which are quantised to the G l k v " 1"1 gain quantiser level
- cbg represents the total number of gain quantisation levels
- I n represents the CBS k - v shape codebook index which encodes the Erm" k error vector and l ⁇ j ⁇ vs n .
- Process VII calculates the energy of the residual signal.
- the LPC analysis performed in Process II provides the prediction coefficients a, l ⁇ i ⁇ p and the reflection coefficients k, l ⁇ i ⁇ p.
- the Voiced/Unvoiced classification performed in Process I provides the short term autocorrelation coefficient for zero delay of the speech signal (RO) for the frame under consideration.
- the Energy of the residual signal E Tha value is given as:
- Equation (50) gives a good approximation of the residual signal energy with low computational requirements.
- Equation (50) gives a good approximation of the residual signal energy with low computational requirements.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Aerials With Secondary Devices (AREA)
- Optical Communication System (AREA)
- Telephonic Communication Services (AREA)
- Reduction Or Emphasis Of Bandwidth Of Signals (AREA)
- Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)
Abstract
Description
Claims
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU34523/97A AU3452397A (en) | 1996-07-05 | 1997-07-07 | Speech synthesis system |
EP97930643A EP0950238B1 (en) | 1996-07-05 | 1997-07-07 | Speech coding and decoding system |
AT97930643T ATE249672T1 (en) | 1996-07-05 | 1997-07-07 | VOICE CODING AND DECODING SYSTEM |
DE69724819T DE69724819D1 (en) | 1996-07-05 | 1997-07-07 | VOICE CODING AND DECODING SYSTEM |
JP10504943A JP2000514207A (en) | 1996-07-05 | 1997-07-07 | Speech synthesis system |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB9614209.6 | 1996-07-05 | ||
GBGB9614209.6A GB9614209D0 (en) | 1996-07-05 | 1996-07-05 | Speech synthesis system |
US2181596P | 1996-07-16 | 1996-07-16 | |
US021,815 | 1996-07-16 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1998001848A1 true WO1998001848A1 (en) | 1998-01-15 |
Family
ID=26309651
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/GB1997/001831 WO1998001848A1 (en) | 1996-07-05 | 1997-07-07 | Speech synthesis system |
Country Status (7)
Country | Link |
---|---|
EP (1) | EP0950238B1 (en) |
JP (1) | JP2000514207A (en) |
AT (1) | ATE249672T1 (en) |
AU (1) | AU3452397A (en) |
CA (1) | CA2259374A1 (en) |
DE (1) | DE69724819D1 (en) |
WO (1) | WO1998001848A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2357683A (en) * | 1999-12-24 | 2001-06-27 | Nokia Mobile Phones Ltd | Voiced/unvoiced determination for speech coding |
GB2398981A (en) * | 2003-02-27 | 2004-09-01 | Motorola Inc | Speech communication unit and method for synthesising speech therein |
CN114519996A (en) * | 2022-04-20 | 2022-05-20 | 北京远鉴信息技术有限公司 | Method, device and equipment for determining voice synthesis type and storage medium |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2784218B1 (en) * | 1998-10-06 | 2000-12-08 | Thomson Csf | LOW-SPEED SPEECH CODING METHOD |
DE102004007191B3 (en) | 2004-02-13 | 2005-09-01 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Audio coding |
DE102004007200B3 (en) | 2004-02-13 | 2005-08-11 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Device for audio encoding has device for using filter to obtain scaled, filtered audio value, device for quantizing it to obtain block of quantized, scaled, filtered audio values and device for including information in coded signal |
DE102004007184B3 (en) * | 2004-02-13 | 2005-09-22 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Method and apparatus for quantizing an information signal |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0490740A1 (en) * | 1990-12-11 | 1992-06-17 | Thomson-Csf | Method and apparatus for pitch period determination of the speech signal in very low bitrate vocoders |
EP0703565A2 (en) * | 1994-09-21 | 1996-03-27 | International Business Machines Corporation | Speech synthesis method and system |
WO1996027870A1 (en) * | 1995-03-07 | 1996-09-12 | British Telecommunications Public Limited Company | Speech synthesis |
-
1997
- 1997-07-07 DE DE69724819T patent/DE69724819D1/en not_active Expired - Lifetime
- 1997-07-07 CA CA002259374A patent/CA2259374A1/en not_active Abandoned
- 1997-07-07 EP EP97930643A patent/EP0950238B1/en not_active Expired - Lifetime
- 1997-07-07 AU AU34523/97A patent/AU3452397A/en not_active Abandoned
- 1997-07-07 AT AT97930643T patent/ATE249672T1/en not_active IP Right Cessation
- 1997-07-07 WO PCT/GB1997/001831 patent/WO1998001848A1/en active IP Right Grant
- 1997-07-07 JP JP10504943A patent/JP2000514207A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0490740A1 (en) * | 1990-12-11 | 1992-06-17 | Thomson-Csf | Method and apparatus for pitch period determination of the speech signal in very low bitrate vocoders |
EP0703565A2 (en) * | 1994-09-21 | 1996-03-27 | International Business Machines Corporation | Speech synthesis method and system |
WO1996027870A1 (en) * | 1995-03-07 | 1996-09-12 | British Telecommunications Public Limited Company | Speech synthesis |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2357683A (en) * | 1999-12-24 | 2001-06-27 | Nokia Mobile Phones Ltd | Voiced/unvoiced determination for speech coding |
US6915257B2 (en) | 1999-12-24 | 2005-07-05 | Nokia Mobile Phones Limited | Method and apparatus for speech coding with voiced/unvoiced determination |
GB2398981A (en) * | 2003-02-27 | 2004-09-01 | Motorola Inc | Speech communication unit and method for synthesising speech therein |
GB2398981B (en) * | 2003-02-27 | 2005-09-14 | Motorola Inc | Speech communication unit and method for synthesising speech therein |
CN114519996A (en) * | 2022-04-20 | 2022-05-20 | 北京远鉴信息技术有限公司 | Method, device and equipment for determining voice synthesis type and storage medium |
Also Published As
Publication number | Publication date |
---|---|
AU3452397A (en) | 1998-02-02 |
JP2000514207A (en) | 2000-10-24 |
EP0950238B1 (en) | 2003-09-10 |
EP0950238A1 (en) | 1999-10-20 |
ATE249672T1 (en) | 2003-09-15 |
DE69724819D1 (en) | 2003-10-16 |
CA2259374A1 (en) | 1998-01-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1576585B1 (en) | Method and device for robust predictive vector quantization of linear prediction parameters in variable bit rate speech coding | |
EP3039676B1 (en) | Adaptive bandwidth extension and apparatus for the same | |
RU2389085C2 (en) | Method and device for introducing low-frequency emphasis when compressing sound based on acelp/tcx | |
KR101604774B1 (en) | Multi-reference lpc filter quantization and inverse quantization device and method | |
US6871176B2 (en) | Phase excited linear prediction encoder | |
US7039581B1 (en) | Hybrid speed coding and system | |
US7222070B1 (en) | Hybrid speech coding and system | |
WO2007083933A1 (en) | Apparatus and method for encoding and decoding signal | |
US7139700B1 (en) | Hybrid speech coding and system | |
EP0950238B1 (en) | Speech coding and decoding system | |
Champion et al. | High-order allpole modelling of the spectral envelope | |
US20050065786A1 (en) | Hybrid speech coding and system | |
Drygajilo | Speech Coding Techniques and Standards | |
Villette | Sinusoidal speech coding for low and very low bit rate applications | |
So et al. | Multi-frame GMM-based block quantisation of line spectral frequencies | |
Bhaskar et al. | Low bit-rate voice compression based on frequency domain interpolative techniques | |
CA2511516C (en) | Method and device for robust predictive vector quantization of linear prediction parameters in variable bit rate speech coding | |
Lee et al. | An Efficient Segment-Based Speech Compression Technique for Hand-Held TTS Systems | |
Papanastasiou | LPC-Based Pitch Synchronous Interpolation Speech Coding | |
Yang et al. | A 5.4 kbps speech coder based on multi-band excitation and linear predictive coding | |
Zhang | Speech transform coding using ranked vector quantization | |
EP1212750A1 (en) | Multimode vselp speech coder | |
Lupini | Harmonic coding of speech at low bit rates | |
Balint | Excitation modeling in CELP speech coders [articol] | |
Ilk | Low Bit Rate DCT Prototype Interpolation Speech Coding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH HU IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW AM AZ BY KG KZ MD RU TJ TM |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH KE LS MW SD SZ UG ZW AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
ENP | Entry into the national phase |
Ref document number: 2259374 Country of ref document: CA Ref country code: CA Ref document number: 2259374 Kind code of ref document: A Format of ref document f/p: F |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1997930643 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 09214308 Country of ref document: US |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWP | Wipo information: published in national office |
Ref document number: 1997930643 Country of ref document: EP |
|
WWG | Wipo information: grant in national office |
Ref document number: 1997930643 Country of ref document: EP |