EP0449043A2 - Procédé et dispositif pour la numérisation de la parole - Google Patents

Procédé et dispositif pour la numérisation de la parole Download PDF

Info

Publication number
EP0449043A2
EP0449043A2 EP91103907A EP91103907A EP0449043A2 EP 0449043 A2 EP0449043 A2 EP 0449043A2 EP 91103907 A EP91103907 A EP 91103907A EP 91103907 A EP91103907 A EP 91103907A EP 0449043 A2 EP0449043 A2 EP 0449043A2
Authority
EP
European Patent Office
Prior art keywords
segments
signal
filter
speech
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP91103907A
Other languages
German (de)
English (en)
Other versions
EP0449043A3 (en
Inventor
Arthur Schaub
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ascom Zelcom AG
Original Assignee
Ascom Zelcom AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ascom Zelcom AG filed Critical Ascom Zelcom AG
Publication of EP0449043A2 publication Critical patent/EP0449043A2/fr
Publication of EP0449043A3 publication Critical patent/EP0449043A3/de
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/06Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients

Definitions

  • the 100 percent quality corresponds to the well-known logarithmic pulse code modulation with a bit rate of 64 kilobits per second, which is at the upper end the important range for radio and telephony is from 2.4 to 64 kbit per second.
  • Logarithmic pulse code modulation belongs to the class of so-called waveform or waveform encoders, the principle of which is to approximate each individual sample as closely as possible.
  • the coding of the samples can be done in different ways, namely in such a way that the coding depends on the previous sample, or on parameters derived from the previous samples, so that one can take advantage of any characteristic of the speech signals can draw and there is the possibility, in this way, to improve the effectiveness of the processing method and to reduce the bit speed. Knowing the correlation function of a speech signal section, one can calculate an optimal filter that provides the best estimates for predicting a sample from previous samples. This filter is used in a feedback loop in order to obtain a quantization noise with a flat spectrum, that is to say without speech modulation.
  • the source coding In contrast to the waveform coding, there is the so-called source coding, which is called vocoding in English in connection with the speech coding.
  • the only issue here is to generate a signal during playback that sounds as similar as possible to the original, but in which the signal curve itself, i.e. the individual samples, can be very different from the original.
  • the signal is analyzed using a replica of the speech generation to derive parameters for a speech replication. These parameters are digitally transmitted to the receiving end, where they are used to control a synthesis device that corresponds to the simulation used for the analysis.
  • the source coding already generates 60 to 75% of the full speech quality at 2.4 kilobits per second, but it cannot exceed the saturation value even if the bit rate is increased as desired increase by 75%. This reduced quality is mainly noticeable in a not entirely natural sound and in difficult speaker recognition. The reason for this lies in the too simple model for speech synthesis.
  • the bit rate can be reduced from 64 kilobits to approximately 12 kilobits per second while maintaining the full speech quality, although the complexity of the coding algorithms increases accordingly.
  • the speech quality of the waveform coding declines rapidly below 12 kilobits per second.
  • the present invention now relates to a method for speech digitization using the waveform coding, with an encoder for digitization and a decoder for the reconstruction of the speech signal, in which the speech signal is divided into segments in the encoder and processed with the closest possible approximation of the samples, using known ones Sampled values a calculation of an estimated value for upcoming new samples takes place.
  • the invention is intended to close the gap between waveform and source coding in the range from approximately 3.6 to 12 kilobits per second, or in other words, a coding method is to be specified in which the speech quality is 100% when used from approximately 6 kilobits / s. is, to their Reaching the moderate computational effort customary for waveform coding is sufficient.
  • This object is achieved according to the invention in that the calculation of the estimated value takes place only in part of the segments and in the other part of the segments only parameters for a speech simulation in the sense of the source coding are derived, and that the individual signal segments are processed with a variable bit rate, these being processed Bit rates are assigned to different operating modes and each signal segment is classified into one of the operating modes.
  • the individual speech segments are coded with more or less bits as required, and a hybrid coding method is obtained in which the methods of source coding and waveform coding are combined.
  • the segment-wise processing with different bit rates together with the signal processing steps upstream and downstream of the signal quantization leads to an average bit rate of about 6 kilobits per second and to a voice quality that is 100% of that in telephony transmission.
  • the corresponding sampling rate is 7200 Hz, the bandwidth is 3400 Hz.
  • the length of the voice segments is 20 milliseconds, so that a segment comprises 144 samples.
  • the invention further relates to a device for performing the above method with an encoder and a decoder.
  • the device is characterized in that in the encoder an adaptive near-prediction filter for calculating the estimated value for the imminent new sample in one part of the segments, an adaptive remote prediction filter for use in voiced signal segments and means for examining the signal segments and are assigned to the individual operating modes.
  • the structure of the speech coder according to the invention with a variable bit rate is thus based on the one hand on the principle of adaptive-predictive coding (APC) and on the other hand on that of the linear predictive coding of the classic LPC vocoder with a bit rate of 2.4 kilobits per second.
  • APC adaptive-predictive coding
  • the typical data rates of the source coding enable a sufficiently high quality reproduction for many signal segments. This applies first of all to the clearly perceptible pauses between words and sentences, but also to the short pauses before plosive sounds (p, t, k, b, d and g). The latter are pauses within individual words, for example in the word "father" between a and t. Such signal intervals are referred to below as quiet segments and are assigned to a first operating mode, mode I. They are encoded with 24 bits, which results in a data rate of 1200 bits / s.
  • the hissing sounds can also be adequately reproduced with a low data rate of preferably 2400 bit / s.
  • These sounds have the common property that a continuous flow of air flows from the lungs through the trachea, pharynx and oral cavity, and that at a certain point an air turbulence results from a narrowing, the different sibilants differing by the location of this narrowing: At s it is the narrowing between the upper and lower teeth, with f that between the upper teeth and lower lip and with sch that between the tip of the tongue and the palate. In any case, it is a noise that experiences a slightly different spectral coloring according to the geometric arrangement of the speech organs.
  • the corresponding signal intervals are referred to below as fricative segments and are assigned to a second operating mode, mode II. They are encoded with 48 bits, which results in the aforementioned data rate of 2400 bits / s.
  • the normal segments have no signal properties that would allow particularly economical coding, such as the quiet and the fricative segments.
  • the normal segments do not show anything special that requires additional coding, like the voiced segments, which are explained as the last operating mode immediately afterwards.
  • the normal segments are assigned to a third operating mode, mode III, and encoded with 192 bits, which results in a data rate of 9600 bit / s.
  • the voiced sounds include all vowels (a, e, i, o, u, ä, ö, ü and y) and diphtongs (au, ei and eu) as well as the nasal sounds (m, n, and ng).
  • Their common property is the activity of the vocal cords, which modulate the air flow from the lungs by delivering periodic air blasts. This results in a quasi-periodic waveform.
  • the different voiced sounds are characterized by different geometrical arrangements of the speech organs, what leads to different spectral colors. A satisfactory reproduction of the voiced sounds is only possible if the approximate periodicity is also taken into account in addition to the coding method for the normal segments. For the voiced sounds assigned to a fourth operating mode, mode IV, this results in a data volume increased to 216 bits per segment and from this a data rate of 10800 bits / s.
  • a prerequisite for the use of the different operating modes with the respective data rate is a signal analysis which classifies each signal segment into one of the operating modes Mode I to Mode IV and initiates the appropriate signal processing.
  • the structure of the encoder with variable bit rate is based on the principle of adaptive-predictive coding (APC) on the one hand and on the other hand on that of the classic 2.4 kbit / s LPC vocoder.
  • a detailed description of adaptive-predictive coding can be found in the book "Digital Coding of Waveforms" by NS Jayant and P. Noll, Prentice Hall, Inc., Englewodd Cliffs, New Jersey 1984; Chapter 6: Differential PCM, pp. 252-350; Chapter 7: Noice Feedback Coding, pp. 351-371.
  • the encoder contains at its input a high-pass filter HP and at its output an adaptive filter 1 with the transfer function 1-A (z) and a stage 2 called near correlation.
  • the signal path leads from the output of filter 1 to one adaptive pre-filter 3 with the transfer function 1 / (1-A (z / ⁇ )), to a stage 4 designated with remote correlation and to an adaptive filter 5 with the transfer function 1-B (z), to which one designated with stage calculation Level 6.
  • the circuit contains a multiplexer 7, four summation points, an adaptive quantizer 8, a filter 9 with the transfer function A (z / ⁇ ) and a filter 10 with the transfer function B (z).
  • the decoder contains a demultiplexer 11, a decoder / quantizer 12, a noise source 13, three Summation points, a filter 9 with the transfer function A (z / ⁇ ), a filter 10 with the transfer function B (z) and an adaptive post-filter 14 with the transfer function (1-A (z / ⁇ ) / (1-A (z)) .
  • Table 2 below shows which basic algorithmic elements the encoder contains and which of the circuit elements shown in FIGS. 1 and 2 perform these functions:
  • the near-prediction filter which is also referred to as a predictor for linear predictive coding (LPC predictor), calculates, based on a few already known sample values, an estimate for the imminent new sample value.
  • the transfer function of the near-prediction filter is usually referred to as A (z).
  • the filter works in segments with another transfer function that is adapted to the signal curve; Because the signal form of a speech signal is constantly changing, new filter coefficients have to be calculated for each signal element. This calculation is carried out in the near correlation stage labeled 2.
  • a residual signal results which consists of the linearly unpredictable signal components.
  • the transfer function of this filtering is 1-A (z). Due to its unpredictability, the residual signal has properties of a random process, which can be seen in its approximately flat spectrum.
  • the adaptive filter 1 thus has the remarkable property of smoothing out the sound of the specific resonances, that is to say the so-called formants.
  • the filtering 1-A (z) with the filter 1 arranged at the input of the encoder takes place in each of the four operating modes (Table 1). Different filter orders are used for the different operating modes; the filter has order three for the quiet segments (mode I) and order eight for the other operating modes.
  • the prediction coefficients are also used in the further course of the coding filters 3 and 9 are required, which is symbolized in FIG. 1 by the broad arrows characterizing the data flow. Likewise, the prediction coefficients are used in the decoding of FIG. 2 for the filters 9 and 14. However, since the prediction filter can only be calculated in the encoder when the signal segment is present, the calculated coefficients must be encoded and stored together with further digital information so that the decoder can reconstruct the signal.
  • This coding of the coefficients is intended in FIG. 1 as a component of the near correlation stage 2. Their storage is symbolized by the data arrow to the multiplexer 7. The prediction coefficients then arrive from the demultiplexer 10 in FIG. 2 along the data arrows drawn in to the filters 9 and 14.
  • the adaptive remote prediction filter is also referred to as the pitch predictor in accordance with the English name for the fundamental frequency of the periodic excitation signal present in voiced sounds. Its use only makes sense in voiced segments (mode IV), and the actual filtering is always preceded by a signal analysis that decides for or against its use. This analysis takes place in remote correlation level 4. Other tasks at this stage are the calculation and coding of the coefficients of the remote prediction filter, which, like those of the near prediction filter, are stored as part of the digital information must be so that the decoder can reconstruct the waveform in voiced segments.
  • the transfer function of the remote prediction filter is designated B (z). It is implemented as a transversal filter; its filter order is three. In contrast to the near-prediction filter, it does not work on the immediately preceding signal values, but on those at intervals of a basic period M of the periodic excitation signal.
  • M also referred to as the pitch period, is a further task of the remote correlation stage 4.
  • the adaptive quantizer (Table 2) is composed of the stage calculation 6 and the quantizer 8. Its mode of operation is similar to that of a conventional analog / digital converter, with the difference that the adaptive quantizer does not work with a constant maximum signal amplitude, but uses a variable value that is periodically determined anew in the step calculation 6.
  • the level calculation which is carried out in all operating modes, divides each signal segment into sub-segments and calculates a new level value adapted to the signal curve for each sub-segment. Quiet segments are divided into two, the rest into three sub-segments. The level values are also encoded and saved.
  • the quantization and coding of the individual signal values takes place in the quantizer 8 and takes place with only a single bit per signal value, a positive signal value being coded with 1 and a negative signal value with 0. This means that this data has the meaning of sign bits.
  • the signal values at the output of quantizer 8 are the positive current step value for code 1 and the negative current step value for code word 0.
  • the quantization of the individual signal values only takes place in the normal and voiced segments. This leads to the remarkably low data rates of the quiet and fricative signal elements.
  • the decoder / quantizer 12 receives the sign bits for the reconstruction of the individual signal values only in the normal and voiced segments.
  • the noise source 13 is active, which supplies a pseudo-random signal of constant power, the values of which are multiplied by the current step value. This locally generated signal enables a qualitatively adequate reproduction of the quiet and fictional segments.
  • ⁇ PCM loop The signal paths in Fig. 1 with the quantizer 8, the predictors 9 and 10, and the four summation points are collectively referred to as ⁇ PCM loop.
  • the incoming voice signal goes directly to the ⁇ PCM loop, i.e. without going through filters 1 and 3, and it arrives in the ⁇ PCM loop the near-prediction filter A (z) is used instead of the filter 9 with the transfer function A (z / ⁇ ).
  • a prediction value is subtracted from the signal value at the output of the high-pass filter HP, which is composed of voiced segments from the near and the long-range prediction value.
  • the remote prediction filter makes no contribution in non-voiced segments.
  • the difference value is quantized in both cases, and at the output of the quantizer 8, the prediction value is added to the quantized difference value. This addition results in a quantized speech signal value that approximates the non-quantized speech signal value fed into the ⁇ PCM loop. In the decoder of FIG. 2, this approximate value is reconstructed using the stored digital information.
  • the quantized speech signal now goes directly to the loudspeaker without passing through the filter 14.
  • the predictors use the quantized speech signal as an input signal and that the predictors are arranged in a feedback loop. From Fig. 1 it can also be seen that the two predictors work in series, so that the output signal of the near-prediction filter is subtracted from the quantized speech signal and this difference reaches the remote prediction filter.
  • the quantized difference value differs from the non-quantized one by a slight rounding error.
  • the signal of the successive rounding errors is uncorrelated in this case and shows a flat spectrum.
  • This so-called quantization noise is included in the quantized speech signal. Its spectrum is composed of the spectrum of the original, non-quantized speech signal and the flat spectrum of the quantization noise. With fine quantization, the signal-to-noise ratio is so large that the quantization noise is barely perceptible.
  • the signal-to-noise ratio is so small that the quantization noise is perceived as disturbing.
  • the frequency domain shows that the quantization noise covers parts of the speech signal spectrum, which are frequency intervals between the formants. The formants themselves protrude from the quantization noise like mountain peaks.
  • the speech signal is processed before the ⁇ PCM loop in such a way that the formants are less pronounced.
  • the quantized signal must then undergo an inverse shaping before playback undergo so that it returns to the original sound.
  • the quantization noise then increases in the frequency intervals occupied with formants; there is therefore a rearrangement of the quantization noise within individual frequency intervals. Therefore, the shaping described is referred to as spectral shaping of the quantization noise (Table 2).
  • the signal-to-noise ratio in the formants may be reduced somewhat compared to the conditions in APC, but only moderately.
  • the ideal compromise is given when the quantization noise between the formants comes just below the level of the speech signal and still remains well below the signal spectrum in the formants. In this case, the quantized speech signal is perceived as practically free of interference (so-called masking effect).
  • the spectral shaping of the quantization noise is about moderately reforming the formants of the speech signal before it is fed into the ⁇ PCM loop and amplifying it again to the same extent after the decoding. This is done in the encoder by the successive filters 1 and 3, in the ⁇ PCM loop the prediction filter 9 is used because its transfer function is matched to the spectrally shaped signal. It has already been mentioned that the filter 1 smoothes the formants present in a signal segment; the inverse filter with the transfer function 1 / (1-A (z)) is consequently able to impress the corresponding formants again on a flat spectrum, with a single filter parameter which is between zero and one is sufficient to make the formants weaker in a controlled manner.
  • the filter 14 for inverse spectral shaping in the decoder should actually have the transfer function (1-A (z / ⁇ ) / (1-A (z)), but instead of ⁇ has the filter parameter ⁇ , which lies between zero and ⁇ , which means that the frequency intervals with better signal-to-noise ratio are slightly amplified compared to those with poorer distance
  • the filter 1-A (z / ⁇ ) does not smooth the quantized signal completely flat, and the subsequent filter 1 / (A (z)) characterizes a signal with a flat spectrum
  • the formants are present to the fullest extent. Since the formants are partially present in the input signal of the latter filter, they are overemphasized as desired by the filtering in comparison with the non-quantized speech signal.
  • An adaptive volume control is designated by g (see also FIG. 9), which is calculated from the k values of the filter and which is used to compensate for volume fluctuations caused by the different filter coefficients ⁇ and ⁇ en.
  • the filters 1, 3 for spectral shaping in the encoder and 14 in the decoder are active in all operating modes, whereby these measures which are essential for the subjectively perceived speech quality do not cause any additional data for storage.
  • the values once selected for the filter parameters ⁇ and ⁇ remain constant during use.
  • processing begins with the calculation of the autocorrelation coefficients; the subsequent decision separates the processing of the quiet from that of the other segments.
  • the autocorrelation coefficient r (0) serves as a measure of the energy contained in a segment, the decision as to whether it is a quiet segment is made in comparison with an adaptively tracked threshold ⁇ . If a fraction of the autocorrelation coefficient exceeds the threshold, then the threshold is raised to the value of that fraction. The decision for a quiet segment is made when the signal power becomes less than the current threshold.
  • the processing of the quiet segments comprises the calculation and coding of the coefficients of the near-prediction filter, the filtering 1-A (z) by the filter 1 (FIG. 1) and the calculation and coding of the quantization levels.
  • the filter 1 shown in Fig. 4 is implemented as a so-called lattice filter, the coefficients of which are the so-called reflection coefficients k 1, .... k m .
  • Structure and properties of the Lattice filters are in the book "Adaptive Filters” by CFN Cowan and PM Grant, Prentice Hall, Inc., Englewodd Cliffs, New Jersey, 1985, Chapter 5: Recursive Least-Squares Estimation and Lattice Filters, p. 91 -144. Since the filter order in the quiet segments is three, only three reflection coefficients are calculated and the remaining zero is set.
  • the calculation is based on the autocorrelation coefficients that have already been determined, whereby any of the known methods (Durbin-Levinson, Schur, Le Roux - Gueguen) can be used. It is of practical importance that monitoring of the filter stability is included: If the calculation for a reflection coefficient yields a value greater than one, then this and all higher-order coefficients are set to zero.
  • a first step the calculated values are reduced to value ranges that are relevant in practice, which represent intervals in which 99% of all values in an extensive speech sample accounted for. If a calculated coefficient is the minimum or exceeds the maximum value, the tabulated extreme value is then processed in its place. This limitation is not shown in the flowchart of FIG. 3, but it results in a more efficient use of the bits available for coding the coefficients.
  • the further steps include the calculation of the so-called log area ratio and the linear quantization / coding of these values. These two steps have the effect that the finite number of discrete values for each reflection coefficient which are possible as a result of the coding are distributed so sensibly over the value ranges mentioned that the rounding errors which result when the coefficients are quantized have as little noticeable effect on the reproduction signal as possible.
  • the quantized filter coefficients, and thus identical filters, are used in the encoder and decoder, which is essential for high signal quality.
  • two quantization levels are calculated for the quiet segments, the first level being valid for the first 10 ms and the second level being valid for the second 10 ms of the segment which has a total of 144 samples.
  • the quantization levels result from the mean absolute values of the signal values in the sub-segments. Four bits are available for coding for each level. A square-rooted quantization characteristic is used Use which results in a finer resolution for weak signals than for the louder signal elements.
  • FIG. 5 illustrates the data format with which the parameters of a quiet segment are stored.
  • the background is covered with stripes, the width of which corresponds to one bit.
  • the log area ratio of the first and second reflection coefficients k 1 and k 2 are encoded with five bits each, that of the third reflection coefficient k 3 with four bits.
  • the two quantization levels q1 and q2 are also coded with four bits each, so that the total amount of data amounts to 24 bits.
  • the data formats of the remaining segments are selected as integer multiples of 24; it is an adaptation to the word width of the Motorola signal processor DSP 56000.
  • the fricative segments are processed if the pitch analysis following filtering 1-A (z) does not detect a voiced signal curve and the autocorrelation coefficient r (1) is less than zero. This latter condition means that there is more energy in the higher-frequency part of the short-term spectrum than in the part with the lower frequencies, which in turn means that it is a hissing sound or breathing noises.
  • the processing of the fricative segments differs from that of the quiet segments in two ways: On the one hand, the filter 1-A (z) has a higher filter order, and this is eight as with the normal and voiced segments. And on the other hand, the number of quantization levels in adaptive quantization, also in accordance with the conditions in the normal and voiced segments, is three.
  • the processing of the eight reflection coefficients comprises the steps already explained for the quiet segments: limitation of the value ranges, calculation of the log area ratio, quantization with linear characteristic and back calculation.
  • a difference to the quiet segments is that the first three coefficients are encoded with a higher resolution.
  • the three quantization levels are then calculated; they are coded in the same way as for the quiet segments.
  • the data format of the fricative segments is shown in FIG. 6.
  • the coding of the first four reflection coefficients k1 to k4 is carried out with seven, six, five and four bits, that of the last four k5 to k8 with three bits each. Together with the code word for the operating mode and with the three quantization levels, this results in a data volume of 48 bits.
  • the processing of the normal segments is also only possible after a pitch examination, which could not detect a voiced signal curve.
  • the class of normal segments then includes all those segments that do not meet the condition r (1) less than zero for a fricative segment.
  • the processing of normal segments differs from that of fricative segments in that the sign bits of the individual signal values are determined and saved in the ⁇ PCM loop.
  • the spectral shaping of the input signal with filtering 1 / (1-A (z / ⁇ )) (Filter 3, Fig. 1) are completed.
  • the filter 3 (FIG. 7) is again a grating filter, but with the structure complementary to the filter 1 (FIG. 4), the filter parameter ⁇ being prepended to each delay element z ⁇ 1.
  • Fig. 8 shows the structure of the near-prediction filter 9 (Fig. 1) in the ⁇ PCM loop. It is again a grating filter with a structure similar to filter 1 (FIG. 5).
  • filter 1 the input signal on the upper signal path arrives at the output without delay and without scaling, so that the component A (z) corresponds to the sum of the partial signal coming from the lower to the upper signal path.
  • the prediction filter of FIG. 8 forms the estimated values.
  • the filter parameter ⁇ is again implemented as a multiplier before each delay element z ⁇ 1.
  • the data format of the normal segments is an extension of the data format of the fricative segments, with the sign bits determined in the ⁇ PCM loop being added as additional data. According to the subdivision of the segments into three sub-segments, these are combined in three groups of 48 bits each, which results in a total data amount of 192 bits.
  • the starting point for the detection of the voiced segments is the calculation of the correlation coefficients (pitch analysis, Fig. 3), where ⁇ 2 is calculated so that in the signal processor on the root can be dispensed with.
  • the possible pitch periods are limited to 14 to 141 sampling intervals, i.e. to 128 possible values, which leads to a 7-bit code word for the pitch period.
  • the decision for a voiced segment depends on three conditions: First, the square value of the largest correlation coefficient then it must be a positive correlation, and finally the quotient corresponding to the coefficient of a first order prediction filter must not exceed a certain maximum value of 1.3. This condition prevents the use of a prediction filter with very large amplification, which sometimes results in voiced segments that sound, and thereby protects the coding algorithm from possible instability.
  • the decision in the manner described for a voiced segment is only preliminary and means that in the next step the prediction coefficients ⁇ 1, ⁇ 0 and ⁇ +1 are calculated for a transverse pitch filter B (z). Following the calculation of the filter coefficients, the final decision for or against processing as a voiced segment is made.
  • the filter coefficients of the remote prediction filter or pitch predictor When calculating the coefficients of the remote prediction filter or pitch predictor, it is assumed that the basic period M of the quasi-periodic excitation of voiced sounds from the pitch examination is already known. The filter coefficients searched then result as a solution to a familiar optimization task in which the sum of the squares of errors is minimized. Due to the symmetrical structure of the matrix appearing in the equation, the solution can be calculated efficiently using the so-called Cholesky decomposition.
  • the filter coefficients are quantized using the previous conversions, extreme value limits and resolution according to Table 3. In exceptional cases, if the sum of the three filter coefficients is less than the tabulated minimum value of 0.1, the previous decision in favor of a voiced segment is dropped, but otherwise definitely confirmed.
  • the processing of the voiced segments differs from that of the normal segments by the additional use of the remote prediction filter in the ⁇ PCM loop.
  • the effect of the additional predictor must be taken into account appropriately, which is done by the previous filtering 1-B (z) of the signal that is otherwise used directly for the calculation.
  • the quantization levels are calculated in the manner indicated in the flowchart in FIG. 3, and their coding is carried out as in the other segments.
  • the coding of the pitch period and the coefficients of the remote prediction filter results in an additional 24 bits in addition to the data amount of the normal segments.
  • the decoder (FIG. 2) contains, in addition to parts which the coder also contains in terms of function, two special elements which do not occur in the coder, these are the noise source 13 and the filter 14.
  • the noise source is a 24 bit linear, shift register that generates a maximum length sequence of length 224 -1, in which the individual bits appear in pseudo-random order.
  • the definition of the shift register that is, the arrangement of the XOR feedback, is the book "Error-Correcting Codes" by WW Peterson, EJ Weldon, MIT Press, Cambridge, Massachusetts, 1972; Appendix C: Tables of Irreducible Polynomials over GF (2), pp. 472-492.
  • the mean absolute value of the successive random numbers is 1 ⁇ 2. Multiplication by the quantization level, which in turn was calculated as the mean absolute value, results in a synthetic excitation signal that is systematically too low by 6 dB, which sensibly compensates for the effects of fixed high-pass pre-filter and adaptive formant overemphasis, which are doubly reinforcing for fricative segments. Furthermore, this reduction in signal power in the quiet segments is subjectively perceived as increasing quality.
  • the adaptive filter 14 the structure of which is shown in FIG. 9, is used for inverse spectral shaping and overemphasis on the formants. It is a series connection of the two filter structures shown in FIGS. 4 and 7. If ⁇ is given a slightly smaller value than that in the first sub-filter If parameter ⁇ in the encoder, the formants partially present in the decoded speech signal are not completely smoothed out. The subsequent second sub-filter can impress a signal with a flat spectrum to the full extent of the formants contained in the original signal. Its application to the signal with a not completely flat spectrum brings about the desired overemphasis on the dominant signal components.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
EP19910103907 1990-03-22 1991-03-14 Method and apparatus for speech digitizing Ceased EP0449043A3 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CH956/90 1990-03-22
CH95690A CH680030A5 (fr) 1990-03-22 1990-03-22

Publications (2)

Publication Number Publication Date
EP0449043A2 true EP0449043A2 (fr) 1991-10-02
EP0449043A3 EP0449043A3 (en) 1992-04-29

Family

ID=4199089

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19910103907 Ceased EP0449043A3 (en) 1990-03-22 1991-03-14 Method and apparatus for speech digitizing

Country Status (3)

Country Link
EP (1) EP0449043A3 (fr)
CH (1) CH680030A5 (fr)
FI (1) FI911010A (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5933803A (en) * 1996-12-12 1999-08-03 Nokia Mobile Phones Limited Speech encoding at variable bit rate
EP0588932B1 (fr) * 1991-06-11 2001-11-14 QUALCOMM Incorporated Vocodeur a vitesse variable

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ICASSP '88, (1988 INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 11. - 14. April 1988), Band 1, Seiten 631-634, IEEE, New York, US; D.J. ZARKADIS et al.: "A 16kb/s APC system with adaptive postfilter and evaluation of its performence" *
ICASSP '89, (1989 INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEE CH, AND SIGNAL PROCESSING, Glasgow, 23. - 26. Mai 1989), Band 1, Seiten 156-159, IEEE, New York, US; T. TANIGUCHI et al.: "Multimode coding: Application to CELP" *
ICC '87, (IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS '87, Seattle, Washington, 7. - 10. Juni 1987), Band 1, Seiten 418-424, IEEE, New York, US; Y. YATSUZUKA et al.: "Hardware implementation of 9.6/16 kbit/s APC-MLQ speech codec and its applications for mobile satellite communications" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0588932B1 (fr) * 1991-06-11 2001-11-14 QUALCOMM Incorporated Vocodeur a vitesse variable
US5933803A (en) * 1996-12-12 1999-08-03 Nokia Mobile Phones Limited Speech encoding at variable bit rate

Also Published As

Publication number Publication date
FI911010A (fi) 1991-09-23
FI911010A0 (fi) 1991-02-28
CH680030A5 (fr) 1992-05-29
EP0449043A3 (en) 1992-04-29

Similar Documents

Publication Publication Date Title
EP2022043B1 (fr) Codage de signaux d'information
DE60121405T2 (de) Transkodierer zur Vermeidung einer Kaskadenkodierung von Sprachsignalen
DE69133458T2 (de) Verfahren zur Sprachquantisierung und Fehlerkorrektur
DE69926821T2 (de) Verfahren zur signalgesteuerten Schaltung zwischen verschiedenen Audiokodierungssystemen
DE60117144T2 (de) Sprachübertragungssystem und verfahren zur behandlung verlorener datenrahmen
DE69910240T2 (de) Vorrichtung und verfahren zur wiederherstellung des hochfrequenzanteils eines überabgetasteten synthetisierten breitbandsignals
DE69107841T2 (de) Transformationskodierer und -dekodierer mit adaptiver blocklänge, adaptiver transformation und adaptivem fenster für hochwertige tonsignale.
DE69401514T2 (de) Vom rechenaufwand her effiziente adaptive bitzuteilung für kodierverfahren und kodiereinrichtung
DE60006271T2 (de) Celp sprachkodierung mit variabler bitrate mittels phonetischer klassifizierung
EP1979901B1 (fr) Procede et dispositifs pour le codage de signaux audio
DE60219351T2 (de) Signaländerungsverfahren zur effizienten kodierung von sprachsignalen
DE60029990T2 (de) Glättung des verstärkungsfaktors in breitbandsprach- und audio-signal dekodierer
DE3856211T2 (de) Verfahren zur adaptiven Filterung von Sprach- und Audiosignalen
DE60218385T2 (de) Nachfilterung von kodierter Sprache im Frequenzbereich
DE602004006211T2 (de) Verfahren zur Maskierung von Paketverlusten und/oder Rahmenausfall in einem Kommunikationssystem
DE69730779T2 (de) Verbesserungen bei oder in Bezug auf Sprachkodierung
DE102008042579B4 (de) Verfahren zur Fehlerverdeckung bei fehlerhafter Übertragung von Sprachdaten
DE60118631T2 (de) Verfahren zum ersetzen verfälschter audiodaten
DE69730721T2 (de) Verfahren und vorrichtungen zur geräuschkonditionierung von signalen welche audioinformationen darstellen in komprimierter und digitalisierter form
DE69820362T2 (de) Nichtlinearer Filter zur Geräuschunterdrückung in linearen Prädiktions-Sprachkodierungs-Vorrichtungen
EP0076234A1 (fr) Procédé et dispositif pour traitement digital de la parole réduisant la redondance
DE60124079T2 (de) Sprachverarbeitung
DE19715126A1 (de) Sprachsignal-Codiervorrichtung
DE19722705A1 (de) Verfahren zur Abschätzung der Verstärkung zur Sprachkodierung
DE19743662A1 (de) Verfahren und Vorrichtung zur Erzeugung eines bitratenskalierbaren Audio-Datenstroms

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE DE DK ES FR GB IT NL SE

PUAL Search report despatched

Free format text: ORIGINAL CODE: 0009013

AK Designated contracting states

Kind code of ref document: A3

Designated state(s): AT BE DE DK ES FR GB IT NL SE

17P Request for examination filed

Effective date: 19920817

GRAG Despatch of communication of intention to grant

Free format text: ORIGINAL CODE: EPIDOS AGRA

17Q First examination report despatched

Effective date: 19951213

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 19960624