EP0449043A2 - Procédé et dispositif pour la numérisation de la parole - Google Patents
Procédé et dispositif pour la numérisation de la parole Download PDFInfo
- Publication number
- EP0449043A2 EP0449043A2 EP91103907A EP91103907A EP0449043A2 EP 0449043 A2 EP0449043 A2 EP 0449043A2 EP 91103907 A EP91103907 A EP 91103907A EP 91103907 A EP91103907 A EP 91103907A EP 0449043 A2 EP0449043 A2 EP 0449043A2
- Authority
- EP
- European Patent Office
- Prior art keywords
- segments
- signal
- filter
- speech
- coding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000004088 simulation Methods 0.000 claims abstract description 4
- 238000013139 quantization Methods 0.000 claims description 40
- 230000003044 adaptive effect Effects 0.000 claims description 29
- 238000004364 calculation method Methods 0.000 claims description 24
- 238000001914 filtration Methods 0.000 claims description 17
- 230000005284 excitation Effects 0.000 claims description 6
- 238000011045 prefiltration Methods 0.000 claims description 5
- 238000011144 upstream manufacturing Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 claims description 2
- 230000008447 perception Effects 0.000 claims description 2
- 238000005070 sampling Methods 0.000 abstract description 5
- 230000005540 biological transmission Effects 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 18
- 238000001228 spectrum Methods 0.000 description 14
- 238000007493 shaping process Methods 0.000 description 9
- 230000003595 spectral effect Effects 0.000 description 9
- 238000004458 analytical method Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000873 masking effect Effects 0.000 description 3
- 230000000737 periodic effect Effects 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 2
- 210000004072 lung Anatomy 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 230000029058 respiratory gaseous exchange Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000004040 coloring Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000005314 correlation function Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 210000000214 mouth Anatomy 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 210000003254 palate Anatomy 0.000 description 1
- 210000003800 pharynx Anatomy 0.000 description 1
- 230000004962 physiological condition Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000002407 reforming Methods 0.000 description 1
- 230000003014 reinforcing effect Effects 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 210000005182 tip of the tongue Anatomy 0.000 description 1
- 210000003437 trachea Anatomy 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/06—Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients
Definitions
- the 100 percent quality corresponds to the well-known logarithmic pulse code modulation with a bit rate of 64 kilobits per second, which is at the upper end the important range for radio and telephony is from 2.4 to 64 kbit per second.
- Logarithmic pulse code modulation belongs to the class of so-called waveform or waveform encoders, the principle of which is to approximate each individual sample as closely as possible.
- the coding of the samples can be done in different ways, namely in such a way that the coding depends on the previous sample, or on parameters derived from the previous samples, so that one can take advantage of any characteristic of the speech signals can draw and there is the possibility, in this way, to improve the effectiveness of the processing method and to reduce the bit speed. Knowing the correlation function of a speech signal section, one can calculate an optimal filter that provides the best estimates for predicting a sample from previous samples. This filter is used in a feedback loop in order to obtain a quantization noise with a flat spectrum, that is to say without speech modulation.
- the source coding In contrast to the waveform coding, there is the so-called source coding, which is called vocoding in English in connection with the speech coding.
- the only issue here is to generate a signal during playback that sounds as similar as possible to the original, but in which the signal curve itself, i.e. the individual samples, can be very different from the original.
- the signal is analyzed using a replica of the speech generation to derive parameters for a speech replication. These parameters are digitally transmitted to the receiving end, where they are used to control a synthesis device that corresponds to the simulation used for the analysis.
- the source coding already generates 60 to 75% of the full speech quality at 2.4 kilobits per second, but it cannot exceed the saturation value even if the bit rate is increased as desired increase by 75%. This reduced quality is mainly noticeable in a not entirely natural sound and in difficult speaker recognition. The reason for this lies in the too simple model for speech synthesis.
- the bit rate can be reduced from 64 kilobits to approximately 12 kilobits per second while maintaining the full speech quality, although the complexity of the coding algorithms increases accordingly.
- the speech quality of the waveform coding declines rapidly below 12 kilobits per second.
- the present invention now relates to a method for speech digitization using the waveform coding, with an encoder for digitization and a decoder for the reconstruction of the speech signal, in which the speech signal is divided into segments in the encoder and processed with the closest possible approximation of the samples, using known ones Sampled values a calculation of an estimated value for upcoming new samples takes place.
- the invention is intended to close the gap between waveform and source coding in the range from approximately 3.6 to 12 kilobits per second, or in other words, a coding method is to be specified in which the speech quality is 100% when used from approximately 6 kilobits / s. is, to their Reaching the moderate computational effort customary for waveform coding is sufficient.
- This object is achieved according to the invention in that the calculation of the estimated value takes place only in part of the segments and in the other part of the segments only parameters for a speech simulation in the sense of the source coding are derived, and that the individual signal segments are processed with a variable bit rate, these being processed Bit rates are assigned to different operating modes and each signal segment is classified into one of the operating modes.
- the individual speech segments are coded with more or less bits as required, and a hybrid coding method is obtained in which the methods of source coding and waveform coding are combined.
- the segment-wise processing with different bit rates together with the signal processing steps upstream and downstream of the signal quantization leads to an average bit rate of about 6 kilobits per second and to a voice quality that is 100% of that in telephony transmission.
- the corresponding sampling rate is 7200 Hz, the bandwidth is 3400 Hz.
- the length of the voice segments is 20 milliseconds, so that a segment comprises 144 samples.
- the invention further relates to a device for performing the above method with an encoder and a decoder.
- the device is characterized in that in the encoder an adaptive near-prediction filter for calculating the estimated value for the imminent new sample in one part of the segments, an adaptive remote prediction filter for use in voiced signal segments and means for examining the signal segments and are assigned to the individual operating modes.
- the structure of the speech coder according to the invention with a variable bit rate is thus based on the one hand on the principle of adaptive-predictive coding (APC) and on the other hand on that of the linear predictive coding of the classic LPC vocoder with a bit rate of 2.4 kilobits per second.
- APC adaptive-predictive coding
- the typical data rates of the source coding enable a sufficiently high quality reproduction for many signal segments. This applies first of all to the clearly perceptible pauses between words and sentences, but also to the short pauses before plosive sounds (p, t, k, b, d and g). The latter are pauses within individual words, for example in the word "father" between a and t. Such signal intervals are referred to below as quiet segments and are assigned to a first operating mode, mode I. They are encoded with 24 bits, which results in a data rate of 1200 bits / s.
- the hissing sounds can also be adequately reproduced with a low data rate of preferably 2400 bit / s.
- These sounds have the common property that a continuous flow of air flows from the lungs through the trachea, pharynx and oral cavity, and that at a certain point an air turbulence results from a narrowing, the different sibilants differing by the location of this narrowing: At s it is the narrowing between the upper and lower teeth, with f that between the upper teeth and lower lip and with sch that between the tip of the tongue and the palate. In any case, it is a noise that experiences a slightly different spectral coloring according to the geometric arrangement of the speech organs.
- the corresponding signal intervals are referred to below as fricative segments and are assigned to a second operating mode, mode II. They are encoded with 48 bits, which results in the aforementioned data rate of 2400 bits / s.
- the normal segments have no signal properties that would allow particularly economical coding, such as the quiet and the fricative segments.
- the normal segments do not show anything special that requires additional coding, like the voiced segments, which are explained as the last operating mode immediately afterwards.
- the normal segments are assigned to a third operating mode, mode III, and encoded with 192 bits, which results in a data rate of 9600 bit / s.
- the voiced sounds include all vowels (a, e, i, o, u, ä, ö, ü and y) and diphtongs (au, ei and eu) as well as the nasal sounds (m, n, and ng).
- Their common property is the activity of the vocal cords, which modulate the air flow from the lungs by delivering periodic air blasts. This results in a quasi-periodic waveform.
- the different voiced sounds are characterized by different geometrical arrangements of the speech organs, what leads to different spectral colors. A satisfactory reproduction of the voiced sounds is only possible if the approximate periodicity is also taken into account in addition to the coding method for the normal segments. For the voiced sounds assigned to a fourth operating mode, mode IV, this results in a data volume increased to 216 bits per segment and from this a data rate of 10800 bits / s.
- a prerequisite for the use of the different operating modes with the respective data rate is a signal analysis which classifies each signal segment into one of the operating modes Mode I to Mode IV and initiates the appropriate signal processing.
- the structure of the encoder with variable bit rate is based on the principle of adaptive-predictive coding (APC) on the one hand and on the other hand on that of the classic 2.4 kbit / s LPC vocoder.
- a detailed description of adaptive-predictive coding can be found in the book "Digital Coding of Waveforms" by NS Jayant and P. Noll, Prentice Hall, Inc., Englewodd Cliffs, New Jersey 1984; Chapter 6: Differential PCM, pp. 252-350; Chapter 7: Noice Feedback Coding, pp. 351-371.
- the encoder contains at its input a high-pass filter HP and at its output an adaptive filter 1 with the transfer function 1-A (z) and a stage 2 called near correlation.
- the signal path leads from the output of filter 1 to one adaptive pre-filter 3 with the transfer function 1 / (1-A (z / ⁇ )), to a stage 4 designated with remote correlation and to an adaptive filter 5 with the transfer function 1-B (z), to which one designated with stage calculation Level 6.
- the circuit contains a multiplexer 7, four summation points, an adaptive quantizer 8, a filter 9 with the transfer function A (z / ⁇ ) and a filter 10 with the transfer function B (z).
- the decoder contains a demultiplexer 11, a decoder / quantizer 12, a noise source 13, three Summation points, a filter 9 with the transfer function A (z / ⁇ ), a filter 10 with the transfer function B (z) and an adaptive post-filter 14 with the transfer function (1-A (z / ⁇ ) / (1-A (z)) .
- Table 2 below shows which basic algorithmic elements the encoder contains and which of the circuit elements shown in FIGS. 1 and 2 perform these functions:
- the near-prediction filter which is also referred to as a predictor for linear predictive coding (LPC predictor), calculates, based on a few already known sample values, an estimate for the imminent new sample value.
- the transfer function of the near-prediction filter is usually referred to as A (z).
- the filter works in segments with another transfer function that is adapted to the signal curve; Because the signal form of a speech signal is constantly changing, new filter coefficients have to be calculated for each signal element. This calculation is carried out in the near correlation stage labeled 2.
- a residual signal results which consists of the linearly unpredictable signal components.
- the transfer function of this filtering is 1-A (z). Due to its unpredictability, the residual signal has properties of a random process, which can be seen in its approximately flat spectrum.
- the adaptive filter 1 thus has the remarkable property of smoothing out the sound of the specific resonances, that is to say the so-called formants.
- the filtering 1-A (z) with the filter 1 arranged at the input of the encoder takes place in each of the four operating modes (Table 1). Different filter orders are used for the different operating modes; the filter has order three for the quiet segments (mode I) and order eight for the other operating modes.
- the prediction coefficients are also used in the further course of the coding filters 3 and 9 are required, which is symbolized in FIG. 1 by the broad arrows characterizing the data flow. Likewise, the prediction coefficients are used in the decoding of FIG. 2 for the filters 9 and 14. However, since the prediction filter can only be calculated in the encoder when the signal segment is present, the calculated coefficients must be encoded and stored together with further digital information so that the decoder can reconstruct the signal.
- This coding of the coefficients is intended in FIG. 1 as a component of the near correlation stage 2. Their storage is symbolized by the data arrow to the multiplexer 7. The prediction coefficients then arrive from the demultiplexer 10 in FIG. 2 along the data arrows drawn in to the filters 9 and 14.
- the adaptive remote prediction filter is also referred to as the pitch predictor in accordance with the English name for the fundamental frequency of the periodic excitation signal present in voiced sounds. Its use only makes sense in voiced segments (mode IV), and the actual filtering is always preceded by a signal analysis that decides for or against its use. This analysis takes place in remote correlation level 4. Other tasks at this stage are the calculation and coding of the coefficients of the remote prediction filter, which, like those of the near prediction filter, are stored as part of the digital information must be so that the decoder can reconstruct the waveform in voiced segments.
- the transfer function of the remote prediction filter is designated B (z). It is implemented as a transversal filter; its filter order is three. In contrast to the near-prediction filter, it does not work on the immediately preceding signal values, but on those at intervals of a basic period M of the periodic excitation signal.
- M also referred to as the pitch period, is a further task of the remote correlation stage 4.
- the adaptive quantizer (Table 2) is composed of the stage calculation 6 and the quantizer 8. Its mode of operation is similar to that of a conventional analog / digital converter, with the difference that the adaptive quantizer does not work with a constant maximum signal amplitude, but uses a variable value that is periodically determined anew in the step calculation 6.
- the level calculation which is carried out in all operating modes, divides each signal segment into sub-segments and calculates a new level value adapted to the signal curve for each sub-segment. Quiet segments are divided into two, the rest into three sub-segments. The level values are also encoded and saved.
- the quantization and coding of the individual signal values takes place in the quantizer 8 and takes place with only a single bit per signal value, a positive signal value being coded with 1 and a negative signal value with 0. This means that this data has the meaning of sign bits.
- the signal values at the output of quantizer 8 are the positive current step value for code 1 and the negative current step value for code word 0.
- the quantization of the individual signal values only takes place in the normal and voiced segments. This leads to the remarkably low data rates of the quiet and fricative signal elements.
- the decoder / quantizer 12 receives the sign bits for the reconstruction of the individual signal values only in the normal and voiced segments.
- the noise source 13 is active, which supplies a pseudo-random signal of constant power, the values of which are multiplied by the current step value. This locally generated signal enables a qualitatively adequate reproduction of the quiet and fictional segments.
- ⁇ PCM loop The signal paths in Fig. 1 with the quantizer 8, the predictors 9 and 10, and the four summation points are collectively referred to as ⁇ PCM loop.
- the incoming voice signal goes directly to the ⁇ PCM loop, i.e. without going through filters 1 and 3, and it arrives in the ⁇ PCM loop the near-prediction filter A (z) is used instead of the filter 9 with the transfer function A (z / ⁇ ).
- a prediction value is subtracted from the signal value at the output of the high-pass filter HP, which is composed of voiced segments from the near and the long-range prediction value.
- the remote prediction filter makes no contribution in non-voiced segments.
- the difference value is quantized in both cases, and at the output of the quantizer 8, the prediction value is added to the quantized difference value. This addition results in a quantized speech signal value that approximates the non-quantized speech signal value fed into the ⁇ PCM loop. In the decoder of FIG. 2, this approximate value is reconstructed using the stored digital information.
- the quantized speech signal now goes directly to the loudspeaker without passing through the filter 14.
- the predictors use the quantized speech signal as an input signal and that the predictors are arranged in a feedback loop. From Fig. 1 it can also be seen that the two predictors work in series, so that the output signal of the near-prediction filter is subtracted from the quantized speech signal and this difference reaches the remote prediction filter.
- the quantized difference value differs from the non-quantized one by a slight rounding error.
- the signal of the successive rounding errors is uncorrelated in this case and shows a flat spectrum.
- This so-called quantization noise is included in the quantized speech signal. Its spectrum is composed of the spectrum of the original, non-quantized speech signal and the flat spectrum of the quantization noise. With fine quantization, the signal-to-noise ratio is so large that the quantization noise is barely perceptible.
- the signal-to-noise ratio is so small that the quantization noise is perceived as disturbing.
- the frequency domain shows that the quantization noise covers parts of the speech signal spectrum, which are frequency intervals between the formants. The formants themselves protrude from the quantization noise like mountain peaks.
- the speech signal is processed before the ⁇ PCM loop in such a way that the formants are less pronounced.
- the quantized signal must then undergo an inverse shaping before playback undergo so that it returns to the original sound.
- the quantization noise then increases in the frequency intervals occupied with formants; there is therefore a rearrangement of the quantization noise within individual frequency intervals. Therefore, the shaping described is referred to as spectral shaping of the quantization noise (Table 2).
- the signal-to-noise ratio in the formants may be reduced somewhat compared to the conditions in APC, but only moderately.
- the ideal compromise is given when the quantization noise between the formants comes just below the level of the speech signal and still remains well below the signal spectrum in the formants. In this case, the quantized speech signal is perceived as practically free of interference (so-called masking effect).
- the spectral shaping of the quantization noise is about moderately reforming the formants of the speech signal before it is fed into the ⁇ PCM loop and amplifying it again to the same extent after the decoding. This is done in the encoder by the successive filters 1 and 3, in the ⁇ PCM loop the prediction filter 9 is used because its transfer function is matched to the spectrally shaped signal. It has already been mentioned that the filter 1 smoothes the formants present in a signal segment; the inverse filter with the transfer function 1 / (1-A (z)) is consequently able to impress the corresponding formants again on a flat spectrum, with a single filter parameter which is between zero and one is sufficient to make the formants weaker in a controlled manner.
- the filter 14 for inverse spectral shaping in the decoder should actually have the transfer function (1-A (z / ⁇ ) / (1-A (z)), but instead of ⁇ has the filter parameter ⁇ , which lies between zero and ⁇ , which means that the frequency intervals with better signal-to-noise ratio are slightly amplified compared to those with poorer distance
- the filter 1-A (z / ⁇ ) does not smooth the quantized signal completely flat, and the subsequent filter 1 / (A (z)) characterizes a signal with a flat spectrum
- the formants are present to the fullest extent. Since the formants are partially present in the input signal of the latter filter, they are overemphasized as desired by the filtering in comparison with the non-quantized speech signal.
- An adaptive volume control is designated by g (see also FIG. 9), which is calculated from the k values of the filter and which is used to compensate for volume fluctuations caused by the different filter coefficients ⁇ and ⁇ en.
- the filters 1, 3 for spectral shaping in the encoder and 14 in the decoder are active in all operating modes, whereby these measures which are essential for the subjectively perceived speech quality do not cause any additional data for storage.
- the values once selected for the filter parameters ⁇ and ⁇ remain constant during use.
- processing begins with the calculation of the autocorrelation coefficients; the subsequent decision separates the processing of the quiet from that of the other segments.
- the autocorrelation coefficient r (0) serves as a measure of the energy contained in a segment, the decision as to whether it is a quiet segment is made in comparison with an adaptively tracked threshold ⁇ . If a fraction of the autocorrelation coefficient exceeds the threshold, then the threshold is raised to the value of that fraction. The decision for a quiet segment is made when the signal power becomes less than the current threshold.
- the processing of the quiet segments comprises the calculation and coding of the coefficients of the near-prediction filter, the filtering 1-A (z) by the filter 1 (FIG. 1) and the calculation and coding of the quantization levels.
- the filter 1 shown in Fig. 4 is implemented as a so-called lattice filter, the coefficients of which are the so-called reflection coefficients k 1, .... k m .
- Structure and properties of the Lattice filters are in the book "Adaptive Filters” by CFN Cowan and PM Grant, Prentice Hall, Inc., Englewodd Cliffs, New Jersey, 1985, Chapter 5: Recursive Least-Squares Estimation and Lattice Filters, p. 91 -144. Since the filter order in the quiet segments is three, only three reflection coefficients are calculated and the remaining zero is set.
- the calculation is based on the autocorrelation coefficients that have already been determined, whereby any of the known methods (Durbin-Levinson, Schur, Le Roux - Gueguen) can be used. It is of practical importance that monitoring of the filter stability is included: If the calculation for a reflection coefficient yields a value greater than one, then this and all higher-order coefficients are set to zero.
- a first step the calculated values are reduced to value ranges that are relevant in practice, which represent intervals in which 99% of all values in an extensive speech sample accounted for. If a calculated coefficient is the minimum or exceeds the maximum value, the tabulated extreme value is then processed in its place. This limitation is not shown in the flowchart of FIG. 3, but it results in a more efficient use of the bits available for coding the coefficients.
- the further steps include the calculation of the so-called log area ratio and the linear quantization / coding of these values. These two steps have the effect that the finite number of discrete values for each reflection coefficient which are possible as a result of the coding are distributed so sensibly over the value ranges mentioned that the rounding errors which result when the coefficients are quantized have as little noticeable effect on the reproduction signal as possible.
- the quantized filter coefficients, and thus identical filters, are used in the encoder and decoder, which is essential for high signal quality.
- two quantization levels are calculated for the quiet segments, the first level being valid for the first 10 ms and the second level being valid for the second 10 ms of the segment which has a total of 144 samples.
- the quantization levels result from the mean absolute values of the signal values in the sub-segments. Four bits are available for coding for each level. A square-rooted quantization characteristic is used Use which results in a finer resolution for weak signals than for the louder signal elements.
- FIG. 5 illustrates the data format with which the parameters of a quiet segment are stored.
- the background is covered with stripes, the width of which corresponds to one bit.
- the log area ratio of the first and second reflection coefficients k 1 and k 2 are encoded with five bits each, that of the third reflection coefficient k 3 with four bits.
- the two quantization levels q1 and q2 are also coded with four bits each, so that the total amount of data amounts to 24 bits.
- the data formats of the remaining segments are selected as integer multiples of 24; it is an adaptation to the word width of the Motorola signal processor DSP 56000.
- the fricative segments are processed if the pitch analysis following filtering 1-A (z) does not detect a voiced signal curve and the autocorrelation coefficient r (1) is less than zero. This latter condition means that there is more energy in the higher-frequency part of the short-term spectrum than in the part with the lower frequencies, which in turn means that it is a hissing sound or breathing noises.
- the processing of the fricative segments differs from that of the quiet segments in two ways: On the one hand, the filter 1-A (z) has a higher filter order, and this is eight as with the normal and voiced segments. And on the other hand, the number of quantization levels in adaptive quantization, also in accordance with the conditions in the normal and voiced segments, is three.
- the processing of the eight reflection coefficients comprises the steps already explained for the quiet segments: limitation of the value ranges, calculation of the log area ratio, quantization with linear characteristic and back calculation.
- a difference to the quiet segments is that the first three coefficients are encoded with a higher resolution.
- the three quantization levels are then calculated; they are coded in the same way as for the quiet segments.
- the data format of the fricative segments is shown in FIG. 6.
- the coding of the first four reflection coefficients k1 to k4 is carried out with seven, six, five and four bits, that of the last four k5 to k8 with three bits each. Together with the code word for the operating mode and with the three quantization levels, this results in a data volume of 48 bits.
- the processing of the normal segments is also only possible after a pitch examination, which could not detect a voiced signal curve.
- the class of normal segments then includes all those segments that do not meet the condition r (1) less than zero for a fricative segment.
- the processing of normal segments differs from that of fricative segments in that the sign bits of the individual signal values are determined and saved in the ⁇ PCM loop.
- the spectral shaping of the input signal with filtering 1 / (1-A (z / ⁇ )) (Filter 3, Fig. 1) are completed.
- the filter 3 (FIG. 7) is again a grating filter, but with the structure complementary to the filter 1 (FIG. 4), the filter parameter ⁇ being prepended to each delay element z ⁇ 1.
- Fig. 8 shows the structure of the near-prediction filter 9 (Fig. 1) in the ⁇ PCM loop. It is again a grating filter with a structure similar to filter 1 (FIG. 5).
- filter 1 the input signal on the upper signal path arrives at the output without delay and without scaling, so that the component A (z) corresponds to the sum of the partial signal coming from the lower to the upper signal path.
- the prediction filter of FIG. 8 forms the estimated values.
- the filter parameter ⁇ is again implemented as a multiplier before each delay element z ⁇ 1.
- the data format of the normal segments is an extension of the data format of the fricative segments, with the sign bits determined in the ⁇ PCM loop being added as additional data. According to the subdivision of the segments into three sub-segments, these are combined in three groups of 48 bits each, which results in a total data amount of 192 bits.
- the starting point for the detection of the voiced segments is the calculation of the correlation coefficients (pitch analysis, Fig. 3), where ⁇ 2 is calculated so that in the signal processor on the root can be dispensed with.
- the possible pitch periods are limited to 14 to 141 sampling intervals, i.e. to 128 possible values, which leads to a 7-bit code word for the pitch period.
- the decision for a voiced segment depends on three conditions: First, the square value of the largest correlation coefficient then it must be a positive correlation, and finally the quotient corresponding to the coefficient of a first order prediction filter must not exceed a certain maximum value of 1.3. This condition prevents the use of a prediction filter with very large amplification, which sometimes results in voiced segments that sound, and thereby protects the coding algorithm from possible instability.
- the decision in the manner described for a voiced segment is only preliminary and means that in the next step the prediction coefficients ⁇ 1, ⁇ 0 and ⁇ +1 are calculated for a transverse pitch filter B (z). Following the calculation of the filter coefficients, the final decision for or against processing as a voiced segment is made.
- the filter coefficients of the remote prediction filter or pitch predictor When calculating the coefficients of the remote prediction filter or pitch predictor, it is assumed that the basic period M of the quasi-periodic excitation of voiced sounds from the pitch examination is already known. The filter coefficients searched then result as a solution to a familiar optimization task in which the sum of the squares of errors is minimized. Due to the symmetrical structure of the matrix appearing in the equation, the solution can be calculated efficiently using the so-called Cholesky decomposition.
- the filter coefficients are quantized using the previous conversions, extreme value limits and resolution according to Table 3. In exceptional cases, if the sum of the three filter coefficients is less than the tabulated minimum value of 0.1, the previous decision in favor of a voiced segment is dropped, but otherwise definitely confirmed.
- the processing of the voiced segments differs from that of the normal segments by the additional use of the remote prediction filter in the ⁇ PCM loop.
- the effect of the additional predictor must be taken into account appropriately, which is done by the previous filtering 1-B (z) of the signal that is otherwise used directly for the calculation.
- the quantization levels are calculated in the manner indicated in the flowchart in FIG. 3, and their coding is carried out as in the other segments.
- the coding of the pitch period and the coefficients of the remote prediction filter results in an additional 24 bits in addition to the data amount of the normal segments.
- the decoder (FIG. 2) contains, in addition to parts which the coder also contains in terms of function, two special elements which do not occur in the coder, these are the noise source 13 and the filter 14.
- the noise source is a 24 bit linear, shift register that generates a maximum length sequence of length 224 -1, in which the individual bits appear in pseudo-random order.
- the definition of the shift register that is, the arrangement of the XOR feedback, is the book "Error-Correcting Codes" by WW Peterson, EJ Weldon, MIT Press, Cambridge, Massachusetts, 1972; Appendix C: Tables of Irreducible Polynomials over GF (2), pp. 472-492.
- the mean absolute value of the successive random numbers is 1 ⁇ 2. Multiplication by the quantization level, which in turn was calculated as the mean absolute value, results in a synthetic excitation signal that is systematically too low by 6 dB, which sensibly compensates for the effects of fixed high-pass pre-filter and adaptive formant overemphasis, which are doubly reinforcing for fricative segments. Furthermore, this reduction in signal power in the quiet segments is subjectively perceived as increasing quality.
- the adaptive filter 14 the structure of which is shown in FIG. 9, is used for inverse spectral shaping and overemphasis on the formants. It is a series connection of the two filter structures shown in FIGS. 4 and 7. If ⁇ is given a slightly smaller value than that in the first sub-filter If parameter ⁇ in the encoder, the formants partially present in the decoded speech signal are not completely smoothed out. The subsequent second sub-filter can impress a signal with a flat spectrum to the full extent of the formants contained in the original signal. Its application to the signal with a not completely flat spectrum brings about the desired overemphasis on the dominant signal components.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CH956/90 | 1990-03-22 | ||
CH95690A CH680030A5 (fr) | 1990-03-22 | 1990-03-22 |
Publications (2)
Publication Number | Publication Date |
---|---|
EP0449043A2 true EP0449043A2 (fr) | 1991-10-02 |
EP0449043A3 EP0449043A3 (en) | 1992-04-29 |
Family
ID=4199089
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP19910103907 Ceased EP0449043A3 (en) | 1990-03-22 | 1991-03-14 | Method and apparatus for speech digitizing |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP0449043A3 (fr) |
CH (1) | CH680030A5 (fr) |
FI (1) | FI911010A (fr) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5933803A (en) * | 1996-12-12 | 1999-08-03 | Nokia Mobile Phones Limited | Speech encoding at variable bit rate |
EP0588932B1 (fr) * | 1991-06-11 | 2001-11-14 | QUALCOMM Incorporated | Vocodeur a vitesse variable |
-
1990
- 1990-03-22 CH CH95690A patent/CH680030A5/de not_active IP Right Cessation
-
1991
- 1991-02-28 FI FI911010A patent/FI911010A/fi not_active Application Discontinuation
- 1991-03-14 EP EP19910103907 patent/EP0449043A3/de not_active Ceased
Non-Patent Citations (3)
Title |
---|
ICASSP '88, (1988 INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 11. - 14. April 1988), Band 1, Seiten 631-634, IEEE, New York, US; D.J. ZARKADIS et al.: "A 16kb/s APC system with adaptive postfilter and evaluation of its performence" * |
ICASSP '89, (1989 INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEE CH, AND SIGNAL PROCESSING, Glasgow, 23. - 26. Mai 1989), Band 1, Seiten 156-159, IEEE, New York, US; T. TANIGUCHI et al.: "Multimode coding: Application to CELP" * |
ICC '87, (IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS '87, Seattle, Washington, 7. - 10. Juni 1987), Band 1, Seiten 418-424, IEEE, New York, US; Y. YATSUZUKA et al.: "Hardware implementation of 9.6/16 kbit/s APC-MLQ speech codec and its applications for mobile satellite communications" * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0588932B1 (fr) * | 1991-06-11 | 2001-11-14 | QUALCOMM Incorporated | Vocodeur a vitesse variable |
US5933803A (en) * | 1996-12-12 | 1999-08-03 | Nokia Mobile Phones Limited | Speech encoding at variable bit rate |
Also Published As
Publication number | Publication date |
---|---|
FI911010A (fi) | 1991-09-23 |
FI911010A0 (fi) | 1991-02-28 |
CH680030A5 (fr) | 1992-05-29 |
EP0449043A3 (en) | 1992-04-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2022043B1 (fr) | Codage de signaux d'information | |
DE60121405T2 (de) | Transkodierer zur Vermeidung einer Kaskadenkodierung von Sprachsignalen | |
DE69133458T2 (de) | Verfahren zur Sprachquantisierung und Fehlerkorrektur | |
DE69926821T2 (de) | Verfahren zur signalgesteuerten Schaltung zwischen verschiedenen Audiokodierungssystemen | |
DE60117144T2 (de) | Sprachübertragungssystem und verfahren zur behandlung verlorener datenrahmen | |
DE69910240T2 (de) | Vorrichtung und verfahren zur wiederherstellung des hochfrequenzanteils eines überabgetasteten synthetisierten breitbandsignals | |
DE69107841T2 (de) | Transformationskodierer und -dekodierer mit adaptiver blocklänge, adaptiver transformation und adaptivem fenster für hochwertige tonsignale. | |
DE69401514T2 (de) | Vom rechenaufwand her effiziente adaptive bitzuteilung für kodierverfahren und kodiereinrichtung | |
DE60006271T2 (de) | Celp sprachkodierung mit variabler bitrate mittels phonetischer klassifizierung | |
EP1979901B1 (fr) | Procede et dispositifs pour le codage de signaux audio | |
DE60219351T2 (de) | Signaländerungsverfahren zur effizienten kodierung von sprachsignalen | |
DE60029990T2 (de) | Glättung des verstärkungsfaktors in breitbandsprach- und audio-signal dekodierer | |
DE3856211T2 (de) | Verfahren zur adaptiven Filterung von Sprach- und Audiosignalen | |
DE60218385T2 (de) | Nachfilterung von kodierter Sprache im Frequenzbereich | |
DE602004006211T2 (de) | Verfahren zur Maskierung von Paketverlusten und/oder Rahmenausfall in einem Kommunikationssystem | |
DE69730779T2 (de) | Verbesserungen bei oder in Bezug auf Sprachkodierung | |
DE102008042579B4 (de) | Verfahren zur Fehlerverdeckung bei fehlerhafter Übertragung von Sprachdaten | |
DE60118631T2 (de) | Verfahren zum ersetzen verfälschter audiodaten | |
DE69730721T2 (de) | Verfahren und vorrichtungen zur geräuschkonditionierung von signalen welche audioinformationen darstellen in komprimierter und digitalisierter form | |
DE69820362T2 (de) | Nichtlinearer Filter zur Geräuschunterdrückung in linearen Prädiktions-Sprachkodierungs-Vorrichtungen | |
EP0076234A1 (fr) | Procédé et dispositif pour traitement digital de la parole réduisant la redondance | |
DE60124079T2 (de) | Sprachverarbeitung | |
DE19715126A1 (de) | Sprachsignal-Codiervorrichtung | |
DE19722705A1 (de) | Verfahren zur Abschätzung der Verstärkung zur Sprachkodierung | |
DE19743662A1 (de) | Verfahren und Vorrichtung zur Erzeugung eines bitratenskalierbaren Audio-Datenstroms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AT BE DE DK ES FR GB IT NL SE |
|
PUAL | Search report despatched |
Free format text: ORIGINAL CODE: 0009013 |
|
AK | Designated contracting states |
Kind code of ref document: A3 Designated state(s): AT BE DE DK ES FR GB IT NL SE |
|
17P | Request for examination filed |
Effective date: 19920817 |
|
GRAG | Despatch of communication of intention to grant |
Free format text: ORIGINAL CODE: EPIDOS AGRA |
|
17Q | First examination report despatched |
Effective date: 19951213 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED |
|
18R | Application refused |
Effective date: 19960624 |