US20010001320A1

US20010001320A1 - Method and device for speech coding

Info

Publication number: US20010001320A1
Application number: US09/725,345
Authority: US
Inventors: Stefan Heinen; Wen Xu
Original assignee: Individual
Current assignee: Individual
Priority date: 1998-05-29
Filing date: 2000-11-29
Publication date: 2001-05-17
Also published as: EP1080464B1; EP1080464A1; US20020007473A1; CN1312937A; EP1093690A1; US6567949B2; WO1999063520A1; CN1134764C; DE59904164D1; WO1999063523A1; CN1303508A; EP1093690B1; CN1143470C; DE59913231D1; WO1999063522A1

Abstract

In the novel speech coding method and device, speech signals are coded by a combination of speech parameters and excitation signals. The speech parameters or excitation signals are described with vectors. The vectors are formed by superposing at least two tracks, wherein at least one track has at least two vector elements different from zero. The algebraic signs of the vector elements that differ from zero are coded independently of one another and independently of the positions of the vector elements that differ from zero.

Description

CROSS-REFERENCE TO RELATED APPLICATION

1. This is a continuation of copending International Application PCT/EP99/03766, filed May 31, 1999, which designated the United States.

BACKGROUND OF THE INVENTION

2. 1. Field of the Invention

3. The invention lies in the communications field. More specifically, the invention relates to methods and devices for speech coding in which the speech signal is encoded by a combination of speech parameters and excitation signals, in particular on the CELP (coded excited linear predictive) principle.

4. Source signals or source information such as voice, sound, picture, and video signals almost always contain statistical redundancy, that is redundant information. This redundancy can be greatly reduced by source encoding, so that efficient transmission or storage of the source signal is made possible. This reduction in redundancy eliminates, prior to transmission, redundant signal contents that are based on the prior knowledge of, for example, statistical parameters of the signal variation. The bit rate of the source-encoded information is also referred to as the encoding rate or source bit rate. Following transmission, in the source decoding, these component parts are once more added to the signal, so that objectively and/or subjectively there is no ascertainable loss in quality.

5. On the other hand, it is customary in signal transmission for redundancy to be specifically added again by channel encoding, in order to eliminate largely the influencing of the transmission by channel interference. Additional redundant bits enable the receiver or decoder to detect errors and possibly also correct them. The bit rate of the channel-encoded information is also referred to as the gross bit rate.

6. To allow information, in particular speech data, picture data or other useful data, to be transmitted as efficiently as possible by means of the limited transmission capacities of a transmission medium, in particular an air interface, this information to be transmitted is consequently compressed prior to the transmission by a source encoding and protected against channel errors by a channel encoding. Different methods are known for these procedures. For example, in the GSM (Global System for Mobile Communication) system, speech can be encoded by means of a full rate speech codec, a half rate speech codec or an enhanced full rate speech codec (EFR).

7. Within the scope of this description, a method of encoding and/or corresponding decoding, which may also comprise source encoding and/or channel encoding is also referred to as a speech codec.

8. As part of the further development of the European GSM mobile radio standard and the development of new mobile radio systems which are based on a CDMA (code division multiple access) method, such as the UMTS (Universal Mobile Telecommunications System) in the process of being standardized, new methods are being developed for encoded speech transmission, making it possible for the entire data rate, and also the dividing of the data rate between the source encoding and channel encoding, to be set adaptively according to the channel state and network conditions (system load). Instead of the speech codecs described above, having a fixed source bit rate, new speech codecs, able to be operated in different codec modes, are to be used here, the codec modes differing with regard to their source bit rate (encoding rate).

9. The main objects of such AMR (adaptive multirate) speech codecs with variable source bit rate or variable encoding rate are to achieve fixed network quality of the speech under different channel conditions and to ensure optimum distribution of the channel capacity with certain network parameters taken into account.

10. 2. Summary of the Invention

11. The object of the invention is to provide a method and a device for speech coding which overcome the above-noted deficiencies and disadvantages of the prior art devices and methods of this kind, and which make it possible to encode speech signals robustly to resist transmission errors and with relatively little expenditure.

12. With the above and other objects in view there is provided, in accordance with the invention, a speech coding method, which comprises:

13. coding a speech signal by a combination of speech parameters and excitation signals;

14. describing the speech parameters or excitation signals by vectors;

15. forming the vectors by superposing at least two tracks, wherein at least one track has at least two vector elements different from zero; and

16. coding algebraic signs of the vector elements that differ from zero independently of one another and independently of the positions of the vector elements that differ from zero.

17. In accordance with an added feature of the invention, the speech parameters or the excitation signals are formed from the speech signals on the CELP principle.

18. In accordance with an additional feature of the invention, the positions of the vector elements that differ from zero are encoded together to form an index value.

19. With the above and other objects in view there is also provided, in accordance with the invention, a device for speech coding. The device includes a processor unit receiving speech signals and being configured to:

20. encode a speech signal by a combination of speech parameters and excitation signals;

21. describe the speech parameters or excitation signals by vectors;

22. form the vectors by superposing at least two tracks;

23. define at least one track with at least two vector elements that differ from zero; and

24. encode an algebraic sign of the vector elements that differ from zero independently of one another and independently of positions of the vector elements that differ from zero.

25. In accordance with again an added feature of the invention, the processor unit is programmed to obtain the speech parameters and excitation signals from the speech signals on the CELP principle.

26. In accordance with a concomitant feature of the invention, the processor unit is programmed to encode the positions of the vector elements that differ from zero together to form an index value.

27. In other words, the invention is premised on the idea of encoding positions of certain predetermined vector elements which differ from zero and the algebraic signs of these vector elements separately from one another for the encoding of vectors for describing speech parameters or excitation signals.

28. The problem is also solved by devices for speech coding in which a digital signal processor is in each case set up in such a way that positions of certain predetermined vector elements which differ from zero and the algebraic signs of these vector elements are encoded separately from one another for the encoding of vectors for describing speech parameters or excitation signals.

29. The invention relates quite generally to methods for speech coding in which the speech signal is encoded by a combination of speech parameters and excitation signals. The speech parameters include parameters or characteristic variables of a statistical model on which the speech production is based, for example LPC or LTP filter coefficients, and the excitation signals are signals of the exciting processes of this model. These processes may be modeled either statistically or deterministically. Examples of statistical modeling are vocoder methods in which the excitation signals are generated by noise sources. In deterministic modeling, the excitation signals are obtained with the aid of the underlying model from the speech signal and are quantized. Examples of this are RPE/LTP (GSM Full Rate Codec), VSELP (GSM Half Rate Codec) and ACELP (GSM Enhanced Full Rate Codec).

30. Other features which are considered as characteristic for the invention are set forth in the appended claims.

31. Although the invention is illustrated and described herein as embodied in a method and device for speech coding, it is nevertheless not intended to be limited to the details shown, since various modifications and structural changes may be made therein without departing from the spirit of the invention and within the scope and range of equivalents of the claims.

32. The construction and method of operation of the invention, however, together with additional objects and advantages thereof will be best understood from the following description of specific embodiments when read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

33.FIG. 1 is a block diagram of essential elements in a telecommunications transmission chain;
34.FIG. 2 is a block diagram of an AMR encoder based on the CELP principle; and
35.FIG. 3 is a schematic block diagram of a processor unit.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

36. Referring now to the figures of the drawing in detail and first, particularly, to FIG. 1 thereof, there is seen a source Q, which generates source signals qs, which are compressed by a source encoder QE, such as the GSM full rate speech coder, to form symbol sequences comprising symbols. In parametric source encoding methods, the source signals qs generated by the source Q (for example speech) are divided into blocks (for example time frames) and these are processed separately. The source encoder QE generates quantized parameters (for example speech parameters or speech coefficients), which are also referred to hereafter as symbols of a symbol sequence, and which reflect the properties of the source in the current block in a certain way (for example spectrum of the speech in the form of filter coefficients, amplitude factors, excitation vectors). After the quantization, these symbols have a certain symbol value.
37. The symbols of the symbol sequence or the corresponding symbol values are mapped by a binary mapping (allocation specification), which is frequently described as part of the source encoding QE, onto a sequence of binary code words, which in each case have a plurality of bit positions. If these binary code words are, for example, further processed one after the other as a sequence of binary code words, a sequence of source-encoded bit positions which may be embedded in a frame structure is produced. After source encoding carried out in this way, source bits or data bits db with a source bit rate (encoding rate) dependent on the type of source encoding are thus obtained in a structured form in the frame.
38. The term codebook is to be understood here as describing a table with all the quantization representatives. The entries of the table may be both scalar and vectorial quantities.
39. Scalar codebooks can be used for example for the quantization of amplitude factors, since these are generally scalar quantities. Examples of the use of vectorial codebooks are the quantization of LSF (line spectrum frequencies) and the quantization of the stochastic excitation.
40. Referring now to FIG. 2, there is shown a basic representation a special variant of a source encoder. The exemplary embodiment is a speech coder, namely an AMR encoder based on a CELP (coded excited linear predictive) principle.
41. The CELP principle concerns a method of analysis by synthesis. In this case, a filter structure obtained from the current portion of speech is excited by excitation vectors (code vectors) taken one after the other from a codebook. The output signal of the filter is compared by means of a suitable error criterion with the current portion of speech and the error-minimizing excitation vector is selected. The representation of the filter structure and the position number of the selected excitation vector are transmitted to the receiver.
42. A specific variant of a CELP method uses an algebraic codebook, which is often also referred to as a sparse algebraic code. It is a multipulse codebook which is filled with binary (+/−1) or ternary pulses (0, +/−1). Within the excitation vectors, only a few positions are respectively occupied by pulses. After the selection of the positions, the entire vector is weighted with an amplitude factor. A codebook of this type has several advantages. On the one hand, it does not take up any storage space, since the positions allowed for the pulses are determined by an algebraic computing rule, on the other hand it can be searched through very efficiently for the best pulse positions on account of the way it is structured.
43. A configurational variant of a conventional CELP encoder is first described below with reference to FIG. 2. A target signal to be approximated is reproduced by searching through two codebooks. A distinction is drawn here between an adaptive codebook (a2), the task of which is the reproduction of the harmonic speech components, and a stochastic codebook (a4), which serves for the synthesis of the speech components which cannot be obtained by prediction. The adaptive codebook (a2) changes according to the speech signal, while the stochastic codebook (a4) is invariant over time. The search for the best excitation code vectors takes place not by searching jointly, i.e. simultaneously, in the codebooks, as would be necessary for an optimum selection of the excitation code vectors, but, for expenditure-related reasons, by initially searching through the adaptive codebook (a2). Once the excitation code vector that is best according to the error criterion has been found, its contribution to the reconstructed target signal is subtracted from the target vector (target signal) and the part of the target signal still to be reconstructed by means of a vector from the stochastic codebook (a4) is obtained. The search in the individual codebooks takes place on the same principle. In both cases, the quotient from the square of the correlation of the filtered excitation code vector with the target vector and the energy of the filtered target vector is computed for all the excitation code vectors. That excitation code vector which maximizes this quotient is regarded as the best excitation code vector, which minimizes the error criterion (a5). The preceding error weighting (a6) weights the error according to the characteristics of human hearing. The position of the found excitation code vector within the excitation codebook is transmitted to the decoder.
44. The computation of the afore-mentioned quotient has the effect of implicitly determining the correct (codebook) amplitude factor (amplification 1, amplification 2) for each excitation code vector. Once the best candidate has been determined from the two codebooks, the quality-reducing influence of the sequentially performed codebook search can be reduced by a joint optimization of the amplification. This involves re-specifying the original target vector and computing the best amplifications matching the excitation code vectors now selected. These amplifications usually differ slightly from those which were determined during the codebook search.
45. In the case of the CELP principle, each candidate vector can be individually filtered (a3) and compared with the target signal for finding the best excitation code vector.
46. Finally, filter parameters, amplitude factors, and excitation code vectors are converted into binary signals and, embedded in a fixed structure, are transmitted in frames. The filter parameters may be LPC (linear predictive coding) coefficients, LTP (long term prediction) indices, or LTP (long term prediction) amplitude factors.
47. The LPC residual signal or excitation signal is vectorially quantized. To keep the ROM requirement small, an algebraic codebook is used. In other words, the group of quantization representatives is not explicitly present in a table, but instead the various representatives can be determined from an index value with the aid of an algebraic computing rule. This additionally has complexity-related advantages in the codebook search.
48. The algebraic computing rule for the determination of the excitation code vectors from an index value uses a division of the vector space into so-called tracks. Within one track (vector), only components (vector elements) which lie on a track-specific grid can assume values which differ from zero.
49. Positions which differ from zero are referred to as pulse positions. Depending on the codebook, the pulses may have either a binary (−1/+1) or ternary (−1/0/+1) value range. The superposing of individual tracks finally supplies the excitation code vector.
50. This is to be explained on the basis of the following simple example: the algebraic codebook of the dimension 20, which can be formed by 2 tracks.
51. The symbols are as follows:
52. 0: allowed position in the track, no pulse
53. #: unallowed position in the track
54. +: positive pulse

55. −: negative pulse



Position:	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19

Track1:	0	#	0	#	0	#	0	#	0	#	0	#	+	#	0	#	0	#	0	#
Track2:	#	0	#	—	#	0	#	0	#	0	#	0	#	0	#	0	#	0	#	0
Excit:	0	0	0	—	0	0	0	0	0	0	0	0	+	0	0	0	0	0	0	0

56. For encoding this codebook, accordingly 10 positions and the algebraic sign of the set pulse have been encoded per track. A more accurate quantization of the excitation signal is achieved with codebooks which have more than one pulse per track, since then the superposing of two pulses is also possible.
57. A track (vector) is consequently described by pulses (vector elements which differ from zero). It is possible for the pulses to be described by their position (vector element index) and their algebraic sign. The total set of possible pulse position combinations is described with the aid of an overall index.
58. The code word length required for encoding the excitation vector found is made up of the number of bits required for encoding the pulse positions and algebraic signs. If only one pulse is set per track, not only the number of bits for the encoding of its position but also a further bit for the encoding of its algebraic sign are required.
59. Efficient encoding of the algebraic signs with likewise only one bit for the case in which two pulses are set per track is already used in the GSM-EFR codec. Here, the information of the position sequence is utilized. If two pulses are located in the position grid of the same track, in the case in which both pulses have like algebraic signs, that pulse which assumes the lower position within the grid is encoded first. In the case of different algebraic signs, the pulse which occupies the higher position is encoded first. Only the sign bit of the pulse encoded first is transmitted. The algebraic sign of the second pulse is determined on the decoder side by analysis of the encoding sequence of the pulse positions.
60. This principle is to be illustrated on the basis of a codebook with three pulses per track. In this case, the sign information of the pulses of a track can also be encoded with a bit, in that the principle of the algebraic sign encoding is extended by deliberate changing of the encoding sequence. This is shown by the following estimate: in the transmission of a sign bit, with P_T=3 pulses per track, N_VZpossible sign combinations are obtained $N_{VZ} = \frac{2^{P_{r}}}{2} = \frac{2^{3}}{2} = 4.$
61. A number N_permof possible permutations of the pulse positions becomes
N _perm =P _T!=3!=6>N _VZ
62. As long as the number of possible permutations N_permis greater than the number of possible sign combinations N_VZ, the sign information of a track must allow itself to be transmitted by deliberate changing of the encoding sequence of the pulse positions. If, for instance, a codebook which provides four pulses per track were drawn up, the permutation of the encoding sequence alone would be sufficient for the sign encoding; an additional sign bit would not be required.
63. This efficient encoding of the sign information of an excitation vector is accomplished, however, at the expense of an increased susceptibility to interference on the transmission channel. In the worst case, the interference of a sign bit of a track causes the reversal of all the algebraic signs in this track. Similarly, the interference of the parameter of a pulse position may affect the algebraic signs of all the pulses of the same track.
64. For this reason, the invention describes a significantly more robust sign encoding. In this case, the overall set of possible pulse positions is addressed by a suitable algebraic method with the aid of a single index. Independently of this, the algebraic signs of the pulses are respectively encoded with a bit.
65. The improved method may be explained by the example of the algebraic codebook for the rate 9.5 kbits/s. In the case of this codebook, two pulses are set to 14 possible positions. For the encoding scheme with permutation encoding, one bit is required for the first pulse sign and 4 bits are required for each of the two pulse positions, that is a total of 9 bits.
66. The robust encoding method encodes the possible pulse positions independently of the algebraic signs. Since both pulses may also lie at the same position, there are combinations with repetition here. It is known from the theory of combinations that in this case there exists the following number of possiblities: $(\frac{14 + 2 - 1}{2}) = (\frac{15}{2}) = 105 < 2^{7}$
67. Since this number is less than 2⁷=128, seven bits are sufficient for the encoding of the positions. The two algebraic signs are respectively encoded with one bit. In this way, a detachment of algebraic signs and pulse positions is achieved without increasing the bit rate required for the encoding of the excitation vectors.
68. In simulations, individual bit positions of the codebook indices were in each case subjected to interference with a 100% error rate and the resulting speech SNR measured after resynthesis. In these simulations it was possible to improve the sensitivity of the algebraic signs by about 3 dB on account of the more robust encoding.
69. In a configurational variant of the invention, the different encoding rates mean that different codec modes generally have different frame sizes, and therefore also different structures in which the bit positions serving for describing the filter parameters, amplitude factors or excitation code vectors are embedded.
70. To realize a variable encoding rate, the changing of the encoding rate may be realized by a corresponding changing of the number of bit positions for describing an excitation code vector taken from a stochastic codebook. Switching over of the codec modes consequently leads, as a result of switching over to a stochastic codebook corresponding to the new codec mode, to the selection of excitation code vectors contained in this codebook, for the description of which a correspondingly changed number of bit positions is required. Consequently, there are different stochastic codebooks available according to the number of different codec modes.
71. Referring now to FIG. 3, there is shown a processor unit PE, which may be contained in particular in a communication device, such as a base station BS or mobile station MS. The unit PE contains a control device STE, which essentially comprises a program-controlled microcontroller, and a processing device VE, which comprises a processor, in particular a digital signal processor, which can both gain writing and reading access to memory chips SPE.
72. The microcontroller controls and monitors all the major elements and functions of a functional unit which includes the processor unit PE. The digital signal processor, part of the digital signal processor or a specific processor is responsible for carrying out the speech coding or speech decoding. The selection of a speech codec may also be performed by the microcontroller or the digital signal processor itself.
73. An input/output interface I/O serves for the input/output of useful or control data, for example to a man-machine interface MMI, which may include a keyboard and/or a display. The individual elements of the processor unit may be connected to one another by a digital bus system BUS.
74. It will be understood by those of skill in the pertinent art that, on the basis of the foregoing description, the invention can also apply to encoding methods other than the CELP encoding method explained in the application.

Claims

We claim:

1. A speech coding method, which comprises:

coding a speech signal by a combination of speech parameters and excitation signals;

describing the speech parameters or excitation signals by vectors;

forming the vectors by superposing at least two tracks, wherein at least one track has at least two vector elements different from zero; and

coding algebraic signs of the vector elements that differ from zero independently of one another and independently of the positions of the vector elements that differ from zero.

2. The method according to

claim 1

, which comprises forming one of the speech parameters and the excitation signals from the speech signals on the CELP principle.

3. The method according to

claim 1

, which comprises coding the positions of the vector elements that differ from zero together to form an index value.

4. A device for speech coding, comprising an input for receiving a speech signal, and a processor unit connected to said input and configured to:

code a speech signal by a combination of speech parameters and excitation signals;

describe the speech parameters or excitation signals by vectors;

form the vectors by superposing at least two tracks;

define at least one track with at least two vector elements that differ from zero; and

code an algebraic sign of the vector elements that differ from zero independently of one another and independently of positions of the vector elements that differ from zero.

5. The device according to

claim 4

, wherein said processor unit is programmed to obtain the speech parameters and excitation signals from the speech signals on the CELP principle.

6. The device according to

claim 4

, wherein said processor unit is programmed to code the positions of the vector elements that differ from zero together to form an index value.