DATA TRANSMISSION OVER A COMPRESSED SPEECH CHANNEL
The present invention relates to data transmission over telecommunications networks, and in particular to the transmission of digital data over speech channels of such networks.
Using speech channels for the transmission of data over networks can have advantages under some circumstances because they are generally given a high priority and are subject to relatively small delays. This is necessary to ensure that normal speech can be transmitted without unacceptable loss of clarity. Data channels are generally susceptible to greater delay.
There exist a number of methods for transforming digital data into a form suitable for transmission over analogue channels such as between a fixed- line telephony service subscriber and his local exchange. These are collectively referred to as modulation techniques and generally involve modifying a sinusoidal carrier signal in such a way so as to represent a time-varying digital bit stream. The corresponding process at the receiver end is known as demodulation, while a device which carries out both tasks is called a modem. Typical characteristics of a carrier that are varied in order to superimpose the digital data include its amplitude, frequency and phase. These schemes are designed to be robust to errors in the transmission medium as well as to be able to transmit the greatest number of bits in the available bandwidth. Moreover in the case of modems as used by Personal Computers to transmit information over the telephone network, the modulation/demodulation techniques must be resilient to the conversion of the analogue signals to digital D/PCM (Differential/Pulse Code Modulation) format for transmission between different exchanges. Current standardised techniques allow for up to
56kbaud (using N.90 modems) to be transmitted over a standard telephone line.
These data modems cannot, however be used for transmission over systems that employ speech compression technologies, such as digital mobile or satellite channels. The main reason for this is that speech compression systems remove the unwanted redundancy in speech by employing a speech production model which is designed to efficiently represent the characteristics of the speech being input. These algorithms, known as speech coders, aim to characterise the speech by means of a number of parameters, each describing different aspects of the voice production model employed. The main objective of the speech compression algorithms is therefore to reduce the number of bits required to represent these parameters, whilst still retaining an acceptable speech quality level when the parameters are input to the voice synthesis model at the decoder. This ensures the maintaining of optimal voice quality at the targeted or stipulated bit rate. A side-effect of this approach is that the resulting synthesised speech, whilst perceptually being similar to the input speech, i.e. sounding very similar to the original, may have a fairly different waveform on a sample-by-sample basis. The waveforms as seen with respect to time will be visibly different to the original input speech, so notwithstanding the fact that the output speech may be audibly clear and intelligible, the actual decoded signal may have undergone significant change.
This distortion prevents most data modems from operating over channels which employ speech compression systems. The frequency, phase and amplitude characteristics of a standard modem output used to represent the transmitted data are not retained with sufficient quality to allow for any meaningful bit rate to be sustained. This problem is compounded by the fact that in many networks, and in particular, mobile communication
systems, the speech signal may undergo more than one set of compression/decompression stages, a phenomenon known as tandeming.
Therefore according to a first aspect of the present invention there is provided a method of transmitting a block of data over a speech channel comprising the steps of converting the block of data into a speech-like signal having at least one speech-like characteristic defined by at least one parameter which is determined by the data, and transmitting the data over the speech channel.
The use of speech-like signals to transmit the data can help to ensure that during its transmission over the network transcoding and filtering processes will not severely affect the data. This is because those processes are designed to retain the perceptibly important characteristics of the speech for the normal effective transmission of speech signals.
The speech-like signal may a digital signal defining a speech-like waveform having said at least one speech-like characteristic. Alternatively the speech-like signal can be an analogue signal having said at least one speech-like characteristic.
Preferably the data is converted to the speech-like signal using a table which maps each possible form of at least part of the data block to a unique form of said at least one parameter.
Preferably the method further comprises the step of converting the speech-like signal to a compressed digital format in which said at least one parameter is defined by at least one section of data.
In some circumstances it may by desirable to avoid altogether the conversion of data to an analogue signal. Because many voice channels
are designed to transmit speech in digitally compressed format, the data can simply be converted straight to a digital signal having the same format as digitally compressed speech so that it can be transmitted.
Therefore, according to a second aspect of the present invention there is provided a method of transmitting a block of digital data over a speech channel comprising the steps of converting the data to a digital compressed speech format such that at least one section of the data defines at least one parameter of a speech-like signal, and transmitting the converted data over the speech channel.
Since data will often need to be transmitted as a continuous or semi- continuous stream, a third aspect of the present invention provides a method of transmitting a data stream over a speech channel comprising dividing the data into a plurality of blocks, and transmitting each of the blocks using the method of the first or second aspects invention.
According to a fourth aspect of the present invention there is further provided a method of receiving a block of data over a speech channel comprising the steps of receiving a frame of speech-like signal having at least one speech-like characteristic defined by at least one parameter, analysing the received signal to determine said at least one parameter, and reproducing the data from the at least one parameter.
Preferably the received speech-like signal is converted to the data using a table which maps each valid form of said at least one parameter to a unique form of at least part of the data block.
Preferably the speech-like signal is obtained by receiving a signal in digital compressed speech format representing at least one block of data, and decompressing the received signal to produce a speech-like signal
having a speech-like characteristic defined by at least one parameter determined by said of least one block of data.
According to a fifth aspect of the present invention there is further provided a method of receiving a block of data over a speech channel comprising the steps of receiving a signal in compressed speech ormat in which at least one group of bits represents at least one parameter defining at least one characteristic of a speech-like signal, and reproducing the data from the bits.
According to a sixth aspect of the present invention there is further provided a method of communicating data over a speech channel comprising transmitting the data according to the first, second or third aspects of the invention and receiving data according to the fourth or fifth aspects of the invention.
Preferably the at least one characteristic includes the fundamental frequency of the speech-like signal, a characteristic of the spectral envelope of the speech-like signal, or the energy level of the speech-like signal.
More preferably said at least one speech-like parameter comprises a plurality of parameters, and the or each block of data determines or is derived from the parameters. Preferably the or each block of data has a plurality of sections each of which determines or is derived from a respective one of the parameters.
The data may represent a voice message, which is preferably compressed, in particular an encrypted voice message.
Preferred embodiments of the present invention will now be described by way of example only with reference to the accompanying drawings in which:
Figure 1 is a diagrammatic representation of a data communication system according to the invention;
Figure 2 is a diagrammatic representation of part of a conversion process of data in the system of Figure 1 ;
Figure 3 is a diagrammatic representation of a further part of the conversion process of data in the system of Figure 1 ;
Figure 4 is a diagrammatic representation of the conversion process of synthesized speech to data in the system of Figure 1 ; and
Figure 5 shows the effect of speech compression on the synthesized speech signals used in the system of Figure 1.
Referring to Figure 1, a data communication system comprises a transmitter 10, in the form of a first mobile terminal, arranged to transmit data over a communications network 12, and a receiver 14, in the form of a second mobile terminal, arranged to receive data over the network. The transmitter 10 comprises an input 16 for receiving digital data 17, a data modulator 18 for converting the data signal to a speech-like signal 19, a speech compression module 20 in the form of a GSM speech encoder for converting the speech-like signal to a compressed speech signal 23, and an output 22 for transmitting the compressed speech signal 23 onto the network 12. The receiver 14 comprises an input 24 for receiving compressed speech signals 25 from the network 12, a speech decompression module 26 for converting the compressed speech signal 25
to a speech-like signal 27, a demodulator 28 for converting the speechlike signal 27 to a data signal 29, and an output 30 for the data signal 29.
For a conventional mobile network the output 22 of the transmitter 10 will transmit the signal as a radio signal, and the network 12 will include a number of base stations for transmitting and receiving the radio signals and a telephone network to which the base stations are connected.
At the input stage, the data to be transmitted over the network 12 is input into the data modulator 18. This modulator 18 is essentially a voice synthesiser. The input bit stream is converted by means of an appropriate synthetic voice production model, as will be described below, into a speech-like waveform. This waveform bears no resemblance to the input data (one being a binary data stream, the output being a PCM-encoded waveform) . There is however an important relationship between the two, namely that the output waveform can be uniquely generated by a set of parameters corresponding to a portion of the input bit stream. This process is carried out on a discrete frame-by-frame basis, with a predetermined number of bits resulting in the formation of an output waveform of fixed duration.
These parameters and their corresponding waveforms have two important properties. First, notwithstanding the distortion processes introduced by the network, on receiving a given waveform, the demodulator 28 correctly identifies the original descriptive parameters corresponding to the input data. This is because the parameters selected represent the most perceptually important characteristics of a human speech signal. It is these characteristics which are maintained by all voice telephony systems, since, if they are modified there will be significant distortion of the resulting speech, degrading its intelligibility and subjective quality.
The second important property of the selected waveform-parameters is that there should be a large distance between them. This means that each of the selected waveforms should bear as little resemblance as possible to each of the other waveforms in the valid set. The modulator 18 generates an output bit rate that is considerably lower than that of the lowest-rate speech compression system in the transmission link. This allows for a certain amount of redundancy to counteract the distortion introduced in the transmission network.
Current GSM mobile networks employ speech coders such as EFR (Enhanced Full Rate) and AMR (Adaptive MultiRate) , which compress speech from 64 kbit/s down to 12.2 kb/s or below, depending on the operating mode of the speech coder. Only a fraction of this throughput can be efficiently used by the system to transmit user data, typically 1 to 2 kb/s. This embodiment operates at a rate of 1 kb/s.
Speech coders used in GSM systems operate on frames of 20 ms of speech, sampled at 8 kHz. The modulator 18 therefore also generates speech-like waveforms in frames of 20ms, with a 8 kHz sampling rate. This is input to the GSM speech encoder 20. For a 1 kb/s modem throughput, this corresponds to a block of 20 bits being mapped to a 20 ms frame of artificial speech.
Referring to Figure 2, details of the modulator 18 of this embodiment will now be described. The data modulator 18 comprises an input buffer 32 arranged to receive the input data bitstream 33, divide it into blocks of 20 bits each, and then divide each block into a plurality of sections, each of predetermined length and at a predetermined position within the bock, and send each section to a respective index generator which generates, from the section of data it receives, an index relating to a characteristic of the speech-like signal to be produced. In this case 10 bits are sent to an LSF
(line spectral frequency) index generator 34 which converts the 10 bits into an LSF index 35, 5 bits are sent to an energy index generator 36 which converts them to an energy index 37, and five bits are sent to a pitch index generator which converts them to a pitch index 39. Each index 35, 37, 39 will have a unique value for each possible form of the section of data used to generate it.
The modulator 18 has stored in its memory a set of look-up tables, also referred to as codebooks, one for each of the parameters. These therefore comprise a LSF codebook 40, an energy codebook 42, and a pitch codebook 44. Each codebook relates a unique form of the relevant parameter to each value of the relevant index 35, 37, 39. Therefore in this case there are 1024 forms of the LSF parameter 41, and 32 forms of each of the energy parameter 43 and the pitch parameter 45.
Referring to Figure 3 the three parameters 41, 43, 45 obtained from the codebooks 40, 42, 44 for the block of data are then converted to a speech-like signal or synthetic speech by means of a DFT-based speech synthesizer which receives the three parameters 41, 43, 45 as inputs, generates a pulse train signal and then performs filtering and scaling on it to generate a 20ms digital synthetic speech-like signal 19 defining a waveform containing pitch cycles of the desired length, energy and spectral shape. The speech-like signal waveform therefore has a number of speech-like characteristics which are defined by the parameters used to produce it.
Referring to Figure 4, the demodulator 28 in the receiver will now be described. The demodulator comprises an input buffer 50 which receives the synthetic speech 27 and divides it into 20ms frames. Each speech frame is then input to extraction modules 52, 54, 56 for the LSF, pitch lag and energy parameters respectively. These modules analyse the
i.o synthetic speech and output estimated values for each of the three parameters. Various techniques can be used for this purpose, and in this example, the first stage is to employ an autocorrelation-based detector for the pitch lag. The energy of each pitch cycle can then be extracted for the detected pitch value using a root-mean-square calculation, and the spectral envelope parameters are then extracted using well-known signal processing techniques. These estimated values 53, 55, 57 are then compared with those in LSF, pitch and energy codebooks 58, 60, 62 corresponding to the codebooks in the transmitter 10 to find a best match which is then input to the codebooks to select an index value 59, 61, 63 for each parameter. The index values for the three parameters are then each converted to a number of bits of data, which are then combined in the correct order, in the reverse of the process of the transmitter, to produce the block of decoded data 29.
One 20 bit block of data is transmitted and received every 20 ms, resulting in a total bit rate of 1 kb/s in this example. The number of bits and the frame rate can be adapted depending on the network conditions and the required performance.
Since the various speech coders used in the network 12 attempt to represent correctly human speech, the speech parameters suffer relatively little distortion. Provided the distortion introduced by the channel stays sufficiently low, the indices of the codebook entries in the receiver for each synthetic speech frame correspond to those of the frame sent by the encoder 20, and the bitstream transmitted can then be recovered.
Figure 5 compares the original 'synthetic speech' signals produced by the modulator 18 with the corresponding signals after one EFR tandem. Each 20ms frame includes 160 samples, so the characteristics of the waveform change every 160 samples as can be seen in Figure 5. While there are
minor differences between the original and the compressed and decompressed signals, it can be seen that the signals retain their basic form, including pitch, energy, and spectral shape.
It will be appreciated that the performance of the system depends heavily on the choice of the codebooks. The various elements in the codebook must be selected so as to try to maximise the distance between them. If two elements are too close, the distortion caused by the network may result in an index being wrongly decoded. Ideally, the codebook vectors must be chosen with regard to the distortion introduced by the speech coder itself, so that easy decoding is possible.
For example, the autocorrelation technique used at the demodulator 28 is sensitive to pitch doubling or halving, i.e. the pitch may be incorrectly decoded as having either half or twice the original pitch value. It is therefore preferable to use only a range of pitch values in which no such doubling or halving can occur. For a 5-bit codebook, an example is to use all integer pitch values from 35 samples to 66 samples. The energy codebook 42 should also make use of the characteristics of the speech coders, which generally use exponential quantisers for the energy parameter. An exponential codebook should therefore be used for the energy parameter, with the maximum and minimum values chosen so as not to cause saturation and not be detected as silence by a Voice Activity Detector in the network respectively.
Finally, the vectors in the LSF codebook 40 must be as separated as possible. This is made more complex by the fact that LSFs are defined as vectors, and the problem is therefore multi-dimensional. Codebook design algorithms such as LBG (see Y. Linde, A. Buzo and R.M. Gray: "An Algorithm for Nector Quantiser Design" , IEEE Transactions on Communications, Vol. 28, No. 1, 1980, pp 84-96) can be used to obtain
an initial oversized codebook. This codebook can then be refined by using the modulator/demodulator and by finding out which vectors get wrongly decoded. These vectors are then removed from the codebook 40 until all remaining vectors in the codebook are decoded correctly when passed through the modem.
Mobile , networks often operate under difficult channel conditions, and therefore the characteristics of the received speech may vary from that of the original signal. Although error concealment techniques are used in the network to reduce this effect, some parameters may be decoded wrongly, introducing bit errors. In order to minimize this effect, it is preferable to order the codebooks in such a way that codebook entries which are similar also have indexes close to each other in terms of Euclidian distance. This technique is known as Index Assignment (IA) . If an error occurs during the parameter decoding, it is likely that the selected set will have been fairly close to the original set. Using IA, the decoded index will then only differ slightly from the original, reducing the overall bit error rate on the data channel. Index assignment will be most effective when used in conjunction with a block-based error correction scheme.
In a second embodiment of the invention the data modulation and speech compression units of the first embodiment are combined to form a single, unified data mapping scheme. In this embodiment, the input data is mapped directly onto the syntax of the GSM EFR speech compression output, without the use of an intermediate interface between the modulator and the speech compression algorithms.
In the GSM EFR compression algorithm the parameters used for characterising the speech are:
The procedures for identifying the optimal speech characteristics for representing the input data, and for codebook design are identical to those used in the first embodiment. The parameters which are chosen to transmit data are those which are most robust to coding and decoding on the network. In this case the spectral information, pitch delay, and pitch gain are used. Although these parameters give a total data throughput of 4200 kbit/s the expected data rate that can be supported by these parameters will be somewhat lower than this. Instead of producing a synthetic waveform, the data from each block of input data are mapped directly onto the GSM EFR speech compression parameters. Moreover, parameters which are not necessarily used to transmit data (such as the codebook excitation), may nevertheless be selected so as to enhance the spectral characteristics of the speech. For example, the excitation could be set to be in the form of regular pulses aligned with the fundamental pitch pulses generated by the speech synthesiser. This will have to main benefits. First, since the spectral characteristics of a pulse are flat, there will be minimum distortion to the information stored in the spectral envelope characteristics of the synthetic speech. Secondly, such an alignment can enhance the definition of the individual pitch pulses generated by the synthesiser.
Whereas such an implementation has the disadvantage of being tied down to one particular speech coder, it has a number of advantages, namely that the characteristics of the selected waveform-parameter pairs will be
selected so as to provide optimal performance for that particular compression scheme.