WO2003071521A1 - Data transmission over a compressed speech channel - Google Patents

Data transmission over a compressed speech channel Download PDF

Info

Publication number
WO2003071521A1
WO2003071521A1 PCT/GB2003/000721 GB0300721W WO03071521A1 WO 2003071521 A1 WO2003071521 A1 WO 2003071521A1 GB 0300721 W GB0300721 W GB 0300721W WO 03071521 A1 WO03071521 A1 WO 03071521A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
data
signal
parameter
block
Prior art date
Application number
PCT/GB2003/000721
Other languages
French (fr)
Inventor
Stephane Pierre Villette
Simon N. Fabri
Ahmet M. Kondoz
Original Assignee
The University Of Surrey
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The University Of Surrey filed Critical The University Of Surrey
Priority to AU2003205902A priority Critical patent/AU2003205902A1/en
Publication of WO2003071521A1 publication Critical patent/WO2003071521A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis

Definitions

  • the present invention relates to data transmission over telecommunications networks, and in particular to the transmission of digital data over speech channels of such networks.
  • Using speech channels for the transmission of data over networks can have advantages under some circumstances because they are generally given a high priority and are subject to relatively small delays. This is necessary to ensure that normal speech can be transmitted without unacceptable loss of clarity. Data channels are generally susceptible to greater delay.
  • modulation techniques generally involve modifying a sinusoidal carrier signal in such a way so as to represent a time-varying digital bit stream.
  • demodulation a device which carries out both tasks
  • modem a device which carries out both tasks
  • Typical characteristics of a carrier that are varied in order to superimpose the digital data include its amplitude, frequency and phase.
  • the modulation/demodulation techniques must be resilient to the conversion of the analogue signals to digital D/PCM (Differential/Pulse Code Modulation) format for transmission between different exchanges.
  • D/PCM Digital/Pulse Code Modulation
  • Current standardised techniques allow for up to 56kbaud (using N.90 modems) to be transmitted over a standard telephone line.
  • speech compression systems remove the unwanted redundancy in speech by employing a speech production model which is designed to efficiently represent the characteristics of the speech being input.
  • speech coders aim to characterise the speech by means of a number of parameters, each describing different aspects of the voice production model employed.
  • the main objective of the speech compression algorithms is therefore to reduce the number of bits required to represent these parameters, whilst still retaining an acceptable speech quality level when the parameters are input to the voice synthesis model at the decoder. This ensures the maintaining of optimal voice quality at the targeted or stipulated bit rate.
  • a side-effect of this approach is that the resulting synthesised speech, whilst perceptually being similar to the input speech, i.e. sounding very similar to the original, may have a fairly different waveform on a sample-by-sample basis.
  • the waveforms as seen with respect to time will be visibly different to the original input speech, so notwithstanding the fact that the output speech may be audibly clear and intelligible, the actual decoded signal may have undergone significant change.
  • This distortion prevents most data modems from operating over channels which employ speech compression systems.
  • the frequency, phase and amplitude characteristics of a standard modem output used to represent the transmitted data are not retained with sufficient quality to allow for any meaningful bit rate to be sustained.
  • This problem is compounded by the fact that in many networks, and in particular, mobile communication systems, the speech signal may undergo more than one set of compression/decompression stages, a phenomenon known as tandeming.
  • a method of transmitting a block of data over a speech channel comprising the steps of converting the block of data into a speech-like signal having at least one speech-like characteristic defined by at least one parameter which is determined by the data, and transmitting the data over the speech channel.
  • speech-like signals to transmit the data can help to ensure that during its transmission over the network transcoding and filtering processes will not severely affect the data. This is because those processes are designed to retain the perceptibly important characteristics of the speech for the normal effective transmission of speech signals.
  • the speech-like signal may a digital signal defining a speech-like waveform having said at least one speech-like characteristic.
  • the speech-like signal can be an analogue signal having said at least one speech-like characteristic.
  • the data is converted to the speech-like signal using a table which maps each possible form of at least part of the data block to a unique form of said at least one parameter.
  • the method further comprises the step of converting the speech-like signal to a compressed digital format in which said at least one parameter is defined by at least one section of data.
  • a method of transmitting a block of digital data over a speech channel comprising the steps of converting the data to a digital compressed speech format such that at least one section of the data defines at least one parameter of a speech-like signal, and transmitting the converted data over the speech channel.
  • a third aspect of the present invention provides a method of transmitting a data stream over a speech channel comprising dividing the data into a plurality of blocks, and transmitting each of the blocks using the method of the first or second aspects invention.
  • a method of receiving a block of data over a speech channel comprising the steps of receiving a frame of speech-like signal having at least one speech-like characteristic defined by at least one parameter, analysing the received signal to determine said at least one parameter, and reproducing the data from the at least one parameter.
  • the received speech-like signal is converted to the data using a table which maps each valid form of said at least one parameter to a unique form of at least part of the data block.
  • the speech-like signal is obtained by receiving a signal in digital compressed speech format representing at least one block of data, and decompressing the received signal to produce a speech-like signal having a speech-like characteristic defined by at least one parameter determined by said of least one block of data.
  • a method of receiving a block of data over a speech channel comprising the steps of receiving a signal in compressed speech ormat in which at least one group of bits represents at least one parameter defining at least one characteristic of a speech-like signal, and reproducing the data from the bits.
  • a method of communicating data over a speech channel comprising transmitting the data according to the first, second or third aspects of the invention and receiving data according to the fourth or fifth aspects of the invention.
  • the at least one characteristic includes the fundamental frequency of the speech-like signal, a characteristic of the spectral envelope of the speech-like signal, or the energy level of the speech-like signal.
  • said at least one speech-like parameter comprises a plurality of parameters, and the or each block of data determines or is derived from the parameters.
  • the or each block of data has a plurality of sections each of which determines or is derived from a respective one of the parameters.
  • the data may represent a voice message, which is preferably compressed, in particular an encrypted voice message.
  • Figure 1 is a diagrammatic representation of a data communication system according to the invention.
  • Figure 2 is a diagrammatic representation of part of a conversion process of data in the system of Figure 1 ;
  • Figure 3 is a diagrammatic representation of a further part of the conversion process of data in the system of Figure 1 ;
  • Figure 4 is a diagrammatic representation of the conversion process of synthesized speech to data in the system of Figure 1 ;
  • Figure 5 shows the effect of speech compression on the synthesized speech signals used in the system of Figure 1.
  • a data communication system comprises a transmitter 10, in the form of a first mobile terminal, arranged to transmit data over a communications network 12, and a receiver 14, in the form of a second mobile terminal, arranged to receive data over the network.
  • the transmitter 10 comprises an input 16 for receiving digital data 17, a data modulator 18 for converting the data signal to a speech-like signal 19, a speech compression module 20 in the form of a GSM speech encoder for converting the speech-like signal to a compressed speech signal 23, and an output 22 for transmitting the compressed speech signal 23 onto the network 12.
  • the receiver 14 comprises an input 24 for receiving compressed speech signals 25 from the network 12, a speech decompression module 26 for converting the compressed speech signal 25 to a speech-like signal 27, a demodulator 28 for converting the speechlike signal 27 to a data signal 29, and an output 30 for the data signal 29.
  • the output 22 of the transmitter 10 will transmit the signal as a radio signal
  • the network 12 will include a number of base stations for transmitting and receiving the radio signals and a telephone network to which the base stations are connected.
  • the data to be transmitted over the network 12 is input into the data modulator 18.
  • This modulator 18 is essentially a voice synthesiser.
  • the input bit stream is converted by means of an appropriate synthetic voice production model, as will be described below, into a speech-like waveform.
  • This waveform bears no resemblance to the input data (one being a binary data stream, the output being a PCM-encoded waveform) .
  • the output waveform can be uniquely generated by a set of parameters corresponding to a portion of the input bit stream. This process is carried out on a discrete frame-by-frame basis, with a predetermined number of bits resulting in the formation of an output waveform of fixed duration.
  • the demodulator 28 correctly identifies the original descriptive parameters corresponding to the input data. This is because the parameters selected represent the most perceptually important characteristics of a human speech signal. It is these characteristics which are maintained by all voice telephony systems, since, if they are modified there will be significant distortion of the resulting speech, degrading its intelligibility and subjective quality.
  • the second important property of the selected waveform-parameters is that there should be a large distance between them. This means that each of the selected waveforms should bear as little resemblance as possible to each of the other waveforms in the valid set.
  • the modulator 18 generates an output bit rate that is considerably lower than that of the lowest-rate speech compression system in the transmission link. This allows for a certain amount of redundancy to counteract the distortion introduced in the transmission network.
  • EFR Enhanced Full Rate
  • AMR Adaptive MultiRate
  • Speech coders used in GSM systems operate on frames of 20 ms of speech, sampled at 8 kHz.
  • the modulator 18 therefore also generates speech-like waveforms in frames of 20ms, with a 8 kHz sampling rate. This is input to the GSM speech encoder 20. For a 1 kb/s modem throughput, this corresponds to a block of 20 bits being mapped to a 20 ms frame of artificial speech.
  • the data modulator 18 comprises an input buffer 32 arranged to receive the input data bitstream 33, divide it into blocks of 20 bits each, and then divide each block into a plurality of sections, each of predetermined length and at a predetermined position within the bock, and send each section to a respective index generator which generates, from the section of data it receives, an index relating to a characteristic of the speech-like signal to be produced.
  • LSF line spectral frequency index generator 34 which converts the 10 bits into an LSF index 35
  • 5 bits are sent to an energy index generator 36 which converts them to an energy index 37
  • five bits are sent to a pitch index generator which converts them to a pitch index 39.
  • Each index 35, 37, 39 will have a unique value for each possible form of the section of data used to generate it.
  • the modulator 18 has stored in its memory a set of look-up tables, also referred to as codebooks, one for each of the parameters. These therefore comprise a LSF codebook 40, an energy codebook 42, and a pitch codebook 44. Each codebook relates a unique form of the relevant parameter to each value of the relevant index 35, 37, 39. Therefore in this case there are 1024 forms of the LSF parameter 41, and 32 forms of each of the energy parameter 43 and the pitch parameter 45.
  • the three parameters 41, 43, 45 obtained from the codebooks 40, 42, 44 for the block of data are then converted to a speech-like signal or synthetic speech by means of a DFT-based speech synthesizer which receives the three parameters 41, 43, 45 as inputs, generates a pulse train signal and then performs filtering and scaling on it to generate a 20ms digital synthetic speech-like signal 19 defining a waveform containing pitch cycles of the desired length, energy and spectral shape.
  • the speech-like signal waveform therefore has a number of speech-like characteristics which are defined by the parameters used to produce it.
  • the demodulator 28 in the receiver comprises an input buffer 50 which receives the synthetic speech 27 and divides it into 20ms frames. Each speech frame is then input to extraction modules 52, 54, 56 for the LSF, pitch lag and energy parameters respectively. These modules analyse the i.o synthetic speech and output estimated values for each of the three parameters.
  • extraction modules 52, 54, 56 for the LSF, pitch lag and energy parameters respectively.
  • These modules analyse the i.o synthetic speech and output estimated values for each of the three parameters.
  • the first stage is to employ an autocorrelation-based detector for the pitch lag.
  • the energy of each pitch cycle can then be extracted for the detected pitch value using a root-mean-square calculation, and the spectral envelope parameters are then extracted using well-known signal processing techniques.
  • estimated values 53, 55, 57 are then compared with those in LSF, pitch and energy codebooks 58, 60, 62 corresponding to the codebooks in the transmitter 10 to find a best match which is then input to the codebooks to select an index value 59, 61, 63 for each parameter.
  • the index values for the three parameters are then each converted to a number of bits of data, which are then combined in the correct order, in the reverse of the process of the transmitter, to produce the block of decoded data 29.
  • One 20 bit block of data is transmitted and received every 20 ms, resulting in a total bit rate of 1 kb/s in this example.
  • the number of bits and the frame rate can be adapted depending on the network conditions and the required performance.
  • the speech parameters suffer relatively little distortion. Provided the distortion introduced by the channel stays sufficiently low, the indices of the codebook entries in the receiver for each synthetic speech frame correspond to those of the frame sent by the encoder 20, and the bitstream transmitted can then be recovered.
  • Figure 5 compares the original 'synthetic speech' signals produced by the modulator 18 with the corresponding signals after one EFR tandem.
  • Each 20ms frame includes 160 samples, so the characteristics of the waveform change every 160 samples as can be seen in Figure 5. While there are minor differences between the original and the compressed and decompressed signals, it can be seen that the signals retain their basic form, including pitch, energy, and spectral shape.
  • the performance of the system depends heavily on the choice of the codebooks.
  • the various elements in the codebook must be selected so as to try to maximise the distance between them. If two elements are too close, the distortion caused by the network may result in an index being wrongly decoded.
  • the codebook vectors must be chosen with regard to the distortion introduced by the speech coder itself, so that easy decoding is possible.
  • the autocorrelation technique used at the demodulator 28 is sensitive to pitch doubling or halving, i.e. the pitch may be incorrectly decoded as having either half or twice the original pitch value. It is therefore preferable to use only a range of pitch values in which no such doubling or halving can occur.
  • a 5-bit codebook an example is to use all integer pitch values from 35 samples to 66 samples.
  • the energy codebook 42 should also make use of the characteristics of the speech coders, which generally use exponential quantisers for the energy parameter. An exponential codebook should therefore be used for the energy parameter, with the maximum and minimum values chosen so as not to cause saturation and not be detected as silence by a Voice Activity Detector in the network respectively.
  • LSF codebook 40 must be as separated as possible. This is made more complex by the fact that LSFs are defined as vectors, and the problem is therefore multi-dimensional.
  • Codebook design algorithms such as LBG (see Y. Linde, A. Buzo and R.M. Gray: "An Algorithm for Nector Quantiser Design” , IEEE Transactions on Communications, Vol. 28, No. 1, 1980, pp 84-96) can be used to obtain an initial oversized codebook. This codebook can then be refined by using the modulator/demodulator and by finding out which vectors get wrongly decoded. These vectors are then removed from the codebook 40 until all remaining vectors in the codebook are decoded correctly when passed through the modem.
  • IA Index Assignment
  • the data modulation and speech compression units of the first embodiment are combined to form a single, unified data mapping scheme.
  • the input data is mapped directly onto the syntax of the GSM EFR speech compression output, without the use of an intermediate interface between the modulator and the speech compression algorithms.
  • the procedures for identifying the optimal speech characteristics for representing the input data, and for codebook design are identical to those used in the first embodiment.
  • the parameters which are chosen to transmit data are those which are most robust to coding and decoding on the network. In this case the spectral information, pitch delay, and pitch gain are used. Although these parameters give a total data throughput of 4200 kbit/s the expected data rate that can be supported by these parameters will be somewhat lower than this.
  • the data from each block of input data are mapped directly onto the GSM EFR speech compression parameters.
  • parameters which are not necessarily used to transmit data may nevertheless be selected so as to enhance the spectral characteristics of the speech.
  • the excitation could be set to be in the form of regular pulses aligned with the fundamental pitch pulses generated by the speech synthesiser.
  • This will have to main benefits.
  • Second, such an alignment can enhance the definition of the individual pitch pulses generated by the synthesiser.

Abstract

A data transmission system arranged to transmit data over a speech transmission system comprises a transmitter (10) having a modulator (18) which converts the data to a speech-like signal having at least one speech-like characteristic defined by at least one parameter, and a speech compression module (20) which converts the speech-like signal to digital compressed speech format. A receiver (14) includes a speech decompression module (26) and a demodulator (28) which converts the speech-like signals back to the original data by analysing the parameters. If a digital speech compression link is used on either the end uplink or end downlink part of the communications path, the data may be mapped directly onto the compressed speech bit stream or recovered from the bit stream respectively.

Description

DATA TRANSMISSION OVER A COMPRESSED SPEECH CHANNEL
The present invention relates to data transmission over telecommunications networks, and in particular to the transmission of digital data over speech channels of such networks.
Using speech channels for the transmission of data over networks can have advantages under some circumstances because they are generally given a high priority and are subject to relatively small delays. This is necessary to ensure that normal speech can be transmitted without unacceptable loss of clarity. Data channels are generally susceptible to greater delay.
There exist a number of methods for transforming digital data into a form suitable for transmission over analogue channels such as between a fixed- line telephony service subscriber and his local exchange. These are collectively referred to as modulation techniques and generally involve modifying a sinusoidal carrier signal in such a way so as to represent a time-varying digital bit stream. The corresponding process at the receiver end is known as demodulation, while a device which carries out both tasks is called a modem. Typical characteristics of a carrier that are varied in order to superimpose the digital data include its amplitude, frequency and phase. These schemes are designed to be robust to errors in the transmission medium as well as to be able to transmit the greatest number of bits in the available bandwidth. Moreover in the case of modems as used by Personal Computers to transmit information over the telephone network, the modulation/demodulation techniques must be resilient to the conversion of the analogue signals to digital D/PCM (Differential/Pulse Code Modulation) format for transmission between different exchanges. Current standardised techniques allow for up to 56kbaud (using N.90 modems) to be transmitted over a standard telephone line.
These data modems cannot, however be used for transmission over systems that employ speech compression technologies, such as digital mobile or satellite channels. The main reason for this is that speech compression systems remove the unwanted redundancy in speech by employing a speech production model which is designed to efficiently represent the characteristics of the speech being input. These algorithms, known as speech coders, aim to characterise the speech by means of a number of parameters, each describing different aspects of the voice production model employed. The main objective of the speech compression algorithms is therefore to reduce the number of bits required to represent these parameters, whilst still retaining an acceptable speech quality level when the parameters are input to the voice synthesis model at the decoder. This ensures the maintaining of optimal voice quality at the targeted or stipulated bit rate. A side-effect of this approach is that the resulting synthesised speech, whilst perceptually being similar to the input speech, i.e. sounding very similar to the original, may have a fairly different waveform on a sample-by-sample basis. The waveforms as seen with respect to time will be visibly different to the original input speech, so notwithstanding the fact that the output speech may be audibly clear and intelligible, the actual decoded signal may have undergone significant change.
This distortion prevents most data modems from operating over channels which employ speech compression systems. The frequency, phase and amplitude characteristics of a standard modem output used to represent the transmitted data are not retained with sufficient quality to allow for any meaningful bit rate to be sustained. This problem is compounded by the fact that in many networks, and in particular, mobile communication systems, the speech signal may undergo more than one set of compression/decompression stages, a phenomenon known as tandeming.
Therefore according to a first aspect of the present invention there is provided a method of transmitting a block of data over a speech channel comprising the steps of converting the block of data into a speech-like signal having at least one speech-like characteristic defined by at least one parameter which is determined by the data, and transmitting the data over the speech channel.
The use of speech-like signals to transmit the data can help to ensure that during its transmission over the network transcoding and filtering processes will not severely affect the data. This is because those processes are designed to retain the perceptibly important characteristics of the speech for the normal effective transmission of speech signals.
The speech-like signal may a digital signal defining a speech-like waveform having said at least one speech-like characteristic. Alternatively the speech-like signal can be an analogue signal having said at least one speech-like characteristic.
Preferably the data is converted to the speech-like signal using a table which maps each possible form of at least part of the data block to a unique form of said at least one parameter.
Preferably the method further comprises the step of converting the speech-like signal to a compressed digital format in which said at least one parameter is defined by at least one section of data.
In some circumstances it may by desirable to avoid altogether the conversion of data to an analogue signal. Because many voice channels are designed to transmit speech in digitally compressed format, the data can simply be converted straight to a digital signal having the same format as digitally compressed speech so that it can be transmitted.
Therefore, according to a second aspect of the present invention there is provided a method of transmitting a block of digital data over a speech channel comprising the steps of converting the data to a digital compressed speech format such that at least one section of the data defines at least one parameter of a speech-like signal, and transmitting the converted data over the speech channel.
Since data will often need to be transmitted as a continuous or semi- continuous stream, a third aspect of the present invention provides a method of transmitting a data stream over a speech channel comprising dividing the data into a plurality of blocks, and transmitting each of the blocks using the method of the first or second aspects invention.
According to a fourth aspect of the present invention there is further provided a method of receiving a block of data over a speech channel comprising the steps of receiving a frame of speech-like signal having at least one speech-like characteristic defined by at least one parameter, analysing the received signal to determine said at least one parameter, and reproducing the data from the at least one parameter.
Preferably the received speech-like signal is converted to the data using a table which maps each valid form of said at least one parameter to a unique form of at least part of the data block.
Preferably the speech-like signal is obtained by receiving a signal in digital compressed speech format representing at least one block of data, and decompressing the received signal to produce a speech-like signal having a speech-like characteristic defined by at least one parameter determined by said of least one block of data.
According to a fifth aspect of the present invention there is further provided a method of receiving a block of data over a speech channel comprising the steps of receiving a signal in compressed speech ormat in which at least one group of bits represents at least one parameter defining at least one characteristic of a speech-like signal, and reproducing the data from the bits.
According to a sixth aspect of the present invention there is further provided a method of communicating data over a speech channel comprising transmitting the data according to the first, second or third aspects of the invention and receiving data according to the fourth or fifth aspects of the invention.
Preferably the at least one characteristic includes the fundamental frequency of the speech-like signal, a characteristic of the spectral envelope of the speech-like signal, or the energy level of the speech-like signal.
More preferably said at least one speech-like parameter comprises a plurality of parameters, and the or each block of data determines or is derived from the parameters. Preferably the or each block of data has a plurality of sections each of which determines or is derived from a respective one of the parameters.
The data may represent a voice message, which is preferably compressed, in particular an encrypted voice message. Preferred embodiments of the present invention will now be described by way of example only with reference to the accompanying drawings in which:
Figure 1 is a diagrammatic representation of a data communication system according to the invention;
Figure 2 is a diagrammatic representation of part of a conversion process of data in the system of Figure 1 ;
Figure 3 is a diagrammatic representation of a further part of the conversion process of data in the system of Figure 1 ;
Figure 4 is a diagrammatic representation of the conversion process of synthesized speech to data in the system of Figure 1 ; and
Figure 5 shows the effect of speech compression on the synthesized speech signals used in the system of Figure 1.
Referring to Figure 1, a data communication system comprises a transmitter 10, in the form of a first mobile terminal, arranged to transmit data over a communications network 12, and a receiver 14, in the form of a second mobile terminal, arranged to receive data over the network. The transmitter 10 comprises an input 16 for receiving digital data 17, a data modulator 18 for converting the data signal to a speech-like signal 19, a speech compression module 20 in the form of a GSM speech encoder for converting the speech-like signal to a compressed speech signal 23, and an output 22 for transmitting the compressed speech signal 23 onto the network 12. The receiver 14 comprises an input 24 for receiving compressed speech signals 25 from the network 12, a speech decompression module 26 for converting the compressed speech signal 25 to a speech-like signal 27, a demodulator 28 for converting the speechlike signal 27 to a data signal 29, and an output 30 for the data signal 29.
For a conventional mobile network the output 22 of the transmitter 10 will transmit the signal as a radio signal, and the network 12 will include a number of base stations for transmitting and receiving the radio signals and a telephone network to which the base stations are connected.
At the input stage, the data to be transmitted over the network 12 is input into the data modulator 18. This modulator 18 is essentially a voice synthesiser. The input bit stream is converted by means of an appropriate synthetic voice production model, as will be described below, into a speech-like waveform. This waveform bears no resemblance to the input data (one being a binary data stream, the output being a PCM-encoded waveform) . There is however an important relationship between the two, namely that the output waveform can be uniquely generated by a set of parameters corresponding to a portion of the input bit stream. This process is carried out on a discrete frame-by-frame basis, with a predetermined number of bits resulting in the formation of an output waveform of fixed duration.
These parameters and their corresponding waveforms have two important properties. First, notwithstanding the distortion processes introduced by the network, on receiving a given waveform, the demodulator 28 correctly identifies the original descriptive parameters corresponding to the input data. This is because the parameters selected represent the most perceptually important characteristics of a human speech signal. It is these characteristics which are maintained by all voice telephony systems, since, if they are modified there will be significant distortion of the resulting speech, degrading its intelligibility and subjective quality. The second important property of the selected waveform-parameters is that there should be a large distance between them. This means that each of the selected waveforms should bear as little resemblance as possible to each of the other waveforms in the valid set. The modulator 18 generates an output bit rate that is considerably lower than that of the lowest-rate speech compression system in the transmission link. This allows for a certain amount of redundancy to counteract the distortion introduced in the transmission network.
Current GSM mobile networks employ speech coders such as EFR (Enhanced Full Rate) and AMR (Adaptive MultiRate) , which compress speech from 64 kbit/s down to 12.2 kb/s or below, depending on the operating mode of the speech coder. Only a fraction of this throughput can be efficiently used by the system to transmit user data, typically 1 to 2 kb/s. This embodiment operates at a rate of 1 kb/s.
Speech coders used in GSM systems operate on frames of 20 ms of speech, sampled at 8 kHz. The modulator 18 therefore also generates speech-like waveforms in frames of 20ms, with a 8 kHz sampling rate. This is input to the GSM speech encoder 20. For a 1 kb/s modem throughput, this corresponds to a block of 20 bits being mapped to a 20 ms frame of artificial speech.
Referring to Figure 2, details of the modulator 18 of this embodiment will now be described. The data modulator 18 comprises an input buffer 32 arranged to receive the input data bitstream 33, divide it into blocks of 20 bits each, and then divide each block into a plurality of sections, each of predetermined length and at a predetermined position within the bock, and send each section to a respective index generator which generates, from the section of data it receives, an index relating to a characteristic of the speech-like signal to be produced. In this case 10 bits are sent to an LSF (line spectral frequency) index generator 34 which converts the 10 bits into an LSF index 35, 5 bits are sent to an energy index generator 36 which converts them to an energy index 37, and five bits are sent to a pitch index generator which converts them to a pitch index 39. Each index 35, 37, 39 will have a unique value for each possible form of the section of data used to generate it.
The modulator 18 has stored in its memory a set of look-up tables, also referred to as codebooks, one for each of the parameters. These therefore comprise a LSF codebook 40, an energy codebook 42, and a pitch codebook 44. Each codebook relates a unique form of the relevant parameter to each value of the relevant index 35, 37, 39. Therefore in this case there are 1024 forms of the LSF parameter 41, and 32 forms of each of the energy parameter 43 and the pitch parameter 45.
Referring to Figure 3 the three parameters 41, 43, 45 obtained from the codebooks 40, 42, 44 for the block of data are then converted to a speech-like signal or synthetic speech by means of a DFT-based speech synthesizer which receives the three parameters 41, 43, 45 as inputs, generates a pulse train signal and then performs filtering and scaling on it to generate a 20ms digital synthetic speech-like signal 19 defining a waveform containing pitch cycles of the desired length, energy and spectral shape. The speech-like signal waveform therefore has a number of speech-like characteristics which are defined by the parameters used to produce it.
Referring to Figure 4, the demodulator 28 in the receiver will now be described. The demodulator comprises an input buffer 50 which receives the synthetic speech 27 and divides it into 20ms frames. Each speech frame is then input to extraction modules 52, 54, 56 for the LSF, pitch lag and energy parameters respectively. These modules analyse the i.o synthetic speech and output estimated values for each of the three parameters. Various techniques can be used for this purpose, and in this example, the first stage is to employ an autocorrelation-based detector for the pitch lag. The energy of each pitch cycle can then be extracted for the detected pitch value using a root-mean-square calculation, and the spectral envelope parameters are then extracted using well-known signal processing techniques. These estimated values 53, 55, 57 are then compared with those in LSF, pitch and energy codebooks 58, 60, 62 corresponding to the codebooks in the transmitter 10 to find a best match which is then input to the codebooks to select an index value 59, 61, 63 for each parameter. The index values for the three parameters are then each converted to a number of bits of data, which are then combined in the correct order, in the reverse of the process of the transmitter, to produce the block of decoded data 29.
One 20 bit block of data is transmitted and received every 20 ms, resulting in a total bit rate of 1 kb/s in this example. The number of bits and the frame rate can be adapted depending on the network conditions and the required performance.
Since the various speech coders used in the network 12 attempt to represent correctly human speech, the speech parameters suffer relatively little distortion. Provided the distortion introduced by the channel stays sufficiently low, the indices of the codebook entries in the receiver for each synthetic speech frame correspond to those of the frame sent by the encoder 20, and the bitstream transmitted can then be recovered.
Figure 5 compares the original 'synthetic speech' signals produced by the modulator 18 with the corresponding signals after one EFR tandem. Each 20ms frame includes 160 samples, so the characteristics of the waveform change every 160 samples as can be seen in Figure 5. While there are minor differences between the original and the compressed and decompressed signals, it can be seen that the signals retain their basic form, including pitch, energy, and spectral shape.
It will be appreciated that the performance of the system depends heavily on the choice of the codebooks. The various elements in the codebook must be selected so as to try to maximise the distance between them. If two elements are too close, the distortion caused by the network may result in an index being wrongly decoded. Ideally, the codebook vectors must be chosen with regard to the distortion introduced by the speech coder itself, so that easy decoding is possible.
For example, the autocorrelation technique used at the demodulator 28 is sensitive to pitch doubling or halving, i.e. the pitch may be incorrectly decoded as having either half or twice the original pitch value. It is therefore preferable to use only a range of pitch values in which no such doubling or halving can occur. For a 5-bit codebook, an example is to use all integer pitch values from 35 samples to 66 samples. The energy codebook 42 should also make use of the characteristics of the speech coders, which generally use exponential quantisers for the energy parameter. An exponential codebook should therefore be used for the energy parameter, with the maximum and minimum values chosen so as not to cause saturation and not be detected as silence by a Voice Activity Detector in the network respectively.
Finally, the vectors in the LSF codebook 40 must be as separated as possible. This is made more complex by the fact that LSFs are defined as vectors, and the problem is therefore multi-dimensional. Codebook design algorithms such as LBG (see Y. Linde, A. Buzo and R.M. Gray: "An Algorithm for Nector Quantiser Design" , IEEE Transactions on Communications, Vol. 28, No. 1, 1980, pp 84-96) can be used to obtain an initial oversized codebook. This codebook can then be refined by using the modulator/demodulator and by finding out which vectors get wrongly decoded. These vectors are then removed from the codebook 40 until all remaining vectors in the codebook are decoded correctly when passed through the modem.
Mobile , networks often operate under difficult channel conditions, and therefore the characteristics of the received speech may vary from that of the original signal. Although error concealment techniques are used in the network to reduce this effect, some parameters may be decoded wrongly, introducing bit errors. In order to minimize this effect, it is preferable to order the codebooks in such a way that codebook entries which are similar also have indexes close to each other in terms of Euclidian distance. This technique is known as Index Assignment (IA) . If an error occurs during the parameter decoding, it is likely that the selected set will have been fairly close to the original set. Using IA, the decoded index will then only differ slightly from the original, reducing the overall bit error rate on the data channel. Index assignment will be most effective when used in conjunction with a block-based error correction scheme.
In a second embodiment of the invention the data modulation and speech compression units of the first embodiment are combined to form a single, unified data mapping scheme. In this embodiment, the input data is mapped directly onto the syntax of the GSM EFR speech compression output, without the use of an intermediate interface between the modulator and the speech compression algorithms.
In the GSM EFR compression algorithm the parameters used for characterising the speech are:
Figure imgf000014_0001
Figure imgf000015_0001
The procedures for identifying the optimal speech characteristics for representing the input data, and for codebook design are identical to those used in the first embodiment. The parameters which are chosen to transmit data are those which are most robust to coding and decoding on the network. In this case the spectral information, pitch delay, and pitch gain are used. Although these parameters give a total data throughput of 4200 kbit/s the expected data rate that can be supported by these parameters will be somewhat lower than this. Instead of producing a synthetic waveform, the data from each block of input data are mapped directly onto the GSM EFR speech compression parameters. Moreover, parameters which are not necessarily used to transmit data (such as the codebook excitation), may nevertheless be selected so as to enhance the spectral characteristics of the speech. For example, the excitation could be set to be in the form of regular pulses aligned with the fundamental pitch pulses generated by the speech synthesiser. This will have to main benefits. First, since the spectral characteristics of a pulse are flat, there will be minimum distortion to the information stored in the spectral envelope characteristics of the synthetic speech. Secondly, such an alignment can enhance the definition of the individual pitch pulses generated by the synthesiser.
Whereas such an implementation has the disadvantage of being tied down to one particular speech coder, it has a number of advantages, namely that the characteristics of the selected waveform-parameter pairs will be selected so as to provide optimal performance for that particular compression scheme.

Claims

1. A method of transmitting a block of data (17) over a speech channel characterised in that it comprises the steps of converting the block of data (17) into a speech-like signal (19) having at least one speech-like characteristic defined by at least one parameter which is determined by the data (17) , and transmitting the converted data over the speech channel.
2. A method according to claim 1 wherein the speech-like signal (19) is a digital signal defining a speech-like waveform having said at least one speech-like characteristic.
3. A method according to claim 1 or claim 2 wherein the data (17) is converted to the speech-like signal (19) using a table (40, 42, 44) which maps each possible form of at least part of the data block (33) to a unique form of said at least one parameter.
4. A method according to any foregoing claim further comprising the step of converting the speech-like signal (19) to a compressed digital format in which said at least one parameter is defined by at least one section of data.
5. A method of transmitting a block of digital data (17) over a speech channel characterised in that it comprises the steps of converting the data to a digital compressed speech format such that at least one section of the data defines at least one parameter of a speech-like signal (19), and transmitting the converted data over the speech channel.
6. A method of transmitting a data stream (33) over a speech channel characterised in that it comprises dividing the data into a plurality of blocks, and transmitting each of the blocks using the method of any foregoing claim.
7. A method of receiving a block of data (29) over a speech channel characterised in that it comprises the steps of receiving a frame of speechlike signal (27) having at least one speech-like characteristic defined by at least one parameter, analysing the received signal to determine said at least one parameter, and reproducing the data (29) from the at least one parameter.
8. A method according to claim 7 wherein the received speech-like signal (27) is converted to the data using a table which maps each valid form of said at least one parameter to a unique form of at least part of the data block.
9. A method according to claim 7 or claim 8 wherein the speech-like signal (27) is obtained by receiving a signal (25) in digital compressed speech format representing at least one block of data, and decompressing the received signal to produce a speech-like signal (27) having a speech- like characteristic defined by at least one parameter determined by said at least one block of data (29) .
10. A method of receiving a block of data (29) over a speech channel characterised in that it comprises the steps of receiving a signal (25) in compressed speech format in which at least one group of bits represents at least one parameter defining a speech-like characteristic of a speech-like signal, and reproducing the data (29) from the bits.
11. A method of receiving data over a speech channel characterised in that it comprises dividing a received speech-like signal into a plurality of frames, and converting each of the frames to data (29) using the method of any of claims 7 to 10.
12. A method of communicating data (17) over a speech channel comprising transmitting the data (17) according to the method of any of claims 1 to 6 and receiving the data (29) according to the method of any of claims 7 to 11.
13. A method according to any foregoing claim wherein the at least one speech-like characteristic includes the fundamental frequency of the speech-like signal (19, 27) .
14. A method according to any foregoing claim wherein the at least one characteristic includes a characteristic of the spectral envelope of the speech-like signal (19, 27) .
15. A method according to any foregoing claim wherein the at least one characteristic includes the energy level of the speech-like signal (19, 27).
16. A method according to any foregoing claim wherein said at least one parameter comprises a plurality of parameters, and the or each block of data (17, 29) determines or is derived from the parameters.
17. A method according to any foregoing claim wherein the data (17, 29) represents a voice message.
18. A method according to claim 16 wherein the voice message is encrypted in the data.
19. Apparatus for transmitting a block of data (17) over a speech channel comprising converting means (18) arranged to convert the block of data into a speech-like signal (19) having at least one speech-like characteristic defined by at least one parameter which is determined by the data, and transmitting means (22) arranged to transmit the data over the speech channel.
20. Apparatus according to claim 19 wherein the speech-like signal (19) is a digital signal defining a speech-like waveform having said at least one speech like characteristic.
21. Apparatus according to claim 19 or claim 20 wherein the converting means (18) is arranged to convert the data to the speech-like signal using a table (40, 42, 44) which maps each possible form of at least part of the data block to a unique form of said at least one parameter.
22. Apparatus according to any of claims 19 to 21 further comprising compression means (20) arranged to convert the speech-like signal (19) to a compressed digital format (23) .
23. Apparatus for transmitting a block of data (17) over a speech channel comprising converting means (18) arranged to convert the data to a digital compressed speech format such that at least one section of the data determines at least one parameter which defines a speech-like characteristic of a speech-like signal (19) , and transmitting means (22) arranged to transmit the converted data over the speech channel.
24. Apparatus according to any of claims 19 to 23 further comprising input means (32) arranged to receive data and to divide the data into a plurality of blocks for inputting to the converting means.
25. Apparatus for receiving data over a speech channel comprising receiving means (24) for receiving a speech-like signal (27) representing a block of data (29) and having at least one speech-like characteristic defined by at least one parameter, analysing means (28) arranged to analyse the received signal to determine said at least one parameter, and converting means (28) arranged to reproduce the data (29) from the at least one determined parameter.
26. Apparatus according to claim 25 wherein the converting means (28) is arranged to reproduce the data using a table 58, 60, 62 which maps each valid form of said at least one parameter to a unique form of at least part of the data block.
27. Apparatus according to claim 25 or claim 26 further comprising decompression means (26) arranged to receive the data in digital compressed speech format (25) and to convert it to the speech-like signal (27) .
28. Apparatus for receiving data over a speech channel comprising receiving means (24) arranged to receive the data in digital compressed speech format in which at least one group of bits represents at least one parameter defining a speech-like characteristic of a speech-like signal, and converting means (28) arranged to reproduce the data from the bits.
29. Apparatus according to any of claims 25 to 28 wherein the receiving means (24, 50) is arranged to divide the speech-like signal into a plurality of frames, and the converting means (28) is arranged to convert each of the frames to a block of data.
30. Apparatus according to any of claims 19 to 29 wherein the at least one characteristic includes the fundamental frequency of the speech-like signal (27) .
31. Apparatus according to any of claims 19 to 30 wherein the at least one characteristic includes a characteristic of the spectral envelope of the speech-like signal (27).
32. Apparatus according to any of claims 19 to 31 wherein the at least one characteristic includes the energy level of the speech-like signal (27).
33. Apparatus according to any of claims 19 to 32 wherein said at least one parameter comprises a plurality of parameters, and the or each block of data (29) has a plurality of sections each of which determines or is derived from a respective one of the parameters.
34. Apparatus according to any of claims 19 to 33 wherein the data (29) represents a voice message.
35. Apparatus according to claim 34 wherein the voice message is encrypted in the data (29) .
36. A data communication system for communicating data over an analogue speech channel comprising apparatus according to any of claims 19 to 24 or 30 to 35 for transmitting the data and apparatus according to any of claims 25 to 29 or 30 to 35 for receiving the data.
37. A method of transmitting data over a speech channel substantially as hereinbefore described with reference to Figures 1 to 3 of the accompanying drawings.
38. A method of receiving data over a speech channel substantially as hereinbefore described with reference to Figures 1 and 4 of the accompanying drawings.
39. A method of communicating data over a speech channel substantially as hereinbefore described with reference to Figures 1 to 4 of the accompanying drawings.
40. Apparatus for transmitting data over a speech channel substantially as hereinbefore described with reference to Figures 1 to 3 of the accompanying drawings.
41. Apparatus for receiving data over a speech channel substantially as hereinbefore described with reference to Figures 1 and 4 of the accompanying drawings.
42. A data communications system substantially as hereinbefore described with reference to Figures 1 to 4 of the accompanying drawings.
PCT/GB2003/000721 2002-02-19 2003-02-19 Data transmission over a compressed speech channel WO2003071521A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003205902A AU2003205902A1 (en) 2002-02-19 2003-02-19 Data transmission over a compressed speech channel

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0203834.7 2002-02-19
GB0203834A GB0203834D0 (en) 2002-02-19 2002-02-19 Data transmission

Publications (1)

Publication Number Publication Date
WO2003071521A1 true WO2003071521A1 (en) 2003-08-28

Family

ID=9931313

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2003/000721 WO2003071521A1 (en) 2002-02-19 2003-02-19 Data transmission over a compressed speech channel

Country Status (3)

Country Link
AU (1) AU2003205902A1 (en)
GB (1) GB0203834D0 (en)
WO (1) WO2003071521A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005109923A1 (en) * 2004-05-08 2005-11-17 University Of Surrey Data transmission
WO2007080474A2 (en) * 2006-01-09 2007-07-19 Solectron Invotronics Modem for communicating data over a voice channel of a communications system
FR2935212A1 (en) * 2008-08-19 2010-02-26 Sagem Defense Securite Data signal i.e. voice signal, transmission method for telephonic network in e.g. hotel, involves decoding voice signal, at level of receiver, by extracting structural component and comparing component with look-up table to retrieve data
WO2011023208A1 (en) * 2009-08-26 2011-03-03 Continental Automotive Gmbh In-band modem signals for use on a cellular telephone voice channel
US9031833B2 (en) 2010-07-07 2015-05-12 Electronics And Telecommunications Research Institute Communication apparatus
CN107786574A (en) * 2017-11-10 2018-03-09 芯盾(北京)信息技术有限公司 The voice communication Source Encryption system of mobile terminal
CN109256141A (en) * 2018-09-13 2019-01-22 芯盾(北京)信息技术有限公司 The method carried out data transmission using voice channel

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999031895A1 (en) * 1997-12-15 1999-06-24 Telefonaktiebolaget Lm Ericsson (Publ) Data transmission over a coded voice channel
WO2001067671A2 (en) * 2000-03-06 2001-09-13 Meyer Thomas W Data embedding in digital telephone signals
WO2001072067A1 (en) * 2000-03-21 2001-09-27 Airbiquity Inc. Voiceband modem for data communications over digital wireless networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999031895A1 (en) * 1997-12-15 1999-06-24 Telefonaktiebolaget Lm Ericsson (Publ) Data transmission over a coded voice channel
WO2001067671A2 (en) * 2000-03-06 2001-09-13 Meyer Thomas W Data embedding in digital telephone signals
WO2001072067A1 (en) * 2000-03-21 2001-09-27 Airbiquity Inc. Voiceband modem for data communications over digital wireless networks

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005109923A1 (en) * 2004-05-08 2005-11-17 University Of Surrey Data transmission
AU2005242023B2 (en) * 2004-05-08 2009-08-13 Mulsys Ltd Data transmission
US8472508B2 (en) 2004-05-08 2013-06-25 Mulsys Ltd Data transmission
WO2007080474A2 (en) * 2006-01-09 2007-07-19 Solectron Invotronics Modem for communicating data over a voice channel of a communications system
WO2007080474A3 (en) * 2006-01-09 2007-10-18 Solectron Invotronics Modem for communicating data over a voice channel of a communications system
US7864717B2 (en) 2006-01-09 2011-01-04 Flextronics Automotive Inc. Modem for communicating data over a voice channel of a communications system
DE112007000123B4 (en) * 2006-01-09 2011-09-08 Solectron Invotronics A modem for communicating data over a voice channel of a communication system
FR2935212A1 (en) * 2008-08-19 2010-02-26 Sagem Defense Securite Data signal i.e. voice signal, transmission method for telephonic network in e.g. hotel, involves decoding voice signal, at level of receiver, by extracting structural component and comparing component with look-up table to retrieve data
WO2011023208A1 (en) * 2009-08-26 2011-03-03 Continental Automotive Gmbh In-band modem signals for use on a cellular telephone voice channel
US9031833B2 (en) 2010-07-07 2015-05-12 Electronics And Telecommunications Research Institute Communication apparatus
CN107786574A (en) * 2017-11-10 2018-03-09 芯盾(北京)信息技术有限公司 The voice communication Source Encryption system of mobile terminal
CN109256141A (en) * 2018-09-13 2019-01-22 芯盾(北京)信息技术有限公司 The method carried out data transmission using voice channel

Also Published As

Publication number Publication date
AU2003205902A1 (en) 2003-09-09
GB0203834D0 (en) 2002-04-03

Similar Documents

Publication Publication Date Title
EP1222659B1 (en) Lpc-harmonic vocoder with superframe structure
JP3542610B2 (en) Audio signal processing apparatus and audio information data / frame processing method
CN1326415C (en) Method for conducting code conversion to audio-frequency signals code converter, network unit, wivefree communication network and communication system
KR100923891B1 (en) Method and apparatus for interoperability between voice transmission systems during speech inactivity
KR100594670B1 (en) Automatic speech/speaker recognition over digital wireless channels
KR101061404B1 (en) How to encode and decode audio at variable rates
AU2005242023B2 (en) Data transmission
JPH02123828A (en) Sub-band coding method and device
US6301265B1 (en) Adaptive rate system and method for network communications
JPS60116000A (en) Voice encoding system
EP0152430A1 (en) Apparatus and methods for coding, decoding, analyzing and synthesizing a signal.
JP2009541797A (en) Vocoder and associated method for transcoding between mixed excitation linear prediction (MELP) vocoders of various speech frame rates
JP2002055699A (en) Device and method for encoding voice
Katugampala et al. Secure voice over GSM and other low bit rate systems
WO2003071521A1 (en) Data transmission over a compressed speech channel
EP1020848A2 (en) Method for transmitting auxiliary information in a vocoder stream
US6028890A (en) Baud-rate-independent ASVD transmission built around G.729 speech-coding standard
KR920007093B1 (en) Spectrally efficient method for communicating an information signal
Choi et al. Effects of packet loss on 3 toll quality speech coders
Zhang et al. Embedded RPE based on multistage coding
Campanella VOICE PROCESSING TECHNIQUES
KR940008741B1 (en) Voice encoding/decoding method
JP2000206999A (en) Voice code transmission device
Zarkadis et al. Performance considerations of a 9.6 kb/s RELP coder for mobile telephony
KR20050059572A (en) Apparatus for changing audio level and method thereof

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP