GB2598104A

GB2598104A - Discontinuous transmission operation for spatial audio parameters

Info

Publication number: GB2598104A
Application number: GB2012787.4A
Authority: GB
Inventors: Sakari Rämö Anssi; Juhani Laaksonen Lasse; Vasilache Adriana; Johannes Pihlajakuja Tapani; Ilari Laitinen Mikko-Ville
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2022-02-23
Also published as: GB202012787D0; EP4196980A4; EP4196980A1; WO2022038307A1

Abstract

An Immersive Voice Audio Service (IVAS) spatial audio coder 207 capable of operating in active and discontinuous transmission (DTX) modes obtains an audio signal and spatial metadata 206 from which two spatial direction parameters are generated at different times and used to decide whether to terminate Discontinuous Transmission Mode 213. The direction parameters may be converted from spherical to Cartesian co-ordinates and weighted by average sample energy.

Description

DISCONTINUOUS TRANSMISSION OPERATION FOR SPATIAL AUDIO PARAMETERS

Field

The present application relates to apparatus and methods for terminating the discontinuous transmission operation at an encoder for spatial audio parameter encoding and audio signal encoding.

Background

lmmersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the Immersive Voice and Audio Services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network including use in such immersive services as for example immersive voice and audio for virtual reality (VR). This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.

Voice Activity Detection (VAD), also known as speech activity detection or more generally as signal activity detection is a technique used in various speech processing algorithms, most notably speech codecs, for detecting the presence or absence of human speech. It can be generalized to detection of active signal, i.e., a sound source other than background noise. Based on a VAD decision, it is possible to utilize, e.g., a certain encoding mode in a speech encoder.

Discontinuous Transmission (DTX) is a technique utilizing VAD intended to temporarily shut off parts of active signal processing (such as speech coding according to certain modes) and the frame-by-frame transmission of encoded audio. For example rather than transmitting normal encoded frames a simplified update frame is sent to drive a comfort noise generator (CNG) at the decoder. The use of DTX can help with reducing interference and/or preserving/reallocating capacity in a practical mobile network. Furthermore, the use of DTX can also help with battery life of the device. Especially in low-bit rate operation, use of DTX can also in some cases enhance the user experience by use of comfort noise instead of poorly

encoded background noise for audio presentation.

Comfort Noise Generation (CNG) is a technique for creating a synthetic background noise to fill silence periods that would otherwise be observed. For example comfort noise generation can be implemented under a DTX operation.

Silence Descriptor (SID) frames can be sent during speech inactivity to keep the receiver CNG decently well aligned with the background noise level at the sender side. This is of particular importance at the onset of each new talk spurt. Thus, SID frames should not be too old, when speech starts again. Commonly SID frames are sent regularly e.g. every 8th frame, but some codecs allow also variable rate SID updates. SID frames are typically quite small e.g. 2.4kbit/s SID bitrate that equals 48 bits per frame for a typical 20-ms frame size.

Summary

There is provided according to a first aspect an apparatus comprising means configured toobtain at least one audio signal and audio spatial parameters associated with the at least one audio signal; generate a spatial direction parameter associated with the audio spatial parameters at a first point in time associated with a discontinuous transmission mode of operation of the apparatus; generate a further spatial direction parameter associated with the audio spatial parameters at a second point in time during the discontinuous transmission mode of operation of the apparatus; and determine whether to terminate the discontinuous transmission mode of operation of the apparatus dependent upon a comparison of the further spatial direction parameter and the spatial direction parameter.

The spatial direction parameter may be generated before the further spatial direction parameter.

The first point in time may comprise at least one of at a start or substantially at a start of the discontinuous transmission mode of operation of the apparatus; at an audio frame preceding a start of the discontinuous transmission mode of operation of the apparatus; and at an audio frame after or at the time of a silence descriptor update.

The apparatus may further comprise means configured to determine an energy of a 15 first group of samples of the one or more audio signals and an energy of at least a second group of samples of the one or more audio signals at the first point in time.

The means configured to generate spatial direction parameter may comprise means configured to: convert a first spherical direction vector associated with the first group of samples to a first cartesian direction vector wherein the first cartesian direction vector comprise an x-axis component, y-axis component and a z-axis component and for each single component in turn, weighting the component of the cartesian direction vector by the energy of the first group of samples and a direct to total energy ratio calculated for the first group of samples; convert a second spherical direction vector associated with the second group of samples to a second cartesian direction vector wherein the second cartesian direction vector comprise an x-axis component, y-axis component and a z-axis component and for each single component in turn, weighting the component of the cartesian direction vector by the energy of the second group of samples and a direct to total energy ratio calculated for the second group of samples; and divide the sum of the first cartesian direction vector and the at least second cartesian direction vector by the sum of the energy of the first group of samples and the energy of the second group of samples.

The first group of samples may comprise time domain samples of a subframe and 5 frequency domain samples of a sub band, and wherein the at least second group of samples comprises time domain samples of a further subframe and frequency domain samples of a further sub band.

The apparatus having the means configured to determine whether to terminate the discontinuous transmission mode of operation of the apparatus dependent upon a comparison of the further average direction parameter and the average direction parameter may further comprise means configured to: determine whether to terminate the discontinuous transmission mode of operation of the apparatus dependent upon a comparison of a difference between the spatial direction parameter and the further spatial direction parameter to a threshold value.

According to second aspect there is provided a method comprising: obtaining at least one audio signal and audio spatial parameters associated with the at least one audio signal; generating a spatial direction parameter associated with the audio spatial parameters at a first point in time associated with a discontinuous transmission mode of operation of the apparatus; generating a further spatial direction parameter associated with the audio spatial parameters at a second point in time during the discontinuous transmission mode of operation of the apparatus; and determining whether to terminate the discontinuous transmission mode of operation of the apparatus dependent upon a comparison of the further spatial direction parameter and the spatial direction parameter.

The first point in time may comprise at least one of: at a start or substantially at a start of the discontinuous transmission mode of operation of the apparatus; at an audio frame preceding a start of the discontinuous transmission mode of operation of the apparatus; and at an audio frame after or at the time of a silence descriptor update.

The method may further comprise determining an energy of a first group of samples of the one or more audio signals and an energy of at least a second group of samples of the one or more audio signals at the first point in time.

Generating a spatial direction parameter may comprise: converting a first spherical direction vector associated with the first group of samples to a first cartesian direction vector wherein the first cartesian direction vector comprise an x-axis component, y-axis component and a z-axis component and for each single component in turn, weighting the component of the cartesian direction vector by the energy of the first group of samples and a direct to total energy ratio calculated for the first group of samples; converting a second spherical direction vector associated with the second group of samples to a second cartesian direction vector wherein the second cartesian direction vector comprise an x-axis component, y-axis component and a z-axis component and for each single component in turn, weighting the component of the cartesian direction vector by the energy of the second group of samples and a direct to total energy ratio calculated for the second group of samples; and dividing the sum of the first cartesian direction vector and the at least second cartesian direction vector by the sum of the energy of the first group of samples and the energy of the second group of samples.

The first group of samples comprises time domain samples of a subframe and frequency domain samples of a sub band, and wherein the at least second group of samples comprises time domain samples of a further subframe and frequency 30 domain samples of a further sub band.

Determining whether to terminate the discontinuous transmission mode of operation of the apparatus dependent upon a comparison of the further average direction parameter and the average direction parameter may further comprise: determining whether to terminate the discontinuous transmission mode of operation of the apparatus dependent upon a comparison of a difference between the spatial direction parameter and the further spatial direction parameter to a threshold value.

According to a third aspect there is provided an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain at least one audio signal and audio spatial parameters associated with the at least one audio signal; generate a spatial direction parameter associated with the audio spatial parameters at a first point in time associated with a discontinuous transmission mode of operation of the apparatus; generate a further spatial direction parameter associated with the audio spatial parameters at a second point in time during the discontinuous transmission mode of operation of the apparatus; and determine whether to terminate the discontinuous transmission mode of operation of the apparatus dependent upon a comparison of the further spatial direction parameter and the spatial direction parameter.

According to a third aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: obtain at least one audio signal and audio spatial parameters associated with the at least one audio signal; generate a spatial direction parameter associated with the audio spatial parameters at a first point in time associated with a discontinuous transmission mode of operation of the apparatus; generate a further spatial direction parameter associated with the audio spatial parameters at a second point in time during the discontinuous transmission mode of operation of the apparatus; and determine whether to terminate the discontinuous transmission mode of operation of the apparatus dependent upon a comparison of the further spatial direction parameter and the spatial direction parameter.

An apparatus configured to perform the actions of the method as described above.

A computer program comprising program instructions for causing a computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which: Figure 1 shows schematically a system of apparatus suitable for implementing some 25 embodiments; Figure 2 shows schematically an example IVAS encoder according to some embodiments; Figure 3 shows schematically parts of a spatial audio parameter encoder according to some embodiments;

B

Figure 4 shows a flow diagram of encoding according to some embodiments; and Figure 5 shows schematically an example device suitable for implementing the 5 apparatus shown.

Embodiments of the Application The concept as discussed in the embodiments of invention relates to speech and audio codecs and in particular immersive audio codecs supporting a multitude of operating points ranging from a low bit rate operation to transparency as well as a range of service capabilities, e.g., from mono to stereo to fully immersive audio encoding/decoding/rendering. An example of such a codec is the 3GPP IVAS codec discussed above The input signals are presented to the IVAS encoder in one of the supported formats (and in some allowed combinations of the formats). Similarly, it is expected that the decoder can output the audio in a number of supported formats. A pass-through mode has been proposed, where the audio could be provided in its original format after transmission (encoding/decoding).

For example, a mono audio signal (without metadata) may be encoded using an Enhanced Voice Service (EVS) encoder. Other input formats may utilize new IVAS encoding tools. One input format proposed for IVAS is the Metadata-assisted spatial audio (MASA) format, where the encoder may utilize, e.g., a combination of mono and stereo encoding tools and metadata encoding tools for efficient transmission of the format. MASA is a parametric spatial audio format suitable for spatial audio processing. Parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound (or sound scene) is described using a set of parameters. For example, in parametric spatial audio capture from microphone arrays, it is a typical and an effective choice to estimate from the microphone array signals a set of parameters such as directions of the sound in frequency bands, and the relative energies of the directional and non-directional parts of the captured sound in frequency bands, expressed for example as a direct-to-total ratio or an ambient-to-total energy ratio in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics.

For example, there can be two channels (stereo) of audio signals and spatial metadata. The spatial metadata may furthermore define parameters such as: Direction index, describing a direction of arrival of the sound at a time-frequency parameter interval; level/phase differences; Direct-to-total energy ratio, describing an energy ratio for the direction index; Diffuseness; Coherences such as Spread coherence describing a spread of energy for the direction index; Diffuse-to-total energy ratio, describing an energy ratio of non-directional sound over surrounding directions; Surround coherence describing a coherence of the non-directional sound over the surrounding directions; Remainder-to-total energy ratio, describing an energy ratio of the remainder (such as microphone noise) sound energy to fulfil requirement that sum of energy ratios is 1; Distance, describing a distance of the sound originating from the direction index in meters on a logarithmic scale; covariance matrices related to a multi-channel loudspeaker signal, or any data related to these covariance matrices; other parameters for guiding or controlling a specific decoder, e.g., VAD/DTX/CNG/SID parameters. Any of these parameters can be determined in frequency bands.

As discussed above Voice Activity Detection (VAD) may be employed in such a codec to control Discontinuous Transmission (DTX), Comfort Noise Generation (CNG) and Silence Descriptor (SID) frames.

Furthermore, as discussed above CNG is a technique for creating a synthetic background noise to fill silence periods that would otherwise be observed, e.g., under the DTX operation. A complete silence can be confusing or annoying to a receiving user. For example, the listener could judge that the transmission may have been lost and then unnecessarily say "hello, are you still there?" to confirm or simply hang up. On the other hand, sudden changes in sound level (from total silence to active background and speech or vice versa) could also be very annoying. Thus, CNG is applied to prevent a sudden silence or sudden change. Typically, the CNG audio signal output is based on a highly simplified transmission of noise parameters.

Legacy mono codecs, such as EVS, do not provide spatial audio DTX, CNG and SID implementations. In particular, the implementation of DTX operation for spatial audio is likely to change the "feel" and level of the background noise that will be observed by the user. These changes in the spatialization of the background noise may be perceived by the listener as being annoying or confusing. For example, in some embodiments as discussed herein the background noise is provided such that it is experienced as coming from the same direction(s) both during active and inactive speech periods.

For example, a user is talking in front of a spatial capture device with a busy road behind the device. The spatial audio capture has then constant traffic hum (that is somewhat diffuse), and specific traffic noises (e.g., car horns) coming mainly from behind, and of course the talker's voice coming from the front. When DTX is active, and the user is not talking both N channels + spatial metadata transmission can be shut off to save transmission bandwidth (DTX active). In the absence of regular spatial audio coding CNG provides the static background hum that is not too different, e.g., in terms of overall level and spectral shape, to the original captured background noise. Crude background noise description (SID) updates may be transmitted during inactive period (with EVS this is mono and with IVAS typically stereo SID/CNG could be considered) the purpose of which is to keep the signal properties (spectrum and energy) aligned between encoder and decoder. However, for conversational spatial audio coding under DTX conditions a problem exists whereby the spatial characteristics between the receiving side and sending side may significantly diverge from each other. The embodiments as discussed herein attempt to measure perceptually relevant spatial changes such that if there are any significant divergence during the DTX period it may be terminated. This may be performed at the encoder in addition to the spectral and energy changes a typical VAD algorithm may use in order to determine the DTX/active signal state. The DTX/active signal state detector algorithm may determine a measure of spatial change using the downmixed audio signal and accompanying spatial metadata at multiple points in time during a DTX period to provide a snapshot to the stability of the spatial audio during the DTX period. The stability of the spatial audio signal may be used to determine whether there are large swings in the spatial aspect of the audio signal at the encoder which would not necessarily be catered for at the decoder during a DTX period. This information may be used to abort any DTX operation at the encoder.

As such the embodiments as described herein attempt to provide an optimal DTX operation for updating CNG / SID parameters for parametric spatial audio such as MASA.

Thus, some embodiments comprise an IVAS apparatus or system configured to implement a DTX / CNG system where the parametric spatial audio CNG is based either on a spatial audio DTX or a mono/stereo DTX. This means that spatial audio parameters are updated substantially synchronously with the core audio DTX. Also, the embodiments as discussed herein are configured to implement CNG and possible SID updates such that they can work substantially synchronously with the core audio codec.

The proposed apparatus is configured such that it may be capable of the Backward Interoperability constraints expected of IVAS. Having interoperability with the EVS is an important feature. The full EVS codec algorithm shall be part of the IVAS codec solution. EVS bit-exact processing shall be used when the input to the IVAS codec is a simple mono signal without spatial metadata and should also be applied whenever possible. When multiple mono audio channels without spatial metadata are negotiated they shall all be bit-exact with EVS.

In particular the IVAS codec as implemented in some embodiments herein may be configured to support certain stereo modes of operation which include an embedded bit-exact EVS mono downm ix bitstream at the bit-rates from 9.6 kbit/s to 24.4 kbit/s SWB (9.6/13.2/16.4/24.4 kbit/s).

This requirement for embedded bit-exact EVS mono downmix bitstream delivered as part of certain stereo modes of operation is such that for such stereo modes of operation an EVS mono encoding with some additional separate encoded data needs to provide a stereo audio playback for a stereo input. The additional separate encoded data can be removed/stripped/ignored to receive an EVS mono bitstream. By bit-exact operation it is generally meant that the EVS mono bitstream needs to fully comply with one encoded and decoded with an external "legacy" EVS (i.e., the EVS standard).

The embodiments may furthermore be configured such that when the embedded stereo/spatial IVAS codec backwards compatible mode of operation is in DTX operation, it is configured to be compliant with EVS DTX operation. Thus, any additional DTX data should be at minimal cost for optimal operation. Transmitting DTX data twice will complicate the system.

The embodiments as discussed herein relate to an encoder side spatial DTX update system for a spatial audio codec utilizing a parametric spatial audio description. The embodiments may be implemented within or apply to codecs such as IVAS, and its MASA or MASA coding, when DTX functionality is used.

The embodiments as described herein have the additional advantage that they can be utilized for embedded spatial extensions of mono/stereo codecs with or without additional SID updates for the spatial part. For example, some embodiments as described herein may be implemented within an embedded stereo codec.

In some embodiments a MASA input to the IVAS encoder comprises a suitable number of audio signals (for example 1 to 4 audio signals) and metadata. It should be noted that MASA encoding can be an efficient representation also for other spatial inputs besides a dedicated MASA input. For example, channel-based inputs or Ambisonics (EGA, HOA) inputs could be transformed into a MASA format representation inside the audio encoder.

Figure 1 presents a high-level overview of a suitable system or apparatus for IVAS coding and decoding which is suitable for implementing embodiments as described herein. The system 100 is shown comprising an (IVAS) input 111. The IVAS input 111 can comprise at least one of any number of suitable input formats, combinations of which are currently being defined by the Third Generation Partnership Project (3GPP) For example as shown in Figure 1 there is shown a mono audio signal input 112. The mono audio signal input 112 may in some embodiments be passed to the encoder 121 and specifically to an Enhanced Voice Services (EVS) encoder 123. Furthermore is shown a stereo and binaural audio signal input 113. The stereo and binaural audio signal input 113 in some embodiments is passed to the encoder 121 and specifically to the (IVAS) spatial audio encoder 125. Figure 1 also shows a Metadata-Assisted Spatial Audio (MASA) signal input 114. The Metadata-Assisted Spatial Audio (MASA) signal input in some embodiments is passed to the encoder 121. Specifically the audio component of the MASA input is passed to the (IVAS) spatial audio encoder 125 and the metadata component passed to a metadata quantizer/encoder 127. Another input format shown in Figure 1 is an ambisonic audio signal, which may comprise first order Ambisonics (FOA) and/or higher order ambisonics (HOA) audio signal 115. The first order Ambisonics (FOA) and/or higher order ambisonics (HOA) audio signal 115 in some embodiments is passed to the encoder 121 and specifically to the (IVAS) spatial audio encoder 125. Furthermore as shown in Figure 1 is a channel based audio signal input 116. This may be any suitable input audio channel format, for example 5.1 channel format, 7.1 channel 5 format etc. The channel-based audio signal input 116 in some embodiments is passed to the encoder 121 and specifically to the (IVAS) spatial audio encoder 125. The final example input shown in Figure 1 is an object (or audio object) signal input 117. The object signal input in some embodiments is passed to the encoder 121. Specifically the audio component of the object signal input is passed to the (IVAS) 10 spatial audio encoder 125 and the metadata component passed to a metadata quantizer/encoder 127.

Figure 1 furthermore shows an (IVAS) encoder 121. The (IVAS) encoder 121 is configured to receive the audio signal from the input and encode it to produce a suitable format encoded bitstream 131. The (IVAS) encoder 121 in some embodiments as shown in Figure 1 comprises an EVS encoder 123 configured to receive any mono audio signals 112 and encode them according to an EVS codec definition.

Furthermore the (IVAS) encoder 121 is shown comprising an (IVAS) spatial audio encoder 125. The (IVAS) spatial audio encoder 125 is configured to receive the audio signals or audio signal components and encode the audio signals based on a suitable definition or coding mechanism. In some embodiments the spatial audio encoder 125 is configured to reduce the number of audio signals being encoded before the signals are encoded. For example in some embodiments the spatial audio encoder is configured to combine or otherwise downmix the input audio signals.

In some embodiments, for example when the input type is MASA signals, the spatial 30 audio encoder is configured to encode the audio signals as a mono or stereo (downmix) signal.

The spatial audio encoder 125 may comprise an audio encoder core which is configured to receive the downmix or the audio signals directly and generate a suitable encoding of these audio signals. The encoder 121 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.

In some embodiments the encoder 121 comprises a metadata quantizer/encoder 127. The metadata quantizer/encoder 127 is configured to receive the metadata, for example from the MASA input or the objects and generate a suitable quantized and/or encoded metadata bitstream suitable for (being combined with or associated with the encoded audio signal bitstream) and being output as part of the (IVAS) bitstream 131.

Furthermore, as shown in Figure 1 there is shown a (IVAS) decoder 141. The decoder 141 in some embodiments comprises a metadata dequantizer/decoder 147. The metadata dequantizer/decoder 147 is configured to receive the encoded metadata, for example from the IVAS bitstream 131 and generate a metadata bitstream suitable for rendering the audio signals within the stereo and spatial audio decoder 145.

Figure 1 furthermore shows the (IVAS) decoder 141 comprising an EVS decoder 143. The EVS decoder 143 is configured to receive the EVS encoded mono audio signals as part of the IVAS bitstream 131 and decode them to generate a suitable mono audio signal which can be passed to an internal renderer (for example the stereo and spatial decoder) or suitable external renderer.

Additionally, in some embodiments the (IVAS) decoder 141 comprises a stereo and spatial audio signal decoder 145. The stereo and spatial audio signal decoder 145 in some embodiments is configured to receive the encoded audio signals and generate a suitable decoded spatial audio signal which can be rendered internally (for example by the stereo and spatial audio signal decoder) or suitable external renderer.

Therefore, in summary first the system is configured to receive a suitable audio signal format. In some embodiments the system is configured to generate (a downmix or more generally known as transport audio signals) audio signals. The system is then configured to encode for storage/transmission the audio signals. After this the system may store/transmit the encoded audio signals and metadata.

The system may retrieve/receive the encoded audio signals and metadata. Then the system is configured to extract the audio signals and metadata from encoded audio signals and metadata parameters, for example demultiplex and decode the encoded audio signals and metadata parameters.

The system may furthermore be configured to synthesize an output multi-channel audio signal based on the extracted audio signals and metadata.

Figure 2 depicts an example apparatus and system for implementing embodiments of the application. The system 200 is shown with an analysis' part 221. The 'analysis' part 221 is the part from receiving the multi-channel loudspeaker signals up to an encoding of the metadata and downm ix signal.

The input to the system 200 and the 'analysis' part 221 is the multi-channel signals 202. In the following examples a microphone channel signal input is described, however any suitable input (or synthetic multi-channel) format may be implemented in other embodiments. For example, in some embodiments the spatial analyser and the spatial analysis may be implemented external to the encoder. For example, in some embodiments the spatial metadata associated with the audio signals may be provided to an encoder as a separate bit-stream. In some embodiments the spatial metadata may be provided at least as a set of directions which may for example be represented as spatial direction index values. These are examples of a metadatabased audio input format.

The multi-channel signals are passed to a transport signal generator 203 and to an analysis processor 205.

In some embodiments the transport signal generator 203 is configured to receive the multi-channel signals and generate a suitable transport signal comprising a determined number of channels and output the transport signals 204. For example, the transport signal generator 203 may be configured to generate a 2-audio channel downmix of the multi-channel signals. The determined number of channels may be any suitable number of channels. The transport signal generator in some embodiments is configured to otherwise select or combine, for example, by beamforming techniques the input audio signals to the determined number of channels and output these as transport signals.

In some embodiments the transport signal generator 203 is optional and the multichannel signals are passed unprocessed to an encoder 207 in the same manner as the transport signal are in this example.

In some embodiments the analysis processor 205 is also configured to receive the multi-channel signals and analyse the signals to produce metadata 206 associated with the multi-channel signals and thus associated with the transport signals 204. The analysis processor 205 may be configured to generate the metadata which may comprise, for each time-frequency analysis interval, a direction parameter 208 and an energy ratio parameter 210 and a coherence parameter 212 (and in some embodiments a diffuseness parameter). The direction, energy ratio and coherence parameters may in some embodiments be considered to be spatial audio parameters. In other words, the spatial audio parameters comprise parameters which aim to characterize the sound-field created/captured by the multi-channel signals (or two or more audio signals in general).

In some embodiments the parameters generated may differ from frequency band to frequency band. Thus, for example in band X all of the parameters are generated and transmitted, whereas in band Y only one of the parameters is generated and transmitted, and furthermore in band Z no parameters are generated or transmitted. A practical example of this may be that for some frequency bands such as the highest band some of the parameters are not required for perceptual reasons. The transport signals 204 and the metadata 206 may be passed to an encoder 207.

The encoder 207 may comprise an audio encoder core 209 which is configured to receive the transport (for example downmix) signals 204 and generate a suitable encoding of these audio signals. The encoder 207 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.

The encoding may be implemented using any suitable scheme. The encoder 207 may furthermore comprise a metadata encoder/quantizer 211 which is configured to receive the metadata and output an encoded or compressed form of the information. In some embodiments the encoder 207 may further interleave, multiplex to a single data stream or embed the metadata within encoded downmix signals before transmission or storage shown in Figure 1 231. The multiplexing may be implemented using any suitable scheme.

Also shown in Figure 2 is a DTX determiner 213 which may be used to determine the start of a DTX period by using the transport signal 204. For those embodiments which deploy spatial audio encoding the spatial DTX determiner 307 which can be embedded with the DTX determiner 213 may use the spatial audio metadata as determined by the Analysis processor 205. Also, some variants of the DTX determiner may also use the multichannel audio signals 202 and/or the transport signals 204 in addition to the spatial audio metadata 206. In embodiments the DTX determiner 213 may be based around a typical VAD algorithm which may use changes in the audio signal spectrum and audio signal energy (from 202 or 204) in addition to changes in the spatial audio data 206. It is to be understood, the following description focuses on how the changes in the spatial audio data may be used to influence the DTX period determination. The change in the spatial audio data may be used in conjunction with other VAD determination parameters in order to provide 5 a decision matrix in which a DTX period can be determined both in terms of when DTX is activated and the length of the DTX period. The output from the DTX determiner can be an Active/DTX signal 215 which indicates whether the current frame is part of a DTX period or whether the current frame is in normal active operation. This Active/DTX 215 may then be fed to the audio encoder 209 whereby 10 the audio encoder 209 may use the signal in order to determine whether the encoder operates in DTX mode or active mode.

Therefore, in summary first the system (analysis part) is configured to receive multichannel audio signals.

Then the system (analysis part) is configured to generate a suitable transport audio signal (for example by selecting or downmixing some of the audio signal channels) and the spatial audio parameters as metadata.

The system is then configured to encode for storage/transmission the transport signal and the metadata.

After this the system may store/transmit the encoded transport signal and metadata.

The system may retrieve/receive the encoded transport signal and metadata.

Then the system is configured to extract the transport signal and metadata from encoded transport signal and metadata parameters, for example demultiplex and decode the encoded transport signal and metadata parameters.

The system (synthesis part) is configured to synthesize an output multi-channel audio signal based on extracted transport audio signals and metadata.

With respect to Figure 3 an example analysis processor 205 (as shown in Figure 2) according to some embodiments is described in further detail. Included in Figure 3 is the spatial DTX determiner functional block 307 which may form part of the overall DTX determiner 213 and consequently contribute to the overall DTX signal determination process. The spatial DTX determiner 307 depicts how a spatial component to an overall DTX determination algorithm may be integrated into the DTX determiner 213 for an IVAS type codec. For instance, the spatial DTX determiner 307 may form part of the analysis processor 205 as depicted in Figure 3 and which then may feed into the DTX determiner 213 to contribute to the decision matrix of the overall DTX signal generation step. Alternatively, the spatial DTX determiner 307 may be integrated into the overall DTX determiner 213, in which case the DTX determiner 307 would have as inputs the estimated energies and spatial audio parameters (as metadata) 206.

The analysis processor 205 in some embodiments comprises a time-frequency 20 domain transformer 301.

In some embodiments the time-frequency domain transformer 301 is configured to receive the multi-channel signals 202 and apply a suitable time to frequency domain transform such as a Short Time Fourier Transform (STFT) in order to convert the input time domain signals into a suitable time-frequency signals. These time-frequency signals may be passed to a spatial analyser 303.

Thus for example, the time-frequency signals 302 may be represented in the time-frequency domain representation by where b is the frequency bin index and n is the time-frequency block (frame) index and i is the channel index. In another expression, n can be considered as a time index with a lower sampling rate than that of the original time-domain signals. These frequency bins can be grouped into sub bands that group one or more of the bins into a sub band of a band index k = 0,... , K-1. Each sub band k has a lowest bin bkjew and a highest bin bkhigh, and the subband contains all bins from bk,,,,w to bichigh. The widths of the sub bands can approximate any suitable distribution. For example, the Equivalent rectangular bandwidth (ERB) scale or the Bark scale.

A time frequency (TF) tile (or block) is thus a specific sub band within a subframe of the frame.

It can be appreciated that the number of bits required to represent the spatial audio parameters may be dependent at least in part on the TF (time-frequency) tile resolution (i.e., the number of TF subframes or tiles). For example, a 20ms audio frame may be divided into 4 time-domain subframes of 5ms a piece, and each time-domain subframe may have up to 24 frequency subbands divided in the frequency domain according to a Bark scale, an approximation of it, or any other suitable division. In this particular example the audio frame may be divided into 96 TF subframes/tiles, in other words 4 time-domain subframes with 24 frequency subbands.

Returning to Figure 3, the time frequency signals 302 may be passed to an energy estimator 305 whereby the energy of each frequency sub band k may for all channels i of the time frequency signals 302 be determined. In embodiments this operation maybe expressed according to the following bk,high ISO,b,02 bk,tow Where the time-frequency audio signals are denoted as S(i,b,n), i is the channel index, b is the frequency bin index, and n is the temporal sub-frame index, bic,10", is the lowest bin of the band k and bkihigh is the highest bin.

The energies of each sub band k within a time sub frame n may then be passed on to the spatial DTX determiner 307.

In embodiments the analysis processor 205 may comprise a spatial analyser 303. The spatial analyser 303 may be configured to receive the time-frequency signals 302 and based on these signals estimate direction parameters 208. The direction parameters may be determined based on any audio based 'direction' determination.

For example, in some embodiments the spatial analyser 303 is configured to estimate the direction of a sound source with two or more signal inputs.

The spatial analyser 303 may thus be configured to provide at least one azimuth and elevation for each frequency band and temporal time-frequency block within a frame of an audio signal, denoted as azimuth cp(k,n), and elevation 0(k)n). The direction parameters 208 for the time sub frame may be also be passed to the spatial DTX determiner 307.

The spatial analyser 303 may also be configured to determine an energy ratio parameter 210. The energy ratio may be considered to be a determination of the energy of the audio signal which can be considered to arrive from a direction. The direct-to-total energy ratio r(k,n) can be estimated, e.g., using a stability measure of the directional estimate, or using any correlation measure, or any other suitable method to obtain a ratio parameter. Each direct-to-total energy ratio corresponds to a specific spatial direction and describes how much of the energy comes from the specific spatial direction compared to the total energy. This value may also be represented for each time-frequency tile separately. The spatial direction parameters and direct-to-total energy ratio describe how much of the total energy for each time-frequency tile is coming from the specific direction. In general, a spatial direction parameter can also be thought of as the direction of arrival (DOA).

In embodiments the direct-to-total energy ratio parameter can be estimated based on the normalized cross-correlation parameter cm-1(k, n) between a microphone pair at band k, the value of the cross-correlation parameter lies between -1 and 1. The direct-to-total energy ratio parameter r(k, n) can be determined by comparing the normalized cross-correlation parameter to a diffuse field normalized cross correlation parameter corL(k,n) as r(k,n) -cor'(k,n)-corL(k,n). The direct-to-total 1-corb(k,n) energy ratio is explained further in PCT publication W02017/005978 which is incorporated herein by reference. The energy ratio may be passed to the spatial DTX determiner 307.

Additionally, the spatial analyser 303 may furthermore be configured to determine a 15 number of coherence parameters 112 which may include surrounding coherence (y(k,n)) and spread coherence (((k, n)), both analysed in time-frequency domain.

Therefore, for each sub band k there will be collection of spatial audio parameters associated with the sub band. In this instance each sub band k may have the following spatial parameters associated with it; at least one azimuth and elevation denoted as azimuth 4)(k, n), and elevation 0(k, n), surrounding coherence (y(k,n)) and spread coherence ((k, it)) and a direct-to-total-energy ratio parameter r(k,n).

In embodiments the DTX determiner 213 may determine the start of a DTX period. 25 As stated above this may be performed by algorithms deployed by VAD or SAD (Signal activity detection) determination units. These kinds of algorithms may be used to drive the initial DTX determination signal, which denotes the start of a DTX period. Once a DTX period has been started the DTX determiner may then deploy the Spatial DTX determiner 307 in order to monitor changes to the spatial alignment during the DTX period. On first entering a DTX period the Spatial DTX determiner 307 may take a reference spatial check or snap shot of the spatial audio parameters including: direction parameters 208, energy ratio parameters 210 and estimated energy for each sub band 310. This snapshot in time may be termed the reference spatial direction parameter as it is the first upon start of the DTX period. In some embodiments the reference spatial direction parameter may record the value of the spatial audio parameters associated with the audio frame on entry into the DTX state. In other embodiments the spatial direction parameter used as the reference spatial snapshot may be averaged over the course of a number of audio time frames before an initial value is settled on. This was found to improve the performance of the spatial DTX determiner 307 especially in noisy conditions.

Figure 4 depicts the processing steps of the spatial DTX determiner 307 from the point in time when the DTX determiner 213 was determined the start of the DTX period. Step 401 in Figure 4 shows the processing step on entry into the DTX period whereby the reference snapshot of spatial audio parameter values (direction parameters 208, energy ratio parameters 210 and estimated energy for each sub band 310) is recorded to produce the reference spatial direction parameter. Step 401 in Figure 4 depicts the processing step of taking or recording the reference snapshot of spatial audio parameter values to give the reference spatial direction parameter. In this case the average of the spatial audio parameters values over a plurality of audio frames (within the DTX period) is taken as the reference snapshot.

It is to be appreciated that the processing steps of Figure 4 may also be related to the condition of a SID update. In this case the impetuous to start the processing loop of Figure 4 may be a SID update event rather than the start of a DTX period.

The spatial DTX determiner 307 may then be arranged to have the further processing step of determining a further spatial direction parameter snapshot value of the spatial audio parameters (direction parameters 208, energy ratio parameters 210 and estimated energy for each sub band 310) at a further point in time as specified by the number of audio frames from when the reference spatial direction parameter snapshot is calculated. As before the further spatial direction parameter snapshot may be an average value taken over the course of a plurality of audio time frames around the further point in time (within the DTX period).

This processing step of taking a further spatial direction parameter snapshot is 5 shown as processing step 403 in Figure 4.

The spatial DTX determiner 307 may then be arranged to compare the further spatial direction parameter snapshot value of the spatial audio parameters against the reference spatial direction parameter snapshot value in order to determine if there has been a significant deviation of spatial parameter values between the point in time associated with the reference snapshot value and the point in time associated with the further snapshot value. This comparison step within the spatial DTX determiner 307 may be structured as a difference between the further spatial direction parameter snapshot value and the reference spatial direction parameter snapshot value which is compared to a threshold. This comparison step is depicted in Figure 4 as the step 405.

Depending on the size of the deviation, the spatial DTX determiner 307 may either decide to take another further spatial direction parameter snapshot value at another point in time in other words another set number of audio frames (going forward within the DTX period), or alternatively the spatial DTX determiner 307 may decide to terminate the DTX state.

If the deviation is large enough (as depicted by the path 411) the spatial DTX 25 determiner 307 may be arranged to terminate the DTX state. This decision step is shown as the processing step 409 in Figure 4.

However, if the deviation is within a predetermined threshold the spatial DTX determiner 307 may be so arranged as to remain in the DTX state and be configured to take another spatial direction parameter snapshot value at a further point in time.

This decision step is shown as the loopback step of 407 in Figure 4.

It is to be appreciated that the loopback processing steps of 403, 405, 407 may continue until either the processing step 409 is performed or the DTX period is terminated via other means within the DTX determiner 213.

In some embodiments the spatial direction parameter snapshot may be determined by using the energies E (k, n), spatial direction components as azimuth (p(k,n), and elevation 0 (k, n) and the direct-to-total energy ratios r(k,n).

The spatial DTX determiner 307 may perform the above spatial direction parameter snapshot by initially taking the azimuth cp(k,n) and elevation 0 (k,n) spherical direction component for each k sub bands and n subframe, and then converting each direction component to their respective cartesian coordinate vector. Each cartesian coordinate vector for the sub band k and subframe n may then be weighted by the respective energy E (k, n) (from the energy estimator 205) and the direct-to-total energy ratio parameter r(k, n) for the sub band k and subframe n.

The conversion operation for an azimuth q5(k,n) and elevation direction 0(k, n) component of the sub band k to give the X axis direction component as x(k,n) = E(k,n)r(k,n)cos ip(k, n) cos 0 (k, n) the Y axis component as y(k,n) = E (k, n) r(k, n)sin 0(k, n) cos 0(k, n) and the Z axis component as z(k, n) = E(k,n)r(k,n) sin 0 (k, n) The X, Y and Z components can be arranged into a vector v(k,n) v(k, n) = [ x(k,n),y(k,n),z(k,n)]r The above operation may be performed for a range of sub bands k = 0 to K-1 and a range of subframes for Ni to N2. and then summed over these respective ranges to give N2 K-1 Vivi2 = v(k, n) Nj. k=0 The energies may be summed over the same period to give N2 K-1 EN12 = E(k,n) k=0 The spatial DTX determiner 307 may then determine a spatial direction parameter snapshot as a spatial average vector given by 4V12 vN12 = L'N12 With respect to the Figure 4 the time period over which the ATN12 is calculated 15 (subframes N1 to N2) may be viewed as the spatial snapshot associated with the reference snapshot, in other words the first spatial snapshot value recorded at processing step 401.

A further spatial snapshot may be taken at a further point in time denoted by the subframe range N3 to N4. This will result in a further spatial direction parameter snapshot value, which maybe associated with the spatial snapshot value recorded during processing step 403.

vAr34 = viv3, -'N 34 The spatial DTX determiner 307 may then use these two spatial snapshot values vN34 and vm2 to determine the spatial snapshot difference measure associated with processing step 405. In embodiments the vector norm difference measure may be used = iivw34 VN12 I I where I I * II denotes the vector norm. ;With reference to Figure 4, should the loop back 407 be executed then this will result in the next further spatial direction parameter snapshot recorded at processing step 403. Using the above nomenclature this can result in a spatial snapshot value of vks6 VN56 = L: N56 and the vector norm difference measure may be given as = IvN56 -v II N12..* In some variations of the above embodiment the threshold to which the difference measure is compared may be adaptive, whereby the threshold may be varied according to the spatial audio content. For instance, a static audio scenery may warrant a less frequent update rate of the threshold value, whereas a more dynamic audio scene may justify a more rapid update rate.

It is to be noted in some embodiments the above averaging process may take the form of an infinite impulse response filter (IIR) type averaging process in which a running tally is maintained by computing a new parameter value every sub frame and adding the new parameter value (multiplied by a damping factor) to previous values (from previous subframe calculations).

Further embodiments may deploy an additional parameter called the Remainderto-total energy ratio which is defined as Energy ratio of the remainder (such as microphone noise) sound energy to fulfil requirement that sum of energy ratios is 1.

This may be calculated as energy of remainder sound / total energy and has a Range of values: [0.0, 1.0]. In these embodiments the spatial DTX determiner 307 may take this parameter as an additional input which may be compared to a further threshold value. If the value of the Remainder-to-total energy ratio is large enough i.e. close to one (for example a threshold value of 0.8) then the DTX period may be continued.. If however, the Remainder-to-total energy ratio is determined to be small in value, then there may be a further stage in which the audio spatial parameter values (such as the direction values) may be inspected in order to decide whether the DTX period is terminated.

With respect to Figure 5 an example electronic device which may be used as the analysis or synthesis device is shown. The device may be any suitable electronics device or apparatus. For example, in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1400 comprises a memory 1411. In some embodiments the at least one processor 1407 is coupled to the memory 1411. The memory 1411 can be any suitable storage means. In some embodiments the memory 1411 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore, in some embodiments the memory 1411 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.

In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating with the position determiner as described herein.

In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver can communicate with further apparatus by any suitable known communications protocol. For example, in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (I RDA).

The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.

For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof,

CD

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS: 1. An apparatus comprising means configured to: obtain at least one audio signal and audio spatial parameters associated with the at least one audio signal; generate a spatial direction parameter associated with the audio spatial parameters at a first point in time associated with a discontinuous transmission mode of operation of the apparatus; generate a further spatial direction parameter associated with the audio spatial parameters at a second point in time during the discontinuous transmission mode of operation of the apparatus; and determine whether to terminate the discontinuous transmission mode of operation of the apparatus dependent upon a comparison of the further spatial direction parameter and the spatial direction parameter.
2. The apparatus as claimed in Claim 2, wherein the spatial direction parameter is generated before the further spatial direction parameter.
3. The apparatus as claimed in Claims 1 and 2, wherein the first point in time comprises at least one of: at a start or substantially at a start of the discontinuous transmission mode of operation of the apparatus; at an audio frame preceding a start of the discontinuous transmission mode of operation of the apparatus; and at an audio frame after or at the time of a silence descriptor update.
4. The apparatus as claimed in Claims 1 to 3, wherein the apparatus further comprises means configured to determine an energy of a first group of samples of the one or more audio signals and an energy of at least a second group of samples of the one or more audio signals at the first point in time.
5. The apparatus as claimed in Claim 4, wherein the means configured to generate spatial direction parameter comprises means configured to: convert a first spherical direction vector associated with the first group of samples to a first cartesian direction vector wherein the first cartesian direction vector comprise an x-axis component, y-axis component and a z-axis component and for each single component in turn, weighting the component of the cartesian direction vector by the energy of the first group of samples and a direct to total energy ratio calculated for the first group of samples; convert a second spherical direction vector associated with the second group of samples to a second cartesian direction vector wherein the second cartesian direction vector comprise an x-axis component, y-axis component and a z-axis component and for each single component in turn, weighting the component of the cartesian direction vector by the energy of the second group of samples and a direct to total energy ratio calculated for the second group of samples; and divide the sum of the first cartesian direction vector and the at least second cartesian direction vector by the sum of the energy of the first group of samples and the energy of the second group of samples.
6. The apparatus as claimed in Claims 4 and 5, wherein the first group of samples 20 comprises time domain samples of a subframe and frequency domain samples of a sub band, and wherein the at least second group of samples comprises time domain samples of a further subframe and frequency domain samples of a further sub band.
7. The apparatus as claimed in Claims 1 to 6, the apparatus having the means configured to determine whether to terminate the discontinuous transmission mode of operation of the apparatus dependent upon a comparison of the further average direction parameter and the average direction parameter further comprises means configured to: determine whether to terminate the discontinuous transmission mode of 30 operation of the apparatus dependent upon a comparison of a difference between the spatial direction parameter and the further spatial direction parameter to a threshold value.
8. A method comprising: obtaining at least one audio signal and audio spatial parameters associated with the at least one audio signal; generating a spatial direction parameter associated with the audio spatial parameters at a first point in time associated with a discontinuous transmission mode of operation of the apparatus; generating a further spatial direction parameter associated with the audio spatial parameters at a second point in time during the discontinuous transmission mode of operation of the apparatus; and determining whether to terminate the discontinuous transmission mode of operation of the apparatus dependent upon a comparison of the further spatial direction parameter and the spatial direction parameter.
9. The method as claimed in Claim 8, wherein the spatial direction parameter is generated before the further spatial direction parameter.
10. The method as claimed in Claims 8 and 9, wherein the first point in time comprises at least one of: at a start or substantially at a start of the discontinuous transmission mode of operation of the apparatus; at an audio frame preceding a start of the discontinuous transmission mode 25 of operation of the apparatus; and at an audio frame after or at the time of a silence descriptor update.
11. The method as claimed in Claims 8 to 10, wherein the method further comprises determining an energy of a first group of samples of the one or more audio signals 30 and an energy of at least a second group of samples of the one or more audio signals at the first point in time.
12. The method as claimed in Claim 11, wherein generating a spatial direction parameter comprises: converting a first spherical direction vector associated with the first group of samples to a first cartesian direction vector wherein the first cartesian direction vector comprise an x-axis component, y-axis component and a z-axis component and for each single component in turn, weighting the component of the cartesian direction vector by the energy of the first group of samples and a direct to total energy ratio calculated for the first group of samples; converting a second spherical direction vector associated with the second group of samples to a second cartesian direction vector wherein the second cartesian direction vector comprise an x-axis component, y-axis component and a z-axis component and for each single component in turn, weighting the component of the cartesian direction vector by the energy of the second group of samples and a direct to total energy ratio calculated for the second group of samples; and dividing the sum of the first cartesian direction vector and the at least second cartesian direction vector by the sum of the energy of the first group of samples and the energy of the second group of samples.
13. The method as claimed in Claims 11 and 12, wherein the first group of samples comprises time domain samples of a subframe and frequency domain samples of a sub band, and wherein the at least second group of samples comprises time domain samples of a further subframe and frequency domain samples of a further sub band.
14. The method as claimed in Claims 8 to 13, wherein determining whether to terminate the discontinuous transmission mode of operation of the apparatus dependent upon a comparison of the further average direction parameter and the average direction parameter further comprises: determining whether to terminate the discontinuous transmission mode of 30 operation of the apparatus dependent upon a comparison of a difference between the spatial direction parameter and the further spatial direction parameter to a threshold value.