GB2595891A

GB2595891A - Adapting multi-source inputs for constant rate encoding

Info

Publication number: GB2595891A
Application number: GB2008767.2A
Authority: GB
Inventors: Juhani Laaksonen Lasse
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2020-06-10
Filing date: 2020-06-10
Publication date: 2021-12-15
Also published as: GB202008767D0; EP3923280A1

Abstract

An immersive Metadata Assisted Spatial Audio (MASA) codec determines signal activity within two separate audio signal inputs 801 (ie. Signal or Voice Activity Detection, VAD) and determines a coding mode from three or more coding modes based on the activity, one of the modes comprising an adaptive discontinuous transmission (DTX) coding mode 821. This mode can turn off processing for particular frames, and inserts Silence Description (SID) frames to align comfort noise with background noise.

Description

ADAPTING MULTI-SOURCE INPUTS FOR CONSTANT RATE ENCODING

Field

The present application relates to apparatus and methods for encoding for discontinuous transmission operation multi-source inputs, but not exclusively for encoding for discontinuous transmission operation multi-source inputs within an immersive or spatial audio codec.

Background

Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the Immersive Voice and Audio Services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network including use in such immersive services as for example immersive voice and audio for virtual reality (VR). This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.

Voice Activity Detection (VAD), also known as speech activity detection or more generally as signal activity detection is a technique used in various speech processing algorithms, most notably speech codecs, for detecting the presence or absence of human speech. It can be generalized to detection of active signal, i.e., a sound source other than background noise. Based on a VAD decision, it is possible to utilize, e.g., a certain encoding mode in a speech encoder. Discontinuous Transmission (DTX) is a technique utilizing VAD intended to temporarily shut off parts of active signal processing (such as speech coding 30 according to certain modes) and the frame-by-frame transmission of encoded audio. For example rather than transmitting normal encoded frames infrequent simplified update frames are sent to drive a comfort noise generator (CNG) at the decoder. The use of DTX can help with reducing interference and/or preserving/reallocating capacity in a practical mobile network. Furthermore the use of DTX can also help with battery life of the device, e.g., by turning off radio when not transmitting.

Comfort Noise Generation (CNG) is a technique for creating a synthetic background noise at the decoder to fill silence periods that would otherwise be observed. For example comfort noise generation can be implemented under a DTX operation.

Silence Descriptor (SID) frames can be sent during speech inactivity to keep the receiver CNG decently well aligned with the background noise level at the sender side. This can be of particular importance at the onset of each new talk spurt. Thus, SID frames should not be too old, when speech starts again. Commonly SID frames are sent regularly e.g. every 8th frame, but some codecs allow also variable rate SID updates. SID frames are typically quite small e.g. 2.4kbitis SID bitrate equals 48 bits per frame.

Summary

There is provided according to a first aspect an apparatus comprising means configured to: obtain two or more separate audio signal inputs; determine signal activity within the two or more separate audio signal inputs; determine a coding mode from three or more coding modes, wherein at least one of the three or more coding modes comprises at least one adaptive discontinuous transmission coding mode and the coding mode is determined based on the signal activity within the two or more separate audio signal inputs; and encode the two or more audio signal inputs based on the determined coding mode.

The at least one adaptive discontinuous transmission coding mode may comprise at least one externally visible adaptive discontinuous transmission coding mode, and the means configured to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally visible adaptive discontinuous transmission coding mode may be configured to adaptively encode the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to encode inactive signal activity within the two or more separate audio signal inputs as silence descriptor elements.

The means configured to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally visible adaptive discontinuous transmission coding mode may be configured to adaptively encode one of the two or more audio signal inputs based on signal activity within the one of the two or more separate audio signal inputs and at least one of: signal activity within another one of the two or more separate audio signal inputs; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is kept constant; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is variable; and a number of encoded channels to be output from the encoded others of the two or more audio signal inputs.

The means configured to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally visible adaptive discontinuous transmission coding mode may be configured to control at least one of: a number of encoded channels to be output from the encoded one of the two or more audio signal inputs; and a bit rate of the encoded one of the two or more audio signal inputs.

The means configured to control the number of encoded channels to be output from the encoded one of the two or more audio signal inputs may be configured to control the number of encoded channels to be output from the encoded one of the two or more audio signal inputs such that a total number of channels output from encoding all of the two or more audio signal inputs is constant.

The at least one adaptive discontinuous transmission coding mode may comprise at least one externally invisible adaptive discontinuous transmission coding mode, and the means configured to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally invisible adaptive discontinuous transmission coding mode may be configured to adaptively encode the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance may be configured to encode inactive signal activity within the two or more separate audio signal inputs as silence descriptor elements, but maintain a constant number of output channels and/or a constant output bitrate. The means configured to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally invisible adaptive discontinuous transmission coding mode may be configured to adaptively encode one of the two or more audio signal inputs based on signal activity within the one of the two or more separate audio signal inputs and at least one of: signal activity within another one of the two or more separate audio signal inputs; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is kept constant; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is variable; and a number of encoded channels to be output from the encoded others of the two or more audio signal inputs.

The means configured to encode the two or more audio signal inputs based on the determined coding mode when the discontinuous transmission coding mode is an externally invisible adaptive discontinuous transmission coding mode may be configured to control at least one of: a number of encoded channels to be output from the encoded one of the two or more audio signal inputs to maintain a constant number of output channels, and a bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate.

The means configured to control the bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate may be configured to apply zero padding to the encoded audio signal inputs.

The means configured to control the bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate may be configured to apply an adaptive discontinuous transmission coding mode for the one of the two or more audio signal inputs single source resulting in a maximum of a single encoded audio signal inputs implementing discontinuous transmission encoding assistance The means configured to control the bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate may be configured to apply an adaptive discontinuous transmission coding mode for bit rate allocation and transport signal selection for the encoded one of the two or more audio signal inputs.

The three or more coding modes may further comprise an off mode wherein 5 the means configured to encode the two or more audio signal inputs based on the determined coding mode may be configured to encode the two or more audio signal inputs without any discontinuous transmission encoding assistance.

The three or more coding modes may further comprise an on mode wherein the means configured to encode the two or more audio signal inputs based on the determined coding mode may be configured to encode the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to encode inactive signal activity within the two or more separate audio signal inputs as silence descriptor elements.

The three or more coding modes may further comprise an on mode wherein the means configured to encode the two or more audio signal inputs based on the determined coding mode may be configured to encode the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to adaptively individually encode one or more of the two or more separate audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to encode inactive signal activity within the one or more of the two or more separate audio signal inputs as silence descriptor elements and others of the two or more audio signal inputs without any discontinuous transmission encoding assistance.

According to second aspect there is provided a method comprising: obtaining two or more separate audio signal inputs; determining signal activity within the two or more separate audio signal inputs; determining a coding mode from three or more coding modes, wherein at least one of the three or more coding modes comprises at least one adaptive discontinuous transmission coding mode and the coding mode is determined based on the signal activity within the two or more separate audio signal inputs; and encoding the two or more audio signal inputs based on the determined coding mode.

The at least one adaptive discontinuous transmission coding mode may comprise at least one externally visible adaptive discontinuous transmission coding mode, and encoding the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally visible adaptive discontinuous transmission coding mode may comprise adaptively encoding the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to encode inactive signal activity within the two or more separate audio signal inputs as silence descriptor elements.

Encoding the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally visible adaptive discontinuous transmission coding mode may comprise adaptively encoding one of the two or more audio signal inputs based on signal activity within the one of the two or more separate audio signal inputs and at least one of: signal activity within another one of the two or more separate audio signal inputs; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is kept constant; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is variable; and a number of encoded channels to be output from the encoded others of the two or more audio signal inputs.

Encoding the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally visible adaptive discontinuous transmission coding mode may comprise controlling at least one of: a number of encoded channels to be output from the encoded one of the two or more audio signal inputs; and a bit rate of the encoded one of the two or more audio signal inputs.

Controlling the number of encoded channels to be output from the encoded one of the two or more audio signal inputs may comprise controlling the number of encoded channels to be output from the encoded one of the two or more audio signal inputs such that a total number of channels output from encoding all of the two or more audio signal inputs is constant.

The at least one adaptive discontinuous transmission coding mode may comprise at least one externally invisible adaptive discontinuous transmission coding mode, and encoding the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally invisible adaptive discontinuous transmission coding mode may comprise adaptively encoding the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance may be configured to encode inactive signal activity within the two or more separate audio signal inputs as silence descriptor elements, but maintain a constant number of output channels and/or a constant output bitrate.

Encoding the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally invisible adaptive discontinuous transmission coding mode may comprise adaptively encoding one of the two or more audio signal inputs based on signal activity within the one of the two or more separate audio signal inputs and at least one of: signal activity within another one of the two or more separate audio signal inputs; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is kept constant; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is variable; and a number of encoded channels to be output from the encoded others of the two or more audio signal inputs.

Encoding the two or more audio signal inputs based on the determined coding mode when the discontinuous transmission coding mode is an externally invisible adaptive discontinuous transmission coding mode may comprise controlling at least one of: a number of encoded channels to be output from the encoded one of the two or more audio signal inputs to maintain a constant number of output channels; and a bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate.

Controlling the bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate may comprise applying zero padding to the encoded audio signal inputs.

Controlling the bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate may comprise applying an adaptive discontinuous transmission coding mode for the one of the two or more audio signal inputs single source resulting in a maximum of a single encoded audio signal inputs implementing discontinuous transmission encoding assistance.

Controlling the bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate may comprise applying an adaptive discontinuous transmission coding mode for bit rate allocation and transport signal selection for the encoded one of the two or more audio signal inputs.

The three or more coding modes may further comprise an off mode wherein encoding the two or more audio signal inputs based on the determined coding mode may comprise encoding the two or more audio signal inputs without any discontinuous transmission encoding assistance.

The three or more coding modes may further comprise an on mode wherein the encoding the two or more audio signal inputs based on the determined coding mode may comprise encoding the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to encode inactive signal activity within the two or more separate audio signal inputs as silence descriptor elements.

The three or more coding modes may further comprise an on mode wherein encoding the two or more audio signal inputs based on the determined coding mode may comprise encoding the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to adaptively individually encode one or more of the two or more separate audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to encode inactive signal activity within the one or more of the two or more separate audio signal inputs as silence descriptor elements and others of the two or more audio signal inputs without any discontinuous transmission encoding assistance.

According to a third aspect there is provided an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain two or more separate audio signal inputs; determine signal activity within the two or more separate audio signal inputs; determine a coding mode from three or more coding modes, wherein at least one of the three or more coding modes comprises at least one adaptive 5 discontinuous transmission coding mode and the coding mode is determined based on the signal activity within the two or more separate audio signal inputs; and encode the two or more audio signal inputs based on the determined coding mode. The at least one adaptive discontinuous transmission coding mode may comprise at least one externally visible adaptive discontinuous transmission coding 10 mode, and the apparatus caused to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally visible adaptive discontinuous transmission coding mode may be caused to adaptively encode the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to encode inactive signal activity within the two or more separate audio signal inputs as silence descriptor elements.

The apparatus caused to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally visible adaptive discontinuous transmission coding mode may be caused to adaptively encode one of the two or more audio signal inputs based on signal activity within the one of the two or more separate audio signal inputs and at least one of: signal activity within another one of the two or more separate audio signal inputs; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is kept constant; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is variable; and a number of encoded channels to be output from the encoded others of the two or more audio signal inputs.

The apparatus caused to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally visible adaptive discontinuous transmission coding mode may be caused to control at least one of: a number of encoded channels to be output from the encoded one of the two or more audio signal inputs; and a bit rate of the encoded one of the two or more audio signal inputs.

The apparatus caused to control the number of encoded channels to be output from the encoded one of the two or more audio signal inputs may be caused to control the number of encoded channels to be output from the encoded one of the two or more audio signal inputs such that a total number of channels output from encoding all of the two or more audio signal inputs is constant.

The at least one adaptive discontinuous transmission coding mode may comprise at least one externally invisible adaptive discontinuous transmission coding mode, and the apparatus caused to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally invisible adaptive discontinuous transmission coding mode may be caused to adaptively encode the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance may be caused to encode inactive signal activity within the two or more separate audio signal inputs as silence descriptor elements, but maintain a constant number of output channels and/or a constant output bitrate.

The apparatus caused to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally invisible adaptive discontinuous transmission coding mode may be caused to adaptively encode one of the two or more audio signal inputs based on signal activity within the one of the two or more separate audio signal inputs and at least one of: signal activity within another one of the two or more separate audio signal inputs; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is kept constant; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is variable; and a number of encoded channels to be output from the encoded others of the two or more audio signal inputs.

The apparatus caused to encode the two or more audio signal inputs based on the determined coding mode when the discontinuous transmission coding mode is an externally invisible adaptive discontinuous transmission coding mode may be caused to control at least one of: a number of encoded channels to be output from the encoded one of the two or more audio signal inputs to maintain a constant number of output channels, and a bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate.

The apparatus caused to control the bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate may be caused to apply zero padding to the encoded audio signal inputs.

The apparatus caused to control the bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate may be caused to apply an adaptive discontinuous transmission coding mode for the one of the two or more audio signal inputs single source resulting in a maximum of a single encoded audio signal inputs implementing discontinuous transmission encoding assistance.

The apparatus caused to control the bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate may be caused to apply an adaptive discontinuous transmission coding mode for bit rate allocation and transport signal selection for the encoded one of the two or more audio signal inputs.

The three or more coding modes may further comprise an off mode wherein 20 the apparatus caused to encode the two or more audio signal inputs based on the determined coding mode may be caused to encode the two or more audio signal inputs without any discontinuous transmission encoding assistance.

The three or more coding modes may further comprise an on mode wherein the apparatus caused to encode the two or more audio signal inputs based on the 25 determined coding mode may be caused to encode the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to encode inactive signal activity within the two or more separate audio signal inputs as silence descriptor elements. The three or more coding modes may further comprise an on mode wherein 30 the apparatus caused to encode the two or more audio signal inputs based on the determined coding mode may be caused to encode the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to adaptively individually encode one or more of the two or more separate audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to encode inactive signal activity within the one or more of the two or more separate audio signal inputs as silence descriptor elements and others of the two or more audio signal inputs without any discontinuous transmission encoding assistance.

According to a fourth aspect there is provided an apparatus comprising: obtaining circuitry configured to obtain two or more separate audio signal inputs; determining circuitry configured to determine signal activity within the two or more separate audio signal inputs; determining circuitry configured to determine a coding mode from three or more coding modes, wherein at least one of the three or more coding modes comprises at least one adaptive discontinuous transmission coding mode and the coding mode is determined based on the signal activity within the two or more separate audio signal inputs; and encoding circuitry configured to encode the two or more audio signal inputs based on the determined coding mode.

According to a fifth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: obtain two or more separate audio signal inputs; determine signal activity within the two or more separate audio signal inputs; determine a coding mode from three or more coding modes, wherein at least one of the three or more coding modes comprises at least one adaptive discontinuous transmission coding mode and the coding mode is determined based on the signal activity within the two or more separate audio signal inputs; and encode the two or more audio signal inputs based on the determined coding mode.

According to a sixth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtain two or more separate audio signal inputs; determine signal activity within the two or more separate audio signal inputs; determine a coding mode from three or more coding modes, wherein at least one of the three or more coding modes comprises at least one adaptive discontinuous transmission coding mode and the coding mode is determined based on the signal activity within the two or more separate audio signal inputs; and encode the two or more audio signal inputs based on the determined coding mode.

According to a seventh aspect there is provided an apparatus comprising: means for obtaining two or more separate audio signal inputs; means for determining signal activity within the two or more separate audio signal inputs; means for determining a coding mode from three or more coding modes, wherein at least one of the three or more coding modes comprises at least one adaptive discontinuous transmission coding mode and the coding mode is determined based on the signal activity within the two or more separate audio signal inputs; and means for encoding the two or more audio signal inputs based on the determined coding mode According to an eighth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtain two or more separate audio signal inputs; determine signal activity within the two or more separate audio signal inputs; determine a coding mode from three or more coding modes, wherein at least one of the three or more coding modes comprises at least one adaptive discontinuous transmission coding mode and the coding mode is determined based on the signal activity within the two or more separate audio signal inputs; and encode the two or more audio signal inputs based on the determined coding mode.

An apparatus comprising means for performing the actions of the method as 20 described above.

An apparatus configured to perform the actions of the method as described above.

A computer program comprising program instructions for causing a computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which: Figure 1 shows schematically an example spatial audio communications system suitable for implementing some embodiments; Figures 2, 3 and 4 show schematically differences in audio activity in spatial communications based on multi-source encoder input; Figure 5 shows schematically an encoder system of apparatus suitable for implementing some embodiments; Figure 6 shows schematically an example encoder configured with mode selection for multi-source inputs according to some embodiments; Figure 7 shows schematically an example flow diagram implementing an adaptive DTX operation; Figure 8 shows schematically an example flow diagram implementing an internal adaptive DTX operation; and Figure 9 shows schematically an example device suitable for implementing the apparatus shown.

Embodiments of the Application The concept as discussed in the embodiments of the invention relates to speech and audio codecs and in particular immersive audio codecs supporting a multitude of operating points ranging from a low bit rate operation to transparency as well as a range of service capabilities, e.g., from mono to stereo to fully immersive audio encoding/decoding/rendering. An example of such a codec is the 3GPP IVAS codec discussed above.

The input signals are presented to the IVAS encoder in one of the supported formats (and in some allowed combinations of the formats). Similarly, it is expected that the decoder can output the audio in a number of supported formats. A pass-through mode has been proposed, where the audio could be provided in its original format after transmission (encoding/decoding), which can, e.g., allow for external rendering of the transmitted audio.

For example a mono audio signal (without metadata) may be encoded using an Enhanced Voice Service (EVS) encoder. Other input formats may utilize new IVAS encoding tools. One input format proposed for IVAS is the Metadata-assisted spatial audio (MASA) format, where the encoder may utilize, e.g., a combination of mono and stereo encoding tools and metadata encoding tools for efficient transmission of the format. MASA is a parametric spatial audio format suitable for spatial audio processing. Parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound (or sound scene) is described using a set of parameters. For example, in parametric spatial audio capture from microphone arrays, it is a typical and an effective choice to estimate from the microphone array signals a set of parameters such as directions of the sound in frequency bands, and the relative energies of the directional and non-directional pads of the captured sound in frequency bands, expressed for example as a direct-to-total ratio or an ambient-to-total energy ratio in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Am bisonics.

For example, there can be two channels (stereo) of audio signals and spatial metadata. The spatial metadata may furthermore define parameters such as: Direction index, describing a direction of arrival of the sound at a time-frequency parameter interval; level/phase differences; Direct-to-total energy ratio, describing an energy ratio for the direction index; Diffuseness; Coherences such as Spread coherence describing a spread of energy for the direction index; Diffuse-to-total energy ratio, describing an energy ratio of non-directional sound over surrounding directions; Surround coherence describing a coherence of the non-directional sound over the surrounding directions; Remainder-to-total energy ratio, describing an energy ratio of the remainder (such as microphone noise) sound energy to fulfil requirement that sum of energy ratios is 1; Distance, describing a distance of the sound originating from the direction index in meters on a logarithmic scale; covariance matrices related to a multi-channel loudspeaker signal, or any data related to these covariance matrices; other parameters for guiding or controlling a specific decoder, e.g., VAD/DTX/CNG/SID parameters. Any of these parameters can be determined in frequency bands.

As discussed above Voice Activity Detection (VAD) may be employed in such a codec to control Discontinuous Transmission (DTX), Comfort Noise Generation (CNG) and Silence Descriptor (SID) frames.

Furthermore as discussed above CNG is a technique for creating a synthetic background noise to fill silence periods that would otherwise be observed, e.g., under the DTX operation. However a complete silence can be confusing or annoying to a receiving user. For example, the listener could judge that the transmission may have been lost and then unnecessarily say "hello, are you still there?" to confirm or simply hang up. On the other hand, sudden changes in sound level (from total silence to active background and speech or vice versa) could also be very annoying. Thus, CNG is applied to prevent a sudden silence or sudden change. Typically, the CNG audio signal output is based on a highly simplified transmission of noise parameters.

There is currently no proposed spatial audio DTX, CNG and SID implementations. In particular, the implementation of DTX operation for spatial audio is likely to change the "feel" and level of the background noise that will be observed by the user. These changes in the spatialization of the background noise may be perceived by the listener as being annoying or confusing. For example in some embodiments as discussed herein the background noise is provided such that it is experienced as coming from the same direction(s) both during active and inactive speech periods.

For example, in one scenario, a user is talking in front of a spatial capture device with a busy road behind the device. The spatial audio capture has then constant traffic hum (that is somewhat diffuse), and specific traffic noises (e.g., car horns) coming mainly from behind, and of course the talker's voice coming from the front. When DTX is active, and the user is not talking both N channels + spatial metadata transmission can be shut off to save transmission bandwidth (DTX active). In the absence of regular spatial audio coding CNG provides the static background hum that is not too different to the original captured background noise.

The embodiments as discussed herein attempt to generate spatial metadata during inactive periods. This avoids using the most recent received values results repeating and annoying "stuck" spatial image.

Crude background noise description (SID) updates may be transmitted during inactive period (with EVS this is mono and with IVAS, e.g., mono or stereo SID/CNG) to keep the signal properties (spectrum and energy) aligned between encoder and decoder. The embodiments as discussed herein attempt to define how to transmit spatial image SID updates.

Furthermore in the example above upon local VAD indicating a speech onset, the user voice returns, and so do the traffic hum and other traffic noises, and they get regular updates as well as spatial metadata will be sent at normal bitrate. The listener thus hears a significant change in the spatial reproduction. The embodiments as described herein are configured to consider the spatial dimension of background noise during the CNG periods and SID updates. Thus in such embodiments the DTX operation is made as transparent and pleasant to the user as possible.

As such the embodiments as described herein attempt to provide an optimal DTX / CNG / SID system for parametric spatial audio such as MASA. Additionally the embodiments as described herein are configured to provide an optimal CNG system for parametric spatial audio such as MASA based on a mono or stereo DTX system.

Thus, some embodiments comprise an IVAS apparatus or system configured to implement a DTX / CNG system where the parametric spatial audio CNG is based either on a spatial audio DTX or a mono/stereo DTX. This means that spatial audio parameters are updated substantially synchronously with the core audio DTX.

For example as shown in Figure 1 there is shown an example spatial audio communications system. The system comprises apparatus or device 1 121, which may be any suitable spatial audio capture apparatus being configured to capture audio within the environment El 100 including the user U1 101. The system further comprises apparatus or device 2 141, which may be a further suitable spatial audio capture apparatus being configured to capture audio within the environment E2 110 including the user U2 111. The spatial audio communications are shown as link 123 from apparatus or device 1 121 to apparatus or device 2 141 and link 143 from apparatus or device 2 141 to apparatus or device 1 121 both of which are via a suitable network 131. In this example there are shown two devices but it would be understood that the communication may be between more than two parties in some embodiments.

By spatial audio communications it is meant that at least one upstream is a spatial audio, i.e., more than mono audio. In some embodiments the (two) upstreams (link 123 and link 143) may be different configurations.

For example Figure 2 shows differing audio signal content types being obtained, encoded and then transmitted from the apparatus 1 121 to apparatus 2 141 and obtained, encoded and then transmitted from apparatus 2 141 to apparatus 1121. In this example, the apparatus 1 121 audio input is a mono voice audio signal 201 and spatial ambience audio signal 203 (illustrated as a stereo audio signal), while the apparatus 2 141 audio input is shown as a stereo voice audio signal.

Additionally Figure 2 shows an important aspect relating to DTX operation. There are significant silent periods in both upstreams when considering the user voice signals.

This for example is shown in Figure 3. Figure 3 shows the apparatus 1 mono voice audio signal 201 and a voice activity detection (VAD) indication or signal 301. Here it can be seen that within the VAD indication/signal there are two periods of activity, a first early period 311 and a second late period 312. Additionally as shown in Figure 3 there is the apparatus 2 stereo voice audio signal 205 and a voice activity detection (VAD) indication or signal 303. The VAD indication/signal associated with the apparatus 2 stereo voice audio signal 205 shows two periods of activity, a first early-mid period 321 and a second late-mid period 322.

In the example apparatus input audio signals shown in Figure 2 there are significant periods of inactivity which provide the justification for implementing DTX in communications systems. However, the spatial ambience can be very active during the silent periods. This for example is shown in Figure 4. Figure 4 shows the apparatus 1 mono voice audio signal 201 and voice activity detection (VAD) indication or signal 301, with the two periods of activity, a first early period 311 and a second late period 312. Figure 4 also shows the apparatus 1 spatial ambience audio signal 203 (illustrated as a stereo audio signal) and associated voice activity detection (VAD) indication or signal 401. The VAD indication/signal 401 associated with the apparatus 1 spatial ambience audio signal 203 also shows two long periods of activity. The first long period is a first early-to-mid period 411 which significantly overlaps with the first early period 311 and the second long period is a second midto-late period 412 which overlaps the late-mid period 312. It would be appreciated that in some examples and embodiments there need not be such an overlap.

The ambience/background signals can in many use cases be almost as important as the voice signal or, indeed, include voice signals themselves. For example, in a multi-user audio capture scenario, one of the audio signals could be the primary user voice (for example the user 1 voice as shown in Figure 2 and Figure 3), while the spatial background signal can carry the voice of one or more other local participants (not shown in Figure 2/Figure 3).

As shown by the above examples the multi-source audio input presents a problem for the discontinuous transmission (DTX) operation. The more than one audio inputs can thus show very different activity characteristics and can be independent.

The concept as discussed in the embodiments herein is enabling discontinuous transmission in multi-source input encoding. In other words improving the efficiency of audio encoding for multi-source inputs based on the session-level VAD/DTX properties.

The embodiments therefore describe a method for efficient multi-source input encoding utilizing an (internal) adaptive DTX operation. The adaptive DTX operation is introduced as an extension of the well-known DTX operation.

In some embodiments, instead of the binary DTX selection, there can be utilized a multi-step model. The model can in some embodiments be visible externally or be provided for operation as a codec-internal operation, and which is based on multi-source input handling and encoding.

In some embodiments the method is configured to maintain the number of transport signals in constant bit rate operation, where inactive sources are internally handled using DTX.

The following examples are shown with respect to a 3GPP IVAS encoder.

Specifically, IVAS is foreseen to require support for multi-source input encoding.

For example, combination of one audio object input and one MASA audio input can be input to the encoder.

The embodiments as discussing in some embodiments provides apparatus configured to provide DTX-capability derived adaptation of the encoding for efficient conversational multi-input voice and audio.

Some embodiments may be implemented employing also in other multi-input audio codecs. Some embodiments may be useful in low to mid bit-rate operations such as conversational spatial codecs. Particularly, in some embodiments it may be possible to optimize constant bit rate codec operation. The embodiments therefore in summary may as shown in the examples herein relate to immersive voice and audio codecs and specifically to efficient encoding of multi-source immersive voice and audio inputs under a) DTX conditions and/or b) constant bit rate transmission conditions. However they may be implemented without need of further inventive effort to other codecs with similar functionalities.

Furthermore while the EVS codec provides extensive bit rate switching capabilities, all current commercial EVS service launches utilize only a single bit rate, typically 13.2 kbps or 24.4 kbps. Similar approach is fairly likely for at least the initial launches of IVAS service. Therefore as shown herein in the embodiments optimizing constant bit rate (CBR) operation is a desired goal.

Figure 5 presents a high-level overview of a suitable system or apparatus for IVAS coding and decoding which is suitable for implementing embodiments as described herein. The system 500 is shown comprising an (IVAS) input 511. The IVAS input 511 can comprise one or more of any suitable input format. For example as shown in Figure 5 there is shown a mono audio signal input 512. The mono audio signal input 512 may in some embodiments be passed to the encoder 521 and specifically to an Enhanced Voice Services (EVS) encoder 523. Furthermore is shown a stereo and binaural audio signal input 513. The stereo and binaural audio signal input 513 in some embodiments is passed to the encoder 521 and specifically to the (IVAS) spatial audio encoder 525. Figure 5 also shows a Metadata-Assisted Spatial Audio (MASA) signal input 514. The Metadata-Assisted Spatial Audio (MASA) signal input in some embodiments is passed to the encoder 521. Specifically the audio component of the MASA input is passed to the (IVAS) spatial audio encoder 525 and the metadata component passed to a metadata quantizer/encoder 527. Another input format shown in Figure 5 is an ambisonic audio signal, which may comprise first order Ambisonics (FOA) and/or higher order ambisonics (HOA) audio signal 515. The first order Ambisonics (FOA) and/or higher order ambisonics (HOA) audio signal 515 in some embodiments is passed to the encoder 521 and specifically to the (IVAS) spatial audio encoder 525. Furthermore as shown in Figure 5 is a channel based audio signal input 516. This may be any suitable input audio channel format, for example 5.1 channel format, 7.1 channel format etc. The channel based audio signal input 516 in some embodiments is passed to the encoder 521 and specifically to the (IVAS) spatial audio encoder 525. The final example input shown in Figure 5 is an object (or audio object) signal input 517. The object signal input in some embodiments is passed to the encoder 521.

Specifically the audio component of the object signal input is passed to the (IVAS) spatial audio encoder 525 and the metadata component passed to a metadata quantizer/encoder 527. In some embodiments, the mono audio signal corresponding to the audio object signal input 517 can be passed to the EVS encoder 523, while the corresponding metadata component is passed to a metadata quantizer/encoder 527.

Figure 5 furthermore shows an (IVAS) encoder 521. The (IVAS) encoder 521 is configured to receive the audio signal from the input and encode it to produce a suitable format encoded bitstream 531. The (IVAS) encoder 521 in some embodiments as shown in Figure 5 comprises an EVS encoder 523 configured to receive any mono audio signals 512 and encode them according to an EVS codec definition.

Furthermore the (IVAS) encoder 521 is shown comprising an (IVAS) spatial audio encoder 525. The (IVAS) spatial audio encoder 525 is configured to receive the audio signals or audio signal components and encode the audio signals based on a suitable definition or coding mechanism. In some embodiments the spatial audio encoder 525 is configured to reduce the number of audio signals being encoded before the signals are encoded. For example in some embodiments the spatial audio encoder is configured to combine or otherwise downmix the input audio signals. Such audio signal reduction can in some embodiments include, e.g., a spatial audio analysis, which can result in metadata in addition to the reduced number of audio signals. For example, such spatial audio analysis can transform, e.g., a channel based audio input 516 into an internal MASA representation, which can be encoded similarly as a MASA input 514.

In some embodiments, for example when the input type is MASA signals, the spatial audio encoder is configured to encode the audio signals as a mono or stereo signal.

The spatial audio encoder 525 may comprise an audio encoder core which is configured to receive the downmix or the audio signals directly and generate a suitable encoding of these audio signals. The encoder 525 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or AS ICs.

In some embodiments the encoder 521 comprises a metadata quantizer/encoder 527. The metadata quantizer/encoder 527 is configured to receive the metadata, for example from the MASA input or the objects and generate a suitable quantized and/or encoded metadata bitstream suitable for (being combined with or associated with the encoded audio signal bitstream) and being output as part of the (IVAS) bitsteam 531.

Furthermore as shown in Figure 5 there is shown a (IVAS) decoder 541. The decoder 541 in some embodiments comprises a metadata dequantizer/decoder 547. The metadata dequantizer/decoder 547 is configured to receive the encoded metadata, for example from the IVAS bitstream 531 and generate a metadata bitstream suitable for rendering the audio signals within the stereo and spatial audio decoder 545.

Figure 5 furthermore shows the (IVAS) decoder 541 comprising an EVS decoder 543. The EVS decoder 543 is configured to receive the EVS encoded mono audio signals as part of the IVAS bitstream 531 and decode them to generate a suitable mono audio signal which can be passed to an internal renderer (for example the stereo and spatial decoder) or suitable external renderer.

Additionally in some embodiments the (IVAS) decoder 541 comprises a stereo and spatial audio signal decoder 545. The stereo and spatial audio signal decoder 545 in some embodiments is configured to receive the encoded audio signals and generate a suitable decoded spatial audio signal which can be rendered internally (for example by the stereo and spatial audio signal decoder) or suitable external renderer.

Therefore in summary first the system is configured to receive a suitable audio signal format or any combination of suitable audio signal formats. In some embodiments the system is configured to generate (a downmix or more generally known as transport audio signals) audio signals. The system is then configured to encode for storage/transmission the audio signals. After this the system may store/transmit the encoded audio signals and metadata. The system may retrieve/receive the encoded audio signals and metadata. Then the system is configured to extract the audio signals and metadata from encoded audio signals and metadata parameters, for example demultiplex and decode the encoded audio signals and metadata parameters.

The system may furthermore be configured to synthesize an output multichannel audio signal based on the extracted audio signals and metadata.

With respect to Figure 6 is shown the encoder 521 with respect to some embodiments in further detail In Figure 6, the upper part shows that the inputs are for example a MASA input format 514 and an object input format 517. These can be passed to the (IVAS) encoder 521 for encoding. It would be understood that any suitable input format may be used in some embodiments and these are examples of input formats used to demonstrate the embodiments herein only.

The lower part of Figure 6 shows the encoder shown in the upper part in further detail. Thus the MASA input format 514 is shown comprising a stereo audio signal 601 and associated MASA metadata 603. The stereo audio signal 601 is an example of a suitable audio signal as pad of a MASA input format only and other number of channels or time-frequency domain input audio signals may be employed in some embodiments.

The stereo audio signal 601 is passed to the encoder 521 and in some embodiments is passed a first signal activity detector 621 within the encoder 521. 30 The associated MASA metadata 603 is passed to the encoder 521 and in some embodiments to a metadata quantizer/encoder 653 within the encoder 521.

The object input format 517 is shown comprising a mono audio signal 611 and associated metadata 613. The mono audio signal 611 is passed to the encoder 521 and in some embodiments is passed a second signal activity detector 623 within the encoder 521. The associated metadata 613 is passed to the encoder 521 and in some embodiments to the metadata quantizer/encoder 653 within the encoder 521.

The encoder 521 can in some embodiments comprise a first signal activity detector (SAD) 621 configured to determine whether the stereo audio signals 601 comprise activity. This information (with the stereo audio signals 601) can be passed to a coding mode selector 631.

Furthermore the encoder 521 in some embodiments comprises a second signal activity detector (SAD) 623 configured to determine whether the mono audio signal 611 comprises activity. This information (with the mono audio signal 611) can also be passed to a coding mode selector 631.

The encoder 521 in some embodiments comprises a coding mode selector 631 configured to receive the SAD information associated with the MASA or stereo audio signal input (and may also receive the stereo audio signals 601) the SAD information associated with the object or mono audio signal input (and may also receive the mono audio signal 611) and be configured to determine an coding mode for coding the stereo audio signals, the mono audio signal and the metadata associated with each.

The mode selection or determination can be passed to a stereo encoder 641, a mono audio encoder 643 and the metadata quantizer/encoder 653.

In some embodiments the encoder 521 comprises a stereo encoder 641. The stereo encoder 641 is configured to receive the stereo audio signals 601 and the determined or selected coding mode and encode the stereo audio signals 601 25 based on the coding mode. The encoded stereo audio signals can then be output. Furthermore in some embodiments the encoder 521 comprises a mono encoder 643. The mono encoder 643 is configured to receive the mono audio signal 611 and the determined or selected coding mode and encode the mono audio signals 611 based on the coding mode. The encoded mono audio signal can then be output.

Furthermore in some embodiments the encoder 521 comprises a metadata quantizer/encoder 653. The metadata quantizer/encoder 653 is configured to receive the metadata (the metadata 603 associated with the MASA input and the metadata 613 associated with the object input) and the determined or selected coding mode and quantize and/or encode the metadata based on the coding mode. The quantized and/or encoded metadata can then be output.

As shown herein in some embodiments multi-input encoding can be implemented in such a manner to encode the inputs separately. This for example may be implemented for bit rates at least outside the very lowest bit rates. In some embodiments where very lowest bit rates are required the input formats may be transformed or suitable downmixing implemented to simplify the audio signals (e.g., by reducing the number of audio signals) to be encoded.

The implementation of separate input format encoding for the input formats may be because it enables the implementation of different core encoding algorithms to be used for different input types. The more than one input types can be from multiple sources and may be uncorrelated. As such the use of separate encoders prevents encoding gains from correlated audio signals being generated.

In some embodiments the implementation of a joint encoder may require format conversion and rely on correlated signals in order to produce significant improvements (the downmixed scene for joint encoding is generally not fully separable). A joint encoder may also have to determine whether the joint encoding has added correlation artefacts to the component signals.

In some embodiments the encoders do not operate entirely separately. For example, in order to maintain constant bit rate (CBR) operation, it is beneficial to employ common bit rate allocation, in other words, vary the bit rate for encoding of each component signal. In some embodiments this may employ a Discontinuous transmission (DTX) operation based on the output of the detector (VAD/SAD) analysis of the individual inputs. The encoders such as shown in Figure 6 by the stereo encoder 641 and mono encoder 643 may implement a Discontinuous transmission (DTX) operation.

Conventionally a DTX operation is characterized as being either on or off. This means that DTX operation is either used to limit the upstream transmission activity and bandwidth (bit rate) or the codec is operated without DTX operation such that it provides a constant or variable bit rate. Thus, in DTX operation, the bit rate varies, and the transmission frequency varies. In non-DTX operation, the bit rate may vary (if VBR), but the transmission frequency is constant (Le all frames are transmitted).

In the embodiments discussed herein the encoders, such as the stereo encoder 641 and the mono encoder 643 are configured to implement a multi-step DTX operation in multi-input encoding. The additional capability can be explicit to the outside of the codec or the DTX property can, e.g., for sake of simplicity remain on/off for codec negotiation and interfacing purposes.

For example in some embodiments the coding mode selector 631 is configured to determine one of the follow modes or steps and control the separate encoders in the following manner: 1. Off -No DTX operation, characterized in minimum by all frames being transmitted 2. Adaptive -At least "internal DTX operation" is activated, all frames may be transmitted 3. On -DTX operation is active, inactive frames are being transmitted as SID updates In some embodiments the coding mode selector 631 is configured to determine one of the following alternative modes: 1. Off -No DTX operation 2. Adaptive internal -DTX operation not visible externally, all frames transmitted 3. Adaptive external -DTX operation is visible externally, limited use of SID updates, activity decision of one stream influences encoding of another stream(s), constant or variable rate encoding for transmitted frames 4. On -DTX operation used a Single stream, limited use of SID updates, variable rate encoding for transmitted frames b Multi-stream, each multi-input source transmitted in their own stream, varying SID characteristics per stream The adaptive DTX is employed for multi-input encoding as follows.

In some embodiments the MASA audio input is spatial ambience audio signals and the object audio input a voice audio signal. The coding mode selector can in some embodiments be configured to determine a coding mode selection based (at least) on individual audio input VAD/SAD decisions.

For example as shown in Figure 6 a first SAD decision is derived at the first signal activity detector (SAD) 621 from the stereo audio signal input (part of the MASA input) and a second SAD decision is derived at the second signal activity detector (SAD) 623 from the mono audio input (part of the object input). The coding mode selector 631 then may be configured to allocate audio signals and bit rates for stereo encoding, mono encoding, as well as bit rate for quantization of the associated input metadata (e.g., MASA metadata and object metadata).

Thus the encoder can, when the coding mode selector 631 determines that the mode is DTX off, be configured to control the encoder to implement a suitable (regular) encoding operation that may utilize a fixed bit rate allocation for each 15 component or some form of variable-rate coding over the components.

In addition, when the coding mode selector 631 determines that the mode is DTX on, be configured to control the encoder to implement separate DTX decisions or a combined DTX decision for the individual components.

In some embodiments the coding mode selector 631 can be configured to determine a coding mode on a frame-by-frame basis such that the encoders (an internal encoding) are configured to maintain a constant bit rate as viewed externally or limit certain parameters (for example a number of transport signals) in a way that reduces the computational complexity in the encoder and/or decoder. This may for example in some embodiments be achieved by a sequential combination of the individual decisions.

An example is now considered for the object + MASA combination input as shown in Figure 6. In this example, we specifically consider MASA that is stereo-based or includes both mono and stereo in the input (in other words a 3-channel MASA configuration with mono and stereo and metadata). Stereo-based MASA can be seen as the most common MASA configuration due to its excellent properties of having two naturally incoherent signals in addition to the spatial metadata providing the "description of the 3D audio image". A mono-based MASA can be trivially upmixed to stereo-based MASA by duplicating the mono channel. Thus, mono-based MASA can utilize stereo-based MASA encoding if needed.

In some embodiments the encoding is performed in an adapfiyelD1)(mode. Separate SAD decisions are derived for the inputs. In some embodiments, the SAD decision for the object is utilized not only to drive, in pat the laft rate allocation between the inputs and also the transport signal transmission and the corresponding encoding modes. This is achieved, in some embodiments based on the following logic: if SAD:DI:Jett == 1 /* object activity limits MASA ambience encoding */ objectEncode(); if SAD,. == 1 if (N_chanMASA > 1) MASAdownmix2mono(); /* select or reduce channels to 1 */ 1 MASAmonoEncode(); else MASADTX(); else /* during object DTX operation, MASA ambience can use more channels */ { objectDTX(); if SADELNE,A == 1 if (N_chanMASA > 1) MASAstereoEncode(); else { MASAmonoEncode(); else MASADTX(); The above mode selection results in following transmission configuration for a mono object + stereo-based MASA input combination: MASA active MASA inactive Object active 1 object + mono MASA 1 object (+ MASA DTX) Object inactive Stereo MASA (+ object (Object DTX + MASA DTX) DTX) Thus in this example at most two active channels are encoded. This behaviour may be preferable for efficient spatial audio encoding and transmission due to the following properties: a constant external bit rate can be maintained, and quality optimized for it 10 (when no DTX is used); quality can be optimized for constant maximum bit rate (when DTX is used); decoding load is maintained at maximum of two active channels reducing peak and average complexity and thus saving battery life; object is prioritized (highest importance as generally carrying voice in this 15 combination); MASA spatial quality is enhanced when object is inactive (stereo-based spatialization is better than mono-based spatialization due to availability of two naturally incoherent prototypes).

Figure 7 shows an example flow diagram of the operation of the encoder (as 20 shown in Figure 6) according to some embodiments.

The audio object input is received as shown in Figure 7 by step 701.

A SAD determination on the audio object input is shown in Figure 7 by step The check of whether the SAD determination is active is shown in Figure 7 25 by step 705.

Where the audio object SAD determination indicates that it is not active then the encoder can be configured to implement a DTX encoding for the audio object audio signals (and the associated metadata) as shown in Figure 7 by step 707.

Where the audio object SAD determination indicates that it is active then the encoder can be configured to implement an audio object encoding for the audio object audio signals (and the associated metadata) as shown in Figure 7 by step 715.

The MASA input is received as shown in Figure 7 by step 702.

A SAD determination on the MASA input is shown in Figure 7 by step 704. The check of whether the SAD determination is active is shown in Figure 7 by step 706.

Where the MASA SAD determination indicates that it is not active then the 10 encoder can be configured to implement a DTX encoding for the MASA audio signals (and the associated metadata) as shown in Figure 7 by step 713.

Where the MASA SAD determination indicates that it is active and furthermore when the decision is made from the audio object SAD determination (that there is activity with respect to the audio object) then the encoder can be configured to implement a mono MASA encoding for the MASA audio signals (and the associated metadata) as shown in Figure 7 by step 717.

Where the MASA SAD determination indicates that it is active and furthermore when the decision is made from the audio object SAD determination (that there is no activity with respect to the audio object) then the encoder can be configured to implement a stereo MASA encoding for the MASA audio signals (and the associated metadata) as shown in Figure 7 by step 719.

Then the encoder may be configured to determine a bitstream output based on at most two active channels and corresponding metadata as shown in Figure 7 by step 721.

In such embodiments the above processing results in an externally visible DTX operation. This is acceptable when DTX operation activation is desirable. It may however be beneficial to implement a DTX operation which is not externally visible and specifically a constant bit rate operation. This can be achieved according to the adaptive DTX operation scheme in some embodiments by implementing Zero padding to maintain a desired total bit rate. In these embodiments internal adaptive DTX operation is carried out according to above examples, yet all frames are then zero padded such that the frame transmitted is a fixed size. (This may be implemented without significant processing and results in same audio quality characteristics as regular DTX operation with bandwidth advantage).

In some embodiments internal adaptive DTX operations may implement adaptive DTX operation only for a single source. In some embodiments an internal 5 adaptive DTX operation is carried out according to a process described below, resulting in maximum of one source under DTX.

In some embodiments SAD is used for bit rate allocation and transport signal selection only. In these embodiments a process is implemented in a manner similar to those as described above, however no internal DTX operation as such is used, active noise encoding is used in place of any frame-internal DTX.

Figure 8 shows an example flow diagram of an encoder such as shown in Figure 6 where no externally visible DTX happens. In other words the encoder is configured to transmit every frame. In this example there is constant bit rate encoding. In this example a DTX operation is included (as shown by the "object DTX encoding and MASA DTX encoding" step). The example operates on the combined audio object + MASA input, however, in various examples a similar approach can be utilized for other input format combinations.

The audio inputs (the audio object and MASA input) are received or obtained as shown in Figure 8 by step 801.

Signal activity detection is then performed for both of the inputs as shown in Figure 8 by step 803.

The encoder is then configured to update encoding allocation states based on the current and previous signal activities (and thus selected states) as shown in Figure 8 by step 805 Based on the state update an encoding mode is then selected. The mode selection step is shown in Figure 8 by step 807.

In some embodiments the states and mode selection may be the following Internal adaptive DTX with constant number of transport channels. In such embodiments there may be 2 transport channels are always sent for a combination 30 of audio object input and MASA input.

* In this state the possible modes are -Object encoding and Mono MASA encoding as shown in Figure 8 by step 809 - Object encoding and Mono MASA active noise encoding as shown in Figure 8 by step 811 - Object DTX encoding and Stereo MASA encoding as shown in Figure 8 by step 815 -Object DTX encoding and MASA active noise encoding as shown in Figure 8 by step 817 - Object active noise encoding and MASA active noise encoding as shown in Figure 8 by step 819 * In some embodiments the coding mode selection may be determined based on the following pseudo code: codObjpiev = ACTIVE; codMasaprev = ACTIVE_MONO; // initialize states H for each new frame: if SADobi == 1 codObjcur = ACTIVE; if SADmAsA == 1 codMasacur = ACTIVE_MONO; else codMasacur = ACTIVE_MONO_NOISE; end else if SADmAsA == 1 & N_chanMASA > 1 codObjcur = INACTIVE_DTX; codMasacur = ACTIVE STEREO; else if SADmAsA == 1 codObjcur = ACTIVE_NOISE; codMasacur = ACTIVE_MONO; else // SADmAsA == 0 if codObjprev == INACTIVE_DTX & N_chanMASA > 1 codObjcur = INACTIVE_DTX; codMasacur = ACTIVE_STEREO_NOISE; else codObjcur = ACTIVE_NOISE; codMasacur = ACTIVE MONO NOISE; end end // at end of each frame: codObjprev = codObjcur; codMasaprev = codMasacur; // update states This may be summarized as: If object input is active, it is actively encoded, and MASA encoding enters mono MASA encoding for either active signal or active noise signal encoding depending on MASA input signal activity. In other words implement encoding according to either steps 809 or 811.

If object input is inactive, it can enter DTX as long as MASA is active and stereo-based. In this case MASA is encoded according to stereo MASA encoding. In other words implement encoding according to steps 815.

If object input is inactive and MASA is active and mono-based, active noise signal encoding is used for the object input. MASA is encoded according to mono MASA encoding. In other words implement encoding according to step 819.

If MASA however is also inactive, the algorithm checks previous states and if object input was previously inactive and encoded using DTX, the MASA channel configuration may be used to determine need for updating the object coding mode. It will thus either remain DTX or switch to active noise encoding. MASA noise encoding will be stereo-based or mono-based, accordingly. In other words implement encoding according to either steps 817 or 819. Otherwise, active noise encoding is used for both object input and MASA input. In other words implement encoding according to step 819.

In some embodiments the states and mode selection may be an internal adaptive DTX with no constraints on number of transport channels.

In these embodiments all of the encoding modes shown in Figure 8 may be considered.

- Object encoding and Mono MASA encoding as shown in Figure 8 by step 809 -Object encoding and Mono MASA active noise encoding as shown in Figure 8 by step 811 - Object encoding and MASA DTX encoding as shown in Figure 8 by step 813 - Object DTX encoding and Stereo MASA encoding as shown in Figure 8 by 10 step 815 - Object DTX encoding and MASA active noise encoding as shown in Figure 8 by step 817 - Object active noise encoding and MASA active noise encoding as shown in Figure 8 by step 819 -Object DTX encoding and MASA DTX encoding as shown in Figure 8 by step 821.

In some embodiments the coding mode selection operation is similar as for the above example. However in these embodiments the transmission can now be limited to one active transport channel. This choice can be implementation specific and depend, for example, on the bit rate or signal type.

In some embodiments where both and/or all audio inputs are inactive the apparatus can be configured to select elements or data for transmission. The selection can in some embodiments be implementation specific. In some embodiments, as it can generally be beneficial to send active noise modeling for the background signal rather than the voice object (which is silent), then the background signal modelling information is transmitted.

In some embodiments when the second / last active input becomes inactive while the first / other inputs stay in inactive state the DTX operation is maintained for any source that was already in DTX, and active noise encoding is selected for the second / last input.

In some embodiments there may be implemented DTX using an adaptive DTX operation. This in such embodiments is shown in Figure 8 by the Object DTX encoding and MASA DTX encoding as shown in Figure 8 by step 821 and may be similar to that shown in Figure 7 where SID only transmission is allowed.

Thus the previously presented table may be modified as follows, when no external DTX is performed and a constraint for transport channel constraint is in place: MASA active MASA inactive Object active 1 object + mono MASA 1 object + mono MASA (noise) Object inactive Stereo MASA (+ object Stereo MASA (noise) (+ object DTX) DTX) OR 1 object (noise) + mono MASA (noise) With respect to Figure 9 an example electronic device which may be used as the analysis or synthesis device is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1400 comprises a memory 1411. In some embodiments the at least one processor 1407 is coupled to the memory 1411. The memory 1411 can be any suitable storage means. In some embodiments the memory 1411 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore in some embodiments the memory 1411 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.

In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407.

In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating with the position determiner as described herein.

In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (VVLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.

Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDS II, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS: 1. An apparatus comprising means configured to: obtain two or more separate audio signal inputs; determine signal activity within the two or more separate audio signal inputs; determine a coding mode from three or more coding modes, wherein at least one of the three or more coding modes comprises at least one adaptive discontinuous transmission coding mode and the coding mode is determined based on the signal activity within the two or more separate audio signal inputs; and encode the two or more audio signal inputs based on the determined coding mode.
2. The apparatus as claimed in claim 1, wherein the at least one adaptive discontinuous transmission coding mode comprises at least one externally visible adaptive discontinuous transmission coding mode, and the means configured to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally visible adaptive discontinuous transmission coding mode is configured to adaptively encode the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to encode inactive signal activity within the two or more separate audio signal inputs as silence descriptor elements.
3. The apparatus as claimed in claim 2, wherein the means configured to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally visible adaptive discontinuous transmission coding mode is configured to adaptively encode one of the two or more audio signal inputs based on signal activity within the one of the two or more separate audio signal inputs and at least one of: signal activity within another one of the two or more separate audio signal inputs; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is kept constant; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is variable; and a number of encoded channels to be output from the encoded others of the two or more audio signal inputs.
4. The apparatus as claimed in any of claims 2 or 3, wherein the means configured to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally visible adaptive discontinuous transmission coding mode is configured to control at least one of: a number of encoded channels to be output from the encoded one of the two or more audio signal inputs; a bit rate of the encoded one of the two or more audio signal inputs.
5. The apparatus as claimed in claim 4, wherein the means configured to control the number of encoded channels to be output from the encoded one of the two or more audio signal inputs is configured to control the number of encoded channels to be output from the encoded one of the two or more audio signal inputs such that a total number of channels output from encoding all of the two or more audio signal inputs is constant.
6. The apparatus as claimed in claim 1, wherein the at least one adaptive discontinuous transmission coding mode comprises at least one externally invisible adaptive discontinuous transmission coding mode, and the means configured to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally invisible adaptive discontinuous transmission coding mode is configured to adaptively encode the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to encode inactive signal activity within the two or more separate audio signal inputs as silence descriptor elements, but maintain a constant number of output channels and/or a constant output bitrate.
7. The apparatus as claimed in claim 6, wherein the means configured to encode the two or more audio signal inputs based on the determined coding mode when the coding mode is an externally invisible adaptive discontinuous transmission coding mode is configured to adaptively encode one of the two or more audio signal inputs based on signal activity within the one of the two or more separate audio signal inputs and at least one of: signal activity within another one of the two or more separate audio signal inputs; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is kept constant; a determined output bit rate and an encoding rate of the others of the two or more separate audio signal inputs, such that a combined bit rate for encoding the two or more audio signal inputs is variable; and a number of encoded channels to be output from the encoded others of the two or more audio signal inputs.
8. The apparatus as claimed in any of claims 6 or 7, wherein the means configured to encode the two or more audio signal inputs based on the determined coding mode when the discontinuous transmission coding mode is an externally invisible adaptive discontinuous transmission coding mode is configured to control at least one of: a number of encoded channels to be output from the encoded one of the two or more audio signal inputs to maintain a constant number of output channels; 25 and a bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate.
9. The apparatus as claimed in claim 8, wherein the means configured to control the bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate is configured to apply zero padding to the encoded audio signal inputs.
10. The apparatus as claimed in claim 8, wherein the means configured to control the bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate is configured to apply an adaptive discontinuous transmission coding mode for the one of the two or more audio signal inputs single source resulting in a maximum of a single encoded audio signal inputs implementing discontinuous transmission encoding assistance.
11. The apparatus as claimed in claim 8, wherein the means configured to control the bit rate of the encoded one of the two or more audio signal inputs to maintain a constant output bitrate is configured to apply an adaptive discontinuous transmission coding mode for bit rate allocation and transport signal selection for the encoded one of the two or more audio signal inputs.
12. The apparatus as claimed in any of claims 1 to 11, wherein the three or more coding modes further comprises an off mode wherein the means configured to encode the two or more audio signal inputs based on the determined coding mode is configured to encode the two or more audio signal inputs without any discontinuous transmission encoding assistance.
13. The apparatus as claimed in any of claims 1 to 12, wherein the three or more coding modes further comprise an on mode wherein the means configured to encode the two or more audio signal inputs based on the determined coding mode is configured to encode the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to encode inactive signal activity within the two or more separate audio signal inputs as silence descriptor elements.
14. The apparatus as claimed in any of claims 1 to 12, wherein the three or more coding modes further comprise an on mode wherein the means configured to encode the two or more audio signal inputs based on the determined coding mode is configured to encode the two or more audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to adaptively individually encode one or more of the two or more separate audio signal inputs with discontinuous transmission encoding assistance, the discontinuous transmission encoding assistance configured to encode inactive signal activity within the one or more of the two or more separate audio signal inputs as silence descriptor elements and others of the two or more audio signal inputs without any discontinuous transmission encoding assistance.
15. A method for an apparatus, the method comprising: obtaining two or more separate audio signal inputs; determining signal activity within the two or more separate audio signal inputs; determining a coding mode from three or more coding modes, wherein at least one of the three or more coding modes comprises at least one adaptive discontinuous transmission coding mode and the coding mode is determined based on the signal activity within the two or more separate audio signal inputs; and encoding the two or more audio signal inputs based on the determined coding mode.