GB2593672A

GB2593672A - Switching between audio instances

Info

Publication number: GB2593672A
Application number: GB2004184.4A
Authority: GB
Inventors: Juhani Laaksonen Lasse
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2021-10-06
Also published as: GB202004184D0; WO2021191493A1; EP4128821A4; EP4128821A1

Abstract

In a spatial audio Immersive Voice Audio Service (IVAS) encoder, an audio stream 206 is analysed for instances of eg. audio tracks or objects 200 which can be initialised into groups and set as active or inactive according to a flag or gain parameter, with only the active groups being processed, encoded and transmitted 218 to a receiver 231 for processing and rendering. This allows a user 281 to independently manipulate audio objects in their rendered audio scene, eg. by setting all the users in Room A to the right and the user in Room B to the left.

Description

SWITCHING BETWEEN AUDIO INSTANCES

Field

The present application relates to apparatus and methods for switching between audio instances in sound-field related audio representation and rendering, but not exclusively for switching between audio instances in sound-field related audio representation for an audio decoder.

Background

Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the immersive voice and audio services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network. Such immersive services include uses for example in immersive voice and audio for applications such as virtual reality (VR), augmented reality (AR) and mixed reality (MR) as well as spatial voice communication including teleconferencing. This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.

The input signals are presented to the IVAS encoder in one of the supported formats (and in some allowed combinations of the formats). Similarly, it is expected that the decoder can output the audio in supported formats. A pass-through mode has been proposed, where the audio could be provided in its original format after transmission (encoding/decoding). This provision may, for example by used for user manipulation of audio sources in the audio scene and relate to an object-based audio format.

Summary

There is provided according to a first aspect an apparatus comprising means configured to: initialize two or more instances of groups within at least one audio signal data stream, wherein each of the instances of groups comprise at least one member and wherein at least one of the instances of groups is set active and at least one of the instances of groups is set inactive, such that members of the active group instances can be processed; and encode the at least one of the instances of groups set active for storage and/or transmission.

The means may be further configured to control the transmission of the encoded at least one of the instances of groups set active to at least one further apparatus.

The means may be further configured to control the storage of the encoded at least one of the instances of groups set active to at least one further apparatus. The means may be further configured to: receive, from at least one further apparatus, a request to modify which of the at least one of the instances of groups is set active and which at least one of the instances of groups is set inactive; and modify which of the at least one of the instances of groups is set active and which at least one of the instances of groups is set inactive based on the request.

The means may be configured to: set a parameter associated with each of 20 the instances of groups, wherein a value of the parameter associated with each of the instances of groups defines which of the at least one of the instances of groups is active and which of the at least one of the instances of groups is inactive.

The parameter may comprise at least one of: a gain parameter; and a flag parameter, wherein a value of 1 defines the instance of groups is active and a value of 0 defines the instance of groups is inactive.

The gain parameter may be configured such that a value of 1 defines the instance of groups is active, a value of 0 defines the instance of groups is inactive, and a value greater than 0 and less than 1 defines that the instance of groups is active but diminished The members may comprise at least one of: an audio track; and an audio object.

According to a second aspect there is provided an apparatus comprising means configured to: obtain at least one audio signal data stream; obtain from the at least one audio signal data stream at least one of instance of groups set active, wherein the at least one of the instance of groups set active is at least one of two or more instances of groups and each of the instances of groups comprise at least one member; and process and render a spatial audio signal from the at least one of instance of groups set active.

The means configured to obtain at least one audio signal data stream may be further configured to: receive the at least one audio signal data stream from at least one further apparatus; and retrieve the at least one audio signal data stream from at least one memory.

The means may be further configured to: transmit to the at least one further apparatus, a request to modify which of the at least one of the instances of groups is set active, wherein the at least one further apparatus may be configured to modify which of the at least one of the instances of groups is set active and furthermore set at least one of the instances of groups inactive based on the request.

The means may be configured to: obtain a parameter associated with each of the instances of groups, wherein a value of the parameter associated with each of the instances of groups defines which of the at least one of the instances of groups is active.

According to a third aspect there is provided a method for an apparatus comprising: initializing two or more instances of groups within at least one audio signal data stream, wherein each of the instances of groups comprise at least one member and wherein at least one of the instances of groups is set active and at least one of the instances of groups is set inactive, such that members of the active group instances can be processed; and encoding the at least one of the instances of groups set active for storage and/or transmission.

The method may further comprise controlling the transmission of the encoded at least one of the instances of groups set active to at least one further apparatus.

The method may further comprise controlling the storage of the encoded at least one of the instances of groups set active to at least one further apparatus.

The method may further comprise: receiving, from at least one further apparatus, a request to modify which of the at least one of the instances of groups is set active and which at least one of the instances of groups is set inactive; and modifying which of the at least one of the instances of groups is set active and which at least one of the instances of groups is set inactive based on the request. The method may further comprise: setting a parameter associated with each of the instances of groups, wherein a value of the parameter associated with each of the instances of groups defines which of the at least one of the instances of groups is active and which of the at least one of the instances of groups is inactive. The parameter may comprise at least one of: a gain parameter; and a flag parameter, wherein a value of 1 defines the instance of groups is active and a value of 0 defines the instance of groups is inactive.

The gain parameter may be configured such that a value of 1 defines the instance of groups is active, a value of 0 defines the instance of groups is inactive, and a value greater than 0 and less than 1 defines that the instance of groups is active but diminished The members may comprise at least one of: an audio track; and an audio 25 object.

According to a fourth aspect there is provided a method for an apparatus comprising: obtaining at least one audio signal data stream; obtaining from the at least one audio signal data stream at least one of instance of groups set active, wherein the at least one of the instance of groups set active is at least one of two or more instances of groups and each of the instances of groups comprise at least one member; and processing and rendering a spatial audio signal from the at least one of instance of groups set active.

Obtaining at least one audio signal data stream may comprise one of: receiving the at least one audio signal data stream from at least one further apparatus; and retrieving the at least one audio signal data stream from at least one memory.

The method may comprise: transmitting to the at least one further apparatus, a request to modify which of the at least one of the instances of groups is set active, wherein the at least one further apparatus may be configured to modify which of the at least one of the instances of groups is set active and furthermore set at least one of the instances of groups inactive based on the request.

The method may further comprise: obtaining a parameter associated with each of the instances of groups, wherein a value of the parameter associated with each of the instances of groups may define which of the at least one of the instances of groups is active.

The gain parameter may be configured such that a value of 1 defines the instance of groups is active, a value of 0 defines the instance of groups is inactive, and a value greater than 0 and less than 1 defines that the instance of groups is active but diminished.

The members may comprise at least one of: an audio track; and an audio object.

According to a fifth aspect there is provided an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: initialize two or more instances of groups within at least one audio signal data stream, wherein each of the instances of groups comprise at least one member and wherein at least one of the instances of groups is set active and at least one of the instances of groups is set inactive, such that members of the active group instances can be processed; and encode the at least one of the instances of groups set active for storage and/or transmission.

The apparatus may be further caused to control the transmission of the encoded at least one of the instances of groups set active to at least one further apparatus.

The apparatus may be further caused to control the storage of the encoded at least one of the instances of groups set active to at least one further apparatus.

The apparatus may be further caused to: receive, from at least one further apparatus, a request to modify which of the at least one of the instances of groups is set active and which at least one of the instances of groups is set inactive; and modify which of the at least one of the instances of groups is set active and which at least one of the instances of groups is set inactive based on the request.

The apparatus may be further caused to: set a parameter associated with each of the instances of groups, wherein a value of the parameter associated with each of the instances of groups defines which of the at least one of the instances of groups is active and which of the at least one of the instances of groups is inactive.

The members may comprise at least one of: an audio track; and an audio object.

According to a sixth aspect there is provided an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain at least one audio signal data stream; obtain from the at least one audio signal data stream at least one of instance of groups set active, wherein the at least one of the instance of groups set active is at least one of two or more instances of groups and each of the instances of groups comprise at least one member; and process and render a spatial audio signal from the at least one of instance of groups set active.

The apparatus caused to obtain at least one audio signal data stream may be further caused to: receive the at least one audio signal data stream from at least one further apparatus; and retrieve the at least one audio signal data stream from at least one memory.

The apparatus may be further caused to: transmit to the at least one further apparatus, a request to modify which of the at least one of the instances of groups is set active, wherein the at least one further apparatus may be configured to modify which of the at least one of the instances of groups is set active and furthermore set at least one of the instances of groups inactive based on the request.

The apparatus may be further caused to: obtain a parameter associated with each of the instances of groups, wherein a value of the parameter associated with each of the instances of groups defines which of the at least one of the instances of groups is active.

The members may comprise at least one of: an audio track; and an audio object.

According to a seventh aspect there is provided an apparatus comprising initializing circuitry configured to initialize two or more instances of groups within at least one audio signal data stream, wherein each of the instances of groups comprise at least one member and wherein at least one of the instances of groups is set active and at least one of the instances of groups is set inactive, such that members of the active group instances can be processed; and encoding circuitry configured to encode the at least one of the instances of groups set active for storage and/or transmission.

According to an eighth aspect there is provided an apparatus comprising obtaining circuitry configured to obtain at least one audio signal data stream; decoding circuitry configured to obtain from the at least one audio signal data stream at least one of instance of groups set active, wherein the at least one of the instance of groups set active is at least one of two or more instances of groups and each of the instances of groups comprise at least one member; and processing and rendering circuitry configured to process and render a spatial audio signal from the at least one of instance of groups set active.

According to a ninth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: initialize two or more instances of groups within at least one audio signal data stream, wherein each of the instances of groups comprise at least one member and wherein at least one of the instances of groups is set active and at least one of the instances of groups is set inactive, such that members of the active group instances can be processed; and encode the at least one of the instances of groups set active for storage and/or transmission.

According to a tenth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: obtain at least one audio signal data stream; obtain from the at least one audio signal data stream at least one of instance of groups set active, wherein the at least one of the instance of groups set active is at least one of two or more instances of groups and each of the instances of groups comprise at least one member; and process and render a spatial audio signal from the at least one of instance of groups set active. According to an eleventh aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: initialize two or more instances of groups within at least one audio signal data stream, wherein each of the instances of groups comprise at least one member and wherein at least one of the instances of groups is set active and at least one of the instances of groups is set inactive, such that members of the active group instances can be processed; and encode the at least one of the instances of groups set active for storage and/or transmission.

According to a twelfth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtain at least one audio signal data stream; obtain from the at least one audio signal data stream at least one of instance of groups set active, wherein the at least one of the instance of groups set active is at least one of two or more instances of groups and each of the instances of groups comprise at least one member; and process and render a spatial audio signal from the at least one of instance of groups set active.

According to a thirteenth aspect there is provided an apparatus comprising: means for initializing two or more instances of groups within at least one audio signal data stream, wherein each of the instances of groups comprise at least one member and wherein at least one of the instances of groups is set active and at least one of the instances of groups is set inactive, such that members of the active group instances can be processed; and means for encoding the at least one of the instances of groups set active for storage and/or transmission.

According to a fourteenth aspect there is provided an apparatus comprising: means for obtaining at least one audio signal data stream; means for obtaining from the at least one audio signal data stream at least one of instance of groups set active, wherein the at least one of the instance of groups set active is at least one of two or more instances of groups and each of the instances of groups comprise at least one member; and means for processing and rendering a spatial audio signal from the at least one of instance of groups set active.

According to a fifteenth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: initialize two or more instances of groups within at least one audio signal data stream, wherein each of the instances of groups comprise at least one member and wherein at least one of the instances of groups is set active and at least one of the instances of groups is set inactive, such that members of the active group instances can be processed; and encode the at least one of the instances of groups set active for storage and/or transmission.

According to a sixteenth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtain at least one audio signal data stream; obtain from the at least one audio signal data stream at least one of instance of groups set active, wherein the at least one of the instance of groups set active is at least one of two or more instances of groups and each of the instances of groups comprise at least one member; and process and render a spatial audio signal from the at least one of instance of groups set active.

An apparatus comprising means for performing the actions of the method as described above.

An apparatus configured to perform the actions of the method as described above.

A computer program comprising program instructions for causing a computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the ad.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which: Figures la and lb show example server and peer-to-peer teleconferencing systems within which embodiments may be implemented; Figure 2 shows schematically an example encoder-decoder configuration for a server based teleconferencing system as shown in Figure la according to some embodiments; Figure 3 shows schematically a further example encoder-decoder configuration for a server based teleconferencing system as shown in Figure la according to some embodiments; Figure 4 shows schematically a user interface configuration example for a decoder; Figure 5 shows an example encoder-decoder configuration for a server based teleconferencing system implementing the input from the user interface as shown in Figure 4 according to some embodiments; Figure 6 shows a flow diagram showing a method of operation for the user interface and encoder-decoder configuration as shown in Figures 4 and 5 according to some embodiments; and Figure 7 shows an example device suitable for implementing the apparatus shown.

Embodiments of the Application The following describes in further detail suitable apparatus and possible mechanisms for the provision of efficient switching between (IVAS) audio instances.

As such the embodiments discussed herein are concerned with a codec, for example an IVAS codec, configured to support a multi-input mode of operation wherein the codec is configured to provide a framework for decoding/rendering of multiple input streams, each typically originating from a different encoder.

According to these embodiments there is provided at least one codec input audio stream which can comprise multiple audio inputs.

In some embodiments the audio inputs within the signal can be allocated, for example into separate encoder instances based on a parameter (for example a track-group allocation parameter). This parameter may furthermore be encoded as metadata and be transmitted/stored with the audio signal.

In some embodiments, and as discussed in the following examples, audio scene manipulation or processing employed by a 'receiving' apparatus is fixed and/or same for audio elements that belong to the same track group. In other words, a listener would be allowed to rotate, e.g., audio objects belonging to track group 0 to a different rendering position relative the listener independently of audio objects belonging to track group 1 and vice versa. However, in such embodiments a listener would not be allowed to rotate audio object A independently of audio object B, where the two audio objects belong to the same track group, e.g., to track group 0.

In some embodiments as discussed herein an advantage offered to the user (listener) is the ability to control the audio scene presentation. Such control may also be called presentation manipulation. For example, a user may wish to adjust the volume of an audio source "talker's voice" relative to the transmitted ambience (background sounds). A further example may be one where the user may wish to rotate an audio source "Linda" from right to front center of the scene while maintaining a further audio source "Mark" on the left-hand side. This is generally possible, for example when the (IVAS) encoder is configured to encode individual audio elements (e.g., objects, MASA, 5.1) such that they are separable at the (IVAS) decoder and renderer. This can be called pass-through operation.

The embodiments as described herein attempt to enable such manipulation of individual audio sources to be implemented even where the encoding comprises the audio sources associated or grouped together as a single track group.

An example system within which embodiments may be implemented is shown in Figures 1a and 1 b.

Figure la for example shows a teleconferencing system within which some embodiments can be implemented. In this example there is shown three sites or rooms, Room A 101, Room B 103, and Room C 105. Room A 101 comprises three 'talkers', Talker 1 111, Talker 2 113, and Talker 3 115. Room B 103 comprises one talker', Talker 4 121. Room C 105 comprises one 'talker', Talker 5 131.

In the following example within room A is a suitable teleconference apparatus 102 configured to spatially capture and encode the audio environment and furthermore is configured to render a spatial audio signal to the room. Within each of the other rooms may be a suitable teleconference apparatus 104, 106 configured to render a spatial audio signal to the room and furthermore is configured to capture and encode at least a mono audio and optionally configured to spatially capture and encode the audio environment. In the following examples each room is provided with the means to spatially capture, encode spatial audio signals, receive spatial audio signals and render these to a suitable listener. It would be understood that there may be other embodiments where the system comprises some apparatus configured to only capture and encode audio signals (in other words the apparatus is a 'transmit' only apparatus), and other apparatus configured to only receive and render audio signals (in other words the apparatus is a 'receive' only apparatus). In such embodiments the system within which embodiments may be implemented may comprise apparatus with varying abilities to capture/render audio signals.

The teleconference apparatus (for each site or room) 102, 104, 106 is further configured to call into a teleconference controlled by and implemented over a server or multipoint control unit (MCU) 107.

Figure lb shows a further (peer-to-peer) teleconferencing system within which some embodiments can be implemented. In this example there is shown three sites or rooms, Room A 101, Room B 103, and Room C 105. Room A 101 comprises three 'talkers', Talker 1 111, Talker 2 113, and Talker 3 115. Room B 103 comprises one 'talker', Talker 4 121. Room C 105 comprises one 'talker', Talker 5 131. Within each of the rooms is a suitable teleconference apparatus 102, 104, 106 configured to spatially capture and encode the audio environment and furthermore is configured to render a spatial audio signal to the room. The teleconference apparatus (for each site or room) 102, 104, 106 is further configured to communicate with each other to implement a teleconference function. As shown in Figure 1 b, the IVAS decoder/renderer for each of the teleconference apparatus 102 can be configured to handle multiple input streams that may each originate from a different encoder. For example, the apparatus 106 in room C 105 is configured to decode/render simultaneously audio streams from room A 101 and room B 104.

In some embodiments there can be advantages in terms of complexity of the decoding/rendering and synchronization of the presentation of the streams when these simultaneous decoding/rendering can be done using the same (IVAS) decoder/renderer instance. Alternatively, in some embodiments the handling of multiple input streams that may each originate from a different encoder could be implemented by two or more (IVAS) decoder instances and the audio outputs from the instances mixed in the rendering operations. In the latter case, it may be preferable to use external rendering operation instead or in addition to the integrated (IVAS) rendering in order to allow for the manipulation of the audio outputs relative to each other.

With respect to Figure 2 shows the system of Figure la where the MCU is implementing an encoding for the downstream of one RX user 281 (within room C 105). In this example each talker is represented as a separate audio object for the (IVAS) encoder. For example, each user 'talker' may be wearing a lavalier microphone for individual voice pick-up.

Thus for example as shown in Figure 2, the talkers for room A 101 could be talker 1 111 which using a first lavalier microphone generates audio object 200, talker 2 113 which using a second lavalier microphone generates audio object 202, and talker 3 115 which using a third lavalier microphone generates audio object 204. The teleconference apparatus can comprise an (IVAS) encoder 201 which is configured to receive the audio objects and encode the objects based on a suitable encoding to generate a bitstream 206. The bitstream 206 is passed to the MCU 221.

Additionally for example as shown in Figure 2, the talkers for room B 103 could be talker 4 121 which using a fourth lavalier microphone generates audio object 210. The teleconference apparatus can comprise a further (IVAS) encoder 211 which is configured to receive the audio objects from room B 103 and encode the objects based on a suitable encoding to generate a bitstream 216. The bitstream 216 is passed to the MCU 221.

The MCU (conferencing server) 221 comprises a (IVAS) decoder / (IVAS) encoder and is configured to decode the two (IVAS) bitstreams from rooms A (bitstream 206) and B (bitstream 216). The MCU may then be configured to perform some mixing of the two streams (for example based on audio activity or by any suitable means) and then encode a downmix bitstream 218 for the RX user 281 in room C 105.

The room C 105 may comprise apparatus comprising a (IVAS) decoder 231 which is configured to receive the downmix bitstream 218 which is configured to receive the downmix bitstream 218, decode the bitstream 218 and render a suitable spatial audio signal output to the RX user 281.

One problem with the approach in Figure 2 may be that the decoding-encoding operation, or (self-)tandeming, generally deteriorates the perceptual quality unless substantially high bit rates can be used.

With respect to Figure 3 is shown a further example system of Figure la where the MCU is implementing an encoding for the downstream of one RX user 281 (within room C 105). In this example the MCU operation is simplified.

As shown in Figure 3, the talkers for room A 101 could be talker 1 111 which using a first lavalier microphone generates audio object 200, talker 2 113 which using a second lavalier microphone generates audio object 202, and talker 3 115 which using a third lavalier microphone generates audio object 204. The teleconference apparatus can comprise an (IVAS) encoder 201 which is configured to receive the audio objects and encode the objects based on a suitable encoding to generate a bitstream 206. The bitstream 206 is passed to the MCU 221.

Similar to Figure 2, the talkers for room B 103 could be talker 4 121 which using a fourth lavalier microphone generates audio object 210. The teleconference apparatus can comprise a further (IVAS) encoder 211 which is configured to receive the audio objects from room B 103 and encode the objects based on a suitable encoding to generate a bitstream 216. The bitstream 216 is passed to the MCU 221.

The MCU (conferencing server) 221 may comprise a (IVAS) decoder 323 configured to decode the two (IVAS) bitstreams from rooms A (bitstream 206) and B (bitstream 216).

Furthermore the MCU codec in some embodiments supports the concept as described hereafter of track groups. In such embodiments the MCU 221 comprises a (IVAS) codec track group forwarder 325. The codec track group forwarder 325 can thus in some embodiments significantly reduce the complexity of the operation of the MCU 221, by not having to have to implement a codec (IVAS) encoding but to select/forward the multiple inputs as individual track groups to room C 105. The codec may be configured to perform some re-packetization but not decoding-encoding. For example, the codec may be configured to perform processing of the streams where they are de-packed 206 and 216 and then re-packed (without decoding-encoding) for transmission in common RTP packets.

In some embodiments the MCU may optionally be configured to perform some mixing of the two streams (for example based on audio activity or by any suitable means) and then encode a downm ix bitstream 218 for the RX user 281 in room C 105. The decoder-encoder in this case may be employed if there is no other voice activity indication (e.g., metadata) available. Thus in these embodiments the MCU could decode the bitstreams in order to see whether there is voice activity. In addition, the MCU in some embodiments is configured to derive other metadata (such as spatial position, energy ratio etc.).

Furthermore in some embodiments the MCU is configured to comprise an encoder-decoder in order to support many different codecs and may thus perform the tandeming operation. For example, it could receive some AMR audio data stream from a legacy user and encode it to a compatible (for example!VAS) bitstream.

Thus in some embodiments the MCU 221 may comprise a (IVAS) decoder 323 configured to derive information such as the metadata parameters (e.g., position and/or rotation of each subscene; control metadata for user manipulation, etc.) for each track group. In some embodiments there may be provided one or more bitstreams 318 to the RX user 131 within room C 105. For example, the multiple inputs may be re-packetized together, or separate bitstreams (e.g., RTP) may be sent to the RX user 131.

The room C 105 may comprise apparatus comprising a (IVAS) decoder 231 which is configured to receive the downmix bitstream 318 which is configured to receive the downmix bitstream 318, decode the bitstream 318 and render a suitable spatial audio signal output to the RX user 281 In some embodiments the encoder 201 for room A 101 is configured to send the audio objects 200, 202, 204 as a single track group, in other words not individually.

In some embodiments by sending a track group, all the individual audio inputs (part of the track group) are encoded with a single instance of the (IVAS) encoder such that the (IVAS) encoder can downmix the audio inputs by any suitable means. For example, in some embodiments the IVAS decoder may not have to separate the audio inputs that were originally present in the track group. Rather, only the overall percept as present in the original track group is preserved in some embodiments. In other words, no pass-through operation is present/required.

In the following embodiments the RX user 281 (room C 105) is thus able to rotate the audio object associated with talker 4 121 in room B 103 independently of all other audio objects. Furthermore in some embodiment the RX user 281 (room C 105) is able to rotate the audio objects associated with the talker 1 111 object 200, talker 2 113 object 202, and talker 3 115 object 204 relative to other objects.

For example in some embodiments the group (track group) as a whole can be rotated while maintaining the position of the object associated with talker 4 static. Furthermore in some embodiments the RX user 281 (room C 105) is able to control the volume of an audio object associated with talker relative to other audio objects.

For example the RX user 281 (room C 105) can be configured to control the volume of the audio objects associated with talkers 1, 2, and 3 individually. This overcomes the problem of being able to only control the group (track group) overall volume, for example controlling the audio volume relative to the audio object of talker 4 volume.

In other words the concepts as addressed in the embodiments herein are those associated with how to allow for full user control / presentation manipulation if track groups are used in (IVAS) encoder operation.

The invention relates to efficient encoding of IVAS spatial audio utilizing track groups while maintaining full user control of the spatial audio scene presentation. A flexible switching is performed between at least two track-group instances based on input parameter update.

In some embodiments, where there is Track-group-based control, the (IVAS) encoder operation when using track groups is configured to: Instead of one track-group instance, at least two instances are initialized; At least one track-group instance is set active and at least one instance is set inactive according to a parameter (which can be defined as gainTrackGroup or activeTrackGroup metadata parameters); Upon listener manipulation request of a track group member, a change of parameter value for gainTrackGroup or activeTrackGroup metadata can be signaled for at least two track-group instances; After a parameter value update, an apparatus (user) is configured to manipulate the track group member.

In some embodiments, where there is Track group member-based control, the (IVAS) encoder operation when using track groups is configured to: Instead of one track-group instance, at least two instances are initialized; At least one member of at least one track-group instance is set active and at least one member of at least one other track-group instance is set inactive according to a new gainTrackGroup or activeTrackGroup metadata; Upon listener manipulation request of a track group member, it is signaled a 30 change of parameter value for gainTrackGroup or activeTrackGroup metadata for at least one member of at least two track-group instances; After parameter value update, an apparatus (user) is configured to manipulate the track group member.

In such embodiments an (IVAS) codec implementation may be able to use track groups for higher efficiency and lower latency, e.g., in server-based conferencing use cases while maintaining user control of the spatial audio scene. This is achieved by the signalling and adaptation mechanism that is invisible to the user. User experience and system efficiency are therefore improved in some embodiments.

The embodiments as discussed above may be implemented within an immersive voice and audio encoding, decoding/rendering, and presentation control system. In particular, the embodiments relate to adjusting of encoder inputs and processing based on listener preference and signalling of this information. In particular these embodiments are suitable for implementation within the 3GPP IVAS standard, although the concept can be implemented in embodiments within any other suitable codec with similar functionalities.

In some embodiments the implementation of the Track group parameter allows an encoder to combine and jointly encode audio inputs in a way where the available bit rate is efficiently allocated and utilized. Track groups can thus be used to indicate to the encoder that a set of tracks/audio inputs are to be subjected to the same transformation(s) at the rendering. Track groups can be set, e.g., according to session and defined as:

numTrackGroups Description Data type

Default: 0 numTrackGroups = Number of track groups present -1. unsigned integer The default value of 0 indicates 1 group is present.

For each track group it is provided, for example, encoding-information including bit rate, signal bandwidth (NB, WB, SVVB, FB), and DTX flag (on/off). In addition, in some embodiments it can be specified descriptive information such as an indication whether headphone spatialization (binauralization) has been applied to the (stereo) input or not. Furthermore, the track group metadata set can define the audio format (e.g., channel-based audio, etc.). For different audio formats the track group information is provided as additional metadata according to the format requirements.

In some embodiments, a gain or activity metadata field for the track group is also defined or provided. For example, these may be defined as at least one of:

gainTrackGroup Description Data type

Default. 1.0 Provides a gain adjustment for a track group. Value range [0.0, 1.0]. unsigned fractional

activeTrackGroup Description

1 (default) Indicates that the track group is active, i.e., it is encoded and transmitted.

0 Indicates that the track group is not active, i.e., it is not encoded and transmitted.

(Indication that the track group is provided at the encoder and that it may become active is/can be transmitted.) In some embodiments, the gain and/or activity metadata can be provided on a track-by-track basis for a track group.

In some embodiments this information is controlled by the encoder via any suitable listener-side mechanism. For example the control may be signaled from decoder/renderer to encoder. The signaling may be any suitable signaling, for example using Real-time transport protocol (RTP) or Session description protocol (SDP) messages.

For example, a (IVAS) device user interface (UI) may indicate to user what aspects the user is currently able to control and what aspects the user is able to request for control. If the user indicates that they wish to control an audio source (for example a track group member) that is currently not under listener control, it is signaled to the encoder the user request. The encoder can then in some embodiments be configured to modify at least one parameter relating to the request. For example the parameter related to the requested track group: gainTrackGroup or activeTrackGroup.

In some embodiments all control options (those currently available and those available per request) can be provided to the listener and without transparency of their current status. In such embodiments the signaling is performed as needed according to any user control. It is understood that in such embodiments the response time to a user control (for example scene manipulation) may be shorter.

The following describes some embodiments using the parameters gainTrackGroup or activeTrackGroup control for track groups at the encoder input.

It is first provided a use case example.

For example Figure 4 presents an example user interface (UI) 400 on a suitable electronic (IVAS) device 499. The electronic device 499 shown is a user equipment that receives (IVAS) RTP packets and is able to decode and render the 15 depacketized (IVAS) encoded signals.

The Ul shows a constellation view of the audio scene that is presented to the listener. This as shown in Figure 4 shows representations of all talker positions relative to the representation of the listener or user of the electronic device. The example shown in Figure 4 shows the Ul representation of the listener 499 of the electronic device relatively central in the Ul. Surrounding the representation of the listener 499 are shown the representations of the talkers, the representation of talker 1 401 who is to the left and rear of the listener, the representation of talker 2 403 who is to the left of the listener, the representation of talker 3 405 who is to the front of the listener, and the representation of talker 4 407 who is to the right and rear of the listener.

Although in this example all of the talkers are shown with separate representations in some embodiments all of the talkers (or audio objects) associated with a room or track group are shown as a single representation, for example a representation of a room, as they are not individually manipulatable due to use of track group. In some further embodiments the Ul may show the talkers in a single track group as individual representations but show that they are linked or associated, for example by being indicated be a common colour/shading or image representation.

In some embodiments the Ul can be configured to provide a suitable indication to assist the listener operating the device 499 to understand what they can/cannot currently control. In this example, this is shown on the bottom half of Figure 4 by the left representation showing the relative positions of talker 1 411, talker 2 413 and talker 3 415 in grey showing that they are linked and cannot be controlled individually and talker 4 417 in white.

In this example the listener operating the device 499 wishes to control the playback position of talker 1 but not the positions of talker 2 or 3. As discussed above it is shown to the listener that this control is currently not available (as the talkers 1, 2, 3 are all pad of the same track group).

In the bottom left part of Figure 4 is shown where the listener operating the device is configured to provide a user input to request the control of the audio representation of talker 1. This request is shown by input 412.

Having requested this the representation of talker 1 changes colour indicating that the control of the audio representation of talker 1 has become available. This is shown in the bottom centre part of Figure 4 by the white colour representation 421.

The user may then be configured to apply a user input 422 to perform the manipulation they desire. This is shown by in the bottom right part of Figure 4 where the position of the audio representation of talker 1 is moved 439 to the right of the listener to a new position as represented on the Ul by representation 431.

After the manipulation, in some embodiments, the system can be arranged such that it is able to configure the audio object associated with talker 1 to remain manipulatable. This can in some embodiments be time-dependent or limited. For example the talker 1 can be manipulatable for some pre-determined duration after which it again becomes non-manipulatable relative to talkers 2 and 3. In some embodiments talker 1 may remain manipulatable until the manipulated parameter has been reset.

With respect to Figure 5 it is shown how to change the track grouping to allow for the RX user to reposition talker 1 in the scene. Furthermore Figure 6 shows a flow diagram detailing the operations of the system.

In some embodiments the (IVAS) encoder(s) are initialized with as many track groups as desired to achieve the user manipulation requirements for the current service, use case, or session. This is shown for example in Figure 6 by step 601.

For example, with respect to room A, a first configuration would be for the encoder to determine one track group that includes all audio objects (talkers), and one individual track group for each object (talker): Track group 0: talkers/objects 1, 2, 3 Track group 1: talker/object 1 Track group 2: talker/object 2 Track group 3: talker/object 3 Alternatively in some embodiments the encoder could generate an initialization of the following track groups for room A: Track group Oa: talkers/objects 1, 2, 3 Track group 1 a: talker/object 1 Track group lb: talkers/objects 2, 3 Track group 2a: talker/object 2 Track group 2b: talkers/objects 1, 3 Track group 3b: talker/object 3 Track group 3b: talkers/objects 1, 2 Figure 5 for example shows a system comprising these groupings. The example shown in Figure 5 is similar to that shown in Figure 3 but showing the grouping of track groups 0, 1, 2, and 3. For example as shown in Figure 5 there are shown the track groups as the group 0 500 comprising the object 1 200, object 2 202 and object 3 204, the group 1 comprising the object 1 200, the group 2 comprising the object 2 202 and the group 3 comprising the object 3 204.

It would be appreciated that the intention would be not to encode and transmit all of these options simultaneously but to enable control at the encoder (and in some embodiments based on inputs from the decoder). Thus it is not wasteful in terms of encoding and transmission, and a specific track group or set of track groups may be configured to replace another one(s) when a manipulation request is received from the decoder/renderer (based on a listener user input).

For example, by default the encoder is configured to carry out the transmission of track groups 0/1/2/3 as follows: gainTrackGroup[0] = 1.0; gainTrackGroup[1] = 0.0; gainTrackGroup[2] = 0.0; gainTrackGroup[3] = 0.0; Or: activeTrackGroup[0] = 1; activeTrackGroup[1] = 0; activeTrackGroup[2] = 0; activeTrackGroup[3] = 0; The configuration of parameters to 'select' a track group configuration to be sent, for example by setting activeTrackGroup and/or gainTrackGroup parameters is shown in Figure 6 by step 603.

The encoder is configured to receive information indicating a request to manipulate an object. For example a request to change the position of talker 1. The operation of receiving the request is shown in Figure 6 by step 605.

Furthermore the encoder is then configured to update the parameters based on the obtained/received requests. For example as shown in Figure 4, in response to the request to change the position of talker 1 the encoder is configured to modify the activeTrackGroup and/or gainTrackGroup parameters: gainTrackGroup[0] = 0.0; gainTrackGroup[1] = 1.0; gainTrackGroup[2] = 1.0; gainTrackGroup[3] = 1.0; Or: activeTrackGroup[0] = 0; activeTrackGroup[1] = 1; activeTrackGroup[2] = 1; activeTrackGroup[3] = 1; In such a manner, the audio objects associated with all three talkers are made active (their voices will be heard) and at least the audio object associated with the talker 1 is able to be manipulated (moved to a new position). The updating of the parameters is shown in Figure 6 by step 607.

The encoder, which in Figure 5 is shown as encoder 201, may further be configured to receive or otherwise obtain the manipulation of the audio object (for example the modification of the position of the audio object). In such embodiments the spatial metadata for the audio object (for example the audio object associated with talker 1 which has the position modified) may be updated allowing track group 0 to take this into account. Thus, for example, after the listener no longer wishes to manipulate the position of the audio object, the original track group settings of at least one of gainTrackGroup or activeTrackGroup can be reset.

The resetting of the track group settings is shown in Figure 6 by step 609. In some embodiments the parameter gainTrackGroup can be configured to provide further control over the parameter activeTrackGroup. In some 20 embodiments these parameters could also be provided on a track-by-track basis. This track-by-track control can in some embodiments be achieved by implementing the following changes: gainTrackGroup[0,1 = 1.0; --> gainTrackGroup[0,0] = 0.0; gainTrackGroup[1,:] = 0.0; --> gainTrackGroup[1,:] = 1.0; Or: activeTrackGroup[0,1 = 1; --> activeTrackGroup[0,0] = 0; activeTrackGroup[1,1 = 0; --> activeTrackGroup[1,1 = 1; It is here denoted by [0,1 that all values of corresponding to track group 0 are accessed, while [0,0] means that only the first value corresponding to track group 0 (i.e., value corresponding to talker 1) is accessed.

The track-by-track control operations simplify the transmission since track group 0 is not stopped from being sent and there is no need to begin sending track groups 2 and 3 (or any alternative track groups corresponding to talkers 2 and 3 and not talker 1).

In such a manner any user manipulation of audio objects can be efficiently handled by implementing a flexible switching between at least two track-group instances based on input metadata so that track groups can be utilized for efficient encoding and transmission of (IVAS) audio inputs.

The embodiments as discussed herein can furthermore be used in one-to-one communications, server-based conferencing, and peer-to-peer conferencing.

The applicability of the proposed methods thus covers all communications scenarios relevant for any suitable communications codec such as the IVAS codec. With respect to Figure 7 an example electronic device is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1700 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. In some embodiments the device 1700 comprises at least one processor or central processing unit 1707. The processor 1707 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1700 comprises a memory 1711. In some embodiments the at least one processor 1707 is coupled to the memory 1711. The memory 1711 can be any suitable storage means. In some embodiments the memory 1711 comprises a program code section for storing program codes implementable upon the processor 1707. Furthermore in some embodiments the memory 1711 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1707 whenever needed via the memory-processor coupling.

In some embodiments the device 1700 comprises a user interface 1705. The user interface 1705 can be coupled in some embodiments to the processor 1707. In some embodiments the processor 1707 can control the operation of the user interface 1705 and receive inputs from the user interface 1705. In some embodiments the user interface 1705 can enable a user to input commands to the device 1700, for example via a keypad. In some embodiments the user interface 1705 can enable the user to obtain information from the device 1700. For example the user interface 1705 may comprise a display configured to display information from the device 1700 to the user. The user interface 1705 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1700 and further displaying information to the user of the device 1700.

In some embodiments the device 1700 comprises an input/output pod 1709.

The input/output port 1709 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1707 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (VVLAN) protocol such as for example IEEE 802.X, a suitable shod-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

The transceiver input/output port 1709 may be configured to receive the 25 signals and in some embodiments obtain the focus parameters as described herein.

In some embodiments the device 1700 may be employed to generate a suitable audio signal using the processor 1707 executing suitable code. The input/output port 1709 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones (which may be a headtracked or a non-tracked headphones) or similar.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS: 1. An apparatus comprising means configured to: initialize two or more instances of groups within at least one audio signal data stream, wherein each of the instances of groups comprise at least one member and wherein at least one of the instances of groups is set active and at least one of the instances of groups is set inactive, such that members of the active group instances can be processed; and encode the at least one of the instances of groups set active for storage and/or transmission 2. The apparatus as claimed in claim 1, wherein the means is further configured to control the transmission of the encoded at least one of the instances of groups set active to at least one further apparatus.3. The apparatus as claimed in claim 1, wherein the means is further configured to control the storage of the encoded at least one of the instances of groups set active to at least one further apparatus.4. The apparatus as claimed in any of claims 1 to 3, wherein the means is further configured to: receive, from at least one further apparatus, a request to modify which of the at least one of the instances of groups is set active and which at least one of the instances of groups is set inactive; and modify which of the at least one of the instances of groups is set active and which at least one of the instances of groups is set inactive based on the request.5. The apparatus as claimed in any of claims 1 to 4, wherein the means is configured to: set a parameter associated with each of the instances of groups, wherein a value of the parameter associated with each of the instances of groups defines which of the at least one of the instances of groups is active and which of the at least one of the instances of groups is inactive.6. The apparatus as claimed in claim 5, wherein the parameter comprises at least one of: a gain parameter; and a flag parameter, wherein a value of 1 defines the instance of groups is active and a value of 0 defines the instance of groups is inactive.7. The apparatus as claimed in claim 6, wherein the gain parameter is configured such that a value of 1 defines the instance of groups is active, a value of 0 defines the instance of groups is inactive, and a value greater than 0 and less than 1 defines that the instance of groups is active but diminished.8. The apparatus as claimed in any of claims 1 to 7, wherein the members comprise at least one of: an audio track; and an audio object.9. An apparatus comprising means configured to: obtain at least one audio signal data stream; obtain from the at least one audio signal data stream at least one of instance of groups set active, wherein the at least one of the instance of groups set active is at least one of two or more instances of groups and each of the instances of groups comprise at least one member; and process and render a spatial audio signal from the at least one of instance of groups set active.10. The apparatus as claimed in claim 9, wherein the means configured to obtain at least one audio signal data stream is further configured to perform one of: receive the at least one audio signal data stream from at least one further 30 apparatus; and retrieve the at least one audio signal data stream from at least one memory.11. The apparatus as claimed in any of claims 9 or 10, wherein the means is further configured to: transmit to the at least one further apparatus, a request to modify which of the at least one of the instances of groups is set active, wherein the at least one further apparatus is configured to modify which of the at least one of the instances of groups is set active and furthermore set at least one of the instances of groups inactive based on the request.12. The apparatus as claimed in any of claims 9 to 11, wherein the means is configured to: obtain a parameter associated with each of the instances of groups, wherein a value of the parameter associated with each of the instances of groups defines which of the at least one of the instances of groups is active.13. The apparatus as claimed in claim 12, wherein the parameter comprises at least one of: a gain parameter; and a flag parameter, wherein a value of 1 defines the instance of groups is active and a value of 0 defines the instance of groups is inactive.14. The apparatus as claimed in claim 13, wherein the gain parameter is configured such that a value of 1 defines the instance of groups is active, a value of 0 defines the instance of groups is inactive, and a value greater than 0 and less than 1 defines that the instance of groups is active but diminished.15. The apparatus as claimed in any of claims 9 to 13, wherein the members comprise at least one of: an audio track; and an audio object.16. A method for an apparatus comprising: initializing two or more instances of groups within at least one audio signal data stream, wherein each of the instances of groups comprise at least one member and wherein at least one of the instances of groups is set active and at least one of the instances of groups is set inactive, such that members of the active group instances can be processed; and encoding the at least one of the instances of groups set active for storage and/or transmission.17. A method for an apparatus comprising: obtaining at least one audio signal data stream; obtaining from the at least one audio signal data stream at least one of instance of groups set active, wherein the at least one of the instance of groups set active is at least one of two or more instances of groups and each of the instances of groups comprise at least one member; and processing and rendering a spatial audio signal from the at least one of instance of groups set active.