GB2580899A - Audio representation and associated rendering - Google Patents
Audio representation and associated rendering Download PDFInfo
- Publication number
- GB2580899A GB2580899A GB1900871.3A GB201900871A GB2580899A GB 2580899 A GB2580899 A GB 2580899A GB 201900871 A GB201900871 A GB 201900871A GB 2580899 A GB2580899 A GB 2580899A
- Authority
- GB
- United Kingdom
- Prior art keywords
- audio signal
- mono
- signal
- metadata
- evs
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000009877 rendering Methods 0.000 title description 8
- 230000005236 sound signal Effects 0.000 claims abstract description 185
- 238000000034 method Methods 0.000 claims description 27
- 230000001419 dependent effect Effects 0.000 claims 2
- 238000004458 analytical method Methods 0.000 description 12
- 230000015572 biosynthetic process Effects 0.000 description 10
- 238000003786 synthesis reaction Methods 0.000 description 10
- 238000013461 design Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 6
- 239000004065 semiconductor Substances 0.000 description 6
- 238000004590 computer program Methods 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 5
- 238000011144 upstream manufacturing Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 238000003860 storage Methods 0.000 description 4
- 238000009432 framing Methods 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000008867 communication pathway Effects 0.000 description 1
- 239000004020 conductor Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012732 spatial analysis Methods 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/167—Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/0017—Lossless audio signal coding; Perfect reconstruction of coded audio signal by transmission of coding error
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/03—Aspects of down-mixing multi-channel audio to configurations with lower numbers of playback channels, e.g. 7.1 -> 5.1
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2420/00—Techniques used stereophonic systems covered by H04S but not provided for in its groups
- H04S2420/03—Application of parametric coding in stereophonic audio systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Mathematical Physics (AREA)
- Stereophonic System (AREA)
Abstract
An input format is obtained for generating an encoded mono audio signal and/or a multichannel signal. The input format comprises a mono signal 205 and an associated metadata signal 227 that is configured to enable generation of the encoded multichannel signal 207 from the mono signal. The encoded mono audio signal may be bit-exact 223 (i.e. lossless). The multichannel signal may be a stereo signal. The encoded mono signal may be an enhanced voice system (EVS) encoded mono signal. The encoded multichannel audio signal may be an EVS encoded multichannel signal, or an Immersive Voice and Audio Services (IVAS) multichannel signal 225. The input format may enable backwards-compatible voice conferencing between devices of varying capabilities (Figs. 6-9) by stripping spatial metadata from a bitstream prior to transmitting to a legacy device, thus reducing the number of re-encodings or differing bitstreams.
Description
AUDIO REPRESENTATION AND ASSOCIATED RENDERING
Field
The present application relates to apparatus and methods for sound-field related audio representation and associated rendering, but not exclusively for audio representation for an audio encoder and decoder.
Background
Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the immersive voice and audio services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network. Such immersive services include uses for example in immersive voice and audio for applications such as virtual reality (VR), augmented reality (AR) and mixed reality (MR). This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.
Furthermore parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound is described using a set of parameters. For example, in parametric spatial audio capture from microphone arrays, it is a typical and an effective choice to estimate from the microphone array signals a set of parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics.
Summary
There is provided according to a first aspect an apparatus comprising means for: obtaining an input format for generating an encoded mono audio signal and/or a rnultichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.
The input format may further comprise a definition configured to control an encoder.
The means for may be further for: encoding a mono audio signal based on the mono audio signal; encoding a multichannel audio signal based on the mono audio signal and the metadata signal associated with the mono signal.
The encoding a mono audio signal based on the mono audio signal may be based on the definition configured to control an encoder.
The input format may further comprise a multichannel audio signal, wherein the means for may be further for: encoding a mono audio signal based on the mono audio signal; and encoding a multichannel audio signal based on the multichannel audio signal.
The multichannel audio signal may be a stereo audio signal.
The encoded mono audio signal may be an enhanced voice system encoded mono audio signal.
The encoded multichannel audio signal may be one of: an enhanced voice system encoded multichannel audio signal; and an Immersive Voice and Audio Services multichannel audio signal.
The metadata signal may comprise: two directional parameters for each time-frequency tile, wherein the direction parameters are limited to a single plane of elevation; and direct-to-total energy ratios associated with the two directional parameters, wherein the sum of direct-to-energy ratios for the two directions is 1. According to a second aspect there is provided a method comprising obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.
The input format may further comprise a definition configured to control an encoding.
The method may further comprise: encoding a mono audio signal based on the mono audio signal; encoding a multichannel audio signal based on the mono audio signal and the metadata signal associated with the mono signal.
Encoding a mono audio signal based on the mono audio signal may further comprise encoding based on the definition.
The input format may further comprise a multichannel audio signal, wherein the method may further comprise: encoding a mono audio signal based on the mono audio signal: and encoding a multichannel audio signal based on the multichannel audio signal.
The multichannel audio signal may be a stereo audio signal.
The encoded mono audio signal may be an enhanced voice system encoded mono audio signal.
The encoded multichannel audio signal may be one of: an enhanced voice system encoded multichannel audio signal; and an Immersive Voice and Audio Services multichannel audio signal.
The metadata signal may comprise: two directional parameters for each time-frequency tile, wherein the direction parameters are limited to a single plane of elevation; and direct-to-total energy ratios associated with the two directional parameters, wherein the sum of direct-to-energy ratios for the two directions is 1. According to a third aspect there is provided an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.
The input format may further comprise a definition configured to control an encoding.
The apparatus may further be caused to: encode a mono audio signal based on the mono audio signal; encode a multichannel audio signal based on the mono audio signal and the metadata signal associated with the mono signal.
The apparatus caused to encode a mono audio signal based on the mono audio signal may further be caused to encode based on the definition.
The input format may further comprise a multichannel audio signal, wherein the apparatus may further be caused to: encode a mono audio signal based on the mono audio signal; and encode a multichannel audio signal based on the multichannel audio signal.
The multichannel audio signal may be a stereo audio signal.
The encoded mono audio signal may be an enhanced voice system encoded mono audio signal.
The encoded multichannel audio signal may be one of: an enhanced voice system encoded multichannel audio signal; and an Immersive Voice and Audio Services multichannel audio signal.
The metadata signal may comprise: two directional parameters for each time-frequency tile, wherein the direction parameters are limited to a single plane of elevation; and direct-to-total energy ratios associated with the two directional parameters, wherein the sum of direct-to-energy ratios for the two directions is 1.
According to a fourth aspect there is provided an apparatus comprising obtaining circuitry configured to obtain an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.
According to a fifth aspect there is provided a computer program comprising instructions lor a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.
According to a sixth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.
According to a seventh aspect there is provided an apparatus comprising: means for obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.
According to an eighth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.
Encoding a mono audio signal may be encoding a bit-exact mono audio signal.
The encoded mono audio signal may be an encoded bit-exact mono audio signal.
An apparatus comprising means for performing the actions of the method as described above.
An apparatus configured to perform the actions of the method as described above.
A computer program comprising program instructions for causing a 30 computer to perform the method as described above.
A computer program product stored on a medium may cause an apparatus to perform the method as described herein.
An electronic device may comprise apparatus as described herein.
A chipset may comprise apparatus as described herein.
Embodiments of the present application aim to address problems associated with the state of the art.
Summary of the Figures
For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which: Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments; Figure 2 shows schematically a system of apparatus for an IVAS encoder architecture of Figure 1 including a mono signal input; Figure 3 shows schematically a first example WAS encoder architecture of Figure 1 including a mono signal input according to some embodiments; Figure 4 shows schematically a second example IVAS encoder architecture 15 of Figure 1 including a mono signal input according to some embodiments; Figure 5 shows example bit distribution for bitstream examples based on the first example IVAS encoder architecture shown in Figure 3; Figures 6 to 9 show example voice conference systems employing some embodiments; Figure 10 shows a third example IVAS encoder architecture of Figure 1 including a mono signal input according to some embodiments; Figure 11 shows an example voice conference systems employing the third example IVAS encoder architecture as shown in Figure 10 according to some embodiments; and Figure 12 shows an example device suitable for implementing the apparatus shown.
Embodiments of the Application The following describes in further detail suitable apparatus and possible mechanisms for the provision of efficient representation of audio in immersive systems which implement an embedded stereo (or spatial) mode of operation and which further supports a bit-exact enhanced voice services (EVS) compatible mono downmix bitstream in an efficient way. These examples enable the immersive encoding system to produce an EVS mono downmix bitstream that remains bit-exact against a standalone EVS implementation despite possible differences in mono and stereo (or spatial) pre-processing within the immersive encoding system. For example, there may be different filtering prior to encoding (including the stereo-to-mono downmix) between the immersive system and the standalone EVS. Such pre-processing operation typically introduces a delay into the signal path, which may affect the outcome of the coding, e.g., due to different framing.
The above may assume a well-understood downmix from stereo to mono. For example, a 'Mono = 0.5 x L 0.5 x R' is a well-understood downmix. However, a practical stereo encoding may utilize an adaptive. smart downmix, for example to compensate inter-channel time/phase differences, which may maintain quality and produce as faithful a reproduction of the original stereo signal as a mono signal as possible.
Although the examples shown herein are described with respect to the IVAS codec any other codec or coding can implement some embodiments as described herein. For example an embedded bit-exact stereo or spatial extension can be of great interest for interoperability and conformance reasons (particularly in managed services with quality of service (QoS) requirements).
The concept as discussed in further detail hereafter is able in some embodiments to allow an embedded stereo (or spatial) extension to feature bit-exact legacy mono operation in an embedded encoding structure while providing freedom of high-quality stereo-to-mono (or spatial-to-mono) downmix. Additionally the embodiments can extend Metadata-assisted spatial audio (MASA), which may be understood (at least) as an ingest format intended for the 3GPP IVAS audio codec, for (embedded) stereo encoding in a "spatial" MASA compatible way.
The embodiments may thus define an EETU-MASA (Embedded EVS Stereo Using Metadata-Assisted Spatial Audio) method allowing embedded stereo (and by regular MASA extension, of course, immersive/spatial) operation on top of the legacy EVS codec in a way where the mono downmix can be guaranteed to be bit-exact with EVS operation as specified by the standards such as TS 26.445, TS 26.442, and TS 26.444. This allows for straightforward conformance in 3GPP services (e.g., MIS!).
in some embodiments EETU-MASA is implemented as stereo as 1-channel + metadata input: In such an example the MASA format is configured with a channel configuration option 'stereo using mono input + MASA metadata'. This configuration information can be provided for example via a specific metadata field (e.g., called 'Channel configuration' or 'Channel audio format'). An IVAS encoder on receiving this this input can be configured to select EVS as the core coding of the mono stream (treating the input as mono without metadata), while any MASA metadata is fed into MASA metadata encoder. The bitstream from the IVAS encoder can in some embodiments include a bit-exact EVS mono and additional MASA metadata providing a stereo extension. This can be transmitted to a suitable IVAS decoder as is. The WAS recipient receives the IVAS mono stream and decodes stereo (or spatial audio). Alternatively, in some embodiments a suitable network element (such as MCU) can, for a legacy EVS recipient, drop or strip the additional metadata. Thus, the network element performs a transcoding from IVAS to EVS which is lossless for the mono part. In such embodiments the EVS recipient receives the EVS mono stream and decodes mono.
The stereo mode of the MASA spatial metadata can be achieved using the definition below: Provide two direction parameters per time-frequency (TF) tile; Direction index is limited to Left and Right with no elevation; Sum of Direct-to-total energy ratio for the two Directions is always 1.0; All the other parameters can be omitted or set to zero/default.
The above definition may be particularly useful in situations with independent stereo channels. For example in a multipoint control unit (MCU) stereo 25 audio use case.
In such embodiments the capture and audio processing system before the (IVAS) encoder, in other words the part of the end-to-end system that creates the (MASA) input for the encoder, is free to apply any stereo-to-mono downmix that is suitable for best possible quality. The (EVS/IVAS) encoder in such embodiments is configured to see the single mono downmix, and bit-exactness of the mono signal is therefore maintained.
The methods as described in the embodiments hereafter can furthermore be used for core codecs other than EVS and for extensions/immersive codecs other than IVAS.
In some embodiments an additional advantage may be that a bit exact 5 embedded stereo is allowed also on top of EVS adaptive multi-rate wideband (AMR-WB) interoperable (10) modes (and therefore not only the EVS primary modes).
In some embodiments there may be defined an EETU-MASA with stereo input as 3-channel + metadata example. In such embodiments, the MASA format comprises a channel configuration option of 'stereo using combination of mono input and stereo input + MASA metadata'. This configuration information can be provided for example via a specific metadata field (e.g., called 'Channel configuration' or 'Channel audio format'). In an (IVAS) encoder, this input can configure the encoder to select EVS as the core coding of the mono stream (treating the input as mono without metadata), while at least the stereo stream with the MASA metadata is fed into the IVAS MASA encoder (including the metadata encoding). This mode of operation is thus a parallel stereo/spatial mono downmix encoding for bit-exact backwards interoperability.
The bitstream from the (IVAS) encoder will comprise a bit-exact (EVS) mono and additional (IVAS) bitstream with (MASA) metadata providing a stereo extension. This can be transmitted to an (IVAS) decoder as is (or with the EVS payload dropped). The (IVAS) recipient receives the (IVAS) stream and decodes stereo (or spatial audio). Alternatively in some embodiments a suitable network element (such as a MCU) can drop or strip, for a legacy EVS recipient, everything beyond the EVS bitstream. Thus, the network element can be configured to perform a transcoding from IVAS to EVS which is lossless for the mono part. The EVS recipient can be configured to receive the EVS mono stream and decodes the mono signals.
In some embodiments, the IVAS encoder (or any stereo/spatial encoder) 30 can provide the EVS bitstream (or any mono bitstream) and the IVAS bitstream (or any stereo/spatial bitstream) as separate packets.
Before discussing the embodiments further we initially discuss the systems for obtaining and rendering spatial audio signals which may be used in some embodiments.
With respect to Figure 1 is shown an example apparatus and system for implementing the obtaining and encoding an audio signal (in the form of audio capture in this example) and rendering (the encoded audio signals).
The system 100 is shown with an 'analysis' part 121 and a 'synthesis' part 131. The 'analysis' part 121 is the part from receiving the multi-channel signals up to an encoding of the metadata and transport signal and the 'synthesis' part 131 is the part from a decoding of the encoded metadata and transport signal to the presentation of the re-generated signal (for example in multi-channel loudspeaker form).
The input to the system 100 and the 'analysis' part 121 is the multi-channel signals 102. In the following examples a microphone channel signal input is described, however any suitable input (or synthetic multi-channel) format may be implemented in other embodiments. For example in some embodiments the spatial analyser and the spatial analysis may be implemented external to the encoder. For example in some embodiments the spatial metadata associated with the audio signals may be a provided to an encoder as a separate bit-stream. In some embodiments the spatial metadata may be provided as a set of spatial (direction) index values.
The multi-channel signals are passed to a transport signal generator 103 and to an analysis processor 105.
In some embodiments the transport signal generator 103 is configured to receive the multi-channel signals and generate a suitable transport signal comprising a determined number of channels and output the transport signals 104. For example the transport signal generator 103 may be configured to generate a 2 audio channel downmix of the multi-channel signals. The determined number of channels may be any suitable number of channels. The transport signal generator in some embodiments is configured to otherwise select or combine, for example, by beamforming techniques the input audio signals to the determined number of channels and output these as transport signals.
In some embodiments the transport signal generator 103 is optional and the multi-channel signals are passed unprocessed to an encoder 107 in the same manner as the transport sianal are in this example.
In some embodiments the analysis processor 105 is also configured to receive the multi-channel signals and analyse the signals to produce metadata 106 associated with the multi-channel signals and thus associated with the transport sianals 104. The analysis processor 105 may be configured to generate the metadata which may comprise, for each time-frequency analysis interval, a direction parameter 108 and an energy ratio parameter 110 and a coherence parameter 112 (and in some embodiments a diffuseness parameter). The direction, energy ratio and coherence parameters (and diffuseness parameter) may in some embodiments be considered to be spatial audio parameters. In other words the spatial audio parameters comprise parameters which aim to characterize the sound-field created by the multi-channel signals (or two or more playback audio signals in general).
In some embodiments the parameters generated may differ from frequency band to frequency band. Thus for example in band X all of the parameters are generated and transmitted, whereas in band Y only one of the parameters is generated and transmitted, and furthermore in band Z no parameters are generated or transmitted. A practical example of this may be that for some frequency bands such as the highest band some of the parameters are not required for perceptual reasons. The transport signals 104 and the metadata 106 may be passed to an encoder 107.
In some embodiments, the spatial audio parameters may be grouped or separated into directional and non-directional (such as, e.g., diffuse) parameters.
The encoder 107 may comprise an audio encoder core 109 which is configured to receive the transport (for example downmix) signals 104 and generate a suitable encoding of these audio signals. The encoder 107 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs. The encoding may be implemented using any suitable scheme. The encoder 107 may furthermore comprise a metadata encoder/quantizer 111 which is configured to receive the metadata and output an encoded or compressed form of the information. In some embodiments the encoder 107 may further interleave, multiplex to a single data stream or embed the metadata within encoded downmix signals before transmission or storage shown in Figure 1 by the dashed line. The multiplexing may be implemented using any suitable scheme.
In the decoder side, the received or retrieved data (stream) may be received by a decoderidemultiplexer 133. The decoder/demultiplexer 133 may demultiplex the encoded streams and pass the audio encoded stream to a transport extractor 135 which is configured to decode the audio signals to obtain the transport signals. Similarly the decoder/demultiplexer 133 may comprise a metadata extractor 137 which is configured to receive the encoded metadata and generate metadata. The decoder/demultiplexer 133 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor). or alternatively a specific device utilizing, for example, FPGAs or ASICs.
The decoded metadata and transport audio signals may be passed to a synthesis processor 139.
The system 100 'synthesis' part 131 further shows a synthesis processor 139 configured to receive the transport and the metadata and re-creates in any suitable format a synthesized spatial audio in the form of multi-channel signals 110 (these may be rnultichannel loudspeaker format or in some embodiments any suitable output format such as binaural signals for headphone listening or Ambisonics signals, depending on the use case) based on the transport signals and the metadata.
Therefore in summary first the system (analysis part) is configured to receive multi-channel audio signals.
Then the system (analysis part) is configured to generate a suitable transport audio signal (for example by selecting or downmixing some of the audio signal channels).
The system is then configured to encode for storage/transmission the transport signal and the metadata.
After this the system may store/transmit the encoded transport and metadata.
The system may retrieve/receive the encoded transport and metadata.
Then the system is configured to extract the transport and metadata from encoded transport and metadata parameters, for example demultiplex and decode the encoded transport and metadata parameters.
The system (synthesis part) is configured to synthesize an output multi-5 channel audio signal based on extracted transport audio signals and metadata. In some embodiments the apparatus and methods can be implemented as part of a MASA format definition, encoder functionality, and bitstream format (including, e.g., RTP header). These embodiments are relevant for the audio codec standard as well as various network functionalities (e.g., MCU operation).
With respect to Figure 2 is shown a high-level view of an example IVAS encoder including the various inputs which may, as non-exclusive examples, be expected for the codec. The underlying idea is that mono signals are handled by a bit-exact implementation of the EVS codec, while any stereo, spatial or immersive input is handled by the IVAS core tools complemented in some cases by a metadata encoder.
The system as shown in Figure 2 can comprise input format generators 203. The input format generators 203 may be considered in some examples to be the same as the transport signal generator 103 and the analysis processor 105 from Figure 1. The input format generators 203 may be configured to generate suitable audio signals and metadata for capturing the audio and spatial audio qualities of the input signals, which may originate from microphone capture, some other source (such as a file) or a combination thereof. For example, a relevant microphone capture may be a multi-microphone audio capture on a mobile device (such as a smartphone), while a relevant other source may be a channel-based music file (such as a 5.1 music mix file). Any other suitable microphone array capture or source can also be used.
The input format generators 203 can comprise a mono audio signal generator 205 configured to generate a suitable mono audio signal.
The input format generators 203 can also comprise a multichannel or spatial 30 format generator 207.
The multichannel or spatial format generator 207 in some embodiments comprises a metadata-assisted spatial audio generator 209. The metadataassisted spatial audio generator 209 is configured to generate audio signals (such as the transport audio signals in the form as a stereo-channel audio signal) and metadata associated with the audio signals.
The multichannel or spatial format generator 207 in some embodiments comprises a multichannel format generator 211 configured to generate suitable multichannel audio signals (for example stereo channel format audio signals and/or 5.1 channel format audio signals).
The multichannel or spatial format generator 207 in some embodiments comprises an ambisonics generator 213 configured to generate a suitable ambisonics format audio signal (which may comprise first order ambisonics and/or higher order ambisonics).
The multichannel or spatial format generator 207 in some embodiments can comprise an independent mono streams with metadata generator 215 configured to generate mono audio signals and metadata.
In some embodiments the apparatus comprises encoders 221. The 15 encoder(s) is configured to receive the output of the input format generators 203 and encode these into a suitable format for storage and/or transmission. The encoders may be considered to be the same as the encoder 107.
The encoders 221 may comprise a bit exact EVS encoder 223. The bit exact EVS encoder 223 may be configured to receive the mono audio signal from the 20 input format generators 203 and generate a bit exact EVS mono audio signal.
In some embodiments the encoders 221 may comprise IVAS core encoder 225. The IVAS core encoder 225 may be configured to receive the audio signals generated by the input format generators 203 and encode these according to the IVAS standard.
In some embodiments the encoders comprises a metadata encoder 227.
The metadata encoder is configured to receive the spatial metadata and encode it or compress it in any suitable manner.
The encoders 221 in some embodiments can be configured to combine or multiplex the datastreams generated by the encoders prior to being transmitted 30 and/or stored.
The system furthermore comprises a transmitter configured to transmit or store the bitstream 231.
With respect to Figure 3 furthermore it is shown how an embedded EVS stereo generated signal can be implemented within the system shown in Figure 2.
Thus in this example there comprises a mono input 301 and a stereo (and immersive audio) input 303. The mono input 301 is passed to the encoder 311 and the bit exact EVS encoder 317 in the same manner as shown in Figure 2.
The stereo and immersive audio input 303 is passed to the encoder 311 and a pre-processor 315, The encoder 311 in some embodiments comprises a preprocessor 315. The pre-processor 315 may be configured to receive the stereo and immersive inputs and pre-process the signal before being passed to the downmixer 313 and to the IVAS core encoder 319. The metadata output of the pre-processor 315 can be passed to the metadata encoder 321.
The encoder 311 furthermore comprises a downmixer 313. The downmixer 313 is configured to process the pre-processed audio signal and output a downmixed or mono channel audio signal to the bit-exact EVS encoder 317. The downmixer 313 in some embodiments is further configured to output metadata associated with the downmixed audio signal to the metadata encoder 321.
The encoder 311 may comprise a bit-exact EVS encoder 317. The bit-exact EVS encoder 317 may be configured to receive the mono audio signal from the 20 mono input 301 and the downmixer 313 and generate an EVS mono audio signal. In some embodiments the encoder 311 may comprise the WAS core encoder 319. The IVAS core encoder 319 may be contigured to receive the audio signals generated by the pre-processor 315 and encode these according to the IVAS standard.
In some embodiments the encoders comprises the metadata encoder 321.
The metadata encoder 321 is configured to receive the spatial metadata from the downmixer 313 and pre-processor 315 and encode it or compress it in any suitable manner.
With respect to Figure 4 is shown how an embedded EVS stereo generated 30 signal can be implemented within the system shown in Figure 2 according to a first example embodiment. This example improves over the example shown in Figure 3 in that although the apparatus in Figure 3 implements an embedded EVS stereo it is not a bit exact output when compared to a mono downmix of the same stereo signal into a legacy EVS mono encoder. This is because there is a signal delay due to pre-processing (such as any highpass or lowpass filtering) affecting among other things the exact framing of the signal encoding. For example, if the input framing in the encoder is changed even by introducing a one-sample delay, the resulting bitstream will be different. In addition, the pre-processing itself can change the signal characteristics (such as removal of low-frequency or high-frequency components). Another example is if an active downmix is performed to deal with certain time/phase alignment effect, and this downmix processing differs from the downmix performed outside the codec. Although the apparatus in Figure 3 may be modified such that the pre-processing is skipped when the embedded stereo mode is used this complicates the apparatus and introduces mode switching issues.
The embodiments further improve over the apparatus as shown in Figure 3 in that the downmix inside the codec is not limited to a simple downmix to be able to produce the same downmix outside the codec and inside the codec (as could be required for any managed system conformance test, where the requirement to be tested is providing "an embedded bit-exact EVS mono downmix bitstream").
The example shown in Figure 4 features the same mono input 301 and a stereo (and immersive audio) input 303. The mono input 301 is passed to the encoder 311 and the bit exact EVS encoder 317 in the same manner as shown in Figure 3.
The stereo and immersive audio input 303 is passed to the encoder 311 and a pre-processor 315.
The encoder 311 in some embodiments comprises the pre-processor 315. The pre-processor 315 may be configured to receive the stereo and immersive inputs and pre-process the signal before being passed to the IVAS core encoder 319. The metadata output of the pre-processor 315 can be passed to the metadata encoder 321.
As shown in Figure 4 the apparatus differs from the example shown in Figure 3 in that the format generator/inputs include a further input. In this example the further input is designated the Embedded EVS stereo using MASA (EETU-MASA) input 401. In this example a mono-downmixed parametric stereo representation of the stereo input is thus used which removes the need for passing the stereo or other multichannel audio signals through the pre-processor, for the inclusion of the downmixer prior to the EVS encoder, and allows the use of the metadata encoding as is.
The mono-downmixed parametric stereo representation in some embodiments is an extension of the MASA format. The extension is compatible with the MASA format parameter set. In principle, it is straightforward to allow encoding mode switching with this input, however, in some embodiments the mode is primarily used for the embedded bit-exact EVS stereo operation.
In some embodiments, the EETU-MASA input can be defined as (or additionally support) the following: one or two Direction parameters per time-frequency (TF) tile; a direction index limited to planar front sector (left-front-right) or any equivalent sector; a sum of direct-to-total energy ratio for the two Directions 5 1.0; and other parameters may also have non-zero values The stereo-to-mono downmix may be determined based on a capture or device implementation preference.
The EETU-MASA input is configured to pass the audio signal 441 to the bitexact-EVS encoder 317 and to pass the rnetadata 443 to the metadata encoder 321.
The encoder 311 may comprise a bit-exact EVS encoder 317. The bit-exact EVS encoder 317 may be configured to receive the mono audio signal from the mono input 301 and the EETU-MASA input 401 and attempt to generate a bit exact EVS mono audio signal.
In some embodiments the encoder 311 may comprise the WAS core 25 encoder 319. The WAS core encoder 319 may be configured to receive the audio signals generated by the pre-processor 315 and encode these according to the IVAS standard.
In some embodiments the encoders comprises the metadata encoder 321. The rnetadata encoder 321 is configured to receive the spatial metadata from the 30 EETU-MASA input 401 and pre-processor 315 and encode it or compress it in any suitable manner.
In some embodiments the rendering at the decoder is configured to provide a stereo signal. It is understood this stereo is preferably a head-locked stereo (in other words no head-tracking is needed and should not affect the rendering).
It is possible to implement the two above modes in a switching system, 5 where a mode selection selects based on a relevant criteria one of the two modes for each frame of audio. Typically, fluctuation from one mode to another and back on a frame-to-frame basis would be avoided. The mode selection in this case is part of the front-end processing and seen on the format level by the audio encoder. In some embodiments the EETU-MASA format comprises a channel 10 configuration parameter which may be defined as a channel configuration specifying 'stereo input as mono + restricted MASA metadata'. In some embodiments this configuration information when detected by the encoder 411 configures the EVS encoder 317 to automatically trigger EVS mono encoding and configures the metadata encoder 321 to generate a separate metadata (stream) encoding for the stereo extension.
Example outputs from the encoder are show in Figure 5. Figure 5a (the upper block) shows an example where the full WAS payload is allocated between EVS BE bitstream and stereo (spatial) extension metadata. Thus for example where the available bitrate is 13.2 kbps the EVS BE allowance may be 9.6 kbps and the metadata 3.6 kbps, where the available bitrate is 16.4 kbps the EVS BE allowance may be 13.2 kbps and the metadata 3.2, kbps where the available bitrate is 24.4 kbps the EVS BE allowance may be 16.4 kbps and the metadata 8.0 kbps and where the available bitrate is 32.0 kbps the EVS BE allowance may be 24.4 kbps and the metadata 7.6 kbps.
Figure 5b (the middle block) illustrates an option where the extension bit rate is reduced to allow the first bit in the IVAS payload to indicate the extension usage as shown by the small 0.05 kbps block proceeding the EVS BE blocks. Thus for example where the available bitrate is 13.2 the extension usage is 0.05 kbps, the EVS BE allowance may be 9.6 kbps and the metadata 3.55 kbps, where the available bitrate is 16.4 kbps the extension usage is 0.05 kbps, the EVS BE allowance may be 13.2 kbps and the metadata 3.15 kbps where the available bitrate is 24.4 kbps the extension usage is 0.05 kbps, the EVS BE allowance may be 16.4 kbps and the metadata 7.95 kbps and where the available bitrate is 32.0 kbps the extension usage is 0.05 kbps, the EVS BE allowance may be 24.4 kbps and the rnetadata 7.55 kbps.
Figure 5c (the lower block) shows a further illustration for a 32-kbps packet, which is similar to the middle block but utilizing the first bit of each embedded 5 stream for increased packet flexibility. In this example the 32 kbps packet can be divided into extension usage of 4x 0.05 kbps, 9.6 kbps EVS BE, 3.55 kbps metadata. 3.15 kbps metadata, 7.95 kbps metadata and 7.55 kbps metadata. The 32 kbps packet can also be divided into extension usage of 3x 0.05 kbps, 13.2 kbps EVS BE, 3.15 kbps metadata, 7.95 kbps metadata and 7.55 kbps metadata. The 10 32 kbps packet also can be divided into extension usage of 2x 0.05 kbps, 16.4 kbps EVS BE, 7.95 kbps metadata and 7.55 kbps metadata. Additionally is shown the 32 kbps packet can be divided into extension usage of 1x 0.05 kbps, 24.4 kbps EVS BE and 7.55 kbps metadata. This illustrates the flexibility of the embedded packetization.
In the examples shown in Figure 5, part of the bits used for the metadata can in some embodiments be used for residual coding, a differential extra layer on top of the core EVS coded downmix. The difference can be applied on top of sub blocks of the core codec, for example as, Algebraic code-excited linear prediction (ACELP) sub blocks, TCX sub blocks, etc. These methods can in some embodiments extend the usage of the methods to non-bit-exact embedded mono encoding systems.
In these embodiments the EETU-MASA input is a straightforward new extension of the MASA metadata definition providing an additional audio representation/mode based on a limitation applied on parameter usage (parameters used and allowed values). It is designed to be fully compatible with MASA format.
In some embodiments the EETU-MASA enables IVAS stereo operation with an embedded bit-exact EVS mono downmix bitstream. According to some embodiments the IVAS operation can also be a spatial operation with an embedded bit-exact EVS mono downmix bitstream. The embodiments furthermore allow a switching between stereo and spatial IVAS operation based on the input metadata while providing an embedded bit-exact EVS mono downmix bitstream.
With respect to Figures 6 to 9 are shown a series of example use cases implementing embodiments. Figure 6 presents a first voice conferencing scenario between three participants with a wide range of device capabilities. The system shows a legacy EVS upstream 602 implementation on user equipment 601 with mono capture and playback via earpiece (user A), an IVAS upstream 604 implementation on user equipment 603 with spatial audio capture and playback via headphones (user B), and an IVAS upstream 606 implementation on a conference room setup 605 using stereo audio capture and multi-channel loudspeaker presentation (user C).
In this example the common codec that can be negotiated between these users is the EVS codec (either legacy EVS for all or with two users using EVS in IVAS). However, two users would have full IVAS capability with the first of them being able to provide spatial audio upstream (IVAS MASA) with preference for stereo/binaural downstream/presentation and with the second of the two users being able to provide stereo audio upstream (IVAS stereo) with preference for multichannel spatial/immersive audio playback.
As the legacy EVS user 601 requires an EVS mono downstream; it seems there are two ways to handle the downstream audio (when the legacy EVS user 601 is silent). Where there is no mixing, the two ways are: to produce a single EVS mono downstream for all participants; or to produce an EVS mono downstream for the legacy EVS user and a suitable IVAS downstream for the other user.
For this use case, it is understood that an embedded mode can be very desirable.
This can be further shown via Figure 7, which presents the same scenario as Figure 6 for the downstream, and Figure 8, which adds an additional fourth user D using user equipment 807 who is always in a listening-only mode. For example, fourth user D has joined the audio conference through a separate number or link allowing user D only to listen in.
As shown in Figure 7 each user is delivered an audio representation that is relevant to the user equipment with a reduced number of re-encodings and different bitstreams. Thus for example where a transmitting user (for example user equipment 603 or 605 sends an IVAS payload consisting of 'EVS-i-*stereo metadata' to the network. For receiving user equipment associated with user A (user equipment 601) and user B (user equipment 603), the spatial metadata is stripped and legacy EVS is delivered. Thus for example the MCU may be configured to transmit EVS mono with stereo/spatial metadata stripped out 702 to the user 5 equipment 601 and may be further configured to transmit EVS mono with stereo metadata (with any spatial metadata stripped out) 704 to the user equipment 603. Immersive participants for example user C operating user equipment 605 may be configured to receive from the MCU 607 a EVS mono and spatial metadata downlink 706. Furthermore as shown in Figure 8 user D operating user equipment 10 807 may be configured to receive from the MCU 607 an EVS mono and spatial metadata downlink 808. ;In such an example it is possible to see that overall delivery load in the network is reduced, where a single bitstream is suitable for all receiving users (or for as many users as possible). ;It is understood that in the examples of Figures 6 to 8 user B 603 could at least in some embodiments also receive bitstream describing EVS mono and spatial metadata instead of EVS mono and stereo metadata. This is because a spatial signal can be presented over headphones, e.g., via means of b in aural' zat ion. ;With respect to Figure 9 is shown a further example wherein user B (user equipment 903) uploads or transmits to a MCU node 915 of a network an IVAS payload consisting of EVS mono 906 and stereo metadata 904. The MCU nodes 915, 917 are shown passing the IVAS payload (EVS mono 906 and stereo metadata 904). The receiving user equipment associated with receiver 1 (user equipment 901) is configured to receive from the MCU node 915 signals where the metadata is stripped and legacy EVS 906 is delivered. Similarly the MCU node 917 may be configured to transmit EVS mono 906 with stereo/spatial metadata stripped out to the user equipment 905. ;Immersive participants, for example receiver 3 (user equipment 907) may 30 be configured to receive from the MCU 917 an IVAS payload (in the form of an EVS mono 906 and spatial metadata 904) downlink. ;With respect to Figure 10 is shown a further example of some embodiments. In this example it differs from the example shown in Figure 4 in that the EETU-MASA input 401 (and a stereo or multichannel signal) is furthermore passed 1045 to the pre-processor 315. In such embodiments backward compatible embedded EVS encoding in the WAS codec is achieved by representing the stereo input as a combination of a mono downmix and a stereo (or more generally multichannel) representation. In such an example the input is thus a 3-channel input. The input can furthermore include full MASA metadata. In preferred embodiments, this can be considered to be a special case of MASA input format. ;The mono downmix can be generated using any suitable means. The resulting mono signal is utilized as one component of the 3-channel input. The original stereo, or stereo with MASA metadata, is similarly used as one component of the 3-channel input. In some embodiments the 3-channel input for the IVAS encoder can be created for example based on a mixing of at least two audio streams (e.g., an operation on an MCU). At least in some embodiments, any delay incurred by the mono downmix can be taken into account for the stereo signal of the 3-channel input. The mono and stereo audio signals can thus be fully aligned in time. ;In such an example as shown in Figure 10, the mono channel of the 3-channel input is fed (without metadata) into a bit-exact EVS encoder. In some embodiments the EVS codec may be instructed to encode the signal at a fixed bit rate. In other words the EVS encoding can be without bit rate switching. The result of which produces a fixed-bit rate EVS bitstream. The stereo (+ metadata) is encoded using the IVAS core encoding (and metadata encoding). In some embodiments the mono stream may be fed also to the IVAS core encoder (and may in some embodiments be always provided to the EVS encoder). ;The substantially simultaneously encoded EVS and IVAS frames (both of length 20 ms, although they may have a slight relative offset due to potential mismatch in core encoder lookahead) are packed together in some embodiments into a common package for transmission. ;For example, there may he various coding and transmission modes with 30 package-embedded EVS or embedded scalable EVS enabled, such as for example: bit rate EVS IVAS 64 kbps 13.2 kbps 50.8 kbps stereo (e.g., 2 x 25.4 kbps) 64 kbps 13.2 kbps 50.8 kbps spatial (e.g., mono or stereo MASA) 64 kbps 13.2 kbps 3.2 kbps metadata layer for EVS-based stereo and 47.6 kbps spatial independent of EVS 50.8 kbps or 39.6 kbps; Fixed 39.6 kbps (N/ zero-padding if needed) 64 Mops 13.2 or 24.4 kbps 96 kbps 13.2 kbps 64 kbps and 18.8 kbps of non-audio data IVAS package is here = 13.2+64 = 77.2 kbps Above, by "package-embedded" it is understood that the EVS bitstream is part of the WAS package in a special operation mode. The EVS bitstream can be provided to a legacy EVS user. However, when IVAS is being decoded, the EVS bitstrearn may be simply discarded. The first two examples may be implemented in this way. By "embedded scalable" it is understood "regular" embedded operation, for example resembling Figure 5. The third example may be implemented in such a manner. This package in some embodiments includes three separate encodings: EVS at 13.2 kbps, EVS-based WAS stereo at 16.4 kbps, and a 47.6 kbps IVAS encoding (that may be, for example, a high-quality stereo or a spatial signal). ;Figure 11 presents a further use example associated with the further example shown in Figure 10. In this example a fixed packed size 1115 may be used to communicate between MCUs such as MCU 1121 and MCU 1111. While it may seem wasteful to deliver several encodings in a single package, there may be systems where a fixed prioritized packet size (e.g., a 64-kbps channel or some other size channel) for voice communication is implemented. The "package-embedded" delivery can in this case be used to provide various levels of service, e.g., to conference call participants with different capabilities. ;Thus for example an IVAS mobile device 1101 and user may establish an WAS connection 1102 (for example with a MASA input) with the MCU 1121. A legacy EVS mobile device 1105 and user may establish an EVS only connection 1106 with the MCU 1121. A further legacy EVS mobile device 1103 and user may establish an EVS only connection 1104 with the MCU 1111. Also as shown in Figure 11 a fixed line device 1107 and user may additionally establish a fixed package size 1108 (for example 64kbps) connection with the MCU 1111. ;With respect to Figure 12 an example electronic device which may be used as the analysis or synthesis device is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein. ;In some embodiments the device 1400 comprises a memory 1411. In some embodiments the at least one processor 1407 is coupled to the memory 1411. The memory* 1411 can be any suitable storage means. In some embodiments the memory 1411 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore in some embodiments the memory 1411 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating with the position determiner as described herein.
In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.
In some embodiments the device 1400 may be employed as at least part of the synthesis device. As such the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1407 executing suitable code. The input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones (which may be a headtracked or a non-tracked headphones) or similar.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits. blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GOSH, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.
Claims (15)
- CLAIMS: 1. An apparatus comprising means for: obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.
- 2. The apparatus as claimed in claim 1, wherein the input format further comprises a definition configured to control an encoder.
- 3. The apparatus as claimed in any of claims 1 to 2, wherein the means for is further for: encoding a bit-exact mono audio signal based on the mono audio signal; encoding a multichannel audio signal based on the mono audio signal and the metadata signal associated with the mono signal.
- 4. The apparatus as claimed in claim 3 when dependent on claim 2, wherein 20 the encoding a bit-exact mono audio signal based on the mono audio signal is based on the definition configured to control an encoder.
- 5. The apparatus as claimed in any of claims 1 to 2, the input format further comprising a multichannel audio signal, wherein the means for is further for: encoding a mono audio signal based on the mono audio signal; and encoding a multichannel audio signal based on the multichannel audio signal.
- 6. The apparatus as claimed in any of claims 1 to 5, wherein the multichannel audio signal is a stereo audio signal.
- 7. The apparatus as claimed in any of claims 1 to 6, wherein the encoded mono audio signal is an enhanced voice system encoded mono audio signal.
- 8. The apparatus as claimed in any of claims 1 to 7, wherein the encoded multichannel audio signal is one of: an enhanced voice system encoded multichannel audio signal; and an lmmersive Voice and Audio Services multichannel audio signal.
- 9. The apparatus as claimed in any of claims 1 to 8, wherein the encoded mono audio signal is an encoded mono audio signal.
- 10. The apparatus as claimed in any of claims 1 to 9, wherein the metadata signal comprises: two directional parameters for each time-frequency tile, wherein the direction parameters are limited to a single plane of elevation; and direct-to-total energy ratios associated with the two directional parameters, wherein the sum of direct-to-energy ratios for the two directions is 1.
- 11. A method comprising obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.
- 12. The method as claimed in claim 11, wherein the input format further comprises a definition configured to control an encoding.
- 13. The method as claimed in any of claims 11 to 12, further comprising: encoding a mono audio signal based on the mono audio signal; encoding a multichannel audio signal based on the mono audio signal and the metadata signal associated with the mono signal.
- 14. The method as claimed in claim 13 when dependent on claim 12, wherein encoding a mono audio signal based on the mono audio signal further comprising encoding based on the definition.
- 15. The method as claimed in any of claims 11 to 12, wherein the input format further comprises a multichannel audio signal, wherein the method further comprises: encoding a mono audio signal based on the mono audio signal; and encoding a multichannel audio signal based on the multichannel audio signal.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1900871.3A GB2580899A (en) | 2019-01-22 | 2019-01-22 | Audio representation and associated rendering |
PCT/FI2020/050014 WO2020152394A1 (en) | 2019-01-22 | 2020-01-09 | Audio representation and associated rendering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1900871.3A GB2580899A (en) | 2019-01-22 | 2019-01-22 | Audio representation and associated rendering |
Publications (2)
Publication Number | Publication Date |
---|---|
GB201900871D0 GB201900871D0 (en) | 2019-03-13 |
GB2580899A true GB2580899A (en) | 2020-08-05 |
Family
ID=65656028
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB1900871.3A Withdrawn GB2580899A (en) | 2019-01-22 | 2019-01-22 | Audio representation and associated rendering |
Country Status (2)
Country | Link |
---|---|
GB (1) | GB2580899A (en) |
WO (1) | WO2020152394A1 (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB202002900D0 (en) * | 2020-02-28 | 2020-04-15 | Nokia Technologies Oy | Audio repersentation and associated rendering |
JP7491395B2 (en) | 2020-11-05 | 2024-05-28 | 日本電信電話株式会社 | Sound signal refining method, sound signal decoding method, their devices, programs and recording media |
JP7491393B2 (en) | 2020-11-05 | 2024-05-28 | 日本電信電話株式会社 | Sound signal refining method, sound signal decoding method, their devices, programs and recording media |
WO2022097238A1 (en) | 2020-11-05 | 2022-05-12 | 日本電信電話株式会社 | Sound signal refining method, sound signal decoding method, and device, program, and recording medium therefor |
CN116830193A (en) * | 2023-04-11 | 2023-09-29 | 北京小米移动软件有限公司 | Audio code stream signal processing method, device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030191635A1 (en) * | 2000-09-15 | 2003-10-09 | Minde Tor Bjorn | Multi-channel signal encoding and decoding |
US20120035918A1 (en) * | 2009-04-07 | 2012-02-09 | Telefonaktiebolaget Lm Ericsson (Publ) | Method and arrangement for providing a backwards compatible payload format |
US20150248889A1 (en) * | 2012-09-21 | 2015-09-03 | Dolby International Ab | Layered approach to spatial audio coding |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190096410A1 (en) * | 2016-03-03 | 2019-03-28 | Nokia Technologies Oy | Audio Signal Encoder, Audio Signal Decoder, Method for Encoding and Method for Decoding |
GB2559199A (en) * | 2017-01-31 | 2018-08-01 | Nokia Technologies Oy | Stereo audio signal encoder |
GB2559200A (en) * | 2017-01-31 | 2018-08-01 | Nokia Technologies Oy | Stereo audio signal encoder |
GB2565747A (en) * | 2017-04-20 | 2019-02-27 | Nokia Technologies Oy | Enhancing loudspeaker playback using a spatial extent processed audio signal |
-
2019
- 2019-01-22 GB GB1900871.3A patent/GB2580899A/en not_active Withdrawn
-
2020
- 2020-01-09 WO PCT/FI2020/050014 patent/WO2020152394A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030191635A1 (en) * | 2000-09-15 | 2003-10-09 | Minde Tor Bjorn | Multi-channel signal encoding and decoding |
US20120035918A1 (en) * | 2009-04-07 | 2012-02-09 | Telefonaktiebolaget Lm Ericsson (Publ) | Method and arrangement for providing a backwards compatible payload format |
US20150248889A1 (en) * | 2012-09-21 | 2015-09-03 | Dolby International Ab | Layered approach to spatial audio coding |
Also Published As
Publication number | Publication date |
---|---|
WO2020152394A1 (en) | 2020-07-30 |
GB201900871D0 (en) | 2019-03-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
RU2759160C2 (en) | Apparatus, method, and computer program for encoding, decoding, processing a scene, and other procedures related to dirac-based spatial audio encoding | |
JP6045696B2 (en) | Audio signal processing method and apparatus | |
RU2460155C2 (en) | Encoding and decoding of audio objects | |
WO2020152394A1 (en) | Audio representation and associated rendering | |
GB2574238A (en) | Spatial audio parameter merging | |
CN111819863A (en) | Representing spatial audio with an audio signal and associated metadata | |
CN112567765B (en) | Spatial audio capture, transmission and reproduction | |
JP2013137563A (en) | Stream synthesizing device, decoding device, stream synthesizing method, decoding method, and computer program | |
KR20220084113A (en) | Apparatus and method for audio encoding | |
JP2024512953A (en) | Combining spatial audio streams | |
Multrus et al. | Immersive Voice and Audio Services (IVAS) codec-The new 3GPP standard for immersive communication | |
AU2022233430A1 (en) | Audio codec with adaptive gain control of downmixed signals | |
EP4196980A1 (en) | Discontinuous transmission operation for spatial audio parameters | |
WO2022223133A1 (en) | Spatial audio parameter encoding and associated decoding | |
US20230188924A1 (en) | Spatial Audio Object Positional Distribution within Spatial Audio Communication Systems | |
US20230197087A1 (en) | Spatial audio parameter encoding and associated decoding | |
WO2020201619A1 (en) | Spatial audio representation and associated rendering | |
GB2608406A (en) | Creating spatial audio stream from audio objects with spatial extent | |
WO2023179846A1 (en) | Parametric spatial audio encoding | |
WO2023088560A1 (en) | Metadata processing for first order ambisonics | |
Fug et al. | An Introduction to MPEG-H 3D Audio | |
KR20240152893A (en) | Parametric spatial audio rendering | |
WO2023066456A1 (en) | Metadata generation within spatial audio | |
CA3208666A1 (en) | Transforming spatial audio parameters | |
CN116982109A (en) | Audio codec with adaptive gain control of downmix signals |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |