GB2577045A

GB2577045A - Determination of spatial audio parameter encoding

Info

Publication number: GB2577045A
Application number: GB1814705.8A
Authority: GB
Inventors: Tapani Vilermo Miikka; Petteri Ojanperä Juha; Juhani Laaksonen Lasse; Sakari Rämö Anssi
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2018-09-11
Filing date: 2018-09-11
Publication date: 2020-03-18
Also published as: GB201814705D0

Abstract

An immersive voice and audio services (IVAS) codec which codes spatial audio parameters for eg. Virtual Reality (VR) applications retains static metadata by determining key frame information 704 for video frames 702 at determined times (eg. once per second) and inserting static audio metadata 707 into encoded audio frames 706 based on the key frame information (eg. into frames overlapping with, preceding or following a key video frame). This allows multimedia files to be cut at the container level without quality loss due to transcoding.

Description

(71) Applicant(s):

Nokia Technologies Oy

IPR Department, Karaportti 3 02610, Espoo, Finland (72) Inventor(s):

Miikka Tapani Vilermo

Juha Petteri Ojanpera

Lasse Juhani Laaksonen

Anssi Sakari Ramo (74) Agent and/or Address for Service:

Page White & Farrer

Bedford House, John Street, London, WC1N 2BF, United Kingdom (51) INT CL:

G10L 79/008(2013.01) H04N 21/233 (2011.01)

H04S7/00 (2006.01) (56) Documents Cited:

EP 3171605 A1 US 20020103553 A1 https://medium.com/@LSVR_Matt/how-to-create-highquality-ambisonic-deliverables-for-youtube-vrb5e5cb213afa (58) Field of Search:

INT CLG10L, H04N, H04S

Other: EPODOC, WPI, INSPEC, Internet (54) Title of the Invention: Determination of spatial audio parameter encoding Abstract Title: Spatial audio parameter coding using video key frames (57) An immersive voice and audio services (IVAS) codec which codes spatial audio parameters for eg. Virtual Reality (VR) applications retains static metadata by determining key frame information 704 for video frames 702 at determined times (eg. once per second) and inserting static audio metadata 707 into encoded audio frames 706 based on the key frame information (eg. into frames overlapping with, preceding or following a key video frame). This allows multimedia files to be cut at the container level without quality loss due to transcoding.

/9 • 106

205 207 211 2.15

2/9

3/9

4/9

with a video key frame + once frame before and after 538

5/9 ftp

Φ k_

I I

O •g >

ro d>

E

Qj

E ro (Λ oo ΓΜ LD

6/9

7/9

ο

8/9

9/9

Σ3 .SP

DETERMINATION OF SPATIAL AUDIO PARAMETER ENCODING

Field

The present application relates to apparatus and methods for sound-field related parameter encoding, but not exclusively for time-frequency domain direction related parameter encoding for an audio encoder and decoder.

Background

Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the immersive voice and audio services (IVAS) codec which is being designed to be suitable tor use over a communications network such as a 3GPP 4G/5G network. Such immersive services include uses for example in immersive voice and audio for applications such as virtual reality (VR), augmented reality (AR) and mixed reality (MR). This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio and scene-based audio inputs including spatial Information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.

Furthermore parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound is described using a set of parameters. For example, in parametric spatial audio capture from microphone arrays, it is a typical and an effective choice to estimate from the microphone array signals a set of parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics.

The current approaches typically generate and transmit/store video frames, audio frames and spatial audio (and other) metadata as a multimedia bitstream or file. Typically these frame are classified as static (which can be used by themselves to generate the image or audio) or instantaneous (which contain information which indicates a difference between the current frame and any previous or future static frames - also known as difference or differential frames) and which require information from at least one static frame to decode to generate the image or audio data.

When a multimedia file with static audio metadata only at defined frames of the file is edited (cut), the static metadata may be lost and the file cannot be decoded properly. The same effect occurs when a multimedia stream is viewed from middle and not from the beginning.

Summary

There is provided according to a first aspect an apparatus comprising means for: obtaining a video signal: encoding a video signal as video frames, the video frames comprising key frames at determined times; determining key frame information associated with the encoded video signal; obtaining an audio signal; encoding the audio signal as frames; and inserting static audio metadata information into the frames based on the key frame information.

The means for determining key frame information associated with the encoded video signal may be further for determining from the encoded video frames at least one time associated with at least one key frame of the encoded video frames.

The means for determining key frame information associated with the encoded video signal may be further for obtaining from a user input at least one time associated with at least one key frame of the encoded video frames.

The means for inserting static audio metadata information into frames based on the key frame information may be further for: determining at least one frame overlapping a video frame identified as a key frame based on the key frame information; and inserting static audio metadata information into the at least one frame overlapping a video frame identified as a key frame based on the key frame information.

The means for inserting static audio metadata information into frames based on the key frame information may be further for inserting static audio metadata information into at least one frame preceding and at least one frame succeeding the at least one frame overlapping a video frame identified as a key frame based on the key frame Information.

The at least one frame preceding the at least one frame overlapping a video frame identified as a key frame based on the key frame information may be a first determined number of frames.

The at least one frame succeeding the at least one frame overlapping a video frame identified as a key frame based on the key frame information may be a second determined number of frames.

The first determined number may be at least one of: the second determined number; and different from the second determined number.

The means for inserting static audio metadata information into frames based on the key frame information may be further for: determining at least one frame starting substantially at the same time as a video frame identified as a key frame based on the keyframe information; and inserting static audio metadata information into the at least one frame starting substantially at the same time as a video frame identified as a key frame based on the key frame information.

The means for encoding the audio signal as frames may be further for encoding the audio signal as audio frames and audio metadata frames, and the means for inserting static audio metadata information into the frames based on the key frame information may be further for inserting static audio metadata information into the audio metadata frames based on the key frame information.

The means for encoding the audio signal as frames may be further for encoding the audio signal as combined audio and audio metadata frames, and the means for inserting static audio metadata information into the frames based on the key frame information may be further for inserting static audio metadata information into the combined audio and audio metadata frames based on the key frame information.

According to a second aspect there is provided a method comprising: obtaining a. video signal: encoding a video signal as video frames, the video frames comprising key frames at determined times; determining key frame information associated with the encoded video signal: obtaining an audio signal; encoding the audio signal as frames; and inserting static audio metadata information into the frames based on the key frame information.

Determining key frame information associated with the encoded video signal may further comprise determining from the encoded video frames at least one time associated with at least one key frame of the encoded video frames.

Determining key frame information associated with the encoded video signal may further comprise obtaining from a user input at least one time associated with at least one key frame of the encoded video frames.

Inserting static audio metadata information into frames based on the key frame information may further comprise: determining at least one frame overlapping a video frame identified as a key frame based on the key frame information; and inserting static audio metadata information into the at least one frame overlapping a video frame identified as a key frame based on the key frame information.

Inserting static audio metadata information into frames based on the key frame information may further comprise: inserting static audio metadata information into at least one frame preceding and at least one frame succeeding the at least one frame overlapping a video frame identified as a key frame based on the key frame information.

Inserting static audio metadata information into frames based on the key frame information may further comprise: determining at least one frame starting substantially at the same time as a video frame identified as a key frame based on the key frame information; and inserting static audio metadata information into the at least one frame starting substantially at the same time as a video frame identified as a key frame based on the key frame information.

Encoding the audio signal as frames may further comprise encoding the audio signal as audio frames and audio metadata frames, and inserting static audio metadata information into the frames based on the key frame information may further comprise inserting static audio metadata information into the audio metadata frames based on the key frame information.

Encoding the audio signal as frames may further comprise encoding the audio signal as combined audio and audio metadata frames, and inserting static audio metadata information into the frames based on the key frame information may further comprise inserting static audio metadata information into the combined audio and audio metadata frames based on the key frame information. According to a third aspect there is provided an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least oneprocessor, cause the apparatus at least to: obtain a video signal: encoding a video signal as video frames, the video frames comprising key frames at determined times; determine key frame information associated with the encoded video signal; obtaining an audio signal; encode the audio signal as frames; and insert static audio metadata information into the frames based on the key frame information.

The apparatus caused to determine key frame information associated with the encoded video signal may further be caused to determine from the encoded video frames at least one time associated with at least one key frame of the encoded video frames.

The apparatus caused to determine key frame information associated with the encoded video signal may further be caused to obtain from a user input at least one time associated with at least one key frame of the encoded video frames.

The apparatus caused to insert static audio metadata information into frames based on the key frame information may further be caused to: determine at least one frame overlapping a video frame identified as a key frame based on the key frame information; and insert static audio metadata information into the at least one frame overlapping a video frame identified as a key frame based on the key frame information.

The apparatus caused to insert static audio metadata information into frames based on the key frame information may further be caused to: insert static audio metadata information into at least one frame preceding and at least one frame succeeding the at least one frame overlapping a video frame identified as a key frame based on the key frame information.

The apparatus caused to insert static audio metadata information into frames based on the key frame information may further be caused to: determine at least one frame starting substantially at the same time as a video frame identified as a key frame based on the key frame information; and insert static audio metadata information into the at least one frame starting substantially at the same time as a video frame identified as a key frame based on the key frame information.

The apparatus caused to encode the audio signal as frames may further be caused to encode the audio signal as audio frames and audio metadata frames, and the apparatus caused to insert static audio metadata information into the frames based on the key frame information may further be caused to insert static audio metadata information into the audio metadata frames based on the key frame information.

The apparatus caused to encode the audio signal as frames may further be caused to encode the audio signal as combined audio and audio metadata frames, and the apparatus caused to insert static audio metadata information into the frames based on the key frame information may further be caused to insert static audio metadata information into the combined audio and audio metadata frames based on the key frame information.

According to a fourth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: obtaining a video signal; encoding a video signal as video frames, the video frames comprising key frames at determined times; determining key frame information associated with the encoded video signal; obtaining an audio signal; encoding the audio signal as frames; and inserting static audio metadata information into the frames based on the key frame information.

Determining key frame information associated with the encoded video signal may further cause the apparatus to perform determining from the encoded video frames at least one time associated with at least one key frame of the encoded video frames.

Determining key frame information associated with the encoded video signal may further cause the apparatus to perform obtaining from a user input at least one time associated with at least one key frame of the encoded video frames.

Inserting static audio metadata information into frames based on the key frame information may further cause the apparatus to perform: determining at least one frame overlapping a video frame identified as a key frame based on the key frame information; and inserting static audio metadata information into the at least one frame overlapping a video frame identified as a key frame based on the key frame information.

Inserting static audio metadata information into frames based on the key frame information may further cause the apparatus to perform: inserting static audio metadata information into at least one frame preceding and at least one frame succeeding the at least one frame overlapping a video frame identified as a key frame based on the key frame information.

Inserting static audio metadata, information into frames based on the key frame information may further cause the apparatus to perform: determining at least one frame starting substantially at the same time as a video frame identified as a key frame based on the key frame information; and inserting static audio metadata information into the at least one frame starting substantially at the same time as a video frame identified as a key frame based on the key frame information.

Encoding the audio signal as frames may further cause the apparatus to perform encoding the audio signal as audio frames and audio metadata frames, and inserting static audio metadata information into the frames based on the key frame information may further cause the apparatus to perform inserting static audio metadata information into the audio metadata frames based on the key frame information.

Encoding the audio signal as frames may further cause the apparatus to perform encoding the audio signal as combined audio and audio metadata frames, and inserting static audio metadata information into the frames based on the key frame information may further cause the apparatus to perform inserting static audio metadata information into the combined audio and audio metadata frames based on the key frame information.

According to a fifth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtaining a video signal; encoding a video signal as video frames, the video frames comprising key frames at determined times: determining key frame information associated with the encoded video signal; obtaining an audio signal; encoding the audio signal as frames: and inserting static audio metadata information into the frames based on the key frame information.

Inserting static audio metadata Information into frames based on the key frame information may further cause the apparatus to perform: determining at least one frame starting substantially at the same time as a video frame identified as a key frame based on the key frame information: and inserting static audio metadata information into the at least one frame starting substantially at the same time as a video frame identified as a key frame based on the key frame information.

According to a sixth aspect there is provided an apparatus comprising: video obtaining circuitry configured to obtain a video signal; video encoding circuity configured to encode a video signal as video frames, the video frames comprising key frames at determined times; key frame information determining circuitry configured to determine key frame information associated with the encoded video signal; audio signal obtaining circuitry configured to obtain an audio signal; audio encoding circuitry configured to encode the audio signal as frames; and metadata inserting circuity configured to insert static audio metadata information into the frames based on the key frame information.

According to a seventh aspect there is provided a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtaining a video signal; encoding a video signal as video frames, the video frames comprising key frames at determined times: determining key frame information associated with the encoded video signal; obtaining an audio signal; encoding the audio signal as frames: and inserting static audio metadata information into the frames based on the key frame information.

An apparatus comprising means for performing the actions of the method as described above.

An apparatus configured to perform the actions of the method as described above.

A computer program comprising program instructions for causing a computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

Figure 1 shows schematically apparatus suitable for implementing some embodiments;

Figure 2 shows schematically further apparatus suitable for implementing some embodiments:

Figure 3 shows an example video/audio format;

Figure 4 shows an example video/audio format according to some embodiments;

Figure 5 shows a further example video/audio format according to some embodiments;

Figure 6 shows an another example video/audio format according to some embodiments;

Figure 7 shows a method for implementing the example video/audio format as shown in Figure 4 according to some embodiments;

Figure 8 shows a method for implementing the further example video/audio format as shown in Figure 5 according to some embodiments;

Figure 9 shows a method for implementing the another example video/audio format as shown in Figure 6 according to some embodiments; and

Figure 10 shows schematically an example device suitable for implementing the apparatus shown.

Embodiments of the Application

The following describes in further detail suitable apparatus and possible mechanisms for the provision of audio codec formats which enable editing of a multimedia file with effective spatial audio decoding.

As described herein some third party software packages (such as mp4box) may be able to edit a multimedia file before video key frames without transcoding audio or video by only performing editing at the mp4 container format level. Multimedia files that contain both audio and video tracks are typically edited (without transcoding) at video key frames because if the first frame of the video is not a key frame, then the video cannot be decoded. This results in audio tracks being edited (cut) around at the location of a video key frame. The editors typically assume that audio and audio metadata tracks can be cut anywhere but this is not the case where the multimedia file comprises static audio metadata only at defined frames of the file as the edited (cut) point may not contain the static metadata required to correctly decode the audio signal.

There are several benefits to cutting multimedia files at the container level instead of cutting them with transcoding. Firstly, every transcoding stage (both video and audio) loses quality. Secondly, it is faster. Thirdly, flies can be cut without understanding all tracks such as proprietary metadata tracks.

A first concept as discussed in further detail herein is apparatus and method for selecting audio metadata track key frame locations in places around the time code where a video track has a key frame so that the overall bitrate is not significantly increased but enables the multimedia file/bitstream to be cut without affecting decoder ability to decode the audio with the metadata.

A further concept as also discussed herein is apparatus and method for reducing the number of frames where audio static metadata is inserted (i.e. a key frame). This may be implemented by checking whether an audio or metadata frame starts at the same time as a video key frame. In such cases static metadata is not inserted to several frames around the video key frame, but to the aforementioned audio or metadata frame.

With respect to Figure 1 Is shown an example apparatus as implemented in some embodiments.

The apparatus is shown with an ‘analysis’ or generator part 151 and a ‘synthesis’ or renderer part 153. The generator part 151 is the part from receiving the audio signals 104 and video signals 100 up to an output of encoded video, audio and metadata and the renderer part 153 is the part from a decoding of the encoded video, audio and metadata data to the presentation of the video and audio signals.

The input to the generator part 151 may be the video signal 100, for example captured from a camera or cameras or generated from an image generator (not shown) and passed to a video encoder 101.

The video encoder 101 may be any suitable video encoder configured to receive the video signal 100 and generate a suitable frame based encoded video signal and pass this to a multiplexer 105. The video encoder 101 may further identify and indicate key frame information 102 to an audio encoder 103.

Furthermore the generator part 151 may comprise a suitable audio encoder 103. The audio encoder 103 may be any suitable frame based encoder configured to receive the audio signal 104 and the key frame information 102 and generate encoded audio signals (in a frame format) and encoded spatial audio metadata (in a frame format which is aligned with the audio signal frames) to the multiplexer 105.

The multiplexer 105 may be configured to receive the encoded video and encoded audio and encoded audio metadata and multiplex these to generate a suitable bitstream and/or file. The multiplexer 105 may in some embodiments be configured to store and/or transmit the bitstream/file such that it can be received or retrieved by the tenderer part 153.

The tenderer part 153 is configured to receive or retrieve the bit stream or file. The tenderer part 153 in some embodiments comprises a demultiplexer 107 configured to receive the bit stream or file and demultiplex these into encoded video data which can be passed to a video decoder 109 and encoded audio and encoded spatial metadata which can be passed to an audio decoder 111.

The tenderer part 153 in some embodiments comprises a video decoder 109 configured to receive the encoded video data and decode this to generate a suitable video signal which may be passed to a video player 113 for outputting on a suitable display apparatus (for example a touchscreen display on a mobile device, or a display on a virtual reality/augmented reality/mixed reality headset.

The tenderer part 153 in some embodiments further comprises an audio decoder 111 configured to receive the encoded audio data and encoded spatial metadata and decode the encoded audio data with spatial metadata to generate a suitable audio signal which may be passed to an audio player 115 for outputting on a suitable audio output apparatus (for example headphones, multichannel loudspeakers or earbuds linked to a virtual reality/augmented reality/mixed reality headset).

With respect to Figure 2 is shown a further example apparatus as implemented in some embodiments. The main difference between the examples shown in Figure 1 and Figure 2 is that in the example shown in Figure 2 the key frame information 202 is provided to the video encoder 201 and the multiplexer 205 and not determined in the video encoder 201.

The apparatus is shown with an ‘analysis’ or generator part 251 and a ‘synthesis’ or Tenderer part 253. The generator part 251 is the part from receiving 5 the audio signals 204 and video signals 200 up to an output of encoded video, audio and metadata and the Tenderer part 253 is the part from a decoding of the encoded video, audio and metadata data to the presentation of the video and audio signals.

The input to the generator part 251 may be the video signal 200, for example 10 captured from a camera or cameras or generated from an image generator (not shown) and passed to a video encoder 201.

The video encoder 201 may be any suitable video encoder configured to receive the video signal 200 and generate a suitable frame based encoded video signal and pass this to a multiplexer 205. The video encoder 201 may further be 15 provided -with key frame information 202 which is also provided to the multiplexer 205.

Furthermore the generator part 251 may comprise a suitable audio encoder 203. The audio encoder 203 may be any suitable frame based encoder configured to receive the audio signal 204 and generate encoded audio signals (in a frame 20 format) and encoded spatial audio metadata (in a frame format which is aligned with the audio signal frames) to the multiplexer 205.

The multiplexer 205 may be configured to receive the encoded video and encoded audio and encoded audio metadata and the key frame information 202 and multiplex the video, audio and metadata to generate a suitable bitstream and/or 25 file. The multiplexer 205 may in some embodiments be configured to store and/or transmit the bitstream/file such that it can be received or retrieved by the Tenderer part 253.

The Tenderer part 253 is configured to receive or retrieve the bit stream or file. The Tenderer part 253 in some embodiments comprises a demultiplexer 207 30 configured to receive the bit stream or file and demultiplex these into encoded video data which can be passed to a video decoder 209 and encoded audio and encoded spatial metadata which can be passed to an audio decoder 211.

The Tenderer part 253 in some embodiments comprises a video decoder 209 configured to receive the encoded video data and decode this to generate a suitable video signal which may be passed to a video player 213 for outputting on a suitable display apparatus (for example a touchscreen display on a mobile device, or a display on a virtual reality/augmented reality/mixed reality headset.

The Tenderer part 253 in some embodiments further comprises an audio decoder 211 configured to receive the encoded audio data and metadata and decode the encoded audio data with metadata to generate a suitable audio signal which may be passed to an audio player 215 for outputting on a suitable audio output apparatus (for example headphones, multichannel loudspeakers or earbuds linked to a virtual reality/augmented reality/mixed reality headset).

The use of the multiplexer in the example Figures 1 and 2 are now described in further detail.

Figure 3 shows an example of a typical multimedia file and comprises a video track 301 which comprises video frames. The video frames shown in Figure 3 are video frame 1 302 which is a key frame and comprises all of the information required to regenerate it at the decoder, and video frame 2 304 and video frame 3 306 which may comprise instantaneous (or difference or differential) information and which require information from a key frame, such as video frame 1 302 in order to decode.

Additionally the file comprises an audio track 303 which comprises audio frames. The audio frames shown in Figure 3 are audio frame 1 312, audio frame 2 314, audio frame 3 316, audio frame 4 318, and audio frame 5 320.

Additionally the file comprises an audio metadata track 305 which comprises audio metadata frames. The audio metadata frames shown in Figure 3 are audio metadata frame 1 322. audio metadata frame 2 324, audio metadata frame 3 326, audio metadata frame 4 328, and audio metadata frame 5 330. In the example shown in Figure 3 the length of the audio frame and audio metadata frame is the same, however in some embodiments the lengths may differ. In the example shown in Figure 3 audio metadata frame 1 322 is a ‘key’ audio metadata frame in that it comprises static and instantaneous metadata, but the other frames comprise only the instantaneous metadata.

In some embodiments the audio metadata including the static audio metadata is included in the audio track and there is no separate track for the metadata.

As can be seen in Figure 3 the length of each audio frame (and therefore the audio metadata frame) is shorter than the video frame. Therefore although the file can be started to play from the first track because it contains the static audio metadata and the video frame is a key frame the file cannot be started to play from any other frame.

In some examples a multimedia file video track has a key frame approximately once in a second. The multimedia file can therefore be cut or edited just before the key frame and the video in the edited file can play because in the edited file the first video frame is a key frame.

Similarly an edited file which does not comprise the static audio metadata frame will not be able to replay audio and will have to mask or mute the audio until the next audio metadata frame comprising static metadata is decoded.

In some embodiments the audio encoder 103 as shown in Figure 1 or the multiplexer 205 as shown in Figure 2 may be configured to use the key frame information 102, 202 to modify the audio metadata frames.

For example in some embodiments static audio metadata is inserted into metadata frames that are located around the video key frame i.e. we create audio metadata key frames at suitable locations.

In some embodiments the static metadata is inserted to audio metadata frames that overlap in time a video key frame.

This for example is shown in the schematic file example shown in Figure 4. Figure 4 shows an example multimedia file and comprises a video track 401 which comprises video frames. The video frames shown in Figure 4 are video frame N-1 398 which may comprise instantaneous (or difference or differential) Information and which requires information from a key frame in order to decode and possibly information from non-key frames, video frame N 400 which is a key frame and comprises all of the information required to regenerate it at the decoder and video frame N+1 402 which may comprise instantaneous (or difference or differential) information and which require information from a key frame, such as video frame N 400 in order to decode.

Additionally the file comprises an audio track 403 which comprises audio frames. The audio frames shown in Figure 4 are audio frame M-1 408, audio frame M 410, audio frame M+1 412, audio frame M+2 414, and audio frame M+3 416,

Additionally the file comprises an audio metadata track 405 which comprises audio metadata frames. The audio metadata frames shown in Figure 4 are audio metadata frame M-1 418, audio metadata, frame M 420, audio metadata frame M+1 422, audio metadata frame M+2 424. and audio metadata frame M+3 426.

In some embodiments therefore the location of the key frame for the video is determined (from the key frame information determined directly from the video encoder or otherwise) and then any audio frames overlapping the video key frame are identified and static audio metadata is stored in these overlapping frames.

Thus for the example file shown in Figure 4 where the video frame N is determined as the key frame, the audio metadata frames M, M+1 and M+2 are then identified as the overlapping audio frames and static audio metadata inserted 428.

With respect to Figure 7 an example flow diagram showing the operations in Implementing the overlapping insertion embodiment as shown in Figure 4 and as implemented by the apparatus as shown in Figure 1 or 2 is described in further detail·

The video signal may be received or obtained as shown in Figure 7 by step 701.

The video signal may be encoded as shown in Figure 7 by step 702.

The key frame information may be determined (for example as shown by the dashed line from the encoded video signal) as shown in Figure 7 by step 704.

Additionally the audio signal may be received or obtained as shown in Figure 7 by step 703,

The audio signal may be encoded as shown in Figure 7 by step 706.

Then based on the key frame information the static audio metadata may be inserted into the audio metadata frames which overlap in time with the video key frame as shown in Figure 7 by step 707.

Then the video and audio and audio metadata frames may be multiplexed as shown in Figure 7 by step 709.

in some embodiments static metadata is inserted to audio metadata frames that overlap in time a video key frame and additionally to a determined number of frames before and after. For example the determined number may be one or two frames. This is because multimedia editors function differently and some include a different amount of audio frames into cut multimedia than others.

This for example is shown in the schematic file example shown in Figure 5. Figure 5 shows an example multimedia file and comprises a video track 501 which comprises video frames. The video frames shown in Figure 5 are video frame N-1 498 which may comprise instantaneous (or difference or differential) information and which requires information from a key frame in order to decode, video frame N 500 which is a key frame and comprises all of the information required to regenerate it at the decoder and video frames N+1 502 and N+2 504 which may comprise instantaneous (or difference or differential) information and which require information from a key frame, such as video frame N 500 in order to decode.

Additionally the file comprises an audio track 503 which comprises audio frames. The audio frames shown in Figure 5 are audio frames M-2 506, M-1 508, M 510, M+1 512, M+2 514, M+3 516 and M+4 518.

The file also comprises an audio metadata track 505 which comprises audio metadata frames. The audio metadata frames shown in Figure 5 are audio metadata frame M-2 517, M-1 518, M 520, M+1 522, M+2 524, M+3 526 and M+4 528.

Thus in some embodiments therefore the location of the key frame for the video is determined (from the key frame information determined directly from the video encoder or otherwise) and then any audio frames overlapping the video key frame are identified and static audio metadata is stored in these overlapping frames and one frame before and after.

Thus for the example file shown in Figure 5 where the video frame N is determined as the key frame, the audio metadata frames Μ, M+1 and M+2 are then identified as the overlapping audio frames and static audio metadata inserted 528 into audio metadata frames M-1 518, M 520, M+1 522, M+2 524, and M+3 526.

With respect to Figure 8 an example flow diagram showing the operations in implementing the overlapping insertion embodiment as shown in Figure 5 and as implemented by the apparatus as shown in Figure 1 or 2 is described in further detail.

The video signal may be received or obtained as shown in Figure 8 by step 701.

The video signal may be encoded as shown in Figure 8 by step 702.

The key frame information may be determined (for example as shown by the dashed line from the encoded video signal) as shown in Figure 8 by step 704.

Additionally the audio signal may be received or obtained as shown in Figure 8 by step 703.

The audio signal may be encoded as shown in Figure 8 by step 706.

Then based on the key frame information the static audio metadata may be Inserted into the audio metadata frames which overlap in time with the video key frame and also one frame before and after (or any determined number before and after) as shown in Figure 8 by step 807.

Then the video and audio and audio metadata frames may be multiplexed as shown in Figure 8 by step 709.

In some embodiments when an audio metadata frame start time (and the audio frame start time) is the same as a video key frame start time static audio metadata may be inserted to that single audio metadata frame. In some embodiments the determined number may be different for frames before and frames after.

This may be advantageous because when the start times are the same, most multimedia editors may edit (cut) the multimedia file audio, video and metadata tracks similarly from the beginning of the video key frame.

This for example is shown in the schematic file example shown in Figure 6. Figure 6 shows an example multimedia file and comprises a video track 601 which comprises video frames. The video frames shown in Figure 6 are video frame N-1 598 which may comprise instantaneous (or difference or differential) information and which requires information from a key frame in order to decode, video frame- N 600 which is a key frame and comprises all of the information required to regenerate it at the decoder and video frame N+1 602 which may comprise instantaneous (or difference or differential) information and which require information from a key frame, such as video frame N 600 in order to decode.

Additionally the file comprises an audio track 603 which comprises audio frames. The audio frames shown in Figure 6 are audio frames M-1 608, M 610, M+1 612, M+2 614, and M+3 616.

The file also comprises an audio metadata track 605 which comprises audio metadata frames. The audio metadata frames shown in Figure 6 are audio metadata frame M-1 618, M 620, M+1 622. M+2 624, and M+3 626.

Thus in some embodiments therefore the location of the key frame for the video is determined (from th© key frame information determined directly from the video encoder or otherwise). Furthermore it is determined that the start of the key 10 video frame and an audio frame is aligned. Then static audio metadata is inserted 628 or stored in the aligned frame 620.

With respect to Figure 9 an example flow diagram showing the operations in implementing the overlapping insertion embodiment as shown in Figure 6 and as implemented by the apparatus as shown in Figure 1 or 2 is described in further 15 detail.

The video signal may be received or obtained as shown in Figure 9 by step 701.

The video signal may be encoded as shown in Figure 9 by step 702,

The key frame information may be determined (for example as shown by the 20 dashed line from the encoded video signal) as shown in Figure 9 by step 704.

Additionally the audio signal may be received or obtained as shown In Figure 9 by step 703.

The audio signal may be encoded as shown in Figure 9 by step 706.

Then based on the key frame information the static audio metadata may be 25 inserted into the audio metadata frame that starts at the same time as a video key as shown in Figure 9 by step 907.

Then the video and audio and audio metadata frames may be multiplexed as shown in Figure 9 by step 709.

In some embodiments audio and audio metadata frames are of the same 30 length in time and their start times are the same. However, in some embodiments they may be of different lengths and/or have different start times.

The embodiments such as shown with respect to Figure 5 and 8 may be more suitable when the audio, audio metadata, and video frame sizes differ in size and have frame boundaries meet every now and then.

For example a typical video codecs may have frame sizes so that there is a frame boundary at every second (typically, 24, 25 or 30 frames per second) and audio codecs match this by having a 20ms long frame (50 frames per second).

A video codec may implement a key frame at the beginning of each second and the audio codec could match that by inserting static metadata to the audio metadata frame at the beginning of each second.

Examples of static metadata comprise:

Audio track type: [microphone signals/stereo/binaural];

(If microphone signals) the locations of microphones on the capture device;

A capture device name and model;

Equalization settings;

Dithering settings;

Limits of frequency bands that correspond to directional information.

Examples of instantaneous metadata:

Direction:

Diffuseness;

Energy;

Energy ratio;

direct-to-ambient ratio;

distance;.

In the embodiments shown herein any suitable codex may be used. For example the video codec may be H.264, H.265, or any other. The audio codec may be mp3, AAC, AC-3, AMR-WB, AMR-WB+, 3GPP IVAS or any other.

With respect to Figure 10 an example electronic device which may be used as the device shown in Figures 1 and 2. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1900 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.

In some embodiments the device 1900 comprises at least one processor or central processing unit 1907. The processor 1907 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1900 comprises a memory 1911. In some embodiments the at least one processor 1907 is coupled to the memory 1911. The memory 1911 can be any suitable storage means. In some embodiments the memory 1911 comprises a program code section for storing program codes implementable upon the processor 1907. Furthermore in some embodiments the memory 1911 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1907 whenever needed via the memory-processor coupling.

In some embodiments the device 1900 comprises a user interface 1905. The user interface 1905 can be coupled in some embodiments to the processor 1907. In some embodiments the processor 1907 can control the operation of the user interface 1905 and receive inputs from the user interface 1905. In some embodiments the user interface 1905 can enable a user to input commands to the device 1900, for example via a keypad. In some embodiments the user interface 1905 can enable the user to obtain information from the device 1900. For example the user interface 1905 may comprise a display configured to display information from the device 1900 to the user. The user interface 1905 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1900 and further displaying information to the user of the device 1900. In some embodiments the user interface 1905 may be the user interface for communicating with the position determiner as described herein.

In some embodiments the device 1900 comprises an input/output port 1909. The input/output port 1909 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1907 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

The transceiver input/output port 1909 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1907 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.

In some embodiments the device 1900 may be employed as at least part of the synthesis device. As such the input/output port 1909 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1907 executing suitable code. The input/output port 1909 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software rnay be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits Is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or fab for fabrication.

The foregoing description has provided by way of exemplary and nonlimiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this 5 invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS:

1. An apparatus comprising means for:

obtaining a video signal;

encoding a video signal as video frames, the video frames comprising key frames at determined times;

determining key frame information associated with the encoded video signal; obtaining an audio signal:

encoding the audio signal as frames; and inserting static audio metadata information into the frames based on the key frame information.

2. The apparatus as claimed in claim 1, wherein the means for determining key frame information associated with the encoded video signal is further for determining from the encoded video frames at least one time associated 'with at least one key frame of the encoded video frames.

3. The apparatus as claimed in claim 1, wherein the means for determining key frame information associated with the encoded video signal is further for obtaining from a user input at least one time associated with at least one key frame of the encoded video frames.

4. The apparatus as claimed in any of claims 1 to 3, wherein the means for inserting static audio metadata information into frames based on the key frame information is further for:

determining at least one frame overlapping a video frame identified as a key frame based on the key frame information; and inserting static audio metadata information into the at least one frame overlapping a video frame identified as a key frame based on the key frame information.

5. The apparatus as claimed in claim 4, wherein the means for inserting static audio metadata information into frames based on the key frame information is further for inserting static audio metadata information into at least one frame preceding and at least one frame succeeding the at least one frame overlapping a video frame identified as a key frame based on the key frame information.

6. The apparatus as claimed in claim 5, wherein the at least one frame preceding the at least one frame overlapping a video frame identified as a key frame based on the key frame information is a first determined number of frames.

7. The apparatus as claimed in claim 6, wherein the at least one frame succeeding the at least one frame overlapping a video frame identified as a key frame based on the key frame information is a second determined number of frames.

8. The apparatus as claimed in claim 7, wherein the first determined number is at least one of:

the second determined number; and different from the second determined number.

9. The apparatus as claimed in any of claims 1 to 8, wherein the means for inserting static audio metadata information into frames based on the key frame information is further for:

determining at least one frame starting substantially at the same time as a video frame identified as a key frame based on the key frame information; and inserting static audio metadata information into the at least one frame starting substantially at the same time as a video frame identified as a key frame based on the key frame information.

10. The apparatus as claimed in any of the claims 1 to 9, wherein the means for encoding the audio signal as frames is further for encoding the audio signal as audio frames and audio metadata frames, and the means for inserting static audio metadata information into the frames based on the key frame information is further for inserting static audio metadata information into the audio metadata frames based on the key frame information.

11. The apparatus as claimed in any of the claims 1 to 9, wherein the means for encoding the audio signal as frames is further for encoding the audio signal as combined audio and audio metadata frames, and the means for inserting static audio metadata information into the frames based on the key frame information is further for inserting static audio metadata information into the combined audio and audio metadata frames based on the key frame information.

12. A method comprising:

obtaining a video signal:

determining key frame information associated with the encoded video signal; obtaining an audio signal;

encoding the audio signal as frames: and inserting static audio metadata information into the frames based on the key frame information.

13. The method as claimed in claim 12, wherein determining key frame information associated with the encoded video signal further comprises determining from the encoded video frames at least one time associated with at least one key frame of the encoded video frames.

14. The method as claimed in any of claims 12 and 13 wherein determining key frame information associated with the encoded video signal further comprises obtaining from a user input at least one time associated with at least one key frame of the encoded video frames.

15. The method as claimed in any of claims 12 to 14, wherein inserting static audio metadata information into frames based on the key frame information further comprises:

Intellectual

Property

Office

Application No: GB1814705.8