WO2025166300A1

WO2025166300A1 - Method for generating an audio-visual media stream

Info

Publication number: WO2025166300A1
Application number: PCT/US2025/014205
Authority: WO
Inventors: Yuanxing MA; Ziyu YANG; Zhiwei Shuang; Steve Liu
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2024-02-02
Filing date: 2025-01-31
Publication date: 2025-08-07
Anticipated expiration: 2026-08-02

Abstract

The present disclosure relates to a method for generating an audio-visual media stream. An initial media stream is captured by a mobile user device and a binaural capturing device. The initial media stream comprises: a video stream, a first audio stream comprising N ≥ 2 channel signals captured by the user device, and a second audio stream comprising a pair of channel signals captured by the binaural capturing device. A set of audio objects are extracted and spatial information is estimated comprising, for each audio object, a horizontal direction of arrival estimated based on at least three of the channel signals, and on orientation data indicative of a landscape or portrait mode of the user device. Each audio object is panned in accordance with the spatial information to channels of a multichannel format to generate an upmixed audio stream for the audio-visual media stream.

Description

METHOD FOR GENERATING AN AUDIO-VISUAL MEDIA STREAM

[001] This application claims the benefit of priority from European Patent Application No. 24173857.4, filed 2 May 2024, U.S. Provisional Application No. 63/558779, filed 28 February 2024, and International Patent Application No. PCT/CN2024/075472, filed 2 February 2024, each of which is incorporated by reference herein in its entirety.

TECHNICAL FIELD

[002] The present invention relates to a method, a device and a system for generating an audio-visual media stream comprising an upmixed audio stream in a multichannel format.

BACKGROUND

[003] Audio- visual media in the form of video and audio recorded by mobile user devices

(e.g. hand-held electronic devices such as mobile phones or tablet computers) is an increasingly popular form of user-generated content, e.g. for personal moment sharing. The fast-paced development of the technical capabilities of mobile devices has enabled capturing of videos of high quality, both in terms of resolution and image quality. It is not uncommon for current mobile devices to have the capability to capture stereo audio concurrently with the video.

[004] Meanwhile the use of headsets (wired or wireless) together with mobile devices has become ubiquitous. A headset comprising a left- and a right-channel microphone at the left and a right earpiece of the headset may be used as a binaural capture device, thus capturing the sound at each respective ear of the user wearing the binaural capture device. Accordingly, binaural capture devices are generally good at capturing the voice of the user or the sound as perceived by the user. Binaural capturing devices is hence a convenient choice for recording podcasts, interviews, conferences, and the like.

SUMMARY

[005] In view of the above, it is nowadays common for users to have access to both a mobile device and a binaural capturing device, thus offering multiple microphones for capturing audio, in addition to the video capturing functionality of the mobile device. However, there is still a need for robust and user-friendly techniques enabling these capturing devices to be used together to facilitate creation of audio- visual media with increased immersiveness. It is therefore an object of the present invention to provide a method, device and system addressing this need. [006] According to a first aspect of the present invention, there is provided a method for generating an audio-visual media stream comprising an upmixed audio stream in a multichannel format. The method comprises capturing an initial media stream by a mobile user device operated by a user and a head-mounted binaural capturing device worn by the user and coupled to the user device. The initial media stream comprises: a video stream captured by a camera of the user device, a first audio stream comprising a set of N > 2 first channel signals captured by a set of N microphones of the user device, and a second audio stream comprising a pair of second channel signals captured by a left- and a right-channel microphone of the binaural capturing device. While capturing the initial media stream, the user device is held in front of the binaural capturing device in an orientation corresponding to a landscape or a portrait mode.

[007] The method further comprises processing the channel signals of the first and second audio streams to extract a set of audio objects. The method further comprises obtaining orientation data indicative of whether, during the capturing of the initial media stream, the user device is in the landscape or the portrait mode. The method further comprises estimating spatial information comprising, for each audio object, a horizontal direction of arrival estimated based on a set of at least three channel signals of the first and second channel signals and on the orientation data. The method further comprises panning each audio object in accordance with the spatial information to one or more channels of the multichannel format to generate the upmixed audio stream. The method further comprises combining the video stream and the upmixed audio stream to generate the audio- visual media stream.

[008] The first aspect of the present invention is based on the insight that a more immersive audio-visual media stream may be generated by, simultaneous to capturing a video stream by a mobile user device (e.g. a mobile phone or a tablet computer), capturing both a first and a second audio stream using a set of microphones of the user device and a pair of microphones of a head- worn binaural capturing device (which e.g. may be embodied by a headset), and subsequently using the two captured audio streams to generate an upmixed audio stream in a target multichannel format, e.g. an immersive multichannel format such as 5.1, 7.1, 5.1.2 , 7.1.4, or a First Order Ambisonics (FOA) format.

[009] Rather than directly panning the channel signals of the first and second audio streams to channels of the target multichannel format (e.g. based on a predetermined mapping between the channels of the first and second audio streams and the channels of the multichannel format) the method subjects the channel signals of the first and second audio streams to audio object extraction, and subsequently pans the audio objects to the channels of the multichannel format. The panning of each audio object is based on spatial information comprising (at least) a horizontal direction of arrival for each audio object. The estimation of the spatial information is in turn enabled by the simultaneous capturing of the sound from the distributed microphone locations on the user device (held by the user in front of the user device) and the binaural capturing device (worn on the head of the user). This allows for robust, yet flexible, panning of the captured audio to the channels of the multichannel format.

[010] Each horizontal direction of arrival estimate is based on (at least) three channel signals of the captured channel signals such that channel signals from microphones of both the user device and the binaural capturing device are used. That is, the set of microphone signals comprises at least one of the first channel signals and at least one of the second channel signals. This enables left-right as well as front-back discrimination of the direction of arrival for each audio object. Since mobile devices typically allow capturing of videos both in a landscape and a portrait mode, the method further takes orientation data on the user device into account for the estimation of the spatial information, thereby contributing to both the robustness and flexibility of the method.

[Oil] According to some embodiments, a layout of the set of microphones on the user device is such that the N microphones of the user device comprises microphones being horizontally separated when the user device is in a first mode and microphones being vertically separated when the user device is in a second mode, wherein the first mode is the landscape mode and the second mode is the portrait mode or vice versa. Hence, when capturing the video in the first mode (be it landscape or portrait mode) there are at least two pairs of horizontally separated microphones available for estimating the horizontal direction of arrival, the pair of microphones of the binaural capturing device and a pair of microphones of the horizontally separated microphones of the user device.

[012] If the orientation data is indicative of the first mode, the horizontal direction of arrival for each audio object may be estimated based on a set of at least three channel signals of the first channel signals captured by the horizontally separated microphones of the user device and the second channel signals. The method may accordingly select which microphone signals to use for the estimation of the horizontal direction of arrival of the audio objects, or use the microphone signals from the microphones of both the user device and the binaural capturing device. That is, if the orientation data is indicative of the first mode, the horizontal direction of arrival for each audio object is estimated based on: at least one first channel signal captured by at least one of the horizontally separated microphones of the user device and the pair of second channel signals, or a pair of first channel signals captured by a pair of the horizontally separated microphones of the user device and at least one of the pair of second channel signals.

[013] According to some embodiments, if the orientation data is indicative of the second mode, the spatial information further comprises, for each audio object, height information estimated based on a pair of first channel signals captured by a pair of the vertically separated microphones of the user device. Hence, when capturing the video in the second mode (be it landscape or portrait mode) there is a pair of vertically separated microphones of the user device which thus enables estimating height information in the form of a direction of arrival in a vertical plane for each audio object. Further, if the orientation data is indicative of the second mode, the horizontal direction of arrival for each audio object may be estimated based on: the pair of second channel signals captured by the microphones of the binaural capturing device and at least one first channel signal captured by at least one of the microphones of the user device (e.g. at least one of the vertically separated microphones).

[014] According to some embodiments, the first mode is the landscape mode and the second mode is the portrait mode. Hence, when capturing the video in the landscape mode there are two pairs of horizontally separated microphones available, the pair of microphones of the binaural capturing device and (at least) a pair of horizontally separated microphones of the user device. The method may accordingly select which pair of microphone signals to use for the estimation of the horizontal direction of arrival of the audio objects, or use both for redundancy and improved robustness. Meanwhile, when capturing the video in the portrait mode there is a pair of vertically separated microphones of the user device which thus enables estimating height information in the form of a direction of arrival in a vertical plane for each audio object.

[015] Typical mobile user devices have an elongated shape, i.e. a height dimension exceeding a width dimension. The landscape mode and the portrait mode hence tend to be associated with a horizontal orientation of the height dimension and a vertical orientation of the height dimension, respectively. The elongated shape further typically implies that a microphone separation tends to be greater along the height dimension than the width dimension of the device, of the microphones. The method may hence take advantage of this by estimating the height information when the user device is in the portrait mode.

[016] Accordingly, in some embodiments the user device has a height dimension and a width dimension, and wherein the horizontally separated microphones and the vertically separated microphones are separated along the height dimension of the user device.

[017] According to some embodiments, the horizontally and vertically separated microphones of the user device have at least one microphone in common. Together with the microphones of the binaural capturing device, this allows estimation of both horizontal direction of arrival and height information by a user device with a configuration of three microphones (if one microphone is common) or only two microphones (if two microphones are in common). According to some embodiments, the horizontally and vertically separated microphones may refer to the same microphones of the user device, e.g. a first and a second microphone separated along the height dimension of the user device. The first microphone may be positioned at a bottom portion of the user device and the second microphone may be positioned at a top portion of the user device.

[018] According to some embodiments, the processing of the channel signals of the first and second audio streams comprises extracting, for each audio object of the set of audio objects, a representation of the audio object from each of the first and second channel signals, and wherein the method further comprises, for each audio object, selecting, among the representations of the audio object, one representation of the audio object to be used for the panning process, wherein the selection is based on the spatial information estimated for said audio object. This enables a single one of the available representations of an extracted object to be selected and used for the panning process, wherein the selection may be aided by the spatial information. For example, the representation of the audio object may be extracted from the channel signal captured by the microphone closest to the direction of arrival estimated for the audio object is selected to be used for the panning process. It may be expected that the audio signal captured by the microphone closest to the direction of arrival of the audio object comprises the highest quality audio data relating to the audio object.

[019] According to a second aspect of the present invention, there is provided a computer program product comprising computer program code to perform, when executed on a computer, the method according to the first aspect or any of the embodiments thereof.

[020] According to a third aspect of the present invention, there is provided a mobile user device for generating an audio-visual media stream comprising an upmixed audio stream in a multichannel format. The user device comprises: a camera and a set of N> 2 microphones, and an input interface configured to receive an initial media stream comprising: a video stream captured by the camera of the user device, a first audio stream comprising a set of A first channel signals captured by the set of microphones, and a second audio stream comprising a pair of second channel signals captured by a left- and a right-channel microphone of a binaural capturing device for mounting on a head of a user. The initial media stream is captured while the binaural capturing device is worn by the user and the user device is held in front of the binaural capturing device in an orientation corresponding to a landscape or a portrait mode.

[021] The device further comprises an upmixer configured to: process the channel signals of the first and second audio streams to extract a set of audio objects; obtain orientation data indicating whether, during the capturing of the initial media stream, the user device is in the landscape or the portrait mode; estimate spatial information comprising, for each audio object, a horizontal direction of arrival estimated based on a set of at least three channel signals of the first and second channel signals and on the orientation data; and pan each audio object in accordance with the spatial information to one or more channels of the multichannel format to generate the upmixed audio stream.

[022] The device further comprises a combiner configured to combine the video stream and the upmixed audio stream to generate the audio-visual media stream.

According to a fourth aspect of the present invention, there is provided a system for generating an audio-visual media stream comprising an upmixed audio stream in a multichannel format, the system comprising the user device according to the third aspect or any embodiments thereof; and a binaural capturing device for mounting on a head of a user and comprising a left- and a rightchannel microphone.

[023] The invention according to the second, third and fourth aspects features the same or equivalent benefits as the invention according to the first aspect. Any functions described in relation to the first aspect, may have corresponding features in a system and vice versa.

BRIEF DESCRIPTION OF THE DRAWINGS

[024] The present invention will be described in more detail with reference to the appended drawings, showing embodiments of the invention.

[025] Figure 1 depicts various capturing scenarios involving a user, a user device and a binaural capturing device.

[026] Figure 2 shows a user device from the perspective of a user.

[027] Figure 3a-b show in greater detail a capturing process conducted with the user device in a landscape mode and a portrait mode, respectively.

[028] Figure 4 depicts a block diagram of an example implementation of a system for processing an initial captured audio-visual media stream.

[029] Figure 5 depicts a block diagram of an example implementation of an up mixer 110.

[030] Figure 6a-b schematically depict estimation of spatial information comprising a horizontal direction of arrival for an audio object.

[031] Figure 7 schematically depicts estimating of spatial information comprising height information for an audio object.

[032] Figure 8 depicts a block diagram of an example implementation of a panning block.

[033] Figure 9a-b depict in a top-down view and a rear-side view, respectively, a user device according to a further example implementation.

[034] Fig. 10 is a flow chart of a method for generating an audio-visual media stream comprising an upmixed audio stream in a multichannel format. DETAILED DESCRIPTION

[035] Fig. 1 depicts a user 1 performing a simultaneous capturing of an audio-visual media stream (initial audio-visual media stream). The audio-visual media stream is captured by a system (capturing system) comprising a mobile (i.e. portable) user device 10, in the shape of a mobile phone (smartphone), and a binaural capturing device 20, in the shape of a headset worn on the head of the user 1. The audio-visual media stream comprises a video stream captured by a camera 14a of the user device 10. The camera 14a is provided on a rear- or backside of the user device 10 and may in the following also be referred to as the main camera 14a of the user device 10. The audio-visual media stream further comprises a first audio stream comprising a set of first channel signals captured by a corresponding number of microphones of the user device 10. More specifically, and as further described herein, the first audio stream comprises a set of N> 2 channel signals (termed first channel signals) captured by a set of N microphones of the user device 10. The audio-visual media stream further comprises a second audio stream comprising a pair of second channel signals captured by a left- and a right-channel microphone 20a, 20b of the binaural capturing device 20. The microphones 20a, 20b are provided in a respective earpiece of the headset. The binaural capturing device 20 may thus record a binaural audio stream of two channel signals; a left-channel signal and a right-channel signal. The channel signals of the second I binaural audio stream may in the following be referred to as the second channel signals. The binaural capturing device 20 may be connected to the user device 10 wirelessly (e.g. by means of a Bluetooth communication link, or employing any other suitable conventional wireless communication protocol) or by wires. The user device 10 may accordingly receive the second audio stream from the binaural capturing device 20, e.g. in real-time.

[036] As shown in Fig. 1, the initial media stream is captured while the user device 10 is held by the user 1 in front of the binaural capturing device 20 with the camera 14a facing in a substantially horizontal shooting direction. The user device 10 is more specifically held with an orientation corresponding to a landscape mode. That is, the height dimension of the user device 10 (which corresponds to the longitudinal dimension of the user device 10) is oriented substantially parallel to the horizontal plane.

[037] In the illustrated example of Fig. 1, the user 1 holds the user device 10 by means of a holding or extension device (e.g. a grip pole or monopod, also called “selfie stick”). The righthand side of Fig. 1 shows a corresponding scenario of a user 1’ capturing an initial audio-visual media stream by a user device 10’ and a headset-type binaural capturing device 20’ (of which only one earpiece is visible, the other one being hidden from view by the head of the user 1’). The capturing scenario for user 1’ however differs in that the user device 10’ is in the shape of a tablet computer and is held with an orientation corresponding to a portrait mode, i.e. the height dimension is oriented substantially vertically (transverse to the horizontal plane). Additionally, the user device 10’ is held directly in the hand of the user 1’. Accordingly, as used herein, a user device being “held” or “handheld” by a user is intended to cover the user device being held either directly (as in the scenario of user 1’ and user device 10’) or “indirectly” using a holding device (as in the scenario of user 1 and user device 10).

[038] As a few examples, the capturing processes depicted in Fig. 1 may be performed for the purpose of recording a podcast, an interview with one or more other persons than the user 1, recording an event or a scene, e.g. for personal moment sharing on social media, etc. It is noted that these examples merely are illustrative and not should be construed as limiting.

[039] Fig. 2 schematically shows a closer view of the user device 10, from the perspective of the user 1 in Fig. 1, wherein the user device 10 by way of example is held directly in the user’s hand la. As in Fig. 1, the user device 10 is held in front of the user 1, with the camera 14a (hidden from view in Fig. 2) facing in a substantially horizontal shooting direction away from the user 1, i.e. in a front direction relative the user 1. The front-side of the user device 10, which as shown may be provided with a screen 16, is hence facing the user 1. Fig. 2 further indicates the height dimension H and a width dimension W of the user device 10.

[040] Fig. 3a-b show in greater detail a capturing scenario wherein the user device 10 is in the landscape mode and the portrait mode, respectively. The video stream is captured by the main camera (14a in Fig. 1) and V denotes the field of view of the main camera. As shown, the shooting direction S of the camera is hence in both cases directed away from the user 1, in a front or forward direction relative the user 1. The left and right designations of the microphones thus correspond to the left and right lateral sides of the shooting direction S. The user device 10 comprises a pair of a first microphone 12a and a second microphone 12b (i.e. N - 2). The layout of the microphones 12a-b on the user device 10 is such that the first microphone 12a is positioned at a bottom portion of the user device 10 and the second microphone 12b is positioned at a top portion of the user device 10. The first and second microphones 12a-b are separated along the height dimension H of the user device 10. Thus, in Fig. 3a the first and second microphones 12a-b are horizontally separated (i.e. separated along the horizontal plane) while in Fig. 3b the first and second microphones 12a-b are vertically separated (i.e. separated along a vertical plane). Hence, in Fig. 3a, the first and second microphones 12a-b may be referred to as a pair of horizontally separated microphones of the user device 10 and in Fig. 3b, the first and second microphones 12a-b may be referred to as a pair of vertically separated microphones of the user device 10.

[041] The terms “horizontally separated microphones” and “vertically separated microphones” (such as in connection with the first and second microphones 12a-b) is herein used to refer respectively to a combination of microphones of a user device which are horizontally separated when the user device is in a first mode and a combination of microphones of the user device which are vertically separated when the user device is in a second mode. As will become apparent from the following, the first mode may be a landscape mode and the second mode may be a portrait mode, or vice versa. As further set out herein, the first and second combinations of microphones of the user device may in some implementations have at least one microphone in common (e.g. as in the case of microphones 12a-c of the user device 10’ in Fig. 9a-b). In some implementations the first and second combinations of microphones may refer to the same combination of microphones of the user device (e.g. as in the case of microphones 12a-b of the user device 10 in Fig. 3a-b).

[042] Meanwhile, the left- and right-channel microphones 20a-b of the binaural capturing device 20 are in both Fig. 3a and 3b horizontally separated. The microphones 12a-b and 20a-b may be omnidirectional or directional microphones. In either case, considered as a whole, the set of microphones 12a-b of the user device 10 together with the microphones 20a-b of the binaural capturing device 20 hence define a microphone array with a variable spatial configuration. Therefore, an upmixing process, as further described below, will take orientation data indicative of the orientation mode of the user device 10 into account for estimating spatial information for extracted audio objects.

[043] Fig. 4 and 5 show block diagrams of example implementations of a system 100 for processing an initial captured audio-visual media stream Ml to generate an audio- visual media stream M2 comprising an upmixed audio stream A3 in a multichannel format, and an upmixer 110 for generating the upmixed audio stream A3.

[044] The initial media stream Ml is received as input by the system 100. The initial media stream Ml comprises as shown a video stream V, a first audio stream Al comprising a set of N > 2 first channel signals, and a second audio stream A2 comprising a pair of second channel signals. The upmixer 110 is configured to receive and process the first and second audio streams A1, A2, and generate the upmixed audio stream A3. The system 100 further comprises a combiner 120 configured to combine the video stream V and the upmixed audio stream A3 to generate the audio- visual media stream M2.

[045] The initial media stream Ml may e.g. be captured by the above-described capturing system comprising the user device 10 and the binaural capturing device 20. Accordingly, the video stream V may be captured by the main camera 14a of the user device 10, the first audio stream Al may be captured by the first and second microphones 12a-b of the user device 10, and the second audio stream A2 be captured by the microphones 20a-b of the binaural capturing device 20. Analogous to the preceding discussion, the initial media stream Ml is captured while the user 1 holds the user device 10 in front of the binaural capturing device 20 in an orientation corresponding to a landscape or a portrait mode, and with the camera facing in a substantially horizontal shooting direction, e.g. as shown in Fig. 3a and 3b.

[046] The system 100, including the upmixer 110 and the combiner 120, may for example be implemented in the user device 10, such that the upmixing- and combining-steps are performed entirely at the side of the user device 10. However, other implementations are also possible, and the system 100 is generally not dependent on any particular configuration or form factor of the implementing device. For instance, the system 100 may be implemented by a device separate from the user device 10. The system 100 may as an example be implemented in a remote server, wherein the initial media stream Ml may be uploaded (e.g. by the user device 10) to the remote server, wherein the server may generate the upmixed audio stream A3 and combine the same with the video stream V to generate the media stream M2. A distributed implementation of the system 100 is also possible, wherein the upmixer 110 is implemented in the remote server while the combiner 120 is implemented by the user device 10. The first and second audio streams Al, A2 of the initial media stream Ml may be uploaded (e.g. by the user device 10) to the server wherein the server may generate the upmixed audio stream A3. The uploading device (e.g. the user device 10) may then download the upmixed audio stream A3 from the server and combine the same with the video stream V to generate the media stream M2. This merely represent a few non-limiting examples and other implementations and distributions of functionality are compatible with the present disclosure.

[047] Example implementations of the upmixer 110 will now be discussed with reference to Fig. 5.

[048] The upmixer 110 may as an initial step in the processing chain comprise a synchronization block 1102 configured to synchronize the channel signals of the first and second audio streams Al, A2 in time. That is, the synchronization aims at temporally aligning the channel signals of the first audio stream Al and the channel signals of the second audio stream A2 with respect to a common time basis (common clock reference). The synchronized first and audio signals Al, A2 output by the synchronization block 1102 are in Fig. 5 commonly designated A’. Synchronization may be needed if the binaural capturing device (e.g. binaural capturing device 20) is wirelessly coupled to the user device (e.g. user device 10), wherein jitter may be present between a clock of the binaural capturing device (or respective clocks of the left and right earpieces) and a clock of the user device. The jitter may otherwise obscure the relative acoustic delays between the audio signals captured by the microphones of the binaural capturing device and the microphones of the user device. For instance, the synchronization may be implemented by comparing time stamps recorded in frames of the first and second audio signals. Implementations for clock synchronization are as such known in the art and are found for instance in the Simple Network Time Protocol (SNTP) and the Precision Time Protocol (PTP). Further examples of clock synchronization methods include Reference Broadcast Synchronization (RBS).

[049] While the synchronization block 1102 in Fig. 5 is shown to form part of the upmixer 110, it is to be noted that in case of a wireless coupling between the user device and binaural capturing device, the synchronization may instead be performed by wireless communication circuits maintaining the wireless link. For instance, if the user device and the binaural capturing device are coupled via a Bluetooth link, the synchronization may be provided by a Broadcast Synchronization over Bluetooth (BSB) method implemented by the Bluetooth circuits. As further may be understood, synchronization may be omitted in case the channel signals of the first audio stream Al and the second audio stream A2 are sufficiently synchronized already upon receipt by the system 100. For instance, in case of a wired coupling between the user device and the binaural capturing device, synchronization may not be needed. In either case, it is assumed in the following that the channel signals of the first and second audio streams Al, A2 are sufficiently synchronized (be it by means of the synchronization block 1102 or due to absence of any appreciable timing errors between the user device and the binaural capturing device) to allow for resolving the relative acoustic delays between the audio signals captured by the microphones of the binaural capturing device and the microphones of the user device.

[050] The upmixer 110 may as a (further) initial step of the processing chain comprise a leveling and/or equalization (EQ) block 1104. The leveling and/or EQ block 1104 may as shown receive as input the output A ’ of the synchronization block 1102, comprising the synchronized first and second audio streams Al, A2. In absence of a synchronization block 1102, the leveling and EQ block 1104 may receive the first and second audio streams Al, A2 as input without any preceding synchronization performed by the upmixer 110. The leveling and/or EQ block 1104 may apply leveling and/or equalization to the channel signals of the (synchronized) first and second audio streams Al, A2. The output of the leveling and/or EQ block 1104 is denoted A” in Fig. 5. To streamline the following description, the term “input audio stream” and label A ” will be used to refer to the collective audio stream comprising the first and second channel signals of the first and second audio streams Al, A2, which may or may not have been subjected to one or more of synchronization, leveling and EQ.

[051] The upmixer 110 is further configured to process the first and second channel signals of the input audio stream A ” to extract a set of audio objects. The audio object extraction is implemented by an object extraction block 1106 of the upmixer 110. [052] The term “audio object” is used herein to refer to sources or elements of sound captured in the input audio stream A An audio object may for instance correspond to a sound from a human, an animal, a vehicle or any other object or process being the source of the sound which is captured in the first and second audio streams. An audio object may be dynamic, e.g. have a limited temporal duration and/or present time- varying characteristics (such as energy, envelope, spectrum). An audio object may also be static, e.g. have a duration coextensive with a duration of the audio streams, and a substantially stationary energy, envelope and spectrum.

[053] As may be appreciated, the number of audio objects captured in the input audio stream A” may vary between different capturing scenarios. In general, one or more different audio objects may be extracted. In some instances, like a monologue in a podcast, the input audio stream A” may comprise an audio object corresponding to only a single speaker, possibly together with a background or residual. In other instances, like an interview setting, the input audio stream A” may comprise audio objects corresponding to respective speakers. If the capturing process is conducted in a setting such as a cafeteria, at a busy street or in a park, the input audio stream A ” may comprise a number of different audio objects corresponding respectively to speakers, cars driving by, bird chirps, and other sound events typical for such settings.

[054] An extracted audio object may comprise, or be defined by, an audio object channel signal (corresponding to the actual audio content or audio data) and/or metadata allowing the actual audio content or audio data of the audio object to be derived from the input audio stream A ”, The metadata may for instance indicate the subbands comprising (occupied by) the audio object, and, optionally, the time of appearance and/or a duration of the audio object in the channel signals of the input audio stream A”. The metadata may additionally or alternatively comprise a soft mask (e.g. a soft gain mask) defined such that a representation of the (common) audio object may be derived from each respective channel signal by applying the soft mask to the respective channel signal.

[055] In a capturing process, it is envisaged that a representation of an audio object (e.g. a sound originating from a source) will be captured in more than one, typically each, channel signal of the input audio stream A ”. Thus, the term “representation of an audio object” may be used to refer to the representation of the audio object in a respective channel signal. The term “common” audio object may be used to refer to the audio object to which the representations correspond.

[056] Hence, an extracted (common) audio object may be defined by the set of N+2 representations of the audio object extracted from each of the N+2 channel signals of the input audio stream A”. Thus, the object extraction block 1106 may in some implementations process each of the first and second channel signals individually, and thus extract from each of the first and second channel signals a respective representation of each (common) audio object. The output of the object extraction block 1106 (denoted O in Fig. 5) may thus comprise a set of one or more common audio objects, wherein each common audio object of the set is defined by a respective set of N+2 representations of the respective common audio object, extracted from the channel signals.

[057] In some implementations, a representation of an audio object extracted from a channel signal may be defined by an audio object channel signal comprising a component of the (common) audio object extracted from the channel signal. In this case, the output O of the object extraction block 1106 may comprise a set of one or more common audio objects, wherein each common audio object (in turn) is defined by a respective set of N+2 audio object channel signals extracted from the N+2 channel signals.

[058] In some implementations, a representation of an audio object extracted from a channel signal may be defined by metadata allowing an audio object channel signal comprising a component of the (common) audio object to be derived from the channel signal. In this case, the output O of the object extraction block 1106 may comprise the 7V+2 channel signals of the input audio stream A ” and metadata allowing the .V+2 audio object channel signals of each (common) audio object to be extracted from the N+2 channel signals. The metadata may comprise separate (i.e. individual) metadata for each representation of each audio object. For instance, a separate soft mask may be output for each representation of each (common) audio object. The metadata may also comprise, for each respective (common) audio object, shared metadata allowing the +2 audio object channel signals of the (common) audio object to be derived from the /V+2 channel signals. For instance, a shared soft mask may be output for each (common) audio object. [059] In the case of a soft mask-based approach (shared or individual), the output O of the object extraction block 1106 may comprise the individual channel signals of the input audio stream A ”, and a set of shared soft masks for each common audio object, or a set of N+2 individual soft masks for each common audio object. In either case, the output of the object extraction block 1106 allows the respective audio object channel signals of each audio object to be derived by applying the respective soft mask to the respective channel signals.

[060] The object extraction block 1106 may further be configured to compare the representations of the audio objects extracted from the respective channel signals, to determine a correspondence between the extracted representations of the audio objects. That is, the extracted representations corresponding to (associated with) the same audio object (i.e. a common audio object) may be identified. The object extraction block 1106 may compare the audio object representations (e.g. the audio object channel signals) extracted from each channel signal in a frequency domain. As an example, frequency domain representations of the audio object representations may obtained by applying a suitable frequency transform, such as the Short-Time Fourier Transform (STFT), to the audio object representations, compute a correlation between the audio object representations across all channels, and associate audio object representations with a sufficiently strong correlation (e.g. a correlation exceeding a similarity threshold) in time and in one or more frequency bands. Associated audio object representations may hence be grouped (e.g. labelled) as a set of audio object representations relating to a common audio object (e.g. the same source).

[061] In some implementations the object extraction block 1106 may be configured to apply a frequency domain transform already prior to the audio object extraction, wherein the audio object extraction may be performed on the frequency domain representations of the channel signals. A (further) frequency domain transform of the extracted audio objects may in this case be skipped. The extracted audio objects may in a subsequent step of the upmixing process (e.g. in connection with step of estimation the spatial information and/or the panning step) be subjected to an inverse transform (e.g. inverse-STFT) to transform the extracted audio objects back into the time domain.

[062] The portions of the channel signals (audio data) not being extracted as (e.g. belonging to or being associated with) an audio object may be referred to as residual signals. As further described below, any residual signals may be treated as audio beds during the panning process, i.e. be panned to channels in accordance with predetermined fixed panning rules. The residual signals are in Fig. 5 denoted R.

[063] The audio object extraction may be implemented by machine learning (ML)-based algorithms or models like convolutional neural networks (CNNs) or recursive neural networks (RNN), or by digital signal processing (DSP)-based algorithms, or combinations thereof.

[064] In one example implementation, the object extraction block 1106 may be configured to apply an ML-based noise reduction algorithm trained to distinguish sound events such as speech, music and/or bird chirps from an input channel signal, from an acoustic background (residual). The noise-reduced output signals may be taken as the extracted representations of the audio objects. The representations may then be associated (e.g. grouped) using the approach outlined above. Many other ML-based audio object extraction approaches are known in the art, e.g. neural network-based models trained to distinguish (separate) sound sources from an audio signal, and may be used for implementing the audio object extraction.

[065] In another example implementation, the object extraction block 1106 may be configured to implement a correlation-based DSP-algorithm to extract the audio objects. The channel signals of the input audio stream (e.g. Al and A2 of A”) may be divided into several frequency bands (i.e. after applying a frequency domain transform like STFT). Correlations may then be calculated for each frequency band across all channel signals. Bands with a sufficiently strong correlation across time and input channels (e.g. time-frequency tiles with a correlation exceeding a threshold) may be grouped to define a respective audio object. Hence, each audio object is defined by correlated frequency bands (e.g. correlated time-frequency tiles) of the channel signals.

[066] In yet another example implementation, the object extraction block 1106 may process the channel signals in a frequency domain (e.g. the STFT of the channel signals), and generate a soft mask corresponding to each audio object detected in each channel signal. Similar soft masks derived from different channel signals may optionally then be averaged to generate a shared soft mask for each audio object. The representations of each audio object may then be extracted by applying the respective shared soft masks to each channel signal.

[067] It is noted that in each of the above example implementations, the object extraction is performed on each of the channel signals of the first and second audio streams Al, A2, e.g. the first channel signals captured by the microphones 12a-b of the user device 10 and the second channel signals captured by the microphones 20a-b of the binaural capturing device 20.

[068] The upmixer 110 is further configured to estimate spatial information for each of the audio objects O. The estimation of spatial information is implemented by a spatial information estimation block 1108 (for conciseness termed “spatial block 1108” in the following). The spatial information output by spatial block 1108 is denoted 5 in Fig. 5.

[069] While a horizontal direction of arrival (DOA) for an audio object (i.e. the DOA for the sound source corresponding to audio object, as seen in a horizontal plane) in principle may be estimated based on only channel signals captured by a binaural capturing device, e.g. employing techniques based on a head-related transfer function (HRTF), such techniques may be sensitive to anatomical differences between heads of different users, and may further be comparably complex to implement. While the present disclosure does not preclude such techniques, a horizontal DOA may in some cases be estimated more robustly based on a set of three or more channel signals captured by a corresponding set of three or more microphones with a separation in the horizontal plane.

[070] Accordingly, the spatial block 1108 is configured to estimate at least horizontal spatial information in the form of a horizontal DOA for each audio object based on a set of at least three channel signals of the first and second channel signals. More specifically, the horizontal DOA for each audio object may be estimated based on the representations of the audio object extracted from three (or more) of the first and second channel signals. Thus, while the following description for simplicity and conciseness may state that the estimation of spatial information for an audio object (horizontal DOA or height information) may be based on or uses three or more of the first and second channel signals or microphone signals, references to these channel signals may be understood as references to the audio object channel signals extracted or derived therefrom and comprising the audio object.

[071] In view of the above, depending on the form of the output O of the object extraction block 1106, the spatial block 1108 may thus receive, for each audio object, a respective set of N+2 audio object channel signals, or the N+2 channel signals of the input audio stream A” and metadata (e.g. soft masks, individual or shared) allowing N+2 audio object channel signals to be derived from the N+2 channel signals. In the latter case, the spatial block 1108 may be configured to, as an initial step, derive the N+2 audio object channel signals for each audio object from the N+2 channel signals of the input audio stream A ” using the metadata.

[072] As already discussed, a capturing process employing a mobile user device may be performed in both a landscape and a portrait mode. Assuming the user device comprises two spaced apart microphones, the physical configuration of the user device microphones in space (i.e. the spatial locations of the microphones, relative the physical surroundings) will thus be different when the user device is in the landscape mode and the portrait mode. Therefore, the algorithm for estimating the spatial information implemented by the spatial block 1108 is adaptable in the sense that it further is based on orientation data indicative of whether the user device is in a landscape or portrait mode. For a given layout of the set of microphones on the user device, the orientation data allows the spatial block 1108 to adapt the spatial information estimation process in accordance with the orientation of the user device, and thus to the spatial (physical) locations of the microphones (e.g. 12a-b) of the user device (e.g. 10), during the capturing of the initial media stream Ml .

[073] The orientation data may be obtained by an orientation sensor (e.g. based on a gyroscope or accelerometer) of the user device and may be indicative of whether, during the capturing of the initial media stream Ml, the user device is in the landscape or the portrait mode. The orientation data may be included in a metadata stream of the initial media stream Ml and be provided as input to the upmixer 110, together with the first and second audio streams Al, A2. The orientation data may also be received separately from the initial media stream Ml, e.g. obtained directly from the orientation sensor. The orientation data may for instance indicate an actual orientation angle of the user device. The orientation data may in a more basic example simply indicate the orientation mode of the user device, i.e. landscape mode or portrait mode (e.g. a binary indication).

[074] In some implementations, the spatial block 1108 may expressly take information or data indicative of the layout of the set of microphones on the user device into account for the purpose of estimating the spatial information for the audio objects. The layout of the set of microphones on the user device may for instance be provided as predetermined layout information, e.g. retrieved from a device database or look-up-table comprising layout data for various models of user devices. A layout may for instance indicate the relative locations of the microphones (e.g. 12a-b) in a frame of reference fixed to the user device. According to a more basic example, the layout may simply indicate a separation between the microphones (e.g. 12a-b) along the height dimension H and/or the width dimension W of the user device.

[075] The orientation data and the layout information may as shown in Fig. 5 (represented by reference sign 1109) be provided as input to the spatial block 1108.

[076] As will be further discussed in the following, the spatial information for the extracted audio objects (comprising at least horizontal DOA and optionally height information) may be estimated by the spatial block 1108 based on the at least three channel signals (e.g. the audio object signals extracted or derived therefrom) and the spatial locations of the microphones of the user device and the binaural capturing device capturing the at least three channel signals (e.g. the at least three audio object signals for each common audio object). As mentioned above, the spatial locations of the microphones of the user device may be determined based on the orientation data and the layout of the set of microphones on the user device. The microphone locations of the user device and the binaural capturing device may be expressed in the form of coordinates (e.g. cartesian coordinates) for each microphone relative a common frame of reference (e.g. a frame of reference in which the spatial information is estimated). A convenient choice of origin for a frame of reference would be the user device or the binaural capturing device, although other choices are also possible. Also, in some cases it may suffice to express the microphone locations as relative microphone locations, e.g. in the form of respective distances (horizontally and/or vertically) between the microphones.

[077] For example, the spatial block 1108 may, based on the orientation data, transform relative locations of the microphones on the user device, indicated in the predetermined layout information, to locations (absolute or relative) in a frame of reference in which the spatial information is estimated. As an example, if the orientation data indicates a landscape mode the relative microphone locations in the user device frame of reference may be transformed to coordinates (absolute or relative) in a horizontal plane. If the orientation indicates a portrait mode the relative microphone locations in the user device frame of reference may be transformed to coordinates or distances in a vertical plane. As a further example, the predetermined layout information may comprise two sets of relative microphone locations, one corresponding to the landscape mode and one corresponding to the portrait mode. The spatial block 1108 may thus, based on the orientation data, select which set of microphone locations from the predetermined layout information to use.

[078] Meanwhile, the microphone locations for the binaural capturing device may be determined based on a spatial relationship between the user device and the binaural capturing device. The spatial relationship may comprise a distance between the user device and the binaural capturing device. The distance may be a predetermined distance or be obtained from sensor data (e.g. focus data obtained by a front-facing camera 14b as discussed below). The microphone locations for the binaural capturing device may further be based on a distance between the left- and a right-channel microphones of the binaural capturing device. The distance may be a predetermined (e.g. assumed) distance.

[079] Example implementations of the spatial block 1108 will now be described with reference to the user device 10 and the binaural capturing device 20 and Fig. 6a-b.

[080] Fig. 6a again depicts the user device 10 of Fig. 1 capturing the initial media stream M. The user device 10 is held in an orientation corresponding to the landscape mode. Reference sign 30 schematically indicates an audio object (common audio object) corresponding to a source, extracted from the channel signals of the first and second audio streams Al , A2 by the object extraction block 1106. While Fig. 6a depicts only a single audio object 30 it is noted that the object extraction block 1106 may extract audio objects corresponding to more than one source and that the following description is applicable to each such extracted audio object.

[081] The pair of microphones 12a-b are positioned on the user device 10 such that the microphones 12a-b are horizontally separated when the user device 10 is in the landscape mode. The microphones 12a-b and 20a-b thus define a horizontal non-linear arrangement (array) of four microphones spaced apart in the horizontal plane. As indicated in Fig. 6a, their respective locations may approximately correspond to corners of a rectangle. The different locations of the microphones 12a-b and 20a-b in the horizontal plane allows a horizontal direction of arrival 0 for the audio object 30 (i.e. the sound emitted by the source corresponding to the audio object 30) to be estimated.

[082] The horizontal DOA 0 for the audio object 30 may be estimated either from the viewpoint of the binaural capturing device 20 (equivalently the user 1) (i.e. 0 = 0_h) or the user device 10 (i.e. 0 — 0_U). In either case the horizontal DOA 0 may be represented as the azimuthal angle to the audio object 30 (in the horizontal plane) relative a reference direction. For brevity, the terms “horizontal DOA” and “DOA” may in the following be used interchangeably. The reference direction may, as shown in Fig. 6a, correspond to or coincide with the horizontal shooting direction 5 of the camera 14a of the user device 10. The reference direction .S' is assumed to point away from the user 1 (e.g. the user 1 is facing the screen of the user device 10) and coincide with the inter- aural axis of the user 1, located substantially mid-way between the microphones 20a-b of the binaural capturing device 20 (and thus the ears of the user 1). Thus, if the audio object 30 is located to the left of the reference direction I shooting direction S it will be perceived as being to the left from the viewpoint of the user device 10 and the user 1, whereas if the audio object 30 is located to the right of the shooting direction 5 it will be perceived as being to the right from the view point of the user device 10 and the user 1.

[083] Estimation of the horizontal DOA 0 for the audio object 30 may in some implementations be separated into two sub-steps: determining an initial DOA estimate resolving the DOA in a left-right sense but comprising a front-back ambiguity; and determining the final DOA 0 by resolving the front-back ambiguity. For the microphone arrangement shown in Fig. 6a, the initial DOA may thus be estimated using the two first microphone signals from the microphones 12a-b of the user device 10 (i.e. using the audio object channel signals extracted or derived from the first channel signals), and obtaining the final DOA 0 = 0_U by resolving the front-back ambiguity using one of the second microphone signals from the microphones 20a-b of the binaural capturing device 20 (i.e. using one of the audio object channel signals extracted or derived from the second channel signals). The initial DOA may alternatively be estimated using the two second microphone signals from the microphones 20a-b of the binaural capturing device 20, and obtaining the final DOA 0 = 0_b by resolving the front-back ambiguity using one of the first microphone signals from the microphones 12a-b of the user device 10.

[084] Fig. 6b depicts the user device 10 capturing the initial media stream Ml while held in an orientation corresponding to the portrait mode. The pair of microphones 12a-b are positioned on the user device 10 such that the microphones 12a-b are vertically separated when the user device 10 is in the portrait mode. The microphones 12a-b and 20a-b thus define a nonlinear arrangement (array) of three microphones spaced apart in the horizontal plane (since the microphones 20a-b have the same location in the horizontal plane). As indicated in Fig. 6b, their respective locations may approximately correspond to comers of a triangle, e.g. an isosceles triangle. For the microphone arrangement shown in Fig. 6b, the initial DOA may thus be estimated using the two second microphone signals from the microphones 20a-b of the binaural capturing device 20, and then obtaining the final DOA 0 = 0_b by resolving the front-back ambiguity using one of the first microphone signals from the microphones 12a-b of the user device 10.

[085] In view of the capturing scenarios in Fig. 6a-b, the spatial block 1108 may accordingly estimate the DOA 0 based on a selected set of three channel signals selected among the first and second channel signals such that the selected set of channel signals comprises at least one of the first channel signals and at least one of the second channel signals, wherein the selection is based on the orientation data. More specifically, since the layout of the microphones 12a-b on the user device 10 is such that the microphones 12a-b are horizontally separated when the user device is in the landscape mode, in case the orientation data is indicative of the landscape mode, the spatial block 1108 may estimate the DO A 0 based on the pair of second channel signals from the microphones 20a-b of the binaural capturing device 20 and one of the first channel signals from one of the microphones 12a-b of the user device 10. Alternatively, the spatial block 1108 may estimate the DO A 0 based on the pair of first channel signals from the microphones 12a-b of the user device 10 and one of the second channel signals from the microphones 20a-b of the binaural capturing device 20. On the other hand, in case the orientation data is indicative of the portrait mode, the spatial block 1108 may estimate the DOA 0 based on the pair of second channel signals from the microphones 20a-b of the binaural capturing device 20 and one of the first channel signals from one of the microphones 12a-b of the user device 10. The selected set of channel signals may hence correspond to a (strict) sub-set of the first and second channel signals.

[086] It is envisaged that, without a priori knowledge of the size and shape of the head of the user 1, the separation between the microphones 12a-b of the user device 10 may be known with a greater precision than an (assumed) separation between the microphones 20a-b of the binaural capturing device 20. Hence, in some implementations the spatial block 1108 may be configured to, when the user device 10 is in the landscape mode, estimate the DOA 0 based on the pair of first channel signals from the microphones 12a-b of the user device 10 and one of the second channel signals from the microphones 20a-b of the binaural capturing device 20. The selection in the case of a landscape orientation may hence be “preconfigured” such that the spatial block 110 defaults to this selection.

[087] Considering as an illustrative example a pair of microphones located in a horizontal plane, a DOA 0 (with front-back ambiguity) for the sound may be estimated from: where c is speed of sound, f is the estimated time-of-arrival difference, and d is the distance between the pair of microphones. In view of the preceding discussion, if the pair of microphones are the microphones 20a-b of the binaural capturing device 20, the distance d may be a predetermined (i.e. assumed) distance. If the pair of microphones are the microphones 12a-b of the user device 10, the distance d may be determined based on the orientation data and the layout of the microphones 12a-b on the user device 10 (e.g. provided as predetermined layout information). [088] If the specific algorithm used to estimate the DOA 0 is dependent on the distance between the user device 10 and the binaural capturing device (denoted d_u-b in Fig. 6a-b) the distance d may be a predetermined distance, set by assuming that the user 1 typically holds the user device 10 at a certain distance in front of the face (e.g. 0.3-0.4 m) during the capturing process. The distance information may be supplied as input 1109 to the spatial block 1108 (see Fig- 5).

[089] Estimating the DOA 6 based on channel signals from only three of the microphones 12a-b and 20a-b is sufficient for estimating the DOA 0 and limits the amount of audio data to process. However, the spatial block 1108 may in some implementations be configured to, in case the orientation data is indicative of the landscape mode, estimate the DOA 6 based on each of the first channel signals from the microphones 12a-b of the user device 10 and each of the second channel signals from the microphones 20a-b of the binaural capturing device 20. Additionally, the spatial block 1108 may be configured to, in case the orientation data is indicative of the portrait mode, estimate the DOA 0 based on each of the first channel signals from the microphones 12a-b of the user device 10 and each of the second channel signals from the microphones 20a-b of the binaural capturing device 20. It is contemplated that basing the estimation of the DOA 0 on audio data from additional (e.g. in a sense redundant) channel signals may increase the robustness and accuracy of the estimation.

[090] While it is possible to estimate the horizontal DOA 0 by separately resolving the DOA in a left-right sense and the front-back ambiguity, this is by no means necessary and DOA estimation algorithms may, given microphone signals (i.e. audio object channel signals) from a horizontal non-linear arrangement of three (or more) microphones, directly estimate a front-back resolved horizontal DOA 0. Other example algorithms that may be used to estimate the horizontal DOA 0 include DSP-based algorithms based on Generalized Cross Correlation (GSS) or Steered Response Power (SRP), as well as ML-based algorithms trained to estimate (predict) a DOA based on an input set of audio object channel signals and the locations (e.g. coordinates) of the microphones used to capture channel signals. Some algorithms (e.g. like the SRP algorithm and ML-based algorithms) may further allow estimation of horizontal information comprising not only the horizontal DOA 0, but the locations of the audio objects (e.g. horizontal plane coordinates of the audio objects).

[091] In some implementations, the difference in time-of-arrival of the audio object 30 at the microphones may be (expressly) estimated by comparing the audio object channel signals of the audio object 30 from the three (or more more) of the microphones (e.g. by searching for the inter-channel time delays maximizing the correlation between the audio object channel signals). The DOA 0 may then be estimated using the estimated time-of-arrival differences together with the spatial locations of the microphones capturing the respective channel signals.

[092] Moreover, some DOA estimation algorithms do not rely on the spatial locations (coordinates) of the microphones of the user device and/or the binaural capturing device. Instead, the horizontal DOA 0 may be estimated directly from relative time delays between at least three of the captured channel signals. The relative time delays may correspond to time-of-arrival differences of the sound of the audio object at the respective microphones (e.g. the time-of- arrival differences of the audio object 30 between three or more of the microphones 12a-b and 20a-b). The time delays may be estimated by comparing the channel signals from the three (or more) microphones e.g. by searching for the inter-channel time delays maximizing the correlation between the channel signals or between the audio object channel signals extracted therefrom. The DOA 0 may then be estimated from the estimated time delays using a mapping function relating estimated time delays between the channel signals to a DOA 0 estimate.

[093] The mapping function may be a predetermined mapping function established for instance in a measurement procedure comprising playing back a test sound in an anechoic room and measure the relative time delay between the microphones of the user device and the binaural capturing device for a plurality of different directions of arrival of the test sound. The user device and the binaural capturing device may for instance be positioned on a turn table to allow precise control over the angle of the microphones relative the test sound. This process can be performed for the user device both in a landscape mode and a portrait mode. Thereby, for any given angle, the expected time delay between the microphones for the landscape mode and the portrait mode, respectively, may be established and captured in a mapping function to be used to estimate the DOA 0 in the upmixing process. The mapping function may for instance be realized as a look-up table, or as a mathematical function fitted to the measurement data may also be used. An analogous approach and measurement procedure could be used to establish respective mapping functions for the user device and the binaural capturing device. Thereby, an initial DOA 0 may be estimated from the estimated time delays between (at least) two channel signals captured by (at least) two microphones of the user device or the binaural capturing device, and then resolving a front-back ambiguity using a channel signal captured by a microphone of the other device. In an upmixing process, the spatial block 1108 may thus select which of the mapping functions to use in accordance with the orientation data.

[094] In view of the above, regardless of the specific type of estimation algorithm, the spatial block 1108 may accordingly be configured to estimate the DOA 0 (and optionally the location) based on differences in time-of-arrival (i.e. time delays) of the sound of the audio object 30 at the respective microphones (e.g. three or more of the microphones 12a-b and 20a-b). In some implementations, the DOA 9 for an audio object may be estimated based on time delays between a set of at least three of the first and second channel signals and based on the orientation data. In some implementations, estimation of the DOA 9 for the audio object may further be based on the spatial locations of the microphones capturing the set of at least three channel signals. The microphone locations for the user device may be determined based on the orientation data and on information indicative of the layout of the microphones on the user device. The microphone locations for the binaural capturing device may be determined based on a distance (predetermined or assumed) between the left- and right-channel microphones, and a distance (predetermined, assumed or estimated) between the binaural capturing device and the user device. In some implementations, the DOA 9 for an audio object may instead be estimated based on time delays between a set of at least three of the first and second channel signals and based on the orientation data, and using a mapping function mapping the estimated time delays to an estimated DOA 0 for the orientation mode indicated in the orientation data.

[095] According to an extended implementation, the upmixer 110 may further be configured to estimate spatial information in the form height information for each audio object if the user device is in the portrait mode. Fig. 7 depicts by way of example the user device 10 (for illustrational clarity omitting the binaural capturing device 20 and the user 1) in the portrait mode. Due to the layout of the microphones 12a-b on the user device 10, the microphones 12a-b are vertically separated (i.e. separated in a vertical plane). The vertical separation allows the spatial block 1108 to, in addition to the horizontal DOA 9. estimate height information for the audio object 30 based on the first channel signals from the microphones 12a-b. The height information may for instance be estimated in the form of an elevation angle <p_u as shown in Fig. 7, or optionally as a vertical coordinate representing a height of the audio object above a horizontal plane. The elevation angle <p_u or coordinate may be estimated in an analogous manner to the horizontal DOA 9, e.g. based on differences in time-of- arrival (i.e. time delays) of the sound of the audio object 30 at the respective microphones 12a-b. For instance, a mapping function may be used to map time delays to an elevation angle <p_u, analogous to the mapping function discussed above for estimating a horizontal DOA 9. The front-back ambiguity may as discussed above be resolved during the estimation of the horizontal DOA 9 and need hence not be separately considered for the height information estimation. It is further noted that the spatial block 1108 may implement an algorithm (e.g. GSS, SRP or other ML-based algorithms) simultaneously estimating the horizontal DOA 9 and the height information (optionally the horizontal and vertical plane coordinates) for the audio object based on the first channel signals from the microphones 12a-b of the user device 10 and the second channel signals from the binaural microphones 20a-b of the binaural capturing device 20. [096] The above example implementations of the spatial block 1108 have been based on an assumption that the relative positions of the user device 10 and the binaural capturing device 20 are constant during the capturing process. According to an extended example implementation, the upmixing process may be adapted to accommodate for dynamic capturing conditions wherein the spatial relationship between the user device 10 and the binaural capturing device 20 changes during the capturing of the initial media stream. The upmixer 110 may in this case be configured to obtain sensor data indicative of a spatial relationship between the user device 10 and the binaural capturing device 20 during the capturing of the initial media stream Ml, wherein the estimating of the spatial information is further based on the sensor data. The sensor data may form part of the input 1109 shown in Fig. 5. The types of sensor data may be various, and in particular include non-acoustic data, such as data from orientation sensors and motion sensors (e.g. based on gyroscopes or accelerometers) of the user device 10 and the binaural capturing device 20. When changes from the initial spatial relationship are detected, the spatial block 1108 may adapt the distance between the binaural capturing device 20 and the user device 10 and adapt the estimation of the spatial information accordingly. In one example implementation, the user device 10 may further comprise a front-side camera 14b (see Fig. 2), wherein a distance between the user device 10 and the binaural capturing device 20 (e.g. d_u-b in Fig. 6a-b) may be estimated from a focusing distance of the front-side camera 14b during the capturing of the initial media stream.

[097] Referring again to Fig. 5, the upmixer 110 is further configured to pan each audio object extracted from the input audio stream A ” in accordance with the spatial information 5 to one or more channels of a multichannel format. The upmixer 110 may further be configured to pan any residual signal R to one or more channels of the multichannel format. While the audio objects are panned in accordance with the spatial information S, the residual signals R are not associated with any spatial information and is therefore panned to one or more predetermined channels of the multichannel format, e.g. in accordance with a predetermined panning rule. The output of the panning process is the upmixed audio stream A3. The multichannel format may be a speaker channel-based format such as 3.1.2, 5.1, 7.1, 5.1.2 , 7.1.4, or a speaker-independent channel-based format such as First Order Ambisonics (FOA).

[098] The panning process is implemented by a panning block 1110, of which an implementation example is illustrated in Fig. 8.

[099] The panning block 1110 receives as input the extracted audio objects O, the spatial information S and any residual signals R. In view of the above, depending on form of the output O of the object extraction block 1106, the panning block 1110 may receive, for each audio object, a respective set of N+2 audio object channel signals, or the N+2 channel signals of the input audio stream A ” and metadata (e.g. soft masks, individual or shared) allowing N+2 audio object channel signals to be derived from the 7V+2 channel signals. In the latter case, the spatial block 1110 may be configured to, prior to panning, derive the N+2 audio object channel signals for each audio object from the A+2 channel signals of the input audio stream A” using the metadata.

[100] The panning block 1110 comprises a first sub-block 1112 (audio object panning sub-block) implementing the panning of the audio objects O. The audio objects O may be panned to speaker channels of the multichannel format using common panning laws. For instance, if the multichannel format is a speaker channel-based immersive multi-channel format, the panning block 1110 may for each (common) audio object, based on the horizontal DOA 6 and (if available) the height information (e.g. elevation angle (p_u), locate the nearest two speakers in the (known) speaker layout and pan the audio object to these two speakers according to the respective distances to the speakers. If no height information is available for an audio object, the audio object may be assumed to be located on the horizontal plane and not panned to any height channel. If not already done earlier in the processing chain, the audio objects may as part of the panning process be subjected to an inverse transform (e.g. inverse-STFT) to transform the extracted audio objects back into the time domain (assuming the audio objects input to the panning process are represented in the frequency domain).

[101] As discussed above, a representation of each common audio object may be extracted from each channel signal. Therefore, the panning block 1110 (e.g. the first sub-block

1112) may be configured to select, for each common audio object, one representation among the N+2 representations of the common audio object to be used for the panning process. The other non-selected representations may be discarded. That is, for each common audio object, only the audio object channel signal of the selected representation will be panned. In some implementations, the selection of the representation may be based on the spatial information estimated for the corresponding common audio object. More specifically, the selected representation of the common audio object may be the representation extracted from the channel signal captured by the microphone closest to the direction of arrival estimated for the common audio object. To illustrate, considering the example in Fig. 6a, the microphone 12b of the user device 10 is closer to the audio object 30 than any of the other microphones 12a, 20a-b. If the estimated spatial information indicates the locations of the audio objects (e.g. horizontal plane coordinates of the audio objects), the selected representation may be the representation extracted from the channel signal captured by the microphone closest to the estimated location of the audio object. However, other implementations are also possible. For instance, the selected representation may be the representation extracted from the channel signal with the highest signal-to-noise ratio.

[102] Optionally, certain audio objects may be panned according to predetermined panning rules. For instance, the speech of the wearer of the binaural capturing device (e.g. the user 1 wearing the binaural capturing device 20) may be panned to one or more predetermined channels in the speaker layout, such as a center front channel or left and right rear channels (e.g. depending on a user preference). Speech of the user 1 may be identified using known techniques such as machine -learning based voice recognition algorithms, trained to recognize the voice of the user.

[103] The panning block 1110 further comprises a second and third sub-block 1116, 1118 (residual signal panning blocks) configured to pan each residual signal R to one or more channels of the multichannel format. As illustrated in Fig. 8, the residual signals R of the first channel signals (from the microphones of the user device) and the second channel signals (from the microphones of the binaural capturing device) may be panned separately. For instance, the second sub-block 1116 may pan the residual signals R of the first channel signals to a first set of channels (e.g. one or more height channels) of the multichannel format and the third sub-block 1118 may pan the residual signals R of the second channel signals to a second set of channels of the multichannel format (e.g. to the center front channel, or evenly to the available horizontal channels). As further indicated in Fig. 8, decorrelation may be applied to the respective residual signals R prior to panning. Optionally, the first channel signals from the microphones of the user device may be beamformed (optional beamforming block 1114) to enhance height objects prior to decorrelation and panning.

[104] The panned audio objects (i.e. the panned representations of the common audio objects) and any residual signals R are subsequently summed, channel-by-channel, by the mixing block 1120 to generate the upmixed audio stream A3 in the speaker channel-based multichannel format. If the multichannel format of the upmixed audio stream A3 is a speaker-independent channel-based format such as FOA, the panning may be implemented by first panning the audio objects O as set out above, and thereafter pan each channel of the speaker channel-based multichannel format using a FOA providing channels [W, X, Y, Z] :

W = S/V2

X = ScosOcostp Y = SsinOcoscp Z = Ssintp where 0 is the azimuth angle, and tp is elevation angle of the speaker in the speaker channelbased multichannel format. However, it is also possible to directly pan the audio objects and the residual using an FOA panner as set out above, but where instead 6 is the azimuth angle, and <p is elevation angle for each audio object. In either case, any residual signals 5 may be added to channel W.

[105] In the above, the capturing and upmixing process have mainly been described with refence to user device 10 comprising two microphones 12a-b. Thus, the horizontally and vertically separated microphones have two microphones in common, i.e. the horizontally and vertically separated microphones of the user device 10 refer to the same microphones 12a-b. The present disclosure is however not limited to a user device with such a two-microphone layout, but is more generally applicable also to a user device with other microphone layouts.

[106] Fig. 9a-b depict in a top-down view and a rear-side view, respectively, a user device 10’ of a further example implementation, comprising N = 3 microphones 12a-c. Hence, in addition to a microphone 12a positioned at a bottom portion of the user device 10’ and a second microphone 12b positioned at a top portion of the user device 10’, the user device 10’ comprises a third microphone 12c positioned centrally at a rear-side of the user device 10’, e.g. adjacent to the camera 14a. The third microphone 12c is as seen separated from each of the first and second microphones 12a-b along the width dimension W. Hence, even when the user device 10’ is in the landscape orientation, the user device 10’ comprises a pair of vertically separated microphones (e.g. 12a and 12c or 12b and 12c). As may be understood from the preceding discussion, this enables the upmixer 110 (e.g. the spatial block 1108) to estimate height information also when the user device 10’ is in the landscape mode.

[107] Accordingly, Fig. 9a-b represent an implementation of a user device wherein the layout of the set of N (e.g. N = 3) microphones on the user device is such that the N microphones comprises at least two microphones (“first subset of microphones”) separated along a height dimension of the user device and at least one microphone (“further subset of microphones”) separated from at least one of the first subset of microphones along a width dimension of the user device. Hence, in the landscape orientation of user device, the horizontal DOA may be estimated based on at least one first channel signal captured by at least one of the first subset of microphones (e.g. 12a and/or 12b) and the pair of second channel signals captured by the binaural capturing device, or based on at least two first channel signals captured by at least two of the first subset of microphones (e.g. 12a and 12b) and at least one of the pair of second channel signals captured by the binaural capturing device. Further in the landscape orientation, also height information may be estimated based on a first channel signal captured by a microphone of the first subset of microphones and a first channel signal captured by a microphone of the further subset of microphones (the microphone of the first subset and the microphone of the further subset defining a pair of the vertically separated microphones, e.g. 12a and 12c or 12b and 12c). As shown in Fig. 9b, the third microphone 12c may further be separated from one or both of the microphones 12a and 12b along the height dimension H. Thus, the third microphone 12c may optionally form part also of the first subset of microphones and be used to estimate the horizontal DOA in the landscape orientation of user device 10’.

[108] Fig. 12 is a flow chart of an implementation of a method for generating an audiovisual media stream (e.g. M2) comprising an upmixed audio stream in a multichannel format (e.g. A3).

[109] At step SI, an initial media stream (e.g. Ml) is captured by a mobile user device (e.g. user device 10) operated by a user (e.g. user 1) and a head-mounted binaural capturing device (e.g. binaural capturing device 20) worn by the user and coupled to the user device. The initial media stream comprises a video stream (e.g. video stream V) captured by a camera of the user device (e.g. camera 14a or 14b of the user device 10). The initial media stream further comprises a first audio stream (e.g. Al) comprising a set of N > 2 first channel signals captured by a set of N microphones of the user device (e.g. microphones 12a-b). The initial media stream further comprises a second audio stream (e.g. A2) comprising a pair of second channel signals captured by a left- and a right-channel microphone of the binaural capturing device (e.g. microphones 20a-b of the binaural capturing device 20). The initial media stream is captured while the user device is held in front of the binaural capturing device in an orientation corresponding to a landscape or a portrait mode. The camera may in particular face in a substantially horizontal shooting direction.

[110] At step S2, the channel signals of the first and second audio streams may be subjected to a synchronization process (e.g. by synchronization block 1102).

[111] At step S3, the (synchronized) channel signals of the first and second audio streams may be subjected to at least one of leveling and EQ (e.g. by leveling and/or EQ block 1104).

[112] At step S4, the channel signals of the first and second audio streams are processed to extract a set of audio objects (e.g. O). The extraction may be performed by object extraction block 1106. The processing may further comprise extracting a set of residual signals (e.g. R).

[113] At step S5, orientation data is obtained indicative of whether, during the capturing of the initial media stream, the user device is in the landscape or the portrait mode.

[114] At step S6, spatial information (e.g. S) is estimated, comprising, for each audio object, a horizontal DOA (e.g. 0) estimated based on a set of at least three channel signals of the first and second channel signals and on the orientation data. Additionally, depending on the orientation mode of the user device, height information (e.g. cp_u) may be estimated for each audio object. The spatial information may be estimated by the spatial block 1108. [115] At step S7, each audio object is panned in accordance with the spatial information to one or more channels of the multichannel format to generate the upmixed audio stream.

[116] At step S8, the video stream and the upmixed audio stream are combined (e.g. by the combiner 120) to generate the audio-visual media stream.

[117] The person skilled in the art realizes that the present invention by no means is limited to the embodiments described above. On the contrary, many modifications and variations are possible within the scope of the appended claims. For example, while reference in the above has been made to capturing the video stream using the main camera 14a on the backside of the user device 10, a video stream may also be captured using a front side camera 14b (i.e. “face” of “selfie” camera, shown in Fig. 2) of the user device 10. The above description applies correspondingly to such a capturing scenario, albeit with the difference that the front-back directions, and optionally the left-right directions (e.g. depending on user preferences and whether the user device stores mirrored video data or not), respectively, will be mirrored or flipped. Moreover, as discussed above in connection with Fig. 9b, the third microphone 12c of the user device 10’ may optionally form part also of the first subset of microphones and be used to estimate the horizontal DO A in the landscape orientation of the user device 10’ . Hence, in the landscape orientation of the user device 10’ of Fig. 9a-b, the horizontal DOA may be estimated based on at least one first channel signal captured by at least one of the microphones 12a-c and the pair of second channel signals captured by the binaural capturing device, or based on at least two first channel signals captured by at least two of the microphones 12a-c (e.g., 12a and 12b, or 12a and 12c, or 12b and 12c) of the user device 10’ and at least one of the pair of second channel signals captured by the binaural capturing device. Further, when the user device 10’ is in the portrait orientation, height information for each audio object may be estimated based on a pair of first channel signals captured by any pair of microphones of the user device 10’ being vertically separated in the portrait orientation, e.g., 12a and 12b, or 12a and 12c, or 12b and 12c. In either case, the height information (e.g., elevation angle <p_M) may be estimated in a manner analogous to the discussion of Fig. 7, e.g., based on the first microphone signals from the microphones 12a and 12b, from the microphones 12a and 12c, or from the microphones 12b and 12c. Additionally, in the above, the capturing and upmixing process have been described mainly in relation to a user device comprising a set of microphones with a layout on the user device such that the user device comprises horizontally separated microphones when the user device is in a landscape mode (first mode) and vertically separated microphones when the user device is in a portrait mode (second mode). However other microphone layouts are also possible such as a layout where the first mode is the portrait mode and the second mode is the landscape mode. For instance, a user device may comprise a pair of microphones separated along a width dimension of the user device, and thus positioned on the user device such that the pair of microphones are horizontally separated when the user device is in a first mode being a portrait mode and vertically separated when the user device is in a second mode being a landscape mode. The spatial information estimation and the spatial block 1108 may in this case proceed as set out above, but with the difference that the horizontal DOA may be estimated for the portrait mode and the height information may be estimated for the landscape mode. The present disclosure is further applicable to a user device comprising an even greater number of microphones than three, such as four (A = 4) microphones. Such a user device may for example comprise a first pair of microphones positioned like the afore-mentioned first and second microphones 12a-b (e.g. a first subset of microphones), and a second pair of microphones formed of a third and fourth microphone (e.g. a further subset of microphones) different from the first and second microphones 12a-b. The second pair (further subset) of microphones may be positioned on the user device to be separated along the width dimension of the user device such that the second pair of microphones are vertically separated when the user device is in the landscape mode, wherein the microphone signals captured by the second pair of microphones may be used by the spatial block 1108 to estimate the height information for extracted audio objects if the orientation data is indicative of the landscape mode. Conversely, one or more of the first microphone signals captured by one or more of the second pair (further subset) of microphones may be used by the spatial block 1108, together with the second channel signal(s) captured by the binaural capturing device, to estimate the horizontal DOA for extracted audio objects if the orientation data is indicative of the portrait mode.

[118] Systems and methods disclosed in the present application may be implemented as software, firmware, hardware or a combination thereof. In a hardware implementation, the division of tasks does not necessarily correspond to the division into physical units; to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation.

[119] The computer hardware may for example be a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smartphone, an AR/VR wearable, automotive infotainment system, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that computer hardware. Further, the present disclosure shall relate to any collection of computer hardware that individually or jointly execute instructions to perform any one or more of the concepts discussed herein.

[120] Certain or all components may be implemented by one or more processors that accept computer-readable (also called machine -readable) code containing a set of instructions that when executed by one or more of the processors carry out at least one of the methods described herein. Any processor capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken are included. Thus, one example is a typical processing system (e.g., computer hardware) that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system further may include a memory subsystem including a hard drive, SSD, RAM and/or ROM. A bus subsystem may be included for communicating between the components. The software may reside in the memory subsystem and/or within the processor during execution thereof by the computer system.

[121] The one or more processors may operate as a standalone device or may be connected, e.g., networked to other processor(s). Such a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.

[122] The software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to a person skilled in the art, the term computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, physical (non-transitory) storage media in various forms, such as EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Further, it is well known to the skilled person that communication media (transitory) typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

[123] Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the disclosure discussions utilizing terms such as “processing”, “computing”, “calculating”, “determining”, “analyzing” or the like, refer to the action and/or processes of a computer hardware or computing system, or similar electronic computing devices, that manipulate and/or transform data represented as physical, such as electronic, quantities into other data similarly represented as physical quantities.

[124] Various aspects of the present disclosure may be appreciated from the following Enumerated Example Embodiments (EEEs): [125] EEE 1. A method for generating an audio-visual media stream comprising an upmixed audio stream in a multichannel format, the method comprising: capturing an initial media stream by a mobile user device operated by a user and a head-mounted binaural capturing device worn by the user and coupled to the user device, the initial media stream comprising: a video stream captured by a camera of the user device, a first audio stream comprising a set of

N > 2 first channel signals captured by a set of N microphones of the user device, and a second audio stream comprising a pair of second channel signals captured by a left- and a right-channel microphone of the binaural capturing device, wherein while capturing the initial media stream, the user device is held in front of the binaural capturing device in an orientation corresponding to a landscape or a portrait mode; processing the channel signals of the first and second audio streams to extract a set of audio objects; obtaining orientation data indicative of whether, during the capturing of the initial media stream, the user device is in the landscape or the portrait mode; estimating spatial information comprising, for each audio object, a horizontal direction of arrival estimated based on a set of at least three channel signals of the first and second channel signals and on the orientation data; panning each audio object in accordance with the spatial information to one or more channels of the multichannel format to generate the upmixed audio stream; and combining the video stream and the upmixed audio stream to generate the audio-visual media stream.

[126] EEE 2. The method according to EEE 1, wherein a layout of the set of microphones on the user device is such that the N microphones of the user device comprises microphones being horizontally separated when the user device is in a first mode and microphones being vertically separated when the user device is in a second mode, wherein the first mode is the landscape mode and the second mode is the portrait mode or vice versa, wherein, if the orientation data is indicative of the first mode, the horizontal direction of arrival for each audio object is estimated based on a set of at least three channel signals of: the first channel signals captured by the horizontally separated microphones of the user device and the second channel signals, and wherein, if the orientation data is indicative of the second mode, the spatial information further comprises, for each audio object, height information estimated based on a pair of first channel signals captured by a pair of the vertically separated microphones of the user device.

[127] EEE 3. The method according to EEE 2, wherein, if the orientation data is indicative of the second mode, the horizontal direction of arrival for each audio object is estimated based on: the pair of second channel signals captured by the microphones of the binaural capturing device and at least one first channel signal captured by at least one of the microphones of the user device. [128] EEE 4. The method according to EEE 3, wherein said first mode is the landscape mode and said second mode is the portrait mode.

[129] EEE 5. The method according to EEE 4, wherein the user device has a height dimension and a width dimension, and wherein the horizontally separated microphones and the vertically separated microphones are separated along the height dimension of the user device.

[130] EEE 6. The method according to EEE 5, wherein the horizontally and vertically separated microphones of the user device each comprises a first microphone positioned at a bottom portion of the user device and a second microphone positioned at a top portion of the user device.

[131] EEE 7. The method according to any one of EEEs 3-6, wherein the horizontally and vertically separated microphones have at least one microphone in common.

[132] EEE 8. The method according to EEE 7, wherein the horizontally and vertically separated microphones are the same microphones.

[133] EEE 9 . The method according to any one of the preceding EEEs, wherein, for each audio object of the set of audio objects, a representation of the audio object is extracted from each of the first and second channel signals, and wherein the method further comprises, for each audio object, selecting, among the representations of the audio object, one representation of the audio object to be used for the panning process, wherein the selection is based on the spatial information estimated for said audio object.

[134] EEE 10. The method according to EEE 9, wherein for each audio object, the representation of the audio object extracted from the channel signal captured by the microphone closest to the direction of arrival estimated for the audio object is selected to be used for the panning process.

[135] EEE 11. The method according to any one of the preceding EEEs, wherein processing the channel signals of the first and second audio streams further comprises extracting a set of residual signals, wherein the method further comprises panning each residual signal to one or more channels of the multichannel format, wherein each residual signal is panned to one or more predetermined channels of the multichannel format.

[136] EEE 12. The method according to EEE 11, further comprising decorrelating the residual signals prior to the panning.

[137] EEE 13. The method according to EEE 12, wherein the residual signals extracted from the first channel signals are panned to a first set of one or more channels of the multichannel format, and the residual signals extracted from the second channel signals are panned to a second set of one or more channels of the multichannel format different from the first set. [138] EEE 14. The method according to any one of the preceding EEEs, wherein the set of audio objects is extracted using a machine learning algorithm, such as a neural network, or a digital signal processing algorithm, or a combination thereof.

[139] EEE 15. The method according to any one of the preceding EEEs, further comprising obtaining sensor data indicative of a spatial relationship between the user device and the binaural capturing device during the capturing of the initial media stream, wherein the estimating of the spatial information is further based on the sensor data.

[140] EEE 16. The method according to EEE 15, wherein the sensor data indicates a distance between the user device and the binaural capturing device, wherein the user device comprises a front-side camera, and wherein the distance is obtained from a focusing distance of the front-side camera during the capturing of the initial media stream.

[141] EEE 17. The method according to any one of the preceding EEEs, wherein the processing of the channel signals of the first and second audio streams comprises synchronizing the channel signals of the first and second audio streams prior to extracting the set of audio objects.

[142] EEE 18. The method according to any one of the preceding EEEs, wherein the processing of the channel signals of the first and second audio streams comprises applying at least one of leveling and equalization to channel signals of the first and second audio streams prior to extracting the set of audio objects.

[143] EEE 19. The method according to any one of the preceding EEEs, wherein the multichannel format is one of 5.1, 7.1, 5.1.2, 7.1.4, or First Order Ambisonics (FOA).

[144] EEE 20. The method according to any one of the preceding EEEs, wherein the method is performed by the user device and wherein the binaural audio signal is receiv3ed form the binaural capturing device via a wired or wireless connection.

[145] EEE 21. The method according to any one of the preceding EEEs, wherein the camera capturing the video stream is a rear-side camera or a front-side camera.

[146] EEE 22. The method according to any one of the preceding EEEs, wherein the user device is a mobile phone or a tablet computer, and wherein the binaural capturing device is a headset.

[147] EEE 23. A computer program product comprising computer program code to perform, when executed by a processing device, the method according to any of EEEs 1-22.

[148] EEE 24. A mobile user device for generating an audio- visual media stream comprising an upmixed audio stream in a multichannel format, the user device comprising: a camera and a set of N > 2 microphones; an input interface configured to receive an initial media stream comprising: a video stream captured by the camera of the user device, a first audio stream comprising a set of N first channel signals captured by the set of microphones, and a second audio stream comprising a pair of second channel signals captured by a left- and rightchannel microphone of a binaural capturing device for mounting on a head of a user, wherein the initial media stream is captured while the binaural capturing device is worn by a user of the user device and the user device is held in front of the binaural capturing device in an orientation corresponding to a landscape or a portrait mode; an upmixer configured to: process the channel signals of the first and second audio streams to extract a set of audio objects; obtain orientation data indicating whether, during the capturing of the initial media stream, the user device is in the landscape or the portrait mode; estimate spatial information comprising, for each audio object, a horizontal direction of arrival estimated based on a set of at least three channel signals of the first and second channel signals and on the orientation data; and pan each audio object in accordance with the spatial information to one or more channels of the multichannel format to generate the upmixed audio stream; and a combiner configured to combine the video stream and the upmixed audio stream to generate the audio-visual media stream.

[149] EEE 25. A system for generating an audio- visual media stream comprising an upmixed audio stream in a multichannel format, the system comprising: the user device according to EEE 24; and a binaural capturing device for mounting on a head of a user and comprising a left- and a right-channel microphone.

[150] It should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention. Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.

[151] Furthermore, some of the embodiments are described herein as a method or combination of elements of a method that can be implemented by a processor of a computer system or by other means of carrying out the function. Thus, a processor with instructions for carrying out such a method or element of a method forms a means for carrying out the method or element of a method. Note that when the method includes several elements, e.g., several steps, no ordering of such elements is implied, unless specifically stated. Furthermore, an element described herein of an apparatus embodiment is an example of a means for carrying out the function performed by the element for the purpose of carrying out the embodiments of the invention. In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. [152] Thus, while there has been described specific embodiments of the invention, those skilled in the art will recognize that other and further modifications may be made thereto, and it is intended to claim all such changes and modifications as falling within the scope of the invention.

Claims

1. A method for generating an audio- visual media stream comprising an upmixed audio stream in a multichannel format, the method comprising: capturing an initial media stream by a mobile user device operated by a user and a headmounted binaural capturing device worn by the user and coupled to the user device, the initial media stream comprising: a video stream captured by a camera of the user device, a first audio stream comprising a set of N > 2 first channel signals captured by a set of N microphones of the user device, and a second audio stream comprising a pair of second channel signals captured by a left- and a right-channel microphone of the binaural capturing device, wherein while capturing the initial media stream, the user device is held in front of the binaural capturing device in an orientation corresponding to a landscape or a portrait mode; processing the channel signals of the first and second audio streams to extract a set of audio objects; obtaining orientation data indicative of whether, during the capturing of the initial media stream, the user device is in the landscape or the portrait mode; estimating spatial information comprising, for each audio object, a horizontal direction of arrival estimated based on a set of at least three channel signals of the first and second channel signals and on the orientation data; panning each audio object in accordance with the spatial information to one or more channels of the multichannel format to generate the upmixed audio stream; and combining the video stream and the upmixed audio stream to generate the audio-visual media stream.

2. The method according to claim 1 , wherein a layout of the set of microphones on the user device is such that the N microphones of the user device comprises microphones being horizontally separated when the user device is in a first mode and microphones being vertically separated when the user device is in a second mode, wherein the first mode is the landscape mode and the second mode is the portrait mode or vice versa, wherein, if the orientation data is indicative of the first mode, the horizontal direction of arrival for each audio object is estimated based on a set of at least three channel signals of: the first channel signals captured by the horizontally separated microphones of the user device and the second channel signals, and wherein, if the orientation data is indicative of the second mode, the spatial information further comprises, for each audio object, height information estimated based on a pair of first channel signals captured by a pair of the vertically separated microphones of the user device.

3. The method according to claim 2, wherein, if the orientation data is indicative of the second mode, the horizontal direction of arrival for each audio object is estimated based on: the pair of second channel signals captured by the microphones of the binaural capturing device and at least one first channel signal captured by at least one of the microphones of the user device.

4. The method according to claim 3, wherein said first mode is the landscape mode and said second mode is the portrait mode.

5. The method according to claim 4, wherein the user device has a height dimension and a width dimension, and wherein the horizontally separated microphones are separated along the height dimension of the user device and the vertically separated microphones are separated along the height dimension of the user device.

6. The method according to claim 5, wherein the horizontally separated microphones of the user device and the vertically separated microphones of the user device each comprises a first microphone positioned at a bottom portion of the user device and a second microphone positioned at a top portion of the user device.

7. The method according to any one of claims 3-6, wherein the horizontally separated microphones of the user device and the vertically separated microphones of the user device have at least one microphone in common.

8. The method according to any one of the preceding claims, wherein, for each audio object of the set of audio objects, a representation of the audio object is extracted from each of the first and second channel signals, and wherein the method further comprises, for each audio object, selecting, among the representations of the audio object, one representation of the audio object to be used for the panning process, wherein the selection is based on the spatial information estimated for said audio object.

9. The method according to any one of the preceding claims, wherein processing the channel signals of the first and second audio streams further comprises extracting a set of residual signals, wherein the method further comprises panning each residual signal to one or more channels of the multichannel format, wherein each residual signal is panned to one or more predetermined channels of the multichannel format.

10. The method according to claim 9, wherein the residual signals extracted from the first channel signals are panned to a first set of one or more channels of the multichannel format, and the residual signals extracted from the second channel signals are panned to a second set of one or more channels of the multichannel format different from the first set.

11. The method according to any one of the preceding claims, wherein the set of audio objects is extracted using a machine learning algorithm, such as a neural network, or a digital signal processing algorithm, or a combination thereof.

12. The method according to any one of the preceding claims, further comprising obtaining sensor data indicative of a spatial relationship between the user device and the binaural capturing device during the capturing of the initial media stream, wherein the estimating of the spatial information is further based on the sensor data.

13. The method according to claim 12, wherein the sensor data indicates a distance between the user device and the binaural capturing device, wherein the user device comprises a front-side camera, and wherein the distance is obtained from a focusing distance of the front- side camera during the capturing of the initial media stream.

14. The method according to any one of the preceding claims, wherein the processing of the channel signals of the first and second audio streams comprises synchronizing the channel signals of the first and second audio streams prior to extracting the set of audio objects.

15. The method according to any one of the preceding claims, wherein the processing of the channel signals of the first and second audio streams comprises applying at least one of leveling and equalization to channel signals of the first and second audio streams prior to extracting the set of audio objects.