EP4037339A1

EP4037339A1 - Selecton of audio channels based on prioritization

Info

Publication number: EP4037339A1
Application number: EP21154652.8A
Authority: EP
Inventors: Lasse Juhani Laaksonen; Mikko-Ville Laitinen; Arto Juhani Lehtiniemi
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2021-02-02
Filing date: 2021-02-02
Publication date: 2022-08-03

Abstract

An apparatus comprising means for:
receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon prioritization of the N audio channels, and wherein the prioritization is adaptive depending at least upon a changing content of one or more of the N audio channels; and
providing for rendering at least the output audio channel.

Description

TECHNOLOGICAL FIELD

Embodiments of the present disclosure relate to audio. Some enable the distribution of common content for rendering to both advanced audio output devices and less advanced audio output devices.

BACKGROUND

Advanced audio output devices are capable to rendering multiple received audio channels as different spatially positioned audio sources. The spatial separation of audio sources (spatial audio) can aid hearing when the sources simultaneously provide sound.
Less advanced audio output devices are perhaps only capable of rendering one monophonic audio channel. They cannot render multiple received audio channels as different spatially positioned audio sources.
Content that is suitable for rendering spatial audio via an advanced audio output device may be unsuitable for a less advanced audio output device and content that is suitable for rendering by a less advanced audio output device may under-utilize the spatial audio capabilities of an advanced audio output device.

BRIEF SUMMARY

According to various, but not necessarily all, embodiments there is provided an apparatus comprising means for:

receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon prioritization of the N audio channels, and wherein the prioritization is adaptive depending at least upon a changing content of one or more of the N audio channels; and
providing for rendering at least the output audio channel.

In some but not necessarily all examples, the apparatus comprises means for: automatically controlling mixing of the N audio channels to produce at least the output audio channel, in dependence upon time-variation of content of one or more of the N audio channels.
In some but not necessarily all examples, the N audio channels are N spatial audio channels where each of the N spatial audio channels can be rendered as a differently positioned audio source.
In some but not necessarily all examples, N is at least two and M is one, the output audio channel being a monophonic audio output channel.
In some but not necessarily all examples, the apparatus comprises means for analyzing the N audio channels to adapt a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels.
In some but not necessarily all examples, prioritization depends upon one or more of:

timing of content of at least one of the N audio channels relative to timing of content of at least another one of the N audio channels;
history of content of at least one of the N audio channels;
mapping to a particular person, an identified voice in content of at least one of the N audio channels;
detection that content of at least one of the N audio channels is voice content;
detection that content of at least one of the N audio channels comprises an identified word.

In some but not necessarily all examples, controlling mixing of the N audio channels to produce at least an output audio channel, comprises:

selecting a first sub-set of the N audio channels to be mixed to provide background audio;
selecting a second sub-set of the N audio channels to be mixed to provide foreground audio that is for rendering at greater loudness than the background audio, wherein the selection of the first sub-set and selection of the second sub-set is dependent upon the prioritization of the N audio channels; and
mixing the background audio and the foreground audio to produce the output audio channel.

In some but not necessarily all examples, the apparatus comprises means for controlling mixing of the N audio channels to produce M audio channels in response to a communication bandwidth for receiving the audio channels or for providing output audio signals falling beneath a threshold value.
In some but not necessarily all examples, the apparatus comprises means for controlling mixing of the N audio channels to produce M audio channels when conflict between a first audio channel of the N audio channels and a second audio channel of the N audio channels, wherein the first audio channel is included within the M audio channels and the second audio channel is not included within the M audio channels, wherein over-talking is an example of conflict.
In some but not necessarily all examples, the audio channels of the N audio channels that are not the selected M audio channels are available for later rendering.
In some but not necessarily all examples, the apparatus comprises a user input interface for controlling prioritization of the N audio channels.
In some but not necessarily all examples, the apparatus comprises a user input interface, wherein the user input interface provides a spatial representation of the N audio channels and indicates which of the N audio channels are comprised in the sub-set of M audio channels.
According to various, but not necessarily all, embodiments there is provided a multi-party, live communication system that enables live audio communication between multiple remote participants using at least the N audio channels wherein different ones of the multiple remote participants provide audio input for different ones of the N audio channels, wherein the system comprises the apparatus.
According to various, but not necessarily all, embodiments there is provided a method comprising:

receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
control mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels; and
rendering at least the output audio channel.

According to various, but not necessarily all, embodiments there is provided a computer program that when run on one or more processors enables:

control mixing of N received audio channels, where each of the N audio channels can be rendered as a different audio source, to produce at least an output audio channel for rendering,
wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels.

receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
adapting a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels; and
controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon the prioritization; and
providing for rendering at least the output audio channel.

receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
analyzing the N audio channels to adapt a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels; and
controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon the prioritization; and
providing for rendering at least the output audio channel.

According to various, but not necessarily all, embodiments there is provided examples as claimed in the appended claims.

BRIEF DESCRIPTION

Some examples will now be described with reference to the accompanying drawings in which:

FIG. 1 illustrates an example of an apparatus for providing an output audio channel for rendering;
FIG. 2 illustrates an example of an apparatus in which an analyzer is configured to analyze the N audio channels to adapt the prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels;
FIG. 3 illustrates another example of the apparatus;
FIG. 4 illustrates an example of a multi-party, live communication system comprising the apparatus;
FIG. 5A and 5B illustrate alternative topologies of the system;
FIG. 6 illustrates an example of prioritization based on timing of content;
FIG. 7 illustrates an example of prioritization based on content type;
FIG. 8 illustrates an example of storage of unselected audio channels;
FIG. 9A, 9B, 9C illustrate examples of prioritization based on keywords in content;
FIG. 10 illustrates an example of informing a consumer of the output audio channel of an option to change the audio channels included within the output audio channel;
FIG. 11A illustrates an example of spatial audio rendered, based on the N audio channels, at an output end-point configured for rendering spatial audio;
FIG. 11B illustrates an example of audio rendered, based on the output audio channel, at an output end-point that is not configured for rendering spatial audio;
FIG. 12, 13, 15 illustrate examples of a method;
FIG. 14 illustrates an example of changing prioritization based on timing of content;
FIG. 16 illustrates an example of a controller; and
FIG. 17 illustrates an example of a computer program.

DETAILED DESCRIPTION

The following description and the attached drawings describe various examples of an apparatus 10 that receives at least N audio channels 20 and enables the rendering of one or more output audio channels 52.
The set of N audio channels is referenced using reference number 20. Each audio channel of the set of N audio channels is referenced using reference number 20_i, where i is 1, 2,...N-1, N.
The apparatus 10 comprises means for receiving at least N audio channels 20 where each of the N audio channels 20_i can be rendered as a different audio source.
The apparatus 10 comprises means 40, 50 for controlling selection and mixing of the N audio channels 20 to produce at least an output audio channel 52.
A selector 40 selects for mixing (to produce the output audio channel 52) a sub-set 30 of M audio channels from the N audio channels 20. The selection is dependent upon prioritization 32 of the N audio channels 20. The prioritization 32 is adaptive depending at least upon a changing content 34 of one or more of the N audio channels 20.
The sub-set 30 of M audio channels is referenced using reference number 30. Each audio channel of the sub-set of M audio channels is referenced using reference number 20_j, where j is any M of the N values of i. The sub-set 30 can, for example, be varied by changing the value of M and/or by changing which audio channels 20_j are used to comprise the M audio channel of the sub-set 30. In the description, different sub-set 30 can, in some examples, be differentiated using the same reference 30 with different numeric sub-scripts.
A mixer 50 mixes the sub-set 30 of M audio channels to produce the output audio channel 52 which is suitable for rendering.
An advanced spatial audio output device (an example is illustrated at FIG 11A) can render the N audio channels 20 as multiple different spatially positioned audio sources. A less advanced audio output device (an example is illustrated at FIG 11B) can render the output audio channel 52.
The apparatus 10 therefore allows a common content, the N audio channels 20, to provide audio output at both the advanced spatial audio output device and the less advanced audio output device.
FIG. 1 illustrates an example of an apparatus 10 for providing an output audio channel 52 for rendering. The rendering of the output audio channel 52 can occur at the apparatus 10 or can occur at some other device.
The apparatus 10 receives at least N audio channels 20. An audio channel 20_i of the N audio channels 20 can be rendered as a distinct audio source.
The apparatus 10 comprises a mixer 50 for mixing a sub-set 30 of the M audio channels 20 to produce at least an output audio channel 52.
A selector 40 selects for mixing, at mixer 50, the sub-set 30 of M audio channels from the N audio channels 20. The selection, by the selector 40, is dependent upon prioritization 32 of the N audio channels 20. The prioritization 32 is adaptive depending at least upon a changing content 34 of one or more of the N audio channels 20. The apparatus 10 provides, from the mixer 50, the output audio channel 52 for rendering.
The sub-set 30 of M audio channels has less audio channels than the N audio channels 20, that is, M is less than N. N is at least two and in at least some examples is greater than 2. In at least some examples M is one and the output audio channel 52 is a monophonic audio output channel.
The prioritization 32 is adaptive. The prioritization 32 depends at least on a changing content 34 of one or more of the N audio channels 20.
In some but not necessarily all examples, the apparatus 10 is configured to automatically control the mixing of the N audio channels 20 to produce at least the output audio channel 52, in dependence upon time-variation of content 34 of one or more of the N audio channels 20.
FIG. 2 illustrates an example of an apparatus 10 in which an analyzer 60 is configured to analyze the N audio channels 20 to adapt the prioritization 32 of the N audio channels 20 in dependence upon, at least, changing content 34 of one or more of the N audio channels 20.
The analysis can be performed before (or simultaneously with) the before-mentioned selection.
In some examples, the analyzer 60 is configured to process metadata associated with the N audio channels 20. Additionally or alternatively, in some examples, the analyzer 60 is configured to process the audio content of the audio channels 20. This processing could, for example, comprise voice activity detection, voice recognition processing, spectral analysis, semantic processing of speech or other processing including machine learning and artificial intelligence processing used to identify characteristics of the content 34 of one or more of the N audio channels 20.
The prioritization 32 can depend upon one or more parameters of the content 34.
In one example, the prioritization 32 depends upon timing of content 34_i of an audio channel 20_i relative to timing of content 34_j of an audio channel 20_j. Thus, the audio channel 20 that first satisfies a trigger condition has temporal priority. In some examples the trigger condition may be that the audio channel 20 has activity above a threshold, and/or has activity above a threshold in a particular spectral range and/or has voice activity and/or has voice activity associated with a specific person and/or the voice activity comprises semantic content including a particular keyword word or phrase.
An initial prioritization 32, can cause an initial selection of a first sub-set 30₁ of audio channels 20 that are mixed to form the output audio channel 52. A change in prioritization 32, can cause a new selection of a second different sub-set 30₂ of audio channels 20 that are mixed to form a new, different output audio channel 52. The first sub-set 30₁ and the second sub-set 30₁ are not equal sets. Thus, apparatus 10 can prioritize one or more of the N audio channels 20 as a sub-set 30 until a new selection by the selector 40 based on a new prioritization 32 changes the sub-set 30.
If a person is speaking in a particular audio channel 20, first, that channel may be prioritized ahead of a second audio channel. However, if the person speaking in the first audio channel stops speaking then the prioritization 32 of the audio channels can change and there can be a consequential reselection at the selector 40 of the sub-set 30 of M audio channels provided for mixing to produce the output audio channel 52.
The apparatus 10 can flag at least one input audio channel 20 corresponding to a first active talker, or generally active content 34, during a selection period and prioritize this selection over other audio channels 20. The apparatus 10 can determine whether the active talker continues before introducing content 34 from non-prioritized channels to the mixed output audio channel 52. The introduction of such additional content 34 from non-prioritized channels is controlled by the selector 40 during a following selection period.
In some examples, non-prioritized audio channels 20 can be completely omitted from the mixed output audio channel 52 and thus the mixed output audio channel 52 will contain only the prioritized channel(s). However, in other examples, the non-prioritized channels can be mixed with a lower gain or higher attenuation than the prioritized channel and/or with other suitable processing to produce the output audio channel 52.
It will therefore be appreciated that in at least some examples, a history of content 34 of at least one of the N audio channels 20 can be used to control the prioritization 32. For example, it may be possible to vary the "inertia" of the system, that is control the rate of change of the rate of change of prioritization. It is therefore possible to make the apparatus 10 more or less responsive to short term variations in the content 34 of one or more of the N audio channels 20.
The selector 40 in making a selection of which of the N audio channels 20 to select for mixing to produce the output audio channel 52 can, for example, use decision thresholds for selection. A decision threshold can be changed over time and can be dependent upon a history of the content 34. In addition, different decision thresholds can be used for different audio channels 20.
In some examples, the prioritization 32 can be dependent upon mapping to a particular person an identified voice in content 34 of at least one of the N audio channels 20. The analyzer 60 can for example perform voice recognition based upon the content 34 of one or more of the N audio channels 20. Alternatively, the analyzer 60 can identify a particular person based upon metadata comprised within the content 34 of at least one of the N audio channels 20. It may therefore be possible to identify a particular one of the N audio channels 20 as relating to a person whose contribution it is particularly important to hear such as, for example, a chairman of a meeting.
In some examples, the analyzer 60 is configured to adapt the prioritization 32 when the presence of voice content is detected within the content 34 of at least one of the N audio channels 20. Thus, the analyzer 60 is able to prioritize the spoken word within the output audio channel 52. It is also possible to adapt the analyzer 60 to prioritize other types of content.
In some, but not necessarily all, examples, the analyzer 60 is configured to adapt the prioritization 32 based upon detection that content 34 of at least one of the N audio channels 20 comprises an identified keyword. The analyzer 60 can, for example, listen to the content 34 and identify within the stream of content a keyword or identify semantic meaning within the stream of content. This can be used to modify the prioritization 32. For example, it may be desirable for a consumer of the output audio channel 52 to have that output audio channel 52 personalized so that if one of the N audio channels 20 comprises content 34 that includes the consumer's name or other information associated with the consumer then that audio channel 20 is prioritized by the analyzer 60.
In some, but not necessarily all, examples, the N audio channels 20 can represent live content. In this example, the analysis by the analyzer 60, the selection by the selector 40 and the mixing by the mixer 50 can occur in real time such that the output audio channel 52 is also live.
FIG. 3 illustrates an example of the apparatus of FIG. 1 in more detail. In this example one possible operation of the mixer 50 is illustrated in more detail. In this example, the mixing is a weighted mixing in which different sub-sets of the sub-set 30 of selected audio channels are weighted with different attenuation/gain before being finally mixed to produce the output audio channel 52.
In the illustrated example, the selector 40, based upon the prioritization 32, selects a first sub-set SS1 of the M audio channels to be mixed to provide background audio B and selects a second sub-set SS2 of the M audio channels 20 to be mixed to provide foreground audio F that is for rendering at greater loudness than the background audio B. The selection of the first sub-set SS1 and the selection of the second sub-set SS2 is dependent upon the prioritization 32 of the N audio channels 20. The first sub-set SS1 of audio channels 20 is mixed 50₁ to provide background audio B which is then amplified/attenuated G1 to adjust the loudness of the background audio before it is provided to the mixer 50₃ for mixing to produce the output audio channel 52. The second sub-set SS2 of audio channels 20 is mixed 50₂ to provide foreground audio F which is then amplified/attenuated G1 to adjust the loudness of the background audio before it is provided to the mixer 50₃ for mixing to produce the output audio channel 52.
The gain/attenuation G2 applied to the foreground audio F makes it significantly louder than the background audio B in the output audio channel 52. In some situations, the foreground audio F is naturally louder than background audio B. Thus, it can be but need not be that G2 > G1.
The gain/attenuation G1, G2 can, in some examples, vary with frequency.
FIG. 4 illustrates an example of a multi-party, live communication system 200 that enables live audio communication between multiple remote participants A_i, B, C, D_i using at least the N audio channels 20. Different ones of the multiple remote participants A_i, B, C, D_i provide audio input for different ones of the N audio channels 20.
The system 200 comprises input end-points 206 for capturing audio channels 20. The system 200 comprises output end-points 204 for rendering audio channels. One or more output end-points 204_s (spatial output end-points) are configured for rendering spatial audio as distinct rendered audio sources. One or more output end-points 204_m (mono output end-points) are not configured for rendering spatial audio.
The N audio channels 20 are N spatial audio channels where each of the N spatial audio channels is captured as a differently positioned captured audio source, and can be rendered using spatial audio as a differently positioned rendered audio source. In some examples the captured audio source (input end-point 206) has a fixed and stationary position. However, in other examples it can vary in position. When such an input end-point 206 is rendered as a rendered audio source at an output end-point 204 using spatial audio, then the rendered audio source can either be fixed or can move, for example, in a manner corresponding to the moving input end-point 206.
In this example, the system 200 is for enabling immersive teleconferencing or telepresence for remote terminals. The different terminals have varying device capabilities and different (and possibly variable) network conditions.
Spatial/immersive audio refers to audio that typically has a three-dimensional space representation or is presented (rendered) to a participant with the intention of the participant being able to hear a specific audio source from a specific direction. In the specific example illustrated there is a multi-participant audio/visual conference call between remote participants. Some of the participants share a room. For example, participants A₁, A₂, A₃, A₄ share the room A and the participants D₁, D₂, D₃, D₄, D₅ share the room D.
Some of the terminals can be characterized as "advanced spatial audio output devices" that have an output end-point 204_s that is configured for spatial audio. However, some of the terminals are less advanced audio output devices that have an output end-point 204_m that is not configured for spatial audio.
In a spatial audio experience, the voices of the participants A_i, B, C, D_i are spatially separated. The voices may, for example, have fixed spatial positions relative to each other or the directions may be adaptive, for example, according to participant movements, conference bridge settings or based upon inputs by participants. A similar experience is available to the participants who are using the output end-points 204_s and they have the ability to interact much more naturally than traditional voice calls and voice conferencing. For example, they can talk at the same time and still understand each other thanks to effects such as the well-known cocktail party effect.
In rooms A and D, each of the respective participants A_i, D_i has a personal input end-point 206 which captures a personal captured audio source as a personal audio channel 20. The personal input end-point 206 can, for example, be provided by a directional microphone or by a Lavalier microphone.
The participants B and C each have a single personal input end-point 206 which captures a personal audio channel 20.
In rooms A and D, the output end-points 204_s are configured for spatial audio. For example, each room can have a surround sound system as an output end-point 204_s.
An output end point 204_s is configured to render each captured sound source represented by an audio channel 20 as a rendered sound source.
In room D, each participant A_i, B, C has a personal output audio channel 20. Each personal output audio channel 20 is rendered from a different location as a different rendered audio source. The collection of rendered audio sources associated with the participants A_i creates a virtual room A.
In room A, each participant D_i, B, C has a personal output audio channel 20. Each personal output audio channel 20 is rendered from a different location as a different rendered sound source. The collection of the rendered audio sources associated with the participants D_i creates a virtual room D.
For participant C, the output end-point 204_s is configured for spatial audio. For example,
as an output end-point 204_s.
An output end point 204_s is configured to render each captured sound source represented by an audio channel 20 as a rendered sound source.
The participant C has an output end-point 204_s that is configured for spatial audio. In this example, the participant C is using a headset configured for binaural spatial audio that is suitable for virtual reality (VR). Binauralization methods can be used to render personal audio channels 20 as spatially positioned rendered audio sources, Each participant Ai, Di, B has a personal output audio channel 20. Each personal output audio channel 20 is or can be rendered from a different location as a different rendered sound source.
The participant B has an output end-point 204_m that is not configured for spatial audio. In this example it is a monophonic output end-point. In the example illustrated, the participant B is using a mobile device (e.g. a mobile phone) to provide the input end-point 206 and the output end-point 204_m. The mobile device has a single output end-point 204_m which provides the output audio channel 52 as previously described. The processing to produce the output audio channel 52 can be performed at the mobile device of the participant C or at the server 202.
The mono-capability limitation of participant B can, for example, be caused by the device, for example it is only configured for decoding of mono audio or because of the available audio output facilities such as a mono-only earpiece or headset.
In the preceding examples the spatial audio has been described at a high resolution. Each of the input end-points 206 is rendered in spatial audio as a spatially distinct rendered audio source. However, in other examples multiple ones of the input end-points 206 may be mixed together to produce a single rendered audio source. This can be used to reduce the number of rendered audio sources using spatial audio. Therefore, in some examples, a spatial audio device may render multiple ones of output audio channels 52.
In the example illustrated in FIG. 4, a star topology similar to that illustrated in FIG. 5A is used. The central server 202 interconnects the input end-points 206 and the output end-points 204. In the example of FIG. 5A, the input end-points 206 provide the N audio channels 20 to a central server 202 which produces the output audio channel 52 as previously described to the output end-point 204_m. In this example, the apparatus 10 is located in the central server 202, however, in other examples the apparatus 10 is located at the output end-point 204_m.
FIG. 5B illustrates an alternative topology in which there is no centralized architecture but a peer-to-peer architecture. In this example, the apparatus 10 is located at the output end-point 204m.
The 3GPP IVAS codec is an example of a voice and audio communications codec for spatial audio. The IVAS codec is an extension of the 3GPP EVS codec and is intended for new immersive voice and audio services over 4G and 5G. Such immersive services include, for example, immersive voice and audio for virtual reality (VR). The multi-purpose audio codec is expected to handle encoding, decoding and rendering of speech, music and generic audio. It is expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. It is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions. The audio channels 20 can, for example, be coded/decoded using the 3GPP IVAS codec.
The spatial audio channels 20 can, for example, be provided as metadata-assisted spatial audio (MASA), objective-based audio, channel-based audio (5.1, 7.1+4), non-parametric scene-based audio (e.g. First Order Ambisonics, High Order Ambisonics) and any combination of these formats. These audio formats can be binauralized for headset listening such that a participant can hear the audio sources outside their head.
It will therefore be appreciated from the foregoing that the apparatus 10 provides a better experience, including improved intelligibility for a mono user participating in a spatial audio teleconference with several potentially overlapping spatial audio inputs. The apparatus 10 means that it is not necessary, in some cases, to simplify the spatial audio conference experience for the spatial audio users due to having a mono-audio participant. Thus, a mono user can participate in a spatial audio conference without compromising the experience of the other users.
FIGS 6, 7, 8 and 9A illustrate examples of an apparatus 10 that comprises a controller 70. The controller receives N audio channels 20 and performs control processing to select the sub-set 30 of M audio channels. In the examples previously described, the controller 70 comprises the selector 40 and, optionally, the analyzer 60. In these examples, the mixer 50 is present but not illustrated.
In at least some of these examples, the controller 70 is configured to control mixing of the N audio channels 20 to produce the sub-set 30 of M audio channels when a conflict between a first audio channel of the N audio channels 20 and a second audio channel of the N audio channels occurs. For example, the control can cause the first audio channel 20 to be included within the sub-set 30 of M audio channels and cause the second audio channel 20 not to be included within the sub-set 30 of M audio channels.
In some examples, at a later time, when there is no longer conflict between the first audio channel and the second audio channel, the second audio channel is included within the sub-set 30 of M audio channels.
One example of when there is conflict between audio channels is when there is simultaneous activity from different prioritized sound sources. For example, overtalking (simultaneous speech) associated with different audio channels 20 can be an example of conflict.
In the example illustrated in FIG. 6, the prioritization 32 used for the selection of audio channels to form the sub-set 30 of M audio channels depends upon timing of content 34 of at least one of the N audio channels 20 relative to timing of content 34 of at least another one of the N audio channels 20.
In this example, the participant 3 speaks first and the audio channel 20₃ associated with the participant 3 is selected as a 'priority' for inclusion within the sub-set 30 of M=1 audio channels used to form the output audio channel 52. The later speech by participants 4 and 5 is not selected for inclusion within the sub-set 30 of audio channels used to form the output audio channel 52.
The audio channel 20₃ preferentially remains prioritized and remains included within the output audio channel 52, while there is voice activity in the audio channel 20₃, whereas the audio channels 20₄, 20₅ are excluded. If voice activity is no longer detected in the audio channel 20₃ then in some examples a selection process may immediately change the identity of the audio channel 20 selected for inclusion within the output audio channel 52. However, in other examples there can be a selection grace period. During this grace period, there can be a greater likelihood of selection/reselection of the original selected audio channel 20₃. Thus, during the grace period prioritization 32 is biased in favor of the previously selected audio channel.
It will therefore be appreciated that in at least some examples, prioritization 32 used for the selection depends upon a history of content 34 of at least one of the N audio channels 20.
In some examples, the prioritization 32 used for the selection can depend upon mapping to a particular person (an identifiable human), an identified voice in content 34 of at least one of the N audio channels 20. A voice can be identified using metadata or by analysis of the content 34. The prioritization 32 would more favorably select the particular person's audio channel 20 for inclusion within the output audio channel 52.
The particular person could, for example, be based upon service policy. A teleconference service may have a moderator or chairman role and this participant may for example be made audible to all participants or may be able to force themselves to be audible to all participants. In other examples, the particular person could for example be indicated by a user consuming the output audio channel 52. That consumer could for example indicate which of the other participants' content 34 or audio channels 20 they wish to consume. This audio channel 20 could then be included, or be more likely to be included, within the output audio channel 52. The inclusion of the user-selected audio channel 20 can for example be dependent upon voice activity within the audio channel 20, that is, the user-selected audio channel 20 is only included if there is active voice activity within that audio channel 20. The prioritization 32 used for the selection therefore strongly favors the user-selected audio channel 20. The selection by the consumer of the output audio channel 52 of a particular audio channel 20 can for example be based upon an identity of the participant who is speaking or should speak in that audio channel. Alternatively, it could be based upon a user-selection of that audio channel because of the content 34 rendered within that audio channel.
FIG. 7 illustrates an example similar to FIG. 6. In this example, the audio channels 20 include a mixture of different audio types. The audio channel 20₃ associated with participant3 is predominantly a voice channel. The audio channels 20₄, 20₅ associated with participants 4 and 5 are predominantly instrumental/music channels. In this example, the selection of which of the audio channels 20 is to be included within the output audio channel 52 can be based upon the audio type present within the audio channel 20. The detection of the audio type within the audio channel 20 can for example be achieved using metadata or, alternatively, by analyzing the content 34 of the audio channel 20. Thus, the prioritization 32 used for selection can be dependent upon detection that content 34 of at least one of the N audio channels 20 is voice content. In such a voice-central case, natural pauses in the active content 34 allow for changes in the mono downmix. That is, the output audio channel 52 can switch between the inclusion of different audio channels 20 in dependence upon which of them includes active voice content. In this way priority can be given to spoken language. The other channels for example the music channels 20₄, 20₅ may optionally be included, for example as background audio as previously described with relation to FIG. 3.
In the examples illustrated in FIGS 6 and 7, the apparatus 10 deliberately loses information by excluding (or diminishing) audio channels 20 with respect to the output audio channel 52. Information is generally lost by the selective downmixing which is required to maintain or guarantee intelligibility. It is, however, possible for there to be two simultaneously important audio channels 20, only one of which is selected for inclusion in the output audio channel 52. The apparatus illustrated in FIG. 8 addresses this issue.
The apparatus 10 illustrated is similar to that illustrated in FIGS 6 and 7. However, it additionally comprises a memory 82 for storage of a further sub-set 80 of the N audio channels 20 that is different to the sub-set 30 of N audio channels 20. Thus, in this example at least some of the audio channels of the N audio channels 20 that are not selected for inclusion in the sub-set 30 of M audio channels, are stored as sub-set 80 and are available for later rendering. In some examples, the later rendering may be at a faster playback rate and that playback may be fixed or may be adaptive. In some examples, the sub-set 80 of audio channels is mixed to form an alternative audio output channel for storage in the memory 82.
In the specific example illustrated at least some of the audio channels of the N audio channels that are not selected to be in the sub-set 30 of M audio channels are stored in memory 82 for later rendering.
In the particular illustrated example, there is selection of a first sub-set 30 of M audio channels from the N audio channels based upon prioritization 32 of the N audio channels. The first sub-set 30 of M audio channels are mixed to produce a first output audio channel 52. There is selection of a different second sub-set 80 of audio channels from the N audio channels based upon prioritization 32 of the N audio channels. The second sub-set 80 of audio channels are mixed to produce a second output audio channel for storage.
In the example illustrated in FIG. 8, the audio channel 20₃ includes content 34 comprising voice content from a single participant, and it is selected for inclusion within the sub-set 30 of audio channels. It is used to produce the output audio channel 52. The audio channels 20₄, 20₅, which have not been included within the output audio channel 52, or included only as background (as described with reference to FIG. 3), are selected for mixing to produce the second output audio signal that is stored in memory 82.
When there is storage of a second sub-set 80 of audio channels as a second audio signal, it is desirable to let the consumer of the output audio channel 52 know of the existence of the stored audio signal. This can for example facilitate user control of switching from rendering the output audio channel 52 to rendering the stored audio channel.
FIG. 10 illustrates an example of how such an indication may be provided to the consumer of the output audio channel 52. Fig 10 is described in detail later.
In some examples, it may be possible to automatically switch from rendering the output audio channel 52 to rendering the stored audio channel. For example, there may be automatic switching during periods of inactivity of the output audio channel 52. An apparatus 10 may switch to the stored audio channel and play that back at a higher speed. For example, the apparatus 10 can monitor the typical length of inactivity in the preferred output audio channel 52 and adjust the speed of playback for the stored audio channel such that the relevant portions can be played back during a typical inactive period.
FIG. 9A illustrates an example in which the apparatus 10 detects that content 34 of at least one of the N audio channels 20 comprises an identified keyword and adapts the prioritization 32 accordingly. The prioritization 32 in turn controls selection of which of the audio channels 20 are included in the sub-set 30 and the output audio channel 52 (and, if implemented, the stored alternative audio channel).
In the example illustrated in FIG. 9B, the participant 'User 3' is speaking first and has priority. Therefore, the audio channel 20₃ associated with the User 3 is initially selected as the priority audio channel and is included within the output audio channel 52. Even though the participant 'User 5' begins to talk, the prioritization is not changed and the audio channel 20₃ remains the priority audio channel included within the output audio channel 52. At time T1 it is detected that User 5 says a keyword, in this example the name of the consumer of the output audio channel 52 (Dave). While this event increases the likelihood of a switch in the prioritization of the audio channels 20₃, 20₅ such that the audio channel 20₅ becomes prioritized and included in the output audio channel 52, in this example there is insufficient cause to change the prioritization 32 and consequently change which of the audio channels 20 is included within the output audio channel 52.
In the example illustrated in FIG. 9C, the participant 'User 3' is speaking first and has priority. Therefore, the audio channel 20₃ associated with the User 3 is initially selected as the priority audio channel and is included within the sub-set 30 used to produce the output audio channel 52. Even though the participant 'User 5' begins to talk, the prioritization is not changed and the audio channel 20₃ remains the priority audio channel included within the sub-set 30 and the output audio channel 52. At time T1 it is detected that User 5 says a keyword, in this example the name of the consumer of the output audio channel 52 (Dave). This event causes a switch in the prioritization of the audio channels 20₃, 20₅ such that the audio channel 20₅ becomes prioritized and included in the sub-set 30 and the output audio channel 52 and the audio channel 20₃ becomes de-prioritized and excluded from the sub-set 30 and the output audio channel 52.
In some examples, the consumer of the output audio channel 52 can via user input settings control the likelihood of a switch when a keyword is mentioned within an audio channel 20. For example, the consumer of the output audio channel 52 can, for example, require a switch if a keyword is detected. Alternatively, the likelihood of a switch can be increased.
In other examples, the occurrence of a keyword can increase the prioritization of an audio channel 20 such that it is stored, for example as described in relation to FIG. 8.
In other examples, the detection of a keyword may provide an option to the consumer of the output audio channel 52, to enable the consumer to cause a change in the audio channel 20 included within the sub-set 30 and the output audio channel 52. For example, if the name of the consumer of the output audio channel 52 is included within an audio channel 20 that is not being rendered, as a priority, within the output audio channel 52 then the consumer of the output audio channel 52 can be presented with an option to change prioritization 32 and switch to using a sub-set 30 and output audio channel 52 that includes the audio channel 20 in which their name was detected.
Where a detected keyword causes a switch in the audio channels included in the sub-set 30 and output audio channel 52, the new output audio channel 52 based on the detected keyword may be played back from the occurrence of the detected keyword. In some examples the playback is at a faster rate to allow a catch-up with real time.
FIG. 10 illustrates an example in which a consumer of the output audio channel 52 is provided with information to allow that consumer to make an informed decision to switch audio channels 20 included within the sub-set 30 and the output audio channel 52.
In some examples, some form of indication is given to indicate a change in activity status. For example, if a particular participant begins to talk or there is a second separate discussion ongoing, the consumer of the original output audio channel 52 is made aware of this.
A suitable indicator could for example be an audible indicator that is added to the output audio channel 52. In some examples, each participant may have an associated different tone and a beep with a particular tone may indicate which participant has begun to speak. Alternatively, an indicator could be a visual indictor in an input user interface.
In the example illustrated in FIG. 10, the background audio is adapted to provide an audible indication. Initially, the consumer listening to the output audio channel 52 hears the audio channel 20₁ associated with a first participant's voice (User A voice). If a second audio channel 20 is mixed with the audio channel 20₁, then it may, for example, be an audio channel 20₂ that captures the ambient audio of the first participant (User A ambience). At time T1 a second participant, User B, begins to talk. This does not initiate a switch of prioritization 32 sufficient to change the sub-set 30. The primary audio channel 20 in the sub-set 30 and the output audio channel 52 remains the audio channel 20₁. However, an indication is provided to indicate to the consumer of the output audio channel 52 that there is an alternative, available, audio channel 20₃. The indication is provided by mixing the primary audio channel 20₁ with an additional audio channel 20 associated with the User B. For example, the additional audio channel 20 can be an attenuated version of the audio channel 20₃ or can be an ambient audio channel 20₄ for the User B (User B ambience). In this example, the second audio channel 20₂ is replaced by the additional audio channel 20₄.
The consumer of the output audio channel 52 can then decide whether or not they wish to cause a change in the prioritization 32 to prioritize the audio channel 20₃ associated with the User B above the audio channel 20₁ associated with the User A. If this change in prioritization occurs then there is a switch in the primary audio channel within the sub-set 30 and the output audio channel 52 from being the audio channel 20₁ to being the audio channel 20₃. In the example illustrated, the consumer does not make this switch. The switch does however occur automatically when the User A stops talking at time T2.
In the example of FIG. 10, referring back to the example of FIG. 3, the background audio B can be included and/or varied as an indication to the consumer of the output audio channel 52 that an alternative audio channel 20 is available for selection.
FIG. 11A schematically illustrates audio rendered to a participant (User 5) at an output end-point 204_s of the system 200 (not illustrated) that is configured for rendering spatial audio. In accordance with the preceding examples, the audio output at the end-point 204_s has multiple rendered sound sources associated with audio channels 20_1. 20₂, 20₃, 20₄ at different locations. FIG. 11A illustrates that even with the presence in the system 200 (not illustrated) of an output end-point 204_m (FIG 11B) that is not configured for spatial audio rendering, there may be no need to reduce the immersive capabilities or experience at the output end-points 204_s of the system 200 that are configured for rendering spatial audio.
FIG. 11B schematically illustrates audio rendered to a participant (User 1) at an output end-point 204_m of the system 200 (not illustrated) that is not configured for rendering spatial audio. In accordance with the preceding examples, the audio output at the end-point 204_m provided by the output audio channel 52 has a single monophonic output audio channel 52 that is based on the sub-set 30 of selected audio channels 20 and has good intelligibility. In the example illustrated, the audio channel 20₂ is the primary audio channel that is included in the sub-set 30 and the output audio channel 52.
The apparatus 10 can be configured to automatically switch the composition of the audio channels 20 mixed to form the output audio channel 52 in dependence upon an adaptive prioritization 32. Additionally or alternatively, in some examples, the switching can be effected manually by the consumer at the end-point 204_m using a user interface which includes a user input interface 90.
In the example illustrated in FIG. 11B, the device at the output end-point 204_s, which in some examples may be the apparatus 10, comprises a user input interface 90 for controlling prioritization 32 of the N audio channels 20. For example, the user input interface 90 can be configured to highlight or label selected ones of the N audio channels 20 for selection. The user input interface 90 can be used to control if and to what extent manual or automatic switching occurs to produce the output audio channel 52 from selected ones of the audio channels 20. An adaptation of the prioritization 32 can cause an automatic switching or can cause a prompt to a consumer for manual switching.
In some examples, the user input interface 90 can control if and the extent to which prioritization 32 depends upon one or more of timing of content 34 of at least one of the N audio channels 20 relative to timing of content 34 of at least another one of the N audio channels 20; history of content 34 of at least one of the N audio channels 20; mapping to a particular person an identified voice in content 34 of at least one of the N audio channels 20; detection that content 34 of at least one of the N audio channels 20 is voice content; and/or detection that content 34 of at least one of the N audio channels comprises an identified word.
In the example illustrated, within the user input interface 90, there is an option 91₄ that allows the participant, User 1, to select the audio channel 20₄ as a replacement primary audio channel that is included in the sub-set 30 and the output audio channel 52 instead of the audio channel 20₂. There is also an option 91₃ that allows User 1 to select the audio channel 20₃ as a replacement primary audio channel that is included in the sub-set 30 and the output audio channel 52 instead of the audio channel 20₂.
In some but not necessarily all examples, the user input interface 90 can provide a visual spatial representation of the N audio channels 20 and indicate which of the N audio channels 20 are comprised in the sub-set 30 of M audio channels.
The user input interface 90 can also indicate which of the N audio channels are not comprised in the sub-set 30 of M audio channels and which, if any, of these are active.
In some, but not necessarily all, examples, the user input interface 90 may provide textual information about an audio channel 20 that is active and available for selection. For example, speech-to-text algorithms may be utilized to convert speech within that audio channel 20 into an alert displayed at the user input interface 90. Referring back to the example illustrated in FIG. 9A, the apparatus 10 may be configured to cause the user input interface 90 to provide an option to a consumer of the output audio channel 52 that enables that consumer to switch audio channels 20 included within the sub-set 30 and output audio channel 52. In this example, the keyword is "Dave" and the textual output provided by the user input interface 90 could, for example, say "option to switch to User 5 who addressed you and said: 'In our last teleco Dave made an interesting'". If the consumer, Dave, then selects the option to switch, the sub-set 30 and the output audio channel 52 then includes the audio channel 20₅ from the User 5 and starts from the position "In our last teleco Dave made an interesting...". A memory 82 (not illustrated in the FIG) could be used to store the audio channel 20₅ from the User 5.
In the preceding examples, the apparatus 10 can be permanently operational to perform the selection of the sub-set 30 of audio channels 20 used to produce the output audio channel 52. However, in other examples the apparatus 10 has a state in which it is operational in this way and a state in which it is not operation in this way and it can transition between these states, for example when a trigger event is or is not detected.
The apparatus 10 can be configured to control a mixer 50 mixing of the N audio channels 20 to produce M audio channels in response to a trigger event,
One example of a trigger event is conflict between audio channels 20. An example of detecting conflict would be when there is overlapping speech in audio channels 20.
Another example of a trigger event is a reduction in communication bandwidth for receiving the audio channels 20 below a threshold value. In this example, the value of M can be dependent upon the available bandwidth.
Another example of a trigger event is a reduction in communication bandwidth for providing the output audio channel 52 beneath a threshold value. In this example, the value of M can be dependent upon the available bandwidth.
In some examples, where the apparatus 10 can also be configured to control the transmission of audio channels 20 to it, and reduce the number of audio channels received by N-M from N to M, wherein only the M audio channels that may berequired for mixing to produce the output audio channel 52 are received.
FIG. 12 illustrates an example of a method 100 that can for example be performed by the apparatus 10. The method comprises, at block 102, receiving at least N audio channels 20 where each of the N audio channels 20 can be rendered as a different audio source.
The method 100 comprises, at block 104, controlling mixing of the N audio channels 20 to produce at least an output audio channel 52, wherein the mixer 50 selects a sub-set 30 of at least M audio channels from the N audio channels 20 in dependence upon prioritization 32 of the N audio channels 20, wherein the prioritization 32 is adaptive and depends at least upon a content 34 of one or more of the N audio channels 20. The method 100 further comprises, at block 106, causing rendering of at least the output audio channel 52.
FIG. 13 illustrates a method 110 for producing the output audio channel 52. This method broadly corresponds to the method previously described with reference to FIG. 6.
At block 112, the method 110 comprises obtaining spatial audio signals from at least two sources as distinct audio channels 20. At block 114, the method 110 comprises determining temporal activity of each of the spatial audio signals (of the two audio channels 20) and selecting at least one spatial audio signal (audio channel 20) for mono downmix (for inclusion within the sub-set 30 and the output audio channel 52) for duration of its activity. At block 116, the method 110 comprises determining a content-based priority for at least one of the spatial audio signals (audio channels 20) for temporarily altering a previous selection. At block 118, the method 110 comprises determining a first mono downmix (sub-set 30 and output audio channel 52) based on at least one of the prioritized spatial audio signals (audio channels 20). The output audio channel 52 is based upon the selected sub-set M which is in turn based upon the prioritization 32. Then at block 120, the method 110 provides the first mono downmix (the output audio channel 52) to the participant for listening. That is, it provides the output audio channel 52 for rendering.
It will therefore be appreciated that the prioritization 32 determined at block 116 is used to adaptively adjust selection of the sub-set 30 of M audio channels 20 used to produce the output audio channel 52.
FIG. 14 illustrates an example in which the audio channel 20₃ is first selected, based on prioritization, as the primary audio channel in the output audio channel 52. In this example, at this time, the output audio channel 52 does not comprise the audio channel 20₄ or 20₅. Until the activity in the selected audio channel 20₃ ends, the audio channel 20₃ remains prioritized. There is no change to the selection of the sub-set 30 of M audio channels until the activity in the audio channel 20₃ ends. When the activity in the audio channel 20₃ ends then a new selection process can occur based upon the prioritization 32 of other channels. In this example there is a selection grace period after the end of activity in the audio channel 20₃. If there is resumed activity in the audio channel 20₃ during this selection grace period then the audio channel 20₃ will be re-selected as the primary channel to be included in the sub-set 30 and the output audio channel 52. Thus during the selection grace period the audio channel 20₃ can have a higher prioritization and be selected if it becomes active. After the selection grace period expires, the prioritization of the audio channel 20₃ can be decreased.
FIG. 15 illustrates an example of a method 130 that broadly corresponds to the method previously described in relation to FIG. 8. At block 132, the method 130 comprises obtaining spatial audio signals (audio channels 20) from at least two sources. This corresponds to the receiving of at least two audio channels 20. At block 132, the method 130 determines a first mono downmix (sub-set 30 and output audio channel 52) based on at least one of the spatial audio signals (audio channels 20). Next, at block 136, the method 130 comprises determining at least one second mono downmix based (sub-set 80 and additional audio channel) on at least one of the spatial audio signals (audio channels 20) not present in the first mono downmix. At block 138, the first mono downmix is provided to a participant for listening as the output audio channel 52. At block 140, the second mono downmix is provided to a memory for storage.
In any of the examples, when an audio channel 20 associated with a particular input end-point 206 is selected for inclusion within the sub-set 30 of audio channels used to create the output audio channel 52, then this information may be provided as a feedback at an output end-point 204 associated with that included input end-point 206.
In any of the examples, when an audio channel 20 associated with a particular input end-point 206 is not selected for inclusion within the sub-set 30 of audio channels used to create the output audio channel 52 at a particular output end point 204, then this information may be provided as a feedback at an output end-point 204 associated with that excluded input end-point 206. The information can for example identify the input end-points 206 not selected for inclusion for rendering at a particular identified output end-point 204.
FIG. 16 illustrates an example of a controller 70. Implementation of a controller 70 may be as controller circuitry. The controller 70 may be implemented in hardware alone, have certain aspects in software including firmware alone or can be a combination of hardware and software (including firmware).
As illustrated in FIG. 16 the controller 70 may be implemented using instructions that enable hardware functionality, for example, by using executable instructions of a computer program 76 in a general-purpose or special-purpose processor 72 that may be stored on a computer readable storage medium (disk, memory etc) to be executed by such a processor 72.
The processor 72 is configured to read from and write to the memory 74. The processor 72 may also comprise an output interface via which data and/or commands are output by the processor 72 and an input interface via which data and/or commands are input to the processor 72.
The memory 74 stores a computer program 76 comprising computer program instructions (computer program code) that controls the operation of the apparatus when loaded into the processor 72. The computer program instructions, of the computer program 76, provide the logic and routines that enables the apparatus to perform the previously methods illustrated and/or described. The processor 72 by reading the memory 74 is able to load and execute the computer program 76.
The apparatus 10 therefore comprises:

at least one processor 72; and
at least one memory 74 including computer program code
the at least one memory 74 and the computer program code configured to, with the at least one processor 72, cause the apparatus 10 at least to perform:
- receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source
- control mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels;
- causing rendering at least the output audio channel.

As illustrated in FIG. 17, the computer program 76 may arrive at the apparatus 10 via any suitable delivery mechanism 78. The delivery mechanism 78 may be, for example, a machine readable medium, a computer-readable medium, a non-transitory computer-readable storage medium, a computer program product, a memory device, a record medium such as a Compact Disc Read-Only Memory (CD-ROM) or a Digital Versatile Disc (DVD) or a solid state memory, an article of manufacture that comprises or tangibly embodies the computer program 76. The delivery mechanism may be a signal configured to reliably transfer the computer program 76. The apparatus 10 may propagate or transmit the computer program 76 as a computer data signal.
Computer program instructions for causing an apparatus to perform at least the following or for performing at least the following:

receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source
control mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels;
causing rendering at least the output audio channel.

The computer program instructions may be comprised in a computer program, a non-transitory computer readable medium, a computer program product, a machine readable medium. In some but not necessarily all examples, the computer program instructions may be distributed over more than one computer program.
Although the memory 74 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable and/or may provide permanent/semi-permanent/ dynamic/cached storage.
Although the processor 72 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable. The processor 72 may be a single core or multi-core processor.
References to 'computer-readable storage medium', 'computer program product', 'tangibly embodied computer program' etc. or a 'controller', 'computer', 'processor' etc. should be understood to encompass not only computers having different architectures such as single /multi- processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc.
As used in this application, the term 'circuitry' may refer to one or more or all of the following:

(a) hardware-only circuitry implementations (such as implementations in only analog and/or digital circuitry) and
(b) combinations of hardware circuits and software, such as (as applicable):
1. (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and
2. (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions and
(c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g. firmware) for operation, but the software may not be present when it is not needed for operation.

The blocks illustrated in the preceding Figs may represent steps in a method and/or sections of code in the computer program 76. The illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block may be varied. Furthermore, it may be possible for some blocks to be omitted.
Where a structural feature has been described, it may be replaced by means for performing one or more of the functions of the structural feature whether that function or those functions are explicitly or implicitly described.
The above described examples find application as enabling components of:
automotive systems; telecommunication systems; electronic systems including consumer electronic products; distributed computing systems; media systems for generating or rendering media content including audio, visual and audio visual content and mixed, mediated, virtual and/or augmented reality; personal systems including personal health systems or personal fitness systems; navigation systems; user interfaces also known as human machine interfaces; networks including cellular, non-cellular, and optical networks; ad-hoc networks; the internet; the internet of things; virtualized networks; and related software and services.
The term 'comprise' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising Y indicates that X may comprise only one Y or may comprise more than one Y. If it is intended to use 'comprise' with an exclusive meaning then it will be made clear in the context by referring to "comprising only one.." or by using "consisting".
In this description, reference has been made to various examples. The description of features or functions in relation to an example indicates that those features or functions are present in that example. The use of the term 'example' or 'for example' or 'can' or 'may' in the text denotes, whether explicitly stated or not, that such features or functions are present in at least the described example, whether described as an example or not, and that they can be, but are not necessarily, present in some of or all other examples. Thus 'example', 'for example', 'can' or 'may' refers to a particular instance in a class of examples. A property of the instance can be a property of only that instance or a property of the class or a property of a sub-class of the class that includes some but not all of the instances in the class. It is therefore implicitly disclosed that a feature described with reference to one example but not with reference to another example, can where possible be used in that other example as part of a working combination but does not necessarily have to be used in that other example.
Although examples have been described in the preceding paragraphs with reference to various examples, it should be appreciated that modifications to the examples given can be made without departing from the scope of the claims.
Features described in the preceding description may be used in combinations other than the combinations explicitly described above.
Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not.
Although features have been described with reference to certain examples, those features may also be present in other examples whether described or not.
The term 'a' or 'the' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising a/the Y indicates that X may comprise only one Y or may comprise more than one Y unless the context clearly indicates the contrary. If it is intended to use 'a' or 'the' with an exclusive meaning then it will be made clear in the context. In some circumstances the use of 'at least one' or 'one or more' may be used to emphasis an inclusive meaning but the absence of these terms should not be taken to infer any exclusive meaning.
The presence of a feature (or combination of features) in a claim is a reference to that feature or (combination of features) itself and also to features that achieve substantially the same technical effect (equivalent features). The equivalent features include, for example, features that are variants and achieve substantially the same result in substantially the same way. The equivalent features include, for example, features that perform substantially the same function, in substantially the same way to achieve substantially the same result.
In this description, reference has been made to various examples using adjectives or adjectival phrases to describe characteristics of the examples. Such a description of a characteristic in relation to an example indicates that the characteristic is present in some examples exactly as described and is present in other examples substantially as described.
Whilst endeavoring in the foregoing specification to draw attention to those features believed to be of importance it should be understood that the Applicant may seek protection via the claims in respect of any patentable feature or combination of features hereinbefore referred to and/or shown in the drawings whether or not emphasis has been placed thereon.

Claims

An apparatus comprising means for:
receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;

controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon prioritization of the N audio channels, and wherein the prioritization is adaptive depending at least upon a changing content of one or more of the N audio channels; and

providing for rendering at least the output audio channel.
An apparatus as claimed in claim 1, comprising means for: automatically controlling mixing of the N audio channels to produce at least the output audio channel, in dependence upon time-variation of content of one or more of the N audio channels.
An apparatus as claimed in claim 1 or 2, wherein the N audio channels are N spatial audio channels where each of the N spatial audio channels can be rendered as a differently positioned audio source.
An apparatus as claimed in any preceding claim, wherein N is at least two and wherein M is one, the output audio channel being a monophonic audio output channel.
An apparatus as claimed in any preceding claim, comprising means for analyzing the N audio channels to adapt a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels.
An apparatus as claimed in any preceding claim, wherein prioritization depends upon one or more of:
timing of content of at least one of the N audio channels relative to timing of content of at least another one of the N audio channels;

history of content of at least one of the N audio channels;

mapping to a particular person, an identified voice in content of at least one of the N audio channels;

detection that content of at least one of the N audio channels is voice content;

detection that content of at least one of the N audio channels comprises an identified word.
An apparatus as claimed in any preceding claim, wherein controlling mixing of the N audio channels to produce at least an output audio channel, comprises:
selecting a first sub-set of the N audio channels to be mixed to provide background audio;

selecting a second sub-set of the N audio channels to be mixed to provide foreground audio that is for rendering at greater loudness than the background audio, wherein the selection of the first sub-set and selection of the second sub-set is dependent upon the prioritization of the N audio channels; and

mixing the background audio and the foreground audio to produce the output audio channel.
An apparatus as claimed in any preceding claim, comprising means for controlling mixing of the N audio channels to produce M audio channels in response to a communication bandwidth for receiving the audio channels or for providing output audio signals falling beneath a threshold value.
An apparatus as claimed in any preceding claim, comprising means for controlling mixing of the N audio channels to produce M audio channels when conflict between a first audio channel of the N audio channels and a second audio channel of the N audio channels, wherein the first audio channel is included within the M audio channels and the second audio channel is not included within the M audio channels, wherein over-talking is an example of conflict.
An apparatus as claimed in any preceding claim, wherein the audio channels of the N audio channels that are not the selected M audio channels are available for later rendering.
An apparatus as claimed in any preceding claim, comprising a user input interface for controlling prioritization of the N audio channels.
An apparatus as claimed in any preceding claim, comprising a user input interface, wherein the user input interface provides a spatial representation of the N audio channels and indicates which of the N audio channels are comprised in the sub-set of M audio channels.
A multi-party, live communication system that enables live audio communication between multiple remote participants using at least the N audio channels wherein different ones of the multiple remote participants provide audio input for different ones of the N audio channels, wherein the system comprises the apparatus as claimed in any of claims 1 to 12.
A method comprising:
receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;

control mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels; and

rendering at least the output audio channel.
A computer program that when run on one or more processors enables:
control mixing of N received audio channels, where each of the N audio channels can be rendered as a different audio source, to produce at least an output audio channel for rendering, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels.