EP4037339A1 - Selecton of audio channels based on prioritization - Google Patents

Selecton of audio channels based on prioritization Download PDF

Info

Publication number
EP4037339A1
EP4037339A1 EP21154652.8A EP21154652A EP4037339A1 EP 4037339 A1 EP4037339 A1 EP 4037339A1 EP 21154652 A EP21154652 A EP 21154652A EP 4037339 A1 EP4037339 A1 EP 4037339A1
Authority
EP
European Patent Office
Prior art keywords
audio
audio channels
channels
channel
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP21154652.8A
Other languages
German (de)
French (fr)
Inventor
Lasse Juhani Laaksonen
Mikko-Ville Laitinen
Arto Juhani Lehtiniemi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Technologies Oy
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Priority to EP21154652.8A priority Critical patent/EP4037339A1/en
Publication of EP4037339A1 publication Critical patent/EP4037339A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/008Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/03Aspects of down-mixing multi-channel audio to configurations with lower numbers of playback channels, e.g. 7.1 -> 5.1
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field

Definitions

  • Embodiments of the present disclosure relate to audio. Some enable the distribution of common content for rendering to both advanced audio output devices and less advanced audio output devices.
  • Advanced audio output devices are capable to rendering multiple received audio channels as different spatially positioned audio sources.
  • the spatial separation of audio sources can aid hearing when the sources simultaneously provide sound.
  • Content that is suitable for rendering spatial audio via an advanced audio output device may be unsuitable for a less advanced audio output device and content that is suitable for rendering by a less advanced audio output device may under-utilize the spatial audio capabilities of an advanced audio output device.
  • an apparatus comprising means for:
  • the apparatus comprises means for: automatically controlling mixing of the N audio channels to produce at least the output audio channel, in dependence upon time-variation of content of one or more of the N audio channels.
  • the N audio channels are N spatial audio channels where each of the N spatial audio channels can be rendered as a differently positioned audio source.
  • N is at least two and M is one, the output audio channel being a monophonic audio output channel.
  • the apparatus comprises means for analyzing the N audio channels to adapt a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels.
  • prioritization depends upon one or more of:
  • controlling mixing of the N audio channels to produce at least an output audio channel comprises:
  • the apparatus comprises means for controlling mixing of the N audio channels to produce M audio channels in response to a communication bandwidth for receiving the audio channels or for providing output audio signals falling beneath a threshold value.
  • the apparatus comprises means for controlling mixing of the N audio channels to produce M audio channels when conflict between a first audio channel of the N audio channels and a second audio channel of the N audio channels, wherein the first audio channel is included within the M audio channels and the second audio channel is not included within the M audio channels, wherein over-talking is an example of conflict.
  • the audio channels of the N audio channels that are not the selected M audio channels are available for later rendering.
  • the apparatus comprises a user input interface for controlling prioritization of the N audio channels.
  • the apparatus comprises a user input interface, wherein the user input interface provides a spatial representation of the N audio channels and indicates which of the N audio channels are comprised in the sub-set of M audio channels.
  • a multi-party, live communication system that enables live audio communication between multiple remote participants using at least the N audio channels wherein different ones of the multiple remote participants provide audio input for different ones of the N audio channels, wherein the system comprises the apparatus.
  • an apparatus comprising means for:
  • an apparatus comprising means for:
  • the set of N audio channels is referenced using reference number 20.
  • Each audio channel of the set of N audio channels is referenced using reference number 20 i , where i is 1, 2,...N-1, N.
  • the apparatus 10 comprises means for receiving at least N audio channels 20 where each of the N audio channels 20 i can be rendered as a different audio source.
  • the apparatus 10 comprises means 40, 50 for controlling selection and mixing of the N audio channels 20 to produce at least an output audio channel 52.
  • a selector 40 selects for mixing (to produce the output audio channel 52) a sub-set 30 of M audio channels from the N audio channels 20.
  • the selection is dependent upon prioritization 32 of the N audio channels 20.
  • the prioritization 32 is adaptive depending at least upon a changing content 34 of one or more of the N audio channels 20.
  • the sub-set 30 of M audio channels is referenced using reference number 30.
  • Each audio channel of the sub-set of M audio channels is referenced using reference number 20 j , where j is any M of the N values of i.
  • the sub-set 30 can, for example, be varied by changing the value of M and/or by changing which audio channels 20 j are used to comprise the M audio channel of the sub-set 30.
  • different sub-set 30 can, in some examples, be differentiated using the same reference 30 with different numeric sub-scripts.
  • a mixer 50 mixes the sub-set 30 of M audio channels to produce the output audio channel 52 which is suitable for rendering.
  • An advanced spatial audio output device (an example is illustrated at FIG 11A ) can render the N audio channels 20 as multiple different spatially positioned audio sources.
  • a less advanced audio output device (an example is illustrated at FIG 11B ) can render the output audio channel 52.
  • the apparatus 10 therefore allows a common content, the N audio channels 20, to provide audio output at both the advanced spatial audio output device and the less advanced audio output device.
  • FIG. 1 illustrates an example of an apparatus 10 for providing an output audio channel 52 for rendering.
  • the rendering of the output audio channel 52 can occur at the apparatus 10 or can occur at some other device.
  • the apparatus 10 receives at least N audio channels 20.
  • An audio channel 20 i of the N audio channels 20 can be rendered as a distinct audio source.
  • the apparatus 10 comprises a mixer 50 for mixing a sub-set 30 of the M audio channels 20 to produce at least an output audio channel 52.
  • a selector 40 selects for mixing, at mixer 50, the sub-set 30 of M audio channels from the N audio channels 20.
  • the selection, by the selector 40 is dependent upon prioritization 32 of the N audio channels 20.
  • the prioritization 32 is adaptive depending at least upon a changing content 34 of one or more of the N audio channels 20.
  • the apparatus 10 provides, from the mixer 50, the output audio channel 52 for rendering.
  • the sub-set 30 of M audio channels has less audio channels than the N audio channels 20, that is, M is less than N.
  • N is at least two and in at least some examples is greater than 2.
  • M is one and the output audio channel 52 is a monophonic audio output channel.
  • the prioritization 32 is adaptive. The prioritization 32 depends at least on a changing content 34 of one or more of the N audio channels 20.
  • the apparatus 10 is configured to automatically control the mixing of the N audio channels 20 to produce at least the output audio channel 52, in dependence upon time-variation of content 34 of one or more of the N audio channels 20.
  • FIG. 2 illustrates an example of an apparatus 10 in which an analyzer 60 is configured to analyze the N audio channels 20 to adapt the prioritization 32 of the N audio channels 20 in dependence upon, at least, changing content 34 of one or more of the N audio channels 20.
  • the analysis can be performed before (or simultaneously with) the before-mentioned selection.
  • the analyzer 60 is configured to process metadata associated with the N audio channels 20. Additionally or alternatively, in some examples, the analyzer 60 is configured to process the audio content of the audio channels 20. This processing could, for example, comprise voice activity detection, voice recognition processing, spectral analysis, semantic processing of speech or other processing including machine learning and artificial intelligence processing used to identify characteristics of the content 34 of one or more of the N audio channels 20.
  • the prioritization 32 can depend upon one or more parameters of the content 34.
  • the prioritization 32 depends upon timing of content 34 i of an audio channel 20 i relative to timing of content 34 j of an audio channel 20 j .
  • the audio channel 20 that first satisfies a trigger condition has temporal priority.
  • the trigger condition may be that the audio channel 20 has activity above a threshold, and/or has activity above a threshold in a particular spectral range and/or has voice activity and/or has voice activity associated with a specific person and/or the voice activity comprises semantic content including a particular keyword word or phrase.
  • An initial prioritization 32 can cause an initial selection of a first sub-set 30 1 of audio channels 20 that are mixed to form the output audio channel 52.
  • a change in prioritization 32 can cause a new selection of a second different sub-set 30 2 of audio channels 20 that are mixed to form a new, different output audio channel 52.
  • the first sub-set 30 1 and the second sub-set 30 1 are not equal sets.
  • apparatus 10 can prioritize one or more of the N audio channels 20 as a sub-set 30 until a new selection by the selector 40 based on a new prioritization 32 changes the sub-set 30.
  • That channel may be prioritized ahead of a second audio channel. However, if the person speaking in the first audio channel stops speaking then the prioritization 32 of the audio channels can change and there can be a consequential reselection at the selector 40 of the sub-set 30 of M audio channels provided for mixing to produce the output audio channel 52.
  • the apparatus 10 can flag at least one input audio channel 20 corresponding to a first active talker, or generally active content 34, during a selection period and prioritize this selection over other audio channels 20.
  • the apparatus 10 can determine whether the active talker continues before introducing content 34 from non-prioritized channels to the mixed output audio channel 52. The introduction of such additional content 34 from non-prioritized channels is controlled by the selector 40 during a following selection period.
  • non-prioritized audio channels 20 can be completely omitted from the mixed output audio channel 52 and thus the mixed output audio channel 52 will contain only the prioritized channel(s).
  • the non-prioritized channels can be mixed with a lower gain or higher attenuation than the prioritized channel and/or with other suitable processing to produce the output audio channel 52.
  • a history of content 34 of at least one of the N audio channels 20 can be used to control the prioritization 32.
  • the selector 40 in making a selection of which of the N audio channels 20 to select for mixing to produce the output audio channel 52 can, for example, use decision thresholds for selection.
  • a decision threshold can be changed over time and can be dependent upon a history of the content 34.
  • different decision thresholds can be used for different audio channels 20.
  • the prioritization 32 can be dependent upon mapping to a particular person an identified voice in content 34 of at least one of the N audio channels 20.
  • the analyzer 60 can for example perform voice recognition based upon the content 34 of one or more of the N audio channels 20.
  • the analyzer 60 can identify a particular person based upon metadata comprised within the content 34 of at least one of the N audio channels 20. It may therefore be possible to identify a particular one of the N audio channels 20 as relating to a person whose contribution it is particularly important to hear such as, for example, a chairman of a meeting.
  • the analyzer 60 is configured to adapt the prioritization 32 when the presence of voice content is detected within the content 34 of at least one of the N audio channels 20.
  • the analyzer 60 is able to prioritize the spoken word within the output audio channel 52. It is also possible to adapt the analyzer 60 to prioritize other types of content.
  • the analyzer 60 is configured to adapt the prioritization 32 based upon detection that content 34 of at least one of the N audio channels 20 comprises an identified keyword.
  • the analyzer 60 can, for example, listen to the content 34 and identify within the stream of content a keyword or identify semantic meaning within the stream of content. This can be used to modify the prioritization 32. For example, it may be desirable for a consumer of the output audio channel 52 to have that output audio channel 52 personalized so that if one of the N audio channels 20 comprises content 34 that includes the consumer's name or other information associated with the consumer then that audio channel 20 is prioritized by the analyzer 60.
  • the N audio channels 20 can represent live content.
  • the analysis by the analyzer 60, the selection by the selector 40 and the mixing by the mixer 50 can occur in real time such that the output audio channel 52 is also live.
  • FIG. 3 illustrates an example of the apparatus of FIG. 1 in more detail.
  • the mixing is a weighted mixing in which different sub-sets of the sub-set 30 of selected audio channels are weighted with different attenuation/gain before being finally mixed to produce the output audio channel 52.
  • the selector 40 selects a first sub-set SS1 of the M audio channels to be mixed to provide background audio B and selects a second sub-set SS2 of the M audio channels 20 to be mixed to provide foreground audio F that is for rendering at greater loudness than the background audio B.
  • the selection of the first sub-set SS1 and the selection of the second sub-set SS2 is dependent upon the prioritization 32 of the N audio channels 20.
  • the first sub-set SS1 of audio channels 20 is mixed 50 1 to provide background audio B which is then amplified/attenuated G1 to adjust the loudness of the background audio before it is provided to the mixer 50 3 for mixing to produce the output audio channel 52.
  • the second sub-set SS2 of audio channels 20 is mixed 50 2 to provide foreground audio F which is then amplified/attenuated G1 to adjust the loudness of the background audio before it is provided to the mixer 50 3 for mixing to produce the output audio channel 52.
  • the gain/attenuation G2 applied to the foreground audio F makes it significantly louder than the background audio B in the output audio channel 52. In some situations, the foreground audio F is naturally louder than background audio B. Thus, it can be but need not be that G2 > G1.
  • the gain/attenuation G1, G2 can, in some examples, vary with frequency.
  • FIG. 4 illustrates an example of a multi-party, live communication system 200 that enables live audio communication between multiple remote participants A i , B, C, D i using at least the N audio channels 20. Different ones of the multiple remote participants A i , B, C, D i provide audio input for different ones of the N audio channels 20.
  • the system 200 comprises input end-points 206 for capturing audio channels 20.
  • the system 200 comprises output end-points 204 for rendering audio channels.
  • One or more output end-points 204 s are configured for rendering spatial audio as distinct rendered audio sources.
  • One or more output end-points 204 m are not configured for rendering spatial audio.
  • the N audio channels 20 are N spatial audio channels where each of the N spatial audio channels is captured as a differently positioned captured audio source, and can be rendered using spatial audio as a differently positioned rendered audio source.
  • the captured audio source input end-point 206
  • the rendered audio source can either be fixed or can move, for example, in a manner corresponding to the moving input end-point 206.
  • the system 200 is for enabling immersive teleconferencing or telepresence for remote terminals.
  • the different terminals have varying device capabilities and different (and possibly variable) network conditions.
  • Spatial/immersive audio refers to audio that typically has a three-dimensional space representation or is presented (rendered) to a participant with the intention of the participant being able to hear a specific audio source from a specific direction.
  • Some of the participants share a room. For example, participants A 1 , A 2 , A 3 , A 4 share the room A and the participants D 1 , D 2 , D 3 , D 4 , D 5 share the room D.
  • Some of the terminals can be characterized as "advanced spatial audio output devices" that have an output end-point 204 s that is configured for spatial audio. However, some of the terminals are less advanced audio output devices that have an output end-point 204 m that is not configured for spatial audio.
  • the voices of the participants A i , B, C, D i are spatially separated.
  • the voices may, for example, have fixed spatial positions relative to each other or the directions may be adaptive, for example, according to participant movements, conference bridge settings or based upon inputs by participants.
  • a similar experience is available to the participants who are using the output end-points 204 s and they have the ability to interact much more naturally than traditional voice calls and voice conferencing. For example, they can talk at the same time and still understand each other thanks to effects such as the well-known cocktail party effect.
  • each of the respective participants A i , D i has a personal input end-point 206 which captures a personal captured audio source as a personal audio channel 20.
  • the personal input end-point 206 can, for example, be provided by a directional microphone or by a Lavalier microphone.
  • the participants B and C each have a single personal input end-point 206 which captures a personal audio channel 20.
  • the output end-points 204 s are configured for spatial audio.
  • each room can have a surround sound system as an output end-point 204 s .
  • An output end point 204 s is configured to render each captured sound source represented by an audio channel 20 as a rendered sound source.
  • each participant A i , B, C has a personal output audio channel 20.
  • Each personal output audio channel 20 is rendered from a different location as a different rendered audio source.
  • the collection of rendered audio sources associated with the participants A i creates a virtual room A.
  • each participant D i , B, C has a personal output audio channel 20.
  • Each personal output audio channel 20 is rendered from a different location as a different rendered sound source.
  • the collection of the rendered audio sources associated with the participants D i creates a virtual room D.
  • the output end-point 204 s is configured for spatial audio. For example, as an output end-point 204 s .
  • An output end point 204 s is configured to render each captured sound source represented by an audio channel 20 as a rendered sound source.
  • the participant C has an output end-point 204 s that is configured for spatial audio.
  • the participant C is using a headset configured for binaural spatial audio that is suitable for virtual reality (VR).
  • Binauralization methods can be used to render personal audio channels 20 as spatially positioned rendered audio sources,
  • Each participant Ai, Di, B has a personal output audio channel 20.
  • Each personal output audio channel 20 is or can be rendered from a different location as a different rendered sound source.
  • the participant B has an output end-point 204 m that is not configured for spatial audio. In this example it is a monophonic output end-point.
  • the participant B is using a mobile device (e.g. a mobile phone) to provide the input end-point 206 and the output end-point 204 m .
  • the mobile device has a single output end-point 204 m which provides the output audio channel 52 as previously described.
  • the processing to produce the output audio channel 52 can be performed at the mobile device of the participant C or at the server 202.
  • the mono-capability limitation of participant B can, for example, be caused by the device, for example it is only configured for decoding of mono audio or because of the available audio output facilities such as a mono-only earpiece or headset.
  • Each of the input end-points 206 is rendered in spatial audio as a spatially distinct rendered audio source. However, in other examples multiple ones of the input end-points 206 may be mixed together to produce a single rendered audio source. This can be used to reduce the number of rendered audio sources using spatial audio. Therefore, in some examples, a spatial audio device may render multiple ones of output audio channels 52.
  • FIG. 4 a star topology similar to that illustrated in FIG. 5A is used.
  • the central server 202 interconnects the input end-points 206 and the output end-points 204.
  • the input end-points 206 provide the N audio channels 20 to a central server 202 which produces the output audio channel 52 as previously described to the output end-point 204 m .
  • the apparatus 10 is located in the central server 202, however, in other examples the apparatus 10 is located at the output end-point 204 m .
  • FIG. 5B illustrates an alternative topology in which there is no centralized architecture but a peer-to-peer architecture.
  • the apparatus 10 is located at the output end-point 204m.
  • the 3GPP IVAS codec is an example of a voice and audio communications codec for spatial audio.
  • the IVAS codec is an extension of the 3GPP EVS codec and is intended for new immersive voice and audio services over 4G and 5G.
  • Such immersive services include, for example, immersive voice and audio for virtual reality (VR).
  • the multi-purpose audio codec is expected to handle encoding, decoding and rendering of speech, music and generic audio. It is expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. It is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.
  • the audio channels 20 can, for example, be coded/decoded using the 3GPP IVAS codec.
  • the spatial audio channels 20 can, for example, be provided as metadata-assisted spatial audio (MASA), objective-based audio, channel-based audio (5.1, 7.1+4), non-parametric scene-based audio (e.g. First Order Ambisonics, High Order Ambisonics) and any combination of these formats. These audio formats can be binauralized for headset listening such that a participant can hear the audio sources outside their head.
  • MASA metadata-assisted spatial audio
  • objective-based audio e.g., objective-based audio
  • channel-based audio 5.1, 7.1+4
  • non-parametric scene-based audio e.g. First Order Ambisonics, High Order Ambisonics
  • the apparatus 10 provides a better experience, including improved intelligibility for a mono user participating in a spatial audio teleconference with several potentially overlapping spatial audio inputs.
  • the apparatus 10 means that it is not necessary, in some cases, to simplify the spatial audio conference experience for the spatial audio users due to having a mono-audio participant.
  • a mono user can participate in a spatial audio conference without compromising the experience of the other users.
  • FIGS 6, 7, 8 and 9A illustrate examples of an apparatus 10 that comprises a controller 70.
  • the controller receives N audio channels 20 and performs control processing to select the sub-set 30 of M audio channels.
  • the controller 70 comprises the selector 40 and, optionally, the analyzer 60.
  • the mixer 50 is present but not illustrated.
  • the controller 70 is configured to control mixing of the N audio channels 20 to produce the sub-set 30 of M audio channels when a conflict between a first audio channel of the N audio channels 20 and a second audio channel of the N audio channels occurs.
  • the control can cause the first audio channel 20 to be included within the sub-set 30 of M audio channels and cause the second audio channel 20 not to be included within the sub-set 30 of M audio channels.
  • the second audio channel is included within the sub-set 30 of M audio channels.
  • One example of when there is conflict between audio channels is when there is simultaneous activity from different prioritized sound sources.
  • overtalking sustaneous speech
  • the prioritization 32 used for the selection of audio channels to form the sub-set 30 of M audio channels depends upon timing of content 34 of at least one of the N audio channels 20 relative to timing of content 34 of at least another one of the N audio channels 20.
  • the later speech by participants 4 and 5 is not selected for inclusion within the sub-set 30 of audio channels used to form the output audio channel 52.
  • the audio channel 20 3 preferentially remains prioritized and remains included within the output audio channel 52, while there is voice activity in the audio channel 20 3 , whereas the audio channels 20 4 , 20 5 are excluded. If voice activity is no longer detected in the audio channel 20 3 then in some examples a selection process may immediately change the identity of the audio channel 20 selected for inclusion within the output audio channel 52. However, in other examples there can be a selection grace period. During this grace period, there can be a greater likelihood of selection/reselection of the original selected audio channel 20 3 . Thus, during the grace period prioritization 32 is biased in favor of the previously selected audio channel.
  • prioritization 32 used for the selection depends upon a history of content 34 of at least one of the N audio channels 20.
  • the prioritization 32 used for the selection can depend upon mapping to a particular person (an identifiable human), an identified voice in content 34 of at least one of the N audio channels 20.
  • a voice can be identified using metadata or by analysis of the content 34. The prioritization 32 would more favorably select the particular person's audio channel 20 for inclusion within the output audio channel 52.
  • the particular person could, for example, be based upon service policy.
  • a teleconference service may have a moderator or chairman role and this participant may for example be made audible to all participants or may be able to force themselves to be audible to all participants.
  • the particular person could for example be indicated by a user consuming the output audio channel 52. That consumer could for example indicate which of the other participants' content 34 or audio channels 20 they wish to consume. This audio channel 20 could then be included, or be more likely to be included, within the output audio channel 52.
  • the inclusion of the user-selected audio channel 20 can for example be dependent upon voice activity within the audio channel 20, that is, the user-selected audio channel 20 is only included if there is active voice activity within that audio channel 20.
  • the prioritization 32 used for the selection therefore strongly favors the user-selected audio channel 20.
  • the selection by the consumer of the output audio channel 52 of a particular audio channel 20 can for example be based upon an identity of the participant who is speaking or should speak in that audio channel. Alternatively, it could be based upon a user-selection of that audio channel because of the content 34 rendered within that audio channel.
  • FIG. 7 illustrates an example similar to FIG. 6 .
  • the audio channels 20 include a mixture of different audio types.
  • the audio channel 20 3 associated with participant3 is predominantly a voice channel.
  • the audio channels 20 4 , 20 5 associated with participants 4 and 5 are predominantly instrumental/music channels.
  • the selection of which of the audio channels 20 is to be included within the output audio channel 52 can be based upon the audio type present within the audio channel 20.
  • the detection of the audio type within the audio channel 20 can for example be achieved using metadata or, alternatively, by analyzing the content 34 of the audio channel 20.
  • the prioritization 32 used for selection can be dependent upon detection that content 34 of at least one of the N audio channels 20 is voice content.
  • the output audio channel 52 can switch between the inclusion of different audio channels 20 in dependence upon which of them includes active voice content. In this way priority can be given to spoken language.
  • the other channels for example the music channels 20 4 , 20 5 may optionally be included, for example as background audio as previously described with relation to FIG. 3 .
  • the apparatus 10 deliberately loses information by excluding (or diminishing) audio channels 20 with respect to the output audio channel 52.
  • Information is generally lost by the selective downmixing which is required to maintain or guarantee intelligibility. It is, however, possible for there to be two simultaneously important audio channels 20, only one of which is selected for inclusion in the output audio channel 52.
  • the apparatus illustrated in FIG. 8 addresses this issue.
  • the apparatus 10 illustrated is similar to that illustrated in FIGS 6 and 7 . However, it additionally comprises a memory 82 for storage of a further sub-set 80 of the N audio channels 20 that is different to the sub-set 30 of N audio channels 20.
  • a memory 82 for storage of a further sub-set 80 of the N audio channels 20 that is different to the sub-set 30 of N audio channels 20.
  • the later rendering may be at a faster playback rate and that playback may be fixed or may be adaptive.
  • the sub-set 80 of audio channels is mixed to form an alternative audio output channel for storage in the memory 82.
  • At least some of the audio channels of the N audio channels that are not selected to be in the sub-set 30 of M audio channels are stored in memory 82 for later rendering.
  • first sub-set 30 of M audio channels there is selection of a first sub-set 30 of M audio channels from the N audio channels based upon prioritization 32 of the N audio channels.
  • the first sub-set 30 of M audio channels are mixed to produce a first output audio channel 52.
  • the second sub-set 80 of audio channels are mixed to produce a second output audio channel for storage.
  • the audio channel 20 3 includes content 34 comprising voice content from a single participant, and it is selected for inclusion within the sub-set 30 of audio channels. It is used to produce the output audio channel 52.
  • the audio channels 20 4 , 20 5 which have not been included within the output audio channel 52, or included only as background (as described with reference to FIG. 3 ), are selected for mixing to produce the second output audio signal that is stored in memory 82.
  • FIG. 10 illustrates an example of how such an indication may be provided to the consumer of the output audio channel 52. Fig 10 is described in detail later.
  • An apparatus 10 may switch to the stored audio channel and play that back at a higher speed. For example, the apparatus 10 can monitor the typical length of inactivity in the preferred output audio channel 52 and adjust the speed of playback for the stored audio channel such that the relevant portions can be played back during a typical inactive period.
  • FIG. 9A illustrates an example in which the apparatus 10 detects that content 34 of at least one of the N audio channels 20 comprises an identified keyword and adapts the prioritization 32 accordingly.
  • the prioritization 32 in turn controls selection of which of the audio channels 20 are included in the sub-set 30 and the output audio channel 52 (and, if implemented, the stored alternative audio channel).
  • the participant 'User 3' is speaking first and has priority. Therefore, the audio channel 20 3 associated with the User 3 is initially selected as the priority audio channel and is included within the output audio channel 52. Even though the participant 'User 5' begins to talk, the prioritization is not changed and the audio channel 20 3 remains the priority audio channel included within the output audio channel 52.
  • the prioritization is not changed and the audio channel 20 3 remains the priority audio channel included within the output audio channel 52.
  • the participant 'User 3' is speaking first and has priority. Therefore, the audio channel 20 3 associated with the User 3 is initially selected as the priority audio channel and is included within the sub-set 30 used to produce the output audio channel 52. Even though the participant 'User 5' begins to talk, the prioritization is not changed and the audio channel 20 3 remains the priority audio channel included within the sub-set 30 and the output audio channel 52.
  • the prioritization is not changed and the audio channel 20 3 remains the priority audio channel included within the sub-set 30 and the output audio channel 52.
  • This event causes a switch in the prioritization of the audio channels 20 3 , 20 5 such that the audio channel 20 5 becomes prioritized and included in the sub-set 30 and the output audio channel 52 and the audio channel 20 3 becomes de-prioritized and excluded from the sub-set 30 and the output audio channel 52.
  • the consumer of the output audio channel 52 can via user input settings control the likelihood of a switch when a keyword is mentioned within an audio channel 20.
  • the consumer of the output audio channel 52 can, for example, require a switch if a keyword is detected.
  • the likelihood of a switch can be increased.
  • the occurrence of a keyword can increase the prioritization of an audio channel 20 such that it is stored, for example as described in relation to FIG. 8 .
  • the detection of a keyword may provide an option to the consumer of the output audio channel 52, to enable the consumer to cause a change in the audio channel 20 included within the sub-set 30 and the output audio channel 52. For example, if the name of the consumer of the output audio channel 52 is included within an audio channel 20 that is not being rendered, as a priority, within the output audio channel 52 then the consumer of the output audio channel 52 can be presented with an option to change prioritization 32 and switch to using a sub-set 30 and output audio channel 52 that includes the audio channel 20 in which their name was detected.
  • the new output audio channel 52 based on the detected keyword may be played back from the occurrence of the detected keyword.
  • the playback is at a faster rate to allow a catch-up with real time.
  • FIG. 10 illustrates an example in which a consumer of the output audio channel 52 is provided with information to allow that consumer to make an informed decision to switch audio channels 20 included within the sub-set 30 and the output audio channel 52.
  • some form of indication is given to indicate a change in activity status. For example, if a particular participant begins to talk or there is a second separate discussion ongoing, the consumer of the original output audio channel 52 is made aware of this.
  • a suitable indicator could for example be an audible indicator that is added to the output audio channel 52.
  • each participant may have an associated different tone and a beep with a particular tone may indicate which participant has begun to speak.
  • an indicator could be a visual indictor in an input user interface.
  • the background audio is adapted to provide an audible indication.
  • the consumer listening to the output audio channel 52 hears the audio channel 20 1 associated with a first participant's voice (User A voice).
  • a second audio channel 20 is mixed with the audio channel 20 1 , then it may, for example, be an audio channel 20 2 that captures the ambient audio of the first participant (User A ambience).
  • a second participant, User B begins to talk. This does not initiate a switch of prioritization 32 sufficient to change the sub-set 30.
  • the primary audio channel 20 in the sub-set 30 and the output audio channel 52 remains the audio channel 20 1 .
  • an indication is provided to indicate to the consumer of the output audio channel 52 that there is an alternative, available, audio channel 20 3 .
  • the indication is provided by mixing the primary audio channel 20 1 with an additional audio channel 20 associated with the User B.
  • the additional audio channel 20 can be an attenuated version of the audio channel 20 3 or can be an ambient audio channel 20 4 for the User B (User B ambience).
  • the second audio channel 20 2 is replaced by the additional audio channel 20 4 .
  • the consumer of the output audio channel 52 can then decide whether or not they wish to cause a change in the prioritization 32 to prioritize the audio channel 20 3 associated with the User B above the audio channel 20 1 associated with the User A. If this change in prioritization occurs then there is a switch in the primary audio channel within the sub-set 30 and the output audio channel 52 from being the audio channel 20 1 to being the audio channel 20 3 . In the example illustrated, the consumer does not make this switch. The switch does however occur automatically when the User A stops talking at time T2.
  • the background audio B can be included and/or varied as an indication to the consumer of the output audio channel 52 that an alternative audio channel 20 is available for selection.
  • FIG. 11A schematically illustrates audio rendered to a participant (User 5) at an output end-point 204 s of the system 200 (not illustrated) that is configured for rendering spatial audio.
  • the audio output at the end-point 204 s has multiple rendered sound sources associated with audio channels 20 1. 20 2 , 20 3 , 20 4 at different locations.
  • FIG. 11A illustrates that even with the presence in the system 200 (not illustrated) of an output end-point 204 m ( FIG 11B ) that is not configured for spatial audio rendering, there may be no need to reduce the immersive capabilities or experience at the output end-points 204 s of the system 200 that are configured for rendering spatial audio.
  • FIG. 11B schematically illustrates audio rendered to a participant (User 1) at an output end-point 204 m of the system 200 (not illustrated) that is not configured for rendering spatial audio.
  • the audio output at the end-point 204 m provided by the output audio channel 52 has a single monophonic output audio channel 52 that is based on the sub-set 30 of selected audio channels 20 and has good intelligibility.
  • the audio channel 20 2 is the primary audio channel that is included in the sub-set 30 and the output audio channel 52.
  • the apparatus 10 can be configured to automatically switch the composition of the audio channels 20 mixed to form the output audio channel 52 in dependence upon an adaptive prioritization 32. Additionally or alternatively, in some examples, the switching can be effected manually by the consumer at the end-point 204 m using a user interface which includes a user input interface 90.
  • the device at the output end-point 204 s which in some examples may be the apparatus 10, comprises a user input interface 90 for controlling prioritization 32 of the N audio channels 20.
  • the user input interface 90 can be configured to highlight or label selected ones of the N audio channels 20 for selection.
  • the user input interface 90 can be used to control if and to what extent manual or automatic switching occurs to produce the output audio channel 52 from selected ones of the audio channels 20.
  • An adaptation of the prioritization 32 can cause an automatic switching or can cause a prompt to a consumer for manual switching.
  • the user input interface 90 can control if and the extent to which prioritization 32 depends upon one or more of timing of content 34 of at least one of the N audio channels 20 relative to timing of content 34 of at least another one of the N audio channels 20; history of content 34 of at least one of the N audio channels 20; mapping to a particular person an identified voice in content 34 of at least one of the N audio channels 20; detection that content 34 of at least one of the N audio channels 20 is voice content; and/or detection that content 34 of at least one of the N audio channels comprises an identified word.
  • an option 91 4 that allows the participant, User 1, to select the audio channel 20 4 as a replacement primary audio channel that is included in the sub-set 30 and the output audio channel 52 instead of the audio channel 20 2 .
  • an option 91 3 that allows User 1 to select the audio channel 20 3 as a replacement primary audio channel that is included in the sub-set 30 and the output audio channel 52 instead of the audio channel 20 2 .
  • the user input interface 90 can provide a visual spatial representation of the N audio channels 20 and indicate which of the N audio channels 20 are comprised in the sub-set 30 of M audio channels.
  • the user input interface 90 can also indicate which of the N audio channels are not comprised in the sub-set 30 of M audio channels and which, if any, of these are active.
  • the user input interface 90 may provide textual information about an audio channel 20 that is active and available for selection.
  • speech-to-text algorithms may be utilized to convert speech within that audio channel 20 into an alert displayed at the user input interface 90.
  • the apparatus 10 may be configured to cause the user input interface 90 to provide an option to a consumer of the output audio channel 52 that enables that consumer to switch audio channels 20 included within the sub-set 30 and output audio channel 52.
  • the keyword is "Dave” and the textual output provided by the user input interface 90 could, for example, say "option to switch to User 5 who addressed you and said: 'In our last teleco Dave made an interesting'".
  • the sub-set 30 and the output audio channel 52 then includes the audio channel 20 5 from the User 5 and starts from the position "In our last teleco Dave made an interesting".
  • a memory 82 could be used to store the audio channel 20 5 from the User 5.
  • the apparatus 10 can be permanently operational to perform the selection of the sub-set 30 of audio channels 20 used to produce the output audio channel 52.
  • the apparatus 10 has a state in which it is operational in this way and a state in which it is not operation in this way and it can transition between these states, for example when a trigger event is or is not detected.
  • the apparatus 10 can be configured to control a mixer 50 mixing of the N audio channels 20 to produce M audio channels in response to a trigger event,
  • Trigger event is conflict between audio channels 20.
  • An example of detecting conflict would be when there is overlapping speech in audio channels 20.
  • a trigger event is a reduction in communication bandwidth for receiving the audio channels 20 below a threshold value.
  • the value of M can be dependent upon the available bandwidth.
  • a trigger event is a reduction in communication bandwidth for providing the output audio channel 52 beneath a threshold value.
  • the value of M can be dependent upon the available bandwidth.
  • apparatus 10 can also be configured to control the transmission of audio channels 20 to it, and reduce the number of audio channels received by N-M from N to M, wherein only the M audio channels that may berequired for mixing to produce the output audio channel 52 are received.
  • FIG. 12 illustrates an example of a method 100 that can for example be performed by the apparatus 10.
  • the method comprises, at block 102, receiving at least N audio channels 20 where each of the N audio channels 20 can be rendered as a different audio source.
  • the method 100 comprises, at block 104, controlling mixing of the N audio channels 20 to produce at least an output audio channel 52, wherein the mixer 50 selects a sub-set 30 of at least M audio channels from the N audio channels 20 in dependence upon prioritization 32 of the N audio channels 20, wherein the prioritization 32 is adaptive and depends at least upon a content 34 of one or more of the N audio channels 20.
  • the method 100 further comprises, at block 106, causing rendering of at least the output audio channel 52.
  • FIG. 13 illustrates a method 110 for producing the output audio channel 52. This method broadly corresponds to the method previously described with reference to FIG. 6 .
  • the method 110 comprises obtaining spatial audio signals from at least two sources as distinct audio channels 20.
  • the method 110 comprises determining temporal activity of each of the spatial audio signals (of the two audio channels 20) and selecting at least one spatial audio signal (audio channel 20) for mono downmix (for inclusion within the sub-set 30 and the output audio channel 52) for duration of its activity.
  • the method 110 comprises determining a content-based priority for at least one of the spatial audio signals (audio channels 20) for temporarily altering a previous selection.
  • the method 110 comprises determining a first mono downmix (sub-set 30 and output audio channel 52) based on at least one of the prioritized spatial audio signals (audio channels 20).
  • the output audio channel 52 is based upon the selected sub-set M which is in turn based upon the prioritization 32. Then at block 120, the method 110 provides the first mono downmix (the output audio channel 52) to the participant for listening. That is, it provides the output audio channel 52 for rendering.
  • the prioritization 32 determined at block 116 is used to adaptively adjust selection of the sub-set 30 of M audio channels 20 used to produce the output audio channel 52.
  • FIG. 14 illustrates an example in which the audio channel 20 3 is first selected, based on prioritization, as the primary audio channel in the output audio channel 52.
  • the output audio channel 52 does not comprise the audio channel 20 4 or 20 5 .
  • the audio channel 20 3 remains prioritized. There is no change to the selection of the sub-set 30 of M audio channels until the activity in the audio channel 20 3 ends.
  • a new selection process can occur based upon the prioritization 32 of other channels. In this example there is a selection grace period after the end of activity in the audio channel 20 3 .
  • the audio channel 20 3 will be re-selected as the primary channel to be included in the sub-set 30 and the output audio channel 52.
  • the audio channel 20 3 can have a higher prioritization and be selected if it becomes active. After the selection grace period expires, the prioritization of the audio channel 20 3 can be decreased.
  • FIG. 15 illustrates an example of a method 130 that broadly corresponds to the method previously described in relation to FIG. 8 .
  • the method 130 comprises obtaining spatial audio signals (audio channels 20) from at least two sources. This corresponds to the receiving of at least two audio channels 20.
  • the method 130 determines a first mono downmix (sub-set 30 and output audio channel 52) based on at least one of the spatial audio signals (audio channels 20).
  • the method 130 comprises determining at least one second mono downmix based (sub-set 80 and additional audio channel) on at least one of the spatial audio signals (audio channels 20) not present in the first mono downmix.
  • the first mono downmix is provided to a participant for listening as the output audio channel 52.
  • the second mono downmix is provided to a memory for storage.
  • this information may be provided as a feedback at an output end-point 204 associated with that included input end-point 206.
  • an audio channel 20 associated with a particular input end-point 206 is not selected for inclusion within the sub-set 30 of audio channels used to create the output audio channel 52 at a particular output end point 204, then this information may be provided as a feedback at an output end-point 204 associated with that excluded input end-point 206.
  • the information can for example identify the input end-points 206 not selected for inclusion for rendering at a particular identified output end-point 204.
  • FIG. 16 illustrates an example of a controller 70.
  • Implementation of a controller 70 may be as controller circuitry.
  • the controller 70 may be implemented in hardware alone, have certain aspects in software including firmware alone or can be a combination of hardware and software (including firmware).
  • the controller 70 may be implemented using instructions that enable hardware functionality, for example, by using executable instructions of a computer program 76 in a general-purpose or special-purpose processor 72 that may be stored on a computer readable storage medium (disk, memory etc) to be executed by such a processor 72.
  • a general-purpose or special-purpose processor 72 may be stored on a computer readable storage medium (disk, memory etc) to be executed by such a processor 72.
  • the processor 72 is configured to read from and write to the memory 74.
  • the processor 72 may also comprise an output interface via which data and/or commands are output by the processor 72 and an input interface via which data and/or commands are input to the processor 72.
  • the memory 74 stores a computer program 76 comprising computer program instructions (computer program code) that controls the operation of the apparatus when loaded into the processor 72.
  • the computer program instructions, of the computer program 76 provide the logic and routines that enables the apparatus to perform the previously methods illustrated and/or described.
  • the processor 72 by reading the memory 74 is able to load and execute the computer program 76.
  • the apparatus 10 therefore comprises:
  • the computer program 76 may arrive at the apparatus 10 via any suitable delivery mechanism 78.
  • the delivery mechanism 78 may be, for example, a machine readable medium, a computer-readable medium, a non-transitory computer-readable storage medium, a computer program product, a memory device, a record medium such as a Compact Disc Read-Only Memory (CD-ROM) or a Digital Versatile Disc (DVD) or a solid state memory, an article of manufacture that comprises or tangibly embodies the computer program 76.
  • the delivery mechanism may be a signal configured to reliably transfer the computer program 76.
  • the apparatus 10 may propagate or transmit the computer program 76 as a computer data signal.
  • Computer program instructions for causing an apparatus to perform at least the following or for performing at least the following:
  • the computer program instructions may be comprised in a computer program, a non-transitory computer readable medium, a computer program product, a machine readable medium. In some but not necessarily all examples, the computer program instructions may be distributed over more than one computer program.
  • memory 74 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable and/or may provide permanent/semi-permanent/ dynamic/cached storage.
  • processor 72 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable.
  • the processor 72 may be a single core or multi-core processor.
  • references to 'computer-readable storage medium', 'computer program product', 'tangibly embodied computer program' etc. or a 'controller', 'computer', 'processor' etc. should be understood to encompass not only computers having different architectures such as single /multi- processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry.
  • References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc.
  • circuitry may refer to one or more or all of the following:
  • the blocks illustrated in the preceding Figs may represent steps in a method and/or sections of code in the computer program 76.
  • the illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block may be varied. Furthermore, it may be possible for some blocks to be omitted.
  • the above described examples find application as enabling components of: automotive systems; telecommunication systems; electronic systems including consumer electronic products; distributed computing systems; media systems for generating or rendering media content including audio, visual and audio visual content and mixed, mediated, virtual and/or augmented reality; personal systems including personal health systems or personal fitness systems; navigation systems; user interfaces also known as human machine interfaces; networks including cellular, non-cellular, and optical networks; ad-hoc networks; the internet; the internet of things; virtualized networks; and related software and services.
  • a property of the instance can be a property of only that instance or a property of the class or a property of a sub-class of the class that includes some but not all of the instances in the class. It is therefore implicitly disclosed that a feature described with reference to one example but not with reference to another example, can where possible be used in that other example as part of a working combination but does not necessarily have to be used in that other example.
  • 'a' or 'the' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising a/the Y indicates that X may comprise only one Y or may comprise more than one Y unless the context clearly indicates the contrary. If it is intended to use 'a' or 'the' with an exclusive meaning then it will be made clear in the context. In some circumstances the use of 'at least one' or 'one or more' may be used to emphasis an inclusive meaning but the absence of these terms should not be taken to infer any exclusive meaning.
  • the presence of a feature (or combination of features) in a claim is a reference to that feature or (combination of features) itself and also to features that achieve substantially the same technical effect (equivalent features).
  • the equivalent features include, for example, features that are variants and achieve substantially the same result in substantially the same way.
  • the equivalent features include, for example, features that perform substantially the same function, in substantially the same way to achieve substantially the same result.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)

Abstract

An apparatus comprising means for:
receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon prioritization of the N audio channels, and wherein the prioritization is adaptive depending at least upon a changing content of one or more of the N audio channels; and
providing for rendering at least the output audio channel.

Description

    TECHNOLOGICAL FIELD
  • Embodiments of the present disclosure relate to audio. Some enable the distribution of common content for rendering to both advanced audio output devices and less advanced audio output devices.
  • BACKGROUND
  • Advanced audio output devices are capable to rendering multiple received audio channels as different spatially positioned audio sources. The spatial separation of audio sources (spatial audio) can aid hearing when the sources simultaneously provide sound.
  • Less advanced audio output devices are perhaps only capable of rendering one monophonic audio channel. They cannot render multiple received audio channels as different spatially positioned audio sources.
  • Content that is suitable for rendering spatial audio via an advanced audio output device may be unsuitable for a less advanced audio output device and content that is suitable for rendering by a less advanced audio output device may under-utilize the spatial audio capabilities of an advanced audio output device.
  • BRIEF SUMMARY
  • According to various, but not necessarily all, embodiments there is provided an apparatus comprising means for:
    • receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
    • controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon prioritization of the N audio channels, and wherein the prioritization is adaptive depending at least upon a changing content of one or more of the N audio channels; and
    • providing for rendering at least the output audio channel.
  • In some but not necessarily all examples, the apparatus comprises means for: automatically controlling mixing of the N audio channels to produce at least the output audio channel, in dependence upon time-variation of content of one or more of the N audio channels.
  • In some but not necessarily all examples, the N audio channels are N spatial audio channels where each of the N spatial audio channels can be rendered as a differently positioned audio source.
  • In some but not necessarily all examples, N is at least two and M is one, the output audio channel being a monophonic audio output channel.
  • In some but not necessarily all examples, the apparatus comprises means for analyzing the N audio channels to adapt a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels.
  • In some but not necessarily all examples, prioritization depends upon one or more of:
    • timing of content of at least one of the N audio channels relative to timing of content of at least another one of the N audio channels;
    • history of content of at least one of the N audio channels;
    • mapping to a particular person, an identified voice in content of at least one of the N audio channels;
    • detection that content of at least one of the N audio channels is voice content;
    • detection that content of at least one of the N audio channels comprises an identified word.
  • In some but not necessarily all examples, controlling mixing of the N audio channels to produce at least an output audio channel, comprises:
    • selecting a first sub-set of the N audio channels to be mixed to provide background audio;
    • selecting a second sub-set of the N audio channels to be mixed to provide foreground audio that is for rendering at greater loudness than the background audio, wherein the selection of the first sub-set and selection of the second sub-set is dependent upon the prioritization of the N audio channels; and
    • mixing the background audio and the foreground audio to produce the output audio channel.
  • In some but not necessarily all examples, the apparatus comprises means for controlling mixing of the N audio channels to produce M audio channels in response to a communication bandwidth for receiving the audio channels or for providing output audio signals falling beneath a threshold value.
  • In some but not necessarily all examples, the apparatus comprises means for controlling mixing of the N audio channels to produce M audio channels when conflict between a first audio channel of the N audio channels and a second audio channel of the N audio channels, wherein the first audio channel is included within the M audio channels and the second audio channel is not included within the M audio channels, wherein over-talking is an example of conflict.
  • In some but not necessarily all examples, the audio channels of the N audio channels that are not the selected M audio channels are available for later rendering.
  • In some but not necessarily all examples, the apparatus comprises a user input interface for controlling prioritization of the N audio channels.
  • In some but not necessarily all examples, the apparatus comprises a user input interface, wherein the user input interface provides a spatial representation of the N audio channels and indicates which of the N audio channels are comprised in the sub-set of M audio channels.
  • According to various, but not necessarily all, embodiments there is provided a multi-party, live communication system that enables live audio communication between multiple remote participants using at least the N audio channels wherein different ones of the multiple remote participants provide audio input for different ones of the N audio channels, wherein the system comprises the apparatus.
  • According to various, but not necessarily all, embodiments there is provided a method comprising:
    • receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
    • control mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels; and
    • rendering at least the output audio channel.
  • According to various, but not necessarily all, embodiments there is provided a computer program that when run on one or more processors enables:
    • control mixing of N received audio channels, where each of the N audio channels can be rendered as a different audio source, to produce at least an output audio channel for rendering,
    • wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels.
  • According to various, but not necessarily all, embodiments there is provided an apparatus comprising means for:
    • receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
    • adapting a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels; and
    • controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon the prioritization; and
    • providing for rendering at least the output audio channel.
  • According to various, but not necessarily all, embodiments there is provided an apparatus comprising means for:
    • receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
    • analyzing the N audio channels to adapt a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels; and
    • controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon the prioritization; and
    • providing for rendering at least the output audio channel.
  • According to various, but not necessarily all, embodiments there is provided examples as claimed in the appended claims.
  • BRIEF DESCRIPTION
  • Some examples will now be described with reference to the accompanying drawings in which:
    • FIG. 1 illustrates an example of an apparatus for providing an output audio channel for rendering;
    • FIG. 2 illustrates an example of an apparatus in which an analyzer is configured to analyze the N audio channels to adapt the prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels;
    • FIG. 3 illustrates another example of the apparatus;
    • FIG. 4 illustrates an example of a multi-party, live communication system comprising the apparatus;
    • FIG. 5A and 5B illustrate alternative topologies of the system;
    • FIG. 6 illustrates an example of prioritization based on timing of content;
    • FIG. 7 illustrates an example of prioritization based on content type;
    • FIG. 8 illustrates an example of storage of unselected audio channels;
    • FIG. 9A, 9B, 9C illustrate examples of prioritization based on keywords in content;
    • FIG. 10 illustrates an example of informing a consumer of the output audio channel of an option to change the audio channels included within the output audio channel;
    • FIG. 11A illustrates an example of spatial audio rendered, based on the N audio channels, at an output end-point configured for rendering spatial audio;
    • FIG. 11B illustrates an example of audio rendered, based on the output audio channel, at an output end-point that is not configured for rendering spatial audio;
    • FIG. 12, 13, 15 illustrate examples of a method;
    • FIG. 14 illustrates an example of changing prioritization based on timing of content;
    • FIG. 16 illustrates an example of a controller; and
    • FIG. 17 illustrates an example of a computer program.
    DETAILED DESCRIPTION
  • The following description and the attached drawings describe various examples of an apparatus 10 that receives at least N audio channels 20 and enables the rendering of one or more output audio channels 52.
  • The set of N audio channels is referenced using reference number 20. Each audio channel of the set of N audio channels is referenced using reference number 20i, where i is 1, 2,...N-1, N.
  • The apparatus 10 comprises means for receiving at least N audio channels 20 where each of the N audio channels 20i can be rendered as a different audio source.
  • The apparatus 10 comprises means 40, 50 for controlling selection and mixing of the N audio channels 20 to produce at least an output audio channel 52.
  • A selector 40 selects for mixing (to produce the output audio channel 52) a sub-set 30 of M audio channels from the N audio channels 20. The selection is dependent upon prioritization 32 of the N audio channels 20. The prioritization 32 is adaptive depending at least upon a changing content 34 of one or more of the N audio channels 20.
  • The sub-set 30 of M audio channels is referenced using reference number 30. Each audio channel of the sub-set of M audio channels is referenced using reference number 20j, where j is any M of the N values of i. The sub-set 30 can, for example, be varied by changing the value of M and/or by changing which audio channels 20j are used to comprise the M audio channel of the sub-set 30. In the description, different sub-set 30 can, in some examples, be differentiated using the same reference 30 with different numeric sub-scripts.
  • A mixer 50 mixes the sub-set 30 of M audio channels to produce the output audio channel 52 which is suitable for rendering.
  • An advanced spatial audio output device (an example is illustrated at FIG 11A) can render the N audio channels 20 as multiple different spatially positioned audio sources. A less advanced audio output device (an example is illustrated at FIG 11B) can render the output audio channel 52.
  • The apparatus 10 therefore allows a common content, the N audio channels 20, to provide audio output at both the advanced spatial audio output device and the less advanced audio output device.
  • FIG. 1 illustrates an example of an apparatus 10 for providing an output audio channel 52 for rendering. The rendering of the output audio channel 52 can occur at the apparatus 10 or can occur at some other device.
  • The apparatus 10 receives at least N audio channels 20. An audio channel 20i of the N audio channels 20 can be rendered as a distinct audio source.
  • The apparatus 10 comprises a mixer 50 for mixing a sub-set 30 of the M audio channels 20 to produce at least an output audio channel 52.
  • A selector 40 selects for mixing, at mixer 50, the sub-set 30 of M audio channels from the N audio channels 20. The selection, by the selector 40, is dependent upon prioritization 32 of the N audio channels 20. The prioritization 32 is adaptive depending at least upon a changing content 34 of one or more of the N audio channels 20. The apparatus 10 provides, from the mixer 50, the output audio channel 52 for rendering.
  • The sub-set 30 of M audio channels has less audio channels than the N audio channels 20, that is, M is less than N. N is at least two and in at least some examples is greater than 2. In at least some examples M is one and the output audio channel 52 is a monophonic audio output channel.
  • The prioritization 32 is adaptive. The prioritization 32 depends at least on a changing content 34 of one or more of the N audio channels 20.
  • In some but not necessarily all examples, the apparatus 10 is configured to automatically control the mixing of the N audio channels 20 to produce at least the output audio channel 52, in dependence upon time-variation of content 34 of one or more of the N audio channels 20.
  • FIG. 2 illustrates an example of an apparatus 10 in which an analyzer 60 is configured to analyze the N audio channels 20 to adapt the prioritization 32 of the N audio channels 20 in dependence upon, at least, changing content 34 of one or more of the N audio channels 20.
  • The analysis can be performed before (or simultaneously with) the before-mentioned selection.
  • In some examples, the analyzer 60 is configured to process metadata associated with the N audio channels 20. Additionally or alternatively, in some examples, the analyzer 60 is configured to process the audio content of the audio channels 20. This processing could, for example, comprise voice activity detection, voice recognition processing, spectral analysis, semantic processing of speech or other processing including machine learning and artificial intelligence processing used to identify characteristics of the content 34 of one or more of the N audio channels 20.
  • The prioritization 32 can depend upon one or more parameters of the content 34.
  • In one example, the prioritization 32 depends upon timing of content 34i of an audio channel 20i relative to timing of content 34j of an audio channel 20j. Thus, the audio channel 20 that first satisfies a trigger condition has temporal priority. In some examples the trigger condition may be that the audio channel 20 has activity above a threshold, and/or has activity above a threshold in a particular spectral range and/or has voice activity and/or has voice activity associated with a specific person and/or the voice activity comprises semantic content including a particular keyword word or phrase.
  • An initial prioritization 32, can cause an initial selection of a first sub-set 301 of audio channels 20 that are mixed to form the output audio channel 52. A change in prioritization 32, can cause a new selection of a second different sub-set 302 of audio channels 20 that are mixed to form a new, different output audio channel 52. The first sub-set 301 and the second sub-set 301 are not equal sets. Thus, apparatus 10 can prioritize one or more of the N audio channels 20 as a sub-set 30 until a new selection by the selector 40 based on a new prioritization 32 changes the sub-set 30.
  • If a person is speaking in a particular audio channel 20, first, that channel may be prioritized ahead of a second audio channel. However, if the person speaking in the first audio channel stops speaking then the prioritization 32 of the audio channels can change and there can be a consequential reselection at the selector 40 of the sub-set 30 of M audio channels provided for mixing to produce the output audio channel 52.
  • The apparatus 10 can flag at least one input audio channel 20 corresponding to a first active talker, or generally active content 34, during a selection period and prioritize this selection over other audio channels 20. The apparatus 10 can determine whether the active talker continues before introducing content 34 from non-prioritized channels to the mixed output audio channel 52. The introduction of such additional content 34 from non-prioritized channels is controlled by the selector 40 during a following selection period.
  • In some examples, non-prioritized audio channels 20 can be completely omitted from the mixed output audio channel 52 and thus the mixed output audio channel 52 will contain only the prioritized channel(s). However, in other examples, the non-prioritized channels can be mixed with a lower gain or higher attenuation than the prioritized channel and/or with other suitable processing to produce the output audio channel 52.
  • It will therefore be appreciated that in at least some examples, a history of content 34 of at least one of the N audio channels 20 can be used to control the prioritization 32. For example, it may be possible to vary the "inertia" of the system, that is control the rate of change of the rate of change of prioritization. It is therefore possible to make the apparatus 10 more or less responsive to short term variations in the content 34 of one or more of the N audio channels 20.
  • The selector 40 in making a selection of which of the N audio channels 20 to select for mixing to produce the output audio channel 52 can, for example, use decision thresholds for selection. A decision threshold can be changed over time and can be dependent upon a history of the content 34. In addition, different decision thresholds can be used for different audio channels 20.
  • In some examples, the prioritization 32 can be dependent upon mapping to a particular person an identified voice in content 34 of at least one of the N audio channels 20. The analyzer 60 can for example perform voice recognition based upon the content 34 of one or more of the N audio channels 20. Alternatively, the analyzer 60 can identify a particular person based upon metadata comprised within the content 34 of at least one of the N audio channels 20. It may therefore be possible to identify a particular one of the N audio channels 20 as relating to a person whose contribution it is particularly important to hear such as, for example, a chairman of a meeting.
  • In some examples, the analyzer 60 is configured to adapt the prioritization 32 when the presence of voice content is detected within the content 34 of at least one of the N audio channels 20. Thus, the analyzer 60 is able to prioritize the spoken word within the output audio channel 52. It is also possible to adapt the analyzer 60 to prioritize other types of content.
  • In some, but not necessarily all, examples, the analyzer 60 is configured to adapt the prioritization 32 based upon detection that content 34 of at least one of the N audio channels 20 comprises an identified keyword. The analyzer 60 can, for example, listen to the content 34 and identify within the stream of content a keyword or identify semantic meaning within the stream of content. This can be used to modify the prioritization 32. For example, it may be desirable for a consumer of the output audio channel 52 to have that output audio channel 52 personalized so that if one of the N audio channels 20 comprises content 34 that includes the consumer's name or other information associated with the consumer then that audio channel 20 is prioritized by the analyzer 60.
  • In some, but not necessarily all, examples, the N audio channels 20 can represent live content. In this example, the analysis by the analyzer 60, the selection by the selector 40 and the mixing by the mixer 50 can occur in real time such that the output audio channel 52 is also live.
  • FIG. 3 illustrates an example of the apparatus of FIG. 1 in more detail. In this example one possible operation of the mixer 50 is illustrated in more detail. In this example, the mixing is a weighted mixing in which different sub-sets of the sub-set 30 of selected audio channels are weighted with different attenuation/gain before being finally mixed to produce the output audio channel 52.
  • In the illustrated example, the selector 40, based upon the prioritization 32, selects a first sub-set SS1 of the M audio channels to be mixed to provide background audio B and selects a second sub-set SS2 of the M audio channels 20 to be mixed to provide foreground audio F that is for rendering at greater loudness than the background audio B. The selection of the first sub-set SS1 and the selection of the second sub-set SS2 is dependent upon the prioritization 32 of the N audio channels 20. The first sub-set SS1 of audio channels 20 is mixed 501 to provide background audio B which is then amplified/attenuated G1 to adjust the loudness of the background audio before it is provided to the mixer 503 for mixing to produce the output audio channel 52. The second sub-set SS2 of audio channels 20 is mixed 502 to provide foreground audio F which is then amplified/attenuated G1 to adjust the loudness of the background audio before it is provided to the mixer 503 for mixing to produce the output audio channel 52.
  • The gain/attenuation G2 applied to the foreground audio F makes it significantly louder than the background audio B in the output audio channel 52. In some situations, the foreground audio F is naturally louder than background audio B. Thus, it can be but need not be that G2 > G1.
  • The gain/attenuation G1, G2 can, in some examples, vary with frequency.
  • FIG. 4 illustrates an example of a multi-party, live communication system 200 that enables live audio communication between multiple remote participants Ai, B, C, Di using at least the N audio channels 20. Different ones of the multiple remote participants Ai, B, C, Di provide audio input for different ones of the N audio channels 20.
  • The system 200 comprises input end-points 206 for capturing audio channels 20. The system 200 comprises output end-points 204 for rendering audio channels. One or more output end-points 204s (spatial output end-points) are configured for rendering spatial audio as distinct rendered audio sources. One or more output end-points 204m (mono output end-points) are not configured for rendering spatial audio.
  • The N audio channels 20 are N spatial audio channels where each of the N spatial audio channels is captured as a differently positioned captured audio source, and can be rendered using spatial audio as a differently positioned rendered audio source. In some examples the captured audio source (input end-point 206) has a fixed and stationary position. However, in other examples it can vary in position. When such an input end-point 206 is rendered as a rendered audio source at an output end-point 204 using spatial audio, then the rendered audio source can either be fixed or can move, for example, in a manner corresponding to the moving input end-point 206.
  • In this example, the system 200 is for enabling immersive teleconferencing or telepresence for remote terminals. The different terminals have varying device capabilities and different (and possibly variable) network conditions.
  • Spatial/immersive audio refers to audio that typically has a three-dimensional space representation or is presented (rendered) to a participant with the intention of the participant being able to hear a specific audio source from a specific direction. In the specific example illustrated there is a multi-participant audio/visual conference call between remote participants. Some of the participants share a room. For example, participants A1, A2, A3, A4 share the room A and the participants D1, D2, D3, D4, D5 share the room D.
  • Some of the terminals can be characterized as "advanced spatial audio output devices" that have an output end-point 204s that is configured for spatial audio. However, some of the terminals are less advanced audio output devices that have an output end-point 204m that is not configured for spatial audio.
  • In a spatial audio experience, the voices of the participants Ai, B, C, Di are spatially separated. The voices may, for example, have fixed spatial positions relative to each other or the directions may be adaptive, for example, according to participant movements, conference bridge settings or based upon inputs by participants. A similar experience is available to the participants who are using the output end-points 204s and they have the ability to interact much more naturally than traditional voice calls and voice conferencing. For example, they can talk at the same time and still understand each other thanks to effects such as the well-known cocktail party effect.
  • In rooms A and D, each of the respective participants Ai, Di has a personal input end-point 206 which captures a personal captured audio source as a personal audio channel 20. The personal input end-point 206 can, for example, be provided by a directional microphone or by a Lavalier microphone.
  • The participants B and C each have a single personal input end-point 206 which captures a personal audio channel 20.
  • In rooms A and D, the output end-points 204s are configured for spatial audio. For example, each room can have a surround sound system as an output end-point 204s.
    An output end point 204s is configured to render each captured sound source represented by an audio channel 20 as a rendered sound source.
  • In room D, each participant Ai, B, C has a personal output audio channel 20. Each personal output audio channel 20 is rendered from a different location as a different rendered audio source. The collection of rendered audio sources associated with the participants Ai creates a virtual room A.
  • In room A, each participant Di, B, C has a personal output audio channel 20. Each personal output audio channel 20 is rendered from a different location as a different rendered sound source. The collection of the rendered audio sources associated with the participants Di creates a virtual room D.
  • For participant C, the output end-point 204s is configured for spatial audio. For example,
    as an output end-point 204s.
    An output end point 204s is configured to render each captured sound source represented by an audio channel 20 as a rendered sound source.
  • The participant C has an output end-point 204s that is configured for spatial audio. In this example, the participant C is using a headset configured for binaural spatial audio that is suitable for virtual reality (VR). Binauralization methods can be used to render personal audio channels 20 as spatially positioned rendered audio sources, Each participant Ai, Di, B has a personal output audio channel 20. Each personal output audio channel 20 is or can be rendered from a different location as a different rendered sound source.
  • The participant B has an output end-point 204m that is not configured for spatial audio. In this example it is a monophonic output end-point. In the example illustrated, the participant B is using a mobile device (e.g. a mobile phone) to provide the input end-point 206 and the output end-point 204m. The mobile device has a single output end-point 204m which provides the output audio channel 52 as previously described. The processing to produce the output audio channel 52 can be performed at the mobile device of the participant C or at the server 202.
  • The mono-capability limitation of participant B can, for example, be caused by the device, for example it is only configured for decoding of mono audio or because of the available audio output facilities such as a mono-only earpiece or headset.
  • In the preceding examples the spatial audio has been described at a high resolution. Each of the input end-points 206 is rendered in spatial audio as a spatially distinct rendered audio source. However, in other examples multiple ones of the input end-points 206 may be mixed together to produce a single rendered audio source. This can be used to reduce the number of rendered audio sources using spatial audio. Therefore, in some examples, a spatial audio device may render multiple ones of output audio channels 52.
  • In the example illustrated in FIG. 4, a star topology similar to that illustrated in FIG. 5A is used. The central server 202 interconnects the input end-points 206 and the output end-points 204. In the example of FIG. 5A, the input end-points 206 provide the N audio channels 20 to a central server 202 which produces the output audio channel 52 as previously described to the output end-point 204m. In this example, the apparatus 10 is located in the central server 202, however, in other examples the apparatus 10 is located at the output end-point 204m.
  • FIG. 5B illustrates an alternative topology in which there is no centralized architecture but a peer-to-peer architecture. In this example, the apparatus 10 is located at the output end-point 204m.
  • The 3GPP IVAS codec is an example of a voice and audio communications codec for spatial audio. The IVAS codec is an extension of the 3GPP EVS codec and is intended for new immersive voice and audio services over 4G and 5G. Such immersive services include, for example, immersive voice and audio for virtual reality (VR). The multi-purpose audio codec is expected to handle encoding, decoding and rendering of speech, music and generic audio. It is expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. It is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions. The audio channels 20 can, for example, be coded/decoded using the 3GPP IVAS codec.
  • The spatial audio channels 20 can, for example, be provided as metadata-assisted spatial audio (MASA), objective-based audio, channel-based audio (5.1, 7.1+4), non-parametric scene-based audio (e.g. First Order Ambisonics, High Order Ambisonics) and any combination of these formats. These audio formats can be binauralized for headset listening such that a participant can hear the audio sources outside their head.
  • It will therefore be appreciated from the foregoing that the apparatus 10 provides a better experience, including improved intelligibility for a mono user participating in a spatial audio teleconference with several potentially overlapping spatial audio inputs. The apparatus 10 means that it is not necessary, in some cases, to simplify the spatial audio conference experience for the spatial audio users due to having a mono-audio participant. Thus, a mono user can participate in a spatial audio conference without compromising the experience of the other users.
  • FIGS 6, 7, 8 and 9A illustrate examples of an apparatus 10 that comprises a controller 70. The controller receives N audio channels 20 and performs control processing to select the sub-set 30 of M audio channels. In the examples previously described, the controller 70 comprises the selector 40 and, optionally, the analyzer 60. In these examples, the mixer 50 is present but not illustrated.
  • In at least some of these examples, the controller 70 is configured to control mixing of the N audio channels 20 to produce the sub-set 30 of M audio channels when a conflict between a first audio channel of the N audio channels 20 and a second audio channel of the N audio channels occurs. For example, the control can cause the first audio channel 20 to be included within the sub-set 30 of M audio channels and cause the second audio channel 20 not to be included within the sub-set 30 of M audio channels.
  • In some examples, at a later time, when there is no longer conflict between the first audio channel and the second audio channel, the second audio channel is included within the sub-set 30 of M audio channels.
  • One example of when there is conflict between audio channels is when there is simultaneous activity from different prioritized sound sources. For example, overtalking (simultaneous speech) associated with different audio channels 20 can be an example of conflict.
  • In the example illustrated in FIG. 6, the prioritization 32 used for the selection of audio channels to form the sub-set 30 of M audio channels depends upon timing of content 34 of at least one of the N audio channels 20 relative to timing of content 34 of at least another one of the N audio channels 20.
  • In this example, the participant 3 speaks first and the audio channel 203 associated with the participant 3 is selected as a 'priority' for inclusion within the sub-set 30 of M=1 audio channels used to form the output audio channel 52. The later speech by participants 4 and 5 is not selected for inclusion within the sub-set 30 of audio channels used to form the output audio channel 52.
  • The audio channel 203 preferentially remains prioritized and remains included within the output audio channel 52, while there is voice activity in the audio channel 203, whereas the audio channels 204, 205 are excluded. If voice activity is no longer detected in the audio channel 203 then in some examples a selection process may immediately change the identity of the audio channel 20 selected for inclusion within the output audio channel 52. However, in other examples there can be a selection grace period. During this grace period, there can be a greater likelihood of selection/reselection of the original selected audio channel 203. Thus, during the grace period prioritization 32 is biased in favor of the previously selected audio channel.
  • It will therefore be appreciated that in at least some examples, prioritization 32 used for the selection depends upon a history of content 34 of at least one of the N audio channels 20.
  • In some examples, the prioritization 32 used for the selection can depend upon mapping to a particular person (an identifiable human), an identified voice in content 34 of at least one of the N audio channels 20. A voice can be identified using metadata or by analysis of the content 34. The prioritization 32 would more favorably select the particular person's audio channel 20 for inclusion within the output audio channel 52.
  • The particular person could, for example, be based upon service policy. A teleconference service may have a moderator or chairman role and this participant may for example be made audible to all participants or may be able to force themselves to be audible to all participants. In other examples, the particular person could for example be indicated by a user consuming the output audio channel 52. That consumer could for example indicate which of the other participants' content 34 or audio channels 20 they wish to consume. This audio channel 20 could then be included, or be more likely to be included, within the output audio channel 52. The inclusion of the user-selected audio channel 20 can for example be dependent upon voice activity within the audio channel 20, that is, the user-selected audio channel 20 is only included if there is active voice activity within that audio channel 20. The prioritization 32 used for the selection therefore strongly favors the user-selected audio channel 20. The selection by the consumer of the output audio channel 52 of a particular audio channel 20 can for example be based upon an identity of the participant who is speaking or should speak in that audio channel. Alternatively, it could be based upon a user-selection of that audio channel because of the content 34 rendered within that audio channel.
  • FIG. 7 illustrates an example similar to FIG. 6. In this example, the audio channels 20 include a mixture of different audio types. The audio channel 203 associated with participant3 is predominantly a voice channel. The audio channels 204, 205 associated with participants 4 and 5 are predominantly instrumental/music channels. In this example, the selection of which of the audio channels 20 is to be included within the output audio channel 52 can be based upon the audio type present within the audio channel 20. The detection of the audio type within the audio channel 20 can for example be achieved using metadata or, alternatively, by analyzing the content 34 of the audio channel 20. Thus, the prioritization 32 used for selection can be dependent upon detection that content 34 of at least one of the N audio channels 20 is voice content. In such a voice-central case, natural pauses in the active content 34 allow for changes in the mono downmix. That is, the output audio channel 52 can switch between the inclusion of different audio channels 20 in dependence upon which of them includes active voice content. In this way priority can be given to spoken language. The other channels for example the music channels 204, 205 may optionally be included, for example as background audio as previously described with relation to FIG. 3.
  • In the examples illustrated in FIGS 6 and 7, the apparatus 10 deliberately loses information by excluding (or diminishing) audio channels 20 with respect to the output audio channel 52. Information is generally lost by the selective downmixing which is required to maintain or guarantee intelligibility. It is, however, possible for there to be two simultaneously important audio channels 20, only one of which is selected for inclusion in the output audio channel 52. The apparatus illustrated in FIG. 8 addresses this issue.
  • The apparatus 10 illustrated is similar to that illustrated in FIGS 6 and 7. However, it additionally comprises a memory 82 for storage of a further sub-set 80 of the N audio channels 20 that is different to the sub-set 30 of N audio channels 20. Thus, in this example at least some of the audio channels of the N audio channels 20 that are not selected for inclusion in the sub-set 30 of M audio channels, are stored as sub-set 80 and are available for later rendering. In some examples, the later rendering may be at a faster playback rate and that playback may be fixed or may be adaptive. In some examples, the sub-set 80 of audio channels is mixed to form an alternative audio output channel for storage in the memory 82.
  • In the specific example illustrated at least some of the audio channels of the N audio channels that are not selected to be in the sub-set 30 of M audio channels are stored in memory 82 for later rendering.
  • In the particular illustrated example, there is selection of a first sub-set 30 of M audio channels from the N audio channels based upon prioritization 32 of the N audio channels. The first sub-set 30 of M audio channels are mixed to produce a first output audio channel 52. There is selection of a different second sub-set 80 of audio channels from the N audio channels based upon prioritization 32 of the N audio channels. The second sub-set 80 of audio channels are mixed to produce a second output audio channel for storage.
  • In the example illustrated in FIG. 8, the audio channel 203 includes content 34 comprising voice content from a single participant, and it is selected for inclusion within the sub-set 30 of audio channels. It is used to produce the output audio channel 52. The audio channels 204, 205, which have not been included within the output audio channel 52, or included only as background (as described with reference to FIG. 3), are selected for mixing to produce the second output audio signal that is stored in memory 82.
  • When there is storage of a second sub-set 80 of audio channels as a second audio signal, it is desirable to let the consumer of the output audio channel 52 know of the existence of the stored audio signal. This can for example facilitate user control of switching from rendering the output audio channel 52 to rendering the stored audio channel.
  • FIG. 10 illustrates an example of how such an indication may be provided to the consumer of the output audio channel 52. Fig 10 is described in detail later.
  • In some examples, it may be possible to automatically switch from rendering the output audio channel 52 to rendering the stored audio channel. For example, there may be automatic switching during periods of inactivity of the output audio channel 52. An apparatus 10 may switch to the stored audio channel and play that back at a higher speed. For example, the apparatus 10 can monitor the typical length of inactivity in the preferred output audio channel 52 and adjust the speed of playback for the stored audio channel such that the relevant portions can be played back during a typical inactive period.
  • FIG. 9A illustrates an example in which the apparatus 10 detects that content 34 of at least one of the N audio channels 20 comprises an identified keyword and adapts the prioritization 32 accordingly. The prioritization 32 in turn controls selection of which of the audio channels 20 are included in the sub-set 30 and the output audio channel 52 (and, if implemented, the stored alternative audio channel).
  • In the example illustrated in FIG. 9B, the participant 'User 3' is speaking first and has priority. Therefore, the audio channel 203 associated with the User 3 is initially selected as the priority audio channel and is included within the output audio channel 52. Even though the participant 'User 5' begins to talk, the prioritization is not changed and the audio channel 203 remains the priority audio channel included within the output audio channel 52. At time T1 it is detected that User 5 says a keyword, in this example the name of the consumer of the output audio channel 52 (Dave). While this event increases the likelihood of a switch in the prioritization of the audio channels 203, 205 such that the audio channel 205 becomes prioritized and included in the output audio channel 52, in this example there is insufficient cause to change the prioritization 32 and consequently change which of the audio channels 20 is included within the output audio channel 52.
  • In the example illustrated in FIG. 9C, the participant 'User 3' is speaking first and has priority. Therefore, the audio channel 203 associated with the User 3 is initially selected as the priority audio channel and is included within the sub-set 30 used to produce the output audio channel 52. Even though the participant 'User 5' begins to talk, the prioritization is not changed and the audio channel 203 remains the priority audio channel included within the sub-set 30 and the output audio channel 52. At time T1 it is detected that User 5 says a keyword, in this example the name of the consumer of the output audio channel 52 (Dave). This event causes a switch in the prioritization of the audio channels 203, 205 such that the audio channel 205 becomes prioritized and included in the sub-set 30 and the output audio channel 52 and the audio channel 203 becomes de-prioritized and excluded from the sub-set 30 and the output audio channel 52.
  • In some examples, the consumer of the output audio channel 52 can via user input settings control the likelihood of a switch when a keyword is mentioned within an audio channel 20. For example, the consumer of the output audio channel 52 can, for example, require a switch if a keyword is detected. Alternatively, the likelihood of a switch can be increased.
  • In other examples, the occurrence of a keyword can increase the prioritization of an audio channel 20 such that it is stored, for example as described in relation to FIG. 8.
  • In other examples, the detection of a keyword may provide an option to the consumer of the output audio channel 52, to enable the consumer to cause a change in the audio channel 20 included within the sub-set 30 and the output audio channel 52. For example, if the name of the consumer of the output audio channel 52 is included within an audio channel 20 that is not being rendered, as a priority, within the output audio channel 52 then the consumer of the output audio channel 52 can be presented with an option to change prioritization 32 and switch to using a sub-set 30 and output audio channel 52 that includes the audio channel 20 in which their name was detected.
  • Where a detected keyword causes a switch in the audio channels included in the sub-set 30 and output audio channel 52, the new output audio channel 52 based on the detected keyword may be played back from the occurrence of the detected keyword. In some examples the playback is at a faster rate to allow a catch-up with real time.
  • FIG. 10 illustrates an example in which a consumer of the output audio channel 52 is provided with information to allow that consumer to make an informed decision to switch audio channels 20 included within the sub-set 30 and the output audio channel 52.
  • In some examples, some form of indication is given to indicate a change in activity status. For example, if a particular participant begins to talk or there is a second separate discussion ongoing, the consumer of the original output audio channel 52 is made aware of this.
  • A suitable indicator could for example be an audible indicator that is added to the output audio channel 52. In some examples, each participant may have an associated different tone and a beep with a particular tone may indicate which participant has begun to speak. Alternatively, an indicator could be a visual indictor in an input user interface.
  • In the example illustrated in FIG. 10, the background audio is adapted to provide an audible indication. Initially, the consumer listening to the output audio channel 52 hears the audio channel 201 associated with a first participant's voice (User A voice). If a second audio channel 20 is mixed with the audio channel 201, then it may, for example, be an audio channel 202 that captures the ambient audio of the first participant (User A ambience). At time T1 a second participant, User B, begins to talk. This does not initiate a switch of prioritization 32 sufficient to change the sub-set 30. The primary audio channel 20 in the sub-set 30 and the output audio channel 52 remains the audio channel 201. However, an indication is provided to indicate to the consumer of the output audio channel 52 that there is an alternative, available, audio channel 203. The indication is provided by mixing the primary audio channel 201 with an additional audio channel 20 associated with the User B. For example, the additional audio channel 20 can be an attenuated version of the audio channel 203 or can be an ambient audio channel 204 for the User B (User B ambience). In this example, the second audio channel 202 is replaced by the additional audio channel 204.
  • The consumer of the output audio channel 52 can then decide whether or not they wish to cause a change in the prioritization 32 to prioritize the audio channel 203 associated with the User B above the audio channel 201 associated with the User A. If this change in prioritization occurs then there is a switch in the primary audio channel within the sub-set 30 and the output audio channel 52 from being the audio channel 201 to being the audio channel 203. In the example illustrated, the consumer does not make this switch. The switch does however occur automatically when the User A stops talking at time T2.
  • In the example of FIG. 10, referring back to the example of FIG. 3, the background audio B can be included and/or varied as an indication to the consumer of the output audio channel 52 that an alternative audio channel 20 is available for selection.
  • FIG. 11A schematically illustrates audio rendered to a participant (User 5) at an output end-point 204s of the system 200 (not illustrated) that is configured for rendering spatial audio. In accordance with the preceding examples, the audio output at the end-point 204s has multiple rendered sound sources associated with audio channels 201. 202, 203, 204 at different locations. FIG. 11A illustrates that even with the presence in the system 200 (not illustrated) of an output end-point 204m (FIG 11B) that is not configured for spatial audio rendering, there may be no need to reduce the immersive capabilities or experience at the output end-points 204s of the system 200 that are configured for rendering spatial audio.
  • FIG. 11B schematically illustrates audio rendered to a participant (User 1) at an output end-point 204m of the system 200 (not illustrated) that is not configured for rendering spatial audio. In accordance with the preceding examples, the audio output at the end-point 204m provided by the output audio channel 52 has a single monophonic output audio channel 52 that is based on the sub-set 30 of selected audio channels 20 and has good intelligibility. In the example illustrated, the audio channel 202 is the primary audio channel that is included in the sub-set 30 and the output audio channel 52.
  • The apparatus 10 can be configured to automatically switch the composition of the audio channels 20 mixed to form the output audio channel 52 in dependence upon an adaptive prioritization 32. Additionally or alternatively, in some examples, the switching can be effected manually by the consumer at the end-point 204m using a user interface which includes a user input interface 90.
  • In the example illustrated in FIG. 11B, the device at the output end-point 204s, which in some examples may be the apparatus 10, comprises a user input interface 90 for controlling prioritization 32 of the N audio channels 20. For example, the user input interface 90 can be configured to highlight or label selected ones of the N audio channels 20 for selection. The user input interface 90 can be used to control if and to what extent manual or automatic switching occurs to produce the output audio channel 52 from selected ones of the audio channels 20. An adaptation of the prioritization 32 can cause an automatic switching or can cause a prompt to a consumer for manual switching.
  • In some examples, the user input interface 90 can control if and the extent to which prioritization 32 depends upon one or more of timing of content 34 of at least one of the N audio channels 20 relative to timing of content 34 of at least another one of the N audio channels 20; history of content 34 of at least one of the N audio channels 20; mapping to a particular person an identified voice in content 34 of at least one of the N audio channels 20; detection that content 34 of at least one of the N audio channels 20 is voice content; and/or detection that content 34 of at least one of the N audio channels comprises an identified word.
  • In the example illustrated, within the user input interface 90, there is an option 914 that allows the participant, User 1, to select the audio channel 204 as a replacement primary audio channel that is included in the sub-set 30 and the output audio channel 52 instead of the audio channel 202. There is also an option 913 that allows User 1 to select the audio channel 203 as a replacement primary audio channel that is included in the sub-set 30 and the output audio channel 52 instead of the audio channel 202.
  • In some but not necessarily all examples, the user input interface 90 can provide a visual spatial representation of the N audio channels 20 and indicate which of the N audio channels 20 are comprised in the sub-set 30 of M audio channels.
  • The user input interface 90 can also indicate which of the N audio channels are not comprised in the sub-set 30 of M audio channels and which, if any, of these are active.
  • In some, but not necessarily all, examples, the user input interface 90 may provide textual information about an audio channel 20 that is active and available for selection. For example, speech-to-text algorithms may be utilized to convert speech within that audio channel 20 into an alert displayed at the user input interface 90. Referring back to the example illustrated in FIG. 9A, the apparatus 10 may be configured to cause the user input interface 90 to provide an option to a consumer of the output audio channel 52 that enables that consumer to switch audio channels 20 included within the sub-set 30 and output audio channel 52. In this example, the keyword is "Dave" and the textual output provided by the user input interface 90 could, for example, say "option to switch to User 5 who addressed you and said: 'In our last teleco Dave made an interesting'". If the consumer, Dave, then selects the option to switch, the sub-set 30 and the output audio channel 52 then includes the audio channel 205 from the User 5 and starts from the position "In our last teleco Dave made an interesting...". A memory 82 (not illustrated in the FIG) could be used to store the audio channel 205 from the User 5.
  • In the preceding examples, the apparatus 10 can be permanently operational to perform the selection of the sub-set 30 of audio channels 20 used to produce the output audio channel 52. However, in other examples the apparatus 10 has a state in which it is operational in this way and a state in which it is not operation in this way and it can transition between these states, for example when a trigger event is or is not detected.
    The apparatus 10 can be configured to control a mixer 50 mixing of the N audio channels 20 to produce M audio channels in response to a trigger event,
  • One example of a trigger event is conflict between audio channels 20. An example of detecting conflict would be when there is overlapping speech in audio channels 20.
  • Another example of a trigger event is a reduction in communication bandwidth for receiving the audio channels 20 below a threshold value. In this example, the value of M can be dependent upon the available bandwidth.
  • Another example of a trigger event is a reduction in communication bandwidth for providing the output audio channel 52 beneath a threshold value. In this example, the value of M can be dependent upon the available bandwidth.
  • In some examples, where the apparatus 10 can also be configured to control the transmission of audio channels 20 to it, and reduce the number of audio channels received by N-M from N to M, wherein only the M audio channels that may berequired for mixing to produce the output audio channel 52 are received.
  • FIG. 12 illustrates an example of a method 100 that can for example be performed by the apparatus 10. The method comprises, at block 102, receiving at least N audio channels 20 where each of the N audio channels 20 can be rendered as a different audio source.
  • The method 100 comprises, at block 104, controlling mixing of the N audio channels 20 to produce at least an output audio channel 52, wherein the mixer 50 selects a sub-set 30 of at least M audio channels from the N audio channels 20 in dependence upon prioritization 32 of the N audio channels 20, wherein the prioritization 32 is adaptive and depends at least upon a content 34 of one or more of the N audio channels 20. The method 100 further comprises, at block 106, causing rendering of at least the output audio channel 52.
  • FIG. 13 illustrates a method 110 for producing the output audio channel 52. This method broadly corresponds to the method previously described with reference to FIG. 6.
  • At block 112, the method 110 comprises obtaining spatial audio signals from at least two sources as distinct audio channels 20. At block 114, the method 110 comprises determining temporal activity of each of the spatial audio signals (of the two audio channels 20) and selecting at least one spatial audio signal (audio channel 20) for mono downmix (for inclusion within the sub-set 30 and the output audio channel 52) for duration of its activity. At block 116, the method 110 comprises determining a content-based priority for at least one of the spatial audio signals (audio channels 20) for temporarily altering a previous selection. At block 118, the method 110 comprises determining a first mono downmix (sub-set 30 and output audio channel 52) based on at least one of the prioritized spatial audio signals (audio channels 20). The output audio channel 52 is based upon the selected sub-set M which is in turn based upon the prioritization 32. Then at block 120, the method 110 provides the first mono downmix (the output audio channel 52) to the participant for listening. That is, it provides the output audio channel 52 for rendering.
  • It will therefore be appreciated that the prioritization 32 determined at block 116 is used to adaptively adjust selection of the sub-set 30 of M audio channels 20 used to produce the output audio channel 52.
  • FIG. 14 illustrates an example in which the audio channel 203 is first selected, based on prioritization, as the primary audio channel in the output audio channel 52. In this example, at this time, the output audio channel 52 does not comprise the audio channel 204 or 205. Until the activity in the selected audio channel 203 ends, the audio channel 203 remains prioritized. There is no change to the selection of the sub-set 30 of M audio channels until the activity in the audio channel 203 ends. When the activity in the audio channel 203 ends then a new selection process can occur based upon the prioritization 32 of other channels. In this example there is a selection grace period after the end of activity in the audio channel 203. If there is resumed activity in the audio channel 203 during this selection grace period then the audio channel 203 will be re-selected as the primary channel to be included in the sub-set 30 and the output audio channel 52. Thus during the selection grace period the audio channel 203 can have a higher prioritization and be selected if it becomes active. After the selection grace period expires, the prioritization of the audio channel 203 can be decreased.
  • FIG. 15 illustrates an example of a method 130 that broadly corresponds to the method previously described in relation to FIG. 8. At block 132, the method 130 comprises obtaining spatial audio signals (audio channels 20) from at least two sources. This corresponds to the receiving of at least two audio channels 20. At block 132, the method 130 determines a first mono downmix (sub-set 30 and output audio channel 52) based on at least one of the spatial audio signals (audio channels 20). Next, at block 136, the method 130 comprises determining at least one second mono downmix based (sub-set 80 and additional audio channel) on at least one of the spatial audio signals (audio channels 20) not present in the first mono downmix. At block 138, the first mono downmix is provided to a participant for listening as the output audio channel 52. At block 140, the second mono downmix is provided to a memory for storage.
  • In any of the examples, when an audio channel 20 associated with a particular input end-point 206 is selected for inclusion within the sub-set 30 of audio channels used to create the output audio channel 52, then this information may be provided as a feedback at an output end-point 204 associated with that included input end-point 206.
  • In any of the examples, when an audio channel 20 associated with a particular input end-point 206 is not selected for inclusion within the sub-set 30 of audio channels used to create the output audio channel 52 at a particular output end point 204, then this information may be provided as a feedback at an output end-point 204 associated with that excluded input end-point 206. The information can for example identify the input end-points 206 not selected for inclusion for rendering at a particular identified output end-point 204.
  • FIG. 16 illustrates an example of a controller 70. Implementation of a controller 70 may be as controller circuitry. The controller 70 may be implemented in hardware alone, have certain aspects in software including firmware alone or can be a combination of hardware and software (including firmware).
  • As illustrated in FIG. 16 the controller 70 may be implemented using instructions that enable hardware functionality, for example, by using executable instructions of a computer program 76 in a general-purpose or special-purpose processor 72 that may be stored on a computer readable storage medium (disk, memory etc) to be executed by such a processor 72.
  • The processor 72 is configured to read from and write to the memory 74. The processor 72 may also comprise an output interface via which data and/or commands are output by the processor 72 and an input interface via which data and/or commands are input to the processor 72.
  • The memory 74 stores a computer program 76 comprising computer program instructions (computer program code) that controls the operation of the apparatus when loaded into the processor 72. The computer program instructions, of the computer program 76, provide the logic and routines that enables the apparatus to perform the previously methods illustrated and/or described. The processor 72 by reading the memory 74 is able to load and execute the computer program 76.
  • The apparatus 10 therefore comprises:
    • at least one processor 72; and
    • at least one memory 74 including computer program code
    • the at least one memory 74 and the computer program code configured to, with the at least one processor 72, cause the apparatus 10 at least to perform:
      • receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source
      • control mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels;
      • causing rendering at least the output audio channel.
  • As illustrated in FIG. 17, the computer program 76 may arrive at the apparatus 10 via any suitable delivery mechanism 78. The delivery mechanism 78 may be, for example, a machine readable medium, a computer-readable medium, a non-transitory computer-readable storage medium, a computer program product, a memory device, a record medium such as a Compact Disc Read-Only Memory (CD-ROM) or a Digital Versatile Disc (DVD) or a solid state memory, an article of manufacture that comprises or tangibly embodies the computer program 76. The delivery mechanism may be a signal configured to reliably transfer the computer program 76. The apparatus 10 may propagate or transmit the computer program 76 as a computer data signal.
  • Computer program instructions for causing an apparatus to perform at least the following or for performing at least the following:
    • receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source
    • control mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels;
    • causing rendering at least the output audio channel.
  • The computer program instructions may be comprised in a computer program, a non-transitory computer readable medium, a computer program product, a machine readable medium. In some but not necessarily all examples, the computer program instructions may be distributed over more than one computer program.
  • Although the memory 74 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable and/or may provide permanent/semi-permanent/ dynamic/cached storage.
  • Although the processor 72 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable. The processor 72 may be a single core or multi-core processor.
  • References to 'computer-readable storage medium', 'computer program product', 'tangibly embodied computer program' etc. or a 'controller', 'computer', 'processor' etc. should be understood to encompass not only computers having different architectures such as single /multi- processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc.
  • As used in this application, the term 'circuitry' may refer to one or more or all of the following:
    1. (a) hardware-only circuitry implementations (such as implementations in only analog and/or digital circuitry) and
    2. (b) combinations of hardware circuits and software, such as (as applicable):
      1. (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and
      2. (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions and
    3. (c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g. firmware) for operation, but the software may not be present when it is not needed for operation.
    This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit for a mobile device or a similar integrated circuit in a server, a cellular network device, or other computing or network device.
  • The blocks illustrated in the preceding Figs may represent steps in a method and/or sections of code in the computer program 76. The illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block may be varied. Furthermore, it may be possible for some blocks to be omitted.
  • Where a structural feature has been described, it may be replaced by means for performing one or more of the functions of the structural feature whether that function or those functions are explicitly or implicitly described.
  • The above described examples find application as enabling components of:
    automotive systems; telecommunication systems; electronic systems including consumer electronic products; distributed computing systems; media systems for generating or rendering media content including audio, visual and audio visual content and mixed, mediated, virtual and/or augmented reality; personal systems including personal health systems or personal fitness systems; navigation systems; user interfaces also known as human machine interfaces; networks including cellular, non-cellular, and optical networks; ad-hoc networks; the internet; the internet of things; virtualized networks; and related software and services.
  • The term 'comprise' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising Y indicates that X may comprise only one Y or may comprise more than one Y. If it is intended to use 'comprise' with an exclusive meaning then it will be made clear in the context by referring to "comprising only one.." or by using "consisting".
  • In this description, reference has been made to various examples. The description of features or functions in relation to an example indicates that those features or functions are present in that example. The use of the term 'example' or 'for example' or 'can' or 'may' in the text denotes, whether explicitly stated or not, that such features or functions are present in at least the described example, whether described as an example or not, and that they can be, but are not necessarily, present in some of or all other examples. Thus 'example', 'for example', 'can' or 'may' refers to a particular instance in a class of examples. A property of the instance can be a property of only that instance or a property of the class or a property of a sub-class of the class that includes some but not all of the instances in the class. It is therefore implicitly disclosed that a feature described with reference to one example but not with reference to another example, can where possible be used in that other example as part of a working combination but does not necessarily have to be used in that other example.
  • Although examples have been described in the preceding paragraphs with reference to various examples, it should be appreciated that modifications to the examples given can be made without departing from the scope of the claims.
  • Features described in the preceding description may be used in combinations other than the combinations explicitly described above.
  • Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not.
  • Although features have been described with reference to certain examples, those features may also be present in other examples whether described or not.
  • The term 'a' or 'the' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising a/the Y indicates that X may comprise only one Y or may comprise more than one Y unless the context clearly indicates the contrary. If it is intended to use 'a' or 'the' with an exclusive meaning then it will be made clear in the context. In some circumstances the use of 'at least one' or 'one or more' may be used to emphasis an inclusive meaning but the absence of these terms should not be taken to infer any exclusive meaning.
  • The presence of a feature (or combination of features) in a claim is a reference to that feature or (combination of features) itself and also to features that achieve substantially the same technical effect (equivalent features). The equivalent features include, for example, features that are variants and achieve substantially the same result in substantially the same way. The equivalent features include, for example, features that perform substantially the same function, in substantially the same way to achieve substantially the same result.
  • In this description, reference has been made to various examples using adjectives or adjectival phrases to describe characteristics of the examples. Such a description of a characteristic in relation to an example indicates that the characteristic is present in some examples exactly as described and is present in other examples substantially as described.
  • Whilst endeavoring in the foregoing specification to draw attention to those features believed to be of importance it should be understood that the Applicant may seek protection via the claims in respect of any patentable feature or combination of features hereinbefore referred to and/or shown in the drawings whether or not emphasis has been placed thereon.

Claims (15)

  1. An apparatus comprising means for:
    receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
    controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon prioritization of the N audio channels, and wherein the prioritization is adaptive depending at least upon a changing content of one or more of the N audio channels; and
    providing for rendering at least the output audio channel.
  2. An apparatus as claimed in claim 1, comprising means for: automatically controlling mixing of the N audio channels to produce at least the output audio channel, in dependence upon time-variation of content of one or more of the N audio channels.
  3. An apparatus as claimed in claim 1 or 2, wherein the N audio channels are N spatial audio channels where each of the N spatial audio channels can be rendered as a differently positioned audio source.
  4. An apparatus as claimed in any preceding claim, wherein N is at least two and wherein M is one, the output audio channel being a monophonic audio output channel.
  5. An apparatus as claimed in any preceding claim, comprising means for analyzing the N audio channels to adapt a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels.
  6. An apparatus as claimed in any preceding claim, wherein prioritization depends upon one or more of:
    timing of content of at least one of the N audio channels relative to timing of content of at least another one of the N audio channels;
    history of content of at least one of the N audio channels;
    mapping to a particular person, an identified voice in content of at least one of the N audio channels;
    detection that content of at least one of the N audio channels is voice content;
    detection that content of at least one of the N audio channels comprises an identified word.
  7. An apparatus as claimed in any preceding claim, wherein controlling mixing of the N audio channels to produce at least an output audio channel, comprises:
    selecting a first sub-set of the N audio channels to be mixed to provide background audio;
    selecting a second sub-set of the N audio channels to be mixed to provide foreground audio that is for rendering at greater loudness than the background audio, wherein the selection of the first sub-set and selection of the second sub-set is dependent upon the prioritization of the N audio channels; and
    mixing the background audio and the foreground audio to produce the output audio channel.
  8. An apparatus as claimed in any preceding claim, comprising means for controlling mixing of the N audio channels to produce M audio channels in response to a communication bandwidth for receiving the audio channels or for providing output audio signals falling beneath a threshold value.
  9. An apparatus as claimed in any preceding claim, comprising means for controlling mixing of the N audio channels to produce M audio channels when conflict between a first audio channel of the N audio channels and a second audio channel of the N audio channels, wherein the first audio channel is included within the M audio channels and the second audio channel is not included within the M audio channels, wherein over-talking is an example of conflict.
  10. An apparatus as claimed in any preceding claim, wherein the audio channels of the N audio channels that are not the selected M audio channels are available for later rendering.
  11. An apparatus as claimed in any preceding claim, comprising a user input interface for controlling prioritization of the N audio channels.
  12. An apparatus as claimed in any preceding claim, comprising a user input interface, wherein the user input interface provides a spatial representation of the N audio channels and indicates which of the N audio channels are comprised in the sub-set of M audio channels.
  13. A multi-party, live communication system that enables live audio communication between multiple remote participants using at least the N audio channels wherein different ones of the multiple remote participants provide audio input for different ones of the N audio channels, wherein the system comprises the apparatus as claimed in any of claims 1 to 12.
  14. A method comprising:
    receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
    control mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels; and
    rendering at least the output audio channel.
  15. A computer program that when run on one or more processors enables:
    control mixing of N received audio channels, where each of the N audio channels can be rendered as a different audio source, to produce at least an output audio channel for rendering, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels.
EP21154652.8A 2021-02-02 2021-02-02 Selecton of audio channels based on prioritization Withdrawn EP4037339A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP21154652.8A EP4037339A1 (en) 2021-02-02 2021-02-02 Selecton of audio channels based on prioritization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP21154652.8A EP4037339A1 (en) 2021-02-02 2021-02-02 Selecton of audio channels based on prioritization

Publications (1)

Publication Number Publication Date
EP4037339A1 true EP4037339A1 (en) 2022-08-03

Family

ID=74505017

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21154652.8A Withdrawn EP4037339A1 (en) 2021-02-02 2021-02-02 Selecton of audio channels based on prioritization

Country Status (1)

Country Link
EP (1) EP4037339A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110040397A1 (en) * 2009-08-14 2011-02-17 Srs Labs, Inc. System for creating audio objects for streaming
US20150049868A1 (en) * 2012-03-23 2015-02-19 Dolby Laboratories Licensing Corporation Clustering of Audio Streams in a 2D / 3D Conference Scene
US20180190300A1 (en) * 2017-01-03 2018-07-05 Nokia Technologies Oy Adapting A Distributed Audio Recording For End User Free Viewpoint Monitoring

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110040397A1 (en) * 2009-08-14 2011-02-17 Srs Labs, Inc. System for creating audio objects for streaming
US20150049868A1 (en) * 2012-03-23 2015-02-19 Dolby Laboratories Licensing Corporation Clustering of Audio Streams in a 2D / 3D Conference Scene
US20180190300A1 (en) * 2017-01-03 2018-07-05 Nokia Technologies Oy Adapting A Distributed Audio Recording For End User Free Viewpoint Monitoring

Similar Documents

Publication Publication Date Title
US10574828B2 (en) Method for carrying out an audio conference, audio conference device, and method for switching between encoders
EP3282669B1 (en) Private communications in virtual meetings
US9237238B2 (en) Speech-selective audio mixing for conference
CN110072021B (en) Method, apparatus and computer readable medium in audio teleconference mixing system
US20140218464A1 (en) User interface control in a multimedia conference system
EP3111627B1 (en) Perceptual continuity using change blindness in conferencing
EP2378768A1 (en) Multi-channel audio signal processing method, device and system
US20220165281A1 (en) Audio codec extension
US11115444B2 (en) Private communications in virtual meetings
WO2022124040A1 (en) Teleconference system, communication terminal, teleconference method, and program
EP4037339A1 (en) Selecton of audio channels based on prioritization
EP4078998A1 (en) Rendering audio
CN111951821B (en) Communication method and device
JP2009027239A (en) Telecommunication conference apparatus
US11562761B2 (en) Methods and apparatus for enhancing musical sound during a networked conference
EP3031048B1 (en) Encoding of participants in a conference setting
EP4354841A1 (en) Conference calls
JPH1188513A (en) Voice processing unit for inter-multi-point communication controller
WO2021123495A1 (en) Providing a translated audio object
JP2022076189A (en) Voice processing device, voice processing method, voice processing system, and terminal
JP2022093326A (en) Communication terminal, remote conference method, and program

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20230204