WO2024083906A1

WO2024083906A1 - Method and apparatus for hosting a conference call

Info

Publication number: WO2024083906A1
Application number: PCT/EP2023/078967
Authority: WO
Inventors: Matthew Scheybeler; Abigail BETLEY; Samuel SINGLEWOOD
Original assignee: Riff Technologies Limited
Priority date: 2022-10-20
Filing date: 2023-10-18
Publication date: 2024-04-25
Also published as: GB202215561D0

Abstract

There is provided a method of hosting a conference call, comprising: connecting first, second and third user devices to the conference call; receiving a respective location indication indicative of a location of each of the first, second and third user devices; based on the respective location indications, determining that the location of the first user device and the location of the second user device are the same and that the location of the third user device is different to that of the first and second user devices; receiving an audio frame captured by the third user device; responsive to determining that the location of the first user device and the location of the second user device are the same, determining, for each of the first and second user devices, a respective playback instruction for the audio frame; and sending, to each of the first and second user devices, the audio frame captured by the third user device and the respective playback instruction for the audio frame.

Description

METHOD AND APPARATUS FOR HOSTING A CONFERENCE CALL

Field

[0001] The present disclosure relates generally to a method and apparatus for hosting or participating in a conference call.

Background

[0002] Typical conference call systems are designed for a single participant in a room to connect to a conference call with other remote participants. When multiple participants connect to the same conference call from the same room, audio issues may occur.

Summary

[0003] Aspects of the present disclosure are defined in the accompanying independent claims .

Overview of disclosure

[0004] There is disclosed a method of hosting a conference call (or 'call', or 'videoconference' , or 'audio call' , or 'video call' ) .

[0005] The method comprises connecting first, second and third user devices (or 'clients', or 'endpoints', or 'participant devices' ) to the conference call.

[0006] The method may comprise receiving a respective location indication indicative of a location of each of the first, second and third user devices.

[0007] Optionally, the location indication of a particular user device comprises : an indication that a signal emitted by another device has been captured by the particular user device; or a signal emitted by another device and captured by the particular user device.

[0008] Optionally, the signal emitted by the other device is an audio signal played by the other device. [0009] Optionally, the signal emitted by the other device is an ultrasound signal.

[0010] Optionally, the location indication of a particular user device comprises a list of other devices detected by the particular user device.

[0011] Optionally, the other devices are detected using a wireless personal area network (WPAN) technology. Optionally, the WPAN technology is Bluetooth (RTM) or Institute of Electrical and Electronics Engineers (IEEE) 802.15.1.

[0012] Optionally, the list of other devices is a list of devices connected to a same local area network (LAN) as the particular user device .

[0013] Optionally, the location indication of a particular user device comprises one or more coordinates of the particular user device .

[0014] Optionally, the location indication of a particular user device comprises a network address of the particular user device.

[0015] Optionally, the location indication of a particular user device comprises a text string identifying the location of the particular user device, such as 'In the office' or 'At HQ' or 'At home' or 'Working from home' .

[0016] The method may comprise, based on the respective location indications, determining that the location of the first user device and the location of the second user device are the same and that the location of the third user device is different to that of the first and second user devices.

[0017] By 'the same', it may be meant that the first and second user devices are close enough to each other for a user of the first user device to overhear sound played by the second user device or for the first user device to capture sound played by the second user device. However, the determination may be made with a (slightly) higher or lower level of precision, and therefore the determination that the locations of the first and second user devices are the same may instead be a determination that they are in a same room, on a same floor, in a same building, or out side but within a predetermined distance of each other, for example .

[ 0018 ] The method comprises receiving an audio frame captured by at least one of the first , second and third user devices and, optionally, a timestamp for the audio frame .

[ 0019 ] The method may comprise, responsive to determining that the location of the first user device and the location of the second user device are the same , determining, for each of the first and second user devices , a respective playback instruction for the audio frame .

[ 0020 ] The respective playback instruction may be determined to reduce unsynchronised playback of the audio frame at the first and second user devices .

[ 0021 ] Optionally, the playback instruction for a particular audio frame and a particular user device comprises one or more of : a delay to apply to playback of the particular audio frame on the particular user device ; an instruction to play the particular audio frame at the particular user device ; an instruction not to play the particular audio frame at the particular user device ; an instruction to reduce or increase a playback volume of the particular audio frame at the particular user device; or an instruction to buffer the particular audio frame at the particular audio device and, optionally, a period during which to buf fer before beginning playback at the particular audio device .

[ 0022 ] The method may comprise sending, to each of the first and second user devices , the audio frame captured by the third user device and the respective playback instruction for the audio frame . The audio frame and playback instruction may be sent in a same mes sage, or separately . [ 0023 ] Optionally, the method further comprises , subsequent to receiving the audio frame, receiving one or more subsequent audio frames captured by the third user device until a predetermined number of subsequent audio frames have been received . Optionally, the method further comprises repeating the determining of the respective playback instructions for the one or more subsequent audio frames . Optionally, the method further comprises sending, to each of the first and second user devices , the one or more subsequent audio frames and the respective playback instruction for the one or more subsequent audio frames .

[ 0024 ] Optionally, the method further comprises receiving a respective network connection quality indication indicative of a network connection quality of each of the first and second user devices . Optionally, the determining, for each of the first and second user devices , of the respective playback instruction for the audio frame is based on the respective network connection quality indication of the first and second user devices .

[ 0025 ] Optionally, the network connection quality indication comprises an indication of one or more of : a playout buffer status ; packet los s ; trip time; j itter; or latency .

[ 0026 ] Optionally, the method further comprises , subsequent to receiving the respective network connection quality indications , receiving a respective subsequent network connection quality indication indicative of the network connection quality of each of the first and second user devices . Optionally, the repeating of the determining of the respective playback instructions for at least one of the one or more subsequent audio frames is based on the subsequent network connection quality indications . [ 0027 ] The method may comprise, responsive to determining that the location of the first user device and the location of the second user device are the same , selecting one of the first and second user devices as a preferred user device based on the respective audio frames .

[ 0028 ] The selecting may be based on the respective timestamps for the respective audio frames .

[ 0029 ] The method may comprise, responsive to determining that the location of the third user device is dif ferent to that of the first and second user devices , sending, to the third user device, the respective audio frame captured by the preferred user device .

[ 0030 ] Optionally, the method further comprises , subsequent to receiving the respective audio frames , receiving one or more respective subsequent audio frames captured by each of the first and second user devices until a predetermined number of subsequent audio frames have been received . Optionally, the selecting of the preferred user device is further based on the one or more respective subsequent audio frames . Optionally, the method further comprises sending, to the third user device, the one or more respective subsequent audio frames captured by the preferred user device .

[ 0031 ] Optionally, the method further comprises , subsequent to receiving the respective audio frames , receiving one or more respective subsequent audio frames captured by each of the first and second user devices . Optionally, the method further comprises selecting a different one of the first and second user devices as a subsequent preferred user device based on the one or more respective subsequent audio frames . Optionally, the method further comprises sending, to the third user device, the one or more respective subsequent audio frames captured by the subsequent preferred user device .

[ 0032 ] Optionally, the respective audio frame and the one or more respective subsequent audio frames are faded together . [ 0033 ] Optionally, the method further comprises sending, to the third user device, an instruction to fade the respective audio frame and the one or more respective subsequent audio frames together .

[ 0034 ] Optionally, the method further comprises connecting a fourth user device to the conference call . Optionally, the method further comprises receiving a respective location indication indicative of a location of the fourth user device . Optionally, the method further comprises , based on the location indications of the third and fourth user devices , determining that the location of the third user device and the location of the fourth user device are the same . Optionally, the method further comprises , responsive to determining that the location of the third user device and the location of the fourth user device are the same , determining, for each of the third and fourth user devices , a respective playback instruction for the audio frame captured by the preferred user device . Optionally, the method further comprises sending, to the third user device , the respective playback instruction for the audio frame . Optionally, the method further comprises sending, to the fourth user device, the audio frame captured by the preferred user device and the respective playback instruction for the audio frame .

[ 0035 ] Optionally, the method further comprises receiving a respective network connection quality indication indicative of a network connection quality of each of the third and fourth user devices . Optionally, the determining, for each of the third and fourth user devices , of the respective playback instruction for the audio frame is based on the respective network connection quality indication of the third and fourth user devices .

[ 0036 ] Optionally, selecting one of the first and second user devices as a preferred user device comprises : determining, based on the respective audio frames and, optionally, the respective subsequent audio frames , a respective audio signal captured by each of the first and second user devices over one or more given time intervals ; and determining, for each of the first and second user devices, one or more respective loudnesses of the respective audio signals over each of the one or more given time intervals.

[0037] Optionally, selecting one of the first and second user devices as a preferred user device further comprises determining a number of times each of the one or more respective loudnesses exceeds a predetermined loudness threshold.

[0038] Optionally, the loudness for a particular time interval is based on a root mean square (RMS) of the respective audio signal over the particular time interval.

[0039] Optionally, the loudness is weighted over time, with higher weighting applied to more recent samples. Optionally, the weighting is exponentially higher.

[0040] Optionally, the respective audio signal captured by each of the first and second user devices over one or more given time intervals is determined based on a respective server transport latency for each of the first and second user devices.

[0041] Optionally, selecting one of the first and second user devices as a preferred user device comprises determining which of the respective audio frames and, optionally, the respective subsequent audio frames, contain speech.

[0042] Optionally, the method further comprises: receiving a video frame captured by one of the first, second or third user devices; and sending, to another one of the first, second or third devices, the video frame captured by the one of the first, second or third devices .

[0043] Optionally, the respective playback instruction is for the audio frame and the video frame.

[0044] Optionally, the method is performed on a server.

[0045] Optionally, the method is performed by a computing system. [ 0046 ] There is disclosed an apparatus configured to perform any of the methods described herein .

[ 0047 ] There is disclosed an apparatus comprising one or more processors configured to perform any of the methods described herein .

[ 0048 ] There is disclosed an apparatus comprising one or more processors and a memory in communication with the processor, the memory storing instructions which, when executed by the one or more processors , cause the one or more proces sors to perform any of the methods described herein .

[ 0049 ] There is disclosed a computer system configured to perform any of the methods described herein .

[ 0050 ] There is disclosed a computer system comprising a processor configured to perform any of the methods described herein .

[ 0051 ] There is disclosed a computer program comprising instructions which, when executed by one or more proces sors of a computing system, cause the computing system to perform any of the methods described herein .

[ 0052 ] There is disclosed a computer readable medium comprising computer readable instructions which, when executed by one or more processors of a computing system, cause the computing system to perform any of the methods described herein . Optionally, the medium is non-transitory .

Brief description of the drawings

[ 0053 ] Examples of the present disclosure will now be explained with reference to the accompanying drawings in which :

Fig . 1 is a block diagram of an example system architecture ;

Fig . 2 is an example sequence diagram showing remote participants connecting to a conference call;

Fig . 3 is an example sequence diagram showing a first office participant connecting to a conference call ; Fig. 4 is an example sequence diagram showing a second office participant connecting to a conference call;

Fig. 5 is an example sequence diagram showing how a server can select one of two office participants' audio streams;

Figs. 6 and 7 are first and second parts of an example sequence diagram showing how a server can synchronise playout of audio streams for the two office participants;

Fig. 8 is an example sequence diagram showing how a server can respond to increased latency for one of the office participants;

Fig. 9 is an example flowchart of a first method of hosting a conference call;

Fig. 10 is an example flowchart of a method of participating in the conference call of Fig. 9;

Fig. 11 is an example flowchart of a second method of hosting a conference call;

Fig. 12 is an example flowchart of a method of participating in the conference call of Fig. 11;

Fig. 13 is an example flowchart of a third method of hosting a conference call; and

Fig. 14 is a block diagram of an example apparatus for implementing any of the methods described herein.

[0054] Throughout the description and the drawings, like reference numerals refer to like parts.

Detailed description

Context

[0055] When two user devices that are close to each other join a conference call, an immediate unpleasant audio feedback loop occurs. This is due to remote participants' audio being played out of the speakers of other nearby participants' user devices and then being picked up again by the participant's microphone. It is possible in most systems to mute one's microphone when not speaking, but the user also has to mute the speaker audio from the conference call to stop this feedback loop. This is not always possible and so users need to reduce the volume of their entire device, which is far from ideal. For this user to then speak and hear in the call, they need to unmute both their microphone and speaker volumes, which is cumbersome for the user. In addition to unpleasant feedback loops, the experience is also degraded due to

• participants' microphones being unmuted and sending the same audio signal, causing a chorus effect as well as unpleasant echoes and distortion when the signals are not fully in sync;

• participants' microphones picking up each other's speaker audio, causing remote far participants to hear their own voice being echoed back at them; and

• audio playing out from multiple participants' devices at different times and not in sync, causing echoes/distorted audio and an unpleasant out of sync stereo effect.

[0056] Poor network conditions may further accentuate the degraded experience .

[0057] Participant device responsiveness can also degrade the experience by causing slow and/or inconsistent audio processing time .

[0058] In overview, the present disclosure provides techniques to mitigate the above issues by reducing unsynchronised playback and capture at participant devices that are located in a same room.

System architecture

[0059] Fig. 1 is a block diagram of an example system architecture 100 for performing the methods described herein.

[0060] The system 100 comprises clients (or 'user devices' ) used by conference participants. For example, the user devices may include a first user device 105, 'Alice' , a second user device 110, 'Bob' , a third user device 115, 'Carl', and a fourth user device 120, 'Denise' . The system 100 further comprises a server 125 that interacts to provide the best audio experience for all conference participants. The server 125 may, in particular, perform office echo suppression, selection of office audio and office playout synchronization. In some implementations, the server 125 may perform any of the methods or steps described herein, even if such methods or steps are described as being performed by the system 100.

[0061] The participants connect to the server and join a conference room to send their audio and video streams, and receive audio and video streams from the other participants.

[0062] The system may allow office participants to connect to the server without receiving each other' s audio streams which immediately stops the initial continuous echo and feedback loops. This is explained in more detail in 'Office echo suppression' below.

[0063] The system may also analyse the audio streams from office participants and decides on the best audio stream to send to other remote participants to represent the audio within the office. If any original audio streams are used, the system synchronises these accordingly to stop any audio glitches or artefacts. This is explained in more detail in 'Selection of office audio stream' below .

[0064] The system may also continuously monitor participants' network conditions and decide on a best strategy to allow synchronised playout of remote participants' audio on office participants' devices. This is explained in more detail in 'Office playout synchronisation' below.

[0065] In the example system architecture of Fig. 1, remote participants Carl and Denise are both remote in different physical locations. In-office participants Alice and Bob are in the same physical location. As participants in the same location, Alice and Bob are likely to have the same internet connection and so are likely to have network issues at the same time. As remote participants in different locations, Carl and Denise have separate internet connections and so are unlikely to have network issues at the same time.

[0066] Although Fig. 1 refers to the user devices using the names 'Alice' , 'Bob' , 'Carl' and 'Denise' , it will be understood that these names are provided merely for the purpose of illustration, and that any reference to 'Alice' and 'Bob' should be taken as a reference to user devices that are in a same location, and that any reference to user devices 'Carl' and 'Denise' should be taken as a reference to user devices that are not in a same location as each other or as any other user devices on the conference call.

[0067] Although Fig. 1 shows four user devices with two user devices in an office and two remote user devices, it will be understood that the present disclosure is flexible and applicable to conference calls with any number of user devices in any number of locations .

Call setup

[0068] Fig. 2 is an example sequence diagram showing remote participants connecting to a conference call.

[0069] Carl and Denise both connect to the video conferencing system server 125 (201, 204) .

[0070] Carl and Denise both send (or 'publish' ) audio to the server 125 (202, 205) . In some examples, Carl and Denise also send video to the server 125 (203, 206) .

[0071] Carl subscribes to Denise's audio and video streams. The server 125 starts sending Denise's audio and optionally video to Carl (207, 208) .

[0072] Denise subscribes to Carl's audio and video streams. The server 125 starts sending Carl's audio and optionally video to Denise (209, 210) .

Detection of same location

[0073] To determine if two devices are proximate and can overhear each other (so therefore will cause feedback if a conference call is started) the system 100 may employ ultrasound. Each device may listen to the microphone stream and also send out an ultrasonic (outside of human hearing) pulse. Each pulse uses a unique sequence of frequencies that identifies a device. Therefore devices can keep a list of other nearby devices and provide this to the server 125 as a location indication of each listed device.

[0074] The system 100 may additionally or alternatively make use of available systems such as Bonjour or Bluetooth to find devices within Bluetooth range and nearby as well as using audible audio clips to determine whether echoes, distortion and/or feedback loops would be generated.

[0075] The system 100 may additionally or alternatively allow participants to set their location manually to control their behaviour on the system.

[0076] The system 100 may additionally or alternatively use ultrasound to determine a device audio latency which measures device audio capture and playout delays for that device on connecting to the system. This device delay is then considered in further delay calculations .

Office echo suppression

[0077] Participants that are located in the same room/physical location will cause infinite audio feedback loops unless they mute each other's conference audio (both input and output) . System clients send control messages to the server 125 to report their physical location, the server then sends control messages back to the client to control other office participants' audio streams including unsubscribing where required.

[0078] Fig. 3 is an example sequence diagram showing a first office participant, Bob, connecting to a conference call.

[0079] Bob's client connects (301) to conference call at the server

125 and informs (302) the server 125 that it is in the location 'Office' . [0080] Bob starts sending audio (303) and optionally video (304) to the server 125.

[0081] The server 125 starts sending audio (307, 305) and optionally video (308, 306) to Bob from both Carl and Denise.

[0082] The server 125 informs (313) Bob of the best playout delay for both audio and video from Carl.

[0083] The server 125 informs (313) Bob of the best playout delay to use for both audio and video from Denise.

[0084] The server 125 starts sending audio (311, 309) and optionally video (312, 310) from Bob to both Carl and Denise.

[0085] The server requests that Bob's client sends periodic details on all incoming audio and/or video streams they are receiving from Carl and Denise. This includes information for each incoming stream about the current network latency, packet loss, audio buffer sizes, and target audio buffer sizes etc.

[0086] Fig. 4 is an example sequence diagram showing a second office participant, Alice, connecting to a conference call.

[0087] Alice's client connects (401) to the conference call at the server 125, informing (402) the server that it is in the location 'Office' .

[0088] Alice starts sending audio (403) and optionally video (404) to the server 125.

[0089] The server 125 starts forwarding audio (407, 405) and optionally video (408, 406) from Carl and Denise to Alice.

[0090] The server 125 informs (415) Alice of the best playout delay for use for both audio and video from Carl.

[0091] The server 125 informs (415) Alice of the best playout delay to use for both audio and video from Denise.

[0092] The server 125 starts forwarding (413) only video from Alice to Bob. [0093] The server 125 starts forwarding (411, 409) Alice's audio to Carl and Denise.

[0094] The server 125 optionally starts forwarding (412, 410, 413) Alice's video to Carl and Denise and Bob.

[0095] The server 125 optionally starts forwarding (414) Bob's video to Alice.

[0096] The server 125 requests that Alice's client sends periodic details on all incoming audio video streams they are receiving from Carl and Denise. This includes information for each incoming stream about the current network latency, packet loss, audio buffer sizes, and target audio buffer sizes.

[0097] In the example described with reference to Figs. 3 and 4, the location of 'Office' is provided to the server 125 from the user device 105, 110. In some implementations, the location of office is provided to the user device via a user interface, allowing users to set their location manually to control their behaviour on the system.

[0098] In other implementations, the location of office may automatically be sent to the server without user input, for example, if the user device is connected to office WiFi. In some examples, a coordinate may be provided to the server, for example a Global Positioning System (GPS) coordinate.

[0099] In some examples, instead of a user device providing a location indication to the server 125, another nearby device may instead provide a location indication of the user device to the server 125.

Selection of office audio stream

[0100] The server 125 may use an active speaker detection algorithm using a number of attributes to determine the best audio to send to remote participants for participants from the same physical location/room. This generally correlates to the microphone closest to the speaker and with the strongest signal energy. In addition, the system 100 may highlight the speaking participant and make this information available to the remote participants.

[0101] The system uses network and device delay information to make sure playout of any audio stream is synchronised appropriately during playout on any remote participant devices.

[0102] In one implementation, the system 100 uses an audio window and RMS value for this window and determines the office participant with the loudest signal and forwards only this audio stream.

[0103] In addition, the system 100 can use audio signal processing to create a more continuous audio signal for remote participants as well as focussing on the relevant audio signal.

[0104] In addition, the system 100 may employ Network Time Protocol (NTP) timestamps to determine the offset of the audio signals received from each of the office participants and process this information accordingly when considering which audio stream is most appropriate .

[0105] Playout of the office audio on the client is synchronised using the NTP timestamps so that, when the system 100 switches between audio streams, there are no artefacts. The system 100 also employs fading of streams to further reduce glitching and artefacts on switching. This can be performed on the client via a control message from the server 125 or entirely on the client.

[0106] The server 125 analyses (505) Bob's and Alice's audio streams to determine the best audio stream and/or combination of audio streams to forward to Carl and Denise from the location Office .

[0107] The server 125 may determine the best audio stream/s by looking at the RMS of the audio signal over a certain time interval to determine the loudest signal at that time.

[0108] It may make sure it is evaluating the same real time window by calculating and allowing for any difference in server transport latency . [0109] In addition, the server 125 may use a weighted value determined by how much of the window has an RMS signal over a certain threshold.

[0110] Additionally or alternatively, the server 125 may use an exponentially smoothed RMS value giving greater importance to closer signal values.

[0111] Additionally or alternatively, the server 125 may use any of the above techniques in combination with a voice activity detection algorithm.

[0112] Additionally or alternatively, the server 125 may use any of the above techniques in combination with a speech/non-speech classifier .

[0113] Fig. 5 is an example sequence diagram showing how a server 125 can select one of the two office participants' audio streams.

[0114] Alice sends (501) audio with RMS over duration to the server 125. Alice sends (502) a network connection quality indication, such as information on transport link, to the server 125. Bob sends (503) audio with RMS over duration to the server 125. Bob sends (504) a network connection quality indication, such as information on transport link, to the server 125. The server 125, using RMS over duration and the network connection quality indication, changes (505) the output device for the office from Alice to Bob. The server 125 mutes (506) Carl's audio for Alice. The server 125 unmutes (507) Carl's audio for Bob. The server 125 sends (508) playout delays required for sync to Denise. The server 125 mutes (509) Denise's audio for Alice. The server 125 mutes (510) Denise's audio for Bob. The server 125 sends (511) playout delays required for sync to Carl.

Office playout synchronisation

[0115] Playing out of audio is affected by network congestion, burst and packet loss, which can cause audio glitching and/or artefacts. Algorithms exist to account for this; by stretching, squashing and/or concealing the audio to provide the user with the most optimal audio experience during the call at the lowest possible latency .

[0116] In addition, audio preferably needs to be played out as close to the same time from participant device speakers in the same location to avoid an unpleasant audio experience. The participants in the same location may experience echoes/dis jointed/distorted audio when this is not the case, whilst remote participants are likely to hear their own voice being echoed back to them.

[0117] The system 125 is able to solve both issues within one algorithm.

[0118] Office participant clients send information on their audio stack, play out buffers, and network status, including packet loss, trip time and latency to the server.

[0119] The server 125 uses combined information from each office participant client to work out the likelihood one of the clients will run out of audio to play due to current conditions.

[0120] The server 125 uses this information to decide on a common action for all office participant clients to playout the remote participant audio.

[0121] The server 125 sends a series of control messages to each client in the same location to equalise for network latency, device latency, and audio disruption across all devices.

[0122] Office participant clients can be on any network including over the internet. The system 100 produces playout of audio across multiple devices within the same physical location at best latency with minimal audio issues.

[0123] In one implementation, the server 125 decides the best option would be to add a set playout delay to remote participant streams being played out on office participants' devices. This playout delay buffers audio for a set amount of time in order for office participants' playout audio streams to be in sync. A best/desired playout is set which the server tries to keep all office participants playout audio streams close to. The playout delay is set for each remote participant and can be dif ferent acros s remote participant s but will be the same value for the same remote participant on dif ferent office participant s ' devices . Where required the playout delay is increased to allow for network interruptions and then allowed to correct once the network interruption has lapsed .

[ 0124 ] In some cases , after repeated or consistent network interruption, the best /desired playout adapts to a suitable value to reduce system volatility . Best performance is produced when all office participant devices are playing out audio at real time with minimal stretching and squashing . Adapting to average network conditions over a time window maximises playout at real time across devices .

[ 0125 ] In one instance, the system 100 detect s that at least one connected device has a consistently slower latency, most likely due to physical location, and a larger best/desired playout delay should be used .

[ 0126 ] In another instance, the system 100 detects that at least one connected device has considerable j itter on packet arrival causing system volatility and a larger best /desired playout delay should be used .

[ 0127 ] In another instance, the server 125 decides that the best option would be to set a playout delay and/or completely mute playout of one or more remote participant streams on one or more office participant devices .

[ 0128 ] In another instance, the server 125 decides the best option would be to set a playout delay and/or reduce the volume of one of more remote participant streams on one of more of fice participant devices .

[ 0129 ] In addition, the server 125 also calculates whether it is optimal for all of fice participant s in the room to reset their playout audio streams to achieve synchronisation similar to the start of the video conf erence/call . [ 0130 ] Figs . 6 and 7 are first and second parts of an example sequence diagram showing how a server can synchronise playout of audio streams for the two of fice participants .

[ 0131 ] The server continually receives information on the receiving, processing and playout of audio and video streams from Carl and Denise for both Alice and Bob . Using this information, the server calculates the best playout delay for Carl and Denise on both Alice and Bob .

[ 0132 ] The server uses an adaptive best /desired playout value typically less ~200ms which is the default playout value for client s in the same location . This small buf fer of audio allows the playout of audio to be real-time and thus remain in sync between Bob and Alice .

[ 0133 ] The server monitors the information received from Alice and Bob and identifies any network issues which mean one or more of the audio streams being received on Alice and/or Bob require a playout delay larger than 200ms .

[ 0134 ] In this case the server identifies the smallest playout that will satis fy the current network interruptions for each remote participant audio stream being received by each office participant and sets accordingly .

[ 0135 ] The stream identified as having an interruption is left to correct it self to best /desired delay and all other subscribers of the same audio stream follow the best value for that audio stream .

[ 0136 ] I f any other audio streams are identified as requiring a longer playout delay the server switches to following that particular audio stream .

[ 0137 ] In one instance, the client and server both use WebRTC to form a peer connection . The client monitors the peer connection status using the WebRTC GetStats function and sends the "current buf fer size" and "target buf fer size" for each audio stream it is receiving to the server . The "current buffer size" value indicates the current delay in seconds from receiving audio to playing out that audio , whilst the "target buf fer size" value is the delay in seconds from the audio being sent by the sender to being played out by the client the client is aiming to achieve . This "target buf fer size" correlates to the "playout delay" setting that can be set using the WebRTC API on each receiving audio stream . The server is able to set "playout delay" for each remote participant receiving stream and monitor the corresponding "target buffer size" value . If the audio receiver encounters network interruptions or device audio issues which mean it is no longer able to achieve the desired "playout delay" it will increase its "target buffer size" to the minimum value it is able to achieve under the current conditions whilst still enabling smooth playout of audio . The server monitors this value and adjusts its "playout delay" values for each remote participant audio stream on each office participant accordingly .

[ 0138 ] To calculate an adaptive best/desired playout delay, the server continually receives information on the transport latency from all participant s including both those situated remotely and in the office and decides on an optimal best/desired playout delay . This information contains participant-to-server transport latency as well as j itter on transport packet arrival and is analysed over a time window to provide a consistent estimate . In one instance , the server calculates an optimal best/desired playout that equates to maximum of total transport time from office participant to remote participant plus an additional padding amount quantized to a multiple of transport packet size . Additionally, the server monitors transport packet arrival j itter and increases best/desired playout delay when this increases over a set of thresholds .

[ 0139 ] Fig . 8 is an example sequence diagram showing how a server can respond to increased latency for one of the office participants .

Effects of the disclosure

[ 0140 ] An effect of the present disclosure is the provision of an approach which does not require bespoke user devices , but allows colocated users to participate in conference calls , retaining full functionality of their devices without interfering with audio quality . [0141] An effect of the present disclosure is the provision of an approach which requires no additional signal processing on the client or server.

[0142] An effect of the present disclosure is the provision of an approach which requires no additional data to be sent over a local network. It may not always be possible to send data over a local network, because they often block data sent between two local peers. The approach works wherever WebRTC works .

[0143] An effect of the present disclosure is the provision of an approach in which, because streams are not downmixed, all participants experience the best quality and latency, not the quality and latency of the worst remote participant.

Further implementations

[0144] An example flowchart of a first method SlOOa of hosting a conference call (or 'call' , or 'videoconference' , or 'audio call' , or 'video call' ) is shown in Fig. 9. The method may be performed by a computing system including, for example, any of the components of system 100 in Fig. 1. In particular, the method may be performed by a server such as the server 125.

[0145] At S105, the method comprises first, second and third user devices (or 'clients', or 'endpoints' ) 105, 110, 115 connecting to the conference call.

[0146] At S115, the method comprises receiving a respective location indication indicative of a location of each of the first, second and third user devices 105, 110, 115.

[0147] The location indication of a particular user device 105 may comprise an indication that a signal emitted by another device 110 has been captured by the particular user device 105, or a signal emitted by another device 110 and captured by the particular user device 105.

[0148] The signal emitted by the other device 110 may be an audio signal played by the other device 110. The signal emitted by the other device 110 may be an ultrasound signal, or may be audible. The signal emitted by the other device 110 may contain a unique signature of the other device 110.

[0149] The location indication of a particular user device 105 may comprise a list of other devices 110 detected by the particular user device. The other devices 110 may be detected using a wireless personal area network (WPAN) technology. The WPAN technology may be Bluetooth (RTM) or Institute of Electrical and Electronics Engineers (IEEE) 802.15.1. The list of other devices 110 may be a list of devices connected to a same local area network (LAN) as the particular user device 150.

[0150] Additionally or alternatively, the location indication of a particular user device 105 may comprise one or more of: one or more coordinates of the particular user device 105; a network address of the particular user device 105; or a text string identifying the location of the particular user device 105, such as 'In the office' or 'At HQ' or 'At home' or 'Working from home' .

[0151] At S120, the method comprises, based on the respective location indications, determining that the location of the first user device 105 and the location of the second user device 110 are the same and that the location of the third user device 115 is different to that of the first and second user devices 105, 110.

[0152] By 'the same', it may be meant that the first and second user devices 105, 110 are close enough to each other for a user of the first user device 105 to overhear sound played by the second user device 110 or for the first user device 105 to capture sound played by the second user device 110. However, the determination may be made with a (slightly) higher or lower level of precision, and therefore the determination that the locations of the first and second user devices 105, 110 are the same may instead be a determination that they are in a same room or building, for example. [0153] At S130, the method comprises receiving an audio frame captured by the third user device 115. The audio frame may be received from the third user device 115.

[0154] At S140, the method may comprise receiving a respective network connection quality indication (or 'connection quality indication' , or 'quality indication' , or 'health indication' , or 'network health indication' ) indicative of a network connection quality (or 'connection quality' , or 'health' ) of each of the first and second user devices 105, 110. The respective network connection quality indication may be indicative of a network connection quality between the system or device performing the method (such as the server 125) and the respective one of the first and second user devices 105, 110.

[0155] The network connection quality indication may comprise an indication of one or more of: a playout buffer status; packet loss; trip time; jitter; or latency. The packet loss, trip time, jitter or latency may be between the system or device performing the method and the respective one of the first and second user devices 105, 110. The packet loss, trip time, jitter or latency may be in both directions, or in one direction (from the system or device performing the method to the respective one of the first and second user devices 105, 110, or vice versa) . The trip time or latency may be a total trip time or latency for a transmission from the system or device performing the method to the respective one of the first and second user devices 105, 110, and back (from the respective one of the first and second user devices 105, 110 to the system or device performing the method) .

[0156] At S150, the method comprises, responsive to determining that the location of the first user device 105 and the location of the second user device 110 are the same, determining, for each of the first and second user devices 105, 110, a respective playback instruction for the audio frame.

[0157] The respective playback instruction may be determined to reduce unsynchronised playback of the audio frame at the first and second user devices 105, 110. There are two ways in which unsynchronised playback of the audio frame can be reduced: by synchronising playback of the audio frame, and/or by reducing the volume of unsynchronised playback of the audio frame.

[0158] The determining, for each of the first and second user devices 105, 110, of the respective playback instruction for the audio frame may be based on the respective network connection quality indication of the first and second user devices 105, 110.

[0159] The playback instruction for a particular audio frame and a particular user device 105, 110, 115, 120 may comprise one or more of : a delay to apply to playback of the particular audio frame on the particular user device 105, 110, 115, 120; an instruction to play the particular audio frame at the particular user device 105, 110, 115, 120; an instruction not to play the particular audio frame at the particular user device 105, 110, 115, 120; an instruction to reduce or increase a playback volume of the particular audio frame at the particular user device 105, 110, 115, 120; or an instruction to buffer the particular audio frame at the particular audio device 105, 110, 115, 120 and, optionally, a period during which to buffer before beginning playback at the particular audio device 105, 110, 115, 120.

[0160] At S160, the method may comprise sending, to each of the first and second user devices 105, 110, the audio frame captured by the third user device 115 and the respective playback instruction for the audio frame. The audio frame and playback instruction may be sent in a same message, or separately.

[0161] The method SlOOa may be repeated one or more times so that audio and, optionally, video transmission continues for the duration of the conference call. As a result, the method SlOOa may further comprise : subsequent to receiving the audio frame in S130, receiving one or more subsequent audio frames captured by the third user device 115 until a predetermined number of subsequent audio frames have been received; repeating the determining S150 of the respective playback instructions for the one or more subsequent audio frames; and sending, to each of the first and second user devices 105, 110, the one or more subsequent audio frames and the respective playback instruction for the one or more subsequent audio frames.

[0162] The method SlOOa may also further comprise, subsequent to receiving the respective network connection quality indications, receiving a respective subsequent network connection quality indication indicative of the network connection quality of each of the first and second user devices. The repeating of the determining of the respective playback instructions for at least one of the one or more subsequent audio frames may then be based on the subsequent network connection quality indications.

[0163] Fig. 10 shows an example flowchart of a method SlOOb of participating in a conference call, such as the conference call of method SlOOa shown in Fig. 9. The method may be performed by a computing device such as any of the user devices 105, 110, 115, 120 of Fig. 1.

[0164] At S102, a request to join a conference call is sent. The request may be sent from the user device 105, 110, 115, 120 performing the method to the device hosting the call, such as server 125.

[0165] At S110, a location indication of the user device 105, 110,

115, 120 is sent. The location indication may be sent to the device hosting the call, such as server 125.

[0166] At S125, an audio frame captured by the user device 105,

110, 115, 120 is sent. The audio frame may be sent to the device hosting the call, such as server 125. [0167] At S135, a network connection quality indication may be sent. The network connection quality indication may be sent to the device hosting the call, such as server 125.

[0168] At S165, an audio frame captured by another user device 105, 110, 115, 120 in the conference call is received. A playback instruction for the audio frame may also be received. The audio frame and playback instruction may be sent in a same message, or separately .

[0169] An example flowchart of a second method S200a of hosting a conference call is shown in Fig. 11. The method may be performed by a computing system including, for example, any of the components of Fig. 1. In particular, the method may be performed by a server such as the server 125.

[0170] As in method SlOOa, the method comprises, at S205, first, second and third user devices 105, 110, 115 connecting to the conference call. At S215, the method comprises receiving a respective location indication indicative of a location of each of the first, second and third user devices 105, 110, 115. At S220, the method comprises, based on the respective location indications, determining that the location of the first user device 105 and the location of the second user device 110 are the same and that the location of the third user device 115 is different to that of the first and second user devices 105, 110.

[0171] At S230, the method comprises receiving a respective audio frame captured by each of the first and second user devices 105, 110 and, optionally, a respective timestamp for each of the respective audio frames .

[0172] At S250, the method comprises, responsive to determining that the location of the first user device 105 and the location of the second user device 110 are the same, selecting one of the first and second user devices 105, 110 as a preferred user device based on the respective audio frames. [0173] Selecting one of the first and second user devices 105, 110 as a preferred user device may further be based on the respective timestamps for each of the respective audio frames.

[0174] Selecting one of the first and second user devices 105, 110 as a preferred user device may comprise: determining, based on the respective audio frames and, optionally, any respective subsequent audio frames, a respective audio signal captured by each of the first and second user devices 105, 110 over one or more given time intervals; and determining, for each of the first and second user devices 105, 110, one or more respective loudnesses of the respective audio signals over each of the one or more given time intervals. The selected preferred user device may be the user device for which the respective loudness, or the average of the respective loudnesses over each of the one or more given time intervals, is highest.

[0175] Selecting one of the first and second user devices 105, 110 as a preferred user device may further comprise determining a number of times each of the one or more respective loudnesses exceeds a predetermined loudness threshold. The selected preferred user device may be the user device for which the number of times is highest.

[0176] The loudness for a particular time interval may be based on a root mean square (RMS) of the respective audio signal over the particular time interval.

[0177] The loudness may be weighted over time, with higher weighting applied to more recent samples . The weighting may be exponentially higher.

[0178] The respective audio signal captured by each of the first and second user devices 105, 110 over one or more given time intervals may be determined based on a respective server transport latency for each of the first and second user devices 105, 110.

[0179] Selecting one of the first and second user devices 105, 110 as a preferred user device may comprise determining which of the respective audio frames and, optionally, any respective subsequent audio frames , contain speech .

[ 0180 ] At S260 , the method comprises , responsive to determining that the location of the third user device 115 is dif ferent to that of the first and second user devices 105 , 110 , sending, to the third user device 115 , the respective audio frame captured by the preferred user device . At S260 , the method may further comprise refraining from sending, to the third user device 115 , the respective audio frame ( s ) captured by the user devices other than the preferred user device - in other words , discarding the respective audio frame ( s ) captured by the user devices other than the preferred user device .

[ 0181 ] The method S200a may be repeated one or more times . As a result , the method S200a may further comprise : subsequent to receiving the respective audio frames , receiving one or more respective subsequent audio frames captured by each of the first and second user devices 105 , 110 until a predetermined number of subsequent audio frames have been received, the selecting of the preferred user device being further based on the one or more respective subsequent audio frames ; and sending, to the third user device, the one or more respective subsequent audio frames captured by the preferred user device .

[ 0182 ] Additionally or alternatively, the method S200a may further comprise : subsequent to receiving the respective audio frames , receiving one or more respective subsequent audio frames captured by each of the first and second user devices 105 , 110 ; selecting a different one of the first and second user devices 105 , 110 as a subsequent preferred user device based on the one or more respective subsequent audio frames ; and sending, to the third user device 115 , the one or more respective subsequent audio frames captured by the subsequent preferred user device . [0183] The respective audio frame and the one or more respective subsequent audio frames may be faded together. The method may additionally or alternatively further comprise sending, to the third user device 115, an instruction to fade the respective audio frame and the one or more respective subsequent audio frames together.

[0184] Fig. 12 shows an example flowchart of a method S200b of participating in a conference call, such as the conference call of method S200a shown in Fig. 11. The method may be performed by a computing device such as any of the user devices 105, 110, 115, 120 of Fig. 1.

[0185] At S202, a request to join a conference call is sent. The request may be sent from the user device 105, 110, 115, 120 performing the method to the device hosting the call, such as server 125.

[0186] At S210, a location indication of the user device 105, 110, 115, 120 is sent. The location indication may be sent to the device hosting the call, such as server 125.

[0187] At S225, an audio frame captured by the user device 105, 110, 115, 120 is sent. The audio frame may be sent to the device hosting the call, such as server 125.

[0188] At S265, an audio frame captured by another user device 105, 110, 115, 120 in the conference call is received. A playback instruction for the audio frame may also be received. The audio frame and playback instruction may be sent in a same message, or separately .

[0189] Any steps of methods SlOOa and S200a of hosting a conference call and methods SlOOb and S200b of participating in a conference call may be combined.

[0190] In particular, methods SlOOa and S200a may be combined. An example flowchart of such a combined method S300 of hosting a conference call is shown in Fig. 13. The method may be performed by a computing system including, for example, any of the components of Fig. 1. In particular, the method may be performed by a server such as the server 125.

[0191] At S305, the method comprises the first, second, third and fourth user devices 105, 110, 115, 120 connecting to the conference call .

[0192] At S315, the method comprises receiving a respective location indication indicative of a location of each of the first, second, third and fourth user devices 105, 110, 115, 120.

[0193] At S320, the method comprises, based on the respective location indications, determining that the location of the first user device 105 and the location of the second user device 110 are the same, first location, and that the location of the third user device 115 and the location of the fourth user device 120 are the same, second location. The second location is different to the first location .

[0194] At S330, the method comprises receiving a respective audio frame captured by each of the first and second user devices 105, 110.

[0195] At S340, the method may comprise receiving a respective network connection quality indication indicative of a network connection quality of each of the third and fourth user devices 115, 120.

[0196] At S350, the method comprises, responsive to determining that the location of the third user device 115 and the location of the fourth user device 120 are the same, determining, for each of the third and fourth user devices 115, 120, a respective playback instruction for the audio frame captured by the preferred user device. The determining, for each of the third and fourth user devices, of the respective playback instruction for the audio frame may then be based on the respective network connection quality indication of the third and fourth user devices 115, 120.

[0197] At step S360, the method comprises, responsive to determining that the location of the first user device 105 and the location of the second user device 110 are the same, selecting one of the first and second user devices 105, 110 as a preferred user device based on the respective audio frames.

[0198] At step S370, the method comprises, responsive to determining that the location of the third user device 115 is different to that of the first and second user devices 105, 110, sending, to each of the third and fourth user devices 115, 120, the respective audio frame captured by the preferred user device and the respective playback instruction.

[0199] In any of methods SlOOa, SlOOb, S200a, S200b and S300, video frames may be sent or received in addition to audio frames. Thus, any of methods SlOOa, S200a and S300 may further comprise receiving a video frame captured by one 105 of the first, second, third or fourth user devices 105, 110, 115, 120 and sending, to another one 110, 115, 120 of the first, second, third or fourth devices 105, 110, 115, 120, the video frame captured by the one 105 of the first, second or third devices 105, 110, 115, 120. The playback instruction may then apply to the video frame as well as the relevant audio frame. However, because the audio and video frame rates may differ, the video frames may not be sent alongside the audio frames . Similarly, any of methods SlOOb and S200b may further comprise sending a video frame captured by the user device 105, 110, 115, 120. The video frame may be sent to the device hosting the call, such as server 125.

[0200] It will be understood that, because the steps of methods Figs. 2 to 13 may be split between multiple devices, at each device, only some steps of these methods need to be performed. The steps of these methods that are performed at each of these devices are to be understood as forming methods in their own right. Thus, as but one example, a method may be performed from only steps S305, S315, S320, S330, S340, S360 and S370 if, for example, step S350 is performed at another device. Similarly, as the devices of Fig. 1 may be manufactured (and sold) separately, the user devices 105, 110, 115, 120 and the server 125 should be understood as being disclosed both separately and in combination with each other. [0201] A block diagram of an example apparatus 1400 for implementing any of the methods described herein, or any portion thereof, such as method SlOOa, SlOOb, S200a, S200b or S300, is shown in Fig. 14. The exemplary apparatus 1400 may, for example, be used to implement any of user devices 105, 110, 115 or 120, or server 125.

[0202] The apparatus 1400 comprises a processor 1410 (e.g., a digital signal processor, or a multipurpose processor) arranged to execute computer-readable instructions as may be provided to the apparatus 1400 via one or more of a memory 1420, a communication interface 1430, or an input interface 1450.

[0203] The memory 1420, for example a random-access memory (RAM) , is arranged to be able to retrieve, store, and provide to the processor 1410, instructions and data that have been stored in the memory 1420. The communication interface 1430 is arranged to enable the processor 1410 to communicate with other devices (e.g., to receive filters) and/or to communicate with a communications network, such as the Internet. The communication interface 1430 may be the external data communication interface described above. The input interface 1450 is arranged to receive user inputs provided via an input device (not shown) such as a mouse, a keyboard, or a touchscreen. The processor 1410 may further be coupled to a display adapter 1440, which is in turn coupled to a display device (not shown) . The processor 1410 may further be coupled to an audio interface 1460 which may be used to capture and/or output audio signals, e.g., when the apparatus 1400 is used to implement any of user devices 105, 110, 115 or 120. The audio interface 1460 may comprise a digital-to-analog converter (DAC) (not shown) , e.g., for use with audio devices with analog input (s) .

[0204] The approaches described herein may be embodied on a computer-readable medium, which may be a non-transitory computer- readable medium. The computer-readable medium carries computer- readable instructions arranged for execution upon a processor so as to make the processor carry out any or all of the methods described herein . Interpretation

[ 0205 ] Section titles are provided above to ease understanding of the disclosure , and are not to be construed as limiting the scope of the disclosure .

[ 0206 ] It will be appreciated that , although various approaches above may be implicitly or explicitly described as optimal, engineering involves trade-off s and so an approach which is optimal from one perspective may not be optimal from another . Furthermore , approaches which are slightly suboptimal may nevertheless be useful . As a result , both optimal and sub-optimal approaches should be considered as being within the scope of the present disclosure .

[ 0207 ] It will be appreciated that the steps of the methods described herein may be performed concurrently . For example , the audio frame of S130 and the connection quality indication of S140 may be received concurrently . Furthermore, when any of the methods described herein are repeated, portions of the second iteration of the method may be performed whilst the first iteration of the method is still being performed . For example, a second audio frame may be received in S130 while the first audio frame received in S130 is still being sent in S160 .

[ 0208 ] It will also be appreciated that , unless otherwise indicated ( either explicitly or due to the dependencies of a particular step) , the steps of the methods described herein may be performed in any order . As but one example, S140 may be performed before S130 .

[ 0209 ] It will also be appreciated that , unless otherwise indicated, any of the features of the present disclosure may be combined, even if such a combination is not explicitly recited . For example, any of the features described in 'Of fice echo suppression' and 'Office playout synchronisation' may be used as part of methods SlO Oa, SlO Ob, S200a, S200b, and S300 , and in particular as part of

S150 or S350 . As another example , any of the features described in

' Selection of of fice audio stream' may be used as part of methods SlO Oa, SlO Ob, S200a, S200b, and S300 , and in particular as part of

S250 or S360 . As yet another example , any of the features described in 'Detection of same location' may be used as part of methods SlO Oa, SlO Ob, S200a, S200b, and S300 , and in particular as part of S120 , S220 or S320 . As yet another example, any of the features described as part of SlO Oa, Sl OOb, S200a, S200b may be used as part of method S300 ( for example, features described as part of S105 or S205 may be used as part of S305 , features described as part of S115 or S215 may be used as part of S315 , features described as part of S120 or S220 may be used as part of S320 , features described as part of S130 or S230 may be used as part of S330 , features described as part of S140 may be used as part of S340 , features described as part of S150 may be used as part of S350 , features described as part of S250 may be used as part of S360 , features described as part of S160 or S260 may be used as part of S370 , etc . ) .

[ 0210 ] The term "computer-readable medium" as used herein refers to any medium that stores data and/or instructions for causing a processor to operate in a specific manner . Such storage medium may comprise non-volatile media and/or volatile media . Non-volatile media may include, for example , optical or magnetic disks . Volatile media may include dynamic memory . Exemplary forms of storage medium include, a floppy disk, a flexible disk, a hard disk, a solid state drive , a magnetic tape , or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with one or more patterns of holes , a RAM, a PROM, an EPROM, a FLASH-EPROM, NVRAM, and any other memory chip or cartridge .

[ 0211 ] Those skilled in the art will recognise that a wide variety of modifications , alterations , and combinations can be made with respect to the examples described herein without departing from the scope of the disclosed concept s . Those skilled in the art will thus recognise that the scope of the invention is not limited by the examples described herein, but is instead defined by the appended claims .

Claims

1 . A method of hosting a conference call, the method comprising : connecting first , second and third user devices to the conference call; receiving a respective location indication indicative of a location of each of the first , second and third user devices ; based on the respective location indications , determining that the location of the first user device and the location of the second user device are the same and that the location of the third user device is different to that of the first and second user devices ; receiving an audio frame captured by the third user device; responsive to determining that the location of the first user device and the location of the second user device are the same, determining, for each of the first and second user devices , a respective playback instruction for the audio frame ; and sending, to each of the first and second user devices , the audio frame captured by the third user device and the respective playback instruction for the audio frame .

2 . The method of claim 1 , further comprising : subsequent to receiving the audio frame, receiving one or more subsequent audio frames captured by the third user device until a predetermined number of subsequent audio frames have been received; repeating the determining of the respective playback instructions for the one or more subsequent audio frames ; and sending, to each of the first and second user devices , the one or more subsequent audio frames and the respective playback instruction for the one or more subsequent audio frames .

The method of any of claims 1 to 2 , further comprising : receiving a respective network connection quality indication indicative of a network connection quality of each of the first and second user devices , wherein the determining, for each of the first and second user devices , of the respective playback instruction for the audio frame is based on the respective network connection quality indication of the first and second user devices .

4 . The method of claim 3 when dependent on claim 2 , further comprising : subsequent to receiving the respective network connection quality indications , receiving a respective subsequent network connection quality indication indicative of the network connection quality of each of the first and second user devices , wherein the repeating of the determining of the respective playback instructions for at least one of the one or more subsequent audio frames is based on the subsequent network connection quality indications .

5 . A method of hosting a conference call, the method comprising : connecting first , second and third user devices to the conference call; receiving a respective location indication indicative of a location of each of the first , second and third user devices ; based on the respective location indications , determining that the location of the first user device and the location of the second user device are the same and that the location of the third user device is different to that of the first and second user devices ; receiving a respective audio frame captured by each of the first and second user devices ; responsive to determining that the location of the first user device and the location of the second user device are the same, selecting one of the first and second user devices as a preferred user device based on the respective audio frames ; responsive to determining that the location of the third user device is different to that of the first and second user devices , sending, to the third user device, the respective audio frame captured by the preferred user device .

6 . The method of claim 5 , further comprising : subsequent to receiving the respective audio frames , receiving one or more respective subsequent audio frames captured by each of the first and second user devices until a predetermined number of subsequent audio frames have been received, wherein the selecting of the preferred user device is further based on the one or more respective subsequent audio frames ; and sending, to the third user device, the one or more respective subsequent audio frames captured by the preferred user device .

The method of claim 5 , further comprising : subsequent to receiving the respective audio frames , receiving one or more respective subsequent audio frames captured by each of the first and second user devices ; selecting a different one of the first and second user devices as a subsequent preferred user device based on the one or more respective subsequent audio frames ; and sending, to the third user device, the one or more respective subsequent audio frames captured by the subsequent preferred user device .

8 . The method of claim 7 , wherein the respective audio frame and the one or more respective subsequent audio frames are faded together, or further comprising sending, to the third user device , an instruction to fade the respective audio frame and the one or more respective subsequent audio frames together .

9 . The method of any of claims 5 to 8 , further comprising : connecting a fourth user device to the conference call; receiving a respective location indication indicative of a location of the fourth user device ; based on the location indications of the third and fourth user devices , determining that the location of the third user device and the location of the fourth user device are the same ; responsive to determining that the location of the third user device and the location of the fourth user device are the same, determining, for each of the third and fourth user devices , a respective playback instruction for the audio frame captured by the preferred user device; sending, to the third user device, the respective playback instruction for the audio frame; and sending, to the fourth user device , the audio frame captured by the preferred user device and the respective playback instruction for the audio frame .

10 . The method of claim 9, further comprising : receiving a respective network connection quality indication indicative of a network connection quality of each of the third and fourth user devices , wherein the determining, for each of the third and fourth user devices , of the respective playback instruction for the audio frame is based on the respective network connection quality indication of the third and fourth user devices .

11 . The method of any of claims 1 to 4 or 9 to 10 , wherein the playback instruction for a particular audio frame and a particular user device comprises one or more of : a delay to apply to playback of the particular audio frame on the particular user device ; an instruction to play the particular audio frame at the particular user device ; an instruction not to play the particular audio frame at the particular user device ; an instruction to reduce or increase a playback volume of the particular audio frame at the particular user device; or an instruction to buffer the particular audio frame at the particular audio device and, optionally, a period during which to buf fer before beginning playback at the particular audio device .

12 . The method of any of claims 3 , 4 or 10 , where the network connection quality indication comprises an indication of one or more of : a playout buffer status ; packet los s ; trip time; j itter; or latency .

13 . The method of any claims 5 to 10 , wherein selecting one of the first and second user devices as a preferred user device comprises : determining, based on the respective audio frames and, optionally, the respective subsequent audio frames , a respective audio signal captured by each of the first and second user devices over one or more given time intervals ; and determining, for each of the first and second user devices , one or more respective loudnes ses of the respective audio signals over each of the one or more given time intervals .

14 . The method of claim 13 , wherein selecting one of the first and second user devices as a preferred user device further comprises determining a number of times each of the one or more respective loudnesses exceeds a predetermined loudness threshold .

15 . The method of any of claims 13 to 14 , wherein the loudnes s for a particular time interval is based on a root mean square (RMS ) of the respective audio signal over the particular time interval .

16 . The method of any of claims 13 to 15 , wherein the loudnes s is weighted over time , with higher weighting applied to more recent samples , optionally wherein the weighting is exponentially higher .

17 . The method of any of claims 13 to 16, wherein the respective audio signal captured by each of the first and second user devices over one or more given time intervals is determined based on a respective server transport latency for each of the first and second user devices .

18 . The method of any of claims 5 to 10 or 13 to 17 , wherein selecting one of the first and second user devices as a preferred user device comprises determining which of the respective audio frames and, optionally, the respective subsequent audio frames , contain speech .

19 . The method of any preceding claim, further comprising : receiving a video frame captured by one of the first , second or third user devices ; and sending, to another one of the first , second or third devices , the video frame captured by the one of the first , second or third devices .

20 . The method of claim 19 when dependent on any of claims 1 or 9 , wherein the respective playback instruction is for the audio frame and the video frame .

21 . The method of any preceding claim, wherein the location indication of a particular user device comprises : an indication that a signal emitted by another device has been captured by the particular user device ; or a signal emitted by another device and captured by the particular user device .

22 . The method of claim 21 , wherein the signal emitted by the other device is an audio signal played by the other device .

23 . The method of any of claims 21 to 22 , wherein the signal emitted by the other device is an ultrasound signal .

24 . The method of any preceding claim, wherein the location indication of a particular user device comprises a list of other devices detected by the particular user device .

25 . The method of claim 24 , wherein the other devices are detected using a wireless personal area network (WPAN) technology, the WPAN technology optionally being Bluetooth (RTM) or Institute of Electrical and Electronics Engineers ( IEEE ) 802 . 15 . 1 .

26 . The method of any of claims 24 to 25 , wherein the list of other devices is a list of devices connected to a same local area network (LAN) as the particular user device .

27 . The method of any preceding claim, wherein the location indication of a particular user device comprises one or more coordinates of the particular user device .

28 . The method of any preceding claim, wherein the location indication of a particular user device comprises a network addres s of the particular user device .

29 . The method of any preceding claim, wherein the location indication of a particular user device comprises a text string identifying the location of the particular user device .

30 . The method of any preceding claim, wherein the method is performed on a server .

31 . Apparatus configured to perform the method of any preceding claim .

32 . A computer program comprising instructions which, when executed by one or more proces sors of a computing system, cause the computing system to perform the method of any of claims 1 to 30 .