GB2623548A

GB2623548A - Hybrid Teleconference platform

Info

Publication number: GB2623548A
Application number: GB2215456.1A
Authority: GB
Inventors: Aambo Fosstveit Havar; Bovens Andreas; Tyra Andrew
Original assignee: Whereby AS
Current assignee: Whereby AS
Priority date: 2022-10-19
Filing date: 2022-10-19
Publication date: 2024-04-24
Also published as: GB202215456D0

Abstract

Teleconferencing between a plurality of participants (P1-P6) comprises assigning each participant to one of a plurality of clusters (12,14,16), wherein at least one audio capture component and/or at least one audio output component is associated with the clusters and/or participants. An audio stream associated with one of the clusters is captured using at least one of the audio capture components and the audio stream is broadcast to at least one of the other clusters using at least one of the audio output components. The captured audio stream is not broadcast from any of the audio output components associated with the cluster from which the audio stream was captured. A further aspect of the invention provides identifying a single audio capture component from amongst the audio capture components of the audio stream as a primary audio capture component based on an amplitude of the audio streams being within a predetermined frequency range, the range having a lower bound above 2kHz, and broadcasting the primary audio capture component. Another aspect of the invention assigns the participants of the teleconference to clusters indicative of participants physical proximity to one another.

Description

HYBRID TELECONFERENCE PLATFORM

The present invention relates to hybrid teleconferencing, and particularly to hybrid videoconferencing.

Hybrid teleconferencing refers to the situation where some participants of a teleconference are located physically close to one another, and one or more other participants are located remotely.

In one example, with reference to Figure 1, participants P1, P2 and P3 are located at a first physical location, participants P4 and P5 are located at a second physical location, and participant P6 is located at a third physical location. However, all six participants wish to communicate with one another by teleconference. Historically, this is a situation that has caused great difficulty.

Figure 16 illustrates some of these difficulties in a situation where each participant is individually connected to the teleconference, i.e. each participant having their own microphone and loudspeaker. Figure 16 illustrates an example in which participant P1 is speaking.

The solid, purple arrows illustrate that participant P2 and P3 can each directly hear participant P1 speaking. However, the dotted blue lines illustrate that participants P2 and P3 can also hear participant P1 speaking via the teleconference, i.e. captured by the microphone of participant P1 and broadcast by the loudspeakers of participants P2 and P3. This will create an echo within the room for each of participants P2 and P3.

The microphones of participants P2 and P3 may also capture participant 1 speaking. This will then be broadcast to other members of the teleconference as indicated by the yellow dotted lines, in addition to the audio captured by the microphone of participant P1 indicated by the blue dotted line. This will create an echo within the teleconference, as each of participants P2 and P3 will transmit a delayed version of participant P1 speaking.

The microphone of participant P2 and P3 may further capture the audio broadcast by the loudspeakers of one another. Similarly, the microphones of participant P4 and P5 may capture the audio broadcast by the loudspeakers of one another. This will create a further echo within the teleconference, as each of participants P2, P3, P4 and P5 will transmit a delayed version of participant P1 speaking.

Finally, the microphone of participant P1 may capture the audio broadcast by the loudspeakers of each of participants P2 and P3. This will create an echo within the audio broadcast by the microphone of participant Pl.

As can be seen, these various interactions can generate a significant amount of noise and sometimes even a feedback loop during a hybrid teleconference situation, making it difficult for all members of the teleconference to clearly understand what is being said. This noise is of course amplified in the situation where multiple participants are speaking.

Figure 17 shows a common solution to the problem of hybrid teleconferencing. In this example, a single microphone and loudspeaker is used for each physical location. The single microphone captures speech from all participants within that location, and the single loudspeaker broadcasts audio for all participants within that location. This solution can be effective, but still causes difficulties.

Firstly, the solution usually required a dedicated teleconferencing facility having specialist equipment -often, a mobile phone or laptop computer does not have a suitable microphone and/or loudspeaker for this purpose. Thus, this solution is both expensive, and requires preparation of the meeting space before the teleconference.

Additionally, it can still cause audio issues within the meeting. For example, the participants may be located at different distances from the microphone, meaning that they appear to be speaking at different volumes for other participants. Additionally, some participants may simply speak more loudly or softly than one another. However, as all participants use the same microphone, their speech volumes cannot be individually moderated.

Likewise, if the participants are located at different distances from the loudspeaker, then those participants close to the loudspeaker will hear a speaking participant more loudly than those located more remotely from the loudspeaker. Thus, it may be difficult to select a suitable volume for the single loudspeaker that facilitates comfortable teleconferencing of all participants at a particular location.

Other solutions to the problem of hybrid teleconferences have also been proposed. For example, with reference again to Figure 16, it has been suggested to utilise machine learning or the like to identify and remove feedback signals. However, to date, attempted machine learning solutions have not been sufficient to remedy the feedback problem arising in hybrid teleconference situations.

Additionally, attempts to utilise such machine learning solutions have introduced significant latency into the teleconference, which makes natural communication between participants difficult. This is particularly pronounced in hybrid teleconference situations where some participants are in the same physical location and so can communicate directly, i.e. with zero latency.

A need therefore exists for an improved teleconferencing platform.

In a first aspect, the present invention provides a method of teleconferencing between a plurality of participants, the method comprising: assigning each participant to one of a plurality of clusters, wherein a first cluster of the plurality of clusters includes at least two participants, each having an audio capture component and an audio output component, and wherein a second, different cluster of the plurality of clusters includes at least one participant having an audio output component; capturing an audio stream associated with the first cluster using at least one of the audio capture components associated with the first cluster; and broadcasting the audio stream associated with the first cluster from each audio output component associated with the second cluster; wherein the audio stream associated with the first cluster is not broadcast from any of the audio output components associated with the first cluster.

Capturing the audio stream associated with the first cluster may comprise: identifying a single audio capture component from amongst the audio capture components associated with a first cluster as a primary audio capture component; and capturing the audio stream associated with the first cluster using only the primary audio capture component.

Which audio capture component is the primary audio capture component may be periodically re-assessed, for example at least every second, or at least every 500 ms, or at least every 300 ms, or at least every 200 ms, or at least every 100 ms.

The primary audio capture component may be identified based on an amplitude of an audio stream captured by each of the audio capture components associated with a first cluster.

The primary audio capture component may be selected based on the audio stream having the maximum peak and/or average amplitude within a period of time.

The primary audio capture component may be identified based on the amplitudes of the audio streams within a predetermined frequency range, wherein the predetermined frequency range has a lower bound above 1.5 kHz, or above 2 kHz, or above 2.2 kHz, or above 2.3 kHz. The predetermined frequency range may have no upper bound, or may have an upper bound below 30 kHz, or below 20 kHz, or below 10 kHz.

The second cluster may comprise at least two participants, each having an audio output component. The audio stream associated with the first cluster may be broadcast from each audio output component associated with the second cluster at a time delay associated with that audio output component. The time delays may be such that the audio stream is broadcast from each of the audio output components associated with the second cluster at substantially the same time. For example, a delay or drift between the broadcast audio streams may be below 50 ms, or below 25 ms, or below 15 ms.

At least one participant in the second cluster may have an audio capture component. The method may further comprise, for each audio output component associated with the second cluster: generating a synchronising audio signal using the respective audio output component; detecting the synchronising audio signal using the at least one audio capture component associated with the second cluster; and determining a time delay associated with the respective audio output component based on a time when the synchronising audio signal is detected.

The synchronising audio signal may be substantially outside of the human audible frequency range. For example, the synchronising audio signal may comprise a frequency above 15 kHz, or above 18 kHz, or above 20 kHz, or above 22 kHz. The synchronising audio signal may include substantially no sound at frequencies below 10 kHz, or below 15 kHz, or below 18 kHz, or below 20 kHz.

The audio streams broadcast from the audio capture components may be filtered to remove any synchronising audio signals. For example, the method may further comprise: capturing an audio stream associated with the second cluster using the at least one audio capture component associated with the second cluster; applying a filter to the audio stream associated with the second cluster to create a filtered audio stream associated with the second cluster that does not include the synchronising audio signal; and broadcasting the filtered audio stream associated with the second cluster from each audio output component associated with the first cluster. Optionally, the filtering may be applied only when a synchronising audio signal is generated. Alternatively, the filtering may be applied continuously during the teleconference.

Each participant may have an audio capture component and an audio output component. Assigning each participant to one of a plurality of clusters may comprise, for each participant: generating an identifying audio signal using the audio output component associated with the respective participant.

When the identifying audio signal is not detected by the audio capture components associated with any of the plurality of participants other than the respective participant, the respective participant may be assigned to a new cluster.

When the identifying audio signal is detected by an audio capture component associated with one or more of the plurality of participants other than the respective participant, the respective participant may be assigned to an existing cluster including the one or more of the plurality of participants that detected the identifying audio signal.

The assigning process may be performed automatically. The assigning process may be performed automatically when the respective participant joins the teleconference. The assigning process may be automatically performed intermittently, and optionally periodically. For example, the process may be performed at least every minute, at least every 30 seconds, or at least every 10 seconds.

The assigning process may be performed in response to a command from the respective participant, or from another participant of the teleconference, such as a participant having administrative rights in respect of the teleconference.

The identifying audio signal optionally may be substantially outside of the human audible frequency range. For example, the identifying audio signal may comprise a frequency above 15 kHz, or above 18 kHz, or above 20 kHz, or above 22 kHz. The identifying audio signal may include substantially no sound at frequencies below 10 kHz, or below 15 kHz, or below 18 kHz, or below 20 kHz.

The identifying audio signal may uniquely identify the participant or the audio output component of the participant. This may be achieved by the timing of the broadcasting of the identifying audio signal, i.e. such that only a single identifying audio signal is generated at a time. Alternatively, a frequency of the identifying audio signal or audio data contained in the identifying audio signal may uniquely identify the participant or the audio output component of the participant.

Additionally, or alternatively, assigning each participant to one of a plurality of clusters may comprises, for each participant: receiving an input from the respective participant.

When the input indicates the respective participant wishes to join a new cluster or indicates the respective participant is not in the same location as any of the plurality of participants other than the respective participant, the respective participant may be assigned to a new cluster.

Alternatively, when the input indicates the respective participant is not in the same location as any of the plurality of participants other than the respective participant, the respective participant may be designated as a solitary participant.

When the input indicates the respective participant wishes to join an indicated existing cluster or indicates the respective participant is in the same location as an indicated one of the plurality of participants, the respective participant may be assigning to the indicated existing cluster or to an existing cluster including the indicated participant.

The assigning process may be performed automatically when the respective participant joins the teleconference. The assigning process may be performed in response to a command from a participant of the teleconference, such as a participant having administrative rights in respect of the teleconference.

Assigning each participant to one of a plurality of clusters may be based on an input received from a participant having administrative rights in respect of the teleconference.

When the input indicates the respective participant should be assigned to a new cluster or indicates the respective participant is not in the same location as any of the plurality of participants other than the respective participant, the respective participant may be assigned to a new cluster.

When the input indicates the respective participant should be assigned to an indicated existing cluster or indicates the respective participant is in the same location as an indicated one of the plurality of participants, the respective participant may be assigning to the indicated existing cluster or to an existing cluster including the indicated participant.

Each participant may have an audio capture component. One or more of the audio capture components may comprise a microphone. Each participant may have an audio output component. One or more of the audio output components may comprise a loudspeaker or headphones.

One or more or each participant may have a user device. The user device may be a networked computing device. The user device may be a portable device. The user device may be a personal device in the possession of the user, i.e. not a communal device that is shared between other people. The user device may be a mobile phone, a laptop, a desktop computer. The user device may comprise (e.g. include as an integrated component) or be in communication with (e.g. a wired or wireless connection to a peripheral component) an audio capture component and an audio output component. Optionally, the user device may comprise or be in communication with a video capture component and/or a video output component.

The teleconference may be a videoconference, or may be an augmented reality teleconference or a virtual reality teleconference. Thus, one or more or each participant may have a video display device. Additionally, one or more or each participant may have a video or motion capture device.

Video or motion data captured for each participant may be transmitted to each other participant. The video or motion data may be displayed on the video display devices of the other participants. Optionally, only the video or motion data for a subset of the participants may be transmitted and/or displayed, for example if the recipient participant selects to limit the number of participants displayed, or to hide one or more participants, or to display only specific participants, e.g. a speaking participant or only participants in a specific cluster.

The following second to fourth aspects of the invention may include any features of the first aspect, and/or any other optional features described in respect of the first aspect.

In a second aspect, the present invention provides a method of teleconferencing between a plurality of participants, the method comprising: assigning each participant to one of a plurality of clusters, wherein a first cluster of the plurality of clusters includes at least two participants, each having an audio capture component, and wherein a second, different cluster of the plurality of clusters includes at least one participant having an audio output component; capturing an audio stream using each of the audio capture components associated with the first cluster; identifying a single audio capture component from amongst the audio capture components associated with the first cluster as a primary audio capture component based on an amplitude of the audio streams within a predetermined frequency range, wherein the predetermined frequency range has a lower bound above 2 kHz; and broadcasting the audio stream captured by the primary audio capture component from each audio output component associated with the second cluster.

In a third aspect, the present invention provides a method of teleconferencing between a plurality of participants, the method comprising: assigning each participant to one of a plurality of clusters, wherein a first cluster of the plurality of clusters includes at least one participant having an audio capture component, and wherein a second, different cluster of the plurality of clusters includes a plurality of participants, at least two having an audio output component and at least one having an audio capture component; for each audio output component associated with the second cluster: generating a synchronising audio signal using the respective audio output component, wherein the synchronising audio signal is substantially outside of the human audible frequency range; detecting the synchronising audio signal using the at least one audio capture component associated with the second cluster; and determining a time delay associated with the respective audio output component based on a time when the synchronising audio signal is detected; and broadcasting an audio stream associated with the first cluster from each audio output component associated with the second cluster, wherein the audio stream is broadcast from each audio output component associated with the second cluster at the respective time delay associated with that audio output component, such that the audio stream is broadcast from each audio output component at the same time.

A delay or drift between the broadcast audio streams may be below 50 ms, or below 25 ms, or below 15 ms.

Which audio capture component is the primary audio capture component may be intermittently or periodically re-assessed, for example at least every second, or at least every 500 ms, or at least every 300 ms, or at least every 200 ms, or at least every 100 ms.

Which audio capture component is the primary audio capture component may be re-assessed in response to a command from a participant of the teleconference, such as a participant having administrative rights in respect of the teleconference.

A single audio signal may provide both the identifying audio signal and the synchronising audio signal for a participant.

The method may further comprise: capturing an audio stream associated with the second cluster using the at least one audio capture component associated with the second cluster; applying a filter to the audio stream associated with the second cluster to create a filtered audio stream associated with the second cluster that does not include the synchronising audio signal; and broadcasting the filtered audio stream associated with the second cluster from each audio output component associated with the first cluster.

In a fourth aspect, the present invention provides a method of assigning participants of a teleconference to a cluster indicative of physical proximity, each participant having an audio capture component and an audio output component, the method comprising: for each participant: generating an identifying audio signal using the audio output component associated with the respective participant; when the identifying audio signal is not detected by the audio capture components associated with any of the plurality of participants other than the respective participant, assigning the respective participant to a new cluster; and when the identifying audio signal is detected by an audio capture component associated with one or more of the plurality of participants other than the respective participant, assigning the respective participant to an existing cluster including the one or more of the plurality of participants that detected the identifying audio signal.

The identifying audio signal may be substantially outside of the human audible frequency range. For example, the identifying audio signal may comprise a frequency above 15 kHz, or above 18 kHz, or above 20 kHz, or above 22 kHz. The identifying audio signal may include substantially no sound at frequencies below 10 kHz, or below 15 kHz, or below 18 kHz, or below 20 kHz.

In a fifth aspect, the present invention provides a system configured to implement any of the methods described above.

The system may comprise an internet browser, which may be based on the W3C standards that define an Open Web Platform. The internet browser may operate on a user device, which may be associate with a participant of the teleconference. The system may comprise a plurality of such user devices, each including an internet browser operating thereon.

The system may comprise a server, such as a cloud server. The server may be in communication with each user device, for example via the internet.

The methods described in the first to fourth aspects may be implemented using computer programs composed of computer-readable instructions for execution by a computer. The computer programs may be configured for execution on a server, which may be capable of communication with one or more remote user devices, for example via internet browsers on the user device or via dedicated teleconference applications on the user device.

The following sixth to ninth aspects of the invention may include any features of the first to fourth aspects, respectively, and/or any other optional features described in respect of those aspects.

In a sixth aspect, the present invention provides a computer program or a tangible computer-readable medium storing a computer program, wherein the computer program is for teleconferencing between a plurality of participants, the computer program comprising instructions that, when executed, will cause a networked computer to perform a method comprising: communicating with a plurality of local device applications, each operating on a local device associated with a participant of the teleconference; assigning each local device to one of a plurality of clusters, wherein a first cluster of the plurality of clusters includes at least two local devices, with each having an audio capture component and an audio output component, and wherein a second, different cluster of the plurality of clusters includes at least one local device having an audio output component; receiving an audio stream associated with the first cluster, the audio stream having been captured using at least one of the audio capture components associated with the first cluster; and transmitting the audio stream associated with the first cluster for broadcasting from each audio output component associated with the second cluster; wherein the audio stream associated with the first cluster is not transmitting for broadcast from any of the audio output components associated with the first cluster.

In a seventh aspect, the present invention provides a computer program or a tangible computer-readable medium storing a computer program, wherein the computer program is for teleconferencing between a plurality of participants, the computer program comprising instructions that, when executed, will cause a networked computer to perform a method comprising: communicating with a plurality of local device applications, each operating on a local device associated with a participant of the teleconference; assigning each local device to one of a plurality of clusters, wherein a first cluster of the plurality of clusters includes at least two local devices, each having an audio capture component, and wherein a second, different cluster of the plurality of clusters includes at least one local device having an audio output component; receiving audio streams captured using each of the audio capture components associated with the first cluster; identifying a single audio capture component from amongst the audio capture components associated with the first cluster as a primary audio capture component based on an amplitude of the audio streams within a predetermined frequency range, wherein the predetermined frequency range has a lower bound above 2 kHz; transmitting the audio stream captured by the primary audio capture component for broadcasting from each audio output component associated with the second cluster.

In an eighth aspect, the present invention provides a computer program or a tangible computer-readable medium storing a computer program, wherein the computer program is for teleconferencing between a plurality of participants, the computer program comprising instructions that, when executed, will cause a networked computer to perform a method comprising: communicating with a plurality of local device applications, each operating on a local device associated with a participant of the teleconference; assigning each local device to one of a plurality of clusters, wherein a first cluster of the plurality of clusters includes at least one local device having an audio capture component, and wherein a second, different cluster of the plurality of clusters includes a plurality of local devices, at least two having an audio output component and at least one having an audio capture component; for each audio output component associated with the second cluster: commanding the respective local device to generate a synchronising audio signal using the audio output component, wherein the synchronising audio signal is substantially outside of the human audible frequency range; receiving an audio stream captured by each of the at least one audio capture component associated with the second cluster; identifying the synchronising audio signal in the audio stream or streams captured using the at least one audio capture component associated with the second cluster; and determining a time delay associated with the respective audio output component based on a time when the synchronising audio signal is detected; and transmitting an audio stream associated with the first cluster for broadcasting from each audio output component associated with the second cluster at the respective time delay associated with that audio output component, such that the audio stream is broadcast from each audio output component at the same time.

In a ninth aspect, the present invention provides a computer program or a tangible computer-readable medium storing a computer program, wherein the computer program is for assigning participants of a teleconference to a cluster indicative of physical proximity, the computer program comprising instructions that, when executed, will cause a networked computer to perform a method comprising: communicating with a plurality of local device applications, each operating on a local device associated with a participant of the teleconference, and each local device having an audio capture component and an audio output component; for each local device: commanding the local device to generate an identifying audio signal using the audio output component associated with the respective local device; when the identifying audio signal is not detected by the audio capture components associated with any of the plurality of local devices other than the respective local device, assigning the respective local device to a new cluster; and when the identifying audio signal is detected by an audio capture component associated with one or more of the plurality of local devices other than the respective local device, assigning the respective local device to an existing cluster including the one or more of the plurality of local devices that detected the identifying audio signal.

Certain preferred embodiments of the present invention will now be described in greater detail, by way of example only and with reference to the accompanying drawings, in which: Figure 1 shows a physical configuration of participants of an exemplary videoconference; Figure 2 shows the participants assigned to virtual clusters based on their physical locations; Figure 3 shows how audio travels when a participant in a first cluster speaks; Figure 4 shows how audio travels when a participant in a second cluster speaks; Figures 5 to 7 show participants being automatically assigned to clusters based on generation and detection of an identifying audio signal; Figures 8 and 9 show an interface for a participant to manually assign themselves to a cluster when joining a videoconference; Figures 10 shows an interface for a participant to manually re-assign themselves to a new cluster during a videoconference; Figure 11 shows audio data captured by two microphones in physical proximity to one another; Figure 12 shows the audio data of Figure 11 after passing through a 2.3 kHz high-pass filter; Figure 13 shows the inherent time delays of each of the audio output components within the first cluster when broadcasting audio data; Figure 14 shows an artificial delay applied to each of the audio output components within the first cluster when broadcasting audio data; Figure 15 shows synchronisation of the audio output components within the first cluster a synchronising audio signal; Figure 16 shows how speech travels between participants in a first prior art hybrid teleconference system; and Figure 17 shows how speech travels between participants in a second prior art hybrid teleconference system.

Figure 1 shows schematically a physical configuration of six participants of an exemplary videoconference.

The six participants, numbered from P1 to P6, are located in three different physical locations 2, 4, 6. Participants P1, P2 and P3 are located in a first physical location 2. Participants P4 and P5 are located in a second physical location 4. Participant P6 is located in a third physical location 6.

The distance between the locations 2, 4, 6 is not relevant. However, the locations should be sufficiently remote that participants in one location, e.g. participants P1-P3 in the first location 2, cannot directly hear participants speaking in the other locations, e.g. participants P4 and P5 in the second and participant P6 in the third locations 4, 6.

In the illustrated embodiment, each participant has their own audio capture component (e.g. a microphone), audio output component (e.g. a loudspeaker), video capture component (e.g. a camera) and video display component (e.g. a display screen). In many situations, all of these components would be integrated into a local device 8, i.e. local to the respective participant, such as a mobile phone, a laptop or a headset. However, one or more of the components may be a standalone component, such as an external microphone or loudspeaker, or a virtual-or augmented-reality headset, connected to a local device 8 such as a desktop computer.

In some embodiments, certain participants may not individually have all four components. For example, in some situations, multiple participants may share one or more components, such as sharing a common video display component or audio output component. It will be readily apparent how the following techniques may be applied in such situations.

Additionally, whilst the illustrated embodiment is described in the context of a videoconference, the principles described herein apply principally to the capture, transmission and broadcasting of audio. Therefore, these techniques may be applied also to other types of teleconferences, such as audio-only teleconferences. In other examples, these techniques may be applied to teleconferences where one or more of the participants are represented by virtual avatars, rather than by a video of the participant themselves. This could be employed, for example, on platforms such as Meta Horizon or Roblox.

As discussed previously, in hybrid teleconference situations where multiple audio capture components and/or audio output components are located proximate one another, feedback and echoes can occur, which make it difficult to clearly hear a speaking participant.

The following technique seeks to reduce this problem. The technique is based on the premise that remote participants only want to hear audio from one source at a time from the in-person room, and that participants in the room only want to hear audio from remote participants (because they can hear a person speaking within the room directly).

To enable this, participants joining a videoconference are assigned to or select a virtual cluster containing other participants that they share a physical location with. Remote participants will hear the audio feed only of the individual(s) in the room actively speaking. In-person participants can leave their devices connected, and microphones on without fear of audio feedback loops, and can hear remote participants equally well, as their audio plays simultaneously from all of the devices in the room. Thus, all participants can much more easily stay engaged, as individual video and audio feeds mean in-person participants can interact naturally in the room, and remote participants have an experience similar to a fully-remote meeting.

This is illustrated further, with reference to Figures 2 to 4.

Figure 2 illustrates three virtual clusters 12, 14, 16 to which the participants P1 to P6 have been assigned, based on their physical locations 2, 4, 6.

Participants P1, P2 and P3, who are located in the first location 2, are assigned to the first cluster 12. Participants P4 and P5, who are located in the second location 4, are assigned to the second cluster 14. Participant P6, who is located in the third location 6, is assigned to a third cluster 16.

The process by which the participants P1 to P6 are assigned to the clusters 12, 14, 16 will be discussed in greater detail later.

With reference to Figure 3, when participant P1 in the first cluster 12 speaks, their voice is captured as audio data by their respective audio capture component. This audio data is transmitted to the participants in the other clusters, i.e. to participants P4 and P5 in the second cluster 14, and to participant P6 in the third cluster 16, and the audio data is broadcast in a synchronised manner from each of their respective audio output components.

The process for synchronisation of the outputs of the audio output components will be discussed in greater detail later.

As the participants within the first cluster, i.e. participants P2 and P3, are located in the first physical location 2 together with participant P1, they are able to hear participant P1 directly. Therefore, the audio data captured by the audio capture component of participant P1 is not transmitted to participants P2 and P3 for broadcast from their respective audio output components.

The audio data transmitted to the second and third clusters 14, 16 contains audio data captures using only a single, primary audio capture component within the first cluster 12, i.e. that of participant P1 in this case. Thus, even though the audio capture components of participants P2 and P3 may also capture speech from participant P1, these audio streams will not be transmitted to the other clusters.

The process for selection of the primary audio capture component will be discussed in greater detail later.

Turning now to Figure 4, when participant P4 in the second cluster 14 speaks, their voice is captured as audio data by their respective audio capture component. This audio data is again transmitted to the participants in the other clusters, i.e. to participants P1 to P3 in the first cluster 12, and to participant P6 in the third cluster 16, and the audio data is broadcast from each of their respective audio output components.

As above, the audio data transmitted to the first and third clusters 12, 16 contains audio data captures using only a single, primary audio capture component within the second cluster 14, i.e. that of participant P4 in this case. Thus, even though the audio capture component of participants P5 may also capture speech from participant P4, this audio streams will not be transmitted to the other clusters.

Figures 5 to 7 schematically illustrate a first process by which the participants of the videoconference may be automatically assigned to the clusters 12, 14, 16.

The following example will illustrate the process where participants P1, P2, P3 and P4 join the videoconference in that order. In the illustrated example, each participant has an audio capture component 22, and an audio output component 24.

When participant P1 joins the videoconference, they are the first participant. Therefore, they will automatically be assigned to the first cluster 12.

With reference to Figure 5, when participant P2 joins the videoconference, their audio output component 242 will be commanded to generate an identifying audio signal 261=2 for participant P2. Whilst this may be an audio signal that is audible to humans, it is more preferable that the identifying audio signal 261,2 is substantially outside of the range of human hearing (typically considered 20Hz to 20 kHz), for example an ultrasonic audio signal. Audio signals in this frequency range, above around 20 kHz, can commonly be generated and detected by audio output components and audio capture components of the type commonly used in laptop computers and mobile phones.

The identifying audio signal 261,2 of participant P2 is detected by the audio capture component 22p, of participant P1, which indicates that participants P1 and P2 are located in the same physical location, i.e. the first location 2. Therefore, participant P2 will be assigned to the same cluster as participant P1, i.e. the first cluster 12.

With reference to Figure 6, when participant P3 joins the videoconference, their audio output component 243 will be commanded to generate an identifying audio signal 261=3 for participant P3. Again, this may be an ultrasonic audio signal.

The identifying audio signal 26p3 of participant P3 is detected by the audio capture components 22pl, 22p2of both participants P1 and P2, which indicates that participants P3 is located in the same physical location as participants P1 and P2, i.e. the first location 2. Therefore, participant P3 will also be assigned to the same cluster as participants P1 and P2, i.e. the first cluster 12.

With reference to Figure 7, when participant P4 joins the videoconference, their audio output component 241.4 will be commanded to generate an identifying audio signal 261=4 for participant P4. Again, this may be an ultrasonic audio signal.

The identifying audio signal 26p4 of participant P4 is not detected by any of the audio capture components 22pl, 22p2, 221.3 of participants P1, P2 or P3 because participant P4 is located in a different physical location to them, i.e. the second location 4. Therefore, participant P4 will also be assigned to the new cluster, i.e. the second cluster 14 will be created.

This process is repeated as each new participant joins the videoconference. Any number of clusters may be generated, and the identifying audio signal 26 of a new participant only needs to be detected by the audio capture components 22 of a single other participant to be assigned to their cluster.

Optionally, the audio output components 24 of the participants may be periodically instructed to generate an identifying audio signal 26 throughout the videoconference, for example once every 10 seconds, and the clusters to which each participant is assigned may be updated based on which audio capture components 22 detect the respect identifying audio signals 26. Thus, for example, a user leaving a physical location.

Whilst the above example comprises the generation of an identifying audio signal 26 by the audio output component 24 of the new participant, it will be appreciated that the location of a new participant may alternatively be implemented by generating identifying audio signals 26 by the audio output components 24 of at least one existing participant of each cluster, and determining which Cif any) identifying audio signals 26 is detected by the audio capture components 22 of the new participant.

The audio signals captured by each audio capture component 22 may be low-pass filtered, for example with a cut-off frequency of about 20 kHz (corresponding to the upper bound of human hearing), prior to broadcasting from the audio output component 24 in other clusters. This prevents other audio output components 24 from inadvertently repeating the identifying audio signals 26 in other clusters.

Figures 8 and 9 show an interface for a participant to manually assign themselves to a cluster when joining a videoconference. This may be used in addition or as an alternative to the automatic cluster assignment technique described above.

When a new participant joins a videoconference, they may be presented with a location-selection window 30. A first exemplary location-selection window 30a is shown in Figure 8, and a second exemplary location-selection window 30b is shown in Figure 9.

With reference to Figure 8, the location-selection window 30a contains an existing-cluster option 32 that will assign the participant to a corresponding, existing cluster of the videoconference. Additionally, it contains a new-cluster option 34 that will create a new cluster of the videoconference and will assign the participant to the new cluster. Optionally, the participant may be invited to name the new cluster when it is created.

In some embodiments, the location-selection window 30a may contain a solitary-participant option 34. When selecting this option, the participant is not placed in a cluster with other participants, but they are treated as solitary participants to the meeting. Other participants cannot join a cluster with a solitary participant. However, the solitary participant may later choose to join an existing cluster or to create a new cluster. The solitary participant receives audio and video in the same manner as other clusters, for example see the third cluster 16 described above. Likewise, audio from the solitary participant is transmitted to all other clusters in the same manner as the primary audio streams from the other clusters.

With reference to Figure 9, as more participants join the call, the location-selection window 30b will contain an existing-cluster option 32a-32c for each cluster that is not a solitary-participant cluster, in addition to the new-cluster option 34 and the optional solitary-participant option 36.

Selecting one of the existing-cluster options 32a-32c will assign the participant to the respective existing cluster.

This arrangement facilitates simple assignment of participants to the correct cluster. Additionally, it will work even in situations where a participant does not have their own audio capture component and/or audio output component, or where those components have been incorrectly configured.

Figure 10 shows an interface 40 of the videoconference, which is presented to a participant of the videoconference. The interface 40 includes a cluster control 42 by which the participant can manually re-assign their cluster during the videoconference.

The interface 40 may be used in combination with any of the cluster assignment techniques described above.

The cluster control 42 shows the clusters that are present in the videoconference, and optionally the participants present within each cluster. In the illustrated example, the videoconference includes three clusters. Additionally, the cluster control 42 includes an indication 44 of the current cluster to which the participant is assigned.

The cluster control 42 comprises an existing-cluster option 46a, 46b for each existing cluster to which the participant is not currently assigned. Selection of one of these options 46a, 46b by the participant will assign the participant to the respective existing cluster.

The cluster control 42 also comprises a new-cluster option 48 that will create a new cluster of the videoconference and will assign the participant to the new cluster. Optionally, the participant may be invited to name the new cluster when it is created.

The illustrated existing-cluster options 46a, 46b and new-cluster option 48 are merely one exemplary interface for assignment of the user to a cluster via the cluster control 42. In other implementations, for example, a drag-and-drop interface may be used, whereby dragging a participant's name to a cluster acts to select the respective existing-cluster options 46a, 46b, or the existing-cluster options 46a, 46b and/or new-cluster option 48 may be selected from a drop-down list.

Turning now to Figure 11, this Figure shows two audio signals 52, 54 captured by two different audio capture components 22 located within the same physical location. The graphs of the audio signals 52, 54 shown in Figure 11 plot amplitude of the audio signals 52, 54 against time.

In this example, the audio signals are treated as those captured by the audio capture components 22p4, 22p5 of participants P4 and P5, who are located within the same location, e.g. the second location 4 in Figure 1, and are assigned to the same cluster, e.g. the second cluster 14 in Figure 2.

As discussed above, for each cluster, a primary audio capture component 22 is selected for that cluster, and only an audio stream captured by that primary audio capture component 22 is broadcast from the audio output components 24 of the participants in the other clusters.

It is desirable to detect the primary audio capture component 22 as the audio capture component 22 within the physical location that provides the clearest audio recording of a participant that is speaking within that location. It was envisaged that this might be determined based on the selecting the audio signal 52, 54 having the maximum amplitude. However, where the two participants P4, P5 are located within the same location, it has been found that amplitude alone may be insufficient to correctly identify the best audio signal.

The first audio signal 52 is captured by the audio capture component 221=4 of participant P4. The second audio signal 54 is captured by the audio capture component 221=5 of participant P5. The audio signals 52, 54 represent the same period of time, which includes a first period of time 56 during which participant P4 speaks, and a second period of time 58 during which participant P5 speaks.

It would be expected that the first audio stream 52, captured by the audio capture component 221,4 of participant P4, would provide the best audio capture of the first period of time 56 during which participant P4 speaks, and that second audio stream 54, captured by the audio capture component 221,5 of participant P5, would provide the best audio capture of the second period of time 58 during which participant P5 speaks.

As can be seen, amplitude provides a good distinction between the audio stream 52, 54 for participant P5. However, for participant P4, the amplitudes of the audio stream 52, 54 are very similar. Nevertheless, when heard by a human listener, the first audio stream 52 is significantly clearer than the second audio 54.

The inventors have identified that, within a physical location, certain frequencies present in human speech can create standing waves within the space. This can result in more distant audio capture component 22 capturing a louder audio signal than an audio capture component 22 that is closer to the speaking participant, even though the more distant audio capture component 22 may capture a lower quality audio signal.

The inventors have identified that this effect is particularly prominent at lower frequencies, whereas it is the higher frequency components of speech that create a crisper and clearer audio stream.

Turning now to Figure 12, this Figure shows two audio signals 52', 54' derived by passing the two audio signals 52, 54 through a high-pass filter having a 2.3 kHz cut-off frequency.

As can be seen, the higher frequency audio signals 52', 54' permit much clearer distinction between the audio capture component 22p4, 225,5 to identify which is adjacent to the speaking participant.

Consequently, it is proposed that the primary audio capture component 22 for a cluster be based on the amplitude of the filtered audio signals 52', 54' captured by the audio capture component 221=4, 22p5 associated with the participants of that cluster.

It will be appreciated that this filtering is only used for determining the primary audio capture component 22, and the audio signal transmitted to other clusters for broadcasting is not filtered in this way.

The primary audio capture component 22 is periodically re-evaluated for each cluster, for example every 200 ms. This ensures that only a single audio capture component 22 transmits from the cluster at any one time, avoiding echoes, whilst still quickly changing the primary audio capture component 22 when the speaking participant changes.

Figure 13 shows the output of an audio signal 52 against time for each of the audio output components 24 within the first cluster 12.

In this example, all three audio output components 24pi, 242, 243 are instructed to broadcast the same audio signal 62 (e.g. received from participant P4 in the example of Figure 4). However, due to various factors, an inherent time delay 64 occurs between the sending of the instruction and the broadcasting of the audio signal 62. As illustrated in Figure 13, each of the inherent time delays 64pi, 642, 643 are different.

This inherent time delay can arise for a variety of factors, such as the route of data transmission, the network speed, processing speed of the local device, etc. However, as the inherent time delays 64 is often different for each audio output component 24, this can cause echoes or interference when multiple audio output components 24 broadcast within the same physical location.

With reference to Figure 14, in order to overcome this issue, an artificial time delay 66 is applied to the audio signal 62 broadcast from each audio output components 24, 242, 243 in the cluster, so as to cause the audio signals 62 to be broadcast at the same time, i.e. in sync with one another.

An artificial time delay 66 is determined that is specific to each audio output component 24. The artificial time delay 66 may be applied at the local device or audio output component 24 when broadcasting the audio signal 62, or the audio signal 62 being transmitted to each audio output component 24 for broadcast may be delayed by the respective artificial time delay 66 before transmission to that audio output component 24.

Figure 15 illustrates an example of how the artificial time 66 delay is determined for the audio output components 24pi of participant P1.

The audio output component 24pi of participant P1 is commanded to output a synchronising audio signal 68pi. Whilst this may be an audio signal that is audible to humans, it is more preferable that the synchronising audio signal 26P2 is substantially outside of the range of human hearing (typically considered 20Hz to 20 kHz), for example an ultrasonic audio signal.

The synchronising audio signal 68pi is received by each of the audio capture components 22P2, 221=3 of the other participants at the same physical location, i.e. participants P2 and P3.

Based on a measurement of the delay between commanding the audio output component 241 to output the synchronising audio signal 68pi, and receiving the synchronising audio signal 68pi at the audio capture components 221=2, 22p3 of the other participants, it is possible to calculate the inherent time delay 64pi for the audio output component 24pi.

The artificial time delay 66pi for the audio output component 24pi is then determined based on the artificial time delays 66P2, 66P3 of the other audio output components 24p2, 24p3 within the cluster 12. Typically, one of the artificial time delays 66 within the cluster 12 will be set to zero -in the illustrated example, this is the artificial time delay 66p3 for the audio output component 24p of participant P3. The other artificial time delays 66 adjusted accordingly to ensure that audio is broadcast in sync with the minimum time delay.

The artificial time delays 66 of the audio capture components 22 are periodically recalculated, for example approximately every 60 seconds.

In one example, the identifying audio signal 26 may also be used as the synchronising audio signal 68.

The techniques described above are implemented using software. The software comprises server-side software hosted in a server accessible by the local devices 8 of the videoconference participants, such as a cloud server accessible via the Internet. The server-side software interacts with client-side software operating on the local devices 8 of the participants.

The client-side software may comprise a dedicated application, or may comprise an internet browser. Thus, in some embodiments, the local devices 8 of the participants may require no particular modification in order to implement the techniques disclosed herein. In one embodiment the local client is based on an Open Web Platform, such as a browser or media player following W3C standards.

The server-side software is configured to cause the client-side software to transmit audio captured using the audio capture component 22 of the respective local device 8. The server-side software is also configured to cause the client-side software to broadcast audio from the audio output component 24 of the respective local device 8.

Likewise, the server-side software is configured to cause the client-side software to transmit video captured using the video capture component of the respective local device 8, and to cause the client-side software to display video using the video display component of the respective local device 8.

In the preferred embodiment, the techniques described above are implemented primarily using the server-side software, and the client-side software merely captures and outputs the audio or video data in accordance with commands received from the server. However, it will be appreciated that one or more steps of the process may be implemented in the client-side software instead.

Claims

CLAIMS1. A method of teleconferencing between a plurality of participants, the method comprising: assigning each participant to one of a plurality of clusters, wherein a first cluster of the plurality of clusters includes at least two participants, each having an audio capture component and an audio output component, and wherein a second, different cluster of the plurality of clusters includes at least one participant having an audio output component; capturing an audio stream associated with the first cluster using at least one of the audio capture components associated with the first cluster; and broadcasting the audio stream associated with the first cluster from each audio output component associated with the second cluster; wherein the audio stream associated with the first cluster is not broadcast from any of the audio output components associated with the first cluster.
2. A method according to claim 1, wherein capturing the audio stream associated with the first cluster comprises: identifying a single audio capture component from amongst the audio capture components associated with a first cluster as a primary audio capture component; and capturing the audio stream associated with the first cluster using only the primary audio capture component
3. A method according to claim 2, wherein which audio capture component is the primary audio capture component is periodically re-assessed, for example at least every 500 ms.
4. A method according to claim 2 or 3, wherein the primary audio capture component is identified based on an amplitude of an audio stream captured by each of the audio capture components associated with a first cluster.
5. A method according to claim 4, wherein the primary audio capture component is identified based on the amplitudes of the audio streams within a predetermined frequency range, wherein the predetermined frequency range has a lower bound above 2 kHz.
6. A method according to any preceding claim, wherein the second cluster comprises at least two participants, each having an audio output component, and wherein the audio stream associated with the first cluster is broadcast from each audio output component associated with the second cluster at a time delay associated with that audio output component, such that the audio stream is broadcast from each of the audio output components associated with the second cluster at substantially the same time.
7. A method according to claim 6, wherein at least one participant in the second cluster has an audio capture component, the method further comprising, for each audio output component associated with the second cluster: generating a synchronising audio signal using the respective audio output component, the synchronising audio signal optionally being substantially outside of the human audible frequency range; detecting the synchronising audio signal using the at least one audio capture component associated with the second cluster; and determining a time delay associated with the respective audio output component based on a time when the synchronising audio signal is detected.
8. A method according to claim 7, the method further comprising: capturing an audio stream associated with the second cluster using the at least one audio capture component associated with the second cluster; applying a filter to the audio stream associated with the second cluster to create a filtered audio stream associated with the second cluster that does not include the synchronising audio signal; and broadcasting the filtered audio stream associated with the second cluster from each audio output component associated with the first cluster.
9. A method according to any preceding claim, wherein each participant has an audio capture component and an audio output component, and wherein assigning each participant to one of a plurality of clusters comprises, for each participant: generating an identifying audio signal using the audio output component associated with the respective participant, the identifying audio signal optionally being substantially outside of the human audible frequency range; when the identifying audio signal is not detected by the audio capture components associated with any of the plurality of participants other than the respective participant, assigning the respective participant to a new cluster; and when the identifying audio signal is detected by an audio capture component associated with one or more of the plurality of participants other than the respective participant, assigning the respective participant to an existing cluster including the one or more of the plurality of participants that detected the identifying audio signal.
10. A method according to any one of claims 1 to 8, wherein assigning each participant to one of a plurality of clusters comprises, for each participant: receiving an input from the respective participant; when the input indicates the respective participant wishes to join a new cluster or indicates the respective participant is not in the same location as any of the plurality of participants other than the respective participant, assigning the respective participant to a new cluster; and when the input indicates the respective participant wishes to join an indicated existing cluster or indicates the respective participant is in the same location as an indicated one of the plurality of participants, assigning the respective participant to the indicated existing cluster or to an existing cluster including the indicated participant.
11. A method according to any one of claims 1 to 8, wherein assigning each participant to one of a plurality of clusters is based on an input received from a participant having administrative rights in respect of the teleconference.
12. A method of teleconferencing between a plurality of participants, the method comprising: assigning each participant to one of a plurality of clusters, wherein a first cluster of the plurality of clusters includes at least two participants, each having an audio capture component, and wherein a second, different cluster of the plurality of clusters includes at least one participant having an audio output component; capturing an audio stream using each of the audio capture components associated with the first cluster; identifying a single audio capture component from amongst the audio capture components associated with the first cluster as a primary audio capture component based on an amplitude of the audio streams within a predetermined frequency range, wherein the predetermined frequency range has a lower bound above 2 kHz; and broadcasting the audio stream captured by the primary audio capture component from each audio output component associated with the second cluster.
13. A method of teleconferencing between a plurality of participants, the method comprising: assigning each participant to one of a plurality of clusters, wherein a first cluster of the plurality of clusters includes at least one participant having an audio capture component, and wherein a second, different cluster of the plurality of clusters includes a plurality of participants, at least two having an audio output component and at least one having an audio capture component; for each audio output component associated with the second cluster: generating a synchronising audio signal using the respective audio output component, wherein the synchronising audio signal is substantially outside of the human audible frequency range; detecting the synchronising audio signal using the at least one audio capture component associated with the second cluster; and determining a time delay associated with the respective audio output component based on a time when the synchronising audio signal is detected; and broadcasting an audio stream associated with the first cluster from each audio output component associated with the second cluster, wherein the audio stream is broadcast from each audio output component associated with the second cluster at the respective time delay associated with that audio output component, such that the audio stream is broadcast from each audio output component at the same time.
14. A method according to claim 13, wherein which audio capture component is the primary audio capture component is periodically re-assessed, for example at least every 500 ms.
15. A method according to claim 13 or 14, the method further comprising: capturing an audio stream associated with the second cluster using the at least one audio capture component associated with the second cluster; applying a filter to the audio stream associated with the second cluster to create a filtered audio stream associated with the second cluster that does not include the synchronising audio signal; and broadcasting the filtered audio stream associated with the second cluster from each audio output component associated with the first cluster.
16. A method of assigning participants of a teleconference to a cluster indicative of physical proximity, each participant having an audio capture component and an audio output component, the method comprising: for each participant: generating an identifying audio signal using the audio output component associated with the respective participant; when the identifying audio signal is not detected by the audio capture components associated with any of the plurality of participants other than the respective participant, assigning the respective participant to a new cluster; and when the identifying audio signal is detected by an audio capture component associated with one or more of the plurality of participants other than the respective participant, assigning the respective participant to an existing cluster including the one or more of the plurality of participants that detected the identifying audio signal.
17. A method according to claim 16, wherein the identifying audio signal is substantially outside of the human audible frequency range.
18. A system having an internet browser based on the W3C standards that define an Open Web Platform, wherein the system is configured to implement a method according to any one of claims 1 to 17.
19. A computer program or a tangible computer-readable medium storing a computer program, wherein the computer program is for teleconferencing between a plurality of participants, the computer program comprising instructions that, when executed, will cause a networked computer to perform a method comprising: communicating with a plurality of local device applications, each operating on a local device associated with a participant of the teleconference; assigning each local device to one of a plurality of clusters, wherein a first cluster of the plurality of clusters includes at least two local devices, with each having an audio capture component and an audio output component, and wherein a second, different cluster of the plurality of clusters includes at least one local device having an audio output component; receiving an audio stream associated with the first cluster, the audio stream having been captured using at least one of the audio capture components associated with the first cluster; and transmitting the audio stream associated with the first cluster for broadcasting from each audio output component associated with the second cluster; wherein the audio stream associated with the first cluster is not transmitting for broadcast from any of the audio output components associated with the first cluster.
20. A computer program or a tangible computer-readable medium storing a computer program, wherein the computer program is for teleconferencing between a plurality of participants, the computer program comprising instructions that, when executed, will cause a networked computer to perform a method comprising: communicating with a plurality of local device applications, each operating on a local device associated with a participant of the teleconference; assigning each local device to one of a plurality of clusters, wherein a first cluster of the plurality of clusters includes at least two local devices, each having an audio capture component, and wherein a second, different cluster of the plurality of clusters includes at least one local device having an audio output component; receiving audio streams captured using each of the audio capture components associated with the first cluster; identifying a single audio capture component from amongst the audio capture components associated with the first cluster as a primary audio capture component based on an amplitude of the audio streams within a predetermined frequency range, wherein the predetermined frequency range has a lower bound above 2 kHz; transmitting the audio stream captured by the primary audio capture component for broadcasting from each audio output component associated with the second cluster.
21. A computer program or a tangible computer-readable medium storing a computer program, wherein the computer program is for teleconferencing between a plurality of participants, the computer program comprising instructions that, when executed, will cause a networked computer to perform a method comprising: communicating with a plurality of local device applications, each operating on a local device associated with a participant of the teleconference; assigning each local device to one of a plurality of clusters, wherein a first cluster of the plurality of clusters includes at least one local device having an audio capture component, and wherein a second, different cluster of the plurality of clusters includes a plurality of local devices, at least two having an audio output component and at least one having an audio capture component; for each audio output component associated with the second cluster: commanding the respective local device to generate a synchronising audio signal using the audio output component, wherein the synchronising audio signal is outside of the human audible frequency range; receiving an audio stream captured by each of the at least one audio capture component associated with the second cluster; identifying the synchronising audio signal in the audio stream or streams captured using the at least one audio capture component associated with the second cluster; and determining a time delay associated with the respective audio output component based on a time when the synchronising audio signal is detected; and transmitting an audio stream associated with the first cluster for broadcasting from each audio output component associated with the second cluster at the respective time delay associated with that audio output component, such that the audio stream is broadcast from each audio output component at the same time.
22. A computer program or a tangible computer-readable medium storing a computer program, wherein the computer program is for assigning participants of a teleconference to a cluster indicative of physical proximity, the computer program comprising instructions that, when executed, will cause a networked computer to perform a method comprising: communicating with a plurality of local device applications, each operating on a local device associated with a participant of the teleconference, and each local device having an audio capture component and an audio output component; for each local device: commanding the local device to generate an identifying audio signal using the audio output component associated with the respective local device; when the identifying audio signal is not detected by the audio capture components associated with any of the plurality of local devices other than the respective local device, assigning the respective local device to a new cluster; and when the identifying audio signal is detected by an audio capture component associated with one or more of the plurality of local devices other than the respective local device, assigning the respective local device to an existing cluster including the one or more of the plurality of local devices that detected the identifying audio signal.