GB2581518A

GB2581518A - System and method for teleconferencing exploiting participants' computing devices

Info

Publication number: GB2581518A
Application number: GB1902435.5A
Authority: GB
Inventors: Douglas Blair Christopher; Laurence Heap Richard
Original assignee: Software Hothouse Ltd
Current assignee: Software Hothouse Ltd
Priority date: 2019-02-22
Filing date: 2019-02-22
Publication date: 2020-08-26
Also published as: GB201902435D0

Abstract

Audio connectivity is provided between several local participants 1 and a remote participant 11 by transmitting, to the remote participant, a single resultant audio stream that is formed by selectively merging audio received from microphones in several of the local participants’ devices (2-6) such as smartphones or laptops. The local participants are within earshot of each other, and sound may be generated by one device and received by others to determine their ability to participate in a call or to determine their relative position. The devices’ contributions to the single resultant stream may be determined by audio level or quality. Speech recognition algorithms 16 may be used to output a putative transcript and confidence score for each word or phrase, and the transcript and score may be used to determine which audio stream to add to the resultant audio stream. This method has applications to teleconferencing and to conference calls.

Description

System and Method for Teleconferencing Exploiting Participants' Computing Devices This invention relates to a means of conducting teleconferences with a plurality of participants at one location in which the participants' smartphones, tablets or computers act as a distributed set of microphones.

Background

Special "conference phones" have long been used in meeting rooms. These typically provide louder and better quality audio output than a standard office phone -so that all in the room can hear the distant end on a telephone call. They also usually provide better microphones than are found on a standard office phone. Many also support additional microphones, typically on one or two flying-leads, that can be spread around the table so as to pick up more clearly the voices of those seated furthest from the conference phone.

These devices are typically expensive; many are only infrequently used and some only work with one vendor's telephony system. The limited number of microphones in or connected to the device often requires participants to move closer to them when speaking or for the microphones to be repeatedly moved around the table to be closer to the person speaking.

Most participants in such a call will already have a smartphone that includes speakerphone capabilities. Whilst this can be used with more than one person sitting around the phone, the volume and loudspeaker quality is limited and the microphone pickup can be quite directional and therefore not good at picking up the voices of all those sitting around a table. Most modern smartphones now include multiple microphones allowing them to perform background noise suppression as well as beam-forming to identify the source of noise and use the most appropriate microphone(s). They typically also provide Automated Speech Recognition (ASR) capabilities, locally and/or via a network connected service.

Some participants may also have a tablet computer and/or laptop computer with them. Again, these generally have some microphone and speaker capabilities. When more than one individual in a room connects into a multimedia conference using these, the multiple audio paths frequently result in echo and/or feedback. It is therefore common practice to mute (or not even connect) the microphones and loudspeakers in these devices -instead sharing a single voice path via a dedicated conference phone in the middle of the table.

Sometimes a completely separate telephone conference bridge is used for the audio rather than the audio capabilities of the multimedia conference that is being used to display images and share desktops or drawings. This is not only time consuming to set up, it is also more expensive than a single bridge and can seriously hamper efforts to record the conference -as voice and other streams are in separate, unsynchronized systems.

Most smartphones, tablets and laptops also support wireless connectivity -typically via Bluetooth® -to loudspeakers which can be used when higher volume and/or quality is required. Such loudspeakers are readily available and many cost very little. Many are battery powered and therefore not restricted to being used near a power (or "Power over Ethernet") outlet as is the case with most conference phones.

In any phone call involving several individuals around the same table, there will therefore be almost as many, if not more, devices with microphones in the room than there are people. These microphones are already positioned close to the individuals that own them and hence can pick up their voices well. However, rather than contributing positively to the call, these devices often interrupt and/or degrade the conference call (as unrelated calls cause them to ring for example).

There is therefore an opportunity to use one or more cheap Bluetooth connected loudspeakers and/or some or all of the smartphones, laptops and tablet computers in the room as microphones and/or speakers. These can supplement the dedicated conference phone if one exists or completely avoid the need for one. When a battery powered speaker is used, this allows any space to be used as a "conference room" without the need for any fixed wiring or connection points.

Statement of Invention

The present invention lets a plurality of co-located users participate in an audio interaction with one or more remote participants without requiring a dedicated "conference phone". It does this by using the participants' smartphones, tablets and/or laptop computers as a distributed array of microphones for audio input and the loudspeaker(s) of one or more of said devices and/or a separate loudspeaker for audio output.

Introduction to the Drawings

Figure 1 shows the major components of an exemplary system. Detail of the Invention A plurality of individuals within a room (1) need to participate in a conference call, that includes at least an audio path, with at least one externally connected person and/or service such a speech recognition (16) or transcription service.

A number of computing devices with audio capabilities (microphone and/or loudspeaker) are also present in the room (1). These may be brought into the room by participants or be part of the room's infrastructure. They typically include smartphones (2, 3, 4), tablet computers (5), laptops or personal computers (6).

There is also typically at least one desk phone (8) and/or conference phone present in the room. Optionally, one or more loudspeaker devices such a Bluetooth speaker (9) may be present. Audio may, optionally, be sent to said speaker (9) from at least one of the devices -for example, smartphone (2) -in the room (1).

Devices within the room are typically able to communicate with each other and with those outside the room via a Wi-Fi (7) and/or wired (typically ethernet) network. Those with mobile network access may also be able to use a public cellular network for data communication as well as voice. Even in the absence of any pre-existing network (7) in the room, many such devices are able to use peer-to-peer wireless networking to communicate with each other.

Prior to this invention, one individual would typically use the desk or conference phone (8) to dial whichever conference provider is hosting the overall conference. This results in an phone call (over the WAN, LAN, PSTN, mobile network(s) and/or Internet) to a conference bridge (10). This is frequently routed via the company's Private Branch Exchange (PBX) (13). Other participants (11), outside the room also dial in or connect via their browser or other application to this conference bridge (10).

Note that this invention also works when there is a single remote party (11) and no external conference bridge (10). The issues around managing the audio pickup within the room (1) are the same regardless.

The conference bridge (10) may be providing video and/or screen-sharing, whiteboarding, chat and other data sharing as well as recording facilities. Alternatively, these interaction mechanisms may be provided by a completely separate conferencing service to which the participants connect independently of this audio connection.

In either case, users within the room (1) will typically also connect via their browsers or dedicated applications to view and contribute to the visual streams if these are available. Although this connection method usually also offers an audio path, this is not a good option. If there is more than one audio path between the room (1) and the conference bridge (10) echo typically occurs as the microphone on one device picks up the audio output by another device in the same room. Even if users wear headsets, their microphones typically still pick up one another's speech.

Furthermore, it is usually easier to position the desk or conference phone (8) centrally than to use any one of the personal devices. This phone (8) often has better speakers and, especially if it is a dedicated conference phone also has better audio pickup than the personal devices.

However, the volume and quality of audio picked up from each participant depends on their location in the room and the quality and orientation of the microphone(s) in or connected to the (single) device being used for the audio path.

Should the participants instead choose to (or have to in the absence of phone (8)) use one of their smartphones (2, 3, 4) to connect to the conference bridge (10) this can incur significant extra costs. Whereas PBX (13) typically has access to low cost routing even if conference bridge (10) is overseas, any time a mobile phone is used to call a foreign number or is used while roaming outside its home network, costs can be exorbitant. Conference calls often last for hours -incurring huge costs.

This invention builds on the system described in UK patent application GB1816697.5 -which describes, in detail, a system by which an application on each employee's smartphone interacts with the company's telephony infrastructure via a "Mobile Access Point" and an application on their smartphone. The conferencing features described in this patent application can be provided as additional functionality within that framework. In this case, the Conference Room Process (CRP) (15) runs inside the Mobile Access Point and the Conference Participant Application (CPA) (17) is part of the overall application running on the employees' smartphones.

Naturally, the scenario described applies not just where at least some of the participants are employees of one organisation but also to any set of individuals, at least some of whom have this (or a compatible) application (17) installed on their devices and access to a common server process (15) functionally equivalent to the MAP described above for use within a single business. In such cases, the central process (15) is typically "in the cloud" -a service accessed via the Internet.

Using this invention, any or all of the devices in the room that are capable of running at least the Conference Participant Application (CPA) (17) should have this installed and run it to participate in the audio part of the conference -even if another application is providing the video, chat and other data streams. Alternatively, this CPA functionality may be embedded in such multimedia conferencing applications.

For the invention to work, at least one of the devices in the room must be able to communicate with a CRP (15). This may be located within the company's network or in the cloud/internet. This component will receive audio streams from the personal devices (2, 3, 4, 5, 6) and, optionally, the desk/conference phone (8) if present. It typically accesses the latter via an internal phone call using the company's PBX (13).

The CRP (15) is responsible for establishing and maintaining a single audio connection to the conference bridge (10)-thus appearing as a single (audio) participant in the overall conference.

The CRP (15) also receives the (single) audio stream from the conference bridge (10) and routes it either directly to the desk/conference phone (8) or, optionally, to one or more participating devices (2, 3,4, 5, 6). The device receiving this stream may output the audio directly via its own loudspeaker(s) or via a paired Bluetooth speaker or physically connected speaker.

The CRP (15) receives audio streams from all of the personal devices that have joined the conference. It processes and compares these audio streams with each other and with the incoming audio stream from the conference bridge (10) to determine which, if any streams, it will mix into the single audio stream it transmits to the conference bridge (10).

This processing may include squelch level (do not send if level below a threshold); noise reduction, echo cancellation (within the room and with the remote parties) and/or automatic speech recognition algorithms.

S

The process by which devices join the conference is described below.

Each device running the CPA establishes a data communication path to the CRP (15) and, preferably, reports its location and other information it can provide that will assist the CRP (15) to determine where it is and hence which conference it is most likely to be involved in. This information may include, but is not limited to: wireless network characteristics (such as base station address; signal strength; other networks visible; Wi-Fi SSID, BSSID and signal strength). Peer-to-peer networking can also be used to determine whether any other devices running this application are within range.

One of the users of the CPA (17) application will initiate a conference by selecting the "Start a Conference Call" button. They are then presented with any or all of: i) Dialpad for manual (or from clipboard) entry of a phone number or ii) Contacts list(s) iii) List of "well known" conference bridges (including typically those the company itself uses) They are also typically shown a set of options for "Audio Output Device". This set may include a list of meeting rooms and their phone numbers -preferably filtered and ranked based on their current location such that the phone (8) in the room they are in is at the top of the list.

If they select one of these phones, the CRP (15) immediately calls that phone (8), typically via the corporate PBX (13), preferably with a high quality connection (sampled at 16KHz or above and uncompressed or compressed with a higher quality codec than is normally used for PSTN or cellular calls) and someone in the room should answer it. This will serve as the default audio output in the room and, assuming it has speakerphone capability, one possible audio input stream that is now being received by the CRP (15).

Once the initial audio path has been established, it is beneficial to identify the other participants in the room and, if possible to ensure that their phones are silenced with regard to other incoming calls but their microphones are available to assist with the conference. To this end, the conference initiator's phone (2) advertises, (preferably via peer-to-peer Wi-Fi and/or Bluetooth) a specific service associated with this conferencing application.

Where the operating system of the other phones in the room permits service detection from background mode, this is programmed to awake the application (17). Where this is not possible, if the application (17) has been advising the central process (15) of each phones' location, a "push" notification may be sent to the devices that are potentially within earshot of this conference -waking the application (17) and thus allowing it to scan for the presence of this This alerts other phones near the initiator's to the presence of a conference -the set of devices belonging to potential participants. However, not all of these may be in the room as the radio signals may be picked up in nearby rooms and/or the reported locations may not be accurate enough to determine which room each phone is in.

The challenge is therefore to identify the set of devices that are present in the room and should be part of the conference. If a conference is being held in an open space where others may overhead the content, that is also of interest.

The system therefore attempts to identify which devices are within earshot of the conference. It does this by alerting the previously identified said set of devices that a test audio signal is to be transmitted. This test signal typically contains a spoken component (such as "Checking for potential participants") and, optionally, a variable identifier that is easily recognisable in a received audio stream (such as a few DTMF digits or a sequence of single tones). This variable identifier is sufficiently complex that it cannot be guessed or spoofed by an attacker but does not need to be overly complex as it only needs to be uniquely identifiable during the brief period of participant discovery. In other words, if a neighbouring table were setting up a similar conference at the same time, it is important to be able to distinguish between the two conferences.

The application (17) on each of the devices that has been alerted to said discovery phase listens via any available microphone. For security reasons and to preserve bandwidth, each device analyses the audio it hears locally in preference to sending it to the central process (15).

Knowing the generic signal to expect (the speech) each can report to the central process (15) whether or not it heard this in the few seconds following the alert and, if so, what volume level and signal to noise ratio it picked up. It can also report the identifying signal (DTMF or tones) that accompanied the spoken signal.

The Conference Room Process (15) may be managing many conferences but staggers the discovery phase transmissions so that only one is in progress at a time -so as to avoid any possibility of confusion across conferences. This typically provides sufficient security that the use of identifying tones is unnecessary. There is a window of, typically, less than a second in which a report of hearing the discovery signal could be valid. Using several variants of the wording and/or speaker further enhances the level of security, making it very difficult for someone not in the room to know what audio to spoof exactly when to fool the system into thinking they are in earshot of a specific conference.

Those devices that picked up the audio signal are therefore a subset of the aforementioned potential devices. This "within earshot" subset is preferably shown to the conference initiator -who may accept the full set of participants with a single confirmatory touch or other action.

Conversely, the initiator may reject any of the set of devices shown -but has at least been alerted to the fact that these devices (and presumably, therefore, their owners) can hear what's going to be said in the conference.

Should the "within earshot" subset of devices not include all of the people that the initiator can see in the room, this is an indication that not everyone will be able to hear the conference -given their current location relative to the audio output device and the volume it is producing.

The initiator is therefore offered the options of increasing the volume of the output device and/or testing other devices as potential output devices. In the former case, they adjust the volume and the discovery signal is repeated. In the latter case, a discovery signal is played out of each of the "within earshot" set of devices in the hope that others beyond them pick it up. Thus additional devices may be added to the "within earshot" set -but their reception characteristics are noted relative to the device(s) that were playing the sound they detected.

Having identified a potential participant and the initiator accepting them into the conference, a further audio signal (for example "Alex, joining") may also be sent in the opposite direction -being played by the new participant's device (2) and hence picked up by the desk-phone (8) and/or the other participants.

Optionally, more sophisticated signalling can be employed to assess the distances between devices. For example, a "background" signal consisting of some music or tones may be played via, say fixed phone (8) at (nominally) the same time as playing a greeting ("Alex joining") at a specific phone (2). The actual time at which each of these two audio signals plays will vary because of jitter and delays in the system.

However, all microphones picking up the resultant audio in the room will be hearing the same thing -albeit from slightly different locations within the room. Again, the timestamps that they report back will not be directly comparable -because of jitter and delays in their own audio path. Within each received audio stream, however, the relative volumes of the two components of the received signal and, crucially, their relative offset from each other can be measured precisely.

In a simple example, suppose the audio played by device A contains a single pure tone at N Hertz for 100ms and the second, played by device B contains a single pure tone at M Hertz for 100ms. Each of the devices in the room can report the relative volumes at which it heard the two digits and the time difference between the centre of the burst of N Hertz and the centre of the burst of M Hertz. So, for example, if device C hears the N Hertz tone t milliseconds before the M Hertz tone but device D hears it t+2 ms later one can infer that (D to A) -(D to B) is approximately 0.7m more than (C to A) -(C to B).

By disabling echo suppression during this phase, each device also hears its own audio output thus providing further contributions to the overall set of simultaneous equations that have to be solved to deduce the relative positions of each device.

By transmitting simultaneously or even in quick succession from more than two devices, this discovery phase can be reduced to a second to two even with many participants.

Optionally, repeating the above test with the tones reversed (N Hertz played at device A and M Hertz at device B) allows any variation in the frequency response of the audio paths to be eliminated -by taking the average of the volume ratios across the two tests.

By repeating this test for each pair of devices, a map of the locations of each device can be determined. The relative levels detected can also be used to infer how effective each microphone is at receiving audio from each of the other devices and hence a model built of which microphone(s) to use and what delay to apply to each in order to "beam form" the audio -to pick out individual speakers wherever they are situated in the room.

The audio level and time-delay between the audio transmitted from a particular device and that received at each other device can be used to infer characteristics of the two devices and their distance from each other. A complete lack of correlation between transmitted and received sound is used to infer that the devices are not close enough to each other to be part of the same conference. This is a useful security measure that can help stop unauthorised listening in by those not in the room.

The locally received audio during these exchanges (that picked up by desk-phone 8 while it is playing a predetermined sound and that picked up by the smartphone (2) while it is playing a pre-determined sound) may also be analysed to determine how good the local echo-suppression capability is at each device.

In parallel with this, data packets are exchanged between each participating device (2) and CRP (15) to monitor and measure the suitability of the network path between them. If this is poor, the user may be prompted to select a phone (8) in the room for the audio path rather than or as a backup to that via his device.

Should the user not wish to -or not be able to -use a desk-phone (8), the user may select "This device" as their audio output -in which case the CRP will stream the audio from the remote party (11) or conference bridge (10) to this device and the CPA (17) will play it via the device's loudspeaker(s).

Alternatively, the user may select a paired "Bluetooth Speaker" (9) as their audio output path-in which case the CRP (15) will stream audio to their smartphone (2) but this will be played via a paired Bluetooth speaker (9) rather than the internal loudspeaker.

Each of the newly invited participants' devices also prompts them at this time to silence their devices (or does so automatically where the operating system permits this).

During the interaction, automated speech recognition (ASR) may be performed at any or all of the devices running the CPA (17) and/or the CRP (15). The latter preferably analyses each received media stream separately and may also analyse the differences between pairs of said audio streams. Preferably, said differences are calculated having first time-shifted one of the signals so as to maximize the correlation between the two -hence identifying and compensating for any time lag caused by the physical distance between the two microphones and the dominant sound source and the network links between the two devices and the CRP (15).

The output of said ASR -including the confidence level it assigns the transcript can be used to infer which audio stream has the "clearest" audio signal (steady stream of transcript with high confidence level) and preferentially transmit that stream to the remote bridge (10).

Optionally, if the interaction is to be recorded, not only the audio stream transmitted to the remote bridge (10) is recorded but also some or all of the individual input streams from the various devices running the CPA (17). This multi-channel recording can be made available at replay time -to users and/or further ASR/transcription applications. By altering the gain of each channel and offering differences between channels, preferably automatically adjusting for time lag between each (estimated by time-shifted correlations) the listener/application can hear more clearly what individual speakers were saying even if the resultant audio transmitted to the far end included multiple speakers talking over each other or had muted them as not coming from the strongest signal.

In addition to the actual audio from each source being recorded, preferably a reduced bandwidth "summary" track is also recorded. This, for example, will typically include the volume (actually often "energy level" -proportional to volume squared) every SOms or so; the output of any ASR; signal to noise ratio within that period.

An overall merged summary "track" can also be derived from these -showing who was speaking in a given time window and their transcript.

Preferably, interruptions are minimized by (where the operating system allows it and/or calls can be routed via the CPA (17) or CRP(15)) the suppression of audio alerts from other incoming calls and other notifications on the users' devices. Where this cannot be achieved automatically, the user is reminded via the screen that they should mute or block such interruptions.

Claims

CLAIMS1. A system providing audio connectivity between a plurality of individuals within earshot of each other and at least one remote participant characterised in that audio is received via microphones in a plurality of said individuals' smartphones, smartwatches, tablet computers, laptops, personal computing devices and selectively merged by a single controller to form a single resultant audio stream that is transmitted to the remote participant(s).
2. A system of Claim 1 in which sound generated by at least one of said devices is received via at least one other of said devices in order to determine their ability to participate in a shared audio interaction and/or to determine their relative positions.
3. A system of Claim 1 further characterised in that the audio stream from the remote participant(s) is output via a desk telephone, a conference phone, one or more of said personal computing devices or one or more loudspeakers connected physically or wirelessly to any of these devices.
4. A system of Claim 1 in which the contribution to said resultant audio stream is determined by the audio level and/or quality received at each such microphone.
5. A system of Claim 1 in which any of the orientation, gain, quality of and/or relative distances between said microphones and individual currently speaking are inferred from any of the relative volume, spectrum signal-to-noise ratio and time shift of audio received by them.
6. A system of Claim 1 in which any of the orientation, gain, quality of and/or distance of said microphones from any of said audio output devices is inferred from any of the volume, spectrum, signal-to-noise ratio and/or time shift of audio received at said microphone in response to a specific audio signal being played at one or more of said output devices.
7. A system of Claim 1 in which the audio from each microphone and/or the differences between pairs of audio streams are processed via automatic speech recognition algorithms so as to output a putative transcript and associated confidence level for each word or phrase.
8. A system of Claim 7 in which said differences are calculated from derivatives of the received audio streams which include time-shifts that maximise correlation between said streams.
9. A system of Claim 7 in which said transcript and/or confidence levels are used to determine which of said audio streams or differences between said streams is added to said resultant audio stream.
10. A system of Claim 7 in which said transcript is used to access and show related information in real time on a shared display and/or on the individuals' personal computing devices.
11. A system of Claim 1 in which one or more of the individual microphones' audio streams is recorded along with the resultant audio stream such that each such stream, combinations of and/or differences between such streams can be replayed as required with each stream optionally automatically time-shifted so as to minimise echo.
12. A system of Claim 7 in which the strength of audio received at each microphone and/or the outputs of said speech recognition is noted and used to determine which individual was speaking at a given time.
13. A system of Claim 1 in which said audio connectivity is part of a multimedia conference between said individuals and said remote participant(s).
14. A system of Claim 1 in which said personal computing devices, on joining said audio connection automatically mute and/or prompt the user to mute their call alerts and/or block other calls from interrupting said shared audio connection.
15. A system of Claim 1 in which said personal computing devices join said shared audio connection via peer-to-peer data messages and/or audio signals sent between them that, on being received, identify at least a subset of the potential participants and the shared connection which they wish to join.