GB2578121A

GB2578121A - System and method for hands-free advanced control of real-time data stream interactions

Info

Publication number: GB2578121A
Application number: GB1816863.3A
Authority: GB
Inventors: Douglas Blair Christopher
Original assignee: Software Hothouse Ltd
Current assignee: Software Hothouse Ltd
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2020-04-22
Also published as: GB201816863D0

Abstract

A system using real-time data streams, such as a telephone or video conference detects trigger phrases (e.g. ‘Let me just…’) or visual cues in the stream which signal commands/controls, analyses them and identifies related commands. The commands may be contextual or lead to complex branching procedures. The commands may include hold, retrieve, transfer, hang up, leave message, record (start, stop, pause, resume) or play announcement/text to speech. The system may make use of an address book/directory database. The system may be incorporated into a PBX (private branch exchange) and may provide interactive voice response (IVR) functionality. The system may be implemented as a mobile access point (MAP).

Description

System and Method for Hands-free Advanced Control of Real-time Data Stream Interactions This invention relates to a means of controlling real-time interactions, such as telephone calls without requiring the complex set of buttons typically used on a business telephone set to do so.

Background

Many businesses do (or would like to) allow employees to select, purchase and use their own mobile phone with which to conduct business calls as well as personal calls. Patent Application GB1816697.5 describes a number of the challenges facing businesses and details a system architecture in which calls to and/or from the employee's mobile phone may be routed via a "Mobile Access Point" within the businesses telephony infrastructure. It also highlighted a number of possible features that could be provided during the call. These make using the business line more productive and hence more attractive to both the business and employee.

The aforementioned patent application also describes a number of novel aspects of a "Corporate Dialler" application that can be deployed on the employees' phones. Such applications have been available for several years and therefore predate the explosion of voice assistant devices now found in many homes and on most smartphones. People are now comfortable and confident commanding a device to perform actions using the spoken word.

Many business telephone providers offer mobile phone applications that extend a subset of an employee's office phone capabilities to their mobile phone. However, that office phone typically has several dedicated buttons ("Hold", "Transfer", "Conference", "Voicemail" for example). While these can be replicated on a smartphone's screen, they are inconvenient and/or dangerous to use if the phone is at your ear or on the passenger seat respectively.

Public concern, and hence legislation, is taking an increasingly dim view of using a mobile phone whilst driving. Hands-free operation is generally allowed but pressing buttons and, most certainly, typing, whilst still widespread is gradually becoming socially unacceptable as well as illegal.

Business phone systems frequently have Interactive Voice Response (IVR) systems connected to them. These are increasingly being used not just to answer calls and direct the caller to an appropriate number on the basis of a dialled digit ("Press 1 for...") -but to automate simple interactions, often through speech recognition. These systems are optimized for telephony quality speech and the limited vocabulary expected at each stage of a controlled dialog.

Such IVR systems are also increasingly used as real-time assistants-saving employees (mainly call centre agents) time by, for example, reciting a pre-prepared greeting or set of terms and conditions (often recorded with the employee's own voice). This allows the employee to be working on their keyboard more effectively during the announcement.

Those who do not use a multi-button, multi-line office phone -and many of those who do struggle with the command sequences needed to achieve most of the hundreds of features the phone system may perform. These features were designed in the days before softphones and certainly before voice control. Complex sequences of button presses and dialling "magic numbers" are needed to access many features.

There is therefore a requirement to improve the productivity of employees making business calls from their mobile phone by giving them easy access to (at least) the features they have at their disposal on their office phone. Some of these features would also be of interest to small businesses and individuals -if they were easy to use.

Statement of Invention

The present invention lets a user access a number of advanced features during a telephone call without having to interrupt the call. It does this by analysing the audio stream from (and, optionally to) the user at one or more nodes in the call's path.

Introduction to the Drawings

Figure 1 shows the major components (1, 14) of an exemplary system and the networks between them and infrastructure around them.

Figure 2 shows the relevant functional components within the mobile phone (1) and the service with which it interacts (14).

Figure 3 shows how outgoing calls are made from the mobile phone (1) and a subset of this (starting at 322) is also used for inbound calls.

Figure 4 shows how inbound calls to the service are handled -and as an outbound call from the mobile becomes a special case of an inbound call at the service, also covers that scenario.

Detail of the Invention Figure 1 shows the main elements of an exemplary implementation of the invention.

Mobile phone (1) (assumed to be a "smartphone") supports voice connections to telephones (4) on the public switched telephone network (3) and voice and (in some cases) video connections to others on mobile phones, tablets, laptops or other computers which are connected via one or more mobile networks (2) and/or data networks such as a local Wi-fi network (5), the Internet (6) or a private network (7).

Figure 1 shows a number of components that may be present within such a network. These may be physically present in a building; distributed around the world; physical or virtual machines; owned by the company; hosted or "in the cloud".

The network joining them may be, for example, a Local Area Network (LAN), Wide Area Network (WAN), a Virtual Private Network (VPN) or directly on the internet. Functional units discussed may be provided as physical servers or as services running elsewhere. The important factor is that the required components can communicate with each other and are configured and permissioned to do so.

Where the user of mobile phone (1) is an employee or otherwise works for a business, private network (7) represents that businesses corporate I.T. infrastructure. This typically includes a Private Branch Exchange (PBX) (8), a plurality of internal phone numbers which may be mapped to physical phone sets (13) and/or applications running on laptops, tablets or desktop computers. Frequently, there is an Interactive Voice Response (IVR) system (9) and a corporate voicemail service (10). Optionally, voice recording capability (12) is present and, increasingly, a Speech Analysis server/service (11). These typically perform phonetic analysis, speech recognition, emotion detection and/or biometric analysis of live speech and/or recordings. Any of these servers or services may be combined into systems supporting multiple functions and/or exist as one or more separate systems/services.

One or more Mobile Access Points (14) is provided as part of this invention. Patent application GB1816697.5 describes how these are used to allow control of mobile phone calls by routing the call via said MAP (14) -allowing it to access the audio (and video if present) passing between the parties on the call and to manage each leg of the call -ideally with media stream processing -including bridging as needed -occurring in the MAP (14) rather than an external conference bridge -giving it access to the audio from each party separately and being able to combine, fork, block and inject audio to and from each party as needed for this invention. Hence MAP (14) can be considered to be a "stream management node" that controls how the call is handled.

Where the user of mobile phone (1) is not an employee -or is making a personal call that is not related to their work for the aforementioned business, then (7) may be the infrastructure of a publicly available service to which the user may subscribe. This allows individuals to access many of the same features that were previously only available to business users.

Also note that many mobile phones have a speech recognition capability. Some use a remote service (15) and hence require a data path to it in order to perform speech recognition. Others include a local capability (17) allowing them to perform at least some speech recognition when offline.

Figure 1 does not represent the user interface presented on mobile phone (1) but rather the presence of services and applications. In addition to the (optional) speech recognition services (17) there is typically a voice assistant service (18) -which may be always listening for key phrases (such as "Hey xxxx!" ) which trigger it to try and respond to spoken commands.

Figure 2 shows the components involved in managing a call so as to allow the provision of advanced calling functions optionally controlled by spoken command during the call.

Mobile (smart) phone (1) has the "LetMeJust" application (16) installed on it. This includes an overall CallManager (206) component that takes user commands from the touch display (210) and, optionally, headset or other peripherals such as a keyboard. It also displays call status information and optionally tips, hints and instructions on said display (210).

Audio from the microphone(s) (201) and, optionally, video from camera(s) 208 is received by TXHandler (204). This may also invoke speech recognition and/or keyword/phrase detection services on this audio stream and thus receive notification of what is being said and can identify one or more spoken commands. It can also fork a copy of the audio to one or more local or remove voice assistant (222), voice recording (219) and/or archiving services -or any arbitrary service that needs to use the stream.

Preferably, the TXHandler (204) transmits audio and, optionally, video out to the connected party(ies) over one or more networks (203). This is not always what it receives from the microphone (201). It also has access to additional audio sources such as pre-recorded audio; internally generated tones; text to speech and other incoming streams. What is transmitted over the network connection(s) can therefore be any combination of these, each processed, modified, supplemented or filtered and/or mixed at a specified volume. For example a recording tone may be mixed into the outgoing audio; the microphone may be muted; audio may be fed to a translation service and the output of that transmitted instead.

In some cases, particularly where a "traditional" phone call is being made over a mobile network, it may not be possible to intercept the audio from the microphone that is being transmitted over the network. In this case, a call may be routed via a Media Access Point (14) so that the above functions can be carried out there instead.

The received data stream(s) are handled by the RXHandler (205). This also has the same suites of mixing, blocking, forking, injection, processing and analysis capabilities available to it as the TXHandler (204) does. One analysis performed here that is not required in the TXHandler (204) is tone detection (such as Dual Tone Multi Frequency DTMF detection) to detect in-band signalling arriving in the received stream.

An RXHandler may, in the general case receive one or more media streams and one or more signalling or control streams. It may also process any of said media streams to extract signalling/control information from them -such as DTMF tones or spoken commands. There is therefore a call-back mechanism whereby the RXHandler (205) can notify the CallManager (206) of control information -whether received via an "out of band" signalling path or "in band" within the media flowing.

This latter mechanism is also used for metadata passed within some media coding schemes (MP3, MP4 for example). Problems with the received or transmitted media stream can also generate such call-backs. For example, packet loss rates exceeding a threshold may result in a call-back to the CallManager (206) warning it of deteriorating connection quality; RTCP packets received may trigger a call-back warning of problems in the opposite direction.

It will be appreciated that this same architecture can be applied to an application running on a tablet, laptop or desktop computer -either as a standalone application, a web browser plug-in or a remote service accessed via a browser.

Communication with the other party or parties (215) on the call may occur directly via one or more networks (203) or be routed via a MAP (14).

Where connections are direct to endpoints (215), the mobile phone (1) may need the more sophisticated multi-party TX/RX handler approach of the MAP (14) and/or there may be an additional connection to a MAP (14) allowing it to provide a subset of services on the call even if the audio/video stream(s) do not all pass through it.

Connections may be via VoIP (typically using SIP/SIPS and RTP/SRTP over a data connection) or virtual circuits over telephony networks. However, many phones only support one telephony network call at a time -and sometimes block data networks during such a call. The MAP (14) thus allows complex call scenarios with multiple counterparties to be established via it even if only one connection to the phone (1) is possible.

Each MAP (14) hosts a number of concurrent calls. Each of these is controlled by a MAPCallManager (210). This communicates with the CallManager (206) on the mobile phone (1) preferably via a data network, typically using HTTPS and the native Push Notification mechanism of a given mobile phone (1).

For each party on the call (apart from the MAP (14) itself) the MAP (14) instantiates a TXHandler and RXHandler with the same capabilities as those on the phone (204, 205). The mobile phone (1) is hereafter referred to as Party 0 on the call -so TXOHandler (211) and RXOHandler (212) process the streams to and from it respectively. Additional handlers (213, 214) are created for each additional party added to the call -resulting in handlers TXO...TXN and RXO...RXN if the call has had N+1 connections to date. Hereafter an arbitrary handler is referred to as TXnHandler or RXnHandler where 0 <= n <= N. As with the RXHandler (205) in the phone, these RX Handlers not only process their respective incoming media stream(s) they also alert the MAPCallManager (210) to signalling/control events they detect in-band or out of band.

Within the MAP (14), each TX Handler (211, 213) has access to the incoming streams from all of the RX Handlers (212, 214) should it need them. This allows each to construct the required stream for transmission to the specific party it handles -regardless of what is being sent to any of the others.

To allow some enhanced functionality even when the mobile phone cannot access a data path, basic, low bandwidth signalling between the MAPCallManager (210) and CallManager (206) can be achieved by instructing the TXHandler at either end (204, 211) to inject a sequence of DTMF tones. These can be identified by the corresponding RXHandler (212, 205 respectively) and thus used to convey basic instructions between the CallManager (206) and MAPCallManager (210).

Advantageously, the DTMF tones are transmitted at a low level and the RXHandlers (212, 214) suppress them further if the incoming audio does not contain significant other content during these bursts of tones. Preferably said suppression consists of injecting a signal similar to the background noise level on the call rather than complete silence.

MAPCallManager (210) has access to a wide range of services that can be used to enhance the interaction established between mobile phone (1) and the remote endpoint(s) (215). For example, speech analysis services may be available within the business (217) and/or externally (221). These latter may include Voice Assistant services (222) that not only recognise the words, they interpret commands -typically involving spoken responses, confirmations and further clarification. They can therefore be thought of as yet another participant in all or part of the call -and TX and RX handlers established to route commands to them and receive responses from them. Note that said responses can be injected into the stream being sent to the mobile phone (1) without necessarily being injected into the stream to any other party (215) on the call.

Telephony Services (216) include corporate PBX services for internal calls which can be used to exploit the corporate telephony network which may include sophisticated "least cost routing" schemes. These services also include SIP/SIPS or similar connectivity allowing VoIP calls to be established to anywhere via the internet and/or corporate network. As additional connections are established, so these data streams are connected to newly instantiated TX and RX handlers.

Many telephone connections provide user to user signalling -allowing arbitrary data to be passed in the signalling channel as part of the call. This can be received via the Telephony Services (216) and passed to that party's RX Handler (214).

IVR systems (9) are also typically accessed via these telephony services (216). An IVR port typically appears as an internal telephone number (or pool of ports behind a shared number) and can be accessed by, for example, calling that number. Thus the IVR port becomes another party on the call and a TX and RX Handler are instantiated for it -allowing audio to be passed to it for automated handling and audio from the assigned IVR port -such as prompts, confirmation and dialog -to be injected to any or all of the other parties on the call as required.

There is typically a data connection to the IVR (9) as well -allowing the MAPCallManager (210) to direct the interaction and to receive the results of the interaction (choices made, digits entered etc.). A common use for this is when processing credit card payments over the phone. The IVR (9) interacts with one party on the call only and the others do not hear and do not record that interaction.

Telephone calls are gradually being replaced by calls via instant messaging which typically use VolP based services. This component (218) allows the MAPCallManager (210) to use the Application Programming Interfaces (APIs) of these services to establish connections with counterparties via alternatives to the PSTN.

Recording Services (219) may be within the MAP (14) -writing to files locally or on a file-share -or streaming in real-time to a separate recording service on the corporate network, via the internet and/or VPN or "in the cloud". Again, in the latter case, the recording service becomes a party on the call and one or more TX Handlers (213) are instantiated to feed the appropriate audio to it. In this case the RX Handler (214) is largely redundant (though can pass on events from the recording system -such as "unable to record" or "pause recording") -but there may be multiple TX and/or RX Handlers (213, 214)-one for each separate media stream where these are to be recorded separately (or half of a stereo pair of channels in a file). This also allows recording to be paused and resumed, stopped and started during the call -as may be required for regulatory compliance.

Similarly, announcement services (220) are used to play specific audio under the control of the MAPCallManager (210). Thus the MAPCallManager (210) can use a combination of Speech Services (217), tone detection within the RX Handlers (213) and Announcement Services (220) as an alternative to IVR ports.

Using the Announcement services (220) results in an RX Handler (213) being instantiated. This typically receives its audio (and/or video) stream from a file, an internal announcement service or speech to text rather than as a live stream over the network. The TX Handler in this case is normally redundant as nothing is transmitted back to the announcement service. However, a text to speech service may use a TX Handler (213) to manage the flow of text to it, for example.

A further service that may result in an additional party joining the call is that of a concierge or private assistant service (223) who may be provided with a copy of some or all of the call content and/or metadata and some instructions-spoken or otherwise -during and/or after the call.

Thus the MAPCallManager (210) can control complex and sophisticated call scenarios under the control of in band signalling such as spoken commands from potentially any party; DTMF signals; out of band messages from IVRs (9), PBX (8), remote services (222) and so forth.

Figure 3 shows a flowchart of how a an outbound call through this system is managed.

The user opens or selects (301) the application (16)-possibly via voice command ("Hey xxx! Make a business call" for example). Preferably this application (16) replaces or at least sits alongside the in-built telephony dialler application on the phone -encouraging or enforcing the use of said application (16).

A background task immediately uses the phone's location, schedule and/or other preferences/history that are available to determine the most appropriate MAP (14) to use from those known (303). Alternatively, a locator service may be used. This responds with details of the MAP (14) to be used and, preferably, one or more fall-back alternatives.

Communication is established (304) with a MAP (14) via an available data network (e.g. 4G or Wi-fi) allowing it to allocate resources ready for a call. A VolP channel is established (308), typically using SIP/SIPS but media need not flow immediately. This thread continues to pass user actions and commands interpreted from the media streams to the MAP (14) and acts on commands incoming from the MAP (14) until the call ends. Typically this thread will actually maintain its connection with the MAP (14) for as long as a call is in progress or the application (16) is in foreground.

Meanwhile, on the user interface thread(s), the user selects (via touch or speech command) an existing contact or group of contacts or enters a phone number/address of a party (215) they wish to call. Selection may imply immediate connection (e.g. heavy/long press or press a phone icon or "Call Now" button or spoken command) or may simply select the entry, allowing others to be added. In the latter case, the MAP (14) is advised of the selection (305) and may choose to initiate connection while others are being added to the call -as this implies a conference call, which will almost certainly go via the MAP, it may, for example, start or extend its probing of the VolP channel (308) so as to understand the quality of that as the potential voice path to/from the MAP (14).

Having selected the set of initial participant(s) (215) of the call, this set is examined to determine whether or not the call should be placed directly (307) (normally only an option for a single counterparty) or via the MAP (14) (for example: multiple parties; recording required; business call; international call requiring least cost routing; and/or advanced features may be required). In this latter case, a voice path is established (315) to the MAP (14)-preferably over the VoIP channel (308) if viable -to the MAP. If the MAP (14) is providing speech recognition services, there may be no need to do so at the phone (1) as well -so, optionally, a speech recognition service taps (318) into the audio from the microphone (201).

Other analyses of media streams can also be added. For example, image processing such as supported by OpenCV or similar may be used to, for example, detect visual commands (such as a wave or putting one's hand up). This can be done on inputs that may not even be transmitted. For example, image processing may be applied to a stream from the camera (208) even if the call is voice only -in which case the TXHandler (204) is not passing that stream on, merely analysing it.

Where a call is placed directly (307) to the end party (215), the MAP (14) is advised of the identity of the called party (309) and a speech recognition service (17) is tapped into the microphone stream on the phone. This allows commands to be recognised and acted on from this point forwards -including before the call has been answered (e.g. "Let me just leave a message" should the user give up waiting for the call to be answered). If the far end answers, the audio paths are connected but the system continues to listen (313) for spoken commands from the user and/or instructions from the touch screen, headset or other peripherals.

If the call is not answered after a timeout (or earlier if a spoken or UI command to abandon the call attempt is given) the call is abandoned (312) and torn down. As with all state changes, the MAP (14) is advised (320).

Not shown is an optional scenario whereby a call originally called directly (307), subsequently requires services only available at the MAP (14). In this case, the phone (1) establishes a second voice path, to the MAP (14) over a network that allows the existing connection to remain in place. By selectively conferencing and/or forking these, some (but not all) features of the MAP (14) can be provided. For example, a copy of the audio can be streamed to the MAP (14) for recording and/or remote analysis.

Figure 4 shows the preferred inbound call handling approach. Someone places a call to this mobile (1). Note that they may not have dialled (or even know) the actual mobile phone's (1) public number. They may dial a number printed on the phone's owner's business card which his employer associates with this mobile (1) via a MAP (14) or other redirection mechanism.

Inbound calls to the mobile phone (1) are therefore preferably arranged to route to a unique phone number that terminates on a MAP (14) rather than taken directly on the mobile phone's (1) own PSTN number. This can be done, for example, by applying a "divert all calls" feature or by advertising a different number in the first place (as described above).

Whether the call is routed directly to a MAP (14) or via a PBX (8), a call alerts (402) on the MAP (14)-causing (403) a TXOHandler (211) and RXOHandler (212) to be instantiated ready for communication with the phone (1); a further TX and RX Handler (213, 214) to be instantiated in preparation for terminating the stream and a MAPCallManager (210) to control the call.

Thereafter, the MAPCallManager (210) starts by advising (404) the mobile (1) of the call details via a data network.

A media task tries (405) to connect the TXOHandler (211) and RXOHandler (212) to the mobile phone (1). In many cases a second pair of handlers is created in parallel -allowing PSTN and VolP call attempts to be made in parallel -with the first one to succeed being used (as long as it appears to be of adequate quality). The other channel may be dropped or maintained in case of fall-back.

Optionally, a further TX and RX Handler pair (213, 214) may be instantiated and a call initiated to an IVR (9) port or pool. This is needed if the IVR (9) is to be used in either assistive mode (e.g. playing pre-recorded messages) and/or to take control of the call at any point (e.g. take credit card details).

Optionally (not shown) a further TX and RX Handler (213, 214) may be assigned, ready to play an announcement (220) or read some text via text to speech. Likewise, for recording services if required.

The original call may or may not be answered immediately (403). There are reasons to adopt each approach. For example, as soon as the call is answered, the far end may incur charges. The likelihood of a charge being incurred may be easily inferred in some cases. For example, a normal UK geographic number answering a call from a landline (first two digits not "07") is very likely going to result in the caller incurring a call setup charge -which can be significant. This call may therefore be allowed to continue alerting until the call to the mobile (1) has been answered or a decision is taken to respond with answering machine or voicemail (10) capability at which point the MAP (14) may provide such capability internally and/or route the call to an existing service (10). Other calls, determined likely to incur zero or minimal charges, or other rules applied to the call, destination, source, time or other parameters may be answered immediately (403) so that progress notification tones and/or announcements can be played to the caller rather than basic ring tone.

Before the connection to the mobile (1) has been established, if speech commands are to be supported from either party on the call, speech recognition services (217) are tapped (407, 408) into the appropriate RX Handler (214). Note that these may have different language, speaker and/or vocabulary models for the caller and called party.

The various TX and RX Handlers (211, 212, 213, 214) route, fork, mute, mix, process, filter, analyse and generate audio, video or other data stream content (409) as instructed by the MAPCallManager (210) throughout the call until it is terminated. Any events they detect from signalling or in-band analysis of the media stream are passed to the MAPCallManager (210) for processing. These may result in changes to how media is flowing and/or connection/disconnection of streams.

Processing at the mobile for this incoming call scenario is essentially a subset of that for outbound calling. Whether a push notification or an inbound call from a MAP (14) via the mobile network occurs first, the user is alerted to an incoming (enhanced) call as normal.

The voice connection to the MAP (14) is completed immediately or when the user chooses to answer the call (depending on tariff details and user preferences) -joining the flowchart of Figure 3 at 322. As at the MAP, the TX/RX Handlers route media as instructed and advise of signalling/control events while the CallManager task (206) is advised of and acts on events coming in from the MAP (14) and the mobile phone's (1) User Interface (210) and/or peripherals.

Note that the processing at the MAP (14) end for an outbound call from the mobile (1) is very similar -as this also results in an inbound call from the mobile (1) to the MAP (14). When creating the handlers, the MAPCallManager (210) recognizes the calling party as a supported mobile phone (1) (preferably it has recently been alerted to that by a data message over a data network (303) and has already started preparing for the call. In this case, the inbound call from the mobile (1) is answered immediately and the counterparty(ies) (215) is/are called instead of the mobile (1).

A subset of the features described can be provided to callers from regular phones or mobile phones that do not have the application (16) present.

With the various media streams established and appropriate analysers processing them, we now turn to the enhanced feature set that can be provided to the user within this system.

The overall goal is to provide easy, non-intrusive access to at least the features that users of dedicated phone terminals (19) and "agent desktop" interfaces make frequent use of in advanced call centres. These typically require a complex user interface, in the form of a business telephone set, a "softphone" or "agent desktop" and often ancillary controls such as agent initiated recording controls, an auto-dialler user interface and so forth.

Voice assistant devices are now in many homes and their speech interface, triggered typically by a "wake" word or phrase is well proven and understood. Furthermore, these devices have accompanying Software Development Kits (SDKs) and Application Programming Interfaces (APIs) making it easy to construct sets of commands -with varying degrees of complexity of dialog beyond the initial voice utterance that triggers a command sequence.

Mobile phones can access these services directly. Most run as services over the internet with audio being transmitted to them and the response coming back. They can therefore be connected into a call via a TXHandler/RXHandler (213, 214) pair as described above.

It is now commonplace to initiate phone calls via voice commands -especially using hands-free mode when driving. However, there is scope for significant extra functionality that can easily be added given the architecture described above.

This needs to be done with minimal disruption to the flow of the call. Luckily, there are some very common phrases used ahead of most telephony operations -because as soon as these are performed, the audio path to the customer is often lost. Today's voice assistants are designed to have a command immediately following a wake word or phrase. This is ideal for use on a call. For example (using British English phrasing): * "Let me just put you on hold for a minute or two." * "Let me just transfer you to sales." * "Let me just conference in my supervisor." In all these cases, the intent is clearly stated after a common phrase "Let me just" that does not sound at all out of place or deliberately aimed at a voice assistant -even though it can be. Other regions and languages have similar phrases that can be used there.

There is also a natural pause (at least on the part of this speaker on the call) giving a clear demarcation at the end of the command. The other party often expects to hear silence -if only briefly -at this point. This is an ideal opportunity for this speaker to conduct the remaining dialog in private.

In this system, it is straightforward to route the responses from the voice assistant that is tapped into the caller's audio stream back to that caller only. The other party therefore does not hear "Sorry I don't know that one". Preferably, this phrase is replaced by a short but easily recognisable tone so as to distract the caller less. As brief rising tone indicating "huh?" (or, literally "Huh?") is all that is required should a command go unrecognised or the wake phrase be used in more general conversation without a valid telephony command following it.

Some commands have serious consequences and hence should be confirmed before they are acted upon. Again, the beauty of the existing telephony system is that the other party already expects the line to go dead after many of these sentences. It is easy to play the confirmation or follow-up dialog only to the user giving the command. For example, "Are you sure you want to hang up?" -to which a rarely mistaken "Yes" or "No" is given -and that response goes solely to the voice assistant, not to the other party on the phone call who may already be listening to an announcement or music on hold.

In the example dialogs below, the spoken responses and questions are deliberately terse. Preferably at least two options are provided to the end user from sets of utterances that could be characterised as "terse", "polite" and "verbose".

Users typically start with "verbose" -which can include tips and hints explaining commands while waiting (e.g. while listening to ring tone). Users can then move to "polite" or "terse" if their time is more valuable than how they are seen to converse with their telephony voice assistant.

Different command sets are used during the three phases of: call setup; once the call has connected and after the call has ended. Within these stages of the call (or sub-call such as a consultation call within the overall interaction), each command may be enabled or disabled through corporate and/or personal preferences. Those occurring during the call may optionally be provided to the other party or parties on the call as well as to the user of the mobile phone (1).

In the dialog examples below, the words spoken by the user of mobile phone (1) ("John Doe") are shown in normal typeface and those of the voice assistant's responses in italics. The counterparty is referred to as "Jane Smith".

Synonyms and alternative phrasing may obviously be added to improve command recognition accuracy.

Note that initial responses/confirmations are, preferably, unique for each command. This allows very short phrases to be used but the user can still be confident that the appropriate command has been understood.

Many of these will be designed and configured by the business or even the individual user. Catalogues of command dialogs may be made available for users to pick and choose functionality from and, if they wish, assign to command phrases of their choice.

Each user may choose from a wide range of tasks that the business can accumulate and share between employees. As with the in-call commands, buttons, text fields etc. may be presented on the user interface of his phone as an alternative to spoken responses-or just to let him correct any errors in what has been interpreted from his spoken responses.

Note that call status and recording state (on, off, paused...) is also shown continuously on John Doe's screen (210) throughout the call and can be controlled by pressing buttons thereon. This provides a fall-back mechanism for the (increasingly rare) cases where the voice assistant cannot interpret his commands correctly.

During call setup, assuming the call reaches a valid endpoint (215) (has not been misdialled or wrong number/address used), the call will ring the far end (215) until it is either answered or the caller gives up ("abandons"). During this period of alerting, a number of commands, including but not limited to the following supported. In this phase, responses are mixed with (a reduced volume copy of) the ongoing ring tone so that if the call is answered before a valid selection is made, the user is immediately aware; the call is connected and the partially completed action abandoned.

"Let me just hang up." Confirmation question asked ("End Call?") played over ongoing ring tone. Call terminates after confirmation. If call is answered before positive confirmation, action is abandoned, call is connected.

"Let me just leave a message." Dependent on the contact details available for the counterparty (215) being called, user is asked to choose from appropriate set of options (for example "Record an email for Jane Smith, have me transcribe a spoken message or record a message for me to call her with later?"). If a valid option is not heard, assistant plays "Sorry, didn't recognise that option" or "I didn't hear your choice" and abandons the action.

"Let me just try again later." This then uses standard scheduling/alarm/reminder dialog patterns to determine when to try again ("In an hour", "At 3pm tomorrow", "daily at 9 AM", "Noon on 31st) ...). The dialog can include options such as "Shall I call you first and then Jane Smith or only call you once I have her on the line?". In the latter case, a scheduled call is made to the counterparty (215) who, on answering, are then played "John Doe has been trying to reach you. I'm just trying to connect him into the call now." "Let me just ask Fred Bloggs to call them instead." "Fred Bloggs" is looked up in the user's contacts list with the same or similar dialog and search approach used for voice controlled calling. Assuming a candidate party is found, an appropriate set of options is built from the contact details available. For example "Shall I email, text or call Fred Bloggs asking them to call Jane Smith?" then, on a valid option being selected, "Would you like to record a message to go with that request"? "Let me just give up on them." Acts as for hang up (and requires confirmation) but also removes this counterparty (215) from a pre-specified list of parties (as may be done with an auto-dialler dialling list).

Once the call has been established between mobile phone(1) and at least one counterparty (215) or backend service such as a transcription or concierge service (223), several more commonly used phrases can be listened for -and acted on much as a business phone would do on having the corresponding button pressed. For example: "Let me just put you on hold." Call immediately placed on hold with only a very brief positive "doodle-doop" tone or "OK". This action is reversible so no need for explicit confirmation. Actually, there is no need to do anything in the PBX (203) unless call or data costs can be saved by stopping media streaming within other networks. Typically the MAPCallManager simply stops John Doe's voice stream from being added into the call, optionally replacing it with music or announcement(s). Normally this will also stop their voice being added to recording streams (though there are exceptions).

"Let me just retrieve that call." If there is a call on hold, it is retrieved after a brief positive confirmation tone or acknowledgment word/phrase as above. Again, this is reversible so no need for explicit confirmation. If as noted above, the hold did not involve actions outside the MAP (14) then neither does this action as it simply reverses the actions take when putting the call on hold.

"Let me just transfer you to ABC." A phrase such as this is used to request an (implicitly) "blind" transfer -in which the original call will very shortly hear the new party ringing (or announcements played on alerting) until they answer.

The call is placed on hold and a confirmation played and responded to ("Blind transfer to ABC?" "Yes") to ensure the destination has been understood correctly. In this case, not only the user's Contacts should be consulted for a match but also the corporate directory.

On confirmation, the call leg with the counterparty (215) hears ABC alerting immediately, ending the call as far as the MAP (14) is concerned. Depending on the PBX (8) used, this may require the call to be maintained via the MAP (14) or the call on which the counterparty (215) is connected to the MAP (14) may be transferred off to the new party, allowing this MAPCallManager (210) to terminate. Note that if the counterparty (215) is to be offered any of the in call services offered by the MAP (14) then the transferred call must actually be made as two separate legs through the PBX (8) and the MAP (14) remain in circuit.

Note that the contact details held for ABC may include addresses on systems and networks other than the telephone network. The connection to this new party may therefore be attempted by what Instant messaging, VolP or telephony service(s) are available and for which an address is known or can be determined for that party. This just influences the type of TX and RX Handler (213, 214) instantiated and the service used to establish the connections to and/or from them.

"Let me just hand you over to ABC." This phrase may be configured to invoke an implicit "consultative transfer". After confirmation that the destination has been correctly identified ("Consult ABC ahead of transfer?' "Yes"), a call is initiated to ABC (new TX and RX Handler (213, 214) pair). John Doe hears that line ringing -during which period a subset of the pre-connection commands applies.

Once connected to ABC, John Doe may leave the call (see below), in which case ABC is then connected to Jane Smith. Alternatively, Jon Doe may retrieve the original call (see above) resulting in a 3-way conference.

"Let me just consult with ABC".

Original call is "held" within the MAP. After confirmation ("Consult ABC?" "Yes"), an additional connection to ABC is added as for consultative transfer but, in this case, should John Doe leave the consultation call, the original call may remain on hold until explicitly retrieved (often personal preference).

"Let me just conference in ABC." This is a "single-step conference" or "fast conference" so, after confirmation that the destination has been correctly identified, a connection to ABC (new TX and RX Handler (213, 214) pair) is added to the call immediately. John Doe and Jane Smith hear that line ringing -during which period a subset of the pre-connection commands applies.

Once connected to ABC, John Doe may leave the call (see below), in which case ABC remains connected to Jane Smith.

"Let me just ask you to give your card details in secret" A brief but distinct confirmation ("Taking payment details. OK?") is played to John Doe (who gave the command). For efficiency's sake, this may played concurrently with a "How would you like to pay?' dialog starting with Jane Smith. Should the command be misinterpreted, no harm is done in the few seconds it takes to cancel it with a "No".

Given the credit card industry standards, there are several IVR approaches to this. The call may therefore be temporarily connected to an existing such service or the dialog may all occur within the voice assistant framework that is handling the "Let me just...." commands.

"Let me just get someone to help with that." Can conference in an internal or external resource pool -such as a transcription or concierge service (223) or an automated personal assistant. The response may request selection from a range of pre-configured resource options. This may or may not be audible to the other party ("Jane Smith").

"Let me just send you a recording of this call." Response may request confirmation of which party(ies) on the call are to be sent the recording. May include the above resource pool.

"Let me just send you a transcription." Response may request confirmation of which party(ies) on the call are to be sent the transcription and/or options of how to transcribe it (automatically, manually internally, externally...).

"Let me just tell you XYZ" For example "our terms for this offer". Results in a pre-recorded announcement or text-to-speech output being played to the counterparty. As this is non-destructive, it can start playing immediately if "XYZ" is recognised as a valid choice. Otherwise a "huh?" tone/word can be played to John Doe (only).

"Let me just stop that" Aborts anything being played as the result of "Let me just tell you...". Used if "XYZ" was misheard, is no longer relevant or can be truncated.

"Let me just mark this call as XXX" Adds metadata "XXX" to the call (same as "tags it with XXX"). May be confirmed or not according to how important the tag is.

"Let me just set the XXX to YYY" Sets field with key XXX to value YYY as is commonly done with "user defined fields" in call recording systems. For example, "Set the priority to high." May be confirmed or not according to how important the tag is. In this case, the confirmation may be played to the counterparty as well -giving both parties confidence that the action has been confirmed.

"Let me just add some details to the call" Results in "Go ahead" -and subsequent speech from the user is not transmitted to the counterparty (215) but recorded for internal purposes. May or not be sent to the main recording system.

"Let me just add some notes to the call" Results in "Recording notes" -and subsequent speech from the user is not transmitted to the counterparty (215) but recorded for internal purposes. May or not be sent to the main recording system. Terminates on another wake phrase in "Let me just return to the call".

"Let me just start recording" Appropriate after John Doe (or pre-recorded announcement) has explained to Jane Smith that this portion of the call needs to be recorded for contractual reasons (or other GDPR compliant reason).

A short positive confirmation tone played to John Doe (only) gives him immediate confidence that recording has started. A background recording beep tone is typically then injected into the audio stream played to Jane Smith (varies according to local regulations).

"Let me just pause recording" Appropriate ahead of sensitive information being disclosed. Stops or masks recording with a fixed tone. A brief distinctive tone gives John Doe immediate confidence that recording has been paused.

"Let me just resume recording" Ends the masking caused by the pause command above. A brief distinctive tone (different from the paused tone) gives John Doe immediate confidence that recording has been paused.

There are many other mid-call commands that could be useful in specific scenarios. For example: "Let me just flag this call to ABC", "Let me add a voice memo to this call" etc. There are several commands that result in John Doe leaving the call, "Let me just leave you with ABC" The parameter ABC is (unusually) irrelevant, this is merely an instruction to the system to clear John Doe's connection but explicitly not to proactively clear the remaining parties.

"Let me just drop off the call then." Will clear John Doe's connection and, if there is only one remaining real party, the entire call.

"Let me wrap up the call then." Will force clear all parties from the call even if there are two or more other parties who could otherwise continue to converse.

Should the counterparty, Jane Smith, leave the call first, this should be announced to John Doe rather than simply tearing the call down -as the latter action is difficult to distinguish from a failed connection between John Doe's phone (1) and the MAP (14).

Optionally, a connection to John Doe may be maintained after the call to the counterparty (215) has dropped. "Completion codes" are a common requirement. For example, on clearing the counterparty from the call, the voice assistant could ask "Please state the outcome of this call"; "Shall I add Jane Smith to your prospect list?"; "Shall I schedule a follow-up call with Jane Smith?" etc. Some commands may also be offered to the counterparty (215) during all or part of the call.

For example, while John Doe has Jane Smith "on hold", this is an ideal opportunity to play her some instructions -such as "John Doe has placed this call on hold. If you need me to attract his attention, just say "Excuse Me". If you need to leave the call please tell me. You can leave him a message if you need to drop off the call. "A sys Note that this does not disclose the wake phrase ("Let me just" in this example). This is partly because "Excuse me" is easier to remember and partly because the system does not necessarily want to draw the counterparty's attention to the wake phrase.

Other commands can be listened for without explicit instructions having been given. For example, "Excuse me!" said while a pre-recorded announcement is being played may terminate it with a "Sorry, can I help?" -preferably in John Doe's voice rather than that of an assistant that the customer may not be aware of having been "present" on the call.

Advantageously, the system may "scrape" the configuration of the company PBX (8) via, for example, an administration interface or API. By reading all of the phone numbers/handles and other addresses and their associated names/descriptions where present it can stay up to date with the names and numbers that may be mentioned in voice commands.

Similarly, if the employees are configured in a human resource system, workforce optimisation suite, recording system, private branch exchange system, corporate directory, public directory, domain server, active directory, reading the current configuration and subsequent updates from there can also reduce the need for ongoing detailed configuration of the possible numbers/addresses and associated job functions and employee names.

Claims

CLAIMS1. A system providing real-time data stream exchange or interaction between a plurality of parties in which control over aspects of said interactions is achieved by the deliberate insertion of and subsequent analysis and identification of one or more pre-determined phrases and/or visual cues within one or more of said real-time data streams.
2. A system of claim 1 further characterised in that said pre-determined phrases begin with a pre-determined word or phrase.
3. A system of claim 1 further characterised in that said control varies according to the state of the connections to the parties on the call.
4. A system of claim 3 further characterised in that said control affects the establishment of an as yet unconnected data stream and/or the alternative actions to be performed should said stream not connect.
5. A system of claim 1 further characterised in that at least some of said data streams are routed via not directly between said parties but via a stream management node that variably derives the data transmitted onwards in any of said data streams from the data streams arriving at it from each party and/or derived for sources within or connected to said management node.
6. A system of claim 1 further characterised in that at least some of said control is performed by a private branch exchange controlled by commands from said stream management node.
7. A system of claim 1 further characterised in that at least some of said phrases are chosen from the set of those commonly used within telephone calls to advise the counterparty of an imminent action likely to affect said call.
8. A system of claim 5 further characterised in that data regarding said interaction is transmitted across a second network connection to said node before or in parallel with the established of said data stream between the initiator of said interaction and said node.
9. A system of claim 5 further characterised in that in the absence of a viable data connection between the initiating party and said stream management node that data is transmitted in-band via the real-time data stream established between said initiator and said node.
10. A system of claim 1 further characterised in that said analysis is performed by an Interactive Voice Response system accessed via a telephony system.
11. A system of claim 5 further characterised in that connection of two or more real-time data streams is attempted and optionally maintained between said stream management node and one or more of said parties.
12. A system of claim 1 further characterised in that inbound calls intended for a specific party are instead routed to a stream management node that then establishes a real-time data stream connection to said specific party.
13. A system of claim 12 further characterised in that said real-time data stream connection to said specific party is a circuit switched call over a telephony network and is initiated by said specific party in response to an instruction from said stream management node sent over a packet data network.
14. A system of claim 1 further characterised in that said phrases initiate a spoken dialog response between the speaker of said phrase and the server or service providing said analysis in which elements of said dialog are selectively transmitted to or concealed from one or more of the other parties in the interaction.
15. A system of claim 1 further characterised in that said control includes one or more of the standard telephony functions: hold, retrieve, blind-transfer, consultative-transfer, conferenced transfer, consult, conference, abandon, hang-up, call.
16. A system of claim 1 further characterised in that said control includes one or more standard auto-dialler functions including but not limited to: reschedule, reassign, change next contact attempt method, leave a message, store call outcome code.
17. A system of claim 1 further characterised in that said control includes one or more standard recording control functions including but not limited to: start, stop, pause, resume, tag, transcribe.
18. A system of claim 1 further characterised in that said control includes one or more standard agent assistive functions including but not limited to: play pre-recorded announcement, play text to speech output, take payment or other details in secret.
19. A system of claim 1 further characterised in that said control includes connection to and/or transfer to one of the following services with interaction continues: transcription, concierge, automated personal assistant.
20. A system of claim 5 further characterised in that a subset of said controls may be performed by a party on the call other than the party that resulted in said stream management node being inserted into the interaction.
21. A system of claim 20 further characterised in that said a further subset of said subset of controls may be given and acted upon whilst the content of the data stream or streams transmitted by said party are not being forwarded to any of the other parties in the interaction.
22. A system of claim 19 further characterised in that said subset of controls includes but is not limited to any of: attract other party's attention, advise other party of this party's intended departure from the interaction, record a message for the other party ahead of disconnecting.
23. A system of claim 1 further characterised in that one or more existing databases containing employee, counterparty and/or contact details are read in order to populate the parameters for control dialogs involving employees, job functions and their associations with telephone numbers or other messaging service addresses.
24. A system of claim 23 further characterised in that said databases include those in one or more of: human resource system, workforce optimisation suite, recording system, private branch exchange system, corporate directory, public directory, domain server, active directory.
25. A method of providing real-time data stream exchanges between a plurality of parties in which control over aspects of said interactions is achieved by the deliberate insertion of one or more pre-determined audio and/or visual cues within one or more of said real-time data streams.