WO2023076187A2 - Systems and methods for query source identification and response - Google Patents

Systems and methods for query source identification and response Download PDF

Info

Publication number
WO2023076187A2
WO2023076187A2 PCT/US2022/047615 US2022047615W WO2023076187A2 WO 2023076187 A2 WO2023076187 A2 WO 2023076187A2 US 2022047615 W US2022047615 W US 2022047615W WO 2023076187 A2 WO2023076187 A2 WO 2023076187A2
Authority
WO
WIPO (PCT)
Prior art keywords
query
data
command
source
response
Prior art date
Application number
PCT/US2022/047615
Other languages
French (fr)
Other versions
WO2023076187A3 (en
Inventor
Rong Li
Scott Poole
Original Assignee
Exxo, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Exxo, Inc. filed Critical Exxo, Inc.
Publication of WO2023076187A2 publication Critical patent/WO2023076187A2/en
Publication of WO2023076187A3 publication Critical patent/WO2023076187A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems

Definitions

  • the present invention relates generally to the contextual ization of a meeting space for the purpose of identifying the source of an audible query originating within the space.
  • the present invention relates generally to the analysis of an audible query to facilitate developing a response thereto and more particularly relates to systems and methods for analyzing an audible query originated in a contextualized space where identification of the query source grants access rights to data in accordance with the permissions associated with the query source.
  • the present invention overcomes many of the limitations of the prior art by providing, in an embodiment, systems and methods for a platform independent Al voice assistant configured to contextualize a meeting space to support, automatically and concurrently, a wide variety of meeting participants by identifying, through audio, visual, time of flight and other means, the source of a spoken query or command (sometimes generalized as “query” hereinafter for simplicity) and analyzing the query to cause relevant data to be provided, automatically, in response to the query or some automated task to be performed, for example a digital task such as retrieving a digital file.
  • a platform independent Al voice assistant configured to contextualize a meeting space to support, automatically and concurrently, a wide variety of meeting participants by identifying, through audio, visual, time of flight and other means, the source of a spoken query or command (sometimes generalized as “query” hereinafter for simplicity) and analyzing the query to cause relevant data to be provided, automatically, in response to the query or some automated task to be performed, for example a digital task such as retrieving a digital file.
  • the invention can be run in a cloud infrastructure, typically virtualized, or can be run in a customer deployment where proprietary or confidential data is to be accessed or where lower latency is desired.
  • the invention comprises a Single Page Application (“SPA”) front end that can run in any browser, thus permitting it to run on a smartphone, tablet, PC, WebTV, Smart Speaker or similar.
  • SPA Single Page Application
  • An aspect of the invention comprises, in at least some embodiments, an intermediary device that acts as an audio endpoint for interfacing with meeting participants via voice and further interfacing with audiovisual devices that expand the audio capability and enable data visualization and other visual functionalities, in some instances via a web browser running on the device.
  • the invention automatically identifies the source of a spoken query accurately and substantially in real-time, and further gleans context including the disambiguation of pronouns such as “me”, “us”, “I”, and similarly context-dependent terms such as “this”, “that”, and so on, enabling accurate analysis of the query and facilitating development of the desired response.
  • physical gestures for example pointing, are also detected and correlated with the spoken words to assist in assigning proper context or meaning to query.
  • additional physical world characteristics are observed and contextualized such as various gas levels (CO2, etc%), ambient noise levels, lighting levels, the presence and relative location of AV equipment, the presence and location of whiteboards, chalkboards or glassboards, and the gestures humans make while speaking that can be correlated with their spoken queries or commands.
  • This physical world data is then correlated with digital world data to perform one or more complex actions such as executing a command to “send this document to everyone in the meeting” in response to a person pointing at a document on the projection screen.
  • the system comprises a customer model that establishes, for each attendee, permitted levels of data access as well as maintaining a user profile that assists in the accurate interpretation of that attendee’s spoken words.
  • natural language processing is used to develop and iteratively train the profile without user data, or the user query as uttered, never leaving the user’s premises or similar data silo.
  • iterative training can eventually increase the accuracy of source identification through voice only such that the use other identification functionalities of the invention become less necessary or not needed.
  • analysis of a query can be performed using look-ahead techniques.
  • analysis of a query begins upon detection and transcription of the first substantive word of the query both in terms of identifying source and also identifying responsive data.
  • the analysis iterates with transcription of each additional word or other element of the query (e.g., a gesture) such that source is typically identified before the query is fully articulated and relevant responsive data is accessed within the permission associated with the identified source of the query.
  • a gesture e.g., a gesture
  • Figure 1A shows a meeting space suitable for being contextualized by an embodiment of the present invention.
  • Figure 1 B shows in process flow form an embodiment of the invention including both source identification and query analysis.
  • Figure 1 C shows in process flow form an embodiment of the invention including source identification and query analysis using interim transcription to facilitate look-ahead processing.
  • Figure 2 illustrates in flow diagram form an embodiment of the source identification process.
  • Figure 3 illustrates in block diagram form the system components of an embodiment of the invention.
  • Figure 4 illustrates in block diagram form the modular functionalities of an embodiment of the invention.
  • Figure 5 illustrates in process flow form a more detailed description of an embodiment of a method for identifying the query source in accordance with an aspect of the invention.
  • Figure 6 illustrates an embodiment of the voice transcription module of Figures 4.
  • Figures 7A-7B illustrate an embodiment of the Al Coordinator/lnference module of Figures 4.
  • Figure 8A illustrates an embodiment of the Command Execution module of Figure 4.
  • Figure 8B illustrates an embodiment of the Data Preparation Manager of Figure 8A.
  • Figure 8C illustrates an alternative embodiment of the Data Preparation Manager of Figure 8A.
  • Figure 9 illustrates an embodiment of a data structure in accordance with an aspect of the invention.
  • Figure 10 illustrates an embodiment of the Dialog Control module of Figures 4.
  • Figure 11 illustrates a process flow for latency reduction in accordance with an embodiment of an aspect of the invention.
  • Figure 12 illustrates an embodiment of the model library service of an aspect of the invention, and describes a method and system for updating Core Inference Al shown in Figure 4.
  • Figure 13 illustrates an embodiment of a training flow for updating the customer model in an aspect of the invention.
  • Figures 14A-14B illustrate an embodiment of the system elements and a process flow, respectively, for disambiguation of query terms using gaze tracking in accordance with an aspect of the invention.
  • Figures 14C-14D illustrate an embodiment of the system elements and a process flow, respectively, for disambiguation of query terms using time of flight (ToF) in accordance with an aspect of the invention.
  • ToF time of flight
  • Figure 15A illustrates an embodiment of a process flow for passively applying permissions to a contextualized meeting in accordance with an aspect of the invention.
  • Figure 15B illustrates an embodiment of a process flow for actively applying permissions to a contextualized meeting in accordance with an aspect of the invention.
  • Figure 16 illustrates an embodiment of a process flow for analyzing implied multiple queries in accordance with an aspect of the invention.
  • FIG. 1A shown therein is a meeting room 10 in which a plurality of attendees 15 are positioned around a table 20.
  • a plurality of sensors 25A-n are connected, either wirelessly or by any other convenient means, to an intermediary device 30 and provide to that intermediary device such audio, visual, ToF, biometric and similar data as will be helpful in identifying the individual attendees so that utterances made by any of the attendees can be correlated with the source of the utterance and any contextual references are understood. While the sensors are shown as distinct in Figure 1 A, some or all may be integrated into the housing of intermediary device 30. One of the attendees may be deemed an operator in some embodiments, although in other embodiments the operator may be remote from the meeting space 10.
  • the intermediary device 30 comprises one or more processors that execute processes as described hereinafter, some embodiments of which are shown generally in Figures 1 B and 1C. As further described below, in an embodiment the intermediary device then queries data resources appropriate to the person generating the query. A response to the query is then communicated to one or more of the attendees and may, as one example, be displayed on video screen 40.
  • FIG. 1 B a process flow for an embodiment of the invention including both source identification and query analysis for a meeting occurring in a contextualized meeting space such as that shown in Figure 1A can be better appreciated. More specifically, the process begins at 100 and advances to capturing the entirety of the query at step 105. Typically a meeting comprises a plurality of attendees as shown in Figure 1 A, such that the source of the query can be ambiguous. The source is identified through a disambiguation process at step 110, at least in part through the use of data requested from customer model 115. In at least some contexts, the source identification/ disambiguation process can be thought of as eliminating dissonance in the query.
  • the customer model applies the permissions set for that source or, alternatively, set the permissions associated with the most senior attendee, shown at step 120.
  • setting of permissions can be either active or passive, and can occur before a spoken query by identifying attendees visually rather than by voice or by voice identification not comprising a query, or by other means. Setting the permissions determines what data will be accessible for developing a response to the query.
  • the query can be analyzed as shown at step 125. Such analysis is discussed in greater detail below, but involves disambiguation of pronouns, gestures, and other imprecise or ambiguous terms within the query. Such a transcription process yields an understandable query.
  • the process then advances to step 130 for developing a response to the query by accessing a data source 135, again within the limits of any applicable permissions as set at step 120.
  • the response is then presented to the users/attendees at step 140 and the process ends at 145.
  • Presentment of the response can take any form suitable to the meeting space and the attendees, including audio, visual display, and so on. Additionally, the response can take the form of an action, such as retrieving and distributing a digital document, that applies to some or all attendees or any other designated group, e.g. “all of Dept X”.
  • FIG. 1 C shown therein in process flow form is an embodiment of the invention including source identification and query analysis using interim transcription to facilitate look-ahead processing.
  • the process of Figure 1C varies from that of Figure 1 B in that attempts to identify source and to begin analyzing the query start with detection of the first word of the query, and iterates with each additional word until (a) the source is identified and (b) the query can be sufficiently disambiguated and transcribed that an appropriate response can be developed. More specifically, the process of Figure 1C starts at 100 and the first word of a query is detected at 105A. An effort is made to identify source at 110, potentially by retrieving information from the customer model 115.
  • the process loops back to step 105A to detect the next word of the query. Again, disambiguation and transcription are performed on the updated form of the query, and identification of source is attempted. The process continues to iterate until source is identified, at which point permissions can be applied as shown at 120.
  • the current form of the query can be analyzed as shown at 125A, and a response developed based on that current form, step 130A.
  • step 155 the process loops to step 105A and steps 110-135 repeat.
  • the process can bypass steps 110- 120 and jump to step 125A. Analysis of the query and development of an interim response continues until there are no more words in the query, yielding a “no” at step 155, whereupon the response from step 130A is provided to the users/attendees at step 140 and the process ends at 145.
  • the permissions for a given meeting can be set in accordance with the permissions associated with the most senior attendee when there are no concerns about anyone in the room getting even momentary access to information retrieved.
  • permissions for a given meeting can be set based on the most junior attendee if the desire is for information retrieved to be restricted so as not to expose it to the more junior attendees.
  • the system can dynamically apply these permissions based on the ability of the system to automatically identify meeting attendees together with the selected permissions mode.
  • the permissions mode can be preset, for example by an administrator, or can be implemented or modified at any time during the meeting via a voice or other command combined with suitable authentication such as an active form of identification, for example the reading of a presented fingerprint, an iris scan, or other similar techniques.
  • suitable authentication such as an active form of identification, for example the reading of a presented fingerprint, an iris scan, or other similar techniques.
  • permissions are based on the meeting attendee or attendees and mode rather than associated with the query source, in which case step 120 is effectively removed from the process and steps 125A et seq are performed for each iteration without regard to identification of query source.
  • the application of the permissions step can be relocated either to between steps 125A and 130A or to between steps 155 and 140, such that the query response is being developed regardless of permissions but only displayed once permissions have been applied and the data matched to the applicable permissions. It will be understood that, in the latter case, another iteration of steps 130A-135 may be required to triage data from the interim response should the permissions associated with the query source not be entitled to access some of the data available from data source 135.
  • FIG. 200 an embodiment of a process for disambiguating/identifying the source of the query, shown generally at 110 in Figures 1 B-1C, can be appreciated in greater detail.
  • the process indicated generally at 200 permits identifying a query source from among a plurality of attendees or speakers in a meeting room where voice authentication technology alone is inadequate.
  • Voice authentication can be effective for meetings involving a small number of attendees, where voice can be distinguished from within a small reference set.
  • the certainty of an accurate match degrades rapidly as the number of attendees increases. For example, in a meeting with 15 people, the voice authentication success rate is typically too low to be practical and so different methods are required.
  • an object of this aspect of the present invention is to supplement voice authentication with other technologies to eliminate potential misidentification of the query source.
  • Various technologies used to help eliminate potential identities include but are not limited to:
  • Wireless identity signaling devices such as wearables, for example the smart rings provided by ProxyTM or, previously, MotivTM, or other wearable biometric devices as well as natively managed signals via protocols such as BT/BTLE, WiFi, NFC, UWB or other third party proprietary signals.
  • Facial Recognition Technology where the specific implementation is chosen by balancing considerations of accuracy, speed, angular accuracy and security, can be basic RGB camera image matching, or more advanced and secure technologies such as IR dot grid technologies or ToF- based devices.
  • Voice Authentication Technology which is distinct from voice recognition technology in that recognition technology, simply converts spoken language into text with no regard for who the speaker is.
  • Voice Authentication Technology on the other hand is able to identify the speaker themselves from just their voice.
  • Traditional authentication technology focused on being able to work with a large reference set (one to many) and so had to limit the authentication being done on a single and specific utterance. By constraining to a small reference set (one to few) any utterance can be supported.
  • the process 200 can be better understood.
  • the process begins at step 205 and, at step 210, a user command or inquiry, typically audible, is issued either actively or passively.
  • the voice is then compared with user profiles from the customer model 115 to determine whether the user or query source can be identified solely from voice recognition. If yes, the identification of the source is forwarded to the system client at step 220, along with the total relevant utterance that comprises the query or command.
  • the voice authentication Al is trained on the user’s voice uttering a “wake” word or phrase, as present day voice authentication technology typically operates off a fixed phrase.
  • voice authentication can be based on other training approaches, for example using larger speech libraries, such that enunciation of a fixed word or phrase is not necessary.
  • step 215 If voice alone does not permit identification of the query source, resulting in a NO from step 215, the location of the voice within the meeting space is detected as discussed hereinafter, and a determination is made as to whether the voice location associated with the query is proximate to only one face, indicated at step 225. If so, a yes results and the identity of the query source is ported to step 220 along with the relevant utterance. However, if a NO results at step 225, the process advances to step 230. At step 230, the process initiates a process of elimination by which other potential query sources are eliminated until only one unmatched potential source remains.
  • some candidates can be eliminated from the potential speakers, i.e., all people in the room. For example, if an identity signal location is detected, and a person at that location is also detected (especially if the facial recognition process is certain that a person is being detected, but uncertain who that person is) yet the voice location is elsewhere, that person can be eliminated as a candidate for the voice identity. Stated more broadly, it is possible to eliminate anyone recognized by facial recognition but not positioned where the voice originated from.
  • FIG. 3 an embodiment of the system hardware components can be better appreciated.
  • the embodiment of Figure 3 comprises an intermediary device that, together with various embodiments of the system software described herein, operates to contextualize a meeting room or other space by, on the input side, adding audio, video, wireless, biometric and similar sensing to aid in identifying a person speaking or similar data source and, on the output side, by providing audio, visual and related means for communicating data responsive to the queries or other communications originating from a person speaking or otherwise issuing commands formatted for interpretation by the system.
  • These can include data visualization, including but not limited to communication through a web browser, or synthetically generated audio, text, or visual displays as briefly discussed above in connection with Figure 1A.
  • the exemplary intermediary device of Figure 3 comprises a microcontroller 300 or other suitable processor which communicates through a client 305 with an array of sensors and related devices as shown generally in Figure 1A as 25A-25n and more specifically in Figure 3 as 310A-310n.
  • the microcontroller 300 which may for example be an ARM A7 class microcontroller, typically operates under the control of software compatible with an operating system such as, for example, Linux although numerous other operating systems are acceptable depending upon the embodiment.
  • An array of other devices 31 SA- 315n provides user interface functions and support functions.
  • the controller 300 cooperates with one or more I/O ports, examples of which are shown at 320. Further, the controller 300 cooperates with storage in the form of, for example, main memory 325, typically volatile, and persistent memory 330, for example a flash device.
  • sensors and related devices 310A-310n comprise an ultra-wideband (UWB) controller with associated antenna array 335, a Bluetooth Low Energy (BTLE) controller with associated antenna array 340, a Time of Flight (ToF) camera/laser array, a proximity sensor, an Ethernet controller, an RGB and/or IR Camera Array, a WiFi controller with antenna, a microphone array, and a fingerprint sensor.
  • UWB ultra-wideband
  • BTLE Bluetooth Low Energy
  • ToF Time of Flight
  • proximity sensor an Ethernet controller
  • RGB and/or IR Camera Array an RGB and/or IR Camera Array
  • WiFi controller with antenna
  • microphone array a fingerprint sensor
  • fingerprint sensor a fingerprint sensor
  • additional physical contextualization sensors can be included including but not limited to gas level sensors, noise level sensors, humidity sensors, pressure sensors, proximity sensors and light level sensors. It will be understood that the number and type of sensors and related controllers is dependent upon the particular implementation and which such sensors are employed in such an implementation is a matter of design choice.
  • the ToF Array can be a multipoint camera array, for example five cameras although the actual number can be less or more depending upon the meeting space, and provides 360 degree visual coverage, in some instances including detection of IR imagery, to permit identification not only of attendees, but also to permit extrapolation of a finger including at least the angle of a pointing finger upon suitable Al training.
  • the camera array permits fingertips and visual media to be recognized in real time where the distance data of the relevant pixels permits determination of angles relative to a backplane, facilitating association of a pointing finger to a subject of interest being displayed.
  • a proximity sensor performs blunt object detection, to determine whether anyone is in the room, and can be configured for “power-on on approach” or “power-off on departure” for both security and power savings.
  • an RGB- IR camera can be configured to detect IR emissions from attendees, for example the heat from an attendee’s hand or finger to supplement pointing, after adjustment for ambient heat.
  • RGB imagery can be provided, either alone or with IR imagery, to facilitate use of a structured light facial recognition solution which can be more secure and reliable than simple RGB facial recognition.
  • devices 315A-315n can comprise one or more of a digital signal processor (DSP) 315A, an Al acceleration chip such as a specialized GPU or ASIC, button controls, speakers, and LED indicators.
  • DSP digital signal processor
  • Al acceleration chip such as a specialized GPU or ASIC
  • button controls can provide visual and audio from the device to the meeting attendees to provide responses to queries or commands.
  • the button controls provider user control of functions that can involve explicit tactile or physical user interaction, such as volume or power.
  • An Al acceleration chip potentially integrated into the microcontroller 300, permits the system to perform certain tasks with low latency, such as responses to commands such as moving through a slide deck, jumping to a particular slide or other document or performing voice, face and gesture recognition in real-time.
  • a DSP also potentially integrated into the microcontroller, permits curating of incoming audio to provide high levels of accuracy in interpreting incoming audio signals (i.e. , voice/utterances).
  • the IR Camera array can be employed along with gaze tracking algorithms to provide the context of where a meeting participant is looking. This information can be utilized in a variety of ways such as generating metrics on meeting attendee focus and degree of participation, or correlated with other sensor inputs to more definitively determine the position of suspected objects (Whiteboard, TV, Projection Screen, etc%) or to gain awareness whether people are looking at each other or at a specific person to help in identifying to whom an utterance is directed, e.g. “can you take care of that”? .
  • a fingerprint sensor 318A can be provided to permit active management of security permissions during a meeting, such as when sensitive documents or other materials are being displayed, and serves as an affirmatively physical alternative to other forms of recognition of meeting attendees, as described above. Because the system of the invention is contextually aware of a current meeting, including the identities of the attendees and their fingerprint, the permissions level can be automatically adjusted to match the permissions level of the most senior person currently in attendance, although in some embodiments such a senior attendee can also authorize, verbally or by any convenient means, the meeting to continue at that person’s permissions level even after they depart the meeting. Absent such authorization, the permissions level for a given meeting drops dynamically as more senior attendees depart.
  • a multipoint microphone array 318n for example five microphones although the exact number can vary depending upon the meeting space, provide detection of angle of arrival and position detection for any utterances made during the meeting, thus assisting in determining the identity of the speaker and reducing the reference set of possible speakers for voice authentication.
  • the I/O ports 320 can, for example, comprise one or more USB ports, one or more Ethernet ports, HDMI input and output connections, one or more wireless displays, and one or more of each of WiFi, ultra wide-band (UWB) and Bluetooth Low Energy (BTLE) interfaces.
  • the WiFi, UWB and BTLE interfaces can provide recognition of attendees’ devices such as smartphones or tags, although WiFi can also be used for connection to a network if more suitable than a wired connection.
  • Figure 4 illustrates in block diagram form the system of the invention that resides on the other side of the Client, running in the operating system executing on the intermediary device of Figure 3, and identified generally at 400. More specifically, Figure 4 illustrates a plurality of modular functionalities that comprise an embodiment of the invention, and further denotes where such functionalities physically reside, typically in either a virtual or a physical server. It will be apparent to those skilled in the art that, in some embodiments, certain functionalities may be incorporated into the intermediary device of Figure 3.
  • system 400 performs speech to text transcription, inference of material information from the spoken utterance, and processing of that material information into a backend database query either for datasources that are well defined (e.g., Quickbooks, SalesForce.com or other CRM application, NetSuite, Multiple Listing Service, etc.) or data sources that are weakly defined (e.g., SQL, Oracle, Redshift, Snowflake, etc.) Additionally, system 400 can interface with non-data sources such as API’s for managing task commands or queries, for example PowerPoint/Excel/Word API’s for navigating and editing documents via voice, or Outlook or similar API’s for managing calendars. Finally, the system 400 can perform wholly selfcontrolled functions such as note taking, recording or executing action items, or administrative tasks such as reservations or food orders.
  • datasources that are well defined (e.g., Quickbooks, SalesForce.com or other CRM application, NetSuite, Multiple Listing Service, etc.) or data sources that are weakly defined (e.g., SQL, Oracle, Redshift, Snowf
  • the User Interface 405 can be hosted on the intermediary device 30 with or without a visual medium connected. In embodiments that do not include a visual medium, some features such as data visualization would be limited.
  • the user interface 405 is a single page application running in a web browser and is therefore platform independent. As such, it can be run on any device can run a fully featured browser such as a laptop, desktop, smartphone, projector, webTV, Smart Speaker (or other device plugged into the intermediary device of Figure 3), a tablet, or any other suitable computing device.
  • the User Interface is primarily responsible for collecting the spoken utterances of users and transmitting into air the audible spoken responses generated by the present invention. Such spoken responses can be either answers or clarifying questions.
  • the user interface transmits visual responses to a connected visual medium (if implemented) for display to the user(s)/attendees.
  • the display data service 410 retrieves the data blob (“binary large object) comprising a query result from the relevant database.
  • the client is notified via a message bus 430 that the data is ready, at which point a request for the data is made to that service.
  • Authentication service 415 verifies user account and product access and manages connector-specific token retrieval, update and verification.
  • Configuration service 420 manages user configuration changes communicated from the Client (the User Interface) by updating the Operational Database 455 accordingly.
  • a voice transcription service 425 is responsible for receiving the incoming stream of audio data from any client via voice input microservice 425A, converting that stream (e.g., audio) to a common format supported by the voice transcription engine 425B, receiving the transcription back (e.g., text) and dropping a message on the message bus 430 that the transcription is complete.
  • the client sends the audio stream directly to the transcription service, gets the transcription back and places the completed transcription on the bus itself.
  • the client can do the transcription locally using the client resources and drop the transcription onto the message bus.
  • the voice transcription service also configures the voice transcription engine appropriately for a given client.
  • Some clients can run as embedded browsers as is the case for Microsoft Powerpoint control, in which case the present invention can run in an embedded browser inside of PowerPoint.
  • This allows client type to be known, allowing the transcription engine to be optimized for that client and its associated queries/commands.
  • the voice transcription service communicates with other services in system 400 via message bus 430.
  • Al Coordinator 435 feeds the transcribed utterance into the correct Al inference engine indicated generally at 440.
  • the invention comprises multiple Al Inference engines 440A-n, each trained on a specific scope of understanding in order to increase accuracy in developing an inference by limiting how many individual slots and intents a given engine must attempt to discern among.
  • an intent represents an action that fulfills a user's spoken request.
  • Intents can optionally have arguments called slots.
  • a new Al Inference Engine can be created if desired for the particular implementation.
  • a parent Al that only identifies intent can serve to split the Al categorically into two or more parts.
  • the total utterance is passed onto a separate Al that is trained on both Intent and Slots but where the range of potential total utterances is narrower so as to increase accuracy of slot identification.
  • the Al Coordinator also takes the output of the Al Inference Engine (I ntent/Slot data derived from the total transcribed utterance) and calls Command Execution Engine 450 with that data, again by using message bus 430.
  • the various Al Inference Engines 440A-n, each containerized so it can be migrated around the infrastructure, is deployed as its own Web App and thus has essentially independent endpoints.
  • a simple text string input will output an intent, slots and meta-data such as confidence levels. This applies to any of engines 440A-n.
  • the Al engine itself is Python wrapped in a web framework (currently using Flask) which is inside the virtualized container.
  • Command execution service 450 turns intent and slot data into an executable query in the relevant one(s) of data sources 460A-n, executing that query on the data source and finally parsing the results into a form that the client can consume.
  • a Dialogue Control Service 465 makes available audio playback data, for example an error message, an audible answer, or a clarifying question.
  • this service either pre-generate certain responses that have no dynamic component, or alternatively receives a dynamically generated text string and uses a text to speech engine to convert that into an audible form.
  • the Data blob that the client gets from the Display Data Service will contain the URI at which the associated audio file will reside such that the client can connect and stream/play that audio.
  • the dialogue control service will own the creation of the dynamic string to be converted to audio by taking in just the variable components from another service rather than taking in the whole string already constructed.
  • a speaker identification process describes a process for identifying a speaker whose identity may not otherwise be known by eliminating other users until the speaker is isolated.
  • the process begins at 500, where an utterance is made by an unidentified speaker.
  • users whose face is recognized and who are within proximity of their signal are eliminated as not being the speaker.
  • a determination is made whether the remaining number of unidentified users is small enough that the speaker can be identified through voice recognition. If not, at step 520, those users are eliminated whose face is recognized but are not within proximity of the voice location. Again, step 515 determines whether the remaining constellation of unidentified users is small enough that voice recognition can accurately identify the speaker.
  • step 525 all users whose face is recognized and not within proximity of the voice location are eliminated.
  • a test is again made at 515 to determine the identity of the speaker. If the result is “no”, the process ends at 530. If “yes”, the process advances at 535 and 540 by sending the identity of the speaker and the total relevant utterance to the client.
  • This service takes in the streaming audio, indicated at 600, via a SignaIR Hub 605, which is a form of communications bus that supports real-time streaming.
  • the streaming audio data is placed into an audio buffer 610 which then feeds into an audio convertor 615, for example an FFMPEG suite or other converter suitable for format transcoding, that can convert any supported though unknown audio into a specific format including containerizing the audio in the event the incoming audio lacks header data.
  • an audio convertor 615 for example an FFMPEG suite or other converter suitable for format transcoding, that can convert any supported though unknown audio into a specific format including containerizing the audio in the event the incoming audio lacks header data.
  • the audio is sent into a safety buffer 620 as well as being fed over a WebSocket interface 625 to the speech-to-text transcription engine 635.
  • the safety audio buffer 620 serves as a backup of the real-time audio data in the event the connection to the speech to text engine needs to be reset and streaming must be paused to permit the reset.
  • the partial audio data is discarded and the backup buffer sends the entire utterance after connection reset to ensure normal operation.
  • the safety buffer 620 retains at least 20 seconds of audio. It will be appreciated by those skilled in the art that the capacity of the safety buffer 620 can be any arbitrary size and varies with the particular implementation.
  • the streaming connection to the speech to text engine may also need to be reset in the event that a given utterance approaches or exceeds the time or size limit supported by a particular service for a continuous connection.
  • Exceeding such a limit can result in closing of a connection and an unplanned interruption to streaming, including an unmanaged slicing of the user utterance if such an utterance was being made when the connection was closed. Again the partial utterance is discarded and the contents of the backup buffer are retransmitted.
  • the transcription (speech to text) engine can run a custom language model that is trained by a particular implementation in order to increase accuracy within the scope supported by the particular embodiment.
  • a slightly different method can be used. In these embodiments, the need for audio format translation is eliminated.
  • the audio is stored, for example in a single 20 second rolling buffer.
  • a local speech to text engine responsive to a wake word or phrase or other triggering signal listens constantly for the that triggering signal.
  • the audio stream utilized by this engine is then copied directly from the single buffer and does not require a separate buffer.
  • the main audio handling function is notified and a connection to a speech-to-text engine, which can be either local or external depending upon the embodiment, is established if not already existing. Once the connection is ready to accept data, the audio buffer is sent, whereupon the transcription is returned and parsed by the system to capture what came after the trigger signal.
  • the Al Coordinator receives a transcribed query (text representation of the total spoke utterance). It then references a mapping table 700 that translates specific inputs about where the spoken utterance originated from and provides an Al Identifier.
  • the input to the mapping table can be an identifier of the client itself, or an output of an Intent Only Al, or a user identifier, among others - essentially any information that would assist in determining which of the Al inference engines should receive the transcribed query for intent and slot determination.
  • the resultant Al identifier from the mapping table 700 is than input into the Inference Engine Switch 705 where the transcribed query will be directed to the specific inference engine deemed most appropriate in accordance with the mapping list, whereupon the inference engine will return the relevant slot/intent data and a confidence level, indicated at 710. If the confidence level is insufficient relative to a predetermined value the process stops, and the user will be provided either an error message or clarification request. If the confidence level is sufficient, user access validation will take place 715. User access will determine if the user is permitted to have the query processed based on the client type or the intent itself or some other information provided to the Al Coordinator.
  • the plurality of Al inference engines 750A-n can each be seen to comprise a virtualized container 755A-n, which in turn comprises a web app 760A-n communicating with, typically, a Python Web App wrapped BERT-based Al with a unique engine 750 for each category of intents, including one for data visualization and one for front office control.
  • the confidence threshold can be set by any convenient means, where its purpose is, at least in part, to assist in ensuring that no data is improperly during a presentation or a meeting.
  • FIG 8A describes in greater detail an embodiment of the command execution service (“CES” sometimes hereinafter) shown generally in Figure 4.
  • the CES comprises a presentation layer 800A, a data preparation manager 800B, and data storage 800C.
  • the presentation layer 800A provides an API definition that enables the user interface via an Al Coordinator, while the data preparation manager 800B, sometimes referred to more generally as the domain layer with two alternative embodiments shown in Figures 8B and 8C, comprises the business logic of the CES.
  • the logic of the data preparation manager generates an output stored in the data storage 800C, sometimes referred to as the data layer.
  • the presentation layer and the data storage layer can be changed to permit the CES to operate on different platforms without changing the data preparation manager, thus making the present invention platform agnostic.
  • the data preparation manager 800B comprises a plurality of modules that convert Al-produced slot/intent data into results for the user.
  • the functionalities of the data preparation manager 800B include a Slot Parsing Engine 801 responsible for taking the Key/Value pair format provided by the Al Coordinator in its endpoint call and converting it into data structures capable of being utilized by the CES. It also manages converting a natural language query with multiple implied machine queries into the base set of information for multiple queries discussed in connection with Figure 16 herein.
  • the result of the slot parsing engine 801 is provided to an Intent-Command Mapping List 805, which is a Many-to-1 mapping of incoming intents (as determined by the relevant Al inference engine) to one of a plurality of common command templates, shown at 815.
  • the selector 810 references the mapping list to make the correct selection of a command template 815.
  • Each of the command templates 815 defines how to build a query if one is required, how to execute it, how to extract the relevant data from the results, and how to package that data for return to the user interface.
  • each intent and connector has associated therewith a method of extraction, formatting and configuring that is specific to that intent and connector, which can include data source type [0069]
  • a Command Template Selector 810 checks the mapping and construction of the required parameters for the command templates 815, as well as instantiating the selected template and executing the template interface, then passes the result to a Selected Command Template step, all as shown at 820.
  • a Data Connector Manager function 830 comprises data structures which provides connector specific information to the rest of the CES module.
  • the Data Connector Manager function includes a DataMap 835 for well- defined data sources that provides a specific mapping between a spoken utterance and how that utterance should be referenced when 1) building the query and 2) parsing the results.
  • each well-defined data connector has its own Data Connector Manager and one or more associated DataMaps. The DataMap structure is explained in greater detail in connection with Figure 9.
  • the Data Connector Manager uses the DataMap to generate a query, 840, and execute it, 845, against its data source on request. It also ensures that any DataSource-required authentication tokens (per the authentication service described above) are valid so that the query will succeed.
  • the query string generator portion 840 of the data connector manager 830 uses a Jargon Manager 860 and jargon loader 865 in building the query string as shown at 840 as in some cases the user will utter jargon and the query will need to know the classification of that jargon in order to properly build itself.
  • the result of the query executor step 845 is provided to the selected command template step 820, which in turn provides its output to a data extractor 850.
  • the data extractor 850 is responsible for taking the full line-item results from the database query and extracting only the relevant portions needed to respond to the original natural language query.
  • the data extractor step 850 would extract X, Y, Z axis information for each record.
  • the data extractor step 850 would return a filtered list of records with all or a subset of information for each record, or, in the case of an exact record, just that record.
  • the output of the data extractor step 850 is provided to a data formatter 855, which in turn is responsible for bucketing datapoint results collected in the Data Extractor into buckets that will represent axis ticks in the III or in the case of 3D charts, the size of a graphed shape such as may be used in bubble charts or similar visualizations. This includes both axes, potentially defined as continuous or discrete variables (time, amounts, quantities or categories).
  • a jargon manager function 860 identifies the class of a jargon utterance or provides all the potential jargon utterances of a particular class. In well-defined data sources, this information can simply be queried when a user has logged in.
  • jargon In weakly defined data sources, jargon needs to be parsed out of all of the data in the data source.
  • the Jargon Manager is utilized in query generation and in Data Formatting. In the Data Formatting case, when the buckets (per the natural language query) are by some category, the categories are determined by requesting them from the jargon manager by their category label (e.g., customers, vendors, classes, states, regions, etc.)
  • a Jargon Loader 865 operates as a sub-component of the Jargon Manager for creating the jargon mappings at login, to enable rapid access during a query.
  • the final output is provided from the selected command template 820 following completion of data formatting.
  • a user’s utterance can include filtering in their query, for example, “show me revenue for Red Rock Diner in 2019”.
  • filtering takes place in one of three ways: 1 ) The data connector manager query string generator itself will attempt to build the filtering into the query request as much as that particular data source will allow (in SQL terms this is the WHERE clause); 2) the data extractor will attempt to filter out from the returned results any results that are not part of the desired response; and (3), if proper slot types are present, filtering may take place after data extraction AND after formatting.
  • FIG. 8C an alternative embodiment of the software architecture of the present invention can be better appreciated. Modules having the same function as that shown in Figure 8B are shown with like reference numerals. Figure 8C includes additional functionalities to improve resolution of ambiguities that may result from the spoken query.
  • the selected command template can access a filter engine 870 as described above, and can further include a clarification utility 875 accessed through the filter engine.
  • the clarification utility can operate to provide a user with the opportunity to clarify the meaning of a word or words within the query.
  • the clarification utility can be passed down from the selected command template 820 and directly utilized by the jargon manager 860.
  • the data connector manager 830 authorizes the interaction of the jargon manager with the filter engine and clarification utility.
  • the jargon manager can access a search index 880, which in an embodiment serves to store all jargon that has been captured by the system for future reference, including data that may yield multiple responses to a word or words in the query, thus leading to ambiguity.
  • the search index is a query-response subsystem, and stores any clarifying information that may have become associated with the term of interest.
  • a CRM record of type “Opportunity” may be clarified with the account name, the “owner” of that opportunity, and the region.
  • the response from the search index will yield multiple results from the jargon manager, whereupon the clarification utility can craft a response to be sent to the user that includes a request for clarification as to which of the multiple results is the appropriate one or ones.
  • a cache 885 can also be provided for use by the data connector manager 830. Once the response to the query has been developed, it is provided to the selected command template, which in turn provides it to a database in the data storage layer 800C, where it is available for retrieval by the client.
  • the DataMap comprises a data structure 900 whose task is to correctly parse natural language into the appropriate database reference for a well-defined data source.
  • Data structure 900 contains, first, a table name, which essentially corresponds to the concept of a table in a relational database.
  • a Time Column is a Table-specific primary reference for queries related to time for records from that table. For example, for most transactional records this would be a string “TxnDate” but for some non- transactional records it would be “CreatedDate” - it simply depends on the table (and thus the data source).
  • the data structure 900 next comprises a mapping among Slot Type indicated at 910, Utterance indicated at 915 and a Path Traversal Definition through a datasource record or path indicated at 920.
  • the Slot Type 910 is the label the Al Inference engine assigns to the uttered word or phrase, the slot type is a category and allows the CES to know how to treat the value in processing, the Utterance is the actual uttered word or phrase and the path traversal is a list of nodes through the tree to the piece of data, bearing in mind that the return results for a data source are often a JSON-formed object that inherently possesses a tree structure.
  • a table can comprise multiple slot types and multiple utterances per slot type.
  • Each Utterance has a path (which is a list of nodes), a set of characteristics (essentially flags that would drive special treatment of that path or provide essential information about the data found at that path, such as data type - Boolean, string, etc.%) and finally a Ul-friendly label for that particular path.
  • the data structure 900 comprises mappings related to query string creation. These mappings tell the query string generator in the data connector manager what to use in the query for a particular uttered jargon. The first is a mapping between jargon classification, indicated at 930, and a list of paths that could potentially have a piece of data of that classification type at its location in the query result record, each path is a list of nodes, indicated at 935.
  • the data structure 900 comprises a mapping between jargon classification, indicated at 945, and parent node, indicated at 950, rather than path. This results because, when building a jargon reference into the actual query, the reference mechanism for a data source can differ from how the actual record might be traversed. In this case it uses just the parent node as reference in the WHERE clause. To know for a given table and classification what string to use in the WHERE clause, this mapping is consulted. This data structure serves to define how to interpret the specifically connected data sources’ response and traverse that response to the precise data sought.
  • this involves encoding data traversal in a relationship that will work for both flat structures (a database row for example) or tree structures (a JSON) or potentially any atypical structure.
  • This data structure furthermore includes a mechanism to resolve jargon (essentially proper nouns with near unlimited possibilities) within the data sources response if, for example, the returned data records needed to be filtered on some specific datapoint.
  • the data structure also encodes specific data source traversal information at purely a high level for usage in the actual data source machine query, Jargon Classification to Filterable data source Path Mapping indicated at 940.
  • the dialogue control service comprises a static synthesizer function and a dynamic synthesizer function, indicated at 1005 and 1010 respectively.
  • the static synthesizer 1005 takes a list of error strings on startup, indicated at 1015 and received via message bus 430, and synthesizes any that are not already in existence.
  • the dynamic synthesizer 1010 takes in a string to be generated, indicated at 1020, in real time from message bus 430 and, optionally, can also take in just the variable components and synthesize either the full string with the variable values or just the variable values and insert them into the rest of the synthetization.
  • all of these audio snippets are made available via the service as a public web server 1025. Additionally, in some embodiments the III Client itself can receive the text response and initiate a real-time text-to-speech operation with either a local method or an external service to convert the text response to audio.
  • FIG. 11 a method for latency reduction in accordance with an embodiment of the invention can be better understood.
  • the method described here is designed to minimize turnaround time for a natural language query by ensuring that we are processing that query the moment the last relevant piece of information is uttered. Thus, if any superfluous words are uttered towards the end of the utterance, no delay in providing a response result from the time it takes a speaker to utter them, such that those superfluous words do not impact the overall turnaround time. Additionally, processing the query is initiated before the speech to text engine has finalized the transcription and, if the interim transcription is not materially different, processing is performed from the point of the last material difference.
  • an interim transcription result is received at 1105, and is sent at 1110 to the Al coordinator.
  • the Al coordinator determines an inference and provides a confidence value at 1115. If the confidence value exceeds a threshold value, a “confident” result is reached; if not, a “not confident” result is reached. If the result is not confident, the process ends at 1120. If the result is “confident”, the process advances to step 1125 where slot data is sent to the CES. At 1130, the result is checked for a material difference in slot data. If there is no material difference, the process ends at 1135.
  • the query - already in progress - based on the prior partial utterance is discarded at 1140, and the updated query begins processing at 1145.
  • the process ends at 1150 once the new query is processed. This process is repeated for every interim result received from the voice transcription service stream and ultimately concludes when the user is done speaking and the voice transcription service provides its final result.
  • the processes of Figure 11 operate essentially as a loop, but the loop is driven by the occurrence of further speech, which yields a new interim loop - for example one new loop for each additional word.
  • Figure 12 illustrates a model library, from which models 115 ( Figures 1 , 2) can be better appreciated.
  • each customer will have a different model, indicated at 1200A-1200n for customers 1205A-1205n.
  • the model update flow depicted in Figure 12 provides continuous improvement of the systems natural language processing by continuously refining the latest-deployed NLP model via customer usage data and feedback. Further, it seeks to allow models to be refined across a plurality of customers while at the same time ensuring that none of a given customer's data and actual queries ever leaves their premise (or, if running online, their data silo.)
  • incoming queries 1210 are stored in a dedicated query collection 1215.
  • the queries are also used for online training of the deployed model, designated Model-A when user feedback is provided.
  • a “tournament” is held to determine a new best model for customer 1200A.
  • the participant models are the local improved models A1 , A2, etc., plus models from model library 1225, which can comprise models collected from other customers, for example, models Model-B3, Model-C2, Model-Zx.
  • the specific tournament “rules” could vary depending upon the implementation, although in at least some embodiments the main determining factor would be each model's test result against queries from query collection 1215 from customer 1200A. Similarly, a test for a new model from customer 1200B would run tests against queries from that customer’s query collection.
  • Model library 1225 which is managed by the model library service 1205.
  • the model library 1225 keeps track of all “tournament winners” as well as the metadata associated with those models.
  • the model library service also manages selection of models to be tested in a tournament, as shown at 1230.
  • models selected for such tournaments can be based at least in part on model novelty to the customer.
  • one approach is to prioritize models that have not been compared to existing models for a particular customer.
  • models selected for a given tournament can be selected based on Model popularity.
  • Model fitness gain where a model that has high fitness should likely do well with a new customer.
  • Model customer size since a customer with a high user count should be more representative of the problem domain and a model that works well for such a customer has a comparatively high probability of working well for another customer.
  • Models in the model library 1225 that are not getting selected for tournament participation, or are not winning such tournaments, or are not deployed at customer sites can be retired from the model library after, for example, a threshold period of time or other convenient criteria.
  • the method of Figure 13 allows a user having an “on premises” installation or a private cloud to update its Al models without exposing customer query data outside of the customer network.
  • the system is designed to have a current base model that is the last updated and *qualified* model.
  • “updated” means it has completed incremental training from the last customer installation to signal some threshold of new query data is available and has requested the base model for updating.
  • qualified means a given new model has shown an improvement in terms of F-1 score or a similar metric, and that the given new model has also shown improvement as measured by the new model’s success rate against predetermined “golden” test set. To avoid exposure of customer data or queries, such testing can be performed on customer premises.
  • the base model is updated and any future requests for the base model from an on-premises implementation will receive the new base model.
  • the on-premises installation will update it local base model with the core base model hosted by the vendor so that customers take advantage of the aggregate training data of all customers.
  • a query collector signals a vendor network that it is ready for a training update.
  • the vendor’s base model manager sends the latest base model for an on-premise training service evaluation of the new model.
  • the base model is incrementally trained with newly collected queries, shown at 1315, and the updated model is sent out of the client network into the Vendor’s network at 1320.
  • a check is then made by the model manager at 1325 to determine whether the new model’s F-1 score and score against the “golden” test set performs better or not. If better, the new model is established as the new base model, step 1335, but if not, the process ends, 1330.
  • Figures 14A-14B illustrate hardware and software components for performing gaze tracking in accordance with an embodiment of the invention.
  • This refers to a system for determining what (on a visual medium - tv screen, etc%) the user is pointing/looking at.
  • a gaze tracking device exhibiting context specific (meeting room, large presentation space, personal office, etc%) specifications would be employed. These specifications at a top level would include precision, accuracy, and speed and at a more granular level would include, but not be limited to, sampling frequency, spectrum operation, angle of operation, pixel density and depth, among others.
  • a main considerations in specifying a specific gaze tracking implementation is to maximize gaze accuracy to a particular range given a particular context.
  • a gaze tracking device uses a gaze tracking device to subdivided into a grid with an internal inventory of what objects, along with their visual characteristics, are in each grid.
  • Examples of gaze tracking devices include as an example Tobii has a variety of solutions, the Tobii Pro Spectrum, Fusion and Nano. They also sell a pair of glasses that will track your gaze.
  • Competitive offerings are available from EyeLink, SmartEye, SMI and others. Some considerations in choosing a specific implementation include accuracy, speed and cost. Some solutions typically used in research have very high sampling frequencies but are quite expensive, and in at least some implementations of the invention such performance metrics are unnecessary, allowing some cost savings.
  • Key metrics for selecting a device would likely take into account accuracy, where there is a limit to how small things on a screen can get from a productivity perspective, high enough processing speed to cause verbal instructions to be executed quickly as compared to a human with a keyboard, as there is a quickly met upper bound on how long verbal instructions should take to execute, the measuring stick being a human with a finger and a keyboard. Cost is also a practical consideration, where there may be one unit per personal office.
  • a gaze tracking device 1400 provides its input into an intermediary device such as shown in Figure 3, or other computer, which in turn supplies its output to a system location either in the cloud or on premises, as shown at 1410.
  • a screen grid object content utterance mapping list indicated at 1415, provides the correlation between the gridded screen and the captured gaze.
  • FIG. 14B an embodiment of gaze tracking to identify a “this” in a spoken utterance can be better understood.
  • An utterance is received from a user at 1425, and an attempt to identify the referred-to object is made by searching the type and characteristics in the screen grid object mapping list 1415. If only one object is found, the process advances by sending that object ID to the system for parsing of any other portion of the utterance, as shown at 1435, and then the process ends. If, however, the result at step 1430 is either no match or too many matches, the process advances to step 1445 by querying the display driver for screen resolution and dimensions. Based on that result, at step 1450 the screen is divided into a 2D array with a plurality of sectors.
  • each sector is searched using the mapping list 1415. Again, if one object is found, the process advances back to step 1435, after which the process ends. [0085] If, again, many objects are found, the sector size is reduced and the sectors are again searched, with the process being iterated until only one object is found. If, despite the sector-by-sector examination, no object is found, the user can be asked for clarification as shown at 1465, after which the process ends at 1470.
  • Figures 14C-14D illustrate a time-of-flight based alternative (or supplement, in some embodiments) to the embodiment of Figures 14A-14B.
  • the time-of-flight hardware components comprise a 3D time-of-flight camera array that covers substantially 360 degrees of view horizontally and a suitable view vertically, such as 70 degrees, shown at 1472.
  • the output of the array is provided to an intermediary device as shown at 1474, where in turn the output of the intermediary device is provided to an intermediary system 1476.
  • an image database trained with IR/3D time-of-flight images of extended arms/fingers and visual media supplies its data to an Al algorithm using a convolutional neural network (CNN) 1480.
  • the system receives an utterance at 1484, and the boundary of the visual display device, such as a TV, computer monitor, projector, or other display device is identified at 1486.
  • the angle of the visual medium relative to a presumed flat backplane is calculated at 1488, permitting the location of a finger to be performed at 1490.
  • the finger angle relative to the presumed-flat backplane is calculated, 1492, permitting the projected intersection of the finger and the visual medium to be extrapolated at 1494.
  • An attempt is made to identify the referred-to object through the use of positional mapping of the U.l. to objects, step 1496, after which the object ID and utterance are provided to the system at 1498.
  • Figure 15A directed to passive mode permissions, begins with one or more attendees entering an empty room, shown at 1503.
  • the permissions level is automatically set to the level associated with the most senior person identified as being in attendance in the room, step 1506, and continues at that level until that person leaves the room, step 1509. If that senior person verbally authorizes continuation (or extension for a period of time) of their level of permissions, that level continues, step 1512 until either the set time elapses or the meeting otherwise ends.
  • the system automatically drops the permissions level to that of the next most senior person in attendance, shown at 1515. If the meeting goes over the scheduled time, such that others enter the room before the meeting ends, the permissions level is automatically and dynamically set to the level of the most senior person in the room at a given time, steps 1518-1527. Eventually the room is empty, and the permissions level is set to the lowest setting, step 1530. Setting permissions to allow access to only part of otherwise available data can also provide desirable security against unintentional exposure of sensitive data.
  • Figure 15B shows a process for active mode permissions, and relies on having at least the most senior person activate their permissions level by the use of a fingerprint scanner or other biometric device, or a keypad or other device requiring an affirmative action. Again levels automatically drop once that most senior person leaves the room, absent authorization for continuance of their permissions level, again by use of a fingerprint scanner, retinal scanner, or other biometric device, keypad, or similar.
  • the process of Figure 15B can be seen to be analogous to that of Figure 15A in all other respects and those steps are not repeated in the interest of simplicity.
  • the number of queries to be executed is determined at 1610 by looking, for example, at the slot type with the most occurrences out of slot types that inform on what “table” to reference. Each slot is then examined, step 1615. If there are no more slots to be examined, the process ends at 1620.
  • step 1625 a determination is made whether the required composition for performing a single inquiry yet exists. If yes, a new query object is formed at 1630 and the process loops to step 1615. If not, the process advances to step 1635 where a determination is made whether the slot type appears only once, and the current step is the first pass in the examination of slots. If yes, the process advances to step 1640, where the slot is duplicated as many times as there are queries (e.g., a search for revenue for multiple years), and the process advances to step 1645.
  • queries e.g., a search for revenue for multiple years
  • step 1635 If the result at step 1635 was a no, the process jumps to step 1645, At 1645 a determination is made whether the composition is missing this slot type AND that slot type has not already been assigned to a query, in which case the slot is assigned to the query and the process advances to step 1650, where the query’s composition is updated for the newly-assigned slot. The process then loops to step 1615, and continues until the result at 1615 is a yes.

Abstract

Systems and methods for a platform independent Al voice assistant configured to contextualize a meeting space to identify, automatically and concurrently, the source of a query or command spoken within the meeting space and to analyze the query or command, including disambiguation of pronouns and context- dependent terms to cause relevant data to be provided, automatically, in response. Gaze tracking and analysis and access authorization are provided in some embodiments. Customer-specific models permit system training and operation without customer data leaving the customer's premises or data silo. An intermediary device provides an audio endpoint for interfacing with meeting participants and enabling data visualization. In an embodiment, at least some software aspects are configured to operate as a Single Page Application front end. In other embodiments, an iterative look-ahead technique is implemented to minimize system latency.

Description

Systems and Methods for Query Source Identification and Response
RELATED APPLICATION
[001] This application claims the benefit of U.S. Patent Application S.N. 63/271 ,403 filed October 25, 2021 , which is incorporated herein as set forth in full.
FIELD OF THE INVENTION
[002] In one aspect, the present invention relates generally to the contextual ization of a meeting space for the purpose of identifying the source of an audible query originating within the space. In another aspect, the present invention relates generally to the analysis of an audible query to facilitate developing a response thereto and more particularly relates to systems and methods for analyzing an audible query originated in a contextualized space where identification of the query source grants access rights to data in accordance with the permissions associated with the query source.
BACKGROUND OF THE INVENTION
[003] Most decision-making processes, whether corporate, governmental, NGO, or personal, rely on access to reliable data. However, when even modestly complex issues require a discussion among multiple meeting attendees, all too often the discussion stalls because the facts surrounding a key issue are not immediately available. Frequently, the result is that a decision is not made or is made without the benefit of the missing information.
[004] As internet access to data has become ubiquitous, one common approach for such meeting is to assign to one or more individual meeting attendees the task of executing internet or corporate network searches as issues develop where the information is not available. This approach, while helpful in some instances, suffers from significant limitations. One is that the attendee responsible for the executing a query has to assimilate the entire query before executing the search, resulting in delays that cause the meeting to lose momentum. Further, the attendee attempting to execute the search may not have permission to access the most relevant data, potentially causing the decision makers to rely on inaccurate or incomplete data. Yet further, this responsible attendee cannot fully participate in the meeting while performing this task which takes away from the maximum benefit of their attending the meeting. Additionally, having this responsibility is likely not a formal part of their role.
[005] Numerous speech recognition applications exist, and such applications can be used in some instances. However, many conventional speech recognition systems cannot comprehend complex queries or understand ambiguities such as pronouns “we”, “everyone”, “this”. Further, conventional speech recognition systems typically are incapable of distinguishing among multiple potential sources and cannot identify source in a manner sufficient to determine what data access is appropriate for a given query. For example, in a meeting with numerous attendees, for example a dozen or more, voice recognition alone is insufficient to identify source.
[006] As a result, there has been a need for an automated data retrieval and command execution system and method by which meeting attendees can be associated with a given query such that the process of generating an automated response accesses the most relevant data to which the query source is permitted to have access and a system whereby ambiguous references to persons and objects in the room are understood.
SUMMARY OF THE INVENTION
[007] The present invention overcomes many of the limitations of the prior art by providing, in an embodiment, systems and methods for a platform independent Al voice assistant configured to contextualize a meeting space to support, automatically and concurrently, a wide variety of meeting participants by identifying, through audio, visual, time of flight and other means, the source of a spoken query or command (sometimes generalized as “query” hereinafter for simplicity) and analyzing the query to cause relevant data to be provided, automatically, in response to the query or some automated task to be performed, for example a digital task such as retrieving a digital file.
[008] In various embodiments, the invention can be run in a cloud infrastructure, typically virtualized, or can be run in a customer deployment where proprietary or confidential data is to be accessed or where lower latency is desired. In an embodiment, the invention comprises a Single Page Application (“SPA”) front end that can run in any browser, thus permitting it to run on a smartphone, tablet, PC, WebTV, Smart Speaker or similar. [009] An aspect of the invention comprises, in at least some embodiments, an intermediary device that acts as an audio endpoint for interfacing with meeting participants via voice and further interfacing with audiovisual devices that expand the audio capability and enable data visualization and other visual functionalities, in some instances via a web browser running on the device. By means of such features, the invention automatically identifies the source of a spoken query accurately and substantially in real-time, and further gleans context including the disambiguation of pronouns such as “me”, “us”, “I”, and similarly context-dependent terms such as “this”, “that”, and so on, enabling accurate analysis of the query and facilitating development of the desired response. In an embodiment, physical gestures, for example pointing, are also detected and correlated with the spoken words to assist in assigning proper context or meaning to query. In some embodiments, additional physical world characteristics are observed and contextualized such as various gas levels (CO2, etc...), ambient noise levels, lighting levels, the presence and relative location of AV equipment, the presence and location of whiteboards, chalkboards or glassboards, and the gestures humans make while speaking that can be correlated with their spoken queries or commands. This physical world data is then correlated with digital world data to perform one or more complex actions such as executing a command to “send this document to everyone in the meeting” in response to a person pointing at a document on the projection screen.
[0010] In some embodiments, the system comprises a customer model that establishes, for each attendee, permitted levels of data access as well as maintaining a user profile that assists in the accurate interpretation of that attendee’s spoken words. Further, in some embodiments, natural language processing is used to develop and iteratively train the profile without user data, or the user query as uttered, never leaving the user’s premises or similar data silo. In at least some embodiments, and for at least some groups of attendees, such iterative training can eventually increase the accuracy of source identification through voice only such that the use other identification functionalities of the invention become less necessary or not needed.
[0011] In a further aspect of the invention, in some embodiments analysis of a query can be performed using look-ahead techniques. In such embodiments, analysis of a query begins upon detection and transcription of the first substantive word of the query both in terms of identifying source and also identifying responsive data. The analysis iterates with transcription of each additional word or other element of the query (e.g., a gesture) such that source is typically identified before the query is fully articulated and relevant responsive data is accessed within the permission associated with the identified source of the query. [0012] It is therefore one object of at least some embodiments of the invention to provide a system and method for automatic identification of the source of a spoken query within a meeting space.
[0013] It is a further object of at least some embodiments of the invention to provide a system and method that contextualizes a meeting space to permit automatic identification of the source of a query accurately and substantially in real time.
[0014] It is a still further object of at least some embodiments of the invention to contextualize a meeting space and meeting attendees sufficiently to enable disambiguation of imprecise query elements, such as pronouns.
[0015] It is another object of at least some embodiments of the invention to associate with at least some meeting attendees a level of permitted data access. [0016] It is yet another object of at least some embodiments of the invention to associate with at least some meeting attendees a user profile to facilitate automatic, rapid and accurate comprehension of each attendee’s spoken queries including that attendee’s use of slang, colloquialisms, or other mannerisms specific to a given attendee.
[0017] The foregoing and other objects and benefits of the present invention can be better appreciated from the following detailed description of the invention, taken in combination with the appended Figures. It will further be appreciated that the details of features of various embodiments of the invention are disclosed hereinafter. It is to be understood that, while a specific feature or set of features may be described only in connection with a given embodiment, such features can be included or excluded from that or any other embodiment described hereinafter, depending upon the needs of a particular implementation of the invention. The description of a feature in connection with only a specific embodiment is for avoidance of redundancy only and is not to be understood as limiting the features of any embodiment of the invention, in terms of either inclusion or exclusion.
THE FIGURES [0018] Figure 1A shows a meeting space suitable for being contextualized by an embodiment of the present invention.
[0019] Figure 1 B shows in process flow form an embodiment of the invention including both source identification and query analysis.
[0020] Figure 1 C shows in process flow form an embodiment of the invention including source identification and query analysis using interim transcription to facilitate look-ahead processing.
[0021] Figure 2 illustrates in flow diagram form an embodiment of the source identification process.
[0022] Figure 3 illustrates in block diagram form the system components of an embodiment of the invention.
[0023] Figure 4 illustrates in block diagram form the modular functionalities of an embodiment of the invention.
[0024] Figure 5 illustrates in process flow form a more detailed description of an embodiment of a method for identifying the query source in accordance with an aspect of the invention.
[0025] Figure 6 illustrates an embodiment of the voice transcription module of Figures 4.
[0026] Figures 7A-7B illustrate an embodiment of the Al Coordinator/lnference module of Figures 4.
[0027] Figure 8A illustrates an embodiment of the Command Execution module of Figure 4.
[0028] Figure 8B illustrates an embodiment of the Data Preparation Manager of Figure 8A.
[0029] Figure 8C illustrates an alternative embodiment of the Data Preparation Manager of Figure 8A.
[0030] Figure 9 illustrates an embodiment of a data structure in accordance with an aspect of the invention.
[0031] Figure 10 illustrates an embodiment of the Dialog Control module of Figures 4.
[0032] Figure 11 illustrates a process flow for latency reduction in accordance with an embodiment of an aspect of the invention.
[0033] Figure 12 illustrates an embodiment of the model library service of an aspect of the invention, and describes a method and system for updating Core Inference Al shown in Figure 4. [0034] Figure 13 illustrates an embodiment of a training flow for updating the customer model in an aspect of the invention.
[0035] Figures 14A-14B illustrate an embodiment of the system elements and a process flow, respectively, for disambiguation of query terms using gaze tracking in accordance with an aspect of the invention.
[0036] Figures 14C-14D illustrate an embodiment of the system elements and a process flow, respectively, for disambiguation of query terms using time of flight (ToF) in accordance with an aspect of the invention.
[0037] Figure 15A illustrates an embodiment of a process flow for passively applying permissions to a contextualized meeting in accordance with an aspect of the invention.
[0038] Figure 15B illustrates an embodiment of a process flow for actively applying permissions to a contextualized meeting in accordance with an aspect of the invention.
[0039] Figure 16 illustrates an embodiment of a process flow for analyzing implied multiple queries in accordance with an aspect of the invention.
[0040] Various aspects and embodiments of the invention are described below with reference to the above-described Figures, wherein like numerical designations denote like elements.
DETAILED DESCRIPTION OF THE INVENTION
[0041] Referring first to Figure 1A, shown therein is a meeting room 10 in which a plurality of attendees 15 are positioned around a table 20. A plurality of sensors 25A-n are connected, either wirelessly or by any other convenient means, to an intermediary device 30 and provide to that intermediary device such audio, visual, ToF, biometric and similar data as will be helpful in identifying the individual attendees so that utterances made by any of the attendees can be correlated with the source of the utterance and any contextual references are understood. While the sensors are shown as distinct in Figure 1 A, some or all may be integrated into the housing of intermediary device 30. One of the attendees may be deemed an operator in some embodiments, although in other embodiments the operator may be remote from the meeting space 10. For an embodiment where the operator is within the meeting space a keyboard 35 for inputting certain commands to the intermediary device 30. [0042] The intermediary device 30 comprises one or more processors that execute processes as described hereinafter, some embodiments of which are shown generally in Figures 1 B and 1C. As further described below, in an embodiment the intermediary device then queries data resources appropriate to the person generating the query. A response to the query is then communicated to one or more of the attendees and may, as one example, be displayed on video screen 40.
[0043] Referring next to Figure 1 B, a process flow for an embodiment of the invention including both source identification and query analysis for a meeting occurring in a contextualized meeting space such as that shown in Figure 1A can be better appreciated. More specifically, the process begins at 100 and advances to capturing the entirety of the query at step 105. Typically a meeting comprises a plurality of attendees as shown in Figure 1 A, such that the source of the query can be ambiguous. The source is identified through a disambiguation process at step 110, at least in part through the use of data requested from customer model 115. In at least some contexts, the source identification/ disambiguation process can be thought of as eliminating dissonance in the query. Once the source of the query is identified, the customer model applies the permissions set for that source or, alternatively, set the permissions associated with the most senior attendee, shown at step 120. As discussed in greater detail hereinafter, setting of permissions can be either active or passive, and can occur before a spoken query by identifying attendees visually rather than by voice or by voice identification not comprising a query, or by other means. Setting the permissions determines what data will be accessible for developing a response to the query.
[0044] Once the permissions have been set, the query can be analyzed as shown at step 125. Such analysis is discussed in greater detail below, but involves disambiguation of pronouns, gestures, and other imprecise or ambiguous terms within the query. Such a transcription process yields an understandable query. The process then advances to step 130 for developing a response to the query by accessing a data source 135, again within the limits of any applicable permissions as set at step 120. The response is then presented to the users/attendees at step 140 and the process ends at 145. Presentment of the response can take any form suitable to the meeting space and the attendees, including audio, visual display, and so on. Additionally, the response can take the form of an action, such as retrieving and distributing a digital document, that applies to some or all attendees or any other designated group, e.g. “all of Dept X”.
[0045] Referring next to Figure 1 C, shown therein in process flow form is an embodiment of the invention including source identification and query analysis using interim transcription to facilitate look-ahead processing. Thus, the process of Figure 1C varies from that of Figure 1 B in that attempts to identify source and to begin analyzing the query start with detection of the first word of the query, and iterates with each additional word until (a) the source is identified and (b) the query can be sufficiently disambiguated and transcribed that an appropriate response can be developed. More specifically, the process of Figure 1C starts at 100 and the first word of a query is detected at 105A. An effort is made to identify source at 110, potentially by retrieving information from the customer model 115. In some cases, the first word of the query will be insufficient to identify source, as determined at step 150, so the process loops back to step 105A to detect the next word of the query. Again, disambiguation and transcription are performed on the updated form of the query, and identification of source is attempted. The process continues to iterate until source is identified, at which point permissions can be applied as shown at 120.
[0046] Once appropriate permissions have been applied, the current form of the query can be analyzed as shown at 125A, and a response developed based on that current form, step 130A. However, if more words are detected as part of the query, step 155, the process loops to step 105A and steps 110-135 repeat. Alternatively, and depending upon the level of confidence that the source of the query has been accurately identified, the process can bypass steps 110- 120 and jump to step 125A. Analysis of the query and development of an interim response continues until there are no more words in the query, yielding a “no” at step 155, whereupon the response from step 130A is provided to the users/attendees at step 140 and the process ends at 145. As a still further alternative, also discussed in greater detail hereinafter, in an embodiment the permissions for a given meeting can be set in accordance with the permissions associated with the most senior attendee when there are no concerns about anyone in the room getting even momentary access to information retrieved. Alternatively, permissions for a given meeting can be set based on the most junior attendee if the desire is for information retrieved to be restricted so as not to expose it to the more junior attendees. Depending upon the embodiment, the system can dynamically apply these permissions based on the ability of the system to automatically identify meeting attendees together with the selected permissions mode. Depending upon the embodiment, the permissions mode can be preset, for example by an administrator, or can be implemented or modified at any time during the meeting via a voice or other command combined with suitable authentication such as an active form of identification, for example the reading of a presented fingerprint, an iris scan, or other similar techniques. In the above case, permissions are based on the meeting attendee or attendees and mode rather than associated with the query source, in which case step 120 is effectively removed from the process and steps 125A et seq are performed for each iteration without regard to identification of query source.
[0047] As yet a further alternative, because the query response is developed at step 130A but not provided to any user/attendee until step 140, the application of the permissions step can be relocated either to between steps 125A and 130A or to between steps 155 and 140, such that the query response is being developed regardless of permissions but only displayed once permissions have been applied and the data matched to the applicable permissions. It will be understood that, in the latter case, another iteration of steps 130A-135 may be required to triage data from the interim response should the permissions associated with the query source not be entitled to access some of the data available from data source 135.
[0048] With reference next to Figure 2, an embodiment of a process for disambiguating/identifying the source of the query, shown generally at 110 in Figures 1 B-1C, can be appreciated in greater detail. The process indicated generally at 200 permits identifying a query source from among a plurality of attendees or speakers in a meeting room where voice authentication technology alone is inadequate. Voice authentication can be effective for meetings involving a small number of attendees, where voice can be distinguished from within a small reference set. However, the certainty of an accurate match degrades rapidly as the number of attendees increases. For example, in a meeting with 15 people, the voice authentication success rate is typically too low to be practical and so different methods are required. For purposes of the present discussion, and without limitation of the invention, it will be assumed that more than five attendees at a meeting will mean that a high confidence match cannot be achieved solely from voice authentication technology. However, as discussed below in connection with training, over time voice authentication will improve such that, where a relatively finite universe of speakers is involved, voice recognition alone may be sufficient to yield a high confidence match. For now, for a new system with only limited training, voice authentication alone frequently does not yield satisfactory results where the number of attendees exceeds a low threshold. [0049] Thus, an object of this aspect of the present invention is to supplement voice authentication with other technologies to eliminate potential misidentification of the query source. Various technologies used to help eliminate potential identities include but are not limited to:
(a) Wireless identity signaling devices such as wearables, for example the smart rings provided by Proxy™ or, previously, Motiv™, or other wearable biometric devices as well as natively managed signals via protocols such as BT/BTLE, WiFi, NFC, UWB or other third party proprietary signals.
(b) Facial Recognition Technology, where the specific implementation is chosen by balancing considerations of accuracy, speed, angular accuracy and security, can be basic RGB camera image matching, or more advanced and secure technologies such as IR dot grid technologies or ToF- based devices.
(c) Voice Authentication Technology, which is distinct from voice recognition technology in that recognition technology, simply converts spoken language into text with no regard for who the speaker is. Voice Authentication Technology on the other hand is able to identify the speaker themselves from just their voice. Traditional authentication technology focused on being able to work with a large reference set (one to many) and so had to limit the authentication being done on a single and specific utterance. By constraining to a small reference set (one to few) any utterance can be supported.
[0050] With the foregoing in mind, the process 200 can be better understood. The process begins at step 205 and, at step 210, a user command or inquiry, typically audible, is issued either actively or passively. The voice is then compared with user profiles from the customer model 115 to determine whether the user or query source can be identified solely from voice recognition. If yes, the identification of the source is forwarded to the system client at step 220, along with the total relevant utterance that comprises the query or command. In some embodiments, the voice authentication Al is trained on the user’s voice uttering a “wake” word or phrase, as present day voice authentication technology typically operates off a fixed phrase. In other embodiments, voice authentication can be based on other training approaches, for example using larger speech libraries, such that enunciation of a fixed word or phrase is not necessary.
[0051] If voice alone does not permit identification of the query source, resulting in a NO from step 215, the location of the voice within the meeting space is detected as discussed hereinafter, and a determination is made as to whether the voice location associated with the query is proximate to only one face, indicated at step 225. If so, a yes results and the identity of the query source is ported to step 220 along with the relevant utterance. However, if a NO results at step 225, the process advances to step 230. At step 230, the process initiates a process of elimination by which other potential query sources are eliminated until only one unmatched potential source remains. More specifically, in comparing identity signals matching the location of people identified either by facial recognition or by the origin of their voice, some candidates can be eliminated from the potential speakers, i.e., all people in the room. For example, if an identity signal location is detected, and a person at that location is also detected (especially if the facial recognition process is certain that a person is being detected, but uncertain who that person is) yet the voice location is elsewhere, that person can be eliminated as a candidate for the voice identity. Stated more broadly, it is possible to eliminate anyone recognized by facial recognition but not positioned where the voice originated from. In at least some instances, it is possible to eliminate all others so that there is only one identity signal (e.g., coming from a smartphone on the table while the owner presents at the front of the room) unmatched to a user who is perhaps unseen but definitely heard. By narrowing the potential speakers down to that one smartphone on the table, we know the user at the front of the room who gave the voice command must be the owner of that smartphone with its identity signal. Identification of the source, if successful, is then passed on at 235 to the system client as indicated at 220. In occasional instances, none of the foregoing steps is successful, in which case a No results at step 235. In such an event, the system can generate a request that a given speaker be identified. The request is passed to the block 220, after which the process ends at 245.
[0052] Referring next to Figure 3, an embodiment of the system hardware components can be better appreciated. The embodiment of Figure 3 comprises an intermediary device that, together with various embodiments of the system software described herein, operates to contextualize a meeting room or other space by, on the input side, adding audio, video, wireless, biometric and similar sensing to aid in identifying a person speaking or similar data source and, on the output side, by providing audio, visual and related means for communicating data responsive to the queries or other communications originating from a person speaking or otherwise issuing commands formatted for interpretation by the system. These can include data visualization, including but not limited to communication through a web browser, or synthetically generated audio, text, or visual displays as briefly discussed above in connection with Figure 1A.
[0053] Thus, the exemplary intermediary device of Figure 3 comprises a microcontroller 300 or other suitable processor which communicates through a client 305 with an array of sensors and related devices as shown generally in Figure 1A as 25A-25n and more specifically in Figure 3 as 310A-310n. The microcontroller 300, which may for example be an ARM A7 class microcontroller, typically operates under the control of software compatible with an operating system such as, for example, Linux although numerous other operating systems are acceptable depending upon the embodiment. An array of other devices 31 SA- 315n provides user interface functions and support functions. The controller 300 cooperates with one or more I/O ports, examples of which are shown at 320. Further, the controller 300 cooperates with storage in the form of, for example, main memory 325, typically volatile, and persistent memory 330, for example a flash device.
[0054] In an embodiment, sensors and related devices 310A-310n comprise an ultra-wideband (UWB) controller with associated antenna array 335, a Bluetooth Low Energy (BTLE) controller with associated antenna array 340, a Time of Flight (ToF) camera/laser array, a proximity sensor, an Ethernet controller, an RGB and/or IR Camera Array, a WiFi controller with antenna, a microphone array, and a fingerprint sensor. While not pictured, additional physical contextualization sensors can be included including but not limited to gas level sensors, noise level sensors, humidity sensors, pressure sensors, proximity sensors and light level sensors. It will be understood that the number and type of sensors and related controllers is dependent upon the particular implementation and which such sensors are employed in such an implementation is a matter of design choice. The ToF Array can be a multipoint camera array, for example five cameras although the actual number can be less or more depending upon the meeting space, and provides 360 degree visual coverage, in some instances including detection of IR imagery, to permit identification not only of attendees, but also to permit extrapolation of a finger including at least the angle of a pointing finger upon suitable Al training. With such training, the camera array permits fingertips and visual media to be recognized in real time where the distance data of the relevant pixels permits determination of angles relative to a backplane, facilitating association of a pointing finger to a subject of interest being displayed. A proximity sensor performs blunt object detection, to determine whether anyone is in the room, and can be configured for “power-on on approach” or “power-off on departure” for both security and power savings. In some embodiments, an RGB- IR camera can be configured to detect IR emissions from attendees, for example the heat from an attendee’s hand or finger to supplement pointing, after adjustment for ambient heat. RGB imagery can be provided, either alone or with IR imagery, to facilitate use of a structured light facial recognition solution which can be more secure and reliable than simple RGB facial recognition.
[0055] Similarly, in an exemplary implementation, devices 315A-315n can comprise one or more of a digital signal processor (DSP) 315A, an Al acceleration chip such as a specialized GPU or ASIC, button controls, speakers, and LED indicators. For example, the LED’s and speakers can provide visual and audio from the device to the meeting attendees to provide responses to queries or commands. The button controls provider user control of functions that can involve explicit tactile or physical user interaction, such as volume or power. An Al acceleration chip, potentially integrated into the microcontroller 300, permits the system to perform certain tasks with low latency, such as responses to commands such as moving through a slide deck, jumping to a particular slide or other document or performing voice, face and gesture recognition in real-time. A DSP, also potentially integrated into the microcontroller, permits curating of incoming audio to provide high levels of accuracy in interpreting incoming audio signals (i.e. , voice/utterances).
[0056] The IR Camera array can be employed along with gaze tracking algorithms to provide the context of where a meeting participant is looking. This information can be utilized in a variety of ways such as generating metrics on meeting attendee focus and degree of participation, or correlated with other sensor inputs to more definitively determine the position of suspected objects (Whiteboard, TV, Projection Screen, etc...) or to gain awareness whether people are looking at each other or at a specific person to help in identifying to whom an utterance is directed, e.g. “can you take care of that”? .
[0057] A fingerprint sensor 318A can be provided to permit active management of security permissions during a meeting, such as when sensitive documents or other materials are being displayed, and serves as an affirmatively physical alternative to other forms of recognition of meeting attendees, as described above. Because the system of the invention is contextually aware of a current meeting, including the identities of the attendees and their fingerprint, the permissions level can be automatically adjusted to match the permissions level of the most senior person currently in attendance, although in some embodiments such a senior attendee can also authorize, verbally or by any convenient means, the meeting to continue at that person’s permissions level even after they depart the meeting. Absent such authorization, the permissions level for a given meeting drops dynamically as more senior attendees depart. If all attendees depart, the permissions level can be set to secure automatically the data from further view, as discussed further in connection with Figures 15A-15B hereinafter. A multipoint microphone array 318n, for example five microphones although the exact number can vary depending upon the meeting space, provide detection of angle of arrival and position detection for any utterances made during the meeting, thus assisting in determining the identity of the speaker and reducing the reference set of possible speakers for voice authentication. The I/O ports 320 can, for example, comprise one or more USB ports, one or more Ethernet ports, HDMI input and output connections, one or more wireless displays, and one or more of each of WiFi, ultra wide-band (UWB) and Bluetooth Low Energy (BTLE) interfaces. The WiFi, UWB and BTLE interfaces can provide recognition of attendees’ devices such as smartphones or tags, although WiFi can also be used for connection to a network if more suitable than a wired connection.
[0058] It will be appreciated by those skilled in the art that a given implementation may use any combination of sensors, I/O’s, etc., from among those shown in Figure 3 at 310, 315, and 320, and in some implementations may not use any of a given group.
[0059] Figure 4 illustrates in block diagram form the system of the invention that resides on the other side of the Client, running in the operating system executing on the intermediary device of Figure 3, and identified generally at 400. More specifically, Figure 4 illustrates a plurality of modular functionalities that comprise an embodiment of the invention, and further denotes where such functionalities physically reside, typically in either a virtual or a physical server. It will be apparent to those skilled in the art that, in some embodiments, certain functionalities may be incorporated into the intermediary device of Figure 3. In operation, and as explained in greater detail hereinafter, system 400 performs speech to text transcription, inference of material information from the spoken utterance, and processing of that material information into a backend database query either for datasources that are well defined (e.g., Quickbooks, SalesForce.com or other CRM application, NetSuite, Multiple Listing Service, etc.) or data sources that are weakly defined (e.g., SQL, Oracle, Redshift, Snowflake, etc.) Additionally, system 400 can interface with non-data sources such as API’s for managing task commands or queries, for example PowerPoint/Excel/Word API’s for navigating and editing documents via voice, or Outlook or similar API’s for managing calendars. Finally, the system 400 can perform wholly selfcontrolled functions such as note taking, recording or executing action items, or administrative tasks such as reservations or food orders.
[0060] More specifically, in an embodiment the User Interface 405 can be hosted on the intermediary device 30 with or without a visual medium connected. In embodiments that do not include a visual medium, some features such as data visualization would be limited. In an embodiment, the user interface 405 is a single page application running in a web browser and is therefore platform independent. As such, it can be run on any device can run a fully featured browser such as a laptop, desktop, smartphone, projector, webTV, Smart Speaker (or other device plugged into the intermediary device of Figure 3), a tablet, or any other suitable computing device. The User Interface is primarily responsible for collecting the spoken utterances of users and transmitting into air the audible spoken responses generated by the present invention. Such spoken responses can be either answers or clarifying questions. Further, the user interface transmits visual responses to a connected visual medium (if implemented) for display to the user(s)/attendees. At the request of the client, through the user interface 405, the display data service 410 retrieves the data blob (“binary large object) comprising a query result from the relevant database. The client is notified via a message bus 430 that the data is ready, at which point a request for the data is made to that service. [0061] Authentication service 415 verifies user account and product access and manages connector-specific token retrieval, update and verification. Configuration service 420 manages user configuration changes communicated from the Client (the User Interface) by updating the Operational Database 455 accordingly. A voice transcription service 425, described in greater detail in connection with Figure 6, is responsible for receiving the incoming stream of audio data from any client via voice input microservice 425A, converting that stream (e.g., audio) to a common format supported by the voice transcription engine 425B, receiving the transcription back (e.g., text) and dropping a message on the message bus 430 that the transcription is complete.. In an alternative embodiment which may reduce latency, the client sends the audio stream directly to the transcription service, gets the transcription back and places the completed transcription on the bus itself. In a yet additional embodiment, the client can do the transcription locally using the client resources and drop the transcription onto the message bus. The voice transcription service also configures the voice transcription engine appropriately for a given client. For example, some clients can run as embedded browsers as is the case for Microsoft Powerpoint control, in which case the present invention can run in an embedded browser inside of PowerPoint. This allows client type to be known, allowing the transcription engine to be optimized for that client and its associated queries/commands. The voice transcription service communicates with other services in system 400 via message bus 430.
[0062] Al Coordinator 435 feeds the transcribed utterance into the correct Al inference engine indicated generally at 440. In at least some embodiments, the invention comprises multiple Al Inference engines 440A-n, each trained on a specific scope of understanding in order to increase accuracy in developing an inference by limiting how many individual slots and intents a given engine must attempt to discern among. In the present context, an intent represents an action that fulfills a user's spoken request. Intents can optionally have arguments called slots. When other knowledge (such as client type) exists that will allow the intent/slots to be split apart categorically, a new Al Inference Engine can be created if desired for the particular implementation. Additionally, in an embodiment, a parent Al that only identifies intent can serve to split the Al categorically into two or more parts. In such an embodiment, the total utterance is passed onto a separate Al that is trained on both Intent and Slots but where the range of potential total utterances is narrower so as to increase accuracy of slot identification. The Al Coordinator also takes the output of the Al Inference Engine (I ntent/Slot data derived from the total transcribed utterance) and calls Command Execution Engine 450 with that data, again by using message bus 430. The various Al Inference Engines 440A-n, each containerized so it can be migrated around the infrastructure, is deployed as its own Web App and thus has essentially independent endpoints. A simple text string input will output an intent, slots and meta-data such as confidence levels. This applies to any of engines 440A-n. In an embodiment, the Al engine itself is Python wrapped in a web framework (currently using Flask) which is inside the virtualized container.
[0063] Command execution service 450, described in greater detail in connection with Figures 8A-8C, turns intent and slot data into an executable query in the relevant one(s) of data sources 460A-n, executing that query on the data source and finally parsing the results into a form that the client can consume. The command execution service 450, as well as the display data service 410, authentication service 415, and configuration service 520 all communicate with operational database 455. A Dialogue Control Service 465 makes available audio playback data, for example an error message, an audible answer, or a clarifying question. In various embodiments, this service either pre-generate certain responses that have no dynamic component, or alternatively receives a dynamically generated text string and uses a text to speech engine to convert that into an audible form. The Data blob that the client gets from the Display Data Service will contain the URI at which the associated audio file will reside such that the client can connect and stream/play that audio. Optionally, the dialogue control service will own the creation of the dynamic string to be converted to audio by taking in just the variable components from another service rather than taking in the whole string already constructed.
[0064] Referring next to Figure 5, a speaker identification process, indicated generally at 500, describes a process for identifying a speaker whose identity may not otherwise be known by eliminating other users until the speaker is isolated. The process begins at 500, where an utterance is made by an unidentified speaker. At step 510 users whose face is recognized and who are within proximity of their signal are eliminated as not being the speaker. At step 515, a determination is made whether the remaining number of unidentified users is small enough that the speaker can be identified through voice recognition. If not, at step 520, those users are eliminated whose face is recognized but are not within proximity of the voice location. Again, step 515 determines whether the remaining constellation of unidentified users is small enough that voice recognition can accurately identify the speaker. If not, at step 525 all users whose face is recognized and not within proximity of the voice location are eliminated. A test is again made at 515 to determine the identity of the speaker. If the result is “no”, the process ends at 530. If “yes”, the process advances at 535 and 540 by sending the identity of the speaker and the total relevant utterance to the client. [0065] Referring next to Figure 6, the voice transcription process described generally above can be appreciated in greater detail. This service takes in the streaming audio, indicated at 600, via a SignaIR Hub 605, which is a form of communications bus that supports real-time streaming. The streaming audio data is placed into an audio buffer 610 which then feeds into an audio convertor 615, for example an FFMPEG suite or other converter suitable for format transcoding, that can convert any supported though unknown audio into a specific format including containerizing the audio in the event the incoming audio lacks header data. After the convertor, the audio is sent into a safety buffer 620 as well as being fed over a WebSocket interface 625 to the speech-to-text transcription engine 635. The safety audio buffer 620 serves as a backup of the real-time audio data in the event the connection to the speech to text engine needs to be reset and streaming must be paused to permit the reset. In the event of such a reset, the partial audio data is discarded and the backup buffer sends the entire utterance after connection reset to ensure normal operation. In an embodiment, it is assumed that no single utterance is longer than 20 seconds and thus the safety buffer 620 retains at least 20 seconds of audio. It will be appreciated by those skilled in the art that the capacity of the safety buffer 620 can be any arbitrary size and varies with the particular implementation. The streaming connection to the speech to text engine may also need to be reset in the event that a given utterance approaches or exceeds the time or size limit supported by a particular service for a continuous connection. Exceeding such a limit can result in closing of a connection and an unplanned interruption to streaming, including an unmanaged slicing of the user utterance if such an utterance was being made when the connection was closed. Again the partial utterance is discarded and the contents of the backup buffer are retransmitted. Finally, in an embodiment, the transcription (speech to text) engine can run a custom language model that is trained by a particular implementation in order to increase accuracy within the scope supported by the particular embodiment.
[0066] In alternative embodiments that implement either a client direct to Transcription Service, or a local to client transcription function, a slightly different method can be used. In these embodiments, the need for audio format translation is eliminated. The audio is stored, for example in a single 20 second rolling buffer. In an embodiment, a local speech to text engine responsive to a wake word or phrase or other triggering signal listens constantly for the that triggering signal. The audio stream utilized by this engine is then copied directly from the single buffer and does not require a separate buffer. When the wake word or similar trigger is heard, the main audio handling function is notified and a connection to a speech-to-text engine, which can be either local or external depending upon the embodiment, is established if not already existing. Once the connection is ready to accept data, the audio buffer is sent, whereupon the transcription is returned and parsed by the system to capture what came after the trigger signal.
[0067] Next referring to Figure 7 A, the Al Coordinator of Figure 4 can be better understood. The Al Coordinator receives a transcribed query (text representation of the total spoke utterance). It then references a mapping table 700 that translates specific inputs about where the spoken utterance originated from and provides an Al Identifier. For example, the input to the mapping table can be an identifier of the client itself, or an output of an Intent Only Al, or a user identifier, among others - essentially any information that would assist in determining which of the Al inference engines should receive the transcribed query for intent and slot determination. The resultant Al identifier from the mapping table 700 is than input into the Inference Engine Switch 705 where the transcribed query will be directed to the specific inference engine deemed most appropriate in accordance with the mapping list, whereupon the inference engine will return the relevant slot/intent data and a confidence level, indicated at 710. If the confidence level is insufficient relative to a predetermined value the process stops, and the user will be provided either an error message or clarification request. If the confidence level is sufficient, user access validation will take place 715. User access will determine if the user is permitted to have the query processed based on the client type or the intent itself or some other information provided to the Al Coordinator. If the user access is confirmed, the intent and slot data will be sent to the Command Execution Service, if it is not confirmed the User will be informed via the message bus to the client with a synthesized and visualized error message. Referring next to Figure 7B, the plurality of Al inference engines 750A-n can each be seen to comprise a virtualized container 755A-n, which in turn comprises a web app 760A-n communicating with, typically, a Python Web App wrapped BERT-based Al with a unique engine 750 for each category of intents, including one for data visualization and one for front office control. The confidence threshold can be set by any convenient means, where its purpose is, at least in part, to assist in ensuring that no data is improperly during a presentation or a meeting.
[0068] Figure 8A describes in greater detail an embodiment of the command execution service (“CES” sometimes hereinafter) shown generally in Figure 4. In an embodiment typical of a microservice, the CES comprises a presentation layer 800A, a data preparation manager 800B, and data storage 800C. The presentation layer 800A provides an API definition that enables the user interface via an Al Coordinator, while the data preparation manager 800B, sometimes referred to more generally as the domain layer with two alternative embodiments shown in Figures 8B and 8C, comprises the business logic of the CES. The logic of the data preparation manager generates an output stored in the data storage 800C, sometimes referred to as the data layer. In an embodiment shown in Figure 8B, the presentation layer and the data storage layer can be changed to permit the CES to operate on different platforms without changing the data preparation manager, thus making the present invention platform agnostic. In general, the data preparation manager 800B comprises a plurality of modules that convert Al-produced slot/intent data into results for the user. The functionalities of the data preparation manager 800B include a Slot Parsing Engine 801 responsible for taking the Key/Value pair format provided by the Al Coordinator in its endpoint call and converting it into data structures capable of being utilized by the CES. It also manages converting a natural language query with multiple implied machine queries into the base set of information for multiple queries discussed in connection with Figure 16 herein.
The result of the slot parsing engine 801 is provided to an Intent-Command Mapping List 805, which is a Many-to-1 mapping of incoming intents (as determined by the relevant Al inference engine) to one of a plurality of common command templates, shown at 815. In operation, the selector 810 references the mapping list to make the correct selection of a command template 815. Each of the command templates 815 defines how to build a query if one is required, how to execute it, how to extract the relevant data from the results, and how to package that data for return to the user interface. Once the data is formatted it is returned to the command template for some final styling parameters, after which it is deposited into the database and a message is sent to the message bus that the response is ready In at least some embodiments, each intent and connector has associated therewith a method of extraction, formatting and configuring that is specific to that intent and connector, which can include data source type [0069] Still with respect to Figure 8B, a Command Template Selector 810 checks the mapping and construction of the required parameters for the command templates 815, as well as instantiating the selected template and executing the template interface, then passes the result to a Selected Command Template step, all as shown at 820. A Data Connector Manager function 830 comprises data structures which provides connector specific information to the rest of the CES module. The Data Connector Manager function includes a DataMap 835 for well- defined data sources that provides a specific mapping between a spoken utterance and how that utterance should be referenced when 1) building the query and 2) parsing the results. In an embodiment, each well-defined data connector has its own Data Connector Manager and one or more associated DataMaps. The DataMap structure is explained in greater detail in connection with Figure 9.
[0070] The Data Connector Manager uses the DataMap to generate a query, 840, and execute it, 845, against its data source on request. It also ensures that any DataSource-required authentication tokens (per the authentication service described above) are valid so that the query will succeed. The query string generator portion 840 of the data connector manager 830 uses a Jargon Manager 860 and jargon loader 865 in building the query string as shown at 840 as in some cases the user will utter jargon and the query will need to know the classification of that jargon in order to properly build itself. The result of the query executor step 845 is provided to the selected command template step 820, which in turn provides its output to a data extractor 850. The data extractor 850 is responsible for taking the full line-item results from the database query and extracting only the relevant portions needed to respond to the original natural language query. In the case of graphing data, for example, the data extractor step 850 would extract X, Y, Z axis information for each record. In the case of a list, it would return a filtered list of records with all or a subset of information for each record, or, in the case of an exact record, just that record. The output of the data extractor step 850 is provided to a data formatter 855, which in turn is responsible for bucketing datapoint results collected in the Data Extractor into buckets that will represent axis ticks in the III or in the case of 3D charts, the size of a graphed shape such as may be used in bubble charts or similar visualizations. This includes both axes, potentially defined as continuous or discrete variables (time, amounts, quantities or categories). A jargon manager function 860 identifies the class of a jargon utterance or provides all the potential jargon utterances of a particular class. In well-defined data sources, this information can simply be queried when a user has logged in. In weakly defined data sources, jargon needs to be parsed out of all of the data in the data source. The Jargon Manager is utilized in query generation and in Data Formatting. In the Data Formatting case, when the buckets (per the natural language query) are by some category, the categories are determined by requesting them from the jargon manager by their category label (e.g., customers, vendors, classes, states, regions, etc.) In an embodiment, a Jargon Loader 865 operates as a sub-component of the Jargon Manager for creating the jargon mappings at login, to enable rapid access during a query. In an embodiment, the final output is provided from the selected command template 820 following completion of data formatting.
[0071] In at least some embodiments, a user’s utterance can include filtering in their query, for example, “show me revenue for Red Rock Diner in 2019”. Depending upon the embodiment, filtering takes place in one of three ways: 1 ) The data connector manager query string generator itself will attempt to build the filtering into the query request as much as that particular data source will allow (in SQL terms this is the WHERE clause); 2) the data extractor will attempt to filter out from the returned results any results that are not part of the desired response; and (3), if proper slot types are present, filtering may take place after data extraction AND after formatting. For example, developing a response to a query of “graph my top 3 customers by total invoice amounts” would first involve summing up invoice amounts of all datapoints extracted from all records and then filtering down to the top three. It will be appreciated that, in such a case, records cannot be prefiltered. In some instances, the foregoing filtering approaches can be performed sequentially, such that if (1) is not successful, (2) is executed and, if that is unsuccessful, (3) is executed. [0072] Referring next to Figure 8C, an alternative embodiment of the software architecture of the present invention can be better appreciated. Modules having the same function as that shown in Figure 8B are shown with like reference numerals. Figure 8C includes additional functionalities to improve resolution of ambiguities that may result from the spoken query. Thus, the selected command template can access a filter engine 870 as described above, and can further include a clarification utility 875 accessed through the filter engine. In instances where the query includes a word that may create an ambiguity, such as a proper noun, the clarification utility can operate to provide a user with the opportunity to clarify the meaning of a word or words within the query. In an embodiment, the clarification utility can be passed down from the selected command template 820 and directly utilized by the jargon manager 860. In an embodiment, the data connector manager 830 authorizes the interaction of the jargon manager with the filter engine and clarification utility. The jargon manager can access a search index 880, which in an embodiment serves to store all jargon that has been captured by the system for future reference, including data that may yield multiple responses to a word or words in the query, thus leading to ambiguity. In an embodiment, the search index is a query-response subsystem, and stores any clarifying information that may have become associated with the term of interest. Thus, as an example, a CRM record of type “Opportunity” may be clarified with the account name, the “owner” of that opportunity, and the region. The response from the search index will yield multiple results from the jargon manager, whereupon the clarification utility can craft a response to be sent to the user that includes a request for clarification as to which of the multiple results is the appropriate one or ones. A cache 885 can also be provided for use by the data connector manager 830. Once the response to the query has been developed, it is provided to the selected command template, which in turn provides it to a database in the data storage layer 800C, where it is available for retrieval by the client.
[0073] Referring next to Figure 9, the DataMap comprises a data structure 900 whose task is to correctly parse natural language into the appropriate database reference for a well-defined data source. Data structure 900 contains, first, a table name, which essentially corresponds to the concept of a table in a relational database. A Time Column is a Table-specific primary reference for queries related to time for records from that table. For example, for most transactional records this would be a string “TxnDate” but for some non- transactional records it would be “CreatedDate” - it simply depends on the table (and thus the data source). The data structure 900 next comprises a mapping among Slot Type indicated at 910, Utterance indicated at 915 and a Path Traversal Definition through a datasource record or path indicated at 920. The Slot Type 910 is the label the Al Inference engine assigns to the uttered word or phrase, the slot type is a category and allows the CES to know how to treat the value in processing, the Utterance is the actual uttered word or phrase and the path traversal is a list of nodes through the tree to the piece of data, bearing in mind that the return results for a data source are often a JSON-formed object that inherently possesses a tree structure. A table can comprise multiple slot types and multiple utterances per slot type. Each Utterance has a path (which is a list of nodes), a set of characteristics (essentially flags that would drive special treatment of that path or provide essential information about the data found at that path, such as data type - Boolean, string, etc....) and finally a Ul-friendly label for that particular path. Next the data structure 900 comprises mappings related to query string creation. These mappings tell the query string generator in the data connector manager what to use in the query for a particular uttered jargon. The first is a mapping between jargon classification, indicated at 930, and a list of paths that could potentially have a piece of data of that classification type at its location in the query result record, each path is a list of nodes, indicated at 935. The nodes in this data structure are represented as strings, which would be the key in the key value pair that is the total structure of any one node that exists at the data source. Finally, the data structure 900 comprises a mapping between jargon classification, indicated at 945, and parent node, indicated at 950, rather than path. This results because, when building a jargon reference into the actual query, the reference mechanism for a data source can differ from how the actual record might be traversed. In this case it uses just the parent node as reference in the WHERE clause. To know for a given table and classification what string to use in the WHERE clause, this mapping is consulted. This data structure serves to define how to interpret the specifically connected data sources’ response and traverse that response to the precise data sought. In an embodiment, this involves encoding data traversal in a relationship that will work for both flat structures (a database row for example) or tree structures (a JSON) or potentially any atypical structure. This data structure furthermore includes a mechanism to resolve jargon (essentially proper nouns with near unlimited possibilities) within the data sources response if, for example, the returned data records needed to be filtered on some specific datapoint. In an embodiment, the data structure also encodes specific data source traversal information at purely a high level for usage in the actual data source machine query, Jargon Classification to Filterable data source Path Mapping indicated at 940.
[0074] With reference next to Figure 10, the dialogue control service of Figure 4 can be better appreciated. In an embodiment, the dialogue control service comprises a static synthesizer function and a dynamic synthesizer function, indicated at 1005 and 1010 respectively. The static synthesizer 1005 takes a list of error strings on startup, indicated at 1015 and received via message bus 430, and synthesizes any that are not already in existence. The dynamic synthesizer 1010 takes in a string to be generated, indicated at 1020, in real time from message bus 430 and, optionally, can also take in just the variable components and synthesize either the full string with the variable values or just the variable values and insert them into the rest of the synthetization. In at least some embodiments, all of these audio snippets are made available via the service as a public web server 1025. Additionally, in some embodiments the III Client itself can receive the text response and initiate a real-time text-to-speech operation with either a local method or an external service to convert the text response to audio.
[0075] Next referring to Figure 11 , a method for latency reduction in accordance with an embodiment of the invention can be better understood. The method described here is designed to minimize turnaround time for a natural language query by ensuring that we are processing that query the moment the last relevant piece of information is uttered. Thus, if any superfluous words are uttered towards the end of the utterance, no delay in providing a response result from the time it takes a speaker to utter them, such that those superfluous words do not impact the overall turnaround time. Additionally, processing the query is initiated before the speech to text engine has finalized the transcription and, if the interim transcription is not materially different, processing is performed from the point of the last material difference. Thus, as shown in Figure 11 , an interim transcription result is received at 1105, and is sent at 1110 to the Al coordinator. The Al coordinator determines an inference and provides a confidence value at 1115. If the confidence value exceeds a threshold value, a “confident” result is reached; if not, a “not confident” result is reached. If the result is not confident, the process ends at 1120. If the result is “confident”, the process advances to step 1125 where slot data is sent to the CES. At 1130, the result is checked for a material difference in slot data. If there is no material difference, the process ends at 1135. If there is a material difference, the query - already in progress - based on the prior partial utterance is discarded at 1140, and the updated query begins processing at 1145. The process ends at 1150 once the new query is processed. This process is repeated for every interim result received from the voice transcription service stream and ultimately concludes when the user is done speaking and the voice transcription service provides its final result. Thus, the processes of Figure 11 operate essentially as a loop, but the loop is driven by the occurrence of further speech, which yields a new interim loop - for example one new loop for each additional word.
[0076] Figure 12 illustrates a model library, from which models 115 (Figures 1 , 2) can be better appreciated. In an embodiment, each customer will have a different model, indicated at 1200A-1200n for customers 1205A-1205n. In an embodiment, the model update flow depicted in Figure 12 provides continuous improvement of the systems natural language processing by continuously refining the latest-deployed NLP model via customer usage data and feedback. Further, it seeks to allow models to be refined across a plurality of customers while at the same time ensuring that none of a given customer's data and actual queries ever leaves their premise (or, if running online, their data silo.)
[0077] For example, in the case of a customer running their model manager 1215A in a data silo, incoming queries 1210 are stored in a dedicated query collection 1215. In addition, and in some instances at the same time, Meanwhile, the queries are also used for online training of the deployed model, designated Model-A when user feedback is provided. In the case of queries that have been reported by the user to result in clearly incorrect responses, or in the case where the query cannot be completed and the system is aware that a particular query failed, that query will have certain key parts of speech (nouns, verbs) replaced with random values so that the query is no longer specific to the organization it occurred in and that newly obfuscated query is sent from the client organization into the external network where it can be encoded by an operator, for example a human or other Al, and sent back into the client organization for inclusion in future training of the system. As online training continues, iteratively improved models Model-A1 , Model-A2, etc., are obtained, in each case as determined by their performance against test queries pulled from customer model 1200A’s query collection 1215.
[0078] On a given trigger, for example if a threshold amount of time has elapsed or a performance improvement threshold has been met, a “tournament” is held to determine a new best model for customer 1200A. In an embodiment, the participant models are the local improved models A1 , A2, etc., plus models from model library 1225, which can comprise models collected from other customers, for example, models Model-B3, Model-C2, Model-Zx. The specific tournament “rules” could vary depending upon the implementation, although in at least some embodiments the main determining factor would be each model's test result against queries from query collection 1215 from customer 1200A. Similarly, a test for a new model from customer 1200B would run tests against queries from that customer’s query collection.
[0079] The “winner”, or model with the best results, is then redeployed for customer 1200A, and the model is also noted in the Model library 1225 which is managed by the model library service 1205. The model library 1225 keeps track of all “tournament winners” as well as the metadata associated with those models. The model library service also manages selection of models to be tested in a tournament, as shown at 1230. In an embodiment, models selected for such tournaments can be based at least in part on model novelty to the customer. In order to promote cross training, one approach is to prioritize models that have not been compared to existing models for a particular customer. Alternatively, models selected for a given tournament can be selected based on Model popularity. The more tournaments a particular model wins, the more likely it will perform well with a new customer. A still further approach is Model fitness gain, where a model that has high fitness should likely do well with a new customer. Yet another approach is Model customer size, since a customer with a high user count should be more representative of the problem domain and a model that works well for such a customer has a comparatively high probability of working well for another customer. Models in the model library 1225 that are not getting selected for tournament participation, or are not winning such tournaments, or are not deployed at customer sites can be retired from the model library after, for example, a threshold period of time or other convenient criteria. [0080] Referring next to Figure 13, a method for client-private training of the system can be better appreciated. In an embodiment, the method of Figure 13 allows a user having an “on premises” installation or a private cloud to update its Al models without exposing customer query data outside of the customer network. In such an embodiment, the system is designed to have a current base model that is the last updated and *qualified* model. In this context, “updated” means it has completed incremental training from the last customer installation to signal some threshold of new query data is available and has requested the base model for updating. In an embodiment, qualified means a given new model has shown an improvement in terms of F-1 score or a similar metric, and that the given new model has also shown improvement as measured by the new model’s success rate against predetermined “golden” test set. To avoid exposure of customer data or queries, such testing can be performed on customer premises. Once these two items occur the base model is updated and any future requests for the base model from an on-premises implementation will receive the new base model. Periodically the on-premises installation will update it local base model with the core base model hosted by the vendor so that customers take advantage of the aggregate training data of all customers.
[0081] Thus, with reference to Figure 13, at 1305 a query collector signals a vendor network that it is ready for a training update. At 1310, the vendor’s base model manager sends the latest base model for an on-premise training service evaluation of the new model. The base model is incrementally trained with newly collected queries, shown at 1315, and the updated model is sent out of the client network into the Vendor’s network at 1320. A check is then made by the model manager at 1325 to determine whether the new model’s F-1 score and score against the “golden” test set performs better or not. If better, the new model is established as the new base model, step 1335, but if not, the process ends, 1330. [0082] Figures 14A-14B illustrate hardware and software components for performing gaze tracking in accordance with an embodiment of the invention. This refers to a system for determining what (on a visual medium - tv screen, etc...) the user is pointing/looking at. In one embodiment, a gaze tracking device exhibiting context specific (meeting room, large presentation space, personal office, etc...) specifications would be employed. These specifications at a top level would include precision, accuracy, and speed and at a more granular level would include, but not be limited to, sampling frequency, spectrum operation, angle of operation, pixel density and depth, among others. A main considerations in specifying a specific gaze tracking implementation is to maximize gaze accuracy to a particular range given a particular context. Using a gaze tracking device the screen can be subdivided into a grid with an internal inventory of what objects, along with their visual characteristics, are in each grid. Examples of gaze tracking devices include as an example Tobii has a variety of solutions, the Tobii Pro Spectrum, Fusion and Nano. They also sell a pair of glasses that will track your gaze. Competitive offerings are available from EyeLink, SmartEye, SMI and others. Some considerations in choosing a specific implementation include accuracy, speed and cost. Some solutions typically used in research have very high sampling frequencies but are quite expensive, and in at least some implementations of the invention such performance metrics are unnecessary, allowing some cost savings. Key metrics for selecting a device would likely take into account accuracy, where there is a limit to how small things on a screen can get from a productivity perspective, high enough processing speed to cause verbal instructions to be executed quickly as compared to a human with a keyboard, as there is a quickly met upper bound on how long verbal instructions should take to execute, the measuring stick being a human with a finger and a keyboard. Cost is also a practical consideration, where there may be one unit per personal office.
[0083] Through a combination of the user’s utterance and narrowing down which cell they are looking in (while increasing the resolution if there is some initial ambiguity, for example where a sector is too large) the intended object (“this”) can be identified, which in turn permits an appropriate action to be taken based on a user’s utterance. In Figure 14A, a gaze tracking device 1400 provides its input into an intermediary device such as shown in Figure 3, or other computer, which in turn supplies its output to a system location either in the cloud or on premises, as shown at 1410. A screen grid object content utterance mapping list, indicated at 1415, provides the correlation between the gridded screen and the captured gaze.
[0084] In Figure 14B, an embodiment of gaze tracking to identify a “this” in a spoken utterance can be better understood. An utterance is received from a user at 1425, and an attempt to identify the referred-to object is made by searching the type and characteristics in the screen grid object mapping list 1415. If only one object is found, the process advances by sending that object ID to the system for parsing of any other portion of the utterance, as shown at 1435, and then the process ends. If, however, the result at step 1430 is either no match or too many matches, the process advances to step 1445 by querying the display driver for screen resolution and dimensions. Based on that result, at step 1450 the screen is divided into a 2D array with a plurality of sectors. Then, at step 1455, each sector is searched using the mapping list 1415. Again, if one object is found, the process advances back to step 1435, after which the process ends. [0085] If, again, many objects are found, the sector size is reduced and the sectors are again searched, with the process being iterated until only one object is found. If, despite the sector-by-sector examination, no object is found, the user can be asked for clarification as shown at 1465, after which the process ends at 1470.
[0086] Figures 14C-14D illustrate a time-of-flight based alternative (or supplement, in some embodiments) to the embodiment of Figures 14A-14B. The time-of-flight hardware components comprise a 3D time-of-flight camera array that covers substantially 360 degrees of view horizontally and a suitable view vertically, such as 70 degrees, shown at 1472. The output of the array is provided to an intermediary device as shown at 1474, where in turn the output of the intermediary device is provided to an intermediary system 1476. Turning to the software functions that run in the hardware of Figure 14A, an image database trained with IR/3D time-of-flight images of extended arms/fingers and visual media supplies its data to an Al algorithm using a convolutional neural network (CNN) 1480. The system receives an utterance at 1484, and the boundary of the visual display device, such as a TV, computer monitor, projector, or other display device is identified at 1486. The angle of the visual medium relative to a presumed flat backplane is calculated at 1488, permitting the location of a finger to be performed at 1490. The finger angle relative to the presumed-flat backplane is calculated, 1492, permitting the projected intersection of the finger and the visual medium to be extrapolated at 1494. An attempt is made to identify the referred-to object through the use of positional mapping of the U.l. to objects, step 1496, after which the object ID and utterance are provided to the system at 1498.
[0087] Referring next to Figures 15A-15B, the processes for managing passive and active mode permissions can be better appreciated. Figure 15A, directed to passive mode permissions, begins with one or more attendees entering an empty room, shown at 1503. In an embodiment, the permissions level is automatically set to the level associated with the most senior person identified as being in attendance in the room, step 1506, and continues at that level until that person leaves the room, step 1509. If that senior person verbally authorizes continuation (or extension for a period of time) of their level of permissions, that level continues, step 1512 until either the set time elapses or the meeting otherwise ends. If the senior person departs without authorizing a continuation of their permissions level, which can occur in any convenient manner including that speaker’s utterance, a fingerprint authorization, etc., the system automatically drops the permissions level to that of the next most senior person in attendance, shown at 1515. If the meeting goes over the scheduled time, such that others enter the room before the meeting ends, the permissions level is automatically and dynamically set to the level of the most senior person in the room at a given time, steps 1518-1527. Eventually the room is empty, and the permissions level is set to the lowest setting, step 1530. Setting permissions to allow access to only part of otherwise available data can also provide desirable security against unintentional exposure of sensitive data.
[0088] Figure 15B shows a process for active mode permissions, and relies on having at least the most senior person activate their permissions level by the use of a fingerprint scanner or other biometric device, or a keypad or other device requiring an affirmative action. Again levels automatically drop once that most senior person leaves the room, absent authorization for continuance of their permissions level, again by use of a fingerprint scanner, retinal scanner, or other biometric device, keypad, or similar. The process of Figure 15B can be seen to be analogous to that of Figure 15A in all other respects and those steps are not repeated in the interest of simplicity.
[0089] Referring next to Figure 16, a process for managing implied multiple queries in accordance with the invention can be better appreciated. This algorithm is called for when a user wants to compare two or more datasets. In such an approach, multiple machine queries on the backend must be made such that the natural language query is split into multiple machine queries. This is complicated by the fact that users often only utter different words along the dimensions they want to compare, which requires that the implied parameters of the query must be determined. For example, if the query is “Compare my revenue and expenses in 2020” then two independent “tables” need to be referenced in the data source, thus requiring two different database queries to be derived from a single natural language query. Thus, referring still to Figure 16, at 1605 all slot data provided by the Al Coordinator is loaded. The number of queries to be executed is determined at 1610 by looking, for example, at the slot type with the most occurrences out of slot types that inform on what “table” to reference. Each slot is then examined, step 1615. If there are no more slots to be examined, the process ends at 1620.
[0090] If there are more slots to be examined, the process advances to step 1625, where a determination is made whether the required composition for performing a single inquiry yet exists. If yes, a new query object is formed at 1630 and the process loops to step 1615. If not, the process advances to step 1635 where a determination is made whether the slot type appears only once, and the current step is the first pass in the examination of slots. If yes, the process advances to step 1640, where the slot is duplicated as many times as there are queries (e.g., a search for revenue for multiple years), and the process advances to step 1645. If the result at step 1635 was a no, the process jumps to step 1645, At 1645 a determination is made whether the composition is missing this slot type AND that slot type has not already been assigned to a query, in which case the slot is assigned to the query and the process advances to step 1650, where the query’s composition is updated for the newly-assigned slot. The process then loops to step 1615, and continues until the result at 1615 is a yes.
[0091] From the foregoing, those skilled in the art will recognize that a new and novel method and system for each of contextualizing a meeting space, identifying the source of and analyzing a spoken query or command, and parsing that query or command in order to develop a response, has been disclosed, offering significant improvement over the prior art. Given the teachings herein, those skilled in the art will recognize numerous alternatives and equivalents that do not vary from the invention, and therefore the present invention is not to be limited by the foregoing description, but only by the appended claims.

Claims

33 CLAIMS What is claimed is:
1 . A system for retrieving data relevant to a query or command comprising at least one input configured to receive a query or command in the form of at least one of a group of input signals comprising audio data, video data, time of flight data, biometric data, infrared data, thermal data, humidity data, proximity data, Bluetooth data, WiFi data, and ultra wide band receiver data, an intermediary device comprising at least one processor and associated storage configured to execute a first process to receive data representative of the at least one input signal, execute a second process to extract from that data a source of the query or command, execute a third process of applying permissions to the query or command, and execute a fourth process by which a response to the query or command is developed in accordance with the permissions set by the third process, and data storage for storing the response and making the response available to a client.
2. A method for automatically contextualizing a space to identify the source of a spoken query or command, and to automatically provide relevant data in response, comprising the steps of automatically attempting to identify in at least one processor and associated memory the source of the query or command by comparing the voice to at least one user profile, if the attempt is not successful, automatically analyzing in the at least one processor the location within the space from which the query originated to further attempt to identify the source, in response to one or more identifying steps, automatically associating one or more permissions with the query to provide access to data where the accessible data is constrained by the permissions, disambiguating the query or command to enable the query or command to be transcribed by a speech-to-text engine, and 34 developing a response to the query or command in response to the transcribed query or command.
3. A method for training an Al model in which training data generated within a customer network is kept private within that network, comprising the steps of in a processor and associate storage, automatically obfuscating elements of the uttered query that are specific to the organization operating the customer network, automatically sending the obfuscated uttered query to an external network configured to encode the query, receiving the encoded query back into the customer network, automatically reversing the obfuscation to de-obfuscate the query, and automatically incorporating the encoded and de-obfuscated query into the training.
4. The system of claim 1 wherein execution of at least the fourth process comprises developing an interim response upon receipt of a first portion of the query or command, and revising the response as one or more successive portions of the query or command are received.
5. The method of claim 2 further comprising the steps of automatically analyzing a gesture by the source of the query or command by in the at least one processor the location within the space from which the query originated to further attempt to identify the source, extrapolating the angle of a pointing finger to relative to a digital video medium, and, determining, based on at least one input received from at least one of a group comprising RGB, IR and ToF sensors, the location and angle of the finger relative the ground plane of the visual medium.
6. The method of claim 5 further comprising the step of dividing the visual medium into sectors sized such that a single discrete object is located in a single sector intersecting the extrapolated angle.
7. The method of claim 2 wherein the permissions constrain the available data to the most senior person in a space.
8. The method of claim 2 wherein the permissions constrain the available data to the least senior person in a space.
9. The system of claim 1 wherein the intermediary device comprises a plurality of command templates.
10. The system of claim 1 wherein the intermediary device comprises in part one or more user profiles and each user profiles comprises at least the permissions associated with a given user.
11 . The system of claim 1 wherein the query or command is spoken and the second process comprises using voice recognition to identify the source of the query or command.
12. The system of claim 1 where in the query or command is spoken and one of the second and third processes comprises identifying the position of the source within a space.
13. The system of claim 12 where a position of each person within the space is identified by use of at least one of the input signals.
14. The system of claim 1 wherein the fourth process comprises analyzing the query or command through the use of a jargon manager.
15. The system of claim 1 wherein the jargon manager comprises a search index maintained in the associated storage and the search index comprises a store of data associated with captured query terms.
PCT/US2022/047615 2021-10-25 2022-10-24 Systems and methods for query source identification and response WO2023076187A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163271403P 2021-10-25 2021-10-25
US63/271,403 2021-10-25

Publications (2)

Publication Number Publication Date
WO2023076187A2 true WO2023076187A2 (en) 2023-05-04
WO2023076187A3 WO2023076187A3 (en) 2023-06-15

Family

ID=86158447

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/047615 WO2023076187A2 (en) 2021-10-25 2022-10-24 Systems and methods for query source identification and response

Country Status (1)

Country Link
WO (1) WO2023076187A2 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7953447B2 (en) * 2001-09-05 2011-05-31 Vocera Communications, Inc. Voice-controlled communications system and method using a badge application
US7949529B2 (en) * 2005-08-29 2011-05-24 Voicebox Technologies, Inc. Mobile systems and methods of supporting natural language human-machine interactions
US7890330B2 (en) * 2005-12-30 2011-02-15 Alpine Electronics Inc. Voice recording tool for creating database used in text to speech synthesis system
US8015014B2 (en) * 2006-06-16 2011-09-06 Storz Endoskop Produktions Gmbh Speech recognition system with user profiles management component
WO2010067118A1 (en) * 2008-12-11 2010-06-17 Novauris Technologies Limited Speech recognition involving a mobile device

Also Published As

Publication number Publication date
WO2023076187A3 (en) 2023-06-15

Similar Documents

Publication Publication Date Title
US10977452B2 (en) Multi-lingual virtual personal assistant
CN111149107B (en) Enabling autonomous agents to differentiate between questions and requests
US11341335B1 (en) Dialog session override policies for assistant systems
US11557302B2 (en) Digital assistant processing of stacked data structures
JP2022547704A (en) Intention recognition technology with reduced training
US20170242886A1 (en) User intent and context based search results
US20170243107A1 (en) Interactive search engine
US11741316B2 (en) Employing abstract meaning representation to lay the last mile towards reading comprehension
US11556698B2 (en) Augmenting textual explanations with complete discourse trees
US20210191938A1 (en) Summarized logical forms based on abstract meaning representation and discourse trees
CN114429133A (en) Relying on speech analysis to answer complex questions through neuro-machine reading understanding
CN114375449A (en) Techniques for dialog processing using contextual data
US20230050655A1 (en) Dialog agents with two-sided modeling
US11403462B2 (en) Streamlining dialog processing using integrated shared resources
WO2023076187A2 (en) Systems and methods for query source identification and response
CN112489632B (en) Implementing correction models to reduce propagation of automatic speech recognition errors
US11908460B2 (en) Using a generative adversarial network to train a semantic parser of a dialog system
TW202301080A (en) Multi-device mediation for assistant systems
JP2021061000A (en) Digital assistant processing of data structure in stack format

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22888019

Country of ref document: EP

Kind code of ref document: A2