US20220036899A1 - Multi-modal conversational agent platform - Google Patents
Multi-modal conversational agent platform Download PDFInfo
- Publication number
- US20220036899A1 US20220036899A1 US17/500,352 US202117500352A US2022036899A1 US 20220036899 A1 US20220036899 A1 US 20220036899A1 US 202117500352 A US202117500352 A US 202117500352A US 2022036899 A1 US2022036899 A1 US 2022036899A1
- Authority
- US
- United States
- Prior art keywords
- query
- data
- tenant
- response
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 230000004044 response Effects 0.000 claims abstract description 130
- 238000012545 processing Methods 0.000 claims abstract description 119
- 238000000034 method Methods 0.000 claims abstract description 92
- 230000008569 process Effects 0.000 claims abstract description 45
- 230000015572 biosynthetic process Effects 0.000 claims description 52
- 238000003786 synthesis reaction Methods 0.000 claims description 52
- 230000009471 action Effects 0.000 claims description 29
- 238000007635 classification algorithm Methods 0.000 claims description 17
- 230000015654 memory Effects 0.000 claims description 13
- 230000003993 interaction Effects 0.000 claims description 10
- 238000010200 validation analysis Methods 0.000 claims description 9
- 230000014509 gene expression Effects 0.000 claims description 8
- 230000002452 interceptive effect Effects 0.000 claims description 8
- 238000012552 review Methods 0.000 claims description 6
- 238000003860 storage Methods 0.000 claims description 5
- 238000004519 manufacturing process Methods 0.000 claims description 4
- 238000009877 rendering Methods 0.000 claims description 2
- 239000003795 chemical substances by application Substances 0.000 description 127
- 238000010801 machine learning Methods 0.000 description 20
- 238000003058 natural language processing Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 10
- 238000004590 computer program Methods 0.000 description 10
- 238000004422 calculation algorithm Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 7
- 239000008186 active pharmaceutical agent Substances 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 4
- 239000003086 colorant Substances 0.000 description 3
- 230000000977 initiatory effect Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 238000012706 support-vector machine Methods 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012417 linear regression Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 102100034761 Cilia- and flagella-associated protein 418 Human genes 0.000 description 1
- 101100439214 Homo sapiens CFAP418 gene Proteins 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013477 bayesian statistics method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012015 optical character recognition Methods 0.000 description 1
- 238000013488 ordinary least square regression Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000035755 proliferation Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000013468 resource allocation Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0603—Catalogue ordering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0641—Shopping interfaces
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/01—Customer relationship services
- G06Q30/015—Providing customer assistance, e.g. assisting a customer within a business location or via helpdesk
- G06Q30/016—After-sales
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
- G10L2015/228—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2201/00—Electronic components, circuits, software, systems or apparatus used in telephone systems
- H04M2201/40—Electronic components, circuits, software, systems or apparatus used in telephone systems using speech recognition
- H04M2201/405—Electronic components, circuits, software, systems or apparatus used in telephone systems using speech recognition involving speaker-dependent recognition
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/487—Arrangements for providing information services, e.g. recorded voice services or time announcements
- H04M3/493—Interactive information services, e.g. directory enquiries ; Arrangements therefor, e.g. interactive voice response [IVR] systems or voice portals
Definitions
- Conversational agents can be utilized in e-commerce applications to allow a retail or service provider entity to interact with potential or existing customers in regard to a product or service without requiring a human customer support operator.
- Conversational agents can process data received in a variety of modalities, such as voice, text, and/or web site interactions.
- Conversational agents can also process data received from a variety of input devices, such as computing devices, which may for example display a website of an e-commerce retailer, a browser-enabled smartphone or mobile computing device, as well as intelligent or virtual personal assistant devices.
- the semantic interpretation can be generated using a first data structure representing the first lexicon associated with the tenant.
- the first data structure can be generated based on at least one of: a catalog of items associated with the tenant and including a first item title and a first item description; one or more reviews associated with a first item; interactive user data associated with a first item; or a combination thereof.
- Generating the first data structure can include determining one or more attributes associated with a first item from the catalog of items; determining one or more synonyms associated with the first item from the catalog of items; determining one or more referring expressions associated with the first item from the catalog of items and/or the interactive user data associated with the first item; generating the first data structure based on the determining steps, the first data structure including a name, one or more attributes, one or more synonyms, one or more referring expressions, and/or one or more dialogs corresponding to the first item.
- the first data structure can be used in the first machine learning process to train the at least one of a plurality of classification algorithms.
- the method can include receiving second data characterizing an utterance of a query associated with a second tenant; providing, to a second automated speech recognition engine, the received second data and a profile selected from a plurality of profiles based on the second tenant, the profile configuring the second automated speech recognition engine to process the received second data; receiving, from the automated speech recognition engine, a text string characterizing the query; and processing, via the ensemble of natural language agents configured based on the second tenant, the text string characterizing the query to determine a textual response to the query, the textual response including at least one word from a second lexicon associated with the second tenant.
- the utterance of the query can include a plurality of natural language words spoken by a user and received by an input device of a first computing device.
- the utterance of the query can be provided by the user in regard to a first context associated with a first item provided by the tenant.
- the profile can include one or more configuration settings associated with the ensemble of natural language agents configured on a server including a data processor, one or more configuration settings associated with an ensemble of natural language agents configured on the first computing device, and one or more configuration settings specifying one or more speech processing engines configured on the server including the data processor.
- the tenant can include at least one of a retail entity, a service provider entity, a financial entity, a manufacturing entity, an entertainment entity, an information storage entity, and a data processing entity.
- the method can include receiving, prior to receiving data characterizing the utterance of the query, an input to a web site provided via a web browser configured on first computing device, the input causing the web browser to be authenticated and registered at a second computing device coupled to the first computing device via a network.
- FIG. 1 illustrates an example architecture of a system including a dialog processing platform, a client device configured as a multi-modal conversational agent, and a machine learning platform;
- FIG. 2 illustrates an example architecture of a client device configured as a multi-modal conversational agent of the system described in FIG. 1 ;
- FIG. 3 illustrates an example architecture of a dialog processing platform of the system described in FIG. 1 ;
- FIG. 4 is a flowchart illustrating an example method for determining a textual response to an utterance of a query provided by a user via a client device of the system described in FIG. 1 ;
- FIG. 5 is a flowchart illustrating an example method for providing a verbalized query response to a user via a client device of the system described in FIG. 1 ;
- FIG. 7 is a flowchart illustrating an example method for generating a first data structure used in generating the semantic representation associated with the text string characterizing a query
- FIG. 8 is a flowchart illustrating an example method for generating an initial conversation prompt via a multi-modal conversational agent of the system described in FIG. 1 ;
- FIG. 9 is a diagram illustrating an example data flow for processing a dialog using a multi-modal conversational agent and the system of FIG. 1 .
- conversational agent architectures do not provide the flexibility to mix-and-match different speech or natural language processing resources. For instance, existing conversational agent architectures may not provide a means for configuring and deploying new, updated, or alternate speech processing and/or nor natural language understanding resources.
- the speech or language processing resources of many conversational agent architectures are integrated within the architecture and are not replaceable with alternate natural language processing resources.
- many conversation agent architectures cannot support or be reconfigured to support new digital endpoint devices that are part of the conversational agent architecture as originally designed.
- a conversational agent backend architecture may be configured to process textual dialog inputs provided to a conversational agent utilized in a website.
- the backend architecture may be able to process the textual inputs provided by a user via a keyboard of a mobile or personal computing device at which the user is viewing the website. However, the backend architecture may be unable to process voice inputs provided via a microphone of the mobile or personal computing device.
- the lack of re-configurability and modularity of backend architectures limits the use of existing conversational agent systems to support new digital endpoint devices, new natural language processing resources, and new lexicons. The inability to efficiently configure and deploy new processing resources in conversational agent frontend and backend architectures can reduce user engagement, customer satisfaction, and revenue for the entities deploying the conversational agent.
- the conversational agent frontend and backend architecture described herein allow entities deploying conversational agents to configure and/or reconfigure natural language processing resources that best suit the application or application domain.
- the conversational agent frontend and backend architecture described herein can also enable entities deploying conversation agents to support a broader variety of user input/output devices that are not necessarily from the same technology provider or originally intended to operate with a particular conversational agent backend.
- the conversational agent frontend and backend architecture described herein includes components that can easily integrate multiple input modalities provided via smartphones with multi-touch and keyboard capabilities, and also includes backend adaptors or connectors to simplify the user's authentication and to provide access to backend application programming interfaces (API) from different frontend device or application configurations.
- API application programming interfaces
- FIG. 1 illustrates an example architecture of a conversational agent system 100 including a client device 102 , a dialog processing platform 120 , and a machine learning platform 165 .
- the client device 102 , the dialog processing platform 120 , and the machine learning platform 165 can be communicatively coupled via a network, such as network 118 .
- a user can provide an input associated with a query to the client device 102 via input device 114 .
- the client device 102 can include a frontend of the conversational agent system 100 .
- a conversational agent can be configured on the client device 102 as one or more applications 106 .
- the conversational agent can transmit data associated with the query to a backend of the conversational agent system 100 .
- the client device 102 includes a memory 104 , a processor 108 , a communications module 110 , and a display 112 .
- the memory 104 can store computer-readable instructions and/or data associated with processing multi-modal user data via a frontend and backend of the conversational agent system 100 .
- the memory 104 can include one or more applications 106 implementing a conversational agent frontend.
- the applications 106 can provide speech and textual conversational agent modalities to the client device 102 thereby configuring the client device 102 as a digital or telephony endpoint device.
- the processor 108 operates to execute the computer-readable instructions and/or data stored in memory 104 and to transmit the computer-readable instructions and/or data via the communications module 110 .
- the network 118 can include, but is not limited to, any one or more of the following network topologies, including a bus network, a star network, a ring network, a mesh network, a star-bus network, tree or hierarchical network, and the like.
- the client device 102 also includes a display 112 .
- the display 112 can be configured within or on the client device 102 . In other implementations, the display 112 can be external to the client device 102 .
- the client device 102 also includes an input device 114 , such as a microphone to receive voice inputs, or a keyboard, to receive textual inputs.
- the client device 102 also includes an output device 116 , such as a speaker or a display.
- the client device 102 can include a conversational agent frontend, e.g., one or more of applications 106 , which can receive inputs associated with a user query and to provide responses to the users query.
- a conversational agent frontend e.g., one or more of applications 106
- the client device 102 can receive user queries which are uttered, spoken, or otherwise verbalized and received by the input device 114 , such as a microphone.
- the input device 114 can be a keyboard and the user can provide query data as a textual input, in addition to or separately from the inputs provided using a voice-based modality.
- a user can interact with the input device 114 to provide dialog data, such as a query, via an e-commerce web-site at which the user previously placed an order.
- the conversational agent system 100 includes a dialog processing platform 120 .
- the dialog processing platform 120 operates to receive dialog data, such as user queries provided to the client device 102 , and to process the dialog data to generate responses to the user provided dialog data.
- the dialog processing platform 120 can be configured on any device having an appropriate processor, memory, and communications capability for hosting the dialog processing platform as will be described herein.
- the dialog processing platform can be configured as one or more servers, which can be located on-premises of an entity deploying the conversational agent system 100 , or can be located remotely from the entity.
- the distributed processing platform 120 can be implemented as a distributed architecture or a cloud computing architecture.
- one or more of the components or functionality included in the dialog processing platform 120 can be configured in a microservices architecture.
- one or more components of the dialog processing platform 120 can be provided via a cloud computing server of an infrastructure-as-a-service (IaaS) and be able to support a platform-as-a-service (PaaS) and software-as-a-service (SaaS) services.
- IaaS infrastructure-as-a-service
- PaaS platform-as-a-service
- SaaS software-as-a-service
- the dialog processing platform 120 includes a communications module 122 to receive the computer-readable instructions and/or user data transmitted via network 118 .
- the dialog processing platform 120 also includes one or more processors 124 configured to execute instructions that when executed cause the processors to perform natural language processing on the received dialog data and to generate contextually specific responses to the user dialog inputs using one or more interchangeable and configurable natural language processing resources.
- the dialog processing platform 120 also includes a memory 128 configured to store the computer-readable instructions and/or user data associated with processing user dialog data and generating dialog responses.
- the memory 128 can store a plurality of profiles associated with each tenant or entity. The profile can configure one or more processing components of the dialog processing platform 120 with respect to the entity or tenant for which the conversational agent system 100 has been configured.
- the dialog processing platform 120 includes one or more subsystems such as subsystem 130 A and 130 B, collectively referred to as subsystems 130 .
- Each subsystem 130 and the components or functionality configured therein can correspond to a particular entity, or tenant, that has configured the conversational agent system 100 to provide conversational agents to end users.
- the dialog processing platform 120 can include a first subsystem 130 A which can be associated with a first tenant 130 A, such as retail entity, and a second subsystem 130 B which can be associated with a second tenant 130 B, such as a financial services entity.
- the dialog processing platform 120 can be configured as a multi-tenant portal to provide natural language processing for different tenants, and their corresponding conversational agent frontend applications 106 , which can be configured on a variety of multi-modal digital endpoint client devices 102 .
- Subsystems 130 can include components implementing functionality to receive user dialog data from a variety of multi-modal conversational agents and to generate dialog responses in the context of a particular lexicon of a tenant or entity for which the conversational agent has been deployed. For example, as shown in FIG.
- the components can include an automatic speech recognition engine adapter (ASRA) 135 A for interfacing with a plurality of automated speech recognition (ASR) engines 140 , a plurality of natural language agent (NLA) ensembles 145 A, a text-to-speech synthesis engine adapter (TTSA) 150 for interfacing to a plurality of text-to-speech (TTS) synthesis engines 155 , and a plurality of catalog-to-dialog (CTD) modules 160 A.
- ASRA automatic speech recognition engine adapter
- ASR automated speech recognition
- NLA natural language agent
- TTSA text-to-speech synthesis engine adapter
- CTD catalog-to-dialog
- the dialog processing platform 120 can include one or more subsystems 130 .
- the plurality of ASR engines 140 , the plurality of NLA ensembles 145 , the plurality of TTS synthesis engines 155 , and the plurality of CTD modules 160 can be respectfully referred to as ASR engines 140 , NLA ensembles 145 , TTS synthesis engines 155 , and CTD modules 160 .
- the subsystem 130 components can be configured directly within the dialog processing platform 120 such that the components are not configured within a subsystem 130 .
- the ASR engines 140 and the TTS synthesis engines 155 can be configured outside of the dialog processing platform 120 , such as in a cloud-based architecture.
- the dialog processing platform 120 can exchange data with the ASR engines 140 and the TTS synthesis engines 155 via the ASRA 135 and the TTSA 150 , respectfully.
- the ASR 140 and/or TTS 155 can be configured within the dialog processing platform 120 .
- the components of the dialog processing platform 120 , as well as the ASR engines 140 and the TTS synthesis engines 155 can be implemented as microservices within a cloud-based or distributed computing architecture.
- the dialog processing platform 120 includes an ASRA 135 A configured to interface with the ASR engines 140 .
- the ASR engines 140 can include automated speech recognition engines configured to receive spoken or textual natural language inputs and to generate textual outputs corresponding the inputs. For example, the ASR engines 140 can process the user's verbalized query or utterance “When will my order be delivered?” into a text string of natural language units characterizing the query. The text string can be further processed to determine an appropriate query response.
- the dialog processing platform 120 can dynamically select a particular ASR engine 140 that best suits a particular task, dialog, or received user query.
- the dialog processing platform 120 also includes a plurality of NLA ensembles 145 .
- the NLA ensembles 145 can include a plurality of components configured to receive the text string from the ASR engines 140 and to process the text string in order to determine a textual response to the user query.
- the NLA ensembles 145 can include a natural language understanding (NLU) module implementing a number of classification algorithms trained in a machine learning process to classify the text string into a semantic interpretation.
- the processing can include classifying an intent of the text string and extracting information from the text string.
- the NLU module combines different classification algorithms and/or models to generate accurate and robust interpretation of the text string.
- the NLA ensembles 145 can also include a dialog manager (DM) module.
- DM dialog manager
- the DM module can determine an appropriate dialog action in a contextual sequence formed by the current or previous dialog sequences conducted with the user. In this way, the DM can generate a response action to increase natural language quality and fulfillment of the user's query objective.
- the NLA ensembles 145 can also include a natural language generator (NLG) module.
- the NLG module can process the action response determined by the dialog manager and can convert the action response into a corresponding textual response.
- the NLG module provides multimodal support for generating textual responses for a variety of different output device modalities, such as voice outputs or visually displayed (e.g., textual) outputs.
- the ensemble can include a set of models that are included in the NLU and optimized jointly to select the right response.
- the dialog processing platform 120 also includes a TTSA 150 configured to interface with the TTS synthesis engines 155 .
- the TTS synthesis engines 155 can include text-to-speech synthesis engines configured to convert textual responses to verbalized query responses. In this way, a response to a user's query can be determined as a text string and the text string can be provided to the TTS synthesis engines 155 to generate the query response as natural language speech.
- the dialog processing platform 120 can dynamically select a particular TTS synthesis engine 155 that best suits a particular task, dialog, or generated textual response.
- the dialog processing platform 120 also includes catalog-to-dialog (CTD) modules 160 .
- CTD modules 160 can be selected for use based on a profile associated with the tenant or entity.
- the CTD modules 160 can automatically convert data from a tenant or entity catalog, as well as billing and order information into a data structure corresponding to a particular tenant or entity for which the conversational agent system 100 is deployed.
- the CTD modules 160 can derive product synonyms, attributes, and natural language queries from product titles and descriptions which can be found in the tenant or entity catalog.
- the CTD modules 160 can generate a data structure that is used the machine learning platform 165 to train one or more classification algorithms included in NLU module.
- the CTD modules 160 can instantiate, create, or implement fully configured conversational agents configured to process user queries or dialog inputs for a tenant.
- the CTD modules 160 can be used to efficiently pre-configure the conversational agent system 100 to automatically respond to queries about orders and/or products or services provided by the tenant or entity.
- the dialog processing platform 120 can process the users query to determine a response regarding the previously placed order.
- the dialog processing platform 120 can generate a response to the user's query.
- the query response can be transmitted to the client device 102 and provided as speech output via output device 116 and/or provided as text displayed via display 112 .
- the conversational agent system 100 includes a machine learning platform 165 .
- Machine learning can refer to an application of artificial intelligence that automates the development of an analytical model by using algorithms that iteratively learn patterns from data without explicit indication of the data patterns.
- Machine learning can be used in pattern recognition, computer vision, email filtering and optical character recognition and enables the construction of algorithms or models that can accurately learn from data to predict outputs thereby making data-driven predictions or decisions.
- the machine learning platform 165 can include a number of components configured to generate one or more trained prediction models suitable for use in the conversational agent system 100 described in relation to FIG. 1 .
- a feature selector can provide a selected subset of features to a model trainer as inputs to a machine learning algorithm to generate one or more training models.
- a wide variety of machine learning algorithms can be selected for use including algorithms such as support vector regression, ordinary least squares regression (OLSR), linear regression, logistic regression, stepwise regression, multivariate adaptive regression splines (MARS), locally estimated scatterplot smoothing (LOESS), ordinal regression, Poisson regression, fast forest quantile regression, Bayesian linear regression, neural network regression, decision forest regression, boosted decision tree regression, artificial neural networks (ANN), Bayesian statistics, case-based reasoning, Gaussian process regression, inductive logic programming, learning automata, learning vector quantization, informal fuzzy networks, conditional random fields, genetic algorithms (GA), Information Theory, support vector machine (SVM), Averaged One-Dependence Estimators (AODE), Group method of data handling (GMDH), instance-based learning, lazy learning, Maximum Information Spanning Trees (MIST), and transfer learning methods based on pre-trained, generalized embeddings as well as domain-based embeddings.
- OLSR ordinary least squares regression
- MIST maximum Information Spanning Trees
- the CTD modules 160 can be used in the machine learning process to train the classification algorithms included in the NLU of the NLA ensembles 145 .
- the model trainer can evaluate the machine learning algorithm's prediction performance based on patterns in the received subset of features processed as training inputs and generates one or more new training models.
- the generated training models e.g., classification algorithms and models included in the NLU of the NLA ensemble 145 , are then capable of receiving user data including text strings corresponding to a user query via and to output predicted textual responses including at least one word from a lexicon associated with the tenant or entity for which the conversational agent system 100 has been configured and deployed.
- FIG. 2 illustrates an example architecture of a client device 102 configured as a multi-modal conversational agent of the conversational agent system 100 described in relation to FIG. 1 .
- the client device 102 can include a plurality of applications 106 .
- the applications 106 can include easily installed, pre-packaged software developer kits for which implement conversational agent frontend functionality on a client device 102 .
- the applications 106 can include APIs as JavaScript libraries received from the dialog processing platform 120 and incorporated into a website of the entity or tenant to enable support for text and/or voice modalities via a customizable user interfaces.
- the applications 106 can implement client APIs on different client devices 102 and web browsers in order to provide responsive multi-modal interactive graphical user interfaces (GUI) that are customized for the entity or tenant.
- GUI graphical user interface
- the GUI and applications 106 can be provided based on a profile associated with the tenant or entity.
- the conversational agent system 100 can provide customizable branded assets defining the look and feel of a user interface, different voices utilized by the TTS synthesis engines 140 , as well as textual responses generated by the NLA ensembles 145 which are specific to the tenant or entity.
- the web application 205 includes functionality configured to enable a web browser on a client device 102 to communicate with the dialog processing platform 120 .
- the web application 205 can include a media capture API, a web audio API, a document object model, and a web socket API.
- the web application 205 can be configured to capture dynamic content generated by the multi-modal conversation agent configured on the client device 102 .
- the dynamic content can include clickable and multimodal interactive components and data.
- the iOS application 210 includes functionality configured to provide support for multi-modal conversational agents implemented on client devices 102 configured with the proprietary iOS operating system developed by Apple Inc. of Cupertino, Calif., U.S.A.
- the interface representation and interactive user model used for a conversational agent configured on a client device web browser can be converted and provided using the same interface representation deployed on a mobile device web browser.
- the android application 215 includes functionality configured to provide support for multi-modal conversational agents implemented on client devices 102 configured with the Unix-based Android operating system developed by the Open Handset Alliance of Mountain View, Calif., U.S.A.
- the messaging application 220 includes functionality configured to provide messaging support for a variety of chat and messaging platforms. In some implementations, the messaging application 220 can reproduce the same interface representation multi-modal experience as enabled on other client device 102 interfaces.
- the telephony application 225 includes functionality configured to provide telephony support via public switched telephone network (PSTN) devices and voice over internet protocol (VoIP) devices.
- PSTN public switched telephone network
- VoIP voice over internet protocol
- the telephony application 225 can be configured to generate short conversational prompts or dialog sequences without reference to the content of the screen.
- the conversational agent system described herein can enable support for smart speaker client devices 102 and the conversational agents configured on the client devices 102 can automatically adapt to the capabilities of different devices.
- FIG. 3 illustrates an example architecture 300 of a dialog processing platform 120 of the system 100 described in relation to FIG. 1 .
- the dialog processing platform 120 can serve as a backend of the conversational agent system 100 .
- One or more components included in the dialog processing platform 120 shown in FIG. 3 can be configured on a single server device or on multiple server devices.
- One or more of the components of the dialog processing platform 120 can also be configured as a microservice, for example in a cloud computing environment.
- the conversational agent system 100 can be configured as a robustly scalable architecture that can be provisioned based on resource allocation demands.
- the dialog processing platform 120 includes run-time components that are responsible for processing incoming speech or text inputs, determining the meaning in the context of a dialog and a tenant lexicon, and generate replies to the user which are provided as speech and/or text. Additionally, the dialog processing platform 120 provides a multi-tenant portal where both administrators and tenants can customize, manage, and monitor platform resources, and can generate run-time reports and analytic data. The dialog processing platform 120 interfaces with a number of real-time resources such as ASR engines 140 , TTS synthesis engines 155 , and telephony platforms. The dialog processing platform 120 also provides consistent authentication and access APIs to commercial e-commerce platforms.
- the dialog processing platform 120 includes a DPP server 302 .
- the DPP server 302 can act as a frontend to the dialog processing platform 120 and can appropriately route data received from or to be transmitted to client devices 102 as appropriate.
- the DPP server 302 routes requests or data to specific components of the dialog processing platform 120 based on registered tenant and application identifiers which can be included in a profile associated with a particular tenant.
- the DPP server 302 can also securely stream to the ASR engines 140 and from the TTS synthesis engines 140 .
- the dialog processing platform 120 includes a plurality of adapters 304 configured interface the ASR engines 140 and the TTS synthesis engines 155 to the DPP server 302 .
- the adapters 304 allow the dialog processing platform 120 to interface with a variety of speech processing engines, such as ASR engines 140 and TTS synthesis engines 155 .
- the speech processing engines can be configured in a cloud-based architecture of the dialog processing platform 120 and may not be collocated in the same server device as the DPP server 302 or other components of the dialog processing platform 120 .
- the adapters 304 include a ASR engine adapter 135 and a TTS synthesis engine adapter 150 .
- the ASR engine adapter 135 and a TTS synthesis engine adapter 150 enable tenants to dynamically select speech recognition and text-to-speech synthesis providers or natural language speech processing resources that best suit the users objective, task, dialog, or query.
- the dialog processing platform 120 includes a voiceXML (VXML) adapter 310 which can couple the DPP server 302 to various media resources 312 .
- the media resources 312 can include VoIP networks, ASR engines, and TTS synthesis engines 314 .
- the media resources 312 enable the conversational agents to leverage existing telephony platforms, which can often be integrated with particular speech processing resources.
- the existing telephony platforms can provide interfaces for communications with VoIP infrastructures using session initiation protocol (SIP). In these configurations, VXML documents are exchanged during a voice call.
- SIP session initiation protocol
- the dialog processing platform 120 also includes an orchestrator component 316 .
- the orchestrator 316 provides an interface for administrators and tenants to access and configure the conversational agent system 100 .
- the administrator portal 318 can enable monitoring and resource provisioning, as well as providing rule-based alert and notification generation.
- the tenant portal 320 can allow customers or tenants of the conversational agent system 100 to configure reporting and analytic data, such as account management, customized reports and graphical data analysis, trend aggregation and analysis, as well as drill-down data associated dialog utterances.
- the tenant portal 320 can also allow tenants to configure branding themes and implement a common look and feel for the tenants conversational agent user interfaces.
- the tenant portal 320 can also provide an interface for onboarding or bootstrapping customer data.
- the tenant portal 320 can provide tenants with access to customizable conversational agent features such as user prompts, dialog content, colors, themes, usability or design attributes, icons, and default modalities, e.g., using voice or text as a first modality in a dialog.
- the tenant portal 320 can, in some implementations, provide tenants with customizable content via different ASR engines 140 and different TTS synthesis engines 155 which can be utilized to provide speech data in different voices and/or dialects.
- the tenant portal 320 can provide access to analytics reports and extract, transform, load (ETL) data feeds.
- ETL transform, load
- the orchestrator 316 can provide secure access to one or more backends of a tenants data infrastructure.
- the orchestrator 316 can provide one or more common APIs to various tenant data sources which can be associated with retail catalog data, user accounts, order status, order history, and the like.
- the common APIs can enable developers to reuse APIs from various client side implementations.
- the orchestrator 316 can further provide an interface 322 to human resources, such as human customer support operators who may be located at one or more call centers.
- the dialog processing platform 120 can include a variety of call center connectors 324 configured to interface with data systems at one or more call centers.
- the orchestrator 316 can provide an interface 326 configured to retrieve authentication information and propagate user authentication and/or credential information to one or more components of the system 300 to enable access to a user's account.
- the authentication information can identify one or more users, such as individuals who have accessed a tenant web site as a customer or who have interacted with the conversational agent system 100 previously.
- the interface 326 can provide an authentication mechanism for tenants seeking to authenticate users of the conversational agent system 100 .
- the dialog processing platform 120 can include a variety of end-user connectors 328 configured to interface the dialog processing platform 120 to one or more databases or data sources identifying end-users.
- the interface 326 can also enable access to the tenant's customer order and billing data via one or more catalog or e-commerce connectors 328 .
- the orchestrator 316 can also provide an interface 330 to tenant catalog and e-commerce data sources.
- the interface 330 can enable access to the tenant's catalog data which can be accessed via one or more catalog or e-commerce connectors 332 .
- the interface 330 enables access to tenant catalogs and/or catalog data and further enables the catalog data to be made available to the CTD modules 160 . In this way, data from one or more sources of catalog data can be ingested into the CTD modules 160 to populate the modules with product or item names, descriptions, brands, images, colors, swatches, as well as structured and free-form item or product attributes.
- the dialog processing platform 120 also includes a maestro component 334 .
- the maestro 334 enables administrators of the conversational agent system 100 to manage, deploy, and monitor conversational agent applications 106 independently.
- the maestro 334 provides infrastructure services to dynamically scale the number of instances of natural language resources, such as tenant subsystems 130 , ASR engines 140 , TTS synthesis engines 155 , NLA ensembles 145 , and CTD modules 160 .
- the maestro 334 can dynamically scale these resources as dialog traffic increases.
- the maestro 334 can deploy new resources without interrupting the processing being performed by existing resources.
- the maestro 334 can also manage updates to the CTD modules 160 with respect to updates to the tenants e-commerce data and/or product catalogs.
- the maestro 334 provided the benefit of enabling the dialog processing platform 120 to operate as a highly scalable infrastructure for deploying artificially intelligent multi-modal conversational agent applications 106 for multiple tenants or multiple tenant subsystems 130 .
- the conversational agent system 100 can reduce the time, effort, and resources required to develop, test, and deploy conversational agents.
- the dialog processing platform 120 further includes a CTD module 160 .
- the CTD module 160 can implement methods to collect e-commerce data from tenant catalogs, product reviews, user account and order data, and user clickstream data collected at the tenants web site to generate a data structure that can be used to learn specific domain knowledge and to onboard or bootstrap a newly configured conversational agent system 100 .
- the CTD module 160 can extract taxonomy labels associated with hierarchical relationships between categories of products and can associate the taxonomy labels with the products in the tenant catalog.
- the CTD module 160 can also extract structured product attributes (e.g., categories, colors, sizes, prices) and unstructured product attributes (e.g., fit details, product care instructions) and the corresponding values of those attributes.
- the CTD module 160 can normalize attribute vales so that the attribute values share the same format throughout the catalog data structure. In this way, noisy values caused by poorly formatted content can be removed.
- Products in an e-commerce catalogs can be typically organized in a multi-level taxonomy, which can group the products into specific categories.
- the categories can be broader at higher levels (e.g., there are more products) and narrower (e.g., there are less products) at lower levels of the product taxonomy.
- a product taxonomy associated with clothing can be represented as Clothing >Sweaters >Cardigans & Jackets.
- the category “Clothing” is quite general, while “Cardigans & Jackets” are a very specific type of clothing.
- a user's queries can refer to a category (e.g., dresses, pants, skirts, etc.) identified by a taxonomy label or to a specific product item (e.g., item #30018, Boyfriend Cardigan, etc.).
- a product search could either start from a generic category and narrow down to a specific product or vice versa.
- CTD module 160 can extract category labels from the catalog taxonomy, product attributes types and values, as well as product titles and descriptions.
- the CTD module 160 can automatically generate attribute type synonyms and lexical variations for each attribute type from search query logs, product descriptions and product reviews and can automatically extract referring expressions from the tenant product catalog or the user clickstream data.
- the CTD module 160 can also automatically generate dialogs based on the tenant catalog and the lexicon of natural language units or words that are associated with the tenant and included in the data structure.
- the CTD module 160 utilizes the extracted data to train classification algorithms to automatically categorize catalog categories and product attributes when provided in a natural language query by a user.
- the extracted data can also be used to train a full search engine based on the extracted catalog information.
- the full search engine can thus include indexes for each product category and attribute.
- the extracted data can also be used to automatically define a dialog frame structure that will be used by a dialog manger module, described later, to maintain a contextual state of the dialog with the user.
- the maestro 334 can interface with a plurality of natural language agent (NLA) ensembles 145 .
- NLA natural language agent
- Each of the NLA ensembles 145 can include one or more of a natural language generator (NLG) module 336 , a dialog manager (DM) module 338 , and a natural language understanding (NLU) module 340 .
- the NLA ensembles 145 can include pre-built automations, which when executed at run-time, implement dialog policies for a particular dialog context.
- the pre-built automations can include dialog policies associated with searching, frequently-asked-questions (FAQ), customer care or support, order tracking, and small talk or commonly occurring dialog sequences which may or may not be contextually relevant to the user's query.
- the NLA ensembles 145 can include reusable dialog policies, dialog state tracking mechanisms, domain and schema definitions. Customized NLA ensembles 145 can be added to the plurality of NLA ensembles 145 in a compositional manner as well.
- Each NLA ensemble 145 can include at least one of a natural language understanding (NLU) module 336 , a dialog manager (DM) module 338 , and a natural language generator (NLG) module 340 .
- NLU natural language understanding
- DM dialog manager
- NLG natural language generator
- the NLA ensemble 145 includes a natural language understanding (NLU) module 336 .
- the NLU module 336 can implement a variety of classification algorithms used to classify input text associated with a user utterance and generated by the ASR engines 140 into a semantic interpretation.
- the NLU 336 can classify input text when the utterance incudes customer support requests/questions about products and services, as well as user queries.
- the NLU module 336 can implement a stochastic intent classifier and a named-entity recognizer ensemble to perform intent classification and information extraction, such as extraction of entity or user data.
- the NLU module 336 can combine different classification algorithms and can select the classification algorithm most likely to provide the best semantic interpretation for a particular task or user query by determining dialog context and integrating dialog histories.
- the classification algorithms included in the NLU module 336 can be trained in a supervised machine learning process using support vector machines or using conditional random field modeling methods. In some implementations, the classification algorithms included in the NLU module 336 can be trained using a convolutional neural network, a long short-term memory recurrent neural network, as well as a bidirectional long short-term memory recurrent neural network.
- the NLU module 336 can receive the user query and can determine surface features and feature engineering, distributional semantic attributes, and joint optimizations of intent classifications and entity determinations, as well as rule based domain knowledge in order to generate a semantic interpretation of the user query.
- the NLU module 336 can include one or more of intent classifiers (IC), named entity recognition (NER), and a model-selection component that can evaluate performance of various IC and NER components in order to select the configuration most likely generate contextually accurate conversational results.
- IC intent classifiers
- NER named entity recognition
- the NLU module 336 can include competing models which can predict the same labels but using different algorithms and domain models where each model produces different labels (customer care inquires, search queries, FAQ, etc.).
- the NLA ensemble 145 also includes a dialog manager (DM) module 338 .
- the DM module 338 can select a next action to take in a dialog with a user.
- the DM module 338 can provided automated learning from user dialog and interaction data.
- the DM module 338 can implement rules, frames, and stochastic-based policy optimization with dialog state tracking.
- the DM module 338 can maintain an understanding of dialog context with the user and can generate more natural interactions in a dialog by providing full context interpretation of a particular dialog with anaphora resolution and semantic slot dependencies.
- the DM module 338 can mitigate “cold-start” issues by implementing rule-based dialog management in combination with user simulation and reinforcement learning.
- sub-dialog and/or conversation automations can be reused in different domains.
- the DM module 338 can receive semantic interpretations generated by the NLU module 336 and can generate a dialog response action using context interpreter, a dialog state tracker, a database of dialog history, and an ensemble of dialog action policies.
- the ensemble of dialog action policies can be refined and optimized using rules, frames and one or more machine learning techniques.
- the NLA ensemble 145 includes a natural language generator (NLG) module 340 .
- the NLG module 340 can generate a textual response based on the response action generated by the DM module 338 .
- the NLG module 340 can convert response actions into natural language and multi-modal responses that can be uttered or spoken to the user and/or can be provided as textual outputs for display to the user.
- the NLG module 340 can include a customizable template programming language which can be integrated with a dialog state at runtime.
- the NLG module 340 can be configured with a flexible template interpreter with dialog content access.
- the flexible template interpreter can be implemented using Jinja2, a web template engine.
- the NLG module 340 can receive a response action the DM module 338 and can process the response action with dialog state information and using the template interpreter to generate output formats in speech synthesis markup language (SSML), VXML, as well as one or more media widgets.
- the NLG module 340 can further receive dialog prompt templates and multi-modal directives.
- the NLG module 340 can maintain or receive access to the current dialog state, a dialog history, and can refer to variables or language elements previously referred to in a dialog.
- the NLG module 340 can label a portion of the dialog as PERSON_TYPE and can associate a normalized GENDER slot value as FEMALE.
- the NLG module 340 can inspect the gender reference and customize the output by using the proper gender pronouns such as ‘her, she, etc.’
- FIG. 4 is a flowchart illustrating an example method for determining a textual response to an utterance of user query processed by the dialog processing platform 120 described in relation to FIGS. 1 and 3 .
- data characterizing an utterance of a query associated with a tenant can be received.
- data characterizing the utterance can include audio data received by an input device 114 of client device 102 and provided to/received by the dialog processing platform 120 .
- the data characterizing the utterance can be provided via text, for example a user can provide the utterance as textual input to a conversational agent configured in a web site of an ecommerce entity or tenant. The user can provided the utterance in regard to a goal or objective that the user seeks to accomplish in cooperation with the tenant.
- the user can provide the data characterizing the utterance of the query in a dialog with a conversational agent configured as an application 106 on the client device 102 .
- the received data can be provided to an automated speech recognition engine, such as ASR engine 140 , along with a profile selected from a plurality of profiles associated with the tenant.
- the profile can configure the ASR engine 140 to process the received data by specifying suitable configurations that are associated with the tenant and identified in tenant profile.
- the configurations can include the tenant specific lexicon.
- the tenant specific lexicon can include domain language and channel audio characteristics associated with the tenant.
- the tenant specific lexicon can include product and/or service names, alternative phonetic annunciations and pronunciations, and audio channel information such as telephony or digital voice quality and/or audio coding types.
- a language model for each state of a dialog can be identified.
- a language model can include a set of statically related or defined sentences.
- the language model can be identified when specific contextual conditions exist in a dialog, such as when the conversational agent expects to receive a business name. In such circumstances, a business name language model can be identified and activated.
- the ASR engine 140 can receive the data and process the audio data or textual data to determine a string of text corresponding to the data received at the client device 102 .
- the ASR engine 140 can receive the user's verbal utterance forming a query “When will my order be delivered?”.
- the ASR engine 140 can process the audio data including the verbal utterance to decompose the received data into a string of natural language units or words.
- the ASR 140 can select words to be included in the text string based on the profile associated with the tenant. In this way, the ASR engine 140 operates or is selected to operate in a manner that is most likely to generate a text string that is contextually most relevant to the tenant and best represents the intention of the user conveyed via the utterance.
- the profile can be defined and generated via the tenant portal 320 and can be distributed or made accessible to other components of the system 300 via the orchestrator 316 .
- the profile can be stored in the DPP server 302 and can be propagated to the maestro 334 .
- a tenant may prefer a TTS synthesis engine 155 configured with a male voice and customized to process specific product names which are not commonly recognized by an ASR engine 140 .
- the DPP server 302 can provide a TTS voice identifier to the TTS synthesis engine 155 each time speech is to be generated.
- the DPP server 302 can provide a list of specific product names which are not commonly recognized to the ASR engine 140 every time the system 300 is listening to the user.
- the maestro 334 can add more configurations of the TTS synthesis engines 155 based on the dialog context. And by configuring the ASR engine 140 with a profile selected based on the tenant, specific ASR engine 140 technology can be easily changed, updated, and/or reconfigured on a tenant-specific basis.
- the NLA ensemble 145 can receive the text string characterizing the query.
- the query can include the utterance or portions of the utterance.
- the query can include a text request.
- the text string output by the ASR engine 140 can be conveyed to the NLA ensemble 145 for processing.
- the NLA ensemble 145 can process the text string to determine a textual response to the query.
- the text string can be first processed by the NLU module 336 to generate a semantic interpretation associated with the text string.
- the semantic interpretation can next be processed by the DM module 338 to determine a contextual sequence associated with the text string and a response action to the query (and the corresponding text string).
- the response action can then be processed by the NLG module 340 to determine a textual response corresponding to the response action.
- the NLA ensemble 145 have determined the most contextually relevant response action to the user's query regarding the status of their order is “Your order will be delivered tomorrow.” Additional detail associated with processing the text string will be provided in the description of FIG. 6 .
- FIG. 5 is a flowchart illustrating an example method for providing a verbalized query response to a user via the client device 102 and the dialog processing platform 120 described in relation to FIGS. 1 and 3 .
- the textual response generated by the NLA ensemble 145 can be provided to the TTS synthesis engine 155 with the tenant profile.
- the tenant profile can be used to configure and select a TTS synthesis engine 155 associated with the tenant and such that the TTS synthesis engine 155 can generate a verbalized query response, which includes a plurality of natural language units or words selected from a lexicon associated with the tenant or the tenant's applications 106 .
- FIG. 5 is a flowchart illustrating an example method for providing a verbalized query response to a user via the client device 102 and the dialog processing platform 120 described in relation to FIGS. 1 and 3 .
- the textual response generated by the NLA ensemble 145 can be provided to the TTS synthesis engine 155 with the tenant profile.
- the tenant profile can be used to
- the NLA ensemble 145 have determined the most contextually relevant response action to the user's query inquiring about the status of their order is “Your order will be delivered tomorrow.”
- the textual response action generated by the NLA ensemble 145 can be received by the TTS synthesis engine 155 .
- the TTS synthesis engine 155 can determine a verbalized response query, using the tenant profile.
- the DPP server 302 can receive a verbalized query response from the TTS engine 155 and in operation 515 , the DPP server 302 can provide the verbalized query response to the client device 102 .
- the client device 102 can further provide the verbalized query response to the user via the output device 116 , such as a speaker.
- the user can select between a textual modality and a voice or speech modality.
- the applications 106 can include a user-settable mechanism to configure the conversational agent for textual dialogs or voice dialogs.
- the DPP server 302 can exclude the ASR engines 140 and the TTS synthesis engines 155 and can transmit the textual data to the orchestrator 316 .
- FIG. 6 is a flowchart illustrating an example method for processing a text string characterizing a query.
- the text string associated characterizing the user's query and generated by the ASR engine 140 , can be provided to the NLA ensemble 145 for processing to generate a textual response.
- the text string is initially provided to the NLU module 336 .
- a semantic representation associated with the text string can be generated by the NLU module 336 .
- the semantic representation can include an attributes of the query such as the query intent, an intent type, and a category of the intent.
- the NLU module 336 can provide the location of the information extracted from the query. For example, the NLU module 336 can provide an index span indicating the position of an word in the query.
- the NLU module 336 can determine and provide confidence scores estimating the accuracy of the predictions as well as normalized values based on gazetteers and/or a backend database. For example, the NLU module 336 can normalize “trousers” to a taxonomic category “pants and shorts” based on the tenant's catalog data.
- the DM module 338 determines a first contextual sequence associated with the text string. For example, the DM module 338 can receive the semantic representation generated by the NLU module 336 and can interpret the context of the semantic representation to determine a state of the dialog which the user's query is included. The DM module 338 can include a dialog state tracker and a dialog history component to determine the context of the semantic representation associated with the user's query.
- the DM module 338 can generate a response action based on the determined contextual sequence.
- the DM module 338 can further include an ensemble policy which can receive input from the dialog state tracker to generate the response action.
- the DM module 338 can generate the response action via one or more policy optimization models, rules, and/or frames.
- the DM module 338 can generate an optimal response to the user by combining a number of strategies.
- the DM module 338 can utilized a frame-based policy.
- the frame-based policy can determine intents and can associate slots to complete the task initiated by the user. Slots can include bits of information required to provide an answer to the user.
- a user query is associated with purchasing shoes, it can be necessary to understand the type of shoes, the size of the shoe, and the width of the shoe, which can be a required parameter used to determine a suitable shoe fitting model.
- Mandatory and optional slots, as well as slots that are dependent on the value of other slots can be used to determine the next action of the dialog.
- the DM module 338 can determine which mandatory or optional slot may be necessary next in the dialog sequence based on which slot may shorten the time to reach the goals. For example, the DM module 338 can be configured to ask for a shoe style since information received in regard to the shoe style can narrow down the potential choices more than dialog regarding the users shoe size.
- the DM module 338 can include one or more dialog policies.
- the dialog policies can be learned from data. For example, data associated with the sequences of dialog turns between the conversational agent/system 300 and the user can be converted into a vector representation and used to train a sequence model to predict the next optimal dialog action.
- the NLG module 340 can receive the response action generated by the DM module 338 and can generate a textual response.
- the NLG module 340 can include a copy of the dialog state from the dialog tracker configured in the DM module 338 and can process the action using a template interpreter.
- the template interpreter can include a Jinja or Jinja2 template interpreter written in the Python programming language.
- the template interpreter can output a textual response which can be further formatted by one or more output formatting components using SSML, VXML, and/or various other media widgets.
- the NLG module 340 can generate HyperText Markup Language (HTML) or meta-representations for GUI elements and content including clickable buttons, text, and images.
- HTML HyperText Markup Language
- FIG. 7 is a flowchart illustrating an example method for generating a first data structure.
- the data structure can be used by the NLU module 336 to generate the semantic representation associated with the text string characterizing a query.
- the data structure can include product attributes, product synonyms, referring expressions related to the tenant's products, and common dialogs related to the tenant's products.
- the data structure can be generated by the CTD module 160 .
- the CTD 160 module can determine one or more product attributes associated with an item from the tenants catalog of products or items.
- the CTD module 160 can determine and generate the product attributes by extracting synonyms in a specific product domain.
- the product attributes can be used by the NLU module 336 to expand slot values associated with a particular product.
- the CTD module 160 and the data structure it generates can include the attributes of “moccasin, boots, heels, sandals” for a product identified as a “shoe”.
- the CTD module 160 can be trained on product or tenant domain data but can also learn patterns and context in which the words are used, thus allowing the CTD module 160 to automatically infer words with the same meaning.
- the CTD module 160 can employ word embeddings, lexical databases, such as WordNet, and lexical chains to determine the product attributes.
- the CTD module 160 can determine one or more synonyms associated with an item from the tenant product catalog.
- a product attribute can be a property or attribute of a product.
- a retailer category can be defined by a product taxonomy. For example, “sweaters” can be a category label associated with products in the clothing domain.
- the CTD module 160 can automatically determine that “pullovers”, “cardigans”, “turtleneck”, “shaker”, and “cardigan sweater” are all synonyms and are referring the same category.
- the CTD module 160 can automatically expand the lexicon for both catalog searching and search query interpretation.
- the CTD module 160 can use both word and sentence embeddings and can extract similar words from a specific domain and click stream data from search query logs.
- the CTD module 160 can use prebuilt embeddings or can train specific embeddings for the domain using catalog and review data. Additionally, CTD module 160 can include a classifier that can automatically classify unseen search terms into a taxonomy label.
- the CTD module 160 can determine one or more referring expressions associated with an item from the tenant product catalog. Additionally, or alternatively, the CTD module 160 can determine one or more referring expressions based on interactive user data associated with the item.
- the CTD module 160 can automatically learn how customers refer to items in the tenants product catalog. For example, the CTD module 160 can process the tenant catalog and clickstream data received by users visiting the tenants website or online product catalog and can apply word embeddings and sequence-to-sequence models. Semantic similarities can be determined and the results can be ranked for inclusion in the data structure.
- the CTD module 160 can generate the data structure based on operations 705 - 715 .
- the data structure can then be used to update the classification algorithms included in the NLU module 336 .
- the orchestrator 316 can configure periodic, e.g., daily, updates to the CTD module 160 and the data structure. For example, billing, order, catalog, clickstream, and review data can be uploaded to the CTD module 160 and processed to extract product titles, descriptions, and attributes.
- the CTD module 160 can normalize attribute values, extract keywords and n-grams, tokenize the data, and define a search index for use in the data structure.
- the data structure can then be used in the NLU module 336 to update a search index, optimize ranking functions, and update the classification algorithms used to generate the semantic interpretation associated with the text string characterizing the user's query.
- FIG. 8 is a flowchart illustrating an example method for generating an initial conversation prompt via a multi-modal conversational agent of the system described in relation to FIG. 1 .
- the conversational agent system 100 can generate an initial conversation prompt and configure the conversational agent 106 on the client device 102 to communicate and conduct multi-modal dialog exchanges with the dialog processing platform 120 .
- the conversational agent system 100 can generate an initial conversation prompt and configure the conversational agent 106 on the client device 102 to communicate and conduct multi-modal dialog exchanges with the dialog processing platform 120 .
- the conversational agent system 100 Prior to receiving data characterizing an utterance of a user query, the conversational agent system 100 can generate an initial conversation prompt and configure the conversational agent 106 on the client device 102 to communicate and conduct multi-modal dialog exchanges with the dialog processing platform 120 .
- the conversational agent system 100 Prior to receiving data characterizing an utterance of a user query, the conversational agent system 100 can generate an initial conversation prompt and configure the conversational agent 106 on the client device 102 to communicate
- the web site receives an input provided via the web browser configured on the client device 102 .
- the user can provide the input, for example, by clicking the “Speak” button in the web site.
- the dialog processing platform 120 receives validation data associated with the client device 102 .
- a network connection will be initiated, e.g., via web sockets, and the web browser configured with application 205 , can be authenticated and registered through the DPP server 302 .
- the DPP server 302 can receive validation data about the audio and graphical processing capabilities of the client device 102 and can validate if the client device 302 is able to render graphics and capture audio in real-time.
- the DPP server 302 can generate a conversation initiation message and provide the conversation initiation message to the maestro component 334 .
- the maestro component 334 can provide an initial conversation response message back to the DPP server 302 which can initiate a call to the TTS synthesis engine 155 via the TTS adapter 150 .
- the DPP server 302 will begin streaming audio data from the TTS adapter 150 to the application 205 .
- the DPP server 302 will generate an initial conversation prompt by providing an audible prompt and textual output on the display 112 of the client device 102 .
- the initial conversation prompt can inform the user that the system 100 is ready to receive a user query, for example, the initial conversation prompt can include “Hello. Welcome to ACME shoes. How may I help you?”.
- the client device 102 can receive data characterizing the utterance of a query associated with the tenant as described earlier in the discussion of FIG. 4 , operation 405 .
- FIG. 9 is a diagram illustrating an example data flow 900 for receiving and processing a user query using the multi-modal conversational agent system 100 of FIG. 1 .
- the conversational agent system 100 can receive data characterizing an utterance of a query. The data can be received in the context of a dialog and processed as follows.
- the client device 102 can receive a user query, such as “I am looking for a pair of elegant shoes for my wife”.
- the client device 102 can capture the utterance associated with the query via the microphone 114 configured on the client device 102 .
- the captured audio data is streamed by web application 205 to the DPP server 302 in addition to a profile associated with the tenant.
- the DPP server 302 streams the received audio data to the ASR adapter 135 .
- the ASR adapter 135 can provide the audio data to a ASR engine 140 associated with the tenant profile.
- the ASR engine 140 can be a pre-configured cloud-based ASR engine, such as the Google Cloud ASR offered by Google, LLC of Mountain View, Calif., U.S.A.
- the ASR engine 140 processes the audio data in real-time until the user completes the utterance associated with the query. After completing the utterance, the user is likely to pause and await a reply from the conversational agent system 100 .
- the ASR engine 140 can detect the end of the utterance and the subsequent period of silence and can provide the DPP server 302 with the best hypothetical text string corresponding to the user's utterance.
- the ASR engine 140 can generate a text string which exactly matches the words of the user's utterance.
- the text string can be combined with other parameters related to the processed utterance.
- the other parameters can include rankings associated with the recognized speech. The rankings can be dynamically adjusted based on the NLU module 336 .
- the NLU module 336 can process the top hypotheses generated by the ASR engine 140 and can evaluate those hypothetical responses in the context of other responses generated by the NLU module 336 so that the top hypothesis is selected over another hypothesis which can include a lower confidence ranking.
- the parameters can be associated with errors such as phonetically similar words. Small variations in text strings can be mitigated using similarity measures, such as the Levenshtein distance or fuzzy matching algorithm.
- step 4 the DPP server 302 can provide the text string to the orchestrator component 316 and await a reply.
- the orchestrator component 316 can transmit the text string to the maestro component 334 .
- the maestro component 334 can provide the text string to the NLA ensemble 145 for response processing.
- the NLA 145 can determine the current state of the dialog via the DM 338 and can generate a contextually appropriate textual response to the query.
- the NLA ensemble 145 can also, in some implementations, generate graphical content associated with query and the dialog context to be displayed on the display 112 of the client device 102 .
- the textual response and the corresponding graphical content can be provided in a device-agnostic format.
- the NLA ensemble 145 can determine that the contextually appropriate textual response to the query is “I can help you with that. What size does she usually wear?”
- the dialog processing platform 120 can perform an authentication of the user.
- the orchestrator component 316 can be granted access to the user's account in the event that the user's query requires information associated with a specific order or account. For example, if the user utters “When will my order arrive?”, the maestro component 334 can interpret the utterance and query, via the orchestrator component 316 (as in step 6 ), and can prompt the user to provide account authentication credentials in order to determine the status of the order in step 6 a.
- the orchestrator component 316 can cache the authentication token for the duration of the dialog session to avoid repeating the authentication steps for other queries.
- the orchestrator component 316 can format the textual response and graphical content into a suitable format for the configuration of the client device 102 .
- the orchestrator component 316 can apply tenant-defined brand customizations provided via the customer portal 320 .
- the customizations can specify a color palette, font style, images and image formatting, and TTS synthesis engines 155 to use which may include one or more alternate voice dialects.
- the DPP server 302 can provide the textual response to the TTS adapter 150 to initiate speech synthesis processing by the TTS synthesis engines 155 to generate a verbalized query response.
- the TTS synthesis engines 155 can be remotely located from the DPP server 302 , such as when configured in a cloud-based, distributed conversational agent system.
- the DPP server 302 can also provide, in step 10 , the textual response graphically with the appropriate formatting on the display 112 of the client device 302 .
- the TTS adapter 150 can begin retrieving audio data associated with the verbalized query response from the TTS synthesis engine 155 in response to a request from the DPP server 302 .
- the TTS adapter 150 can subsequently provide, or stream, the verbalized query response to the DPP server 302 .
- the DPP server 302 can act as a proxy by sending the verbalized query response to web application 205 on the client device 102 .
- the web application 205 can provide the verbalized query response to the user via the output device 116 audibly informing the user “I can help you with that. What size show does she usually wear?”.
- Steps 1 - 10 can be performed in an iterative manner via the client device 102 and the dialog processing platform 120 until the user's query has been fulfilled or the user terminates the dialog session.
- the web application 205 configured as the conversational agent on the client device 102 , can enable the user to switch from speech to text as input and output modes as well as switching from text to speech as input and output modes.
- Exemplary technical effects of the methods, systems, and computer-readable medium described herein include, by way of non-limiting example, processing a user query using a multi-modal conversation agent system.
- the conversational agent system can provide scalable, modular natural language processing resources for multiple tenants for which the user query can be directed.
- the conversational agent system can provide improved interfaces for processing the user query using distributed natural language resources.
- the conversational agent system can improve the contextual accuracy of conversational agent dialogs using a catalog-to-dialog data structure incorporated into a machine learning process used to train classification algorithms configured to process the user query and generate query responses.
- the conversational agent system also provides improved interfaces for tenants to customize conversational agent branding and provide more accurate dialog responses based on integrated e-commerce data sources such as user account, billing, and customer order data.
- the subject matter described herein can be implemented in analog electronic circuitry, digital electronic circuitry, and/or in computer software, firmware, or hardware, including the structural means disclosed in this specification and structural equivalents thereof, or in combinations of them.
- the subject matter described herein can be implemented as one or more computer program products, such as one or more computer programs tangibly embodied in an information carrier (e.g., in a machine-readable storage device), or embodied in a propagated signal, for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers).
- a computer program (also known as a program, software, software application, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- a computer program does not necessarily correspond to a file.
- a program can be stored in a portion of a file that holds other programs or data, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
- a computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
- processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processor of any kind of digital computer.
- a processor will receive instructions and data from a read-only memory or a random access memory or both.
- the essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
- Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, (e.g., EPROM, EEPROM, and flash memory devices); magnetic disks, (e.g., internal hard disks or removable disks); magneto-optical disks; and optical disks (e.g., CD and DVD disks).
- semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
- magnetic disks e.g., internal hard disks or removable disks
- magneto-optical disks e.g., CD and DVD disks
- optical disks e.g., CD and DVD disks.
- the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- the subject matter described herein can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, (e.g., a mouse or a trackball), by which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- a keyboard and a pointing device e.g., a mouse or a trackball
- Other kinds of devices can be used to provide for interaction with a user as well.
- feedback provided to the user can be any form of sensory feedback, (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user can be received in any form, including acoustic, speech, or tactile input.
- modules refers to computing software, firmware, hardware, and/or various combinations thereof. At a minimum, however, modules are not to be interpreted as software that is not implemented on hardware, firmware, or recorded on a non-transitory processor readable recordable storage medium (i.e., modules are not software per se). Indeed “module” is to be interpreted to always include at least some physical, non-transitory hardware such as a part of a processor or computer. Two different modules can share the same physical hardware (e.g., two different modules can use the same processor and network interface). The modules described herein can be combined, integrated, separated, and/or duplicated to support various applications.
- a function described herein as being performed at a particular module can be performed at one or more other modules and/or by one or more other devices instead of or in addition to the function performed at the particular module.
- the modules can be implemented across multiple devices and/or other components local or remote to one another. Additionally, the modules can be moved from one device and added to another device, and/or can be included in both devices.
- the subject matter described herein can be implemented in a computing system that includes a back-end component (e.g., a data server), a middleware component (e.g., an application server), or a front-end component (e.g., a client computer having a graphical user interface or a web browser through which a user can interact with an implementation of the subject matter described herein), or any combination of such back-end, middleware, and front-end components.
- the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
- LAN local area network
- WAN wide area network
- Approximating language can be applied to modify any quantitative representation that could permissibly vary without resulting in a change in the basic function to which it is related. Accordingly, a value modified by a term or terms, such as “about,” “approximately,” and “substantially,” are not to be limited to the precise value specified. In at least some instances, the approximating language can correspond to the precision of an instrument for measuring the value.
- range limitations can be combined and/or interchanged, such ranges are identified and include all the sub-ranges contained therein unless context or language indicates otherwise.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- General Business, Economics & Management (AREA)
- Strategic Management (AREA)
- Software Systems (AREA)
- Marketing (AREA)
- Economics (AREA)
- Development Economics (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
A method includes receiving data characterizing an utterance of a query associated with a tenant; providing, to an automated speech recognition engine, the received data and a profile selected from a plurality of profiles based on the tenant, the profile configuring the automated speech recognition engine to process the received data; receiving, from the automated speech recognition engine, a text string characterizing the query; and processing, via an ensemble of natural language agents configured based on the tenant, the text string characterizing the query to determine a textual response to the query, the textual response including at least one word from a first lexicon associated with the tenant. Related systems, methods, apparatus, and computer readable mediums are also described.
Description
- This application is a continuation of U.S. patent application Ser. No. 16/696,482, filed on Nov. 26, 2019, entitled “Multi-modal Conversational Agent Platform”, which is hereby incorporated by reference in its entirety.
- Conversational agents can interact directly with users via voice or text modalities. A conversational agent and a user can exchange information with each other in a series of steps to fulfill a specific goal or objective of the user. The exchange of information can form a dialog between the conversational agent and the user. Information supplied by the user during one or more steps of the dialog can be processed by a system in which the conversational agent is configured and deployed to provide contextually relevant outputs relating to each of the dialog steps. In this way, the system can generate statements and/or questions during the dialog with the user in a contextually accurate and efficient manner with regard to the specific goal or objective of the user.
- Conversational agents can be utilized in e-commerce applications to allow a retail or service provider entity to interact with potential or existing customers in regard to a product or service without requiring a human customer support operator. Conversational agents can process data received in a variety of modalities, such as voice, text, and/or web site interactions. Conversational agents can also process data received from a variety of input devices, such as computing devices, which may for example display a website of an e-commerce retailer, a browser-enabled smartphone or mobile computing device, as well as intelligent or virtual personal assistant devices.
- In an aspect, a method includes receiving data characterizing an utterance of a query associated with a tenant; providing, to an automated speech recognition engine, the received data and a profile selected from a plurality of profiles based on the tenant, the profile configuring the automated speech recognition engine to process the received data; receiving, from the automated speech recognition engine, a text string characterizing the query; and processing, via an ensemble of natural language agents configured based on the tenant, the text string characterizing the query to determine a textual response to the query, the textual response including at least one word from a first lexicon associated with the tenant.
- One or more of the following features can be included in any feasible combination. For example, the method can include providing, to a text-to-speech synthesis engine, the textual response and the profile; receiving, from the text-to-speech synthesis engine, a verbalized query response determined by the text-to-speech synthesis engine based on the textual response; and providing the verbalized query response. The method can include providing a first configuration of a graphical user interface on a first client device based on the profile, the client device configured to receive the utterance from a user. Processing the text string characterizing the query can include generating a sematic interpretation associated with the text string, the semantic interpretation generated using at least one of a plurality of classification algorithms trained using a first machine learning process associated with the tenant; determining a first contextual sequence associated with text string based on one or more previously processed text strings; generating a first response action based on the determined first contextual sequence; and generating the textual response based on the generated first response action.
- The semantic interpretation can be generated using a first data structure representing the first lexicon associated with the tenant. The first data structure can be generated based on at least one of: a catalog of items associated with the tenant and including a first item title and a first item description; one or more reviews associated with a first item; interactive user data associated with a first item; or a combination thereof. Generating the first data structure can include determining one or more attributes associated with a first item from the catalog of items; determining one or more synonyms associated with the first item from the catalog of items; determining one or more referring expressions associated with the first item from the catalog of items and/or the interactive user data associated with the first item; generating the first data structure based on the determining steps, the first data structure including a name, one or more attributes, one or more synonyms, one or more referring expressions, and/or one or more dialogs corresponding to the first item. The first data structure can be used in the first machine learning process to train the at least one of a plurality of classification algorithms.
- The method can include receiving second data characterizing an utterance of a query associated with a second tenant; providing, to a second automated speech recognition engine, the received second data and a profile selected from a plurality of profiles based on the second tenant, the profile configuring the second automated speech recognition engine to process the received second data; receiving, from the automated speech recognition engine, a text string characterizing the query; and processing, via the ensemble of natural language agents configured based on the second tenant, the text string characterizing the query to determine a textual response to the query, the textual response including at least one word from a second lexicon associated with the second tenant.
- The utterance of the query can include a plurality of natural language words spoken by a user and received by an input device of a first computing device. The utterance of the query can be provided by the user in regard to a first context associated with a first item provided by the tenant. The profile can include one or more configuration settings associated with the ensemble of natural language agents configured on a server including a data processor, one or more configuration settings associated with an ensemble of natural language agents configured on the first computing device, and one or more configuration settings specifying one or more speech processing engines configured on the server including the data processor.
- The tenant can include at least one of a retail entity, a service provider entity, a financial entity, a manufacturing entity, an entertainment entity, an information storage entity, and a data processing entity.
- The automated speech recognition engine can be configured to receive audio data corresponding to the utterance of the query and to generate, in response to the receiving, the text string including textual data corresponding to the received audio data, the automatic speech recognition engine being selected from one or more inter-changeable speech processing engines included in the profile. The text-to-speech synthesis engine can be configured to receive the textual response, and to generate, in response to the receiving, the verbalized query response including audio data corresponding to the received textual response, the text-to-speech synthesis engine being selected from one or more inter-changeable speech processing engines included in the profile. The method can include receiving, prior to receiving data characterizing the utterance of the query, an input to a web site provided via a web browser configured on first computing device, the input causing the web browser to be authenticated and registered at a second computing device coupled to the first computing device via a network.
- The method can include receiving, by the second computing device, validation data associated with the first computing device, the validation data including audio and graphical rendering settings configured on with the first computing device; generating, in response to confirming the validation data, an initial conversation prompt by the second computing device and providing the initial conversation prompt to the web site configured on the first computing device; receiving, at an input device coupled to the first computing device and in response to providing the initial conversation prompt via the web site, the data characterizing the utterance of the query, the query associated with an item available via the web site; transmitting the provided verbalized query response to the first computing device; and providing the verbalized query response to the user via an output device coupled to the first computing device. The data characterizing the utterance of the query associated with the tenant can be provided via a textual interaction modality or via a speech interaction modality.
- Non-transitory computer program products (i.e., physically embodied computer program products) are also described that store instructions, which when executed by one or more data processors of one or more computing systems, causes at least one data processor to perform operations herein. Similarly, computer systems are also described that may include one or more data processors and memory coupled to the one or more data processors. The memory may temporarily or permanently store instructions that cause at least one processor to perform one or more of the operations described herein. In addition, methods can be implemented by one or more data processors either within a single computing system or distributed among two or more computing systems. Such computing systems can be connected and can exchange data and/or commands or other instructions or the like via one or more connections, including a connection over a network (e.g. the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like), via a direct connection between one or more of the multiple computing systems, etc.
- These and other features will be more readily understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
-
FIG. 1 illustrates an example architecture of a system including a dialog processing platform, a client device configured as a multi-modal conversational agent, and a machine learning platform; -
FIG. 2 illustrates an example architecture of a client device configured as a multi-modal conversational agent of the system described inFIG. 1 ; -
FIG. 3 illustrates an example architecture of a dialog processing platform of the system described inFIG. 1 ; -
FIG. 4 is a flowchart illustrating an example method for determining a textual response to an utterance of a query provided by a user via a client device of the system described inFIG. 1 ; -
FIG. 5 is a flowchart illustrating an example method for providing a verbalized query response to a user via a client device of the system described inFIG. 1 ; -
FIG. 6 is a flowchart illustrating an example method for processing a text string characterizing a query; -
FIG. 7 is a flowchart illustrating an example method for generating a first data structure used in generating the semantic representation associated with the text string characterizing a query; -
FIG. 8 is a flowchart illustrating an example method for generating an initial conversation prompt via a multi-modal conversational agent of the system described inFIG. 1 ; and -
FIG. 9 is a diagram illustrating an example data flow for processing a dialog using a multi-modal conversational agent and the system ofFIG. 1 . - It is noted that the drawings are not necessarily to scale. The drawings are intended to depict only typical aspects of the subject matter disclosed herein, and therefore should not be considered as limiting the scope of the disclosure.
- Advances in natural language processing have enabled a proliferation of digital endpoint devices capable of providing voice recognition capabilities. Personal and mobile computing devices, intelligent or virtual assistant devices, televisions, and even automobiles, can receive voice-based inputs, often in addition to text-based inputs, and to process the inputs in regard to a specific user objective or goal. A multi-modal conversational agent can be configured on or within these digital endpoint devices to receive voice or text-based inputs and to process the inputs in the context of a dialog with the user. A user can interact with the conversational agent in a dialog about a product offered by a retail or manufacturing entity; a service provided by a service provider such as an insurance company or a medical facility; or a transaction by a financial or banking entity; and/or the like.
- The backend architectures coupled to the conversational agents and which can receive and process user dialog data from the digital endpoint devices, can include closed, proprietary interfaces. As a result, the backend architectures coupled to many conversational agents deployed in a variety of digital endpoint devices cannot be easily extended or reconfigured to process a wider variety of endpoint devices, user queries and dialogs beyond those that the conversational agent and corresponding backend architecture were originally designed to process. For example, a backend architecture coupled to a conversational agent associated with an endpoint device that can receive textual dialog inputs may be unable to process verbal dialog inputs. Additionally, a backend architecture coupled to a conversational agent associated with a retail entity may be unable to process textual or voice dialog data associated with an entertainment entity or a financial services entity. Similarly, a backend architecture associated with a conversational agent deployed in a customer support function of a retail entity may be unable to process user dialog inputs corresponding to new items or updated pricing in a catalog of the retail entity.
- Many conversational agent architectures do not provide the flexibility to mix-and-match different speech or natural language processing resources. For instance, existing conversational agent architectures may not provide a means for configuring and deploying new, updated, or alternate speech processing and/or nor natural language understanding resources. The speech or language processing resources of many conversational agent architectures are integrated within the architecture and are not replaceable with alternate natural language processing resources. In addition, even if new resources could be added, many conversation agent architectures cannot support or be reconfigured to support new digital endpoint devices that are part of the conversational agent architecture as originally designed. For example, a conversational agent backend architecture may be configured to process textual dialog inputs provided to a conversational agent utilized in a website. The backend architecture may be able to process the textual inputs provided by a user via a keyboard of a mobile or personal computing device at which the user is viewing the website. However, the backend architecture may be unable to process voice inputs provided via a microphone of the mobile or personal computing device. The lack of re-configurability and modularity of backend architectures limits the use of existing conversational agent systems to support new digital endpoint devices, new natural language processing resources, and new lexicons. The inability to efficiently configure and deploy new processing resources in conversational agent frontend and backend architectures can reduce user engagement, customer satisfaction, and revenue for the entities deploying the conversational agent.
- In some implementations, the conversational agent frontend and backend architecture described herein allow entities deploying conversational agents to configure and/or reconfigure natural language processing resources that best suit the application or application domain. The conversational agent frontend and backend architecture described herein can also enable entities deploying conversation agents to support a broader variety of user input/output devices that are not necessarily from the same technology provider or originally intended to operate with a particular conversational agent backend. The conversational agent frontend and backend architecture described herein includes components that can easily integrate multiple input modalities provided via smartphones with multi-touch and keyboard capabilities, and also includes backend adaptors or connectors to simplify the user's authentication and to provide access to backend application programming interfaces (API) from different frontend device or application configurations.
- Accordingly, example conversational agent systems described herein enable system operators to replace or change backend components without altering the client user interface or other client side processing implementations for speech and/or textual agent modalities. This can be especially beneficial when changing audio streaming configurations to adapt to different speech providers. The example conversational agent systems described herein can reduce client-side incompatibilities when configuring new or alternate backend language processing resources. In this way, the client-side interfaces and implementations remain unchanged regardless of which natural language processing components or resources are used.
- For example, the conversational agent frontend and backend architecture described herein can provide a modular, configurable architecture for use in a variety of domains. The improved conversational agent architecture described herein can include components to automatically extract information from a variety of domain resources such as user data, website interaction data, product and/or services data, as well as customer order and billing data can be used to train one or more components of the multi-modal conversational agent architecture described herein. The conversational agent architecture described herein can utilize the extracted information to automatically generate synonyms for the names and characterizations of the products and/or services which can then be used in dialog sequences with a user of the conversational agent. The conversational agent architecture described herein can also generate search indexes optimized for user inputs, as well as enhanced models used for natural language processing and dialog management. In this way, conversational agent architecture described herein can more accurately capture and utilize a domain specific lexicon to provide users with a more focused, satisfying and robust dialog experience via the conversational agent.
-
FIG. 1 illustrates an example architecture of aconversational agent system 100 including aclient device 102, adialog processing platform 120, and amachine learning platform 165. Theclient device 102, thedialog processing platform 120, and themachine learning platform 165 can be communicatively coupled via a network, such asnetwork 118. In broad terms, a user can provide an input associated with a query to theclient device 102 viainput device 114. Theclient device 102 can include a frontend of theconversational agent system 100. A conversational agent can be configured on theclient device 102 as one ormore applications 106. The conversational agent can transmit data associated with the query to a backend of theconversational agent system 100. Thedialog processing platform 120 can be configured as the backend of theconversational agent system 100 and can receive the data from theclient device 102 via thenetwork 118. Thedialog processing platform 120 can process the transmitted data to generate a response to the user query and can provide the generated response to theclient device 102. Theclient device 102 can then output the query response via theoutput device 116. A user may iteratively provide inputs and receive outputs via theconversational agent system 100 in a dialog. The dialog can include natural language units, such as words, which can be processed and generated in the context of a lexicon that is associated with the domain of the subsystem for which theconversational agent system 100 has been implemented. - As shown in
FIG. 1 , theconversational agent system 100 includes aclient device 102. Theclient device 102 can include a large-format computing device or any other fully functional computing device, such as a desktop computers or laptop computers, which can transmit user data to thedialog processing platform 120. Additionally, or alternatively, other computing devices, such as a small-format computing devices 102 can also transmit user data to thedialog processing platform 120. Small-format computing devices 102 can include a tablet, smartphone, intelligent or virtual digital assistant, or any other computing device configured to receive user inputs as voice and/or textual inputs and provide responses to the user as voice and/or textual outputs. - The
client device 102 includes amemory 104, aprocessor 108, acommunications module 110, and adisplay 112. Thememory 104 can store computer-readable instructions and/or data associated with processing multi-modal user data via a frontend and backend of theconversational agent system 100. For example, thememory 104 can include one ormore applications 106 implementing a conversational agent frontend. Theapplications 106 can provide speech and textual conversational agent modalities to theclient device 102 thereby configuring theclient device 102 as a digital or telephony endpoint device. Theprocessor 108 operates to execute the computer-readable instructions and/or data stored inmemory 104 and to transmit the computer-readable instructions and/or data via thecommunications module 110. Thecommunications module 110 transmits the computer-readable instructions and/or user data stored on or received by theclient device 102 vianetwork 118. Thenetwork 118 connects theclient device 102 to thedialog processing platform 120. Thenetwork 118 can also be configured to connect themachine learning platform 165 to thedialog processing platform 120. Thenetwork 118 can include, for example, any one or more of a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), a broadband network (BBN), the Internet, and the like. Further, thenetwork 118 can include, but is not limited to, any one or more of the following network topologies, including a bus network, a star network, a ring network, a mesh network, a star-bus network, tree or hierarchical network, and the like. Theclient device 102 also includes adisplay 112. In some implementations, thedisplay 112 can be configured within or on theclient device 102. In other implementations, thedisplay 112 can be external to theclient device 102. Theclient device 102 also includes aninput device 114, such as a microphone to receive voice inputs, or a keyboard, to receive textual inputs. Theclient device 102 also includes anoutput device 116, such as a speaker or a display. - The
client device 102 can include a conversational agent frontend, e.g., one or more ofapplications 106, which can receive inputs associated with a user query and to provide responses to the users query. For example, as shown inFIG. 1 , theclient device 102 can receive user queries which are uttered, spoken, or otherwise verbalized and received by theinput device 114, such as a microphone. In some implementations, theinput device 114 can be a keyboard and the user can provide query data as a textual input, in addition to or separately from the inputs provided using a voice-based modality. A user can interact with theinput device 114 to provide dialog data, such as a query, via an e-commerce web-site at which the user previously placed an order. For example, the user can provide a query asking “When will my order be delivered?”. Theconversational agent 106 configured on theclient device 102 can receive the query via theinput device 114 andcause processor 108 to transmit the query data to thedialog processing platform 120 for processing. Additional detail of theclient device 102 and the conversationalagent frontend applications 106 will be provided in the description ofFIG. 2 . - As shown in
FIG. 1 , theconversational agent system 100 includes adialog processing platform 120. Thedialog processing platform 120 operates to receive dialog data, such as user queries provided to theclient device 102, and to process the dialog data to generate responses to the user provided dialog data. Thedialog processing platform 120 can be configured on any device having an appropriate processor, memory, and communications capability for hosting the dialog processing platform as will be described herein. In certain aspects, the dialog processing platform can be configured as one or more servers, which can be located on-premises of an entity deploying theconversational agent system 100, or can be located remotely from the entity. In some implementations, the distributedprocessing platform 120 can be implemented as a distributed architecture or a cloud computing architecture. In some implementations, one or more of the components or functionality included in thedialog processing platform 120 can be configured in a microservices architecture. In some implementations, one or more components of thedialog processing platform 120 can be provided via a cloud computing server of an infrastructure-as-a-service (IaaS) and be able to support a platform-as-a-service (PaaS) and software-as-a-service (SaaS) services. - The
dialog processing platform 120 includes acommunications module 122 to receive the computer-readable instructions and/or user data transmitted vianetwork 118. Thedialog processing platform 120 also includes one ormore processors 124 configured to execute instructions that when executed cause the processors to perform natural language processing on the received dialog data and to generate contextually specific responses to the user dialog inputs using one or more interchangeable and configurable natural language processing resources. Thedialog processing platform 120 also includes amemory 128 configured to store the computer-readable instructions and/or user data associated with processing user dialog data and generating dialog responses. Thememory 128 can store a plurality of profiles associated with each tenant or entity. The profile can configure one or more processing components of thedialog processing platform 120 with respect to the entity or tenant for which theconversational agent system 100 has been configured. - As shown in
FIG. 1 , thedialog processing platform 120 includes one or more subsystems such assubsystem conversational agent system 100 to provide conversational agents to end users. For example, thedialog processing platform 120 can include afirst subsystem 130A which can be associated with afirst tenant 130A, such as retail entity, and asecond subsystem 130B which can be associated with asecond tenant 130B, such as a financial services entity. In this way, thedialog processing platform 120 can be configured as a multi-tenant portal to provide natural language processing for different tenants, and their corresponding conversationalagent frontend applications 106, which can be configured on a variety of multi-modal digitalendpoint client devices 102. - Subsystems 130 can include components implementing functionality to receive user dialog data from a variety of multi-modal conversational agents and to generate dialog responses in the context of a particular lexicon of a tenant or entity for which the conversational agent has been deployed. For example, as shown in
FIG. 1 in regard tosubsystem 130A, the components can include an automatic speech recognition engine adapter (ASRA) 135A for interfacing with a plurality of automated speech recognition (ASR)engines 140, a plurality of natural language agent (NLA)ensembles 145A, a text-to-speech synthesis engine adapter (TTSA) 150 for interfacing to a plurality of text-to-speech (TTS)synthesis engines 155, and a plurality of catalog-to-dialog (CTD)modules 160A. In some implementations, thedialog processing platform 120 can include one or more subsystems 130. - The plurality of
ASR engines 140, the plurality ofNLA ensembles 145, the plurality ofTTS synthesis engines 155, and the plurality ofCTD modules 160 can be respectfully referred to asASR engines 140,NLA ensembles 145,TTS synthesis engines 155, andCTD modules 160. In some implementations, the subsystem 130 components can be configured directly within thedialog processing platform 120 such that the components are not configured within a subsystem 130. As shown inFIG. 1 , theASR engines 140 and theTTS synthesis engines 155 can be configured outside of thedialog processing platform 120, such as in a cloud-based architecture. Thedialog processing platform 120 can exchange data with theASR engines 140 and theTTS synthesis engines 155 via theASRA 135 and theTTSA 150, respectfully. In some implementations, theASR 140 and/orTTS 155, or portions thereof, can be configured within thedialog processing platform 120. In some implementations, the components of thedialog processing platform 120, as well as theASR engines 140 and theTTS synthesis engines 155 can be implemented as microservices within a cloud-based or distributed computing architecture. - As shown in
FIG. 1 , thedialog processing platform 120 includes anASRA 135A configured to interface with theASR engines 140. TheASR engines 140 can include automated speech recognition engines configured to receive spoken or textual natural language inputs and to generate textual outputs corresponding the inputs. For example, theASR engines 140 can process the user's verbalized query or utterance “When will my order be delivered?” into a text string of natural language units characterizing the query. The text string can be further processed to determine an appropriate query response. Thedialog processing platform 120 can dynamically select aparticular ASR engine 140 that best suits a particular task, dialog, or received user query. - The
dialog processing platform 120 also includes a plurality ofNLA ensembles 145. TheNLA ensembles 145 can include a plurality of components configured to receive the text string from theASR engines 140 and to process the text string in order to determine a textual response to the user query. TheNLA ensembles 145 can include a natural language understanding (NLU) module implementing a number of classification algorithms trained in a machine learning process to classify the text string into a semantic interpretation. The processing can include classifying an intent of the text string and extracting information from the text string. The NLU module combines different classification algorithms and/or models to generate accurate and robust interpretation of the text string. TheNLA ensembles 145 can also include a dialog manager (DM) module. The DM module can determine an appropriate dialog action in a contextual sequence formed by the current or previous dialog sequences conducted with the user. In this way, the DM can generate a response action to increase natural language quality and fulfillment of the user's query objective. TheNLA ensembles 145 can also include a natural language generator (NLG) module. The NLG module can process the action response determined by the dialog manager and can convert the action response into a corresponding textual response. The NLG module provides multimodal support for generating textual responses for a variety of different output device modalities, such as voice outputs or visually displayed (e.g., textual) outputs. In some implementations, the ensemble can include a set of models that are included in the NLU and optimized jointly to select the right response. - The
dialog processing platform 120 also includes aTTSA 150 configured to interface with theTTS synthesis engines 155. TheTTS synthesis engines 155 can include text-to-speech synthesis engines configured to convert textual responses to verbalized query responses. In this way, a response to a user's query can be determined as a text string and the text string can be provided to theTTS synthesis engines 155 to generate the query response as natural language speech. Thedialog processing platform 120 can dynamically select a particularTTS synthesis engine 155 that best suits a particular task, dialog, or generated textual response. - The
dialog processing platform 120 also includes catalog-to-dialog (CTD)modules 160. TheCTD modules 160 can be selected for use based on a profile associated with the tenant or entity. TheCTD modules 160 can automatically convert data from a tenant or entity catalog, as well as billing and order information into a data structure corresponding to a particular tenant or entity for which theconversational agent system 100 is deployed. TheCTD modules 160 can derive product synonyms, attributes, and natural language queries from product titles and descriptions which can be found in the tenant or entity catalog. TheCTD modules 160 can generate a data structure that is used themachine learning platform 165 to train one or more classification algorithms included in NLU module. In some implementations, theCTD modules 160 can instantiate, create, or implement fully configured conversational agents configured to process user queries or dialog inputs for a tenant. In some implementations, theCTD modules 160 can be used to efficiently pre-configure theconversational agent system 100 to automatically respond to queries about orders and/or products or services provided by the tenant or entity. For example, referring back toFIG. 1 , thedialog processing platform 120 can process the users query to determine a response regarding the previously placed order. As a result of the processing initially described above and to be described in more detail in relation toFIG. 3 , thedialog processing platform 120 can generate a response to the user's query. The query response can be transmitted to theclient device 102 and provided as speech output viaoutput device 116 and/or provided as text displayed viadisplay 112. - The
conversational agent system 100 includes amachine learning platform 165. Machine learning can refer to an application of artificial intelligence that automates the development of an analytical model by using algorithms that iteratively learn patterns from data without explicit indication of the data patterns. Machine learning can be used in pattern recognition, computer vision, email filtering and optical character recognition and enables the construction of algorithms or models that can accurately learn from data to predict outputs thereby making data-driven predictions or decisions. - The
machine learning platform 165 can include a number of components configured to generate one or more trained prediction models suitable for use in theconversational agent system 100 described in relation toFIG. 1 . For example, during a machine learning process, a feature selector can provide a selected subset of features to a model trainer as inputs to a machine learning algorithm to generate one or more training models. A wide variety of machine learning algorithms can be selected for use including algorithms such as support vector regression, ordinary least squares regression (OLSR), linear regression, logistic regression, stepwise regression, multivariate adaptive regression splines (MARS), locally estimated scatterplot smoothing (LOESS), ordinal regression, Poisson regression, fast forest quantile regression, Bayesian linear regression, neural network regression, decision forest regression, boosted decision tree regression, artificial neural networks (ANN), Bayesian statistics, case-based reasoning, Gaussian process regression, inductive logic programming, learning automata, learning vector quantization, informal fuzzy networks, conditional random fields, genetic algorithms (GA), Information Theory, support vector machine (SVM), Averaged One-Dependence Estimators (AODE), Group method of data handling (GMDH), instance-based learning, lazy learning, Maximum Information Spanning Trees (MIST), and transfer learning methods based on pre-trained, generalized embeddings as well as domain-based embeddings. - The
CTD modules 160 can be used in the machine learning process to train the classification algorithms included in the NLU of theNLA ensembles 145. The model trainer can evaluate the machine learning algorithm's prediction performance based on patterns in the received subset of features processed as training inputs and generates one or more new training models. The generated training models, e.g., classification algorithms and models included in the NLU of theNLA ensemble 145, are then capable of receiving user data including text strings corresponding to a user query via and to output predicted textual responses including at least one word from a lexicon associated with the tenant or entity for which theconversational agent system 100 has been configured and deployed. -
FIG. 2 illustrates an example architecture of aclient device 102 configured as a multi-modal conversational agent of theconversational agent system 100 described in relation toFIG. 1 . As shown inFIG. 2 , theclient device 102 can include a plurality ofapplications 106. Theapplications 106 can include easily installed, pre-packaged software developer kits for which implement conversational agent frontend functionality on aclient device 102. Theapplications 106 can include APIs as JavaScript libraries received from thedialog processing platform 120 and incorporated into a website of the entity or tenant to enable support for text and/or voice modalities via a customizable user interfaces. Theapplications 106 can implement client APIs ondifferent client devices 102 and web browsers in order to provide responsive multi-modal interactive graphical user interfaces (GUI) that are customized for the entity or tenant. The GUI andapplications 106 can be provided based on a profile associated with the tenant or entity. In this way, theconversational agent system 100 can provide customizable branded assets defining the look and feel of a user interface, different voices utilized by theTTS synthesis engines 140, as well as textual responses generated by theNLA ensembles 145 which are specific to the tenant or entity. - The
web application 205 includes functionality configured to enable a web browser on aclient device 102 to communicate with thedialog processing platform 120. Theweb application 205 can include a media capture API, a web audio API, a document object model, and a web socket API. Theweb application 205 can be configured to capture dynamic content generated by the multi-modal conversation agent configured on theclient device 102. For example, the dynamic content can include clickable and multimodal interactive components and data. TheiOS application 210 includes functionality configured to provide support for multi-modal conversational agents implemented onclient devices 102 configured with the proprietary iOS operating system developed by Apple Inc. of Cupertino, Calif., U.S.A. In some implementations, the interface representation and interactive user model used for a conversational agent configured on a client device web browser can be converted and provided using the same interface representation deployed on a mobile device web browser. Theandroid application 215 includes functionality configured to provide support for multi-modal conversational agents implemented onclient devices 102 configured with the Unix-based Android operating system developed by the Open Handset Alliance of Mountain View, Calif., U.S.A. Themessaging application 220 includes functionality configured to provide messaging support for a variety of chat and messaging platforms. In some implementations, themessaging application 220 can reproduce the same interface representation multi-modal experience as enabled onother client device 102 interfaces. Thetelephony application 225 includes functionality configured to provide telephony support via public switched telephone network (PSTN) devices and voice over internet protocol (VoIP) devices. In some implementations, thetelephony application 225 can be configured to generate short conversational prompts or dialog sequences without reference to the content of the screen. Accordingly, the conversational agent system described herein can enable support for smartspeaker client devices 102 and the conversational agents configured on theclient devices 102 can automatically adapt to the capabilities of different devices. -
FIG. 3 illustrates anexample architecture 300 of adialog processing platform 120 of thesystem 100 described in relation toFIG. 1 . Thedialog processing platform 120 can serve as a backend of theconversational agent system 100. One or more components included in thedialog processing platform 120 shown inFIG. 3 can be configured on a single server device or on multiple server devices. One or more of the components of thedialog processing platform 120 can also be configured as a microservice, for example in a cloud computing environment. In this way, theconversational agent system 100 can be configured as a robustly scalable architecture that can be provisioned based on resource allocation demands. - The
dialog processing platform 120 includes run-time components that are responsible for processing incoming speech or text inputs, determining the meaning in the context of a dialog and a tenant lexicon, and generate replies to the user which are provided as speech and/or text. Additionally, thedialog processing platform 120 provides a multi-tenant portal where both administrators and tenants can customize, manage, and monitor platform resources, and can generate run-time reports and analytic data. Thedialog processing platform 120 interfaces with a number of real-time resources such asASR engines 140,TTS synthesis engines 155, and telephony platforms. Thedialog processing platform 120 also provides consistent authentication and access APIs to commercial e-commerce platforms. - As shown in
FIG. 3 , thedialog processing platform 120 includes aDPP server 302. TheDPP server 302 can act as a frontend to thedialog processing platform 120 and can appropriately route data received from or to be transmitted toclient devices 102 as appropriate. TheDPP server 302 routes requests or data to specific components of thedialog processing platform 120 based on registered tenant and application identifiers which can be included in a profile associated with a particular tenant. TheDPP server 302 can also securely stream to theASR engines 140 and from theTTS synthesis engines 140. - For example, as shown in
FIG. 3 , thedialog processing platform 120 includes a plurality ofadapters 304 configured interface theASR engines 140 and theTTS synthesis engines 155 to theDPP server 302. Theadapters 304 allow thedialog processing platform 120 to interface with a variety of speech processing engines, such asASR engines 140 andTTS synthesis engines 155. In some implementations, the speech processing engines can be configured in a cloud-based architecture of thedialog processing platform 120 and may not be collocated in the same server device as theDPP server 302 or other components of thedialog processing platform 120. - The
adapters 304 include aASR engine adapter 135 and a TTSsynthesis engine adapter 150. TheASR engine adapter 135 and a TTSsynthesis engine adapter 150 enable tenants to dynamically select speech recognition and text-to-speech synthesis providers or natural language speech processing resources that best suit the users objective, task, dialog, or query. - As shown in
FIG. 3 , thedialog processing platform 120 includes a voiceXML (VXML)adapter 310 which can couple theDPP server 302 tovarious media resources 312. For example, themedia resources 312 can include VoIP networks, ASR engines, andTTS synthesis engines 314. In some implementations, themedia resources 312 enable the conversational agents to leverage existing telephony platforms, which can often be integrated with particular speech processing resources. The existing telephony platforms can provide interfaces for communications with VoIP infrastructures using session initiation protocol (SIP). In these configurations, VXML documents are exchanged during a voice call. - The
dialog processing platform 120 also includes anorchestrator component 316. Theorchestrator 316 provides an interface for administrators and tenants to access and configure theconversational agent system 100. Theadministrator portal 318 can enable monitoring and resource provisioning, as well as providing rule-based alert and notification generation. Thetenant portal 320 can allow customers or tenants of theconversational agent system 100 to configure reporting and analytic data, such as account management, customized reports and graphical data analysis, trend aggregation and analysis, as well as drill-down data associated dialog utterances. Thetenant portal 320 can also allow tenants to configure branding themes and implement a common look and feel for the tenants conversational agent user interfaces. Thetenant portal 320 can also provide an interface for onboarding or bootstrapping customer data. In some implementations, thetenant portal 320 can provide tenants with access to customizable conversational agent features such as user prompts, dialog content, colors, themes, usability or design attributes, icons, and default modalities, e.g., using voice or text as a first modality in a dialog. Thetenant portal 320 can, in some implementations, provide tenants with customizable content viadifferent ASR engines 140 and differentTTS synthesis engines 155 which can be utilized to provide speech data in different voices and/or dialects. In some implementations, thetenant portal 320 can provide access to analytics reports and extract, transform, load (ETL) data feeds. - The orchestrator 316 can provide secure access to one or more backends of a tenants data infrastructure. The orchestrator 316 can provide one or more common APIs to various tenant data sources which can be associated with retail catalog data, user accounts, order status, order history, and the like. The common APIs can enable developers to reuse APIs from various client side implementations.
- The orchestrator 316 can further provide an
interface 322 to human resources, such as human customer support operators who may be located at one or more call centers. Thedialog processing platform 120 can include a variety ofcall center connectors 324 configured to interface with data systems at one or more call centers. - The orchestrator 316 can provide an
interface 326 configured to retrieve authentication information and propagate user authentication and/or credential information to one or more components of thesystem 300 to enable access to a user's account. For example, the authentication information can identify one or more users, such as individuals who have accessed a tenant web site as a customer or who have interacted with theconversational agent system 100 previously. Theinterface 326 can provide an authentication mechanism for tenants seeking to authenticate users of theconversational agent system 100. Thedialog processing platform 120 can include a variety of end-user connectors 328 configured to interface thedialog processing platform 120 to one or more databases or data sources identifying end-users. Theinterface 326 can also enable access to the tenant's customer order and billing data via one or more catalog ore-commerce connectors 328. - The orchestrator 316 can also provide an
interface 330 to tenant catalog and e-commerce data sources. Theinterface 330 can enable access to the tenant's catalog data which can be accessed via one or more catalog ore-commerce connectors 332. Theinterface 330 enables access to tenant catalogs and/or catalog data and further enables the catalog data to be made available to theCTD modules 160. In this way, data from one or more sources of catalog data can be ingested into theCTD modules 160 to populate the modules with product or item names, descriptions, brands, images, colors, swatches, as well as structured and free-form item or product attributes. - The
dialog processing platform 120 also includes amaestro component 334. Themaestro 334 enables administrators of theconversational agent system 100 to manage, deploy, and monitorconversational agent applications 106 independently. Themaestro 334 provides infrastructure services to dynamically scale the number of instances of natural language resources, such as tenant subsystems 130,ASR engines 140,TTS synthesis engines 155,NLA ensembles 145, andCTD modules 160. Themaestro 334 can dynamically scale these resources as dialog traffic increases. Themaestro 334 can deploy new resources without interrupting the processing being performed by existing resources. Themaestro 334 can also manage updates to theCTD modules 160 with respect to updates to the tenants e-commerce data and/or product catalogs. In this way, themaestro 334 provided the benefit of enabling thedialog processing platform 120 to operate as a highly scalable infrastructure for deploying artificially intelligent multi-modalconversational agent applications 106 for multiple tenants or multiple tenant subsystems 130. As a result, theconversational agent system 100 can reduce the time, effort, and resources required to develop, test, and deploy conversational agents. - The
dialog processing platform 120 further includes aCTD module 160. TheCTD module 160 can implement methods to collect e-commerce data from tenant catalogs, product reviews, user account and order data, and user clickstream data collected at the tenants web site to generate a data structure that can be used to learn specific domain knowledge and to onboard or bootstrap a newly configuredconversational agent system 100. TheCTD module 160 can extract taxonomy labels associated with hierarchical relationships between categories of products and can associate the taxonomy labels with the products in the tenant catalog. TheCTD module 160 can also extract structured product attributes (e.g., categories, colors, sizes, prices) and unstructured product attributes (e.g., fit details, product care instructions) and the corresponding values of those attributes. TheCTD module 160 can normalize attribute vales so that the attribute values share the same format throughout the catalog data structure. In this way, noisy values caused by poorly formatted content can be removed. - Products in an e-commerce catalogs can be typically organized in a multi-level taxonomy, which can group the products into specific categories. The categories can be broader at higher levels (e.g., there are more products) and narrower (e.g., there are less products) at lower levels of the product taxonomy. For example, a product taxonomy associated with clothing can be represented as Clothing >Sweaters >Cardigans & Jackets. The category “Clothing” is quite general, while “Cardigans & Jackets” are a very specific type of clothing. A user's queries can refer to a category (e.g., dresses, pants, skirts, etc.) identified by a taxonomy label or to a specific product item (e.g., item #30018, Boyfriend Cardigan, etc.). In a web-based search session, a product search could either start from a generic category and narrow down to a specific product or vice versa.
CTD module 160 can extract category labels from the catalog taxonomy, product attributes types and values, as well as product titles and descriptions. - The
CTD module 160 can automatically generate attribute type synonyms and lexical variations for each attribute type from search query logs, product descriptions and product reviews and can automatically extract referring expressions from the tenant product catalog or the user clickstream data. TheCTD module 160 can also automatically generate dialogs based on the tenant catalog and the lexicon of natural language units or words that are associated with the tenant and included in the data structure. - The
CTD module 160 utilizes the extracted data to train classification algorithms to automatically categorize catalog categories and product attributes when provided in a natural language query by a user. The extracted data can also be used to train a full search engine based on the extracted catalog information. The full search engine can thus include indexes for each product category and attribute. The extracted data can also be used to automatically define a dialog frame structure that will be used by a dialog manger module, described later, to maintain a contextual state of the dialog with the user. - As shown in
FIG. 3 , themaestro 334 can interface with a plurality of natural language agent (NLA)ensembles 145. Each of theNLA ensembles 145 can include one or more of a natural language generator (NLG)module 336, a dialog manager (DM)module 338, and a natural language understanding (NLU)module 340. In some implementations, theNLA ensembles 145 can include pre-built automations, which when executed at run-time, implement dialog policies for a particular dialog context. For example, the pre-built automations can include dialog policies associated with searching, frequently-asked-questions (FAQ), customer care or support, order tracking, and small talk or commonly occurring dialog sequences which may or may not be contextually relevant to the user's query. TheNLA ensembles 145 can include reusable dialog policies, dialog state tracking mechanisms, domain and schema definitions.Customized NLA ensembles 145 can be added to the plurality ofNLA ensembles 145 in a compositional manner as well. - Each
NLA ensemble 145 can include at least one of a natural language understanding (NLU)module 336, a dialog manager (DM)module 338, and a natural language generator (NLG)module 340. The operation of theNLA ensemble 140 and its modules will be described further in the relation toFIGS. 5-7 . - As shown in
FIG. 3 , theNLA ensemble 145 includes a natural language understanding (NLU)module 336. TheNLU module 336 can implement a variety of classification algorithms used to classify input text associated with a user utterance and generated by theASR engines 140 into a semantic interpretation. In some implementations, theNLU 336 can classify input text when the utterance incudes customer support requests/questions about products and services, as well as user queries. In some implementations, theNLU module 336 can implement a stochastic intent classifier and a named-entity recognizer ensemble to perform intent classification and information extraction, such as extraction of entity or user data. TheNLU module 336 can combine different classification algorithms and can select the classification algorithm most likely to provide the best semantic interpretation for a particular task or user query by determining dialog context and integrating dialog histories. - The classification algorithms included in the
NLU module 336 can be trained in a supervised machine learning process using support vector machines or using conditional random field modeling methods. In some implementations, the classification algorithms included in theNLU module 336 can be trained using a convolutional neural network, a long short-term memory recurrent neural network, as well as a bidirectional long short-term memory recurrent neural network. TheNLU module 336 can receive the user query and can determine surface features and feature engineering, distributional semantic attributes, and joint optimizations of intent classifications and entity determinations, as well as rule based domain knowledge in order to generate a semantic interpretation of the user query. In some implementations, theNLU module 336 can include one or more of intent classifiers (IC), named entity recognition (NER), and a model-selection component that can evaluate performance of various IC and NER components in order to select the configuration most likely generate contextually accurate conversational results. TheNLU module 336 can include competing models which can predict the same labels but using different algorithms and domain models where each model produces different labels (customer care inquires, search queries, FAQ, etc.). - The
NLA ensemble 145 also includes a dialog manager (DM)module 338. TheDM module 338 can select a next action to take in a dialog with a user. TheDM module 338 can provided automated learning from user dialog and interaction data. TheDM module 338 can implement rules, frames, and stochastic-based policy optimization with dialog state tracking. TheDM module 338 can maintain an understanding of dialog context with the user and can generate more natural interactions in a dialog by providing full context interpretation of a particular dialog with anaphora resolution and semantic slot dependencies. In new dialog scenarios, theDM module 338 can mitigate “cold-start” issues by implementing rule-based dialog management in combination with user simulation and reinforcement learning. In some implementations, sub-dialog and/or conversation automations can be reused in different domains. - The
DM module 338 can receive semantic interpretations generated by theNLU module 336 and can generate a dialog response action using context interpreter, a dialog state tracker, a database of dialog history, and an ensemble of dialog action policies. The ensemble of dialog action policies can be refined and optimized using rules, frames and one or more machine learning techniques. - As further shown in
FIG. 3 , theNLA ensemble 145 includes a natural language generator (NLG)module 340. TheNLG module 340 can generate a textual response based on the response action generated by theDM module 338. For example, theNLG module 340 can convert response actions into natural language and multi-modal responses that can be uttered or spoken to the user and/or can be provided as textual outputs for display to the user. TheNLG module 340 can include a customizable template programming language which can be integrated with a dialog state at runtime. - In some implementations, the
NLG module 340 can be configured with a flexible template interpreter with dialog content access. For example, the flexible template interpreter can be implemented using Jinja2, a web template engine. TheNLG module 340 can receive a response action theDM module 338 and can process the response action with dialog state information and using the template interpreter to generate output formats in speech synthesis markup language (SSML), VXML, as well as one or more media widgets. TheNLG module 340 can further receive dialog prompt templates and multi-modal directives. In some implementations, theNLG module 340 can maintain or receive access to the current dialog state, a dialog history, and can refer to variables or language elements previously referred to in a dialog. For example, a user may have previously provided the utterance “I am looking for a pair of shoes for my wife”. TheNLG module 340 can label a portion of the dialog as PERSON_TYPE and can associate a normalized GENDER slot value as FEMALE. TheNLG module 340 can inspect the gender reference and customize the output by using the proper gender pronouns such as ‘her, she, etc.’ -
FIG. 4 is a flowchart illustrating an example method for determining a textual response to an utterance of user query processed by thedialog processing platform 120 described in relation toFIGS. 1 and 3 . - In
operation 405, data characterizing an utterance of a query associated with a tenant can be received. In some implementations, data characterizing the utterance can include audio data received by aninput device 114 ofclient device 102 and provided to/received by thedialog processing platform 120. In some implementations, the data characterizing the utterance can be provided via text, for example a user can provide the utterance as textual input to a conversational agent configured in a web site of an ecommerce entity or tenant. The user can provided the utterance in regard to a goal or objective that the user seeks to accomplish in cooperation with the tenant. The user can provide the data characterizing the utterance of the query in a dialog with a conversational agent configured as anapplication 106 on theclient device 102. - At
operation 410, the received data can be provided to an automated speech recognition engine, such asASR engine 140, along with a profile selected from a plurality of profiles associated with the tenant. The profile can configure theASR engine 140 to process the received data by specifying suitable configurations that are associated with the tenant and identified in tenant profile. The configurations can include the tenant specific lexicon. The tenant specific lexicon can include domain language and channel audio characteristics associated with the tenant. For example, the tenant specific lexicon can include product and/or service names, alternative phonetic annunciations and pronunciations, and audio channel information such as telephony or digital voice quality and/or audio coding types. For some ASR engines 140 a language model for each state of a dialog can be identified. A language model can include a set of statically related or defined sentences. The language model can be identified when specific contextual conditions exist in a dialog, such as when the conversational agent expects to receive a business name. In such circumstances, a business name language model can be identified and activated. - The
ASR engine 140 can receive the data and process the audio data or textual data to determine a string of text corresponding to the data received at theclient device 102. For example, theASR engine 140 can receive the user's verbal utterance forming a query “When will my order be delivered?”. TheASR engine 140 can process the audio data including the verbal utterance to decompose the received data into a string of natural language units or words. TheASR 140 can select words to be included in the text string based on the profile associated with the tenant. In this way, theASR engine 140 operates or is selected to operate in a manner that is most likely to generate a text string that is contextually most relevant to the tenant and best represents the intention of the user conveyed via the utterance. The profile can be defined and generated via thetenant portal 320 and can be distributed or made accessible to other components of thesystem 300 via theorchestrator 316. In some implementations, the profile can be stored in theDPP server 302 and can be propagated to themaestro 334. For example, a tenant may prefer aTTS synthesis engine 155 configured with a male voice and customized to process specific product names which are not commonly recognized by anASR engine 140. At run time, theDPP server 302 can provide a TTS voice identifier to theTTS synthesis engine 155 each time speech is to be generated. At the same time, theDPP server 302 can provide a list of specific product names which are not commonly recognized to theASR engine 140 every time thesystem 300 is listening to the user. In some implementations, themaestro 334 can add more configurations of theTTS synthesis engines 155 based on the dialog context. And by configuring theASR engine 140 with a profile selected based on the tenant,specific ASR engine 140 technology can be easily changed, updated, and/or reconfigured on a tenant-specific basis. - In
operation 415, theNLA ensemble 145 can receive the text string characterizing the query. In some implementations, the query can include the utterance or portions of the utterance. In some implementations, the query can include a text request. The text string output by theASR engine 140 can be conveyed to theNLA ensemble 145 for processing. - In
operation 420, theNLA ensemble 145 can process the text string to determine a textual response to the query. The text string can be first processed by theNLU module 336 to generate a semantic interpretation associated with the text string. The semantic interpretation can next be processed by theDM module 338 to determine a contextual sequence associated with the text string and a response action to the query (and the corresponding text string). The response action can then be processed by theNLG module 340 to determine a textual response corresponding to the response action. In the example use case ofFIG. 1 , theNLA ensemble 145 have determined the most contextually relevant response action to the user's query regarding the status of their order is “Your order will be delivered tomorrow.” Additional detail associated with processing the text string will be provided in the description ofFIG. 6 . -
FIG. 5 is a flowchart illustrating an example method for providing a verbalized query response to a user via theclient device 102 and thedialog processing platform 120 described in relation toFIGS. 1 and 3 . Inoperation 505, the textual response generated by theNLA ensemble 145 can be provided to theTTS synthesis engine 155 with the tenant profile. The tenant profile can be used to configure and select aTTS synthesis engine 155 associated with the tenant and such that theTTS synthesis engine 155 can generate a verbalized query response, which includes a plurality of natural language units or words selected from a lexicon associated with the tenant or the tenant'sapplications 106. In the example use case ofFIG. 1 , theNLA ensemble 145 have determined the most contextually relevant response action to the user's query inquiring about the status of their order is “Your order will be delivered tomorrow.” The textual response action generated by theNLA ensemble 145 can be received by theTTS synthesis engine 155. TheTTS synthesis engine 155 can determine a verbalized response query, using the tenant profile. - In
operation 510, theDPP server 302 can receive a verbalized query response from theTTS engine 155 and inoperation 515, theDPP server 302 can provide the verbalized query response to theclient device 102. Theclient device 102 can further provide the verbalized query response to the user via theoutput device 116, such as a speaker. In some implementations, the user can select between a textual modality and a voice or speech modality. For example, theapplications 106 can include a user-settable mechanism to configure the conversational agent for textual dialogs or voice dialogs. In implementations when the text mode is selected, theDPP server 302 can exclude theASR engines 140 and theTTS synthesis engines 155 and can transmit the textual data to theorchestrator 316. -
FIG. 6 is a flowchart illustrating an example method for processing a text string characterizing a query. The text string, associated characterizing the user's query and generated by theASR engine 140, can be provided to theNLA ensemble 145 for processing to generate a textual response. The text string is initially provided to theNLU module 336. - In
operation 605, a semantic representation associated with the text string can be generated by theNLU module 336. The semantic representation can include an attributes of the query such as the query intent, an intent type, and a category of the intent. TheNLU module 336 can provide the location of the information extracted from the query. For example, theNLU module 336 can provide an index span indicating the position of an word in the query. In some implementations, theNLU module 336 can determine and provide confidence scores estimating the accuracy of the predictions as well as normalized values based on gazetteers and/or a backend database. For example, theNLU module 336 can normalize “trousers” to a taxonomic category “pants and shorts” based on the tenant's catalog data. - In
operation 610, theDM module 338 determines a first contextual sequence associated with the text string. For example, theDM module 338 can receive the semantic representation generated by theNLU module 336 and can interpret the context of the semantic representation to determine a state of the dialog which the user's query is included. TheDM module 338 can include a dialog state tracker and a dialog history component to determine the context of the semantic representation associated with the user's query. - In
operation 615, theDM module 338 can generate a response action based on the determined contextual sequence. TheDM module 338 can further include an ensemble policy which can receive input from the dialog state tracker to generate the response action. TheDM module 338 can generate the response action via one or more policy optimization models, rules, and/or frames. TheDM module 338 can generate an optimal response to the user by combining a number of strategies. For example, theDM module 338 can utilized a frame-based policy. The frame-based policy can determine intents and can associate slots to complete the task initiated by the user. Slots can include bits of information required to provide an answer to the user. If a user query is associated with purchasing shoes, it can be necessary to understand the type of shoes, the size of the shoe, and the width of the shoe, which can be a required parameter used to determine a suitable shoe fitting model. Mandatory and optional slots, as well as slots that are dependent on the value of other slots can be used to determine the next action of the dialog. TheDM module 338 can determine which mandatory or optional slot may be necessary next in the dialog sequence based on which slot may shorten the time to reach the goals. For example, theDM module 338 can be configured to ask for a shoe style since information received in regard to the shoe style can narrow down the potential choices more than dialog regarding the users shoe size. TheDM module 338 can include one or more dialog policies. The dialog policies can be learned from data. For example, data associated with the sequences of dialog turns between the conversational agent/system 300 and the user can be converted into a vector representation and used to train a sequence model to predict the next optimal dialog action. - In
operation 620, theNLG module 340 can receive the response action generated by theDM module 338 and can generate a textual response. TheNLG module 340 can include a copy of the dialog state from the dialog tracker configured in theDM module 338 and can process the action using a template interpreter. In some implementations, the template interpreter can include a Jinja or Jinja2 template interpreter written in the Python programming language. The template interpreter can output a textual response which can be further formatted by one or more output formatting components using SSML, VXML, and/or various other media widgets. In some implementations, theNLG module 340 can generate HyperText Markup Language (HTML) or meta-representations for GUI elements and content including clickable buttons, text, and images. -
FIG. 7 is a flowchart illustrating an example method for generating a first data structure. The data structure can be used by theNLU module 336 to generate the semantic representation associated with the text string characterizing a query. The data structure can include product attributes, product synonyms, referring expressions related to the tenant's products, and common dialogs related to the tenant's products. The data structure can be generated by theCTD module 160. - For example, in
operation 705, theCTD 160 module can determine one or more product attributes associated with an item from the tenants catalog of products or items. TheCTD module 160 can determine and generate the product attributes by extracting synonyms in a specific product domain. The product attributes can be used by theNLU module 336 to expand slot values associated with a particular product. For example, theCTD module 160 and the data structure it generates can include the attributes of “moccasin, boots, heels, sandals” for a product identified as a “shoe”. TheCTD module 160 can be trained on product or tenant domain data but can also learn patterns and context in which the words are used, thus allowing theCTD module 160 to automatically infer words with the same meaning. TheCTD module 160 can employ word embeddings, lexical databases, such as WordNet, and lexical chains to determine the product attributes. - In
operation 710, theCTD module 160 can determine one or more synonyms associated with an item from the tenant product catalog. A product attribute can be a property or attribute of a product. A retailer category can be defined by a product taxonomy. For example, “sweaters” can be a category label associated with products in the clothing domain. TheCTD module 160 can automatically determine that “pullovers”, “cardigans”, “turtleneck”, “shaker”, and “cardigan sweater” are all synonyms and are referring the same category. TheCTD module 160 can automatically expand the lexicon for both catalog searching and search query interpretation. TheCTD module 160 can use both word and sentence embeddings and can extract similar words from a specific domain and click stream data from search query logs. In some implementations, theCTD module 160 can use prebuilt embeddings or can train specific embeddings for the domain using catalog and review data. Additionally,CTD module 160 can include a classifier that can automatically classify unseen search terms into a taxonomy label. - In
operation 715, theCTD module 160 can determine one or more referring expressions associated with an item from the tenant product catalog. Additionally, or alternatively, theCTD module 160 can determine one or more referring expressions based on interactive user data associated with the item. TheCTD module 160 can automatically learn how customers refer to items in the tenants product catalog. For example, theCTD module 160 can process the tenant catalog and clickstream data received by users visiting the tenants website or online product catalog and can apply word embeddings and sequence-to-sequence models. Semantic similarities can be determined and the results can be ranked for inclusion in the data structure. - In
operation 720, theCTD module 160 can generate the data structure based on operations 705-715. The data structure can then be used to update the classification algorithms included in theNLU module 336. At run-time, theorchestrator 316 can configure periodic, e.g., daily, updates to theCTD module 160 and the data structure. For example, billing, order, catalog, clickstream, and review data can be uploaded to theCTD module 160 and processed to extract product titles, descriptions, and attributes. TheCTD module 160 can normalize attribute values, extract keywords and n-grams, tokenize the data, and define a search index for use in the data structure. The data structure can then be used in theNLU module 336 to update a search index, optimize ranking functions, and update the classification algorithms used to generate the semantic interpretation associated with the text string characterizing the user's query. -
FIG. 8 is a flowchart illustrating an example method for generating an initial conversation prompt via a multi-modal conversational agent of the system described in relation toFIG. 1 . Prior to receiving data characterizing an utterance of a user query, theconversational agent system 100 can generate an initial conversation prompt and configure theconversational agent 106 on theclient device 102 to communicate and conduct multi-modal dialog exchanges with thedialog processing platform 120. In the example described below, assume that a user is utilizing asmartphone device 102 and browsing an e-commerce website associated with a retail entity. The web site offers both text and speech interfaces to thedialog processing platform 120. - In
operation 805, the web site receives an input provided via the web browser configured on theclient device 102. The user can provide the input, for example, by clicking the “Speak” button in the web site. - In
operation 810, thedialog processing platform 120 receives validation data associated with theclient device 102. For example, based on receiving the input inoperation 805, a network connection will be initiated, e.g., via web sockets, and the web browser configured withapplication 205, can be authenticated and registered through theDPP server 302. TheDPP server 302 can receive validation data about the audio and graphical processing capabilities of theclient device 102 and can validate if theclient device 302 is able to render graphics and capture audio in real-time. - Upon receiving the validation data and validating the
client device 102, theDPP server 302 can generate a conversation initiation message and provide the conversation initiation message to themaestro component 334. Themaestro component 334 can provide an initial conversation response message back to theDPP server 302 which can initiate a call to theTTS synthesis engine 155 via theTTS adapter 150. TheDPP server 302 will begin streaming audio data from theTTS adapter 150 to theapplication 205. Inoperation 815, theDPP server 302 will generate an initial conversation prompt by providing an audible prompt and textual output on thedisplay 112 of theclient device 102. The initial conversation prompt can inform the user that thesystem 100 is ready to receive a user query, for example, the initial conversation prompt can include “Hello. Welcome to ACME shoes. How may I help you?”. - In
operation 820, theclient device 102 can receive data characterizing the utterance of a query associated with the tenant as described earlier in the discussion ofFIG. 4 ,operation 405. -
FIG. 9 is a diagram illustrating anexample data flow 900 for receiving and processing a user query using the multi-modalconversational agent system 100 ofFIG. 1 . Following the configuration of the initial conversation prompt described in relation toFIG. 8 , theconversational agent system 100 can receive data characterizing an utterance of a query. The data can be received in the context of a dialog and processed as follows. - In
step 1, in response to the initial conversation prompt generated by theDPP server 302, theclient device 102 can receive a user query, such as “I am looking for a pair of elegant shoes for my wife”. Theclient device 102 can capture the utterance associated with the query via themicrophone 114 configured on theclient device 102. The captured audio data is streamed byweb application 205 to theDPP server 302 in addition to a profile associated with the tenant. - In
step 2, theDPP server 302 streams the received audio data to theASR adapter 135. TheASR adapter 135 can provide the audio data to aASR engine 140 associated with the tenant profile. In some implementations, theASR engine 140 can be a pre-configured cloud-based ASR engine, such as the Google Cloud ASR offered by Google, LLC of Mountain View, Calif., U.S.A. - In step 3, the
ASR engine 140 processes the audio data in real-time until the user completes the utterance associated with the query. After completing the utterance, the user is likely to pause and await a reply from theconversational agent system 100. TheASR engine 140 can detect the end of the utterance and the subsequent period of silence and can provide theDPP server 302 with the best hypothetical text string corresponding to the user's utterance. In a best-case scenario, theASR engine 140 can generate a text string which exactly matches the words of the user's utterance. The text string can be combined with other parameters related to the processed utterance. In some implementations, the other parameters can include rankings associated with the recognized speech. The rankings can be dynamically adjusted based on theNLU module 336. For example, theNLU module 336 can process the top hypotheses generated by theASR engine 140 and can evaluate those hypothetical responses in the context of other responses generated by theNLU module 336 so that the top hypothesis is selected over another hypothesis which can include a lower confidence ranking. In some implementations, the parameters can be associated with errors such as phonetically similar words. Small variations in text strings can be mitigated using similarity measures, such as the Levenshtein distance or fuzzy matching algorithm. - In
step 4, theDPP server 302 can provide the text string to theorchestrator component 316 and await a reply. Instep 5, theorchestrator component 316 can transmit the text string to themaestro component 334. - In
step 6, themaestro component 334 can provide the text string to theNLA ensemble 145 for response processing. TheNLA 145 can determine the current state of the dialog via theDM 338 and can generate a contextually appropriate textual response to the query. TheNLA ensemble 145 can also, in some implementations, generate graphical content associated with query and the dialog context to be displayed on thedisplay 112 of theclient device 102. The textual response and the corresponding graphical content can be provided in a device-agnostic format. TheNLA ensemble 145 can determine that the contextually appropriate textual response to the query is “I can help you with that. What size does she usually wear?” - In
step 6 a, thedialog processing platform 120 can perform an authentication of the user. Theorchestrator component 316 can be granted access to the user's account in the event that the user's query requires information associated with a specific order or account. For example, if the user utters “When will my order arrive?”, themaestro component 334 can interpret the utterance and query, via the orchestrator component 316 (as in step 6), and can prompt the user to provide account authentication credentials in order to determine the status of the order instep 6 a. After access has been granted instep 6 b, theorchestrator component 316 can cache the authentication token for the duration of the dialog session to avoid repeating the authentication steps for other queries. - In
step 7, theorchestrator component 316 can format the textual response and graphical content into a suitable format for the configuration of theclient device 102. For example, theorchestrator component 316 can apply tenant-defined brand customizations provided via thecustomer portal 320. The customizations can specify a color palette, font style, images and image formatting, andTTS synthesis engines 155 to use which may include one or more alternate voice dialects. - In
step 8, based on the format of the textual response and the graphical content provided by theorchestrator component 316, theDPP server 302 can provide the textual response to theTTS adapter 150 to initiate speech synthesis processing by theTTS synthesis engines 155 to generate a verbalized query response. In some implementations, theTTS synthesis engines 155 can be remotely located from theDPP server 302, such as when configured in a cloud-based, distributed conversational agent system. TheDPP server 302 can also provide, instep 10, the textual response graphically with the appropriate formatting on thedisplay 112 of theclient device 302. - In step 9, the
TTS adapter 150 can begin retrieving audio data associated with the verbalized query response from theTTS synthesis engine 155 in response to a request from theDPP server 302. TheTTS adapter 150 can subsequently provide, or stream, the verbalized query response to theDPP server 302. - In
step 10, theDPP server 302 can act as a proxy by sending the verbalized query response toweb application 205 on theclient device 102. Theweb application 205 can provide the verbalized query response to the user via theoutput device 116 audibly informing the user “I can help you with that. What size show does she usually wear?”. - Steps 1-10 can be performed in an iterative manner via the
client device 102 and thedialog processing platform 120 until the user's query has been fulfilled or the user terminates the dialog session. Theweb application 205, configured as the conversational agent on theclient device 102, can enable the user to switch from speech to text as input and output modes as well as switching from text to speech as input and output modes. - Exemplary technical effects of the methods, systems, and computer-readable medium described herein include, by way of non-limiting example, processing a user query using a multi-modal conversation agent system. The conversational agent system can provide scalable, modular natural language processing resources for multiple tenants for which the user query can be directed. The conversational agent system can provide improved interfaces for processing the user query using distributed natural language resources. The conversational agent system can improve the contextual accuracy of conversational agent dialogs using a catalog-to-dialog data structure incorporated into a machine learning process used to train classification algorithms configured to process the user query and generate query responses. The conversational agent system also provides improved interfaces for tenants to customize conversational agent branding and provide more accurate dialog responses based on integrated e-commerce data sources such as user account, billing, and customer order data.
- Certain exemplary embodiments have been described to provide an overall understanding of the principles of the structure, function, manufacture, and use of the systems, devices, and methods disclosed herein. One or more examples of these embodiments have been illustrated in the accompanying drawings. Those skilled in the art will understand that the systems, devices, and methods specifically described herein and illustrated in the accompanying drawings are non-limiting exemplary embodiments and that the scope of the present invention is defined solely by the claims. The features illustrated or described in connection with one exemplary embodiment can be combined with the features of other embodiments. Such modifications and variations are intended to be included within the scope of the present invention. Further, in the present disclosure, like-named components of the embodiments generally have similar features, and thus within a particular embodiment each feature of each like-named component is not necessarily fully elaborated upon.
- The subject matter described herein can be implemented in analog electronic circuitry, digital electronic circuitry, and/or in computer software, firmware, or hardware, including the structural means disclosed in this specification and structural equivalents thereof, or in combinations of them. The subject matter described herein can be implemented as one or more computer program products, such as one or more computer programs tangibly embodied in an information carrier (e.g., in a machine-readable storage device), or embodied in a propagated signal, for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). A computer program (also known as a program, software, software application, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file. A program can be stored in a portion of a file that holds other programs or data, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
- The processes and logic flows described in this specification, including the method steps of the subject matter described herein, can be performed by one or more programmable processors executing one or more computer programs to perform functions of the subject matter described herein by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus of the subject matter described herein can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
- Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processor of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, (e.g., EPROM, EEPROM, and flash memory devices); magnetic disks, (e.g., internal hard disks or removable disks); magneto-optical disks; and optical disks (e.g., CD and DVD disks). The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- To provide for interaction with a user, the subject matter described herein can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, (e.g., a mouse or a trackball), by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well. For example, feedback provided to the user can be any form of sensory feedback, (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user can be received in any form, including acoustic, speech, or tactile input.
- The techniques described herein can be implemented using one or more modules. As used herein, the term “module” refers to computing software, firmware, hardware, and/or various combinations thereof. At a minimum, however, modules are not to be interpreted as software that is not implemented on hardware, firmware, or recorded on a non-transitory processor readable recordable storage medium (i.e., modules are not software per se). Indeed “module” is to be interpreted to always include at least some physical, non-transitory hardware such as a part of a processor or computer. Two different modules can share the same physical hardware (e.g., two different modules can use the same processor and network interface). The modules described herein can be combined, integrated, separated, and/or duplicated to support various applications. Also, a function described herein as being performed at a particular module can be performed at one or more other modules and/or by one or more other devices instead of or in addition to the function performed at the particular module. Further, the modules can be implemented across multiple devices and/or other components local or remote to one another. Additionally, the modules can be moved from one device and added to another device, and/or can be included in both devices.
- The subject matter described herein can be implemented in a computing system that includes a back-end component (e.g., a data server), a middleware component (e.g., an application server), or a front-end component (e.g., a client computer having a graphical user interface or a web browser through which a user can interact with an implementation of the subject matter described herein), or any combination of such back-end, middleware, and front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
- Approximating language, as used herein throughout the specification and claims, can be applied to modify any quantitative representation that could permissibly vary without resulting in a change in the basic function to which it is related. Accordingly, a value modified by a term or terms, such as “about,” “approximately,” and “substantially,” are not to be limited to the precise value specified. In at least some instances, the approximating language can correspond to the precision of an instrument for measuring the value. Here and throughout the specification and claims, range limitations can be combined and/or interchanged, such ranges are identified and include all the sub-ranges contained therein unless context or language indicates otherwise.
- One skilled in the art will appreciate further features and advantages of the invention based on the above-described embodiments. Accordingly, the present application is not to be limited by what has been particularly shown and described, except as indicated by the appended claims. All publications and references cited herein are expressly incorporated by reference in their entirety.
Claims (20)
1. A method comprising:
receiving, by a multitenant remote server including executable instances of natural language resources, data characterizing a query by a first user and associated with a first tenant, the multitenant remote server including a tenant portal enabling the first user configuration of tenant data;
deploying, responsive to the receiving, a first instance of an executable natural language resource configured to receive a text string characterizing the query and determine a textual response to the query;
providing, to an automated speech recognition engine via the multitenant remote server, the received data;
receiving, from the automated speech recognition engine, a text string characterizing the query; and
processing, via the first instance of the executable natural language agent ensemble, the text string characterizing the query to determine a textual response to the query, the textual response including at least one word from a lexicon associated with the first tenant.
2. The method of claim 1 , further comprising
providing, to a text-to-speech synthesis engine, the textual response;
receiving, from the text-to-speech synthesis engine, a verbalized query response determined by the text-to-speech synthesis engine based on the textual response; and
providing the verbalized query response.
3. The method of claim 2 , wherein the text-to-speech synthesis engine is configured to receive the textual response, and to generate, in response to the receiving, the verbalized query response including audio data corresponding to the received textual response, the text-to-speech synthesis engine being selected from one or more inter-changeable speech processing engines included in the profile.
4. The method of claim 2 , further comprising:
receiving, prior to receiving data characterizing the query, an input to a web site provided via a web browser configured on a first computing device, the input causing the web browser to be authenticated and registered at a second computing device coupled to the first computing device via a network.
5. The method of claim 4 , further comprising:
receiving, by the second computing device, validation data associated with the first computing device, the validation data including audio and graphical rendering settings configured on with the first computing device;
generating, in response to confirming the validation data, an initial conversation prompt by the second computing device and providing the initial conversation prompt to the web site configured on the first computing device;
receiving, at an input device coupled to the first computing device and in response to providing the initial conversation prompt via the web site, the data characterizing an utterance of the query, the query associated with an item available via the web site;
transmitting the provided verbalized query response to the first computing device; and
providing the verbalized query response via an output device coupled to the first computing device.
6. The method of claim 1 , further comprising
providing a first configuration of a graphical user interface on a first client device, the client device configured to receive the utterance from a user.
7. The method of claim 1 , wherein processing the text string characterizing the query further comprises:
generating a sematic interpretation associated with the text string;
determining a first contextual sequence associated with text string based on one or more previously processed text strings;
generating a first response action based on the determined first contextual sequence; and
generating the textual response based on the generated first response action.
8. The method of claim 7 , wherein the semantic interpretation is generated using a first data structure representing the lexicon associated with the first tenant.
9. The method of claim 8 , wherein the first data structure is generated based on at least one of: a catalog of items associated with the first tenant and including a first item title and a first item description, one or more reviews associated with a first item, interactive user data associated with a first item, or a combination thereof.
10. The method of claim 9 , wherein generating the first data structure includes
determining one or more attributes associated with a first item from the catalog of items;
determining one or more synonyms associated with the first item from the catalog of items;
determining one or more referring expressions associated with the first item from the catalog of items and/or the interactive user data associated with the first item;
generating the first data structure based on the determining steps, the first data structure including a name, one or more attributes, one or more synonyms, one or more referring expressions, and/or one or more dialogs corresponding to the first item.
11. The method of claim 8 , wherein the first data structure is used to train the at least one of a plurality of classification algorithms.
12. The method of claim 1 , further comprising:
receiving second data characterizing an utterance of a second query associated with a second tenant;
providing, to a second automated speech recognition engine, the received second data;
receiving, from the automated speech recognition engine, a second text string characterizing the second query; and
processing, via a second instance of the natural language agent ensemble configured based on the second tenant, the second text string characterizing the second query to determine a second textual response to the second query, the second textual response including at least one word from a second lexicon associated with the second tenant.
13. The method of claim 1 , wherein the query includes a plurality of natural language words spoken by the first user and received by an input device of a first computing device, the query provided by the first user in regard to a first context associated with a first item provided by the first tenant.
14. The method of claim 1 , wherein the received data is provided to the automated speech recognition engine with a profile selected from a plurality of profiles based on the first tenant, the profile configuring the automated speech recognition engine to process the received data.
15. The method of claim 14 , wherein the profile includes one or more configuration settings associated with the first instance of the natural language agent ensemble configured on a server including a data processor, one or more configuration settings associated with the natural language agent ensemble configured on a first computing device, or one or more configuration settings specifying one or more speech processing engines configured on a server including a data processor.
16. The method of claim 1 , wherein the first tenant includes at least one of a retail entity, a service provider entity, a financial entity, a manufacturing entity, an entertainment entity, an information storage entity, and a data processing entity.
17. The method of claim 1 , wherein the automated speech recognition engine is configured to receive audio data corresponding to the query and to generate, in response to the receiving, the text string including textual data corresponding to the received audio data, the automatic speech recognition engine being selected from one or more inter-changeable speech processing engines.
18. The method of claim 1 , wherein the data characterizing the query associated with the first tenant is provided via a textual interaction modality or via a speech interaction modality.
19. A system comprising:
at least one data processor; and
memory storing instructions, which, when executed by the at least one data processor cause the at least one data processor to perform operations comprising:
receiving, by a multitenant remote server including executable instances of natural language resources, data characterizing a query by a first user and associated with a first tenant, the multitenant remote server including a tenant portal enabling the first user configuration of tenant data;
deploying, responsive to the receiving, a first instance of an executable natural language resource configured to receive a text string characterizing the query and determine a textual response to the query;
providing, to an automated speech recognition engine via the multitenant remote server, the received data;
receiving, from the automated speech recognition engine, a text string characterizing the query; and
processing, via the first instance of the executable natural language agent ensemble, the text string characterizing the query to determine a textual response to the query, the textual response including at least one word from a lexicon associated with the first tenant.
20. The system of claim 19 , the operations further comprising:
providing, to a text-to-speech synthesis engine, the textual response;
receiving, from the text-to-speech synthesis engine, a verbalized query response determined by the text-to-speech synthesis engine based on the textual response; and
providing the verbalized query response.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/500,352 US20220036899A1 (en) | 2019-11-26 | 2021-10-13 | Multi-modal conversational agent platform |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/696,482 US11176942B2 (en) | 2019-11-26 | 2019-11-26 | Multi-modal conversational agent platform |
US17/500,352 US20220036899A1 (en) | 2019-11-26 | 2021-10-13 | Multi-modal conversational agent platform |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/696,482 Continuation US11176942B2 (en) | 2019-11-26 | 2019-11-26 | Multi-modal conversational agent platform |
Publications (1)
Publication Number | Publication Date |
---|---|
US20220036899A1 true US20220036899A1 (en) | 2022-02-03 |
Family
ID=73793823
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/696,482 Active US11176942B2 (en) | 2019-11-26 | 2019-11-26 | Multi-modal conversational agent platform |
US17/500,352 Abandoned US20220036899A1 (en) | 2019-11-26 | 2021-10-13 | Multi-modal conversational agent platform |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/696,482 Active US11176942B2 (en) | 2019-11-26 | 2019-11-26 | Multi-modal conversational agent platform |
Country Status (2)
Country | Link |
---|---|
US (2) | US11176942B2 (en) |
WO (1) | WO2021108163A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11974025B2 (en) | 2007-04-17 | 2024-04-30 | Intent IQ, LLC | Targeted television advertisements based on online behavior |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9805125B2 (en) | 2014-06-20 | 2017-10-31 | Google Inc. | Displaying a summary of media content items |
US10206014B2 (en) * | 2014-06-20 | 2019-02-12 | Google Llc | Clarifying audible verbal information in video content |
US11386887B1 (en) | 2020-03-23 | 2022-07-12 | Amazon Technologies, Inc. | Natural language processing using context |
US11908480B1 (en) * | 2020-03-23 | 2024-02-20 | Amazon Technologies, Inc. | Natural language processing using context |
US11934806B2 (en) | 2020-03-30 | 2024-03-19 | Microsoft Technology Licensing, Llc | Development system and method |
US20220075960A1 (en) * | 2020-09-09 | 2022-03-10 | Achieve Intelligent Technologies, Inc. | Interactive Communication System with Natural Language Adaptive Components |
US20220114349A1 (en) * | 2020-10-09 | 2022-04-14 | Salesforce.Com, Inc. | Systems and methods of natural language generation for electronic catalog descriptions |
US20220335933A1 (en) * | 2021-04-16 | 2022-10-20 | At&T Intellectual Property I, L.P. | Customer support using a cloud-based message analysis model |
US11922476B2 (en) * | 2021-07-01 | 2024-03-05 | Capital One Services, Llc | Generating recommendations based on descriptors in a multi-dimensional search space |
CN113630306A (en) * | 2021-07-28 | 2021-11-09 | 北京达佳互联信息技术有限公司 | Information processing method, information processing device, electronic equipment and storage medium |
US11417337B1 (en) * | 2021-08-12 | 2022-08-16 | Cresta Intelligence Inc. | Initiating conversation monitoring system action based on conversational content |
US20230146336A1 (en) * | 2021-11-11 | 2023-05-11 | Maplebear Inc. (Dba Instacart) | Directly identifying items from an item catalog satisfying a received query using a model determining measures of similarity between items in the item catalog and the query |
US12022026B2 (en) | 2022-03-18 | 2024-06-25 | Capital One Services, Llc | System and method for serving multiple customers by a live agent |
US20230350928A1 (en) * | 2022-04-28 | 2023-11-02 | Knowbl LLC | Systems and methods for implementing a virtual agent performing context and query transformations using unsupervised machine learning models |
US20240005925A1 (en) * | 2022-06-30 | 2024-01-04 | Cdw Llc | Techniques for providing natural language understanding (nlu) services to contact centers |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190156210A1 (en) * | 2017-11-17 | 2019-05-23 | Facebook, Inc. | Machine-Learning Models Based on Non-local Neural Networks |
US20190213999A1 (en) * | 2018-01-08 | 2019-07-11 | Apple Inc. | Multi-directional dialog |
US20200027553A1 (en) * | 2018-07-18 | 2020-01-23 | International Business Machines Corporation | Dynamic selection of virtual agents in a mutli-domain expert system |
Family Cites Families (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7869998B1 (en) * | 2002-04-23 | 2011-01-11 | At&T Intellectual Property Ii, L.P. | Voice-enabled dialog system |
US20040054534A1 (en) * | 2002-09-13 | 2004-03-18 | Junqua Jean-Claude | Client-server voice customization |
US7421393B1 (en) * | 2004-03-01 | 2008-09-02 | At&T Corp. | System for developing a dialog manager using modular spoken-dialog components |
US7412393B1 (en) * | 2004-03-01 | 2008-08-12 | At&T Corp. | Method for developing a dialog manager using modular spoken-dialog components |
US8909748B1 (en) * | 2006-06-22 | 2014-12-09 | Emc Corporation | Configurable views of context-relevant content |
US9177551B2 (en) * | 2008-01-22 | 2015-11-03 | At&T Intellectual Property I, L.P. | System and method of providing speech processing in user interface |
US9858925B2 (en) * | 2009-06-05 | 2018-01-02 | Apple Inc. | Using context information to facilitate processing of commands in a virtual assistant |
US10241752B2 (en) * | 2011-09-30 | 2019-03-26 | Apple Inc. | Interface for a virtual digital assistant |
WO2013155619A1 (en) | 2012-04-20 | 2013-10-24 | Sam Pasupalak | Conversational agent |
US10157175B2 (en) * | 2013-03-15 | 2018-12-18 | International Business Machines Corporation | Business intelligence data models with concept identification using language-specific clues |
US9412358B2 (en) * | 2014-05-13 | 2016-08-09 | At&T Intellectual Property I, L.P. | System and method for data-driven socially customized models for language generation |
US10482184B2 (en) | 2015-03-08 | 2019-11-19 | Google Llc | Context-based natural language processing |
US9984116B2 (en) * | 2015-08-28 | 2018-05-29 | International Business Machines Corporation | Automated management of natural language queries in enterprise business intelligence analytics |
WO2017112813A1 (en) * | 2015-12-22 | 2017-06-29 | Sri International | Multi-lingual virtual personal assistant |
US10007607B2 (en) * | 2016-05-31 | 2018-06-26 | Salesforce.Com, Inc. | Invalidation and refresh of multi-tier distributed caches |
US10270864B2 (en) * | 2016-06-21 | 2019-04-23 | Oracle International Corporation | Internet cloud-hosted natural language interactive messaging system server collaboration |
US10115400B2 (en) * | 2016-08-05 | 2018-10-30 | Sonos, Inc. | Multiple voice services |
US10115393B1 (en) * | 2016-10-31 | 2018-10-30 | Microsoft Technology Licensing, Llc | Reduced size computerized speech model speaker adaptation |
US10331791B2 (en) | 2016-11-23 | 2019-06-25 | Amazon Technologies, Inc. | Service for developing dialog-driven applications |
US11183181B2 (en) * | 2017-03-27 | 2021-11-23 | Sonos, Inc. | Systems and methods of multiple voice services |
KR102398649B1 (en) * | 2017-03-28 | 2022-05-17 | 삼성전자주식회사 | Electronic device for processing user utterance and method for operation thereof |
US11727513B2 (en) * | 2017-05-13 | 2023-08-15 | Regology, Inc. | Method and system for facilitating implementation of regulations by organizations |
US10958743B2 (en) * | 2017-07-31 | 2021-03-23 | Fanplayr Inc. | Method and system for segmentation as a service |
US11048663B2 (en) * | 2017-11-15 | 2021-06-29 | Salesforce.Com, Inc. | Database systems and methods for automated database modifications |
EP3598437A4 (en) * | 2018-01-16 | 2020-05-13 | SONY Corporation | Information processing device, information processing system, information processing method, and program |
KR102508677B1 (en) * | 2018-03-08 | 2023-03-13 | 삼성전자주식회사 | System for processing user utterance and controlling method thereof |
US10733018B2 (en) * | 2018-04-27 | 2020-08-04 | Paypal, Inc. | Systems and methods for providing services in a stateless application framework |
US10284541B1 (en) * | 2018-07-09 | 2019-05-07 | Capital One Services, Llc | System and method for generating enhanced distributed online registry |
US10672392B2 (en) * | 2018-07-23 | 2020-06-02 | Motorola Solutions, Inc. | Device, system and method for causing an output device to provide information for voice command functionality |
US11120788B2 (en) * | 2019-05-02 | 2021-09-14 | Microsoft Technology Licensing, Llc | Organizational-based language model generation |
-
2019
- 2019-11-26 US US16/696,482 patent/US11176942B2/en active Active
-
2020
- 2020-11-17 WO PCT/US2020/060820 patent/WO2021108163A1/en active Application Filing
-
2021
- 2021-10-13 US US17/500,352 patent/US20220036899A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190156210A1 (en) * | 2017-11-17 | 2019-05-23 | Facebook, Inc. | Machine-Learning Models Based on Non-local Neural Networks |
US20190213999A1 (en) * | 2018-01-08 | 2019-07-11 | Apple Inc. | Multi-directional dialog |
US20200027553A1 (en) * | 2018-07-18 | 2020-01-23 | International Business Machines Corporation | Dynamic selection of virtual agents in a mutli-domain expert system |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11974025B2 (en) | 2007-04-17 | 2024-04-30 | Intent IQ, LLC | Targeted television advertisements based on online behavior |
Also Published As
Publication number | Publication date |
---|---|
WO2021108163A1 (en) | 2021-06-03 |
US11176942B2 (en) | 2021-11-16 |
US20210158811A1 (en) | 2021-05-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11176942B2 (en) | Multi-modal conversational agent platform | |
US20230237348A1 (en) | Chatbot for defining a machine learning (ml) solution | |
US11461420B2 (en) | Referring expression generation | |
JP2023520416A (en) | Improved techniques for out-of-domain (OOD) detection | |
CN114424185A (en) | Stop word data augmentation for natural language processing | |
WO2021252845A1 (en) | Entity level data augmentation in chatbots for robust named entity recognition | |
CN115398419A (en) | Method and system for object-based hyper-parameter tuning | |
US11972467B2 (en) | Question-answer expansion | |
WO2022159485A1 (en) | Context tag integration with named entity recognition models | |
CN115398436A (en) | Noise data augmentation for natural language processing | |
US20230186161A1 (en) | Data manufacturing frameworks for synthesizing synthetic training data to facilitate training a natural language to logical form model | |
CN116547676A (en) | Enhanced logic for natural language processing | |
CN116583837A (en) | Distance-based LOGIT values for natural language processing | |
US20230100508A1 (en) | Fusion of word embeddings and word scores for text classification | |
US20240061833A1 (en) | Techniques for augmenting training data for aggregation and sorting database operations in a natural language to database query system | |
CN116635862A (en) | Outside domain data augmentation for natural language processing | |
WO2023076754A1 (en) | Deep learning techniques for extraction of embedded data from documents | |
CN116235164A (en) | Out-of-range automatic transition for chat robots | |
US20230368773A1 (en) | Methods and systems for generating personal virtual agents | |
US20230351184A1 (en) | Query Classification with Sparse Soft Labels | |
US20230161963A1 (en) | System and techniques for handling long text for pre-trained language models | |
JP2024503517A (en) | Multifactor modeling for natural language processing | |
US20230134149A1 (en) | Rule-based techniques for extraction of question and answer pairs from data | |
US20240282298A1 (en) | Systems and methods for conversation orchestration using large language models | |
US20240062112A1 (en) | Adaptive training data augmentation to facilitate training named entity recognition models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
AS | Assignment |
Owner name: VUI, INC., MASSACHUSETTS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:DE FABBRIZIO, GIUSEPPE;STEPANOV, EVGENY;TORTORETO, GIULIANO;AND OTHERS;SIGNING DATES FROM 20200107 TO 20200115;REEL/FRAME:058159/0397 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |