CN112470216A - Voice application platform - Google Patents

Voice application platform Download PDF

Info

Publication number
CN112470216A
CN112470216A CN201980049296.7A CN201980049296A CN112470216A CN 112470216 A CN112470216 A CN 112470216A CN 201980049296 A CN201980049296 A CN 201980049296A CN 112470216 A CN112470216 A CN 112470216A
Authority
CN
China
Prior art keywords
voice assistant
voice
response
request
content items
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980049296.7A
Other languages
Chinese (zh)
Inventor
R.T.诺顿
N.G.莱德劳
A.M.邓恩
J.K.麦克马洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SOUND LLC
Original Assignee
SOUND LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/000,798 external-priority patent/US10235999B1/en
Priority claimed from US16/000,789 external-priority patent/US10803865B2/en
Priority claimed from US16/000,805 external-priority patent/US11437029B2/en
Priority claimed from US16/000,799 external-priority patent/US10636425B2/en
Application filed by SOUND LLC filed Critical SOUND LLC
Publication of CN112470216A publication Critical patent/CN112470216A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/909Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Wherein a request is received from a voice assistant device that is represented according to a different respective protocol of one or more voice assistant frameworks. Each request represents a voice input by a user to a corresponding voice assistant device. The received request is re-expressed according to a common request protocol. Based on the received request, a response to the request is expressed according to a common response protocol. Each response is re-represented according to the protocol relative to the framework in which the corresponding request is represented. The response is sent to the voice assistant device for presentation to the user.

Description

Voice application platform
Technical Field
The present application relates to voice application platforms.
Background
The voice application platform provides services to the voice assistant and voice assistant device to enable them to listen to and respond to the end user's speech. The response may be spoken or presented as text, images, audio, and video (content item). In some cases, the response involves an action such as shutting down the device.
Voice assistants, such as Siri for apple, Alexa for amazon, Cortana for microsoft, and asistat for google, are accessed from servers, or sometimes on general purpose workstations and mobile devices, through dedicated voice Assistant devices, such as amazon Echo and apple HomePod.
The voice assistant device typically has a microphone, speaker, processor, memory, communications facilities, and other hardware and software. The voice assistant device may detect and process a person's speech to derive information representing the end user's request, represent the information as a request message (sometimes referred to as an intent or containing an intent) according to a predefined protocol, and communicate the request message to a server over a communication network.
At the server, the voice application receives and processes the request message and determines the appropriate response. The response is included in a response message expressed according to a predefined protocol. The response message is sent to the voice assistant device over the communication network. The voice assistant interprets the response message and speaks or presents the response (or takes the action specified by the response). The operation of the voice application is supported by the operating system's infrastructure and other processes running on the server.
The services provided by the server to the client voice assistant devices to enable their interaction with the end user are sometimes referred to as voice assistant services (sometimes also referred to as or including skills, actions, or voice applications).
The interaction between the end user and the voice assistant may include a series of requests and responses. In some cases, the request is a question posed by the end user and the response is an answer to the question.
Typically, the server, voice assistant device, voice assistant services, predefined protocols, and basic voice applications are designed together as part of a dedicated voice assistant framework. To enable third parties, such as brands that want to interact with end users through a voice assistant, to create their own voice applications, the framework provides specialized APIs.
Disclosure of Invention
In some implementations, the generic voice application platform we describe herein provides brands and organizations with the ability to create and maintain amazon Alexa, google assistant, apple HomePod, microsoft Cortana, and other devices' ability to participate in voice applications in one location. Platform designs are used to provide brands and organizations with the ability to rapidly deploy voice applications while providing flexibility via customization capabilities.
The platform provides features to process voice requests and package within modules. Features include a processor that processes voice requests for events, FAQs, daily updates, reminders, checklists, surveys, and up-to-date messages, among other predefined features. The module packages reference features based on common usage related to industry specific needs and includes sample content that enables rapid marketing for brands and organizations.
Brand authors can manage voice content within the platform's voice content management system. The voice content management system provides an intuitive interface that does not require technical knowledge to create, modify, and remove content that shapes the voice experience. The platform's content management system also provides guidance and insight to brand managers via end user usage analysis captured over time. The guidance includes cues such as visual indicators for devices supported by media type (e.g., video and image media supported by amazon Echo Show). The insight includes an analysis of success rates for responses for a given question across device types (e.g., insight that Google bimamason's Alexa responds more to the same question).
In the background, the platform is cloud-based, eliminating the need for brands and organizations to invest in additional infrastructure. Cloud-based provisioning also results in periodic updates and enhancements that are automatically available to brands and organizations that are users of the platform.
The platform uses a layered architecture that does not rely on dependencies from other layers in the system. The layers include a voice API layer, a business logic layer, a feature and module layer, a CMS layer, and a data layer.
The unique aspects of the platform are as follows:
1. the platform processes data from multiple voice assistant frameworks (such as Alexa, google Home, apple HomePod, chat bot) into a single API/business logic layer. The platform refines the data and processes it to enhance the understanding of the end user's intent. In contrast to the rule-based engine, the platform uses graph-based pattern matching. Graph-based pattern matching allows for a method of managing the consistency and confidence of the mapping across helper intents and features of the platform being used. This makes voice applications more manageable and updatable while still giving flexibility to enable machine learning to update the location of nodes in the graph. The graph-based approach requires only one step to support the newly added voice assistant framework. New nodes (data points) are added to the graph database to create connections between voice intentions from the end user.
2. Because the platform accesses data from multiple voice assistant frameworks, the platform may compare how some frameworks perform relative to others. For example, the platform may see the failure rates of different voice applications and features across various voice assistant frameworks, and as a result, machine learning and algorithms may be used to better understand the intent of the end user than the particular voice assistant framework they are using. This is possible by detecting the success and failure patterns of each framework for the same type of content, and determining what changes will make it more successful and can allow finding the best superset of content changes that fit all supported frameworks.
3. Because the platform collects performance data across multiple devices through a single API, it can collect and analyze performance and efficiently provide content recommendations. The platform uses machine learning and its own algorithms to ensure how one voice application does with respect to another to make real-time dynamic content suggestions directly to the voice application developer within the platform's user interface. This may optimize the performance of the voice application and enhance the overall end user experience.
4. The platform supports a collection of dynamic content, providing more than one way to answer questions or give responses. This creates a more engaging voice experience because the prompts and responses can change from session to session. It also allows the creation and alteration of roles that depend on the preferences and demographics of the end user. Conversely, for example, if ten end users ask Alexa the same question, the voice assistant will interact in the same manner ten times. The platform described herein allows a voice application developer to set an unlimited number of different responses for each of ten users, and the responses may even be personalized for each particular individual. For example, if the platform determines that the end user is a 35 year old female living in the state of georgia, the developer may decide that the end user is likely to speak more comfortably with another female having a southern accent and speaking using a local spoken language and a local reference. The platform allows developers to change words used by a particular speech platform when speaking to an end user. Developers can also use the platform to record amateur or professional voice talents with relevant gender, accent, dialect, and the like. The result is a more trusted/human interaction between the end user and their voice assistant device.
5. The platform naturally supports multiple language content for prompts and responses. This is useful in order to reach more listeners in the united states and worldwide. It also creates a more inclusive and human experience between the end-users and their voice assistant devices. The multi-language support is built into the interface for the non-english speaking manager along with the ability to add, modify, and remove multi-language content.
6. The platform provides fast listing with sample content via predefined modules and flexibility via customization. The platform lets developers create a customized voice experience using a predefined module and platform's content management system, or using a combination of their own modules and content that interface with the platform via an API. This is important because it will enable the voice application creator/administrator to create and manage a more customized and trusted voice experience, which will ultimately benefit the end user.
7. Using human speech for prompts and responses results in a more trusted and engaging experience as opposed to AI computer speech. The platform allows administrators to create and edit audio and video content directly within the platform. There is no need to leave the platform to create new content. The administrator can create voice interactions in the voice application, including creating rich media (audio and video) content all in one place. In typical recognition, it is desirable for managers to create audio and video assets outside of the voice application platform. The platform enables administrators to add media directly within the platform and its user interface, thus increasing efficiency and speed of marketing. In addition, this ultimately results in a deeper, richer voice experience for the end user.
8. Voice assistant devices vary in how they handle multimedia based on its internal hardware. One device may support video, audio, images, and text, while another may support only text and audio. The platform provides media guidance in the platform's user interface directly in real-time as to whether particular content within the platform is supported by a particular voice assistant device and framework. This provides the user with important information about what content he or she should focus on, while learning how to optimize the experience on a particular voice assistant device.
Thus, in general, in an aspect, a request is received from a voice assistant device that is represented according to a respective protocol of one or more voice assistant frameworks. Each request represents a voice input by a user to a corresponding voice assistant device. The received request is re-expressed according to a common request protocol. Based on the received, a response to the request is expressed according to a common response protocol. Each response is re-represented according to the protocol relative to the framework in which the corresponding request is represented. The response is sent to the voice assistant device for presentation to the user.
Implementations may include one or a combination of two or more of the following features. The request is expressed according to respective protocols of two or more voice assistant frameworks. The voice assistant framework includes a framework of at least one of amazon, apple, google, microsoft, or chat robot developers. The generation of the response includes traversing the graph using the information from the request. The traversal graph includes identifying features to be used to implement the response. The features are organized in modules. At least one module is predefined. At least one module is custom defined. At least one module includes a collection of predefined characteristics and predefined content items tailored to a particular industry or organization. The features include information about the content item to be included in the response. The characteristics include information about the dynamic content item to be included in the response. At least one content item is predefined. At least one content item is defined custom. The generating of the response to the request includes executing a voice application. The voice application includes a collection of functions that generate responses to requests spoken by a person. The generated response includes a word output. The generated response triggers other functions while providing word and sentence output. The instructions are executable by the processor to: receiving data for two or more frames regarding the request and corresponding responses, and analyzing the received data to determine a comparative performance of the responses for the frames. The capabilities include capabilities of one or more voice assistant frameworks. The performance includes performance of one or more features used to implement the response. The performance includes performance of one or more content items included in the response. The capabilities include capabilities of one or more voice applications.
The instructions are executable by the processor to expose features for selection and management of content items to be included in the response at a user interface of the voice application platform. Information about the relative performance of individual content items associated with the characteristics of the content items is presented through the user interface in real time as the content items are being selected or managed. Information is received via the user interface regarding the selected or managed content item. The speech application is executed to generate a response including the presentation of the selected and managed content item. The user interface is configured to enable a non-technical trained person to select or manage content items and to provide and receive information about the content items. The instructions are executable by the processor to enable selection of a content item to be included in a given one of the selectable possible content items. The selection of a content item to be included in a given response is based on the context of the end user's voice input. The context of the end user's voice input includes the geographic location of the response to the voice assistant device to be sent. The context of the end user's speech input includes the end user's demographic characteristics.
The instructions are executable by the processor to present a user interface configured to (a) enable creation of a voice application for processing requests and for generating corresponding responses, (b) maintain a module of features with which the requests can be matched to generate responses, including standard modules and custom modules, (c) include in each module a set of features corresponding to a context in which the responses are presented to an end user, and (d) present the module through the user interface.
The instructions are executable by the processor to expose, at a user interface of the voice application platform, features for enabling selection and management of content items to be included in the response. Each content item requires a voice assistant device with corresponding content presentation capabilities. During selection and management of content items, information regarding the capabilities of voice assistant devices that conform to various different voice assistant frameworks to present the selected and managed content items is presented simultaneously through a user interface. The voice application platform guides non-technical trained users with respect to the capabilities of the voice assistant framework and how they will represent images, audio, video, and other forms of media.
In general, in one aspect, a request is received over a communication network from a voice assistant device that conforms to one or more disparate voice assistant frameworks. A service is requested for speech based on the end user. The speech of the end user represents the intent. Data derived from the requests for service is used to traverse the graph of nodes and edges to arrive at features that match the respective requests for service. The feature is executed to generate a response. The responses are sent over the communication network to the voice assistant devices to make them responsive to the respective end users.
Implementations may include one or a combination of two or more of the following features. The voice assistant device from which the request is received conforms to two or more different voice assistant frameworks. Data is derived from a request for service by abstracting the information in the request into a data format that is common across two or more different voice assistant frameworks. The nodes of the graph are updated using the output of the machine learning algorithm. The information about the request is used to identify the initial node of the graph at which traversal begins. The nodes are automatically added to the graph to serve as the initial nodes of the graph that begin traversing with respect to requests that conform to the additional voice assistant framework.
In general, in one aspect, a request is received over a communication network from a voice assistant device that conforms to one or more disparate voice assistant frameworks. A service is requested for speech based on the end user. The speech of the end user represents the intent. A response to the received request is determined. The responses are configured to be sent over a communication network to the voice assistant apparatus to cause them to respond to the respective end user. A measure of success of the determination of the response is evaluated. Based on the relative measure of success of the response, the user may manage subsequent responses to requests for service through the user interface.
Implementations may include one or a combination of two or more of the following features. The voice assistant device from which the request is received conforms to two or more different voice assistant frameworks. The proposed response is presented to the user through the user interface based on the assessed measure of success, and the user may select a response to send to the voice assistant device based on the proposed response. The evaluation of the measure of success includes evaluating the success of the content item carried by the response across two or more different voice assistant frameworks. The evaluation of the measure of success includes evaluating the success of the response relative to the various voice assistant frameworks responding to the voice assistant device to which it is sent. The evaluation of the measure of success includes evaluating success of the response relative to two or more different voice applications configured to receive the request and determine the response. Content items to be carried in subsequent responses are managed based on the measure of success.
In general, in one aspect, a user interface presentation feature of a voice application platform enables selection and management of content items to be included in a response by a voice application to a voice assistant apparatus conforming to one or more different voice assistant frameworks while the content items are being selected and managed presents information about the relative performance of the respective content items associated with characteristics of the content items through the user interface. Information is received via the user interface regarding the selected and managed content items. The voice application is executed to generate a response to include the selected and managed content item.
Implementations may include one or a combination of two or more of the following features. Usage data is aggregated from voice assistant devices that conform to two or more different voice assistant frameworks. Information is generated regarding the relative performance of individual content items from the aggregated usage data. The usage data is aggregated through a generic API. Information about relative performance is generated by a machine learning algorithm.
In general, in one aspect, a request for service is received over a communication network from a voice assistant device that conforms to one or more disparate voice assistant frameworks. The request for service is based on the end user's speech. The speech of the end user represents the intent. A response to the received request is determined. The responses are configured to be sent over a communication network to the voice assistant apparatus to cause them to respond to the respective end user. The response includes the content item. The content items included in a given one of the responses are selected from the possible content items that are selectable. The selection of the content item to be included in a given response is based on the context of the end user's expressed intent.
Implementations may include one or a combination of two or more of the following features. The voice assistant device from which the request is received conforms to two or more different voice assistant frameworks. One of the voice assistant frameworks includes a chat robot framework. The context of the end user's expressed intent may include the geographic location of the response to the voice assistant device to be sent. The context of the intent of the representation of the end user may include demographic characteristics of the end user. The demographic characteristics include linguistic characteristics inferred from the geographic location of the voice assistant device to which the response is to be sent or inferred from characteristics of words included in the received request. The demographic characteristics may include age. The language property includes a local spoken language or a local reference. The demographic characteristic may include gender. End user preferences based on which content item to include in a given response may be selected.
In general, in one aspect, a user interface is presented for development of a speech application. The user interface is configured to enable the creation of a voice application for processing requests received from the voice assistant device and for generating corresponding responses for presentation by the voice assistant device to the end user. A module to which the maintenance request can be matched to generate characteristics of the response. Each module includes a set of features corresponding to a context in which a response is presented to an end user. The maintenance of the modules includes (a) the maintenance of standard modules for the respective contexts, and (b) the generation and maintenance of custom modules that enable the features with which requests may be matched to generate custom responses for the voice assistant device. The module is presented via a user interface.
Implementations may include one or a combination of two or more of the following features. The content item is maintained for use with the feature when generating the response. The maintaining of the content items includes (a) maintaining standard content items, and (b) enabling the generation and maintenance of customized content items to be used with the features to generate customized responses for the voice assistant apparatus. The context relates to a product or service in the defined market segment. The context relates to the demographics of the target group of end users. The context relates to the performance of the voice assistant apparatus. The context relates to the type of content item to be used with the feature when generating the response.
In general, in one aspect, a user interface is presented for development of a speech application. The user interface is configured to enable the creation of a voice application for processing requests received from the voice assistant device and for generating corresponding responses for presentation by the voice assistant device to the end user. A response to the received request is determined. The responses are configured to be sent over a communication network to the voice assistant apparatus to cause them to respond to the respective end user. The response includes the content item. The user interface enables the creation and editing of the content item in the rich media format for inclusion in the response.
Implementations may include one or a combination of two or more of the following features. Rich media formats include image, audio, and video formats. A user interface is presented through a platform that enables creation of a voice application. The platform enables direct recording and editing of content items within the platform through a user interface area.
In general, in one aspect, features are presented in a user interface of a speech application platform. The features enable selection and management of content items to be included in responses to be provided by a voice application to voice assistant devices that conform to one or more different voice assistant frameworks. Each content item requires a voice assistant device with corresponding content presentation capabilities. While the content item is being selected and managed, information is simultaneously presented via the user interface regarding the capabilities of the voice assistant devices that conform to the various different voice assistant frameworks to present the content item being selected and managed.
Implementations may include one or a combination of two or more of the following features. The voice assistant apparatus to which the response is to be provided conforms to two or more different voice assistant frameworks. The content rendering capabilities include the capabilities of the hardware and software of the voice assistant device. The content presentation performance relates to the type of content item. The types of content items include text, images, audio, and video.
In general, in one aspect, a user interface is presented for development of a speech application. The user interface is configured to enable the creation of a voice application for processing requests received from the voice assistant device and for generating corresponding responses for presentation by the voice assistant device to the end user. A response to the received request is determined. The responses are configured to be sent over a communication network to the voice assistant apparatus so that they are responsive to the respective end user, the responses including content items expressed in natural language. The user interface enables a user to select and manage the presentation of one or more content items in any of two or more natural languages.
Implementations may include one or a combination of two or more of the following features. The user interface is presented in any one of two or more different natural languages. Each content item is represented according to a data model. The representation of each content item inherits an object that includes properties of the natural language of the content item.
These and other aspects, features and implementations may be expressed as methods, apparatus, systems, components, program products, methods of doing business, means or steps for performing functions, and in other ways.
These and other aspects, features and implementations will become apparent from the following description, including the claims.
Drawings
Fig. 1, 2 to 10, 14 to 21, and 29 to 32 are block diagrams.
Fig. 11A, 11B, 12, and 13 are examples of codes.
Fig. 22 to 28 and fig. 30 and 33 are user interface screens.
Detailed Description
As shown in fig. 1, here we describe a technique 10 that provides a generic speech application platform 12 (sometimes we simply refer to as a "platform" or "generic platform" or "cross-device platform"). The platform is configured to create, store, manage, control, and execute (among other actions) the voice application 14 and provide the voice assistant services 11 to the voice assistant 13 and the voice assistant apparatus 18. The platform serves two types of users.
One type includes a voice assistant device and a voice assistant end-user 28. The end user is served by a generic voice application that can process requests from voice assistant devices conforming to any framework and formulate corresponding generic responses that can be translated into responses usable in any framework.
Another type of user includes a platform participant user 45 that uses a platform in software as a service model through the user interface 39 to create, store and manage generic voice applications and related content items, etc. The platform is configured to enable platform participating users to quickly create, store, and manage standardized, generic voice applications based on predefined standard content items and other components required by the voice applications. In other usage modes, the platform is configured to enable platform-participating users to create, store, manage, and control, among other things, customized generic voice applications and related content items.
Standardized general-purpose speech applications, content items and other components may be stored on the platform server 222. Customized generic speech applications, content items and other components may be stored on a customization server.
In operation, requests (e.g., intentions) 26 spoken from end users are received by the voice assistant device 18 that processes them and formulates a request message 34. Request message 34 is passed through communication network 29 to voice assistant server 31. voice assistant server 31 is operated, for example, by parties controlling a particular framework (such as amazon with respect to the Alexa framework). The voice assistant server processes the incoming messages, parses them to deduce request message elements, and passes the processed request information to the platform server. The platform server uses the received message elements to determine the best response based on the given standardized or customized voice application being executed. For this purpose, the platform server may reference standard voice applications, content items and other components stored and managed on the platform server, or may reference a custom server for custom voice applications, custom content items and other custom components. The platform server formulates the corresponding appropriate response message elements 35 and returns them to the voice assistant server, which uses them to generate the formal voice response message 32. The response 34 may be spoken or presented in text, images, audio, or video. The platform stores the content item 52 in various media formats for response. In some cases, the response may involve a responsive action such as shutting down the device.
Three sets of servers (platform server, customization server, and voice assistant server) may be created, managed, operated, owned, or controlled (or a combination of those actions) by different three parties, respectively: (a) the present invention relates to a system and method for operating a platform as a business 'platform owner, (b) platform participants that control their own custom servers, and (c) framework developers (such as developers of microsoft, amazon, google, apple, and chatbots) that operate their own voice assistant servers to control the manner in which their framework's request and response messages are processed. In some implementations, two or more of the three aggregated servers may be unilaterally controlled for the benefit of themselves or for the benefit of both itself and another party.
Because the platform is cloud-based (e.g., implemented using one or more servers in communication with the client voice assistant device over a communication network), platform participants do not need to invest in additional infrastructure to be able to create, edit, manage, and own robust voice applications. The cloud-based approach also enables periodic updates and enhancements to be added by the party controlling the generic voice application platform. Updates and enhancements become automatically and immediately available to platform participants.
Examples of platform participants as described above include brands, advertisers, developers, and other entities using the platform.
In some examples, people who use a platform as a representative of, or on behalf of, a platform participant are sometimes referred to as "platform participant users," platform users, "or" participant users. Participant users interact with the platform through one or more "participant user interfaces" 39 or simply "user interfaces".
As presented earlier, certain speech applications, sometimes referred to as "standard speech applications", are designed, developed and stored by a party controlling the platform and made publicly available by the platform participants. Some speech applications, which we refer to as "customized speech applications," include customized content items, customized features, or other customized components, and are designed, developed, stored, and controlled for specific purposes or by specific platform participants. In some cases, these customized voice applications may be shared with other platform participants. In some cases, the customized voice application is dedicated to a single platform participant and is not shared.
We use the term "voice app" broadly to include, for example, any app that can accept a request for a user of a voice assistant device and formulate an element of a response to the request to return to the voice assistant device to implement the response. The voice application may be created by any method that involves elements that specify how information about an input request is accepted and used and how an appropriate response is to be generated based on the information about the input request. The response may include the content item and the elements of the response may be generated by performing the function defined in relation to the input request based on the information about the input request. In a typical known system, a speech application is "hardwired" as code that accepts a request as input and performs a pre-specified method or function based on the request to generate a response. Among the advantages of the platform and user interface we describe herein are that they provide participant users with an easy-to-use, robust, efficient, time-saving, highly flexible, cross-framework method to develop, update, control, maintain, gauge the effectiveness of, and deploy voice applications and content items that they use. Fine-grained cross-framework, cross-content, and cross-feature analysis is available to users, and also works with a background to improve the effectiveness of speech applications. The resulting application is robust, adaptive, dynamic and efficient, among other advantages.
The platform 12 is configured to be able to accept request message elements that conform to any type of voice assistant framework, execute generic voice applications using those message elements, and return response message elements that can be used to formulate a generic representation of a response message for any type of voice assistant framework.
In other words, the generic voice application platform may communicate with voice assistant devices belonging to (e.g., conforming to) multiple different current and future voice assistant frameworks simultaneously using request and response messages for each voice assistant device conforming to its framework's native protocol. At the same time, the generic application platform enables platform participants to develop, maintain and deploy robust generic voice applications that can interpret requests and formulate responses for voice assistant devices belonging to a variety of different frameworks without having to develop, maintain and deploy multiple parallel functionally similar voice applications, one for each framework to be serviced.
Thus, among the benefits of certain implementations of the platform, platform participants can formulate, maintain, and deploy participating, efficient, robust voice applications through a single, easy-to-use, consistent participant user interface. The generated voice application may serve amazon Alexa, google assistant, apple HomePod, microsoft Cortana, and any other kind of current or future voice assistant and voice assistant devices in general. The platform is designed to enable platform participants to quickly and easily deploy voice applications while providing flexibility through customization capabilities.
Features and advantages of the present techniques and platforms are also as follows:
based on the graph. The platform can interact with any voice assistant framework, including existing proprietary and non-proprietary frameworks developed by amazon, google, apple, microsoft, etc., through a single generic API and generic business logic layer, providing services for, and processing data associated with, the voice assistant framework. The platform abstracts the received request messages and processes them using graph-based pattern matching rather than a rule-based engine (although it is possible to combine graph-based pattern matching with a rule-based approach) to understand the end user's request (e.g., intent). Graph-based pattern matching enables a consistent and confident approach of mapping request messages to features to be used in formulated responses across multiple voice assistant frameworks. Graph-based approaches are manageable, updatable, and flexible enough to enable machine learning to update the location of nodes in a graph. The new voice assistant framework may be adapted by a graph-based approach simply by adding new nodes (data points) to the graph database to create reachable connections based on request messages received from voice assistant devices that conform to the new voice assistant framework.
Cross-frame analysis. Because the generic voice application platform may access usage data from multiple different voice assistant frameworks, the platform may compare the relative performance between the frameworks. For example, the platform may analyze failure rates of different voice applications in processing and responding to received request messages, and failure rates of particular features or content items across multiple voice assistant frameworks. As a result, the platform may use machine learning and platform algorithms to better understand the end user's request (intent) than might be understood by the particular voice assistant framework being used (which may only access usage data for that framework). This advantage is achieved, for example, by detecting patterns of success and failure for each frame of a given type of feature or content item and determining changes that will make the content item or feature more successful. The analysis enables the platform to identify an optimal superset of content item and feature variations across the supported framework.
Robust content recommendation. Because the platform collects usage data across multiple voice assistant devices and multiple frameworks through a single API and can analyze their relative performance, the platform can provide efficient feature and content recommendations to platform participants. The platform uses machine learning and algorithms to report the relative performance of different speech applications to platform participants (including different speech applications for a given platform participant or different speech applications for different platform participants) to make real-time dynamic content suggestions directly to platform users within the platform user interface. These suggestions may help platform users optimize the performance of their voice applications and enhance the overall end-user experience.
Dynamic content. The platform supports a collection of items of dynamic content, for example, to provide more than one possible response to a request, such as an optional answer to a question. Dynamic content may enable a more engaging end user experience, for example, because the response may change between sessions. Dynamic content also enables the creation of one or more roles for voice assistants and changes in the end user experience depending on the end user's preferences and demographics. In a typical existing platform, if ten end users ask a given voice assistant the same question, the voice assistant will interact in the same way ten times. The generic voice application platform is capable of formulating an unlimited number of possible responses for each of the ten end users and personalizing each response to the particular end user. For example, if the platform determines that the end user is a 35 year old female living in georgia, a particular response may be selected based on the developer's decision that this end user may be more comfortable talking to another female (voice assistant) with a southern accent and speaking using a local spoken language and a local reference. The platform enables developers to change the words a given voice assistant framework uses while speaking to an end user, and to record amateur or professional voice talents with associated gender, accent, dialect, or other voice characteristics. The result is a more trusted and acceptable interaction between a given end user and the voice assistant.
Typically, the platform cannot "hear" the end user's accent because the request message does not carry audio files from any of the voice assistant frameworks. The platform only receives text and can look for keywords that give clues that the end user may have an accent. An example would be that "y' all" in the text may be attributed to a southern accent in the united states. The platform may also couple the identification of keywords with geographic information, if available. The keyword "y' all" received from the voice assistant device of GA in atlanta may suggest a southern accent.
Multi-language content. The platform enables platform participants to reach a greater audience in the united states and worldwide in response to naturally supporting multi-lingual content. The platform also enables a more inclusive and human experience between the end user and the voice assistant. The multi-lingual support is built into the interface for non-english speaking participant users along with the ability to add, modify, and remove multi-lingual content.
Pre-stored and customized modules and content. The platform provides both: (a) accelerated listing to brand owners or other platform participants using predefined (e.g., standard) features, feature modules, and sample content items, and (b) flexibility to use customization or customized creation of features, modules, and content items, etc. Platform participants may use standard features, modules and content items 23 to speed development through an easy-to-use content management system, or may create customized end-user experiences by creating their own customized features, modules and content items, etc. that operate with the platform using APIs. This arrangement enables platform participants to create and manage customized and trusted end-user experiences to better serve end-users.
The voice of a person. Using human speech for responding rather than just synthesized computer speech produces a more authentic and engaging end-user experience. The platform enables participant users to create and edit audio and video content items directly within the platform through a user interface without resorting to other cross-platform content creation applications (although cross-platform content creation applications may also be used). Platform participants can create voice applications that utilize and include rich media (audio and video) content items via a single participant user interface. Advantages of this arrangement include higher efficiency and a faster time to market and a deeper, richer end user experience.
Media guidance regarding the performance of a device. The voice assistant frameworks (and the voice assistant apparatus that conform to them) vary in how they handle various types of content items based on their internal hardware and software. For example, one framework may support video, audio, images, and text, while another may support only text and audio. The generic voice application platform provides media guidance as to whether a particular type of content item is supported by a particular voice assistant device or voice assistant framework and provides that guidance directly in real-time in the platform's participant user interface. This guidance enables a brand or other platform participant to determine which content to emphasize while learning how to optimize the end user experience on a particular voice assistant device or voice assistant framework.
As explained earlier, in some implementations of the present technology we describe herein, the voice assistant device 18 processes the voice 26 of the end user 28, interprets the voice as a corresponding request 48, includes the request (e.g., intent) in a request message expressed according to the protocol of the voice assistant framework to which the voice assistant device belongs, and forwards the request message over one or more communication networks to a server, which processes the received request message. As also shown in fig. 1, the server formulates a response using the relevant features 43 of the voice application 14 and (in most cases) sends a corresponding response message back to the voice assistant device. The generic voice application platform includes a module 46 that organizes and provides features 43 to enable the voice application to process requests. In some implementations of the platform, features of such modules are implemented as a request handler 41 that handles potentially many different types of requests (e.g., intents) for voice applications, e.g., requests associated with features such as events, FAQs, daily updates, reminders, checklists, surveys, and up-to-date messages.
The features implemented as request processors in a given module may represent a collection of features that are all useful with respect to, for example, a collection of platform participants sharing a common trait, such as a common use case of entities belonging to an industry or market. Each module may also include or be associated with a pre-stored item of sample content 23 that may be invoked and used by the request handler in formulating a response to the request. The availability of pre-stored items of sample content may improve the rapid listing for platform participants.
A participant user (e.g., a person working on behalf of a particular company, brand, organization, or other platform participant's interests) may create, edit, and manage customized content items 22 through the platform's user interface using the platform's content management system 54. The content management system provides an intuitive user interface that does not require technical knowledge to create, modify, and remove content items that shape the end user experience.
The content management system of the platform also provides guidance and insight to participants by collecting usage data and applying analytics 56 to the collected usage data 55. In the user interface, guidance can be provided by cues such as visual indicators for voice assistant devices in the media format of the content item 653 supported by a particular framework (e.g., video and image media supported by amazon Echo Show). The insight includes, for example, an analysis of the success rate of the voice assistant device across different frameworks for the response formulated by the voice application for a given request (e.g., google assistant responds more successfully to a given request than amazon Alexa).
As shown in FIG. 2, the generic speech application platform 12 uses an architecture 70 that is a separate functional layer. The layers include: API layer 72, business logic layer 74, feature and module layer 76, CMS (content management system) layer 78 and data layer 80.
API layer
The API layer processes request messages 73 received from the voice assistant device and requests 75 received from the customization modules and features. The API layer accepts request messages and other requests represented according to the protocol 82 associated with any possible proprietary or non-proprietary voice assistant framework. When the API layer receives a request message or other request that conforms to any defined protocol, the API layer abstracts (e.g., translates, transforms, or maps) the received request message or request to a request represented according to the common general protocol 84 for further processing. The abstraction enables the support of a variety of specialized and non-specialized voice assistant frameworks, voice assistant devices, and voice assistants using generic business logic and other logical layers (such as feature and module layers and CMS layers) without requiring a separate stack of logical layers for each voice assistant framework.
As an example, amazon Alexa and google assistant each provide request messages expressed in JSON to the API layer of the platform for processing. The protocol used to represent the request message is typically the same regardless of the framework to which the voice assistant device conforms, but the object and value pairs included in the request message differ between two different frameworks supported by google or amazon, respectively. For example, both platforms represent within the JSON protocol whether a user and session are new; the specific key-value pair for google assistants is "userid | Unique Number" and "type | New", while the specific key for Alexa is "userid | GUID" and "New | True". The platform detects which framework is associated with the particular voice assistant device that sent the request message to determine how the request message should be further processed. The platform coordinates the differences and normalizes the information to a common format for additional processing.
Business logic layer
The business logic layer applies business logic to handle the platform's critical operations related to mapping the message elements of each incoming request to the specific appropriate modules and features that may be associated with the request to be processed. In some implementations, the business logic layer performs the mapping through graph traversal using a graph database 86 stored as one of the databases in the server. In some cases, the graph traversal determines which modules and features are most likely to match (e.g., are most likely to process and formulate an appropriate response to) a given request. The graph database includes data representing a graph of nodes connected by edges. Graph traversal is a search technique that finds patterns within a graph database based on term relationships. A pattern represents an edge within a graph connecting one or more nodes. For example, a request message from an amazon Alexa device with the literal phrase "stop" as one of the message elements maps Alexa-based edge values and stop instructions to the "stop" feature nodes of the graph. Based on the results of the graph traversal, the business logic layer processes requests that have been represented in an abstract generic protocol to identify the most likely matching modules and features of the generic speech application platform and module layer 76.
Feature and module layer
Features 81 within the features and module layer represent functions or processes 83 that are invoked as a result of processing requests in the voice API layer and business logic layer. For example, the function of returning an event list contemplates parsing the message elements received from the request message and from the business logic layer to indicate the date of the event or the type of event such as a basketball game or both. Features within the platform are segmented according to the type of request to be processed. For example, all requests for information about an event may be processed by the function of the event feature 85, while all requests for the latest generic update are processed by the function of the daily update feature 87. The structured format for handling requests and placing responses is provided by a feature fragment of the request type. The functionality of each feature and the content items used by them may be stored and managed by one or both of the participant users or parties of the control platform. Because the features and modules are closely related to and use the content item, the feature and module layer is one of two layers (the other is the CMS layer) that the participant users can view and work directly with by name in the user interface of the platform.
The module 89 provides a structure for referencing or packaging a set 91 of features 81 that are commonly used by or associated with a set of platform participants, such as companies belonging to a given industry, or a set of features associated with a given use case. More than one module may reference a given feature or include a given feature in its packaging. Because the features reference and use the content items, references to the modules and features of the modules are relative to references to particular content items (e.g., pre-stored sample or standard content items 23 managed by the platform for use by platform participants). For example, both a module for the higher education domain and a module for the health industry may include references (e.g., packages) to the same event features, but the use of the features will differ based on the content items (e.g., sample or standard content items or custom content items) that are loaded when the features are respectively invoked by two different references in two different modules. The higher education event module may formulate a response relating to a particular sports team or school department; the health event module may formulate a response for the action of the city or office.
As discussed later, the generic speech application platform includes a search engine that retrieves particular content items when a feature is invoked by performing a content search with respect to a search index. For example, an incoming request message stating "what happened on tuesday in campus" is processed by an event feature search against an index to return a list of events in the database with the date value for that tuesday.
CMS layer
The standard and customized content items 23 are created, stored and managed by the participant users through a major portion of the platform user interface that exposes the features and functionality of the CMS layer 78. The CMS layer also enables participant users to control administrative and access rights. The CMS layer is designed to be sufficiently easy to use for non-technical managers. The CMS layer supports content items in various formats, including: audio such as mp3, video such as mp4, images such as png, original text and text such as SSML (language synthesis markup language), and the like. For interoperability, in addition to supporting requests from the feature and module layer 76, the CMS layer provides its own APIs 90 to support requests from external applications. For example, the platform participants may repurpose the content items stored within the CMS layer for external voice applications and for other assigned channels, such as presented through mobile applications. In the latter case, the mobile application may retrieve content items stored within the CMS layer through the use of an API.
Data layer
The data layer is a repository of data used by all layers, user interfaces, and other functions of the platform. The data layer employs various storage mechanisms 92, such as a graph database 101, a file store 103, a search index 105, and relational and non-relational database storage. The data layer houses data for at least the following users, mechanisms, and uses: participant users, system permissions, mappings for modules and features, content items related to features and responses formulated by features, and usage data for analysis, among others.
Important aspects of the present techniques and platforms
Among the important aspects of the present technology and platform, the layers and user interfaces that comprise it are the following, some of which have been mentioned earlier.
Support for various voice assistant devices using an API layer
The API layer may process request messages from any type of voice assistant device, including any voice assistant device belonging to or conforming with one or more voice assistant frameworks, such as those provided by Amazon, Google, Microsoft, apple, and the like, for example. New or customized voice assistant apparatus, voice assistants, and voice assistant frameworks that are developed in the future may be accommodated in a consistent manner. Thus, by using a single API layer, various types (frameworks) of voice assistant devices can be accommodated without requiring the development of an entirely different set of code libraries for each framework.
Graph database technique for mapping sentence structure to features
The request message received at the platform (e.g., at the API layer) carries information about the speech of the user of the voice assistant device, typically represented as part of a loosely structured sentence pattern. An important function of the platform (and in some implementations, of the business logic layer of the platform) is to determine the correct or most appropriate or relevant or valid features (we sometimes refer to them as "appropriate features") that should be invoked for the message elements included in a given request message based on the information carried in the loosely structured sentence patterns. While graph database techniques are typically used to identify pattern matches for entity relationships on large data sets of highly relevant data, the platform herein uses graph database techniques to identify pattern matches for loosely structured sentence patterns relative to defined functions. For example, graph databases are commonly used to determine relationship patterns within large data sets of social networks. The individuals represented by the nodes may have several relationships with other individuals and shared interests represented within the graph. The platform herein leverages the graph database to match patterns regarding the type of user request to features within the platform. Graph enables work with manageable data sets.
Analysis across a voice assistant framework
The platform may capture usage data within a single repository (e.g., a database within a data layer) of voice applications for use across various voice assistant devices, voice assistants, and frameworks. Using the stored usage data, the platform may perform analysis and provide results to participant users and platform participants, e.g., results regarding the overall performance of the voice application across multiple types of devices or multiple frameworks and results regarding the performance of individual request and response interactions of a particular voice application. At the speech application level, the platform can perform and accumulate, store, and provide results of analysis of coverage metrics, including: the number of voice application downloads, the number of voice application sessions, the number of unique application sessions, the length of the average application session, the most frequent requests received, the average ratio of successfully mapping requests to features, and the inability to successfully map requests to features.
The usage data for each analysis metric may be divided by the type of voice assistant, voice assistant device, or voice assistant framework, date range, or various other parameters.
API layer and SDK
As explained earlier and as shown in FIG. 3, the voice assistant device 98 represents the request 99 spoken by the end user as structured data (request message) according to the local protocol of the voice assistant device. The local protocol may be determined by a framework associated with the device. In some cases, the request message is represented according to a common protocol applied to the type of voice assistant device or framework that is not supported by the platform.
In order for the API layer (identified in FIG. 3 as voice experience API 110) to be able to process request messages 73 expressed according to a particular protocol, the platform supports a collection of SDKs 112 for different programming languages, voice assistant devices, and voice assistant frameworks.
The SDK enables all types of voice assistant devices (conforming to any framework) to easily access the API layer. The SDK provides the developer or other platform participant with a desired format (protocol) for representing communications with the platform. The SDK includes tools that enable developers to define characteristics of protocols for: authorizing and verifying voice assistant devices to access the API layer in a manner that allows them to apply request messages in a desired format, authorizing voice applications registered with the platform, formulating the original request message into a data structure conforming to the applicable protocol for presentation to the API layer, formulating responses received from the API into an appropriate data structure (response message) according to the applicable protocol desired by the target voice assistant device, ensuring that the request message is applied to the correct version of the API after the update is deployed, and supporting multiple programming languages.
The platform SDK may support common programming languages such as JavaScript and TypeScript, C #, Java and Kotlin, Swift and Go, etc. for creating skills, actions, extensions and voice applications for various types of voice assistant devices and frameworks.
For the type of voice assistant device (framework) for which processing is not typically written in one of the programming languages supported by the SDK, the API layer may be directly accessed to enable developers to develop other SDKs or to present request messages directly to the API layer. The SDK may be open-sourced to help support members of the development community using programming languages other than the supported SDK by demonstrating design patterns and code architectures that meet the requirements of the native protocols and API layers of the various frameworks.
Once the SDK forwards the request message from the voice assistant device to the API layer, the API layer maps the message to the platform's internal generic protocol. The API layer also represents the response 113 formulated by the feature server 115 as a response message 117 that conforms to the protocol accepted by the voice assistant device sending the request. The SDK may then accept the formulated response message from the API layer, validate the response message, and forward it over the network to the voice assistant device. The voice assistant device then renders or presents the response 119 (e.g., the content item carried in the response) to the end user. If the voice assistant device supports those richer formats, then the presentation of the response may be by reading the text included in the response by the local AI voice of the voice assistant device, by playing an audio file directly, by presenting a video file, etc., or a combination thereof.
For example, a request message processed by amazon Alexa's SDK is sent to the API layer for further processing. The API layer then maps the processed request to a normalized format (e.g., a public format). The normalized formulated request is then further processed using the mapping to the specific features, as explained further below. The response returned from the feature is then formulated as a response message in the appropriate frame format and sent back to amazon Alexa's SDK for presentation as spoken text, audio, image, or video.
However, the availability of SDKs does not limit developers or other platform participants to developing voice applications using only the features provided by the platform. For example, if a developer wants to provide response behavior that cannot be implemented by any available feature, the developer can skip using the SDK to send input requests to the API layer and simply use the SDK to implement an explicit response to the request. This capability enables developers to migrate to the platform by using existing skills and voice application experiences without having to start over.
For types of voice assistant devices or frameworks that are not supported by the platform, such as third party chat robots, non-mainstream voice assistants, etc., a developer may register the unsupported types of voice assistant devices or frameworks in the CMS layer of the platform. Doing so will generate a unique identifier for the voice assistant device or framework to enable tracking of better analysis of the type of request from a particular type of voice assistant device or framework that is better working than others, or to obtain usage data for a given type of voice assistant device or framework over others.
Business logical level graph traversal
To support different voice assistant devices, the business logic layer processes the patterns of request message elements included in the request message provided by each type of voice assistant device or framework. As shown in FIG. 3, to be able to process request elements 107 of request messages 108 from various types of voice assistant devices (voice assistant frameworks) 98 and map the patterns of the request elements back to the appropriate features 115, the business logic layer uses a traversal 117 of a graph database 116 of the relationships between the patterns of the request elements and the platform-supported features. The graph includes nodes for request messages corresponding to each voice assistant device or framework and information about each feature supported by the platform. The graph database may begin a search at any node to find a match of the requested element with the appropriate feature used.
Traversal 117 of the graph database to match request messages and their request elements with the appropriate features comprises at least the following steps: API consumption, node endpoint searching, graph traversal 117, and output processing.
API consumption
A preliminary step in finding the appropriate features to apply to formulating a response to a given request message is to create a RESTful API 110 for the business logic layer with a unique endpoint that consumes request message elements of a local request message from a voice assistant device associated with a particular framework. Each unique endpoint in the RESTful API knows the protocol of the request element included in the message request received from the voice assistant device that conforms to the particular framework. For example, an endpoint may exist to consume a request element included in a request message received from amazon Alexa SDK 112. The separate set of endpoints of the API consumes the type of request elements that google helper SDK112 sends in its request message. RESTful (representational state transfer) is a technical architectural style based on hypertext transfer protocol (HTTP) to balance APIs for communication between systems.
These endpoints of the RESTful API enable tracking of request elements that conform to the protocol of each framework of the voice assistant device and provide a generic set of endpoints for the generic set of request elements so that unregistered types (unsupported frameworks) of voice assistant devices or other applications can also interact with the features supported by the platform.
By having a collection of understood protocols associated with each different voice assistant framework and corresponding voice assistant device, and a common set of protocols, the system may search the appropriate set of nodes in the graph database for matches to find the appropriate features to formulate a response to a received request.
Node endpoint search
Typically, the request elements of a request message from a voice assistant device of a given framework may be decomposed into a relationship of the general type of request to internal request elements known as slots. (slots are optional placeholders for values passed by the end user in the form of requests.examples of slots and slot values are US _ City and Seattle. US _ City is a slot and Seattle is a value.) based on this structure, a graph database of relationships of request elements to features can be built. Relationships captured by such graph databases may include common types of relationships.
As shown in FIG. 4, the relationship between a message element (in some contexts we refer to intent) and a feature may be as simple as the type of message element 142 (intent 1) received from the type of voice assistant (Assistant 1) for a particular feature 140 or (FIG. 5) may be more complex, e.g., message elements 142 from two different assistants (Assistant 1 and Assistant 2) of different types (i.e., frames) of voice assistant devices for the same feature 140. Example types of message elements may be an Alexa event search, which would share an edge 143 with the event feature node 140 in the graph, and an Alexa event location search, which would also share an edge 145 with the event feature node 140. The edge descriptors for the edges of a given message element to a given feature are "steering"; a message element is a parent node that leads to a child feature node.
As shown in fig. 6, the relationship may be more complex if the type of slot 150 can be shared by two different message elements 152, 154 originating from a voice assistant device of a particular type 153 and if each of the two message elements also has its own slot type 156, 158 that is not shared with other voice assistant devices. Continuing with the example of message elements for Alexa event search and Alexa event location search with respect to event features, the two different message elements 152, 154 will have an internal (i.e., shared) slot. Some slots 150 may be shared between two different message elements and some slots 156, 158 may not be shared between two different message elements. For example, in terms of a slot's date type and a slot's location name type. The message element type Alexa event search will include a date and location name slot type, while the Alexa event location search will include only a location name slot type. The edge descriptor 160 for a message element to slot is "include" in that the message element includes a slot or slots.
As shown in FIG. 7, in a more complex example, feature 702 may also relate to multiple types of message elements from different types of voice assistant devices and the slots they include. In the example of an Alexa event search type message element (intent 1) with respect to event feature 702, a voice assistant device other than Alexa (assistant 1), such as google assistant (assistant 2), may have a framework that supports its own similar message element called google event 701 (intent 1). The google event node 701 in the graph shares the leading edge 711 to the same event feature 702 that also shares the edge as the Alexa event search 703 and Alexa event position 704 searches.
A node for a given message element may have an edge that leads to a number of different features. However, in order for this to work, there must be a way to determine which different feature a given actual message element leads to. For example, a determination may be made if there are two different types of slots for two different features each with respect to only one of the two features.
As shown in fig. 7, if a first message element 703 is related to a feature 702 and has a slot type 706 shared with a second message element 704 that also refers to the same feature 702, and if the first message element has another slot type 708 that is not shared with the second message element, the relationship 709 between the first message element 703 and the feature 702 sharing another message element 704 having the same slot 706 is stronger than the relationship 711 between the second message element 704 and the feature 702. How this decision is made is discussed in more detail below with respect to graph traversal.
For example, consider two platform-supported features: event characteristics and daily message characteristics. Both of these features formulate a response message comprising different types of content items. One type of content item (for an event) may be event information including a date, time, location, event type and description. Other types of content items (for daily messages) may be audio or video messages to be broadcast to a group of people according to a schedule. There are many different types of request message elements that may involve, i.e., sharing a steering edge with the nodes representing these two features of the diagram. There are also message elements that may be directed to either feature, but not both. Both features may be active in a voice application at a given time, so the only way to know from the request message element which feature to direct is to look at the slot that the message element shares with each of the two features. For example, Alexa's what is new message elements may lead to an event feature or a daily message feature. However, the alternate's new message element may include multiple slot types such as date and person name slots. The date slot also shares an edge with both features, but the name slot relates only to daily message features. Thus, if the message element in the received request message is the Alexa's what's new message element and the request message includes a person name slot, the relationship between the request message and the daily message characteristics is stronger than it is to the event characteristics. On the other hand, if there are more room relationships between a feature node and one intended node than there are with another intended node and the request reaches the graph without room related to the populated one intended node, then the other relationships of the feature node to the other intended nodes are stronger. Within the same example, if the received request includes Alexa's what's new intent and has only filled date slots, the intent may lead to an event feature.
Using these types of relationships, a graph database may include any simple or complex combination of nodes, edges, features, and slots. Once the request message is received through the API layer, processing will begin with the nodes in the graph that match the type of the message element, and the slot type included in the message element will be used to determine the best path to the most applicable feature.
Graph traversal
To find the most appropriate feature that matches the message element, the nodes found and the null nodes included are traversed starting from the end point search step. The logic of the business logic layer uses the graph to find all features that are directly connected by the edge to the node. As shown in fig. 8, in the case of a simple relationship between a message element (fig. 1) and a feature, the path traversed is one hop 190 along a single edge to a single feature 192, which is then selected to formulate a response message element.
For more complex graph relationships in which message elements have multiple correlation characteristics, the search process must consider the empty spaces associated with the message elements. If the message element only includes holes associated with a given feature type, the traversal path will continue to the strongest relationship including the most hole relationship. In the example of the event and daily message features above that share the Alexa's what's new message element, if the request message includes that message element along with a date slot and a person name slot, the traversal path will lead to the daily message feature, which is the only feature node that shares an edge with the person name and date slot while the event feature only shares an edge with the date slot.
The message elements may relate to other message elements even though the related message elements include data for the types of message elements of different types of voice assistant devices. Associating these relationships together may result in a stronger path to the selected feature. The goal of the traversal logic is to determine the shortest path to the feature. If the two features are the same number of edges from the message element node (i.e., have the same route length to the message element node), the traversed path must lead to the feature with the strongest relationship, i.e., the feature known to have the most connected short edges. For example, instead of a directed event feature, an Alexa event search message element may share edges with google event message elements. The google event message element may then have a leading edge to the event feature. The edge descriptor for the relationship between the Alexa event search message element and the google event message element will be referred to as "about". The traversal path from the Alexa event search to the event feature is: alexa events search for google events about directed events.
Complex graph traversal
As shown in FIG. 9, a more complex exemplary diagram 300 includes a plurality of message elements carried in request messages from multiple types of voice assistant devices (corresponding to various frameworks) and multiple features. Several message elements may each be mapped to and correlated back to multiple features. Depending on which null value (i.e., has a value) is filled based on the message element of the request message, the traversal from the speaker of Alexa to search for the intent node 302 may terminate at the FAQ feature node 304 or at the event feature node 306.
For example, if the message element is represented as a speaker search intent 302 of Alexa and fills in the person slot 308 value, the traversal will follow a path 314 to the person information intent 310 of Alexa and then to the FAQ feature 304.
On the other hand, if the message element is represented as a speaker search intent 302 of Alexa, but instead of filling in the person name space values, the event type space is filled in, a path 312 is traversed to the event feature 306 that will follow the road through the Alexa's event location search intent 316 and Alexa's event search intent 318, which share edges therewith.
Similar traversal path analysis applies to traversal paths from google events 320, google location information 322, google universal search 324, and Alexa universal search 326 message elements to event features 306 and FAQ features 304.
Note that each of the two features 304 and 306 may reach and formulate a response message element in response to a request message element received from a voice assistant device that conforms to two different frameworks (Amazon and Google).
Output processing
After finding the appropriate matching features through graph traversal, the business logic layer next formulates a data structure for the message elements to adapt the features. Once the data structure is formulated in the available manner for the feature, the platform will invoke the feature using the structured data, formulate a formal response message that conforms to the appropriate protocol, and send a response message derived from the feature to the initiating voice assistant device. The processing may include reverse mapping of the data structure returned by the feature to the formal response message.
Managing unseen nodes and confidence scores
If the search of the appropriate node at which the traversal path should begin proves that no node matches the message element of the received request message, the platform will return a response message to the initiating voice assistant device through the API layer that the request is not valid or unsupported.
Except for the simple case of a miss, the number of edges from the initial message element to the appropriate feature may be too many for the path traversed to logically consider as having reached the appropriate choice of feature. The number of edges to traverse to reach a feature may be taken as a so-called "confidence score" for the traversal path. A threshold for the confidence score may be configured beyond which the generated feature will not be considered a proper choice and the request will be considered bad or unsupported. For example, if the confidence score threshold is set to 10 edges, message elements that require traversal of only one edge have a 100% confidence score, traversal of five edges may have a 50% confidence score, and traversal of ten edges may have a 0% confidence score. Any request that exceeds or equals the confidence threshold will be considered invalid.
Feature and module layer
The platform supports features that can formulate a response to the request message and thus assist the end user in interacting with the voice assistant device. In fact, the end user may formulate a response by triggering a feature with a speech that is interpreted by the natural language processor of the voice assistant device as a message element that represents the end user's intent. For example, the intent may be to have a question answered or to have an action performed, such as turning on a light. The message elements are sent as request messages to the API layer for mapping by the business logic layer to specific features. The features process the intent and generate a response, as explained earlier.
A feature is a collection of one or more functional methods that can perform one or more of a variety of actions, such as retrieving data, sending data, invoking other functional methods, and formulating a response to a request message to return to the originating voice assistant device.
Examples of such features are the event features mentioned earlier. The user may speak into the voice assistant device to ask questions, such as "do there are any health events in Seattle office tomorrow? ". The question is sent from the voice assistant device to the platform as a message element (intent) in the request message. At the platform, the event features parse the words and other parameters of the message elements and, in some cases, use the parsed words and other parameters to retrieve a list of actual events from the platform database (or from a web service call to a customization server) based on a direct mapping of the words and other parameters to a database query or based on business logic.
Each feature utilizes a large amount of data input and custom business logic to generate a response. With respect to the event feature examples discussed previously, the event feature can be configured to expect message elements (e.g., questions) having values for any number of placeholder parameters (e.g., slots). The event feature parses the problem to extract placeholder parameter values for further processing of the problem. The process may apply the parsed parameter values against a search index, database, custom business logic, or custom server to obtain one or more values for the parameters characterizing one or more answers to the question. The response formulated by the event feature may represent an answer to the question using a combination of content items including one or more of text, images, video, or audio. The content item includes a message element in a formulated response message for return to the originating voice assistant device. Based on the message elements included in the formulated response message, the voice assistant at the voice assistant device may speak a text response or play an audio or video clip with the image (if the device supports both images and video).
The execution mode supported by the feature enables, for example, the event feature to process various different message elements of the request message using the same method and process (represented by the execution mode). For example, the end user may ask "what time the football team plays next" or "what happened in the TD garden? ", and the corresponding message element of the request message may be processed through the same execution mode of the event feature. The event feature finds a pattern of event types or time frames to search for the corresponding item. In the above example, the event type equates the values "football team" and "TD Garden" to the event type and location. The word "next" in the question of the end user implies a search for future events. The statement "what happens in the TD garden" does not include a time frame and the feature handles the statement by defaulting to a question about a future event.
In addition, a given feature may support industry-specific uses. For this reason, the platform supports modules, each module package including an execution mode for a participant user and one or more characteristics of a content item (such as a sample content item). The features packaged in a given module will typically have a relationship to each other based on industry (or some other logical basis). In some implementations, modules are represented within the code stack of the platform as containers that reference particular features and content items. The modules include features and content items that create, manage, update, and fulfill the end user's voice experience needs, as presented to the participant users through the user interface of the platform.
Feature processing
Examples of methods performed by features are event handlers and FAQ handlers. The user may ask a voice assistant device question, such as "do any health events tomorrow in seattle's office? ". The FAQ feature parses the message elements in the corresponding request message and retrieves a list of events based on their use of the database, custom business logic, or responses from custom web service sessions.
The decomposition of business logic used by the business logic layer to process message elements of a request message falls into three main steps: feature location search and discovery, feature server request and response processing.
At the end of this process, a response message is sent to the originating voice assistant device.
Feature location discovery
As shown in FIG. 10, when voice experience server 110 receives a request message from voice assistant device 521 and parses the message elements in the request message, the server sends a request 523 for a graph traversal. Once the graph has been traversed 501 for the supported types of voice assistant devices, the feature and module layer knows the type of feature 527 represented by the message element of the request message. The feature type may be represented by a unique identifier such as a GUID, UUID, or keyword. With this unique ID, the feature and module layer may search 502 the feature database 504 for all information defining the feature (including execution mode and other information). Once the feature and module layer has information about the feature, it can find out where the given voice application registered the feature. The registration or metadata about the feature may exist on the server 505, possibly internally, on a management server of the platform, or on a custom server controlled by a platform participant. Each of these servers can be independently metered from the platform to properly handle fluctuations in lookup requests that it needs to process separately from any other features.
For example, if the traversal of the graph 501 results in the selection of an event feature, then that feature type (in this case, feature type "event") will have a unique identifier, such as a592a403-16ff-469a-8e91-dec68f5513b 5. Using this identifier, the processing of the feature and layer modules will search against a feature management database 504, such as a PostgreSQL database. The database includes a table with records of event feature types, related voice applications, and feature server locations that the voice applications have selected for the event features. The feature server location record includes a URL for the location of the server 505, such as HTTPs:// events-feature.voicify.com/api/eventSearch, and the desired HTTP method accepted by the feature server, such as HTTP GET. The feature server location record does not necessarily include a URL managed by the platform. The server location may be external by implementing custom features such as https:// thirdpartywebsite. com/api/eventSearch.
Once the platform has found the appropriate feature server 505, it sends a service request 529 for the feature server 505 to execute the feature type using parameters derived from the message elements of the request message and waits for a response 499.
Feature server request
Once the feature server 505 is found, the service request is sent to it by creating an HTTP request that includes an HTTP header identifying the request is from the feature and module layers of the platform and an HTTP body that includes words and parameters parsed from the message elements of the request message from the voice assistant device and represented according to the corresponding feature request protocol. The service request is then processed on the feature server, for example by using words and parameters from the message elements, to search for matching content items. Search results are returned to the feature and module layer according to the service response protocol representation that defines the feature.
Each feature defines a feature request protocol and a feature response protocol. These protocols define the format and structure of service requests and service responses used to send and receive responses and requests to and from the feature server. The feature request and feature response protocol defines rigid formulation requirements. Fig. 11A and 11B are examples of JSON versions of the feature request protocol, and fig. 12 is an example of JSON versions of the feature response protocol. By defining a strict feature request and feature response protocol, the platform can be confident that the feature server will be able to properly process each feature request and provide the proper feature response that the feature and module layers of the platform can properly process. This architecture also enables the custom feature server built into the platform to enable developers to create their own custom feature servers to handle requests and responses for a given type of feature.
The general structure of the feature request protocol includes information about features that are the subject of the service request, the content of the service request, and information about message elements included in the message request from the voice assistant device that are used to traverse the graph to find the features. This architecture also enables feature servers managed by the owner of the platform or created on behalf of platform participants as custom feature servers to process requests and responses as they are handled naturally by or from the voice assistant device. This enables the customization and platform servers to implement the full capabilities of the framework API of each type of voice assistant device.
For example, when sending a service request to an event feature server (whether managed internally in the platform or managed at a third party server), the feature and module layer will send an HTTP request with the following listed headers and HTTP body of the example feature request protocol in fig. 11A and 11B: -authorizing: 1d91e3e1-f3de-4028-ba19-47bd4526ca 94; -applying: 2e1541dd-716f-4369-b22f-b9f6f1fa2c6 d.
The-authorization header value is a unique identifier that is automatically generated by and unique to the voice application and feature type. This value may be regenerated by the platform participant to enable the feature server to ensure that the request does not come from a malicious third party. The application header value is a unique identifier for the voice application to enable the feature server to verify that the request came from an authorized voice application.
Response handling
Once the feature server 505 has finished processing the feature service request, it needs to return data represented according to the feature response protocol. The feature service response 499 includes information about content items found by the feature server and possibly information about rich media content items for the voice assistant apparatus capable of presenting richer content items. The feature service response may include a URL pointer to a file location, such as an image, video, or audio file. The data included in the feature service response is validated by the feature and module layer to ensure compliance with the service response protocol and to ensure that the data includes valid information.
If there is an error in the verification of the feature service response or if the initial feature service request times out or is invalid, an error response message is sent to the voice assistant device for the initial request message received by the API layer.
If the feature server returns a successful feature service response that passes authentication, the feature service response 519 is processed by the feature and module layer of the voice experience layer 110 to formulate a response message to be sent to the voice assistant device. This process involves a protocol that maps feature service responses to the framework of the voice assistant device 521, including mapping media files and other content items to an appropriate form. If the voice assistant device supports a rich media item format, such as video, the process will prioritize the rich media item. Otherwise, processing will fall back to simple text content to be spoken or read by the voice assistant to the end user, e.g., if rich media is not included in the response. Using the message elements included in the response message, the initiating voice assistant device will be able to render or present the response to the end user. If the initial request message is from a generic or unsupported AI device or voice assistant device, a generic response message will be returned that includes the original version of the content item from the feature service response so that the unsupported AI device or voice assistant device itself can determine whether and how to use or render each content item.
For example, if the requesting voice assistant device supports rendering richer content than just voice, such as images or video (as does Amazon Echo Show), the response formulation process at the feature and module layer will map URLs included in the feature service response for rich media items and to rich media properties in message elements of the message response that conform to the framework protocol of the voice assistant device. Certain features may enable the voice assistant apparatus to present multiple types of media items, such as images and text, while reading answers to the end user. The business logic layer of the platform will know the configuration of the supported voice assistant devices to facilitate formulating the response message according to the optimal configuration. For a voice assistant device that does not support rich media items, the default behavior of the feature and module layer would be to formulate the message elements of the response message as voice responses so that the voice assistant device speaks the text sent to it in the response message.
For example, if the request message is from a voice assistant device such as Echo Show that supports images and text, the feature service response provided to the event feature may be as shown in FIG. 13. The feature service response shown in the example of FIG. 13 enables the results in the text response to be spoken and shown in the card area of the voice assistant device, and also maps the image URL to the appropriate card image URL according to the Alexa response message protocol.
The same example feature response is now used, but assume that the requesting voice assistant device is an Alexa Echo Dot, which does not support the presentation of visual content items. More simply, the Alexa response protocol may be:
Figure BDA0002912393680000291
this example only maps text from the feature response to the text of the outputspech property of the Alexa protocol, which is then spoken to the user by the Alexa Echo Dot.
Feature content search
When a feature processes a message element of a request message to which it is routed as a result of a graph traversal, the processing of the feature must search for the content item to include in the response, as shown in fig. 14. The feature server 505 is responsible for finding and including content items relevant to the feature based service request. In some implementations, the feature server searches for content items 510 within a search index 511 of managed content items authored or otherwise controlled by other platform participants. The content search index 511 provides an efficient repository based on structured content items for feature server queries. The identification of the returned content item in the search results is any one that exactly matches the query, or is likely to match based on the search confidence, or is exclusive of content items based on zero returned content items or low confidence scores of returned items.
There are two key aspects that enable the feature server to return the appropriate content item: content index 512 and content search 531. The content indexing and content retrieval work together to create content items in the content database 504 that can be searched by the feature server 505 to provide the content items to the feature and module layer for formulating responses to the voice assistant device.
Content indexing
As stored in the platform's database, each content item has certain fields and properties, such as textual content, identifiers, URLs, etc., that include simple information that can be easily searched when placed into the flexible search index 511. To improve the performance of the feature server, all content items reachable by the feature processing should be added to the elastic search index 511. Some content items used by a feature may have one or more particular characteristics that are more valuable in the index, and weights may be added to those characteristics in fields of the index. The weighting enables the elastic search index to prioritize searches relative to fields in descending order of their weights. The weights yield scores when a search against the index has multiple hits on different fields of a given content item.
For example, if an event content item has the following fields, an indicated weight value (in a proportion of 1-5) may be associated with them: event name-4, event location-2, event start date/time-5, event end date/time-1, event details-2, and event summary-2.
This weight will prioritize the search with respect to the start date/time of the event and the name of the event. Thus, if there are two events with similar descriptions but different start times, and the request includes a particular date of the search, such as tomorrow or march three days, the top result will be an event content item with a start date and time that matches the date request. If there are two events occurring at the same time, the next field of the prioritized search is the name. For example, if there are data with the same start date: 5/2/20183: 00PM, but one having the name "basketball game" and the other "hockey game", then for example, "time of hockey game on june-two days? "will find the second event with the name hockey game as the top result and return it instead of the basketball game event.
Content items are added, updated, and removed from the elastic search index automatically as participant users update content items using the content management system 513. If the participant user deletes a content item by marking it as removed from the database (or deleting it altogether), the content indexer process 512 will remove the content item from each elastic search index that includes the content item. Likewise, if the participant user updates the characteristics of the content items or adds new content items, those items 535 are updated 533 or added to the elastic search index. The index may also be manually populated or reset. Doing so forces the content indexer process to reconstruct the index by querying database 504 for the content items that should be indexed, and then resynthesize the index and cache using the data.
For example, assume that a platform participant adds a new content item for an event feature having the following characteristics and values: event name: basketball game, event location: gym, event start date/time: two fifths day, 3:00PM, event end date/time: may two days, 5:30PM, event details: third Rams in this year competes with Lions, summary of events: ticket price starts at $15 and opens at 1 PM! Purchase some merchandise to support your team!
Once a participant user marks a content item as valid or publishes an event, the content item is added directly to the elastic search index and the event is available for finding in the search by the feature server 505 on behalf of the feature of the event. Assume that the participant user returns to the content item in the CMS and updates the properties, such as: event location: in a gym of 100 avenues. The update process will update the records of the content items in the database and also update the content items in the flexible search index. Assume that a disconnect occurs from voice experience server 110 or content management system 513 to elastic search index 511 that may result in desynchronization, such as an elastic search index maintenance failure. Then when the connection is restored, the elastic search index will overflow, i.e. all content items in the index will be removed. Once this is done, the index processor 512 will communicate between the database 504 and the flexible search index 511 to re-add all the appropriate content items. Finally, if the participant user were to remove the basketball game event from the CMS, the event would be marked as purged from the database and deleted entirely from the index to ensure that it would not be found by any feature server.
Content search
Once a content item is added to the database and the flexible search index by the content indexer, the item is ready to be found in a search by the feature server. If the index is not composed (has no data) due to a valid overflow of cache and index, or for any other reason, the feature server 505 will fall back to querying the content database 504 directly using the conventional fuzzy search technique 514. Fuzzy searches produce lower confidence results for content items, but guarantee that content items are reached when updates are being made to the system or if the index 511 becomes corrupted. In some implementations, the content database is a relational database 504 that includes information managed in the content management system 513 and includes content items, the content management system 513 including information regarding features that a given voice application has enabled.
When the index is populated and reachable, the feature server will perform a search against the index. The preliminary filter may enable fast searches, such as searches only for content items matching the feature type represented by the feature server. This forces that a given feature server will not return a rule for a content item associated with another feature. A search against the index will return a set of results that match the search request. If there is no match, the message element of the request message cannot be successfully processed and an appropriate response message will be returned from the feature server to the voice experience server to explain what the feature server is not certain to do with the message element. When a single content item is found in the search, also referred to as an exact match, then one content item will return a responsive message element to the voice experience server. If many content items are found to match the message element, the content item with the highest score based on the weight of the searched field will be returned as the message element for inclusion in the response message.
In the above example involving a basketball game and a hockey game event, the total possible score for a perfect match would be the sum of the weights of all indexable fields: 16. if the feature service request processed by the feature server includes information about the start date/time and name and nothing else, then the maximum achievable score is 9. If the search query includes the same start time of both events and the name of the hockey game, the score for the basketball game will be 5 and the score for the hockey game will be 9, and the hockey game event information will be returned as a message element to be included in the response message to be sent to the voice assistant device.
Feature and module customization
In addition to the modules that the platform's standard supports and manages, the platform enables platform participants to create custom modules. When building a custom module, the participant user may select the registered feature types to add to the module. The platform also enables developers to create custom feature servers that replace supported feature servers during execution of the voice application.
There are two aspects of customizing the way content items are retrieved and managed in a customized context: a customization module and customization features.
The custom module is a non-technical element and does not require separate development or maintenance by the platform participants, while the custom feature requires the developer to create and maintain a web server that the platform can communicate with to use the custom module and enable execution of the custom feature.
Creating customization modules
At a high level, module 508 is a collection of features 509 and contextualized content items 510 within those features, as shown in FIG. 15. By way of example, the platform may be preconfigured to include a collection of industry modules, such as a higher education module or an employee health module, as shown in fig. 16. When any of these modules are added to the speech application 507, the platform may pre-populate the features 516, 541 of the modules with sample (e.g., standard) content items 517 that the platform participants 506 may use, update, or remove to replace their own content items. As examples, the pre-populated (e.g., standard or sample) features 516, 541 may include frequently asked questions, quick polls, and surveys. Platform maintenance and administration pre-populated module 515; however, the platform participants are not limited to these pre-populated modules and their features. If a platform participant wishes to mix and match features of different modules or wants to create a collection of features having different contexts than the existing module enables, the platform participant can create one or more custom modules, as shown in FIG. 17.
The customization module 518 must give a unique name within the context of the voice application to which it belongs. Platform users may also give their module descriptions to help consolidate the context for their features and content item creation. When a developer creates a module with a unique name, it registers within the platform. Once a platform participant has created a module by a unique name, the owner can begin adding features to it. The features may be pre-existing (e.g., standard or sample) platform-supported features 516 or custom features 520. If the added feature is a pre-existing feature 516, the owner may then begin adding the content item to the feature within the customization module 518.
In addition to creating a new custom module from scratch, the platform participant can also add existing (e.g., standard or sample) industry modules to the speech application and adjust features within the modules by adding features, removing features, or using custom features instead of or in addition to pre-existing features to form a custom module 519, as shown in FIG. 18. As with pre-existing features, adding a feature to an industry module will not populate content items within the feature. For example, if the employee health module has been used by the voice application and the participant user wants to add another feature to the module that is not included or previously removed, the participant user may view the remaining supported feature types that have not been added through the user interface of the platform and may add the desired feature to the module. The participant user may then choose whether to implement or register custom features from third parties using pre-existing features or custom features that the participant user has developed himself.
Creating custom features
The platform feature is implemented by a combination of the feature server and the type of feature it represents. The feature type defines the desired feature request protocol, the feature response protocol, and the location of the feature server to which the HTTP is sent is invoked when the feature type is identified as the appropriate feature found during the graph traversal. The architecture applies both supported, managed features and custom features created to extend the platform. If a platform participant has a pre-existing content item stored outside the platform database or a content item managed by another system, the platform participant may want to do so if their security criteria make the content item unmanageable by an external system, such as the platform, or if they want to enhance or change the functionality or behavior of the platform.
If a platform participant wants to create a customized feature, the participant can create a publicly accessible web server (as a customized feature server). The custom feature server has an HTTP endpoint that accepts a desired feature service request expressed according to a protocol in an HTTP body and returns a desired feature service response expressed according to the protocol. In some implementations, the endpoint must return feature service responses for a limited period of time to ensure that the end user's experience is not degraded by chronic capabilities outside of platform control.
The custom feature server may use the data from the feature service request in any manner so long as the desired feature service response is returned. The customized feature server may use the message elements of the initial request message from the voice assistant device, track any internal analysis, parse the message elements of the request message, and provide functionality unique to the voice assistant device or voice application sending the request message. As shown in fig. 19, for example, if a platform participant has managed its event information using a third party service and does not want to migrate data to the platform, the participant may instead develop a custom event feature server 521 to replace the default (supported) event feature server 555. However, the custom event signature server 521 must accept a signature service request expressed according to the same protocol as the platform's event signature server 555 and return a signature service response expressed according to the same output protocol as the platform's server. Once the developer has created this publicly accessible custom event feature server, the developer can update the voice application in the CMS to change the feature server location to the URL of the custom event feature server.
Each custom feature server must be of an existing feature type. The platform needs to know to send the feature service request to the appropriate feature server. However, as shown in FIG. 20, the feature server may also register as a custom fallback feature server 523, such that for a given voice application, if the request from the voice helper device does not match the feature type registered for the voice application, a feature service request 524 may be sent to the fallback custom feature server 523. This arrangement enables full customization of how to handle responses, such as creating a voice application that includes a customization module without features other than a fallback customization feature. As shown in fig. 21, the all feature service request 525 will be forwarded to the custom feature server 523, which custom feature server 523 may be designed to process all message elements of the response message itself without using any platform supported features 526. These types of customized features still require that the feature service response returned to the voice experience server match the protocol for the fallback type of desired feature service response. The feature service request for this case may include the message elements of the initial message request from the voice assistant device and the information that the message element is attempting to pull, such as the feature type that it most closely matches. As shown in fig. 21, even if the feature type is not registered in the voice application, the processing is performed as such.
For example, if a given voice application does not have an event feature enabled in any of its modules, but the request message reaches a voice experience server that includes message elements for an Alexa event search, graph traversal will not be able to find a matching feature because the appropriate match is an event feature. If the voice application has registered a customized fallback feature, processing will skip the graph traversal step and instead find the fallback feature server information from the content database and send the initial local Alexa event search message element to the customized fallback feature server. The custom feature server may then apply any desired processing to the original Alexa event search message elements and return a structured feature service response specific to the fallback feature type. If there are no registered features other than the custom fallback feature server, then graph traversal will always be skipped, facilitating direct progress to the custom feature server.
Content management layer
The interaction between the voice assistant and the end user is provided by the voice application and is materialized based on the content items managed by the platform. The platform enables participant users to create, modify, and delete content items for use by the feature as desired. These participant users may work with feature-based content items through the user interface of the platform using a web browser or mobile device. As discussed earlier, a feature may be implemented as a processor for a request for a particular type of message element, such as information about an event. The features also provide a consistent structure for adding content items based on the protocol defined by the platform.
For example, the event features may include the following characteristics: event name, event location, event start date/time, event end date/time, event details, and event summary, among others. With respect to this feature, the participant user simply adds, modifies, or removes information about the event 622 (FIG. 22) using fields presented within the user interface of the platform. The content items for the features are added to a search index that is queried when an end user of the voice assistant device issues an event-specific question.
As shown in FIG. 23, a participant user may use a content management system user interface 611 to manage content items for voice applications for all selected feature types within a given module (whether a platform managed module or a custom module). Additionally, the participant user may view cross-device (e.g., cross-frame) analysis 612 based on usage data for a given voice application across multiple frames of the voice assistant device, as a generic voice application platform may process request messages from all such voice assistant devices.
For the purpose of adding content items, the user interface sends content management requests to the CMS's API using HTTP. The CMS API then manages where to store the content items. The content items may include text or media assets such as audio in mp3 format or video in mp4 format. Content items in the form of media assets are uploaded to a blob store or file management system, and metadata and related content items are stored in an extensible relational database.
The CMS API is not exclusive to content items related to feature types, but also enables participant users to manage their accounts, voice applications, modules, features, and other aspects of the platform, including registering custom modules and custom features. Each content item is structured specifically for the corresponding feature type in that the characteristics and fields of the content item conform consistently to a common protocol used to represent the content item for any given feature type. Each content item is also associated with a particular voice application to prevent platform participants other than the appropriate platform participant that has access to the voice application from viewing or using the owner's content item in the user interface. While a given feature type may be used across multiple modules, feature content items are directly associated with the module that manages them. For example, the feature content item values representing answers to frequently asked questions and being the same for both modules are stored twice in the database.
Support and guidance from CMS
Voice assistant devices vary in how they process content items based on their internal hardware and software. One voice assistant device may support video, audio, images, and text, while another may support only text and audio. The CMS may provide guidance and real-time feedback regarding content items added by participant users. For example, as shown in FIG. 24, a participant user may enter a text content item related to an event in addition to audio files and images also related to the event 652. The CMS interface will indicate the type of voice assistant device that supports the submission type of the content items 651, 661 (fig. 26).
Participant users who select to include audio or video as part of the message elements of the response message may generate the content item directly within the CMS through the user interface of the platform 641. Thus, as shown in fig. 24 and 25, the platform enables a platform user to generate and manage multiple types of content items at one location 642.
Questions and answers
The platform is designed to store and provide different phrases and sentences that the voice assistant device can speak, for example, to answer end-user questions. Optionally as a set of questions and answers. As shown in FIG. 22, the CMS interface enables platform users to create a set of questions 621 and answers 623.
Comprehensive multi-language support
The platform comprehensively supports multi-language content and voice interaction in the interface of the voice content management system. Because the voice content management system interface supports multiple languages, the interface is accessible to non-english platform users in their native languages. In some implementations, the platform may support the ability to publish non-English content. To make this approach useful, the instructions and prompts within the interface also need to be provided in the native language of the platform user.
The platform supports multi-lingual content for voice interaction from the data layer up through the final response message to the voice assistant device based on a data model representing a given content item. All content items within the platform inherit objects that include properties for language and version. Thus, any content item in the system may have corresponding items in other languages. For example, a question in a voice content management system stating "how big a student population" in the language value of EN-US may have equivalent entries in Spanish and French with language values of ES-ES and FR-FR.
Analysis of
The analytics processing of the platform may analyze usage data representing many different aspects of the operation of the platform and process this vast amount of information to provide participant users with insight into the performance of their content items 624, features, modules, and voice applications. As shown in fig. 27 and 28, the data analysis may include metrics made across different types (frames) of voice assistant devices 671, 681 and different specific voice assistant devices that are the source of the initial request message, metrics of the type of feature of the requested message element call 672, and a comparison of the performance of various content items used by a given feature. These types of analysis are separate from the analysis used by the platform itself to determine the performance of components, aspects, and the entire platform.
The analysis of key categories provided by the platform includes data accumulation, data analysis and processing, key performance indicators, and intelligent rendering.
Data accumulation
Analyzing the performance of content items is critical to enable platform participants to create a good voice experience for the end user of the voice assistant device. There are points in the data stream for which the raw data can be analyzed particularly efficiently. The platform applies machine learning methods to raw data to sort the data into buckets and compares large amounts of data accumulated over time.
The types of data analyzed by the platform include each of the following (and combinations of two or more thereof): the type of voice assistant (e.g., frame) from which the request message originated (e.g., Alexa, google assistant, apple Siri, microsoft Cortana, or customized voice assistant), the type of voice assistant device (e.g., frame) from which the request message originated (e.g., Echo Show, google Home, mobile device, Echo Dot, or others), the type of feature invoked by the message element of the request message, metadata for each processed content item, content items typically found together, the success rate of invoking the type of message element of the request message in the appropriate feature, a loss in invoking content items, information about the end user whose speech initiated the request, information about the application, original usage information, time of day, repetition relative new guest, geographic location and area from which the request message originated, and verified end user information, and so on.
These data items may also be related to each other. As previously described, the relationships of the data items provide insight into the performance of the content items.
There are certain particularly efficient locations in the operational flow of the platform where raw analysis data can be collected and there are sub-flows for how it is collected. Once collected, the raw data can be processed into more easily understood structured data. The active locations for data collection include: initial receipt of a request message at the API layer, performance of a content search by the feature server, and processing of a response message by the voice experience server, among other things.
Receipt of requests by a Voice experience API
The request message sent by the voice assistant device to the voice experience server API includes useful raw data. The raw data sent will depend on the type of voice assistant device, although the data sent by many types of voice assistant devices typically includes: a user identifier, information about the voice assistant device that originated the request message, information about the voice assistant that originated the request message, and certain data included in the request (e.g., message elements).
The API layer of the platform translates the raw data into an abstract form that is represented according to a set of protocols that are shared across different frameworks. As shown in FIG. 29, once the raw data is structured and represented according to the abstract protocol, it is sent to an accumulation data store implemented as a data lake 528, where it is stored for later processing 530 by one or more data analysis processes 529, for example.
Feature server content search
By creating a search index using weights for the fields and allowing message elements of a request message to reach multiple feature content results, the platform can track the results returned in corresponding response messages and content items commonly found in results across multiple request messages. This enables the platform to show the platform participants which of their content items are being used most frequently and which are lost through the user interface of the platform. The platform participant may then decide to change the wording or structure or other characteristics of the content item or message element of the response message to produce better results when interacting with the end user.
As shown in FIG. 30, when feature server 505 queries content index 511 and receives possible results (content items), raw possible results 527 can be stored in data lake 528. The stored data identifies content items from the search results and related information about queries returned with those results, such as feature service requests. The data in the feature service request stored with the search result data relates to the requested data originally sent from the API in that the feature service request includes the initial message elements of the request message received from the voice assistant device.
Response handling
Once the message elements from the initial request message and the data of the content search results have been stored in the analysis data lake, the message elements to be included in the response message may be formulated by converting the feature service response from the feature server into a form that conforms to the protocol expected by the corresponding voice assistant device. The process of generating the message elements of the response message is a useful point for accumulating raw data for analysis.
For example, if data lake 528 includes message elements from request messages, information about initiating voice assistant devices and request messages and response messages, analysis process 529 can combine those data sets into a cleaner and thinner model 530 to make it easier to show, for example, how many end users use various types of voice assistant devices or how many request messages have generated successful response messages for a certain type of voice assistant device. For example, if a voice application has Alexa trick and google actions using SDK to send message elements of a request message to the voice application, the platform participants may know how many end users use Alexa trick versus google actions for the voice application overall and how many end users use Alexa versus google for a particular feature such as an event feature, or how many end users of two different voice assistant devices require a particular content item.
The analysis process may also track message element types for a given type of voice assistant device that match a given feature, enabling platform participants to consider moving content items to a customized fallback feature server. Because the initial request message includes the initial message element type, analysis process 529 can skip the graph traversal and find the feature directly. For example, if a platform participant notices that google action tends to use a particular message element type that the platform participant does not want to map to the feature it is mapping to, the owner may disable the feature and where to customize the message elements of the request message by using a custom fallback feature server or a custom feature server.
The type of analysis discussed above may be considered static analysis, and the processing of data into abstract structures may be referred to as static data analysis. As discussed later, static data analysis differs from what is referred to as dynamic data analysis or intelligent data analysis, which uses machine learning to understand patterns in the analysis rather than displaying data directly.
Once the message elements of the request message have been mapped from their original state stored in data lake 528 to a more structured form stored in database 531, the original data in the data lake may be deleted or moved to a long-term archive by compressing the data into a file and saving it to blob store or file store. Archiving certain types of data enables training of new or revised machine learning algorithms without having to recollect the data for training, and also serves as a backup against data corruption or data loss in the analytics database 531.
Machine learning and intelligent advice
The analysis engine uses machine learning and extensive analysis data to provide analysis and recommendations to platform participants. This dynamic or intelligent data analysis may be used to provide intelligent suggestions to platform participants as to how to structure content items, where to place certain types of content items, which content items work well, and which content items do not work well.
As shown in fig. 31, a general flow of processing analysis data includes: storing raw data in a data lake, retrieving raw data from the data lake, sending the raw data to a static analysis, sending output from the static analysis to machine learning 534, storing recommendations of platform participants in a separate database 535 for later use, requesting the recommendations 536 based on the output of a machine learning algorithm, and presenting the recommendations to the platform participants through a user interface of the platform.
Data analysis and processing
As shown in FIG. 31, processing within the analysis engine uses the data 531 statically analyzed by post-processing and information generated from the raw data from the pre-processed data lake to infer relationships and view patterns in those relationships. As with the static analysis step, the algorithm used for the dynamic analysis 533 is also targeted to a specific target. Targets for dynamic analysis not only use static data, such as cross-device usage, success rate, or failure rate. Dynamic analysis uses these statistics on usage and rate to compare certain content items and characteristics.
For example, as shown in FIG. 32, dynamic analysis may detect the relative performance of content items. When performing dynamic analysis using an increasing amount of accumulated data over time, dynamic analysis may enable an increasingly deeper understanding of why a particular content item works better than others. The results of this dynamic analysis may be information about the sentence structure, the type of data within the content item, the quality of the use of words by the speech assistant, and other factors.
Wherein the dynamic analysis of the analysis data comprises: data is collected at the speech application level and at the content item level. The data may include, for example: overall content item success 537 and failure 538 ratios, content item success and failure ratios when presented on a particular type of voice assistant device, comparison of which content items are typically returned together in a feature content search, and identification of queries in a feature server content search that return common data set results, and so forth.
The main difference in the collection of analysis data between static analysis and dynamic analysis is that static analysis uses only data within the context of a particular speech application and feature. This limitation arises because the result of static analysis is data that only applies to a particular application and its own features and content items. Instead, the dynamic analysis may use the raw data derived from the execution of all speech applications of all platform participants at once. Thus, a given platform participant may use a dynamic analysis of all content items of all speech applications of all platform participants and may receive intelligent suggestions that enable the platform participants to provide effective content items to end users.
For example, dynamic analysis and machine learning performed by an analysis engine of the platform may classify 539 analysis data for four speech applications of four different platform participants. It is assumed that the voice application uses the survey feature in its entirety, regardless of which module is the source of the feature. In each area of survey feature content, each speech application asks similar questions, such as "how many births are in hamilton's college? "assume that the question has a set of acceptable answers, such as 1878, 1800, about 1800, and one thousand eight hundred.
Based on this example, the static analysis will collect information about how many responses were successful and the type of voice assistant device or voice assistant 540 that produced the success and failure. For example, certain types of voice assistants, such as Siri, have a much higher failure rate than other voice assistants. The analysis engine may collect information about providing an incorrect answer. During the dynamic analysis of these statistics, the analysis engine may detect a large number of failed responses of Siri, most of which are "eighteen-hundred". This may suggest that a particular type of voice assistant device or the speech processing of a voice assistant may perform worse than other types. The end user may have actually said "one thousand eight hundred", but Siri interprets this speech as "eighteen hundred". Dynamic analysis may track the type of words that some voice assistants are less accurate interpreting than other types of voice assistants and store this information in a structured database as static analysis does. In this example, the machine learning algorithm 534 will record that "one thousand eight hundred" is a difficult phrase for Siri to process correctly. With this knowledge, the analytics engine can provide intelligent suggestions to platform participants. Because the analytics engine can use usage data from all four applications from different platform participants, it can store and provide processed information to all four platform participants without requiring each platform participant to access private information for training machines and processing for intelligent advice.
Intelligent advice
Intelligent suggestions are suggestions derived from data generated by the machine learning and dynamic analysis stages of the analysis process and provided to platform participants in a manner that structures or represents or otherwise alters the content item to achieve an effective voice experience for the end user when using the platform participants' voice applications and message elements on one or more types of voice assistant devices. These suggestions may include: reformulating sentences, removing words, adding word changes, removing changes, or updating space values, etc.
Suggestions are generated by sending an HTTP request to the CMS API to request a suggestion when a content item is being updated. The CMS API checks the database for up-to-date information, such as success and failure rates for certain words with respect to certain voice assistants or voice assistant devices, and returns a set of suggestions, if any. The CMS client (e.g., analytics processing) then presents these suggestions to the platform user through the user interface of the platform to enable the platform user to make changes to the word or ignore the suggestions based on the suggestions.
Using the above machine learning and dynamic analysis to detect and track Siri for the example where it is difficult for certain types of numbers, such as "one-thousand-eight-hundred," given that platform participants create a new survey question "independently declare when signed? ", where the accepted answers are 1776, seventeen-six. After the participant users enter content items representing these answers, the CMS will request suggestions for these content items. Because the analysis engine knows that Siri will likely say "seventeen-six" as "seventeen-one-hundred and seventeen," it will suggest that the platform participant add another answer variant of "seventeen-one-hundred and seventeen," stating that Siri may incorrectly interpret certain numbers, and adding this variant will help ensure that the end user of Apple HomePod will have a better voice interaction experience. For example, as shown in FIG. 33, such phrases may be present in the user interface for these intelligent suggestions 631.
Intelligent suggestions can be used for any type of feature or content item, as dynamic analysis can track data across features and within the context of a particular feature to provide the best intelligent suggestion.
Another type of intelligent suggestion besides suggestions on content items and suggestions on features is a recommendation to add a particular feature to a speech application. Such intelligent suggestions can be derived by tracking which features added to similar speech applications are associated with more successful or more usage by their speech applications. For example, by knowing which features are used the most and most successfully for voice applications in the same industry, dynamic analysis can track data about these features and modules and suggest adding them to platform participants.
For example, if there are two voice applications in the higher education industry and one voice application experiences more usage and higher success rate due to adding survey features, the dynamic analysis may detect that the feature is the cause of greater success for the first voice application and suggest adding similar features to the second application, with the reason that other platform participants in their industry experience greater success when the feature is included.
Data layer
The data layer defines the types of storage used by the analytics engine and how those types of storage interact with the business logic or APIs and other parts of the application. The main storage includes: content databases, analytical data lakes, analytical structured databases, file and blob stores, content indexes and graph databases, and the like.
Each primary storage is designed to be scalable using cloud technology so that they can replicate, hold data synchronously across regions of the world and increase size and throughput.
Content database
The content database is responsible for storing data related to managing content items owned by the platform. In some implementations, the database is a relational SQL style database that associates data about platform participants, voice applications, modules, features, content items, and other data.
The content database is updated through the CMS API using a connection from the CMS server and database. Requests made by the platform participants to the CMS through the user interface of the platform enable the platform participants to update content items.
The database may be implemented as a PostgreSQL database or any other SQL style database.
File and blob store
The file and blob stores may be implemented as traditional file stores in the cloud to enable extensible storage with security. The file and blob stores include files uploaded by platform participants, such as audio recordings, video files, or images, or a combination thereof. Each of these files is associated with a publicly accessible URL to enable the voice assistant device to access the files, e.g., stream audio recordings and video files or render images on the voice assistant device that support those formats.
When a platform participant uploads a file, the file data passes through the CMS API to the file and blob store. Once the upload is complete, the URL of the file is sent as a reply to the requesting client, and a reference to the URL of the file is stored in the content database. Platform participants may also use the CMS to remove and update files in storage through the user interface of the platform.
In some implementations, the file and blob stores may be implemented as amazon web service S3 buckets.
Content indexing
The content index is a collection of flexible search indexes containing data from content items in a content database. The content index provides better performing content searches for the feature server. When a query is made to the index from the feature server, a set of best match results is returned. As described earlier, the flexible search index enables the addition of weights to certain characteristics of a given type of data being added to the index.
Content items in the content index are updated by the CMS API as content items are added, updated, or deleted by platform participants.
Graph database
A graph database stores a graph of the relationship between features, message elements of a request message, and message element slots. When a request message is received from a voice assistant device, a graph database is used during a graph traversal phase of the business logic layer. The graph may be traversed using the intents, slots, and edges between features to find the most appropriate feature for the message element of the request message from the voice assistant device.
The graph database is updated by participant users who manage relationships of new or updated message element types, such as amazon, google, apple, and microsoft.
Analytical data lake
The analytical data lake is a large data store for unstructured analytical data. It is used to add basic information based on request messages from the voice assistant and content searches from the feature server. The static analysis and dynamic analysis phases and tasks use large amounts of data and structure it into smaller and more understandable units of information that are valuable to the analysis engine, such as usage, success/failure rates, and so forth.
Analyzing structured databases
The analytics structured database is a SQL-style relational database that the CMS uses to show and provide structured analytics data and store intelligent advisory data. After information is retrieved from the data lake and mapped to the structural table relationships that exist in the structured database, the database is updated by the data analysis stage.
Other implementations are within the scope of the following claims.

Claims (75)

1. An apparatus, comprising:
one or more processors for executing a program to perform,
a memory containing instructions executable by the processor to:
receiving, from the voice assistant device, requests according to the corresponding protocol representations of the one or more voice assistant frameworks, each request representing a voice input by the user to the corresponding voice assistant device,
re-representing the received request according to a common request protocol,
based on the received request, generating a response to the request expressed according to a common response protocol,
re-representing each of the responses according to a protocol of the framework with respect to which the corresponding request was represented, an
Sending the response to the voice assistant device for presentation to the user.
2. The apparatus of claim 1, wherein the request is represented according to corresponding protocols of two or more voice assistant frameworks.
3. The apparatus of claim 1, wherein the voice assistant framework comprises a framework of at least one of amazon, apple, google, microsoft, or a chat robot developer.
4. The apparatus of claim 1, wherein the generation of the response comprises traversing a graph using information from the request.
5. The apparatus of claim 4, wherein traversing the graph comprises identifying features to be used to implement the response.
6. The apparatus of claim 5, wherein the features are organized in modules.
7. The apparatus of claim 6, wherein at least one of the modules is predefined.
8. The apparatus of claim 6, wherein at least one of the modules is custom defined.
9. The apparatus of claim 6, wherein at least one of the modules comprises a set of predefined features having predefined content items tailored to a particular industry or organization.
10. The apparatus of claim 5, wherein the characteristic comprises information about a content item to be included in the response.
11. The device of claim 5, wherein the characteristic comprises information about a dynamic content item to be included in the response.
12. The apparatus of claim 10, wherein at least one of the content items is predefined.
13. The apparatus of claim 10, wherein at least one of the content items is custom defined.
14. The device of claim 1, wherein the generation of the response to the request comprises executing a voice application.
15. The device of claim 1, wherein the voice application comprises a set of functions that generate responses to requests spoken by a person.
16. The apparatus of claim 15, wherein the generated response comprises a verbal output.
17. The device of claim 16, wherein the generated response triggers other functions while providing the spoken output.
18. The device of claim 1, wherein the instructions are executable by the processor to:
receiving data regarding requests and corresponding responses for two or more frameworks, and
analyzing the received data to determine a comparative performance of the response for the framework.
19. The apparatus of claim 18, wherein the performance comprises performance of one or more features for implementing the response.
20. The device of claim 18, wherein the performance comprises performance of one or more content items included in the response.
21. The apparatus of claim 18, wherein the capabilities comprise capabilities of one or more voice applications.
22. The device of claim 1, wherein the instructions are executable by the processor to:
presenting, at a user interface of a voice application platform, features for selection and management of content items to be included in the response,
presenting, via the user interface, information about the relative performance of individual content items associated with characteristics of the content items in real time as the content items are being selected or managed,
receiving information on the selected or managed content items through the user interface, and
the voice application is executed to generate a response including the selected and managed content items.
23. The apparatus of claim 22, wherein the user interface is configured to enable unskilled trained personnel to select or manage a content item and provide and receive information about the content item.
24. The device of claim 1, wherein the instructions are executable by the processor to:
enabling selection of a content item to be included in a given one of the responses from the selectable possible content items, the selection of the content item to be included in the given response being based on a context of the end user's speech input.
25. The apparatus of claim 24, comprising including in the given response a content item that matches a context of an end user who uttered the request.
26. The apparatus of claim 25, wherein the context of the end user's voice input comprises a geographic location of the voice assistant device to which the response is to be sent.
27. The apparatus of claim 25, wherein the context of the end user's voice input comprises a demographic characteristic of the end user.
28. The device of claim 1, wherein the instructions are executable by the processor to:
a user interface is presented to the user for presentation,
configuring the user interface to enable creation of a voice application for processing the request and for generating the corresponding response,
modules that maintain characteristics with which the request can be matched to generate a response, including standard modules and custom modules,
in each of said modules, a set of features corresponding to a context in which said response is to be presented to said end user is included, and
and displaying the module through the user interface.
29. The device of claim 1, wherein the instructions are executable by the processor to:
presenting, at a user interface of a voice application platform, features that enable selection and management of content items to be included in the response, each of the content items requiring the voice assistant device to have corresponding content presentation capabilities,
during the selection and management of the content item, performance information regarding voice assistant devices that conform to respective different voice assistant frameworks is presented simultaneously through the user interface to present the content item being selected and managed.
30. The apparatus of claim 29 wherein the voice application platform guides non-technically trained users regarding the performance of the voice assistant framework and how they will represent images, audio, video, and other forms of media.
31. A computer-implemented method, comprising:
receiving, over a communication network, a request for an end-user's voice-based service from a voice assistant device that conforms to one or more different voice assistant frameworks, the end-user's voice representing an intent,
traversing a graph of nodes and edges using data derived from requests for the service to arrive at matching characteristics for respective requests for the service,
executing the feature to generate a response, an
Sending the responses to the voice assistant device over the communication network such that they are responsive to the respective end users.
32. The method of claim 31, wherein the voice assistant device from which the request is received conforms to two or more different voice assistant frameworks.
33. The method of claim 31, comprising deriving data from a request for service by abstracting information in the request into a data format that is common across two or more different voice assistant frameworks.
34. The method of claim 31, comprising updating nodes of the graph using an output of a machine learning algorithm.
35. The method of claim 31, wherein the information about the request is used to identify an initial node of the graph at which to begin the traversal.
36. The method of claim 35, comprising automatically adding a node to the graph to serve as the initial node of the graph at which the traversal is initiated with respect to a request to conform to an additional voice assistant framework.
37. A computer-implemented method, comprising:
receiving, over a communication network, a request for an end-user's voice-based service from a voice assistant device that conforms to one or more different voice assistant frameworks, the end-user's voice representing an intent,
determining responses to the received requests, the responses configured to be sent over the communication network to the voice assistant apparatus to cause them to be responsive to respective end users,
evaluating a measure of success of the determination of the response,
based on the relative measure of success of the responses, the user is enabled to manage subsequent responses to the request for service through the user interface.
38. The method of claim 37, wherein the voice assistant device from which the request is received conforms to two or more different voice assistant frameworks.
39. The method of claim 37, comprising:
based on the assessed measure of success, presenting a suggested response to the user through the user interface and enabling the user to select a response to be sent to the voice assistant device based on the suggested response.
40. The method of claim 37, wherein evaluating a measure of success comprises evaluating success of a content item carried by the response across two or more different voice assistant frameworks.
41. The method of claim 37 wherein evaluating a measure of success comprises evaluating the success of the response relative to each voice assistant framework of the voice assistant apparatus to which the response is to be sent.
42. The method of claim 37, wherein evaluating a measure of success comprises evaluating success of a response relative to two or more different voice applications configured to receive the request and determine the response.
43. The method of claim 37, comprising: based on the measure of success, the content items to be carried in subsequent responses are managed.
44. A computer-implemented method, comprising:
presenting, at a user interface of a voice application platform, a feature that enables selection and management of content items to be included in responses to be provided by a voice application to voice assistant devices that conform to one or more different voice assistant frameworks,
presenting, via the user interface, information about the relative performance of individual content items associated with characteristics of the content items in real-time with the content items being selected and managed,
receiving information on the selected or managed content items through the user interface, and
executing a voice application to generate the response including the selected and managed content items.
45. The method of claim 44, comprising
Accumulating usage data from the voice assistant device that conforms to two or more different voice assistant frameworks, an
Information about the relative performance of the respective content items is generated from the accumulated usage data.
46. The method of claim 45, wherein the usage data is accumulated through a generic API.
47. The method of claim 45, wherein the information about the relative performance is generated by a machine learning algorithm.
48. A computer-implemented method, comprising:
receiving, over a communication network, a request for an end-user's voice-based service from a voice assistant device that conforms to one or more different voice assistant frameworks, the end-user's voice representing an intent,
determining responses to the received requests, the responses being configured to be sent to the voice assistant apparatus over the communications network to cause them to be responsive to respective end users, the responses including content items,
the content items included in a given one of the responses are selected from the selectable possible content items, the selection of the content items to be included in the given response being based on the context of the indicated intent of the end user.
49. The method of claim 48 wherein the voice assistant device from which the request is received conforms to two or more different voice assistant frameworks.
50. The method of claim 48 wherein one of the voice assistant frameworks comprises a chat robot framework.
51. The method of claim 48 wherein the context of the end user's expressed intent includes a geographic location of the voice assistant device to which the response is to be sent.
52. A method as defined in claim 48, wherein the context of the end user's represented intent includes demographic characteristics of the end user.
53. The method of claim 52, wherein the demographic characteristic comprises age.
54. The method of claim 52 wherein the demographic characteristics include linguistic characteristics inferred from a geographic location of the voice assistant device to which a response is to be sent or inferred from characteristics of words included in the received request.
55. The method of claim 54, wherein the linguistic characteristics include accents.
56. The method of claim 54, wherein the linguistic characteristics include a local spoken language or a local reference.
57. A method as defined in claim 52, wherein the demographic characteristic comprises gender.
58. The method of claim 52 comprising end-user preferences based on which content items to be included in the given response can be selected.
59. A computer-implemented method, comprising:
a user interface for development of a voice application is presented,
configuring the user interface to enable creation of a voice application for processing requests received from a voice assistant device and for generating corresponding responses for the voice assistant device to present to an end user,
a module that maintains characteristics with which the request can be matched to generate the response,
in each of the modules, including a set of features corresponding to a context in which the response is to be presented to the end user,
the maintenance of the modules includes (a) maintaining a standard module for a corresponding context, and (b) enabling the generation and maintenance of a customization module that enables features with which the request may be matched to generate a customized response for the voice assistant device, and
and displaying the module through the user interface.
60. The method of claim 59, comprising
Maintaining a content item to be used with the feature when generating the response,
the maintaining of the content items includes (a) maintaining standard content items, and (b) enabling the generation and maintenance of customized content items to be used with the features to generate customized responses for the voice assistant apparatus.
61. The method of claim 59, wherein the context relates to a product or service in a defined market segment.
62. The method of claim 59, wherein the context relates to demographics of a target population of end users.
63. The method of claim 59 wherein the context relates to performance of the voice assistant apparatus.
64. The method of claim 59, wherein the context relates to a type of content item to be used with the feature in generating the response.
65. A computer-implemented method, comprising:
a user interface for development of a voice application is presented,
configuring the user interface to enable creation of a voice application for processing requests received from a voice assistant device and for generating corresponding responses for the voice assistant device to present to an end user,
determining responses to the received requests, the responses being configured to be sent to the voice assistant apparatus over the communications network to cause them to be responsive to respective end users, the responses including content items,
the user interface enables creation and editing of content items in a rich media format for inclusion in the response.
66. The method of claim 65, wherein the rich media formats include image, audio, and video formats.
67. The method of claim 65, wherein the user interface is presented by a platform that enables creation of a voice application, and the platform enables recording and editing of the content item directly within the platform through the user interface.
68. A computer-implemented method, comprising:
presenting, at a user interface of a voice application platform, a feature that enables selection and management of content items to be included in a response to be provided by a voice application to a voice assistant device that conforms to one or more different voice assistant frameworks, each of the content items requiring the voice assistant device to have a corresponding content presentation capability,
simultaneously presenting, via the user interface, information regarding capabilities of voice assistant devices that conform to the respective different voice assistant frameworks while the content item is being selected and managed to present the content item being selected and managed.
69. The method of claim 68 wherein the voice assistant apparatus to which the response is to be provided conforms to two or more different voice assistant frameworks.
70. The method of claim 68 wherein the content presentation capabilities comprise capabilities of hardware and software of the voice assistant device.
71. The method of claim 68, wherein the content presentation performance relates to a type of content item.
72. The method of claim 68, wherein the types of content items include text, images, audio, and video.
73. A computer-implemented method, comprising:
a user interface for development of a voice application is presented,
configuring the user interface to enable creation of a voice application for processing requests received from a voice assistant device and for generating corresponding responses for the voice assistant device to present to an end user,
determining responses to the received requests, the responses being configured to be sent to the voice assistant apparatus over a communications network to cause them to be responsive to respective end users, the responses including content items expressed in natural language,
the user interface enables a user to select and manage the presentation of one or more content items in any of two or more natural languages.
74. The method of claim 73, wherein the user interface is presented in any one of two or more different natural languages.
75. The method of claim 73, comprising representing each content item according to a data model, and wherein the representation of each content item inherits objects that include characteristics of a natural language for the content item.
CN201980049296.7A 2018-06-05 2019-06-03 Voice application platform Pending CN112470216A (en)

Applications Claiming Priority (11)

Application Number Priority Date Filing Date Title
US16/000,798 US10235999B1 (en) 2018-06-05 2018-06-05 Voice application platform
US16/000,799 2018-06-05
US16/000,805 2018-06-05
US16/000,789 2018-06-05
US16/000,798 2018-06-05
US16/000,789 US10803865B2 (en) 2018-06-05 2018-06-05 Voice application platform
US16/000,805 US11437029B2 (en) 2018-06-05 2018-06-05 Voice application platform
US16/000,799 US10636425B2 (en) 2018-06-05 2018-06-05 Voice application platform
US16/353,977 US10943589B2 (en) 2018-06-05 2019-03-14 Voice application platform
US16/353,977 2019-03-14
PCT/US2019/035125 WO2019236444A1 (en) 2018-06-05 2019-06-03 Voice application platform

Publications (1)

Publication Number Publication Date
CN112470216A true CN112470216A (en) 2021-03-09

Family

ID=68769419

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980049296.7A Pending CN112470216A (en) 2018-06-05 2019-06-03 Voice application platform

Country Status (4)

Country Link
EP (1) EP3803856A4 (en)
CN (1) CN112470216A (en)
CA (1) CA3102093A1 (en)
WO (1) WO2019236444A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IT202100012548A1 (en) * 2021-05-14 2022-11-14 Hitbytes Srl Method for building cross-platform voice applications
CN116893864B (en) * 2023-07-17 2024-02-13 无锡车联天下信息技术有限公司 Method and device for realizing voice assistant of intelligent cabin and electronic equipment

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050261907A1 (en) * 1999-04-12 2005-11-24 Ben Franklin Patent Holding Llc Voice integration platform
US20070033005A1 (en) * 2005-08-05 2007-02-08 Voicebox Technologies, Inc. Systems and methods for responding to natural language speech utterance
US20080091406A1 (en) * 2006-10-16 2008-04-17 Voicebox Technologies, Inc. System and method for a cooperative conversational voice user interface
KR20080112771A (en) * 2007-06-22 2008-12-26 주식회사 엘지씨엔에스 Voice portal service system and voice portal service method
CN102845128A (en) * 2010-04-28 2012-12-26 惠普发展公司,有限责任合伙企业 Techniques to provide integrated voice service management
CN103067443A (en) * 2011-10-18 2013-04-24 通用汽车环球科技运作有限责任公司 Speech-based interface service identification and enablement for connecting mobile devices
US20150235640A1 (en) * 2014-02-19 2015-08-20 Honeywell International Inc. Methods and systems for integration of speech into systems
US20160259779A1 (en) * 2015-03-06 2016-09-08 Nuance Communications, Inc. Evidence-Based Natural Language Input Recognition
CN107112016A (en) * 2015-01-05 2017-08-29 谷歌公司 Multi-modal cycle of states
CN107277153A (en) * 2017-06-30 2017-10-20 百度在线网络技术(北京)有限公司 Method, device and server for providing voice service
US20180040324A1 (en) * 2016-08-05 2018-02-08 Sonos, Inc. Multiple Voice Services

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150066817A1 (en) * 2013-08-27 2015-03-05 Persais, Llc System and method for virtual assistants with shared capabilities
US9886955B1 (en) * 2016-06-29 2018-02-06 EMC IP Holding Company LLC Artificial intelligence for infrastructure management
US10783883B2 (en) * 2016-11-03 2020-09-22 Google Llc Focus session at a voice interface device
US10235999B1 (en) * 2018-06-05 2019-03-19 Voicify, LLC Voice application platform

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050261907A1 (en) * 1999-04-12 2005-11-24 Ben Franklin Patent Holding Llc Voice integration platform
US20070033005A1 (en) * 2005-08-05 2007-02-08 Voicebox Technologies, Inc. Systems and methods for responding to natural language speech utterance
US20080091406A1 (en) * 2006-10-16 2008-04-17 Voicebox Technologies, Inc. System and method for a cooperative conversational voice user interface
KR20080112771A (en) * 2007-06-22 2008-12-26 주식회사 엘지씨엔에스 Voice portal service system and voice portal service method
CN102845128A (en) * 2010-04-28 2012-12-26 惠普发展公司,有限责任合伙企业 Techniques to provide integrated voice service management
CN103067443A (en) * 2011-10-18 2013-04-24 通用汽车环球科技运作有限责任公司 Speech-based interface service identification and enablement for connecting mobile devices
US20150235640A1 (en) * 2014-02-19 2015-08-20 Honeywell International Inc. Methods and systems for integration of speech into systems
CN107112016A (en) * 2015-01-05 2017-08-29 谷歌公司 Multi-modal cycle of states
US20160259779A1 (en) * 2015-03-06 2016-09-08 Nuance Communications, Inc. Evidence-Based Natural Language Input Recognition
US20180040324A1 (en) * 2016-08-05 2018-02-08 Sonos, Inc. Multiple Voice Services
CN107277153A (en) * 2017-06-30 2017-10-20 百度在线网络技术(北京)有限公司 Method, device and server for providing voice service

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YUJI KINOSHITA: "Spoken dialog strategy based on understanding graph search", 《2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING》 *
罗丛优: "基于SIP协议的实时语音通信系统的研究与开发", 《中国优秀硕士学位论文全文数据库》 *

Also Published As

Publication number Publication date
CA3102093A1 (en) 2019-12-12
EP3803856A4 (en) 2021-07-21
WO2019236444A1 (en) 2019-12-12
EP3803856A1 (en) 2021-04-14

Similar Documents

Publication Publication Date Title
US11790904B2 (en) Voice application platform
US11615791B2 (en) Voice application platform
US11887597B2 (en) Voice application platform
US11437029B2 (en) Voice application platform
US11775254B2 (en) Analyzing graphical user interfaces to facilitate automatic interaction
US11769064B2 (en) Onboarding of entity data
US11107470B2 (en) Platform selection for performing requested actions in audio-based computing environments
US20230352017A1 (en) Platform selection for performing requested actions in audio-based computing environments
CN112470216A (en) Voice application platform
US11416229B2 (en) Debugging applications for delivery via an application delivery server
US11385990B2 (en) Debugging applications for delivery via an application delivery server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination