CN116897353A - Text editing using voice and gesture input for auxiliary systems - Google Patents

Text editing using voice and gesture input for auxiliary systems Download PDF

Info

Publication number
CN116897353A
CN116897353A CN202280019144.4A CN202280019144A CN116897353A CN 116897353 A CN116897353 A CN 116897353A CN 202280019144 A CN202280019144 A CN 202280019144A CN 116897353 A CN116897353 A CN 116897353A
Authority
CN
China
Prior art keywords
user
particular embodiments
message
text message
user interface
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280019144.4A
Other languages
Chinese (zh)
Inventor
加布里埃尔·凯瑟琳·莫斯基
克里斯托弗·E·巴尔梅斯
贾斯汀·丹尼
甘鑫
伊拉娜·奥利·沙洛维茨
普一鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Meta Platforms Inc
Original Assignee
Meta Platforms Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/407,922 external-priority patent/US20220284904A1/en
Application filed by Meta Platforms Inc filed Critical Meta Platforms Inc
Priority claimed from PCT/US2022/018697 external-priority patent/WO2022187480A1/en
Publication of CN116897353A publication Critical patent/CN116897353A/en
Pending legal-status Critical Current

Links

Abstract

In one embodiment, a method includes: presenting, through a user interface of the client system, a text message including a plurality of n-grams based on a user utterance received at the client system; receiving, at a client system, a first user request to edit a text message; presenting, through a user interface, a text message visually divided into a plurality of blocks, wherein each block includes one or more of a plurality of n-grams of the text message, and the plurality of n-grams in each block are consecutive with respect to each other and grouped within the block based on analysis of the text message by a Natural Language Understanding (NLU) module; receiving, at the client system, a second user request to edit one or more of the plurality of blocks; and presenting, via the user interface, the edited text message generated based on the second user request.

Description

Text editing using voice and gesture input for auxiliary systems
Technical Field
The present disclosure relates generally to database and file management within a network environment, and more particularly to hardware and software for intelligent assistance systems.
Background
The auxiliary system may provide information or services on behalf of the user based on a combination of: user input, location awareness, and the ability to access information (e.g., weather conditions, traffic congestion, news, stock prices, user schedules, retail prices, etc.) from various online resources. The user input may include text (e.g., online chat), voice, images, actions, or a combination thereof, particularly in an instant messaging application or other application. The auxiliary system may perform concierge-type services (e.g., booking dinner, purchasing event tickets, making travel arrangements) or providing information based on user input. The auxiliary system may also perform administrative or data processing tasks based on the online information and activities without user initiation or interaction. Examples of those tasks that may be performed by the auxiliary system may include calendar management (e.g., sending a prompt for a user to defer dinner dates due to traffic conditions, updating both parties' calendars, and changing restaurant reservation times). The auxiliary system may be implemented by a combination of: computing device, application programming interface (application programming interface, API), and a number of applications on the user device.
A social networking system, which may include a social networking website, may enable its users (e.g., individuals or organizations) to interact with the social networking system and with each other through the social networking system. The social networking system may utilize input from the user to create and store user profiles associated with the user in the social networking system. The user profile may include demographic information, communication channel information, and information about the user's personal interests. The social networking system may also utilize input from the user to create and store a record of the user's relationship with other users of the social networking system, as well as provide services (e.g., material/news feed posts, photo sharing, campaign organization, messaging, games, or advertisements) for facilitating social interactions between or among the users.
The social networking system may send content or messages related to its services to the user's mobile computing device or other computing device over one or more networks. The user may also install a software application on the user's mobile computing device or other computing device for accessing the user's profile and other data within the social-networking system. The social networking system may generate a personalized set of content objects (e.g., news feeds of comprehensive stories of other users with whom the user has a connection) for display to the user.
Disclosure of Invention
According to a first aspect of the present disclosure, there is provided a method comprising: by the client system: presenting, through a user interface of the client system, a text message based on a user utterance received at the client system, wherein the text message includes a plurality of n-grams; receiving, at the client system, a first user request to edit the text message; presenting, through the user interface, a text message visually divided into a plurality of blocks, wherein each block includes one or more of a plurality of n-grams of the text message, and wherein the plurality of n-grams in each block are consecutive with respect to each other and grouped within the block based on analysis of the text message by a Natural Language Understanding (NLU) module; receiving, at the client system, a second user request to edit one or more of the plurality of blocks; and presenting the edited text message via the user interface, wherein the edited text message is generated based on the second user request.
In some embodiments, the method further comprises: a prompt is presented through the user interface for entering a second user request, wherein the second user request includes information for editing the one or more blocks.
In some embodiments, each of the plurality of blocks is visually partitioned using one or more of: geometry, color, or identifier.
In some embodiments, one or more of the first user request or the second user request is based on one or more of: voice input, gesture input, or gaze input.
In some embodiments, the first user request is based on gesture input, and wherein the method further comprises: presenting, via a user interface, a gesture-based menu comprising a plurality of selection options for editing the plurality of tiles, wherein the second user request comprises: one or more selection options of the plurality of selection options corresponding to the one or more blocks are selected based on one or more gesture inputs.
In some embodiments, the second user request includes a gesture input intended to clear the text message, and wherein editing one or more of the plurality of blocks includes clearing an n-gram corresponding to the one or more blocks.
In some embodiments, the method further comprises: a determination is made by a gesture classifier that the gesture input is intended to clear the text message based on one or more attributes associated with the gesture input.
In some embodiments, the plurality of blocks are visually partitioned using a plurality of identifiers, respectively, and wherein the second user request includes one or more references to the one or more identifiers of one or more corresponding blocks.
In some embodiments, the plurality of identifiers includes one or more of: numbers, letters, or symbols.
In some embodiments, the second user request includes a voice input referencing the one or more blocks.
In some embodiments, the references to the one or more blocks in the second user request comprise ambiguous references, and wherein the method further comprises: disambiguating the ambiguous reference based on a speech similarity model.
In some embodiments, one or more of the first user request or the second user request comprises: a voice input from a first user of the client system, and wherein the method further comprises: detecting a second user proximate to the first user based on sensor signals acquired by one or more sensors of the client system; and determining that the first user request and the second user request are directed to the client system based on the one or more gaze inputs of the first user.
In some embodiments, the second user request includes one or more gaze inputs directed to the one or more blocks.
In some embodiments, the method further comprises: the text message is edited based on the second user request.
In some embodiments, editing the text message includes altering one or more of the one or more n-grams in each of one or more of the one or more blocks to one or more other n-grams, respectively.
In some embodiments, editing the text message includes adding one or more n-grams to each of one or more of the one or more blocks.
In some embodiments, editing the text message includes altering an order associated with a plurality of n-grams in each of one or more of the one or more blocks.
According to a second aspect of the present disclosure, there is provided one or more computer-readable non-transitory storage media containing software that when executed is operable to: presenting, through a user interface of the client system, a text message based on a user utterance received at the client system, wherein the text message includes a plurality of n-grams; receiving, at the client system, a first user request to edit the text message; presenting, through a user interface, a text message visually divided into a plurality of blocks, wherein each block includes one or more of a plurality of n-grams of the text message, and wherein the plurality of n-grams in each block are consecutive with respect to each other and are grouped within the block based on analysis of the text message by a Natural Language Understanding (NLU) module; receiving, at the client system, a second user request to edit one or more of the plurality of blocks; and presenting the composed text message through a user interface, wherein the composed text message is generated based on the second user request.
According to a third aspect of the present disclosure, there is provided a system comprising: one or more processors; and a non-transitory memory coupled to the one or more processors and including instructions executable by the processor, the one or more processors being operable when executing the instructions to: presenting, through a user interface of the client system, a text message based on a user utterance received at the client system, wherein the text message includes a plurality of n-grams; receiving, at the client system, a first user request to edit the text message; presenting, through a user interface, a text message visually divided into a plurality of blocks, wherein each block includes one or more of a plurality of n-grams of the text message, and wherein the plurality of n-grams in each block are consecutive with respect to each other and are grouped within the block based on analysis of the text message by a Natural Language Understanding (NLU) module; receiving, at the client system, a second user request to edit one or more of the plurality of blocks; and presenting the composed text message through a user interface, wherein the composed text message is generated based on the second user request.
In particular embodiments, the assistance system may assist the user in obtaining information or services. The assistance system may enable a user to interact with the assistance system through user input in various modalities (e.g., audio, speech, text, images, video, gestures, motion, location, orientation) in a stateful and multi-turn conversation to receive assistance from the assistance system. By way of example and not limitation, the auxiliary system may support single-modal input (e.g., voice-only input), multi-modal input (e.g., voice input and text input), hybrid/multi-modal input, or any combination thereof. The user input provided by the user may be associated with a particular assistance-related task and may include, for example, a user request (e.g., a verbal request for information or an action to perform), a user interaction with an assistance application associated with the assistance system (e.g., a User Interface (UI) element selected by touch or gesture), or any other type of suitable user input that may be detected and understood by the assistance system (e.g., a user movement detected by a user's client device). The auxiliary system may create and store such user profiles: the user profile includes personal information and contextual information associated with the user. In particular embodiments, the assistance system may analyze user input using Natural Language Understanding (NLU) techniques. The analysis may be based on user profiles of the users for more personalized and context-aware understanding. The auxiliary system may parse the entity associated with the user input based on the analysis. In particular embodiments, the auxiliary system may interact with different agents to obtain information or services associated with the parsed entities. The auxiliary system may generate responses for the user regarding these information or services by using Natural Language Generation (NLG). Through interaction with the user, the auxiliary system may use dialog management techniques to manage and forward dialog flows with the user. In particular embodiments, the assistance system may also assist the user in effectively and efficiently understanding the acquired information by summarizing the information. The assistance system may also assist the user in better participation in the online social network by providing tools that assist the user in interacting with the online social network (e.g., creating posts, comments, messages). Additionally, the assistance system may assist the user in managing different tasks, such as keeping track of events. In particular embodiments, the auxiliary system may actively perform tasks related to user interests and preferences at times related to the user based on the user profile without user input. In particular embodiments, the auxiliary system may check the privacy settings to ensure that access to user profiles or other user information is allowed and different tasks are performed depending on the user's privacy settings.
In particular embodiments, the assistance system may assist the user through a hybrid architecture built on the client-side process and the server-side process. The client-side process and the server-side process may be two parallel workflows for processing user input and providing assistance to a user. In particular embodiments, the client-side process may be performed locally on a client system associated with the user. In contrast, the server-side process may be performed remotely on one or more computing systems. In particular embodiments, an arbiter on the client system may coordinate receiving user input (e.g., audio signals), determining whether to respond to the user input with a client-side process, a server-side process, or both, and analyzing the processing results from each process. The arbiter may instruct the client or server side agent to perform the task associated with the user input based on the foregoing analysis. The execution results may be further rendered as output by the client system. By utilizing both client-side and server-side processes, the auxiliary system can effectively help users optimize the use of computing resources while protecting user privacy and enhancing security.
In particular embodiments, the assistance system may enable the user to edit the message using speech and gestures when a mouse or other fine pointer is not available to select a word or text segment for the client system. In alternative embodiments, the assistance system may also enable the user to edit messages using voice and gestures in combination with normal pointer inputs. The auxiliary system may provide several functions to the user to edit the message. The first function may be a quick clear edit, in which the user may slide after entering the initial message before sending the message to clear the entire message. The auxiliary system may then prompt the user to enter a new message without the user having to speak the wake-up word again. The second function may be two-step speech editing. Through two-step voice editing, the user may enter an initial message such as "tell Kevin I'll be thene in 10 (telling Kevin me to there within 10 minutes)" and then want to change it by speaking "I want to change it (I want to change it)". The auxiliary system may then prompt the user to speak what they want to alter. For example, the user may say "change the time" or "change the time to 20 (change time to 20)". The auxiliary system may then look up a reference to "time" in the initial message and change it to "20". Through one-step voice editing, the user can directly say "change the time to (change time to 20)", without telling the auxiliary system that he/she wants to edit the message, for which the auxiliary system can automatically recognize the content to be changed. Similarly, through two-step voice editing, the user can say "change the time" and the auxiliary system can respond to "What's the change? (why does change) ", and the user can say" change it to 20 (change it to 20) "or" change 10to 20 (change 10to 20) ". The auxiliary system may also use n-gram editing or block editing to enable the user to edit the message by: editing of a large block of messages in a display of a client system is partitioned into blocks accessible by speech/gestures. The auxiliary system may intelligently divide the user's dictation into common phrases ("n-grams") and/or blocks, which may allow easier selection by voice or gesture. For example, if the user says "be thene in 20 (within 20 minutes to there)", but wants to change it, the auxiliary system can split the message into two n-gram blocks [ be thene (to there) ] and [ in 20 (within 20 minutes) ]. In this process, the user may then use a gesture to select [ in 20 (within 20 minutes) ] and speak "in 30 (within 30 minutes)" to alter it when the microphone of the client system can continue to listen to the user. Instead of n-gram editing or block editing, the auxiliary system may place the sequence of digits over the words in the user's dictation upon receiving a request from the user to alter it. Thus, a user can easily reference individual words to alter them. In connection with the editing method described above, the auxiliary system may use gaze as an additional signal to determine when a user wants to enter text and/or edit the entered text. Thus, the auxiliary system may have the technical advantage of improving the user experience of editing dictated text, as the auxiliary system may provide various functions that enable a user to conveniently edit text. Although this disclosure describes editing a particular message by any particular system in a particular manner, this disclosure contemplates editing any suitable message by any suitable system in any suitable manner.
In particular embodiments, a client system may present text messages based on user utterances received at the client system through a user interface of the client system. The text message may include a plurality of n-grams. The client system may then receive a first user request to edit the text message at the client system. In particular embodiments, a client system may present a text message visually divided into a plurality of blocks through a user interface. Each block may include one or more of a plurality of n-grams of the text message. In particular embodiments, the plurality of n-grams in each block may be contiguous with respect to each other and grouped within the block based on analysis of text messages by Natural Language Understanding (NLU) modules. The client system may then receive a second user request at the client system to edit one or more of the plurality of blocks. In particular embodiments, the client system may also present the edited text message via a user interface. The edited text message may be generated based on the second user request.
For efficient text editing, there are certain technical challenges. One technical challenge may include efficiently and accurately locating a piece of text that a user wants to edit. The solution presented by the embodiments disclosed herein to address this challenge may be to use a combination of voice input, gesture input, gaze input, and visual indicators of blocks, as these different inputs may complement each other to increase the accuracy of determining which piece of text the user wants to edit, while the visual indicators may help the user to easily target this piece of text using the different inputs. Another technical challenge may include distinguishing between a user's voice interaction with the auxiliary system and another person's voice interaction. A solution presented by embodiments disclosed herein to address this challenge may be to use the user's gaze input because the user's voice input may be more likely to be directed to the auxiliary system when the user is speaking while looking at the auxiliary system (e.g., user interface). Another technical challenge may include disambiguating ambiguous references to a section of text in a user's speech input. The solution presented by the embodiments disclosed herein to address this challenge may be to disambiguate using a speech similarity model, as the model may determine a confidence score for the identified user input text, which may also be used to determine which piece of text the user wants to alter (e.g., the piece of text with the low confidence score).
Certain embodiments disclosed herein may provide one or more technical advantages. Technical advantages of these embodiments may include improving the user experience of editing dictated text, as the auxiliary system may provide various functions that enable a user to conveniently edit text. The one or more other technical advantages may be readily apparent to one skilled in the art from the figures, descriptions, and claims of the present disclosure.
The various embodiments disclosed herein are merely examples and the scope of the disclosure is not limited to these embodiments. A particular embodiment may include all, some, or none of the components, elements, features, functions, operations, or steps in the embodiments disclosed herein. In particular, embodiments according to the invention are disclosed in the appended claims directed to methods, storage media, systems and computer program products, wherein any feature mentioned in one claim category (e.g., methods) may also be claimed in another claim category (e.g., systems). The dependencies or return indications in the appended claims are chosen for form reasons only. However, any subject matter resulting from the intentional reference to any preceding claim (particularly to multiple dependencies) may also be claimed, such that multiple claims and any combination of features thereof are disclosed and may be claimed regardless of the dependencies selected in the appended claims. The claimed subject matter includes not only the various combinations of features recited in the appended claims, but also any other combinations of features in the claims, where each feature recited in a claim may be combined with any other feature or combination of features in the claim. Furthermore, any of the embodiments and features described or depicted herein may be claimed in separate claims and/or in any combination with any of the embodiments or features described or depicted herein or in any combination with any of the features in the appended claims.
It should be understood that any feature described herein as being suitable for incorporation into one or more aspects or embodiments of the present disclosure is intended to be generalized to any and all aspects and embodiments of the present disclosure. Other aspects of the disclosure will be understood by those skilled in the art from the description, claims, and drawings of the disclosure. The foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the claims.
Drawings
FIG. 1 illustrates an example network environment associated with an auxiliary system.
Fig. 2 shows an example architecture of an auxiliary system.
Fig. 3 shows an example flow chart of the auxiliary system.
FIG. 4 illustrates an example task-centric flow chart for processing user input.
FIG. 5A illustrates the following example user interface: the user interface displays a spoken message.
FIG. 5B illustrates the following example user interface: the user interface displays a user request to make a change based on the one-step correction.
FIG. 5C illustrates the following example user interface: the user interface displays the composed message.
FIG. 5D illustrates the following example user interface: the user interface displays a confirmation of sending the composed message.
FIG. 6A illustrates the following example user interface: the user interface displays a spoken message.
FIG. 6B illustrates the following example user interface: the user interface displays a user request to alter the dictation message.
FIG. 6C illustrates the following example user interface: the user interface displays waiting for further dictation by the user.
FIG. 6D illustrates the following example user interface: the user interface displays the new dictation.
FIG. 6E illustrates the following example user interface: the user interface displays the transcribed new message.
FIG. 6F illustrates the following example user interface: the user interface displays a confirmation of sending the composed message.
FIG. 7A illustrates the following example user interface: the user interface displays a spoken message.
FIG. 7B is the following example user interface: the user interface displays a new dictation waiting for the user.
FIG. 7C illustrates the following example user interface: the user interface displays the new dictation.
FIG. 7D illustrates the following example user interface: the user interface displays the new transcribed message.
FIG. 7E illustrates the following example user interface: the user interface displays a confirmation of sending the composed message.
FIG. 8A illustrates the following example user interface: the user interface displays a spoken message.
FIG. 8B illustrates the following example user interface: the user interface displays a gesture input targeting a portion of the message.
FIG. 8C illustrates the following example user interface: the user interface displays the n-gram for modification.
FIG. 8D illustrates the following example user interface: the user interface displays a gesture input targeting the replacement.
FIG. 8E illustrates the following example user interface: the user interface displays a confirmation of the selected replacement.
FIG. 8F illustrates the following example user interface: the user interface displays the composed message.
FIG. 8G illustrates the following example user interface: the user interface displays a selection to send message.
FIG. 8H illustrates the following example user interface: the user interface displays a confirmation of the transmission of the message.
FIG. 9A illustrates the following example user interface: the user interface displays a spoken message.
FIG. 9B illustrates the following example user interface: the user interface displays blocks of the divided messages.
FIG. 9C illustrates the following example user interface: the user interface displays gaze input.
Fig. 9D shows the following example user interface: the user interface displays edits to the blocks.
FIG. 9E illustrates the following example user interface: the user interface displays an acknowledgement of the sent message.
FIG. 10A illustrates the following example user interface: the user interface displays a spoken message.
FIG. 10B illustrates the following example user interface: the user interface displays a user request to make a change.
FIG. 10C illustrates the following example user interface: the user interface displays blocks for editing.
FIG. 10D illustrates the following example user interface: the user interface displays a selection of blocks.
FIG. 10E illustrates the following example user interface: the user interface displays a confirmation of the selected block.
FIG. 10F illustrates the following example user interface: the user interface displays editing the selected block.
FIG. 10G illustrates the following example user interface: the user interface displays a confirmation of the edited block.
FIG. 10H illustrates the following example user interface: the user interface displays a selection to send message.
FIG. 10I illustrates the following example user interface: the user interface displays a confirmation of the transmission of the message.
FIG. 11A illustrates the following example user interface: the user interface displays a spoken message.
FIG. 11B illustrates the following example user interface: the user interface displays a user selection of a portion of the message for editing.
FIG. 11C illustrates the following example user interface: the user interface displays a word that begins to be selected for editing.
FIG. 11D illustrates the following example user interface: the user interface displays the word that is selected for editing in the end.
FIG. 11E illustrates the following example user interface: the user interface displays an option to edit the selected word.
FIG. 11F illustrates the following example user interface: the user interface displays an option to select a voice input.
FIG. 11G illustrates the following example user interface: the user interface displays a confirmation of editing using the voice input.
FIG. 11H illustrates the following example user interface: the user interface displays dictations from the user.
FIG. 11I illustrates the following example user interface: the user interface displays the composed message.
FIG. 11J illustrates the following example user interface: the user interface displays accepting edits to the message.
FIG. 12A illustrates the following example user interface: the user interface displays a spoken message.
FIG. 12B illustrates the following example user interface: the user interface displays a user selection of a portion of the message for editing.
FIG. 12C illustrates the following example user interface: the user interface displays a word that begins to be selected for editing.
FIG. 12D illustrates the following example user interface: the user interface displays the word that is selected for editing in the end.
FIG. 12E illustrates the following example user interface: the user interface displays an option to edit the selected word.
FIG. 12F illustrates the following example user interface: the user interface displays an option to select a voice input.
FIG. 12G illustrates the following example user interface: the user interface displays a confirmation of editing using the voice input.
FIG. 12H illustrates the following example user interface: the user interface displays dictations from the user.
FIG. 12I illustrates the following example user interface: the user interface displays the composed message.
Fig. 13 shows an example of editing a message by dividing the message and numbering the divisions.
Fig. 14A shows an example dictation of a message.
Fig. 14B shows an example of quick clear of a message.
Fig. 15A shows an example input message.
FIG. 15B illustrates an example n-gram overlay of identifiers on a smartphone.
FIG. 15C illustrates an example n-gram overlay of identifiers on a smart watch.
Fig. 15D illustrates an example n-gram overlay of identifiers on an intelligent network camera.
FIG. 16 illustrates an example method for efficient text editing.
FIG. 17 illustrates an example computer system.
Detailed Description
Overview of the System
FIG. 1 illustrates an example network environment 100 associated with an auxiliary system. Network environment 100 includes client system 130, auxiliary system 140, social-networking system 160, and third-party system 170 connected to each other by network 110. Although fig. 1 illustrates a particular arrangement of client system 130, auxiliary system 140, social-networking system 160, third-party system 170, and network 110, the present disclosure contemplates any suitable arrangement of client system 130, auxiliary system 140, social-networking system 160, third-party system 170, and network 110. By way of example and not limitation, two or more of client system 130, social-networking system 160, auxiliary system 140, and third-party system 170 may be directly connected to each other bypassing network 110. As another example, two or more of client system 130, auxiliary system 140, social-networking system 160, and third-party system 170 may be physically or logically co-located with each other, in whole or in part. Further, while FIG. 1 illustrates a particular number of client systems 130, auxiliary systems 140, social-networking systems 160, third-party systems 170, and networks 110, this disclosure contemplates any suitable number of client systems 130, auxiliary systems 140, social-networking systems 160, third-party systems 170, and networks 110. By way of example and not limitation, network environment 100 may include a plurality of client systems 130, a plurality of auxiliary systems 140, a plurality of social-networking systems 160, a plurality of third-party systems 170, and a plurality of networks 110.
This disclosure contemplates any suitable network 110. By way of example and not limitation, one or more portions of network 110 may include an ad hoc network (ad hoc network), an intranet, an extranet, a virtual private network (virtual private network, VPN), a local area network (local area network, LAN), a Wireless Local Area Network (WLAN), a wide area network (wide area network, WAN), a Wireless Wide Area Network (WWAN), a metropolitan area network (metropolitan area network, MAN), a portion of the internet, a portion of a public switched telephone network (Public Switched Telephone Network, PSTN), a cellular technology-based network, a satellite communication technology-based network, another network 110, or a combination of two or more of these networks.
Link 150 may connect client system 130, auxiliary system 140, social-networking system 160, and third-party system 170 to communication network 110 or connect client system 130, auxiliary system 140, social-networking system 160, and third-party system 170 to each other. This disclosure contemplates any suitable links 150. In particular embodiments, one or more links 150 include one or more wired (e.g., digital subscriber line (Digital Subscriber Line, DSL) or data over cable service interface Specification (Data Over Cable Service Interface Specification, DOCSIS)) links, one or more wireless (e.g., wi-Fi or worldwide interoperability for microwave Access (Worldwide Interoperability for Microwave Access, wiMAX)) links, or one or more optical (e.g., synchronous optical network (Synchronous Optical Network, SONET) or synchronous digital hierarchy (Synchronous Digital Hierarchy, SDH)) links. In particular embodiments, one or more links 150 each include an ad hoc network, an intranet, an extranet, VPN, LAN, WLAN, WAN, WWAN, MAN, a portion of the internet, a portion of the PSTN, a cellular technology based network, a satellite communication technology based network, another link 150, or a combination of two or more of these links 150. Links 150 need not be identical throughout network environment 100. In one or more aspects, the one or more first links 150 can be different from the one or more second links 150.
In particular embodiments, client system 130 may be any suitable electronic device that includes hardware, software, or embedded logic components, or a combination of two or more such components, and that is capable of performing the functions implemented or supported by client system 130. By way of example and not limitation, client system 130 may include a computer system such as a desktop, notebook or laptop computer, netbook, tablet computer, electronic book reader, global Positioning System (GPS) device, camera, personal digital assistant (personal digital assistant, PDA), handheld electronic device, cellular telephone, smart phone, augmented-reality (AR) smart glasses, virtual Reality (VR) headset, other suitable electronic device, or any suitable combination thereof. In particular embodiments, client system 130 may be a smart auxiliary device. The present disclosure contemplates any suitable client systems 130. In particular embodiments, client system 130 may enable a network user at client system 130 to access network 110. The client system 130 may also enable the user to communicate with other users at other client systems 130.
In particular embodiments, client system 130 may include a web browser 132 and may have one or more add-on (add-on), plug-in, or other extensions. A user at client system 130 may enter a uniform resource locator (Uniform Resource Locator, URL) or other address to direct web browser 132 to a particular server (e.g., server 162, or a server associated with third party system 170), and web browser 132 may generate and transmit hypertext transfer protocol (Hyper Text Transfer Protocol, HTTP) requests to the server. The server may accept the HTTP request and transmit one or more hypertext markup language (Hyper Text Markup Language, HTML) files to the client system 130 in response to the HTTP request. Client system 130 may render a web page interface (e.g., a web page) for presentation to a user based on the HTML file from the server. The present disclosure contemplates any suitable source files. By way of example and not limitation, the web page interface may be rendered from an HTML file, an extensible hypertext markup language (Extensible Hyper Text Markup Language, XHTML) file, or an extensible markup language (Extensible Markup Language, XML) file, according to particular needs. Such interfaces may also execute scripts, combinations of markup languages and scripts, and the like. Herein, references to a web page interface include one or more corresponding source files (which a browser may use to render the web page interface), and vice versa, where appropriate.
In particular embodiments, client system 130 may include a social networking application 134 installed on client system 130. A user at client system 130 may use social networking application 134 to access an online social network. A user at client system 130 may use social networking application 134 to communicate with the user's social relationships (e.g., friends, attendees, account numbers of interest, contacts, etc.). A user at client system 130 may also interact with multiple content objects (e.g., posts, news articles, transient content, etc.) on an online social network using social networking application 134. By way of example and not limitation, a user may browse trending topics and breaking news using social network application 134.
In particular embodiments, client system 130 may include a secondary application 136. A user at client system 130 may interact with auxiliary system 140 using auxiliary application 136. In particular embodiments, secondary application 136 may include a secondary xbot function as a front-end interface for interacting with a user of client system 130 that includes receiving user input and providing output. In particular embodiments, secondary application 136 may comprise a stand-alone application. In particular embodiments, the secondary application 136 may be integrated into the social networking application 134 or another suitable application (e.g., a messaging application). In particular embodiments, secondary application 136 may also be integrated into client system 130, a secondary hardware device, or any other suitable hardware device. In particular embodiments, secondary application 136 may also be part of secondary system 140. In particular embodiments, secondary application 136 may be accessed through web browser 132. In particular embodiments, a user may interact with auxiliary system 140 by: user input is provided to the secondary application 136 through various modalities (e.g., audio, voice, text, visual, image, video, gesture, motion, activity, location, orientation). The auxiliary application 136 may communicate user input to the auxiliary system 140 (e.g., via an auxiliary xbot). The auxiliary system 140 may generate a response based on the user input. The secondary system 140 may send the generated response to the secondary application 136. The secondary application 136 may then present the response to the user at the client system 130 through various modalities (e.g., audio, text, images, and video). By way of example and not limitation, a user may interact with auxiliary system 140 by: user input (e.g., a verbal request for information about the current state of nearby vehicle traffic) is provided to the auxiliary xbot through the microphone of client system 130. The secondary application 136 may then communicate the user input to the secondary system 140 over the network 110. Accordingly, the assistance system 140 can analyze the user input, generate a response (e.g., vehicle traffic information obtained from a third party source) based on the analysis of the user input, and communicate the generated response back to the assistance application 136. The secondary application 136 may then present the generated response to the user in any suitable manner (e.g., display a text-based push notification and/or one or more images on a display of the client system 130 that show a local map of nearby vehicle traffic).
In particular embodiments, client system 130 may implement wake word detection techniques to allow a user to conveniently activate auxiliary system 140 using one or more wake words associated with auxiliary system 140. By way of example and not limitation, the system audio API on the client system 130 may continuously monitor for user input including audio data (e.g., frames of voice data) received at the client system 130. In this example, the wake word associated with the assist system 140 may be the speech phrase "hey assist". In this example, when the system audio API on the client system 130 detects the speech phrase "hey assist" in the monitored audio data, the assistance system 140 may be activated for subsequent interaction with the user. In alternative embodiments, similar detection techniques may be implemented to activate the auxiliary system 140 using specific non-audio user inputs associated with the auxiliary system 140. For example, the non-audio user input may be a particular visual signal detected by a low power sensor (e.g., camera) of the client system 130. By way of example and not limitation, the visual signal may be a static image (e.g., a bar code, a Quick Response (QR) code, a universal product code (universal product code, UPC)), a location of the user (e.g., a user's gaze on the client system 130), a user action (e.g., the user pointing at an object), or any other suitable visual signal.
In particular embodiments, client system 130 may include rendering device 137, and optionally companion device 138. Rendering device 137 may be configured to render output generated by auxiliary system 140 to a user. The companion device 138 may be configured to perform the computation locally (i.e., on the device) on the companion device 138 under certain circumstances (e.g., when the rendering device 137 is unable to perform the computation associated with a particular task (e.g., communication with the auxiliary system 140). In particular embodiments, client system 130, rendering device 137, and/or companion device 138 may each be suitable electronic devices including: hardware, software, or embedded logic components, or a combination of two or more of these components, and are capable of independently or cooperatively performing the functions described herein as being implemented or supported by client system 130. By way of example and not limitation, client system 130, rendering device 137, and/or companion device 138 may each include a computer system, such as a desktop computer, notebook or laptop computer, netbook, tablet computer, e-book reader, GPS device, camera, personal Digital Assistant (PDA), handheld electronic device, cellular telephone, smart phone, smart speaker, virtual Reality (VR) headset, augmented Reality (AR) smart glasses, other suitable electronic device, or any suitable combination thereof. In particular embodiments, one or more of client system 130, rendering device 137, and companion device 138 may operate as intelligent auxiliary devices. By way of example and not limitation, rendering device 137 may include smart glasses and companion device 138 may include a smart phone. As another example and not by way of limitation, rendering device 137 may comprise a smart watch and companion device 138 may comprise a smart phone. As yet another example and not by way of limitation, rendering device 137 may include smart glasses and companion device 138 may include a smart remote control for the smart glasses. As yet another example and not by way of limitation, rendering device 137 may comprise a VR/AR headset and companion device 138 may comprise a smartphone.
In particular embodiments, the user may interact with auxiliary system 140 using rendering device 137 or companion device 138, alone or in combination. In particular embodiments, one or more of client system 130, rendering device 137, and companion device 138 may implement a multi-stage wake word detection model to enable a user to conveniently activate auxiliary system 140 by continuously monitoring one or more wake words associated with auxiliary system 140. In a first phase of the wake word detection model, rendering device 137 may receive audio user input (e.g., frames of speech data). If a wireless connection between rendering device 137 and companion device 138 is available, an application on rendering device 137 may communicate the received audio user input to a companion application on companion device 138 over the wireless connection. In the second phase of the wake word detection model, the companion application on companion device 138 may process the received audio user input to detect wake words associated with auxiliary system 140. The companion application on companion device 138 may then communicate the detected wake word over wireless network 110 to a server associated with secondary system 140. In a third stage of the wake word detection model, a server associated with the auxiliary system 140 may perform keyword verification on the detected wake word to verify whether the user wants to activate the auxiliary system 140 and receive assistance from the auxiliary system 140. In alternative embodiments, any of the processing, detecting, or keyword verification may be performed by rendering device 137 and/or companion device 138. In particular embodiments, when auxiliary system 140 has been activated by a user, an application on rendering device 137 may be configured to receive user input from the user, and an companion application on companion device 138 may be configured to process the user input (e.g., user request) received by the application on rendering device 137. In particular embodiments, rendering device 137 and companion device 138 may be associated (i.e., paired) with each other via one or more wireless communication protocols (e.g., bluetooth).
The example workflow below illustrates how rendering device 137 and companion device 138 may handle user input provided by a user. In this example, an application on rendering device 137 may receive user input that includes a user request for rendering device 137. An application on rendering device 137 may then determine the state of the wireless connection (i.e., the network sharing (warming) state) between rendering device 137 and companion device 138. If a wireless connection between rendering device 137 and companion device 138 is not available, an application on rendering device 137 may communicate a user request (optionally including additional data and/or contextual information available to rendering device 137) to auxiliary system 140 over network 110. The auxiliary system 140 may then generate a response to the user request and transmit the generated response back to the rendering device 137. Rendering device 137 may then present the response to the user in any suitable manner. Alternatively, if a wireless connection between rendering device 137 and companion device 138 is available, an application on rendering device 137 may communicate a user request (optionally including additional data and/or contextual information available to rendering device 137) to a companion application on companion device 138 over the wireless connection. The companion application on companion device 138 may then communicate the user request (optionally including additional data and/or contextual information available to companion device 138) to secondary system 140 over network 110. The auxiliary system 140 may then generate a response to the user request and transmit the generated response back to the companion device 138. The companion application on companion device 138 may then communicate the generated response to the application on rendering device 137. Rendering device 137 may then present the response to the user in any suitable manner. In the foregoing example workflow, rendering device 137 and companion device 138 may each perform one or more computations and/or processes in each respective step of the workflow. In particular embodiments, execution of the computations and/or processes disclosed herein may be adaptively switched between rendering device 137 and companion device 138 based at least in part on a device state of rendering device 137 and/or companion device 138, tasks associated with user inputs, and/or one or more additional factors. By way of example and not limitation, one factor may be the signal strength of the wireless connection between rendering device 137 and companion device 138. For example, if the signal strength of the wireless connection between rendering device 137 and companion device 138 is strong, the computations and processing may adaptively switch to be performed substantially by companion device 138, e.g., to benefit from the greater processing power of the Central Processing Unit (CPU) of companion device 138. Alternatively, if the signal strength of the wireless connection between rendering device 137 and companion device 138 is weak, the computation and processing may adaptively switch to be performed by rendering device 137 in a substantially independent manner. In particular embodiments, if client system 130 does not include companion device 138, the foregoing calculations and processing may be performed solely by rendering device 137 in an independent manner.
In particular embodiments, the assistance system 140 may assist the user in performing various assistance-related tasks. Assistance system 140 may interact with social-networking system 160 and/or third-party system 170 in performing these assistance-related tasks.
In particular embodiments, social-networking system 160 may be a network-addressable computing system that may host an online social network. Social-networking system 160 may generate, store, receive, and send social-networking data, such as user profile data, concept profile data, social-graph information, or other suitable data related to an online social network. Social-networking system 160 may be accessed directly by other components of network environment 100 or through network 110. By way of example and not limitation, client system 130 may use web browser 132 or a local application associated with social-networking system 160 (e.g., a mobile social-networking application, a messaging application, another suitable application, or any combination thereof) to access social-networking system 160 directly or through network 110. In particular embodiments, social-networking system 160 may include one or more servers 162. Each server 162 may be a single server, or a distributed server across multiple computers or multiple data centers. By way of example and not limitation, each server 162 may be a web server, news server, mail server, message server, advertisement server, file server, application server, exchange server, database server, proxy server, another server adapted to perform the functions or processes described herein, or any combination thereof. In particular embodiments, each server 162 may include hardware, software, or embedded logic components, or a combination of two or more of these components, for performing the appropriate functions implemented or supported by server 162. In particular embodiments, social-networking system 160 may include one or more data stores 164. The data store 164 may be used to store various types of information. In particular embodiments, the information stored in data store 164 may be organized according to particular data structures. In particular embodiments, each data store 164 may be a relational database, a columnar database, an associative database, or other suitable database. Although this disclosure describes or illustrates a particular type of database, this disclosure contemplates any suitable type of database. Particular embodiments may provide such an interface: the interface enables client system 130, social-networking system 160, auxiliary system 140, or third-party system 170 to manage, retrieve, modify, add, or delete information stored in data store 164.
In particular embodiments, social-networking system 160 may store one or more social-graphs in one or more data stores 164. In particular embodiments, a social graph may include multiple nodes, which may include multiple user nodes (each corresponding to a particular user) or multiple concept nodes (each corresponding to a particular concept), and multiple edges connecting the nodes. Social-networking system 160 may provide users of the online social network with the ability to communicate and interact with other users. In particular embodiments, users may join an online social network via social-networking system 160, and may then add connections (e.g., relationships) with a number of other users in social-networking system 160 to which they want to connect. As used herein, the term "friend" may refer to any other user of social-networking system 160 with whom the user has formed a connection, association, or relationship via social-networking system 160.
In particular embodiments, social-networking system 160 may provide users with the ability to take actions on various types of items or objects supported by social-networking system 160. By way of example and not limitation, these items and objects may include groups or social networks to which a user of social-networking system 160 may belong, activity or calendar entries to which the user may be interested, computer-based applications that the user may use, transactions that allow the user to purchase or sell items through a service, interactions with advertisements that the user may perform, or other suitable items or objects. The user may interact with anything as follows: the thing can be represented in social-networking system 160 or by an external system of third-party system 170 that is separate from social-networking system 160 and coupled to social-networking system 160 through network 110.
In particular embodiments, social-networking system 160 may be capable of linking various entities. By way of example and not limitation, social-networking system 160 may enable users to interact with each other and receive content from third-party system 170 or other entities, or allow users to interact with these entities through an Application Programming Interface (API) or other communication channel.
In particular embodiments, third party system 170 may include one or more types of servers, one or more data stores, one or more interfaces (including but not limited to APIs), one or more web services, one or more content sources, one or more networks, or any other suitable component with which, for example, a server may communicate. Third party system 170 may be operated by an entity different from the entity operating social-networking system 160. However, in particular embodiments, social-networking system 160 and third-party system 170 may operate in conjunction with each other to provide social-networking services to users of social-networking system 160 or third-party system 170. In this sense, social-networking system 160 may provide a platform or backbone that other systems (e.g., third-party systems 170) may use to provide social-networking services and functionality to users on the internet.
In particular embodiments, third party system 170 may include a third party content object provider. The third party content object provider may include one or more sources of content objects that may be delivered to the client system 130. By way of example and not limitation, the content object may include information related to things or activities of interest to the user, such as movie show times, movie reviews, restaurant menus, product information and reviews, or other suitable information. As another example and not by way of limitation, the content object may include an incentive content object, such as a coupon, gift certificate, or other suitable incentive object. In particular embodiments, a third party content provider may use one or more third party agents to provide content objects and/or services. The third party agent may be an implementation that is hosted and executed on the third party system 170.
In particular embodiments, social-networking system 160 also includes user-generated content objects that may enhance user interactions with social-networking system 160. User-generated content may include any content that a user may add, upload, send, or "post" to social-networking system 160. By way of example and not limitation, a user communicates a post from client system 130 to social-networking system 160. The post may include data such as status updates or other text data, location information, photos, videos, links, music, or other similar data or media. Content may also be added to social-networking system 160 by a third party via a "communication channel" (e.g., a news feed or stream).
In particular embodiments, social-networking system 160 may include various servers, subsystems, programs, modules, logs, and data stores. In particular embodiments, social-networking system 160 may include one or more of the following: a web server, an action log logger, an API request server, a relevance and ranking engine, a content object classifier, a notification controller, an action log, a third party content object disclosure log, an inference module, an authorization/privacy server, a search module, an advertisement targeting module, a user interface module, a user profile repository, a contact repository, a third party content repository, or a location repository. Social-networking system 160 may also include suitable components, such as a network interface, security mechanism, load balancer, failover server, management and network operations console, other suitable components, or any suitable combination thereof. In particular embodiments, social-networking system 160 may include one or more user profile stores for storing user profiles. The user profile may include, for example, biographical information, demographic information, behavioral information, social information, or other types of descriptive information (e.g., work experience, educational history, hobbies or preferences, interests, in-affinity, or location). The interest information may include interests associated with one or more categories. The categories may be general or specific. By way of example and not limitation, if a user "likes" an article about a brand of shoes, that category may be that brand, or may be a general category of "shoes" or "apparel. The contact store may be used to store contact information about users. The contact information may indicate the following users: the users have similar or common work experiences, group membership, hobbies, educational history, or are related or share common attributes in any way. The contact information may also include user-defined contacts between different users and content (both internal and external). The web server may be used to link social-networking system 160 to one or more client systems 130 or one or more third-party systems 170 via network 110. The web servers may include mail servers, or other messaging functionality for receiving and routing messages between social-networking system 160 and one or more client systems 130. The API request server may allow, for example, secondary system 140 or third party system 170 to access information from social-networking system 160 by invoking one or more APIs. The action log recorder may be used to receive communications from the web server regarding the user's actions to initiate or shut down social-networking system 160. In conjunction with the action log, a third party content object log may be maintained that is disclosed to the third party content object by the user. The notification controller may provide information about the content object to the client system 130. The information may be pushed to the client system 130 as a notification or the information may be extracted from the client system 130 in response to user input including a user request received from the client system 130. The authorization server may be used to implement one or more privacy settings of users of social-networking system 160. The privacy settings of the user may determine how particular information associated with the user may be shared. The authorization server may allow the user to choose to let or choose not to let, for example by setting the appropriate privacy settings: social-networking system 160 records their actions, or shares their actions with other systems (e.g., third-party system 170). The third party content object store may be used to store content objects received from third parties (e.g., third party systems 170). The location repository may be used to store location information associated with users received from client systems 130. The advertisement pricing module may combine social information, current time, location information, or other suitable information to provide relevant advertisements to the user in the form of notifications.
Auxiliary system
Fig. 2 illustrates an example architecture 200 of the auxiliary system 140. In particular embodiments, the assistance system 140 may assist the user in obtaining information or services. The assistance system 140 may enable a user to interact with the assistance system 140 through user input of various modalities (e.g., audio, speech, text, visual, image, video, gestures, actions, activities, positions, orientations) in a stateful and multi-round conversation to receive assistance from the assistance system 140. By way of example and not limitation, the user input may include audio input (e.g., verbal commands) based on user speech, which may be processed by a system audio API (application programming interface) on the client system 130. The system audio API may perform such techniques: including echo cancellation, noise removal, beamforming, voice activation from a user, speaker recognition, voice activity detection (voice activity detection, VAD), and/or any other suitable acoustic technique in order to generate audio data that is readily processed by the auxiliary system 140. In particular embodiments, auxiliary system 140 may support single-modality input (e.g., voice-only input), multi-modality input (e.g., voice input and text input), hybrid/multi-modality input, or any combination thereof. In particular embodiments, the user input may be user-generated input that is sent to auxiliary system 140 in a single round. The user input provided by the user may be associated with a particular assistance-related task and may include, for example, a user request (e.g., a verbal request for information or an action to perform), a user interaction with an assistance application 136 associated with assistance system 140 (e.g., selection of a UI element by touch or gesture), or any other type of suitable user input that may be detected and understood by assistance system 140 (e.g., user movement detected by user's client device 130).
In particular embodiments, the auxiliary system 140 may create and store such user profiles: the user profile includes personal information and contextual information associated with the user. In particular embodiments, auxiliary system 140 may analyze user input using Natural Language Understanding (NLU) techniques. The analysis may be based at least in part on user profiles of the users to make more personalized and context-aware understanding. The auxiliary system 140 may parse the entity associated with the user input based on the analysis. In particular embodiments, the auxiliary system 140 may interact with different agents to obtain information or services associated with the parsed entities. The auxiliary system 140 may generate responses for the user regarding such information or services using Natural Language Generation (NLG). Through interaction with the user, the auxiliary system 140 may use dialog management techniques to manage and forward dialog flows with the user. In particular embodiments, the assistance system 140 may also assist the user in effectively and efficiently understanding the acquired information by summarizing the information. The assistance system 140 may also assist the user in more interactions with the online social network by providing tools that assist the user in interacting with the online social network (e.g., creating posts, comments, messages). Additionally, the assistance system 140 may assist the user in managing different tasks, such as keeping track of events. In particular embodiments, auxiliary system 140 may actively perform tasks related to user interests and preferences at times related to the user based on the user profile without user input. In particular embodiments, the auxiliary system 140 may check privacy settings to ensure that access to user profiles or other user information is allowed and different tasks are performed depending on the user's privacy settings.
In particular embodiments, assistance system 140 may assist a user through an architecture built based on client-side processes and server-side processes, which may operate in various modes of operation. In fig. 2, a client-side process is shown above dashed line 202, and a server-side process is shown below dashed line 202. The first mode of operation (i.e., on-device mode) may be a workflow of: in this workflow, the assistance system 140 processes user input and provides assistance to the user by executing client-side processes locally, either primarily or exclusively on the client system 130. For example, if client system 130 is not connected to network 110 (i.e., when client system 130 is offline), auxiliary system 140 may only utilize client-side processes to process user input in the first mode of operation. The second mode of operation (i.e., cloud mode) may be a workflow of: in this workflow, the assistance system 140 processes user input and provides assistance to the user by performing server-side processes primarily or exclusively on one or more remote servers (e.g., servers associated with the assistance system 140). As shown in fig. 2, the third mode of operation (i.e., the hybrid mode) may be such a parallel workflow: in this parallel workflow, the assistance system 140 processes user input and provides assistance to the user by executing client-side processes locally on the client system 130, in conjunction with executing server-side processes on one or more remote servers (e.g., servers associated with the assistance system 140). For example, both the client system 130 and the server associated with the auxiliary system 140 may perform an automatic speech recognition (automatic speech recognition, ASR) process and a Natural Language Understanding (NLU) process, but the client system 130 may delegate the server associated with the auxiliary system 140 to perform dialogue, proxy, and Natural Language Generation (NLG) processes.
In particular embodiments, selection of the operational mode may be based at least in part on a device state, a task associated with the user input, and/or one or more additional factors. By way of example and not limitation, as described above, one factor may be the network connection status of client system 130. For example, if client system 130 is not connected to network 110 (i.e., when client system 130 is offline), auxiliary system 140 may process user input in a first mode of operation (i.e., an on-device mode). As another example and not by way of limitation, another factor may be based on a measurement of the available battery power (i.e., battery status) of client system 130. For example, if the client system 130 needs to conserve battery power (e.g., when the client system 130 has a minimum available battery power or the user has indicated a desire to conserve battery power of the client system 130), the auxiliary system 140 may process the user input in a second mode of operation (i.e., cloud mode) or a third mode of operation (i.e., hybrid mode) in order to perform less power consuming operations on the client system 130. As yet another example and not by way of limitation, another factor may be one or more privacy constraints (e.g., specified privacy settings, applicable privacy policies). For example, if one or more privacy constraints limit or prevent particular data from being sent to a remote server (e.g., a server associated with the auxiliary system 140), the auxiliary system 140 may process user input in a first mode of operation (i.e., an on-device mode) in order to preserve user privacy. As yet another example and not by way of limitation, another factor may be unsynchronized contextual data between client system 130 and a remote server (e.g., a server associated with auxiliary system 140). For example, it may be determined that client system 130 and a server associated with auxiliary system 140 have inconsistent, missing, and/or inconsistent context data, auxiliary system 140 may process user input in a third mode of operation (i.e., a hybrid mode) to reduce the likelihood of inadequate analysis associated with the user input. As yet another example and not by way of limitation, another factor may be a measurement of the latency of a connection between client system 130 and a remote server (e.g., a server associated with auxiliary system 140). For example, if a task associated with user input may significantly benefit from and/or require on-time or immediate execution (e.g., a photo-taking task), the auxiliary system 140 may process the user input in a first mode of operation (i.e., an on-device mode) to ensure that the task is executed in a timely manner. As yet another example and not by way of limitation, another factor may be: for a feature related to a task associated with user input, whether the feature is supported only by a remote server (e.g., a server associated with auxiliary system 140). For example, if the relevant feature requires advanced technical functionality (e.g., high performance processing capability, fast update cycles) that is only supported by the server associated with the auxiliary system 140 and not by the client system 130 upon user input, the auxiliary system 140 may process the user input in the second mode of operation (i.e., cloud mode) or the third mode of operation (i.e., hybrid mode) in order to benefit from the relevant feature.
In particular embodiments, the on-device coordinator 206 on the client system 130 may coordinate receiving user input and may determine, at one or more decision points in the example workflow, which of the above-described modes of operation should be used to process or continue to process the user input. As discussed above, the selection of the operational mode may be based at least in part on the device state, the task associated with the user input, and/or one or more additional factors. By way of example and not limitation, referring to the workflow architecture shown in fig. 2, after receiving user input from a user, the on-device coordinator 206 may determine at decision point (D0) 205 whether to begin processing the user input in a first mode of operation (i.e., on-device mode), a second mode of operation (i.e., cloud mode), or a third mode of operation (i.e., hybrid mode). For example, at decision point (D0) 205, if client system 130 is not connected to network 110 (i.e., when client system 130 is offline), if one or more privacy constraints explicitly require on-device processing (e.g., adding or deleting another person to a private call between multiple users), or if a user input is associated with a task that does not require or benefit from server-side processing (e.g., setting a hint or calling another user), on-device coordinator 206 may select a first mode of operation (i.e., on-device mode). As another example, at decision point (D0) 205, if the client system 130 needs to save battery power (e.g., when the client system 130 has a minimum available battery power or the user has indicated a desire to save battery power for the client system 130) or when additional utilization of computing resources needs to be restricted (e.g., when other processes running on the client device 130 require high CPU utilization (e.g., a Short Message Service (SMS) messaging application), the on-device coordinator 206 may select a second mode of operation (i.e., cloud mode) or a third mode of operation (i.e., hybrid mode).
In particular embodiments, as shown in fig. 2, if the on-device coordinator 206 determines at decision point (D0) 205 that the user input should be processed using the first mode of operation (i.e., on-device mode) or the third mode of operation (i.e., mixed mode), then the client-side process may proceed. By way of example and not limitation, if the user input includes speech data, the speech data may be received at a local Automatic Speech Recognition (ASR) module 208a on the client system 130. The ASR module 208a may allow the user to dictate and transcribe speech into written text, synthesize files into an audio stream, or issue commands that are recognized by the system as such.
In particular embodiments, the output of the ASR module 208a may be sent to a local Natural Language Understanding (NLU) module 210a. NLU module 210a may perform named entity resolution (Named Entity Resolution, NER) or named entity resolution may be performed by entity resolution module 212a, as described below. In particular embodiments, one or more of intent (intent), slot (slot), or domain (domain) may be the output of NLU module 210a.
In particular embodiments, the user input may include non-voice data that may be received at local context engine 220 a. By way of example and not limitation, non-voice data may include location, visual material, touch, gesture, world update, social update, contextual information, information related to a person, activity data, and/or any other suitable type of non-voice data. The non-voice data may also include sensory data received by the client system 130 sensors (e.g., microphones, cameras) that may be accessed under privacy constraints and may be further analyzed by computer vision techniques. In particular embodiments, computer vision techniques may include human reconstruction, face detection, face recognition, hand tracking, eye movement tracking, and/or any other suitable computer vision technique. In particular embodiments, the non-speech data may be subject to a geometric construct, which may include constructing objects around the user using any suitable type of data collected by client system 130. By way of example and not limitation, a user may be wearing AR glasses, and the geometry may be used to determine the spatial location of surfaces and items (e.g., floors, walls, user's hands). In particular embodiments, the non-voice data may be inertial data collected by AR glasses or VR headset, and the inertial data may be data associated with linear and angular motion (e.g., measurements associated with user body motion). In particular embodiments, context engine 220a may determine various types of events and contexts based on non-speech data.
In particular embodiments, the output of NLU module 210a and/or context engine 220a may be sent to entity resolution module 212a. Entity resolution module 212a can resolve entities associated with one or more slots output by NLU module 210 a. In particular embodiments, each parsed entity may be associated with one or more entity identifiers. By way of example and not limitation, the identifier may include a unique user Identifier (ID) corresponding to a particular user (e.g., a unique user name or user ID number of social-networking system 160). In particular embodiments, each parsed entity may also be associated with a confidence score.
In particular embodiments, at decision point (D0) 205, the on-device coordinator 206 may determine that the user input should be processed in a second mode of operation (i.e., cloud mode) or a third mode of operation (i.e., hybrid mode). In these modes of operation, user input may be handled by some server-side modules in a similar manner to the client-side process described above.
In particular embodiments, if the user input includes voice data, the user input voice data may be received at a remote Automatic Speech Recognition (ASR) module 208b on a remote server (e.g., a server associated with the auxiliary system 140). The ASR module 208b may allow the user to dictate and transcribe speech into written text, synthesize files into an audio stream, or issue commands that are recognized by the system as such.
In particular embodiments, the output of the ASR module 208b may be sent to a remote Natural Language Understanding (NLU) module 210b. In particular embodiments, NLU module 210b may perform Named Entity Resolution (NER), or named entity resolution may be performed by entity resolution module 212b of dialog manager module 216b, as described below. In a particular embodiment, one or more of the intent, slot, or domain may be an output of NLU module 210b.
In particular embodiments, the user input may include non-voice data, which may be received at the remote context engine 220 b. In particular embodiments, remote context engine 220b may determine various types of events and contexts based on non-speech data. In particular embodiments, the output of NLU module 210b and/or context engine 220b may be sent to remote dialog manager 216b.
In particular embodiments, as discussed above, the on-device coordinator 206 on the client system 130 may coordinate receiving user input and may determine, at one or more decision points in the example workflow, which of the above-described modes of operation should be used to process or continue to process the user input. As discussed further above, the selection of the operational mode may be based at least in part on the device state, the task associated with the user input, and/or one or more additional factors. By way of example and not limitation, with continued reference to the workflow architecture shown in fig. 2, after the entity resolution module 212a generates an output or null output, the on-device coordinator 206 may determine at decision point (D1) 215 whether to continue processing user input in the first mode of operation (i.e., on-device mode), the second mode of operation (i.e., cloud mode), or the third mode of operation (i.e., hybrid mode). For example, at decision point (D1) 215, if the identified intent is associated with a latency sensitive processing task (e.g., take a photograph, pause timer), the on-device coordinator 206 may select a first mode of operation (i.e., an on-device mode). As another example and not by way of limitation, if the on-device processing on the client system 130 does not support messaging tasks, the on-device coordinator 206 may select a third mode of operation (i.e., a hybrid mode) to process user input associated with the messaging request. As yet another example, at decision point (D1) 215, if the task being processed requires access to a social graph, knowledge graph, or concept graph that is not stored on the client system 130, the on-device coordinator 206 may select a second mode of operation (i.e., cloud mode) or a third mode of operation (i.e., hybrid mode). Alternatively, if there is a sufficient version (e.g., a small version and/or a bootstrapped version of a knowledge-graph) of an information graph (which includes the necessary information for the task) on the client system 130, the on-device coordinator 206 may instead select the first mode of operation (i.e., the on-device mode).
In particular embodiments, as shown in fig. 2, if the on-device coordinator 206 determines at decision point (D1) 215 that processing should proceed using either the first mode of operation (i.e., on-device mode) or the third mode of operation (i.e., hybrid mode), then the client-side process may proceed. By way of example and not limitation, the output from the entity resolution module 212a may be sent to the on-device dialog manager 216a. In particular embodiments, on-device dialog manager 216a may include dialog state tracker 218a and action selector 222a. The on-device dialog manager 216a may have complex dialog logic and product-related business logic to manage dialog states and flows of dialog between the user and the auxiliary system 140. The on-device dialog manager 216a may include all functionality for end-to-end integration and multi-round support (e.g., validation, disambiguation). The on-device dialog manager 216a may also be lightweight in terms of computational constraints and resources as follows: the computation constraints and resources include memory, computation (CPU), and binary size constraints. The on-device dialog manager 216a may also be extensible to enhance the developer experience. In particular embodiments, on-device dialog manager 216a may benefit auxiliary system 140, for example, by: providing offline support to reduce network connectivity issues (e.g., unstable or unavailable network connections), using client-side procedures to prevent privacy-sensitive information from being transferred out of the client system 130, and providing a stable user experience in highly latency-sensitive scenarios.
In particular embodiments, the on-device dialog manager 216a may also perform false trigger reduction. The implementation of false trigger reduction may detect and prevent false triggers (e.g., unexpected wake words) that would also invoke the auxiliary system 140 based on user input, and may further prevent the auxiliary system 140 from generating data records based on false triggers that may be inaccurate and/or may be subject to privacy constraints. By way of example and not limitation, if the user is in a voice call, the dialog of the user during the voice call may be considered private and false trigger reduction may limit detection of wake words to audio user input received locally by the user's client system 130. In particular embodiments, on-device dialog manager 216a may implement false trigger reduction based on a disused (nonce) detector. If the revocation detector determines with high confidence that the received wake word is logically and/or contextually unreasonable at the point in time the wake word was received from the user, the on-device dialog manager 216a may determine that the user does not want to invoke the auxiliary system 140.
In particular embodiments, because of the limited computing power of client system 130, on-device dialog manager 216a may perform on-device learning based on learning algorithms specifically tailored to client system 130. By way of example and not limitation, the joint learning technique may be implemented by the on-device dialog manager 216 a. Joint learning is a particular class of distributed machine learning techniques that can use decentralized data stored on end devices (e.g., mobile phones) to train a machine learning model. In particular embodiments, on-device dialog manager 216a may extend existing neural network personalization techniques using a federated user representation learning model to enable federated learning of on-device dialog manager 216 a. Joint user representation learning joint learning models may be personalized by learning task-specific user representations (i.e., embedding) and/or by personalizing model weights. Federated user representation learning is simple, scalable, privacy preserving, and resource efficient. Joint user representation learning can separate model parameters into joint parameters and privacy parameters. The privacy parameters (e.g., private user embedding) may be trained locally on the client system 130 rather than being transmitted to or averaged by a remote server (e.g., a server associated with the auxiliary system 140). In contrast, the federated parameters may be trained remotely on a server. In particular embodiments, the on-device dialog manager 216a may use an active joint learning model that may send a global model trained on a remote server to the client system 130 and compute gradients locally on the client system 130. Active joint learning may enable the on-device dialog manager 216a to minimize transmission costs associated with the download model and upload gradient. For active joint learning, in each round, the client system 130 may be selected in a semi-random manner based at least in part on probabilities conditioned on the current model and data on the client system 130 in order to optimize the efficiency of training the joint learning model.
In particular embodiments, dialog state tracker 218a may track state changes over time as a user interacts with the world and assistance system 140 interacts with the user. By way of example and not limitation, the dialog state tracker 218a may be subject to applicable privacy policies to track, for example, what the user is talking about, with whom the user is, where the user is, what tasks are currently being performed, and where the user gazes.
In particular embodiments, at decision point (D1) 215, the on-device coordinator 206 may determine a server that forwards the user input to the second mode of operation (i.e., cloud mode) or the third mode of operation (i.e., hybrid mode). By way of example and not limitation, if a particular function or process (e.g., messaging) is not supported on the client system 130, the on-device coordinator 206 may determine to use a third mode of operation (i.e., a hybrid mode) at decision point (D1) 215. In particular embodiments, on-device coordinator 206 may cause the output from NLU module 210a, context engine 220a, and entity resolution module 212a to be forwarded to entity resolution module 212b of remote dialog manager 216b through dialog manager agent 224 to continue the process. The dialog manager agent 224 may be a communication channel for exchanging information/events between the client system 130 and the server. In particular embodiments, dialog manager 216b may additionally include a remote arbiter 226b, a remote dialog state tracker 218b, and a remote action selector 222b. In particular embodiments, at decision point (D0) 205, the auxiliary system 140 may have begun processing the user input in the second mode of operation (i.e., cloud mode), while at decision point (D1) 215, the on-device coordinator 206 may determine to continue processing the user input based on the second mode of operation (i.e., cloud mode). Thus, the output from NLU module 210b and context engine 220b may be received at remote entity resolution module 212 b. The remote entity resolution module 212b may have similar functionality to the local entity resolution module 212a, which may include resolving entities associated with slots. In particular embodiments, the entity resolution module 212b may access one or more of a social graph, a knowledge graph, or a concept graph when resolving an entity. The output from the entity resolution module 212b may be received at an arbiter 226 b.
In particular embodiments, remote arbiter 226b may be responsible for selecting between client-side upstream results and server-side upstream results (e.g., results from NLU modules 210a/210b, results from entity resolution modules 212a/212b, and results from context engines 220a/220 b). The arbiter 226b may send the selected upstream result to the remote dialog state tracker 218b. In particular embodiments, similar to local dialog state tracker 218a, remote dialog state tracker 218b may use a task specification to convert upstream results into candidate tasks and parse parameters (parameters) using entity parsing.
In particular embodiments, at decision point (D2) 225, the on-device coordinator 206 may determine whether to continue to process user input based on the first mode of operation (i.e., on-device mode) or forward user input to a server in a third mode of operation (i.e., hybrid mode). The decision may depend on, for example, whether the client-side process is able to successfully resolve the tasks and slots, whether there is a valid task policy with specific feature support, and/or a context difference between the client-side process and the server-side process. In particular embodiments, the decision made at decision point (D2) 225 may be for a multi-round scenario. In particular embodiments, there may be at least two possible scenarios. In a first scenario, the auxiliary system 140 may have begun processing user input in a first mode of operation (i.e., an on-device mode) using the client-side dialog state. If the auxiliary system 140 decides to switch to having the remote server process the user input at the same point, the auxiliary system 140 may create and forward a programmed/predefined task with the current task state to the remote server. For subsequent rounds, the auxiliary system 140 may continue processing in a third mode of operation (i.e., a hybrid mode) using the server-side dialog state. In another scenario, the auxiliary system 140 may have already begun processing user input in the second mode of operation (i.e., cloud mode) or the third mode of operation (i.e., hybrid mode), and for all subsequent rounds, the auxiliary system 140 may rely substantially on server-side dialog states. If the on-device coordinator 206 determines to continue processing user input based on the first mode of operation (i.e., the on-device mode), an output from the dialog state tracker 218a may be received at the action selector 222 a.
In particular embodiments, at decision point (D2) 225, the on-device coordinator 206 may determine to forward the user input to the remote server and continue to process the user input in the second mode of operation (i.e., cloud mode) or the third mode of operation (i.e., hybrid mode). The auxiliary system 140 may create and forward a programmed/predefined task with the current task state to the server, which may be received at the action selector 222 b. In particular embodiments, auxiliary system 140 may have begun processing user input in the second mode of operation (i.e., cloud mode), and on-device coordinator 206 may determine at decision point (D2) 225 to continue processing user input in the second mode of operation (i.e., cloud mode). Thus, output from dialog state tracker 218b may be received at action selector 222 b.
In particular embodiments, action selectors 222a/222b may perform interaction management. The action selector 222a/222b may determine and trigger a set of general executable actions. These actions may be performed on the client system 130 or at a remote server. By way of example and not limitation, such actions may include providing information or suggestions to the user.
In particular embodiments, these actions may interact with agents 228a/228b, the user, and/or the auxiliary system 140 itself. These actions may include the actions of: the actions include one or more of: slot request, acknowledge, disambiguate, or proxy execution. These actions may be implemented independently of the underlying layers of action selectors 222a/222 b. For more complex scenarios (e.g., multi-round tasks, or tasks with complex business logic), the local action selector 222a may invoke one or more local agents 228a, and the remote action selector 222b may invoke one or more remote agents 228b to perform these actions. The agents 228a/228b may be invoked by task IDs, and any actions may be routed to the correct agents 228a/228b using the task IDs. In particular embodiments, agents 228a/228b may be configured to act as intermediaries (brookers) between multiple content providers of a domain. The content provider may be the following entities: the entity is responsible for performing actions associated with the intent or for completing tasks associated with the intent. In particular embodiments, agents 228a/228b may provide a number of functions for auxiliary system 140, including, for example, local template generation, task-specific business logic, and querying external APIs. The agents 228a/228b may use the context from the dialog state tracker 218a/218b in performing the task's actions, and may also update the dialog state tracker 218a/218b. In particular embodiments, agents 228a/228b may also generate partial payloads from conversational actions.
In particular embodiments, the local agent 228a may have different implementations that are compiled/registered for different platforms (e.g., smart glasses versus VR headset). In particular embodiments, multiple device-specific implementations (e.g., real-time calls to client system 130 or messaging applications on client system 130) may be handled internally by a single agent 228a. Alternatively, device-specific implementations may be handled by multiple agents 228a associated with multiple domains. By way of example and not limitation, invoking the proxy 228a on the smart glasses may be implemented in a different manner than invoking the proxy 228a on the smart phone. Different platforms may also utilize different numbers of agents 228a. The agent 228a may also be cross-platform (i.e., a different operating system on the client system 130). Further, the agent 228a may have a minimized startup time or binary size impact. The local agent 228a may be adapted to a particular use case. By way of example and not limitation, one use case may be an emergency call to client system 130. As another example and not by way of limitation, another use case may be responsive to user input without network connectivity. As yet another example and not by way of limitation, another use case may be that a particular domain/particular task may be privacy sensitive and may prohibit user input from being sent to a remote server.
In particular embodiments, local action selector 222a may invoke local transport system 230a to perform an action, and remote action selector 222b may invoke remote transport system 230b to perform an action. Upon receipt of a trigger signal from the dialog state tracker 218a/218b, the delivery system 230a/230b may deliver the predefined event by performing a corresponding action. The delivery system 230a/230b may ensure that events are delivered to hosts with active connections. By way of example and not limitation, the delivery system 230a/230b may broadcast to all online devices belonging to a user. As another example and not by way of limitation, the delivery system 230a/230b may deliver events to a target-specific device. The transport system 230a/230b may also render the payload using the most current device context.
In particular embodiments, the on-device dialog manager 216a may also include a separate local action execution module, and the remote dialog manager 216b may also include a separate remote action execution module. The local execution module and the remote action execution module may have similar functions. In particular embodiments, the action execution module may call agents 228a/228b to perform tasks. The action execution module may also execute a set of universally executable actions determined by the action selectors 222a/222 b. The set of executable actions may interact with the agents 228a/228b, the users, and the auxiliary system 140 itself through the delivery system 230a/230 b.
In particular embodiments, if the first mode of operation (i.e., on-device mode) is used to process user input, results from agent 228a and/or delivery system 230a may be returned to on-device dialog manager 216a. The on-device dialog manager 216a may then instruct the local arbiter 226a to generate a final response based on these results. Arbiter 226a may aggregate these results and evaluate them. By way of example and not limitation, arbiter 226a may rank the results in response to user input and select the best result. If the user request is processed in the second mode of operation (i.e., cloud mode), results from the agent 228b and/or the delivery system 230b may be returned to the remote dialog manager 216b. The remote dialog manager 216b may instruct the arbiter 226a to generate a final response based on these results through the dialog manager agent 224. Similarly, arbiter 226a may analyze these results and select the best result to provide to the user. If user input is processed based on the third mode of operation (i.e., the hybrid mode), both client-side and server-side results (e.g., from agents 228a/228b and/or transport systems 230a/230 b) may be provided to arbiter 226a by on-device dialog manager 216a and remote dialog manager 216b, respectively. Arbiter 226 may then select between client-side results and server-side results to determine the final results to provide to the user. In particular embodiments, the logic to decide between these results may depend on the particular use case.
In particular embodiments, local arbiter 226a may generate a response based on the final result and send the response to rendering output module 232. The render output module 232 may determine how to render the output in a manner appropriate for the client system 130. By way of example and not limitation, for VR headset or AR smart glasses, rendering output module 232 may determine to render output using a vision-based modality (e.g., image or video clip) that may be displayed by VR headset or AR smart glasses. As another example, the response may be rendered as an audio signal as follows: the audio signal may be played by a user through VR headset or AR smart glasses. As yet another example, the response may be rendered as augmented reality data for enhancing the user experience.
In particular embodiments, in addition to determining the mode of operation for processing user input, on-device coordinator 206 may determine whether to process user input on rendering device 137, user input processed at companion device 138, or user requests on a remote server. Rendering device 137 and/or companion device 138 may each process user input using an auxiliary stack (stack) in a manner similar to that disclosed above. By way of example and not limitation, the on-device coordinator 206 may determine that a portion of the process should be completed on the rendering device 137, that a portion of the process should be completed on the companion device 138, and that the remaining processes should be completed on the remote server.
In particular embodiments, the auxiliary system 140 may have various capabilities including audio cognition, visual cognition, signal intelligence, reasoning, and memory. In particular embodiments, the audio-aware capabilities may enable the auxiliary system 140 to, for example: understanding user inputs associated with various domains in different languages, understanding and summarizing conversations, performing on-device audio recognition for complex commands, extracting topics from conversations and automatically tagged portions of the conversations by speech recognition of the user, enabling wake word-free audio interactions, filtering and amplifying user speech from ambient noise and conversations, and/or understanding with which client system 130 the user is talking in the vicinity of multiple client systems 130.
In particular embodiments, the visual-cognitive capabilities may enable the auxiliary system 140 to, for example: performing face detection and tracking, identifying users, identifying people interested in a dominant urban area at different angles, identifying objects of interest in the world through a combination of existing machine learning models and one-time learning, identifying moments of interest and automatically gathering it, implementing semantic understanding over multiple visual frames over different time segments, providing platform support for additional capabilities in person identification, place identification or object identification, identifying a full set of settings and micro-locations including personalized locations, identifying complex activities, identifying complex gestures of the user controlling the client system 130, processing images/videos from self-centering cameras (egocentric camera) (e.g., by motion, gathering angles, resolution), achieving similar accuracy and speed levels related to images with lower resolution, performing one-time registration and identification of places and objects, and/or performing visual identification on the client system 130.
In particular embodiments, the assistance system 140 may utilize computer vision techniques to achieve visual awareness. In addition to computer vision techniques, the assistance system 140 may explore the following options: the options may supplement these techniques to extend the recognition of objects. In particular embodiments, the auxiliary system 140 may use supplemental signals, such as optical character recognition (optical character recognition, OCR) of the object's tag, GPS signals for location recognition, and/or signals from the user's client system 130 for identifying the user. In particular embodiments, the auxiliary system 140 may perform general scene recognition (e.g., home space, work space, public space) to set context for the user and narrow down computer vision search space to identify possible objects or people. In particular embodiments, the assistance system 140 may guide the user to train the assistance system 140. For example, crowdsourcing (crowing) may be used to allow users to mark objects and help assist the system 140 in identifying more objects over time. As another example, when using auxiliary system 140, a user may register his personal object as part of the initial setup. The assistance system 140 may also allow users to provide positive/negative signals to the objects with which they interact to train and improve their personalized models.
In particular embodiments, the ability to signal intelligence may enable the auxiliary system 140 to, for example: determining a user location, understanding a date/time, determining a home location, understanding a user's calendar and future intended places, integrating a richer sound understanding to identify settings/context by sound only, and/or building a signal intelligence model that can be personalized at run-time according to the user's personal routine.
In particular embodiments, the inference capabilities may enable the auxiliary system 140 to, for example: extracting previous conversation threads at any point in the future, synthesizing all signals to understand micro-contexts and personalized contexts, learning interaction patterns and preferences from the user's historical behavior and accurately suggesting interactions that they might attach importance (value), understanding what content the user might want to watch at what time of day based on micro-context understanding, generating highly predictive active suggestions, and/or understanding changes in the scene and how the changes might affect what the user wants.
In particular embodiments, the memory capability may enable the auxiliary system 140 to, for example: bearing in mind the social connections that the user has previously accessed or interacted with, writing into memory and querying memory (i.e., open dictation and automatic tagging) as desired, extracting richer preferences based on previous interactions and long-term learning, bearing in mind the user's life history, extracting rich information from self-centric data streams and automatic catalogs, and/or writing into memory in a structured form to form rich short-term, segment and long-term memories.
Fig. 3 illustrates an example flow chart 300 of the auxiliary system 140. In particular embodiments, auxiliary service module 305 may access request manager 310 upon receiving a user input. In a particular embodiment, the request manager 310 may include a context extractor 312 and a conversational understanding (conversational understanding, CU) object generator (CU object generator) 314. The context extractor 312 may extract context information associated with the user input. The context extractor 312 may also update the context information based on the secondary application 136 executing on the client system 130. By way of example and not limitation, updating the context information may include displaying the content item on the client system 130. As another example and not by way of limitation, the update to the context information may include whether a hint is set on the client system 130. As another example and not by way of limitation, the update to the context information may include whether a song is being played on the client system 130. CU object generator 314 may generate a particular content object related to the user input. The content object may include dialog session data and features associated with the user input that may be shared with all of the modules of the auxiliary system 140. In particular embodiments, request manager 310 may store the context information and the generated content objects in data store 320, where data store 320 is a particular data store implemented in secondary system 140.
In particular embodiments, request manager 310 may send the generated content object to NLU module 210.NLU module 210 may perform a number of steps to process the content object. NLU module 210 may first traverse content objects in allow list/block list 330. In particular embodiments, allow list/block list 330 may include interpretation data that matches the user input. NLU module 210 may then perform characterization 332 of the content object. NLU module 210 may then perform domain categorization/selection 334 of the user input based on the features generated by characterization 332 to categorize the user input into a predefined domain. In particular embodiments, a domain may represent a social context of an interaction (e.g., education), or a namespace of a set of intents (e.g., music). The domain classification/selection result may be further processed based on two related processes. In one process, NLU module 210 may use meta-intent classifier 336a to process domain classification/selection results. The meta-intent classifier 336a may determine a category that describes the intent of the user. The intent may be an element in a predefined semantic intent classification that may indicate the purpose of a user interaction with the assistance system 140. NLU module 210a may classify user input as a member of a predefined classification. For example, the user input may be "Play bethoven's 5th (Play beprofen fifth song)", and the NLU module 210a may classify the input as having an intention [ IN: play_music ] ([ IN: play_music ]). In particular embodiments, intent common to multiple domains may be processed by meta intent classifier 336 a. By way of example and not limitation, meta-intent classifier 336a may be based on a machine learning model as follows: the machine learning model may take domain classification/selection results as input and calculate a probability that the input is associated with a particular predefined meta-intent. Then, NLU module 210 may use meta-slot labeler 338a to label the classification results from meta-intention classifier 336a with one or more meta-slots. A slot may be a named sub-string that corresponds to a string within the user input representing a basic semantic entity. For example, the slot of "pizza (pizza)" may be [ SL: dish ] ([ SL: food ]). In particular embodiments, a set of valid or expected naming slots may be conditioned on the intent of being categorized. By way of example and not limitation, for intent [ IN: play_music ], the valid slot may be [ SL: song_name ]. In particular embodiments, meta-slot labeler 338a may label generic slots such as a reference to an item (e.g., first), a type of slot, a value of a slot, and the like. In particular embodiments, NLU module 210 may use intent classifier 336b to process domain classification/selection results. The intent classifier 336b may determine a user intent associated with the user input. In particular embodiments, for each domain, there may be one intent classifier 336b to determine the most likely intent in a given domain. By way of example and not limitation, intent classifier 336b may be based on a machine learning model as follows: the machine learning model may take domain classification/selection results as input and calculate a probability that the input is associated with a particular predefined intent. NLU module 210 may then use slot annotator 338b to annotate one or more slots associated with the user input. In particular embodiments, slot annotator 338b can annotate one or more slots for a user-entered n-gram (n-gram). By way of example and not limitation, the user input may include "change 500dollars in my account to Japanese yen (redeem $ 500 for yen" in my account). The intent classifier 336b may take user input as input and represent the input as a vector. The intent classifier 336b may then calculate a probability that the user input is associated with a different predefined intent based on a vector comparison between the vector representing the user input and the vector representing the different predefined intent. In a similar manner, slot marker 338b may take user input as input and represent each word as a vector. The intent classifier 336b may then calculate the probability that each word is associated with a different predefined slot based on a vector comparison between the vector representing the word and the vector representing the different predefined slot. The user's intention may be classified as "change money". The slots entered by the user may include "500", "dolars", "account", and "Japanese yen". The meta-intent of the user may be classified as "financial service (financial service)". The meta-slot may include "finance".
In particular embodiments, NLU module 210 may also extract information from one or more of a social graph, a knowledge graph, or a concept graph, and may retrieve user profiles stored locally on client system 130. NLU module 210 may also consider contextual information when analyzing user input. NLU module 210 may also process information from these different sources by: identifying and integrating the information, annotating the n-gram of the user input, sorting the n-grams with confidence scores based on the integrated information, and representing the sorted n-grams as NLU module 210 is useful for understanding the characteristics of the user input. In particular embodiments, NLU module 210 may identify one or more of a domain, intent, or slot from user input in a personalized and context-aware manner. By way of example and not limitation, the user input may include "show me how to get to the coffee shop (tell me how to go to the coffee shop)". NLU module 210 may identify a particular coffee shop that the user wants to go to based on the user's personal information and associated contextual information. In particular embodiments, NLU module 210 may include a dictionary of a particular language, a parser, and grammar rules that divide sentences into internal representations. NLU module 210 may also include one or more programs that perform naive or stochastic semantic analysis, and may also use programmatic to understand user input. In particular embodiments, the parser may be based on a deep learning architecture as follows: the deep learning architecture includes a plurality of long-short-term memory (LSTM) networks. By way of example and not limitation, the parser may be based on a recurrent neural network grammar (recurrent neural network grammar, RNNG) model, which is a recursive and round-robin LSTM algorithm.
In particular embodiments, the output of NLU module 210 may be sent to entity resolution module 212 to resolve the relevant entity. An entity may include, for example, unique users or concepts, each of which may have a unique Identifier (ID). These entities may include one or more of the following: real world entities (from a general knowledge base), user entities (from user memory), context entities (device context/dialog context), or value resolution (number, date time, etc.). In particular embodiments, entity resolution module 212 may include a domain entity resolution 340 and a generic entity resolution 342. Entity resolution module 212 can perform generic and domain-specific entity resolution. Generic entity resolution 342 can resolve entities by classifying slots and meta-slots into different generic topics. Domain entity resolution 340 may resolve entities by classifying slots and meta-slots into different domains. By way of example and not limitation, in response to input of a query for advantages of a particular brand of electric vehicle, generic entity resolution 342 may resolve the referenced brand of electric vehicle to a vehicle, and domain entity resolution 340 may resolve the referenced brand of electric vehicle to an electric vehicle.
In particular embodiments, entities may be parsed based on knowledge 350 about the world and the user. The assistance system 140 may extract the ontology data from the atlas 352. By way of example and not limitation, the graph 352 may include one or more of a knowledge graph, a social graph, or a concept graph. The ontology data may include structural relationships between different slots/meta-slots and domains. The body data may also include the following information: how slots/metaslots may be grouped, how slots/metaslots may be related within a hierarchy (where higher layers include domains), and how slots/metaslots may be subdivided according to similarity and variability. For example, the knowledge-graph may include a plurality of entities. Each entity may include a single record associated with one or more attribute values. A particular record may be associated with a unique entity identifier. Each record may have a different attribute value for the entity. Each attribute value may be associated with a confidence probability and/or a semantic weight. Confidence probability representation for attribute values: for a given attribute, the value is the exact probability. The semantic weights for attribute values may represent: the value is semantically how suitable for a given attribute, taking into account all available information. For example, a knowledge graph may include an entity of a book named "BookName" that may include information extracted from multiple content sources (e.g., online social networks, online encyclopedias, book review sources, media databases, and entertainment content sources) that may be deduplicated, parsed, and fused to generate a single unique record of the knowledge graph. In this example, an entity named "book name" may be associated with a "fantasy" attribute value of a "type" entity attribute.
In particular embodiments, the auxiliary user memory (assistant user memory, AUM) 354 may include user segment memory that helps determine how to more effectively assist the user. The AUM 354 may be a central location for storing, retrieving, indexing, and searching user data. By way of example and not limitation, the AUM 354 may store information such as contacts, photos, reminders, and the like. In addition, the AUM 354 may automatically synchronize data to servers and other devices (only for non-sensitive data). By way of example and not limitation, if a user sets a nickname for a contact on one device, all devices may synchronize and obtain the nickname based on the AUM 354. In particular embodiments, the AUM 354 may first prepare events, user status, reminders, and trigger status for storage in the data store. A memory node Identifier (ID) may be created to store an item object in the AUM 354, where the item may be some piece of information about the user (e.g., a photograph, a reminder, etc.). By way of example and not limitation, the first few bits of the memory node ID may indicate that this is a memory node ID type, the last few bits may be a user ID, and the last few bits may be a time of creation. The AUM 354 may then index the data for retrieval as needed. For this purpose, an index ID may be created. In particular embodiments, given an "index key" (e.g., photo_location) and an "index value" (e.g., "San Francisco"), AUM 354 may obtain a list of memory IDs having that attribute (e.g., PHOTOs of San Francisco). By way of example and not limitation, the first few bits may indicate that this is an index ID type, the last few bits may be a user ID, and the last few bits may encode an "index key" and an "index value". The AUM 354 may also utilize flexible query languages for information retrieval. For this purpose, a relationship index ID may be created. In particular embodiments, given a source memory node and an edge type, the AUM 354 may obtain the memory IDs of all target nodes with the outward edges of that type from the source. By way of example and not limitation, the first few bits may indicate that this is a relationship index ID type, the last few bits may be a user ID, and the last few bits may be a source node ID and an edge type. In particular embodiments, the AUM 354 may facilitate detection of concurrent updates for different events. More information about segment memory can be found in the following applications: U.S. patent application Ser. No. 16/552559, filed on 8/27 of 2019, which is incorporated herein by reference.
In particular embodiments, entity resolution module 212 may use different techniques to resolve different types of entities. For real world entities, the entity resolution module 212 may use knowledge maps to resolve the scope of the entity, such as "music tracks", "movies", and the like. For user entities, the entity resolution module 212 may use user memory or some agent to resolve the scope of user-specific entities, such as "contacts," reminders, "or" relationships. For a context entity, the entity resolution module 212 may perform co-referencing (coreference) based on information from the context engine 220 to resolve references to the entity in the context, such as "he", "she", "first" or "last". In particular embodiments, for co-fingering, entity resolution module 212 may create a reference for the entity determined by NLU module 210. The entity resolution module 212 can then accurately resolve the designations. By way of example and not limitation, the user input may include "find me the nearest grocery store and direct me there (find the nearest grocery store for me and direct me there)". Based on the co-fingers, the entity resolution module 212 may interpret "thene (there)" as "the nearest grocery store (nearest grocery store)". In particular embodiments, co-referencing may depend on information from context engine 220 and dialog manager 216 to interpret the referencing with increased accuracy. In particular embodiments, entity resolution module 212 may also resolve entities by context (device context or dialog context), such as entities shown on screen or entities from a last dialog history. For value resolution, the entity resolution module 212 may resolve the referenced value into a standard form of exact value, such as a numerical value, a time of day, an address, and the like.
In particular embodiments, entity resolution module 212 may first perform a check on applicable privacy constraints to ensure that performing entity resolution does not violate any applicable privacy policies. By way of example and not limitation, the entity to be parsed may be another user whose identity is specified in their privacy settings as not being searchable on an online social network. In this case, the entity resolution module 212 may not return the entity identifier of the user in response to user input. By utilizing the described information obtained from social graph, knowledge graph, concept graph, and user profile, and by adhering to any applicable privacy policies, the entity resolution module 212 can resolve entities associated with user input in a personalized, context-aware, and privacy-preserving manner.
In particular embodiments, the entity resolution module 212 may work with the ASR module 208 to perform entity resolution. The following example illustrates how the entity resolution module 212 may resolve an entity name. Entity resolution module 212 can first expand the name associated with the user into its corresponding normalized text form, which can be transcribed phonetically using a diphone algorithm (double metaphone algorithm), as a phonetic consonant representation. The entity resolution module 212 may then determine an n-best set of candidate transcriptions and perform a parallel understanding process on all of the speech transcriptions in the n-best set of candidate transcriptions. In particular embodiments, each transcription (collapse) parsed into the same intent may then be converted into a single intent. Then, each intent may be assigned a score corresponding to the highest scoring candidate transcript for that intent. During the transition, the entity resolution module 212 may identify various possible text transcriptions associated with each slot, the possible text transcriptions being associated by a boundary time offset associated with the transcription of the slot. The entity resolution module 212 may then extract a subset of possible candidate transcriptions for each slot from among a plurality (e.g., 1000) of candidate transcriptions, regardless of whether the candidate transcriptions are classified as the same intent. In this way, the slot and intent may be a scoring list of phrases. In particular embodiments, a new or running task capable of handling an intent (e.g., a message creation (composition) task for an intent to send a message to another user) may be identified and provided with the intent. The identified tasks may then trigger the entity resolution module 212 by: the entity resolution module 212 is provided with a scoring list of phrases associated with one of its slots, as well as the categories for which resolution should be made. By way of example and not limitation, if the entity attribute is designated as a "friend," the entity resolution module 212 may traverse each candidate list of items in the same extension that may be run at the matcher compile time. Each candidate extension of a matching term may be matched in a pre-compiled trie matching structure. Matches may be scored using a function based at least in part on the input of the transcription, the form of the match, and the name of the friend. As another example and not by way of limitation, if the entity attribute is designated as "celebrity/prominent figure," the entity resolution module 212 may perform a parallel search on the knowledge graph for each candidate set of slots output from the ASR module 208. The entity resolution module 212 can score matches based on their human popularity and the scoring signal provided by the ASR. In particular embodiments, when a memory class is specified, entity resolution module 212 may perform the same search on the user's memory. Entity resolution module 212 can slowly travel backwards (crawl) in the user's memories and attempt to match each memory (e.g., the person recently mentioned in the conversation, or the person seen and identified by visual signals, etc.). For each entity, the entity resolution module 212 may employ a match (i.e., speech) similar to how friends are matched. In particular embodiments, the score may include a time decay factor associated with the last time of the previously mentioned name. The entity resolution module 212 may also combine, sort, and deduplicate all matches. In particular embodiments, a task may receive a candidate set. When there are multiple high scoring candidates, the entity resolution module 212 may perform user-facilitated disambiguation (e.g., obtain real-time user feedback from the user for the candidates).
In particular embodiments, context engine 220 may help entity resolution module 212 to improve entity resolution. The context engine 220 may include an offline aggregator and an online inference service. The offline aggregator may process a plurality of data associated with the user collected from a previous time window. By way of example and not limitation, the data may include news feed posts/comments collected during a predetermined time frame (e.g., from a previous 90 day window), interactions with news feed posts/comments, search history, and the like. The processing results may be stored in the context engine 220 as part of the user profile. The user profile of the user may include user profile data including demographic information, social information, and contextual information associated with the user. The user profile data may also include user interests and preferences for multiple topics aggregated through conversations on news feeds, search logs, messaging platforms, and the like. The use of user profiles may be subject to privacy constraints to ensure that the user's information is available only for his/her interests and cannot be shared with any other person. In particular embodiments, the online inference service may analyze dialog data associated with the user received by the assistance system 140 at the current time. The analysis results may also be stored in the context engine 220 as part of the user profile. In particular embodiments, both the offline aggregator and the online inference service may extract personalized features from multiple data. Other modules of the auxiliary system 140 may use the extracted personalized features to better understand user input. In particular embodiments, entity resolution module 212 may process information (e.g., user profile) from context engine 220 based on Natural Language Processing (NLP) in the following steps. In particular embodiments, entity resolution module 212 may tag text via text normalization (normalization) based on NLP, extract syntactic features from the text, and extract semantic features from the text. The entity resolution module 212 can also extract features from context information accessed from a dialog history between the user and the auxiliary system 140. The entity resolution module 212 can also perform global word embedding, domain-specific embedding, and/or dynamic embedding based on the context information. The processing results may be annotated by the entity labeler using the entity. Based on the annotations, the entity resolution module 212 may generate a dictionary. In particular embodiments, the dictionary may include global dictionary features, which may be dynamically updated offline. The entity resolution module 212 can order entities annotated by the entity annotators. In particular embodiments, entity resolution module 212 may communicate with different graphs 352 (including one or more of social graphs, knowledge graphs, or concept graphs) to extract ontology data related to information retrieved from context engine 220. In particular embodiments, entity resolution module 212 may also resolve entities based on user profiles, ranked entities, and information from map 352.
In particular embodiments, entity resolution module 212 may be driven by tasks (corresponding to agents 228). This reversal of processing order may make it possible for domain knowledge present in the task to be applied to pre-filter or bias a set of resolution targets when it is apparent and appropriate to do so. By way of example and not limitation, for the utterance "who is John? (who is john. Thus, the entity resolution module 212 may resolve "john" for all content. As another example and not by way of limitation, for the utterance "send a message to John (send message to John)", the entity resolution module 212 can readily determine that "John (John)" refers to a person that can send and receive messages. Thus, the entity resolution module 212 may trend the resolution as friends. As another example and not by way of limitation, for the utterance "what is John's most famous album? (what is the most famous album of John. The entity resolution module 212 may determine that entities related to a music album include singers, producers, and recording studio. Thus, the entity resolution module 212 may search among these types of entities in the music domain to resolve "John".
In particular embodiments, the output of entity resolution module 212 may be sent to dialog manager 216 to advance dialog flow with the user. The dialog manager 216 may be an asynchronous state machine as follows: the asynchronous state machine repeatedly updates the state and selects an action based on the new state. The dialog manager 216 may also store previous dialogs between the user and the auxiliary system 140. In particular embodiments, dialog manager 216 may perform dialog optimization. Dialog optimization involves such challenges: the most likely branching options in the dialog with the user are understood and identified. By way of example and not limitation, the auxiliary system 140 may implement a dialog optimization technique to avoid the need to confirm who the user wants to call, as the auxiliary system 140 may determine a high confidence that the person inferred based on the context and available data is the intended recipient. In particular embodiments, dialog manager 216 may implement a reinforcement learning framework to improve dialog optimization. The dialog manager 216 may include a dialog intent parse 356, a dialog state tracker 218, and an action selector 222. In particular embodiments, dialog manager 216 may perform the selected action and then invoke dialog state tracker 218 again until the selected action requires a user response or no more actions are to be performed. Each selected action may depend on the execution result from a previous action. In a particular embodiment, the dialog intent resolution 356 may resolve user intent associated with the current dialog session based on a dialog history between the user and the assistance system 140. Dialog intent resolution 356 can map the intent determined by NLU module 210 to a different dialog intent. Dialog intent resolution 356 can also rank dialog intents based on signals from NLU module 210, entity resolution module 212, and dialog history between the user and auxiliary system 140.
In particular embodiments, dialog state tracker 218 may use a set of operators to track dialog states. The operators may include data and logic required to update dialog states. Each operator may act as a delta of dialog state after processing the user input just received. In particular embodiments, dialog state tracker 218 may include a task tracker that may be based on task specifications and different rules. Dialog state tracker 218 may also include a slot tracker and co-fingering component that may be rule-based, and/or time-of-last based. The co-fingering component can assist the entity resolution module 212 in resolving an entity. In alternative embodiments, with co-fingering components, the dialog state tracker 218 may replace the entity resolution module 212 and may resolve any references/mentions and keep track of the state. In particular embodiments, dialog state tracker 218 may use the task specification to convert the upstream results into candidate tasks and parse the parameters using entity parsing. Both the user state (e.g., the user's current activity) and the task state (e.g., trigger conditions) can be tracked. Given the current state, dialog state tracker 218 may generate candidate tasks such as: the auxiliary system 140 may process and execute the candidate tasks for the user. By way of example and not limitation, candidate tasks may include "give advice," obtain weather information, "or" take a photograph. In particular embodiments, dialog state tracker 218 may generate candidate tasks based on available data from, for example, knowledge maps, user memory, and user task history. In particular embodiments, dialog state tracker 218 may then parse the trigger object using the parsed parameters. By way of example and not limitation, the user input "remind me to call mom when she's online and I'm home tonight (this tonight reminds me to make a call to mom when she is online and me at home)" may perform a transition from NLU output to trigger representation by dialog state tracker 218, as shown in table 1 below:
Table 1: conversion examples from NLU output to trigger representation
In the above example, "mom," "home," and "tonight" are represented by their respective entities: persona entity (persona entity), location entity (locationEntity), date entity (datetime entity).
In particular embodiments, dialog manager 216 may map events determined by context engine 220 to actions. By way of example and not limitation, the action may be a Natural Language Generation (NLG) action, a display or overlay, a device action, or a retrieval action. The dialog manager 216 may also perform context tracking and interaction management. Context tracking may include aggregating streams of real-time events into a unified user state. Interaction management may include selecting the best action in each state. In particular embodiments, dialog state tracker 218 may perform context tracking (i.e., tracking events related to a user). To support processing of event streams, the dialog state tracker 218a may use an event handling module (handler) (e.g., for disambiguation, validation, request) that may consume (condume) various types of events and update internal auxiliary states. Each event type may have one or more processing modules. Each event processing module may be modifying a certain segment of the auxiliary state. In particular embodiments, event processing modules may be operating on disjoint subsets of states (i.e., only one processing module may have write access to a particular field in the state). In particular embodiments, all event processing modules may have an opportunity to process a given event. By way of example and not limitation, the dialog state tracker 218 may run all event processing modules in parallel on each event, and may then merge the state updates proposed by the various event processing modules (e.g., for each event, most processing modules may return NULL (NULL) updates).
In particular embodiments, dialog state tracker 218 may operate as any programming processing module (logic) that requires versioning. In particular embodiments, instead of directly changing the dialog state, dialog state tracker 218 may be a marginally effect-independent component and may generate the n best candidates of dialog state update operators that propose updates to the dialog state. Dialog state tracker 218 may include an intent resolver that contains logic to process different types of NLU intents and generate operators based on dialog state. In particular embodiments, the logic may be organized by intent processing modules (e.g., disambiguation intent processing modules for processing intent when disambiguation is required by the auxiliary system 140, validation intent processing modules including logic for processing validation, etc.). The intent resolver may combine the round of intent with dialog states to generate a context update for the dialog with the user. The slot resolution component can then utilize resolution providers (including knowledge maps and domain agents) to recursively resolve slots using the update operators. In particular embodiments, dialog state tracker 218 may update/order dialog states for a current dialog session. By way of example and not limitation, if the dialog session ends, the dialog state tracker 218 may update the dialog state to "complete". As another example and not by way of limitation, the dialog state tracker 218 may rank dialog states based on priorities associated with the dialog states.
In particular embodiments, dialog state tracker 218 may communicate with action selector 222 regarding dialog intents and associated content objects. In particular embodiments, action selector 222 may rank different dialog hypotheses for different dialog intents. The action selector 222 may employ the candidate operators of dialog states and consult the dialog policy 360 to decide what actions should be performed. In particular embodiments, conversation strategy 360 may be a tree-based strategy that is a pre-built conversation plan. Based on the current dialog state, dialog policy 360 may select a node to execute and generate a corresponding action. By way of example and not limitation, tree-based policies may include topic grouping nodes and conversational action (leaf) nodes. In particular embodiments, dialog policy 360 may also include a data structure describing the execution plan of the action by agent 228. The dialog strategy 360 may also include a plurality of targets that are related to each other by logical operators. In particular embodiments, the target may be the result of a portion of a dialog policy, and the target may be constructed by dialog manager 216. The target may be represented by an identifier (e.g., a string) having one or more naming parameters that parameterize the target. By way of example and not limitation, a target with its associated target parameters may be represented as { confirm_artist, args: { artist: "Madonna }. In particular embodiments, the targets may be mapped to leaves of a tree of the tree-structured representation of dialog strategy 360.
In particular embodiments, auxiliary system 140 may use hierarchical dialog policies 360, where generic policies 362 handle cross-domain business logic and task policies 364 handle task/domain specific logic. The generic policy 362 can be used for actions that are not specific to an individual task. The generic policy 362 can be used to determine task stacking and switching, active tasks, notifications, and the like. The generic policies 362 may include: process low confidence intents, internal errors, unacceptable user responses with retries, and/or skip or insert acknowledgements based on ASR or NLU confidence scores. The generic policy 362 may also include logic to: the dialog state update candidates from the output of dialog state tracker 218 are ranked and one dialog state update candidate to be updated is chosen (e.g., choosing the task intent to rank first). In particular embodiments, auxiliary system 140 may have a particular interface for generic policy 362 that allows for the incorporation of decentralized cross-domain policies/business rules (particularly those found in dialog state tracker 218) into the functionality of action selector 222. The interface for the generic policy 362 may also allow for the creation of independent sub-policy units that can be bound to a particular situation or client (e.g., policy functions that can be easily opened or closed based on the client, situation). The interface for the generic policy 362 may also allow for providing policy layering (i.e., multiple policy units) with back-off, where highly specialized policy units that handle a particular situation are backed up by a more generic policy 362 that applies to a more extensive situation. In this context, the generic policy 362 may alternatively include intent or task specific policies.
In particular embodiments, task policy 364 may include logic of action selector 222 based on the task and the current state. Task policies 364 may be dynamic and self-organizing. In particular embodiments, the types of task policies 364 may include one or more of the following types: (1) a manually made tree-based dialog plan; (2) Directly implementing the encoding strategy of the interface for generating the action; (3) configurator-specified slot fill tasks; or (4) a machine learning model-based strategy learned from data. In particular embodiments, assistance system 140 may bootstrap new domains with rule-based logic and later refine task policies 364 with a machine learning model. In particular embodiments, generic policy 362 may select an operator from the candidate operators to update dialog states, followed by selection of user-oriented actions by task policy 364. Once a task is active in a dialog state, the corresponding task policy 364 may be consulted to select the correct action.
In particular embodiments, action selector 222 may select an action based on one or more of: events, dialog intents and states, associated content objects, and directions from dialog policy 360 determined by context engine 220. Each dialog policy 360 may subscribe to a particular condition on the status field. After the event is processed and the state is updated, the action selector 222 may run a fast search algorithm (e.g., similar to boolean satisfiability (Boolean satisfiability)) to identify policies that should be triggered based on the current state. In particular embodiments, if multiple policies are triggered, the action selector 222 may use a tie-breaking mechanism to pick a particular policy. Alternatively, the action selector 222 may use a more complex approach as follows: the method may pre-enforce each policy and then pick a particular policy that may be determined to have a high likelihood of success. In particular embodiments, mapping events to actions may bring about several technical advantages to the auxiliary system 140. One technical advantage may include: each event may be a status update from the user or the user's physical/digital environment, which may or may not trigger an action from the auxiliary system 140. Another technical advantage may include: the possibility of a fast sudden event (e.g., a user entering a new building and seeing many people) is handled by first consuming all events to update the state and then triggering one or more actions according to the final state. Another technical advantage may include consuming all events into a single global auxiliary state.
In particular embodiments, action selector 222 may select a dialog action with a dialog state update operator as part of the input. Execution of the dialog action may generate a set of expectations to instruct the dialog state tracker 218 to process future rounds. In particular embodiments, when processing user input from the next round, the desire may be used to provide context to dialog state tracker 218. By way of example and not limitation, a slot request dialog action may have the desire to verify the value of the requested slot. In particular embodiments, both dialog state tracker 218 and action selector 222 may not change dialog state until the selected action is performed. This may allow the auxiliary system 140 to execute the dialog state tracker 218 and action selector 222 for processing the speculative ASR results and n-best ranking with previewing.
In particular embodiments, action selector 222 may invoke a different agent 228 to perform the task. At the same time, dialog manager 216 may receive instructions for updating dialog states. By way of example and not limitation, the update may include waiting for a response by the agent 228. The agent 228 may select among registered content providers to complete the action. The data structure may be constructed by the dialog manager 216 based on the intent, and one or more slots associated with the intent. In particular embodiments, agents 228 may include a first party agent and a third party agent. In particular embodiments, the first party agent may include an internal agent (e.g., an agent associated with a service (e.g., a messaging service or a photo sharing service) provided by an online social network) that is accessible and controllable by the auxiliary system 140. In particular embodiments, the third party agent may include an external agent (e.g., a third party online music application agent, ticketing agent) that is not controllable by the auxiliary system 140. The first party agent may be associated with a first party provider as follows: the first party provider provides content objects and/or services hosted by social-networking system 160. The third party agent may be associated with a third party provider as follows: the third party provider provides content objects and/or services hosted by the third party system 170. In particular embodiments, each of the first party agents or third party agents may be designated for a particular domain. By way of example and not limitation, a domain may include weather, traffic, music, shopping, social, video, photographs, events, locations, and/or work. In particular embodiments, the auxiliary system 140 may cooperatively use multiple agents 228 in response to user input. By way of example and not limitation, the user input may include "direct me to my next meeting" (directing me to attend the next meeting). The auxiliary system 140 may use the calendar agent to retrieve the location of the next meeting. The assistance system 140 may then use the navigation agent to direct the user to attend the next meeting.
In particular embodiments, dialog manager 216 may support multi-round component (composition) parsing of slot references. For component parsing from NLU module 210, the parser can recursively parse nested slots. The dialog manager 216 may also support disambiguation of nested slots. By way of example and not limitation, the user input may be "remind me to call Alex (remind me to make a call to Alexach)". The parser may need to know which alexin to call to before creating an operable reminder to do entity. When further user clarification is necessary for a particular slot, the resolver may pause resolution and set the resolution state. The generic policy 362 can examine the parse state and create a corresponding dialog action for user clarification. The dialog manager 216 may update the nested slots based on user input and recent dialog actions in the dialog state tracker 218. This functionality may allow the auxiliary system 140 to interact with the user to not only collect missing slot values but also reduce ambiguity of more complex/ambiguous utterances to complete the task. In particular embodiments, dialog manager 216 may also support requesting missing slots in nested intent and multi-intent user inputs (e.g., "take this photo and send it to Dad (take this photo and send it to dad)"). In particular embodiments, dialog manager 216 may support a machine learning model to obtain a more robust dialog experience. By way of example and not limitation, dialog state tracker 218 may use a neural network-based model (or any other suitable machine learning model) to model beliefs (belies) on task hypotheses. As another example and not by way of limitation, for action selector 222, the highest priority policy element may include a whitelist/blacklist rewrite, which may have to be done by design; the medium priority unit may include a machine learning model designed for action selection; and the lower priority element may include a rule-based fallback (fallback) when the machine learning model chooses not to handle a situation. In particular embodiments, a generic policy unit based on a machine learning model may help the assistance system 140 reduce redundant disambiguation or validation steps, thereby reducing the number of rounds of performing user input.
In particular embodiments, the actions determined by action selector 222 may be sent to delivery system 230. The delivery system 230 may include a CU editor 370, a response generation component 380, a dialog state writing component 382, and a text-to-speech (TTS) component 390. Specifically, the output of action selector 222 may be received at CU editor 370. In particular embodiments, the output from action selector 222 may be represented as a < k, c, u, d > tuple, where k indicates a knowledge source, c indicates a interaction objective, u indicates a user model, and d indicates an utterance model.
In particular embodiments, CU editor 370 may use Natural Language Generation (NLG) component 372 to generate communication content for a user. In particular embodiments, NLG component 372 may use different language models and/or language templates to generate natural language output. The generation of natural language output may be application specific. The generation of natural language output may also be personalized for each user. In particular embodiments, NLG component 372 may include a content determination component, a sentence planner, and a surface implementation component. The content determination component can determine the communication content based on the knowledge source, the interaction objective, and the user's desire. By way of example, and not limitation, this determination may be based on descriptive logic. Description logic may include, for example, three basic ideas (notes) that are individuals (representing objects in a domain), concepts (describing a collection of individuals), and roles (representing binary relationships between individuals or concepts). The descriptive logic may be characterized by a set of constructors that allow the natural language generator to construct complex concepts/roles from atomic concepts/roles. In particular embodiments, the content determination component may perform the following tasks to determine the communication content. The first task may include a conversion task in which the input of the NLG component 372 may be converted into a concept. The second task may include a selection task in which related concepts may be selected among concepts generated from the conversion task based on the user model. The third task may include a verification task in which the coherence of the selected concept may be verified. The fourth task may include an instantiation task in which the verified concept may be instantiated as an executable file that may be processed by NLG component 372. The sentence planner may determine the organization of the communication such that the communication is understood by humans. The surface-implementing component can determine the particular word to use, the order of sentences, and the style of the communication content.
In particular embodiments, CU editor 370 may also use UI payload generator 374 to determine the modality of the generated communication content. Because the generated communication content may be considered a response to user input, CU editor 370 may also use response sequencer 376 to order the generated communication content. By way of example and not limitation, the ordering may indicate a priority of the response. In particular embodiments, CU editor 370 may include a Natural Language Synthesis (NLS) component that is separate from NLG component 372. The NLS component can specify attributes of synthesized speech generated by CU editor 370, including gender, volume, cadence, style, or language domain, in order to customize the response for a particular user, task, or agent. The NLS component can adjust language synthesis without participating in the implementation of the associated task. In a particular embodiment, CU editor 370 may check privacy constraints associated with the user to ensure that the generation of the communication content complies with privacy policies.
In particular embodiments, transport system 230 may perform different tasks based on the output of CU editor 370. These tasks may include: the dialog state is written (i.e., stored/updated) into the data store 330 using the dialog state writing component 382 and a response is generated using the response generation component 380. In particular embodiments, additionally, if the determined modality of the communication content is audio, the output of CU editor 370 may be sent to TTS component 390. In particular embodiments, the output from the delivery system 230 may then be sent back to the dialog manager 216, the output from the delivery system 230 including one or more of: the generated response, communication content, or voice generated by TTS element 390.
In particular embodiments, coordinator 206 may determine whether to process user input on client system 130, on a server, or in a third mode of operation (i.e., a hybrid mode) using both client system 130 and the server, based on the output of entity resolution module 212. In addition to determining how to handle user input, the coordinator 206 may also receive results from the agents 228 and/or results from the delivery system 230 provided by the dialog manager 216. The coordinator 206 may then forward these results to the arbiter 226. Arbiter 226 may aggregate the results, analyze the results, select the best result, and provide the selected result to rendering output module 232. In particular embodiments, arbiter 226 may consult dialog policy 360 to obtain guidance in analyzing these results. In particular embodiments, rendering output module 232 may generate a response appropriate for client system 130.
FIG. 4 illustrates an example task-centric flow chart 400 for processing user input. In particular embodiments, the assistance system 140 may not only assist the user with a voice-initiated experience, but may also be used to assist the user with a more active, multimodal experience that is initiated when the user context is understood. In particular embodiments, the assistance system 140 may rely on assistance tasks for such purposes. The auxiliary task may be such a central concept: the central concept is shared across the entire auxiliary stack to understand the user's intent, interact with the user and the world, and thus accomplish the correct tasks for the user. In particular embodiments, the auxiliary task may be the original unit of auxiliary functionality. Auxiliary tasks may include data extraction, updating a state, executing a command, or complex tasks consisting of a smaller set of tasks. Proper and successful completion of tasks to communicate values to a user may be a goal to optimize the auxiliary system 140. In particular embodiments, auxiliary tasks may be defined as functions or features. If multiple result surfaces have identical requirements, auxiliary tasks can be shared across the multiple result surfaces and thus can be easily tracked. Auxiliary tasks can also be transferred from one device to another and easily picked up by another device midway through the task because the original units are identical. Furthermore, the consistent format of the auxiliary tasks may allow developers handling different modules in the auxiliary stack to more easily design around the auxiliary stack. In addition, the consistent format of auxiliary tasks may also allow for task sharing. By way of example and not limitation, if a user is listening to music on smart glasses, the user may say "play the music on my cell phone". In the event that the handset has not been awakened or has a task to perform, the smart glasses may formulate a task to be provided to the handset, which task is then performed by the handset to begin playing the music. In particular embodiments, if the surfaces have different expected behaviors, the auxiliary tasks may be retained by the surfaces individually. In particular embodiments, the assistance system 140 may identify the correct task based on user input or other signals of different modalities, conduct a conversation to gather all necessary information, and accomplish the task with an action selector 222 implemented either inside or outside the server or local results surface. In particular embodiments, the auxiliary stack may include a set of processing components for waking up, recognizing user input, understanding user intent, reasoning about tasks, completing tasks to generate natural language responses using speech.
In particular embodiments, the user input may include voice input. A speech input may be received at the ASR module 208 to extract a text transcription from the speech input. The ASR module 208 may use the statistical model to determine the most likely word sequence corresponding to a given portion of speech received as audio input by the auxiliary system 140. The model may include one or more of the following: a hidden Markov (Markov) model, a neural network, a deep learning model, or any combination thereof. The received audio input may be encoded into digital data at a particular sampling rate (e.g., 16kHz, 44.1kHz, or 96 kHz) and with a particular number of bits (e.g., 8, 16, or 24 bits) representing each sample.
In particular embodiments, the ASR module 208 may include one or more of: a grapheme-to-phone (G2P) model, a pronunciation learning model, a personalized acoustic model, a personalized language model (personalized language model, PLM), or an end point model. In particular embodiments, a grapheme-to-phoneme (G2P) model may be used to determine a grapheme-to-phoneme style of a user (i.e., what a particular word may sound when the word is spoken by a particular user). In particular embodiments, the personalized acoustic model may be a model of a relationship between an audio signal and sound of a phonetic unit in a language. Thus, such a personalized acoustic model may identify how the user's speech sounds. The training data (e.g., training speech received as audio input, and corresponding speech units corresponding to the speech) may be used to generate a personalized acoustic model. The personalized acoustic model may be trained or refined using the voice of a particular user to recognize the user's voice. In particular embodiments, the personalized language model may then determine the most likely phrase corresponding to the phonetic unit identified for the particular audio input. The personalized language model may be a model of the probability that various word sequences may appear in the language. The sounds of the phonetic units in the audio input may be matched to the word sequences using a personalized language model, and greater weights may be assigned to the word sequences: the word sequence is more likely to be a phrase in the language. The word sequence with the highest weight may then be selected as the text corresponding to the audio input. In particular embodiments, the personalized language model may also be used to predict what words the user is most likely to speak given the context. In particular embodiments, the end point model may detect when to end the utterance. In particular embodiments, assistance system 140 can optimize the personalized language model at runtime during the client-side process based at least in part on the limited computing capabilities of client system 130. By way of example and not limitation, the assistance system 140 can pre-compute a plurality of personalized language models for a plurality of possible topics that the user may talk about. When user input is associated with a request for assistance, the assistance system 140 can quickly switch between pre-computed language models and locally optimize these pre-computed language models at runtime based on user activity. Thus, the auxiliary system 140 can conserve computing resources while efficiently identifying topics associated with user input. In particular embodiments, the assistance system 140 can also dynamically relearn user pronunciation at run-time.
In particular embodiments, the user input may include non-speech input. Non-speech input may be received at the context engine 220 for determining events and contexts from the non-speech input. The context engine 220 may determine to include multimodal events including voice/text intent, location update, visual events, touch, gaze, gestures, activities, device/application events, and/or any other suitable type of event. The speech/text intent may depend on the ASR module 208 and the NLU module 210. The location updates may be consumed by the dialog manager 216 to support various active/passive scenarios. The visual event may be based on a person or object appearing in the user's field of view. These events may be consumed by dialog manager 216 and recorded in a temporary user state to support visual co-pointing (e.g., parse "how much is that shirt" (how much money is that shirt? user running), the flag may adjust the action selector 222. For device/application events, if the application updates the device status, this may be posted to the auxiliary system 140 so that the dialog manager 216 may use the context (the context currently displayed to the user) to handle passive and active scenes, by way of example and not limitation, the context engine 220 may cause a push notification message to be displayed on a display screen of the user's client system 130. The user may interact with the push notification message, which may initiate a multi-modal event (e.g., an event workflow for replying to a message received from another user.) other example multi-modal events may include seeing friends, seeing landmarks, running at home, identifying faces in photos, starting to phone with touches, taking photos with touches, opening applications, etc. in particular embodiments, the context engine 220 may also determine world/social events based on world/social updates (e.g., weather changes, friends online). Social updates may include events subscribed to by the user (e.g., birthdays, posts of friends), comment, other notification). The dialog manager 216 may consume these updates to trigger context-based proactive actions (e.g., suggesting that the user call friends on their birthday, but only if the user is not paying attention to other things). By way of example and not limitation, the received message may be a social event that may trigger a task to read the message for the user.
In particular embodiments, text transcription from the ASR module 208 may be sent to the NLU module 210.NLU module 210 may process text transcription and extract user intent (i.e., intent) and parse slots or parsing results based on language ontology. In particular embodiments, intent and slots from NLU module 210, and/or events and contexts from context engine 220 may be sent to entity resolution module 212. In particular embodiments, entity resolution module 212 may resolve entities associated with user inputs based on output from NLU module 210 and/or context engine 220. The entity resolution module 212 may use different techniques to resolve entities, including accessing user memory from an Auxiliary User Memory (AUM) 354. In particular embodiments, the AUM 354 may include user segment memory that facilitates resolution of the entity by the entity resolution module 212. The AUM 354 may be the following center location: the central location is used to store, retrieve, index, and search user data.
In particular embodiments, entity resolution module 212 may provide one or more of the following to dialog state tracker 218: intent, slot, entity, event, context, or user memory. The dialog state tracker 218 may accordingly identify a set of state candidates for the task, interact with the user to gather information necessary for the fill state, and invoke the action selector 222 to complete the task. In particular embodiments, dialog state tracker 218 may include task tracker 410. Task tracker 410 may track task states associated with auxiliary tasks. In particular embodiments, the task state may be a data structure such as: the data structure persists through multiple interaction runs and real-time updates to collect task state throughout the interaction. The task state may include all current information about the task execution state, such as parameters, validation states, confidence scores, etc. Any error or outdated information in the task state may result in a task execution failure or error. The task state may also be used as a set of context information for many other components (e.g., ASR module 208, NLU module 210, etc.).
In particular embodiments, task tracker 410 may include an intent processor 411, a task candidate ordering module 414, a task candidate generation module 416, and a merge layer 419. In particular embodiments, a task may be identified by its ID name. If a task ID is not explicitly set in the task specification (e.g., dialog policy 360, proxy execution, NLG dialog action, etc.), the task ID may be used to associate a corresponding component asset (asset). Accordingly, output from the entity resolution module 212 may be received by the task ID resolution component 417 of the task candidate generation module 416 to resolve task IDs of corresponding tasks. In particular embodiments, task ID parsing component 417 may call task specification manager API 430 to access trigger specifications and deployment specifications for parsing task IDs. Given these specifications, task ID parsing component 417 can parse task IDs using intent, slots, dialog states, context, and user memory.
In particular embodiments, the technical specification of a task may be defined by a task specification. The task specification may be used by the auxiliary system 140 to trigger a task, conduct a dialog session, and find the correct execution module (e.g., agent 228) to perform the task. The task specification may be an implementation of a product demand document. The task specification may be used as a generic contract and requirement for all component conventions. The task specification may be considered a component specification of the product, and all development partners deliver the modules based on the specification. In particular embodiments, auxiliary tasks may be defined in the implementation by specifications. By way of example and not limitation, a task specification may be defined as the following categories. One category may be a basic task scheme that includes basic identification information such as an ID, a name, and a scheme of inputting parameters. Another category may be trigger specifications on how tasks may be triggered, such as intent, event message ID, etc. Another category may be dialog specifications in which dialog manager 216 performs dialogs with users and systems. Another category may be an execution specification regarding how tasks are to be executed and completed. Another category may be deployment specifications regarding how features are to be deployed to particular surfaces, local, and user groups.
In particular embodiments, task specification manager API 430 may be an API for accessing a task specification manager. The task specification manager may be a module in the runtime stack that loads the specification from all tasks and provides an interface to access all task specifications to obtain detailed information or generate task candidates. In particular embodiments, the task specification manager may be accessible to all components in the runtime stack through task specification manager API 430. The task specification manager may include a set of static utility functions to manage tasks with the task specification manager, such as filtering task candidates by the platform. Before the task specification lands, the auxiliary system 140 may also dynamically load the task specification to support end-to-end development during the development phase.
In particular embodiments, task specifications may be grouped by domain and stored in runtime configuration 435. The runtime stack may load all task specifications from the runtime configuration 435 during build time. In particular embodiments, in runtime configuration 435, there may be a cconf file and a cinc file (e.g., a sidechef_task. Cconf and sidechef_task. Inc) for the domain. By way of example and not limitation, < domain > _tasks.cconf may include all details of the task specification. As another example and not by way of limitation, if the functionality is not already supported, < domain > _tasks, cinc may provide a method to override the generated specification.
In particular embodiments, task execution may require a set of parameters to execute. Thus, the parameter resolving component 418 may resolve the parameter name using the resolved parameter specification of the task ID. These parameters may be parsed based on NLU output (e.g., slot [ SL: contact ])), dialog state (e.g., short term call history), user memory (e.g., user preference, location, long term call history, etc.), or device context (e.g., timer state, screen content, etc.). In particular embodiments, the parametric modality may be text, audio, image, or other structural data. The mapping of slots to parameters may be defined by a fill policy and/or a language ontology. In particular embodiments, given a task trigger specification, task candidate generation module 416 may find a task list to be triggered as a task candidate based on the parsed task ID and parameters.
In particular embodiments, the generated task candidates may be sent to task candidate ranking module 414 for further ranking. Task candidate ranking module 414 may use rule-based ranker 415 to rank the task candidates. In particular embodiments, rule-based sequencer 415 may include a set of heuristics to favor certain domain tasks. The ordering logic may be described below using context priority principles. In particular embodiments, user-specified tasks may be prioritized over foreground tasks. When the intent is a meta-intent, foreground tasks may be prioritized over device domain tasks. The device domain tasks may have a higher priority than the tasks triggering the intent domain. By way of example and not limitation, if a task field is mentioned or specified in an utterance (e.g., "create a timer in TIMER app (create timer in timer application)"), the ordering may pick out the task. As another example and not by way of limitation, if a task field is in a foreground or active state (e.g., when a timer application is in a foreground and there is an active timer, "stop the timer" is used to stop the timer), then the ordering may pick out the task. As yet another example and not by way of limitation, if the intent is a general meta-intent and the task is device control when no other active application or active state is present, the ordering may pick out the task. As yet another example and not by way of limitation, if a task is the same as an intent domain, the ordering may pick out the task. In particular embodiments, task candidate ordering module 414 may customize some more logic to check for an intent/slot/entity type match. The ordered task candidates may be sent to the merge layer 419.
In particular embodiments, the output from the entity resolution module 212 may be sent to the task ID resolution component 412 of the intent processing module 411. Similar to the task ID parsing part 417, the task ID parsing part 412 may parse task IDs of corresponding tasks. In particular embodiments, intent processing module 411 may also include parameter parsing component 413. Similar to the parameter parsing part 418, the parameter parsing part 413 may parse a parameter name using the parameter specification of the parsed task ID. In particular embodiments, intent processing module 411 may process task independent features and may not be expressed within task-specific task specifications. The intent processing module 411 may output status candidates (e.g., parameter updates, validation updates, disambiguation updates, etc.) other than task candidates. IN particular embodiments, some tasks may require very complex trigger conditions or very complex parameter fill logic that may not be re-used by other tasks (e.g., voice commands IN a call, MEDIA tasks via [ IN: PLAY MEDIA ], etc.) even if such tasks are supported IN the task specification. The intent processing logic 411 may also be adapted for this type of task. In particular embodiments, results from intent processing module 411 may be prioritized over results from task candidate ranking module 414. The results from the intent processing module 411 may also be sent to the merge layer 419.
In particular embodiments, merge layer 419 may combine results from intent processing module 411 and results from task candidate ranking module 414. The dialog state tracker 218 may suggest each task as a new state from which the dialog policy 360 is to select, thereby generating a list of state candidates. The combined results may be further sent to a dialog understanding enhancement engine (conversational understanding reinforcement engine, CURE) tracker 420. In particular embodiments, CURE tracker 420 may be a personalized learning process as follows: the personalized learning process is used to improve the determination of state candidates by the dialog state tracker 218 in different contexts using real-time user feedback.
In particular embodiments, the state candidates generated by CURE tracker 420 can be sent to action selector 222. Action selector 222 may consult task policy 364 and task policy 364 may be generated from the execution specification accessed through task specification manager API 430. In particular embodiments, the execution specification may describe how the task should be performed, and what actions the action selector 222 may need to take to complete the task.
In particular embodiments, action selector 222 may determine an action associated with the system. These actions may require the agent 228 to perform. Thus, the action selector 222 may send system actions to the agent 228, and the agent 228 may return the results of execution of those actions. In particular embodiments, the action selector may determine an action associated with the user or the device. These actions may need to be performed by the delivery system 230. Thus, the action selector 222 may send user/device actions to the delivery system 230, and the delivery system 230 may return the results of execution of those actions.
Embodiments disclosed herein may include an artificial reality (artificial reality) system, or may be implemented in conjunction with an artificial reality system. An artificial reality is a form of reality that has been somehow adjusted before being presented to a user, which may include, for example, virtual Reality (VR), augmented reality (augmented reality, AR), mixed Reality (MR), mixed reality (hybrid reality), or some combination and/or derivative thereof. The artificial reality content may include entirely generated content, or generated content in combination with captured content (e.g., real world photographs). The artificial reality content may include video, audio, haptic feedback, or some combination thereof, and any of the above may be presented in a single channel or multiple channels (e.g., stereoscopic video that brings a three-dimensional effect to the viewer). Further, in some embodiments, the artificial reality may also be associated with an application, product, accessory, service, or some combination thereof, for example, for creating content in the artificial reality and/or for use in the artificial reality (e.g., performing an activity in the artificial reality). The artificial reality system providing the artificial reality content may be implemented on a variety of platforms including a head-mounted display (HMD) connected to a host computer system, a standalone HMD, a mobile device or computing system, or any other hardware platform capable of providing the artificial reality content to one or more viewers.
Text editing using voice input and gesture input
In particular embodiments, when a mouse or other fine pointer is not available to select a word or text segment for client system 130, assistance system 140 may enable a user to edit a message using speech and gestures. In alternative embodiments, the assistance system 140 may also enable the user to edit messages using voice and gestures in combination with normal pointer inputs. The auxiliary system 140 may provide several functions to the user to edit the message. The first function may be a quick clear edit, in which the user may slide after entering the initial message before sending the message to clear the entire message. The auxiliary system 140 may then prompt the user to enter a new message without the user having to speak the wake-up word again. The second function may be two-step speech editing. Through two-step voice editing, the user may enter an initial message such as "tell Kevin I'll be thene in 10 (telling Kevin me to there within 10 minutes)" and then want to change it by speaking "I want to change it (I want to change it)". The auxiliary system 140 may then prompt the user to speak what they want to alter. For example, the user may say "change the time" or "change the time to20 (change time to20 minutes)". The auxiliary system 140 may then look up a reference to "time" in the initial message and change it to "20 (20 minutes)". Through one-step voice editing, the user can directly say "change the time to (change time to20 minutes)", without telling the auxiliary system 140 that he/she wants to edit the message, for which the auxiliary system 140 can automatically recognize the content to be changed. The assistance system 140 may also use n-gram editing or block editing to enable a user to edit a message by: editing of a large block of messages in the display of the client system 130 is partitioned into blocks that are accessible by voice/gestures. The assistance system 140 can intelligently divide the user's dictation into common phrases ("n-grams") and/or chunks, which can allow easier selection by voice or gesture. For example, if the user says "be thene in 20 (within 20 minutes to there)", but wants to change it, the auxiliary system 140 can split the message into two n-gram blocks [ be thene (to there) ] and [ in 20 (within 20 minutes) ]. Then, in this process, the user can use a gesture to select [ in 20 (within 20 minutes) ] and say "in 30 (within 30 minutes)" to alter it when the microphone of the client system may continue to listen to the user. Instead of n-gram editing or block editing, the assist system 140 can place the sequence of digits over the words in the user's dictation upon receiving a request from the user to alter it. Thus, a user can easily reference individual words to alter them. In connection with the editing method described above, the auxiliary system 140 may use gaze as an additional signal to determine when a user wants to enter text and/or edit the entered text. Thus, the auxiliary system 140 may have the technical advantage of improving the user experience of editing dictated text, as the auxiliary system 140 may provide various functions that enable a user to conveniently edit text. Although this disclosure describes editing a particular message by any particular system in a particular manner, this disclosure contemplates editing any suitable message by any suitable system in any suitable manner.
In a particular embodiment, the auxiliary system 140 may present a text message based on the user utterance received at the client system 130 through a user interface of the client system 130. The text message may include a plurality of n-grams. The auxiliary system 140 may then receive a first user request to edit the text message at the client system 130. In particular embodiments, the auxiliary system 140 may present a text message visually divided into a plurality of blocks through a user interface. Each block may include one or more of a plurality of n-grams of the text message. In particular embodiments, the plurality of n-grams in each block may be contiguous with respect to each other and grouped within the block based on analysis of text messages by a Natural Language Understanding (NLU) module. The auxiliary system 140 may then receive a second user request at the client system 130 to edit one or more of the plurality of blocks. In particular embodiments, the auxiliary system 140 may also present the edited text message via a user interface. The edited text message may be generated based on the second user request.
It can be a challenge for a user to edit a free-form field using speech. The auxiliary system 140 may have many different problems to be solved and balanced to ensure accurate, low-energy interactions with the user. A first problem may be that most errors may come from the system (e.g., in the form of ASR recognition errors) rather than the user pressing a wrong key, so the user may be less forgiving, especially if there is a high pressure like sending a message to another person. By way of example and not limitation, a user may say "Thai spray" but may be mistaken for the more common word "typhoon". A second problem may be how the user uses speech to alter the incoming message. Modern keyboards (e.g., on mobile phones) may have a large number of features and context menus to provide comprehensive support for text entry and editing, thereby increasing the voice discoverability issues that have been serious. A third problem may be that editing text entered by speech dictation may be located as an unobstructed function of the client system 130, which may be based on the following assumptions: the transcribed text is easier to edit than the voice command using a keyboard. Most messages may be very short (e.g., text messages are typically less than five words), and personal preferences appear to determine whether to reenter the text of the message or edit existing text, thus requiring support for both paths.
In particular embodiments, to address the foregoing, the assistance system 140 may enable the user to make three levels of natural language based edits while the user is attempting to send a message to other people. The assistance system 140 can use the interaction model to edit messages of various lengths for the user through multimodal inputs such as speech and gestures. In particular embodiments, when a user wants to edit a message, the assistance system 140 may need to interact with the user as follows. The user may enter an edit mode, target text segments, select text segments, enter text (e.g., "delete" text, "replace" text, and "insert" text into an existing message), and exit the edit mode to send updated messages. In particular embodiments, the auxiliary system 140 may enable a user to easily enter and leave the editing mode. The auxiliary system 140 may give the user a variety of ways to edit their message and then send it, with exits appropriate for their entry paths. In this case, the auxiliary system 140 may have a unique behavior for the global "stop" voice command such that the auxiliary system 140 stops the sending of the message without completely exiting the flow, which is the primary system response to the user's "stop" utterance. In particular embodiments, using gestures (e.g., determining a target and selecting) and speech (e.g., input) plus gestures or speech may cover most cases of message editing with minimal effort or new interactions from the user.
In particular embodiments, the auxiliary system 140 may support multimodal input from a user as the user interacts with the auxiliary system 140 to edit a message. One or more of the first user request or the second user request may be based on one or more of: voice input, gesture input, or gaze input. By way of example and not limitation, a user may give a voice command to make any changes. As another example and not by way of limitation, a user may directly select an action button or scroll through a list/dial with a gesture to edit a message. As yet another example and not by way of limitation, a user may edit a message using a pinch gesture to select a button and other items. As yet another example and not by way of limitation, a user may select buttons and other items by looking at them and then tapping a finger to edit a message. As yet another example and not by way of limitation, a user may use gaze with speech to edit a message. In particular embodiments, the assistance system 140 may use a model based on a mixture of gestures and speech to enable the user to edit the message, including in the messaging platform rather than creating a message in the assistance system 140. This may include simple word selection in addition to fine-grained cursors. The user may then, for example, pinch the finger to drop the cursor or drag the cursor between words.
In particular embodiments, gaze-based editing may be complementary to speech and gestures (e.g., in an AR/VR system). Using gaze may enable the assistance system 140 to evolve beyond the following scenarios: the user must wake up the auxiliary system 140 using a wake-up word or a manual action (e.g., tapping a button). It can be boring for a user to constantly speak a wake-up word or tap on the same button whenever the user interacts with the auxiliary system 140, particularly if the user is in a longer dialogue session with the auxiliary system 140. With gaze, the assistance system 140 may enable more natural and human-like interactions for the user (e.g., waking up the assistance system 140) by tracking the user's eye gaze in the field of view of their display and allowing the user to speak when they have frozen their gaze on the source. Thus, the user may not need to speak the wake-up word. Alternatively, the user may focus his gaze on the auxiliary icon corresponding to the auxiliary system 140 and then begin speaking his request. In particular embodiments, the assistance system 140 may use gaze as an additional signal to determine when a user wants to enter text and/or edit the entered text. By way of example and not limitation, when the user's gaze is focused on a field, the assistance system 140 may prompt the user to dictate their utterance to enter text into the field. If the user indicates that they want to edit, the user's gaze at a particular portion of text may be used by the auxiliary system 140 as a signal to determine to prompt the user to edit the content. Using the gaze input of the user may be an effective solution to the following technical challenges: distinguishing between a user's voice interaction with the auxiliary system 140 and another person, because the user's voice input may be more likely to be directed to the auxiliary system 140 when the user looks at the auxiliary system 140 (e.g., user interface) while speaking. In a particular embodiment, one or more of the first user request or the second user request may include voice input from the first user of the client system 130. The assistance system 140 may detect a second user proximate to the first user based on sensor signals acquired by one or more sensors of the client system 130. Thus, the assistance system 140 may determine to direct the first user request and the second user request to the client system 130 based on one or more gaze inputs of the first user. By way of example and not limitation, if the user is with another person, the auxiliary system 140 may have difficulty determining with whom the user is talking (whether the person is also the auxiliary system 140). This may frustrate the user if the auxiliary system 140 responds to the user's voice while the user is actually talking to another person. With gaze, the assistance system 140 may respond to the user only when the user freezes the gaze on the assistance icon on their display. When the user gazes away from the secondary icon, the secondary system 140 may not provide a prompt, but rather listen to commands related to message editing.
In particular embodiments, the auxiliary system 140 may edit the text message based on the second user request. The assistance system 140 may provide different functions for editing message text using different combinations of speech, gestures, and gaze. One function may be quick clear editing. In particular embodiments, the second user request may include a gesture input that wants to clear the text message. Thus, editing one or more of the plurality of blocks may include clearing an n-gram corresponding to the one or more blocks. Because editing individual portions of a message may be difficult, it may be easier to reconstruct the entire message. In particular embodiments, auxiliary system 140 may implement a single, quick hint for clearing message content. In particular embodiments, when the user requests or selects a change message via voice, the assistance system 140 can allow the user to quickly clear the entire message to restart and activate the hand to determine the target and make the selection via a gesture. This approach may be quick and the user may feel that they are in control. It can also highlight how speech can become an input accelerator for short messages by providing a restarted experience. By way of example and not limitation, this may apply to the following cases: when the message is short, when the user is moving, when the user's hand has not been activated, and when there is a high confidence in ASR transcription. In other cases, the assistance system 140 may otherwise activate the hand and highlight the potential ASR error using an accessible submenu to quickly correct and add a "clear" graphical user interface (graphical user interface, GUI) hint, or support it as a voice command. With quick clear editing, the user may perform rough left-right swipes (either way), (e.g., on the "clear" button), perform gesture selections, or provide voice input (e.g., say "clear") to clear the entire message of user input before sending. In order not to accidentally clear a message when the user performs a random hand movement that is not used to clear the message, the assistance system 140 can determine that the gesture input intends to clear the text message by way of a gesture classifier based on one or more attributes associated with the gesture input. By way of example and not limitation, these attributes may include range, speed, position relative to client system 130, movement, and the like. In other words, the gesture classifier may be used to analyze the gesture to ensure that it is a purposeful gesture (e.g., with a normal swipe extension) for quick clear editing. It can be seen that the assistance system 140 can provide multiple paths for the user to quickly clear the entire message and then re-transcribe their message by voice. In particular embodiments, the auxiliary system 140 may then prompt the user to enter a new message without requiring the user to speak the wake-up word again. The prompt may also be considered feedback to the user that the auxiliary system 140 is again listening to the user's utterance. In terms of the interactive cost of switching modalities during a task, prompting voice input after clearing the content field with gestures or a side-to-side swipe may feel quick and relatively low energy. Alternatively, the assistance system 140 may prompt the user to confirm that they want to clear the entire message. After the message is cleared, the auxiliary system 140 may automatically turn on the microphone to listen for the replacement message.
The following may be an example workflow for quick purge editing. The auxiliary system 140 may, for example, be implemented by speaking "got it.send or change this? (knowing whether to send or alter this message. The user may pinch select a button to clear the entire message, which may restart the message content creation flow by having a slightly updated hint. For example, the auxiliary system 140 may query "what s the new message? (what is the new message), "then the user can speak the new message in its entirety, e.g.," I'll be thene in thirty "(I will arrive within thirty minutes). The auxiliary system 140 may, for example, speak "updated.send this? (updated. Is this message sent. The user may confirm (e.g., by speaking) whether to send the updated message. If not, they return to the standard message editing flow.
The following may be another example workflow for quick purge editing. The auxiliary system 140 may ask the user if he wants "send or change it (send or change it)" (it) =their message). For example, the auxiliary system 140 may say "got it.send or change this? (knowing whether to send or alter this message. The auxiliary system 140 may then clear the message so that the message may be quickly rewritten (rather than extending the multi-round conversation to fill the slot that needs to be changed). For example, the auxiliary system 140 may say "sure.what's the new message? (of course, what is the new message), "then the user can speak the new message in its entirety, e.g.," I'll be thene in thirty "(I will arrive within thirty minutes). The auxiliary system 140 may, for example, speak "updated.send this? (updated. Is this message sent. The user may confirm (e.g., by speaking) whether to send the updated message. If they do not, they return to the standard editing flow.
Another function may be two-step speech editing, which may require enabling three editing modes, including pause flow, two-step correction, and one-step correction. In particular embodiments, the auxiliary system 140 may enable the user to pause the flow, for example, by speaking "I want to change it (i want to change it)". By way of example and not limitation, the user may enter an initial message such as "hey assside, tell Kevin I'll be there in twenty (hey, assisted, tell kev I that will arrive in twenty minutes)". The user may then want to change it by speaking "I want to change it (i want to change it)". The voice command may pause the flow of the user dictation to the auxiliary system 140. The auxiliary system 140 may then prompt the user to speak what they want to alter. The auxiliary system 140 may present a prompt for entering a second user request through the user interface. The second user request may include information for editing one or more blocks.
In particular embodiments, the auxiliary system 140 may enable the user to make a two-step correction, for example, by speaking "change the time". The auxiliary system 140 may then respond by asking the user how to want to alter it when the voice command is received. The user may then say "change it to thirty (change it to thirty minutes)". Upon receiving the voice command, the auxiliary system may disambiguate "it" and look up a reference to "time" in the initial message and change it to "thirsty". It can be seen that in this scenario, the user uses speech to edit the message requires two steps.
In particular embodiments, the auxiliary system 140 may support one-step correction by voice when the user begins deleting, replacing, and inserting messages by voice. The assistance system 140 can use a basic "select/highlight from < x > to < y >" model for the user (or in an unobstructed mode) to make a voice selection through text. This approach may be quick and may feel natural if the user starts the input by voice without having to switch the input. In particular embodiments, the assistance system 140 may enable the user to make a one-step correction, for example, by speaking "change < original text > to < new text >". The user may respond to "got it.send it or change it? (know. Send or change it). Any voice or gesture interaction with the auxiliary system 140 may stop the automatic transmission of the message. The auxiliary system 140 can properly end the slot, parse the < original text >, match it to the content in the message, and replace the match with the < new text >. Continuing with the previous example of changing the arrival time, the user may speak "change it to thirty (change it to thirty minutes)" or "change in twenty to in thirty (change twenty minutes to thirty minutes)". Upon receiving the voice command, the auxiliary system may disambiguate "it" and look up a reference to "time" in the initial message and change it to "thirsty". It can be seen that no assistance system 140 is required to follow up with this approach, allowing the user to edit the message using speech in only one step. The auxiliary system 140 may then confirm to the user whether to send the updated message. For example, the auxiliary system may say "updated.send this? (updated. Do this message sent. If they do not, they return to the standard editing flow.
In particular embodiments, the second user request may include a voice input referencing one or more blocks, but references to one or more blocks in the second user request may include ambiguous references. Thus, the assistance system 140 can disambiguate ambiguous references based on the speech similarity model. In particular, the assistance system 140 can use the speech similarity model to determine a confidence score for the identified text entered by the user, which can be further used to determine which portion of the message the user wants to alter. By way of example and not limitation, a user may want to change "fifteen (fifteen)" to "fifty (fifty)" in an incoming message. The auxiliary system 140 can see that the ASR module 208 has low confidence that "fifteen (fifteen)" is the correct ASR transcription for the user's speech input. The assistance system 140 may then use the speech similarity model to determine that "fiftee (fifteen)" is similar in speech to "fifty" such that the word may be the word that the user wants to alter. Using a speech similarity model for disambiguation may be an effective solution for solving the following technical challenges: disambiguating ambiguous references to a segment of text in the user's speech input because such a model may determine a confidence score for the identified text of the user input, which may be further used to determine which segment of text (e.g., a segment with a low confidence score) the user wants to alter.
In addition to using a voice similarity model to determine what to alter, the auxiliary system 140 can run the message through the NLU module 210 to make two-step corrections and one-step corrections to understand the message in the context of the user. Continuing with the example of "change the time to twenty (change time to twenty)", NLU module 210 may allow auxiliary system 140 to determine what "the time" refers to. To protect privacy, NLU module 210 may begin parsing an incoming message only after a user requests a change to the message.
In particular embodiments, the auxiliary system 140 may be optimized for faster interactions during message editing. Because of the limited space on the display of the compact client system 130 and to avoid taking up too much field of view when the user walks or otherwise multitasking in a real world or AR environment, the assistance system 140 may also keep a minimal Graphical User Interface (GUI). Thus, the auxiliary system 140 can stream ASR transcription directly into the NLU module 210 to begin extracting semantics while the user is still dictating a message. The auxiliary system 140 may display a portion of the graphical user interface while the user is still speaking. As the ASR stream proceeds, the assist system 140 can update the graphical user interface along with the NLU update.
In particular embodiments, assistance system 140 may provide another function as n-gram/block editing by splitting edits to large blocks of input messages into speech, gestures, or gaze-accessible n-grams/blocks. Each of the plurality of blocks may be visually partitioned using one or more of: geometry, color, or identifier. In particular embodiments, editing the text message may include: one or more of the plurality of n-grams in each of the one or more blocks is/are respectively changed to one or more other n-grams. Alternatively, editing the text message may include: one or more n-grams are added to each of one or more of the one or more blocks. Another way of editing a text message may include: the order associated with the n-gram in each of the one or more blocks is altered. In particular embodiments, the assistance system 140 can intelligently divide the user's dictation into common phrases ("n-grams") and/or blocks with low confidence words therein. Such grouping may allow for easier selection by gesture, voice, or gaze. The user may then speak directly into or between these editable n-grams/blocks using their eyes, or the user may use their hands to remove or rearrange them. By way of example and not limitation, a user may say "be there in twenty (within twenty minutes to there)". The user may then say "I want to change it (i want to change it)", for which the auxiliary system 140 may split the message into two blocks, namely [ be thene (to there) ] and [ in twentity (in twenty minutes) ]. The user may then highlight it with different types of commands, such as a gesture touching [ in twentiy (twenty minutes) ], a voice input of "in twentiy (twenty minutes)", or a gaze [ in twentiy (twenty minutes) ]. The auxiliary system 140 may then turn on the microphone without the user speaking a wake-up word, thereby waiting for the user to change the content's instructions. The user may then say "in thirsty" to change it. N-gram/block editing may be particularly suitable for short messages. In addition, n-gram/block editing may transform text into similarly enhanced objects, thereby bringing targets, selections, and inputs closer to the primary interaction model of the client system 130.
In particular embodiments, the assistance system 140 may use eye gaze for targeting and selection and voice as input to enable n-gram/block editing. In other words, the second user request may include one or more gaze inputs directed to one or more blocks. This approach may be useful for short messages and may be hands-free. By way of example and not limitation, the auxiliary system 140 may be configured to detect a change in the presence of a message, such as by speaking "got it. (know. Send or modify this message. The auxiliary system 140 may then split the message into n-grams or blocks for editing by the user, rather than editing the entire message. The user may move their gaze onto one of the n-gram or the block to select it (e.g., [ in twenties). Once selected, the microphone may be turned on. Then, the user may say "in thirty (within thirty minutes)", for example. The user may continue to speak while the microphone is on to send a message, for example by speaking "send".
In particular embodiments, assistance system 140 may enable n-gram/chunk editing by using gestures only, i.e., the first user request may be based on gesture input. Accordingly, the assistance system 140 may present a gesture-based menu through the user interface that includes selection options for the plurality of blocks for editing. The second user request may then include: one or more of the selection options corresponding to the one or more blocks are selected based on one or more gesture inputs. Such an approach may be useful for short messages and allows easy targeting, selection and input (especially without requiring additional steps to achieve continuous dictation from the user). Using this approach, the switching mode may be perceived as seamless. This approach may also be combined with voice-only options (e.g., no eye gaze). By way of example and not limitation, the auxiliary system 140 may be used to determine if the auxiliary system is in the form of a message "got it. (knowing whether to send or modify this message. The user may pinch the selection purge. Then, the user can say "change it" or pinch select the "edit" button displayed in the GUI. In particular embodiments, the user may use the hand to activate hand tracking, for example by moving the hand to activate them. Then, the user can move the finger to the n-gram/block (e.g., [ in twentit (twenty minutes) ]) that is targeted and pinch it to select it. The user can select and hold to turn on the microphone and begin dictation. The user may hold the gesture and speak "in thirsty" instead of "in twenty". The secondary system 140 may then update the message content. The user may then move the finger to target the send button. The user may also choose to send the message, for example by pinching to select a "send" button.
In particular embodiments, the assistance system 140 may enable a user to quickly and accurately navigate to a piece of text that the user wants to edit by using a numbering mechanism or other visual indicators (colors, symbols, etc.). In particular, multiple blocks may be visually divided using multiple identifiers, respectively. By way of example and not limitation, the plurality of identifiers may include numbers, letters, or symbols. The second user request may include one or more references to one or more identifiers of one or more corresponding blocks. For example, the auxiliary system 140 may add a sequence of numbers or other visual indicators on words (n-grams) or blocks. When the user wants to change the entered message (e.g., "I'm running twenty minutes late (I am 20 minutes later)"), the auxiliary system 140 can add a number (e.g., 1.i'm 2.running 3.twenty (twenty) 4.minutes 5.late) or other visual indicator (color, symbol, etc.) on each word or block. These numbers or visual indicators may provide a simple way for the user to reference a single word or block. In particular embodiments, the user may speak numbers or visual indicators to edit the corresponding words/blocks. For example, the user may exchange the number 2 for the number 4, or replace the number 2 with another word. This may be a method of accurate positioning and navigation through phrases. In particular embodiments, a digital grid may be positioned thereon. When the grid appears, the user may speak commands, such as exchanges 2 and 4. In certain embodiments, a large number of acknowledgements popped up by the grid may occur. In particular embodiments, the auxiliary system 140 may use these numbers if there is only one line, possibly only a few words. However, if it is longer, including several sentences, the assist system 140 may first classify each sentence. If it is a long document, the auxiliary system 140 may organize it by paragraph and sentence. By way of example and not limitation, the user may use "a1" to locate the first word of the first sentence of a paragraph. Using a combination of voice input, gesture input, gaze input, and visual indicators of a block may be an effective solution to address the following technical challenges: the text segments that the user wants to edit are efficiently and accurately located because these different inputs may complement each other to increase the accuracy of determining which text segment the user wants to edit, while the visual indicators may help the user to easily target such text segments using the different inputs.
In particular embodiments, the assistance system 140 may allow a user to easily edit long messages using different functions as mentioned above. This may improve the functional balance with the user's existing experience and may be valuable for other use cases (e.g., taking notes and writing documents). The assistance system 140 may enable the user to edit long messages using pinching, dragging, and voice. This approach may have fine granularity control that is accurate to the alphabetic level and allows editing multiple pieces of a large text string.
The following may be an example user interaction with the auxiliary system 140 when editing a long message. After entering the edit mode, the user may see a text field and controls that allow them to exit. The user can point to the text and control a point with his hand. The user may pinch to activate a cursor in the text. Depending on what is focused on, a pinch gesture may activate a cursor (for "insert") or select a word. The user may then hold the pinch gesture and move the hand to select a word. Alternatively, the user may pinch and select each word instead of pinching, holding, dragging, and releasing. The user may choose without having to track the text. The selection may be automatically calculated as the hand moves in any direction. The user may also release the pinch after selecting a word and then view the context menu shown. By way of example and not limitation, there may be three options on the menu, including "delete", "voice input" and "keyboard". The user may then move the hand to point to a voice input option on the menu. The user may then pinch once to select the option. The user may then speak a new message into the auxiliary system 140 and see that the new message was transcribed in real time through a different visual process. In particular embodiments, all other controls may be hidden in this state to allow the user to focus and minimize their cognitive load. The user can then see the updated message after it is completely transcribed. The state of the auxiliary system 140 can now be returned to a position where the user can perform another edit or exit the edit mode.
The following may be another example of user interaction with the auxiliary system 140 when editing long messages. The user may edit using gaze, frame strokes on smart glasses, and speech. This approach may allow for continued visual attention to the message (i.e., the user does not need to consider hands), have granularity control that is accurate to the alphabetic level, allow for insertion, and allow editing multiple pieces of a large text string. After entering the edit mode, the user may see a text field and controls that allow them to exit. They can also see visual cues controlled by eye gaze. The user may use the eye gaze to control visual cues (e.g., to switch between highlighting or emphasizing a word and highlighting or emphasizing a cursor). The user, while looking at this word, can tap the glasses frame to activate the selection using the "cursor" in the previous step. The user may place a finger on the frame and move the eyes to look at the ending word to make a selection. The user may not have to use eye gaze to track text. The selection may be automatically calculated as the eye moves in any direction. The user may then release the finger from the frame after selecting the word and then see the context menu shown. The user may look at the voice input options on the menu. The user may tap once on the eyeglass frame to select a voice input option on the menu while looking at the option. The user may speak a new message to the auxiliary system 140 and see that the new message is transcribed in real time through different visual processes. In particular embodiments, all other controls may be hidden in this state to allow the user to focus and minimize their cognitive load. The user can see the updated message after it is completely transcribed and exits the editing mode. The state of the auxiliary system 140 can now be returned to a position where the user can perform another edit or exit the edit mode.
In particular embodiments, the auxiliary system 140 may provide an online error checker. This approach may be particularly suitable for word replacement. It may make the switch from gesture to voice feel natural (as the auxiliary system 140 may prompt the user through text-to-voice utterances) and may be faster than using gestures alone. By way of example and not limitation, the auxiliary system 140 may be used to determine if the auxiliary system is in the form of a message "got it. (knowing whether to send or modify this message. The user may pinch the selection purge. The auxiliary system 140 can highlight hypothetical ASR errors in the transcribed text. The user may point to focus on a point of interest on the text, for example, using a finger to aim at the underlined text. The user may then pinch the selection, which may open n best hypotheses (e.g., n may be 1 to 3). At this point, the auxiliary system 140 may turn off the microphone. The user may point to one of the suggested corrections, for example, by pointing to the target. The user may also select one of a plurality of options, for example by pinching. The auxiliary system 140 may then swap in the selected option. The user may also point to the "send" button to determine it as a target, for example by pointing to the target. The user may then "choose" to send the message, for example by pinching.
In particular embodiments, the auxiliary system 140 may use some basic interaction rules while enabling the user to edit the message. When the user focuses on the field using a gesture, this may call up a gesture to the forward context menu (e.g., if the user clicks on a text message using a gesture, a subsequent menu with gesture selection options may pop up). In this way, the user can maintain the same modality. However, if the user's gesture is outside of the focused field, this may remove the focus, turn off the microphone (whichever field is in focus at that time), and stop any automatic transmissions that are in progress. Another rule may be that if there is a gesture input before transmission, the auxiliary system 140 may not acknowledge or automatically transmit, but may wait for the gesture input to transmit. Further, if the field is empty (e.g., it is cleared because of user sliding), the assistance system 140 can prompt the user for input and automatically turn on the microphone for the user to respond.
Fig. 5A to 5D illustrate examples of editing a message by using voice input. FIG. 5A illustrates the following example user interface: the user interface displays a spoken message. Fig. 5A shows that there are three options under the message "be there in twenty (within twenty minutes to there)" that has been dictated in text box 500, including "open" 505, "edit" 510 and "send" 515. The user may select each of these options by voice (i.e., speak the options). The "edit" 510 option may allow the user to edit the message. The "send" 515 option may allow the user to send a message. The "open" 505 option may allow a user to open a messaging application. Further, there may be a symbol 520 indicating the status of the microphone. For example, in fig. 5A, symbol 520 may indicate that the microphone is on, waiting for further voice input by the user. FIG. 5B illustrates the following example user interface: the user interface displays a user request to make a change based on the one-step correction. As shown in fig. 5B, the user may request a change to the dictation message using voice input, indicated by "change in twenty to in thirty (change in twenty minutes to thirty minutes)" 525. When the auxiliary system 140 begins editing the message, the symbol 520 may indicate that the microphone is now off. FIG. 5C illustrates the following example user interface: the user interface displays the edge edited message. In fig. 5C, the edited message is now "be there in thirty (within thirty minutes to there)" in the text box 500, wherein the edited content "in thirty" may be highlighted. Symbol 520 may indicate that the microphone is turned on again, waiting for further voice input by the user. FIG. 5D illustrates the following example user interface: the user interface displays a confirmation of sending the composed message. As shown in fig. 5D, the auxiliary system 140 may confirm with the user whether to send the composed message by highlighting the "send" 515 option. The user may then say "Yes" 530 to confirm. When the auxiliary system 140 begins sending messages, the symbol 520 may indicate that the microphone is now off.
Fig. 6A to 6F illustrate another example of editing a message by using voice input. FIG. 6A illustrates the following example user interface: the user interface displays a spoken message. Fig. 6A shows that there are three options under the message "be there in twenty (within twenty minutes to there)" that has been dictated in text box 500, including "open" 505, "edit" 510 and "send" 515. The user may select each of these options by voice (i.e., speak the options). The "edit" 510 option may allow the user to edit the message. The "send" 515 option may allow the user to send a message. The "open" 505 option may allow a user to open a messaging application. Symbol 520 may indicate that the microphone is on and waiting for further voice input by the user. FIG. 6B illustrates the following example user interface: the user interface displays a user request to alter the spoken message. As shown in fig. 6B, the user may request to change the dictation message, indicated by change it 605. FIG. 6B also shows the user entering an edit mode, indicated by the highlighted "edit" 510 option. When the auxiliary system 140 begins editing the message, the symbol 520 may indicate that the microphone is now off. FIG. 6C illustrates the following example user interface: the user interface displays waiting for further dictation by the user. The auxiliary system 140 may turn the previously spoken message in the text box 500 grey, indicating that the auxiliary system 140 is waiting for the user to speak the entire new message. Fig. 6C also shows a "cancel" 610 option for the user to cancel the editing of the message. FIG. 6D illustrates the following example user interface: the user interface displays the new dictation. The user may say "I'll be there in thirty (I will go there within thirty minutes) 615". When the auxiliary system 140 begins to transcribe a new dictation, the symbol 520 may indicate that the microphone is now off. FIG. 6E illustrates the following example user interface: the user interface displays the transcribed new message. The text box 500 now has a new message "I'll be there in thirty (I will go there within thirty minutes)". Symbol 520 may indicate that the microphone is now on again, waiting for further voice input by the user. FIG. 6F illustrates the following example user interface: the user interface displays a confirmation of sending the composed message. As shown in fig. 6F, the auxiliary system 140 may confirm to the user whether to send the composed message by highlighting the "send" 515 option. The user may then say "Yes" 620 to confirm. When the auxiliary system 140 begins sending messages, the symbol 520 may indicate that the microphone is now off.
Fig. 7A to 7E illustrate examples of editing a message by using gesture input and voice input. FIG. 7A illustrates the following example user interface: the user interface displays a spoken message. Fig. 7A shows that there are three options under the message "be there in twenty (within twenty minutes to there)" that has been dictated in text box 500, including "open" 505, "clear" 705, and "send" 515. The user may select each of these options by voice or gesture (i.e., speak these options or select these options with gestures). By way of example and not limitation, the gesture may be a hand movement/finger movement. The "clear" 705 option may allow the user to clear the entire message in text box 500. As shown in fig. 7A, a user may use one hand 710 to point at "clear" 710 with a pinch gesture to select it. Symbol 520 may indicate that the microphone is now on, thereby enabling voice input by the user. FIG. 7B illustrates the following example user interface: the user interface displays a new dictation waiting for the user. As shown in fig. 7B, the previously spoken message "be there in twenty (within twenty minutes to there)" has been cleared. The "clear" 705 option now turns gray indicating that the function is not available (because there is no message in text box 500). Symbol 520 may indicate that the microphone is now on, waiting for the user's dictation. FIG. 7C illustrates the following example user interface: the user interface displays the new dictation. The new dictation of the user may be "I'll be there in thirty (I will go there within thirty minutes) 715". When the auxiliary system 140 begins to transcribe a new dictation, the symbol 520 may indicate that the microphone is now off. FIG. 7D illustrates the following example user interface: the user interface displays the newly transcribed message. As shown in fig. 7D, the text box 500 may now have a message such as "I'll be there in thirty (I will go there within thirty minutes)". Symbol 520 may indicate that the microphone is now on, waiting for further voice input by the user. FIG. 7E illustrates the following example user interface: the user interface displays a confirmation of sending the composed message. As shown in fig. 7E, the auxiliary system 140 may confirm with the user whether to send the composed message by highlighting the "send" 515 option. The user may then say "Yes" 720 to confirm. When the auxiliary system 140 begins sending messages, the symbol 520 may indicate that the microphone is now off.
FIGS. 8A to 8H show the general flowExamples of editing a message using gesture input and voice input have been described. FIG. 8A illustrates the following example user interface: the user interface displays a spoken message. In text box 500, auxiliary system 140 displays "bedere ntwanty messages ", as a transcription of spoken messages that may contain errors. The auxiliary system 140 may have identified the error as "dere n" and underline them. Symbol 520 may indicate that the microphone is now on, waiting for voice input by the user. FIG. 8B illustrates the following example user interface: the user interface displays a gesture input for targeting a portion of the message. As shown in fig. 8B, the user can determine it as a target by pointing to "dere n" using the finger of the hand. Symbol 520 may indicate that the microphone is now off, waiting for further gesture input by the user. FIG. 8C illustrates the following example user interface: the user interface displays the n-gram for modification. The user may make a pinch gesture using hand 805 to confirm the edit "dere n". Thus, the auxiliary system 140 may provide: the user may choose to alter the alternative n-gram of "dere n". By way of example and not limitation, these n-grams may include "therin (there) 810" and "therin (there) 815". FIG. 8D illustrates the following example user interface: the user interface displays a gesture input for targeting the replacement. As shown in FIG. 8D, the user can point at "skin in 815" by using the finger of the hand to target it. FIG. 8E illustrates the following example user interface: the user interface displays a confirmation of the selected replacement. In FIG. 8E, the user may use hand 805 to make a pinch gesture to confirm that "skin" 815 was selected. FIG. 8F illustrates the following example user interface: the user interface displays the composed message. As shown in fig. 8F, the text box 500 may now have the edited message "be there in twenty minutes (within 20 minutes to there)". FIG. 8G illustrates the following example user interface: the user interface displays a selection to send message. As shown in fig. 8G, the user may send the edited message using the finger of hand 805 to point to the "send" 515 option, for which the auxiliary system 140 may highlight the option. FIG. 8H illustrates the following example user interface : the user interface displays a confirmation of the transmission of the message. As shown in fig. 8H, the user may make a pinch gesture with the hand 805 to confirm the transmission of the edited message.
Fig. 9A to 9E show examples of editing a message by using gaze input and voice input. FIG. 9A illustrates the following example user interface: the user interface displays a spoken message. The spoken message may be "be there in twenty (within twenty minutes to there)" displayed in the text box 500. Symbol 520 may indicate that the microphone is now on, waiting for voice input by the user. FIG. 9B illustrates the following example user interface: the user interface displays the partitioned blocks of the message. As shown in fig. 9B, the user may have requested to change the dictation message. Thus, the auxiliary system 140 may have divided the message into two blocks, namely "be thene (there)" 905 and "in twention" 910. Symbol 520 may indicate that the microphone is now off, waiting for the user's gaze input. FIG. 9C illustrates the following example user interface: the user interface displays gaze input. In fig. 9C, circle 915 may indicate a user's gaze input, which is pinned at block "in twentiy" 910. This means that the user wants to edit the block "in twentiy" 910. Fig. 9D shows the following example user interface: the user interface displays edits to the block. As shown in fig. 9D, the user may have dictated the edit as replacing "in twentiy" 910 with "in thirsty" 920. FIG. 9E illustrates the following example user interface: the user interface displays an acknowledgement of the sent message. In fig. 9E, the user may freeze gaze input 915 at the "send" 515 option. Thus, the auxiliary system 140 may display "Send" 925 as a way to confirm that the user wants to Send the message.
Fig. 10A to 10I show examples of editing a message by using gesture input and voice input. FIG. 10A illustrates the following example user interface: the user interface displays a spoken message. The spoken message may be "be there in twenty (within twenty minutes to there)" displayed in the text box 500. Symbol 520 may indicate that the microphone is now on, waiting for voice input by the user. FIG. 10B illustrates the following example user interface: the user interface displays the modified user request. As shown in fig. 10B, the user's voice may be "Change" 1005, indicating that the user wants to edit the message. While the auxiliary system 140 is waiting for gesture input by the user, the symbol 520 may indicate that the microphone is now off. FIG. 10C illustrates the following example user interface: the user interface displays blocks for editing. In fig. 10C, the message is divided into two blocks, including "be thene (there)" 1010 and "in twentit" 1015. The user may select it using the hand 1020, for example by pointing with a finger to the block to be edited. FIG. 10D illustrates the following example user interface: the user interface displays a selection of blocks. As shown in fig. 10D, the user can select it by pointing a finger to "in twenti" 1015 using the hand 1020. After selection, the block "in twenti" 1051 may be turned gray. FIG. 10E illustrates the following example user interface: the user interface displays a confirmation of the selected block. In fig. 10E, the user may make a pinch gesture using hand 1020 to confirm the selection of "in twenti" 1051. FIG. 10F illustrates the following example user interface: the user interface displays editing the selected block. As shown in fig. 10F, the user may have dictated "in thirsty" 1025 as an alternative to "in twentiy" 1015. FIG. 10G illustrates the following example user interface: the user interface displays a confirmation of the edited block. In fig. 10G, a user may make a pinch gesture using hand 1020 to confirm a new, edited block, e.g., "in thirsty" 1030. FIG. 10H illustrates the following example user interface: the user interface displays a selection to send message. As shown in fig. 10H, the user may point to the "send" 515 option with hand 1020 to send the composed message. FIG. 10I illustrates the following example user interface: the user interface displays a confirmation of the transmission of the message. As shown in fig. 10I, the user may make a pinch gesture with hand 1020 to confirm the transmission of an edited message, such as "be there in thirty (within thirty minutes to there)".
Fig. 11A to 11J illustrate examples of editing a message by using gesture input and voice input. FIG. 11A illustrates the following example user interface: the user interface displays a spoken message. The message may be "are you almost done with your appointmentI'm close by at the pharmacy and can pick you up.i'm buffering name snacks.analyzing you want me to buy for you? (is your date terminated about. Fig. 11A shows that there are two options under the message that has been dictated, including: "<"1110 for returning to the previous step, and "v" 1115 for accepting the edit. FIG. 11B illustrates the following example user interface: the user interface displays a portion of the message selected by the user for editing. As shown in fig. 11B, the user may use hand 1120 to point to a location where the user wants to change, e.g., between "name" and "snack". FIG. 11C illustrates the following example user interface: the user interface displays a word that begins to be selected for editing. In fig. 11C, the user may make a pinch gesture using hand 1120 to move a virtual cursor between "name" and "snack". FIG. 11D illustrates the following example user interface: the user interface displays the word that is selected for editing in the end. As shown in FIG. 11D, the user may move hand 1120 from left to right while maintaining a pinch gesture to select "snacks" for editing. FIG. 11E illustrates the following example user interface: the user interface displays an option to edit the selected word. In FIG. 11E, the user may use hand 1120 to point to the selected word "snacks". In response to the pointing, the auxiliary system 140 may call up three options. The first option may be to delete the word as indicated by garbage can symbol 1125. A second option may be a voice input for how to edit the word, as indicated by microphone symbol 1130. A third option may be to type in the edit content as indicated by keyboard symbols 1135. FIG. 11F illustrates the following example user interface: the user interface displays an option to select a voice input. As shown in fig. 11F, the user may point to microphone symbol 1130 using the finger of hand 1120. FIG. 11G illustrates the following example user interface: the user interface displays a confirmation of editing using the voice input. In fig. 11G, the user may make a pinch gesture using the hand 112 to confirm that the word "snacks" was edited using voice input. FIG. 11H illustrates the following example user interface: the user interface displays dictations from the user. As shown in fig. 11H, the user may dictate "fresh fruits and" as an alternative to "snacks". At this point, the user may not have completed dictation. FIG. 11I illustrates the following example user interface: the user interface displays the composed message. As shown in fig. 11I, the edited message may now be "are you almost done with your appointmentI'm close by at the pharmacy and can pick you up.i'm buying some fresh fruits and chips.analyzing you want me to buy for you? (is your date block finished: the user interface displays accepting edits to the message. As shown in fig. 11J, the user can use hand 1120 to make a pinch gesture directed to "v" 1115 to accept editing of the message.
Fig. 12A to 12I show examples of editing a message by using gaze input, gesture input, and voice input. FIG. 12A illustrates the following example user interface: the user interface displays a spoken message. The message may be "are you almost done with your appointmentI'm close by at the pharmacy and can pick you up.i'm buffering name snacks.analyzing you want me to buy for you? (is your date terminated about. Fig. 12A shows: below the message that has been dictated, there are two options, including "<"1110 for returning to the previous step, and "v" 1115 for accepting editing. Further, there may be a user's gaze input 1205, indicating that the user's gaze is moving toward the dictation message. FIG. 12B illustrates the following example user interface: the user interface displays a portion of the message selected by the user for editing. As shown in fig. 12B, the user's gaze input 1205 may freeze on "name(s)". FIG. 12C illustrates the following example user interface: the user interface displays a word that begins to be selected for editing. In fig. 12C, the user's gaze input 1205 may freeze between "name" and "snacks", and a cursor may appear between the two words. Meanwhile, the user may use the hand 1210 to confirm that the user wants to edit the message. FIG. 12D illustrates the following example user interface: the user interface displays the word that is selected for editing in the end. As shown in fig. 12D, the user's gaze input 1205 may have moved from the previous "snacks" to "buy". Since the user's hand 1210 is already pointing steadily, the entire section "snacks. Analysis you want me to buy (snack. What you want me to buy)" may be selected. FIG. 12E illustrates the following example user interface: the user interface displays an option to edit the selected word. In fig. 12E, after the user removes the hand 1210, the position of the gaze input 1205 freeze may become the choice of words for editing. It follows that "snacks" may be selected. The user may have three editing options, for example, delete words as indicated by garbage can symbol 1125, voice input as indicated by microphone symbol 1130, and typing as indicated by keyboard symbol 1135. FIG. 12F illustrates the following example user interface: the user interface displays an option to select a voice input. As shown in fig. 12F, the user's gaze input 1205 may freeze at the microphone symbol 1130. FIG. 12G illustrates the following example user interface: the user interface displays a confirmation of editing using the voice input. In fig. 12G, the user can point with a hand 1210 to confirm editing the word "snacks" using voice input. FIG. 12H illustrates the following example user interface: the user interface displays dictations from the user. As shown in fig. 12H, the user may dictate "fresh fruits and" as an alternative to "snacks". At this point, the user may not have completed dictation. FIG. 12I illustrates the following example user interface: the user interface displays the composed message. As shown in fig. 12I, the edited message may now be "are you almost done with your appointmentI'm close by at the pharmacy and can pick you up.i'm buying some fresh fruits and chips.analyzing you want me to buy for you? (is your appointment completed soon.
Fig. 13 shows an example of editing a message by dividing the message and numbering the divisions. In fig. 13, the dictation message may be "are you almost done with your appointmentI'm close by at the pharmacy and can pick you up.i'm playing name snacks.anyting you want me to buy for you? (is your date terminated soon. To facilitate editing the message, the auxiliary system 140 may divide the message into four sentences. The auxiliary system 140 may also assign numbers to the sentences. For example, assign the number 1 to "are you almost done with your appointment? (is your appointment ending soon? (what do you want me to buy you. By way of example and not limitation, instruction 1305 may be "what do you wanna changeYou can' number one" (what is you want to change.
Fig. 14A and 14B show examples of the quick clear message. Fig. 14A shows an example dictation of a message. The user 1405 may be wearing smart glasses as his client system 130. The user 1405 may use voice input to dictate a message, such as "be there in twenty (within twenty minutes to there)" 1410. The auxiliary system 140 may transcribe the voice input and instruct the smart glasses to present the transcribed message on the display 1415 of the smart glasses. The transcribed message may be located in text box 500. There may also be other options such as "open" 505 for opening a messaging application, "edit" 510 for entering edit mode, "send" 515 for sending a message, and an indicator 520 of microphone status. Fig. 14B shows an example of quickly clearing the message. The user 1405 may perform a swipe gesture using his hand 1420. In response to the swipe gesture, the assistance system 140 may delete the entire message. Thus, there may be no more messages in text box 500 on display 1415.
Fig. 15A to 15D show examples of editing a message based on n-gram coverage of an identifier. Fig. 15A shows an example incoming message. The incoming message may be "the quick brown fox jumped over the lazy dog (the agile brown fox skipped the lazy dog). The message may be displayed in a text box 1505 on the client system 130 (e.g., a smartphone). Within text box 1505, there may be two options: "edit" 1510 and "send" 1515. FIG. 15B illustrates an example n-gram overlay of identifiers on a smartphone. As shown in fig. 15B, the user is now in edit mode 1520. For each n-gram, there may be an identifier, such as a number, overlaid thereon. For example, "1" represents "the (that)", "2" represents "quick", "3" represents "Brown", "4" represents "fox", "5" represents "jumped", "6" represents "over", "7" represents "the (that)", "8" represents "lazy", "9" represents "dog". The user can easily speak numbers to edit the corresponding n-gram. FIG. 15C illustrates an example n-gram overlay of identifiers on a smart watch. As shown in fig. 15C, the user is now in edit mode 1520. For each n-gram, there may be an identifier, such as a number, overlaid thereon. For example, "1" represents "the (that)", "2" represents "quick", "3" represents "Brown", "4" represents "fox", "5" represents "jumped", "6" represents "over", "7" represents "the (that)", "8" represents "lazy", "9" represents "dog". The user can easily speak numbers to edit the corresponding n-gram. Fig. 15D illustrates an example n-gram overlay of identifiers on an intelligent network camera. As shown in fig. 15D, the user is now in edit mode 1520. For each n-gram, there may be an identifier, such as a number, overlaid thereon. For example, "1" represents "the (that)", "2" represents "quick", "3" represents "Brown", "4" represents "fox", "5" represents "jumped", "6" represents "over", "7" represents "the (that)", "8" represents "lazy", "9" represents "dog". The user can easily speak numbers to edit the corresponding n-gram.
FIG. 16 illustrates an example method 1600 for efficient text editing. The method may begin at step 1610, where the auxiliary system 140 may present, via a user interface of the client system 130, a text message based on a user utterance received at the client system, where the text message includes a plurality of n-grams. At step 1620, the auxiliary system 140 may receive a first user request to edit the text message at the client system 130, wherein the first user request is based on one or more of: voice input, gesture input, or gaze input. At step 1630, the assistance system 140 can present, through the user interface, the text message visually divided into a plurality of blocks, wherein each block includes one or more of a plurality of n-grams of the text message, wherein the plurality of n-grams in each block are contiguous with respect to each other, and grouped within the block based on analysis of the text message by a Natural Language Understanding (NLU) module, wherein each block of the plurality of blocks is visually divided using one or more of: geometry, color, or identifier, the identifier comprising one or more of a number, letter, or symbol. At step 1640, the auxiliary system 140 may present a prompt through the user interface for entering a second user request, wherein the second user request includes information for editing the one or more blocks. At step 1650, the auxiliary system 140 may receive a second user request at the client system 130 to edit one or more of the plurality of blocks, wherein the second user request is based on one or more of: a voice input referencing one or more blocks, a gesture input, or a gaze input directed to one or more blocks. At step 1660, if the second user request includes a speech input referencing one or more blocks and the reference to the one or more blocks includes an ambiguous reference, the assistance system 140 can disambiguate the ambiguous reference based on the speech similarity model. At step 1670, the assistance system 140 can edit the text message based on the second user request, wherein editing the text message includes one or more of: if the second user request includes a gesture input that wants to clear a text message, clearing an n-gram corresponding to one or more blocks; changing one or more n-grams of the plurality of n-grams in each of the one or more blocks to one or more other n-grams, respectively; adding one or more n-grams to each of the one or more blocks; or alter the order associated with the n-gram in each of one or more of the one or more blocks. At step 1680, the auxiliary system 140 can present the composed text message via the user interface, wherein the composed text message is generated based on the second user request. Particular embodiments may repeat one or more steps in the method of fig. 16, where appropriate. Although this disclosure describes and illustrates particular steps of the method of fig. 16 as occurring in a particular order, this disclosure contemplates any suitable steps of the method of fig. 16 occurring in any suitable order. Furthermore, although this disclosure describes and illustrates an example method for efficient text editing that includes particular steps of the method of fig. 16, this disclosure contemplates any suitable method for efficient text editing that includes any suitable steps, which may include all, some, or none of the steps of the method of fig. 16, where appropriate. Furthermore, although this disclosure describes and illustrates particular components, devices, or systems performing particular steps of the method of fig. 16, this disclosure contemplates any suitable combination of any suitable components, devices, or systems performing any suitable steps of the method of fig. 16.
Privacy system
In particular embodiments, one or more objects (e.g., content or other types of objects) of a computing system may be associated with one or more privacy settings. The one or more objects may be stored on or otherwise associated with any suitable computing system or application, such as social-networking system 160, client system 130, auxiliary system 140, third-party system 170, a social-networking application, an auxiliary application, a messaging application, a photo-sharing application, or any other suitable computing system or application. While the examples discussed herein are in the context of an online social network, these privacy settings may be applied to any other suitable computing system. The privacy settings (or "access settings") of the objects may be stored in any suitable manner (e.g., in a manner associated with the objects, in a manner authorizing an index on the server, in another suitable manner, or any combination thereof). The privacy settings of an object may specify how the object (or particular information associated with the object) may be accessed, stored, or otherwise used (e.g., viewed, shared, modified, copied, executed, revealed, identified) within the online social network. An object may be described as "visible" with respect to a particular user or other entity when the privacy setting of the object allows the user or other entity to access the object. By way of example and not limitation, a user of an online social network may specify privacy settings for a user profile page that identify a group of users that may access work experience information on the user profile page, thereby denying other users access to the information.
In particular embodiments, the privacy settings of an object may specify a "blocked list" of users or other entities that should not be allowed to access particular information associated with the object. In particular embodiments, the blocklist may include third party entities. The blocked list may specify one or more users or entities for which the object is invisible. By way of example and not limitation, a user may designate a group of users that may not access an album associated with the user, thereby denying the users access to the album (while also potentially allowing specific users not within the designated group of users to access the album). In particular embodiments, privacy settings may be associated with particular social graph elements. The privacy settings of social-graph elements (e.g., nodes or edges) may specify how the social-graph elements, information associated with the social-graph elements, or content objects associated with the social-graph elements may be accessed using an online social network. By way of example and not limitation, a particular photograph may have privacy settings that specify that only users marked in the photograph and friends of users marked in the photograph may access the photograph. In particular embodiments, the privacy settings may allow the user to choose to let or not let: social-networking system 160 or auxiliary system 140 stores/records their content, information, or actions, or shares their content, information, or actions with other systems (e.g., third-party systems 170). Although this disclosure describes using particular privacy settings in a particular manner, this disclosure contemplates using any suitable privacy settings in any suitable manner.
In particular embodiments, social-networking system 160 may present a "privacy wizard" (e.g., within a web page, a module, one or more dialog boxes, or any other suitable interface) to the first user to help the first user specify one or more privacy settings. The privacy wizard may display instructions, appropriate privacy related information, current privacy settings, one or more input fields for accepting one or more inputs from a first user specifying a change or confirmation of privacy settings, or any suitable combination thereof. In particular embodiments, social-networking system 160 may provide a "control panel" function to the first user that may display the first user's current privacy settings. The control panel function may be displayed to the first user at any suitable time (e.g., after input from the first user calling the control panel function, after a particular event or trigger action occurs). The control panel function may allow the first user to modify one or more of the first user's current privacy settings at any time in any suitable manner (e.g., redirect the first user to the privacy wizard).
The privacy settings associated with the object may specify any suitable granularity of allowing access or denying access. By way of example and not limitation, access may be specified or denied for the following users: a particular user (e.g., i me only, my roommate, my boss), a user within a particular degree of separation (e.g., friends, or friends of friends), a group of users (e.g., game clubs, my family), a network of users (e.g., employees of a particular employer, students or alumni of a particular university), all users ("public"), none users ("private"), users of third party system 170, a particular application (e.g., a third party application, an external website), other suitable entity, or any suitable combination thereof. Although this disclosure describes a particular granularity of allowing access or denying access, this disclosure contemplates any suitable granularity of allowing access or denying access.
In particular embodiments, one or more servers 162 may be authorization/privacy servers for enforcing privacy settings. In response to a request from a user (or other entity) for a particular object stored in data store 164, social-networking system 160 may send a request for the object to data store 164. The request may identify a user associated with the request and the object may be sent only to the user (or the user's client system 130) if the authorization server determines that the user is authorized to access the object based on privacy settings associated with the object. If the requesting user is not authorized to access the object, the authorization server may block retrieval of the requested object from data store 164 or may block transmission of the requested object to the user. In the search query context, a query user may be provided as a search result only if the object is authorized to be accessed (e.g., if privacy settings for the object allow the object to be revealed to the query user, allow the object to be found by the query user, or allow the object to be otherwise visible to the query user). In particular embodiments, the object may represent content that is visible to the user through the user's news feed. By way of example and not limitation, one or more objects may be visible to a user's "Trending" page. In particular embodiments, the object may correspond to a particular user. The object may be content associated with a particular user, or may be an account of the particular user, or information stored on social-networking system 160 or other computing system. By way of example and not limitation, a first user may view one or more second users of the online social network through the "people you may know (People You May Know)" function of the online social network or by viewing a friends list of the first user. By way of example and not limitation, a first user may specify that they do not wish to see objects associated with a particular second user in their news feed or friends list. The object may be excluded from the search results if its privacy settings do not allow it to appear to, be found by, or be visible to the user. Although this disclosure describes performing privacy settings in a particular manner, this disclosure contemplates performing privacy settings in any suitable manner.
In particular embodiments, different objects of the same type associated with a user may have different privacy settings. Different types of objects associated with a user may have different types of privacy settings. By way of example and not limitation, a first user may specify that a status update of the first user is public, but that any images shared by the first user are only visible to friends of the first user on the online social network. As another example and not by way of limitation, a user may specify different privacy settings for different types of entities, such as individual users, friends of friends, followers, groups of users, or corporate entities. As another example and not by way of limitation, a first user may designate a group of users that may view video published by the first user while preventing the video from being visible to an employer of the first user. In particular embodiments, different privacy settings may be provided for different groups or demographics of users. By way of example and not limitation, a first user may specify that other users at the same university as the first user may view the first user's photos, but that other users who are members of the first user's family may not view those same photos.
In particular embodiments, social-networking system 160 may provide one or more default privacy settings for each object of a particular object type. The privacy settings of an object set as default may be changed by a user associated with the object. By way of example and not limitation, all images posted by a first user may have default privacy settings that are visible only to friends of the first user, and for a particular image, the first user may change the privacy settings of that image to be visible to friends and friends of friends.
For example, in particular embodiments, the privacy settings may allow the user to specify (e.g., by opting out, by not opting in) whether the communication system 140 may receive, collect, record, or store particular objects or information associated with the user for any purpose. In particular embodiments, the privacy settings may allow the first user to specify whether a particular application or process may access, store, or use a particular object or information associated with the user. The privacy settings may allow the user to choose to have or not have access to, store or use objects or information for a particular application or process. Social-networking system 160 or auxiliary system 140 may access such information in order to provide a particular function or service to the first user, but social-networking system 160 or auxiliary system 140 may not access the information for any other purpose. Prior to accessing, storing, or using such objects or information, social-networking system 160 or auxiliary system 140 may prompt the user to provide the following privacy settings prior to allowing any such actions: the privacy settings specify which applications or processes, if any, may access, store, or use the objects or information. By way of example and not by way of limitation, a first user may send a message to a second user via an application (e.g., a messaging application) associated with an online social network, and may specify the following privacy settings: social-networking system 160 or auxiliary system 140 should not store these messages.
In particular embodiments, a user may specify whether a particular type of object or information associated with a first user may be accessed, stored, or used by social-networking system 160 or auxiliary system 140. By way of example and not limitation, a first user may specify that an image sent by the first user through social-networking system 160 or auxiliary system 140 may not be stored by social-networking system 160 or auxiliary system 140. As another example and not by way of limitation, a first user may specify that messages sent from the first user to a particular second user may not be stored by social-networking system 160 or auxiliary system 140. As yet another example and not by way of limitation, a first user may specify that all objects sent via a particular application may be saved by social-networking system 160 or auxiliary system 140.
In particular embodiments, the privacy settings may allow the first user to specify whether particular objects or information associated with the first user may be accessed from a particular client system 130 or third party system 170. The privacy settings may allow the first user to choose to join or choose not to access objects or information from a particular device (e.g., a phonebook on the user's smartphone), from a particular application (e.g., a messaging application), or from a particular system (e.g., an email server). Social-networking system 160 or auxiliary system 140 may provide default privacy settings for each device, system, or application and/or may prompt the first user to specify particular privacy settings for each context. By way of example and not limitation, a first user may utilize location services features of social-networking system 160 or auxiliary system 140 to provide recommendations for restaurants or other places in the vicinity of the user. The default privacy settings of the first user may specify that social-networking system 160 or secondary system 140 may provide location-based services using location information provided from first-user's client system 130, but social-networking system 160 or secondary system 140 may not store or provide location information of the first user to any third-party system 170. The first user may then update the privacy settings to allow the location information to be used by the third party image sharing application to geotag the photo.
In particular embodiments, the privacy settings may allow a user to specify one or more geographic locations from which objects may be accessed. Access to the object or denial of access may depend on the geographic location of the user attempting to access the object. By way of example and not limitation, users may share an object and specify that only users in the same city may access or view the object. As another example and not by way of limitation, a first user may share an object and specify that the object is only visible to a second user when the first user is in a particular location. If the first user leaves the particular location, the object may no longer be visible to the second user. As another example and not by way of limitation, a first user may specify that an object is only visible to a second user within a threshold distance from the first user. If the first user subsequently changes locations, the original second user having access to the object may lose access, while a new set of second users may gain access when they come within a threshold distance of the first user.
In particular embodiments, social-networking system 160 or auxiliary system 140 may have such functionality: personal or biometric information of the user may be used as input for user authentication or for the purpose of personalizing the experience. Users may choose to use these functions to enhance their experience with online social networks. By way of example and not limitation, a user may provide personal or biometric information to social-networking system 160 or auxiliary system 140. The user's privacy settings may specify that such information may be used only for a particular process (e.g., authentication), and also that such information may not be shared with any third party system 170 or used for other processes or applications associated with social-networking system 160 or auxiliary system 140. As another example and not by way of limitation, social-networking system 160 may provide functionality for users to provide voiceprint records to an online social network. By way of example and not limitation, if a user wishes to utilize this functionality of an online social network, the user may provide a voice recording of his or her own voice to provide status updates on the online social network. The record of the voice input may be compared to the user's voiceprint to determine what word the user has spoken. The privacy settings of the user may specify that such voice recordings may be used only for voice input purposes (e.g., to authenticate the user, send voice messages, improve voice recognition to use voice operating features of an online social network), and may also specify that such voice recordings may not be shared with any third-party system 170 or used by other processes or applications associated with social-networking system 160. As another example and not by way of limitation, social-networking system 160 may provide the user with such functionality: the reference image (e.g., facial profile, retinal scan) is provided to the online social network. The online social network may compare the reference image with subsequently received image inputs (e.g., to authenticate the user, mark the user in a photograph). The privacy settings of the user may specify that such images may be used for limited purposes only (e.g., authentication, marking the user in a photograph), and further specify that such images may not be shared with any third party system 170 or used by other processes or applications associated with social-networking system 160.
System and method
FIG. 10 illustrates an example computer system 1700. In particular embodiments, one or more computer systems 1700 perform one or more steps of one or more methods described or illustrated herein. In particular embodiments, one or more computer systems 1700 provide the functionality described or illustrated herein. In particular embodiments, software running on one or more computer systems 1700 performs one or more steps of one or more methods described or illustrated herein, or provides functionality described or illustrated herein. Particular embodiments include one or more portions of one or more computer systems 1700. In this document, references to computer systems may encompass computing devices, and vice versa, where appropriate. Furthermore, references to computer systems may encompass one or more computer systems, where appropriate.
This disclosure contemplates any suitable number of computer systems 1700. The present disclosure contemplates computer system 1700 taking any suitable physical form. By way of example, and not limitation, computer system 1700 may be an embedded computer system, a system-on-chip (SOC), a single-board computer System (SBC) (e.g., a computer-on-module (COM) or a system-on-module (SOM)), a desktop computer system, a laptop or notebook computer system, an interactive self-service terminal (kiosk), a mainframe, a network of computer systems, a mobile telephone, a personal digital assistant (personal digital assistant, PDA), a server, a tablet computer system, or a combination of two or more of these. Computer system 1700 may, where appropriate: including one or more computer systems 1700; is unitary or distributed; spanning multiple locations; across multiple machines; across multiple data centers; or in a cloud, which may include one or more cloud components in one or more networks. Where appropriate, one or more computer systems 1700 may perform one or more steps of one or more methods described or illustrated herein without substantial spatial or temporal limitations. By way of example, and not limitation, one or more computer systems 1700 may perform one or more steps of one or more methods described or illustrated herein in real-time or in batch mode. Where appropriate, one or more computer systems 1700 may perform one or more steps of one or more methods described or illustrated herein at different times or at different locations.
In a particular embodiment, the computer system 1700 includes a processor 1702, a memory 1704, a storage 1706, an input/output (I/O) interface 1708, a communication interface 1710, and a bus 1712. Although this disclosure describes and illustrates a particular computer system having a particular number of particular components arranged in a particular manner, this disclosure contemplates any suitable computer system having any suitable number of any suitable components arranged in any suitable manner.
In a particular embodiment, the processor 1702 includes hardware for executing instructions (e.g., those comprising a computer program). By way of example, and not limitation, to execute instructions, the processor 1702 may retrieve (or fetch) instructions from an internal register, an internal cache, the memory 1704, or the storage 1706; decoding and executing the instructions; and then write one or more results to an internal register, internal cache, memory 1704, or storage 1706. In particular embodiments, processor 1702 may include one or more internal caches for data, instructions, or addresses. The present disclosure contemplates processor 1702 including any suitable number of any suitable internal caches, where appropriate. By way of example, and not limitation, the processor 1702 may include one or more instruction caches, one or more data caches, and one or more page table caches (translation lookaside buffer, TLB). Instructions in the instruction cache may be copies of instructions in the memory 1704 or the memory 1706, and the instruction cache may speed retrieval of those instructions by the processor 1702. The data in the data cache may be: a copy of data in memory 1704 or 1706 for manipulation by instructions executing at processor 1702; results of previous instructions executed at processor 1702 for access by subsequent instructions executed at processor 1702 or for writing to memory 1704 or memory 1706; or other suitable data. The data cache may speed up read or write operations of the processor 1702. The TLB may accelerate virtual address translation for the processor 1702. In particular embodiments, processor 1702 may include one or more internal registers for data, instructions, or addresses. The present disclosure contemplates processor 1702 including any suitable number of any suitable internal registers, where appropriate. Where appropriate, the processor 1702 may: comprising one or more arithmetic logic units (arithmetic logic unit, ALU); is a multi-core processor; or may include one or more processors 1702. Although this disclosure describes and illustrates a particular processor, this disclosure contemplates any suitable processor.
In a particular embodiment, the memory 1704 includes a main memory for storing instructions to be executed by the processor 1702 or data for manipulation by the processor 1702. By way of example, and not limitation, computer system 1700 may load instructions from memory 1706 or another source (e.g., another computer system 1700) to memory 1704. The processor 1702 may then load the instructions from the memory 1704 into an internal register or internal cache. To execute instructions, the processor 1702 may retrieve and decode the instructions from internal registers or internal caches. During or after execution of the instructions, the processor 1702 may write one or more results (which may be intermediate results or final results) to an internal register or internal cache. The processor 1702 may then write one or more of these results to the memory 1704. In particular embodiments, processor 1702 executes only instructions in one or more internal registers or internal caches or in memory 1704 (rather than memory 1706 or elsewhere), and manipulates only data in one or more internal registers or internal caches or in memory 1704 (rather than memory 1706 or elsewhere). One or more memory buses (which may each include an address bus and a data bus) may couple the processor 1702 to the memory 1704. As described below, the bus 1712 may include one or more memory buses. In particular embodiments, one or more memory management units (memory management unit, MMU) are located between processor 1702 and memory 1704 and facilitate access to memory 1704 requested by processor 1702. In a particular embodiment, the memory 1704 includes random access memory (random access memory, RAM). The RAM may be volatile memory, where appropriate. The RAM may be Dynamic RAM (DRAM) or Static RAM (SRAM), where appropriate. Further, the RAM may be single-port RAM or multi-port RAM, where appropriate. The present disclosure contemplates any suitable RAM. The memory 1704 may include one or more memories 1704, where appropriate. Although this disclosure describes and illustrates a particular memory, this disclosure contemplates any suitable memory.
In a particular embodiment, the memory 1706 includes mass storage for data or instructions. By way of example, and not limitation, memory 1706 may include a Hard Disk Drive (HDD), floppy disk drive (floppy disk drive, FDD), flash memory, optical disk, magneto-optical disk, magnetic tape, or universal serial bus (universal serial bus, USB) drive, or a combination of two or more of these. The memory 1706 may include removable media or non-removable (or fixed) media, where appropriate. Memory 1706 may be internal or external to computer system 1700, where appropriate. In a particular embodiment, the memory 1706 is a non-volatile, solid state memory. In a particular embodiment, the memory 1706 includes a read-only memory (ROM). The ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (electrically erasable PROM, EEPROM), electrically rewritable ROM (electrically alterable ROM, EAROM), or flash memory, or a combination of two or more of these, where appropriate. The present disclosure contemplates mass memory 1706 in any suitable physical form. The memory 1706 may include one or more memory control units that facilitate communications between the processor 1702 and the memory 1706, where appropriate. The memory 1706 may include one or more memories 1706, where appropriate. Although this disclosure describes and illustrates particular memory, this disclosure contemplates any suitable memory.
In particular embodiments, I/O interface 1708 comprises hardware, software, or both, that provides one or more interfaces for communication between computer system 1700 and one or more I/O devices. Computer system 1700 may include one or more of these I/O devices, where appropriate. One or more of these I/O devices may enable communication between a person and computer system 1700. By way of example, and not limitation, the I/O device may include a keyboard, a keypad, a microphone, a monitor, a mouse, a printer, a scanner, a speaker, a still camera, a stylus, a tablet, a touch screen, a trackball, a camera, another suitable I/O device, or a combination of two or more of these. The I/O device may include one or more sensors. The present disclosure contemplates any suitable I/O devices and any suitable I/O interfaces 1708 for any suitable I/O devices. The I/O interface 1708 may include one or more devices or software drivers that enable the processor 1702 to drive one or more of these I/O devices, where appropriate. The I/O interfaces 1708 may include one or more I/O interfaces 1708, where appropriate. Although this disclosure describes and illustrates particular I/O interfaces, this disclosure contemplates any suitable I/O interfaces.
In particular embodiments, communication interface 1710 includes hardware, software, or both, that provides one or more interfaces for communication (e.g., packet-based) between computer system 1700 and one or more other computer systems 1700 or with one or more networks. By way of example and not limitation, the communication interface 1710 may include a network interface controller (network interface controller, NIC) or network adapter for communicating with an ethernet or other wire-based network, or a Wireless NIC (WNIC) or wireless adapter for communicating with a wireless network (e.g., wi-Fi network). The present disclosure contemplates any suitable networks and any suitable communication interfaces 1710 for any suitable networks. By way of example, and not limitation, computer system 1700 may communicate with the following networks: an ad hoc network, a personal area network (personal area network, PAN), a Local Area Network (LAN), a Wide Area Network (WAN), a Metropolitan Area Network (MAN), or one or more portions of the internet, or a combination of two or more of these. One or more portions of one or more of these networks may be wired or wireless. By way of example, computer system 1700 may communicate with the following networks: wireless PAN (WPAN) (e.g., bluetooth WPAN (BLUETOOTH WPAN)), wi-Fi network, wi-MAX network, cellular telephone network (e.g., global system for mobile communications (Global System for Mobile Communications, GSM) network), or other suitable wireless network, or a combination of two or more of these. Computer system 1700 may include any suitable communication interface 1710 for any of these networks, where appropriate. The communication interface 1710 may include one or more communication interfaces 1710, where appropriate. Although this disclosure describes and illustrates a particular communication interface, this disclosure contemplates any suitable communication interface.
In a particular embodiment, the bus 1712 includes hardware, software, or both that couple the components of the computer system 1700 to one another. By way of example, and not limitation, the bus 1712 may include an accelerated graphics port (Accelerated Graphics Port, AGP) or other graphics bus, an enhanced industry standard architecture (Enhanced Industry Standard Architecture, EISA) bus, a Front Side Bus (FSB), a HyperTransport (HYPERTRANSPORT, HT) interconnect, an industry standard architecture (Industry Standard Architecture, ISA) bus, an Infiniband (INFINIBAND) interconnect, a Low Pin Count (LPC) bus, a memory bus, a micro channel architecture (Micro Channel Architecture, MCa) bus, a peripheral component interconnect (Peripheral Component Interconnect, PCI) bus, a PCI Express (PCIe) bus, a serial advanced technology attachment (serial advanced technology attachment, SATA) bus, a local video electronics standards association (Video Electronics Standards Association local, VLB) bus, or another suitable bus, or a combination of two or more of these. The bus 1712 may include one or more buses 1712, where appropriate. Although this disclosure describes and illustrates a particular bus, this disclosure contemplates any suitable bus or interconnect.
In this context, one or more computer-readable non-transitory storage media may include, where appropriate: one or more semiconductor-based integrated circuits (integrated circuit, ICs) or other integrated circuits (e.g., field-programmable gate array, FPGA) or application-specific IC (ASIC)), a Hard Disk Drive (HDD), a hybrid hard disk drive (hybrid hard drive, HHD), an optical disk drive (optical disc drive, ODD), a magneto-optical drive, a Floppy Disk Drive (FDD), a magnetic tape, a Solid State Drive (SSD), a RAM drive, a SECURE DIGITAL (SECURE DIGITAL) card or drive, any other suitable computer-readable non-transitory storage medium, or any suitable combination of two or more of these. The computer-readable non-transitory storage medium may be volatile, nonvolatile, or a combination of volatile and nonvolatile, where appropriate.
Others
Herein, unless expressly indicated otherwise or indicated by context, "or" is inclusive rather than exclusive. Thus, herein, "a or B" means "A, B, or both, unless indicated otherwise explicitly or otherwise by context. Furthermore, unless explicitly indicated otherwise or indicated by context, "and" are both associative and individual. Thus, herein, "a and B" means "a and B, jointly or individually, unless indicated otherwise explicitly or indicated otherwise by context.
The scope of the present disclosure encompasses all changes, substitutions, variations, alterations, and modifications to the example embodiments described or illustrated herein that will be appreciated by those of ordinary skill in the art. The scope of the present disclosure is not limited to the example embodiments described or illustrated herein. Furthermore, although the present disclosure describes and illustrates respective embodiments herein as including particular components, elements, features, functions, operations, or steps, any of these embodiments may include any combination or arrangement of any components, elements, features, functions, operations, or steps described or illustrated anywhere herein as would be understood by one of ordinary skill in the art. Furthermore, references in the appended claims to an apparatus or system adapted, arranged, capable, configured, implemented, operable, or operative to perform a particular function or a component of the apparatus or system encompass the apparatus, system, component whether or not the apparatus, system, component, or particular function is activated, or unlocked, as long as the apparatus, system, or component is so adapted, arranged, capable, configured, implemented, operable, or operative. Furthermore, while particular embodiments have been described or illustrated herein as providing particular advantages, particular embodiments may or may not provide some or all of these advantages.

Claims (15)

1. A method, comprising: by the client system:
presenting, through a user interface of the client system, a text message based on a user utterance received at the client system, wherein the text message includes a plurality of n-grams;
receiving, at the client system, a first user request to edit the text message;
presenting, through the user interface, the text message visually divided into a plurality of blocks, wherein each block includes one or more of the plurality of n-grams of the text message, and wherein the plurality of n-grams in each block are continuous with respect to each other and are grouped within the block based on analysis of the text message by a Natural Language Understanding (NLU) module;
receiving, at the client system, a second user request to edit one or more of the plurality of blocks; and
and presenting the edited text message through the user interface, wherein the edited text message is generated based on the second user request.
2. The method of claim 1, further comprising:
presenting, via the user interface, a prompt for entering the second user request, wherein the second user request includes information for editing the one or more blocks.
3. The method of claim 1 or 2, wherein each of the plurality of blocks is visually partitioned using one or more of: geometry, color, or identifier.
4. The method of claim 1, wherein one or more of the first user request or the second user request is based on one or more of: voice input, gesture input, or gaze input.
5. A method according to any one of claims 1 to 3, wherein the first user request is based on gesture input, and wherein the method further comprises:
presenting, via the user interface, a gesture-based menu comprising a plurality of selection options for editing the plurality of tiles, wherein the second user request comprises: one or more selection options of the plurality of selection options corresponding to the one or more blocks are selected based on one or more gesture inputs.
6. The method of claim 1, wherein the second user request includes a gesture input that wants to clear the text message, and wherein editing one or more of the plurality of blocks comprises: and clearing n-gram corresponding to the one or more blocks.
7. The method of claim 6, further comprising:
determining, by a gesture classifier, that the gesture input intends to clear the text message based on one or more attributes associated with the gesture input.
8. The method of the preceding claim, wherein the plurality of blocks are visually partitioned using a plurality of identifiers, respectively, and wherein the second user request comprises: one or more references to one or more identifiers of one or more corresponding blocks; and preferably wherein the plurality of identifiers comprises one or more of: numbers, letters, or symbols.
9. The method of claim 1, wherein the second user request comprises: a speech input referencing the one or more blocks; and preferably wherein the references to the one or more blocks in the second user request comprise ambiguous references, and wherein the method further comprises:
disambiguating the ambiguous reference based on a speech similarity model.
10. The method of claim 1, wherein one or more of the first user request or the second user request comprises: a voice input from a first user of the client system, and wherein the method further comprises:
Detecting a second user proximate to the first user based on sensor signals acquired by one or more sensors of the client system; and
the method further includes determining that the first user request and the second user request are directed to the client system based on one or more gaze inputs of the first user.
11. The method of claim 1, wherein the second user request includes one or more gaze inputs directed to the one or more blocks.
12. The method of any preceding claim, further comprising:
editing the text message based on the second user request.
13. The method of claim 12, wherein editing the text message comprises: changing one or more of the one or more n-grams in each of one or more of the one or more blocks to one or more other n-grams, respectively; or preferably, wherein editing the text message includes: adding one or more n-grams to each of one or more of the one or more blocks; or preferably, wherein editing the text message includes: altering an order associated with the one or more n-grams in each of one or more of the one or more blocks.
14. One or more computer-readable non-transitory storage media embodying software that is operable when executed to:
presenting, through a user interface of the client system, a text message based on a user utterance received at the client system, wherein the text message includes a plurality of n-grams;
receiving, at the client system, a first user request to edit the text message;
presenting, through the user interface, the text message visually divided into a plurality of blocks, wherein each block includes one or more of the plurality of n-grams of the text message, and wherein the one or more n-grams in each block are continuous with respect to each other and grouped within the block based on analysis of the text message by a Natural Language Understanding (NLU) module;
receiving, at the client system, a second user request to edit one or more of the plurality of blocks; and
and presenting the edited text message through the user interface, wherein the edited text message is generated based on the second user request.
15. A system, comprising: one or more processors; and a non-transitory memory coupled to the one or more processors, and the non-transitory memory including instructions executable by the one or more processors, the one or more processors when executing the instructions operable to:
presenting, through a user interface of the client system, a text message based on a user utterance received at the client system, wherein the text message includes a plurality of n-grams;
receiving, at the client system, a first user request to edit the text message;
presenting, through the user interface, the text message visually divided into a plurality of blocks, wherein each block includes one or more of the plurality of n-grams of the text message, and wherein the plurality of n-grams in each block are continuous with respect to each other and are grouped within the block based on analysis of the text message by a Natural Language Understanding (NLU) module;
receiving, at the client system, a second user request to edit one or more of the plurality of blocks; and
And presenting the edited text message through the user interface, wherein the edited text message is generated based on the second user request.
CN202280019144.4A 2021-03-03 2022-03-03 Text editing using voice and gesture input for auxiliary systems Pending CN116897353A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US63/156,209 2021-03-03
US17/407,922 US20220284904A1 (en) 2021-03-03 2021-08-20 Text Editing Using Voice and Gesture Inputs for Assistant Systems
US17/407,922 2021-08-20
PCT/US2022/018697 WO2022187480A1 (en) 2021-03-03 2022-03-03 Text editing using voice and gesture inputs for assistant systems

Publications (1)

Publication Number Publication Date
CN116897353A true CN116897353A (en) 2023-10-17

Family

ID=88312502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280019144.4A Pending CN116897353A (en) 2021-03-03 2022-03-03 Text editing using voice and gesture input for auxiliary systems

Country Status (1)

Country Link
CN (1) CN116897353A (en)

Similar Documents

Publication Publication Date Title
US20230418875A1 (en) Context Carryover Across Tasks for Assistant Systems
US20230401170A1 (en) Exploration of User Memories in Multi-turn Dialogs for Assistant Systems
US20220284904A1 (en) Text Editing Using Voice and Gesture Inputs for Assistant Systems
US20230128422A1 (en) Voice Command Integration into Augmented Reality Systems and Virtual Reality Systems
KR20230029582A (en) Using a single request to conference in the assistant system
US20220366904A1 (en) Active Listening for Assistant Systems
EP4327197A1 (en) Task execution based on real-world text detection for assistant systems
US20240054156A1 (en) Personalized Labeling for User Memory Exploration for Assistant Systems
US20220269870A1 (en) Readout of Communication Content Comprising Non-Latin or Non-Parsable Content Items for Assistant Systems
US20220358917A1 (en) Multi-device Mediation for Assistant Systems
US20220366170A1 (en) Auto-Capture of Interesting Moments by Assistant Systems
US20230419952A1 (en) Data Synthesis for Domain Development of Natural Language Understanding for Assistant Systems
CN116897353A (en) Text editing using voice and gesture input for auxiliary systems
US20230353652A1 (en) Presenting Personalized Content during Idle Time for Assistant Systems
EP4343493A1 (en) Presenting attention states associated with voice commands for assistant systems
US20230236555A1 (en) Event-Based Reasoning for Assistant Systems
TW202240461A (en) Text editing using voice and gesture inputs for assistant systems
CN117396837A (en) Multi-device mediation of assistant systems
CN117377942A (en) Active listening of assistant systems
WO2022178066A1 (en) Readout of communication content comprising non-latin or non-parsable content items for assistant systems
CN117396838A (en) Task execution based on real-world text detection for assistant systems
CN117396836A (en) Automatic acquisition of interesting moments by an assistant system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination