US20220076678A1 - Receiving a natural language request and retrieving a personal voice memo - Google Patents
Receiving a natural language request and retrieving a personal voice memo Download PDFInfo
- Publication number
- US20220076678A1 US20220076678A1 US17/531,371 US202117531371A US2022076678A1 US 20220076678 A1 US20220076678 A1 US 20220076678A1 US 202117531371 A US202117531371 A US 202117531371A US 2022076678 A1 US2022076678 A1 US 2022076678A1
- Authority
- US
- United States
- Prior art keywords
- memo
- memos
- database
- user
- natural language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 230000004044 response Effects 0.000 claims abstract description 50
- 238000000034 method Methods 0.000 claims abstract description 33
- 230000015654 memory Effects 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 description 43
- 238000013518 transcription Methods 0.000 description 31
- 230000035897 transcription Effects 0.000 description 31
- 238000004891 communication Methods 0.000 description 18
- 238000010411 cooking Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 230000003993 interaction Effects 0.000 description 8
- 238000012545 processing Methods 0.000 description 8
- 241000238558 Eucarida Species 0.000 description 7
- 230000000007 visual effect Effects 0.000 description 7
- 235000013305 food Nutrition 0.000 description 6
- 230000009118 appropriate response Effects 0.000 description 4
- 238000012790 confirmation Methods 0.000 description 4
- 241000238565 lobster Species 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000002354 daily effect Effects 0.000 description 2
- 230000003203 everyday effect Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003796 beauty Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000005352 clarification Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 235000014102 seafood Nutrition 0.000 description 1
- 235000013599 spices Nutrition 0.000 description 1
- 108020001568 subdomains Proteins 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000014616 translation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/63—Querying
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Definitions
- Some voice memo applications such as Zoho Notebook® and Voice Memos® for iOs®, allow users to record and play back memos, starting and stopping using a manual modality (see submitted non-patent literature “Zoho”).
- Some voice memo applications do not support explicit or implicit searching for information in memos or retrieving information from the memos using voice modalities.
- Conventional smart-speaker virtual assistants allow storing and retrieving information using voice in limited ways. For example, Google Assistant® and Siri® can add and retrieve events from a cloud-stored calendar. However, using the feature requires the user to carefully specify the content and the requests precisely to make the system do what is desired. For example, if a user asks Siri® “When is my husband's birthday?” and that information has not been pre-set in that user's device or device ecosystem, Siri® willy reply “I don't know who your husband is.”
- Cardona® teaches, at a high level, how to use various current commercial virtual assistants to store any arbitrary voice notes (see submitted non-patent literature “Cardona”). All systems implemented by Cardona® essentially transcribe speech to text that users can only retrieve through a visual modality. Prior art systems do not allow even for a system to read back, using text-to-speech notes or a summary of notes using speech. Doing such without significantly wasting the time of a user listening to extraneous neighboring words and irrelevant information is a non-trivial and unsolved problem.
- Voicera® describes the existence, without enablement, of summarization of voice notes (see submitted non-patent literature “Voicera”). However, Voicera® still relies on a visual modality for reviewing information and does not address the problem of providing relevant information for users, using a speech modality, without wasting time with extraneous neighboring words and irrelevant information.
- voice memorandums i.e., memos
- voice-enabled virtual assistants currently do not have the capability to intelligently learn the preferences or favorites of a user and then later use that information to answer a question from the user. For example, Siri® does not learn a person's preferences or favorites in an intelligent manner.
- Siri® thinks that the user is asking about Siri's preference and a response is provided to the user as “I don't eat out that much.”
- other virtual assistants are asked “What is my favorite restaurant?” they pick a restaurant that has the word “favorite” in its name, such as “My Favorite Cafe.”
- the Google Maps® application has an option to add places to a “Favorites” list, a “Want to Go” list or a “Starred Places” list, but it does not allow those lists to be queried using one's own voice.
- Google Assistant® has a feature of remembering a favorite place; however, it is able to store only a limited number of places and doesn't allow users to reliably query them (e.g., give directions to that place).
- a Google Assistant® (GA) interaction goes as follows: (i) user: “do you know what my favorite restaurant is?”; (ii) GA: “I don't know that yet. What's your favorite restaurant?”; (iii) user: “my favorite restaurant is Red Lobster,” (iv) GA: “OK, I'll remember that”; (v) user: “do you know what is my favorite beach?”; (vi) GA: “I remember you told me.
- the technology disclosed relates to (i) speech enabled virtual assistants implementing technology that is capable of recording voice memorandums (i.e., memos), intelligently storing the memos along with information derived from the memos, and intelligently retrieving information contained in or derived from the stored memos and (ii) speech enabled virtual assistants implementing technology that intelligently stores favorite information of a user for subsequent retrieval and presentation to the user at the appropriate time.
- voice memorandums i.e., memos
- intelligently storing the memos along with information derived from the memos
- intelligently retrieving information contained in or derived from the stored memos and
- speech enabled virtual assistants implementing technology that intelligently stores favorite information of a user for subsequent retrieval and presentation to the user at the appropriate time.
- the technology disclosed receives (by a virtual assistant) a natural language utterance that includes memo information, interprets the received utterance according to a natural language grammar rule associated with a memo domain and stores (in a database) a memo that is derived from the interpretation of the memo information, receives another natural language utterance expressing a request (i.e., a request to query memo data from the database), interprets the natural language utterance expressing a request according to a natural language grammar rule for retrieving memo data from the natural language utterance, such that the natural language rule for retrieving memo data recognizes query information, in response to a successful interpretation of the natural language utterance, uses the recognized using the recognized query information to query the database for specific memo data related to the recognized query information, and provides, to the user, a response generated in dependence upon the queried-for specific memo data.
- the technology disclosed operates in a similar manner as the storing and retrieval of memos.
- FIG. 1 illustrates a block diagram of an example environment capable of speech enabled virtual assistants implementing technology that is capable of receiving a request or query and intelligently retrieving information contained in or derived from previously stored memos.
- FIG. 2 illustrates a block diagram of an example environment capable of speech enabled virtual assistants implementing technology that is capable of recording voice memorandums (i.e., memos) and intelligently storing the memos along with information derived from the memos.
- voice memorandums i.e., memos
- FIG. 3 illustrates a block diagram of an example environment capable of speech enabled virtual assistants implementing technology that intelligently retrieves and presents favorite information of a user contained in or derived from previously identified and stored favorites.
- FIG. 4 illustrates a block diagram of an example environment capable of speech enabled virtual assistants implementing technology that is capable of receiving favorites and intelligently storing the favorites along with information derived from the favorites.
- FIGS. 5A, 5B and 5C show three examples implementations of the technology disclosed using different types of virtual assistants.
- FIG. 6 illustrates shows an overhead view of an automobile designed to implement the technology disclosed.
- FIG. 7 illustrates an example environment in which personal memos and/or favorites can be stored, search and retrieved for generation of intelligent responses using the technology disclosed.
- FIG. 8 is a block diagram of an example computer system that can implement various components of the environment of FIG. 7 .
- FIG. 9 illustrates TABLE 1, which includes example phrases that would trigger the storing of a personal memo.
- FIG. 10 illustrates TABLE 2, which includes example phrases that would trigger the storing of a personal memo.
- FIG. 11 illustrates TABLE 3, which includes example ways of invoking the storing of favorite information, querying favorite information and possible responses from a virtual assistant.
- FIG. 12 illustrates TABLE 4, which includes example ways of using favorite information for obtaining directions and travel information.
- FIG. 13 illustrates TABLE 5, which includes example ways of storing multiple favorites for a specific category and then later obtaining specific information for both of the favorites in the same category or obtaining favorite information of multiple favorites based on geographical location.
- An aspect of the technology disclosed relates to speech-enabled virtual assistants implementing recognition technology that is capable of recording voice memorandums (i.e., memos, or personal memos), intelligently storing the memos along with information derived from the memos, and intelligently retrieving information contained in or derived from the stored memos.
- voice memorandums i.e., memos, or personal memos
- intelligently storing the memos along with information derived from the memos
- intelligently retrieving information contained in or derived from the stored memos are provided below.
- the first example relates to cooking lasagna.
- the scenario is that just about every recipe on the internet indicates that lasagna should be cooked for 40 minutes. However, a particular user has determined that with their oven 40 minutes is too much, and as a result, their lasagna is always burned. The user was able to determine through experience that the perfect cooking time for their lasagna is 30 minutes. In order to remember that the perfect time for cooking lasagna in their oven is 30 minutes, the user will have an interaction with a virtual assistant (or some other type of technology that is capable of speech recognition and feedback) as follows (note that only the text in italics is the voice exchange or interaction with the virtual assistant; and the virtual assistant is named Hound):
- the second example relates to finding or locating lost objects.
- the scenario is that a user places an object somewhere (e.g., for hiding or storage), where the user wants to be sure to remember where the object was placed. Instead of writing a text, email or physical message to oneself, the user would have the following interaction with the virtual assistant.
- Another aspect of the technology disclosed relates to speech enabled virtual assistants implementing technology that intelligently stores favorite information of a user for subsequent retrieval and presentation to the user at the appropriate time.
- a concept is that the favorite information of the user is stored, such as favorite restaurants, grocery stores, beauty salons, gyms, recreation spots, parking garages, friends and family, etc. and then later used to answer inquiries from the user.
- This technology that is capable of recording and intelligently storing memos and related information and retrieving information in dependence upon the stored memos are provided below.
- the first example relates to favorite places and the scenario is that the user tells the virtual assistant about a favorite restaurant and then later on asks for directions to that restaurant.
- the second example relates to a routine commute and the scenario is that the user goes to the same gym, bar, grocery store etc. on a regular basis, so she tells the virtual assistant to remember this particular place as a favorite for later retrieval.
- the third example relates to making recommendations and the scenario is that a user asks a virtual assistant for a recommendation, where the user has previously given the virtual assistant some information about favorite restaurants, etc. or perhaps where the user has not previously provided favorite information.
- FIG. 1 illustrates a block diagram of an example environment 100 capable of speech enabled virtual assistants implementing technology that is capable of receiving a request or query and intelligently retrieving information contained in or derived from previously stored memos.
- the term “intelligently retrieving” is mentioned because the environment 100 , as discussed in further detail below, is capable of not just repeating a previous statement made by the user but is able to derive a more useful response to the user, as a result of having previously stored a memo or personal memo provided by the user.
- FIG. 1 illustrates that the example environment 100 includes a speech input 102 being received from a microphone or some other type of input device (e.g., an application running on a mobile phone or tablet, etc.).
- the speech input 102 includes search or query request 103 (hereinafter query 103 ).
- the query 103 can be in the form of a natural language utterance spoken by the user.
- the speech input 102 can be received by a virtual assistant (not illustrated) as query 103 .
- Speech enabled virtual assistants will simply be referred to herein as “virtual assistants” or a “virtual assistant.”
- a virtual assistant can be a device or an application residing on a device, such as a smart phone, a watch, glasses, a television, an automobile, etc.
- the virtual assistant is capable of interacting with a user using the user's speech and is capable of, for example, (i) providing information back to the user (e.g., an answer to a question), (ii) providing an actionary response (e.g., changing the thermostat or locking the doors to an automobile) or (iii) storing information for later retrieval (remotely or locally) or for increasing the knowledge base of the virtual assistant.
- a virtual assistant can monitor sound (e.g., conversations) to listen for a wake phrase that engages the virtual assistant and to listen to a trigger phrase uttered after the wake phrase that directs the virtual assistant (or any system in communication with the virtual assistant) to a particular domain.
- a wake phrase can be just one word or multiple words and a trigger phrase can be just one word or multiple words.
- the query 103 will be transcribed by the virtual assistant (or a system connected to the virtual assistant as described below with respect to FIG. 7 ) in operation 106 .
- text obtained from the transcriptions of the query 103 will be used to determine whether or not the user intended to query a particular domain, such as a memo domain 108 . If the memo domain 108 is identified, then the text obtained from the transcriptions will be interpreted using a particular grammar rule.
- a domain represents a particular subject area, and comprises or is associated with a specific grammar rule.
- a specific grammar rule is not necessarily one single rule but can be a set of rules that are suited to interpret a transcription of a natural language utterance that is related to a specific domain.
- the process of interpreting a natural language utterance within a particular domain produces exactly one interpretation. Different interpretations arise when systems interpret a natural language utterance in the context of different domains. Each interpretation represents the meaning of the natural language utterance as interpreted by a domain. For example, when users make requests, such as asking “What time is it?” or directing the system to “Send a message.” Systems provide responses, such as by speaking the time.
- the natural language utterance that expresses the request can be interpreted according to a natural language grammar rule for retrieving memo data. This rule is obtained from the memo domain 108 . Further, the natural language grammar rule is interpreted to recognize query information from the natural language utterance (e.g. query 103 ). As an example, in operation 106 the received natural language utterance is “How long should I cook lasagna?”
- a memo transcription database 112 can be queried using the interpreted natural language utterance.
- the memo transcription database 112 includes text from previous natural language utterances directed to personal memos.
- the memo transcription database 112 can be an unstructured or a structured database storing unstructured or structured data. However, as previously discussed, merely providing text back to a user that has not been interpreted according to specific domain would not be as helpful to the user.
- a memo interpretation database 114 is queried using the interpreted natural language utterance.
- the memo interpretation database 114 includes interpretations of natural language utterances directed to personal memos.
- the memo interpretation database 114 can be an unstructured or a structured database storing unstructured or structured data. Because the interpretations of the natural language utterances are made using a particular natural language grammar rule associated with the memo domain 108 , the information stored and retrieved from the memo interpretation database 114 will be easier to search and provide more accurate and meaningful results.
- An example memo retrieved from the memo interpretation database 114 could be structured data, such as “cook.lasagna.oven.30-minutes” that can be used to generate a response, or an example memo retrieved from the memo interpretation database 114 could already be in a form that is phrased as a natural language response such as “Violet, you should cook your lasagna in your oven for 30 minutes.”
- operation 118 After obtaining the memo from the memo transcription database 112 or the memo interpretation database 114 in operation 110 , operation 118 generates an appropriate answer (response) for the user. As discussed above and in further detail below, an aspect of the technology disclosed is capable of providing a meaningful (appropriate) response to the user that is not simply necessarily a word-for-word repeat of a previously stored transcription, but something that is sufficient to and will actually be more helpful to answering the users request or query. If operation 110 obtains the memo from the memo transcription database 112 , then the memo can be further interpreted using the specific grammar rule for retrieving memo data.
- the retrieved memo “To get a perfect lasagna, I cook it in the oven for 30 minutes” could be interpreted to generate a response such as “Violet, you should cook your lasagna in your oven for 30 minutes.”
- the memo retrieved from the memo interpretation database 114 is structured as “cook.lasagna.oven.30-minutes,” the system will generate “Violet, you should cook your lasagna in your oven for 30 minutes,” as an appropriate response.
- the appropriate response or answer will be provided to the user in operation 120 , in the form of speech 122 or message/text to a mobile device 124 or some other device similar thereto.
- FIG. 2 illustrates a block diagram of an example environment capable of speech or text enabled virtual assistants implementing technology that is capable of recording voice memorandums (i.e., memos) and intelligently storing the memos along with information derived from the memos.
- voice memorandums i.e., memos
- FIG. 2 illustrates an environment 200 that implements the storing of a natural language utterance in the memo transcription database 112 and/or the memo interpretation database 114 .
- the environment of FIG. 2 is very similar to that of FIG. 1 , except that a statement 203 is received that causes the virtual assistant to store some or all of the statement 203 as a memo as opposed to conducting a query. Descriptions of redundant elements of FIG. 2 are omitted.
- the statement 203 is transcribed and then a domain, such as the memo domain 108 is identified.
- a domain such as the memo domain 108
- the text transcribed from the statement 203 is interpreted using a specific grammar rule for storing a memo that is associate or included in the memo domain 108 .
- the natural language utterance e.g., statement 203
- the memo, obtained from the transcription of the natural language is stored as a transcription in the memo transcription database 112 and in operation 212 the memo, obtained from an interpretation of the natural language utterance is stored in the memo interpretation database 114 .
- the actual recording of the natural language utterance that expresses the statement 203 can be stored in another database, or even the memo transcription database 112 and/or the memo interpretation database 114 .
- the differences between transcriptions and interpretations and between the memo transcription database 112 and the memo interpretation database 114 are described above in detail with reference to FIG. 1 .
- feedback is provided to the user in the form of speech 122 or message/text to a mobile device 124 or some other device similar thereto.
- the speech can include a request for confirmation to the user to confirm whether or not they intended to store a personal memo, or a confirmation to the user that the information has been stored as a personal memo.
- One aspect of the technology disclosed includes assigning a time period to a memo after which the memo will expire and then removing the memo (or memo related information) from the memo transcription database 112 and/or the memo interpretation database 114 .
- Another aspect of the technology disclosed includes interpreting the query 103 and/or the statement 203 according to multiple domains (e.g., multiple grammar rules), wherein each domain of the multiple domains has an associated relevancy score for the interpreted utterance.
- the memo domain 108 is one domain of the multiple domains and the memo domain 108 has an advantage over the other domains with respect to interpreting queries and statements related to personal memos. As such, when any of the query 103 and/or the statement 203 is directed to a personal memo, the interpretation using the memo domain 108 will have the highest relevance score as compared to the other domains. Additionally, different interpretations of the query 103 and/or the statement 203 using the multiple domains can be stored in the memo interpretation database 114 .
- the information stored in the memo interpretation database can be stored along with additional information, such as meta-data or meta-information that describes the memo as pertaining to a short-term activity, daily weather, and an until-event such as a child being at soccer practice, which is cancelled (or deleted) when the parent arrived and then leaves the soccer field as a result of picking up the child.
- additional information such as meta-data or meta-information that describes the memo as pertaining to a short-term activity, daily weather, and an until-event such as a child being at soccer practice, which is cancelled (or deleted) when the parent arrived and then leaves the soccer field as a result of picking up the child.
- the meta-data or meta-information can be explicitly stated by the user (e.g., “I'll be at work until 5 pm”) or it can be inferred from other information obtained from the user, such as other personal memos, other calendar information or other routine information obtained from general tendencies of the user.
- virtual assistants or related devices often have wake phrases to indicate to the virtual assistant that the user is attempting to engage or use the virtual assistant. Assuming that the technology disclosed utilizes a standard wake phrase of “Ok Hound” to engage the virtual assistant.
- One way to indicate that a user's utterance is intended to retrieve information from a stored personal memo would be to assign specific wake phrases, such as “Ok Hound check my personal information for . . . ,” or “Hound check my memos for information regarding . . . ”.
- one way to indicate that a user's utterance is intended to be stored as a personal memo would be to assign specific wake phrases, such as “Ok Hound memo,” “Hound memo” or “Ok Hound remember.” Each of these example wake phrases would immediately indicate that the user is intending to retrieve or store a personal memo. However, sometimes users have difficulty remembering which wake phrases to use in which situation.
- the technology disclosed is capable of determining whether or not a natural language utterance received after a generic wake phrase includes a specific trigger phrase to indicate that the user intends to search for a memo or store a memo.
- a “trigger phrase” can include just a single word or multiple words
- a “wake phrase” can include just a single word or multiple words.
- Use of the wake phrase and trigger phrase can be used to make the system understand to record, store and retrieve the information to/from the “memo domain”. Additionally, weights on the “memo domain” can be invoked in order to make it the first domain (of multiple other domains) to consider when retrieving information.
- the trigger phrases can include personal pronouns, such as “I” (e.g., “Where did I put the key?”, “How long do I usually cook Lasagna?”) or possessives like “my” (e.g., “Where is my key?”).
- a trigger phrase may be identified as being an interrogative pronoun or a relative pronoun that is within 5 words of the personal pronoun, or a trigger phrase may be identified as being a personal pronoun followed by or preceded by an interrogative pronoun or a relative pronoun that is within 5 words of the personal pronoun.
- the appropriate domain e.g., memo domain 108
- an appropriate grammar rule can also be selected in dependence upon the trigger phrase itself, other contents of the natural language utterance or a combination of both.
- FIG. 9 illustrates TABLE 1, which includes example phrases that would trigger the storing of a personal memo in the memo domain 108 or a specific sub-domain (e.g., cooking) of the memo domain 108 .
- Stage 1 examples require the stored memo and the query to be of a similar nature and the response is similar in nature as well. This is somewhat of a one-to-one correlation of the stored memo, the request and the response. This is the least complex of the stages, because the response is closely tied to the query.
- the query states, “do I usually leave . . . in the oven.” and the response states “you usually leave . . . in the oven.”.
- Stage 2 examples allow for more information to be inferred from the stored memo and the query for the memo and allow for different answers to be derived from the stored memo.
- the arrows on the first row of stage two indicate that the utterance used to invoke storage can be queried using three different options and there are three possibilities for response. In other words, each cell of stage 2 has three counterpart cells. Although the arrows do not indicate such due to space constraints on TABLE 1, the same goes for the second and third rows of stage two. For example, the second row of stage 2 , the user can state “To get a perfect lasagna leave it for 30 minutes in the oven.” Now, this personal memo can be queried in, at least, three different ways.
- Stage 3 is the most complex stage, because it allows for additional information to be derived from the stored memo, not just the cooking time.
- the user most likely invoked the storage of the memo with a statement directed to the length of time for cooking lasagna, without really thinking about later retrieving an answer as to “where” the lasagna should be cooked.
- the virtual assistant identified at least two pieces of information from the memo, including the fact that the lasagna is cooked in the oven and that it is cooked for 30 minutes. Therefore, the virtual assistant can answer two different types of questions, including those related to how long to cook the lasagna and those related to where the lasagna should be cooked.
- FIG. 10 illustrates TABLE 2, which includes example phrases that would trigger the storing of a personal memo in the memo domain 108 or a specific sub-domain (e.g., object location) of the memo domain 108 , as well as ways to query the personal memo and possible responses from the virtual assistant.
- TABLE 2 is different from TABLE 1, because TABLE 2 also includes examples of grammar rules and sentence parsing that can be implemented to store memos along with additional information and how the memo and additional information can be used to identify a query and structure a response.
- each sentence used to invoke storage of a memo is parsed to identify various components. For example, in the first row of TABLE 2, the virtual assistant identifies the personal pronoun “I” and then looks for a verb that is near the “I”.
- any verb such as “put”, “am putting”, “'ll put” or “will put” that follows the “I” indicates to the virtual assistant that the utterance received from the user is related to the user putting an object somewhere.
- the virtual assistant when looks for some variable (e.g., keys) that are likely to be put somewhere.
- the virtual assistant looks for another variable (i.e., variable2) describing where variable1 is placed.
- this personal memo is stored with the additional information obtained from parsing the utterance, the memo can be queried when the user asks a question including any variation of the verb “put” along with variable1 (e.g., keys).
- Row 1 of TABLE 2 also describes the structure of the response with respect to the information included in the initial statement from the user and the subsequent query.
- the system may invoke user feedback to confirm whether or not a user intended to search for an answer based on a personal memo or to store a personal memo. If the user indicates that they did not intend to query a personal memo, then a different domain will be used to provide a response to the user's question. If the user indicates that they did not intend to store an utterance as a personal memo, then the personal memo will not be stored, or it will be deleted if it was stored.
- the confirmation requests to the user can be auditory or in the form of text and the user responses to the confirmation quests can be auditory or in the form of text. Additionally, if the virtual assistant cannot locate a memo that provides an answer to the user's request, then the virtual assistant can ask for a clarification.
- a user can store and query multiple memos that are related to the same subject. For example, a user may indicate that they put their keys in a refrigerator for safe keeping. Then at a later point the user may indicate that they put their keys in their backpack. Now, when a user asks where their keys are located, the virtual assistant should be able to indicate to the user that their keys are stored in their backpack. This scenario can be handled in many different ways. First, the virtual assistant may store each memo with time information and then make an assumption that when the user asks about the location of their keys, the user is referring to the most recent memo about their keys. This is essentially time ordering all of the memos related to the location of the user's keys.
- the virtual assistant By saving all of the memos regarding the location of the user's keys, the virtual assistant will be able to tell the user where they placed the keys before they were placed in the backpack. This would be helpful if the user actually did not put them in the backpack. In this case, the user would probably find their nicely cooled keys in the refrigerator.
- a virtual assistant would parse search type statements to identify entities and attributes of the entities; search a database of memo information for the entity; and for database records related to the entity, check for the most recent one relating to the same attribute.
- the entity would be keys and the attribute would be location.
- a second option would to delete all previous memos relating to the location of the user's keys upon the storing of the most recent memo regarding the user's keys being in the backpack.
- a virtual assistant would parse store type statements to identify entities and attributes; search a database for records about the same attribute of the same entity (only one should be found); delete the record; and store a new record with the new information about the entity and its attribute.
- FIG. 3 illustrates a block diagram of an example environment capable of speech enabled virtual assistants implementing technology that intelligently retrieves and presents favorite information of a user contained in or derived from previously identified and stored favorites.
- the environment 300 illustrated in FIG. 3 is similar to the environment 100 of FIG. 1 , except that the query 103 is directed to a favorites domain 308 for the purpose of obtaining information from a favorites transcription database 312 or a favorites interpretation database 314 .
- the favorites domain 308 is similar to the memo domain 108 of FIG. 1 , except that the favorites domain 308 has a different grammar rule for interpreting the query 103 .
- the favorites transcription database 312 stores transcriptions of previously stored natural language utterances related to “favorites” of a user and the favorites interpretation database 314 stores interpretations of natural languages related to “favorites” of a user.
- favorites are different from personal memos, because they are inherently narrower in scope and have a longer duration of relevance.
- Some example categories of favorites could be favorite types of food, grocery stores, hotels, friends, gymnasiums or recreation facilities, hair dressers, schools, colleges, sports teams, etc.
- FIG. 4 illustrates a block diagram of an example environment capable of speech enabled virtual assistants implementing technology that is capable of receiving favorites and intelligently storing the favorites along with information derived from the favorites.
- the environment 400 of FIG. 4 is similar to the environment 200 of FIG. 2 , except that the statement 202 is (i) interpreted using the favorites domain 308 , (ii) transcribed and stored in the favorites transcription database 312 and (iii) interpreted for storage in the favorites interpretation database 314 .
- All of the descriptions provided above with respect to FIGS. 1 and 2 and memos, as provided above are applicable to the storing and retrieval of favorites and information derived from the favorites. For example, wake phrases, trigger phrases, etc., are applicable to favorites.
- a memo and/or memo related information can indicate that a specific entity is a favorite of the user.
- FIG. 11 illustrates TABLE 3, which includes some example ways of invoking the storing of favorite information, querying favorite information and possible responses from a virtual assistant.
- FIG. 12 illustrates TABLE 4, which is similar to TABLE 3, except that it illustrates some example ways of using favorite information for obtaining directions and travel information.
- FIG. 13 illustrates TABLE 5, which is similar to TABLE 4, except that it illustrates some example ways of using storing multiple favorites for a specific category and then later obtaining specific information for both of the favorites in the same category or obtaining favorite information of multiple favorites based on geographic location.
- vorites can include building a recommendations table base on user's stored favorites.
- User “I like Red Lobster® Restaurant”
- Virtual Assistant obtains information regarding Red Lobster Restaurant from another service, such as Yelp® (e.g., Seafood/Bar/Kids' menu/Casual & Cozy/3.9 stars/etc.);
- Yelp® e.g., Seafood/Bar/Kids' menu/Casual & Cozy/3.9 stars/etc.
- Virtual Assistant “There are other restaurants in the area that have similar characteristics and ratings as your other favorites such as Fish Market Restaurant in San Mateo, would you like me to provide you with a full list of options?”
- FIGS. 5 A, 5 B and 5 C show three example implementations of the technology disclosed using different types of virtual assistants.
- FIG. 5A illustrates a mobile phone 502 .
- mobile phones are battery-powered, it is important to minimize complex computations so as not to run down the battery. Therefore, mobile phone 502 may connect over the Internet to a server.
- the mobile phone 502 has a visual display that can provide information in some use cases.
- the mobile phone 502 also has a speaker, and in some use cases the mobile phone 502 may respond to an utterance using only speech.
- FIG. 5B also illustrates a home assistant device 504 , which may plug into a stationary power source, so it has power to do more advanced local processing than the mobile phone 502 .
- the home assistant device 504 may rely on a cloud server for interpretation of utterances according to specialized domains and in particular domains that require dynamic data to form useful results. Because the home assistant device 504 has no display, it is a speech-only device.
- FIG. 5C illustrates an automobile 506 .
- the automobile 506 may be able to connect to the Internet through a wireless network. However, if driven away from an area with a reliable wireless network, the automobile 506 must process utterances, respond, and give appropriate results reliably, using only local processing. As a result, the automobile 506 can run software locally for natural language utterance processing. Though many automobiles have visual displays, to avoid distracting drivers in dangerous ways, the automobile 506 may provide results with speech-only requests and responses or may provide results to a display for only non-driving passengers to view and interact with.
- FIG. 6 shows an overhead view of an automobile 600 designed to implement the technology disclosed.
- the automobile 600 has two front seats 602 , either of which can hold one person.
- the automobile 600 also has a back seat 604 that can hold several people.
- the automobile 600 has a driver information console 606 that displays basic information such as speed and energy level.
- the automobile 600 also has a dashboard console 608 for more complex human interactions that cannot be quickly conducted by speech, such as viewing and tapping locations on navigational maps.
- the automobile 600 has side bar microphones 610 and a ceiling-mounted console microphone 612 , all of which receive speech audio such that a digital signal processor embedded within the automobile can perform an algorithm to distinguish between speech from the driver or front-seated passenger.
- the automobile 600 also has a rear ceiling-mounted console microphone 614 that receive speech audio from rear-seated passengers.
- the automobile 600 also has a car audio sound system with speakers.
- the speakers can play music but also produce speech audio for spoken responses to user commands and results.
- the automobile 600 also has an embedded microprocessor. It runs software stored on non-transitory computer-readable media that instruct the processor to perform some or all of the operations discussed with reference to the algorithm of FIGS. 1-5, 7 and 8 , among other functions.
- FIG. 7 illustrates an example environment 700 in which personal memos and/or favorites (or information derived therefrom) can be stored, searched for retrieval and for generation of intelligent responses using the technology disclosed.
- the environment 700 includes at least one user device 702 , 706 .
- the user device 702 can be a mobile phone, tablet, workstation, desktop computer, laptop or any other type of user device running an application 704 .
- the user device 702 can be an automobile 706 or any other combination of hardware and software that is running an application 704 .
- the user devices 702 , 706 are connected to one or more communication networks 708 that allow for communication between various components of the environment 700 and that allow for performing of searches on the internet or other networks.
- the communication networks 708 include the internet.
- the communication networks 708 also can utilize dedicated or private communication links that are not necessarily part of the public internet.
- the communication networks 708 use standard communication technologies, protocols, and/or inter-process communication technologies.
- the user devices 702 , 706 are capable of receiving, for example, a first query in a first language, where the purpose of the query is to perform a search on the internet or a private network.
- the application 704 is implemented on the user devices 702 , 706 to capture the first query.
- the environment 700 also includes applications 710 that can be preinstalled on the user devices 702 , 706 or updated/installed on the user devices 702 , 706 over the communications networks 708 .
- the environment 700 includes Application Programming Interfaces (APIs) 711 that can also be preinstalled on the user devices 702 , 706 or updated/installed on the user devices 702 , 706 over the communications networks 708 .
- the APIs 711 can be implemented to allow the user devices 702 , 706 and the applications 710 to easily gain access to other components on the environment 700 as well as certain private networks.
- the environment 700 also includes an interpreter 712 that can be running on one or more platforms/servers that are part of a speech recognition system.
- the interpreter 712 can be a single computing device (e.g., a server), a cloud computing device, or it can be any combination of computing device, cloud computing devices, etc., that are capable of communicating with each other to perform the various tasks required to perform meaningful interpretation, as well as speech recognition, if desired.
- the interpreter 712 can include a deep learning system 714 that is capable of using artificial intelligence, neural networks, and or machine learning to perform interpretations.
- the deep learning 714 can implement language embedding(s), such as a model or models 716 as well as a natural language domain 718 for providing domain-specific translations and interpretations for natural language processing (NLP).
- NLP natural language processing
- the interpreter 712 can be spread over multiple servers and/or cloud computing device, the operations of the deep learning 714 , the language embedding(s) 716 and the natural language domains 718 can also be spread over multiple servers and/or cloud computing device.
- the applications 710 can be used by and/or in conjunction with the interpreter 712 to translate spoken input, as well as text input and text file input. Again, the various components of the environment 700 can communicate (exchange data) with each other using customized APIs 711 for security and efficiency.
- the interpreter 712 is capable of interpreting a query or statement (e.g., natural language utterance) obtained from the user devices 702 , 706 .
- the user devices 702 , 706 and the interpreter 712 can each include memory for storage of data and software applications, a processor for accessing data in executing applications, and components that facilitate communication over the communications networks 708 .
- the user devices 702 , 706 execute applications 704 , such as web browsers (e.g., a web browser application 704 executing on the user device 702 ), to allow developers to prepare and submit applications 710 and allow users to submit speech audio queries (e.g., the speech input 102 and query 103 of FIG. 1 ) including natural language utterances to be interpreted by the interpreter 712 .
- applications 704 such as web browsers (e.g., a web browser application 704 executing on the user device 702 ), to allow developers to prepare and submit applications 710 and allow users to submit speech audio queries (e.g., the speech input 102 and query 103 of FIG. 1 ) including natural language utterances to be interpreted by the interpreter 712 .
- the interpreter 712 can implement one or more language embeddings (models) 716 from a repository of embeddings (models) (not illustrated) that are created and trained using the techniques that are known to a person of ordinary skill in the art.
- the natural language domain 718 can be implemented by the interpreter 712 in order to add context or real meaning to the transcription of the received speech input.
- the environment 700 can further include a topic analyzer 720 that can implement one or more topic models 722 to analyze and determine a topic of a query or statement. Some of the operations of the topic analyzer 720 could be performed during, for example, transcription operation 106 of FIG. 1 .
- the environment 700 can include a disambiguator 724 that is able to utilize any type of external data 726 (e.g., disambiguation information) in order to add further meaning to an obtained query.
- the disambiguator 724 is able to add further meaning to a query or statement by analyzing previous searches of the user, profile data of the user, location information, calendar information, date and time information, etc.
- the disambiguator 724 can be used to add synonyms to the initial search that can be helpful to narrow the search to what the user wants to find.
- the disambiguator 724 can also add additional limits to the search, such as certain dates and/or timeframes (e.g., based on the travel plans of the user additional limits can be added to the original query to identify events that are occurring while the user is traveling to a certain region).
- additional limits such as certain dates and/or timeframes (e.g., based on the travel plans of the user additional limits can be added to the original query to identify events that are occurring while the user is traveling to a certain region).
- the topic analyzer 720 can analyze the query and determine that the topic (or domain) is “memo.cooking”.
- the disambiguator 724 can use the external data 726 to determine that the user has been cooking at their mother's house for the past few days. Accordingly, the disambiguator 724 can extend the terms of the first query from “How long do I cook lasagna?” to “How long do I cook lasagna at my mother's house?” Prior to extending the query, the system can ask the user if they are cooking at their home or at their mother's house.
- the combination of the results obtained by the topic analyzer 720 and the disambiguator 724 can essentially narrow the scope of the query.
- the disambiguator 724 can also use other mechanisms to extend the keywords of the received queries. This can be done by asking the user broad or specific questions regarding their initial query or can simply be done using artificial intelligence or other means to be able to further narrow the initial query.
- a searcher 732 of the environment 700 is implemented to perform a search for a memo or favorite information based on the query to obtain language.
- the searcher 732 can implement language and domain data 734 to determine which domains should be searched.
- the searcher 732 can, for example, identify a domain for a query in dependence upon at least one of a wake phrase, a trigger phrase, the contents or topic of the query, as determined by the topic analyzer 720 .
- the searcher 732 is not limited to searching just a single domain.
- the searcher 732 can search multiple domains in parallel or in series. For example, if an insufficient number of results are found after searching in the first domain (e.g., the memo domain) a second domain (e.g., favorites) may be searched.
- scoring techniques can be implemented which will be understood by one of ordinary skill in the art.
- the user may have the option to select various scoring and ranking techniques to be implemented. For example, the user may select to have scoring and ranking independently implemented (and presented) for each domain.
- the scorer/ranker 730 may only present the top X results or a top Y percentage of results so as to not overwhelm the user.
- the technology disclosed can also provide a brief visual or auditory summary of each result, making it easier for the user to determine which results they would like to view first.
- the interpreter 712 , topic analyzer 720 , disambiguator 724 , scorer/ranker 730 and/or the searcher 732 can be implemented using at least one hardware component and can also include firmware, or software running on hardware.
- Software that is combined with hardware to carry out the actions of the interpreter 712 , topic analyzer 720 , disambiguator 724 , scorer/ranker 730 and/or the searcher 732 can be stored on computer readable media such as rotating or non-rotating memory.
- the non-rotating memory can be volatile or non-volatile.
- computer readable media does not include a transitory electromagnetic signal that is not stored in a memory; computer readable media store program instructions for execution.
- the interpreter 712 , topic analyzer 720 , disambiguator 724 , scorer/ranker 730 and/or the searcher 732 , as well as the applications 710 , the topic models, 722 , external data 726 , the language and domain data 734 and the APIs 711 can be wholly or partially hosted and/or executed in the cloud or by other entities connected through the communications network 708 .
- FIG. 8 is a block diagram of an example computer system that can implement various components of the environment 700 of FIG. 7 .
- Computer system 810 typically includes at least one processor 814 , which communicates with a number of peripheral devices via bus subsystem 812 .
- peripheral devices may include a storage subsystem 824 , comprising for example memory devices and a file storage subsystem, user interface input devices 822 , user interface output devices 820 , and a network interface 815 .
- the input and output devices allow user interaction with computer system 810 .
- Network interface 815 provides an interface to outside networks, including an interface to the communication networks 708 , and is coupled via the communication networks 708 to corresponding interface devices in other computer systems.
- User interface input devices 822 may include audio input devices such as speech recognition systems, microphones, and other types of input devices.
- audio input devices such as speech recognition systems, microphones, and other types of input devices.
- use of the term “input device” is intended to include all possible types of devices and ways to input speech information into computer system 810 or onto communication network 708 .
- User interface output devices 820 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
- the display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
- the display subsystem may also provide non-visual display such as via audio output devices.
- output device is intended to include all possible types of devices and ways to output information from computer system 810 to the user or to another machine or computer system.
- Storage subsystem 824 stores programming and data constructs that provide the functionality of some or all of the modules described herein. These software modules are generally executed by processor 814 alone or in combination with other processors.
- Memory subsystem 825 used in the storage subsystem can include a number of memories including a main random-access memory (RAM) 830 for storage of instructions and data during program execution and a read only memory (ROM) 832 in which fixed instructions are stored.
- a file storage subsystem 828 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
- the modules implementing the functionality of certain embodiments may be stored by file storage subsystem 828 in the storage subsystem 824 , or in other machines accessible by the processor.
- Bus subsystem 812 provides a mechanism for letting the various components and subsystems of computer system 810 communicate with each other as intended. Although bus subsystem 812 is shown schematically as a single bus, alternative embodiments of the bus subsystem may use multiple busses.
- Computer system 810 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computer system 810 depicted in FIG. 8 is intended only as a specific example for purposes of illustrating the various embodiments. Many other configurations of computer system 810 are possible having more or fewer components than the computer system depicted in FIG. 8 .
- the technology disclosed can be practiced as a system, method, or article of manufacture.
- One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable.
- One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.
- a method implementation of the technology disclosed includes a method of retrieving a personal memo from a database.
- the method includes receiving, by a virtual assistant, a natural language utterance that expresses a request, interpreting the natural language utterance according to a natural language grammar rule for retrieving memo data from the natural language utterance, the natural language grammar rule recognizing query information, responsive to interpreting the natural language utterance, using the query information to query the database for a memo related to the query information, and providing, to a user, a response generated in dependence upon the memo related to the query information.
- the natural language grammar rule for retrieving memo data is selected from a plurality of domain dependent grammar rules in accordance to contents of the received natural language utterance.
- the database is queried for the memo related to the query information by searching the database to identify any memo that includes information sufficient to provide an appropriate response to the user.
- the response is provided to the user, such that the response answers the request expressed by the natural language utterance as opposed to providing a word-for-word repeat of a transcription.
- a further implementation includes identifying a trigger phrase from the received natural language utterance, and responsive to identifying the trigger, selecting the natural language grammar rule for retrieving memo data in dependence upon at least one of (i) the identified trigger phrase and (ii) other contents of the natural language utterance.
- the trigger phrase includes both a personal pronoun followed by an interrogative pronoun or a relative pronoun that is within 5 words of the personal pronoun.
- the method can include receiving an indication that the user spoke a memo-specific wake phrase before the natural language utterance.
- the database storing the memo is a structured database, such that the memo is stored in a structured format
- the database storing the memo is an unstructured database, such that the memo is stored in an unstructured format
- the method includes receiving, from the user, a natural language utterance including memo information, interpreting the natural language utterance to extract the memo information, and storing the memo information in the database as a memo.
- Another implementation includes the stored interpretation of the natural language utterance including the memo information includes personal information about the user.
- an implementation can include receiving, interpreting and storing multiple natural language utterances including the memo information as memos that relate to a subject along with additional information indicating a time-order of being received, and generating the response in dependence upon a stored memo (i) relating to the subject and (ii) that was interpreted from a most recently received natural language utterance including the memo information relating to the subject.
- Another implementation may include replacing other previously stored memos that relate to a subject with a most recently stored memo that relates to the subject when multiple natural language utterances including the memo information are received, interpreted and stored in the database as a memo that relates to a subject.
- the method includes allowing the user to confirm or acknowledge whether or not the user intended for the natural language utterance including the memo information to be stored as the memo.
- the method includes deleting the stored memo related to the natural language utterance including the memo information when the user indicates that that natural language utterance including the memo information was not intended to be stored as the memo.
- the method includes assigning a time period to the memo, after which the memo will expire, and removing the memo from the database when the time period has expired.
- An implementation may also include interpreting the natural language utterance that expresses the request according to multiple domains, each domain of the multiple domains having an associated relevancy score for the interpreted utterance, wherein a memo domain is one of the multiple domains, and wherein the memo domain has a score advantage relative to other domains.
- the method may include storing a recording of the natural language utterance that expresses the request and/or storing a recording the natural language utterance including the memo information.
- a first particular interpretation of the transcription of text is stored in the database in association with a first domain and a second particular interpretation of the transcription is stored in the database in association with the second domain, such that two or more interpretations stored in the database.
- One implementation may include storing meta-data along with the memo, where the meta-data include information such as short-term activity information, daily weather information, until-event occurs information, and where the meta-data can be explicitly stated by the user or inferred from other information including other memos, regular commute information and/or calendar information.
- the meta-data include information such as short-term activity information, daily weather information, until-event occurs information, and where the meta-data can be explicitly stated by the user or inferred from other information including other memos, regular commute information and/or calendar information.
- implementations may include a non-transitory computer-readable recording medium having a computer program for retrieving a personal memo form a database recorded thereon.
- the computer program when executed on one or more processors, causes the processors to perform the method described above and any of the above-described implementations. Specifically, includes receiving, by a virtual assistant, a natural language utterance that expresses a request, interpreting the natural language utterance according to a natural language grammar rule for retrieving memo data from the natural language utterance, the natural language grammar rule recognizing query information, responsive to interpreting the natural language utterance, using the query information to query the database for a memo related to the query information, and providing, to a user, a response generated in dependence upon the memo related to the query information.
- a system implementation of the technology disclosed includes one or more processors coupled to memory.
- the memory is loaded with computer instructions to retrieve a personal memo from a database.
- the instructions when executed on the one or more processors, implement actions including includes receiving, by a virtual assistant, a natural language utterance that expresses a request, interpreting the natural language utterance according to a natural language grammar rule for retrieving memo data from the natural language utterance, the natural language grammar rule recognizing query information, responsive to interpreting the natural language utterance, using the query information to query the database for a memo related to the query information, and providing, to a user, a response generated in dependence upon the memo related to the query information.
- System can also include features described in connection with methods disclosed. In the interest of conciseness, alternative combinations of system features are not individually enumerated. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
- a given event or value is “responsive” (e.g., “in response to” or “responsive to”) to a predecessor event or value if the predecessor event or value influenced the given event or value. If there is an intervening processing element, step or time period, the given event or value can still be “responsive” to the predecessor event or value. If the intervening processing element or step combines more than one event or value, the signal output of the processing element or step is considered “responsive” to each of the event or value inputs. If the given event or value is the same as the predecessor event or value, this is merely a degenerate case in which the given event or value is still considered to be “responsive” to the predecessor event or value. “Dependency” (e.g. “in dependence upon” or “in dependence on”) of a given event or value upon another event or value is defined similarly.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
- This application is a continuation of U.S. application Ser. No. 16/255,674, entitled “Using A Virtual Assistant To Store A Personal Voice Memo And To Obtain A Response Based On A Stored Personal Voice Memo That Is Retrieved According To A Received Query”, filed on Jan. 23, 2019, naming inventors Mara Selvaggi, Irina A. Spiridonova and Karl Stahl, the application of which is hereby incorporated by reference.
- Existing note-taking applications, such as Evernote® and Simplenote®, allow users to write notes using a manual input modality. However, such applications do not record memos, play back memos or play back intelligent interpretations of memos using a spoken modality.
- Some voice memo applications, such as Zoho Notebook® and Voice Memos® for iOs®, allow users to record and play back memos, starting and stopping using a manual modality (see submitted non-patent literature “Zoho”). However, such applications do not support explicit or implicit searching for information in memos or retrieving information from the memos using voice modalities.
- Conventional smart-speaker virtual assistants allow storing and retrieving information using voice in limited ways. For example, Google Assistant® and Siri® can add and retrieve events from a cloud-stored calendar. However, using the feature requires the user to carefully specify the content and the requests precisely to make the system do what is desired. For example, if a user asks Siri® “When is my husband's birthday?” and that information has not been pre-set in that user's device or device ecosystem, Siri® willy reply “I don't know who your husband is.”
- Cardona® teaches, at a high level, how to use various current commercial virtual assistants to store any arbitrary voice notes (see submitted non-patent literature “Cardona”). All systems implemented by Cardona® essentially transcribe speech to text that users can only retrieve through a visual modality. Prior art systems do not allow even for a system to read back, using text-to-speech notes or a summary of notes using speech. Doing such without significantly wasting the time of a user listening to extraneous neighboring words and irrelevant information is a non-trivial and unsolved problem.
- Voicera® describes the existence, without enablement, of summarization of voice notes (see submitted non-patent literature “Voicera”). However, Voicera® still relies on a visual modality for reviewing information and does not address the problem of providing relevant information for users, using a speech modality, without wasting time with extraneous neighboring words and irrelevant information.
- U.S. Patent Application Publication No. 2006/0064411 A1 with title “Search engine using user intent” filed by Gross, et al., teaches a system for searching with results ranked based, in part, on past user activity. However, it does not use natural language and is not applicable to conversational voice search. Also, it does not provide for a user to explicitly retrieve stored information.
- U.S. Pat. No. 6,675,159
B 1 with title “Concept-based search and retrieval system” issued to Lin, et al., teaches a system for natural-language-based retrieving of multimedia information stored with appropriate attribute metadata. However, the system only addresses retrieving multimedia information. It does not teach retrieval of information used to complete the interpretations or respond verbally to natural language queries. - The submitted non-patent literature “Kolodner” teaches a specific speed-and-storage efficient method for storing and organizing facts for natural-language-based storage and retrieval. It is limited to a single domain of knowledge and would not be practical to implement for any arbitrary domains or conversation topics.
- U.S. Patent Application Publication No. 2014/0365222 A1 with title “Mobile systems and methods of supporting natural language human-machine interactions” filed by Weider teaches a method of storage and retrieval of personal information, such as user profile and environmental information. However, it does not extract information from conversational natural language expressions, and it does not filter for particular relevant information to retrieve for interpreting and responding to later natural language requests.
- Thus, a need arises for speech recognition technology that is capable of recording voice memorandums (i.e., memos), intelligently storing the memos along with information derived from the memos, and intelligently retrieving information contained in or derived from the stored memos.
- Additionally, voice-enabled virtual assistants currently do not have the capability to intelligently learn the preferences or favorites of a user and then later use that information to answer a question from the user. For example, Siri® does not learn a person's preferences or favorites in an intelligent manner. Specifically, when a user asks Siri® “What is my favorite restaurant?” Siri® thinks that the user is asking about Siri's preference and a response is provided to the user as “I don't eat out that much.” Furthermore, when other virtual assistants are asked “What is my favorite restaurant?” they pick a restaurant that has the word “favorite” in its name, such as “My Favorite Cafe.” The Google Maps® application has an option to add places to a “Favorites” list, a “Want to Go” list or a “Starred Places” list, but it does not allow those lists to be queried using one's own voice. Google Assistant® has a feature of remembering a favorite place; however, it is able to store only a limited number of places and doesn't allow users to reliably query them (e.g., give directions to that place). For example, a Google Assistant® (GA) interaction goes as follows: (i) user: “do you know what my favorite restaurant is?”; (ii) GA: “I don't know that yet. What's your favorite restaurant?”; (iii) user: “my favorite restaurant is Red Lobster,” (iv) GA: “OK, I'll remember that”; (v) user: “do you know what is my favorite beach?”; (vi) GA: “I remember you told me. ‘My favorite restaurant is Red Lobster’.”; (vii) user: “can you give me directions to my favorite restaurant?”; and (viii) GA: “Here you go. Directions from your location to IHOP . . . .” As is clear from the prior art, there is much needed improvement with respect to incorporating a user's preferences or favorites into a voice-enabled virtual assistant.
- Accordingly, an additional need arises for speech enabled virtual assistants that intelligently store favorite information of a user for subsequent retrieval and presentation to the user at the appropriate time.
- The technology disclosed relates to (i) speech enabled virtual assistants implementing technology that is capable of recording voice memorandums (i.e., memos), intelligently storing the memos along with information derived from the memos, and intelligently retrieving information contained in or derived from the stored memos and (ii) speech enabled virtual assistants implementing technology that intelligently stores favorite information of a user for subsequent retrieval and presentation to the user at the appropriate time.
- Regarding the recording, storage and retrieving of memos, the technology disclosed receives (by a virtual assistant) a natural language utterance that includes memo information, interprets the received utterance according to a natural language grammar rule associated with a memo domain and stores (in a database) a memo that is derived from the interpretation of the memo information, receives another natural language utterance expressing a request (i.e., a request to query memo data from the database), interprets the natural language utterance expressing a request according to a natural language grammar rule for retrieving memo data from the natural language utterance, such that the natural language rule for retrieving memo data recognizes query information, in response to a successful interpretation of the natural language utterance, uses the recognized using the recognized query information to query the database for specific memo data related to the recognized query information, and provides, to the user, a response generated in dependence upon the queried-for specific memo data.
- Regarding the storing and retrieval of favorite information, the technology disclosed operates in a similar manner as the storing and retrieval of memos.
- Particular aspects of the technology disclosed are described in the claims, specification and drawings.
-
FIG. 1 illustrates a block diagram of an example environment capable of speech enabled virtual assistants implementing technology that is capable of receiving a request or query and intelligently retrieving information contained in or derived from previously stored memos. -
FIG. 2 illustrates a block diagram of an example environment capable of speech enabled virtual assistants implementing technology that is capable of recording voice memorandums (i.e., memos) and intelligently storing the memos along with information derived from the memos. -
FIG. 3 illustrates a block diagram of an example environment capable of speech enabled virtual assistants implementing technology that intelligently retrieves and presents favorite information of a user contained in or derived from previously identified and stored favorites. -
FIG. 4 illustrates a block diagram of an example environment capable of speech enabled virtual assistants implementing technology that is capable of receiving favorites and intelligently storing the favorites along with information derived from the favorites. -
FIGS. 5A, 5B and 5C show three examples implementations of the technology disclosed using different types of virtual assistants. -
FIG. 6 illustrates shows an overhead view of an automobile designed to implement the technology disclosed. -
FIG. 7 illustrates an example environment in which personal memos and/or favorites can be stored, search and retrieved for generation of intelligent responses using the technology disclosed. -
FIG. 8 is a block diagram of an example computer system that can implement various components of the environment ofFIG. 7 . -
FIG. 9 illustrates TABLE 1, which includes example phrases that would trigger the storing of a personal memo. -
FIG. 10 illustrates TABLE 2, which includes example phrases that would trigger the storing of a personal memo. -
FIG. 11 illustrates TABLE 3, which includes example ways of invoking the storing of favorite information, querying favorite information and possible responses from a virtual assistant. -
FIG. 12 illustrates TABLE 4, which includes example ways of using favorite information for obtaining directions and travel information. -
FIG. 13 illustrates TABLE 5, which includes example ways of storing multiple favorites for a specific category and then later obtaining specific information for both of the favorites in the same category or obtaining favorite information of multiple favorites based on geographical location. - The following detailed description is made with reference to the figures. Example implementations are described to illustrate the technology disclosed, not to limit its scope, which is defined by the claims. Those of ordinary skill in the art will recognize a variety of equivalent variations on the description that follows.
- An aspect of the technology disclosed relates to speech-enabled virtual assistants implementing recognition technology that is capable of recording voice memorandums (i.e., memos, or personal memos), intelligently storing the memos along with information derived from the memos, and intelligently retrieving information contained in or derived from the stored memos. Two specific examples of this speech recognition technology that is capable of recording and intelligently storing memos and related information and retrieving information in dependence upon the stored memos are provided below.
- The first example relates to cooking lasagna. The scenario is that just about every recipe on the internet indicates that lasagna should be cooked for 40 minutes. However, a particular user has determined that with their oven 40 minutes is too much, and as a result, their lasagna is always burned. The user was able to determine through experience that the perfect cooking time for their lasagna is 30 minutes. In order to remember that the perfect time for cooking lasagna in their oven is 30 minutes, the user will have an interaction with a virtual assistant (or some other type of technology that is capable of speech recognition and feedback) as follows (note that only the text in italics is the voice exchange or interaction with the virtual assistant; and the virtual assistant is named Hound):
- (i) User: “Ok Hound. To get a perfect lasagna, I cook it in the oven for 30 minutes.” [this phrase uttered from the user was identified by the virtual assistant as being related to a memo or a memo domain in dependence upon the virtual assistant identifying the trigger words “I” a personal pronoun and “cook” a verb)].
- (ii) User: “Ok Hound. How long should I cook the lasagna?” [this phrase uttered from the user was identified by the virtual assistant as being related to querying a memo or a memo domain in dependence upon the virtual assistant identifying a request (e.g., an interrogatory) and trigger words such as “I” a personal pronoun and “cook” a verb)].
- (iii) Hound: “You should cook the
lasagna 30 minutes in the oven.” [this response from the virtual assistant was generated by obtaining the stored memo or information relating to the memo that indicated the cooking time in the oven for lasagna is 30 minutes]. - The second example relates to finding or locating lost objects. The scenario is that a user places an object somewhere (e.g., for hiding or storage), where the user wants to be sure to remember where the object was placed. Instead of writing a text, email or physical message to oneself, the user would have the following interaction with the virtual assistant.
- (i) User: “Hound, remember that I put the car key in my brown bag.” [this phrase uttered from the user could be identified by the virtual assistant as being related to a memo or memo domain in dependence upon the virtual assistant identifying the wake phrase “Hound, remember.”]
- (ii) User: “Ok Hound. Where did I put my car key?”
- (iii) Hound: “You put your car key in your brown bag.”
- Another aspect of the technology disclosed relates to speech enabled virtual assistants implementing technology that intelligently stores favorite information of a user for subsequent retrieval and presentation to the user at the appropriate time. A concept is that the favorite information of the user is stored, such as favorite restaurants, grocery stores, beauty salons, gyms, recreation spots, parking garages, friends and family, etc. and then later used to answer inquiries from the user. Three specific examples of this technology that is capable of recording and intelligently storing memos and related information and retrieving information in dependence upon the stored memos are provided below.
- The first example relates to favorite places and the scenario is that the user tells the virtual assistant about a favorite restaurant and then later on asks for directions to that restaurant.
- (i) User: “Ok Hound, my favorite restaurant is Spice Me at Half Moon Bay.” [this information conveyed from the user, triggered “favorites” or a favorites domain and in particular a favorite restaurant in dependence upon the trigger words “favorite” and “restaurant.”].
- (ii) User: “Ok Hound, give me directions to my favorite restaurant.”
- (iii) Hound: “Here you are . . . ” (and directions are provided to the user in one of various forms, such as spoken word, opening up a map or directions application, etc.).
- The second example relates to a routine commute and the scenario is that the user goes to the same gym, bar, grocery store etc. on a regular basis, so she tells the virtual assistant to remember this particular place as a favorite for later retrieval.
- (i) User: “Ok Hound, the gym I usually go to is Orange Theory Fitness® in Santa Clara.” [favorites or favorites domain is triggered by the words “I” and “usually”].
- (ii) User: “Ok Hound, how long will it take me to get to the gym?”
- (iii) Hound: “It will take you 15 minutes to get to the gym.” [the virtual assistant utilizes the information of the user's favorite gym to determine which gym the user is referring to and then estimate how long it will take to get there using the typical transportation scheme used by the user to get to the gym in view of present traffic conditions].
- The third example relates to making recommendations and the scenario is that a user asks a virtual assistant for a recommendation, where the user has previously given the virtual assistant some information about favorite restaurants, etc. or perhaps where the user has not previously provided favorite information.
- (i) User: “Ok Hound, give me a restaurant recommendation.”
- (ii) Hound: “Tell me what kind of food you like.”
- (iii) User: “I like Thai Food and Italian food the most.”
- (iv) User: “Ok Hound, are there any restaurants around I might like?”
- (v) Hound: “I have two restaurants that are close by that serve your favorite types of food but based on the fact that you recently had Thai food I will recommend Pasta Moon Italian Restaurant at Half Moon Bay.”
- Now, turning the figures, various example aspects of the technology disclosed are provided below.
-
FIG. 1 illustrates a block diagram of anexample environment 100 capable of speech enabled virtual assistants implementing technology that is capable of receiving a request or query and intelligently retrieving information contained in or derived from previously stored memos. The term “intelligently retrieving” is mentioned because theenvironment 100, as discussed in further detail below, is capable of not just repeating a previous statement made by the user but is able to derive a more useful response to the user, as a result of having previously stored a memo or personal memo provided by the user. - In particular,
FIG. 1 illustrates that theexample environment 100 includes aspeech input 102 being received from a microphone or some other type of input device (e.g., an application running on a mobile phone or tablet, etc.). Thespeech input 102 includes search or query request 103 (hereinafter query 103). Thequery 103 can be in the form of a natural language utterance spoken by the user. - The
speech input 102 can be received by a virtual assistant (not illustrated) asquery 103. Speech enabled virtual assistants will simply be referred to herein as “virtual assistants” or a “virtual assistant.” A virtual assistant can be a device or an application residing on a device, such as a smart phone, a watch, glasses, a television, an automobile, etc. The virtual assistant is capable of interacting with a user using the user's speech and is capable of, for example, (i) providing information back to the user (e.g., an answer to a question), (ii) providing an actionary response (e.g., changing the thermostat or locking the doors to an automobile) or (iii) storing information for later retrieval (remotely or locally) or for increasing the knowledge base of the virtual assistant. A virtual assistant can monitor sound (e.g., conversations) to listen for a wake phrase that engages the virtual assistant and to listen to a trigger phrase uttered after the wake phrase that directs the virtual assistant (or any system in communication with the virtual assistant) to a particular domain. A wake phrase can be just one word or multiple words and a trigger phrase can be just one word or multiple words. - Referring back to
FIG. 1 , thequery 103 will be transcribed by the virtual assistant (or a system connected to the virtual assistant as described below with respect toFIG. 7 ) inoperation 106. Next, inoperation 106, text obtained from the transcriptions of thequery 103 will be used to determine whether or not the user intended to query a particular domain, such as amemo domain 108. If thememo domain 108 is identified, then the text obtained from the transcriptions will be interpreted using a particular grammar rule. - Regarding domains and grammar rules, a domain represents a particular subject area, and comprises or is associated with a specific grammar rule. A specific grammar rule is not necessarily one single rule but can be a set of rules that are suited to interpret a transcription of a natural language utterance that is related to a specific domain. The process of interpreting a natural language utterance within a particular domain produces exactly one interpretation. Different interpretations arise when systems interpret a natural language utterance in the context of different domains. Each interpretation represents the meaning of the natural language utterance as interpreted by a domain. For example, when users make requests, such as asking “What time is it?” or directing the system to “Send a message.” Systems provide responses, such as by speaking the time. Systems also make requests of users, such as by asking, “To whom would you like to send a message?”, and in reply, users respond, such as by replying, “Mom.” Sequences of one or more requests and responses produce results such as sending a message or reporting the time of day. The interactions regarding the “time” are interpreted, for example, using a “time domain” with specific grammar rule that is suited for interpreting text related to time. The same for “messages,” which implement a “messages domain.” Sub-domains can also exist. The number of domains is limitless, as well as the specific grammar rules implemented by or included in the domains. These are merely non-limiting examples of domains, grammar rules, transcriptions and domains.
- Turing back to
FIG. 1 , when the received natural language utterance expresses a request, the natural language utterance that expresses the request can be interpreted according to a natural language grammar rule for retrieving memo data. This rule is obtained from thememo domain 108. Further, the natural language grammar rule is interpreted to recognize query information from the natural language utterance (e.g. query 103). As an example, inoperation 106 the received natural language utterance is “How long should I cook lasagna?” - Responsive to the interpretation and obtaining of the query information, an appropriate database will be searched or queried. According to one aspect of the present invention, in operation 110 a
memo transcription database 112 can be queried using the interpreted natural language utterance. Thememo transcription database 112 includes text from previous natural language utterances directed to personal memos. Thememo transcription database 112 can be an unstructured or a structured database storing unstructured or structured data. However, as previously discussed, merely providing text back to a user that has not been interpreted according to specific domain would not be as helpful to the user. An example of such text would be “To get a perfect lasagna, I cook it in the oven for 30 minutes.” This is just a simple transcription of a previously stored or recorded personal memo (e.g., a word-for-word repeat of a transcription). While this is not a perfect answer to the user's query, it still provides enough information. Additionally, the actual recording of the natural language utterance that expresses thequery 103 can be stored in another database, or even thememo transcription database 112 and/or thememo interpretation database 114. Further, the text stored in thememo transcription database 112 or the recording stored in another database can be stored for the purpose of later re-interpretation. For example, grammar rules of domains can be improved over time, therefore providing more accurate interpretations as time goes on. By storing the original text or recording that was used to create a first interpretation using thememo domain 108, it is possible to re-interpret the original text or recording if the grammar rules have been improved upon. - According to another aspect of the present technology, in
operation 110, amemo interpretation database 114 is queried using the interpreted natural language utterance. Thememo interpretation database 114 includes interpretations of natural language utterances directed to personal memos. Thememo interpretation database 114 can be an unstructured or a structured database storing unstructured or structured data. Because the interpretations of the natural language utterances are made using a particular natural language grammar rule associated with thememo domain 108, the information stored and retrieved from thememo interpretation database 114 will be easier to search and provide more accurate and meaningful results. An example memo retrieved from thememo interpretation database 114 could be structured data, such as “cook.lasagna.oven.30-minutes” that can be used to generate a response, or an example memo retrieved from thememo interpretation database 114 could already be in a form that is phrased as a natural language response such as “Violet, you should cook your lasagna in your oven for 30 minutes.” - After obtaining the memo from the
memo transcription database 112 or thememo interpretation database 114 inoperation 110,operation 118 generates an appropriate answer (response) for the user. As discussed above and in further detail below, an aspect of the technology disclosed is capable of providing a meaningful (appropriate) response to the user that is not simply necessarily a word-for-word repeat of a previously stored transcription, but something that is sufficient to and will actually be more helpful to answering the users request or query. Ifoperation 110 obtains the memo from thememo transcription database 112, then the memo can be further interpreted using the specific grammar rule for retrieving memo data. For example, the retrieved memo “To get a perfect lasagna, I cook it in the oven for 30 minutes” could be interpreted to generate a response such as “Violet, you should cook your lasagna in your oven for 30 minutes.” If the memo retrieved from thememo interpretation database 114 is structured as “cook.lasagna.oven.30-minutes,” the system will generate “Violet, you should cook your lasagna in your oven for 30 minutes,” as an appropriate response. Once the appropriate response or answer is generated inoperation 118, the appropriate response or answer will be provided to the user inoperation 120, in the form ofspeech 122 or message/text to amobile device 124 or some other device similar thereto. -
FIG. 2 illustrates a block diagram of an example environment capable of speech or text enabled virtual assistants implementing technology that is capable of recording voice memorandums (i.e., memos) and intelligently storing the memos along with information derived from the memos. - Specifically,
FIG. 2 illustrates anenvironment 200 that implements the storing of a natural language utterance in thememo transcription database 112 and/or thememo interpretation database 114. The environment ofFIG. 2 is very similar to that ofFIG. 1 , except that astatement 203 is received that causes the virtual assistant to store some or all of thestatement 203 as a memo as opposed to conducting a query. Descriptions of redundant elements ofFIG. 2 are omitted. - In
operation 206 thestatement 203 is transcribed and then a domain, such as thememo domain 108 is identified. Just as inFIG. 1 , where thequery 103 is transcribed and interpreted, the text transcribed from thestatement 203 is interpreted using a specific grammar rule for storing a memo that is associate or included in thememo domain 108. For example, the natural language utterance (e.g., statement 203) received from the user can be interpreted according to a natural language grammar rule for storing memo data. Inoperation 210 the memo, obtained from the transcription of the natural language, is stored as a transcription in thememo transcription database 112 and inoperation 212 the memo, obtained from an interpretation of the natural language utterance is stored in thememo interpretation database 114. Additionally, the actual recording of the natural language utterance that expresses thestatement 203 can be stored in another database, or even thememo transcription database 112 and/or thememo interpretation database 114. The differences between transcriptions and interpretations and between thememo transcription database 112 and thememo interpretation database 114 are described above in detail with reference toFIG. 1 . - In
operation 214 feedback is provided to the user in the form ofspeech 122 or message/text to amobile device 124 or some other device similar thereto. The speech can include a request for confirmation to the user to confirm whether or not they intended to store a personal memo, or a confirmation to the user that the information has been stored as a personal memo. - One aspect of the technology disclosed includes assigning a time period to a memo after which the memo will expire and then removing the memo (or memo related information) from the
memo transcription database 112 and/or thememo interpretation database 114. - Another aspect of the technology disclosed includes interpreting the
query 103 and/or thestatement 203 according to multiple domains (e.g., multiple grammar rules), wherein each domain of the multiple domains has an associated relevancy score for the interpreted utterance. Thememo domain 108 is one domain of the multiple domains and thememo domain 108 has an advantage over the other domains with respect to interpreting queries and statements related to personal memos. As such, when any of thequery 103 and/or thestatement 203 is directed to a personal memo, the interpretation using thememo domain 108 will have the highest relevance score as compared to the other domains. Additionally, different interpretations of thequery 103 and/or thestatement 203 using the multiple domains can be stored in thememo interpretation database 114. - The information stored in the memo interpretation database can be stored along with additional information, such as meta-data or meta-information that describes the memo as pertaining to a short-term activity, daily weather, and an until-event such as a child being at soccer practice, which is cancelled (or deleted) when the parent arrived and then leaves the soccer field as a result of picking up the child. The meta-data or meta-information can be explicitly stated by the user (e.g., “I'll be at work until 5 pm”) or it can be inferred from other information obtained from the user, such as other personal memos, other calendar information or other routine information obtained from general tendencies of the user.
- Additional examples of storing personal memos and then retrieving information related to the stored personal memos are provided below.
- As mentioned above, virtual assistants or related devices often have wake phrases to indicate to the virtual assistant that the user is attempting to engage or use the virtual assistant. Assuming that the technology disclosed utilizes a standard wake phrase of “Ok Hound” to engage the virtual assistant. One way to indicate that a user's utterance is intended to retrieve information from a stored personal memo would be to assign specific wake phrases, such as “Ok Hound check my personal information for . . . ,” or “Hound check my memos for information regarding . . . ”. Further, one way to indicate that a user's utterance is intended to be stored as a personal memo would be to assign specific wake phrases, such as “Ok Hound memo,” “Hound memo” or “Ok Hound remember.” Each of these example wake phrases would immediately indicate that the user is intending to retrieve or store a personal memo. However, sometimes users have difficulty remembering which wake phrases to use in which situation.
- Accordingly, the technology disclosed is capable of determining whether or not a natural language utterance received after a generic wake phrase includes a specific trigger phrase to indicate that the user intends to search for a memo or store a memo. For the sake of simplicity, a “trigger phrase” can include just a single word or multiple words, and a “wake phrase” can include just a single word or multiple words. Use of the wake phrase and trigger phrase can be used to make the system understand to record, store and retrieve the information to/from the “memo domain”. Additionally, weights on the “memo domain” can be invoked in order to make it the first domain (of multiple other domains) to consider when retrieving information.
- The trigger phrases can include personal pronouns, such as “I” (e.g., “Where did I put the key?”, “How long do I usually cook Lasagna?”) or possessives like “my” (e.g., “Where is my key?”). As another example, a trigger phrase may be identified as being an interrogative pronoun or a relative pronoun that is within 5 words of the personal pronoun, or a trigger phrase may be identified as being a personal pronoun followed by or preceded by an interrogative pronoun or a relative pronoun that is within 5 words of the personal pronoun. These are merely examples of the types of phrases that can be configured to indicate that the user is attempting to retrieve or store a personal memo.
- Once the trigger phrase is identified, then the appropriate domain (e.g., memo domain 108) will be selected and an appropriate grammar rule can also be selected in dependence upon the trigger phrase itself, other contents of the natural language utterance or a combination of both.
- For each domain, it is possible to (i) determine and assign all of the possible ways a user would store a personal memo, (ii) retrieve information from the stored personal memo and (iii) determine all of the ways for the virtual assistant to respond to the user.
-
FIG. 9 illustrates TABLE 1, which includes example phrases that would trigger the storing of a personal memo in thememo domain 108 or a specific sub-domain (e.g., cooking) of thememo domain 108. There can be multiple stages of complexity with respect to the virtual assistant understanding a request and providing an answer to the user. Different stages could be implemented by the virtual assistant due to many factors, such as availability of processing, communication bandwidth, certainty of interpretations and content of personal memos. -
Stage 1 examples require the stored memo and the query to be of a similar nature and the response is similar in nature as well. This is somewhat of a one-to-one correlation of the stored memo, the request and the response. This is the least complex of the stages, because the response is closely tied to the query. For example, the first example ofstage 1 the query states, “do I usually leave . . . in the oven.” and the response states “you usually leave . . . in the oven.”. -
Stage 2 examples allow for more information to be inferred from the stored memo and the query for the memo and allow for different answers to be derived from the stored memo. Note that the arrows on the first row of stage two indicate that the utterance used to invoke storage can be queried using three different options and there are three possibilities for response. In other words, each cell ofstage 2 has three counterpart cells. Although the arrows do not indicate such due to space constraints on TABLE 1, the same goes for the second and third rows of stage two. For example, the second row ofstage 2, the user can state “To get a perfect lasagna leave it for 30 minutes in the oven.” Now, this personal memo can be queried in, at least, three different ways. In our example here, let's say that the user initiates the query using the phrase “How many minutes should I cook lasagna?” This is different thanstage 1, because the virtual assistant has a broader range of potential queries that could result in finding a particular personal memo. The same goes for the response provided by the virtual assistant, such that a response to the query “How many minutes should I cook lasagna?” could be “You usually leave your lasagna in the oven for 30 minutes.” as opposed to “you should cook your lasagna for 30 minutes.” A particular response can be implemented by the virtual assistant based previous responses that have been successful and/or unsuccessful (e.g., due to the user's vocabulary, etc., certain responses can be more successful than others. -
Stage 3 is the most complex stage, because it allows for additional information to be derived from the stored memo, not just the cooking time. In the example forstage 3, the user most likely invoked the storage of the memo with a statement directed to the length of time for cooking lasagna, without really thinking about later retrieving an answer as to “where” the lasagna should be cooked. However, the virtual assistant identified at least two pieces of information from the memo, including the fact that the lasagna is cooked in the oven and that it is cooked for 30 minutes. Therefore, the virtual assistant can answer two different types of questions, including those related to how long to cook the lasagna and those related to where the lasagna should be cooked. -
FIG. 10 illustrates TABLE 2, which includes example phrases that would trigger the storing of a personal memo in thememo domain 108 or a specific sub-domain (e.g., object location) of thememo domain 108, as well as ways to query the personal memo and possible responses from the virtual assistant. TABLE 2 is different from TABLE 1, because TABLE 2 also includes examples of grammar rules and sentence parsing that can be implemented to store memos along with additional information and how the memo and additional information can be used to identify a query and structure a response. As described in TABLE 2, each sentence used to invoke storage of a memo is parsed to identify various components. For example, in the first row of TABLE 2, the virtual assistant identifies the personal pronoun “I” and then looks for a verb that is near the “I”. Here, any verb such as “put”, “am putting”, “'ll put” or “will put” that follows the “I” indicates to the virtual assistant that the utterance received from the user is related to the user putting an object somewhere. Continuing with this example, after the verb, the virtual assistant when looks for some variable (e.g., keys) that are likely to be put somewhere. Next the virtual assistant looks for another variable (i.e., variable2) describing where variable1 is placed. Once this personal memo is stored with the additional information obtained from parsing the utterance, the memo can be queried when the user asks a question including any variation of the verb “put” along with variable1 (e.g., keys).Row 1 of TABLE 2 also describes the structure of the response with respect to the information included in the initial statement from the user and the subsequent query. - The system may invoke user feedback to confirm whether or not a user intended to search for an answer based on a personal memo or to store a personal memo. If the user indicates that they did not intend to query a personal memo, then a different domain will be used to provide a response to the user's question. If the user indicates that they did not intend to store an utterance as a personal memo, then the personal memo will not be stored, or it will be deleted if it was stored. The confirmation requests to the user can be auditory or in the form of text and the user responses to the confirmation quests can be auditory or in the form of text. Additionally, if the virtual assistant cannot locate a memo that provides an answer to the user's request, then the virtual assistant can ask for a clarification.
- Dealing with Multiple Related Memos
- A user can store and query multiple memos that are related to the same subject. For example, a user may indicate that they put their keys in a refrigerator for safe keeping. Then at a later point the user may indicate that they put their keys in their backpack. Now, when a user asks where their keys are located, the virtual assistant should be able to indicate to the user that their keys are stored in their backpack. This scenario can be handled in many different ways. First, the virtual assistant may store each memo with time information and then make an assumption that when the user asks about the location of their keys, the user is referring to the most recent memo about their keys. This is essentially time ordering all of the memos related to the location of the user's keys. By saving all of the memos regarding the location of the user's keys, the virtual assistant will be able to tell the user where they placed the keys before they were placed in the backpack. This would be helpful if the user actually did not put them in the backpack. In this case, the user would probably find their nicely cooled keys in the refrigerator. To accomplish this, a virtual assistant would parse search type statements to identify entities and attributes of the entities; search a database of memo information for the entity; and for database records related to the entity, check for the most recent one relating to the same attribute. In this example, the entity would be keys and the attribute would be location.
- A second option would to delete all previous memos relating to the location of the user's keys upon the storing of the most recent memo regarding the user's keys being in the backpack. To accomplish this, a virtual assistant would parse store type statements to identify entities and attributes; search a database for records about the same attribute of the same entity (only one should be found); delete the record; and store a new record with the new information about the entity and its attribute.
- Additionally, the technology disclosed can understand when memos relate to changes in time. For example a user might say “Ok Hound, remember that I pick up my dog every day of the workweek at 5 pm from doggy daycare” (this is a memo related every Monday through Friday) or “Ok Hound, remember that today I pick up the dog at 4 pm from doggy hair salon” (this is a memo related to a specific day). Specific trigger phrases that will help indicate these behaviors are “every day,” “today,” and “tomorrow”).
-
FIG. 3 illustrates a block diagram of an example environment capable of speech enabled virtual assistants implementing technology that intelligently retrieves and presents favorite information of a user contained in or derived from previously identified and stored favorites. - Specifically, the
environment 300 illustrated inFIG. 3 , is similar to theenvironment 100 ofFIG. 1 , except that thequery 103 is directed to afavorites domain 308 for the purpose of obtaining information from afavorites transcription database 312 or afavorites interpretation database 314. Thefavorites domain 308 is similar to thememo domain 108 ofFIG. 1 , except that thefavorites domain 308 has a different grammar rule for interpreting thequery 103. Furthermore, thefavorites transcription database 312 stores transcriptions of previously stored natural language utterances related to “favorites” of a user and thefavorites interpretation database 314 stores interpretations of natural languages related to “favorites” of a user. - Generally, favorites are different from personal memos, because they are inherently narrower in scope and have a longer duration of relevance. Some example categories of favorites could be favorite types of food, grocery stores, hotels, friends, gymnasiums or recreation facilities, hair dressers, schools, colleges, sports teams, etc.
-
FIG. 4 illustrates a block diagram of an example environment capable of speech enabled virtual assistants implementing technology that is capable of receiving favorites and intelligently storing the favorites along with information derived from the favorites. - The
environment 400 ofFIG. 4 is similar to theenvironment 200 ofFIG. 2 , except that thestatement 202 is (i) interpreted using thefavorites domain 308, (ii) transcribed and stored in thefavorites transcription database 312 and (iii) interpreted for storage in thefavorites interpretation database 314. All of the descriptions provided above with respect toFIGS. 1 and 2 and memos, as provided above are applicable to the storing and retrieval of favorites and information derived from the favorites. For example, wake phrases, trigger phrases, etc., are applicable to favorites. Additionally, a memo and/or memo related information can indicate that a specific entity is a favorite of the user. Some examples of retrieving favorite information of the user and storing information related to a user's favorite are discussed below. -
FIG. 11 illustrates TABLE 3, which includes some example ways of invoking the storing of favorite information, querying favorite information and possible responses from a virtual assistant. -
FIG. 12 illustrates TABLE 4, which is similar to TABLE 3, except that it illustrates some example ways of using favorite information for obtaining directions and travel information. -
FIG. 13 illustrates TABLE 5, which is similar to TABLE 4, except that it illustrates some example ways of using storing multiple favorites for a specific category and then later obtaining specific information for both of the favorites in the same category or obtaining favorite information of multiple favorites based on geographic location. - Other example implementations of “favorites” can include building a recommendations table base on user's stored favorites. Here is an example: (i) User: “I like Red Lobster® Restaurant”; (ii) Virtual Assistant: obtains information regarding Red Lobster Restaurant from another service, such as Yelp® (e.g., Seafood/Bar/Kids' menu/Casual & Cozy/3.9 stars/etc.); (iii) User “Are there any restaurants around here I might like?”; (iv) Virtual Assistant: “There are other restaurants in the area that have similar characteristics and ratings as your other favorites such as Fish Market Restaurant in San Mateo, would you like me to provide you with a full list of options?”
-
FIGS. 5 A, 5B and 5C show three example implementations of the technology disclosed using different types of virtual assistants. For example,FIG. 5A illustrates amobile phone 502. Because mobile phones are battery-powered, it is important to minimize complex computations so as not to run down the battery. Therefore,mobile phone 502 may connect over the Internet to a server. Themobile phone 502 has a visual display that can provide information in some use cases. However, themobile phone 502 also has a speaker, and in some use cases themobile phone 502 may respond to an utterance using only speech. -
FIG. 5B also illustrates ahome assistant device 504, which may plug into a stationary power source, so it has power to do more advanced local processing than themobile phone 502. Like themobile phone 502, thehome assistant device 504 may rely on a cloud server for interpretation of utterances according to specialized domains and in particular domains that require dynamic data to form useful results. Because thehome assistant device 504 has no display, it is a speech-only device. -
FIG. 5C illustrates anautomobile 506. Theautomobile 506 may be able to connect to the Internet through a wireless network. However, if driven away from an area with a reliable wireless network, theautomobile 506 must process utterances, respond, and give appropriate results reliably, using only local processing. As a result, theautomobile 506 can run software locally for natural language utterance processing. Though many automobiles have visual displays, to avoid distracting drivers in dangerous ways, theautomobile 506 may provide results with speech-only requests and responses or may provide results to a display for only non-driving passengers to view and interact with. -
FIG. 6 shows an overhead view of anautomobile 600 designed to implement the technology disclosed. Theautomobile 600 has twofront seats 602, either of which can hold one person. Theautomobile 600 also has aback seat 604 that can hold several people. Theautomobile 600 has adriver information console 606 that displays basic information such as speed and energy level. Theautomobile 600 also has adashboard console 608 for more complex human interactions that cannot be quickly conducted by speech, such as viewing and tapping locations on navigational maps. - The
automobile 600 hasside bar microphones 610 and a ceiling-mountedconsole microphone 612, all of which receive speech audio such that a digital signal processor embedded within the automobile can perform an algorithm to distinguish between speech from the driver or front-seated passenger. Theautomobile 600 also has a rear ceiling-mountedconsole microphone 614 that receive speech audio from rear-seated passengers. - The
automobile 600 also has a car audio sound system with speakers. The speakers can play music but also produce speech audio for spoken responses to user commands and results. Theautomobile 600 also has an embedded microprocessor. It runs software stored on non-transitory computer-readable media that instruct the processor to perform some or all of the operations discussed with reference to the algorithm ofFIGS. 1-5, 7 and 8 , among other functions. -
FIG. 7 illustrates anexample environment 700 in which personal memos and/or favorites (or information derived therefrom) can be stored, searched for retrieval and for generation of intelligent responses using the technology disclosed. Theenvironment 700 includes at least oneuser device user device 702 can be a mobile phone, tablet, workstation, desktop computer, laptop or any other type of user device running anapplication 704. Theuser device 702 can be anautomobile 706 or any other combination of hardware and software that is running anapplication 704. - The
user devices more communication networks 708 that allow for communication between various components of theenvironment 700 and that allow for performing of searches on the internet or other networks. In one implementation, thecommunication networks 708 include the internet. Thecommunication networks 708 also can utilize dedicated or private communication links that are not necessarily part of the public internet. In one implementation thecommunication networks 708 use standard communication technologies, protocols, and/or inter-process communication technologies. Theuser devices application 704 is implemented on theuser devices - The
environment 700 also includesapplications 710 that can be preinstalled on theuser devices user devices environment 700 includes Application Programming Interfaces (APIs) 711 that can also be preinstalled on theuser devices user devices APIs 711 can be implemented to allow theuser devices applications 710 to easily gain access to other components on theenvironment 700 as well as certain private networks. - The
environment 700 also includes aninterpreter 712 that can be running on one or more platforms/servers that are part of a speech recognition system. Theinterpreter 712 can be a single computing device (e.g., a server), a cloud computing device, or it can be any combination of computing device, cloud computing devices, etc., that are capable of communicating with each other to perform the various tasks required to perform meaningful interpretation, as well as speech recognition, if desired. Theinterpreter 712 can include adeep learning system 714 that is capable of using artificial intelligence, neural networks, and or machine learning to perform interpretations. Thedeep learning 714 can implement language embedding(s), such as a model ormodels 716 as well as anatural language domain 718 for providing domain-specific translations and interpretations for natural language processing (NLP). - Since the
interpreter 712 can be spread over multiple servers and/or cloud computing device, the operations of thedeep learning 714, the language embedding(s) 716 and thenatural language domains 718 can also be spread over multiple servers and/or cloud computing device. Theapplications 710 can be used by and/or in conjunction with theinterpreter 712 to translate spoken input, as well as text input and text file input. Again, the various components of theenvironment 700 can communicate (exchange data) with each other using customizedAPIs 711 for security and efficiency. Theinterpreter 712 is capable of interpreting a query or statement (e.g., natural language utterance) obtained from theuser devices - The
user devices interpreter 712 can each include memory for storage of data and software applications, a processor for accessing data in executing applications, and components that facilitate communication over the communications networks 708. Theuser devices applications 704, such as web browsers (e.g., aweb browser application 704 executing on the user device 702), to allow developers to prepare and submitapplications 710 and allow users to submit speech audio queries (e.g., thespeech input 102 and query 103 ofFIG. 1 ) including natural language utterances to be interpreted by theinterpreter 712. - As mentioned above, the
interpreter 712 can implement one or more language embeddings (models) 716 from a repository of embeddings (models) (not illustrated) that are created and trained using the techniques that are known to a person of ordinary skill in the art. - As also mentioned above, the
natural language domain 718 can be implemented by theinterpreter 712 in order to add context or real meaning to the transcription of the received speech input. - The
environment 700 can further include atopic analyzer 720 that can implement one or moretopic models 722 to analyze and determine a topic of a query or statement. Some of the operations of thetopic analyzer 720 could be performed during, for example,transcription operation 106 ofFIG. 1 . - Furthermore, the
environment 700 can include adisambiguator 724 that is able to utilize any type of external data 726 (e.g., disambiguation information) in order to add further meaning to an obtained query. Essentially, thedisambiguator 724 is able to add further meaning to a query or statement by analyzing previous searches of the user, profile data of the user, location information, calendar information, date and time information, etc. For example, thedisambiguator 724 can be used to add synonyms to the initial search that can be helpful to narrow the search to what the user wants to find. Thedisambiguator 724 can also add additional limits to the search, such as certain dates and/or timeframes (e.g., based on the travel plans of the user additional limits can be added to the original query to identify events that are occurring while the user is traveling to a certain region). - For example, if the
query 103 obtained by one of theuser devices topic analyzer 720 can analyze the query and determine that the topic (or domain) is “memo.cooking”. Thedisambiguator 724 can use theexternal data 726 to determine that the user has been cooking at their mother's house for the past few days. Accordingly, thedisambiguator 724 can extend the terms of the first query from “How long do I cook lasagna?” to “How long do I cook lasagna at my mother's house?” Prior to extending the query, the system can ask the user if they are cooking at their home or at their mother's house. In other words, the combination of the results obtained by thetopic analyzer 720 and thedisambiguator 724 can essentially narrow the scope of the query. Thedisambiguator 724 can also use other mechanisms to extend the keywords of the received queries. This can be done by asking the user broad or specific questions regarding their initial query or can simply be done using artificial intelligence or other means to be able to further narrow the initial query. - Regardless of whether the
topic analyzer 720 and/or thedisambiguator 724 are implemented to change the scope of any of the queries or statements, asearcher 732 of theenvironment 700 is implemented to perform a search for a memo or favorite information based on the query to obtain language. Thesearcher 732 can implement language anddomain data 734 to determine which domains should be searched. - The
searcher 732 can, for example, identify a domain for a query in dependence upon at least one of a wake phrase, a trigger phrase, the contents or topic of the query, as determined by thetopic analyzer 720. Thesearcher 732 is not limited to searching just a single domain. Thesearcher 732 can search multiple domains in parallel or in series. For example, if an insufficient number of results are found after searching in the first domain (e.g., the memo domain) a second domain (e.g., favorites) may be searched. - Various scoring techniques can be implemented which will be understood by one of ordinary skill in the art. Further, the user may have the option to select various scoring and ranking techniques to be implemented. For example, the user may select to have scoring and ranking independently implemented (and presented) for each domain. The scorer/
ranker 730 may only present the top X results or a top Y percentage of results so as to not overwhelm the user. - Whether the results are presented in speech or text, the technology disclosed can also provide a brief visual or auditory summary of each result, making it easier for the user to determine which results they would like to view first.
- The
interpreter 712,topic analyzer 720,disambiguator 724, scorer/ranker 730 and/or thesearcher 732 can be implemented using at least one hardware component and can also include firmware, or software running on hardware. Software that is combined with hardware to carry out the actions of theinterpreter 712,topic analyzer 720,disambiguator 724, scorer/ranker 730 and/or thesearcher 732 can be stored on computer readable media such as rotating or non-rotating memory. The non-rotating memory can be volatile or non-volatile. In this application, computer readable media does not include a transitory electromagnetic signal that is not stored in a memory; computer readable media store program instructions for execution. Theinterpreter 712,topic analyzer 720,disambiguator 724, scorer/ranker 730 and/or thesearcher 732, as well as theapplications 710, the topic models, 722,external data 726, the language anddomain data 734 and theAPIs 711 can be wholly or partially hosted and/or executed in the cloud or by other entities connected through thecommunications network 708. -
FIG. 8 is a block diagram of an example computer system that can implement various components of theenvironment 700 ofFIG. 7 .Computer system 810 typically includes at least oneprocessor 814, which communicates with a number of peripheral devices viabus subsystem 812. These peripheral devices may include astorage subsystem 824, comprising for example memory devices and a file storage subsystem, userinterface input devices 822, userinterface output devices 820, and anetwork interface 815. The input and output devices allow user interaction withcomputer system 810.Network interface 815 provides an interface to outside networks, including an interface to thecommunication networks 708, and is coupled via thecommunication networks 708 to corresponding interface devices in other computer systems. - User
interface input devices 822 may include audio input devices such as speech recognition systems, microphones, and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input speech information intocomputer system 810 or ontocommunication network 708. - User
interface output devices 820 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide non-visual display such as via audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information fromcomputer system 810 to the user or to another machine or computer system. -
Storage subsystem 824 stores programming and data constructs that provide the functionality of some or all of the modules described herein. These software modules are generally executed byprocessor 814 alone or in combination with other processors. -
Memory subsystem 825 used in the storage subsystem can include a number of memories including a main random-access memory (RAM) 830 for storage of instructions and data during program execution and a read only memory (ROM) 832 in which fixed instructions are stored. Afile storage subsystem 828 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain embodiments may be stored byfile storage subsystem 828 in thestorage subsystem 824, or in other machines accessible by the processor. -
Bus subsystem 812 provides a mechanism for letting the various components and subsystems ofcomputer system 810 communicate with each other as intended. Althoughbus subsystem 812 is shown schematically as a single bus, alternative embodiments of the bus subsystem may use multiple busses. -
Computer system 810 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description ofcomputer system 810 depicted inFIG. 8 is intended only as a specific example for purposes of illustrating the various embodiments. Many other configurations ofcomputer system 810 are possible having more or fewer components than the computer system depicted inFIG. 8 . - We describe various implementations of retrieving a personal memo from a database and storing a memo in a database.
- The technology disclosed can be practiced as a system, method, or article of manufacture. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.
- A method implementation of the technology disclosed includes a method of retrieving a personal memo from a database. The method includes receiving, by a virtual assistant, a natural language utterance that expresses a request, interpreting the natural language utterance according to a natural language grammar rule for retrieving memo data from the natural language utterance, the natural language grammar rule recognizing query information, responsive to interpreting the natural language utterance, using the query information to query the database for a memo related to the query information, and providing, to a user, a response generated in dependence upon the memo related to the query information.
- According to an implementation, the natural language grammar rule for retrieving memo data is selected from a plurality of domain dependent grammar rules in accordance to contents of the received natural language utterance.
- In another implementation, the database is queried for the memo related to the query information by searching the database to identify any memo that includes information sufficient to provide an appropriate response to the user.
- In an implementation the response is provided to the user, such that the response answers the request expressed by the natural language utterance as opposed to providing a word-for-word repeat of a transcription.
- A further implementation includes identifying a trigger phrase from the received natural language utterance, and responsive to identifying the trigger, selecting the natural language grammar rule for retrieving memo data in dependence upon at least one of (i) the identified trigger phrase and (ii) other contents of the natural language utterance.
- In an implementation the trigger phrase includes both a personal pronoun followed by an interrogative pronoun or a relative pronoun that is within 5 words of the personal pronoun.
- In a different implementation the method can include receiving an indication that the user spoke a memo-specific wake phrase before the natural language utterance.
- In a further implementation the database storing the memo is a structured database, such that the memo is stored in a structured format, and in another implementation the database storing the memo is an unstructured database, such that the memo is stored in an unstructured format.
- In one implementation the method includes receiving, from the user, a natural language utterance including memo information, interpreting the natural language utterance to extract the memo information, and storing the memo information in the database as a memo.
- Another implementation includes the stored interpretation of the natural language utterance including the memo information includes personal information about the user.
- Moreover, an implementation can include receiving, interpreting and storing multiple natural language utterances including the memo information as memos that relate to a subject along with additional information indicating a time-order of being received, and generating the response in dependence upon a stored memo (i) relating to the subject and (ii) that was interpreted from a most recently received natural language utterance including the memo information relating to the subject.
- Another implementation may include replacing other previously stored memos that relate to a subject with a most recently stored memo that relates to the subject when multiple natural language utterances including the memo information are received, interpreted and stored in the database as a memo that relates to a subject.
- According to one implementation, the method includes allowing the user to confirm or acknowledge whether or not the user intended for the natural language utterance including the memo information to be stored as the memo.
- According to a further implementation, the method includes deleting the stored memo related to the natural language utterance including the memo information when the user indicates that that natural language utterance including the memo information was not intended to be stored as the memo.
- According to another implementation, the method includes assigning a time period to the memo, after which the memo will expire, and removing the memo from the database when the time period has expired.
- An implementation may also include interpreting the natural language utterance that expresses the request according to multiple domains, each domain of the multiple domains having an associated relevancy score for the interpreted utterance, wherein a memo domain is one of the multiple domains, and wherein the memo domain has a score advantage relative to other domains.
- Additionally, according to one implementation the method may include storing a recording of the natural language utterance that expresses the request and/or storing a recording the natural language utterance including the memo information.
- According to an implementation a first particular interpretation of the transcription of text is stored in the database in association with a first domain and a second particular interpretation of the transcription is stored in the database in association with the second domain, such that two or more interpretations stored in the database.
- One implementation may include storing meta-data along with the memo, where the meta-data include information such as short-term activity information, daily weather information, until-event occurs information, and where the meta-data can be explicitly stated by the user or inferred from other information including other memos, regular commute information and/or calendar information.
- Other implementations may include a non-transitory computer-readable recording medium having a computer program for retrieving a personal memo form a database recorded thereon. The computer program, when executed on one or more processors, causes the processors to perform the method described above and any of the above-described implementations. Specifically, includes receiving, by a virtual assistant, a natural language utterance that expresses a request, interpreting the natural language utterance according to a natural language grammar rule for retrieving memo data from the natural language utterance, the natural language grammar rule recognizing query information, responsive to interpreting the natural language utterance, using the query information to query the database for a memo related to the query information, and providing, to a user, a response generated in dependence upon the memo related to the query information.
- Each of the features discussed in this particular implementation section for the first system implementation apply equally to the CRM implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
- A system implementation of the technology disclosed includes one or more processors coupled to memory. The memory is loaded with computer instructions to retrieve a personal memo from a database. The instructions, when executed on the one or more processors, implement actions including includes receiving, by a virtual assistant, a natural language utterance that expresses a request, interpreting the natural language utterance according to a natural language grammar rule for retrieving memo data from the natural language utterance, the natural language grammar rule recognizing query information, responsive to interpreting the natural language utterance, using the query information to query the database for a memo related to the query information, and providing, to a user, a response generated in dependence upon the memo related to the query information.
- This system implementation and other systems disclosed optionally include one or more of the following features. System can also include features described in connection with methods disclosed. In the interest of conciseness, alternative combinations of system features are not individually enumerated. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
- A given event or value is “responsive” (e.g., “in response to” or “responsive to”) to a predecessor event or value if the predecessor event or value influenced the given event or value. If there is an intervening processing element, step or time period, the given event or value can still be “responsive” to the predecessor event or value. If the intervening processing element or step combines more than one event or value, the signal output of the processing element or step is considered “responsive” to each of the event or value inputs. If the given event or value is the same as the predecessor event or value, this is merely a degenerate case in which the given event or value is still considered to be “responsive” to the predecessor event or value. “Dependency” (e.g. “in dependence upon” or “in dependence on”) of a given event or value upon another event or value is defined similarly.
Claims (18)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/531,371 US20220076678A1 (en) | 2019-01-23 | 2021-11-19 | Receiving a natural language request and retrieving a personal voice memo |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/255,674 US11211064B2 (en) | 2019-01-23 | 2019-01-23 | Using a virtual assistant to store a personal voice memo and to obtain a response based on a stored personal voice memo that is retrieved according to a received query |
US17/531,371 US20220076678A1 (en) | 2019-01-23 | 2021-11-19 | Receiving a natural language request and retrieving a personal voice memo |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/255,674 Continuation US11211064B2 (en) | 2019-01-23 | 2019-01-23 | Using a virtual assistant to store a personal voice memo and to obtain a response based on a stored personal voice memo that is retrieved according to a received query |
Publications (1)
Publication Number | Publication Date |
---|---|
US20220076678A1 true US20220076678A1 (en) | 2022-03-10 |
Family
ID=71609026
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/255,674 Active 2039-03-19 US11211064B2 (en) | 2019-01-23 | 2019-01-23 | Using a virtual assistant to store a personal voice memo and to obtain a response based on a stored personal voice memo that is retrieved according to a received query |
US17/531,371 Abandoned US20220076678A1 (en) | 2019-01-23 | 2021-11-19 | Receiving a natural language request and retrieving a personal voice memo |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/255,674 Active 2039-03-19 US11211064B2 (en) | 2019-01-23 | 2019-01-23 | Using a virtual assistant to store a personal voice memo and to obtain a response based on a stored personal voice memo that is retrieved according to a received query |
Country Status (1)
Country | Link |
---|---|
US (2) | US11211064B2 (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11211064B2 (en) * | 2019-01-23 | 2021-12-28 | Soundhound, Inc. | Using a virtual assistant to store a personal voice memo and to obtain a response based on a stored personal voice memo that is retrieved according to a received query |
SG11202109611RA (en) * | 2019-03-05 | 2021-10-28 | The Anti Inflammaging Company Ag | Virtual agent team |
EP4107722A1 (en) * | 2020-02-17 | 2022-12-28 | Cerence Operating Company | Coordinating electronic personal assistants |
US20220269477A1 (en) * | 2021-02-25 | 2022-08-25 | Ncr Corporation | Remote control access of terminal interface |
US11842738B1 (en) * | 2021-03-22 | 2023-12-12 | Amazon Technologies, Inc. | Computing services using embeddings of a transformer-based encoder |
CN113643691A (en) * | 2021-08-16 | 2021-11-12 | 思必驰科技股份有限公司 | Far-field voice message interaction method and system |
CN118012903A (en) * | 2022-11-09 | 2024-05-10 | 北京欧珀通信有限公司 | Interaction method, device, electronic equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130268260A1 (en) * | 2012-04-10 | 2013-10-10 | Artificial Solutions Iberia SL | System and methods for semiautomatic generation and tuning of natural language interaction applications |
US20150142704A1 (en) * | 2013-11-20 | 2015-05-21 | Justin London | Adaptive Virtual Intelligent Agent |
US20150348551A1 (en) * | 2014-05-30 | 2015-12-03 | Apple Inc. | Multi-command single utterance input method |
US20180329982A1 (en) * | 2017-05-09 | 2018-11-15 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
US20190258641A1 (en) * | 2017-10-15 | 2019-08-22 | Aria Solutions, Inc. | System and method for object-oriented pattern matching in arbitrary data object streams |
US20190341029A1 (en) * | 2018-05-01 | 2019-11-07 | Dell Products, L.P. | Intelligent assistance using voice services |
US11211064B2 (en) * | 2019-01-23 | 2021-12-28 | Soundhound, Inc. | Using a virtual assistant to store a personal voice memo and to obtain a response based on a stored personal voice memo that is retrieved according to a received query |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6675159B1 (en) | 2000-07-27 | 2004-01-06 | Science Applic Int Corp | Concept-based search and retrieval system |
WO2006036781A2 (en) | 2004-09-22 | 2006-04-06 | Perfect Market Technologies, Inc. | Search engine using user intent |
US7949529B2 (en) | 2005-08-29 | 2011-05-24 | Voicebox Technologies, Inc. | Mobile systems and methods of supporting natural language human-machine interactions |
-
2019
- 2019-01-23 US US16/255,674 patent/US11211064B2/en active Active
-
2021
- 2021-11-19 US US17/531,371 patent/US20220076678A1/en not_active Abandoned
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130268260A1 (en) * | 2012-04-10 | 2013-10-10 | Artificial Solutions Iberia SL | System and methods for semiautomatic generation and tuning of natural language interaction applications |
US20140019116A1 (en) * | 2012-04-10 | 2014-01-16 | Artificial Solutions Iberia SL | System and methods for semiautomatic generation and tuning of natural language interaction applications |
US20150142704A1 (en) * | 2013-11-20 | 2015-05-21 | Justin London | Adaptive Virtual Intelligent Agent |
US20150348551A1 (en) * | 2014-05-30 | 2015-12-03 | Apple Inc. | Multi-command single utterance input method |
US20180350353A1 (en) * | 2014-05-30 | 2018-12-06 | Apple Inc. | Multi-command single utterance input method |
US20180329982A1 (en) * | 2017-05-09 | 2018-11-15 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
US20190258641A1 (en) * | 2017-10-15 | 2019-08-22 | Aria Solutions, Inc. | System and method for object-oriented pattern matching in arbitrary data object streams |
US20190341029A1 (en) * | 2018-05-01 | 2019-11-07 | Dell Products, L.P. | Intelligent assistance using voice services |
US11211064B2 (en) * | 2019-01-23 | 2021-12-28 | Soundhound, Inc. | Using a virtual assistant to store a personal voice memo and to obtain a response based on a stored personal voice memo that is retrieved according to a received query |
Also Published As
Publication number | Publication date |
---|---|
US20200234698A1 (en) | 2020-07-23 |
US11211064B2 (en) | 2021-12-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220076678A1 (en) | Receiving a natural language request and retrieving a personal voice memo | |
US11822695B2 (en) | Assembling and evaluating automated assistant responses for privacy concerns | |
JP7498149B2 (en) | User Programmable Automated Assistant | |
EP3491533B1 (en) | Providing command bundle suggestions for an automated assistant | |
US9529787B2 (en) | Concept search and semantic annotation for mobile messaging | |
KR102364400B1 (en) | Obtaining response information from multiple corpuses | |
KR102014665B1 (en) | User training by intelligent digital assistant | |
EP2994908B1 (en) | Incremental speech input interface with real time feedback | |
EP2347409A1 (en) | Method and system for providing a voice interface | |
EP3647968A1 (en) | System and method for performing an intelligent cross-domain search | |
US11817093B2 (en) | Method and system for processing user spoken utterance | |
US20230079148A1 (en) | Proactive contextual and personalized search query identification | |
CN116075885A (en) | Bit vector based content matching for third party digital assistant actions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
AS | Assignment |
Owner name: SOUNDHOUND, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SPIRIDONOVA, IRINA A.;STAHL, KARL;SELVAGGI, MARA;SIGNING DATES FROM 20190117 TO 20190122;REEL/FRAME:058783/0284 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
AS | Assignment |
Owner name: ACP POST OAK CREDIT II LLC, TEXAS Free format text: SECURITY INTEREST;ASSIGNORS:SOUNDHOUND, INC.;SOUNDHOUND AI IP, LLC;REEL/FRAME:063349/0355 Effective date: 20230414 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: SOUNDHOUND AI IP HOLDING, LLC, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SOUNDHOUND, INC.;REEL/FRAME:064083/0484 Effective date: 20230510 |
|
AS | Assignment |
Owner name: SOUNDHOUND AI IP, LLC, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SOUNDHOUND AI IP HOLDING, LLC;REEL/FRAME:064205/0676 Effective date: 20230510 |
|
AS | Assignment |
Owner name: SOUNDHOUND AI IP, LLC, CALIFORNIA Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:ACP POST OAK CREDIT II LLC, AS COLLATERAL AGENT;REEL/FRAME:067698/0845 Effective date: 20240610 Owner name: SOUNDHOUND, INC., CALIFORNIA Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:ACP POST OAK CREDIT II LLC, AS COLLATERAL AGENT;REEL/FRAME:067698/0845 Effective date: 20240610 |
|
AS | Assignment |
Owner name: MONROE CAPITAL MANAGEMENT ADVISORS, LLC, AS COLLATERAL AGENT, ILLINOIS Free format text: SECURITY INTEREST;ASSIGNOR:SOUNDHOUND, INC.;REEL/FRAME:068526/0413 Effective date: 20240806 |