WO2022125078A1

WO2022125078A1 - Identifying and providing requested user information during voice calls and video calls

Info

Publication number: WO2022125078A1
Application number: PCT/US2020/063839
Authority: WO
Inventors: Brandon Charles Barbello; Shenaz Zack; Tim Wantland; Jan Piotr JEDRZEJOWICZ
Original assignee: Google Llc
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2022-06-16

Abstract

This document describes systems and techniques to identify and provide requested user information during voice and video calls. The described systems and techniques can determine whether audio data associated with a voice or video call between a user of a computing device and a third party includes a request for user information. The third party asks for user information during the call. The described systems and techniques identify user data responsive to the request for user information and display it on a display or provide it to the third party during the call. In this way, the described systems and techniques can improve user calls by identifying user data responses to requests for user information and providing a convenient and secure way to communicate the user data to the third party.

Description

IDENTIFYING AND PROVIDING REQUESTED USER INFORMATION

DURING VOICE CALLS AND VIDEO CALLS

BACKGROUND

[0001] Some telephone calls require a person to provide user information (e.g. , reservation numbers, account numbers, personal information) to another person. For example, a utility company may require a caller to provide an account number and credit card number during a voice call. Suppose the requested user information is not readily available or memorized. In that case, the user of a computing device (e.g., a smartphone) may exit the communication application and open another application to locate the user information. Alternatively, the user may need to find a physical copy of a previous billing statement to locate the requested user information. In other situations, the caller may need to provide sensitive user information (e.g, a credit card number, home address, birth date) in a public area.

SUMMARY

[0002] This document describes systems and techniques to identify and provide requested user information during voice and video calls. The described systems and techniques can determine whether audio data associated with a voice or video call between a user of a computing device and a third party includes a request for user information. The third party asks for user information during the call. The described systems and techniques identify user data responsive to the request for user information and display it on a display or provide it to the third party during the call. In this way, the described systems and techniques can improve user calls by identifying user data responses to requests for user information and providing a convenient and secure way to communicate the user data to the third party.

[0003] For example, assume that a computing device obtains audio data that is output from a communication application executing on the computing device. The audio data includes audible parts of a voice or video call between a user of the computing device and a third party. The computing device determines whether the audio data includes a request for user information using the audible parts of the voice or video call. The computing device identifies, using the audible parts, user data responsive to the request for user information. The computing device then displays the user data or provides it to the third party during the voice call or video call.

[0004] The described systems and techniques may allow user data to be disclosed safely and securely to a third party, for example when a user of the computing device is in an environment in which they may be overheard. The described systems and techniques may allow sensitive user data such as credit card numbers to be provided to the third party without being disclosed to unauthorized persons in the environment.

[0005] This document also describes other methods, configurations, and systems to identify and provide requested user information during voice calls and video calls.

[0006] This Summary is provided to introduce simplified concepts for identifying and providing requested user information during voice calls and video calls, further described in the Detailed Description and Drawings. This Summary is not intended to identify essential features of the claimed subject matter, nor is it intended for use in determining the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] The details of one or more aspects of identifying and providing requested user information during voice calls and video calls are described in this document with reference to the following drawings. The same numbers are used throughout multiple drawings to reference like features and components.

[0008] FIG. 1 illustrates an example environment that includes a computing device that can identify and provide requested user information during voice calls and video calls.

[0009] FIG. 2 illustrates an example device diagram of a computing device that can identify and provide requested user information during voice calls and video calls.

[0010] FIG. 3 illustrates an example diagram of a machine-learned model of a computing device that can provide text descriptions for identifying and providing requested user information.

[0011] FIG. 4 illustrates a flow chart of example operations of a computing device that can identify and provide selectable controls and user data related to voice calls and video calls.

[0012] FIG. 5 illustrates example operations to identify and provide requested user information during voice calls and video calls.

[0013] FIGs. 6A-6D illustrate example user interfaces of a computing device to assist users with voice calls and video calls.

[0014] FIGs. 7A-7C illustrate other example user interfaces of a computing device to assist users with voice calls and video calls.

[0015] FIGs. 8A-8D illustrate other example user interfaces of a computing device to assist users with voice calls and video calls. DETAILED DESCRIPTION

OVERVIEW

[0016] This document describes techniques and systems to identify and provide requested user information during voice calls and video calls on a computing device. As noted above, some voice calls or video calls require a computing device user to provide user information (e.g., reservation numbers, account numbers, personal information) to another person. The user must often search for the requested user information in personal records by opening another application or retrieving it from another device or a physical document. It can be challenging for users to locate or recall this information, especially when they are at work, running an errand, or away from home. Some of the requested information can also be sensitive user information that users would prefer not to share in a public area (e.g. , in a cubicle at work, at the grocery store, on public transportation).

[0017] Consider a smartphone with a communication application that allows users to make voice calls or video calls. For example, a user can use the communication application to call a medical office. If the user is a new patient, the medical office can request user information, including a home address, birth date, social security number, and medical insurance information. The user can sometimes recall some of the requested user information (e.g., home address, birth date, and social security number). Still, most users likely do not recall their medical insurance information. In this scenario, the user must either find the medical insurance information on a website associated with the medical insurance provider, a medical insurance card, or a recent medical insurance document. The user can also anticipate that the medical office will request the medical insurance information and locate it in advance. Each of these options can be timeconsuming or, in some situations, not possible. The user may need to leave an application they are currently using on the smartphone and switch to a different application to find the information. In some cases, they may need to switch between more than one different application to find the required information. Such switching between applications can be a waste of computing resources such as battery life and processing power. If users are in a public space, they may also be uncomfortable audibly sharing the requested user information, for example if the information is restricted or sensitive.

[0018] The described techniques and systems can assist users conveniently and communicate requested user information securely during voice and video calls by identifying and providing the user information. In particular, the described techniques and systems can obtain audio data from a voice or video call and determine whether the conversation includes a request for user information. The described techniques and systems can then identify and provide user data responsive to the request.

[0019] The described techniques and systems could also assist a user who is hard of hearing in providing user data to a third party. For example, a user who is hard of hearing may not be able to hear a request for user data clearly, or at all. However, the described techniques and systems can determine whether a request was made and provide the data to the third party, or display the data for the user to provide to the third party. As such, the described techniques and systems can allow a user who is hard of hearing to respond to a request for user data that they may otherwise not have been able to respond to. The described techniques and systems may also improve the ease with which users who find difficulty in using a computing device may provide user data to a third party, for example users with dexterity impairments.

[0020] Consider the medical office scenario described above. The smartphone can listen to the voice call and determine whether the medical office requests medical insurance information from the user. The described systems and techniques can identify user data responsive to the request information from memory or another application in response. The smartphone can then display the user data or provide it to the medical office. The smartphone can also give the user the option to privately share the user data with the medical office by using a text-to-speech module to read the insurance information, sending it to the medical office as an email or text, or providing it as a series of dual-tone multi-frequency (DTMF) tones. In this way, the described techniques and systems provide a user-friendly and secure experience for smartphone users to locate and share requested user information. The described techniques and systems may mean that a user can avoid switching between applications on their smartphone to find the requested user data. Battery and processing resources of the computing device may be conserved as a result.

[0021] As a non-limiting example, a computing device can obtain audio data output from a communication application. The audio data includes audible parts of a voice call or video call between a user of the computing device and a third party. The computing device determines, using the audible parts, whether the audio data includes a request for user information, which is audibly provided by the third party during the voice call or the video call. The computing device then identifies user data responsive to the request. The computing device displays the user data or provides it to the third party during the call.

[0022] The computing device may only use the information from the audio data after the computing device receives explicit permission from a user of the computing device. For example, in situations discussed above in which the computing device may collect audio data from voice and video calls, individual users may be provided with an opportunity to provide input to control whether programs or features of the computing device can collect and make use of the information. The individual users may further be provided with an opportunity to control what the programs or features can or cannot do with the information.

[0023] This example is just one illustration of how the described systems and techniques to identify and provide requested user information during voice calls or video calls can improve a user’s experience on a computing device. Other examples and implementations are described throughout this document. This document now describes additional example configurations, components, and methods for identifying and providing requested user information during voice calls and video calls.

EXAMPLE ENVIRONMENT

[0024] FIG. 1 illustrates an example environment 100 that includes an example computing device 102 that can identify and provide requested user information during voice calls and video calls. In addition to the computing device 102, the environment 100 includes a computing system 104 and a caller system 106. The computing device 102, the computing system 104, and the caller system 106 are communicatively coupled to network 108.

[0025] Although operations of the computing device 102 are described as being performed locally, in some examples, the operations may be performed by multiple computing devices and systems (e.g, the computing system 104), including additional computing devices and systems beyond those shown in FIG. 1. For example, the computing system 104, the caller system 106, or any other device or system communicatively coupled to the network 108, may perform some or all of the functionality of the computing device 102, or vice versa.

[0026] The computing system 104 represents any combination of one or more computers, mainframes, servers, cloud computing systems, or other types of remote computing systems capable of exchanging information with the computing device 102 via the network 108. The computing system 104 can store, or provide access to, additional processors, stored data, or other computing resources needed by computing device 102 to implement the described systems and techniques for providing selectable controls for IVR systems on the computing device 102.

[0027] The caller system 106 can execute an interactive voice response (IVR) system 110 to transmit and receive telephony data with the computing device 102 via the network 108. For example, the caller system 106 can be a mobile telephone, landline telephone, laptop computer, workstation at a telephone call center, or other computing device configured to present the IVR system 110 to a caller. The caller system 106 can also represent any combination of computers, computing devices, mainframes, servers, cloud computing systems, or other types of remote computing systems capable of communicating information via network 108 to implement a voice call or a video call between the caller system 106 and the computing device 102.

[0028] The network 108 represents any public or private communications network for transmitting data (e.g, voice communications, video communications, data packages) between computing systems, servers, and computing devices. For example, the network 108 can include a public switched telephone network (PSTN), a wireless network (e.g. , a cellular network, a wireless local area network (WLAN)), a wired network (e.g, a local area network (LAN), a wide area network (WAN)), an Internet Protocol (IP) telephony network (e.g, a voice-over-IP (VoIP) network), or any combination thereof. The network 108 may include network hubs, network switches, network routers, or any other network equipment that is operatively inter-coupled. The computing device 102, the computing system 104, and the caller system 106 may transmit and receive data across the network 108 using any suitable communication techniques. The computing device 102, the computing system 104, and the caller system 106 can be operatively coupled to the network 108 using respective network links.

[0029] The computing device 102 represents any suitable computing device capable of identifying and providing requested user information during voice calls and video calls. For example, the computing device 102 may be a smartphone on which a user provides inputs to make or accept voice calls or video calls with a caller entity (e.g, the caller system 106).

[0030] The computing device 102 includes one or more communication units 112. The communication units 112 allow the computing device 102 to communicate over wireless or wired networks, including the network 108. For example, the communication units 112 can include transceivers for cellular phone communication or network data communication. The computing device 102 can tune the communication units 112 and supporting circuitry (e.g, antennas, frontend modules, amplifiers) to one or more frequency bands defined by various communication standards.

[0031] The computing device 102 includes a user interface component 114, which includes an audio component 116, a display component 118, and an input component 120. The computing device 102 also includes an operating system 122 and a communication application 124. These components and other components (not illustrated) of the computing device 102 are operatively coupled in various ways, including wired and wireless busses and links. The computing device 102 may include additional components and interfaces omitted from FIG. 1 for the sake of clarity.

[0032] The user interface component 114 manages input and output to a user interface 126 controlled by the operating system 122 or applications executing on the computing device 102. For example, the communication application 124 can cause the user interface 126 to display various user interface elements, including input controls, navigational components, informational components, or a combination thereof.

[0033] As described above, the user interface component 114 can include the audio component 116, the display component 118, and the input component 120. The audio component 116, the display component 118, and the input component 120 can be separate or integrated as a single component. The audio component 116 (e.g., a single speaker or multiple speakers) can receive an audio signal as input and convert the audio signal to audible sound. The display component 118 can display visual elements on the user interface 126. The display component 118 can include any suitable display technology, including light-emitting diode (LED), organic light-emitting diode (OLED), and liquid crystal display (LCD) technologies. The input component 120 may be a microphone, presence-sensitive device, touch screen, mouse, keyboard, or another type of component configured to receive user input.

[0034] The operating system 122 generally controls the computing device 102, including the communication units 112, the user interface component 114, and other peripherals. For example, the operating system 122 can manage hardware and software resources of the computing device 102 and provide common services for applications. As another example, the operating system 122 can control task scheduling. The operating system 122 and the applications are generally executable by one or more processors (e.g., a system on chip (SoC), a central processing unit (CPU)) to enable communications and user interaction with the computing device 102. The operating system 122 generally provides for user interaction through a user interface 126.

[0035] The operating system 122 also provides an execution environment for applications, for example, the communication application 124. The communication application 124 allows the computing device 102 to make and receive voice calls and video calls with callers, including the caller system 106.

[0036] During a voice call or a video call, the communication application 124 can cause the user interface 126 to display a caller box 128, a numeric-keypad icon 130, a speakerphone icon 132, selectable controls 134, an end-call icon 136, and a message element 138. The caller box 128 can indicate the name and telephone number of the caller (e.g, the caller system 106). The numeric-keypad icon 130 is a selectable icon that, when selected, causes a numeric keypad to be displayed on the user interface 126. The speakerphone icon 132 is a selectable icon that, when selected, causes the computing device 102 to use a speakerphone functionality for the voice call or video call. The selectable controls 134 are selectable by a user of the computing device 102 to perform a particular operation or function. In the illustrated example, the selectable controls 134 are selectable by the user to read, by the computing device 102, an account number to the caller system 106. The selectable controls 134 can include buttons, toggles, selectable text, sliders, checkboxes, or icons. The end-call icon 136 allows a user of the computing device 102 to terminate a voice call or a video call. The message element 138 provides textual information to the user. For example, the message element 138 can provide user data (e.g, an account number) responsive to user information requests.

[0037] The operating system 122 can correlate detected inputs at the input component 120 to elements of the user interface 126. In response to receiving an input at the input component 120 (e.g, a tap), the operating system 122 or the communication application 124 can receive information from the user interface component 114 about the detected input. The operating system 122 or the communication application 124 may perform a function or operation in response to the detected input. For example, the operating system 122 may determine that the input corresponds to the user selecting one of the selectable controls 134 and, in response, the operating system 122 can cause the computing device 102 to provide the user data to the third party without requiring the user to provide it audibly.

[0038] In operation, the operating system 122 or the communication application 124 can automatically generate the selectable controls 134 corresponding to the user data. The computing device 102 can obtain audio data from an audio mixer or sound engine of the operating system 122. The audio data generally includes the audible parts of the voice call or the video call, including the IVR options provided by the IVR system 110.

EXAMPLE CONFIGURATIONS

[0039] This section illustrates example configurations of systems to identify and provide requested user information during voice calls and video calls, which may occur separately or together in whole or in part. This section describes various example configurations, each described in relation to a drawing for ease of reading. [0040] FIG. 2 illustrates an example device diagram 200 of a computing device 202 that can identify and provide requested user information during voice calls and video calls. The computing device 202 is an example of the computing device 102, with some additional detail.

[0041] As shown in FIG. 2, the computing device 202 may be a smartphone 202-1 , a tablet device 202-2, a laptop computer 202-3, a desktop computer 202-4, a computerized watch 202-5 or other wearable device, a voice-assistant system 202-6, a smart display system, or a computing system installed in a vehicle.

[0042] In addition to the communication units 112 and the user interface component 114, the computing device 202 includes one or more processors 204 and computer-readable storage media (CRM) 206.

[0043] The processors 204 may include any combination of one or more controllers, microcontrollers, processors, microprocessors, hardware processors, hardware processing units, digital-signal-processors, graphics processors, graphics processing units, and the like. For example, the processor 204 can be an integrated processor and memory subsystem, including, as non-limiting examples, an SoC, a CPU, a graphics processing unit, or a tensor processing unit. An SoC generally integrates many of the components of the computing device 202 into a single device, including a central processing unit, a memory, and input and output ports. A CPU generally executes commands and processes needed for the computing device 202. A graphics processing unit performs operations to display graphics of the computing device 202 and can perform other specific computational tasks. The tensor processing unit generally performs symbolic match operations in neural -network machine-learning applications. The processors 204 can include a single core or multiple cores.

[0044] The CRM 206 can provide the computing device 202 with persistent and non- persistent storage of executable instructions (e.g, firmware, recovery firmware, software, applications, modules, programs, functions) and data (e.g, user data, operational data) to support the execution of the executable instructions. For example, the CRM 206 includes instructions that, when executed by the processors 204, execute the operating system 122 and the communication application 124. Examples of the CRM 206 include volatile memory and nonvolatile memory, fixed and removable media devices, and any suitable memory device or electronic data storage that maintains executable instructions and supporting data. The CRM 206 can include various implementations of random-access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), non-volatile RAM (NVRAM), read-only memory (ROM), flash memory, and other storage memory types in various memory device configurations. The CRM 206 excludes propagating signals. The CRM 206 can be a solid-state drive (SSD) or a hard disk drive (HDD).

[0045] The operating system 122 can also include or control an audio mixer 208 and caption module 210. The audio mixer 208 and the caption module 210 can be specialized hardware components, software components, or a combination thereof. In other examples, the audio mixer 208 and the caption module 210 are separate from the operating system 122 (e.g., as a system plug-in or additional add-on service locally installed on the computing device 202).

[0046] The audio mixer 208 can obtain and consolidate audio data generated by applications, including the communication application 124, executing on the computing device 202. The audio mixer 208 obtains audio streams from applications, such as the communication application 124, and generates audio output signals that reproduce the sounds encoded in the audio streams when combined and output from the audio component 116. The audio mixer 208 may adjust the audio signals in other ways, for example, controlling focus, intent, and volume. The audio mixer provides an interface between the application source that generates the content and the audio component 116 that creates sounds from the content. The audio mixer 208 can manage raw audio data, analyze it, and direct audio signals to be output by the audio component 116 or sent, via the communication units 112, to another computing device (e.g., the caller system 106).

[0047] The caption module 210 is configured to analyze audio data, in raw form, as received (e.g, as a byte stream) by the audio mixer 208. For example, the caption module 210 can perform speech recognition on the audio data to determine whether the audio data includes selectable options of an IVR system, a request for user information, or communicated information related to a call context. Rather than process each audio signal, the caption module 210 can identify individual, pre-mixed audio data streams suitable for captioning. For example, the caption module 210 can automatically caption spoken audio data but not notification or sonification audio data (e.g, system beeps, rings). The caption module 210 may apply a filter to the byte streams received by the audio mixer 208 to identify the audio data suitable for captioning. The caption module 210 can use a machine-learned model to determine audio data descriptions from audible parts of a voice call or a video call.

[0048] Rather than captioning all the audio data, the operating system 122 can use metadata to focus the captioning on specific portions of the audio data. For example, the caption module 210 can focus on audio data related to providing selectable controls for IVR systems, user information in response to a request, or communicated information related to a call context. In other words, the operating system 122 can identify “captionable” audio data based on metadata and refrain from captioning all audio data. Some metadata examples include a context indicator specifying the nature of a voice call or a video call. The audio mixer may use the context indicator to control routing, focus, and captioning decisions regarding the audio data.

[0049] Some computing devices can transcribe a voice call or a video call. The transcription, however, generally provides a direct transcription of the audible parts of the call and cannot determine whether the conversation includes selectable options of an IVR system, a request for user information, or communicated information related to the call context. The user still must read the transcript to determine the desired menu option, the requested user information, or the communicated information. Thus, even if the computing device provides a transcription, the user may still find it challenging to in navigating the IVR system and select the desired option, locate requested user information, or find and save communicated information. In contrast, the described systems and techniques assist users navigate IVR systems, provide user information in response to a request, or manage communicated information from voice calls and video calls by displaying selectable controls and message elements with the relevant information.

[0050] The computing device 202 also includes one or more sensors 214. The sensors 214 obtain contextual information indicative of a physical operating environment of the computing device 202 or characteristics of the computing device 202 while functioning in a physical operating environment. For example, the caption module 210 can use this contextual information as metadata to focus the audio data processing. Examples of the sensors 214 include movement sensors, temperature sensors, position sensors, proximity sensors, ambient light sensors, moisture sensors, pressure sensors, and the like.

[0051] In operation, the operating system 122 or the caption module 210 determines whether the audio data is for captioning. For example, the caption module 210 can determine whether the audio data includes selectable options of an IVR system, a request for user information, or communicated information related to the call context. Responsive to determining that the audio data is for captioning, the operating system 122 determines the audio data description. For example, the operating system 122 may execute a machine-learned model (e.g, an end-to-end Recurrent-Neural-Network-Transducer Automatic Speech-Recognition Model) trained to generate descriptions of audible parts of voice calls or video calls. The machine-learned model can be any type of model suitable for learning descriptions of sounds, including transcriptions for spoken audio. The machine-learned model used by the operating system 122 can be smaller and less complex than other machine-learned models because it only needs to be trained to identify audible parts of voice calls and video calls. The machine-learned model can avoid processing all audio data sent to the audio mixer 208. In this way, the described systems and techniques can avoid using remote processing resources (e.g, a machine-learned model at a remote computing device) to avoid unnecessary privacy risks and potential processing latencies.

[0052] By relying on original audio data instead of audio signals generated by the audio component 116, the machine-learned model can generate descriptions that more-accurately represent the audible parts of voice calls and video calls. By determining whether audio data is for captioning before using the machine-learned model, the operating system 122 can avoid wasting resources overanalyzing all audio data output by the communication application 124. This captioning determination enables the computing device 202 to execute a more-efficient, smaller, and less-complex machine-learned model. In this way, the machine-learned model can perform automatic speech-recognition and automatic sound classification techniques locally to maintain privacy.

[0053] The operating system 122 receives the machine-learned model description and displays it using the display component 118. The display component 118 can also display other visual elements (e.g. , selectable controls that allow the user to perform an action on the computing device 202) related to the descriptions. For example, the operating system 122 can present the visual elements (e.g, the selectable controls 134, the message element 138) as part of the user interface 126. A description can include transcriptions or a summary of the audible parts (e.g, the phone conversation) of voice calls and video calls. The descriptions can also identify a context for the audible parts of the audio data. The details and operation of the machine-learned model are described in greater detail with respect to FIG. 3.

[0054] FIG. 3 illustrates an example diagram 300 of a machine-learned model 302 of the computing device 202 that can provide text descriptions for identifying and providing requested user information. In other implementations, the computing device 202 can be the computing device 102 of FIG. 1 or a similar computing device.

[0055] As illustrated in FIG. 3, the machine-learned model 302 can be part of the caption module 210. The machine-learned model 302 can convert audio data 304 into the text descriptions 306 (e.g, text descriptions of selectable options provided by the IVR system 110) of the audible parts of a voice call or a video call without converting the audio data 304 into sound. The audio data 304 can include different types, forms, or variations of data from the communication application 124. For example, the audio data 304 can include raw, pre-mixed audio byte stream data or processed byte stream data. The machine-learned model 302 can include multiple machine-learned models combined into a single model that provides the text descriptions 306 in response to the audio data 304.

[0056] Applications, including the communication application 124, can use the machine- learned model 302 to process the audio data 304 into the text descriptions 306. For example, the communication application 124 can communicate through the operating system 122 or the caption module 210 with the machine-learned model 302 using an application programming interface (API) (e.g, a public API across all applications). In some implementations, the machine-learned model 302 can process the audio data 304 within a secure section or enclave of the operating system 122 or the CRM 206 to ensure user privacy and security.

[0057] The machine-learned model 302 can make inferences. In particular, the machine- learned model 302 can be trained to receive the audio data 304 as an input and provide, as output data, the text descriptions 306 of the audible parts of a call. Through performing inference using the machine-learned model 302, the caption module 210 can process the audio data 304 locally. The machine-learned model 302 can also perform classification, regression, clustering, anomaly detection, recommendation generation, and other tasks.

[0058] Engineers can train the machine-learned model 302 using supervised learning techniques. For example, engineers can train the machine-learned model 302 using training data 308 (e.g, truth data) that includes examples of descriptions inferred from examples of audio data 304 from a series of voice calls and video calls. The inferences can be manually applied by engineers or other experts, generated through crowd-sourcing, or provided by other techniques (e.g, complex speech-recognition and content-recognition algorithms). The training data 308 can include audio data from voice calls and video calls to the audio data 304. As an example, consider that the audio data 304 includes a voice call with an IVR system used by a medical office. The training data 308 for the machine-learned model 302 can include many audio data files from a broad range of voice calls and video calls with IVR systems. As another example, consider that the audio data 304 includes a voice call with a customer representative of a business. The training data 308 can include many audio data files from a broad range of similar voice calls and video calls. Engineers can also use unsupervised learning techniques to train the machine-learned model 302.

[0059] The machine-learned model 302 can be trained at a training computing system and then provided for storage and implementation at one or more computing devices 202. For example, the training computing system can include a model trainer. The training computing system can be included in or separate from the computing device 202 that implements the machine-learned model 302.

[0060] Engineers can also train the machine-learned model 302 online or offline. In offline training (e.g., batch learning), engineers train the machine-learned model 302 on the entirety of a static set of the training data 308. In online learning, engineers continuously train the machine-learned model 302 as new training data 308 becomes available (e.g., while the machine- learned model 302 is used on the computing device 202 to perform inference). For example, engineers can initially train the machine-learned model 302 to replicate descriptions applied to audible parts of voice calls and video calls (e.g, captioned IVR systems, captioned telephone conversations). As the machine-learned model 302 infers the text descriptions 306 from the audio data 304, the computing device 202 can feed the text descriptions 306 (and the corresponding portions of the audio data 304) back to the machine-learned model 302 as new training data 308. In this way, the machine-learned model 302 can continuously improve the accuracy of the text descriptions 306. In some implementations, a user of the computing device 202 can provide input to the machine-learned model 302 to flag a particular description as having errors. The computing device 202 can use this flag to train the machine-learned model 302 and improve future predictions.

[0061] Engineers or trainers can perform centralized training of multiple machine-learned models 302 (e.g, based on a centrally stored dataset). In other implementations, the trainer or engineer can use decentralized training techniques, including distributed training or federated learning, to train, update, or personalize the machine-learned model 302. The engineer may only use user information to personalize the machine-learned model 302 after receiving explicit permission from a user. For example, in situations in which the computing device 202 may collect user information, individual users may be provided with an opportunity to provide input to control whether programs or features of the machine-learned model 302 can collect and make use of the user information. The individual users may further be provided with an opportunity to control what the programs or features can or cannot do with the user information.

[0062] The machine-learned model 302 can be or include one or more artificial neural networks. In such an implementation, the machine-learned model 302 can include a group of connected or non-fully connected nodes (e.g. , neurons). Engineers can also organize the machine- learned model 302 into one or more layers (e.g, a deep network). In a deep-network implementation, the machine-learned model 302 can include an input layer, an output layer, and one or more hidden layers positioned between the input layer and the output layer. [0063] The machine-learned model 302 can also include one or more recurrent neural networks. For example, the machine-learned model 302 can be an end-to-end Recurrent-Neural- Network-Transducer Automatic Speech-Recognition Model. Example recurrent neural networks include long short-term memory (LSTM) recurrent neural networks, gated recurrent units, bidirection recurrent neural networks, continuous-time recurrent neural networks, neural history compressors, echo state networks, Elman networks, Jordan networks, recursive neural networks, Hopfield networks, fully recurrent networks, and sequence-to-sequence configurations.

[0064] At least some of the nodes of a recurrent neural network can form a cycle. When configured as a recurrent neural network, the machine-learned model 302 can be especially useful for processing sequential input data (e.g, the audio data 304). For example, a recurrent neural network can pass or retain information from a previous portion of the audio data 304 to a subsequent portion of the audio data 304 using recurrent or directed cyclical node connections.

[0065] The audio data 304 can also include time-series data (e.g. , sound data versus time). As a recurrent neural network, the machine-learned model 302 can analyze the audio data 304 over time to detect or predict spoken sounds and relevant non-spoken sounds to generate the text descriptions 306 of at least portions of the audio data 304. For example, the sequential sounds from the audio data 304 can indicate spoken words in a sentence (e.g. , natural language processing, speech detection, or processing).

[0066] The machine-learned model 302 can also include one or more convolutional neural networks. A convolutional neural network can include multiple convolutional layers that perform convolutions over input data using learned filters or kernels. Engineers generally use convolutional neural networks to diagnose vision problems in still images or videos. Engineers can also apply convolutional neural networks to natural language processing of the audio data 304 to generate the text descriptions 306.

[0067] This document describes the operations of the caption module 210 and the machine-learned model 302 in greater detail with respect to FIG. 4.

EXAMPLE METHODS

[0068] FIG. 4 illustrates a flow chart of example operations 400 of a computing device that can identify and provide selectable controls and user data related to voice calls and video calls. The operations 400 are described below in the context of the computing device 202 of FIG. 2. In other implementations, the computing device 202 can be the computing device 102 of FIG. 1 or a similar computing device. The operations 400 may be performed in a different order than that illustrated in FIG. 4 or with additional or fewer operations.

[0069] At 402, the computing device optionally obtains content that includes user information of a computing device user. The computing device can use the user information to help the user retrieve requested information or save communicated information related to voice calls and video calls. Before obtaining the user information or performing the described options below, the computing device 202 may obtain consent from the user to use the user information for voice calls and video calls. For example, the computing device 202 may only use user information after receiving explicit consent. The computing device 202 can obtain the user information from user entry into an application on the computing device 202 (e.g., inputting contact information into a user profile, inputting an account number via a third-party application) or learning it from information received in an application (e.g., an account number included in an emailed statement, saved calendar entries).

[0070] At 404, the computing device displays a graphical user interface of a communication application. For example, the computing device 202 may direct the display component 118 to present the user interface 126 for the communication application 124 in response to the user making or receiving a voice call or a video call.

[0071] At 406, the computing device obtains audio data output from the communication application executing on the computing device. The audio data includes audible parts of a voice call or a video call. For example, the communication application 124 allows a user of the computing device 202 to make and receive voice calls and video calls. The audio mixer 208 obtains the audio data 304 output from the communication application 124 during the voice calls and video calls. The audio data 304 includes audible parts of a voice call or a video call between a user of the computing device 202 and a third party. To provide selectable controls and other information to the user during the voice call or the video call, the caption module 210 can extract the audio data 304 from the audio mixer 208.

[0072] At 408, the computing device determines whether the audio data includes relevant information using the audible parts of the voice call or video call. The relevant information can be two or more selectable options of an IVR system (e.g., phone tree options), a request for user information (e.g., a request for a credit card number, address, account number), or communicated information (e.g., appointment details, contact information, account information). For example, the caption module 210, using the machine-learned model 302, can determine whether the audio data 304 includes relevant information. The relevant information can include two or more selectable options of an IVR system, a request for user information, or communicated information. The user or the third party audibly provides the relevant information during the voice call or video call. The caption module 210 or the machine-learned model 302 may filter out audio data 304 that does not require processing, including notification sounds and background noise. Examples of the machine-learned model 302 determining whether the audio data 304 includes two or more selectable options are illustrated in FIGs. 6A and 8A. Examples of the machine-learned model 302 determining whether the audio data 304 includes a request for user information are illustrated in FIGs. 6B, 6C, 7A, and 8B. Examples of the machine-learned model 302 determining whether the audio data 304 includes communicated information are illustrated in FIGs. 6D, 7B, 7C, and 8C.

[0073] If the audio data does not include relevant information, at 416, the computing device displays the user interface for the communication application. For example, in response to determining that the audio data 304 does not include relevant information, the computing device 202 displays the user interface 126 of the communication application 124.

[0074] If the computing device determines the audio data includes relevant information, at 410, the computing device determines a text description of the relevant information. The text description transcribes the relevant information. For example, the caption module 210 can use the machine-learned model 302 to perform speech recognition on the audio data 304 and determine a text description 306 of the relevant information. The text description 306 provides a transcription of at least a portion of the two or more selectable options, the request for user information, or the communicated information. Examples of the machine-learned 302 determining the text description 306 of the two or more selectable options are illustrated in FIGs. 6A and 8A. Examples of the machine-learned model 302 determining the text description 306 of the request for user information are illustrated in FIGs. 6B, 6C, 7A, and 8B. Examples of the machine-learned model 302 determining the text description of the communicated information are illustrated in FIGs. 6D, 7B, 7C, and 8C.

[0075] The caption module 210 can improve the accuracy of the text description 306 in various ways, including by biasing the machine-learned model 302 based on contexts of the computing device 202. For example, the caption module 210 may bias the machine-learned model 302 based on the identity of the third party to the voice call or video call. Consider the user of the computing device 202 makes a voice call to a medical office. The caption module 210 can bias the machine-learned model 302 using common words from a medical office conversation. In this way, the computing device 202 can improve the text descriptions 306 for this voice call. The caption module 210 can use other contextual information types, including location information derived from a sensor 214 and information from other applications, to bias the machine-learned model 302.

[0076] In some implementations, the computing device 202 can translate the text description 306 into another language before displaying it. For example, the caption module 210 may determine from the operating system 122 a preferred language of the user and translate the text description 306 into the preferred language. In this way, a Japanese user can view the text description 306 in Japanese, even if the audio data 304 is in a different language (e.g, Chinese or English).

[0077] At 412, the computing device optionally identifies user data responsive to the request for user information. The computing device does not perform this operation if the audio data does not include a request for user information. For example, in response to determining that the third party requested user information, the computing device 202 can identify user data responsive to user information requests. The computing device 202 can retrieve the user data from the CRM 206, the communication application 124, another application on the computing device 202, or remote computing devices associated with the user or the computing device 202. Consider the medical office call scenario above. A receptionist for the medical office can request the user provide her insurance information. In response, the computing device 202 can retrieve the medical insurance provider and user account number from an email previously received by the user and stored on the computing device 202. Examples of the computing device 202 identifying user data response to the request for user information are illustrated in FIGs. 6B, 6C, 7A, and 8B.

[0078] The computing device may only use the information responsive to the request for user information after the computing device receives explicit permission from a user of the computing device. For example, in situations discussed above in which the computing device may collect user data, individual users may be provided with an opportunity to provide input to control whether programs or features of the computing device can collect and make use of the user data. The individual users may further be provided with an opportunity to control what the programs or features can or cannot do with the user data.

[0079] At 414, the computing device displays the user data or selectable controls. The selectable controls are selectable by the user and include the text description. Suppose the audio data included a request for user information. In that scenario, the computing device can display the identified user data. Suppose the audio data included two or more selectable options of an IVR system. In that scenario, the user can use the selectable controls to indicate to the third party a selected option from the two or more selectable options. Suppose the audio data included communicated information. In that scenario, the user can use the selectable controls to save the communicated information in the computing device, the communication application, or another application. For example, the computing device 202 can cause the display component 118 to display the user data or the selectable controls 134. The display component 118 can provide the user data as a text notification on the user interface 126. Consider the medical office call scenario above. The display component 118 can display the medical insurance provider and user account information as a text box on the user interface 126 during the voice call. The display component 118 can also provide the selectable controls 134. The display component 118 can provide the text description 306 or the requested information as part of a button on the user interface 126 for the communication application 124. Examples of the display component 118 displaying the selectable control 134 are illustrated in FIGs. 6A and 8 A. Examples of the display component 118 displaying user data are illustrated in FIGs. 6B, 6C, 7A, and 8B. Examples of the display component 118 displaying selectable control 134 and user data in response to communicated information are illustrated in FIGs. 6D, 7B, 7C, and 8C.

[0080] Consider the medical office used the IVR system 110 to direct the voice call to the receptionist. The display component 118 can display selectable controls 134. The selectable controls 134 provide a respective text description 318 of two or more selectable options provided by the IVR system 110. The user can use the selectable controls 134 to indicate to the medical office of a selected option from the two or more selectable options.

[0081] Also, consider the user scheduling an appointment with the medical office. The display component 118 can display the selectable control 134. The selectable control 134 includes the text description of the appointment. The user can use the selectable control 134 to save the appointment details to a calendar application.

[0082] At 416, the computing device displays the user interface for the communication application. For example, the display component 118 can display the user interface 126 associated with the communication application 124. The user interface 126 can include the user data and selectable controls 134.

[0083] FIG. 5 illustrates example operations 500 to identify and provide requested user information during voice calls and video calls. The operations 500 are described in the context of the computing device 202 of FIG. 2. The operations 500 may be performed in a different order or with additional or fewer operations.

[0084] At 502, a computing device obtains audio data output from a communication application executing on the computing device. The audio data includes audible parts of a voice call or a video call between a user of the computing device and a third party. For example, the audio mixer 208 of the computing device 202 can obtain audio data 304 output from the communication application 124 executing on the computing device 202. The caption module 210 can receive the audio data 304 from the audio mixer 208. The audio data 304 includes audible parts of a voice call or a video call between a user of the computing device 202 and a third party (e.g, a person, a computerized IVR system).

[0085] At 504, the computing device determines, using the audible parts, whether the audio data includes a request for user information. The third party audibly provides the request for user information during the voice call or the video call. For example, the machine-learned model 302 of the caption module 210 can determine, using the audible parts of the audio data 304, whether the audio data 304 includes a request for user information (e.g, an account number, payment information, home address). The third party audibly provides the user information request during the voice call or the video call.

[0086] At 506, responsive to determining that the audio data includes the request for user information, the computing device optionally determines a text description of the request for user information. The text description provides a transcription of at least a portion of the user information request. For example, responsive to determining that the audio data 304 includes the request for user information, the machine-learned model 302 determines a text description 306 of the request for user information. The text description 306 provides a transcription of at least a portion of the request for user information. In some implementations, the text description 306 includes a word-for-word transcription. In other implementations, the text description 306 provides a paraphrasing of the user information request.

[0087] At 508, the computing device identifies, using the audible parts, user data responsive to the request for user information. For example, the computing device 202 identifies, using the audible parts of the audio data 304, user data (e.g. , a credit card number, expiration date, and card verification value (CVV)) responsive to a request for user information (e.g, payment information).

[0088] At 510, the computing device displays the user data on a display of the computing device or provides the user data to the third party during the voice call or the video call. For example, the display component 118 displays the user data on the display of the computing device 202. The computing device 202 can also provide the user data to the third party during the voice call or video call. The display includes the user interface 126. EXAMPLE IMPLEMENTATIONS

[0089] This section illustrates example implementations of the described systems and techniques that can assist users with voice calls and video calls, which may operate separately or together in whole or in part. This section describes various example implementations, each outlined in relation to a specific drawing for ease of reading.

[0090] FIGs. 6A-6D illustrate example user interfaces of a computing device to assist users with voice calls and video calls. FIGs. 6A-6D are described in succession and the context of the computing device 202 of FIG. 2. The computing device 202 may provide different user interfaces with fewer or additional features than those illustrated in FIGs. 6A-6D.

[0091] In FIG. 6A, the computing device 202 causes the display component 118 to display the user interface 126. The user interface 126 is associated with the communication application 124. The user interface 126 includes the caller box 128, the numeric-keypad icon 130, the speakerphone icon 132, the selectable controls 134, and the end-call icon 136.

[0092] Consider that the user has called a new medical provider Doctor Office. In this implementation, the user has placed a voice call using the communication application 124. In other implementations, the user can place a video call using the communication application 124 or another application on the computing device 202. The caller box 128 indicates the business name (e.g, Doctor Office) and telephone number (e.g, (111) 555-1234) of the third party. The Doctor Office uses the IVR system 110 to provide a menu of selectable options audibly. The IVR system 110 can direct callers to appropriate personnel and staff at the Doctor Office. Consider that the IVR system 110 provides the following dialogue upon answering the voice call: “Thank you for calling Doctor Office. Please listen to the following options and choose the option that best matches the purpose of your call today. For prescription refills, please press 1. To schedule an appointment, please press 2. For billing, please press 3. To speak to a nurse, please press 4.”

[0093] As the IVR system 110 audibly provides the selectable options, the caption module 210 obtains the audio data 304 output from the communication application 124. As described above, the audio mixer 208 can send the audio data 304 to the caption module 210. The caption module 210 then determines that the audio data 304 includes multiple selectable options. In response to this determination, the caption module 210 determines a text description 306 of the selectable options. For example, the machine-learned model 302 can transcribe at least a portion of the selectable options. The transcription can be a word-for-word transcription or paraphrasing of each of the selectable options. [0094] The caption module 210 then causes the display component 118 to display the selectable controls 134 on the user interface 126. The selectable controls 134 include a selectable control associated with each of the selectable options provided by the IVR system 110: a first selectable control 134-1, a second selectable control 134-2, a third selectable control 134-3, and a fourth selectable control 134-4. The selectable controls 134 include the text description 306 associated with a respective selectable option. For example, the first selectable control 134-1 includes the text “1 - Prescription refills.” The number “1” indicates that the first selectable control 134-1 is associated with the first selectable option provided by the IVR system 110. The second selectable control 134-2 provides the text “2 - Schedule an appointment.” The third selectable control 134-3 displays the text “3 - Billing.” And the fourth selectable control 134-4 includes the text “4 - Speak with a nurse.” In some implementations, the selectable controls 134 can omit the numbers associated with each selectable option.

[0095] As described above, the selectable controls 134 can be presented in various forms on the user interface 126. For example, the selectable controls 134 can be buttons, toggles, selectable text, sliders, checkboxes, or icons. The user can select a selectable control 134 to cause the computing device 202 to indicate to the IVR system 110 the selected option of the multiple selectable options.

[0096] In response to IVR system 110 providing the selectable options, the user can select the numeric-keypad icon 130 to display a numeric keypad and select a number associated with the desired selectable option. For example, the user can select the number “2” in the numeric keypad to schedule an appointment. In response, the computing device 202 can send a DTMF tone to the IVR system 110. In other implementations, the IVR system 110 may allow the user to provide the selected option by audibly saying the number “2.” The described systems and techniques allow the user to select the selectable control 134 associated with the desired option. In this example, the user selects the second selectable control 134-2 to schedule a new appointment. In response to the user selecting the second selectable control 134-2, the input component 120 causes the computing device 202 to send a DTMF tone associated with the number “2” or audible communication of the number “2” to the IVR system 110. In this way, the described systems and techniques help the user navigate the selectable IVR menu options and select the desired option.

[0097] In some implementations, the computing device 202 can provide a series of selectable controls 134 in response to different levels of IVR menus. The computing device 202 can update the selectable controls 134 to correspond to the current selectable options. In other implementations, the computing device 202 can provide an option to display a previous menu of selectable options from earlier in the voice call or video call.

[0098] FIG. 6B is an example of the user interface 126 in response to a request for user information. In response to the user selecting the second selectable control 134-2 in the previous scenario, the IVR system 110 directs the user to a receptionist at the Doctor Office. Because the user is a new patient, the receptionist may ask a series of questions to set up an account or profile associated with the user. For example, the receptionist may request medical insurance information for the user. In this situation, the audio data 304 may include the following question: “Do you have medical insurance?” The machine-learned model 302 can determine, using audible parts of the voice call with the Doctor Office, whether the audio data 304 includes a request for user information. In this example, the machine-learned model 302 can use the words “medical insurance,” along with other parts of the conversation and the context that the third party is a medical office, to determine that the audio data 304 includes a request for user information.

[0099] The machine-learned model 302 can determine the text description 306 of the request for user information in response. In this example, the machine-learned model 302 or the caption module 210 determines the text description 306 includes “medical insurance.” The caption module 210 or the computing device 202 can then identify user data responsive to the request for medical insurance information in the CRM 206 and cause the display component 118 to display it on the user interface 126. In this example, the user data can include the insurance provider, the policy number, or the account identifier. The computing device 202 can also retrieve the medical insurance information from an email in an email application or profile information stored in a contacts application. In some implementations, the computing device 202 can store and retrieve sensitive user data from a secure enclave of the CRM 206 or other memory in the computing device 202.

[0100] The display component 118 can display the user data (e.g. , insurance provider and policy number) in a message element 600 on the user interface 126. The message element 600 can be an icon, notification, message box, or similar user interface element to display textual information. The message element 600 can also include the text description 306 of the request for user information to provide context. In this example, the message element 600 provides the following text: “Your insurance provider: Apex Medical Insurance Co.” and “Your policy number: 123456789-0.” In the depicted implementation, the message element 600 provides both sets of user data in a single message element 600. In other implementations, the display component 118 can include the user data in multiple message elements 600. [0101] The display component 118 displays the message element 600 on the user interface 126 shortly after the receptionist asks the question. In some implementations, the computing device 202 can determine from the audio data 304 that the user is a new patient at the Doctor Office. In response to this context, the machine-learned model 302 or the caption module 210 can anticipate that the receptionist will ask for medical insurance information and retrieve this user data. In other implementations, the machine-learned model 302 or the caption module 210 can anticipate that the medical insurance information may be requested when the user calls a medical office. In such situations, the medical insurance information can be displayed in response to a request for this information.

[0102] The computing device 202 can use the sensors 214 to determine the context of the computing device 202. In response to determining that the user is not looking at the display, the computing device 202 can cause the audio component 116 to provide an audio signal or haptic feedback. The audio signal can alert the user that user data related to a user information request is displayed. For example, if the computing device 202 determines that the user is holding the computing device 202 to her ear (e.g., by using a proximity sensor, gyroscope, or accelerometer), the computing device 202 can cause the audio component 116 to provide an audio signal (e.g., a soft tone) that only the user can hear. In other implementations, the computing device 202 can provide haptic feedback to the user as an alert.

[0103] In response to reading the message element 600 with the medical insurance information, the user can audibly provide this information to the receptionist. In some situations, the user may be in a public setting and may not want to provide the user data audibly. As a result, the user can select one of several selectable controls 134. The display component 118 displays a fifth selectable control 134-5 and a sixth selectable control 134-6. The fifth selectable control 134-5 includes the following text: “Read my insurance provider.” The sixth selectable control 134-6 includes the following text: “Read my policy number.” In response to the user selecting one of the selectable controls 134, the computing device 202 causes the audio mixer 208 to audibly read the respective user data to the receptionist without requiring the user to provide this information audibly. In other implementations, the computing device 202 can give the user additional selectable controls 134 to email, text, or otherwise send the user data (e.g, the medical insurance information) to the receptionist. In this way, the described techniques and systems provide a secure and private way to share sensitive user data with another person or entity during voice calls and video calls. [0104] In FIG. 6C, the computing device 202 provides user data in response to a proposed appointment time. Consider the previous voice call to the Doctor Office. After the user provides her medical insurance information, the receptionist suggests an appointment at 11 am on Tuesday. For example, the audio data 304 includes the following question from the receptionist: “Does next Tuesday at 11 am work for you?” In response to the proposed time, the computing device 202 can check user calendar information in a calendar application and identify a potential conflict. In this example, the user has a dentist appointment scheduled at 11: 15 am on Tuesday. The computing device 202 causes the display component 118 to display this information in the message element 600. For example, the display component 118 can display the following text: “Dentist appointment at 11 : 15 am. ” In some implementations, the computing device 202 can also automatically suggest alternative times based on the user calendar information. The display component 118 can display the following text: “You have a conflict, try these times instead: Tues, at 9:30 am [or] Wed. at 1:00 pm.” In this way, the computing device 202 helps the user schedule a new appointment at the Doctor Office. The user must not recall the previously- scheduled dentist appointment or open the calendar application on the computing device 202 while talking to the receptionist. The user can also avoid calling the Doctor Office back to reschedule the appointment after recalling the conflict.

[0105] In FIG. 6D, the computing device 202 displays communicated information related to the voice call. Consider the previous voice call to the Doctor Office. The receptionist had an appointment slot available at 1 pm on Wednesday and confirmed the appointment by saying: “We have you scheduled for an appointment at 1 pm on Wednesday, November 4.” In response, the computing device 202 can cause the display component 118 to display the details of the appointment in the message element 600: For example, the message element 600 can provide the following communicated information: “Wednesday, Nov. 4, 2020 at 1 pm, Medical Appointment @ Doctor Office.”

[0106] The computing device 202 can also provide the user with several selectable controls related to the communicated information, including a seventh selectable control 134-7 and an eighth selectable control 134-8. In this example, the seventh selectable control 134-7 displays the text “Save to Calendar.” When selected, the seventh selectable control 134-7 causes the computing device 202 to save the appointment information to the calendar application. The eighth selectable control 134-8 displays the text “Send to Spouse.” When selected, the eighth selectable control 134-8 causes the computing device 202 to send the appointment information to the spouse. The user can also cause the computing device 202 to save the appointment information to the calendar application via audible commands.

[0107] The computing device 202 can cause the display component 118 to leave the message element 600 and the selectable controls 134 related to the appointment on the user interface 126 until the termination of the voice call and for several minutes after that. In other implementations, the user can retrieve this information, including the message element 600 and the selectable controls, by selecting the conversation with the Doctor Office in a history menu of the communication application 124. In this way, the user can save communicated information from a voice call or a video call without writing down the appointment, recalling the appointment later, or separately entering the appointment into the calendar application. The features and functionality described with respect to FIGs. 6A-6D allow the computing device 202 to provide a more user-friendly experience for voice calls and video calls.

[0108] FIGs. 7A-7C illustrate other example user interfaces of a computing device to assist users with voice calls and video calls. FIGs. 7A-7C are described in succession and the context of the computing device 202. The computing device 202 may provide different user interfaces with fewer or additional features than those illustrated in FIGs. 7A-7C.

[0109] In FIG. 7A, the computing device 202 causes the display component to display the user interface 126. Consider that the user has placed a voice call using the communication application 124 to her friend Amy. The caller box 128 provides Amy’s name and telephone number (e.g. , (111) 555-6789). During the voice call, Amy asks the user for her new address. As illustrated in FIG. 7A, the audio data 304 includes the following phrase: “What is your new address?”

[0110] In response to determining that the audio data 304 includes a request for user information (e.g., the user address), the computing device 202 determines a description of the request. In this example, the caption module 210 determines the text description 306 of the request includes the user’s home address. The computing device 202 finds the home address in the CRM 206 and displays it on the user interface 126. For example, the display component 118 can cause a message element 700 to provide the text description 306 and the responsive user data. The message element 700 provides the following information: “Your address: 100 First Street, San Francisco, CA 94016.” In most situations, the user likely recalls this user data but may need help recalling specific details (e.g., the zip code).

[0111] The computing device 202 can also cause the display component 118 to display selectable controls 702. The user can audibly provide her home address to Amy. In some situations, the user may be in a public setting and may not want to provide her address audibly. As a result, the user can select one of the selectable controls 702. In this example, the selectable controls 702 include a first selectable control 702-1, a second selectable control 702-2, and a third selectable control 702-3. The first selectable control 702-1 includes the following text: “Read my address.” When selected, the first selectable control 702-1 causes the audio mixer 208 to audibly read the home address to Amy without requiring the user to provide this information audibly. The second selectable control 702-2 includes the following text: “Text my address.” When selected, the second selectable control 702-2 causes the communication application 124 or another application to send, using the communication units 116, a text message to Amy with the home address. The third selectable control 702-3 includes the following text: “Email my address.” The third selectable control 702-3 causes an email application to send an email to Amy with the home address when selected. The computing device 202 can obtain the email address for Amy from a contact application. In this way, the computing device 202 provides the user with a safe way to share sensitive user data on a voice call or a video call without audibly broadcasting it to nearby individuals.

[0112] In FIG. 7B, the computing device 202 displays communicated information related to the voice call. Consider the previous voice call with Amy and that Amy provides new contact information (e.g., her new work email address). In response, the computing device 202 provides the communication information to the user. The caption module 210 determines that the audio data 304 includes Amy providing her new email address: “My email address is amy@email.com.” The display component 118 then displays the new email address in the message element 702. The message element provides the following text: “Amy’s email address: amy@email.com.”

[0113] In some implementations, the computing device 202 can verily that the new email address is not saved on the computing device 202 (e.g., in a contacts application or an email application). If the new email address is saved, then the computing device 202 may cause the caption module 210 not to display this communicated information. If the new email address is not saved, then the computing device 202 may cause the caption module 210 to display this communicated information.

[0114] The computing device 202 can display a fourth selectable control 702-4. The fourth selectable control 702-4 includes the following text: “Save in Contacts.” The fourth selectable control 702-4 causes the computing device 202 to save the email address to a contacts application when selected. [0115] In FIG. 7C, the computing device 202 provides additional selectable controls in response to communicated information during the voice call. Consider the previous voice call with Amy and that the user and Amy agree to meet for lunch. The audio data 304 includes the following phrase audibly spoken by the user: “I’ll meet you in 20 minutes at Mary’s Diner.” In response to this communicated information, the computing device 202 can display the address for Mary’s Diner in the message element 700. The message element 702 includes the following text: “Address for Mary’s Diner, 500 S. 20^th Street, San Francisco, CA 94016.” The computing device 202 can also display a fifth selectable control 702-5. The fifth selectable control 702-5 displays the following text: “Directions to Mary’s Diner.” When selected, the fifth selectable control 702-5 causes the computing device 202 to initiate navigation instructions from a navigation application.

[0116] In some implementations, the fifth selectable control 702-5 can be a slice window of the navigation application that provides a subset of functionalities of the navigation application related to the communicated information. For example, the slice window for the navigation application can allow the user to select walking directions, driving directions, or public transport directions to Mary’s Diner.

[0117] FIGs. 8A-8D illustrate other example user interfaces of a computing device to assist users voice calls and video calls. FIGs. 8A-8D are described in succession and the context of the computing device 202 of FIG. 2. The computing device 202 may provide different user interfaces with fewer or additional features than those illustrated in FIGs. 8A-8D.

[0118] In FIG. 8A, the computing device 202 causes the display component 118 to display the user interface 126 with message element 800 and selectable controls 802 in response to selectable options of the IVR system 110. Consider that the user placed a voice call to a new utility provider Utility Company. The caller box 128 indicates the business name (e.g, Utility Company) and telephone number (e.g. , (111) 555-2345) of the called party.

[0119] The IVR system 110 uses a voice response system that prompts callers to provide audio responses to a series of questions and statements. Consider that the audio data 304 includes the following statement: “Thank you for contacting us about becoming a new customer. Please state the type of service you are interested in.” The IVR system 110 can listen for a phrase that matches or closely matches a list of offered services. For example, the Utility Company can listen for one of the following selectable options: home internet service, home telephone, or TV services. The computing device 202 can determine that the audio data 304 includes an implicit list of two or more selectable options. The display component 118 can display the following text in the message element 800: “Listed below are common responses offered by new customers.” In this example, the selectable controls 802 can include a first selectable control 802-1 (e.g., “Home Internet Service”), a second selectable control 802-2 (e.g. , “Home Telephone”), and a third selectable control 802-3 (e.g., “TV Services”). The selectable controls 802 can include additional or fewer suggestions. The user can select one of the selectable controls 802, causing the audio mixer 204 to provide the selected option to the IVR system 110 audibly.

[0120] The computing device 202 can determine the potential suggestions based on the audio data 304 by deciphering the available services from audible parts of the voice call. The computing device 202 can also determine the selectable options based on data obtained from other computing devices given a similar request by the same utility provider or similar companies. In this way, the computing device 202 can help the user navigate open-ended IVR prompts and avoid ineffective responses or cause the system to restart.

[0121] FIG. 8B is an example of the user interface 126 in response to a request for user information (e.g. , payment information). In response to the user selecting home internet services, the IVR system 110 directs the user to an account specialist to set up a new account and initiate home internet services. Because the user is a new account holder, the account specialist collects payment information, including a credit card number, to set up the account. For example, the audio data 304 may include the following request from the specialist: “Please provide a preferred form of payment for your new services.” In response to determining that the audio data 304 includes a request for user information, the computing device 202 determines a text description 306 of the request. In this example, the caption module 210 determines the text description 306 asks for credit card information. The computing device 202 identifies the credit card information in the CRM 206 and displays the user data on the user interface 126. The response element 800 includes the following information: “Your credit card information: #### - #### - #### - 1234, [Expiration date:] 01/21, and [PIN] 789.”

[0122] The computing device 202 can also determine whether the user data includes sensitive information. In response to determining that a portion of the user data is sensitive information, the computing device 202 can obscure a portion of the sensitive information (e.g, replacing at least some digits of the credit card number with a different symbol, including

or “*” or omitting them). In this way, the computing device 202 can maintain secrecy of the sensitive information and obscure it from other persons.

[0123] The display component 118 can display a selectable control 802 to maintain the secrecy of the user data. In this example, the display component 118 displays a fourth selectable control 802-4 that includes the following text: “Read my credit card information.” When selected, the fourth selectable control 802-4 causes the computing device 202 to audibly read the complete credit card number, expiration date, and PIN to the account specialist. In this way, the computing device 202 provides a secure way for the user to share sensitive credit card information with the account specialist.

[0124] In FIG. 8C, the computing device 202 displays communicated information related to the voice call. Consider the previous voice call to the Utility Company. The account specialist provides account information (e.g., an account number and personal identification number (PIN)) to the user. In this situation, the audio data 304 includes the following statement: “Your new account number is UTIL12345, and the PIN associated with your account is 6789.” In response, the computing device 202 displays the account number and PIN in the message element 800. Specifically, the message element 802 displays: “Your account number: UTIL12345, Your PIN: 6789.” The computing device 202 can provide the user with a fifth selectable control 802-5 and a sixth selectable control 802-6. The fifth selectable control 802-5 includes the following text: “Save in Contacts.” When selected, the fifth selectable control 802-5 causes the computing device 202 to save the account number and PIN to a contacts application. The sixth selectable control 802-6 includes the following text: “Save in Secure Memory.” When selected, the sixth selectable control 802-6 causes the computing device 202 to save the account number and PIN to a secure memory that requires special privileges by an application or a user to access.

[0125] In FIG. 8D, the computing device 202 displays communicated information related to a previous voice call. Consider the previous voice call to the Utility Company. In this example, the user could not review the communicated information displayed on the user interface during or shortly after the voice call. The computing device 202 can store the message element 802, the fifth selectable control 802-5, the sixth selectable control 802-6, or a combination thereof related to the voice call. In this way, the user can access the text description 306 of the communication information later.

[0126] The call history can provide a user interface 126 associated with each voice call or video call. For example, the user interface 126 associated with the history of the voice call with the Utility Company can include a history element 804. The history element 804 can include historical information about the voice call, including the following text: “Outgoing call on November 2.”

[0127] In some situations, the user may need to make another voice call or video call immediately after the termination of the voice call with the Utility Company or may need to perform another functionality on the computing device 202. The computing device 202 can store the message elements 800 and the selectable controls 802 associated with each voice call or video call in memory associated with the communication application 124. The communication application 124 can include a call history. In this way, the user can retrieve the message element 800 and the selectable controls 802 related to a voice call or video call later when convenient.

EXAMPLES

[0128] In the following section, examples are provided.

[0129] Example 1: A method comprising: obtaining, by a computing device, audio data output from a communication application executing on the computing device, the audio data comprising audible parts of a voice call or a video call between a user of the computing device and a third party; determining, by the computing device and using the audible parts, whether the audio data includes a request for user information, the request for user information audibly provided by the third party during the voice call or the video call; identifying, by the computing device and using the audible parts, user data that is responsive to the request for user information; and displaying the user data on a display of the computing device or providing, by the computing device, the user data to the third party during the voice call or the video call.

[0130] Example 2: The method of example 1, the method further comprising: responsive to determining that the audio data includes the request for user information, determining, by the computing device, a text description of the request for user information, the text description providing a transcription of at least a portion of the request for user information; and displaying the text description on the display of the computing device.

[0131] Example 3: The method of any preceding example, wherein displaying the user data on the display of the computing device comprises displaying one or more selectable controls on the display of the computing device, the one or more selectable controls configured to be selectable by the user to provide, by the computing device, the user data to the third party without the user audibly communicating the user data.

[0132] Example 4: The method of example 3, the method further comprising: receiving a selection of one selectable control of the one or more selectable controls by the user; and responsive to receiving the selection of the one selectable control, communicating, by the computing device, the user data to the third party. [0133] Example 5: The method of example 4, wherein the user data is communicated to the third party by at least one of emailing the user data to the third party, texting the user data to the third party, transmitting a digital file containing the user data to the third party, providing a series of dual-tone multi-frequency (DTMF) tones that represent the user data to the third party, or audibly reading, by the computing device, the user data to the third party.

[0134] Example 6: The method of any preceding example, the method further comprising: determining, by the computing device, whether the user is not looking at the display of the computing device; and responsive to determining that the user is not looking at the display of the computing device, providing, by the computing device, an audio signal or haptic feedback to alert the user that the user data is displayed on the computing device.

[0135] Example 7: The method of any of examples 4 through 6, the method further comprising: responsive to communicating the user data to the third party, identifying, by the computing device and using the audio data, a description of an additional request for user information, the additional request for user information audibly provided by the third party during the voice call or the video call; identifying, by the computing device, additional user data responsive to the additional request for user information; and displaying the additional user data on the display of the computing device during the voice call or the video call.

[0136] Example 8: The method of any preceding example, wherein the request for user information comprises a request by the third party for at least one of an account number, a reservation number, a confirmation number, a social security number, calendar information, personal information, a credit card number, or an address of the user.

[0137] Example 9: The method of any of examples 2 through 8, wherein determining the text description of the request for user information comprises executing, by the computing device, a machine-learned model to determine the text description of the request for user information, the machine-learned model trained to determine text descriptions from the audio data, the audio data received from an audio mixer of the computing device.

[0138] Example 10: The method of example 9, wherein the machine-learned model comprises an end-to-end Recurrent-Neural-Network-Transducer Automatic Speech-Recognition Model.

[0139] Example 11: The method of any preceding example, the method further comprising: determining, by the computing device and using the audible parts, whether the audio data includes two or more selectable options, the two or more selectable options audibly provided by the third party during the voice call or the video call; responsive to determining that the audio data includes the two or more selectable options, determining, by the computing device, a text description of the two or more selectable options, the text description of the two or more selectable options providing a transcription of at least a portion of the two or more selectable options; and displaying two or more other selectable controls on a display of the computing device, the two or more other selectable controls configured to be selectable by the user to provide an indication to the third party of a selected option of the two or more selectable options, each of the two or more other selectable controls providing the text description of a respective selectable option.

[0140] Example 12: The method of any preceding example, the method further comprising: determining, by the computing device and using the audible parts, whether the audio data includes communicated information, the communicated information related to a context of the voice call or the video call and audibly provided by the third party or the user during the voice call or the video call; responsive to determining that the audio data includes the communicated information, determining, by the computing device, a text description of the communicated information, the text description of the communicated information providing a transcription of at least a portion of the communicated information; and displaying another selectable control on the display, the other selectable control providing the text description of the communicated information and configured to be selectable by the user to save the communicated information in at least one of the computing device, the application, or another application on the computing device.

[0141] Example 13: The method of any preceding example, wherein the computing device comprises a smartphone, a computerized watch, a tablet device, a wearable device, or a laptop computer.

[0142] Example 14: A computing device comprising at least one processor configured to perform any of the methods of examples 1 through 13.

[0143] Example 15: A computer-readable storage medium comprising instructions that, when executed, configure a processor of a computing device to perform any of the methods of examples 1 through 13.

CONCLUSION

[0144] While various configurations and methods to identify and provide requested user information during voice calls and video calls have been described in language specific to features and/or methods, it is to be understood that the subject of the appended claims is not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as non-limiting examples of identifying and providing requested user information during voice calls and video calls. Further, although various examples have been described above, with each example having certain features, it should be understood that it is not necessary for a particular feature of one example to be used exclusively with that example. Instead, any of the features described above and/or depicted in the drawings can be combined with any of the examples, in addition to or in substitution for any of the other features of those examples.

Claims

CLAIMS What is claimed is:

1. A method comprising: obtaining, by a computing device, audio data output from a communication application executing on the computing device, the audio data comprising audible parts of a voice call or a video call between a user of the computing device and a third party; determining, by the computing device and using the audible parts, whether the audio data includes a request for user information, the request for user information audibly provided by the third party during the voice call or the video call; identifying, by the computing device and using the audible parts, user data that is responsive to the request for user information; and displaying the user data on a display of the computing device or providing, by the computing device, the user data to the third party during the voice call or the video call.

2. The method of claim 1, the method further comprising: responsive to determining that the audio data includes the request for user information, determining, by the computing device, a text description of the request for user information, the text description providing a transcription of at least a portion of the request for user information; and displaying the text description on the display of the computing device.

3. The method of any preceding claim, wherein displaying the user data on the display of the computing device comprises displaying one or more selectable controls on the display of the computing device, the one or more selectable controls configured to be selectable by the user to provide, by the computing device, the user data to the third party without the user audibly communicating the user data.

35

4. The method of claim 3, the method further comprising: receiving a selection of one selectable control of the one or more selectable controls by the user; and responsive to receiving the selection of the one selectable control, communicating, by the computing device, the user data to the third party.

5. The method of claim 4, wherein the user data is communicated to the third party by at least one of emailing the user data to the third party, texting the user data to the third party, transmitting a digital file containing the user data to the third party, providing a series of dualtone multi -frequency (DTMF) tones that represent the user data to the third party, or audibly reading, by the computing device, the user data to the third party.

6. The method of any preceding claim, the method further comprising: determining, by the computing device, whether the user is not looking at the display of the computing device; and responsive to determining that the user is not looking at the display of the computing device, providing, by the computing device, an audio signal or haptic feedback to alert the user that the user data is displayed on the computing device.

7. The method of any of claims 4 through 6, the method further comprising: responsive to communicating the user data to the third party, identifying, by the computing device and using the audio data, a description of an additional request for user information, the additional request for user information audibly provided by the third party during the voice call or the video call; identifying, by the computing device, additional user data responsive to the additional request for user information; and displaying the additional user data on the display of the computing device during the voice call or the video call.

8. The method of any preceding claim, wherein the request for user information comprises a request by the third party for at least one of an account number, a reservation number, a confirmation number, a social security number, calendar information, personal information, a credit card number, or an address of the user.

36

9. The method of any of claims 2 through 8, wherein determining the text description of the request for user information comprises executing, by the computing device, a machine- learned model to determine the text description of the request for user information, the machine- learned model trained to determine text descriptions from the audio data, the audio data received from an audio mixer of the computing device.

10. The method of claim 9, wherein the machine-learned model comprises an end-to- end Recurrent-Neural-Network-Transducer Automatic Speech-Recognition Model.

11. The method of any preceding claim, the method further comprising: determining, by the computing device and using the audible parts, whether the audio data includes two or more selectable options, the two or more selectable options audibly provided by the third party during the voice call or the video call; responsive to determining that the audio data includes the two or more selectable options, determining, by the computing device, a text description of the two or more selectable options, the text description of the two or more selectable options providing a transcription of at least a portion of the two or more selectable options; and displaying two or more other selectable controls on a display of the computing device, the two or more other selectable controls configured to be selectable by the user to provide an indication to the third party of a selected option of the two or more selectable options, each of the two or more other selectable controls providing the text description of a respective selectable option.

12. The method of any preceding claim, the method further comprising: determining, by the computing device and using the audible parts, whether the audio data includes communicated information, the communicated information related to a context of the voice call or the video call and audibly provided by the third party or the user during the voice call or the video call; responsive to determining that the audio data includes the communicated information, determining, by the computing device, a text description of the communicated information, the text description of the communicated information providing a transcription of at least a portion of the communicated information; and displaying another selectable control on the display, the other selectable control providing the text description of the communicated information and configured to be selectable by the user to save the communicated information in at least one of the computing device, the application, or another application on the computing device.

13. The method of any preceding claim, wherein the computing device comprises a smartphone, a computerized watch, a tablet device, a wearable device, or a laptop computer.

14. A computing device comprising at least one processor configured to perform any of the methods of claims 1 through 13.

15. A computer-readable storage medium comprising instructions that, when executed, configure a processor of a computing device to perform any of the methods of claims 1 through 13.