WO2020023070A1

WO2020023070A1 - Text-to-speech interface featuring visual content supplemental to audio playback of text documents

Info

Publication number: WO2020023070A1
Application number: PCT/US2018/055643
Authority: WO
Inventors: Benedict Davies; Oliver Lade; Guillaume BONIFACE; Rafael DE LA TORRE FERNANDES; Jack WHYTE; Jakub ADAMEK; Simon Tokumine; Matthias Quasthoff; Yossi Matias; Ye Zhou; Rony Amira
Original assignee: Google Llc
Priority date: 2018-07-24
Filing date: 2018-10-12
Publication date: 2020-01-30
Also published as: CN112424853A

Abstract

A text-to-speech (TTS) system implemented by a computing device can automatically convert textual content displayed in a graphical user interface to an audio signal that includes speech of the textual content and the computing device can play (e.g., via a speaker or headphones) the audio signal back to the user. In addition, the computing device can further provide one or more types of supplemental visual content for display to the user during playback of the audio signal to the user. As one example, the supplemental visual content can include a "karaoke" style text display window in which particular segments (e.g., words or sentences) of the textual content are highlighted or otherwise visually modified as the corresponding portion of the audio signal that includes speech of such segments is played back.

Description

TEXT-TO-SPEECH INTERFACE FEATURING VISUAL CONTENT SUPPLEMENTAL TO AUDIO PLAYBACK OF TEXT DOCUMENTS

RELATED APPLICATIONS

[0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 62/702,721, filed on July 24, 2018. U.S. Provisional Patent Application No. 62/702,721 is hereby incorporated by reference in its entirety.

FIELD

[0002] The present disclosure relates generally to text-to-speech interfaces. More particularly, the present disclosure relates to systems and methods for interacting with and consuming textual content provided in association with computing devices, such as textual content displayed in a graphical user interface.

BACKGROUND

[0003] For users with no or low literacy there are significant challenges in interacting with and/or consuming the textual content of documents such as, for example, web pages. The same is true for users who may be interacting with textual content that is in a language they struggle to or are unable to read. There are also circumstances when users who are able to read prefer not to or are physically unable to, for example when they want a lean-back experience or their attention is elsewhere and having the page verbalized to them is more useful. For these groups of users and many others, being able to hear the textual content of a document offers a considerable potential benefit.

[0004] However, even assuming that the user is able to receive audio playback of the text in a document, such audio playback may fail to convey the full context of and/or enable the complete utilization of the textual content included in the document. For example, certain persons may be unable to focus on the audio playback alone and become distracted. As another example, certain persons may wish to improve their literacy or foreign language skills and audio playback of the text alone may not serve these purposes as it does not challenge the user to also read or otherwise visually process the textual data. SUMMARY

[0005] Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments.

[0006] One example aspect of the present disclosure is directed to a computer- implemented method. The method includes obtaining, by one or more computing devices, data descriptive of textual content included in a document. The method includes generating, by the one or more computing devices, an audio signal that comprises speech of the textual content. The method includes analyzing, by the one or more computing devices, one or both of the textual content and the audio signal to identify one or more semantic entities referenced by the textual content. The method includes obtaining, by the one or more computing devices, one or more visual content items that are associated with the one or more semantic entities. The method includes causing, by the one or more computing devices, playback of the audio signal to a user. The method includes providing, by the one or more computing devices, the one or more visual content items for display to the user

contemporaneously with playback of the audio signal to the user.

[0007] Other aspects of the present disclosure are directed to various systems, apparatuses, non-transitory computer-readable media, user interfaces, and electronic devices.

[0008] These and other features, aspects, and advantages of various embodiments of the present disclosure will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate example embodiments of the present disclosure and, together with the description, serve to explain the related principles.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] Detailed discussion of embodiments directed to one of ordinary skill in the art is set forth in the specification, which makes reference to the appended figures, in which:

[0010] Figures 1 A-12 show example user interfaces according to example embodiments of the present disclosure.

[0011] Figure 13 A depicts a block diagram of an example computing system according to example embodiments of the present disclosure.

[0012] Figure 13B depicts a block diagram of an example computing device according to example embodiments of the present disclosure. [0013] Figure 13C depicts a block diagram of an example computing device according to example embodiments of the present disclosure.

[0014] Reference numerals that are repeated across plural figures are intended to identify the same features in various implementations.

DETAILED DESCRIPTION

Overview

[0015] Example aspects of the present disclosure are directed to systems and methods that enable users to interact with and consume textual content provided in association with computing devices, such as textual content displayed in a graphical user interface. In particular, a text-to-speech (TTS) system implemented by a computing device can automatically convert textual content displayed in a graphical user interface to an audio signal that includes speech of the textual content and the computing device can play (e.g., via a speaker or headphones) the audio signal back to the user. In such fashion, users that have low literacy skills, that are reading in a foreign language, or that are seeking a lean-back experience can hear the content of a document read aloud to them. In addition, according to an aspect of the present disclosure, the computing device can further provide one or more types of supplemental visual content for display to the user during playback of the audio signal to the user. As one example, the supplemental visual content can include a“karaoke” style text display window in which particular segments (e.g., words or sentences) of the textual content are highlighted or otherwise visually modified as the corresponding portion of the audio signal that includes speech of such segments is played back. In such fashion, the user can easily follow along with the text as the corresponding speech of such text is being played. This enables users with low literacy or reading in a foreign language to sharpen their reading/language skills as they are able to both hear and see the text at the same time. As another example, the supplemental visual content can include visual content items such as images or videos that relate to the textual content. For example, the computing device can identify one or more semantic entities referenced by the textual content, obtain one or more visual content items that are associated with (e.g., depict) the identified semantic entities, and can display the visual content items during playback of the audio signal. In particular, in some implementations, a particular visual content item associated with a particular semantic entity can be displayed at the same time as the portion of the audio signal that references such particular semantic entity is played back to the user. Thus, the systems and methods of the present disclosure provide, in addition to playback of an audio signal that verbalizes textual content included in a document, the user with a complementary visual experience that displays the text being read and/or relevant visual media in a user-controllable interface. As such, the systems and methods of the present disclosure can bring the simplicity of watching video or listening to the radio to web browsing or other document viewing scenarios.

[0016] More particularly, a computing device (e.g., a user’s tablet, smartphone, etc.) can access a document that contains textual content. As examples, the computing device can access and display a document such as a webpage, news article, word processing document, etc. that has textual content. The document can be displayed within a browser application and/or other document viewing application.

[0017] A user can request that the textual content (or a portion thereof) be provided in an audio format. For example, a user may be unable to read text in a displayed language, may be vision-impaired, may have low literacy, or may otherwise wish to have text provided as speech in an audio output. The user can select a portion of the document (e.g., a paragraph) or can simply select the document as a whole and can request that the textual content be provided in audio format. Alternatively to a user prompt, the computing device can automatically operate to generate an audio signal that includes speech of the textual content without requiring a user prompt. Thus, generation of an audio signal from the textual content can be automatically invoked or can be initiated once the user has indicated that they wish to receive such feature (for example by tapping on an audio shortcut button, icon, or other user interface feature).

[0018] The computing device can generate an audio signal that includes speech of at least some of the textual content included in the document. For example, the computing device can include a TTS system that performs various TTS techniques to convert the text to speech. Many different TTS techniques can be performed. In one example, one or more machine-learned models such as, for example, an artificial neural network (hereinafter “neural network”) (e.g., a recurrent neural network) can be used to generate speech from the text. The machine-learned models can be implemented on the device or can be accessed as part of a cloud service. Thus, as a result of one or more TTS techniques, an audio signal (e.g., audio file) can be produced that includes speech of the textual content.

[0019] The computing device can play the audio signal back to the user. For example, the computing device can include or communicate with (e.g., via a wired or wireless connection) one or more speakers, including, for example, speakers in the form of headphones. The computing device can playback the audio signal to the user via the one or more speakers, such that the user hears speech of the textual content included in the document.

[0020] In addition, according to an aspect of the present disclosure, contemporaneous with playback of the audio signal, the computing device can further provide one or more types of supplemental visual content for display to the user. As one example, the

supplemental visual content can include a“karaoke” style text display window in which in which certain segments (e.g., words or sentences) of the textual content are highlighted or otherwise visually modified as the corresponding portion of the audio signal that includes speech of such segments is played back. In such fashion, the user can easily follow along with the text as the corresponding speech of such text is being played. This enables users with low literacy or reading in a foreign language to sharpen their reading/language skills as they are able to both hear and see the text at the same time.

[0021] Thus, in some implementations, the computing device can generate and provide a user interface that includes a text display area. The text display area can present the user with the text that has been extracted from the document. The text display area can appear as a dedicated area within the overall document, as an overlay or panel above the document content, and/or via another display mechanism. As the audio signal is played back to the user, the text display area can display the current, preceding, and next section of the textual content.

[0022] The amount of text displayed in the text display area can range from a few words on either side of the current word being spoken within the audio signal to a sentence or two on either side of the current word, depending on variables including the user’s preference, their proficiency with the language, optimal scrolling or display characteristics, and/or the complexity and detail inherent in the topic of the content. Thus, the currently displayed text can correspond to current portion of the audio signal for which playback is currently occurring.

[0023] In some implementations, the word or words currently being spoken within the audio signal are highlighted in a fashion that is analogous to a karaoke machine. Highlighting the text can include changing the color of the text, changing the formatting of the text, changing the size of the text, changing a local background color behind the text, or other visual modifications that cause the highlighted text to be visually distinguishable from other surrounding portions of the text. In some implementations, text that has already been spoken can be faded out while currently-being-spoken text is bolder and highlighted, and upcoming text is in-between the faded and bolded looks. [0024] In some implementations, the highlighting can be performed on a word-by-word basis as the current portion of the audio signal changes over time (that is, as the audio signal is played back so that different words of the text are spoken within the audio signal over time). Other bases for highlighting (e.g., phrase-by-phrase, syllable-by-syllable, etc.) can be used as well. Furthermore, in some implementations, the text can be scrolled through the text display area, such that the portion of the text that is currently being spoken within the audio signal and/or highlighted within the scrolls into place as the corresponding portion of the audio signal approaches and is played back.

[0025] This movement of the focus between words as they are spoken (e.g., via highlighting and/or scrolling) helps the user easily see where the playback has reached in the current section of text displayed. In the case of users with literacy issues or who are unfamiliar with the language and are wishing to improve their capabilities, it can also assist with their becoming more proficient with the language.

[0026] As another example, the supplemental visual content can include visual content items such as images or videos that relate to the textual content. In particular, in some implementations, the computing device can identify one or more semantic entities referenced by the textual content, obtain one or more visual content items that are associated with (e.g., depict) the identified semantic entities, and can display the visual content items during playback of the audio signal. In particular, in some implementations, a particular visual content item associated with a particular semantic entity can be displayed at the same time as the portion of the audio signal that references such particular semantic entity is played back to the user.

[0027] More particularly, the computing device can analyze one or both of the textual content and the audio signal to identify one or more semantic entities referenced by the textual content. Semantic entities can include people, places, things, dates/times, events, organizations, or other semantic entities. As one example, a machine-learned model stored on the computing device or accessed via a cloud service can analyze at least a portion of the textual content and/or audio signal to recognize one or more semantic entities referenced by the textual content. As another example, segments (e.g., words) of the textual content and/or audio signal can be matched against a database of semantic entities to recognize one or more semantic entities. Various natural language processing techniques can also be performed to recognize semantic entities.

[0028] In some implementations, the visual content items can be displayed within a media display area of the user interface. The media display area can be a complement to the text display area and can be used for showing visual content that relates to the text that is currently being displayed and/or actively in focus. Thus, in some implementations, as the focus moves through the words of the text, the computing device can update the media display area to show relevant media to the word, concept, or theme being spoken aloud. For example, if the word currently in focus was‘dog’ the media display area could show an image of a dog, or if the word currently in focus was part of a sentence about how to build a log cabin the media display area could display an animated GIF showing how a log cabin was constructed.

[0029] The visual content items on display can be extracted from the document itself (for example images that are included in, or linked from, the page) or can be fetched from a source such as the Knowledge Graph or Google Images. In some implementations, metadata associated with visual content items can be matched against the semantic entities. As examples, the displayed media can take the form of visual content such as images, GIFs or other animations, and/or videos. In the case of videos there can be a mechanism to mediate the audio output from the video with the playback of the audio signal. Alternatively, choosing to play the video content could pause the playback of the audio signal.

[0030] In some implementations, the media display area does not appear at all in certain circumstances. For example, if there were no relevant media to display or the user had opted not to have it shown, the media display area may not be included in the user interface. In some implementations, the media display area can abut the text display area or be provided within (e.g., as a sub-portion of) the text display area. As another example, in some implementations, the visual content items can be as a background of the text display area, and the text displayed within the text display area can be overlaid upon the one or more visual content items. In some implementations, how the visual content items are displayed can be dependent upon or a function of the orientation of the device. For example, in some implementations, if the device is in a landscape orientation, then the playback interface can be shown in full screen, while if the device is in a portrait orientation, then the playback interface can be shown on a portion of the display screen (e.g., upper area) while the document is shown in a separate portion of the display screen (e.g., lower area).

[0031] In some implementations, the computing device can identify the semantic entities and obtain the visual content items in real-time contemporaneous with generation and/or playback of the audio signal that comprises speech of the textual content. Thus, the playback experience can be generated in real-time after a user has accessed a document and requested audio playback. [0032] In some implementations, the computing device can identify the semantic entities and obtain the visual content items in real-time contemporaneous with playback of the audio signal to the user but in advance of playback of a respective portion of the audio signal that references the one or more semantic entities. More particularly, the computing device can cue up the visual content items in advance, rather than obtaining the content when the

corresponding word is spoken in the audio signal. For example, when page processing begins, the device can process textual content in the document that lies ahead of the current position of the audio signal. A combination of rules and/or heuristics can be used to process ahead of the audio signal. Machine-learned models can also be used, for example, to evaluate a user response relevant to the content (e.g., how“sticky” is certain content) and to provide improved content predictions in subsequent operations based on such user response data.

[0033] In some implementations, the user interface can further include a duration display area. The duration display area can show the user how far through the playback of the entire document they are and their current position in the total playback experience. The duration display area can communicate this in many different ways including as a straight linear bar or by representing the structure of the page so that distinct aspects of the document (e.g., such as headline, sub-headline, body text, images and captions) are represented.

[0034] The duration display area can allow the user to quickly and easily move the playback experience to a specific section of the page. This allows users to quickly navigate to specific sections of text that they wish to hear played back via audio.

[0035] A number of different controls can be provided to the user. The controls available to the user can include, but are not limited to, play, pause, stop, fast forward, rewind, restart and/or move to the next or previous document (e.g., news article). There can also be capabilities to allow the user to jump forward or backwards by predefined intervals (for example jump forward/backward 30 seconds) or select a section of the text to be played on a loop. The controls available to the user can also include the ability to change the speed at which the playback occurs as well as the voice and language used to speak aloud the text.

[0036] In some implementations, the user can also be enabled to place the playback experience into a minimalized or background state. For example, this would allow the user to undertake other tasks on their device (e.g., smartphone) while continuing to have the content of the document played back. In some implementations, the playback user interface (e.g., text display area and/or media display area) can be reduced in size, collapsed, and/or removed from the screen when placed into the background state. In one example, if a user scrolls or otherwise interacts with the document in a document display window, the playback user interface (e.g., text display area and/or media display area) can automatically be reduced in size, collapsed, and/or removed from the screen so that the document is given greater screen space.

[0037] Thus, in some implementations, the user can also scroll the document and tap a paragraph to skip to it or replay some bits. In some implementations, if the user wants to keep listening while doing something else, the user can keep the playback in the background. The user can interact with a notification to control the playback.

[0038] In some implementations, in addition or alternatively to supplemental visual content, the computing device can also add additional audio content to enhance the listening experience. For example, the computing device can add contextually appropriate background music or sound effects. Thus, in some implementations, additional audio content that relates to semantic entities identified within the textual content can be obtained and can be inserted into the audio signal alongside the speech of the textual content.

[0039] In some implementations, the playback experience for a document can also be available to users when they are offline. This allows the user to have the textual contents of played in audio format even when they don’t have an internet connection. This offline capability can be controlled by the user (for example by pro-actively choosing to make the page available when offline) or done automatically (for example by caching visited pages or pre-caching pages before the user has visited them).

[0040] Additional aspects of the present disclosure are directed to machine intelligence that is used to control various aspects of the playback experience. As one example, in some implementations, machine-learned models can be used to identify the important parts of a page for audio playback, and generate one or more a natural speaking voices. For example, multiple different speaking voices can be generated when the text being read is quoting two or more people (e.g., in an interview, press conference, theatrical dialog, etc.). This provides the ability to read billions of webpages across many different languages around the world. As another example, machine intelligence (e.g., machine-learned models) can be used to analyze the content of the document and to generate a summary of the document which can be included in the audio playback.

[0041] As another example, machine intelligence (e.g., machine-learned models) can be used to understand the structure of the document. For example, if there is a sidebar, footnotes, end notes, or other secondary text items in the document, the computing device can use machine intelligence to understand where to insert speech such secondary text items in the audio signal. In addition to structure, machine intelligence can be used to understand and generate narrative flow.

[0042] As another example, if certain content is referenced in the text (e.g., via hyperlink), the computing system can access such content and insert it into the speech included in the audio signal. This prevents the user from needing to follow the link and visit a whole new document. Again, machine intelligence can be used to determine how to weave such additional content into the primary narrative.

[0043] Any of the machine intelligence (e.g., machine-learned models) described herein can be trained based on user feedback (e.g., user indications that the performed operations were correct or incorrect). User feedback can be aggregated across multiple users to generate (e.g., re-train) a global model and/or user-specific feedback can be used to personalize (e.g., re-train) a personalized model. For example, user feedback can indicate if sidebar is being introduced in wrong place and/or if content accessed and inserted via a hyperlink is useful/not useful and/or relevant/not relevant. Re-training the models based on user feedback can enable improved model performance moving forward.

[0044] Thus, the systems and methods of the present disclosure provide, in addition to playback of an audio signal that verbalizes textual content included in a document, the user with a complementary visual experience that displays the text being read and/or relevant visual media in a user-controllable interface. As such, the systems and methods of the present disclosure can bring the simplicity of watching video or listening to the radio to web browsing or other document viewing scenarios.

[0045] The systems and methods of the present disclosure provide a number of technical effects and benefits. As one example technical effect and benefit, the playback experience of the present disclosure may reduce the number of screen interactions (e.g., scrolling operations) needed to consume a document. Each screen interaction requires processing and memory resources to handle and display the result of (e.g., re-display the document at a different location after scrolling). By reducing the number of such screen interactions, the playback experience can save computing resources. As another example technical effect and benefit, the systems and methods of the present disclosure can automatically retrieve and playback content from linked pages or other ancillary content that is obtained from a different source beyond the main document. By automatically retrieving and playing back such content, the user is not required to individually load and display such additional source (e.g., linked web pages). As a result, network and device resources are saved because the additional source is not required to be obtained in full and displayed to the user. [0046] As yet another example, the systems and methods of the present disclosure provide features which specify a mechanism enabling user input, making a selection, and/or submitting a command. For example, in some implementations, the systems and methods of the present disclosure can include or provide a graphical user interface that includes an alternative graphical shortcut that allows the user to directly access an audio signal that includes speech of textual content included in the user interface.

[0047] With reference now to the Figures, example embodiments of the present disclosure will be discussed in further detail.

Example User Interfaces

[0048] Figures 1 A-12 show example user interfaces according to example embodiments of the present disclosure.

[0049] Figures 1 A-F show an example user flow of the playback experience. At Figure 1 A, a user has opened a document, which in this case is a webpage. As shown in the circled area 52 at the bottom, the user is provided with a graphical audio shortcut feature (e.g., shown here as a speaker icon) that enables the user to request audio playback of the textual content of the webpage. In Figure 1B the user has selected the audio shortcut feature and, as a result, the top bar 54 has expanded. A chime and/or animation may accompany this action. In Figure 1C, the audio playback and highlighting of textual content has started. In particular, the words“Most of’ are highlighted in correspondence with the audio playback of such words. The user can tap the area with the highlighted text to bring up a control widget. Possible buttons that can be included in the control widget include rewind, pause/play, speed control, etc. In Figure 1D, the user is scrolling through the webpage to a portion of the webpage that is lower/further into the webpage. In Figure 1E, the used as stopped scrolling. Further, a “play from here” button 56 shows up above the first paragraph shown in the user interface. In Figure 1F, the user has selected the play from here button 56 and the playback experience starts from the selected paragraph.

[0050] Figure 2 shows an example alternative launch affordance as part of the browser user interface. The user can select the circled audio shortcut feature (e.g., again shown here as a speaker icon) to start the playback experience. Alternatively the playback experience can automatically play based on the page entry-point.

[0051] Figure 3 shows an example title screen. For example, the top bar can transform and slide down to become part of the playback experience. In various implementations, branded sound can mark the beginning of the experience, the page title and/or publisher can be verbalized, publisher imagery can be re-used, timing and an estimate of data usage can be provided, and/or other information can be presented.

[0052] Figure 4 shows example controls. The controls can be overlaid on the top of the playback interface. Controls can include speed adjustment 402, full screen 404, download for offline playback 406, and/or other controls.

[0053] Figure 5 shows example playback which includes a visual content item 502. For example, the visual content item 502 can be a full bleed image in the background of the interface. The images can be from the document, from a knowledge graph, and/or from other locations. The images can be dynamically panned in the background. The playback experience can include word-by-word highlight, TTS readout, no document/webpage scroll (e.g., to maintain focus on the playback interface), a playback progress indicator, and/or other features.

[0054] Figure 6 shows the playback interface in an example collapsed state. For example, upon a user scroll of the document or other user interaction, the playback interface can collapse to enable document exploration.

[0055] Figure 7 shows example fast-forward menu options. For example, the progress bar 700 and the document can both include bookmarks 702 which allow the user to fast forward to a specific part of the document.

[0056] Figure 8 shows presentation of an example user interface that includes a text display area and a media display area next to the text display area. On the left hand side is the document without the playback user interface. Once playback starts, as shown in the middle pane, the playback user interface 802 is presented above the document 804. On the right hand side is a close up of the playback interface with various components identified. Figure 9 shows another example user interface that includes a media display area 902.

[0057] Figures 10A-C show different example timeline options that enable fast- forwarding or other skipping among portions of the document structure. Figure 10A shows the document structure/bookmarks with text bubbles along the progress bar. Figure 10B shows the document structure/bookmarks by segmenting the progress bar into different portions. Figure 10C shows the document structure/bookmarks with vertical columns/bars placed along the progress bar. In some implementations, the document structure/bookmarks can be derived from HTML tags.

[0058] Figure 11 shows an example of the playback experience occurring in the background. A notification 1100 (e.g., sticky notification) can allow the user to pause/resume the audio playback. The notification 1100 can also include a smaller media display area 1102, as illustrated in Figure 11.

[0059] Figure 12 shows an example interface that provides suggestions to the user of what to play next. For example, after completion of playback of the audio signal, the playback interface can suggested related articles for the user to further the experience.

Example Devices and Systems

[0060] Figure 13 A depicts a block diagram of an example computing system 100 according to example embodiments of the present disclosure. The system 100 includes a user computing device 102, a server computing system 130, and a training computing system 150 that are communicatively coupled over a network 180.

[0061] The user computing device 102 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.

[0062] The user computing device 102 includes one or more processors 112 and a memory 114. The one or more processors 112 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 114 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 114 can store data 116 and instructions 118 which are executed by the processor 112 to cause the user computing device 102 to perform operations.

[0063] The user computing device 102 can include an audio playback system 119 that implements the playback experience described herein. The audio playback system 119 can be an application, a plug-in (e.g., a plug-in for a browser application), or other forms of software implementable by the one or more processors 112. The user computing device 102 can access one or more documents 190 (e.g., over the network 180 and/or from the local memory 114) and the audio playback system 119 can generate an audio playback experience for textual content included in such document(s) 190.

[0064] In some implementations, the user computing device 102 (e.g., the audio playback system 119) can store or include one or more machine-learned models 120. For example, the machine-learned models 120 can be or can otherwise include various machine- learned models such as neural networks (e.g., deep neural networks) or other types of machine-learned models, including non-linear models and/or linear models. Neural networks can include feed-forward neural networks, recurrent neural networks (e.g., long short-term memory recurrent neural networks), convolutional neural networks or other forms of neural networks.

[0065] In some implementations, the one or more machine-learned models 120 can be received from the server computing system 130 over network 180, stored in the user computing device memory 114, and then used or otherwise implemented by the one or more processors 112. In some implementations, the user computing device 102 can implement multiple parallel instances of a single machine-learned model 120.

[0066] Additionally or alternatively, one or more machine-learned models 140 can be included in or otherwise stored and implemented by the server computing system 130 that communicates with the user computing device 102 according to a client-server relationship. For example, the machine-learned models 140 can be implemented by the server computing system 140 as a portion of a web service (e.g., a semantic entity identification service). Thus, one or more models 120 can be stored and implemented at the user computing device 102 and/or one or more models 140 can be stored and implemented at the server computing system 130.

[0067] The audio playback system can also include a TTS system 121. The TTS system can perform any number of TTS techniques to generate an audio signal that includes speech of text. The audio playback system can also include a visual content handler 123. The visual content handler 123 can obtain visual content associated with semantic entities and can provide such visual content for display at an appropriate time.

[0068] The user computing device 102 can also include one or more user input component 122 that receives user input. For example, the user input component 122 can be a touch-sensitive component (e.g., a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus). The touch-sensitive component can serve to implement a virtual keyboard. Other example user input components include a microphone, a traditional keyboard, or other means by which a user can provide user input.

[0069] The user computing device 102 can also include one or more speakers 124. The speakers 124 can be physically connected to the device 102 or not physically connected to the device 102. The speakers 124 can include stand-alone speakers, earbuds, or the like.

[0070] The server computing system 130 includes one or more processors 132 and a memory 134. The one or more processors 132 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 134 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 134 can store data 136 and instructions 138 which are executed by the processor 132 to cause the server computing system 130 to perform operations.

[0071] In some implementations, the server computing system 130 includes or is otherwise implemented by one or more server computing devices. In instances in which the server computing system 130 includes plural server computing devices, such server computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof.

[0072] As described above, the server computing system 130 can store or otherwise include one or more machine-learned models 140. For example, the models 140 can be or can otherwise include various machine-learned models. Example machine-learned models include neural networks or other multi-layer non-linear models. Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks.

[0073] The user computing device 102 and/or the server computing system 130 can train the models 120 and/or 140 via interaction with the training computing system 150 that is communicatively coupled over the network 180. The training computing system 150 can be separate from the server computing system 130 or can be a portion of the server computing system 130.

[0074] The training computing system 150 includes one or more processors 152 and a memory 154. The one or more processors 152 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 154 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 154 can store data 156 and instructions 158 which are executed by the processor 152 to cause the training computing system 150 to perform operations. In some implementations, the training computing system 150 includes or is otherwise implemented by one or more server computing devices. [0075] The training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example, backwards propagation of errors. In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. The model trainer 160 can perform a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.

[0076] In particular, the model trainer 160 can train the machine-learned models 120 and/or 140 based on a set of training data 162. The training data 162 can include, for example, user feedback or data descriptive of user actions performed in response to various playback experience settings or operations.

[0077] In some implementations, if the user has provided consent, the training examples can be provided by the user computing device 102. Thus, in such implementations, the model 120 provided to the user computing device 102 can be trained by the training computing system 150 on user-specific data received from the user computing device 102. In some instances, this process can be referred to as personalizing the model.

[0078] The model trainer 160 includes computer logic utilized to provide desired functionality. The model trainer 160 can be implemented in hardware, firmware, and/or software controlling a general purpose processor. For example, in some implementations, the model trainer 160 includes program files stored on a storage device, loaded into a memory and executed by one or more processors. In other implementations, the model trainer 160 includes one or more sets of computer-executable instructions that are stored in a tangible computer-readable storage medium such as RAM hard disk or optical or magnetic media.

[0079] The network 180 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links. In general, communication over the network 180 can be carried via any type of wired and/or wireless connection, using a wide variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL).

[0080] Figure 13 A illustrates one example computing system that can be used to implement the present disclosure. Other computing systems can be used as well. For example, in some implementations, the user computing device 102 can include the model trainer 160 and the training dataset 162. In such implementations, the models 120 can be both trained and used locally at the user computing device 102. In some of such implementations, the user computing device 102 can implement the model trainer 160 to personalize the models 120 based on user-specific data.

[0081] Figure 13B depicts a block diagram of an example computing device 10 that performs according to example embodiments of the present disclosure. The computing device 10 can be a user computing device or a server computing device.

[0082] The computing device 10 includes a number of applications (e.g., applications 1 through N). Each application contains its own machine learning library and machine-learned model(s). For example, each application can include a machine-learned model. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.

[0083] As illustrated in Figure 13B, each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, each application can communicate with each device component using an API (e.g., a public API). In some implementations, the API used by each application is specific to that application.

[0084] Figure 13C depicts a block diagram of an example computing device 50 that performs according to example embodiments of the present disclosure. The computing device 50 can be a user computing device or a server computing device.

[0085] The computing device 50 includes a number of applications (e.g., applications 1 through N). Each application is in communication with a central intelligence layer. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. In some

implementations, each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications).

[0086] The central intelligence layer includes a number of machine-learned models. For example, as illustrated in Figure 13C, a respective machine-learned model (e.g., a model) can be provided for each application and managed by the central intelligence layer. In other implementations, two or more applications can share a single machine-learned model. For example, in some implementations, the central intelligence layer can provide a single model (e.g., a single model) for all of the applications. In some implementations, the central intelligence layer is included within or otherwise implemented by an operating system of the computing device 50. [0087] The central intelligence layer can communicate with a central device data layer. The central device data layer can be a centralized repository of data for the computing device 50. As illustrated in Figure 13C, the central device data layer can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, the central device data layer can communicate with each device component using an API (e.g., a private API).

Additional Disclosure

[0088] The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and

functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.

[0089] While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation, not limitation of the disclosure. Those skilled in the art, upon attaining an understanding of the foregoing, can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the subject disclosure does not preclude inclusion of such modifications, variations and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that the present disclosure cover such alterations, variations, and equivalents.

Claims

WHAT IS CLAIMED IS:

1. A computer-implemented method, the method comprising:

obtaining, by one or more computing devices, data descriptive of textual content included in a document;

providing, by the one or more computing devices, a graphical user interface for display to a user, wherein the graphical user interface presents at least a portion of the textual content included in the document, and wherein the graphical user interface comprises a graphical audio shortcut feature that enables a user to request audio playback of at least a portion of the textual content;

receiving, by the one or more computing devices, a user input that selects the graphical audio shortcut feature;

generating, by the one or more computing devices, an audio signal that comprises speech of the textual content;

analyzing, by the one or more computing devices, one or both of the textual content and the audio signal to identify one or more semantic entities referenced by the textual content;

obtaining, by the one or more computing devices, one or more visual content items that are associated with the one or more semantic entities;

causing, by the one or more computing devices, playback of the audio signal to a user; and

providing, by the one or more computing devices, the one or more visual content items for display to the user in the graphical user interface contemporaneously with playback of the audio signal to the user.

2. The computer-implemented method of claim 1, wherein providing, by the one or more computing devices, the one or more visual content items for display to the user contemporaneously with playback of the audio signal to the user comprises providing for display, by the one or more computing devices, a respective one of a plurality of visual content items during a respective time period in which playback of a respective portion of the audio signal that references the semantic entity with which such respective visual content item is associated occurs.

3. The computer-implemented method of any preceding claim, wherein obtaining, by the one or more computing devices, the data descriptive of textual content included in the document comprises obtaining, by the one or more computing devices, data descriptive of a subset of the textual content included in the document that has been selected by the user, wherein said generating, analyzing, and obtaining are performed with respect to only the subset of the textual content that has been selected by the user.

4. The computer-implemented method of any preceding claim, wherein obtaining, by the one or more computing devices, the one or more visual content items that are associated with the one or more semantic entities comprises obtaining, by the one or more computing devices, one or more visual content items from the document.

5. The computer-implemented method of any preceding claim, wherein obtaining, by the one or more computing devices, the one or more visual content items that are associated with the one or more semantic entities comprises obtaining, by the one or more computing devices, one or more visual content items from an external data source that is different from the document.

6. The computer-implemented method of any preceding claim, wherein obtaining, by the one or more computing devices, the one or more visual content items that are associated with the one or more semantic entities comprises accessing, by the one or more computing devices, a knowledge graph to obtain visual content items that are associated with the one or more semantic entities within the knowledge graph.

7. The computer-implemented method of any preceding claim, wherein the one or more visual content items comprise one or more of: still images, animations, or videos.

8. The computer-implemented method of any preceding claim, further comprising: providing, by the one or more computing devices, a text display area in the graphical user interface for display to the user contemporaneously with playback of the audio signal to the user, wherein the text display area displays at least a portion of the text that corresponds to a current portion of the audio signal for which playback is currently occurring.

9. The computer-implemented method of claim 8, wherein providing, by the one or more computing devices, the one or more visual content items for display to the user contemporaneously with playback of the audio signal to the user comprises providing, by the one or more computing devices, the one or more visual content items for display to the user within or abutting the text display area.

10. The computer-implemented method of claim 9, wherein providing, by the one or more computing devices, the one or more visual content items for display to the user within or abutting the text display area comprises providing, by the one or more computing devices, the one or more visual content items as a background of the text display area, and wherein the portion of the text displayed within the text display area is overlaid upon the one or more visual content items.

11. The computer-implemented method of any of claims 8-10, further comprising: highlighting, by the one or more computing devices, a word included in the portion of the text displayed within the text display area that corresponds to the current portion of the audio signal for which playback is currently occurring, wherein the word is a subset of the portion of the text displayed within the text display area, and wherein said highlighting is repeatedly performed over on a word-by-word basis as the current portion of the audio signal changes over time.

12. The computer-implemented method of any of claims 8-11, wherein providing, by the one or more computing devices, the text display area for display comprises scrolling, by the one or more computing devices, the text through the text display area contemporaneously with playback of the audio signal to the user.

13. The computer-implemented method of any preceding claim, further comprising, after providing, by the one or more computing devices, the one or more visual content items for display:

receiving, by the one or more computing devices, data descriptive of a user interaction with the document; and in response to receipt of the data descriptive of the user interaction with the document, collapsing or ceasing, by the one or more computing devices, display of the one or more visual content items to increase a relative size of a portion of a user interface in which the document is displayed.

14. The computer-implemented method of any preceding claim, wherein the document comprises a web page.

15. The computer-implemented method of any preceding claim, wherein said analyzing, by the one or more computing devices, one or both of the textual content and the audio signal to identify the one or more semantic entities referenced by the textual content and said obtaining, by the one or more computing devices, the one or more visual content items that are associated with the one or more semantic entities are performed in real-time contemporaneous with generation of the audio signal that comprises speech of the textual content.

16. The computer-implemented method of any preceding claim, wherein said analyzing, by the one or more computing devices, one or both of the textual content and the audio signal to identify the one or more semantic entities referenced by the textual content and said obtaining, by the one or more computing devices, the one or more visual content items that are associated with the one or more semantic entities are performed in real-time contemporaneous with playback of the audio signal to the user.

17. The computer-implemented method of any preceding claim, wherein said analyzing, by the one or more computing devices, one or both of the textual content and the audio signal to identify the one or more semantic entities referenced by the textual content and said obtaining, by the one or more computing devices, the one or more visual content items that are associated with the one or more semantic entities are performed in real-time contemporaneous with playback of the audio signal to the user but in advance of playback of a respective portion of the audio signal that references the one or more semantic entities.

18. A computing device, comprising: one or more processors; and

one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the computing device to perform the method of any of claims 1-17.

19. The computing device of claim 18, wherein the computing device comprises a mobile computing device.

20. One or more non-transitory computer-readable media that store instructions that, when executed by one or more processors, cause the one or more processors to perform the method of any of claims 1-17.