WO2012167276A1 - Automatically creating a mapping between text data and audio data - Google Patents

Automatically creating a mapping between text data and audio data Download PDF

Info

Publication number
WO2012167276A1
WO2012167276A1 PCT/US2012/040801 US2012040801W WO2012167276A1 WO 2012167276 A1 WO2012167276 A1 WO 2012167276A1 US 2012040801 W US2012040801 W US 2012040801W WO 2012167276 A1 WO2012167276 A1 WO 2012167276A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
location
text
data
work
Prior art date
Application number
PCT/US2012/040801
Other languages
French (fr)
Inventor
Xiang Cao
Alan C. Cannistraro
Gregory S. ROBBIN
Casey M. DOUGHERTY
Melissa Breglio HAJJ
Raymond Walsh
Original Assignee
Apple Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Apple Inc. filed Critical Apple Inc.
Priority to CN201280036281.5A priority Critical patent/CN103703431B/en
Priority to KR1020167006970A priority patent/KR101700076B1/en
Priority to AU2012261818A priority patent/AU2012261818B2/en
Priority to KR1020157017690A priority patent/KR101674851B1/en
Priority to JP2014513799A priority patent/JP2014519058A/en
Priority to EP12729332.2A priority patent/EP2593846A4/en
Priority to KR1020137034641A priority patent/KR101622015B1/en
Publication of WO2012167276A1 publication Critical patent/WO2012167276A1/en
Priority to AU2016202974A priority patent/AU2016202974B2/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems

Definitions

  • the present invention relates to automatically creating a mapping between text data and audio data by analyzing the audio data to detect words reflected therein and compare those words to words in the document.
  • a method includes receiving audio data that reflects an audible version of a work for which a textual version exists; performing a speech-to-text analysis of the audio data to generate text for portions of the audio data; and based on the text generated for the portions of the audio data, generating a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in the textual version of the work.
  • the method is performed by one or more computing devices.
  • generating text for portions of the audio data includes generating text for portions of the audio data based, at least in part, on textual context of the work. In some embodiments, generating text for portions of the audio data based, at least in part, on textual context of the work includes generating text based, at least in part, on one or more rules of grammar used in the textual version of the work. In some embodiments, generating text for portions of the audio data based, at least in part, on textual context of the work includes limiting which words the portions can be translated to based on which words are in the textual version of the work, or a subset thereof.
  • limiting which words the portions can be translated to based on which words are in the textual version of the work includes, for a given portion of the audio data, identifying a sub-section of the textual version of the work that corresponds to the given portion and limiting the words to only those words in the sub-section of the textual version of the work.
  • identifying the sub-section of the textual version of the work includes maintaining a current text location in the textual version of the work that corresponds to a current audio location, in the audio data, of the speech-to-text analysis; and the sub-section of the textual version of the work is a section associated with the current text location.
  • the portions include portions that correspond to individual words, and the mapping maps the locations of the portions that correspond to individual words to individual words in the textual version of the work. In some embodiments, the portions include portions that correspond to individual sentences, and the mapping maps the locations of the portions that correspond to individual sentences to individual sentences in the textual version of the work. In some embodiments, the portions include portions that correspond to fixed amounts of data, and the mapping maps the locations of the portions that correspond to fixed amounts of data to corresponding locations in the textual version of the work.
  • mapping includes: (1) embedding anchors in the audio data; (2) embedding anchors in the textual version of the work; or (3) storing the mapping in a media overlay that is stored in association with the audio data or the textual version of the work.
  • each of one or more text locations of the plurality of text locations indicates a relative location in the textual version of the work.
  • one text location, of the plurality of text locations indicates a relative location in the textual version of the work and another text location, of the plurality of text locations, indicates an absolute location from the relative location.
  • each of one or more text locations of the plurality of text locations indicates an anchor within the textual version of the work.
  • a method includes receiving a textual version of a work; performing a text-to-speech analysis of the textual version to generate first audio data; based on the first audio data and the textual version, generating a first mapping between a first plurality of audio locations in the first audio data and a corresponding plurality of text locations in the textual version of the work; receiving second audio data that reflects an audible version of the work for which the textual version exists; and based on (1) a comparison of the first audio data and the second audio data and (2) the first mapping, generating a second mapping between a second plurality of audio locations in the second audio data and the plurality of text locations in the textual version of the work.
  • the method is performed by one or more computing devices.
  • a method includes receiving audio input; performing a speech-to-text analysis of the audio input to generate text for portions of the audio input; determining whether the text generated for portions of the audio input matches text that is currently displayed; and in response to determining that the text matches text that is currently displayed, causing the text that is currently displayed to be highlighted.
  • the method is performed by one or more computing devices.
  • an electronic device includes an audio data receiving unit configured for receiving audio data that reflects an audible version of a work for which a textual version exists.
  • the electronic device also includes a processing unit coupled to the audio data receiving unit.
  • the processing unit is configured to: perform a speech-to-text analysis of the audio data to generate text for portions of the audio data; and based on the text generated for the portions of the audio data, generate a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in the textual version of the work.
  • an electronic device includes a text receiving unit configured for receiving a textual version of a work.
  • the electronic device also includes a processing unit coupled to the text receiving unit, the processing unit configured to: perform a text-to-speech analysis of the textual version to generate first audio data; and based on the first audio data and the textual version, generate a first mapping between a first plurality of audio locations in the first audio data and a corresponding plurality of text locations in the textual version of the work.
  • the electronic device also includes an audio data receiving unit configured for receiving second audio data that reflects an audible version of the work for which the textual version exists.
  • the processing unit is further configured to, based on (1) a comparison of the first audio data and the second audio data and (2) the first mapping, generate a second mapping between a second plurality of audio locations in the second audio data and the plurality of text locations in the textual version of the work.
  • an electronic device includes an audio receiving unit configured for receiving audio input.
  • the electronic device also includes a processing unit coupled to the audio receiving unit.
  • the processing unit is configured to perform a speech-to-text analysis of the audio input to generate text for portions of the audio input; determine whether the text generated for portions of the audio input matches text that is currently displayed; and in response to determining that the text matches text that is currently displayed, cause the text that is currently displayed to be highlighted.
  • a method includes obtaining location data that indicates a specified location within a textual version of a work; inspecting a mapping between a plurality of audio locations in an audio version of the work and a corresponding plurality of text locations in the textual version of the work to: determine a particular text location, of the plurality of text locations, that corresponds to the specified location, and based on the particular text location, determine a particular audio location, of the plurality of audio locations, that corresponds to the particular text location.
  • the method includes providing the particular audio location, that was determined based on the particular text location, to a media player to cause the media player to establish the particular audio location as a current playback position of the audio data.
  • the method is performed by one or more computing devices.
  • obtaining comprises a server receiving, over a network, the location data from a first device; inspecting and providing are performed by the server; and providing comprises the server sending the particular audio location to a second device that executes the media player.
  • the second device and the first device are the same device.
  • obtaining, inspecting, and providing are performed by a computing device that is configured to display the textual version of the work and that executes the media player.
  • the method further includes determining, at a device that is configured to display the textual version of the work, the location data without input from a user of the device.
  • the method further includes receiving input from a user; and in response to receiving the input, determining the location data based on the input.
  • providing the particular audio location to the media player comprises providing the particular audio location to the media player to cause the media player to process the audio data beginning at the current playback position, which causes the media player to generate audio from the processed audio data; and causing the media player to process the audio data is performed in response to receiving the input.
  • the input selects multiple words in the textual version of the work; the specified location is a first specified location; the location data also indicates a second specified location, within the textual version of the work, that is different than the first specified location; inspecting further comprises inspecting the mapping to: determine a second particular text location, of the plurality of text locations, that corresponds to the second specified location, and based on the second particular text location, determine a second particular audio location, of the plurality of audio locations, that corresponds to the second particular text location; and providing the particular audio location to the media player comprises providing the second particular audio location to the media player to cause the media player to cease processing the audio data when the current playback position arrives at or near the second particular audio location.
  • the method further includes obtaining annotation data that is based on input from a user; storing the annotation data in association with the specified location; and causing information about the annotation data to be displayed.
  • causing information about the particular audio location and the annotation data to be displayed comprises: determining when a current playback position of the audio data is at or near the particular audio location; and in response to determining that the current playback position of the audio data is at or near the particular audio location, causing information about the annotation data to be displayed.
  • the annotation data includes text data; and causing information about the annotation data to be displayed comprises displaying the text data.
  • the annotation data includes voice data; and causing information about the annotation data to be displayed comprises processing the voice data to generate audio.
  • an electronic device includes a location data obtaining unit configured for obtaining location data that indicates a specified location within a textual version of a work.
  • the electronic device also includes a processing unit coupled to the location data obtaining unit.
  • the processing unit is configured to inspect a mapping between a plurality of audio locations in an audio version of the work and a corresponding plurality of text locations in the textual version of the work to: determine a particular text location, of the plurality of text locations, that corresponds to the specified location, and based on the particular text location, determine a particular audio location, of the plurality of audio locations, that corresponds to the particular text location; and provide the particular audio location, that was determined based on the particular text location, to a media player to cause the media player to establish the particular audio location as a current playback position of the audio data.
  • a method includes obtaining location data that indicates a specified location within audio data; inspecting a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to: determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and based on the particular audio location, determine a particular text location, of the plurality of text locations, that corresponds to the particular audio location; and causing a media player to display information about the particular text location.
  • the method is performed by one or more computing devices.
  • obtaining comprises a server receiving, over a network, the location data from a first device; inspecting and causing are performed by the server; and causing comprises the server sending the particular text location to a second device that executes the media player.
  • the second device and the first device are the same device.
  • obtaining, inspecting, and causing are performed by a computing device that is configured to display the textual version of the work and that executes the media player.
  • the method further includes determining, at a device that is configured to process the audio data, the location data without input from a user of the device.
  • the method further includes: receiving input from a user; and in response to receiving the input, determining the location data based on the input.
  • causing comprises causing the media player to display a portion of the textual version of the work that corresponds to the particular text location; and causing the media player to display the portion of the textual version of the work is performed in response to receiving the input.
  • the input selects a segment of the audio data; the specified location is a first specified location; the location data also indicates a second specified location, within the audio data, that is different than the first specified location; inspecting further comprises inspecting the mapping to: determine a second particular audio location, of the plurality of audio locations, that corresponds to the second specified location, and based on the second particular audio location, determine a second particular text location, of the plurality of text locations, that corresponds to the second particular audio location; and causing a media player to display information about the particular text location further comprises causing the media player to display information about the second particular text location.
  • the specified location corresponds to a current playback position in the audio data; causing is performed as the audio data at the specified location is processed and audio is generated; and causing comprises causing a second media player to highlight text within the textual version of the work at or near the particular text location.
  • the method further includes: obtaining annotation data that is based on input from a user; storing the annotation data in association with the specified location; and causing information about the annotation data to be displayed.
  • causing information about the annotation data to be displayed comprises: determining when a portion of the textual version of the work that corresponds to the particular text location is displayed; and in response to determining that a portion of the textual version of the work that corresponds to the particular text location is displayed, causing information about the annotation data to be displayed.
  • the annotation data includes text data; and causing information about the annotation data to be displayed comprises causing the text data to be displayed.
  • the annotation data includes voice data; and causing information about the annotation data to be displayed comprises causing the voice data to be processed to generate audio.
  • a method includes, during playback of an audio version of a work: obtaining location data that indicates a specified location within the audio version, and determining, based on the specified location, a particular text location, in a textual version of the work, that is associated with pause data that indicates when to pause playback of the audio version; and in response to determining that the particular text location is associated with pause data, pausing playback of the audio version.
  • the method is performed by one or more computing devices.
  • determining the particular text location comprises: inspecting a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to: determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and based on the particular audio location, determine the particular text location, of the plurality of text locations, that corresponds to the particular audio location.
  • the pause data corresponds to the end of a page reflected in the textual version of the work. In some embodiments, the pause data corresponds to a location, within the textual version of the work, that immediately precedes a picture that does not include text.
  • the method further comprises continuing playback of the audio version in response to receiving user input. In some embodiments, the method further comprises continuing playback of the audio version in response to the lapse of a particular amount of time since playback of the audio version was paused.
  • a method includes, during playback of an audio version of a work: obtaining location data that indicates a specified location within the audio version, and determining, based on the specified location, a particular text location, in a textual version of the work, that is associated with the page end data that indicates an end of a first page reflected in the textual version of the work; and in response to determining that the particular text location is associated with the page end data, automatically causing the first page to cease to be displayed and causing a second page that is subsequent to the first page to be displayed.
  • the method is performed by one or more computing devices.
  • the method further comprises inspecting a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to: determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and based on the particular audio location, determine the particular text location, of the plurality of text locations, that corresponds to the particular audio location.
  • an electronic device includes a location obtaining unit configured for obtaining location data that indicates a specified location within audio data.
  • the electronic device also includes a processing unit coupled to the location obtaining unit.
  • the processing unit is configured to: inspect a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to: determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and based on the particular audio location, determine a particular text location, of the plurality of text locations, that corresponds to the particular audio location; and cause a media player to display information about the particular text location.
  • an electronic device includes a location obtaining unit configured for obtaining location data that indicates a specified location within the audio version during playback of an audio version of a work.
  • the electronic device also includes a processing unit coupled to the location obtaining unit, the processing unit configured to, during playback of an audio version of a work: determine, based on the specified location, a particular text location, in a textual version of the work, that is associated with the page end data that indicates an end of a first page reflected in the textual version of the work; and in response to determining that the particular text location is associated with the page end data, automatically cause the first page to cease to be displayed and causing a second page that is subsequent to the first page to be displayed.
  • a method includes, while a first version of a work is processed, obtaining annotation data that is based on input from a user; storing association data that associates the annotation data with the work; and while a second version of the work is processed, causing information about the annotation data to be displayed, wherein the second version is different than the first version; and wherein the method is performed by one or more computing devices.
  • obtaining comprises determining location data that indicates a specified location within the first version of the work; storing comprises storing the location data in association with the work; the specified location corresponds to a particular location within the second version of the work; and causing comprises causing the information about the annotation data to be displayed in association with the particular location in the second version.
  • the annotation data includes text data; and causing information about the annotation data to be displayed comprises causing the text data to be displayed.
  • the annotation data includes voice data; and causing information about the annotation data to be displayed comprises causing the voice data to be processed to generate audio.
  • an electronic device includes an annotation obtaining unit configured for, while a first version of a work is processed, obtaining annotation data that is based on input from a user; and a processing unit coupled to the annotation obtaining unit and the association data storing unit, the processing unit configured for: causing association data that associates the annotation data with the work to be stored; and while a second version of the work is processed, causing information about the annotation data to be displayed, wherein the second version is different than the first version.
  • a method includes receiving data that establishes a first bookmark within a first version of a work.
  • the method further includes inspecting a mapping between a plurality of first locations in the first version of the work and a corresponding plurality of second locations in a second version of the work to: determine a particular first location, of the plurality of first locations, that corresponds to the first bookmark, and based on the particular first location, determine a particular second location, of the plurality of second locations, that corresponds to the particular first location; wherein the first version of the work is different than the second version of the work.
  • the method further includes causing data that establishes the particular second location as a second bookmark within the second version of the work to be stored; wherein the method is performed by one or more computing devices.
  • receiving comprises a server receiving, over a network, the input from a first device; inspecting is performed by the server; and causing comprises the server sending the particular second location to a second device.
  • the first device and the second device are different devices.
  • the first version of the work is one of an audio version of the work or a text version of the work and the second version of the work is the other of the audio version or the text version.
  • an electronic device includes a data receiving unit configured for receiving data that establishes a first bookmark within a first version of a work.
  • the electronic device also includes a processing unit coupled to the data receiving unit, the processing unit configured for: inspecting a mapping between a plurality of first locations in the first version of the work and a corresponding plurality of second locations in a second version of the work to: determine a particular first location, of the plurality of first locations, that corresponds to the first bookmark, and based on the particular first location, determine a particular second location, of the plurality of second locations, that corresponds to the particular first location; wherein the first version of the work is different than the second version of the work.
  • the processing unit is also configured for causing data that establishes the particular second location as a second bookmark within the second version of the work to be stored.
  • a method includes causing a portion of text of a work to be displayed by a device; while the portion of text is displayed: receiving, at the device, audio input from a user.
  • the method further includes, in response to receiving the audio input: analyzing the audio input to identify one or more words; determining whether the one or more words are reflected in the portion of the text; and in response to determining that the one or more words are reflected in the portion of the text, causing a visual indication to be displayed by the device.
  • causing the visual indication to be displayed comprises causing text data that corresponds to the one or more words to be highlighted.
  • an electronic device includes a processing unit configured for causing a portion of text of a work to be displayed by a device; and an audio receiving unit coupled to the processing unit and configured for receiving, at the device, audio input from a user.
  • the processing unit is further configured for, in response to receiving the audio input at the audio receiving unit: analyzing the audio input to identify one or more words; determining whether the one or more words are reflected in the portion of the text; and in response to determining that the one or more words are reflected in the portion of the text, causing a visual indication to be displayed by the device.
  • a computer-readable storage medium is provided, the computer-readable storage medium storing one or more programs for execution by one or more processors of an electronic device, the one or more programs including instructions for performing any of the methods described above.
  • an electronic device is provided that comprises means for performing any of the methods described above.
  • an electronic device is provided that comprises one or more processors and memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for performing any of the methods described above.
  • an information processing apparatus for use in an electronic device is provided, the information processing apparatus comprising means for performing any of the methods described above.
  • EPUB European Digital Publishing Forum
  • An EPUB file uses XHTML 1.1 (or DTBook) to construct the content of a book. Styling and layout are performed using a subset of CSS, referred to as OPS Style Sheets.
  • an audio version of the written work is created. For example, a recording of a famous individual (or one with a pleasant voice) reading a written work is created and made available for purchase, whether online or in a brick and mortar store.
  • a pre-recorded narration of a publication can be represented as a series of audio clips, each corresponding to part of the text.
  • Each single audio clip, in the series of audio clips that make up a pre-recorded narration typically represents a single phrase or paragraph, but infers no order relative to the other clips or to the text of a document.
  • Media Overlays solve this problem of synchronization by tying the structured audio narration to its corresponding text in the EPUB Content Document using SMIL markup.
  • Media Overlays are a simplified subset of SMIL 3.0 that allow the playback sequence of these clips to be defined.
  • a media overlay file may associate the beginning of each paragraph in an e-book with a corresponding location in an audio version of the book.
  • FIG. 1 is a flow diagram that depicts a process for automatically creating a mapping between text data and audio data, according to an embodiment of the invention
  • FIG. 2 is a block diagram that depicts a process that involves an audio-to-text correlator in generating a mapping between text data and audio data, according to an embodiment of the invention
  • FIG. 3 is a flow diagram that depicts a process for using a mapping in one or more of these scenarios, according to an embodiment of the invention
  • FIG. 4 is a block diagram that an example system 400 that may be used to implement some of the processes described herein, according to an embodiment of the invention.
  • FIGs. 5A-B are flow diagrams that depict processes for bookmark switching, according to an embodiment of the invention.
  • FIG. 6 is a flow diagram that depicts a process for causing text, from a textual version of a work, to be highlighted while an audio version of the work is being played, according to an embodiment of the invention
  • FIG. 7 is a flow diagram that depicts a process of highlighting displayed text in response to audio input from a user, according to an embodiment of the invention.
  • FIGs. 8A-B are flow diagrams that depict processes for transferring an annotation from one media context to another, according to an embodiment of the invention.
  • FIG. 9 is a block diagram that illustrates a computer system upon which an embodiment of the invention may be implemented.
  • FIGs. 10-18 are functional block diagrams of electronic devices in accordance with some embodiments.
  • a mapping is automatically created where the mapping maps locations within an audio version of a work (e.g., an audio book) with corresponding locations in a textual version of the work (e.g., an e-book).
  • the mapping is created by performing a speech-to-text analysis on the audio version to identify words reflected in the audio version. The identified words are matched up with the corresponding words in the textual version of the work.
  • the mapping associates locations (within the audio version) of the identified words with locations in the textual version of the work where the identified words are found.
  • the audio data reflects an audible reading of text of a textual version of a work, such as a book, web page, pamphlet, flyer, etc.
  • the audio data may be stored in one or more audio files.
  • the one or more audio files may be in one of many file formats. Non-limiting examples of audio file formats include AAC, MP3, WAV, and PCM.
  • the text data to which the audio data is mapped may be stored in one of many document file formats.
  • document file formats include DOC, TXT, PDF, RTF, HTML, XHTML, and EPUB.
  • a typical EPUB document is accompanied by a file that (a) lists each XHTML content document, and (b) indicates an order of the XHTML content documents. For example, if a book comprises 20 chapters, then an EPUB document for that book may have 20 different XHTML documents, one for each chapter. A file that accompanies the EPUB document identifies an order of the XHTML documents that corresponds to the order of the chapters in the book. Thus, a single (logical) document (whether an EPUB document or another type of document) may comprise multiple data items or files.
  • the words or characters reflected in the text data may be in one or multiple languages. For example, one portion of the text data may be in English while another portion of the text data may be in French. Although examples of English words are provided herein, embodiments of the invention may be applied to other languages, including character-based languages.
  • a mapping comprises a set of mapping records, where each mapping record associates an audio location with a text location.
  • Each audio location identifies a location in audio data.
  • An audio location may indicate an absolute location within the audio data, a relative location within the audio data, or a combination of an absolute location and a relative location.
  • an audio location may indicate a time offset (e.g., 04:32:24 indicating 4 hours, 32 minutes, 24 seconds) into the audio data, or a time range, as indicated above in Example A.
  • a relative location an audio location may indicate a chapter number, a paragraph number, and a line number.
  • the audio location may indicate a chapter number and a time offset into the chapter indicated by the chapter number.
  • each text location identifies a location in text data, such as a textual version of a work.
  • a text location may indicate an absolute location within the textual version of the work, a relative location within the textual version of the work, or a combination of an absolute location and a relative location.
  • a text location may indicate a byte offset into the textual version of the work and/or an "anchor" within the textual version of the work.
  • An anchor is metadata within the text data that identifies a specific location or portion of text. An anchor may be stored separate from the text in the text data that is displayed to an end-user or may be stored among the text that is displayed to an end-user.
  • a text location may indicate a page number, a chapter number, a paragraph number, and/or a line number.
  • a text location may indicate a chapter number and an anchor into the chapter indicated by the chapter number.
  • the "par” element includes two child elements: a "text” element and an "audio” element.
  • the text element comprises an attribute "src” that identifies a particular sentence within an XHTML document that contains content from the first chapter of a book.
  • the audio element comprises a "src” attribute that identifies an audio file that contains an audio version of the first chapter of the book, a "clipBegin” attribute that identifies where an audio clip within the audio file begins, and a "clipEnd” attribute that identifies where the audio clip within the audio file ends.
  • seconds 23 through 45 in the audio file correspond to the first sentence in Chapter 1 of the book.
  • a mapping between a textual version of a work and an audio version of the same work is automatically generated. Because the mapping is generated automatically, the mapping may use much finer granularity than would be practical using manual text-to-audio mapping techniques.
  • Each automatically-generated text-to-audio mapping includes multiple mapping records, each of which associates a text location in the textual version with an audio location in the audio version.
  • FIG. 1 is a flow diagram that depicts a process 100 for automatically creating a mapping between a textual version of a work and an audio version of the same work, according to an embodiment of the invention.
  • a speech-to-text analyzer receives audio data that reflects an audible version of the work.
  • the speech-to-text analyzer performs an analysis of the audio data
  • the speech-to-text analyzer generates text for portions of the audio data.
  • the speech-to-text analyzer Based on the text generated for the portions of the audio data, the speech-to-text analyzer generates a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in the textual version of the work.
  • Step 130 may involve the speech-to-text analyzer comparing the generated text with text in the textual version of the work to determine where, within the textual version of the work, the generated text is located. For each portion of generated text that is found in the textual version of the work, the speech-to-text analyzer associates (1) an audio location that indicates where, within the audio data, the corresponding portion of audio data is found with (2) a text location that indicates where, within the textual version of the work, the portion of text is found.
  • the textual context of a textual version of a work includes intrinsic characteristics of the textual version of the work (e.g. the language the textual version of the work is written in, the specific words that textual version of the work uses, the grammar and punctuation that textual version of the work uses, the way the textual version of the work is structured, etc.) and extrinsic characteristics of the work (e.g. the time period in which the work was created, the genre to which the work belongs, the author of the work, etc.)
  • intrinsic characteristics of the textual version of the work e.g. the language the textual version of the work is written in, the specific words that textual version of the work uses, the grammar and punctuation that textual version of the work uses, the way the textual version of the work is structured, etc.
  • extrinsic characteristics of the work e.g. the time period in which the work was created, the genre to which the work belongs, the author of the work, etc.
  • one technique described herein automatically creates a fine granularity mapping between the audio version of a work and the textual version of the same work by performing a speech-to-text conversion of the audio version of the work.
  • the textual context of a work is used to increase the accuracy of the speech-to- text analysis that is performed on the audio version of the work.
  • the speech-to-text analyzer (or another process) may analyze the textual version of the work prior to performing a speech-to-text analysis. The speech-to-text analyzer may then make use of the grammar information thus obtained to increase the accuracy of the speech-to-text analysis of the audio version of the work.
  • a user may provide input that identifies one or more rules of grammar that are followed by the author of the work.
  • the rules associated with the identified grammar are input to the speech-to-text analyzer to assist the analyzer in recognizing words in the audio version of the work.
  • speech-to-text analyzers must be configured or designed to recognize virtually every word in the English language and, optionally, some words in other languages. Therefore, speech-to-text analyzers must have access to a large dictionary of words.
  • the dictionary from which a speech-to-text analyzer selects words during a speech-to-text operation is referred to herein as the "candidate dictionary" of the speech-to-text analyzer.
  • the number of unique words in a typical candidate dictionary is approximately 500, 000.
  • text from the textual version of a work is taken into account when performing the speech-to-text analysis of the audio version of the work.
  • the candidate dictionary used by the speech-to-text analyzer is restricted to the specific set of words that are in the text version of the work.
  • the only words that are considered to be "candidates" during the speech-to-text operation performed on an audio version of a work are those words that actually appear in the textual version of the work.
  • the speech-to-text operation may be significantly improved. For example, assume that the number of unique words in a particular work is 20,000. A conventional speech-to-text analyzer may have difficulty determining to which specific word, of a 500,000 word candidate dictionary, a particular portion of audio corresponds. However, that same portion of audio may unambiguously correspond to one particular word when only the 20,000 unique words that are in the textual version of the work are considered. Thus, with such a much smaller dictionary of possible words, the accuracy of the speech-to-text analyzer may be significantly improved.
  • the candidate dictionary may be restricted to even fewer words than all of the words in the textual version of the work.
  • the candidate dictionary is limited to those words found in a particular portion of the textual version of the work. For example, during a speech-to-text translation of a work, it is possible to approximately track the "current translation position" of the translation operation relative to the textual version of the work. Such tracking may be performed, for example, by comparing (a) the text that has been generated during the speech-to-text operation so far, against (b) the textual version of the work.
  • the candidate dictionary may further restricted based on the current translation position.
  • the candidate dictionary is limited to only those words that appear, within the textual version of the work, after the current translation position.
  • words that are found prior to the current translation position, but not thereafter, are effectively removed from the candidate dictionary.
  • Such removal may increase the accuracy of the speech-to-text analyzer, since the smaller the candidate dictionary, the less likely the speech-to-text analyzer will translate a portion of audio data to the wrong word.
  • an audio book and a digital book may be divided into a number of segments or sections.
  • the audio book may be associated with an audio section mapping and the digital book may be associated with a text section mapping.
  • the audio section mapping and the text section mapping may identify where each chapter begins or ends.
  • the speech-to-text analyzer determines, based on the audio section mapping, that the speech-to-text analyzer is analyzing the 4 th chapter of the audio book, then the speech-to-text analyzer uses the text section mapping to identify the 4 th chapter of the digital book and limit the candidate dictionary to the words found in the 4 th chapter.
  • the speech-to-text analyzer employs a sliding window that moves as the current translation position moves.
  • the speech-to-text analyzer moves the sliding window "across" the textual version of the work.
  • the sliding window indicates two locations within the textual version of the work.
  • the boundaries of the sliding window may be (a) the start of the paragraph that precedes the current translation position and (b) the end of the third paragraph after the current translation position.
  • the candidate dictionary is restricted to only those words that appear between those two locations.
  • the window may span any amount of text within the textual version of the work.
  • the window may span an absolute amount of text, such as 60 characters.
  • the window may span a relative amount of text from the textual version of the work, such as ten words, three "lines" of text, 2 sentences, or 1 "page" of text.
  • the speech-to-text analyzer may use formatting data within the textual version of the work to determine how much of the textual version of the work constitutes a line or a page.
  • the textual version of a work may comprise a page indicator (e.g., in the form of an HTML or XML tag) that indicates, within the content of the textual version of the work, the beginning of a page or the ending of a page.
  • a page indicator e.g., in the form of an HTML or XML tag
  • the start of the window corresponds to the current translation position.
  • the speech-to-text analyzer maintains a current text location that indicates the most recently-matched word in the textual version of the work and maintains a current audio location that indicates the most recently-identified word in the audio data.
  • the narrator whose voice is reflected in the audio data
  • misreads text of the textual version of the work adds his/her own content, or skips portions of the textual version of the work during the recording
  • the next word that the speech-to-text analyzer detects in the audio data i.e., after the current audio location
  • Maintaining both locations may significantly increase the accuracy of the speech-to-text translation.
  • a text-to-speech generator and an audio-to-text correlator are used to automatically create a mapping between the audio version of a work and the textual version of a work.
  • FIG. 2 is a block diagram that depicts these analyzers and the data used to generate the mapping.
  • Textual version 210 of a work (such as an EPUB document) is input to text-to-speech generator 220.
  • Text-to-speech generator 220 may be implemented in software, hardware, or a combination of hardware and software. Whether implemented in software or hardware, text-to-speech generator 220 may be implemented on a single computing device or may be distributed among multiple computing devices.
  • Text-to-speech generator 220 generates audio data 230 based on document 210. During the generation of the audio data 230, text-to-speech generator 220 (or another component not shown) creates an audio-to-document mapping 240. Audio-to-document mapping 240 maps, multiple text locations within document 210 to corresponding audio locations within generated audio data 230.
  • text-to-speech generator 220 generates audio data for a word located at location Y within document 210. Further assume that the audio data that was generated for the work is located at a location X within audio data 230. To reflect the correlation between the location of the word within the document 210 and the location of the corresponding audio in the audio data 230, a mapping would be created between location X and location Y.
  • text-to-speech generator 220 knows where a word or phrase occurs in document 210 when a corresponding word or phrase of audio is generated, each mapping between the corresponding words or phrases can be easily generated.
  • Audio-to-text correlator 260 accepts, as input, generated audio data 230, audio book 250, and audio-to-document mapping 240. Audio-to-text correlator 260 performs two main steps: an audio-to-audio correlation step and a look-up step. For the audio-to-audio correlation step, audio-to-text correlator 260 compares generated audio data 230 with audio book 250 to determine the correlation between portions of audio data 230 and portions of audio book 250. For example, audio-to-text correlator 260 may determine, for each word represented in audio data 230, the location of the corresponding word in audio book 250.
  • the granularity at which audio data 230 is divided, for the purpose of establishing correlations, may vary from implementation to implementation. For example, a correlation may be established between each word in audio data 230 and each corresponding word in audio book 250. Alternatively, a correlation may be established based on fixed-duration time intervals (e.g. one mapping for every 1 minute of audio). In yet another alternative, a correlation may be established for portions of audio established based on other criteria, such as at paragraph or chapter boundaries, significant pauses (e.g., silence of greater than 3 seconds), or other locations based on data in audio book 250, such as audio markers within audio book 250.
  • a correlation may be established between each word in audio data 230 and each corresponding word in audio book 250.
  • a correlation may be established based on fixed-duration time intervals (e.g. one mapping for every 1 minute of audio).
  • a correlation may be established for portions of audio established based on other criteria, such as at paragraph or chapter boundaries, significant pauses (e.g., silence
  • audio-to-text correlator 260 uses audio-to-document mapping 240 to identify a text location (indicated in mapping 240) that corresponds to the audio location within generated audio data 230. Audio-to-text correlator 260 then associates the text location with the audio location within audio book 250 to create a mapping record in document-to-audio mapping 270.
  • Audio-to-text correlator 260 repeatedly performs the audio-to-audio correlation and look-up steps for each portion of audio data 230. Therefore, document-to-audio mapping 270 comprises multiple mapping records, each mapping record mapping a location within document 210 to a location within audio book 250.
  • the audio-to-audio correlation for each portion of audio data 230 is immediately followed by the look-up step for that portion of audio.
  • document- to-audio mapping 270 may be created for each portion of audio data 230 prior to proceeding to the next portion of audio data 230.
  • the audio-to-audio correlation step may be performed for many or for all of the portions of audio data 230 before any look-up step is performed.
  • the look-up steps for all portions can be performed in a batch, after all of the audio-to-audio correlations have been established.
  • a mapping has a number of attributes, one of which is the mapping's size, which refers to the number of mapping records in the mapping. Another attribute of a mapping is the mapping's "granularity.”
  • the "granularity" of a mapping refers to the number of mapping records in the mapping relative to the size of the digital work.
  • the granularity of a mapping may vary from one digital work to another digital work.
  • a first mapping for a digital book that comprises 200 "pages" includes a mapping record only for each paragraph in the digital book.
  • the first mapping may comprise 1000 mapping records.
  • a second mapping for a digital "children's" book that comprises 20 pages includes a mapping record for each word in the children's book.
  • the second mapping may comprise 800 mapping records.
  • the first mapping comprises more mapping records than the second mapping
  • the granularity of the second mapping is finer than the granularity of the first mapping.
  • the granularity of a mapping may be dictated based on input to a speech-to-text analyzer that generates the mapping. For example, a user may specify a specific granularity before causing a speech-to-text analyzer to generate a mapping.
  • specific granularities include:
  • a user may specify the type of digital work (e.g., novel, children's book, short story) and the speech-to-text analyzer (or another process) determines the granularity based on the work's type. For example, a children's book may be associated with word granularity while a novel may be associated with sentence granularity.
  • the type of digital work e.g., novel, children's book, short story
  • the speech-to-text analyzer or another process determines the granularity based on the work's type. For example, a children's book may be associated with word granularity while a novel may be associated with sentence granularity.
  • mappings for the first three chapters of a digital book may have sentence granularity while a mapping for the remaining chapters of the digital book have word granularity.
  • an audio-to-text mapping is generated at runtime or after a user has begun to consume the audio data and/or the text data on the user's device. For example, a user reads a textual version of a digital book using a tablet computer. The tablet computer keeps track of the most recent page or section of the digital book that the tablet computer has displayed to the user. The most recent page or section is identified by a "text bookmark.”
  • the playback device may be the same tablet computer on which the user was reading the digital book or another device.
  • the text bookmark is retrieved, and a speech-to-text analysis is performed relative to at least a portion of the audio book.
  • "temporary" mapping records are generated to establish a correlation between the generated text and the corresponding locations within the audio book.
  • the portion of the audio book on which the speech-to-text analysis is performed may be limited to the portion that corresponds to the text bookmark.
  • an audio section mapping may already exist that indicates where certain portions of the audio book begin and/or end.
  • an audio section mapping may indicate where each chapter begins, where one or more pages begin, etc. Such an audio section mapping may be helpful to determine where to begin the speech-to-text analysis so that a speech-to-text analysis on the entire audio book is not required to be performed.
  • the text bookmark indicates a location within the 12 th chapter of the digital book and an audio section mapping associated with the audio data identifies where the 12 th chapter begins in the audio data
  • a speech-to-text analysis is not required to be performed on any of the first 11 chapters of the audio book.
  • the audio data may consist of 20 audio files, one audio file for each chapter. Therefore, only the audio file that corresponds to the 12 th chapter is input to a speech-to-text analyzer.
  • Mapping records can be generated on-the-fly to facilitate audio-to-text transitions, as well as text-to-audio transitions. For example, assume that a user is listening to an audio book using a smart phone. The smart phone keeps track of the current location within the audio book that is being played. The current location is identified by an "audio bookmark.” Later, the user picks up a tablet computer and selects a digital book version of the audio book to display.
  • the tablet computer receives the audio bookmark (e.g., from a central server that is remote relative to the tablet computer and the smart phone), performs a speech-to-text analysis of at least a portion of the audio book, and identifies, within the audio book, a portion that corresponds to a portion of text within a textual version of the audio book that corresponds to the audio bookmark.
  • the tablet computer then begins displaying the identified portion within the textual version.
  • the portion of the audio book on which the speech-to-text analysis is performed may be limited to the portion that corresponds to the audio bookmark.
  • a speech-to-text analysis is performed on a portion of the audio book that spans one or more time segments (e.g., seconds) prior to the audio bookmark in the audio book and/or one or more time segments after the audio bookmark in the audio book.
  • the text produced by the speech-to-text analysis on that portion is compared to text in the textual version to locate where the series of words or phrases in the produced text match text in the textual version.
  • mapping (whether created manually or
  • a mapping may be used to identify a location within an e-book based on a "bookmark" established in an audio book.
  • a mapping may be used to identify which displayed text corresponds to an audio recording of a person reading the text as the audio recording is being played and cause the identified text to be highlighted.
  • a mapping may be used to identify a location in audio data and play audio at that location in response to input that selects displayed text from an e-book.
  • a user may select a word in an e-book, which selection causes audio that corresponds to that word to be played.
  • a user may create an annotation while "consuming" (e.g., reading or listening to) one version of a digital work (e.g., an e-book) and cause the annotation to be consumed while the user is consuming another version of the digital work (e.g., an audio book).
  • a user can make notes on a "page" of an e-book and may view those notes while listening to an audio book of the e-book.
  • a user can make a note while listening to an audio book and then can view that note when reading the corresponding e- book.
  • FIG. 3 is a flow diagram that depicts a process for using a mapping in one or more of these scenarios, according to an embodiment of the invention.
  • step 310 location data that indicates a specified location within a first media item is obtained.
  • the first media item may be a textual version of a work or audio data that corresponds to a textual version of the work.
  • This step may be performed by a device (operated by a user) that consumes the first media item.
  • the step may be performed by a server that is located remotely relative to the device that consumes the first media item.
  • the device sends the location data to the server over a network using a communication protocol.
  • a mapping is inspected to determine a first media location that corresponds to the specified location. Similarly, this step may be performed by a device that consumes the first media item or by a server that is located remotely relative to the device.
  • a second media location that corresponds to the first media location and that is indicated in the mapping is determined. For example, if the specified location is an audio "bookmark", then the first media location is an audio location indicated in the mapping and the second media location is a text location that is associated with the audio location in the mapping. Similarly, For example, if the specified location is a text
  • the first media location is a text location indicated in the mapping and the second media location is an audio location that is associated with the text location in the mapping.
  • the second media item is processed based on the second media location. For example, if the second media item is audio data, then the second media location is an audio location and is used as a current playback position in the audio data. As another example, if the second media item is a textual version of a work, then the second media location is a text location and is used to determine which portion of the textual version of the work to display.
  • FIG. 4 is a block diagram that an example system 400 that may be used to implement some of the processes described herein, according to an embodiment of the invention.
  • System 400 includes end-user device 410, intermediary device 420, and end-user device 430.
  • Non-limiting examples of end-user devices 410 and 430 include desktop computers, laptop computers, smart phones, tablet computers, and other handheld computing devices.
  • device 410 stores a digital media item 402 and executes a text media player 412 and an audio media player 414.
  • Text media player 412 is configured to process electronic text data and cause device 410 to display text (e.g., on a touch screen of device 410, not shown).
  • digital media item 402 is an e-book
  • text media player 412 may be configured to process digital media item 402, as long as digital media item 402 is in a text format that text media player 412 is configured to process.
  • Device 410 may execute one or more other media players (not shown) that are configured to process other types of media, such as video.
  • audio media player 414 is configured to process audio data and cause device 410 to generate audio (e.g., via speakers on device 410, not shown).
  • digital media item 402 is an audio book
  • audio media player 414 may be configured to process digital media item 402, as long as digital media item 402 is in an audio format that audio media player 414 is configured to process.
  • item 402 may comprise multiple files, whether audio files or text files.
  • Device 430 similarly stores a digital media item 404 and executes an audio media player 432 that is configured to process audio data and cause device 430 to generate audio.
  • Device 430 may execute one or more other media players (not shown) that are configured to process other types of media, such as video and text.
  • Intermediary device 420 stores a mapping 406 that maps audio locations within audio data to text location in text data. For example, mapping 406 may map audio locations within digital media item 404 to text locations within digital media item 402. Although not depicted in FIG. 4, intermediary device 420 may store many mappings, one for each corresponding set of audio data and text data. Also, intermediary device 420 may interact with many end-user devices not shown.
  • intermediary device 420 may store digital media items that users may access via their respective devices.
  • a device e.g., device 430
  • intermediary device 420 may store account data that associates one or more devices of a user with a single account. Thus, such account data may indicate that devices 410 and 430 are registered by the same user under the same account. Intermediary device 420 may also store account-item association data that associates an account with one or more digital media items owned (or purchased) by a particular user. Thus, intermediary device 420 may verify that device 430 may access a particular digital media item by determining whether the account-item association data indicates that device 430 and the particular digital media item are associated with the same account.
  • an end-user may own and operate more or less devices that consume digital media items, such as e-books and audio books.
  • intermediary device 420 the entity that owns and operates intermediary device 420 may operate multiple devices, each of which provide the same service or may operate together to provide a service to the user of end-user devices 410 and 430.
  • Network 440 may be implemented by any medium or mechanism that provides for the exchange of data between various computing devices.
  • Examples of such a network include, without limitation, a network such as a Local Area Network (LAN), Wide Area Network (WAN), Ethernet or the Internet, or one or more terrestrial, satellite, or wireless links.
  • the network may include a combination of networks such as those described.
  • the network may transmit data according to Transmission Control Protocol (TCP), User Datagram Protocol (UDP), and/or Internet Protocol (IP).
  • TCP Transmission Control Protocol
  • UDP User Datagram Protocol
  • IP Internet Protocol
  • mapping 406 may be stored separate from the text data and the audio data from which the mapping was generated. For example, as depicted in FIG. 4, mapping 406 is stored separate from digital media items 402 and 404 even though mapping 406 may be used to identify a media location in one digital media item based on a media location in the other digital media item. In fact, mapping 406 is stored on a separate computing device
  • mapping may be stored as part of the
  • mapping 406 may be stored in digital media item 402. However, even if the mapping is stored as part of the text data, the mapping may not be displayed to an end-user that consumes the text data. Additionally or alternatively still, a mapping may be stored as part of the audio data. For example, mapping 406 may be stored in digital media item 404. BOOKMARK SWITCHING
  • bookmark switching refers to establishing a specified location (or "bookmark") in one version of a digital work and using the bookmark to find the corresponding location within another version of the digital work.
  • TA bookmark switching involves using a text bookmark established in an e-book to identify a corresponding audio location in an audio book.
  • AT bookmark switching involves using an audio bookmark established in an audio book to identify a corresponding text location within an e-book.
  • FIG. 5 A is a flow diagram that depicts a process 500 for TA bookmark switching, according to an embodiment of the invention.
  • FIG. 5A is described using elements of system 400 depicted in FIG. 4.
  • a text media player 412 determines a text bookmark within digital media item 402 (e.g., a digital book).
  • Device 410 displays content from digital media item 402 to a user of device 410.
  • the text bookmark may be determined in response to input from the user. For example, the user may touch an area on a touch screen of device 410. Device 410's display, at or near that area, displays one or more words. In response to the input, the text media player 412 determines the one or more words that are closest to the area. The text media player 412 determines the text bookmark based on the determined one or more words.
  • the text bookmark may be determined based on the last text data that was displayed to the user.
  • the digital media item 402 may comprise 200 electronic "pages" and page 110 was the last page that was displayed.
  • Text media player 412 determines that page 110 was the last page that was displayed.
  • Text media player 412 may establish page 110 as the text bookmark or may establish a point at the beginning of page 110 as the text bookmark, since there may be no way to know where the user stopped reading. It may be safe to assume that the user at least read the last sentence on page 109, which sentence may have ended on page 109 or on page 110. Therefore, the text media player 412 may establish the beginning of the next sentence (which begins on page 110) as the text bookmark.
  • text media player 412 may establish the beginning of the last paragraph on page 109. Similarly, if the granularity of the mapping is at the sentence level, then text media player 412 may establish the beginning of the chapter that includes page 1 10 as the text bookmark.
  • text media player 412 sends, over network 440 to intermediary device 420, data that indicates the text bookmark.
  • Intermediary device 420 may store the text bookmark in association with device 410 and/or an account of the user of device 410.
  • the user may have established an account with an operator of intermediary device 420.
  • the user then registered one or more devices, including device 410, with the operator.
  • the registration caused each of the one or more devices to be associated with the user's account.
  • One or more factors may cause the text media player 412 to send the text bookmark to intermediary device 420. Such factors may include the exiting (or closing down) of text media player 412, the establishment of the text bookmark by the user, or an explicit instruction by the user to save the text bookmark for use when listening to the audio book that corresponds to the textual version of the work for which the text bookmark is established.
  • intermediary device 420 has access to (e.g., stores) mapping 406, which, in this example, maps multiple audio locations in digital media item 404 with multiple text locations within digital media item 402.
  • mapping 406 maps multiple audio locations in digital media item 404 with multiple text locations within digital media item 402.
  • intermediary device 420 inspects mapping 406 to determine a particular text location, of the multiple text locations, that corresponds to the text bookmark.
  • the text bookmark may not exactly match any of the multiple text locations in mapping 406.
  • intermediary device 420 may select the text location that is closest to the text bookmark.
  • intermediary device 420 may select the text location that is immediately before the text bookmark, which text location may or may not be the closest text location to the text bookmark.
  • the text bookmark indicates 5 th chapter, 3 rd paragraph, 5 th sentence and the closest text locations in mapping 406 are (1) 5 th chapter, 3 rd paragraph, 1 st sentence and (2), 5 th chapter, 3 rd paragraph, 6 th sentence, then the text location (1) is selected.
  • intermediary device 420 determines a particular audio location, in mapping 406, that corresponds to the particular text location.
  • intermediary device 420 sends the particular audio location to device 430, which, in this example, is different than device 410.
  • device 410 may be a tablet computer and the device 430 may be a smart phone. In a related embodiment, device 430 is not involved. Thus, intermediary device 420 may send the particular audio location to device 410.
  • Step 510 may be performed automatically, i.e., in response to intermediary device 420 determining the particular audio location.
  • step 510 or step 506) may be performed in response to receiving, from device 430, an indication that device 430 is about to process digital media item 404. The indication may be a request for an audio location that corresponds to the text bookmark.
  • audio media player 432 establishes the particular audio location as a current playback position of the audio data in digital media item 404. This establishment may be performed in response to receiving the particular audio location from intermediary device 420. Because the current playback position becomes the particular audio location, audio media player 432 is not required to play any of the audio that precedes the particular audio location in the audio data. For example, if the particular audio location indicates 2:56:03 (2 hours, 56 minutes, and 3 seconds), then audio media player 432 establishes that time in the audio data as the current playback position. Thus, if the user of device 430 selects a "play" button (whether graphical or physical) on device 430, then audio media player 430 begins processing the audio data at that 2:56:03 mark.
  • device 410 stores mapping 406 (or a copy thereof). Therefore, in place of steps 504-508, text media player 412 inspects mapping 406 to determine a particular text location, of the multiple text locations, that corresponds to the text bookmark. Then, text media player 412 determines a particular audio location, in mapping 406, that corresponds to the particular text location. The text media player 412 may then cause the particular audio location to be sent to intermediary device 420 to allow device 430 to retrieve the particular audio location and establish a current playback position in the audio data to be the particular audio location.
  • Text media player 412 may also cause the particular text location (or text bookmark) to be sent to intermediary device 420 to allow device 410 (or another device, not shown) to later retrieve the particular text location to allow another text media player executing on the other device to display a portion (e.g., a page) of another copy of digital media item 402, where the portion corresponds to the particular text location.
  • the particular text location or text bookmark
  • intermediary device 420 and device 430 are not involved. Thus, steps 504 and 510 are not performed. Thus, device 410 performs all other steps in FIG. 5A, including steps 506 and 508.
  • FIG. 5B is a flow diagram that depicts a process 550 for AT bookmark switching, according to an embodiment of the invention. Similarly to FIG. 5A, FIG. 5B is described using elements of system 400 depicted in FIG. 4. [0105] At step 552, audio media player 432 determines an audio bookmark within digital media item 404 (e.g., an audio book).
  • digital media item 404 e.g., an audio book
  • the audio bookmark may be determined in response to input from the user. For example, the user may stop the playback of the audio data, for example, by selecting a "stop" button that is displayed on a touch screen of device 430. Audio media player 432 determines the location within audio data of digital media item 404 that corresponds to where playback stopped. Thus, the audio bookmark may simply be the last place where the user stopped listening to the audio generated from digital media item 404. Additionally or alternatively, the user may select one or more graphical buttons on the touch screen of device 430 to establish a particular location within digital media item 404 as the audio bookmark. For example, device 430 displays a timeline that corresponds to the length of the audio data in digital media item 404. The user may select a position on the timeline and then provide one or more additional inputs that are used by audio media player 432 to establish the audio bookmark.
  • step 554 device 430 sends, over network 440 to intermediary device 420, data that indicates the audio bookmark.
  • the intermediary device 420 may store the audio bookmark in association with device 430 and/or an account of the user of device 430.
  • the user established an account with an operator of intermediary device 420.
  • the user then registered one or more devices, including device 430, with the operator.
  • the registration caused each of the one or more devices to be associated with the user's account.
  • Intermediary device 420 also has access to (e.g., stores) mapping 406.
  • Mapping 406 maps multiple audio locations in the audio data of digital media item 404 with multiple text locations within text data of digital media item 402.
  • One or more factors may cause audio media player 432 to send the audio bookmark to intermediary device 420. Such factors may include the exiting (or closing down) of audio media player 432, the establishment of the audio bookmark by the user, or an explicit instruction by the user to save the audio bookmark for use when displaying portions of the textual version of the work (reflected in digital media item 402) that corresponds to digital media item 404, for which the audio bookmark is established.
  • intermediary device 420 inspects mapping 406 to determine a particular audio location, of the multiple audio locations, that corresponds to the audio bookmark.
  • the audio bookmark may not exactly match any of the multiple audio locations in mapping 406.
  • intermediary device 420 may select the audio location that is closest to the audio bookmark.
  • intermediary device 420 may select the audio location that is immediately before the audio bookmark, which audio location may or may not be the closest audio location to the audio bookmark. For example, if the audio bookmark indicates 02:43: 19 (or 2 hours, 43 minutes, and 19 seconds) and the closest audio locations in mapping 406 are (1) 02:41 :07 and (2), 0:43:56, then the audio location (1) is selected, even though audio location (2) is closest to the audio bookmark.
  • intermediary device 420 determines a particular text location, in mapping 406, that corresponds to the particular audio location.
  • intermediary device 420 sends the particular text location to device 410, which, in this example, is different than device 430.
  • device 410 may be a tablet computer and device 430 may be a smart phone that is configured to process audio data and generate audible sounds.
  • Step 560 may be performed automatically, i.e., in response to intermediary device 420 determining the particular text location.
  • step 560 (or step 556) may be performed in response to receiving, from device 410, an indication that device 410 is about to process the digital media item 402.
  • the indication may be a request for a text location that corresponds to the audio bookmark.
  • text media player 412 displays information about the particular text location.
  • Step 562 may be performed in response to receiving the particular text location from intermediary device 420.
  • Device 410 is not required to display any of the content that precedes the particular text location in the textual version of the work reflected in digital media item 402. For example, if the particular text location indicates Chapter 3, paragraph 2, sentence 4, then device 410 displays a page that includes that sentence.
  • Text media player 412 may cause a marker to be displayed at the particular text location in the page that visually indicates, to a user of device 410, where to begin reading in the page.
  • the user is able to immediately read the textual version of the work beginning at a location that corresponds to the last words spoken by a narrator in the audio book.
  • the device 410 stores mapping 406. Therefore, in place of steps 556-560, after step 554 (wherein the device 430 sends data that indicates the audio bookmark to intermediary device 420), intermediary device 420 sends the audio bookmark to device 410. Then, text media player 412 inspects mapping 406 to determine a particular audio location, of the multiple audio locations, that corresponds to the audio bookmark. Then, text media player 412 determines a particular text location, in mapping 406, that corresponds to the particular audio location. This alternative process then proceeds to step 562, described above. [0116] In another alternative embodiment, intermediary device 420 is not involved.
  • steps 554 and 560 are not performed.
  • device 430 performs all other steps in FIG. 5B, including steps 556 and 558.
  • text from a portion of a textual version of a work is highlighted or "lit up" while audio data that corresponds to the textual version of the work is played.
  • the audio data is an audio version of a textual version of the work and may reflect a reading, of text from the textual version, by a human user.
  • highlighting text refers to a media player (e.g., an "e-reader") visually distinguishing that text from other text that is concurrently displayed with the highlighted text.
  • Highlighting text may involve changing the font of the text, changing the font style of the text (e.g., italicize, bold, underline), changing the size of the text, changing the color of the text, changing the background color of the text, or creating an animation associated with the text.
  • An example of creating an animation is causing the text (or background of the text) to blink on and off or to change colors.
  • Another example of creating an animation is creating a graphic to appear above, below, or around the text.
  • the media player displays a toaster image above the word “toaster” in the displayed text.
  • a toaster image above the word “toaster” in the displayed text.
  • Another example of an animation is a bouncing ball that "bounces" on a portion of text (e.g., word, syllable, or letter) when that portion is detected in audio data that is played.
  • FIG. 6 is a flow diagram that depicts a process 600 for causing text, from a textual version of a work, to be highlighted while an audio version of the work is being played, according to an embodiment of the invention.
  • the current playback position (which is constantly changing) of audio data of the audio version is determined. This step may be performed by a media player executing on a user's device. The media player processes the audio data to generate audio for the user.
  • a mapping record in a mapping is identified.
  • the current playback position may match or nearly match the audio location identified in the mapping record.
  • Step 620 may be performed by the media player if the media player has access to a mapping that maps multiple audio locations in the audio data with multiple text locations in the textual version of the work.
  • step 620 may be performed by another process executing on the user's device or by a server that receives the current playback position from the user's device over a network.
  • step 630 the text location identified in the mapping record is identified.
  • step 640 a portion of the textual version of the work that corresponds to the text location is caused to be highlighted. This step may be performed by the media player or another software application executing on the user's device. If a server performs the look-up steps (620 and 630), then step 640 may further involve the server sending the text location to the user's device. In response, the media player, or another software application, accepts the text location as input and causes the corresponding text to be highlighted.
  • mappings are associated with different types of highlighting.
  • one text location in the mapping may be associated with the changing of the font color from black to red while another text location in the mapping may be associated with an animation, such as a toaster graphic that shows a piece of toast "popping" out of toaster.
  • each mapping record in the mapping may include "highlighting data" that indicates how the text identified by the corresponding text location is to be highlighted.
  • the media player uses the highlighting data to determine how to highlight the text. If a mapping record does not include highlighting data, then the media player may not highlight the corresponding text. Alternatively, if an mapping record in the mapping does not include highlighting data, then the media player may use a "default" highlight technique (e.g., holding the text) to highlight the text.
  • FIG. 7 is a flow diagram that depicts a process 700 of highlighting displayed text in response to audio input from a user, according to an embodiment of the invention.
  • a mapping is not required.
  • the audio input is used to highlight text in a portion of a textual version of a work that is concurrently displayed to the user.
  • audio input is received.
  • the audio input may be based on a user reading aloud text from a textual version of a work.
  • the audio input may be received by a device that displays a portion of the textual version.
  • the device may prompt the user to read aloud a word, phrase, or entire sentence.
  • the prompt may be visual or audio.
  • a visual prompt the device may cause the following text to be displayed: "Please read the underlined text” while or immediately before the device displays a sentence that is underlined.
  • the device may cause a computer-generated voice to read "Please read the underlined text" or cause a pre-recorded human voice to be played, where the pre-recorded human voice provides the same instruction.
  • a speech-to-text analysis is performed on the audio input to detect one or more words reflected in the audio input.
  • step 730 for each detected word reflected in the audio input, that detected word is compared to a particular set of words.
  • the particular set of words may be all the words that are currently displayed by a computing device (e.g., an e-reader). Alternatively, the particular set of words may be all the words that the user was prompted to read.
  • step 740 for each detected word that matches a word in the particular set, the device causes that matching word to be highlighted.
  • the steps depicted in process 700 may be performed by a single computing device that displays text from a textual version of a work. Alternatively, the steps depicted in process 700 may be performed by one or more computing devices that are different than the computing device that displays text from the textual version.
  • the audio input from a user in step 710 may be sent from the user's device over a network to a network server that performs the speech-to-text analysis. The network server may then send highlight data to the user's device to cause the user's device to highlight the appropriate text.
  • a user of a media player that displays portions of a textual version of a work may select portions of displayed text and cause the corresponding audio to be played. For example, if a displayed word from the digital book is "donut" and the user selects that word (e.g., by touching a portion of the media player's touch screen that displays that word), then the audio of "donut" may be played.
  • a mapping that maps text locations in a textual version of the work with audio locations in audio data is used to identify the portion of the audio data that corresponds to the selected text.
  • the user may select a single word, a phrase, or even one or more sentences.
  • the media player may identify one or more text locations. For example, the media player may identify a single text location that corresponds to the selected portion, even if the selected portion comprises multiple lines or sentences. The identified text location may correspond to the beginning of the selected portion. As another example, the media player may identify a first text location that corresponds to the beginning of the selected portion and a second text location that corresponds to the ending of the selected portion.
  • the media player uses the identified text location to look up a mapping record in the mapping that indicates a text location that is closest (or closest prior) to the identified text location.
  • the media player uses the audio location indicated in the mapping record to identify where, in the audio data, to begin processing the audio data in order to generate audio. If only a single text location is identified, then only the word or sounds at or near the audio location may be played. Thus, after the word or sounds are played, the media player ceases to play any more audio.
  • the media player begins playing at or near the audio location and does not cease playing the audio that follows the audio location until (a) the end of the audio data is reached, (b) further input from the user (e.g., selection of a "stop” button), or (c) a pre-designated stopping point in the audio data (e.g., end of a page or chapter that requires further input to proceed).
  • the media player identifies two text locations based on the selected portion, then two audio locations are identified and may be used to identify where to begin playing and where to stop playing the corresponding audio.
  • the audio data identified by the audio location may be played slowly (i.e., at a slow playback speed) or continuously without advancing the current playback position in the audio data. For example, if a user of a tablet computer selects the displayed word "two" by touching a touch screen of the tablet computer with his finger and continuously touches the displayed word (i.e., without lifting his finger and without moving his finger to another displayed word), then the tablet computer plays the corresponding audio creating a sound reflected by reading the word "twoooooooooooooooo".
  • the speed at which a user drags her finger across displayed text on a touch screen of a media player causes the corresponding audio to be played at the same or similar speed. For example, a user selects the letter "d" of the displayed word “donut” and then slowly moves his finger across the displayed word.
  • the media player identifies the corresponding audio data (using the mapping) and plays the corresponding audio at the same speed at which the user moves his finger. Therefore, the media player creates audio that sounds as if the reader of the text of the textual version of the work pronounced the word "donut" as "dooooooonnnnnnnuuuuuuut.”
  • the time that a user "touches" a word displayed on a touch screen dictates how quickly or slowly the audio version of the word is played. For example, a quick tap of a displayed word by the user's finger causes the corresponding audio to be played at a normal speed, whereas the user holding down his finger on the selected word for more than 1 second causes the corresponding audio to be played at 1 ⁇ 2 the normal speed.
  • a user initiates the creation of annotations to one media version (e.g., audio) of a digital work and causes the annotations to be associated with another media version (e.g., text) of the digital work.
  • another media version e.g., text
  • an annotation may be created in the context of one type of media
  • the annotation may be consumed in the context of another type of media.
  • the "context" in which an annotation is created or consumed refers to whether text is being displayed or audio is being played when the creation or consumption occurs.
  • an indication of the annotation may be displayed, by a device, at the beginning or the end of the corresponding textual version or on each "page" of the corresponding textual version.
  • the text that is displayed when an annotation is created in the text context is not used when consuming the annotation in the audio context.
  • an indication of the annotation may be displayed, by a device, at the beginning or end of the corresponding audio version or continuously while the corresponding audio version is being played.
  • an audio indication of the annotation may be played. For example, a "beep" is played simultaneously with the audio track in such a way that both the beep and the audio track can be heard.
  • FIGs. 8A-B are flow diagrams that depict processes for transferring an annotation from one context to another, according to an embodiment of the invention.
  • FIG. 8A is a flow diagram depicts a process 800 for creating an annotation in the "text" context and consuming the annotation in the "audio” context
  • FIG. 8B is a flow diagram that depicts a process 850 for creating an annotation in the "audio” context and consuming the annotation in the "text” context.
  • the creation and consumption of an annotation may occur on the same computing device (e.g., device 410) or on separate computing devices (e.g., devices 410 and 430).
  • FIG. 8A describes a scenario where the annotation is created and consumed on device 410
  • FIG. 8B describes a scenario where the annotation is created on device 410 and later consumed on device 430.
  • text media player 412 executing on device 410, causes text (e.g., in the form of a page) from digital media item 402 to be displayed.
  • text media player 412 determines a text location within a textual version of the work reflected in digital media item 402.
  • the text location is eventually stored in association with an annotation.
  • the text location may be determined in a number of ways.
  • text media player 412 may receive input that selects the text location within the displayed text.
  • the input may be a user touching a touch screen (that displays the text) of device 410 for a period of time.
  • the input may select a specific word, a number of words, the beginning or ending of a page, before or after a sentence, etc.
  • the input may also include first selecting a button, which causes text media player 412 to change to a "create annotation" mode where an annotation may be created and associated with the text location.
  • text media player 412 determines the text location automatically (without user input) based on which portion of the textual version of the work (reflected in digital media item 402) is being displayed. For example, if device 410 is displaying page 20 of the textual version of the work, then the annotation will be associated with page 20.
  • step 806 text media player 412 receives input that selects a "Create
  • Annotation button that may be displayed on the touch screen. Such a button may be displayed in response to input in step 804 that selects the text location, where, for example, the user touches the touch screen for a period of time, such as one second.
  • step 804 is depicted as occurring before step 806, alternatively, the selection of the "Create Annotation" button may occur prior to the determination of the text location.
  • text media player 412 receives input that is used to create annotation data.
  • the input may be voice data (such as the user speaking into a microphone of device 410) or text data (such as the user selecting keys on a keyboard, whether physical or graphical). If the annotation data is voice data, text media player 412 (or another process) may perform speech-to-text analysis on the voice data to create a textual version of the voice data.
  • text media player 412 stores the annotation data in association with the text location.
  • Text media player 412 uses a mapping (e.g., a copy of mapping 406) to identify a particular text location, in mapping, that is closest to the text location. Then, using mapping, text media player identifies an audio location that corresponds to the particular text location.
  • mapping e.g., a copy of mapping 406
  • text media player 412 sends, over network 440 to intermediary device 420, the annotation data and the text location.
  • intermediary device 420 stores the annotation data in association with the text location.
  • Intermediary device 420 uses a mapping (e.g., mapping 406) to identify a particular text location, in mapping 406, that is closest to the text location. Then, using mapping 406, intermediary device 420 identifies an audio location that corresponds to the particular text location.
  • Intermediary device 420 sends the identified audio location over network 440 to device 410.
  • Intermediary device 420 may send the identified audio location in response to a request, from device 410, for certain audio data and/or for annotations associated with certain audio data. For example, in response to a request for an audio book version of "The Tale of Two Cities", intermediary device 420 determines whether there is any annotation data associated with that audio book and, if so, sends the annotation data to device 410.
  • Step 810 may also comprise storing date and/or time information that indicates when the annotation was created. This information may be displayed later when the annotation is consumed in the audio context.
  • audio media player 414 plays audio by processing audio data of digital media item 404, which, in this example (although not shown), may be stored on device 410 or may be streamed to device 410 from intermediary device 420 over network 440.
  • audio media player 414 determines when the current playback position in the audio data matches or nearly matches the audio location identified in step 810 using mapping 406.
  • audio media player 414 may cause data that indicates that an annotation is available to be displayed, regardless of where the current playback position is located and without having to play any audio, as indicated in step 812. In other words, step 812 is unnecessary.
  • a user may launch audio media player 414 and cause audio media player 414 to load the audio data of digital media item 404.
  • Audio media player 414 determines that annotation data is associated with the audio data. Audio media player 414 causes information about the audio data (e.g., title, artist, genre, length, etc.) to be displayed without generating any audio associated with the audio data.
  • the information may include a reference to the annotation data and information about a location within the audio data that is associated with the annotation data, where the location corresponds to the audio location identified in step 810.
  • audio media player 414 consumes the annotation data. If the annotation data is voice data, then consuming the annotation data may involve processing the voice data to generate audio or converting the voice data to text data and displaying the text data. If the annotation data is text data, then consuming the annotation data may involve displaying the text data, for example, in a side panel of a GUI that displays attributes of the audio data that is played or in a new window that appears separate from the GUI.
  • Non- limiting examples of attributes include time length of the audio data, the current playback position, which may indicate an absolute location within the audio data (e.g., a time offset) or a relative position within the audio data (e.g., chapter or section number), a waveform of the audio data, and title of the digital work.
  • FIG. 8B describes a scenario, as noted previously, where an annotation is created on device 430 and later consumed on device 410.
  • audio media player 432 processes audio data from digital media item 404 to play audio.
  • audio media player 432 determines an audio location within the audio data.
  • the audio location is eventually stored in association with an annotation.
  • the audio location may be determined in a number of ways.
  • audio media player 432 may receive input that selects the audio location within the audio data.
  • the input may be a user touching a touch screen (that displays attributes of the audio data) of device 430 for a period of time.
  • the input may select an absolute position within a timeline that reflects the length of the audio data or a relative position within the audio data, such as a chapter number and a paragraph number.
  • the input may also comprise first selecting a button, which causes audio media player 432 to change to a "create annotation" mode where an annotation may be created and associated with the audio location.
  • audio media player 432 determines the audio location automatically (without user input) based on which portion of the audio data is being processed. For example, if audio media player 432 is processing a portion of the audio data that corresponds to chapter 20 of a digital work reflected in digital media item 404, then audio media player 432 determines that the audio location is at least be somewhere within chapter 20.
  • audio media player 432 receives input that selects a "Create
  • Annotation button that may be displayed on the touch screen of device 430. Such a button may be displayed in response to input in step 854 that selects the audio location, where, for example, the user touches the touch screen continuously for a period of time, such as one second.
  • step 854 is depicted as occurring before step 856, alternatively, the selection of the "Create Annotation" button may occur prior to the determination of the audio location.
  • the first media player receives input that is used to create annotation data, similar to step 808.
  • audio media player 432 stores the annotation data in association with the audio location.
  • Audio media player 432 uses a mapping (e.g., mapping 406) to identify a particular audio location, in the mapping, that is closest to the audio location determined in step 854. Then, using the mapping, audio media player 432 identifies a text location that corresponds to the particular audio location.
  • mapping e.g., mapping 406
  • audio media player 432 sends, over network 400 to intermediary device 420, the annotation data and the audio location. In response,
  • intermediary device 420 stores the annotation data in association with the audio location.
  • Intermediary device 420 uses mapping 406 to identify a particular audio location, in the mapping, that is closest to the audio location determined in step 854. Then, using mapping 406, intermediary device 420 identifies a text location that corresponds to the particular audio location.
  • Intermediary device 420 sends the identified text location over network 440 to device 410.
  • Intermediary device 420 may send the identified text location in response to a request, from device 410, for certain text data and/or for annotations associated with certain text data. For example, in response to a request for a digital book of "The Grapes of Wrath", intermediary device 420 determines whether there is any annotation data associated with that digital book and, if so, sends the annotation data to device 430.
  • Step 860 may also comprise storing date and/or time information that indicates when the annotation was created. This information may be displayed later when the annotation is consumed in the text context.
  • device 410 displays text data associated with digital media item 402, which is a textual version of digital media item 404.
  • Device 410 displays the text data of digital media item 402 based on a locally-stored copy of digital media item 402 or, if a locally-stored copy does not exist, may display the text data while the text data is streamed from intermediary device 420.
  • device 410 determines when a portion of the textual version of the work (reflected in digital media item 402) that includes the text location (identified in step 860) is displayed. Alternatively, device 410 may display data that indicates that an annotation is available regardless of what portion of the textual version of the work, if any, is displayed.
  • text media player 412 consumes the annotation data. If the annotation data is voice data, then consuming the annotation data may comprise playing the voice data or converting the voice data to text data and displaying the text data. If the annotation data is text data, then consuming the annotation data may comprises displaying the text data, for example, in a side panel of a GUI that displays a portion of the textual version of the work or in a new window that appears separate from the GUI.
  • a user of a media player may view a textual version of a work while simultaneously listening to an audio version of the work.
  • This scenario is referred to herein as the "read aloud” scenario.
  • the media player is concurrently displaying a portion of textual version of a work and playing a portion of an audio version of the work, the media player is said to be in "read aloud mode.”
  • a media player visually indicates whether the media player is in read aloud mode.
  • a visual indication of being in read aloud mode may be an icon or graphic that appears somewhere on a screen of the media player. For example, an image of a narrator "character" is displayed by the media player and animated on each page that is displayed by the media player while the media player is in read aloud mode.
  • One example of a setting in the read aloud mode is an automatic page turn setting. If the media player is operating under the automatic page turn setting, then when the current playback position in the audio data corresponds to the end of a page displayed by the media player, the page is automatically "turned," i.e., without user input. "Turning" a digital page involves ceasing to display a first page and displaying a second page that is subsequent to the first page. Such "turning" may include displaying graphics that make it appear that the first page is an actual page that is turning. Thus, under the automatic page turn setting, the media player determines when the current playback position of the audio data corresponds to the last word on a displayed page. This determination is made possible by translating the current audio location into a current text location using a mapping, as described herein, whether the mapping is stored on the media player or on a server that is remote to the media player.
  • Another example of a setting in the read aloud mode is an end of page setting. If the media player is operating under the end of page setting, then the media player detects when the current playback position of the audio data corresponds to the text at the end of a page that is displayed by the media player. In response to this detection, the media player causes the playback of the audio data to cease. Only input from a user of the media player will cause the media player to continue processing the audio data. Also, the input may cause the media player to "turn" the page. Such input may be voice input or input via a touch screen of the media player.
  • Another example of a setting in the read aloud mode is a book control setting. If the media player is operating under the book control setting, then data (e.g., metadata) that is associated with the textual version of the work is used to control the playback of the corresponding audio data.
  • data e.g., metadata
  • certain data such as tags in the textual data or in a mapping, indicates when to pause or stop playback of the audio data, regardless of the page position.
  • a textual version of a children's book might have a page with multiple pictures of objects, one of which is an apple.
  • the audio version of the children's book could ask, "Can you find the apple?" and the portion of the textual version that corresponds to the end of the question has a tag (or other data) that indicates when to pause the audio playback.
  • the media player reads the tag and, in response, pauses playback until additional input from the user, such as user selection of a displayed apple on a touch screen of the media player.
  • the mapping associated with the audio version and the textual version may include pause data that indicates when to pause the audio.
  • the media player detects the pause data while the current playback position of the audio version is changing, the media player pauses the playback until the user provides input, such as tapping the displayed apple on the touch screen. Once the user provides the required input, playback of the audio version resumes.
  • the textual versions contain pictures.
  • a page of a textual version of a work may include only a picture without any text or may include a picture and text while other pages in the textual version do not include any pictures.
  • a textual version of a work includes a "pause tag" that indicates when playback of an audio version of the work should be paused.
  • a pause tag may precede a picture within the textual version or may immediately follow a question in the textual version.
  • a pause tag may correspond to a particular text location within the textual version of the work.
  • the media player determines, based on a mapping, when the current playback of an audio version of the work corresponds to the particular text location. In response to the determination, the media player pauses the playback of the audio data.
  • the pause may be pre-determined, such as three seconds, after which the media player automatically begins to play the audio data again (i.e., without further user input).
  • the amount of time to pause may be determined based on information in the pause tag itself or information in metadata of the textual version, where the information indicates an amount of time, such as five seconds, after which the media player automatically plays the audio data again beginning where the media player stopped playback.
  • the media player receives user input that causes the media player to continue playing the audio version of the work after the media player pauses the playback. The user input may be required to continue playback or may be used to shorten the pause time.
  • a mapping associated with the audio version and the textual version of the work includes pause data that indicates where, in the audio version, to pause for a certain amount of time or until user input is received. For example, while the media player processes the audio version of the work, the media player keeps track of the current playback position in the audio version. When the current playback position corresponds, in the mapping, to an audio location that is associated with pause data, the media player pauses the playback of the audio data.
  • the techniques described herein are implemented by one or more special-purpose computing devices.
  • the special-purpose computing devices may be hard- wired to perform the techniques, or may include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are persistently programmed to perform the techniques, or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination.
  • ASICs application-specific integrated circuits
  • FPGAs field programmable gate arrays
  • Such special-purpose computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom programming to accomplish the techniques.
  • the special-purpose computing devices may be desktop computer systems, portable computer systems, handheld devices, networking devices or any other device that incorporates hard-wired and/or program logic to implement the techniques.
  • FIG. 9 is a block diagram that illustrates a computer system 900 upon which an embodiment of the invention may be implemented.
  • Computer system 900 includes a bus 902 or other communication mechanism for communicating information, and a hardware processor 904 coupled with bus 902 for processing information.
  • Hardware processor 904 may be, for example, a general purpose microprocessor.
  • Computer system 900 also includes a main memory 906, such as a random access memory (RAM) or other dynamic storage device, coupled to bus 902 for storing information and instructions to be executed by processor 904.
  • Main memory 906 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 904.
  • Such instructions when stored in non-transitory storage media accessible to processor 904, render computer system 900 into a special-purpose machine that is customized to perform the operations specified in the instructions.
  • Computer system 900 further includes a read only memory (ROM) 908 or other static storage device coupled to bus 902 for storing static information and instructions for processor 904.
  • ROM read only memory
  • a storage device 910 such as a magnetic disk or optical disk, is provided and coupled to bus 902 for storing information and instructions.
  • Computer system 900 may be coupled via bus 902 to a display 912, such as a cathode ray tube (CRT), for displaying information to a computer user.
  • a display 912 such as a cathode ray tube (CRT)
  • An input device 914 is coupled to bus 902 for communicating information and command selections to processor 904.
  • cursor control 916 is Another type of user input device
  • cursor control 916 such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 904 and for controlling cursor movement on display 912.
  • This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
  • Computer system 900 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 900 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 900 in response to processor 904 executing one or more sequences of one or more instructions contained in main memory 906. Such instructions may be read into main memory 906 from another storage medium, such as storage device 910. Execution of the sequences of instructions contained in main memory 906 causes processor 904 to perform the process steps described herein. In alternative embodiments, hard- wired circuitry may be used in place of or in combination with software instructions.
  • Non-volatile media includes, for example, optical or magnetic disks, such as storage device 910.
  • Volatile media includes dynamic memory, such as main memory 906.
  • Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
  • Storage media is distinct from but may be used in conjunction with transmission media.
  • Transmission media participates in transferring information between storage media.
  • transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 902.
  • Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications .
  • Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 904 for execution.
  • the instructions may initially be carried on a magnetic disk or solid state drive of a remote computer.
  • the remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem.
  • a modem local to computer system 900 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal.
  • An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 902.
  • Bus 902 carries the data to main memory 906, from which processor 904 retrieves and executes the instructions.
  • the instructions received by main memory 906 may optionally be stored on storage device 910 either before or after execution by processor 904.
  • Computer system 900 also includes a communication interface 918 coupled to bus 902.
  • Communication interface 918 provides a two-way data communication coupling to a network link 920 that is connected to a local network 922.
  • communication interface 918 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line.
  • ISDN integrated services digital network
  • communication interface 918 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN.
  • LAN local area network
  • Wireless links may also be implemented.
  • communication interface 918 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
  • Network link 920 typically provides data communication through one or more networks to other data devices.
  • network link 920 may provide a connection through local network 922 to a host computer 924 or to data equipment operated by an Internet Service Provider (ISP) 926.
  • ISP 926 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet" 928.
  • Internet 928 uses electrical, electromagnetic or optical signals that carry digital data streams.
  • the signals through the various networks and the signals on network link 920 and through communication interface 918, which carry the digital data to and from computer system 900, are example forms of transmission media.
  • Computer system 900 can send messages and receive data, including program code, through the network(s), network link 920 and communication interface 918.
  • a server 930 might transmit a requested code for an application program through Internet 928, ISP 926, local network 922 and communication interface 918.
  • Figures 10-15 show functional block diagrams of electronic devices 1000-1500 in accordance with the principles of the invention as described above.
  • the functional blocks of the device may be implemented by hardware, software, or a combination of hardware and software to carry out the principles of the invention. It is understood by persons of skill in the art that the functional blocks described in Figures 10-15 may be combined or separated into sub-blocks to implement the principles of the invention as described above. Therefore, the description herein may support any possible combination or separation or further definition of the functional blocks described herein.
  • the electronic device 1000 includes an audio data receiving unit 1002 configured for receiving audio data that reflects an audible version of a work for which a textual version exists.
  • the electronic device 1000 also includes a processing unit 1006 coupled to the audio data receiving unit 1002.
  • the processing unit 1006 includes a speech to text unit 1008 and a mapping unit 1010.
  • the processing unit 1006 is configured to perform a speech-to-text analysis of the audio data to generate text for portions of the audio data (e.g., with the speech to text unit 1008); and based on the text generated for the portions of the audio data, generate a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in the textual version of the work (e.g., with the mapping unit 1010).
  • the electronic device 1100 includes a text receiving unit 1102 configured for receiving a textual version of a work.
  • the electronic device 1100 also includes an audio data receiving unit 1104 configured for receiving second audio data that reflects an audible version of the work for which the textual version exists.
  • the electronic device 1100 also includes a processing unit 1106 coupled to the text receiving unit 1102.
  • the processing unit 1106 includes a text to speech unit 1108 and a mapping unit 1110.
  • the processing unit 1106 is configured to perform a text-to-speech analysis of the textual version to generate first audio data (e.g., with the text to speech unit 1108); and based on the first audio data and the textual version, generate a first mapping between a first plurality of audio locations in the first audio data and a corresponding plurality of text locations in the textual version of the work (e.g., with the mapping unit 1110).
  • the processing unit 1106 is further configured to, based on (1) a comparison of the first audio data and the second audio data and (2) the first mapping, generate a second mapping between a second plurality of audio locations in the second audio data and the plurality of text locations in the textual version of the work (e.g., with the mapping unit 1110).
  • the electronic device 1200 includes an audio receiving unit 1202 configured for receiving audio input.
  • the electronic device 1200 also includes a processing unit 1206 coupled to the audio receiving unit 1202.
  • the processing unit 1206 includes a speech to text unit 1208, a text matching unit 1209, and a display control unit 1210.
  • the processing unit 1206 is configured to perform a speech-to-text analysis of the audio input to generate text for portions of the audio input (e.g., with the speech to text unit 1208); determine whether the text generated for portions of the audio input matches text that is currently displayed (e.g., with the text matching unit 1209); and in response to determining that the text matches text that is currently displayed, cause the text that is currently displayed to be highlighted (e.g., with the display control unit 1210).
  • the electronic device 1300 includes a location data obtaining unit 1302 configured for obtaining location data that indicates a specified location within a textual version of a work.
  • the electronic device 1300 also includes a processing unit 1306 coupled to the location data obtaining unit 1302.
  • the processing unit 1306 includes a map inspecting unit 1308.
  • the processing unit 1306 is configured to inspect a mapping (e.g., with the map inspecting unit 1308) between a plurality of audio locations in an audio version of the work and a corresponding plurality of text locations in the textual version of the work to: determine a particular text location, of the plurality of text locations, that corresponds to the specified location, and based on the particular text location, determine a particular audio location, of the plurality of audio locations, that corresponds to the particular text location.
  • the processing unit 1306 is also configured to provide the particular audio location, that was determined based on the particular text location, to a media player to cause the media player to establish the particular audio location as a current playback position of the audio data.
  • the electronic device 1400 includes a location obtaining unit 1402 configured for obtaining location data that indicates a specified location within audio data.
  • the electronic device 1400 also includes a processing unit 1406 coupled to the location obtaining unit 1402.
  • the processing unit 1406 includes a map inspecting unit 1408 and a display control unit 1410.
  • the processing unit 1406 is configured to inspect a mapping (e.g., with the map inspecting unit 1408) between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to: determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and based on the particular audio location, determine a particular text location, of the plurality of text locations, that corresponds to the particular audio location.
  • the processing unit 1406 is also configured to cause a media player to display information about the particular text location (e.g., with the display control unit 1410).
  • the electronic device 1500 includes a location obtaining unit 1502 configured for obtaining location data that indicates a specified location within the audio version during playback of an audio version of a work.
  • the electronic device 1500 also includes a processing unit 1506 coupled to the location data obtaining unit 1502.
  • the processing unit 1506 includes text location determining unit 1508 and a display control unit 1510.
  • the processing unit 1506 is configured to, during playback of an audio version of a work: determine, based on the specified location, a particular text location, in a textual version of the work, that is associated with the page end data that indicates an end of a first page reflected in the textual version of the work (e.g., with the text location determining unit 1508); and in response to determining that the particular text location is associated with the page end data, automatically cause the first page to cease to be displayed and causing a second page that is subsequent to the first page to be displayed (e.g., with the display control unit 1510).
  • the electronic device 1600 includes an annotation obtaining unit 1602 configured for, while a first version of a work is processed, obtaining annotation data that is based on input from a user.
  • the electronic device 1600 also includes an association data storing unit 1603.
  • the electronic device 1600 also includes a processing unit 1606 coupled to the annotation obtaining unit 1602 and the association data storing unit 1603.
  • the processing unit 1606 includes a display control unit 1610.
  • the processing unit 1606 is configured for causing association data that associates the annotation data with the work to be stored (e.g., in the association data storing unit 1603); and while a second version of the work is processed, causing information about the annotation data to be displayed (e.g., with the display control unit 1610), wherein the second version is different than the first version.
  • the electronic device 1700 includes a data receiving unit 1702 configured for receiving data that establishes a first bookmark within a first version of a work.
  • the electronic device 1700 also includes location data storing unit 1703.
  • the electronic device 1700 also includes a processing unit 1706 coupled to the data receiving unit 1702 and the location data storing unit 1703.
  • the processing unit 1706 includes a map inspection unit 1708.
  • the processing unit 1706 is configured for inspecting a mapping between a plurality of first locations in the first version of the work and a corresponding plurality of second locations in a second version of the work (e.g., with the map inspection unit 1708) to: determine a particular first location, of the plurality of first locations, that corresponds to the first bookmark, and based on the particular first location, determine a particular second location, of the plurality of second locations, that corresponds to the particular first location, wherein the first version of the work is different than the second version of the work; and causing data that establishes the particular second location as a second bookmark within the second version of the work to be stored (e.g., in the location data storing unit 1703).
  • the electronic device 1800 includes an audio receiving unit 1802 configured for receiving, at the device, audio input from a user.
  • the electronic device 1800 also includes a processing unit 1806 coupled to the audio receiving unit 1802.
  • the processing unit 1806 includes a word analyzing unit 1808 and a display control unit 1810.
  • the processing unit 1806 is configured for causing a portion of text of a work to be displayed by a device (e.g., with the display control unit 1810); and in response to receiving the audio input at the audio receiving unit: analyzing the audio input to identify one or more words (e.g., with the word analyzing unit 1808); determining whether the one or more words are reflected in the portion of the text (e.g., with the word analyzing unit 1808); and in response to determining that the one or more words are reflected in the portion of the text, causing a visual indication to be displayed by the device (e.g., with the display control unit 1810).
  • a device e.g., with the display control unit 1810

Abstract

Techniques are provided for creating a mapping that maps locations in audio data (e.g., an audio book) to corresponding locations in text data (e.g., an e-book). Techniques are provided for using a mapping between audio data and text data, whether or not the mapping is created automatically or manually. A mapping may be used for bookmark switching where a bookmark established in one version of a digital work (e.g., e-book) is used to identify a corresponding location with another version of the digital work (e.g., an audio book). Alternatively, the mapping may be used to play audio that corresponds to text selected by a user. Alternatively, the mapping may be used to automatically highlight text in response to audio that corresponds to the text being played. Alternatively, the mapping may be used to determine where an annotation created in one media context will be consumed in another media context.

Description

AUTOMATICALLY CREATING A MAPPING BETWEEN TEXT DATA
AND AUDIO DATA
FIELD OF THE INVENTION
[0001] The present invention relates to automatically creating a mapping between text data and audio data by analyzing the audio data to detect words reflected therein and compare those words to words in the document.
SUMMARY
[0002] In accordance with some embodiments, a method is provided that includes receiving audio data that reflects an audible version of a work for which a textual version exists; performing a speech-to-text analysis of the audio data to generate text for portions of the audio data; and based on the text generated for the portions of the audio data, generating a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in the textual version of the work. The method is performed by one or more computing devices.
[0003] In some embodiments, generating text for portions of the audio data includes generating text for portions of the audio data based, at least in part, on textual context of the work. In some embodiments, generating text for portions of the audio data based, at least in part, on textual context of the work includes generating text based, at least in part, on one or more rules of grammar used in the textual version of the work. In some embodiments, generating text for portions of the audio data based, at least in part, on textual context of the work includes limiting which words the portions can be translated to based on which words are in the textual version of the work, or a subset thereof. In some embodiments, limiting which words the portions can be translated to based on which words are in the textual version of the work includes, for a given portion of the audio data, identifying a sub-section of the textual version of the work that corresponds to the given portion and limiting the words to only those words in the sub-section of the textual version of the work. In some embodiments, identifying the sub-section of the textual version of the work includes maintaining a current text location in the textual version of the work that corresponds to a current audio location, in the audio data, of the speech-to-text analysis; and the sub-section of the textual version of the work is a section associated with the current text location.
[0004] In some embodiments, the portions include portions that correspond to individual words, and the mapping maps the locations of the portions that correspond to individual words to individual words in the textual version of the work. In some embodiments, the portions include portions that correspond to individual sentences, and the mapping maps the locations of the portions that correspond to individual sentences to individual sentences in the textual version of the work. In some embodiments, the portions include portions that correspond to fixed amounts of data, and the mapping maps the locations of the portions that correspond to fixed amounts of data to corresponding locations in the textual version of the work.
[0005] In some embodiments, generating the mapping includes: (1) embedding anchors in the audio data; (2) embedding anchors in the textual version of the work; or (3) storing the mapping in a media overlay that is stored in association with the audio data or the textual version of the work.
[0006] In some embodiments, each of one or more text locations of the plurality of text locations indicates a relative location in the textual version of the work. In some
embodiments, one text location, of the plurality of text locations, indicates a relative location in the textual version of the work and another text location, of the plurality of text locations, indicates an absolute location from the relative location. In some embodiments, each of one or more text locations of the plurality of text locations indicates an anchor within the textual version of the work.
[0007] In accordance with some embodiments, a method is provided that includes receiving a textual version of a work; performing a text-to-speech analysis of the textual version to generate first audio data; based on the first audio data and the textual version, generating a first mapping between a first plurality of audio locations in the first audio data and a corresponding plurality of text locations in the textual version of the work; receiving second audio data that reflects an audible version of the work for which the textual version exists; and based on (1) a comparison of the first audio data and the second audio data and (2) the first mapping, generating a second mapping between a second plurality of audio locations in the second audio data and the plurality of text locations in the textual version of the work. The method is performed by one or more computing devices.
[0008] In accordance with some embodiments, a method is provided that includes receiving audio input; performing a speech-to-text analysis of the audio input to generate text for portions of the audio input; determining whether the text generated for portions of the audio input matches text that is currently displayed; and in response to determining that the text matches text that is currently displayed, causing the text that is currently displayed to be highlighted. The method is performed by one or more computing devices.
[0009] In accordance with some embodiments, an electronic device is provided that includes an audio data receiving unit configured for receiving audio data that reflects an audible version of a work for which a textual version exists. The electronic device also includes a processing unit coupled to the audio data receiving unit. The processing unit is configured to: perform a speech-to-text analysis of the audio data to generate text for portions of the audio data; and based on the text generated for the portions of the audio data, generate a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in the textual version of the work.
[00010] In accordance with some embodiments, an electronic device is provided that includes a text receiving unit configured for receiving a textual version of a work. The electronic device also includes a processing unit coupled to the text receiving unit, the processing unit configured to: perform a text-to-speech analysis of the textual version to generate first audio data; and based on the first audio data and the textual version, generate a first mapping between a first plurality of audio locations in the first audio data and a corresponding plurality of text locations in the textual version of the work. The electronic device also includes an audio data receiving unit configured for receiving second audio data that reflects an audible version of the work for which the textual version exists. The processing unit is further configured to, based on (1) a comparison of the first audio data and the second audio data and (2) the first mapping, generate a second mapping between a second plurality of audio locations in the second audio data and the plurality of text locations in the textual version of the work.
[00011] In accordance with some embodiments, an electronic device is provided that includes an audio receiving unit configured for receiving audio input. The electronic device also includes a processing unit coupled to the audio receiving unit. The processing unit is configured to perform a speech-to-text analysis of the audio input to generate text for portions of the audio input; determine whether the text generated for portions of the audio input matches text that is currently displayed; and in response to determining that the text matches text that is currently displayed, cause the text that is currently displayed to be highlighted.
[00012] In accordance with some embodiments, a method is provided that includes obtaining location data that indicates a specified location within a textual version of a work; inspecting a mapping between a plurality of audio locations in an audio version of the work and a corresponding plurality of text locations in the textual version of the work to: determine a particular text location, of the plurality of text locations, that corresponds to the specified location, and based on the particular text location, determine a particular audio location, of the plurality of audio locations, that corresponds to the particular text location. The method includes providing the particular audio location, that was determined based on the particular text location, to a media player to cause the media player to establish the particular audio location as a current playback position of the audio data. The method is performed by one or more computing devices.
[00013] In some embodiments, obtaining comprises a server receiving, over a network, the location data from a first device; inspecting and providing are performed by the server; and providing comprises the server sending the particular audio location to a second device that executes the media player. In some embodiments, the second device and the first device are the same device. In some embodiments, obtaining, inspecting, and providing are performed by a computing device that is configured to display the textual version of the work and that executes the media player. In some embodiments, the method further includes determining, at a device that is configured to display the textual version of the work, the location data without input from a user of the device.
[00014] In some embodiments, the method further includes receiving input from a user; and in response to receiving the input, determining the location data based on the input. In some embodiments, providing the particular audio location to the media player comprises providing the particular audio location to the media player to cause the media player to process the audio data beginning at the current playback position, which causes the media player to generate audio from the processed audio data; and causing the media player to process the audio data is performed in response to receiving the input.
[00015] In some embodiments, the input selects multiple words in the textual version of the work; the specified location is a first specified location; the location data also indicates a second specified location, within the textual version of the work, that is different than the first specified location; inspecting further comprises inspecting the mapping to: determine a second particular text location, of the plurality of text locations, that corresponds to the second specified location, and based on the second particular text location, determine a second particular audio location, of the plurality of audio locations, that corresponds to the second particular text location; and providing the particular audio location to the media player comprises providing the second particular audio location to the media player to cause the media player to cease processing the audio data when the current playback position arrives at or near the second particular audio location.
[00016] In some embodiments, the method further includes obtaining annotation data that is based on input from a user; storing the annotation data in association with the specified location; and causing information about the annotation data to be displayed. In some embodiments, causing information about the particular audio location and the annotation data to be displayed comprises: determining when a current playback position of the audio data is at or near the particular audio location; and in response to determining that the current playback position of the audio data is at or near the particular audio location, causing information about the annotation data to be displayed.
[00017] In some embodiments, the annotation data includes text data; and causing information about the annotation data to be displayed comprises displaying the text data. In some embodiments, the annotation data includes voice data; and causing information about the annotation data to be displayed comprises processing the voice data to generate audio.
[00018] In accordance with some embodiments, an electronic device is provided that includes a location data obtaining unit configured for obtaining location data that indicates a specified location within a textual version of a work. The electronic device also includes a processing unit coupled to the location data obtaining unit. The processing unit is configured to inspect a mapping between a plurality of audio locations in an audio version of the work and a corresponding plurality of text locations in the textual version of the work to: determine a particular text location, of the plurality of text locations, that corresponds to the specified location, and based on the particular text location, determine a particular audio location, of the plurality of audio locations, that corresponds to the particular text location; and provide the particular audio location, that was determined based on the particular text location, to a media player to cause the media player to establish the particular audio location as a current playback position of the audio data.
[00019] In accordance with some embodiments, a method is provided that includes obtaining location data that indicates a specified location within audio data; inspecting a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to: determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and based on the particular audio location, determine a particular text location, of the plurality of text locations, that corresponds to the particular audio location; and causing a media player to display information about the particular text location. The method is performed by one or more computing devices.
[00020] In some embodiments, obtaining comprises a server receiving, over a network, the location data from a first device; inspecting and causing are performed by the server; and causing comprises the server sending the particular text location to a second device that executes the media player. In some embodiments, the second device and the first device are the same device. In some embodiments, obtaining, inspecting, and causing are performed by a computing device that is configured to display the textual version of the work and that executes the media player. In some embodiments, the method further includes determining, at a device that is configured to process the audio data, the location data without input from a user of the device. [00021] In some embodiments, the method further includes: receiving input from a user; and in response to receiving the input, determining the location data based on the input. In some embodiments, causing comprises causing the media player to display a portion of the textual version of the work that corresponds to the particular text location; and causing the media player to display the portion of the textual version of the work is performed in response to receiving the input.
[00022] In some embodiments, the input selects a segment of the audio data; the specified location is a first specified location; the location data also indicates a second specified location, within the audio data, that is different than the first specified location; inspecting further comprises inspecting the mapping to: determine a second particular audio location, of the plurality of audio locations, that corresponds to the second specified location, and based on the second particular audio location, determine a second particular text location, of the plurality of text locations, that corresponds to the second particular audio location; and causing a media player to display information about the particular text location further comprises causing the media player to display information about the second particular text location.
[00023] In some embodiments, the specified location corresponds to a current playback position in the audio data; causing is performed as the audio data at the specified location is processed and audio is generated; and causing comprises causing a second media player to highlight text within the textual version of the work at or near the particular text location.
[00024] In some embodiments, the method further includes: obtaining annotation data that is based on input from a user; storing the annotation data in association with the specified location; and causing information about the annotation data to be displayed. In some embodiments, causing information about the annotation data to be displayed comprises: determining when a portion of the textual version of the work that corresponds to the particular text location is displayed; and in response to determining that a portion of the textual version of the work that corresponds to the particular text location is displayed, causing information about the annotation data to be displayed.
[00025] In some embodiments, the annotation data includes text data; and causing information about the annotation data to be displayed comprises causing the text data to be displayed. In some embodiments, the annotation data includes voice data; and causing information about the annotation data to be displayed comprises causing the voice data to be processed to generate audio.
[00026] In accordance with some embodiments, a method is provided that includes, during playback of an audio version of a work: obtaining location data that indicates a specified location within the audio version, and determining, based on the specified location, a particular text location, in a textual version of the work, that is associated with pause data that indicates when to pause playback of the audio version; and in response to determining that the particular text location is associated with pause data, pausing playback of the audio version. The method is performed by one or more computing devices.
[00027] In some embodiments, the pause data is within the textual version of the work. In some embodiments, determining the particular text location comprises: inspecting a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to: determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and based on the particular audio location, determine the particular text location, of the plurality of text locations, that corresponds to the particular audio location.
[00028] In some embodiments, the pause data corresponds to the end of a page reflected in the textual version of the work. In some embodiments, the pause data corresponds to a location, within the textual version of the work, that immediately precedes a picture that does not include text.
[00029] In some embodiments, the method further comprises continuing playback of the audio version in response to receiving user input. In some embodiments, the method further comprises continuing playback of the audio version in response to the lapse of a particular amount of time since playback of the audio version was paused.
[00030] In accordance with some embodiments, a method is provided that includes, during playback of an audio version of a work: obtaining location data that indicates a specified location within the audio version, and determining, based on the specified location, a particular text location, in a textual version of the work, that is associated with the page end data that indicates an end of a first page reflected in the textual version of the work; and in response to determining that the particular text location is associated with the page end data, automatically causing the first page to cease to be displayed and causing a second page that is subsequent to the first page to be displayed. The method is performed by one or more computing devices.
[00031] In some embodiments, the method further comprises inspecting a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to: determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and based on the particular audio location, determine the particular text location, of the plurality of text locations, that corresponds to the particular audio location.
[00032] In accordance with some embodiments, an electronic device is provided that includes a location obtaining unit configured for obtaining location data that indicates a specified location within audio data. The electronic device also includes a processing unit coupled to the location obtaining unit. The processing unit is configured to: inspect a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to: determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and based on the particular audio location, determine a particular text location, of the plurality of text locations, that corresponds to the particular audio location; and cause a media player to display information about the particular text location.
[00033] In accordance with some embodiments, an electronic device is provided that includes a location obtaining unit configured for obtaining location data that indicates a specified location within the audio version during playback of an audio version of a work. The electronic device also includes a processing unit coupled to the location obtaining unit, the processing unit configured to, during playback of an audio version of a work: determine, based on the specified location, a particular text location, in a textual version of the work, that is associated with the page end data that indicates an end of a first page reflected in the textual version of the work; and in response to determining that the particular text location is associated with the page end data, automatically cause the first page to cease to be displayed and causing a second page that is subsequent to the first page to be displayed.
[00034] In accordance with some embodiments, a method is provided that includes, while a first version of a work is processed, obtaining annotation data that is based on input from a user; storing association data that associates the annotation data with the work; and while a second version of the work is processed, causing information about the annotation data to be displayed, wherein the second version is different than the first version; and wherein the method is performed by one or more computing devices.
[00035] In some embodiments, obtaining comprises determining location data that indicates a specified location within the first version of the work; storing comprises storing the location data in association with the work; the specified location corresponds to a particular location within the second version of the work; and causing comprises causing the information about the annotation data to be displayed in association with the particular location in the second version.
[00036] In some embodiments, the first version is an audio version of the work and the second version is a textual version of the work; causing information about the annotation data to be displayed comprises: determining when a portion of the textual version of the work that corresponds to the particular location is displayed; and in response to determining that a portion of the textual version of the work that corresponds to the particular location is displayed, causing information about the annotation data to be displayed. In some embodiments, the first version is a textual version of the work and the second version is an audio version of the work; causing information about the annotation data to be displayed comprises: determining when a portion of the audio version of the work that corresponds to the particular location is played; and in response to determining that a portion of the textual version of the work that corresponds to the particular location is played, causing information about the annotation data to be displayed.
[00037] In some embodiments, the annotation data includes text data; and causing information about the annotation data to be displayed comprises causing the text data to be displayed. In some embodiments, the annotation data includes voice data; and causing information about the annotation data to be displayed comprises causing the voice data to be processed to generate audio.
[00038] In accordance with some embodiments, an electronic device is provided that includes an annotation obtaining unit configured for, while a first version of a work is processed, obtaining annotation data that is based on input from a user; and a processing unit coupled to the annotation obtaining unit and the association data storing unit, the processing unit configured for: causing association data that associates the annotation data with the work to be stored; and while a second version of the work is processed, causing information about the annotation data to be displayed, wherein the second version is different than the first version.
[00039] In accordance with some embodiments, a method is provided that includes receiving data that establishes a first bookmark within a first version of a work. The method further includes inspecting a mapping between a plurality of first locations in the first version of the work and a corresponding plurality of second locations in a second version of the work to: determine a particular first location, of the plurality of first locations, that corresponds to the first bookmark, and based on the particular first location, determine a particular second location, of the plurality of second locations, that corresponds to the particular first location; wherein the first version of the work is different than the second version of the work. The method further includes causing data that establishes the particular second location as a second bookmark within the second version of the work to be stored; wherein the method is performed by one or more computing devices.
[00040] In some embodiments, receiving comprises a server receiving, over a network, the input from a first device; inspecting is performed by the server; and causing comprises the server sending the particular second location to a second device. In some embodiments, the first device and the second device are different devices. In some embodiments, the first version of the work is one of an audio version of the work or a text version of the work and the second version of the work is the other of the audio version or the text version. [00041] In accordance with some embodiments, an electronic device is provided that includes a data receiving unit configured for receiving data that establishes a first bookmark within a first version of a work. The electronic device also includes a processing unit coupled to the data receiving unit, the processing unit configured for: inspecting a mapping between a plurality of first locations in the first version of the work and a corresponding plurality of second locations in a second version of the work to: determine a particular first location, of the plurality of first locations, that corresponds to the first bookmark, and based on the particular first location, determine a particular second location, of the plurality of second locations, that corresponds to the particular first location; wherein the first version of the work is different than the second version of the work. The processing unit is also configured for causing data that establishes the particular second location as a second bookmark within the second version of the work to be stored.
[00042] In accordance with some embodiments, a method is provided that includes causing a portion of text of a work to be displayed by a device; while the portion of text is displayed: receiving, at the device, audio input from a user. The method further includes, in response to receiving the audio input: analyzing the audio input to identify one or more words; determining whether the one or more words are reflected in the portion of the text; and in response to determining that the one or more words are reflected in the portion of the text, causing a visual indication to be displayed by the device. In some embodiments, causing the visual indication to be displayed comprises causing text data that corresponds to the one or more words to be highlighted.
[00043] In accordance with some embodiments, an electronic device is provided that includes a processing unit configured for causing a portion of text of a work to be displayed by a device; and an audio receiving unit coupled to the processing unit and configured for receiving, at the device, audio input from a user. The processing unit is further configured for, in response to receiving the audio input at the audio receiving unit: analyzing the audio input to identify one or more words; determining whether the one or more words are reflected in the portion of the text; and in response to determining that the one or more words are reflected in the portion of the text, causing a visual indication to be displayed by the device.
[00044] In accordance with some embodiments, a computer-readable storage medium is provided, the computer-readable storage medium storing one or more programs for execution by one or more processors of an electronic device, the one or more programs including instructions for performing any of the methods described above. In accordance with some embodiments, an electronic device is provided that comprises means for performing any of the methods described above. In some embodiments, an electronic device is provided that comprises one or more processors and memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for performing any of the methods described above. In some embodiments, an information processing apparatus for use in an electronic device is provided, the information processing apparatus comprising means for performing any of the methods described above.
BACKGROUND
[00045] With the cost of handheld electronic devices decreasing and large demand for digital content, creative works that have once been published on printed media are increasingly becoming available as digital media. For example, digital books (also known as "e-books") are increasingly popular, along with specialized handheld electronic devices known as e-book readers (or "e-readers"). Also, other handheld devices, such as tablet computers and smart phones, although not designed solely as e-readers, have the capability to be operated as e-readers.
[00046] A common standard by which e-books are formatted is the EPUB standard (short for "electronic publication"), which is a free and open e-book standard by the International Digital Publishing Forum (IDPF). An EPUB file uses XHTML 1.1 (or DTBook) to construct the content of a book. Styling and layout are performed using a subset of CSS, referred to as OPS Style Sheets.
[00047] For some written works, especially those that become popular, an audio version of the written work is created. For example, a recording of a famous individual (or one with a pleasant voice) reading a written work is created and made available for purchase, whether online or in a brick and mortar store.
[00048] It is not uncommon for consumers to purchase both an e-book and an audio version (or "audio book") of the e-book. In some cases, a user reads the entirety of an e-book and then desires to listen to the audio book. In other cases, a user transitions between reading and listening to the book, based on the user's circumstances. For example, while engaging in sports or driving during a commute, the user will tend to listen to the audio version of the book. On the other hand, when lounging in a sofa-chair prior to bed, the user will tend to read the e-book version of the book. Unfortunately, such transitions can be painful, since the user must remember where she stopped in the e-book and manually locate where to begin in the audio book, or visa- versa. Even if the user remembers clearly what was happening in the book where the user left off, such transitions can still be painful because knowing what is happening does not necessarily make it easy to find the portion of an eBook or audio book that corresponds to those happenings. Thus, switching between an e-book and an audio book may be extremely time-consuming. [00049] The specification "EPUB Media Overlays 3.0" defines a usage of SMIL
(Synchronized Multimedia Integration Language), the Package Document, the EPUB Style Sheet, and the EPUB Content Document for representation of synchronized text and audio publications. A pre-recorded narration of a publication can be represented as a series of audio clips, each corresponding to part of the text. Each single audio clip, in the series of audio clips that make up a pre-recorded narration, typically represents a single phrase or paragraph, but infers no order relative to the other clips or to the text of a document. Media Overlays solve this problem of synchronization by tying the structured audio narration to its corresponding text in the EPUB Content Document using SMIL markup. Media Overlays are a simplified subset of SMIL 3.0 that allow the playback sequence of these clips to be defined.
[0010] Unfortunately, creating Media Overlay files is largely a manual process.
Consequently, the granularity of the mapping between audio and textual versions of a work is very coarse. For example, a media overlay file may associate the beginning of each paragraph in an e-book with a corresponding location in an audio version of the book. The reason that media overlay files, especially for novels, do not contain a mapping at any finer level of granularity, such as on a word-by- word basis, is that creating such a highly granular media overlay file might take countless hours in human labor.
[0011] The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] In the drawings:
[0013] FIG. 1 is a flow diagram that depicts a process for automatically creating a mapping between text data and audio data, according to an embodiment of the invention;
[0014] FIG. 2 is a block diagram that depicts a process that involves an audio-to-text correlator in generating a mapping between text data and audio data, according to an embodiment of the invention;
[0015] FIG. 3 is a flow diagram that depicts a process for using a mapping in one or more of these scenarios, according to an embodiment of the invention;
[0016] FIG. 4 is a block diagram that an example system 400 that may be used to implement some of the processes described herein, according to an embodiment of the invention. [0017] FIGs. 5A-B are flow diagrams that depict processes for bookmark switching, according to an embodiment of the invention;
[0018] FIG. 6 is a flow diagram that depicts a process for causing text, from a textual version of a work, to be highlighted while an audio version of the work is being played, according to an embodiment of the invention;
[0019] FIG. 7 is a flow diagram that depicts a process of highlighting displayed text in response to audio input from a user, according to an embodiment of the invention;
[0020] FIGs. 8A-B are flow diagrams that depict processes for transferring an annotation from one media context to another, according to an embodiment of the invention; and
[0021] FIG. 9 is a block diagram that illustrates a computer system upon which an embodiment of the invention may be implemented.
[0022] FIGs. 10-18 are functional block diagrams of electronic devices in accordance with some embodiments.
DETAILED DESCRIPTION
[0023] In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
OVERVIEW OF AUTOMATIC GENERATION OF AUDIO-TO-TEXT MAPPING
[0024] According to one approach, a mapping is automatically created where the mapping maps locations within an audio version of a work (e.g., an audio book) with corresponding locations in a textual version of the work (e.g., an e-book). The mapping is created by performing a speech-to-text analysis on the audio version to identify words reflected in the audio version. The identified words are matched up with the corresponding words in the textual version of the work. The mapping associates locations (within the audio version) of the identified words with locations in the textual version of the work where the identified words are found. AUDIO VERSION FORMATS
[0025] The audio data reflects an audible reading of text of a textual version of a work, such as a book, web page, pamphlet, flyer, etc. The audio data may be stored in one or more audio files. The one or more audio files may be in one of many file formats. Non-limiting examples of audio file formats include AAC, MP3, WAV, and PCM.
TEXTUAL VERSION FORMATS
[0026] Similarly, the text data to which the audio data is mapped may be stored in one of many document file formats. Non-limiting examples of document file formats include DOC, TXT, PDF, RTF, HTML, XHTML, and EPUB.
[0027] A typical EPUB document is accompanied by a file that (a) lists each XHTML content document, and (b) indicates an order of the XHTML content documents. For example, if a book comprises 20 chapters, then an EPUB document for that book may have 20 different XHTML documents, one for each chapter. A file that accompanies the EPUB document identifies an order of the XHTML documents that corresponds to the order of the chapters in the book. Thus, a single (logical) document (whether an EPUB document or another type of document) may comprise multiple data items or files.
[0028] The words or characters reflected in the text data may be in one or multiple languages. For example, one portion of the text data may be in English while another portion of the text data may be in French. Although examples of English words are provided herein, embodiments of the invention may be applied to other languages, including character-based languages.
AUDIO AND TEXT LOCATIONS IN MAPPING
[0029] As described herein, a mapping comprises a set of mapping records, where each mapping record associates an audio location with a text location.
[0030] Each audio location identifies a location in audio data. An audio location may indicate an absolute location within the audio data, a relative location within the audio data, or a combination of an absolute location and a relative location. As an example of an absolute location, an audio location may indicate a time offset (e.g., 04:32:24 indicating 4 hours, 32 minutes, 24 seconds) into the audio data, or a time range, as indicated above in Example A. As an example of a relative location, an audio location may indicate a chapter number, a paragraph number, and a line number. As an example of a combination of an absolute location and a relative location, the audio location may indicate a chapter number and a time offset into the chapter indicated by the chapter number.
[0031] Similarly, each text location identifies a location in text data, such as a textual version of a work. A text location may indicate an absolute location within the textual version of the work, a relative location within the textual version of the work, or a combination of an absolute location and a relative location. As an example of an absolute location, a text location may indicate a byte offset into the textual version of the work and/or an "anchor" within the textual version of the work. An anchor is metadata within the text data that identifies a specific location or portion of text. An anchor may be stored separate from the text in the text data that is displayed to an end-user or may be stored among the text that is displayed to an end-user. For example, text data may include the following sentence: "Why did the chicken <i name="123"/>cross the road?" where "<i name="123"/>" is the anchor. When that sentence is displayed to a user, the user only sees "Why did the chicken cross the road?" Similarly, the same sentence may have multiple anchors as follows: "<i name="123"/>Why <i name="124"/>did <i name="125"/>the <i name="126"/>chicken <i name="127"/>cross <i name="128"/>the <i name="129"/>road?" In this example, there is an anchor prior to each word in the sentence.
[0032] As an example of a relative location, a text location may indicate a page number, a chapter number, a paragraph number, and/or a line number. As an example of a combination of an absolute location and a relative location, a text location may indicate a chapter number and an anchor into the chapter indicated by the chapter number.
[0033] Examples of how to represent a text location and an audio location are provided in the specification entitled "EPUB Media Overlays 3.0," which defines a usage of SMIL (Synchronized Multimedia Integration Language), an EPUB Style Sheet, and an EPUB Content Document. An example of an association that associates a text location with an audio location and that is provided in the specification is as follows:
<par>
<text src=" chapter 1.xhtml#sentencel "/>
<audio src=" chapter 1 audio. mp3" clipBegin="23s" clipEnd="45s"/>
</par>
EXAMPLE A
[0034] In Example A, the "par" element includes two child elements: a "text" element and an "audio" element. The text element comprises an attribute "src" that identifies a particular sentence within an XHTML document that contains content from the first chapter of a book. The audio element comprises a "src" attribute that identifies an audio file that contains an audio version of the first chapter of the book, a "clipBegin" attribute that identifies where an audio clip within the audio file begins, and a "clipEnd" attribute that identifies where the audio clip within the audio file ends. Thus, seconds 23 through 45 in the audio file correspond to the first sentence in Chapter 1 of the book.
CREATING A MAPPING BETWEEN TEXT AND AUDIO
[0035] According to an embodiment, a mapping between a textual version of a work and an audio version of the same work is automatically generated. Because the mapping is generated automatically, the mapping may use much finer granularity than would be practical using manual text-to-audio mapping techniques. Each automatically-generated text-to-audio mapping includes multiple mapping records, each of which associates a text location in the textual version with an audio location in the audio version.
[0036] FIG. 1 is a flow diagram that depicts a process 100 for automatically creating a mapping between a textual version of a work and an audio version of the same work, according to an embodiment of the invention. At step 110, a speech-to-text analyzer receives audio data that reflects an audible version of the work. At step 120, while the speech-to-text analyzer performs an analysis of the audio data, the speech-to-text analyzer generates text for portions of the audio data. At step 130, based on the text generated for the portions of the audio data, the speech-to-text analyzer generates a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in the textual version of the work.
[0037] Step 130 may involve the speech-to-text analyzer comparing the generated text with text in the textual version of the work to determine where, within the textual version of the work, the generated text is located. For each portion of generated text that is found in the textual version of the work, the speech-to-text analyzer associates (1) an audio location that indicates where, within the audio data, the corresponding portion of audio data is found with (2) a text location that indicates where, within the textual version of the work, the portion of text is found.
TEXTUAL CONTEXT
[0038] Every document has a "textual context". The textual context of a textual version of a work includes intrinsic characteristics of the textual version of the work (e.g. the language the textual version of the work is written in, the specific words that textual version of the work uses, the grammar and punctuation that textual version of the work uses, the way the textual version of the work is structured, etc.) and extrinsic characteristics of the work (e.g. the time period in which the work was created, the genre to which the work belongs, the author of the work, etc.)
[0039] Different works may have significantly different textual contexts. For example, the grammar used in a classic English novel may be very different that the grammar of modern poetry. Thus, while a certain word order may follow the rules of one grammar, that same word order may violate the rules of another grammar. Similarly, the grammar used in both a classic English novel and modern poetry may differ from the grammar (or lack thereof) employed in a text message sent from one teenager to another.
[0040] As mentioned above, one technique described herein automatically creates a fine granularity mapping between the audio version of a work and the textual version of the same work by performing a speech-to-text conversion of the audio version of the work. In an embodiment, the textual context of a work is used to increase the accuracy of the speech-to- text analysis that is performed on the audio version of the work. For example, in order to determine the grammar employed in a work, the speech-to-text analyzer (or another process) may analyze the textual version of the work prior to performing a speech-to-text analysis. The speech-to-text analyzer may then make use of the grammar information thus obtained to increase the accuracy of the speech-to-text analysis of the audio version of the work.
[0041] Instead of or in addition to automatically determining the grammar of a work based on the textual version of the work, a user may provide input that identifies one or more rules of grammar that are followed by the author of the work. The rules associated with the identified grammar are input to the speech-to-text analyzer to assist the analyzer in recognizing words in the audio version of the work.
LIMITING THE CANDIDATE DICTIONARY BASED ON TEXTUAL VERSION
[0042] Typically, speech-to-text analyzers must be configured or designed to recognize virtually every word in the English language and, optionally, some words in other languages. Therefore, speech-to-text analyzers must have access to a large dictionary of words. The dictionary from which a speech-to-text analyzer selects words during a speech-to-text operation is referred to herein as the "candidate dictionary" of the speech-to-text analyzer. The number of unique words in a typical candidate dictionary is approximately 500, 000.
[0043] In an embodiment, text from the textual version of a work is taken into account when performing the speech-to-text analysis of the audio version of the work. Specifically, in one embodiment, during the speech-to-text analysis of an audio version of a work, the candidate dictionary used by the speech-to-text analyzer is restricted to the specific set of words that are in the text version of the work. In other words, the only words that are considered to be "candidates" during the speech-to-text operation performed on an audio version of a work are those words that actually appear in the textual version of the work.
[0044] By limiting the candidate dictionary used in the speech-to-text translation of a particular work to those words that appear in the textual version of the work, the speech-to- text operation may be significantly improved. For example, assume that the number of unique words in a particular work is 20,000. A conventional speech-to-text analyzer may have difficulty determining to which specific word, of a 500,000 word candidate dictionary, a particular portion of audio corresponds. However, that same portion of audio may unambiguously correspond to one particular word when only the 20,000 unique words that are in the textual version of the work are considered. Thus, with such a much smaller dictionary of possible words, the accuracy of the speech-to-text analyzer may be significantly improved.
LIMITING THE CANDIDATE DICTIONARY BASED ON CURRENT POSITION
[0045] To improve accuracy, the candidate dictionary may be restricted to even fewer words than all of the words in the textual version of the work. In one embodiment, the candidate dictionary is limited to those words found in a particular portion of the textual version of the work. For example, during a speech-to-text translation of a work, it is possible to approximately track the "current translation position" of the translation operation relative to the textual version of the work. Such tracking may be performed, for example, by comparing (a) the text that has been generated during the speech-to-text operation so far, against (b) the textual version of the work.
[0046] Once the current translation position has been determined, the candidate dictionary may further restricted based on the current translation position. For example, in one embodiment, the candidate dictionary is limited to only those words that appear, within the textual version of the work, after the current translation position. Thus, words that are found prior to the current translation position, but not thereafter, are effectively removed from the candidate dictionary. Such removal may increase the accuracy of the speech-to-text analyzer, since the smaller the candidate dictionary, the less likely the speech-to-text analyzer will translate a portion of audio data to the wrong word.
[0047] As another example, prior to a speech-to-text analysis, an audio book and a digital book may be divided into a number of segments or sections. The audio book may be associated with an audio section mapping and the digital book may be associated with a text section mapping. For example, the audio section mapping and the text section mapping may identify where each chapter begins or ends. These respective mappings may be used by a speech-to-text analyzer to limit the candidate dictionary. For example, if the speech-to-text analyzer determines, based on the audio section mapping, that the speech-to-text analyzer is analyzing the 4th chapter of the audio book, then the speech-to-text analyzer uses the text section mapping to identify the 4th chapter of the digital book and limit the candidate dictionary to the words found in the 4th chapter.
[0048] In a related embodiment, the speech-to-text analyzer employs a sliding window that moves as the current translation position moves. As the speech-to-text analyzer is analyzing the audio data, the speech-to-text analyzer moves the sliding window "across" the textual version of the work. The sliding window indicates two locations within the textual version of the work. For example, the boundaries of the sliding window may be (a) the start of the paragraph that precedes the current translation position and (b) the end of the third paragraph after the current translation position. The candidate dictionary is restricted to only those words that appear between those two locations.
[0049] While a specific example was given above, the window may span any amount of text within the textual version of the work. For example, the window may span an absolute amount of text, such as 60 characters. As another example, the window may span a relative amount of text from the textual version of the work, such as ten words, three "lines" of text, 2 sentences, or 1 "page" of text. In the relative amount scenario, the speech-to-text analyzer may use formatting data within the textual version of the work to determine how much of the textual version of the work constitutes a line or a page. For example, the textual version of a work may comprise a page indicator (e.g., in the form of an HTML or XML tag) that indicates, within the content of the textual version of the work, the beginning of a page or the ending of a page.
[0050] In an embodiment, the start of the window corresponds to the current translation position. For example, the speech-to-text analyzer maintains a current text location that indicates the most recently-matched word in the textual version of the work and maintains a current audio location that indicates the most recently-identified word in the audio data. Unless the narrator (whose voice is reflected in the audio data) misreads text of the textual version of the work, adds his/her own content, or skips portions of the textual version of the work during the recording, the next word that the speech-to-text analyzer detects in the audio data (i.e., after the current audio location) is most likely the next word in the textual version of the work (i.e., after the current text location). Maintaining both locations may significantly increase the accuracy of the speech-to-text translation. CREATING A MAPPING USING AUDIO-TO-AUDIO CORRELATION
[0051] In an embodiment, a text-to-speech generator and an audio-to-text correlator are used to automatically create a mapping between the audio version of a work and the textual version of a work. FIG. 2 is a block diagram that depicts these analyzers and the data used to generate the mapping. Textual version 210 of a work (such as an EPUB document) is input to text-to-speech generator 220. Text-to-speech generator 220 may be implemented in software, hardware, or a combination of hardware and software. Whether implemented in software or hardware, text-to-speech generator 220 may be implemented on a single computing device or may be distributed among multiple computing devices.
[0052] Text-to-speech generator 220 generates audio data 230 based on document 210. During the generation of the audio data 230, text-to-speech generator 220 (or another component not shown) creates an audio-to-document mapping 240. Audio-to-document mapping 240 maps, multiple text locations within document 210 to corresponding audio locations within generated audio data 230.
[0053] For example, assume that text-to-speech generator 220 generates audio data for a word located at location Y within document 210. Further assume that the audio data that was generated for the work is located at a location X within audio data 230. To reflect the correlation between the location of the word within the document 210 and the location of the corresponding audio in the audio data 230, a mapping would be created between location X and location Y.
[0054] Because text-to-speech generator 220 knows where a word or phrase occurs in document 210 when a corresponding word or phrase of audio is generated, each mapping between the corresponding words or phrases can be easily generated.
[0055] Audio-to-text correlator 260 accepts, as input, generated audio data 230, audio book 250, and audio-to-document mapping 240. Audio-to-text correlator 260 performs two main steps: an audio-to-audio correlation step and a look-up step. For the audio-to-audio correlation step, audio-to-text correlator 260 compares generated audio data 230 with audio book 250 to determine the correlation between portions of audio data 230 and portions of audio book 250. For example, audio-to-text correlator 260 may determine, for each word represented in audio data 230, the location of the corresponding word in audio book 250.
[0056] The granularity at which audio data 230 is divided, for the purpose of establishing correlations, may vary from implementation to implementation. For example, a correlation may be established between each word in audio data 230 and each corresponding word in audio book 250. Alternatively, a correlation may be established based on fixed-duration time intervals (e.g. one mapping for every 1 minute of audio). In yet another alternative, a correlation may be established for portions of audio established based on other criteria, such as at paragraph or chapter boundaries, significant pauses (e.g., silence of greater than 3 seconds), or other locations based on data in audio book 250, such as audio markers within audio book 250.
[0057] After a correlation between a portion of audio data 230 and a portion of audio book 250 is identified, audio-to-text correlator 260 uses audio-to-document mapping 240 to identify a text location (indicated in mapping 240) that corresponds to the audio location within generated audio data 230. Audio-to-text correlator 260 then associates the text location with the audio location within audio book 250 to create a mapping record in document-to-audio mapping 270.
[0058] For example, assume that a portion of audio book 250 (located at location Z) matches the portion of generated audio data 230 that is located at location X. Based on a mapping record (in audio-to-document mapping 240) that correlates location X to location Y within document 210, a mapping record in document-to-audio mapping 270 would be created that correlates location Z of the audio book 250 with location Y within document 210.
[0059] Audio-to-text correlator 260 repeatedly performs the audio-to-audio correlation and look-up steps for each portion of audio data 230. Therefore, document-to-audio mapping 270 comprises multiple mapping records, each mapping record mapping a location within document 210 to a location within audio book 250.
[0060] In an embodiment, the audio-to-audio correlation for each portion of audio data 230 is immediately followed by the look-up step for that portion of audio. Thus, document- to-audio mapping 270 may be created for each portion of audio data 230 prior to proceeding to the next portion of audio data 230. Alternatively, the audio-to-audio correlation step may be performed for many or for all of the portions of audio data 230 before any look-up step is performed. The look-up steps for all portions can be performed in a batch, after all of the audio-to-audio correlations have been established.
MAPPING GRANULARITY
[0061] A mapping has a number of attributes, one of which is the mapping's size, which refers to the number of mapping records in the mapping. Another attribute of a mapping is the mapping's "granularity." The "granularity" of a mapping refers to the number of mapping records in the mapping relative to the size of the digital work. Thus, the granularity of a mapping may vary from one digital work to another digital work. For example, a first mapping for a digital book that comprises 200 "pages" includes a mapping record only for each paragraph in the digital book. Thus, the first mapping may comprise 1000 mapping records. On the other hand, a second mapping for a digital "children's" book that comprises 20 pages includes a mapping record for each word in the children's book. Thus, the second mapping may comprise 800 mapping records. Even though the first mapping comprises more mapping records than the second mapping, the granularity of the second mapping is finer than the granularity of the first mapping.
[0062] In an embodiment, the granularity of a mapping may be dictated based on input to a speech-to-text analyzer that generates the mapping. For example, a user may specify a specific granularity before causing a speech-to-text analyzer to generate a mapping. Non- limiting examples of specific granularities include:
- word granularity (i.e., an association for each word),
- sentence granularity (i.e., an association for each sentence),
- paragraph granularity (i.e., an association for each paragraph),
- 10-word granularity (i.e., a mapping for each 10 word portion in the digital work), and
- 10-second granularity (i.e., a mapping for each 10 seconds of audio).
[0063] As another example, a user may specify the type of digital work (e.g., novel, children's book, short story) and the speech-to-text analyzer (or another process) determines the granularity based on the work's type. For example, a children's book may be associated with word granularity while a novel may be associated with sentence granularity.
[0064] The granularity of a mapping may even vary within the same digital work. For example, a mapping for the first three chapters of a digital book may have sentence granularity while a mapping for the remaining chapters of the digital book have word granularity.
ON-THE-FLY MAPPING GENERATION DURING TEXT-TO-AUDIO TRANSITIONS
[0065] While an audio-to-text mapping will, in many cases, be generated prior to a user needing to rely on one, in one embodiment, an audio-to-text mapping is generated at runtime or after a user has begun to consume the audio data and/or the text data on the user's device. For example, a user reads a textual version of a digital book using a tablet computer. The tablet computer keeps track of the most recent page or section of the digital book that the tablet computer has displayed to the user. The most recent page or section is identified by a "text bookmark."
[0066] Later, the user selects to play an audio book version of the same work. The playback device may be the same tablet computer on which the user was reading the digital book or another device. Regardless of the device upon which the audio book is to be played, the text bookmark is retrieved, and a speech-to-text analysis is performed relative to at least a portion of the audio book. During the speech-to-text analysis, "temporary" mapping records are generated to establish a correlation between the generated text and the corresponding locations within the audio book.
[0067] Once the text and correlation records have been generated, a text-to-text comparison is used to determine the generated text that corresponds to the text bookmark. Then, the temporary mapping records are used to identify the portion of the audio book that corresponds to the portion of generated text that corresponds to the text bookmark. Playback of the audio book is then initiated from that position.
[0068] The portion of the audio book on which the speech-to-text analysis is performed may be limited to the portion that corresponds to the text bookmark. For example, an audio section mapping may already exist that indicates where certain portions of the audio book begin and/or end. For example, an audio section mapping may indicate where each chapter begins, where one or more pages begin, etc. Such an audio section mapping may be helpful to determine where to begin the speech-to-text analysis so that a speech-to-text analysis on the entire audio book is not required to be performed. For example, if the text bookmark indicates a location within the 12th chapter of the digital book and an audio section mapping associated with the audio data identifies where the 12th chapter begins in the audio data, then a speech-to-text analysis is not required to be performed on any of the first 11 chapters of the audio book. For example, the audio data may consist of 20 audio files, one audio file for each chapter. Therefore, only the audio file that corresponds to the 12th chapter is input to a speech-to-text analyzer.
ON-THE-FLY MAPPING GENERATION DURING AUDIO-TO-TEXT TRANSITIONS
[0069] Mapping records can be generated on-the-fly to facilitate audio-to-text transitions, as well as text-to-audio transitions. For example, assume that a user is listening to an audio book using a smart phone. The smart phone keeps track of the current location within the audio book that is being played. The current location is identified by an "audio bookmark." Later, the user picks up a tablet computer and selects a digital book version of the audio book to display. The tablet computer receives the audio bookmark (e.g., from a central server that is remote relative to the tablet computer and the smart phone), performs a speech-to-text analysis of at least a portion of the audio book, and identifies, within the audio book, a portion that corresponds to a portion of text within a textual version of the audio book that corresponds to the audio bookmark. The tablet computer then begins displaying the identified portion within the textual version.
[0070] The portion of the audio book on which the speech-to-text analysis is performed may be limited to the portion that corresponds to the audio bookmark. For example, a speech-to-text analysis is performed on a portion of the audio book that spans one or more time segments (e.g., seconds) prior to the audio bookmark in the audio book and/or one or more time segments after the audio bookmark in the audio book. The text produced by the speech-to-text analysis on that portion is compared to text in the textual version to locate where the series of words or phrases in the produced text match text in the textual version.
[0071] If there exists a text section mapping that indicates where certain portions of the textual version begin or end and the audio bookmark can be used to identify a section in the text section mapping, then much of the textual version need not be analyzed in order to locate where the series of words or phrases in the produced text match text in the textual version. For example, if the audio bookmark indicates a location within in the 3rd chapter of the audio book and a text section mapping associated with the digital book identifies where the 3rd chapter begins in the textual version, then a speech-to-text analysis is not required to be performed on any of the first two chapters of the audio book or on any of the chapters of the audio book after the 3rd chapter.
OVERVIEW OF USE OF AUDIO-TO-TEXT MAPPINGS
[0072] According to one approach, a mapping (whether created manually or
automatically) is used to identify the locations within an audio version of a digital work (e.g., an audio book) that correspond to locations within a textual version of the digital work (e.g., an e-book). For example, a mapping may be used to identify a location within an e-book based on a "bookmark" established in an audio book. As another example, a mapping may be used to identify which displayed text corresponds to an audio recording of a person reading the text as the audio recording is being played and cause the identified text to be highlighted. Thus, while an audio book is being played, a user of an e-book reader may follow along as the e-book reader highlights the corresponding text. As another example, a mapping may be used to identify a location in audio data and play audio at that location in response to input that selects displayed text from an e-book. Thus, a user may select a word in an e-book, which selection causes audio that corresponds to that word to be played. As another example, a user may create an annotation while "consuming" (e.g., reading or listening to) one version of a digital work (e.g., an e-book) and cause the annotation to be consumed while the user is consuming another version of the digital work (e.g., an audio book). Thus, a user can make notes on a "page" of an e-book and may view those notes while listening to an audio book of the e-book. Similarly, a user can make a note while listening to an audio book and then can view that note when reading the corresponding e- book.
[0073] FIG. 3 is a flow diagram that depicts a process for using a mapping in one or more of these scenarios, according to an embodiment of the invention.
[0074] At step 310, location data that indicates a specified location within a first media item is obtained. The first media item may be a textual version of a work or audio data that corresponds to a textual version of the work. This step may be performed by a device (operated by a user) that consumes the first media item. Alternatively, the step may be performed by a server that is located remotely relative to the device that consumes the first media item. Thus, the device sends the location data to the server over a network using a communication protocol.
[0075] At step 320, a mapping is inspected to determine a first media location that corresponds to the specified location. Similarly, this step may be performed by a device that consumes the first media item or by a server that is located remotely relative to the device.
[0076] At step 330, a second media location that corresponds to the first media location and that is indicated in the mapping is determined. For example, if the specified location is an audio "bookmark", then the first media location is an audio location indicated in the mapping and the second media location is a text location that is associated with the audio location in the mapping. Similarly, For example, if the specified location is a text
"bookmark", then the first media location is a text location indicated in the mapping and the second media location is an audio location that is associated with the text location in the mapping.
[0077] At step 340, the second media item is processed based on the second media location. For example, if the second media item is audio data, then the second media location is an audio location and is used as a current playback position in the audio data. As another example, if the second media item is a textual version of a work, then the second media location is a text location and is used to determine which portion of the textual version of the work to display.
[0078] Examples of using process 300 in specific scenarios are provided below.
ARCHITECTURE OVERVIEW
[0079] Each of the example scenarios mentioned above and described in detail below may involve one or more computing devices. FIG. 4 is a block diagram that an example system 400 that may be used to implement some of the processes described herein, according to an embodiment of the invention. System 400 includes end-user device 410, intermediary device 420, and end-user device 430. Non-limiting examples of end-user devices 410 and 430 include desktop computers, laptop computers, smart phones, tablet computers, and other handheld computing devices.
[0080] As depicted in FIG. 4, device 410 stores a digital media item 402 and executes a text media player 412 and an audio media player 414. Text media player 412 is configured to process electronic text data and cause device 410 to display text (e.g., on a touch screen of device 410, not shown). Thus, if digital media item 402 is an e-book, then text media player 412 may be configured to process digital media item 402, as long as digital media item 402 is in a text format that text media player 412 is configured to process. Device 410 may execute one or more other media players (not shown) that are configured to process other types of media, such as video.
[0081] Similarly, audio media player 414 is configured to process audio data and cause device 410 to generate audio (e.g., via speakers on device 410, not shown). Thus, if digital media item 402 is an audio book, then audio media player 414 may be configured to process digital media item 402, as long as digital media item 402 is in an audio format that audio media player 414 is configured to process. Whether item 402 is an e-book or an audio book, item 402 may comprise multiple files, whether audio files or text files.
[0082] Device 430 similarly stores a digital media item 404 and executes an audio media player 432 that is configured to process audio data and cause device 430 to generate audio. Device 430 may execute one or more other media players (not shown) that are configured to process other types of media, such as video and text.
[0083] Intermediary device 420 stores a mapping 406 that maps audio locations within audio data to text location in text data. For example, mapping 406 may map audio locations within digital media item 404 to text locations within digital media item 402. Although not depicted in FIG. 4, intermediary device 420 may store many mappings, one for each corresponding set of audio data and text data. Also, intermediary device 420 may interact with many end-user devices not shown.
[0084] Also, intermediary device 420 may store digital media items that users may access via their respective devices. Thus, instead of storing a local copy of a digital media item, a device (e.g., device 430) may request the digital media item from intermediary device 420.
[0085] Additionally, intermediary device 420 may store account data that associates one or more devices of a user with a single account. Thus, such account data may indicate that devices 410 and 430 are registered by the same user under the same account. Intermediary device 420 may also store account-item association data that associates an account with one or more digital media items owned (or purchased) by a particular user. Thus, intermediary device 420 may verify that device 430 may access a particular digital media item by determining whether the account-item association data indicates that device 430 and the particular digital media item are associated with the same account.
[0086] Although only two end-user devices are depicted, an end-user may own and operate more or less devices that consume digital media items, such as e-books and audio books. Similarly, although only a single intermediary device 420 is depicted, the entity that owns and operates intermediary device 420 may operate multiple devices, each of which provide the same service or may operate together to provide a service to the user of end-user devices 410 and 430.
[0087] Communication between intermediary device 420 and end-user devices 410 and 430 is made possible via network 440. Network 440 may be implemented by any medium or mechanism that provides for the exchange of data between various computing devices.
Examples of such a network include, without limitation, a network such as a Local Area Network (LAN), Wide Area Network (WAN), Ethernet or the Internet, or one or more terrestrial, satellite, or wireless links. The network may include a combination of networks such as those described. The network may transmit data according to Transmission Control Protocol (TCP), User Datagram Protocol (UDP), and/or Internet Protocol (IP).
STORAGE LOCATION OF MAPPING
[0088] A mapping may be stored separate from the text data and the audio data from which the mapping was generated. For example, as depicted in FIG. 4, mapping 406 is stored separate from digital media items 402 and 404 even though mapping 406 may be used to identify a media location in one digital media item based on a media location in the other digital media item. In fact, mapping 406 is stored on a separate computing device
(intermediary device 420) than devices 410 and 430 that store, respectively, digital media items 402 and 404.
[0089] Additionally or alternatively, a mapping may be stored as part of the
corresponding text data. For example, mapping 406 may be stored in digital media item 402. However, even if the mapping is stored as part of the text data, the mapping may not be displayed to an end-user that consumes the text data. Additionally or alternatively still, a mapping may be stored as part of the audio data. For example, mapping 406 may be stored in digital media item 404. BOOKMARK SWITCHING
[0090] "Bookmark switching" refers to establishing a specified location (or "bookmark") in one version of a digital work and using the bookmark to find the corresponding location within another version of the digital work. There are two types of bookmark switching: text- to-audio (TA) bookmark switching and audio-to-text (AT) bookmark switching. TA bookmark switching involves using a text bookmark established in an e-book to identify a corresponding audio location in an audio book. Conversely, another type of bookmark switching referred to herein as AT bookmark switching involves using an audio bookmark established in an audio book to identify a corresponding text location within an e-book.
TEXT-TO-AUDIO BOOKMARK SWITCHING
[0091] FIG. 5 A is a flow diagram that depicts a process 500 for TA bookmark switching, according to an embodiment of the invention. FIG. 5A is described using elements of system 400 depicted in FIG. 4.
[0092] At step 502, a text media player 412 (e.g., an e-reader) determines a text bookmark within digital media item 402 (e.g., a digital book). Device 410 displays content from digital media item 402 to a user of device 410.
[0093] The text bookmark may be determined in response to input from the user. For example, the user may touch an area on a touch screen of device 410. Device 410's display, at or near that area, displays one or more words. In response to the input, the text media player 412 determines the one or more words that are closest to the area. The text media player 412 determines the text bookmark based on the determined one or more words.
[0094] Alternatively, the text bookmark may be determined based on the last text data that was displayed to the user. For example, the digital media item 402 may comprise 200 electronic "pages" and page 110 was the last page that was displayed. Text media player 412 determines that page 110 was the last page that was displayed. Text media player 412 may establish page 110 as the text bookmark or may establish a point at the beginning of page 110 as the text bookmark, since there may be no way to know where the user stopped reading. It may be safe to assume that the user at least read the last sentence on page 109, which sentence may have ended on page 109 or on page 110. Therefore, the text media player 412 may establish the beginning of the next sentence (which begins on page 110) as the text bookmark. However, if the granularity of the mapping is at the paragraph level, then text media player 412 may establish the beginning of the last paragraph on page 109. Similarly, if the granularity of the mapping is at the sentence level, then text media player 412 may establish the beginning of the chapter that includes page 1 10 as the text bookmark.
[0095] At step 504, text media player 412 sends, over network 440 to intermediary device 420, data that indicates the text bookmark. Intermediary device 420 may store the text bookmark in association with device 410 and/or an account of the user of device 410.
Previous to step 502, the user may have established an account with an operator of intermediary device 420. The user then registered one or more devices, including device 410, with the operator. The registration caused each of the one or more devices to be associated with the user's account.
[0096] One or more factors may cause the text media player 412 to send the text bookmark to intermediary device 420. Such factors may include the exiting (or closing down) of text media player 412, the establishment of the text bookmark by the user, or an explicit instruction by the user to save the text bookmark for use when listening to the audio book that corresponds to the textual version of the work for which the text bookmark is established.
[0097] As noted previously, intermediary device 420 has access to (e.g., stores) mapping 406, which, in this example, maps multiple audio locations in digital media item 404 with multiple text locations within digital media item 402.
[0098] At step 506, intermediary device 420 inspects mapping 406 to determine a particular text location, of the multiple text locations, that corresponds to the text bookmark. The text bookmark may not exactly match any of the multiple text locations in mapping 406. However, intermediary device 420 may select the text location that is closest to the text bookmark. Alternatively, intermediary device 420 may select the text location that is immediately before the text bookmark, which text location may or may not be the closest text location to the text bookmark. For example, if the text bookmark indicates 5th chapter, 3rd paragraph, 5th sentence and the closest text locations in mapping 406 are (1) 5th chapter, 3rd paragraph, 1st sentence and (2), 5th chapter, 3rd paragraph, 6th sentence, then the text location (1) is selected.
[0099] At step 508, once the particular text location in the mapping is identified, intermediary device 420 determines a particular audio location, in mapping 406, that corresponds to the particular text location.
[00100] At step 510, intermediary device 420 sends the particular audio location to device 430, which, in this example, is different than device 410. For example, device 410 may be a tablet computer and the device 430 may be a smart phone. In a related embodiment, device 430 is not involved. Thus, intermediary device 420 may send the particular audio location to device 410. [0100] Step 510 may be performed automatically, i.e., in response to intermediary device 420 determining the particular audio location. Alternatively, step 510 or step 506) may be performed in response to receiving, from device 430, an indication that device 430 is about to process digital media item 404. The indication may be a request for an audio location that corresponds to the text bookmark.
[0101] At step 512, audio media player 432 establishes the particular audio location as a current playback position of the audio data in digital media item 404. This establishment may be performed in response to receiving the particular audio location from intermediary device 420. Because the current playback position becomes the particular audio location, audio media player 432 is not required to play any of the audio that precedes the particular audio location in the audio data. For example, if the particular audio location indicates 2:56:03 (2 hours, 56 minutes, and 3 seconds), then audio media player 432 establishes that time in the audio data as the current playback position. Thus, if the user of device 430 selects a "play" button (whether graphical or physical) on device 430, then audio media player 430 begins processing the audio data at that 2:56:03 mark.
[0102] In an alternative embodiment, device 410 stores mapping 406 (or a copy thereof). Therefore, in place of steps 504-508, text media player 412 inspects mapping 406 to determine a particular text location, of the multiple text locations, that corresponds to the text bookmark. Then, text media player 412 determines a particular audio location, in mapping 406, that corresponds to the particular text location. The text media player 412 may then cause the particular audio location to be sent to intermediary device 420 to allow device 430 to retrieve the particular audio location and establish a current playback position in the audio data to be the particular audio location. Text media player 412 may also cause the particular text location (or text bookmark) to be sent to intermediary device 420 to allow device 410 (or another device, not shown) to later retrieve the particular text location to allow another text media player executing on the other device to display a portion (e.g., a page) of another copy of digital media item 402, where the portion corresponds to the particular text location.
[0103] In another alternative embodiment, intermediary device 420 and device 430 are not involved. Thus, steps 504 and 510 are not performed. Thus, device 410 performs all other steps in FIG. 5A, including steps 506 and 508.
AUDIO-TO-TEXT BOOKMARK SWITCHING
[0104] FIG. 5B is a flow diagram that depicts a process 550 for AT bookmark switching, according to an embodiment of the invention. Similarly to FIG. 5A, FIG. 5B is described using elements of system 400 depicted in FIG. 4. [0105] At step 552, audio media player 432 determines an audio bookmark within digital media item 404 (e.g., an audio book).
[0106] The audio bookmark may be determined in response to input from the user. For example, the user may stop the playback of the audio data, for example, by selecting a "stop" button that is displayed on a touch screen of device 430. Audio media player 432 determines the location within audio data of digital media item 404 that corresponds to where playback stopped. Thus, the audio bookmark may simply be the last place where the user stopped listening to the audio generated from digital media item 404. Additionally or alternatively, the user may select one or more graphical buttons on the touch screen of device 430 to establish a particular location within digital media item 404 as the audio bookmark. For example, device 430 displays a timeline that corresponds to the length of the audio data in digital media item 404. The user may select a position on the timeline and then provide one or more additional inputs that are used by audio media player 432 to establish the audio bookmark.
[0107] At step 554, device 430 sends, over network 440 to intermediary device 420, data that indicates the audio bookmark. The intermediary device 420 may store the audio bookmark in association with device 430 and/or an account of the user of device 430.
Previous to step 552, the user established an account with an operator of intermediary device 420. The user then registered one or more devices, including device 430, with the operator. The registration caused each of the one or more devices to be associated with the user's account.
[0108] Intermediary device 420 also has access to (e.g., stores) mapping 406. Mapping 406 maps multiple audio locations in the audio data of digital media item 404 with multiple text locations within text data of digital media item 402.
[0109] One or more factors may cause audio media player 432 to send the audio bookmark to intermediary device 420. Such factors may include the exiting (or closing down) of audio media player 432, the establishment of the audio bookmark by the user, or an explicit instruction by the user to save the audio bookmark for use when displaying portions of the textual version of the work (reflected in digital media item 402) that corresponds to digital media item 404, for which the audio bookmark is established.
[0110] At step 556, intermediary device 420 inspects mapping 406 to determine a particular audio location, of the multiple audio locations, that corresponds to the audio bookmark. The audio bookmark may not exactly match any of the multiple audio locations in mapping 406. However, intermediary device 420 may select the audio location that is closest to the audio bookmark. Alternatively, intermediary device 420 may select the audio location that is immediately before the audio bookmark, which audio location may or may not be the closest audio location to the audio bookmark. For example, if the audio bookmark indicates 02:43: 19 (or 2 hours, 43 minutes, and 19 seconds) and the closest audio locations in mapping 406 are (1) 02:41 :07 and (2), 0:43:56, then the audio location (1) is selected, even though audio location (2) is closest to the audio bookmark.
[0111] At step 558, once the particular audio location in the mapping is identified, intermediary device 420 determines a particular text location, in mapping 406, that corresponds to the particular audio location.
[0112] At step 560, intermediary device 420 sends the particular text location to device 410, which, in this example, is different than device 430. For example, device 410 may be a tablet computer and device 430 may be a smart phone that is configured to process audio data and generate audible sounds.
[0113] Step 560 may be performed automatically, i.e., in response to intermediary device 420 determining the particular text location. Alternatively, step 560 (or step 556) may be performed in response to receiving, from device 410, an indication that device 410 is about to process the digital media item 402. The indication may be a request for a text location that corresponds to the audio bookmark.
[0114] At step 562, text media player 412 displays information about the particular text location. Step 562 may be performed in response to receiving the particular text location from intermediary device 420. Device 410 is not required to display any of the content that precedes the particular text location in the textual version of the work reflected in digital media item 402. For example, if the particular text location indicates Chapter 3, paragraph 2, sentence 4, then device 410 displays a page that includes that sentence. Text media player 412 may cause a marker to be displayed at the particular text location in the page that visually indicates, to a user of device 410, where to begin reading in the page. Thus, the user is able to immediately read the textual version of the work beginning at a location that corresponds to the last words spoken by a narrator in the audio book.
[0115] In an alternative embodiment, the device 410 stores mapping 406. Therefore, in place of steps 556-560, after step 554 (wherein the device 430 sends data that indicates the audio bookmark to intermediary device 420), intermediary device 420 sends the audio bookmark to device 410. Then, text media player 412 inspects mapping 406 to determine a particular audio location, of the multiple audio locations, that corresponds to the audio bookmark. Then, text media player 412 determines a particular text location, in mapping 406, that corresponds to the particular audio location. This alternative process then proceeds to step 562, described above. [0116] In another alternative embodiment, intermediary device 420 is not involved.
Thus, steps 554 and 560 are not performed. Thus, device 430 performs all other steps in FIG. 5B, including steps 556 and 558.
HIGHLIGHT TEXT IN RESPONSE TO PLAYING AUDIO
[0117] In an embodiment, text from a portion of a textual version of a work is highlighted or "lit up" while audio data that corresponds to the textual version of the work is played. As noted previously, the audio data is an audio version of a textual version of the work and may reflect a reading, of text from the textual version, by a human user. As used herein,
"highlighting" text refers to a media player (e.g., an "e-reader") visually distinguishing that text from other text that is concurrently displayed with the highlighted text. Highlighting text may involve changing the font of the text, changing the font style of the text (e.g., italicize, bold, underline), changing the size of the text, changing the color of the text, changing the background color of the text, or creating an animation associated with the text. An example of creating an animation is causing the text (or background of the text) to blink on and off or to change colors. Another example of creating an animation is creating a graphic to appear above, below, or around the text. For example, in response to the word "toaster" being played and detected by a media player, the media player displays a toaster image above the word "toaster" in the displayed text. Another example of an animation is a bouncing ball that "bounces" on a portion of text (e.g., word, syllable, or letter) when that portion is detected in audio data that is played.
[0118] FIG. 6 is a flow diagram that depicts a process 600 for causing text, from a textual version of a work, to be highlighted while an audio version of the work is being played, according to an embodiment of the invention.
[0119] At step 610, the current playback position (which is constantly changing) of audio data of the audio version is determined. This step may be performed by a media player executing on a user's device. The media player processes the audio data to generate audio for the user.
[0120] At step 620, based on the current playback position, a mapping record in a mapping is identified. The current playback position may match or nearly match the audio location identified in the mapping record.
[0121] Step 620 may be performed by the media player if the media player has access to a mapping that maps multiple audio locations in the audio data with multiple text locations in the textual version of the work. Alternatively, step 620 may be performed by another process executing on the user's device or by a server that receives the current playback position from the user's device over a network.
[0122] At step 630, the text location identified in the mapping record is identified.
[0123] At step 640, a portion of the textual version of the work that corresponds to the text location is caused to be highlighted. This step may be performed by the media player or another software application executing on the user's device. If a server performs the look-up steps (620 and 630), then step 640 may further involve the server sending the text location to the user's device. In response, the media player, or another software application, accepts the text location as input and causes the corresponding text to be highlighted.
[0124] In an embodiment, different text locations that are identified, by the media player, in the mapping are associated with different types of highlighting. For example, one text location in the mapping may be associated with the changing of the font color from black to red while another text location in the mapping may be associated with an animation, such as a toaster graphic that shows a piece of toast "popping" out of toaster. Therefore, each mapping record in the mapping may include "highlighting data" that indicates how the text identified by the corresponding text location is to be highlighted. Thus, for each mapping record in the mapping that the media player identifies and that includes highlighting data, the media player uses the highlighting data to determine how to highlight the text. If a mapping record does not include highlighting data, then the media player may not highlight the corresponding text. Alternatively, if an mapping record in the mapping does not include highlighting data, then the media player may use a "default" highlight technique (e.g., holding the text) to highlight the text.
HIGHLIGHTING TEXT BASED ON AUDIO INPUT
[0125] FIG. 7 is a flow diagram that depicts a process 700 of highlighting displayed text in response to audio input from a user, according to an embodiment of the invention. In this embodiment, a mapping is not required. The audio input is used to highlight text in a portion of a textual version of a work that is concurrently displayed to the user.
[0126] At step 710, audio input is received. The audio input may be based on a user reading aloud text from a textual version of a work. The audio input may be received by a device that displays a portion of the textual version. The device may prompt the user to read aloud a word, phrase, or entire sentence. The prompt may be visual or audio. As an example of a visual prompt, the device may cause the following text to be displayed: "Please read the underlined text" while or immediately before the device displays a sentence that is underlined. As an example of an audio prompt, the device may cause a computer-generated voice to read "Please read the underlined text" or cause a pre-recorded human voice to be played, where the pre-recorded human voice provides the same instruction.
[0127] At step 720, a speech-to-text analysis is performed on the audio input to detect one or more words reflected in the audio input.
[0128] At step 730, for each detected word reflected in the audio input, that detected word is compared to a particular set of words. The particular set of words may be all the words that are currently displayed by a computing device (e.g., an e-reader). Alternatively, the particular set of words may be all the words that the user was prompted to read.
[0129] At step 740, for each detected word that matches a word in the particular set, the device causes that matching word to be highlighted.
[0130] The steps depicted in process 700 may be performed by a single computing device that displays text from a textual version of a work. Alternatively, the steps depicted in process 700 may be performed by one or more computing devices that are different than the computing device that displays text from the textual version. For example, the audio input from a user in step 710 may be sent from the user's device over a network to a network server that performs the speech-to-text analysis. The network server may then send highlight data to the user's device to cause the user's device to highlight the appropriate text.
PLAYING AUDIO IN RESPONSE TO TEXT SELECTION
[0131] In an embodiment, a user of a media player that displays portions of a textual version of a work may select portions of displayed text and cause the corresponding audio to be played. For example, if a displayed word from the digital book is "donut" and the user selects that word (e.g., by touching a portion of the media player's touch screen that displays that word), then the audio of "donut" may be played.
[0132] A mapping that maps text locations in a textual version of the work with audio locations in audio data is used to identify the portion of the audio data that corresponds to the selected text. The user may select a single word, a phrase, or even one or more sentences. In response to input that selects a portion of the displayed text, the media player may identify one or more text locations. For example, the media player may identify a single text location that corresponds to the selected portion, even if the selected portion comprises multiple lines or sentences. The identified text location may correspond to the beginning of the selected portion. As another example, the media player may identify a first text location that corresponds to the beginning of the selected portion and a second text location that corresponds to the ending of the selected portion. [0133] The media player uses the identified text location to look up a mapping record in the mapping that indicates a text location that is closest (or closest prior) to the identified text location. The media player uses the audio location indicated in the mapping record to identify where, in the audio data, to begin processing the audio data in order to generate audio. If only a single text location is identified, then only the word or sounds at or near the audio location may be played. Thus, after the word or sounds are played, the media player ceases to play any more audio. Alternatively, the media player begins playing at or near the audio location and does not cease playing the audio that follows the audio location until (a) the end of the audio data is reached, (b) further input from the user (e.g., selection of a "stop" button), or (c) a pre-designated stopping point in the audio data (e.g., end of a page or chapter that requires further input to proceed).
[0134] If the media player identifies two text locations based on the selected portion, then two audio locations are identified and may be used to identify where to begin playing and where to stop playing the corresponding audio.
[0135] In an embodiment, the audio data identified by the audio location may be played slowly (i.e., at a slow playback speed) or continuously without advancing the current playback position in the audio data. For example, if a user of a tablet computer selects the displayed word "two" by touching a touch screen of the tablet computer with his finger and continuously touches the displayed word (i.e., without lifting his finger and without moving his finger to another displayed word), then the tablet computer plays the corresponding audio creating a sound reflected by reading the word "twoooooooooooooooo".
[0136] In a similar embodiment, the speed at which a user drags her finger across displayed text on a touch screen of a media player causes the corresponding audio to be played at the same or similar speed. For example, a user selects the letter "d" of the displayed word "donut" and then slowly moves his finger across the displayed word. In response to this input, the media player identifies the corresponding audio data (using the mapping) and plays the corresponding audio at the same speed at which the user moves his finger. Therefore, the media player creates audio that sounds as if the reader of the text of the textual version of the work pronounced the word "donut" as "dooooooonnnnnnuuuuuut."
[0137] In a similar embodiment, the time that a user "touches" a word displayed on a touch screen dictates how quickly or slowly the audio version of the word is played. For example, a quick tap of a displayed word by the user's finger causes the corresponding audio to be played at a normal speed, whereas the user holding down his finger on the selected word for more than 1 second causes the corresponding audio to be played at ½ the normal speed. TRANSFERRING USER ANNOTATIONS
[0138] In an embodiment, a user initiates the creation of annotations to one media version (e.g., audio) of a digital work and causes the annotations to be associated with another media version (e.g., text) of the digital work. Thus, while an annotation may be created in the context of one type of media, the annotation may be consumed in the context of another type of media. The "context" in which an annotation is created or consumed refers to whether text is being displayed or audio is being played when the creation or consumption occurs.
[0139] Although the following examples involve determining a location within audio or text location when an annotation is created, some embodiments of the invention are not so limited. For example, the current playback position within an audio file when an annotation is created in the audio context is not used when consuming the annotation in the text context. Instead, an indication of the annotation may be displayed, by a device, at the beginning or the end of the corresponding textual version or on each "page" of the corresponding textual version. As another example, the text that is displayed when an annotation is created in the text context is not used when consuming the annotation in the audio context. Instead, an indication of the annotation may be displayed, by a device, at the beginning or end of the corresponding audio version or continuously while the corresponding audio version is being played. Additionally or alternatively to a visual indication, an audio indication of the annotation may be played. For example, a "beep" is played simultaneously with the audio track in such a way that both the beep and the audio track can be heard.
[0140] FIGs. 8A-B are flow diagrams that depict processes for transferring an annotation from one context to another, according to an embodiment of the invention. Specifically, FIG. 8A is a flow diagram depicts a process 800 for creating an annotation in the "text" context and consuming the annotation in the "audio" context, while FIG. 8B is a flow diagram that depicts a process 850 for creating an annotation in the "audio" context and consuming the annotation in the "text" context. The creation and consumption of an annotation may occur on the same computing device (e.g., device 410) or on separate computing devices (e.g., devices 410 and 430). FIG. 8A describes a scenario where the annotation is created and consumed on device 410 while FIG. 8B describes a scenario where the annotation is created on device 410 and later consumed on device 430.
[0141] At step 802 in FIG. 8A, text media player 412, executing on device 410, causes text (e.g., in the form of a page) from digital media item 402 to be displayed.
[0142] At step 804, text media player 412 determines a text location within a textual version of the work reflected in digital media item 402. The text location is eventually stored in association with an annotation. The text location may be determined in a number of ways. For example, text media player 412 may receive input that selects the text location within the displayed text. The input may be a user touching a touch screen (that displays the text) of device 410 for a period of time. The input may select a specific word, a number of words, the beginning or ending of a page, before or after a sentence, etc. The input may also include first selecting a button, which causes text media player 412 to change to a "create annotation" mode where an annotation may be created and associated with the text location.
[0143] As another example of determining a text location, text media player 412 determines the text location automatically (without user input) based on which portion of the textual version of the work (reflected in digital media item 402) is being displayed. For example, if device 410 is displaying page 20 of the textual version of the work, then the annotation will be associated with page 20.
[0144] At step 806, text media player 412 receives input that selects a "Create
Annotation" button that may be displayed on the touch screen. Such a button may be displayed in response to input in step 804 that selects the text location, where, for example, the user touches the touch screen for a period of time, such as one second.
[0145] Although step 804 is depicted as occurring before step 806, alternatively, the selection of the "Create Annotation" button may occur prior to the determination of the text location.
[0146] At step 808, text media player 412 receives input that is used to create annotation data. The input may be voice data (such as the user speaking into a microphone of device 410) or text data (such as the user selecting keys on a keyboard, whether physical or graphical). If the annotation data is voice data, text media player 412 (or another process) may perform speech-to-text analysis on the voice data to create a textual version of the voice data.
[0147] At step 810, text media player 412 stores the annotation data in association with the text location. Text media player 412 uses a mapping (e.g., a copy of mapping 406) to identify a particular text location, in mapping, that is closest to the text location. Then, using mapping, text media player identifies an audio location that corresponds to the particular text location.
[0148] Alternatively to step 810, text media player 412 sends, over network 440 to intermediary device 420, the annotation data and the text location. In response, intermediary device 420 stores the annotation data in association with the text location. Intermediary device 420 uses a mapping (e.g., mapping 406) to identify a particular text location, in mapping 406, that is closest to the text location. Then, using mapping 406, intermediary device 420 identifies an audio location that corresponds to the particular text location. Intermediary device 420 sends the identified audio location over network 440 to device 410. Intermediary device 420 may send the identified audio location in response to a request, from device 410, for certain audio data and/or for annotations associated with certain audio data. For example, in response to a request for an audio book version of "The Tale of Two Cities", intermediary device 420 determines whether there is any annotation data associated with that audio book and, if so, sends the annotation data to device 410.
[0149] Step 810 may also comprise storing date and/or time information that indicates when the annotation was created. This information may be displayed later when the annotation is consumed in the audio context.
[0150] At step 812, audio media player 414 plays audio by processing audio data of digital media item 404, which, in this example (although not shown), may be stored on device 410 or may be streamed to device 410 from intermediary device 420 over network 440.
[0151] At step 814, audio media player 414 determines when the current playback position in the audio data matches or nearly matches the audio location identified in step 810 using mapping 406. Alternatively, audio media player 414 may cause data that indicates that an annotation is available to be displayed, regardless of where the current playback position is located and without having to play any audio, as indicated in step 812. In other words, step 812 is unnecessary. For example, a user may launch audio media player 414 and cause audio media player 414 to load the audio data of digital media item 404. Audio media player 414 determines that annotation data is associated with the audio data. Audio media player 414 causes information about the audio data (e.g., title, artist, genre, length, etc.) to be displayed without generating any audio associated with the audio data. The information may include a reference to the annotation data and information about a location within the audio data that is associated with the annotation data, where the location corresponds to the audio location identified in step 810.
[0152] At step 816, audio media player 414 consumes the annotation data. If the annotation data is voice data, then consuming the annotation data may involve processing the voice data to generate audio or converting the voice data to text data and displaying the text data. If the annotation data is text data, then consuming the annotation data may involve displaying the text data, for example, in a side panel of a GUI that displays attributes of the audio data that is played or in a new window that appears separate from the GUI. Non- limiting examples of attributes include time length of the audio data, the current playback position, which may indicate an absolute location within the audio data (e.g., a time offset) or a relative position within the audio data (e.g., chapter or section number), a waveform of the audio data, and title of the digital work. [0153] FIG. 8B describes a scenario, as noted previously, where an annotation is created on device 430 and later consumed on device 410.
[0154] At step 852, audio media player 432 processes audio data from digital media item 404 to play audio.
[0155] At step 854, audio media player 432 determines an audio location within the audio data. The audio location is eventually stored in association with an annotation. The audio location may be determined in a number of ways. For example, audio media player 432 may receive input that selects the audio location within the audio data. The input may be a user touching a touch screen (that displays attributes of the audio data) of device 430 for a period of time. The input may select an absolute position within a timeline that reflects the length of the audio data or a relative position within the audio data, such as a chapter number and a paragraph number. The input may also comprise first selecting a button, which causes audio media player 432 to change to a "create annotation" mode where an annotation may be created and associated with the audio location.
[0156] As another example of determining an audio location, audio media player 432 determines the audio location automatically (without user input) based on which portion of the audio data is being processed. For example, if audio media player 432 is processing a portion of the audio data that corresponds to chapter 20 of a digital work reflected in digital media item 404, then audio media player 432 determines that the audio location is at least be somewhere within chapter 20.
[0157] At step 856, audio media player 432 receives input that selects a "Create
Annotation" button that may be displayed on the touch screen of device 430. Such a button may be displayed in response to input in step 854 that selects the audio location, where, for example, the user touches the touch screen continuously for a period of time, such as one second.
[0158] Although step 854 is depicted as occurring before step 856, alternatively, the selection of the "Create Annotation" button may occur prior to the determination of the audio location.
[0159] At step 858, the first media player receives input that is used to create annotation data, similar to step 808.
[0160] At step 860, audio media player 432 stores the annotation data in association with the audio location. Audio media player 432 uses a mapping (e.g., mapping 406) to identify a particular audio location, in the mapping, that is closest to the audio location determined in step 854. Then, using the mapping, audio media player 432 identifies a text location that corresponds to the particular audio location. [0161] Alternatively to step 860, audio media player 432 sends, over network 400 to intermediary device 420, the annotation data and the audio location. In response,
intermediary device 420 stores the annotation data in association with the audio location. Intermediary device 420 uses mapping 406 to identify a particular audio location, in the mapping, that is closest to the audio location determined in step 854. Then, using mapping 406, intermediary device 420 identifies a text location that corresponds to the particular audio location. Intermediary device 420 sends the identified text location over network 440 to device 410. Intermediary device 420 may send the identified text location in response to a request, from device 410, for certain text data and/or for annotations associated with certain text data. For example, in response to a request for a digital book of "The Grapes of Wrath", intermediary device 420 determines whether there is any annotation data associated with that digital book and, if so, sends the annotation data to device 430.
[0162] Step 860 may also comprise storing date and/or time information that indicates when the annotation was created. This information may be displayed later when the annotation is consumed in the text context.
[0163] At step 862, device 410 displays text data associated with digital media item 402, which is a textual version of digital media item 404. Device 410 displays the text data of digital media item 402 based on a locally-stored copy of digital media item 402 or, if a locally-stored copy does not exist, may display the text data while the text data is streamed from intermediary device 420.
[0164] At step 864, device 410 determines when a portion of the textual version of the work (reflected in digital media item 402) that includes the text location (identified in step 860) is displayed. Alternatively, device 410 may display data that indicates that an annotation is available regardless of what portion of the textual version of the work, if any, is displayed.
[0165] At step 866, text media player 412 consumes the annotation data. If the annotation data is voice data, then consuming the annotation data may comprise playing the voice data or converting the voice data to text data and displaying the text data. If the annotation data is text data, then consuming the annotation data may comprises displaying the text data, for example, in a side panel of a GUI that displays a portion of the textual version of the work or in a new window that appears separate from the GUI.
READ ALOUD FEATURE
[0166] As described above, a user of a media player may view a textual version of a work while simultaneously listening to an audio version of the work. This scenario is referred to herein as the "read aloud" scenario. When the media player is concurrently displaying a portion of textual version of a work and playing a portion of an audio version of the work, the media player is said to be in "read aloud mode."
[0167] In an embodiment, a media player visually indicates whether the media player is in read aloud mode. A visual indication of being in read aloud mode may be an icon or graphic that appears somewhere on a screen of the media player. For example, an image of a narrator "character" is displayed by the media player and animated on each page that is displayed by the media player while the media player is in read aloud mode.
[0168] While the media player is in read aloud mode, a user may select a number of settings that are provided via the media player and that are associated with this scenario.
[0169] One example of a setting in the read aloud mode is an automatic page turn setting. If the media player is operating under the automatic page turn setting, then when the current playback position in the audio data corresponds to the end of a page displayed by the media player, the page is automatically "turned," i.e., without user input. "Turning" a digital page involves ceasing to display a first page and displaying a second page that is subsequent to the first page. Such "turning" may include displaying graphics that make it appear that the first page is an actual page that is turning. Thus, under the automatic page turn setting, the media player determines when the current playback position of the audio data corresponds to the last word on a displayed page. This determination is made possible by translating the current audio location into a current text location using a mapping, as described herein, whether the mapping is stored on the media player or on a server that is remote to the media player.
[0170] Another example of a setting in the read aloud mode is an end of page setting. If the media player is operating under the end of page setting, then the media player detects when the current playback position of the audio data corresponds to the text at the end of a page that is displayed by the media player. In response to this detection, the media player causes the playback of the audio data to cease. Only input from a user of the media player will cause the media player to continue processing the audio data. Also, the input may cause the media player to "turn" the page. Such input may be voice input or input via a touch screen of the media player.
[0171] Another example of a setting in the read aloud mode is a book control setting. If the media player is operating under the book control setting, then data (e.g., metadata) that is associated with the textual version of the work is used to control the playback of the corresponding audio data. Thus, certain data, such as tags in the textual data or in a mapping, indicates when to pause or stop playback of the audio data, regardless of the page position. For example, a textual version of a children's book might have a page with multiple pictures of objects, one of which is an apple. The audio version of the children's book could ask, "Can you find the apple?" and the portion of the textual version that corresponds to the end of the question has a tag (or other data) that indicates when to pause the audio playback. The media player reads the tag and, in response, pauses playback until additional input from the user, such as user selection of a displayed apple on a touch screen of the media player.
Alternatively, the mapping associated with the audio version and the textual version may include pause data that indicates when to pause the audio. Thus, when the media player detects the pause data while the current playback position of the audio version is changing, the media player pauses the playback until the user provides input, such as tapping the displayed apple on the touch screen. Once the user provides the required input, playback of the audio version resumes.
AUTOMATICALLY PAUSING PLAYBACK OF AUDIO DATA
[0172] In some scenarios (other than at the end of a page as in the end page setting described above), it may be beneficial to automatically pause the playback of an audio version of a work while a portion of textual version of the work is being displayed. For example, for some works, the textual versions contain pictures. Specifically, a page of a textual version of a work may include only a picture without any text or may include a picture and text while other pages in the textual version do not include any pictures. In such situations, it may be beneficial to stop playing the audio version of a work to allow a reader to analyze the picture in silence.
[0173] In an embodiment, a textual version of a work includes a "pause tag" that indicates when playback of an audio version of the work should be paused. For example, a pause tag may precede a picture within the textual version or may immediately follow a question in the textual version. Thus, a pause tag may correspond to a particular text location within the textual version of the work. The media player (or a remote server) determines, based on a mapping, when the current playback of an audio version of the work corresponds to the particular text location. In response to the determination, the media player pauses the playback of the audio data. The pause may be pre-determined, such as three seconds, after which the media player automatically begins to play the audio data again (i.e., without further user input). Alternatively, the amount of time to pause may be determined based on information in the pause tag itself or information in metadata of the textual version, where the information indicates an amount of time, such as five seconds, after which the media player automatically plays the audio data again beginning where the media player stopped playback. Alternatively still, the media player receives user input that causes the media player to continue playing the audio version of the work after the media player pauses the playback. The user input may be required to continue playback or may be used to shorten the pause time. [0174] In a related embodiment, a mapping associated with the audio version and the textual version of the work includes pause data that indicates where, in the audio version, to pause for a certain amount of time or until user input is received. For example, while the media player processes the audio version of the work, the media player keeps track of the current playback position in the audio version. When the current playback position corresponds, in the mapping, to an audio location that is associated with pause data, the media player pauses the playback of the audio data.
HARDWARE OVERVIEW
[0175] According to one embodiment, the techniques described herein are implemented by one or more special-purpose computing devices. The special-purpose computing devices may be hard- wired to perform the techniques, or may include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are persistently programmed to perform the techniques, or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination. Such special-purpose computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom programming to accomplish the techniques. The special-purpose computing devices may be desktop computer systems, portable computer systems, handheld devices, networking devices or any other device that incorporates hard-wired and/or program logic to implement the techniques.
[0176] For example, FIG. 9 is a block diagram that illustrates a computer system 900 upon which an embodiment of the invention may be implemented. Computer system 900 includes a bus 902 or other communication mechanism for communicating information, and a hardware processor 904 coupled with bus 902 for processing information. Hardware processor 904 may be, for example, a general purpose microprocessor.
[0177] Computer system 900 also includes a main memory 906, such as a random access memory (RAM) or other dynamic storage device, coupled to bus 902 for storing information and instructions to be executed by processor 904. Main memory 906 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 904. Such instructions, when stored in non-transitory storage media accessible to processor 904, render computer system 900 into a special-purpose machine that is customized to perform the operations specified in the instructions.
[0178] Computer system 900 further includes a read only memory (ROM) 908 or other static storage device coupled to bus 902 for storing static information and instructions for processor 904. A storage device 910, such as a magnetic disk or optical disk, is provided and coupled to bus 902 for storing information and instructions.
[0179] Computer system 900 may be coupled via bus 902 to a display 912, such as a cathode ray tube (CRT), for displaying information to a computer user. An input device 914, including alphanumeric and other keys, is coupled to bus 902 for communicating information and command selections to processor 904. Another type of user input device is cursor control 916, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 904 and for controlling cursor movement on display 912. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
[0180] Computer system 900 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 900 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 900 in response to processor 904 executing one or more sequences of one or more instructions contained in main memory 906. Such instructions may be read into main memory 906 from another storage medium, such as storage device 910. Execution of the sequences of instructions contained in main memory 906 causes processor 904 to perform the process steps described herein. In alternative embodiments, hard- wired circuitry may be used in place of or in combination with software instructions.
[0181] The term "storage media" as used herein refers to any non-transitory media that store data and/or instructions that cause a machine to operation in a specific fashion. Such storage media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 910. Volatile media includes dynamic memory, such as main memory 906. Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
[0182] Storage media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 902. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications . [0183] Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 904 for execution. For example, the instructions may initially be carried on a magnetic disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 900 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 902. Bus 902 carries the data to main memory 906, from which processor 904 retrieves and executes the instructions. The instructions received by main memory 906 may optionally be stored on storage device 910 either before or after execution by processor 904.
[0184] Computer system 900 also includes a communication interface 918 coupled to bus 902. Communication interface 918 provides a two-way data communication coupling to a network link 920 that is connected to a local network 922. For example, communication interface 918 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 918 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 918 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
[0185] Network link 920 typically provides data communication through one or more networks to other data devices. For example, network link 920 may provide a connection through local network 922 to a host computer 924 or to data equipment operated by an Internet Service Provider (ISP) 926. ISP 926 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet" 928. Local network 922 and Internet 928 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 920 and through communication interface 918, which carry the digital data to and from computer system 900, are example forms of transmission media.
[0186] Computer system 900 can send messages and receive data, including program code, through the network(s), network link 920 and communication interface 918. In the Internet example, a server 930 might transmit a requested code for an application program through Internet 928, ISP 926, local network 922 and communication interface 918.
[0187] The received code may be executed by processor 904 as it is received, and/or stored in storage device 910, or other non- volatile storage for later execution. [0188] In accordance with some embodiments, Figures 10-15 show functional block diagrams of electronic devices 1000-1500 in accordance with the principles of the invention as described above. The functional blocks of the device may be implemented by hardware, software, or a combination of hardware and software to carry out the principles of the invention. It is understood by persons of skill in the art that the functional blocks described in Figures 10-15 may be combined or separated into sub-blocks to implement the principles of the invention as described above. Therefore, the description herein may support any possible combination or separation or further definition of the functional blocks described herein.
[0189] As shown in Figure 10, the electronic device 1000 includes an audio data receiving unit 1002 configured for receiving audio data that reflects an audible version of a work for which a textual version exists. The electronic device 1000 also includes a processing unit 1006 coupled to the audio data receiving unit 1002. In some embodiments, the processing unit 1006 includes a speech to text unit 1008 and a mapping unit 1010.
[0190] The processing unit 1006 is configured to perform a speech-to-text analysis of the audio data to generate text for portions of the audio data (e.g., with the speech to text unit 1008); and based on the text generated for the portions of the audio data, generate a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in the textual version of the work (e.g., with the mapping unit 1010).
[0191] As shown in Figure 11, the electronic device 1100 includes a text receiving unit 1102 configured for receiving a textual version of a work. The electronic device 1100 also includes an audio data receiving unit 1104 configured for receiving second audio data that reflects an audible version of the work for which the textual version exists. The electronic device 1100 also includes a processing unit 1106 coupled to the text receiving unit 1102. In some embodiments, the processing unit 1106 includes a text to speech unit 1108 and a mapping unit 1110.
[0192] The processing unit 1106 is configured to perform a text-to-speech analysis of the textual version to generate first audio data (e.g., with the text to speech unit 1108); and based on the first audio data and the textual version, generate a first mapping between a first plurality of audio locations in the first audio data and a corresponding plurality of text locations in the textual version of the work (e.g., with the mapping unit 1110). The processing unit 1106 is further configured to, based on (1) a comparison of the first audio data and the second audio data and (2) the first mapping, generate a second mapping between a second plurality of audio locations in the second audio data and the plurality of text locations in the textual version of the work (e.g., with the mapping unit 1110).
[0193] As shown in Figure 12, the electronic device 1200 includes an audio receiving unit 1202 configured for receiving audio input. The electronic device 1200 also includes a processing unit 1206 coupled to the audio receiving unit 1202. In some embodiments, the processing unit 1206 includes a speech to text unit 1208, a text matching unit 1209, and a display control unit 1210.
[0194] The processing unit 1206 is configured to perform a speech-to-text analysis of the audio input to generate text for portions of the audio input (e.g., with the speech to text unit 1208); determine whether the text generated for portions of the audio input matches text that is currently displayed (e.g., with the text matching unit 1209); and in response to determining that the text matches text that is currently displayed, cause the text that is currently displayed to be highlighted (e.g., with the display control unit 1210).
[0195] As shown in Figure 13, the electronic device 1300 includes a location data obtaining unit 1302 configured for obtaining location data that indicates a specified location within a textual version of a work. The electronic device 1300 also includes a processing unit 1306 coupled to the location data obtaining unit 1302. In some embodiments, the processing unit 1306 includes a map inspecting unit 1308.
[0196] The processing unit 1306 is configured to inspect a mapping (e.g., with the map inspecting unit 1308) between a plurality of audio locations in an audio version of the work and a corresponding plurality of text locations in the textual version of the work to: determine a particular text location, of the plurality of text locations, that corresponds to the specified location, and based on the particular text location, determine a particular audio location, of the plurality of audio locations, that corresponds to the particular text location. The processing unit 1306 is also configured to provide the particular audio location, that was determined based on the particular text location, to a media player to cause the media player to establish the particular audio location as a current playback position of the audio data.
[0197] As shown in Figure 14, the electronic device 1400 includes a location obtaining unit 1402 configured for obtaining location data that indicates a specified location within audio data. The electronic device 1400 also includes a processing unit 1406 coupled to the location obtaining unit 1402. In some embodiments, the processing unit 1406 includes a map inspecting unit 1408 and a display control unit 1410.
[0198] The processing unit 1406 is configured to inspect a mapping (e.g., with the map inspecting unit 1408) between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to: determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and based on the particular audio location, determine a particular text location, of the plurality of text locations, that corresponds to the particular audio location. The processing unit 1406 is also configured to cause a media player to display information about the particular text location (e.g., with the display control unit 1410). [0199] As shown in Figure 15, the electronic device 1500 includes a location obtaining unit 1502 configured for obtaining location data that indicates a specified location within the audio version during playback of an audio version of a work. The electronic device 1500 also includes a processing unit 1506 coupled to the location data obtaining unit 1502. In some embodiments, the processing unit 1506 includes text location determining unit 1508 and a display control unit 1510.
[0200] The processing unit 1506 is configured to, during playback of an audio version of a work: determine, based on the specified location, a particular text location, in a textual version of the work, that is associated with the page end data that indicates an end of a first page reflected in the textual version of the work (e.g., with the text location determining unit 1508); and in response to determining that the particular text location is associated with the page end data, automatically cause the first page to cease to be displayed and causing a second page that is subsequent to the first page to be displayed (e.g., with the display control unit 1510).
[0201] As shown in Figure 16, the electronic device 1600 includes an annotation obtaining unit 1602 configured for, while a first version of a work is processed, obtaining annotation data that is based on input from a user. The electronic device 1600 also includes an association data storing unit 1603. The electronic device 1600 also includes a processing unit 1606 coupled to the annotation obtaining unit 1602 and the association data storing unit 1603. In some embodiments, the processing unit 1606 includes a display control unit 1610.
[0202] The processing unit 1606 is configured for causing association data that associates the annotation data with the work to be stored (e.g., in the association data storing unit 1603); and while a second version of the work is processed, causing information about the annotation data to be displayed (e.g., with the display control unit 1610), wherein the second version is different than the first version.
[0203] As shown in Figure 17, the electronic device 1700 includes a data receiving unit 1702 configured for receiving data that establishes a first bookmark within a first version of a work. The electronic device 1700 also includes location data storing unit 1703. The electronic device 1700 also includes a processing unit 1706 coupled to the data receiving unit 1702 and the location data storing unit 1703. In some embodiments, the processing unit 1706 includes a map inspection unit 1708.
[0204] The processing unit 1706 is configured for inspecting a mapping between a plurality of first locations in the first version of the work and a corresponding plurality of second locations in a second version of the work (e.g., with the map inspection unit 1708) to: determine a particular first location, of the plurality of first locations, that corresponds to the first bookmark, and based on the particular first location, determine a particular second location, of the plurality of second locations, that corresponds to the particular first location, wherein the first version of the work is different than the second version of the work; and causing data that establishes the particular second location as a second bookmark within the second version of the work to be stored (e.g., in the location data storing unit 1703).
[0205] As shown in Figure 18, the electronic device 1800 includes an audio receiving unit 1802 configured for receiving, at the device, audio input from a user. The electronic device 1800 also includes a processing unit 1806 coupled to the audio receiving unit 1802. In some embodiments, the processing unit 1806 includes a word analyzing unit 1808 and a display control unit 1810.
[0206] The processing unit 1806 is configured for causing a portion of text of a work to be displayed by a device (e.g., with the display control unit 1810); and in response to receiving the audio input at the audio receiving unit: analyzing the audio input to identify one or more words (e.g., with the word analyzing unit 1808); determining whether the one or more words are reflected in the portion of the text (e.g., with the word analyzing unit 1808); and in response to determining that the one or more words are reflected in the portion of the text, causing a visual indication to be displayed by the device (e.g., with the display control unit 1810).
[0207] In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to
implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the invention, and what is intended by the applicants to be the scope of the invention, is the literal and equivalent scope of the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction.

Claims

CLAIMS What is claimed is:
1. A method comprising:
receiving audio data that reflects an audible version of a work for which a textual version exists;
performing a speech-to-text analysis of the audio data to generate text for portions of the audio data; and
based on the text generated for the portions of the audio data, generating a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in the textual version of the work;
wherein the method is performed by one or more computing devices.
2. The method of Claim 1, wherein generating text for portions of the audio data includes generating text for portions of the audio data based, at least in part, on textual context of the work.
3. The method of Claim 2, wherein generating text for portions of the audio data based, at least in part, on textual context of the work includes generating text based, at least in part, on one or more rules of grammar used in the textual version of the work.
4. The method of Claim 2, wherein generating text for portions of the audio data based, at least in part, on textual context of the work includes limiting which words the portions can be translated to based on which words are in the textual version of the work, or a subset thereof.
5. The method of Claim 4, wherein limiting which words the portions can be translated to based on which words are in the textual version of the work includes, for a given portion of the audio data, identifying a sub-section of the textual version of the work that corresponds to the given portion and limiting the words to only those words in the sub-section of the textual version of the work.
6. The method of Claim 5, wherein: identifying the sub-section of the textual version of the work includes maintaining a current text location in the textual version of the work that corresponds to a current audio location, in the audio data, of the speech-to-text analysis; and
the sub-section of the textual version of the work is a section associated with the current text location.
7. The method of any of Claims 1-6, wherein the portions include portions that correspond to individual words, and the mapping maps the locations of the portions that correspond to individual words to individual words in the textual version of the work.
8. The method of any of Claims 1-6, wherein the portions include portions that correspond to individual sentences, and the mapping maps the locations of the portions that correspond to individual sentences to individual sentences in the textual version of the work.
9. The method of any of Claims 1-6, wherein the portions include portions that correspond to fixed amounts of data, and the mapping maps the locations of the portions that correspond to fixed amounts of data to corresponding locations in the textual version of the work.
10. The method of any of Claims 1-9, wherein generating the mapping includes: (1) embedding anchors in the audio data; (2) embedding anchors in the textual version of the work; or (3) storing the mapping in a media overlay that is stored in association with the audio data or the textual version of the work.
11. The method of any of Claims 1-10, wherein each of one or more text locations of the plurality of text locations indicates a relative location in the textual version of the work.
12. The method of any of Claims 1-10, wherein one text location, of the plurality of text locations, indicates a relative location in the textual version of the work and another text location, of the plurality of text locations, indicates an absolute location from the relative location.
13. The method of any of Claims 1-10, wherein each of one or more text locations of the plurality of text locations indicates an anchor within the textual version of the work.
14. A method comprising:
receiving a textual version of a work;
performing a text-to-speech analysis of the textual version to generate first audio data; based on the first audio data and the textual version, generating a first mapping between a first plurality of audio locations in the first audio data and a corresponding plurality of text locations in the textual version of the work;
receiving second audio data that reflects an audible version of the work for which the textual version exists; and
based on (1) a comparison of the first audio data and the second audio data and (2) the first mapping, generating a second mapping between a second plurality of audio locations in the second audio data and the plurality of text locations in the textual version of the work;
wherein the method is performed by one or more computing devices.
15. A method comprising:
receiving audio input;
performing a speech-to-text analysis of the audio input to generate text for portions of the audio input;
determining whether the text generated for portions of the audio input matches text that is currently displayed; and
in response to determining that the text matches text that is currently displayed, causing the text that is currently displayed to be highlighted;
wherein the method is performed by one or more computing devices.
16. An electronic device, comprising:
an audio data receiving unit configured for receiving audio data that reflects an audible version of a work for which a textual version exists; and
a processing unit coupled to the audio data receiving unit, the processing unit configured to:
perform a speech-to-text analysis of the audio data to generate text for portions of the audio data; and
based on the text generated for the portions of the audio data, generate a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in the textual version of the work.
17. An electronic device, comprising:
a text receiving unit configured for receiving a textual version of a work; and
a processing unit coupled to the text receiving unit, the processing unit configured to: perform a text-to-speech analysis of the textual version to generate first audio data; and
based on the first audio data and the textual version, generate a first mapping between a first plurality of audio locations in the first audio data and a corresponding plurality of text locations in the textual version of the work;
an audio data receiving unit configured for receiving second audio data that reflects an audible version of the work for which the textual version exists;
the processing unit further configured to, based on (1) a comparison of the first audio data and the second audio data and (2) the first mapping, generate a second mapping between a second plurality of audio locations in the second audio data and the plurality of text locations in the textual version of the work.
18. An electronic device, comprising:
an audio receiving unit configured for receiving audio input; and
a processing unit coupled to the audio receiving unit, the processing unit configured to: perform a speech-to-text analysis of the audio input to generate text for portions of the audio input;
determine whether the text generated for portions of the audio input matches text that is currently displayed; and
in response to determining that the text matches text that is currently displayed, cause the text that is currently displayed to be highlighted.
19. An electronic device, comprising one or more processors and memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 1-15.
20. A computer-readable storage medium, storing one or more programs for execution by one or more processors of an electronic device, the one or more programs including instructions for performing any of the methods of claims 1-15.
21. An electronic device, comprising means for performing any of the methods of claims 1 - 15.
22. An information processing apparatus for use in an electronic device, comprising means for performing any of the methods of claims 1-15.
23. A method comprising:
obtaining location data that indicates a specified location within a textual version of a work;
inspecting a mapping between a plurality of audio locations in an audio version of the work and a corresponding plurality of text locations in the textual version of the work to:
determine a particular text location, of the plurality of text locations, that corresponds to the specified location, and
based on the particular text location, determine a particular audio location, of the plurality of audio locations, that corresponds to the particular text location;
providing the particular audio location, that was determined based on the particular text location, to a media player to cause the media player to establish the particular audio location as a current playback position of the audio data;
wherein the method is performed by one or more computing devices.
24. The method of Claim 23, wherein:
obtaining comprises a server receiving, over a network, the location data from a first device;
inspecting and providing are performed by the server;
providing comprises the server sending the particular audio location to a second device that executes the media player.
25. The method of Claim 24, wherein the second device and the first device are the same device.
26. The method of Claim 23, wherein obtaining, inspecting, and providing are performed by a computing device that is configured to display the textual version of the work and that executes the media player.
27. The method of Claim 23, further comprising determining, at a device that is configured to display the textual version of the work, the location data without input from a user of the device.
28. The method of any of Claims 23-27, further comprising:
receiving input from a user; and
in response to receiving the input, determining the location data based on the input.
29. The method of Claim 28, wherein:
providing the particular audio location to the media player comprises providing the particular audio location to the media player to cause the media player to process the audio data beginning at the current playback position, which causes the media player to generate audio from the processed audio data; and
causing the media player to process the audio data is performed in response to receiving the input.
30. The method of Claim 29, wherein:
the input selects multiple words in the textual version of the work;
the specified location is a first specified location;
the location data also indicates a second specified location, within the textual version of the work, that is different than the first specified location;
inspecting further comprises inspecting the mapping to:
determine a second particular text location, of the plurality of text locations, that corresponds to the second specified location, and
based on the second particular text location, determine a second particular audio location, of the plurality of audio locations, that corresponds to the second particular text location; and
providing the particular audio location to the media player comprises providing the second particular audio location to the media player to cause the media player to cease processing the audio data when the current playback position arrives at or near the second particular audio location.
31. The method of any of Claims 23-30, further comprising:
obtaining annotation data that is based on input from a user;
storing the annotation data in association with the specified location; and
causing information about the annotation data to be displayed.
32. The method of Claim 31, wherein causing information about the particular audio location and the annotation data to be displayed comprises:
determining when a current playback position of the audio data is at or near the particular audio location; and
in response to determining that the current playback position of the audio data is at or near the particular audio location, causing information about the annotation data to be displayed.
33. The method of any of Claims 31-32, wherein:
the annotation data includes text data; and
causing information about the annotation data to be displayed comprises displaying the text data.
34. The method of any of Claims 31-33, wherein:
the annotation data includes voice data; and
causing information about the annotation data to be displayed comprises processing the voice data to generate audio.
35. An electronic device, comprising one or more processors and memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 23-34.
36. A computer-readable storage medium, storing one or more programs for execution by one or more processors of an electronic device, the one or more programs including instructions for performing any of the methods of claims 23-34.
37. An electronic device, comprising means for performing any of the methods of claims 23- 34.
38. An information processing apparatus for use in an electronic device, comprising means for performing any of the methods of claims 23-34.
39. An electronic device, comprising:
a location data obtaining unit configured for obtaining location data that indicates a specified location within a textual version of a work; and
a processing unit coupled to the location data obtaining unit, the processing unit configured to:
inspect a mapping between a plurality of audio locations in an audio version of the work and a corresponding plurality of text locations in the textual version of the work to:
determine a particular text location, of the plurality of text locations, that corresponds to the specified location, and
based on the particular text location, determine a particular audio location, of the plurality of audio locations, that corresponds to the particular text location;
provide the particular audio location, that was determined based on the particular text location, to a media player to cause the media player to establish the particular audio location as a current playback position of the audio data.
40. A method comprising:
obtaining location data that indicates a specified location within audio data;
inspecting a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to:
determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and
based on the particular audio location, determine a particular text location, of the plurality of text locations, that corresponds to the particular audio location;
causing a media player to display information about the particular text location;
wherein the method is performed by one or more computing devices.
41. The method of Claim 40, wherein: obtaining comprises a server receiving, over a network, the location data from a first device;
inspecting and causing are performed by the server;
causing comprises the server sending the particular text location to a second device that executes the media player.
42. The method of Claim 41, wherein the second device and the first device are the same device.
43. The method of Claim 40, wherein obtaining, inspecting, and causing are performed by a computing device that is configured to display the textual version of the work and that executes the media player.
44. The method of Claim 40, further comprising determining, at a device that is configured to process the audio data, the location data without input from a user of the device.
45. The method of any of Claims 40-44, further comprising:
receiving input from a user; and
in response to receiving the input, determining the location data based on the input.
46. The method of Claim 45, wherein:
causing comprises causing the media player to display a portion of the textual version of the work that corresponds to the particular text location; and
causing the media player to display the portion of the textual version of the work is performed in response to receiving the input.
47. The method of Claim 46, wherein:
the input selects a segment of the audio data;
the specified location is a first specified location;
the location data also indicates a second specified location, within the audio data, that is different than the first specified location;
inspecting further comprises inspecting the mapping to: determine a second particular audio location, of the plurality of audio locations, that corresponds to the second specified location, and
based on the second particular audio location, determine a second particular text location, of the plurality of text locations, that corresponds to the second particular audio location;
causing a media player to display information about the particular text location further comprises causing the media player to display information about the second particular text location.
48. The method of any of Claims 40-47, wherein:
the specified location corresponds to a current playback position in the audio data;
causing is performed as the audio data at the specified location is processed and audio is generated;
causing comprises causing a second media player to highlight text within the textual version of the work at or near the particular text location.
49. The method of any of Claims 40-48, further comprising:
obtaining annotation data that is based on input from a user;
storing the annotation data in association with the specified location; and
causing information about the annotation data to be displayed.
50. The method of Claim 49, wherein causing information about the annotation data to be displayed comprises:
determining when a portion of the textual version of the work that corresponds to the particular text location is displayed; and
in response to determining that a portion of the textual version of the work that corresponds to the particular text location is displayed, causing information about the annotation data to be displayed.
51. The method of any of Claims 49-50, wherein:
the annotation data includes text data; and
causing information about the annotation data to be displayed comprises causing the text data to be displayed.
52. The method of any of Claims 49-51 , wherein:
the annotation data includes voice data; and
causing information about the annotation data to be displayed comprises causing the voice data to be processed to generate audio.
53. A method comprising:
during playback of an audio version of a work:
obtaining location data that indicates a specified location within the audio version, and determining, based on the specified location, a particular text location, in a textual version of the work, that is associated with pause data that indicates when to pause playback of the audio version; and
in response to determining that the particular text location is associated with pause data, pausing playback of the audio version;
wherein the method is performed by one or more computing devices.
54. The method of Claim 53, wherein the pause data is within the textual version of the work.
55. The method of any of Claims 53-54, wherein determining the particular text location comprises:
inspecting a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to:
determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and
based on the particular audio location, determine the particular text location, of the plurality of text locations, that corresponds to the particular audio location.
56. The method of any of Claims 53-55, wherein the pause data corresponds to the end of a page reflected in the textual version of the work.
57. The method of any of Claims 53-55, wherein the pause data corresponds to a location, within the textual version of the work, that immediately precedes a picture that does not include text.
58. The method of any of Claims 53-57, further comprising continuing playback of the audio version in response to receiving user input.
59. The method of any of Claims 53-57, further comprising continuing playback of the audio version in response to the lapse of a particular amount of time since playback of the audio version was paused.
60. A method comprising:
during playback of an audio version of a work:
obtaining location data that indicates a specified location within the audio version, and
determining, based on the specified location, a particular text location, in a textual version of the work, that is associated with the page end data that indicates an end of a first page reflected in the textual version of the work; and
in response to determining that the particular text location is associated with the page end data, automatically causing the first page to cease to be displayed and causing a second page that is subsequent to the first page to be displayed;
wherein the method is performed by one or more computing devices.
61. The method of Claim 60, wherein determining the particular text location comprises: inspecting a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to:
determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and
based on the particular audio location, determine the particular text location, of the plurality of text locations, that corresponds to the particular audio location.
62. An electronic device, comprising one or more processors and memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 40-61.
63. A computer-readable storage medium, storing one or more programs for execution by one or more processors of an electronic device, the one or more programs including instructions for performing any of the methods of claims 40-61.
64. An electronic device, comprising means for performing any of the methods of claims 40- 61.
65. An information processing apparatus for use in an electronic device, comprising means for performing any of the methods of claims 40-61.
66. An electronic device, comprising:
a location obtaining unit configured for obtaining location data that indicates a specified location within audio data; and
a processing unit coupled to the location obtaining unit, the processing unit configured to: inspect a mapping between a plurality of audio locations in the audio data and a corresponding plurality of text locations in a textual version of a work to:
determine a particular audio location, of the plurality of audio locations, that corresponds to the specified location, and
based on the particular audio location, determine a particular text location, of the plurality of text locations, that corresponds to the particular audio location;
cause a media player to display information about the particular text location.
67. An electronic device, comprising:
a location obtaining unit configured for obtaining location data that indicates a specified location within the audio version during playback of an audio version of a work; and
a processing unit coupled to the location obtaining unit, the processing unit configured to, during playback of an audio version of a work:
determine, based on the specified location, a particular text location, in a textual version of the work, that is associated with the page end data that indicates an end of a first page reflected in the textual version of the work; and
in response to determining that the particular text location is associated with the page end data, automatically cause the first page to cease to be displayed and causing a second page that is subsequent to the first page to be displayed.
68. A method comprising:
while a first version of a work is processed, obtaining annotation data that is based on input from a user;
storing association data that associates the annotation data with the work; and
while a second version of the work is processed, causing information about the annotation data to be displayed, wherein the second version is different than the first version; wherein the method is performed by one or more computing devices.
69. The method of Claim 68, wherein:
obtaining comprises determining location data that indicates a specified location within the first version of the work;
storing comprises storing the location data in association with the work;
the specified location corresponds to a particular location within the second version of the work; and
causing comprises causing the information about the annotation data to be displayed in association with the particular location in the second version.
70. The method of Claim 69, wherein:
the first version is an audio version of the work and the second version is a textual version of the work;
causing information about the annotation data to be displayed comprises:
determining when a portion of the textual version of the work that corresponds to the particular location is displayed; and
in response to determining that a portion of the textual version of the work that corresponds to the particular location is displayed, causing information about the annotation data to be displayed.
71. The method of Claim 69, wherein:
the first version is a textual version of the work and the second version is an audio version of the work;
causing information about the annotation data to be displayed comprises:
determining when a portion of the audio version of the work that corresponds to the particular location is played; and in response to determining that a portion of the textual version of the work that corresponds to the particular location is played, causing information about the annotation data to be displayed.
72. The method of any of Claims 68-71 , wherein:
the annotation data includes text data; and
causing information about the annotation data to be displayed comprises causing the text data to be displayed.
73. The method of any of Claims 68-71 , wherein:
the annotation data includes voice data; and
causing information about the annotation data to be displayed comprises causing the voice data to be processed to generate audio.
74. An electronic device, comprising one or more processors and memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 68-73.
75. A computer-readable storage medium, storing one or more programs for execution by one or more processors of an electronic device, the one or more programs including instructions for performing any of the methods of claims 68-73.
76. An electronic device, comprising means for performing any of the methods of 68-73.
77. An information processing apparatus for use in an electronic device, comprising means for performing any of the methods of claims 68-73.
78. An electronic device, comprising:
an annotation obtaining unit configured for, while a first version of a work is processed, obtaining annotation data that is based on input from a user; and
a processing unit coupled to the annotation obtaining unit and the association data storing unit, the processing unit configured for: causing association data that associates the annotation data with the work to be stored; and
while a second version of the work is processed, causing information about the annotation data to be displayed, wherein the second version is different than the first version.
79. A method comprising:
receiving data that establishes a first bookmark within a first version of a work;
inspecting a mapping between a plurality of first locations in the first version of the work and a corresponding plurality of second locations in a second version of the work to:
determine a particular first location, of the plurality of first locations, that corresponds to the first bookmark, and
based on the particular first location, determine a particular second location, of the plurality of second locations, that corresponds to the particular first location;
wherein the first version of the work is different than the second version of the work;
causing data that establishes the particular second location as a second bookmark within the second version of the work to be stored;
wherein the method is performed by one or more computing devices.
80. The method of Claim 79, wherein:
receiving comprises a server receiving, over a network, the input from a first device; inspecting is performed by the server; and
causing comprises the server sending the particular second location to a second device.
81. The method of Claim 80, wherein the first device and the second device are different devices.
82. The method of any of Claims 79-81, wherein the first version of the work is one of an audio version of the work or a text version of the work and the second version of the work is the other of the audio version or the text version.
83. An electronic device, comprising one or more processors and memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 79-82.
84. A computer-readable storage medium, storing one or more programs for execution by one or more processors of an electronic device, the one or more programs including instructions for performing any of the methods of claims 79-82.
85. An electronic device, comprising means for performing any of the methods of 79-82.
86. An information processing apparatus for use in an electronic device, comprising means for performing any of the methods of claims 79-82.
87. An electronic device, comprising:
a data receiving unit configured for receiving data that establishes a first bookmark within a first version of a work; and
a processing unit coupled to the data receiving unit, the processing unit configured for: inspecting a mapping between a plurality of first locations in the first version of the work and a corresponding plurality of second locations in a second version of the work to:
determine a particular first location, of the plurality of first locations, that corresponds to the first bookmark, and
based on the particular first location, determine a particular second location, of the plurality of second locations, that corresponds to the particular first location;
wherein the first version of the work is different than the second version of the work; and
causing data that establishes the particular second location as a second bookmark within the second version of the work to be stored.
88. A method comprising:
causing a portion of text of a work to be displayed by a device;
while the portion of text is displayed:
receiving, at the device, audio input from a user;
in response to receiving the audio input: analyzing the audio input to identify one or more words;
determining whether the one or more words are reflected in the portion of the text;
in response to determining that the one or more words are reflected in the portion of the text, causing a visual indication to be displayed by the device.
89. The method of Claim 88, wherein causing the visual indication to be displayed comprises causing text data that corresponds to the one or more words to be highlighted.
90. An electronic device, comprising one or more processors and memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for performing the method of claims 88-89.
91. A computer-readable storage medium, storing one or more programs for execution by one or more processors of an electronic device, the one or more programs including instructions for performing any of the methods of claims 88-89.
92. An electronic device, comprising means for performing any of the methods of 88-89.
93. An information processing apparatus for use in an electronic device, comprising means for performing any of the methods of claims 88-89.
94. An electronic device, comprising:
a processing unit configured for causing a portion of text of a work to be displayed by a device;
an audio receiving unit coupled to the processing unit and configured for receiving, at the device, audio input from a user; and
the processing unit further configured for, in response to receiving the audio input at the audio receiving unit:
analyzing the audio input to identify one or more words;
determining whether the one or more words are reflected in the portion of the text; in response to determining that the one or more words are reflected in the portion of the text, causing a visual indication to be displayed by the device.
PCT/US2012/040801 2011-06-03 2012-06-04 Automatically creating a mapping between text data and audio data WO2012167276A1 (en)

Priority Applications (8)

Application Number Priority Date Filing Date Title
CN201280036281.5A CN103703431B (en) 2011-06-03 2012-06-04 Automatically create the mapping between text data and voice data
KR1020167006970A KR101700076B1 (en) 2011-06-03 2012-06-04 Automatically creating a mapping between text data and audio data
AU2012261818A AU2012261818B2 (en) 2011-06-03 2012-06-04 Automatically creating a mapping between text data and audio data
KR1020157017690A KR101674851B1 (en) 2011-06-03 2012-06-04 Automatically creating a mapping between text data and audio data
JP2014513799A JP2014519058A (en) 2011-06-03 2012-06-04 Automatic creation of mapping between text data and audio data
EP12729332.2A EP2593846A4 (en) 2011-06-03 2012-06-04 Automatically creating a mapping between text data and audio data
KR1020137034641A KR101622015B1 (en) 2011-06-03 2012-06-04 Automatically creating a mapping between text data and audio data
AU2016202974A AU2016202974B2 (en) 2011-06-03 2016-05-09 Automatically creating a mapping between text data and audio data

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
US201161493372P 2011-06-03 2011-06-03
US61/493,372 2011-06-03
US201161494375P 2011-06-07 2011-06-07
US61/494,375 2011-06-07
US13/267,749 2011-10-06
US13/267,738 2011-10-06
US13/267,749 US10672399B2 (en) 2011-06-03 2011-10-06 Switching between text data and audio data based on a mapping
US13/267,738 US20120310642A1 (en) 2011-06-03 2011-10-06 Automatically creating a mapping between text data and audio data

Publications (1)

Publication Number Publication Date
WO2012167276A1 true WO2012167276A1 (en) 2012-12-06

Family

ID=47262337

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2012/040801 WO2012167276A1 (en) 2011-06-03 2012-06-04 Automatically creating a mapping between text data and audio data

Country Status (7)

Country Link
US (2) US10672399B2 (en)
EP (1) EP2593846A4 (en)
JP (1) JP2014519058A (en)
KR (4) KR101622015B1 (en)
CN (1) CN103703431B (en)
AU (2) AU2012261818B2 (en)
WO (1) WO2012167276A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210097999A1 (en) * 2018-06-27 2021-04-01 Google Llc Rendering responses to a spoken utterance of a user utilizing a local text-response map
CN109522427B (en) * 2018-09-30 2021-12-10 北京光年无限科技有限公司 Intelligent robot-oriented story data processing method and device

Families Citing this family (277)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8645137B2 (en) 2000-03-16 2014-02-04 Apple Inc. Fast, language-independent method for user authentication by voice
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US9020811B2 (en) * 2006-10-13 2015-04-28 Syscom, Inc. Method and system for converting text files searchable text and for processing the searchable text
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US10002189B2 (en) 2007-12-20 2018-06-19 Apple Inc. Method and apparatus for searching using an active ontology
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US10496753B2 (en) 2010-01-18 2019-12-03 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US20100030549A1 (en) 2008-07-31 2010-02-04 Lee Michael M Mobile device having human language translation capability with positional feedback
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
WO2010067118A1 (en) 2008-12-11 2010-06-17 Novauris Technologies Limited Speech recognition involving a mobile device
US10706373B2 (en) 2011-06-03 2020-07-07 Apple Inc. Performing actions associated with task items that represent tasks to perform
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US9431006B2 (en) 2009-07-02 2016-08-30 Apple Inc. Methods and apparatuses for automatic speech recognition
AU2011203833B2 (en) * 2010-01-11 2014-07-10 Apple Inc. Electronic text manipulation and display
US10705794B2 (en) 2010-01-18 2020-07-07 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US10553209B2 (en) 2010-01-18 2020-02-04 Apple Inc. Systems and methods for hands-free notification summaries
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US10679605B2 (en) 2010-01-18 2020-06-09 Apple Inc. Hands-free list-reading by intelligent automated assistant
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US10762293B2 (en) 2010-12-22 2020-09-01 Apple Inc. Using parts-of-speech tagging and named entity recognition for spelling correction
US9645986B2 (en) 2011-02-24 2017-05-09 Google Inc. Method, medium, and system for creating an electronic book with an umbrella policy
US10019995B1 (en) 2011-03-01 2018-07-10 Alice J. Stiebel Methods and systems for language learning based on a series of pitch patterns
US11062615B1 (en) 2011-03-01 2021-07-13 Intelligibility Training LLC Methods and systems for remote language learning in a pandemic-aware world
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US8855797B2 (en) 2011-03-23 2014-10-07 Audible, Inc. Managing playback of synchronized content
US9734153B2 (en) 2011-03-23 2017-08-15 Audible, Inc. Managing related digital content
US8948892B2 (en) 2011-03-23 2015-02-03 Audible, Inc. Managing playback of synchronized content
US9703781B2 (en) 2011-03-23 2017-07-11 Audible, Inc. Managing related digital content
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
US10672399B2 (en) 2011-06-03 2020-06-02 Apple Inc. Switching between text data and audio data based on a mapping
US8994660B2 (en) 2011-08-29 2015-03-31 Apple Inc. Text correction processing
US8819012B2 (en) * 2011-08-30 2014-08-26 International Business Machines Corporation Accessing anchors in voice site content
US9141404B2 (en) 2011-10-24 2015-09-22 Google Inc. Extensible framework for ereader tools
JP5941264B2 (en) * 2011-11-01 2016-06-29 キヤノン株式会社 Information processing apparatus and information processing method
US9031493B2 (en) 2011-11-18 2015-05-12 Google Inc. Custom narration of electronic books
US20130129310A1 (en) * 2011-11-22 2013-05-23 Pleiades Publishing Limited Inc. Electronic book
US9213705B1 (en) * 2011-12-19 2015-12-15 Audible, Inc. Presenting content related to primary audio content
US9117195B2 (en) * 2012-02-13 2015-08-25 Google Inc. Synchronized consumption modes for e-books
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US20130268826A1 (en) * 2012-04-06 2013-10-10 Google Inc. Synchronizing progress in audio and text versions of electronic books
US9075760B2 (en) 2012-05-07 2015-07-07 Audible, Inc. Narration settings distribution for content customization
US9280610B2 (en) 2012-05-14 2016-03-08 Apple Inc. Crowd sourcing information to fulfill user requests
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US9317500B2 (en) 2012-05-30 2016-04-19 Audible, Inc. Synchronizing translated digital content
US8933312B2 (en) * 2012-06-01 2015-01-13 Makemusic, Inc. Distribution of audio sheet music as an electronic book
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
US9141257B1 (en) 2012-06-18 2015-09-22 Audible, Inc. Selecting and conveying supplemental content
US9536439B1 (en) 2012-06-27 2017-01-03 Audible, Inc. Conveying questions with content
US9679608B2 (en) 2012-06-28 2017-06-13 Audible, Inc. Pacing content
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US20140013192A1 (en) * 2012-07-09 2014-01-09 Sas Institute Inc. Techniques for touch-based digital document audio and user interface enhancement
US10109278B2 (en) 2012-08-02 2018-10-23 Audible, Inc. Aligning body matter across content formats
US9047356B2 (en) 2012-09-05 2015-06-02 Google Inc. Synchronizing multiple reading positions in electronic books
US9576574B2 (en) 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
US9684641B1 (en) * 2012-09-21 2017-06-20 Amazon Technologies, Inc. Presenting content in multiple languages
US9367196B1 (en) 2012-09-26 2016-06-14 Audible, Inc. Conveying branched content
US9632647B1 (en) * 2012-10-09 2017-04-25 Audible, Inc. Selecting presentation positions in dynamic content
US9223830B1 (en) 2012-10-26 2015-12-29 Audible, Inc. Content presentation analysis
US9542936B2 (en) * 2012-12-29 2017-01-10 Genesys Telecommunications Laboratories, Inc. Fast out-of-vocabulary search in automatic speech recognition systems
US20140191976A1 (en) * 2013-01-07 2014-07-10 Microsoft Corporation Location Based Augmentation For Story Reading
US9280906B2 (en) * 2013-02-04 2016-03-08 Audible. Inc. Prompting a user for input during a synchronous presentation of audio content and textual content
US9472113B1 (en) 2013-02-05 2016-10-18 Audible, Inc. Synchronizing playback of digital content with physical content
CN104969289B (en) 2013-02-07 2021-05-28 苹果公司 Voice trigger of digital assistant
KR101952179B1 (en) * 2013-03-05 2019-05-22 엘지전자 주식회사 Mobile terminal and control method for the mobile terminal
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US9368114B2 (en) 2013-03-14 2016-06-14 Apple Inc. Context-sensitive handling of interruptions
KR101759009B1 (en) 2013-03-15 2017-07-17 애플 인크. Training an at least partial voice command system
WO2014144579A1 (en) 2013-03-15 2014-09-18 Apple Inc. System and method for updating an adaptive speech recognition model
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
KR102045281B1 (en) * 2013-06-04 2019-11-15 삼성전자주식회사 Method for processing data and an electronis device thereof
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
US9317486B1 (en) 2013-06-07 2016-04-19 Audible, Inc. Synchronizing playback of digital content with captured physical content
WO2014197336A1 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
WO2014197334A2 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
WO2014197335A1 (en) 2013-06-08 2014-12-11 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
KR101959188B1 (en) 2013-06-09 2019-07-02 애플 인크. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
WO2014200731A1 (en) 2013-06-13 2014-12-18 Apple Inc. System and method for emergency calls initiated by voice command
EP3017576A4 (en) * 2013-07-03 2016-06-29 Ericsson Telefon Ab L M Providing an electronic book to a user equipment
KR101749009B1 (en) 2013-08-06 2017-06-19 애플 인크. Auto-activating smart responses based on activities from remote devices
KR20150024188A (en) * 2013-08-26 2015-03-06 삼성전자주식회사 A method for modifiying text data corresponding to voice data and an electronic device therefor
US9489360B2 (en) * 2013-09-05 2016-11-08 Audible, Inc. Identifying extra material in companion content
WO2015040743A1 (en) 2013-09-20 2015-03-26 株式会社東芝 Annotation sharing method, annotation sharing device, and annotation sharing program
US20150089368A1 (en) * 2013-09-25 2015-03-26 Audible, Inc. Searching within audio content
KR102143997B1 (en) * 2013-10-17 2020-08-12 삼성전자 주식회사 Apparatus and method for processing an information list in electronic device
US10296160B2 (en) 2013-12-06 2019-05-21 Apple Inc. Method for extracting salient dialog usage from live data
US9620105B2 (en) 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US10592095B2 (en) 2014-05-23 2020-03-17 Apple Inc. Instantaneous speaking of content on touch devices
US9502031B2 (en) 2014-05-27 2016-11-22 Apple Inc. Method for supporting dynamic grammars in WFST-based ASR
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US10289433B2 (en) 2014-05-30 2019-05-14 Apple Inc. Domain specific language for encoding assistant dialog
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US10078631B2 (en) 2014-05-30 2018-09-18 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US9966065B2 (en) 2014-05-30 2018-05-08 Apple Inc. Multi-command single utterance input method
US9734193B2 (en) 2014-05-30 2017-08-15 Apple Inc. Determining domain salience ranking from ambiguous words in natural speech
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US10659851B2 (en) 2014-06-30 2020-05-19 Apple Inc. Real-time digital assistant knowledge updates
US10446141B2 (en) 2014-08-28 2019-10-15 Apple Inc. Automatic speech recognition based on user feedback
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10789041B2 (en) 2014-09-12 2020-09-29 Apple Inc. Dynamic thresholds for always listening speech trigger
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US10552013B2 (en) 2014-12-02 2020-02-04 Apple Inc. Data detection
US9711141B2 (en) 2014-12-09 2017-07-18 Apple Inc. Disambiguating heteronyms in speech synthesis
US10152299B2 (en) 2015-03-06 2018-12-11 Apple Inc. Reducing response latency of intelligent automated assistants
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
CN106033678A (en) * 2015-03-18 2016-10-19 珠海金山办公软件有限公司 Playing content display method and apparatus thereof
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
US10089059B1 (en) * 2015-03-30 2018-10-02 Audible, Inc. Managing playback of media content with location data
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US10331304B2 (en) 2015-05-06 2019-06-25 Microsoft Technology Licensing, Llc Techniques to automatically generate bookmarks for media files
US10460227B2 (en) 2015-05-15 2019-10-29 Apple Inc. Virtual assistant in a communication session
US10200824B2 (en) 2015-05-27 2019-02-05 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US10127220B2 (en) 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings
US9578173B2 (en) 2015-06-05 2017-02-21 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10101822B2 (en) 2015-06-05 2018-10-16 Apple Inc. Language input correction
US10255907B2 (en) 2015-06-07 2019-04-09 Apple Inc. Automatic accent detection using acoustic models
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US10186254B2 (en) 2015-06-07 2019-01-22 Apple Inc. Context-based endpoint detection
US20160378747A1 (en) 2015-06-29 2016-12-29 Apple Inc. Virtual assistant for media playback
US10387570B2 (en) * 2015-08-27 2019-08-20 Lenovo (Singapore) Pte Ltd Enhanced e-reader experience
US10740384B2 (en) 2015-09-08 2020-08-11 Apple Inc. Intelligent automated assistant for media search and playback
US10331312B2 (en) 2015-09-08 2019-06-25 Apple Inc. Intelligent automated assistant in a media environment
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
GB201516553D0 (en) 2015-09-18 2015-11-04 Microsoft Technology Licensing Llc Inertia audio scrolling
GB201516552D0 (en) * 2015-09-18 2015-11-04 Microsoft Technology Licensing Llc Keyword zoom
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification
US20170098324A1 (en) * 2015-10-05 2017-04-06 Vitthal Srinivasan Method and system for automatically converting input text into animated video
KR101663300B1 (en) * 2015-11-04 2016-10-07 주식회사 디앤피코퍼레이션 Apparatus and method for implementing interactive fairy tale book
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10147416B2 (en) * 2015-12-09 2018-12-04 Amazon Technologies, Inc. Text-to-speech processing systems and methods
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10235367B2 (en) 2016-01-11 2019-03-19 Microsoft Technology Licensing, Llc Organization, retrieval, annotation and presentation of media data files using signals captured from a viewing environment
CN105632484B (en) * 2016-02-19 2019-04-09 云知声(上海)智能科技有限公司 Speech database for speech synthesis pause information automatic marking method and system
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US10606950B2 (en) * 2016-03-16 2020-03-31 Sony Mobile Communications, Inc. Controlling playback of speech-containing audio data
CN109074240B (en) * 2016-04-27 2021-11-23 索尼公司 Information processing apparatus, information processing method, and program
US20170315976A1 (en) * 2016-04-29 2017-11-02 Seagate Technology Llc Annotations for digital media items post capture
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
DK179309B1 (en) 2016-06-09 2018-04-23 Apple Inc Intelligent automated assistant in a home environment
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10586535B2 (en) 2016-06-10 2020-03-10 Apple Inc. Intelligent digital assistant in a multi-tasking environment
DK179415B1 (en) 2016-06-11 2018-06-14 Apple Inc Intelligent device arbitration and control
DK179343B1 (en) 2016-06-11 2018-05-14 Apple Inc Intelligent task discovery
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
DK179049B1 (en) 2016-06-11 2017-09-18 Apple Inc Data driven natural language event detection and classification
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
CN106527845B (en) * 2016-10-11 2019-12-10 东南大学 Method and device for carrying out voice annotation and reproducing mouse operation in text
US10489110B2 (en) 2016-11-22 2019-11-26 Microsoft Technology Licensing, Llc Implicit narration for aural user interface
US10559297B2 (en) * 2016-11-28 2020-02-11 Microsoft Technology Licensing, Llc Audio landmarking for aural user interface
US11281993B2 (en) 2016-12-05 2022-03-22 Apple Inc. Model and ensemble compression for metric learning
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
US10475438B1 (en) * 2017-03-02 2019-11-12 Amazon Technologies, Inc. Contextual text-to-speech processing
CN107122179A (en) 2017-03-31 2017-09-01 阿里巴巴集团控股有限公司 The function control method and device of voice
WO2018187234A1 (en) * 2017-04-03 2018-10-11 Ex-Iq, Inc. Hands-free annotations of audio text
DK201770383A1 (en) 2017-05-09 2018-12-14 Apple Inc. User interface for correcting recognition errors
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
DK180048B1 (en) 2017-05-11 2020-02-04 Apple Inc. MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION
DK201770439A1 (en) 2017-05-11 2018-12-13 Apple Inc. Offline personal assistant
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
DK201770427A1 (en) 2017-05-12 2018-12-20 Apple Inc. Low-latency intelligent automated assistant
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
DK201770431A1 (en) 2017-05-15 2018-12-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
DK201770432A1 (en) 2017-05-15 2018-12-21 Apple Inc. Hierarchical belief states for digital assistants
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US20180336892A1 (en) 2017-05-16 2018-11-22 Apple Inc. Detecting a trigger of a digital assistant
DK179560B1 (en) 2017-05-16 2019-02-18 Apple Inc. Far-field extension for digital assistant services
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US10657328B2 (en) 2017-06-02 2020-05-19 Apple Inc. Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
CN107657973B (en) * 2017-09-27 2020-05-08 风变科技(深圳)有限公司 Text and audio mixed display method and device, terminal equipment and storage medium
US10755051B2 (en) 2017-09-29 2020-08-25 Apple Inc. Rule-based natural language processing
CN107885430B (en) * 2017-11-07 2020-07-24 Oppo广东移动通信有限公司 Audio playing method and device, storage medium and electronic equipment
US10636424B2 (en) 2017-11-30 2020-04-28 Apple Inc. Multi-turn canned dialog
US10733982B2 (en) 2018-01-08 2020-08-04 Apple Inc. Multi-directional dialog
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
CN108255386B (en) * 2018-02-12 2019-07-05 掌阅科技股份有限公司 The display methods of the hand-written notes of e-book calculates equipment and computer storage medium
CN108460120A (en) * 2018-02-13 2018-08-28 广州视源电子科技股份有限公司 Data save method, device, terminal device and storage medium
US10789959B2 (en) 2018-03-02 2020-09-29 Apple Inc. Training speaker recognition models for digital assistants
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US10909331B2 (en) 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
DK201870355A1 (en) 2018-06-01 2019-12-16 Apple Inc. Virtual assistant operation in multi-device environments
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
DK179822B1 (en) 2018-06-01 2019-07-12 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance
EP3824461B1 (en) * 2018-07-19 2022-08-31 Dolby International AB Method and system for creating object-based audio content
US10380997B1 (en) * 2018-07-27 2019-08-13 Deepgram, Inc. Deep learning internal state index-based search and classification
KR102136464B1 (en) * 2018-07-31 2020-07-21 전자부품연구원 Audio Segmentation Method based on Attention Mechanism
EP4191562A1 (en) * 2018-08-27 2023-06-07 Google LLC Algorithmic determination of a story readers discontinuation of reading
WO2020046387A1 (en) 2018-08-31 2020-03-05 Google Llc Dynamic adjustment of story time special effects based on contextual data
EP3837597A1 (en) 2018-09-04 2021-06-23 Google LLC Detection of story reader progress for pre-caching special effects
WO2020050820A1 (en) 2018-09-04 2020-03-12 Google Llc Reading progress estimation based on phonetic fuzzy matching and confidence interval
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
CN109491740B (en) * 2018-10-30 2021-09-10 北京云测信息技术有限公司 Automatic multi-version funnel page optimization method based on context background information
GB2578742A (en) * 2018-11-06 2020-05-27 Arm Ip Ltd Resources and methods for tracking progression in a literary work
EP3660848A1 (en) 2018-11-29 2020-06-03 Ricoh Company, Ltd. Apparatus, system, and method of display control, and carrier means
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
KR20200092763A (en) * 2019-01-25 2020-08-04 삼성전자주식회사 Electronic device for processing user speech and controlling method thereof
CN110110136A (en) * 2019-02-27 2019-08-09 咪咕数字传媒有限公司 A kind of text sound matching process, electronic equipment and storage medium
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US10930284B2 (en) 2019-04-11 2021-02-23 Advanced New Technologies Co., Ltd. Information processing system, method, device and equipment
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
DK180129B1 (en) 2019-05-31 2020-06-02 Apple Inc. User activity shortcut suggestions
DK201970510A1 (en) 2019-05-31 2021-02-11 Apple Inc Voice identification in digital assistant systems
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11227599B2 (en) 2019-06-01 2022-01-18 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
CN113412514A (en) 2019-07-09 2021-09-17 谷歌有限责任公司 On-device speech synthesis of text segments for training of on-device speech recognition models
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
US11354920B2 (en) * 2019-10-12 2022-06-07 International Business Machines Corporation Updating and implementing a document from an audio proceeding
US10805665B1 (en) 2019-12-13 2020-10-13 Bank Of America Corporation Synchronizing text-to-audio with interactive videos in the video framework
US11350185B2 (en) * 2019-12-13 2022-05-31 Bank Of America Corporation Text-to-audio for interactive videos using a markup language
USD954967S1 (en) * 2020-02-21 2022-06-14 Bone Foam, Inc. Dual leg support device
US11043220B1 (en) 2020-05-11 2021-06-22 Apple Inc. Digital assistant hardware abstraction
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones
WO2022047516A1 (en) * 2020-09-04 2022-03-10 The University Of Melbourne System and method for audio annotation
CN112530472B (en) * 2020-11-26 2022-06-21 北京字节跳动网络技术有限公司 Audio and text synchronization method and device, readable medium and electronic equipment
CN112990173B (en) * 2021-02-04 2023-10-27 上海哔哩哔哩科技有限公司 Reading processing method, device and system
US11798536B2 (en) 2021-06-14 2023-10-24 International Business Machines Corporation Annotation of media files with convenient pause points
KR102553832B1 (en) 2021-07-27 2023-07-07 울산과학기술원 Controlling and assisting device for listening means
US11537781B1 (en) 2021-09-15 2022-12-27 Lumos Information Services, LLC System and method to support synchronization, closed captioning and highlight within a text document or a media file
US20230177258A1 (en) * 2021-12-02 2023-06-08 At&T Intellectual Property I, L.P. Shared annotation of media sub-content

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6260011B1 (en) * 2000-03-20 2001-07-10 Microsoft Corporation Methods and apparatus for automatically synchronizing electronic audio files with electronic text files
US20020095290A1 (en) * 1999-02-05 2002-07-18 Jonathan Kahn Speech recognition program mapping tool to align an audio file to verbatim text
US20020129057A1 (en) * 2001-03-09 2002-09-12 Steven Spielberg Method and apparatus for annotating a document
US20040054530A1 (en) * 2002-09-18 2004-03-18 International Business Machines Corporation Generating speech recognition grammars from a large corpus of data
US20040216049A1 (en) * 2000-09-20 2004-10-28 International Business Machines Corporation Method for enhancing dictation and command discrimination
US20050010409A1 (en) * 2001-11-19 2005-01-13 Hull Jonathan J. Printable representations for time-based media
US20060020890A1 (en) * 2004-07-23 2006-01-26 Findaway World, Inc. Personal media player apparatus and method
US20070156627A1 (en) * 2005-12-15 2007-07-05 General Instrument Corporation Method and apparatus for creating and using electronic content bookmarks
US20070203955A1 (en) * 2006-02-28 2007-08-30 Sandisk Il Ltd. Bookmarked synchronization of files
US20070271104A1 (en) * 2006-05-19 2007-11-22 Mckay Martin Streaming speech with synchronized highlighting generated by a server
US20080255837A1 (en) * 2004-11-30 2008-10-16 Jonathan Kahn Method for locating an audio segment within an audio file
US7490039B1 (en) * 2001-12-13 2009-02-10 Cisco Technology, Inc. Text to speech system and method having interactive spelling capabilities
US20100324895A1 (en) * 2009-01-15 2010-12-23 K-Nfb Reading Technology, Inc. Synchronization for document narration

Family Cites Families (790)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US626011A (en) * 1899-05-30 stuckwisch
US3828132A (en) 1970-10-30 1974-08-06 Bell Telephone Labor Inc Speech synthesis by concatenation of formant encoded words
US3704345A (en) 1971-03-19 1972-11-28 Bell Telephone Labor Inc Conversion of printed text into synthetic speech
US3979557A (en) 1974-07-03 1976-09-07 International Telephone And Telegraph Corporation Speech processor system for pitch period extraction using prediction filters
BG24190A1 (en) 1976-09-08 1978-01-10 Antonov Method of synthesis of speech and device for effecting same
JPS597120B2 (en) 1978-11-24 1984-02-16 日本電気株式会社 speech analysis device
US4310721A (en) 1980-01-23 1982-01-12 The United States Of America As Represented By The Secretary Of The Army Half duplex integral vocoder modem system
US4348553A (en) 1980-07-02 1982-09-07 International Business Machines Corporation Parallel pattern verifier with dynamic time warping
US5047617A (en) 1982-01-25 1991-09-10 Symbol Technologies, Inc. Narrow-bodied, single- and twin-windowed portable laser scanning head for reading bar code symbols
DE3382806T2 (en) 1982-06-11 1996-11-14 Mitsubishi Electric Corp Vector quantizer
US4688195A (en) 1983-01-28 1987-08-18 Texas Instruments Incorporated Natural-language interface generating system
JPS603056A (en) 1983-06-21 1985-01-09 Toshiba Corp Information rearranging device
DE3335358A1 (en) 1983-09-29 1985-04-11 Siemens AG, 1000 Berlin und 8000 München METHOD FOR DETERMINING LANGUAGE SPECTRES FOR AUTOMATIC VOICE RECOGNITION AND VOICE ENCODING
US5164900A (en) 1983-11-14 1992-11-17 Colman Bernath Method and device for phonetically encoding Chinese textual data for data processing entry
US4726065A (en) 1984-01-26 1988-02-16 Horst Froessl Image manipulation by speech signals
US4955047A (en) 1984-03-26 1990-09-04 Dytel Corporation Automated attendant with direct inward system access
US4811243A (en) 1984-04-06 1989-03-07 Racine Marsh V Computer aided coordinate digitizing system
US4692941A (en) 1984-04-10 1987-09-08 First Byte Real-time text-to-speech conversion system
US4783807A (en) 1984-08-27 1988-11-08 John Marley System and method for sound recognition with feature selection synchronized to voice pitch
US4718094A (en) 1984-11-19 1988-01-05 International Business Machines Corp. Speech recognition system
US5165007A (en) 1985-02-01 1992-11-17 International Business Machines Corporation Feneme-based Markov models for words
US4944013A (en) 1985-04-03 1990-07-24 British Telecommunications Public Limited Company Multi-pulse speech coder
US4819271A (en) 1985-05-29 1989-04-04 International Business Machines Corporation Constructing Markov model word baseforms from multiple utterances by concatenating model sequences for word segments
US4833712A (en) 1985-05-29 1989-05-23 International Business Machines Corporation Automatic generation of simple Markov model stunted baseforms for words in a vocabulary
US4829583A (en) 1985-06-03 1989-05-09 Sino Business Machines, Inc. Method and apparatus for processing ideographic characters
EP0218859A3 (en) 1985-10-11 1989-09-06 International Business Machines Corporation Signal processor communication interface
US4776016A (en) 1985-11-21 1988-10-04 Position Orientation Systems, Inc. Voice control system
JPH0833744B2 (en) 1986-01-09 1996-03-29 株式会社東芝 Speech synthesizer
US4724542A (en) 1986-01-22 1988-02-09 International Business Machines Corporation Automatic reference adaptation during dynamic signature verification
US5759101A (en) 1986-03-10 1998-06-02 Response Reward Systems L.C. Central and remote evaluation of responses of participatory broadcast audience with automatic crediting and couponing
US5057915A (en) 1986-03-10 1991-10-15 Kohorn H Von System and method for attracting shoppers to sales outlets
US5032989A (en) 1986-03-19 1991-07-16 Realpro, Ltd. Real estate search and location system and method
DE3779351D1 (en) 1986-03-28 1992-07-02 American Telephone And Telegraph Co., New York, N.Y., Us
US4903305A (en) 1986-05-12 1990-02-20 Dragon Systems, Inc. Method for representing word models for use in speech recognition
WO1988002516A1 (en) 1986-10-03 1988-04-07 British Telecommunications Public Limited Company Language translation system
AU592236B2 (en) 1986-10-16 1990-01-04 Mitsubishi Denki Kabushiki Kaisha Amplitude-adapted vector quantizer
US4829576A (en) 1986-10-21 1989-05-09 Dragon Systems, Inc. Voice recognition system
US4852168A (en) 1986-11-18 1989-07-25 Sprague Richard P Compression of stored waveforms for artificial speech
US4727354A (en) 1987-01-07 1988-02-23 Unisys Corporation System for selecting best fit vector code in vector quantization encoding
US4827520A (en) 1987-01-16 1989-05-02 Prince Corporation Voice actuated control system for use in a vehicle
US5179627A (en) 1987-02-10 1993-01-12 Dictaphone Corporation Digital dictation system
US4965763A (en) 1987-03-03 1990-10-23 International Business Machines Corporation Computer method for automatic extraction of commonly specified information from business correspondence
US5644727A (en) 1987-04-15 1997-07-01 Proprietary Financial Products, Inc. System for the operation and management of one or more financial accounts through the use of a digital communication and computation system for exchange, investment and borrowing
EP0293259A3 (en) 1987-05-29 1990-03-07 Kabushiki Kaisha Toshiba Voice recognition system used in telephone apparatus
DE3723078A1 (en) 1987-07-11 1989-01-19 Philips Patentverwaltung METHOD FOR DETECTING CONTINUOUSLY SPOKEN WORDS
CA1288516C (en) 1987-07-31 1991-09-03 Leendert M. Bijnagte Apparatus and method for communicating textual and image information between a host computer and a remote display terminal
US4974191A (en) 1987-07-31 1990-11-27 Syntellect Software Inc. Adaptive natural language computer interface system
US4827518A (en) 1987-08-06 1989-05-02 Bell Communications Research, Inc. Speaker verification system using integrated circuit cards
US5022081A (en) 1987-10-01 1991-06-04 Sharp Kabushiki Kaisha Information recognition system
US4852173A (en) 1987-10-29 1989-07-25 International Business Machines Corporation Design and construction of a binary-tree system for language modelling
US5072452A (en) 1987-10-30 1991-12-10 International Business Machines Corporation Automatic determination of labels and Markov word models in a speech recognition system
DE3876379T2 (en) 1987-10-30 1993-06-09 Ibm AUTOMATIC DETERMINATION OF LABELS AND MARKOV WORD MODELS IN A VOICE RECOGNITION SYSTEM.
US4914586A (en) 1987-11-06 1990-04-03 Xerox Corporation Garbage collector for hypermedia systems
US4992972A (en) 1987-11-18 1991-02-12 International Business Machines Corporation Flexible context searchable on-line information system with help files and modules for on-line computer system documentation
US5220657A (en) 1987-12-02 1993-06-15 Xerox Corporation Updating local copy of shared data in a collaborative system
US4984177A (en) 1988-02-05 1991-01-08 Advanced Products And Technologies, Inc. Voice language translator
CA1333420C (en) 1988-02-29 1994-12-06 Tokumichi Murakami Vector quantizer
US4914590A (en) 1988-05-18 1990-04-03 Emhart Industries, Inc. Natural language understanding system
FR2636163B1 (en) 1988-09-02 1991-07-05 Hamon Christian METHOD AND DEVICE FOR SYNTHESIZING SPEECH BY ADDING-COVERING WAVEFORMS
US4839853A (en) 1988-09-15 1989-06-13 Bell Communications Research, Inc. Computer information retrieval using latent semantic structure
JPH0293597A (en) 1988-09-30 1990-04-04 Nippon I B M Kk Speech recognition device
US4905163A (en) 1988-10-03 1990-02-27 Minnesota Mining & Manufacturing Company Intelligent optical navigator dynamic information presentation and navigation system
US5282265A (en) 1988-10-04 1994-01-25 Canon Kabushiki Kaisha Knowledge information processing system
DE3837590A1 (en) 1988-11-05 1990-05-10 Ant Nachrichtentech PROCESS FOR REDUCING THE DATA RATE OF DIGITAL IMAGE DATA
ATE102731T1 (en) 1988-11-23 1994-03-15 Digital Equipment Corp NAME PRONUNCIATION BY A SYNTHETIC.
US5027406A (en) 1988-12-06 1991-06-25 Dragon Systems, Inc. Method for interactive speech recognition and training
US5127055A (en) 1988-12-30 1992-06-30 Kurzweil Applied Intelligence, Inc. Speech recognition apparatus & method having dynamic reference pattern adaptation
US5293448A (en) 1989-10-02 1994-03-08 Nippon Telegraph And Telephone Corporation Speech analysis-synthesis method and apparatus therefor
SE466029B (en) 1989-03-06 1991-12-02 Ibm Svenska Ab DEVICE AND PROCEDURE FOR ANALYSIS OF NATURAL LANGUAGES IN A COMPUTER-BASED INFORMATION PROCESSING SYSTEM
JPH0782544B2 (en) 1989-03-24 1995-09-06 インターナショナル・ビジネス・マシーンズ・コーポレーション DP matching method and apparatus using multi-template
US4977598A (en) 1989-04-13 1990-12-11 Texas Instruments Incorporated Efficient pruning algorithm for hidden markov model speech recognition
US5197005A (en) 1989-05-01 1993-03-23 Intelligent Business Systems Database retrieval system having a natural language interface
US5010574A (en) 1989-06-13 1991-04-23 At&T Bell Laboratories Vector quantizer search arrangement
JP2940005B2 (en) 1989-07-20 1999-08-25 日本電気株式会社 Audio coding device
US5091945A (en) 1989-09-28 1992-02-25 At&T Bell Laboratories Source dependent channel coding with error protection
CA2027705C (en) 1989-10-17 1994-02-15 Masami Akamine Speech coding system utilizing a recursive computation technique for improvement in processing speed
US5020112A (en) 1989-10-31 1991-05-28 At&T Bell Laboratories Image recognition method using two-dimensional stochastic grammars
US5220639A (en) 1989-12-01 1993-06-15 National Science Council Mandarin speech input method for Chinese computers and a mandarin speech recognition machine
US5021971A (en) 1989-12-07 1991-06-04 Unisys Corporation Reflective binary encoder for vector quantization
US5179652A (en) 1989-12-13 1993-01-12 Anthony I. Rozmanith Method and apparatus for storing, transmitting and retrieving graphical and tabular data
CH681573A5 (en) 1990-02-13 1993-04-15 Astral Automatic teller arrangement involving bank computers - is operated by user data card carrying personal data, account information and transaction records
EP0443548B1 (en) 1990-02-22 2003-07-23 Nec Corporation Speech coder
US5301109A (en) 1990-06-11 1994-04-05 Bell Communications Research, Inc. Computerized cross-language document retrieval using latent semantic indexing
JP3266246B2 (en) 1990-06-15 2002-03-18 インターナシヨナル・ビジネス・マシーンズ・コーポレーシヨン Natural language analysis apparatus and method, and knowledge base construction method for natural language analysis
US5202952A (en) 1990-06-22 1993-04-13 Dragon Systems, Inc. Large-vocabulary continuous speech prefiltering and processing system
GB9017600D0 (en) 1990-08-10 1990-09-26 British Aerospace An assembly and method for binary tree-searched vector quanisation data compression processing
US5309359A (en) 1990-08-16 1994-05-03 Boris Katz Method and apparatus for generating and utlizing annotations to facilitate computer text retrieval
US5404295A (en) 1990-08-16 1995-04-04 Katz; Boris Method and apparatus for utilizing annotations to facilitate computer retrieval of database material
US5297170A (en) 1990-08-21 1994-03-22 Codex Corporation Lattice and trellis-coded quantization
US5400434A (en) 1990-09-04 1995-03-21 Matsushita Electric Industrial Co., Ltd. Voice source for synthetic speech system
JPH0833739B2 (en) 1990-09-13 1996-03-29 三菱電機株式会社 Pattern expression model learning device
US5216747A (en) 1990-09-20 1993-06-01 Digital Voice Systems, Inc. Voiced/unvoiced estimation of an acoustic signal
US5128672A (en) 1990-10-30 1992-07-07 Apple Computer, Inc. Dynamic predictive keyboard
US5325298A (en) 1990-11-07 1994-06-28 Hnc, Inc. Methods for generating or revising context vectors for a plurality of word stems
US5317507A (en) 1990-11-07 1994-05-31 Gallant Stephen I Method for document retrieval and for word sense disambiguation using neural networks
US5247579A (en) 1990-12-05 1993-09-21 Digital Voice Systems, Inc. Methods for speech transmission
US5345536A (en) 1990-12-21 1994-09-06 Matsushita Electric Industrial Co., Ltd. Method of speech recognition
US5127053A (en) 1990-12-24 1992-06-30 General Electric Company Low-complexity method for improving the performance of autocorrelation-based pitch detectors
US5133011A (en) 1990-12-26 1992-07-21 International Business Machines Corporation Method and apparatus for linear vocal control of cursor position
US5268990A (en) 1991-01-31 1993-12-07 Sri International Method for recognizing speech using linguistically-motivated hidden Markov models
GB9105367D0 (en) 1991-03-13 1991-04-24 Univ Strathclyde Computerised information-retrieval database systems
US5303406A (en) 1991-04-29 1994-04-12 Motorola, Inc. Noise squelch circuit with adaptive noise shaping
US5500905A (en) 1991-06-12 1996-03-19 Microelectronics And Computer Technology Corporation Pattern recognition neural network with saccade-like operation
US5475587A (en) 1991-06-28 1995-12-12 Digital Equipment Corporation Method and apparatus for efficient morphological text analysis using a high-level language for compact specification of inflectional paradigms
US5293452A (en) 1991-07-01 1994-03-08 Texas Instruments Incorporated Voice log-in using spoken name input
US5687077A (en) 1991-07-31 1997-11-11 Universal Dynamics Limited Method and apparatus for adaptive control
US5199077A (en) 1991-09-19 1993-03-30 Xerox Corporation Wordspotting for voice editing and indexing
JP2662120B2 (en) 1991-10-01 1997-10-08 インターナショナル・ビジネス・マシーンズ・コーポレイション Speech recognition device and processing unit for speech recognition
US5222146A (en) 1991-10-23 1993-06-22 International Business Machines Corporation Speech recognition apparatus having a speech coder outputting acoustic prototype ranks
US5757979A (en) 1991-10-30 1998-05-26 Fuji Electric Co., Ltd. Apparatus and method for nonlinear normalization of image
KR940002854B1 (en) 1991-11-06 1994-04-04 한국전기통신공사 Sound synthesizing system
US5386494A (en) 1991-12-06 1995-01-31 Apple Computer, Inc. Method and apparatus for controlling a speech recognition function using a cursor control device
US6081750A (en) 1991-12-23 2000-06-27 Hoffberg; Steven Mark Ergonomic man-machine interface incorporating adaptive pattern recognition based control system
US5903454A (en) 1991-12-23 1999-05-11 Hoffberg; Linda Irene Human-factored interface corporating adaptive pattern recognition based controller apparatus
US5502790A (en) 1991-12-24 1996-03-26 Oki Electric Industry Co., Ltd. Speech recognition method and system using triphones, diphones, and phonemes
US5349645A (en) 1991-12-31 1994-09-20 Matsushita Electric Industrial Co., Ltd. Word hypothesizer for continuous speech decoding using stressed-vowel centered bidirectional tree searches
US5267345A (en) 1992-02-10 1993-11-30 International Business Machines Corporation Speech recognition apparatus which predicts word classes from context and words from word classes
EP0559349B1 (en) 1992-03-02 1999-01-07 AT&T Corp. Training method and apparatus for speech recognition
US6055514A (en) 1992-03-20 2000-04-25 Wren; Stephen Corey System for marketing foods and services utilizing computerized centraland remote facilities
US5317647A (en) 1992-04-07 1994-05-31 Apple Computer, Inc. Constrained attribute grammars for syntactic pattern recognition
US5412804A (en) 1992-04-30 1995-05-02 Oracle Corporation Extending the semantics of the outer join operator for un-nesting queries to a data base
JPH07506908A (en) 1992-05-20 1995-07-27 インダストリアル リサーチ リミテッド Wideband reverberation support system
US5293584A (en) 1992-05-21 1994-03-08 International Business Machines Corporation Speech recognition system for natural language translation
US5434777A (en) 1992-05-27 1995-07-18 Apple Computer, Inc. Method and apparatus for processing natural language
US5390281A (en) 1992-05-27 1995-02-14 Apple Computer, Inc. Method and apparatus for deducing user intent and providing computer implemented services
US5734789A (en) 1992-06-01 1998-03-31 Hughes Electronics Voiced, unvoiced or noise modes in a CELP vocoder
JPH064093A (en) 1992-06-18 1994-01-14 Matsushita Electric Ind Co Ltd Hmm generating device, hmm storage device, likelihood calculating device, and recognizing device
US5333275A (en) 1992-06-23 1994-07-26 Wheatley Barbara J System and method for time aligning speech
US5325297A (en) 1992-06-25 1994-06-28 System Of Multiple-Colored Images For Internationally Listed Estates, Inc. Computer implemented method and system for storing and retrieving textual data and compressed image data
JPH0619965A (en) 1992-07-01 1994-01-28 Canon Inc Natural language processor
US5999908A (en) 1992-08-06 1999-12-07 Abelow; Daniel H. Customer-based product design module
GB9220404D0 (en) 1992-08-20 1992-11-11 Nat Security Agency Method of identifying,retrieving and sorting documents
US5412806A (en) 1992-08-20 1995-05-02 Hewlett-Packard Company Calibration of logical cost formulae for queries in a heterogeneous DBMS using synthetic database
US5333236A (en) 1992-09-10 1994-07-26 International Business Machines Corporation Speech recognizer having a speech coder for an acoustic match based on context-dependent speech-transition acoustic models
US5384893A (en) 1992-09-23 1995-01-24 Emerson & Stern Associates, Inc. Method and apparatus for speech synthesis based on prosodic analysis
FR2696036B1 (en) 1992-09-24 1994-10-14 France Telecom Method of measuring resemblance between sound samples and device for implementing this method.
JPH0772840B2 (en) 1992-09-29 1995-08-02 日本アイ・ビー・エム株式会社 Speech model configuration method, speech recognition method, speech recognition device, and speech model training method
US5758313A (en) 1992-10-16 1998-05-26 Mobile Information Systems, Inc. Method and apparatus for tracking vehicle location
US6092043A (en) 1992-11-13 2000-07-18 Dragon Systems, Inc. Apparatuses and method for training and operating speech recognition systems
US5909666A (en) 1992-11-13 1999-06-01 Dragon Systems, Inc. Speech recognition system which creates acoustic models by concatenating acoustic models of individual words
DE69327774T2 (en) 1992-11-18 2000-06-21 Canon Information Syst Inc Processor for converting data into speech and sequence control for this
US5455888A (en) 1992-12-04 1995-10-03 Northern Telecom Limited Speech bandwidth extension method and apparatus
US5412756A (en) 1992-12-22 1995-05-02 Mitsubishi Denki Kabushiki Kaisha Artificial intelligence software shell for plant operation simulation
US5390279A (en) 1992-12-31 1995-02-14 Apple Computer, Inc. Partitioning speech rules by context for speech recognition
US5613036A (en) 1992-12-31 1997-03-18 Apple Computer, Inc. Dynamic categories for a speech recognition system
US5734791A (en) 1992-12-31 1998-03-31 Apple Computer, Inc. Rapid tree-based method for vector quantization
US5384892A (en) 1992-12-31 1995-01-24 Apple Computer, Inc. Dynamic language model for speech recognition
US6122616A (en) 1993-01-21 2000-09-19 Apple Computer, Inc. Method and apparatus for diphone aliasing
US5864844A (en) 1993-02-18 1999-01-26 Apple Computer, Inc. System and method for enhancing a user interface with a computer based training tool
CA2091658A1 (en) 1993-03-15 1994-09-16 Matthew Lennig Method and apparatus for automation of directory assistance using speech recognition
US6055531A (en) 1993-03-24 2000-04-25 Engate Incorporated Down-line transcription system having context sensitive searching capability
US5536902A (en) 1993-04-14 1996-07-16 Yamaha Corporation Method of and apparatus for analyzing and synthesizing a sound by extracting and controlling a sound parameter
US5444823A (en) 1993-04-16 1995-08-22 Compaq Computer Corporation Intelligent search engine for associated on-line documentation having questionless case-based knowledge base
US5574823A (en) 1993-06-23 1996-11-12 Her Majesty The Queen In Right Of Canada As Represented By The Minister Of Communications Frequency selective harmonic coding
JPH0756933A (en) 1993-06-24 1995-03-03 Xerox Corp Method for retrieval of document
US5515475A (en) 1993-06-24 1996-05-07 Northern Telecom Limited Speech recognition method using a two-pass search
JP3685812B2 (en) 1993-06-29 2005-08-24 ソニー株式会社 Audio signal transmitter / receiver
US5794207A (en) 1996-09-04 1998-08-11 Walker Asset Management Limited Partnership Method and apparatus for a cryptographically assisted commercial network system designed to facilitate buyer-driven conditional purchase offers
AU7323694A (en) 1993-07-07 1995-02-06 Inference Corporation Case-based organizing and querying of a database
US5495604A (en) 1993-08-25 1996-02-27 Asymetrix Corporation Method and apparatus for the modeling and query of database structures using natural language-like constructs
US5619694A (en) 1993-08-26 1997-04-08 Nec Corporation Case database storage/retrieval system
US5940811A (en) 1993-08-27 1999-08-17 Affinity Technology Group, Inc. Closed loop financial transaction method and apparatus
US5377258A (en) 1993-08-30 1994-12-27 National Medical Research Council Method and apparatus for an automated and interactive behavioral guidance system
US5873056A (en) 1993-10-12 1999-02-16 The Syracuse University Natural language processing system for semantic vector representation which accounts for lexical ambiguity
JP2986345B2 (en) * 1993-10-18 1999-12-06 インターナショナル・ビジネス・マシーンズ・コーポレイション Voice recording indexing apparatus and method
US5578808A (en) 1993-12-22 1996-11-26 Datamark Services, Inc. Data card that can be used for transactions involving separate card issuers
WO1995017711A1 (en) 1993-12-23 1995-06-29 Diacom Technologies, Inc. Method and apparatus for implementing user feedback
US5621859A (en) 1994-01-19 1997-04-15 Bbn Corporation Single tree method for grammar directed, very large vocabulary speech recognizer
US5584024A (en) 1994-03-24 1996-12-10 Software Ag Interactive database query system and method for prohibiting the selection of semantically incorrect query parameters
US5642519A (en) 1994-04-29 1997-06-24 Sun Microsystems, Inc. Speech interpreter with a unified grammer compiler
EP0684607B1 (en) 1994-05-25 2001-03-14 Victor Company Of Japan, Limited Variable transfer rate data reproduction apparatus
US5493677A (en) 1994-06-08 1996-02-20 Systems Research & Applications Corporation Generation, archiving, and retrieval of digital images with evoked suggestion-set captions and natural language interface
US5812697A (en) 1994-06-10 1998-09-22 Nippon Steel Corporation Method and apparatus for recognizing hand-written characters using a weighting dictionary
US5675819A (en) 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
JPH0869470A (en) 1994-06-21 1996-03-12 Canon Inc Natural language processing device and method
US5948040A (en) 1994-06-24 1999-09-07 Delorme Publishing Co. Travel reservation information and planning system
WO1996001453A1 (en) 1994-07-01 1996-01-18 Palm Computing, Inc. Multiple pen stroke character set and handwriting recognition system
US5682539A (en) 1994-09-29 1997-10-28 Conrad; Donovan Anticipated meaning natural language interface
GB2293667B (en) 1994-09-30 1998-05-27 Intermation Limited Database management system
US5715468A (en) 1994-09-30 1998-02-03 Budzinski; Robert Lucius Memory system for storing and retrieving experience and knowledge with natural language
US5845255A (en) 1994-10-28 1998-12-01 Advanced Health Med-E-Systems Corporation Prescription management system
US5577241A (en) 1994-12-07 1996-11-19 Excite, Inc. Information retrieval system and method with implementation extensible query architecture
US5748974A (en) 1994-12-13 1998-05-05 International Business Machines Corporation Multimodal natural language interface for cross-application tasks
US5794050A (en) 1995-01-04 1998-08-11 Intelligent Text Processing, Inc. Natural language understanding system
CN1183841A (en) 1995-02-13 1998-06-03 英特特拉斯特技术公司 System and method for secure transaction management and electronic rights protection
US5701400A (en) 1995-03-08 1997-12-23 Amado; Carlos Armando Method and apparatus for applying if-then-else rules to data sets in a relational data base and generating from the results of application of said rules a database of diagnostics linked to said data sets to aid executive analysis of financial data
US5749081A (en) 1995-04-06 1998-05-05 Firefly Network, Inc. System and method for recommending items to a user
US5642464A (en) 1995-05-03 1997-06-24 Northern Telecom Limited Methods and apparatus for noise conditioning in digital speech compression systems using linear predictive coding
US5812698A (en) 1995-05-12 1998-09-22 Synaptics, Inc. Handwriting recognition system and method
TW338815B (en) 1995-06-05 1998-08-21 Motorola Inc Method and apparatus for character recognition of handwritten input
US5664055A (en) 1995-06-07 1997-09-02 Lucent Technologies Inc. CS-ACELP speech compression system with adaptive pitch prediction filter gain based on a measure of periodicity
US6496182B1 (en) 1995-06-07 2002-12-17 Microsoft Corporation Method and system for providing touch-sensitive screens for the visually impaired
US5991441A (en) 1995-06-07 1999-11-23 Wang Laboratories, Inc. Real time handwriting recognition system
US5710886A (en) 1995-06-16 1998-01-20 Sellectsoft, L.C. Electric couponing method and apparatus
JP3284832B2 (en) 1995-06-22 2002-05-20 セイコーエプソン株式会社 Speech recognition dialogue processing method and speech recognition dialogue device
US6038533A (en) 1995-07-07 2000-03-14 Lucent Technologies Inc. System and method for selecting training text
US6026388A (en) 1995-08-16 2000-02-15 Textwise, Llc User interface and other enhancements for natural language information retrieval system and method
JP3697748B2 (en) 1995-08-21 2005-09-21 セイコーエプソン株式会社 Terminal, voice recognition device
US5712957A (en) 1995-09-08 1998-01-27 Carnegie Mellon University Locating and correcting erroneously recognized portions of utterances by rescoring based on two n-best lists
US5737734A (en) 1995-09-15 1998-04-07 Infonautics Corporation Query word relevance adjustment in a search of an information retrieval system
US5790978A (en) 1995-09-15 1998-08-04 Lucent Technologies, Inc. System and method for determining pitch contours
US6173261B1 (en) 1998-09-30 2001-01-09 At&T Corp Grammar fragment acquisition using syntactic and semantic clustering
US5884323A (en) 1995-10-13 1999-03-16 3Com Corporation Extendible method and apparatus for synchronizing files on two different computer systems
US5799276A (en) 1995-11-07 1998-08-25 Accent Incorporated Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals
US6064959A (en) 1997-03-28 2000-05-16 Dragon Systems, Inc. Error correction in speech recognition
US5794237A (en) 1995-11-13 1998-08-11 International Business Machines Corporation System and method for improving problem source identification in computer systems employing relevance feedback and statistical source ranking
US5706442A (en) 1995-12-20 1998-01-06 Block Financial Corporation System for on-line financial services using distributed objects
CA2242874A1 (en) 1996-01-17 1997-07-24 Personal Agents, Inc. Intelligent agents for electronic commerce
US6119101A (en) 1996-01-17 2000-09-12 Personal Agents, Inc. Intelligent agents for electronic commerce
US6125356A (en) 1996-01-18 2000-09-26 Rosefaire Development, Ltd. Portable sales presentation system with selective scripted seller prompts
US5987404A (en) 1996-01-29 1999-11-16 International Business Machines Corporation Statistical natural language understanding using hidden clumpings
US5729694A (en) 1996-02-06 1998-03-17 The Regents Of The University Of California Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
US6076088A (en) 1996-02-09 2000-06-13 Paik; Woojin Information extraction system and method using concept relation concept (CRC) triples
US5835893A (en) 1996-02-15 1998-11-10 Atr Interpreting Telecommunications Research Labs Class-based word clustering for speech recognition using a three-level balanced hierarchical similarity
US5901287A (en) 1996-04-01 1999-05-04 The Sabre Group Inc. Information aggregation and synthesization system
US5867799A (en) 1996-04-04 1999-02-02 Lang; Andrew K. Information system and method for filtering a massive flow of information entities to meet user information classification needs
US5963924A (en) 1996-04-26 1999-10-05 Verifone, Inc. System, method and article of manufacture for the use of payment instrument holders and payment instruments in network electronic commerce
US5987140A (en) 1996-04-26 1999-11-16 Verifone, Inc. System, method and article of manufacture for secure network electronic payment and credit collection
US5913193A (en) 1996-04-30 1999-06-15 Microsoft Corporation Method and system of runtime acoustic unit selection for speech synthesis
US5857184A (en) 1996-05-03 1999-01-05 Walden Media, Inc. Language and method for creating, organizing, and retrieving data from a database
US5828999A (en) 1996-05-06 1998-10-27 Apple Computer, Inc. Method and system for deriving a large-span semantic language model for large-vocabulary recognition systems
FR2748342B1 (en) 1996-05-06 1998-07-17 France Telecom METHOD AND DEVICE FOR FILTERING A SPEECH SIGNAL BY EQUALIZATION, USING A STATISTICAL MODEL OF THIS SIGNAL
US5826261A (en) 1996-05-10 1998-10-20 Spencer; Graham System and method for querying multiple, distributed databases by selective sharing of local relative significance information for terms related to the query
US6366883B1 (en) 1996-05-15 2002-04-02 Atr Interpreting Telecommunications Concatenation of speech segments by use of a speech synthesizer
US5727950A (en) 1996-05-22 1998-03-17 Netsage Corporation Agent based instruction system and method
US6556712B1 (en) 1996-05-23 2003-04-29 Apple Computer, Inc. Methods and apparatus for handwriting recognition
US5966533A (en) 1996-06-11 1999-10-12 Excite, Inc. Method and system for dynamically synthesizing a computer program by differentially resolving atoms based on user context data
US5915249A (en) 1996-06-14 1999-06-22 Excite, Inc. System and method for accelerated query evaluation of very large full-text databases
US5987132A (en) 1996-06-17 1999-11-16 Verifone, Inc. System, method and article of manufacture for conditionally accepting a payment method utilizing an extensible, flexible architecture
US5825881A (en) 1996-06-28 1998-10-20 Allsoft Distributing Inc. Public network merchandising system
US6070147A (en) 1996-07-02 2000-05-30 Tecmark Services, Inc. Customer identification and marketing analysis systems
CN100371914C (en) 1996-07-22 2008-02-27 Cyva研究公司 Tool for safety and exchanging personal information
US6453281B1 (en) 1996-07-30 2002-09-17 Vxi Corporation Portable audio database device with icon-based graphical user-interface
EP0829811A1 (en) 1996-09-11 1998-03-18 Nippon Telegraph And Telephone Corporation Method and system for information retrieval
US6181935B1 (en) 1996-09-27 2001-01-30 Software.Com, Inc. Mobility extended telephone application programming interface and method of use
US5794182A (en) 1996-09-30 1998-08-11 Apple Computer, Inc. Linear predictive speech encoding systems with efficient combination pitch coefficients computation
US6199076B1 (en) * 1996-10-02 2001-03-06 James Logan Audio program player including a dynamic program selection controller
US5721827A (en) 1996-10-02 1998-02-24 James Logan System for electrically distributing personalized information
US5732216A (en) 1996-10-02 1998-03-24 Internet Angles, Inc. Audio message exchange system
US5913203A (en) 1996-10-03 1999-06-15 Jaesent Inc. System and method for pseudo cash transactions
US5930769A (en) 1996-10-07 1999-07-27 Rose; Andrea System and method for fashion shopping
US5836771A (en) 1996-12-02 1998-11-17 Ho; Chi Fai Learning method and system based on questioning
US6282511B1 (en) 1996-12-04 2001-08-28 At&T Voiced interface with hyperlinked information
US6665639B2 (en) 1996-12-06 2003-12-16 Sensory, Inc. Speech recognition in consumer electronic products
US6078914A (en) 1996-12-09 2000-06-20 Open Text Corporation Natural language meta-search system and method
US5839106A (en) 1996-12-17 1998-11-17 Apple Computer, Inc. Large-vocabulary speech recognition using an integrated syntactic and semantic statistical language model
US5966126A (en) 1996-12-23 1999-10-12 Szabo; Andrew J. Graphic user interface for database system
US5932869A (en) 1996-12-27 1999-08-03 Graphic Technology, Inc. Promotional system with magnetic stripe and visual thermo-reversible print surfaced medium
JP3579204B2 (en) 1997-01-17 2004-10-20 富士通株式会社 Document summarizing apparatus and method
US5941944A (en) 1997-03-03 1999-08-24 Microsoft Corporation Method for providing a substitute for a requested inaccessible object by identifying substantially similar objects using weights corresponding to object features
US5930801A (en) 1997-03-07 1999-07-27 Xerox Corporation Shared-data environment in which each file has independent security properties
US6076051A (en) 1997-03-07 2000-06-13 Microsoft Corporation Information retrieval utilizing semantic representation of text
US6260013B1 (en) 1997-03-14 2001-07-10 Lernout & Hauspie Speech Products N.V. Speech recognition system employing discriminatively trained models
WO1998041956A1 (en) 1997-03-20 1998-09-24 Schlumberger Technologies, Inc. System and method of transactional taxation using secure stored data devices
US5822743A (en) 1997-04-08 1998-10-13 1215627 Ontario Inc. Knowledge-based information retrieval system
US5970474A (en) 1997-04-24 1999-10-19 Sears, Roebuck And Co. Registry information system for shoppers
US5895464A (en) 1997-04-30 1999-04-20 Eastman Kodak Company Computer program product and a method for using natural language for the description, search and retrieval of multi-media objects
DE69816185T2 (en) 1997-06-12 2004-04-15 Hewlett-Packard Co. (N.D.Ges.D.Staates Delaware), Palo Alto Image processing method and device
US6017219A (en) 1997-06-18 2000-01-25 International Business Machines Corporation System and method for interactive reading and language instruction
WO1999001834A1 (en) 1997-07-02 1999-01-14 Coueignoux, Philippe, J., M. System and method for the secure discovery, exploitation and publication of information
US5860063A (en) 1997-07-11 1999-01-12 At&T Corp Automated meaningful phrase clustering
US5933822A (en) 1997-07-22 1999-08-03 Microsoft Corporation Apparatus and methods for an information retrieval system that employs natural language processing of search results to improve overall precision
US5974146A (en) 1997-07-30 1999-10-26 Huntington Bancshares Incorporated Real time bank-centric universal payment system
US6016476A (en) 1997-08-11 2000-01-18 International Business Machines Corporation Portable information and transaction processing system and method utilizing biometric authorization and digital certificate security
US5895466A (en) 1997-08-19 1999-04-20 At&T Corp Automated natural language understanding customer service system
US6081774A (en) 1997-08-22 2000-06-27 Novell, Inc. Natural language information retrieval system and method
US6404876B1 (en) 1997-09-25 2002-06-11 Gte Intelligent Network Services Incorporated System and method for voice activated dialing and routing under open access network control
US6023684A (en) 1997-10-01 2000-02-08 Security First Technologies, Inc. Three tier financial transaction system with cache memory
DE69712485T2 (en) 1997-10-23 2002-12-12 Sony Int Europe Gmbh Voice interface for a home network
US6108627A (en) 1997-10-31 2000-08-22 Nortel Networks Corporation Automatic transcription tool
US5943670A (en) 1997-11-21 1999-08-24 International Business Machines Corporation System and method for categorizing objects in combined categories
US5960422A (en) 1997-11-26 1999-09-28 International Business Machines Corporation System and method for optimized source selection in an information retrieval system
US6026375A (en) 1997-12-05 2000-02-15 Nortel Networks Corporation Method and apparatus for processing orders from customers in a mobile environment
US6064960A (en) 1997-12-18 2000-05-16 Apple Computer, Inc. Method and apparatus for improved duration modeling of phonemes
US6094649A (en) 1997-12-22 2000-07-25 Partnet, Inc. Keyword searches of structured databases
US6173287B1 (en) 1998-03-11 2001-01-09 Digital Equipment Corporation Technique for ranking multimedia annotations of interest
US6195641B1 (en) 1998-03-27 2001-02-27 International Business Machines Corp. Network universal spoken language vocabulary
US6026393A (en) 1998-03-31 2000-02-15 Casebank Technologies Inc. Configuration knowledge as an aid to case retrieval
US6233559B1 (en) 1998-04-01 2001-05-15 Motorola, Inc. Speech control of multiple applications using applets
US6173279B1 (en) 1998-04-09 2001-01-09 At&T Corp. Method of using a natural language interface to retrieve information from one or more data resources
US6088731A (en) 1998-04-24 2000-07-11 Associative Computing, Inc. Intelligent assistant for use with a local computer and with the internet
EP1076865B1 (en) 1998-04-27 2002-12-18 BRITISH TELECOMMUNICATIONS public limited company Database access tool
US6081780A (en) * 1998-04-28 2000-06-27 International Business Machines Corporation TTS and prosody based authoring system
US6029132A (en) 1998-04-30 2000-02-22 Matsushita Electric Industrial Co. Method for letter-to-sound in text-to-speech synthesis
US6016471A (en) 1998-04-29 2000-01-18 Matsushita Electric Industrial Co., Ltd. Method and apparatus using decision trees to generate and score multiple pronunciations for a spelled word
US6285786B1 (en) 1998-04-30 2001-09-04 Motorola, Inc. Text recognizer and method using non-cumulative character scoring in a forward search
US6144938A (en) 1998-05-01 2000-11-07 Sun Microsystems, Inc. Voice user interface with personality
US7711672B2 (en) 1998-05-28 2010-05-04 Lawrence Au Semantic network methods to disambiguate natural language meaning
US6778970B2 (en) 1998-05-28 2004-08-17 Lawrence Au Topological methods to organize semantic network data flows for conversational applications
US20070094222A1 (en) 1998-05-28 2007-04-26 Lawrence Au Method and system for using voice input for performing network functions
US6144958A (en) 1998-07-15 2000-11-07 Amazon.Com, Inc. System and method for correcting spelling errors in search queries
US6105865A (en) 1998-07-17 2000-08-22 Hardesty; Laurence Daniel Financial transaction system with retirement saving benefit
US6499013B1 (en) 1998-09-09 2002-12-24 One Voice Technologies, Inc. Interactive user interface using speech recognition and natural language processing
US6434524B1 (en) 1998-09-09 2002-08-13 One Voice Technologies, Inc. Object interactive user interface using speech recognition and natural language processing
US6266637B1 (en) 1998-09-11 2001-07-24 International Business Machines Corporation Phrase splicing and variable substitution using a trainable speech synthesizer
DE29825146U1 (en) 1998-09-11 2005-08-18 Püllen, Rainer Audio on demand system
US6792082B1 (en) 1998-09-11 2004-09-14 Comverse Ltd. Voice mail system with personal assistant provisioning
US6317831B1 (en) 1998-09-21 2001-11-13 Openwave Systems Inc. Method and apparatus for establishing a secure connection over a one-way data path
US7137126B1 (en) 1998-10-02 2006-11-14 International Business Machines Corporation Conversational computing via conversational virtual machine
US6275824B1 (en) 1998-10-02 2001-08-14 Ncr Corporation System and method for managing data privacy in a database management system
GB9821969D0 (en) 1998-10-08 1998-12-02 Canon Kk Apparatus and method for processing natural language
US6928614B1 (en) 1998-10-13 2005-08-09 Visteon Global Technologies, Inc. Mobile office with speech recognition
US6453292B2 (en) 1998-10-28 2002-09-17 International Business Machines Corporation Command boundary identifier for conversational natural language
US6208971B1 (en) 1998-10-30 2001-03-27 Apple Computer, Inc. Method and apparatus for command recognition using data-driven semantic inference
US6321092B1 (en) 1998-11-03 2001-11-20 Signal Soft Corporation Multiple input data management for wireless location-based applications
US6839669B1 (en) 1998-11-05 2005-01-04 Scansoft, Inc. Performing actions identified in recognized speech
US6519565B1 (en) 1998-11-10 2003-02-11 Voice Security Systems, Inc. Method of comparing utterances for security control
US6446076B1 (en) 1998-11-12 2002-09-03 Accenture Llp. Voice interactive web-based agent system responsive to a user location for prioritizing and formatting information
US6606599B2 (en) 1998-12-23 2003-08-12 Interactive Speech Technologies, Llc Method for integrating computing processes with an interface controlled by voice actuated grammars
WO2000030069A2 (en) 1998-11-13 2000-05-25 Lernout & Hauspie Speech Products N.V. Speech synthesis using concatenation of speech waveforms
US6246981B1 (en) 1998-11-25 2001-06-12 International Business Machines Corporation Natural language task-oriented dialog manager and method
US7082397B2 (en) 1998-12-01 2006-07-25 Nuance Communications, Inc. System for and method of creating and browsing a voice web
US6260024B1 (en) 1998-12-02 2001-07-10 Gary Shkedy Method and apparatus for facilitating buyer-driven purchase orders on a commercial network system
US7881936B2 (en) 1998-12-04 2011-02-01 Tegic Communications, Inc. Multimodal disambiguation of speech recognition
US6317707B1 (en) 1998-12-07 2001-11-13 At&T Corp. Automatic clustering of tokens from a corpus for grammar acquisition
US6177905B1 (en) 1998-12-08 2001-01-23 Avaya Technology Corp. Location-triggered reminder for mobile user devices
US6308149B1 (en) 1998-12-16 2001-10-23 Xerox Corporation Grouping words with equivalent substrings by automatic clustering based on suffix relationships
US6523172B1 (en) 1998-12-17 2003-02-18 Evolutionary Technologies International, Inc. Parser translator system and method
US6460029B1 (en) 1998-12-23 2002-10-01 Microsoft Corporation System for improving search text
US6513063B1 (en) 1999-01-05 2003-01-28 Sri International Accessing network-based electronic information through scripted online interfaces using spoken input
US7036128B1 (en) 1999-01-05 2006-04-25 Sri International Offices Using a community of distributed electronic agents to support a highly mobile, ambient computing environment
US6523061B1 (en) 1999-01-05 2003-02-18 Sri International, Inc. System, method, and article of manufacture for agent-based navigation in a speech-based data navigation system
US6757718B1 (en) 1999-01-05 2004-06-29 Sri International Mobile navigation of network-based electronic information using spoken input
US6851115B1 (en) 1999-01-05 2005-02-01 Sri International Software-based architecture for communication and cooperation among distributed electronic agents
US6742021B1 (en) 1999-01-05 2004-05-25 Sri International, Inc. Navigating network-based electronic information using spoken input with multimodal error feedback
US7152070B1 (en) 1999-01-08 2006-12-19 The Regents Of The University Of California System and method for integrating and accessing multiple data sources within a data warehouse architecture
JP2000207167A (en) 1999-01-14 2000-07-28 Hewlett Packard Co <Hp> Method for describing language for hyper presentation, hyper presentation system, mobile computer and hyper presentation method
US6282507B1 (en) 1999-01-29 2001-08-28 Sony Corporation Method and apparatus for interactive source language expression recognition and alternative hypothesis presentation and selection
US6505183B1 (en) 1999-02-04 2003-01-07 Authoria, Inc. Human resource knowledge modeling and delivery system
US6317718B1 (en) 1999-02-26 2001-11-13 Accenture Properties (2) B.V. System, method and article of manufacture for location-based filtering for shopping agent in the physical world
GB9904662D0 (en) 1999-03-01 1999-04-21 Canon Kk Natural language search method and apparatus
US6356905B1 (en) 1999-03-05 2002-03-12 Accenture Llp System, method and article of manufacture for mobile communication utilizing an interface support framework
US6928404B1 (en) 1999-03-17 2005-08-09 International Business Machines Corporation System and methods for acoustic and language modeling for automatic speech recognition with large vocabularies
US6584464B1 (en) 1999-03-19 2003-06-24 Ask Jeeves, Inc. Grammar template query system
EP1088299A2 (en) 1999-03-26 2001-04-04 Scansoft, Inc. Client-server speech recognition
US6356854B1 (en) 1999-04-05 2002-03-12 Delphi Technologies, Inc. Holographic object position and type sensing system and method
WO2000060435A2 (en) 1999-04-07 2000-10-12 Rensselaer Polytechnic Institute System and method for accessing personal information
US6631346B1 (en) 1999-04-07 2003-10-07 Matsushita Electric Industrial Co., Ltd. Method and apparatus for natural language parsing using multiple passes and tags
US6647260B2 (en) 1999-04-09 2003-11-11 Openwave Systems Inc. Method and system facilitating web based provisioning of two-way mobile communications devices
US6924828B1 (en) 1999-04-27 2005-08-02 Surfnotes Method and apparatus for improved information representation
US6697780B1 (en) 1999-04-30 2004-02-24 At&T Corp. Method and apparatus for rapid acoustic unit selection from a large speech corpus
US20020032564A1 (en) 2000-04-19 2002-03-14 Farzad Ehsani Phrase-based dialogue modeling with particular application to creating a recognition grammar for a voice-controlled user interface
EP1224569A4 (en) 1999-05-28 2005-08-10 Sehda Inc Phrase-based dialogue modeling with particular application to creating recognition grammars for voice-controlled user interfaces
US6931384B1 (en) 1999-06-04 2005-08-16 Microsoft Corporation System and method providing utility-based decision making about clarification dialog given communicative uncertainty
US6598039B1 (en) 1999-06-08 2003-07-22 Albert-Inc. S.A. Natural language interface for searching database
US6615175B1 (en) 1999-06-10 2003-09-02 Robert F. Gazdzinski “Smart” elevator system and method
US8065155B1 (en) 1999-06-10 2011-11-22 Gazdzinski Robert F Adaptive advertising apparatus and methods
US7093693B1 (en) 1999-06-10 2006-08-22 Gazdzinski Robert F Elevator access control system and method
US7711565B1 (en) 1999-06-10 2010-05-04 Gazdzinski Robert F “Smart” elevator system and method
US6711585B1 (en) 1999-06-15 2004-03-23 Kanisa Inc. System and method for implementing a knowledge management system
US6442518B1 (en) * 1999-07-14 2002-08-27 Compaq Information Technologies Group, L.P. Method for refining time alignments of closed captions
JP3361291B2 (en) 1999-07-23 2003-01-07 コナミ株式会社 Speech synthesis method, speech synthesis device, and computer-readable medium recording speech synthesis program
US6421672B1 (en) 1999-07-27 2002-07-16 Verizon Services Corp. Apparatus for and method of disambiguation of directory listing searches utilizing multiple selectable secondary search keys
US6628808B1 (en) 1999-07-28 2003-09-30 Datacard Corporation Apparatus and method for verifying a scanned image
US6622121B1 (en) 1999-08-20 2003-09-16 International Business Machines Corporation Testing speech recognition systems using test data generated by text-to-speech conversion
EP1079387A3 (en) 1999-08-26 2003-07-09 Matsushita Electric Industrial Co., Ltd. Mechanism for storing information about recorded television broadcasts
US6697824B1 (en) 1999-08-31 2004-02-24 Accenture Llp Relationship management in an E-commerce application framework
US6912499B1 (en) 1999-08-31 2005-06-28 Nortel Networks Limited Method and apparatus for training a multilingual speech model set
US6601234B1 (en) 1999-08-31 2003-07-29 Accenture Llp Attribute dictionary in a business logic services environment
US7127403B1 (en) 1999-09-13 2006-10-24 Microstrategy, Inc. System and method for personalizing an interactive voice broadcast of a voice service based on particulars of a request
US6601026B2 (en) 1999-09-17 2003-07-29 Discern Communications, Inc. Information retrieval by natural language querying
US6625583B1 (en) 1999-10-06 2003-09-23 Goldman, Sachs & Co. Handheld trading system interface
US6505175B1 (en) 1999-10-06 2003-01-07 Goldman, Sachs & Co. Order centric tracking system
US7020685B1 (en) 1999-10-08 2006-03-28 Openwave Systems Inc. Method and apparatus for providing internet content to SMS-based wireless devices
CA2387079C (en) 1999-10-19 2011-10-18 Sony Electronics Inc. Natural language interface control system
US6807574B1 (en) 1999-10-22 2004-10-19 Tellme Networks, Inc. Method and apparatus for content personalization over a telephone interface
JP2001125896A (en) 1999-10-26 2001-05-11 Victor Co Of Japan Ltd Natural language interactive system
US7310600B1 (en) 1999-10-28 2007-12-18 Canon Kabushiki Kaisha Language recognition using a similarity measure
GB2355834A (en) 1999-10-29 2001-05-02 Nokia Mobile Phones Ltd Speech recognition
US9076448B2 (en) 1999-11-12 2015-07-07 Nuance Communications, Inc. Distributed real time speech recognition system
US7050977B1 (en) 1999-11-12 2006-05-23 Phoenix Solutions, Inc. Speech-enabled server for internet website and method
US6633846B1 (en) 1999-11-12 2003-10-14 Phoenix Solutions, Inc. Distributed realtime speech recognition system
US7725307B2 (en) 1999-11-12 2010-05-25 Phoenix Solutions, Inc. Query engine for processing voice based queries including semantic decoding
US6615172B1 (en) 1999-11-12 2003-09-02 Phoenix Solutions, Inc. Intelligent query engine for processing voice based queries
US6665640B1 (en) 1999-11-12 2003-12-16 Phoenix Solutions, Inc. Interactive speech based learning/training system formulating search queries based on natural language parsing of recognized user queries
US7392185B2 (en) 1999-11-12 2008-06-24 Phoenix Solutions, Inc. Speech based learning/training system using semantic decoding
US7412643B1 (en) * 1999-11-23 2008-08-12 International Business Machines Corporation Method and apparatus for linking representation and realization data
US6532446B1 (en) 1999-11-24 2003-03-11 Openwave Systems Inc. Server based speech recognition user interface for wireless devices
US6526382B1 (en) 1999-12-07 2003-02-25 Comverse, Inc. Language-oriented user interfaces for voice activated services
US7024363B1 (en) 1999-12-14 2006-04-04 International Business Machines Corporation Methods and apparatus for contingent transfer and execution of spoken language interfaces
US6397186B1 (en) 1999-12-22 2002-05-28 Ambush Interactive, Inc. Hands-free, voice-operated remote control transmitter
US6526395B1 (en) 1999-12-31 2003-02-25 Intel Corporation Application of personality models and interaction with synthetic characters in a computing system
US6556983B1 (en) 2000-01-12 2003-04-29 Microsoft Corporation Methods and apparatus for finding semantic information, such as usage logs, similar to a query using a pattern lattice data space
US6546388B1 (en) 2000-01-14 2003-04-08 International Business Machines Corporation Metadata search results ranking system
US6701294B1 (en) 2000-01-19 2004-03-02 Lucent Technologies, Inc. User interface for translating natural language inquiries into database queries and data presentations
US6829603B1 (en) 2000-02-02 2004-12-07 International Business Machines Corp. System, method and program product for interactive natural dialog
US6895558B1 (en) 2000-02-11 2005-05-17 Microsoft Corporation Multi-access mode electronic personal assistant
US6640098B1 (en) 2000-02-14 2003-10-28 Action Engine Corporation System for obtaining service-related information for local interactive wireless devices
AU2001243277A1 (en) 2000-02-25 2001-09-03 Synquiry Technologies, Ltd. Conceptual factoring and unification of graphs representing semantic models
US6895380B2 (en) 2000-03-02 2005-05-17 Electro Standards Laboratories Voice actuation with contextual learning for intelligent machine control
US6449620B1 (en) 2000-03-02 2002-09-10 Nimble Technology, Inc. Method and apparatus for generating information pages using semi-structured data stored in a structured manner
WO2001067225A2 (en) 2000-03-06 2001-09-13 Kanisa Inc. A system and method for providing an intelligent multi-step dialog with a user
US6466654B1 (en) 2000-03-06 2002-10-15 Avaya Technology Corp. Personal virtual assistant with semantic tagging
US6757362B1 (en) 2000-03-06 2004-06-29 Avaya Technology Corp. Personal virtual assistant
US6477488B1 (en) 2000-03-10 2002-11-05 Apple Computer, Inc. Method for dynamic context scope selection in hybrid n-gram+LSA language modeling
US6615220B1 (en) 2000-03-14 2003-09-02 Oracle International Corporation Method and mechanism for data consolidation
US6510417B1 (en) 2000-03-21 2003-01-21 America Online, Inc. System and method for voice access to internet-based information
GB2366009B (en) 2000-03-22 2004-07-21 Canon Kk Natural language machine interface
US20020035474A1 (en) 2000-07-18 2002-03-21 Ahmet Alpdemir Voice-interactive marketplace providing time and money saving benefits and real-time promotion publishing and feedback
US6934684B2 (en) 2000-03-24 2005-08-23 Dialsurf, Inc. Voice-interactive marketplace providing promotion and promotion tracking, loyalty reward and redemption, and other features
JP3728172B2 (en) 2000-03-31 2005-12-21 キヤノン株式会社 Speech synthesis method and apparatus
US7177798B2 (en) 2000-04-07 2007-02-13 Rensselaer Polytechnic Institute Natural language interface using constrained intermediate dictionary of results
US6865533B2 (en) 2000-04-21 2005-03-08 Lessac Technology Inc. Text to speech
US6810379B1 (en) 2000-04-24 2004-10-26 Sensory, Inc. Client/server architecture for text-to-speech synthesis
US7107204B1 (en) 2000-04-24 2006-09-12 Microsoft Corporation Computer-aided writing system and method with cross-language writing wizard
WO2001084535A2 (en) 2000-05-02 2001-11-08 Dragon Systems, Inc. Error correction in speech recognition
US20020010584A1 (en) 2000-05-24 2002-01-24 Schultz Mitchell Jay Interactive voice communication method and system for information and entertainment
US20020042707A1 (en) 2000-06-19 2002-04-11 Gang Zhao Grammar-packaged parsing
US6680675B1 (en) 2000-06-21 2004-01-20 Fujitsu Limited Interactive to-do list item notification system including GPS interface
US6684187B1 (en) 2000-06-30 2004-01-27 At&T Corp. Method and system for preselection of suitable units for concatenative speech
US6691111B2 (en) 2000-06-30 2004-02-10 Research In Motion Limited System and method for implementing a natural language user interface
US6505158B1 (en) 2000-07-05 2003-01-07 At&T Corp. Synthesis-based pre-selection of suitable units for concatenative speech
JP3949356B2 (en) 2000-07-12 2007-07-25 三菱電機株式会社 Spoken dialogue system
TW521266B (en) 2000-07-13 2003-02-21 Verbaltek Inc Perceptual phonetic feature speech recognition system and method
US7139709B2 (en) 2000-07-20 2006-11-21 Microsoft Corporation Middleware layer between speech related applications and engines
JP2002041276A (en) 2000-07-24 2002-02-08 Sony Corp Interactive operation-supporting system, interactive operation-supporting method and recording medium
US20060143007A1 (en) 2000-07-24 2006-06-29 Koh V E User interaction with voice information services
US7853664B1 (en) 2000-07-31 2010-12-14 Landmark Digital Services Llc Method and system for purchasing pre-recorded music
US7092928B1 (en) 2000-07-31 2006-08-15 Quantum Leap Research, Inc. Intelligent portal engine
US6778951B1 (en) 2000-08-09 2004-08-17 Concerto Software, Inc. Information retrieval method with natural language interface
US6766320B1 (en) 2000-08-24 2004-07-20 Microsoft Corporation Search engine with natural language-based robust parsing for user query and relevance feedback learning
DE10042944C2 (en) 2000-08-31 2003-03-13 Siemens Ag Grapheme-phoneme conversion
US6799098B2 (en) 2000-09-01 2004-09-28 Beltpack Corporation Remote control system for a locomotive using voice commands
WO2002023796A1 (en) 2000-09-11 2002-03-21 Sentrycom Ltd. A biometric-based system and method for enabling authentication of electronic messages sent over a network
JP3784289B2 (en) 2000-09-12 2006-06-07 松下電器産業株式会社 Media editing method and apparatus
AU2001290882A1 (en) 2000-09-15 2002-03-26 Lernout And Hauspie Speech Products N.V. Fast waveform synchronization for concatenation and time-scale modification of speech
US7216080B2 (en) 2000-09-29 2007-05-08 Mindfabric Holdings Llc Natural-language voice-activated personal assistant
US7219058B1 (en) 2000-10-13 2007-05-15 At&T Corp. System and method for processing speech recognition results
US6832194B1 (en) 2000-10-26 2004-12-14 Sensory, Incorporated Audio recognition peripheral system
US7027974B1 (en) 2000-10-27 2006-04-11 Science Applications International Corporation Ontology-based parser for natural language processing
US7006969B2 (en) 2000-11-02 2006-02-28 At&T Corp. System and method of pattern recognition in very high-dimensional space
TW518482B (en) * 2000-11-10 2003-01-21 Future Display Systems Inc Method for taking notes on an article displayed by an electronic book
JP2002169588A (en) 2000-11-16 2002-06-14 Internatl Business Mach Corp <Ibm> Text display device, text display control method, storage medium, program transmission device, and reception supporting method
US6957076B2 (en) 2000-11-22 2005-10-18 Denso Corporation Location specific reminders for wireless mobiles
US20040085162A1 (en) 2000-11-29 2004-05-06 Rajeev Agarwal Method and apparatus for providing a mixed-initiative dialog between a user and a machine
US20020067308A1 (en) 2000-12-06 2002-06-06 Xerox Corporation Location/time-based reminder for personal electronic devices
JP2004516516A (en) 2000-12-18 2004-06-03 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ How to save utterance and select vocabulary to recognize words
US20040190688A1 (en) 2003-03-31 2004-09-30 Timmins Timothy A. Communications methods and systems using voiceprints
TW490655B (en) 2000-12-27 2002-06-11 Winbond Electronics Corp Method and device for recognizing authorized users using voice spectrum information
US6937986B2 (en) 2000-12-28 2005-08-30 Comverse, Inc. Automatic dynamic speech recognition vocabulary based on external sources of information
US20020133347A1 (en) 2000-12-29 2002-09-19 Eberhard Schoneburg Method and apparatus for natural language dialog interface
AU2001255568A1 (en) 2000-12-29 2002-07-16 General Electric Company Method and system for identifying repeatedly malfunctioning equipment
US7085723B2 (en) 2001-01-12 2006-08-01 International Business Machines Corporation System and method for determining utterance context in a multi-context speech application
US7257537B2 (en) 2001-01-12 2007-08-14 International Business Machines Corporation Method and apparatus for performing dialog management in a computer conversational interface
US20020099552A1 (en) * 2001-01-25 2002-07-25 Darryl Rubin Annotating electronic information with audio clips
JP2002229955A (en) 2001-02-02 2002-08-16 Matsushita Electric Ind Co Ltd Information terminal device and authentication system
US6964023B2 (en) 2001-02-05 2005-11-08 International Business Machines Corporation System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input
US6885987B2 (en) 2001-02-09 2005-04-26 Fastmobile, Inc. Method and apparatus for encoding and decoding pause information
US6622136B2 (en) 2001-02-16 2003-09-16 Motorola, Inc. Interactive tool for semi-automatic creation of a domain model
US7171365B2 (en) 2001-02-16 2007-01-30 International Business Machines Corporation Tracking time using portable recorders and speech recognition
US7290039B1 (en) 2001-02-27 2007-10-30 Microsoft Corporation Intent based processing
GB2372864B (en) 2001-02-28 2005-09-07 Vox Generation Ltd Spoken language interface
US6721728B2 (en) 2001-03-02 2004-04-13 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration System, method and apparatus for discovering phrases in a database
AU2002237495A1 (en) 2001-03-13 2002-09-24 Intelligate Ltd. Dynamic natural language understanding
US6677929B2 (en) 2001-03-21 2004-01-13 Agilent Technologies, Inc. Optical pseudo trackball controls the operation of an appliance or machine
JP3925611B2 (en) 2001-03-22 2007-06-06 セイコーエプソン株式会社 Information providing system, information providing apparatus, program, information storage medium, and user interface setting method
US7058889B2 (en) * 2001-03-23 2006-06-06 Koninklijke Philips Electronics N.V. Synchronizing text/visual information with audio playback
US6738743B2 (en) 2001-03-28 2004-05-18 Intel Corporation Unified client-server distributed architectures for spoken dialogue systems
US6996531B2 (en) 2001-03-30 2006-02-07 Comverse Ltd. Automated database assistance using a telephone for a speech based or text based multimedia communication mode
GB0110326D0 (en) 2001-04-27 2001-06-20 Ibm Method and apparatus for interoperation between legacy software and screen reader programs
US6654740B2 (en) 2001-05-08 2003-11-25 Sunflare Co., Ltd. Probabilistic information retrieval based on differential latent semantic space
DE60213595T2 (en) 2001-05-10 2007-08-09 Koninklijke Philips Electronics N.V. UNDERSTANDING SPEAKER VOTES
US7085722B2 (en) 2001-05-14 2006-08-01 Sony Computer Entertainment America Inc. System and method for menu-driven voice control of characters in a game environment
JP2002344880A (en) 2001-05-22 2002-11-29 Megafusion Corp Contents distribution system
US6944594B2 (en) 2001-05-30 2005-09-13 Bellsouth Intellectual Property Corporation Multi-context conversational environment system and method
US7020663B2 (en) * 2001-05-30 2006-03-28 George M. Hay System and method for the delivery of electronic books
US20020194003A1 (en) 2001-06-05 2002-12-19 Mozer Todd F. Client-server security system and method
US20020198714A1 (en) 2001-06-26 2002-12-26 Guojun Zhou Statistical spoken dialog system
US7139722B2 (en) 2001-06-27 2006-11-21 Bellsouth Intellectual Property Corporation Location and time sensitive wireless calendaring
US6604059B2 (en) 2001-07-10 2003-08-05 Koninklijke Philips Electronics N.V. Predictive calendar
US7987151B2 (en) 2001-08-10 2011-07-26 General Dynamics Advanced Info Systems, Inc. Apparatus and method for problem solving using intelligent agents
US6813491B1 (en) 2001-08-31 2004-11-02 Openwave Systems Inc. Method and apparatus for adapting settings of wireless communication devices in accordance with user proximity
US7313526B2 (en) 2001-09-05 2007-12-25 Voice Signal Technologies, Inc. Speech recognition using selectable recognition modes
US7953447B2 (en) 2001-09-05 2011-05-31 Vocera Communications, Inc. Voice-controlled communications system and method using a badge application
US7403938B2 (en) 2001-09-24 2008-07-22 Iac Search & Media, Inc. Natural language query processing
US6985865B1 (en) 2001-09-26 2006-01-10 Sprint Spectrum L.P. Method and system for enhanced response to voice commands in a voice command platform
US20050196732A1 (en) 2001-09-26 2005-09-08 Scientific Learning Corporation Method and apparatus for automated training of language learning skills
US6650735B2 (en) 2001-09-27 2003-11-18 Microsoft Corporation Integrated voice access to a variety of personal information services
US7324947B2 (en) 2001-10-03 2008-01-29 Promptu Systems Corporation Global speech user interface
US7167832B2 (en) 2001-10-15 2007-01-23 At&T Corp. Method for dialog management
GB2381409B (en) 2001-10-27 2004-04-28 Hewlett Packard Ltd Asynchronous access to synchronous voice services
NO316480B1 (en) 2001-11-15 2004-01-26 Forinnova As Method and system for textual examination and discovery
US20030101054A1 (en) 2001-11-27 2003-05-29 Ncc, Llc Integrated system and method for electronic speech recognition and transcription
JP2003163745A (en) 2001-11-28 2003-06-06 Matsushita Electric Ind Co Ltd Telephone set, interactive responder, interactive responding terminal, and interactive response system
US7483832B2 (en) 2001-12-10 2009-01-27 At&T Intellectual Property I, L.P. Method and system for customizing voice translation of text to speech
TW541517B (en) 2001-12-25 2003-07-11 Univ Nat Cheng Kung Speech recognition system
US20030167335A1 (en) 2002-03-04 2003-09-04 Vigilos, Inc. System and method for network-based communication
CN1295672C (en) 2002-03-27 2007-01-17 诺基亚有限公司 Pattern recognition
US7197460B1 (en) 2002-04-23 2007-03-27 At&T Corp. System for handling frequently asked questions in a natural language dialog service
US6847966B1 (en) 2002-04-24 2005-01-25 Engenium Corporation Method and system for optimally searching a document database using a representative semantic space
US7546382B2 (en) 2002-05-28 2009-06-09 International Business Machines Corporation Methods and systems for authoring of mixed-initiative multi-modal interactions and related browsing mechanisms
WO2003102919A1 (en) 2002-05-31 2003-12-11 Onkyo Corporation Network type content reproduction system
US7398209B2 (en) 2002-06-03 2008-07-08 Voicebox Technologies, Inc. Systems and methods for responding to natural language speech utterance
US20030233230A1 (en) 2002-06-12 2003-12-18 Lucent Technologies Inc. System and method for representing and resolving ambiguity in spoken dialogue systems
US6999066B2 (en) 2002-06-24 2006-02-14 Xerox Corporation System for audible feedback for touch screen displays
WO2004002144A1 (en) 2002-06-24 2003-12-31 Matsushita Electric Industrial Co., Ltd. Metadata preparing device, preparing method therefor and retrieving device
US7299033B2 (en) 2002-06-28 2007-11-20 Openwave Systems Inc. Domain-based management of distribution of digital content from multiple suppliers to multiple wireless services subscribers
US7233790B2 (en) 2002-06-28 2007-06-19 Openwave Systems, Inc. Device capability based discovery, packaging and provisioning of content for wireless mobile devices
US7693720B2 (en) 2002-07-15 2010-04-06 Voicebox Technologies, Inc. Mobile systems and methods for responding to natural language speech utterance
US7467087B1 (en) 2002-10-10 2008-12-16 Gillick Laurence S Training and using pronunciation guessers in speech recognition
JP2004152063A (en) 2002-10-31 2004-05-27 Nec Corp Structuring method, structuring device and structuring program of multimedia contents, and providing method thereof
US7783486B2 (en) 2002-11-22 2010-08-24 Roy Jonathan Rosser Response generator for mimicking human-computer natural language conversation
WO2004053836A1 (en) 2002-12-10 2004-06-24 Kirusa, Inc. Techniques for disambiguating speech input using multimodal interfaces
US7386449B2 (en) 2002-12-11 2008-06-10 Voice Enabling Systems Technology Inc. Knowledge-based flexible natural speech dialogue system
US7956766B2 (en) 2003-01-06 2011-06-07 Panasonic Corporation Apparatus operating system
US20040152055A1 (en) * 2003-01-30 2004-08-05 Gliessner Michael J.G. Video based language learning system
US7529671B2 (en) 2003-03-04 2009-05-05 Microsoft Corporation Block synchronous decoding
US6980949B2 (en) 2003-03-14 2005-12-27 Sonum Technologies, Inc. Natural language processor
US20040186714A1 (en) 2003-03-18 2004-09-23 Aurilab, Llc Speech recognition improvement through post-processsing
US20060217967A1 (en) 2003-03-20 2006-09-28 Doug Goertzen System and methods for storing and presenting personal information
US7496498B2 (en) 2003-03-24 2009-02-24 Microsoft Corporation Front-end architecture for a multi-lingual text-to-speech system
US20040220798A1 (en) 2003-05-01 2004-11-04 Visteon Global Technologies, Inc. Remote voice identification system
US7421393B1 (en) 2004-03-01 2008-09-02 At&T Corp. System for developing a dialog manager using modular spoken-dialog components
US20050045373A1 (en) 2003-05-27 2005-03-03 Joseph Born Portable media device with audio prompt menu
US7200559B2 (en) 2003-05-29 2007-04-03 Microsoft Corporation Semantic object synchronous understanding implemented with speech application language tags
US7720683B1 (en) 2003-06-13 2010-05-18 Sensory, Inc. Method and apparatus of specifying and performing speech recognition operations
US7580551B1 (en) 2003-06-30 2009-08-25 The Research Foundation Of State University Of Ny Method and apparatus for analyzing and/or comparing handwritten and/or biometric samples
US20050015772A1 (en) 2003-07-16 2005-01-20 Saare John E. Method and system for device specific application optimization via a portal server
JP4551635B2 (en) 2003-07-31 2010-09-29 ソニー株式会社 Pipeline processing system and information processing apparatus
JP2005070645A (en) 2003-08-27 2005-03-17 Casio Comput Co Ltd Text and voice synchronizing device and text and voice synchronization processing program
US7475010B2 (en) 2003-09-03 2009-01-06 Lingospot, Inc. Adaptive and scalable method for resolving natural language ambiguities
US7418392B1 (en) 2003-09-25 2008-08-26 Sensory, Inc. System and method for controlling the operation of a device by voice commands
US7460652B2 (en) 2003-09-26 2008-12-02 At&T Intellectual Property I, L.P. VoiceXML and rule engine based switchboard for interactive voice response (IVR) services
US7155706B2 (en) 2003-10-24 2006-12-26 Microsoft Corporation Administrative tool environment
US7292726B2 (en) 2003-11-10 2007-11-06 Microsoft Corporation Recognition of electronic ink with late strokes
US7302099B2 (en) 2003-11-10 2007-11-27 Microsoft Corporation Stroke segmentation for template-based cursive handwriting recognition
US7584092B2 (en) 2004-11-15 2009-09-01 Microsoft Corporation Unsupervised learning of paraphrase/translation alternations and selective application thereof
US7412385B2 (en) 2003-11-12 2008-08-12 Microsoft Corporation System for identifying paraphrases using machine translation
US20050108074A1 (en) 2003-11-14 2005-05-19 Bloechl Peter E. Method and system for prioritization of task items
US7447630B2 (en) 2003-11-26 2008-11-04 Microsoft Corporation Method and apparatus for multi-sensory speech enhancement
DE602004016681D1 (en) 2003-12-05 2008-10-30 Kenwood Corp AUDIO DEVICE CONTROL DEVICE, AUDIO DEVICE CONTROL METHOD AND PROGRAM
ES2312851T3 (en) 2003-12-16 2009-03-01 Loquendo Spa VOICE TEXT PROCEDURE AND SYSTEM AND THE ASSOCIATED INFORMATIC PROGRAM.
US7427024B1 (en) 2003-12-17 2008-09-23 Gazdzinski Mark J Chattel management apparatus and methods
JP2005189454A (en) 2003-12-25 2005-07-14 Casio Comput Co Ltd Text synchronous speech reproduction controller and program
US7552055B2 (en) 2004-01-10 2009-06-23 Microsoft Corporation Dialog component re-use in recognition systems
US7298904B2 (en) 2004-01-14 2007-11-20 International Business Machines Corporation Method and apparatus for scaling handwritten character input for handwriting recognition
WO2005071663A2 (en) 2004-01-16 2005-08-04 Scansoft, Inc. Corpus-based speech synthesis based on segment recombination
US20050165607A1 (en) 2004-01-22 2005-07-28 At&T Corp. System and method to disambiguate and clarify user intention in a spoken dialog system
EP1560200B8 (en) 2004-01-29 2009-08-05 Harman Becker Automotive Systems GmbH Method and system for spoken dialogue interface
KR100612839B1 (en) 2004-02-18 2006-08-18 삼성전자주식회사 Method and apparatus for domain-based dialog speech recognition
KR100462292B1 (en) 2004-02-26 2004-12-17 엔에이치엔(주) A method for providing search results list based on importance information and a system thereof
US7505906B2 (en) 2004-02-26 2009-03-17 At&T Intellectual Property, Ii System and method for augmenting spoken language understanding by correcting common errors in linguistic performance
US7693715B2 (en) 2004-03-10 2010-04-06 Microsoft Corporation Generating large units of graphonemes with mutual information criterion for letter to sound conversion
US7478033B2 (en) 2004-03-16 2009-01-13 Google Inc. Systems and methods for translating Chinese pinyin to Chinese characters
US7084758B1 (en) 2004-03-19 2006-08-01 Advanced Micro Devices, Inc. Location-based reminders
US7409337B1 (en) 2004-03-30 2008-08-05 Microsoft Corporation Natural language processing interface
US7496512B2 (en) 2004-04-13 2009-02-24 Microsoft Corporation Refining of segmental boundaries in speech waveforms using contextual-dependent models
US20050273626A1 (en) 2004-06-02 2005-12-08 Steven Pearson System and method for portable authentication
US8095364B2 (en) 2004-06-02 2012-01-10 Tegic Communications, Inc. Multimodal disambiguation of speech recognition
US20050289463A1 (en) 2004-06-23 2005-12-29 Google Inc., A Delaware Corporation Systems and methods for spell correction of non-roman characters and words
US7720674B2 (en) 2004-06-29 2010-05-18 Sap Ag Systems and methods for processing natural language queries
US7228278B2 (en) 2004-07-06 2007-06-05 Voxify, Inc. Multi-slot dialog systems and methods
JP2006023860A (en) 2004-07-06 2006-01-26 Sharp Corp Information browser, information browsing program, information browsing program recording medium, and information browsing system
JP4652737B2 (en) 2004-07-14 2011-03-16 インターナショナル・ビジネス・マシーンズ・コーポレーション Word boundary probability estimation device and method, probabilistic language model construction device and method, kana-kanji conversion device and method, and unknown word model construction method,
TWI252049B (en) 2004-07-23 2006-03-21 Inventec Corp Sound control system and method
US7725318B2 (en) 2004-07-30 2010-05-25 Nice Systems Inc. System and method for improving the accuracy of audio searching
CN101077014B (en) * 2004-08-09 2013-09-25 尼尔森(美国)有限公司 Methods and apparatus to monitor audio/visual content from various sources
US7853574B2 (en) 2004-08-26 2010-12-14 International Business Machines Corporation Method of generating a context-inferenced search query and of sorting a result of the query
KR20060022001A (en) 2004-09-06 2006-03-09 현대모비스 주식회사 Button mounting structure for a car audio
US20060061488A1 (en) 2004-09-17 2006-03-23 Dunton Randy R Location based task reminder
US7716056B2 (en) 2004-09-27 2010-05-11 Robert Bosch Corporation Method and system for interactive conversational dialogue for cognitively overloaded device users
US8107401B2 (en) 2004-09-30 2012-01-31 Avaya Inc. Method and apparatus for providing a virtual assistant to a communication participant
US7603381B2 (en) 2004-09-30 2009-10-13 Microsoft Corporation Contextual action publishing
US7693719B2 (en) 2004-10-29 2010-04-06 Microsoft Corporation Providing personalized voice font for text-to-speech applications
US7735012B2 (en) 2004-11-04 2010-06-08 Apple Inc. Audio user interface for computing devices
US7552046B2 (en) 2004-11-15 2009-06-23 Microsoft Corporation Unsupervised learning of paraphrase/translation alternations and selective application thereof
US7546235B2 (en) 2004-11-15 2009-06-09 Microsoft Corporation Unsupervised learning of paraphrase/translation alternations and selective application thereof
US7885844B1 (en) 2004-11-16 2011-02-08 Amazon Technologies, Inc. Automatically generating task recommendations for human task performers
US7702500B2 (en) 2004-11-24 2010-04-20 Blaedow Karen R Method and apparatus for determining the meaning of natural language
CN1609859A (en) 2004-11-26 2005-04-27 孙斌 Search result clustering method
US7376645B2 (en) 2004-11-29 2008-05-20 The Intellection Group, Inc. Multimodal natural language query system and architecture for processing voice and proximity-based queries
US8214214B2 (en) 2004-12-03 2012-07-03 Phoenix Solutions, Inc. Emotion detection device and method for use in distributed systems
US20060122834A1 (en) 2004-12-03 2006-06-08 Bennett Ian M Emotion detection device & method for use in distributed systems
US7636657B2 (en) 2004-12-09 2009-12-22 Microsoft Corporation Method and apparatus for automatic grammar generation from data entries
US8478589B2 (en) 2005-01-05 2013-07-02 At&T Intellectual Property Ii, L.P. Library of existing spoken dialog data for use in generating new natural language spoken dialog systems
US7873654B2 (en) 2005-01-24 2011-01-18 The Intellection Group, Inc. Multimodal natural language query system for processing and analyzing voice and proximity-based queries
US7508373B2 (en) 2005-01-28 2009-03-24 Microsoft Corporation Form factor and input method for language input
GB0502259D0 (en) 2005-02-03 2005-03-09 British Telecomm Document searching tool and method
US7949533B2 (en) 2005-02-04 2011-05-24 Vococollect, Inc. Methods and systems for assessing and improving the performance of a speech recognition system
US20060194181A1 (en) * 2005-02-28 2006-08-31 Outland Research, Llc Method and apparatus for electronic books with enhanced educational features
US7676026B1 (en) 2005-03-08 2010-03-09 Baxtech Asia Pte Ltd Desktop telephony system
US7925525B2 (en) 2005-03-25 2011-04-12 Microsoft Corporation Smart reminders
US7721301B2 (en) 2005-03-31 2010-05-18 Microsoft Corporation Processing files from a mobile device using voice commands
US20080120342A1 (en) * 2005-04-07 2008-05-22 Iofy Corporation System and Method for Providing Data to be Used in a Presentation on a Device
US7684990B2 (en) 2005-04-29 2010-03-23 Nuance Communications, Inc. Method and apparatus for multiple value confirmation and correction in spoken dialog systems
WO2006129967A1 (en) 2005-05-30 2006-12-07 Daumsoft, Inc. Conversation system and method using conversational agent
US8041570B2 (en) 2005-05-31 2011-10-18 Robert Bosch Corporation Dialogue management using scripts
US8024195B2 (en) 2005-06-27 2011-09-20 Sensory, Inc. Systems and methods of performing speech recognition using historical information
US8396715B2 (en) 2005-06-28 2013-03-12 Microsoft Corporation Confidence threshold tuning
US7925995B2 (en) 2005-06-30 2011-04-12 Microsoft Corporation Integration of location logs, GPS signals, and spatial resources for identifying user activities, goals, and context
US7826945B2 (en) 2005-07-01 2010-11-02 You Zhang Automobile speech-recognition interface
US20070027732A1 (en) 2005-07-28 2007-02-01 Accu-Spatial, Llc Context-sensitive, location-dependent information delivery at a construction site
US8271549B2 (en) 2005-08-05 2012-09-18 Intel Corporation System and method for automatically managing media content
US7640160B2 (en) 2005-08-05 2009-12-29 Voicebox Technologies, Inc. Systems and methods for responding to natural language speech utterance
US7362738B2 (en) 2005-08-09 2008-04-22 Deere & Company Method and system for delivering information to a user
US7620549B2 (en) 2005-08-10 2009-11-17 Voicebox Technologies, Inc. System and method of supporting adaptive misrecognition in conversational speech
US20070041361A1 (en) 2005-08-15 2007-02-22 Nokia Corporation Apparatus and methods for implementing an in-call voice user interface using context information
US7949529B2 (en) 2005-08-29 2011-05-24 Voicebox Technologies, Inc. Mobile systems and methods of supporting natural language human-machine interactions
WO2007027989A2 (en) 2005-08-31 2007-03-08 Voicebox Technologies, Inc. Dynamic speech sharpening
US8265939B2 (en) 2005-08-31 2012-09-11 Nuance Communications, Inc. Hierarchical methods and apparatus for extracting user intent from spoken utterances
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US20070055514A1 (en) * 2005-09-08 2007-03-08 Beattie Valerie L Intelligent tutoring feedback
JP4908094B2 (en) 2005-09-30 2012-04-04 株式会社リコー Information processing system, information processing method, and information processing program
US7577522B2 (en) 2005-12-05 2009-08-18 Outland Research, Llc Spatially associated personal reminder system and method
US7930168B2 (en) 2005-10-04 2011-04-19 Robert Bosch Gmbh Natural language processing of disfluent sentences
US8620667B2 (en) 2005-10-17 2013-12-31 Microsoft Corporation Flexible speech-activated command and control
US7707032B2 (en) 2005-10-20 2010-04-27 National Cheng Kung University Method and system for matching speech data
US8229745B2 (en) 2005-10-21 2012-07-24 Nuance Communications, Inc. Creating a mixed-initiative grammar from directed dialog grammars
US20070106674A1 (en) 2005-11-10 2007-05-10 Purusharth Agrawal Field sales process facilitation systems and methods
US20070112572A1 (en) 2005-11-15 2007-05-17 Fail Keith W Method and apparatus for assisting vision impaired individuals with selecting items from a list
US8326629B2 (en) 2005-11-22 2012-12-04 Nuance Communications, Inc. Dynamically changing voice attributes during speech synthesis based upon parameter differentiation for dialog contexts
US7822749B2 (en) 2005-11-28 2010-10-26 Commvault Systems, Inc. Systems and methods for classifying and transferring information in a storage network
KR20070057496A (en) 2005-12-02 2007-06-07 삼성전자주식회사 Liquid crystal display
KR100810500B1 (en) 2005-12-08 2008-03-07 한국전자통신연구원 Method for enhancing usability in a spoken dialog system
GB2433403B (en) 2005-12-16 2009-06-24 Emil Ltd A text editing apparatus and method
US20070211071A1 (en) 2005-12-20 2007-09-13 Benjamin Slotznick Method and apparatus for interacting with a visually displayed document on a screen reader
DE102005061365A1 (en) 2005-12-21 2007-06-28 Siemens Ag Background applications e.g. home banking system, controlling method for use over e.g. user interface, involves associating transactions and transaction parameters over universal dialog specification, and universally operating applications
US7996228B2 (en) 2005-12-22 2011-08-09 Microsoft Corporation Voice initiated network operations
US7599918B2 (en) 2005-12-29 2009-10-06 Microsoft Corporation Dynamic search with implicit user intention mining
JP2007183864A (en) 2006-01-10 2007-07-19 Fujitsu Ltd File retrieval method and system therefor
US20070174188A1 (en) 2006-01-25 2007-07-26 Fish Robert D Electronic marketplace that facilitates transactions between consolidated buyers and/or sellers
JP2007206317A (en) 2006-02-01 2007-08-16 Yamaha Corp Authoring method and apparatus, and program
IL174107A0 (en) 2006-02-01 2006-08-01 Grois Dan Method and system for advertising by means of a search engine over a data network
US8595041B2 (en) 2006-02-07 2013-11-26 Sap Ag Task responsibility system
US7983910B2 (en) 2006-03-03 2011-07-19 International Business Machines Corporation Communicating across voice and text channels with emotion preservation
KR100764174B1 (en) 2006-03-03 2007-10-08 삼성전자주식회사 Apparatus for providing voice dialogue service and method for operating the apparatus
US7752152B2 (en) 2006-03-17 2010-07-06 Microsoft Corporation Using predictive user models for language modeling on a personal device with user behavior models based on statistical modeling
JP4734155B2 (en) 2006-03-24 2011-07-27 株式会社東芝 Speech recognition apparatus, speech recognition method, and speech recognition program
US7707027B2 (en) 2006-04-13 2010-04-27 Nuance Communications, Inc. Identification and rejection of meaningless input during natural language classification
US8214213B1 (en) 2006-04-27 2012-07-03 At&T Intellectual Property Ii, L.P. Speech recognition based on pronunciation modeling
US20070276714A1 (en) 2006-05-15 2007-11-29 Sap Ag Business process map management
US20070276651A1 (en) 2006-05-23 2007-11-29 Motorola, Inc. Grammar adaptation through cooperative client and server based speech recognition
US8423347B2 (en) 2006-06-06 2013-04-16 Microsoft Corporation Natural language personal information management
US7523108B2 (en) 2006-06-07 2009-04-21 Platformation, Inc. Methods and apparatus for searching with awareness of geography and languages
US7483894B2 (en) 2006-06-07 2009-01-27 Platformation Technologies, Inc Methods and apparatus for entity search
US20100257160A1 (en) 2006-06-07 2010-10-07 Yu Cao Methods & apparatus for searching with awareness of different types of information
KR100776800B1 (en) 2006-06-16 2007-11-19 한국전자통신연구원 Method and system (apparatus) for user specific service using intelligent gadget
KR20080001227A (en) 2006-06-29 2008-01-03 엘지.필립스 엘시디 주식회사 Apparatus for fixing a lamp of the back-light
US7548895B2 (en) 2006-06-30 2009-06-16 Microsoft Corporation Communication-prompted user assistance
US8050500B1 (en) 2006-07-06 2011-11-01 Senapps, LLC Recognition method and system
TWI312103B (en) 2006-07-17 2009-07-11 Asia Optical Co Inc Image pickup systems and methods
US20080027726A1 (en) * 2006-07-28 2008-01-31 Eric Louis Hansen Text to audio mapping, and animation of the text
US8170790B2 (en) 2006-09-05 2012-05-01 Garmin Switzerland Gmbh Apparatus for switching navigation device mode
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
EP2067102A2 (en) * 2006-09-15 2009-06-10 Exbiblio B.V. Capture and display of annotations in paper and electronic documents
US20080077384A1 (en) 2006-09-22 2008-03-27 International Business Machines Corporation Dynamically translating a software application to a user selected target language that is not natively provided by the software application
US8214208B2 (en) 2006-09-28 2012-07-03 Reqall, Inc. Method and system for sharing portable voice profiles
US7649454B2 (en) 2006-09-28 2010-01-19 Ektimisi Semiotics Holdings, Llc System and method for providing a task reminder based on historical travel information
US7528713B2 (en) 2006-09-28 2009-05-05 Ektimisi Semiotics Holdings, Llc Apparatus and method for providing a task reminder based on travel history
US7930197B2 (en) 2006-09-28 2011-04-19 Microsoft Corporation Personal data mining
US20080082338A1 (en) 2006-09-29 2008-04-03 O'neil Michael P Systems and methods for secure voice identification and medical device interface
US8073681B2 (en) 2006-10-16 2011-12-06 Voicebox Technologies, Inc. System and method for a cooperative conversational voice user interface
US8600760B2 (en) 2006-11-28 2013-12-03 General Motors Llc Correcting substitution errors during automatic speech recognition by accepting a second best when first best is confusable
GB2457855B (en) 2006-11-30 2011-01-12 Nat Inst Of Advanced Ind Scien Speech recognition system and speech recognition system program
US20080129520A1 (en) 2006-12-01 2008-06-05 Apple Computer, Inc. Electronic device with enhanced audio feedback
US8045808B2 (en) 2006-12-04 2011-10-25 Trend Micro Incorporated Pure adversarial approach for identifying text content in images
US20080140652A1 (en) * 2006-12-07 2008-06-12 Jonathan Travis Millman Authoring tool
US20080140413A1 (en) * 2006-12-07 2008-06-12 Jonathan Travis Millman Synchronization of audio to reading
US8032510B2 (en) 2008-03-03 2011-10-04 Yahoo! Inc. Social aspects of content aggregation, syndication, sharing, and updating
WO2008085742A2 (en) 2007-01-07 2008-07-17 Apple Inc. Portable multifunction device, method and graphical user interface for interacting with user input elements in displayed content
KR100883657B1 (en) 2007-01-26 2009-02-18 삼성전자주식회사 Method and apparatus for searching a music using speech recognition
US7818176B2 (en) 2007-02-06 2010-10-19 Voicebox Technologies, Inc. System and method for selecting and presenting advertisements based on natural language processing of voice-based input
US7822608B2 (en) 2007-02-27 2010-10-26 Nuance Communications, Inc. Disambiguating a speech recognition grammar in a multimodal application
US20080221901A1 (en) 2007-03-07 2008-09-11 Joseph Cerra Mobile general search environment speech processing facility
US20080256613A1 (en) 2007-03-13 2008-10-16 Grover Noel J Voice print identification portal
US7801729B2 (en) 2007-03-13 2010-09-21 Sensory, Inc. Using multiple attributes to create a voice search playlist
US8219406B2 (en) 2007-03-15 2012-07-10 Microsoft Corporation Speech-centric multimodal user interface design in mobile technology
JP2008250375A (en) 2007-03-29 2008-10-16 Toshiba Corp Character input device, method, and program
US7809610B2 (en) 2007-04-09 2010-10-05 Platformation, Inc. Methods and apparatus for freshness and completeness of information
US8457946B2 (en) 2007-04-26 2013-06-04 Microsoft Corporation Recognition architecture for generating Asian characters
US7983915B2 (en) 2007-04-30 2011-07-19 Sonic Foundry, Inc. Audio content search engine
US8032383B1 (en) 2007-05-04 2011-10-04 Foneweb, Inc. Speech controlled services and devices using internet
US9292807B2 (en) 2007-05-10 2016-03-22 Microsoft Technology Licensing, Llc Recommending actions based on context
US8055708B2 (en) 2007-06-01 2011-11-08 Microsoft Corporation Multimedia spaces
US8204238B2 (en) 2007-06-08 2012-06-19 Sensory, Inc Systems and methods of sonic communication
KR20080109322A (en) 2007-06-12 2008-12-17 엘지전자 주식회사 Method and apparatus for providing services by comprehended user's intuited intension
US20080313335A1 (en) 2007-06-15 2008-12-18 Searete Llc, A Limited Liability Corporation Of The State Of Delaware Communicator establishing aspects with context identifying
US8190627B2 (en) 2007-06-28 2012-05-29 Microsoft Corporation Machine assisted query formulation
US8019606B2 (en) 2007-06-29 2011-09-13 Microsoft Corporation Identification and selection of a software application via speech
JP4424382B2 (en) 2007-07-04 2010-03-03 ソニー株式会社 Content reproduction apparatus and content automatic reception method
JP2009036999A (en) 2007-08-01 2009-02-19 Infocom Corp Interactive method using computer, interactive system, computer program and computer-readable storage medium
KR101359715B1 (en) 2007-08-24 2014-02-10 삼성전자주식회사 Method and apparatus for providing mobile voice web
WO2009029910A2 (en) 2007-08-31 2009-03-05 Proxpro, Inc. Situation-aware personal information management for a mobile device
US20090058823A1 (en) 2007-09-04 2009-03-05 Apple Inc. Virtual Keyboards in Multi-Language Environment
US8838760B2 (en) 2007-09-14 2014-09-16 Ricoh Co., Ltd. Workflow-enabled provider
KR100920267B1 (en) 2007-09-17 2009-10-05 한국전자통신연구원 System for voice communication analysis and method thereof
US8706476B2 (en) 2007-09-18 2014-04-22 Ariadne Genomics, Inc. Natural language processing method by analyzing primitive sentences, logical clauses, clause types and verbal blocks
US8165886B1 (en) 2007-10-04 2012-04-24 Great Northern Research LLC Speech interface system and method for control and interaction with applications on a computing system
US8036901B2 (en) 2007-10-05 2011-10-11 Sensory, Incorporated Systems and methods of performing speech recognition using sensory inputs of human position
US20090112677A1 (en) 2007-10-24 2009-04-30 Rhett Randolph L Method for automatically developing suggested optimal work schedules from unsorted group and individual task lists
US7840447B2 (en) 2007-10-30 2010-11-23 Leonard Kleinrock Pricing and auctioning of bundled items among multiple sellers and buyers
US20090112572A1 (en) * 2007-10-30 2009-04-30 Karl Ola Thorn System and method for input of text to an application operating on a device
US7983997B2 (en) 2007-11-02 2011-07-19 Florida Institute For Human And Machine Cognition, Inc. Interactive complex task teaching system that allows for natural language input, recognizes a user's intent, and automatically performs tasks in document object model (DOM) nodes
JP4926004B2 (en) 2007-11-12 2012-05-09 株式会社リコー Document processing apparatus, document processing method, and document processing program
US7890525B2 (en) 2007-11-14 2011-02-15 International Business Machines Corporation Foreign language abbreviation translation in an instant messaging system
US8112280B2 (en) 2007-11-19 2012-02-07 Sensory, Inc. Systems and methods of performing speech recognition with barge-in for use in a bluetooth system
CN101878479B (en) * 2007-11-28 2013-04-24 富士通株式会社 Metallic pipe managed by wireless ic tag, and the wireless ic tag
US8140335B2 (en) 2007-12-11 2012-03-20 Voicebox Technologies, Inc. System and method for providing a natural language voice user interface in an integrated voice navigation services environment
US10002189B2 (en) 2007-12-20 2018-06-19 Apple Inc. Method and apparatus for searching using an active ontology
US8675830B2 (en) 2007-12-21 2014-03-18 Bce Inc. Method and apparatus for interrupting an active telephony session to deliver information to a subscriber
US8219407B1 (en) 2007-12-27 2012-07-10 Great Northern Research, LLC Method for processing the output of a speech recognizer
US20090187577A1 (en) 2008-01-20 2009-07-23 Aviv Reznik System and Method Providing Audio-on-Demand to a User's Personal Online Device as Part of an Online Audio Community
KR101334066B1 (en) 2008-02-11 2013-11-29 이점식 Self-evolving Artificial Intelligent cyber robot system and offer method
US8099289B2 (en) 2008-02-13 2012-01-17 Sensory, Inc. Voice interface and search for electronic devices including bluetooth headsets and remote systems
US20090239552A1 (en) 2008-03-24 2009-09-24 Yahoo! Inc. Location-based opportunistic recommendations
US8958848B2 (en) 2008-04-08 2015-02-17 Lg Electronics Inc. Mobile terminal and menu control method thereof
US8666824B2 (en) 2008-04-23 2014-03-04 Dell Products L.P. Digital media content location and purchasing system
US8594995B2 (en) 2008-04-24 2013-11-26 Nuance Communications, Inc. Multilingual asynchronous communications of speech messages recorded in digital media files
US8249857B2 (en) 2008-04-24 2012-08-21 International Business Machines Corporation Multilingual administration of enterprise data with user selected target language translation
US8285344B2 (en) 2008-05-21 2012-10-09 DP Technlogies, Inc. Method and apparatus for adjusting audio for a user environment
US8589161B2 (en) 2008-05-27 2013-11-19 Voicebox Technologies, Inc. System and method for an integrated, multi-modal, multi-device natural language voice services environment
US8694355B2 (en) 2008-05-30 2014-04-08 Sri International Method and apparatus for automated assistance with task management
US8423288B2 (en) 2009-11-30 2013-04-16 Apple Inc. Dynamic alerts for calendar events
US8166019B1 (en) 2008-07-21 2012-04-24 Sprint Communications Company L.P. Providing suggested actions in response to textual communications
US8756519B2 (en) 2008-09-12 2014-06-17 Google Inc. Techniques for sharing content on a web page
KR101005074B1 (en) 2008-09-18 2010-12-30 주식회사 수현테크 Plastic pipe connection fixing device
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US9200913B2 (en) 2008-10-07 2015-12-01 Telecommunication Systems, Inc. User interface for predictive traffic
US8832319B2 (en) * 2008-11-18 2014-09-09 Amazon Technologies, Inc. Synchronization of digital content
US8442824B2 (en) 2008-11-26 2013-05-14 Nuance Communications, Inc. Device, system, and method of liveness detection utilizing voice biometrics
US8140328B2 (en) 2008-12-01 2012-03-20 At&T Intellectual Property I, L.P. User intention based on N-best list of recognition hypotheses for utterances in a dialog
US8489599B2 (en) 2008-12-02 2013-07-16 Palo Alto Research Center Incorporated Context and activity-driven content delivery and interaction
JP5257311B2 (en) 2008-12-05 2013-08-07 ソニー株式会社 Information processing apparatus and information processing method
CN102317938B (en) 2008-12-22 2014-07-30 谷歌公司 Asynchronous distributed de-duplication for replicated content addressable storage clusters
WO2010075623A1 (en) 2008-12-31 2010-07-08 Bce Inc. System and method for unlocking a device
US8326637B2 (en) 2009-02-20 2012-12-04 Voicebox Technologies, Inc. System and method for processing multi-modal device interactions in a natural language voice services environment
US20100225809A1 (en) * 2009-03-09 2010-09-09 Sony Corporation And Sony Electronics Inc. Electronic book with enhanced features
US8805823B2 (en) 2009-04-14 2014-08-12 Sri International Content processing systems and methods
WO2010126321A2 (en) 2009-04-30 2010-11-04 삼성전자주식회사 Apparatus and method for user intention inference using multimodal information
KR101581883B1 (en) 2009-04-30 2016-01-11 삼성전자주식회사 Appratus for detecting voice using motion information and method thereof
KR101032792B1 (en) 2009-04-30 2011-05-06 주식회사 코오롱 Polyester fabric for airbag and manufacturing method thereof
WO2010131911A2 (en) * 2009-05-13 2010-11-18 Lee Doohan Multimedia file playing method and multimedia player
US8498857B2 (en) 2009-05-19 2013-07-30 Tata Consultancy Services Limited System and method for rapid prototyping of existing speech recognition solutions in different languages
US10706373B2 (en) 2011-06-03 2020-07-07 Apple Inc. Performing actions associated with task items that represent tasks to perform
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US10540976B2 (en) 2009-06-05 2020-01-21 Apple Inc. Contextual voice commands
KR101562792B1 (en) 2009-06-10 2015-10-23 삼성전자주식회사 Apparatus and method for providing goal predictive interface
US8290777B1 (en) * 2009-06-12 2012-10-16 Amazon Technologies, Inc. Synchronizing the playing and displaying of digital content
US8484027B1 (en) * 2009-06-12 2013-07-09 Skyreader Media Inc. Method for live remote narration of a digital book
US20100324709A1 (en) * 2009-06-22 2010-12-23 Tree Of Life Publishing E-book reader with voice annotation
US9754224B2 (en) 2009-06-26 2017-09-05 International Business Machines Corporation Action based to-do list
US8527278B2 (en) 2009-06-29 2013-09-03 Abraham Ben David Intelligent home automation
US20110047072A1 (en) 2009-08-07 2011-02-24 Visa U.S.A. Inc. Systems and Methods for Propensity Analysis and Validation
US8768313B2 (en) 2009-08-17 2014-07-01 Digimarc Corporation Methods and systems for image or audio recognition processing
CN101996631B (en) * 2009-08-28 2014-12-03 国际商业机器公司 Method and device for aligning texts
US9213558B2 (en) 2009-09-02 2015-12-15 Sri International Method and apparatus for tailoring the output of an intelligent automated assistant to a user
US8321527B2 (en) 2009-09-10 2012-11-27 Tribal Brands System and method for tracking user location and associated activity and responsively providing mobile device updates
US8768308B2 (en) 2009-09-29 2014-07-01 Deutsche Telekom Ag Apparatus and method for creating and managing personal schedules via context-sensing and actuation
KR20110036385A (en) 2009-10-01 2011-04-07 삼성전자주식회사 Apparatus for analyzing intention of user and method thereof
US9197736B2 (en) 2009-12-31 2015-11-24 Digimarc Corporation Intuitive computing methods and systems
US20110099507A1 (en) 2009-10-28 2011-04-28 Google Inc. Displaying a collection of interactive elements that trigger actions directed to an item
US20120137367A1 (en) 2009-11-06 2012-05-31 Cataphora, Inc. Continuous anomaly detection based on behavior modeling and heterogeneous information analysis
KR20120091325A (en) 2009-11-10 2012-08-17 둘세타 인코포레이티드 Dynamic audio playback of soundtracks for electronic visual works
US9171541B2 (en) 2009-11-10 2015-10-27 Voicebox Technologies Corporation System and method for hybrid processing in a natural language voice services environment
US9502025B2 (en) 2009-11-10 2016-11-22 Voicebox Technologies Corporation System and method for providing a natural language content dedication service
US8712759B2 (en) 2009-11-13 2014-04-29 Clausal Computing Oy Specializing disambiguation of a natural language expression
KR101960835B1 (en) 2009-11-24 2019-03-21 삼성전자주식회사 Schedule Management System Using Interactive Robot and Method Thereof
US20110153330A1 (en) 2009-11-27 2011-06-23 i-SCROLL System and method for rendering text synchronized audio
US8396888B2 (en) 2009-12-04 2013-03-12 Google Inc. Location-based searching using a search area that corresponds to a geographical location of a computing device
KR101622111B1 (en) 2009-12-11 2016-05-18 삼성전자 주식회사 Dialog system and conversational method thereof
US20110161309A1 (en) 2009-12-29 2011-06-30 Lx1 Technology Limited Method Of Sorting The Result Set Of A Search Engine
US8494852B2 (en) 2010-01-05 2013-07-23 Google Inc. Word-level correction of speech input
US8334842B2 (en) 2010-01-15 2012-12-18 Microsoft Corporation Recognizing user intent in motion capture system
US8626511B2 (en) 2010-01-22 2014-01-07 Google Inc. Multi-dimensional disambiguation of voice commands
US20110218855A1 (en) 2010-03-03 2011-09-08 Platformation, Inc. Offering Promotions Based on Query Analysis
US8521513B2 (en) 2010-03-12 2013-08-27 Microsoft Corporation Localization for interactive voice response systems
US8374864B2 (en) * 2010-03-17 2013-02-12 Cisco Technology, Inc. Correlation of transcribed text with corresponding audio
US9323756B2 (en) * 2010-03-22 2016-04-26 Lenovo (Singapore) Pte. Ltd. Audio book and e-book synchronization
KR101369810B1 (en) 2010-04-09 2014-03-05 이초강 Empirical Context Aware Computing Method For Robot
US8265928B2 (en) 2010-04-14 2012-09-11 Google Inc. Geotagged environmental audio for enhanced speech recognition accuracy
US20110279368A1 (en) 2010-05-12 2011-11-17 Microsoft Corporation Inferring user intent to engage a motion capture system
US8392186B2 (en) * 2010-05-18 2013-03-05 K-Nfb Reading Technology, Inc. Audio synchronization for document narration with user-selected playback
US8694313B2 (en) 2010-05-19 2014-04-08 Google Inc. Disambiguation of contact information using historical data
US8522283B2 (en) 2010-05-20 2013-08-27 Google Inc. Television remote control data transfer
US8468012B2 (en) 2010-05-26 2013-06-18 Google Inc. Acoustic model adaptation using geographic information
EP2397972B1 (en) 2010-06-08 2015-01-07 Vodafone Holding GmbH Smart card with microphone
US20110306426A1 (en) 2010-06-10 2011-12-15 Microsoft Corporation Activity Participation Based On User Intent
US8234111B2 (en) 2010-06-14 2012-07-31 Google Inc. Speech and noise models for speech recognition
US8411874B2 (en) 2010-06-30 2013-04-02 Google Inc. Removing noise from audio
US8861925B1 (en) 2010-07-28 2014-10-14 Intuit Inc. Methods and systems for audio-visual synchronization
US8775156B2 (en) 2010-08-05 2014-07-08 Google Inc. Translating languages in response to device motion
US8473289B2 (en) 2010-08-06 2013-06-25 Google Inc. Disambiguating input based on context
US8359020B2 (en) 2010-08-06 2013-01-22 Google Inc. Automatically monitoring for voice input based on context
EP2614448A1 (en) * 2010-09-09 2013-07-17 Sony Ericsson Mobile Communications AB Annotating e-books/e-magazines with application results
US8812321B2 (en) 2010-09-30 2014-08-19 At&T Intellectual Property I, L.P. System and method for combining speech recognition outputs from a plurality of domain-specific speech recognizers via machine learning
US20120084634A1 (en) * 2010-10-05 2012-04-05 Sony Corporation Method and apparatus for annotating text
US8862255B2 (en) * 2011-03-23 2014-10-14 Audible, Inc. Managing playback of synchronized content
KR20140039194A (en) 2011-04-25 2014-04-01 비비오, 인크. System and method for an intelligent personal timeline assistant
TWI488174B (en) * 2011-06-03 2015-06-11 Apple Inc Automatically creating a mapping between text data and audio data
US10672399B2 (en) 2011-06-03 2020-06-02 Apple Inc. Switching between text data and audio data based on a mapping

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020095290A1 (en) * 1999-02-05 2002-07-18 Jonathan Kahn Speech recognition program mapping tool to align an audio file to verbatim text
US6260011B1 (en) * 2000-03-20 2001-07-10 Microsoft Corporation Methods and apparatus for automatically synchronizing electronic audio files with electronic text files
US20040216049A1 (en) * 2000-09-20 2004-10-28 International Business Machines Corporation Method for enhancing dictation and command discrimination
US20020129057A1 (en) * 2001-03-09 2002-09-12 Steven Spielberg Method and apparatus for annotating a document
US20090228126A1 (en) * 2001-03-09 2009-09-10 Steven Spielberg Method and apparatus for annotating a line-based document
US20050010409A1 (en) * 2001-11-19 2005-01-13 Hull Jonathan J. Printable representations for time-based media
US7490039B1 (en) * 2001-12-13 2009-02-10 Cisco Technology, Inc. Text to speech system and method having interactive spelling capabilities
US20040054530A1 (en) * 2002-09-18 2004-03-18 International Business Machines Corporation Generating speech recognition grammars from a large corpus of data
US20060020890A1 (en) * 2004-07-23 2006-01-26 Findaway World, Inc. Personal media player apparatus and method
US20080255837A1 (en) * 2004-11-30 2008-10-16 Jonathan Kahn Method for locating an audio segment within an audio file
US20070156627A1 (en) * 2005-12-15 2007-07-05 General Instrument Corporation Method and apparatus for creating and using electronic content bookmarks
US20070203955A1 (en) * 2006-02-28 2007-08-30 Sandisk Il Ltd. Bookmarked synchronization of files
US20070271104A1 (en) * 2006-05-19 2007-11-22 Mckay Martin Streaming speech with synchronized highlighting generated by a server
US20100324895A1 (en) * 2009-01-15 2010-12-23 K-Nfb Reading Technology, Inc. Synchronization for document narration
US20100324905A1 (en) * 2009-01-15 2010-12-23 K-Nfb Reading Technology, Inc. Voice models for document narration

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210097999A1 (en) * 2018-06-27 2021-04-01 Google Llc Rendering responses to a spoken utterance of a user utilizing a local text-response map
CN109522427B (en) * 2018-09-30 2021-12-10 北京光年无限科技有限公司 Intelligent robot-oriented story data processing method and device

Also Published As

Publication number Publication date
KR101324910B1 (en) 2013-11-04
US20120310642A1 (en) 2012-12-06
KR20120135137A (en) 2012-12-12
US20120310649A1 (en) 2012-12-06
KR20140027421A (en) 2014-03-06
KR101674851B1 (en) 2016-11-09
EP2593846A1 (en) 2013-05-22
AU2012261818A1 (en) 2014-01-16
KR101622015B1 (en) 2016-05-17
KR20150085115A (en) 2015-07-22
AU2016202974B2 (en) 2018-04-05
CN103703431A (en) 2014-04-02
KR20160036077A (en) 2016-04-01
AU2016202974A1 (en) 2016-06-02
EP2593846A4 (en) 2014-12-03
AU2012261818B2 (en) 2016-02-25
US10672399B2 (en) 2020-06-02
KR101700076B1 (en) 2017-01-26
CN103703431B (en) 2018-02-09
JP2014519058A (en) 2014-08-07

Similar Documents

Publication Publication Date Title
AU2016202974B2 (en) Automatically creating a mapping between text data and audio data
JP5463385B2 (en) Automatic creation of mapping between text data and audio data
US10671251B2 (en) Interactive eReader interface generation based on synchronization of textual and audial descriptors
US11657725B2 (en) E-reader interface system with audio and highlighting synchronization for digital books
US9786283B2 (en) Transcription of speech
US20200294487A1 (en) Hands-free annotations of audio text
US20140039871A1 (en) Synchronous Texts
US20060194181A1 (en) Method and apparatus for electronic books with enhanced educational features
US11551568B2 (en) System and method for dual mode presentation of content in a target language to improve listening fluency in the target language
US20150170648A1 (en) Ebook interaction using speech recognition
US10650089B1 (en) Sentence parsing correction system
US20080243510A1 (en) Overlapping screen reading of non-sequential text
US20230419847A1 (en) System and method for dual mode presentation of content in a target language to improve listening fluency in the target language
Luz et al. Interface design strategies for computer-assisted speech transcription
KR101030777B1 (en) Method and apparatus for producing script data
Masoodian et al. TRAED: Speech audio editing using imperfect transcripts
KR20100014031A (en) Device and method for making u-contents by easily, quickly and accurately extracting only wanted part from multimedia file
Wald et al. Benefiting disabled students by developing an application that uses captioning of multimedia to enhance learning for all students

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2012729332

Country of ref document: EP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12729332

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2014513799

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20137034641

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2012261818

Country of ref document: AU

Date of ref document: 20120604

Kind code of ref document: A