TW201312548A - Automatically creating a mapping between text data and audio data - Google Patents

Automatically creating a mapping between text data and audio data Download PDF

Info

Publication number
TW201312548A
TW201312548A TW101119921A TW101119921A TW201312548A TW 201312548 A TW201312548 A TW 201312548A TW 101119921 A TW101119921 A TW 101119921A TW 101119921 A TW101119921 A TW 101119921A TW 201312548 A TW201312548 A TW 201312548A
Authority
TW
Taiwan
Prior art keywords
text
audio
work
version
mapping
Prior art date
Application number
TW101119921A
Other languages
Chinese (zh)
Other versions
TWI488174B (en
Inventor
Xiang Cao
Alan C Cannistraro
Gregory S Robbin
Casey M Dougherty
Melissa Breglio Hajj
Original Assignee
Apple Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US201161493372P priority Critical
Priority to US13/267,738 priority patent/US20120310642A1/en
Application filed by Apple Inc filed Critical Apple Inc
Publication of TW201312548A publication Critical patent/TW201312548A/en
Application granted granted Critical
Publication of TWI488174B publication Critical patent/TWI488174B/en

Links

Abstract

Techniques are provided for creating a mapping that maps locations in audio data (e.g., an audio book) to corresponding locations in text data (e.g., an e-book). Techniques are provided for using a mapping between audio data and text data, whether or not the mapping is created automatically or manually. A mapping may be used for bookmark switching where a bookmark established in one version of a digital work (e.g., e-book) is used to identify a corresponding location with another version of the digital work (e.g., an audio book). Alternatively, the mapping may be used to play audio that corresponds to text selected by a user. Alternatively, the mapping may be used to automatically highlight text in response to audio that corresponds to the text being played. Alternatively, the mapping may be used to determine where an annotation created in one media context (e.g., audio) will be consumed in another media context (e.g., text).

Description

Automatically establish mapping between text and audio data

The present invention relates to automatically mapping between text data and audio data by analyzing audio data to detect words reflected in the audio material and comparing the words in a single word to the word in the file.

With the cost of handheld electronic devices and the need for digital content, creative work that has been published on print media is increasingly becoming available as digital media. For example, digital books (also known as "e-books") are becoming increasingly popular along with specialized handheld electronic devices called e-book readers (or "e-readers"). Also, although not individually designed as an e-reader, other handheld devices such as tablets and smart phones have the capability to operate as an e-reader.

A common standard for formatting e-books is the EPUB standard (short for "electronic publications"), a free and open e-book standard proposed by the International Digital Publishing Union (IDPF). The EPUB file uses XHTML 1.1 (or DTBook) to construct the contents of the book. Styling and layout are performed using a subset of CSS called OPS style sheets.

For some of the writings, especially for the writings that have become popular, the audio version of the written work is established. For example, a well-known individual (or a person with a pleasant voice) reads a recording of the written work and makes the recording available for purchase, whether online or in a brick and mortar store.

For consumers, it is not uncommon to purchase audio versions of e-books and e-books (or "information books"). In some cases, the user reads all of the e-book and then wants to listen to the audio book. In other situations, based on the user's environment, the user changes between reading the book and listening to the book. For example, while participating in sports or driving while commuting, users will tend to listen to the audio version of the book. On the other hand, when lying lazily on the couch before going to bed, the user will tend to read the e-book version of the book. Unfortunately, such transitions can be laborious because the user must remember where they stopped in the e-book and manually locate where in the audio book, or vice versa. Even if the user clearly remembers the event in the book where the user stopped, the transition can be laborious because it is not necessarily easy to find the portion of the e-book or audio book that corresponds to their event. Therefore, switching between an e-book and an audio book can be extremely time consuming.

The specification "EPUB Media Overlays 3.0" defines the use of SMIL (Synchronized Multimedia Integration Language), package files, EPUB style sheets, and EPUB content files for representing synchronized text and audio publications. The pre-recorded narration of a publication can be represented as a series of audio clips, each corresponding to a portion of the text. Each single audio clip in a series of audio clips that constitute a pre-recorded narration typically represents a single phrase or paragraph, but does not infer the order relative to other clips or text relative to the document. Media Overlay solves this synchronization problem by tying the structured audio narration to its corresponding text in the EPUB content file using SMIL tags. Media overlays are SMIL 3.0 simple to allow the definition of the playback order of such clips Subsets.

Unfortunately, creating media overlay files is largely a manual process. Therefore, the nuances of the mapping between the audio version of the work and the text version are extremely coarse. For example, a media overlay file can associate the beginning of each paragraph in the e-book with the corresponding location in the audio version of the book. The reason that media overlay files (especially for novels) do not contain mappings at any finer level of detail, such as on a word-by-word basis, is that it can take countless hours of manpower to create such high-resolution media overlay files.

The methods described in this section are methods that can be implemented, but are not necessarily methods that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the methods described in this section are merely prior art based on its inclusion in this section.

According to some embodiments, a method is provided, the method comprising: receiving audio data, the audio data reflecting an audible version of a work in a text version; performing a voice-to-text analysis of the audio data to generate the audio data a portion of the text; and a mapping between the plurality of audio locations in the audio material and the corresponding plurality of text locations in the text version of the work based on the text generated for the portions of the audio material. The method is performed by one or more computing devices.

In some embodiments, the text that produces the portion of the audio material includes text that is based at least in part on the portion of the textual content of the work that produces the portion of the audio material. In some embodiments, the generating of the portion of the audio material based at least in part on the context of the textual content of the work comprises, at least in part, One or more grammar rules in the text version of the work produce text. In some embodiments, the generating of the portion of the audio material based at least in part on the contextual content of the work comprises limiting which words can be translated into words based on which ones are in the text version of the work or a subset thereof . In some embodiments, restricting which words can be translated into which words based on which words are in the text version of the work includes identifying, for a given portion of the audio material, a correspondence of the text version of the work One of the subsections of the given portion and the words are limited to those words in the subsection of the text version of the work. In some embodiments, identifying the sub-section of the text version of the work includes maintaining a current text position in the text version of the work, the current text position corresponding to the voice-to-text analysis in the audio material a current audio location; and the sub-chapter of the text version of the work is a chapter associated with the current text location.

In some embodiments, the portions include portions corresponding to individual words, and the mapping maps locations corresponding to the portions of the individual words to individual words in the text version of the work. In some embodiments, the portions include portions corresponding to individual sentences, and the mapping maps locations corresponding to the portions of the individual sentences to individual sentences in the text version of the work. In some embodiments, the portions include portions corresponding to a fixed amount of data, and the mapping maps locations of the portions corresponding to the fixed amount of data to respective locations in the text version of the work.

In some embodiments, generating the mapping includes: (1) embedding an anchor in the audio material; (2) embedding the anchor in the text version of the work; or (3) storing the mapping in a media overlay stored in association with the audio material or the text version of the work.

In some embodiments, each of the one or more text positions of the plurality of text positions indicates a relative position in the text version of the work. In some embodiments, one of the plurality of text positions indicates a relative position in the text version of the work, and another of the plurality of text positions indicates an absolute position from the relative position. In some embodiments, each of the one or more text locations of the plurality of text locations indicates an anchor within the text version of the work.

According to some embodiments, a method is provided, the method comprising: receiving a text version of a work; executing a text-to-speech analysis of the text version to generate a first audio material; generating the sound based on the first audio material and the text version a first mapping between a first plurality of audio locations in the first audio material and a corresponding plurality of text locations in the text version of the work; receiving an audible version of the work reflecting the presence of the text version And (2) the first mapping, generating a second plurality of audio locations in the second audio material and the work based on (1) comparing the first audio data with one of the second audio data and (2) the first mapping A second mapping between the plurality of text positions in the text version. The method is performed by one or more computing devices.

According to some embodiments, a method is provided, the method comprising: receiving an audio input; performing one of the audio input to a text analysis to generate a portion of the audio input; determining whether the text generated for the portion of the audio input is Matches the currently displayed text; and responds to the decision Matches the currently displayed text so that the currently displayed text is highlighted. The method is performed by one or more computing devices.

In accordance with some embodiments, an electronic device is provided that includes an audio data receiving unit configured to receive audio data, the audio data reflecting an audible version of a work in a text version. The electronic device also includes a processing unit coupled to the audio data receiving unit. The processing unit is configured to: perform a voice-to-text analysis of the audio material to generate text of a portion of the audio material; and generate the audio data based on the text generated for the portions of the audio material A mapping between the plurality of audio locations and corresponding plurality of text locations in the text version of the work.

In accordance with some embodiments, an electronic device is provided that includes a text receiving unit configured to receive a text version of a work. The electronic device also includes a processing unit coupled to the text receiving unit, the processing unit configured to: execute one of the text versions to the speech analysis to generate the first audio data; and based on the first audio data and The text version produces a first mapping between the first plurality of audio locations in the first audio material and the corresponding plurality of text locations in the text version of the work. The electronic device also includes an audio data receiving unit configured to receive the second audio material, the second audio material reflecting an audible version of the work in which the text version exists. The processing unit is further configured to generate a second plurality of audio locations in the second audio data based on (1) comparing the first audio material with one of the second audio data and (2) the first mapping Between the plural text positions in the text version of the work a second mapping.

In accordance with some embodiments, an electronic device is provided that includes an audio receiving unit configured to receive an audio input. The electronic device also includes a processing unit coupled to the audio receiving unit. The processing unit is configured to perform a voice-to-text analysis of the audio input to generate a portion of the audio input; determine whether the text generated for the portion of the audio input matches the currently displayed text; and respond It is determined that the text matches the currently displayed text, so that the currently displayed text is highlighted.

According to some embodiments, a method is provided, the method comprising obtaining location data indicating a specified location within a text version of a work; checking a plurality of audio locations in an audio version of the work and the text of the work a mapping between a plurality of corresponding text positions in the version to: determine a specific text position of the plurality of text positions corresponding to the specified position, and determine, according to the specific text position, a correspondence among the plurality of audio positions A specific audio location for that particular text location. The method includes providing the particular audio location determined based on the particular text location to a media player such that the media player establishes the particular audio location as a current playback location of the audio material. The method is performed by one or more computing devices.

In some embodiments, obtaining a server includes receiving the location data from a first device via a network; the checking and providing is performed by the server; and providing the server to send the specific audio location to the execution A second device of the media player. In some embodiments, the second device And the first device is the same device. In some embodiments, obtaining, inspecting, and providing is performed by a computing device configured to display the text version of the work and execute the media player. In some embodiments, the method further includes determining the location profile at the device configured to display the text version of the work without input from a user of a device.

In some embodiments, the method further comprises: receiving input from a user; and in response to receiving the input, determining the location profile based on the input. In some embodiments, providing the particular audio location to the media player includes providing the particular audio location to the media player such that the media player processes the audio material beginning at the current playback location, in which case Having the media player generate audio from the processed audio material; and causing the media player to process the audio data in response to receiving the input.

In some embodiments, the input selects a plurality of words in the text version of the work; the designated location is a first designated location; the location profile also indicates that the textual version of the work is different from the first designation a second specified position of the location; the checking further comprising: checking the mapping to: determine a second specific text position of the plurality of text positions corresponding to the second specified position, and determining the plural based on the second specific text position Providing a second specific audio location of the plurality of audio locations corresponding to the second particular text location; and providing the particular audio location to the media player includes providing the second specific audio location to the media player such that The media player is interrupted when the current playback position reaches or approaches the second specific audio location Process the audio material.

In some embodiments, the method further comprises: obtaining annotation data based on input from a user; storing the annotation material associated with the specified location; and causing information regarding the annotation material to be displayed. In some embodiments, the information about the specific audio location and the annotation data is displayed to: determine when the current playback position of the audio material is at or near the specific audio location; and in response to determining the current current of the audio material The playback location is at or near the particular audio location such that information about the annotation material is displayed.

In some embodiments, the annotation material includes textual material; and causing information about the annotation material to be displayed to include displaying the textual material. In some embodiments, the annotation material includes voice material; and causing information about the annotation material to be displayed includes processing the voice material to generate audio.

In accordance with some embodiments, an electronic device is provided that includes a location data obtaining unit configured to obtain location data, the location data indicating a designated location within a text version of a work. The electronic device also includes a processing unit coupled to the location data obtaining unit. The processing unit is configured to check a mapping between a plurality of audio locations in an audio version of the work and a corresponding plurality of text positions in the text version of the work to determine a correspondence among the plurality of text positions Determining, at a specific text position of the specified position, a specific audio position corresponding to the specific text position among the plurality of audio positions based on the specific text position; and determining the specific audio position based on the specific text position Provided to a media player to cause the media player to place the particular audio location Established as a current playback position of the audio material.

According to some embodiments, a method is provided, the method comprising: obtaining location data, the location data indicating a specified location in the audio material; checking a plurality of audio locations in the audio material and a corresponding plurality of text versions in a work a mapping between the text positions to: determine a particular one of the plurality of audio locations corresponding to the specified location, and determine, based on the particular audio location, a one of the plurality of text locations corresponding to the particular audio location a specific text location; and causing a media player to display information about the particular text location. The method is performed by one or more computing devices.

In some embodiments, obtaining a server includes receiving the location data from a first device via a network; checking and causing execution by the server; and causing the server to send the specific text location to execution A second device of the media player. In some embodiments, the second device and the first device are the same device. In some embodiments, the obtaining, checking, and causing is performed by a computing device configured to display the text version of the work and execute the media player. In some embodiments, the method further includes determining the location profile at the device configured to process the audio material without input from a user of a device.

In some embodiments, the method further comprises: receiving input from a user; and in response to receiving the input, determining the location profile based on the input. In some embodiments, causing a portion of the text version that causes the media player to display the work to correspond to a portion of the particular text position; Having the media player display the portion of the text version of the work in response to receiving the input.

In some embodiments, the input selects a segment of the audio material; the designated location is a first designated location; the location profile also indicates a second designation in the audio material that is different from the first designated location Positioning; the checking further comprising: checking the mapping to: determine a second specific audio position corresponding to the second specified position of the plurality of audio positions, and determining a correspondence among the plurality of text positions based on the second specific audio position And a second specific text position of the second specific audio location; and causing a media player to display information about the specific text location further includes causing the media player to display information about the second specific text location.

In some embodiments, the designated location corresponds to one of the current playback locations of the audio material; the audio data is processed and the audio is generated as the specified location is generated; and the inclusion is caused to cause a second media to be played The highlighter displays the text in the text version of the work at or near the particular text position.

In some embodiments, the method further comprises: obtaining annotation data based on input from a user; storing the annotation material associated with the specified location; and causing information regarding the annotation material to be displayed. In some embodiments, causing information about the annotation material to be displayed includes: determining when to display a portion of the text version of the work corresponding to the particular text location; and responding to determining that the text version of the work corresponds to A portion of the particular text location causes information about the annotation material to be displayed.

In some embodiments, the annotation material includes textual material; and the information about the annotation material is displayed to include the textual material being displayed. In some embodiments, the annotation material includes voice material; and causing information about the annotation material to be displayed includes causing the voice material to be processed to generate audio.

According to some embodiments, a method is provided, comprising: during playback of an audio version of a work: obtaining location data indicating a designated location within the audio version, and determining a text version of the work based on the designated location And a specific text position associated with the suspended data indicating when the playback of the audio version is suspended; and in response to determining that the particular text position is associated with the suspended data, pausing the playback of the audio version. The method is performed by one or more computing devices.

In some embodiments, the suspension data is within the text version of the work. In some embodiments, determining the specific text location comprises: checking a mapping between a plurality of audio locations in the audio material and a corresponding plurality of text locations in a text version of a work: determining the plurality of audio locations Corresponding to a specific audio location of the specified location, and determining the particular text location of the plurality of text locations corresponding to the particular audio location based on the particular audio location.

In some embodiments, the pause data corresponds to the end of a page reflected in the text version of the work. In some embodiments, the pause data corresponds to a location within the text version of the work that precedes a picture that does not include one of the text.

In some embodiments, the method further comprises responding to receiving the user Enter to continue playback of the audio version. In some embodiments, the method further includes continuing playback of the audio version in response to a lapse of a certain amount of time since the playback of the audio version was paused.

According to some embodiments, a method is provided, comprising: during playback of an audio version of a work: obtaining location data indicating a designated location within the audio version, and determining a text version of the work based on the designated location a specific text position associated with the page end material, the page end data indication being reflected at an end of a first page of the text version of the work; and responsive to determining the particular text position and The page end material is associated, automatically causing the first page to be interrupted and causing a second page after the first page to be displayed. The method is performed by one or more computing devices.

In some embodiments, the method further includes examining a mapping between the plurality of audio locations in the audio material and the corresponding plurality of text locations in a text version of a work to determine that the plurality of audio locations correspond to a specific audio location of the specified location, and determining the particular text location of the plurality of text locations corresponding to the particular audio location based on the particular audio location.

In accordance with some embodiments, an electronic device is provided that includes a location obtaining unit configured to obtain location data, the location data indicating a designated location within the audio material. The electronic device also includes a processing unit coupled to the location obtaining unit. The processing unit is configured to: check a mapping between a plurality of audio locations in the audio material and a corresponding plurality of text locations in a text version of a work to determine the plural a specific audio position corresponding to the specified position in the audio position, and determining a specific text position of the plurality of text positions corresponding to the specific audio position based on the specific audio position; and causing a media player to display Information about this particular text location.

According to some embodiments, an electronic device is provided, the electronic device including a location obtaining unit configured to obtain location data, the location data indicating a designated location within the audio version during playback of an audio version of a work . The electronic device also includes a processing unit coupled to the location obtaining unit, the processing unit configured to: during playback of an audio version of a work: determining a particular one of the text versions of the work based on the designated location a text position, the specific text position is associated with the page end material, the page end data indication is reflected at an end of one of the first pages of the text version of the work; and in response to determining the specific text position and the end of the page The data is associated, automatically causing the first page to be interrupted and causing a second page after the first page to be displayed.

According to some embodiments, a computer readable storage medium is provided that stores one or more programs for execution by one or more processors of an electronic device, the one or more programs including An instruction of any of the methods. In accordance with some embodiments, an electronic device is provided that includes components for performing any of the above methods. In some embodiments, an electronic device is provided, the electronic device comprising one or more processors, and a memory storing one or more programs for execution by the one or more processors, the one or more programs comprising An instruction to perform any of the above methods. In some embodiments, an electronic for one is provided An information processing device in a device, the information processing device comprising means for performing any of the above methods.

In the following description, numerous specific details are set forth However, it will be apparent that the invention may be practiced without the specific details. In other instances, well-known structures and devices are shown in block diagrams in order to avoid unnecessarily obscuring the invention.

Overview of the automatic generation of audio-to-text mapping

According to one method, a mapping is automatically established, wherein the mapping maps the location within the audio version of the work (eg, an audio book) to the corresponding location in the text version of the work (eg, an e-book). The mapping is established by performing a speech-to-text analysis on the audio version to identify the words reflected in the audio version. The identified word matches the corresponding word in the text version of the work. The mapping associates the location of the identified word (within the audio version) with the location at the identified word in the text version of the work.

Audio version format

The audio material reflects the readable reading of the text version of the work (such as books, web pages, brochures, leaflets, etc.). Audio data can be stored in one or more audio files. The one or more audio files can be in one of a number of file formats. Non-limiting examples of audio file formats include AAC, MP3, WAV, and PCM.

Text version format

Similarly, the text data to which the audio material is mapped can have many file files. One of the file formats is stored. Non-limiting examples of file file formats include DOC, TXT, PDF, RTF, HTML, XHTML, and EPUB.

A typical EPUB file is accompanied by the following files: (a) listing each XHTML content file, and (b) indicating the order of the XHTML content files. For example, if the book contains 20 chapters, the EPUB file of the book can have 20 different XHTML files, and each chapter has an XHTML file. The file accompanying the EPUB file identifies the order of the XHTML files corresponding to the order of the chapters in the book. Therefore, a single (logical) file (whether an EPUB file or another type of file) can contain multiple data items or files.

Words or characters reflected in the text may be in one or more languages. For example, one part of the text may be in English and another part of the text may be in French. Although examples of English words are provided herein, embodiments of the invention are applicable to other languages including word-based languages.

The audio and text position in the map

As described herein, a mapping includes a set of mapping records, with each mapping record associating an audio location with a text location.

Each audio location identifies the location in the audio material. The audio position can indicate the absolute position within the audio material, the relative position within the audio material, or a combination of absolute and relative positions. As an example of an absolute position, as indicated above in Example A, the audio position may indicate a time offset into the audio material (eg, 04:32:24 indicates 4 hours, 32 minutes, 24 seconds), or a time range. As an example of the relative position, the audio position can indicate the chapter number, the segment number, and the line number. As an example of a combination of absolute position and relative position, the audio position may indicate the chapter number and the time offset in the chapter indicated by the chapter number.

Similarly, each text location identifies a location in a textual material, such as a textual version of a work. The position of the text may indicate the absolute position within the text version of the work, the relative position within the text version of the work, or a combination of absolute and relative positions. As an example of an absolute position, the text position may indicate a byte offset in the text version of the work, and/or an "anchor" within the text version of the work. The anchor is a post-set material that identifies a specific location or text portion within the textual material. The anchor may be stored separately from the text displayed to the end user in the text material, or may be stored in the text displayed to the end user. For example, the text should include the following sentence: "Why did the chicken<i name="123"/>cross the road? ""<i name="123"/>" is an anchor. When the sentence is displayed to the user, the user only sees "Why did the chicken cross the road?". Similarly, the same sentence can have multiple anchors as follows: "<i name="123"/>Why<i name="124"/>did<i name="125"/>the<i name="126" />chicken<i name="127"/>cross<i name="128"/>the<i name="129"/>road? "." In this example, there is an anchor before each word in the sentence.

As an example of a relative position, the text position may indicate a page number, a chapter number, a segment number, and/or a line number. As an example of a combination of absolute position and relative position, the position of the text may indicate the chapter number and the anchor in the chapter indicated by the chapter number.

An example of the manner in which the text position and the audio position are indicated is provided in the specification entitled "EPUB Media Overlay 3.0", which defines the use of SMIL (Synchronized Multimedia Integration Language), EPUB style sheets, and EPUB content files. Examples of associations that associate text positions with audio locations and are provided in the specification are as follows:

Example A

In Example A, the "par" element consists of two child elements: the "text" element and the "audio" element. The text element contains the attribute "src", which identifies the specific sentence in the XHTML file that contains the content from the first chapter of the book. The audio element includes a "src" attribute that identifies the audio file containing the audio version of the first chapter of the book, a "clipBegin" attribute that identifies where the audio clip in the audio file begins, and an end of the audio clip in the identified audio file. Where is the "clipEnd" attribute. Therefore, the seconds 23 to 45 in the audio file correspond to the first sentence in Chapter 1 of the book.

Map between text and audio

According to an embodiment, the mapping between the text version of the work and the audio version of the same work is automatically generated. Because the mapping is automatically generated, the mapping can use much finer granularity than would be the use of manual text-to-audio mapping techniques. Each automatically generated text-to-audio map includes a plurality of map records, each of which associates a text position in the text version with an audio position in the audio version.

1 is a process diagram for automatically establishing a mapping between a text version of a work and an audio version of the same work, in accordance with an embodiment of the present invention. 100 flow chart. At step 110, the speech-to-text analyzer receives audio material reflecting an audible version of the work. At step 120, the speech-to-text analyzer generates text for portions of the audio material while the speech-to-text analyzer performs the analysis of the audio material. At step 130, based on the text generated for the portions of the audio material, the speech-to-text analyzer generates a mapping between the plurality of audio locations in the audio material and the corresponding plurality of text locations in the text version of the work.

Step 130 may involve the voice-to-text analyzer comparing the text in the text version of the generated text and work to determine where the generated text is within the text version of the work. For each part of the generated text found in the text version of the work, the voice-to-text analyzer associates the following two: (1) indicating where to find the corresponding portion of the audio material within the audio material The location of the audio and (2) the location of the text indicating where the text is found within the text version of the work.

Text context

Each file has a "text context". The textual content of the text version of the work includes the inherent characteristics of the text version of the work (for example, the language used to write the text version of the work, the specific word used in the text version of the work, the grammar used in the text version of the work, and punctuation marks, The way in which the text version of the work is structured, etc., and the characteristics outside the work (for example, the period of creation of the work, the style of the work, the author of the work, etc.).

Different works can have significantly different contexts of text content. For example, the grammar used in classic English fiction can be quite different from the grammar of modern poetry. Therefore, although a word order can follow the rules of a grammar, the same The word order may violate the rules of another grammar. Similarly, the grammar used in both classic English novels and modern poetry may be different from the grammar (or no grammar) used in text messages sent from one teen to another.

As mentioned above, one of the techniques described herein automatically establishes a fine-grained mapping between the audio version of the work and the text version of the same work by performing a voice-to-text conversion of the audio version of the work. In an embodiment, the textual context of the work is used to increase the accuracy of the speech-to-text analysis performed on the audio version of the work. For example, to determine the grammar for use in a work, a speech-to-text parser (or another handler) can analyze the text version of the work prior to performing speech-to-text analysis. The speech-to-text analyzer can then use the grammar information thus obtained to increase the accuracy of the speech-to-text analysis of the audio version of the work.

Instead of or in addition to the genre of the work-based text version, the user can provide input that identifies one or more grammar rules followed by the author of the work. The rules associated with the identified grammar are input to the speech-to-text analyzer to assist the analyzer in identifying the words in the audio version of the work.

Limit candidate dictionary based on text version

In general, voice-to-text analyzers must be configured or designed to recognize virtually every word in the English language, and some words in other languages, as appropriate. Therefore, the speech-to-text analyzer must access a large dictionary of words. The speech-to-text analyzer selects the word from which the dictionary is used during speech-to-text operations. This is referred to herein as the "candidate dictionary" of the speech-to-text analyzer. The number of unique words in a typical candidate dictionary is approximately 500,000 One.

In an embodiment, the text from the text version of the work is considered when performing the speech-to-text analysis of the audio version of the work. In particular, in one embodiment, during speech-to-text analysis of an audio version of a work, the candidate dictionary used by the speech-to-text analyzer is limited to a particular set of words that are in the text version of the work. In other words, the words considered as "candidates" only during the voice-to-text operation performed on the audio version of the work are those words that actually appear in the text version of the work.

Speech-to-text operations can be significantly improved by limiting candidate dictionaries for speech-to-text translation for a particular work to those words appearing in the text version of the work. For example, suppose the number of unique words in a particular work is 20,000. The conventional speech to text analyzer may have difficulty determining which particular portion of the 500,000 word candidate dictionary corresponds to a particular portion of the audio. However, when considering only 20,000 unique words in the text version of the work, the same portion of the audio can clearly correspond to a particular word. Thus, the accuracy of the speech-to-text analyzer can be significantly improved by a much smaller dictionary of possible words.

Limit candidate dictionary based on current location

To improve accuracy, candidate dictionaries can be limited to fewer words than all words in the text version of the work. In an embodiment, candidate dictionaries are limited to those words found in a particular portion of the text version of the work. For example, during the speech-to-text translation of a work, it is possible to roughly track the "current translation position" of the translation operation relative to the text version of the work. Can be generated, for example, by comparing (a) to date during voice-to-text operations Text and (b) the text version of the work to perform this tracking.

Once the current translation location has been determined, the candidate dictionary can be further restricted based on the current translation location. For example, in one embodiment, the candidate lexicons are limited to only those words that appear within the text version of the work after the current translation position. Therefore, the word that was found before the current translation position but not found after it is effectively removed from the candidate dictionary. This removal increases the accuracy of the speech to the text analyzer, as the smaller the candidate dictionary, the less likely the speech-to-text analyzer will translate a portion of the audio material into the wrong word.

As another example, an audio book and a digital book can be divided into a number of paragraphs or sections prior to speech-to-text analysis. The audio book can be associated with an audio chapter map and the digital book can be associated with a text chapter map. For example, audio chapter mapping and text chapter mapping identify where each chapter begins or ends. These individual mappings can be used by a speech-to-text analyzer to limit candidate dictionaries. For example, if the voice-to-text analyzer determines the voice-to-text analyzer based on the audio chapter map to analyze the fourth chapter of the audio book, the voice-to-text analyzer uses the text chapter map to identify Chapter 4 of the digital book. The candidate dictionary is limited to the words found in Chapter 4.

In a related embodiment, the voice to text analyzer uses a sliding window that moves as the current translation position moves. As the voice-to-text analyzer is analyzing the audio material, the voice-to-text analyzer moves the sliding window to "cross" the text version of the work. The sliding window indicates two locations within the text version of the work. For example, the boundary of the sliding window can be (a) the beginning of the paragraph preceding the current translation position, and (b) the end of the third paragraph after the current translation position. The candidate dictionaries are limited to only those words that appear between their two positions.

Although specific examples are provided above, the window can span any amount of text within the text version of the work. For example, a window can traverse an absolute amount of text, such as 60 characters. As another example, the window can span a relative amount of text from a text version of the work, such as 10 words, 3 "line" characters, 2 sentences, or 1 "page" text. In a relative context, the voice-to-text analyzer can use the formatted material within the text version of the work to determine how much of the text version of the work constitutes a line or page. For example, the text version of the work may include a page indicator (eg, in the form of an HTML or XML tag) that indicates the beginning of the page or the end of the page within the content of the text version of the work.

In an embodiment, the beginning of the window corresponds to the current translation position. For example, the voice-to-text analyzer maintains the current text position of the most recent matching word in the text version of the work and maintains the current audio position of the most recently recognized word in the audio data. Unless the narrator (the voice is reflected in the audio material) misreads the text version of the work during the recording, adds its own content, or skips multiple parts of the text version of the work, the speech-to-text analyzer is The next word detected in the audio material (i.e., after the current audio position) is most likely the next word in the text version of the work (i.e., after the current text position). Maintaining two positions can significantly increase the accuracy of voice-to-text translation.

Use audio to audio correlation to create a map

In an embodiment, the text-to-speech generator and the audio-to-text correlator are used to automatically establish a mapping between the audio version of the work and the text version of the work. 2 is a block depicting such analyzers and data used to generate mappings Figure. A text version 210 of the work (such as an EPUB file) is input to the text to voice generator 220. The text-to-speech generator 220 can be implemented in software, hardware, or a combination of hardware and software. Whether implemented in software or hardware, the text-to-speech generator 220 can be implemented on a single computing device or can be interspersed among multiple computing devices.

The text-to-speech generator 220 generates audio material 230 based on the file 210. During the generation of the audio material 230, the text-to-speech generator 220 (or another component not shown) establishes an audio to file map 240. The audio to file map 240 maps a plurality of text locations within the file 210 to respective audio locations within the generated audio material 230.

For example, assume that text-to-speech generator 220 produces audio data for a single word located at location Y within file 210. Further assume that the audio material generated for the work is located at location X within the audio material 230. To reflect the correlation between the location of the word within file 210 and the location of the corresponding audio in audio material 230, a mapping will be established between position X and position Y.

Because the text-to-speech generator 220 knows that a corresponding word or phrase appears in the file 210 when a word or phrase of the audio is produced, each mapping between the corresponding word or phrase can be easily generated.

The audio to text correlator 260 accepts the generated audio material 230, the audio book 250, and the audio to file map 240 as inputs. The audio to text correlator 260 performs two main steps: an audio to audio related step and a lookup step. For the audio to audio related step, the audio to text correlator 260 compares the generated audio data 230 with the audio book 250 to determine the correlation between portions of the audio material 230 and portions of the audio book 250. Lift For example, the audio to text correlator 260 can determine the position of the corresponding word in the audio book 250 for each of the words represented in the audio material 230.

The nuances of segmenting the audio material 230 may vary from implementation to implementation for related purposes. For example, the correlation can be established between each word in the audio material 230 and each corresponding word in the audio book 250. Alternatively, correlation can be established based on a fixed duration interval (eg, one mapping per 1 minute of audio). In yet another alternative, multiple portions of the audio based on other criteria (such as at a paragraph or chapter boundary), a large pause (eg, more than 3 seconds of silence) or based on the audio book 250 may be Other locations of the material (such as audio tags within the audio book 250) establish correlation.

After identifying the correlation between a portion of the audio material 230 and a portion of the audio book 250, the audio-to-text correlator 260 uses the audio-to-file mapping 240 to identify the textual position corresponding to the audio location within the generated audio material 230 (indicated by Map 240). The audio to text correlator 260 then associates the text location with the audio location within the audio book 250 to establish a mapping record in the file to audio map 270.

For example, assume that a portion of the audio book 250 (at location Z) matches the portion of the generated audio material 230 at location X. Based on the mapping record (in the audio to file map 240) relating the location X to the location Y within the file 210, the file is created into the audio map 270 such that the location Z of the audiobook 250 is associated with the location Y within the file 210. Map records.

The audio to text correlator 260 repeatedly performs audio to audio correlation and lookup steps for each portion of the audio material 230. Thus, the file-to-audio map 270 contains a plurality of map records, each of which records the location within the file 210. The mapping is mapped to a location within the audio book 250.

In an embodiment, the audio-to-information correlation for each portion of the audio material 230 is immediately followed by a search step of the audio portion. Thus, a file to audio map 270 can be created for each portion of the audio material 230 prior to proceeding to a portion of the audio material 230. Alternatively, the audio to audio related step can be performed for many or all portions of the audio material 230 prior to performing any of the searching steps. After all audio-to-information correlations have been established, the search steps for all parts can be performed in batches.

Mapping nuance

A map has several attributes, one of which is the size of the map, and the size of the map refers to the number of map records in the map. Another attribute of the mapping is the "nuance" of the mapping. The "nuance" of a map refers to the number of mapped records in the map relative to the size of the digital work. Therefore, the nuance of the mapping can vary between digital works. For example, a first mapping of a digital book containing 200 "pages" includes a mapping record for only each of the digital books. Therefore, the first map can contain 1000 map records. On the other hand, the second map containing the 20-page "Children" book includes a mapping record for each word in the children's book. Therefore, the second map can contain 800 mapping records. Although the first map contains more mapping records than the second map, the granularity of the second map is finer than the fineness of the first map.

In an embodiment, the nuance of the mapping may be specified based on the input to the generated speech-to-text analyzer. For example, the user can specify a particular granularity before causing the speech to text analyzer to generate a mapping. Specific subtle Non-limiting examples of degrees include: - single word nuance (ie, for each word association), - sentence nuance (ie, for each sentence), - paragraph nuance (ie, for each A paragraph of relevance), - 10 word nuances (i.e., mapping for every 10 word parts in a digital work), and - 10 seconds of nuance (i.e., for every 10 seconds of audio mapping).

As another example, a user may specify a type of digital work (eg, a novel, a children's book, a short story), and a voice-to-text analyzer (or another processing program) determines the nuance based on the type of work. For example, a children's book can be associated with a single word verb, and a novel can be associated with a sentence nuance.

The nuance of the mapping can vary even within the same digital work. For example, the mapping of the first three chapters of the digital book may have sentence subtlety, and the mapping of the remaining chapters of the digital book has a single word fineness.

Mapping in the operation of text to audio transitions

In many cases, although the audio-to-text mapping will be generated before the user needs to rely on a mapping, in one embodiment, the audio data is accessed and/or at the time of execution or when the user has begun to access the user's device. After the text data, an audio to text map is generated. For example, a user uses a tablet to read a text version of a digital book. The tablet tracks the most recent page or chapter of the digital book that the tablet has displayed to the user. Identify the most recent page or chapter by "text bookmarks".

Later, the user chooses to play the audio book version of the same book. player The piece can be the same tablet or another device for the user to read the digital book. Regardless of the device that will play the audio book, the text tag is retrieved, and at least a portion of the audio book is subjected to speech-to-text analysis. During speech-to-text analysis, a "temporary" mapping record is generated to establish a correlation between the generated text and the corresponding location within the audio book.

Once the text and related records have been generated, the text-to-text comparison is used to determine the text corresponding to the text bookmark. Next, the temporary mapping record is used to identify a portion of the audio book that corresponds to the portion of the generated text, and the portion of the generated text corresponds to the text bookmark. The playback of the audio book begins with the position.

The portion of the audio book that performs voice-to-text analysis can be limited to the portion corresponding to the text bookmark. For example, an audio chapter map indicating where certain portions of the audio book begin and/or end may already exist. For example, an audio chapter map can indicate where each chapter begins, where one or more pages begin, and so on. This audio chapter mapping can help determine where to start speech-to-text analysis so that no speech-to-text analysis of the entire audio book is required. For example, if the text bookmark indicates the position in Chapter 12 of the digital book and the audio chapter mapping associated with the audio material identifies where Chapter 12 begins in the audio material, then the first 11 chapters of the audio book are not required. Any of them performs voice-to-text analysis. For example, audio data can be composed of 20 audio files, and each chapter has an audio file. Therefore, only the audio file corresponding to Chapter 12 is input to the voice to text analyzer.

Map generation during operation from audio to text transition

Map records can be generated in operation to facilitate audio to text conversion and text Word to audio transition. For example, suppose a user is listening to an audio book using a smart phone. The smart phone tracks the current location of the audio play. The current location is identified by "audio bookmarks." Later, the user picks up the tablet and selects the digital book version of the audio book for display. The tablet receives audio bookmarks (eg, from a central server remote from the tablet and the smart computer), performs voice-to-text analysis of at least a portion of the audio book, and identifies text in the audio book corresponding to the audio book The portion of a text portion of the version that corresponds to the audio bookmark in the audio book. The tablet then begins to display the identified portion of the text version.

The portion of the audio book that performs voice-to-text analysis can be limited to the portion corresponding to the audio bookmark. For example, performing a voice-to-text analysis on a portion of an audio book that spans one or more time segments (eg, seconds) preceding the audio bookmark in the audio book and/or audio bookmarks in the audio book One or more time segments after. Comparing the text in the text and text versions generated by the speech-to-text analysis of the other part to locate where the series of words or phrases in the generated text match the text in the text version.

If there is a text chapter mapping indicating where some parts of the text version begin or end, and the audio bookmark can be used to identify the chapters in the text chapter map, then it is not necessary to analyze the majority of the text version in order to locate the text in the generated text. Where the series of words or phrases match the text in the text version. For example, if the audio bookmark indicates the position in Chapter 3 of the audio book, and the text chapter mapping associated with the digital book identifies where Chapter 3 begins in the text version, then the first two chapters of the audio book are not required. Any of them or perform a voice-to-text on any of the chapters after Chapter 3 of the audio book Word analysis.

Overview of the use of audio to text mapping

According to one method, a mapping (whether manually established or automatically established) is used to identify a location within an audio version (eg, an audio book) of a digital work that corresponds to a location within a textual version (eg, an e-book) of the digital work. For example, mapping can be used to identify locations within an e-book based on "bookmarks" established in an audio book. As another example, the mapping can be used to identify which displayed text corresponds to the audio recording as the audio recording of the person reading the text is playing, and causes the recognized text to be highlighted. Therefore, while the audio book is being played, since the e-book reader causes the corresponding text to be highlighted, the user of the e-book reader can follow. As another example, the mapping can be used to identify the location in the audio material and to play the audio at the location in response to the input of the displayed text from the e-book selection. Therefore, the user can select a word in the e-book, and this selection causes the audio corresponding to the single word to be played. As another example, a user may create an annotation while "taking" (eg, reading or listening to) a version of a digital work (eg, an e-book), and causing the annotation to be taken while the user is taking another digital work. Version (for example, an audio book). Therefore, the user can annotate the "page" of the e-book and can view the notes and listen to the e-book's audio book. Similarly, the user can make an annotation while listening to the audio book, and then can view the annotation while reading the corresponding e-book.

3 is a flow diagram depicting a process for using mappings in one or more of these scenarios, in accordance with an embodiment of the present invention.

At step 310, obtaining a specified location within the first media item is obtained Location data. The first media item may be a text version of the work or an audio material corresponding to the text version of the work. This step can be performed by the device that accesses the first media item (by user operation). Alternatively, the step can be performed by a server remotely located relative to the device that fetches the first media item. Therefore, the device transmits the location data to the server via the network using a communication protocol.

At step 320, the map is checked to determine a first media location corresponding to the specified location. Similarly, this step can be performed by a device that fetches the first media item or by a server that is located remotely relative to the device.

At step 330, a determination corresponds to the first media location and is indicated to the second media location in the map. For example, if the specified location is an audio "bookmark", the first media location is an audio location indicated in the mapping, and the second media location is a text location associated with the audio location in the mapping. Similarly, for example, if the specified location is a text "bookmark", the first media location is a text location indicated in the mapping, and the second media location is an audio location associated with the text location in the mapping.

At step 340, the second media item is processed based on the second media location. For example, if the second media item is audio data, the second media location is an audio location and is used as the current playback location in the audio material. As another example, if the second media item is a text version of the work, the second media location is a text position and is used to determine which portion of the text version of the work is displayed.

Examples of using the handler 300 in a particular context are provided below.

Architecture review

Each of the example contexts mentioned above and described in detail below may relate to one or more computing devices. 4 is a block diagram of an example system 400 that can be utilized to implement some of the processing procedures described herein in accordance with an embodiment of the present invention. System 400 includes an end user device 410, an intermediate device 420, and an end user device 430. Non-limiting examples of end user devices 410 and 430 include desktop computers, laptop computers, smart phones, tablets, and other handheld computing devices.

As depicted in FIG. 4, device 410 stores digital media item 402 and executes text media player 412 and audio media player 414. Text media player 412 is configured to process electronic text material and cause device 410 to display text (e.g., on a touch screen (not shown) of device 410). Thus, if the digital media item 402 is an e-book, the text media player 412 can be configured to process the digital media item 402 as long as the digital media item 402 is in a text format that the text media player 412 is configured to process. . Device 410 can execute one or more other media players (not shown) that are configured to process other types of media, such as video.

Similarly, audio media player 414 is configured to process audio material and cause device 410 to generate audio (e.g., via a speaker (not shown) on device 410). Thus, if the digital media item 402 is an audio book, the audio media player 414 can be configured to process the digital media item 402 as long as the digital media item 402 is in the audio format that the audio media player 414 is configured to process. . Regardless of whether the item 402 is an e-book or an audio book, the item 402 can include multiple files, whether it is an audio file or a text file.

Device 430 similarly stores digital media item 404 and performs configuration to An audio media player 432 that processes the audio material and causes the device 430 to generate an audio. Device 430 can execute one or more other media players (not shown) configured to process other types of media, such as video and text.

The intermediate device 420 stores a map 406 that maps the audio locations within the audio material to text locations in the text material. For example, mapping 406 can map audio locations within digital media item 404 to text locations within digital media item 402. Although not depicted in FIG. 4, intermediate device 420 can store a number of mappings, each of which has a mapping of audio data and textual data. Also, the intermediate device 420 can interact with a number of end user devices not shown.

Also, the intermediary device 420 can store digital media items that a user can access via its respective device. Thus, a device (eg, device 430) can request a digital media item from intermediate device 420 instead of storing a local copy of the digital media item.

Additionally, the intermediary device 420 can store account data that associates one or more devices of the user with a single account. Thus, this account profile may indicate that devices 410 and 430 are registered with the same account by the same user. The intermediary device 420 can also store account-project related material that associates an account with one or more digital media items owned (or purchased) by a particular user. Thus, the intermediary device 420 can verify that the device 430 can access a particular digital media item by determining if the account-item association material indicates that the device 430 and the particular digital media item are associated with the same account.

Although only two end user devices are depicted, end users can own and operate more digital media items (such as e-books and audio books). Or fewer devices. Similarly, although only a single intermediate device 420 is depicted, the entity that owns and operates the intermediate device 420 can operate multiple devices, each of which provides the same service or can operate together to the end user device 410 and 430 users provide services.

Communication between the intermediate device 420 and the end user devices 410 and 430 is possible via the network 440. Network 440 can be implemented by any medium or mechanism that provides for the exchange of data between various computing devices. Examples of such networks include, but are not limited to, networks such as regional networks (LANs), wide area networks (WANs), Ethernet or the Internet, or one or more terrestrial, satellite or wireless links. The network may include a combination of networks such as the networks described. The network can transmit data in accordance with Transmission Control Protocol (TCP), User Datagram Protocol (UDP), and/or Internet Protocol (IP).

Map storage location

The mapping can be stored separately from the text data and audio data from which the mapping is generated. For example, as depicted in FIG. 4, mapping 406 is stored separately with digital media items 402 and 404, even though mapping 406 can be used to identify media locations in another digital media item based on media locations in a digital media item. in this way. In effect, map 406 is stored on a computing device (intermediate device 420) that is separate from devices 410 and 430 that store digital media items 402 and 404, respectively.

Additionally or alternatively, the mapping can be stored as part of the corresponding textual material. For example, map 406 can be stored in digital media item 402. However, even if the mapping is stored as part of the text material, the mapping may not be displayed to the end user of the fetched text material. Additionally or alternatively, the mapping can It is stored as part of the audio material. For example, the mapping 406 can be stored in the digital media item 404.

Bookmark switch

"Bookmark switching" refers to the creation of a specified location (or "bookmark") in one version of a digital work and the use of a bookmark to find the corresponding location within another version of a digital work. There are two types of bookmark switching: text to audio (TA) bookmark switching and audio to text (AT) bookmark switching. TA bookmark switching involves using a text bookmark created in an e-book to identify the corresponding audio location in the audio book. In contrast, another type of bookmark switching, referred to herein as AT bookmark switching, involves the use of audio tags built into the audio book to identify corresponding text locations within the e-book.

Text to audio bookmark switch

FIG. 5A is a flow diagram depicting a process 500 for TA bookmark switching in accordance with an embodiment of the present invention. Figure 5A is described using elements of system 400 depicted in Figure 4.

At step 502, text media player 412 (e.g., an e-reader) determines a text bookmark within digital media item 402 (e.g., a digital book). Device 410 displays the content from digital media item 402 to the user of device 410.

The text bookmark can be determined in response to input from the user. For example, the user can touch the area on the touch screen of device 410. The display of device 410 displays one or more words at or near the area. In response to the input, the text media player 412 determines one or more of the words in the closest region. The text media player 412 is based on one or more of the decisions Single word judgment text bookmarks.

Alternatively, the text bookmark can be determined based on the last textual material displayed to the user. For example, digital media item 402 can include 200 electronic "pages" and page 110 is the last page displayed. Text media player 412 determines page 110 as the last page displayed. The text media player 412 can create the page 110 as a text bookmark, or can create a point at the beginning of the page 110 as a text bookmark, since there may be no way to know where the user stopped reading. It is assumed that the user can read at least the last sentence on page 109, which may have ended on page 109 or page 110. Thus, text media player 412 can create a text bookmark for the beginning of the next sentence (which begins on page 110). However, if the nuance of the mapping is at the paragraph level, the text media player 412 can establish the beginning of the last paragraph on page 109. Similarly, if the nuance of the mapping is at the sentence level, the text media player 412 can create the beginning of the chapter including the page 110 as a text bookmark.

At step 504, text media player 412 sends the data indicating the text bookmark to intermediate device 420 via network 440. The intermediate device 420 can store text bookmarks associated with the user of the device 410 and/or the user of the device 410. Prior to step 502, the user may have established an account with the operator of the intermediary device 420. The user then registers one or more devices including device 410 by the operator. Registration causes each of the one or more devices to be associated with a user's account.

One or more factors may cause the text media player 412 to send the text bookmark to the intermediary device 420. Such factors may include the exit (or downtime) of the text media player 412, the creation of text bookmarks by the user, or by use of An explicit command is used to save the text bookmark for use in listening to an audio book corresponding to the text version of the work, and the text bookmark is created for the text version.

As previously mentioned, the intermediary device 420 accesses (e.g., stores) the mapping 406, which in this example maps a plurality of audio locations in the digital media item 404 to a plurality of text locations within the digital media item 402.

At step 506, the intermediary device 420 checks the map 406 to determine a particular text position in the plurality of text locations that corresponds to the text bookmark. The text bookmark may not exactly match any of the multiple text locations in map 406. However, the intermediate device 420 can select the position of the text that is closest to the text bookmark. Alternatively, the intermediary device 420 can select a text location just before the text bookmark, which may or may not be the closest text location to the text bookmark. For example, if the text bookmark indicates the fifth sentence of the third paragraph of Chapter 5, and the closest text position in the map 406 is (1) the first sentence of the fifth paragraph of Chapter 5 and (2) the third paragraph of Chapter 5 In the sixth sentence, select the text position (1).

At step 508, upon identifying a particular text location in the map, intermediate device 420 determines a particular audio location in map 406 that corresponds to a particular text location.

At step 510, intermediate device 420 sends a particular audio location to device 430, which in this example is different than device 410. For example, device 410 can be a tablet and device 430 can be a smart phone. In a related embodiment, device 430 is not involved. Thus, the intermediate device 420 can send a particular audio location to the device 410.

Can (i.e., automatically) in response to intermediate device 420 determining a particular audio location Go to step 510. Alternatively, step 510 or step 506 can be performed in response to receiving an indication from device 430 that device 430 is planning to process digital media item 404. The indication can be a request for an audio location corresponding to a text bookmark.

At step 512, the audio media player 432 establishes the particular audio location as the current playback location of the audio material in the digital media item 404. This setup can be performed in response to receiving a particular audio location from the intermediate device 420. Because the current playback position becomes a particular audio location, the audio media player 432 is not required to play any of the audio prior to the particular audio location in the audio material. For example, if the specific audio position indicates 2:56:03 (2 hours 56 minutes and 3 seconds), the audio media player 432 establishes the time in the audio material as the current playback position. Thus, if the user of device 430 selects a "play" button (whether a graphic or an entity) on device 430, then audio media player 430 begins processing the audio material at the 2:56:03 flag.

In an alternative embodiment, device 410 stores map 406 (or a copy of map 406). Thus, instead of steps 504 through 508, text media player 412 checks map 406 to determine a particular text position in the plurality of text locations that corresponds to the text bookmark. Next, text media player 412 determines a particular audio location in map 406 that corresponds to a particular text location. Text media player 412 can then cause a particular audio location to be sent to intermediate device 420 to allow device 430 to capture a particular audio location and establish the current playback location in the audio material as a particular audio location. The text media player 412 can also cause a particular text location (or text bookmark) to be sent to the intermediary device 420 to allow the device 410 (or another device not shown) to later retrieve a particular text location to allow execution on another device. Another text media player displays digital media projects A portion (e.g., a page) of another copy of 402, wherein the portion corresponds to a particular text position.

In another alternative embodiment, intermediate device 420 and device 430 are not involved. Therefore, steps 504 and 510 are not performed. Thus, device 410 performs all of the other steps of steps 5 and 508 in FIG. 5A.

Audio to text bookmark switch

FIG. 5B is a flow diagram depicting a process 550 for AT bookmark switching in accordance with an embodiment of the present invention. Similar to Figure 5A, Figure 5B is depicted using elements of system 400 depicted in Figure 4.

At step 552, the audio media player 432 determines an audio bookmark within the digital media item 404 (eg, an audio book).

The audio bookmark can be determined in response to input from the user. For example, the user can stop the playback of the audio material, for example, by selecting a "stop" button displayed on the touch screen of device 430. The audio media player 432 determines the location within the audio material of the digital media item 404 that corresponds to where the playback stopped. Thus, the audio bookmark can only be the last place where the user stops listening to the audio generated from the digital media item 404. Additionally or alternatively, the user can select one or more graphical buttons on the touch screen of device 430 to create a particular location within digital media item 404 as an audio bookmark. For example, device 430 displays a time table corresponding to the length of the audio material in digital media item 404. The user can select a location on the timetable and then provide one or more additional inputs used by the audio media player 432 to create an audio bookmark.

At step 554, device 430 will indicate an audio bookmark via network 440. The data is sent to the intermediate device 420. The intermediate device 420 can store audio bookmarks associated with the user of the device 430 and/or the user of the device 430. Prior to step 552, the user establishes an account with the operator of the intermediary device 420. The user then registers one or more devices including device 430 by the operator. Registration causes each of the one or more devices to be associated with a user's account.

Intermediate device 420 also accesses (e.g., stores) map 406. The map 406 maps a plurality of audio locations in the audio material of the digital media item 404 to a plurality of text locations within the text material of the digital media item 402.

One or more factors may cause audio media player 432 to send an audio bookmark to intermediate device 420. Such factors may include the exit (or downtime) of the audio media player 432, the creation of the audio bookmark by the user, or by the user to save the audio bookmark for use in displaying the work corresponding to the digital media item 404. The explicit instructions used in the text version (reflected in the digital media item 402) are used for the digital media item 404.

At step 556, the intermediate device 420 checks the map 406 to determine a particular one of the plurality of audio locations corresponding to the audio bookmark. The audio bookmark may not exactly match any of the plurality of audio locations in map 406. However, the intermediate device 420 can select the audio location that is closest to the audio bookmark. Alternatively, the intermediate device 420 can select an audio location just before the audio bookmark, which may or may not be the audio location closest to the audio bookmark. For example, if the audio bookmark indicates 02:43:19 (or 2 hours, 43 minutes, and 19 seconds), and the closest audio position in the map 406 is (1) 02:41:07 and (2) 0:43:56, select the audio position (1), even if the audio position (2) is closest to the audio bookmark.

At step 558, upon identifying a particular audio location in the map, intermediate device 420 determines a particular text location in map 406 that corresponds to the particular audio location.

At step 560, the intermediate device 420 sends a particular text location to the device 410, which in this example is different from the device 430. For example, device 410 can be a tablet, and device 430 can be a smart phone configured to process audio material and produce audible sound.

Step 560 can be performed automatically (i.e., in response to intermediate device 420 determining a particular text position. Alternatively, step 560 (or step 556) may be performed in response to receiving an indication from device 410 that device 410 is planning to process digital media item 402. The indication can be a request for a text location corresponding to the audio bookmark.

At step 562, text media player 412 displays information regarding a particular text location. Step 562 can be performed in response to receiving a particular text location from the intermediate device 420. The device 410 is not required to display any of the content prior to the particular text position in the text version of the work reflected in the digital media item 402. For example, if a particular text position indicates the fourth sentence of the second paragraph of Chapter 3, then device 410 displays a page that includes the sentence. The text media player 412 can cause the indicia to be displayed at a particular text location in the page that visually indicates to the user of the device 410 where to begin reading in the page. Therefore, the user can immediately read the text version starting from the position corresponding to the last word spoken by the narrator in the audio book.

In an alternative embodiment, device 410 stores map 406. Thus, instead of steps 556 through 560, after step 554 (where device 430 sends information indicative of the audio bookmark to intermediate device 420), intermediate device 420 sends the audio bookmark to device 410. Next, text media player 412 checks map 406 to determine a particular one of the plurality of audio locations corresponding to the audio bookmark. Next, text media player 412 determines a particular text location in map 406 that corresponds to a particular audio location. This alternative process then proceeds to step 562 above.

In another alternative embodiment, the intermediate device 420 is not involved. Therefore, steps 554 and 560 are not performed. Accordingly, device 430 performs all of the other steps of steps 556 and 558 in FIG. 5B.

Reflexive display of text in response to playing audio

In an embodiment, while playing the audio material corresponding to the text version of the work, the text from one of the text versions of the work is highlighted or "illuminated". As mentioned previously, the audio material is an audio version of the text version of the work and can reflect the reading from the text version by the human user. As used herein, "reverse display" text refers to a media player (eg, "e-reader") that visually distinguishes between text and other text displayed simultaneously with highlighted text. Displaying text in reverse can involve changing the font of the text, changing the font style of the text (for example, italic, bold, underline), changing the size of the text, changing the color of the text, changing the background color of the text, or establishing Text associated animation. An example of creating an animation is to make the text (or the background of the text) blink or change color intermittently. Another example of creating an animation is to create a text that appears above the text. Graphics below or around. For example, in response to the media player playing and detecting the word "toaster", the media player displays the image of the toaster above the word "toaster" in the displayed text. Another example of an animation is a bouncing ball that "bounces" on a portion of a piece of text (eg, a word, byte, or letter) in the audio material being played.

6 is a flow diagram depicting a process 600 for causing text from a text version of a work to be highlighted in reverse while the audio version of the work is being played, in accordance with an embodiment of the present invention.

At step 610, the current playback position of the audio material of the audio version is determined (which is constantly changing). This step can be performed by a media player executing on the user's device. The media player processes the audio material to produce audio for the user.

At step 620, a mapping record in the map is identified based on the current play position. The current playback position can match or almost match the audio location identified in the mapping record.

If the media player accesses the mapping of the plurality of audio locations in the mapped audio material to the plurality of text locations in the text version of the work, step 620 can be performed by the media player. Alternatively, step 620 can be performed by another processing program executing on the user's device or by a server that receives the current playback location from the user's device via the network.

At step 630, the location of the text identified in the mapping record is identified.

At step 640, the portion of the text version of the work corresponding to the text position is rendered highlighted. This step can be performed by a media player or another software application executing on the user's device. If the server executes Looking up steps (620 and 630), step 640 may further involve the server sending the text location to the user's device. In response, the media player or another software application accepts the text position as input and causes the corresponding text to be highlighted.

In an embodiment, the different text locations identified by the media player in the mapping are associated with different types of highlighted displays. For example, one of the text positions in the map can be associated with a change in font color from black to red, and another text position in the map can be associated with an animation (such as a piece of toast displayed "popped out" from the toaster. Toaster graphics) are associated. Therefore, each mapping record in the map may include "reverse display data" indicating the manner in which the text recognized by the corresponding text position is to be highlighted. Thus, for each mapped record in the map that the media player recognizes and includes highlighting the material, the media player uses the highlighted data to determine the manner in which the text is highlighted. If the mapping record does not include highlighting data, the media player may not highlight the corresponding text. Alternatively, if the mapping record in the mapping does not include highlighting, the media player can use the "preset" highlighting technique (eg, to make the text bold) to highlight the text.

Display text based on audio input

7 is a flow diagram depicting a process 700 for highlighting displayed text in response to audio input from a user, in accordance with an embodiment of the present invention. In this embodiment, no mapping is required. The audio input is used to highlight the text in the part of the user while displaying the text version of the work.

At step 710, an audio input is received. Audio input can be based on usage Read the text from the text version of the work. The audio input can be received by a device that displays a portion of the text version. The device prompts the user to read a single word, phrase or entire sentence. The prompt can be a visual cue or an audio cue. As an example of a visual cue, the device can cause the following text to be displayed at the same time as the device displays the sentence with the underline or just before the device displays the underlined sentence: "Please read the underlined text." As an example of an audio cue, the device can cause the computer-generated speech to read "Please read the underlined text" or cause the pre-recorded human speech to be played, wherein the pre-recorded human speech provides the same command.

At step 720, voice-to-text analysis is performed on the audio input to detect one or more words reflected in the audio input.

At step 730, a particular set of detected words and words is compared for each detected word reflected in the audio input. A particular set of words can be all words that are simultaneously displayed by a computing device (eg, an e-reader). Alternatively, a particular set of words may be all words that are prompted to be read by the user.

At step 740, for each detected word that matches a single word in a particular set, the device causes the matching word to be highlighted.

The steps depicted in the process 700 can be performed by a single computing device that displays text from a text version of the work. Alternatively, the steps depicted in process 700 can be performed by one or more computing devices other than computing devices that display text from text versions. For example, the audio input from the user in step 710 can be sent from the user's device to the web server performing the voice-to-text analysis via the network. Web server The highlighted data is then sent to the user's device so that the user's device highlights the appropriate text.

Respond to text selection to play audio

In an embodiment, a user of the media player displaying portions of the text version of the work may select portions of the displayed text and cause the corresponding audio to be played. For example, if the word displayed from the digital book is "donut" and the user selects the word (for example, by touching the touch screen of the media player to display the part of the word), then "donut" can be played. The news.

The mapping of the position of the text in the text version of the mapping work to the location of the audio in the audio material is used to identify the portion of the audio material that corresponds to the selected text. The user can select a single word, a phrase or even one or more sentences. In response to selecting an input for a portion of the displayed text, the media player can recognize one or more text positions. For example, the media player can identify a single text location corresponding to the selected portion, even if the selected portion contains multiple lines or sentences. The identified text position may correspond to the beginning of the selected portion. As another example, the media player can identify a first text position corresponding to the beginning of the selected portion and a second text position corresponding to the end of the selected portion.

The media player uses the identified text position to find a mapping record in the map that indicates the position of the text that is closest to the identified text position (or closest to the recognized text position). The media player uses the audio location indicated in the mapping record to identify where to begin processing the audio material in the audio material to produce the audio. If only a single text position is identified, Only words or sounds at or near the audio position can be played. Therefore, after playing a word or sound, the media player is interrupted so that no audio is played. Alternatively, the media player starts playing at or near the audio location and does not interrupt the playback of the audio following the audio location until (a) the end of the audio material is reached, and (b) other input from the user (eg, The "stop" button selection), or (c) a pre-specified stop point in the audio material (for example, the end of a page or chapter that requires additional input).

If the media player identifies two text locations based on the selected portion, the two audio locations are identified and can be used to identify where to start playing and where to stop playing the corresponding audio.

In an embodiment, the audio data identified by the audio location may be played slowly (ie, at a slow playback speed) or continuously without exceeding the current playback position in the audio material. For example, if the user of the tablet touches the touch screen of the tablet with his finger and continuously touches the displayed word (ie, does not raise his finger and does not move his finger to another The displayed word ") selects the displayed word "two", and the tablet plays the corresponding audio to establish the sound reflected by reading the word "twoooooooooooooooo".

In a similar embodiment, the user drags the speed of their finger across the displayed text on the touch screen of the media player to play the corresponding audio at the same or similar speed. For example, the user selects the letter "d" of the displayed word "donut" and then slowly moves its finger across the displayed word. In response to this input, the media player recognizes the corresponding audio material (using the mapping) at the same speed as the user moves their finger Play the corresponding audio. Therefore, the media player creates a sound that reads the word "donut" as "dooooooonnnnnnuuuuuut" by a reader who sounds like the text version of the work.

In a similar embodiment, the user "touches" the time of the word displayed on the touch screen to indicate the extent to which the audio version of the word is played quickly or slowly. For example, the quick tap of the user's finger on the displayed word causes the corresponding audio to be played at normal speed, and the user presses his finger on the selected word for more than 1 second to play the corresponding audio at a normal speed of 1⁄2. .

Transfer user annotation

In an embodiment, the user initiates the creation of an annotation to a media version (eg, audio) of one of the digital works, and associates the annotation with another media version (eg, text) of the digital work. Thus, although annotations can be established in the context of a type of media, annotations can be taken from the context of another type of media. The "content context" of creating or retrieving an annotation refers to whether the text is being displayed or the audio is being played when the creation or retrieval takes place.

Although the following examples relate to determining the position or text position within the audio when the annotation is established, some embodiments of the invention are not so limited. For example, when an annotation is taken in the context of the text content, the current playback position within the audio file when the annotation is created in the context of the audio content is not used. The truth is that the device can display an indication of the annotation at the beginning or end of the corresponding text version or on each "page" of the corresponding text version. As another example, when an annotation is taken in the context of the audio content, the text displayed when the annotation is created in the context of the text content is not used. The fact is that the device can display an indication of the annotation at the beginning or end of the corresponding audio version, or The indication of the annotation is continuously displayed at the same time as the audio version. In addition to or instead of a visual indication, an annotated audio indication can be played. For example, the beep and audio track are played simultaneously in such a way that both "beep" and audio tracks can be heard.

8A-8B are flow diagrams depicting a process for transmitting annotations from one context context to another context context, in accordance with an embodiment of the present invention. In particular, Figure 8A is a flow diagram depicting a process 800 for creating annotations in the context of a "text" context and for annotating the context of the "audio" content, while Figure 8B is depicted for use in "audio" content. A flowchart of the process 850 in which annotations are created in the context and annotations are taken in the context of the "text" context. The creation and retrieval of annotations can occur on the same computing device (e.g., device 410) or on separate computing devices (e.g., devices 410 and 430). FIG. 8A depicts the context in which annotations are established and accessed on device 410, while FIG. 8B depicts the context in which annotations are established on device 410 and annotations are later taken on device 430.

At step 802 in FIG. 8A, the text media player 412 executing on the device 410 causes the text from the digital media item 402 to be displayed (ie, in the form of a page).

At step 804, the text media player 412 determines the position of the text within the text version of the work reflected in the digital media item 402. The final location of the text associated with the annotation is stored. There are several ways to determine the position of a text. For example, text media player 412 can receive input that selects the location of the text within the displayed text. The input can be used by the user to touch the touch screen of the device 410 (display text) for a period of time. Input to select specific words, numbers A single word, the beginning or end of a page, before or after a sentence, etc. The input may also include first selecting a button, which causes the text media player 412 to change to an "create annotation" mode in which the annotation can be established and associated with the text location.

As another example of determining the position of the character, the text media player 412 automatically determines the position of the character based on which portion of the text of the work being displayed (reflected in the digital media item 402) (in the absence of user input). For example, if device 410 is displaying page 20 of the text version of the work, the annotation will be associated with page 20.

At step 806, the text media player 412 receives an input that selects an "Create Annotation" button that can be displayed on the touchscreen. The button can be displayed in response to the input of the selected text position in step 804, wherein, for example, the user touches the touchscreen for a period of time (such as 1 second).

Although step 804 is depicted as occurring prior to step 806, alternatively, the selection of the "Create Annotation" button may occur prior to determining the text position.

At step 808, text media player 412 receives an input to create annotation material. The input can be voice material (such as the user speaking into the microphone of device 410) or textual material (such as the user selecting a key on the keyboard, whether it is a physical key or a graphic key). If the annotation data is voice material, the text media player 412 (or another processing program) can perform voice-to-text analysis on the voice material to create a text version of the voice material.

At step 810, the text media player 412 stores the annotation material associated with the text location. The text media player 412 uses a map (eg, a copy of the map 406) to identify the particular text in the map that is closest to the text position. position. Next, using the mapping, the text media player identifies the audio location corresponding to the particular text location.

Or to step 810, the text media player 412 sends the annotation material and the text location to the intermediary device 420 via the network 440. In response, intermediate device 420 stores annotation material associated with the text location. The intermediate device 420 uses a mapping (e.g., mapping 406) to identify a particular text location in the mapping 406 that is closest to the text location. Next, using map 406, intermediate device 420 identifies the audio location corresponding to the particular text location. The intermediate device 420 transmits the identified audio location to the device 410 via the network 440. The intermediate device 420 can transmit the identified audio location in response to a request from the device 410 for an audio material and/or an annotation associated with an audio material. For example, in response to a request for an audiobook version of "The Tale of Two Cities," the intermediary device 420 determines if any annotation material associated with the audiobook exists and, if so, sends the annotation material to the device 410.

Step 810 can also include storing a date and/or time information indicating when the annotation was established. This information can be displayed later when an annotation is taken in the context of the audio content.

At step 812, the audio media player 414 plays the audio by processing the audio material of the digital media item 404, which in this example (although not shown) may be stored on the device 410 or may be via the network 440. Streaming from intermediate device 420 to device 410.

At step 814, the audio media player 414 determines when the current play position in the audio material matches or nearly matches the audio position identified in step 810 using the map 406. Or, as indicated in step 812, the audio The media player 414 can cause the information indicating that an annotation is available to be displayed regardless of where the current playback position is located and without playing any audio. In other words, step 812 is unnecessary. For example, the user can activate the audio media player 414 and cause the audio media player 414 to load the audio material of the digital media item 404. The audio media player 414 determines that the annotation data is associated with the audio material. The audio media player 414 causes information about the audio material (e.g., title, artist, genre, length, etc.) to be displayed without generating any audio associated with the audio material. The information may include a reference to the annotation material and information regarding the location in the audio material associated with the annotation material, wherein the location corresponds to the audio location identified in step 810.

At step 816, the audio media player 414 retrieves the annotation material. If the annotation data is voice data, accessing the annotation data may involve processing the voice data to generate audio or converting the voice data into text data and displaying the text data. If the annotation material is textual material, accessing the annotation material may involve, for example, displaying the textual material in a side panel of the GUI displaying the properties of the played audio material or in a new window appearing to be separate from the GUI. Non-limiting examples of attributes include the length of time of the audio material, the absolute position within the audio material (eg, time offset), or the relative position of the audio data (eg, chapter number or chapter number), the current playback position, audio The waveform of the data, and the title of the digital work.

As previously mentioned, FIG. 8B depicts a scenario in which annotations are established on device 430 and annotations are taken on device 410 at a later time.

At step 852, the audio media player 432 processes the digital media item Video information of 404 to play audio.

At step 854, the audio media player 432 determines the location of the audio within the audio material. The audio location associated with the annotation is ultimately stored. There are several ways to determine the location of the audio. For example, the audio media player 432 can receive an input to select an audio location within the audio material. The input can be a touch screen of the user touching the device 430 (displaying the attributes of the audio data) for a period of time. Enter the absolute position in the timetable that reflects the length of the audio data or the relative position in the audio data (such as the chapter number and the segment number). The input may also include a first select button, which causes the audio media player 432 to change to an "create annotation" mode in which the annotation can be established and associated with the audio location.

As another example of determining the location of the audio, the audio media player 432 automatically determines the location of the audio based on which portion of the audio material is being processed (in the absence of user input). For example, if the audio media player 432 is processing a portion of the audio material corresponding to the chapter 20 of the digital work reflected in the digital media item 404, the audio media player 432 determines that the audio location is at least somewhere within the chapter 20. .

At step 856, the audio media player 432 receives an input that selects an "Create Annotation" button that can be displayed on the touch screen of the device 430. The button can be displayed in response to the input of the selected audio position in step 854, wherein, for example, the user continuously touches the touchscreen for a period of time (such as 1 second).

Although step 854 is depicted as occurring prior to step 856, alternatively, the selection of the "Create Annotation" button may occur prior to determining the audio location.

At step 858, similar to step 808, the first media player receives an input to create annotation material.

At step 860, the audio media player 432 stores the annotation material associated with the audio location. The audio media player 432 uses a mapping (e.g., mapping 406) to identify a particular audio location in the mapping that is closest to the audio location determined in step 854. Next, using the mapping, the audio media player 432 identifies the text location corresponding to the particular audio location.

Or to step 860, the audio media player 432 sends the annotation data and the audio location to the intermediate device 420 via the network 400. In response, intermediate device 420 stores annotation material associated with the audio location. The intermediate device 420 uses the map 406 to identify the particular audio location in the map that is closest to the audio location determined in step 854. Next, using map 406, intermediate device 420 identifies the text location corresponding to the particular audio location. The intermediate device 420 transmits the identified text location to the device 410 via the network 440. The intermediary device 420 can transmit the identified text location in response to a request from the device 410 for a textual material and/or an annotation associated with a textual material. For example, in response to a request for the "The Grapes of Wrath" digital book, the intermediary device 420 determines if any annotation material associated with the digital book exists and, if so, sends the annotation material to the device 430.

Step 860 can also include storing a date and/or time information indicating when the annotation was established. This information can be displayed later when an annotation is taken in the context of the text content.

At step 862, device 410 displays the textual material associated with digital media item 402, which is the text of digital media item 404. version. The device 410 displays the text data of the digital media item 402 based on the copy stored at the local end of the digital media item 402, or if the copy stored at the local end does not exist, the text data can be displayed and the text data is streamed from the intermediate device 420.

At step 864, device 410 determines when the portion of the textual version of the book (reflected in digital media item 402) that includes the textual location (identified in step 860) is displayed. Alternatively, device 410 may display information indicating that annotations are available, regardless of which portion of the text version of the work, if any, is displayed.

At step 866, the text media player 412 retrieves the annotation material. If the annotation data is voice data, the annotation data may be used to play the voice data or convert the voice data into text data and display the text data. If the annotation material is a text material, the annotation material may include, for example, displaying the text material in a side panel of the GUI displaying a portion of the text version of the work or in a new window appearing to be separate from the GUI.

Hardware review

According to an embodiment, the techniques described herein are implemented by one or more dedicated computing devices. Dedicated computing devices may be hardwired to perform such techniques, or may include digital electronic devices that are permanently programmed to perform such techniques, such as one or more special application integrated circuits (ASICs) or field programmable gates An array (FPGA), or may include one or more general purpose hardware processors programmed to perform such techniques in accordance with program instructions in firmware, memory, other storage, or a combination. These specialized computing devices can also be customized by custom programming combinations of hardwired logic, ASICs or FPGAs. Surgery. The dedicated computing device can be a desktop computer system, a portable computer system, a handheld device, a network connected device, or any other device that has hardwired and/or programmed logic to implement such techniques.

For example, Figure 9 is a block diagram illustrating a computer system 900 in which embodiments of the present invention may be implemented. The computer system 900 includes a bus 902 or other communication mechanism for communicating information, and a hardware processor 904 coupled to the bus 902 for processing information. The hardware processor 904 can be, for example, a general purpose microprocessor.

Computer system 900 also includes primary memory 906, such as random access memory (RAM) or other dynamic storage device, coupled to bus 902 for storing information and instructions to be executed by processor 904. The primary memory 906 can also be used to store temporary variables or other intermediate information during execution of instructions to be executed by the processor 904. Such instructions, when stored in a non-transitory storage medium that is accessible to the processor 904, cause the computer system 900 to be a dedicated machine that is customized to perform the operations specified in the instructions.

Computer system 900 further includes a read only memory (ROM) 908 or other static storage device coupled to bus 902 for storing static information and instructions for processor 904. A storage device 910, such as a disk or optical disk, is provided and coupled to bus bar 902 for storing information and instructions.

Computer system 900 can be coupled via bus bar 902 to display 912, such as a cathode ray tube (CRT), for displaying information to a computer user. Input device 914, including alphanumeric and other keys, is coupled to bus 902 for communicating information and command selections to processor 904. Another type of user input device is cursor control such as a mouse, trackball or cursor direction key 916 for communicating direction information and command selections to the processor 904 and for controlling cursor movement on the display 912. This input device typically has two degrees of freedom on two axes (a first axis (eg, x) and a second axis (eg, y)) that allow the device to specify a position in the plane.

Computer system 900 can implement the techniques described herein using custom hardwired logic, one or more ASICs or FPGAs, firmware and/or program logic that combines with a computer system to cause computer system 900 to become A dedicated machine or a computer system 900 is programmed as a dedicated machine. In accordance with an embodiment, the techniques herein are performed by computer system 900 in response to processor 904 executing one or more sequences of one or more instructions contained in primary memory 906. Such instructions may be read into the primary memory 906 from another storage medium, such as storage device 910. Execution of the sequence of instructions contained in primary memory 906 causes processor 904 to perform the processing steps described herein. In an alternative embodiment, a hardwired circuit can be used instead of a software instruction or a combination of software instructions.

The term "storage medium" as used herein refers to any non-transitory medium that stores material and/or instructions that cause the machine to operate in a particular fashion. Such storage media may include non-volatile media and/or volatile media. Non-volatile media includes, for example, a compact disc or a magnetic disk such as storage device 910. Volatile media includes dynamic memory such as primary memory 906. Common forms of storage media include, for example, flexible disks, flexible disks, hard drives, solid state drives, magnetic tape, or any other magnetic data storage media, CD-ROM, any other optical data storage media, with aperture patterns Any physical media, RAM, PROM, and EPROM, FLASH-EPROM, NVRAM, any other memory chip or device.

The storage medium is different from the transmission medium, but can be used in conjunction with the transmission medium. The transmission media participates in the transfer of information between storage media. For example, transmission media includes coaxial cables, copper wires, and optical fibers, including wires including bus bars 902. The transmission medium may also take the form of sound waves or light waves, such as those generated during radio wave and infrared data communication.

Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 904 for execution. For example, the instructions may initially be carried on a disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and use the modem to send commands via the telephone line. The local data machine of the computer system 900 can receive data on the telephone line and convert the data into an infrared signal using an infrared transmitter. The infrared detector can receive the data carried in the infrared signal, and the appropriate circuit can place the data on the bus 902. Bus 902 carries the data to primary memory 906, and processor 904 retrieves and executes the instructions from primary memory 906. The instructions received by the primary memory 906 may optionally be stored on the storage device 910 before or after execution by the processor 904.

Computer system 900 also includes a communication interface 918 that is coupled to bus bar 902. Communication interface 918 provides bi-directional data communication coupling to network link 920, and network link 920 is coupled to regional network 922. For example, communication interface 918 can be an integrated services digital network (ISDN) card, a cable modem, a satellite data machine, or a data machine that provides a data communication connection to a corresponding type of telephone line. As another example, communication interface 918 can be a local area network (LAN) card to provide a data communication connection to a compatible LAN. A wireless link can also be implemented. Incumbent In this implementation, communication interface 918 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

Network link 920 typically provides data communication to other data devices via one or more networks. For example, network link 920 can provide connectivity via host network 922 to host computer 924 or to a data device operated by an Internet Service Provider (ISP) 926. ISP 926 provides data communication services via a global packet data communication network, now known as the "Internet" 928. Both the local network 922 and the Internet 928 use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on the network link 920 and through the communication interface 918 that carry the digital data to and from the computer system 900 are examples of transmission media.

Computer system 900 can transmit messages and receive data including the code via network(s), network link 920, and communication interface 918. In the Internet instance, server 930 may transmit the requested code of the application via Internet 928, ISP 926, regional network 922, and communication interface 918.

The received code may be executed by processor 904 as it is received, and/or stored in storage device 910 or other non-volatile storage for later execution.

10 through 15 show functional block diagrams of electronic devices 1000 through 1500 in accordance with the principles of the present invention as described above, in accordance with some embodiments. Functional blocks of the device may be implemented by hardware, software or a combination of hardware and software to carry out the principles of the invention. Those skilled in the art will appreciate that the functional blocks described in Figures 10 through 15 can be combined or separated into sub-blocks to implement The principles of the invention described above. Accordingly, the description herein may support any possible combination or separation or other definition of the functional blocks described herein.

As shown in FIG. 10, electronic device 1000 includes an audio data receiving unit 1002 configured to receive audio material, the audio data reflecting an audible version of the work in which the text version is present. The electronic device 1000 also includes a processing unit 1006 coupled to the audio data receiving unit 1002. In some embodiments, processing unit 1006 includes voice to text unit 1008 and mapping unit 1010.

Processing unit 1006 is configured to perform voice-to-text analysis of the audio material to generate text for portions of the audio material (eg, by voice to text unit 1008); and based on the portions of the audio material The text, the mapping between the plurality of audio locations in the audio material and the corresponding plurality of text locations in the text version of the work (eg, by mapping unit 1010).

As shown in FIG. 11, electronic device 1100 includes a text receiving unit 1102 that is configured to receive a text version of a work. The electronic device 1100 also includes an audio data receiving unit 1104 configured to receive the second audio material, the second audio material reflecting an audible version of the work in which the text version exists. The electronic device 1100 also includes a processing unit 1106 coupled to the text receiving unit 1102. In some embodiments, processing unit 1106 includes text-to-speech unit 1108 and mapping unit 1110.

Processing unit 1106 is configured to perform text-to-speech analysis of text versions to generate first audio material (eg, by text-to-speech unit 1108); And generating, according to the first audio data and the text version, a first mapping between the first plurality of audio locations in the first audio material and the corresponding plurality of text positions in the text version of the work (eg, by mapping unit 1110) . The processing unit 1106 is further configured to generate a second plurality of audio locations and text versions in the second audio material based on (1) comparison of the first audio material with the second audio data and (2) the first mapping A second mapping between the plurality of text locations (e.g., by mapping unit 1110).

As shown in FIG. 12, electronic device 1200 includes an audio receiving unit 1202 that is configured to receive audio input. The electronic device 1200 also includes a processing unit 1206 coupled to the audio receiving unit 1202. In some embodiments, processing unit 1206 includes voice to text unit 1208, text matching unit 1209, and display control unit 1210.

Processing unit 1206 is configured to perform voice-to-text analysis of the audio input to generate text for portions of the audio input (eg, by voice to text unit 1208); determining for portions of the audio input Whether the text matches the currently displayed text (eg, by the text matching unit 1209); and in response to the determination text matching the currently displayed text, such that the currently displayed text is highlighted (eg, by the display control unit 1210) .

As shown in FIG. 13, electronic device 1300 includes a location data obtaining unit 1302 configured to obtain location data indicating a specified location within a textual version of the work. The electronic device 1300 also includes a processing unit 1306 coupled to the location data obtaining unit 1302. In some embodiments, processing unit 1306 includes mapping check unit 1308.

Processing unit 1306 is configured to check a mapping between a plurality of audio locations in the audio version of the work and a corresponding plurality of text positions in the text version of the work (eg, by mapping check unit 1308) to: determine the plurality of a specific text position corresponding to the specified position in the text position, and determining a specific audio position corresponding to the specific text position among the plurality of audio positions based on the specific text position. Processing unit 1306 is also configured to provide a particular audio location determined based on a particular text location to the media player such that the media player establishes the particular audio location as the current playback location of the audio material.

As shown in FIG. 14, electronic device 1400 includes a location obtaining unit 1402 configured to obtain location data indicating a specified location within the audio material. The electronic device 1400 also includes a processing unit 1406 coupled to the location obtaining unit 1402. In some embodiments, processing unit 1406 includes mapping check unit 1408 and display control unit 1410.

The processing unit 1406 is configured to check a mapping between a plurality of audio locations in the audio material and a corresponding plurality of text locations in the text version of the work (eg, by mapping check unit 1408) to: determine the plurality of audio locations Corresponding to a specific audio location of the specified location, and determining a particular text location of the plurality of text locations corresponding to the particular audio location based on the particular audio location. Processing unit 1406 is also configured to cause the media player to display information regarding a particular text location (e.g., by display control unit 1410).

As shown in FIG. 15, the electronic device 1500 includes a location obtaining unit 1502 configured to obtain location data, the location information indicating in the work The specified location within the audio version during playback of the audio version. The electronic device 1500 also includes a processing unit 1506 coupled to the location data obtaining unit 1502. In some embodiments, processing unit 1506 includes a text location determination unit 1508 and a display control unit 1510.

The processing unit 1506 is configured to determine a particular text position in the text version of the work based on the specified location during playback of the audio version of the work (eg, by text location determining unit 1508), the particular text location and page end material Correspondingly, the page end information indication is reflected at the end of the first page in the text version of the work; and in response to determining that the specific text position is associated with the page end material, automatically causing the first page to be interrupted and after the first page The second page is displayed (e.g., by display control unit 1510).

In the foregoing specification, embodiments of the invention have been described with reference Accordingly, the specification and drawings are to be regarded as illustrative and not restrictive. The individual and exclusive indicators of the scope of the invention and the applicant's desire to be within the scope of the invention are the technical solutions set forth in the present application, including any subsequent corrections. And equivalent categories.

210‧‧‧ text version/document

220‧‧‧Text to Speech Generator/Text to Speech Analyzer

230‧‧‧ Audio Information

240‧‧‧Audio to file mapping

250‧‧‧ audio books

260‧‧‧Audio to Text Correlator/Voice to Speech Analyzer

270‧‧‧File to audio mapping/file to text mapping

400‧‧‧ system

402‧‧‧Digital Media Project

404‧‧‧Digital Media Project

406‧‧‧ mapping

410‧‧‧End User Devices

412‧‧‧Text Media Player

414‧‧‧Audio Media Player

420‧‧‧Intermediate device

430‧‧‧End User Devices

432‧‧‧Audio Media Player

440‧‧‧Network

900‧‧‧Computer system

902‧‧ ‧ busbar

904‧‧‧ hardware processor

906‧‧‧ main memory

908‧‧‧Reading Memory (ROM)

910‧‧‧Storage device

912‧‧‧ display

914‧‧‧Input device

916‧‧‧ cursor control

918‧‧‧Communication interface

920‧‧‧Network link

922‧‧‧Local Network

924‧‧‧Host computer

926‧‧‧ Internet Service Provider (ISP)

928‧‧‧Internet

930‧‧‧Server

1000‧‧‧Electronics

1002‧‧‧Audio data receiving unit

1006‧‧‧Processing unit

1008‧‧‧ voice to text unit

1010‧‧‧ mapping unit

1100‧‧‧Electronics

1102‧‧‧Text receiving unit

1104‧‧‧Audio data receiving unit

1106‧‧‧Processing unit

1108‧‧‧Text to voice unit

1110‧‧‧ mapping unit

1200‧‧‧Electronics

1204‧‧‧Optical receiving unit

1206‧‧‧Processing unit

1208‧‧‧Voice to text unit

1209‧‧‧Text Matching Unit

1210‧‧‧Display Control Unit

1300‧‧‧Electronics

1302‧‧‧Location data acquisition unit

1306‧‧‧Processing unit

1308‧‧‧ mapping check unit

1400‧‧‧Electronics

1402‧‧‧Location Acquisition Unit

1406‧‧‧Processing unit

1408‧‧‧ mapping check unit

1410‧‧‧Display Control Unit

1500‧‧‧Electronics

1502‧‧‧Location Acquisition Unit/Location Data Acquisition Unit

1506‧‧‧Processing unit

1508‧‧‧Text position determination unit

1510‧‧‧Display Control Unit

1 is a flow chart depicting a process for automatically establishing a mapping between text material and audio material in accordance with an embodiment of the present invention; and FIG. 2 is a diagram depicting text data and audio data in accordance with an embodiment of the present invention. A block diagram of a processing procedure involving an audio to text correlator when generating a map; FIG. 3 is a diagram depicting a scenario in accordance with an embodiment of the present invention A flowchart of one or more of the processing procedures using mappings; FIG. 4 is a block diagram of an example system 400 that can be used to implement some of the processing procedures described herein in accordance with an embodiment of the present invention.

5A-5B are flowcharts depicting a processing procedure for bookmark switching in accordance with an embodiment of the present invention; and FIG. 6 is a diagram illustrating a method for making an audio version of a work being played while being played in accordance with an embodiment of the present invention. FIG. 7 is a flow chart depicting a processing procedure for highlighting displayed text in response to audio input from a user in accordance with an embodiment of the present invention; FIG. 8A-8B are flow diagrams depicting a process for transferring annotations from one media context to another media context, in accordance with an embodiment of the present invention; and FIG. 9 is a diagram illustrating a computer in which embodiments of the present invention may be implemented Block diagram of the system.

10 through 15 are functional block diagrams of electronic devices in accordance with some embodiments.

Claims (20)

  1. A method comprising: receiving audio data, the audio data reflecting an audible version of a work in a text version; performing a voice-to-text analysis of the audio data to generate a portion of the audio data; and based on The text generated by the portions of the audio material generates a mapping between a plurality of audio locations in the audio material and corresponding plurality of text locations in the text version of the work; wherein the method is performed by Or multiple computing devices are executed.
  2. The method of claim 1, wherein the generating the portion of the audio material comprises: text based at least in part on the portion of the textual content of the work that produces the portion of the audio material.
  3. The method of claim 2, wherein generating the text of the portion of the audio material based at least in part on the textual context of the work comprises: generating the text based at least in part on one or more grammar rules in the text version of the work.
  4. The method of claim 2, wherein the generating of the portion of the audio material based at least in part on the context of the textual content of the work comprises: restricting the translation of the partial translation based on the text version of the work or a subset thereof What are the words?
  5. The method of claim 4, wherein restricting which of the words can be translated into which words based on which words are in the text version of the work comprises: identifying the text version of the work for a given portion of the audio material Corresponding to one of the subsections of the given portion and limiting the words to those words in the subsection of the text version of the work.
  6. The method of claim 5, wherein: identifying the sub-section of the text version of the work comprises: maintaining a current text position in the text version of the work, the current text position corresponding to the voice-to-text analysis a current audio position in the audio material; and the sub-section of the text version of the work is a chapter associated with the current text position.
  7. The method of any one of clauses 1 to 6, wherein the portions include portions corresponding to individual words, and the mapping maps locations of the portions corresponding to the individual words to individual ones of the text version of the work Single word.
  8. The method of any one of claims 1 to 6, wherein the portions include portions corresponding to individual sentences, and the mapping maps locations of the portions corresponding to the individual sentences to individual ones of the text versions of the work sentence.
  9. The method of any one of claims 1 to 6, wherein the portion includes a portion corresponding to a fixed amount of data, and the mapping maps a position of the portion corresponding to the fixed amount of data to the text of the work The corresponding location in the version.
  10. The method of any one of claims 1 to 9, wherein generating the mapping comprises: (1) embedding an anchor in the audio material; (2) embedding the anchor in the text version of the work; or (3) The mapping is stored in a media overlay stored in association with the audio material or the text version of the work.
  11. The method of any one of clauses 1 to 10, wherein the plurality of text bits Each of the one or more text locations of the set indicates a relative position in the text version of the work.
  12. The method of any one of claims 1 to 10, wherein one of the plurality of text positions indicates a relative position in the text version of the work, and another one of the plurality of text positions is from the relative position The position indicates an absolute position.
  13. The method of any one of claims 1 to 10, wherein each of the one or more of the plurality of text positions indicates an anchor within the text version of the work.
  14. A method comprising: receiving a text version of a work; executing a text-to-speech analysis of the text version to generate a first audio material; generating a first of the first audio data based on the first audio data and the text version a first mapping between a plurality of audio locations and corresponding plurality of text locations in the text version of the work; receiving second audio material reflecting an audible version of the work in which the text version exists; and based on ( 1) comparing the first audio material with one of the second audio data and (2) the first mapping, generating a second plurality of audio locations in the second audio material and the plural in the text version of the work A second mapping between text locations; wherein the method is performed by one or more computing devices.
  15. A method comprising: Receiving an audio input; performing a voice input to a text analysis to generate a portion of the audio input; determining whether the text generated for the portion of the audio input matches the currently displayed text; and responding to the determining The text matches the currently displayed text such that the currently displayed text is highlighted; wherein the method is performed by one or more computing devices.
  16. An electronic device comprising: at least one processor; and a memory storing one or more programs for execution by the at least one processor, the one or more programs including instructions for receiving audio data, The audio material reflects an audible version of a work in a textual version; a speech-to-text analysis of the audio material to generate a portion of the audio data; and based on the portion of the audio material The text produces a mapping between a plurality of audio locations in the audio material and corresponding plurality of text locations in the text version of the work.
  17. An electronic device comprising: at least one processor; and a memory storing one or more programs for execution by the at least one processor, the one or more programs including instructions for: receiving a work a text version; Performing one of the text versions to the voice analysis to generate the first audio data; generating, based on the first audio data and the text version, the first plurality of audio locations in the first audio material and the text version of the work Corresponding to a first mapping between the plurality of text positions; receiving second audio data reflecting an audible version of the work in which the text version exists; and based on (1) the first audio material and the second audio material And comparing (2) the first mapping, generating a second mapping between the second plurality of audio locations in the second audio material and the plurality of text locations in the text version of the work.
  18. An electronic device comprising: at least one processor; and a memory storing one or more programs for execution by the at least one processor, the one or more programs including instructions for: receiving an audio input; Performing a voice input to a text analysis to generate a portion of the audio input; determining whether the text generated for the portion of the audio input matches the currently displayed text; and responding to determining the text and the current display The text matches so that the currently displayed text is highlighted.
  19. A computer readable medium comprising one or more programs, the one or more programs being enabled when executed by one or more processors of an electronic device Sub-device: receiving audio data, the audio data reflecting an audible version of a work in a text version; performing a speech-to-text analysis of the audio data to generate text of a portion of the audio data; and based on the audio data The text generated by the portions produces a mapping between a plurality of audio locations of the audio material and corresponding plurality of text locations in the text version of the work.
  20. An electronic device comprising: means for receiving audio data, the audio data reflecting an audible version of a work in a text version; and performing a voice-to-text analysis of the audio data to generate the audio data a component of a portion of the text; and a portion between a plurality of audio locations for generating the audio material based on the text generated for the portions of the audio material and a corresponding plurality of text locations in the text version of the work The component of the map.
TW101119921A 2011-06-03 2012-06-01 Automatically creating a mapping between text data and audio data TWI488174B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US201161493372P true 2011-06-03 2011-06-03
US13/267,738 US20120310642A1 (en) 2011-06-03 2011-10-06 Automatically creating a mapping between text data and audio data

Publications (2)

Publication Number Publication Date
TW201312548A true TW201312548A (en) 2013-03-16
TWI488174B TWI488174B (en) 2015-06-11

Family

ID=47675616

Family Applications (1)

Application Number Title Priority Date Filing Date
TW101119921A TWI488174B (en) 2011-06-03 2012-06-01 Automatically creating a mapping between text data and audio data

Country Status (3)

Country Link
JP (2) JP5463385B2 (en)
CN (1) CN102937959A (en)
TW (1) TWI488174B (en)

Cited By (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI510940B (en) * 2014-05-09 2015-12-01 Univ Nan Kai Technology Image browsing device for establishing note by voice signal and method thereof
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US9865248B2 (en) 2008-04-05 2018-01-09 Apple Inc. Intelligent text-to-speech conversion
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9966060B2 (en) 2013-06-07 2018-05-08 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US9986419B2 (en) 2014-09-30 2018-05-29 Apple Inc. Social reminders
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10049675B2 (en) 2010-02-25 2018-08-14 Apple Inc. User profiling for voice input processing
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US10083690B2 (en) 2014-05-30 2018-09-25 Apple Inc. Better resolution when referencing to concepts
US10089072B2 (en) 2016-06-11 2018-10-02 Apple Inc. Intelligent device arbitration and control
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10269345B2 (en) 2016-06-11 2019-04-23 Apple Inc. Intelligent task discovery
US10297253B2 (en) 2016-06-11 2019-05-21 Apple Inc. Application integration with a digital assistant
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US10318871B2 (en) 2005-09-08 2019-06-11 Apple Inc. Method and apparatus for building an intelligent automated assistant
US10332518B2 (en) 2017-05-09 2019-06-25 Apple Inc. User interface for correcting recognition errors
US10354011B2 (en) 2016-06-09 2019-07-16 Apple Inc. Intelligent automated assistant in a home environment
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
US10403283B1 (en) 2018-06-01 2019-09-03 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10417405B2 (en) 2011-03-21 2019-09-17 Apple Inc. Device access using voice authentication
US10417344B2 (en) 2014-05-30 2019-09-17 Apple Inc. Exemplar-based natural language processing
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10438595B2 (en) 2014-09-30 2019-10-08 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10453443B2 (en) 2014-09-30 2019-10-22 Apple Inc. Providing an indication of the suitability of speech recognition
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10521466B2 (en) 2016-09-19 2019-12-31 Apple Inc. Data driven natural language event detection and classification

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120310642A1 (en) * 2011-06-03 2012-12-06 Apple Inc. Automatically creating a mapping between text data and audio data
US20160162560A1 (en) * 2013-07-03 2016-06-09 Telefonaktiebolaget L M Ericsson (Publ) Providing an electronic book to a user equipment
WO2015040743A1 (en) * 2013-09-20 2015-03-26 株式会社東芝 Annotation sharing method, annotation sharing device, and annotation sharing program
CN104765714A (en) * 2014-01-08 2015-07-08 中国移动通信集团浙江有限公司 Switching method and device for electronic reading and listening
CN104866543A (en) * 2015-05-06 2015-08-26 陆默 Various book carrier switching method and device
CN105302908A (en) * 2015-11-02 2016-02-03 北京奇虎科技有限公司 E-book related audio resource recommendation method and apparatus
CN105373605A (en) * 2015-11-11 2016-03-02 中国农业大学 Batch storage method and system for data files
CN107948405A (en) * 2017-11-13 2018-04-20 百度在线网络技术(北京)有限公司 A kind of information processing method, device and terminal device
CN108172247A (en) * 2017-12-22 2018-06-15 北京壹人壹本信息科技有限公司 Record playing method, mobile terminal and the device with store function

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2986345B2 (en) * 1993-10-18 1999-12-06 インターナショナル・ビジネス・マシーンズ・コーポレイション Sound recording indexed apparatus and method
JP2000074685A (en) * 1998-08-31 2000-03-14 Matsushita Electric Ind Co Ltd Retrieving method in mobile unit and car navigation system
US6369811B1 (en) * 1998-09-09 2002-04-09 Ricoh Company Limited Automatic adaptive document help for paper documents
BR0008346A (en) * 1999-02-19 2002-01-29 Custom Speech Usa Inc System and method for automating transcription services for one or more voice users
US6260011B1 (en) * 2000-03-20 2001-07-10 Microsoft Corporation Methods and apparatus for automatically synchronizing electronic audio files with electronic text files
JP2002169588A (en) * 2000-11-16 2002-06-14 Internatl Business Mach Corp <Ibm> Text display device, text display control method, storage medium, program transmission device, and reception supporting method
US7366979B2 (en) * 2001-03-09 2008-04-29 Copernicus Investments, Llc Method and apparatus for annotating a document
JP2002344880A (en) * 2001-05-22 2002-11-29 Megafusion Corp Contents distribution system
US20050131559A1 (en) * 2002-05-30 2005-06-16 Jonathan Kahn Method for locating an audio segment within an audio file
US7567902B2 (en) * 2002-09-18 2009-07-28 Nuance Communications, Inc. Generating speech recognition grammars from a large corpus of data
JP2004152063A (en) * 2002-10-31 2004-05-27 Nec Corp Structuring method, structuring device and structuring program of multimedia contents, and providing method thereof
JP2005070645A (en) * 2003-08-27 2005-03-17 Casio Comput Co Ltd Text and voice synchronizing device and text and voice synchronization processing program
WO2005069171A1 (en) * 2004-01-14 2005-07-28 Nec Corporation Document correlation device and document correlation method
JP2006023860A (en) * 2004-07-06 2006-01-26 Sharp Corp Information browser, information browsing program, information browsing program recording medium, and information browsing system
US20070055514A1 (en) * 2005-09-08 2007-03-08 Beattie Valerie L Intelligent tutoring feedback
JP2007206317A (en) * 2006-02-01 2007-08-16 Yamaha Corp Authoring method and apparatus, and program
US20080140652A1 (en) * 2006-12-07 2008-06-12 Jonathan Travis Millman Authoring tool
JP2010506232A (en) * 2007-02-14 2010-02-25 エルジー エレクトロニクス インコーポレイティド Method and apparatus for encoding and decoding object-based audio signal
US8515757B2 (en) * 2007-03-20 2013-08-20 Nuance Communications, Inc. Indexing digitized speech with words represented in the digitized speech
US20090112572A1 (en) * 2007-10-30 2009-04-30 Karl Ola Thorn System and method for input of text to an application operating on a device
US8055693B2 (en) * 2008-02-25 2011-11-08 Mitsubishi Electric Research Laboratories, Inc. Method for retrieving items represented by particles from an information database
JP2010078979A (en) * 2008-09-26 2010-04-08 Nec Infrontia Corp Voice recording device, recorded voice retrieval method, and program
US8498866B2 (en) * 2009-01-15 2013-07-30 K-Nfb Reading Technology, Inc. Systems and methods for multiple language document narration

Cited By (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10318871B2 (en) 2005-09-08 2019-06-11 Apple Inc. Method and apparatus for building an intelligent automated assistant
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US9865248B2 (en) 2008-04-05 2018-01-09 Apple Inc. Intelligent text-to-speech conversion
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
US10049675B2 (en) 2010-02-25 2018-08-14 Apple Inc. User profiling for voice input processing
US10417405B2 (en) 2011-03-21 2019-09-17 Apple Inc. Device access using voice authentication
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US9966060B2 (en) 2013-06-07 2018-05-08 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
TWI510940B (en) * 2014-05-09 2015-12-01 Univ Nan Kai Technology Image browsing device for establishing note by voice signal and method thereof
US10083690B2 (en) 2014-05-30 2018-09-25 Apple Inc. Better resolution when referencing to concepts
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US10417344B2 (en) 2014-05-30 2019-09-17 Apple Inc. Exemplar-based natural language processing
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10390213B2 (en) 2014-09-30 2019-08-20 Apple Inc. Social reminders
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US10453443B2 (en) 2014-09-30 2019-10-22 Apple Inc. Providing an indication of the suitability of speech recognition
TWI585744B (en) * 2014-09-30 2017-06-01 蘋果公司 Method, system, and computer-readable storage medium for operating a virtual assistant
US10438595B2 (en) 2014-09-30 2019-10-08 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9986419B2 (en) 2014-09-30 2018-05-29 Apple Inc. Social reminders
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10354652B2 (en) 2015-12-02 2019-07-16 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
US10354011B2 (en) 2016-06-09 2019-07-16 Apple Inc. Intelligent automated assistant in a home environment
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10297253B2 (en) 2016-06-11 2019-05-21 Apple Inc. Application integration with a digital assistant
US10089072B2 (en) 2016-06-11 2018-10-02 Apple Inc. Intelligent device arbitration and control
US10269345B2 (en) 2016-06-11 2019-04-23 Apple Inc. Intelligent task discovery
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10521466B2 (en) 2016-09-19 2019-12-31 Apple Inc. Data driven natural language event detection and classification
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10332518B2 (en) 2017-05-09 2019-06-25 Apple Inc. User interface for correcting recognition errors
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10529332B2 (en) 2018-01-04 2020-01-07 Apple Inc. Virtual assistant activation
US10403283B1 (en) 2018-06-01 2019-09-03 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance
US10504518B1 (en) 2018-06-03 2019-12-10 Apple Inc. Accelerated task performance

Also Published As

Publication number Publication date
CN102937959A (en) 2013-02-20
JP5463385B2 (en) 2014-04-09
JP2014132345A (en) 2014-07-17
TWI488174B (en) 2015-06-11
JP2013008357A (en) 2013-01-10

Similar Documents

Publication Publication Date Title
Harrington Phonetic analysis of speech corpora
Maekawa et al. Spontaneous Speech Corpus of Japanese.
TWI543150B (en) Method, computer-readable storage device, and system for providing voice stream augmented note taking
US8364488B2 (en) Voice models for document narration
US7493559B1 (en) System and method for direct multi-modal annotation of objects
US9478219B2 (en) Audio synchronization for document narration with user-selected playback
JP4559946B2 (en) Input device, input method, and input program
US9378739B2 (en) Identifying corresponding positions in different representations of a textual work
US10088976B2 (en) Systems and methods for multiple voice document narration
JP4463256B2 (en) System and method for providing automatically completed recommended words that link multiple languages
Pasha et al. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic.
JP2008268684A (en) Voice reproducing device, electronic dictionary, voice reproducing method, and voice reproducing program
JP4370811B2 (en) Voice display output control device and voice display output control processing program
US9865248B2 (en) Intelligent text-to-speech conversion
US20110301943A1 (en) System and method of dictation for a speech recognition command system
US8793133B2 (en) Systems and methods document narration
CN101382937B (en) Multimedia resource processing method based on speech recognition and on-line teaching system thereof
US7693717B2 (en) Session file modification with annotation using speech recognition or text to speech
JP4225703B2 (en) Information access method, information access system and program
Barras et al. Transcriber: development and use of a tool for assisting speech corpora production
US8954329B2 (en) Methods and apparatus for acoustic disambiguation by insertion of disambiguating textual information
Bucholtz Variation in transcription
US7668718B2 (en) Synchronized pattern recognition source data processed by manual or automatic means for creation of shared speaker-dependent speech user profile
EP1096472B1 (en) Audio playback of a multi-source written document
US6985864B2 (en) Electronic document processing apparatus and method for forming summary text and speech read-out