TW201312548A

TW201312548A - Automatically creating a mapping between text data and audio data

Info

Publication number: TW201312548A
Application number: TW101119921A
Authority: TW
Inventors: Xiang Cao; Alan C Cannistraro; Gregory S Robbin; Casey M Dougherty; Melissa Breglio Hajj
Original assignee: Apple Inc
Priority date: 2011-06-03
Filing date: 2012-06-01
Publication date: 2013-03-16
Also published as: JP5463385B2; JP2014132345A; JP2013008357A; TWI488174B; CN102937959A

Abstract

Techniques are provided for creating a mapping that maps locations in audio data (e.g., an audio book) to corresponding locations in text data (e.g., an e-book). Techniques are provided for using a mapping between audio data and text data, whether or not the mapping is created automatically or manually. A mapping may be used for bookmark switching where a bookmark established in one version of a digital work (e.g., e-book) is used to identify a corresponding location with another version of the digital work (e.g., an audio book). Alternatively, the mapping may be used to play audio that corresponds to text selected by a user. Alternatively, the mapping may be used to automatically highlight text in response to audio that corresponds to the text being played. Alternatively, the mapping may be used to determine where an annotation created in one media context (e.g., audio) will be consumed in another media context (e.g., text).

Description

Automatically establish mapping between text and audio data

本發明係關於藉由分析音訊資料以偵測反映於音訊資料中之單字並在文件中單字對單字地比較彼等單字來在文字資料與音訊資料之間自動地建立映射。 The present invention relates to automatically mapping between text data and audio data by analyzing audio data to detect words reflected in the audio material and comparing the words in a single word to the word in the file.

隨著手持式電子器件之成本降低及對數位內容之大需求，曾經已發佈於印刷媒體上之創作(creative work)作為數位媒體日益變得可用。舉例而言，數位書(亦稱為「電子書」)連同稱為電子書閱讀器(或「電子閱讀器」)之專用手持式電子器件一起日益風行。又，儘管並未單獨地設計為電子閱讀器，但諸如平板電腦及智慧型電話之其他手持式器件具有作為電子閱讀器進行操作的性能。 With the cost of handheld electronic devices and the need for digital content, creative work that has been published on print media is increasingly becoming available as digital media. For example, digital books (also known as "e-books") are becoming increasingly popular along with specialized handheld electronic devices called e-book readers (or "e-readers"). Also, although not individually designed as an e-reader, other handheld devices such as tablets and smart phones have the capability to operate as an e-reader.

使電子書格式化之常見標準為EPUB標準(為「電子出版物」之簡稱)，該EPUB標準為藉由國際數位出版聯盟(IDPF)所提出之自由且開放之電子書標準。EPUB檔案使用XHTML 1.1(或DTBook)來建構書本之內容。使用稱為OPS樣式表之CSS子集來執行樣式化(Styling)及版面配置(layout)。 A common standard for formatting e-books is the EPUB standard (short for "electronic publications"), a free and open e-book standard proposed by the International Digital Publishing Union (IDPF). The EPUB file uses XHTML 1.1 (or DTBook) to construct the contents of the book. Styling and layout are performed using a subset of CSS called OPS style sheets.

對於一些所撰寫著作而言，尤其對於變得風行之彼等著作而言，建立所撰寫著作之音訊版本。舉例而言，著名個人(或具有令人愉快之語音者)朗讀所撰寫著作之錄音被建立且使該錄音可用於供購買，不管是在線上抑或在實體商店(brick and mortar store)中。 For some of the writings, especially for the writings that have become popular, the audio version of the written work is established. For example, a well-known individual (or a person with a pleasant voice) reads a recording of the written work and makes the recording available for purchase, whether online or in a brick and mortar store.

對於消費者而言，購買電子書及電子書之音訊版本(或「音訊書」)兩者並非不常見的。在一些狀況下，使用者閱讀電子書之全部，且接著想要聽取音訊書。在其他狀況下，基於使用者之環境，使用者在閱讀書本與聽取書本之間轉變。舉例而言，在參與運動或在通勤(commute)期間駕車的同時，使用者將傾向於聽取書本之音訊版本。另一方面，當在上床之前懶洋洋地躺在沙發椅上時，使用者將傾向於閱讀書本之電子書版本。不幸地，此等轉變可為費力的，此係由於使用者必須記得其在電子書中停止於何處且手動地定位在音訊書中於何處開始，或反之亦然。即使使用者清楚地記得在使用者停止處書本中的事件，此等轉變仍可為費力的，此係因為知曉事件未必使得易於找到電子書或音訊書之對應於彼等事件的部分。因此，電子書與音訊書之間的切換可為極其耗時的。 For consumers, it is not uncommon to purchase audio versions of e-books and e-books (or "information books"). In some cases, the user reads all of the e-book and then wants to listen to the audio book. In other situations, based on the user's environment, the user changes between reading the book and listening to the book. For example, while participating in sports or driving while commuting, users will tend to listen to the audio version of the book. On the other hand, when lying lazily on the couch before going to bed, the user will tend to read the e-book version of the book. Unfortunately, such transitions can be laborious because the user must remember where they stopped in the e-book and manually locate where in the audio book, or vice versa. Even if the user clearly remembers the event in the book where the user stopped, the transition can be laborious because it is not necessarily easy to find the portion of the e-book or audio book that corresponds to their event. Therefore, switching between an e-book and an audio book can be extremely time consuming.

規範「EPUB媒體疊覆(EPUB Media Overlays)3.0」定義SMIL(同步多媒體整合語言)、封裝文件、EPUB樣式表及EPUB內容文件之使用以用於表示同步的文字及音訊出版物。出版物之預先錄音的旁白可表示為一系列音訊剪輯，每一剪輯對應於文字之部分。構成預先錄音之旁白之一系列音訊剪輯中的每一單一音訊剪輯通常表示單一片語或段落，但不推斷相對於其他剪輯或相對於文件之文字的次序。媒體疊覆藉由使用SMIL標記將結構化音訊旁白繫結至其在EPUB內容文件中之相應文字來解決此同步問題。媒體疊覆為允許定義此等剪輯之播放順序的SMIL 3.0之簡化子集。 The specification "EPUB Media Overlays 3.0" defines the use of SMIL (Synchronized Multimedia Integration Language), package files, EPUB style sheets, and EPUB content files for representing synchronized text and audio publications. The pre-recorded narration of a publication can be represented as a series of audio clips, each corresponding to a portion of the text. Each single audio clip in a series of audio clips that constitute a pre-recorded narration typically represents a single phrase or paragraph, but does not infer the order relative to other clips or text relative to the document. Media Overlay solves this synchronization problem by tying the structured audio narration to its corresponding text in the EPUB content file using SMIL tags. Media overlays are SMIL 3.0 simple to allow the definition of the playback order of such clips Subsets.

不幸地，建立媒體疊覆檔案很大程度上為手動處理程序。因此，著作之音訊版本與文字版本之間的映射之細微度為極粗的。舉例而言，媒體疊覆檔案可使電子書中之每一段落的開始與書本之音訊版本中的相應位置相關聯。媒體疊覆檔案(尤其對於小說)並不含有處於諸如在逐單字基礎上之任何較精細細微度等級之映射的原因為，建立此高細微度媒體疊覆檔案可能花費無數小時之人力。 Unfortunately, creating media overlay files is largely a manual process. Therefore, the nuances of the mapping between the audio version of the work and the text version are extremely coarse. For example, a media overlay file can associate the beginning of each paragraph in the e-book with the corresponding location in the audio version of the book. The reason that media overlay files (especially for novels) do not contain mappings at any finer level of detail, such as on a word-by-word basis, is that it can take countless hours of manpower to create such high-resolution media overlay files.

描述於此章節中之方法為可推行之方法，但未必為先前已設想或推行的方法。因此，除非另有指示，否則不應假設描述於此章節中之方法中的任一者僅依據其包括於此章節中而說成先前技術。 The methods described in this section are methods that can be implemented, but are not necessarily methods that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the methods described in this section are merely prior art based on its inclusion in this section.

根據一些實施例，提供一種方法，該方法包括：接收音訊資料，該音訊資料反映一文字版本存在之一著作的一可聞版本；執行該音訊資料之一話音至文字分析以產生該音訊資料之部分的文字；及基於針對該音訊資料之該等部分所產生的該文字，產生該音訊資料中之複數個音訊位置與該著作之該文字版本中的相應複數個文字位置之間的一映射。該方法係藉由一或多個計算器件執行。 According to some embodiments, a method is provided, the method comprising: receiving audio data, the audio data reflecting an audible version of a work in a text version; performing a voice-to-text analysis of the audio data to generate the audio data a portion of the text; and a mapping between the plurality of audio locations in the audio material and the corresponding plurality of text locations in the text version of the work based on the text generated for the portions of the audio material. The method is performed by one or more computing devices.

在一些實施例中，產生該音訊資料之部分的文字包括至少部分基於該著作之文字內容脈絡產生該音訊資料之部分的文字。在一些實施例中，至少部分基於該著作之文字內容脈絡產生該音訊資料之部分的文字包括至少部分基於用於該著作之該文字版本中的一或多個語法規則產生文字。在一些實施例中，至少部分基於該著作之文字內容脈絡產生該音訊資料之部分的文字包括基於哪些單字係在該著作之該文字版本或其一子集中而限制該等部分可轉譯成哪些單字。在一些實施例中，基於哪些單字係在該著作之該文字版本中而限制該等部分可轉譯成哪些單字包括，針對該音訊資料之一給定部分，識別該著作之該文字版本的對應於該給定部分之一子章節且將該等單字僅限於在該著作之該文字版本之該子章節中的彼等單字。在一些實施例中，識別該著作之該文字版本的該子章節包括維持該著作之該文字版本中的一當前文字位置，該當前文字位置對應於該話音至文字分析之在該音訊資料中的一當前音訊位置；且該著作之該文字版本的該子章節為與該當前文字位置相關聯的一章節。 In some embodiments, the text that produces the portion of the audio material includes text that is based at least in part on the portion of the textual content of the work that produces the portion of the audio material. In some embodiments, the generating of the portion of the audio material based at least in part on the context of the textual content of the work comprises, at least in part, One or more grammar rules in the text version of the work produce text. In some embodiments, the generating of the portion of the audio material based at least in part on the contextual content of the work comprises limiting which words can be translated into words based on which ones are in the text version of the work or a subset thereof . In some embodiments, restricting which words can be translated into which words based on which words are in the text version of the work includes identifying, for a given portion of the audio material, a correspondence of the text version of the work One of the subsections of the given portion and the words are limited to those words in the subsection of the text version of the work. In some embodiments, identifying the sub-section of the text version of the work includes maintaining a current text position in the text version of the work, the current text position corresponding to the voice-to-text analysis in the audio material a current audio location; and the sub-chapter of the text version of the work is a chapter associated with the current text location.

在一些實施例中，該等部分包括對應於個別單字之部分，且該映射將對應於個別單字之該等部分的位置映射至該著作之該文字版本中的個別單字。在一些實施例中，該等部分包括對應於個別句子之部分，且該映射將對應於個別句子之該等部分的位置映射至該著作之該文字版本中的個別句子。在一些實施例中，該等部分包括對應於固定量之資料的部分，且該映射將對應於固定量之資料之該等部分的位置映射至該著作之該文字版本中的相應位置。 In some embodiments, the portions include portions corresponding to individual words, and the mapping maps locations corresponding to the portions of the individual words to individual words in the text version of the work. In some embodiments, the portions include portions corresponding to individual sentences, and the mapping maps locations corresponding to the portions of the individual sentences to individual sentences in the text version of the work. In some embodiments, the portions include portions corresponding to a fixed amount of data, and the mapping maps locations of the portions corresponding to the fixed amount of data to respective locations in the text version of the work.

在一些實施例中，產生該映射包括：(1)將錨嵌入於該音訊資料中；(2)將錨嵌入於該著作之該文字版本中；或 (3)將該映射儲存於係與該音訊資料或該著作之該文字版本相關聯儲存的一媒體疊覆中。 In some embodiments, generating the mapping includes: (1) embedding an anchor in the audio material; (2) embedding the anchor in the text version of the work; or (3) storing the mapping in a media overlay stored in association with the audio material or the text version of the work.

在一些實施例中，該複數個文字位置中之一或多個文字位置中的每一者指示該著作之該文字版本中的一相對位置。在一些實施例中，該複數個文字位置中之一文字位置指示該著作之該文字版本中的一相對位置，且該複數個文字位置中之另一文字位置自該相對位置指示一絕對位置。在一些實施例中，該複數個文字位置中之一或多個文字位置中的每一者指示該著作之該文字版本內的一錨。 In some embodiments, each of the one or more text positions of the plurality of text positions indicates a relative position in the text version of the work. In some embodiments, one of the plurality of text positions indicates a relative position in the text version of the work, and another of the plurality of text positions indicates an absolute position from the relative position. In some embodiments, each of the one or more text locations of the plurality of text locations indicates an anchor within the text version of the work.

根據一些實施例，提供一種方法，該方法包括：接收一著作之一文字版本；執行該文字版本之一文字至話音分析以產生第一音訊資料；基於該第一音訊資料及該文字版本，產生該第一音訊資料中之第一複數個音訊位置與該著作之該文字版本中的相應複數個文字位置之間的一第一映射；接收反映該文字版本存在之該著作之一可聞版本的第二音訊資料；及基於(1)該第一音訊資料與該第二音訊資料之一比較及(2)該第一映射，產生該第二音訊資料中之第二複數個音訊位置與該著作之該文字版本中的該複數個文字位置之間的一第二映射。該方法係藉由一或多個計算器件執行。 According to some embodiments, a method is provided, the method comprising: receiving a text version of a work; executing a text-to-speech analysis of the text version to generate a first audio material; generating the sound based on the first audio material and the text version a first mapping between a first plurality of audio locations in the first audio material and a corresponding plurality of text locations in the text version of the work; receiving an audible version of the work reflecting the presence of the text version And (2) the first mapping, generating a second plurality of audio locations in the second audio material and the work based on (1) comparing the first audio data with one of the second audio data and (2) the first mapping A second mapping between the plurality of text positions in the text version. The method is performed by one or more computing devices.

根據一些實施例，提供一種方法，該方法包括接收音訊輸入；執行該音訊輸入之一話音至文字分析以產生該音訊輸入之部分的文字；判定針對該音訊輸入之部分所產生的該文字是否與當前顯示之文字匹配；及回應於判定該文字與當前顯示之文字匹配，使得當前顯示之該文字經反白顯示。該方法係藉由一或多個計算器件執行。 According to some embodiments, a method is provided, the method comprising: receiving an audio input; performing one of the audio input to a text analysis to generate a portion of the audio input; determining whether the text generated for the portion of the audio input is Matches the currently displayed text; and responds to the decision Matches the currently displayed text so that the currently displayed text is highlighted. The method is performed by one or more computing devices.

根據一些實施例，提供一種電子器件，該電子器件包括一經組態以用於接收音訊資料之音訊資料接收單元，該音訊資料反映一文字版本存在之一著作的一可聞版本。該電子器件亦包括一耦接至該音訊資料接收單元之處理單元。該處理單元經組態以：執行該音訊資料之一話音至文字分析以產生該音訊資料之部分的文字；及基於針對該音訊資料之該等部分所產生的該文字，產生該音訊資料中之複數個音訊位置與該著作之該文字版本中的相應複數個文字位置之間的一映射。 In accordance with some embodiments, an electronic device is provided that includes an audio data receiving unit configured to receive audio data, the audio data reflecting an audible version of a work in a text version. The electronic device also includes a processing unit coupled to the audio data receiving unit. The processing unit is configured to: perform a voice-to-text analysis of the audio material to generate text of a portion of the audio material; and generate the audio data based on the text generated for the portions of the audio material A mapping between the plurality of audio locations and corresponding plurality of text locations in the text version of the work.

根據一些實施例，提供一種電子器件，該電子器件包括一經組態以用於接收一著作之一文字版本的文字接收單元。該電子器件亦包括一耦接至該文字接收單元之處理單元，該處理單元經組態以：執行該文字版本之一文字至話音分析以產生第一音訊資料；及基於該第一音訊資料及該文字版本，產生該第一音訊資料中之第一複數個音訊位置與該著作之該文字版本中的相應複數個文字位置之間的一第一映射。該電子器件亦包括一經組態以用於接收第二音訊資料之音訊資料接收單元，該第二音訊資料反映該文字版本存在之該著作的一可聞版本。該處理單元經進一步組態以，基於(1)該第一音訊資料與該第二音訊資料之一比較及(2)該第一映射，產生該第二音訊資料中之第二複數個音訊位置與該著作之該文字版本中的該複數個文字位置之間的一第二映射。 In accordance with some embodiments, an electronic device is provided that includes a text receiving unit configured to receive a text version of a work. The electronic device also includes a processing unit coupled to the text receiving unit, the processing unit configured to: execute one of the text versions to the speech analysis to generate the first audio data; and based on the first audio data and The text version produces a first mapping between the first plurality of audio locations in the first audio material and the corresponding plurality of text locations in the text version of the work. The electronic device also includes an audio data receiving unit configured to receive the second audio material, the second audio material reflecting an audible version of the work in which the text version exists. The processing unit is further configured to generate a second plurality of audio locations in the second audio data based on (1) comparing the first audio material with one of the second audio data and (2) the first mapping Between the plural text positions in the text version of the work a second mapping.

根據一些實施例，提供一種電子器件，該電子器件包括一經組態以用於接收音訊輸入之音訊接收單元。該電子器件亦包括一耦接至該音訊接收單元之處理單元。該處理單元經組態以執行該音訊輸入之一話音至文字分析以產生該音訊輸入之部分的文字；判定針對該音訊輸入之部分所產生的該文字是否與當前顯示之文字匹配；及回應於判定該文字與當前顯示之文字匹配，使得當前顯示之該文字經反白顯示。 In accordance with some embodiments, an electronic device is provided that includes an audio receiving unit configured to receive an audio input. The electronic device also includes a processing unit coupled to the audio receiving unit. The processing unit is configured to perform a voice-to-text analysis of the audio input to generate a portion of the audio input; determine whether the text generated for the portion of the audio input matches the currently displayed text; and respond It is determined that the text matches the currently displayed text, so that the currently displayed text is highlighted.

根據一些實施例，提供一種方法，該方法包括獲得位置資料，該位置資料指示一著作之一文字版本內的一指定位置；檢查該著作之一音訊版本中的複數個音訊位置與該著作之該文字版本中之相應複數個文字位置之間的一映射以：判定該複數個文字位置中之對應於該指定位置的一特定文字位置，及基於該特定文字位置判定該複數個音訊位置中之對應於該特定文字位置的一特定音訊位置。該方法包括將基於該特定文字位置所判定之該特定音訊位置提供至一媒體播放器以使得該媒體播放器將該特定音訊位置建立為該音訊資料的一當前播放位置。該方法係藉由一或多個計算器件執行。 According to some embodiments, a method is provided, the method comprising obtaining location data indicating a specified location within a text version of a work; checking a plurality of audio locations in an audio version of the work and the text of the work a mapping between a plurality of corresponding text positions in the version to: determine a specific text position of the plurality of text positions corresponding to the specified position, and determine, according to the specific text position, a correspondence among the plurality of audio positions A specific audio location for that particular text location. The method includes providing the particular audio location determined based on the particular text location to a media player such that the media player establishes the particular audio location as a current playback location of the audio material. The method is performed by one or more computing devices.

在一些實施例中，獲得包含一伺服器經由一網路自一第一器件接收該位置資料；檢查及提供係藉由該伺服器執行；且提供包含該伺服器將該特定音訊位置發送至執行該媒體播放器的一第二器件。在一些實施例中，該第二器件及該第一器件為同一器件。在一些實施例中，獲得、檢查及提供係藉由一計算器件執行，該計算器件經組態以顯示該著作之該文字版本且執行該媒體播放器。在一些實施例中，該方法進一步包括在無來自一器件之一使用者之輸入的情況下在經組態以顯示該著作之該文字版本的該器件處判定該位置資料。 In some embodiments, obtaining a server includes receiving the location data from a first device via a network; the checking and providing is performed by the server; and providing the server to send the specific audio location to the execution A second device of the media player. In some embodiments, the second device And the first device is the same device. In some embodiments, obtaining, inspecting, and providing is performed by a computing device configured to display the text version of the work and execute the media player. In some embodiments, the method further includes determining the location profile at the device configured to display the text version of the work without input from a user of a device.

在一些實施例中，該方法進一步包括：自一使用者接收輸入；及回應於接收該輸入，基於該輸入判定該位置資料。在一些實施例中，將該特定音訊位置提供至該媒體播放器包含將該特定音訊位置提供至該媒體播放器以使得該媒體播放器處理在該當前播放位置處開始的該音訊資料，此情形使得該媒體播放器自該經處理音訊資料產生音訊；且使得該媒體播放器處理該音訊資料係回應於接收到該輸入而執行。 In some embodiments, the method further comprises: receiving input from a user; and in response to receiving the input, determining the location profile based on the input. In some embodiments, providing the particular audio location to the media player includes providing the particular audio location to the media player such that the media player processes the audio material beginning at the current playback location, in which case Having the media player generate audio from the processed audio material; and causing the media player to process the audio data in response to receiving the input.

在一些實施例中，該輸入選擇該著作之該文字版本中的多個單字；該指定位置為一第一指定位置；該位置資料亦指示該著作之該文字版本內的不同於該第一指定位置的一第二指定位置；檢查進一步包含檢查該映射以：判定該複數個文字位置中之對應於該第二指定位置的一第二特定文字位置，及基於該第二特定文字位置判定該複數個音訊位置中之對應於該第二特定文字位置的一第二特定音訊位置；且將該特定音訊位置提供至該媒體播放器包含將該第二特定音訊位置提供至該媒體播放器以使得該媒體播放器在該當前播放位置到達或靠近該第二特定音訊位置時中斷處理該音訊資料。 In some embodiments, the input selects a plurality of words in the text version of the work; the designated location is a first designated location; the location profile also indicates that the textual version of the work is different from the first designation a second specified position of the location; the checking further comprising: checking the mapping to: determine a second specific text position of the plurality of text positions corresponding to the second specified position, and determining the plural based on the second specific text position Providing a second specific audio location of the plurality of audio locations corresponding to the second particular text location; and providing the particular audio location to the media player includes providing the second specific audio location to the media player such that The media player is interrupted when the current playback position reaches or approaches the second specific audio location Process the audio material.

在一些實施例中，該方法進一步包括：獲得基於來自一使用者之輸入的註解資料；儲存與該指定位置相關聯之該註解資料；及使得關於該註解資料之資訊被顯示。在一些實施例中，使得關於該特定音訊位置及該註解資料之資訊被顯示包含：判定該音訊資料之一當前播放位置何時處於或靠近該特定音訊位置；及回應於判定該音訊資料之該當前播放位置係處於或靠近該特定音訊位置，使得關於該註解資料之資訊被顯示。 In some embodiments, the method further comprises: obtaining annotation data based on input from a user; storing the annotation material associated with the specified location; and causing information regarding the annotation material to be displayed. In some embodiments, the information about the specific audio location and the annotation data is displayed to: determine when the current playback position of the audio material is at or near the specific audio location; and in response to determining the current current of the audio material The playback location is at or near the particular audio location such that information about the annotation material is displayed.

在一些實施例中，該註解資料包括文字資料；且使得關於該註解資料之資訊被顯示包含顯示該文字資料。在一些實施例中，該註解資料包括語音資料；且使得關於該註解資料之資訊被顯示包含處理該語音資料以產生音訊。 In some embodiments, the annotation material includes textual material; and causing information about the annotation material to be displayed to include displaying the textual material. In some embodiments, the annotation material includes voice material; and causing information about the annotation material to be displayed includes processing the voice material to generate audio.

根據一些實施例，提供一種電子器件，該電子器件包括一經組態以用於獲得位置資料之位置資料獲得單元，該位置資料指示一著作之一文字版本內的一指定位置。該電子器件亦包括一耦接至該位置資料獲得單元之處理單元。該處理單元經組態以檢查該著作之一音訊版本中的複數個音訊位置與該著作之該文字版本中之相應複數個文字位置之間的一映射以：判定該複數個文字位置中之對應於該指定位置的一特定文字位置，及基於該特定文字位置判定該複數個音訊位置中之對應於該特定文字位置的一特定音訊位置；及將基於該特定文字位置所判定之該特定音訊位置提供至一媒體播放器以使得該媒體播放器將該特定音訊位置建立為該音訊資料的一當前播放位置。 In accordance with some embodiments, an electronic device is provided that includes a location data obtaining unit configured to obtain location data, the location data indicating a designated location within a text version of a work. The electronic device also includes a processing unit coupled to the location data obtaining unit. The processing unit is configured to check a mapping between a plurality of audio locations in an audio version of the work and a corresponding plurality of text positions in the text version of the work to determine a correspondence among the plurality of text positions Determining, at a specific text position of the specified position, a specific audio position corresponding to the specific text position among the plurality of audio positions based on the specific text position; and determining the specific audio position based on the specific text position Provided to a media player to cause the media player to place the particular audio location Established as a current playback position of the audio material.

根據一些實施例，提供一種方法，該方法包括：獲得位置資料，該位置資料指示音訊資料內之一指定位置；檢查該音訊資料中之複數個音訊位置與一著作之一文字版本中的相應複數個文字位置之間的一映射以：判定該複數個音訊位置中之對應於該指定位置的一特定音訊位置，及基於該特定音訊位置判定該複數個文字位置中之對應於該特定音訊位置的一特定文字位置；及使得一媒體播放器顯示關於該特定文字位置的資訊。該方法係藉由一或多個計算器件執行。 According to some embodiments, a method is provided, the method comprising: obtaining location data, the location data indicating a specified location in the audio material; checking a plurality of audio locations in the audio material and a corresponding plurality of text versions in a work a mapping between the text positions to: determine a particular one of the plurality of audio locations corresponding to the specified location, and determine, based on the particular audio location, a one of the plurality of text locations corresponding to the particular audio location a specific text location; and causing a media player to display information about the particular text location. The method is performed by one or more computing devices.

在一些實施例中，獲得包含一伺服器經由一網路自一第一器件接收該位置資料；檢查及使得係藉由該伺服器執行；且使得包含該伺服器將該特定文字位置發送至執行該媒體播放器的一第二器件。在一些實施例中，該第二器件及該第一器件為同一器件。在一些實施例中，獲得、檢查及使得係藉由一計算器件執行，該計算器件經組態以顯示該著作之該文字版本且執行該媒體播放器。在一些實施例中，該方法進一步包括在無來自一器件之一使用者之輸入的情況下在經組態以處理該音訊資料的該器件處判定該位置資料。 In some embodiments, obtaining a server includes receiving the location data from a first device via a network; checking and causing execution by the server; and causing the server to send the specific text location to execution A second device of the media player. In some embodiments, the second device and the first device are the same device. In some embodiments, the obtaining, checking, and causing is performed by a computing device configured to display the text version of the work and execute the media player. In some embodiments, the method further includes determining the location profile at the device configured to process the audio material without input from a user of a device.

在一些實施例中，該方法進一步包括：自一使用者接收輸入；及回應於接收到該輸入，基於該輸入判定該位置資料。在一些實施例中，使得包含使得該媒體播放器顯示該著作之該文字版本之對應於該特定文字位置的一部分；且使得該媒體播放器顯示該著作之該文字版本的該部分係回應於接收到該輸入而執行。 In some embodiments, the method further comprises: receiving input from a user; and in response to receiving the input, determining the location profile based on the input. In some embodiments, causing a portion of the text version that causes the media player to display the work to correspond to a portion of the particular text position; Having the media player display the portion of the text version of the work in response to receiving the input.

在一些實施例中，該輸入選擇該音訊資料之一段落(segment)；該指定位置為一第一指定位置；該位置資料亦指示該音訊資料內之不同於該第一指定位置的一第二指定位置；檢查進一步包含檢查該映射以：判定該複數個音訊位置中之對應於該第二指定位置的一第二特定音訊位置，及基於該第二特定音訊位置判定該複數個文字位置中之對應於該第二特定音訊位置的一第二特定文字位置；且使得一媒體播放器顯示關於該特定文字位置的資訊進一步包含使得該媒體播放器顯示關於該第二特定文字位置的資訊。 In some embodiments, the input selects a segment of the audio material; the designated location is a first designated location; the location profile also indicates a second designation in the audio material that is different from the first designated location Positioning; the checking further comprising: checking the mapping to: determine a second specific audio position corresponding to the second specified position of the plurality of audio positions, and determining a correspondence among the plurality of text positions based on the second specific audio position And a second specific text position of the second specific audio location; and causing a media player to display information about the specific text location further includes causing the media player to display information about the second specific text location.

在一些實施例中，該指定位置對應於該音訊資料中之一當前播放位置；隨著該指定位置處之該音訊資料經處理且音訊經產生，執行使得；且使得包含使得一第二媒體播放器反白顯示該著作之該文字版本內的在該特定文字位置處或靠近該特定文字位置之文字。 In some embodiments, the designated location corresponds to one of the current playback locations of the audio material; the audio data is processed and the audio is generated as the specified location is generated; and the inclusion is caused to cause a second media to be played The highlighter displays the text in the text version of the work at or near the particular text position.

在一些實施例中，該方法進一步包括：獲得基於來自一使用者之輸入的註解資料；儲存與該指定位置相關聯之該註解資料；及使得關於該註解資料之資訊被顯示。在一些實施例中，使得關於該註解資料之資訊被顯示包含：判定何時顯示該著作之該文字版本的對應於該特定文字位置之一部分；及回應於判定顯示該著作之該文字版本的對應於該特定文字位置之一部分，使得關於該註解資料之資訊被顯示。 In some embodiments, the method further comprises: obtaining annotation data based on input from a user; storing the annotation material associated with the specified location; and causing information regarding the annotation material to be displayed. In some embodiments, causing information about the annotation material to be displayed includes: determining when to display a portion of the text version of the work corresponding to the particular text location; and responding to determining that the text version of the work corresponds to A portion of the particular text location causes information about the annotation material to be displayed.

在一些實施例中，該註解資料包括文字資料；且使得關於該註解資料之資訊被顯示包含使得該文字資料被顯示。在一些實施例中，該註解資料包括語音資料；且使得關於該註解資料之資訊被顯示包含使得該語音資料經處理以產生音訊。 In some embodiments, the annotation material includes textual material; and the information about the annotation material is displayed to include the textual material being displayed. In some embodiments, the annotation material includes voice material; and causing information about the annotation material to be displayed includes causing the voice material to be processed to generate audio.

根據一些實施例，提供一種方法，該方法包括，在一著作之一音訊版本的播放期間：獲得指示該音訊版本內之一指定位置的位置資料，及基於該指定位置判定該著作之一文字版本中之一特定文字位置，該特定文字位置係與指示何時暫停該音訊版本之播放的暫停資料相關聯；及回應於判定該特定文字位置係與暫停資料相關聯，暫停該音訊版本之播放。該方法係藉由一或多個計算器件執行。 According to some embodiments, a method is provided, comprising: during playback of an audio version of a work: obtaining location data indicating a designated location within the audio version, and determining a text version of the work based on the designated location And a specific text position associated with the suspended data indicating when the playback of the audio version is suspended; and in response to determining that the particular text position is associated with the suspended data, pausing the playback of the audio version. The method is performed by one or more computing devices.

在一些實施例中，該暫停資料係在該著作之該文字版本內。在一些實施例中，判定該特定文字位置包含：檢查該音訊資料中之複數個音訊位置與一著作之一文字版本中的相應複數個文字位置之間的一映射以：判定該複數個音訊位置中之對應於該指定位置的一特定音訊位置，及基於該特定音訊位置判定該複數個文字位置中之對應於該特定音訊位置的該特定文字位置。 In some embodiments, the suspension data is within the text version of the work. In some embodiments, determining the specific text location comprises: checking a mapping between a plurality of audio locations in the audio material and a corresponding plurality of text locations in a text version of a work: determining the plurality of audio locations Corresponding to a specific audio location of the specified location, and determining the particular text location of the plurality of text locations corresponding to the particular audio location based on the particular audio location.

在一些實施例中，該暫停資料對應於反映於該著作之該文字版本中的一頁之結尾。在一些實施例中，該暫停資料對應於該著作之該文字版本內的恰先於並不包括文字之一圖片的一位置。 In some embodiments, the pause data corresponds to the end of a page reflected in the text version of the work. In some embodiments, the pause data corresponds to a location within the text version of the work that precedes a picture that does not include one of the text.

在一些實施例中，該方法進一步包含回應於接收使用者輸入來繼續該音訊版本的播放。在一些實施例中，該方法進一步包含回應於自暫停該音訊版本之播放起一特定量之時間的推移而繼續該音訊版本的播放。 In some embodiments, the method further comprises responding to receiving the user Enter to continue playback of the audio version. In some embodiments, the method further includes continuing playback of the audio version in response to a lapse of a certain amount of time since the playback of the audio version was paused.

根據一些實施例，提供一種方法，該方法包括，在一著作之一音訊版本的播放期間：獲得指示該音訊版本內之一指定位置的位置資料，及基於該指定位置判定該著作之一文字版本中的一特定文字位置，該特定文字位置係與頁終資料相關聯，該頁終資料指示反映於該著作之該文字版本中的一第一頁之一結尾；及回應於判定該特定文字位置與該頁終資料相關聯，自動地使得該第一頁中斷顯示且使得該第一頁之後的一第二頁被顯示。該方法係藉由一或多個計算器件執行。 According to some embodiments, a method is provided, comprising: during playback of an audio version of a work: obtaining location data indicating a designated location within the audio version, and determining a text version of the work based on the designated location a specific text position associated with the page end material, the page end data indication being reflected at an end of a first page of the text version of the work; and responsive to determining the particular text position and The page end material is associated, automatically causing the first page to be interrupted and causing a second page after the first page to be displayed. The method is performed by one or more computing devices.

在一些實施例中，該方法進一步包含檢查該音訊資料中之複數個音訊位置與一著作之一文字版本中的相應複數個文字位置之間的一映射以：判定該複數個音訊位置中之對應於該指定位置的一特定音訊位置，及基於該特定音訊位置判定該複數個文字位置中之對應於該特定音訊位置的該特定文字位置。 In some embodiments, the method further includes examining a mapping between the plurality of audio locations in the audio material and the corresponding plurality of text locations in a text version of a work to determine that the plurality of audio locations correspond to a specific audio location of the specified location, and determining the particular text location of the plurality of text locations corresponding to the particular audio location based on the particular audio location.

根據一些實施例，提供一種電子器件，該電子器件包括一經組態以用於獲得位置資料之位置獲得單元，該位置資料指示音訊資料內的一指定位置。該電子器件亦包括一耦接至該位置獲得單元之處理單元。該處理單元經組態以：檢查該音訊資料中之複數個音訊位置與一著作之一文字版本中的相應複數個文字位置之間的一映射以：判定該複數個音訊位置中之對應於該指定位置的一特定音訊位置，及基於該特定音訊位置判定該複數個文字位置中之對應於該特定音訊位置的一特定文字位置；及使得一媒體播放器顯示關於該特定文字位置的資訊。 In accordance with some embodiments, an electronic device is provided that includes a location obtaining unit configured to obtain location data, the location data indicating a designated location within the audio material. The electronic device also includes a processing unit coupled to the location obtaining unit. The processing unit is configured to: check a mapping between a plurality of audio locations in the audio material and a corresponding plurality of text locations in a text version of a work to determine the plural a specific audio position corresponding to the specified position in the audio position, and determining a specific text position of the plurality of text positions corresponding to the specific audio position based on the specific audio position; and causing a media player to display Information about this particular text location.

根據一些實施例，提供一種電子器件，該電子器件包括一經組態以用於獲得位置資料之位置獲得單元，該位置資料指示在一著作之一音訊版本的播放期間該音訊版本內的一指定位置。該電子器件亦包括一耦接至該位置獲得單元之處理單元，該處理單元經組態以，在一著作之一音訊版本的播放期間：基於該指定位置判定該著作之一文字版本中的一特定文字位置，該特定文字位置係與頁終資料相關聯，該頁終資料指示反映於該著作之該文字版本中的一第一頁之一結尾；及回應於判定該特定文字位置與該頁終資料相關聯，自動地使得該第一頁中斷顯示且使得該第一頁之後的一第二頁被顯示。 According to some embodiments, an electronic device is provided, the electronic device including a location obtaining unit configured to obtain location data, the location data indicating a designated location within the audio version during playback of an audio version of a work . The electronic device also includes a processing unit coupled to the location obtaining unit, the processing unit configured to: during playback of an audio version of a work: determining a particular one of the text versions of the work based on the designated location a text position, the specific text position is associated with the page end material, the page end data indication is reflected at an end of one of the first pages of the text version of the work; and in response to determining the specific text position and the end of the page The data is associated, automatically causing the first page to be interrupted and causing a second page after the first page to be displayed.

根據一些實施例，提供一種電腦可讀儲存媒體，該電腦可讀儲存媒體儲存一或多個程式以供一電子器件之一或多個處理器執行，該一或多個程式包括用於執行上述方法中之任一者的指令。根據一些實施例，提供一種電子器件，該電子器件包含用於執行上述方法中之任一者的構件。在一些實施例中，提供一種電子器件，該電子器件包含一或多個處理器，及儲存一或多個程式以供該一或多個處理器執行的記憶體，該一或多個程式包括用於執行上述方法中之任一者的指令。在一些實施例中，提供一種用於一電子器件中之資訊處理裝置，該資訊處理裝置包含用於執行上述方法中之任一者的構件。 According to some embodiments, a computer readable storage medium is provided that stores one or more programs for execution by one or more processors of an electronic device, the one or more programs including An instruction of any of the methods. In accordance with some embodiments, an electronic device is provided that includes components for performing any of the above methods. In some embodiments, an electronic device is provided, the electronic device comprising one or more processors, and a memory storing one or more programs for execution by the one or more processors, the one or more programs comprising An instruction to perform any of the above methods. In some embodiments, an electronic for one is provided An information processing device in a device, the information processing device comprising means for performing any of the above methods.

在以下描述中，為了解釋之目的，闡述眾多特定細節以便提供對本發明之透徹理解。然而，可在無此等特定細節之情況下實踐本發明將為顯而易見的。在其他情況下，以方塊圖形式展示熟知之結構及器件以便避免不必要地混淆本發明。 In the following description, numerous specific details are set forth However, it will be apparent that the invention may be practiced without the specific details. In other instances, well-known structures and devices are shown in block diagrams in order to avoid unnecessarily obscuring the invention.

Overview of the automatic generation of audio-to-text mapping

根據一方法，自動地建立映射，其中映射對著作之音訊版本(例如，音訊書)內的位置與著作之文字版本(例如，電子書)中的相應位置進行映射。藉由對音訊版本執行話音至文字分析以識別反映於音訊版本中的單字來建立映射。經識別單字與著作之文字版本中的相應單字匹配。映射使經識別單字之位置(音訊版本內)與著作之文字版本中的找到經識別單字處的位置相關聯。 According to one method, a mapping is automatically established, wherein the mapping maps the location within the audio version of the work (eg, an audio book) to the corresponding location in the text version of the work (eg, an e-book). The mapping is established by performing a speech-to-text analysis on the audio version to identify the words reflected in the audio version. The identified word matches the corresponding word in the text version of the work. The mapping associates the location of the identified word (within the audio version) with the location at the identified word in the text version of the work.

Audio version format

音訊資料反映著作(諸如，書本、網頁、小冊子、傳單等)之文字版本的文字之可聞朗讀。音訊資料可儲存於一或多個音訊檔案中。該一或多個音訊檔案可呈許多檔案格式中之一者。音訊檔案格式之非限制性實例包括AAC、MP3、WAV及PCM。 The audio material reflects the readable reading of the text version of the work (such as books, web pages, brochures, leaflets, etc.). Audio data can be stored in one or more audio files. The one or more audio files can be in one of a number of file formats. Non-limiting examples of audio file formats include AAC, MP3, WAV, and PCM.

Text version format

類似地，音訊資料所映射至之文字資料可以許多文件檔案格式中之一者儲存。文件檔案格式之非限制性實例包括DOC、TXT、PDF、RTF、HTML、XHTML及EPUB。 Similarly, the text data to which the audio material is mapped can have many file files. One of the file formats is stored. Non-limiting examples of file file formats include DOC, TXT, PDF, RTF, HTML, XHTML, and EPUB.

典型EPUB文件伴有如下檔案：(a)列出每一XHTML內容文件，及(b)指示XHTML內容文件之次序。舉例而言，若書本包含20個章，則彼書本之EPUB文件可具有20個不同的XHTML文件，每一章有一個XHTML文件。伴隨EPUB文件之檔案識別XHTML文件之對應於書本中之章次序的次序。因此，單一(邏輯)文件(不管為EPUB文件抑或另一類型之文件)可包含多個資料項或檔案。 A typical EPUB file is accompanied by the following files: (a) listing each XHTML content file, and (b) indicating the order of the XHTML content files. For example, if the book contains 20 chapters, the EPUB file of the book can have 20 different XHTML files, and each chapter has an XHTML file. The file accompanying the EPUB file identifies the order of the XHTML files corresponding to the order of the chapters in the book. Therefore, a single (logical) file (whether an EPUB file or another type of file) can contain multiple data items or files.

反映於文字資料中之單字或字元可呈一種或多種語言。舉例而言，文字資料之一部分可呈英文，而文字資料之另一部分可呈法文。儘管本文中提供英文單字之實例，但本發明之實施例可應用於包括基於字元之語言的其他語言。 Words or characters reflected in the text may be in one or more languages. For example, one part of the text may be in English and another part of the text may be in French. Although examples of English words are provided herein, embodiments of the invention are applicable to other languages including word-based languages.

The audio and text position in the map

如本文中所描述，映射包含一組映射記錄，其中每一映射記錄使一音訊位置與一文字位置相關聯。 As described herein, a mapping includes a set of mapping records, with each mapping record associating an audio location with a text location.

每一音訊位置識別音訊資料中之位置。音訊位置可指示音訊資料內之絕對位置、音訊資料內之相對位置，或絕對位置與相對位置之組合。作為絕對位置之實例，如上文在實例A中所指示，音訊位置可指示至音訊資料中之時間偏移(例如，04：32：24指示4小時32分24秒)，或時間範圍。作為相對位置之實例，音訊位置可指示章號、段號及行號。作為絕對位置與相對位置之組合的實例，音訊位置可指示章號及至藉由章號所指示之章中的時間偏移。 Each audio location identifies the location in the audio material. The audio position can indicate the absolute position within the audio material, the relative position within the audio material, or a combination of absolute and relative positions. As an example of an absolute position, as indicated above in Example A, the audio position may indicate a time offset into the audio material (eg, 04:32:24 indicates 4 hours, 32 minutes, 24 seconds), or a time range. As an example of the relative position, the audio position can indicate the chapter number, the segment number, and the line number. As an example of a combination of absolute position and relative position, the audio position may indicate the chapter number and the time offset in the chapter indicated by the chapter number.

類似地，每一文字位置識別文字資料(諸如，著作之文字版本)中的位置。文字位置可指示著作之文字版本內的絕對位置、著作之文字版本內的相對位置，或絕對位置與相對位置之組合。作為絕對位置之實例，文字位置可指示至著作之文字版本中的位元組偏移，及/或著作之文字版本內的「錨(anchor)」。錨為文字資料內之識別特定位置或文字部分的後設資料。錨可與文字資料中之顯示給終端使用者之文字單獨地儲存，或可儲存於顯示給終端使用者的文字當中。舉例而言，文字資料可包括以下句子：「Why did the chickencross the road？」，其中「」為錨。當彼句子顯示給使用者時，使用者僅看到「Why did the chicken cross the road？」。類似地，同一句子可具有如下多個錨：「Whydidthechickencrosstheroad？」。在此實例中，在句子中之每一單字之前存在一錨。 Similarly, each text location identifies a location in a textual material, such as a textual version of a work. The position of the text may indicate the absolute position within the text version of the work, the relative position within the text version of the work, or a combination of absolute and relative positions. As an example of an absolute position, the text position may indicate a byte offset in the text version of the work, and/or an "anchor" within the text version of the work. The anchor is a post-set material that identifies a specific location or text portion within the textual material. The anchor may be stored separately from the text displayed to the end user in the text material, or may be stored in the text displayed to the end user. For example, the text should include the following sentence: "Why did the chickencross the road? """ is an anchor. When the sentence is displayed to the user, the user only sees "Why did the chicken cross the road?". Similarly, the same sentence can have multiple anchors as follows: "Whydidthechickencrosstheroad? "." In this example, there is an anchor before each word in the sentence.

作為相對位置之實例，文字位置可指示頁號、章號、段號及/或行號。作為絕對位置與相對位置之組合的實例，文字位置可指示章號及至藉由章號所指示之章中的錨。 As an example of a relative position, the text position may indicate a page number, a chapter number, a segment number, and/or a line number. As an example of a combination of absolute position and relative position, the position of the text may indicate the chapter number and the anchor in the chapter indicated by the chapter number.

表示文字位置及音訊位置之方式的實例提供於題為「EPUB媒體疊覆3.0」之規範中，該規範定義SMIL(同步多媒體整合語言)、EPUB樣式表及EPUB內容文件之使用。使文字位置與音訊位置相關聯且提供於規範中之關聯的實例係如下： An example of the manner in which the text position and the audio position are indicated is provided in the specification entitled "EPUB Media Overlay 3.0", which defines the use of SMIL (Synchronized Multimedia Integration Language), EPUB style sheets, and EPUB content files. Examples of associations that associate text positions with audio locations and are provided in the specification are as follows:

Example A

在實例A中，「par」元素包括兩個子元素：「文字」元素及「音訊」元素。文字元素包含屬性「src」，該屬性「src」識別XHTML文件內的含有來自書本之第一章之內容的特定句子。音訊元素包含識別含有書本之第一章之音訊版本的音訊檔案的「src」屬性、識別音訊檔案內之音訊剪輯開始於何處的「clipBegin」屬性，及識別音訊檔案內之音訊剪輯結束於何處的「clipEnd」屬性。因此，音訊檔案中之秒23至45對應於書本之章1中的第一句子。 In Example A, the "par" element consists of two child elements: the "text" element and the "audio" element. The text element contains the attribute "src", which identifies the specific sentence in the XHTML file that contains the content from the first chapter of the book. The audio element includes a "src" attribute that identifies the audio file containing the audio version of the first chapter of the book, a "clipBegin" attribute that identifies where the audio clip in the audio file begins, and an end of the audio clip in the identified audio file. Where is the "clipEnd" attribute. Therefore, the seconds 23 to 45 in the audio file correspond to the first sentence in Chapter 1 of the book.

Map between text and audio

根據一實施例，自動地產生著作之文字版本與同一著作之音訊版本之間的映射。因為自動地產生映射，所以相較於使用手動文字至音訊映射技術將係實用之細微度，映射可使用精細得多之細微度。每一自動地產生之文字至音訊映射包括多個映射記錄，其中每一者使文字版本中之文字位置與音訊版本中之音訊位置相關聯。 According to an embodiment, the mapping between the text version of the work and the audio version of the same work is automatically generated. Because the mapping is automatically generated, the mapping can use much finer granularity than would be the use of manual text-to-audio mapping techniques. Each automatically generated text-to-audio map includes a plurality of map records, each of which associates a text position in the text version with an audio position in the audio version.

圖1為描繪根據本發明之實施例的用於在著作之文字版本與同一著作之音訊版本之間自動地建立映射的處理程序 100之流程圖。在步驟110處，話音至文字分析器接收反映著作之可聞版本的音訊資料。在步驟120處，在話音至文字分析器執行音訊資料之分析的同時，話音至文字分析器產生音訊資料之多個部分的文字。在步驟130處，基於針對音訊資料之該等部分所產生的文字，話音至文字分析器產生音訊資料中之複數個音訊位置與著作之文字版本中的相應複數個文字位置之間的映射。 1 is a process diagram for automatically establishing a mapping between a text version of a work and an audio version of the same work, in accordance with an embodiment of the present invention. 100 flow chart. At step 110, the speech-to-text analyzer receives audio material reflecting an audible version of the work. At step 120, the speech-to-text analyzer generates text for portions of the audio material while the speech-to-text analyzer performs the analysis of the audio material. At step 130, based on the text generated for the portions of the audio material, the speech-to-text analyzer generates a mapping between the plurality of audio locations in the audio material and the corresponding plurality of text locations in the text version of the work.

步驟130可涉及，話音至文字分析器比較所產生文字與著作之文字版本中的文字以判定所產生文字在著作之文字版本內位於何處。對於在著作之文字版本中所找到之所產生文字的每一部分而言，話音至文字分析器使以下兩者相關聯：(1)指示在音訊資料內於何處找到音訊資料之相應部分的音訊位置與(2)指示在著作之文字版本內於何處找到文字之部分的文字位置。 Step 130 may involve the voice-to-text analyzer comparing the text in the text version of the generated text and work to determine where the generated text is within the text version of the work. For each part of the generated text found in the text version of the work, the voice-to-text analyzer associates the following two: (1) indicating where to find the corresponding portion of the audio material within the audio material The location of the audio and (2) the location of the text indicating where the text is found within the text version of the work.

Text context

每一文件具有「文字內容脈絡」。著作之文字版本的文字內容脈絡包括著作之文字版本的固有特性(例如，撰寫著作之文字版本所採用的語言、著作之文字版本使用的特定單字、著作之文字版本使用的語法及標點符號、使著作之文字版本結構化的方式等)，及著作之外在特性(例如，創作著作之時段，著作所屬之風格、著作的作者等)。 Each file has a "text context". The textual content of the text version of the work includes the inherent characteristics of the text version of the work (for example, the language used to write the text version of the work, the specific word used in the text version of the work, the grammar used in the text version of the work, and punctuation marks, The way in which the text version of the work is structured, etc., and the characteristics outside the work (for example, the period of creation of the work, the style of the work, the author of the work, etc.).

不同的著作可具有顯著不同之文字內容脈絡。舉例而言，用於經典英文小說中之語法可大大不同於現代詩歌之語法。因此，儘管某一字序可遵循一語法之規則，但同一字序可能違反另一語法的規則。類似地，用於經典英文小說及現代詩歌兩者中之語法可不同於用於自一青少年發送至另一青少年之文字訊息中的語法(或無其語法)。 Different works can have significantly different contexts of text content. For example, the grammar used in classic English fiction can be quite different from the grammar of modern poetry. Therefore, although a word order can follow the rules of a grammar, the same The word order may violate the rules of another grammar. Similarly, the grammar used in both classic English novels and modern poetry may be different from the grammar (or no grammar) used in text messages sent from one teen to another.

如上文所提及，本文中所描述之一技術藉由執行著作之音訊版本的話音至文字轉換來自動地建立著作之音訊版本與同一著作之文字版本之間的精細細微度映射。在實施例中，著作之文字內容脈絡用以增加對著作之音訊版本所執行的話音至文字分析之精度。舉例而言，為了判定用於著作中之語法，話音至文字分析器(或另一處理程序)可在執行話音至文字分析之前分析著作的文字版本。話音至文字分析器可接著利用由此獲得之語法資訊以增加著作之音訊版本的話音至文字分析的精度。 As mentioned above, one of the techniques described herein automatically establishes a fine-grained mapping between the audio version of the work and the text version of the same work by performing a voice-to-text conversion of the audio version of the work. In an embodiment, the textual context of the work is used to increase the accuracy of the speech-to-text analysis performed on the audio version of the work. For example, to determine the grammar for use in a work, a speech-to-text parser (or another handler) can analyze the text version of the work prior to performing speech-to-text analysis. The speech-to-text analyzer can then use the grammar information thus obtained to increase the accuracy of the speech-to-text analysis of the audio version of the work.

替代於基於著作之文字版本自動地判定著作之語法或除此情形之外，使用者可提供輸入，該輸入識別著作之作者所遵循的一或多個語法規則。與所識別語法相關聯之規則輸入至話音至文字分析器以在辨識著作之音訊版本中的單字時輔助分析器。 Instead of or in addition to the genre of the work-based text version, the user can provide input that identifies one or more grammar rules followed by the author of the work. The rules associated with the identified grammar are input to the speech-to-text analyzer to assist the analyzer in identifying the words in the audio version of the work.

Limit candidate dictionary based on text version

通常，話音至文字分析器必須經組態或設計以辨識採用英語語言之事實上每一單字，及視情況採用其他語言之一些單字。因此，話音至文字分析器必須存取單字之大辭典。話音至文字分析器在話音至文字操作期間選擇單字所自之辭典在本文中稱為話音至文字分析器的「候選辭典」。典型候選辭典中之獨特單字的數目為大約500,000 個。 In general, voice-to-text analyzers must be configured or designed to recognize virtually every word in the English language, and some words in other languages, as appropriate. Therefore, the speech-to-text analyzer must access a large dictionary of words. The speech-to-text analyzer selects the word from which the dictionary is used during speech-to-text operations. This is referred to herein as the "candidate dictionary" of the speech-to-text analyzer. The number of unique words in a typical candidate dictionary is approximately 500,000 One.

在實施例中，當執行著作之音訊版本的話音至文字分析時，考慮來自著作之文字版本的文字。特定言之，在一實施例中，在著作之音訊版本的話音至文字分析期間，藉由話音至文字分析器所使用之候選辭典限於係在著作之文字版本中的特定單字集合。換言之，僅在對著作之音訊版本所執行的話音至文字操作期間考慮為「候選者」的單字為實際上出現於著作之文字版本中的彼等單字。 In an embodiment, the text from the text version of the work is considered when performing the speech-to-text analysis of the audio version of the work. In particular, in one embodiment, during speech-to-text analysis of an audio version of a work, the candidate dictionary used by the speech-to-text analyzer is limited to a particular set of words that are in the text version of the work. In other words, the words considered as "candidates" only during the voice-to-text operation performed on the audio version of the work are those words that actually appear in the text version of the work.

藉由將用於特定著作之話音至文字轉譯中的候選辭典限於出現於著作之文字版本中的彼等單字，話音至文字操作可得到顯著改良。舉例而言，假設特定著作中之獨特單字的數目為20,000。習知話音至文字分析器可具有判定音訊之特定部分對應於500,000單字候選辭典之哪一特定單字上的困難。然而，當考慮係在著作之文字版本中的僅20,000個獨特單字時，音訊之彼同一部分可清楚地對應於一特定單字。因此，藉由可能單字之此小得多之辭典，話音至文字分析器之精度可得到顯著改良。 Speech-to-text operations can be significantly improved by limiting candidate dictionaries for speech-to-text translation for a particular work to those words appearing in the text version of the work. For example, suppose the number of unique words in a particular work is 20,000. The conventional speech to text analyzer may have difficulty determining which particular portion of the 500,000 word candidate dictionary corresponds to a particular portion of the audio. However, when considering only 20,000 unique words in the text version of the work, the same portion of the audio can clearly correspond to a particular word. Thus, the accuracy of the speech-to-text analyzer can be significantly improved by a much smaller dictionary of possible words.

Limit candidate dictionary based on current location

為了改良精度，可將候選辭典限於相較於著作之文字版本中之所有單字更少的單字。在一實施例中，將候選辭典限於在著作之文字版本之特定部分中找到的彼等單字。舉例而言，在著作之話音至文字轉譯期間，相對於著作之文字版本大致追蹤轉譯操作之「當前轉譯位置」為可能的。可(例如)藉由比較(a)迄今在話音至文字操作期間已產生之文字與(b)著作之文字版本來執行此追蹤。 To improve accuracy, candidate dictionaries can be limited to fewer words than all words in the text version of the work. In an embodiment, candidate dictionaries are limited to those words found in a particular portion of the text version of the work. For example, during the speech-to-text translation of a work, it is possible to roughly track the "current translation position" of the translation operation relative to the text version of the work. Can be generated, for example, by comparing (a) to date during voice-to-text operations Text and (b) the text version of the work to perform this tracking.

一旦已判定當前轉譯位置，則可基於當前轉譯位置進一步限制候選辭典。舉例而言，在一實施例中，將候選辭典限於僅彼等在當前轉譯位置之後出現於著作之文字版本內的單字。因此，自候選辭典有效地移除在當前轉譯位置之前找到但在其後未找到之單字。此移除可增加話音至文字分析器之精度，此係由於候選辭典愈小，話音至文字分析器將愈不可能將音訊資料的一部分轉譯成錯誤單字。 Once the current translation location has been determined, the candidate dictionary can be further restricted based on the current translation location. For example, in one embodiment, the candidate lexicons are limited to only those words that appear within the text version of the work after the current translation position. Therefore, the word that was found before the current translation position but not found after it is effectively removed from the candidate dictionary. This removal increases the accuracy of the speech to the text analyzer, as the smaller the candidate dictionary, the less likely the speech-to-text analyzer will translate a portion of the audio material into the wrong word.

作為另一實例，在話音至文字分析之前，可將音訊書及數位書分成數個段落或章節(section)。可使音訊書與音訊章節映射相關聯，且可使數位書與文字章節映射相關聯。舉例而言，音訊章節映射及文字章節映射可識別每一章開始或結束之處。此等各別映射可藉由話音至文字分析器使用以限制候選辭典。舉例而言，若話音至文字分析器基於音訊章節映射判定話音至文字分析器正分析音訊書之第4章，則話音至文字分析器使用文字章節映射以識別數位書的第4章且將候選辭典限於在第4章中找到的單字。 As another example, an audio book and a digital book can be divided into a number of paragraphs or sections prior to speech-to-text analysis. The audio book can be associated with an audio chapter map and the digital book can be associated with a text chapter map. For example, audio chapter mapping and text chapter mapping identify where each chapter begins or ends. These individual mappings can be used by a speech-to-text analyzer to limit candidate dictionaries. For example, if the voice-to-text analyzer determines the voice-to-text analyzer based on the audio chapter map to analyze the fourth chapter of the audio book, the voice-to-text analyzer uses the text chapter map to identify Chapter 4 of the digital book. The candidate dictionary is limited to the words found in Chapter 4.

在相關實施例中，話音至文字分析器使用隨著當前轉譯位置移動而移動的滑動窗。隨著話音至文字分析器正分析音訊資料，話音至文字分析器移動滑動窗「跨越」著作之文字版本。滑動窗指示著作之文字版本內的兩個位置。舉例而言，滑動窗之邊界可為(a)先於當前轉譯位置之段落的開始，及(b)在當前轉譯位置之後的第三段落之結尾。將候選辭典限於僅彼等出現於彼等兩個位置之間的單字。 In a related embodiment, the voice to text analyzer uses a sliding window that moves as the current translation position moves. As the voice-to-text analyzer is analyzing the audio material, the voice-to-text analyzer moves the sliding window to "cross" the text version of the work. The sliding window indicates two locations within the text version of the work. For example, the boundary of the sliding window can be (a) the beginning of the paragraph preceding the current translation position, and (b) the end of the third paragraph after the current translation position. The candidate dictionaries are limited to only those words that appear between their two positions.

儘管上文提供特定實例，但窗可橫越著作之文字版本內的任何量之文字。舉例而言，窗可橫越絕對量之文字，諸如60個字元。作為另一實例，窗可橫越來自著作之文字版本的相對量之文字，諸如10個單字、3「行」文字、2個句子或1「頁」文字。在相對量情境下，話音至文字分析器可使用著作之文字版本內的格式化資料，以判定著作之文字版本的多少構成一行或一頁。舉例而言，著作之文字版本可包含頁指示符(例如，呈HTML或XML標籤之形式)，該指示符指示著作之文字版本之內容內的頁之開始或頁之結尾。 Although specific examples are provided above, the window can span any amount of text within the text version of the work. For example, a window can traverse an absolute amount of text, such as 60 characters. As another example, the window can span a relative amount of text from a text version of the work, such as 10 words, 3 "line" characters, 2 sentences, or 1 "page" text. In a relative context, the voice-to-text analyzer can use the formatted material within the text version of the work to determine how much of the text version of the work constitutes a line or page. For example, the text version of the work may include a page indicator (eg, in the form of an HTML or XML tag) that indicates the beginning of the page or the end of the page within the content of the text version of the work.

在實施例中，窗之開始對應於當前轉譯位置。舉例而言，話音至文字分析器維持指示著作之文字版本中之最新近匹配單字的當前文字位置，且維持指示音訊資料中之最新近識別單字的當前音訊位置。除非旁白者(其語音反映於音訊資料中)在錄音期間誤讀著作之文字版本的文字、添加其自己之內容，或跳過著作之文字版本的多個部分，否則話音至文字分析器在音訊資料中偵測到之下一單字(亦即，在當前音訊位置之後)最可能為著作之文字版本中的下一單字(亦即，在當前文字位置之後)。維持兩個位置可顯著增加話音至文字轉譯之精度。 In an embodiment, the beginning of the window corresponds to the current translation position. For example, the voice-to-text analyzer maintains the current text position of the most recent matching word in the text version of the work and maintains the current audio position of the most recently recognized word in the audio data. Unless the narrator (the voice is reflected in the audio material) misreads the text version of the work during the recording, adds its own content, or skips multiple parts of the text version of the work, the speech-to-text analyzer is The next word detected in the audio material (i.e., after the current audio position) is most likely the next word in the text version of the work (i.e., after the current text position). Maintaining two positions can significantly increase the accuracy of voice-to-text translation.

Use audio to audio correlation to create a map

在實施例中，文字至話音產生器及音訊至文字相關器用以自動地建立著作之音訊版本與著作之文字版本之間的映射。圖2為描繪用以產生映射之此等分析器及資料的方塊圖。將著作之文字版本210(諸如，EPUB文件)輸入至文字至話音產生器220。文字至話音產生器220可以軟體、硬體或硬體與軟體之組合來實施。不管以軟體抑或硬體來實施，文字至話音產生器220可實施於單一計算器件上或可散佈於多個計算器件當中。 In an embodiment, the text-to-speech generator and the audio-to-text correlator are used to automatically establish a mapping between the audio version of the work and the text version of the work. 2 is a block depicting such analyzers and data used to generate mappings Figure. A text version 210 of the work (such as an EPUB file) is input to the text to voice generator 220. The text-to-speech generator 220 can be implemented in software, hardware, or a combination of hardware and software. Whether implemented in software or hardware, the text-to-speech generator 220 can be implemented on a single computing device or can be interspersed among multiple computing devices.

文字至話音產生器220基於文件210產生音訊資料230。在音訊資料230之產生期間，文字至話音產生器220(或另一未圖示組件)建立音訊至文件映射240。音訊至文件映射240將文件210內之多個文字位置映射至所產生音訊資料230內的相應音訊位置。 The text-to-speech generator 220 generates audio material 230 based on the file 210. During the generation of the audio material 230, the text-to-speech generator 220 (or another component not shown) establishes an audio to file map 240. The audio to file map 240 maps a plurality of text locations within the file 210 to respective audio locations within the generated audio material 230.

舉例而言，假設文字至話音產生器220產生位於文件210內之位置Y處之單字的音訊資料。進一步假設針對著作所產生之音訊資料位於音訊資料230內之位置X處。為了反映文件210內之單字的位置與音訊資料230中之相應音訊之位置之間的相關，映射將建立於位置X與位置Y之間。 For example, assume that text-to-speech generator 220 produces audio data for a single word located at location Y within file 210. Further assume that the audio material generated for the work is located at location X within the audio material 230. To reflect the correlation between the location of the word within file 210 and the location of the corresponding audio in audio material 230, a mapping will be established between position X and position Y.

因為文字至話音產生器220知曉在音訊之單字或片語產生時相應單字或片語出現於文件210中之處，所以可容易地產生相應單字或片語之間的每一映射。 Because the text-to-speech generator 220 knows that a corresponding word or phrase appears in the file 210 when a word or phrase of the audio is produced, each mapping between the corresponding word or phrase can be easily generated.

音訊至文字相關器260接受所產生音訊資料230、音訊書250及音訊至文件映射240作為輸入。音訊至文字相關器260執行兩個主要步驟：音訊至音訊相關步驟及查找步驟。對於音訊至音訊相關步驟而言，音訊至文字相關器260比較所產生音訊資料230與音訊書250，以判定音訊資料230之多個部分與音訊書250之多個部分之間的相關。舉例而言，音訊至文字相關器260針對表示於音訊資料230中之每一單字可判定音訊書250中之相應單字的位置。 The audio to text correlator 260 accepts the generated audio material 230, the audio book 250, and the audio to file map 240 as inputs. The audio to text correlator 260 performs two main steps: an audio to audio related step and a lookup step. For the audio to audio related step, the audio to text correlator 260 compares the generated audio data 230 with the audio book 250 to determine the correlation between portions of the audio material 230 and portions of the audio book 250. Lift For example, the audio to text correlator 260 can determine the position of the corresponding word in the audio book 250 for each of the words represented in the audio material 230.

為建立相關之目的，分割音訊資料230所藉以之細微度可在實施間變化。舉例而言，相關可建立於音訊資料230中之每一單字與音訊書250中之每一相應單字之間。或者，可基於固定持續時間間隔建立相關(例如，每1分鐘之音訊有一映射)。在又一替代例中，可針對基於其他準則(諸如，於段落或章邊界處)所建立之音訊的多個部分、大的暫停(例如，大於3秒之靜默)或基於音訊書250中之資料的其他位置(諸如，音訊書250內的音訊標記)建立相關。 The nuances of segmenting the audio material 230 may vary from implementation to implementation for related purposes. For example, the correlation can be established between each word in the audio material 230 and each corresponding word in the audio book 250. Alternatively, correlation can be established based on a fixed duration interval (eg, one mapping per 1 minute of audio). In yet another alternative, multiple portions of the audio based on other criteria (such as at a paragraph or chapter boundary), a large pause (eg, more than 3 seconds of silence) or based on the audio book 250 may be Other locations of the material (such as audio tags within the audio book 250) establish correlation.

在識別音訊資料230之一部分與音訊書250之一部分之間的相關之後，音訊至文字相關器260使用音訊至文件映射240來識別對應於所產生音訊資料230內之音訊位置的文字位置(指示於映射240中)。音訊至文字相關器260接著使文字位置與音訊書250內之音訊位置相關聯，以在文件至音訊映射270中建立映射記錄。 After identifying the correlation between a portion of the audio material 230 and a portion of the audio book 250, the audio-to-text correlator 260 uses the audio-to-file mapping 240 to identify the textual position corresponding to the audio location within the generated audio material 230 (indicated by Map 240). The audio to text correlator 260 then associates the text location with the audio location within the audio book 250 to establish a mapping record in the file to audio map 270.

舉例而言，假設音訊書250之一部分(位於位置Z處)與所產生音訊資料230之位於位置X處的部分匹配。基於使位置X與文件210內之位置Y相關的映射記錄(在音訊至文件映射240中)，將建立文件至音訊映射270中之使音訊書250之位置Z與文件210內的位置Y相關的映射記錄。 For example, assume that a portion of the audio book 250 (at location Z) matches the portion of the generated audio material 230 at location X. Based on the mapping record (in the audio to file map 240) relating the location X to the location Y within the file 210, the file is created into the audio map 270 such that the location Z of the audiobook 250 is associated with the location Y within the file 210. Map records.

音訊至文字相關器260針對音訊資料230之每一部分重複地執行音訊至音訊相關及查找步驟。因此，文件至音訊映射270包含多個映射記錄，每一映射記錄將文件210內之位置映射至音訊書250內的位置。 The audio to text correlator 260 repeatedly performs audio to audio correlation and lookup steps for each portion of the audio material 230. Thus, the file-to-audio map 270 contains a plurality of map records, each of which records the location within the file 210. The mapping is mapped to a location within the audio book 250.

在實施例中，針對音訊資料230之每一部分的音訊至音訊相關立即繼之以音訊之彼部分的查找步驟。因此，在進行至音訊資料230之下一部分之前，可針對音訊資料230之每一部分建立文件至音訊映射270。或者，在執行任何查找步驟之前，可針對音訊資料230之許多或所有部分執行音訊至音訊相關步驟。在已建立所有音訊至音訊相關之後，可批量執行針對所有部分之查找步驟。 In an embodiment, the audio-to-information correlation for each portion of the audio material 230 is immediately followed by a search step of the audio portion. Thus, a file to audio map 270 can be created for each portion of the audio material 230 prior to proceeding to a portion of the audio material 230. Alternatively, the audio to audio related step can be performed for many or all portions of the audio material 230 prior to performing any of the searching steps. After all audio-to-information correlations have been established, the search steps for all parts can be performed in batches.

Mapping nuance

映射具有數個屬性，該等屬性中之一者為映射之大小，映射之大小指代映射中之映射記錄的數目。映射之另一屬性為映射之「細微度」。映射之「細微度」指代映射中之映射記錄相對於數位著作之大小的數目。因此，映射之細微度可在數位著作間變化。舉例而言，包含200「頁」之數位書的第一映射包括僅針對數位書中之每一段落的映射記錄。因此，第一映射可包含1000個映射記錄。另一方面，包含20頁之數位「兒童」書的第二映射包括針對兒童書中之每一單字的映射記錄。因此，第二映射可包含800個映射記錄。儘管第一映射相較於第二映射包含更多映射記錄，但第二映射之細微度相較於第一映射之細微度為較精細的。 A map has several attributes, one of which is the size of the map, and the size of the map refers to the number of map records in the map. Another attribute of the mapping is the "nuance" of the mapping. The "nuance" of a map refers to the number of mapped records in the map relative to the size of the digital work. Therefore, the nuance of the mapping can vary between digital works. For example, a first mapping of a digital book containing 200 "pages" includes a mapping record for only each of the digital books. Therefore, the first map can contain 1000 map records. On the other hand, the second map containing the 20-page "Children" book includes a mapping record for each word in the children's book. Therefore, the second map can contain 800 mapping records. Although the first map contains more mapping records than the second map, the granularity of the second map is finer than the fineness of the first map.

在實施例中，可基於至產生映射之話音至文字分析器的輸入來規定映射之細微度。舉例而言，使用者可在使得話音至文字分析器產生映射之前指定特定細微度。特定細微度之非限制性實例包括：- 單字細微度(亦即，針對每一單字之關聯)，- 句子細微度(亦即，針對每一句子之關聯)，- 段落細微度(亦即，針對每一段落之關聯)，- 10字細微度(亦即，針對數位著作中之每10個單字部分的映射)，及- 10秒細微度(亦即，針對每10秒音訊之映射)。 In an embodiment, the nuance of the mapping may be specified based on the input to the generated speech-to-text analyzer. For example, the user can specify a particular granularity before causing the speech to text analyzer to generate a mapping. Specific subtle Non-limiting examples of degrees include: - single word nuance (ie, for each word association), - sentence nuance (ie, for each sentence), - paragraph nuance (ie, for each A paragraph of relevance), - 10 word nuances (i.e., mapping for every 10 word parts in a digital work), and - 10 seconds of nuance (i.e., for every 10 seconds of audio mapping).

作為另一實例，使用者可指定數位著作之類型(例如，小說、兒童書、短篇故事)，且話音至文字分析器(或另一處理程序)基於著作之類型來判定細微度。舉例而言，兒童書可與單字細微度相關聯，而小說可與句子細微度相關聯。 As another example, a user may specify a type of digital work (eg, a novel, a children's book, a short story), and a voice-to-text analyzer (or another processing program) determines the nuance based on the type of work. For example, a children's book can be associated with a single word verb, and a novel can be associated with a sentence nuance.

映射之細微度可甚至在同一數位著作內變化。舉例而言，數位書之前三章的映射可具有句子細微度，而該數位書之剩餘章的映射具有單字細微度。 The nuance of the mapping can vary even within the same digital work. For example, the mapping of the first three chapters of the digital book may have sentence subtlety, and the mapping of the remaining chapters of the digital book has a single word fineness.

Mapping in the operation of text to audio transitions

在許多狀況下，儘管音訊至文字映射將在使用者需要依靠一映射之前產生，但在一實施例中，在執行時間或在使用者已開始在使用者之器件上取用音訊資料及/或文字資料之後，產生音訊至文字映射。舉例而言，使用者使用平板電腦閱讀數位書之文字版本。平板電腦追蹤平板電腦已顯示給使用者的數位書最新近頁或章節。藉由「文字書籤」來識別最新近頁或章節。 In many cases, although the audio-to-text mapping will be generated before the user needs to rely on a mapping, in one embodiment, the audio data is accessed and/or at the time of execution or when the user has begun to access the user's device. After the text data, an audio to text map is generated. For example, a user uses a tablet to read a text version of a digital book. The tablet tracks the most recent page or chapter of the digital book that the tablet has displayed to the user. Identify the most recent page or chapter by "text bookmarks".

稍後，使用者選擇播放同一著作之音訊書版本。播放器件可為使用者閱讀數位書之同一平板電腦或另一器件。無關於將播放音訊書之器件，擷取文字標籤，且關於音訊書之至少一部分執行話音至文字分析。在話音至文字分析期間，「臨時」映射記錄產生以建立所產生文字與音訊書內之相應位置之間的相關。 Later, the user chooses to play the audio book version of the same book. player The piece can be the same tablet or another device for the user to read the digital book. Regardless of the device that will play the audio book, the text tag is retrieved, and at least a portion of the audio book is subjected to speech-to-text analysis. During speech-to-text analysis, a "temporary" mapping record is generated to establish a correlation between the generated text and the corresponding location within the audio book.

一旦已產生文字及相關記錄，則文字至文字比較用以判定對應於文字書籤之所產生文字。接著，臨時映射記錄用以識別音訊書之對應於所產生文字之部分的部分，所產生文字之該部分對應於文字書籤。音訊書之播放接著自彼位置起始。 Once the text and related records have been generated, the text-to-text comparison is used to determine the text corresponding to the text bookmark. Next, the temporary mapping record is used to identify a portion of the audio book that corresponds to the portion of the generated text, and the portion of the generated text corresponds to the text bookmark. The playback of the audio book begins with the position.

可將執行話音至文字分析的音訊書部分限於對應於文字書籤之部分。舉例而言，指示音訊書之某些部分開始及/或結束於何處的音訊章節映射可能已存在。舉例而言，音訊章節映射可指示每一章開始於何處，一或多個頁開始於何處等。此音訊章節映射可有助於判定在何處開始話音至文字分析，使得不需要執行對整個音訊書之話音至文字分析。舉例而言，若文字書籤指示數位書之第12章內之位置，且與音訊資料相關聯之音訊章節映射識別第12章在音訊資料中於何處開始，則不需要對音訊書之前11章中的任一者執行話音至文字分析。舉例而言，音訊資料可由20個音訊檔案組成，每一章有一個音訊檔案。因此，僅將對應於第12章之音訊檔案輸入至話音至文字分析器。 The portion of the audio book that performs voice-to-text analysis can be limited to the portion corresponding to the text bookmark. For example, an audio chapter map indicating where certain portions of the audio book begin and/or end may already exist. For example, an audio chapter map can indicate where each chapter begins, where one or more pages begin, and so on. This audio chapter mapping can help determine where to start speech-to-text analysis so that no speech-to-text analysis of the entire audio book is required. For example, if the text bookmark indicates the position in Chapter 12 of the digital book and the audio chapter mapping associated with the audio material identifies where Chapter 12 begins in the audio material, then the first 11 chapters of the audio book are not required. Any of them performs voice-to-text analysis. For example, audio data can be composed of 20 audio files, and each chapter has an audio file. Therefore, only the audio file corresponding to Chapter 12 is input to the voice to text analyzer.

Map generation during operation from audio to text transition

可在運作中產生映射記錄以促進音訊至文字轉變以及文字至音訊轉變。舉例而言，假設使用者正使用智慧型電話聽取音訊書。智慧型電話追蹤音訊書內之正播放的當前位置。當前位置係藉由「音訊書籤」來識別。稍後，使用者拾取平板電腦且選擇音訊書之數位書版本以供顯示。平板電腦接收音訊書籤(例如，自相對於平板電腦及智慧型電腦為遠端的中央伺服器)、執行音訊書之至少一部分的話音至文字分析，且識別音訊書內之對應於音訊書之文字版本內的一文字部分的部分，音訊書內之該部分對應於音訊書籤。平板電腦接著開始顯示文字版本內之所識別部分。 Map records can be generated in operation to facilitate audio to text conversion and text Word to audio transition. For example, suppose a user is listening to an audio book using a smart phone. The smart phone tracks the current location of the audio play. The current location is identified by "audio bookmarks." Later, the user picks up the tablet and selects the digital book version of the audio book for display. The tablet receives audio bookmarks (eg, from a central server remote from the tablet and the smart computer), performs voice-to-text analysis of at least a portion of the audio book, and identifies text in the audio book corresponding to the audio book The portion of a text portion of the version that corresponds to the audio bookmark in the audio book. The tablet then begins to display the identified portion of the text version.

可將執行話音至文字分析的音訊書部分限於對應於音訊書籤之部分。舉例而言，對音訊書之一部分執行話音至文字分析，該部分橫越在音訊書中之音訊書籤之前的一或多個時間區段(例如，秒)及/或音訊書中之音訊書籤之後的一或多個時間區段。比較藉由對彼部分之話音至文字分析所產生的文字與文字版本中之文字，以定位所產生文字中之該等系列單字或片語在何處與文字版本中之文字匹配。 The portion of the audio book that performs voice-to-text analysis can be limited to the portion corresponding to the audio bookmark. For example, performing a voice-to-text analysis on a portion of an audio book that spans one or more time segments (eg, seconds) preceding the audio bookmark in the audio book and/or audio bookmarks in the audio book One or more time segments after. Comparing the text in the text and text versions generated by the speech-to-text analysis of the other part to locate where the series of words or phrases in the generated text match the text in the text version.

若存在指示文字版本之某些部分開始或結束於何處之文字章節映射，且音訊書籤可用以識別文字章節映射中之章節，則不需要分析文字版本之大部分以便定位所產生文字中的該等系列單字或片語在何處與文字版本中之文字匹配。舉例而言，若音訊書籤指示音訊書之第3章內的位置，且與數位書相關聯之文字章節映射識別第3章在文字版本中於何處開始，則不需要對音訊書之前兩章中之任一者或對音訊書之第3章之後的章中之任一者執行話音至文字分析。 If there is a text chapter mapping indicating where some parts of the text version begin or end, and the audio bookmark can be used to identify the chapters in the text chapter map, then it is not necessary to analyze the majority of the text version in order to locate the text in the generated text. Where the series of words or phrases match the text in the text version. For example, if the audio bookmark indicates the position in Chapter 3 of the audio book, and the text chapter mapping associated with the digital book identifies where Chapter 3 begins in the text version, then the first two chapters of the audio book are not required. Any of them or perform a voice-to-text on any of the chapters after Chapter 3 of the audio book Word analysis.

Overview of the use of audio to text mapping

根據一方法，映射(不管手動建立抑或自動建立)用以識別數位著作之音訊版本(例如，音訊書)內的對應於該數位著作之文字版本(例如，電子書)內之位置的位置。舉例而言，映射可用以基於建立於音訊書中之「書籤」識別電子書內之位置。作為另一實例，映射可用以隨著朗讀文字之人員的音訊錄音正在播放而識別哪一所顯示文字對應於該音訊錄音，且使得所識別文字經反白顯示。因此，在正播放音訊書之同時，由於電子書閱讀器使相應文字反白顯示，因此電子書閱讀器之使用者可跟隨。作為另一實例，映射可用以識別音訊資料中之位置，且回應於自電子書選擇所顯示文字之輸入而播放彼位置處的音訊。因此，使用者可選擇電子書中之單字，此選擇使得對應於彼單字之音訊被播放。作為另一實例，使用者可建立註解同時「取用」(例如，閱讀或聽取)數位著作之一版本(例如，電子書)，且使得註解被取用同時使用者正取用數位著作的另一版本(例如，音訊書)。因此，使用者可對電子書之「頁」作出註釋，且可檢視彼等註釋同時聽取電子書的音訊書。類似地，使用者可作出註釋同時聽取音訊書，且接著可在閱讀相應電子書時檢視彼註釋。 According to one method, a mapping (whether manually established or automatically established) is used to identify a location within an audio version (eg, an audio book) of a digital work that corresponds to a location within a textual version (eg, an e-book) of the digital work. For example, mapping can be used to identify locations within an e-book based on "bookmarks" established in an audio book. As another example, the mapping can be used to identify which displayed text corresponds to the audio recording as the audio recording of the person reading the text is playing, and causes the recognized text to be highlighted. Therefore, while the audio book is being played, since the e-book reader causes the corresponding text to be highlighted, the user of the e-book reader can follow. As another example, the mapping can be used to identify the location in the audio material and to play the audio at the location in response to the input of the displayed text from the e-book selection. Therefore, the user can select a word in the e-book, and this selection causes the audio corresponding to the single word to be played. As another example, a user may create an annotation while "taking" (eg, reading or listening to) a version of a digital work (eg, an e-book), and causing the annotation to be taken while the user is taking another digital work. Version (for example, an audio book). Therefore, the user can annotate the "page" of the e-book and can view the notes and listen to the e-book's audio book. Similarly, the user can make an annotation while listening to the audio book, and then can view the annotation while reading the corresponding e-book.

圖3為描繪根據本發明之實施例的用於在此等情境中之一或多者下使用映射之處理程序的流程圖。 3 is a flow diagram depicting a process for using mappings in one or more of these scenarios, in accordance with an embodiment of the present invention.

在步驟310處，獲得指示第一媒體項目內之指定位置的位置資料。第一媒體項目可為著作之文字版本或對應於該著作之文字版本的音訊資料。此步驟可藉由取用第一媒體項目之器件(藉由使用者操作)執行。或者，該步驟可藉由相對於取用第一媒體項目之器件遠端地定位的伺服器執行。因此，器件使用通信協定經由網路將位置資料發送至伺服器。 At step 310, obtaining a specified location within the first media item is obtained Location data. The first media item may be a text version of the work or an audio material corresponding to the text version of the work. This step can be performed by the device that accesses the first media item (by user operation). Alternatively, the step can be performed by a server remotely located relative to the device that fetches the first media item. Therefore, the device transmits the location data to the server via the network using a communication protocol.

在步驟320處，檢查映射以判定對應於指定位置之第一媒體位置。類似地，此步驟可藉由取用第一媒體項目之器件或藉由相對於該器件遠端地定位的伺服器執行。 At step 320, the map is checked to determine a first media location corresponding to the specified location. Similarly, this step can be performed by a device that fetches the first media item or by a server that is located remotely relative to the device.

在步驟330處，判定對應於第一媒體位置且指示於映射中的第二媒體位置。舉例而言，若指定位置為音訊「書籤」，則第一媒體位置為指示於映射中之音訊位置，且第二媒體位置為與映射中之音訊位置相關聯的文字位置。類似地，舉例而言，若指定位置為文字「書籤」，則第一媒體位置為指示於映射中之文字位置，且第二媒體位置為與映射中之文字位置相關聯的音訊位置。 At step 330, a determination corresponds to the first media location and is indicated to the second media location in the map. For example, if the specified location is an audio "bookmark", the first media location is an audio location indicated in the mapping, and the second media location is a text location associated with the audio location in the mapping. Similarly, for example, if the specified location is a text "bookmark", the first media location is a text location indicated in the mapping, and the second media location is an audio location associated with the text location in the mapping.

在步驟340處，基於第二媒體位置處理第二媒體項目。舉例而言，若第二媒體項目為音訊資料，則第二媒體位置為音訊位置且用作音訊資料中的當前播放位置。作為另一實例，若第二媒體項目為著作之文字版本，則第二媒體位置為文字位置且用以判定顯示著作之文字版本的哪一部分。 At step 340, the second media item is processed based on the second media location. For example, if the second media item is audio data, the second media location is an audio location and is used as the current playback location in the audio material. As another example, if the second media item is a text version of the work, the second media location is a text position and is used to determine which portion of the text version of the work is displayed.

下文提供在特定情境下使用處理程序300之實例。 Examples of using the handler 300 in a particular context are provided below.

Architecture review

上文所提及且下文詳細描述之實例情境中的每一者可涉及一或多個計算器件。圖4為根據本發明之實施例的可用以實施本文中所描述之處理程序中之一些的實例系統400之方塊圖。系統400包括終端使用者器件410、中間器件420及終端使用者器件430。終端使用者器件410及430之非限制性實例包括桌上型電腦、膝上型電腦、智慧型電話、平板電腦及其他手持式計算器件。 Each of the example contexts mentioned above and described in detail below may relate to one or more computing devices. 4 is a block diagram of an example system 400 that can be utilized to implement some of the processing procedures described herein in accordance with an embodiment of the present invention. System 400 includes an end user device 410, an intermediate device 420, and an end user device 430. Non-limiting examples of end user devices 410 and 430 include desktop computers, laptop computers, smart phones, tablets, and other handheld computing devices.

如圖4中所描繪，器件410儲存數位媒體項目402，且執行文字媒體播放器412及音訊媒體播放器414。文字媒體播放器412經組態以處理電子文字資料，且使得器件410顯示文字(例如，在器件410之未圖示的觸控式螢幕上)。因此，若數位媒體項目402為電子書，則文字媒體播放器412可經組態以處理數位媒體項目402，只要數位媒體項目402係呈文字媒體播放器412經組態以處理之文字格式即可。器件410可執行經組態以處理其他類型之媒體(諸如，視訊)的一或多個其他媒體播放器(未圖示)。 As depicted in FIG. 4, device 410 stores digital media item 402 and executes text media player 412 and audio media player 414. Text media player 412 is configured to process electronic text material and cause device 410 to display text (e.g., on a touch screen (not shown) of device 410). Thus, if the digital media item 402 is an e-book, the text media player 412 can be configured to process the digital media item 402 as long as the digital media item 402 is in a text format that the text media player 412 is configured to process. . Device 410 can execute one or more other media players (not shown) that are configured to process other types of media, such as video.

類似地，音訊媒體播放器414經組態以處理音訊資料，且使得器件410產生音訊(例如，經由器件410上之未圖示的揚聲器)。因此，若數位媒體項目402為音訊書，則音訊媒體播放器414可經組態以處理數位媒體項目402，只要數位媒體項目402係呈音訊媒體播放器414經組態以處理之音訊格式即可。不管項目402為電子書抑或音訊書，項目402可包含多個檔案，不管為音訊檔案抑或文字檔案。 Similarly, audio media player 414 is configured to process audio material and cause device 410 to generate audio (e.g., via a speaker (not shown) on device 410). Thus, if the digital media item 402 is an audio book, the audio media player 414 can be configured to process the digital media item 402 as long as the digital media item 402 is in the audio format that the audio media player 414 is configured to process. . Regardless of whether the item 402 is an e-book or an audio book, the item 402 can include multiple files, whether it is an audio file or a text file.

器件430類似地儲存數位媒體項目404，且執行經組態以處理音訊資料並使得器件430產生音訊的音訊媒體播放器432。器件430可執行經組態以處理其他類型之媒體(諸如，視訊及文字)的一或多個其他媒體播放器(未圖示)。 Device 430 similarly stores digital media item 404 and performs configuration to An audio media player 432 that processes the audio material and causes the device 430 to generate an audio. Device 430 can execute one or more other media players (not shown) configured to process other types of media, such as video and text.

中間器件420儲存映射406，映射406將音訊資料內之音訊位置映射至文字資料中之文字位置。舉例而言，映射406可將數位媒體項目404內之音訊位置映射至數位媒體項目402內的文字位置。儘管圖4中未描繪，但中間器件420可儲存許多映射，音訊資料及文字資料的每一相應集合有一映射。又，中間器件420可與未圖示之許多終端使用者器件互動。 The intermediate device 420 stores a map 406 that maps the audio locations within the audio material to text locations in the text material. For example, mapping 406 can map audio locations within digital media item 404 to text locations within digital media item 402. Although not depicted in FIG. 4, intermediate device 420 can store a number of mappings, each of which has a mapping of audio data and textual data. Also, the intermediate device 420 can interact with a number of end user devices not shown.

又，中間器件420可儲存使用者可經由其各別器件存取的數位媒體項目。因此，器件(例如，器件430)可向中間器件420請求數位媒體項目，而非儲存數位媒體項目的本端複本。 Also, the intermediary device 420 can store digital media items that a user can access via its respective device. Thus, a device (eg, device 430) can request a digital media item from intermediate device 420 instead of storing a local copy of the digital media item.

另外，中間器件420可儲存使使用者之一或多個器件與單一帳戶相關聯的帳戶資料。因此，此帳戶資料可指示藉由同一使用者以同一帳戶登記器件410及430。中間器件420亦可儲存使帳戶與特定使用者所擁有(或購買)之一或多個數位媒體項目相關聯的帳戶-項目關聯資料。因此，中間器件420可藉由以下操作來驗證器件430可存取特定數位媒體項目：判定帳戶-項目關聯資料是否指示器件430及特定數位媒體項目與同一帳戶相關聯。 Additionally, the intermediary device 420 can store account data that associates one or more devices of the user with a single account. Thus, this account profile may indicate that devices 410 and 430 are registered with the same account by the same user. The intermediary device 420 can also store account-project related material that associates an account with one or more digital media items owned (or purchased) by a particular user. Thus, the intermediary device 420 can verify that the device 430 can access a particular digital media item by determining if the account-item association material indicates that the device 430 and the particular digital media item are associated with the same account.

儘管僅描繪兩個終端使用者器件，但終端使用者可擁有且操作取用數位媒體項目(諸如，電子書及音訊書)的更多或更少之器件。類似地，儘管僅描繪單一中間器件420，但擁有且操作中間器件420之實體可操作多個器件，該多個器件中之每一者提供相同服務或可一起操作以向終端使用者器件410及430的使用者提供服務。 Although only two end user devices are depicted, end users can own and operate more digital media items (such as e-books and audio books). Or fewer devices. Similarly, although only a single intermediate device 420 is depicted, the entity that owns and operates the intermediate device 420 can operate multiple devices, each of which provides the same service or can operate together to the end user device 410 and 430 users provide services.

經由網路440使中間器件420與終端使用者器件410及430之間的通信為可能的。網路440可藉由提供各種計算器件之間的資料交換之任何媒體或機構實施。此網路之實例包括(但不限於)諸如區域網路(LAN)、廣域網路(WAN)、乙太網路或網際網路的網路，或一或多個陸地、衛星或無線鏈路。網路可包括諸如所描述之彼等網路的網路之組合。網路可根據傳輸控制協定(TCP)、使用者資料報協定(UDP)及/或網際網路協定(IP)傳輸資料。 Communication between the intermediate device 420 and the end user devices 410 and 430 is possible via the network 440. Network 440 can be implemented by any medium or mechanism that provides for the exchange of data between various computing devices. Examples of such networks include, but are not limited to, networks such as regional networks (LANs), wide area networks (WANs), Ethernet or the Internet, or one or more terrestrial, satellite or wireless links. The network may include a combination of networks such as the networks described. The network can transmit data in accordance with Transmission Control Protocol (TCP), User Datagram Protocol (UDP), and/or Internet Protocol (IP).

Map storage location

可與產生映射所自之文字資料及音訊資料單獨地儲存映射。舉例而言，如圖4中所描繪，與數位媒體項目402及404單獨地儲存映射406，即使映射406可用以基於一數位媒體項目中的媒體位置來識別另一數位媒體項目中的媒體位置亦如此。實際上，映射406儲存於與分別儲存數位媒體項目402及404的器件410及430分離之計算器件(中間器件420)上。 The mapping can be stored separately from the text data and audio data from which the mapping is generated. For example, as depicted in FIG. 4, mapping 406 is stored separately with digital media items 402 and 404, even though mapping 406 can be used to identify media locations in another digital media item based on media locations in a digital media item. in this way. In effect, map 406 is stored on a computing device (intermediate device 420) that is separate from devices 410 and 430 that store digital media items 402 and 404, respectively.

另外或或者，映射可作為相應文字資料之部分被儲存。舉例而言，映射406可儲存於數位媒體項目402中。然而，即使映射作為文字資料之部分被儲存，映射仍可能並不顯示給取用文字資料之終端使用者。再另外或或者，映射可作為音訊資料之部分被儲存。舉例而言，映射406可儲存於數位媒體項目404中。 Additionally or alternatively, the mapping can be stored as part of the corresponding textual material. For example, map 406 can be stored in digital media item 402. However, even if the mapping is stored as part of the text material, the mapping may not be displayed to the end user of the fetched text material. Additionally or alternatively, the mapping can It is stored as part of the audio material. For example, the mapping 406 can be stored in the digital media item 404.

Bookmark switch

「書籤切換」指代在數位著作之一版本中建立指定位置(或「書籤」)及使用書籤來找到數位著作之另一版本內的相應位置。存在兩種類型之書籤切換：文字至音訊(TA)書籤切換及音訊至文字(AT)書籤切換。TA書籤切換涉及使用建立於電子書中之文字書籤以識別音訊書中的相應音訊位置。相反，本文中稱為AT書籤切換之另一類型之書籤切換涉及使用建立於音訊書中的音訊標籤來識別電子書內的相應文字位置。 "Bookmark switching" refers to the creation of a specified location (or "bookmark") in one version of a digital work and the use of a bookmark to find the corresponding location within another version of a digital work. There are two types of bookmark switching: text to audio (TA) bookmark switching and audio to text (AT) bookmark switching. TA bookmark switching involves using a text bookmark created in an e-book to identify the corresponding audio location in the audio book. In contrast, another type of bookmark switching, referred to herein as AT bookmark switching, involves the use of audio tags built into the audio book to identify corresponding text locations within the e-book.

Text to audio bookmark switch

圖5A為描繪根據本發明之實施例的用於TA書籤切換之處理程序500的流程圖。使用描繪於圖4中之系統400的元件來描述圖5A。 FIG. 5A is a flow diagram depicting a process 500 for TA bookmark switching in accordance with an embodiment of the present invention. Figure 5A is described using elements of system 400 depicted in Figure 4.

在步驟502處，文字媒體播放器412(例如，電子閱讀器)判定數位媒體項目402(例如，數位書)內的文字書籤。器件410將來自數位媒體項目402之內容顯示給器件410的使用者。 At step 502, text media player 412 (e.g., an e-reader) determines a text bookmark within digital media item 402 (e.g., a digital book). Device 410 displays the content from digital media item 402 to the user of device 410.

可回應於來自使用者之輸入來判定文字書籤。舉例而言，使用者可觸碰器件410之觸控式螢幕上的區域。器件410之顯示器在該區域處或靠近該區域顯示一或多個單字。回應於輸入，文字媒體播放器412判定最接近區域之一或多個單字。文字媒體播放器412基於所判定之一或多個單字判定文字書籤。 The text bookmark can be determined in response to input from the user. For example, the user can touch the area on the touch screen of device 410. The display of device 410 displays one or more words at or near the area. In response to the input, the text media player 412 determines one or more of the words in the closest region. The text media player 412 is based on one or more of the decisions Single word judgment text bookmarks.

或者，可基於顯示給使用者之最後文字資料判定文字書籤。舉例而言，數位媒體項目402可包含200個電子「頁」，且頁110為所顯示之最後頁。文字媒體播放器412判定頁110為所顯示之最後頁。文字媒體播放器412可將頁110建立為文字書籤，或可將頁110之開始處的點建立為文字書籤，此係由於可能不存在知曉使用者在何處停止閱讀的方式。假設使用者至少閱讀頁109上之最後句子可為安全的，該句子可能已在頁109或頁110上結束。因此，文字媒體播放器412可將下一句子(其在頁110上開始)之開始建立為文字書籤。然而，若映射之細微度係處於段落層級，則文字媒體播放器412可建立頁109上之最後段落的開始。類似地，若映射之細微度係處於句子層級，則文字媒體播放器412可將包括頁110之章的開始建立為文字書籤。 Alternatively, the text bookmark can be determined based on the last textual material displayed to the user. For example, digital media item 402 can include 200 electronic "pages" and page 110 is the last page displayed. Text media player 412 determines page 110 as the last page displayed. The text media player 412 can create the page 110 as a text bookmark, or can create a point at the beginning of the page 110 as a text bookmark, since there may be no way to know where the user stopped reading. It is assumed that the user can read at least the last sentence on page 109, which may have ended on page 109 or page 110. Thus, text media player 412 can create a text bookmark for the beginning of the next sentence (which begins on page 110). However, if the nuance of the mapping is at the paragraph level, the text media player 412 can establish the beginning of the last paragraph on page 109. Similarly, if the nuance of the mapping is at the sentence level, the text media player 412 can create the beginning of the chapter including the page 110 as a text bookmark.

在步驟504處，文字媒體播放器412經由網路440將指示文字書籤之資料發送至中間器件420。中間器件420可儲存與器件410及/或器件410之使用者之帳戶相關聯的文字書籤。在步驟502之前，使用者可能已藉由中間器件420之操作者建立帳戶。使用者接著藉由操作者登記包括器件410之一或多個器件。登記使得該一或多個器件中之每一者與使用者之帳戶相關聯。 At step 504, text media player 412 sends the data indicating the text bookmark to intermediate device 420 via network 440. The intermediate device 420 can store text bookmarks associated with the user of the device 410 and/or the user of the device 410. Prior to step 502, the user may have established an account with the operator of the intermediary device 420. The user then registers one or more devices including device 410 by the operator. Registration causes each of the one or more devices to be associated with a user's account.

一或多個因素可使得文字媒體播放器412將文字書籤發送至中間器件420。此等因素可包括文字媒體播放器412之退出(或停機)、文字書籤藉由使用者的建立，或藉由使用者進行以保存文字書籤以用於在聽取對應於著作之文字版本的音訊書時使用的顯式指令，文字書籤係針對該文字版本而建立。 One or more factors may cause the text media player 412 to send the text bookmark to the intermediary device 420. Such factors may include the exit (or downtime) of the text media player 412, the creation of text bookmarks by the user, or by use of An explicit command is used to save the text bookmark for use in listening to an audio book corresponding to the text version of the work, and the text bookmark is created for the text version.

如先前所提到，中間器件420存取(例如，儲存)映射406，在此實例中，映射406映射數位媒體項目404中之多個音訊位置與數位媒體項目402內的多個文字位置。 As previously mentioned, the intermediary device 420 accesses (e.g., stores) the mapping 406, which in this example maps a plurality of audio locations in the digital media item 404 to a plurality of text locations within the digital media item 402.

在步驟506處，中間器件420檢查映射406以判定多個文字位置中之對應於文字書籤的特定文字位置。文字書籤可能並非與映射406中之多個文字位置中之任一者準確匹配。然而，中間器件420可選擇最接近文字書籤之文字位置。或者，中間器件420可選擇恰在文字書籤之前的文字位置，該文字位置可能或可能並非最接近文字書籤的文字位置。舉例而言，若文字書籤指示第5章第3段第5句，且映射406中之最接近文字位置為(1)第5章第3段第1句及(2)第5章第3段第6句，則選擇文字位置(1)。 At step 506, the intermediary device 420 checks the map 406 to determine a particular text position in the plurality of text locations that corresponds to the text bookmark. The text bookmark may not exactly match any of the multiple text locations in map 406. However, the intermediate device 420 can select the position of the text that is closest to the text bookmark. Alternatively, the intermediary device 420 can select a text location just before the text bookmark, which may or may not be the closest text location to the text bookmark. For example, if the text bookmark indicates the fifth sentence of the third paragraph of Chapter 5, and the closest text position in the map 406 is (1) the first sentence of the fifth paragraph of Chapter 5 and (2) the third paragraph of Chapter 5 In the sixth sentence, select the text position (1).

在步驟508處，一旦識別映射中之特定文字位置，則中間器件420判定映射406中之對應於特定文字位置的特定音訊位置。 At step 508, upon identifying a particular text location in the map, intermediate device 420 determines a particular audio location in map 406 that corresponds to a particular text location.

在步驟510處，中間器件420將特定音訊位置發送至器件430，器件430在此實例中不同於器件410。舉例而言，器件410可為平板電腦，且器件430可為智慧型電話。在相關實施例中，不涉及器件430。因此，中間器件420可將特定音訊位置發送至器件410。 At step 510, intermediate device 420 sends a particular audio location to device 430, which in this example is different than device 410. For example, device 410 can be a tablet and device 430 can be a smart phone. In a related embodiment, device 430 is not involved. Thus, the intermediate device 420 can send a particular audio location to the device 410.

可(亦即)回應於中間器件420判定特定音訊位置而自動執行步驟510。或者，可回應於自器件430接收到器件430正打算處理數位媒體項目404的指示而執行步驟510或步驟506。該指示可為對對應於文字書籤之音訊位置的請求。 Can (i.e., automatically) in response to intermediate device 420 determining a particular audio location Go to step 510. Alternatively, step 510 or step 506 can be performed in response to receiving an indication from device 430 that device 430 is planning to process digital media item 404. The indication can be a request for an audio location corresponding to a text bookmark.

在步驟512處，音訊媒體播放器432將特定音訊位置建立為數位媒體項目404中之音訊資料的當前播放位置。可回應於自中間器件420接收到特定音訊位置而執行此建立。因為當前播放位置變為特定音訊位置，所以不要求音訊媒體播放器432播放先於音訊資料中之特定音訊位置的音訊中之任一者。舉例而言，若特定音訊位置指示2：56：03(2小時56分及3秒)，則音訊媒體播放器432將音訊資料中之彼時間建立為當前播放位置。因此，若器件430之使用者選擇器件430上之「播放」按鈕(不管為圖形抑或實體)，則音訊媒體播放器430開始處理彼2：56：03標誌處的音訊資料。 At step 512, the audio media player 432 establishes the particular audio location as the current playback location of the audio material in the digital media item 404. This setup can be performed in response to receiving a particular audio location from the intermediate device 420. Because the current playback position becomes a particular audio location, the audio media player 432 is not required to play any of the audio prior to the particular audio location in the audio material. For example, if the specific audio position indicates 2:56:03 (2 hours 56 minutes and 3 seconds), the audio media player 432 establishes the time in the audio material as the current playback position. Thus, if the user of device 430 selects a "play" button (whether a graphic or an entity) on device 430, then audio media player 430 begins processing the audio material at the 2:56:03 flag.

在替代性實施例中，器件410儲存映射406(或映射406之複本)。因此，替代於步驟504至508，文字媒體播放器412檢查映射406以判定多個文字位置中之對應於文字書籤的特定文字位置。接著，文字媒體播放器412判定映射406中之對應於特定文字位置的特定音訊位置。文字媒體播放器412可接著使得特定音訊位置發送至中間器件420，以允許器件430擷取特定音訊位置且將音訊資料中之當前播放位置建立為特定音訊位置。文字媒體播放器412亦可使得特定文字位置(或文字書籤)發送至中間器件420，以允許器件410(或另一未圖示器件)稍後擷取特定文字位置從而允許在另一器件上執行的另一文字媒體播放器顯示數位媒體項目 402的另一複本之一部分(例如，頁)，其中該部分對應於特定文字位置。 In an alternative embodiment, device 410 stores map 406 (or a copy of map 406). Thus, instead of steps 504 through 508, text media player 412 checks map 406 to determine a particular text position in the plurality of text locations that corresponds to the text bookmark. Next, text media player 412 determines a particular audio location in map 406 that corresponds to a particular text location. Text media player 412 can then cause a particular audio location to be sent to intermediate device 420 to allow device 430 to capture a particular audio location and establish the current playback location in the audio material as a particular audio location. The text media player 412 can also cause a particular text location (or text bookmark) to be sent to the intermediary device 420 to allow the device 410 (or another device not shown) to later retrieve a particular text location to allow execution on another device. Another text media player displays digital media projects A portion (e.g., a page) of another copy of 402, wherein the portion corresponds to a particular text position.

在另一替代性實施例中，不涉及中間器件420及器件430。因此，不執行步驟504及510。因此，器件410執行圖5A中之包括步驟506及508的所有其他步驟。 In another alternative embodiment, intermediate device 420 and device 430 are not involved. Therefore, steps 504 and 510 are not performed. Thus, device 410 performs all of the other steps of steps 5 and 508 in FIG. 5A.

Audio to text bookmark switch

圖5B為描繪根據本發明之實施例的用於AT書籤切換之處理程序550的流程圖。類似於圖5A，使用描繪於圖4中之系統400的元件來描述圖5B。 FIG. 5B is a flow diagram depicting a process 550 for AT bookmark switching in accordance with an embodiment of the present invention. Similar to Figure 5A, Figure 5B is depicted using elements of system 400 depicted in Figure 4.

在步驟552處，音訊媒體播放器432判定數位媒體項目404(例如，音訊書)內之音訊書籤。 At step 552, the audio media player 432 determines an audio bookmark within the digital media item 404 (eg, an audio book).

可回應於來自使用者之輸入來判定音訊書籤。舉例而言，使用者可(例如)藉由選擇顯示於器件430之觸控式螢幕上的「停止」按鈕來停止音訊資料的播放。音訊媒體播放器432判定數位媒體項目404之音訊資料內的對應於停止播放之處的位置。因此，音訊書籤可僅為使用者停止聽取自數位媒體項目404所產生之音訊的最後地方。另外或或者，使用者可選擇器件430之觸控式螢幕上的一或多個圖形按鈕，以將數位媒體項目404內之特定位置建立為音訊書籤。舉例而言，器件430顯示對應於數位媒體項目404中之音訊資料之長度的時刻表。使用者可選擇時刻表上之位置，且接著提供藉由音訊媒體播放器432所使用之一或多個額外輸入以建立音訊書籤。 The audio bookmark can be determined in response to input from the user. For example, the user can stop the playback of the audio material, for example, by selecting a "stop" button displayed on the touch screen of device 430. The audio media player 432 determines the location within the audio material of the digital media item 404 that corresponds to where the playback stopped. Thus, the audio bookmark can only be the last place where the user stops listening to the audio generated from the digital media item 404. Additionally or alternatively, the user can select one or more graphical buttons on the touch screen of device 430 to create a particular location within digital media item 404 as an audio bookmark. For example, device 430 displays a time table corresponding to the length of the audio material in digital media item 404. The user can select a location on the timetable and then provide one or more additional inputs used by the audio media player 432 to create an audio bookmark.

在步驟554處，器件430經由網路440將指示音訊書籤之資料發送至中間器件420。中間器件420可儲存與器件430及/或器件430之使用者之帳戶相關聯的音訊書籤。在步驟552之前，使用者藉由中間器件420之操作者建立帳戶。使用者接著藉由操作者登記包括器件430之一或多個器件。登記使得該一或多個器件中之每一者與使用者之帳戶相關聯。 At step 554, device 430 will indicate an audio bookmark via network 440. The data is sent to the intermediate device 420. The intermediate device 420 can store audio bookmarks associated with the user of the device 430 and/or the user of the device 430. Prior to step 552, the user establishes an account with the operator of the intermediary device 420. The user then registers one or more devices including device 430 by the operator. Registration causes each of the one or more devices to be associated with a user's account.

中間器件420亦存取(例如，儲存)映射406。映射406映射數位媒體項目404之音訊資料中的多個音訊位置與數位媒體項目402之文字資料內的多個文字位置。 Intermediate device 420 also accesses (e.g., stores) map 406. The map 406 maps a plurality of audio locations in the audio material of the digital media item 404 to a plurality of text locations within the text material of the digital media item 402.

一或多個因素可使得音訊媒體播放器432將音訊書籤發送至中間器件420。此等因素可包括音訊媒體播放器432之退出(或停機)、音訊書籤藉由使用者的建立，或藉由使用者進行以保存音訊書籤以用於在顯示對應於數位媒體項目404的著作之文字版本(反映於數位媒體項目402中)的多個部分時使用之顯式指令，音訊書籤係針對數位媒體項目404而建立。 One or more factors may cause audio media player 432 to send an audio bookmark to intermediate device 420. Such factors may include the exit (or downtime) of the audio media player 432, the creation of the audio bookmark by the user, or by the user to save the audio bookmark for use in displaying the work corresponding to the digital media item 404. The explicit instructions used in the text version (reflected in the digital media item 402) are used for the digital media item 404.

在步驟556處，中間器件420檢查映射406以判定多個音訊位置中之對應於音訊書籤的特定音訊位置。音訊書籤可能並非與映射406中之多個音訊位置中之任一者準確匹配。然而，中間器件420可選擇最接近音訊書籤之音訊位置。或者，中間器件420可選擇恰在音訊書籤之前的音訊位置，該音訊位置可能或可能並非最接近音訊書籤的音訊位置。舉例而言，若音訊書籤指示02：43：19(或2小時43分及19秒)，且映射406中之最接近音訊位置為(1)02：41：07及 (2)0：43：56，則選擇音訊位置(1)，即使音訊位置(2)最接近音訊書籤亦如此。 At step 556, the intermediate device 420 checks the map 406 to determine a particular one of the plurality of audio locations corresponding to the audio bookmark. The audio bookmark may not exactly match any of the plurality of audio locations in map 406. However, the intermediate device 420 can select the audio location that is closest to the audio bookmark. Alternatively, the intermediate device 420 can select an audio location just before the audio bookmark, which may or may not be the audio location closest to the audio bookmark. For example, if the audio bookmark indicates 02:43:19 (or 2 hours, 43 minutes, and 19 seconds), and the closest audio position in the map 406 is (1) 02:41:07 and (2) 0:43:56, select the audio position (1), even if the audio position (2) is closest to the audio bookmark.

在步驟558處，一旦識別映射中之特定音訊位置，則中間器件420判定映射406中之對應於特定音訊位置的特定文字位置。 At step 558, upon identifying a particular audio location in the map, intermediate device 420 determines a particular text location in map 406 that corresponds to the particular audio location.

在步驟560處，中間器件420將特定文字位置發送至器件410，器件410在此實例中不同於器件430。舉例而言，器件410可為平板電腦，且器件430可為經組態以處理音訊資料並產生可聞聲音的智慧型電話。 At step 560, the intermediate device 420 sends a particular text location to the device 410, which in this example is different from the device 430. For example, device 410 can be a tablet, and device 430 can be a smart phone configured to process audio material and produce audible sound.

可(亦即)回應於中間器件420判定特定文字位置而自動執行步驟560。或者，可回應於自器件410接收到器件410正打算處理數位媒體項目402的指示而執行步驟560(或步驟556)。該指示可為對對應於音訊書籤之文字位置的請求。 Step 560 can be performed automatically (i.e., in response to intermediate device 420 determining a particular text position. Alternatively, step 560 (or step 556) may be performed in response to receiving an indication from device 410 that device 410 is planning to process digital media item 402. The indication can be a request for a text location corresponding to the audio bookmark.

在步驟562處，文字媒體播放器412顯示關於特定文字位置之資訊。可回應於自中間器件420接收到特定文字位置而執行步驟562。不要求器件410顯示先於反映於數位媒體項目402中的著作之文字版本中之特定文字位置的內容中之任一者。舉例而言，若特定文字位置指示第3章第2段第4句，則器件410顯示包括彼句子之頁。文字媒體播放器412可使得標記顯示於頁中之特定文字位置處，該標記向器件410之使用者視覺上指示在頁中於何處開始閱讀。因此，使用者能夠立即閱讀著作之於對應於音訊書中的藉由旁白者說出之最後單字的位置處開始之文字版本。 At step 562, text media player 412 displays information regarding a particular text location. Step 562 can be performed in response to receiving a particular text location from the intermediate device 420. The device 410 is not required to display any of the content prior to the particular text position in the text version of the work reflected in the digital media item 402. For example, if a particular text position indicates the fourth sentence of the second paragraph of Chapter 3, then device 410 displays a page that includes the sentence. The text media player 412 can cause the indicia to be displayed at a particular text location in the page that visually indicates to the user of the device 410 where to begin reading in the page. Therefore, the user can immediately read the text version starting from the position corresponding to the last word spoken by the narrator in the audio book.

在替代性實施例中，器件410儲存映射406。因此，替代於步驟556至560，在步驟554(其中器件430將指示音訊書籤之資料發送至中間器件420)之後，中間器件420將音訊書籤發送至器件410。接著，文字媒體播放器412檢查映射406以判定多個音訊位置中之對應於音訊書籤的特定音訊位置。接著，文字媒體播放器412判定映射406中之對應於特定音訊位置的特定文字位置。此替代性處理程序接著進行至上述步驟562。 In an alternative embodiment, device 410 stores map 406. Thus, instead of steps 556 through 560, after step 554 (where device 430 sends information indicative of the audio bookmark to intermediate device 420), intermediate device 420 sends the audio bookmark to device 410. Next, text media player 412 checks map 406 to determine a particular one of the plurality of audio locations corresponding to the audio bookmark. Next, text media player 412 determines a particular text location in map 406 that corresponds to a particular audio location. This alternative process then proceeds to step 562 above.

在另一替代性實施例中，不涉及中間器件420。因此，不執行步驟554及560。因此，器件430執行圖5B中之包括步驟556及558的所有其他步驟。 In another alternative embodiment, the intermediate device 420 is not involved. Therefore, steps 554 and 560 are not performed. Accordingly, device 430 performs all of the other steps of steps 556 and 558 in FIG. 5B.

Reflexive display of text in response to playing audio

在實施例中，在播放對應於著作之文字版本的音訊資料之同時，反白顯示或「照亮」來自著作之文字版本之一部分的文字。如先前所提到，音訊資料為著作之文字版本的音訊版本，且可反映藉由人類使用者進行之來自文字版本之文字的朗讀。如本文中所使用，「反白顯示」文字指代媒體播放器(例如，「電子閱讀器」)視覺上區分彼文字與係與經反白顯示文字同時顯示的其他文字。反白顯示文字可涉及改變文字之字型、改變文字之字型樣式(例如，斜體、粗體、加底線)、改變文字之大小、改變文字之色彩、改變文字之背景色彩，或建立與文字相關聯的動畫。建立動畫之實例為使得文字(或文字之背景)斷續地閃爍或改變色彩。建立動畫之另一實例為建立出現於文字上方、下方或周圍的圖形。舉例而言，回應於媒體播放器正播放並偵測到單字「toaster」，媒體播放器在所顯示文字中於單字「toaster」上方顯示烤麵包機影像。動畫之另一實例為在於所播放之音訊資料中偵測到文字之一部分(例如，單字、字節或字母)時在彼部分上「彈跳」的彈跳球。 In an embodiment, while playing the audio material corresponding to the text version of the work, the text from one of the text versions of the work is highlighted or "illuminated". As mentioned previously, the audio material is an audio version of the text version of the work and can reflect the reading from the text version by the human user. As used herein, "reverse display" text refers to a media player (eg, "e-reader") that visually distinguishes between text and other text displayed simultaneously with highlighted text. Displaying text in reverse can involve changing the font of the text, changing the font style of the text (for example, italic, bold, underline), changing the size of the text, changing the color of the text, changing the background color of the text, or establishing Text associated animation. An example of creating an animation is to make the text (or the background of the text) blink or change color intermittently. Another example of creating an animation is to create a text that appears above the text. Graphics below or around. For example, in response to the media player playing and detecting the word "toaster", the media player displays the image of the toaster above the word "toaster" in the displayed text. Another example of an animation is a bouncing ball that "bounces" on a portion of a piece of text (eg, a word, byte, or letter) in the audio material being played.

圖6為描繪根據本發明之實施例的用於在著作之音訊版本正播放的同時使得來自著作之文字版本的文字經反白顯示的處理程序600之流程圖。 6 is a flow diagram depicting a process 600 for causing text from a text version of a work to be highlighted in reverse while the audio version of the work is being played, in accordance with an embodiment of the present invention.

在步驟610處，判定音訊版本之音訊資料的當前播放位置(其正不斷地改變)。此步驟可藉由在使用者之器件上執行的媒體播放器來執行。媒體播放器處理音訊資料以產生針對使用者之音訊。 At step 610, the current playback position of the audio material of the audio version is determined (which is constantly changing). This step can be performed by a media player executing on the user's device. The media player processes the audio material to produce audio for the user.

在步驟620處，基於當前播放位置，識別映射中之映射記錄。當前播放位置可與在映射記錄中所識別之音訊位置匹配或幾乎與該音訊位置匹配。 At step 620, a mapping record in the map is identified based on the current play position. The current playback position can match or almost match the audio location identified in the mapping record.

若媒體播放器存取映射音訊資料中之多個音訊位置與著作之文字版本中的多個文字位置的映射，則步驟620可藉由媒體播放器執行。或者，步驟620可藉由在使用者之器件上執行的另一處理程序或藉由經由網路自使用者之器件接收到當前播放位置的伺服器執行。 If the media player accesses the mapping of the plurality of audio locations in the mapped audio material to the plurality of text locations in the text version of the work, step 620 can be performed by the media player. Alternatively, step 620 can be performed by another processing program executing on the user's device or by a server that receives the current playback location from the user's device via the network.

在步驟630處，識別在映射記錄中所識別之文字位置。 At step 630, the location of the text identified in the mapping record is identified.

在步驟640處，使得著作之文字版本的對應於文字位置之部分經反白顯示。此步驟可藉由媒體播放器或在使用者之器件上執行的另一軟體應用程式來執行。若伺服器執行查找步驟(620及630)，則步驟640可進一步涉及伺服器將文字位置發送至使用者之器件。作為回應，媒體播放器或另一軟體應用程式接受文字位置作為輸入且使得相應文字經反白顯示。 At step 640, the portion of the text version of the work corresponding to the text position is rendered highlighted. This step can be performed by a media player or another software application executing on the user's device. If the server executes Looking up steps (620 and 630), step 640 may further involve the server sending the text location to the user's device. In response, the media player or another software application accepts the text position as input and causes the corresponding text to be highlighted.

在實施例中，映射中藉由媒體播放器所識別之不同文字位置係與不同類型之反白顯示相關聯。舉例而言，映射中之一文字位置可與字型色彩自黑色至紅色之改變相關聯，而映射中之另一文字位置可與動畫(諸如，展示自烤麵包機「彈出」的一塊烤麵包片的烤麵包機圖形)相關聯。因此，映射中之每一映射記錄可包括指示藉由相應文字位置所識別之文字將被反白顯示的方式的「反白顯示資料」。因此，對於媒體播放器識別且包括反白顯示資料的在映射中之每一映射記錄而言，媒體播放器使用反白顯示資料來判定反白顯示文字的方式。若映射記錄不包括反白顯示資料，則媒體播放器可能不反白顯示相應文字。或者，若映射中之映射記錄不包括反白顯示資料，則媒體播放器可使用「預設」反白顯示技術(例如，使文字粗體化)來反白顯示文字。 In an embodiment, the different text locations identified by the media player in the mapping are associated with different types of highlighted displays. For example, one of the text positions in the map can be associated with a change in font color from black to red, and another text position in the map can be associated with an animation (such as a piece of toast displayed "popped out" from the toaster. Toaster graphics) are associated. Therefore, each mapping record in the map may include "reverse display data" indicating the manner in which the text recognized by the corresponding text position is to be highlighted. Thus, for each mapped record in the map that the media player recognizes and includes highlighting the material, the media player uses the highlighted data to determine the manner in which the text is highlighted. If the mapping record does not include highlighting data, the media player may not highlight the corresponding text. Alternatively, if the mapping record in the mapping does not include highlighting, the media player can use the "preset" highlighting technique (eg, to make the text bold) to highlight the text.

Display text based on audio input

圖7為描繪根據本發明之實施例的回應於來自使用者之音訊輸入使所顯示文字反白顯示之處理程序700的流程圖。在此實施例中，不需要映射。音訊輸入用以反白顯示著作之文字版本的同時顯示給使用者之部分中的文字。 7 is a flow diagram depicting a process 700 for highlighting displayed text in response to audio input from a user, in accordance with an embodiment of the present invention. In this embodiment, no mapping is required. The audio input is used to highlight the text in the part of the user while displaying the text version of the work.

在步驟710處，接收音訊輸入。音訊輸入可係基於使用者朗讀來自著作之文字版本的文字。音訊輸入可藉由顯示文字版本之一部分的器件接收。該器件可提示使用者朗讀單字、片語或整個句子。該提示可為視覺提示或音訊提示。作為視覺提示之實例，在器件顯示加底線之句子的同時或恰在器件顯示加底線的句子之前，器件可使得以下文字被顯示：「請朗讀加底線文字」。作為音訊提示之實例，器件可使得電腦產生之語音朗讀「請朗讀加底線文字」或使得預先錄音之人類語音被播放，其中預先錄音的人類語音提供同一指令。 At step 710, an audio input is received. Audio input can be based on usage Read the text from the text version of the work. The audio input can be received by a device that displays a portion of the text version. The device prompts the user to read a single word, phrase or entire sentence. The prompt can be a visual cue or an audio cue. As an example of a visual cue, the device can cause the following text to be displayed at the same time as the device displays the sentence with the underline or just before the device displays the underlined sentence: "Please read the underlined text." As an example of an audio cue, the device can cause the computer-generated speech to read "Please read the underlined text" or cause the pre-recorded human speech to be played, wherein the pre-recorded human speech provides the same command.

在步驟720處，對音訊輸入執行話音至文字分析以偵測反映於音訊輸入中的一或多個單字。 At step 720, voice-to-text analysis is performed on the audio input to detect one or more words reflected in the audio input.

在步驟730處，對於反映於音訊輸入中之每一所偵測單字而言，比較彼所偵測單字與單字之特定集合。單字之特定集合可為藉由計算器件(例如，電子閱讀器)同時顯示的所有單字。或者，單字之特定集合可為提示使用者朗讀之所有單字。 At step 730, a particular set of detected words and words is compared for each detected word reflected in the audio input. A particular set of words can be all words that are simultaneously displayed by a computing device (eg, an e-reader). Alternatively, a particular set of words may be all words that are prompted to be read by the user.

在步驟740處，對於與特定集合中之單字匹配的每一所偵測單字而言，器件使得彼匹配單字經反白顯示。 At step 740, for each detected word that matches a single word in a particular set, the device causes the matching word to be highlighted.

描繪於處理程序700中之步驟可藉由顯示來自著作之文字版本之文字的單一計算器件來執行。或者，描繪於處理程序700中之步驟可藉由不同於顯示來自文字版本之文字之計算器件的一或多個計算器件來執行。舉例而言，步驟710中來自使用者之音訊輸入可經由網路自使用者之器件發送至執行話音至文字分析的網路伺服器。網路伺服器可接著將反白顯示資料發送至使用者之器件，以使得使用者之器件反白顯示適當文字。 The steps depicted in the process 700 can be performed by a single computing device that displays text from a text version of the work. Alternatively, the steps depicted in process 700 can be performed by one or more computing devices other than computing devices that display text from text versions. For example, the audio input from the user in step 710 can be sent from the user's device to the web server performing the voice-to-text analysis via the network. Web server The highlighted data is then sent to the user's device so that the user's device highlights the appropriate text.

Respond to text selection to play audio

在實施例中，顯示著作之文字版本之多個部分的媒體播放器之使用者可選擇所顯示文字之多個部分，且使得相應音訊被播放。舉例而言，若來自數位書之所顯示單字為「donut」且使用者選擇彼單字(例如，藉由觸碰媒體播放器之觸控式螢幕的顯示彼單字之部分)，則可播放「donut」之音訊。 In an embodiment, a user of the media player displaying portions of the text version of the work may select portions of the displayed text and cause the corresponding audio to be played. For example, if the word displayed from the digital book is "donut" and the user selects the word (for example, by touching the touch screen of the media player to display the part of the word), then "donut" can be played. The news.

映射著作之文字版本中的文字位置與音訊資料中之音訊位置的映射用以識別音訊資料之對應於所選擇文字的部分。使用者可選擇單一單字、片語或甚至一或多個句子。回應於選擇所顯示文字之一部分的輸入，媒體播放器可識別一或多個文字位置。舉例而言，媒體播放器可識別對應於所選擇部分之單一文字位置，即使所選擇部分包含多個行或句子亦如此。所識別文字位置可對應於所選擇部分之開始。作為另一實例，媒體播放器可識別對應於所選擇部分之開始的第一文字位置，及對應於所選擇部分之結尾的第二文字位置。 The mapping of the position of the text in the text version of the mapping work to the location of the audio in the audio material is used to identify the portion of the audio material that corresponds to the selected text. The user can select a single word, a phrase or even one or more sentences. In response to selecting an input for a portion of the displayed text, the media player can recognize one or more text positions. For example, the media player can identify a single text location corresponding to the selected portion, even if the selected portion contains multiple lines or sentences. The identified text position may correspond to the beginning of the selected portion. As another example, the media player can identify a first text position corresponding to the beginning of the selected portion and a second text position corresponding to the end of the selected portion.

媒體播放器使用所識別之文字位置來查找映射中之映射記錄，該映射記錄指示最接近所識別文字位置(或在所識別文字位置前最接近)的文字位置。媒體播放器使用指示於映射記錄中之音訊位置來識別在音訊資料中於何處開始處理音訊資料以便產生音訊。若識別出僅單一文字位置，則僅可播放音訊位置處或靠近音訊位置的單字或聲音。因此，在播放單字或聲音之後，媒體播放器中斷從而不再播放音訊。或者，媒體播放器在音訊位置處或靠近音訊位置開始播放，且並不中斷播放跟隨音訊位置後的音訊，直至(a)達到音訊資料之結尾、(b)來自使用者之其他輸入(例如，「停止」按鈕之選擇)，或(c)音訊資料中之預先指定的停止點(例如，需要其他輸入以進行的頁或章之結尾)。 The media player uses the identified text position to find a mapping record in the map that indicates the position of the text that is closest to the identified text position (or closest to the recognized text position). The media player uses the audio location indicated in the mapping record to identify where to begin processing the audio material in the audio material to produce the audio. If only a single text position is identified, Only words or sounds at or near the audio position can be played. Therefore, after playing a word or sound, the media player is interrupted so that no audio is played. Alternatively, the media player starts playing at or near the audio location and does not interrupt the playback of the audio following the audio location until (a) the end of the audio material is reached, and (b) other input from the user (eg, The "stop" button selection), or (c) a pre-specified stop point in the audio material (for example, the end of a page or chapter that requires additional input).

若媒體播放器基於所選擇部分識別兩個文字位置，則兩個音訊位置經識別，且可用以識別於何處開始播放且於何處停止播放相應音訊。 If the media player identifies two text locations based on the selected portion, the two audio locations are identified and can be used to identify where to start playing and where to stop playing the corresponding audio.

在實施例中，在不超過音訊資料中之當前播放位置的情況下，藉由音訊位置所識別之音訊資料可經緩慢(亦即，以緩慢播放速度)或連續播放。舉例而言，若平板電腦之使用者藉由用其手指觸碰平板電腦之觸控式螢幕且連續地觸碰所顯示單字(亦即，不抬起其手指且不將其手指移動至另一所顯示單字)來選擇所顯示單字「two」，則平板電腦播放相應音訊從而建立藉由朗讀單字「twoooooooooooooooo」所反映的聲音。 In an embodiment, the audio data identified by the audio location may be played slowly (ie, at a slow playback speed) or continuously without exceeding the current playback position in the audio material. For example, if the user of the tablet touches the touch screen of the tablet with his finger and continuously touches the displayed word (ie, does not raise his finger and does not move his finger to another The displayed word ") selects the displayed word "two", and the tablet plays the corresponding audio to establish the sound reflected by reading the word "twoooooooooooooooo".

在類似實施例中，使用者在媒體播放器之觸控式螢幕上拖曳其手指跨越所顯示文字的速度使得以相同或類似速度播放相應音訊。舉例而言，使用者選擇所顯示單字「donut」之字母「d」，且接著緩慢地移動其手指跨越所顯示單字。回應於此輸入，媒體播放器識別相應音訊資料(使用映射)，且以與使用者移動其手指之速度相同的速度播放相應音訊。因此，媒體播放器建立聽起來如同著作之文字版本之文字的讀者將單字「donut」發音為「dooooooonnnnnnuuuuuut」的音訊。 In a similar embodiment, the user drags the speed of their finger across the displayed text on the touch screen of the media player to play the corresponding audio at the same or similar speed. For example, the user selects the letter "d" of the displayed word "donut" and then slowly moves its finger across the displayed word. In response to this input, the media player recognizes the corresponding audio material (using the mapping) at the same speed as the user moves their finger Play the corresponding audio. Therefore, the media player creates a sound that reads the word "donut" as "dooooooonnnnnnuuuuuut" by a reader who sounds like the text version of the work.

在類似實施例中，使用者「觸碰」顯示於觸控式螢幕上之單字的時間指示快速或緩慢地播放單字之音訊版本的程度。舉例而言，使用者之手指對所顯示單字之快速輕扣使得相應音訊以正常速度播放，而使用者將其手指按下於所選擇單字上歷時多於1秒使得以½正常速度播放相應音訊。 In a similar embodiment, the user "touches" the time of the word displayed on the touch screen to indicate the extent to which the audio version of the word is played quickly or slowly. For example, the quick tap of the user's finger on the displayed word causes the corresponding audio to be played at normal speed, and the user presses his finger on the selected word for more than 1 second to play the corresponding audio at a normal speed of 1⁄2. .

Transfer user annotation

在實施例中，使用者起始對數位著作之一媒體版本(例如，音訊)之註解的建立，且使得註解與數位著作之另一媒體版本(例如，文字)相關聯。因此，儘管可在一類型之媒體的內容脈絡中建立註解，但可在另一類型之媒體的內容脈絡中取用註解。建立或取用註解之「內容脈絡」指代在建立或取用發生時正顯示文字抑或正播放音訊。 In an embodiment, the user initiates the creation of an annotation to a media version (eg, audio) of one of the digital works, and associates the annotation with another media version (eg, text) of the digital work. Thus, although annotations can be established in the context of a type of media, annotations can be taken from the context of another type of media. The "content context" of creating or retrieving an annotation refers to whether the text is being displayed or the audio is being played when the creation or retrieval takes place.

儘管以下實例涉及在建立註解時判定音訊內之位置或文字位置，但本發明之一些實施例並不如此受限。舉例而言，當在文字內容脈絡中取用註解時，不使用在音訊內容脈絡中建立註解時音訊檔案內的當前播放位置。實情為，可藉由器件在相應文字版本之開始或結尾處或在相應文字版本之每一「頁」上顯示註解的指示。作為另一實例，當在音訊內容脈絡中取用註解時，並不使用在文字內容脈絡中建立註解時所顯示之文字。實情為，可藉由器件在相應音訊版本之開始或結尾處顯示註解的指示，或在正播放相應音訊版本之同時連續地顯示註解的指示。除視覺指示之外或替代於視覺指示，可播放註解之音訊指示。舉例而言，以可聽到「嗶聲」及音軌兩者之方式同時播放嗶聲與音軌。 Although the following examples relate to determining the position or text position within the audio when the annotation is established, some embodiments of the invention are not so limited. For example, when an annotation is taken in the context of the text content, the current playback position within the audio file when the annotation is created in the context of the audio content is not used. The truth is that the device can display an indication of the annotation at the beginning or end of the corresponding text version or on each "page" of the corresponding text version. As another example, when an annotation is taken in the context of the audio content, the text displayed when the annotation is created in the context of the text content is not used. The fact is that the device can display an indication of the annotation at the beginning or end of the corresponding audio version, or The indication of the annotation is continuously displayed at the same time as the audio version. In addition to or instead of a visual indication, an annotated audio indication can be played. For example, the beep and audio track are played simultaneously in such a way that both "beep" and audio tracks can be heard.

圖8A至圖8B為描繪根據本發明之實施例的用於將註解自一內容脈絡傳送至另一內容脈絡之處理程序的流程圖。特定言之，圖8A為描繪用於在「文字」內容脈絡中建立註解且在「音訊」內容脈絡中取用註解之處理程序800的流程圖，而圖8B為描繪用於在「音訊」內容脈絡中建立註解且在「文字」內容脈絡中取用註解之處理程序850的流程圖。註解之建立及取用可發生於同一計算器件(例如，器件410)上，或單獨的計算器件(例如，器件410及430)上。圖8A描述在器件410上建立並取用註解之情境，而圖8B描述在器件410上建立註解且稍後在器件430上取用註解之情境。 8A-8B are flow diagrams depicting a process for transmitting annotations from one context context to another context context, in accordance with an embodiment of the present invention. In particular, Figure 8A is a flow diagram depicting a process 800 for creating annotations in the context of a "text" context and for annotating the context of the "audio" content, while Figure 8B is depicted for use in "audio" content. A flowchart of the process 850 in which annotations are created in the context and annotations are taken in the context of the "text" context. The creation and retrieval of annotations can occur on the same computing device (e.g., device 410) or on separate computing devices (e.g., devices 410 and 430). FIG. 8A depicts the context in which annotations are established and accessed on device 410, while FIG. 8B depicts the context in which annotations are established on device 410 and annotations are later taken on device 430.

在圖8A中之步驟802處，在器件410上執行之文字媒體播放器412使得顯示來自數位媒體項目402之文字(亦即，呈頁之形式)。 At step 802 in FIG. 8A, the text media player 412 executing on the device 410 causes the text from the digital media item 402 to be displayed (ie, in the form of a page).

在步驟804處，文字媒體播放器412判定反映於數位媒體項目402中之著作之文字版本內的文字位置。最終儲存與註解相關聯之文字位置。可以數種方式來判定文字位置。舉例而言，文字媒體播放器412可接收選擇所顯示文字內之文字位置的輸入。輸入可為使用者觸碰器件410之觸控式螢幕(顯示文字)歷時一時段。輸入可選擇特定單字、數個單字、頁之開始或結尾、句子之前或之後等。輸入亦可包括首先選擇按鈕，此情形使得文字媒體播放器412改變為註解可經建立並與文字位置相關聯的「建立註解」模式。 At step 804, the text media player 412 determines the position of the text within the text version of the work reflected in the digital media item 402. The final location of the text associated with the annotation is stored. There are several ways to determine the position of a text. For example, text media player 412 can receive input that selects the location of the text within the displayed text. The input can be used by the user to touch the touch screen of the device 410 (display text) for a period of time. Input to select specific words, numbers A single word, the beginning or end of a page, before or after a sentence, etc. The input may also include first selecting a button, which causes the text media player 412 to change to an "create annotation" mode in which the annotation can be established and associated with the text location.

作為判定文字位置之另一實例，文字媒體播放器412基於正顯示著作(反映於數位媒體項目402中)之文字版本的哪一部分而自動地判定文字位置(在無使用者輸入之情況下)。舉例而言，若器件410正顯示著作之文字版本的頁20，則將使註解與頁20相關聯。 As another example of determining the position of the character, the text media player 412 automatically determines the position of the character based on which portion of the text of the work being displayed (reflected in the digital media item 402) (in the absence of user input). For example, if device 410 is displaying page 20 of the text version of the work, the annotation will be associated with page 20.

在步驟806處，文字媒體播放器412接收輸入，該輸入選擇可顯示於觸控式螢幕上之「建立註解」按鈕。可回應於步驟804中之選擇文字位置的輸入而顯示此按鈕，其中(例如)使用者觸碰觸控式螢幕歷時一時段(諸如，1秒)。 At step 806, the text media player 412 receives an input that selects an "Create Annotation" button that can be displayed on the touchscreen. The button can be displayed in response to the input of the selected text position in step 804, wherein, for example, the user touches the touchscreen for a period of time (such as 1 second).

儘管將步驟804描繪為發生於步驟806之前，但或者，「建立註解」按鈕之選擇可在判定文字位置之前發生。 Although step 804 is depicted as occurring prior to step 806, alternatively, the selection of the "Create Annotation" button may occur prior to determining the text position.

在步驟808處，文字媒體播放器412接收用以建立註解資料的輸入。輸入可為語音資料(諸如，使用者對著器件410之麥克風說話)或文字資料(諸如，使用者選擇鍵盤上之鍵，不管為實體鍵或圖形鍵)。若註解資料為語音資料，則文字媒體播放器412(或另一處理程序)可對語音資料執行話音至文字分析以建立語音資料的文字版本， At step 808, text media player 412 receives an input to create annotation material. The input can be voice material (such as the user speaking into the microphone of device 410) or textual material (such as the user selecting a key on the keyboard, whether it is a physical key or a graphic key). If the annotation data is voice material, the text media player 412 (or another processing program) can perform voice-to-text analysis on the voice material to create a text version of the voice material.

在步驟810處，文字媒體播放器412儲存與文字位置相關聯之註解資料。文字媒體播放器412使用映射(例如，映射406之複本)來識別映射中之最接近該文字位置的特定文字位置。接著，使用映射，文字媒體播放器識別對應於特定文字位置的音訊位置。 At step 810, the text media player 412 stores the annotation material associated with the text location. The text media player 412 uses a map (eg, a copy of the map 406) to identify the particular text in the map that is closest to the text position. position. Next, using the mapping, the text media player identifies the audio location corresponding to the particular text location.

或者至步驟810，文字媒體播放器412經由網路440將註解資料及文字位置發送至中間器件420。作為回應，中間器件420儲存與文字位置相關聯之註解資料。中間器件420使用映射(例如，映射406)來識別映射406中之最接近該文字位置的特定文字位置。接著，使用映射406，中間器件420識別對應於特定文字位置的音訊位置。中間器件420經由網路440將所識別之音訊位置發送至器件410。中間器件420可回應於來自器件410之對某音訊資料及/或對與某音訊資料相關聯之註解的請求而發送所識別之音訊位置。舉例而言，回應於對「The Tale of Two Cities」之音訊書版本的請求，中間器件420判定是否存在與彼音訊書相關聯之任何註解資料，且若有則將註解資料發送至器件410。 Or to step 810, the text media player 412 sends the annotation material and the text location to the intermediary device 420 via the network 440. In response, intermediate device 420 stores annotation material associated with the text location. The intermediate device 420 uses a mapping (e.g., mapping 406) to identify a particular text location in the mapping 406 that is closest to the text location. Next, using map 406, intermediate device 420 identifies the audio location corresponding to the particular text location. The intermediate device 420 transmits the identified audio location to the device 410 via the network 440. The intermediate device 420 can transmit the identified audio location in response to a request from the device 410 for an audio material and/or an annotation associated with an audio material. For example, in response to a request for an audiobook version of "The Tale of Two Cities," the intermediary device 420 determines if any annotation material associated with the audiobook exists and, if so, sends the annotation material to the device 410.

步驟810亦可包含儲存指示何時建立註解之日期及/或時間資訊。當在音訊內容脈絡中取用註解時，可稍後顯示此資訊。 Step 810 can also include storing a date and/or time information indicating when the annotation was established. This information can be displayed later when an annotation is taken in the context of the audio content.

在步驟812處，音訊媒體播放器414藉由處理數位媒體項目404之音訊資料來播放音訊，數位媒體項目404在此實例中(儘管未圖示)可儲存於器件410上或可經由網路440自中間器件420串流傳輸至器件410。 At step 812, the audio media player 414 plays the audio by processing the audio material of the digital media item 404, which in this example (although not shown) may be stored on the device 410 or may be via the network 440. Streaming from intermediate device 420 to device 410.

在步驟814處，音訊媒體播放器414判定音訊資料中之當前播放位置何時與使用映射406在步驟810中所識別之音訊位置匹配或幾乎匹配。或者，如步驟812中所指示，音訊媒體播放器414可使得指示一註解可用之資料被顯示，而無關於當前播放位置定位於何處且無需播放任何音訊。換言之，步驟812為不必要的。舉例而言，使用者可啟動音訊媒體播放器414且使得音訊媒體播放器414載入數位媒體項目404之音訊資料。音訊媒體播放器414判定註解資料係與音訊資料相關聯。在不產生與音訊資料相關聯之任何音訊的情況下，音訊媒體播放器414使得關於音訊資料之資訊(例如，標題、藝術家、風格、長度等)被顯示。該資訊可包括對註解資料之參考及關於音訊資料內之與註解資料相關聯之位置的資訊，其中位置對應於在步驟810中所識別之音訊位置。 At step 814, the audio media player 414 determines when the current play position in the audio material matches or nearly matches the audio position identified in step 810 using the map 406. Or, as indicated in step 812, the audio The media player 414 can cause the information indicating that an annotation is available to be displayed regardless of where the current playback position is located and without playing any audio. In other words, step 812 is unnecessary. For example, the user can activate the audio media player 414 and cause the audio media player 414 to load the audio material of the digital media item 404. The audio media player 414 determines that the annotation data is associated with the audio material. The audio media player 414 causes information about the audio material (e.g., title, artist, genre, length, etc.) to be displayed without generating any audio associated with the audio material. The information may include a reference to the annotation material and information regarding the location in the audio material associated with the annotation material, wherein the location corresponds to the audio location identified in step 810.

在步驟816處，音訊媒體播放器414取用註解資料。若註解資料為語音資料，則取用註解資料可涉及處理語音資料以產生音訊或將語音資料轉換為文字資料並顯示該文字資料。若註解資料為文字資料，則取用註解資料可涉及(例如)在顯示所播放之音訊資料之屬性的GUI之側面板中或在顯現為與GUI分離之新窗中顯示文字資料。屬性之非限制性實例包括音訊資料之時間長度、可指示音訊資料內之絕對位置(例如，時間偏移)或音訊資料內之相對位置(例如，章號或章節號)的當前播放位置、音訊資料之波形，及數位著作的標題。 At step 816, the audio media player 414 retrieves the annotation material. If the annotation data is voice data, accessing the annotation data may involve processing the voice data to generate audio or converting the voice data into text data and displaying the text data. If the annotation material is textual material, accessing the annotation material may involve, for example, displaying the textual material in a side panel of the GUI displaying the properties of the played audio material or in a new window appearing to be separate from the GUI. Non-limiting examples of attributes include the length of time of the audio material, the absolute position within the audio material (eg, time offset), or the relative position of the audio data (eg, chapter number or chapter number), the current playback position, audio The waveform of the data, and the title of the digital work.

如先前所提到，圖8B描述在器件430上建立註解且稍後在器件410上取用註解的情境。 As previously mentioned, FIG. 8B depicts a scenario in which annotations are established on device 430 and annotations are taken on device 410 at a later time.

在步驟852處，音訊媒體播放器432處理來自數位媒體項目404之音訊資料以播放音訊。 At step 852, the audio media player 432 processes the digital media item Video information of 404 to play audio.

在步驟854處，音訊媒體播放器432判定音訊資料內之音訊位置。最終儲存與註解相關聯之音訊位置。可以數種方式來判定音訊位置。舉例而言，音訊媒體播放器432可接收選擇音訊資料內之音訊位置的輸入。輸入可為使用者觸碰器件430之觸控式螢幕(顯示音訊資料之屬性)歷時一時段。輸入可選擇時刻表內之反映音訊資料之長度的絕對位置或音訊資料內之相對位置(諸如，章號及段號)。輸入亦可包含首先選擇按鈕，此情形使得音訊媒體播放器432改變為註解可經建立並與音訊位置相關聯的「建立註解」模式。 At step 854, the audio media player 432 determines the location of the audio within the audio material. The audio location associated with the annotation is ultimately stored. There are several ways to determine the location of the audio. For example, the audio media player 432 can receive an input to select an audio location within the audio material. The input can be a touch screen of the user touching the device 430 (displaying the attributes of the audio data) for a period of time. Enter the absolute position in the timetable that reflects the length of the audio data or the relative position in the audio data (such as the chapter number and the segment number). The input may also include a first select button, which causes the audio media player 432 to change to an "create annotation" mode in which the annotation can be established and associated with the audio location.

作為判定音訊位置之另一實例，音訊媒體播放器432基於音訊資料之哪一部分正被處理來自動地判定音訊位置(在無使用者輸入的情況下)。舉例而言，若音訊媒體播放器432正處理音訊資料之對應於反映於數位媒體項目404中之數位著作的章20之部分，則音訊媒體播放器432判定音訊位置至少係章20內之某處。 As another example of determining the location of the audio, the audio media player 432 automatically determines the location of the audio based on which portion of the audio material is being processed (in the absence of user input). For example, if the audio media player 432 is processing a portion of the audio material corresponding to the chapter 20 of the digital work reflected in the digital media item 404, the audio media player 432 determines that the audio location is at least somewhere within the chapter 20. .

在步驟856處，音訊媒體播放器432接收輸入，該輸入選擇可顯示於器件430之觸控式螢幕上之「建立註解」按鈕。可回應於步驟854中之選擇音訊位置的輸入而顯示此按鈕，其中(例如)使用者連續觸碰觸控式螢幕歷時一時段(諸如，1秒)。 At step 856, the audio media player 432 receives an input that selects an "Create Annotation" button that can be displayed on the touch screen of the device 430. The button can be displayed in response to the input of the selected audio position in step 854, wherein, for example, the user continuously touches the touchscreen for a period of time (such as 1 second).

儘管將步驟854描繪為發生於步驟856之前，但或者，「建立註解」按鈕之選擇可在判定音訊位置之前發生。 Although step 854 is depicted as occurring prior to step 856, alternatively, the selection of the "Create Annotation" button may occur prior to determining the audio location.

在步驟858處，類似於步驟808，第一媒體播放器接收用以建立註解資料的輸入。 At step 858, similar to step 808, the first media player receives an input to create annotation material.

在步驟860處，音訊媒體播放器432儲存與音訊位置相關聯之註解資料。音訊媒體播放器432使用映射(例如，映射406)來識別映射中之最接近步驟854中所判定之音訊位置的特定音訊位置。接著，使用映射，音訊媒體播放器432識別對應於特定音訊位置的文字位置。 At step 860, the audio media player 432 stores the annotation material associated with the audio location. The audio media player 432 uses a mapping (e.g., mapping 406) to identify a particular audio location in the mapping that is closest to the audio location determined in step 854. Next, using the mapping, the audio media player 432 identifies the text location corresponding to the particular audio location.

或者至步驟860，音訊媒體播放器432經由網路400將註解資料及音訊位置發送至中間器件420。作為回應，中間器件420儲存與音訊位置相關聯之註解資料。中間器件420使用映射406來識別映射中之最接近在步驟854中所判定之音訊位置的特定音訊位置。接著，使用映射406，中間器件420識別對應於特定音訊位置的文字位置。中間器件420經由網路440將所識別之文字位置發送至器件410。中間器件420可回應於來自器件410之對某文字資料及/或對與某文字資料相關聯之註解的請求而發送所識別之文字位置。舉例而言，回應於對「The Grapes of Wrath」之數位書的請求，中間器件420判定是否存在與彼數位書相關聯之任何註解資料，且若有則將註解資料發送至器件430。 Or to step 860, the audio media player 432 sends the annotation data and the audio location to the intermediate device 420 via the network 400. In response, intermediate device 420 stores annotation material associated with the audio location. The intermediate device 420 uses the map 406 to identify the particular audio location in the map that is closest to the audio location determined in step 854. Next, using map 406, intermediate device 420 identifies the text location corresponding to the particular audio location. The intermediate device 420 transmits the identified text location to the device 410 via the network 440. The intermediary device 420 can transmit the identified text location in response to a request from the device 410 for a textual material and/or an annotation associated with a textual material. For example, in response to a request for the "The Grapes of Wrath" digital book, the intermediary device 420 determines if any annotation material associated with the digital book exists and, if so, sends the annotation material to the device 430.

步驟860亦可包含儲存指示何時建立註解之日期及/或時間資訊。當在文字內容脈絡中取用註解時，可稍後顯示此資訊。 Step 860 can also include storing a date and/or time information indicating when the annotation was established. This information can be displayed later when an annotation is taken in the context of the text content.

在步驟862處，器件410顯示與數位媒體項目402相關聯之文字資料，數位媒體項目402為數位媒體項目404的文字版本。器件410基於數位媒體項目402之本端儲存之複本而顯示數位媒體項目402的文字資料，或若本端儲存之複本並不存在，則可顯示文字資料同時文字資料自中間器件420串流傳輸。 At step 862, device 410 displays the textual material associated with digital media item 402, which is the text of digital media item 404. version. The device 410 displays the text data of the digital media item 402 based on the copy stored at the local end of the digital media item 402, or if the copy stored at the local end does not exist, the text data can be displayed and the text data is streamed from the intermediate device 420.

在步驟864處，器件410判定著作(反映於數位媒體項目402中)之文字版本的包括文字位置(在步驟860中所識別)之部分何時被顯示。或者，器件410可顯示指示註解可用之資料，而無關於著作之文字版本的哪一部分(若存在)被顯示。 At step 864, device 410 determines when the portion of the textual version of the book (reflected in digital media item 402) that includes the textual location (identified in step 860) is displayed. Alternatively, device 410 may display information indicating that annotations are available, regardless of which portion of the text version of the work, if any, is displayed.

在步驟866處，文字媒體播放器412取用註解資料。若註解資料為語音資料，則取用註解資料可包含播放語音資料或將語音資料轉換為文字資料並顯示該文字資料。若註解資料為文字資料，則取用註解資料可包含(例如)在顯示著作之文字版本的一部分之GUI之側面板中或在顯現為與GUI分離之新窗中顯示文字資料。 At step 866, the text media player 412 retrieves the annotation material. If the annotation data is voice data, the annotation data may be used to play the voice data or convert the voice data into text data and display the text data. If the annotation material is a text material, the annotation material may include, for example, displaying the text material in a side panel of the GUI displaying a portion of the text version of the work or in a new window appearing to be separate from the GUI.

Hardware review

根據一實施例，本文中所描述之技術係藉由一或多個專用計算器件來實施。專用計算器件可經硬連線以執行該等技術，或可包括經永久程式化以執行該等技術的數位電子器件，諸如一或多個特殊應用積體電路(ASIC)或場可程式化閘陣列(FPGA)，或可包括經程式化以按照韌體、記憶體、其他儲存器或組合中之程式指令執行該等技術的一或多個通用硬體處理器。此等專用計算器件亦可藉由定製程式化組合定製硬連線邏輯、ASIC或FPGA以實現該等技術。專用計算器件可為桌上型電腦系統、攜帶型電腦系統、手持式器件、網路連接器件，或併有硬連線及/或程式邏輯以實施該等技術的任何其他器件。 According to an embodiment, the techniques described herein are implemented by one or more dedicated computing devices. Dedicated computing devices may be hardwired to perform such techniques, or may include digital electronic devices that are permanently programmed to perform such techniques, such as one or more special application integrated circuits (ASICs) or field programmable gates An array (FPGA), or may include one or more general purpose hardware processors programmed to perform such techniques in accordance with program instructions in firmware, memory, other storage, or a combination. These specialized computing devices can also be customized by custom programming combinations of hardwired logic, ASICs or FPGAs. Surgery. The dedicated computing device can be a desktop computer system, a portable computer system, a handheld device, a network connected device, or any other device that has hardwired and/or programmed logic to implement such techniques.

舉例而言，圖9為說明可實施本發明之實施例的電腦系統900之方塊圖。電腦系統900包括匯流排902或用於傳達資訊之其他通信機構，及與匯流排902耦接以用於處理資訊的硬體處理器904。硬體處理器904可為(例如)通用微處理器。 For example, Figure 9 is a block diagram illustrating a computer system 900 in which embodiments of the present invention may be implemented. The computer system 900 includes a bus 902 or other communication mechanism for communicating information, and a hardware processor 904 coupled to the bus 902 for processing information. The hardware processor 904 can be, for example, a general purpose microprocessor.

電腦系統900亦包括耦接至匯流排902以用於儲存資訊及待藉由處理器904執行之指令的主要記憶體906，諸如隨機存取記憶體(RAM)或其他動態儲存器件。主要記憶體906亦可用於在執行待藉由處理器904執行之指令期間儲存臨時變數或其他中間資訊。此等指令在儲存於對於處理器904為可存取之非暫時性儲存媒體中時致使電腦系統900為經定製以執行在指令中所指定之操作的專用機器。 Computer system 900 also includes primary memory 906, such as random access memory (RAM) or other dynamic storage device, coupled to bus 902 for storing information and instructions to be executed by processor 904. The primary memory 906 can also be used to store temporary variables or other intermediate information during execution of instructions to be executed by the processor 904. Such instructions, when stored in a non-transitory storage medium that is accessible to the processor 904, cause the computer system 900 to be a dedicated machine that is customized to perform the operations specified in the instructions.

電腦系統900進一步包括耦接至匯流排902以用於儲存靜態資訊及用於處理器904之指令的唯讀記憶體(ROM)908或其他靜態儲存器件。諸如磁碟或光碟之儲存器件910經提供並耦接至匯流排902以用於儲存資訊及指令。 Computer system 900 further includes a read only memory (ROM) 908 or other static storage device coupled to bus 902 for storing static information and instructions for processor 904. A storage device 910, such as a disk or optical disk, is provided and coupled to bus bar 902 for storing information and instructions.

電腦系統900可經由匯流排902耦接至諸如陰極射線管(CRT)之顯示器912，以用於將資訊顯示給電腦使用者。包括文數字及其他鍵之輸入器件914耦接至匯流排902，以用於將資訊及命令選擇傳達至處理器904。另一類型之使用者輸入器件為諸如滑鼠、軌跡球或游標方向鍵之游標控制 916，以用於將方向資訊及命令選擇傳達至處理器904且用於控制顯示器912上之游標移動。此輸入器件通常具有允許器件指定平面中之位置的兩個軸線(第一軸線(例如，x)及第二軸線(例如，y))上之兩個自由度。 Computer system 900 can be coupled via bus bar 902 to display 912, such as a cathode ray tube (CRT), for displaying information to a computer user. Input device 914, including alphanumeric and other keys, is coupled to bus 902 for communicating information and command selections to processor 904. Another type of user input device is cursor control such as a mouse, trackball or cursor direction key 916 for communicating direction information and command selections to the processor 904 and for controlling cursor movement on the display 912. This input device typically has two degrees of freedom on two axes (a first axis (eg, x) and a second axis (eg, y)) that allow the device to specify a position in the plane.

電腦系統900可使用經定製之硬連線邏輯、一或多個ASIC或FPGA、韌體及/或程式邏輯來實施本文中所描述之技術，前述裝置與電腦系統組合而使得電腦系統900成為專用機器或將電腦系統900程式化為專用機器。根據一實施例，本文中之技術係回應於處理器904執行含於主要記憶體906中之一或多個指令的一或多個序列而藉由電腦系統900執行。此等指令可自另一儲存媒體(諸如，儲存器件910)讀取至主要記憶體906中。含於主要記憶體906中之指令之序列的執行使得處理器904執行本文中所描述之處理程序步驟。在替代性實施例中，可替代於軟體指令或組合軟體指令而使用硬連線電路。 Computer system 900 can implement the techniques described herein using custom hardwired logic, one or more ASICs or FPGAs, firmware and/or program logic that combines with a computer system to cause computer system 900 to become A dedicated machine or a computer system 900 is programmed as a dedicated machine. In accordance with an embodiment, the techniques herein are performed by computer system 900 in response to processor 904 executing one or more sequences of one or more instructions contained in primary memory 906. Such instructions may be read into the primary memory 906 from another storage medium, such as storage device 910. Execution of the sequence of instructions contained in primary memory 906 causes processor 904 to perform the processing steps described herein. In an alternative embodiment, a hardwired circuit can be used instead of a software instruction or a combination of software instructions.

如本文中所使用之術語「儲存媒體」指代儲存資料及/或使得機器以特定型式操作之指令的任何非暫時性媒體。此等儲存媒體可包含非揮發性媒體及/或揮發性媒體。非揮發性媒體包括(例如)諸如儲存器件910之光碟或磁碟。揮發性媒體包括諸如主要記憶體906之動態記憶體。儲存媒體之常見形式包括(例如)軟性磁碟、可撓性磁碟、硬碟、固態驅動機、磁帶、或任何其他磁性資料儲存媒體、CD-ROM、任何其他光學資料儲存媒體、具有孔洞圖案之任何實體媒體、RAM、PROM、及EPROM、FLASH-EPROM、 NVRAM、任何其他記憶體晶片或匣。 The term "storage medium" as used herein refers to any non-transitory medium that stores material and/or instructions that cause the machine to operate in a particular fashion. Such storage media may include non-volatile media and/or volatile media. Non-volatile media includes, for example, a compact disc or a magnetic disk such as storage device 910. Volatile media includes dynamic memory such as primary memory 906. Common forms of storage media include, for example, flexible disks, flexible disks, hard drives, solid state drives, magnetic tape, or any other magnetic data storage media, CD-ROM, any other optical data storage media, with aperture patterns Any physical media, RAM, PROM, and EPROM, FLASH-EPROM, NVRAM, any other memory chip or device.

儲存媒體相異於傳輸媒體，但可結合傳輸媒體使用。傳輸媒體參與在儲存媒體之間傳送資訊。舉例而言，傳輸媒體包括同軸纜線、銅導線及光纖，包括包含匯流排902的導線。傳輸媒體亦可採用聲波或光波之形式，諸如在無線電波及紅外線資料通信期間所產生的彼等。 The storage medium is different from the transmission medium, but can be used in conjunction with the transmission medium. The transmission media participates in the transfer of information between storage media. For example, transmission media includes coaxial cables, copper wires, and optical fibers, including wires including bus bars 902. The transmission medium may also take the form of sound waves or light waves, such as those generated during radio wave and infrared data communication.

在將一或多個指令之一或多個序列攜載至處理器904以供執行時可涉及各種形式之媒體。舉例而言，指令最初可攜載於遠端電腦之磁碟或固態驅動機上。遠端電腦可將指令載入至其動態記憶體中，且使用數據機經由電話線而發送指令。電腦系統900之本端數據機可在電話線上接收資料，且使用紅外線傳輸器將資料轉換成紅外線信號。紅外線偵測器可接收攜載於紅外線信號中之資料，且適當電路可將資料置放於匯流排902上。匯流排902將資料攜載至主要記憶體906，處理器904自主要記憶體906擷取並執行指令。藉由主要記憶體906所接收之指令可視情況在藉由處理器904執行之前抑或之後儲存於儲存器件910上。 Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 904 for execution. For example, the instructions may initially be carried on a disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and use the modem to send commands via the telephone line. The local data machine of the computer system 900 can receive data on the telephone line and convert the data into an infrared signal using an infrared transmitter. The infrared detector can receive the data carried in the infrared signal, and the appropriate circuit can place the data on the bus 902. Bus 902 carries the data to primary memory 906, and processor 904 retrieves and executes the instructions from primary memory 906. The instructions received by the primary memory 906 may optionally be stored on the storage device 910 before or after execution by the processor 904.

電腦系統900亦包括耦接至匯流排902之通信介面918。通信介面918提供至網路鏈路920之雙向資料通信耦接，網路鏈路920連接至區域網路922。舉例而言，通信介面918可為整合服務數位網路(ISDN)卡、纜線數據機、衛星數據機，或將資料通信連接提供至相應類型之電話線的數據機。作為另一實例，通信介面918可為區域網路(LAN)卡以向相容LAN提供資料通信連接。亦可實施無線鏈路。在任何此實施中，通信介面918發送及接收電信號、電磁信號或光學信號，該等信號攜載表示各種類型之資訊的數位資料串流。 Computer system 900 also includes a communication interface 918 that is coupled to bus bar 902. Communication interface 918 provides bi-directional data communication coupling to network link 920, and network link 920 is coupled to regional network 922. For example, communication interface 918 can be an integrated services digital network (ISDN) card, a cable modem, a satellite data machine, or a data machine that provides a data communication connection to a corresponding type of telephone line. As another example, communication interface 918 can be a local area network (LAN) card to provide a data communication connection to a compatible LAN. A wireless link can also be implemented. Incumbent In this implementation, communication interface 918 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

網路鏈路920通常經由一或多個網路向其他資料器件提供資料通信。舉例而言，網路鏈路920可經由區域網路922將連接提供至主機電腦924或藉由網際網路服務提供者(ISP)926所操作之資料設備。ISP 926又經由現統稱為「網際網路」928之全球封包資料通信網路提供資料通信服務。區域網路922及網際網路928皆使用攜載數位資料串流的電、電磁或光學信號。攜載至及自電腦系統900之數位資料的通過各種網路之信號及在網路鏈路920上且通過通信介面918的信號為傳輸媒體之實例形式。 Network link 920 typically provides data communication to other data devices via one or more networks. For example, network link 920 can provide connectivity via host network 922 to host computer 924 or to a data device operated by an Internet Service Provider (ISP) 926. ISP 926 provides data communication services via a global packet data communication network, now known as the "Internet" 928. Both the local network 922 and the Internet 928 use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on the network link 920 and through the communication interface 918 that carry the digital data to and from the computer system 900 are examples of transmission media.

電腦系統900可經由(多個)網路、網路鏈路920及通信介面918發送訊息並接收包括程式碼的資料。在網際網路實例中，伺服器930可能經由網際網路928、ISP 926、區域網路922及通信介面918傳輸應用程式之所請求程式碼。 Computer system 900 can transmit messages and receive data including the code via network(s), network link 920, and communication interface 918. In the Internet instance, server 930 may transmit the requested code of the application via Internet 928, ISP 926, regional network 922, and communication interface 918.

所接收程式碼可在其被接收時藉由處理器904執行，及/或儲存於儲存器件910或其他非揮發性儲存器中以供稍後執行。 The received code may be executed by processor 904 as it is received, and/or stored in storage device 910 or other non-volatile storage for later execution.

根據一些實施例，圖10至圖15展示根據如上文所描述之本發明之原理的電子器件1000至1500之功能方塊圖。器件之功能區塊可藉由硬體、軟體或硬體與軟體之組合來實施以執行本發明的原理。熟習此項技術者應理解，描述於圖10至圖15中之功能區塊可經組合或分離成子區塊以實施如上文所描述的本發明之原理。因此，本文中之描述可支援本文中所描述之功能區塊的任何可能組合或分離或其他定義。 10 through 15 show functional block diagrams of electronic devices 1000 through 1500 in accordance with the principles of the present invention as described above, in accordance with some embodiments. Functional blocks of the device may be implemented by hardware, software or a combination of hardware and software to carry out the principles of the invention. Those skilled in the art will appreciate that the functional blocks described in Figures 10 through 15 can be combined or separated into sub-blocks to implement The principles of the invention described above. Accordingly, the description herein may support any possible combination or separation or other definition of the functional blocks described herein.

如圖10中所展示，電子器件1000包括經組態以用於接收音訊資料的音訊資料接收單元1002，該音訊資料反映存在文字版本之著作的可聞版本。電子器件1000亦包括耦接至音訊資料接收單元1002之處理單元1006。在一些實施例中，處理單元1006包括話音至文字單元1008及映射單元1010。 As shown in FIG. 10, electronic device 1000 includes an audio data receiving unit 1002 configured to receive audio material, the audio data reflecting an audible version of the work in which the text version is present. The electronic device 1000 also includes a processing unit 1006 coupled to the audio data receiving unit 1002. In some embodiments, processing unit 1006 includes voice to text unit 1008 and mapping unit 1010.

處理單元1006經組態以執行音訊資料之話音至文字分析以產生音訊資料之多個部分的文字(例如，藉由話音至文字單元1008)；及基於針對音訊資料之該等部分所產生的文字，產生音訊資料中之複數個音訊位置與著作之文字版本中的相應複數個文字位置之間的映射(例如，藉由映射單元1010)。 Processing unit 1006 is configured to perform voice-to-text analysis of the audio material to generate text for portions of the audio material (eg, by voice to text unit 1008); and based on the portions of the audio material The text, the mapping between the plurality of audio locations in the audio material and the corresponding plurality of text locations in the text version of the work (eg, by mapping unit 1010).

如圖11中所展示，電子器件1100包括經組態以用於接收著作之文字版本的文字接收單元1102。電子器件1100亦包括經組態以用於接收第二音訊資料之音訊資料接收單元1104，該第二音訊資料反映文字版本存在之著作的可聞版本。電子器件1100亦包括耦接至文字接收單元1102之處理單元1106。在一些實施例中，處理單元1106包括文字至話音單元1108及映射單元1110。 As shown in FIG. 11, electronic device 1100 includes a text receiving unit 1102 that is configured to receive a text version of a work. The electronic device 1100 also includes an audio data receiving unit 1104 configured to receive the second audio material, the second audio material reflecting an audible version of the work in which the text version exists. The electronic device 1100 also includes a processing unit 1106 coupled to the text receiving unit 1102. In some embodiments, processing unit 1106 includes text-to-speech unit 1108 and mapping unit 1110.

處理單元1106經組態以執行文字版本之文字至話音分析以產生第一音訊資料(例如，藉由文字至話音單元1108)；且基於第一音訊資料及文字版本，產生第一音訊資料中之第一複數個音訊位置與著作之文字版本中的相應複數個文字位置之間的第一映射(例如，藉由映射單元1110)。處理單元1106經進一步組態以基於(1)第一音訊資料與第二音訊資料之比較及(2)第一映射，產生第二音訊資料中之第二複數個音訊位置與著作之文字版本中的該複數個文字位置之間的第二映射(例如，藉由映射單元1110)。 Processing unit 1106 is configured to perform text-to-speech analysis of text versions to generate first audio material (eg, by text-to-speech unit 1108); And generating, according to the first audio data and the text version, a first mapping between the first plurality of audio locations in the first audio material and the corresponding plurality of text positions in the text version of the work (eg, by mapping unit 1110) . The processing unit 1106 is further configured to generate a second plurality of audio locations and text versions in the second audio material based on (1) comparison of the first audio material with the second audio data and (2) the first mapping A second mapping between the plurality of text locations (e.g., by mapping unit 1110).

如圖12中所展示，電子器件1200包括經組態以用於接收音訊輸入之音訊接收單元1202。電子器件1200亦包括耦接至音訊接收單元1202之處理單元1206。在一些實施例中，處理單元1206包括話音至文字單元1208、文字匹配單元1209及顯示控制單元1210。 As shown in FIG. 12, electronic device 1200 includes an audio receiving unit 1202 that is configured to receive audio input. The electronic device 1200 also includes a processing unit 1206 coupled to the audio receiving unit 1202. In some embodiments, processing unit 1206 includes voice to text unit 1208, text matching unit 1209, and display control unit 1210.

處理單元1206經組態以執行音訊輸入之話音至文字分析以產生音訊輸入之多個部分的文字(例如，藉由話音至文字單元1208)；判定針對音訊輸入之多個部分所產生的文字是否與當前顯示之文字匹配(例如，藉由文字匹配單元1209)；及回應於判定文字與當前顯示之文字匹配，使得當前顯示之文字經反白顯示(例如，藉由顯示控制單元1210)。 Processing unit 1206 is configured to perform voice-to-text analysis of the audio input to generate text for portions of the audio input (eg, by voice to text unit 1208); determining for portions of the audio input Whether the text matches the currently displayed text (eg, by the text matching unit 1209); and in response to the determination text matching the currently displayed text, such that the currently displayed text is highlighted (eg, by the display control unit 1210) .

如圖13中所展示，電子器件1300包括經組態以用於獲得位置資料之位置資料獲得單元1302，該位置資料指示著作之文字版本內的指定位置。電子器件1300亦包括耦接至位置資料獲得單元1302之處理單元1306。在一些實施例中，處理單元1306包括映射檢查單元1308。 As shown in FIG. 13, electronic device 1300 includes a location data obtaining unit 1302 configured to obtain location data indicating a specified location within a textual version of the work. The electronic device 1300 also includes a processing unit 1306 coupled to the location data obtaining unit 1302. In some embodiments, processing unit 1306 includes mapping check unit 1308.

處理單元1306經組態以檢查著作之音訊版本中的複數個音訊位置與著作之文字版本中之相應複數個文字位置之間的映射(例如，藉由映射檢查單元1308)以：判定該複數個文字位置中之對應於指定位置的特定文字位置，及基於特定文字位置判定該複數個音訊位置中之對應於特定文字位置的特定音訊位置。處理單元1306亦經組態以將基於特定文字位置所判定之特定音訊位置提供至媒體播放器，以使得媒體播放器將特定音訊位置建立為音訊資料的當前播放位置。 Processing unit 1306 is configured to check a mapping between a plurality of audio locations in the audio version of the work and a corresponding plurality of text positions in the text version of the work (eg, by mapping check unit 1308) to: determine the plurality of a specific text position corresponding to the specified position in the text position, and determining a specific audio position corresponding to the specific text position among the plurality of audio positions based on the specific text position. Processing unit 1306 is also configured to provide a particular audio location determined based on a particular text location to the media player such that the media player establishes the particular audio location as the current playback location of the audio material.

如圖14中所展示，電子器件1400包括經組態以用於獲得位置資料之位置獲得單元1402，該位置資料指示音訊資料內的指定位置。電子器件1400亦包括耦接至位置獲得單元1402之處理單元1406。在一些實施例中，處理單元1406包括映射檢查單元1408及顯示控制單元1410。 As shown in FIG. 14, electronic device 1400 includes a location obtaining unit 1402 configured to obtain location data indicating a specified location within the audio material. The electronic device 1400 also includes a processing unit 1406 coupled to the location obtaining unit 1402. In some embodiments, processing unit 1406 includes mapping check unit 1408 and display control unit 1410.

處理單元1406經組態以檢查音訊資料中之複數個音訊位置與著作之文字版本中的相應複數個文字位置之間的映射(例如，藉由映射檢查單元1408)以：判定該複數個音訊位置中之對應於指定位置的特定音訊位置，及基於特定音訊位置判定該複數個文字位置中之對應於特定音訊位置的特定文字位置。處理單元1406亦經組態以使得媒體播放器顯示關於特定文字位置之資訊(例如，藉由顯示控制單元1410)。 The processing unit 1406 is configured to check a mapping between a plurality of audio locations in the audio material and a corresponding plurality of text locations in the text version of the work (eg, by mapping check unit 1408) to: determine the plurality of audio locations Corresponding to a specific audio location of the specified location, and determining a particular text location of the plurality of text locations corresponding to the particular audio location based on the particular audio location. Processing unit 1406 is also configured to cause the media player to display information regarding a particular text location (e.g., by display control unit 1410).

如圖15中所展示，電子器件1500包括經組態以用於獲得位置資料之位置獲得單元1502，該位置資料指示在著作之音訊版本之播放期間在音訊版本內的指定位置。電子器件1500亦包括耦接至位置資料獲得單元1502之處理單元1506。在一些實施例中，處理單元1506包括文字位置判定單元1508及顯示控制單元1510。 As shown in FIG. 15, the electronic device 1500 includes a location obtaining unit 1502 configured to obtain location data, the location information indicating in the work The specified location within the audio version during playback of the audio version. The electronic device 1500 also includes a processing unit 1506 coupled to the location data obtaining unit 1502. In some embodiments, processing unit 1506 includes a text location determination unit 1508 and a display control unit 1510.

處理單元1506經組態以在著作之音訊版本的播放期間：基於指定位置判定著作之文字版本中的特定文字位置(例如，藉由文字位置判定單元1508)，該特定文字位置係與頁終資料相關聯，該頁終資料指示反映於著作之文字版本中的第一頁之結尾；及回應於判定特定文字位置與頁終資料相關聯，自動地使得第一頁中斷顯示且使得第一頁之後的第二頁被顯示(例如，藉由顯示控制單元1510)。 The processing unit 1506 is configured to determine a particular text position in the text version of the work based on the specified location during playback of the audio version of the work (eg, by text location determining unit 1508), the particular text location and page end material Correspondingly, the page end information indication is reflected at the end of the first page in the text version of the work; and in response to determining that the specific text position is associated with the page end material, automatically causing the first page to be interrupted and after the first page The second page is displayed (e.g., by display control unit 1510).

在前述說明書中，已參考可在實施間變化的眾多特定細節描述了本發明之實施例。因此，說明書及圖式應視為具有說明性意義而非限制性意義。本發明之範疇的單獨及排外性指示符及申請人期望係本發明之範疇之物為以技術方案發佈所呈的特定形式自本申請案發佈的包括任何後續校正之技術方案集合文字(literal)及等效範疇。 In the foregoing specification, embodiments of the invention have been described with reference Accordingly, the specification and drawings are to be regarded as illustrative and not restrictive. The individual and exclusive indicators of the scope of the invention and the applicant's desire to be within the scope of the invention are the technical solutions set forth in the present application, including any subsequent corrections. And equivalent categories.

210‧‧‧文字版本/文件 210‧‧‧ text version/document

220‧‧‧文字至話音產生器/文字至話音分析器 220‧‧‧Text to Speech Generator/Text to Speech Analyzer

230‧‧‧音訊資料 230‧‧‧ Audio Information

240‧‧‧音訊至文件映射 240‧‧‧Audio to file mapping

250‧‧‧音訊書 250‧‧‧ audio books

260‧‧‧音訊至文字相關器/話音至話音分析器 260‧‧‧Audio to Text Correlator/Voice to Speech Analyzer

270‧‧‧文件至音訊映射/文件至文字映射 270‧‧‧File to audio mapping/file to text mapping

400‧‧‧系統 400‧‧‧ system

402‧‧‧數位媒體項目 402‧‧‧Digital Media Project

404‧‧‧數位媒體項目 404‧‧‧Digital Media Project

406‧‧‧映射 406‧‧‧ mapping

410‧‧‧終端使用者器件 410‧‧‧End User Devices

412‧‧‧文字媒體播放器 412‧‧‧Text Media Player

414‧‧‧音訊媒體播放器 414‧‧‧Audio Media Player

420‧‧‧中間器件 420‧‧‧Intermediate device

430‧‧‧終端使用者器件 430‧‧‧End User Devices

432‧‧‧音訊媒體播放器 432‧‧‧Audio Media Player

440‧‧‧網路 440‧‧‧Network

900‧‧‧電腦系統 900‧‧‧Computer system

902‧‧‧匯流排 902‧‧ ‧ busbar

904‧‧‧硬體處理器 904‧‧‧ hardware processor

906‧‧‧主要記憶體 906‧‧‧ main memory

908‧‧‧唯讀記憶體(ROM) 908‧‧‧Reading Memory (ROM)

910‧‧‧儲存器件 910‧‧‧Storage device

912‧‧‧顯示器 912‧‧‧ display

914‧‧‧輸入器件 914‧‧‧Input device

916‧‧‧游標控制 916‧‧‧ cursor control

918‧‧‧通信介面 918‧‧‧Communication interface

920‧‧‧網路鏈路 920‧‧‧Network link

922‧‧‧區域網路 922‧‧‧Local Network

924‧‧‧主機電腦 924‧‧‧Host computer

926‧‧‧網際網路服務提供者(ISP) 926‧‧‧ Internet Service Provider (ISP)

928‧‧‧網際網路 928‧‧‧Internet

930‧‧‧伺服器 930‧‧‧Server

1000‧‧‧電子器件 1000‧‧‧Electronics

1002‧‧‧音訊資料接收單元 1002‧‧‧Audio data receiving unit

1006‧‧‧處理單元 1006‧‧‧Processing unit

1008‧‧‧話音至文字單元 1008‧‧‧ voice to text unit

1010‧‧‧映射單元 1010‧‧‧ mapping unit

1100‧‧‧電子器件 1100‧‧‧Electronics

1102‧‧‧文字接收單元 1102‧‧‧Text receiving unit

1104‧‧‧音訊資料接收單元 1104‧‧‧Audio data receiving unit

1106‧‧‧處理單元 1106‧‧‧Processing unit

1108‧‧‧文字至話音單元 1108‧‧‧Text to voice unit

1110‧‧‧映射單元 1110‧‧‧ mapping unit

1200‧‧‧電子器件 1200‧‧‧Electronics

1204‧‧‧音訊接收單元 1204‧‧‧Optical receiving unit

1206‧‧‧處理單元 1206‧‧‧Processing unit

1208‧‧‧話音至文字單元 1208‧‧‧Voice to text unit

1209‧‧‧文字匹配單元 1209‧‧‧Text Matching Unit

1210‧‧‧顯示控制單元 1210‧‧‧Display Control Unit

1300‧‧‧電子器件 1300‧‧‧Electronics

1302‧‧‧位置資料獲得單元 1302‧‧‧Location data acquisition unit

1306‧‧‧處理單元 1306‧‧‧Processing unit

1308‧‧‧映射檢查單元 1308‧‧‧ mapping check unit

1400‧‧‧電子器件 1400‧‧‧Electronics

1402‧‧‧位置獲得單元 1402‧‧‧Location Acquisition Unit

1406‧‧‧處理單元 1406‧‧‧Processing unit

1408‧‧‧映射檢查單元 1408‧‧‧ mapping check unit

1410‧‧‧顯示控制單元 1410‧‧‧Display Control Unit

1500‧‧‧電子器件 1500‧‧‧Electronics

1502‧‧‧位置獲得單元/位置資料獲得單元 1502‧‧‧Location Acquisition Unit/Location Data Acquisition Unit

1506‧‧‧處理單元 1506‧‧‧Processing unit

1508‧‧‧文字位置判定單元 1508‧‧‧Text position determination unit

1510‧‧‧顯示控制單元 1510‧‧‧Display Control Unit

圖1為描繪根據本發明之實施例的用於在文字資料與音訊資料之間自動地建立映射之處理程序的流程圖；圖2為描繪根據本發明之實施例的在於文字資料與音訊資料之間產生映射時涉及音訊至文字相關器之處理程序的方塊圖；圖3為描繪根據本發明之實施例的用於在此等情境中之一或多者下使用映射之處理程序的流程圖；圖4為根據本發明之實施例的可用以實施本文中所描述之處理程序中之一些的實例系統400之方塊圖。 1 is a flow chart depicting a process for automatically establishing a mapping between text material and audio material in accordance with an embodiment of the present invention; and FIG. 2 is a diagram depicting text data and audio data in accordance with an embodiment of the present invention. A block diagram of a processing procedure involving an audio to text correlator when generating a map; FIG. 3 is a diagram depicting a scenario in accordance with an embodiment of the present invention A flowchart of one or more of the processing procedures using mappings; FIG. 4 is a block diagram of an example system 400 that can be used to implement some of the processing procedures described herein in accordance with an embodiment of the present invention.

圖5A至圖5B為描繪根據本發明之實施例的用於書籤切換之處理程序的流程圖；圖6為描繪根據本發明之實施例的用於在著作之音訊版本正播放之同時使得來自著作之文字版本的文字經反白顯示的處理程序之流程圖；圖7為描繪根據本發明之實施例的回應於來自使用者之音訊輸入使所顯示文字反白顯示之處理程序的流程圖；圖8A至圖8B為描繪根據本發明之實施例的用於將註解自一媒體內容脈絡傳送至另一媒體內容脈絡之處理程序的流程圖；及圖9為說明可實施本發明之實施例的電腦系統之方塊圖。 5A-5B are flowcharts depicting a processing procedure for bookmark switching in accordance with an embodiment of the present invention; and FIG. 6 is a diagram illustrating a method for making an audio version of a work being played while being played in accordance with an embodiment of the present invention. FIG. 7 is a flow chart depicting a processing procedure for highlighting displayed text in response to audio input from a user in accordance with an embodiment of the present invention; FIG. 8A-8B are flow diagrams depicting a process for transferring annotations from one media context to another media context, in accordance with an embodiment of the present invention; and FIG. 9 is a diagram illustrating a computer in which embodiments of the present invention may be implemented Block diagram of the system.

圖10至圖15為根據一些實施例之電子器件的功能方塊圖。 10 through 15 are functional block diagrams of electronic devices in accordance with some embodiments.

Claims

A method comprising: receiving audio data, the audio data reflecting an audible version of a work in a text version; performing a voice-to-text analysis of the audio data to generate a portion of the audio data; and based on The text generated by the portions of the audio material generates a mapping between a plurality of audio locations in the audio material and corresponding plurality of text locations in the text version of the work; wherein the method is performed by Or multiple computing devices are executed.

The method of claim 1, wherein the generating the portion of the audio material comprises: text based at least in part on the portion of the textual content of the work that produces the portion of the audio material.

The method of claim 2, wherein generating the text of the portion of the audio material based at least in part on the textual context of the work comprises: generating the text based at least in part on one or more grammar rules in the text version of the work.

The method of claim 2, wherein the generating of the portion of the audio material based at least in part on the context of the textual content of the work comprises: restricting the translation of the partial translation based on the text version of the work or a subset thereof What are the words?

The method of claim 4, wherein restricting which of the words can be translated into which words based on which words are in the text version of the work comprises: identifying the text version of the work for a given portion of the audio material Corresponding to one of the subsections of the given portion and limiting the words to those words in the subsection of the text version of the work.

The method of claim 5, wherein: identifying the sub-section of the text version of the work comprises: maintaining a current text position in the text version of the work, the current text position corresponding to the voice-to-text analysis a current audio position in the audio material; and the sub-section of the text version of the work is a chapter associated with the current text position.

The method of any one of clauses 1 to 6, wherein the portions include portions corresponding to individual words, and the mapping maps locations of the portions corresponding to the individual words to individual ones of the text version of the work Single word.

The method of any one of claims 1 to 6, wherein the portions include portions corresponding to individual sentences, and the mapping maps locations of the portions corresponding to the individual sentences to individual ones of the text versions of the work sentence.

The method of any one of claims 1 to 6, wherein the portion includes a portion corresponding to a fixed amount of data, and the mapping maps a position of the portion corresponding to the fixed amount of data to the text of the work The corresponding location in the version.

The method of any one of claims 1 to 9, wherein generating the mapping comprises: (1) embedding an anchor in the audio material; (2) embedding the anchor in the text version of the work; or (3) The mapping is stored in a media overlay stored in association with the audio material or the text version of the work.

The method of any one of clauses 1 to 10, wherein the plurality of text bits Each of the one or more text locations of the set indicates a relative position in the text version of the work.

The method of any one of claims 1 to 10, wherein one of the plurality of text positions indicates a relative position in the text version of the work, and another one of the plurality of text positions is from the relative position The position indicates an absolute position.

The method of any one of claims 1 to 10, wherein each of the one or more of the plurality of text positions indicates an anchor within the text version of the work.

A method comprising: receiving a text version of a work; executing a text-to-speech analysis of the text version to generate a first audio material; generating a first of the first audio data based on the first audio data and the text version a first mapping between a plurality of audio locations and corresponding plurality of text locations in the text version of the work; receiving second audio material reflecting an audible version of the work in which the text version exists; and based on ( 1) comparing the first audio material with one of the second audio data and (2) the first mapping, generating a second plurality of audio locations in the second audio material and the plural in the text version of the work A second mapping between text locations; wherein the method is performed by one or more computing devices.

A method comprising: Receiving an audio input; performing a voice input to a text analysis to generate a portion of the audio input; determining whether the text generated for the portion of the audio input matches the currently displayed text; and responding to the determining The text matches the currently displayed text such that the currently displayed text is highlighted; wherein the method is performed by one or more computing devices.

An electronic device comprising: at least one processor; and a memory storing one or more programs for execution by the at least one processor, the one or more programs including instructions for receiving audio data, The audio material reflects an audible version of a work in a textual version; a speech-to-text analysis of the audio material to generate a portion of the audio data; and based on the portion of the audio material The text produces a mapping between a plurality of audio locations in the audio material and corresponding plurality of text locations in the text version of the work.

An electronic device comprising: at least one processor; and a memory storing one or more programs for execution by the at least one processor, the one or more programs including instructions for: receiving a work a text version; Performing one of the text versions to the voice analysis to generate the first audio data; generating, based on the first audio data and the text version, the first plurality of audio locations in the first audio material and the text version of the work Corresponding to a first mapping between the plurality of text positions; receiving second audio data reflecting an audible version of the work in which the text version exists; and based on (1) the first audio material and the second audio material And comparing (2) the first mapping, generating a second mapping between the second plurality of audio locations in the second audio material and the plurality of text locations in the text version of the work.

An electronic device comprising: at least one processor; and a memory storing one or more programs for execution by the at least one processor, the one or more programs including instructions for: receiving an audio input; Performing a voice input to a text analysis to generate a portion of the audio input; determining whether the text generated for the portion of the audio input matches the currently displayed text; and responding to determining the text and the current display The text matches so that the currently displayed text is highlighted.

A computer readable medium comprising one or more programs, the one or more programs being enabled when executed by one or more processors of an electronic device Sub-device: receiving audio data, the audio data reflecting an audible version of a work in a text version; performing a speech-to-text analysis of the audio data to generate text of a portion of the audio data; and based on the audio data The text generated by the portions produces a mapping between a plurality of audio locations of the audio material and corresponding plurality of text locations in the text version of the work.

An electronic device comprising: means for receiving audio data, the audio data reflecting an audible version of a work in a text version; and performing a voice-to-text analysis of the audio data to generate the audio data a component of a portion of the text; and a portion between a plurality of audio locations for generating the audio material based on the text generated for the portions of the audio material and a corresponding plurality of text locations in the text version of the work The component of the map.