TWI269191B

TWI269191B - Method of synchronizing speech waveform playback and text display

Info

Publication number: TWI269191B
Application number: TW094125461A
Authority: TW
Inventors: Ren-Yuan Lyu; Yuang-Chin Chiang; Hong-Wen Sie
Original assignee: Ren-Yuan Lyu
Priority date: 2005-07-27
Filing date: 2005-07-27
Publication date: 2006-12-21
Also published as: TW200705222A

Abstract

A method of synchronizing speech waveform playback and text display is disclosed. The synchronization can be performed in the syllable, character or word level. The method includes the following approaches: getting input text, which includes multiple syllables, characters and words; getting the reference feature vector sequence according to the input text by concatenating multiple reference feature vector sequences which are from a database of feature vector sequences of all linguistic units (like syllables, characters and words) for the target language or languages; getting a speech waveform; extracting the feature vector sequence from the speech waveform; searching for the syllable boundaries by aligning the extracting feature vector sequence and the reference feature vector sequence, where the alignment is performed by using the dynamic time warping (DTW) technique.

Description

12691911269191

N 玖、發明說明：【發明所屬之技術領域】 _ 本發明係關於一種語音切割之技術，藉由找出音節邊界點，使得語音播放與文字顯示可以同步。 k 【先前技術】N 玖, invention description: [Technical field to which the invention pertains] The present invention relates to a technique of speech cutting, in which speech playback and text display can be synchronized by finding a syllable boundary point. k [prior art]

傳統出版界在製作有聲書（錄音文件）時，通常都是先設計好錄音劇本（文字檔案），再由專業錄音人士在錄音至中’在錄音工程師監控下，進行逐句（sentence sentenee) 或逐段(paragraph by paragraph)的錄音，整個錄音過程完成後輸出一個數位聲音波形（digital sound waveform)檔案，再使用聲音波形編輯軟體，經由人工逐句或逐段將聲音波形切割出文句或段落的邊界，這個過程相當耗時，以致時間成本很高。若要求切割至「字」(character)、「音節」(syllable) ，词」（word)等更小的語言單位，則人工切割的代價更是數倍於此。因此市面上出現的有聲書，多是以整段的錄音形式呈現，有少數搭配電腦程式瀏覽器者，可以做到以文句為單位，亦即播放聲音時，螢幕同時顯現—整個文句 =字。-般電視節目或電影對白的文字顯示⑽她），亦屬於此類的應用。产，邊播放聲音，螢幕同時逐「字」顯現文字的情於卡拉^伴唱的錄影節目，該類節目首先將 ==現於榮幕:上，然後用事先編輯好的曲譜，透過人，，逐一將歌詞中的「字」或「音節」，與曲譜中 5 1269191 Λ 的「音符」（note)，逐一產生「連結」（aiignment)，隨後播放伴唱樂曲的音符時（通常是由電子合成器依音符的數字表示法，即俗稱之midi，產生音樂聲波），由於樂曲的拍速(tempo)固定，故電腦播放樂曲時，可以產生Γ音符」與歌詞中的「字」或「音節」同步，以達到協助唱歌人可以跟上旋律的效果。對於自然的人聲錄音或樂器（非電子合成樂器）演奏，則尚未見到聲音與文字可以同步的產品。在電腦語音處理（合成或辨認）技術領域中，曾發展 _ 出4夕自動切割语音波形（aut〇ixiatic speech waveform segmentation)的技術，[參見參考之學術論文丨，2，『先前技術』末段]這些技術可對語音波形作各種語音單位（speech unit)層次的切割，比如運用著名的隱馬可夫模型（Hidden • Mai<kGV Model，HMM) ’在預先對每個基本語音單位建模 (building models)之後，可將任一個文句的語音波形中所含的基本語音單位辨識出來，連帶的該文句每個語音單位在吕吾音波形中的邊界位置（boundary position)，亦可被解碼出 φ 來。又例如中華民國專利497092「中文語音合成之切音方法」所揭露之方法，即屬運用HMM來切割中文語音波形至音節單位的方法。該方法對大量的人聲錄音資料作切割’以便日後作為語音合成系統之基本合成單位，用來接合任意中文文句成為合成的（synthesized)語音波形。此即電腦語音處理領域中的「文字轉語音」（Text_T〇_Speech，TTS) 技術。適當運用「文字轉語音」技術亦可以作到聲音與文字同步的效果’主要乃因為該技術強調先有文句資料作為輸 6 1269191 χ 入’糸統分析該文句之組成基本語言單位如音素 (phoneme)、音節等，再把預先錄製好的基本語言單位對應之語音聲波(waveform)串接成句。其間或需一些訊號銜接之間的平滑化（smoothing)、韻律調整（prosodic adjustment) 的技巧來美化此串接而成的合成聲波（synthetic waveform)。然而，不論這些技巧多麼精巧，相較於人直接對整個文句錄音，其聲音自然度仍然相差懸殊。就「語言學習」觀點而言，真人對整句、整篇的文章作錄音，仍然 | 有無可取代之價值。再者，已存在的電視節目或電影之語音對白，常伴隨有文句層次(sentence-level)的文字資訊，如何藉由訊號處理科技的輔助，加以標示注音，並進一步作聲音波形與文字序列作連結，實有其價值，特別是在「語 . 言學習」的領域中，更是如此。 . 在習知的語音切割技術中，運用HMM是相當普遍的方法，然而HMM有一項相當複雜且高成本的「語音單位建模」（speech unit model training)階段，即所謂的「訓練 _ 模型」。[參見參考之學術論文3,『先前技術』末段]該階段需要極大量的語音資料庫，針對每個語音單位，要有充足的語音波形資料（每個待建模之語音單位在經驗上至少需要語音樣版（speech template) 100個以上），才能「訓練」出足以代表該語音單位的模型參數，因此並不易實施。又雖然HMM應用於語音辨認領域中，對未知文字内容語音波形或特徵向量序列作語音單位的解碼（decoding)，獲得巨大的成功。但對於本發明所描述的應用，雖可運用，一旦論及效率或實施之成本，其並非最佳選擇。本發明提出運 7 1269191 、用基於「動態時軸校正」（Dynamic Programming Warping， DTW) ’僅需對所欲處理的語言之基本語音單位（如中文一般採用「音節」），逐一錄製聲音檔案，並藉由特徵擷取 (feature extraction)演算法轉換成特徵向量序列（feature extractionsequence)，即可作為參考語音資料庫，不用「假設」任何統計式的語音模型，更不用複雜且高成本的「建模」（model training)過程。In the traditional publishing industry, when making audio books (audio files), it is usually first to design a recording script (text file), and then the professional recording person in the recording to 'sentence sentenee under the supervision of the recording engineer or Paragraph by paragraph recording, after the completion of the entire recording process, output a digital sound waveform file, and then use the sound waveform editing software to manually cut the sound waveform into sentences or paragraphs step by step or segment by segment. Boundaries, this process is quite time consuming, resulting in high time costs. If you want to cut to a smaller language unit such as "character", "syllable", or word (word), the cost of manual cutting is several times more. Therefore, audiobooks appearing on the market are mostly presented in the form of a whole section of recording. A few people who use a computer program browser can do the sentence as a unit, that is, when playing a sound, the screen appears at the same time—the whole sentence = word. Ordinary TV shows or movie dialogues (10) her), also belong to such applications. Production, while playing the sound, the screen simultaneously shows the text of the karaoke sings along the "word", this kind of program will first == now on the honor screen: on, then use the pre-edited score, through people, One by one, the "word" or "syllable" in the lyrics, and the "note" of the 5 1269191 Λ in the score, and the "aiignment" are generated one by one, and then the notes of the accompaniment are played (usually by the electronic synthesizer). According to the digital representation of the note, commonly known as midi, the music sound wave is generated. Because the tempo of the music is fixed, when the computer plays the music, the Γ note can be generated. Synchronized with the word or syllable in the lyrics. In order to help the singer to keep up with the effect of the melody. For natural vocal recordings or musical instruments (non-electronic integrated instruments), you have not seen a product where the sound and text can be synchronized. In the field of computer speech processing (synthesis or recognition), the technology of aut〇ixiatic speech waveform segmentation has been developed, [see reference academic papers, 2, "previous technology" These techniques can be used to perform various speech unit level cuts on speech waveforms, such as using the famous Hidden • Mai<kGV Model (HMM) to model each basic speech unit in advance (building models) After that, the basic phonetic unit contained in the speech waveform of any sentence can be identified, and the boundary position of each speech unit in the Lvwu tone waveform can also be decoded by φ. . For example, the method disclosed in the Republic of China Patent 497092, "Sounding Method for Chinese Speech Synthesis", is a method of cutting Chinese speech waveforms to syllable units using HMM. This method cuts a large number of vocal recordings for later use as a basic synthesis unit of a speech synthesis system for combining arbitrary Chinese sentences into a synthesized speech waveform. This is the text-to-speech (TTS) technology in the field of computer speech processing. Appropriate use of "text-to-speech" technology can also achieve the effect of synchronizing sound and text 'mainly because the technology emphasizes the prior text data as the input 6 1269191 χ 糸分析分析分析分析分析分析分析该 phone phone phone phone phone phone phone phone phone phone ), syllables, etc., and then concatenate the speech sound waves corresponding to the pre-recorded basic language units into sentences. In the meantime, some techniques of smoothing and prosodic adjustment between signal connections are needed to beautify the serially synthesized synthetic waveform. However, no matter how delicate these techniques are, the naturalness of the sound is still quite different compared to the direct recording of the entire sentence. As far as the "language learning" point of view is concerned, real people are recording the entire sentence and the entire article, and there is still no substitute value. Furthermore, the voice dialogue of existing TV programs or movies is often accompanied by sentence-level text information. How to mark the phonetic by the aid of signal processing technology, and further make sound waveforms and text sequences. Linking has its value, especially in the field of "language learning." In the conventional speech cutting technology, the use of HMM is a fairly common method. However, HMM has a relatively complicated and costly "speech unit model training" stage, the so-called "training_model". . [See References, Academic Paper 3, "Previous Technology" at the end] This phase requires a very large number of speech databases. For each speech unit, there must be sufficient speech waveform data (each voice unit to be modeled is empirically At least 100 speech templates are required to "train" enough model parameters to represent the speech unit, so it is not easy to implement. Moreover, although the HMM is applied to the field of speech recognition, the decoding of the speech unit of the unknown text content or the sequence of feature vectors is decoded, and great success is obtained. However, the application described in the present invention, although applicable, is not the best choice once the efficiency or cost of implementation is addressed. The present invention proposes to record 7 1269191, using "Dynamic Programming Warping" (DTW) to record sound files one by one only for the basic phonetic units of the language to be processed (such as "syllabic" in Chinese). And by transforming the feature extraction algorithm into a feature extraction sequence, it can be used as a reference speech database without “assuming” any statistical speech model, without complicated and costly construction. Model training process.

[參考之學術論文] 1· Chou, F.-C5 C,-Y. Tseng and L.-s. Lee, “Automatic Segmental and Prosodic Labeling of Mandarin Speech，” Proceedings of International Conference on Spoken Language Processing，” 1998, p.p.1263-1266 2. Ljolje， A. and M.D. Riley， “Automatic segmentation of speech for TTS，’’ Proceedings of European Conference on Speech Communication and Technology，1993, pp· 1445-1448· 3. Huang, Acero, and Hon，“Spoken Language Processing”，published by Prentice Hall Ptr，2001， pp.389-393 【發明内容】 δ '1269191 本發明之主要目的係在提供語音播放與文字顯示同步之技術，其核心為語音切割之技術。本發明之另一主要目的係在透過電腦軟體運算而不需要經過人工處理就能達成將語音切割至音節單位的邊界點之目的。本發明之另一目的係避免運用HMM及其衍生之演算法所需要的高成本的大量語音資料庫及無效率且複雜的「語音單位建模」過程之缺點。 • 為達成上述之目的，本發明能使語音播放與文字顯示同步之方法，包括下列步驟：取得複數之輸入文字；根據複數之輸入文字，由一參考特徵向量序列資料庫 • 選出該複數之輸入文字之對應複數之參考特徵向量序列並 . 將複數之參考特徵向量序列串接成一串接之參考特徵向量序列，其中參考特徵向量序列資料庫包括複數之參考特徵向量序列； • 取得一輸入語音，其中該輸入語音為對應該複數之輸入文字所輸入之人聲；計算輸入語音對應的輸入特徵向量序列，以及尋找出該輸入語音之複數音節邊界點，其中尋找複數音節邊界點之方式係透過動態時軸校正技術(D T W)針對該串接之參考特徵向量序列以及該輸入特徵向量序列進行運 9 1269191 有了音節邊界點，即可進行語音切割，達成語音波形與文字在音節層次的連結，進而可應用於如有聲書、電視電影節目對白、語言教學及卡拉οκ伴唱的錄影節目中語音與文字逐字同步顯現的目的，或甚至卡通動晝中聲音與卡通人物嘴形同步之處理。【實施方式】為能讓貴審查委員能更暸解本發明之技術内容，特舉一個較佳具體實施例說明如下。以下請參考圖1關於語音播放與文字顯示同步之應用程式 10之架構圖。語音播放與文字顯示同步之應用程式10主要是讓使用者（例如有聲書製作者）輸入文字與輸入文字相對應之語音，最後語音播放與文字顯示逐字同步之應用程式10能使得語音播放與文字顯示為同步進行。語音播放與文字顯示同步之應用程式10主要包括文字輸入介面11，語音輸入介面12，文字轉音標程式碼13，搜尋特徵向量程式碼14,特徵擷取演算引擎15,動態時轴校正計算引擎 16,串接特徵向量程式碼17以及參考特徵向量序列資料庫 50。關於各程式碼之功能請於圖2關於本發明之流程圖解釋時一併暸解。以下為圖2關於本發明之流程圖之解釋，並請一併參考圖3〜8以瞭解本發明之技術。步驟201 : 10 1269191 取得複數之輸入文字20，如圖2，譬如使用者輸入中文漢字『媽媽好溫柔』。使用者可利用文字輸入介面11輸入複數之輸入文字 20。如圖3 A所示為運用常見的文字編輯軟體作為文字輸入介面11之具體實施例，其中第一攔位為流水序號，第二欄位為文字，第三欄位為音標（或含字詞分界）。其中請注意文字部份可為中文漢字（包含華語Mandarin及其他漢語方言）、英文或任何其他語言。音標序列為「音節」、「字」 _ 或「詞」層次的音標。又圖3A中流水序號〇〇1為中文漢字「媽媽好溫柔」配國語（Mandarin)音節之音標”mal-mal-hau3-unl-rou2”，〇〇2為中文漢字配台語 (Taiwanese，Min-nan)音節音標，〇〇3為英文配詞層次 (word-level)的音標，004則為英文配音節層次（syllaMe_level) 的音標。英文文字中所含為音節分界，，，=，，為詞分界。步驟202 : 翁根據複數之輸入文字20，由一參考特徵向量序列資料庫50選出該複數之輸入文字20之對應的參考特徵向量序列30，並串接得到一串接之參考特徵向量序列M。以下先解釋參考特徵向量序列資料庫50之意義。特徵向量資料庫50之產生必須先事先製作一『參考語音資料庫』’製作方式之基本步驟是錄製涵蓋標的（target)語言（如中文(Chinese)或英文(English)，其中中文包含各種漢語方吕如華語（Mandarin)、閩南語(Min-nan)、客語（Hakka)等）的所有基本語言單位（elementary linguistic unit)的語音，此 1269191 處所稱基本語言單位至少可包含音素、音節，（此處所稱音節包括各種聲調，如華語的一聲、二聲、三塾、敕：節包括各種聲調，如華語的一聲、二聲、三聲、四聲及軒聲等，譬如『勺一』、『勹一/』、『勹一 V』、『勺一 \』、『勺丫』、『勺丫 /』、『勺丫 V』、『勺丫 \』、『勺丫·』、『u』、『U/』、『UV』、『ui、』等等都屬中文華語的音節）、字母或子（character)以及詞（word)。再透過特徵擷取演算引擎 15將『芩考語音資料庫』轉換為參考特徵向量序列資料庫 50’參考特徵向量序列資料庫50包括複數之參考特徵向量序列30 ’每個特徵向量可含複數個維度(dimensi〇n)，各個維度可包含原始語音訊號的頻譜(spectrum)、音高（phch)、能量（energy)、倒頻譜（cepstrum)、線性預估編碼參數(Unear prediction coefficient，LPC)以及由這些參數所導出的相關蒼數。關於特徵操取演算引擎15為已知之技術，可參見王小川’”語音訊號處理，，，全華圖書出版，2005,Chap 7, P.7·2〜Ρ7·10，因此在此不再贅述。特二I參”量序列資料庫50之後’即可利用搜尋 =Γ 4,參考特徵向量序列資料庫50選出該 =3: I，參考特徵向量序列3。，亦即耗】入文子2 0為媽媽女手、)四矛資料庫50找出『Π 丫』、『门丫皿柔『』’因此由特徵向量序列『曰又/』的參考特徵向量序歹:厂么V』、『乂〜』以及向量程式碼π將找到的複數最後再藉由串接特徵一串接之參考賴^序列^考特徵向量序列如串接成需注意的是，若使用者在步驟1〇1僅輸入複數之輸入 '1269191 了 ’ *沒有提供輪人文字2G 由文子轉音標程式石馬曰即，則先需要之音節’意即找出找出『门丫』、『=好，柔』每-個字 4』以及『0又〆』厂幺▽』、『>( 有之技術，常用於動二標程式瑪13』為既出音節之應用程式可===這:自動標列之列表，對中文而古，有，、對應的音標序由建立以句或抑口吊子對多音的情形，此亦可發音來^ = t的發音詞典，由於詞的發音相對於字的形，若仍遇又到—‘二此當進能大牛大降低-字對多音的情 °Ί對夕曰，進一步亦可運用發音出現頻率原理’取〜的發音，由於此為習知技術’在此不再贅述其找出、『门丫』、『门丫』、『厂么ν』、『乂以及『Q /』的芩考特徵向量序列30後，將這些參考特徵向量序列30串接成一串接之參考特徵向量序列31 (請參考圖6) 以便在步驟205之計算時使用。步驟203 : 取得一輸入語音40。使用者可利用語音輸入介面12輸入『輸入語音40』，亦即『媽媽好溫柔』之語音（聲音）。輸入語音4〇為使用者輸入之人聲，『輸入語音40』可以用聲音檔（wave file)儲存。如圖5A所示為一語音輸入介面π之具體實施例，該 13 '1269191 介面從錄音劇本中逐句擷取文句顯示於視窗中，運用語音端點偵測（speech endpoint detection)技術，自動偵測錄音人的聲音，並排除文句前後的環境雜訊（即無人聲之靜謐音段）。隨後以常見的標準聲音檔案格式儲存，該格式至少包含聲音的取樣頻率（sampling rate)及取樣位元數(niimber of bits per sample)。如圖5為該儲存下來的聲音波形概念圖，其中橫軸代表時間（time)，通常以秒作單位，縱軸代表聲波之振幅（amplitude)，並在後文以『輸入語音4〇』指涉之0 為達到錄音之功能，文字顯示同步之應用程式1 〇最好在一個有錄音功能之電腦上執行。步驟204 : =特=擷取演算引擎15取得輸入語音40之對應的輸入特彳政向里序列41，此步驟如同步驟202關於製作夹者特 :向量序列資料庫™，亦即將-語音;:特= =以便在步驟2G5之計算時使用。目9為從語音波形轉徵向量之示意圖，其中圖上半部份為原始語音 ί圖=表，Γ1 ’通常以秒為單位，縱轴代表聲音振列柄徵餘演算法所麻出之特徵向量序之單位時間’可與原始語音波形之時間軸取相同向r母個維度的數值大小，則以亮度之明暗程度來份心::::越黑暗的部份代表數值越大，越白亮的部 1269191 在圖6中所不的特徵向量序N 41，為一抽象概念圖，亦即其僅顯特徵向量序列巾某單—維度隨時間變化的曲線（其杈軸代表時間，縱軸代表該單一維度數值之大小）’其並未元全顯現特徵向量實為複數維度向量的事實，惟此表不並無礙於往後觀念之闡釋。故後文仍以特徵向量序列41來指涉之。步驟205 : 利用動悲日T軸校正技術(DTW，Dynamic Time Warping) 尋找出該輸入語音40之複數音節邊界點，此部分由動態時軸校正計算引擎16所完成，請參考圖6。動態時軸校正技術為一已知之數學計算方法，重點針對兩個向量序列做點對點（p〇int t〇 p〇int)的最佳連結 (optimal alignment)，有關動態時軸校正技術(dtw)可參考 ρ·383〜ρ·385, “Spoken Language Processing”，coauthored by Huang，Acero，and Hon，published by Prentice Hall PTR， 2001。以下介紹動態時軸校正技術(DTW)之基本概念：請參考圖7，X轴上有六個點X〇〜X5，Y轴有四個點 Y0〜Y3，皆為一維向量序列，向量值分別列於χ〇〜χ5盥 Υ0〜Υ3的旁邊，譬如Χ5之向量值為3。動態時軸校正技術 (DTW)的重點在於找到一條最佳連結路徑，起始點為 (Χ0，Υ0)，終點為（Χ5，Υ3)，而最佳連結路徑即在Χ0〜Χ5與 Υ0〜Υ3的各相交點做最佳的選擇。在χ〇〜Χ5與Y〇〜Y3的各相交點可以算出向量值之距離（一律取絕對值），譬如 D(X2，Y1)=| 4_1 丨=3, D(X4，Y3)=| 1_2 | = 1 等等，最佳連結 15 1269191 攸(XG，YG)開始即尋找下—個相交㉟（往下—格、往右二f或往右1斜角-格），因此可以選擇之路徑有相當多 :7顯不的虛線A的路徑較虛線B的路徑來得好，原 =亚線A路徑所經過的交叉點向量值之距離總和較小。 =上’以圖7為例，虛線A的路徑為最佳連結路徑。時軸校正技術為-已知之數學計算方法，因此在此不再贅述其細節。旦在本發明中，這兩個向量序列即是串接之參考特徵向輪人特徵向量序列41 (在圖6中以特徵向 ^某早-維度隨時間變化之曲線作為示意，該曲線之走 ^類似原始聲音波型），找出最佳連結路徑⑽ 串接之參考特徵向量相w之各連接點31b,3ie，3id = 二可找出輸人特徵向量序列41的邊界點，並進—步換算輸t語Ϊ4〇的音節邊界點45b，45c,45d,45e(代表時間:可為早位）。需;主意的是由於特徵向量序列31之連接點 = 與結束點31f即是對應輸入特徵向量序列；；】 n 與結絲攸，所以音節邊界點，5f疋不而要由動悲時軸校正技術找出。 =出音節邊界點45b，45e，45d,45e之後，請參考圖8，虽播放輸人語音40時，可設計—計時器節邊界點45b，45C，45d，45e所代表之日#門赴二^ 丁〇十至曰數之輸入文字2”的每—.「i=:;'，賴的複丁」3 音即」即設計特殊反二。譬如輸入語音40在進入音節邊界點45 #，『好』即開始反白或以顯著之方式呈現，而輸入語音 1269191 ^白進^^邊界點45d時，而由下—個字『溫』開始進資料同f之方式呈現。如此便達到文字資料於語音資料同放的目的。需注意岐，文字資料於語音 :!到的字以顯著之方式呈現，有的方式方式呈現，有的是僅將目輕到的二木而其餘的字不顯示等等。 4Sf音邊界點45b，45c，45d，45e連同起始點45a 二等育訊’可獨立儲存於另一槽案(seg檔) 劇本樓案⑽檔)中，連同前^^ ί (wav擋），可作為本發明應用程式10之輸出物件，如圖10A以及圖10B所示。其中圖10A為記載音節邊界資訊之seg檔之例，其f第一行代表本檔案之檔頭，並記錄其相對應之聲音檔案（wav檔），在此例中該聲者為”錄音劇本.wav，，，第二行tmipo第ίί tmax = 8.368代表其相對應之聲音檔之起始時間及結束時間’單位為「秒」。第四行segments: Size = 28 代表本槽案之音節邊界資訊總共有段 (segments)，接下來為連續重複的資訊如反白部所示，其中一片段資訊如下， segment[2]: tmin = 0.0344375 tmax=0.358875 text = ’’ma 1 ’’ 這片段資訊代表第2 (音節）段，起始時間 0.03443 75，結束時間〇·3 5 8875，該音段對應文字 1269191 (或音標）為“mal“。餘類推。另外如圖1 0B所示為一記載文字、音標及音節邊界之tcp檔實施例，其中第一行代表本檔案之檔頭’並記錄其相對應之聲音標案（wav槽），在此例 :該聲音檔檔名為，，錄音劇本·wav”，第二行開始，前三攔位如同文字輸入介面丨丨所述，分土序號、文字及音標，第4攔位則為音節邊水其，以分隔兩邊界點的時間（以秒為單位），比如第2行第4攔位前2個數字0.0344375 _ 〇 358875 1表第第2行第3攔位第夏個音節”mal:之起始呀間為0.0344375,結束時間為〇 358875。餘類推。上所陳’本發明無論就目的、手段及功效，在在均顯示其迥異於習知技術之特徵，懇請㈣查委員明察， ^曰賜准專利，俾嘉惠社會，實感德便。惟應注意的是，上述諸多實施例僅係為了便於說明而舉例而已，本發明所圍，申請專利範圍所述為準，而非僅限施例’譬如本發明之技術可以應詩任何語文，而非僅限於中文或英文。【圖式簡單說明】圖1係本發明關於語音播放盥架構圖實施例。 Ά子顯不同步之應用程式之圖2係本發明之流程圖。圖3係本發明關於複數之輸人圖3A係本發明關於複數之輪人文字介面之實施例。[Reference Academic Papers] 1· Chou, F.-C5 C,-Y. Tseng and L.-s. Lee, “Automatic Segmental and Prosodic Labeling of Mandarin Speech,” Proceedings of International Conference on Spoken Language Processing,” 1998 , pp1263-1266 2. Ljolje, A. and MD Riley, “Automatic segmentation of speech for TTS,” 'Proceedings of European Conference on Speech Communication and Technology, 1993, pp· 1445-1448· 3. Huang, Acero, and Hon, "Spoken Language Processing", published by Prentice Hall Ptr, 2001, pp. 389-393 [Abstract] δ '1269191 The main purpose of the present invention is to provide a technique for synchronizing voice playback with text display, the core of which is voice cutting Technology. Another primary object of the present invention is to achieve the purpose of cutting speech to the boundary points of syllable units by computer software operations without manual processing. Another object of the present invention is to avoid the disadvantages of using a high cost, large volume speech database and an inefficient and complex "speech unit modeling" process required for HMM and its derived algorithms. In order to achieve the above object, the present invention enables a method for synchronizing voice playback with text display, comprising the steps of: obtaining a plurality of input texts; selecting a reference number vector based on a plurality of input texts; a reference eigenvector sequence corresponding to the plurality of words and concatenating the plurality of reference eigenvector sequences into a series of reference eigenvector sequences, wherein the reference eigenvector sequence database comprises a plurality of reference eigenvector sequences; • obtaining an input speech, The input voice is a voice input corresponding to the input text of the plural number; the input feature vector sequence corresponding to the input voice is calculated, and the complex syllable boundary point of the input voice is searched, wherein the way of finding the complex syllable boundary point is through dynamic time Axis Correction Technique (DTW) performs the sequence of reference eigenvectors and the sequence of input eigenvectors in the series. 9 1269191 With syllable boundary points, voice cutting can be performed, and the connection between the speech waveform and the text at the syllable level can be achieved. Applied to audio books, TV movie program dialogue, language teaching and karaoke singer's video program in which the speech and text are synchronized verbatim, or even the sound of the card and the character of the card are synchronized. [Embodiment] In order to enable the reviewing committee to better understand the technical contents of the present invention, a preferred embodiment will be described below. Please refer to FIG. 1 for the architecture diagram of the application 10 for synchronizing the voice play with the text display. The application 10 for synchronizing the voice play with the text display is mainly for the user (for example, the audio book producer) to input the voice corresponding to the input text, and finally the application 10 for synchronizing the voice play and the text display can make the voice play and The text is displayed as synchronized. The application 10 for synchronizing the voice play with the text display mainly includes a text input interface 11, a voice input interface 12, a text transcoding program code 13, a search feature vector code 14, a feature capture calculation engine 15, and a dynamic time axis correction calculation engine 16 The concatenated feature vector code 17 and the reference feature vector sequence database 50. The function of each code will be understood together with the explanation of the flowchart of the present invention in Fig. 2. The following is an explanation of the flow chart of the present invention with reference to Fig. 2, and reference is made to Figs. 3 to 8 to understand the technique of the present invention. Step 201: 10 1269191 Obtain the input text 20 of the plural number, as shown in Fig. 2. For example, the user inputs the Chinese character "Mom is gentle". The user can input a plurality of input characters 20 by using the text input interface 11. FIG. 3A shows a specific embodiment of using a common text editing software as the text input interface 11, wherein the first position is a serial number, the second field is a text, and the third field is a phonetic symbol (or a word). Demarcation). Please note that the text can be Chinese characters (including Mandarin and other Chinese dialects), English or any other language. The phonetic sequence is a phonetic symbol of the "syllable", "word" _ or "word" level. In addition, in Figure 3A, the serial number 〇〇1 is the Chinese kanji "Mom is gentle" with the Mandarin syllable "mal-mal-hau3-unl-rou2", 〇〇2 is Chinese kanji with Taiwanese (Taiwanese,Min -nan) syllable phonetic symbols, 〇〇3 is the phonetic symbol of the English word-level (word-level), and 004 is the phonetic symbol of the English syllable level (syllaMe_level). The syllable boundary is included in the English text, and, =, is the word boundary. Step 202: According to the input text 20 of the complex number, a reference feature vector sequence 30 corresponding to the input text 20 of the complex number is selected by a reference feature vector sequence database 50, and a concatenated reference feature vector sequence M is obtained in series. The meaning of the reference feature vector sequence database 50 is explained below. The feature vector database 50 must be created in advance to create a "reference voice database". The basic steps of the production method are to record the target language (such as Chinese (English) or English (English), where Chinese contains various Chinese The speech of all elementary linguistic units of the language of Mandarin, Min-nan, Hakka, etc., the basic language unit referred to in 1269191 is at least containing phonemes, syllables, The syllables mentioned here include various tones, such as one, two, three, and two in Chinese: the festival includes various tones, such as one, two, three, four, and Xuan in Chinese, such as "Spoon One". , "勹一/", "勹一V", "spoon one", "spoon", "spoon", "spoon", "spoon", "spoon", "u" 』, 『U/』, 『UV』, 『ui,』, etc. are all Chinese syllables), letters or characters, and words. Then, the feature retrieval engine 15 converts the "reference voice database" into a reference feature vector sequence database 50'. The reference feature vector sequence database 50 includes a plurality of reference feature vector sequences 30' each feature vector may include a plurality of Dimensi〇n, each dimension may include the spectrum, phch, energy, cepstrum, unear prediction coefficient (LPC) of the original speech signal, and The relevant number of culls derived from these parameters. Regarding the feature manipulation calculation engine 15 is a known technology, see Wang Xiaochuan's voice signal processing, and, Quanhua Book Publishing, 2005, Chap 7, P.7·2~Ρ7·10, so I will not repeat them here. After the special second I parameter "quantity sequence database 50" can use the search = Γ 4, the reference feature vector sequence database 50 selects the = 3: I, reference feature vector sequence 3. , that is, the consumption] into the text child 20 is the mother's hand, and the four-spear database 50 finds the "Π 丫", "the door 柔柔 soft", so the reference feature vector sequence of the feature vector sequence "曰 / /"歹: Factory V 』, 乂』』 and vector code π will find the complex number and then by the concatenation feature a series of reference lai ^ sequence ^ test feature vector sequence, such as concatenation into the need to pay attention to The user enters only the plural input '1269191' in step 1〇1. *There is no provision for the wheel character 2G. The first syllable is needed to find out the "threshold". 『=好,柔』every word 4』 and 『0又〆』 Factory 幺▽, 『>(There are techniques, often used to move the second standard program Ma 13) for the application of the syllables === This: automatic list of the list, for Chinese and ancient, there, the corresponding phonetic sequence is created by a sentence or suppress the sling to the multi-tone situation, this can also be pronounced ^ ^ t pronunciation dictionary, due to the word The pronunciation is relative to the shape of the word, if it still meets again - 'two, when the advance can be big, the big cow is reduced, the word is multi-tone, and the opposite is true. Further, you can also use the pronunciation frequency principle to 'take the pronunciation of ~, because this is a well-known technology', here is no longer a description of its identification, "threshold", "threshold", "factory ν", "乂 and After the "Q /" reference feature vector sequence 30, these reference feature vector sequences 30 are concatenated into a concatenated reference feature vector sequence 31 (please refer to Figure 6) for use in the calculation of step 205. Step 203: Acquire An input voice 40. The user can use the voice input interface 12 to input "input voice 40", that is, the voice of "mother is gentle" (sound). Input voice 4 〇 is the voice input by the user, "input voice 40" can Stored in a wave file. As shown in FIG. 5A, a specific embodiment of a voice input interface π, the 13 '1269191 interface is used to retrieve a sentence from a recorded script and displayed in a window, using voice endpoint detection. (speech endpoint detection) technology that automatically detects the voice of the recorder and eliminates ambient noise before and after the sentence (ie, silent sounds of unmanned sounds). It is then stored in a common standard sound file format. The format includes at least a sampling rate of the sound and a niimber of bits per sample. Figure 5 is a conceptual diagram of the stored sound waveform, wherein the horizontal axis represents time, usually in seconds. As the unit, the vertical axis represents the amplitude of the sound wave, and in the following text, the input voice refers to the function of recording, and the text display synchronization application 1 is preferably in a recording function. Executing on the computer. Step 204: = Special = Retrieve the calculation engine 15 to obtain the corresponding input meta-political sequence 41 of the input speech 40, this step is as in step 202, regarding the creation of the vector: the vector sequence database TM, - Voice;: Special = = for use in the calculation of step 2G5. Figure 9 is a schematic diagram of the vector from the speech waveform, where the upper part of the figure is the original speech ί = table, Γ 1 'usually in seconds, the vertical axis represents the characteristics of the sound oscillator lemma algorithm The unit time of the vector sequence can take the same value as the time dimension of the original speech waveform to the value of the r mother dimension, and the degree of brightness of the brightness is used to focus on:::: The darker the part, the larger the value, the more white The feature vector sequence N 41 in Figure 6 is an abstract concept map, that is, it only shows the curve of a single-dimensional dimension of the eigenvector sequence (the 杈 axis represents time and the vertical axis represents The size of the single-dimensional value) does not reveal the fact that the eigenvector is a complex dimension vector, but this table does not hinder the interpretation of the concept of the future. Therefore, the following is still referred to by the feature vector sequence 41. Step 205: Find the complex syllable boundary point of the input speech 40 by using DW (Dynamic Time Warping), which is completed by the dynamic time axis correction calculation engine 16, please refer to FIG. The dynamic time axis correction technique is a known mathematical calculation method, focusing on the optimal alignment of point-to-point (p〇int t〇p〇int) for two vector sequences, and the dynamic time axis correction technique (dtw) can be Reference ρ·383~ρ·385, “Spoken Language Processing”, coauthored by Huang, Acero, and Hon, published by Prentice Hall PTR, 2001. The following introduces the basic concept of dynamic time axis correction technology (DTW): Please refer to Figure 7. There are six points X〇~X5 on the X axis and four points Y0~Y3 on the Y axis, all of which are one-dimensional vector sequences, vector values. They are listed next to χ〇~χ5盥Υ0~Υ3, and the vector value of 譬5 is 3. The focus of Dynamic Time Shaft Correction (DTW) is to find an optimal link path, starting point is (Χ0, Υ0), end point is (Χ5, Υ3), and the best link path is Χ0~Χ5 and Υ0~Υ3 The intersection points make the best choice. The distance between the vector values can be calculated at the intersections of χ〇~Χ5 and Y〇~Y3 (all absolute values are used), such as D(X2, Y1)=| 4_1 丨=3, D(X4,Y3)=| 1_2 | = 1 and so on, the best link 15 1269191 攸 (XG, YG) starts looking for the next - intersection 35 (downward - grid, right 2 f or right 1 bevel - grid), so you can choose the path There are quite a few: the path of the dotted line A of 7 is better than the path of the dotted line B, and the sum of the distances of the vector values of the intersections of the original = sub-line A path is small. = upper is taken as an example in Fig. 7, and the path of the broken line A is the best connection path. The time axis correction technique is a known mathematical calculation method, and thus the details thereof will not be described herein. In the present invention, the two vector sequences are the reference feature of the concatenation to the wheel eigenvector sequence 41 (in FIG. 6, the curve is characterized by a characteristic toward the early-dimensional dimension, and the curve is taken away. ^ Similar to the original sound waveform), find the best connection path (10) The connection points of the reference feature vector phase w of the concatenation 31b, 3ie, 3id = 2 can find the boundary point of the input feature vector sequence 41, and proceed Convert syllable boundary points 45b, 45c, 45d, 45e (representing time: can be early). The idea is that the connection point of the feature vector sequence 31 = the corresponding input feature vector sequence; Technology to find out. = After the syllable boundary points 45b, 45e, 45d, 45e, please refer to Figure 8. Although the input voice 40 is played, it can be designed - the timer node boundary points 45b, 45C, 45d, 45e represent the day #门到二^ Ding Yu 10 to the number of input text 2" each - "i =:; ', Lai's complex" 3 sounds that is designed to be special. For example, if the input voice 40 enters the syllable boundary point 45 #, "good" starts to be reversed or presented in a conspicuous manner, and the input voice 1269191 ^ white enters ^^ boundary point 45d, and the next word "warm" starts The data is presented in the same way as f. In this way, the purpose of the text data in the same voice data is achieved. Need to pay attention to, the text data in the voice:! The word is presented in a remarkable way, some ways to present, some are only the light to the second wood and the rest of the words are not displayed and so on. 4Sf sound boundary points 45b, 45c, 45d, 45e together with the starting point 45a Second-class education 'can be stored separately in another slot (seg file) script (10) file, together with the previous ^^ ί (wav block) It can be used as an output object of the application 10 of the present invention, as shown in Figs. 10A and 10B. FIG. 10A is an example of the seg file of the syllable boundary information, wherein the first line of f represents the file header of the file, and records the corresponding sound file (wav file). In this example, the voice is a “recording script”. .wav,,,The second line tmipo ίί tmax = 8.368 represents the start time and end time of the corresponding sound file. The unit is "second". The fourth line of segments: Size = 28 means that there are a total of segments in the syllable boundary information of the slot, followed by continuous repeated information as shown in the reverse section, one of which is as follows, segment[2]: tmin = 0.0344375 tmax=0.358875 text = ''ma 1 '' This piece of information represents the 2nd (syllable) segment, the start time is 0.03443 75, and the end time is 〇·3 5 8875. The segment corresponds to the text 1269191 (or phonetic symbol) is “mal” ". I analogy. In addition, as shown in FIG. 10B, a tcp file embodiment for describing characters, phonetic symbols and syllable boundaries, wherein the first line represents the file header of the file and records the corresponding sound standard (wav slot), in this example. : The sound file name is, the recording script wav", the second line begins, the first three blocks are as described in the text input interface, the soil serial number, the text and the phonetic symbol, and the fourth block is the syllable side water. It is the time (in seconds) to separate the two boundary points, such as the 2nd row before the 4th block, the second digit 0.0344375 _ 〇 358875 1 the second row, the third block, the summer syllable "mal: The starting time is 0.0344375 and the ending time is 〇358875. I analogy. In the present invention, regardless of the purpose, means and efficacy, it is different in the characteristics of the prior art, and (4) the members of the investigation are inspected, and the patent is granted, and the society is in good faith. It should be noted that the various embodiments described above are merely examples for the convenience of the description, and the scope of the claims is not limited to the embodiments. For example, the technology of the present invention can be used in any language. Not limited to Chinese or English. BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is an embodiment of a speech playback architecture of the present invention. Figure 2 is a flow chart of the present invention. 3 is an embodiment of the present invention relating to a plurality of rounds of human text interfaces. FIG.

語音輸入介面12 搜尋特徵向量程式碼14 動態時軸校正計算引擎16 參考特徵向量序列30 1269191 圖4係本發明關於參考語音資料庫之實施例。圖5係本發明關於輸入語音之實施例。圖5A係本發明關於輸入語音介面（錄音介面）之實施例。圖6係本發明關於動態時軸校正技術(DTW)之示意圖。圖7係說明動態時轴校正技術(DTW)主要原理之說明圖。圖8係顯示本發明效果之示意圖。圖9係本發明關於特徵向量擷取技術之示意圖。圖10A係本發明關於音節邊界點資訊紀錄檔案之實施例。圖10B係本發明關於音節邊界點資訊併入文字及音標記錄檔案之實施例。【元件代表符號說明】語音播放與文字顯示同步之應用程式10 文字輸入介面11 文字轉音標程式碼13 特徵擷取演算引擎15 串接特徵向量程式碼17 複數之輸入文字20 串接之參考特徵向量序列31 連接點 31a，31b，31c，31d，31e，31f 輸入語音40 輸入特徵向量序列41 音節邊界點45a，45b，45c，45d，45e，45f 參考特徵向量序列資料庫50 最佳連結路徑80Speech Input Interface 12 Search Feature Vector Code 14 Dynamic Time Axis Correction Calculation Engine 16 Reference Feature Vector Sequence 30 1269191 Figure 4 is an embodiment of the present invention with respect to a reference speech database. Figure 5 is an embodiment of the present invention with respect to inputting speech. Figure 5A is an embodiment of the present invention with respect to an input voice interface (recording interface). Figure 6 is a schematic illustration of the dynamic time axis correction technique (DTW) of the present invention. Fig. 7 is an explanatory view showing the main principle of the dynamic time axis correction technique (DTW). Fig. 8 is a view showing the effect of the present invention. Figure 9 is a schematic diagram of the feature vector extraction technique of the present invention. Figure 10A is an illustration of an embodiment of the present invention relating to a syllable boundary point information record file. Figure 10B is an illustration of an embodiment of the present invention in which syllable boundary point information is incorporated into a text and phonetic record file. [Component Representation Symbol Description] Voice Play and Text Display Synchronization Application 10 Text Input Interface 11 Text Transcoding Code Code 13 Feature Acquisition Calculation Engine 15 Concatenated Feature Vector Code 17 Complex Input Text 20 Concatenated Reference Feature Vector Sequence 31 Connection points 31a, 31b, 31c, 31d, 31e, 31f Input speech 40 Input feature vector sequence 41 Syllable boundary points 45a, 45b, 45c, 45d, 45e, 45f Reference feature vector sequence database 50 Best connection path 80

Claims

1269191 Picking up, applying for patent scope: 1 · A method for enabling the voice play disk text to be abrupt: /, text is not synchronized, including the following steps to obtain a plurality of input text; library selected text 'by a reference feature vector sequence Data, the number of input words of μ稷

Reference feature vector sequence sequence: the reference feature vector sequence database includes a sequence of input feature vectors corresponding to the input speech, and

The brother finds the complex syllable boundary point of the input speech, wherein the method of finding the complex syllable boundary point is performed by the dynamic time axis correction technique (dtw) for the concatenated reference feature vector sequence and the input feature vector sequence, The vector sequence is the best link. 2. The method of synchronizing the voice play with the text display as described in the third paragraph of the patent application, wherein the plurality of input characters can be Chinese or English. 3) Kind of items that are re-continued by electric power, which allows voice playback and text display to be stepped. 'The items on the side include a medium that can record the program. The medium includes the following code·· a code for obtaining the input text of the plural number; 2 0 1269191 The second code is used to select the input character of the plural number according to the input text of the plural number. Referring to the sequence of feature vectors and the reference feature vector sequence of the plurality of reference feature vector sequences, wherein the reference feature vector library includes a reference feature vector sequence of complex numbers; the code 'for obtaining-input voice, wherein the input language The vocal input for the input text corresponding to the plural; the fourth code for obtaining the input feature corresponding to the input speech to the 1 sequence, and the fifth & code for finding the (4) complex syllable of the input speech y point 'where the way to find the complex syllable boundary point is through the dynamic time axis, the positive technique (DTW) is used to refer to the reference feature vector sequence and the input feature Sequence ship row arithmetic operation amount, of two vector sequence for optimal even junction 04 · 5. Patent application cleared of the cake around the article 3 can be read by a computer, the input text may be a plurality of Chinese or English. /, - an item that can be read by a computer, the voice play and the text display can be synchronized, the item in # includes a medium with a recordable code, the medium includes the following code: text input; The face is used to input the input text of the plural number; the voice input interface is used to input the input voice input by the input text that should be touched, wherein the input voice is a human voice; 〃-reference feature vector sequence material library, wherein The reference feature vector sequence database includes a complex reference eigenvector sequence; 1269191 a search eigenvector 々 database selects the trajectory of the complex eigenvector vector sequence vector sequence; the corresponding complex reference feature contiguous feature vector code , - the sequence of the quantity (4) into the reference feature of the serial connection to the reference feature:: 3: = engine, to obtain the input of the input voice corresponding to the complex:: = correction calculation engine to find the input voice = ::Boundary point, in which the way to find the boundary point of the complex syllable is the reference to the reference feature of the concatenation And 5 Hai input feature vector sequence 歹 4 to perform the operation, the best connection to the two vector sequence. 6. The article readable by a computer as recited in claim 5, further comprising a text-to-speech code for the corresponding syllable of the plurality of input characters. 7. For the items that can be read by computer as described in item $ of the patent application, the input text of the plural can be Chinese or English. twenty two