TWI790705B

TWI790705B - Method for adjusting speech rate and system using the same

Info

Publication number: TWI790705B
Application number: TW110129198A
Authority: TW
Inventors: 楊崇文; 吳根丞; 黃顯詔
Original assignee: 宏正自動科技股份有限公司
Priority date: 2021-08-06
Filing date: 2021-08-06
Publication date: 2023-01-21
Also published as: TW202308396A; CN115705838A

Abstract

A method for adjusting speech rate includes obtaining an original voice signal for a plurality of character and a total adjustment time, analyzing the original speech signal to obtain voiced signal section corresponding to each character and unvoiced signal section corresponding to each character, calculating a frame-adjustment amount according to the total adjustment time and single-frame duration, and adjusting the number of sound frames in at least one voiced signal section according to the frame-adjustment amount to form the adjusted voice signal.

Description

Speech speed adjustment method and system

本發明涉及語音處理技術，特別是一種語速調整方法及其系統。The invention relates to speech processing technology, in particular to a speech rate adjustment method and system thereof.

影音的音頻描述通過特製的音軌提供對角色動作和場景變化等事件的變化進行語音描述。音頻描述可為盲人、低視力或其他視力障礙的人改善視覺影圖像的可訪問性。The audio description of video and audio provides a voice description of changes in events such as character actions and scene changes through a specially-made audio track. Audio description can improve the accessibility of visual imagery for people who are blind, have low vision, or have other visual impairments.

音頻描述的創建既昂貴又麻煩。傳統上，影音的製作者僱用腳本編寫者和語音人才來創建音頻描述。在這種傳統方法中，腳本編寫者透過觀看影音的內容找出需加上音頻描述的影像片段，以確定需插入音頻描述的時間點並估算音頻描述的可用時間，並根據影像片段的內容創建描述性音頻的腳本。然後，語音人才再依據腳本錄製符合可用時間的音頻描述。通常，在可用時間的時間限制下，腳本編寫者和語音人才必須多次重複前述過程以優化獲得的音頻描述。例如，腳本編寫者可以修改需要音頻描述的影像片段以得到新的可用時間、或者腳本編寫者可以重寫腳本以適應較短的可用時間，或者語音人才可反覆調整其講話速度以適應可用時間。由於這些挑戰，傳統的音頻描述服務的價格相當高。Audio descriptions are expensive and cumbersome to create. Producers of audiovisuals have traditionally hired script writers and voice talent to create audio descriptions. In this traditional method, the script writer finds out the video clips that need to add audio descriptions by watching the content of the video and audio, so as to determine the time point where the audio descriptions need to be inserted and estimate the available time of the audio descriptions, and create a script based on the content of the video clips Script for descriptive audio. The voice talent then records the audio description according to the available time according to the script. Typically, script writers and speech talents must repeat the aforementioned process several times to optimize the resulting audio description, within the time constraints of available time. For example, a script writer may modify a video segment that requires audio description to accommodate a new available time, or a script writer may rewrite a script to accommodate a shorter available time, or a voice talent may repeatedly adjust his speaking speed to accommodate an available time. Because of these challenges, the price of traditional audio description services is quite high.

本發明一實施例提供一種語速調整方法，其包括：取得多個字的原始語音訊號及總調整時長；分析原始語音訊號以取得對應各字的濁音訊號區段與清音訊號區段；根據總調整時長與單位音框時長計算音框調整量；以及根據音框調整量調整至少一濁音訊號區段的音框數量以形成調整後語音訊號。An embodiment of the present invention provides a speech rate adjustment method, which includes: obtaining original speech signals of multiple characters and the total adjustment duration; analyzing the original speech signals to obtain voiced signal segments and unvoiced signal segments corresponding to each word; calculating the frame adjustment amount by the total adjustment duration and the unit frame duration; and adjusting the number of frames in at least one voiced signal segment according to the frame adjustment amount to form an adjusted voice signal.

本發明另一實施例提供一種語速調整系統，其包括：儲存單元、分析單元、以及調整單元。分析單元耦接儲存單元，且調整單元耦接儲存單元與分析單元。儲存單元暫存多個字的原始語音訊號。分析單元分析原始語音訊號以取得對應各字的濁音訊號區段與清音訊號區段。調整單元根據總調整時長與單位音框時長計算音框調整量，並且根據音框調整量調整至少一濁音訊號區段的音框數量以形成調整後語音訊號。Another embodiment of the present invention provides a speech rate adjustment system, which includes: a storage unit, an analysis unit, and an adjustment unit. The analysis unit is coupled to the storage unit, and the adjustment unit is coupled to the storage unit and the analysis unit. The storage unit temporarily stores the original voice signals of multiple characters. The analysis unit analyzes the original speech signal to obtain voiced signal segments and unvoiced signal segments corresponding to each character. The adjusting unit calculates the frame adjustment amount according to the total adjustment duration and the unit frame duration, and adjusts the number of frames in at least one voiced signal segment according to the frame adjustment amount to form an adjusted speech signal.

綜上所述，任一實施例之語速調整方法適用於調整語音訊號的語速，以提供滿足時間限制的音檔，進而減少重複錄音的次數，並大幅減少音檔的製作成本。To sum up, the speech rate adjustment method of any embodiment is suitable for adjusting the speech rate of the speech signal to provide an audio file that satisfies the time limit, thereby reducing the number of repeated recordings and greatly reducing the production cost of the audio file.

參照圖1與圖2，本發明一實施例提供一種語速調整系統10，包括：儲存單元110、分析單元120以及調整單元130。分析單元120耦接儲存單元110，並且調整單元130耦接儲存單元110與分析單元120。於此，儲存單元110暫存至少一字的原始語音訊號。其中，本發明還提供一種語速調整方法，其能以語速調整系統實現。為清楚說明，以下以多個字的原始語音訊號為例進行說明。Referring to FIG. 1 and FIG. 2 , an embodiment of the present invention provides a speech rate adjustment system 10 , including: a storage unit 110 , an analysis unit 120 and an adjustment unit 130 . The analysis unit 120 is coupled to the storage unit 110 , and the adjustment unit 130 is coupled to the storage unit 110 and the analysis unit 120 . Here, the storage unit 110 temporarily stores at least one character of the original speech signal. Wherein, the present invention also provides a speech rate adjustment method, which can be realized by a speech rate adjustment system. For clarity, the original speech signal of multiple characters is taken as an example for illustration below.

於此，語速調整方法可適用於調整單字或句子的發音的語速，或調整影音的音頻描述的語速。後續將詳述語速調整方法的技術內容。Here, the method for adjusting the speech rate may be applicable to adjusting the speech rate of the pronunciation of a word or sentence, or adjusting the speech rate of the audio description of a video or video. The technical content of the speech rate adjustment method will be described in detail later.

於一實施例中，分析單元120能取得並分析原始語音訊號Si（步驟S21）以取得對應各字的濁音（voiced sound，又稱有聲音）訊號區段與清音（unvoiced sound，又稱無聲音）訊號區段（步驟S22）。調整單元130能取得總調整時長N1（步驟S21），根據總調整時長N1與單位音框時長計算待移除的音框的數量（即音框調整量）（步驟S23），並且根據音框調整量調整此些字的濁音訊號區段其中至少一濁音訊號區段的音框數量以形成調整後語音訊號So（步驟S24）。其中，音框（speech frame）為進行語音訊號處理時的最小訊號區段，而單位音框時長即為一個最小訊號區段的時間長度。In one embodiment, the analysis unit 120 can obtain and analyze the original speech signal Si (step S21) to obtain the voiced sound (voiced sound, also known as sound) signal segment and the unvoiced sound (unvoiced sound, also known as no sound) corresponding to each character. ) signal section (step S22). The adjustment unit 130 can obtain the total adjustment duration N1 (step S21), calculate the number of sound frames to be removed (that is, the adjustment amount of the sound frame) according to the total adjustment duration N1 and the unit sound frame duration (step S23), and according to The frame adjustment amount adjusts the frame quantity of at least one voiced signal segment among the voiced signal segments of the characters to form the adjusted speech signal So (step S24 ). Wherein, a speech frame is a minimum signal segment when performing speech signal processing, and a unit speech frame duration is the time length of a minimum signal segment.

在一些實施例中，原始語音訊號Si包括此些字Wo的字音訊號Sp1~Sp7，如圖3所示。各字音訊號Sp1~Sp7包括複數音框（以下稱原始音框）。In some embodiments, the original speech signal Si includes the word sound signals Sp1˜Sp7 of the words Wo, as shown in FIG. 3 . Each character sound signal Sp1-Sp7 includes a plurality of phonetic frames (hereinafter referred to as original phonetic frames).

其中，於語言學中，發音時聲帶振動的音稱爲濁音，聲帶不振動的音稱爲清音，另有輔音，其同時具有清音與濁音，於本案中於遇有輔音的字時，可以先區分出濁音與清音後，再進行語速調整。Among them, in linguistics, the sound whose vocal cords vibrate during pronunciation is called voiced sound, the sound without vocal cord vibration is called unvoiced sound, and there are consonants, which have both unvoiced and voiced sounds. In this case, when encountering a word with consonants, you can first After distinguishing between voiced and unvoiced sounds, adjust the speech rate.

在步驟S22的一些實施例中，分析單元120會分析原始語音訊號Si以找出原始語音訊號Si中每個字Wo的字音訊號Sp1~Sp7（即對應每個字的訊號區段），然後再分析每個字Wo的字音訊號Sp1~Sp7以找出每個字音訊號Sp1~Sp7中濁音訊號區段（即對應濁音發音的訊號區段）與清音訊號區段（即對應清音發音的訊號區段）。舉例來說，搭配參照圖3，分析單元120能將原始語音訊號Si從時間對振幅的語音波形F1轉換成時間對頻率的聲音頻譜F2，並根據能量分布狀態識別出每個字Wo的字音訊號Sp1~Sp7。在圖3中，對於語音波形F1，橫軸為時間（秒），縱軸為振幅（分貝）；對於聲音頻譜F2，橫軸為時間（秒）；縱軸為頻率（赫茲（Hz））。然後，分析單元120再根據每個字Wo的字音訊號Sp中的能量分布狀態識別出濁音訊號區段Z2與清音訊號區段Z1。In some embodiments of step S22, the analysis unit 120 analyzes the original speech signal Si to find out the word sound signals Sp1-Sp7 of each word Wo in the original speech signal Si (that is, the signal segment corresponding to each word), and then Analyze the word sound signal Sp1~Sp7 of each word Wo to find out the voiced sound signal segment (that is, the signal segment corresponding to voiced sound pronunciation) and the unvoiced sound signal segment (that is, the signal segment corresponding to unvoiced sound pronunciation) in each word sound signal Sp1~Sp7 ). For example, referring to FIG. 3 , the analysis unit 120 can convert the original speech signal Si from the time-to-amplitude speech waveform F1 into the time-to-frequency sound spectrum F2, and recognize the word sound signal of each word Wo according to the energy distribution state. Sp1~Sp7. In Fig. 3, for the voice waveform F1, the horizontal axis is time (seconds), and the vertical axis is amplitude (decibel); for the sound spectrum F2, the horizontal axis is time (seconds); the vertical axis is frequency (hertz (Hz)). Then, the analysis unit 120 identifies the voiced sound signal segment Z2 and the unvoiced sound signal segment Z1 according to the energy distribution state in the word sound signal Sp of each word Wo.

在步驟S23的一實施例中，調整單元130是以固定間隔移除一個原始音框的方式從濁音訊號區段Z2移除音框調整量的原始音框，以形成相對於原始語音訊號Si語速變快的調整後語音訊號So。其中，調整單元130是對其音框總量大於音框調整量的濁音訊號區段Z2進行音框刪除的聲音訊號處理。In an embodiment of step S23, the adjustment unit 130 removes an original sound frame at fixed intervals from the voiced signal section Z2 by the original sound frame adjustment amount, so as to form a speech frame relative to the original speech signal Si The adjusted voice signal So with a faster speed. Wherein, the adjusting unit 130 is to perform sound signal processing of deleting sound frames for the voiced sound signal segment Z2 whose total sound frame amount is greater than the sound frame adjustment amount.

舉例來說，假設音框調整量為每個字刪除20個原始音框。當一個字Wo的字音訊號Sp1的濁音訊號區段Z2有100個原始音框時，調整單元130即對此濁音訊號區段Z2進行每間隔5個原始音框刪除1個原始音框的聲音訊號處理。For example, assume that the amount of frame adjustment is to delete 20 original frames for each character. When the voiced sound signal section Z2 of the character sound signal Sp1 of a word Wo has 100 original sound frames, the adjustment unit 130 will delete the sound signal of one original sound frame every five original sound frames for the voiced sound signal section Z2 deal with.

在一些實施例中，調整單元130可依據總調整時長、單位音框時長以及處理數量（即字音訊號Sp1~Sp7中具有濁音訊號區段Z2的字音訊號的數量）來計算出每個字Wo待移除的原始音框之音框調整量，然後再從每個具有濁音訊號區段Z2的字音訊號中刪除音框調整量的原始音框。在一些實施例中，在音框調整量後，調整單元130會先確認具有濁音訊號區段Z2的字音訊號其濁音訊號區段Z2的音框數量是否均大於當前的音框調整量。於均大於時，調整單元130才進行音框刪除。反之，調整單元130排除小於的字音訊號以獲得新的處理數量並重新計算音框調整量。In some embodiments, the adjustment unit 130 can calculate each word according to the total adjustment time, the unit sound frame time, and the number of processing (that is, the number of word sounds with voiced sound signal segment Z2 in the word sound signals Sp1-Sp7). Wo is to remove the original frame adjustment of the original frame, and then delete the original frame of the frame adjustment from each word sound signal having the voiced sound signal section Z2. In some embodiments, after the frame adjustment amount, the adjusting unit 130 first confirms whether the number of voice frames of the voiced signal segment Z2 of the word sound signal having the voiced signal segment Z2 is greater than the current frame adjustment amount. When both are greater than, the adjusting unit 130 deletes the sound frame. On the contrary, the adjustment unit 130 excludes the phonetic signals smaller than that to obtain a new processing amount and recalculates the frame adjustment amount.

在步驟S23的另一實施例中，調整單元130是以固定間隔插入一音框（以下稱補充音框）的方式插入音框調整量的補充音框至濁音訊號區段Z2，以形成相對於原始語音訊號Si語速變慢的調整後語音訊號So。In another embodiment of step S23, the adjustment unit 130 inserts a sound frame (hereinafter referred to as a supplementary sound frame) at fixed intervals into the voiced signal segment Z2 by inserting a sound frame with a sound frame adjustment amount, so as to form a sound frame relative to The original speech signal Si is the adjusted speech signal So with the speech speed slowed down.

舉例來說，假設音框調整量為每個字增加20個補充音框。當一個字Wo的字音訊號Sp1的濁音訊號區段Z2有100個原始音框時，調整單元130即對此濁音訊號區段Z2進行每5個原始音框插入1個補充音框的聲音訊號處理。For example, assume that the frame adjustment amount is to add 20 supplementary frames to each character. When there are 100 original sound frames in the voiced sound signal section Z2 of the character sound signal Sp1 of a word Wo, the adjustment unit 130 performs sound signal processing of inserting one supplementary sound frame into every five original sound frames for the voiced sound signal section Z2 .

在一些實施例中，插入的補充音框相關於此濁音訊號區段Z2中與插入位置相鄰的至少一原始音框。在一實施例中，插入的補充音框可為此濁音訊號區段Z2中與插入位置相鄰的原始音框的平均。例如，承前例，調整單元130在濁音訊號區段Z2的第5個原始音框與第6個原始音框之間插入由第5個原始音框與第6個原始音框平均所獲的補充音框。在另一實施例中，插入的補充音框可為此濁音訊號區段Z2中插入位置的前一原始音框。例如，承前例，調整單元130在濁音訊號區段Z2的第5個原始音框與第6個原始音框之間插入透過複製第5個原始音框所得的補充音框。在又一實施例中，插入的補充音框可為此濁音訊號區段Z2中插入位置的下一原始音框。例如，承前例，調整單元130在濁音訊號區段Z2的第5個原始音框與第6個原始音框之間插入透過複製第6個原始音框所得的補充音框。換言之，調整單元130可以視實際情況調整每間隔多少音框移除或插入一至多個音框的方式，本發明並非為限制。In some embodiments, the inserted supplementary frame is related to at least one original frame adjacent to the insertion position in the voiced signal segment Z2. In one embodiment, the supplementary sound frame to be inserted may be the average of the original sound frames adjacent to the insertion position in the voiced signal section Z2. For example, following the previous example, the adjustment unit 130 inserts a supplementary value obtained by averaging the fifth original sound frame and the sixth original sound frame between the fifth original sound frame and the sixth original sound frame of the voiced sound signal section Z2. sound frame. In another embodiment, the inserted supplementary sound frame may be the previous original sound frame at the inserted position in the voiced signal segment Z2. For example, following the previous example, the adjustment unit 130 inserts a supplementary sound frame obtained by copying the fifth original sound frame between the fifth original sound frame and the sixth original sound frame of the voiced signal section Z2. In yet another embodiment, the inserted supplementary frame may be the next original frame at the inserted position in the voiced signal segment Z2. For example, following the previous example, the adjustment unit 130 inserts a supplementary sound frame obtained by duplicating the sixth original sound frame between the fifth original sound frame and the sixth original sound frame of the voiced signal section Z2. In other words, the adjustment unit 130 can adjust the manner of removing or inserting one or more sound frames per interval according to the actual situation, and the present invention is not limited thereto.

在一些實施例中，語速調整系統10可進一步根據調整後語音訊號So與一動態影像生成一口述影像。其中，調整後語音訊號對應於動態影像中的無聲內容。在一些實施例中，動態影像可以是無聲連續圖片（如，GIF等）或是原始影音視頻（如，電影或動畫等），而產生的口述影像則對應為有聲連續圖片（如，有聲GIF等）或是口述影音視頻（如，口述電影或口述動畫等）。In some embodiments, the speech rate adjustment system 10 can further generate a voice-description image according to the adjusted speech signal So and a dynamic image. Wherein, the adjusted voice signal corresponds to the silent content in the dynamic image. In some embodiments, the dynamic image can be continuous pictures without sound (such as GIF, etc.) or original audio-visual video (such as movies or animations, etc.), and the generated audio description corresponds to continuous pictures with sound (such as GIF with sound, etc. ) or dictated video (e.g., dictated movies or dictated animations, etc.).

在一些實施例中，原始語音訊號Si與調整後語音訊號So個別可用以提供原始影音視頻中的一段無聲影像的影像內容（即，無聲內容）的事件描述。於此，無聲影像是指影像內容中沒有人在講話也沒有具有劇情上意義的音效聲（如，開門聲、或車輛靠近聲等）。換言之，原始語音訊號Si與調整後語音訊號So是以不同語速提供對此段無聲影像中角色動作及/或場景變化等事件的變化進行語音描述。In some embodiments, the original audio signal Si and the adjusted audio signal So are individually used to provide an event description of an image content (ie, silent content) of a silent image in the original audiovisual video. Here, a silent image means that there are no people speaking and no dramatic sound effects (such as the sound of doors opening, or the sound of vehicles approaching, etc.) in the image content. In other words, the original voice signal Si and the adjusted voice signal So provide voice descriptions of changes in events such as character movements and/or scene changes in the silent video at different speech rates.

在一些實施例中，請參照圖4，語速調整系統10可更包括：影音處理單元180。影音處理單元180耦接儲存單元110與調整單元130。於此，參照圖4及圖5，影音處理單元180能根據調整後語音訊號So與原始影音視頻Vi生成口述影音視頻Vo（步驟S26）。In some embodiments, please refer to FIG. 4 , the speech rate adjustment system 10 may further include: an audio and video processing unit 180 . The audio-video processing unit 180 is coupled to the storage unit 110 and the adjustment unit 130 . Here, referring to FIG. 4 and FIG. 5 , the audio-visual processing unit 180 can generate a spoken audio-visual video Vo according to the adjusted speech signal So and the original audio-visual video Vi (step S26 ).

在步驟S21的一些實施例中，原始語音訊號Vi與總調整時長N1可由耦接語速調整系統10的外部裝置提供，及/或使用者經由使用者介面輸入至語速調整系統10。In some embodiments of step S21 , the original voice signal Vi and the total adjustment duration N1 may be provided by an external device coupled to the speech rate adjustment system 10 , and/or input to the speech rate adjustment system 10 by the user through a user interface.

在一些實施例中，請參照圖4，語速調整系統10可更包括：轉換單元140以及判斷單元150。轉換單元140耦接儲存單元110，且判斷單元150耦接轉換單元140、分析單元120與調整單元130。In some embodiments, please refer to FIG. 4 , the speech rate adjustment system 10 may further include: a converting unit 140 and a judging unit 150 . The conversion unit 140 is coupled to the storage unit 110 , and the determination unit 150 is coupled to the conversion unit 140 , the analysis unit 120 and the adjustment unit 130 .

在步驟S21的一些實施例中，參照圖4及圖5，轉換單元140接收對應無聲內容的描述文本XL（步驟S11），並且將描述文本XL轉換為原始語音訊號Si（步驟S12）。其中，描述文本XL內記錄有以此些字Wo所構成的事件描述，並且此事件描述是敘述原始影音視頻Vi中的無聲內容。In some embodiments of step S21 , referring to FIG. 4 and FIG. 5 , the conversion unit 140 receives the descriptive text XL corresponding to the silent content (step S11 ), and converts the descriptive text XL into the original speech signal Si (step S12 ). Wherein, the event description composed of these words Wo is recorded in the description text XL, and the event description is to describe the silent content in the original audio-visual video Vi.

於轉換後，轉換單元140會將生成的原始語音訊號Si暫存於儲存單元110，並且判斷單元150會比較原始語音訊號Si的總時長與無聲內容的總時長Tt（步驟S13）。After conversion, the converting unit 140 temporarily stores the generated original audio signal Si in the storage unit 110, and the judging unit 150 compares the total duration of the original audio signal Si with the total duration Tt of the silent content (step S13).

在一些實施例中，判斷單元150透過比較步驟（步驟S13）確認原始語音訊號Si的總時長是否大於無聲內容的總時長Tt（步驟S14）。In some embodiments, the judging unit 150 determines whether the total duration of the original audio signal Si is greater than the total duration Tt of the silent content through the comparison step (step S13 ) (step S14 ).

於原始語音訊號Si的總時長大於無聲內容的總時長Tt時，判斷單元150會計算原始語音訊號Si的總時長與無聲內容的總時長Tt之間的時間差以得到總調整時長N1（步驟S15），並提供給調整單元130。並且，判斷單元150還會致能分析單元120開始對生成的原始語音訊號Si進行分析，讓調整單元130根據總調整時長N1及分析結果生成相對於原始語音訊號Si語速較快的調整後語音訊號So（即，接續執行步驟S22~S24）。When the total duration of the original audio signal Si is greater than the total duration Tt of the silent content, the judging unit 150 calculates the time difference between the total duration of the original audio signal Si and the total duration Tt of the silent content to obtain the total adjusted duration N1 (step S15), and provide it to the adjustment unit 130. Moreover, the judging unit 150 will also enable the analysis unit 120 to start analyzing the generated original speech signal Si, so that the adjustment unit 130 can generate an adjusted speech rate faster than the original speech signal Si according to the total adjustment duration N1 and the analysis result. The voice signal So (that is, continue to execute steps S22-S24).

於原始語音訊號Si的總時長不大於無聲內容的總時長Tt時，判斷單元150則不會致能分析單元120（即不接續執行步驟S22~S24）。此時，若語速調整系統10具有影音處理單元180，判斷單元150則會致使影音處理單元180根據原始語音訊號Si與原始影音視頻Vi生成口述影音視頻Vo（步驟S26）。When the total duration of the original audio signal Si is not greater than the total duration Tt of the silent content, the judging unit 150 will not enable the analysis unit 120 (that is, the execution of steps S22 - S24 will not proceed). At this time, if the speech rate adjustment system 10 has an audio-visual processing unit 180 , the judging unit 150 will cause the audio-visual processing unit 180 to generate a dictated audio-visual video Vo according to the original speech signal Si and the original audio-visual video Vi (step S26 ).

在一些實施例中，參照圖4及圖6，判斷單元150透過比較步驟（步驟S13）確認原始語音訊號Si的總時長是否等於無聲內容的總時長Tt（步驟S14’）。In some embodiments, referring to FIG. 4 and FIG. 6 , the judging unit 150 confirms whether the total duration of the original audio signal Si is equal to the total duration Tt of the silent content through the comparison step (step S13) (step S14').

於原始語音訊號Si的總時長不等於無聲內容的總時長Tt時，判斷單元150會計算原始語音訊號Si的總時長與無聲內容的總時長Tt之間的時間差以得到總調整時長N1（步驟S15），並提供給調整單元130。並且，判斷單元150還會致能分析單元120開始對生成的原始語音訊號Si進行分析，以致於調整單元130根據總調整時長N1及分析結果生成調整後語音訊號So（即，接續執行步驟S22~S24）。其中，於原始語音訊號Si的總時長大於無聲內容的總時長Tt時，調整單元130會生成相對於原始語音訊號Si語速較快的調整後語音訊號So（步驟S24）。於原始語音訊號Si的總時長小於無聲內容的總時長Tt時，調整單元130會生成相對於原始語音訊號Si語速較慢的調整後語音訊號So（步驟S24）。When the total duration of the original audio signal Si is not equal to the total duration Tt of the silent content, the judging unit 150 calculates the time difference between the total duration of the original audio signal Si and the total duration Tt of the silent content to obtain the total adjustment time. length N1 (step S15 ), and provide it to the adjustment unit 130 . Moreover, the judgment unit 150 will also enable the analysis unit 120 to start analyzing the generated original speech signal Si, so that the adjustment unit 130 generates the adjusted speech signal So according to the total adjustment duration N1 and the analysis result (that is, continue to execute step S22 ~S24). Wherein, when the total duration of the original audio signal Si is greater than the total duration Tt of the silent content, the adjusting unit 130 generates an adjusted audio signal So whose speech rate is faster than that of the original audio signal Si (step S24 ). When the total duration of the original audio signal Si is less than the total duration Tt of the silent content, the adjustment unit 130 generates an adjusted audio signal So whose speech rate is slower than that of the original audio signal Si (step S24 ).

於原始語音訊號Si的總時長等於無聲內容的總時長Tt時，判斷單元150則不會致能分析單元120（即不接續執行步驟S22~S24）。此時，若語速調整系統10具有影音處理單元180，判斷單元150則會致使影音處理單元180根據原始語音訊號Si與原始影音視頻Vi生成口述影音視頻Vo（步驟S26）。When the total duration of the original audio signal Si is equal to the total duration Tt of the silent content, the judging unit 150 will not enable the analysis unit 120 (that is, the execution of steps S22 - S24 will not be continued). At this time, if the speech rate adjustment system 10 has an audio-visual processing unit 180 , the judging unit 150 will cause the audio-visual processing unit 180 to generate a dictated audio-visual video Vo according to the original speech signal Si and the original audio-visual video Vi (step S26 ).

在一些實施例中，影音處理單元180能以混音、取代、或關聯等方式將語音訊號（即原始語音訊號Si或調整後語音訊號So）與原始影音視頻Vi結合而形成以語音訊號作為無聲內容的音頻之口述影音視頻Vo。In some embodiments, the audio-visual processing unit 180 can combine the audio signal (that is, the original audio signal Si or the adjusted audio signal So) with the original audio-visual video Vi by mixing, replacing, or correlating to form the audio signal as a silent Audio, audio and video content Vo.

在一實施例中，影音處理單元180接收原始影音視頻Vi並將原始影音視頻Vi分離為原始音軌與無聲影像視頻。接著，影音處理單元180將原始音軌與語音訊號（即原始語音訊號Si或調整後語音訊號So）混音以形成調整後音軌，然後透過同步調整後音軌與無聲影像視頻來將調整後音軌與無聲影像視頻結合成口述影音視頻Vo。In one embodiment, the video processing unit 180 receives the original video Vi and separates the original video Vi into an original audio track and a silent video. Next, the audio-visual processing unit 180 mixes the original audio track and the voice signal (i.e., the original voice signal Si or the adjusted voice signal So) to form an adjusted audio track, and then adjusts the adjusted audio track and the silent image video synchronously. The audio track and the silent image video are combined to form a spoken audio-visual video Vo.

在另一實施例中，影音處理單元180接收原始影音視頻Vi並將原始影音視頻Vi分離為原始音軌與無聲影像視頻。接著，影音處理單元180以語音訊號取代原始音軌中對應無聲內容的播放時間的音軌區段以形成調整後音軌，然後透過同步調整後音軌與無聲影像視頻來將調整後音軌與無聲影像視頻結合成口述影音視頻Vo。In another embodiment, the audiovisual processing unit 180 receives the original audiovisual video Vi and separates the original audiovisual video Vi into an original audio track and a silent image video. Then, the audio-visual processing unit 180 replaces the audio track segment corresponding to the playback time of the silent content in the original audio track with the voice signal to form an adjusted audio track, and then synchronizes the adjusted audio track with the silent image and video. Silent video and video are combined into audio and video audio and video Vo.

在又一實施例中，影音處理單元180接收原始影音視頻Vi並找出原始影音視頻Vi中無聲內容對應的音軌區段。接著，影音處理單元180建立語音訊號對此音軌區段的替代訊號，並產生含有原始影音視頻Vi、語音訊號以及替代訊號的口述影音視頻Vo。假設無聲內容對應的音軌區段在原始音軌中的第一播放時間到第二播放時間之間。此時，在口述影音視頻Vo的播放過程中，在第一播放時間時會觸發替代訊號而由執行原始音軌改為執行語音訊號，直到第二播放時間再切回從原始音軌中第二播放時間的位置接續執行原始音軌。In yet another embodiment, the audiovisual processing unit 180 receives the original audiovisual video Vi and finds out the audio track segment corresponding to the silent content in the original audiovisual video Vi. Next, the audio-visual processing unit 180 creates a substitute signal for the audio-track segment of the voice signal, and generates a dictation video Vo containing the original video Vi, the voice signal, and the substitute signal. It is assumed that the audio track segment corresponding to the silent content is between the first playing time and the second playing time in the original audio track. At this time, during the playback of the audio-visual video Vo, the substitute signal will be triggered at the first playback time to change from the execution of the original audio track to the execution of the voice signal, and then switch back to the second audio signal from the original audio track at the second playback time. The playback time position is continued from the original track.

在一些實施例中，於調整後語音訊號So生成後可先進行語義識別，並且於調整後語音訊號So的語義可識別時才輸出調整後語音訊號So。In some embodiments, semantic recognition may be performed first after the adjusted speech signal So is generated, and the adjusted speech signal So is output only when the semantics of the adjusted speech signal So can be recognized.

在一些實施例中，請參照圖4，語速調整系統10可更包括：識別單元160與更新單元170。識別單元160耦接調整單元130與更新單元170。In some embodiments, please refer to FIG. 4 , the speech rate adjustment system 10 may further include: an identification unit 160 and an update unit 170 . The identifying unit 160 is coupled to the adjusting unit 130 and the updating unit 170 .

參照圖4與圖5或圖4與圖6，於調整單元130生成調整後語音訊號So後，識別單元160會先偵測調整後語音訊號So的語義（步驟S25），以確認語義是否可識別（即確認調整後語音訊號So後所播放出的語音的內容是否能識別）。於此，語義偵測技術為本領域之技術人員所熟知，故於此不再贅述。Referring to FIG. 4 and FIG. 5 or FIG. 4 and FIG. 6, after the adjustment unit 130 generates the adjusted speech signal So, the recognition unit 160 will first detect the semantics of the adjusted speech signal So (step S25) to confirm whether the semantics can be recognized (That is, to confirm whether the content of the voice played after the adjusted voice signal So is recognizable). Here, the semantic detection technology is well known to those skilled in the art, so it will not be repeated here.

於語義不為可識別時，識別單元160不輸出調整後語音訊號So，並且致使更新單元170更新描述文本以減少構成事件描述的字數（步驟S27）。然後，更新單元170將更新後的描述文本提供給轉換單元140進行轉換（即接續執行步驟S12）。此時，若語速調整系統10具有影音處理單元180，識別單元160則不會將生成的調整後語音訊號So輸出給影音處理單元180。When the semantics is not recognizable, the recognition unit 160 does not output the adjusted speech signal So, and causes the update unit 170 to update the description text to reduce the number of words constituting the event description (step S27 ). Then, the update unit 170 provides the updated descriptive text to the conversion unit 140 for conversion (that is, step S12 is continued). At this time, if the speech rate adjustment system 10 has an audio-visual processing unit 180 , the recognition unit 160 will not output the generated adjusted speech signal So to the audio-visual processing unit 180 .

於語義可識別時，識別單元160才輸出調整後語音訊號So。此時，若語速調整系統10具有影音處理單元180，識別單元160會將調整後語音訊號So輸出給影音處理單元180以進行影音處理（即執行步驟S26）。When the semantics are recognizable, the recognition unit 160 outputs the adjusted speech signal So. At this time, if the speech rate adjustment system 10 has an audio-visual processing unit 180 , the recognition unit 160 will output the adjusted speech signal So to the audio-visual processing unit 180 for audio-visual processing (that is, execute step S26 ).

在一些實施例中，儲存單元110、分析單元120、調整單元130、轉換單元140、判斷單元150、識別單元160、更新單元170以及影音處理單元180能以單個或多個處理組件實現。In some embodiments, the storage unit 110 , the analysis unit 120 , the adjustment unit 130 , the conversion unit 140 , the judgment unit 150 , the identification unit 160 , the update unit 170 and the audio-visual processing unit 180 can be realized by a single or multiple processing components.

在一些實施例中，原始語音訊號Si與調整後語音訊號So可提供無聲連續圖片中的無聲內容的對應語音。換言之，原始語音訊號Si與調整後語音訊號So是提供相同內容但不同語速的語音。舉例來說，無聲連續圖片可呈現單字或句子的發音的口型變化，而原始語音訊號Si與調整後語音訊號So則提供此單字或句子的發音。In some embodiments, the original speech signal Si and the adjusted speech signal So can provide the corresponding speech of the silent content in the silent continuous picture. In other words, the original speech signal Si and the adjusted speech signal So provide speech with the same content but different speech rates. For example, the continuous pictures without sound can present the lip changes of the pronunciation of a word or sentence, and the original speech signal Si and the adjusted speech signal So provide the pronunciation of the word or sentence.

在一些實施例中，請參照圖7，語速調整系統10可更包括：合併單元190。合併單元190耦接調整單元130。參照圖7及圖8，合併單元190接收一無聲連續圖片Mi，並且透過同步調整後語音訊號So與無聲連續圖片Mi來將調整後語音訊號So與無聲連續圖片Mi結合成一有聲連續圖片Mo（步驟S26’）。In some embodiments, please refer to FIG. 7 , the speech rate adjusting system 10 may further include: a merging unit 190 . The merging unit 190 is coupled to the adjustment unit 130 . Referring to FIG. 7 and FIG. 8, the merging unit 190 receives a silent continuous picture Mi, and combines the adjusted speech signal So and the silent continuous picture Mi into a sound continuous picture Mo by synchronizing the adjusted speech signal So and the silent continuous picture Mi (step S26').

在一些實施例中，儲存單元110、分析單元120、調整單元130以及合併單元190能以單個或多個處理組件實現。In some embodiments, the storage unit 110 , the analyzing unit 120 , the adjusting unit 130 and the merging unit 190 can be realized by a single or multiple processing components.

在一些實施例中，儲存單元110能由單個或多個記憶體實現。前述之處理組件可為微處理器、微控制器、中央處理器、可編程邏輯控制器、邏輯電路、類比電路、數位電路或任何基於操作指令操作信號的類比和/或數位裝置。In some embodiments, the storage unit 110 can be realized by a single or multiple memories. The aforementioned processing components can be microprocessors, microcontrollers, central processing units, programmable logic controllers, logic circuits, analog circuits, digital circuits, or any analog and/or digital devices that operate signals based on operation instructions.

在一些實施例中，任一實施例之語速調整方法可由一電腦程式產品實現，以致於當電腦載入程式並執行後可完成任一實施例之語速調整方法。在一些實施例中，電腦程式產品可為非暫態記錄媒體，而上述程式則儲存在非暫態記錄媒體中供電腦載入。在一些實施例中，上述程式本身即可為電腦程式產品，並且經由有線或無線的方式傳輸至電腦中。In some embodiments, the speech rate adjusting method of any embodiment can be realized by a computer program product, so that the speech rate adjusting method of any embodiment can be completed after the computer loads the program and executes it. In some embodiments, the computer program product may be a non-transitory recording medium, and the above-mentioned programs are stored in the non-transitory recording medium for the computer to load. In some embodiments, the above program itself can be a computer program product, and can be transmitted to the computer via wired or wireless means.

10:語速調整系統 110:儲存單元 120:分析單元 130:調整單元 140:轉換單元 150:判斷單元 160:識別單元 170:更新單元 180:影音處理單元 190:合併單元 Si:原始語音訊號 N1:總調整時長 So:調整後語音訊號 Wo:字 Sp1~Sp7:字音訊號 F1:語音波形 F2:聲音頻譜 Z1:清音訊號區段 Z2:濁音訊號區段 Vi:原始影音視頻 Vo:口述影音視頻 Mi:無聲連續圖片 Mo:有聲連續圖片 XL:描述文本 Tt:無聲內容的總時長 S21~S27:步驟 S11~S15:步驟 S14’:步驟 S26’:步驟10:Speech adjustment system 110: storage unit 120: Analysis unit 130: Adjustment unit 140: conversion unit 150: judgment unit 160: Identification unit 170: update unit 180: AV processing unit 190: merge unit Si: original voice signal N1: total adjustment time So: Adjusted voice signal Wo: word Sp1~Sp7: word tone signal F1: Speech Waveform F2: Sound Spectrum Z1: Unvoiced signal section Z2: voiced signal section Vi: original audio and video Vo: dictation audio-visual video Mi: Continuous picture without sound Mo: continuous picture with sound XL: Descriptive text Tt: Total duration of silent content S21~S27: Steps S11~S15: Steps S14': step S26': step

圖1為一些實施例的語速調整方法的流程圖。圖2為一些實施例的語速調整系統的功能方塊圖。圖3為一實施例的原始語音訊號的示意圖。圖4為一些實施例的語速調整系統的功能方塊圖。圖5為一些實施例的語速調整方法的流程圖。圖6為一些實施例的語速調整方法的流程圖。圖7為一些實施例的語速調整系統的功能方塊圖。圖8為一些實施例的語速調整方法的流程圖。 Fig. 1 is a flowchart of a speech rate adjustment method in some embodiments. FIG. 2 is a functional block diagram of a speech rate adjustment system in some embodiments. FIG. 3 is a schematic diagram of an original speech signal according to an embodiment. FIG. 4 is a functional block diagram of a speech rate adjustment system in some embodiments. Fig. 5 is a flowchart of a speech rate adjustment method in some embodiments. Fig. 6 is a flowchart of a speech rate adjustment method in some embodiments. FIG. 7 is a functional block diagram of a speech rate adjustment system in some embodiments. Fig. 8 is a flowchart of a speech rate adjustment method in some embodiments.

S21~S24:步驟 S21~S24: Steps

Claims

A method for adjusting speech rate, comprising: Obtain an original voice signal and a total adjustment duration of complex numbers; analyzing the original speech signal to obtain a voiced signal segment and an unvoiced signal segment corresponding to each of the characters; calculating a sound frame adjustment amount according to the total adjustment duration and a unit sound frame duration; and A frame quantity of at least one of the voiced signal segments is adjusted according to the frame adjustment amount to form an adjusted speech signal.

The speech rate adjustment method as described in claim 1, wherein the step of adjusting a frame number of at least one of the voiced signal segments according to the frame adjustment amount to form an adjusted speech signal includes: Remove at least one original sound frame corresponding to the sound frame adjustment amount from the at least one voiced signal segment according to a fixed interval, or insert at least one supplementary sound frame corresponding to the sound frame adjustment amount, so as to form a voice different from the original sound The adjusted speech signal of the speech rate of the signal, wherein the at least one supplementary frame is related to the at least one original frame adjacent to the insertion position.

The speech rate adjustment method as described in claim item 1, further comprising: A voice-description image is generated according to the adjusted voice signal and a dynamic image, wherein the adjusted voice signal corresponds to a silent content in the dynamic image.

The speech rate adjustment method as described in claim 3, wherein the steps of obtaining the original speech signal of the complex number and the total adjustment duration include: receiving a description text, wherein an event description of the silent content composed of the plural numbers is recorded in the description text; and Converting the description text into the original audio signal, wherein the total adjusted duration is a time difference between the total duration of the original audio signal and the total duration of the silent content.

The speech rate adjustment method as described in claim item 4, further comprising: judging whether a semantic meaning of the adjusted speech signal is recognizable; When the semantics of the adjusted voice signal is not recognizable, reduce the number of words constituting the plural number of the event description in the description text, convert the updated description text into the original voice signal, and continue to analyze the original voice signal to obtain the step of the voiced signal segment and the unvoiced signal segment corresponding to each of the characters; and When the semantic meaning is recognizable, the adjusted voice signal is output.

A speech rate adjustment system, comprising: a storage unit temporarily storing an original voice signal of complex numbers; An analysis unit, coupled to the storage unit, analyzes the original speech signal to obtain a voiced signal segment and an unvoiced signal segment corresponding to each character; and An adjustment unit, coupled to the storage unit and the analysis unit, calculates a frame adjustment amount according to a total adjustment duration and a unit frame duration, and adjusts the voiced sound signals of the characters according to the frame adjustment amount A frame number of at least one of the segments to form an adjusted speech signal.

The speech rate adjustment system as described in claim 6, further comprising: A merging unit, coupled to the adjustment unit, receives a continuous picture without sound, and combines the adjusted voice signal with the continuous picture without sound to form a continuous picture with sound.

The speech rate adjustment system as described in claim 6, further comprising: An audio-visual processing unit, coupled to the adjustment unit, generates a dictated audio-visual video according to the adjusted audio signal and an original audio-visual video, wherein the adjusted audio-visual signal is used to provide an event description of a silent content in the original audio-visual video.

The speech rate adjustment system as described in Claim 8, further comprising: A converting unit, coupled to the storage unit, receives a descriptive text, and converts the descriptive text into the original voice signal, wherein the descriptive text records the event description composed of the plural numbers and the total adjustment time is the time difference between the total duration of the original voice signal and the total duration of the silent content.

The speech rate adjustment system as described in Claim 9, further comprising: a recognition unit, coupled to the adjustment unit, to determine whether a semantic meaning of the adjusted speech signal is recognizable; and An update unit, coupled to the recognition unit and the conversion unit, updates the description text to reduce the number of words constituting the complex number of the event description when the semantics is not recognizable, and provides the updated description text to the The conversion unit performs the conversion.