TW201415884A

TW201415884A - System for adjusting display time of expressions based on analyzing voice signal and method thereof

Info

Publication number: TW201415884A
Application number: TW101136461A
Authority: TW
Inventors: Ke Ding
Original assignee: Inventec Corp
Priority date: 2012-10-03
Filing date: 2012-10-03
Publication date: 2014-04-16

Abstract

A system for adjusting display time of expressions based on analyzing a voice signal and a method thereof are provided. By obtaining start time of speech sections in the voice signal after analyzes the voice signal, corresponding each expression of an exposition corresponding with the voice signal to each of the speech sections in sequence, and adjusting display time of each of the expressions to the start time of the speech section corresponding the expression, the system and the method can improve efficiency of adjusting display time of expressions, and can achieve the effect of adjusting display time of each expression to appearance time of voice corresponding the expression.

Description

System and method for synchronizing vocal signals with textual description materials

一種詞句顯示時間同步系統及其方法，特別係指一種人聲訊號與其文字說明資料的同步之系統及其方法。 A phrase display time synchronization system and method thereof, in particular, a system and method for synchronizing a human voice signal with a text description material thereof.

動態歌詞的技術方案可以讓歌曲檔被播放時，讀取歌曲檔所對應的歌詞，並同步顯示被播放之人聲所表示的歌詞。事實上，動態歌詞並不只限於使用在歌曲檔，凡是所包含之人聲有代表意義的多媒體檔案都可以使用動態歌詞同步顯示人聲所表示的詞句。 The technical solution of dynamic lyrics can read the lyrics corresponding to the song files when the song files are played, and synchronously display the lyrics represented by the played vocals. In fact, dynamic lyrics are not limited to use in song files. Any multimedia file that is representative of the vocal that is included can use dynamic lyrics to simultaneously display the words represented by the vocals.

隨著越來越多的多媒體播放軟體或多媒體播放器支援動態歌詞，多媒體檔案的擁有者逐漸希望多媒體播放軟體或多媒體播放器在播放多媒體檔案中之音頻訊號時，可以同步顯示被播放之人聲所表示的詞句。 As more and more multimedia player software or multimedia players support dynamic lyrics, the owner of the multimedia file gradually hopes that the multimedia player or multimedia player can simultaneously display the vocal being played when playing the audio signal in the multimedia file. The words expressed.

在提供執行動態歌詞的文字說明資料中，必須包含與文字說明資料中所記錄之各詞句對應的顯示時間，如此，多媒體播放軟體或多媒體播放器才能在音頻訊號被播放的時間與各詞句對應的顯示時間相同時，同步播放各詞句。 In the text description data for providing the dynamic lyrics, the display time corresponding to each word recorded in the text description data must be included, so that the multimedia playing software or the multimedia player can correspond to each word at the time when the audio signal is played. When the display time is the same, each word is played synchronously.

雖然目前大部分的音頻訊號都有對應的文字說明資料以使用動態歌詞的技術方案，且目前在文字說明資料中之各詞句對應的顯示時間與音頻訊號中各人聲出現的時間有落差時，也有修改文字說明資料所記錄之顯示時間的工具軟體。但是，這些工具軟體僅提供使用者手動逐一修改各詞句的顯示時間，或是對所有詞句的顯示時間進行整體性的增加或減少。 Although most of the current audio signals have corresponding text description materials to use the dynamic lyrics technical solution, and currently there is a difference between the display time of each word in the text description data and the time when each voice in the audio signal appears, there are also The tool software that modifies the display time recorded in the text description data. However, these tools only provide users with the ability to manually modify the display time of each sentence, or to increase or decrease the display time of all words.

由於目前仍有部分的音頻訊號所對應的文字說明資料僅記錄音頻訊號的詞句，而沒有記錄顯示時間，另外，部分的音頻訊號則可能有多種的版本，每一種版本之出現人聲的時間可能有一點差異，例如某些人聲可能較詞句被顯示的時間早/晚被播放，若要手動自行新增或修改各詞句的顯示時間，這將花費大量的時間，非常沒有效率。 Since there are still some audio signals corresponding to the audio signal, only the words of the audio signal are recorded, and the display time is not recorded. In addition, some audio signals may have multiple versions, and each version may have a vocal time. A little difference, for example, some vocals may be played earlier/late than the time when the words are displayed. If you want to manually add or modify the display time of each phrase, it will take a lot of time and is very inefficient.

綜上所述，可知先前技術中長期以來一直存在無法有效率的調整與音頻訊號對應之各詞句顯示時間的問題，因此有必要提出改進的技術手段，來解決此一問題。 In summary, it can be seen that there has been a problem in the prior art that the display time of each word corresponding to the audio signal cannot be efficiently adjusted. Therefore, it is necessary to propose an improved technical means to solve this problem.

有鑒於先前技術存在調整詞句之顯示時間過於沒有效率的問題，本發明遂揭露一種人聲訊號與其文字說明資料的同步系統及其方法，其中：本發明所揭露之人聲訊號與其文字說明資料的同步系統，至少包含：載入模組，用以載入音頻訊號，音頻訊號對應文字說明資料；人聲抽取模組，用以由音頻訊號中抽取出人聲訊號，人聲訊號包含語音段落；段落分析模組，用以分析人聲訊號以取得語音段落之起始時間；詞句對應模組，用以對應語音段落與文字說明資料中之各詞句；時間調整模組，用以將詞句之顯示時間調整為相對應之語音段落之起始時間。 In view of the prior art, there is a problem that the display time of the adjustment words is too inefficient, and the present invention discloses a synchronization system and a method for the vocal signals and the text description materials thereof, wherein: the synchronization system of the vocal signals and the text description materials disclosed by the present invention The method includes at least: a loading module for loading an audio signal, an audio signal corresponding to the text description data, and a vocal extraction module for extracting a vocal signal from the audio signal, the vocal signal includes a voice passage, and the paragraph analysis module, It is used to analyze the vocal signal to obtain the start time of the speech passage; the word corresponding module is used to correspond to the words in the speech paragraph and the text description data; the time adjustment module is used to adjust the display time of the words to the corresponding The start time of the voice paragraph.

本發明所揭露之人聲訊號與其文字說明資料的同步方法，其步驟至少包括：載入音頻訊號，音頻訊號對應文字說明資料；由音頻訊號中抽取出人聲訊號，人聲訊號包含語音段落；分析人聲訊號以取得語音段落之起始時間；將文字說明資料中之各詞句依序對應至語音段落；調整詞句之顯示時間為相對應之語音段落之起始時間。 The method for synchronizing the human voice signal and the text description data disclosed by the present invention comprises the steps of: loading an audio signal, the audio signal corresponding to the text description data; extracting the human voice signal from the audio signal, the human voice signal including the voice passage; analyzing the voice signal To obtain the start time of the phonetic paragraph; The sequence corresponds to the speech passage; the display time of the adjustment phrase is the start time of the corresponding speech segment.

本發明所揭露之系統與方法如上，與先前技術之間的差異在於本發明透過分析音頻訊號中的人聲訊號後取得人聲訊號中之語音段落的起始時間，並在將與音頻訊號對應之文字說明資料中的各詞句依序對應至各語音段落後，調整各詞句的顯示時間為對應之語音段落的起始時間，藉以解決先前技術所存在的問題，並可以達成自動將詞句之顯示時間調整為出現對應人聲之時間的技術功效。 The system and method disclosed in the present invention are different from the prior art in that the present invention obtains the start time of the voice segment in the voice signal by analyzing the voice signal in the audio signal, and displays the text corresponding to the audio signal. After the words in the data are sequentially mapped to the respective speech segments, the display time of each sentence is adjusted to the start time of the corresponding speech segment, so as to solve the problems existing in the prior art, and the time adjustment of the display of the words can be automatically achieved. The technical effect of the time corresponding to the vocal.

以下將配合圖式及實施例來詳細說明本發明之特徵與實施方式，內容足以使任何熟習相關技藝者能夠輕易地充分理解本發明解決技術問題所應用的技術手段並據以實施，藉此實現本發明可達成的功效。 The features and embodiments of the present invention will be described in detail below with reference to the drawings and embodiments, which are sufficient to enable those skilled in the art to fully understand the technical means to which the present invention solves the technical problems, and The achievable effects of the present invention.

本發明可以偵測音頻訊號中的人聲訊號，並依據人聲訊號中的各個語音段落的起始時間調整與各個語音段落對應之詞句的顯示時間，使得所有詞句都能夠在相對應之語音段落被播放時被顯示，而沒有提前或延後顯示的現象。 The invention can detect the human voice signal in the audio signal, and adjust the display time of the words corresponding to each voice segment according to the start time of each voice segment in the voice signal, so that all the words can be played in the corresponding voice segment. The time is displayed without the phenomenon of displaying in advance or delay.

本發明所提之音頻訊號，至少包含人聲訊號，在被播放後會產生人聲，例如，歌曲檔、相聲錄音檔等，但本發明並不以此為限，甚至，音頻訊號也可以包含在多媒體影音檔案中。其中，人聲訊號可能包含一個或多個語音段落，每一個語音段落是以是否有人聲分隔，也就是說，語音段落與語音段落之間沒有人聲。 The audio signal of the present invention includes at least a human voice signal, and a human voice, such as a song file, a crosstalk recording file, etc., is generated after being played, but the present invention is not limited thereto, and even the audio signal may be included in the multimedia. In the audio and video files. Among them, the vocal signal may contain one or more voice passages, and each voice passage is separated by a human voice, that is, there is no vocal between the voice passage and the voice passage.

以下先以「第1圖」本發明所提之人聲訊號與其文字說明資料的同步系統架構圖來說明本發明的系統運作。如「第1圖」所示，本發明之系統含有載入模組110、人聲抽取模組120、段落分析模組130、詞句對應模組150、以及時間調整模組160。 The following is the vocal signal and text description of the invention mentioned in the first figure. The synchronization system architecture diagram of the material illustrates the operation of the system of the present invention. As shown in FIG. 1, the system of the present invention includes a loading module 110, a vocal extraction module 120, a paragraph analysis module 130, a vocabulary correspondence module 150, and a time adjustment module 160.

載入模組110負責載入音頻訊號，一般而言，載入模組110是將音頻訊號載入執行本發明之裝置的記憶體(圖中未示)中，但本發明並不以此為限。載入模組110可以由執行本發明之裝置的儲存媒體(圖中未示)載入音頻訊號，也可以由執行本發明之裝置外部的裝置載入音頻訊號，本發明沒有特別的限制 The loading module 110 is responsible for loading the audio signal. Generally, the loading module 110 loads the audio signal into a memory (not shown) of the device for performing the present invention, but the present invention does not limit. The loading module 110 may load an audio signal from a storage medium (not shown) for executing the apparatus of the present invention, or may load an audio signal by a device external to the apparatus for performing the present invention, and the present invention is not particularly limited.

載入模組110所載入的音頻訊號具有對應的文字說明資料，例如，當音頻訊號為歌曲檔時，文字說明資料為對應的歌詞檔，當音頻訊號為相聲錄音檔時，文字說明資料為對應的字幕檔，當音頻訊號包含在多媒體影音檔案中時，文字說明資料為包含該音頻訊號之多媒體影音檔案的字幕檔等，但本發明之文字說明資料並不以上述為限。 The audio signal loaded by the loading module 110 has corresponding text description data. For example, when the audio signal is a song file, the text description data is a corresponding lyric file. When the audio signal is a comic recording file, the text description data is For the corresponding subtitle file, when the audio signal is included in the multimedia video file, the text description data is a subtitle file of the multimedia audio and video file containing the audio signal, but the text description of the present invention is not limited to the above.

與音頻訊號對應的文字說明資料可能被儲存在本發明之裝置的儲存媒體中，也可能被儲存在執行本發明之裝置外部的裝置中，本發明沒有特別的限制。值得一提的是，音頻訊號與對應的文字說明資料並不一定會儲存在相同的裝置中。 The caption data corresponding to the audio signal may be stored in the storage medium of the apparatus of the present invention, or may be stored in a device external to the apparatus embodying the present invention, and the present invention is not particularly limited. It is worth mentioning that the audio signal and the corresponding text description data are not necessarily stored in the same device.

一般而言，載入模組110在載入音頻訊號時，也會將相對應的文字說明資料載入執行本發明之裝置的記憶體中，但本發明並不以此為限。 In general, when loading the audio signal, the loading module 110 also loads the corresponding text description data into the memory of the apparatus for performing the present invention, but the invention is not limited thereto.

人聲抽取模組120負責由載入模組110所載入的音頻訊號中抽取人聲訊號。其中，人聲抽取模組120所抽取的人聲訊號被播放的時間會與載入模組110所載入的音頻訊號被播放的時間相同。 The vocal extraction module 120 is responsible for extracting the vocal signals from the audio signals loaded by the loading module 110. The vocal signal extracted by the vocal module 120 is played for the same time as the audio signal loaded by the loading module 110 is played.

人聲抽取模組120可以衰減音頻訊號中的特定頻率，例如，衰減音頻訊號中範圍在300Hz至3000Hz之外的頻率，如此，非人聲的頻率將被衰減，也就是說，音頻訊號在經過衰減後，留下的部份主要為人聲訊號。人聲抽取模組120也可以將音頻訊號的左聲道反向後，疊加到右聲道，並將音頻訊號的右聲道反向後，疊加到左聲道，如此，可以得到消除人聲的中介訊號，接著在將消除人聲的中介訊號反向後與原始的音頻訊號疊加，或反向原始的音頻訊號後與中介訊號疊加，疊加後產生的訊號即為人聲訊號。但本發明之人聲抽取模組120由音頻訊號中抽取人聲訊號之方式並不以上述為限。 The vocal extraction module 120 can attenuate a specific frequency in the audio signal, for example, attenuating frequencies in the audio signal ranging from 300 Hz to 3000 Hz, so that the non-human frequency will be attenuated, that is, after the audio signal is attenuated The remaining part is mainly the vocal signal. The vocal extraction module 120 can also reverse the left channel of the audio signal, superimpose it on the right channel, and reverse the right channel of the audio signal, and superimpose it on the left channel. Thus, the intermediate signal for eliminating the vocal can be obtained. Then, after the mediation signal of the vocal elimination is reversed, the original audio signal is superimposed, or the original audio signal is reversed and superimposed with the mediation signal, and the signal generated by the superposition is a vocal signal. However, the manner in which the human voice extraction module 120 of the present invention extracts the voice signal from the audio signal is not limited to the above.

段落分析模組130負責分析人聲抽取模組120所抽取出之人聲訊號，藉以取得人聲訊號中之各語音段落的起始時間。一般而言，段落分析模組130會偵測人聲訊號中的人聲，並在偵測人聲時判斷該人聲未持續(中斷或停止)的時間點，而後繼續偵測新的人聲，當段落分析模組130偵測到該人聲未持續時，會將人聲中斷或停止前所偵測到的持續人聲視為一個語音段落，而後再偵測到新的人聲時，將所偵測到之新人聲視為一個新的語音段落，其中，段落分析模組130偵測到該持續人聲開始的時間點即為本發明中所提之起始時間。但本發明之段落分析模組130取得人聲訊號中之各語音段落的方式並不以上述為限。 The paragraph analysis module 130 is configured to analyze the vocal signals extracted by the vocal extraction module 120 to obtain the start time of each voice segment in the vocal signal. In general, the paragraph analysis module 130 detects the vocal sound in the vocal signal and determines the time point at which the vocal is not sustained (interrupted or stopped) when detecting the vocal sound, and then continues to detect the new vocal, when the paragraph analysis mode When the group 130 detects that the vocal is not sustained, it will treat the continuous vocal sound detected before the vocal interruption or stop as a voice passage, and then detect the new vocal voice, and then detect the new vocal For a new voice passage, the time point at which the paragraph analysis module 130 detects the start of the continuous voice is the start time mentioned in the present invention. However, the manner in which the paragraph analysis module 130 of the present invention obtains each voice segment in the voice signal is not limited to the above.

詞句對應模組150負責建立與音頻訊號對應之文字說明資料中的各詞句與段落分析模組130所取得的各語音段落的對應關係，也就是將文字說明資料中的各詞句對應至各語音段落。一般而言，詞句對應模組150會依據文字說明資料中各詞句的排列順序以及段落分析模組130取得各語音段落的先後順序，依序將各詞句對應至各語音段落，但本發明之詞句對應模組150對應詞句與語音段落之方式並不以上述為限。 The vocabulary correspondence module 150 is responsible for establishing a correspondence between each vocabulary in the transcript corresponding to the audio signal and each speech segment obtained by the paragraph analysis module 130, that is, corresponding to each vocal passage in the transcript data. . In general, the phrase correspondence module 150 will arrange the words in the data according to the text description. The sequence and paragraph analysis module 130 obtains the sequence of each voice segment, and sequentially assigns each word segment to each voice segment. However, the manner in which the phrase correspondence module 150 of the present invention corresponds to the word segment and the voice segment is not limited to the above.

當文字說明資料中除了包含各詞句之外，還包含與各詞句對應的顯示時間時，例如，文字說明資料為動態歌詞(LRC)檔時，詞句對應模組150也可以先依據與各詞句對應的顯示時間重新排列各詞句，而後再依據重新排列後之各詞句的排列順序，將各詞句依序對應至各語音段落。其中，值得特別一提的是，文字說明資料中的部分詞句可能會與兩個或兩個以上的顯示時間相對應，詞句對應模組150會將對應兩個或兩個以上之顯示時間的詞句分割為與相對應之顯示時間之數量相同的多個詞句，且各個詞句分別對應不同的顯示時間，而後，再依據各個詞句對應的顯示時間排列各個詞句。 When the caption data includes a display time corresponding to each vocabulary, for example, when the caption data is a dynamic lyrics (LRC) file, the vocabulary correspondence module 150 may firstly correspond to each vocabulary. The display time rearranges the words, and then according to the order of the rearranged words, the words are sequentially assigned to the respective speech segments. Among them, it is worth mentioning that some words in the caption data may correspond to two or more display times, and the phrase corresponding module 150 will correspond to two or more words showing the time. The plurality of words are divided into the same number of words as the corresponding display time, and each word corresponds to a different display time, and then each word is arranged according to the display time corresponding to each word.

時間調整模組160負責依據詞句對應模組150所建立之詞句與語音段落的對應關係，以及段落分析模組130所記錄之各語音段落的起始時間，將文字說明資料中之各詞句的顯示時間調整為相對應之語音段落的起始時間。其中，時間調整模組160可以調整被載入模組110載入記憶體中之文字說明資料所記錄的顯示時間，也可以在文字說明資料儲存於執行本發明之裝置的儲存媒體中時，直接調整儲存於儲存媒體中之文字說明資料所記錄的顯示時間。 The time adjustment module 160 is responsible for displaying the correspondence between the words and the speech passages established by the morphological correspondence module 150 and the start time of each speech segment recorded by the paragraph analysis module 130, and displaying the words in the text description data. The time is adjusted to the start time of the corresponding speech segment. The time adjustment module 160 can adjust the display time recorded by the text description data loaded into the memory by the loading module 110, or can be directly used when the text description data is stored in the storage medium of the device for executing the present invention. Adjust the display time recorded in the text description data stored in the storage medium.

另外，本發明還可以包含可附加的播放模組190，播放模組190負責同步顯示顯示時間與音頻訊號中之語音段落被播放的時間相同的詞句，也就是在音頻訊號被播放之時間與時間調整模組 160調整後之某一詞句的顯示時間相同時，顯示調整後之顯示時間與音頻訊號被播放之時間相同的詞句。在部分的實施例中，播放模組190還可以播放載入模組110所載入的音頻訊號。 In addition, the present invention may further include an additional play module 190, and the play module 190 is responsible for synchronously displaying the words with the same time as the voice segment in the audio signal being played, that is, the time and time when the audio signal is played. Adjustment module When the display time of a certain sentence after the adjustment is the same, the words whose adjusted display time is the same as the time when the audio signal is played are displayed. In some embodiments, the playing module 190 can also play the audio signal loaded by the loading module 110.

接著以第一實施例來解說本發明的運作系統與方法，並請參照「第2圖」本發明所提之人聲訊號與其文字說明資料的同步方法流程圖。在本實施例中，假設音頻訊號為歌曲檔，對應之文字說明資料為歌詞檔。 Next, the operation system and method of the present invention will be explained with reference to the first embodiment, and please refer to the flowchart of the method for synchronizing the human voice signal and the text description data of the present invention. In this embodiment, it is assumed that the audio signal is a song file, and the corresponding text description data is a lyric file.

當使用者下載歌曲檔(音頻訊號)至電腦的硬碟後，使用電腦播放播放所下載的歌曲檔時，若歌曲檔的版本不同，則顯示歌詞(詞句)的時間較歌曲檔中每一句歌詞(語音段落)出現的時間可能略有不同，也就是歌詞檔(文字說明資料)中所記錄之歌詞的顯示時間與歌曲檔中之歌詞的起始時間不同，造成顯示歌詞(詞句)的時間較歌曲檔中之歌詞出現的時間稍早或稍晚，如此，使用者可以使用本發明調整歌詞檔中所記錄之各歌詞的顯示時間。 When the user downloads the song file (audio signal) to the hard disk of the computer, when using the computer to play the downloaded song file, if the version of the song file is different, the time of displaying the lyrics (sentences) is more than the lyrics in the song file. The time of occurrence of the (speech paragraph) may be slightly different, that is, the display time of the lyrics recorded in the lyric file (text description material) is different from the start time of the lyrics in the song file, resulting in a time for displaying the lyrics (phrases). The lyrics in the song file appear earlier or later, so that the user can use the present invention to adjust the display time of each lyric recorded in the lyrics file.

首先，載入模組110可以載入音頻訊號(步驟210)，在本實施例中，也就是將歌曲檔由電腦的硬碟中載入電腦的記憶體。同時，假設載入模組110也將與歌曲檔對應的歌詞檔載入電腦的記憶體中。 First, the loading module 110 can load an audio signal (step 210). In this embodiment, the song file is loaded into the computer's memory from the hard disk of the computer. At the same time, it is assumed that the loading module 110 also loads the lyric file corresponding to the song file into the memory of the computer.

在載入模組110載入音頻訊號(步驟210)後，人聲抽取模組120可以由載入模組110所載入的音頻訊號中抽取出人聲訊號(步驟220)，接著，段落分析模組130可以分析人聲抽取模組120所抽取出的人聲訊號，藉以取得人聲訊號中之各個語音段落的起始時間(步驟230)。在本實施例中，假設人聲訊號中包含12個語音段落。 After the loading module 110 loads the audio signal (step 210), the vocal extraction module 120 can extract the vocal signal from the audio signal loaded by the loading module 110 (step 220), and then the paragraph analysis module. The vocal signal extracted by the vocal sampling module 120 can be analyzed to obtain the start time of each voice segment in the vocal signal (step 230). In this embodiment, it is assumed that the voice signal contains 12 voices. paragraph.

在段落分析模組130取得人聲訊號中之各個語音段落的起始時間後，詞句對應模組150可以將文字說明資料中的各語句依序對應到人聲訊號的各語音段落(步驟250)。在本實施例中，由於歌詞檔(文字說明資料)中包含歌詞(詞句)的顯示時間，因此，詞句對應模組150會依據歌詞檔中所記錄之12句歌詞的顯示時間，重新排序各歌詞，並在排序歌詞後，將排序後的12句歌詞依序對應到人聲訊號中的12個語音段落，使每一句歌詞依序對應到不同的語音段落。 After the paragraph analysis module 130 obtains the start time of each voice segment in the voice signal, the word correspondence module 150 can sequentially associate each statement in the text description data to each voice segment of the voice signal (step 250). In this embodiment, since the lyrics file (character description material) includes the display time of the lyrics (phrases), the vocabulary correspondence module 150 reorders the lyrics according to the display time of the 12 lyrics recorded in the lyric file. After sorting the lyrics, the 12 lyrics after sorting are sequentially corresponding to the 12 speech passages in the vocal signal, so that each lyrics sequentially corresponds to different speech passages.

在本實施例中，若歌詞檔僅包含11句歌詞，但其中有一句歌詞對應兩個顯示時間，則詞句對應模組150會將對應兩個顯示時間的歌詞複製為相同的兩句歌詞，並兩個顯示時間分別對應到將複製後相同的兩句歌詞。如此，歌詞檔會變為12句歌詞，每一句歌詞對應不同的顯示時間。而後，詞句對應模組150便可以依據各歌詞所對應的顯示時間重新排序各歌詞，並在排序後將排序後的12句歌詞依序對應到人聲訊號中的12個語音段落。 In this embodiment, if the lyric file contains only 11 lyrics, but one of the lyrics corresponds to two display times, the vocabulary correspondence module 150 copies the lyrics corresponding to the two display times into the same two lyrics, and The two display times correspond to the same two lyrics that will be copied. In this way, the lyrics file will become 12 lyrics, and each lyrics corresponds to a different display time. Then, the vocabulary correspondence module 150 can reorder the lyrics according to the display time corresponding to each lyric, and sequentially sort the 12 lyrics after the sorting to the 12 speech segments in the vocal signal.

在詞句對應模組150將文字說明資料中的各語句依序對應到人聲訊號的各語音段落(步驟250)後，時間調整模組160可以依據段落分析模組130所取得之各語音段落的起始時間，將文字說明資料中之各語句的顯示時間調整為與各語句對應之語音段落的起始時間(步驟260)。在本實施例中，時間調整模組160會將使用者所使用之電腦的記憶體中記錄之各語句的顯示時間調整為與各語句對應之各語音段落的起始時間，也會將使用者所使用之電腦的儲存媒體所記錄之歌詞檔中各語句的顯示時間調整為與各語句對應之各語音段落的起始時間。如此，透過本發明，歌詞檔中之各語句的顯示時間會自動調整為正確的時間，使用者不需要自行調整。 After the word correspondence module 150 sequentially associates each sentence in the text description data to each voice segment of the voice signal (step 250), the time adjustment module 160 can start from each voice segment obtained by the paragraph analysis module 130. At the beginning time, the display time of each sentence in the caption material is adjusted to the start time of the speech segment corresponding to each sentence (step 260). In this embodiment, the time adjustment module 160 adjusts the display time of each sentence recorded in the memory of the computer used by the user to the start time of each voice segment corresponding to each sentence, and also displays the user. The display time of each sentence in the lyric file recorded by the storage medium of the computer used is adjusted to each language The start time of each speech segment corresponding to the sentence. Thus, with the present invention, the display time of each sentence in the lyric file is automatically adjusted to the correct time, and the user does not need to adjust it by himself.

接著再以第二實施例來解說本發明的運作系統與方法，請繼續參照「第2圖」之方法流程圖。在本實施例中，假設音頻訊號同樣為歌曲檔，文字說明資料同樣為歌詞檔。 Next, the operation system and method of the present invention will be explained in the second embodiment. Please continue to refer to the method flowchart of "Fig. 2". In this embodiment, it is assumed that the audio signal is also a song file, and the text description material is also a lyric file.

首先，載入模組110可以將歌曲檔(音頻訊號)由電腦的硬碟中載入電腦的記憶體中(步驟210)。在本實施例中，假設載入模組110也透過網路至歌詞伺服器下載被載入模組110載入之歌曲檔對應的歌詞檔，其中，載入模組110所下載的歌詞檔並沒有包含歌詞的顯示時間。 First, the load module 110 can load the song file (audio signal) from the hard disk of the computer into the memory of the computer (step 210). In this embodiment, it is assumed that the loading module 110 also downloads the lyric file corresponding to the song file loaded by the loading module 110 through the network to the lyric server, wherein the lyric file downloaded by the module 110 is loaded. There is no display time for the lyrics.

在載入模組110載入音頻訊號(步驟210)後，人聲抽取模組120可以由載入模組110所載入的音頻訊號中抽取出人聲訊號(步驟220)，段落分析模組130可以分析人聲抽取模組120所抽取出的人聲訊號，藉以取得人聲訊號中之各個語音段落的起始時間(步驟230)，詞句對應模組150可以將文字說明資料中的各語句依序對應到人聲訊號的各語音段落(步驟250)。 After the loading module 110 loads the audio signal (step 210), the vocal extraction module 120 can extract the vocal signal from the audio signal loaded by the loading module 110 (step 220), and the paragraph analysis module 130 can The vocal signals extracted by the vocal sampling module 120 are analyzed to obtain the start time of each voice segment in the vocal signal (step 230), and the vocabulary correspondence module 150 can sequentially correspond to the vocals in the text description data. Each speech segment of the signal (step 250).

在本實施例中，由於歌詞檔(文字說明資料)中沒有包含歌詞(詞句)的顯示時間，因此，詞句對應模組150會依據歌詞檔中所記錄之歌詞的先後順序，將歌詞依序對應到人聲訊號中的各個語音段落，使每一句歌詞依序對應到不同的語音段落。 In this embodiment, since the lyrics file (character description material) does not include the display time of the lyrics (phrases), the vocabulary correspondence module 150 sequentially matches the lyrics according to the order of the lyrics recorded in the lyrics file. Each speech segment in the vocal signal causes each lyric to sequentially correspond to a different speech segment.

在詞句對應模組150將文字說明資料中的各語句依序對應到人聲訊號的各語音段落(步驟250)後，時間調整模組160可以將文字說明資料中之各語句的顯示時間調整為與各語句對應之語音段落的起始時間(步驟260)。在本實施例中，由於載入模組110所再入之歌詞檔沒有包含顯示時間，因此，使用者所使用之電腦的記憶體中記錄之各歌詞有沒有對應的顯示時間，時間調整模組160會將各歌詞所對應之語音段落的起始時間加入使用者所使用之電腦的記憶體中，藉以將記憶體中所記錄之各語句的顯示時間由無調整為與各語句對應之各語音段落的起始時間。 After the word correspondence module 150 sequentially associates each sentence in the text description data to each voice segment of the voice signal (step 250), the time adjustment module 160 can adjust the display time of each statement in the text description data to Voice corresponding to each statement The start time of the paragraph (step 260). In this embodiment, since the lyric file re-entered by the loading module 110 does not include the display time, there is no corresponding display time for each lyric recorded in the memory of the computer used by the user, and the time adjustment module 160 adds the start time of the voice segment corresponding to each lyric to the memory of the computer used by the user, so that the display time of each statement recorded in the memory is adjusted from no speech corresponding to each sentence. The start time of the paragraph.

在上述兩實施例中，若包含播放模組190，則播放模組190可以播放歌曲檔(音頻訊號)(步驟280)，並在歌曲檔被播放時，判斷歌曲檔被播放的時間是否與記憶體中所記錄之各歌詞(詞句)的顯示時間(經過時間調整模組160調整後的顯示時間)相同，當歌曲檔被播放的時間與記憶體中所記錄之某一句歌詞的顯示時間相同時，播放模組190可以顯示相對應之顯示時間與歌曲檔被播放之時間相同的歌詞，藉以同步顯示對應之顯示時間與音頻訊號中之語音段落被播放的時間相同的詞句(步驟290)。如此，透過本發明，歌詞檔中之各語句的顯示時間會自動調整為正確的時間，使用者不需要自行調整。 In the above two embodiments, if the play module 190 is included, the play module 190 can play the song file (audio signal) (step 280), and when the song file is played, determine whether the time when the song file is played and whether the memory is played. The display time of each lyric (sentence) recorded in the body (the display time adjusted by the time adjustment module 160) is the same, when the time when the song file is played is the same as the display time of the lyrics recorded in the memory. The playing module 190 can display the lyrics corresponding to the same time as the time when the song file is played, so as to synchronously display the words with the same display time and the same time as the voice segment in the audio signal is played (step 290). Thus, with the present invention, the display time of each sentence in the lyric file is automatically adjusted to the correct time, and the user does not need to adjust it by himself.

綜上所述，可知本發明與先前技術之間的差異在於具有分析音頻訊號中的人聲訊號後取得人聲訊號中之語音段落的起始時間，並在將與音頻訊號對應之文字說明資料中的各詞句依序對應至各語音段落後，調整各詞句的顯示時間為對應之語音段落的起始時間之技術手段，藉由此一技術手段可以解決先前技術所存在調整詞句之顯示時間過於沒有效率的問題，進而達成自動將詞句之顯示時間調整為出現對應人聲之時間的技術功效。 In summary, it can be seen that the difference between the present invention and the prior art is that the start time of the voice segment in the voice signal is obtained after analyzing the voice signal in the audio signal, and is in the text description data corresponding to the audio signal. After the words correspond to the respective speech passages in sequence, the technical means for adjusting the display time of each phrase to the start time of the corresponding speech segment can be solved by using a technical means to solve the problem that the display time of the adjustment words in the prior art is too inefficient. The problem, in turn, achieves the technical effect of automatically adjusting the display time of the words to the time when the corresponding vocals appear.

再者，本發明之人聲訊號與其文字說明資料的同步方法，可實現於硬體、軟體或硬體與軟體之組合中，亦可在電腦系統中以集中方式實現或以不同元件散佈於若干互連之電腦系統的分散方式實現。 Furthermore, the method for synchronizing the human voice signal of the present invention with the text description data can be It can be implemented in hardware, software or a combination of hardware and software. It can also be implemented in a centralized manner in a computer system or in a distributed manner in which different components are interspersed among several interconnected computer systems.

雖然本發明所揭露之實施方式如上，惟所述之內容並非用以直接限定本發明之專利保護範圍。任何本發明所屬技術領域中具有通常知識者，在不脫離本發明所揭露之精神和範圍的前提下，對本發明之實施的形式上及細節上作些許之更動潤飾，均屬於本發明之專利保護範圍。本發明之專利保護範圍，仍須以所附之申請專利範圍所界定者為準。 While the embodiments of the present invention have been described above, the above description is not intended to limit the scope of the invention. Any modification of the form and details of the practice of the present invention, which is a matter of ordinary skill in the art to which the present invention pertains, is a patent protection of the present invention. range. The scope of the invention is to be determined by the scope of the appended claims.

110‧‧‧載入模組 110‧‧‧Loading module

120‧‧‧人聲抽取模組 120‧‧‧Voice extraction module

130‧‧‧段落分析模組 130‧‧‧Paragraph Analysis Module

150‧‧‧詞句對應模組 150‧‧‧word corresponding module

160‧‧‧時間調整模組 160‧‧‧Time adjustment module

190‧‧‧播放模組 190‧‧‧Playing module

步驟210‧‧‧載入音頻訊號，音頻訊號對應文字說明資料，文字說明資料包含多個詞句 Step 210‧‧‧ Load the audio signal, the audio signal corresponds to the text description, and the text description contains multiple words

步驟220‧‧‧由音頻訊號中抽取出人聲訊號，人聲訊號包含多個語音段落 Step 220‧‧‧ Extracting the vocal signal from the audio signal, the vocal signal contains multiple Voice passage

步驟230‧‧‧分析人聲訊號以取得語音段落之起始時間 Step 230‧‧‧Analyze the vocal signal to get the start time of the utterance

步驟250‧‧‧將文字說明資料中之各詞句依序對應至各語音段落 Step 250‧‧‧ Correspondence to the various passages in the textual description

步驟260‧‧‧調整語句之顯示時間為相對應之語音段落之起始時間 Step 260‧‧‧ The display time of the adjustment statement is the start time of the corresponding speech segment

步驟280‧‧‧播放音頻訊號 Step 280‧‧‧Play audio signal

步驟290‧‧‧同步顯示顯示時間與音頻訊號之語音段落被播放時間相同之詞句 Step 290‧‧‧ Synchronous display shows the same time as the voice passage of the audio signal is played

第1圖為本發明所提之人聲訊號與其文字說明資料的同步系統架構圖。 Figure 1 is a schematic diagram of the synchronization system architecture of the human voice signal and its text description data.

第2圖為本發明所提之人聲訊號與其文字說明資料的同步方法流程圖。 FIG. 2 is a flow chart of a method for synchronizing a human voice signal and a text description data thereof according to the present invention.

步驟220‧‧‧由音頻訊號中抽取出人聲訊號，人聲訊號包含多個語音段落 Step 220‧‧‧ Extracting the vocal signal from the audio signal, the vocal signal contains multiple voice passages

步驟280‧‧‧播放音頻訊號 Step 280‧‧‧Play audio signal

Claims

A method for synchronizing a vocal signal with a caption data, the method comprising at least the following steps: loading an audio signal, the audio signal corresponding to a caption data, the caption data comprising a plurality of words; and extracting a human voice from the audio signal a signal, the human voice signal includes a plurality of voice passages; analyzing the voice signal to obtain a start time of each voice segment; sequentially matching each word in the text description data to each of the voice passages; and adjusting each of the words The display time is the start time of the corresponding speech segment.

For example, in the method for synchronizing the vocal signal and the caption data described in claim 1, wherein the synchronizing method further comprises playing the step after adjusting the display time of each phrase as the starting time of the corresponding speech segment. The audio signal is displayed in synchronization with the steps of displaying the words with the same time as the audio segment of the audio signal being played.

For the method of synchronizing the vocal signal and the caption data as described in claim 1, wherein the step of extracting the vocal signal from the audio signal is to attenuate the specific frequency in the audio signal, or to superimpose the audio in reverse. After the left and right channels of the signal generate an intermediate signal, the audio signal and the intermediate signal are superimposed.

For example, the method for synchronizing the vocal signal and the text description material mentioned in the first item of the patent application, wherein the words in the text description data are sequentially corresponding to each The step of the voice paragraph further includes the step of associating each of the words into the respective voice segments according to the arrangement order after the words are arranged according to the display time corresponding to each of the words.

The method for synchronizing the vocal signal and the caption data according to item 4 of the patent application scope, wherein the synchronizing method further comprises dividing the corresponding display time according to the step of arranging the words according to the display time corresponding to each word segment. A word is a step that is the same as the number of display times and corresponds to a plurality of words of each display time.

A synchronization system of a vocal signal and a text description data, the synchronization system at least comprising: a loading module for loading an audio signal, the audio signal corresponding to a text description material, the text description material comprising a plurality of words; a human voice Extracting a module for extracting a human voice signal from the audio signal, the human voice signal comprising a plurality of voice passages; and a paragraph analysis module for analyzing the voice signal to obtain a start time of each voice segment; The corresponding module is configured to correspond to each of the voice passages and the words in the text description data; and a time adjustment module is configured to adjust the display time of each of the words to the start time of the corresponding voice segment.

The synchronization system of the vocal signal and the caption data described in claim 6 wherein the synchronization system further comprises a play module for playing the audio signal, and simultaneously displaying the display time and the audio signal. The speech passages are played with the same words at the same time.

The synchronization system of the vocal signal and the text description material described in claim 6 wherein the vocal singer module attenuates the specific frequency in the audio signal or inversely superimposes the left and right channels of the audio signal to generate an intermediary. After the signal, the audio signal and the intervening signal are superimposed in a reverse manner, so that the human voice signal is extracted from the audio signal.

For example, the synchronization system of the vocal signal and the text description material described in claim 6 of the patent scope, wherein the vocabulary corresponding module is further configured to arrange the words according to the display time corresponding to each vocabulary, and then the words are arranged according to the arrangement order. Corresponds to each of the speech passages.

For example, in the synchronization system of the vocal signal and the text description material described in claim 9, wherein the vocabulary corresponding module is further configured to divide the vocabulary corresponding to the plurality of display times to be the same as the display time and corresponding to each The plurality of words of the display time are arranged according to the display time corresponding to each of the words.