TWI584139B - Speech recognition system and its information processing method applied to non - real - time signal source - Google Patents

Speech recognition system and its information processing method applied to non - real - time signal source Download PDF

Info

Publication number
TWI584139B
TWI584139B TW105129201A TW105129201A TWI584139B TW I584139 B TWI584139 B TW I584139B TW 105129201 A TW105129201 A TW 105129201A TW 105129201 A TW105129201 A TW 105129201A TW I584139 B TWI584139 B TW I584139B
Authority
TW
Taiwan
Prior art keywords
information
audio
generate
audio feature
preset
Prior art date
Application number
TW105129201A
Other languages
Chinese (zh)
Other versions
TW201810081A (en
Inventor
Yu-Hong Chen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed filed Critical
Priority to TW105129201A priority Critical patent/TWI584139B/en
Application granted granted Critical
Publication of TWI584139B publication Critical patent/TWI584139B/en
Publication of TW201810081A publication Critical patent/TW201810081A/en

Links

Description

應用於非即時訊號源的聲紋辨識系統及其資訊處理方法Voiceprint recognition system applied to non-instant signal source and information processing method thereof

本發明係關於一種聲紋辨識系統,尤指一種應用於非即時訊號源的聲紋辨識系統及其資訊處理方法。The invention relates to a voiceprint recognition system, in particular to a voiceprint recognition system applied to a non-instant signal source and an information processing method thereof.

由於行動裝置的普及以及網路的快速發展,使得人們手上都有一支可以上網的行動裝置,以方便人們於通勤或休息時,能連線到網路觀看非即時訊號源,其中非即時訊號源係指儲存於各媒體影音網站(如youtube、土豆或優酷等)的音樂影片、戲劇、新聞或動畫等媒體訊號源,使得人們於通勤或休息的時候不會無聊,因此,許多業者都會將自己的媒體訊號源放到上述媒體影音網站上供人收看,以評估自己的媒體訊號源是否受到人們喜愛。Due to the popularity of mobile devices and the rapid development of the Internet, people have a mobile device that can access the Internet, so that people can connect to the Internet to watch non-instant signal sources during commuting or rest, among which non-instant signals Source refers to media sources such as music videos, dramas, news or animations stored on various media and audio websites (such as youtube, potato or Youku, etc.), so that people are not bored when commuting or taking a break, so many operators will The source of your own media signal is placed on the above-mentioned media and audio-visual website for people to watch to assess whether their media source is popular.

現有技術中有一種判斷方式,是透過媒體訊號源的點擊率來得知是否受到人們的喜愛,但是,有些時候人們僅是開啟媒體訊號源後就關閉,或者開啟媒體訊號源後發現不是自己喜愛的就關閉,再不然就是打開媒體訊號源後,卻因為突發狀況而並未收看,導致媒體訊號源的點擊率並不準確,如此一來,業者透過不準確的點擊率來評估媒體訊號源是否受歡迎,有可能造成錯誤的評估,因此,技術存在有待改善的空間。In the prior art, there is a way of judging whether the user's favorite is loved by the click rate of the media signal source. However, sometimes people only turn off the media signal source and then turn it off, or open the media signal source and find that it is not their favorite. It is closed, otherwise it is after opening the media signal source, but it has not been watched because of the unexpected situation, which causes the click rate of the media signal source to be inaccurate. As a result, the operator evaluates the media signal source through the inaccurate click rate. Popularity may lead to erroneous assessments, so there is room for improvement in technology.

有鑑於上述現有技術所存在的問題,本發明係提供一種應用於非即時訊號源的聲紋辨識系統及其資訊處理方法,使用者透過一行動裝置傳送觀看的非即時訊號源的一組當前音訊特徵資訊至一雲端伺服器進行比對,以產生一比對結果資訊,供業者參考使用者最常觀看的非即時訊號源,藉此達到提供更佳聲紋辨識的目的。In view of the above problems in the prior art, the present invention provides a voiceprint recognition system applied to a non-instant signal source and an information processing method thereof, wherein a user transmits a set of current audio of a non-instant signal source viewed through a mobile device. The feature information is compared to a cloud server to generate a comparison result information, and the supplier refers to the non-instant signal source most frequently viewed by the user, thereby achieving better voiceprint recognition.

為達成上述目的所採取的技術手段,係令前述應用於非即時訊號源的聲紋辨識系統,其包含: 一遠端的雲端伺服器,其預建有複數組預設音訊特徵資訊; 一行動裝置,其經由網路與該雲端伺服器連結,並且供安裝一應用程式; 其中,該行動裝置將一非即時訊號源進行處理以取得一當前音訊資訊,並且執行一音訊特徵演算程序對該當前音訊資訊進行處理以產生一組當前音訊特徵資訊,該行動裝置將該組當前音訊特徵資訊經由網路傳送至該雲端伺服器,並且與該等組預設音訊特徵資訊進行比對,以產生一組比對結果資訊供參考。The technical means for achieving the above purpose is to apply the aforementioned voiceprint recognition system to a non-instant signal source, comprising: a remote cloud server, which is pre-built with a complex array of preset audio feature information; The device is connected to the cloud server via a network, and is configured to install an application; wherein the mobile device processes a non-instant signal source to obtain a current audio message, and executes an audio feature calculation program for the current The audio information is processed to generate a set of current audio feature information, and the mobile device transmits the current audio feature information to the cloud server via the network, and compares with the set of preset audio feature information to generate a Group comparison results information for reference.

透過上述構造可知,使用者於該行動裝置上安裝該應用程式以觀看該非即時訊號源,並且使用者於觀看該非即時訊號源時,透過該應用程式對該非即時訊號源進行處理以取得該當前音訊資訊,並且執行該音訊特徵演算程序對該當前音訊資訊進行處理以產生該組當前音訊特徵資訊,並傳送至該雲端伺服器與該組預設音訊特徵資訊進行比對,以產生該組比對結果資訊,供業者參考使用者最常觀看的非即時訊號源,藉此達到提供更佳聲紋辨識的目的。According to the above configuration, the user installs the application on the mobile device to view the non-instant signal source, and the user processes the non-instant signal source through the application to obtain the current audio when viewing the non-instant source. Information, and executing the audio feature calculation program to process the current audio information to generate the current audio feature information, and transmitting the information to the cloud server for comparison with the set of preset audio feature information to generate the group comparison The result information, the supplier refers to the non-instant signal source that the user most often watches, thereby achieving the purpose of providing better voiceprint recognition.

為達成上述目的所採取的另一技術手段,係令前述應用於非即時訊號源的資訊處理方法,其包含一使用者端模式及一遠端模式,該方法包含以下步驟: 該使用者端模式: 取得一非即時訊號源的一當前音訊資訊; 執行一音訊特徵演算程序對該當前音訊資訊進行處理以產生一組當前音訊特徵資訊; 傳送該組當前音訊特徵資訊至一遠端,並與該遠端預設的音訊特徵進行比對; 該遠端模式: 取得複數非即時訊號源的一預設音訊資訊; 執行該音訊特徵演算程序對該等預設音訊資訊進行處理,以分別產生一組預設音訊特徵資訊; 接收該組當前音訊特徵資訊,並且與該等組預設音訊特徵進行比對,以產生一組比對結果資訊供參考。Another technical means for achieving the above object is to apply the foregoing information processing method to a non-instant signal source, comprising a user end mode and a remote mode, the method comprising the following steps: the user end mode Obtaining a current audio information of a non-instant signal source; performing an audio feature calculation program to process the current audio information to generate a set of current audio feature information; transmitting the current audio feature information of the group to a remote end, and The remote preset audio features are compared; the remote mode: obtaining a preset audio information of the plurality of non-instant signal sources; performing the audio feature calculation program to process the preset audio information to respectively generate a group Presetting audio feature information; receiving the current audio feature information of the group, and comparing with the group of preset audio features to generate a set of comparison result information for reference.

透過上述方法可知,藉由執行該使用者端模式,以傳送該組當前音訊特徵資訊至該遠端後,並且執行該遠端模式,以將接收到的當前音訊特徵資訊與該等組預設音訊特徵資訊進行比對,以產生該組比對結果資訊,供業者參考使用者最常觀看的非即時訊號源,藉此達到提供更佳聲紋辨識的目的。According to the foregoing method, after the user terminal mode is executed, the current audio feature information of the group is transmitted to the remote end, and the remote mode is executed to receive the current audio feature information and the group preset. The audio feature information is compared to generate the comparison result information, and the supplier refers to the non-instant signal source most frequently viewed by the user, thereby achieving better sound pattern recognition.

關於本發明應用於非即時訊號源的聲紋辨識系統之較佳實施例,請參閱圖1所示,其包括一行動裝置10及一遠端的雲端伺服器20,該行動裝置10透過網路與該雲端伺服器20連結;本實施例中,該行動裝置10可為一智慧型手機或一平板電腦;該雲端伺服器20可為一電腦。For a preferred embodiment of the voiceprint recognition system of the present invention applied to a non-instant signal source, please refer to FIG. 1 , which includes a mobile device 10 and a remote cloud server 20 . The mobile device 10 transmits through the network. The mobile device 10 is connected to the cloud server 20; in this embodiment, the mobile device 10 can be a smart phone or a tablet computer; the cloud server 20 can be a computer.

該行動裝置10具有一微處理器11、一第一通訊單元12、一觸控顯示器13及一第一記憶單元14,該微處理器11分別與該第一通訊單元12、該觸控顯示器13及該第一記憶單元14連接,該微處理器11用以執行業者所提供的一應用程式,該第一通訊單元12供該行動裝置10連結至網路上,該觸控顯示器13供使用者操作該應用程式,連結至網路觀看業者儲存於影音媒體網站上的一個以上非即時訊號源,該第一記憶單元14儲存有該應用程式的相關資訊、程序。The mobile device 10 has a microprocessor 11 , a first communication unit 12 , a touch display 13 and a first memory unit 14 . The microprocessor 11 and the first communication unit 12 and the touch display 13 respectively The first memory unit 14 is configured to execute an application provided by the operator. The first communication unit 12 is connected to the mobile device 10, and the touch display 13 is operated by a user. The application is connected to one or more non-instant signal sources stored on the audio and video media website by the network viewer. The first memory unit 14 stores related information and programs of the application.

使用者透過該行動裝置10觀看該非即時訊號源時,藉由該觸控顯示器13操作該應用程式,對使用者觀看的非即時訊號源依特定取樣頻率及音訊參數進行處理,以取得該非即時訊號源的一當前音訊資訊,該微處理器11執行一音訊特徵演算程序對該當前音訊資訊進行處理以產生一組當前音訊特徵資訊,該微處理器11將該組當前音訊特徵資訊經由該第一通訊單元12透過網路傳送至該雲端伺服器20;本實施例中,該當前音訊資訊可為該非即時訊號源的片段音訊或完整音訊。When the user views the non-instant signal source through the mobile device 10, the touch display 13 operates the application to process the non-instant signal source viewed by the user according to a specific sampling frequency and audio parameters to obtain the non-instant signal. a current audio information of the source, the microprocessor 11 performs an audio feature calculation program to process the current audio information to generate a set of current audio feature information, and the microprocessor 11 passes the current audio feature information through the first The communication unit 12 is transmitted to the cloud server 20 through the network. In this embodiment, the current audio information may be a piece of audio or complete audio of the non-instant source.

當該微處理器11執行該音訊特徵演算程序時,首先對該當前音訊資訊進行音訊降取樣處理,並且對應產生複數音框資訊,其中前後相鄰的音框資訊具有相重疊的資訊。When the microprocessor 11 executes the audio feature calculation program, the current audio information is first subjected to audio downsampling processing, and corresponding to the plurality of sound box information, wherein the adjacent sound box information has overlapping information.

將該等音框資訊進行處理,以各別產生一音訊頻率資訊;本實施例中,係透過一傅立葉轉換對該等音框資訊進行頻率轉換,以分別產生對應的音訊頻率資訊。The audio frame information is processed to generate an audio frequency information. In this embodiment, the audio frame information is frequency-converted by a Fourier transform to generate corresponding audio frequency information.

將該等音訊頻率資訊進行濾波處理,以各別產生對應的多個濾波值;本實施例中,係透過不同頻帶的一梅爾濾波器對該等音訊頻率資訊進行濾波處理,使各音訊頻率資訊對應不同頻帶的梅爾濾波器產生對應的濾波值。Filtering the audio frequency information to generate a plurality of corresponding filtered values. In this embodiment, the audio frequency information is filtered by a mel filter of different frequency bands to make each audio frequency The information corresponding to the different frequency bands of the Meyer filter produces corresponding filter values.

將各音訊頻率資訊的濾波值進行一矩陣化處理,使該等音訊頻率資訊的濾波值分別依序排列成相對應的一矩陣,使每個音訊頻率資訊皆對應有一矩陣。The filter values of the audio frequency information are subjected to a matrix processing, so that the filter values of the audio frequency information are sequentially arranged into a corresponding matrix, so that each audio frequency information corresponds to a matrix.

該微處理器11透過一遮罩對每個音訊頻率資訊的矩陣進行卷積運算,以分別產生對應的一卷積值,該微處理器11依序對每個卷積值進行二值化判斷,以將大於零的卷積值註記為”1”,小於等於零的卷積值註記為”0”,並且將註記後的卷積值依序排列以產生該組當前音訊特徵資訊。The microprocessor 11 convolves a matrix of each audio frequency information through a mask to respectively generate a corresponding convolution value, and the microprocessor 11 sequentially performs a binarization judgment on each convolution value. To note a convolution value greater than zero as "1", a convolution value less than or equal to zero is annotated as "0", and the decodated convolution values are sequentially arranged to produce the current set of audio feature information for the group.

該雲端伺服器20具有一處理器21、一第二通訊單元22、一第二記憶單元23,該處理器21分別與該第二通訊單元22及該第二記憶單元23連接,該第二通訊單元22供該雲端伺服器20連結至網路,並且與該行動裝置10的第一通訊單元12連結,該第二記憶單元23預建複數組預設音訊特徵資訊。The cloud server 20 has a processor 21, a second communication unit 22, and a second memory unit 23. The processor 21 is connected to the second communication unit 22 and the second memory unit 23, respectively. The unit 22 is connected to the network server 20 and connected to the first communication unit 12 of the mobile device 10. The second memory unit 23 pre-assembles the complex array of preset audio feature information.

該雲端伺服器20的第二記憶單元23預建該等組預設音訊特徵資訊,是預先根據業者所提供的非即時訊號源進行處理,以分別產生對應的預設音訊特徵資訊,並儲存於該第二記憶單元23內,本實施例中,業者所提供的非即時訊號源時皆對應有一組預設音訊特徵資訊。The second memory unit 23 of the cloud server 20 pre-establishes the group of preset audio feature information, which is processed in advance according to a non-instant signal source provided by the operator, to respectively generate corresponding preset audio feature information, and is stored in the corresponding preset audio feature information. In the second memory unit 23, in the embodiment, the non-instant signal source provided by the operator corresponds to a set of preset audio feature information.

該處理器21根據業者所提供的非即時訊號源進行特定音訊規格處理,以取得對應的一預設音訊資訊,該處理器21執行該音訊特徵演算程序對該等預設音訊資訊進行處理,以分別產生對應的預設音訊特徵資訊;本實施例中,該等預設音訊資訊係為對應的非即時訊號源的完整音訊;該雲端伺服器20的處理器21所執行的音訊特徵演算程序,與該行動裝置10的微處理器11所執行的音訊特徵演算程序相同。The processor 21 performs specific audio specification processing according to the non-instant signal source provided by the operator to obtain a corresponding preset audio information, and the processor 21 executes the audio feature calculation program to process the preset audio information. Corresponding preset audio feature information is generated respectively. In this embodiment, the preset audio information is a complete audio of the corresponding non-instant signal source; the audio feature calculation program executed by the processor 21 of the cloud server 20, The same is true for the audio feature calculation program executed by the microprocessor 11 of the mobile device 10.

該雲端伺服器20的處理器21執行該音訊特徵演算程序對該等預設音訊資訊分別進行音訊降取樣處理,並分別產生對應的複數音框資訊,其中前後相鄰的音框資訊具有相重疊的資訊,透過該傅立葉轉換對該等音框資訊進行頻率轉換,以各別產生對應的音訊頻率資訊,透過不同頻帶的梅爾濾波器對該等預設音訊資訊的音訊頻率資訊進行濾波處理,以分別產生對應的多個濾波值,將每個預設音訊資訊的濾波值進行一矩陣化處理,使該等預設音訊資訊的濾波值分別依序排列成相對應的矩陣,該處理器21透過該遮罩對每個預設音訊資訊的矩陣進行卷積運算,以分別產生對應的卷積值,該處理器21依序對每個預設音訊資訊的卷積值進行二值化判斷,以將大於零的卷積值註記為”1”,小於等於零的卷積值註記為”0”,並且將註記後的卷積值依序排列以分別產生該等組預設音訊特徵資訊。The processor 21 of the cloud server 20 executes the audio feature calculation program to respectively perform audio down sampling processing on the preset audio information, and respectively generate corresponding complex sound box information, wherein the adjacent sound box information overlaps The information is frequency-converted by the Fourier transform to generate corresponding audio frequency information, and the audio frequency information of the preset audio information is filtered by the Meyer filter of different frequency bands. And respectively generating a corresponding plurality of filter values, and performing a matrix processing on the filter values of the preset audio information, so that the filter values of the preset audio information are sequentially arranged into corresponding matrices, and the processor 21 The matrix of each preset audio information is convoluted by the mask to generate a corresponding convolution value, and the processor 21 sequentially performs a binarization judgment on the convolution value of each preset audio information. To note a convolution value greater than zero as "1", a convolution value less than or equal to zero is annotated as "0", and the decodated convolution values are sequentially arranged to generate such Group preset audio feature information.

該處理器21接收該行動裝置10傳送的當前音訊特徵資訊,並且將接收到的當前音訊特徵資訊與該等組預設音訊特徵資訊逐一比對,以產生複數錯誤率,其中錯誤率越低代表當前音訊特徵資訊與預設音訊特徵資訊越相近,錯誤率越高代表當前音訊特徵資訊與預設音訊特徵資訊越不同,該處理器21根據最低的錯誤率產生一組比對結果資訊,供業者參考使用者最常觀看的非即時訊號源。The processor 21 receives the current audio feature information transmitted by the mobile device 10, and compares the received current audio feature information with the set of preset audio feature information one by one to generate a complex error rate, wherein the lower the error rate is, the lower the error rate is. The closer the current audio feature information is to the preset audio feature information, the higher the error rate is, the more the current audio feature information is different from the preset audio feature information, and the processor 21 generates a set of comparison result information according to the lowest error rate, the supplier. Refer to the non-instant source of the most frequently viewed by the user.

本實施例中,該雲端伺服器20的第二通訊單元22經由網路將該組比對結果資訊傳送至該行動裝置10的第一通訊單元12,該行動裝置10的微處理器11將該組比對結果資訊傳送至該觸控顯示器13顯示,以供使用者參考觀看的非即時訊號源是否與該雲端伺服器20比對到的非即時訊號源相同,供使用者參考喜愛的非即時訊號源的點擊支持結果。In this embodiment, the second communication unit 22 of the cloud server 20 transmits the group comparison result information to the first communication unit 12 of the mobile device 10 via the network, and the microprocessor 11 of the mobile device 10 The comparison result information is transmitted to the touch display 13 for the user to refer to whether the non-instant signal source viewed by the user is the same as the non-instant signal source that is compared with the cloud server 20, for the user to refer to the favorite non-instant source. The click of the signal source supports the result.

根據上述內容,進一步歸納出一應用於非即時訊號源的資訊處理方法,其包含一使用者端模式及一遠端模式,並且由該行動裝置10執行該使用者端模式,該雲端伺服器20執行該遠端模式,其中該使用者端模式係包含以下步驟: 取得一非即時訊號源的一當前音訊資訊(S11);本實施例中,該行動裝置10係將該非即時訊號源依特定取樣頻率及音訊參數進行處理,以取得該當前音訊資訊,並且該當前音訊資訊可為該非即時訊號源的片段音訊或完整音訊; 透過一音訊特徵演算程序對該當前音訊資訊進行處理以產生一組當前音訊特徵資訊(S12); 傳送該組當前音訊特徵資訊至該雲端伺服器20(S13)。According to the above content, an information processing method applied to a non-instant signal source is further summarized, which includes a user end mode and a remote mode, and the mobile device 10 executes the user end mode, and the cloud server 20 Executing the remote mode, wherein the user mode includes the following steps: obtaining a current audio information of a non-instant signal source (S11); in this embodiment, the mobile device 10 is to perform the specific sampling by the non-instant signal source. The frequency and audio parameters are processed to obtain the current audio information, and the current audio information may be a segment audio or a complete audio of the non-instant signal source; the current audio information is processed by an audio feature calculation program to generate a set of current The audio feature information (S12); transmitting the current audio feature information of the group to the cloud server 20 (S13).

該遠端模式係包含以下步驟: 取得複數非即時訊號源的一預設音訊資訊(S21);本實施例中,該等非即時訊號源係為業者所預先提供,並且每一預設音訊資訊係為對應的非即時訊號源的完整音訊; 執行該音訊特徵演算程序對該等預設音訊資訊進行處理以分別產生一組預設音訊特徵資訊(S22); 接收該行動裝置10傳送的該組當前音訊特徵資訊,並且與該等組預設音訊特徵資訊進行比對,以產生一組比對結果資訊供參考(S23);本實施例中,該組比對結果資訊係為將接收到的當前音訊特徵資訊與該等組預設音訊特徵資訊逐一比對,以產生複數錯誤率,其中錯誤率越低代表當前音訊特徵資訊與預設音訊特徵資訊越相近,錯誤率越高代表當前音訊特徵資訊與預設音訊特徵資訊越不同,該雲端伺服器20根據最低的錯誤率產生該組比對結果資訊。The remote mode includes the following steps: obtaining a preset audio information of a plurality of non-instant signal sources (S21); in this embodiment, the non-instant signal sources are provided in advance by the operator, and each preset audio information is provided. Corresponding non-instant signal source complete audio; performing the audio feature calculation program to process the preset audio information to generate a set of preset audio feature information respectively (S22); receiving the group transmitted by the mobile device 10 The current audio feature information is compared with the group of preset audio feature information to generate a set of comparison result information for reference (S23); in this embodiment, the group comparison result information is received The current audio feature information is compared with the preset audio feature information to generate a complex error rate. The lower the error rate, the closer the current audio feature information is to the preset audio feature information, and the higher the error rate is, the current audio feature is represented. The more different the information is from the preset audio feature information, the cloud server 20 generates the set of comparison result information according to the lowest error rate.

本實施例中,該行動裝置10及該雲端伺服器20所執行的音訊特徵演算程序係包含以下步驟: 對音訊資訊進行處理並取得複數音框資訊(S31);本實施例中,係對音訊資訊進行降取樣處理以取得該等音框資訊,並且前後相鄰的音框資訊具有重疊的資訊;該行動裝置10執行該音訊特徵演算程序與該雲端伺服器20所執行的音訊特徵演算程序相同,惟該行動裝置10執行該音訊特徵演算程序係對該當前音訊資訊進行處理,該雲端伺服器20執行該音訊特徵演算程序係對該等預設音訊資訊進行處理,並且該等預設音訊資訊分別對應有多個音框資訊; 對每個音框資訊進行處理以各別產生對應的音訊頻率資訊(S32);本實施例中,係透過一傅立葉轉換對該等音框資訊進行頻率轉換; 對每個音訊頻率資訊進行濾波處理以各別產生對應的多個濾波值(S33);本實施例中,係透過不同頻帶的一梅爾濾波器對該等音訊頻率資訊進行濾波處理,使各音訊頻率資訊對應不同頻帶的梅爾濾波器產生對應的濾波值; 將各音訊頻率資訊的濾波值依序排列成對應的一矩陣(S34);本實施例中,係將各音訊頻率資訊的濾波值進行一矩陣化處理,使每個音訊頻率資訊皆對應有一矩陣; 透過一遮罩對各音訊頻率資訊的矩陣進行卷積運算,以分別產生一卷積值(S35); 判斷產生的卷積值是否大於零(S36); 若是,則將卷積值註記為”1”(S37);本實施例中,係將各個卷積值進行二值化判斷; 若否,則將卷積值註記為”0”(S38); 將所有註記後的卷積值依序排列以產生對應的音訊特徵資訊(S39);本實施例中,該行動裝置10產生對應的當前音訊特徵資訊,該雲端伺服器20產生對應的複數組預設音訊特徵資訊。In this embodiment, the audio feature calculation program executed by the mobile device 10 and the cloud server 20 includes the following steps: processing the audio information and obtaining the complex sound box information (S31); in this embodiment, the audio information is The information is subjected to downsampling processing to obtain the sound box information, and the adjacent sound box information has overlapping information; the mobile device 10 executes the audio feature calculation program and is the same as the audio feature calculation program executed by the cloud server 20. The mobile device 10 executes the audio feature calculation process to process the current audio information, and the cloud server 20 performs the audio feature calculation process to process the preset audio information, and the preset audio information is processed. Each of the sound box information is processed to generate corresponding audio frequency information (S32); in this embodiment, the frequency information is converted by using a Fourier transform; Filtering each audio frequency information to generate corresponding multiple filter values (S33); in this embodiment, the system transmits different The first Meyer filter is used to filter the audio frequency information, so that each of the audio frequency information corresponds to a different frequency band of the Meyer filter to generate a corresponding filter value; and the filtered values of the audio frequency information are sequentially arranged into corresponding ones. a matrix (S34); in this embodiment, the filtering values of the audio frequency information are matrixed, so that each audio frequency information corresponds to a matrix; and the matrix of each audio frequency information is volumed through a mask. a product operation to respectively generate a convolution value (S35); determine whether the generated convolution value is greater than zero (S36); if so, the convolution value is denoted as "1" (S37); in this embodiment, Each convolution value is binarized; if not, the convolution value is denoted as "0" (S38); all the convolved convolution values are sequentially arranged to generate corresponding audio feature information (S39); In an embodiment, the mobile device 10 generates corresponding current audio feature information, and the cloud server 20 generates a corresponding complex array of preset audio feature information.

透過上述可知,藉由該行動裝置10取得觀看的非即時訊號源的當前音訊資訊進行音訊特徵演算程序處理,以產生對應的當前音訊特徵資訊,並且傳送至該雲端伺服器20與預設音訊特徵資訊進行比對,以產生對應的比對結果資訊,供業者參考使用者最常觀看的非即時訊號源,或供使用者參考對喜愛的非即時訊號源的點擊支持結果,藉此達到提供更佳聲紋辨識的目的。As can be seen from the above, the current audio information of the non-instant signal source that is viewed by the mobile device 10 is processed by the audio feature calculation program to generate corresponding current audio feature information, and transmitted to the cloud server 20 and the preset audio feature. The information is compared to generate corresponding comparison result information, and the supplier refers to the non-instant source of the most frequently viewed by the user, or provides the user with reference to the click support result of the favorite non-instant signal source, thereby providing more information. The purpose of good sound pattern recognition.

10‧‧‧行動裝置10‧‧‧Mobile devices

11‧‧‧微處理器11‧‧‧Microprocessor

12‧‧‧第一通訊單元12‧‧‧First communication unit

13‧‧‧觸控顯示器13‧‧‧Touch display

14‧‧‧第一記憶單元14‧‧‧First memory unit

20‧‧‧雲端伺服器20‧‧‧Cloud Server

21‧‧‧處理器21‧‧‧ Processor

22‧‧‧第二通訊單元22‧‧‧Second communication unit

23‧‧‧第二記憶單元23‧‧‧Second memory unit

圖1 係本發明較佳實施例之系統架構方塊圖。 圖2 係本發明較佳實施例之使用者端模式的流程圖。 圖3係本發明較佳實施例之遠端模式的流程圖。 圖4 係本發明較佳實施例之音訊特徵演算程序的流程圖。1 is a block diagram of a system architecture in accordance with a preferred embodiment of the present invention. 2 is a flow chart of a user end mode in accordance with a preferred embodiment of the present invention. 3 is a flow chart of a remote mode in accordance with a preferred embodiment of the present invention. 4 is a flow chart of an audio feature calculation program in accordance with a preferred embodiment of the present invention.

10‧‧‧行動裝置 10‧‧‧Mobile devices

11‧‧‧微處理器 11‧‧‧Microprocessor

12‧‧‧第一通訊單元 12‧‧‧First communication unit

13‧‧‧觸控顯示器 13‧‧‧Touch display

14‧‧‧第一記憶單元 14‧‧‧First memory unit

20‧‧‧雲端伺服器 20‧‧‧Cloud Server

21‧‧‧處理器 21‧‧‧ Processor

22‧‧‧第二通訊單元 22‧‧‧Second communication unit

23‧‧‧第二記憶單元 23‧‧‧Second memory unit

Claims (7)

一種應用於非即時訊號源的聲紋辨識系統,其包含:一遠端的雲端伺服器,其預建有複數組預設音訊特徵資訊;一行動裝置,其經由網路與該雲端伺服器連結,並且供安裝一應用程式;其中,該行動裝置將一非即時訊號源進行處理以取得一當前音訊資訊,並且執行一音訊特徵演算程序對該當前音訊資訊進行處理以產生一組當前音訊特徵資訊,該行動裝置將該組當前音訊特徵資訊經由網路傳送至該雲端伺服器,並且與該等組預設音訊特徵資訊進行比對,以產生一組比對結果資訊供參考;其中,該音訊特徵演算程序係根據對應的音訊資訊進行處理以產生複數音框資訊,將每個音框資訊進行處理以各別產生對應的音訊頻率資訊,再將每個音訊頻率資訊進行濾波處理以各別產生對應的多個濾波值,將各音訊頻率資訊的濾波值分別依序排列成對應的一矩陣,並且透過一遮罩對各矩陣進行卷積運算,以分別產生一卷積值,並且判斷大於零的卷積值註記為1,小於等於零的卷積值註記為0,並且根據註記後的卷積值產生對應的音訊特徵資訊。 A voiceprint recognition system applied to a non-instant signal source, comprising: a remote cloud server pre-built with a complex array of preset audio feature information; and a mobile device connected to the cloud server via a network And installing an application; wherein the mobile device processes a non-instant signal source to obtain a current audio message, and performs an audio feature calculation program to process the current audio information to generate a current set of audio feature information. The mobile device transmits the current audio feature information to the cloud server via the network, and compares with the set of preset audio feature information to generate a set of comparison result information for reference; wherein the audio device The feature calculation program processes the corresponding audio information to generate complex sound box information, processes each sound box information to generate corresponding audio frequency information, and then filters each audio frequency information to generate each. Corresponding multiple filtering values, the filter values of each audio frequency information are sequentially arranged into a corresponding matrix, And convolving each matrix through a mask to respectively generate a convolution value, and determining that the convolution value greater than zero is denoted as 1, the convolution value less than or equal to zero is denoted as 0, and according to the convolution after the annotation The value produces corresponding audio feature information. 如請求項1所述之應用於非即時訊號源的聲紋辨識系統,其中該行動裝置具有一微處理器、一第一通訊單元、一觸控顯示器及一第一記憶單元,該微處理器分別與該第一通訊單元、該觸控顯示器及該第一記憶單元連接,該微處理器用以處理資訊,該第一通訊單元用以連結至網路,該觸控顯示器供操作該應用程式,該第一記憶單元儲存該音訊特徵演算程序。 The voiceprint recognition system of claim 1, wherein the mobile device has a microprocessor, a first communication unit, a touch display, and a first memory unit, the microprocessor The first communication unit is connected to the first communication unit, the touch display and the first memory unit, and the first communication unit is configured to connect to the network, and the touch display is configured to operate the application. The first memory unit stores the audio feature calculation program. 如請求項2所述之應用於非即時訊號源的聲紋辨識系統,其中該雲端伺服器具有一處理器、一第二通訊單元、一第二記憶單元,該處理器分別與該第二通訊單元及該第二記憶單元連接,該第二通訊單元用以連結至網路,並且與該行動裝置的第一通訊單元連結,該第二記憶單元儲存該等組預設音訊特徵資訊。 The voiceprint recognition system of claim 2, wherein the cloud server has a processor, a second communication unit, and a second memory unit, and the processor is respectively associated with the second communication The second communication unit is connected to the network and is connected to the first communication unit of the mobile device. The second memory unit stores the set of preset audio feature information. 如請求項3所述之應用於非即時訊號源的聲紋辨識系統,其中該雲端伺服器的處理器將多數非即時訊號源進行處理,以分別取得對應的一預設音訊資訊,該處理器執行該音訊特徵演算程序對該等預設音訊資訊進行處理以產生上述預設音訊特徵資訊。 The voiceprint recognition system of claim 3, wherein the processor of the cloud server processes a plurality of non-instant signal sources to obtain a corresponding preset audio information, the processor Performing the audio feature calculation program to process the preset audio information to generate the preset audio feature information. 如請求項4所述之應用於非即時訊號源的聲紋辨識系統,其中係透過一傅立葉轉換對每個音框進行頻率轉換,以產生上述音訊頻率資訊;並且透過不同頻帶的一梅爾濾波器對每個音訊頻率資訊進行濾波處理,以各別產生上述濾波值。 The voiceprint recognition system applied to the non-instant signal source according to claim 4, wherein each of the sound frames is frequency-converted by a Fourier transform to generate the audio frequency information; and a Meyer filter is transmitted through different frequency bands. The device filters each audio frequency information to generate the above-mentioned filtered values. 一種應用於非即時訊號源的資訊處理方法,由一行動裝置執行一使用者端模式,該使用者端模式包含以下步驟:取得一非即時訊號源的一當前音訊資訊;執行一音訊特徵演算程序對該當前音訊資訊進行處理以產生一組當前音訊特徵資訊;其中,該音訊特徵演算程序包含以下步驟:對該當前音訊資訊進行處理並取得複數音框資訊;對每個音框資訊進行處理以各別產生對應的音訊頻率資訊;對每個音訊頻率資訊進行濾波處理以各別產生對應的多個濾波值;將各音訊頻率資訊的濾波值依序排列成對應的一矩陣;透過一遮罩對各音訊頻率資訊的矩陣進行卷積運算,以分別產生一卷積值;判斷產生的卷積值是否大於零;若是,則將卷積值註記為1;若否,則將卷積值註記為0;將所有註記後的卷積值依序排列以產生該組當前音訊特徵資訊; 傳送該組當前音訊特徵資訊至一遠端,並與該遠端預設的音訊特徵進行比對。 An information processing method for a non-instant signal source, wherein a mobile device performs a user terminal mode, the user terminal mode comprising the steps of: obtaining a current audio information of a non-instant signal source; and executing an audio feature calculation program Processing the current audio information to generate a set of current audio feature information; wherein the audio feature calculation program comprises the steps of: processing the current audio information and obtaining the plurality of audio frame information; processing each of the audio frame information to Corresponding audio frequency information is generated separately; each audio frequency information is filtered to generate corresponding multiple filter values; the filtered values of the audio frequency information are sequentially arranged into a corresponding matrix; Convoluting a matrix of each audio frequency information to generate a convolution value; determining whether the generated convolution value is greater than zero; if so, the convolution value is noted as 1; if not, convolving the convolution value 0; sort all the convolved values of the annotations to generate the current audio feature information of the group; Transmitting the current audio feature information of the group to a remote end and comparing with the remote preset audio features. 一種應用於非即時訊號源的資訊處理方法,由一雲端伺服器執行一遠端模式,該遠端模式包含以下步驟:取得複數非即時訊號源的一預設音訊資訊;執行該音訊特徵演算程序對該等預設音訊資訊進行處理,以分別產生一組預設音訊特徵資訊;其中,該音訊特徵演算程序係包含以下步驟:對該等預設音訊資訊進行處理並分別取得複數音框資訊;對每個音框資訊進行處理以各別產生對應的音訊頻率資訊;對每個音訊頻率資訊進行濾波處理以各別產生對應的多個濾波值;將各音訊頻率資訊的濾波值依序排列成對應的一矩陣;透過一遮罩對各音訊頻率資訊的矩陣進行卷積運算,以分別產生一卷積值;判斷產生的卷積值是否大於零;若是,則將卷積值註記為1;若否,則將卷積值註記為0;將所有註記後的卷積值依序排列以產生複數組預設音訊特徵資訊;接收該組當前音訊特徵資訊,並且與該等組預設音訊特徵進行比對,以產生一組比對結果資訊供參考。 An information processing method for a non-instant signal source, wherein a remote mode is executed by a cloud server, the remote mode comprising the steps of: obtaining a preset audio information of a plurality of non-instant signal sources; and executing the audio feature calculation program Processing the preset audio information to generate a set of preset audio feature information respectively; wherein the audio feature calculation program comprises the steps of: processing the preset audio information and separately obtaining the plurality of audio frame information; Processing each of the sound box information to generate corresponding audio frequency information separately; filtering each audio frequency information to generate corresponding multiple filter values; and sequentially arranging the filtered values of the audio frequency information into Corresponding a matrix; convolving a matrix of each audio frequency information through a mask to respectively generate a convolution value; determining whether the generated convolution value is greater than zero; if so, the convolution value is denoted as 1; If not, the convolution value is annotated as 0; all the convolved convolution values are sequentially arranged to generate a complex array of preset audio feature information; Audio feature set of current information, and wherein the audio preset for comparison with such groups, to produce a set of comparison result information by reference.
TW105129201A 2016-09-09 2016-09-09 Speech recognition system and its information processing method applied to non - real - time signal source TWI584139B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW105129201A TWI584139B (en) 2016-09-09 2016-09-09 Speech recognition system and its information processing method applied to non - real - time signal source

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW105129201A TWI584139B (en) 2016-09-09 2016-09-09 Speech recognition system and its information processing method applied to non - real - time signal source

Publications (2)

Publication Number Publication Date
TWI584139B true TWI584139B (en) 2017-05-21
TW201810081A TW201810081A (en) 2018-03-16

Family

ID=59367491

Family Applications (1)

Application Number Title Priority Date Filing Date
TW105129201A TWI584139B (en) 2016-09-09 2016-09-09 Speech recognition system and its information processing method applied to non - real - time signal source

Country Status (1)

Country Link
TW (1) TWI584139B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060190450A1 (en) * 2003-09-23 2006-08-24 Predixis Corporation Audio fingerprinting system and method
US20060254411A1 (en) * 2002-10-03 2006-11-16 Polyphonic Human Media Interface, S.L. Method and system for music recommendation
TW201109944A (en) * 2009-09-08 2011-03-16 Univ Nat Cheng Kung Music recommendation method and program product thereof
TW201537558A (en) * 2014-03-31 2015-10-01 Kung-Lan Wang Voiceprint data processing method, voiceprint data transaction method and system based on the same

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060254411A1 (en) * 2002-10-03 2006-11-16 Polyphonic Human Media Interface, S.L. Method and system for music recommendation
US20060190450A1 (en) * 2003-09-23 2006-08-24 Predixis Corporation Audio fingerprinting system and method
TW201109944A (en) * 2009-09-08 2011-03-16 Univ Nat Cheng Kung Music recommendation method and program product thereof
TW201537558A (en) * 2014-03-31 2015-10-01 Kung-Lan Wang Voiceprint data processing method, voiceprint data transaction method and system based on the same

Also Published As

Publication number Publication date
TW201810081A (en) 2018-03-16

Similar Documents

Publication Publication Date Title
US11853370B2 (en) Scene aware searching
US20170257650A1 (en) Systems and methods for live media content matching
US8699862B1 (en) Synchronized content playback related to content recognition
US20140114455A1 (en) Apparatus and method for scene change detection-based trigger for audio fingerprinting analysis
US20210185410A1 (en) Computing System with Channel-Change-Based Trigger Feature
US20190379929A1 (en) Media Content Identification on Mobile Devices
US10901685B2 (en) Systems and methods for composition of audio content from multi-object audio
US10516914B2 (en) Method and system for implementing automatic audio optimization for streaming services
US10582271B2 (en) On-demand captioning and translation
WO2020026009A1 (en) Video object recommendation method and apparatus, and device/terminal/server
WO2017113701A1 (en) Video highlight compilation method, apparatus, electronic device, server and system
US11606619B2 (en) Display device and display device control method
US20180332328A1 (en) Method and System for Implementing Conflict Resolution for Electronic Program Guides
US11076180B2 (en) Concurrent media stream aggregate fingerprinting
US20200186874A1 (en) Media player with integrated wireless video link capability & method and system for implementing video tuning and wireless video communication
US11134279B1 (en) Validation of media using fingerprinting
WO2016150274A1 (en) Song splicing algorithm and apparatus
TWI584139B (en) Speech recognition system and its information processing method applied to non - real - time signal source
WO2016189535A1 (en) A system and method to generate an interactive video on the fly
WO2017080216A1 (en) Method for recommending video through bluetooth technology, remote controller, and smart tv
US11133036B2 (en) System and method for associating audio feeds to corresponding video feeds
CN113392238A (en) Media file processing method and device, computer readable medium and electronic equipment
US11627373B2 (en) Systems and methods for providing survey data from a mobile device
US10491950B2 (en) Audio correlation for viewership determination
CN112019917A (en) Audio data extraction method, device, equipment and storage medium