TW201703029A

TW201703029A - Method and device for recognizing stuttered speech and computer program product

Info

Publication number: TW201703029A
Application number: TW104121880A
Authority: TW
Inventors: 葉品忻; 楊淑蘭; 楊智傑; 謝孟達
Original assignee: 國立屏東大學
Priority date: 2015-07-06
Filing date: 2015-07-06
Publication date: 2017-01-16
Also published as: TWI585757B

Abstract

A method and a device for recognizing a stuttered speech and a computer program product are provided. The method includes: dividing an audio signal into multiple syllables; performing a dynamic time warping algorithm according to several feature values on a first syllable and a second syllable of the syllables that are adjacent to each other in order to determine whether the first syllable is similar to the second syllable; if the first syllable is similar to the second syllable, determining that a stuttered phenomenon exists in the audio signal.

Description

Stutter detection method and device, computer program product

本發明是有關於一種能自動偵測口吃語音的方法、裝置與電腦程式產品。 The invention relates to a method, a device and a computer program product capable of automatically detecting stuttering voice.

在語言治療領域中，要評估人是否有口吃需要耗費很多人力，而且這些評估會很依賴聽者的主觀判斷，從不同的判斷者中很難找到一致且客觀的判斷標準。因此，若能用電腦來自動判斷口吃的現象，則可以有客觀的標準且可以節省人力。一些習知的作法是將人的語音錄音下來以得到聲音訊號，並且把此聲音訊號分為多個學習樣本，再執行一個機器學習演算法，所得到的模型可用來判斷測試的聲音訊號是否有口吃的現象。然而，利用機器學習的方法需要蒐集許多人聲樣本，若人聲樣本不足則會降低判斷準確度。 In the field of language therapy, it takes a lot of manpower to assess whether a person has a stutter, and these assessments rely heavily on the subjective judgment of the listener. It is difficult to find consistent and objective criteria from different judges. Therefore, if a computer can be used to automatically judge the phenomenon of stuttering, there can be objective standards and manpower can be saved. Some conventional methods are to record a person's voice to obtain an audio signal, and divide the voice signal into a plurality of learning samples, and then execute a machine learning algorithm, and the obtained model can be used to determine whether the test voice signal has The phenomenon of stuttering. However, the use of machine learning methods requires the collection of many vocal samples, and if the vocal samples are insufficient, the accuracy of the judgment will be reduced.

本發明的實施例提出一種口吃偵測方法。此方法包括：取得聲音訊號；對於此聲音訊號，計算多個特徵值以取得多個特徵訊號；根據其中一個特徵訊號將聲音訊號分為多個音節；對於相鄰的第一音節與第二音節，根據特徵訊號執行動態時軸校正演算法以判斷第一音節與第二音節是否相似；以及若第一音節與第二音節相似，判斷聲音訊號存在口吃現象。 Embodiments of the present invention provide a stutter detection method. The method includes: obtaining an audio signal; for the audio signal, calculating a plurality of feature values Obtaining a plurality of characteristic signals; dividing the sound signal into a plurality of syllables according to one of the characteristic signals; and performing a dynamic time axis correction algorithm according to the characteristic signals for determining the first syllable and the first syllable for the adjacent first syllable and the second syllable Whether the two syllables are similar; and if the first syllable is similar to the second syllable, it is judged that the sound signal is stuttering.

在一實施例中，上述的特徵訊號是分別根據以下方程式(1)~(6)所計算出： In an embodiment, the feature signals are calculated according to the following equations (1) to (6):

VH=α×Volume+(1-α)×HOD...(5) VH=α×Volume+(1-α)×HOD...(5)

VE=Volume×(1-entropy)...(6) VE=Volume×(1-entropy)...(6)

其中Volume為音量特徵訊號，ZCR為過零率特徵訊號，Entropy為熵值特徵訊號，HOD為微分特徵訊號，VH為音量微分特徵訊號，VE為音量熵值特徵訊號。s_i為聲音訊號，n為聲音片段的長度，sgn[ ]代表正負號函數，s(f_k)代表聲音訊號在頻率域中第k個頻率的振幅，N代表聲音訊號在頻率域的長度，α為常數。 Volume is the volume characteristic signal, ZCR is the zero-crossing characteristic signal, Entropy is the entropy characteristic signal, HOD is the differential characteristic signal, VH is the volume differential characteristic signal, and VE is the volume entropy characteristic signal. s _i is the sound signal, n is the length of the sound segment, sgn[ ] represents the sign function, s(f _k ) represents the amplitude of the k-th frequency of the sound signal in the frequency domain, and N represents the length of the sound signal in the frequency domain. α is a constant.

在一實施例中，上述將聲音訊號分為音節的步驟包括：判斷音量微分特徵訊號是否大於音節臨界值；從音量微分特徵訊號中取得大於音節臨界值的部分以決定上述多個音節。 In an embodiment, the step of dividing the audio signal into syllables includes: determining whether the volume differential feature signal is greater than a syllable threshold; and obtaining a portion greater than the syllable threshold from the volume differential feature signal to determine the plurality of syllables.

在一實施例中，上述判斷第一音節與第二音節是否相似的步驟包括：對於每一特徵訊號，對第一音節與第二音節執行動態時軸校正演算法以取得一誤差值，並判斷此誤差值是否小於誤差臨界值；以及特徵訊號中誤差值小於誤差臨界值的數目，若大於相似臨界數目，則判斷第一音節是相似於第二音節。在一實施例中，上述的相似臨界數目為4。 In an embodiment, the determining the first syllable and the second syllable Whether the similar steps include: performing a dynamic time axis correction algorithm on the first syllable and the second syllable for each characteristic signal to obtain an error value, and determining whether the error value is smaller than an error threshold; and the error value in the characteristic signal If the number is less than the error threshold, if it is greater than the similar threshold number, it is determined that the first syllable is similar to the second syllable. In an embodiment, the similar critical number is four.

本發明的實施例亦提出電腦程式產品與口吃偵測裝置，用以執行上述的口吃偵測方法。藉此，不需要蒐集太多學習樣本便可以自動地判斷該人是否有口吃。 Embodiments of the present invention also provide a computer program product and a stutter detection device for performing the stutter detection method described above. Thereby, it is possible to automatically judge whether the person has a stutter without collecting too many learning samples.

為讓本發明的上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明如下。 The above described features and advantages of the invention will be apparent from the following description.

100‧‧‧口吃偵測裝置 100‧‧‧ Stuttering detection device

110‧‧‧處理器 110‧‧‧ processor

120‧‧‧記憶體 120‧‧‧ memory

130‧‧‧聲音擷取模組 130‧‧‧Sound capture module

140‧‧‧傳輸模組 140‧‧‧Transmission module

210‧‧‧音量微分特徵訊號 210‧‧‧Volume differential feature signal

211‧‧‧音節臨界值 211‧‧‧ syllable threshold

220、230‧‧‧音節 220, 230‧‧ syllables

300‧‧‧矩陣 300‧‧‧Matrix

310‧‧‧路徑 310‧‧‧ Path

S401~S406‧‧‧步驟 S401~S406‧‧‧Steps

[圖1]是根據一實施例繪示口吃偵測裝置的示意圖。 FIG. 1 is a schematic diagram showing a stutter detection device according to an embodiment.

[圖2]是根據一實施例繪示將聲音訊號分為多個音節的示意圖。 FIG. 2 is a schematic diagram showing dividing an audio signal into a plurality of syllables according to an embodiment.

[圖3]是根據一實施例繪示動態時軸校正演算法中矩陣的示意圖。 FIG. 3 is a schematic diagram showing a matrix in a dynamic time axis correction algorithm according to an embodiment.

[圖4]是根據一實施例繪示口吃偵測方法的流程圖。 FIG. 4 is a flow chart showing a stutter detection method according to an embodiment.

關於本文中所使用之『第一』、『第二』、...等，並非特別指次序或順位的意思，其僅為了區別以相同技術用語描述的元件或操作。 The terms "first", "second", "etc." used in this document are not intended to mean the order or the order, and are merely to distinguish between elements or operations described in the same technical terms.

圖1是根據一實施例繪示口吃偵測裝置的示意圖。請參照圖1，口吃偵測裝置100包括處理器110、記憶體120、聲音擷取模組130與傳輸模組140。口吃偵測裝置100可以實作為任意形式的電子裝置，例如為手機、平板電腦、個人電腦或者是嵌入式系統。 1 is a schematic diagram showing a stutter detection device according to an embodiment. Referring to FIG. 1 , the stutter detecting device 100 includes a processor 110 , a memory 120 , a sound capturing module 130 , and a transmission module 140 . The stutter detection device 100 can be implemented as any form of electronic device, such as a mobile phone, a tablet computer, a personal computer, or an embedded system.

處理器110例如為中央處理器、微處理器或是任意可執行指令的通用目的處理器(general purpose processor)。記憶體120可以為隨機存取記憶體或快閃記憶體等，記憶體120中儲存有多個指令，可由處理器110來執行以完成口吃偵測方法。聲音擷取模組130例如為麥克風，而傳輸模組140可以是符合任意合適通訊協定的電路，例如為通用串列匯流排(Universal Serial Bus，USB)模組或是藍芽(Bluetooth)模組。然而，口吃偵測裝置100還可包括其他模組、例如顯示模組、電源模組等，本發明並不在此限。以下將說明口吃偵測方法的具體內容。 The processor 110 is, for example, a central processing unit, a microprocessor, or a general purpose processor of any executable instructions. The memory 120 can be a random access memory or a flash memory. The memory 120 stores a plurality of instructions, which can be executed by the processor 110 to complete the stutter detection method. The sound capture module 130 is, for example, a microphone, and the transmission module 140 can be a circuit that conforms to any suitable communication protocol, such as a universal serial bus (USB) module or a Bluetooth module. . However, the stutter detecting device 100 may further include other modules, such as a display module, a power module, etc., and the present invention is not limited thereto. The specific content of the stutter detection method will be described below.

首先，使用者可對聲音擷取模組130說話，而聲音擷取模組130會將取得的聲音訊號傳送至處理器110。在此實施例中處理器110是透過聲音擷取模組130來取得聲音訊號，但在其他實施例中也可以透過傳輸模組140，從其他的裝置上取得聲音訊號，本發明並不在此限。 First, the user can speak to the sound capture module 130, and the sound capture module 130 transmits the obtained sound signal to the processor 110. In this embodiment, the processor 110 obtains the sound signal through the sound capture module 130. However, in other embodiments, the sound signal can be obtained from other devices through the transmission module 140. The present invention is not limited thereto. .

接下來處理器110會對此聲音訊號計算多個特徵值以取得多個特徵訊號。在此實施例中特徵訊號的個數為6，分別為音量特徵訊號、過零率特徵訊號、熵值特徵訊號、微分特徵訊號、音量微分特徵訊號與音量熵值特徵訊號，可由以下方程式(1)~(6)所計算出。 Next, the processor 110 calculates a plurality of feature values for the sound signal to obtain a plurality of feature signals. In this embodiment, the number of characteristic signals is 6, which are a volume characteristic signal, a zero-crossing rate characteristic signal, an entropy characteristic signal, a differential characteristic signal, a volume differential characteristic signal, and a volume entropy characteristic signal. It is calculated by the following equations (1) to (6).

VH=α×Volume+(1-α)×HOD...(5) VH=α×Volume+(1-α)×HOD...(5)

VE=Volume×(1-entropy)...(6) VE=Volume×(1-entropy)...(6)

具體來說，Volume為音量特徵訊號，s_i為聲音訊號在時間點i的振幅。n為聲音片段的長度，例如為20微秒，但在其他實施例中也可以為其他長度。換言之，聲音訊號會被切割為多個聲音片段，而對於每一個聲音片段都可以根據上述方程式(1)計算出一個數值，而聲音訊號中所有的數值便組成音量特徵訊號。 Specifically, the volume is a volume characteristic signal, and s _i is the amplitude of the sound signal at time point i. n is the length of the sound segment, for example 20 microseconds, but may be other lengths in other embodiments. In other words, the sound signal is cut into a plurality of sound segments, and for each sound segment, a value can be calculated according to the above equation (1), and all the values in the sound signal constitute a volume feature signal.

ZCR為過零率特徵訊號，sgn[ ]代表正負號(sign)函數，例如變數x為正時sgn[x]則為1，反之為0。 ZCR is a zero-crossing rate characteristic signal, and sgn[ ] represents a sign function. For example, if the variable x is positive, sgn[x] is 1, and vice versa.

Entropy為熵值特徵訊號，s(f_k)代表聲音訊號在頻率域中第k個頻率的振幅，N代表聲音訊號在頻率域的長度。舉例來說，對於每一個聲音片段，可先將聲音訊號做時域至頻率域轉換，例如為傅立葉轉換(Fourier transform)或是快速傅立葉轉換(Fast Fourier Transform，FFT)等以得到各個頻率的振幅(共有N個數值)，而s(f_k)則代表其中第k個振幅。 Entropy is the entropy characteristic signal, s(f _k ) represents the amplitude of the k-th frequency of the audio signal in the frequency domain, and N represents the length of the audio signal in the frequency domain. For example, for each sound segment, the sound signal can be first converted from the time domain to the frequency domain, for example, Fourier transform or Fast Fourier Transform (FFT) to obtain the amplitude of each frequency. (N total of N values), and s(f _k ) represents the kth amplitude.

HOD為微分特徵訊號，VH為音量微分特徵訊號，VE為音量熵值特徵訊號，其中α為介於0至1之間的常數。 HOD is a differential feature signal, VH is a volume differential characteristic signal, and VE is a volume entropy characteristic signal, where α is a constant between 0 and 1.

在取得上述6個特徵訊號以後，可以根據至少一個特徵訊號來將聲音訊號分為多個音節。請參照圖2，圖2是根據一實施例繪示將聲音訊號分為多個音節的示意圖。以音量微分特徵訊號210為例，可先判斷音量微分特徵訊號210中每一個取樣點的振幅是否大於音節臨界值211，接下來將音量微分特徵訊號210中取得大於音節臨界值211的部分以決定出音節220與音節230。在其他的實施例中，也可以用其他的特徵訊號來將聲音訊號分為多個音節，本發明並不在此限。 After obtaining the above six characteristic signals, the sound signal can be divided into a plurality of syllables according to at least one characteristic signal. Please refer to FIG. 2. FIG. 2 is a schematic diagram showing dividing an audio signal into a plurality of syllables according to an embodiment. Taking the volume differential feature signal 210 as an example, it can be determined whether the amplitude of each sampling point in the volume differential characteristic signal 210 is greater than the syllable threshold 211, and then the portion of the volume differential characteristic signal 210 that is greater than the syllable threshold 211 is determined. The syllable 220 and the syllable 230. In other embodiments, other characteristic signals may be used to divide the sound signal into a plurality of syllables, and the present invention is not limited thereto.

若使用者說的是中文，則上述取得的音節可能代表的是一個字。接下來可判斷這些音節中相鄰的兩個音節是否相似，若有兩個相鄰的音節相似，則判斷聲音訊號中存在口吃現象。舉例來說，假設上述的音節中有彼此相鄰的第一音節與第二音節，值得注意的是這兩個音節的長度可能不相同。對於每一個特徵訊號，處理器110都會對第一音節與第二音節執行動態時軸校正演算法(dynamic time warping)以取得一個誤差值。請參照圖3，假設第一音節中有a個聲音片段，而第二音節有b個聲音片段，其中a與b為正整數，則在執行動態時軸校正演算法時便會產生一個a乘b的矩陣300。以音量特徵訊號為例，矩陣300中的每一個元素代表第一音節中對應的聲音片段與第二音節中對應的聲音片段在音量上的差異。動態時軸校正演算法是要在矩陣300中找到差異累加後最小的路徑310，累加後的差異便是第一音節與第二音節之間的誤差值。在本實施例中共有6個特徵訊號，因此每一個特徵訊號都會有對應的矩陣和誤差值，這些誤差值都會與一個誤差臨界值來比較。接下來會計算有幾個特徵訊號的誤差值是小於誤差臨界值，特徵訊號中誤差值小於誤差臨界值的數目，若大於一個相似臨界數目，則判斷第一音節是相似於第二音節。此相似臨界數目例如為4，但本發明並不在此限。換言之，在6個特徵訊號中若有4個特徵訊號都非常相近，則表示第一音節與第二音節是相似的，這二音節發生口吃的現象。 If the user speaks Chinese, the syllable obtained above may represent a word. Next, it can be judged whether two adjacent syllables in the syllables are similar. If two adjacent syllables are similar, it is judged that there is stuttering in the sound signal. For example, assuming that the above syllables have first syllables and second syllables adjacent to each other, it is worth noting that the lengths of the two syllables may not be the same. For each feature signal, the processor 110 performs a dynamic time warping on the first syllable and the second syllable to obtain an error value. Referring to FIG. 3, assuming that there are a sound segments in the first syllable and b sound segments in the second syllable, wherein a and b are positive integers, a multiplication is generated when performing the dynamic time axis correction algorithm. Matrix 300 of b. Taking the volume feature signal as an example, each element in the matrix 300 represents the difference in volume between the corresponding sound segment in the first syllable and the corresponding sound segment in the second syllable. The dynamic time axis correction algorithm is to find the smallest path 310 after the difference is accumulated in the matrix 300, and the accumulated difference is The error value between the first syllable and the second syllable. In this embodiment, there are six characteristic signals, so each characteristic signal has a corresponding matrix and error value, and these error values are compared with an error threshold. Next, it is calculated that the error value of several characteristic signals is smaller than the error threshold, and the error value in the characteristic signal is less than the error threshold. If it is greater than a similar critical number, it is determined that the first syllable is similar to the second syllable. This similar critical number is, for example, 4, but the invention is not limited thereto. In other words, if four of the six characteristic signals are very similar, it means that the first syllable is similar to the second syllable, and the two syllables are stuttered.

在上述的實施例中，特徵訊號的數目為6，且相似臨界數目為4。然而，在其他實施例中也可以使用更多個特徵值並產生更多的特徵訊號，在這些實施例中上述的相似臨界數目也可以對應的調整。另一方面，上述的口吃偵測方法可以適用於任何的語言，本發明並不在此限。 In the above embodiment, the number of feature signals is 6, and the similar critical number is 4. However, in other embodiments, more feature values may be used and more feature signals may be generated. In these embodiments, the similar critical numbers described above may also be adjusted accordingly. On the other hand, the above-described stutter detection method can be applied to any language, and the present invention is not limited thereto.

在本發明亦提出一種電腦程式產品，當電腦載入此電腦程式產品並執行後，可完成上述的口吃偵測方法。例如，此電腦程式產品可以載入至圖1的記憶體120，由處理器110來執行。然而，本發明並不限制此電腦程式產品要用何種程式語言來實作。 The invention also provides a computer program product, which can complete the above stutter detection method after the computer loads the computer program product and executes it. For example, the computer program product can be loaded into the memory 120 of FIG. 1 and executed by the processor 110. However, the present invention does not limit the programming language in which the computer program product is to be implemented.

另一方面，本發明提出的口吃偵測方法可表示為圖4的流程圖，在步驟S401中，取得聲音訊號。在步驟S402中，對於聲音訊號，計算多個特徵值以取得多個特徵訊號。在步驟S403中，根據其中一個特徵訊號將聲音訊號分為多個音節。在步驟S403中，對於音節中相鄰的第一音節與第二音節，根據特徵訊號執行動態時軸校正演算法以判斷第一音節與第二音節是否相似。若有兩個相鄰的音節相似，則在步驟S405中判斷聲音訊號存在口吃現象。若所有相鄰的音節都不相似，在步驟S406中則判斷聲音訊號不存在口吃現象。然而，圖4中各步驟已詳細說明如上，在此便不再贅述。值得注意的是，圖4中各步驟可以實作為多個程式碼或是電路，本發明並不在此限。此外，圖4的方法可以搭配以上實施例使用也可以單獨使用，換言之，圖4的各步驟之間也可以加入其他的步驟。 On the other hand, the stutter detection method proposed by the present invention can be expressed as the flowchart of FIG. 4, and in step S401, an audio signal is obtained. In step S402, for the audio signal, a plurality of feature values are calculated to obtain a plurality of feature signals. In step S403, the sound signal is divided into a plurality of syllables according to one of the characteristic signals. In step S403, for the first sound adjacent in the syllable The node and the second syllable perform a dynamic time axis correction algorithm according to the feature signal to determine whether the first syllable is similar to the second syllable. If there are two adjacent syllables similar, it is determined in step S405 that there is a stutter phenomenon in the audio signal. If all adjacent syllables are not similar, it is determined in step S406 that there is no stuttering phenomenon in the audio signal. However, the steps in FIG. 4 have been described in detail above, and will not be described again here. It should be noted that the steps in FIG. 4 can be implemented as multiple code codes or circuits, and the present invention is not limited thereto. In addition, the method of FIG. 4 can be used in combination with the above embodiments or can be used alone. In other words, other steps can be added between the steps of FIG. 4.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。 Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention, and any one of ordinary skill in the art can make some changes and refinements without departing from the spirit and scope of the present invention. The scope of the invention is defined by the scope of the appended claims.

S401~S406‧‧‧步驟 S401~S406‧‧‧Steps

Claims

A stutter detection method includes: obtaining an audio signal; calculating, for the audio signal, a plurality of feature values to obtain a plurality of feature signals; and dividing the sound signal into a plurality of syllables according to one of the feature signals; An adjacent first syllable and a second syllable in the syllables perform a dynamic time axis correction algorithm according to the characteristic signals to determine whether the first syllable is similar to the second syllable; and if the first syllable Similar to the second syllable, it is judged that the sound signal has a stuttering phenomenon.

The stutter detection method of claim 1, wherein the characteristic signals are calculated according to the following equations (1) to (6): VH=α×Volume+(1-α)×HOD...(5) VE=Volume×(1-entropy) (6) where Volume is the volume characteristic signal, ZCR is the zero-crossing rate characteristic signal, and Entropy is Entropy characteristic signal, HOD is differential characteristic signal, VH is volume differential characteristic signal, VE is volume entropy characteristic signal, s _{i is} the amplitude of the sound signal at time point i, n is the length of a sound segment, sgn[ ] Represents a sign function, s(f _k ) represents the amplitude of the k-th frequency of the audio signal in the frequency domain, N represents the length of the audio signal in the frequency domain, and α is a constant.

The stutter detection method of claim 2, wherein the step of dividing the sound signal into the syllables according to one of the characteristic signals comprises: determining whether the volume differential feature signal is greater than a syllable threshold And obtaining a portion greater than the syllable threshold from the volume differential feature signal to determine the syllables.

The stutter detection method of claim 1, wherein the step of determining whether the first syllable is similar to the second syllable comprises: for each of the characteristic signals, the first syllable and the second syllable Performing the dynamic time axis correction algorithm to obtain an error value, and determining whether the error value is less than an error threshold; and if the error value of the characteristic signals is less than a certain threshold number of the error threshold, It is determined that the first syllable is similar to the second syllable.

The stutter detection method of claim 4, wherein the number of the characteristic signals is 6, and the similar critical number is 4.

A computer program product, which is capable of performing the stutter detection method according to any one of the above-mentioned claims, wherein the computer is loaded with the computer program product and executed.

A stutter detecting device comprising: a memory for storing a plurality of instructions; and a processor for executing the instructions to perform a plurality of steps: obtaining an audio signal; and calculating a plurality of feature values for the audio signal Obtaining a plurality of characteristic signals; dividing the sound signal into a plurality of syllables according to one of the characteristic signals; performing, for each of the first syllables and the second syllable adjacent to the syllables, performing a signal according to the characteristic signals The dynamic time axis correction algorithm determines whether the first syllable is similar to the second syllable; and if the first syllable is similar to the second syllable, determining that the audio signal has a stuttering phenomenon.