TWI839132B

TWI839132B - Voice activity detection system

Info

Publication number: TWI839132B
Application number: TW112106990A
Authority: TW
Inventors: 周青翰; 湯迪文; 黃柏穎
Original assignee: 奇景光電股份有限公司
Priority date: 2022-06-14
Filing date: 2023-02-24
Publication date: 2024-04-11

Abstract

A voice activity detection (VAD) system includes a voice frame detector that detects a voice frame during which a voice signal is not silent; a voice detector that detects presence of human speech according to the voice frame; and a threshold update unit that updates an associated threshold for detecting the presence of humane speech according to result of human speech detection by the voice detector.

Description

Voice Activity Detection System

本發明係有關一種語音活動偵測(voice activity detection,VAD)，特別是關於一種具適應臨界值的語音活動偵測系統。 The present invention relates to a voice activity detection (VAD), and more particularly to a voice activity detection system with an adaptive threshold.

語音活動偵測係用以偵測或識別人語，主要用於語音的處理。語音活動偵測可用以啟動基於語音的應用。於非語音期間，語音活動偵測關閉部分處理以避免不需要的傳輸，因而降低通訊頻寬與功率消耗。 Voice activity detection is used to detect or recognize human speech, and is mainly used for voice processing. Voice activity detection can be used to activate voice-based applications. During non-voice periods, voice activity detection shuts down some processing to avoid unnecessary transmission, thereby reducing communication bandwidth and power consumption.

傳統語音活動偵測系統易造成錯誤或不可靠，特別是處於雜訊環境。因此亟需提出一種新穎機制，以克服傳統語音活動偵測系統的缺失。 Traditional voice activity detection systems are prone to errors or unreliability, especially in noisy environments. Therefore, it is urgent to propose a novel mechanism to overcome the shortcomings of traditional voice activity detection systems.

鑑於上述，本發明實施例的目的之一在於提出一種具適應臨界值的語音活動偵測系統，可適用於環境變化與雜訊克服，因而輸出可靠且正確的偵測結果。 In view of the above, one of the purposes of the embodiments of the present invention is to provide a voice activity detection system with an adaptive threshold value, which can be applied to environmental changes and noise overcoming, thereby outputting reliable and correct detection results.

根據本發明實施例，語音活動偵測系統包含語音訊框偵測器及語音偵測器。語音訊框偵測器用以偵測語音信號非靜止時的語音訊框。語音偵測器根據語音訊框以偵測人語。 According to an embodiment of the present invention, the voice activity detection system includes a voice frame detector and a voice detector. The voice frame detector is used to detect the voice frame when the voice signal is not static. The voice detector detects human speech based on the voice frame.

於一實施例中，語音活動偵測系統更包含臨界值更新單元，根據語音偵測器偵測人語的結果，用以更新相應臨界值以偵測人語。 In one embodiment, the voice activity detection system further includes a threshold value updating unit, which is used to update the corresponding threshold value to detect human speech according to the result of the voice detector detecting human speech.

100:語音活動偵測系統 100: Voice activity detection system

100A:語音活動偵測系統 100A: Voice Activity Detection System

100B:語音活動偵測系統 100B: Voice Activity Detection System

11:轉換器 11: Converter

12:語音訊框偵測器 12: Voice frame detector

13:語音偵測器 13: Voice detector

14:臨界值更新單元 14: Critical value update unit

15:控制器 15: Controller

16:影像感測器 16: Image sensor

17:人工智慧引擎 17: Artificial Intelligence Engine

18:語音識別單元 18: Speech recognition unit

19:人臉識別單元 19: Face recognition unit

200:語音活動偵測方法 200: Voice activity detection method

21:將聲音轉換為語音信號 21: Convert sound into voice signals

22:偵測語音信號非靜止時的語音訊框 22: Detect the voice frame when the voice signal is not static

23:是否偵測到人語 23: Whether human speech is detected

24:更新臨界值 24: Update critical value

TH_B:第一臨界值 TH_B: First critical value

TH_C:第二臨界值 TH_C: Second critical value

第一圖顯示本發明實施例的語音活動偵測系統的方塊圖。 The first figure shows a block diagram of a voice activity detection system of an embodiment of the present invention.

第二圖顯示本發明實施例的語音活動偵測方法的流程圖。 The second figure shows a flow chart of the voice activity detection method of an embodiment of the present invention.

第三A圖例示語音信號及端點的波形。 Figure 3A shows the waveforms of voice signals and endpoints.

第三B圖例示語音信號的音量與高階差值。 Figure 3B shows the volume and high-frequency difference of the voice signal.

第三C圖例示語音訊框。 Figure 3C shows the audio frame.

第四A圖例示語音信號及端點的波形。 Figure 4A shows the waveform of the voice signal and endpoints.

第四B圖例示自相關與相應第一臨界值TH_B。 Figure 4B illustrates the autocorrelation and the corresponding first critical value TH_B.

第四C圖例示正規化平方差值與相應第二臨界值TH_C。 Figure 4C illustrates the normalized squared difference and the corresponding second critical value TH_C.

第五A圖例示自相關及如何獲得更新第一臨界值。 Figure 5A illustrates the autocorrelation and how to obtain the updated first critical value.

第五B圖例示正規化平方差及如何獲得更新第二臨界值。 Figure 5B illustrates the normalized squared difference and how to obtain the updated second critical value.

第六圖顯示本發明第一例示實施例的語音活動偵測系統的方塊圖。 Figure 6 shows a block diagram of the voice activity detection system of the first exemplary embodiment of the present invention.

第七圖顯示本發明第二例示實施例的語音活動偵測系統的方塊圖。 FIG. 7 shows a block diagram of a voice activity detection system of the second exemplary embodiment of the present invention.

第一圖顯示本發明實施例的語音活動偵測(voice activity detection,VAD)系統100的方塊圖，第二圖顯示本發明實施例的語音活動偵測方法200的流程圖。 The first figure shows a block diagram of a voice activity detection (VAD) system 100 according to an embodiment of the present invention, and the second figure shows a flow chart of a voice activity detection method 200 according to an embodiment of the present invention.

本實施例的語音活動偵測系統100可包含轉換器(transducer)11，例如麥克風，用以將聲音轉換為(電子)語音信號(步驟21)。 The voice activity detection system 100 of this embodiment may include a transducer 11, such as a microphone, for converting sound into (electronic) voice signals (step 21).

語音活動偵測系統100可包含語音訊框偵測器12，接收語音信號且用以偵測語音信號非靜止時的語音訊框(步驟22)。在一實施例中，語音訊框偵測器12使用端點偵測(end-point detection,EPD)以決定語音信號的端點，於該端點之間語音信號非靜止。在一實施例中，大於預設臨界值之語音信號的振幅(其代表音量)決定為端點。在另一實施例中，大於預設臨界值之語音信號的高階差值(high-order difference,HOD)(其代表斜率)決定為端點。第三A圖例示語音信號及端點的波形，第三B圖例示語音信號的音量與高階差值，第三C圖例示語音訊框。 The voice activity detection system 100 may include a voice frame detector 12 that receives a voice signal and is used to detect a voice frame when the voice signal is non-static (step 22). In one embodiment, the voice frame detector 12 uses end-point detection (EPD) to determine the endpoints of the voice signal between which the voice signal is non-static. In one embodiment, the amplitude of the voice signal greater than a preset threshold (which represents the volume) is determined as the endpoint. In another embodiment, the high-order difference (HOD) of the voice signal greater than a preset threshold (which represents the slope) is determined as the endpoint. Figure 3A shows the waveform of the voice signal and its endpoints, Figure 3B shows the volume and high-order difference of the voice signal, and Figure 3C shows the voice frame.

本實施例之語音活動偵測系統100可包含語音偵測器13，根據語音訊框以偵測人語(步驟23)。 The voice activity detection system 100 of this embodiment may include a voice detector 13 to detect human speech based on the voice frame (step 23).

在本實施例中，當語音訊框之間的相似度(similarity)或相關度(correlation)的值大於相應臨界值時，則(語音偵測器13)偵測到人語。其中，對語音訊框執行自相關(auto-correlation)(函數)以決定自相關值，其代表語音訊框與具延遲時間的(延遲)語音訊框之間的相似度(或偵測音高(detect pitch))。自相關函數(ACF)可表示如下：

其中τ代表延遲時間，s代表語音訊框，i=0,…,n-1。 In this embodiment, when the value of the similarity or correlation between speech frames is greater than the corresponding threshold value, the speech detector 13 detects human speech. In this case, an auto-correlation (function) is performed on the speech frame to determine the auto-correlation value, which represents the similarity (or detected pitch) between the speech frame and the (delayed) speech frame with a delay time. The auto-correlation function (ACF) can be expressed as follows:

Where τ represents the delay time, s represents the speech frame, and i=0,…,n-1.

在本實施例中，更對語音訊框(例如語音訊框與具延遲時間的語音訊框)執行正規化平方差(normalized squared difference)(函數)以決定正規化平方差值，正規化平方差函數(NSDF)可表示如下：

In this embodiment, a normalized squared difference (function) is further performed on the voice frame (eg, a voice frame and a voice frame with a delay time) to determine a normalized squared difference value. The normalized squared difference function (NSDF) can be expressed as follows:

在本實施例中，當自相關值大於第一臨界值且正規化平方差值大於第二臨界值時，則偵測到人語。第四A圖例示語音信號及端點的波形，第四B圖例示自相關與相應第一臨界值TH_B，第四C圖例示正規化平方差值與相應第二臨界值TH_C。 In this embodiment, when the autocorrelation value is greater than the first critical value and the normalized square difference is greater than the second critical value, human speech is detected. The fourth figure A illustrates the waveform of the voice signal and the endpoints, the fourth figure B illustrates the autocorrelation and the corresponding first critical value TH_B, and the fourth figure C illustrates the normalized square difference and the corresponding second critical value TH_C.

回到第二圖，如果偵測到人語，則偵測另一語音訊框。如果未偵測到人語(表示偵測到雜訊)，則於偵測另一語音訊框之前，以步驟24更新(或調整)語音訊框間之相似度所相應的臨界值。藉此，語音活動偵測系統100與語音活動偵測方法200可根據人語偵測結果以適應決定臨界值，因而得以適應目前環境，而非如傳統語音活動偵測系統與方法係使用固定臨界值。 Returning to the second figure, if human speech is detected, another voice frame is detected. If human speech is not detected (indicating that noise is detected), before detecting another voice frame, the critical value corresponding to the similarity between voice frames is updated (or adjusted) in step 24. In this way, the voice activity detection system 100 and the voice activity detection method 200 can adapt to determine the critical value based on the human speech detection result, so as to adapt to the current environment, rather than using a fixed critical value like the traditional voice activity detection system and method.

本實施例之語音活動偵測系統100可包含臨界值更新單元14，(當未偵測到人語時)，藉由(語音偵測器13發出之)啟動信號以啟動臨界值更新單元14，用以決定更新(第一/第二)臨界值。當未偵測到人語時，啟動信號變為主動。 The voice activity detection system 100 of this embodiment may include a threshold value updating unit 14, which is activated by an activation signal (sent by the voice detector 13) to determine the update of the (first/second) threshold value (when no human speech is detected). When no human speech is detected, the activation signal becomes active.

第五A圖例示自相關及如何獲得更新第一臨界值。在本實施例中，不具延遲時間的自相關值(亦即，ACF(0))減去特定範圍內的最大自相關值(例如max(ACF(62：188)))，以得到更新第一臨界值。 Figure 5A illustrates the autocorrelation and how to obtain the updated first critical value. In this embodiment, the autocorrelation value without delay (i.e., ACF(0)) is subtracted from the maximum autocorrelation value within a specific range (e.g., max(ACF(62:188))) to obtain the updated first critical value.

第五B圖例示正規化平方差及如何獲得更新第二臨界值。在本實施例中，更新第二臨界值等於特定範圍內的最大自相關值(例如max(ACF(62：188)))。 Figure 5B illustrates the normalized squared error and how to obtain the updated second critical value. In this embodiment, the updated second critical value is equal to the maximum autocorrelation value within a specific range (e.g., max(ACF(62:188))).

根據上述實施例，由於偵測人語時的臨界值係適應決定，因此語音活動偵測系統100與語音活動偵測方法200可適應環境變化與雜訊克服，因而輸出可靠且正確的偵測結果。 According to the above-mentioned embodiment, since the critical value when detecting human speech is adaptively determined, the voice activity detection system 100 and the voice activity detection method 200 can adapt to environmental changes and overcome noise, thereby outputting reliable and correct detection results.

第六圖顯示本發明第一例示實施例的語音活動偵測系統100A的方塊圖。在本實施例中，(僅)當偵測到人語時，語音偵測器13發出語音觸發信號至控制器15，其發出影像觸發信號以喚醒影像感測器16(例如接觸式影像感測器(CIS))以擷取影像。值得注意的是，影像感測器16通常處於低功率模式或睡眠模式，直到影像觸發信號變為主動。藉此，得以大量降低功率消耗與通訊頻寬。 FIG6 shows a block diagram of a voice activity detection system 100A of the first exemplary embodiment of the present invention. In this embodiment, (only) when human speech is detected, the voice detector 13 sends a voice trigger signal to the controller 15, which sends an image trigger signal to wake up the image sensor 16 (such as a contact image sensor (CIS)) to capture an image. It is worth noting that the image sensor 16 is usually in a low power mode or sleep mode until the image trigger signal becomes active. In this way, power consumption and communication bandwidth can be greatly reduced.

在本實施例中，語音活動偵測系統100A可包含人工智慧(AI)引擎17，例如類神經網路，用以分析影像感測器16所擷取影像，並將分析結果傳送至控制器15，其根據分析結果以執行特定功能或應用。 In this embodiment, the voice activity detection system 100A may include an artificial intelligence (AI) engine 17, such as a neural network, to analyze the image captured by the image sensor 16 and transmit the analysis result to the controller 15, which executes a specific function or application based on the analysis result.

第七圖顯示本發明第二例示實施例的語音活動偵測系統100B的方塊圖。第七圖之語音活動偵測系統100B類似於第六圖之語音活動偵測系統100A，其差異處說明如下。 FIG. 7 shows a block diagram of a voice activity detection system 100B of the second exemplary embodiment of the present invention. The voice activity detection system 100B of FIG. 7 is similar to the voice activity detection system 100A of FIG. 6, and the differences are described as follows.

在本實施例中，語音活動偵測系統100B可更包含語音識別單元18，根據(語音訊框偵測器12之)語音訊框，用以識別口述語言甚至將口述語言翻譯為文字，或者用以識別口述者，或者執行兩者。 In this embodiment, the voice activity detection system 100B may further include a voice recognition unit 18, which is used to recognize the spoken language and even translate the spoken language into text according to the voice frame (of the voice frame detector 12), or to recognize the speaker, or to perform both.

本實施例之語音活動偵測系統100B可更包含人臉識別單元19，用以從影像感測器16所擷取影像當中識別人臉。僅當(控制器15之)影像觸發信號變為主動時，才會啟動人臉識別單元19。 The voice activity detection system 100B of this embodiment may further include a face recognition unit 19 for recognizing faces from the image captured by the image sensor 16. The face recognition unit 19 is activated only when the image trigger signal (of the controller 15) becomes active.

以上所述僅為本發明之較佳實施例而已，並非用以限定本發明之申請專利範圍；凡其它未脫離發明所揭示之精神下所完成之等效改變或修飾，均應包含在下述之申請專利範圍內。 The above is only a preferred embodiment of the present invention and is not intended to limit the scope of the patent application of the present invention; any other equivalent changes or modifications that do not deviate from the spirit disclosed by the invention should be included in the scope of the patent application described below.

100:語音活動偵測系統 11:轉換器 12:語音訊框偵測器 13:語音偵測器 14:臨界值更新單元 100: Voice activity detection system 11: Converter 12: Voice frame detector 13: Voice detector 14: Threshold update unit

Claims

A voice activity detection system includes: a voice frame detector for detecting a voice frame when a voice signal is not static; a voice detector for detecting human speech based on the similarity or correlation between voice frames, the voice detector performing autocorrelation on the voice frame to determine an autocorrelation value, which represents the difference between the voice frame and a voice frame with a delay time. similarity; and a critical value updating unit, which is used to update the corresponding critical value to detect human speech according to the result of the speech detector detecting human speech; wherein the speech frame detector uses endpoint detection to determine the endpoints of the speech signal, and the speech signal is not static between the endpoints, and only when human speech is not detected, the critical value updating unit updates the corresponding critical value.

The voice activity detection system of claim 1 further comprises: a converter for converting sound into the voice signal.

As in claim 1, the voice activity detection system, wherein the amplitude or high-order difference value of the voice signal greater than a preset critical value is determined as an endpoint.

As in the voice activity detection system of claim 1, when the similarity value between the voice frames is greater than the corresponding critical value, the voice detector detects human speech.

A voice activity detection system as claimed in claim 1, wherein the voice detector performs a normalized square difference on a voice frame and a voice frame with a delay time to determine a normalized square difference value.

As in claim 5, the voice activity detection system, wherein when the autocorrelation value is greater than the first critical value and the normalized squared difference is greater than the second critical value, human speech is detected.

A voice activity detection system as claimed in claim 6, wherein the autocorrelation value without delay is subtracted from the maximum autocorrelation value within a specific range to obtain an updated first critical value, and the updated second critical value is equal to the maximum autocorrelation value within the specific range.

The voice activity detection system of claim 1 further comprises: a controller, which receives a voice trigger signal from the voice detector when human speech is detected; and an image sensor, which sends an image trigger signal to wake up the image sensor in low power mode when human speech is detected.

The voice activity detection system of claim 8 further comprises: an artificial intelligence engine for analyzing the image captured by the image sensor and transmitting the analysis result to the controller, which executes a specific function or application based on the analysis result.

The voice activity detection system of claim 9 further comprises: a voice recognition unit, which is activated only when the image trigger signal becomes active, to recognize the spoken language or to recognize the speaker.

The voice activity detection system of claim 9 further comprises: a face recognition unit, which is activated only when the image trigger signal becomes active, to recognize a face from the image captured by the image sensor.