TWI839132B - Voice activity detection system - Google Patents

Voice activity detection system Download PDF

Info

Publication number
TWI839132B
TWI839132B TW112106990A TW112106990A TWI839132B TW I839132 B TWI839132 B TW I839132B TW 112106990 A TW112106990 A TW 112106990A TW 112106990 A TW112106990 A TW 112106990A TW I839132 B TWI839132 B TW I839132B
Authority
TW
Taiwan
Prior art keywords
voice
activity detection
detection system
critical value
voice activity
Prior art date
Application number
TW112106990A
Other languages
Chinese (zh)
Other versions
TW202349378A (en
Inventor
周青翰
湯迪文
黃柏穎
Original Assignee
奇景光電股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/839,962 external-priority patent/US20230402057A1/en
Application filed by 奇景光電股份有限公司 filed Critical 奇景光電股份有限公司
Publication of TW202349378A publication Critical patent/TW202349378A/en
Application granted granted Critical
Publication of TWI839132B publication Critical patent/TWI839132B/en

Links

Images

Abstract

A voice activity detection (VAD) system includes a voice frame detector that detects a voice frame during which a voice signal is not silent; a voice detector that detects presence of human speech according to the voice frame; and a threshold update unit that updates an associated threshold for detecting the presence of humane speech according to result of human speech detection by the voice detector.

Description

語音活動偵測系統Voice Activity Detection System

本發明係有關一種語音活動偵測(voice activity detection,VAD),特別是關於一種具適應臨界值的語音活動偵測系統。 The present invention relates to a voice activity detection (VAD), and more particularly to a voice activity detection system with an adaptive threshold.

語音活動偵測係用以偵測或識別人語,主要用於語音的處理。語音活動偵測可用以啟動基於語音的應用。於非語音期間,語音活動偵測關閉部分處理以避免不需要的傳輸,因而降低通訊頻寬與功率消耗。 Voice activity detection is used to detect or recognize human speech, and is mainly used for voice processing. Voice activity detection can be used to activate voice-based applications. During non-voice periods, voice activity detection shuts down some processing to avoid unnecessary transmission, thereby reducing communication bandwidth and power consumption.

傳統語音活動偵測系統易造成錯誤或不可靠,特別是處於雜訊環境。因此亟需提出一種新穎機制,以克服傳統語音活動偵測系統的缺失。 Traditional voice activity detection systems are prone to errors or unreliability, especially in noisy environments. Therefore, it is urgent to propose a novel mechanism to overcome the shortcomings of traditional voice activity detection systems.

鑑於上述,本發明實施例的目的之一在於提出一種具適應臨界值的語音活動偵測系統,可適用於環境變化與雜訊克服,因而輸出可靠且正確的偵測結果。 In view of the above, one of the purposes of the embodiments of the present invention is to provide a voice activity detection system with an adaptive threshold value, which can be applied to environmental changes and noise overcoming, thereby outputting reliable and correct detection results.

根據本發明實施例,語音活動偵測系統包含語音訊框偵測器及語音偵測器。語音訊框偵測器用以偵測語音信號非靜止時的語音訊框。語音偵測器根據語音訊框以偵測人語。 According to an embodiment of the present invention, the voice activity detection system includes a voice frame detector and a voice detector. The voice frame detector is used to detect the voice frame when the voice signal is not static. The voice detector detects human speech based on the voice frame.

於一實施例中,語音活動偵測系統更包含臨界值更新單元,根據語音偵測器偵測人語的結果,用以更新相應臨界值以偵測人語。 In one embodiment, the voice activity detection system further includes a threshold value updating unit, which is used to update the corresponding threshold value to detect human speech according to the result of the voice detector detecting human speech.

100:語音活動偵測系統 100: Voice activity detection system

100A:語音活動偵測系統 100A: Voice Activity Detection System

100B:語音活動偵測系統 100B: Voice Activity Detection System

11:轉換器 11: Converter

12:語音訊框偵測器 12: Voice frame detector

13:語音偵測器 13: Voice detector

14:臨界值更新單元 14: Critical value update unit

15:控制器 15: Controller

16:影像感測器 16: Image sensor

17:人工智慧引擎 17: Artificial Intelligence Engine

18:語音識別單元 18: Speech recognition unit

19:人臉識別單元 19: Face recognition unit

200:語音活動偵測方法 200: Voice activity detection method

21:將聲音轉換為語音信號 21: Convert sound into voice signals

22:偵測語音信號非靜止時的語音訊框 22: Detect the voice frame when the voice signal is not static

23:是否偵測到人語 23: Whether human speech is detected

24:更新臨界值 24: Update critical value

TH_B:第一臨界值 TH_B: First critical value

TH_C:第二臨界值 TH_C: Second critical value

第一圖顯示本發明實施例的語音活動偵測系統的方塊圖。 The first figure shows a block diagram of a voice activity detection system of an embodiment of the present invention.

第二圖顯示本發明實施例的語音活動偵測方法的流程圖。 The second figure shows a flow chart of the voice activity detection method of an embodiment of the present invention.

第三A圖例示語音信號及端點的波形。 Figure 3A shows the waveforms of voice signals and endpoints.

第三B圖例示語音信號的音量與高階差值。 Figure 3B shows the volume and high-frequency difference of the voice signal.

第三C圖例示語音訊框。 Figure 3C shows the audio frame.

第四A圖例示語音信號及端點的波形。 Figure 4A shows the waveform of the voice signal and endpoints.

第四B圖例示自相關與相應第一臨界值TH_B。 Figure 4B illustrates the autocorrelation and the corresponding first critical value TH_B.

第四C圖例示正規化平方差值與相應第二臨界值TH_C。 Figure 4C illustrates the normalized squared difference and the corresponding second critical value TH_C.

第五A圖例示自相關及如何獲得更新第一臨界值。 Figure 5A illustrates the autocorrelation and how to obtain the updated first critical value.

第五B圖例示正規化平方差及如何獲得更新第二臨界值。 Figure 5B illustrates the normalized squared difference and how to obtain the updated second critical value.

第六圖顯示本發明第一例示實施例的語音活動偵測系統的方塊圖。 Figure 6 shows a block diagram of the voice activity detection system of the first exemplary embodiment of the present invention.

第七圖顯示本發明第二例示實施例的語音活動偵測系統的方塊圖。 FIG. 7 shows a block diagram of a voice activity detection system of the second exemplary embodiment of the present invention.

第一圖顯示本發明實施例的語音活動偵測(voice activity detection,VAD)系統100的方塊圖,第二圖顯示本發明實施例的語音活動偵測方法200的流程圖。 The first figure shows a block diagram of a voice activity detection (VAD) system 100 according to an embodiment of the present invention, and the second figure shows a flow chart of a voice activity detection method 200 according to an embodiment of the present invention.

本實施例的語音活動偵測系統100可包含轉換器(transducer)11,例如麥克風,用以將聲音轉換為(電子)語音信號(步驟21)。 The voice activity detection system 100 of this embodiment may include a transducer 11, such as a microphone, for converting sound into (electronic) voice signals (step 21).

語音活動偵測系統100可包含語音訊框偵測器12,接收語音信號且用以偵測語音信號非靜止時的語音訊框(步驟22)。在一實施例中,語音訊框偵測器12使用端點偵測(end-point detection,EPD)以決定語音信號的端點,於該端點之間語音信號非靜止。在一實施例中,大於預設臨界值之語音信號的振幅(其代表音量)決定為端點。在另一實施例中,大於預設臨界值之語音信號的高階差值(high-order difference,HOD)(其代表斜率)決 定為端點。第三A圖例示語音信號及端點的波形,第三B圖例示語音信號的音量與高階差值,第三C圖例示語音訊框。 The voice activity detection system 100 may include a voice frame detector 12 that receives a voice signal and is used to detect a voice frame when the voice signal is non-static (step 22). In one embodiment, the voice frame detector 12 uses end-point detection (EPD) to determine the endpoints of the voice signal between which the voice signal is non-static. In one embodiment, the amplitude of the voice signal greater than a preset threshold (which represents the volume) is determined as the endpoint. In another embodiment, the high-order difference (HOD) of the voice signal greater than a preset threshold (which represents the slope) is determined as the endpoint. Figure 3A shows the waveform of the voice signal and its endpoints, Figure 3B shows the volume and high-order difference of the voice signal, and Figure 3C shows the voice frame.

本實施例之語音活動偵測系統100可包含語音偵測器13,根據語音訊框以偵測人語(步驟23)。 The voice activity detection system 100 of this embodiment may include a voice detector 13 to detect human speech based on the voice frame (step 23).

在本實施例中,當語音訊框之間的相似度(similarity)或相關度(correlation)的值大於相應臨界值時,則(語音偵測器13)偵測到人語。其中,對語音訊框執行自相關(auto-correlation)(函數)以決定自相關值,其代表語音訊框與具延遲時間的(延遲)語音訊框之間的相似度(或偵測音高(detect pitch))。自相關函數(ACF)可表示如下:

Figure 112106990-A0305-02-0004-1
其中τ代表延遲時間,s代表語音訊框,i=0,…,n-1。 In this embodiment, when the value of the similarity or correlation between speech frames is greater than the corresponding threshold value, the speech detector 13 detects human speech. In this case, an auto-correlation (function) is performed on the speech frame to determine the auto-correlation value, which represents the similarity (or detected pitch) between the speech frame and the (delayed) speech frame with a delay time. The auto-correlation function (ACF) can be expressed as follows:
Figure 112106990-A0305-02-0004-1
Where τ represents the delay time, s represents the speech frame, and i=0,…,n-1.

在本實施例中,更對語音訊框(例如語音訊框與具延遲時間的語音訊框)執行正規化平方差(normalized squared difference)(函數)以決定正規化平方差值,正規化平方差函數(NSDF)可表示如下:

Figure 112106990-A0305-02-0004-2
In this embodiment, a normalized squared difference (function) is further performed on the voice frame (eg, a voice frame and a voice frame with a delay time) to determine a normalized squared difference value. The normalized squared difference function (NSDF) can be expressed as follows:
Figure 112106990-A0305-02-0004-2

在本實施例中,當自相關值大於第一臨界值且正規化平方差值大於第二臨界值時,則偵測到人語。第四A圖例示語音信號及端點的波形,第四B圖例示自相關與相應第一臨界值TH_B,第四C圖例示正規化平方差值與相應第二臨界值TH_C。 In this embodiment, when the autocorrelation value is greater than the first critical value and the normalized square difference is greater than the second critical value, human speech is detected. The fourth figure A illustrates the waveform of the voice signal and the endpoints, the fourth figure B illustrates the autocorrelation and the corresponding first critical value TH_B, and the fourth figure C illustrates the normalized square difference and the corresponding second critical value TH_C.

回到第二圖,如果偵測到人語,則偵測另一語音訊框。如果未偵測到人語(表示偵測到雜訊),則於偵測另一語音訊框之前,以步驟24更新(或調整)語音訊框間之相似度所相應的臨界值。藉此,語音活動偵測系統100與語音活動偵測方法200可根據人語偵測結果以適應決定臨界值, 因而得以適應目前環境,而非如傳統語音活動偵測系統與方法係使用固定臨界值。 Returning to the second figure, if human speech is detected, another voice frame is detected. If human speech is not detected (indicating that noise is detected), before detecting another voice frame, the critical value corresponding to the similarity between voice frames is updated (or adjusted) in step 24. In this way, the voice activity detection system 100 and the voice activity detection method 200 can adapt to determine the critical value based on the human speech detection result, so as to adapt to the current environment, rather than using a fixed critical value like the traditional voice activity detection system and method.

本實施例之語音活動偵測系統100可包含臨界值更新單元14,(當未偵測到人語時),藉由(語音偵測器13發出之)啟動信號以啟動臨界值更新單元14,用以決定更新(第一/第二)臨界值。當未偵測到人語時,啟動信號變為主動。 The voice activity detection system 100 of this embodiment may include a threshold value updating unit 14, which is activated by an activation signal (sent by the voice detector 13) to determine the update of the (first/second) threshold value (when no human speech is detected). When no human speech is detected, the activation signal becomes active.

第五A圖例示自相關及如何獲得更新第一臨界值。在本實施例中,不具延遲時間的自相關值(亦即,ACF(0))減去特定範圍內的最大自相關值(例如max(ACF(62:188))),以得到更新第一臨界值。 Figure 5A illustrates the autocorrelation and how to obtain the updated first critical value. In this embodiment, the autocorrelation value without delay (i.e., ACF(0)) is subtracted from the maximum autocorrelation value within a specific range (e.g., max(ACF(62:188))) to obtain the updated first critical value.

第五B圖例示正規化平方差及如何獲得更新第二臨界值。在本實施例中,更新第二臨界值等於特定範圍內的最大自相關值(例如max(ACF(62:188)))。 Figure 5B illustrates the normalized squared error and how to obtain the updated second critical value. In this embodiment, the updated second critical value is equal to the maximum autocorrelation value within a specific range (e.g., max(ACF(62:188))).

根據上述實施例,由於偵測人語時的臨界值係適應決定,因此語音活動偵測系統100與語音活動偵測方法200可適應環境變化與雜訊克服,因而輸出可靠且正確的偵測結果。 According to the above-mentioned embodiment, since the critical value when detecting human speech is adaptively determined, the voice activity detection system 100 and the voice activity detection method 200 can adapt to environmental changes and overcome noise, thereby outputting reliable and correct detection results.

第六圖顯示本發明第一例示實施例的語音活動偵測系統100A的方塊圖。在本實施例中,(僅)當偵測到人語時,語音偵測器13發出語音觸發信號至控制器15,其發出影像觸發信號以喚醒影像感測器16(例如接觸式影像感測器(CIS))以擷取影像。值得注意的是,影像感測器16通常處於低功率模式或睡眠模式,直到影像觸發信號變為主動。藉此,得以大量降低功率消耗與通訊頻寬。 FIG6 shows a block diagram of a voice activity detection system 100A of the first exemplary embodiment of the present invention. In this embodiment, (only) when human speech is detected, the voice detector 13 sends a voice trigger signal to the controller 15, which sends an image trigger signal to wake up the image sensor 16 (such as a contact image sensor (CIS)) to capture an image. It is worth noting that the image sensor 16 is usually in a low power mode or sleep mode until the image trigger signal becomes active. In this way, power consumption and communication bandwidth can be greatly reduced.

在本實施例中,語音活動偵測系統100A可包含人工智慧(AI)引擎17,例如類神經網路,用以分析影像感測器16所擷取影像,並將分析結果傳送至控制器15,其根據分析結果以執行特定功能或應用。 In this embodiment, the voice activity detection system 100A may include an artificial intelligence (AI) engine 17, such as a neural network, to analyze the image captured by the image sensor 16 and transmit the analysis result to the controller 15, which executes a specific function or application based on the analysis result.

第七圖顯示本發明第二例示實施例的語音活動偵測系統100B的方塊圖。第七圖之語音活動偵測系統100B類似於第六圖之語音活動偵測系統100A,其差異處說明如下。 FIG. 7 shows a block diagram of a voice activity detection system 100B of the second exemplary embodiment of the present invention. The voice activity detection system 100B of FIG. 7 is similar to the voice activity detection system 100A of FIG. 6, and the differences are described as follows.

在本實施例中,語音活動偵測系統100B可更包含語音識別單元18,根據(語音訊框偵測器12之)語音訊框,用以識別口述語言甚至將口述語言翻譯為文字,或者用以識別口述者,或者執行兩者。 In this embodiment, the voice activity detection system 100B may further include a voice recognition unit 18, which is used to recognize the spoken language and even translate the spoken language into text according to the voice frame (of the voice frame detector 12), or to recognize the speaker, or to perform both.

本實施例之語音活動偵測系統100B可更包含人臉識別單元19,用以從影像感測器16所擷取影像當中識別人臉。僅當(控制器15之)影像觸發信號變為主動時,才會啟動人臉識別單元19。 The voice activity detection system 100B of this embodiment may further include a face recognition unit 19 for recognizing faces from the image captured by the image sensor 16. The face recognition unit 19 is activated only when the image trigger signal (of the controller 15) becomes active.

以上所述僅為本發明之較佳實施例而已,並非用以限定本發明之申請專利範圍;凡其它未脫離發明所揭示之精神下所完成之等效改變或修飾,均應包含在下述之申請專利範圍內。 The above is only a preferred embodiment of the present invention and is not intended to limit the scope of the patent application of the present invention; any other equivalent changes or modifications that do not deviate from the spirit disclosed by the invention should be included in the scope of the patent application described below.

100:語音活動偵測系統 11:轉換器 12:語音訊框偵測器 13:語音偵測器 14:臨界值更新單元 100: Voice activity detection system 11: Converter 12: Voice frame detector 13: Voice detector 14: Threshold update unit

Claims (11)

一種語音活動偵測系統,包含:一語音訊框偵測器,用以偵測語音信號非靜止時的語音訊框;一語音偵測器,根據語音訊框之間的相似度或相關度以偵測人語,該語音偵測器對語音訊框執行自相關以決定自相關值,其代表語音訊框與具延遲時間的語音訊框之間的相似度;及一臨界值更新單元,根據該語音偵測器偵測人語的結果,用以更新相應臨界值以偵測人語;其中該語音訊框偵測器使用端點偵測以決定該語音信號的端點,於該端點之間該語音信號非靜止,且僅當未偵測到人語時,該臨界值更新單元更新相應臨界值。 A voice activity detection system includes: a voice frame detector for detecting a voice frame when a voice signal is not static; a voice detector for detecting human speech based on the similarity or correlation between voice frames, the voice detector performing autocorrelation on the voice frame to determine an autocorrelation value, which represents the difference between the voice frame and a voice frame with a delay time. similarity; and a critical value updating unit, which is used to update the corresponding critical value to detect human speech according to the result of the speech detector detecting human speech; wherein the speech frame detector uses endpoint detection to determine the endpoints of the speech signal, and the speech signal is not static between the endpoints, and only when human speech is not detected, the critical value updating unit updates the corresponding critical value. 如請求項1之語音活動偵測系統,更包含:一轉換器,用以將聲音轉換為該語音信號。 The voice activity detection system of claim 1 further comprises: a converter for converting sound into the voice signal. 如請求項1之語音活動偵測系統,其中大於預設臨界值之該語音信號的振幅或高階差值決定為端點。 As in claim 1, the voice activity detection system, wherein the amplitude or high-order difference value of the voice signal greater than a preset critical value is determined as an endpoint. 如請求項1之語音活動偵測系統,其中當該語音訊框之間的相似度的值大於相應臨界值時,則該語音偵測器偵測到人語。 As in the voice activity detection system of claim 1, when the similarity value between the voice frames is greater than the corresponding critical value, the voice detector detects human speech. 如請求項1之語音活動偵測系統,其中該語音偵測器對語音訊框與具延遲時間的語音訊框執行正規化平方差,以決定正規化平方差值。 A voice activity detection system as claimed in claim 1, wherein the voice detector performs a normalized square difference on a voice frame and a voice frame with a delay time to determine a normalized square difference value. 如請求項5之語音活動偵測系統,其中當該自相關值大於第一臨界值且該正規化平方差值大於第二臨界值時,則偵測到人語。 As in claim 5, the voice activity detection system, wherein when the autocorrelation value is greater than the first critical value and the normalized squared difference is greater than the second critical value, human speech is detected. 如請求項6之語音活動偵測系統,其中不具延遲時間的自相關值減去特定範圍內的最大自相關值,以得到更新第一臨界值,且更新第二臨界值等於特定範圍內的最大自相關值。 A voice activity detection system as claimed in claim 6, wherein the autocorrelation value without delay is subtracted from the maximum autocorrelation value within a specific range to obtain an updated first critical value, and the updated second critical value is equal to the maximum autocorrelation value within the specific range. 如請求項1之語音活動偵測系統,更包含:一控制器,當偵測到人語時,自該語音偵測器接收語音觸發信號;及一影像感測器,當偵測到人語時,該控制器發出影像觸發信號以喚醒低功率模式下的該影像感測器。 The voice activity detection system of claim 1 further comprises: a controller, which receives a voice trigger signal from the voice detector when human speech is detected; and an image sensor, which sends an image trigger signal to wake up the image sensor in low power mode when human speech is detected. 如請求項8之語音活動偵測系統,更包含:一人工智慧引擎,用以分析該影像感測器所擷取影像,並將分析結果傳送至該控制器,其根據分析結果以執行特定功能或應用。 The voice activity detection system of claim 8 further comprises: an artificial intelligence engine for analyzing the image captured by the image sensor and transmitting the analysis result to the controller, which executes a specific function or application based on the analysis result. 如請求項9之語音活動偵測系統,更包含:一語音識別單元,僅當該影像觸發信號變為主動時,啟動該語音識別單元,用以識別口述語言或者用以識別口述者。 The voice activity detection system of claim 9 further comprises: a voice recognition unit, which is activated only when the image trigger signal becomes active, to recognize the spoken language or to recognize the speaker. 如請求項9之語音活動偵測系統,更包含:一人臉識別單元,僅當該影像觸發信號變為主動時,啟動該人臉識別單元,用以從該影像感測器所擷取影像當中識別人臉。 The voice activity detection system of claim 9 further comprises: a face recognition unit, which is activated only when the image trigger signal becomes active, to recognize a face from the image captured by the image sensor.
TW112106990A 2022-06-14 2023-02-24 Voice activity detection system TWI839132B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US17/839,962 2022-06-14
US17/839,962 US20230402057A1 (en) 2022-06-14 2022-06-14 Voice activity detection system

Publications (2)

Publication Number Publication Date
TW202349378A TW202349378A (en) 2023-12-16
TWI839132B true TWI839132B (en) 2024-04-11

Family

ID=

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150112689A1 (en) 2013-10-18 2015-04-23 Knowles Electronics Llc Acoustic Activity Detection Apparatus And Method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150112689A1 (en) 2013-10-18 2015-04-23 Knowles Electronics Llc Acoustic Activity Detection Apparatus And Method

Similar Documents

Publication Publication Date Title
US9502028B2 (en) Acoustic activity detection apparatus and method
CN111566730B (en) Voice command processing in low power devices
US9940949B1 (en) Dynamic adjustment of expression detection criteria
KR100636317B1 (en) Distributed Speech Recognition System and method
US7227960B2 (en) Robot and controlling method of the same
US20180061396A1 (en) Methods and systems for keyword detection using keyword repetitions
KR101437830B1 (en) Method and apparatus for detecting voice activity
JP3255584B2 (en) Sound detection device and method
KR20090054642A (en) Method for recognizing voice, and apparatus for implementing the same
KR20110131147A (en) Method of noise reduction using instantaneous signal-to-noise ratio as the principal quantity for optimal estimation
JP4682700B2 (en) Voice recognition device
CN112073862B (en) Digital processor, microphone assembly and method for detecting keyword
CN205754809U (en) A kind of robot self-adapting volume control system
JP2012242609A (en) Voice recognition device, robot, and voice recognition method
CN106033673B (en) A kind of near-end voice signals detection method and device
TWI839132B (en) Voice activity detection system
KR20080059881A (en) Apparatus for preprocessing of speech signal and method for extracting end-point of speech signal thereof
JP2023553451A (en) Hot phrase trigger based on sequence of detections
US10104472B2 (en) Acoustic capture devices and methods thereof
US20230402057A1 (en) Voice activity detection system
US20220114447A1 (en) Adaptive tuning parameters for a classification neural network
KR102308022B1 (en) Apparatus for recognizing call sign and method for the same
KR20230118165A (en) Adapting Automated Speech Recognition Parameters Based on Hotword Attributes
CN110958033B (en) Method and terminal for controlling communication of half-duplex digital intercom system
Kim et al. Sound's Direction Detection and Speech Recognition System for Humanoid Active Audition