TW201633291A

TW201633291A - Speech segment detection device and method for detecting speech segment

Info

Publication number: TW201633291A
Application number: TW104119363A
Authority: TW
Inventors: Toshiyuki Hanazawa
Original assignee: Mitsubishi Electric Corp
Priority date: 2015-03-12
Filing date: 2015-06-16
Publication date: 2016-09-16
Also published as: JP6444490B2; JPWO2016143125A1; WO2016143125A1

Abstract

A speech segment detection device comprises: a speech segment detection unit (4) for detecting a provisional start time indicating the start point and a provisional end time indicating the end point of the speech segment included in an input signal using a pattern recognition model for distinguishing speech from noise included in an input signal on the basis of a first feature calculated by a first feature calculation unit (1); and a start-end time correction unit (5) for correcting the start time and end time on the basis of a comparison between a second feature calculated by a second feature calculation unit (2) and a threshold value.

Description

Sound interval detecting device and sound interval detecting method

本發明係關於使用複數的特徵量在輸入信號中檢測聲音區間之技術。 The present invention relates to a technique for detecting a sound interval in an input signal using a complex feature quantity.

從輸入信號中抽出聲音存在的區間之聲音區間檢測處理，作為聲音辨識的前處理，係非常重要的處理。一般，聲音辨識處理，以聲音區間檢測處理檢測的區間為對象，因為進行圖案辨識得到辨識結果，聲音區間的檢測有錯誤時，聲音辨識處理的辨識精確度大為降低。作為聲音區間檢測的基本方法，有算出輸入信號的功率，檢測算出的功率在設定的臨界值以上的區間作為聲音區間之方法。上述檢測方法，在背景噪音小的情況及穩定的聲音區間的檢測處理中動作較佳。 The sound section detection processing of the section in which the sound is extracted from the input signal is a very important process as the pre-processing of the voice recognition. In general, the voice recognition processing is performed on the section detected by the voice section detection processing. Since the pattern recognition is performed to obtain the recognition result and the sound section is detected incorrectly, the recognition accuracy of the voice recognition processing is greatly reduced. As a basic method of the sound section detection, there is a method of calculating the power of the input signal and detecting a section in which the calculated power is equal to or greater than the set threshold value as the sound section. The above detection method is preferable in the case where the background noise is small and the stable sound section is detected.

另一方面，作為工廠設備的保守作業等之中檢點結果輸入及各種FA(工廠自動化)機器等的操作支援，係使用者免持可利用的聲音辨識非常有效的使用者界面。不過，工廠設備的保守作業環境及FA機器的動作環境，係敲打音及錘打音等非穩定的噪音常發生的環境。因此，只使用根據上述輸入信號算出的功率檢測聲音區間的方法，因為誤檢出非穩定的噪音為聲音，降低聲音區間的檢測精確度，之後的聲音辨識處理中有不能得到充分的辨識性能之問題。 On the other hand, as a result of the inspection result input and the various FA (factory automation) equipment and the like in the conservative operation of the factory equipment, the user is free from the user interface that is very effective in sound recognition. However, the conservative working environment of the plant equipment and the operating environment of the FA machine are environments in which unsteady noise such as knocking sounds and hammering sounds often occur. Therefore, only the method of detecting the sound interval based on the power calculated based on the above input signal is used, because the undetected noise is erroneously detected as the sound, and the detection accuracy of the sound interval is lowered, and the subsequent sound recognition processing cannot obtain sufficient recognition performance. problem.

對於上述的問題，例如專利文件1中，揭示聲音區間檢測方法，其中，聲音區域檢測中使用的特徵量，轉換成輸入信號的功率，使用表現輸入信號的光譜特徵的倒頻譜，再使用以上述倒頻譜作為參數的HMM(隱馬爾可夫模型)。具體而言，分別關於聲音與噪音，先學習幾個HMM，檢出聲音區間的開始點的始端之際，計算各HMM的似然，12框架(120毫秒)中計算的HMM似然最高的框架是4框架以上存在時，檢出上述12框架的前頭框架作為聲音區間的始端。 For the above problem, for example, Patent Document 1 discloses a sound interval detecting method in which a feature amount used in sound area detection is converted into a power of an input signal, and a cepstrum representing a spectral characteristic of an input signal is used, and the above is used. CMM (Hidden Markov Model) with cepstrum as a parameter. Specifically, regarding the sound and noise, respectively, when learning several HMMs and detecting the beginning of the start point of the sound interval, the likelihood of each HMM is calculated, and the frame with the highest HMM likelihood calculated in 12 frames (120 milliseconds) is calculated. When it is 4 or more frames, the front frame of the above 12 frames is detected as the beginning of the sound section.

[advance technical documents] [Patent Document]

[專利文件1]日本專利第2001-343983號公開公報 [Patent Document 1] Japanese Patent Publication No. 2001-343983

不過，上述專利文件1揭示的技術中，因為使用表現輸入信號的光譜特徵之特徵量進行聲音區間檢測，可以抑制誤檢出與聲音光譜特徵不同的噪音為聲音，但無聲子音(p、t、k、s、sh、h、f)等與噪音因為光譜特徵類似，有不能正確辨別檢出上述無聲子音等與噪音之課題。 However, in the technique disclosed in the above Patent Document 1, since the sound interval detection is performed using the feature amount representing the spectral characteristics of the input signal, it is possible to suppress erroneous detection of noise different from the spectral characteristics of the sound as sound, but no murmur (p, t, Since k, s, sh, h, f) and the like are similar to the spectral characteristics, there is a problem that the above-mentioned silent sub-tones and the like are not correctly recognized.

此發明，因為用以解決上述課題而形成，以抑制誤檢出非穩定的噪音為聲音，提高聲音的語頭及語尾的無聲子音的檢測精確度為目的。 The present invention has been made to solve the above problems, and it is intended to suppress the erroneous detection of unsteady noise as a sound, and to improve the accuracy of detecting the unvoiced consonants of the head and the ending of the voice.

根據本發明的聲音區間檢測裝置，包括第1特徵量算出部，根據輸入信號算出顯示光譜特徵的第1特徵量；第2特徵量算出部，根據輸入信號算出顯示與第1特徵量不同的聲音特徵量的第2特徵量；聲音區間檢測部，使用用以辨識輸入信號內包含的聲音與噪音之辨識模型，根據第1特徵量算出部算出的第1特徵量，檢出指示輸入信號內包含的聲音區間的開始點之始端時刻及指示結束點之終端時刻；以及始終端補正部，根據第2特徵量算出部算出的第2特徵量與臨界值的比較，補正聲音區間檢測部檢出的始端時刻及終端時刻。 Sound interval detecting device according to the present invention, including the first feature The calculation unit calculates a first feature amount for displaying a spectral feature based on the input signal, and the second feature amount calculation unit calculates a second feature amount for displaying a sound feature amount different from the first feature amount based on the input signal, and a sound interval detecting unit. Using the identification model for recognizing the sound and the noise included in the input signal, the first feature amount calculated by the first feature amount calculation unit detects the start time and the instruction end point of the start point of the sound section included in the instruction input signal. And the terminal end correction unit corrects the start time and the terminal time detected by the sound section detection unit based on the comparison between the second feature amount calculated by the second feature amount calculation unit and the threshold value.

根據本發明，可以抑制誤檢出不穩定噪音為聲音區間，還可以提高聲音的語頭及語尾的無聲子音的檢測精確度。 According to the present invention, it is possible to suppress the erroneous detection of the unstable noise as the sound interval, and it is also possible to improve the detection accuracy of the silent voice of the head and the end of the voice.

1‧‧‧第1特徵量算出部 1‧‧‧1st feature quantity calculation unit

2‧‧‧第2特徵量算出部 2‧‧‧2nd feature quantity calculation unit

3‧‧‧圖案辨識模型積蓄部 3‧‧‧ Pattern Identification Model Accumulation Department

4‧‧‧聲音區間檢測部 4‧‧‧Sound interval detection department

5‧‧‧始終端補正部 5‧‧‧ Always correcting department

5a‧‧‧始終端補正部 5a‧‧‧ Always correcting department

5b‧‧‧始終端補正部 5b‧‧‧ Always correcting department

6‧‧‧臨界值算出部 6‧‧‧Threshold value calculation unit

10‧‧‧聲音區間檢測裝置 10‧‧‧Sound interval detection device

10a‧‧‧聲音區間檢測裝置 10a‧‧‧Sound interval detection device

10b‧‧‧聲音區間檢測裝置 10b‧‧‧Sound interval detection device

20‧‧‧處理器 20‧‧‧ processor

30‧‧‧記憶體 30‧‧‧ memory

[第1圖]係顯示第一實施例的聲音區間檢測裝置的構成方塊圖；[第2圖]係顯示第一實施例的聲音區間檢測裝置的硬體構成圖；[第3A圖]係顯示第一實施例的聲音區間檢測裝置的動作流程圖；[第3B圖]係顯示第一實施例的聲音區間檢測裝置的動作流程圖；[第4圖]係顯示根據第一實施例的聲音區間檢測裝置的始終端補正部產生的搜尋區間圖； [第5圖]係顯示第二實施例的聲音區間檢測裝置的構成方塊圖；[第6圖]係顯示根據第二實施例的聲音區間檢測裝置的始終端補正部產生的搜尋區間、根據臨界值算出部產生的臨界值算出區間圖；[第7A圖]係顯示第二實施例的聲音區間檢測裝置的動作流程圖；[第7B圖]係顯示第二實施例的聲音區間檢測裝置的動作流程圖；[第8圖]係顯示第三實施例的聲音區間檢測裝置的構成方塊圖；[第9A圖]係顯示第三實施例的聲音區間檢測裝置的動作流程圖；以及[第9B圖]係顯示第三實施例的聲音區間檢測裝置的動作流程圖。 [Fig. 1] is a block diagram showing a configuration of a sound section detecting device of the first embodiment; [Fig. 2] is a hardware configuration diagram showing a sound section detecting device of the first embodiment; [Fig. 3A] is a display The operation flowchart of the sound section detecting device of the first embodiment; [Fig. 3B] is a flowchart showing the operation of the sound section detecting apparatus of the first embodiment; [Fig. 4] shows the sound section according to the first embodiment. a search interval map generated by the always-end correction portion of the detecting device; [Fig. 5] is a block diagram showing a configuration of a sound section detecting device of the second embodiment; [Fig. 6] showing a search section generated by the constant end correcting section of the sound section detecting apparatus according to the second embodiment, according to a critical The threshold value calculation interval map generated by the value calculation unit; [FIG. 7A] shows an operation flowchart of the sound interval detection device of the second embodiment; [FIG. 7B] shows the operation of the sound interval detection device of the second embodiment. [Fig. 8] is a block diagram showing the configuration of the sound section detecting device of the third embodiment; [Fig. 9A] is a flowchart showing the operation of the sound section detecting apparatus of the third embodiment; and [Fig. 9B] The flowchart showing the operation of the sound section detecting device of the third embodiment is shown.

以下，為了更詳細說明此發明，關於用以實施此發明的形態，根據附加的圖面說明。 Hereinafter, in order to explain the present invention in more detail, the embodiments for carrying out the invention will be described with reference to the accompanying drawings.

[First Embodiment]

第1圖，係顯示第一實施例的聲音區間檢測裝置10的構成方塊圖。 Fig. 1 is a block diagram showing the configuration of the sound section detecting device 10 of the first embodiment.

聲音區間檢測裝置10，係由第1特徵量算出部1、第2特徵量算出部2、圖案辨識模型積蓄部3、聲音區間檢測部4及始終端補正部5所構成。 The sound section detecting device 10 is composed of a first feature amount calculating unit 1, a second feature amount calculating unit 2, a pattern identifying model storing unit 3, a sound section detecting unit 4, and a permanent end correcting unit 5.

第1特徵量算出部1，進行外部輸入的輸入信號的音響分析，表現光譜特徵的特徵量(以下，稱作第1特徵量)的時序列。第1特徵量，例如MFCC(梅爾倒頻譜係數)的1~12次元的數據。又，以下為了說明的簡潔化，MFCC從1到12次元為止的數據只稱作MFCC。 The first feature amount calculation unit 1 performs acoustic analysis of an input signal that is externally input, and expresses a time series of a feature amount of the spectral feature (hereinafter referred to as a first feature amount). The first feature amount is, for example, data of 1 to 12 dimensions of MFCC (Meier Cepstral Coefficient). Further, for simplification of the description below, the data of the MFCC from 1 to 12 dimensions is simply referred to as MFCC.

第2特徵量算出部2，係算出與第1特徵量算出部1轉換的第1特徵量不同之特徵量，且適於檢出第1特徵量中辨別困難的聲音之特徵量(以下，稱作第2特徵量)的時序列。例如，算出適於檢出第1特微量中與噪音的辨別困難的聲音之無聲子音等的特徵量的時序列。在此，無聲子音係p、t、k、s、sh、h、f等。一般，無聲子音因為聲音集中在高頻，算出例如強調高頻的功率作為第2特徵量。 The second feature amount calculation unit 2 is configured to calculate the feature amount different from the first feature amount converted by the first feature amount calculation unit 1 and is suitable for detecting the feature amount of the sound that is difficult to distinguish among the first feature amount (hereinafter referred to as The time series of the second feature quantity). For example, a time series suitable for detecting a feature amount such as a silent consonant of a sound that is difficult to distinguish between noise in the first special trace is calculated. Here, the silent sub-tones p, t, k, s, sh, h, f, and the like. In general, the unvoiced consonant is calculated such that the high-frequency power is emphasized as the second feature amount because the sound is concentrated at a high frequency.

圖案辨識模型積蓄部3，積蓄用以辨別輸入信號中的聲音與噪音之圖案辨識模型。此第一實施例中，以使用GMM(高斯混合模型)的情況為例進行說明。具體而言，以模型化聲音的1個GMM(以下，稱作聲音GMM)、模型化噪音的1個GMM(以下，稱作噪音GMM)，構成圖案辨識模型。聲音GMM及噪音GMM的參數，例如以使用最似然推斷法等的學習先求出。使用多樣聲音的MFCC進行聲音GMM的參數學習，而使用多樣噪音的MFCC進行噪音GMM的參數的學習。 The pattern recognition model accumulation unit 3 accumulates a pattern recognition model for discriminating sound and noise in the input signal. In the first embodiment, a case of using a GMM (Gaussian Mixture Model) will be described as an example. Specifically, a GMM (hereinafter referred to as a sound GMM) that models sound, and one GMM (hereinafter referred to as a noise GMM) that models noise are configured to form a pattern recognition model. The parameters of the sound GMM and the noise GMM are first obtained by, for example, learning using the most likelihood estimation method. The parameter learning of the sound GMM is performed using the MFCC of various sounds, and the parameters of the noise GMM are learned using the MFCC of various noises.

聲音區間檢測部4，參照圖案辨識模型積蓄部3內蓄積的參數辨識模型，進行第1特徵量算出部1算出的第1特徵量的圖案匹配，檢出指示輸入信號中的聲音區間的開始點之暫定的始端時刻(以下，稱作暫時始端時刻)及指示暫定的結束點的終端時刻(以下，稱作暫時終端時刻)。始終端補正部5，根據第2特徵量補正聲音區間檢測部4檢出的暫時始端時刻及暫時終端時刻，確定始端時刻及終端時刻。始終端補正部5，輸出得到的始端時刻及終端時刻，作為輸入信號中的聲音區間的時間資訊。 The voice section detecting unit 4 refers to the parameter identification model accumulated in the pattern recognition model accumulation unit 3, performs pattern matching of the first feature amount calculated by the first feature quantity calculation unit 1, and detects the start point of the sound section in the instruction input signal. The tentative start time (hereinafter referred to as the temporary start time) and the tentative end The terminal time of the point (hereinafter referred to as the temporary terminal time). The always-end correction unit 5 corrects the temporary start time and the temporary terminal time detected by the sound section detecting unit 4 based on the second feature amount, and determines the start time and the terminal time. The always-end correction unit 5 outputs the obtained start time and terminal time as time information of the sound interval in the input signal.

第2圖係顯示第一實施例的聲音區間檢測裝置10的硬體構成圖。 Fig. 2 is a view showing a hardware configuration of the sound section detecting device 10 of the first embodiment.

藉由處理器20實行記憶體30內記憶的程式，實現聲音區間檢測裝置10的第1特徵量算出部1、第2特徵量算出部2、聲音區間檢測部4及始終端補正部5。圖案辨識模型積蓄部3，構成記憶體30。又，構成為複數的處理器20及複數的記憶體30聯合實行上述機能也可以。 The processor 20 executes the program stored in the memory 30 to realize the first feature amount calculation unit 1, the second feature quantity calculation unit 2, the sound section detection unit 4, and the permanent correction unit 5 of the voice section detecting device 10. The pattern recognition model accumulation unit 3 constitutes the memory 30. Further, the processor 20 and the plurality of memories 30 configured in plural may perform the above functions in combination.

其次，說明關於聲音區間檢測裝置10的動作。 Next, the operation of the sound section detecting device 10 will be described.

第3A及3B圖係顯示第一實施例的聲音區間檢測裝置10的動作流程圖。 3A and 3B are flowcharts showing the operation of the sound section detecting device 10 of the first embodiment.

輸入信號時(步驟ST1)，第1特徵量算出部1分割為設定輸入信號的時間區間(以下，稱作框架)，每一分割的框架進行輸入信號的轉換，算出第1特徵量(步驟ST2)。又，框架的分割中，鄰接的框架之間時間區間重複也可以。例如，框架的時間區間長為30毫秒，框架每10毫秒錯開的同時，轉換輸入信號算出第1特徵量。第1特徵量如上述為MFCC。即，步驟ST2的處理中，第1特徵量算出部1以10毫秒的間隔，算出MFCC的時序列，並輸出。 When the signal is input (step ST1), the first feature amount calculation unit 1 divides the time interval (hereinafter referred to as a frame) for setting the input signal, and converts the input signal for each divided frame to calculate the first feature amount (step ST2). ). Further, in the division of the frame, the time interval between adjacent frames may be repeated. For example, the time interval of the frame is 30 milliseconds long, and the frame is shifted every 10 milliseconds, and the input signal is converted to calculate the first feature amount. The first feature amount is MFCC as described above. In other words, in the process of step ST2, the first feature amount calculation unit 1 calculates the time sequence of the MFCC at intervals of 10 msec and outputs it.

第2特徵量算出部2，以與第1特徵量算出部1相同的框架間隔，分割輸入信號，每一分割的框架轉換輸入信號，算出第2特徵量(步驟ST3)。又，步驟ST3中，算出強調高頻的功率作為第2特徵量，進行以下說明。第2特徵量算出部2，將輸入信號的最初的K框架(例如，K=10)看做聲音不存在的噪音區間，算出上述K框架的區間中聲音的功率平均，作為噪音級(步驟ST4)。又，第2特徵量算出部2，從各框架以步驟ST3算出的強調高頻之功率減去步驟ST4算出的噪音級，算出高頻強調差異功率(步驟ST5)。步驟ST5的處理中，第2特徵量算出部2，以10毫秒的間隔，算出高頻強調差異功率的時序列，並輸出。 The second feature quantity calculation unit 2 is in phase with the first feature quantity calculation unit 1 At the same frame interval, the input signal is divided, and the input signal is converted for each divided frame, and the second feature amount is calculated (step ST3). In addition, in step ST3, the power of the high-frequency emphasis is calculated as the second feature quantity, and the following description will be made. The second feature amount calculation unit 2 regards the first K-frame (for example, K=10) of the input signal as a noise section in which the sound does not exist, and calculates the power average of the sound in the section of the K-frame as the noise level (step ST4). ). In addition, the second feature amount calculation unit 2 calculates the high frequency emphasis difference power by subtracting the noise level calculated in step ST4 from the power of the emphasized high frequency calculated in step ST3 in each frame (step ST5). In the process of step ST5, the second feature quantity calculation unit 2 calculates a time series of the high frequency emphasis difference power at intervals of 10 msec, and outputs the time series.

聲音區間檢測部4，以步驟ST2算出的第1特徵量，即MFCC的時序列為輸入，參照圖案辨識模型積蓄部3內積蓄的圖案辨識模型，算出各框架聲音GMM的似然Ls及噪音GMM的對數似然Ln(步驟ST6)。聲音區間檢測部4，使用步驟ST6算出的聲音GMM的似然Ls及噪音GMM的對數似然Ln，根據以下的式(1)算出對數似然差S(步驟ST7)。 The voice interval detecting unit 4 inputs the time-series sequence of the MFCC, which is the first feature amount calculated in step ST2, and refers to the pattern recognition model accumulated in the pattern recognition model accumulation unit 3, and calculates the likelihood Ls and the noise GMM of each frame sound GMM. Log likelihood Ln (step ST6). The sound interval detecting unit 4 calculates the log likelihood difference S based on the likelihood Ls of the sound GMM and the log likelihood Ln of the noise GMM calculated in step ST6 based on the following equation (1) (step ST7).

S=Ls-Ln (1) S=Ls-Ln (1)

聲音區間檢測部4，以時間軸的順方向搜尋步驟ST7算出的對數似然差S為設定的臨界值Th_S以上的框架是設定的框架數的臨界值Th_T1以上連續的區間(步驟ST8)。聲音區間檢測部4，關於步驟ST8搜尋的區間，取得對數似然差S在時間軸的順方向上最初成為臨界值Th_S的框架的時刻，作為聲音區間的暫時始端時刻Tb’(步驟ST9)。 The sound section detecting unit 4 is a section in which the log likelihood difference S calculated in the forward direction search step ST7 of the time axis is equal to or greater than the threshold value Th_T1 of the set number of frames (step ST8). In the section searched for step ST8, the sound section detecting unit 4 acquires the time frame in which the log likelihood difference S first becomes the threshold value Th_S in the forward direction of the time axis, and serves as the temporary start time Tb' of the voice section (step ST9).

其次，聲音區間檢測部4，以時間軸的順方向搜尋步驟ST7算出的對數似然差S未達設定的臨界值Th_S的框架是設定的框架數的臨界值Th_T2以上連續的區間(步驟ST10)。聲音區間檢測部4，關於步驟ST10搜尋的區間，取得對數似然差S在時間軸的順方向上最初未達臨界值Th_S的框架的時刻，作為聲音區間的暫時終端時刻Te’(步驟ST11)。又，上述的步驟ST8及步驟ST10的搜尋處理繼續，直到搜尋到作為目標的框架。 Next, the sound interval detecting unit 4 searches for the direction of the time axis The frame in which the log likelihood difference S calculated in step ST7 does not reach the set threshold value Th_S is a continuous section having a threshold value Th_T2 or more of the set number of frames (step ST10). In the section searched for step ST10, the sound section detecting unit 4 acquires a time frame in which the log likelihood difference S does not reach the threshold value Th_S in the forward direction of the time axis as the temporary terminal time Te' of the voice section (step ST11). . Further, the above-described search processing of step ST8 and step ST10 continues until the target frame is searched.

始終端補正部5，參照步驟ST5算出的高頻強調差異功率的時序列，從位於步驟ST9檢出的聲音區間的暫時始端時刻Tb’的時系列前方之框架b1的時刻Tb1，到位於聲音的暫時始端時刻Tb’的時系列後方之框架b2的時刻Tb2的區間，以時間軸的順方向搜尋高頻強調差異功率為臨界值Th_P1以上的框架是設定的框架數的臨界值Th_T1以上連續的區間(步驟ST12)。始終端補正部5，以步驟ST12的處理進行判定是否已搜尋區間(步驟ST13)。已搜尋區間時(步驟ST13；YES(是))，始終端補正部5，在已搜尋的區間中往時間軸的順方向取得高頻強調差異功率最初成為臨界值Th_P1以上的框架的時刻，作為始端時刻Tb(步驟ST14)。另一方面，未搜尋區間時(步驟ST13；NO(否))。始終端補正部5，以步驟ST9檢出的暫時始端時刻Tb’作為始端時刻Tb(步驟ST15)。 The time-series of the high-frequency emphasis difference power calculated in step ST5 with reference to step ST5, from the time Tb1 of the frame b1 in front of the time series of the temporal start time Tb' detected in step ST9, to the sound located in the sound In the section of the time Tb2 of the frame b2 after the time series of the temporary start time Tb', the frame in which the high-frequency emphasis difference power is the threshold value Th_P1 or more in the forward direction of the time axis is a continuous interval of the set number of frames Th_T1 or more. (Step ST12). The always-end correction unit 5 determines whether or not the section has been searched by the processing of step ST12 (step ST13). When the section has been searched (step ST13; YES), the constant-end correction unit 5 acquires the time at which the high-frequency emphasized difference power first becomes the threshold value Th_P1 or more in the forward direction of the time axis in the searched section. The start time Tb (step ST14). On the other hand, when the section is not searched (step ST13; NO). The end correction unit 5 uses the temporary start time Tb' detected in step ST9 as the start time Tb (step ST15).

其次，始終端補正部5，參照步驟ST5算出的高頻強調差異功率的時序列，從位於步驟ST11檢出的聲音區間的暫時終端時刻Te’的時系列後方之框架e2的時刻Te2，到位於聲音的暫時終端時刻Te’的時系列前方之框架e1的時刻Te1的區間，以時間軸的逆方向搜尋高頻強調差異功率為臨界值Th_P1以上的框架是設定的框架數的臨界值Th_T1以上連續的區間(步驟ST16)。始終端補正部5，以步驟ST16的處理進行判定是否已搜尋區間(步驟ST17)。已搜尋區間時(步驟ST17；YES(是))，始終端補正部5，在已搜尋的區間中往時間軸的逆方向取得高頻強調差異功率最初成為臨界值Th_P1以上的框架的時刻，作為終端時刻Te(步驟ST18)。另一方面，未搜尋區間時(步驟ST17；NO(否))。始終端補正部5，以步驟ST11檢出的暫時終端時刻Te’作為終端時刻Te(步驟ST19)。 Next, the time series of the high-frequency emphasis difference power calculated in step ST5 with reference to step ST5 is located from the time Te2 of the frame e2 behind the time series of the temporary terminal time Te' of the sound section detected in step ST11. The temporary terminal time of the sound, the time of the frame e1 in front of the time series Te1 In the section, the frame in which the high-frequency-emphasized difference power is the threshold value Th_P1 or more in the reverse direction of the time axis is a continuous section of the set threshold number Th_T1 or more (step ST16). The always-end correction unit 5 determines whether or not the section has been searched by the processing of step ST16 (step ST17). When the section has been searched (step ST17; YES), the constant-end correction unit 5 acquires the time when the high-frequency-emphasized difference power first becomes the threshold value Th_P1 or more in the reverse direction of the time axis in the searched section. The terminal time Te (step ST18). On the other hand, when the section is not searched (step ST17; NO). The terminal correction unit 5 uses the temporary terminal time Te' detected in step ST11 as the terminal time Te (step ST19).

始終端補正部5，輸出步驟ST14或步驟ST15取得的始端時刻Tb以及步驟ST18或步驟ST19取得的終端時刻Te，作為聲音區間的時間資訊(步驟ST20)，結束處理。 The always-end correction unit 5 outputs the start time Tb obtained in step ST14 or step ST15 and the terminal time Te acquired in step ST18 or step ST19 as time information of the sound section (step ST20), and ends the process.

又，上述臨界值Th_S、臨界值Th_P1、臨界值Th_T1及臨界值Th_T2係預先設定的0以上的常數。 Further, the threshold value Th_S, the threshold value Th_P1, the threshold value Th_T1, and the threshold value Th_T2 are constants of 0 or more set in advance.

第4圖係顯示根據第一實施例的聲音區間檢測裝置10的始終端補正部5產生的搜尋區間圖。 Fig. 4 is a view showing a search interval generated by the always-end correction unit 5 of the sound section detecting device 10 according to the first embodiment.

第4圖中，橫軸指示時間，縱軸指示聲音GMM與噪音GMM的對數似然差S的強度。第4圖中，時刻Tb’係步驟ST9算出的暫時始端時刻Tb’。時刻Te’係步驟ST11算出的暫時終端時刻Te’。區間A，顯示從位於暫時始端時刻Tb’的時序列前方之框架b1的時刻Tb1到位於後方之框架b2的時刻Tb2之區間，指示始終端補正部5進行用於始端時刻補正的搜尋之搜尋區間。箭頭B，指示始終端補正部5搜尋區間A之際的搜尋方向，顯示往時間軸的順方向搜尋。 In Fig. 4, the horizontal axis indicates time, and the vertical axis indicates the intensity of the log likelihood difference S between the sound GMM and the noise GMM. In Fig. 4, the time Tb' is the temporary start time Tb' calculated in step ST9. The time Te' is the temporary terminal time Te' calculated in step ST11. The section A displays a section from the time Tb1 of the frame b1 in front of the time series of the temporary start time Tb' to the time Tb2 of the frame b2 located at the rear, and instructs the always-end correction unit 5 to perform a search interval for the search for the start time correction. . The arrow B indicates the search direction when the always-end correction unit 5 searches for the section A, and displays the forward direction search to the time axis.

又，區間C顯示位於暫時終端時刻Te’的時序列後方之框架e2的時刻Te2到位於前方之框架e1的時刻Te1之區間，指示始終端補正部5進行用於終端時刻補正的搜尋之搜尋區間。箭頭D，指示始終端補正部5搜尋區間C之際的搜尋方向，顯示往時間軸的逆方向搜尋。 Further, the section C displays the section from the time Te2 of the frame e2 located behind the time series of the temporary terminal time Te' to the time Te1 of the frame e1 located ahead, and instructs the always-end correction section 5 to perform the search section for the terminal time correction search. . The arrow D indicates the search direction when the always-end correction unit 5 searches for the section C, and displays the reverse direction search for the time axis.

顯示具體例的話，例如，離暫時始端時刻Tb’25框架前方設定時刻Tb1，離暫時始端時刻Tb’10框架後方設定時刻Tb2，離暫時終端時刻Te’10框架前方設定時刻Te1，離暫時始端時刻Te’30框架後方設定時刻Te2。又，也可能構成為離暫時始端時刻Tb’0框架設定時刻Tb2，還有離暫時終端時刻Te’0框架設定時刻Te1，以免往以第1特徵量檢出的聲音區間的前方進行補正。 When the specific example is displayed, for example, from the temporary start time Tb'25 to the frame front setting time Tb1, from the temporary start time Tb'10 to the frame rear setting time Tb2, and from the temporary terminal time Te'10 to the frame front setting time Te1, from the temporary starting time Set Te2 at the rear of the Te'30 frame. Further, the frame setting time Tb2 may be set from the temporary start time Tb'0, and the frame setting time Te1 may be set from the temporary terminal time Te'0 to avoid correction in front of the sound section detected by the first feature amount.

如上述，根據此第一實施例，因為構成包括第1特徵量算出部1，算出輸入信號的第1特徵量；第2特徵量算出部2，根據輸入信號算出適於檢出以第1特徵量難以與噪音辨別的聲音之第2特徵量；聲音區間檢測部4，關於第1特徵量使用圖案辨識手法，判別聲音與噪音，算出臨時始端時刻及臨時終端時刻；以及始終端補正部5，利用第2特徵量，補正臨時始端時刻及臨時終端時刻，取得聲音區間的時間資訊；根據聲音區間檢測部4的處理，抑制檢測光譜特徵量不同的非穩定噪音為聲音區間，再根據始終端補正部5的處理，可以抑制遺漏檢出以光譜特徵量難以與噪音辨別的聲音，使聲音區間的檢測精確度提高。 As described above, according to the first embodiment, the first feature amount calculation unit 1 is configured to calculate the first feature amount of the input signal, and the second feature amount calculation unit 2 calculates the first feature suitable for detection based on the input signal. The second feature amount of the sound that is difficult to distinguish from the noise; the sound segment detecting unit 4 uses the pattern recognition method for the first feature amount, determines the sound and the noise, calculates the temporary start time and the temporary terminal time, and the permanent end correction unit 5, By using the second feature amount, the temporary start time and the temporary terminal time are corrected, and the time information of the sound section is obtained. According to the processing of the sound section detecting unit 4, the unsteady noise having the difference in the detected spectral feature amount is suppressed as the sound section, and the correction is performed based on the end point. The processing of the section 5 can suppress the detection of a sound that is difficult to distinguish with the noise by the spectral feature amount, and improve the detection accuracy of the sound section.

又，根據第一實施例，因為構成為第2特徵量算出部2算出適於檢出根據光譜特徵量難以辨別噪音的無聲子音之高頻強調差異功率作為第2特徵量，始終端補正部5使用上述高頻強調差異功率的時序列補正臨時始端時刻及臨時終端時刻，取得聲音區間的時間資訊，可以抑制遺漏檢出無聲子音，使聲音區間的檢測精確度提高。 Further, according to the first embodiment, since the second feature amount is calculated The output unit 2 calculates a high-frequency emphasis difference power suitable for detecting a silent sub-tone that is difficult to discriminate based on the spectral feature amount as the second feature amount, and the constant-side correction unit 5 corrects the temporary start time and the time-series using the high-frequency emphasized difference power. At the temporary terminal time, the time information of the sound section is obtained, and it is possible to suppress the omission of the unvoiced consonant and improve the detection accuracy of the sound section.

又，上述第一實施例中，構成圖案辨識模型積蓄部3內積蓄的圖案辨識模型之聲音GMM及噪音GMM的參數學習中，例示使用最似然推斷法的情況，但也可以應用積極辨別聲音與噪音的參數學習，例如互相資訊最大化推斷法。 Further, in the first embodiment, the parameter learning of the sound GMM and the noise GMM of the pattern recognition model accumulated in the pattern recognition model accumulation unit 3 is exemplified as the case of using the most likelihood estimation method, but the positive discrimination sound may be applied. Learning with parameters of noise, such as mutual information maximization.

又，上述第一實施例中，顯示使用分別一次一個聲音GMM及噪音GMM作為構成圖案辨識模型積蓄部3內積蓄的圖案辨識模型的GMM之構成，也可以使用分別複數個GMM。在此情況下，聲音GMM的對數似然，只要是複數個聲音GMM的對數似然的最大值或加權平均值即可。同樣地，噪音GMM的對數似然，假設為複數個噪音GMM的對數似然的最大值或加權平均值。 Further, in the first embodiment described above, a configuration is adopted in which one sound GMM and noise GMM are used as the GMM constituting the pattern recognition model accumulated in the pattern recognition model accumulation unit 3, and a plurality of GMMs may be used. In this case, the log likelihood of the sound GMM may be a maximum value or a weighted average value of the log likelihood of the plurality of sound GMMs. Similarly, the log likelihood of the noise GMM is assumed to be the maximum or weighted average of the log likelihood of a plurality of noise GMMs.

又，上述第一實施中，顯示使用GMM作為圖案辨識模型積蓄部3積蓄的圖案辨識模型的情況，但使用HMM也可以。又，使用邏輯回歸模型、支持向量機及神經式網路等的圖案辨識手法也可以。 Further, in the first embodiment described above, the GMM is used as the pattern recognition model accumulated by the pattern recognition model accumulation unit 3. However, the HMM may be used. Further, pattern recognition methods such as a logistic regression model, a support vector machine, and a neural network may be used.

又，上述第一實施例中，顯示第2特徵量算出部2算出高頻強調差異功率作為適於檢出無聲子音的特徵量之構成，但適於檢出無聲子音的特徵量，也就是無聲子音中有特徵的特徵量的話，可以應用任意的特徵量。例如，算出每頻帶輸入信號的功率，頻帶未達2KHz(千赫茲)的功率與2KHz以上的功率，可以應用兩功率的比作為特徵量。 Further, in the first embodiment, the second feature amount calculation unit 2 calculates the high-frequency emphasis difference power as a feature amount suitable for detecting the unvoiced consonant, but is suitable for detecting the feature amount of the unvoiced consonant, that is, the silent state. If there is a characteristic feature quantity in the consonant, an arbitrary feature quantity can be applied. For example, calculate the output per band The power of the incoming signal, the power of the frequency band less than 2 KHz (kilohertz) and the power of 2 KHz or more, the ratio of the two powers can be applied as the feature quantity.

[Second embodiment]

上述第一實施例中，始終端補正部5比較高頻強調差異功率與臨界值之際，顯示使用預先設定的臨界值Th_P1的構成，但第二實施例中，顯示使用高頻強調差異功率的標準偏差算出成為高頻強調差異功率的比較對象之臨界值之構成。 In the first embodiment described above, the constant-end correction unit 5 displays a configuration using the preset threshold value Th_P1 when the high-frequency emphasis difference power and the threshold value are compared. However, in the second embodiment, the display uses the high-frequency emphasis differential power. The standard deviation is calculated as a threshold value of the comparison target of the high frequency emphasis difference power.

第5圖係顯示第二實施例的聲音區間檢測裝置10a的構成方塊圖。 Fig. 5 is a block diagram showing the configuration of the sound section detecting device 10a of the second embodiment.

第二實施例的聲音區間檢測裝置10a，追加設置臨界值算出部6至第一實施例所示的聲音區間檢測裝置10。 The sound section detecting device 10a of the second embodiment additionally adds the threshold value calculating unit 6 to the sound section detecting device 10 shown in the first embodiment.

第6圖係顯示根據第二實施例的聲音區間檢測裝置10a的始終端補正部5a產生的搜尋區間以及根據臨界值算出部6產生的臨界值算出區間圖。 Fig. 6 is a view showing a search section generated by the constant end correction unit 5a of the sound section detecting device 10a according to the second embodiment and a threshold value calculation section generated based on the threshold value calculation unit 6.

又，以下，與第一實施例的聲音區間檢測裝置10的構成要素相同或相當的部分，附上與第一實施例使用的符號相同的符號，省略或簡化說明。 In the following, the same or equivalent components as those of the voice section detecting device 10 of the first embodiment are denoted by the same reference numerals as those used in the first embodiment, and the description thereof will be omitted or simplified.

臨界值算出部6，根據第2特徵量算出部2算出的第2特徵量，即高頻強調差異功率的時序列與聲音區間檢測部4檢出的暫時始端時刻Tb’，算出始終端補正部5a參照的臨界值。參照第6圖說明時，臨界值算出部6以位於暫時始端時刻Tb’的時序列前方之框架b1的時刻Tb1開始再回溯框架數Tv的時刻作為時刻Tb0，上述時刻Tb0到時刻Tb1的區間E中，根據以下的式(2)算出高頻強調差異功率的標準偏差sd。 The threshold value calculation unit 6 calculates the second-order feature quantity calculated by the second feature quantity calculation unit 2, that is, the time-series of the high-frequency-emphasized difference power and the temporary start time Tb' detected by the sound section detection unit 4, and calculates the constant-end correction unit. The critical value of 5a reference. When the description is made with reference to Fig. 6, the threshold value calculation unit 6 starts the time Tb0 at the time Tb1 of the frame b1 in front of the time series in the time series of the temporary start time Tb', and the interval E from the time Tb0 to the time Tb1. In the middle, the standard deviation sd of the high frequency emphasis difference power is calculated according to the following formula (2).

式(2)中，mp係時刻Tb0到時刻Tb1的區間E的高頻差異功率的平均值，p_i係時刻i中的高頻差異功率、sqrt()係顯示取平方根的函數。又，框架數Tv係預先設定的常數，例如為50框架。 In the equation (2), the average value of the high-frequency difference power of the section E of the mp-based time Tb0 to the time Tb1, the high-frequency difference power in the time i of the p _i system, and the sqrt() system show a function of taking the square root. Further, the frame number Tv is a predetermined constant, and is, for example, 50 frames.

臨界值算出部6，使用根據式(2)算出的高頻強調差異功率的標準偏差sd，根據以下的式(3)算出始終端補正用臨界值Th_P2。 The threshold value calculation unit 6 calculates the constant value correction threshold value Th_P2 based on the following equation (3), using the standard deviation sd of the high frequency emphasis difference power calculated by the equation (2).

Th_P2=α＊sd+β (3) Th_P2=α*sd+β (3)

式(3)中，α與β係預先決定的0以上的常數。臨界值算出部6算出的始終端補正用臨界值Th_P2，輸出至始終端補正部5a。 In the formula (3), α and β are predetermined constants of 0 or more. The constant end correction threshold value Th_P2 calculated by the threshold value calculation unit 6 is output to the constant end correction unit 5a.

其次，說明關於聲音區間檢測裝置10a的動作。 Next, the operation of the sound section detecting device 10a will be described.

第7A及7B圖，係顯示第二實施例的聲音區間檢測裝置10a的動作流程圖。 7A and 7B are flowcharts showing the operation of the sound section detecting apparatus 10a of the second embodiment.

又，以下，與第一實施例的聲音區間檢測裝置10相同或相當的步驟，附上與第3A及3B圖使用的符號相同的符號，省略或簡化說明。 In the following, the same or corresponding steps as those of the voice section detecting device 10 of the first embodiment are denoted by the same reference numerals as those used in the third and third embodiments, and the description thereof will be omitted or simplified.

步驟ST11中，聲音區間檢測部4檢出聲音的暫時終端時刻Te’時，臨界值算出部6，算出從位於步驟ST9檢出的聲音的暫時始端時刻Tb’的時序列前方之框架b1的時刻Tb1開始再回溯框架數Tv的時刻作為時刻Tb0(步驟ST31)。臨界值算出部6，關於步驟ST31算出的時刻Tb0開始到Tb1的區間，根據上述式(2)算出高頻強調差異功率的標準偏差sd(步驟ST32)。又，臨界值算出部6，使用步驟ST32算出的高頻強調差異功率的標準偏差sd，根據以上的式(3)算出始終端補正用臨界值Th_P2(步驟ST33)。 When the sound section detecting unit 4 detects the temporary terminal time Te' of the sound in the step ST11, the threshold value calculating unit 6 calculates the time of the frame b1 in front of the time series from the temporary start time Tb' of the sound detected in step ST9. The time at which the Tb1 starts to trace back the frame number Tv is the time Tb0 (step ST31). The threshold value calculation unit 6 starts the region to the time Tb1 at the time Tb0 calculated in step ST31. The standard deviation sd of the high frequency emphasis difference power is calculated based on the above equation (2) (step ST32). In addition, the threshold value calculation unit 6 calculates the constant value correction threshold value Th_P2 based on the above equation (3), using the standard deviation sd of the high frequency emphasis difference power calculated in step ST32 (step ST33).

始終端補正部5a，參照步驟ST5算出的高頻強調差異功率的時序列，從位於步驟ST9檢出的聲音的暫時始端時刻Tb’的時序列前方之框架b1的時刻Tb1開始到位於聲音的暫時始端時刻Tb’的時序列後方之框架b2的時刻Tb2為止的區間中，以時間軸的順方向搜尋高頻強調差異功率為步驟ST33算出的始終端補正用臨界值Th_P2以上之框架是設定的框架數的臨界值Th_T1以上連續的區間(步驟ST34)。 The all-time correction unit 5a refers to the time series of the high-frequency emphasis difference power calculated in step ST5, and starts from the time Tb1 of the frame b1 in front of the time series of the temporary start time Tb' of the sound detected in step ST9 to the temporary position of the sound. In the section from the time Tb2 of the frame b2 after the time series of the start time Tb', the frame in which the high-frequency emphasis difference power is searched for in the forward direction of the time axis is the frame of the set value of the constant end correction threshold Th_P2 calculated in step ST33. The critical value of the number Th_T1 is equal to or longer than the continuous interval (step ST34).

始終端補正部5a，在步驟ST34的處理中進行是否已搜尋的判定(步驟ST35)。已搜尋區間時(步驟ST35；YES)，始終端補正部5a在已搜尋的區間中取得高頻強調差異功率最初成為始終端補正用臨界值Th_P2以上之框架的時刻，作為始端時刻Tb(步驟ST36)。另一方面，未搜尋區間時(步驟ST35；NO)，始終端補正部5A以步驟ST39檢出的暫時始端時刻Tb’為始端時刻Tb(步驟ST15)。 The always-end correction unit 5a determines whether or not the search has been performed in the process of step ST34 (step ST35). When the section is searched for (step ST35; YES), the terminal correction unit 5a obtains the time when the high-frequency emphasis difference power first becomes the frame of the constant-end correction threshold value Th_P2 or more as the start time Tb in the searched section (step ST36). ). On the other hand, when the section is not searched (step ST35; NO), the temporary end correction unit 5A sets the temporary start time Tb' detected in step ST39 as the start time Tb (step ST15).

其次，始終端補正部5a，參照步驟ST5算出的高頻強調差異功率的時序列，在位於步驟ST11檢出的聲音的暫時終端時刻Te’的時序列後方之框架e2的時刻Te2開始，到位於聲音的暫時終端時刻Te’的時系列前方之框架e1的時刻Te1為止的範圍中，以時間軸的逆方向搜尋高頻強調差異功率為始終端補正用臨界值Th_P2以上的框架是設定的框架數的臨界值Th_T1以上連續的區間(步驟ST37)。始終端補正部5a，以步驟ST37的處理進行判定是否已檢出區間(步驟ST38)。已搜尋區間時(步驟ST38；YES(是))，始終端補正部5a，在已搜尋的區間中取得高頻強調差異功率最初成為始終端補正用臨界值Th_P2以上的框架的時刻，作為終端時刻Te(步驟ST39)。另一方面，未搜尋區間時(步驟ST38；NO(否))。始終端補正部5a，以步驟ST11檢出的終端時刻Te’作為終端時刻Te(步驟ST19)。 Next, the all-in-time correction unit 5a refers to the time series of the high-frequency emphasis difference power calculated in step ST5, and starts at the time Te2 of the frame e2 located behind the time series of the temporary terminal time Te' of the sound detected in step ST11. In the range from the time Te1 of the frame e1 in front of the time series of the temporary terminal time Te' of the sound, the frame in which the high-frequency emphasis difference power is the upper limit correction threshold Th_P2 or more is set in the frame in the reverse direction of the time axis. Criticality The value is a continuous interval of Th_T1 or more (step ST37). The always-end correction unit 5a determines whether or not the section has been detected by the processing of step ST37 (step ST38). When the section has been searched (step ST38; YES), the terminal correction unit 5a obtains the time when the high-frequency emphasized difference power first becomes the frame of the constant-end correction threshold Th_P2 or more as the terminal time. Te (step ST39). On the other hand, when the section is not searched (step ST38; NO). The terminal correction unit 5a uses the terminal time Te' detected in step ST11 as the terminal time Te (step ST19).

始終端補正部5，輸出步驟ST36或步驟ST15取得的始端時刻Tb以及步驟ST39或步驟ST19取得的終端時刻Te作為聲音區間的時間資訊(步驟ST20)，結束處理。 The always-end correction unit 5 outputs the start time Tb obtained in step ST36 or step ST15 and the terminal time Te acquired in step ST39 or step ST19 as time information of the sound section (step ST20), and ends the process.

如上述，根據此第二實施例，因為構成包括臨界值算出部6，以位於暫時始端時刻Tb’的時序列前方之框架b1的時刻Tb1開始再回溯框架數Tv的時刻作為時刻Tb0，使用在上述時刻Tb0到時刻Tb1的區間算出的高頻強調差異功率的標準偏差sd，算出始終端補正用臨界值Th_P2；以及始終端補正部5a，根據算出的始終端補正用臨界值Th_P2以及高頻強調差異功率的時序列，補正暫時始端時刻及暫時終端時刻，取得聲音區間的時間資訊；高頻強調差異功率的標準偏差值小，對於穩定的噪音環境，能夠設定低終端補正用臨界值，可以使微弱的無聲子音的檢出性能提高。另一方面，高頻強調差異功率的標準偏差值大，對於不穩定的噪音環境，能夠設定高始終端補正用臨界值，可以抑制誤檢出噪音為聲音。 As described above, according to the second embodiment, the configuration including the threshold value calculation unit 6 is used as the time Tb0 at the time Tb0 at the time Tb1 of the frame b1 located in front of the time series of the temporary start time Tb' as the time Tb0. The standard deviation sd of the high-frequency-emphasized difference power calculated in the interval from the time Tb0 to the time Tb1, the constant-end correction threshold value Th_P2 is calculated, and the constant-end correction unit 5a is based on the calculated constant-end correction threshold value Th_P2 and the high-frequency emphasis The time series of the difference power, the temporary start time and the temporary terminal time are corrected, and the time information of the sound interval is obtained; the standard deviation value of the high frequency emphasis difference power is small, and for a stable noise environment, the threshold value for the low terminal correction can be set, and the threshold value can be set. The detection performance of the weak silent sub sound is improved. On the other hand, the high-frequency emphasis standard power of the differential power is large, and in the unstable noise environment, the threshold value for the high-end correction can be set, and the erroneous detection of the noise can be suppressed.

[Third embodiment]

此第三實施例中，顯示的構成除了第2特徵量算出部2算出的高頻強調差異功率的時序列，也考慮聲音區間檢測部4檢出的對數似然度差S的時序列，補正始終端時刻。 In the third embodiment, the time series of the high-frequency emphasis difference power calculated by the second feature quantity calculation unit 2 is also considered, and the time series of the log likelihood difference S detected by the sound section detection unit 4 is also considered. Always at the moment.

第8圖係顯示第三實施例的聲音區間檢測裝置10b的構成方塊圖。 Fig. 8 is a block diagram showing the configuration of the sound section detecting device 10b of the third embodiment.

第三實施例的聲音區間檢測裝置10b，與第二實施例所示的聲音區間檢測裝置10a的構成相同。以下，與第二實施例的聲音區間檢測裝置10a的構成要素相同或相當的部分，附上與第二實施例使用的符號相同的符號，省略或簡化說明。 The sound section detecting device 10b of the third embodiment has the same configuration as that of the sound section detecting device 10a shown in the second embodiment. In the following, the same or corresponding components as those of the sound section detecting device 10a of the second embodiment are denoted by the same reference numerals as those used in the second embodiment, and the description is omitted or simplified.

聲音區間檢測部4，與第一實施例及第二實施例相同，輸出暫時始端時刻Tb’及暫時終端時刻Te’至始終端補正部5b。又，聲音區間檢測部4，輸出各框架根據上述式(1)算出的聲音GMM與噪音GMM的對數似然差S，即對數似然差S的時序列至始終端補正部5b。臨界值算出部6，與第二實施例相同，根據第2特徵量算出部2輸入的高頻強調差異功率的時序列與聲音區間檢測部4檢出的暫時始端時刻Tb’，算出始終端補正部5b參照的臨界值之始終端補正用臨界值Th_P2。 Similarly to the first embodiment and the second embodiment, the voice interval detecting unit 4 outputs the temporary start time Tb' and the temporary terminal time Te' to the constant end correcting unit 5b. Further, the sound section detecting unit 4 outputs the logarithmic likelihood S of the sound GMM and the noise GMM calculated from the above equation (1), that is, the time series of the log likelihood difference S to the constant end correcting unit 5b. In the same manner as in the second embodiment, the threshold value calculation unit 6 calculates the end-time correction based on the time series of the high-frequency emphasis difference power input by the second feature quantity calculation unit 2 and the temporary start time Tb′ detected by the sound section detection unit 4 . The threshold value for the threshold value referenced by the portion 5b is corrected by the threshold value Th_P2.

始終端補正部5b，根據第2特徵量算出部2輸入的高頻強調差異功率的時序列、聲音區間檢測部4輸入的對數似然差S的時序列以及臨界值算出部6輸入的始終端補正用臨界值Th_P2，補正聲音區間檢測部4檢出的暫時始端時刻Tb’及暫時終端時刻Te’，取得始端時刻Tb及終端時刻Te。 The time-series sequence of the high-frequency emphasis difference power input by the second feature quantity calculation unit 2, the time series of the log likelihood difference S input by the voice section detection unit 4, and the terminal end of the threshold value calculation unit 6 are input from the terminal correction unit 5b. The correction threshold value Th_P2 is used to correct the temporary start time Tb' and the temporary terminal time Te' detected by the sound section detecting unit 4, and acquire the start time Tb and the terminal time Te.

其次，說明關於聲音區間檢測裝置10b的動作。 Next, the operation of the sound section detecting device 10b will be described.

第9A及9B圖係顯示第三實施例的聲音區間檢測裝置10b 的動作流程圖。 9A and 9B are diagrams showing the sound section detecting device 10b of the third embodiment. Action flow chart.

又，以下，與第二實施例的聲音區間檢測裝置10a相同的步驟，附上與第7A及7B圖使用的符號相同的符號，省略或簡化說明。 In the following, the same steps as those of the sound interval detecting device 10a of the second embodiment are denoted by the same reference numerals as those used in the seventh and seventh embodiments, and the description thereof will be omitted or simplified.

步驟ST33中，臨界值算出部6算出始終端補正用臨界值Th_P2時，始終端補正部5b，參照步驟ST5算出的高頻強調差異功率的時序列以及步驟ST7算出的對數似然差S的時序列，從位於步驟ST9檢出的聲音區間的暫時始端時刻Tb’的時系列前方之框架b1的時刻Tb1，到位於聲音的暫時始端時刻Tb’的時系列後方之框架b2的時刻Tb2的區間，以時間軸的順方向搜尋高頻強調差異功率為步驟33算出的始終端補正用臨界值Th_P2以上，且對數似然差S為臨界值Th_S2以上的框架是設定的框架數的臨界值Th_T1以上連續的區間(步驟ST41)。 In the step ST33, when the threshold value calculation unit 6 calculates the constant value correction threshold value Th_P2, the terminal correction unit 5b refers to the time series of the high frequency emphasis difference power calculated in step ST5 and the log likelihood difference S calculated in step ST7. The sequence is from the time Tb1 of the frame b1 in front of the time series of the temporary start time Tb' of the sound section detected in step ST9 to the time Tb2 of the frame b2 located behind the time series of the temporary start time Tb' of the sound, The high-frequency emphasis difference power in the forward direction of the time axis is equal to or greater than the constant-end correction threshold value Th_P2 calculated in step 33, and the log likelihood difference S is equal to or greater than the critical value Th_T1 of the set number of frames. The interval (step ST41).

在此，上述臨界值Th_S2是係預先設定的0以上的常數，且比臨界值Th_S小的值。 Here, the threshold value Th_S2 is a constant of 0 or more which is set in advance, and is smaller than the threshold value Th_S.

始終端補正部5b，以步驟ST41的處理進行判定是否已搜尋區間(步驟ST42)。已搜尋區間時(步驟ST42；YES(是))，始終端補正部5b，在已搜尋的區間中取得高頻強調差異功率最初成為始終端補正用臨界值Th_P2以上，且對數似然差S為臨界值Th_S2以上的框架的時刻，作為始端時刻Tb(步驟ST43)。另一方面，未搜尋區間時(步驟ST42；NO(否))。始終端補正部5b，以步驟ST9檢出的暫時始端時刻Tb’作為始端時刻Tb(步驟ST15)。 The always-end correction unit 5b determines whether or not the section has been searched by the processing of step ST41 (step ST42). When the section has been searched (step ST42; YES), the constant-end correction unit 5b obtains the high-frequency emphasis difference power in the searched section, and initially becomes the constant-end correction threshold value Th_P2 or more, and the log-likelihood difference S is The time of the frame of the threshold value Th_S2 or more is referred to as the start time Tb (step ST43). On the other hand, when the section is not searched (step ST42; NO). The all-end correction unit 5b uses the temporary start time Tb' detected in step ST9 as the start time Tb (step ST15).

其次，始終端補正部5b，參照步驟ST5算出的高頻強調差異功率的時序列以及步驟7算出的對數似然差S的時序列，從位於步驟ST11檢出的聲音的暫時終端時刻Te’的時系列後方之框架e2的時刻Te2，到位於聲音的暫時終端時刻Te’的時系列前方之框架e1的時刻Te1的區間，以時間軸的逆方向搜尋高頻強調差異功率為始終端補正用臨界值Th_P2以上，且對數似然差S為設定的臨界值Th_S2以上的框架是設定的框架數的臨界值Th_T1以上連續的區間(步驟ST44)。 Next, the all-end correction unit 5b refers to the time series of the high-frequency emphasis difference power calculated in step ST5 and the time series of the log likelihood difference S calculated in step 7, from the temporary terminal time Te' of the sound detected in step ST11. At the time Te2 of the frame e2 at the rear of the time series, to the time zone Te1 of the frame e1 in front of the time series of the temporary terminal time Te' of the sound, the high-frequency emphasis difference power is searched in the reverse direction of the time axis as the critical value for the terminal correction The frame having a value of Th_P2 or more and a logarithm likelihood difference S of the set threshold value Th_S2 or more is a continuous section of the set threshold number Th_T1 or more (step ST44).

始終端補正部5 b，進行判定步驟ST44的處理是否已檢出區間(步驟ST45)。已搜尋區間時(步驟ST45；YES(是))，始終端補正部5b，在已搜尋的區間中取得高頻強調差異功率最初成為始終端補正用臨界值Th_P2以上，且對數似然差S為臨界值Th_S2以上的框架的時刻，作為終端時刻Te(步驟ST46)。另一方面，未搜尋區間時(步驟ST45；NO(否))。始終端補正部5b，以步驟ST11檢出的暫時終端時刻Te’作為終端時刻Te(步驟ST19)。 The end correction unit 5b performs a determination as to whether or not the processing in the step ST44 has been detected (step ST45). When the section has been searched (step ST45; YES), the constant-end correction unit 5b obtains the high-frequency emphasis difference power in the searched section, and initially becomes the constant-end correction threshold value Th_P2 or more, and the log-likelihood difference S is The timing of the frame having the threshold value Th_S2 or more is referred to as the terminal time Te (step ST46). On the other hand, when the section is not searched (step ST45; NO). The terminal correction unit 5b uses the temporary terminal time Te' detected in step ST11 as the terminal time Te (step ST19).

始終端補正部5 b，輸出步驟ST43或步驟ST15取得的始端時刻Tb以及步驟ST46或步驟ST19取得的終端時刻Te作為聲音區間的時間資訊(步驟ST20)，結束處理。 The always-end correction unit 5b outputs the start time Tb obtained in step ST43 or step ST15 and the terminal time Te acquired in step ST46 or step ST19 as time information of the sound section (step ST20), and ends the process.

如上述，藉由設定臨界值Th_S2為比臨界值Th_S小的值，輕易檢出在暫時始端時刻Tb’及暫時終端時刻Te’的檢測時不能檢出的微弱的無聲子音等。又，不使用高頻強調差異功率的時序列，只使用對數似然差S的時序列，設定臨界值Th_S2為比臨界值Th_S小的值，進行搜尋處理時，檢出噪音的可能性變大，但只有使用高頻強調差異功率的時序列與對數似然差S的時序列，兩者的特徵量都在臨界值以上時，藉由補正暫時始端時刻Tb’及暫時終端時刻Te’，可以提高補正精確度。 As described above, by setting the threshold value Th_S2 to a value smaller than the threshold value Th_S, it is easy to detect a weak unvoiced consonant or the like which cannot be detected at the time of detection of the temporary start time Tb' and the temporary terminal time Te'. Further, the time series of the high-frequency-emphasized difference power is not used, and only the time series of the log likelihood difference S is used, and the threshold value Th_S2 is set to a value smaller than the threshold value Th_S, and the noise can be detected when performing the search processing. The sex becomes large, but only when the high frequency emphasizes the time series of the difference power and the time series of the log likelihood difference S, when the feature quantities of both are above the critical value, by correcting the temporary start time Tb' and the temporary terminal time Te ', can improve the correction accuracy.

始終端補正部5 b，根據高頻強調差異功率之外再加上對數似然差，進行始終端時刻的補正，藉此可以輕易檢出在暫時始端時刻檢測時不能檢出的微弱的無聲子音等。但是，只使用對數似然差設定低臨界值進行始終端時刻的補正時，誤檢出噪音為聲音的可能性變高。因此，同時使用對數似然差與其他特徵量，只有兩者的特徵量都在臨界值以上時，成為補正始終端時刻的構成，使補正精確度提高。 The always-end correction unit 5 b corrects the constant-end time in addition to the high-frequency emphasis difference power plus the log likelihood difference, thereby making it possible to easily detect a weak silent sub-tone that cannot be detected at the temporary start time detection. Wait. However, when the constant end time is corrected using only the log likelihood difference setting low threshold value, the possibility of erroneously detecting noise as a sound becomes high. Therefore, when the log likelihood difference and other feature quantities are used at the same time, only when the feature quantities of both are above the critical value, the configuration of the correction end time is made, and the correction accuracy is improved.

如上述，根據此第三實施，因為構成為包括始終端補正部5 b，根據第2特徵量算出部2算出的高頻強調差異功率的時序列、聲音區間檢測部4檢出的對數似然差的時序列以及從臨界值算出部6輸入的始終端補正用臨界值，補正聲音區間檢測部4檢出的暫時始端時刻及暫時終端時刻，可以抑制誤檢出噪音為聲音而補正，並提高聲音的開始點及聲音的結束點的補正精確度。 As described above, according to the third embodiment, the time-series of the high-frequency emphasis difference power calculated by the second feature amount calculation unit 2 and the log likelihood detected by the sound section detection unit 4 are included in the third-end feature calculation unit 2. The difference time sequence and the threshold value for the constant end correction input from the threshold value calculation unit 6 can correct the temporary start time and the temporary terminal time detected by the sound interval detecting unit 4, thereby suppressing the erroneous detection of noise and correcting the sound, and improving The correction accuracy of the start point of the sound and the end point of the sound.

又，根據此第三實施，因為構成為設定臨界值Th_S2為比臨界值Th_S小的值，可以輕易檢出在暫時始端時刻Tb’及暫時終端時刻Te’的檢測時不能檢出的微弱的無聲子音等。 Further, according to the third embodiment, since the set threshold value Th_S2 is set to a value smaller than the threshold value Th_S, it is possible to easily detect the weak silence that cannot be detected at the time of detecting the temporary start time Tb' and the temporary terminal time Te'. Consonant and so on.

又，上述第三實施例中，顯示應用始終端補正部5b於第二實施例所示的聲音區間檢測裝置10a的構成，但也可以構成為應用始終端補正部5b於第一實施例所示的聲音區間檢測裝置10。 Further, in the third embodiment, the configuration of the application end correction unit 5b in the sound section detecting device 10a shown in the second embodiment is displayed. However, the application of the constant end correction unit 5b may be employed as shown in the first embodiment. Sound interval Detection device 10.

上述第一到三實施例中，以第1特徵量檢出與噪音辨識困難的聲音，以無聲子音的檢出為例進行說明，但無聲子音之外，也可構成為進行無聲子化母音的檢出。又，濁音的子音部等的有聲子音的檢出或母音的檢出等，發聲不明確的情況下，以第1特徵量難以與噪音辨別時，可以構成為進行檢出預測的聲音 In the first to third embodiments described above, the sound having difficulty in identifying the noise is detected by the first feature amount, and the detection of the unvoiced consonant is taken as an example. However, the unvoiced consonant may be configured to perform the unvoiced vowel. Check out. In addition, when the vocal sound is detected or the vowel is detected, and the vocalization is unclear, when the first feature amount is difficult to distinguish from the noise, the sound can be detected and predicted.

上述之外，本發明在其發明範圍內，也可以是各實施例的自由組合或各實施例的任意構成要素的變形，或者各實施例中省略任意的構成要素。 Further, the present invention may be a free combination of the embodiments or a modification of any constituent elements of the respective embodiments within the scope of the invention, or any constituent elements may be omitted in the respective embodiments.

[Industry use possibility]

根據本發明的聲音區間檢測裝置，可應用於需要聲音區間檢出的裝置，例如聲音辨識裝置，防止誤檢出非穩定噪音為聲音，而且可以改善語頭、語尾的無聲子音的檢出精確度。 The sound section detecting apparatus according to the present invention can be applied to a device that requires sound interval detection, such as a voice recognition device, to prevent erroneous detection of unsteady noise as sound, and to improve the detection accuracy of silent heads at the head and the end.

4‧‧‧聲音區間檢測部 4‧‧‧Sound interval detection department

5‧‧‧始終端補正部 5‧‧‧ Always correcting department

10‧‧‧聲音區間檢測裝置 10‧‧‧Sound interval detection device

Claims

A sound segment detecting device includes: a first feature amount calculating unit that calculates a first feature amount for displaying a spectral feature based on an input signal; and a second feature amount calculating unit that calculates a sound different from the first feature amount based on the input signal The second feature amount of the feature amount; the voice segment detecting unit detects the input based on the first feature amount calculated by the first feature amount calculating unit, using an identification model for identifying the sound and the noise included in the input signal. The start time of the start point of the sound section included in the signal and the terminal time of the instruction end point; and the constant end correction unit corrects the sound section based on the comparison between the second feature quantity calculated by the second feature quantity calculation unit and the threshold value The start time and the terminal time detected by the detection unit.

The sound section detecting device according to the first aspect of the invention, comprising: a threshold value calculating unit that calculates a criterion of the second feature amount from a start time of the sound interval detecting unit that is traced back to a predetermined time period The deviation is calculated based on the standard deviation of the second feature amount described above.

The sound section detecting device according to the first aspect of the invention, wherein the sound section detecting unit calculates a likelihood between a sound model that models the sound and a noise model that models the noise, by referring to the identification model; And the above-described constant-end correction unit, in addition to the comparison of the second feature amount and the threshold value, corrects the detected by the sound section detecting unit based on the comparison of the likelihood difference calculated by the sound section detecting unit and the threshold value At the beginning Engraved and terminal moments.

The sound section detecting device according to the first aspect of the invention, wherein the second feature amount calculating unit calculates the second feature amount of a feature of displaying a silent sound among the sounds included in the input signal.

A sound section detecting method includes the following steps: a first feature amount calculating step that calculates a first feature amount for displaying a spectral feature based on an input signal, and a second feature amount calculating step that is based on a second feature amount calculating unit The input signal calculates a second feature amount for displaying a sound feature amount different from the first feature amount; and the detecting step, the sound segment detecting unit uses an identification model for recognizing sound and noise included in the input signal, according to the above a feature quantity, a detection start time indicating a start time of a start point of the sound section included in the input signal, and a terminal time of the instruction end point; and a correction step, the always-end correction unit correcting based on the comparison of the second feature quantity and the threshold value The start time and the terminal time.