JPS62211698A

JPS62211698A - Detection of voice section

Info

Publication number: JPS62211698A
Application number: JP61054500A
Authority: JP
Inventors: 広之野戸
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1986-03-12
Filing date: 1986-03-12
Publication date: 1987-09-17
Also published as: JPH0556512B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（産業上の利用分野）この発明は音声信号処理方法、特に音声認識装置におけ
る音声区間検出方法に関する。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a speech signal processing method, and particularly to a speech segment detection method in a speech recognition device.

（従来の技術）従来提案された音声区間検出法として、特開昭６０−１
１４９００号に開示されている方法がある。この方法は
雑音区間と音声区間とでのパワースペクトルの差を利用
し、入カバワースベクトルから予め登録された雑音パワ
ースペクトルを減じた残りのパワースペクトルの大きさ
に基づいて検出する方法であった。(Prior art) As a previously proposed speech interval detection method, Japanese Patent Application Laid-Open No. 1986-1
There is a method disclosed in No. 14900. This method makes use of the difference in power spectra between the noise section and the speech section, and detects the power spectrum based on the magnitude of the remaining power spectrum after subtracting the pre-registered noise power spectrum from the input Coverworth vector. .

以下、第２図を参照してこの従来の方法につき簡単に説
明する。This conventional method will be briefly explained below with reference to FIG.

第２図は従来の音声区間検出方法の説明に供するブロッ
ク図である。マイクロホン１０１で受音された入力音声
信号は低域、中域及び高域の各帯域フィルタ１０２，１
０３，１０４及びそれぞれの帯域用の整流平滑部１０５
によって各帯域の入力パワーに分解される。この３つつ
の入力パワーはマルチプレクサ１０Ｂを通り環境雑音学
習部１０７及び環境雑音除去部１０８に入る。環境雑音
学習部１０７には予め学習されて騒音パワーが格納され
ており、環境雑音除去部１０８ではこの騒音パワーを読
取ってきて入力パワーから騒音パワーを減算する。この
減算により残った差パワースペクトルはエネルギーによ
る判定部１０９に入り、エネルギー閾値メモリ１１０に
予め設定されている閾値とこの差パワースペクトルとが
比較され、第１回目の有音・無音判定が行われる。この
判定によって有音と無音との区別出来ない場合には、音
声に対する認識率の良い音声認識を行うため、次段の統
計的距離尺度による判定部１１１において、差パワース
ペクトルと標準パタンメモリ１１２に予め格納しである
有声子音φ無声子音との間での距離計算を行って第２回
目の有音・無音の判定を行っている。FIG. 2 is a block diagram for explaining a conventional voice section detection method. The input audio signal received by the microphone 101 is passed through each of the low, middle and high band filters 102 and 1.
03, 104 and rectifying and smoothing section 105 for each band
is decomposed into the input power of each band by These three input powers pass through the multiplexer 10B and enter the environmental noise learning section 107 and the environmental noise removing section 108. The environmental noise learning unit 107 stores the noise power learned in advance, and the environmental noise removing unit 108 reads this noise power and subtracts the noise power from the input power. The difference power spectrum remaining after this subtraction enters the energy determination section 109, where the difference power spectrum is compared with a threshold value preset in the energy threshold memory 110, and the first sound/non-sound determination is performed. . If it is not possible to distinguish between voice and silence by this determination, in order to perform voice recognition with a high recognition rate for voice, a determination unit 111 using a statistical distance measure in the next stage uses the difference power spectrum and standard pattern memory 112. The distance between the pre-stored voiced consonant φ and the unvoiced consonant is calculated to perform the second voiced/unvoiced determination.

（発明が解決しようとする問題点）しかしながら、この従来の音声区間検出方法によれば、
雑音或いは騒音の測定時と音声検出時とで雑音等の性質
例えば雑音パワーが変化した場合には、誤検出したり或
いは検出性能が著しく低下してしまうので、これを回避
するためには判定の基礎となる騒音パワー等を再度登録
する必要があり、さらに音声に対する識別率のよい音声
認識法が必要であるという問題点があった。(Problems to be Solved by the Invention) However, according to this conventional speech interval detection method,
If the nature of the noise, such as the noise power, changes between the time of noise measurement and the time of voice detection, false detections may occur or detection performance will significantly deteriorate. There are problems in that it is necessary to re-register the basic noise power, etc., and a speech recognition method with a high discrimination rate for speech is also required.

この発明は上述した雑音の性質の変化及びパワーの変動
に対する弱点や雑音登録の必要性、さらに音声に対する
識別法が必要となるという問題点を解決するために成さ
れたものである。The present invention has been made in order to solve the above-mentioned problems such as the weakness against changes in noise properties and power fluctuations, the need for noise registration, and the need for a voice identification method.

従って、この発明の目的は高い雑音の下でも音声区間が
正確に検出出来、しかも雑音の変化や登録を意識するこ
となく用いることが出来る音声区間検出方法を提供する
ことにある。SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a method for detecting a speech section that can accurately detect a speech section even under high noise and can be used without being aware of noise changes or registration.

（問題点を解決するための手段）この目的の達成を図るため、この発明の音声区間検出方
法においては次のような処理を含む。(Means for Solving the Problems) In order to achieve this objective, the voice section detection method of the present invention includes the following processing.

先ず、予め設定した一定フレーム長毎に入力特徴パタン
（入力信号の特徴を表わす特徴量の組のことをいう）を
用いて音声区間候補と判定された区間以外の時間区間を
雑音区間と判定する処理を有する。First, time intervals other than the intervals determined to be voice interval candidates are determined to be noise intervals using an input feature pattern (referring to a set of feature values representing the characteristics of the input signal) for each predetermined frame length. Has processing.

この雑音区間内の入力特徴パタンを雑音の特徴を代表す
る特性値の組を雑音標準パタンとして学習する処理を含
む、この場合例えばこの雑音区間における入力特徴パタ
ンを雑音の特徴量の組（以下、雑音特徴パタンと称する
）とし、この雑音特徴パタンの平均的な形と平均的な変
動幅の組を上述した雑音標準パタンとして学習するのが
好適である。また、この雑音標準パタンの学習は、好ま
しくは雑音の性質の変化例えばゆっくりとした変化に追
従させながら雑音区間内に限って学習する（この学習を
区間限定追従学習と称する）のが良い。In this case, for example, the input feature pattern in this noise interval includes a process of learning a set of characteristic values representing the characteristics of the noise as a noise standard pattern. It is preferable to learn a set of the average shape and the average fluctuation range of this noise feature pattern as the above-mentioned noise standard pattern. Further, this noise standard pattern learning is preferably performed only within a noise interval while following changes in the nature of the noise, for example, slow changes (this learning is referred to as interval limited tracking learning).

さらに、入力特徴パタン及び雑音標準パタン間のパタン
間距離と、閾値との大小関係に基づいて前述の音声区間
候補を音声区間として検出する処理を含む、この場合、
パタン間距離（以下、単に距離と称する場合がある）の
雑音期間内での平均値と平均的な変動幅の組を２雑音時
の距離を特徴を代表する特性値の組（雑音時距離標準パ
タンと称する）として区間限定追従学習するのが好適で
ある。また、この閾値を、好ましくは、得られた雑音時
距離標準パタンを基にして算出した値とするのが良い。Furthermore, in this case, it includes a process of detecting the above-mentioned speech section candidate as a speech section based on the magnitude relationship between the inter-pattern distance between the input feature pattern and the noise standard pattern and the threshold value.
2. A set of the average value and average fluctuation width of the inter-pattern distance (hereinafter sometimes simply referred to as distance) within the noise period. 2. Distance during noise. A set of characteristic values representative of the characteristics (distance during noise standard. It is preferable to perform section-limited tracking learning as a pattern (referred to as a pattern). Further, this threshold value is preferably a value calculated based on the obtained noise distance standard pattern.

さらに具体的に述べると、この閾値は、この閾値をＴと
し、雑音区間におけるパタン間距離の推定平均値をＤ及
び推定平均偏差をＥとし及び任意の定数をＣとした時、の式に従って算出するのが好適である。More specifically, this threshold value is calculated according to the following formula, where T is the threshold value, D is the estimated average value of the distance between patterns in the noise interval, E is the estimated average deviation, and C is an arbitrary constant. It is preferable to do so.

また、雑音標準パタンの学習法は、具体的には、離散的
時間をｌとし、入力特徴パタンに対する学習の対象とな
る特性値をｆｉ　　とし、雑音標準パタン内の特性値を
ＦＬ　　とし及び定数Ｋｔ−Ｋ＞　１とした時、雑音区
間内では ■ｉ−（（Ｋ　−１）　■ｉ−１／Ｋ）　＋　ｆＨ／Ｋ
及び音声区間内ではの式に従ってそれぞれ計算する学習法とするのが好適で
ある。In addition, in the learning method of the noise standard pattern, specifically, the discrete time is l, the characteristic value to be learned for the input feature pattern is fi, the characteristic value in the noise standard pattern is FL, and the constant Kt. -K> 1, within the noise section ■i-((K -1) ■i-1/K) + fH/K
It is preferable to use a learning method in which calculations are made according to the following formulas within the speech interval and the following formulas.

また、雑音時距離標準パタンの学習法は、好ましくは、
離散的な時間をｉとし、パタン間距離に対する学習の対
象となる特性値をｇ４　　とし、雑音時距ＩＩ標準パタ
ン内の特性値をＧｉ　　とし及び定数りをＬ＞１なる定
数とした時、雑音区間内では８４　　＝　　（（Ｌ−１
）ａ、−１／Ｌ）　＋　ｇ、／Ｌ及び音声区間内ではの式に従ってそれぞれ計算する学習法とするのが良い。In addition, the learning method for the distance standard pattern in noise is preferably as follows:
When the discrete time is i, the characteristic value to be learned for the inter-pattern distance is g4, the characteristic value in the noise time distance II standard pattern is Gi, and the constant is a constant with L>1, the noise 84 = ((L-1
) a, -1/L) + g, /L and within the speech interval, it is preferable to use a learning method in which calculations are made according to the following formulas.

上述した雑音標準パタン及び雑音時距離標準パタンの学
習によれば、新たに雑音を学習する必要が無く音声区間
を正確に検出出来ると共に、新たな雑音源が発生した場
合にも音声区間よりも長く継続している場合にはこれを
学習出来、再度音声の検出が可能な状態に戻り、正確に
音声区間を検出出来る。According to the above-described learning of the noise standard pattern and the noise distance standard pattern, it is not necessary to newly learn the noise, and it is possible to accurately detect the speech section, and even if a new noise source occurs, it is possible to detect the speech section longer than the speech section. If it continues, it can learn from this and return to a state where it can detect voices again, allowing it to accurately detect voice sections.

さらに、前述の判定を行うに当って、パタン間距離と閾
値との大小関係の比較に加えて、音声の発声長に関する
種々の特徴を満足する上述の判定により得られた音声区
間候補を本来の音声区間と判定して検出するのが好適で
ある。このようにすれば、パルス状の雑音に対しても音
声の発生長に関する特徴を用いたために誤動作せずに、
音声区間の検出が出来る。Furthermore, in performing the above-mentioned judgment, in addition to comparing the magnitude relationship between the distance between patterns and the threshold value, the speech interval candidates obtained by the above-mentioned judgment that satisfy various characteristics regarding the utterance length of the speech are compared with the original one. It is preferable to detect it by determining that it is a voice section. In this way, even in the case of pulse-like noise, there will be no malfunction due to the use of the characteristics related to the speech generation length.
Voice sections can be detected.

この場合、音声の発声長に関する特徴として例えば、（ａ）パタン間距離が閾値よりも大きくなった時間が約
４０ｍｓ以上であること。In this case, the characteristics regarding the utterance length of the voice include, for example: (a) The time during which the distance between patterns becomes larger than the threshold value is about 40 ms or more.

（ｂ）パタン間距離が閾値よりも小さくなった時間が約
４００　ｍ　ｓ以下であること及び（ｃ）約４秒以上連
続して発声される単語は無いこととすることが出来る。この場合、音声区間候補が上述の
（ａ）及び（ｂ）の特徴及び（ｃ）の特徴の少なくとも
一方を満足する場合に音声区間であると判定する。(b) It can be determined that the time during which the distance between patterns becomes smaller than the threshold value is approximately 400 ms or less, and (c) that no word is continuously uttered for approximately 4 seconds or more. In this case, if the voice section candidate satisfies at least one of the above characteristics (a) and (b) and the feature (c), it is determined that the voice section candidate is a voice section.

また、上述した閾値との大小関係に基づいて音声区間と
して検出されなかった音声区間候補を雑音区間と判定す
る処理を含むのが好適である。Further, it is preferable to include a process of determining a speech segment candidate that is not detected as a speech segment as a noise segment based on the magnitude relationship with the above-mentioned threshold value.

（作用）このように構成すれば、音声入力信号区間から音声区間
候補と雑音区間とを検出し、雑音区間内で学習を行い、
パタン間距離と閾値との大小関係で音声区間候補を音声
区間又は雑音区間として判定する方法であるので、特別
な雑音学習の時間を設ける必要が無く、また、従来のよ
うな特別の認識法は必要無く、しかも、音声区間を正確
に検出出来る。(Function) With this configuration, a speech section candidate and a noise section are detected from the speech input signal section, learning is performed within the noise section,
Since this method determines a speech segment candidate as a speech segment or a noise segment based on the magnitude relationship between the distance between patterns and the threshold value, there is no need to set aside time for special noise learning, and there is no need for special recognition methods like in the past. This is not necessary, and the voice section can be detected accurately.

（実施例）以下、図面を参照してこの発明の音声区間検出方法の実
施例につき説明する。(Embodiments) Hereinafter, embodiments of the voice section detection method of the present invention will be described with reference to the drawings.

第１図はこの発明の音声区間検出方法の実施例の説明に
供する。装置の構成を示すブロック図であり、先ず、こ
の装置の構成と基本的な処理につき簡単に説明する。FIG. 1 serves to explain an embodiment of the voice section detection method of the present invention. FIG. 1 is a block diagram showing the configuration of the device. First, the configuration and basic processing of this device will be briefly explained.

第１図において、２０１は入力端子で、この入力端子２
０１には音声入力信号Ｉが入力される。この音声区間候
補Ｉは例えば４ｋＨｚの低域通過フィルタ２０２を通過
した後１次段の回路２０３において例えば１０　ｋＨｚ
のサンプリング周波数でサンプルホールドし、Ａ／Ｄ変
換されてディジタル音声データＤ１に変換されて出力さ
れ、次の前処理用信号処理プロセッサ２０４に送られる
。In FIG. 1, 201 is an input terminal, and this input terminal 2
The audio input signal I is input to 01. This speech section candidate I passes through a low-pass filter 202 of, for example, 4 kHz, and then passes through a low-pass filter 202 of, for example, 10 kHz in the primary stage circuit 203.
The audio data is sampled and held at a sampling frequency of , and is A/D converted into digital audio data D1, which is output and sent to the next preprocessing signal processing processor 204.

この前処理用信号処理プロセッサ２０４によって音声デ
ータＤ１に対して中心周波数２５０）１ｚから４．０ｋ
Ｈｚまでを１１５オクタ一ブ間隔で２０チヤネルのＱ＝
６．ＩＩＲ型単共振ディジタルバンドパスフィルタ群の
計算を行い、その出力の絶対値を例えば１２８点ずつ加
算平均して得られた信号を入力特徴パタンＤ２として音
声区間検出音声認識用マイクロプロセッサ２０５に送ら
れる。This pre-processing signal processing processor 204 converts the audio data D1 from the center frequency 250)1z to 4.0k.
Q = 20 channels at 115 octave intervals up to Hz
6. The IIR type single resonant digital bandpass filter group is calculated, and the absolute value of the output is averaged by 128 points, for example, and the obtained signal is sent to the speech section detection speech recognition microprocessor 205 as the input feature pattern D2. .

この音声区間検出音声認識用マイクロプロセッサ２０５
においては後述するような種々の処理を行って音声区間
を検出しかつ音声認識処理を行ってその認識結果Ｒを出
力端子２０Ｂに出力させる。This voice section detection voice recognition microprocessor 205
, various processes as will be described later are performed to detect a voice section, a voice recognition process is performed, and the recognition result R is outputted to the output terminal 20B.

次に、第１図のブロック図及び第３図の処理の流れ図を
参照してこの発明の音声区間検出処理の実施例の概略を
説明する。尚、この流れ図において処理ステップをＳで
示す。Next, an outline of an embodiment of the voice section detection process of the present invention will be described with reference to the block diagram of FIG. 1 and the process flowchart of FIG. 3. In this flowchart, processing steps are indicated by S.

先ず、この装置を始動させて処理動作をスタートさせる
（Ｓ　１）　。First, this device is started to start processing operations (S1).

次に各種の変数の初期設定を行い（Ｓ２）、然る後、入
力端子２０１で音声入力信号Ｉを受信する（Ｓ３）。Next, various variables are initialized (S2), and then the audio input signal I is received at the input terminal 201 (S3).

この音声入力信号工は回路２０３において音声データＤ
１に変換されて出力され、信号処理プロセッサ２０４に
おいて入力信号の特徴抽出が行われる。This audio input signal equipment is connected to the audio data D in the circuit 203.
1 and output, and feature extraction of the input signal is performed in the signal processing processor 204.

ここで、入力特徴パタンは、例えば、約１３ｍｓを１フ
レームとした２０個（チャネル数）の数値として得られ
、ｉ番目のフレームのｊ番目のチャネルの特徴量の値を
ｘｊ、ｉとする。Here, the input feature pattern is obtained as, for example, 20 numerical values (the number of channels) in which one frame is about 13 ms, and the value of the feature amount of the j-th channel of the i-th frame is xj,i.

次に、この入力特徴パタンと、雑音標準パタンとの間の
パタン間圧ＭＤｉ　を計算する（３５）。Next, the inter-pattern pressure MDi between this input feature pattern and the noise standard pattern is calculated (35).

この計算に当り、ｉ番目のフレームの閾値Ｔｉは直前の
ｉ−１番目のフレームの音声区間検出処理で既に求めら
れているとする。この実施例においては、先ず、この入
力特徴パタンを用い音声区間候補を判定し、残りの区間
を雑音区間と判定する。この雑音区間の入力特徴パタン
である雑音標準パタンを各チャネルにおける特徴量の平
均値と平均偏差とで表わすとする。雑音標準パタンを雑
音の性質のゆっくりとした変化に追従させながら雑音区
間内に限って、ｉ−１番目のフレームまで、学習（区間
限定追従学習）させることによって得られたｊ番目のチ
ャネルの雑音標準パタンとしての平均値を特性値Ａｊ、
ｉとし、平均偏差を特性値Ｏｊ、ｉとする。この場合、
ｉ番目のフレームにおける雑音標準パタンと入力特徴パ
タンとの間のパタン間圧ｆｌｌＤｉ　　は次式で計算す
る。In this calculation, it is assumed that the threshold value Ti for the i-th frame has already been determined in the voice section detection process for the immediately preceding (i-1)-th frame. In this embodiment, first, voice section candidates are determined using this input feature pattern, and the remaining sections are determined to be noise sections. It is assumed that the noise standard pattern, which is the input feature pattern of this noise section, is expressed by the average value and average deviation of the feature amounts in each channel. The noise of the j-th channel obtained by learning the noise standard pattern up to the i-1th frame within the noise interval while following the slow changes in the nature of the noise (limited interval tracking learning) The average value as a standard pattern is the characteristic value Aj,
Let i be the characteristic value Oj, and the average deviation be the characteristic value Oj,i. in this case,
The inter-pattern pressure fllDi between the noise standard pattern and the input feature pattern in the i-th frame is calculated by the following equation.

但し、Ｌ　（ｘ）はランプ関数であり、次式で与えられ
る。However, L (x) is a ramp function and is given by the following equation.

次に、この距離Ｄｉ　　と、ｉ−１番目のフレームまで
に学習された雑音時距離標準パタンから計算されたｉ番
目のフレームの閾値Ｔｉ　　とを比較し、次のような音
声の発声長に関する種々の特徴を基準とした音声区間の
判定を行う（Ｓ６）。Next, this distance Di is compared with the threshold value Ti of the i-th frame calculated from the noisy distance standard pattern learned up to the i-1th frame, and various values regarding the utterance length of the voice are calculated as follows. The voice section is determined based on the characteristics of (S6).

■距離Ｄ・　が閾値Ｔｉ　　よりも大である区間が３フ
レーム（約４０ｍｓ）以上続いた場合には、その区間を
音声区間候補とする。(2) If a section in which the distance D is greater than the threshold value Ti continues for three frames (approximately 40 ms) or more, that section is selected as a speech section candidate.

■音声区間候補の終端から３０フレーム（約４００ｍｓ
）以上にわたって距離Ｄｉ　が閾値Ｔｉよりも小さけれ
ば、その音声区間候補を音声区間と判定する。■30 frames (approximately 400ms) from the end of the voice section candidate
), if the distance Di is smaller than the threshold Ti, the voice section candidate is determined to be a voice section.

■距１ｌｌｌｌ：Ｄ・　が閾値Ｔｉよりも大である区間
が工３００フレーム（約４ｓ）以上続いた場合或いは前述の
基準■及び■に該当しない場合は全て雑音区間と判定す
る。(2) If an interval in which the distance 1llll:D is greater than the threshold value Ti continues for more than 300 frames (approximately 4 seconds), or if it does not meet the criteria (2) and (2) described above, it is determined to be a noise interval.

このような基準■〜■に基づく判定の結果をもとくして
１次のｔ＋を番目のフレームの閾値ＴＬ、１を求めるた
めに雑音降圧ｔｉｌｌ標準パタンを区間限定追従学習す
る。この学習は雑音の性質の変化に追従する形の学習法
で行う、以下、この学習法につき説明する。Based on the results of the determination based on the criteria (1) to (2), a noise step-down standard pattern is learned in limited section tracking in order to obtain the first-order t+ as the threshold value TL of the th frame, which is 1. This learning is performed using a learning method that follows changes in the characteristics of the noise. This learning method will be explained below.

最初に、ｉ番目のフレームが雑音区間の一部分であると
判定された場合につき説明する。この場合には、この判
定データが最新の雑音データであるとして雑音標準パタ
ン及び雑音時距離標準パタンを更新して学習を行う、こ
の更新は次式に従って行う。First, a case will be described in which it is determined that the i-th frame is part of a noise interval. In this case, learning is performed by updating the noise standard pattern and the noise distance standard pattern assuming that this judgment data is the latest noise data.This updating is performed according to the following equation.

離散的時間をｉ　（フレーム番号と等しい）とし、入力
特徴パタンに対する学習の対象となる特性値をｆｉ　　
とし、パタン間距離に対する学習の対象となる特性値を
ｇｉとし、雑音標準パタン内の特性値を■ｉ　　とし、
雑音時距離標準パタン内の特性値をＧｉ　　とし、定数
Ｋ及びＬをＫ＞１及びＩ、＞１とそれぞれした時。Let the discrete time be i (equal to the frame number), and the characteristic value to be learned for the input feature pattern be fi
Let gi be the characteristic value to be learned for the inter-pattern distance, and let ■i be the characteristic value in the noise standard pattern,
When the characteristic value in the noise distance standard pattern is Gi, and the constants K and L are K>1 and I>1, respectively.

雑音区間内での特性値■ｉ　は ■ｉ　　＝　　（（Ｋ−１）　　Ｆト１／Ｋ）　　＋ｆ
４　　／に同区間内での特性値Ｇｉ　はＧ・　＝　　（（Ｌ−１）Ｇｉ１　／　Ｌ　）　　＋　
ｇ４　　／　Ｌである、従って、雑音標準パタンの具体
的な特性Ａｊ、４　　＝　　　（（Ｋｔ　　　−Ｌ）　
　Ａｊ、Ｈ−１／に１　　）　　＋　　ｘＪ、ｉ　／Ｋ
ｑ但しに１＞１０Ｊ、１　＝　（（Ｋ２　−１　）　ＯＪ、Ｊ−１／　
Ｋｌ　）”　’　ｘｊ−ｉ−λｊ、ｉ−１１　／　Ｋｚ
但しに２＞１となり、雑音時距離標準パタンの具体的な特性値Ｄｉ　
　（Ｄｉはｉ番目のフレームまでの雑音区間におけるパ
タン間距離の平均値）及びＥｉ　　（Ｅｉはｉ番目のフ
レームまでの雑音区間におけるパタン間距離の平均偏差
）で表わすと但しに３〉■ ＋　　ｌ　Ｄｉ　−Ｄ４−ＩＩ／に＋但し狗、〉ＩＫ１　〜に４　　は学習の追従性を定める任意の定数で
経験的に定められ、これが小さくなるほど雑音の速い変
動に対応出来るようになる。しかし、この定数が小さす
ぎる場合には、発声がゆるやかに始まる音声を雑音と判
定してしまう。The characteristic value ■i within the noise section is ■i = ((K-1) Ft1/K) +f
The characteristic value Gi within the same interval as 4/ is G・=((L-1)Gi1/L)+
g4/L, therefore, the specific characteristic of the noise standard pattern Aj, 4 = ((Kt −L)
Aj, H-1/to 1) + xJ, i/K
qHowever, 1>1 0J, 1 = ((K2 -1) OJ, J-1/
Kl)"' xj-i-λj, i-11/Kz
However, 2>1, and the specific characteristic value Di of the distance standard pattern during noise
(Di is the average value of the distance between patterns in the noise interval up to the i-th frame) and Ei (Ei is the average deviation of the distance between patterns in the noise interval up to the i-th frame), provided that 3〉■ + l Di −D4−II/+ However, >I K1 ~N4 is an arbitrary constant that determines the followability of learning and is empirically determined, and the smaller it is, the more quickly fluctuations in noise can be coped with. However, if this constant is too small, speech that begins slowly will be determined to be noise.

次に、１番目のフレームが音声区間候補であると判定さ
れた場合につき説明する。この場合には、雑音標準パタ
ン及び雑音時距離標準パタンの更新を行わない、従って
、雑音標準パタンの場合で表わすとであり、また、雑音時距離標準パタンの場合にすとである。Next, a case will be described in which it is determined that the first frame is a voice section candidate. In this case, the noise standard pattern and the noise standard distance pattern are not updated. Therefore, the following is the case for the noise standard pattern, and the same is for the noise standard distance pattern.

音声区間候補が音声区間と判定されず雑音区間と判定さ
れた場合には、それ以前のフレームにさかのぼって学習
を行う。If the speech section candidate is not determined to be a speech section but is determined to be a noise section, learning is performed retroactively to previous frames.

次に、上述した雑音時距離標準パタンを基いて次のｉ＋
１番目のフレームに対する閾値Ｔｉや１を定める（Ｓａ
）、この閾値Ｔｉや１は特性値Ｄｉ　　とＥｉとを用い
て次式で計算する。Next, the next i+
Define the threshold value Ti and 1 for the first frame (Sa
), this threshold value Ti and 1 are calculated using the following equation using the characteristic values Di and Ei.

Ｔ・　千〇・　＋ＣＥｉ工◆１１ここで、Ｃは安全率をみこした定数で、大きくなるほど
検出感度が低くなる代りに、雑音に対する誤検出の可能
性が少なくなる。T・100・＋CEi 工◆11 Here, C is a constant that takes into account the safety factor, and as it increases, the detection sensitivity decreases, but the possibility of false detection due to noise decreases.

このようにして、ｉ番目のフレームついて音声区間が検
出されたかを判断しくＳ９）、検出されていない場合に
は処理ステップＳ３に戻り同様な処理を繰り返し行う、
また、検出がされている場合には、この検出された音声
区間を用いて音声認識を行い（Ｓ　１０）　、次に、入
力した音声区間に対する音声認識が終了したかを判定し
く５１１）、終了している場合にはこの処理がエンド（
Ｓ　１２）となる、ここで、上述した処理ステップ８５
〜Ｓｌｌを音声区間検出、音声認識用マイクロプロセッ
サ２０５で行う。In this way, it is determined whether a voice section has been detected for the i-th frame (S9), and if it has not been detected, the process returns to step S3 and the same process is repeated.
Furthermore, if the detected voice section is detected, voice recognition is performed using this detected voice section (S10), and then it is determined whether the voice recognition for the input voice section has been completed (511), and the process is terminated. If so, this process ends (
S12), where the above-mentioned processing step 85
~Sll is performed by the voice section detection and voice recognition microprocessor 205.

第４図はこの発明の実施例において、雑音下で発声され
た音声「トッキョ」に対する各変数の推移を対数表示で
示す図である。この図での入力特徴パタンの総和は実験
で得られたソナーグラムの多チャネルにおけるスペクト
ル強度の和を取ったものである。同図において、横軸を
時間すなわちフレームを取り、縦軸は総体的な大きさを
対数で取って示しである８曲線Ｉは入力特徴パタンの距
離Ｄｉ　　を示し及び曲線■は閾値Ｔｉ　を示している
。この実験データから理解出来るように、閾値Ｔｉ　　
は雑音区間（１〜２０番目のフレーム）における距離Ｄ
ｉ　　の変化に応じて変化しており、音声区間内では一
定値となっている。また、距＃Ｄｉの変化は入力特徴パ
タンの総和よりも変化が大きく、区間検出に用いること
が有効であることがわかる。尚、検出された音声区間の
始端と終端とを長い目盛で示してあり、雑音下でも正確
に音声区間が検出されることがわかる。FIG. 4 is a logarithmic diagram showing the transition of each variable for the voice "Tokkyo" uttered under noise in the embodiment of the present invention. The sum of the input feature patterns in this figure is the sum of the spectral intensities in multiple channels of the sonargrams obtained in the experiment. In the figure, the horizontal axis represents time, that is, frames, and the vertical axis represents the overall size as a logarithm.8 Curve I represents the distance Di of the input feature pattern, and curve ■ represents the threshold Ti. There is. As can be understood from this experimental data, the threshold value Ti
is the distance D in the noise section (1st to 20th frames)
It changes in accordance with the change in i, and remains constant within the voice section. Further, it can be seen that the change in distance #Di is larger than the total sum of input feature patterns, and that it is effective to use it for section detection. It should be noted that the start and end of the detected voice section are indicated by long scales, and it can be seen that the voice section can be detected accurately even under noise.

（発明の効果）上述した説明から明らかなように、この発明によれば、
特別な雑音学習の時間を設ける必要なく、また、雑音の
性質がゆっくりと変化している場合にはこれに追従して
学習を行うために、新たに雑音を学習する必要なく、音
声区間を正確に検出出来る。(Effect of the invention) As is clear from the above explanation, according to the present invention,
There is no need to set aside special noise learning time, and if the characteristics of the noise are changing slowly, the speech interval can be accurately determined without the need to newly learn the noise. can be detected.

また、パルス状の雑音に対しても音声の発声長に関する
特徴を用いて検出を行うので、誤動作せずに音声区間の
みを検出出来る。In addition, since pulse-like noise is detected using characteristics related to the length of speech, only speech sections can be detected without malfunction.

さらに、新たな雑音源・が発声した場合にも受容すべき
単語音声区間よりも長く継続している場合には、これを
学習し、再度音声の検出が可能な状態に戻る。Furthermore, even when a new noise source utters, if it continues longer than the word speech section to be accepted, this is learned and the state returns to a state where speech can be detected again.

また、この発明の方法によれば、従来の方法では必要で
あったような特別な音声認識法は用いず、従ってこの発
明の方法を実施するための装置化が簡単かつ容易となる
。Further, according to the method of the present invention, a special speech recognition method, which is necessary in the conventional method, is not used, so that it is simple and easy to implement the method of the present invention.

このような効果を奏するので、この発明は音声認識装置
における音声区間検出装置とか、或いは、音声による通
信装置において音声区間のみを伝送することによる伝送
情報圧縮装置、その他の音声区間の検出を行っている装
置に応用して好適である。In order to achieve such effects, the present invention provides a speech section detection device in a speech recognition device, a transmission information compression device by transmitting only the speech section in a speech communication device, and other methods for detecting speech sections. It is suitable for application to devices in which

[Brief explanation of drawings]

第１図はこの発明の音声区間検出方法の説明に供する装
置の概略を示すブロック図、第２図は従来の音声区間検出方法の説明に供する装置を
示すブロック図、第３図はこの発明の音声区間検出方法の一実施例を説明
するための処理の流れ図、第４図はこの発明の音声区間検出方法の一実施例の説明
に供する実験結果を示す図である。２０１・・・入力端子、　　　２０２・・・ローパスフ
ィルタ２０３・・・サンプルホールド、Ａ／Ｄ変換器２
０４・・・前処理用信号処理プロセッサ２０５・・・音
声区間検出、音声認識用マイクロプロセッサ。２０６・・・出力端子。手続補正書昭和６１年１１月２１日FIG. 1 is a block diagram schematically showing an apparatus for explaining the speech interval detection method of the present invention. FIG. 2 is a block diagram showing an apparatus for explaining the conventional speech interval detection method. FIG. 4 is a flowchart of a process for explaining an embodiment of the speech interval detection method. FIG. 4 is a diagram showing experimental results for explaining an embodiment of the speech interval detection method of the present invention. 201...Input terminal, 202...Low pass filter 203...Sample hold, A/D converter 2
04... Preprocessing signal processing processor 205... Microprocessor for speech section detection and speech recognition. 206...Output terminal. Procedural amendment November 21, 1986

Claims

[Claims]

(1) A set of feature values representing the characteristics of an input audio signal for each preset fixed frame length is used as an input feature pattern, and when detecting a speech section using the input feature pattern, (a) Speech section candidates (b) processing to learn a set of characteristic values representative of noise characteristics within the noise interval as a noise standard pattern; (c) the above-mentioned A speech interval detection method comprising the step of detecting the speech interval candidate as a speech interval based on an inter-pattern distance between an input feature pattern and a noise standard pattern, and a magnitude relationship with a threshold value.

(2) The threshold value is determined by learning a set of characteristic values representative of the characteristics of the inter-pattern distance at the time of noise within the noise interval as a distance standard pattern at the time of noise, and based on the obtained distance standard pattern at the time of noise. 2. The voice section detection method according to claim 1, wherein the value is calculated by using the calculated value.

(3) The threshold value is T=■+C■, where the threshold value is T, the estimated average value of the distance between the patterns in the noise interval is ■, the estimated average deviation is ■, and an arbitrary constant is C. 3. The voice section detection method according to claim 2, wherein the voice section detection method is calculated according to a formula.

(4) The speech interval detection method according to claim 1, wherein the learning of the noise standard pattern is performed by a learning method that follows changes in the characteristics of the noise.

(5) In the learning method of the noise standard pattern, the discrete time is i, the characteristic value to be learned for the input feature pattern is f_i, and the characteristic value in the noise standard pattern is
When i and the constant K are K>1, within the noise interval ■_i={(K-1)■_i_-_1/K}+f_i/
5. The voice section detection method according to claim 4, wherein the learning method is used to calculate K and the inside of the voice section respectively according to the formula: ■_i=■_i_-_1.

(6) The speech interval detection method according to claim 2, wherein the learning of the noise distance standard pattern is performed by a learning method that follows changes in the characteristics of the noise.

(7) The learning method for the noise distance standard pattern is such that the discrete time is i, the characteristic value to be learned for the inter-pattern distance is g_i, and the characteristic value in the noise distance standard pattern is ■_i. When the constant L is a constant L>1, within the noise interval ■_i={(L-l)■_i_-_1/L}+g_i/
7. The voice section detection method according to claim 6, wherein the learning method is used to calculate L and the inside of the voice section according to the formula: ■_i=■_i_-_1.

(8) Detection of a speech section based on the magnitude relationship between the distance between the patterns and the threshold value is performed by comparing the distance between the patterns with the threshold value, and in addition to the comparison between the distance between the patterns and the threshold value, the speech section candidate is 2. The method of detecting a voice section according to claim 1, wherein the voice section is detected as a section.

(9) The characteristics regarding the utterance length of the voice are (a) the time during which the distance between the patterns becomes larger than the threshold value, and (b) the time during which the distance between the patterns becomes smaller than the threshold value. and (c) there are no words that are uttered continuously for more than approximately 4 seconds, and the speech segment candidate has at least the characteristics of (a) and (b) above and the characteristic of (c) above. 9. The voice section detection method according to claim 8, wherein the voice section is determined to be a voice section if one of the conditions is satisfied.

(10) The speech segment detection method according to claim 1, further comprising a process of determining a speech segment candidate that is not detected as a speech segment to be a noise segment based on a magnitude relationship with the threshold value.