JPH03233600A

JPH03233600A - Voice segmenting method and voice recognition device

Info

Publication number: JPH03233600A
Application number: JP2030185A
Authority: JP
Inventors: Shinichi Tsurufuji; 鶴藤　真一
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1990-02-09
Filing date: 1990-02-09
Publication date: 1991-10-17
Anticipated expiration: 2014-10-25
Also published as: JP2966460B2

Abstract

PURPOSE:To accurately detect the time area of a voice at all times by setting a threshold value with an ambient noise level which is different from an acoustic signal and detected by an acoustic input means, segmenting an acoustic signal area according to the threshold value, and extracting the acoustic signal area as a voice area. CONSTITUTION:A 1st segmentation control part 3 detects a 1st acoustic signal area where it is considered that there is a voice without fail from an acoustic signal where a voice is present by using a 1st threshold value varying with ambient noise. Further, a 2nd segmentation control part 7 detects a 2nd acoustic signal area obtained by extending time length about the 1st acoustic si cal area by using a 2nd threshold value which is less than the 1st threshold value and the 2nd acoustic signal area can be regarded as a voice area. Therefore, a feature parameter which represents the features of a voice properly is extracted from the acoustic signal extending to the voice area. Consequently, the voice recognition rate can be improved.

Description

【発明の詳細な説明】（イ）産業上の利用分野本発明は、音声認識装置、更にほこの音声認識装置に入
力される音声の時間領域の検出を行うなめの音声切り出
し方法に関する。DETAILED DESCRIPTION OF THE INVENTION (A) Field of Industrial Application The present invention relates to a speech recognition device, and more particularly to a method for cutting out speech for detecting the time domain of speech input to the speech recognition device.

（ロ）従来の技術音声認識装置に於ては、音声を入力するためのマイクに
は、音声の他に常に周囲雑音が入力されてしまうので、
この周囲雑音に含まれる音声の時間領域を正確に検出す
ることが重要課題である。(b) In conventional technology voice recognition devices, ambient noise is always input into the microphone for inputting voice in addition to voice.
An important issue is to accurately detect the time domain of speech included in this ambient noise.

例えば、バックグランドミュージック（ＢＧＭ）が流れ
ているような事務所に於ても、音声認識によって、例え
ばワードプロセッサへの入力を行うなどの必要性が出て
くる場合があり、この場合にはＢ　Ｇ　Ｍが話者の音声
に混じって音声認識のためのマイクに入力されるので、
この入力音響信号のどの時間位置からどの時間位置まで
が音声領域であるかを正確に検出できなければ、音声認
識は不可能である。このような事は、カーステレオなど
の車載音響機器で音楽や歌曲を再生中の自動車内で自動
車電装機器を音声認識操作しようとする場合でも同じで
ある。For example, even in an office where background music (BGM) is playing, it may be necessary to use voice recognition to perform input into a word processor, for example, and in this case, BGM is used. Since M is mixed with the speaker's voice and input into the microphone for voice recognition,
Speech recognition is impossible unless it is possible to accurately detect from which time position to which time position of this input acoustic signal the speech region is. This is the same even when attempting to operate automobile electrical equipment by voice recognition in a car while music or songs are being played on an in-vehicle audio device such as a car stereo.

従って、従来装置では、マイクに入力された信号のレベ
ルを検知して、これが予じめ音声を発生する環境や条件
から決定した特定の閾値以上になる時間を音声の時間領
域と見做して切り出す音声切り出し方法が採用されてい
た。Therefore, in conventional devices, the time period when the level of the signal input to the microphone is detected and the level exceeds a specific threshold determined in advance based on the environment and conditions in which the sound is generated is regarded as the time domain of the sound. An audio extraction method was used to extract the audio.

しかしながら、このような従来の音声切り出し方法では
、周囲雑音であるＧＢＭや歌曲の再生レベルが一定でな
いので、従来の固定的な閾値を用いているだけでは正確
な音声の切り出しができない不都合があった。However, with such conventional audio extraction methods, the playback level of GBM (ambient noise) and songs is not constant, so there is an inconvenience that accurate audio extraction cannot be achieved just by using a conventional fixed threshold. .

（ハ）発明が解決しようとする課題本発明は上述の従来の不都合に鑑みてなされたものであ
り、そのレベルが変動する周囲雑音環境下に於ても正確
に音声の時間領域を検出することのできる音声切り出し
方法を提供し、更には、この音声切り出し方法の採用に
よって音声認識装置を実現しようとするものである。(c) Problems to be Solved by the Invention The present invention has been made in view of the above-mentioned conventional disadvantages, and it is an object of the present invention to accurately detect the time domain of speech even in an environment with ambient noise whose level fluctuates. The present invention aims to provide a voice extraction method that enables the following, and furthermore, to realize a voice recognition device by employing this voice extraction method.

（ニ）課題を解決するための手段本発明の音声切り出し方法は、音声が存在する音響信号
のレベルが特定の閾値以上に達する時間領域に音声の存
在を検出して音声領域を切り出す方法であって、上記音
響信号とは異なる音響入力手段で検出した周囲雑音レベ
ルにより上記閾値を設定し、該閾値により音響信号領域
を切り出し、該音響信号領域を音声領域として抽出する
ものである。(d) Means for Solving the Problems The audio extraction method of the present invention is a method of extracting an audio region by detecting the presence of audio in a time region in which the level of an acoustic signal in which audio exists exceeds a specific threshold. The threshold value is set based on the ambient noise level detected by a sound input means different from the sound signal, and the sound signal region is cut out using the threshold value, and the sound signal region is extracted as a voice region.

又、本発明の音声認識装置は、音声を入力するマイク、
該マイクから得られる音響信号を分析して音声の特徴パ
ラメータ時系列を抽出する音声分析部、該音声分析部か
ら得られる特徴パラメータ時系列に基づいて音声パタン
を作成する音声パタン作成部、予じめ複数の標準的音声
の音声パタンを標準音声パタンとして貯えた標準音声パ
タンメモリ、該メモリの各音声パタンと上記音声パタン
とをパタンマツチングして上記音声パタンを識別する識
別部、周囲雑音を入力するための音響入力端子、該入力
端子に接続された周囲雑音の発生原である音響機器から
の雑音音響レベルにより第１の音声切り出し閾値を設定
する第１切り出し閾値設定部、該設定部により設定され
た第１切り出し閾値により上記マイクから得られる音響
信号から第１の音響信号領域を検出する第１切り出し制
御部、該制御部で検出した第１の音響信号領域が中心に
存在する音響信号に対して更に周囲雑音レベルに基づき
上記第１の閾値より低いレベルの第２の閾値を設定する
第２切り出し閾値設定部、該設定部により設定された第
２切り出し閾値により上記第１の音響信号領域が含まれ
る第２の音響信号領域を検出する第２切り出し制御部を
備え、該第２切り出し制御部で検出された第２の音響信
号領域を音声領域と見做し、上記音声分析部から得られ
る特徴パラメータ時系列の内、上記音声領域に存在する
特徴パラメータ時系列に基づき、上記音声パタン作成部
で音声パタンを作成するものである。The voice recognition device of the present invention also includes a microphone for inputting voice;
a voice analysis unit that analyzes an acoustic signal obtained from the microphone to extract a voice characteristic parameter time series; a voice pattern creation unit that creates a voice pattern based on the characteristic parameter time series obtained from the voice analysis unit; a standard voice pattern memory that stores voice patterns of a plurality of standard voices as standard voice patterns; an identification unit that patterns-matches each voice pattern in the memory with the voice pattern to identify the voice pattern; an audio input terminal for inputting audio, a first audio extraction threshold setting unit that sets a first audio extraction threshold based on a noise sound level from an audio device connected to the input terminal that is a source of ambient noise; a first cutout control unit that detects a first acoustic signal region from the acoustic signal obtained from the microphone according to a set first cutout threshold; an acoustic signal in which the first acoustic signal region detected by the control unit is located at the center; a second cutout threshold setting unit that further sets a second threshold lower than the first threshold based on the ambient noise level; a second cut-out control section that detects a second acoustic signal region including the region, and the second cut-out control section regards the second sound signal region detected by the second cut-out control section as a sound region, and from the sound analysis section. Among the obtained feature parameter time series, the speech pattern creation section creates a speech pattern based on the feature parameter time series existing in the speech region.

（ホ）作用本発明の音声切り出し方法によれば、音声が存在する音
響信号から音声の時間領域をそのレベルで検出するため
の閾値を周囲雑音レベルに従ってダイナミックに設定で
きるので、周囲雑音が変動する環境下でも有効な音声領
域の検出が可能となる。(e) Effect: According to the audio extraction method of the present invention, the threshold for detecting the time domain of audio at that level from an acoustic signal where audio is present can be dynamically set according to the ambient noise level, so the ambient noise fluctuates. It is possible to effectively detect audio areas even under different environments.

本発明の音声認識装置によれば、第１切り出し制御部が
周囲雑音に応じて変動する第１の閾値を用いて音声が存
在する音響信号から音声が必ず存在すると見做せる第１
の音響信号領域を検出し、更に第２切り出し制御部が上
記第１の閾値より小さい第２の閾値を用いて上記第１の
音響信号領域を中心として時間長を拡張した第２の音響
信号領域を検出し、該第２の音響信号領域を音声領域と
見做すことによって、該音声領域に亘たる音響信号から
音声の特徴を適切に表す特徴パ、ラメータが抽出でき、
この特徴パラメータに基づく音声パタンの作成により音
声認識率の向上が可能となる。According to the speech recognition device of the present invention, the first cut-out control unit uses the first threshold value that varies depending on the ambient noise to determine the first cut-out control unit that determines whether speech is present from an acoustic signal in which speech is present.
a second acoustic signal area in which the second cutout control unit further extends the time length around the first acoustic signal area using a second threshold value smaller than the first threshold value; By detecting the second acoustic signal region and regarding the second acoustic signal region as the speech region, characteristic parameters that appropriately represent the characteristics of the speech can be extracted from the acoustic signal spanning the speech region,
By creating a speech pattern based on this feature parameter, it is possible to improve the speech recognition rate.

（へ）実施例第１図に本発明の音声認識装置の一実施例の成因を示す
。(f) Embodiment FIG. 1 shows the origins of an embodiment of the speech recognition device of the present invention.

同図に於て、１は音声が入力されるマイク、２はマイク
１から入力される音響信号を分析して音声の特徴を表す
特徴パラメータの時系列を抽出する音声分析部であり、
例えば、周波数分析により音響信号レベル情報を保存し
たスペクトルパラメータが得られる。３は上記音声分析
部２から得られる特徴パラメータの時系列に対して音声
が存在する時間領域を切り出すための第１切り出し制御
部であり、該時間領域の先頭特徴パラメータと最終特徴
パラメータとに夫々仮のスタート符号とエンド符号とを
付与して、一連の特徴パラメータの時系列（これら符号
付与パラメータの前後に連なる十分な数の時系列を含む
）を出力する。４は該第１切り出し制御部３から仮のス
タート符号とエンド符号とが付与された特徴パラメータ
時系列を一時的に記憶する第１音声バツフアである。In the figure, 1 is a microphone into which voice is input, 2 is a voice analysis unit that analyzes the acoustic signal input from the microphone 1 and extracts a time series of characteristic parameters representing the characteristics of the voice.
For example, frequency analysis provides spectral parameters that preserve acoustic signal level information. Reference numeral 3 denotes a first extraction control unit for extracting a time domain in which speech exists from the time series of feature parameters obtained from the voice analysis unit 2, and controls the first feature parameter and the last feature parameter of the time domain, respectively. A temporary start code and an end code are assigned to output a series of time series of feature parameters (including a sufficient number of time series that precede and follow these coded parameters). Reference numeral 4 denotes a first audio buffer that temporarily stores the feature parameter time series to which a temporary start code and end code have been added from the first extraction control unit 3.

５は上記マイクとは異なる雑音レベル入力端子であり、
これには周囲雑音入力用の第２のマイクあるいは、周囲
雑音源となる音響再生機器の出端子、またはこの音響再
生機器での再生レベル表示（例えば、ＬＥＤのバー表示
からなるレベルメータ）用の信号線が接続される。６は
上記第１切り出し制御部３での特徴パラメータ時系列に
対する音声の時間領域切り出しに必要な第１の閾値を上
記マイク１からの音響信号と上記雑音レベル入力端子３
からの周囲雑音レベルとを参照して設定する第１閾値設
定部である。5 is a noise level input terminal different from the above microphone,
This includes a second microphone for inputting ambient noise, an output terminal of an audio reproduction device that is a source of ambient noise, or a terminal for indicating the reproduction level of this audio reproduction device (for example, a level meter consisting of an LED bar display). The signal line is connected. Reference numeral 6 denotes a first threshold value necessary for time-domain extraction of the audio with respect to the feature parameter time series in the first extraction control unit 3 and the acoustic signal from the microphone 1 and the noise level input terminal 3.
This is a first threshold value setting unit that sets the threshold value with reference to the ambient noise level from .

７は上記第１音声バツフア４から得られる仮のスタート
符号とエンド符号とが付与された特徴パラメータの時系
列に対して、再度厳密に音声が存在する時間領域を切り
出すための第２切り出し制御部であり、該時間領域の仮
の先頭特徴パラメータより時間的に前の位置（真の音声
領域のスタート位置に対応する）の特徴パラメータに真
のスタート符号を付与すると共に仮の最終特徴パラメー
タより時間的に後の位置（真の音声領域のエンド位置）
の特徴パラメータに兵のエンド符号を付与して、これら
一連の特徴パラメータの時系列を出力する。８は該第２
切り出し制御部７から真のスタート符号とエンド符号と
が付与された特徴パラメータ時系列を一時的に記憶する
第２音声バツフアである。９は上記第２切り出し制御部
７での特徴パラメータ時系列に対する音声の真の時間領
域切り出しに必要な第２の閾値を上記第１の閾値より小
さく設定する第２閾値設定部であり、音声の真の時間領
域を適切に抽出できるような値、例えば環境によって多
少異なるが経験的に第１の閾値の８０％程度に設定され
る。Reference numeral 7 denotes a second cutout control unit for cutting out again strictly a time region in which speech exists from the time series of feature parameters to which temporary start codes and end codes obtained from the first speech buffer 4 are added. , a true start code is given to the feature parameter at a position temporally earlier than the tentative first feature parameter in the time domain (corresponding to the start position of the true speech domain), and the time domain is after position (end position of true audio area)
The end code of the soldier is given to the feature parameters of the model, and a time series of these feature parameters is output. 8 is the second
This is a second audio buffer that temporarily stores the feature parameter time series to which the true start code and end code have been added from the extraction control unit 7. Reference numeral 9 denotes a second threshold setting unit that sets a second threshold necessary for the true time domain extraction of the audio for the feature parameter time series in the second extraction control unit 7 to be smaller than the first threshold; It is empirically set to a value that allows the true time domain to be extracted appropriately, for example, approximately 80% of the first threshold value, although it varies somewhat depending on the environment.

１０は上記第２バツフア８のに記憶された真のスタート
符号とエンド符号とが付与された特徴パラメータ時系列
からこれら符号間に属する真の音声領域の特徴パラメー
タ時系列に基づいて入力音声パタンを作成する音声パタ
ン作成部であり、特定の時系列に特徴パタンを正規化し
た音声パタンが得られる。１１は上記雑音レベル入力端
子５から得られる雑音レベルを上記第２切り出し制御部
７から得られる真の音声領域に亘って記憶する雑音レベ
ルバッファ、１２は該雑音レベルバッファ１１の雑音レ
ベルの時間平均値と経験的に設定された所定の所定レベ
ルと比較するレベル比較部であり、該雑音レベルバッフ
ァ１１の平均雑音レベルが所定レベルより大きい時に上
記音声パクン作成部１０での音声パタン作成処理を禁止
する。10 generates an input speech pattern based on the feature parameter time series of the true speech region belonging to between these codes from the feature parameter time series to which the true start code and end code are added stored in the second buffer 8; This is an audio pattern creation unit that creates audio patterns that are normalized feature patterns in a specific time series. Reference numeral 11 denotes a noise level buffer that stores the noise level obtained from the noise level input terminal 5 over the true speech region obtained from the second extraction control section 7; 12 indicates a time average of the noise level of the noise level buffer 11; This is a level comparison section that compares the value with a predetermined level set empirically, and prohibits the speech pattern creation process in the speech pattern creation section 10 when the average noise level of the noise level buffer 11 is higher than the predetermined level. do.

１３は予じめ多数の標準的音声の音声パタンを標準音声
パタンとして記憶した標準音声パタンメモリ、１４は上
記音声パタン作成部１０から得られる入力音声パタンを
上記標準音声パタンメモリ１３の各標準音声パタンをパ
タンマツチングしてパタン間誤差が最も小さくしかもこ
の誤差の許容限度である認識閾値以下の誤差となる標準
音声パタンを検出する識別部であり、検出された標準音
声パタンに対応する認識結果信号を出力する。Reference numeral 13 indicates a standard speech pattern memory in which the speech patterns of a large number of standard speeches are stored in advance as standard speech patterns; and 14 indicates an input speech pattern obtained from the speech pattern creation section 10, which is stored in each standard speech in the standard speech pattern memory 13. This is an identification unit that performs pattern matching to detect a standard speech pattern that has the smallest error between patterns and that is less than the recognition threshold, which is the allowable limit of this error, and the recognition result that corresponds to the detected standard speech pattern. Output a signal.

１５は上記識別部１４での認識閾値を上記雑音レベルバ
ッファ１１の平均雑音レベルに応じて可変設定する認識
閾値設定部であり、平均雑音レベルが多き時にはこの認
識閾値が大きくなる。Reference numeral 15 denotes a recognition threshold setting unit that variably sets the recognition threshold in the identification unit 14 according to the average noise level of the noise level buffer 11, and when the average noise level is high, the recognition threshold becomes large.

第２図は本発明の音声認識装置に於ける音声切り出し動
作を示す信号波形図であり、同図に基づき動作を詳述す
る。FIG. 2 is a signal waveform diagram showing the speech extraction operation in the speech recognition device of the present invention, and the operation will be described in detail based on the diagram.

まず、音声の時間領域の切り出し閾値設定の方法につい
て解説する。First, we will explain how to set the audio time domain extraction threshold.

第１切り出し閾値設定部６は、第２図のＮで示す階段状
に変化する雑音レベル入力端子５からの雑音レベルを一
定時間毎（例えば５ｍ５ｅｃ毎）に取り込み、取り込ん
だレベルに応じて音声の切り出しのための第１の閾値を
決定している。この場合、雑音レベル入力端子５には、
ＬＥＤのバー表示からなるレベルメータ用の信号線が接
続されている。The first cutout threshold setting unit 6 takes in the noise level from the noise level input terminal 5 that changes stepwise as shown by N in FIG. A first threshold value for extraction is determined. In this case, the noise level input terminal 5 has
A signal line for a level meter consisting of an LED bar display is connected.

即ち、この切り出し閾値（Ｖｔｌと記述する）設定は以
下の如き雑音レベルＮの関数になる。That is, the cutout threshold value (denoted as Vtl) setting is a function of the noise level N as shown below.

Ｖｔｌ　　＝　　ｆ　（Ｎ）以下に、ｆ　（Ｎ）の具体例を列挙する。Vtl = f (N) Specific examples of f(N) are listed below.

１上立盟盈ｊｆ　（Ｎ）　＝　ａ　Ｘ　Ｎ　＋　ｂ　　である。1st place alliance f (N) = a X N + b.

ここで、ａ、ｂは夫々定数を示しており、特に、ｂは通
常の定常的な騒音状態においては、第１切り出し制御部
３でマイク１から入力される雑音が音声として切り出さ
れることのないように通常の定常的な騒音のレベルより
大きな値が与えられている。Here, a and b each indicate a constant, and in particular, b indicates that in normal steady noise conditions, the noise input from the microphone 1 is not extracted as voice by the first extraction control unit 3. As such, a value greater than the normal steady noise level is given.

棗２（７す（１倒ｅｃ程度）のマイクｌがらの入力を基に、切り出しの閾
値を設定する方法が有効である。この場合の切り出しの
閾値設定の方法を以下に示す。An effective method is to set the threshold for clipping based on the input from the microphone of Natsume 2 (about 1 increment).The method for setting the threshold for clipping in this case is shown below.

ここで、場合分は条件Ｃは定数。更に、ｔｌ、ｔ２は現
時点より前の時間を意味し、ａ、は時間ｉに関する重み
である。従って、上記の式は音声入力前のマイク１から
の雑音だけの音響信号レベルの時間平均に上述の定数す
を加えたものとなる。Here, in the case, condition C is a constant. Furthermore, tl and t2 mean times before the current time, and a is a weight regarding time i. Therefore, the above equation is obtained by adding the above-mentioned constant s to the time average of the acoustic signal level of only the noise from the microphone 1 before voice input.

以上示したｆ　（Ｎ）は、既知音声のみが雑音としてマ
イク１に入力される場合を想定したものであるが、この
他にもマイクｌに入力されるものとしては、定常的な周
囲雑音がある。この場合は、上記のような閾値設定では
、対処できない。従って、周囲雑音がマイク１で常時入
力されるため、こ−の入力を第１切り出し閾値設定部６
で蓄えながら現在の入力時から一定時間前（例えば５０
　ｍ　ｓえば、５０ｍｓ　ｅ　ｃ程度）前のマイク１が
らの入力のパワーを示すものである。f (N) shown above assumes that only the known voice is input as noise to microphone 1, but in addition to this, stationary ambient noise can also be input to microphone 1. be. In this case, the above-mentioned threshold setting cannot deal with the problem. Therefore, since ambient noise is constantly input through the microphone 1, this input is input to the first cutout threshold setting unit 6.
while saving data from a certain period of time before the current input (for example, 50
It shows the power of the input from the microphone 1 before (for example, about 50 msec).

肛１０Ｍ１１上記第４の関数例に於て、雑音レベルＮが定数Ｃより大
きいか小さいかの場合分けに関係なく、上記式■と式■
のｆ　（Ｎ）の値の大きいほうの値をｆ（Ｎ’）とする
ことができる。10M11 In the fourth function example above, regardless of whether the noise level N is larger or smaller than the constant C, the above equations ■ and equations
The larger value of f(N) can be set as f(N').

以上の如きｆ　（Ｎ）の関数例の採用によって、第２図
の実線曲線で示す様に、周囲雑音Ｎに応じて変動する第
１の閾値Ｖｔｌが決定される。By employing the above example of the function of f (N), the first threshold value Vtl, which varies depending on the ambient noise N, is determined, as shown by the solid curve in FIG.

従って、上記第１切り出し制御部３が音声分析部２から
得られる特徴パラメータ時系列のレベル［この場合、第
２図の破線曲線Ｖで示す如く、各時点に於いて、周波数
スペクトルレベルＶの総和ΣＶ　（＝Ｖ）　］　と第１
の閾値Ｖｔｌ　　との比較を行い、ΣＶ≧Ｖｔｌ　　と
なる連続した時系列の先頭時点Ｔｓｌの特徴パラメータ
に仮のスタート符号を付与し、その最終時点Ｔｅｌの特
徴パラメータに仮のエンド符号を付与する。Therefore, the first extraction control section 3 determines the level of the feature parameter time series obtained from the speech analysis section 2 [in this case, as shown by the broken line curve V in FIG. ΣV (=V) ] and the first
A comparison is made with a threshold value Vtl, and a provisional start code is given to the feature parameter at the beginning time Tsl of the continuous time series where ΣV≧Vtl, and a provisional end code is given to the feature parameter at the final time Tel.

斯して、仮のスタート符号とエンド符号とが付与された
特徴パラメータ時系列は、第１音声バツフア４に格納さ
れる。この時、該バッファ４には仮のスタート符号が付
与された特徴パラメータ以前の時系列と仮のエンド符号
が付与された特徴パラメータ以後の時系列も十分に格納
されている。In this way, the feature parameter time series to which the temporary start code and end code have been added is stored in the first audio buffer 4. At this time, the buffer 4 sufficiently stores the time series before the feature parameter to which the temporary start code has been assigned, and the time series after the feature parameter to which the temporary end code has been assigned.

次に、第２切り出し制御部７による音声切り出しについ
て説明する。Next, audio clipping by the second clipping control section 7 will be explained.

雑音レベル入力端子５からの雑音レベルが大きい場合に
は、上記第１切り出し制御部３では、音声の語頭及び語
尾が正確に切り出されない可能性があり、このため真の
音声領域より短い音声領域しか検出できないことになる
。従って、第２切り出し制御部７はこれを補う為に設け
られている。When the noise level from the noise level input terminal 5 is large, the first cutout control section 3 may not be able to accurately cut out the beginning and end of the speech, and therefore the speech region is shorter than the true speech region. This means that it can only be detected. Therefore, the second cutout control section 7 is provided to compensate for this.

即ち、第２切り出し制御部７では、第１切り出し閾値設
定部３で設定される第１の閾値Ｖｔ１　より小さい値の
第２の閾値Ｖｔ２　　を設定し、この閾値Ｖｔ２　　を
用いて、上記第１音声バツフア４の特徴パラメータ時系
列に対して、より適切な音声領域の切り出しを行う。That is, the second cutout control section 7 sets a second threshold Vt2 that is smaller than the first threshold Vt1 set by the first cutout threshold setting section 3, and uses this threshold Vt2 to A more appropriate audio region is extracted from the characteristic parameter time series of buffer 4.

ここで、第２の閾値Ｖｔ２　　の設定について説明を加
える。第１切り出し閾値設定部６で設定された第１の閾
値Ｖｔｌ　　は時間情報と共に第２切り出し閾値設定部
９に情報提供される。Here, the setting of the second threshold value Vt2 will be explained. The first threshold value Vtl set by the first extraction threshold setting section 6 is provided to the second extraction threshold setting section 9 together with time information.

該第２切り出し閾値設定部９は、第１切り出し閾値設定
部６で設定された第１の閾値Ｖｔｌによって求められた
仮の先頭時点Ｔｓｌの音声レベルＶ（Ｔｓｌ　）＝Ｖｔ
ｌ　（Ｔｓｌ　）なる第１の閾値より小さい第２の閾値
Ｖｔ２を決定すると共に仮の最終時点Ｔｅｌの音声レベ
ルＶ（Ｔ６１　）＝Ｖｔｌ　（Ｔｅｌ　）より小さい第
２の閾値Ｖｔ２を決定する。The second cutout threshold setting unit 9 sets the audio level V(Tsl )=Vt at the tentative head time point Tsl determined by the first threshold Vtl set by the first cutout threshold setting unit 6.
A second threshold value Vt2 smaller than the first threshold value V(T61)=Vtl(Tel) at the tentative final time point Tel is determined.

具体的には、真の先頭時点Ｔｓ２を決定するための第２
の閾値Ｖｔ２はＶｔｌ　（Ｔｓｌ　）の関数になり、以
下の如く表される。Specifically, the second
The threshold value Vt2 is a function of Vtl (Tsl), and is expressed as follows.

例えば、Ｖｔ２　＝Ｖｔｌ　（Ｔｓｌ　）−ｄ、　ｄは
定数または、Ｖｔ２　＝Ｖｔｌ　（Ｔｓｌ　）７ｍ％ｍ
は定数更に、真の最終時点Ｔｅ２を決定するための第２
の閾値Ｖｔ２はＶｊｌ　（Ｔｅ１　）の関数になり、真
の先頭時点Ｔｓ２の場合と同じく、以下の如く表される
。For example, Vt2 = Vtl (Tsl) - d, d is a constant or Vt2 = Vtl (Tsl)7m%m
is a constant.Furthermore, the second
The threshold value Vt2 is a function of Vjl (Te1), and is expressed as follows, as in the case of the true leading time Ts2.

例えば、Ｖｔ２＝Ｖｔｌ（Ｔｅｌ）−ｄ、ｄは定数また
は、Ｖｔ２　＝Ｖｔｌ　（Ｔｅｌ　）７ｍ、　ｍは定数
なお、これら第２の閾値Ｖｔ２の設定の場合も第１の閾
値Ｖｔｌの設定の場合と同様に、最小値定数Ｃを設定し
ておけば、定常雑音を領域まで音声として切り出す危惧
はない。For example, Vt2 = Vtl (Tel) - d, d is a constant, or Vt2 = Vtl (Tel) 7m, m is a constant. Note that the setting of the second threshold Vt2 is the same as the setting of the first threshold Vtl. Similarly, if the minimum value constant C is set, there is no fear that stationary noise will be cut out as speech.

従って、第２切り出し閾値設定部９で設定された第２の
閾値Ｖｔ２を用いて第２切り出し制御部７は、第１音声
バツフア４に記憶されている時点Ｔｓｌ前で、Ｖ（Ｔａ
２）＝Ｖｔ２となる音声の真の先頭時点と見做廿る時点
Ｔｓ２を検出して、この時点の特徴パラメータに真のス
タート符号を付与する。さらに、時点Ｔｅｌ後でＶ（Ｔ
ａ２）−Ｖｔ２となる音声の真の最終時点と見做せる時
点Ｔｓ２を検出して、この時点の特徴パラメータに真の
エンド符号を付与する。Therefore, using the second threshold value Vt2 set by the second extraction threshold setting unit 9, the second extraction control unit 7 sets V(Ta) before the time Tsl stored in the first audio buffer 4.
2) Detect a time point Ts2 that is considered to be the true beginning point of the voice where = Vt2, and assign a true start code to the feature parameter at this time point. Furthermore, after time Tel, V(T
a2) Detect a time Ts2 that can be regarded as the true final time of the voice at −Vt2, and assign a true end code to the feature parameters at this time.

斯して、真のスタート符号とエンド符号が付与された特
徴パラメータ時系列は、第２音声バツフア８に一時的に
記憶され、このスタート符号とエンド符号とが付与され
た間の特徴パラメータ時系列が音声パタン作成部１０に
供給される。In this way, the feature parameter time series to which the true start code and end code have been assigned is temporarily stored in the second audio buffer 8, and the feature parameter time series during which the true start code and end code have been assigned is temporarily stored in the second audio buffer 8. is supplied to the audio pattern creation section 10.

而して、雑音レベルが非常に大きい時には、上述の音声
切り出し手段によっても、正確な音声領域の検出が困難
になる場合があり、この時には音声認識を行わないよう
な安全対策が必要になる。When the noise level is extremely high, it may be difficult to accurately detect the voice region even with the above-mentioned voice extraction means, and in this case, safety measures such as not performing voice recognition are required.

従って、第１図の実施例に於ては、レベル比較部１２を
設けて、上述の安全対策を講じている。Therefore, in the embodiment shown in FIG. 1, the level comparison section 12 is provided to take the above-mentioned safety measures.

即ち、第２切り出し制御部７で切り出された音声領域（
第２図のＴｓ２〜Ｔｅ２）についての雑音レベルが雑音
レベルバッファ１１に貯えられているので、これに基づ
きレベル比較装置１２が雑音レベルの時間平均値ａｗｅ
（Ｎ　）＊ΣＮ／（Ｔｅ２−Ｔａ２）を計算し、この値
が一定値以上になる時、上記音声パタン作成部１０での
音声パタン作成を禁止することになる。That is, the audio region (
Since the noise levels for Ts2 to Te2 in FIG.
(N)*ΣN/(Te2-Ta2) is calculated, and when this value exceeds a certain value, the voice pattern creation section 10 is prohibited from creating a voice pattern.

一方、許容範囲の雑音下に於て音声パタン作成部１０が
作成した音声パタンは、予じめ標準パタンメモリ１３に
蓄えられている多数の標準パタンとを識別部１４でパタ
ンマツチングを行い、標準パタンのうち最も類似してい
る（即ち、誤差りが最も小さい）標準パタンが認識結果
として類似度（誤差りと逆数的関係にある）と共に識別
部１４に貯えられる。On the other hand, the speech pattern created by the speech pattern creation section 10 under noise within an acceptable range is pattern-matched by the identification section 14 with a large number of standard patterns stored in advance in the standard pattern memory 13. Among the standard patterns, the most similar standard pattern (that is, the standard pattern with the smallest error margin) is stored as a recognition result in the identification unit 14 along with the degree of similarity (which has a reciprocal relationship with the error margin).

この識別部１４に於ては、認識閾値設定部１５の認識の
閾値により最終的に識別部１４に貯えられている認識結
果を有効とするかどうかの判定を行う。The identification unit 14 ultimately determines whether or not the recognition result stored in the identification unit 14 is valid based on the recognition threshold of the recognition threshold setting unit 15.

ここで、認識閾値設定部１５に於ける認識の閾値の設定
方法について説明する。誤差りによって類似の程度を表
す場合には、該認識闇値Ｄｔは、音声領域（第２図のＴ
ｓ２〜Ｔｅ２）の雑音平均レベルａｖｅ（Ｎ　）に追従
して決定されるものであり、例えば以下の例のように決
められる。Here, a method of setting a recognition threshold in the recognition threshold setting section 15 will be explained. When expressing the degree of similarity by error, the recognized darkness value Dt
It is determined by following the noise average level ave(N) of s2 to Te2), and is determined, for example, as in the following example.

Ｄｔ　＝ｐＸａｖｅ（Ｎ）＋ｑここで、ｐ、ｑは定数である。Dt = pXave (N) + q Here, p and q are constants.

即ち、認識閾値Ｄｔは、周囲雑音が大きい時には大きく
設定される。That is, the recognition threshold Dt is set large when the surrounding noise is large.

従って、識別部１４は、このように周囲雑音のレベルに
応じて変動する該認識閾値Ｄｔより、認識結果の類似度
りが大きい場合（類似している場合）は認識結果を有効
とするので、雑音レベルの大きさに応じて入力パタンが
多少変形してもこれを吸収して認識結果を導出すること
ができる。Therefore, the identification unit 14 validates the recognition result if the degree of similarity of the recognition result is greater than the recognition threshold Dt which varies according to the level of ambient noise (if the recognition result is similar). Even if the input pattern is slightly deformed depending on the magnitude of the noise level, this can be absorbed and a recognition result can be derived.

以上に説明した音声認識装置は、例えば、自動車内のカ
ーステレオの操作手段として用いることができ、この場
合には、周囲雑音としてこのカーステレオ自体が対象と
なる。また、雑音レベル入力端子５への入力は、オーデ
ィオ機器の出力線から直接入力する以外にも、マイクと
アナログ／デジタルコンバータの使用により、マイクか
ら周囲雑音を採集することもできる。The voice recognition device described above can be used, for example, as a means for operating a car stereo in a car, and in this case, the car stereo itself becomes the object of ambient noise. In addition to inputting directly to the noise level input terminal 5 from the output line of the audio equipment, ambient noise can also be collected from the microphone by using a microphone and an analog/digital converter.

（ト）発明の効果本発明の音声切り出し方法によれば、音声が存在する音
響信号から音声の時間領域をそのレベルで検出するため
の閾値を周囲雑音レベルに従ってダイナミックに設定で
きるので、そのレベルが変動する音響再生環境の中でも
、有効な音声領域の検出ができる。さらに、本発明の音
声切り出し方法を採用した音声認識装置によれば、音声
領域のより適切な検出が可能になり、音声認識処理の精
度の向上が望める。(G) Effects of the Invention According to the audio extraction method of the present invention, the threshold for detecting the time domain of audio at that level from an acoustic signal containing audio can be dynamically set according to the ambient noise level. Effective audio region detection is possible even in a fluctuating audio reproduction environment. Furthermore, according to the speech recognition device that employs the speech extraction method of the present invention, it is possible to detect speech regions more appropriately, and it is expected that the accuracy of speech recognition processing will be improved.

[Brief explanation of drawings]

第１図は本発明の音声認識装置の構成を示すブロック図
、第２図は第１図の装置に採用した本発明の音声切り出
し方法を示す信号図である。１・・・マイク、２・・・音声分析部、３・・・第１切
り出し閾値制御部、４・・・第１音声バツフア、５・・
・雑音レベル入力端子、６・・・第１切り出し閾値設定
部、７・・・第２切り出し閾値制御部、８・・・第２音
声バツフア、９・・・第２切り出し閾値設定部、１０・
・・音声パタン作成部、１１・・・雑音レベルバッファ
、１２・・・レベル比較部、１３・・・標準パタンメモ
リ、１４・・・識別部、１５・・・認識閾値設定部。FIG. 1 is a block diagram showing the configuration of the speech recognition device of the present invention, and FIG. 2 is a signal diagram showing the speech cutting method of the present invention adopted in the device of FIG. DESCRIPTION OF SYMBOLS 1...Microphone, 2...Speech analysis part, 3...First extraction threshold control part, 4...First sound buffer, 5...
- Noise level input terminal, 6... First extraction threshold setting section, 7... Second extraction threshold control section, 8... Second audio buffer, 9... Second extraction threshold setting section, 10.
...Speech pattern creation section, 11...Noise level buffer, 12...Level comparison section, 13...Standard pattern memory, 14...Identification section, 15...Recognition threshold setting section.

Claims

[Claims]

(1) In a sound extraction method that detects the presence of sound in a time region in which the level of the sound signal in which the sound exists reaches a specific threshold or higher and cuts out the sound region, detection is performed using a sound input means different from the above-mentioned sound signal. The above threshold is set according to the ambient noise level, and the acoustic signal region is cut out using the threshold,
A sound extraction method for extracting the acoustic signal region as a sound region.

(2) When the ambient noise level is smaller than a predetermined value, the threshold value is set based on the acoustic signal itself.
The audio extraction method described.

(3) In a sound extraction method that detects the presence of sound in a time region in which the level of the sound signal in which sound is present reaches a specific threshold or higher and cuts out the sound region, detection is performed using a sound input means different from the above-mentioned sound signal. Due to the ambient noise level
A threshold value is set, a first acoustic signal region is cut out using the threshold value, and then, for the acoustic signal in which this first acoustic signal region exists in the center, furthermore, based on the ambient noise level, A sound extraction method that sets a second threshold at a low level, uses the threshold to cut out a second audio signal region that includes the first audio signal region, and extracts the second audio signal region as an audio region.

(4) If the level of ambient noise in the extracted audio region is higher than a value set according to the level of the acoustic signal in the audio region, the extraction of the audio region at this time is invalidated according to claims 1 and 2. , or the audio extraction method described in 3.

(5) The audio extraction method according to claim 1, 2, 3, or 4, wherein the ambient noise is an audio signal directly input from an output terminal of an audio device that reproduces sounds such as music.

(6) A microphone that inputs voice, a voice analysis unit that analyzes the acoustic signal obtained from the microphone and extracts a time series of voice characteristic parameters, and a voice pattern is extracted based on the time series of characteristic parameters obtained from the voice analysis unit. A voice pattern creation unit to create, a standard voice pattern memory that stores a plurality of standard voice voice patterns as standard voice patterns in advance, and a pattern matching between each voice pattern in the memory and the voice pattern to generate the voice pattern. an identification unit for identifying, an audio input terminal for inputting ambient noise, and a first extraction for setting a first audio extraction threshold based on a noise sound level from an audio device connected to the input terminal and which is a source of ambient noise. a threshold value setting section; a first extraction control section that detects a first acoustic signal region from the acoustic signal obtained from the microphone using a first extraction threshold set by the setting section; a first acoustic signal detected by the control section; a second cutout threshold setting unit that sets a second threshold value lower than the first threshold value based on the ambient noise level for the acoustic signal in which the region is centered; a second cutout set by the setting unit; A second cutout control section detects a second acoustic signal region including the first acoustic signal region using a threshold value, and the second acoustic signal region detected by the second cutout control section is regarded as an audio region. The speech recognition device generates a speech pattern in the speech pattern creation section based on the feature parameter time series present in the speech region among the feature parameter time series obtained from the speech analysis section.

(7) The identification unit outputs a recognition result signal that is associated with a standard voice pattern that can perform pattern matching with the input voice pattern at this time with the smallest error, and when the minimum error is smaller than a predetermined recognition error. 7. The speech recognition device according to claim 6, wherein the recognition error is variably set according to ambient noise in the speech region.

(8) the audio input terminal is coupled to an output terminal of a vehicle-mounted sound reproduction device that becomes ambient noise for audio input to the microphone, and the identification result signal of the identification section is output to a control circuit of the vehicle-mounted sound reproduction device; The voice recognition device according to claim 6 or 7, wherein an in-vehicle audio reproduction device is operated by voice recognition.