JPH10124084A

JPH10124084A - Voice processer

Info

Publication number: JPH10124084A
Application number: JP8275702A
Authority: JP
Inventors: Takashi Miki; 敬三木
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1996-10-18
Filing date: 1996-10-18
Publication date: 1998-05-15

Abstract

PROBLEM TO BE SOLVED: To prevent the occurrence of the deterioration in the precision of voice recognition even though the average value of an estimated noise spectrum fluctuates by providing the pattern recognition means which collates the estimated voice spectrum against the reference parameters held internally to identify inputted voices. SOLUTION: A pattern recognition section 104 compares the reference parameters, which are the spectrum feature parameter group of various voice elements stored internally and the spectrum feature parameter group made by background noise only, and the estimated voice spectrum group outputted from an estimated voice spectrum computing section 111 and outputs the name of the category of the voice standard pattern, that gives a highest degree of similarity, to an external device as an identification result. Then, the collating errors of the background noise segment portion are reduced in the comparison computations of the estimated voice spectrum group and the voice standard pattern in the section 104 and the collating precision is increased. As a result, the voice recognition performance is improved.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、音声信号に含ま
れる背景雑音信号を除去する機能を有する音声処理装置
に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an audio processing device having a function of removing a background noise signal included in an audio signal.

【０００２】[0002]

【従来の技術】音声処理装置に入力する音声信号には、
背景雑音信号が重畳されており、この背景雑音信号を除
去するための簡単で有効な手法として、スペクトルサブ
トラクション法（以下、ＳＳ法とする）が用いられてい
る。2. Description of the Related Art An audio signal input to an audio processing device includes:
A background noise signal is superimposed, and a spectral subtraction method (hereinafter, referred to as an SS method) is used as a simple and effective method for removing the background noise signal.

【０００３】ＳＳ法とは、雑音信号が重畳されている音
声信号のスペクトル（以下、音声スペクトルという）か
ら雑音信号のスペクトル（以下、雑音スペクトルとい
う）を減算（サブトラクト）することにより音声スペク
トルを取り出す方法である。In the SS method, a speech spectrum is extracted by subtracting (subtracting) a spectrum of a noise signal (hereinafter, referred to as a noise spectrum) from a spectrum of a speech signal on which a noise signal is superimposed (hereinafter, referred to as a speech spectrum). Is the way.

【０００４】通常、雑音スペトルは、音声が発声してい
ない部分から推定され（以下、推定雑音スペクトルと呼
ぶ）、例えば、音声が発声される直前から一定時間遡っ
た時間までの雑音スペクトルの平均によって推定され
る。Normally, the noise spectrum is estimated from a portion where no voice is uttered (hereinafter referred to as an estimated noise spectrum). Presumed.

【０００５】[0005]

【発明が解決しようとする課題】しかし、雑音スペクト
ルの推定にあたって、雑音スペクトルの平均値が変動す
る場合、推定雑音スペクトルと真の雑音スペクトルとの
誤差が大きくなる。特に、推定雑音スペクトル値が真の
雑音スペクトル値を上回ったとき、ＳＳ法によって得ら
れる音声スペクトル値は負となる場合がある。このよう
な現象は、オーバーエスティメーションと呼ばれ、音声
認識の精度低下の原因となっている。However, in estimating the noise spectrum, if the average value of the noise spectrum fluctuates, the error between the estimated noise spectrum and the true noise spectrum increases. In particular, when the estimated noise spectrum value exceeds the true noise spectrum value, the speech spectrum value obtained by the SS method may be negative. Such a phenomenon is called overestimation and causes a decrease in accuracy of speech recognition.

【０００６】従って、オーバーエスティメーションによ
って生じる音声認識の精度低下を回避する簡便かつ有効
な手段が望まれていた。Therefore, a simple and effective means for avoiding a decrease in the accuracy of speech recognition caused by overestimation has been desired.

【０００７】[0007]

【課題を解決するための手段】かかる課題を解決するた
めに、この発明の音声処理装置は、音声が入力されてい
ない区間を含む入力音声信号をスペクトル分析し、スペ
クトル特徴パラメータを求め、入力音声信号を処理する
音声処理装置において、（１）入力音声信号をフレーム
毎にスペクトル分析してスペクトル特徴パラメータに変
換し、音響パワーを算出する音響分析手段と、（２）音
声が入力されている区間の最初のフレームである音声始
端フレームを音響パワーから検出する音声検出手段と、
（３）音声始端フレームの直前のフレームから一定時間
遡ったフレームまでの区間のスペクトル特徴パラメータ
から雑音スペクトルを推定し、推定雑音スペクトルを出
力する推定雑音スペクトル演算手段と、（４）上記推定
雑音スペクトル演算手段から出力した推定雑音スペクト
ルを保持する推定雑音スペクトル保持手段と、（５）背
景雑音のみのスペクトル特徴パラメータを基準パターン
雑音スペクトルとして保持する基準パターン雑音スペク
トル保持手段と、（６）スペクトル特徴パラメータと推
定雑音スペクトルとの差分によって、推定音声スペクト
ルを算出し、上記推定音声スペクトル値が基準パターン
雑音スペクトル値以下の場合、基準パターン雑音スペク
トルを推定音声スペクトルとする推定音声スペクトル演
算手段と、（７）音声のスペクトル特微パラメータと背
景雑音のみのスペクトル特徴パラメータとを基準パラメ
ータとして自己の内部に有し、推定音声スペクトルと自
己の内部に有する上記基準パラメータとを照合して入力
音声を識別するパターン認識手段とを有することを特徴
とする。SUMMARY OF THE INVENTION In order to solve this problem, a voice processing apparatus according to the present invention analyzes the spectrum of an input voice signal including a section where voice is not input, obtains a spectrum feature parameter, and obtains an input voice signal. In a sound processing apparatus for processing a signal, (1) sound analysis means for analyzing a spectrum of an input sound signal for each frame and converting it into a spectrum characteristic parameter to calculate sound power; and (2) a section in which sound is input. Voice detection means for detecting from the sound power the voice start frame which is the first frame of
(3) Estimated noise spectrum calculating means for estimating a noise spectrum from a spectrum feature parameter of a section from a frame immediately before the speech start frame to a frame which has been retroactive for a predetermined time and outputting an estimated noise spectrum; Estimated noise spectrum holding means for holding the estimated noise spectrum output from the arithmetic means, (5) reference pattern noise spectrum holding means for holding a spectrum feature parameter of only background noise as a reference pattern noise spectrum, and (6) spectrum feature parameter An estimated voice spectrum is calculated based on a difference between the estimated voice spectrum and the estimated noise spectrum, and when the estimated voice spectrum value is equal to or smaller than the reference pattern noise spectrum value, an estimated voice spectrum calculating means that uses the reference pattern noise spectrum as the estimated voice spectrum; ) Pattern recognition that has a voice spectral characteristic parameter and a spectral feature parameter of only background noise as its reference parameters inside itself, and compares an estimated voice spectrum with the above-mentioned reference parameters contained therein to identify input voice. Means.

【０００８】[0008]

BEST MODE FOR CARRYING OUT THE INVENTION

（Ａ）実施形態以下、この発明による音声処理装置の実施形態について
図面を参照しながら詳述する。この実施形態では、入力
音声信号の音声種類を認識する音声処理装置を示す。ま
た、この実施形態には、入力音声信号の音声種類を認識
するために必要となる各種の音声要素（音素片や単語）
について、音声標準パターン（基準のスペクトル特徴パ
ラメータ）を学習すると同時に、その時の背景雑音のみ
のスペクトル（以下、学習時雑音スペクトルと呼ぶ）を
学習する動作モード（以下、学習モードと呼ぶ）と、入
力音声信号を認識する動作モード（以下、認識モードと
呼ぶ）との２つの動作モードを有する。(A) Embodiment Hereinafter, an embodiment of an audio processing device according to the present invention will be described in detail with reference to the drawings. In this embodiment, an audio processing device that recognizes the audio type of an input audio signal will be described. Also, in this embodiment, various speech elements (phoneme fragments and words) required to recognize the speech type of the input speech signal
, An operation mode (hereinafter, referred to as a learning mode) for learning a voice standard pattern (reference spectral feature parameter) and learning a spectrum of only background noise at that time (hereinafter, referred to as a learning noise spectrum). It has two operation modes, an operation mode for recognizing a voice signal (hereinafter, referred to as a recognition mode).

【０００９】（Ａ−１）実施形態の構成図１は、この実施形態の音声処理装置の機能的構成図で
ある。図１において、この実施形態の音声処理装置は、
音響分析部１０１と、音声検出部１０２と、ＳＳ処理部
１０３と、パターン認識部１０４と、推定雑音スペクト
ル記憶メモリ１０５と、学習時雑音スペクトル記憶部１
０６とから構成されている。(A-1) Configuration of Embodiment FIG. 1 is a functional configuration diagram of a voice processing apparatus of this embodiment. In FIG. 1, the audio processing device of this embodiment
Acoustic analysis unit 101, voice detection unit 102, SS processing unit 103, pattern recognition unit 104, estimated noise spectrum storage memory 105, learning noise spectrum storage unit 1
06.

【００１０】なお、ＳＳ処理部１０３は、分担する機能
によって、図１に示すように、推定雑音スペクトル演算
部１１０と推定音声スペクトル演算部１１１とに分ける
ことができる。The SS processing section 103 can be divided into an estimated noise spectrum operation section 110 and an estimated speech spectrum operation section 111 as shown in FIG.

【００１１】上記のように、この実施形態の音声処理装
置には学習モードと認識モードとがあるが、これらモー
ドを選択するモード選択信号は、学習時雑音スペクトル
記憶部１０６と、推定音声スペクトル演算部１１１と、
パターン認識部１０４とに入力する。そして、上記各部
は、動作モードによって異なった機能を有する。そこ
で、まず、この実施形態の音声処理装置が、モード選択
信号によって認識モードとなっているとき、この実施形
態を構成する各部が有する機能について以下に詳述す
る。As described above, the speech processing apparatus of this embodiment has the learning mode and the recognition mode. The mode selection signal for selecting these modes is the noise spectrum storage unit 106 for learning, the estimated speech spectrum calculation Part 111;
It is input to the pattern recognition unit 104. Each of the above units has a different function depending on the operation mode. Therefore, first, when the speech processing apparatus of the present embodiment is in the recognition mode by the mode selection signal, the functions of each unit constituting the present embodiment will be described in detail below.

【００１２】音響分析部１０１は、入力音声信号をアナ
ログ信号からデジタル信号に変換し、フレーム毎にスペ
クトル分析処理を行い、スペクトル特徴パラメータと音
響パワーとを算出するものである。スペクトル分析処理
では、スペクトル特徴パラメータとして、例えば中心周
波数の異なる１６個のバンドパスフィルタ（以下、ＢＰ
Ｆとする）を通過する信号の絶対値をフレーム毎で平均
した値Ｘｉ（ｔ）を用いる。ここでｔはフレーム番号、
ｉはＢＰＦの番号（ｉ＝１〜１６）を表わす。The sound analyzer 101 converts an input speech signal from an analog signal to a digital signal, performs a spectrum analysis process for each frame, and calculates a spectrum feature parameter and a sound power. In the spectrum analysis processing, for example, 16 band-pass filters (hereinafter, referred to as BPs) having different center frequencies are used as spectral feature parameters.
A value Xi (t) obtained by averaging the absolute value of a signal passing through F) for each frame is used. Where t is the frame number,
i represents a BPF number (i = 1 to 16).

【００１３】また、音響分析部１０１は、音響パワーＰ
（ｔ）として、（１）式に示すように、スペクトル特徴
パラメータＸｉ（ｔ）の総和を算出する。なお、総和Σ
は、ｉが１〜１６についてである。The sound analysis unit 101 has a sound power P
As (t), the sum of the spectral feature parameters Xi (t) is calculated as shown in equation (1). In addition, sum 総
Is for i = 1-16.

【００１４】Ｐ（ｔ）＝ΣＸｉ（ｔ） …（１）音声検出部１０２は、音響分析部１０１から出力された
音響パワーＰ（ｔ）から音声が入力されている区間（以
後、音声区間と呼ぶ）を検出するものである。P (t) = ΣXi (t) (1) The sound detection unit 102 outputs a section in which a sound is input from the sound power P (t) output from the sound analysis unit 101 (hereinafter referred to as a sound section). Call).

【００１５】ここで音声区間の最初のフレームを音声始
端フレームｔｓと呼び、音声区間の最後のフレームを音
声終端フレームｔｅと呼ぶ。音声区間の検出について
は、音声区間の音響パワーと、背景雑音区間（音声が入
力されておらず、背景雑音のみが入力されている区間）
の音響パワーとが異なることを利用して音声区間を検出
して良い。また、特開昭６２−２１１６９８号公報に開
示されたような音声特徴量に基づいた検出方法を利用し
ても良い。Here, the first frame of the voice section is called a voice start frame ts, and the last frame of the voice section is called a voice end frame te. Regarding the detection of the voice section, the sound power of the voice section and the background noise section (the section where no voice is input and only the background noise is input)
The sound section may be detected by utilizing the fact that the sound power is different from the sound power. Further, a detection method based on audio feature amounts as disclosed in JP-A-62-211698 may be used.

【００１６】推定雑音スペクトル演算部１１０は、音声
始端フレームｔｓの直前フレーム（ｔｓ−１）から一定
時間（ＮＴ−１の時間）遡ったフレーム（ｔｓ−ＮＴ）
までのスペクトル特徴パラメータの平均値（以下、これ
を推定雑音スペクトルＡｉとする）を算出し、これを推
定雑音スペクトル記憶メモリ１０５へ出力するものであ
る。推定雑音スペクトルＡｉの算出式を（２）式に示
す。なお、総和Σは、ｔがｔｓ−ＮＴ〜ｔｓ−１につい
てである。The estimated noise spectrum calculation unit 110 calculates a frame (ts-NT) which is a predetermined time (NT-1) from the frame (ts-1) immediately before the speech start frame ts.
The average value of the spectral characteristic parameters up to this point (hereinafter referred to as an estimated noise spectrum Ai) is calculated and output to the estimated noise spectrum storage memory 105. The equation for calculating the estimated noise spectrum Ai is shown in equation (2). Note that the sum Σ is for t where ts−NT to ts−1.

【００１７】Ａｉ＝（１／ＮＴ）・ΣＸｉ（ｔ） …（２）推定雑音スペクトル記憶メモリ１０５は、推定雑音スペ
クトル演算部１１０から出力された推定雑音スペクトル
Ａｉを記憶し、推定音声スペクトル演算部１１１へ供給
するものである。Ai = (1 / NT) ΣXi (t) (2) The estimated noise spectrum storage memory 105 stores the estimated noise spectrum Ai output from the estimated noise spectrum calculation unit 110, and 111.

【００１８】学習時雑音スペクトル記憶部１０６は、後
述する学習モードによって、自己の内部に記憶してある
学習時雑音スペクトルＢｉを推定音声スペクトル演算部
１１１へ供給するものである。The learning noise spectrum storage section 106 supplies the learning noise spectrum Bi stored therein to the estimated speech spectrum calculation section 111 in a learning mode described later.

【００１９】推定音声スペクトル演算部１１１は、音響
分析部１０１から出力された音声区間のスペクトル特徴
パラメータＸｉ（ｔ）から推定雑音スペクトル成分ｋ・
Ａｉを減算することによって、推定音声スペクトル系列
ＸＳ（ｔ）を算出するものである。ただし、推定雑音ス
ペクトル成分減算処理後、推定音声スペクトル系列ＸＳ
（ｔ）の要素であるＸＳｉ（ｔ）の値が学習時雑音スペ
クトル記憶部１０６から出力された学習時雑音スペクト
ルＢｉの値以下の場合には、学習時雑音スペクトルＢｉ
を推定音声スペクトルＸＳｉ（ｔ）へ代入するような処
理を行なう。推定音声スペクトルＸＳｉ（ｔ）の算出式
を（３）式に示す。ただし、ｋは予め設定された減算の
度合いを示す係数であり、通常、０．５〜２．０程度の
値を選択する。The estimated speech spectrum calculator 111 calculates the estimated noise spectrum component k · from the spectrum feature parameter Xi (t) of the speech section output from the acoustic analyzer 101.
The estimated speech spectrum sequence XS (t) is calculated by subtracting Ai. However, after the estimated noise spectrum component subtraction processing, the estimated speech spectrum series XS
If the value of XSi (t), which is an element of (t), is equal to or less than the value of the learning noise spectrum Bi output from the learning noise spectrum storage unit 106, the learning noise spectrum Bi
Is substituted into the estimated voice spectrum XSi (t). The equation for calculating the estimated voice spectrum XSi (t) is shown in equation (3). Here, k is a coefficient indicating a preset degree of subtraction, and usually a value of about 0.5 to 2.0 is selected.

【００２０】ＸＳｉ（ｔ）＝Ｘｉ（ｔ）−ｋ・ＡｉＸｉ（ｔ）−ｋ・Ａｉ＞Ｂｉのとき＝ＢｉＸｉ（ｔ）−ｋ・Ａｉ≦Ｂｉのとき …（３）パターン認識部１０４は、自己の内部に格納されている
各種の音声要素のスペクトル特微パラメータ系列（音声
標準パターンと呼ぶ；図示せず）と背景雑音のみのスペ
クトル特微パラメータ系列Ｘｉ（ｔ）（学習時雑音スペ
クトル記憶メモリに格納されている学習時雑音スペクト
ルＢｉと同一のスペクトル情報）とである基準パラメー
タと、推定音声スペクトル演算部１１１から出力された
推定音声スペクトル系列ＸＳ（ｔ）とを比較し、最も高
い類似度を与える音声標準パターンのカテゴリ名を認識
結果として外部機器（図示せず）等に出力するものであ
る。XSi (t) = Xi (t) −k · Ai When Xi (t) −k · Ai> Bi = Bi Xi (t) −k · Ai ≦ Bi (3) Pattern Recognition Unit 104 Is a spectral characteristic parameter sequence of various speech elements stored in itself (called a speech standard pattern; not shown) and a spectral characteristic parameter sequence Xi (t) of only background noise (noise spectrum at learning). The reference parameter, which is the same as the learning noise spectrum Bi stored in the storage memory, is compared with the estimated speech spectrum series XS (t) output from the estimated speech spectrum calculation unit 111, and is the highest. The category name of the audio standard pattern giving the similarity is output to an external device (not shown) or the like as a recognition result.

【００２１】なお、類似度の計算方法として、例えば、
ＤＰ（Dynamic Programing）法を適用しても良い。As a method of calculating the similarity, for example,
A DP (Dynamic Programming) method may be applied.

【００２２】次に、この実施形態がモード選択信号によ
って学習モードとなっているとき、この実施形態を構成
する各部が有する機能について詳述する。なお、音響分
析部１０１と音声検出部１０２とが有する機能について
は、この実施形態が認識モードとなっているときの機能
と同様なので、その説明を省略する。さらに、この実施
形態を構成する各部のうち、ＳＳ処理部１０３（推定雑
音スペクトル演算部１１０と推定音声スペクトル演算部
１１１）と推定雑音スペクトル記憶メモリ１０５とが有
する機能は、認識モードのときにのみ必要となるもので
ある。Next, the function of each section constituting the embodiment when the embodiment is in the learning mode by the mode selection signal will be described in detail. Note that the functions of the sound analysis unit 101 and the sound detection unit 102 are the same as those of the present embodiment in the recognition mode, and a description thereof will be omitted. Further, among the units constituting this embodiment, the functions of the SS processing unit 103 (the estimated noise spectrum calculation unit 110 and the estimated speech spectrum calculation unit 111) and the estimated noise spectrum storage memory 105 are only available in the recognition mode. It is necessary.

【００２３】学習時雑音スペクトル記憶部１０６は、音
響分析部１０１から出力されたスペクトル特徴パラメー
タＸｉ（ｔ）のうち、音声始端フレームｔｓと音声終端
フレームｔｅとから背景雑音区間のスペクトル特徴パラ
メータＸｉ（ｔ）を弁別し、これを学習時雑音スペクト
ルＢｉとして記憶するものである。The learning noise spectrum storage unit 106 stores the spectrum feature parameters Xi (t) of the background noise section from the speech start frame ts and the speech end frame te among the spectrum feature parameters Xi (t) output from the acoustic analysis unit 101. t) is discriminated and stored as a learning noise spectrum Bi.

【００２４】パターン認識部１０４は、音響分析部１０
１から出力されたスペクトル特微パラメータＸｉ（ｔ）
から音声標準パターンと、その時の背景雑音のみのスペ
クトル特微パラメータ系列とを基準パラメータとして自
己の内部に格納するものである。具体的には、パターン
認識部１０４は、スペクトル特微パラメータＸｉ（ｔ）
について、これが音声区間のスペクトル特微パラメータ
か又は背景雑音区間のスペクトル特微パラメータかを音
声始端フレームｔｓと音声終端フレームｔｅとから判断
する。そして、音声区間から、各種の音声要素につい
て、音声標準パターンを学習し、背景雑音区間から、そ
の時の背景雑音のみのスペクトル系列（学習時雑音スペ
クトル記憶部１０６が記憶する学習時雑音スペクトルＢ
ｉと同一のスペクトル情報）を学習し、上記スペクトル
特微パラメータ系列を自己の内部に格納する。The pattern recognizing unit 104 includes the sound analyzing unit 10
1 is the spectral characteristic parameter Xi (t) output from
, A speech standard pattern and a spectrum characteristic parameter sequence of only background noise at that time are stored in the self as reference parameters. Specifically, the pattern recognizing unit 104 calculates the spectral characteristic parameter Xi (t)
Is determined from the speech start frame ts and the speech end frame te as to whether this is a spectrum feature parameter in the speech section or a spectrum feature parameter in the background noise section. Then, a speech standard pattern is learned for various speech elements from the speech section, and a spectrum sequence of only the background noise at that time (the learning noise spectrum B stored in the learning noise spectrum storage unit 106) is learned from the background noise section.
(i.e., the same spectral information as i) is learned, and the above-described spectral characteristic parameter sequence is stored in the self.

【００２５】なお、学習時雑音スペクトル記憶部１０６
とパターン認識部１０４において、上記学習時雑音スペ
クトルＢｉと上記基準パラメータは、この実施形態が新
たに学習するごとに更新されるものとする。さらに、学
習時雑音スペクトル記憶部１０６とパターン認識部１０
４において、音声始端フレームｔｓと音声終端フレーム
ｔｅとによって、音声区間のスペクトル特微パラメータ
と背景雑音区間のスペクトル特微パラメータとを弁別す
ることとしたが、この方法にこだわるものではない。The learning noise spectrum storage unit 106
In the pattern recognition unit 104, the noise spectrum at learning Bi and the reference parameter are updated each time the embodiment is newly learned. Further, the learning noise spectrum storage unit 106 and the pattern recognition unit 10
In 4, the spectral characteristic parameter of the speech section and the spectral characteristic parameter of the background noise section are discriminated by the speech start frame ts and the speech end frame te, but this method is not limited to this method.

【００２６】（Ａ−２）実施形態の動作以上の構成を有するこの実施形態の音声処理装置の動作
を以下説明する。なお上記のとおり、この実施形態には
認識モードと学習モードとがあり、動作モードによっ
て、この実施形態を構成する各部の動作が異なる。そこ
で、まず、この実施形態の音声処理装置が、モード選択
信号によって認識モードとなっているときのこの実施形
態の動作を説明する。(A-2) Operation of the Embodiment The operation of the audio processing apparatus of the embodiment having the above configuration will be described below. As described above, this embodiment includes the recognition mode and the learning mode, and the operation of each unit configuring the embodiment differs depending on the operation mode. Therefore, first, the operation of this embodiment when the voice processing device of this embodiment is in the recognition mode by the mode selection signal will be described.

【００２７】この実施形態の音声処理装置に入力した入
力音声信号は、音響分析部１０１によってフレーム毎に
スペクトル分析処理が行われ、音響パワーＰ（ｔ）とス
ペクトル特徴パラメータＸｉ（ｔ）とが算出される。The input speech signal input to the speech processing apparatus of this embodiment is subjected to spectrum analysis processing for each frame by the sound analysis unit 101, and the sound power P (t) and the spectrum characteristic parameter Xi (t) are calculated. Is done.

【００２８】音響パワーＰ（ｔ）は、音声検出部１０２
に入力し、スペクトル特徴パラメータＸｉ（ｔ）は、推
定雑音スペクトル演算部１１０と、推定音声スペクトル
演算部１１１とに入力する。The sound power P (t) is calculated by
, And the spectral feature parameter Xi (t) is input to the estimated noise spectrum calculator 110 and the estimated speech spectrum calculator 111.

【００２９】音声検出部１０２に入力した音響パワーＰ
（ｔ）から、音声始端フレームｔｓと音声終端フレーム
ｔｅとが検出される。The sound power P input to the voice detector 102
From (t), a voice start frame ts and a voice end frame te are detected.

【００３０】推定雑音スペクトル演算部１１０に入力し
た音声始端フレームｔｓとスペクトル特徴パラメータＸ
ｉ（ｔ）とから、推定雑音スペクトルＡｉが上記（２）
式によって算出され、推定雑音スペクトル記憶メモリ１
０５へ出力される。The speech start frame ts input to the estimated noise spectrum calculator 110 and the spectrum feature parameter X
From i (t), the estimated noise spectrum Ai is calculated by the above (2)
Estimated noise spectrum storage memory 1 calculated by the equation
05 is output.

【００３１】学習時雑音スペクトル記憶部１０６は、自
己内部に記憶してある学習時雑音スペクトルＢｉを推定
音声スペクトル演算部１１１へ供給する。The learning noise spectrum storage section 106 supplies the learning noise spectrum Bi stored therein to the estimated speech spectrum calculating section 111.

【００３２】推定音声スペクトル演算部１１１には、ス
ペクトル特徴パラメータＸｉ（ｔ）と、推定雑音スペク
トルＡｉと、学習時雑音スペクトルＢｉとが入力する。
そして、推定音声スペクトル系列ＸＳ（ｔ）が上記
（３）式によって算出され、パターン認識部１０４へ出
力される。The estimated speech spectrum calculation unit 111 receives a spectrum feature parameter Xi (t), an estimated noise spectrum Ai, and a learning noise spectrum Bi.
Then, the estimated speech spectrum sequence XS (t) is calculated by the above equation (3) and output to the pattern recognition unit 104.

【００３３】パターン認識部１０４は、推定音声スペク
トル演算部１１１から出力した推定音声スペクトル系列
ＸＳ（ｔ）と、自己の内部に格納されている基準パラメ
ータとの類似度を計算し、最も類似した音声標準パター
ン（音声要素のスペクトル特徴パラメータ）のカテゴリ
名を認識結果として出力する。The pattern recognition unit 104 calculates the similarity between the estimated speech spectrum sequence XS (t) output from the estimated speech spectrum calculation unit 111 and a reference parameter stored in itself, and determines the most similar speech. The category name of the standard pattern (spectral feature parameter of the voice element) is output as a recognition result.

【００３４】次に、この実施形態が学習モードとなって
いるとき、この実施形態の動作について説明する。な
お、音響分析部１０１と音声検出部１０２との動作につ
いては、この実施形態が認識モードとなっているときの
動作と同様なので、その説明を省略する。Next, the operation of this embodiment when the embodiment is in the learning mode will be described. Note that the operations of the sound analysis unit 101 and the voice detection unit 102 are the same as the operations in this embodiment in the recognition mode, and a description thereof will be omitted.

【００３５】音響分析部１０１から出力したスペクトル
特徴パラメータＸｉ（ｔ）は、学習時雑音スペクトル記
憶部１０６とパターン認識部１０４とに入力する。The spectrum feature parameter Xi (t) output from the acoustic analysis unit 101 is input to the learning noise spectrum storage unit 106 and the pattern recognition unit 104.

【００３６】学習時雑音スペクトル記憶部１０６は、ス
ペクトル特徴パラメータＸｉ（ｔ）のうち、背景雑音区
間のみのスペクトル特徴パラメータＸｉ（ｔ）のみを学
習時雑音スペクトルＢｉとして記憶する。The learning noise spectrum storage unit 106 stores only the spectral feature parameter Xi (t) of only the background noise section among the spectral feature parameters Xi (t) as the learning noise spectrum Bi.

【００３７】一方、パターン認識部１０４は、スペクト
ル特徴パラメータＸｉ（ｔ）から、音声標準パターンと
学習時雑音スペクトルとを弁別し、自己の内部に格納す
る。On the other hand, the pattern recognition unit 104 discriminates the speech standard pattern and the noise spectrum at the time of learning from the spectrum characteristic parameter Xi (t), and stores it inside itself.

【００３８】（Ａ−３）実施形態の効果この実施形態の音声処理装置によれば、推定音声スペク
トル演算部１１１において、スペクトル特徴パラメータ
Ｘｉ（ｔ）から推定雑音スペクトル成分ｋ・Ａｉを減算
した値ＸＳｉ（ｔ）が学習時雑音スペクトルＢｉ値以下
の場合には、学習時雑音スペクトルＢｉを推定音声スペ
クトルＸＳｉ（ｔ）へ代入するような処理を行なうの
で、以下の効果が生じる。(A-3) Effects of the Embodiment According to the speech processing apparatus of this embodiment, the estimated speech spectrum calculator 111 subtracts the estimated noise spectrum component k · Ai from the spectrum feature parameter Xi (t). When XSi (t) is equal to or smaller than the noise spectrum Bi at the time of learning, processing is performed to substitute the noise spectrum Bi at the time of learning into the estimated speech spectrum XSi (t).

【００３９】推定音声スペクトル系列ＸＳ（ｔ）中の背
景雑音区間（音声が発声されていない区間、例えば、文
節の区切れ目などのポーズ区間など）のスペクトル系列
を学習時の背景雑音区間のスペクトルＢｉに一致させる
ことができる。従って、パターン認識部１０４での推定
音声スペクトル系列ＸＳ（ｔ）と音声標準パターンとの
比較計算において、背景雑音区間部分の照合誤差がなく
なり、照合精度が高まり、その結果、音声認識性能が向
上する。The spectrum Bi of the background noise section when learning the spectrum series of the background noise section in the estimated speech spectrum series XS (t) (the section in which no speech is uttered, for example, the pause section such as a segment break). Can be matched. Therefore, in the comparison calculation between the estimated speech spectrum sequence XS (t) and the speech standard pattern in the pattern recognition unit 104, the matching error in the background noise section is eliminated, and the matching accuracy is increased, and as a result, the speech recognition performance is improved. .

【００４０】図２は、この実施形態の音声処理装置と従
来型のＳＳ法を利用した音声処理装置（以下、従来型と
いう）との評価実験の結果である。実験は、１００単語
を発声した音声に２５種の雑音を重畳させた信号を生成
し、音声認識率を比較したものである。なお、従来型で
は、上記（３）式において学習時雑音スペクトルＢｉを
ｋ１・Ａ１（ｋ１は予め定められた定数）へ置換するこ
とによって、推定音声スペクトル系列ＸＳ（ｔ）を推定
した（（４）式参照）。FIG. 2 shows the results of evaluation experiments on the speech processing apparatus of this embodiment and a speech processing apparatus using the conventional SS method (hereinafter, referred to as a conventional type). In the experiment, a signal in which 25 types of noise were superimposed on a voice uttering 100 words was generated, and the speech recognition rates were compared. In the conventional type, the estimated speech spectrum sequence XS (t) was estimated by replacing the noise spectrum Bi at learning with k1 · A1 (k1 is a predetermined constant) in the above equation (3) ((4) ) Expression).

【００４１】ＸＳｉ＝Ｘｉ−ｋ・ＡｉＸｉ−ｋ・Ａｉ＞ｋ１・Ａｉのとき＝ｋ１・ＡｉＸｉ−ｋ・Ａｉ≦ｋ１・Ａｉのとき …（４）認識率については、音声パワーと雑音パーワとの比（Ｓ
／Ｎ比）が、２０ｄＢの場合、従来型では８３．２％で
あり、この実施形態では８７．１％であり、１５ｄＢの
場合、従来型は６５．７％であり、この実施形態では７
１．０％であり、１０ｄＢの場合、従来型では４７．０
％であり、この実施形態では４９．７％であり、５ｄＢ
の場合、従来型では３１．５％であり、この実施形態で
は３３．１％である。XSi = Xi−k · Ai When Xi−k · Ai> k1 · Ai = k1 · Ai Xi−k · Ai ≦ k1 · Ai (4) Regarding the recognition rate, the voice power and the noise power And the ratio (S
/ N ratio) is 83.2% in the conventional type when it is 20 dB, 87.1% in this embodiment, and 65.7% in the conventional type when it is 15 dB.
1.0%, and at 10 dB, 47.0 in the conventional type.
% In this embodiment, 49.7%, and 5 dB
Is 31.5% in the conventional type, and 33.1% in this embodiment.

【００４２】いずれの場合も、この実施形態の音声処理
装置の認識率の方が、従来型の音声処理装置の認識率よ
りも高く、特にＳ／Ｎ比が１５ｄＢでは、従来型のもの
と比べて認識率が５％以上上回っており、この実施形態
の音声処理装置の有効性を評価することができる。In any case, the recognition rate of the speech processing apparatus of this embodiment is higher than the recognition rate of the conventional speech processing apparatus, especially when the S / N ratio is 15 dB. As a result, the recognition rate exceeds 5% or more, and the effectiveness of the voice processing device of this embodiment can be evaluated.

【００４３】さらに、推定音声スペクトル系列ＸＳ
（ｔ）の算出のための計算量は、この実施形態の音声処
理装置と従来型音声処理装置と共に同じである。Further, the estimated speech spectrum sequence XS
The amount of calculation for calculating (t) is the same for the voice processing device of this embodiment and the conventional voice processing device.

【００４４】（Ｂ）その他の実施形態なお、上記実施形態では、音響分析部１０１でのスペク
トル分析処理にＢＰＦを利用したものを示したが、この
発明は、スペクトル分析処理に現在主流となっているＬ
ＰＣケプストラム系の特徴パラメータに対しても、例え
ば、離散コサイン変換（ＤＣＴ；Discrete Cosine Tran
sform）などを利用して周波数スペクトル領域に変換す
ることで簡単に適用することができる。(B) Other Embodiments In the above embodiment, the BPF is used for the spectrum analysis processing in the acoustic analysis unit 101. However, the present invention is now mainly used for the spectrum analysis processing. L
For the characteristic parameters of the PC cepstrum system, for example, the discrete cosine transform (DCT) is used.
This can be easily applied by transforming the data into the frequency spectrum domain using, for example, a sform.

【００４５】また、この発明は、推定音声スペクトルの
推定方法に特徴を有するものであり、パターン認識はい
かなる方法であっても良い。例えば、音声標準パターン
として、スペクトル特徴パラメータ列や隠れマルコフ過
程モデル（ＨＭＭ；Hidden Markov Model）等を用いた
パターン認識方法を適用したものでも良い。The present invention is characterized by a method for estimating an estimated speech spectrum, and any method may be used for pattern recognition. For example, a pattern recognition method using a spectrum feature parameter sequence, a Hidden Markov Model (HMM), or the like may be applied as the voice standard pattern.

【００４６】さらに、上記実施形態では、この発明の音
声処理装置が学習ごとに新たに学習時雑音スペクトルを
更新する場合を示したが、学習ごとに得られる学習時雑
音スペクトルの平均値（単純平均又は重み付け平均な
ど）を学習時雑音スペクトルとしたり、忘却学習によっ
て学習時雑音スペクトルを決定しても良い。Further, in the above-described embodiment, the case where the speech processing apparatus of the present invention newly updates the learning noise spectrum for each learning has been described. However, the average value (simple average) of the learning noise spectrum obtained for each learning is shown. Or a weighted average) may be used as the noise spectrum at the time of learning, or the noise spectrum at the time of learning may be determined by forgetting learning.

【００４７】さらにまた、上記実施形態では、入力音声
信号の音声種類を認識する音声処理装置を示したが、人
物の音声を弁別する音声処理装置にもこの発明を適用す
ることができる。Further, in the above embodiment, the speech processing apparatus for recognizing the type of the speech of the input speech signal has been described. However, the present invention can be applied to a speech processing apparatus for discriminating the speech of a person.

【００４８】また、上記実施形態は、入力音声信号の音
声種類を認識するために必要となる音声標準パターンと
背景雑音のみのスペクトルとをパターン認識部１０４
に、背景雑音のみのスペクトル学習時雑音スペクトル記
憶部１０６に学習させる学習モードを有するが、これら
の学習は他の装置によって行われても良い。即ち、他の
装置によって、上記学習をパターン認識部１０４と学習
時雑音スペクトル記憶部１０６とに対して施した後に、
この実施形態に組み込む方法でも良い。In the above-described embodiment, the pattern recognition unit 104 recognizes a speech standard pattern necessary for recognizing the speech type of an input speech signal and a spectrum of only background noise.
In addition, there is a learning mode in which the noise spectrum storage unit 106 at the time of spectrum learning of only background noise is learned, but these learnings may be performed by another device. That is, after performing the learning on the pattern recognition unit 104 and the learning noise spectrum storage unit 106 by another device,
A method incorporated in this embodiment may be used.

【００４９】[0049]

【発明の効果】以上のように、この発明の音声処理装置
によれば、推定音声スペクトル演算手段において、スペ
クトル特徴パラメータと推定雑音スペクトルとの差分値
が基準パターン雑音スペクトル値以下の場合には、基準
パターン雑音スペクトルをスペクトル特徴パラメータへ
代入するような処理を行なうので、以下のような効果が
生じる。As described above, according to the speech processing apparatus of the present invention, when the difference between the spectrum feature parameter and the estimated noise spectrum is equal to or smaller than the reference pattern noise spectrum value, Since the process of substituting the reference pattern noise spectrum into the spectrum feature parameter is performed, the following effects are produced.

【００５０】推定雑音スペクトル値が真の雑音スペクト
ル値を上回った場合でも、推定音声スペクトル値が負に
なることを回避することができ、パターン認識手段での
音声認識性能を向上させることができる。Even when the estimated noise spectrum value exceeds the true noise spectrum value, the estimated speech spectrum value can be prevented from becoming negative, and the speech recognition performance of the pattern recognition means can be improved.

【００５１】さらに、推定音声スペクトルの背景雑音区
間のスペクトルを学習時の背景雑音のスペクトルに一致
させることができる。そして、推定音声スペクトルと音
声標準パターンとの比較計算において、背景雑音区間部
分の照合誤差が無くなり、その結果、音声認識性能が向
上する。Further, the spectrum of the background noise section of the estimated speech spectrum can be made to match the spectrum of the background noise at the time of learning. Then, in the comparison calculation between the estimated speech spectrum and the speech standard pattern, the matching error in the background noise section is eliminated, and as a result, the speech recognition performance is improved.

[Brief description of the drawings]

【図１】実施形態の機能的ブロック図である。FIG. 1 is a functional block diagram of an embodiment.

【図２】実施形態の効果を説明するための図表である。FIG. 2 is a table for explaining effects of the embodiment.

[Explanation of symbols]

１０１…音響分析部、１０２…音声検出部、１０４…パ
ターン認識部、１０５…推定雑音スペクトル記憶メモ
リ、１０６…学習時雑音スペクトル記憶部、１１０…推
定雑音スペクトル演算部、１１１…推定音声スペクトル
演算部。101: Acoustic analysis unit, 102: Voice detection unit, 104: Pattern recognition unit, 105: Estimated noise spectrum storage memory, 106: Learning noise spectrum storage unit, 110: Estimated noise spectrum calculation unit, 111: Estimated voice spectrum calculation unit .

Claims

[Claims]

1. An audio processing apparatus for analyzing a spectrum of an input audio signal including a section in which no audio is input, obtaining a spectrum characteristic parameter, and processing the input audio signal. Sound analysis means for converting to spectral feature parameters and calculating sound power; sound detection means for detecting a sound start frame, which is the first frame of a section in which sound is being input, from sound power; Estimated noise spectrum calculating means for estimating a noise spectrum from a spectral feature parameter of a section from a frame to a frame which is retroactive for a predetermined time and outputting an estimated noise spectrum; and estimating holding the estimated noise spectrum output from the estimated noise spectrum calculating means. Noise spectrum holding means and background noise only A reference pattern noise spectrum holding means for holding a spectrum feature parameter as a reference pattern noise spectrum; and calculating an estimated speech spectrum by a difference between the spectrum feature parameter and the estimated noise spectrum, wherein the estimated speech spectrum value is equal to or less than the reference pattern noise spectrum value. In the case of, an estimated speech spectrum calculation means for using the reference pattern noise spectrum as an estimated speech spectrum, and a speech characteristic parameter and a spectrum feature parameter of only background noise are internally provided as reference parameters, and the estimated speech spectrum and A speech recognition device comprising: pattern recognition means for identifying an input speech by comparing the reference parameter with the internal reference parameter.

2. The reference pattern noise spectrum holding means holds a spectrum characteristic parameter of only background noise output from the acoustic analysis means as a reference pattern noise spectrum when learning reference parameters necessary for performing speech processing. The above-mentioned pattern recognition means, at the time of learning the reference parameters necessary for performing the voice processing, the voice standard pattern which is a spectrum characteristic parameter of the voice, and the spectral feature parameter of only the background noise output from the above-mentioned acoustic analysis means. 2. The input speech is identified by storing the reference speech parameter as the reference parameter in the self, and comparing the estimated speech spectrum with the reference parameter stored in the self during input speech processing. An audio processing device according to claim 1.