JP3493849B2

JP3493849B2 - Voice recognition device

Info

Publication number: JP3493849B2
Application number: JP30154995A
Authority: JP
Inventors: 活樹南野; 雅男渡; 和夫石井; 弘史角田; 浩明小川; 雅則表; 等本田; 聡藤村
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1995-11-20
Filing date: 1995-11-20
Publication date: 2004-02-03
Anticipated expiration: 2015-11-20
Also published as: JPH09146586A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識装置にお
いて、背景雑音の大きさに応じて、ユーザーに大きな声
で発声するように促す方式に関するものである。また、
本発明は、音声認識の分野に関するものである。音声認
識の技術は、機器の操作を音声で行なう場合や、キーボ
ードなどにかわる入力手段として音声を用いる場合、さ
らにはロボットやその他のシステムと人間とのインター
フェースの手段として用いられる。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method of urging a user to speak loudly in a voice recognition device according to the magnitude of background noise. Also,
The present invention relates to the field of speech recognition. The technology of voice recognition is used when a device is operated by a voice, when a voice is used as an input means in place of a keyboard or the like, and as a means for an interface between a robot or another system and a human.

【０００２】[0002]

【従来の技術】音声認識の技術としては、ＤＰマッチン
グと呼ばれるパターンマッチングの手法や、ＨＭＭ(Hid
den Markov Model) 方式と呼ばれる確率的な手法がこれ
まで広く用いられてきている。ここで、ＤＰマッチング
とは、認識対象となる候補ワードに対応した標準パター
ンをあらかじめ作成しておき、入力音声信号を分析して
得られる特徴量パターンと全標準パターンとの時間軸の
対応を取りながらマッチングさせる( 距離計算を行な
う) ことで、最も類似したものを選び出すという方式で
ある。2. Description of the Related Art As a voice recognition technique, a pattern matching technique called DP matching or an HMM (Hid
The stochastic method called den Markov Model) has been widely used so far. Here, in DP matching, a standard pattern corresponding to a candidate word to be recognized is created in advance, and a feature amount pattern obtained by analyzing an input voice signal and all standard patterns are associated with each other on the time axis. While matching (calculating the distance), the most similar one is selected.

【０００３】また、ＨＭＭ方式とは、いくつかの状態と
その状態遷移時におけるシンボル出力確率を有するマル
コフ・モデルを用いて認識対象となる候補ワードに対応
した確率モデルをあらかじめ作成しておき、入力音声信
号を分析して得られる特徴量系列が生起する確率を各モ
デルから求めることで、最も生起確率の高いものを選び
出すという方式である。連続音声認識においては、DPマ
ッチングやHMM 方式に加え、文法や語の意味的なつなが
りなどを用い、入力信号に対応するワードの並びが決定
される。In the HMM method, a probabilistic model corresponding to a candidate word to be recognized is created in advance by using a Markov model having some states and symbol output probabilities at the time of transition of the states, and the input is input. This is a method in which the probability of occurrence of a feature amount sequence obtained by analyzing a voice signal is obtained from each model and the one with the highest occurrence probability is selected. In continuous speech recognition, the word sequence corresponding to the input signal is determined using grammar and semantic connections of words in addition to the DP matching and HMM methods.

【０００４】[0004]

【発明が解決しようとする課題】さて、このような音声
認識は、様々な環境下で使われる可能性があり、その背
景雑音が認識性能の劣化要因になることがしばしばあ
る。特に、背景雑音が入力音声に重畳された場合、入力
信号を音響分析して得られる特徴量が大きく変化してし
まい、標準パターンや確率モデルから大きく離れたもの
となるため認識できないという問題が発生する。Now, such speech recognition may be used in various environments, and its background noise is often a factor of deteriorating recognition performance. In particular, when background noise is superimposed on the input speech, the feature amount obtained by acoustic analysis of the input signal changes significantly, and it becomes far from the standard pattern or probabilistic model, which causes a problem that it cannot be recognized. To do.

【０００５】これを回避するために、スペクトルサブト
ラクションと呼ばれるノイズ除去方式や、複数マイクを
用いた適応処理方式などが、これまでに提案されてい
る。これらは、入力信号中に含まれる雑音を除去するこ
とを目的としたものであるが、雑音によって既に影響を
受けている音声信号から、雑音を除去することは非常に
困難な問題である。In order to avoid this, a noise removal method called spectrum subtraction and an adaptive processing method using a plurality of microphones have been proposed so far. These are intended to remove the noise contained in the input signal, but it is a very difficult problem to remove the noise from the speech signal already affected by the noise.

【０００６】本発明は、背景雑音の大きさを観測し、そ
の大きさに応じて、ユーザーに大きな声で発声するよう
に促すことを目的とする、つまり、ユーザーに大きな声
を出すように促すことで、背景雑音の影響そのものを低
減させ、その結果、認識性能の劣化を防ぐことを目指す
ものである。An object of the present invention is to observe the magnitude of background noise and urge the user to speak loudly according to the magnitude of the background noise, that is, urge the user to speak loudly. The purpose of this is to reduce the influence of background noise itself, and as a result, prevent deterioration of recognition performance.

【０００７】[0007]

【課題を解決するための手段】本発明による音声認識装
置は、音声信号の入力部と、入力信号を分析し特徴量を
抽出する音響分析部と、学習過程においては、あらかじ
め学習用のデータを分析して得られた特徴量をもとにし
て、認識に用いるためのパラメータを求め、このパラメ
ータを記憶しておくパラメータ記憶部と、入力信号を分
析して得られた特徴量と記憶しておいたパラメータとか
ら、距離あるいは生起確率などに基づいたスコア付けを
行なうことで、入力信号に対応する単語、あるいは単語
の並びを決定する認識部と、入力音声信号のエネルギー
を算出し、入力音声信号のエネルギーが、背景雑音のエ
ネルギーの平均値と標準偏差に応じて求めたしきい値よ
り小さいか否かを判定する判定部と、入力音声信号のエ
ネルギーがしきい値より小さいと判断された場合に、大
きな声で入力するようにユーザに警告を行なう警告部
と、からなることを特徴としている。また、本発明によ
る音声認識装置の判定部は、背景雑音のエネルギーの平
均値と標準偏差を求める際、有音声・無音声の識別結果
をもとに、無音声部分と判定された入力信号を用いて、
逐次的に更新していくことを特徴としている。さらに、
本発明による音声認識装置の判定部は、無音声部分の平
均スペクトルやエネルギーの平均値と標準偏差などを更
新する条件として、有音声・無音声の識別結果をもと
に、一定時間以上、有音声と判定されつづけた場合に、
強制的に更新することを特徴としている。また、本発明
による音声認識装置の判定部は、有音声・無音声を識別
するために入力信号のエネルギーを用い、所定のしきい
値より大きくなれば有音声、それ以外は無音声とするこ
とを特徴としている。A speech recognition apparatus according to the present invention comprises a speech signal input section, an acoustic analysis section for analyzing the input signal and extracting a feature amount, and learning data in advance in the learning process. Based on the feature amount obtained by the analysis, the parameter to be used for recognition is obtained, and the parameter storage unit for storing this parameter and the feature amount obtained by analyzing the input signal are stored. By scoring based on the distance or the occurrence probability from the set parameters, the recognition unit that determines the word or the sequence of words corresponding to the input signal, the energy of the input voice signal is calculated, and the input voice signal is calculated. the energy of the signal is, of background noise d
A determination unit that determines whether the energy of the input voice signal is less than the threshold value, which is determined based on the average value of energy and the standard deviation, and when the energy of the input voice signal is determined to be less than the threshold value, a loud voice is input. It is characterized in that it comprises a warning unit for warning the user as described above. Further , the determination unit of the voice recognition device according to the present invention, when obtaining the average value and the standard deviation of the energy of the background noise, the input signal determined as the non-voice portion based on the discrimination result of voiced / unvoiced. make use of,
The feature is that it is updated sequentially. further,
The determination unit of the voice recognition device according to the present invention, as a condition for updating the average spectrum of the unvoiced portion, the average value of the energy and the standard deviation, etc. If it continues to be judged as voice,
It is characterized by forcibly updating. Further, the determination unit of the voice recognition device according to the present invention uses the energy of the input signal in order to discriminate between voiced and non-voiced voices, and if there is a predetermined threshold value or more, it is voiced, and otherwise it is voiceless. Is characterized by.

【０００８】[0008]

【発明の実施の形態】以下図面について、本発明の一実
施例を詳述する。まず、音声認識に関して簡単に説明
し、次に、本発明のシステム構成、およびその動作に関
して詳しく説明する。BEST MODE FOR CARRYING OUT THE INVENTION An embodiment of the present invention will be described in detail below with reference to the drawings. First, the speech recognition will be briefly described, and then the system configuration and the operation of the present invention will be described in detail.

【０００９】（ａ）音声認識音声認識は、図１に示されるような入力部１１、音響分
析部１２、認識部１３、パラメータ記憶部１４、出力部
１５から構成されることが多い。入力部１−１とは、マ
イクなどの音声信号を入力する装置とその入力信号を増
幅するアンプ、およびデジタル信号に変換するＡＤ変換
器などによって構成される。そして、入力信号を（例え
ば、１２ kHz で）サンプリングした後、音響分析部１
２へ送信する。(A) Speech recognition Speech recognition is often composed of an input unit 11, an acoustic analysis unit 12, a recognition unit 13, a parameter storage unit 14, and an output unit 15 as shown in FIG. The input unit 1-1 is configured by a device such as a microphone that inputs a sound signal, an amplifier that amplifies the input signal, and an AD converter that converts the input signal into a digital signal. Then, after sampling the input signal (for example, at 12 kHz), the acoustic analysis unit 1
Send to 2.

【００１０】音響分析部１２では、入力された音声信号
から認識に必要な特徴量の抽出を行なう。例えば、単純
な信号のエネルギーや零交差数、ピッチなどの抽出を行
なったり、線形予測分析（ＬＰＣ）、高速フーリエ変換
（ＦＦＴ）、バンドパスフィルター（ＢＰＦ）、さらに
はＷａｖｌｅｔ変換などによって周波数分析を行なった
りする。そして、例えば帯域分割されたエネルギーなど
を要素とするベクトル時系列として、特徴量の抽出を行
なったりする。また、その特徴量の変化量として、例え
ば差分データも特徴量の一つとして同時に抽出すること
もある。こうして得られた特徴量に対し、ＫＬ変換や、
ニューラルネットワークなどの適当な写像を施すこと
で、分離度の大きな特徴量にさらに変換する場合もあ
る。また、ベクトル量子化などにより、特徴量ベクトル
を圧縮し、量子化された特徴量に変換する場合もある。
いずれにしても、音響分析部１２では、認識に必要な特
徴量を入力された音声信号から抽出し、これを認識部１
３に送信する。The acoustic analysis unit 12 extracts a feature amount necessary for recognition from the input voice signal. For example, simple signal energy, number of zero crossings, pitch, etc. are extracted, and frequency analysis is performed by linear predictive analysis (LPC), fast Fourier transform (FFT), bandpass filter (BPF), and wavelet transform. Do it. Then, for example, the feature quantity is extracted as a vector time series having elements such as band-divided energy. Further, as the amount of change in the feature amount, for example, difference data may be simultaneously extracted as one of the feature amounts. KL conversion, or
In some cases, a suitable mapping such as a neural network may be applied to further convert to a feature amount having a high degree of separation. Further, the feature quantity vector may be compressed by vector quantization or the like and converted into a quantized feature quantity.
In any case, the acoustic analysis unit 12 extracts a feature amount necessary for recognition from the input voice signal, and the extracted feature amount is used by the recognition unit 1
Send to 3.

【００１１】認識部１３では、あらかじめ学習用の音声
データを音響分析して得られる特徴量をもとに作成して
おいたパラメータ１４を用いて、未知音声データに対す
る認識処理を行なう。ここで、認識とは、入力された音
声信号に対して、与えられた認識辞書の中から、入力に
対応したワードを選び出すことである。そして、そのた
めの認識手法としては、主なものとして、ＤＰマッチン
グ、ニューラルネットワーク、ＨＭＭなどを用いたもの
が使われる。The recognition unit 13 performs recognition processing for unknown voice data by using the parameter 14 created in advance based on the feature amount obtained by acoustically analyzing the voice data for learning. Here, the recognition is to select a word corresponding to the input from the given recognition dictionary for the input voice signal. As a recognition method for that purpose, a method using DP matching, a neural network, an HMM, etc. is mainly used.

【００１２】ＤＰマッチングは、各音声信号を分析して
得られる特徴量からテンプレートと呼ばれる標準パター
ンをあらかじめパラメータとして求めておき、未知音声
の特徴量と比較して最も近いと判定されるものを見つけ
るという方式である。発話速度の変動を吸収するため、
dynamic time warpingと呼ばれる手法により、テンプレ
ートとの歪みを最小化するように時間軸の伸縮を行なう
方法がよく用いられる。In DP matching, a standard pattern called a template is obtained in advance as a parameter from the feature amount obtained by analyzing each voice signal, and the one that is judged to be the closest to the feature amount of the unknown voice is found. Is the method. In order to absorb the fluctuation of speech rate,
A method called dynamic time warping is often used to expand or contract the time axis so as to minimize distortion with the template.

【００１３】ニューラルネットワークは、人間の脳の構
造を模倣するネットワークモデルによって認識を行なお
うとするもので、学習過程によりあらかじめパスの重み
係数をパラメータとして決定しておき、そのネットワー
クに未知音声の特徴量を入力して得られる出力をもと
に、辞書内の各ワードとの距離を求め、認識ワードを決
定しようとするものでる。また、ＨＭＭは、確率モデル
により認識を行なおうとするもので、あらかじめ状態遷
移モデルに対して、その遷移確率と出力シンボル確率を
学習データをもとに決定しておき、未知音声の特徴量に
対する各モデルの生起確率から認識ワードの決定を行な
おうとする方式である。The neural network is intended to be recognized by a network model that mimics the structure of the human brain, and the weighting coefficient of the path is determined in advance as a parameter in the learning process, and the characteristics of unknown speech are added to the network. Based on the output obtained by inputting the quantity, the distance to each word in the dictionary is calculated and the recognition word is decided. In addition, the HMM attempts to perform recognition using a probabilistic model. For the state transition model, the transition probability and the output symbol probability are previously determined based on the learning data, and the feature amount of the unknown speech is calculated. In this method, the recognition word is decided based on the occurrence probability of each model.

【００１４】以上述べたように、一般に、認識処理とし
ては、学習過程として、あらかじめ学習用データから決
定されたパラメータ（テンプレートや、ネットワークモ
デルの重み係数、確率モデルの統計的パラメータなど）
を求めておき、これをパラメータ記憶部１４に記憶して
おく。そして、認識過程では、入力された未知音声信号
を音響分析１２した後、学習で得られたパラメータ１４
を用いて、与えられた辞書の中のワードそれぞれに対し
て、その認識手法に応じた距離や生起確率などのスコア
付けを行ない、そのスコアが最も高いもの（あるいは、
上位複数個）を認識結果として選び出すということを行
なう。そして、得られた認識結果を出力部１５に送信す
る。出力部１５では、送信されてきた認識結果を画面に
表示したり、音として出力したり、さらには、その認識
結果を用いて、他の装置を動作させたりなどの指令を行
なう。As described above, generally, in the recognition process, the parameters (templates, weighting factors of the network model, statistical parameters of the probabilistic model, etc.) determined in advance from the learning data are used as the learning process.
Is calculated and stored in the parameter storage unit 14. Then, in the recognition process, after the acoustic analysis 12 of the input unknown voice signal, the parameter 14 obtained by learning is analyzed.
Is used to score each word in the given dictionary such as distance and occurrence probability according to its recognition method, and the one with the highest score (or
The top two or more) are selected as recognition results. Then, the obtained recognition result is transmitted to the output unit 15. The output unit 15 issues a command such as displaying the transmitted recognition result on the screen, outputting as a sound, and using the recognition result to operate another device.

【００１５】以上は、単語音声認識に関して簡単に説明
したものであるが、連続音声認識の場合は、パラメータ
記憶部１４に、言語の文法やワードの意味的な接続関係
なども記憶しておき、それらの制限を用いながら、入力
音声信号に対応する連続したワードの並びが認識部１３
において決定され、その認識結果が出力部１５に送信さ
れることになる。The above is a brief description of word speech recognition, but in the case of continuous speech recognition, the parameter storage unit 14 also stores the grammar of the language, the semantic connection of words, and the like. The sequence of consecutive words corresponding to the input voice signal is recognized by the recognition unit 13 while using these restrictions.
The recognition result is transmitted to the output unit 15.

【００１６】（ｂ）システムの構成さて、上述したような音声認識の処理において、音響分
析部１２から出力されるのは、入力音声信号を所定の微
小時間間隔ごとに分析を行なって得られる特徴量の系列
である。学習時にも認識時にも、この特徴量系列が用い
られることになる。ここで問題となるのは、音響分析部
１２において抽出される特徴量が、背景雑音の影響によ
り大きく変動してしまうということである。そこで、図
２に示されるような構成の音声認識装置を実現する。入
力部２１、音響分析部２２、認識部２３、パラメータ記
憶部２４、出力部２５は、それぞれ図１のものと同じで
ある。これに、背景雑音観測部２６と警告部２７を組み
込む。(B) System Configuration In the speech recognition processing as described above, what is output from the acoustic analysis unit 12 is a characteristic obtained by analyzing the input speech signal at every predetermined minute time interval. It is a series of quantities. This feature series is used both during learning and during recognition. The problem here is that the feature amount extracted by the acoustic analysis unit 12 varies greatly due to the influence of background noise. Therefore, a voice recognition device having a configuration as shown in FIG. 2 is realized. The input unit 21, the acoustic analysis unit 22, the recognition unit 23, the parameter storage unit 24, and the output unit 25 are the same as those in FIG. The background noise observation unit 26 and the warning unit 27 are incorporated in this.

【００１７】背景雑音観測部２６では、背景雑音のエネ
ルギーの平均値μと標準偏差σを計算する。この計算方
法に関しては後述する。そして、このμとσから、しき
い値ｒを決定する関数式（１）をあらかじめ定義する。The background noise observing section 26 calculates the average value μ and the standard deviation σ of the background noise energy. This calculation method will be described later. Then, the functional expression (1) for determining the threshold value r is defined in advance from the μ and σ.

【００１８】[0018]

【数１】そして、入力信号のエネルギーｅが式（２）を満たすか
どうかを判定する。[Equation 1] Then, it is determined whether the energy e of the input signal satisfies the expression (2).

【００１９】[0019]

【数２】この判定結果が警告部２７に送られる。例えば、式
（３）のように、背景雑音の平均エネルギーμに標準偏
差σのｃ倍を加えた値をしきい値ｒとして設定する。[Equation 2] The determination result is sent to the warning unit 27. For example, as in the equation (3), a value obtained by adding the standard energy σ to the average energy μ of the background noise times c is set as the threshold r.

【００２０】[0020]

【数３】ただし、式（４）のように、上限値ｒ_maxと下限値ｒ
_minを設定しておく。[Equation 3] However, as in equation (4), the upper limit value r _max and the lower limit value r
Set _min .

【数４】そして、（２）式を満たすように音声が入力されたかど
うかを判定し、その結果を警告部２７に送信する。[Equation 4] Then, it is determined whether or not the voice is input so as to satisfy the expression (2), and the result is transmitted to the warning unit 27.

【００２１】警告部２７では、この判定結果をもとに、
入力信号のエネルギーｅが（２）式を満たすように、ユ
ーザーに警告を行なう。例えば、ユーザーに分かるよう
にＬＥＤを設置しておき、（２）式を満たす時だけＬＥ
Ｄを点灯させたり、あるいは、図３のようなディスプレ
イ３１を用意しておき、ｅの大きさに合わせてレベルメ
ータ３３が変動するように表示し、さらに、しきい値ｒ
に合わせて警告ライン３２を表示する。つまり、ユーザ
ーは、ＬＥＤが点灯するように、あるいはレベルメータ
３３が警告ライン３２を超えるように発声することを要
求されることになる。In the warning unit 27, based on this judgment result,
The user is warned so that the energy e of the input signal satisfies the expression (2). For example, the LED is installed so that the user can understand, and only when the formula (2) is satisfied, LE is set.
D is turned on, or a display 31 as shown in FIG. 3 is prepared, and the level meter 33 is displayed so as to change according to the size of e.
A warning line 32 is displayed in accordance with. That is, the user is required to speak so that the LED lights up or the level meter 33 crosses the warning line 32.

【００２２】あるいはまた、判定式（２）を満たさない
ような発声に対しては、ユーザーに対して警告音を鳴ら
したり、警告メッセージの表示や応答を行なうようにし
てもよい。例えば、『もう少し大きな声で発声して下さ
い。』などのメッセージをユーザーに送るようにするな
どである。Alternatively, for a utterance that does not satisfy the determination formula (2), a warning sound may be emitted to the user, or a warning message may be displayed or responded. For example, "Please speak a little louder. 』And so on to send a message to the user.

【００２３】（ｃ）背景雑音観測部背景雑音観測部２６の動作に関して詳しく説明したのが
図４である。まず、入力音声信号を所定の微小時間間隔
ごとに周波数分析（４−１）し、帯域分割されたエネル
ギーを要素とする特徴量ベクトル（以下、これを入力ス
ペクトルと呼ぶ）を求める。例えば、バンドパスフィル
ターを用いたり、ＦＦＴにより振幅特性を計算すること
で、各帯域のエネルギーを要素とする特徴量ベクトルが
一定時間間隔で得られる。これをＸt と表す。Ｘt は式
（５）に示される様に、時刻ｔに観測されるｎ次元ベク
トルである。(C) Background Noise Observing Section FIG. 4 is a detailed description of the operation of the background noise observing section 26. First, the input voice signal is frequency-analyzed (4-1) at every predetermined minute time interval, and a feature quantity vector (hereinafter, referred to as an input spectrum) having band-divided energy as an element is obtained. For example, by using a bandpass filter or calculating the amplitude characteristic by FFT, the feature quantity vector having the energy of each band as an element can be obtained at constant time intervals. This is represented as Xt. Xt is an n-dimensional vector observed at time t, as shown in equation (5).

【００２４】[0024]

【数５】また、Ｘ_tの各要素はエネルギーを表すことから、式
（６）を満たす。[Equation 5] Further, since each element of X _t represents energy, Expression (6) is satisfied.

【数６】ここで、T は転置を表すものとする。なお、この周波数
分析は、音響分析部２２の分析結果を用いてもよい。次
に、入力信号のエネルギーを式（７）に従って計算し、
（２）式が満たされるかどうかを判定する（４−２）。[Equation 6] Here, T represents transposition. The frequency analysis may use the analysis result of the acoustic analysis unit 22. Then calculate the energy of the input signal according to equation (7),
It is determined whether the expression (2) is satisfied (4-2).

【００２５】[0025]

【数７】この結果は警告部２７へ送信される。その後、背景雑音
の平均スペクトルＶ、およびエネルギーの平均値μと標
準偏差σの更新を行なう（４３）。ここで、Ｖはｎ次
元ベクトル式（８）のように表される。[Equation 7] The result is transmitted to the warning unit 27. Then, the average spectrum V of the background noise, and the average value μ of energy and the standard deviation σ are updated (43). Here, V is represented as an n-dimensional vector expression (8).

【００２６】[0026]

【数８】なお、これらのパラメータの更新方法に関しては、後に
詳細に述べるものとする。そして最後に、μ，σに応じ
て、式（２）によって、しきい値ｒを計算する（４−
４）。なお、時刻ｔにおいて、式（２）の判定に用いら
れるしきい値ｒは、時刻ｔ−１に求められたものが用い
られることになる。[Equation 8] The method of updating these parameters will be described later in detail. Then, finally, the threshold value r is calculated by the equation (2) according to μ and σ (4-
4). At the time t, the threshold value r used for the determination of the equation (2) is the one obtained at the time t-1.

【００２７】（ｄ）背景雑音の更新方法つづいて、背景雑音の平均スペクトルＶと、エネルギー
の平均値μ、標準偏差σの求め方に関して説明する。基
本的には、入力信号に対して有音声・無音声の識別を行
ない、無音声部分と判定された場合にだけ、式（９）に
従って更新を行なう。(D) Method of Updating Background Noise Next, the method of obtaining the average spectrum V of background noise, the average value μ of energy, and the standard deviation σ will be described. Basically, the input signal is discriminated between voiced and non-voiced, and only when it is determined that there is no voiced portion, the update is performed according to the equation (9).

【００２８】[0028]

【数９】ここで、右辺のＶは更新前の値、左辺のＶは更新後の
値とする。また、α₁，β₁は重み係数であり、式（１
０）の関係にある。[Equation 9] Here, V on the right side is a value before updating, and V on the left side is a value after updating. Further, α ₁ and β ₁ are weighting factors, and
0).

【数１０】例えば、α₁＝０．９５，β₁＝０．０５などに設定す
る。βを大きくすることによって、より時間的に新しい
Ｘ_tをＶに反映させることもできる。更新されたＶを用
いて、背景雑音のエネルギー平均値は、式（１１）のよ
うに表される。[Equation 10] For example, α ₁ = 0.95, β ₁ = 0.05, etc. are set. It is also possible to reflect new X _t in V more temporally by increasing β. Using the updated V, the energy average value of the background noise is expressed by equation (11).

【００２９】[0029]

【数１１】また、分散を式（１２）のように更新すれば、標準偏差
sigma が得られる。[Equation 11] Also, if the variance is updated as in equation (12), the standard deviation
You get a sigma.

【数１２】ここで、右辺のσ²は更新前の値、左辺のσ²は更新後
の値であり、ｅ，μはそれぞれ（７），（１１）式によ
って求めたものとする。また、α₂，β₂は重み係数で
あり、式（１３）を満たすように設定される。[Equation 12] Here, sigma ² on the right side value before updating, the left side of the sigma ² is the value after the update, e, mu, respectively (7), and that determined by (11). Further, α ₂ and β ₂ are weighting factors and are set so as to satisfy the equation (13).

【００３０】[0030]

【数１３】以上の更新方法以外にも、更新すべきと判定された時刻
の入力スペクトルＸ_tを、時刻の新しいものからｍ個だ
け常に保存しておき、その平均値を新たなＶとして計算
することもできる。もちろん、エネルギー平均値μや標
準偏差σも容易に求まる。[Equation 13] In addition to the updating method described above, it is also possible to always store m input spectra X _t at the time determined to be updated, starting from the newest time, and calculate the average value as a new V. . Of course, the energy average value μ and the standard deviation σ can also be easily obtained.

【００３１】ところで、上で述べたような更新方法だ
と、例えば、音声は入力されていないにもかかわらず、
環境が変動したために、誤って有音声と判定してしまっ
た場合など、背景雑音の平均スペクトルがいつまでも更
新されないままとなってしまう問題が生じる危険性があ
る。By the way, with the updating method as described above, for example, although no voice is input,
There is a risk that the average spectrum of the background noise may remain unupdated if the voice is mistakenly determined to be voiced because the environment has changed.

【００３２】そこで、環境の変動に対応させて、背景雑
音の平均スペクトルＶを環境に追従させていく必要があ
るといえる。これを行なう一つの方法としては、入力さ
れる音声の最大の長さを推定しておき、有音声部分と判
定される時間があまり長過ぎる場合には、強制的にベク
トルＶを更新するという方法が考えられる。つまり、連
続して有音声と判定される時間ｄをカウントしておき、
これがあるしきい値ｄmax を越えた場合であって、式
（１４）が成り立つ場合には、たとえ有音声と判定され
ていても、強制的にＶ，μ，σを更新するようにすれば
よい。Therefore, it can be said that it is necessary to make the average spectrum V of the background noise follow the environment in response to the environmental changes. One way to do this is to estimate the maximum length of the input voice and forcibly update the vector V if the time to be judged as a voiced part is too long. Can be considered. That is, the time d that is continuously determined to be voice is counted,
If this exceeds a certain threshold value dmax, and if the equation (14) is satisfied, V, μ, σ may be forcibly updated even if it is determined that there is voice. .

【００３３】[0033]

【数１４】この時の更新方法は上に述べた通りである。以上述べた
ような背景雑音の更新方法に関してまとめたのが図５で
ある。あらためて説明しておくと、まず有音声・無音声
の判定５１に基づいて、無音声と判定されればＶの更
新を行なう（５−４）。それ以外の場合でも、判定式
（１４）によって、連続する有音声部分の長さが長いと
判定された場合５−２には、Ｖの更新を行なう（５−
４）。このようにＶを更新していくことで、環境の変動
にもある程度追従しながら、より新しい背景雑音の平均
スペクトルＶ、およびエネルギーの平均値μと標準偏差
σが求まることになる。[Equation 14] The updating method at this time is as described above. FIG. 5 is a summary of the background noise updating method described above. To explain again, first, based on the judgment 51 of voiced / unvoiced, if it is judged that there is no voiced, V is updated (5-4). In other cases, if it is determined by the determination formula (14) that the length of the continuous voiced portion is long 5-2, V is updated (5-
4). By updating V in this way, a newer average spectrum V of background noise, and a new average value μ and standard deviation σ of energy can be obtained while following changes in the environment to some extent.

【００３４】（ｅ）有音声部分と無音声部分の識別方法さて、背景雑音のスペクトルやエネルギーの更新におい
て重要となる、有音声・無音声の識別方法に関して簡単
に説明する。基本的には背景雑音と音声の周波数特性
（スペクトル）の違いに着目し、背景雑音の平均スペク
トルから遠いか近いかで、有音声・無音声の識別を行な
う。まず、入力スペクトルＸ_tと背景雑音の平均スペク
トルＶから、差分ベクトルＶ−Ｘ_tを求める。そして、
このベクトルＶ−Ｘ_tの大きさεが大きい場合には有音
声、小さい場合には無音声と判定する。すなわち、ある
しきい値ｒ₁に対し、式（１５）を満たせば有音声、満
たさなければ無音声と判定する。(E) Discrimination Method of Voiced Part and Non-voiced Part Now, a voiced / unvoiced identification method, which is important in updating the background noise spectrum and energy, will be briefly described. Basically, attention is paid to the difference between the frequency characteristics (spectrum) of background noise and voice, and the presence or absence of voice is discriminated according to whether it is far or near from the average spectrum of background noise. First, the difference vector V−X _t is obtained from the input spectrum X _t and the average spectrum V of the background noise. And
If the magnitude ε of the vector V−X _t is large, it is determined that there is voice, and if it is small, it is determined that there is no voice. That is, with respect to a certain threshold value r ₁ , it is determined that the voice is present if the equation (15) is satisfied, and the voice is not provided if the equation (15) is not satisfied.

【００３５】[0035]

【数１５】 [Equation 15]

【００３６】ここで、εとしては、式（１６），（１
７）等を用いることができる。Here, as ε, equations (16), (1
7) etc. can be used.

【数１６】 [Equation 16]

【数１７】また、分散を考慮に入れたマハラノビス(Mahalanobis)
距離などを用いてもよい。判定式（１５）のしきい値ｒ
₁に関しては、実験などを通して、適当と思われる値に
設定するのが望ましい。あるいは、判定式（２）で用い
られるｒのように、背景雑音のエネルギーの平均値μや
標準偏差σに応じて、ｒ₁を変動させてもよい。いずれ
にしても、Ｘ_tがＶから大きく離れていれば有音声、Ｘ
_tがＶの近傍に存在すれば無音声と判定するようにすれ
ばよい。[Equation 17] Also, Mahalanobis that takes dispersion into account
Distance or the like may be used. Threshold value r of judgment formula (15)
_It is desirable to set 1 to a value that seems to be appropriate through experiments. Alternatively, r ₁ may be varied according to the average value μ of the background noise energy and the standard deviation σ like r used in the determination formula (2). In any case, if X _t is far away from V, voice, X
_{If t} is in the vicinity of V, it may be determined that there is no voice.

【００３７】また、有音声・無音声を識別する別の方法
として、単純に入力信号のエネルギーに応じて有音声部
分と無音声部分を判定する方法も考えられる。具体的に
は、入力信号のエネルギーｅとあるしきい値ｒ₂を比較
して、式（１８）が成り立てば有音声、成り立たなけれ
ば無音声と判定する。As another method for distinguishing between voiced and unvoiced voices, a method of simply determining the voiced voice portion and the voiceless voice portion according to the energy of the input signal can be considered. Specifically, the energy e of the input signal is compared with a certain threshold value r ₂ , and it is determined that the voice is present if the equation (18) is satisfied, and is not voiced if the equation (18) is not satisfied.

【００３８】[0038]

【数１８】が成り立てば有音声、成り立たなければ無音声と判定す
る。この場合も、しきい値ｒ₂は、実験などを通して適
当と思われる値に設定するのが望ましい。あるいは、し
きい値ｒと同様にして、背景雑音のエネルギーの平均値
μや標準偏差σに応じて、ｒ₂を変動させてもよい。さ
らにまた別の方法として、単位時間あたりに入力信号の
隣接するサンプルが正負の異なる符号の値をとる回数ｚ
(これを零交差速度と呼ぶ）を計算し、あるしきい値ｒ
₃と比較して式（１９）が成り立てば有音声、それ以外
は無音声として判定することもできる。[Equation 18] If is established, it is judged that there is voice, and if not, it is judged that there is no voice. Also in this case, it is desirable to set the threshold value r ₂ to a value that is considered appropriate through experiments and the like. Alternatively, similarly to the threshold value r, r ₂ may be changed according to the average value μ and the standard deviation σ of the background noise energy. As yet another method, the number of times z that adjacent samples of the input signal take different positive and negative sign values per unit time
(This is called the zero-crossing velocity) and a certain threshold r
_If Expression (19) is satisfied as compared with ₃ , it is possible to determine that there is voice and the others are silent.

【００３９】[0039]

【数１９】なお、上記の判定式（１５），（１８），（１９）を組
み合わせて識別することも可能である。もちろん、上記
以外の有音声・無音声の判定方式を用いることも可能で
ある。[Formula 19] In addition, it is also possible to identify by combining the above determination expressions (15), (18), and (19). Of course, it is also possible to use a voiced / unvoiced determination method other than the above.

【００４０】[0040]

【発明の効果】以上述べたように、背景雑音を観測し、
その背景雑音のエネルギーの平均値と標準偏差に応じ
て、ユーザーに大きな声で発声するように警告を行なえ
ば、それに従ってユーザーが大きな声で発声するように
なり、その結果背景雑音の影響が低減され、認識性能の
劣化を防ぐことが可能となる。なお、本発明はユーザー
が大きな声で発声するように促すことを目的としてお
り、有音声・無音声の識別結果から有音声区間を抽出す
ることを目的としているわけではないという点で、従来
から行なわれている音声区間検出とは大きく異なるもの
である。As described above, the background noise is observed,
Depending on the average value and standard deviation of the background noise energy , if the user is warned to speak loudly, the user will be vocalized accordingly, and as a result, the influence of background noise will be reduced. Therefore, it is possible to prevent the deterioration of the recognition performance. It should be noted that the present invention is aimed at urging the user to utter a loud voice, and is not intended to extract a voiced section from a voiced / unvoiced discrimination result. This is very different from the voice section detection that is performed.

[Brief description of drawings]

【図１】図１は、音声認識装置の一般的な構成を示すブ
ロック図である。FIG. 1 is a block diagram showing a general configuration of a voice recognition device.

【図２】図２は、本発明の音声認識装置の構成を示すブ
ロック図である。FIG. 2 is a block diagram showing a configuration of a voice recognition device of the present invention.

【図３】図３は、警告部における表示例を示す図であ
る。FIG. 3 is a diagram showing a display example in a warning unit.

【図４】図４は、背景雑音観測部の動作を説明するため
の図である。FIG. 4 is a diagram for explaining an operation of a background noise observation unit.

【図５】図５は、背景雑音の更新方法を説明するための
図である。FIG. 5 is a diagram for explaining a background noise updating method.

[Explanation of symbols]

２１入力部、２２音響分析部、２３認識部、
２４パラメータ記憶部、２５出力部、２６背
景雑音観測部、２７警告部21 input unit, 22 acoustic analysis unit, 23 recognition unit,
24 parameter storage unit, 25 output unit, 26 background noise observation unit, 27 warning unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者角田弘史東京都品川区北品川６丁目７番35号ソニー株式会社内 (72)発明者小川浩明東京都品川区北品川６丁目７番35号ソニー株式会社内 (72)発明者表雅則東京都品川区北品川６丁目７番35号ソニー株式会社内 (72)発明者本田等東京都品川区北品川６丁目７番35号ソニー株式会社内 (72)発明者藤村聡東京都品川区北品川６丁目７番35号ソニー株式会社内 (56)参考文献特開昭63−223795（ＪＰ，Ａ) 特開平７−64595（ＪＰ，Ａ) 特開昭58−85498（ＪＰ，Ａ) 特開平１−502858（ＪＰ，Ａ) 特開平１−302298（ＪＰ，Ａ) 特開平６−75588（ＪＰ，Ａ) 特公平８−23756（ＪＰ，Ｂ２) 特許2589468（ＪＰ，Ｂ２) 古井，ディジタル音声処理，日本，東海大学出版会，1985年９月25日，ｐ. 153，第10〜16行 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 11/00 - 11/02 G10L 15/00 - 15/28 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Hiroshi Tsunoda, 6-735 Kita-Shinagawa, Shinagawa-ku, Tokyo Sony Corporation (72) Hiroaki Ogawa 6-35, Kita-Shinagawa, Shinagawa-ku, Tokyo Sony Corporation (72) Inventor List Masanori 6-735 Kitashinagawa, Shinagawa-ku, Tokyo Sony Corporation (72) Inventor Honda, etc. 6-735 Kitashinagawa, Shinagawa-ku, Tokyo Incorporated (72) Inventor Satoshi Fujimura Sat. 6-735 Kitashinagawa, Shinagawa-ku, Tokyo Sony Corporation (56) References JP-A-63-223795 (JP, A) JP-A-7-64595 ( JP, A) JP 58-85498 (JP, A) JP 1-502858 (JP, A) JP 1-302298 (JP, A) JP 6-75588 (JP, A) JP Flat 8-23756 (JP, B2) Patent 2589468 (JP, B ) Sieve, digital sound processing, Japan, Tokai University Press, September 25, 1985, p. 153, the first 10 to 16 rows (58) investigated the field (Int.Cl. ^7, DB name) G10L 11 / 00-11/02 G10L 15/00-15/28

Claims

(57) [Claims]

1. An input unit for a voice signal, an acoustic analysis unit for analyzing the input signal to extract a feature amount, and a learning process,
Based on the feature amount obtained by analyzing the learning data in advance, the parameter used for recognition is obtained, and the parameter storage unit for storing this parameter and the input signal are obtained. A speech composed of a recognition unit that determines a word corresponding to an input signal or a sequence of words by performing scoring based on the distance or the occurrence probability from the stored characteristic amount and the stored parameters. In the recognition device, the energy of the input voice signal is calculated, and the energy of the input voice signal is the average value of the energy of background noise.
A determination unit that determines whether the energy of the input voice signal is less than the threshold value determined according to the standard deviation, and if the user determines that the energy of the input voice signal is less than the threshold value, the user inputs a loud voice. A voice recognition device comprising: a warning unit for warning the user.

2. The determination unit uses an input signal determined to be a non-voice part based on a discrimination result of voiced / unvoiced when obtaining an average value and a standard deviation of energy of the background noise. , claim 1, characterized in that is renewed sequentially
The voice recognition device described in.

3. The determination unit uses, as a condition for updating an average spectrum of a voiceless portion, an average value of energy and a standard deviation, voiced voice for a certain period of time or more based on a voiced / unvoiced discrimination result. The voice recognition device according to claim 2 , wherein the voice recognition device is forcibly updated when the determination is continued.

4. The determination unit uses the energy of an input signal to distinguish between voiced and non-voiced voices, and determines voiced voices if the energy is larger than a predetermined threshold value and no voices otherwise. The voice recognition device according to claim 2 .

5. The determination unit uses the zero-crossing speed of the input signal to distinguish between voiced and non-voiced voices, and determines that the voiced voice is present when the input signal is larger than a predetermined threshold value, and the voice is not voiced otherwise. The voice recognition device according to claim 2 , characterized in that

6. The determination unit uses a spectrum obtained by frequency-analyzing an input signal in order to discriminate between voiced and unvoiced speech, and a difference from an average spectrum of a non-voiced portion exceeds a predetermined threshold value. The voice recognition device according to claim 2 , wherein the voice recognition device has a voice if the voice becomes louder and a voice is not voiced otherwise.

7. The determination unit distinguishes between the energy of an input signal, a zero-crossing speed, and a spectrum obtained by frequency-analyzing the input signal and an average spectrum of a non-voiced portion in order to distinguish between voiced and unvoiced speech. The voice recognition device according to claim 2 , wherein the voice recognition and the non-voice recognition are performed by combining three feature amounts of difference.