JP2853731B2

JP2853731B2 - Voice recognition device

Info

Publication number: JP2853731B2
Application number: JP7136725A
Authority: JP
Inventors: 信輔坂井
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1995-06-02
Filing date: 1995-06-02
Publication date: 1999-02-03
Anticipated expiration: 2014-02-03
Also published as: JPH08328583A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、音声認識装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition device.

【０００２】[0002]

【従来の技術】音声認識装置は非常に大きい演算量を必
要とするため、従来よりビームサーチによる演算量の削
減が試みられている。ビームサーチによる候補刈り取り
のためのビームの幅の設定法としては、各候補刈り取り
時に、尤度の高いものから一定数の候補を残す方法と、
最大尤度から一定幅の範囲の尤度をもつ候補を残す方法
が良く知られている。2. Description of the Related Art Since a speech recognition apparatus requires a very large amount of calculation, attempts have been made to reduce the amount of calculation by beam search. As a method of setting the beam width for the candidate pruning by the beam search, at the time of each candidate pruning, a method of leaving a fixed number of candidates from those having a high likelihood,
A method of leaving a candidate having a likelihood within a certain range from the maximum likelihood is well known.

【０００３】伊藤らによる、音響学会研究発表会講演論
文集１９９３年１０月７３〜７４ページに掲載の論文
「連続音声認識におけるビームサーチ」おいては、一定
数の候補を残す方法のほうが、ビーム幅の設定値の変化
に対して探索効率が安定していると報告されている。ま
た、最大尤度から一定幅の範囲の尤度をもつ候補の数
は、発声時の周囲雑音環境の影響を受けて変動すると考
えられるが、一定数の候補を残す方法においては、その
ような候補数の変動はない。In a paper entitled "Beam Search in Continuous Speech Recognition" published by Ito et al. It is reported that the search efficiency is stable with respect to a change in the set value of the width. In addition, the number of candidates having a likelihood within a certain range from the maximum likelihood is considered to fluctuate under the influence of the surrounding noise environment at the time of utterance. There is no change in the number of candidates.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上述の
一定数（仮にＭ個とする）の候補を残す方法では、候補
刈り取り時に第Ｍ位の候補を求めるための並べ替え処理
が必要となるために、処理量が多いという欠点があっ
た。However, in the above-described method of leaving a fixed number of candidates (assuming that the number is M), it is necessary to perform a rearrangement process for obtaining the M-th candidate at the time of candidate pruning. However, there is a disadvantage that the amount of processing is large.

【０００５】[0005]

【課題を解決するための手段】請求項１記載の発明によ
れば、入力音声の最初の数フレームで、最大尤度と一定
個数番目の候補の尤度との差をもとめておき、それ以後
は、前記差を用いて、ビームサーチにおける候補刈り取
りのための閾値を設定することを特徴とする音声認識装
置が得られる。According to the first aspect of the present invention, the difference between the maximum likelihood and the likelihood of a certain number of candidates in the first several frames of the input speech is determined, and thereafter, The present invention provides a speech recognition apparatus characterized in that a threshold value for pruning candidates in a beam search is set using the difference.

【０００６】請求項２記載の発明によれば、音声信号を
分析して特徴ベクトル時系列を出力する特徴抽出部と、
あらかじめ作成された標準パタンを蓄えておく標準パタ
ン記憶部と、累積尤度を保持する累積尤度記憶部と、前
記累積尤度記憶部に蓄えられた累積尤度と前記特徴ベク
トルの時系列と前記標準パタンとから新しい累積尤度を
求める漸化式計算部と、前記特徴ベクトル時系列のう
ち、ある部分系列に対しては、漸化式計算部でもとめら
れた累積尤度のうち一定個数を出力するとともに、最大
の累積尤度と前記一定個数番目の尤度との差を蓄積して
おき、それ以降の部分系列に対しては、前記蓄積された
尤度の差を用いて求められた閾値により、出力する累積
尤度を決定する累積尤度出力部と、前記累積尤度出力部
から出力される累積尤度より前記音声信号に対する認識
結果を求める結果出力部とを有することを特徴とする音
声認識装置が得られる。According to the second aspect of the present invention, a feature extracting unit for analyzing a speech signal and outputting a feature vector time series,
A standard pattern storage unit that stores a standard pattern created in advance, a cumulative likelihood storage unit that holds a cumulative likelihood, and a time series of the cumulative likelihood and the feature vector stored in the cumulative likelihood storage unit. A recurrence formula calculation unit for obtaining a new cumulative likelihood from the standard pattern; and a certain number of the cumulative likelihoods obtained by the recurrence formula calculation unit for a certain partial sequence of the feature vector time series. Is output, and the difference between the maximum cumulative likelihood and the certain number of likelihoods is stored, and for subsequent subsequences, the difference is calculated using the difference between the stored likelihoods. A cumulative likelihood output unit that determines a cumulative likelihood to be output according to the threshold value, and a result output unit that obtains a recognition result for the speech signal from the cumulative likelihood output from the cumulative likelihood output unit. To obtain a speech recognition device .

【０００７】請求項３記載の発明によれば、入力音声の
任意の個数の部分系列のおのおのに対して、第Ｍ位の候
補の累積尤度の最大累積尤度との差の平均値を求め、次
の部分系列の間では、前の部分系列で求めた前記差の平
均値を用いて候補刈り取りの閾値を設定することを特徴
とする音声認識装置における閾値設定方法が得られる。According to the third aspect of the invention, the average value of the difference between the cumulative likelihood of the M-th candidate and the maximum cumulative likelihood of each of the arbitrary number of partial sequences of the input speech is determined. , Between the next partial series, a threshold setting method in the speech recognition apparatus, wherein a threshold for candidate pruning is set using the average value of the differences obtained in the previous partial series.

【０００８】[0008]

【実施例】次に、本発明について図面を参照して説明す
る。Next, the present invention will be described with reference to the drawings.

【０００９】図１は、本発明の一実施例を示すブロック
図である。図１を参照すると本発明の実施例は、特徴抽
出部１０１と、標準パタン記憶部１０２と、累積尤度記
憶部１０３と、漸化式計算部１０４と、累積尤度出力部
１０５と、結果出力部１０６とから構成される。FIG. 1 is a block diagram showing one embodiment of the present invention. Referring to FIG. 1, an embodiment of the present invention includes a feature extraction unit 101, a standard pattern storage unit 102, a cumulative likelihood storage unit 103, a recurrence formula calculation unit 104, a cumulative likelihood output unit 105, And an output unit 106.

【００１０】特徴抽出部１０１は、音声入力を特徴ベク
トルの時系列に変換し、漸化式計算部１０４に出力す
る。標準パタン記憶部１０２は、標準パタンを記憶す
る。累積尤度記憶部１０３は、累積尤度出力部１０５か
ら出力される累積尤度を記憶する。処理が開始される以
前には、全認識パス候補に対して累積尤度の初期値１．
０を保持する。漸化式計算部１０４は、第ｉフレームの
特徴ベクトル、標準パタン、および第ｉ−１フレームま
での累積尤度から、第ｉフレームまでの累積尤度を求め
る。累積尤度出力部１０５は、入力された累積尤度の集
合から、次フレームの累積尤度計算に用いられるものを
選択し、累積尤度記憶部１０３に出力する。結果出力部
１０６は、最終フレームまでの累積尤度に基づいて認識
結果を出力する。The feature extraction unit 101 converts a speech input into a time series of feature vectors, and outputs the result to a recurrence formula calculation unit 104. The standard pattern storage unit 102 stores a standard pattern. The cumulative likelihood storage unit 103 stores the cumulative likelihood output from the cumulative likelihood output unit 105. Before the processing is started, the initial value of the cumulative likelihood for all recognition path candidates is set to 1.
Holds 0. The recurrence formula calculation unit 104 obtains the cumulative likelihood up to the i-th frame from the feature vector of the i-th frame, the standard pattern, and the cumulative likelihood up to the (i−1) -th frame. The cumulative likelihood output unit 105 selects, from the input set of cumulative likelihoods, one used for calculating the cumulative likelihood of the next frame, and outputs it to the cumulative likelihood storage unit 103. The result output unit 106 outputs a recognition result based on the accumulated likelihood up to the last frame.

【００１１】次に、図１及び図２を参照して、本実施例
の動作について説明する。Next, the operation of this embodiment will be described with reference to FIGS.

【００１２】入力された音声は、特徴抽出部１０１にお
いて、一定の時間間隔ごとに、音声の周波数をスペクト
ルをあらわす特徴ベクトルに変換され、漸化式計算部１
０４に出力される。この一定の時間間隔を以下ではフレ
ームと呼ぶ。第ｉフレームにおいて、漸化式計算部１０
４では、標準パタン記憶部１０２に保持されている標準
パタンＲＥＦ＝｛Ｒ₁，…，Ｒ_N｝、ここでＲ_w＝｛ｒ_w（１），…，ｒ_w（Ｊ_w) ｝を用いて、現在のフレームの特徴ベクトルの各標準パタ
ンに対する局所的尤度ｌ_w（ｉ，ｊ）（ｗ＝１，…，Ｎ、ｊ＝１，…，Ｊ_w）を求める。ここで、Ｎは標準パタン数、Ｊ_wはｗ番目の
標準パタンのフレーム長である。次に、この局所的尤
度、及び累積尤度記憶部１０３に保持されている第ｉ−
１フレームの累積尤度集合Ｇ＝｛ｇ₁（ｉ−１，１），…，ｇ₁（ｉ−１，Ｊ₁），…，ｇ_N（ｉ−１，１），…，ｇ_N（ｉ−１，Ｊ_N）｝から、動的計画法に基づいた最大化処理により、下記数
１として現在のフレームの認識パス候補およびその累積
尤度を求める（図２のステップ１）。The input speech is converted into a feature vector representing a spectrum of the frequency of the speech at regular time intervals in a feature extraction unit 101, and a recurrence formula calculation unit 1 is provided.
04 is output. This fixed time interval is hereinafter referred to as a frame. In the i-th frame, the recurrence formula calculating unit 10
In 4, the reference pattern _{REF = {R 1, ...,} R N} , which is held in the standard pattern storage section 102, wherein _{_{R w = {r w (1}} ), ..., r w (J w)} using Then, the local likelihood l _w (i, j) (w = 1,..., N, j = 1,..., J _w ) of the feature vector of the current frame with respect to each standard pattern is obtained. Here, N is the number of standard patterns, and J _w is the frame length of the w-th standard pattern. Next, the local likelihood and the i-th i-
1 cumulative likelihood set of frames _{G = {g 1 (i-} 1,1), ..., g 1 (i-1, J 1), ..., g N (i-1, 1), ..., g N ( From i−1, J _N ) 求める, a recognition path candidate of the current frame and its cumulative likelihood are obtained as Equation 1 below by maximization processing based on dynamic programming (Step 1 in FIG. 2).

【００１３】[0013]

【数１】累積尤度出力部１０５は、あらかじめ決められたＫと比
較して、ｉ≦Ｋであるならば、最大値から第Ｍ番目の累
積尤度を求め、これを候補刈り取りのための閾値ＴＨと
し、これと最大尤度との差ｄを求める。後で平均を求め
るために、ｄの累積値Ｓ_dを、Ｓ_d＝Ｓ_d＋ｄと更新す
る（ステップ２，６、及び７）。(Equation 1) The cumulative likelihood output unit 105 determines the M-th cumulative likelihood from the maximum value if i ≦ K in comparison with a predetermined K, and sets the Mth cumulative likelihood as a threshold TH for candidate pruning. The difference d between this and the maximum likelihood is determined. The accumulated value S _d of d is updated as S _d = S _d + d to obtain an average later (steps 2, 6, and 7).

【００１４】なお、Ｓ_dは、第１フレーム以前には０に
初期化しておく。Note that S _d is initialized to 0 before the first frame.

【００１５】ｉ＝Ｋの場合は、Ｋフレーム間の最大尤度
と候補刈り取り閾値との差の平均Ｄ＝Ｓ_d／Ｋを求める
（ステップ３）。If i = K, an average D = S _d / K of the difference between the maximum likelihood between K frames and the candidate pruning threshold is determined (step 3).

【００１６】また、ｉ＞Ｋの場合は、候補刈り取りのた
めの閾値ＴＨは、ＴＨ＝ｇ_max−Ｄとする。ｇ_maxは、
第ｉフレームにおける累積尤度の最大値である（ステッ
プ４）。When i> K, the threshold value TH for candidate pruning is set to TH = g _max -D. g _max is
This is the maximum value of the cumulative likelihood in the i-th frame (step 4).

【００１７】各フレームにおいて、累積尤度出力部１０
５は、累積尤度の閾値ＴＨよりも大きい尤度をもつ認識
パス候補のみを累積尤度記憶部に出力する（ステップ
５）。In each frame, the cumulative likelihood output unit 10
5 outputs only the recognition path candidates having the likelihood larger than the threshold value TH of the cumulative likelihood to the cumulative likelihood storage unit (step 5).

【００１８】現フレームが最終フレームである場合は、
累積尤度出力部１０５は、標準パタンの終端点に達した
すべての認識パス候補を結果出力部１０６に出力する。
結果出力部１０６は、累積候補が最大の認識パス候補を
もとめ、認識結果を出力する（ステップ８，９）。If the current frame is the last frame,
The cumulative likelihood output unit 105 outputs to the result output unit 106 all recognition path candidates that have reached the end point of the standard pattern.
The result output unit 106 determines the recognition path candidate having the largest cumulative candidate and outputs the recognition result (steps 8 and 9).

【００１９】以上、本実施例では、入力音声の最初のＫ
フレームで、第Ｍ位の候補の累積尤度の最大累積尤度と
の差の平均値を求めるという例によって説明したが、さ
らに一般には、入力音声の任意のＬ_max個の部分系列ｌ
ｉ₁，ｌｉ₂，…，ｌｉ_Lmax（Ｌ_max≧１）（これらの
部分系列を仮に学習区間と呼ぶ）のおのおのに対して、
上記の平均値を求め、学習区間ｌｉ_kと次の学習区間ｌ
ｉ_k+1の間では、ｌｉ_kで求めた差の平均値を用いて候
補刈り取りの閾値を設定するという方法をとることがで
きる。As described above, in the present embodiment, the first K
In the frame, has been described by way of obtaining the average value of the difference between the maximum cumulative likelihood of accumulated likelihood of the M-position candidate, more generally, any L _max number of partial series l of the input speech
For each of i ₁ , li ₂ ,..., li _Lmax (L _max ≧ 1) (these subsequences are temporarily referred to as learning intervals),
The average value of the learning section li _k and the next learning section l
Between i _{k + 1} , a method of setting a threshold for candidate pruning using the average value of the differences obtained by _l i _k can be used.

【００２０】[0020]

【発明の効果】以上説明したように、本発明による音声
認識装置は、周囲の雑音環境の変動やビーム幅Ｍの設定
値の変化に対応して第Ｍ位の候補の累積尤度と最大の累
積尤度の差が大きく変動するような場合でも、入力の一
部を用いてこの差の平均値を求めておき、これを用いて
刈り取り閾値の決定を行なうので、入力の全ての区間に
対して累積尤度第Ｍ位までの候補を残す方法に準ずる候
補の刈り取りが行なわれ、安定した探索効率を有しなが
らも、刈り取り閾値決定のための処理量が多くならない
という効果を有する。As described above, the speech recognition apparatus according to the present invention provides the maximum likelihood of the M-th candidate and the maximum likelihood corresponding to the fluctuation of the surrounding noise environment and the change of the set value of the beam width M. Even when the difference between the accumulated likelihoods fluctuates greatly, the average value of this difference is obtained using a part of the input, and the pruning threshold is determined using this. Thus, the pruning of candidates according to the method of leaving candidates up to the M-th cumulative likelihood is performed, and there is an effect that the processing amount for determining the pruning threshold does not increase while maintaining stable search efficiency.

[Brief description of the drawings]

【図１】本発明の音声認識装置の一実施例の構成を示し
たブロック図である。FIG. 1 is a block diagram showing a configuration of an embodiment of a speech recognition device of the present invention.

【図２】図１に示す音声認識装置の一実施例の処理の流
れを示したフローチャートである。FIG. 2 is a flowchart showing a processing flow of an embodiment of the voice recognition device shown in FIG. 1;

[Explanation of symbols]

１０１特徴抽出部１０２標準パタン記憶部１０３累積尤度記憶部１０４漸化式計算部１０５累積尤度出力部１０６結果出力部 Reference Signs List 101 Feature extraction unit 102 Standard pattern storage unit 103 Cumulative likelihood storage unit 104 Recurrence formula calculation unit 105 Cumulative likelihood output unit 106 Result output unit

フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 561 G10L 5/06 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (58) Field surveyed (Int. Cl. ⁶ , DB name) G10L 3/00 561 G10L 5/06 JICST file (JOIS)

Claims

(57) [Claims]

In the first several frames of input speech, the difference between the maximum likelihood and the likelihood of a certain number of candidates is determined,
Thereafter, using the difference, a threshold value for pruning candidates in a beam search is set.

2. A feature extraction unit that analyzes a speech signal and outputs a feature vector time series, a standard pattern storage unit that stores a standard pattern created in advance, and a cumulative likelihood storage unit that stores cumulative likelihood. A recurrence formula calculating unit for calculating a new cumulative likelihood from the cumulative likelihood stored in the cumulative likelihood storage unit, the time series of the feature vector, and the standard pattern; and For the subsequence, a constant number of cumulative likelihoods determined by the recurrence formula calculation unit is output, and the difference between the maximum cumulative likelihood and the constant number likelihood is accumulated. For the subsequent subsequences, a cumulative likelihood output unit that determines the cumulative likelihood to be output, based on a threshold obtained using the accumulated likelihood difference, and an output from the cumulative likelihood output unit. From the accumulated likelihood That the recognition result speech recognition apparatus characterized by having a a determined result output unit.

3. An arbitrary number of threshold learning sections having an input voice.
Of the maximum likelihood and the likelihood of a certain number of candidates
The average value of the differences is calculated, and the threshold learning section and the next threshold learning section are calculated.
A threshold setting method in a speech recognition device, wherein a threshold value of a candidate pruning is set using an average value of likelihood differences obtained in the threshold learning section between the intervals .