JP4748605B2

JP4748605B2 - Voice recognition method and apparatus, voice recognition program and recording medium therefor

Info

Publication number: JP4748605B2
Application number: JP2007087842A
Authority: JP
Inventors: 恒夫加藤; 顕吾藤田
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2007-03-29
Filing date: 2007-03-29
Publication date: 2011-08-17
Anticipated expiration: 2027-03-29
Also published as: JP2008249807A

Description

本発明は、音声認識方法および装置ならびに音声認識プログラムおよびその記録媒体に係り、特に、HMMに代表される状態遷移確率モデルを用いて音声認識を行う音声認識方法および装置ならびに音声認識プログラムおよびその記録媒体に関する。 The present invention relates to a speech recognition method and apparatus, a speech recognition program, and a recording medium thereof, and more particularly, a speech recognition method and apparatus for performing speech recognition using a state transition probability model typified by an HMM, a speech recognition program, and a recording thereof. It relates to the medium.

音声認識では、入力された音声信号に最も近い単語列が、予め生成された音響モデルを連結して表現される単語との類似度（確率）に基づいて判定される。HMM(Hidden Marcov Model：隠れマルコフモデル)は、確率および統計理論に基づいてモデル化された音響モデルの１つであり、状態遷移確率および各状態における出力確率密度関数で定義される。以下、従来の音声認識の手法を、音響モデルとして前記HMM を利用した場合を例にして説明する。 In speech recognition, a word string closest to an input speech signal is determined based on a similarity (probability) with a word expressed by connecting acoustic models generated in advance. HMM (Hidden Marcov Model: Hidden Marcov Model) is one of acoustic models modeled based on probability and statistical theory, and is defined by a state transition probability and an output probability density function in each state. Hereinafter, a conventional speech recognition method will be described using an example in which the HMM is used as an acoustic model.

音声認識装置では、認識可能な文の集合が単語の系列として記述された文法と、文を構成する単語の読み（音素列）が記述された単語辞書とに従って認識処理が進行する。図６は、文法の一例を示した図であり、ここでは、「伊藤です」、「糸井です」、「今井です」、「土井です」という４つの音声を識別する場合を例にして説明する。 In the speech recognition apparatus, recognition processing proceeds in accordance with a grammar in which a set of recognizable sentences is described as a word sequence and a word dictionary in which readings of words constituting the sentence (phoneme strings) are described. FIG. 6 is a diagram showing an example of a grammar, and here, a case where four voices “Ito Ito”, “Itoi Itoi”, “Imai Ito” and “Ito Doi” are identified will be described as an example. .

図６に示された文法は、丸数字１で示した状態「１」を始端（文頭）とし、状態「５」を終端（文末）とする状態遷移図であり、矢印で対応付けられた単語を出力して状態が遷移する。文法を構成する各単語は、その読み（音素列）に従ってHMMの状態系列として表現され、単語辞書に含まれる単語の集合は、図７に示したような木構造辞書として展開される。 The grammar shown in FIG. 6 is a state transition diagram in which the state “1” indicated by the circled number 1 is the beginning (start of sentence) and the state “5” is the end (end of sentence), and the words associated with the arrows Is output and the state transitions. Each word constituting the grammar is expressed as an HMM state sequence according to its reading (phoneme string), and a set of words included in the word dictionary is expanded as a tree structure dictionary as shown in FIG.

木構造辞書では、各単語が音素列に分解され、単語「糸井」であれば４つの音素「i」，「t」，「0」，「i」の列に展開される。各音素は、通常３つ程度の状態（HMM状態）から構成される。木構造辞書は、HMMの状態系列として表現される単語間で、先頭から共通する部分的な状態系列をマージすることにより、右に進むにつれて分岐が広がる状態遷移図である。図７の木構造辞書では、「伊藤」、「糸井」、「今井」の３単語で、単語先頭の「い」に相当するHMMの状態系列がマージされ、さらに「伊藤」と「糸井」の間で「いと」までに相当するHMMの状態系列がマージされている。また「土井」と「です」との間で、単語先頭の「d」に相当するHMM状態系列がマージされている。図中の「sil」は無音声区間（silence）を表している。 In the tree structure dictionary, each word is decomposed into phoneme strings, and if it is the word “Itoi”, it is expanded into four phoneme “i”, “t”, “0”, and “i” strings. Each phoneme is usually composed of about three states (HMM states). The tree structure dictionary is a state transition diagram in which a branch expands toward the right by merging partial state sequences that are common from the beginning among words expressed as HMM state sequences. In the tree structure dictionary of FIG. 7, the HMM state series corresponding to “I” at the beginning of the word is merged with the three words “Ito”, “Itoi”, and “Imai”, and “Ito” and “Itoi” are further merged. The HMM state series corresponding to “Ito” is merged. In addition, the HMM state sequence corresponding to “d” at the beginning of the word is merged between “Doi” and “Is”. “Sil” in the figure represents a silent period (silence).

音声認識処理では、図６に示した文法の制約に従って、図７に示した木構造辞書中の単語先頭のHMM状態から、状態仮説と呼ばれるトークンが木構造辞書を左から右へと遷移する。状態仮説が単語終端のHMM状態に到達すると、単語仮説と呼ばれる履歴を残して、図６の文法における該当単語の遷移先状態に遷移する。遷移先状態が終端（文末）でなければ、次の時刻から同様に、文法の制約に従って木構造辞書の探索が行われる。 In the speech recognition processing, a token called a state hypothesis transitions from the left to the right in the tree structure dictionary from the HMM state at the beginning of the word in the tree structure dictionary shown in FIG. 7 in accordance with the grammatical restrictions shown in FIG. When the state hypothesis reaches the HMM state at the end of the word, a history called a word hypothesis is left and the state transitions to the transition destination state of the corresponding word in the grammar of FIG. If the transition destination state is not the end (end of sentence), the tree structure dictionary is searched from the next time according to the grammatical constraints.

木構造辞書中のHMMの状態系列を状態仮説が左から右へと遷移する間に、入力音声に対して、その単語らしさのスコア（対数尤度）が計算される。木構造辞書を構成する各HMM状態は、音響特徴パラメータの入力に対して尤もらしさを出力する確率分布（出力確率密度関数）を有している。また、HMM状態間の遷移について遷移確率（状態遷移確率）が定義されている。これらの確率を時間方向に累積することで、単語らしさのスコア（対数尤度）が計算される。 While the state hypothesis transitions from the left to the right in the HMM state sequence in the tree structure dictionary, a word-likeness score (log likelihood) is calculated for the input speech. Each HMM state constituting the tree structure dictionary has a probability distribution (output probability density function) that outputs likelihood with respect to the input of acoustic feature parameters. Also, transition probabilities (state transition probabilities) are defined for transitions between HMM states. By accumulating these probabilities in the time direction, a wordiness score (log likelihood) is calculated.

音声信号を一定時間間隔（例えば１０ms）毎に分析して得られる特徴ベクトルのフレームを周期として、各HMM状態まで遷移した各状態仮説は、さらに自身のHMM状態への遷移（自己遷移）および右隣のHMM状態への遷移（L-R遷移）とを同時に繰り返す。このとき、t番目のフレームに状態jが存在する対数尤度をαj(t)とすれば、対数尤度αj(t)は次式(1)で表される。ここで、αijは状態iから状態jへの遷移確率、bj(ot)は状態jが音響特徴量otを出力する確率である。 Each state hypothesis that has transitioned to each HMM state with a period of a feature vector frame obtained by analyzing the speech signal at regular time intervals (for example, 10 ms) is further transferred to its own HMM state (self-transition) and right The transition to the next HMM state (LR transition) is repeated simultaneously. At this time, if the log likelihood in which the state j exists in the t-th frame is αj (t), the log likelihood αj (t) is expressed by the following equation (1). Here, αij is a transition probability from the state i to the state j, and bj (ot) is a probability that the state j outputs the acoustic feature quantity ot.

T個のフレームから構成される音声信号についてN個のHMM状態からなる単語系列を探索する場合、すなわち、状態仮説がHMM状態系列を遷移していく場合の、自己遷移とL-R遷移との空間（トレリス）を図８に示す。トレリス空間は、横軸をフレーム、縦軸を状態として可能な状態系列を示す格子グラフであって、それぞれの状態系列は、各時刻における状態を表す点（○印）を線分で結んだ折れ線で表される。 When searching for a word sequence consisting of N HMM states for a speech signal composed of T frames, that is, when the state hypothesis transits an HMM state sequence, the space between the self transition and the LR transition ( The trellis is shown in FIG. The trellis space is a grid graph showing possible state sequences with the horizontal axis representing the frame and the vertical axis representing the state, and each state sequence is a line that connects points (circles) representing the state at each time with line segments. It is represented by

図８に示したように、t番目のフレームのタイミングで状態jに至るパスは数多く存在するが、音声認識は最も確からしいパス（最尤パス）を求めるのが目的であるので、各フレーム、各HMM状態において、次式(2)に従って高いスコアを残すViterbi処理が行われる。 As shown in FIG. 8, there are many paths that reach the state j at the timing of the t-th frame, but since speech recognition is aimed at finding the most likely path (maximum likelihood path), In each HMM state, Viterbi processing that leaves a high score is performed according to the following equation (2).

音声認識処理は、文法が許容する全ての単語連鎖を探索する必要があるため、同時刻に数多くの状態仮説が、自分自身のHMM状態への遷移（図８では、右隣りへの自己遷移）と隣接する他のHMM状態への遷移（図８では、右下隣りへのL-R遷移）とを行うので、その計算量は膨大になる。この計算量の増大を抑えるために、通常はViterbi探索途中で確率の小さい状態仮説を探索空間から除外する枝刈りが行われる。 Since the speech recognition process needs to search all word chains allowed by the grammar, many state hypotheses change to their own HMM state at the same time (in FIG. 8, self-transition to the right side). And a transition to another adjacent HMM state (LR transition to the lower right neighbor in FIG. 8), the amount of calculation is enormous. In order to suppress this increase in the amount of calculation, pruning is normally performed during the Viterbi search to exclude state hypotheses with a low probability from the search space.

枝刈りでは、処理中の時刻における最大の尤度から、尤度が一定幅以内にある状態仮説を次の時刻の探索空間として残し、尤度が一定幅以上に低い状態仮説は次の時刻の探索空間から除外される。 In pruning, from the maximum likelihood at the time being processed, the state hypothesis whose likelihood is within a certain range is left as a search space for the next time, and the state hypothesis whose likelihood is lower than a certain range is Excluded from search space.

特許文献１、２，３には、状態仮説数を効率良く枝刈りして計算量を減じるために、枝刈り条件の閾値を適応的に変化させる技術が開示されている。
特開２００１−７５５９６号公報特開２００３−１５６８３号公報特開２００４−１９８８３２号公報 Patent Documents 1, 2, and 3 disclose a technique for adaptively changing the threshold value of the pruning condition in order to efficiently prun the number of state hypotheses and reduce the amount of calculation.
JP 2001-75596 A JP 2003-15683 A JP 2004-198832 A

音声認識装置では、音声が最初に検知されたタイミングよりも数百msほど遡った無音声区間から、前記図６に関して説明したような、先頭に無音声「sil」を有する文法に従って探索処理が開始される。 In the speech recognition apparatus, a search process is started from a silent section that is several hundred ms later than the timing at which speech is first detected, according to a grammar having a silent "sil" at the beginning as described above with reference to FIG. Is done.

しかし、この発声前の無音声区間において探索空間（状態仮説の集合）が前方の単語まで無益に拡がってしまうという現象が発生する。図６の例であれば、無音声「sil」を超えて「伊藤」、「糸井」、「今井」、「土井」さらには「です」まで探索空間が拡がってしまうので、状態仮説数が増大して認識処理の遅延をもたらす。 However, a phenomenon occurs in which the search space (a set of state hypotheses) unnecessarily spreads to the previous word in the silent section before utterance. In the example of FIG. 6, the search space expands beyond “sil” to “Ito”, “Itoi”, “Imai”, “Doi”, and “I”, so the number of state hypotheses increases. This delays the recognition process.

発話前の無音声区間では、状態仮説間の尤度差が十分に大きくならないために枝刈りが効果的に行われない。したがって、無益な尤度計算にコストが消費されてしまうことになる。その一方で、音声検知の閾値を上げてしまうと、今度は発話が見逃されて探索が開始されないといった事態が生じ得る。 In the silent period before the utterance, the likelihood difference between the state hypotheses does not become sufficiently large, so pruning is not effectively performed. Therefore, cost is consumed for useless likelihood calculation. On the other hand, if the voice detection threshold is raised, a situation may occur in which the utterance is missed and the search is not started.

このように、発話前の無音声区間では状態仮説間の尤度差が十分に大きくならないために効果的な枝刈りが難しいにも関わらず、上記した従来技術では、発話前の無音声区間であるか否かとは無関係に枝刈り条件が設定されていたので、特に発話前の無音声区間では、無益な尤度計算にコストが消費されてしまうという技術課題があった。 As described above, in the silent section before the utterance, the likelihood difference between the state hypotheses is not sufficiently large, and thus effective pruning is difficult. Since the pruning condition is set regardless of whether or not there is, there is a technical problem that the cost is consumed for useless likelihood calculation, particularly in the voiceless section before the utterance.

本発明の目的は、上記した従来技術の課題を解決し、無音声区間における探索空間の前方への拡大を制限することで、状態仮説数を減じて認識処理の速度を向上させることの可能な音声認識方法および装置ならびに音声認識プログラムおよびその記録媒体を提供することにある。 The object of the present invention is to solve the above-described problems of the prior art and limit the forward expansion of the search space in a non-voice interval, thereby reducing the number of state hypotheses and improving the speed of recognition processing. A speech recognition method and apparatus, a speech recognition program, and a recording medium thereof.

上記した目的を達成するために、本発明は、音声信号から抽出された音響パラメータと確率的な状態遷移モデルである音響モデルとを照合し、前記音響パラメータと音響モデルとの尤度を計算しながら状態仮説を遷移させ、最尤な状態遷移パスを認識結果とする音声認識方法において、音声信号から音響パラメータを抽出して一時記憶する音響分析手段と、音響パラメータに基づいて音声の始端すなわち発話を検知する発話検知手段と、発話検知のタイミングよりも所定時間だけ遡った時刻以降の音響パラメータに基づいて状態仮説を遷移させ、その尤度を計算する遷移探索部と、状態仮説の遷移先を制限する遷移制限手段とを含み、遷移制限手段は、発話が検知されるまでは遷移先を制限し、前記発話が検知された以降は前記制限を解除することを特徴とする。 In order to achieve the above-described object, the present invention collates an acoustic parameter extracted from a speech signal with an acoustic model that is a probabilistic state transition model, and calculates the likelihood between the acoustic parameter and the acoustic model. In the speech recognition method in which the state hypothesis is changed and the most likely state transition path is recognized as the recognition result, the acoustic analysis means for extracting the acoustic parameters from the speech signal and temporarily storing them, and the beginning of the speech, that is, the utterance based on the acoustic parameters A transition search unit that transitions a state hypothesis based on acoustic parameters after a time that is a predetermined time earlier than the timing of speech detection, calculates a likelihood of the transition, and a transition destination of the state hypothesis A transition restriction unit that restricts the transition destination until the utterance is detected, and releases the restriction after the utterance is detected. It is characterized in.

本発明によれば、発話が検知される前の無音区間であって、状態仮説間の尤度差が十分に大きくならないために枝刈りを効果的に行えない区間でも、状態仮説の遷移範囲を制限することで状態仮説の増大を抑えて計算量を削減できるので、音声認識精度の低下を抑えて認識処理時間を削減することが可能になる。 According to the present invention, the transition range of the state hypothesis can be set even in a silent section before the utterance is detected and the likelihood difference between the state hypotheses is not sufficiently large so that pruning cannot be effectively performed. By limiting, it is possible to reduce the amount of calculation by suppressing an increase in the state hypothesis, so that it is possible to reduce the recognition processing time by suppressing a decrease in voice recognition accuracy.

以下、図面を参照して本発明の最良の実施の形態について詳細に説明する。図１は、本発明に係る音声認識装置１の主要部の構成を示したブロック図である。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the best embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the main part of a speech recognition apparatus 1 according to the present invention.

音声信号入力部１１は、入力された音声信号をデジタル信号に変換する。音響分析部１２は、音声信号を音響特徴パラメータに変換して一時記憶する。音響特徴パラメータとは、入力音声を一定時間間隔（例えば１０ms：以下、フレームと表現する）毎に分析して得られる特徴ベクトルである。したがって、音声信号は特徴ベクトルの系列X＝x1,x2,…,xTに変換される。 The audio signal input unit 11 converts the input audio signal into a digital signal. The acoustic analysis unit 12 converts the audio signal into an acoustic feature parameter and temporarily stores it. The acoustic feature parameter is a feature vector obtained by analyzing the input speech at regular time intervals (for example, 10 ms: hereinafter referred to as a frame). Therefore, the audio signal is converted into a sequence of feature vectors X = x1, x2,.

発話検知部１３は、各フレームの音響パラメータに含まれるパワーを参照し、パワーが所定の閾値Wrefを超えた最初のフレームを音声の始端フレームすなわち発話のフレームと判定する。HMMデータベース１４には、音響モデルの確率的な状態遷移ネットワ−クとしてHMMが予め記憶されている。 The utterance detection unit 13 refers to the power included in the acoustic parameters of each frame, and determines the first frame whose power exceeds a predetermined threshold Wref as the voice start end frame, that is, the utterance frame. The HMM database 14 stores in advance an HMM as a stochastic state transition network of the acoustic model.

遷移探索部１５は、前記発話検知部１３により発話が検知されると、この音声検知タイミングから数百ms遡った時刻以降の音響特徴パラメータを前記音響分析部１２から順次に取り込み、文法の拘束条件に従いながら、前記HMMと音響特徴パラメータの時系列データとを照合して音響的な対数尤度を算出する。なお、文法の制約から状態系列が複数に枝分れする場合、遷移探索部１５は枝の数だけ状態仮説を複製し、枝ごとに状態仮説を進行させて尤度を計算する。 When the utterance is detected by the utterance detection unit 13, the transition search unit 15 sequentially takes in the acoustic feature parameters after the time several hundred ms back from the voice detection timing from the acoustic analysis unit 12, and restricts the grammar. The acoustic log likelihood is calculated by comparing the HMM with the time-series data of the acoustic feature parameters. When the state series branches into a plurality of branches due to grammatical constraints, the transition search unit 15 duplicates the state hypotheses by the number of branches and advances the state hypothesis for each branch to calculate the likelihood.

遷移制限部１６は、前記遷移探索部１５における状態仮説のトレリス上での状態方向への遷移、すなわち自己遷移を除く他のHMM状態へのL-R遷移を、前記発話検知部１３により検知された発話開始タイミングまでは所定の制限区域内に制限し、発話の開始タイミング以降は、前記状態遷移に関する制限を解除して従来と同様の状態遷移を許容する。認識結果判定部１７は、文法上の文末まで到達した状態仮説のうち、尤度が最大の状態遷移パスを音声認識結果とする。 The transition restricting unit 16 detects the transition in the state direction on the trellis of the state hypothesis in the transition search unit 15, that is, the LR transition to the other HMM state except the self transition detected by the speech detecting unit 13. Until the start timing, it is restricted within a predetermined restricted area, and after the start timing of the utterance, the restriction on the state transition is canceled and the same state transition as the conventional one is allowed. The recognition result determination unit 17 sets the state transition path with the highest likelihood among the state hypotheses that have reached the grammatical sentence end as the speech recognition result.

図２，３は、前記尤度計算部１５における尤度計算手順を示したフローチャートであり、図４は、本実施形態におけるHMMのトレリス上での状態遷移の様子を、従来技術のそれと比較して模式的に表現した図である。図４では、(a)→(b)→(c)→(d)が本実施形態における状態遷移プロセスを示し、(a)→(b)→(e)→(f)が従来技術における状態遷移プロセスを示している。 2 and 3 are flowcharts showing the likelihood calculation procedure in the likelihood calculation unit 15, and FIG. 4 compares the state transition on the trellis of the HMM in this embodiment with that of the prior art. FIG. In FIG. 4, (a) → (b) → (c) → (d) shows the state transition process in this embodiment, and (a) → (b) → (e) → (f) shows the state in the prior art. The transition process is shown.

図２において、ステップＳ１では、前記発話検知部１３により発話が検知されたか否かが判定される。発話が検知されるとステップＳ２へ進み、発話検知後のHMM状態の遷移回数をカウントする変数nがリセットされる。ステップＳ３では、発話検知タイミングから数百ms遡ったフレームが、音響特徴パラメータの取り込み開始フレームとして設定される。ステップＳ４では、前記音響分析部１２に一時記憶されている音響特徴パラメータが、前記読み取り開始フレームから順に遷移探索部１５に取り込まれる。 In FIG. 2, in step S <b> 1, it is determined whether or not an utterance has been detected by the utterance detection unit 13. When an utterance is detected, the process proceeds to step S2, and a variable n for counting the number of transitions of the HMM state after the utterance is detected is reset. In step S3, a frame that is several hundred ms earlier than the utterance detection timing is set as an acoustic feature parameter capturing start frame. In step S4, the acoustic feature parameters temporarily stored in the acoustic analysis unit 12 are taken into the transition search unit 15 in order from the reading start frame.

ステップＳ５では、前記遷移探索部１５において、有効な状態仮説の一つが今回の計算対象として選択される。ステップＳ６では自己遷移が実施され、その尤度が計算・更新される。ステップＳ７では、全ての状態仮説に関して自己遷移および尤度計算が完了したか否かが判定され、完了していなければステップＳ５へ戻り、今回のタイミングで遷移すべき他の状態仮説についても上記した各処理が繰り返される。 In step S5, the transition search unit 15 selects one of the valid state hypotheses as the current calculation target. In step S6, self-transition is performed, and the likelihood is calculated and updated. In step S7, it is determined whether or not self-transition and likelihood calculation have been completed for all the state hypotheses. If not completed, the process returns to step S5, and the other state hypotheses to be transitioned at this time are also described above. Each process is repeated.

今回のタイミングで遷移すべき全ての状態仮説に関して自己遷移および尤度計算が完了するとステップＳ８へ進み、改めて有効な状態仮説の一つが今回の計算対象として選択される。ステップＳ９では、前記ステップＳ４で取り込まれた今回の音響特徴パラメータが発話前の音響特徴パラメータであるか否かが判定される。最初は発話前の音響特徴パラメータと判定されるので、ステップＳ１０の発話前探索へ進む。 When the self-transition and the likelihood calculation are completed for all state hypotheses to be transitioned at this timing, the process proceeds to step S8, and one of the valid state hypotheses is selected as the current calculation target. In step S9, it is determined whether or not the current acoustic feature parameter captured in step S4 is the acoustic feature parameter before utterance. Since it is initially determined that the acoustic feature parameter is before utterance, the process proceeds to the search before utterance in step S10.

図３は、前記発話前探索の手順を示したフローチャートであり、ステップＳ６１では、L-R遷移先のHMM状態が制限区域内であるか否かが、前記遷移回数nに基づいて判定される。ここで、図４(a)，(b)に示したように、遷移回数nが未だ小さく、L-R遷移先のHMM状態が制限区域外である限りは、ステップＳ６２へ進んでL-R遷移尤度計算が実行される。ステップＳ６３ではViterbi処理が実行され、前記式(2)に基づいて、スコアの最も高い状態仮説のみが残される。ステップＳ６４では、遷移回数nがインクリメントされる。 FIG. 3 is a flowchart showing the pre-utterance search procedure. In step S61, it is determined based on the number of transitions n whether or not the HMM state of the LR transition destination is within the restricted area. Here, as shown in FIGS. 4A and 4B, as long as the number of transitions n is still small and the HMM state of the LR transition destination is outside the restricted area, the process proceeds to step S62 and the LR transition likelihood calculation is performed. Is executed. In step S63, the Viterbi process is executed, and only the state hypothesis having the highest score is left based on the equation (2). In step S64, the number of transitions n is incremented.

これに対して、図４(c)に示したHMM状態[1]であれば、前記ステップＳ６１において、L-R遷移先のHMM状態[2]が制限区域内と判定されるのでステップＳ６４へジャンプする。すなわち、L-R遷移が禁止される。 On the other hand, if it is the HMM state [1] shown in FIG. 4 (c), it is determined in step S61 that the HMM state [2] of the LR transition destination is within the restricted area, and the process jumps to step S64. . That is, LR transition is prohibited.

図２へ戻り、ステップＳ１３では、今回のタイミングで遷移すべき全ての状態仮説に関して、上記したL-R遷移およびViterbi処理が完了したか否かが判定される。完了していなければステップＳ８へ戻り、今回のタイミングで遷移すべき他の状態仮説についても上記した各処理が繰り返される。 Returning to FIG. 2, in step S <b> 13, it is determined whether or not the above-described LR transition and Viterbi processing have been completed for all state hypotheses to be transitioned at the current timing. If not completed, the process returns to step S8, and the above-described processes are repeated for other state hypotheses that should transition at the current timing.

一方、前記ステップＳ９において、前記ステップＳ４で取り込まれた音響特徴パラメータが発話後の音響特徴パラメータであると判定されるとステップＳ１１へ進む。ステップＳ１１では、図４(e)，(f)に示したように、従来技術と同様に、遷移先に制限を設けることなく各状態仮説がL-R遷移されて尤度が計算される。ステップＳ１２ではViterbi処理が実行される。 On the other hand, if it is determined in step S9 that the acoustic feature parameter captured in step S4 is the acoustic feature parameter after utterance, the process proceeds to step S11. In step S11, as shown in FIGS. 4E and 4F, each state hypothesis is LR-transitioned without limiting the transition destination as in the prior art, and the likelihood is calculated. In step S12, Viterbi processing is executed.

その後、今回のタイミングで遷移すべき全ての状態仮説について上記した各処理が完了するとステップＳ１４へ進み、現在の全状態仮説の中でスコアが上位の状態仮説のみを残して他の状態仮説を次の探索から除外する枝刈りが行われる。本実施形態では、時刻t、状態jの尤度αj(t)を同時刻の全状態仮説の中で最大の尤度αmax(t)と比較し、次式(3)を満足する状態仮説を次の時刻の探索空間に残し、次式(4)を満足する状態仮説を次の時刻の探索空間から除外する。なお、θpruningは枝刈りの閾値を示す正の実数である。 Thereafter, when the above-described processes are completed for all the state hypotheses to be transitioned at this timing, the process proceeds to step S14, and the remaining state hypotheses are left with only the state hypothesis having the highest score among all the current state hypotheses. Pruning is excluded from the search. In this embodiment, the likelihood αj (t) at time t and state j is compared with the maximum likelihood αmax (t) among all state hypotheses at the same time, and the state hypothesis satisfying the following equation (3) is determined. The state hypothesis that satisfies the following formula (4) is left in the search space at the next time, and is excluded from the search space at the next time. Θpruning is a positive real number indicating a pruning threshold.

ステップＳ１５では、次フレームの有無が判定され、次フレームが存在すればステップＳ４へ戻って次フレームの音響特徴パラメータを取り込んで上記した各処理が繰り返される。全てのフレームに関して上記した各処理が終了し、状態仮説が文法上の最後のHMM状態に到達すると、ステップＳ１６では、前記認識結果判定部１７において、文法上の文末まで到達した状態仮説のうち、尤度が最大の状態遷移パスが音声認識として確定される。 In step S15, it is determined whether or not there is a next frame. If there is a next frame, the process returns to step S4, the acoustic feature parameter of the next frame is taken in, and the above-described processes are repeated. When each process described above is completed for all frames and the state hypothesis reaches the last grammatical HMM state, in step S16, in the recognition result determination unit 17, among the state hypotheses that have reached the grammatical sentence end, The state transition path with the maximum likelihood is determined as speech recognition.

なお、上記した実施形態では、遷移制限範囲が予め固定的に設定されているものとして説明したが、本発明はこれのみに限定されるものではなく、前記発話検知部１３で発話が検知される直前に測定された背景雑音のパワーの大きさに基づいて、遷移制限範囲が可変的に設定されるようにしても良い。 In the above-described embodiment, the transition restriction range is described as being fixed in advance. However, the present invention is not limited to this, and the utterance detection unit 13 detects the utterance. The transition limit range may be variably set based on the power level of the background noise measured immediately before.

例えば、発話検知前に測定された背景雑音のパワーが大きい場合は、背景雑音のパワーが小さい場合よりも、音声の始端を捉えられずに発話の検出が遅れる確率が高い。このような場合には、遷移制限範囲を拡大することが望ましい。 For example, when the background noise power measured before the utterance detection is large, the probability of delaying the detection of the utterance without capturing the start of the speech is higher than when the background noise power is small. In such a case, it is desirable to expand the transition restriction range.

図５は、前記遷移回数nに基づいて制限範囲への遷移であるか否かを判定する場合の遷移回数nと音声信号のパワーとの関係を示した図であり、パワーが音声検知の閾値Wrefを超えて発話が検知される直前のパワー、すなわち背景雑音のパワーが大きいほど遷移回数nが大きくなり、遷移可能な範囲を拡げる。 FIG. 5 is a diagram showing the relationship between the number of transitions n and the power of the audio signal when determining whether or not the transition is to the limit range based on the number of transitions n. As the power immediately before the utterance is detected exceeding Wref, that is, the power of the background noise is larger, the number of transitions n is larger, and the transitionable range is expanded.

さらに、上記した説明では省略したが、上記した遷移先の制限と従来のスコアに基づく枝刈りは並行に行われても構わない。 Furthermore, although omitted in the above description, the above-described restriction of the transition destination and pruning based on the conventional score may be performed in parallel.

本発明に係る音声認識装置１の主要部の構成を示したブロック図である。It is the block diagram which showed the structure of the principal part of the speech recognition apparatus 1 which concerns on this invention. 尤度計算手順を示したフローチャートである。It is the flowchart which showed the likelihood calculation procedure. 発話前尤度計算処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the likelihood calculation before speech. HMMのトレリス上での状態遷移の様子を模式的に表現した図である。It is the figure which expressed typically the mode of the state transition on the trellis of HMM. 遷移制限範囲を規定する遷移回数nと音声信号パワーとの関係を示した図である。It is the figure which showed the relationship between the frequency | count n of a transition which prescribes | regulates a transition restriction | limiting range, and audio | voice signal power. 文法の一例を示した図である。It is the figure which showed an example of grammar. 木構造辞書の一例を示した図である。It is the figure which showed an example of the tree structure dictionary. トレリス空間の一例を示した図である。It is the figure which showed an example of the trellis space.

Explanation of symbols

１…音声認識装置，１１…音声信号入力部，１２…音響分析部，１３…発話検知部，１４…HMMデータベース，１５…尤度計算部，１６…遷移制限部，１７…認識結果判定部 DESCRIPTION OF SYMBOLS 1 ... Voice recognition apparatus, 11 ... Voice signal input part, 12 ... Acoustic analysis part, 13 ... Speech detection part, 14 ... HMM database, 15 ... Likelihood calculation part, 16 ... Transition restriction part, 17 ... Recognition result determination part

Claims

The acoustic parameter extracted from the speech signal is compared with the acoustic model that is a probabilistic state transition model, the state hypothesis is transitioned while calculating the likelihood between the acoustic parameter and the acoustic model, and the maximum likelihood state transition path Is a speech recognition apparatus that uses the speech recognition result as
Acoustic analysis means for extracting and temporarily storing acoustic parameters from the audio signal;
Speech detection means for detecting speech based on the acoustic parameters;
A transition search unit that performs likelihood calculation and state transition based on acoustic parameters after a time that is a predetermined time later than the timing of speech detection;
Transition limiting means for limiting the transition destination of the state hypothesis,
The transition search unit causes a valid state hypothesis to repeat self-transition and LR transition, and the transition restriction unit narrows the transition permission range of the LR transition until after the utterance is detected until the utterance is detected. A speech recognition apparatus characterized by that.

The transition permission range up speech is detected, the speech recognition apparatus according to claim 1, characterized in that is determined based on the power of the background noise immediately before the utterance is detected.

The transition destination allowed until speech is detected, the speech recognition apparatus according to claim 1 or 2 wherein the utterance is characterized in that it is narrower smaller the power of the background noise immediately before the detection.

The acoustic model speech recognition apparatus according to any one of 3 claims 1, characterized in that a HMM.

The acoustic parameter extracted from the speech signal is compared with the acoustic model that is a probabilistic state transition model, the state hypothesis is transitioned while calculating the likelihood between the acoustic parameter and the acoustic model, and the maximum likelihood state transition path In the speech recognition method with the speech recognition result as
Storing the acoustic parameters;
Detecting speech based on the acoustic parameters;
A procedure for transitioning the state hypothesis based on acoustic parameters after a time that is a predetermined time earlier than the timing of speech detection, and calculating the likelihood thereof,
A step of restricting the transition destination of each state hypothesis in the process of transitioning the state hypothesis,
In the procedure of transitioning the state hypothesis and calculating the likelihood, self transition and LR transition are repeated as valid state hypotheses, and in the procedure of limiting the transition destination, the transition permission range of the LR transition is the utterance A voice recognition method characterized by being narrowed until it is detected until it is detected .

The acoustic parameter extracted from the speech signal is compared with the acoustic model that is a probabilistic state transition model, the state hypothesis is transitioned while calculating the likelihood between the acoustic parameter and the acoustic model, and the maximum likelihood state transition path In the speech recognition program that
Storing the acoustic parameters;
Detecting speech based on the acoustic parameters;
A procedure for transitioning the state hypothesis based on acoustic parameters after a timing that is a predetermined time later than the timing of the utterance detection, and calculating the likelihood thereof,
A step of restricting the transition destination of each state hypothesis in the process of transitioning the state hypothesis,
In the procedure of transitioning the state hypothesis and calculating the likelihood, self transition and LR transition are repeated as valid state hypotheses, and in the procedure of limiting the transition destination, the transition permission range of the LR transition is the utterance A speech recognition program for causing a computer to execute a speech recognition method that is narrower than after detection until it is detected .

A recording medium storing the voice recognition program according to claim 6 so as to be readable by a computer.