JP2012063651A

JP2012063651A - Voice recognition device, voice recognition method and voice recognition program

Info

Publication number: JP2012063651A
Application number: JP2010208760A
Authority: JP
Inventors: Masanobu Nakamura; 匡伸中村; Takashi Masuko; 貴史益子
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2010-09-17
Filing date: 2010-09-17
Publication date: 2012-03-29
Anticipated expiration: 2030-09-17
Also published as: JP5161942B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition technique that can achieve highly robust voice recognition while restraining increases in calculation cost and memory capacity required for calculation.SOLUTION: A voice recognition device stores a plurality of sound models having the same structure and a search network common to the plurality of sound models; accepts a voice input; extracts a sound feature quantity by using the inputted voice; calculates a score, correspondingly to each of the plurality of sound models, for each route representing a series of nodes passed on the way from a start point node to an end point node on the search network by using the extracted sound feature quantity, the plurality of sound models and the search network; by selecting a first route whose score is the greatest among the routes, retrieves the first route which is the optimum route from the start point node to the end point node on the search network; and outputs information indicating the first route, which is the result of voice recognition.

Description

本発明の実施形態は、音声認識装置、音声認識方法及び音声認識プログラムに関する。 Embodiments described herein relate generally to a speech recognition apparatus, a speech recognition method, and a speech recognition program.

従来より、音声認識装置においてロバスト性の高い音声認識を行うためには、例えば、特許文献１及び特許文献２に開示されているように、複数の音響モデルに対して、複数の経路探索部（デコーダ）を同時に用いることにより複数の最適な単語の系列からなる経路を計算する方法か、もしくは特許文献３に開示されているように、単一の経路探索部を順次用いることにより複数の最適な単語の系列からなる経路を計算する方法があった。 Conventionally, in order to perform highly robust speech recognition in a speech recognition device, for example, as disclosed in Patent Document 1 and Patent Document 2, a plurality of path search units ( A method of calculating a path composed of a plurality of optimum word sequences by using the decoder simultaneously, or a plurality of optimum paths by sequentially using a single path search unit as disclosed in Patent Document 3. There was a way to calculate a path consisting of a sequence of words.

特開２００５−２２１６７８公報JP 2005-221678 A 特開２００３−１０８１８８公報JP 2003-108188 A 特開２００７−２２５９３１公報JP 2007-225931 A

しかしながら、従来の方法では、ロバスト性の高い音声認識を行うための計算コストや計算に際して使用するメモリ量が増大する恐れがあった。特に、複数の音響モデルを用いた計算を行う方法においては、計算コストや計算に際して使用するメモリ量が顕著に増大する恐れがあった。 However, in the conventional method, there is a risk that the calculation cost for performing highly robust speech recognition and the amount of memory used for the calculation increase. In particular, in a method of performing calculation using a plurality of acoustic models, there is a risk that the calculation cost and the amount of memory used for the calculation increase significantly.

実施形態の音声認識装置は、同一の構造を有する複数の音響モデルを記憶する第１記憶部と、始端を表す始端ノードと、終端を表す終端ノードと、前記始端ノード及び終端ノードの間の少なくとも１つのノードとを有し、複数の前記音響モデルに共通の探索ネットワークを記憶する第２記憶部と、音声の入力を受け付ける受付部と、前記音声を用いて、音響特徴量を抽出する抽出部と、前記音響特徴量と、複数の前記音響モデルと、前記探索ネットワークとを用いて、前記探索ネットワーク上で前記始端ノードから前記終端ノードに至るまでに経由するノードの系列を示す各経路に対して、複数の前記音響モデルのそれぞれに対応してスコアを計算する計算部と、各前記経路のうち少なくとも１つの音響モデルに対応する前記スコアが最大である第１経路を選択することにより、前記探索ネットワーク上で前記始端ノードから前記終端ノードに至る最適な経路である前記第１経路を探索する探索部と、前記音声の認識結果である前記第１経路を示す情報を出力する出力部とを備えることを特徴とする。 The speech recognition apparatus according to the embodiment includes a first storage unit that stores a plurality of acoustic models having the same structure, a start node that represents a start point, a terminal node that represents a terminal end, and at least between the start node and the terminal node. A second storage unit that stores a search network common to a plurality of the acoustic models, a reception unit that receives voice input, and an extraction unit that extracts an acoustic feature using the voice And using the acoustic feature quantity, the plurality of acoustic models, and the search network, for each path indicating a sequence of nodes that pass from the start node to the end node on the search network. A calculation unit that calculates a score corresponding to each of the plurality of acoustic models, and the score corresponding to at least one acoustic model of each of the paths is a maximum. A search unit that searches for the first route that is the optimal route from the start node to the end node on the search network, and the first recognition result is the voice recognition result. And an output unit that outputs information indicating a route.

本実施の形態に係る音声認識装置の機能的構成を例示する図。The figure which illustrates the functional structure of the speech recognition apparatus which concerns on this Embodiment. 探索ネットワークを例示する図。The figure which illustrates a search network. トークンパッシングの手法による最適経路の探索を説明するための図。The figure for demonstrating the search of the optimal path | route by the method of token passing. 音声認識処理の手順を示すフローチャート。The flowchart which shows the procedure of a speech recognition process. 図４のステップＳ３の処理の詳細な手順を示すフローチャート。The flowchart which shows the detailed procedure of the process of step S3 of FIG. 累積スコアリストの更新例を示す図。The figure which shows the example of an update of a cumulative score list. 累積スコアリストの更新例を示す図。The figure which shows the example of an update of a cumulative score list. 本変形例に係る最適経路の探索を説明するための図。The figure for demonstrating the search of the optimal path | route which concerns on this modification.

[第１の実施形態]
まず、音声認識装置のハードウェア構成について説明する。本実施の形態に係る音声認識装置は、装置全体を制御するＣＰＵ（Central Processing Unit）等の制御部と、各種データや各種プログラムを記憶するＲＯＭ（Read Only Memory）やＲＡＭ（Random Access Memory）等の主記憶部と、各種データや各種プログラムを記憶するＨＤＤ（Hard Disk Drive）やＣＤ（Compact Disk）ドライブ装置等の補助記憶部と、これらを接続するバスとを備えており、通常のコンピュータを利用したハードウェア構成となっている。また、音声認識装置には、音声が入力される音声入力部が有線又は無線により各々接続される。 [First embodiment]
First, the hardware configuration of the speech recognition apparatus will be described. The speech recognition apparatus according to this embodiment includes a control unit such as a CPU (Central Processing Unit) that controls the entire apparatus, a ROM (Read Only Memory) that stores various data and various programs, a RAM (Random Access Memory), and the like. Main storage unit, an auxiliary storage unit such as an HDD (Hard Disk Drive) or CD (Compact Disk) drive device for storing various data and various programs, and a bus for connecting these, a normal computer The hardware configuration is used. In addition, a voice input unit to which voice is input is connected to the voice recognition device by wire or wireless.

次に、このようなハードウェア構成において、本実施の形態に係わる音声認識装置の機能的構成について図１を用いて説明する。音声認識装置１００は、音声入力受付部１０１と、音響特徴量抽出部１０２と、音響モデル記憶部１０３と、探索ネットワーク記憶部１０４と、経路探索部１０５とを有する。音声入力受付部１０１と、音響特徴量抽出部１０２と、経路探索部１０５とは、制御部が主記憶部や補助記憶部に記憶された各種プログラムを実行することにより各々実現される。音響モデル記憶部１０３と、探索ネットワーク記憶部１０４とは、例えば補助記憶部に構成されるものである。 Next, the functional configuration of the speech recognition apparatus according to the present embodiment in such a hardware configuration will be described with reference to FIG. The speech recognition apparatus 100 includes a speech input reception unit 101, an acoustic feature amount extraction unit 102, an acoustic model storage unit 103, a search network storage unit 104, and a route search unit 105. The voice input reception unit 101, the acoustic feature amount extraction unit 102, and the route search unit 105 are each realized by the control unit executing various programs stored in the main storage unit and the auxiliary storage unit. The acoustic model storage unit 103 and the search network storage unit 104 are configured as an auxiliary storage unit, for example.

音声入力受付部１０１は、音声入力部に入力された音声を表すアナログの音声信号の入力を受け付け、これをデジタルの音声信号に変換する。音響特徴量抽出部１０２は、音声入力受付部１０１がデジタルに変換した音声信号を用いて時刻毎の音響特徴量を抽出する。音響モデル記憶部１０３は、複数（例えばＭ個）の音響モデルを記憶する。音響モデルとしては、例えば音素単位の隠れマルコフモデル（ＨＭＭ）などを用いても良い。ただし複数の音響モデルは、後述する探索ネットワーク記憶部１０４に記憶された単一の探索ネットワークを用いて経路探索部１０５が探索可能な共通の構造を有するものとする。例えば、音響モデルが音素単位の隠れマルコフモデルの場合には、複数の音響モデル間で対応する音素モデルの状態数及び状態遷移の構造が共通であることなどが挙げられる。 The voice input reception unit 101 receives an input of an analog voice signal representing the voice input to the voice input unit and converts it into a digital voice signal. The acoustic feature quantity extraction unit 102 extracts an acoustic feature quantity for each time using the voice signal converted into digital by the voice input reception unit 101. The acoustic model storage unit 103 stores a plurality (for example, M) of acoustic models. As the acoustic model, for example, a hidden Markov model (HMM) in units of phonemes may be used. However, the plurality of acoustic models have a common structure that can be searched by the route search unit 105 using a single search network stored in the search network storage unit 104 described later. For example, when the acoustic model is a hidden Markov model in units of phonemes, the number of states of corresponding phoneme models and the structure of state transitions are common among a plurality of acoustic models.

探索ネットワーク記憶部１０４は、単一の探索ネットワークを記憶する。図２は、探索ネットワークを例示する図である。探索ネットワークは、図中において丸印で表されるノードと、ノード間の遷移を矢印で表したアークとにより構成される。ノードのうち、始端を表すものが始端ノードSであり、終端を表すものが終端ノードEである。始端ノードSから終端ノードEに至るまでのノードの系列を示すものが経路となる。各ノード及び各アークの一部又は全てには、ラベルが付与されている。このラベルは、探索ネットワーク上のノード及びアークを各々識別するための任意の識別子であっても良いし、各ノードやアークに対応付けられる音素や単語であっても良く、求められている音声認識結果によって変更可能である。また、全てのノード及びアークにラベルを付与する必要はない。同図に示される探索ネットワークの例では、各ノードに数字による識別子が各々付与され、各アークに単語が各々付与されている。具体的には、各ノードに「１」〜「６」のラベルが各々付与されており、ノード２からノード３へ至るアークにラベル「単語１」が付与されており、ノード２からノード５へ至るアークにラベル「単語２」が付与されており、ノード１からノード４へ至るアークにラベル「単語３」が付与されている。 The search network storage unit 104 stores a single search network. FIG. 2 is a diagram illustrating a search network. The search network is composed of nodes represented by circles in the drawing and arcs representing transitions between the nodes by arrows. Among the nodes, the node representing the starting point is the starting node S, and the node representing the terminal is the terminal node E. A path indicating a sequence of nodes from the start node S to the end node E is a path. A label is given to some or all of each node and each arc. This label may be an arbitrary identifier for identifying each node and arc on the search network, or may be a phoneme or word associated with each node or arc. It can be changed depending on the result. Also, it is not necessary to label all nodes and arcs. In the example of the search network shown in the figure, each node is given a numerical identifier, and each arc is given a word. Specifically, the labels “1” to “6” are assigned to the nodes, and the label “word 1” is assigned to the arc from the node 2 to the node 3. A label “word 2” is assigned to the arc to reach, and a label “word 3” is assigned to the arc from node 1 to node 4.

探索ネットワークのノードnは、時刻tにおいてm番目（1≦m≦M）の音響モデルに対応する音響スコアl_m ^(t,n)を、音響モデルの数（M個）だけ保持する音響スコアリストscr(t,n)を有する。scr(t,n)は例えば式１により表される。 The node n of the search network has an acoustic score list that holds the acoustic scores l _m ^{(t, n)} corresponding to the m-th (1 ≦ m ≦ M) acoustic model at time t by the number of acoustic models (M). It has scr (t, n). scr (t, n) is expressed by, for example, Equation 1.

但し、音響スコアリストは、各ノードに対応したものが参照できれば良く、ノード自身に保持させることに限定するものではない。 However, the acoustic score list only needs to be able to refer to the one corresponding to each node, and is not limited to being stored in the node itself.

一方、ノードｊからノードiへ至るアークは、m番目（1≦m≦M）の音響モデルに対応する遷移スコアa_m ^jiを、音響モデルの数（M個）だけ保持する遷移スコアリストtrans(j,i)を有する。trans(j,i) は例えば式２により表される。 On the other hand, the arc from node j to node i has a transition score list trans () that holds transition scores a _m ^ji corresponding to the m-th (1 ≦ m ≦ M) acoustic model by the number (M) of acoustic models. j, i). trans (j, i) is represented by, for example, Formula 2.

但し、遷移スコアリストは、各アークに対応したものが参照できれば良く、アーク自身に保持させることに限定するものではない。 However, the transition score list need only be able to refer to the one corresponding to each arc, and is not limited to being held in the arc itself.

経路探索部１０５は、音響特徴量抽出部１０２が抽出した音響特徴量と、音響モデル記憶部１０３に記憶された複数の音響モデルと、探索ネットワーク記憶部１０４に記憶された探索ネットワークとを用いて、音声入力部に入力された音声の認識の結果として最適な経路（最適経路という）を探索する。このとき、経路探索部１０５は、始端ノードSから終端ノードEに至るまでに経由するノード毎に、始端ノードSから当該ノードに至る最適経路を順次探索することにより、始端ノードSから終端ノードEに至る最適経路を探索する。そして、経路探索部１０５は、始端ノードSから終端ノードEへ至る最適経路を経由するノードに付与されたラベル及び当該ノード間の遷移を示すアークに付与されたラベルのうち少なくとも一方を示す経路履歴情報を生成してこれを音声認識結果として出力する。ここでは、経路履歴情報は、経路上のアークにラベルとして付与された単語を示すものとする。 The route search unit 105 uses the acoustic feature amount extracted by the acoustic feature amount extraction unit 102, the plurality of acoustic models stored in the acoustic model storage unit 103, and the search network stored in the search network storage unit 104. The optimum route (referred to as the optimum route) is searched as a result of the recognition of the speech input to the speech input unit. At this time, the path search unit 105 sequentially searches for the optimum path from the start node S to the node for each node that passes from the start node S to the end node E, thereby starting from the start node S to the end node E. The optimal route to reach is searched. The route search unit 105 then displays a route history indicating at least one of a label attached to a node passing through an optimum route from the start node S to the end node E and a label attached to an arc indicating a transition between the nodes. Information is generated and output as a speech recognition result. Here, the route history information indicates a word given as a label to an arc on the route.

探索の方法としては、例えば、以下の参考文献１に示されるように、探索ネットワーク上のトークンの伝播を用いた手法（トークンパッシング）がある。
（参考文献１）Ｓ．Ｊ．Ｙｏｕnｇ，Ｎ．Ｈ．Ｒｕｓｓｅｌｌ，ａnｄＪ．Ｈ．Ｓ．Ｔｈｏｒntｏn，“Ｔｏｋｅn Ｐａｓｓinｇ：ａＣｏnｃｅｐtｕａｌＭｏｄｅｌｆｏｒＣｏnnｅｃtｅｄＳｐｅｅｃｈＲｅｃｏｇnitiｏn Ｓｙｓtｅｍｓ”ＣＵＥＤＴｅｃｈniｃａｌＲｅｐｏｒt ＦＩＮＦＥＮＧ／ＴＲ３８，ＣａｍｂｒiｄｇｅＵniｖｅｒｓitｙ，１９８９． As a search method, for example, as shown in Reference Document 1 below, there is a technique (token passing) using token propagation on a search network.
(Reference 1) J. et al. Young, N .; H. Russell, andnd J. et al. H. S. Thornton, “Token Passing: a Conceptual Model for Connected Speech Recognition Systems, INFEDNG / TR38, Cambridge9

トークンパッシングの手法では、図３に例示されるように、経路探索部１０５は、時刻tで到達し得る探索ネットワーク中のノードnに対して、始端ノードSから当該ノードnまでの経路上のアークに付与された単語を示す経路履歴情報hist(t,n)と、複数の音響モデルに各々対応する音響スコアを累積して保持する累積スコアリストcumscr(t,n)とを有するトークンtoken(t,n)を生成する。経路探索部１０５は、探索ネットワーク中の経路上を経由するノードにこのトークンを伝播させることによって経路履歴情報及び累積スコアリストを更新して最適経路を探索する。同図では、「t=2」の時刻tにおいて、始端ノードSからノード１、ノード２及びノード４までトークンが伝播された場合の例が示されている。尚、累積スコアリストcumscr(t,n)は例えば以下の式３によりM次元のベクトルで表される。 In the token passing method, as illustrated in FIG. 3, the route search unit 105 performs an arc on the route from the start node S to the node n for the node n in the search network that can be reached at time t. Token token (t) having path history information hist (t, n) indicating a word given to and a cumulative score list cumscr (t, n) that accumulates and holds acoustic scores respectively corresponding to a plurality of acoustic models , n). The route search unit 105 searches the optimum route by updating the route history information and the accumulated score list by propagating this token to a node passing through the route in the search network. In the figure, an example is shown in which the token is propagated from the start node S to the nodes 1, 2 and 4 at the time t of “t = 2”. Note that the cumulative score list cumscr (t, n) is represented by an M-dimensional vector by, for example, the following Expression 3.

式３において、S_ｍ ^(t,n)（1≦_ｍ≦M）は、始端ノードSから時刻tにおけるノードnに至る経路に対してm番目の音響モデルに対応して累積された音響スコアである。尚、経路履歴情報及び累積スコアリストは、各トークンに対応したものが参照できれば良く、トークン自身に保持させることに限定するものではない。そのため、例えばＲＡＭなどの主記憶部のある記憶領域に、全トークンに関する経路履歴情報及び累積スコアリストを記憶させ、各トークンがそれらを参照する方法なども考えられる。 In Equation 3, S _m ^{(t, n)} (1 ≦ _m ≦ M) is an acoustic score accumulated corresponding to the mth acoustic model for the path from the start node S to the node n at time t. is there. The route history information and the accumulated score list need only be able to refer to the tokens corresponding to each token, and are not limited to being stored in the token itself. Therefore, for example, a method may be considered in which path history information and cumulative score lists regarding all tokens are stored in a storage area having a main storage unit such as a RAM, and each token refers to them.

具体的には、経路探索部１０５は、時刻t-1における経路履歴情報hist(t-1,k)及び累積スコアリストcumscr(t-1,k)を有する全てのトークンtoken(t-1,k)の集合V_t-1を用いて、時刻tにおけるノードiのトークンtoken(t,i)を求め、始端ノードSから時刻tにおけるノードiに至る最適経路を求める。集合V_t-1とは、始端ノードSから当該ノードiに至る経路において当該ノードiの１つ前に経由するノードであって時刻t-1に到達可能なノードが有するトークンの集合を表し、最適経路の探索において枝刈り処理（最適経路の候補の絞り込み）を行う場合には、枝刈りされずに残っているトークンの集合を意味する。 Specifically, the route search unit 105 generates all token tokens (t-1, t1) having route history information hist (t-1, k) and a cumulative score list cumscr (t-1, k) at time t-1. Using the set V _t−1 of k), the token token (t, i) of the node i at the time t is obtained, and the optimum route from the start node S to the node i at the time t is obtained. The set V _t-1 represents a set of tokens that a node that passes through the node i immediately before the node i in the route from the start node S to the node i and that can reach the time t-1 has When pruning processing (optimal route candidate narrowing down) is performed in the search for the optimum route, it means a set of tokens that remain without being pruned.

始端ノードSから時刻tにおけるノードiに至る最適経路を求めるためには、経路探索部１０５は、まずトークン集合V_t-1に含まれる全トークンのうち、時刻tにノードiに遷移し得るトークンの集合V^→(t,i)を選択する。次に、経路探索部１０５は、選択したトークン集合V^→(t,i)が保持する全ての累積スコアリストcumscr(t-1,k) （ただしkはtoken(t-1,k)∈V^→(t,i)を満たす全てのノード番号）と、上述の式２で表される遷移スコアリストtrans(k,i)とのM個の音響モデルに各々対応する値の和（累積）で求められるスコアのうち、最大のスコアを持つノードを最適ノードj^*として式４により選択する。 In order to obtain the optimum route from the start node S to the node i at the time t, the route search unit 105 firstly, among all tokens included in the token set V _t−1 , a token that can transit to the node i at the time t. Select the set V ^{→ (t, i)} . Next, the route search unit 105 selects all cumulative score lists cumscr (t−1, k) (where k is token (t−1, k) ∈V ⁾ held by the selected token set V ^{→ (t, i).} ^{→ (} all node numbers satisfying ^{(t, i)} ) and the sum (cumulative) of the values corresponding to the M acoustic models of the transition score list trans (k, i) represented by the above-mentioned formula 2 Of the obtained scores, the node having the maximum score is selected as the optimum node j ^{* according} to Equation 4.

そして、経路探索部１０５は、最適ノードj^*に対応するトークンを最適トークンtoken(t-1,j^*)としてノードiに伝播させ、最適トークンtoken(t-1,j^*)の累積スコアリストcumscr(t-1,j^*)と、遷移スコアリストtrans(j^*,i)と、時刻tにおけるノードiで抽出された音響特徴量に対する音響スコアリストscr(t,i)とを用いて、トークンtoken(t,i)の累積スコアリストcumscr(t,i)を式５により更新する。 Then, the route search unit 105, the optimum node j ^* optimum corresponding token to the token ^{token (t-1, j *} ) is propagated to node i as the cumulative score list of best token ^{token (t-1, j *} ) Using cumscr (t-1, j ^* ), transition score list trans (j ^* , i), and acoustic score list scr (t, i) for the acoustic feature extracted at node i at time t, The cumulative score list cumscr (t, i) of the token token (t, i) is updated by Expression 5.

トークンtoken(t,i)の有する経路履歴情報hist(t,i)は、最適ノードj^*からノードiへ至る遷移を表すアークにラベル「w」が付与されている場合、式６に表されるように、hist(t-1,j^*)に「w」を追加することにより求められる。 The route history information hist (t, i) included in the token token (t, i) is expressed by Equation 6 when the label “w” is assigned to the arc representing the transition from the optimal node j ^* to the node i. Thus, it is obtained by adding “w” to hist (t−1, j ^* ).

また、最適ノードj^*からノードiへ至る遷移を表すアークにラベルが付与されていない場合には、式７に表されるように、hist(t-1,j^*)をそのままhist(t,i)に代入することで求められる。 If no label is given to the arc representing the transition from the optimal node j ^* to the node i, hist (t−1, j ^* ) is directly used as hist (t, j, It is obtained by substituting into i).

経路探索部１０５は、これらの処理を、音声の入力が終了した時刻（終了時刻）「T」までの全ての時刻t（t=1,2,…,T）において行い、「t=T」である時刻tにおける終端ノードEに対応するトークンtoken(T,E)の有する経路履歴情報hist(T,E)が最適経路を示すものとしてこれを出力する。これが音声認識結果である。 The route search unit 105 performs these processes at all times t (t = 1, 2,..., T) up to the time (end time) “T” when the voice input ends, and “t = T”. The route history information hist (T, E) possessed by the token token (T, E) corresponding to the terminal node E at time t is output as indicating the optimum route. This is a speech recognition result.

尚、ラベルが探索ネットワークの全ノードに付与されている場合、上記の方法によって求められる最適経路は、当該最適経路を経由するノードの系列を意味する。図３の例では、あるノードからあるノードへ至る遷移を表すアークに単語のラベルが付与されているが、最適経路において取り扱う最小単位が単語である場合、各単語内で各音素の状態等の最適な経路を明示的に求める必要はなく、単語単位での最適な経路を求めれば良い。このような場合においては、時刻t-1で最適ノードj^*に対応するトークンtoken(t-1,j^*)を伝播させることにより求められるトークンtoken(t,i)が有する経路履歴情報と同一の経路履歴情報を有するトークンtoken(t-1,k)が複数存在することが考えられる。そのようなトークンの集合をV^*→(t,i)としたとき、経路探索部１０５は、始端ノードSから時刻tにおけるノードiに至る経路において、M個の音響モデルに各々対応する値のうち、最大の値を選択し、これを用いて、ノードiに対応するトークンtoken (t,i)の有する累積スコアリストcumscr(t,i)を更新する。具体的には、経路探索部１０５は、式５の代わりに以下に示す式８を用いて、累積スコアリストcumscr(t,i)を更新する。 When labels are assigned to all nodes in the search network, the optimum route obtained by the above method means a sequence of nodes that pass through the optimum route. In the example of FIG. 3, a word label is given to an arc representing a transition from a certain node to a certain node. However, when the minimum unit handled in the optimum route is a word, the state of each phoneme in each word It is not necessary to explicitly find the optimum route, and it is sufficient to find the optimum route in units of words. In such a case, it is the same as the route history information of the token token (t, i) obtained by propagating the token token (t-1, j ^* ) corresponding to the optimal node j ^* at time t-1. There may be a plurality of token tokens (t−1, k) having the path history information. When such a token set is V ^{* → (t, i)} , the path search unit 105 has values corresponding to M acoustic models in the path from the start node S to the node i at time t. Among them, the maximum value is selected, and using this, the cumulative score list cumscr (t, i) of the token token (t, i) corresponding to the node i is updated. Specifically, the route search unit 105 updates the cumulative score list cumscr (t, i) using Equation 8 shown below instead of Equation 5.

尚、kはtoken(t-1,k)∈V^*→(t,i)を満たす全てのノード番号である。但し、経路探索部１０５は、式８を用いて累積スコアリストを更新した場合には、トークンtoken(t,i)の有する経路履歴情報hist(t,i)は、便宜的に式６又は式７を用いて更新することが望ましい。また、経路探索部１０５は、枝刈り条件に応じて、枝刈りする。具体的に例えば、枝刈り条件とは、各トークンに対応する累積スコアリストにおいて複数の音響モデルに対応して計算された各値の全てが閾値を下回ることである。経路探索部１０５は、このような枝刈り条件を満たすトークン自体を除去する。また、枝刈り条件とは、例えば、各トークンに対応する累積スコアリストにおいて少なくとも１つの音響モデルに対応して計算された値が閾値を下回ることであっても良い。経路探索部１０５は、このような枝刈り条件を満たす値を累積スコアリストから除去することにより、枝刈りする。 Note that k is all node numbers satisfying token (t−1, k) ∈V ^{* → (t, i)} . However, when the route search unit 105 updates the accumulated score list using Equation 8, the route history information hist (t, i) included in the token token (t, i) is expressed by Equation 6 or Equation for convenience. It is desirable to update using 7. Further, the route search unit 105 performs pruning according to the pruning condition. Specifically, for example, the pruning condition is that all values calculated corresponding to a plurality of acoustic models in the cumulative score list corresponding to each token are below a threshold value. The route search unit 105 removes tokens that satisfy such a pruning condition. The pruning condition may be, for example, that a value calculated corresponding to at least one acoustic model in a cumulative score list corresponding to each token is below a threshold value. The route search unit 105 performs pruning by removing values that satisfy such a pruning condition from the cumulative score list.

次に、本実施の形態に係る音声認識装置１００が行う音声認識処理の手順について図４を用いて説明する。音声認識装置１００は、音声入力受付部１０１の機能により、音声入力部に入力された音声を表すアナログの音声信号の入力を受け付け、これをデジタルの音声信号に変換する（ステップＳ１）。音声認識装置１００は、音響特徴量抽出部１０２の機能により、ステップＳ１でデジタルに変換した音声信号を用いて時刻毎の音響特徴量を抽出する（ステップＳ２）。音声認識装置１００は、経路探索部１０５の機能により、ステップＳ２で抽出された音響特徴量と、音響モデル記憶部１０３に記憶された複数の音響モデルと、探索ネットワーク記憶部１０４に記憶された探索ネットワークとを用いて、最適経路を探索する（ステップＳ３）。 Next, the procedure of the speech recognition process performed by the speech recognition apparatus 100 according to the present embodiment will be described with reference to FIG. The voice recognition apparatus 100 receives an input of an analog voice signal representing the voice input to the voice input unit by the function of the voice input receiving unit 101, and converts it into a digital voice signal (step S1). The speech recognition apparatus 100 extracts the acoustic feature amount for each time using the speech signal converted into digital in step S1 by the function of the acoustic feature amount extraction unit 102 (step S2). The speech recognition apparatus 100 uses the function of the route search unit 105 to detect the acoustic feature amount extracted in step S 2, the plurality of acoustic models stored in the acoustic model storage unit 103, and the search stored in the search network storage unit 104. The optimum route is searched using the network (step S3).

図５は、図４のステップＳ３の処理の詳細な手順を示すフローチャートである。経路探索部１０５は、「t＝０」である時刻tに、初期処理として、始端ノードSから時刻tにおけるノードnに至る経路上のアークに付与された単語を経路として示す経路履歴情報hist(t,n)と、複数（M個）の音響モデルに対応する音響スコアを累積して保持する累積スコアリストcumscr(t,n)とを有するトークンを生成する（ステップＳ１０）。尚、「t=0」は、音声が入力される前の適当な時点であり、当該時刻tにおけるノードnは、始端ノードSである。このため、経路履歴情報によって示される単語はなく、累積スコアリストcumscr(t,n)の値は全て「０」である。 FIG. 5 is a flowchart showing a detailed procedure of the process in step S3 of FIG. The route search unit 105 performs route history information hist () indicating the word given to the arc on the route from the start node S to the node n at time t as a route at time t where “t = 0” as an initial process. t, n) and a cumulative score list cumscr (t, n) that accumulates and holds acoustic scores corresponding to a plurality (M) of acoustic models are generated (step S10). Note that “t = 0” is an appropriate time before the voice is input, and the node n at the time t is the start node S. For this reason, there is no word indicated by the route history information, and the values of the cumulative score list cumscr (t, n) are all “0”.

経路探索部１０５は、時刻tを「1」インクリメントすると（ステップＳ１１）、時刻tに到達し得るノードiを選択する。そして、経路探索部１０５は、始端ノードSからノードiに至る経路を経由するノード間の遷移を表すアークに付与された単語の系列を示すように経路履歴情報hist(t,i)を更新し、累積スコアリストを更新して、時刻tにおけるトークンを生成する（ステップＳ１２）。次に、経路探索部１０５は、時刻tに到達し得るノードiに伝播し得る全トークンの有している経路履歴情報が異なるか否かを判断する（ステップＳ１３）。当該判断結果が肯定的である場合（ステップＳ１３：ＹＥＳ）、経路探索部１０５は、時刻tにおけるノードiに対して、時刻t-1における経路履歴情報hist(t-1,k)及び累積スコアリストcumscr(t-1,k)を有する全てのトークンtoken(t-1,k)の集合V_t-1を用いて、時刻tにおけるノードiのトークンtoken(t,i)を求める。このとき、経路探索部１０５は、上述の式５を用いて、ノードiに対応するトークンの累積スコアリストcumscr(t,i)を計算する（ステップＳ１４）。例えば、「i=5」であるノード５に対して、図６に例示されるように、３つの音響モデルに対応して各々音響スコアが計算されている場合、ノード４の有するトークンの有する累積スコアリストが、ノード５に対応する累積スコアリストcumscr(t,5)として計算される。その後、ステップＳ１６に進む。 When the time t is incremented by “1” (step S11), the route search unit 105 selects a node i that can reach the time t. Then, the route search unit 105 updates the route history information hist (t, i) to indicate the word sequence given to the arc representing the transition between the nodes passing through the route from the starting node S to the node i. The accumulated score list is updated to generate a token at time t (step S12). Next, the route search unit 105 determines whether route history information possessed by all tokens that can be propagated to the node i that can reach the time t is different (step S13). When the determination result is affirmative (step S13: YES), the route search unit 105 determines the route history information hist (t-1, k) and the cumulative score at time t-1 for the node i at time t. The token token (t, i) of the node i at time t is obtained using the set V _t-1 of all token tokens (t-1, k) having the list cumscr (t-1, k). At this time, the route search unit 105 calculates the cumulative score list cumscr (t, i) of the token corresponding to the node i using the above-described Expression 5 (step S14). For example, for the node 5 with “i = 5”, as illustrated in FIG. 6, when the acoustic scores are calculated corresponding to the three acoustic models, the accumulated tokens of the node 4 have. A score list is calculated as a cumulative score list cumscr (t, 5) corresponding to node 5. Thereafter, the process proceeds to step S16.

一方、ステップＳ１３の判断結果が否定的である場合（ステップＳ１３：ＮＯ）、経路探索部１０５は、時刻tにおけるノードiに対して、時刻t-1における経路履歴情報hist(t-1,k)及び累積スコアリストcumscr(t-1,k)を有する全てのトークンtoken(t-1,k)の集合V_t-1を用いて、時刻tにおけるノードiのトークンtoken(t,i)を求める。このとき、経路探索部１０５は、上述の式８を用いて、時刻tにおけるノードiに対応するトークンの累積スコアリストcumscr(t,i)を計算する（ステップＳ１５）。例えば、「i=3」であるノード３に対して、図７に例示されるように、３つの音響モデルに対応して各々音響スコアが計算されている場合、一番目の音響スコアについては、ノード２の音響スコアの値が選択され、２番目の音響スコアについては、ノード３の音響スコアの値が選択され、３番目の音響スコアについては、ノード３の音響スコアの値が選択されて、累積スコアリストcumscr(t,3)が計算される。その後、ステップＳ１６に進む。 On the other hand, when the determination result of step S13 is negative (step S13: NO), the route search unit 105 performs route history information hist (t-1, k at time t-1 with respect to the node i at time t. ) And the set V _t−1 of all token tokens (t−1, k) having the cumulative score list cumscr (t−1, k), the token token (t, i) of node i at time t is Ask. At this time, the route search unit 105 calculates the cumulative score list cumscr (t, i) of the token corresponding to the node i at the time t using the above-described Expression 8 (step S15). For example, for the node 3 with “i = 3”, as illustrated in FIG. 7, when the acoustic scores are calculated corresponding to the three acoustic models, the first acoustic score is The value of the acoustic score of node 2 is selected, the value of the acoustic score of node 3 is selected for the second acoustic score, the value of the acoustic score of node 3 is selected for the third acoustic score, A cumulative score list cumscr (t, 3) is calculated. Thereafter, the process proceeds to step S16.

ステップＳ１６では、経路探索部１０５は、式６又は式７を用いて、時刻tにおけるノードiの経路履歴情報を更新する。そして、経路探索部１０５は、時刻tにおけるノードiに対応するトークンが枝刈りの条件に合致するか否かを判断する（ステップＳ１７）。当該判断結果が肯定的である場合（ステップＳ１７：ＹＥＳ）、経路探索部１０５は、当該トークンを除去して（ステップＳ１９）、ステップＳ１８に進み、当該判断結果が否定的である場合（ステップＳ１７：ＮＯ）、ステップＳ１８に進む。ステップＳ１８では、経路探索部１０５は、時刻tまでに到達し得るノードi以外のノードがあるか否かを判断し、当該判断結果が肯定的である場合（ステップＳ１８：ＹＥＳ）、ステップＳ１２に進み、当該ノードi以外のノードについて上述と同様にして処理を行う。一方、ステップＳ１８の判断結果が否定的である場合（ステップＳ１８：ＮＯ）、経路探索部１０５は、時刻tが終了時刻「T」に達したか否かを判断する（ステップＳ２０）。当該判断結果が否定的である場合（ステップＳ２０：ＮＯ）、ステップＳ１１に進み、当該判断結果が肯定的である場合（ステップＳ２０：ＹＥＳ）、経路探索部１０５は、「t=T」の時刻tにおける終端ノードEに対応するトークンの保持する経路履歴情報を出力する（ステップＳ２１）。この経路履歴情報によって示される経路が最適経路であり、ステップＳ１で入力が受け付けられた音声の認識結果を示す。 In step S <b> 16, the route search unit 105 updates the route history information of the node i at time t using Equation 6 or Equation 7. Then, the route search unit 105 determines whether or not the token corresponding to the node i at the time t matches the pruning condition (step S17). When the determination result is affirmative (step S17: YES), the route search unit 105 removes the token (step S19), proceeds to step S18, and when the determination result is negative (step S17). : NO), the process proceeds to step S18. In step S18, the route search unit 105 determines whether there is a node other than the node i that can be reached by time t. If the determination result is affirmative (step S18: YES), the process proceeds to step S12. Then, the process is performed on the nodes other than the node i in the same manner as described above. On the other hand, when the determination result in step S18 is negative (step S18: NO), the route search unit 105 determines whether or not the time t has reached the end time “T” (step S20). If the determination result is negative (step S20: NO), the process proceeds to step S11. If the determination result is affirmative (step S20: YES), the route search unit 105 sets the time “t = T”. The route history information held by the token corresponding to the terminal node E at t is output (step S21). The route indicated by this route history information is the optimum route, and shows the recognition result of the speech that was accepted in step S1.

以上のように、本実施の形態に係わる音声認識装置によれば、複数の音響モデルを用いて単一の探索ネットワーク上において、複数の音響モデルに各々対応した音響スコアの累積を計算して最適経路を探索することで、計算コストや計算の際に使用するメモリ量の増加を抑えつつ、ロバスト性の高い音声認識を行うことが可能となる。 As described above, according to the speech recognition apparatus according to the present embodiment, the optimal acoustic score is calculated by calculating the accumulation of acoustic scores respectively corresponding to a plurality of acoustic models on a single search network using the plurality of acoustic models. By searching for a route, it is possible to perform speech recognition with high robustness while suppressing an increase in the calculation cost and the amount of memory used for the calculation.

なお、本発明は、上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化することができる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成することができる。例えば、実施形態に示される全構成要素からいくつかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせても良い。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Moreover, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, the constituent elements over different embodiments may be appropriately combined.

上述した実施の形態において、音声認識装置１００で実行される各種プログラムを、インターネット等のネットワークに接続されたコンピュータ上に記憶し、ネットワーク経由でダウンロードさせることにより提供するように構成しても良い。また当該各種プログラムを、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ、フレキシブルディスク（ＦＤ）、ＣＤ−Ｒ、ＤＶＤ（Digital Versatile Disk）等のコンピュータで読み取り可能な記録媒体に記録してコンピュータプログラムプロダクトとして提供するように構成しても良い。 In the embodiment described above, various programs executed by the speech recognition apparatus 100 may be stored on a computer connected to a network such as the Internet and provided by being downloaded via the network. The various programs are recorded in a computer-readable recording medium such as a CD-ROM, a flexible disk (FD), a CD-R, and a DVD (Digital Versatile Disk) in a file in an installable or executable format. The computer program product may be provided.

上述した実施の形態において、経路探索部１０５が最適経路を探索する方法は、上述の例に限らず、例えば、参考文献２に示される動的計画法を用いた手法を用いても良い。
（参考文献２）Ｌ．ＲａｂinｅｒａnｄＢ．−Ｈ．Ｊｕａnｇ，“ＦｕnｄａｍｅntａｌｓｏｆＳｐｅｅｃｈＲｅｃｏｇnitiｏn”，ＰｒｅntiｃｅＨａｌｌＳiｇnａｌＰｒｏｃｅｓｓinｇＳｅｒiｅｓ，ｐｐ．３３９−３４２，１９９３． In the embodiment described above, the method by which the route search unit 105 searches for the optimum route is not limited to the above-described example, and for example, a method using the dynamic programming shown in Reference 2 may be used.
(Reference 2) Rabiner and b. -H. Juang, “Fundamentals of Speech Recognition”, Prentice Hall Signal Processing Series, pp. 339-342, 1993.

本変形例に係る探索ネットワーク上の最適経路を探索する例について図８を用いて説明する。同図において、横軸方向はノードを表し、縦軸方向は時刻を表しており、図中の丸印は探索ネットワーク中の状態を表し、矢印は状態の遷移を表す。探索ネットワーク中の各状態(t,n)は時刻tにおけるノードnに対応している。「t＝０」である時刻tにおける始端ノードSに対応する状態(0,S)を初期状態と呼び、音声入力が終了した時刻Tにおける終端ノードEに対応する状態(T,E)を終了状態と呼ぶ。また各状態は、初期状態(0,S)から当該状態(t,n)へ至る経路を示す経路履歴情報hist(t,n)と、式９で示される複数の音響モデルに対応する音響スコアを累積で保持する累積スコアリストcumscr(t,n)とを有する。 An example of searching for the optimum route on the search network according to this modification will be described with reference to FIG. In the figure, the horizontal axis direction represents nodes, the vertical axis direction represents time, circles in the figure represent states in the search network, and arrows represent state transitions. Each state (t, n) in the search network corresponds to node n at time t. The state (0, S) corresponding to the start node S at time t where “t = 0” is referred to as the initial state, and the state (T, E) corresponding to the end node E at time T when the voice input ends is ended. Call the state. Each state includes path history information hist (t, n) indicating a path from the initial state (0, S) to the state (t, n), and an acoustic score corresponding to a plurality of acoustic models represented by Equation 9. Has a cumulative score list cumscr (t, n) that holds.

ただし、各状態の経路履歴情報や累積スコアリストは、各状態に対応したものが参照できれば良く、状態自身に保持させることに限定するものではない。そのため、例えばＲＡＭなどの主記憶部のある記憶領域に、全状態に関する経路履歴情報及び累積スコアリストを記憶させ、各状態がそれらを参照する方法なども考えられる。 However, the route history information and the cumulative score list of each state need only be able to refer to those corresponding to each state, and are not limited to being held in the state itself. Therefore, for example, a method of storing path history information and cumulative score lists regarding all states in a storage area having a main storage unit such as a RAM and referring to them by each state may be considered.

経路探索部１０５は、時刻t-1において、経路履歴情報hist(t-1,k)及び累積スコアリストcumscr(t-1,k)を有する状態(t-1,k)の集合Q_t-1を用いて、時刻tにおけるノードiに対応する状態(t,i)における最適経路を求める。集合Q_t-1とは、初期状態(0,S)から時刻tにおける状態(t,i)に至る経路において当該状態(t,i)の１つ前に経由する状態であって時刻t-1に到達可能な状態の集合を表し、最適経路の探索において枝刈り処理を行う場合には、枝刈りされずに残っている状態の集合を意味する。 The route search unit 105 sets a set Q _{t− of the} states (t−1, k) having the route history information hist (t−1, k) and the cumulative score list cumscr (t−1, k) at time t−1. ₁ is used to obtain the optimum route in the state (t, i) corresponding to the node i at time t. The set Q _t−1 is a state that passes through the state (t, i) immediately before the state (t, i) in the path from the initial state (0, S) to the state (t, i) at the time t, and includes the time t− This represents a set of states that can reach 1, and when pruning processing is performed in the search for the optimum route, it means a set of states that remain without being pruned.

初期状態(0,S)から時刻tにおけるノードiに対応する状態(t,i)に至る最適経路を求めるには、経路探索部１０５は、まず時刻t-1における状態の集合Q_t-1に含まれる全状態のうち、時刻に状態(t,i)に遷移し得る状態の集合Q^→(t,i)を選択する。次に、経路探索部１０５は、選択した状態集合Q^→(t,i)の全ての累積スコアリストcumscr(t-1,k)（ただしkは(t-1,k)∈Q^→(t,i)を満たす全てのノード番号）と、式２で表される遷移スコアリストtrans(k,i)のM個の音響モデルに各々対応する値の和（累積）で求められるスコアのうち、最大のスコアを持つノードを最適ノードj^*として式１０により選択する。 In order to obtain the optimum route from the initial state (0, S) to the state (t, i) corresponding to the node i at time t, the route search unit 105 first sets the state set Q _{t-1 at} time t-1. Among all states included in Q, a set of states Q ^{→ (t, i)} that can transition to the state ^{(t, i) at the time} is selected. Next, the route search unit 105 selects all cumulative score lists cumscr (t−1, k) of the selected state set Q ^{→ (t, i} ) (where k is (t−1, k) ∈Q ^{→ (t , i),} and all the scores obtained from the sum (cumulative value) of the values corresponding to the M acoustic models of the transition score list trans (k, i) represented by Equation 2, The node with the maximum score is selected as the optimal node j ^{* according} to Equation 10.

そして、経路探索部１０５は、時刻t-1での最適ノードj^*に対応する状態(t-1,j^*)での累積スコアリストcumscr(t-1,j^*)と、遷移スコアリストtrans(j^*,i)と、時刻tにおけるノードiに対応する音響スコアリストscr(t,i)とを用いて、式１１により、状態(t,i)における累積スコアリストcumscr(t,i)を計算する。 Then, the route search unit 105 determines the cumulative score list cumscr (t-1, j ^* ) in the state (t-1, j ^* ) corresponding to the optimum node j ^* at time t-1, and the transition score list trans Using (j ^* , i) and the acoustic score list scr (t, i) corresponding to the node i at time t, the cumulative score list cumscr (t, i) in the state (t, i) is obtained according to Equation 11. Calculate

状態(t,i)の有する経路履歴情報hist(t,i)は、最適ノードj^*からノードiへ至る遷移にラベル「w」が付与されている場合には、式１２に示されるように、hist(t-1,j^*)に「w」を追加することで求められる。 The route history information hist (t, i) possessed by the state (t, i) is expressed as shown in Expression 12 when the label “w” is given to the transition from the optimal node j ^* to the node i. , Hist (t-1, j ^* ) is obtained by adding “w”.

また、最適ノードj^*からノードiへ至る遷移にラベルが付与されていない場合には、式１３に示されるように、hist(t-1,j^*)をそのままhist(t,i)に代入することで求められる。 If no label is assigned to the transition from the optimal node j ^* to the node i, hist (t−1, j ^* ) is directly substituted for hist (t, i) as shown in Expression 13. Is required.

経路探索部１０５は、これらの処理を、音声の入力が終了した時刻（終了時刻）「Ｔ」までの全ての時刻t（t=1,2,…,T）において行い、「t＝Ｔ」である時刻tにおける終端ノードEに対応する終了状態(T,E)の有する経路履歴情報hist(T,E)が最適経路を示すものとしてこれを出力する。これが音声認識結果である。 The route search unit 105 performs these processes at all times t (t = 1, 2,..., T) up to the time (end time) “T” when the voice input ends, and “t = T”. The route history information hist (T, E) of the end state (T, E) corresponding to the terminal node E at time t is output as indicating the optimum route. This is a speech recognition result.

尚、ラベルが探索ネットワークの全ノードに付与されている場合、上記の方法によって求められる最適経路は、当該最適経路を経由するノードの系列を意味する。あるノードからあるノードへ至る遷移に単語のラベルが付与されており、最適経路において取り扱う最小単位が単語である場合、各単語内で各音素の状態等の最適な経路を明示的に求める必要はなく、単語単位での最適な経路を求めれば良い。このような場合においては、時刻t-1で最適ノードj^*に対応する状態(t-1,j^*)を伝播させることにより求められる状態(t,i)が有する経路履歴情報と同一の経路履歴情報を有する状態(t-1,k)が複数存在することが考えられる。そのような状態の集合をQ^*→(t,i)としたとき、経路探索部１０５は、初期状態(0,S)から状態(t-1,k)に至る経路において、M個の音響モデルに各々対応する値のうち、最大の値を選択し、これを用いて、時刻tにおけるノードiに対応するトークン(t,i)の有する累積スコアリストcumscr(t,i)を更新する。具体的には、経路探索部１０５は、式１１の代わりに以下に示す式１４を用いて、累積スコアリストcumscr(t,i)を更新する。 When labels are assigned to all nodes in the search network, the optimum route obtained by the above method means a sequence of nodes that pass through the optimum route. When a word label is given to a transition from a certain node to a certain node, and the smallest unit handled in the optimum route is a word, it is necessary to explicitly obtain the optimum route such as the state of each phoneme within each word. What is necessary is just to obtain the optimal route in units of words. In such a case, the same route history information as the state history (t, i) obtained by propagating the state (t-1, j ^* ) corresponding to the optimum node j ^* at time t-1 There may be a plurality of states (t−1, k) having history information. When such a set of states is defined as Q ^{* → (t, i)} , the route search unit 105 performs M sound in the route from the initial state (0, S) to the state (t-1, k). Among the values corresponding to the models, the maximum value is selected, and this is used to update the cumulative score list cumscr (t, i) of the token (t, i) corresponding to the node i at time t. Specifically, the route search unit 105 updates the cumulative score list cumscr (t, i) using the following equation 14 instead of the equation 11.

尚、kは(t-1,k)∈Q^*→(t,i)を満たす全てのノード番号である。但し、経路探索部１０５は、式１４を用いて累積スコアリストを更新した場合には、状態(t,i)の有する経路履歴情報hist(t,i)は、便宜的に式１２又は式１３を用いて更新することが望ましい。また、経路探索部１０５は、枝刈り条件に応じて、枝刈りする。具体的に例えば、枝刈り条件とは、各状態に対応する累積スコアリストにおいて複数の音響モデルに対応して計算された各値の全てが閾値を下回ることである。経路探索部１０５は、このような枝刈り条件を満たす状態自体を除去する。また、枝刈り条件とは、例えば、各状態に対応する累積スコアリストにおいて少なくとも１つの音響モデルに対応して計算された値が閾値を下回ることであっても良い。経路探索部１０５は、このような枝刈り条件を満たす値を累積スコアリストから除去することにより、枝刈りする。 Note that k is all node numbers satisfying (t−1, k) ∈Q ^{* → (t, i)} . However, if the route search unit 105 updates the accumulated score list using Equation 14, the route history information hist (t, i) that the state (t, i) has is expressed by Equation 12 or Equation 13 for convenience. It is desirable to update using Further, the route search unit 105 performs pruning according to the pruning condition. Specifically, for example, the pruning condition is that all values calculated corresponding to a plurality of acoustic models in the cumulative score list corresponding to each state are below a threshold value. The route search unit 105 removes the state itself that satisfies such a pruning condition. The pruning condition may be, for example, that a value calculated corresponding to at least one acoustic model in a cumulative score list corresponding to each state falls below a threshold value. The route search unit 105 performs pruning by removing values that satisfy such a pruning condition from the cumulative score list.

本変形例に係る音声認識装置１００が行う音声認識処理の手順は、図４に示されるものと同様である。ステップＳ３の処理の手順も図５に示されるものと略同様であるが、本変形例においてはトークンの有する経路履歴情報及び累積スコアリストの代わりに、状態の有する経路履歴情報及び累積スコアリストを用いて計算を行う点が、上述の第１の実施の形態と異なる。計算の方法自体は第１の実施の形態と同様である。 The procedure of the speech recognition process performed by the speech recognition apparatus 100 according to this modification is the same as that shown in FIG. The processing procedure of step S3 is substantially the same as that shown in FIG. 5, but in this modified example, the route history information and cumulative score list of the state are used instead of the route history information and cumulative score list of the token. This is different from the first embodiment described above in that the calculation is performed. The calculation method itself is the same as that of the first embodiment.

以上のような構成によっても、複数の音響モデルを用いて単一の探索ネットワーク上において、複数の音響モデルに各々対応した音響スコアの累積を計算して最適経路を探索することで、計算コストや計算の際に使用するメモリ量の増加を抑えつつ、ロバスト性の高い音声認識を行うことが可能となる。 Even with the configuration as described above, the calculation cost or the calculation cost can be reduced by calculating the accumulation of the acoustic scores corresponding to the plurality of acoustic models on the single search network using the plurality of acoustic models and searching for the optimum route. It is possible to perform speech recognition with high robustness while suppressing an increase in the amount of memory used for calculation.

上述した実施の形態及び変形例においては、経路探索部１０５は、枝刈りは行わなくても良い。 In the embodiment and the modification described above, the route search unit 105 may not perform pruning.

上述した実施の形態及び変形例においては、あるノードからあるノードに至る遷移に、ラベルとして単語を付与した場合に経路探索部１０５が行う処理について説明したが、これに限らず、ノード自体にラベルとして単語を付与するようにした場合も経路探索部１０５は同様にして処理を行う。 In the embodiment and the modification described above, the processing performed by the route search unit 105 when a word is assigned as a label to a transition from a certain node to a certain node has been described. The route search unit 105 performs the same processing even when a word is added.

上述した実施の形態及び変形例においては、経路探索部１０５は、ノードに対応させて音響スコアを計算するようにしたが、これに限らず、アークに対応させて音響スコアを計算するようにしても良い。 In the embodiment and the modification described above, the route search unit 105 calculates the acoustic score corresponding to the node. However, the present invention is not limited to this, and the acoustic score is calculated corresponding to the arc. Also good.

１００音声認識装置
１０１音声入力受付部
１０２音響特徴量抽出部
１０３音響モデル記憶部
１０４探索ネットワーク記憶部
１０５経路探索部 DESCRIPTION OF SYMBOLS 100 Speech recognition apparatus 101 Voice input reception part 102 Acoustic feature-value extraction part 103 Acoustic model memory | storage part 104 Search network memory | storage part 105 Path | route search part

Claims

A first storage unit for storing a plurality of acoustic models having the same structure;
A second storage unit that stores a search network common to a plurality of the acoustic models, including a start node that represents a start point, a terminal node that represents a terminal end, and at least one node between the start node and the terminal node ,
A reception unit for receiving voice input;
An extraction unit for extracting an acoustic feature using the voice;
Using the acoustic feature quantity, the plurality of acoustic models, and the search network, for each path indicating a sequence of nodes that pass from the start node to the end node on the search network, A calculator that calculates a score corresponding to each of the plurality of acoustic models;
The first route that is the optimum route from the start node to the end node on the search network by selecting the first route having the maximum score corresponding to at least one acoustic model among the routes. A search unit for searching for a route;
A speech recognition apparatus comprising: an output unit that outputs information indicating the first route, which is the speech recognition result.

The first storage unit stores the acoustic model that is configured by a hidden Markov model and has a common number of states of the hidden Markov model and a structure of state transition corresponding to the plurality of acoustic models. The speech recognition apparatus according to claim 1.

The calculation unit uses the acoustic feature quantity, the plurality of acoustic models, and the search network, for each of the plurality of acoustic models for each node that passes from the start node to the end node. The first score is calculated corresponding to the first route, and the first score calculated for each node passing through the first route from the starting node to the node is accumulated. Calculating the score corresponding to each of the plurality of acoustic models;
The search unit sequentially searches for an optimum route from the start node to the end node for each node passing from the start node to the end node, and reaches from the start node to the end node. Corresponds to at least one acoustic model for each second route that reaches all the second nodes that reach the first node, and that is a node that passes one node before the first node 2. The optimal route from the start node to the first node via the third route is selected by selecting a third route that maximizes the score. Voice recognition device.

The calculation unit uses the acoustic feature quantity, the plurality of acoustic models, and the search network, for each of the plurality of acoustic models for each node that passes from the start node to the end node. The first score is calculated corresponding to the first route, and the first score calculated for each node passing through the first route from the starting node to the node is accumulated. Calculating the score,
The search unit sequentially searches for an optimum route from the start node to the end node for each node passing from the start node to the end node, and reaches from the start node to the end node. If there is one second route that is a route to the first node that is a node that passes through until the first route, the score calculated corresponding to each of the plurality of acoustic models for the second route, Among the scores calculated for each of the plurality of acoustic models for the third path from the starting node to the second node that passes immediately before the first node, for each acoustic model Search for the second route which is the optimum route from the starting node to the first node by updating the selected score to the score for the second route. Speech recognition apparatus according to claim 1, characterized in Rukoto.

A label that is at least one of a phoneme and a word is given to all or a part of at least one of the nodes and an arc representing a transition between the nodes,
The output unit includes a series of labels attached to all or a part of at least one of a node passing from the start node to the end node in the first route and an arc representing a transition between the nodes. The speech recognition apparatus according to claim 1, wherein the information indicating the information is output.

A first storage unit that stores a plurality of acoustic models having the same structure; a start node representing a start end; a termination node representing a termination; and at least one node between the start node and the termination node; A speech recognition method executed by a speech recognition apparatus including a second storage unit that stores a search network common to the plurality of acoustic models, a reception unit, an extraction unit, a calculation unit, a search unit, and an output unit Because
The accepting unit accepting voice input;
The extraction unit using the voice to extract an acoustic feature;
The calculation unit uses the acoustic feature quantity, the plurality of acoustic models, and the search network, and indicates a sequence of nodes that pass from the start node to the end node on the search network. Calculating a score corresponding to each of the plurality of acoustic models for a path;
An optimal route from the start node to the end node on the search network by the search unit selecting a first route having the maximum score corresponding to at least one acoustic model among the routes. Searching for the first route which is
And a step of outputting the information indicating the first route, which is a recognition result of the voice, to the output unit.

A first storage unit that stores a plurality of acoustic models having the same structure; a start node representing a start end; a termination node representing a termination; and at least one node between the start node and the termination node; A computer having a speech recognition device comprising: a second storage unit storing a search network common to a plurality of the acoustic models;
Receiving means for receiving voice input;
Extraction means for extracting an acoustic feature using the voice;
Using each of the extracted acoustic feature quantity, the plurality of acoustic models, and the search network, each path indicating a sequence of nodes that pass from the start node to the end node on the search network. On the other hand, calculation means for calculating a score corresponding to each of the plurality of acoustic models,
The first route that is the optimum route from the start node to the end node on the search network by selecting the first route having the maximum score corresponding to at least one acoustic model among the routes. Search means for searching for a route;
The program for functioning as an output means which outputs the information which shows the said 1st path | route which is the said speech recognition result.