JP2010204274A

JP2010204274A - Speech recognition device and method and program therefore

Info

Publication number: JP2010204274A
Application number: JP2009048035A
Authority: JP
Inventors: Takashi Masuko; 貴史益子
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2009-03-02
Filing date: 2009-03-02
Publication date: 2010-09-16

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognition device capable of obtaining a recognition result with a small amount of calculation. SOLUTION: The speech recognition device includes: a feature extracting section 111 for extracting a sound feature amount for each frame from input speech; a search section 112 which performs speech recognition by searching and pruning the sound feature amount on compression network 102 which is created by merging a plurality of adjoining nodes in a search network 101, and by searching and pruning them until an end point of the input speech, by removing a node in the search network 101 corresponding to the pruned node in the compression network 102. COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声認識装置、その方法及びそのプログラムに関する。 The present invention relates to a speech recognition apparatus, a method thereof, and a program thereof.

従来より、単語単位でクラスタリングを行い、クラスタ内でスコアの初期値が一番大きい単語（代表単語）に対してマッチング処理を行った後、クラスタ内の他の単語の再評価を行う技術が提案されている（特許文献１参照）。 Conventionally, a technique has been proposed in which clustering is performed on a word-by-word basis, matching processing is performed on the word with the largest initial score (representative word) in the cluster, and then other words in the cluster are reevaluated. (See Patent Document 1).

音声認識において、認識語彙数が多いと探索ネットワークが大きくなり、探索に多くの計算量が必要となる。探索の計算量を削減する方法は、非特許文献１に示されるようなビームサーチが用いられる。ビームサーチでは、探索の各ステップで探索ネットワーク上のノードの探索と枝狩りを交互に繰り返すことにより、探索空間を狭め計算量を削減している。しかし、文頭付近や単語境界付近などでは探索ネットワークが多くの分岐を持つため、ビームサーチを用いた場合でも探索対象となるノード数が多くなり、多くの計算量を必要とする。 In speech recognition, if the number of recognized vocabulary is large, the search network becomes large, and a large amount of calculation is required for the search. As a method for reducing the calculation amount of the search, a beam search as shown in Non-Patent Document 1 is used. In the beam search, the search space is narrowed down and the amount of calculation is reduced by alternately repeating the search for the nodes on the search network and the branch hunting at each step of the search. However, since the search network has many branches near the beginning of a sentence or near a word boundary, the number of nodes to be searched increases even when beam search is used, and a large amount of calculation is required.

これに対し特許文献２では、次の３つの方法を提案している。 On the other hand, Patent Document 2 proposes the following three methods.

第１の方法は、１回目の探索では語頭付近で予め類似した音素をマージすることにより小さくした探索ネットワークを用いて探索を行う。次に、音素マージにより探索結果が一意に定まらなかった場合に１回目の探索結果から音素をマージしないで探索ネットワークを構成し、２回目の探索を行うことにより計算量を削減する。 In the first method, in the first search, a search is performed using a search network that has been reduced by merging similar phonemes in the vicinity of the beginning of the word. Next, when the search result is not uniquely determined by phoneme merging, a search network is configured without merging phonemes from the first search result, and the amount of calculation is reduced by performing the second search.

第２の方法は、粗い標準パターンを用いて、少ない計算量で絞り込まれた認識候補のみに対して、精密な標準パターンを用いて再照合する。 In the second method, only a recognition candidate narrowed down with a small amount of calculation using a rough standard pattern is re-matched using a precise standard pattern.

第３の方法は、第２の方法に対して精密な標準パターンを用いず、再照合も行わないものである。 The third method does not use a precise standard pattern as compared to the second method, and does not perform rematching.

Ｘ．Ｈｕａｎｇ，Ａ．ＡｃｅｒｏａｎｄＨ．−Ｗ．Ｈｏｎ， “ＳｐｏｋｅｎＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，” ＰｒｅｎｔｉｃｅＨａｌｌＰＴＲ，ｐｐ．６０６−６０８，２００１．X. Huang, A.A. Acero and H.M. -W. Hon, “Spoken Language Processing,” Prentice Hall PTR, pp. 606-608, 2001.

特開２００５−１３４４４２号公報JP 2005-134442 A 特開２００１−３１２２９３号公報JP 2001-31293 A

しかし、上記各従来技術においても、音声認識のための計算量が十分に削減されず、また、削減された場合には認識精度が劣化するという問題点がある。 However, each of the above prior arts has a problem in that the amount of calculation for speech recognition is not sufficiently reduced, and when it is reduced, recognition accuracy deteriorates.

本発明は、上記問題点を解決するためになされたものであって、少ない計算量で、かつ、精度の高い音声認識結果が得られる音声認識装置、その方法及びそのプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and it is an object of the present invention to provide a speech recognition apparatus, a method thereof, and a program thereof that can obtain a highly accurate speech recognition result with a small amount of calculation. And

本発明は、入力音声からフレーム毎に音響特徴量を抽出する特徴抽出部と、探索ネットワーク中の隣接した複数のノードをマージすることにより生成された少なくとも１つの圧縮ネットワーク上で前記音響特徴量に対して探索及び枝狩りを行い、前記圧縮ネットワークの前記枝狩りされたノードに対応する前記探索ネットワークのノードを探索対象から除外して、前記入力音声の終端まで探索及び枝狩りを行って、音声認識する探索部と、を有することを特徴とする音声認識装置である。 The present invention provides a feature extraction unit that extracts an acoustic feature amount from an input speech for each frame, and the acoustic feature amount on at least one compression network generated by merging a plurality of adjacent nodes in a search network. Search and branch hunting, exclude nodes of the search network corresponding to the branch-pruned nodes of the compression network from search targets, perform search and branch hunting to the end of the input speech, And a search unit for recognizing the speech recognition apparatus.

本発明によれば、少ない計算量で、かつ、精度の高い音声認識結果が得られる。 According to the present invention, a highly accurate speech recognition result can be obtained with a small amount of calculation.

以下、本発明の一実施形態の音声認識装置について添付図面を参照して説明する。 Hereinafter, a speech recognition apparatus according to an embodiment of the present invention will be described with reference to the accompanying drawings.

（第１の実施形態）
第１の実施形態の音声認識装置について図１〜図４及び図１１を参照して説明する。 (First embodiment)
A speech recognition apparatus according to a first embodiment will be described with reference to FIGS.

本実施形態の音声認識装置の構成について図１を参照して説明する。図１は、本実施形態に係る音声認識装置を示すブロック図である。 The configuration of the speech recognition apparatus of this embodiment will be described with reference to FIG. FIG. 1 is a block diagram showing a speech recognition apparatus according to this embodiment.

音声認識装置は、特徴抽出部１１１、探索ネットワーク１０１、圧縮ネットワーク１０２、探索部１１２とを備えている。 The speech recognition apparatus includes a feature extraction unit 111, a search network 101, a compression network 102, and a search unit 112.

特徴抽出部１１１は、入力音声からフレーム毎に音響特徴量を抽出する。 The feature extraction unit 111 extracts an acoustic feature amount for each frame from the input voice.

探索ネットワーク１０１は、特徴抽出部１１１から音響特徴量が入力される。 The search network 101 receives an acoustic feature amount from the feature extraction unit 111.

圧縮ネットワーク１０２は、探索ネットワーク１０１中の隣接する類似度の大きいノードをマージすることにより生成される。 The compressed network 102 is generated by merging adjacent nodes with high similarity in the search network 101.

探索部１１２は、探索ネットワーク１０１と、圧縮ネットワーク１０２とを用いて探索を行い、認識結果を出力する。 The search unit 112 performs a search using the search network 101 and the compression network 102 and outputs a recognition result.

なお、この音声認識装置は、例えば、汎用のコンピュータを基本ハードウェアとして用いることでも実現することが可能である。このとき、音声認識装置は、上記のプログラムをコンピュータに予めインストールすることで実現してもよいし、ＣＤ−ＲＯＭなどの記憶媒体に記憶して、あるいはネットワークを介して上記のプログラムを配布して、このプログラムをコンピュータに適宜インストールすることで実現してもよい。 Note that this voice recognition device can also be realized, for example, by using a general-purpose computer as basic hardware. At this time, the voice recognition apparatus may be realized by installing the above program in a computer in advance, or may be stored in a storage medium such as a CD-ROM or distributed through the network. This program may be realized by appropriately installing it on a computer.

まず、探索ネットワーク１０１と、探索ネットワーク１０１を用いた探索方法について図１１を参照して説明する。 First, the search network 101 and a search method using the search network 101 will be described with reference to FIG.

図１１は、各音素を３状態ｌｅｆｔ−ｔｏ−ｒｉｇｈｔＨＭＭでモデル化し、認識対象語彙が「サトー」、「シバタ」、「ヒライ」、「ホンダ」、「キタムラ」、「カタヤマ」、「タカハシ」、「ナカイ」、「マキ」の９単語である孤立単語認識における探索ネットワーク１０１の例を示す。黒丸は始端ノード、白丸は通常のノード、二重丸は終端ノードを表し、四角の中の単語は点線で結び付けられたアークに貼り付けられた単語ラベルである。 In FIG. 11, each phoneme is modeled by a three-state left-to-right HMM, and the recognition target words are “Sato”, “Shibata”, “Hirai”, “Honda”, “Kitamura”, “Katayama”, “Takahashi” The example of the search network 101 in the recognition of the isolated word which is nine words of "Nakai" and "Maki" is shown. A black circle represents a start node, a white circle represents a normal node, a double circle represents an end node, and a word in the square is a word label attached to an arc connected by a dotted line.

探索方法には、さまざまなものが提案されている。本実施形態では、「ｔｏｋｅｎｐａｓｓｉｎｇ」と呼ばれる方法について説明する。 Various search methods have been proposed. In the present embodiment, a method called “token passing” will be described.

ｔｏｋｅｎｐａｓｓｉｎｇでは、ノードにトークンが割り当てられ、入力フレームが与えられる毎にアークに沿って次のノードに伝播される。このトークンは、累積スコアと、トークンが伝播されてきた経路上の単語ラベルの履歴を保持している。 In token passing, a token is assigned to a node, and every time an input frame is given, the token is propagated along the arc to the next node. This token holds a cumulative score and a history of word labels on the path through which the token has been propagated.

まず探索開始前に始端ノードにトークンが割り当てられる。入力フレームが与えられると、アークに沿ってトークンを次のノードに伝播する。ここで、一つのノードに複数のトークンが伝播されてきた場合には、最も累積スコアの良いトークンが選択される。そして、トークンの存在するノードのスコアを評価し、各トークンの累積スコアに加算される。最後に、音声入力終了時に終端ノードに割り当てられたトークンが保持している単語ラベルの履歴を認識結果として出力する。 First, a token is assigned to the start node before the search starts. Given an input frame, propagate the token along the arc to the next node. Here, when a plurality of tokens are propagated to one node, the token having the best cumulative score is selected. Then, the score of the node where the token exists is evaluated and added to the cumulative score of each token. Finally, the history of the word label held in the token assigned to the terminal node at the end of voice input is output as the recognition result.

そして、Ｎ−ｂｅｓｔ探索を行う場合には、一つのノードに複数のトークンが伝播されてきたときに、最も累積スコアの良いトークンを選択する代わりに，上位Ｎ個のトークン（単語ラベルの履歴と対応する累積スコア）を保持する。 When performing an N-best search, when a plurality of tokens are propagated to one node, instead of selecting the token with the best cumulative score, the top N tokens (word label history and Corresponding cumulative score).

なお、後から説明する圧縮ネットワーク１０２上での探索においては、認識結果を出力する必要がない。これは、探索ネットワーク１０１上での探索により認識結果を得るからである。そのため、圧縮ネットワーク１０２上での探索においては、単語ラベルの履歴を保持する必要はなく、またＮ−ｂｅｓｔ認識においても複数のトークン（単語ラベルの履歴と対応する累積スコア）の管理などの特別な処理をする必要はない。 In the search on the compression network 102 described later, it is not necessary to output the recognition result. This is because a recognition result is obtained by searching on the search network 101. Therefore, it is not necessary to maintain a history of word labels in the search on the compression network 102, and a special management such as management of a plurality of tokens (cumulative scores corresponding to the history of word labels) in N-best recognition. There is no need to process.

次に、圧縮ネットワーク１０２の生成方法について説明する。この生成方法としては、次のような方法がある。 Next, a method for generating the compression network 102 will be described. As the generation method, there are the following methods.

第１の方法は、類似度が所定の値よりも大きい隣接するノードをマージする方法である。なお、あるノードに隣接するノードとは、典型的には同じ親ノード又は子ノードを持つ兄弟ノードである。しかし、アークで直接接続された親ノードや子ノードを含んでもよい。 The first method is a method of merging adjacent nodes whose similarity is greater than a predetermined value. Note that a node adjacent to a certain node is typically a sibling node having the same parent node or child node. However, it may include a parent node and a child node that are directly connected by an arc.

第２の方法は、下記の（１）式で与えられる圧縮率が所定の値よりも大きくなるまで、類似度が最も大きい隣接するノードのマージを繰り返す方法である。 The second method is a method of repeatedly merging adjacent nodes having the highest degree of similarity until the compression rate given by the following equation (1) becomes larger than a predetermined value.

圧縮率＝圧縮後のネットワークのノード数／圧縮前のネットワークのノード数
・・・（１）

しかし、これら方法に限るものではない。
Compression rate = number of network nodes after compression / number of network nodes before compression (1)

However, it is not restricted to these methods.

音声認識装置の動作について図１と図２を参照して説明する。図２は、音声認識装置において１つの圧縮ネットワーク１０２を用いた場合の動作を示すフローチャートである。 The operation of the speech recognition apparatus will be described with reference to FIGS. FIG. 2 is a flowchart showing the operation when one compression network 102 is used in the speech recognition apparatus.

ステップＳ１１１１において、特徴抽出部１１１は、入力音声から一定時間間隔のフレーム毎に特徴抽出を行い、音響特徴量を求める。 In step S 1111, the feature extraction unit 111 performs feature extraction for each frame at a certain time interval from the input speech to obtain an acoustic feature amount.

ステップＳ１１２１において、探索部１１２は、特徴抽出部１１１で求められた音響特徴量を用い、まず圧縮ネットワーク１０２上で探索を行う。 In step S 1121, the search unit 112 first performs a search on the compressed network 102 using the acoustic feature amount obtained by the feature extraction unit 111.

ステップＳ１１２２において、探索部１１２は、圧縮ネットワーク１０２上で枝狩りを行う。 In step S 1122, the search unit 112 performs branch hunting on the compression network 102.

ステップＳ１１２３において、探索部１１２は、探索ネットワーク１０１上で探索を行う。この際、圧縮ネットワーク１０２上で枝狩りされたノードに対応するノードは、探索対象から除外することにより、探索部１１２は、探索にかかる計算量を削減する。 In step S 1123, the search unit 112 performs a search on the search network 101. At this time, the search unit 112 reduces the amount of calculation for the search by excluding the nodes corresponding to the nodes that are pruned on the compressed network 102 from the search target.

ステップＳ１１２４において、探索部１１２は、探索ネットワーク１０１上で枝狩りを行う。 In step S 1124, the search unit 112 performs branch hunting on the search network 101.

ステップＳ１１２５において、探索部１１２は、以上の各ステップを繰り返し（Ｎの場合）、入力音声の終端に到達すると、認識結果を出力する（Ｙの場合）。 In step S1125, the search unit 112 repeats the above steps (in the case of N) and outputs a recognition result (in the case of Y) when reaching the end of the input speech.

上記ステップＳ１１２１〜Ｓ１１２５が、探索部１１２が行う探索ステップとなる。次に、ステップＳ１１２１〜Ｓ１１２３の動作について、従来の音声認識装置の動作と比較して、図３と図４を参照して説明する。 The above steps S1121 to S1125 are search steps performed by the search unit 112. Next, the operation of steps S1121 to S1123 will be described with reference to FIGS. 3 and 4 in comparison with the operation of the conventional speech recognition apparatus.

図３は、探索ネットワーク１０１の例である。丸はネットワークのノード、丸の中の数字はノード番号、矢印はアーク、「単語１」から「単語６」はアークに付与されている単語ラベルをそれぞれ示している。 FIG. 3 is an example of the search network 101. Circles represent network nodes, numbers in circles represent node numbers, arrows represent arcs, and “word 1” to “word 6” represent word labels attached to the arcs.

図４は、図３に示される探索ネットワーク１０１中の類似度の大きい隣接ノードをマージすることにより生成された圧縮ネットワーク１０２の例である。圧縮ネットワーク１０２は、ノード−１は探索ネットワーク１０１のノード１、２を、ノード−２は探索ネットワーク１０１のノード３、４、５をそれぞれマージすることにより生成されている。また、ノード−０及びノード−３は、それぞれ探索ネットワーク１０１のノード０及びノード６に対応し、ノードのマージを行っていないため、それぞれノード０及びノード６と同等である。 FIG. 4 is an example of the compressed network 102 generated by merging adjacent nodes having a high degree of similarity in the search network 101 shown in FIG. The compressed network 102 is generated by merging the nodes 1 and 2 of the search network 101 with the node-1 and the nodes 3, 4, and 5 of the search network 101 with the node-2. Further, the node-0 and the node-3 correspond to the node 0 and the node 6 of the search network 101, respectively, and are not merged, so that they are equivalent to the node 0 and the node 6, respectively.

以下の説明では、従来の音声認識装置と本実施形態の音声認識装置によって、探索ネットワーク１０１のノード０及びこれに対応する圧縮ネットワーク１０２のノード−０から探索する場合について説明する。 In the following description, a case where a search is performed from the node 0 of the search network 101 and the corresponding node 0 of the compression network 102 by the conventional speech recognition apparatus and the speech recognition apparatus of this embodiment will be described.

従来の音声認識装置では、圧縮ネットワーク１０２を用いず探索ネットワーク１０１のみに対して探索を行う。そのため、ノード０から探索を行う場合には、ノード１からノード６までの６ノードに対して探索を行う必要がある。ここで探索とは、各ノードでの音響特徴量に対する尤度計算と、各ノードに到達するまでの累積尤度計算、及び単語履歴の管理からなる。 In the conventional speech recognition apparatus, the search is performed only on the search network 101 without using the compression network 102. Therefore, when searching from node 0, it is necessary to search for 6 nodes from node 1 to node 6. Here, the search includes a likelihood calculation for an acoustic feature amount at each node, a cumulative likelihood calculation until reaching each node, and management of a word history.

次に、本実施形態の音声認識装置について図２、図３、図４を参照して説明する。 Next, the speech recognition apparatus of this embodiment will be described with reference to FIGS. 2, 3, and 4. FIG.

ステップＳ１１２１において、探索部１１２は、図４の圧縮ネットワーク１０２上で探索を行う。ノード−０から探索を行う場合には、ノード−１、ノード−２、ノード−３の３つのノードに対して探索を行う。ここで、圧縮ネットワーク１０２のアークには単語ラベルが付与されていないため、圧縮ネットワーク１０２上の探索においては単語履歴の管理を行う必要はない。また、Ｎ−ｂｅｓｔ探索では上位Ｎ個の単語履歴の管理を行う必要があるが、圧縮ネットワーク１０２の探索においては単語履歴の管理を行わないため、Ｎ−ｂｅｓｔ探索でも特別な処理を行う必要はない。 In step S1121, the search unit 112 performs a search on the compressed network 102 in FIG. When the search is performed from the node-0, the search is performed on the three nodes, the node-1, the node-2, and the node-3. Here, since no word label is given to the arc of the compression network 102, it is not necessary to manage the word history in the search on the compression network 102. Further, in the N-best search, it is necessary to manage the top N word histories, but in the search of the compression network 102, since the word history is not managed, it is necessary to perform special processing also in the N-best search. Absent.

ステップＳ１１２２において、探索部１１２は、ステップＳ１１２１で探索が行われたノード−１、ノード−２、ノード−３に対して枝狩りを行う。 In step S1122, the search unit 112 performs branch hunting on the node-1, the node-2, and the node-3 that have been searched in step S1121.

ステップＳ１１２３において、探索部１１２は、圧縮ネットワーク１０２の枝狩りされなかったノードに対応する探索ネットワーク１０１のノードのみに対して探索を行う。 In step S 1123, the search unit 112 searches only for nodes in the search network 101 corresponding to nodes that have not been pruned in the compressed network 102.

例えば、ステップＳ１１２２で、探索部１１２が、ノード−２とノード−３を枝狩りし、ノード−１を枝狩りしなかった場合には、ステップＳ１１２３では探索部１１２は、ノード−２に対応するノード３からノード５及びノード−３に対応するノード６を探索対象から除外し、ノード−１に対応するノード１及びノード２に対してのみ探索を行う。このとき、圧縮ネットワーク１０２で３ノード、探索ネットワーク１０１で２ノードの、合計５ノードに対して探索が行われる。 For example, in step S1122, when the search unit 112 does not prune node-2 and node-3 and does not prune node-1, the search unit 112 corresponds to node-2 in step S1123. The node 6 corresponding to the node 3 to the node 5 and the node 3 is excluded from the search targets, and only the node 1 and the node 2 corresponding to the node-1 are searched. At this time, a search is performed for a total of five nodes, that is, three nodes in the compression network 102 and two nodes in the search network 101.

これにより、本実施形態の音声認識装置は、従来の音声認識装置と比べて探索ノード数を削減でき、少ない計算量で認識結果を得ることができる。 As a result, the speech recognition apparatus according to the present embodiment can reduce the number of search nodes as compared with the conventional speech recognition apparatus, and can obtain a recognition result with a small amount of calculation.

また、ステップＳ１１２２で探索部１１２が、ノード−１とノード−２を枝狩りし、ノード−３を枝狩りしなかった場合には、ステップＳ１１２３では探索部１１２は、ノード−１及びノード−２に対応するノード１からノード５を探索対象から除外し、ノード−３に対応するノード６のみに対して探索が行われる。このとき、圧縮ネットワーク１０２で３ノード、探索ネットワーク１０１で１ノードの、合計４ノードの探索が行われる。 In addition, when the search unit 112 does not branch the node-1 and the node-2 and does not branch the node-3 in step S1122, the search unit 112 determines that the node-1 and the node-2 in step S1123. The nodes 1 to 5 corresponding to are excluded from the search targets, and only the node 6 corresponding to the node-3 is searched. At this time, a total of four nodes, that is, three nodes in the compression network 102 and one node in the search network 101 are searched.

これにより、本実施形態の音声認識装置は、従来の音声認識装置と比べて探索ノード数を削減でき、少ない計算量で認識結果を得ることができる。また、圧縮ネットワーク１０２のノード−３と探索ネットワーク１０１のノード６は同等のため、ノード６における音響特徴量に対する尤度はノード−３における尤度と同じであり、新たに計算しなおす必要はないため、さらに計算量を削減することができる。 As a result, the speech recognition apparatus according to the present embodiment can reduce the number of search nodes as compared with the conventional speech recognition apparatus, and can obtain a recognition result with a small amount of calculation. In addition, since the node-3 of the compressed network 102 and the node 6 of the search network 101 are equivalent, the likelihood for the acoustic feature amount in the node 6 is the same as the likelihood in the node-3, and it is not necessary to newly calculate again. Therefore, the calculation amount can be further reduced.

さらに、ステップＳ１１２２でノード−１とノード−３が枝狩りされ、ノード−２が枝狩りされなかった場合には、ステップＳ１１２３では探索部１１２は、ノード−１に対応するノード１、ノード２及びノード−３に対応するノード６を探索対象から除外し、ノード３からノード５のみに対して探索が行われる。このとき、圧縮ネットワーク１０２で３ノード、探索ネットワーク１０１で３ノードの、合計６ノードの探索が行われる。この場合、合計の探索ノード数は通常の音声認識装置と同じであるが、圧縮ネットワーク１０２上での探索では単語履歴の管理を行わない。そのため、合計の探索ノード数が同じであっても通常の音声認識装置よりも計算量を削減できる。 Furthermore, when node-1 and node-3 are pruned in step S1122, and node-2 is not pruned, in step S1123, the search unit 112 selects node 1, node 2 and node 1 corresponding to node-1. The node 6 corresponding to the node-3 is excluded from the search targets, and only the node 3 to the node 5 are searched. At this time, a total of 6 nodes are searched, 3 nodes in the compression network 102 and 3 nodes in the search network 101. In this case, the total number of search nodes is the same as that of a normal speech recognition apparatus, but word history management is not performed in the search on the compression network 102. Therefore, even if the total number of search nodes is the same, the amount of calculation can be reduced as compared with a normal speech recognition apparatus.

本実施形態の音声認識装置によれば、圧縮ネットワーク１０２上で探索及び枝狩りを行い、圧縮ネットワーク１０２上で枝狩りされたノードに対応する探索ネットワーク１０１のノードを探索対象から除外することで、計算量を削減できる。 According to the speech recognition apparatus of the present embodiment, search and branch hunting are performed on the compression network 102, and nodes of the search network 101 corresponding to nodes that have been branch hunted on the compression network 102 are excluded from search targets. The amount of calculation can be reduced.

また、各フレームで、圧縮ネットワーク１０２上での探索、枝狩りだけではなく、探索ネットワーク１０１上での探索、枝狩りも行っているため、発話終了後直ちに認識結果と、探索ネットワーク１０１上で計算される入力音声に対する累積尤度を得ることができる。 In addition, since each frame is searched not only on the compression network 102 but also on the search network 101, but also on the search network 101, the recognition result and the calculation on the search network 101 are performed immediately after the end of the utterance. The cumulative likelihood for the input speech to be obtained can be obtained.

（変更例）
なお、圧縮ネットワーク１０２は複数用いることも可能である。 (Example of change)
A plurality of compression networks 102 can be used.

この場合には、例えば、第１の圧縮ネットワーク１０２と第ｋ＋１の圧縮ネットワーク１０２を用いる。但し、ｋ＝１，・・・，Ｋ−１である。 In this case, for example, the first compression network 102 and the (k + 1) th compression network 102 are used. However, k = 1,..., K−1.

第１の圧縮ネットワーク１０２は、探索ネットワーク１０１と、探索ネットワーク１０１の隣接する類似度の大きいノードをマージすることにより生成された圧縮ネットワーク１０２である。第ｋ＋１の圧縮ネットワーク１０２は、探索ネットワーク１０１と、第ｋの圧縮ネットワーク１０２の隣接する類似度の大きいノードをマージすることにより生成される。 The first compression network 102 is a compression network 102 generated by merging the search network 101 and adjacent nodes of the search network 101 having high similarity. The (k + 1) th compression network 102 is generated by merging the search network 101 and the adjacent nodes of the kth compression network 102 that have high similarity.

次に、第Ｋの圧縮ネットワーク１０２上で探索及び枝狩りを行い、第ｋ＋１の圧縮ネットワーク１０２上で枝狩りされたノードに対応する第ｋの圧縮ネットワーク１０２のノードを探索対象から除外する。次に、第ｋの圧縮ネットワーク１０２上で探索を行う。 Next, search and branch hunting are performed on the Kth compression network 102, and the nodes of the kth compression network 102 corresponding to the nodes hunted on the k + 1th compression network 102 are excluded from the search targets. Next, a search is performed on the kth compression network 102.

そして、これら枝狩りと探索をｋ＝Ｋ−１からｋ＝１まで繰り返す。 These branch hunting and searching are repeated from k = K−1 to k = 1.

最後に、第１の圧縮ネットワーク１０２上で枝狩りされたノードに対応する探索ネットワーク１０１のノードを探索対象から除外して、探索ネットワーク１０１上で探索を行う。 Finally, a search is performed on the search network 101 by excluding the nodes of the search network 101 corresponding to the nodes hunted on the first compression network 102 from the search target.

（第２の実施形態）
第２の実施形態の音声認識装置について図５〜図６を参照して説明する。 (Second Embodiment)
A speech recognition apparatus according to a second embodiment will be described with reference to FIGS.

本実施形態の音声認識装置の構成について図５を参照して説明する。図５は、音声認識装置を示すブロック図である。 The configuration of the speech recognition apparatus of this embodiment will be described with reference to FIG. FIG. 5 is a block diagram showing the speech recognition apparatus.

音声認識装置は、特徴抽出部２１１、探索ネットワーク２０１、部分圧縮ネットワーク２０２、探索部２１２とを備えている。 The speech recognition apparatus includes a feature extraction unit 211, a search network 201, a partial compression network 202, and a search unit 212.

特徴抽出部２１１は、入力音声からフレーム毎に音響特徴量を抽出する。 The feature extraction unit 211 extracts an acoustic feature amount for each frame from the input voice.

部分圧縮ネットワーク２０２は、特徴抽出部２１１から音響特徴量が入力されると探索ネットワーク２０１の部分ネットワークに対して、隣接する類似度の大きいノードをマージすることにより生成される。 The partial compression network 202 is generated by merging adjacent nodes having a high similarity with the partial network of the search network 201 when an acoustic feature amount is input from the feature extraction unit 211.

探索部２１２は、探索ネットワーク２０１と、部分圧縮ネットワーク２０２を用いて探索処理を行い、認識結果を出力する。 The search unit 212 performs search processing using the search network 201 and the partial compression network 202, and outputs a recognition result.

次に、探索ネットワーク２０１の部分ネットワークの選択方法について説明する。 Next, a method for selecting a partial network of the search network 201 will be described.

第１の選択方法は、分岐数が所定の値より大きい部分ネットワークを選択する。 In the first selection method, a partial network whose branch number is larger than a predetermined value is selected.

第２の選択方法は、（１）式で与えられる圧縮率が所定の値よりも大きい部分ネットワークを選択する。 In the second selection method, a partial network whose compression rate given by equation (1) is larger than a predetermined value is selected.

しかし、これら方法に限るものではない。 However, it is not restricted to these methods.

次に、部分圧縮ネットワーク２０２の生成方法について説明する。 Next, a method for generating the partial compression network 202 will be described.

第１の生成方法は、類似度が所定の値よりも隣接する大きいノードをマージする方法である。 The first generation method is a method of merging large nodes whose similarity is adjacent to a predetermined value.

第２の生成方法は、（１）式で与えられる圧縮率が所定の値よりも大きくなるまで類似度が最も大きい隣接するノードのマージを繰り返す方法である。 The second generation method is a method in which merging of adjacent nodes having the highest similarity is repeated until the compression ratio given by equation (1) becomes larger than a predetermined value.

なお、探索ネットワーク２０１全体も部分ネットワークに含まれており、探索ネットワーク２０１全体に対して生成された圧縮ネットワークも部分圧縮ネットワーク２０２に含まれる。 The entire search network 201 is also included in the partial network, and the compressed network generated for the entire search network 201 is also included in the partial compressed network 202.

音声認識装置の動作について図５と図６を参照して説明する。ここで、図６は、音声認識装置の動作を示すフローチャートである。 The operation of the speech recognition apparatus will be described with reference to FIGS. Here, FIG. 6 is a flowchart showing the operation of the speech recognition apparatus.

ステップＳ２１１１において、特徴抽出部２１１は、入力音声から一定時間間隔のフレーム毎に特徴抽出を行い、音響特徴量を求める。 In step S 2111, the feature extraction unit 211 performs feature extraction for each frame at a fixed time interval from the input speech to obtain an acoustic feature amount.

ステップＳ２１２１において、探索部２１２は、特徴抽出部２１１で求められた音響特徴量を用い、まず探索ネットワーク２０１の探索部分に対応する部分圧縮ネットワーク２０２が存在するかを判定する。部分圧縮ネットワーク２０２が存在する場合には、ステップＳ２１２２に進み（Ｙの場合）、存在しなければステップＳ２１２４に進む（Ｎの場合）。 In step S 2121, the search unit 212 first determines whether or not the partial compression network 202 corresponding to the search portion of the search network 201 exists using the acoustic feature amount obtained by the feature extraction unit 211. If the partial compression network 202 exists, the process proceeds to step S2122 (in the case of Y), and if not, the process proceeds to step S2124 (in the case of N).

ステップＳ２１２２において、探索部２１２は、部分圧縮ネットワーク２０２上で探索を行い、ステップＳ２１２３において枝狩りを行う。 In step S2122, the search unit 212 performs a search on the partial compression network 202, and performs branch hunting in step S2123.

ステップＳ２１２４において、探索部２１２は、探索ネットワーク２０１上で探索を行う。この際、探索対象に含まれる探索ネットワーク２０１の部分ネットワークに対応する部分圧縮ネットワーク２０２が存在する場合には、部分圧縮ネットワーク２０２上で枝狩りされたノードに対応するノードを探索の対象から除外する。これにより探索にかかる計算量を削減する。 In step S2124, the search unit 212 performs a search on the search network 201. At this time, if there is a partial compression network 202 corresponding to the partial network of the search network 201 included in the search target, the node corresponding to the branch-hunted node on the partial compression network 202 is excluded from the search target. . This reduces the amount of calculation required for the search.

ステップＳ２１２５において、探索部２１２は、探索ネットワーク２０１上で枝狩りを行う。 In step S 2125, the search unit 212 performs branch hunting on the search network 201.

ステップＳ２１２６において、探索部２１２は、以上の各ステップを繰り返し（Ｎの場合）、入力音声の終端に到達すると認識結果を出力する（Ｙの場合）。 In step S2126, the search unit 212 repeats the above steps (in the case of N), and outputs a recognition result when it reaches the end of the input speech (in the case of Y).

本実施形態の音声認識装置によれば、部分圧縮ネットワーク２０２で探索及び枝狩りを行う。次に、部分圧縮ネットワーク２０２で枝狩りされたノードに対応する探索ネットワーク２０１のノードを探索対象から除外する。これにより、部分ネットワークでの計算量の増加を抑制し、より効率的に計算量を削減できる。 According to the speech recognition apparatus of this embodiment, search and branch hunting are performed in the partial compression network 202. Next, the nodes of the search network 201 corresponding to the nodes that are pruned by the partial compression network 202 are excluded from the search targets. Thereby, the increase in the calculation amount in the partial network can be suppressed, and the calculation amount can be reduced more efficiently.

（変更例）
なお、本実施形態においても、第１の実施形態と同様に、それぞれの部分ネットワークに対して複数の部分圧縮ネットワーク２０２を生成して用いてもよい。 (Example of change)
In the present embodiment, a plurality of partial compression networks 202 may be generated and used for each partial network, as in the first embodiment.

また、部分圧縮ネットワーク２０２の部分ネットワークに対して部分圧縮ネットワーク２０２を生成して用いてもよい。 Further, the partial compression network 202 may be generated and used for the partial network of the partial compression network 202.

（第３の実施形態）
第３の実施形態の音声認識装置について図７と図８を参照して説明する。 (Third embodiment)
A speech recognition apparatus according to a third embodiment will be described with reference to FIGS.

本実施形態の音声認識装置の構成について図７を参照して説明する。図７は、本実施形態に係る音声認識装置を示すブロック図である。 The configuration of the speech recognition apparatus of this embodiment will be described with reference to FIG. FIG. 7 is a block diagram showing the speech recognition apparatus according to the present embodiment.

音声認識装置は、特徴抽出部３１１、探索ネットワーク３０１、圧縮ネットワーク生成部３１２、探索部３１３とを備えている。 The speech recognition apparatus includes a feature extraction unit 311, a search network 301, a compressed network generation unit 312, and a search unit 313.

特徴抽出部３１１は、音声認識装置は、入力音声からフレーム毎に音響特徴量を抽出する。 In the feature extraction unit 311, the speech recognition apparatus extracts an acoustic feature amount for each frame from the input speech.

圧縮ネットワーク生成部３１２は、探索ネットワーク３０１の部分ネットワークに対して隣接する類似度の大きいノードをマージすることにより、部分圧縮ネットワーク３０２を生成する。 The compressed network generation unit 312 generates a partial compressed network 302 by merging adjacent nodes with high similarity to the partial network of the search network 301.

探索部３１３は、特徴抽出部３１１から音響特徴量が入力されると、探索ネットワーク３０１と、部分圧縮ネットワーク３０２を用いて探索処理を行い、認識結果を出力する。 When the acoustic feature amount is input from the feature extraction unit 311, the search unit 313 performs search processing using the search network 301 and the partial compression network 302 and outputs a recognition result.

本実施形態の音声認識装置の動作について図７と図８を参照して説明する。図８は、音声認識装置の動作を示すフローチャートである。 The operation of the speech recognition apparatus of this embodiment will be described with reference to FIGS. FIG. 8 is a flowchart showing the operation of the speech recognition apparatus.

ステップＳ３１２１において、圧縮ネットワーク生成部３１２は、探索ネットワーク３０１の部分ネットワークに対して隣接する類似度の大きいノードをマージすることにより部分圧縮ネットワーク３０２を生成する。なお、探索ネットワーク３０４における部分ネットワークの選択方法は、第２の実施形態と同様である。また、部分圧縮ネットワーク３０２の生成方法も第２の実施形態と同様である。 In step S 3121, the compressed network generation unit 312 generates the partial compressed network 302 by merging adjacent nodes with high similarity to the partial network of the search network 301. Note that the method for selecting a partial network in the search network 304 is the same as in the second embodiment. The method for generating the partial compression network 302 is the same as in the second embodiment.

ステップＳ３１１１において、特徴抽出部３１１は、入力音声から一定時間間隔のフレーム毎に特徴抽出を行い、音響特徴量を求める。 In step S 3111, the feature extraction unit 311 performs feature extraction for each frame at a certain time interval from the input speech to obtain an acoustic feature amount.

ステップＳ３１３１において、探索部３１３は、特徴抽出部３１１で求められた音響特徴量を用い、まず探索ネットワーク１０１の探索部分に対応する部分圧縮ネットワーク１０２が存在するかを判定する。存在する場合はステップＳ３１３２に進み、存在しない場合はステップＳ３１３４に進む。 In step S 3131, the search unit 313 first determines whether the partial compression network 102 corresponding to the search portion of the search network 101 exists using the acoustic feature amount obtained by the feature extraction unit 311. When it exists, it progresses to step S3132, and when it does not exist, it progresses to step S3134.

ステップＳ３１３２において、探索部３１３は、部分圧縮ネットワーク１０２が存在するので、部分圧縮ネットワーク１０２上で探索し、ステップＳ３１３３において、枝狩りを行う。 In step S3132, since the partial compression network 102 exists, the search unit 313 searches on the partial compression network 102, and in step S3133 performs branch hunting.

ステップＳ３１３４において、探索部３１３は、探索ネットワーク１０１上で探索を行う。この際、探索対象に含まれる探索ネットワーク１０１の部分ネットワークに対応する部分圧縮ネットワーク１０２が存在する場合には、部分圧縮ネットワーク１０２上で枝狩りされたノードに対応するノードを探索の対象から除外する。これにより探索にかかる計算量を削減する。 In step S3134, the search unit 313 performs a search on the search network 101. At this time, if the partial compression network 102 corresponding to the partial network of the search network 101 included in the search target exists, the node corresponding to the node that has been pruned on the partial compression network 102 is excluded from the search target. . This reduces the amount of calculation required for the search.

ステップＳ３１３５において、探索部３１３は、探索ネットワーク１０１上で枝狩りを行う。 In step S 3135, the search unit 313 performs branch hunting on the search network 101.

ステップＳ３１３６において、探索部３１３は、以上の各ステップを繰り返し（Ｎの場合）、入力音声の終端に到達すると認識結果を出力する（Ｙの場合）。 In step S3136, the search unit 313 repeats the above steps (in the case of N), and outputs a recognition result when it reaches the end of the input speech (in the case of Y).

本実施形態の音声認識装置によれば、予め圧縮ネットワーク１０２を生成することなく、探索ネットワーク１０１の部分ネットワークに対応する部分圧縮ネットワーク１０２を生成する。そして、部分圧縮ネットワーク１０２で探索及び枝狩りを行い、次に、部分圧縮ネットワーク１０２で枝狩りされたノードに対応する探索ネットワーク１０１のノードを探索対象から除外する。これにより、計算量を削減できる。 According to the speech recognition apparatus of this embodiment, the partial compression network 102 corresponding to the partial network of the search network 101 is generated without generating the compression network 102 in advance. Then, search and branch hunting are performed in the partial compression network 102, and then the nodes of the search network 101 corresponding to the nodes that have been branch hunted in the partial compression network 102 are excluded from search targets. Thereby, the amount of calculation can be reduced.

（第４の実施形態）
第４の実施形態の音声認識装置について図９と図１０を参照して説明する。この音声認識装置は、連続音声認識に適用したものである。 (Fourth embodiment)
A voice recognition device according to a fourth embodiment will be described with reference to FIGS. 9 and 10. This speech recognition apparatus is applied to continuous speech recognition.

本実施形態の音声認識装置の構成について図９を参照して説明する。図９は、本実施形態に係る音声認識装置を示すブロック図である。 The configuration of the speech recognition apparatus of this embodiment will be described with reference to FIG. FIG. 9 is a block diagram showing the speech recognition apparatus according to the present embodiment.

音声認識装置は、特徴抽出部４１１、音響モデル４０１、単語辞書４０２、言語モデル４０３、探索ネットワーク生成部４１２、圧縮ネットワーク生成部４１３、探索部４１４とを備えている。 The speech recognition apparatus includes a feature extraction unit 411, an acoustic model 401, a word dictionary 402, a language model 403, a search network generation unit 412, a compression network generation unit 413, and a search unit 414.

特徴抽出部４１１は、入力音声からフレーム毎に音響特徴量を抽出する。 The feature extraction unit 411 extracts an acoustic feature amount for each frame from the input voice.

探索ネットワーク生成部４１２は、音響モデル４０１、単語辞書４０２及び言語モデル４０３から単語仮説を展開して探索ネットワーク１０１を生成する。 The search network generation unit 412 generates a search network 101 by developing word hypotheses from the acoustic model 401, the word dictionary 402, and the language model 403.

圧縮ネットワーク生成部４１３は、生成された探索ネットワーク１０１の部分ネットワークに対して、隣接する類似度の大きいノードをマージすることにより、部分圧縮ネットワーク１０２を生成する。 The compressed network generation unit 413 generates a partial compressed network 102 by merging adjacent nodes having a high degree of similarity with the generated partial network of the search network 101.

探索部４１４は、特徴抽出部４１１から音響特徴量が入力されると探索ネットワーク１０１と、部分圧縮ネットワーク１０２を用いて探索処理を行い、認識結果を出力する。 When an acoustic feature amount is input from the feature extraction unit 411, the search unit 414 performs search processing using the search network 101 and the partial compression network 102, and outputs a recognition result.

本実施形態に係る音声認識装置の動作について図９と図１０を参照して説明する。図１０は、本実施形態に係る音声認識装置の動作を示すフローチャートである。 The operation of the speech recognition apparatus according to this embodiment will be described with reference to FIGS. FIG. 10 is a flowchart showing the operation of the speech recognition apparatus according to this embodiment.

ステップＳ４１１１において、特徴抽出部４１１は、入力音声から一定時間間隔のフレーム毎に特徴抽出を行い、音響特徴量を求める。 In step S 4111, the feature extraction unit 411 performs feature extraction for each frame at a certain time interval from the input speech to obtain an acoustic feature amount.

ステップＳ４１２１において、探索ネットワーク生成部４１２は、音響モデル４０１、単語辞書４０２及び言語モデル４０３を用いて探索の途中結果に従って単語仮説を展開し、探索ネットワーク４０４を生成する。 In step S 4121, the search network generation unit 412 generates a search network 404 by developing word hypotheses according to the intermediate results of the search using the acoustic model 401, the word dictionary 402, and the language model 403.

ステップＳ４１３１において、圧縮ネットワーク生成部４１３は、上記生成された探索ネットワーク４０４の部分ネットワークに対して、隣接する類似度の大きいノードをマージすることにより、部分圧縮ネットワーク４０５を生成する。なお、探索ネットワーク４０４における部分ネットワークの選択方法は、第２の実施形態と同様である。また、部分圧縮ネットワーク４０５の生成方法も第２の実施形態と同様である。 In step S4131, the compressed network generation unit 413 generates a partial compressed network 405 by merging adjacent nodes having a high degree of similarity with the partial network of the generated search network 404. Note that the method for selecting a partial network in the search network 404 is the same as in the second embodiment. The method for generating the partial compression network 405 is the same as that in the second embodiment.

以下のステップＳ４１４１〜Ｓ４１４６は、第２の実施形態における図６のステップＳ２１２１〜Ｓ２１２６と同様である。 The following steps S4141 to S4146 are the same as steps S2121 to S2126 of FIG. 6 in the second embodiment.

従来の連続音声認識においては、単語境界で多数の単語仮説が展開され、分岐数が非常に大きい部分ネットワークを含む探索ネットワークが作成されるため、多数のノードの探索が必要となる。 In conventional continuous speech recognition, a large number of word hypotheses are developed at word boundaries, and a search network including a partial network having a very large number of branches is created. Therefore, it is necessary to search a large number of nodes.

しかし、このような場合にも、本実施形態の音声認識装置によれば、生成された探索ネットワーク４０４の部分ネットワークに対応する部分圧縮ネットワーク４０５を生成して探索及び枝狩りを行い、部分圧縮ネットワーク４０５で枝狩りされたノードに対応する探索ネットワーク４０４のノードを探索対象から除外することで、計算量を削減できる。 However, even in such a case, according to the speech recognition apparatus of the present embodiment, the partial compression network 405 corresponding to the generated partial network of the search network 404 is generated to perform search and branch hunting, and the partial compression network By excluding the node of the search network 404 corresponding to the node hunted in 405 from the search target, the amount of calculation can be reduced.

（変更例）
なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 (Example of change)
Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

例えば、上記各実施形態では、圧縮ネットワーク１０２を作成するために、マージする隣接するノードとは、共通の親ノードを持つ子ノードと定義していたが、これに代えて、親ノードとそれに繋がる子ノードをマージしてもよい。 For example, in each of the above embodiments, in order to create the compression network 102, the adjacent nodes to be merged are defined as child nodes having a common parent node. Child nodes may be merged.

これにより、「共通の親ノードを持つ子ノード」が、空間（語彙）方向に圧縮するのに対し、親ノードと子ノードとをマージすると時間方向に圧縮することができる。例えば、「おじさん／ｏ−ｊ−ｉ−ｓ−ａ−ｎ」と「おじいさん／ｏ−ｊ−ｉ−ｉ−ｓ−ａ−ｎ」のように、母音の長さが異なる単語同士を一つにまとめることができる。 As a result, “child node having a common parent node” compresses in the space (vocabulary) direction, whereas when the parent node and child node are merged, it can be compressed in the time direction. For example, one word with different vowel lengths, such as “Uncle / o-jis-a-n” and “Grandfather / o-j-i-s-a-n”. Can be summarized.

本発明の第１の実施形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on the 1st Embodiment of this invention. 第１の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 1st Embodiment. 探索ネットワークの例である。It is an example of a search network. 図３の探索ネットワークから生成された圧縮ネットワーク１０２の例である。It is an example of the compression network 102 produced | generated from the search network of FIG. 第２の実施形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on 2nd Embodiment. 第２の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 2nd Embodiment. 第３の実施形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on 3rd Embodiment. 第３の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 3rd Embodiment. 第４の実施形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on 4th Embodiment. 第４の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 4th Embodiment. 第１の実施形態の探索ネットワークの説明図である。It is explanatory drawing of the search network of 1st Embodiment.

１０１・・・探索ネットワーク
１０２・・・圧縮ネットワーク
１１１・・・特徴抽出部
１１２・・・探索部 101 ... Search network 102 ... Compression network 111 ... Feature extraction unit 112 ... Search unit

Claims

A feature extraction unit that extracts an acoustic feature amount from the input speech for each frame;
Search and branch hunting is performed for the acoustic feature on at least one compression network generated by merging a plurality of adjacent nodes in the search network, and corresponds to the branch-hunted node of the compression network The search network node to be excluded from search targets, search and branch hunting to the end of the input speech, and a speech recognition search unit,
A speech recognition apparatus comprising:

The compressed network is generated with a partial network that is part of the search network.
The speech recognition apparatus according to claim 1.

The compressed network is a network that is generated by merging a number of branches from one parent node in the search network larger than an arbitrary number.
The speech recognition apparatus according to claim 1.

The compression network is a network generated by merging child nodes of the search network and having a compression ratio larger than an arbitrary value.
The speech recognition apparatus according to claim 1.

A plurality of adjacent nodes for merging refers to a parent node in the search network and a child node connected to the parent node,
The speech recognition apparatus according to claim 1.

A feature extraction unit for extracting an acoustic feature amount for each frame from the input speech; and
A search unit searches and branches the acoustic feature on at least one compression network generated by merging a plurality of adjacent nodes in the search network, and the branch of the compression network is picked. A search step for recognizing speech by excluding a node of the search network corresponding to the selected node from a search target, performing search and branch hunting to the end of the input speech,
A speech recognition method comprising:

On the computer,
A feature extraction function that extracts acoustic features from the input speech for each frame;
Search and branch hunting is performed for the acoustic feature on at least one compression network generated by merging a plurality of adjacent nodes in the search network, and corresponds to the branch-hunted node of the compression network A search function for recognizing speech by excluding nodes of the search network to be searched and performing search and branch hunting to the end of the input speech;
A speech recognition program for realizing