JP2003140685A

JP2003140685A - Continuous voice recognition device and its program

Info

Publication number: JP2003140685A
Application number: JP2001332825A
Authority: JP
Inventors: Toru Imai; 亨今井
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2001-10-30
Filing date: 2001-10-30
Publication date: 2003-05-16
Anticipated expiration: 2021-10-30
Also published as: JP3813491B2

Abstract

PROBLEM TO BE SOLVED: To provide a continuous voice recognition device, which determines a recognition result in an early stage with good real-time response and has high recognition precision and a light arithmetic processing load, even when a detailed sound model and language model are used for voice recognition, and its program. SOLUTION: The continuous voice recognition device is equipped with means 21 and 31 of storing a simple acoustic model, a simple language model, a detailed acoustic model, and a detailed acoustic model, a 1st path processing means 20 of making a forward search for a continuous voice by using the simple acoustic model and simple language model and generating a word start- end list consisting of respective words as candidates and information on their start-end time, and a 2nd path processing means 30 of making a forward search for the continuous voice according to the information on the start-end time within a range of words included in the word start-end list by using the detailed acoustic model and detailed language model and generating a word array corresponding to the continuous voice.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、連続して発声され
た音声を認識して、発声された連続音声が示す単語列を
生成するための連続音声認識装置およびそのプログラム
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a continuous speech recognition apparatus for recognizing continuously uttered speech and generating a word string indicated by the uttered continuous speech and its program.

【０００２】[0002]

【従来の技術】従来、連続して発声された音声を認識し
てその音声が示す単語列を生成するための連続音声認識
方法として、以下に示す２つの方法が知られていた。第
１の方法は、今井ほか著、「最ゆう単語列逐次比較によ
る音声認識結果の早期確定」、電子情報通信学会論文
誌、第Ｊ８４-Ｄ-ＩＩ巻、９号、１９４２-１９４９頁
（２０００）に開示されているように、２つのパスを介
して以下の処理を行うものである。2. Description of the Related Art Conventionally, the following two methods have been known as continuous speech recognition methods for recognizing continuously uttered speech and generating a word string indicated by the speech. The first method is Imai et al., “Early determination of speech recognition result by successive comparison of the most likely word sequence”, IEICE Transactions, J84-D-II, Vol. 9, 1942-1949 (2000). ), The following processing is performed via two paths.

【０００３】第１パスでは、詳細な音響モデルおよび簡
易な言語モデルを用いて文頭から文末方向への探索（以
下、前向き探索という。）を行い、認識候補となる複数
の単語列を求める。次に、第２パスで、詳細な言語モデ
ルを用いて第１パスで求めた複数の単語列についてスコ
アを更新し、最大スコアを与える単語列を認識結果とし
て採用するものである。In the first pass, a detailed acoustic model and a simple language model are used to perform a search from the beginning of a sentence toward the end of the sentence (hereinafter referred to as a forward search) to obtain a plurality of word strings as recognition candidates. Next, in the second pass, the scores are updated for the plurality of word strings obtained in the first pass using the detailed language model, and the word string giving the maximum score is adopted as the recognition result.

【０００４】第２の方法は、ロンググエン等著、「高
性能２パスＮ−ベストデコーダ」、ＤＡＲＰＡ音声認識
ワークショップの議事録、１００−１０３頁、（１９９
７）（ＬｏｎｇＮｇｕｙｅｎ，ｅｔａｌ．，“Ｅｆ
ｆｉｃｉｅｎｔ２-ｐａｓｓＮ-ｂｅｓｔｄｅｃｏ
ｄｅｒ”，ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＤＡ
ＲＰＡＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎＷｏ
ｒｋｓｈｏｐ，ｐｐ．１００-１０３（１９９７））に
開示された以下の処理を行うものである。The second method is Long Nguyen et al., "High-performance 2-pass N-best decoder", Minutes of DARPA Speech Recognition Workshop, pages 100-103, (199).
7) (Long Nguyen, et al., "Ef.
ficent 2-pass N-best deco
der ”, Proceedings of the DA
RPA Speech Recognition Wo
rkshop, pp. 100-103 (1997)).

【０００５】初めに、第１パスでは、簡易な音響モデル
および言語モデルを用いて前向き探索を行い、認識候補
となる単語およびその終端時刻のリストを作成する。次
に、この単語終端リストの制約の下、第２パスで詳細な
音響モデルおよび言語モデルを用いて、文末から文頭へ
の探索（以下、後ろ向き探索という。）を行うものであ
る。First, in the first pass, a forward search is performed using a simple acoustic model and language model to create a list of words that are candidates for recognition and their end times. Next, under the restriction of the word end list, a search from the end of the sentence to the beginning of the sentence (hereinafter, referred to as backward search) is performed using the detailed acoustic model and language model in the second pass.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、従来の
第１の方法では、第１パスで詳細な音響モデルを用いる
ために、音響モデルを詳細にすればするほど認識候補を
限定するための処理量が増大し、認識結果確定までの時
間が長引くという問題や、第２パスでは第１パスで得ら
れた単語列の範囲内でのみスコアの更新を行うため、詳
細な言語モデルの能力を十分に引き出すことができず、
高い認識精度が得られないという問題があった。However, in the first conventional method, since the detailed acoustic model is used in the first pass, the processing amount for limiting the recognition candidates as the acoustic model is made more detailed. Is increased and the time until the recognition result is determined is prolonged, and in the second pass, the score is updated only within the range of the word string obtained in the first pass, so that the ability of the detailed language model is sufficient. I can't pull it out,
There was a problem that high recognition accuracy could not be obtained.

【０００７】また、従来の第２の方法では、第２パスで
後ろ向き探索を行うために、通常の前向き探索とは異な
り、文末から文頭方向への詳細な言語モデルが必要にな
るという問題や、第１パスで求める単語終端は、ある程
度の区間で引き続き単語終端候補となり易く、第２パス
での処理量を増大させるおそれがあるという問題や、発
話終了を待たずに認識結果を逐次確定する場合には、第
２パスが後ろ向き探索であるために、候補単語列の文頭
からの一意性を利用する最適な早期確定手法を適用する
ことができず、認識精度が低下し、リアルタイム処理に
適さない等の問題があった。Further, in the second conventional method, since the backward search is performed in the second pass, unlike the normal forward search, a detailed language model from the end of the sentence toward the beginning of the sentence is required. The word end found in the first pass is likely to continue to become a word end candidate in a certain section, which may increase the amount of processing in the second pass, or when the recognition result is successively confirmed without waiting for the end of utterance. Since the second pass is a backward search, it is not possible to apply an optimal early decision method that uses uniqueness from the beginning of the candidate word string, which reduces recognition accuracy and is not suitable for real-time processing. There was a problem such as.

【０００８】本発明は、かかる問題を解決するためにな
されたものであり、その目的は、音声認識に詳細な音響
モデルおよび言語モデルを用いる場合でも、早期に認識
結果を確定するリアルタイム性に優れ、高い認識精度か
つ演算処理負担の少ない連続音声認識装置およびそのプ
ログラムを提供することにある。The present invention has been made in order to solve such a problem, and an object thereof is to realize a real-time property of deciding a recognition result early even when a detailed acoustic model and language model are used for speech recognition. The object of the present invention is to provide a continuous speech recognition device with high recognition accuracy and a small calculation processing load, and its program.

【０００９】[0009]

【課題を解決するための手段】以上の点を考慮して、請
求項１に係る発明は、発声された連続音声を認識して、
前記連続音声に対応する単語列を生成するための連続音
声認識装置において、簡易な第１の音響モデル、簡易な
第１の言語モデル、前記第１の音響モデルよりも詳細な
第２の音響モデル、および前記第１の言語モデルよりも
詳細な第２の言語モデルを記憶する手段と、前記簡易な
第１の音響モデルおよび前記簡易な第１の言語モデルを
用いて前記連続音声に対して前向き探索を行い、前記単
語列を生成するための候補として単語終端に達した各単
語の情報と前記候補となる各単語が発声された始端時刻
の情報とからなる単語始端リストを生成するための第１
パス処理手段と、前記詳細な第２の音響モデルおよび前
記詳細な第２の言語モデルを用いて、前記単語始端リス
トに含まれる前記候補となる各単語の範囲内で、前記候
補となる各単語が発声された始端時刻の情報に基づい
て、前記連続音声に対して前向き探索を行い、前記連続
音声に対応する単語列を生成するための第２パス処理手
段とを備えた構成を有している。In view of the above points, the invention according to claim 1 recognizes continuous uttered speech,
In a continuous voice recognition device for generating a word string corresponding to the continuous voice, a simple first acoustic model, a simple first language model, and a second acoustic model more detailed than the first acoustic model. , And means for storing a second language model that is more detailed than the first language model, and positive for the continuous speech using the simple first acoustic model and the simple first language model. Performing a search, a first for generating a word start list consisting of information of each word that has reached a word end as a candidate for generating the word string and information of the start time at which each candidate word is uttered 1
Using the path processing means, the detailed second acoustic model, and the detailed second language model, each candidate word within the range of each candidate word included in the word start list A second pass processing means for performing a forward search for the continuous voice based on information on the start time at which the voice is uttered, and generating a word string corresponding to the continuous voice. There is.

【００１０】この構成により、第２パス処理手段が探索
すべき単語とその始端時刻は単語始端リストによって高
精度に制限され、しかも、連続した単語終端は共通の単
語始端をもつ可能性が高いため、単語始端リストは単語
終端リストよりも冗長度が低く、より詳細な音響モデル
や言語モデルを用いた場合でも全体の処理量を増大させ
ることなく、単語の認識精度を向上させることが可能な
連続音声認識装置を実現できる。また、第２パス処理手
段は文頭から文末方向へ前向き探索を行うために、候補
単語列の文頭からの一意性を利用した最適で原理的に認
識精度を低下させない早期確定手法を適用可能で、リア
ルタイム処理に適している。With this configuration, the words to be searched by the second pass processing means and their start times are highly accurately limited by the word start list, and the continuous word ends are likely to have a common word start. The word start list has less redundancy than the word end list, and it is possible to improve word recognition accuracy without increasing the overall processing amount even when using a more detailed acoustic model or language model. A voice recognition device can be realized. In addition, the second pass processing means can apply an optimum early determination method that does not decrease the recognition accuracy in principle by utilizing the uniqueness of the candidate word string from the beginning of the sentence in order to perform a forward search from the beginning of the sentence toward the end of the sentence. Suitable for real-time processing.

【００１１】また、請求項２に係る発明は、請求項１に
おいて、前記第１パス処理手段は、さらに、前記第１パ
ス処理手段での前向き探索中に単語終端近傍に達した単
語の情報と、前記単語終端近傍に達した単語が発声され
た始端時刻の情報とを前記単語始端リストに追加登録す
る構成を有している。この構成により、単語終端に達し
た単語のみならず単語終端近傍に達した単語について
も、前記第２パス処理手段での前向き探索以降の処理が
なされ、より高精度に音声認識が可能な連続音声認識装
置を実現できる。Further, in the invention according to claim 2, in claim 1, the first pass processing means further includes information of a word which has reached near a word end during a forward search in the first pass processing means. , The information of the start time at which the word reaching the vicinity of the end of the word is uttered is additionally registered in the word start list. With this configuration, not only the word that has reached the end of the word but also the word that has reached the vicinity of the end of the word are subjected to the processing after the forward search in the second pass processing means, and continuous speech that enables more highly accurate speech recognition. A recognition device can be realized.

【００１２】また、請求項３に係る発明は、請求項１に
おいて、前記第１パス処理手段は、さらに、前記単語始
端リストに含まれる前記各単語の単語平均スコアを前記
単語始端リストに追加登録し、前記第２パス処理手段
は、さらに、前記各単語の単語平均スコアが所定値以上
となるものに前記候補となる単語を限定し、前記限定さ
れた単語について、前記連続音声に対応する単語列を生
成する構成を有している。この構成により、第２パス処
理手段の処理対象の単語が限定されるため、処理負担の
低減が可能な連続音声認識装置を実現できる。Further, in the invention according to claim 3, in claim 1, the first path processing means further additionally registers a word average score of each of the words included in the word beginning list in the word beginning list. Then, the second pass processing means further limits the candidate words to those in which the word average score of each of the words is a predetermined value or more, and the words corresponding to the continuous speech with respect to the limited words. It has a configuration for generating columns. With this configuration, since the words to be processed by the second pass processing means are limited, it is possible to realize a continuous speech recognition device that can reduce the processing load.

【００１３】また、請求項４に係る発明は、請求項１に
おいて、前記第２パス処理手段は、さらに、前記候補と
なる各単語が発声された始端時刻の前後一定範囲内の所
定時刻を始端時刻として、前記第２パス処理手段での前
向き探索を行う構成を有している。この構成により、候
補となる各単語の始端時刻の前後一定範囲内における所
定時刻を始端時刻として追加し、前記第２パス処理手段
での前向き探索を行うため、より高精度に音声認識が可
能な連続音声認識装置を実現できる。Further, in the invention according to claim 4, in claim 1, the second pass processing means further starts a predetermined time within a certain range before and after the start time at which each of the candidate words is uttered. As a time point, the second path processing means is configured to perform a forward search. With this configuration, a predetermined time within a certain range before and after the start time of each candidate word is added as the start time, and the forward search is performed by the second pass processing means, so that voice recognition can be performed with higher accuracy. A continuous voice recognition device can be realized.

【００１４】また、請求項５に係る発明は、請求項１に
おいて、前記第２パス処理手段は、前記第１パス処理手
段によって前記単語始端リストの生成が完了する前であ
っても、前記第１パス処理手段での前向き探索によって
前記候補となる単語の情報とその始端時刻の情報とが生
成され次第、前記第２パス処理手段での前向き探索を行
い、前記連続音声に対応する単語列を生成するための処
理を行う構成を有している。この構成により、第２パス
処理手段での前向き探索によって候補となる単語の情報
とその始端時刻の情報とが生成され次第、第２パス処理
手段での前向き探索の処理が行われるため、認識精度を
低下させない早期確定手法を適用可能で、リアルタイム
処理に適した連続音声認識装置を実現できる。According to a fifth aspect of the present invention, in the first aspect, the second pass processing means is configured to perform the first pass processing means even before the generation of the word start list is completed. As soon as the information of the candidate word and the information of the start time thereof are generated by the forward search by the one-pass processing means, the forward search is performed by the second-pass processing means, and the word string corresponding to the continuous speech is obtained. It has a configuration for performing processing for generation. With this configuration, the forward search processing is performed by the second pass processing means as soon as the information of the candidate word and the information of the start end time thereof are generated by the forward search by the second pass processing means. It is possible to implement a continuous speech recognition device suitable for real-time processing by applying an early determination method that does not reduce the noise.

【００１５】また、請求項６に係る発明は、請求項１に
おいて、前記第２パス処理手段は、前記第１パス処理手
段によって前記単語始端リストの生成が完了した後に、
前記第２パス処理手段での前向き探索を開始し、前記連
続音声に対応する単語列を生成するための処理を行う構
成を有している。この構成により、リアルタイム処理が
必要でない場合でも、演算処理の負担が少なく、単語の
認識精度を向上させることが可能な連続音声認識装置を
実現できる。Further, in the invention according to claim 6, in claim 1, the second pass processing means, after the generation of the word beginning list by the first pass processing means is completed,
The second path processing means is configured to start a forward search and perform processing for generating a word string corresponding to the continuous speech. With this configuration, even when real-time processing is not required, the load of arithmetic processing is small, and a continuous speech recognition device capable of improving word recognition accuracy can be realized.

【００１６】また、請求項７に係る発明は、請求項１に
おいて、コンピュータに、発声された連続音声を認識し
て、前記連続音声に対応する単語列を生成するための処
理を実行させるプログラムにおいて、コンピュータに、
簡易な第１の音響モデル、簡易な第１の言語モデル、前
記第１の音響モデルよりも詳細な第２の音響モデル、お
よび前記第１の言語モデルよりも詳細な第２の言語モデ
ルを記憶するステップと、前記簡易な第１の音響モデル
および前記簡易な第１の言語モデルを用いて前記連続音
声に対して前向き探索を行い、前記単語列を生成するた
めの候補となる各単語の情報と前記候補となる各単語が
発声された始端時刻の情報とからなる単語始端リストを
生成するための第１パス処理ステップと、前記詳細な第
２の音響モデルおよび前記詳細な第２の言語モデルを用
いて、前記単語始端リストに含まれる前記候補となる各
単語の範囲内で、前記候補となる各単語が発声された始
端時刻の情報に基づいて、前記連続音声に対して前向き
探索を行い、前記連続音声に対応する単語列を生成する
ための第２パス処理ステップとを実行させる構成を有し
ている。According to a seventh aspect of the present invention, there is provided a program according to the first aspect, which causes a computer to recognize a uttered continuous voice and execute a process for generating a word string corresponding to the continuous voice. , To the computer,
Store a simple first acoustic model, a simple first language model, a second acoustic model more detailed than the first acoustic model, and a second language model more detailed than the first language model And the information of each word that is a candidate for generating the word string by performing a forward search on the continuous speech using the simple first acoustic model and the simple first language model. And a detailed second acoustic model and a detailed second language model, a first pass processing step for generating a word start list including the start time at which each candidate word is uttered. Using the, within the range of each of the candidate words included in the word start list, based on the information of the start time at which each candidate word is uttered, perform a forward search for the continuous speech. , The above It has a configuration to execute a second pass processing steps for generating a word string corresponding to the connection sound.

【００１７】この構成により、第２パス処理ステップで
探索すべき単語とその始端時刻は単語始端リストによっ
て高精度に制限され、しかも、連続した単語終端は共通
の単語始端をもつ可能性が高いため、単語始端リストは
単語終端リストよりも冗長度が低く、より詳細な音響モ
デルや言語モデルを用いた場合でも全体の処理量を増大
させることなく、単語の認識精度を向上させることが可
能な連続音声認識プログラムを実現できる。また、第２
パス処理ステップでは文頭から文末方向へ前向き探索が
行われるために、候補単語列の文頭からの一意性を利用
した最適で原理的に認識精度を低下させない早期確定手
法を適用可能で、リアルタイム処理に適している。With this configuration, the word to be searched in the second pass processing step and its start time are highly accurately limited by the word start list, and the continuous word ends are likely to have a common word start. The word start list has less redundancy than the word end list, and it is possible to improve word recognition accuracy without increasing the overall processing amount even when using a more detailed acoustic model or language model. A voice recognition program can be realized. Also, the second
In the pass processing step, a forward search is performed from the beginning of the sentence toward the end of the sentence.Therefore, it is possible to apply an optimal early determination method that uses the uniqueness of the candidate word string from the beginning of the sentence and does not decrease the recognition accuracy in principle, and it can be used for real time processing Are suitable.

【００１８】[0018]

【発明の実施の形態】以下、添付図面を参照して、本発
明の第１の実施の形態に係る連続音声認識装置について
説明する。図１は、本発明の第１の実施の形態に係る連
続音声認識装置１００の概略の構成を示すブロック図で
ある。連続音声認識装置１００は、入力音声を音響分析
して音響分析結果を生成する音響分析部１０、音響分析
結果に応じて単語始端リストを生成する第１パス処理部
２０、および音響分析結果と単語始端リストとを用いて
認識単語列を生成する第２パス処理部３０によって構成
される。BEST MODE FOR CARRYING OUT THE INVENTION A continuous speech recognition apparatus according to a first embodiment of the present invention will be described below with reference to the accompanying drawings. FIG. 1 is a block diagram showing a schematic configuration of a continuous speech recognition device 100 according to the first embodiment of the present invention. The continuous speech recognition device 100 includes an acoustic analysis unit 10 that acoustically analyzes an input voice to generate an acoustic analysis result, a first pass processing unit 20 that generates a word start list according to the acoustic analysis result, and an acoustic analysis result and a word. The second pass processing unit 30 generates a recognized word string using the start list.

【００１９】第１パス処理部２０は、さらに、発音辞書
・簡易モデル記憶部２１、木構造音素ネットワーク生成
部（以下、木構造音素ＮＷ生成部という。）２２、音響
スコア算出部２３、言語スコア算出部２４、および第１
前向き探索部２５によって構成される。発音辞書・簡易
モデル記憶部２１は、第１パスでの音声認識処理に用い
る発音辞書、簡易な音響モデル（以下、簡易音響モデル
という。）、および簡易な言語モデル（以下、簡易言語
モデルという。）等を記憶するための構成部である。こ
こで、「簡易な」とは、言うまでもなく、モデルの規模
が小さいことを指し、例えば、モデルに含まれる状態数
の少ないものが含まれる。The first pass processing unit 20 further includes a pronunciation dictionary / simple model storage unit 21, a tree-structured phoneme network generation unit (hereinafter referred to as a tree-structured phoneme NW generation unit) 22, an acoustic score calculation unit 23, and a language score. Calculation unit 24, and the first
It is configured by the forward search unit 25. The pronunciation dictionary / simple model storage unit 21 is used for the speech recognition processing in the first pass, a pronunciation dictionary, a simple acoustic model (hereinafter referred to as a simplified acoustic model), and a simple language model (hereinafter referred to as a simplified language model). ) Etc. is a component for storing. Here, it is needless to say that "simple" means that the scale of the model is small, and includes, for example, a model having a small number of states.

【００２０】木構造音素ＮＷ生成部２２は、音響分析部
１０から出力された入力音声の音響分析結果、発音辞書
・簡易モデル記憶部２１に記憶された発音辞書、および
簡易音響モデル等を入力とし、入力音声の音響分析結果
に応じた、木構造を有する音素ネットワーク（以下、木
構造音素ネットワークという。）を生成し、生成した木
構造音素ネットワークを音響スコア算出部２３、言語ス
コア算出部２４、および第１前向き探索部２５にそれぞ
れ出力するための構成部である。The tree-structured phoneme NW generation unit 22 receives as input the acoustic analysis result of the input voice output from the acoustic analysis unit 10, the pronunciation dictionary stored in the pronunciation dictionary / simplified model storage unit 21, the simplified acoustic model, and the like. , A phoneme network having a tree structure (hereinafter referred to as a tree structure phoneme network) is generated according to the acoustic analysis result of the input speech, and the generated tree structure phoneme network is an acoustic score calculation unit 23, a language score calculation unit 24, And the first forward search unit 25.

【００２１】音響スコア算出部２３は、入力音声の音響
分析結果、簡易音響モデルおよび木構造音素ネットワー
クを入力とし、簡易音響モデルおよび木構造音素ネット
ワークを用いて音響分析結果に対する音響スコアを算出
し、第１前向き探索部２５に出力するための構成部であ
る。ここで、簡易音響モデルとしては、例えば状態数の
少ないトライフォン隠れマルコフモデル（Ｈｉｄｄｅｎ
ＭａｒｋｏｖＭｏｄｅｌ、以下、ＨＭＭという。）
等を用いることができる。また、音響スコアの算出方法
は、公知であり、その説明を省略する。The acoustic score calculation unit 23 receives the acoustic analysis result of the input voice, the simple acoustic model and the tree-structured phoneme network as inputs, and calculates the acoustic score for the acoustic analysis result using the simple acoustic model and the tree-structured phoneme network, It is a component for outputting to the first forward search unit 25. Here, as a simple acoustic model, for example, a triphone hidden Markov model (Hidden) with a small number of states is used.
Markov Model, hereinafter referred to as HMM. )
Etc. can be used. The method of calculating the acoustic score is publicly known, and the description thereof will be omitted.

【００２２】言語スコア算出部２４は、簡易言語モデル
および木構造音素ネットワークを入力とし、木構造音素
ネットワーク上のアクティブなノードに対する言語スコ
アを、簡易言語モデルを用いて算出し、第１前向き探索
部２５に出力するための構成部である。ここで、簡易言
語モデルとしては、例えば単語バイグラム等を用いるこ
とができる。また、言語スコアの算出方法は、公知であ
り、その説明を省略する。The language score calculation unit 24 receives the simple language model and the tree-structured phoneme network as input, calculates the language score for the active node on the tree-structured phoneme network using the simplified language model, and outputs the first forward search unit. 25 is a component for outputting to 25. Here, for example, a word bigram or the like can be used as the simple language model. The method of calculating the language score is well known, and the description thereof will be omitted.

【００２３】第１前向き探索部２５は、木構造音素ネッ
トワーク、音響スコア、および言語スコアを入力とし、
木構造音素ネットワーク上で音響スコアおよび言語スコ
アを用いて、アクティブなノードを前向きに伝搬させ、
枝刈りされずに単語終端まで残った単語とその始端時刻
のリストである単語始端リストを作成し、作成した単語
始端リストを第２パス処理部３０に出力するための構成
部である。なお、「前向き」とは、文頭から文末への方
向のことを指し、以下では、前向きの探索のことを「前
向き探索」ということにする。The first forward search unit 25 receives the tree-structured phoneme network, the acoustic score, and the language score,
Propagate active nodes forward using acoustic and language scores on a tree-structured phoneme network,
This is a component for creating a word start list that is a list of words that have not been pruned and remains until the end of the word and their start times, and output the created word start list to the second pass processing unit 30. Note that "forward" refers to the direction from the beginning of the sentence to the end of the sentence, and hereinafter, the forward search is referred to as "forward search".

【００２４】ここで、木構造音素ネットワークは、１つ
の木構造音素ネットワークをループさせて用いる静的な
ものでも、木構造音素ネットワークを複数接続させて得
られたものを用いる動的なものでもよい。なお、単語始
端リストを精度の良いものとするために、第１前向き探
索部２５は、直前の単語に依存した単語対近似探索を行
うものとする。Here, the tree-structured phoneme network may be a static one that is used by looping one tree-structured phoneme network or a dynamic one that is obtained by connecting a plurality of tree-structured phoneme networks. . In addition, in order to make the word start list highly accurate, the first forward search unit 25 performs the word pair approximate search depending on the immediately preceding word.

【００２５】一方、第２パス処理部３０は、さらに、発
音辞書・詳細モデル記憶部３１、線形構造音素ネットワ
ーク生成部（以下、線形構造音素ＮＷ生成部という。）
３２、音響スコア算出部３３、言語スコア算出部３４、
および第２前向き探索部３５によって構成される。発音
辞書・詳細モデル記憶部３１は、第２パスでの音声認識
処理に用いる発音辞書、詳細な音響モデル（以下、詳細
音響モデルという。）、および詳細な言語モデル（以
下、詳細言語モデルという。）等を記憶するための構成
部である。ここで、「詳細な」とは、言うまでもなく、
モデルの規模がある程度以上大きいことを指し、例え
ば、モデルに含まれる状態数の多いものが含まれる。On the other hand, the second pass processing unit 30 further includes a pronunciation dictionary / detailed model storage unit 31 and a linear structured phoneme network generation unit (hereinafter referred to as a linear structured phoneme NW generation unit).
32, acoustic score calculator 33, language score calculator 34,
And the second forward searching unit 35. The pronunciation dictionary / detailed model storage unit 31 is used as a pronunciation dictionary, a detailed acoustic model (hereinafter referred to as a detailed acoustic model), and a detailed language model (hereinafter referred to as a detailed language model) used in the speech recognition processing in the second pass. ) Etc. is a component for storing. Here, of course, "detailed"
It means that the scale of the model is larger than a certain level, and includes, for example, a model having a large number of states.

【００２６】線形構造音素ＮＷ生成部３２は、音響分析
部１０から出力された入力音声の音響分析結果、発音辞
書・詳細モデル記憶部３１に記憶された発音辞書、およ
び詳細音響モデル等を入力とし、入力音声の音響分析結
果に応じた、直線構造を有する音素ネットワーク（以
下、線形構造音素ネットワークという。）を生成し、生
成した線形構造音素ネットワークを音響スコア算出部３
３、言語スコア算出部３４、および第２前向き探索部３
５にそれぞれ出力するための構成部である。The linear structured phoneme NW generation unit 32 receives the acoustic analysis result of the input voice output from the acoustic analysis unit 10, the pronunciation dictionary stored in the pronunciation dictionary / detailed model storage unit 31, the detailed acoustic model, and the like as inputs. , A phoneme network having a linear structure (hereinafter referred to as a linear structure phoneme network) is generated according to the acoustic analysis result of the input speech, and the generated linear structure phoneme network is used as the acoustic score calculation unit 3
3, language score calculation unit 34, and second forward search unit 3
5 is a component for outputting each to 5.

【００２７】音響スコア算出部３３は、入力音声の音響
分析結果、詳細音響モデルおよび線形構造音素ネットワ
ークを入力とし、音響分析結果に対する音響スコアを、
詳細音響モデルおよび線形構造音素ネットワークを用い
て算出し、第２前向き探索部３５に出力するための構成
部である。ここで、詳細音響モデルとしては、例えば状
態数の多いトライフォンＨＭＭ等を用いることができ
る。また、音響スコアの算出方法は、上記音響スコア算
出部２３による算出の方法と同様に公知であり、その説
明を省略する。The acoustic score calculation unit 33 receives the acoustic analysis result of the input voice, the detailed acoustic model and the linear structured phoneme network as an input, and outputs an acoustic score for the acoustic analysis result.
It is a component for calculating using the detailed acoustic model and the linear structured phoneme network and outputting it to the second forward searching unit 35. Here, for example, a triphone HMM having many states can be used as the detailed acoustic model. The method of calculating the acoustic score is publicly known similarly to the method of calculation by the acoustic score calculator 23, and the description thereof will be omitted.

【００２８】言語スコア算出部３４は、詳細言語モデル
と線形構造音素ネットワークとを入力とし、線形構造音
素ネットワーク上のアクティブな単語先頭ノードに対す
る言語スコアを、詳細言語モデルを用いて算出し、第２
前向き探索部３５に出力するための構成部である。ここ
で、詳細言語モデルとしては、例えば単語トライグラム
などを用いることができる。また、言語スコアの算出方
法は、上記言語スコア算出部３４による算出の方法と同
様に公知であり、その説明を省略する。The language score calculation unit 34 receives the detailed language model and the linear structured phoneme network as inputs, and calculates the language score for the active word start node on the linear structured phoneme network using the detailed language model,
It is a component for outputting to the forward search unit 35. Here, for example, a word trigram can be used as the detailed language model. The method of calculating the language score is publicly known similarly to the method of calculation by the language score calculator 34, and the description thereof will be omitted.

【００２９】第２前向き探索部３５は、線形構造音素ネ
ットワーク、音響スコア算出部３３から出力された音響
スコア（以下、第２音響スコアという。）、および言語
スコア算出部３４から出力された言語スコア（以下、第
２言語スコアという。）、および第１前向き探索部２５
から出力された単語始端リストを入力とし、認識単語列
を決定し、連続音声認識装置１００の外部に出力するた
めの構成部である。The second forward searching section 35 has a linear structured phoneme network, an acoustic score output from the acoustic score calculating section 33 (hereinafter referred to as a second acoustic score), and a language score output from the language score calculating section 34. (Hereinafter, referred to as a second language score), and the first forward search unit 25.
It is a component for determining a recognized word string by inputting the word start list output from, and outputting it to the outside of the continuous speech recognition apparatus 100.

【００３０】その際、第２前向き探索部３５は、単語始
端リストに含まれる単語および始端時刻に限定して線形
構造音素ネットワーク上のアクティブなノードを前向き
に伝搬させ、上記の第２音響スコアおよび第２言語スコ
アを用いて、発話終了を待たずに単語列候補の文頭から
の一意性を利用した早期確定を行うことができるものと
する。もちろん、発話終了後に、第２前向き探索部３５
での処理を行うことも可能である。At this time, the second forward searching unit 35 forwards the active node on the linear structured phoneme network to the forward direction only by limiting the words included in the word start point list and the start time, and the above second acoustic score and It is assumed that the second language score can be used to perform early confirmation using uniqueness from the beginning of a word string candidate without waiting for the end of utterance. Of course, after the utterance ends, the second forward searching unit 35
It is also possible to perform the processing in.

【００３１】第２前向き探索部３５は、詳細言語モデル
に単語トライグラムを用いる場合、直前単語毎に最適な
１つの単語履歴を保存しつつ探索を進める１-ベスト探
索を行うことができるものとする。また、線形構造音素
ネットワークを用いる理由は、単語始端リストに従って
アクティブにする単語が各時刻で異なるので、１つの音
素ノードを複数の単語で共有する必要がないためであ
る。When the word trigram is used for the detailed language model, the second forward search section 35 can perform the 1-best search which advances the search while saving one optimum word history for each immediately preceding word. To do. Also, the reason why the linear structured phoneme network is used is that it is not necessary to share one phoneme node among a plurality of words because the words to be activated differ according to the word start list at each time.

【００３２】なお、連続音声認識装置１００として、上
記で示したものの他にも、以下に示す実施の形態のもの
も可能である。（１）第１前向き探索部２５は、単語終端だけでなく単
語終端付近に達した単語とその始端時刻を単語始端リス
トに追加登録し、第２前向き探索部３５は、追加登録さ
れた単語についても上記の処理を行うとする実施の形
態。（２）第１前向き探索部２５は、単語始端リストに単語
平均スコアを追加登録し、第２前向き探索部３５は、上
記の追加登録した単語平均スコアが所定閾値を越えたも
のに限定して枝刈りし、上記の処理を行うという実施の
形態。（３）第２前向き探索部３５は、単語始端リストに登録
された始端時刻に、前後する一定幅の時間範囲の所定時
刻を始端時刻として追加し、探索範囲を広げて単語の探
索開始を許す実施形態。As the continuous speech recognition device 100, in addition to the above-described one, the following embodiments can be used. (1) The first forward searching unit 25 additionally registers not only the word end but also the word reaching the end of the word and its start time in the word start list, and the second forward searching unit 35 regards the additionally registered words. An embodiment in which the above processing is also performed. (2) The first forward searching unit 25 additionally registers the word average score in the word start list, and the second forward searching unit 35 limits the additionally registered word average score to a value exceeding a predetermined threshold. An embodiment of pruning and performing the above processing. (3) The second forward searching unit 35 adds a predetermined time within a time range of a certain width to the start time registered in the word start end list as a start time, and widens the search range to allow the word search to start. Embodiment.

【００３３】また、第１パス処理部２０と第２パス処理
部３０における処理の実行順序に関しては、以下に示す
２通りの実施の形態が考えられる。（１）第１パス処理部２０の処理中に、一定の遅れ時間
で第２パス処理部６を並行して処理しつつ、発話終了を
待たずに単語の早期確定を行うリアルタイム処理向きの
実施形態。これは、第１パス処理部２０によって単語始
端リストが生成され、完成する前であっても、候補とな
る所定の単語とその始端時刻の情報とが生成され次第、
第２パス処理部３０での前向き探索を行い、連続音声に
対応する単語列を生成するための処理を行うものであ
る。Regarding the execution order of the processing in the first pass processing unit 20 and the second pass processing unit 30, the following two embodiments are possible. (1) Implementation for real-time processing in which the second pass processing unit 6 is processed in parallel with a certain delay time during the processing of the first pass processing unit 20 and the word is confirmed early without waiting for the end of the utterance. Form. This is because, even before the word start list is generated and completed by the first pass processing unit 20, as soon as a predetermined candidate word and the start time information thereof are generated,
The second pass processing unit 30 performs a forward search to perform processing for generating a word string corresponding to continuous speech.

【００３４】（２）リアルタイム処理が必要でない場合
に、第１パス処理部２０における処理の終了後、すなわ
ち発話終了後に第２パス処理部３０での処理を開始する
実施形態。これは、第２パス処理部２０は、第１パス処
理部によって単語始端リストが生成された後に、第２パ
ス処理部での前向き探索を開始し、連続音声に対応する
単語列を生成するための処理を行うものである。(2) An embodiment in which the processing in the second path processing unit 30 is started after the processing in the first path processing unit 20, that is, after the utterance, when the real-time processing is not required. This is because the second pass processing unit 20 starts a forward search in the second pass processing unit after the word start list is generated by the first pass processing unit, and generates a word string corresponding to continuous speech. Is to be processed.

【００３５】以下、図面を参照して、本発明の第１の実
施の形態に係る連続音声認識装置１００における処理に
ついて説明する。図２および図３は、本発明の第１の実
施の形態に係る連続音声認識装置１００の第１パス処理
部２０における処理の流れを示すフローチャートであ
る。なお、第１の実施の形態に係る連続音声認識装置
は、不図示の、インターフェース、制御・演算装置、記
憶装置を有する一般的な構成のコンピュータ装置により
構成することができる。その場合、発音辞書・簡易モデ
ル記憶部２１および発音辞書・詳細モデル記憶部３１
は、記憶装置に対応させ、その他の第１パス処理部２０
と第２パス処理部３０とにおける各構成、および音響分
析部１０は、制御・演算装置に対応させることができ
る。The processing in the continuous speech recognition apparatus 100 according to the first embodiment of the present invention will be described below with reference to the drawings. 2 and 3 are flowcharts showing the flow of processing in the first pass processing unit 20 of the continuous speech recognition device 100 according to the first embodiment of the present invention. The continuous voice recognition device according to the first embodiment can be configured by a computer device having a general configuration including an interface, a control / calculation device, and a storage device, which are not shown. In that case, the pronunciation dictionary / simple model storage unit 21 and the pronunciation dictionary / detailed model storage unit 31
Corresponds to the storage device, and the other first pass processing units 20
The components of the second pass processing unit 30 and the acoustic analysis unit 10 can be associated with a control / calculation device.

【００３６】ステップＳ２１０で、第１パス処理部２０
は、対象となる入力音声の処理時刻ｔを０に、文頭単語
＜ｓ＞に対応する音素ノードのみをアクティブに、およ
びそのトータルスコアを０にする、初期化処理を行う。
ステップＳ２２０で、第１パス処理部２０は、時刻ｔに
おける入力音声についての音響分析結果３２を音響分析
部１０から取り出す。In step S210, the first pass processing unit 20
Performs initialization processing in which the processing time t of the target input speech is set to 0, only the phoneme node corresponding to the initial word <s> is activated, and its total score is set to 0.
In step S220, the first pass processing unit 20 retrieves the acoustic analysis result 32 of the input voice at the time t from the acoustic analysis unit 10.

【００３７】ステップＳ２３０で、第１パス処理部２０
は、全アクティブ・ノードの中から１つのノードを選択
し、それをノードｎとする。ステップＳ２４０で、音響
スコア算出部２３はノードｎ、時刻ｔにおける入力音声
についての簡易音響スコアを算出し、第１前向き探索部
２５はステップＳ２３０で選択したノードｎのトータル
スコアに時刻ｔにおける入力音声についての簡易音響ス
コアを加算する。In step S230, the first pass processor 20
Selects one node from all active nodes and sets it as node n. In step S240, the acoustic score calculation unit 23 calculates a simple acoustic score for the input speech at the node n and time t, and the first forward searching unit 25 calculates the total score of the node n selected at step S230 for the input speech at time t. The simple acoustic score of is added.

【００３８】ステップＳ２５０で、言語スコア算出部２
４はノードｎの簡易言語スコアを算出し、第１前向き探
索部２５はノードを遷移する度にノードｎのトータルス
コアの簡易言語スコアを更新する。ノードを遷移する度
にノードｎのトータルスコアの簡易言語スコアを更新す
るのは、音素ネットワークに探索効率の高い木構造音素
ネットワークを用いており、一つのノードが複数の単語
に共有されていることによるものである。In step S250, the language score calculation unit 2
4 calculates the simple language score of the node n, and the first forward search unit 25 updates the simple language score of the total score of the node n each time the node transits. The simple language score of the total score of the node n is updated each time the node transits because the tree-structured phoneme network with high search efficiency is used for the phoneme network, and one node is shared by multiple words. It is a thing.

【００３９】ステップＳ２６１で、第１前向き探索部２
５は、ノードｎのトータルスコアが枝刈り閾値以下か否
かを判断する。ステップＳ２６１で、ノードｎのトータ
ルスコアが枝刈り閾値以下と判断された場合、ステップ
Ｓ２６２で、第１前向き探索部２５は、ノードｎを非ア
クティブにしてステップＳ２７０に進む。In step S261, the first forward search unit 2
5 determines whether the total score of the node n is less than or equal to the pruning threshold value. When it is determined in step S261 that the total score of the node n is less than or equal to the pruning threshold, the first forward search unit 25 deactivates the node n in step S262 and proceeds to step S270.

【００４０】ステップＳ２６１で、ノードｎのトータル
スコアが枝刈り閾値を超えると判断された場合、処理は
ステップＳ２６３に移る。ステップＳ２６３で、第１前
向き探索部２５は、ノードｎが単語ｗの終端か否かを判
断する。ステップＳ２６３で、ノードｎが単語ｗの終端
と判断された場合、処理はステップＳ２６４に進む。な
お、ノードｎが単語ｗの終端でなくても、単語終端付近
で十分高いスコアをもつ場合に、処理をステップＳ２６
４に進める実施の形態もありえ、それを排除するもので
はない。When it is determined in step S261 that the total score of the node n exceeds the pruning threshold value, the process proceeds to step S263. In step S263, the first forward search unit 25 determines whether the node n is the end of the word w. When it is determined in step S263 that the node n is the end of the word w, the process proceeds to step S264. Even if the node n is not the end of the word w, if the node n has a sufficiently high score near the end of the word, the processing is performed in step S26.
There may be embodiments that proceed to No. 4, but they are not excluded.

【００４１】ステップＳ２６４で、第１前向き探索部２
５は、ノードｎが属する単語ｗとその始端時刻、単語平
均スコアを単語始端リストに追加し、同じ始端時刻に同
じ単語が既に登録されていれば、大きい方の単語平均ス
コアへ更新する。ステップＳ２６５で、第１前向き探索
部２５は、後続単語の先頭ノードをすべてアクティブに
する。In step S264, the first forward search unit 2
5 adds the word w to which the node n belongs, its start time, and the word average score to the word start list, and if the same word has already been registered at the same start time, updates it to the larger word average score. In step S265, the first forward search unit 25 activates all leading nodes of subsequent words.

【００４２】ステップＳ２６３で、終端でないと判断さ
れた場合、処理はステップＳ２６６に移り、ステップＳ
２６６で、第１前向き探索部２５は、後続音素のノード
をすべてアクティブにする。上記のステップＳ２６２、
ステップＳ２６５、またはステップＳ２６６のいずれか
のステップでの処理が終了したら、処理はステップＳ２
７０に進む。If it is determined in step S263 that it is not the end, the process proceeds to step S266 and step S266.
At 266, the first forward searching unit 25 activates all subsequent phoneme nodes. Step S262 above,
When the process in either step S265 or step S266 is completed, the process proceeds to step S2.
Proceed to 70.

【００４３】ステップＳ２７０で、第１前向き探索部２
５は、全アクティブ・ノードの処理が終了したか否かを
判断し、終了したと判断された場合、処理はステップＳ
２８０に進み、終了していないと判断された場合、処理
はステップＳ２３０に戻り、次のアクティブ・ノードを
選択し、上記の処理を繰り返す。In step S270, the first forward search unit 2
5 determines whether or not the processing of all active nodes is completed, and when it is determined that the processing is completed, the processing is step S.
If it is determined that the processing has not ended, the processing returns to step S230, the next active node is selected, and the above processing is repeated.

【００４４】ステップＳ２８０で、第１前向き探索部２
５は、全入力音声についての処理が終了したか否かを判
断し、終了したと判断された場合、第１パスに関する処
理は終了し、終了していないと判断された場合、処理は
ステップＳ２９０に移る。ステップＳ２９０で、第１前
向き探索部２５は、時刻ｔに１を加え、その後、処理は
ステップＳ２２０に戻り、時刻ｔ＋１における入力音声
について上記の処理が繰り返される。In step S280, the first forward search unit 2
5 determines whether or not the processing for all input voices is completed. If it is determined to be completed, the processing for the first pass is completed, and if it is determined that it is not completed, the processing is step S290. Move on to. In step S290, first forward search unit 25 adds 1 to time t, and then the process returns to step S220, and the above process is repeated for the input voice at time t + 1.

【００４５】図４および図５は、本発明の第１の実施の
形態に係る連続音声認識装置１００の第２パス処理部３
０における処理の流れを示すフローチャートである。以
下に、図面を参照して、本発明の第１の実施の形態に係
る連続音声認識装置１００の第２パス処理部３０におけ
る処理について説明する。FIGS. 4 and 5 show the second pass processing section 3 of the continuous speech recognition apparatus 100 according to the first embodiment of the present invention.
7 is a flowchart showing a flow of processing in 0. Hereinafter, the processing in the second pass processing unit 30 of the continuous speech recognition device 100 according to the first embodiment of the present invention will be described with reference to the drawings.

【００４６】ステップＳ４１０で、第２パス処理部３０
は、対象となる入力音声の処理時刻ｔを０に、文頭単語
＜ｓ＞に対応する音素ノードのみをアクティブに、およ
びそのトータルスコアを０にする、初期化処理を行う。
ステップＳ４２０で、第２パス処理部３０は、時刻ｔに
おける入力音声についての音響分析結果３２を音響分析
部１０から取り出す。In step S410, the second pass processing unit 30
Performs initialization processing in which the processing time t of the target input speech is set to 0, only the phoneme node corresponding to the initial word <s> is activated, and its total score is set to 0.
In step S420, the second pass processing unit 30 extracts the acoustic analysis result 32 of the input voice at the time t from the acoustic analysis unit 10.

【００４７】ステップＳ４３０で、第２パス処理部３０
は、全アクティブ・ノードの中から１つのノードを選択
し、それをノードｎとする。ステップＳ４４０で、音響
スコア算出部３３はノードｎ、時刻ｔにおける入力音声
についての詳細音響スコアを算出し、第２前向き探索部
３５はステップＳ４３０で選択したノードｎのトータル
スコアに時刻ｔにおける入力音声についての詳細音響ス
コアを加算する。In step S430, the second pass processing unit 30
Selects one node from all active nodes and sets it as node n. In step S440, the acoustic score calculation unit 33 calculates a detailed acoustic score for the input speech at the node n and time t, and the second forward search unit 35 calculates the total score of the node n selected at step S430 for the input speech at time t. Add the detailed acoustic score of.

【００４８】ステップＳ４５１で、第２前向き探索部３
５は、ノードｎのトータルスコアが枝刈り閾値以下か否
かを判断する。ステップＳ４５１で、ノードｎのトータ
ルスコアが枝刈り閾値以下と判断された場合、ステップ
Ｓ４５２で、第２前向き探索部３５は、ノードｎを非ア
クティブにしてステップＳ４７０に進む。In step S451, the second forward search unit 3
5 determines whether the total score of the node n is less than or equal to the pruning threshold value. When it is determined in step S451 that the total score of the node n is less than or equal to the pruning threshold value, the second forward search unit 35 deactivates the node n in step S452 and proceeds to step S470.

【００４９】ステップＳ４５１で、ノードｎのトータル
スコアが枝刈り閾値を超えると判断された場合、処理は
ステップＳ４５３に移る。なお、ステップＳ４５１で、
第１パス処理部２０によって得られた単語始端リストに
登録されている単語平均スコアを単語の先頭ノードの枝
刈り判定において併用する実施の形態もありえ、それを
排除するものではない。If it is determined in step S451 that the total score of the node n exceeds the pruning threshold value, the process proceeds to step S453. Note that in step S451,
There may be an embodiment in which the word average score registered in the word start list obtained by the first pass processing unit 20 is also used in the pruning determination of the head node of the word, and it is not excluded.

【００５０】ステップＳ４５３で、第２前向き探索部３
５は、ノードｎが単語ｗの終端か否かを判断する。ステ
ップＳ４５３で、ノードｎが単語ｗの終端と判断された
場合、処理はステップＳ４５４に進み、終端でないと判
断された場合、処理はステップＳ４５６に移る。In step S453, the second forward search unit 3
5 determines whether the node n is the end of the word w. When it is determined in step S453 that the node n is the end of the word w, the process proceeds to step S454, and when it is determined that it is not the end, the process proceeds to step S456.

【００５１】ステップＳ４５４で、第２前向き探索部３
５は、単語始端リストを参照し、時刻ｔ＋１で開始可能
なすべての単語の先頭ノードをアクティブにする。もち
ろん、単語始端リストに登録された単語始端時刻に加え
て、前後一定の幅で単語の探索開始を許す実施の形態も
可能である。ステップＳ４５５で、第２前向き探索部３
５は、ステップＳ４５４でアクティブにしたノードのト
ータルスコアに、詳細言語スコアを加算する。In step S454, the second forward search unit 3
5 refers to the word start list and activates the head nodes of all words that can start at time t + 1. Of course, in addition to the word start time registered in the word start list, an embodiment is also possible in which the word search start is allowed with a certain width before and after. In step S455, the second forward search unit 3
The step 5 adds the detailed language score to the total score of the nodes activated in step S454.

【００５２】ステップＳ４５６で、第２前向き探索部３
５は、後続音素のノードをすべてアクティブにする。上
記のステップＳ４５２、ステップＳ４５５、またはステ
ップＳ４５６のいずれかのステップでの処理が終了した
ら、処理はステップＳ４６０に進む。In step S456, the second forward searching unit 3
5 activates all subsequent phoneme nodes. Upon completion of the processing in any of the above step S452, step S455, or step S456, the processing proceeds to step S460.

【００５３】ステップＳ４６０で、第２前向き探索部３
５は、全アクティブ・ノードの処理が終了したか否かを
判断し、終了したと判断された場合、処理はステップＳ
４７０に進み、終了していないと判断された場合、処理
はステップＳ４３０に戻り、次のアクティブ・ノードを
選択し、上記の処理を繰り返す。ステップＳ４７０で、
第２前向き探索部３５は、全アクティブ・ノードの単語
履歴を参照し、文頭単語＜ｓ＞に後続する単語列が一意
となる区間があれば、それを認識結果の一部として早期
確定する。In step S460, the second forward searching unit 3
5 determines whether or not the processing of all active nodes is completed, and when it is determined that the processing is completed, the processing is step S.
If it is determined that the processing has not ended, the processing returns to step S430, the next active node is selected, and the above processing is repeated. In step S470,
The second forward searching unit 35 refers to the word histories of all the active nodes, and if there is a section in which the word string following the initial word <s> is unique, determines it early as a part of the recognition result.

【００５４】ステップＳ４８０で、第２前向き探索部３
５は、全入力音声についての処理が終了したか否かを判
断し、終了したと判断された場合、第２パスに関する処
理は終了し、終了していないと判断された場合、処理は
ステップＳ４９０に移る。ステップＳ４９０で、第２前
向き探索部３５は、時刻ｔに１を加え、その後、処理は
ステップＳ４２０に戻り、時刻ｔ＋１における入力音声
について上記の処理が繰り返される。In step S480, the second forward search unit 3
5 determines whether or not the processing for all the input voices is completed. If it is determined that the processing is completed, the processing for the second pass is completed. If it is determined that the processing is not completed, the processing is step S490. Move on to. In step S490, second forward search unit 35 adds 1 to time t, and then the process returns to step S420, and the above process is repeated for the input voice at time t + 1.

【００５５】なお、音声認識にリアルタイム性が求めら
れない場合には、ステップＳ４７０で単語を早期確定す
ることなく、ステップＳ４８０での処理を終了した後
に、文末単語＜／ｓ＞から単語履歴をトレースバック
し、発話全体の単語列を一度に出力する実施の形態も可
能である。If the voice recognition is not required to have real-time characteristics, the word history is traced from the end-of-sentence word </ s> after the processing in step S480 is completed without early determining the word in step S470. An embodiment is also possible in which the back is performed and the word string of the entire utterance is output at one time.

【００５６】図６を用いて、本発明の第１前向き探索部
２５の動作について説明する。文頭単語＜ｓ＞は、時刻
０を始端とし、時刻２、３、４を終端の候補としてい
る。単語ｗ１は、文頭単語＜ｓ＞の終端を時刻３とした
場合の文頭単語＜ｓ＞に後続し、時刻４を始端として時
刻８、９を終端の候補としている。単語ｗ２は、文頭単
語＜ｓ＞の終端を時刻２とした場合の文頭単語＜ｓ＞に
後続し、時刻３を始端とするが、時刻８で枝刈りされて
いる。The operation of the first forward searching section 25 of the present invention will be described with reference to FIG. The beginning word <s> has time 0 as a start end and times 2, 3, and 4 as end candidates. The word w1 follows the initial word <s> when the end of the initial word <s> is set to time 3, and has time 4 as a starting end and times 8 and 9 as end candidates. The word w2 follows the beginning word <s> when the end of the beginning word <s> is time 2, and has the beginning time 3 but is pruned at time 8.

【００５７】単語ｗ３は、文頭単語＜ｓ＞の終端を時刻
４とした場合の文頭単語＜ｓ＞に後続し、時刻５を始端
として時刻１２、１３を終端の候補としている。さら
に、単語ｗ３は、時刻８を終端とした場合の単語ｗ１に
も後続可能であり、時刻９を始端として時刻１２、１３
を終端の候補としている。単語ｗ４は、時刻８を終端と
した場合の単語ｗ１に後続し、時刻９を始端として時刻
１３、１４、１５を終端の候補としている。The word w3 follows the sentence head word <s> when the sentence start word <s> ends at time 4, and has time 5 as a start end and times 12 and 13 as end candidates. Furthermore, the word w3 can follow the word w1 when the time 8 is the end, and the time w and the time w are the times 12 and 13 with the time 9 as the start.
Is the candidate for the termination. The word w4 is subsequent to the word w1 when the time 8 is the end, and the time 9 is the start and the times 13, 14, and 15 are the candidates for the end.

【００５８】図７は、図６に示す第１前向き探索部２５
によって作成される単語始端リストの一例を示す図であ
る。時刻０を始端とする単語候補には、文頭単語＜ｓ＞
があり、その単語平均スコアが括弧内に記載され、図７
に示す場合、その単語平均スコアは−５９である。同様
に、時刻４を始端とする単語候補には単語ｗ１が、時刻
５を始端とする単語候補には単語ｗ３が、時刻９を始端
とする単語候補には単語ｗ３とｗ４がある。単語ｗ２は
単語終端に達する前に枝刈りされたため、この単語始端
リストには記載されない。FIG. 7 shows the first forward searching section 25 shown in FIG.
It is a figure which shows an example of the word starting point list created by. For word candidates starting at time 0, the first word <s>
And the word average score is listed in parentheses.
, The word average score is −59. Similarly, the word candidate starting at time 4 has word w1, the word candidate starting at time 5 has word w3, and the word candidate starting at time 9 has words w3 and w4. Since the word w2 was pruned before reaching the end of the word, it is not included in the start-of-word list.

【００５９】図８は、図７に示す単語始端リストの制約
下で動作する、第２前向き探索部３５の動作を説明する
ための図である。文頭単語＜ｓ＞は時刻０にアクティブ
となり、前向き探索が開始される。なお、単語ｗ１は時
刻４にアクティブとなって前向き探索が開始されるが、
時刻４の前後の時刻３および時刻５から前向き探索を開
始可能とする実施例もあり得る。FIG. 8 is a diagram for explaining the operation of the second forward searching section 35 which operates under the constraint of the word start list shown in FIG. The initial word <s> becomes active at time 0 and a forward search is started. Note that the word w1 becomes active at time 4 and the forward search is started,
There may be an embodiment in which the forward search can be started from time 3 and time 5 before and after time 4.

【００６０】同様に、単語ｗ３は時刻５と時刻９、およ
びそれらの前後の時刻に、単語ｗ４は時刻９およびその
前後の時刻にアクティブとなり、前向き探索が開始され
る。以上のように、第２パス処理部３０が探索すべき単
語と探索開始時刻は単語始端リストによって高精度に制
限されるため、より詳細な音響モデルや言語モデルを用
いた場合でも全体の処理量を増大させることなく、単語
の正解精度を向上させることが可能である。Similarly, the word w3 becomes active at time 5 and time 9 and times before and after them, and the word w4 becomes active at time 9 and times before and after that, and the forward search is started. As described above, since the word to be searched by the second pass processing unit 30 and the search start time are highly accurately limited by the word start list, even if a more detailed acoustic model or language model is used, the total processing amount is reduced. It is possible to improve the accuracy of the correct answer of the word without increasing.

【００６１】以上説明したように、本発明の第１の実施
の形態に係る連続音声認識装置およびそのプログラム
は、簡易なモデルによって求めた候補単語とその始端時
刻のリストを利用して詳細な前向き探索を行うため、よ
り詳細な音響モデルおよび言語モデルを用いた場合でも
全体の処理量を増大させることなく、単語の正解精度を
向上させることができる。また、第２パスにおいても文
頭から文末方向への前向き探索を行うために、候補単語
列の文頭からの一意性を利用した最適な早期確定手法を
適用することが可能となり、リアルタイム処理に適して
いる。As described above, the continuous speech recognition apparatus and the program therefor according to the first embodiment of the present invention utilize the list of candidate words obtained by a simple model and their start time points in a detailed forward-looking manner. Since the search is performed, the accuracy of a word can be improved without increasing the overall processing amount even when a more detailed acoustic model and language model are used. In addition, in the second pass as well, in order to perform a forward search from the beginning of the sentence toward the end of the sentence, it is possible to apply an optimal early determination method using uniqueness from the beginning of the candidate word string, which is suitable for real-time processing. There is.

【００６２】なお、本発明の第１の実施の形態では、第
１の実施の形態に係る連続音声認識装置を用いて上記の
ステップＳ２１０〜Ｓ４９０の各ステップでの処理を行
う連続音声認識の方法について説明したが、これらのス
テップＳ２１０〜Ｓ４９０を含む連続音声認識動作を実
行させるための連続音声認識プログラムがインストール
された所定のコンピュータを用いて実施することも可能
である。In the first embodiment of the present invention, the continuous speech recognition method for performing the processing in each of the steps S210 to S490 using the continuous speech recognition apparatus according to the first embodiment. However, it is also possible to use a predetermined computer in which a continuous voice recognition program for executing the continuous voice recognition operation including these steps S210 to S490 is installed.

【００６３】また、本発明は、所定の記憶媒体に記憶さ
れた上記の連続音声認識プログラムをコンピュータにロ
ードする方法のほかに、上記連続音声認識プログラムを
通信インターフェースおよびネットワークからファイル
形式で取得し、前記コンピュータで実施する方法によっ
ても同様の効果が得られる。さらに、ネットワークを用
いることでプログラムの更新や配布が容易となる。In addition to the method of loading the above continuous speech recognition program stored in a predetermined storage medium into a computer, the present invention obtains the above continuous speech recognition program in a file format from a communication interface and a network, The same effect can be obtained by the method implemented by the computer. Furthermore, using a network makes it easy to update and distribute the program.

【００６４】[0064]

【発明の効果】以上説明したように、本発明は、音声認
識に詳細な音響モデルおよび言語モデルを用いる場合で
も、早期に認識結果を確定するリアルタイム性に優れ、
高い認識精度かつ演算処理負担の少ない連続音声認識装
置およびそのプログラムを実現することができる。As described above, according to the present invention, even when a detailed acoustic model and language model are used for speech recognition, the recognition result is excellent in real time and the recognition result is excellent.
It is possible to realize a continuous speech recognition device with high recognition accuracy and a low calculation processing load, and its program.

[Brief description of drawings]

【図１】本発明の第１の実施形態に係る連続音声認識装
置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a continuous speech recognition apparatus according to a first embodiment of the present invention.

【図２】本発明の第１の実施形態に係る連続音声認識装
置の第１パス処理部において行われる処理の流れを示す
フローチャートである。FIG. 2 is a flowchart showing a flow of processing performed in a first pass processing unit of the continuous speech recognition apparatus according to the first embodiment of the present invention.

【図３】本発明の第１の実施形態に係る連続音声認識装
置の第１パス処理部において行われる処理の一部の処理
の流れを詳細に示すフローチャートである。FIG. 3 is a flowchart showing in detail the flow of a part of the processing performed in the first pass processing unit of the continuous speech recognition device according to the first embodiment of the present invention.

【図４】本発明の第１の実施形態に係る連続音声認識装
置の第２パス処理部において行われる処理の流れを示す
フローチャートである。FIG. 4 is a flowchart showing a flow of processing performed in a second pass processing unit of the continuous speech recognition device according to the first embodiment of the present invention.

【図５】本発明の第１の実施形態に係る連続音声認識装
置の第２パス処理部において行われる処理の一部の処理
の流れを詳細に示すフローチャートである。FIG. 5 is a flowchart showing in detail the flow of a part of the processing performed in the second pass processing unit of the continuous speech recognition device according to the first embodiment of the present invention.

【図６】本発明の第１の実施形態に係る連続音声認識装
置の第１前向き探索部の動作を説明するための図であ
る。FIG. 6 is a diagram for explaining the operation of the first forward search section of the continuous speech recognition device according to the first embodiment of the present invention.

【図７】本発明の第１の実施形態に係る連続音声認識装
置の第１前向き探索部によって作成される単語始端リス
トの一例を示す図である。FIG. 7 is a diagram showing an example of a word start list created by a first forward search unit of the continuous speech recognition apparatus according to the first embodiment of the present invention.

【図８】単語始端リストの制約下で動作する、本発明の
第１の実施形態に係る連続音声認識装置の第２前向き探
索部の動作を説明するための図である。FIG. 8 is a diagram for explaining the operation of the second forward searching section of the continuous speech recognition device according to the first embodiment of the present invention, which operates under the constraint of the word start list.

[Explanation of symbols]

１０音響分析部２０第１パス処理部２１発音辞書・簡易モデル記憶部２２木構造音素ＮＷ生成部２３音響スコア算出部２４言語スコア算出部２５第１前向き探索部３０第２パス処理部３１発音辞書・詳細モデル記憶部３２線形構造音素ＮＷ生成部３３音響スコア算出部３４言語スコア算出部３５第２前向き探索部１００連続音声認識装置 10 Acoustic Analysis Department 20 1st pass processing unit 21 Pronunciation dictionary / simplified model memory 22 Tree structure phoneme NW generation unit 23 Acoustic Score Calculation Unit 24 Language score calculator 25 First Forward Search Unit 30 Second pass processing unit 31 pronunciation dictionary / detail model storage 32 Linear structured phoneme NW generator 33 Acoustic Score Calculation Unit 34 Language score calculator 35 Second Forward Search Unit 100 continuous speech recognizer

Claims

[Claims]

1. A continuous speech recognition apparatus for recognizing spoken continuous speech and generating a word string corresponding to the continuous speech, comprising a simple first acoustic model, a simple first language model, Means for storing a second acoustic model that is more detailed than the first acoustic model and a second language model that is more detailed than the first language model, the simple first acoustic model, and the simple A forward search is performed on the continuous speech using the first language model, information of each word that has reached the end of a word as a candidate for generating the word string, and a starting point at which each candidate word is uttered. The first path processing means for generating a word start list including time information, the detailed second acoustic model, and the detailed second language model are used to include the word start list included in the word start list. Candidate Within each word, based on the information of the start time at which each of the candidate words is uttered, a forward search is performed on the continuous speech to generate a word string corresponding to the continuous speech. A continuous speech recognition apparatus comprising: two-pass processing means.

2. The first pass processing means further utters information of a word reaching the word end vicinity during the forward search in the first pass processing means and a word reaching the word end vicinity. The continuous voice recognition device according to claim 1, wherein information on a start time is additionally registered in the word start list.

3. The first pass processing means further additionally registers the word average score of each of the words included in the word start list to the word start list, and the second pass processing means further comprises: 2. The candidate words are limited to those having an average word score of each word of a predetermined value or more, and a word string corresponding to the continuous speech is generated for the limited words. Continuous speech recognizer.

4. The second pass processing means further employs a forward search in the second pass processing means, with a start time being a predetermined time within a certain range before and after the start time at which each of the candidate words is uttered. The continuous speech recognition apparatus according to claim 1, wherein

5. The second pass processing means becomes the candidate by the forward search in the first pass processing means even before the generation of the word beginning list by the first pass processing means is completed. When the word information and the start time information thereof are generated, a forward search is performed by the second pass processing means, and a process for generating a word string corresponding to the continuous speech is performed. Item 1. A continuous speech recognition device according to item 1.

6. The second pass processing means starts a forward search in the second pass processing means after the first pass processing means completes the generation of the word start list, and corresponds to the continuous speech. The continuous speech recognition apparatus according to claim 1, wherein the continuous speech recognition apparatus performs a process for generating a word string to be reproduced.

7. A program for causing a computer to execute a process for recognizing a uttered continuous voice and generating a word string corresponding to the continuous voice, the computer having a simple first acoustic model, a simple A first language model, a second acoustic model that is more detailed than the first acoustic model, and a second language model that is more detailed than the first language model, and the simple first Of the candidate words for generating the word string by performing a forward search on the continuous speech by using the acoustic model and the simple first language model and utterance of the candidate words. The first pass processing step for generating a word start list including the information of the generated start time, and the detailed second acoustic model and the detailed second language model Within the range of each of the candidate words included in the start point list, based on the information of the start point time when each of the candidate words is uttered, a forward search is performed for the continuous voice, and the continuous voice is supported. And a second pass processing step for generating a word string to be executed.