JP2001092495A

JP2001092495A - Continuous speech recognition method

Info

Publication number: JP2001092495A
Application number: JP26823799A
Authority: JP
Inventors: Atsunori Ogawa; 厚徳小川; Yoshiaki Noda; 喜昭野田; Shoichi Matsunaga; 昭一松永
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-09-22
Filing date: 1999-09-22
Publication date: 2001-04-06
Anticipated expiration: 2019-09-22
Also published as: JP3559479B2

Abstract

PROBLEM TO BE SOLVED: To surely obtain a solution and also shorten a processing time by performing a 1st path retrieval using a coarse model to create a word lattice, and performing a 2nd path retrieval from the end on the word lattice using a high accurate model. SOLUTION: By the 2nd path retrieval, a most delayed one among retrievals of all assumptions, namely, the most delayed one N7 in head word boundary time of the assumption, is selected; the assumption is extended and developed by one word to obtain a score; pruning is performed; the retrieval selects a most delayed one again; and the similar operations are repeated.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、規定された文法
あるいは接続関係によって生成可能な数多くの単語列の
仮説から、入力された音声に最も近い仮説を、複数の探
索段階により見つける連続音声認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a continuous speech recognition method for finding a hypothesis closest to an input speech through a plurality of search stages from hypotheses of many word strings that can be generated by a prescribed grammar or connection relation. About.

【０００２】[0002]

【従来の技術】まず、図６を参照して、従来の連続音声
認識法の一例について説明する。この図において、入力
音声１１は、分析処理部１２において特徴パラメータの
ベクトルデータ時系列に変換され、さらに探索処理部１
３において文法／言語モデル１６で許容する単語列の仮
説（以下単に仮説と記す）と対応した音響モデル１５と
前記特徴パラメータのベクトルデータ時系列とが照合さ
れる。この仮説の照合結果の評価値であるスコアは、入
力音声と仮説との音響的な近さを示す音響スコアと仮説
の存在する確率を示す言語スコアからなり、最も高いス
コアを持つ仮説が認識結果１４として出力される。2. Description of the Related Art First, an example of a conventional continuous speech recognition method will be described with reference to FIG. In the figure, an input speech 11 is converted into a vector series of feature parameter vector data in an analysis processing unit 12,
In step 3, the acoustic model 15 corresponding to the hypothesis (hereinafter simply referred to as a hypothesis) of the word string permitted by the grammar / language model 16 is compared with the vector data time series of the feature parameter. The score, which is the evaluation value of the matching result of this hypothesis, is composed of an acoustic score indicating the acoustic closeness between the input speech and the hypothesis and a language score indicating the probability of the existence of the hypothesis, and the hypothesis having the highest score is the recognition result. It is output as 14.

【０００３】分析処理部１２における信号処理としてよ
く用いられるのは、ケプストラム分析であり、特徴パラ
メータとしては、ＭＦＣＣ（Mel Frequency Cepstral C
oefficient) 、ΔＭＦＣＣ、対数パワーなどがある。音
響モデル１５としては、確率・統計理論に基づいてモデ
ル化された隠れマルコフモデル（Hidden Markov Model
、以後ＨＭＭという）が主流である。通常、ＨＭＭは
音素ごとに作成される（音素モデル）が、現在では、あ
る音素のＨＭＭを作成する際に、その前後に接続する音
素も考慮に入れる（音素環境を考慮する）triphone Ｈ
ＭＭと呼ばれるＨＭＭが主流となっている。このＨＭＭ
の詳細は、例えば、文献（社団法人電子情報通信学会
編、中川聖一著『確率モデルによる音声認識』）に開示
されている。[0003] Cepstrum analysis is often used as signal processing in the analysis processing unit 12, and the characteristic parameter is MFCC (Mel Frequency Cepstral C).
oefficient), ΔMFCC, logarithmic power, and the like. As the acoustic model 15, a Hidden Markov Model (Hidden Markov Model) modeled on the basis of probability and statistical theory is used.
, Hereinafter referred to as HMM). Normally, an HMM is created for each phoneme (phoneme model), but at present, when creating an HMM for a certain phoneme, the phonemes connected before and after the phoneme are also taken into account (considering the phoneme environment).
HMM called MM has become mainstream. This HMM
Details are disclosed in, for example, a document (edited by the Institute of Electronics, Information and Communication Engineers, edited by Seiichi Nakagawa, “Speech Recognition by Stochastic Model”).

【０００４】文法／言語モデル１６は、認識対象とする
文章を定義するための単語の連結関係を規定したもので
あり、単語を枝とした単語ネットワークや言語の確率モ
デル等が用いられる。連続音声認識の場合、文法は図７
に示すような任意の単語が任意の単語に接続可能な単語
ネットワークの形式をとることが多い。このような形式
をとることで単語ネットワークに登録されている単語の
範囲内で任意の単語列の仮説の生成が可能となる。言語
の確率モデルは、単語単体の存在確率、２つ以上の単語
の連鎖する確率が用いられる。単語単体の存在確率を表
すモデルは単語１−ｇｒａｍ、単語の２連鎖確率、３連
鎖確率をそれぞれ表すモデルはそれぞれ、単語２−ｇｒ
ａｍ、単語３−ｇｒａｍと呼ばれる。この言語の確率モ
デルを用いることで、言語（ここでは日本語）として存
在し得ない仮説の生成を抑制することができる。この言
語の確率モデルの詳細は、例えば、社団法人電子情報通
信学会編、中川聖一著『確率モデルによる音声認識』等
に開示されている。[0004] The grammar / language model 16 defines a connection between words for defining a sentence to be recognized, and uses a word network with words as branches, a language probability model, or the like. In the case of continuous speech recognition, the grammar is shown in FIG.
Many words take the form of a word network in which any word can be connected to any word. By adopting such a format, it is possible to generate a hypothesis of an arbitrary word string within the range of words registered in the word network. As the language probability model, the existence probability of a single word and the probability of two or more words being linked are used. The model representing the existence probability of a single word is the word 1-gram, the model representing the two-chain probability and the three-chain probability of the word is the word 2-gr, respectively.
am, called the word 3-gram. By using the probability model of this language, generation of a hypothesis that cannot exist as a language (here, Japanese) can be suppressed. The details of the probability model of this language are disclosed in, for example, Seiichi Nakagawa, "Speech Recognition by Probability Model", edited by the Institute of Electronics, Information and Communication Engineers.

【０００５】探索処理部１３では、文法で規定された単
語の接続関係を示す単語ネットワーク上の仮説に対応し
た音響モデルと特徴パラメータのベクトルデータ時系列
を照合し、音響的な尤もらしさを示す音響スコアを求め
ると同時に、その仮説に対応した言語モデルから言語ス
コアを求め、音響スコアと言語スコアからなる仮説のス
コアを、入力連続音声の始端から終端まで各仮説につい
て求め、最も大きいスコアの仮説、つまり入力音声に最
も近い仮説を認識結果として出力する。連続音声認識に
おいては、文法で生成可能な仮説が膨大であり、高速か
つ高精度に認識結果を得るために、仮説探索を複数段階
行い、段階的に候補仮説を絞り込んでいくマルチパス探
索という探索法がとられることが多い。マルチパス探索
の詳細は、例えば、R.Schwartz, L.Nguyen, and John M
akhoul：“Multiple-pass SearchStrategies ”，in Au
tomatic Speech and Speaker Recognition Advanced To
pics, pp.429-456, Kluwer Academic Publishers(199
6）．等に開示されている。[0005] The search processing unit 13 collates an acoustic model corresponding to a hypothesis on a word network indicating a connection relationship between words specified by a grammar with a time series of feature parameter vector data, and indicates an acoustic likelihood. At the same time as obtaining the score, the language score is obtained from the language model corresponding to the hypothesis, and the score of the hypothesis consisting of the acoustic score and the language score is obtained for each hypothesis from the beginning to the end of the input continuous speech, and the hypothesis of the largest score, That is, the hypothesis closest to the input speech is output as the recognition result. In continuous speech recognition, the number of hypotheses that can be generated by grammar is enormous, and in order to obtain a recognition result with high speed and high accuracy, a hypothesis search is performed in multiple stages, and a multipath search that narrows down candidate hypotheses in stages is performed. Laws are often enforced. For details of the multipath search, see, for example, R. Schwartz, L. Nguyen, and John M.
akhoul: “Multiple-pass SearchStrategies”, in Au
tomatic Speech and Speaker Recognition Advanced To
pics, pp.429-456, Kluwer Academic Publishers (199
6). Etc.

【０００６】ここでは最も一般的である２段階で仮説を
絞り込むマルチパス探索について図８を用いて説明す
る。１段階目の探索（第１パス探索）２１では、図７に
示されるような単語ネットワークで生成可能な膨大な仮
説の中から、文法／言語モデル１６中の粗い言語モデ
ル、例えば単語２−ｇｒａｍや、音響モデル１５中の粗
い音響モデル、例えば単語内の音素環境のみ考慮するtr
iphone ＨＭＭなどの計算コストの低いモデルを用い、
入力音声に近い候補仮説を高速に絞り込む。Here, the most general multi-path search for narrowing hypotheses in two stages will be described with reference to FIG. In the first-stage search (first-pass search) 21, a coarse language model in the grammar / language model 16, for example, the word 2-gram, is selected from a huge number of hypotheses that can be generated by a word network as shown in FIG. Or a coarse acoustic model in the acoustic model 15, for example, considering only the phonemic environment in a word tr
Using a model with low computational cost such as iphone HMM,
Narrow candidate hypotheses close to the input voice at high speed.

【０００７】またこの第１パス探索２１では時間同期ビ
ーム探索と呼ばれる方法がとられることが多い。時間同
期ビーム探索では、入力音声と仮説の照合は図７の単語
ネットワークで生成可能な全ての仮説に対して、通常分
析フレームごとの計算を、時間同期的に同時に進められ
るが、生成可能な仮説の数が時間の経過と共に著しく多
くなるため、この処理を現実的な処理時間で終えること
は困難である。そこで、探索途中において認識結果とな
る可能性が低い仮説に対する探索を打ち切る（枝刈りす
る）ことで、現実的な処理時間で探索を終えることをね
らう。時間同期ビーム探索において仮説を枝刈りする基
準としては、全ての仮説の中で、最大のスコアから大き
いスコアの順にｍ個の仮説を残し、それ以外の仮説を打
ち切る方法や、全ての仮説のスコアの中で最大のスコア
から一定値θを差し引いたスコアをしきい値とし、その
しきい値以上のスコアを有する仮説のみ残し、それ以下
のスコアを有する仮説を枝刈りする方法等がある。ここ
で、仮説を枝刈りする基準を決定するパラメータである
ｍやθはビーム幅と呼ばれる。時間同期ビーム探索では
同一時刻における各仮説のスコアの大きさから可能性が
ないと推定される仮説を枝刈りするため、正しい解とな
る仮説を枝刈りする可能性は少ない。しかし時間同期ビ
ーム探索は、探索途中で仮説の枝刈りを行うため、必ず
しも最もスコアの高い仮説が認識結果として得られるわ
けではないが、ビーム幅をある程度以上大きくとれば必
ず解が得られる探索法である。時間同期ビーム探索の詳
細は、例えば、R.Haeb-Umbach and H.Ney ：“Improvem
ents in beam search for 10000-word continuous-spee
ch recognition”，IEEE Trans. Speech and Audio Pro
cessing, Vol.2，No.2，pp.353-356(1994)．等に開示さ
れている。The first path search 21 often employs a method called a time-synchronous beam search. In the time-synchronous beam search, the matching between the input speech and the hypothesis can be performed simultaneously for all the hypotheses that can be generated by the word network in FIG. Is significantly increased with the passage of time, and it is difficult to complete this processing in a practical processing time. Therefore, by terminating (pruning) the search for a hypothesis that is unlikely to be a recognition result during the search, the search is intended to be completed in a realistic processing time. The criteria for pruning the hypotheses in the time-synchronous beam search are, among all the hypotheses, the method of leaving m hypotheses in order from the largest score to the largest score, cutting off the other hypotheses, and the scores of all the hypotheses. There is a method in which a score obtained by subtracting a constant value θ from the largest score among the scores is used as a threshold, only hypotheses having a score equal to or higher than the threshold are left, and hypotheses having a score lower than the threshold are pruned. Here, the parameters m and θ that determine the criteria for pruning the hypothesis are called beam widths. In the time-synchronous beam search, hypotheses that are estimated to have no possibility from the size of each hypothesis at the same time are pruned, so there is little possibility that a hypothesis that provides a correct solution will be pruned. However, in time-synchronized beam search, hypotheses are pruned during the search, so that the hypothesis with the highest score is not always obtained as a recognition result. It is. For details of time-synchronous beam search, see, for example, R. Haeb-Umbach and H. Ney: “Improvem
ents in beam search for 10000-word continuous-spee
ch recognition ”, IEEE Trans. Speech and Audio Pro
cessing, Vol.2, No.2, pp.353-356 (1994). Etc.

【０００８】第１パス探索の結果は、トレリス形式や単
語ラティスなどの中間表現として得られるが、ここでは
図９に示すような単語ラティスと呼ばれる単語の接続関
係をコンパクトに表現した単語ネットワークを想定す
る。単語ラティスには、第１パス探索の結果として単語
境界時刻とその時刻におけるその仮説のそれまでのスコ
アが記憶されている。単語ラティスの詳細は、例えば、
S.Ortmanns and H.Ney：“A word graph algorithm for
large vocaburary continuous speech recognition
”，Computer Speech and Language, Vol.11, No.1, p
p.43-72(1997)．等に開示されている。The result of the first path search is obtained as an intermediate expression such as a trellis form or a word lattice. Here, it is assumed that a word network which compactly expresses a word connection called a word lattice as shown in FIG. I do. The word lattice stores a word boundary time as a result of the first pass search and a score of the hypothesis up to that time at that time. For more information on the word lattice, for example,
S. Ortmanns and H. Ney: “A word graph algorithm for
large vocaburary continuous speech recognition
”, Computer Speech and Language, Vol. 11, No. 1, p
p.43-72 (1997). Etc.

【０００９】図８に示すように２段階目の探索（第２パ
ス探索）２２では、第１パス探索２１の結果得られた単
語ラティス２３上で音響モデル１５中の高精度の音響モ
デルと文法／言語モデル１６中の高精度の言語モデルを
用いた仮説のスコアの再計算を分析フレームごとに行
い、最終的な認識結果１４を得る。第２パス探索２２と
してよく用いられる方法としては、Ｎ−ｂｅｓｔリスコ
アリング、Ａ^*探索が挙げられる。As shown in FIG. 8, in a second stage search (second path search) 22, a high-precision acoustic model and a grammar in the acoustic model 15 are obtained on the word lattice 23 obtained as a result of the first path search 21. The recalculation of the hypothesis score using the high-precision language model in the / language model 16 is performed for each analysis frame, and the final recognition result 14 is obtained. Methods often used as the second path search 22 include N-best rescoring and A ^* search.

【００１０】Ｎ−ｂｅｓｔリスコアリングは、粗いモデ
ルを用いた探索によるスコアで順序づけられたＮ−ｂｅ
ｓｔ文候補と呼ばれる複数（スコアの高いＮ個）の文候
補のスコアを、高精度のモデルを用いた探索によるスコ
アで置き換えて、文候補の順序を大きい順に入れ換え
る。Ｎ−ｂｅｓｔリスコアリングを第２パス探索２２に
用いる場合、まず、単語ラティスに記憶されている第１
パス探索のスコアを基に単語ラティス２３からスコアの
大きい順からＮ個の文候補（Ｎ−ｂｅｓｔ文候補）を作
成し、単語２−ｇｒａｍ等の粗い言語モデルによるスコ
アを単語３−ｇｒａｍ等のより高精度の言語モデルによ
るスコアで置き換えてスコアを計算しなおし、その再計
算したスコアの大きい順に文候補の順序を入れ換える。
Ｎ−ｂｅｓｔリスコアリングは、実装が単純であり、確
実に認識結果を得ることができる。Ｎ−ｂｅｓｔリスコ
アリングの詳細は、例えば、L.Nguyen, R.Schwartz, Y.
Zhao, and G.Zavaliagkos ：“Is N-best dead？”，Pr
oc.DARPA Speech and Natural Language Workshop, pp.
411-414(1994) ．等に開示されている。[0010] N-best rescoring is an N-bes ordering by scores from a search using a coarse model.
The scores of a plurality of (N high-score) sentence candidates called st sentence candidates are replaced by scores obtained by searching using a high-precision model, and the order of sentence candidates is changed in descending order. When N-best rescoring is used for the second pass search 22, first the first
Based on the score of the path search, N sentence candidates (N-best sentence candidates) are created from the word lattice 23 in descending order of the score, and the score based on the coarse language model such as the word 2-gram is converted to the word 3-gram or the like. The score is recalculated by replacing the score with a higher accuracy language model score, and the order of the sentence candidates is changed in descending order of the recalculated score.
The N-best rescoring is simple in implementation, and can reliably obtain a recognition result. Details of N-best rescoring are described, for example, in L. Nguyen, R. Schwartz, Y.
Zhao, and G. Zavaliagkos: “Is N-best dead?”, Pr
oc.DARPA Speech and Natural Language Workshop, pp.
411-414 (1994). Etc.

【００１１】Ａ^*探索では、次の（１）式で定義される
スコアが最も高い仮説ｎから優先的に展開を行う（best
-first探索）。ｆ_n（ｔ）＝ｇ_n（ｔ）＋ｈ_n（ｔ）（１）ここで、ｔは時刻（フレーム番号）、ｇ_n（ｔ）は既に
探索を終えた区間のスコア、つまり図１０において、単
語境界時刻（ノードとも呼ぶ）Ｎ０−Ｎ１−Ｎ２−Ｎ３
−Ｎ４−Ｎ５を連ねる仮説のスコアであり、ｈ_n（ｔ）
はこの単語境界時刻Ｎ５より始端までの未探索の区間の
推定スコア（ヒューリスティック）である。すなわち、
ｆ_n（ｔ）は仮説ｎの全区間に対する推定スコアであ
り、ｆ_n（ｔ）を仮説ｎのスコアとして用いることは、
全ての仮説のスコアを始端から終端までの全区間に対す
るものを求めていることになる。また、これにより、探
索の進行度が異なる（時間的な長さの異なる）仮説同士
の比較が可能となる。Ａ^*探索で最もスコアの高い解
（最適解）を得るためには、ｈ_n（ｔ）の値がその真値
（わかったとする）よりも大きくなければならない（Ａ
^*適格性）ことが知られている。また、ｈ_n（ｔ）がそ
の真値に近いほど効率の高い探索が可能である。Ａ^*探
索を第２パス探索２２に用いる場合は、図１０に示すよ
うに、単語ラティス上を第１パス探索２１とは逆向きに
文末から単語単位の仮説展開を行う。このとき、ｇ
_n（ｔ）は第１パス探索で用いた言語モデル、音響モデ
ルよりもよりそれぞれ高精度の言語モデル、音響モデ
ル、例えば単語３−ｇｒａｍと、単語内及び単語間の各
音素環境を考慮したtriphone ＨＭＭを用いて再計算す
る形で求める。In the A ^* search, the hypothesis n having the highest score defined by the following equation (1) is expanded preferentially (best
-first search). f _n (t) = g _n (t) + h _n (t) (1) Here, t is the time (frame number), and g _n (t) is the score of the section that has already been searched, that is, in FIG. Word boundary time (also called node) N0-N1-N2-N3
−N4−N5, which is the score of the hypothesis linking h _n (t)
Is an estimated score (heuristic) of an unsearched section from the word boundary time N5 to the beginning. That is,
f _n (t) is an estimated score for all sections of hypothesis n, and using f _n (t) as the score of hypothesis n is
This means that scores of all hypotheses are obtained for all sections from the start to the end. This also makes it possible to compare hypotheses with different degrees of search progress (different time lengths). In order to obtain the solution with the highest score (optimal solution) in the A ^* search, the value of _hn (t) must be greater than its true value (assumed) (A
^* Eligibility) is known. Further, the more efficient the search is, the closer h _n (t) is to its true value. When the A ^* search is used for the second path search 22, as shown in FIG. 10, a hypothesis development is performed on the word lattice in the word unit from the end of the sentence in the opposite direction to the first path search 21. At this time, g
_n (t) is a language model and an acoustic model, each of which has higher accuracy than the language model and the acoustic model used in the first pass search, such as a word 3-gram, and a triphone that considers each phoneme environment within and between words. It is obtained by recalculating using HMM.

【００１２】第２パス探索スコアで、ｈ_n（ｔ）には単
語ラティス２３に記憶されている第１パス探索スコアを
用いることができる。図１０においては現在、Ｎ０−Ｎ
１−Ｎ２−Ｎ３−Ｎ６，Ｎ０−Ｎ１−Ｎ２−Ｎ３−Ｎ４
−Ｎ５，Ｎ０−Ｎ１−Ｎ２−Ｎ７，Ｎ０−Ｎ１−Ｎ８，
Ｎ０−Ｎ１１−Ｎ９，Ｎ０−Ｎ１１−Ｎ１０の計６個の
仮説があるが、この中からｆ_n（ｔ）が最大のもの（例
えば今の場合、Ｎ０−Ｎ１−Ｎ２−Ｎ３−Ｎ４−Ｎ５）
を選んでこれを展開する。Ａ^*探索の詳細は、例えば、
コロナ社、Nils.J.Nilsson著、合田周平、増田一比古
訳、『人工知能−問題解決のシステム論−』に開示され
ている。In the second path search score, the first path search score stored in the word lattice 23 can be used for h _n (t). In FIG. 10, N0-N
1-N2-N3-N6, N0-N1-N2-N3-N4
-N5, N0-N1-N2-N7, N0-N1-N8,
Although there are N0-N11-N9, the N0-N11-N10 total of six hypotheses, if from this f _n (t) is maximum one (e.g. now, N0-N1-N2-N3 -N4-N5 )
To expand this. Details of the A ^* search are, for example,
It is disclosed in Corona Publishing, Nils.J.Nilsson, Shuhei Goda, Izu Masuda, and "Artificial Intelligence: A System of Problem Solving".

【００１３】[0013]

【発明が解決しようとする課題】ところで、Ｎ−ｂｅｓ
ｔリスコアリングには、単語ラティスからＮ−ｂｅｓｔ
文候補を作成する際に、１単語のみ異なるような類似候
補が多数出現するため、十分な認識精度を得るには、比
較的多くの文候補を対象にリスコアリングを行う必要が
ある、また、より高精度の音響モデルを用いた音響スコ
アの再計算も可能であるが効率的ではない、等の問題が
ある。一方、Ａ^*探索には、第１パス探索の結果を第２
パス探索でヒューリスティックとして利用できるという
利点はあるが、第１パス探索と第２パス探索では用いる
モデルが異なるために、真値に近いｈ_n（ｔ）が得られ
るとは限らないため、入力音声によっては探索の効率が
悪くなる場合がある。ｈ_n（ｔ）が真値に近く、最高の
ｆ_n（ｔ）の仮説展開をうまく行うことができればよい
が、ｈ_n（ｔ）が真値から遠い場合は仮説数が極端に増
大し、実時間での認識は困難になる。SUMMARY OF THE INVENTION By the way, N-bes
For the t rescoring, the word lattice is converted to N-best
When creating sentence candidates, a large number of similar candidates that differ by only one word appear, so in order to obtain sufficient recognition accuracy, it is necessary to perform rescoring on relatively many sentence candidates. Although recalculation of the acoustic score using a more accurate acoustic model is possible, it is not efficient. On the other hand, in the A ^* search, the result of the first
Although there is an advantage that it can be used as a heuristic in the path search, since the models used in the first path search and the second path search are different, h _n (t) close to the true value is not always obtained. Depending on the case, the efficiency of the search may be reduced. It is sufficient that h _n (t) is close to the true value and the best hypothesis expansion of f _n (t) can be performed successfully. However, if h _n (t) is far from the true value, the number of hypotheses increases extremely, Real-time recognition becomes difficult.

【００１４】この発明は、上述のＮ−ｂｅｓｔリスコア
リングやＡ^*探索にある問題点に鑑みてなされたもの
で、Ａ^*探索のように粗いモデルを用いたパス探索の結
果を利用しながら、その後のパス探索を効率よく行い、
かつ、時間同期ビーム探索やＮ−ｂｅｓｔリスコアリン
グのように必ず解を得ることを可能とする連続音声認識
方法を提供することを目的とする。The present invention has been made in view of the above-mentioned problems in the N-best rescoring and the A ^* search, and utilizes the results of a path search using a coarse model such as the A ^* search. , Then perform efficient path search,
Another object of the present invention is to provide a continuous speech recognition method that can always obtain a solution, such as time-synchronized beam search and N-best rescoring.

【００１５】[0015]

【課題を解決するための手段】この発明によれば、粗い
モデルによる探索により得た単語ネットワーク（単語ラ
ティス）を利用して精度の高いモデルを用いる探索の際
に探索が最も遅れている仮説を優先的に展開することを
繰返し実行する。このようにすることにより展開中の仮
説の長さがほぼ揃うことになる。よって仮説展開中に枝
刈りを行い、効率的な探索が可能となり、しかも必ず解
が得られる。According to the present invention, a hypothesis whose search is most delayed in a search using a high-precision model using a word network (word lattice) obtained by a search using a coarse model is described. Repeated execution of preferential deployment. By doing so, the lengths of the hypotheses being developed are almost the same. Therefore, pruning is performed during hypothesis development, an efficient search becomes possible, and a solution is always obtained.

【００１６】また、精度の高いモデルを用いた探索にお
いて、先に得られている単語ラティスに記憶されている
各単語境界時刻に５ミリ秒以上又は１フレーム以上の幅
をもたせてスコアの計算をする。In a search using a high-precision model, a score is calculated by giving each word boundary time stored in the previously obtained word lattice a width of 5 ms or more or 1 frame or more. I do.

【００１７】[0017]

【発明の実施の形態】以下にこの発明の実施例を説明す
る。この実施例では例えば図８に示したように、粗い音
響モデルと粗い言語モデルを用いて第１パス探索を入力
特徴パラメータのベクトルデータ系列に対して行い、単
語ラティス２３を生成し、その後、その単語ラティス２
３上で、高精度音響モデルと高精度言語モデルを用い
て、第２パス探索を行う。Embodiments of the present invention will be described below. In this embodiment, for example, as shown in FIG. 8, a first pass search is performed on a vector data sequence of input feature parameters using a coarse acoustic model and a coarse language model, and a word lattice 23 is generated. Word lattice 2
3, a second pass search is performed using the high-accuracy acoustic model and the high-accuracy language model.

【００１８】この実施例において特徴があるのは第２パ
ス探索の手法にある。この第２パス探索は従来のＡ^*探
索と同様に、第１パス探索とは逆向きに文末（入力音声
の終端）から単語単位で仮説の展開を行う。この際この
発明ではその単語単位の仮説展開を、探索が最も遅れて
いるものから優先的に展開する（Shortest-first探
索）。例えば図１に示すように、いま、単語境界時刻
（ノード）Ｎ０−Ｎ１−Ｎ３−Ｎ８よりなる仮説、Ｎ０
−Ｎ１−Ｎ３−Ｎ９よりなる仮説、Ｎ０−Ｎ１−Ｎ３−
Ｎ７よりなる仮説、…，Ｎ０−Ｎ１−Ｎ４−Ｎ５−Ｎ１
３よりなる仮説の７個の仮説に展開されている状態にお
いて、探索が最も遅れている仮説は、各仮説の先頭ノー
ドＮ８，Ｎ９，Ｎ７，Ｎ１０，Ｎ１１，Ｎ１２，Ｎ１３
中のその時刻が最も遅い時刻ｔのノードＮ７を選択す
る。ただし入力音声の始点を基準とし、終端側時間が進
むと各時刻を表わしている。このようにして選択したノ
ードＮ７につきその仮説を展開させる。例えば単語ラテ
ィス２３（図８）からノードＮ７に対し始端側に接続さ
れるノードがＮ１４，Ｎ１５，Ｎ１６であったとし、ノ
ードＮ１４から始端に至る未探索区間の推定スコア（ヒ
ューリスティック）がｈ_n１（ｔ）、同様にノードＮ１
５，Ｎ１６からそれぞれ始端に至る未探索区間のヒュー
リスティックがｈ_n２（ｔ），ｈ_n３（ｔ）であったと
する。The feature of this embodiment lies in the second path search method. In the second path search, as in the conventional A ^* search, the hypothesis is developed in word units from the end of the sentence (end of the input speech) in the opposite direction to the first path search. At this time, in the present invention, the word-based hypothesis development is preferentially developed from the one with the slowest search (Shortest-first search). For example, as shown in FIG. 1, a hypothesis consisting of word boundary times (nodes) N0-N1-N3-N8, N0
A hypothesis consisting of -N1-N3-N9, N0-N1-N3-
Hypothesis consisting of N7, ..., N0-N1-N4-N5-N1
In the state where the seven hypotheses of No. 3 are developed, the hypotheses whose search is the slowest are the head nodes N8, N9, N7, N10, N11, N12, N13 of each hypothesis.
The node N7 at the time t whose time is the latest is selected. However, each time is indicated as the end time advances with respect to the start point of the input voice. The hypothesis is developed for the node N7 selected in this way. For example, suppose that nodes connected from the word lattice 23 (FIG. 8) to the start end with respect to the node N7 are N14, N15, and N16, and the estimated score (heuristic) of the unsearched section from the node N14 to the start end is h _n 1 (T) Similarly, the node N1
5, heuristic unsearched section extending from N16 to start each h _n 2 (t), assumed to be h _n 3 (t).

【００１９】ノードＮ０からＮ７を経てＮ１４に至る仮
説のスコアｇ_n１（ｔ）を各分析フレームごとに計算し
て求め、このｇ_n１（ｔ）とｈ_n１（ｔ）との和ｆ_n１
（ｔ）、つまりその仮説の全区間でのスコアを求める。
以下同様にしてノードＮ１５に至る仮説のスコアｇ_n２
（ｔ）と、その全区間でのスコアｆ_n２（ｔ）求め、ま
たノードＮ１６に至る仮説のスコアｇ_n３（ｔ）と、そ
の全区間でのスコアｆ _n３（ｔ）を求める。A temporary connection from node N0 to N14 via N7.
Theory score g_n1 (t) is calculated for each analysis frame.
This g_n1 (t) and h_nSum f with 1 (t)_n1
(T), that is, scores in all sections of the hypothesis are obtained.
Similarly, the score g of the hypothesis reaching the node N15_n2
(T) and the score f in all the sections_n2 (t)
Hypothesis score g leading to node N16_n3 (t) and that
Score f in all sections of _n3 (t) is obtained.

【００２０】このように最も遅れている仮説の先端ノー
ドＮ７からその仮説を１単語分延長する仮説の展開を行
い、その１単語分延長するごとに、最も遅れている仮説
を選びその仮説を１単語分延長する仮説展開を行う。こ
のようにすると、時間同期ビーム探索のように各仮説の
時間的な長さがほぼ揃ろいながら仮説が展開されること
になる。よってスコアにもとづく枝刈りが可能となり、
この実施例では仮説を展開しながら枝刈りを行う。この
枝刈りは二つの手法の一方又は両方を用いることができ
る。その１つは仮説を延長させる際に求めるその仮説の
スコアｇ_ni（ｔ）（前記例えばｉ＝１，２，３）を分析
フレームごとに計算中において、各分析フレームごとの
計算が終ると、その時の全仮説のスコアｇ_n（ｔ）の最
高値から一定値θを差し引いたスコアをしきい値とし
て、そのしきい値以下のスコアの仮説はそこで計算を打
切り、除去する。As described above, the hypothesis that extends the hypothesis by one word is developed from the leading node N7 of the hypothesis that is the latest, and the hypothesis that is the latest is selected and the hypothesis is determined by one for each extension of the word. Hypothesis development is performed to extend words. In this way, the hypotheses are developed while the temporal lengths of the respective hypotheses are almost the same as in the time-synchronized beam search. Therefore, pruning based on the score becomes possible,
In this embodiment, pruning is performed while developing a hypothesis. This pruning can use one or both of the two approaches. One of them is to calculate the hypothesis score g _ni (t) (for example, i = 1, 2, 3) for each analysis frame, which is obtained when extending the hypothesis. When the calculation for each analysis frame is completed, The score obtained by subtracting the constant value θ from the highest value of the scores g _n (t) of all the hypotheses at that time is set as a threshold, and the hypothesis of a score below the threshold is discontinued and removed therefrom.

【００２１】例えば図２に示すように、各分析フレーム
ごとの計算で得られるスコアｇ_n（ｔ）の最高値の包絡
か曲線３１で表わされ、その曲線３１よりθだけ小さい
スコアの曲線３２とすると、仮説展開の計算途中で、ス
コアｇ_n（ｔ）が曲線３２以下となったものは除かれ、
スコアが曲線３１と３２の間に入る仮説のみが残され
る。なお図２は仮説が延長されるに従ってそのスコアが
小さくなるようなスコアの計算方向をとった場合であ
る。For example, as shown in FIG. 2, the envelope of the highest value of the score g _n (t) obtained by calculation for each analysis frame is represented by a curve 31, and a curve 32 having a score smaller by θ than the curve 31. Then, those in which the score g _n (t) is below the curve 32 during the calculation of the hypothesis expansion are excluded.
Only hypotheses whose scores fall between curves 31 and 32 are left. FIG. 2 shows the case where the score calculation direction is set such that the score becomes smaller as the hypothesis is extended.

【００２２】枝刈りのもう１つの手法は、１つの仮説に
ついて１単語分の仮説の延長展開を行うごとに、全仮説
の全区間スコアｆ_n（ｔ）を大きい順にｍ個取出し、そ
のｍ個の仮説を残し、それより小さい仮説は除去する。
以上述べた仮説の展開の手順を図３に示す、まず全仮説
の先頭ノード群Ｎ＝｛ｎ１，…，ｎｘ｝のうち時刻が最
も遅いものｎｉを選択する（Ｓ１）。ノードｎｉから展
開されるノード群｛ｎｉ１，…，ｎｉｙ｝を取出す（Ｓ
２）。その取出したノードから順に１つのノードｎｉｊ
（ｊ＝１，…，ｙ）についてその仮説のスコアｇ
_n（ｔ）（ｎｉｊ）の計算を開始する（Ｓ３）、その各
ｇ_n（ｔ）（ｎｉｊ）の計算途中で、その分析フレーム
ごとの計算結果から最高値スコアを求め、これよりθだ
け引いた値をしきい値とし、計算したスコアがしきい値
以下になると（Ｓ４）、その計算を中止し、そのノード
ｎｉｊへの展開を停止し、つまりそのノードへの展開す
る仮説を枝刈りしてステップＳ７に移る（Ｓ１２）。Another method of pruning is to extract m total interval scores f _n (t) of all hypotheses in descending order each time one hypothesis is extended and expanded for one word. And the smaller hypotheses are removed.
FIG. 3 shows the procedure for developing the above-described hypotheses. First, the one with the latest time is selected from the head node group N = {n1,..., Nx} of all the hypotheses (S1). The node group {ni1,..., Niy} developed from the node ni is extracted (S
2). One node nij in order from the extracted node
(J = 1,..., Y), the score g of the hypothesis
The calculation of _n (t) (nij) is started (S3). During the calculation of each g _n (t) (nij), the highest score is obtained from the calculation result for each analysis frame, and θ is subtracted therefrom. When the calculated score falls below the threshold value (S4), the calculation is stopped and the expansion to the node nij is stopped, that is, the hypothesis to expand to the node is pruned. To step S7 (S12).

【００２３】計算中にスコアがしきい値以下にならずス
コア計算が終了すると（Ｓ５）、そのノードｎｉｊが始
端でなく（Ｓ６）、かつ取出したノードｎｉｊの全てに
ついての計算が終っていなければ（Ｓ７）、ステップＳ
３に戻り、次のノードｎｉｊについてスコアの計算を開
始する。全てのｎｉｊについて仮説のスコアを計算し終
ると（Ｓ７）、先に選択したｎｉを先頭ノード群Ｎから
消去し、全てのｎｉｊを先頭ノード群Ｎに加える（Ｓ
８）。この状態での仮説の数がｍ個以下であれば（Ｓ
９）、ステップＳ１に戻って、再び先頭ノード群から最
も時刻が遅れているノードを選択して同様の処理を行
う。一方、仮説の数がｍ以下でなければ、各仮説の全区
間スコアｆ_n（ｔ）＝ｇ _n（ｔ）＋ｈ_n（ｔ）の大きい
ものから順にｍ個を取出し、その仮説のみを残し、他の
仮説は除去する（Ｓ１０）。この除去に伴って、その除
去された仮説の先頭ノードも先頭ノード群Ｎから除かれ
る。この枝刈り後にステップＳ１に戻る。During the calculation, the score does not fall below the threshold value.
When the core calculation ends (S5), the node nij starts.
Not at the end (S6), and at all of the extracted nodes nij
If the calculation is not completed (S7), step S
3 and start calculating the score for the next node nij.
Start. Calculate hypothetical scores for all nij and finish
Then (S7), the previously selected ni is read from the head node group N.
Erase and add all nij to head node group N (S
8). If the number of hypotheses in this state is m or less (S
9) Returning to step S1, again from the top node group
Also selects the node whose time is late and performs the same processing.
U. On the other hand, if the number of hypotheses is not less than m,
Interval score f_n(T) = g _n(T) + h_nLarge (t)
Take m pieces in order from the thing, leave only that hypothesis,
The hypothesis is removed (S10). With this removal, the removal
The first node of the removed hypothesis is also removed from the first node group N.
You. After this pruning, the process returns to step S1.

【００２４】ステップＳ６でｎｉｊが始端であれば、そ
の時得られたその仮説の全区間スコアｆ_n（ｔ）＝ｇ_n
（ｔ）（ｎｉｊ）をその仮説について記憶してステップ
Ｓ７に移る（Ｓ１２）。このｎｉｊは再び先頭ノード群
Ｎには加えない（ｎｉｊに関しては探索終了）。以上の
処理をステップＳ１で選択する先頭ノードがなくなるま
で行い、選択する先頭ノードがなくなった時に、記憶し
てある仮説スコアの最大のもの又は大きい順に所定数の
ものの仮説を認識結果として出力する。If nij is the starting point in step S6, the total section score f _n (t) = g _{n of the} hypothesis obtained at that time is obtained.
(T) (nij) is stored for the hypothesis, and the process proceeds to step S7 (S12). This nij is not added to the head node group N again (search for nij is completed). The above processing is performed until there are no more top nodes to be selected in step S1, and when there are no more top nodes to be selected, hypotheses with the largest or the largest number of stored hypothesis scores are output as recognition results.

【００２５】第１パス探索と第２パス探索では用いるモ
デルが異なるため、同じ仮説に対しても第１パス探索と
第２パス探索では単語境界がずれる可能性がある。そこ
でこの実施例では第２パス探索の単語境界時刻として単
語ラティスに記憶されている第１パス探索の単語境界時
刻をそのまま用いるのではなく、前後数フレームのずれ
を許容して第２パス探索を行う。Since different models are used in the first path search and the second path search, there is a possibility that the word boundaries may be shifted between the first path search and the second path search even for the same hypothesis. Therefore, in this embodiment, instead of using the word boundary time of the first path search stored in the word lattice as it is as the word boundary time of the second path search, the second path search is performed while allowing a shift of several frames before and after. Do.

【００２６】つまり例えば図４Ａに示すように単語ラテ
ィスに記憶されている単語境界時刻が単語Ａと単語Ｂ間
はｔ１、単語Ｂと単語Ｃ間はｔ２とする。この時、図４
Ｂに示すように単語Ａと単語Ｂ間はｔ１のみならず、ｔ
１−Δと、ｔ１＋Δも境界時刻とし、単語Ｂと単語Ｃ間
はｔ２のみならず、ｔ２−Δとｔ２＋Δも境界時刻とす
る。この時のスコアの計算は時刻ｔ２＋Δから計算を開
始し、時刻ｔ２に達した時の値Δｇ（ｔ２＋Δ，ｔ２）
を記憶し、更に計算を継続して進めｔ２−Δに達した時
の値Δｇ（ｔ２＋Δ，ｔ２−Δ）を記憶し、更に計算を
継続して進め、時刻ｔ１＋Δに達した時の値ｇ（ｔ２＋
Δ，ｔ１＋Δ）を記憶し、更に計算を継続して進めｔ１
に達した時の値ｇ（ｔ２＋Δ，ｔ１）を記憶し、更に計
算を継続して進めｔ１−Δに達した時の値ｇ（ｔ２＋
Δ，ｔ１−Δ）を記憶し、ｔ２＋Δ，ｔ２，ｔ２−Δか
らそれぞれｔ１＋Δに仮説を延長した時の各スコアｇ
（ｔ）＋Δ，ｔ１＋Δ）とｇ（ｔ２＋Δ，ｔ１＋Δ）−
Δｇ（ｔ２＋Δ，ｔ２）と、ｇ（ｔ２＋Δ，ｔ１＋Δ）
−Δｇ（ｔ２＋Δ，ｔ２−Δ）との３つのうち最大のも
のを時刻ｔ１＋Δのスコアとし、ｔ２＋Δ，ｔ２，ｔ２
−Δからそれぞれｔ１に仮説を延長した時の各スコアｇ
（ｔ２＋Δ，ｔ１）と、ｇ（ｔ２＋Δ，ｔ１）−Δｇ
（ｔ２＋Δ，ｔ２）と、ｇ（ｔ２＋Δ，ｔ１）−Δｇ
（ｔ２＋Δ，ｔ２−Δ）との３つのうち最大のものを時
刻ｔ１のスコアとし、ｔ２＋Δ，ｔ２，ｔ２−Δからそ
れぞれｔ１−Δに仮説を延長した時の各スコアｇ（ｔ２
＋Δ，ｔ１−Δ）と、ｇ（ｔ２＋Δ，ｔ１−Δ）−Δｇ
（ｔ２＋Δ，ｔ２）と、ｇ（ｔ２＋Δ，ｔ１−Δ）−Δ
ｇ（ｔ２＋Δ，ｔ２−Δ）との３つのうち最大のものを
時刻ｔ１−Δのスコアとする。That is, for example, as shown in FIG. 4A, the word boundary time stored in the word lattice is t1 between the words A and B, and t2 between the words B and C. At this time, FIG.
B, between word A and word B, not only t1 but also t1
1-Δ and t1 + Δ are also boundary times, and between the words B and C, not only t2 but also t2-Δ and t2 + Δ are boundary times. The calculation of the score at this time starts from time t2 + Δ, and the value Δg (t2 + Δ, t2) when time t2 is reached
Is further stored, and the value Δg (t2 + Δ, t2-Δ) when the calculation is continued and the time reaches t2-Δ is stored. The calculation is further continued and the value g (when the time t1 + Δ is reached) is stored. t2 +
Δ, t1 + Δ), and the calculation is further continued to proceed to t1
The value g (t2 + Δ, t1) at the time t is stored, the calculation is further continued, and the value g (t2 +
Δ, t1-Δ), and each score g when the hypothesis is extended from t2 + Δ, t2, t2-Δ to t1 + Δ, respectively.
(T) + Δ, t1 + Δ) and g (t2 + Δ, t1 + Δ) −
Δg (t2 + Δ, t2) and g (t2 + Δ, t1 + Δ)
−Δg (t2 + Δ, t2−Δ), the largest one among the three at time t1 + Δ is defined as t2 + Δ, t2, t2
Each score g when the hypothesis is extended from -Δ to t1
(T2 + Δ, t1) and g (t2 + Δ, t1) −Δg
(T2 + Δ, t2) and g (t2 + Δ, t1) −Δg
(T2 + Δ, t2-Δ), the maximum score is the score at time t1, and each score g (t2) when the hypothesis is extended from t2 + Δ, t2, t2-Δ to t1-Δ, respectively.
+ Δ, t1−Δ) and g (t2 + Δ, t1−Δ) −Δg
(T2 + Δ, t2) and g (t2 + Δ, t1-Δ) −Δ
The largest one of the three g (t2 + Δ, t2-Δ) is set as the score at time t1-Δ.

【００２７】なお、Δとしては１分析フレーム以上乃至
５ミリ秒程度以上とするが、Δの値を大きくすると、計
算量が多くなるので数フレーム乃至数１０ミリ秒程度以
下とする。上述において仮説の全区間スコアとしてｆ_n
（ｔ）＝ｇ_n（ｔ）＋ｈ_n（ｔ）を用いたが、ｈ
_n（ｔ）に対して１に近い重みαを与えてｆ_n（ｔ）＝
ｇ_n（ｔ）＋αｈ_n（ｔ）を全区間スコアとしてより精
度を高めることもできる。αを求めるには、第１パス探
索に用いる粗いモデルを用いて、適当な単語列について
スコアｈを計算し、またその単語列について第２パス探
索に用いる高精度モデルを用いてスコアｇを計算し、α
＝ｇ／ｈを計算して重みαを求めればよい。Note that Δ is set to one analysis frame or more to about 5 milliseconds or more. However, when the value of Δ is increased, the amount of calculation increases, so it is set to several frames to several tens of milliseconds or less. In the above, f _n
(T) = g _n (t) + h _n (t) was used.
_n (t) is given a weight α close to 1 and f _n (t) =
g _n (t) + αh _n (t) may be used as the whole section score to improve the accuracy. In order to obtain α, a score h is calculated for an appropriate word string using a coarse model used for the first pass search, and a score g is calculated for the word string using a high-precision model used for the second pass search. Then α
= G / h to calculate the weight α.

【００２８】上述においてはこの発明を第２パス探索に
適用したが、３段階探索により認識を行う場合にも適用
できる。要は粗いモデルを用いてパス探索を行い、単語
ラティスを作り、その単語ラティス上で、高い精度のモ
デルを用いてパス探索を行う場合にこの発明を適用でき
る。続いて、この発明者等が開発した大語彙連続発声認
識システムに、上記Ｎ−ｂｅｓｔリスコアリングとこの
発明による探索（以後時間非同期ビーム探索と呼ぶ）を
用いた場合の比較連続音声認識実験の結果について説明
する。なお、大語彙連続音声認識システムについては、
電子情報通信学会技術研究報告ＳＰ９６−１０２、野田
喜昭、松永昭一、嵯峨山茂樹著、“単語グラフを用いた
大語彙連続音声認識における近似演算手法の検討”（１
９９７）に詳しく記載されている。音響モデルは、ニュ
ース番組１ケ月分から６７００文を学習データとする総
状態数２０００、混合数８のtriphone ＨＭＭである。
特徴量は、ＭＦＣＣ１２次元とその１次、２次回帰係
数、対数パワーとその１次、２次回帰係数の計３９次元
である。言語モデルは、ニュース番組原稿４年分の５０
万文と、１ケ月分のニュース番組音声の書き起こしで学
習された単語２−ｇｒａｍと単語３−ｇｒａｍである。
評価セットはニュース番組５日分から５０文（総単語数
１８００、平均発声長１２秒）を選択した。なお、第１
パス探索の結果として得られる単語ラティス内に含まれ
る仮説の中でスコアの最も高い仮説（最適解）の単語誤
り率は９．５１％であった。In the above description, the present invention is applied to the second path search, but can also be applied to a case where recognition is performed by a three-step search. In short, the present invention can be applied to the case where a path search is performed using a coarse model to create a word lattice, and a path search is performed on the word lattice using a highly accurate model. Subsequently, a comparative continuous speech recognition experiment using the above-described N-best rescoring and the search according to the present invention (hereinafter referred to as time asynchronous beam search) is applied to the large vocabulary continuous utterance recognition system developed by the present inventors. The results will be described. For the large vocabulary continuous speech recognition system,
IEICE Technical Report SP96-102, Yoshiaki Noda, Shoichi Matsunaga, Shigeki Sagayama, "A Study on Approximate Calculation Method in Large Vocabulary Continuous Speech Recognition Using Word Graph" (1
997). The acoustic model is a triphone HMM with a total number of 2000 states and a mixture number of 8 using 6,700 sentences as learning data from one month of a news program.
The feature amount is a total of 39 dimensions of 12 dimensions of MFCC and its first and second order regression coefficients, and log power and its first and second order regression coefficients. The language model is 50 for news program manuscripts for four years.
The word 2-gram and the word 3-gram learned in the transcription of the news program audio for one month for all sentences.
For the evaluation set, 50 sentences (total words: 1800, average utterance length: 12 seconds) were selected from the news program for 5 days. The first
The word error rate of the hypothesis (optimum solution) having the highest score among the hypotheses included in the word lattice obtained as a result of the path search was 9.51%.

【００２９】Ｎ−ｂｅｓｔリスコアリングと上記単語境
界時刻のずれを許さない時間非同期ビーム探索の実験結
果を図５Ａに示す。これより、時間非同期ビーム探索で
はＮ−ｂｅｓｔリスコアリングよりも高速かつ高精度に
解を得られることが分かる。続いて時間非同期ビーム探
索で単語境界時刻のずれを考慮し、数ｍｓｅｃのずれを
許容する効果を調査した。図５Ａのずれを許容しない場
合（０ｍｓｅｃ）を基準とし、許容するずれを１０から
５０ｍｓｅｃと変化させて実験を行った。結果を図５Ｂ
に示す。これより、２０ｍｓｅｃ程度のずれを許容する
ことで、より高精度の解が得られ、またずれを許容すれ
ばずれを許容しない場合よりも高い精度の解が得られる
ことが分かる。なおこの実験においてＡ^*探索の評価も
行ったが、第４文章で３０分間程度しても解が得られな
いものが生じた。しかしこの発明によれば実用的時間内
に全ての解が得られ、この発明がＡ^*探索より優れてい
ることが確認できた。FIG. 5A shows an experimental result of the N-best rescoring and the time asynchronous beam search which does not allow the above-mentioned word boundary time shift. From this, it can be seen that in the time asynchronous beam search, a solution can be obtained faster and more accurately than in the N-best rescoring. Subsequently, the effect of allowing a deviation of several msec was investigated in consideration of the deviation of the word boundary time in the time asynchronous beam search. The experiment was performed by changing the allowable shift from 10 to 50 msec based on the case where the shift shown in FIG. 5A is not allowed (0 msec). FIG. 5B shows the results.
Shown in From this, it can be seen that a higher accuracy solution can be obtained by allowing a displacement of about 20 msec, and a higher accuracy solution can be obtained by allowing the displacement than by not allowing the displacement. In this experiment, A ^* search was also evaluated. However, in the fourth sentence, a solution could not be obtained even after about 30 minutes. However, according to the present invention, all the solutions were obtained within a practical time, and it was confirmed that the present invention was superior to the A ^* search.

【００３０】[0030]

【発明の効果】以上説明したように、この発明によれ
ば、安定して解が得られる時間同期ビーム探索のように
展開中の仮説の長さがなるべく揃うような仮説展開と枝
刈りを行うことにより、必ず解が得られる。またＡ^*探
索のように粗いモデルを用いたパス探索の結果として単
語ラティスに記憶されている単語境界時刻とスコアの情
報を利用することと、単語ラティスに記憶されている単
語境界時刻を高精度モデルを用いたパス探索の単語境界
時刻としてそのまま用いるのではなく、数フレーム分の
ずれを許容することで、高精度、効率的かつ安定して最
終的な解を得られるという効果を奏する。As described above, according to the present invention, hypothesis development and pruning are performed such that the lengths of the hypotheses being developed are as uniform as possible, such as a time-synchronous beam search in which a stable solution can be obtained. By doing so, a solution can always be obtained. In addition, using the word boundary time and score information stored in the word lattice as a result of a path search using a coarse model such as A ^* search, and using the word boundary time stored in the word lattice with high accuracy Rather than using the word boundary time as it is in the path search using the model as it is, by allowing a shift of several frames, it is possible to obtain an effect that a final solution can be obtained with high accuracy, efficiently, and stably.

[Brief description of the drawings]

【図１】この発明の要部である最も探索が遅れた仮説を
優先的に展開させることを説明するための仮説展開図。FIG. 1 is a hypothesis development diagram for explaining that a hypothesis whose search is delayed most, which is a main part of the present invention, is preferentially developed.

【図２】スコアビームによる枝刈りを説明する図。FIG. 2 is a diagram illustrating pruning by a score beam.

【図３】この発明の要部である最も探索が遅れた仮説を
優先的に展開し、かつ枝刈りをする処理手順の例を示す
流れ図。FIG. 3 is a flowchart showing an example of a processing procedure for preferentially developing a hypothesis whose search is delayed, which is a main part of the present invention, and for pruning;

【図４】この発明で単語境界時刻のずれを許容させる説
明図。FIG. 4 is an explanatory diagram for allowing a deviation of a word boundary time according to the present invention.

【図５】この発明の効果を示す実験結果を示す図。FIG. 5 is a view showing experimental results showing the effect of the present invention.

【図６】音声認識処理の一般的な機能構成を示す図。FIG. 6 is a diagram showing a general functional configuration of a speech recognition process.

【図７】文法が許容する単語ネットワークを示す図。FIG. 7 is a diagram showing a word network permitted by the grammar.

【図８】マルチパス探索による連続音声認識処理の機能
構成を示す図。FIG. 8 is a diagram showing a functional configuration of a continuous speech recognition process based on a multipath search.

【図９】図８中の第１パス探索により生成された単語ラ
ティスの例を示す図。FIG. 9 is a view showing an example of a word lattice generated by a first pass search in FIG. 8;

【図１０】従来のＡ^*探索における仮説展開の様子を示
す図。FIG. 10 is a diagram showing a state of hypothesis development in a conventional A ^* search.

フロントページの続き (72)発明者松永昭一東京都千代田区大手町二丁目３番１号日本電信電話株式会社内Ｆターム(参考） 5D015 AA01 BB01 HH23 LL03 Continuation of the front page (72) Inventor Shoichi Matsunaga 2-3-1, Otemachi, Chiyoda-ku, Tokyo F-term in Nippon Telegraph and Telephone Corporation (reference) 5D015 AA01 BB01 HH23 LL03

Claims

[Claims]

1. An acoustic model for obtaining an acoustic score indicating an acoustic closeness between a word and an input voice, and a language model for obtaining a grammar defining a connection relationship between words or a language score indicating an easy connection thereof. A search is performed on the continuously uttered input speech using a coarse acoustic model and a coarse language model, and a word network is created by narrowing down hypotheses that are close to the input speech from hypotheses of word strings allowed by the grammar. Then, using a higher-accuracy acoustic model and a higher-precision language model than the search, searching on the word network and converting the input speech from the hypothesis of a word string allowed in the word network to the input speech In a continuous speech recognition method in which a nearer one is narrowed down and a hypothesis of one or a plurality of word strings closest to the input speech is finally recognized as a recognition result, Search using the Le and language model,
For each word sequence hypothesis expansion, the hypothesis of the word sequence whose search is delayed is selected and performed, and for each word sequence hypothesis expansion, the condition deviated from the predetermined condition based on the obtained word string score A continuous speech recognition method characterized by terminating the development of a word string hypothesis.

2. Terminating the development of the hypothesis of the word string is performed by scoring a score g as a searched section to be performed when the hypothesis of the word string is developed.
_2. The continuous speech recognition method according to claim 1, wherein, if the calculation result of _n (t) or the score thereof becomes less than or equal to the threshold value during the calculation, the development of the hypothesis of the word string is terminated.

3. The terminating hypothesis expansion of the word string is 1
When the hypothesis development of the word string for one node (one word boundary on the word network) is completed, the hypothesis score f _n (t) of each word string is changed to the score g _{n of the} searched section.
(T) and the sum of the scores h _n (t) obtained in the previous search in the unsearched section, and the hypothesis f of this word string
3. The continuous speech recognition method according to claim 1, wherein the hypothesis expansion of word strings other than the m word string hypotheses is terminated in ascending order of _n (t).

4. The score h of the unsearched section_n(T)
Multiplied by weight α and f _n(T) = g_n(T) + αh
_nThe continuous sound according to claim 3, wherein (t) is set.
Voice recognition method.

5. The method according to claim 1, wherein the score calculation at the time of developing the hypothesis of the word string is performed within a range shifted from the word boundary time stored in the word network by about 5 ms or more. 5. The continuous speech recognition method according to any one of claims 1 to 4.