JP2001092495A - Continuous speech recognition method - Google Patents

Continuous speech recognition method

Info

Publication number
JP2001092495A
JP2001092495A JP26823799A JP26823799A JP2001092495A JP 2001092495 A JP2001092495 A JP 2001092495A JP 26823799 A JP26823799 A JP 26823799A JP 26823799 A JP26823799 A JP 26823799A JP 2001092495 A JP2001092495 A JP 2001092495A
Authority
JP
Japan
Prior art keywords
word
hypothesis
score
search
word string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP26823799A
Other languages
Japanese (ja)
Other versions
JP3559479B2 (en
Inventor
Atsunori Ogawa
厚徳 小川
Yoshiaki Noda
喜昭 野田
Shoichi Matsunaga
昭一 松永
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP26823799A priority Critical patent/JP3559479B2/en
Publication of JP2001092495A publication Critical patent/JP2001092495A/en
Application granted granted Critical
Publication of JP3559479B2 publication Critical patent/JP3559479B2/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Abstract

PROBLEM TO BE SOLVED: To surely obtain a solution and also shorten a processing time by performing a 1st path retrieval using a coarse model to create a word lattice, and performing a 2nd path retrieval from the end on the word lattice using a high accurate model. SOLUTION: By the 2nd path retrieval, a most delayed one among retrievals of all assumptions, namely, the most delayed one N7 in head word boundary time of the assumption, is selected; the assumption is extended and developed by one word to obtain a score; pruning is performed; the retrieval selects a most delayed one again; and the similar operations are repeated.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【0001】[0001]

【発明の属する技術分野】この発明は、規定された文法
あるいは接続関係によって生成可能な数多くの単語列の
仮説から、入力された音声に最も近い仮説を、複数の探
索段階により見つける連続音声認識方法に関する。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a continuous speech recognition method for finding a hypothesis closest to an input speech through a plurality of search stages from hypotheses of many word strings that can be generated by a prescribed grammar or connection relation. About.

【0002】[0002]

【従来の技術】まず、図6を参照して、従来の連続音声
認識法の一例について説明する。この図において、入力
音声11は、分析処理部12において特徴パラメータの
ベクトルデータ時系列に変換され、さらに探索処理部1
3において文法/言語モデル16で許容する単語列の仮
説(以下単に仮説と記す)と対応した音響モデル15と
前記特徴パラメータのベクトルデータ時系列とが照合さ
れる。この仮説の照合結果の評価値であるスコアは、入
力音声と仮説との音響的な近さを示す音響スコアと仮説
の存在する確率を示す言語スコアからなり、最も高いス
コアを持つ仮説が認識結果14として出力される。
2. Description of the Related Art First, an example of a conventional continuous speech recognition method will be described with reference to FIG. In the figure, an input speech 11 is converted into a vector series of feature parameter vector data in an analysis processing unit 12,
In step 3, the acoustic model 15 corresponding to the hypothesis (hereinafter simply referred to as a hypothesis) of the word string permitted by the grammar / language model 16 is compared with the vector data time series of the feature parameter. The score, which is the evaluation value of the matching result of this hypothesis, is composed of an acoustic score indicating the acoustic closeness between the input speech and the hypothesis and a language score indicating the probability of the existence of the hypothesis, and the hypothesis having the highest score is the recognition result. It is output as 14.

【0003】分析処理部12における信号処理としてよ
く用いられるのは、ケプストラム分析であり、特徴パラ
メータとしては、MFCC(Mel Frequency Cepstral C
oefficient) 、ΔMFCC、対数パワーなどがある。音
響モデル15としては、確率・統計理論に基づいてモデ
ル化された隠れマルコフモデル(Hidden Markov Model
、以後HMMという)が主流である。通常、HMMは
音素ごとに作成される(音素モデル)が、現在では、あ
る音素のHMMを作成する際に、その前後に接続する音
素も考慮に入れる(音素環境を考慮する)triphone H
MMと呼ばれるHMMが主流となっている。このHMM
の詳細は、例えば、文献(社団法人電子情報通信学会
編、中川聖一著『確率モデルによる音声認識』)に開示
されている。
[0003] Cepstrum analysis is often used as signal processing in the analysis processing unit 12, and the characteristic parameter is MFCC (Mel Frequency Cepstral C).
oefficient), ΔMFCC, logarithmic power, and the like. As the acoustic model 15, a Hidden Markov Model (Hidden Markov Model) modeled on the basis of probability and statistical theory is used.
, Hereinafter referred to as HMM). Normally, an HMM is created for each phoneme (phoneme model), but at present, when creating an HMM for a certain phoneme, the phonemes connected before and after the phoneme are also taken into account (considering the phoneme environment).
HMM called MM has become mainstream. This HMM
Details are disclosed in, for example, a document (edited by the Institute of Electronics, Information and Communication Engineers, edited by Seiichi Nakagawa, “Speech Recognition by Stochastic Model”).

【0004】文法/言語モデル16は、認識対象とする
文章を定義するための単語の連結関係を規定したもので
あり、単語を枝とした単語ネットワークや言語の確率モ
デル等が用いられる。連続音声認識の場合、文法は図7
に示すような任意の単語が任意の単語に接続可能な単語
ネットワークの形式をとることが多い。このような形式
をとることで単語ネットワークに登録されている単語の
範囲内で任意の単語列の仮説の生成が可能となる。言語
の確率モデルは、単語単体の存在確率、2つ以上の単語
の連鎖する確率が用いられる。単語単体の存在確率を表
すモデルは単語1−gram、単語の2連鎖確率、3連
鎖確率をそれぞれ表すモデルはそれぞれ、単語2−gr
am、単語3−gramと呼ばれる。この言語の確率モ
デルを用いることで、言語(ここでは日本語)として存
在し得ない仮説の生成を抑制することができる。この言
語の確率モデルの詳細は、例えば、社団法人電子情報通
信学会編、中川聖一著『確率モデルによる音声認識』等
に開示されている。
[0004] The grammar / language model 16 defines a connection between words for defining a sentence to be recognized, and uses a word network with words as branches, a language probability model, or the like. In the case of continuous speech recognition, the grammar is shown in FIG.
Many words take the form of a word network in which any word can be connected to any word. By adopting such a format, it is possible to generate a hypothesis of an arbitrary word string within the range of words registered in the word network. As the language probability model, the existence probability of a single word and the probability of two or more words being linked are used. The model representing the existence probability of a single word is the word 1-gram, the model representing the two-chain probability and the three-chain probability of the word is the word 2-gr, respectively.
am, called the word 3-gram. By using the probability model of this language, generation of a hypothesis that cannot exist as a language (here, Japanese) can be suppressed. The details of the probability model of this language are disclosed in, for example, Seiichi Nakagawa, "Speech Recognition by Probability Model", edited by the Institute of Electronics, Information and Communication Engineers.

【0005】探索処理部13では、文法で規定された単
語の接続関係を示す単語ネットワーク上の仮説に対応し
た音響モデルと特徴パラメータのベクトルデータ時系列
を照合し、音響的な尤もらしさを示す音響スコアを求め
ると同時に、その仮説に対応した言語モデルから言語ス
コアを求め、音響スコアと言語スコアからなる仮説のス
コアを、入力連続音声の始端から終端まで各仮説につい
て求め、最も大きいスコアの仮説、つまり入力音声に最
も近い仮説を認識結果として出力する。連続音声認識に
おいては、文法で生成可能な仮説が膨大であり、高速か
つ高精度に認識結果を得るために、仮説探索を複数段階
行い、段階的に候補仮説を絞り込んでいくマルチパス探
索という探索法がとられることが多い。マルチパス探索
の詳細は、例えば、R.Schwartz, L.Nguyen, and John M
akhoul:“Multiple-pass SearchStrategies ”,in Au
tomatic Speech and Speaker Recognition Advanced To
pics, pp.429-456, Kluwer Academic Publishers(199
6).等に開示されている。
[0005] The search processing unit 13 collates an acoustic model corresponding to a hypothesis on a word network indicating a connection relationship between words specified by a grammar with a time series of feature parameter vector data, and indicates an acoustic likelihood. At the same time as obtaining the score, the language score is obtained from the language model corresponding to the hypothesis, and the score of the hypothesis consisting of the acoustic score and the language score is obtained for each hypothesis from the beginning to the end of the input continuous speech, and the hypothesis of the largest score, That is, the hypothesis closest to the input speech is output as the recognition result. In continuous speech recognition, the number of hypotheses that can be generated by grammar is enormous, and in order to obtain a recognition result with high speed and high accuracy, a hypothesis search is performed in multiple stages, and a multipath search that narrows down candidate hypotheses in stages is performed. Laws are often enforced. For details of the multipath search, see, for example, R. Schwartz, L. Nguyen, and John M.
akhoul: “Multiple-pass SearchStrategies”, in Au
tomatic Speech and Speaker Recognition Advanced To
pics, pp.429-456, Kluwer Academic Publishers (199
6). Etc.

【0006】ここでは最も一般的である2段階で仮説を
絞り込むマルチパス探索について図8を用いて説明す
る。1段階目の探索(第1パス探索)21では、図7に
示されるような単語ネットワークで生成可能な膨大な仮
説の中から、文法/言語モデル16中の粗い言語モデ
ル、例えば単語2−gramや、音響モデル15中の粗
い音響モデル、例えば単語内の音素環境のみ考慮するtr
iphone HMMなどの計算コストの低いモデルを用い、
入力音声に近い候補仮説を高速に絞り込む。
Here, the most general multi-path search for narrowing hypotheses in two stages will be described with reference to FIG. In the first-stage search (first-pass search) 21, a coarse language model in the grammar / language model 16, for example, the word 2-gram, is selected from a huge number of hypotheses that can be generated by a word network as shown in FIG. Or a coarse acoustic model in the acoustic model 15, for example, considering only the phonemic environment in a word tr
Using a model with low computational cost such as iphone HMM,
Narrow candidate hypotheses close to the input voice at high speed.

【0007】またこの第1パス探索21では時間同期ビ
ーム探索と呼ばれる方法がとられることが多い。時間同
期ビーム探索では、入力音声と仮説の照合は図7の単語
ネットワークで生成可能な全ての仮説に対して、通常分
析フレームごとの計算を、時間同期的に同時に進められ
るが、生成可能な仮説の数が時間の経過と共に著しく多
くなるため、この処理を現実的な処理時間で終えること
は困難である。そこで、探索途中において認識結果とな
る可能性が低い仮説に対する探索を打ち切る(枝刈りす
る)ことで、現実的な処理時間で探索を終えることをね
らう。時間同期ビーム探索において仮説を枝刈りする基
準としては、全ての仮説の中で、最大のスコアから大き
いスコアの順にm個の仮説を残し、それ以外の仮説を打
ち切る方法や、全ての仮説のスコアの中で最大のスコア
から一定値θを差し引いたスコアをしきい値とし、その
しきい値以上のスコアを有する仮説のみ残し、それ以下
のスコアを有する仮説を枝刈りする方法等がある。ここ
で、仮説を枝刈りする基準を決定するパラメータである
mやθはビーム幅と呼ばれる。時間同期ビーム探索では
同一時刻における各仮説のスコアの大きさから可能性が
ないと推定される仮説を枝刈りするため、正しい解とな
る仮説を枝刈りする可能性は少ない。しかし時間同期ビ
ーム探索は、探索途中で仮説の枝刈りを行うため、必ず
しも最もスコアの高い仮説が認識結果として得られるわ
けではないが、ビーム幅をある程度以上大きくとれば必
ず解が得られる探索法である。時間同期ビーム探索の詳
細は、例えば、R.Haeb-Umbach and H.Ney :“Improvem
ents in beam search for 10000-word continuous-spee
ch recognition”,IEEE Trans. Speech and Audio Pro
cessing, Vol.2,No.2,pp.353-356(1994).等に開示さ
れている。
The first path search 21 often employs a method called a time-synchronous beam search. In the time-synchronous beam search, the matching between the input speech and the hypothesis can be performed simultaneously for all the hypotheses that can be generated by the word network in FIG. Is significantly increased with the passage of time, and it is difficult to complete this processing in a practical processing time. Therefore, by terminating (pruning) the search for a hypothesis that is unlikely to be a recognition result during the search, the search is intended to be completed in a realistic processing time. The criteria for pruning the hypotheses in the time-synchronous beam search are, among all the hypotheses, the method of leaving m hypotheses in order from the largest score to the largest score, cutting off the other hypotheses, and the scores of all the hypotheses. There is a method in which a score obtained by subtracting a constant value θ from the largest score among the scores is used as a threshold, only hypotheses having a score equal to or higher than the threshold are left, and hypotheses having a score lower than the threshold are pruned. Here, the parameters m and θ that determine the criteria for pruning the hypothesis are called beam widths. In the time-synchronous beam search, hypotheses that are estimated to have no possibility from the size of each hypothesis at the same time are pruned, so there is little possibility that a hypothesis that provides a correct solution will be pruned. However, in time-synchronized beam search, hypotheses are pruned during the search, so that the hypothesis with the highest score is not always obtained as a recognition result. It is. For details of time-synchronous beam search, see, for example, R. Haeb-Umbach and H. Ney: “Improvem
ents in beam search for 10000-word continuous-spee
ch recognition ”, IEEE Trans. Speech and Audio Pro
cessing, Vol.2, No.2, pp.353-356 (1994). Etc.

【0008】第1パス探索の結果は、トレリス形式や単
語ラティスなどの中間表現として得られるが、ここでは
図9に示すような単語ラティスと呼ばれる単語の接続関
係をコンパクトに表現した単語ネットワークを想定す
る。単語ラティスには、第1パス探索の結果として単語
境界時刻とその時刻におけるその仮説のそれまでのスコ
アが記憶されている。単語ラティスの詳細は、例えば、
S.Ortmanns and H.Ney:“A word graph algorithm for
large vocaburary continuous speech recognition
”,Computer Speech and Language, Vol.11, No.1, p
p.43-72(1997).等に開示されている。
The result of the first path search is obtained as an intermediate expression such as a trellis form or a word lattice. Here, it is assumed that a word network which compactly expresses a word connection called a word lattice as shown in FIG. I do. The word lattice stores a word boundary time as a result of the first pass search and a score of the hypothesis up to that time at that time. For more information on the word lattice, for example,
S. Ortmanns and H. Ney: “A word graph algorithm for
large vocaburary continuous speech recognition
”, Computer Speech and Language, Vol. 11, No. 1, p
p.43-72 (1997). Etc.

【0009】図8に示すように2段階目の探索(第2パ
ス探索)22では、第1パス探索21の結果得られた単
語ラティス23上で音響モデル15中の高精度の音響モ
デルと文法/言語モデル16中の高精度の言語モデルを
用いた仮説のスコアの再計算を分析フレームごとに行
い、最終的な認識結果14を得る。第2パス探索22と
してよく用いられる方法としては、N−bestリスコ
アリング、A* 探索が挙げられる。
As shown in FIG. 8, in a second stage search (second path search) 22, a high-precision acoustic model and a grammar in the acoustic model 15 are obtained on the word lattice 23 obtained as a result of the first path search 21. The recalculation of the hypothesis score using the high-precision language model in the / language model 16 is performed for each analysis frame, and the final recognition result 14 is obtained. Methods often used as the second path search 22 include N-best rescoring and A * search.

【0010】N−bestリスコアリングは、粗いモデ
ルを用いた探索によるスコアで順序づけられたN−be
st文候補と呼ばれる複数(スコアの高いN個)の文候
補のスコアを、高精度のモデルを用いた探索によるスコ
アで置き換えて、文候補の順序を大きい順に入れ換え
る。N−bestリスコアリングを第2パス探索22に
用いる場合、まず、単語ラティスに記憶されている第1
パス探索のスコアを基に単語ラティス23からスコアの
大きい順からN個の文候補(N−best文候補)を作
成し、単語2−gram等の粗い言語モデルによるスコ
アを単語3−gram等のより高精度の言語モデルによ
るスコアで置き換えてスコアを計算しなおし、その再計
算したスコアの大きい順に文候補の順序を入れ換える。
N−bestリスコアリングは、実装が単純であり、確
実に認識結果を得ることができる。N−bestリスコ
アリングの詳細は、例えば、L.Nguyen, R.Schwartz, Y.
Zhao, and G.Zavaliagkos :“Is N-best dead?”,Pr
oc.DARPA Speech and Natural Language Workshop, pp.
411-414(1994) .等に開示されている。
[0010] N-best rescoring is an N-bes ordering by scores from a search using a coarse model.
The scores of a plurality of (N high-score) sentence candidates called st sentence candidates are replaced by scores obtained by searching using a high-precision model, and the order of sentence candidates is changed in descending order. When N-best rescoring is used for the second pass search 22, first the first
Based on the score of the path search, N sentence candidates (N-best sentence candidates) are created from the word lattice 23 in descending order of the score, and the score based on the coarse language model such as the word 2-gram is converted to the word 3-gram or the like. The score is recalculated by replacing the score with a higher accuracy language model score, and the order of the sentence candidates is changed in descending order of the recalculated score.
The N-best rescoring is simple in implementation, and can reliably obtain a recognition result. Details of N-best rescoring are described, for example, in L. Nguyen, R. Schwartz, Y.
Zhao, and G. Zavaliagkos: “Is N-best dead?”, Pr
oc.DARPA Speech and Natural Language Workshop, pp.
411-414 (1994). Etc.

【0011】A* 探索では、次の(1)式で定義される
スコアが最も高い仮説nから優先的に展開を行う(best
-first探索)。 fn (t)=gn (t)+hn (t) (1) ここで、tは時刻(フレーム番号)、gn (t)は既に
探索を終えた区間のスコア、つまり図10において、単
語境界時刻(ノードとも呼ぶ)N0−N1−N2−N3
−N4−N5を連ねる仮説のスコアであり、hn (t)
はこの単語境界時刻N5より始端までの未探索の区間の
推定スコア(ヒューリスティック)である。すなわち、
n (t)は仮説nの全区間に対する推定スコアであ
り、fn (t)を仮説nのスコアとして用いることは、
全ての仮説のスコアを始端から終端までの全区間に対す
るものを求めていることになる。また、これにより、探
索の進行度が異なる(時間的な長さの異なる)仮説同士
の比較が可能となる。A* 探索で最もスコアの高い解
(最適解)を得るためには、hn (t)の値がその真値
(わかったとする)よりも大きくなければならない(A
* 適格性)ことが知られている。また、hn (t)がそ
の真値に近いほど効率の高い探索が可能である。A*
索を第2パス探索22に用いる場合は、図10に示すよ
うに、単語ラティス上を第1パス探索21とは逆向きに
文末から単語単位の仮説展開を行う。このとき、g
n (t)は第1パス探索で用いた言語モデル、音響モデ
ルよりもよりそれぞれ高精度の言語モデル、音響モデ
ル、例えば単語3−gramと、単語内及び単語間の各
音素環境を考慮したtriphone HMMを用いて再計算す
る形で求める。
In the A * search, the hypothesis n having the highest score defined by the following equation (1) is expanded preferentially (best
-first search). f n (t) = g n (t) + h n (t) (1) Here, t is the time (frame number), and g n (t) is the score of the section that has already been searched, that is, in FIG. Word boundary time (also called node) N0-N1-N2-N3
−N4−N5, which is the score of the hypothesis linking h n (t)
Is an estimated score (heuristic) of an unsearched section from the word boundary time N5 to the beginning. That is,
f n (t) is an estimated score for all sections of hypothesis n, and using f n (t) as the score of hypothesis n is
This means that scores of all hypotheses are obtained for all sections from the start to the end. This also makes it possible to compare hypotheses with different degrees of search progress (different time lengths). In order to obtain the solution with the highest score (optimal solution) in the A * search, the value of hn (t) must be greater than its true value (assumed) (A
* Eligibility) is known. Further, the more efficient the search is, the closer h n (t) is to its true value. When the A * search is used for the second path search 22, as shown in FIG. 10, a hypothesis development is performed on the word lattice in the word unit from the end of the sentence in the opposite direction to the first path search 21. At this time, g
n (t) is a language model and an acoustic model, each of which has higher accuracy than the language model and the acoustic model used in the first pass search, such as a word 3-gram, and a triphone that considers each phoneme environment within and between words. It is obtained by recalculating using HMM.

【0012】第2パス探索スコアで、hn (t)には単
語ラティス23に記憶されている第1パス探索スコアを
用いることができる。図10においては現在、N0−N
1−N2−N3−N6,N0−N1−N2−N3−N4
−N5,N0−N1−N2−N7,N0−N1−N8,
N0−N11−N9,N0−N11−N10の計6個の
仮説があるが、この中からfn (t)が最大のもの(例
えば今の場合、N0−N1−N2−N3−N4−N5)
を選んでこれを展開する。A* 探索の詳細は、例えば、
コロナ社、Nils.J.Nilsson著、合田周平、増田一比古
訳、『人工知能−問題解決のシステム論−』に開示され
ている。
In the second path search score, the first path search score stored in the word lattice 23 can be used for h n (t). In FIG. 10, N0-N
1-N2-N3-N6, N0-N1-N2-N3-N4
-N5, N0-N1-N2-N7, N0-N1-N8,
Although there are N0-N11-N9, the N0-N11-N10 total of six hypotheses, if from this f n (t) is maximum one (e.g. now, N0-N1-N2-N3 -N4-N5 )
To expand this. Details of the A * search are, for example,
It is disclosed in Corona Publishing, Nils.J.Nilsson, Shuhei Goda, Izu Masuda, and "Artificial Intelligence: A System of Problem Solving".

【0013】[0013]

【発明が解決しようとする課題】ところで、N−bes
tリスコアリングには、単語ラティスからN−best
文候補を作成する際に、1単語のみ異なるような類似候
補が多数出現するため、十分な認識精度を得るには、比
較的多くの文候補を対象にリスコアリングを行う必要が
ある、また、より高精度の音響モデルを用いた音響スコ
アの再計算も可能であるが効率的ではない、等の問題が
ある。一方、A* 探索には、第1パス探索の結果を第2
パス探索でヒューリスティックとして利用できるという
利点はあるが、第1パス探索と第2パス探索では用いる
モデルが異なるために、真値に近いhn (t)が得られ
るとは限らないため、入力音声によっては探索の効率が
悪くなる場合がある。hn (t)が真値に近く、最高の
n (t)の仮説展開をうまく行うことができればよい
が、hn (t)が真値から遠い場合は仮説数が極端に増
大し、実時間での認識は困難になる。
SUMMARY OF THE INVENTION By the way, N-bes
For the t rescoring, the word lattice is converted to N-best
When creating sentence candidates, a large number of similar candidates that differ by only one word appear, so in order to obtain sufficient recognition accuracy, it is necessary to perform rescoring on relatively many sentence candidates. Although recalculation of the acoustic score using a more accurate acoustic model is possible, it is not efficient. On the other hand, in the A * search, the result of the first
Although there is an advantage that it can be used as a heuristic in the path search, since the models used in the first path search and the second path search are different, h n (t) close to the true value is not always obtained. Depending on the case, the efficiency of the search may be reduced. It is sufficient that h n (t) is close to the true value and the best hypothesis expansion of f n (t) can be performed successfully. However, if h n (t) is far from the true value, the number of hypotheses increases extremely, Real-time recognition becomes difficult.

【0014】この発明は、上述のN−bestリスコア
リングやA* 探索にある問題点に鑑みてなされたもの
で、A* 探索のように粗いモデルを用いたパス探索の結
果を利用しながら、その後のパス探索を効率よく行い、
かつ、時間同期ビーム探索やN−bestリスコアリン
グのように必ず解を得ることを可能とする連続音声認識
方法を提供することを目的とする。
The present invention has been made in view of the above-mentioned problems in the N-best rescoring and the A * search, and utilizes the results of a path search using a coarse model such as the A * search. , Then perform efficient path search,
Another object of the present invention is to provide a continuous speech recognition method that can always obtain a solution, such as time-synchronized beam search and N-best rescoring.

【0015】[0015]

【課題を解決するための手段】この発明によれば、粗い
モデルによる探索により得た単語ネットワーク(単語ラ
ティス)を利用して精度の高いモデルを用いる探索の際
に探索が最も遅れている仮説を優先的に展開することを
繰返し実行する。このようにすることにより展開中の仮
説の長さがほぼ揃うことになる。よって仮説展開中に枝
刈りを行い、効率的な探索が可能となり、しかも必ず解
が得られる。
According to the present invention, a hypothesis whose search is most delayed in a search using a high-precision model using a word network (word lattice) obtained by a search using a coarse model is described. Repeated execution of preferential deployment. By doing so, the lengths of the hypotheses being developed are almost the same. Therefore, pruning is performed during hypothesis development, an efficient search becomes possible, and a solution is always obtained.

【0016】また、精度の高いモデルを用いた探索にお
いて、先に得られている単語ラティスに記憶されている
各単語境界時刻に5ミリ秒以上又は1フレーム以上の幅
をもたせてスコアの計算をする。
In a search using a high-precision model, a score is calculated by giving each word boundary time stored in the previously obtained word lattice a width of 5 ms or more or 1 frame or more. I do.

【0017】[0017]

【発明の実施の形態】以下にこの発明の実施例を説明す
る。この実施例では例えば図8に示したように、粗い音
響モデルと粗い言語モデルを用いて第1パス探索を入力
特徴パラメータのベクトルデータ系列に対して行い、単
語ラティス23を生成し、その後、その単語ラティス2
3上で、高精度音響モデルと高精度言語モデルを用い
て、第2パス探索を行う。
Embodiments of the present invention will be described below. In this embodiment, for example, as shown in FIG. 8, a first pass search is performed on a vector data sequence of input feature parameters using a coarse acoustic model and a coarse language model, and a word lattice 23 is generated. Word lattice 2
3, a second pass search is performed using the high-accuracy acoustic model and the high-accuracy language model.

【0018】この実施例において特徴があるのは第2パ
ス探索の手法にある。この第2パス探索は従来のA*
索と同様に、第1パス探索とは逆向きに文末(入力音声
の終端)から単語単位で仮説の展開を行う。この際この
発明ではその単語単位の仮説展開を、探索が最も遅れて
いるものから優先的に展開する(Shortest-first探
索)。例えば図1に示すように、いま、単語境界時刻
(ノード)N0−N1−N3−N8よりなる仮説、N0
−N1−N3−N9よりなる仮説、N0−N1−N3−
N7よりなる仮説、…,N0−N1−N4−N5−N1
3よりなる仮説の7個の仮説に展開されている状態にお
いて、探索が最も遅れている仮説は、各仮説の先頭ノー
ドN8,N9,N7,N10,N11,N12,N13
中のその時刻が最も遅い時刻tのノードN7を選択す
る。ただし入力音声の始点を基準とし、終端側時間が進
むと各時刻を表わしている。このようにして選択したノ
ードN7につきその仮説を展開させる。例えば単語ラテ
ィス23(図8)からノードN7に対し始端側に接続さ
れるノードがN14,N15,N16であったとし、ノ
ードN14から始端に至る未探索区間の推定スコア(ヒ
ューリスティック)がhn 1(t)、同様にノードN1
5,N16からそれぞれ始端に至る未探索区間のヒュー
リスティックがhn 2(t),hn 3(t)であったと
する。
The feature of this embodiment lies in the second path search method. In the second path search, as in the conventional A * search, the hypothesis is developed in word units from the end of the sentence (end of the input speech) in the opposite direction to the first path search. At this time, in the present invention, the word-based hypothesis development is preferentially developed from the one with the slowest search (Shortest-first search). For example, as shown in FIG. 1, a hypothesis consisting of word boundary times (nodes) N0-N1-N3-N8, N0
A hypothesis consisting of -N1-N3-N9, N0-N1-N3-
Hypothesis consisting of N7, ..., N0-N1-N4-N5-N1
In the state where the seven hypotheses of No. 3 are developed, the hypotheses whose search is the slowest are the head nodes N8, N9, N7, N10, N11, N12, N13 of each hypothesis.
The node N7 at the time t whose time is the latest is selected. However, each time is indicated as the end time advances with respect to the start point of the input voice. The hypothesis is developed for the node N7 selected in this way. For example, suppose that nodes connected from the word lattice 23 (FIG. 8) to the start end with respect to the node N7 are N14, N15, and N16, and the estimated score (heuristic) of the unsearched section from the node N14 to the start end is h n 1 (T) Similarly, the node N1
5, heuristic unsearched section extending from N16 to start each h n 2 (t), assumed to be h n 3 (t).

【0019】ノードN0からN7を経てN14に至る仮
説のスコアgn 1(t)を各分析フレームごとに計算し
て求め、このgn 1(t)とhn 1(t)との和fn
(t)、つまりその仮説の全区間でのスコアを求める。
以下同様にしてノードN15に至る仮説のスコアgn
(t)と、その全区間でのスコアfn 2(t)求め、ま
たノードN16に至る仮説のスコアgn 3(t)と、そ
の全区間でのスコアf n 3(t)を求める。
A temporary connection from node N0 to N14 via N7.
Theory score gn1 (t) is calculated for each analysis frame.
This gn1 (t) and hnSum f with 1 (t)n1
(T), that is, scores in all sections of the hypothesis are obtained.
Similarly, the score g of the hypothesis reaching the node N15n2
(T) and the score f in all the sectionsn2 (t)
Hypothesis score g leading to node N16n3 (t) and that
Score f in all sections of n3 (t) is obtained.

【0020】このように最も遅れている仮説の先端ノー
ドN7からその仮説を1単語分延長する仮説の展開を行
い、その1単語分延長するごとに、最も遅れている仮説
を選びその仮説を1単語分延長する仮説展開を行う。こ
のようにすると、時間同期ビーム探索のように各仮説の
時間的な長さがほぼ揃ろいながら仮説が展開されること
になる。よってスコアにもとづく枝刈りが可能となり、
この実施例では仮説を展開しながら枝刈りを行う。この
枝刈りは二つの手法の一方又は両方を用いることができ
る。その1つは仮説を延長させる際に求めるその仮説の
スコアgni(t)(前記例えばi=1,2,3)を分析
フレームごとに計算中において、各分析フレームごとの
計算が終ると、その時の全仮説のスコアgn (t)の最
高値から一定値θを差し引いたスコアをしきい値とし
て、そのしきい値以下のスコアの仮説はそこで計算を打
切り、除去する。
As described above, the hypothesis that extends the hypothesis by one word is developed from the leading node N7 of the hypothesis that is the latest, and the hypothesis that is the latest is selected and the hypothesis is determined by one for each extension of the word. Hypothesis development is performed to extend words. In this way, the hypotheses are developed while the temporal lengths of the respective hypotheses are almost the same as in the time-synchronized beam search. Therefore, pruning based on the score becomes possible,
In this embodiment, pruning is performed while developing a hypothesis. This pruning can use one or both of the two approaches. One of them is to calculate the hypothesis score g ni (t) (for example, i = 1, 2, 3) for each analysis frame, which is obtained when extending the hypothesis. When the calculation for each analysis frame is completed, The score obtained by subtracting the constant value θ from the highest value of the scores g n (t) of all the hypotheses at that time is set as a threshold, and the hypothesis of a score below the threshold is discontinued and removed therefrom.

【0021】例えば図2に示すように、各分析フレーム
ごとの計算で得られるスコアgn (t)の最高値の包絡
か曲線31で表わされ、その曲線31よりθだけ小さい
スコアの曲線32とすると、仮説展開の計算途中で、ス
コアgn (t)が曲線32以下となったものは除かれ、
スコアが曲線31と32の間に入る仮説のみが残され
る。なお図2は仮説が延長されるに従ってそのスコアが
小さくなるようなスコアの計算方向をとった場合であ
る。
For example, as shown in FIG. 2, the envelope of the highest value of the score g n (t) obtained by calculation for each analysis frame is represented by a curve 31, and a curve 32 having a score smaller by θ than the curve 31. Then, those in which the score g n (t) is below the curve 32 during the calculation of the hypothesis expansion are excluded.
Only hypotheses whose scores fall between curves 31 and 32 are left. FIG. 2 shows the case where the score calculation direction is set such that the score becomes smaller as the hypothesis is extended.

【0022】枝刈りのもう1つの手法は、1つの仮説に
ついて1単語分の仮説の延長展開を行うごとに、全仮説
の全区間スコアfn (t)を大きい順にm個取出し、そ
のm個の仮説を残し、それより小さい仮説は除去する。
以上述べた仮説の展開の手順を図3に示す、まず全仮説
の先頭ノード群N={n1,…,nx}のうち時刻が最
も遅いものniを選択する(S1)。ノードniから展
開されるノード群{ni1,…,niy}を取出す(S
2)。その取出したノードから順に1つのノードnij
(j=1,…,y)についてその仮説のスコアg
n (t)(nij)の計算を開始する(S3)、その各
n (t)(nij)の計算途中で、その分析フレーム
ごとの計算結果から最高値スコアを求め、これよりθだ
け引いた値をしきい値とし、計算したスコアがしきい値
以下になると(S4)、その計算を中止し、そのノード
nijへの展開を停止し、つまりそのノードへの展開す
る仮説を枝刈りしてステップS7に移る(S12)。
Another method of pruning is to extract m total interval scores f n (t) of all hypotheses in descending order each time one hypothesis is extended and expanded for one word. And the smaller hypotheses are removed.
FIG. 3 shows the procedure for developing the above-described hypotheses. First, the one with the latest time is selected from the head node group N = {n1,..., Nx} of all the hypotheses (S1). The node group {ni1,..., Niy} developed from the node ni is extracted (S
2). One node nij in order from the extracted node
(J = 1,..., Y), the score g of the hypothesis
The calculation of n (t) (nij) is started (S3). During the calculation of each g n (t) (nij), the highest score is obtained from the calculation result for each analysis frame, and θ is subtracted therefrom. When the calculated score falls below the threshold value (S4), the calculation is stopped and the expansion to the node nij is stopped, that is, the hypothesis to expand to the node is pruned. To step S7 (S12).

【0023】計算中にスコアがしきい値以下にならずス
コア計算が終了すると(S5)、そのノードnijが始
端でなく(S6)、かつ取出したノードnijの全てに
ついての計算が終っていなければ(S7)、ステップS
3に戻り、次のノードnijについてスコアの計算を開
始する。全てのnijについて仮説のスコアを計算し終
ると(S7)、先に選択したniを先頭ノード群Nから
消去し、全てのnijを先頭ノード群Nに加える(S
8)。この状態での仮説の数がm個以下であれば(S
9)、ステップS1に戻って、再び先頭ノード群から最
も時刻が遅れているノードを選択して同様の処理を行
う。一方、仮説の数がm以下でなければ、各仮説の全区
間スコアfn (t)=g n (t)+hn (t)の大きい
ものから順にm個を取出し、その仮説のみを残し、他の
仮説は除去する(S10)。この除去に伴って、その除
去された仮説の先頭ノードも先頭ノード群Nから除かれ
る。この枝刈り後にステップS1に戻る。
During the calculation, the score does not fall below the threshold value.
When the core calculation ends (S5), the node nij starts.
Not at the end (S6), and at all of the extracted nodes nij
If the calculation is not completed (S7), step S
3 and start calculating the score for the next node nij.
Start. Calculate hypothetical scores for all nij and finish
Then (S7), the previously selected ni is read from the head node group N.
Erase and add all nij to head node group N (S
8). If the number of hypotheses in this state is m or less (S
9) Returning to step S1, again from the top node group
Also selects the node whose time is late and performs the same processing.
U. On the other hand, if the number of hypotheses is not less than m,
Interval score fn(T) = g n(T) + hnLarge (t)
Take m pieces in order from the thing, leave only that hypothesis,
The hypothesis is removed (S10). With this removal, the removal
The first node of the removed hypothesis is also removed from the first node group N.
You. After this pruning, the process returns to step S1.

【0024】ステップS6でnijが始端であれば、そ
の時得られたその仮説の全区間スコアfn (t)=gn
(t)(nij)をその仮説について記憶してステップ
S7に移る(S12)。このnijは再び先頭ノード群
Nには加えない(nijに関しては探索終了)。以上の
処理をステップS1で選択する先頭ノードがなくなるま
で行い、選択する先頭ノードがなくなった時に、記憶し
てある仮説スコアの最大のもの又は大きい順に所定数の
ものの仮説を認識結果として出力する。
If nij is the starting point in step S6, the total section score f n (t) = g n of the hypothesis obtained at that time is obtained.
(T) (nij) is stored for the hypothesis, and the process proceeds to step S7 (S12). This nij is not added to the head node group N again (search for nij is completed). The above processing is performed until there are no more top nodes to be selected in step S1, and when there are no more top nodes to be selected, hypotheses with the largest or the largest number of stored hypothesis scores are output as recognition results.

【0025】第1パス探索と第2パス探索では用いるモ
デルが異なるため、同じ仮説に対しても第1パス探索と
第2パス探索では単語境界がずれる可能性がある。そこ
でこの実施例では第2パス探索の単語境界時刻として単
語ラティスに記憶されている第1パス探索の単語境界時
刻をそのまま用いるのではなく、前後数フレームのずれ
を許容して第2パス探索を行う。
Since different models are used in the first path search and the second path search, there is a possibility that the word boundaries may be shifted between the first path search and the second path search even for the same hypothesis. Therefore, in this embodiment, instead of using the word boundary time of the first path search stored in the word lattice as it is as the word boundary time of the second path search, the second path search is performed while allowing a shift of several frames before and after. Do.

【0026】つまり例えば図4Aに示すように単語ラテ
ィスに記憶されている単語境界時刻が単語Aと単語B間
はt1、単語Bと単語C間はt2とする。この時、図4
Bに示すように単語Aと単語B間はt1のみならず、t
1−Δと、t1+Δも境界時刻とし、単語Bと単語C間
はt2のみならず、t2−Δとt2+Δも境界時刻とす
る。この時のスコアの計算は時刻t2+Δから計算を開
始し、時刻t2に達した時の値Δg(t2+Δ,t2)
を記憶し、更に計算を継続して進めt2−Δに達した時
の値Δg(t2+Δ,t2−Δ)を記憶し、更に計算を
継続して進め、時刻t1+Δに達した時の値g(t2+
Δ,t1+Δ)を記憶し、更に計算を継続して進めt1
に達した時の値g(t2+Δ,t1)を記憶し、更に計
算を継続して進めt1−Δに達した時の値g(t2+
Δ,t1−Δ)を記憶し、t2+Δ,t2,t2−Δか
らそれぞれt1+Δに仮説を延長した時の各スコアg
(t)+Δ,t1+Δ)とg(t2+Δ,t1+Δ)−
Δg(t2+Δ,t2)と、g(t2+Δ,t1+Δ)
−Δg(t2+Δ,t2−Δ)との3つのうち最大のも
のを時刻t1+Δのスコアとし、t2+Δ,t2,t2
−Δからそれぞれt1に仮説を延長した時の各スコアg
(t2+Δ,t1)と、g(t2+Δ,t1)−Δg
(t2+Δ,t2)と、g(t2+Δ,t1)−Δg
(t2+Δ,t2−Δ)との3つのうち最大のものを時
刻t1のスコアとし、t2+Δ,t2,t2−Δからそ
れぞれt1−Δに仮説を延長した時の各スコアg(t2
+Δ,t1−Δ)と、g(t2+Δ,t1−Δ)−Δg
(t2+Δ,t2)と、g(t2+Δ,t1−Δ)−Δ
g(t2+Δ,t2−Δ)との3つのうち最大のものを
時刻t1−Δのスコアとする。
That is, for example, as shown in FIG. 4A, the word boundary time stored in the word lattice is t1 between the words A and B, and t2 between the words B and C. At this time, FIG.
B, between word A and word B, not only t1 but also t1
1-Δ and t1 + Δ are also boundary times, and between the words B and C, not only t2 but also t2-Δ and t2 + Δ are boundary times. The calculation of the score at this time starts from time t2 + Δ, and the value Δg (t2 + Δ, t2) when time t2 is reached
Is further stored, and the value Δg (t2 + Δ, t2-Δ) when the calculation is continued and the time reaches t2-Δ is stored. The calculation is further continued and the value g (when the time t1 + Δ is reached) is stored. t2 +
Δ, t1 + Δ), and the calculation is further continued to proceed to t1
The value g (t2 + Δ, t1) at the time t is stored, the calculation is further continued, and the value g (t2 +
Δ, t1-Δ), and each score g when the hypothesis is extended from t2 + Δ, t2, t2-Δ to t1 + Δ, respectively.
(T) + Δ, t1 + Δ) and g (t2 + Δ, t1 + Δ) −
Δg (t2 + Δ, t2) and g (t2 + Δ, t1 + Δ)
−Δg (t2 + Δ, t2−Δ), the largest one among the three at time t1 + Δ is defined as t2 + Δ, t2, t2
Each score g when the hypothesis is extended from -Δ to t1
(T2 + Δ, t1) and g (t2 + Δ, t1) −Δg
(T2 + Δ, t2) and g (t2 + Δ, t1) −Δg
(T2 + Δ, t2-Δ), the maximum score is the score at time t1, and each score g (t2) when the hypothesis is extended from t2 + Δ, t2, t2-Δ to t1-Δ, respectively.
+ Δ, t1−Δ) and g (t2 + Δ, t1−Δ) −Δg
(T2 + Δ, t2) and g (t2 + Δ, t1-Δ) −Δ
The largest one of the three g (t2 + Δ, t2-Δ) is set as the score at time t1-Δ.

【0027】なお、Δとしては1分析フレーム以上乃至
5ミリ秒程度以上とするが、Δの値を大きくすると、計
算量が多くなるので数フレーム乃至数10ミリ秒程度以
下とする。上述において仮説の全区間スコアとしてfn
(t)=gn (t)+hn (t)を用いたが、h
n (t)に対して1に近い重みαを与えてfn (t)=
n (t)+αhn (t)を全区間スコアとしてより精
度を高めることもできる。αを求めるには、第1パス探
索に用いる粗いモデルを用いて、適当な単語列について
スコアhを計算し、またその単語列について第2パス探
索に用いる高精度モデルを用いてスコアgを計算し、α
=g/hを計算して重みαを求めればよい。
Note that Δ is set to one analysis frame or more to about 5 milliseconds or more. However, when the value of Δ is increased, the amount of calculation increases, so it is set to several frames to several tens of milliseconds or less. In the above, f n
(T) = g n (t) + h n (t) was used.
n (t) is given a weight α close to 1 and f n (t) =
g n (t) + αh n (t) may be used as the whole section score to improve the accuracy. In order to obtain α, a score h is calculated for an appropriate word string using a coarse model used for the first pass search, and a score g is calculated for the word string using a high-precision model used for the second pass search. Then α
= G / h to calculate the weight α.

【0028】上述においてはこの発明を第2パス探索に
適用したが、3段階探索により認識を行う場合にも適用
できる。要は粗いモデルを用いてパス探索を行い、単語
ラティスを作り、その単語ラティス上で、高い精度のモ
デルを用いてパス探索を行う場合にこの発明を適用でき
る。続いて、この発明者等が開発した大語彙連続発声認
識システムに、上記N−bestリスコアリングとこの
発明による探索(以後時間非同期ビーム探索と呼ぶ)を
用いた場合の比較連続音声認識実験の結果について説明
する。なお、大語彙連続音声認識システムについては、
電子情報通信学会技術研究報告SP96−102、野田
喜昭、松永昭一、嵯峨山茂樹著、“単語グラフを用いた
大語彙連続音声認識における近似演算手法の検討”(1
997)に詳しく記載されている。音響モデルは、ニュ
ース番組1ケ月分から6700文を学習データとする総
状態数2000、混合数8のtriphone HMMである。
特徴量は、MFCC12次元とその1次、2次回帰係
数、対数パワーとその1次、2次回帰係数の計39次元
である。言語モデルは、ニュース番組原稿4年分の50
万文と、1ケ月分のニュース番組音声の書き起こしで学
習された単語2−gramと単語3−gramである。
評価セットはニュース番組5日分から50文(総単語数
1800、平均発声長12秒)を選択した。なお、第1
パス探索の結果として得られる単語ラティス内に含まれ
る仮説の中でスコアの最も高い仮説(最適解)の単語誤
り率は9.51%であった。
In the above description, the present invention is applied to the second path search, but can also be applied to a case where recognition is performed by a three-step search. In short, the present invention can be applied to the case where a path search is performed using a coarse model to create a word lattice, and a path search is performed on the word lattice using a highly accurate model. Subsequently, a comparative continuous speech recognition experiment using the above-described N-best rescoring and the search according to the present invention (hereinafter referred to as time asynchronous beam search) is applied to the large vocabulary continuous utterance recognition system developed by the present inventors. The results will be described. For the large vocabulary continuous speech recognition system,
IEICE Technical Report SP96-102, Yoshiaki Noda, Shoichi Matsunaga, Shigeki Sagayama, "A Study on Approximate Calculation Method in Large Vocabulary Continuous Speech Recognition Using Word Graph" (1
997). The acoustic model is a triphone HMM with a total number of 2000 states and a mixture number of 8 using 6,700 sentences as learning data from one month of a news program.
The feature amount is a total of 39 dimensions of 12 dimensions of MFCC and its first and second order regression coefficients, and log power and its first and second order regression coefficients. The language model is 50 for news program manuscripts for four years.
The word 2-gram and the word 3-gram learned in the transcription of the news program audio for one month for all sentences.
For the evaluation set, 50 sentences (total words: 1800, average utterance length: 12 seconds) were selected from the news program for 5 days. The first
The word error rate of the hypothesis (optimum solution) having the highest score among the hypotheses included in the word lattice obtained as a result of the path search was 9.51%.

【0029】N−bestリスコアリングと上記単語境
界時刻のずれを許さない時間非同期ビーム探索の実験結
果を図5Aに示す。これより、時間非同期ビーム探索で
はN−bestリスコアリングよりも高速かつ高精度に
解を得られることが分かる。続いて時間非同期ビーム探
索で単語境界時刻のずれを考慮し、数msecのずれを
許容する効果を調査した。図5Aのずれを許容しない場
合(0msec)を基準とし、許容するずれを10から
50msecと変化させて実験を行った。結果を図5B
に示す。これより、20msec程度のずれを許容する
ことで、より高精度の解が得られ、またずれを許容すれ
ばずれを許容しない場合よりも高い精度の解が得られる
ことが分かる。なおこの実験においてA* 探索の評価も
行ったが、第4文章で30分間程度しても解が得られな
いものが生じた。しかしこの発明によれば実用的時間内
に全ての解が得られ、この発明がA* 探索より優れてい
ることが確認できた。
FIG. 5A shows an experimental result of the N-best rescoring and the time asynchronous beam search which does not allow the above-mentioned word boundary time shift. From this, it can be seen that in the time asynchronous beam search, a solution can be obtained faster and more accurately than in the N-best rescoring. Subsequently, the effect of allowing a deviation of several msec was investigated in consideration of the deviation of the word boundary time in the time asynchronous beam search. The experiment was performed by changing the allowable shift from 10 to 50 msec based on the case where the shift shown in FIG. 5A is not allowed (0 msec). FIG. 5B shows the results.
Shown in From this, it can be seen that a higher accuracy solution can be obtained by allowing a displacement of about 20 msec, and a higher accuracy solution can be obtained by allowing the displacement than by not allowing the displacement. In this experiment, A * search was also evaluated. However, in the fourth sentence, a solution could not be obtained even after about 30 minutes. However, according to the present invention, all the solutions were obtained within a practical time, and it was confirmed that the present invention was superior to the A * search.

【0030】[0030]

【発明の効果】以上説明したように、この発明によれ
ば、安定して解が得られる時間同期ビーム探索のように
展開中の仮説の長さがなるべく揃うような仮説展開と枝
刈りを行うことにより、必ず解が得られる。またA*
索のように粗いモデルを用いたパス探索の結果として単
語ラティスに記憶されている単語境界時刻とスコアの情
報を利用することと、単語ラティスに記憶されている単
語境界時刻を高精度モデルを用いたパス探索の単語境界
時刻としてそのまま用いるのではなく、数フレーム分の
ずれを許容することで、高精度、効率的かつ安定して最
終的な解を得られるという効果を奏する。
As described above, according to the present invention, hypothesis development and pruning are performed such that the lengths of the hypotheses being developed are as uniform as possible, such as a time-synchronous beam search in which a stable solution can be obtained. By doing so, a solution can always be obtained. In addition, using the word boundary time and score information stored in the word lattice as a result of a path search using a coarse model such as A * search, and using the word boundary time stored in the word lattice with high accuracy Rather than using the word boundary time as it is in the path search using the model as it is, by allowing a shift of several frames, it is possible to obtain an effect that a final solution can be obtained with high accuracy, efficiently, and stably.

【図面の簡単な説明】[Brief description of the drawings]

【図1】この発明の要部である最も探索が遅れた仮説を
優先的に展開させることを説明するための仮説展開図。
FIG. 1 is a hypothesis development diagram for explaining that a hypothesis whose search is delayed most, which is a main part of the present invention, is preferentially developed.

【図2】スコアビームによる枝刈りを説明する図。FIG. 2 is a diagram illustrating pruning by a score beam.

【図3】この発明の要部である最も探索が遅れた仮説を
優先的に展開し、かつ枝刈りをする処理手順の例を示す
流れ図。
FIG. 3 is a flowchart showing an example of a processing procedure for preferentially developing a hypothesis whose search is delayed, which is a main part of the present invention, and for pruning;

【図4】この発明で単語境界時刻のずれを許容させる説
明図。
FIG. 4 is an explanatory diagram for allowing a deviation of a word boundary time according to the present invention.

【図5】この発明の効果を示す実験結果を示す図。FIG. 5 is a view showing experimental results showing the effect of the present invention.

【図6】音声認識処理の一般的な機能構成を示す図。FIG. 6 is a diagram showing a general functional configuration of a speech recognition process.

【図7】文法が許容する単語ネットワークを示す図。FIG. 7 is a diagram showing a word network permitted by the grammar.

【図8】マルチパス探索による連続音声認識処理の機能
構成を示す図。
FIG. 8 is a diagram showing a functional configuration of a continuous speech recognition process based on a multipath search.

【図9】図8中の第1パス探索により生成された単語ラ
ティスの例を示す図。
FIG. 9 is a view showing an example of a word lattice generated by a first pass search in FIG. 8;

【図10】従来のA* 探索における仮説展開の様子を示
す図。
FIG. 10 is a diagram showing a state of hypothesis development in a conventional A * search.

フロントページの続き (72)発明者 松永 昭一 東京都千代田区大手町二丁目3番1号 日 本電信電話株式会社内 Fターム(参考) 5D015 AA01 BB01 HH23 LL03 Continuation of the front page (72) Inventor Shoichi Matsunaga 2-3-1, Otemachi, Chiyoda-ku, Tokyo F-term in Nippon Telegraph and Telephone Corporation (reference) 5D015 AA01 BB01 HH23 LL03

Claims (5)

【特許請求の範囲】[Claims] 【請求項1】 単語と入力音声との音響的な近さを示す
音響スコアを求める音響モデルと、単語間の接続関係を
規定する文法あるいはその接続しやすさを示す言語スコ
アを求める言語モデルを備え、 連続的に発声された入力音声に対して、粗い音響モデル
と粗い言語モデルを用いて探索して文法の許容する単語
列の仮説の中から入力音声に近いものを絞り込み単語ネ
ットワークを作成し、その後前記探索よりも高精度の音
響モデルと高精度の言語モデルを用い、上記単語ネット
ワーク上で探索して上記入力音声に対して単語ネットワ
ークで許容される単語列の仮説の中から入力音声に更に
近いものを絞り込み、最終的に入力音声に最も近いひと
つあるいは複数の単語列の仮説を認識結果とする連続音
声認識方法において、 前記高精度の音響モデルと言語モデルを用いた探索は、
単語列の仮説の展開ごとに最も探索が遅れている単語列
の仮説を選択して行い、 各単語列の仮説の展開ごとに、得られた単語列のスコア
に基づき予め決めた条件から外れた単語列の仮説の展開
を打ち切ることを特徴とする連続音声認識方法。
1. An acoustic model for obtaining an acoustic score indicating an acoustic closeness between a word and an input voice, and a language model for obtaining a grammar defining a connection relationship between words or a language score indicating an easy connection thereof. A search is performed on the continuously uttered input speech using a coarse acoustic model and a coarse language model, and a word network is created by narrowing down hypotheses that are close to the input speech from hypotheses of word strings allowed by the grammar. Then, using a higher-accuracy acoustic model and a higher-precision language model than the search, searching on the word network and converting the input speech from the hypothesis of a word string allowed in the word network to the input speech In a continuous speech recognition method in which a nearer one is narrowed down and a hypothesis of one or a plurality of word strings closest to the input speech is finally recognized as a recognition result, Search using the Le and language model,
For each word sequence hypothesis expansion, the hypothesis of the word sequence whose search is delayed is selected and performed, and for each word sequence hypothesis expansion, the condition deviated from the predetermined condition based on the obtained word string score A continuous speech recognition method characterized by terminating the development of a word string hypothesis.
【請求項2】 上記単語列の仮説の展開打ち切りは、単
語列の仮説の展開時に行う既探索区間としてのスコアg
n (t)の計算結果又はその計算途中でそのスコアがし
きい値以下になるとその単語列の仮説の展開を打ち切る
ことを特徴とする請求項1記載の連続音声認識方法。
2. Terminating the development of the hypothesis of the word string is performed by scoring a score g as a searched section to be performed when the hypothesis of the word string is developed.
2. The continuous speech recognition method according to claim 1, wherein, if the calculation result of n (t) or the score thereof becomes less than or equal to the threshold value during the calculation, the development of the hypothesis of the word string is terminated.
【請求項3】 上記単語列の仮説の展開打ち切りは、1
つのノード(前記単語ネットワーク上の1つの単語境
界)についての単語列の仮説展開を終了時に、各単語列
の仮説のスコアfn (t)を、既探索区間のスコアgn
(t)と未探索区間における先の探索で得られているス
コアhn (t)との和とし、この単語列の仮説f
n (t)の大きい順にm個の単語列の仮説以外の単語列
の仮説の展開を打ち切ることを特徴とする請求項1又は
2記載の連続音声認識方法。
3. The terminating hypothesis expansion of the word string is 1
When the hypothesis development of the word string for one node (one word boundary on the word network) is completed, the hypothesis score f n (t) of each word string is changed to the score g n of the searched section.
(T) and the sum of the scores h n (t) obtained in the previous search in the unsearched section, and the hypothesis f of this word string
3. The continuous speech recognition method according to claim 1, wherein the hypothesis expansion of word strings other than the m word string hypotheses is terminated in ascending order of n (t).
【請求項4】 前記未探索区間のスコアhn (t)に対
し重みαを掛けてf n (t)=gn (t)+αh
n (t)とすることを特徴とする請求項3記載の連続音
声認識方法。
4. The score h of the unsearched sectionn(T)
Multiplied by weight α and f n(T) = gn(T) + αh
nThe continuous sound according to claim 3, wherein (t) is set.
Voice recognition method.
【請求項5】 前記単語列の仮説の展開時のスコア計算
を、前記単語ネットワークに記憶されている単語境界時
刻に対し5ミリ秒程度以上ずらした範囲内について行う
ことを特徴とする請求項1乃至4の何れかに記載の連続
音声認識方法。
5. The method according to claim 1, wherein the score calculation at the time of developing the hypothesis of the word string is performed within a range shifted from the word boundary time stored in the word network by about 5 ms or more. 5. The continuous speech recognition method according to any one of claims 1 to 4.
JP26823799A 1999-09-22 1999-09-22 Continuous speech recognition method Expired - Lifetime JP3559479B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP26823799A JP3559479B2 (en) 1999-09-22 1999-09-22 Continuous speech recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP26823799A JP3559479B2 (en) 1999-09-22 1999-09-22 Continuous speech recognition method

Publications (2)

Publication Number Publication Date
JP2001092495A true JP2001092495A (en) 2001-04-06
JP3559479B2 JP3559479B2 (en) 2004-09-02

Family

ID=17455820

Family Applications (1)

Application Number Title Priority Date Filing Date
JP26823799A Expired - Lifetime JP3559479B2 (en) 1999-09-22 1999-09-22 Continuous speech recognition method

Country Status (1)

Country Link
JP (1) JP3559479B2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005096271A1 (en) * 2004-03-30 2005-10-13 Pioneer Corporation Speech recognition device and speech recognition method
JP2014149637A (en) * 2013-01-31 2014-08-21 Nippon Telegr & Teleph Corp <Ntt> Approximate oracle sentence selection device, method, and program
CN105723449A (en) * 2013-11-06 2016-06-29 系统翻译国际有限公司 System for analyzing speech content on basis of extraction of keywords from recorded voice data, indexing method using system and method for analyzing speech content

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005096271A1 (en) * 2004-03-30 2005-10-13 Pioneer Corporation Speech recognition device and speech recognition method
JP2014149637A (en) * 2013-01-31 2014-08-21 Nippon Telegr & Teleph Corp <Ntt> Approximate oracle sentence selection device, method, and program
CN105723449A (en) * 2013-11-06 2016-06-29 系统翻译国际有限公司 System for analyzing speech content on basis of extraction of keywords from recorded voice data, indexing method using system and method for analyzing speech content
US20160284345A1 (en) 2013-11-06 2016-09-29 Systran International Co., Ltd. System for grasping keyword extraction based speech content on recorded voice data, indexing method using the system, and method for grasping speech content
JP2016539364A (en) * 2013-11-06 2016-12-15 シストラン・インターナショナル・カンパニー・リミテッドSystran International Co., Ltd. Utterance content grasping system based on extraction of core words from recorded speech data, indexing method and utterance content grasping method using this system
US10304441B2 (en) 2013-11-06 2019-05-28 Systran International Co., Ltd. System for grasping keyword extraction based speech content on recorded voice data, indexing method using the system, and method for grasping speech content

Also Published As

Publication number Publication date
JP3559479B2 (en) 2004-09-02

Similar Documents

Publication Publication Date Title
JP4322815B2 (en) Speech recognition system and method
JP4465564B2 (en) Voice recognition apparatus, voice recognition method, and recording medium
US5241619A (en) Word dependent N-best search method
JP3672595B2 (en) Minimum false positive rate training of combined string models
JP4802434B2 (en) Voice recognition apparatus, voice recognition method, and recording medium recording program
JP5310563B2 (en) Speech recognition system, speech recognition method, and speech recognition program
KR20040076035A (en) Method and apparatus for speech recognition using phone connection information
JP2001249684A (en) Device and method for recognizing speech, and recording medium
Hain et al. The cu-htk march 2000 hub5e transcription system
US6980954B1 (en) Search method based on single triphone tree for large vocabulary continuous speech recognizer
JP2013125144A (en) Speech recognition device and program thereof
US20070038451A1 (en) Voice recognition for large dynamic vocabularies
JP2003208195A5 (en)
JP3559479B2 (en) Continuous speech recognition method
JP4528540B2 (en) Voice recognition method and apparatus, voice recognition program, and storage medium storing voice recognition program
JP3494338B2 (en) Voice recognition method
JP2017044901A (en) Sound production sequence extension device and program thereof
JP4733436B2 (en) Word / semantic expression group database creation method, speech understanding method, word / semantic expression group database creation device, speech understanding device, program, and storage medium
JP3368989B2 (en) Voice recognition method
JP3532248B2 (en) Speech recognition device using learning speech pattern model
JPH08241096A (en) Speech recognition method
Fu et al. Combination of multiple predictors to improve confidence measure based on local posterior probabilities
JP2731133B2 (en) Continuous speech recognition device
JP4600705B2 (en) Voice recognition apparatus, voice recognition method, and recording medium
JP2002149188A (en) Device and method for processing natural language and recording medium

Legal Events

Date Code Title Description
TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20040427

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20040521

R151 Written notification of patent or utility model registration

Ref document number: 3559479

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R151

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20090528

Year of fee payment: 5

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20090528

Year of fee payment: 5

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20100528

Year of fee payment: 6

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20100528

Year of fee payment: 6

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20110528

Year of fee payment: 7

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20120528

Year of fee payment: 8

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20130528

Year of fee payment: 9

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20140528

Year of fee payment: 10

S531 Written request for registration of change of domicile

Free format text: JAPANESE INTERMEDIATE CODE: R313531

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350

EXPY Cancellation because of completion of term