JP4297349B2 - Speech recognition system - Google Patents

Speech recognition system Download PDF

Info

Publication number
JP4297349B2
JP4297349B2 JP2004098760A JP2004098760A JP4297349B2 JP 4297349 B2 JP4297349 B2 JP 4297349B2 JP 2004098760 A JP2004098760 A JP 2004098760A JP 2004098760 A JP2004098760 A JP 2004098760A JP 4297349 B2 JP4297349 B2 JP 4297349B2
Authority
JP
Japan
Prior art keywords
phoneme
recognition result
unit
likelihood
utterance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP2004098760A
Other languages
Japanese (ja)
Other versions
JP2005284018A (en
Inventor
顕吾 藤田
正樹 内藤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
KDDI Corp
Original Assignee
KDDI Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by KDDI Corp filed Critical KDDI Corp
Priority to JP2004098760A priority Critical patent/JP4297349B2/en
Publication of JP2005284018A publication Critical patent/JP2005284018A/en
Application granted granted Critical
Publication of JP4297349B2 publication Critical patent/JP4297349B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Description

本発明は音声認識システムに関し、特に発声途中で音声認識結果候補を逐次出力できる音声認識システムに関する。   The present invention relates to a speech recognition system, and more particularly to a speech recognition system that can sequentially output speech recognition result candidates during speech production.

従来の音声認識システムの一例(従来法1)を図3を参照して説明する。音声入力部1は、ユーザの音声である音声信号を入力し音声検出部2へ送る。次に、音声検出部2は、音声入力部1より得られる音声信号を、音声区間の始端から順次図4に示すようにフレーム長T毎にΔTずつずらして切り出し、それぞれを音響分析部3へ送る。また、音声区間の終端を検出した時刻に、認識結果候補決定部6へ終端検出信号を送る。音響分析部3は、フレーム毎に切り出された音声信号から音声信号の特徴を表す音響パラメータを抽出し、照合部4へ送る。照合部4は、音素の特徴を表すモデルを連結して得た認識語彙標準パタン5と、第1フレームから現時刻までの音響パラメータとを照合し、認識語彙を構成する音素列の尤度を計算する。なお、各時刻の尤度は1時刻前までの照合結果を基にViterbiアルゴリズムを用い計算することが可能である。このことは、例えば、中川聖一著「確率モデルによる音声認識」に記されている。認識結果候補決定部6は、前記音声検出部2からの終端検出信号を受け取った時刻において尤度の高い音素列仮説を認識結果候補として出力する。   An example of a conventional voice recognition system (conventional method 1) will be described with reference to FIG. The voice input unit 1 inputs a voice signal which is a user's voice and sends it to the voice detection unit 2. Next, the voice detection unit 2 cuts out the voice signal obtained from the voice input unit 1 from the beginning of the voice section sequentially by shifting by ΔT for each frame length T as shown in FIG. send. In addition, an end detection signal is sent to the recognition result candidate determination unit 6 at the time when the end of the speech section is detected. The acoustic analysis unit 3 extracts an acoustic parameter representing the characteristics of the audio signal from the audio signal cut out for each frame, and sends it to the matching unit 4. The collation unit 4 collates the recognized vocabulary standard pattern 5 obtained by concatenating models representing the features of phonemes and the acoustic parameters from the first frame to the current time, and determines the likelihood of the phoneme string constituting the recognized vocabulary. calculate. Note that the likelihood at each time can be calculated using the Viterbi algorithm based on the result of matching up to one hour before. This is described in, for example, “Recognition of Speech by Stochastic Model” by Seiichi Nakagawa. The recognition result candidate determination unit 6 outputs a phoneme string hypothesis having a high likelihood as a recognition result candidate at the time when the termination detection signal is received from the voice detection unit 2.

次に、従来システムの他の例(従来法2)を、図5を参照して説明する。図中の図3と同一または同等物には、同じ符号が付されている。この従来例は、音声入力部1、音声検出部2、音響分析部3、照合部4の機能は図3と同じである。図3の従来例と異なるのは、照合部4から各時刻で尤度の高い音素列を認識結果候補として出力するようにしている点である。   Next, another example of the conventional system (conventional method 2) will be described with reference to FIG. The same reference numerals are given to the same or equivalent parts in FIG. In this conventional example, the functions of the voice input unit 1, the voice detection unit 2, the acoustic analysis unit 3, and the collation unit 4 are the same as those in FIG. 3 is different from the conventional example in FIG. 3 in that a phoneme string having a high likelihood is output from the matching unit 4 as a recognition result candidate at each time.

前記従来システム係る技術を記載した文献として、例えば下記の特許文献1,2等がある。
特開2003−255972号公報 特開2003−345386号公報 1988年7月1日、コロナ社発光、中川聖一著「確率モデルによる音声認識」
As documents describing the technology according to the prior art systems, for example, Patent Documents 1 and 2 below.
JP 2003-255972 A JP 2003-345386 A July 1, 1988, Corona Inc. Luminescence, Seiichi Nakagawa "Speech Recognition by Stochastic Model"

前記従来法1,2には、次のような課題があった。すなわち、従来法1では、発声が終了するまで認識結果候補を得ることができず、認識結果に基づき情報を提示するまでに時間を要する。   The conventional methods 1 and 2 have the following problems. That is, in the conventional method 1, a recognition result candidate cannot be obtained until the utterance is completed, and it takes time to present information based on the recognition result.

従来法2では、照合途中の各時刻に尤度の高い音素列を認識結果候補とするため、照合開始から間もない音素も認識結果候補に含まれる恐れがあり、そのような音素を含む類似した音素列が認識結果候補として多く出力される恐れがある。   In the conventional method 2, since a phoneme string having a high likelihood is used as a recognition result candidate at each time during collation, a phoneme shortly after the start of collation may be included in the recognition result candidate. There is a possibility that many phoneme sequences that have been processed are output as recognition result candidates.

例えば、第1フレームからある時刻までの音響パラメータを照合した結果、得られる尤度の高い音素列の例を図6に示す。実際に発声した音素列が「o/NN/s/e/i」である場合でも、図6のように発声した音素列の一部分である「o/NN/s/e」、あるいは辞書登録単語上発声した音素列に続く音素が加わっている「o/NN/s/e/i/n」、「o/NN/s/e/i/g」、「o/NN/s/e/i/s」といった発声した音素列に類似した音素列が尤度の高い音素列として挙がる場合が多く見られる。   For example, FIG. 6 shows an example of a phoneme string having a high likelihood obtained as a result of collating acoustic parameters from the first frame to a certain time. Even when the phoneme sequence actually spoken is “o / NN / s / e / i”, “o / NN / s / e”, which is a part of the phoneme sequence spoken as shown in FIG. "O / NN / s / e / i / n", "o / NN / s / e / i / g", "o / NN / s / e / i" with phonemes following the phoneme sequence In many cases, a phoneme sequence similar to the uttered phoneme sequence such as “/ s” is listed as a highly likely phoneme sequence.

本発明は、前記した従来技術の課題に鑑みてなされたものであり、その目的は、発声途中で、精度の良い認識結果候補を得ることができる音声認識システムを提供することにある。   The present invention has been made in view of the above-described problems of the prior art, and an object of the present invention is to provide a speech recognition system capable of obtaining a recognition result candidate with high accuracy during utterance.

前記した目的を達成するために、本発明は、発声途中で音声認識結果候補を逐次出力できる音声認識システムにおいて、音声信号をフレーム長T毎にΔTずつずらして切り出す音声検出部と、該音声検出部から得られた切り出し音声信号から音声信号の特徴を表す音響パラメータを抽出する音響分析部と、発声開始から発声途中までの音響パラメータから、音素列の尤度を計算する照合部と、該照合部から得られる音素列の照合結果を基に、発声途中で認識結果候補を決定する認識結果候補決定手段とを具備し、前記照合部は、前記音響分析部から抽出された第1フレームから現時刻までの音響パラメータを、認識語彙のもつ音素の並びに従い音素の特徴を表すモデルによる連結を構成した認識語彙標準パタンと照合し、認識結果音素列候補と第1の尤度を出力する第1照合部と、前記第1フレームから現時刻までの音響パラメータを、日本語全般をモデル化するようなバックグランドパタンと照合し、第2の尤度を出力する第2照合部とからなり、前記認識結果候補決定手段は、
前記照合部から得られた音素列の第1、第2の尤度の差を取り、該差を音声認識処理の実行区間で除することにより得られた値である正規化スコアを演算し、該正規化スコアが予め定めた閾値より高い音素列と該音素列の前記第1の尤度m を出力する正規化スコア計算部と、前記閾値より高い音素列の尤度m を指数化して第1の事後確率を求め、該第1の事後確率を基に発声開始から発声途中までの入力音声に含まれる音素列の第2の事後確率を求め、該第2の事後確率が予め定められた基準値以上の音素列を認識結果候補と決定する認識結果候補決定部とからなる点に特徴がある。
In order to achieve the above-described object, the present invention provides a speech recognition system capable of sequentially outputting speech recognition result candidates during speech production, a speech detection unit that extracts a speech signal by shifting it by ΔT for each frame length T, and the speech detection An acoustic analysis unit that extracts an acoustic parameter representing the characteristics of the speech signal from the extracted speech signal obtained from the speech unit, a matching unit that calculates the likelihood of a phoneme sequence from acoustic parameters from the start of speech to the middle of speech, A recognition result candidate determining means for determining a recognition result candidate in the middle of utterance based on a phoneme string collation obtained from the speech analysis unit, and The acoustic parameters up to the time are collated with a recognition vocabulary standard pattern that is composed by a model representing the phoneme characteristics according to the phoneme sequence of the recognition vocabulary, and the recognition result phoneme string candidate The first matching unit that outputs the first likelihood, and the acoustic parameters from the first frame to the current time are compared with a background pattern that models all Japanese, and the second likelihood is output. Ri Do and a second verification unit that, the recognition result candidate determining means,
Taking a difference between the first and second likelihoods of the phoneme sequence obtained from the collating unit, and calculating a normalized score that is a value obtained by dividing the difference by the execution interval of the speech recognition process; normalized score calculator for the normalization score to output the first likelihood m i of advance greater than the determined threshold phoneme string and the sound Motoretsu, and indexed the likelihood m i higher than the threshold phoneme sequence The first posterior probability is obtained, the second posterior probability of the phoneme sequence included in the input speech from the start of utterance to the middle of utterance is obtained based on the first posterior probability, and the second posterior probability is determined in advance. It is characterized in that it comprises a recognition result candidate determination unit that determines a phoneme string that is equal to or greater than the reference value as a recognition result candidate .

請求項1,2の発明によれば、発声途中に、精度の良い認識結果候補を得ることが可能であり、情報の提示までの時間を短縮することができる。 According to the first and second aspects of the present invention, it is possible to obtain an accurate recognition result candidate in the middle of utterance, and the time until presentation of information can be shortened.

また、正規化スコアと事後確率を用いて詳細に認識結果候補の選定を行うことができるので、不要な認識結果候補の出力を低減することが可能になる。   Further, since the recognition result candidate can be selected in detail using the normalized score and the posterior probability, it is possible to reduce the output of unnecessary recognition result candidates.

以下に、図面を参照して、本発明を詳細に説明する。図1は、本発明の一実施形態の構成を示すブロック図である。図1において、図3と同一または同等物には、同じ符号が付されている。   Hereinafter, the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a configuration of an embodiment of the present invention. In FIG. 1, the same reference numerals are assigned to the same or equivalent parts as in FIG.

図において、音声入力部1は、ユーザからの音声信号を入力し音声検出部2へ送る。音声検出部2は、音声入力部1より得られる音声信号を、順次図4に示すようにフレーム長T(時間)毎にΔT(時間)ずつずらして切り出し、それぞれを音響分析部3へ送る。音響分析部3は、フレーム毎に切り出された音声信号から音声信号の特徴を表す音響パラメータを抽出し、第1照合部4および第2照合部へ送る。 In the figure, a voice input unit 1 inputs a voice signal from a user and sends it to a voice detection unit 2. The voice detection unit 2 cuts out the voice signal obtained from the voice input unit 1 sequentially by ΔT (time) every frame length T (time) as shown in FIG. 4 and sends each to the acoustic analysis unit 3. The acoustic analysis unit 3 extracts an acoustic parameter representing the characteristics of the audio signal from the audio signal cut out for each frame, and sends the acoustic parameter to the first verification unit 4 and the second verification unit 7 .

第1照合部4は、音素の特徴を表すモデルを連結して得た認識語彙標準パタン5と第1フレームから現時刻までの音響パラメータを照合し、認識結果音素列候補とその尤度mKを認識結果候補決定手段10に送る。第2照合部7は、日本語全般をモデル化するようなバックグランドパタン8と第1フレームから現時刻までの音響パラメータを照合し、尤度mBを認識結果候補決定手段10に送る。 The first collation unit 4 collates the recognition vocabulary standard pattern 5 obtained by concatenating the models representing the features of phonemes with the acoustic parameters from the first frame to the current time, and recognizes the recognition result phoneme string candidate and its likelihood m K. Is sent to the recognition result candidate determination means 10. The second collation unit 7 collates a background pattern 8 that models all Japanese language and an acoustic parameter from the first frame to the current time, and sends the likelihood m B to the recognition result candidate determination means 10.

該認識結果候補決定手段10は、第1照合部4から得られた音素列のうち尤度の高い音素列を選択し、各音素列の尤度を基に発声開始から発声途中までの入力音声の音素列である可能性の高い音素列のみを認識結果候補と決定する働きをするものであり、一具体例として、正規化スコア計算部11と認識結果候補決定部12とから構成することができる。   The recognition result candidate determination means 10 selects a phoneme string having a high likelihood from the phoneme strings obtained from the first collating unit 4, and based on the likelihood of each phoneme string, the input speech from the start of utterance to the middle of utterance Only a phoneme string that is likely to be a phoneme string is determined as a recognition result candidate, and as one specific example, a normalization score calculation unit 11 and a recognition result candidate determination unit 12 may be configured. it can.

正規化スコア計算部11では、第1照合部4で得られる尤度mK及び第2照合部で得られる尤度mBより、下記に示す式により正規化スコアSを計算する。そして、該スコアが予め定めた閾値より高い音素列仮説(以下、N-best音素列と呼ぶ)と各音素列の尤度 (=前記閾値より高い尤度m を認識結果候補決定部12へ送る。 The normalization score calculation unit 11 calculates a normalization score S from the likelihood m K obtained by the first collation unit 4 and the likelihood m B obtained by the second collation unit 7 by the following formula. Then, a phoneme string hypothesis whose score is higher than a predetermined threshold (hereinafter referred to as an N-best phoneme string) and a likelihood m i of each phoneme string (= a likelihood m k higher than the threshold ) are determined as a recognition result candidate. Send to part 12.

S=(mK−mB)/(t−tS = (m K −m B ) / (t P −t S )

ここに、tは第1フレームの開始時刻、tは現時刻を示す。すなわち、t−tは照合に用いた区間の長さである。なお、該正規化スコアSについては、本出願人による特許出願である特願平2003−048608号に詳細に説明されている。 Here, t S indicates the start time of the first frame, and t P indicates the current time. That, t P -t S is the length of a section used for verification. The normalized score S is described in detail in Japanese Patent Application No. 2003-048608, which is a patent application filed by the present applicant.

次に、認識結果候補決定部12は、N-best音素列と各音素の尤度を用いて、以下に示す方法で各音素列に対する事後確率を算出する。   Next, the recognition result candidate determination unit 12 calculates the posterior probability for each phoneme sequence by the following method using the N-best phoneme sequence and the likelihood of each phoneme.

(1)N-best音素列の尤度m(1≦i≦N)を指数化し、i番目の尤度mの事後確率(第1の事後確率)を下式から求める。 (1) N-best phoneme sequence likelihood m i a (1 ≦ i ≦ N) and indexing to determine the i-th likelihood m posteriori probability (first posterior probability) of i p i from the following equation.

Figure 0004297349
Figure 0004297349

(2)次に、ある音素列xがi番目の候補に含まれるときδx,i=1、含まれないときδx,i=0とすると、入力音声に音素列xが含まれている事後確率(第2の事後確率)は、次式で計算される。 (2) Next, if a certain phoneme sequence x is included in the i-th candidate and δ x, i = 1, and if it is not included in δ x, i = 0, the phoneme sequence x is included in the input speech. The posterior probability (second posterior probability) p x is calculated by the following equation.

Figure 0004297349
Figure 0004297349

その後、認識結果候補決定部12は、音素列中の事後確率の最大値からのある範囲θに含まれる音素列を認識結果候補と決定する。   Thereafter, the recognition result candidate determination unit 12 determines a phoneme string included in a certain range θ from the maximum value of the posterior probability in the phoneme string as a recognition result candidate.

図2には、「音声」、「音声認識」、「音声合成」、「音声操作」といった単語を含む認識辞書を用いた認識を行い、時刻tまでの入力音声「おんせい」と認識語彙標準パタンを照合した際に得られる音素列の例を示した。ここでは正規化スコアの高い5つの音素列が挙がり、それぞれの音素列の尤度から事後確率p(i=1,2,・・・,5)が算出されたとする。 In FIG. 2, recognition is performed using a recognition dictionary including words such as “speech”, “speech recognition”, “speech synthesis”, and “speech operation”, and the input speech “onsei” and recognition vocabulary standard up to time t are recognized. An example of a phoneme sequence obtained when matching patterns is shown. Here, it is assumed that five phoneme sequences having high normalization scores are listed, and the posterior probabilities p i (i = 1, 2,..., 5) are calculated from the likelihoods of the respective phoneme sequences.

本来時刻tまでの音声「おんせい」に対応する音素列は「o/NN/s/e/i」(p=0.22)であるが、その他にも、音素列「o/NN/s/e/i」の一部分をなす「o/NN/s/e」(p=0.18)、あるいは、認識辞書上で「o/NN/s/e/i」に続く「音声認識」、「音声合成」、「音声操作」という単語内の音素が加わった音素列「o/NN/s/e/i/n」(p=0.21)、「o/NN/s/e/i/g」(p=0.20)、「o/NN/s/e/i/s」(p=0.19)も正規化スコアの高い音素列として挙がっている。 Originally, the phoneme string corresponding to the voice “onsei” until time t is “o / NN / s / e / i” (p 1 = 0.22), but in addition, the phoneme string “o / NN / “o / NN / s / e” (p 5 = 0.18) that forms part of “s / e / i” or “speech recognition” that follows “o / NN / s / e / i” on the recognition dictionary ”,“ Speech synthesis ”, phoneme sequence“ o / NN / s / e / i / n ”(p 2 = 0.21) with phonemes in the words“ speech operation ”,“ o / NN / s / “e / i / g” (p 3 = 0.20) and “o / NN / s / e / i / s” (p 4 = 0.19) are also listed as phoneme strings with high normalized scores.

このとき上記事後確率pの定義より、音素列「o/NN/s/e」の事後確率po/NN/s/e=1.0となる。また、同様に音素列「o/NN/s/e/i」の事後確率po/NN/s/e/i=0.82と算出される。また、「o/NN/s/e/i/n」、「o/NN/s/e/i/g」、「o/NN/s/e/i/s」など類似した音素列が候補中に存在する場合、事後確率はそれぞれ、po/NN/s/e/i/n=0.21、po/NN/s/e/i/g=0.20、po/NN/s/e/i/s=0.19と低い値をとると予想される。 From the definition of this time, the posterior probability p x, the posterior probability p o / NN / s / e = 1.0 of phoneme string "o / NN / s / e". Similarly, the posterior probability p o / NN / s / e / i of the phoneme string “o / NN / s / e / i” is calculated as 0.82. Also, similar phoneme strings such as “o / NN / s / e / i / n”, “o / NN / s / e / i / g”, “o / NN / s / e / i / s” are candidates. The posterior probabilities are in the following cases: p o / NN / s / e / i / n = 0.21, p o / NN / s / e / i / g = 0.20, p o / NN / It is expected to take a low value of s / e / i / s = 0.19.

ここで、前記候補選択の基準として、例えばθ=0.50を使用すると、事後確率の高い音素列「o/NN/s/e」及び「o/NN/s/e/i」が認識結果候補として出力され、「o/NN/s/e/i/n」、「o/NN/s/e/i/g」、「o/NN/s/e/i/s」などは認識結果候補から除外される。   Here, when θ = 0.50 is used as the criterion for selecting the candidate, for example, phoneme sequences “o / NN / s / e” and “o / NN / s / e / i” having high posterior probabilities are recognized as recognition results. It is output as a candidate, and "o / NN / s / e / i / n", "o / NN / s / e / i / g", "o / NN / s / e / i / s", etc. are recognized results Excluded from the candidate.

このように、本実施形態によれば、発声途中で、精度の良い認識結果候補を出力できるようになる。   As described above, according to the present embodiment, it is possible to output a recognition result candidate with high accuracy during the utterance.

本発明の一実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of one Embodiment of this invention. 発声した音素列が「o/NN/s/e/i」である場合の正規化スコアの高い音素列の例を示す説明図である。It is explanatory drawing which shows the example of a phoneme sequence with a high normalization score when the phoneme sequence which uttered is "o / NN / s / e / i". 従来の音声認識システムの一例の構成を示すブロック図である。It is a block diagram which shows the structure of an example of the conventional speech recognition system. 音声区間検出と特徴パラメータ抽出の説明図である。It is explanatory drawing of audio | voice area detection and feature parameter extraction. 従来の音声認識システムの他の例の構成を示すブロック図である。It is a block diagram which shows the structure of the other example of the conventional speech recognition system. 図5の音声認識システムにより出力される認識結果候補の説明図である。It is explanatory drawing of the recognition result candidate output by the speech recognition system of FIG.

符号の説明Explanation of symbols

1・・・音声入力部、2・・・音声検出部、3・・・音響分析部、4・・・第1照合部、5・・・認識語彙標準パタン、7・・・第2照合部、8・・・バックグランドパタン、10・・・認識結果候補決定手段、11・・・正規化スコア計算部、12・・・認識結果候補決定部。   DESCRIPTION OF SYMBOLS 1 ... Voice input part, 2 ... Voice detection part, 3 ... Acoustic analysis part, 4 ... 1st collation part, 5 ... Recognition vocabulary standard pattern, 7 ... 2nd collation part , 8 ... Background pattern, 10 ... Recognition result candidate determination means, 11 ... Normalization score calculation unit, 12 ... Recognition result candidate determination unit.

Claims (2)

発声途中で音声認識結果候補を逐次出力できる音声認識システムにおいて、
音声信号をフレーム長T毎にΔTずつずらして切り出す音声検出部と、
該音声検出部から得られた切り出し音声信号から音声信号の特徴を表す音響パラメータを抽出する音響分析部と、
発声開始から発声途中までの音響パラメータから、音素列の尤度を計算する照合部と、
該照合部から得られる音素列の照合結果を基に、発声途中で認識結果候補を決定する認識結果候補決定手段とを具備し、
前記照合部は、前記音響分析部から抽出された第1フレームから現時刻までの音響パラメータを、認識語彙のもつ音素の並びに従い音素の特徴を表すモデルによる連結を構成した認識語彙標準パタンと照合し、認識結果音素列候補と第1の尤度を出力する第1照合部と、前記第1フレームから現時刻までの音響パラメータを、日本語全般をモデル化するようなバックグランドパタンと照合し、第2の尤度を出力する第2照合部とからなり、
前記認識結果候補決定手段は、
前記照合部から得られた音素列の第1、第2の尤度の差を取り、該差を音声認識処理の実行区間で除することにより得られた値である正規化スコアを演算し、該正規化スコアが予め定めた閾値より高い音素列と該音素列の前記第1の尤度m を出力する正規化スコア計算部と、
前記閾値より高い音素列の尤度m を指数化して第1の事後確率を求め、該第1の事後確率を基に発声開始から発声途中までの入力音声に含まれる音素列の第2の事後確率を求め、該第2の事後確率が予め定められた基準値以上の音素列を認識結果候補と決定する認識結果候補決定部とからなる音声認識システム。
In a speech recognition system that can sequentially output speech recognition result candidates during utterance,
An audio detection unit that extracts audio signals by shifting them by ΔT for each frame length T;
An acoustic analysis unit that extracts acoustic parameters representing the characteristics of the audio signal from the cut-out audio signal obtained from the audio detection unit;
From the acoustic parameters from the start of utterance to the middle of utterance, a matching unit that calculates the likelihood of the phoneme sequence,
A recognition result candidate determining means for determining a recognition result candidate in the middle of utterance based on a matching result of the phoneme string obtained from the matching unit;
The collation unit collates the acoustic parameters from the first frame extracted from the acoustic analysis unit to the current time with a recognition vocabulary standard pattern configured by a model representing a phoneme characteristic according to the phoneme sequence of the recognition vocabulary. The first collation unit that outputs the recognition result phoneme string candidate and the first likelihood, and the acoustic parameters from the first frame to the current time are collated with a background pattern that models all Japanese. , Ri Do and a second matching unit for outputting a second likelihood,
The recognition result candidate determination means includes
Taking a difference between the first and second likelihoods of the phoneme sequence obtained from the collating unit, and calculating a normalized score that is a value obtained by dividing the difference by the execution interval of the speech recognition process; normalized score calculator for the normalization score to output the first likelihood m i of advance greater than the determined threshold phoneme string and the sound Motoretsu,
Obtains a first posterior probability exponentially the likelihood m i higher than the threshold phoneme string, the second phoneme strings contained in the input speech halfway utterance from the utterance start based on the posterior probability of the first A speech recognition system including a recognition result candidate determination unit that obtains a posterior probability and determines a phoneme string having a second posterior probability equal to or higher than a predetermined reference value as a recognition result candidate .
請求項に記載の音声認識システムにおいて、
前記認識結果候補決定部は、前記閾値より高い音素列の尤度m(1≦i≦N)を指数化し、i番目の尤度mの第1の事後確率pを下記の(1)式から求め、前記発声開始から発声途中までの入力音声に含まれる音素列xの第2の事後確率pを、該音素列xがi番目の候補に含まれるときδx,i=1、含まれないときδx,i=0として、下記の(2)式で計算することを特徴とする音声認識システム。
Figure 0004297349
Figure 0004297349
The speech recognition system according to claim 1 ,
The recognition result candidate determination unit indexes the likelihood m i (1 ≦ i ≦ N) of the phoneme string higher than the threshold, and sets the first posterior probability p i of the i-th likelihood m i as (1 ) calculated from the equation, the second posterior probability p x of the phoneme sequence x contained in the input speech halfway utterance from the utterance start, when the sound Motoretsu x is included in the i-th candidate [delta] x, i = 1 When not included, δ x, i = 0 and calculated by the following equation (2).
Figure 0004297349
Figure 0004297349
JP2004098760A 2004-03-30 2004-03-30 Speech recognition system Expired - Fee Related JP4297349B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2004098760A JP4297349B2 (en) 2004-03-30 2004-03-30 Speech recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2004098760A JP4297349B2 (en) 2004-03-30 2004-03-30 Speech recognition system

Publications (2)

Publication Number Publication Date
JP2005284018A JP2005284018A (en) 2005-10-13
JP4297349B2 true JP4297349B2 (en) 2009-07-15

Family

ID=35182450

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2004098760A Expired - Fee Related JP4297349B2 (en) 2004-03-30 2004-03-30 Speech recognition system

Country Status (1)

Country Link
JP (1) JP4297349B2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101208166B1 (en) * 2010-12-16 2012-12-04 엔에이치엔(주) Speech recognition client system, speech recognition server system and speech recognition method for processing speech recognition in online

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2808906B2 (en) * 1991-02-07 1998-10-08 日本電気株式会社 Voice recognition device
JP3039095B2 (en) * 1992-01-30 2000-05-08 日本電気株式会社 Voice recognition device
JP3440840B2 (en) * 1998-09-18 2003-08-25 松下電器産業株式会社 Voice recognition method and apparatus
JP2002041082A (en) * 2000-07-28 2002-02-08 Hitachi Ltd Voice-recognition dictionary
JP2002358097A (en) * 2001-06-01 2002-12-13 Mitsubishi Electric Corp Voice recognition device
JP4219603B2 (en) * 2002-03-04 2009-02-04 三菱電機株式会社 Voice recognition device

Also Published As

Publication number Publication date
JP2005284018A (en) 2005-10-13

Similar Documents

Publication Publication Date Title
EP2048655B1 (en) Context sensitive multi-stage speech recognition
US8612223B2 (en) Voice processing device and method, and program
EP1936606B1 (en) Multi-stage speech recognition
JP5200712B2 (en) Speech recognition apparatus, speech recognition method, and computer program
TW201349222A (en) Method and system for speech recognition
EP1734509A1 (en) Method and system for speech recognition
WO2010128560A1 (en) Voice recognition device, voice recognition method, and voice recognition program
KR101317339B1 (en) Apparatus and method using Two phase utterance verification architecture for computation speed improvement of N-best recognition word
KR20070060581A (en) Apparatus and method for speaker adaptive
JPWO2005096271A1 (en) Speech recognition apparatus and speech recognition method
JP6481939B2 (en) Speech recognition apparatus and speech recognition program
CN108806691B (en) Voice recognition method and system
JP2004325635A (en) Apparatus, method, and program for speech processing, and program recording medium
JP2002358097A (en) Voice recognition device
JP3444108B2 (en) Voice recognition device
JP2000250593A (en) Device and method for speaker recognition
JP4297349B2 (en) Speech recognition system
JP3633254B2 (en) Voice recognition system and recording medium recording the program
JPH06266386A (en) Word spotting method
KR100669244B1 (en) Utterance verification method using multiple antimodel based on support vector machine in speech recognition system
JP5344396B2 (en) Language learning device, language learning program, and language learning method
JP5315976B2 (en) Speech recognition apparatus, speech recognition method, and program
KR100586045B1 (en) Recursive Speaker Adaptation Automation Speech Recognition System and Method using EigenVoice Speaker Adaptation
JPH11249688A (en) Device and method for recognizing voice
JP6497651B2 (en) Speech recognition apparatus and speech recognition program

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20050831

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20080204

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20080430

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20080630

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20081203

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20090127

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20090401

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20090409

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20120424

Year of fee payment: 3

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20120424

Year of fee payment: 3

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20150424

Year of fee payment: 6

LAPS Cancellation because of no payment of annual fees