JP5235849B2

JP5235849B2 - Speech recognition apparatus, method and program

Info

Publication number: JP5235849B2
Application number: JP2009270640A
Authority: JP
Inventors: 哲小橋川; 太一浅見; 義和山口; 浩和政瀧; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-11-27
Filing date: 2009-11-27
Publication date: 2013-07-10
Anticipated expiration: 2029-11-27
Also published as: JP2011112963A

Abstract

<P>PROBLEM TO BE SOLVED: To reduce processing time of reliability calculation of a voice recognition result. <P>SOLUTION: A prior reliability score calculation section of the voice recognition device sets voice feature quantity sequence for each frame as input, sets difference between an output probability of maximum likelihood state of a mono-phone and that of a voice model or of a pause model as prior reliability of the frame, and outputs a reliability score in which the prior reliability is averaged in a voice file unit. A voice recognition processing section sets the voice feature quantity sequence and the reliability score as input, performs voice recognition processing and outputs a voice recognition result and the reliability score. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

この発明は、様々な音質の音声データを効率良く音声認識する音声認識装置とその方法と、プログラムに関する。 The present invention relates to a speech recognition apparatus, method and program for efficiently recognizing speech data of various sound qualities.

近年、音声データを記録するメモリ素子が安価になることに伴い大量の音声データを容易に入手することが可能になった。それらの音声データを音声認識する際に、音声データの品質によって認識精度や処理時間が大きく変動する問題が発生する。 In recent years, it has become possible to easily obtain a large amount of audio data as a memory element for recording audio data becomes cheaper. When recognizing such audio data, there arises a problem that the recognition accuracy and processing time greatly vary depending on the quality of the audio data.

そこで、従来から音声認識結果に信頼度を付与することで、音声認識誤りに起因する不具合を抑制する方法が検討されている。図１１に音声認識結果に信頼度を付与するようにした音声認識装置９００の機能構成を示す。音声認識装置９００は、音響分析部１２０、音響モデル格納部１４０、辞書・言語モデル格納部１５０、探索部１６０、信頼度計算部１９０、を備える。 In view of this, conventionally, a method for suppressing a defect caused by a speech recognition error by giving a reliability to the speech recognition result has been studied. FIG. 11 shows a functional configuration of the speech recognition apparatus 900 configured to give reliability to the speech recognition result. The speech recognition apparatus 900 includes an acoustic analysis unit 120, an acoustic model storage unit 140, a dictionary / language model storage unit 150, a search unit 160, and a reliability calculation unit 190.

音響分析部１２０は、入力音声信号１１０を、数十ｍｓのフレームと呼ばれる単位で例えばメル周波数ケプストラム係数（ＭＦＣＣ）分析して音響特徴パラメータ系列１３０を生成する。探索部１６０は、音響特徴パラメータ系列１３０について、音響モデル格納部１４０と辞書・言語モデル格納部１５０とを用いて認識結果候補の探索を行う。探索の結果、上位〜Ｎ位までのＮベストの音声認識結果１７０と、そのスコア１８０が出力される。 The acoustic analysis unit 120 generates an acoustic feature parameter series 130 by analyzing, for example, a mel frequency cepstrum coefficient (MFCC) of the input speech signal 110 in units called frames of several tens of ms. The search unit 160 searches the recognition result candidates for the acoustic feature parameter series 130 using the acoustic model storage unit 140 and the dictionary / language model storage unit 150. As a result of the search, the N best speech recognition results 170 from the top to the Nth and the score 180 are output.

信頼度計算部１９０は、音声認識結果１７０とスコア１８０に基づいて複数の音声認識結果１７０にそれぞれ対応する信頼度スコア２００を計算して出力する。その信頼度スコア２００は、例えば音声認識結果として得られたＮベスト候補及びそれらのスコアの単純なスコア差と加算平均から求められる。 The reliability calculation unit 190 calculates and outputs a reliability score 200 corresponding to each of the plurality of speech recognition results 170 based on the speech recognition result 170 and the score 180. The reliability score 200 is obtained from, for example, N best candidates obtained as a result of speech recognition, a simple score difference between these scores, and an addition average.

この信頼度スコア２００を参照することで、その信頼度スコア２００に対応する音声認識結果１７０を廃棄したり、発話者に対して認識結果を確認したりすることで、誤認識による不具合の発生を抑制していた。 By referring to the reliability score 200, the speech recognition result 170 corresponding to the reliability score 200 is discarded, or the recognition result is confirmed with respect to the speaker, so that a malfunction due to misrecognition occurs. It was suppressed.

特開２００５−１４８３４２号公報JP 2005-148342 A

しかし、従来の音声認識装置９００では、信頼度スコアを、音声認識処理を行った後のスコアから計算していた。したがって、信頼度スコアを得るのに音声認識処理の処理時間を必要としていた。そのため、例えばＳ/Ｎ比が悪い等の理由により誤認識ばかりで利用不能な音声データに余分な処理時間をかけてしまう場合がある。また、大量の音声ファイルに対して音声認識処理を行う場合に、音声認識精度の低い音声ファイルの処理に時間がかかり、他の音声認識精度の高い音声ファイルの処理が進まず、音声認識処理全体の処理効率を低下させる場合がある。また、言語モデルを用いた音声認識結果に基づく処理のため、信頼度スコアの値が言語モデルに依存してしまう課題もあった。 However, in the conventional speech recognition apparatus 900, the reliability score is calculated from the score after performing the speech recognition processing. Accordingly, it takes time for the speech recognition processing to obtain the reliability score. Therefore, for example, extra processing time may be spent on unusable audio data due to a bad S / N ratio. Also, when performing speech recognition processing on a large number of audio files, it takes time to process an audio file with low speech recognition accuracy, and processing of other audio files with high speech recognition accuracy does not proceed. The processing efficiency may be reduced. Moreover, since the processing is based on the speech recognition result using the language model, there is a problem that the reliability score value depends on the language model.

この発明は、このような問題点に鑑みてなされたものであり、音声認識処理を行うこと無く短い処理時間で信頼度スコアが計算可能であり、言語モデルに依存しない信頼度スコアを出力する音声認識装置とその方法と、プログラムを提供することを目的とする。 The present invention has been made in view of such a problem, and is capable of calculating a reliability score in a short processing time without performing a voice recognition process, and outputting a reliability score independent of a language model. It is an object of the present invention to provide a recognition device, a method thereof, and a program.

この発明の音声認識装置は、特徴量分析部と、事前信頼度スコア計算部と、音声認識処理部と、を具備する。特徴量分析部は、入力される音声ディジタル信号の音声特徴量をフレーム単位で分析して音声特徴量系列を出力する。事前信頼度スコア計算部は、フレーム毎の音声特徴量系列を入力として、モノフォンの最尤状態の出力確率と音声モデル（例えば、音声ＧＭＭ、ここでＧＭＭとはGaussian Mixture Modelすなわち混合正規分布である）又はポーズモデル（例えば、ポーズＨＭＭ：Hidden Markov Model）の（中に含まれるＧＭＭの）最尤状態の出力確率との差を当該フレームの事前信頼度とし、その事前信頼度を音声ファイル単位で平均した信頼度スコアを出力する。音声認識処理部は、音声特徴量系列と信頼度スコアを入力として、音声認識結果を出力する。 The speech recognition apparatus according to the present invention includes a feature amount analysis unit, a prior reliability score calculation unit, and a speech recognition processing unit. The feature amount analysis unit analyzes the speech feature amount of the input speech digital signal in units of frames and outputs a speech feature amount sequence. The prior reliability score calculation unit receives a speech feature amount sequence for each frame as an input, and outputs an output probability of a monophone maximum likelihood state and a speech model (for example, speech GMM, where GMM is a Gaussian Mixture Model, that is, a mixed normal distribution) ) Or a pose model (for example, a pose HMM: Hidden Markov Model) and the difference between the maximum likelihood state output probability (of the GMM contained therein) as the prior reliability of the frame, and the prior reliability in units of audio files. Output the average confidence score. The speech recognition processing unit receives the speech feature amount series and the reliability score and outputs a speech recognition result.

この発明の音声認識装置によれば、事前信頼度スコア計算部が、フレーム毎の音声特徴量系列を入力として、モノフォンの最尤状態の出力確率と音声モデル又はポーズモデルの最尤状態の出力確率との差を当該フレームの事前信頼度とし、その事前信頼度を音声ファイル単位で平均した信頼度スコアを出力する。従って、従来の音声認識装置よりも軽い処理で信頼度スコアが求められる。そして、求められた信頼度スコアの値に応じて音声認識処理を行うか否かの判断をすることで、信頼度が低く音声認識精度の低い音声ファイルの音声認識処理に時間がかかる課題も解決される。 According to the speech recognition device of the present invention, the prior reliability score calculation unit receives the speech feature value sequence for each frame as an input, and outputs the monophone maximum likelihood state output probability and the speech model or pose model maximum likelihood state output probability. And the difference score is used as the prior reliability of the frame, and a reliability score obtained by averaging the prior reliability in units of audio files is output. Therefore, the reliability score is obtained by processing that is lighter than that of the conventional speech recognition apparatus. Then, by determining whether or not to perform speech recognition processing according to the value of the obtained reliability score, it is possible to solve the problem that it takes time for speech recognition processing of an audio file with low reliability and low speech recognition accuracy. Is done.

この発明の基本的な考え方を説明するために音声特徴量と尤度（または出力確率）との関係を模式的に示す図。The figure which shows typically the relationship between an audio | voice feature-value and likelihood (or output probability) in order to demonstrate the fundamental view of this invention. この発明の音声認識装置１００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 100 of this invention. 音声認識装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 100. 事前信頼度スコア計算部３０の機能構成例を示す図。The figure which shows the function structural example of the prior reliability score calculation part 30. FIG. モノフォンの出力確率とポーズモデルと音声モデルの出力確率の時間経過を模式的に示す図。The figure which shows typically the time passage of the output probability of a monophone, the output probability of a pause model, and an audio | voice model. 図４を二種以上の音響モデルにした場合を示す図。The figure which shows the case where FIG. 4 is made into 2 or more types of acoustic models. この発明の音声認識装置２５０の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 250 of this invention. 信頼度スコアＣとビーム探索幅Ｎ（Ｃ）との関係の例を示す図。The figure which shows the example of the relationship between the reliability score C and beam search width | variety N (C). この発明の音声認識装置３００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 300 of this invention. 音声認識装置３００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 300. 特許文献１に開示された従来の音声認識装置９００の機能構成を示す図。The figure which shows the function structure of the conventional speech recognition apparatus 900 disclosed by patent document 1. FIG.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。実施例の説明の前に、この発明の基本的な考え方について説明する。
〔この発明の基本的な考え方〕
図１に、音声特徴量と尤度との関係を示す。尤度は、一般的に尤もらしさを表す値であり、出力確率値で代用しても良い。横軸が音声特徴量、縦軸が尤度である。図中に、音響モデル中に含まれる音声モデル(破線)とモノフォンの音素モデル「＊−ａ＋＊」，「＊−ｉ＋＊」，「＊−ｕ＋＊」のそれぞれの分布を表す。音素モデルは、通常複数の状態から構成され、一つの状態は複数の基底分布からなる混合分布（以下、混合正規分布を含めて混合分布とする）から構成される。ここでは、簡略化のため音素モデルの状態数を１、混合分布数を１として表現している。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated. Prior to the description of the embodiments, the basic concept of the present invention will be described.
[Basic concept of this invention]
FIG. 1 shows the relationship between speech feature values and likelihood. Likelihood is a value that generally represents likelihood, and an output probability value may be substituted. The horizontal axis is the voice feature amount, and the vertical axis is the likelihood. In the figure, the respective distributions of the speech model (broken line) and monophone phoneme models “* -a + *”, “* -i + *”, “* -u + *” included in the acoustic model are shown. A phoneme model is usually composed of a plurality of states, and one state is composed of a mixed distribution composed of a plurality of basis distributions (hereinafter referred to as a mixed distribution including a mixed normal distribution). Here, for simplification, the number of states of the phoneme model is expressed as 1, and the number of mixture distributions is expressed as 1.

ここで、モノフォンとは、環境独立音素モデルのことであり、前後の音素環境に制約を持つ環境依存音素モデル（例えばトライフォン）に対して、前後の音素の制約がなく、音素モデルの数も少ない。例えば、音素の数を３０個とした場合、モノフォン音響モデル中の音素モデルの数は３０個であるが、トライフォンの場合の数は３０３個（２７００個）である。
例えば音声モデルに用いたＧＭＭは、混合正規分布モデルであり、音声すなわち全ての音素の学習データで学習されたモデルであるため、その分布は音声特徴量に対する尤度の値が比較的になだらかな分布である。それに対して、モノフォンは、各音素の学習データで学習されたモデルであるため、当該音素に対応する音声特徴量に対する尤度の値が急峻な分布である。 Here, the monophone is an environment-independent phoneme model. Compared to an environment-dependent phoneme model (for example, a triphone) having restrictions on the preceding and following phoneme environments, there is no restriction on the preceding and following phonemes, and the number of phoneme models Few. For example, when the number of phonemes is 30, the number of phoneme models in the monophone acoustic model is 30, but the number of triphones is 303 (2700).
For example, the GMM used for the speech model is a mixed normal distribution model, which is a model learned from speech, that is, learning data of all phonemes, so that the distribution has a relatively gentle likelihood value for the speech feature. Distribution. On the other hand, since the monophone is a model learned from the learning data of each phoneme, the likelihood value for the speech feature amount corresponding to the phoneme has a steep distribution.

したがって、ある音声特徴量に対する音声モデルの尤度と、同じ音声特徴量に対するモノフォンの尤度を比較することで、音声ファイルの信頼度を判定することが可能である。つまり、雑音の影響を受けずに収録された音素ａの音声特徴量Ｏ_ｔ ^clean（ａ）に対するモノフォン「＊−ａ＋＊」の尤度は大きな値を示す。しかし、同じ音声特徴量Ｏ_ｔ ^clean（ａ）に対する音声モデルの尤度は相対的に小さな値を示す。その結果、それらの値の間に差が存在する。 Therefore, it is possible to determine the reliability of an audio file by comparing the likelihood of an audio model for a certain audio feature quantity and the likelihood of a monophone for the same audio feature quantity. In other words, the likelihood of monophone "* -a + *" to speech features of phonemes a was recorded without being affected by noise O _t ^clean (a) shows a large value. However, the likelihood of the speech model for the same speech feature amount O _t ^clean (a) shows a relatively small value. As a result, there is a difference between those values.

これに対して、雑音の影響を強く受けて収録された音素ａの音声特徴量Ｏ_ｔ ^noisy（ａ）は、本来の特徴量とは異なるのでモノフォンでの尤度と、音声モデルにおける尤度との間の差が小さくなる。
このように音声特徴量に対するモノフォンの尤度と、音声モデルの尤度との差を見ることで、収録音声の品質を評価することが出来る。この発明の基本的な考え方は、この点に着目して、モノフォンの最尤状態の出力確率と音声モデルの出力確率との差を事前信頼度として求め、音声ファイル単位の信頼度スコアを得るようにしたものである。 On the other hand, the speech feature quantity O _t ^noisy (a) of the phoneme a recorded under the influence of noise is different from the original feature quantity, so the likelihood in the monophone and the likelihood in the speech model are The difference between is smaller.
Thus, the quality of recorded speech can be evaluated by looking at the difference between the likelihood of the monophone and the likelihood of the speech model for the speech feature. The basic idea of the present invention is to pay attention to this point and obtain the difference between the output probability of the maximum likelihood state of the monophone and the output probability of the speech model as a prior reliability, and obtain a reliability score for each audio file. It is a thing.

図２にこの発明の音声認識装置１００の機能構成例を示す。その動作フローを図３に示す。音声認識装置１００は、Ａ/Ｄ変換部１０と、特徴量分析部２０と、事前信頼度スコア計算部３０と、音声認識処理部４０と、音響モデルパラメータメモリ５０と、言語モデルパラメータメモリ６０と、を具備する。音声認識装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 2 shows a functional configuration example of the speech recognition apparatus 100 of the present invention. The operation flow is shown in FIG. The speech recognition apparatus 100 includes an A / D conversion unit 10, a feature amount analysis unit 20, a prior reliability score calculation unit 30, a speech recognition processing unit 40, an acoustic model parameter memory 50, a language model parameter memory 60, Are provided. The speech recognition apparatus 100 is realized by reading a predetermined program into a computer configured with, for example, a ROM, a RAM, a CPU, and the like, and executing the program by the CPU.

Ａ/Ｄ変換部１０は、音声信号を、例えばサンプリング周波数１６ｋＨｚで離散値化して音声ディジタル信号に変換する。なお、音声ディジタル信号が直接入力される場合は、Ａ/Ｄ変換部１０は不要である。 The A / D conversion unit 10 converts the audio signal into a discrete value at a sampling frequency of 16 kHz, for example, and converts it into an audio digital signal. Note that when the audio digital signal is directly input, the A / D converter 10 is not necessary.

特徴量分析部２０は、音声ディジタル信号を入力として、例えば３２０個の音声ディジタル信号を１フレーム（２０ｍｓ）としたフレーム毎に、音声特徴量系列を出力する（ステップＳ２０）。音声特徴量としては、例えば、ＭＦＣＣ（Mel-Frequenct Cepstrum Coefficient）の１〜１２元と、その変化量であるΔＭＦＣＣ等の動的パラメータや、パワーやΔパワー等を用いる。また、ケプストラム平均正規化（ＣＭＮ）等の処理を行っても良い。 The feature amount analysis unit 20 receives the speech digital signal and outputs a speech feature amount sequence for each frame in which, for example, 320 speech digital signals are one frame (20 ms) (step S20). As the audio feature amount, for example, dynamic parameters such as MFCC (Mel-Frequenct Cepstrum Coefficient) 1 to 12 elements, ΔMFCC which is the change amount, power, Δ power, and the like are used. Also, processing such as cepstrum average normalization (CMN) may be performed.

事前信頼度スコア計算部３０は、フレーム毎の音声特徴量系列を入力として、モノフォンの最尤状態の出力確率と音声モデル又はポーズモデル（に含まれるＧＭＭ）の中の最尤状態の出力確率との差を当該フレームの事前信頼度とし、その事前信頼度を音声ファイル単位で平均した信頼度スコアを出力する（ステップＳ３０）。 The prior reliability score calculation unit 30 receives the speech feature value sequence for each frame as an input, and outputs the output probability of the maximum likelihood state of the monophone and the output probability of the maximum likelihood state in the speech model or the pose model (in the GMM). Is used as the prior reliability of the frame, and a reliability score obtained by averaging the prior reliability in units of audio files is output (step S30).

音声認識処理部４０は、音響モデルパラメータメモリ５０に記録された音響モデルと、言語モデルパラメータメモリ６０に記録された言語モデルとを参照して、音声特徴量系列について音声認識処理を行い、その音声認識結果と信頼度スコアを出力する（ステップＳ４０）。なお、音声認識処理部４０は、破線で示すように音声ファイルの信頼度スコアの値に応じて音声認識処理の実行の有無を切り替えるようにしても良い。ステップＳ４０の音声認識処理過程は、音声ファイルの全フレームについて処理が終了するまで繰り返される。 The speech recognition processing unit 40 refers to the acoustic model recorded in the acoustic model parameter memory 50 and the language model recorded in the language model parameter memory 60, performs speech recognition processing on the speech feature quantity sequence, and the speech A recognition result and a reliability score are output (step S40). Note that the voice recognition processing unit 40 may switch whether or not the voice recognition process is executed according to the reliability score value of the voice file as indicated by a broken line. The voice recognition process in step S40 is repeated until the process is completed for all frames of the voice file.

音声認識装置１００によれば、事前信頼度スコア計算部３０が、フレーム毎に事前信頼度を付与して音声ファイル単位で平均（１フレーム当たりの平均の事前信頼度を計算）した信頼度スコアを計算する。音声特徴量系列に基づいた信頼度スコアは、従来の音声認識結果から信頼度スコアを求める方法と比べて計算量が少なくて済む。また、複数の音声ファイルを処理する場合に、事前信頼度の値に応じて音声認識処理を行うか否かの判断をすることで、事前信頼度が低い、つまり音声認識精度が低い音声ファイルの音声認識処理に時間がかかる課題も解決される。次に、実施例１の主要部である事前信頼度スコア計算部３０のより具体的な構成例を示して更に詳しく説明する。 According to the speech recognition apparatus 100, the prior reliability score calculation unit 30 assigns a prior reliability to each frame and calculates an average reliability score (calculating an average prior reliability per frame) for each audio file. calculate. The reliability score based on the speech feature amount series requires less calculation amount than the conventional method of obtaining the reliability score from the speech recognition result. In addition, when processing a plurality of audio files, it is determined whether or not to perform the speech recognition processing according to the value of the prior reliability, so that an audio file having a low prior reliability, that is, a speech recognition accuracy is low. The problem that time is required for the speech recognition processing is also solved. Next, a more specific configuration example of the prior reliability score calculation unit 30 that is a main part of the first embodiment will be described and described in more detail.

〔事前信頼度スコア計算部〕
図４に事前信頼度スコア計算部３０の機能構成例を示す。事前信頼度スコア計算部３０は、モノフォン最尤検出手段３２と、ポーズ/音声モデル最尤検出手段３３と、事前信頼度算出手段３４と、信頼度スコア算出手段３５と、を備える。
モノフォン最尤検出手段３２は、フレームｔ毎に入力される音声特徴量系列に対する複数のモノフォンの最尤状態ｓ１の出力確率Ｐ（ｔ，ｓ１）を、事前信頼度算出手段３４に出力する。ポーズ/音声モデル最尤検出手段３３は、その音声特徴量系列に対する音声モデル又はポーズモデルの最尤状態ｇ１の出力確率Ｐ（ｔ，ｇ１）を、事前信頼度算出手段３４に出力する。 [Pre-reliability score calculator]
FIG. 4 shows a functional configuration example of the prior reliability score calculation unit 30. The prior reliability score calculation unit 30 includes a monophone maximum likelihood detection unit 32, a pose / voice model maximum likelihood detection unit 33, a prior reliability calculation unit 34, and a reliability score calculation unit 35.
The monophone maximum likelihood detection unit 32 outputs the output probability P (t, s1) of the maximum likelihood state s1 of a plurality of monophones to the speech feature amount sequence input every frame t to the prior reliability calculation unit 34. The pose / speech model maximum likelihood detection unit 33 outputs the output probability P (t, g1) of the maximum likelihood state g1 of the speech model or the pose model for the speech feature quantity sequence to the prior reliability calculation unit 34.

図５に、モノフォンの出力確率とポーズモデルと音声モデルの出力確率の時間経過を模式的に示す。横方向は時間経過をフレームｔで表す。縦方向はフレームｔ毎の複数のモノフォン（ポーズモデルを含む）と音声モデルのそれぞれの状態を表す。例えば、各モノフォン（ポーズモデルを含む）は、それぞれ３つの状態から成り、モノフォン「＊−ａ＋＊」はａ_１,ａ_２,ａ_３から成る。黒丸の状態がモノフォン中の最尤状態ｇ１を表す。斜線入り丸の状態がポーズモデルと音声モデルの中での最尤状態ｇ１を表す。モノフォン中の最尤状態ｓ１と、ポーズモデルと音声モデルの中での最尤状態ｇ１が、一致する場合（ｓ１＝ｇ１）には黒丸で示す。
時刻ｔ_１では、ポーズ以外の複数のモノフォンの何れにも最尤状態が無く、ポーズモデルの第１状態が最尤状態である。時刻ｔ_２では、同様にポーズ以外の複数のモノフォンの何れにも最尤状態が無く、ポーズモデルの第２状態が最尤状態である。時刻ｔ_３も、ポーズ以外の複数のモノフォンの何れにも最尤状態が無く、ポーズモデルの第３状態が最尤状態である。このことから、時刻ｔ_１〜ｔ_３は非音声状態である。この時、モノフォン中の最尤状態と、ポーズモデルと音声モデルの中での最尤状態が一致する（ｓ１＝ｇ１）ため、当該時刻における事前信頼度の値は０となる。
時刻ｔ_４は、ポーズ以外のモノフォンの中で「＊−ａ＋＊」の第３状態が最尤状態ｓ１で、且つポーズモデルと音声モデルの中で音声モデルも最尤状態ｇ１であることから音声状態である。そこで、この実施例では、時刻ｔ_４のモノフォン「＊−ａ＋＊」の最尤状態ｓ１の出力確率と、音声モデルの最尤度状態ｇ１の出力確率との差を事前信頼度とする。
また、時刻ｔ_１９は、ポーズ以外のモノフォンの中で「＊−ｉ＋＊」の第２状態が最尤状態ｓ１で、ポーズモデルと音声モデルの中でポーズモデルの第３状態が最尤状態ｇ１である。この場合、モノフォン「＊−ｉ＋＊」の最尤状態ｓ１の出力確率と、ポーズモデルの最尤状態ｇ１の出力確率との差を事前信頼度とする。なお、図５は、一部の時間しか示していない。音声ファイルの長さは例えば数分（例えば３０,０００フレーム）程度である。 FIG. 5 schematically shows the time course of the output probability of the monophone, the output probability of the pause model, and the speech model. In the horizontal direction, the passage of time is represented by a frame t. The vertical direction represents the state of each of a plurality of monophones (including a pause model) and a sound model for each frame t. For example, each monophone (including the pause model) is composed of three states, and the monophone “* -a + *” is composed of a ₁ , a ₂ , and a ₃ . The black circle represents the maximum likelihood state g1 in the monophone. The state with a hatched circle represents the maximum likelihood state g1 in the pose model and the speech model. When the maximum likelihood state s1 in the monophone matches the maximum likelihood state g1 in the pause model and the speech model (s1 = g1), it is indicated by a black circle.
At time t ₁ , none of the plurality of monophones other than the pose has a maximum likelihood state, and the first state of the pose model is the maximum likelihood state. At time t _2, the same manner in any of a plurality of monophone except pose no maximum likelihood state, a second state of pause model is maximum likelihood state. Time t ₃ is also to both no maximum likelihood state of a plurality of monophone other than the pause, the third state of pause model is maximum likelihood state. Therefore, the times t ₁ to t ₃ are in a non-voice state. At this time, since the maximum likelihood state in the monophone matches the maximum likelihood state in the pause model and the speech model (s1 = g1), the value of the prior reliability at the time is 0.
Time t _4, the audio from that in monophones other than the pose is the third state of "* -a + *" in the maximum likelihood state s1, is a speech model is also the maximum likelihood state g1 and in the pose model and the speech model State. Therefore, in this embodiment, the output probability of the time t ₄ of monophones "* -a + *" of the maximum likelihood state s1, and the difference between the pre-reliability of the output probability of maximum likelihood state g1 of the speech model.
In addition, the time t ₁₉ is, in monophones other than pose "* -i + *" in the second state is the maximum likelihood state s1 of, pose model and the third state is the maximum likelihood state g1 pose model in the voice model It is. In this case, the difference between the output probability of the maximum likelihood state s1 of the monophone “* -i + *” and the output probability of the maximum likelihood state g1 of the pose model is set as the prior reliability. FIG. 5 shows only a part of the time. The length of the audio file is, for example, about several minutes (for example, 30,000 frames).

このように、事前信頼度算出手段３４は、モノフォンの最尤状態の出力確率Ｐ（ｔ，ｓ１）と音声モデル又はポーズモデルの最尤状態の出力確率Ｐ（ｔ，ｇ１）の差を、事前信頼度Ｃ（ｔ）として信頼度スコア算出手段３５に出力する（式（１））。 As described above, the prior reliability calculation means 34 calculates the difference between the output probability P (t, s1) of the maximum likelihood state of the monophone and the output probability P (t, g1) of the maximum likelihood state of the speech model or the pose model in advance. It outputs to the reliability score calculation means 35 as reliability C (t) (Formula (1)).

ここで、ｓ１はモノフォンに属する状態（混合分布）の内、時刻ｔに最も尤度の高い混合分布である。ｇ１は音声モデル又はポーズモデルの内、時刻ｔに最も尤度の高い混合分布である。Ｐ（ｔ，ｓ）は、式（２）に示す時刻ｔにおける状態ｓ（に属する混合分布）の出力確率である。 Here, s1 is a mixture distribution having the highest likelihood at the time t among the states (mixed distribution) belonging to the monophone. g1 is a mixture distribution having the highest likelihood at time t in the speech model or the pose model. P (t, s) is the output probability of state s (mixed distribution belonging to) at time t shown in equation (2).

ここで、Ｍ_ｓは状態ｓの混合数である。ｃ_ｓ，ｍは状態ｓ分布ｍの混合重み係数である。なおｃ_ｓ，ｍは音響モデル学習の結果で決まるものであり、０≦ｃ_ｓ，ｍ≦１の範囲を取る値である。例えば、混合数が１６であるとすると平均１/１６の値となる。Ｎ（・）は平均μ_ｓ，ｍ、分散Σ_ｓ，ｍの（基底）正規分布に対する時刻ｔの特徴量Ｏ_ｔの出力確率を意味する。
信頼度スコア算出手段３５は、事前信頼度Ｃ（ｔ）を音声ファイルの継続時間Ｔ（総フレーム数）の間累積して平均した信頼度スコアＣを出力する（式（３））。 Here, M _s is the number of mixed states s. c _{s, m} is a mixture weight coefficient of the state s distribution m. Note that c _{s, m} is determined by the result of acoustic model learning, and takes a range of 0 ≦ c _{s, m} ≦ 1. For example, if the number of mixtures is 16, the average value is 1/16. N (•) means the output probability of the feature quantity O _t at time t with respect to the (basic) normal distribution with mean μ _{s, m} and variance Σ _{s, m} .
The reliability score calculation means 35 outputs the reliability score C obtained by accumulating the prior reliability C (t) for the duration T (total number of frames) of the audio file and averaging (formula (3)).

このように、事前信頼度スコア計算部３０は、フレーム単位の事前信頼度を音声ファイルの総フレーム数で平均することで音声ファイル単位の信頼度を表す信頼度スコアＣを計算する。音声ファイル単位の信頼度スコアＣを求めるので精緻な処理を必要としない。 As described above, the prior reliability score calculation unit 30 calculates the reliability score C representing the reliability of the audio file unit by averaging the prior reliability of the frame unit with the total number of frames of the audio file. Since the reliability score C is obtained for each audio file, no elaborate processing is required.

音声認識処理部４０は、特徴量分析部２０が出力する音声特徴量系列と信頼度スコアＣを入力として、音声認識処理を行い音声認識結果を出力する。この時、信頼度スコアＣを同時に出力しても良い。ここでの音声認識処理は、音響モデルパラメータメモリ５０に記録された全ての音響モデルを用いた認識処理が行われる。音声認識処理部４０は、信頼度スコアＣの値に応じて音声認識処理の実行の有無を切り替えるようにしても良い。 The speech recognition processing unit 40 receives the speech feature amount sequence output from the feature amount analysis unit 20 and the reliability score C, performs speech recognition processing, and outputs a speech recognition result. At this time, the reliability score C may be output simultaneously. In this speech recognition process, a recognition process using all acoustic models recorded in the acoustic model parameter memory 50 is performed. The voice recognition processing unit 40 may switch whether or not the voice recognition process is executed according to the value of the reliability score C.

なお、信頼度スコアＣは、二種以上の音響モデル中に含まれるモノフォン（ポーズモデルを含む）及び音声モデルに基づいて計算した事前信頼度を、音声ファイル単位で平均した値としても良い。図６に、二種以上の音響モデルを、男性音響モデルと女性音響モデルとした場合の出力確率の時間経過の一例を示す。事前信頼度スコア計算部３０′は、各時刻ｔの音声特徴量系列に対する男性と女性のモノフォンの最尤状態の出力確率Ｐ_男（ｔ，ｓ１）とＰ_女（ｔ，ｓ１）をそれぞれ求めて、大きい方を最尤状態の出力確率Ｐ（ｔ，ｓ１）とし、男性と女性の音声モデル又はポーズモデルの最尤状態の出力確率Ｐ_男（ｔ，ｇ１）とＰ_女（ｔ，ｇ１）のうち大きい方をＰ（ｔ，ｇ１）とし、その差分（Ｐ（ｔ，ｓ１）−Ｐ（ｔ，ｇ１））を事前信頼度Ｃ(ｔ)として求めるようにするものである。 The reliability score C may be a value obtained by averaging the prior reliability calculated based on the monophone (including the pause model) and the audio model included in the two or more types of acoustic models in units of audio files. FIG. 6 shows an example of a time course of output probability when two or more kinds of acoustic models are a male acoustic model and a female acoustic model. The prior reliability score calculation unit 30 ′ obtains the output probabilities P _male (t, s 1) and P _female (t, s 1) of the maximum likelihood state of the male and female monophones for the speech feature amount series at each time t. The larger one is the maximum likelihood output probability P (t, s1), and the maximum likelihood output probability P _male (t, g1) and P _female (t, g1) of the male and female speech models or pose models. The larger one is P (t, g1), and the difference (P (t, s1) −P (t, g1)) is obtained as the prior reliability C (t).

つまり、ポーズ/音声モデル最尤検出手段３３′は、男性と女性の音声モデル又はポーズモデルの最尤状態の出力確率Ｐ_男（ｔ，ｇ１）とＰ_女（ｔ，ｇ１）のうち大きい方をＰ（ｔ，ｇ１）とするものである。そして、モノフォン最尤検出手段３２′は、男性と女性のモノフォンの最尤状態の出力確率Ｐ_男（ｔ，ｓ１）とＰ_女（ｔ，ｓ１）のうち大きい方をＰ（ｔ，ｓ１）として求める。そして、信頼度スコア算出手段３５は、事前信頼度Ｃ（ｔ）を音声ファイルの総フレーム数で平均した値を信頼度スコアＣとして出力する。 That is, the pose / voice model maximum likelihood detection means 33 'calculates the larger of the male and female voice models or the maximum likelihood output probability P _male (t, g1) and P _female (t, g1) of the pose model. P (t, g1). Then, the monophone maximum likelihood detection means 32 'sets the larger one of the maximum likelihood output probability P _male (t, s1) and P _female (t, s1) of the male and female monophones as P (t, s1). Ask. Then, the reliability score calculation unit 35 outputs a value obtained by averaging the prior reliability C (t) by the total number of frames of the audio file as the reliability score C.

また、事前信頼度スコア計算部３０′に用いる音響モデルの種別は三種以上の複数であっても良い。このように、複数の種別の音響モデルを用いることで、後段の音声認識処理が複数の音響モデルを用いる場合でも、信頼度スコアＣの精度を向上させる効果が期待できる。 Further, the acoustic model used for the prior reliability score calculation unit 30 ′ may be three or more types. Thus, by using a plurality of types of acoustic models, an effect of improving the accuracy of the reliability score C can be expected even when the subsequent speech recognition processing uses a plurality of acoustic models.

また、信頼度スコアＣは、音声特徴量系列に対する二種以上の音声モデル又はポーズモデルの最尤状態の出力確率を比較し、出力確率が大きい種別のモノフォンに限定して計算された値であっても良い。つまり、上記した例のように男性と女性のモノフォンの最尤状態の出力確率Ｐ_男（ｔ，ｓ１）とＰ_女（ｔ，ｓ１）を全てのフレームについて求めるのでは無く、音声モデル又はポーズモデルの出力確率が女性（男性）よりも男性（女性）が高くなるフレームは、男性（女性）モノフォンに限定して計算する方法も考えられる。
すなわち、ポーズ/音声モデル最尤検出手段３３”は、男性と女性の音声モデル又はポーズモデルの最尤状態の出力確率Ｐ_男（ｔ，ｇ１）とＰ_女（ｔ，ｇ１）のうち大きい方をＰ（ｔ,ｇ１）とするものである。そして、モノフォン最尤検出手段３２”は、その判定結果を入力としてどちらか一方のモノフォンの最尤状態の出力確率Ｐ（ｔ,ｓ１）を求める。この例の場合、全ての種別のモノフォンの出力確率を計算しないので、計算量を削減する効果が期待できる。 The reliability score C is a value calculated by comparing the output probabilities of the maximum likelihood states of two or more types of speech models or pause models with respect to the speech feature amount series, and limited to monophones of a type having a large output probability. May be. That is, instead of obtaining the maximum likelihood output probabilities P _male (t, s1) and P _female (t, s1) of male and female monophones as in the above example, a speech model or a pose model A method in which a male (female) has a higher output probability than a female (male) may be calculated only for male (female) monophones.
That is, the pose / speech model maximum likelihood detection means 33 "calculates the larger one of the male and female speech models or the maximum likelihood output probability P _male (t, g1) and P _female (t, g1) of the pose model. Then, the monophone maximum likelihood detecting means 32 ″ obtains the output probability P (t, s1) of the maximum likelihood state of one of the monophones by using the determination result as an input. In this example, since the output probabilities of all types of monophones are not calculated, an effect of reducing the amount of calculation can be expected.

図７にこの発明の音声認識装置２５０の機能構成例を示す。音声認識装置２５０は、認識処理制御部２５１を備える点で、音声認識装置１００と異なる。認識処理制御部２５１は、信頼度スコアＣが一定値Ｃ_ｔｈ以下の場合に音声認識処理を停止させる制御信号を、音声認識処理部４０に出力する。信頼度スコアＣは音声ファイル毎に計算される値であるので、音声認識処理部４０は音声ファイル単位で音声認識処理の実行の有無を切り替える。一定値Ｃ_ｔｈは、例えば、音響モデルの学習データに対する信頼度スコア分布から算出する方法が考えられる。信頼度スコア分布の平均値μ、標準偏差σとした場合に、例えばＣ_ｔｈ＝μ−２σとする。また、式（１）に示した一定の高い信頼度スコア値Ｃ_constは、Ｃ_const＝μ＋２σ等としても良い。 FIG. 7 shows a functional configuration example of the speech recognition apparatus 250 of the present invention. The voice recognition device 250 is different from the voice recognition device 100 in that it includes a recognition processing control unit 251. The recognition processing control unit 251 outputs a control signal for stopping the speech recognition processing to the speech recognition processing unit 40 when the reliability score C is equal to or less than a certain value _Cth . Since the reliability score C is a value calculated for each voice file, the voice recognition processing unit 40 switches whether or not the voice recognition process is executed for each voice file. For example, a method of calculating the constant value C _th from the reliability score distribution for the learning data of the acoustic model is conceivable. When the average value μ and the standard deviation σ of the reliability score distribution are used, for example, C _th = μ−2σ. Further, the constant high reliability score value C _const shown in Expression (1) may be C _const = μ + 2σ or the like.

また、認識処理制御部２５１は、制御信号としてビーム探索幅Ｎ（Ｃ）を出力するようにしても良い。その一例を式（４）に示す。 The recognition processing control unit 251 may output the beam search width N (C) as a control signal. An example is shown in equation (4).

図８に信頼度スコアＣとビーム探索幅Ｎ（Ｃ）との関係を例示する。横軸は信頼度スコアＣであり、縦軸はビーム探索幅Ｎ（Ｃ）である。
図８に示すように式（４）は、所定の範囲の信頼度スコアＣ（Ｃ_ｍｉｎ〜Ｃ_ｍａｘ）に対応するビーム探索幅Ｎ（Ｃ）（Ｎ_ｍｉｎ〜Ｎ_ｍａｘ）を、信頼度スコアＣの値で比例配分する考えである。ここでは、比例係数が負の値なので、信頼度スコアＣが小でビーム探索幅Ｎ（Ｃ）が大であり、Ｃが大でＮ（Ｃ）が小となる関係である。もちろん、信頼度スコアＣとビーム探索幅Ｎ（Ｃ）との関係は、非線形な関数で表せる関係であっても良い。また、制御信号としてビーム探索幅Ｎ（Ｃ）を用いる場合、ビーム探索幅は、個数ビーム幅に限定したものではなく、例えばスコアビーム幅、単語終端スコアビーム幅や、単語終端個数ビーム幅等であっても良い。
ここで、例えばＣ_ｍａｘ＝μ＋σ、Ｃ_ｍｉｎ＝μ―σとして、Ｎ_ｍａｘを通常用いるビーム幅の１.５倍、Ｎ_ｍｉｎを通常用いるビーム幅の半分等としても良い。また、平均音質が極端に悪い場合（例えばＣ＜Ｃ_ｍｉｎ）には、ビーム探索幅を拡大しても精度向上が望めず処理時間ばかり掛かるので、ビーム探索幅を小さく、例えばＮ_ｍｉｎにしても良い。また、制御信号に認識対象外指示信号を含ませて音声認識処理を行わせないようにしても良い。また、音声認識処理を停止させる信号とビーム探索幅の制御信号を並存させても良い。 FIG. 8 illustrates the relationship between the reliability score C and the beam search width N (C). The horizontal axis is the reliability score C, and the vertical axis is the beam search width N (C).
As shown in FIG. 8, the equation (4) represents the beam search width N (C) (N _{min to} N _max ) corresponding to the reliability score C (C _{min to} C _max ) in a predetermined range, and the reliability score C It is an idea of proportionally distributing with the value of. Here, since the proportionality coefficient is a negative value, the reliability score C is small and the beam search width N (C) is large, and C is large and N (C) is small. Of course, the relationship between the reliability score C and the beam search width N (C) may be a relationship that can be expressed by a non-linear function. Further, when the beam search width N (C) is used as the control signal, the beam search width is not limited to the number beam width. For example, the score search width, the word end score beam width, the word end number beam width, etc. There may be.
Here, for example, C _max = μ + σ, C _min = μ−σ, N _max may be 1.5 times the beam width normally used, N _min may be half the beam width normally used, and the like. In addition, when the average sound quality is extremely low (for example, C <C _min ), even if the beam search width is increased, the accuracy cannot be improved and it takes much processing time. Therefore, the beam search width is reduced, for example, N _min. good. Further, the speech recognition process may not be performed by including the non-recognition instruction signal in the control signal. Further, a signal for stopping the speech recognition process and a control signal for the beam search width may coexist.

このように、認識処理制御部２５１を備えた音声認識装置２５０は、複数の音声ファイルの音声認識処理の効率化と、認識精度の向上を図ることが出来る。なお、認識処理制御部２５１の機能は、音声認識処理部４０に持たせても良い。 As described above, the speech recognition apparatus 250 including the recognition processing control unit 251 can improve the efficiency of speech recognition processing for a plurality of speech files and improve the recognition accuracy. Note that the function of the recognition processing control unit 251 may be provided in the voice recognition processing unit 40.

図９にこの発明の音声認識装置３００の機能構成例を示す。図１０に動作フローを示す。音声認識装置３００は、音声ファイル処理部３０１と、ソート音声認識処理部３０２と、を備える点で音声認識装置１００，２５０と異なる。 FIG. 9 shows a functional configuration example of the speech recognition apparatus 300 of the present invention. FIG. 10 shows an operation flow. The voice recognition device 300 is different from the voice recognition devices 100 and 250 in that it includes a voice file processing unit 301 and a sorted voice recognition processing unit 302.

音声ファイル処理部３０１は、複数の音声ファイルの信頼度スコアＣの高い順番に複数の音声ファイルを並び替える（ステップＳ３０１）。ソート音声認識処理部３０２は、信頼度スコアＣの高い順番に音声認識処理を行う（ステップＳ３０２）。 The audio file processing unit 301 rearranges the plurality of audio files in descending order of the reliability score C of the plurality of audio files (step S301). The sorted speech recognition processing unit 302 performs speech recognition processing in descending order of the reliability score C (step S302).

このように信頼度スコアＣの大きさ順に音声認識処理を実行することで、複数の音声ファイルの音声認識処理を行う場合の処理効率を向上させることが出来る。例えば、全音声ファイルに対して音声認識処理を行う事が、計算機資源や処理時間の関係等によって難しい場合には、信頼度スコアＣが小さい音声ファイルは音声認識処理が行われず、音声認識精度が高い事が期待される信頼度スコアＣが大きな音声ファイルにのみ音声認識処理が行われることになり、高精度な音声認識結果を収集することが可能になる。なお、音声ファイル処理部３０１の機能は、ソート音声認識処理部３０２の機能に含めても良い。
以上述べたように、この発明の音声認識装置によれば、音声特徴量系列に基づいた事前信頼度を求め、音声ファイル単位でその事前信頼度を平均した信頼度スコアを計算する。従って、従来の音声認識装置よりも軽い処理で信頼度スコアが求められる。また、音声特徴量に基づく処理なので、言語モデルに依存しない信頼度スコアを得ることが出来る。また、求められた信頼度スコアの値に応じて音声認識処理を行うか否かの判断をすることで、例えばＳ/Ｎ比が悪い等の理由により音声認識精度の低い音声ファイルの音声認識処理に時間がかかる問題も解決できる。 By executing the speech recognition processing in the order of the reliability score C in this way, it is possible to improve the processing efficiency when performing speech recognition processing of a plurality of speech files. For example, when it is difficult to perform speech recognition processing on all speech files due to the relationship between computer resources and processing time, speech recognition processing is not performed on speech files having a low reliability score C, and speech recognition accuracy is high. The voice recognition process is performed only on a voice file having a high reliability score C that is expected to be high, and it is possible to collect a highly accurate voice recognition result. Note that the function of the voice file processing unit 301 may be included in the function of the sort voice recognition processing unit 302.
As described above, according to the speech recognition apparatus of the present invention, the prior reliability based on the speech feature amount series is obtained, and the reliability score obtained by averaging the prior reliability for each audio file is calculated. Therefore, the reliability score is obtained by processing that is lighter than that of the conventional speech recognition apparatus. Further, since the processing is based on the voice feature amount, a reliability score independent of the language model can be obtained. Further, by determining whether or not to perform the voice recognition process according to the value of the obtained reliability score, for example, the voice recognition process of the voice file with low voice recognition accuracy due to a reason such as a poor S / N ratio. It can solve the problem that takes time.

なお、上記方法及び装置において説明した処理は、記載の順に従って時系列に実行され
るのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 Note that the processes described in the above method and apparatus are not only executed in time series according to the order of description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Good.

また、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 Further, when the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A feature amount analysis unit that analyzes a speech feature amount of an input speech digital signal in units of frames and outputs a speech feature amount sequence;
With the speech feature value sequence for each frame as an input, the difference between the output probability of the maximum likelihood state of the monophone and the output probability of the maximum likelihood state of the speech model or the pose model is defined as the prior reliability of the frame, and the prior reliability is defined as A prior reliability score calculator that outputs an average reliability score for each audio file;
A speech recognition processing unit that performs speech recognition processing using the speech feature amount sequence and the reliability score as inputs;
A speech recognition apparatus comprising:

The speech recognition apparatus according to claim 1,
The speech recognition apparatus, wherein the reliability score is a value obtained by averaging prior reliability based on two or more kinds of acoustic models in units of audio files.

The speech recognition apparatus according to claim 1,
The prior reliability is obtained by comparing the output probabilities of the maximum likelihood states of a speech model or a pose model among two or more types of acoustic models for the speech feature amount series, and limited to the acoustic model having the maximum output probability. The difference between the output probability of the monophone calculated in the above and the output probability of the maximum likelihood state of the speech model or the pose model in the acoustic model in the maximum type of acoustic model,
A voice recognition device characterized by the above.

The speech recognition apparatus according to any one of claims 1 to 3,
As input on relaxin Yoriyukido score, the recognition control unit for outputting to the voice recognition processing unit and generates a control signal for selecting a voice file to be speech recognition process,
The speech recognition apparatus further comprising:

The speech recognition apparatus according to any one of claims 1 to 3,
A voice file processing unit for rearranging the plurality of audio files in descending order of on connexin Yoriyukido scores of the plurality of audio files,
A sorted speech recognition processing unit that performs speech recognition processing in the order of high reliability score ;
A speech recognition apparatus, further comprising:

A feature amount analysis unit that analyzes a speech feature amount of an input speech digital signal in units of frames and outputs a speech feature amount sequence; and
The prior reliability score calculation unit receives the speech feature quantity sequence for each frame as an input, and calculates the difference between the output probability of the maximum likelihood state of the monophone and the output probability of the maximum likelihood state of the speech model or the pose model. Pre-reliability score calculation process that outputs a reliability score that is the reliability and averages the prior reliability for each audio file,
A speech recognition processing section in which a speech recognition processing unit performs speech recognition processing using the speech feature amount sequence and the reliability score as inputs;
A speech recognition method including:

Program for causing a computer to function as a speech recognition apparatus according to any one of claims 1 to 4.