JP2007127738A

JP2007127738A - Voice recognition device and program therefor

Info

Publication number: JP2007127738A
Application number: JP2005319094A
Authority: JP
Inventors: Hiroaki Tagawa; 博章田川; Takahiro Adachi; 隆弘足立; Reiko Yamada; 玲子山田; Tatsuya Hirahara; 達也平原
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-11-02
Filing date: 2005-11-02
Publication date: 2007-05-24
Anticipated expiration: 2025-11-02
Also published as: JP4798606B2

Abstract

<P>PROBLEM TO BE SOLVED: To solve the following problem; a conventional voice recognition device sometimes outputs a misrecognition result, and in such a case, users' trust in the voice recognition device cannot be built. <P>SOLUTION: The output of the misrecognition result can be prevented by the voice recognition device comprising: a voice input part which receives an voice input; a voice recognition part which performs voice recognition processing to the received voice using stored acoustic data, and acquires the result information of the voice recognition processing as a result of the voice recognition processing; a pronunciation evaluating part which acquires a pronunciation evaluation result by performing pronunciation evaluation to the received voice using the acoustic data; a judging part which judges whether or not the pronunciation evaluation result has a predetermined relation to a predetermined threshold value; an output part which outputs the voice recognition result about the result information of the voice recognition processing acquired by the voice recognition part only when the judging part judges that the pronunciation evaluation result has the predetermined relation. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声認識処理を行う音声認識装置等に関するものである。 The present invention relates to a speech recognition apparatus that performs speech recognition processing.

従来の音声認識装置において、認識精度を向上するための種々の音声認識アルゴリズムを用いた音声認識装置が存在する（例えば、特許文献１、特許文献２、特許文献３、特許文献４参照）。
特開２００３−４４０７９号公報（第１頁、第１図等）特開２００３−２２０９１号公報（第１頁、第１図等）特開平１０−２５４３５０号公報（第１頁、第１図等）特開平７−１９９９８９号公報（第１頁、第１図等） In a conventional speech recognition device, there are speech recognition devices using various speech recognition algorithms for improving recognition accuracy (see, for example, Patent Document 1, Patent Document 2, Patent Document 3, and Patent Document 4).
JP 2003-44079 A (first page, FIG. 1 etc.) Japanese Unexamined Patent Publication No. 2003-22091 (first page, FIG. 1 etc.) Japanese Patent Laid-Open No. 10-254350 (first page, FIG. 1 etc.) JP-A-7-199989 (first page, FIG. 1 etc.)

しかしながら、従来の音声認識装置においては、たとえ認識精度の高いアルゴリズムで音声認識処理を行っても、誤った認識結果を出力する場合もあり、かかる場合には、その音声認識装置に対するユーザの信頼を構築できない、という課題があった。 However, in the conventional speech recognition apparatus, even if speech recognition processing is performed with an algorithm with high recognition accuracy, an erroneous recognition result may be output. In such a case, the user's trust in the speech recognition apparatus is increased. There was a problem that it could not be built.

入力された音声の発音を評定して、発音評定結果が低い場合に、認識結果を出力しない音声認識装置により、誤った認識結果を出力する可能性を、大幅に減らすことができる。その結果、音声認識装置に対するユーザの信頼を担保できる。 When the pronunciation of the input speech is rated and the pronunciation rating result is low, the possibility of outputting an incorrect recognition result can be greatly reduced by a voice recognition device that does not output the recognition result. As a result, the user's trust for the speech recognition apparatus can be ensured.

本音声認識装置は、具体的には、以下の音声認識装置である。 Specifically, this speech recognition apparatus is the following speech recognition apparatus.

本第一の発明の音声認識装置は、比較される対象の音声に関するデータである音響データを格納している音響データ格納部と、音声の入力を受け付ける音声入力部と、前記音響データを用いて、前記音声入力部が受け付けた音声に対して音声認識処理を行い、当該音声認識処理の結果である音声認識処理結果情報を取得する音声認識部と、前記音声入力部が受け付けた音声に対して、前記音響データを用いて発音評定処理を行い、発音評定結果を取得する発音評定部と、前記発音評定結果が所定の閾値に対して所定の関係にあるか否かを判断する判断部と、前記判断部が所定の関係にあると判断した場合のみ、前記音声認識部が取得した音声認識処理結果情報についての音声認識結果を出力する出力部を具備する音声認識装置である。 The speech recognition apparatus according to the first aspect of the invention uses an acoustic data storage unit that stores acoustic data that is data relating to a target voice to be compared, a voice input unit that receives voice input, and the acoustic data. A voice recognition unit that performs voice recognition processing on the voice received by the voice input unit and acquires voice recognition processing result information that is a result of the voice recognition processing; and a voice that is received by the voice input unit. A pronunciation rating unit that performs pronunciation rating processing using the acoustic data and acquires a pronunciation rating result; a determination unit that determines whether the pronunciation rating result has a predetermined relationship with a predetermined threshold; The speech recognition apparatus includes an output unit that outputs a speech recognition result for the speech recognition processing result information acquired by the speech recognition unit only when it is determined that the determination unit has a predetermined relationship.

かかる構成により、誤った認識結果を出力する可能性を、大幅に減らすことができる。 With this configuration, it is possible to greatly reduce the possibility of outputting an erroneous recognition result.

また、本第二の発明の音声認識装置は、第一の発明に対して、前記発音評定部は、前記音声入力部が受け付けた音声に対して、前記音声認識部が取得した音声認識処理結果情報と前記音響データを用いて発音評定処理を行い、発音評定結果を取得する音声認識装置である。 The speech recognition apparatus according to the second aspect of the present invention is the speech recognition processing result obtained by the speech recognition unit with respect to the speech received by the speech input unit. This is a speech recognition device that performs pronunciation rating processing using information and the acoustic data, and acquires a pronunciation rating result.

また、本第三の発明の音声認識装置は、第一、第二いずれかの発明に対して、前記発音評定部は、前記音声受付部が受け付けた音声を、フレームに区分するフレーム区分手段と、前記区分されたフレーム毎の音声データであるフレーム音声データを１以上得るフレーム音声データ取得手段と、前記１以上のフレーム音声データの最適状態を決定する最適状態決定手段と、前記最適状態決定手段が決定した最適状態の確率値を、発音区間毎に取得する発音区間確率値取得手段と、前記発音区間確率値取得手段が取得した１以上の発音区間毎の１以上の確率値をパラメータとして、音声の評定値を算出する評定値算出手段とを具備する音声認識装置である。 The speech recognition apparatus according to the third aspect of the present invention is the voice recognition device according to any one of the first and second aspects, wherein the pronunciation rating unit includes frame classification means for classifying the voice received by the voice reception unit into frames. Frame audio data obtaining means for obtaining one or more frame audio data, which is audio data for each of the divided frames, optimum state determining means for determining an optimum state of the one or more frame audio data, and the optimum state determining means A sounding interval probability value acquisition means for acquiring the probability value of the optimum state determined by each sounding interval, and one or more probability values for each of the one or more sounding intervals acquired by the sounding interval probability value acquisition means as parameters, A speech recognition apparatus comprising rating value calculation means for calculating a rating value of speech.

かかる構成により、さらに精度の高い発音評定が行え、誤った認識結果を出力する可能性を、さらに大幅に減らすことができる。 With this configuration, it is possible to perform pronunciation evaluation with higher accuracy, and to further greatly reduce the possibility of outputting erroneous recognition results.

また、本第四の発明の音声認識装置は、第一、第二いずれかの発明に対して、前記発音評定部は、前記音声受付部が受け付けた音声を、フレームに区分するフレーム区分手段と、前記１以上のフレーム音声データの最適状態を決定する最適状態決定手段と、前記最適状態決定手段が決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値を、発音区間毎に取得する発音区間フレーム音韻確率値取得手段と、前記発音区間フレーム音韻確率値取得手段が取得した１以上の発音区間毎の１以上の確率値をパラメータとして音声の評定値を算出する評定値算出手段を具備する音声認識装置である。 Further, in the voice recognition device according to the fourth aspect of the present invention, with respect to any one of the first and second aspects, the pronunciation rating unit includes frame classification means for classifying the voice received by the voice reception unit into frames. An optimum state determining means for determining an optimum state of the one or more frame sound data, and one or more probability values in the entire phoneme state having the optimum state of each frame determined by the optimum state determining means, for each sounding section. A speech section frame phoneme probability value acquiring means, and a rating value calculation for calculating a speech rating value using one or more probability values for each of one or more sound sections acquired by the sound section frame phoneme probability value acquiring means as parameters. A speech recognition device comprising means.

また、本第五の発明の音声認識装置は、第三、第四いずれかの発明に対して、前記発音評定部は、無音を示すデータである無音データを格納している無音データ格納手段と、前記入力音声データおよび前記無音データに基づいて、無音の区間を検出する無音区間検出手段をさらに具備し、前記最適状態決定手段は、前記無音の区間についてのフレーム音声データを除いた前記１以上のフレーム音声データの最適状態を決定する音声認識装置である。 The speech recognition apparatus according to the fifth aspect of the invention relates to either of the third and fourth aspects of the invention, wherein the pronunciation rating unit includes silence data storage means for storing silence data that is data indicating silence. , Further comprising a silent section detecting means for detecting a silent section based on the input voice data and the silent data, wherein the optimum state determining means is the one or more excluding the frame voice data for the silent section. This is a speech recognition apparatus for determining the optimum state of the frame speech data.

かかる構成により、無音データを検出することにより、さらに精度の高い発音評定が行え、その結果、誤った認識結果を出力する可能性を、さらに大幅に減らすことができる。 With this configuration, it is possible to perform more accurate pronunciation evaluation by detecting silence data, and as a result, the possibility of outputting an erroneous recognition result can be further greatly reduced.

本発明による音声認識装置によれば、誤った認識結果を出力することを防止できる。 According to the speech recognition apparatus of the present invention, it is possible to prevent an erroneous recognition result from being output.

以下、音声認識装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, embodiments of a speech recognition apparatus and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.

（実施の形態１）
本実施の形態における音声認識装置は、受け付けた音声に対して発音の良し悪しを判断する発音評定処理を行い、発音評定処理の結果である評定値が所定の条件を満たす場合にのみ音声認識結果を出力する音声認識装置である。また、本音声認識装置が行う発音評定処理における評定値は、発音区間ごとに算出される。 (Embodiment 1)
The speech recognition apparatus according to the present embodiment performs a pronunciation rating process for determining whether the received voice is good or bad, and the voice recognition result is obtained only when the rating value as a result of the pronunciation rating process satisfies a predetermined condition. Is a voice recognition device that outputs. Further, the rating value in the pronunciation rating process performed by the speech recognition apparatus is calculated for each pronunciation interval.

なお、本音声認識装置が行う各フレームの発音評定処理について、入力音声のフレームに対する最適状態の事後確率を、動的計画法を用いて算出することから、当該事後確率をＤＡＰ（ＤｙｎａｍｉｃＡＰｏｓｔｅｒｉｏｒｉＰｒｏｂａｂｉｌｉｔｙ）と呼び、ＤＡＰに基づく類似度計算法および発音評定処理を行う装置をＤＡＰＳと呼ぶ。そして、本音声認識装置は、発音区間ごとに評定値を算出するので、本音声認識装置が算出する事後確率を、ＤＡＰに対してｔ−ＤＡＰと呼ぶ。 It should be noted that, for the pronunciation evaluation process of each frame performed by the speech recognition apparatus, the posterior probability of the optimum state for the frame of the input speech is calculated using dynamic programming, so that the posterior probability is calculated by DAP (Dynamic A Postiori Probability). ), And a device that performs a similarity calculation method based on DAP and a pronunciation rating process is called DAPS. Then, since the speech recognition apparatus calculates a rating value for each sound generation section, the posterior probability calculated by the speech recognition apparatus is referred to as t-DAP with respect to DAP.

図１は、本実施の形態における音声認識装置のブロック図である。 FIG. 1 is a block diagram of the speech recognition apparatus according to the present embodiment.

音声認識装置は、受付部１０１、音声受付部１０２、音響データ格納部１０３、認識候補データ格納部１０４、音声認識部１０５、発音評定部１０６、判断部１０７、出力部１０８を具備する。 The speech recognition apparatus includes a reception unit 101, a speech reception unit 102, an acoustic data storage unit 103, a recognition candidate data storage unit 104, a speech recognition unit 105, a pronunciation rating unit 106, a determination unit 107, and an output unit 108.

発音評定部１０６は、フレーム区分手段１０６１、フレーム音声データ取得手段１０６２、最適状態決定手段１０６３、発音区間確率値取得手段１０６４、評定値算出手段１０６５を具備する。 The pronunciation rating unit 106 includes frame classification means 1061, frame sound data acquisition means 1062, optimum state determination means 1063, pronunciation interval probability value acquisition means 1064, and rating value calculation means 1065.

受付部１０１は、音声認識装置の処理を開始する指示である開始指示などを受け付ける。開始指示などの入力手段は、テンキーやキーボードやマウスやメニュー画面によるもの等、何でも良い。受付部１０１は、テンキーやキーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The accepting unit 101 accepts a start instruction that is an instruction to start processing of the speech recognition apparatus. Input means such as a start instruction may be anything such as a numeric keypad, a keyboard, a mouse, or a menu screen. The receiving unit 101 can be realized by a device driver for input means such as a numeric keypad or a keyboard, control software for a menu screen, and the like.

音声受付部１０２は、音声の入力を受け付ける。音声は、マイク等から入力される。入力される音声は、音声認識の対象の情報である。音声受付部１０２は、マイクとドライバーソフト等で実現され得る。 The voice receiving unit 102 receives voice input. Audio is input from a microphone or the like. The input voice is information for voice recognition. The voice reception unit 102 can be realized by a microphone and driver software.

音響データ格納部１０３は、比較される対象の音声に関するデータである音響データを格納している。音響データは、例えば、音韻毎の隠れマルコフモデルを連結したＨＭＭに基づくデータであることが好適である。また、音響データは、入力される音声を構成する音素に対応するＨＭＭを、入力順序に従って連結されているＨＭＭに基づくデータであることが好適である。ただし、音響データは、必ずしも、音韻毎のＨＭＭを連結したＨＭＭに基づくデータである必要はない。音響データは、全音素のＨＭＭの、単なる集合であっても良い。また、音響データは、必ずしもＨＭＭに基づくデータである必要はない。音響データは、単一ガウス分布モデルや、確率モデル（ＧＭＭ：ガウシャンミクスチャモデル）や、統計モデルなど、他のモデルに基づくデータでも良い。ＨＭＭに基づくデータは、例えば、フレーム毎に、状態識別子と遷移確率の情報を有する。また、ＨＭＭに基づくデータは、例えば、複数の音声認識対象言語を母国語として話す人が発声した２以上のデータから学習した（推定した）モデルでも良い。音響データ格納部１０３は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The acoustic data storage unit 103 stores acoustic data that is data related to the target speech to be compared. The acoustic data is preferably, for example, data based on an HMM in which hidden Markov models for each phoneme are connected. The acoustic data is preferably data based on an HMM in which HMMs corresponding to phonemes constituting the input voice are linked according to the input order. However, the acoustic data does not necessarily need to be data based on the HMM obtained by connecting the HMMs for each phoneme. The acoustic data may be a simple set of all phoneme HMMs. The acoustic data does not necessarily need to be data based on the HMM. The acoustic data may be data based on other models such as a single Gaussian distribution model, a probability model (GMM: Gaussian mixture model), and a statistical model. The data based on the HMM has, for example, a state identifier and transition probability information for each frame. Further, the data based on the HMM may be, for example, a model learned (estimated) from two or more data uttered by a person speaking with a plurality of speech recognition target languages as a native language. The acoustic data storage unit 103 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

認識候補データ格納部１０４は、音声認識の候補を示すデータである認識候補データを1以上格納している。認識候補データは、例えば、単語の音素トランスクリプションである。なお、ここで音素トランスクリプションとは、音素を文字列に表現した情報である。音素トランスクリプションとは、例えば、「ｉｎｕ」「ｎｅｋｏ」「ｕｍｅ」「ｈｉｔｓｕｊｉ」などである。認識候補データ格納部１０４は、単語ではなく、一の音素（例えば、「ｕ」）を文字列に表現した情報を格納していても良い。また、認識候補データは、例えば、音声認識の候補を識別する識別子でも良い。認識候補データが音声認識の候補を識別する識別子である場合、音声認識の候補を示すデータ自身は、例えば、外部の装置や、本音声認識装置が有する（図示しない）。認識候補データ格納部１０４は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The recognition candidate data storage unit 104 stores one or more recognition candidate data, which is data indicating voice recognition candidates. The recognition candidate data is, for example, a phoneme transcription of a word. Here, the phoneme transcription is information that represents a phoneme in a character string. The phoneme transcription includes, for example, “inu”, “neko”, “ume”, “hitsuji”, and the like. The recognition candidate data storage unit 104 may store information that represents one phoneme (for example, “u”) in a character string instead of a word. The recognition candidate data may be an identifier for identifying a speech recognition candidate, for example. When the recognition candidate data is an identifier for identifying a speech recognition candidate, the data indicating the speech recognition candidate itself is included in, for example, an external device or the present speech recognition device (not shown). The recognition candidate data storage unit 104 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

音声認識部１０５は、音響データ、または音響データと認識候補データを用いて、音声受付部１０２が受け付けた音声に対して音声認識処理を行い、当該音声認識処理の結果である音声認識処理結果情報を取得する。音声認識処理結果情報は、例えば、音声認識処理結果の認識候補データ（例えば、音素トランスクリプション）や、音声認識処理結果の最適状態系列である。音声認識処理結果情報のデータ構造は問わない。音声認識部１０５は、通常、受付部１０１が開始指示を受け付けた後、音声認識処理を行う。音声認識部１０５は、音響データと認識候補データを用いて、特定の単語等を認識処理しても良いし、認識候補データを用いずに、音響データを利用して不特定の単語等を認識処理しても良い。また、音声認識処理のアルゴリズムは問わない。音声認識処理自体は種々の公知技術が存在するので、説明を省略する。音声認識部１０５は、通常、ＭＰＵやメモリ等から実現され得る。音声認識部１０５の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The speech recognition unit 105 performs speech recognition processing on the speech received by the speech reception unit 102 using the acoustic data or the acoustic data and the recognition candidate data, and speech recognition processing result information that is a result of the speech recognition processing. To get. The speech recognition processing result information is, for example, recognition candidate data (for example, phoneme transcription) of the speech recognition processing result or an optimum state sequence of the speech recognition processing result. The data structure of the speech recognition processing result information does not matter. The voice recognition unit 105 normally performs voice recognition processing after the receiving unit 101 receives a start instruction. The voice recognition unit 105 may recognize specific words using the acoustic data and the recognition candidate data, or recognize unspecified words using the acoustic data without using the recognition candidate data. It may be processed. Moreover, the algorithm of a speech recognition process is not ask | required. Since various known techniques exist for the speech recognition process itself, the description thereof is omitted. The speech recognition unit 105 can usually be realized by an MPU, a memory, or the like. The processing procedure of the voice recognition unit 105 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

発音評定部１０６は、音声受付部１０２が受け付けた音声に対して、音響データ格納部１０３の音響データを用いて発音評定処理を行い、発音評定結果を取得する。また、通常、発音評定部１０６は、音声受付部１０２が受け付けた音声に対して、音声認識部１０５が取得した音声認識処理結果情報と音響データを用いて発音評定処理を行い、発音評定結果を取得する。つまり、通常、発音評定部１０６は、音声認識部１０５における音声認識の結果得られた音声認識処理結果情報をも用いて、発音評定処理を行う。発音評定処理の詳細については後述するが、種々の発音評定処理アルゴリズムが考えられる。発音評定部１０６は、通常、ＭＰＵやメモリ等から実現され得る。発音評定部１０６の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The pronunciation rating unit 106 performs a pronunciation rating process on the voice received by the voice receiving unit 102 using the acoustic data in the acoustic data storage unit 103, and acquires a pronunciation rating result. In general, the pronunciation rating unit 106 performs a pronunciation rating process on the voice received by the voice receiving unit 102 using the voice recognition processing result information and the acoustic data acquired by the voice recognition unit 105, and obtains the pronunciation rating result. get. That is, normally, the pronunciation rating unit 106 performs the pronunciation rating process also using the voice recognition processing result information obtained as a result of the voice recognition in the voice recognition unit 105. Although details of the pronunciation rating process will be described later, various pronunciation rating processing algorithms are conceivable. The pronunciation rating unit 106 can usually be realized by an MPU, a memory, or the like. The processing procedure of the pronunciation rating unit 106 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

発音評定部１０６を構成するフレーム区分手段１０６１は、音声受付部１０２が受け付けた音声をフレームに区分する。フレーム区分手段１０６１の処理は公知技術による処理である。 The frame classification means 1061 constituting the pronunciation rating unit 106 classifies the voice received by the voice receiving unit 102 into frames. The processing of the frame sorting means 1061 is a processing according to a known technique.

フレーム音声データ取得手段１０６２は、フレーム区分手段１０６１により区分されたフレーム毎の音声データであるフレーム音声データを1以上得る。 The frame audio data acquisition unit 1062 obtains one or more frame audio data that are audio data for each frame divided by the frame classification unit 1061.

最適状態決定手段１０６３は、1以上のフレーム音声データのうちの少なくとも一のフレーム音声データに対する最適状態を決定する。最適状態決定手段１０６３は、例えば、音響データ格納部１０３中の全音韻ＨＭＭから、認識対象の単語などの音声を構成する1以上の音素に対応するＨＭＭを取得し、当該取得した1以上のＨＭＭを、音素の順序で連結したデータ（認識対象の音声に関するデータであり、音韻毎の隠れマルコフモデルを連結したＨＭＭに基づくデータ）を構成する。そして、構成した当該データ、および取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。なお、最適状態を決定するアルゴリズムは、例えば、Ｖｉｔｅｒｂｉアルゴリズムである。 Optimal state determination means 1063 determines an optimal state for at least one frame audio data of one or more frame audio data. The optimum state determination unit 1063 acquires, for example, an HMM corresponding to one or more phonemes constituting a speech such as a recognition target word from all phoneme HMMs in the acoustic data storage unit 103, and the acquired one or more HMMs Are concatenated in the order of phonemes (data related to the speech to be recognized and data based on the HMM concatenating hidden Markov models for each phoneme). Then, based on each feature vector o _t constituting constituting the relevant data, and the obtained feature vector series, to determine the optimal conditions for a given frame (optimum condition for the feature vector o _t). The algorithm for determining the optimum state is, for example, the Viterbi algorithm.

発音区間確率値取得手段１０６４は、最適状態決定手段１０６３が決定した最適状態の確率値を、発音区間毎に取得する。ここで、発音区間とは、音韻、音節、単語など、発音の一まとまりを構成する区間である。 The pronunciation interval probability value acquisition unit 1064 acquires the probability value of the optimum state determined by the optimum state determination unit 1063 for each pronunciation interval. Here, the pronunciation period is a period that constitutes a group of pronunciations such as phonemes, syllables, and words.

評定値算出手段１０６５は、発音区間確率値取得手段１０６４が取得した１以上の発音区間毎の１以上の確率値をパラメータとして音声の評定値を算出する。評定値算出手段１０６５は、例えば、発音区間確率値取得手段１０６４が取得した各発音区間の１以上の確率値の時間平均値を、発音区間毎に算出し、１以上の時間平均値を得て、当該１以上の時間平均値をパラメータとして音声の評定値を算出する。 The rating value calculation means 1065 calculates a speech rating value using one or more probability values for each of one or more pronunciation intervals acquired by the pronunciation interval probability value acquisition means 1064 as parameters. For example, the rating value calculation means 1065 calculates, for each sounding section, a time average value of one or more probability values of each sounding section acquired by the sounding section probability value acquisition means 1064 to obtain one or more time average values. The voice rating value is calculated using the one or more time average values as parameters.

フレーム区分手段１０６１、フレーム音声データ取得手段１０６２、最適状態決定手段１０６３、発音区間確率値取得手段１０６４、および評定値算出手段１０６５は、通常、ＭＰＵやメモリ等から実現され得る。フレーム区分手段１０６１等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The frame segmentation means 1061, the frame sound data acquisition means 1062, the optimum state determination means 1063, the pronunciation interval probability value acquisition means 1064, and the rating value calculation means 1065 can be usually realized by an MPU, a memory, or the like. The processing procedure of the frame sorting means 1061 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

判断部１０７は、発音評定結果が閾値と所定の関係にあるか否かを判断する。ここで、発音評定結果とは、通常、発音評定のレベルを示す数値である。発音評定結果は、例えば、０から１までの数値であったり、１から１０の１０段階の数値である。また、閾値は、通常、予め格納されている数値である。また、所定の関係とは、例えば、発音評定結果（例えば、０から１までの数値）が閾値（例えば、０．５０）以上であることを言う。また、所定の関係とは、例えば、発音評定結果（例えば、０から１までの数値）が閾値（例えば、０．５０）より大きいことを言う。判断部１０７は、通常、ＭＰＵやメモリ等から実現され得る。判断部１０７の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The determination unit 107 determines whether the pronunciation evaluation result has a predetermined relationship with the threshold value. Here, the pronunciation rating result is usually a numerical value indicating the level of pronunciation rating. The pronunciation rating result is, for example, a numerical value from 0 to 1, or a numerical value in 10 steps from 1 to 10. The threshold is usually a numerical value stored in advance. In addition, the predetermined relationship means, for example, that the pronunciation evaluation result (for example, a numerical value from 0 to 1) is equal to or greater than a threshold value (for example, 0.50). The predetermined relationship means that, for example, a pronunciation evaluation result (for example, a numerical value from 0 to 1) is larger than a threshold value (for example, 0.50). The determination unit 107 can be usually realized by an MPU, a memory, or the like. The processing procedure of the determination unit 107 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力部１０８は、判断部１０７が所定の関係にあると判断した場合のみ、音声認識部１０５が取得した音声認識処理結果情報についての音声認識結果を出力する。音声認識結果は、音声認識処理結果情報と同じ情報でも良いし、音声認識処理結果情報から生成される情報等でも良い。音声認識処理結果情報が音素トランスクリプション（例えば、「ｎｅｋｏ」）の場合に、音声認識結果は、例えば、文字列（例えば、「ねこ」）である。また、出力部１０８は、判断部１０７が所定の関係にないと判断した場合、通常、音声認識できなかった旨の情報を出力する。ただし、かかる場合、何ら出力しなくても良い。ここで、出力とは、ディスプレイへの表示、プリンタへの印字、音出力、外部の装置への送信、記録媒体への蓄積等を含む概念である。出力部１０８は、ディスプレイやスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力部１０８は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 108 outputs the speech recognition result for the speech recognition processing result information acquired by the speech recognition unit 105 only when the determination unit 107 determines that there is a predetermined relationship. The voice recognition result may be the same information as the voice recognition process result information, or information generated from the voice recognition process result information. When the speech recognition processing result information is phoneme transcription (for example, “neko”), the speech recognition result is, for example, a character string (for example, “cat”). Further, when the determination unit 107 determines that the determination unit 107 does not have a predetermined relationship, the output unit 108 normally outputs information indicating that voice recognition has failed. However, in such a case, no output is required. Here, the output is a concept including display on a display, printing on a printer, sound output, transmission to an external device, accumulation in a recording medium, and the like. The output unit 108 may or may not include an output device such as a display or a speaker. The output unit 108 can be realized by output device driver software, or output device driver software and an output device.

次に、音声認識装置の動作について図２、図３のフローチャートを用いて説明する。 Next, the operation of the speech recognition apparatus will be described with reference to the flowcharts of FIGS.

（ステップＳ２０１）受付部１０１は、開始指示を受け付けたか否かを判断する。開始指示を受け付ければステップＳ２０２に行き、開始指示を受け付けなければステップＳ２０１に戻る。 (Step S201) The receiving unit 101 determines whether a start instruction has been received. If a start instruction is accepted, the process proceeds to step S202. If a start instruction is not accepted, the process returns to step S201.

（ステップＳ２０２）音声受付部１０２は、音声を受け付けたか否かを判断する。音声を受け付ければステップＳ２０３に行き、音声を受け付けなければステップＳ２０２に戻る。本音声は、認識の対象となる音声である。 (Step S202) The voice reception unit 102 determines whether a voice is received. If a voice is accepted, the process goes to step S203, and if no voice is accepted, the process returns to step S202. This voice is a voice to be recognized.

（ステップＳ２０３）音声認識部１０５は、ステップＳ２０２で受け付けた音声に対して音声認識処理を行う。具体的には、音声認識部１０５は、音響データと認識候補データを用いて、ステップＳ２０２で受け付けた音声に対して音声認識処理を行い、当該音声認識処理の結果である音声認識処理結果情報を取得する。 (Step S203) The voice recognition unit 105 performs voice recognition processing on the voice received in step S202. Specifically, the speech recognition unit 105 performs speech recognition processing on the speech received in step S202 using the acoustic data and the recognition candidate data, and obtains speech recognition processing result information that is a result of the speech recognition processing. get.

（ステップＳ２０４）発音評定部１０６は、音声受付部１０２が受け付けた音声に対して、音響データ格納部１０３の音響データを用いて発音評定処理を行い、発音評定結果を取得する。発音評定部１０６は、通常、ステップＳ２０３で得た音声認識処理結果情報から、音響データを用いて発音評定処理を行い、発音評定結果を取得する。発音評定処理の詳細について、図３のフローチャートを用いて説明する。 (Step S204) The pronunciation rating unit 106 performs a pronunciation rating process on the voice received by the voice receiving unit 102 using the acoustic data in the acoustic data storage unit 103, and acquires a pronunciation rating result. The pronunciation rating unit 106 normally performs a pronunciation rating process using acoustic data from the speech recognition process result information obtained in step S203, and acquires a pronunciation rating result. Details of the pronunciation rating process will be described with reference to the flowchart of FIG.

（ステップＳ２０５）判断部１０７は、発音評定結果が所定の閾値に対して所定の関係（例えば、発音評定結果＞閾値）にあるか否かを判断する。所定の関係にあればステップＳ２０６に行き、所定の関係になければステップＳ２０７に行く。 (Step S205) The determination unit 107 determines whether or not the pronunciation rating result is in a predetermined relationship (for example, pronunciation rating result> threshold) with respect to the predetermined threshold. If there is a predetermined relationship, go to step S206, and if not, go to step S207.

（ステップＳ２０６）出力部１０８は、音声認識処理結果情報についての音声認識結果（例えば、音声認識結果の文字列）を出力する。そして、ステップＳ２０２に戻る。 (Step S206) The output unit 108 outputs a speech recognition result (for example, a character string of the speech recognition result) for the speech recognition processing result information. Then, the process returns to step S202.

（ステップＳ２０７）出力部１０８は、音声認識できなかった旨の情報を出力する。そして、ステップＳ２０２に戻る。 (Step S207) The output unit 108 outputs information indicating that speech recognition has failed. Then, the process returns to step S202.

なお、図２のフローチャートにおいて、発音評定処理を最初に行って、発音評定結果が閾値と所定の関係にある場合のみ、音声認識処理を行っても良い。 In the flowchart of FIG. 2, the speech recognition process may be performed only when the pronunciation evaluation process is performed first and the pronunciation evaluation result has a predetermined relationship with the threshold value.

また、図２のフローチャートにおいて、音声を受け付ける単位は、単語、文節、文、２以上の文等、問わない。また、音声認識を行う単位、発音評定を行う単位も問わない。図２のフローチャートにおいて、例えば、一単語の音声を受け付け、一単語の音声を認識し、一単語の音声の発音評定を行っている例である。音声を受け付ける単位が文である場合、例えば、ステップＳ２０３の音声認識処理、およびステップＳ２０４の発音評定処理は、発音区間の数分、繰り返し実行される。 Further, in the flowchart of FIG. 2, the unit for accepting voice is not limited to words, phrases, sentences, two or more sentences, and the like. Moreover, the unit for performing speech recognition and the unit for performing pronunciation evaluation are not limited. In the flowchart of FIG. 2, for example, one word speech is received, one word speech is recognized, and pronunciation evaluation of one word speech is performed. When the unit for receiving speech is a sentence, for example, the speech recognition process in step S203 and the pronunciation rating process in step S204 are repeatedly executed for the number of pronunciation intervals.

さらに、図２のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 Further, in the flowchart of FIG. 2, the processing is ended by powering off or interruption of processing end.

次に、図２のステップＳ２０４の発音評定処理について図３のフローチャートを用いて説明する。 Next, the pronunciation rating process in step S204 of FIG. 2 will be described using the flowchart of FIG.

（ステップＳ３０１）フレーム区分手段１０６１は、音声受付部１０２が受け付けた音声をフレームに区分する。かかる段階で、区分されたフレーム毎の音声データであるフレーム音声データが構成されている。フレーム区分手段１０６１が行うフレーム分割の処理は、例えば、フレーム音声データ取得手段１０６２がフレーム音声データを取り出す際の前処理であり、入力された音声のデータを、すべてのフレームに一度に分割するとは限らない。 (Step S301) The frame classification unit 1061 classifies the voice received by the voice receiving unit 102 into frames. At this stage, frame audio data which is audio data for each divided frame is configured. The frame division processing performed by the frame classification unit 1061 is, for example, pre-processing when the frame audio data acquisition unit 1062 extracts frame audio data, and the input audio data is divided into all frames at once. Not exclusively.

（ステップＳ３０２）フレーム音声データ取得手段１０６２は、カウンタｉに１を代入する。 (Step S302) The frame audio data acquisition means 1062 substitutes 1 for the counter i.

（ステップＳ３０３）フレーム音声データ取得手段１０６２は、ステップＳ３０１で区分した音声フレーム中、ｉ番目のフレームが存在するか否かを判断する。ｉ番目のフレームが存在すればステップＳ３０４に行き、ｉ番目のフレームが存在しなければステップＳ３０６に行く。 (Step S303) The frame audio data acquisition unit 1062 determines whether or not the i-th frame is present in the audio frames classified in step S301. If the i-th frame exists, the process goes to step S304, and if the i-th frame does not exist, the process goes to step S306.

（ステップＳ３０４）フレーム音声データ取得手段１０６２は、ｉ番目のフレーム音声データを取得する。フレーム音声データの取得とは、例えば、当該分割された音声データを音声分析し、特徴ベクトルデータを抽出し、メモリ上に配置することである。なお、フレーム音声データは、例えば、入力された音声データをフレーム分割されたデータである。また、フレーム音声データは、例えば、当該分割された音声データから音声分析され、抽出された特徴ベクトルデータを有する。本特徴ベクトルデータは、例えば、三角型フィルタを用いたチャネル数２４のフィルタバンク出力を離散コサイン変換したＭＦＣＣであり、その静的パラメータ、デルタパラメータおよびデルタデルタパラメータをそれぞれ１２次元、さらに正規化されたパワーとデルタパワーおよびデルタデルタパワー（３９次元）を有する。 (Step S304) The frame sound data acquisition unit 1062 acquires the i-th frame sound data. The acquisition of frame sound data means, for example, performing sound analysis on the divided sound data, extracting feature vector data, and placing the data on a memory. Note that the frame audio data is, for example, data obtained by dividing the input audio data into frames. The frame audio data includes, for example, feature vector data extracted from the divided audio data by audio analysis. This feature vector data is, for example, MFCC obtained by discrete cosine transform of a filter bank output of 24 channels using a triangular filter, and the static parameter, delta parameter, and delta delta parameter are further normalized to 12 dimensions, respectively. Power and delta power and delta delta power (39th dimension).

（ステップＳ３０５）フレーム音声データ取得手段１０６２は、カウンタｉを１、インクリメントする。そして、ステップＳ３０３に戻る。 (Step S305) The frame audio data acquisition unit 1062 increments the counter i by 1. Then, the process returns to step S303.

（ステップＳ３０６）発音区間確率値取得手段１０６４は、全フレームの全状態の前向き尤度と後向き尤度を算出する。そして、全フレーム、全状態の確率値を得る。具体的には、発音区間確率値取得手段１０６４は、例えば、各特徴ベクトルが対象の状態から生成された事後確率を算出する。この事後確率は、ＨＭＭの最尤推定におけるＢａｕｍ−Ｗｅｌｃｈアルゴリズムの中で現れる占有度数に対応する。Ｂａｕｍ−Ｗｅｌｃｈアルゴリズムは、公知のアルゴリズムであるので、説明は省略する。 (Step S306) The pronunciation interval probability value acquisition unit 1064 calculates the forward likelihood and the backward likelihood of all states of all frames. Then, probability values for all frames and all states are obtained. Specifically, the pronunciation interval probability value acquisition unit 1064 calculates, for example, a posterior probability that each feature vector is generated from the target state. This posterior probability corresponds to the occupation frequency appearing in the Baum-Welch algorithm in the maximum likelihood estimation of the HMM. The Baum-Welch algorithm is a known algorithm and will not be described.

（ステップＳ３０７）発音区間確率値取得手段１０６４は、全フレームの最適状態確率値を算出する。 (Step S307) The sounding section probability value acquisition unit 1064 calculates optimal state probability values of all frames.

（ステップＳ３０８）発音区間確率値取得手段１０６４は、発音区間（例えば、単語の区間）に対応する１以上の最適状態の確率値をすべて読み出す。 (Step S308) The pronunciation interval probability value acquisition unit 1064 reads all the probability values of one or more optimum states corresponding to the pronunciation interval (for example, a word interval).

（ステップＳ３０９）評定値算出手段１０６５は、ステップＳ３０８で取得した１以上の発音区間毎の１以上の確率値をパラメータとして音声の評定値を算出する。例えば、ステップＳ３０８で取得した１以上の確率値の平均値（時間平均値）を算出する。そして、上位関数にリターンする。評定値の算出処理の詳細については、後述する。 (Step S309) The rating value calculation means 1065 calculates a speech rating value using one or more probability values for each of one or more pronunciation sections acquired in step S308 as parameters. For example, an average value (time average value) of one or more probability values acquired in step S308 is calculated. Then, the process returns to the upper function. Details of the rating value calculation process will be described later.

以下、本実施の形態における音声認識装置の具体的な動作について説明する。 Hereinafter, a specific operation of the speech recognition apparatus in the present embodiment will be described.

まず、本音声認識装置において、図示しない手段により、認識対象の単語（文でも良い）の言語のネイティブ発音の音声データベースからネイティブ発音の音韻ＨＭＭを学習しておく。ここで、音韻の種類数をＬとし、ｌ番目の音韻に対するＨＭＭをλ_ｌとする。なお、かかる学習の処理については、公知技術であるので、詳細な説明は省略する。なお、ＨＭＭの仕様について、図４に示す。なお、ＨＭＭの仕様は、他の実施の形態における具体例の説明においても同様である。ただし、ＨＭＭの仕様が、他の仕様でも良いことは言うまでもない。 First, in this speech recognition apparatus, a native pronunciation phoneme HMM is learned from a native pronunciation speech database of a language of a recognition target word (which may be a sentence) by means not shown. Here, the number of phoneme types is L, and the HMM for the l-th phoneme is λ _l . Since this learning process is a known technique, a detailed description thereof is omitted. The HMM specifications are shown in FIG. The specification of the HMM is the same in the description of specific examples in other embodiments. However, it goes without saying that the specifications of the HMM may be other specifications.

そして、学習したＬ種類の音韻ＨＭＭから、音声認識対象の単語や文章などの音声を構成する１以上の音素に対応するＨＭＭを取得し、当該取得した１以上のＨＭＭを、音素の順序で連結した音響データを構成する。そして、当該音響データを音響データ格納部１０３に保持しておく。 Then, from the learned L types of phoneme HMMs, HMMs corresponding to one or more phonemes constituting speech such as words or sentences to be recognized are acquired, and the acquired one or more HMMs are connected in the order of phonemes. Audio data is constructed. The acoustic data is stored in the acoustic data storage unit 103.

次に、ユーザ（ここでは、例えば、アメリカ人）が、音声認識の開始指示を入力する。かかる指示は、例えば、マウスで所定のボタンを押下することによりなされる。 Next, a user (here, for example, an American) inputs a voice recognition start instruction. Such an instruction is made, for example, by pressing a predetermined button with a mouse.

次に、ユーザは、音声「ｒｉｇｈｔ」を発音する。そして、音声受付部１０２は、ユーザが発音した音声の入力を受け付ける。 Next, the user pronounces the voice “right”. Then, the voice receiving unit 102 receives an input of a voice generated by the user.

次に、音声認識部１０５は、音声認識処理を行い、認識結果の音素トランスクリプション「ｒｉｇｈｔ」を得る、とする。 Next, it is assumed that the speech recognition unit 105 performs speech recognition processing to obtain a phoneme transcription “right” as a recognition result.

次に、発音評定部１０６は、以下のように発音評定を行う。 Next, the pronunciation rating unit 106 performs a pronunciation rating as follows.

つまり、フレーム区分手段１０６１は、音声受付部１０２が受け付けた音声を、短時間フレームに区分する。なお、フレームの間隔は、予め決められている、とする。 That is, the frame classification unit 1061 classifies the voice received by the voice receiving unit 102 into short frames. It is assumed that the frame interval is determined in advance.

そして、フレーム音声データ取得手段１０６２は、フレーム区分手段１０６１が区分した音声データを、スペクトル分析し、特徴ベクトル系列「Ｏ＝ｏ_１，ｏ_２，・・・，ｏ_Ｔ」を算出する。なお、Ｔは、系列長である。ここで、特徴ベクトル系列は、各フレームの特徴ベクトルの集合である。また、特徴ベクトルは、例えば、三角型フィルタを用いたチャネル数２４のフィルタバンク出力を離散コサイン変換したＭＦＣＣであり、その静的パラメータ、デルタパラメータおよびデルタデルタパラメータをそれぞれ１２次元、さらに正規化されたパワーとデルタパワーおよびデルタデルタパワー（３９次元）を有する。また、スペクトル分析において、ケプストラム平均除去を施すことは好適である。なお、音声分析条件を図５の表に示す。なお、音声分析条件は、他の実施の形態における具体例の説明においても同様である。ただし、音声分析条件が、他の条件でも良いことは言うまでもない。 The frame audio data acquisition unit 1062 performs spectrum analysis on the audio data classified by the frame classification unit 1061 and calculates a feature vector series “O = o ₁ , o ₂ ,..., O _T ”. T is a sequence length. Here, the feature vector series is a set of feature vectors of each frame. The feature vector is, for example, an MFCC obtained by performing discrete cosine transform on a filter bank output of 24 channels using a triangular filter, and the static parameter, the delta parameter, and the delta delta parameter are further normalized to 12 dimensions, respectively. Power and delta power and delta delta power (39th dimension). In spectral analysis, it is preferable to perform cepstrum average removal. The voice analysis conditions are shown in the table of FIG. Note that the voice analysis conditions are the same in the description of specific examples in other embodiments. However, it goes without saying that the voice analysis conditions may be other conditions.

次に、最適状態決定手段１０６３は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。最適状態決定手段１０６３が最適状態を決定するアルゴリズムは、例えば、Ｖｉｔｅｒｂｉアルゴリズムによる。かかる場合、最適状態決定手段１０６３は、上記で連結したＨＭＭを用いて最適状態を決定する。最適状態決定手段１０６３は、２以上のフレームの最適状態である最適状態系列を求めることとなる。 Then, the optimal state determination unit 1063, based on the feature vector o _t constituting the obtained feature vector series, to determine the optimal conditions for a given frame (optimum condition for the feature vector o _t). The algorithm by which the optimum state determination unit 1063 determines the optimum state is, for example, the Viterbi algorithm. In such a case, the optimum state determination unit 1063 determines the optimum state using the HMM connected as described above. The optimum state determination unit 1063 obtains an optimum state sequence that is an optimum state of two or more frames.

次に、発音区間確率値取得手段１０６４は、まず、以下の数式１により、最適状態（ｑ_ｔ ^＊）における最適状態確率値（γ_ｔ（ｑ_ｔ ^＊））を算出する。なお、γ_ｔ（ｑ_ｔ ^＊）は、状態ｊの事後確率関数γ_ｔ（ｊ）のｊにｑ_ｔ ^＊を代入した値である。そして、状態ｊの事後確率関数γ_ｔ（ｊ）は、数式２を用いて算出される。この確率値（γ_ｔ（ｊ））は、ｔ番目の特徴ベクトルｏ_ｔが状態ｊから生成された事後確率であり、動的計画法を用いて算出される。なお、ｊは、状態を識別する状態識別子である。
Next, the pronunciation interval probability value acquisition unit 1064 first calculates the optimum state probability value (γ _t (q _t ^* )) in the optimum state (q _t ^* ) by the following formula 1. Note that γ _t (q _t ^* ) is a value obtained by substituting q _t ^* for j in the posterior probability function γ _t (j) of the state j. Then, the posterior probability function γ _t (j) of the state j is calculated using Equation 2. This probability value _{(γ t} (j)) is, t th feature vector o _t is the posterior probability that is generated from the state j, is calculated using dynamic programming. Note that j is a state identifier for identifying a state.

数式１において、ｑ_ｔは、ｏ_ｔに対する状態識別子を表す。この確率値（γ_ｔ（ｊ））は、ＨＭＭの最尤推定におけるＢａｕｍ−Ｗｅｌｃｈアルゴリズムの中で表れる占有度数に対応する。
In Equation 1, _{q t} represents the state identifier for _{o t.} This probability value (γ _t (j)) corresponds to the occupation frequency appearing in the Baum-Welch algorithm in the maximum likelihood estimation of the HMM.

数式２において、「αｔ（ｊ）」「βｔ（ｊ）」は、全部のＨＭＭを用いて、ｆｏｒｗａｒｄ−ｂａｃｋｗａｒｄアルゴリズムにより算出される。「αｔ（ｊ）」は前向き尤度、「βｔ（ｊ）」は後向き尤度である。Ｂａｕｍ−Ｗｅｌｃｈアルゴリズム、ｆｏｒｗａｒｄ−ｂａｃｋｗａｒｄアルゴリズムは、公知のアルゴリズムであるので、詳細な説明は省略する。 In Equation 2, “αt (j)” and “βt (j)” are calculated by the forward-backward algorithm using all the HMMs. “Αt (j)” is a forward likelihood, and “βt (j)” is a backward likelihood. Since the Baum-Welch algorithm and the forward-backward algorithm are known algorithms, detailed description thereof is omitted.

また、数式２において、Ｎは、全ＨＭＭに渡る状態の総数を示す。 In Equation 2, N represents the total number of states over all HMMs.

そして、発音区間確率値取得手段１０６４は、発音区間に対応する１以上の最適状態の確率値をすべて取得する。そして、評定値算出手段１０６５は、取得した１以上の確率値の平均値（時間平均値）を算出する。具体的には、評定値算出手段１０６５は、数式３により評定値を算出する。
Then, the pronunciation interval probability value acquisition unit 1064 acquires all the probability values of one or more optimum states corresponding to the pronunciation interval. Then, the rating value calculating means 1065 calculates an average value (time average value) of the acquired one or more probability values. Specifically, the rating value calculation means 1065 calculates a rating value using Equation 3.

なお、発音評定部１０６は、まず最適状態を求め、次に、最適状態の確率値（なお、確率値は、０以上、１以下である。）を求めても良いし、発音評定部１０６は、まず、全状態の確率値を求め、その後、特徴ベクトル系列の各特徴ベクトルに対する最適状態を求め、当該最適状態に対応する確率値を求めても良い。 Note that the pronunciation rating unit 106 may first obtain an optimum state, and then obtain a probability value of the optimum state (the probability value is 0 or more and 1 or less). First, the probability values of all states may be obtained, then the optimum state for each feature vector of the feature vector series may be obtained, and the probability value corresponding to the optimum state may be obtained.

なお、もしユーザのｔフレーム目に対応する発声が、音響データが示す発音（例えば、正しいネイティブの発音）に近ければ、数式２の（２）式の分子の値が、他の全ての可能な音韻の全ての状態と比較して大きくなり、結果的に最適状態の確率値（評定値）が大きくなる。逆にその区間が、音響データが示す発音に近くなければ、評定値は小さくなる。なお、どのネイティブ発音にも近くないような場合は、ＤＡＰ評定値（ＤＡＰスコアとも言う）はほぼ１／Ｎに等しくなる。Ｎは全ての音韻ＨＭＭにおける全ての状態の数であるから、通常、大きな値となり、ＤＡＰスコアは十分小さくなる。また、ここでは、ＤＡＰスコアは最適状態における確率値と全ての可能な状態における確率値との比率で定義されている。したがって、話者性や収音環境の違いにより多少のスペクトルの変動があったとしても、ユーザが正しい発音をしていれば、その変動が相殺され評定値が高いスコアを維持する。 If the utterance corresponding to the user's t-th frame is close to the pronunciation indicated by the acoustic data (for example, correct native pronunciation), the numerator value of Equation (2) in Equation 2 is set to all other possible values. As a result, the probability value (rating value) in the optimum state becomes large. Conversely, if the section is not close to the pronunciation indicated by the acoustic data, the rating value is small. In the case where it is not close to any native pronunciation, the DAP rating value (also referred to as DAP score) is approximately equal to 1 / N. Since N is the number of all states in all phoneme HMMs, it is usually a large value and the DAP score is sufficiently small. Here, the DAP score is defined by the ratio between the probability value in the optimum state and the probability values in all possible states. Therefore, even if there is a slight spectrum variation due to differences in speaker characteristics and sound collection environment, if the user pronounces correctly, the variation is offset and a score with a high rating value is maintained.

かかる評定値算出手段１０６５が、中間に算出するＤＡＰスコア（フレームごとのスコア）を、図６、図７に示す。図６、図７において、横軸は分析フレーム番号、縦軸はスコアを％で表わしたものである。太い破線は音素境界，細い点線は状態境界（いずれもＶｉｔｅｒｂｉアルゴリズムで求まったもの）を表わしており，図の上部に音素名を表記している。図６は、アメリカ人男性による英語「ｒｉｇｈｔ」の発音のＤＡＰスコアを示す。なお、評定値を示すグラフの横軸、縦軸は、後述するグラフにおいても同様である。 FIG. 6 and FIG. 7 show the DAP score (score for each frame) calculated by the rating value calculation means 1065 in the middle. 6 and 7, the horizontal axis represents the analysis frame number, and the vertical axis represents the score in%. A thick broken line represents a phoneme boundary, a thin dotted line represents a state boundary (both obtained by the Viterbi algorithm), and a phoneme name is shown at the top of the figure. FIG. 6 shows a DAP score for pronunciation of English “right” by an American male. The horizontal axis and vertical axis of the graph indicating the rating value are the same in the graph described later.

図７は、日本人男性による英語「ｒｉｇｈｔ」の発音のＤＡＰスコアを示す。アメリカ人の発音は、日本人の発音と比較して、基本的にスコアが高い。また、図６において、状態の境界において所々スコアが落ち込んでいることがわかる。 FIG. 7 shows a DAP score of pronunciation of English “right” by a Japanese male. American pronunciation is basically higher than Japanese pronunciation. Also, in FIG. 6, it can be seen that the score has dropped in some places at the boundary of the state.

そして、評定値算出手段１０６５は、例えば、発音区間における発音評定結果（ｔ−ＤＡＰスコアとも言う）を「６８」と、算出した、とする（図８参照）。 Then, it is assumed that the rating value calculation means 1065 calculates, for example, a pronunciation rating result (also referred to as a t-DAP score) in the pronunciation interval as “68” (see FIG. 8).

そして、次に、判断部１０７は、予め格納している閾値「５０」と発音評定結果「６８」を比較し、所定の条件を満たすか否かを判断する。所定の条件は、例えば、「発音評定結果＞閾値」である。そして、「６８＞５０」であるので、判断部１０７は、所定の条件を満たす、と判断する。 Next, the determination unit 107 compares the threshold value “50” stored in advance with the pronunciation evaluation result “68” to determine whether or not a predetermined condition is satisfied. The predetermined condition is, for example, “pronunciation rating result> threshold”. Since “68> 50”, the determination unit 107 determines that a predetermined condition is satisfied.

次に、出力部１０８は、音声認識部１０５が取得した音声認識処理結果情報「ｒｉｇｈｔ」を出力する。かかる出力は、例えば、ディスプレイへの表示である。 Next, the output unit 108 outputs the voice recognition processing result information “right” acquired by the voice recognition unit 105. Such output is, for example, display on a display.

さらに、次に、他のユーザ（日本人だとする）が、音声「ｒｉｇｈｔ」を発音する。そして、音声受付部１０２は、ユーザが発音した音声の入力を受け付ける。 Next, another user (assuming they are Japanese) pronounces the voice “right”. Then, the voice receiving unit 102 receives an input of a voice generated by the user.

次に、音声認識部１０５は、音声認識処理を行い、認識結果の音素トランスクリプション「ｌｉｇｈｔ」を得る、とする。 Next, it is assumed that the speech recognition unit 105 performs speech recognition processing and obtains a phoneme transcription “light” as a recognition result.

次に、発音評定部１０６は、上述のアルゴリズムで、発音評定を行う。そして、発音評定結果「２９」を得た、とする。 Next, the pronunciation rating unit 106 performs pronunciation rating using the algorithm described above. It is assumed that the pronunciation evaluation result “29” is obtained.

次に、判断部１０７は、予め格納している閾値「５０」と発音評定結果「２９」を比較し、所定の条件を満たすか否かを判断する。そして、「５０＞２９」であるので、判断部１０７は、所定の条件を満たさない、と判断する。 Next, the determination unit 107 compares the threshold value “50” stored in advance with the pronunciation evaluation result “29” to determine whether or not a predetermined condition is satisfied. Since “50> 29”, the determination unit 107 determines that the predetermined condition is not satisfied.

そして、出力部１０８は、予め格納している情報「音声認識できませんでした。」を出力する。かかる出力は、例えば、スピーカーからの音声出力と、ディスプレイへの表示である。 Then, the output unit 108 outputs prestored information “speech recognition failed”. Such output is, for example, audio output from a speaker and display on a display.

以上、本実施の形態によれば、発音評定を行って、所定の評定結果が得られた場合のみ（良好な場合のみ）、音声認識結果を出力する音声認識装置を提供できる。かかる処理により、誤った認識結果を出力する可能性を減らすことができ、信頼度の高い音声認識装置を提供できる。また、音声認識装置に対するユーザの信頼度を損なわず、ユーザは安心して装置を利用できる。 As described above, according to the present embodiment, it is possible to provide a speech recognition apparatus that outputs a speech recognition result only when a pronunciation rating is performed and a predetermined rating result is obtained (only when it is good). With this process, the possibility of outputting an erroneous recognition result can be reduced, and a highly reliable speech recognition apparatus can be provided. In addition, the user can use the device with peace of mind without impairing the reliability of the user with respect to the voice recognition device.

なお、本実施の形態によれば、音声認識処理を行った後、発音評定を行った。しかし、発音評定を行った後、評定結果が所定の条件を満たす場合のみ音声認識処理を行っても良い。かかることは、他の実施の形態においても同様である。 It should be noted that according to the present embodiment, after performing the speech recognition processing, the pronunciation evaluation is performed. However, after performing the pronunciation evaluation, the speech recognition process may be performed only when the evaluation result satisfies a predetermined condition. The same applies to other embodiments.

また、本実施の形態によれば、音声認識部が行う音声認識のアルゴリズムは問わない。音声認識において、認識候補データ格納部１０４における音素トランスクリプションを用いても用いなくても良い。かかることも、他の実施の形態においても同様である。 Moreover, according to this Embodiment, the algorithm of the speech recognition which a speech recognition part performs is not ask | required. In speech recognition, phoneme transcription in the recognition candidate data storage unit 104 may or may not be used. This also applies to other embodiments.

また、本実施の形態によれば、発音評定を行う評定値算出手段は、１以上の最適状態における最適状態確率値の時間平均値を算出した。しかし、評定値算出手段は、１以上の最適状態における最適状態確率値をパラメータとする関数であり、発音の良し悪しを示す解答が得られる関数であれば、どのような関数でスコアを算出しても良い。例えば、評定値算出手段は、１以上の最適状態における最適状態確率値のうち、中央値と中央値の前後１０の値を取得し、当該１１の値の平均値を、スコアとしても良い。 Further, according to the present embodiment, the rating value calculation means for performing pronunciation evaluation calculates the time average value of the optimum state probability values in one or more optimum states. However, the rating value calculation means is a function having an optimum state probability value in one or more optimum states as a parameter, and any function can be used to calculate the score as long as an answer indicating good or bad pronunciation is obtained. May be. For example, the rating value calculation means may obtain a median value and 10 values before and after the median value among the optimum state probability values in one or more optimum states, and use the average value of the 11 values as a score.

また、本実施の形態によれば、発音評定した結果、文の一部のスコアが閾値より低い場合、文全体の認識結果を出力しないなど、発音評定の範囲と認識結果の出力範囲が完全一致するとは限らない。 Also, according to the present embodiment, if the score of part of the sentence is lower than the threshold as a result of pronunciation evaluation, the recognition result of the whole sentence is not output, for example, the recognition result of the whole sentence is not output. Not always.

さらに、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、音声の入力を受け付ける音声入力ステップと、格納されている音響データを用いて、前記音声入力ステップで受け付けた音声に対して音声認識処理を行い、当該音声認識処理の結果である音声認識処理結果情報を取得する音声認識ステップと、前記音声入力ステップで受け付けた音声に対して、前記音響データを用いて発音評定処理を行い、発音評定結果を取得する発音評定ステップと、前記発音評定結果が所定の閾値と所定の関係にあるか否かを判断する判断ステップと、前記判断ステップで所定の関係にあると判断した場合のみ、前記音声認識ステップで取得した音声認識処理結果情報についての音声認識結果を出力する出力ステップを実行させるためのプログラム、である。 Furthermore, the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded on a recording medium such as a CD-ROM and distributed. This also applies to other embodiments in this specification. Note that the software that implements the information processing apparatus according to the present embodiment is the following program. In other words, this program uses a voice input step for receiving voice input to a computer, and performs voice recognition processing on the voice received in the voice input step using stored acoustic data. A speech recognition step for acquiring speech recognition processing result information, and a pronunciation rating step for performing pronunciation rating processing using the acoustic data for the speech received in the speech input step and acquiring a pronunciation rating result And a speech recognition acquired in the speech recognition step only when it is determined in the determination step whether the pronunciation rating result has a predetermined relationship with a predetermined threshold, and in the determination step. A program for executing an output step of outputting a speech recognition result for processing result information.

また、上記プログラムにおいて、前記発音評定ステップは、前記音声入力ステップで受け付けた音声に対して、前記音声認識ステップで取得した音声認識処理結果情報と前記音響データを用いて発音評定処理を行い、発音評定結果を取得することは好適である。 In the above program, the pronunciation rating step performs a pronunciation rating process on the voice received in the voice input step by using the voice recognition processing result information acquired in the voice recognition step and the acoustic data. It is preferable to obtain a rating result.

さらに、上記プログラムにおいて、前記発音評定ステップは、前記音声受付ステップで受け付けた音声を、フレームに区分するフレーム区分ステップと、前記区分されたフレーム毎の音声データであるフレーム音声データを1以上得るフレーム音声データ取得ステップと、前記1以上のフレーム音声データの最適状態を決定する最適状態決定ステップと、前記最適状態決定ステップで決定した最適状態の確率値を、発音区間毎に取得する発音区間確率値取得ステップと、前記発音区間確率値取得ステップで取得した１以上の発音区間毎の１以上の確率値をパラメータとして、音声の評定値を算出する評定値算出ステップとを具備することは好適である。 Further, in the above program, the pronunciation rating step includes a frame segmentation step for segmenting the speech accepted in the speech acceptance step into frames, and a frame for obtaining one or more frame speech data that is speech data for each of the segmented frames. An audio data acquisition step, an optimal state determination step for determining an optimal state of the one or more frame audio data, and a probability value of the optimal state determined in the optimal state determination step for each of the pronunciation intervals It is preferable to include an acquisition step and a rating value calculation step of calculating a voice rating value using one or more probability values for each of one or more pronunciation intervals acquired in the pronunciation interval probability value acquisition step as parameters. .

（実施の形態２）
本実施の形態における音声認識装置は、実施の形態１の音声認識装置と比較して、発音評定部における発音評定アルゴリズムが異なる。本実施の形態において、評定値は、最適状態を含む音韻の中の全状態の確率値を発音区間で評価して、算出される。本実施の形態における発音評定装置が算出する事後確率を、実施の形態１で述べたｔ−ＤＡＰに対してｔ-ｐ−ＤＡＰと呼ぶ。 (Embodiment 2)
The speech recognition apparatus according to the present embodiment is different from the speech recognition apparatus according to the first embodiment in the pronunciation rating algorithm in the pronunciation rating unit. In the present embodiment, the rating value is calculated by evaluating the probability values of all the states in the phoneme including the optimum state in the pronunciation interval. The posterior probability calculated by the pronunciation rating device in the present embodiment is called tp-DAP with respect to t-DAP described in the first embodiment.

図９は、本実施の形態における音声認識装置のブロック図である。 FIG. 9 is a block diagram of the speech recognition apparatus in the present embodiment.

図９の音声認識装置は、図１の音声認識装置と比較して、発音評定部９０６のみが異なる。 The speech recognition apparatus in FIG. 9 differs from the speech recognition apparatus in FIG. 1 only in the pronunciation rating unit 906.

発音評定部９０６は、フレーム区分手段１０６１、フレーム音声データ取得手段１０６２、最適状態決定手段１０６３、発音区間フレーム音韻確率値取得手段９０６１、評定値算出手段９０６２を具備する。 The pronunciation rating unit 906 includes frame classification means 1061, frame sound data acquisition means 1062, optimum state determination means 1063, pronunciation interval frame phoneme probability value acquisition means 9061, and rating value calculation means 9062.

発音評定部９０６を構成する発音区間フレーム音韻確率値取得手段９０６１は、最適状態決定手段１０６３が決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値を、発音区間毎に取得する。 The pronunciation interval frame phoneme probability value acquisition unit 9061 constituting the pronunciation evaluation unit 906 acquires, for each pronunciation interval, one or more probability values in the entire phoneme state having the optimal state of each frame determined by the optimal state determination unit 1063. To do.

評定値算出手段９０６２は、発音区間フレーム音韻確率値取得手段９０６１が取得した１以上の発音区間毎の１以上の確率値をパラメータとして音声の評定値を算出する。評定値算出手段９０６２は、例えば、最適状態決定手段１０６３が決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値の総和を、フレーム毎に得て、当該フレーム毎の確率値の総和に基づいて、発音区間毎の確率値の総和の時間平均値を１以上得て、当該１以上の時間平均値をパラメータとして音声の評定値を算出する。 The rating value calculation means 9062 calculates a speech rating value using one or more probability values for each of one or more pronunciation intervals acquired by the pronunciation interval frame phoneme probability value acquisition means 9061 as parameters. For example, the rating value calculating unit 9062 obtains, for each frame, a sum of one or more probability values in the entire phonological state having the optimal state of each frame determined by the optimal state determining unit 1063, and the probability value for each frame. 1 or more is obtained based on the sum of the above, and the speech rating value is calculated using the one or more time average values as parameters.

発音区間フレーム音韻確率値取得手段９０６１、および評定値算出手段９０６２は、通常、ＭＰＵやメモリ等から実現され得る。発音区間フレーム音韻確率値取得手段９０６１等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The pronunciation interval frame phoneme probability value acquisition means 9061 and the rating value calculation means 9062 can be usually realized by an MPU, a memory, or the like. The processing procedure of the pronunciation interval frame phoneme probability value acquisition means 9061 and the like is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、本音声認識装置の動作について説明する。本音声認識装置は、実施の形態１の音声認識装置と比較して、ステップＳ２０４の発音評定処理のみが異なる。本音声認識装置の発音評定処理について、図１０のフローチャートを用いて説明する。図１０のフローチャートにおいて、図３と異なるステップについてのみ説明する。 Next, the operation of this speech recognition apparatus will be described. This speech recognition apparatus differs from the speech recognition apparatus according to the first embodiment only in the pronunciation rating process in step S204. The pronunciation rating process of the speech recognition apparatus will be described with reference to the flowchart of FIG. In the flowchart of FIG. 10, only steps different from those in FIG. 3 will be described.

（ステップＳ１００１）発音区間フレーム音韻確率値取得手段９０６１は、ｊに１を代入する。 (Step S1001) The pronunciation interval frame phoneme probability value acquisition unit 9061 substitutes 1 for j.

（ステップＳ１００２）発音区間フレーム音韻確率値取得手段９０６１は、ｊ番目のフレームが、本発音区間（例えば、単語）に存在するか否かを判断する。ｊ番目のフレームが存在すればステップＳ１００３に行き、ｊ番目のフレームが存在しなければステップＳ１００６に飛ぶ。 (Step S1002) The sounding section frame phoneme probability value acquisition unit 9061 determines whether or not the jth frame is present in the main sounding section (for example, a word). If the j-th frame exists, the process goes to step S1003, and if the j-th frame does not exist, the process jumps to step S1006.

（ステップＳ１００３）発音区間フレーム音韻確率値取得手段９０６１は、ｊ番目のフレームの最適状態を含む音韻の全ての確率値を取得する。 (Step S1003) The pronunciation period frame phoneme probability value acquisition unit 9061 acquires all probability values of phonemes including the optimal state of the j-th frame.

（ステップＳ１００４）評定値算出手段９０６２は、ステップＳ１００３で取得した１以上の確率値をパラメータとして、１フレームの音声の評定値を算出し、メモリに一時格納する。 (Step S1004) The rating value calculation means 9062 calculates a rating value of one frame of speech using one or more probability values acquired in step S1003 as a parameter, and temporarily stores it in the memory.

（ステップＳ１００５）発音区間フレーム音韻確率値取得手段９０６１は、ｊを１、インクメントする。そして、ステップＳ１００２に戻る。 (Step S1005) The sounding section frame phoneme probability value acquisition unit 9061 increments j by 1. Then, the process returns to step S1002.

（ステップＳ１００６）評定値算出手段９０６２は、本発音区間の評定値を算出する。評定値算出手段９０６２は、例えば、最適状態決定手段１０６３が決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値の総和を、フレーム毎に得て、当該フレーム毎の確率値の総和に基づいて、発音区間の確率値の総和の時間平均値を、当該発音区間の音声の評定値として算出する。そして、上位関数にリターンする。 (Step S1006) The rating value calculation means 9062 calculates the rating value of the main pronunciation section. For example, the rating value calculation unit 9062 obtains, for each frame, the sum of one or more probability values in the entire phoneme state having the optimal state of each frame determined by the optimal state determination unit 1063, and the probability value for each frame. Based on the sum of the above, the time average value of the sum of the probability values of the sounding section is calculated as a speech evaluation value of the sounding section. Then, the process returns to the upper function.

以下、本実施の形態における音声認識装置の具体的な動作について説明する。本実施の形態において、評定値の算出アルゴリズムが実施の形態１とは異なるので、その動作を中心に説明する。 Hereinafter, a specific operation of the speech recognition apparatus in the present embodiment will be described. In the present embodiment, the rating value calculation algorithm is different from that of the first embodiment, and therefore the operation will be mainly described.

まず、ユーザが、開始指示を入力した後、音声認識対象の音声（ここでは、「ｒｉｇｈｔ」）を発音する。そして、音声受付部１０２は、ユーザが発音した音声の入力を受け付ける。 First, after the user inputs a start instruction, the user utters a speech to be recognized (here, “right”). Then, the voice receiving unit 102 receives an input of a voice generated by the user.

次に、フレーム区分手段１０６１は、音声受付部１０２が受け付けた音声を、短時間フレームに区分する。 Next, the frame classification unit 1061 classifies the voice received by the voice receiving unit 102 into short frames.

そして、フレーム音声データ取得手段１０６２は、フレーム区分手段１０６１が区分した音声データを、スペクトル分析し、特徴ベクトル系列「Ｏ＝ｏ_１，ｏ_２，・・・，ｏ_Ｔ」を算出する。 The frame audio data acquisition unit 1062 performs spectrum analysis on the audio data classified by the frame classification unit 1061 and calculates a feature vector series “O = o ₁ , o ₂ ,..., O _T ”.

次に、発音区間フレーム音韻確率値取得手段９０６１は、各フレームの各状態の事後確率（確率値）を算出する。確率値の算出は、上述した数式１、数式２により算出できる。 Next, the pronunciation interval frame phoneme probability value acquisition unit 9061 calculates the posterior probability (probability value) of each state of each frame. The probability value can be calculated by the above-described Equations 1 and 2.

次に、最適状態決定手段１０６３は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、各フレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。つまり、最適状態決定手段１０６３は、最適状態系列を得る。なお、各フレームの各状態の事後確率（確率値）を算出と、最適状態の決定の順序は問わない。 Then, the optimal state determination unit 1063, based on the feature vector o _t constituting the obtained feature vector series to determine the optimum conditions for each frame (optimum condition for the feature vector o _t). That is, the optimum state determination unit 1063 obtains an optimum state sequence. The order of calculating the posterior probability (probability value) of each state in each frame and determining the optimum state is not limited.

次に、発音区間フレーム音韻確率値取得手段９０６１は、発音区間ごとに、当該発音区間に含まれる各フレームの最適状態を含む音韻の全ての確率値を取得する。 Next, the pronunciation interval frame phoneme probability value acquisition unit 9061 acquires, for each pronunciation interval, all probability values of phonemes including the optimum state of each frame included in the pronunciation interval.

そして、評定値算出手段９０６２は、各フレームの最適状態を含む音韻の全ての確率値の総和を、フレーム毎に算出する。まず、評定値算出手段９０６２は、フレームの最適状態を含む音韻の全ての確率値の総和（これを「ｐ−ＤＡＰ」と言う。）を、以下の数式４により算出する。
Then, the rating value calculation means 9062 calculates the sum of all probability values of phonemes including the optimal state of each frame for each frame. First, the rating value calculation unit 9062 calculates the sum of all the probability values of the phoneme including the optimal state of the frame (this is referred to as “p-DAP”) by the following Equation 4.

そして、評定値算出手段９０６２は、フレーム毎に算出された確率値の総和を、発音区間毎に時間平均し、発音区間毎の評定値を算出する。具体的には、評定値算出手段９０６２は、数式５により評定値を算出する。
Then, the rating value calculation means 9062 averages the sum of the probability values calculated for each frame for each sounding section, and calculates a rating value for each sounding section. Specifically, the rating value calculation unit 9062 calculates the rating value by Equation 5.

かかる評定値算出手段９０６２が算出した評定値（「ｔ−ｐ−ＤＡＰスコア」とも言う。）を、図１１の表に示す。図１１において、アメリカ人男性と日本人男性の評定結果を示す。ＰｈｏｎｅｍｅおよびＷｏｒｄは，ｔ−ｐ−ＤＡＰにおける時間平均の範囲を示す。ここでは、実施の形態１におけるＤＡＰの代わりに、ｐ−ＤＡＰの時間平均を採用したものである。図１１において、アメリカ人男性の発音の評定値が日本人男性の発音の評定値より高く、良好な評定結果が得られている。 The rating values (also referred to as “tp-DAP score”) calculated by the rating value calculating means 9062 are shown in the table of FIG. FIG. 11 shows the evaluation results of an American male and a Japanese male. Phoneme and Word indicate the range of time average in tp-DAP. Here, a time average of p-DAP is adopted instead of DAP in the first embodiment. In FIG. 11, the pronunciation value of an American male is higher than the pronunciation value of a Japanese male, and a good evaluation result is obtained.

次に、判断部１０７は、予め格納している閾値「６０」と、アメリカ人男性の発声の場合の発音評定結果「８４」を比較し、所定の条件を満たすか否かを判断する。そして、「８４＞６０」であるので、判断部１０７は、所定の条件を満たす、と判断する。 Next, the determination unit 107 compares the threshold value “60” stored in advance with the pronunciation evaluation result “84” in the case of an American male utterance, and determines whether or not a predetermined condition is satisfied. Since “84> 60”, the determination unit 107 determines that a predetermined condition is satisfied.

次に、出力部１０８は、音声認識部１０５が取得した音声認識処理結果情報「ｒｉｇｈｔ」を出力する。 Next, the output unit 108 outputs the voice recognition processing result information “right” acquired by the voice recognition unit 105.

また、判断部１０７は、予め格納している閾値「６０」と、日本人男性の発声の場合の発音評定結果「３３」を比較し、所定の条件を満たさない、と判断する。 Further, the determination unit 107 compares the threshold value “60” stored in advance with the pronunciation evaluation result “33” in the case of a Japanese male utterance, and determines that the predetermined condition is not satisfied.

そして、出力部１０８は、予め格納している情報「音声認識できませんでした。」を出力する。 Then, the output unit 108 outputs prestored information “speech recognition failed”.

以上、本実施の形態によれば、発音評定を行って、所定の評定結果が得られた場合のみ（良好な場合のみ）、音声認識結果を出力する音声認識装置を提供できる。かかる処理により、誤った認識結果を出力する可能性を減らすことができ、信頼度の高い音声認識装置を提供できる。また、音声認識装置に対するユーザの信頼度を損なわず、ユーザは安心して装置を利用できる。さらに、本実施の形態によれば、精度高く発音評定ができ、その結果、精度高く、正しい認識結果を出力することができる。 As described above, according to the present embodiment, it is possible to provide a speech recognition apparatus that outputs a speech recognition result only when a pronunciation rating is performed and a predetermined rating result is obtained (only when it is good). With this process, the possibility of outputting an erroneous recognition result can be reduced, and a highly reliable speech recognition apparatus can be provided. In addition, the user can use the device with peace of mind without impairing the reliability of the user with respect to the voice recognition device. Furthermore, according to the present embodiment, pronunciation evaluation can be performed with high accuracy, and as a result, a correct recognition result can be output with high accuracy.

なお、本実施の形態によれば、発音評定を行う評定値算出手段は、各フレームの最適状態を有する音韻全体の状態における１以上の確率値の総和の時間平均値を算出した。しかし、評定値算出手段は、各フレームの最適状態を有する音韻全体の状態における１以上の確率値の総和をパラメータとする関数であり、発音の良し悪しを示す解答が得られる関数であれば、どのような関数でスコアを算出しても良い。例えば、評定値算出手段は、音韻全体の状態における１以上の確率値の総和のうち、中央値と中央値の前後３０の値を取得し、当該３１の値の平均値を、スコアとしても良い。 Note that according to the present embodiment, the rating value calculation means for performing pronunciation rating calculates the time average value of the sum of one or more probability values in the entire phoneme state having the optimal state of each frame. However, the rating value calculation means is a function that uses as a parameter the sum of one or more probability values in the overall state of the phoneme having the optimum state of each frame, and if it is a function that can obtain an answer indicating good or bad pronunciation, The score may be calculated by any function. For example, the rating value calculation means may obtain a median value and a value before and after the median value of the sum of one or more probability values in the entire phoneme state, and use the average value of the 31 values as a score. .

さらに、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、音声の入力を受け付ける音声入力ステップと、格納されている音響データを用いて、前記音声入力ステップで受け付けた音声に対して音声認識処理を行い、当該音声認識処理の結果である音声認識処理結果情報を取得する音声認識ステップと、前記音声入力ステップで受け付けた音声に対して、前記音響データを用いて発音評定処理を行い、発音評定結果を取得する発音評定ステップと、前記発音評定結果が所定の閾値と所定の関係にあるか否かを判断する判断ステップと、前記判断ステップで所定の関係にあると判断した場合のみ、前記音声認識ステップで取得した音声認識処理結果情報についての音声認識結果を出力する出力ステップを実行させるためのプログラム、である。 Furthermore, the software that implements the information processing apparatus according to the present embodiment is the following program. In other words, this program uses a voice input step for receiving voice input to a computer, and performs voice recognition processing on the voice received in the voice input step using stored acoustic data. A speech recognition step for acquiring speech recognition processing result information, and a pronunciation rating step for performing pronunciation rating processing using the acoustic data for the speech received in the speech input step and acquiring a pronunciation rating result And a speech recognition acquired in the speech recognition step only when it is determined in the determination step whether the pronunciation rating result has a predetermined relationship with a predetermined threshold, and in the determination step. A program for executing an output step of outputting a speech recognition result for processing result information.

さらに、上記プログラムにおいて、前記発音評定ステップは、前記音声受付部が受け付けた音声を、フレームに区分するフレーム区分ステップと、前記1以上のフレーム音声データの最適状態を決定する最適状態決定ステップと、前記最適状態決定ステップで決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値を、発音区間毎に取得する発音区間フレーム音韻確率値取得ステップと、前記発音区間フレーム音韻確率値取得ステップで取得した１以上の発音区間毎の１以上の確率値をパラメータとして音声の評定値を算出する評定値算出ステップとを具備することは好適である。 Further, in the above program, the pronunciation rating step includes a frame division step for dividing the sound received by the sound receiving unit into frames, an optimum state determination step for determining an optimum state of the one or more frame sound data, A pronunciation interval frame phoneme probability value acquisition step of acquiring, for each pronunciation interval, one or more probability values in the overall state of the phoneme having the optimal state of each frame determined in the optimal state determination step; and the pronunciation interval frame phoneme probability value It is preferable to include a rating value calculation step of calculating a voice rating value using one or more probability values for each of one or more pronunciation intervals acquired in the acquisition step as a parameter.

（実施の形態３）
本実施の形態における音声認識装置は、実施の形態１、２の音声認識装置と比較して、発音評定部における発音評定アルゴリズムが異なる。本実施の形態において、本音声認識装置は、無音区間を検知し、無音区間を考慮した発音評定が可能な装置である。さらに具体的には、通常、本音声認識装置は、無音区間のフレームは考慮せずに、発音評定を行う。 (Embodiment 3)
The speech recognition apparatus in the present embodiment is different from the speech recognition apparatus in Embodiments 1 and 2 in the pronunciation rating algorithm in the pronunciation rating unit. In the present embodiment, the speech recognition apparatus is an apparatus capable of detecting a silent section and performing pronunciation evaluation considering the silent section. More specifically, the speech recognition apparatus normally performs pronunciation evaluation without considering frames in silent intervals.

図１２は、本実施の形態における音声認識装置のブロック図である。 FIG. 12 is a block diagram of the speech recognition apparatus in the present embodiment.

図１２の音声認識装置は、図１の音声認識装置と比較して、発音評定部１２０６のみが異なる。 The speech recognition apparatus in FIG. 12 differs from the speech recognition apparatus in FIG. 1 only in the pronunciation rating unit 1206.

発音評定部１２０６は、フレーム区分手段１０６１、フレーム音声データ取得手段１０６２、最適状態決定手段１０６３、発音区間確率値取得手段１０６４、無音データ格納手段１２０６１、無音区間検出手段１２０６２、評定値算出手段１２０６３を具備する。 The pronunciation rating unit 1206 includes frame classification means 1061, frame audio data acquisition means 1062, optimum state determination means 1063, pronunciation interval probability value acquisition means 1064, silence data storage means 12061, silence interval detection means 12062, and rating value calculation means 12063. It has.

発音評定部１２０６は、音声受付部１０２が受け付けた音声に対して、音響データ格納部１０３の音響データを用いて発音評定処理を行い、発音評定結果を取得する。また、通常、発音評定部１０６は、音声受付部１０２が受け付けた音声に対して、音声認識部１０５が取得した音声認識処理結果情報と音響データを用いて発音評定処理を行い、発音評定結果を取得する。 The pronunciation rating unit 1206 performs a pronunciation rating process on the voice received by the voice receiving unit 102 using the acoustic data in the acoustic data storage unit 103, and acquires a pronunciation rating result. In general, the pronunciation rating unit 106 performs a pronunciation rating process on the voice received by the voice receiving unit 102 using the voice recognition processing result information and the acoustic data acquired by the voice recognition unit 105, and obtains the pronunciation rating result. get.

発音評定部１２０６は、通常、ＭＰＵやメモリ等から実現され得る。発音評定部１２０６の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The pronunciation rating unit 1206 can be usually realized by an MPU, a memory, or the like. The processing procedure of the pronunciation rating unit 1206 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

発音評定部１２０６を構成する無音データ格納手段１２０６１は、無音を示すである無音データを格納している。無音データは、例えば、音韻毎の隠れマルコフモデルを連結したＨＭＭに基づくデータであることが好適である。また、無音データは、必ずしもＨＭＭに基づくデータである必要はない。無音データは、単一ガウス分布モデルや、確率モデル（ＧＭＭ：ガウシャンミクスチャモデル）や、統計モデルなど、他のモデルに基づくデータでも良い。無音データ格納手段１２０６１は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The silence data storage unit 12061 constituting the pronunciation rating unit 1206 stores silence data indicating silence. The silence data is preferably data based on HMM in which hidden Markov models for each phoneme are connected. Further, the silence data is not necessarily data based on the HMM. The silence data may be data based on other models such as a single Gaussian distribution model, a probability model (GMM: Gaussian mixture model), and a statistical model. The silent data storage unit 12061 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

無音区間検出手段１２０６２は、入力音声データ、および無音データ格納手段１２０６１の無音データに基づいて、無音の区間を検出する。さらに具体的には、無音区間検出手段１２０６２は、フレーム音声データ取得手段１０６２が取得したフレーム音声データ、および無音データ格納手段１２０６１の無音データに基づいて、無音の区間を検出する。無音区間検出手段１２０６２は、フレーム音声データ取得手段１０６２が取得したフレーム音声データと無音データの類似度が所定の値以上である場合に、当該フレーム音声データは無音区間のデータであると判断しても良い。また、無音区間検出手段１２０６２は、下記で述べる発音区間確率値取得手段１０６４が取得する各フレームの確率値が所定の値以下であり、かつ、フレーム音声データ取得手段１０６２が取得したフレーム音声データと無音データの類似度が所定の値以上である場合に、当該フレーム音声データは無音区間のデータであると判断しても良い。 The silent section detecting unit 12062 detects a silent section based on the input voice data and the silent data stored in the silent data storage unit 12061. More specifically, the silent section detecting unit 12062 detects a silent section based on the frame voice data acquired by the frame voice data acquiring unit 1062 and the silent data of the silent data storage unit 12061. When the similarity between the frame audio data acquired by the frame audio data acquisition unit 1062 and the silence data is equal to or greater than a predetermined value, the silence interval detection unit 12062 determines that the frame audio data is data of the silence interval. Also good. Further, the silent section detecting means 12062 has a frame probability value of each frame acquired by the sounding section probability value acquiring means 1064 described below below a predetermined value and the frame sound data acquired by the frame sound data acquiring means 1062 When the similarity of silence data is greater than or equal to a predetermined value, it may be determined that the frame audio data is data in a silent section.

評定値算出手段１２０６３は、無音区間検出手段１２０６２が検出した無音区間を除いて、かつ発音区間確率値取得手段１０６４が取得した１以上の発音区間毎の１以上の確率値をパラメータとして音声の評定値を算出する。なお、評定値算出手段１２０６３は、上記確率値を如何に利用して、評定値を算出するかは問わない。ここでは、例えば、評定値算出手段１２０６３は、ｔ−ＤＡＰにより評定値を算出する、とする。また、評定値算出手段１２０６３は、かならずしも無音区間を除いて、評定値を算出する必要はない。評定値算出手段１２０６３は、無音区間の影響を少なくするように評定値を算出しても良い。 The rating value calculation means 12063 evaluates speech by using one or more probability values for one or more sounding sections acquired by the sounding section probability value acquisition means 1064 as parameters, except for the silent section detected by the silent section detection means 12062. Calculate the value. Note that it does not matter how the rating value calculation means 12063 calculates the rating value using the probability value. Here, for example, it is assumed that the rating value calculation unit 12063 calculates the rating value by t-DAP. The rating value calculation means 12063 does not necessarily calculate the rating value except for the silent section. The rating value calculation means 12063 may calculate the rating value so as to reduce the influence of the silent section.

無音区間検出手段１２０６２、および評定値算出手段１２０６３は、通常、ＭＰＵやメモリ等から実現され得る。無音区間検出手段１２０６２等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The silent section detection means 12062 and the rating value calculation means 12063 can be usually realized by an MPU, a memory, or the like. The processing procedure of the silent section detecting means 12062 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、本音声認識装置の動作について説明する。本音声認識装置は、実施の形態１の音声認識装置と比較して、ステップＳ２０４の発音評定処理のみが異なる。本音声認識装置の発音評定処理について、図１３のフローチャートを用いて説明する。図１３のフローチャートにおいて、図３と異なるステップについてのみ説明する。 Next, the operation of this speech recognition apparatus will be described. This speech recognition apparatus differs from the speech recognition apparatus according to the first embodiment only in the pronunciation rating process in step S204. The pronunciation rating process of this speech recognition apparatus will be described using the flowchart of FIG. In the flowchart of FIG. 13, only steps different from those in FIG. 3 will be described.

（ステップＳ１３０１）発音区間確率値取得手段１０６４は、カウンタｉに１を代入する。 (Step S1301) The pronunciation interval probability value acquisition unit 1064 substitutes 1 for the counter i.

（ステップＳ１３０２）発音区間確率値取得手段１０６４は、ｉ番目のフレーム音声データが存在するか否かを判断する。ｉ番目のフレーム音声データが存在すればステップＳ１３０３に行き、ｉ番目のフレーム音声データが存在しなければステップＳ１３０９に行く。なお、ｉ番目のフレーム音声データは、ステップＳ３０４で取得されたフレーム音声データの中のｉ番目のデータである。 (Step S1302) The sounding section probability value acquisition unit 1064 determines whether or not the i-th frame sound data exists. If the i-th frame sound data exists, the process proceeds to step S1303. If the i-th frame sound data does not exist, the process proceeds to step S1309. Note that the i-th frame audio data is the i-th data in the frame audio data acquired in step S304.

（ステップＳ１３０３）無音区間検出手段１２０６２は、ｉ番目のフレーム音声データの最適状態の確率値を取得する。 (Step S1303) The silent section detecting means 12062 acquires the optimal state probability value of the i-th frame sound data.

（ステップＳ１３０４）無音区間検出手段１２０６２は、ステップＳ１３０３で取得された確率値が、閾値より低い（または閾値以下）であるか否かを判断する。閾値より低い（または閾値以下）であればステップＳ１３０５に行き、閾値より低い（または閾値以下）でなければステップＳ１３０７に行く。 (Step S1304) The silent section detection unit 12062 determines whether or not the probability value acquired in step S1303 is lower than (or lower than) the threshold value. If it is lower than the threshold value (or lower than the threshold value), go to step S1305, and if it is lower than the threshold value (or lower than the threshold value), go to step S1307.

（ステップＳ１３０５）無音区間検出手段１２０６２は、無音データ格納手段１２０６１の無音データ、音響データ格納部１０３の全音響データを取得する。 (Step S1305) The silent section detecting unit 12062 acquires the silent data of the silent data storage unit 12061 and all the acoustic data of the acoustic data storage unit 103.

（ステップＳ１３０６）無音区間検出手段１２０６２は、ｉ番目のフレーム音声データが無音データの確率値が最も高いか否かを判断する。無音データの確率値が最も高ければステップＳ１３０８に行き、無音データの確率値が最も高くなければステップＳ１３０７に行く。 (Step S1306) The silent section detecting means 12062 determines whether or not the i-th frame audio data has the highest probability value of silent data. If the silence data has the highest probability value, the process proceeds to step S1308. If the silence data has the highest probability value, the process proceeds to step S1307.

（ステップＳ１３０７）発音区間確率値取得手段１０６４は、ステップＳ１３０３で取得した確率値を一時格納する。 (Step S1307) The pronunciation interval probability value acquisition means 1064 temporarily stores the probability value acquired in Step S1303.

（ステップＳ１３０８）発音区間確率値取得手段１０６４は、カウンタｉを１、インクリメントし、ステップＳ１３０２に行く。 (Step S1308) The sounding section probability value acquisition unit 1064 increments the counter i by 1, and goes to step S1302.

（ステップＳ１３０９）評定値算出手段１２０６３は、ステップＳ１３０７で一時格納された１以上の確率値をパラメータとして音声の評定値を算出し、上位関数にリターンする。 (Step S1309) The rating value calculation means 12063 calculates a voice rating value using the one or more probability values temporarily stored in step S1307 as parameters, and returns to the upper function.

なお、図１３のフローチャートにおいて、評定値算出手段１２０６３は、上記のように、無音区間の評定値を無視して、スコアを算出することが好適であるが、無音区間の評定値の影響を、例えば、１／１０にして、スコアを算出しても良い。 In the flowchart of FIG. 13, it is preferable that the rating value calculation unit 12063 calculates the score while ignoring the rating value of the silent section as described above. For example, the score may be calculated with 1/10.

以下、本実施の形態における音声認識装置の具体的な動作について説明する。本実施の形態において、無音区間を考慮して評定値を算出するので、評定値の算出アルゴリズムが実施の形態１等とは異なる。そこで、その異なる処理を中心に説明する。 Hereinafter, a specific operation of the speech recognition apparatus in the present embodiment will be described. In the present embodiment, since the rating value is calculated in consideration of the silent section, the rating value calculation algorithm is different from that of the first embodiment. Therefore, the different processing will be mainly described.

まず、ユーザが、開始指示を入力する。次に、ユーザは、例えば、認識対象の音声を発音する。そして、音声受付部１０２は、ユーザが発音した音声の入力を受け付ける。 First, the user inputs a start instruction. Next, for example, the user pronounces the speech to be recognized. Then, the voice receiving unit 102 receives an input of a voice generated by the user.

次に、最適状態決定手段１０６３は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。 Then, the optimal state determination unit 1063, based on the feature vector o _t constituting the obtained feature vector series, to determine the optimal conditions for a given frame (optimum condition for the feature vector o _t).

次に、発音区間確率値取得手段１０６４は、全フレームの全状態の前向き尤度と後向き尤度を算出する。そして、次に、発音区間確率値取得手段１０６４は、数式１により、最適状態における最適状態確率値（γ_ｔ（ｑ_ｔ ^＊））を、全フレームについて、フレーム毎に算出する。 Next, the pronunciation interval probability value acquisition unit 1064 calculates the forward likelihood and the backward likelihood of all states of all frames. Then, the sounding section probability value acquisition unit 1064 calculates the optimum state probability value (γ _t (q _t ^* )) in the optimum state for each frame using Equation 1 for each frame.

さらに、無音区間検出手段１２０６２は、フレーム音声データごとに、最適状態の確率値を取得し、取得した確率値が所定の値より低いか否かを判断する。 Furthermore, the silent section detection means 12062 acquires the probability value of the optimal state for each frame audio data, and determines whether or not the acquired probability value is lower than a predetermined value.

そして、無音区間検出手段１２０６２は、取得した確率値が所定の値より低いと判断した場合、当該フレーム音声データと無音データとの類似度を示す事後確率の値（ＤＡＰスコア）が、当該フレーム音声データと他のデータとの類似度を示す事後確率の値のいずれよりも高いか否かを判断する。そして、フレーム音声データと無音データとの類似度を示す事後確率の値が最も高い場合、当該フレームは無音区間のフレームであるとして、発音評定部１２０６は無視する。そして、評定値算出手段１２０６３は、無音データの区間を除いて、ｔ−ＤＡＰスコアを算出する。ここで、評定値算出手段１２０６３は、「７５」とｔ−ＤＡＰスコアを算出した、とする。 Then, when the silence section detection means 12062 determines that the acquired probability value is lower than a predetermined value, the value of the posterior probability (DAP score) indicating the similarity between the frame sound data and the silence data is determined as the frame sound. It is determined whether or not the value is higher than any of the posterior probability values indicating the similarity between the data and other data. If the value of the posterior probability indicating the similarity between the frame sound data and the silence data is the highest, the pronunciation rating unit 1206 ignores the frame as a frame in the silence interval. Then, the rating value calculation means 12063 calculates a t-DAP score excluding the silent data section. Here, it is assumed that the rating value calculating unit 12063 calculates “75” and the t-DAP score.

次に、判断部１０７は、予め格納している閾値「６０」と、発音評定結果「７５」を比較し、所定の条件を満たすか否かを判断する。そして、「７５＞６０」であるので、判断部１０７は、所定の条件を満たす、と判断する。 Next, the determination unit 107 compares the threshold value “60” stored in advance with the pronunciation evaluation result “75” to determine whether or not a predetermined condition is satisfied. Since “75> 60”, the determination unit 107 determines that a predetermined condition is satisfied.

次に、出力部１０８は、音声認識部１０５が取得した音声認識処理結果情報（例えば、「ｒｉｇｈｔ」）を出力する。 Next, the output unit 108 outputs the speech recognition processing result information (for example, “right”) acquired by the speech recognition unit 105.

かかる出力処理は、実施の形態１における処理と同様である。 Such output processing is the same as the processing in the first embodiment.

以上、本実施の形態によれば、発音評定を行って、所定の評定結果が得られた場合のみ（良好な場合のみ）、音声認識結果を出力する音声認識装置を提供できる。かかる処理により、誤った認識結果を出力する可能性を減らすことができ、信頼度の高い音声認識装置を提供できる。また、音声認識装置に対するユーザの信頼度を損なわず、ユーザは安心して装置を利用できる。さらに、本実施の形態によれば、無音区間を除いて発音評定した結果に基づいて、音声認識結果を出力しなり、出力しなかったりするので、さらに精度高く正しい認識結果を出力することができる。 As described above, according to the present embodiment, it is possible to provide a speech recognition apparatus that outputs a speech recognition result only when a pronunciation rating is performed and a predetermined rating result is obtained (only when it is good). With this process, the possibility of outputting an erroneous recognition result can be reduced, and a highly reliable speech recognition apparatus can be provided. In addition, the user can use the device with peace of mind without impairing the reliability of the user with respect to the voice recognition device. Furthermore, according to the present embodiment, the speech recognition result is output based on the result of pronunciation evaluation excluding the silent section, and is not output. Therefore, the correct recognition result can be output with higher accuracy. .

なお、本実施の形態によれば、評定値算出手段における発音評定のアルゴリズムは問わない。発音評定のアルゴリズムは、ｔ−ＤＡＰ、ｔ−ｐ−ＤＡＰ、その他のアルゴリズムでも良い。 It should be noted that according to the present embodiment, the pronunciation rating algorithm in the rating value calculating means is not limited. The pronunciation rating algorithm may be t-DAP, tp-DAP, or other algorithms.

また、上記プログラムの発音評定ステップにおいて、前記入力音声データおよび格納している無音データに基づいて、無音の区間を検出する無音区間検出ステップをさらに具備し、前記最適状態決定ステップは、前記無音の区間についてのフレーム音声データを除いた前記1以上のフレーム音声データの最適状態を決定することは好適である。 Further, in the pronunciation rating step of the program, the program further includes a silent section detecting step for detecting a silent section based on the input voice data and the stored silent data, and the optimum state determining step includes the silent state determining step. It is preferable to determine the optimum state of the one or more frame audio data excluding the frame audio data for the section.

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.

また、図１４は、本明細書で述べたプログラムを実行して、上述した種々の実施の形態の音声認識装置を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図１４は、このコンピュータシステム３４０の概観図であり、図１５は、コンピュータシステム３４０のブロック図である。 FIG. 14 shows the external appearance of a computer that executes the programs described in this specification to realize the speech recognition apparatuses according to the various embodiments described above. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 14 is an overview diagram of the computer system 340, and FIG. 15 is a block diagram of the computer system 340.

図１４において、コンピュータシステム３４０は、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）ドライブ、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）ドライブを含むコンピュータ３４１と、キーボード３４２と、マウス３４３と、モニタ３４４と、マイク３４５とを含む。 In FIG. 14, a computer system 340 includes a computer 341 including an FD (Flexible Disk) drive and a CD-ROM (Compact Disk Read Only Memory) drive, a keyboard 342, a mouse 343, a monitor 344, and a microphone 345. .

図１５において、コンピュータ３４１は、ＦＤドライブ３４１１、ＣＤ−ＲＯＭドライブ３４１２に加えて、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）３４１３と、ＣＰＵ３４１３、ＣＤ−ＲＯＭドライブ３４１２及びＦＤドライブ３４１１に接続されたバス３４１４と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ（Ｒｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）３４１５と、ＣＰＵ３４１３に接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）３４１６と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク３４１７とを含む。ここでは、図示しないが、コンピュータ３４１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 In FIG. 15, in addition to the FD drive 3411 and the CD-ROM drive 3412, the computer 341 includes a CPU (Central Processing Unit) 3413, a bus 3414 connected to the CPU 3413, the CD-ROM drive 3412, and the FD drive 3411, and a boot. A ROM (Read-Only Memory) 3415 for storing a program such as an up program, and a RAM (Random Access Memory) connected to the CPU 3413 for temporarily storing instructions of an application program and providing a temporary storage space 3416 and a hard disk 3417 for storing application programs, system programs, and data. Although not shown here, the computer 341 may further include a network card that provides connection to the LAN.

コンピュータシステム３４０に、上述した実施の形態の音声処理装置の機能を実行させるプログラムは、ＣＤ−ＲＯＭ３５０１、またはＦＤ３５０２に記憶されて、ＣＤ−ＲＯＭドライブ３４１２またはＦＤドライブ３４１１に挿入され、さらにハードディスク３４１７に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ３４１に送信され、ハードディスク３４１７に記憶されても良い。プログラムは実行の際にＲＡＭ３４１６にロードされる。プログラムは、ＣＤ−ＲＯＭ３５０１、ＦＤ３５０２またはネットワークから直接、ロードされても良い。 A program that causes the computer system 340 to execute the functions of the sound processing apparatus according to the above-described embodiment is stored in the CD-ROM 3501 or the FD 3502, inserted into the CD-ROM drive 3412 or the FD drive 3411, and further stored in the hard disk 3417. May be forwarded. Alternatively, the program may be transmitted to the computer 341 via a network (not shown) and stored in the hard disk 3417. The program is loaded into the RAM 3416 at the time of execution. The program may be loaded directly from the CD-ROM 3501, the FD 3502, or the network.

プログラムは、コンピュータ３４１に、上述した実施の形態の音声処理装置の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム３４０がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS), a third-party program, or the like that causes the computer 341 to execute the functions of the voice processing apparatus according to the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 340 operates is well known and will not be described in detail.

また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる音声認識装置は、誤った認識結果を出力することを防止できるという効果を有し、高性能な音声認識装置等として有用である。 As described above, the speech recognition apparatus according to the present invention has an effect of preventing an erroneous recognition result from being output, and is useful as a high-performance speech recognition apparatus or the like.

実施の形態１における音声認識装置のブロック図Block diagram of speech recognition apparatus according to Embodiment 1 同音声認識装置の動作について説明するフローチャートFlow chart for explaining the operation of the voice recognition apparatus 同発音評定処理について説明するフローチャートFlow chart for explaining the pronunciation evaluation process 同ＨＭＭの仕様を示す図Diagram showing the specifications of the HMM 同音声分析条件を示す図Figure showing the same voice analysis conditions 同ＤＡＰスコアを示す図Figure showing the DAP score 同ＤＡＰスコアを示す図Figure showing the DAP score 同ｔ−ＤＡＰスコアの出力例を示す図The figure which shows the output example of the t-DAP score 実施の形態２における音声認識装置のブロック図Block diagram of a speech recognition apparatus according to Embodiment 2 同発音評定処理の動作について説明するフローチャートA flowchart for explaining the operation of the pronunciation evaluation process 同ｔ−ｐ−ＤＡＰスコアを示す図The figure which shows the same tp-DAP score 実施の形態３における音声認識装置のブロック図Block diagram of speech recognition apparatus according to Embodiment 3 同発音評定処理の動作について説明するフローチャートA flowchart for explaining the operation of the pronunciation evaluation process 同音声認識装置を実現するコンピュータの外観図External view of a computer that implements the speech recognition apparatus 同音声認識装置を実現するコンピュータシステムのブロック図Block diagram of a computer system for realizing the voice recognition apparatus

Explanation of symbols

１０１受付部
１０２音声受付部
１０３音響データ格納部
１０４認識候補データ格納部
１０５音声認識部
１０６、９０６、１２０６発音評定部
１０７判断部
１０８出力部
１０６１フレーム区分手段
１０６２フレーム音声データ取得手段
１０６３最適状態決定手段
１０６４発音区間確率値取得手段
１０６５、９０６２、１２０６３評定値算出手段
９０６１発音区間フレーム音韻確率値取得手段
１２０６１無音データ格納手段
１２０６２無音区間検出手段
DESCRIPTION OF SYMBOLS 101 Reception part 102 Voice reception part 103 Acoustic data storage part 104 Recognition candidate data storage part 105 Speech recognition part 106,906,1206 Pronunciation evaluation part 107 Judgment part 108 Output part 1061 Frame division | segmentation means 1062 Frame audio | voice data acquisition means 1063 Optimal state determination Means 1064 Pronunciation interval probability value acquisition means 1065, 9062, 12063 Rating value calculation means 9061 Pronunciation interval frame phoneme probability value acquisition means 12061 Silence data storage means 12062 Silence interval detection means

Claims

An acoustic data storage unit that stores acoustic data that is data relating to the target speech to be compared;
A voice input unit that accepts voice input;
A voice recognition unit that performs voice recognition processing on the voice received by the voice input unit using the acoustic data, and acquires voice recognition processing result information that is a result of the voice recognition processing;
A pronunciation rating unit that performs pronunciation rating processing using the acoustic data and obtains a pronunciation rating result for the voice received by the voice input unit;
A determination unit that determines whether or not the pronunciation rating result has a predetermined relationship with a predetermined threshold;
A speech recognition apparatus comprising: an output unit that outputs a speech recognition result for speech recognition processing result information acquired by the speech recognition unit only when the determination unit determines that there is a predetermined relationship.

The pronunciation rating unit
The speech recognition according to claim 1, wherein the speech received by the speech input unit is subjected to pronunciation rating processing using the speech recognition processing result information acquired by the speech recognition unit and the acoustic data, and the pronunciation rating result is acquired. apparatus.

The pronunciation rating unit
Frame dividing means for dividing the sound received by the sound receiving unit into frames;
Frame audio data acquisition means for obtaining one or more frame audio data which is audio data for each of the divided frames;
Optimal state determining means for determining an optimal state of the one or more frame audio data;
A pronunciation interval probability value acquisition means for acquiring the probability value of the optimum state determined by the optimum state determination means for each pronunciation interval;
The rating value calculation means which calculates the rating value of a voice using one or more probability values for every one or more pronunciation sections acquired by the pronunciation section probability value acquisition means as a parameter. Voice recognition device.

The pronunciation rating unit
Frame dividing means for dividing the sound received by the sound receiving unit into frames;
Optimal state determining means for determining an optimal state of the one or more frame audio data;
A pronunciation interval frame phoneme probability value acquisition unit that acquires, for each pronunciation interval, one or more probability values in the overall state of the phoneme having the optimal state of each frame determined by the optimal state determination unit;
The rating value calculation means which calculates the rating value of a voice by using one or more probability values for each of one or more pronunciation sections acquired by the pronunciation section frame phoneme probability value acquisition means as a parameter. Voice recognition device.

The pronunciation rating unit
Silence data storage means for storing silence data that is data indicating silence; and
Based on the input voice data and the silent data, further comprising a silent section detecting means for detecting a silent section,
The optimum state determining means includes
The speech recognition apparatus according to claim 3 or 4, wherein an optimum state of the one or more frame speech data excluding the frame speech data for the silent section is determined.

The speech recognition apparatus according to any one of claims 1 to 5, wherein the acoustic data is data relating to a target speech to be compared, and is data based on an HMM in which hidden Markov models for each phoneme are connected.

On the computer,
A voice input step for receiving voice input;
A voice recognition step of performing voice recognition processing on the voice received in the voice input step using the stored acoustic data, and acquiring voice recognition processing result information as a result of the voice recognition processing;
For the voice received in the voice input step, a pronunciation rating process is performed using the acoustic data to obtain a pronunciation rating result; and
A determination step of determining whether the pronunciation rating result is in a predetermined relationship with a predetermined threshold;
A program for executing an output step of outputting a speech recognition result for speech recognition processing result information acquired in the speech recognition step only when it is determined in the determination step that a predetermined relationship exists.