JP2012252026A

JP2012252026A - Voice recognition device, voice recognition method, and voice recognition program

Info

Publication number: JP2012252026A
Application number: JP2011122054A
Authority: JP
Inventors: Yusuke Nakajima; 悠輔中島; Kosuke Tsujino; 孝輔辻野; Shinya Iizuka; 真也飯塚; Masayuki Tanabe; 正幸田邉; Takeshi Nakabo; 壯中坊
Original assignee: ATR-TREK CO Ltd; NTT Docomo Inc
Current assignee: ATR-TREK CO Ltd; NTT Docomo Inc
Priority date: 2011-05-31
Filing date: 2011-05-31
Publication date: 2012-12-20
Anticipated expiration: 2031-05-31
Also published as: JP5752488B2

Abstract

PROBLEM TO BE SOLVED: To improve accuracy of voice recognition.SOLUTION: A voice recognition device 100 comprises: a voice processor 130 that extracts a word included in voice data and calculates a reliability degree of each word; a speaker degree calculating section 140 that calculates a speaker degree for each word indicating voice likeness of a speaker of the voice data corresponding to the word; a score calculating section 150 that calculates a score of each word, on the basis of the reliability degree of the word calculated by the voice processor 130 and the speaker degree of the word calculated by the speaker degree calculating section 140; a threshold setting section 160 that sets a predetermined threshold; and a word selecting section 170 that selects the word to be employed as a voice recognition result, based on the score of each word and the predetermined threshold.

Description

本発明は、音声認識装置、音声認識方法、及び音声認識プログラムに関する。 The present invention relates to a voice recognition device, a voice recognition method, and a voice recognition program.

従来より、人の音声の音響モデルと言語モデルとを用いた統計的手法により、音声認識を行う技術が提案されている（例えば、特許文献１参照）。 Conventionally, a technique for performing speech recognition by a statistical method using an acoustic model of human speech and a language model has been proposed (for example, see Patent Document 1).

特開２００８−５８５０３号公報JP 2008-58503 A

しかしながら、上記特許文献に記載された技術では、発話者の音声ではない雑音に人の音声が含まれているときに、雑音に含まれる人の音声から抽出した単語を音声認識結果として誤採用してしまう場合がある。このため、音声認識精度が低下してしまうという問題がある。 However, in the technique described in the above-mentioned patent document, when the human voice is included in the noise that is not the voice of the speaker, the word extracted from the human voice included in the noise is erroneously adopted as the voice recognition result. May end up. For this reason, there exists a problem that voice recognition accuracy will fall.

本発明は、上記のような課題を解決するために成されたものであり、音声認識精度を向上させることができる音声認識装置、音声認識方法、及び音声認識プログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and it is an object of the present invention to provide a speech recognition device, a speech recognition method, and a speech recognition program that can improve speech recognition accuracy. .

本発明に係る音声認識装置は、音データに含まれる単語を抽出し、各単語の信頼度を出力する音声処理手段と、各単語に対応する音データの発話者の音声らしさを示す各単語の発話者度を算出する発話者度算出手段と、音声処理手段により出力された各単語の信頼度、及び発話者度算出手段により算出された各単語の発話者度に基づいて、各単語のスコアを算出するスコア算出手段と、各単語のスコアと所定の閾値とに基づいて、音声認識結果として採用する単語を選定する単語選定手段とを備える、ことを特徴とする。 The speech recognition apparatus according to the present invention extracts a word included in sound data and outputs a reliability of each word, and a speech processing unit that outputs the reliability of each word, The score of each word based on the speaker degree calculating means for calculating the speaker degree, the reliability of each word output by the speech processing means, and the speaker degree of each word calculated by the speaker degree calculating means And a word selection means for selecting a word to be adopted as a speech recognition result based on the score of each word and a predetermined threshold value.

また、本発明に係る音声認識方法は、音声認識装置により実行される音声認識方法であって、音データに含まれる単語を抽出し、各単語の信頼度を出力する音声処理ステップと、各単語に対応する音データの発話者の音声らしさを示す各単語の発話者度を算出する発話者度算出ステップと、音声処理ステップにより出力された各単語の信頼度、及び発話者度算出ステップにより算出された各単語の発話者度に基づいて、各単語のスコアを算出するスコア算出ステップと、各単語のスコアと所定の閾値とに基づいて、音声認識結果として採用する単語を選定する単語選定ステップとを備える、ことを特徴とする。 The speech recognition method according to the present invention is a speech recognition method executed by a speech recognition apparatus, which extracts a word included in sound data and outputs a reliability of each word; and each word Calculated by the speaker degree calculating step for calculating the speaker degree of each word indicating the speech quality of the speaker corresponding to the sound data, the reliability of each word output by the voice processing step, and the speaker degree calculating step A score calculation step for calculating the score of each word based on the degree of utterance of each word, and a word selection step for selecting a word to be adopted as a speech recognition result based on the score of each word and a predetermined threshold It is characterized by comprising.

また、本発明に係る音声認識プログラムは、コンピュータを、音データに含まれる単語を抽出し、各単語の信頼度を出力する音声処理手段と、各単語に対応する音データの発話者の音声らしさを示す各単語の発話者度を算出する発話者度算出手段と、音声処理手段により出力された各単語の信頼度、及び発話者度算出手段により算出された各単語の発話者度に基づいて、各単語のスコアを算出するスコア算出手段と、各単語のスコアと所定の閾値とに基づいて、音声認識結果として採用する単語を選定する単語選定手段、として機能させることを特徴とする。 In addition, the speech recognition program according to the present invention allows a computer to extract a word included in sound data and output the reliability of each word, and the voice likeness of a speaker of sound data corresponding to each word. Utterance degree calculating means for calculating the degree of utterance of each word, and the reliability of each word output by the speech processing means and the degree of utterance of each word calculated by the utterance degree calculating means And a score calculating means for calculating the score of each word, and a word selecting means for selecting a word to be adopted as a speech recognition result based on the score of each word and a predetermined threshold value.

上記音声認識装置、音声認識方法、及び音声認識プログラムでは、音データに含まれる単語が抽出され、各単語の信頼度が出力され、各単語に対応する音データの発話者の音声らしさを示す発話者度が算出される。そして、各単語の信頼度及び各単語の発話者度に基づいて各単語のスコアが算出され、各単語のスコアと所定の閾値とに基づいて音声認識結果として採用する単語が選定される。このように、各単語の信頼度に加え、各単語の発話者度に基づいて各単語のスコアを算出することで、発話者の音声から抽出された単語のスコアと、雑音から抽出された単語のスコアとの差異を明確にすることができ、音声認識精度を向上させることができる。 In the speech recognition apparatus, the speech recognition method, and the speech recognition program, a word included in sound data is extracted, the reliability of each word is output, and an utterance that indicates the speech likeness of the speaker of the sound data corresponding to each word The degree is calculated. Then, the score of each word is calculated based on the reliability of each word and the degree of speaker of each word, and a word to be adopted as a speech recognition result is selected based on the score of each word and a predetermined threshold. In this way, by calculating the score of each word based on the degree of speaker of each word in addition to the reliability of each word, the word score extracted from the voice of the speaker and the word extracted from noise The difference from the score can be clarified, and the speech recognition accuracy can be improved.

また、発話者度算出手段は、各単語の音量、各単語の音声モデルの尤度、各単語の雑音モデルの尤度、各単語の空間伝達特性、各単語の基本周波数、及び各単語の声質、の少なくとも一つを用いて各単語の発話者度を算出してもよい。この場合、各単語の音量、各単語の音声モデルの尤度、各単語の雑音モデルの尤度、各単語の空間伝達特性、各単語の基本周波数、及び各単語の声質、の少なくとも一つによって、発話者の音声から抽出された単語の発話者度と、雑音から抽出された単語の発話者度との差異がより明確となるため、音声認識精度をより向上させることができる。 Further, the speaker degree calculation means includes a volume of each word, a likelihood of the speech model of each word, a likelihood of a noise model of each word, a spatial transfer characteristic of each word, a fundamental frequency of each word, and a voice quality of each word The degree of utterance of each word may be calculated using at least one of. In this case, at least one of the volume of each word, the likelihood of the speech model of each word, the likelihood of the noise model of each word, the spatial transfer characteristics of each word, the fundamental frequency of each word, and the voice quality of each word Since the difference between the speaker degree of the word extracted from the speech of the speaker and the speaker degree of the word extracted from the noise becomes clearer, the speech recognition accuracy can be further improved.

また、発話者度算出手段は、人の音声の周波数帯域における各単語の音量を用いて各単語の発話者度を算出してもよい。この場合、発話者の音声から抽出された単語と、人の音声以外の周波数帯域の音を多く含む雑音から抽出された単語とで、音量の差異がより明確となり、発話者度の差異がより明確となる。このため、音声認識精度をより向上させることができる。 Further, the speaker degree calculating means may calculate the speaker degree of each word using the volume of each word in the frequency band of human speech. In this case, the difference in volume between the word extracted from the speech of the speaker and the word extracted from noise that contains a lot of sounds in frequency bands other than human speech becomes clearer, and the difference in the degree of speaker is more pronounced. It becomes clear. For this reason, the voice recognition accuracy can be further improved.

また、音声認識装置は、スコア算出手段により算出されたスコアの最大値に基づいて、所定の閾値を設定する閾値設定手段を更に備え、単語選定手段は、閾値設定手段により設定された所定の閾値を用いて音声認識結果として採用する単語を選定してもよい。この場合、集音環境による全単語のスコアの増減に柔軟に対応して、音声認識結果として採用する単語を選定することができ、集音環境による音声認識結果のばらつきを抑制することができる。 The speech recognition apparatus further includes threshold setting means for setting a predetermined threshold based on the maximum value of the score calculated by the score calculation means, and the word selection means has the predetermined threshold set by the threshold setting means. You may select the word employ | adopted as a speech recognition result using. In this case, it is possible to select a word to be adopted as a speech recognition result in a flexible manner corresponding to an increase / decrease in the score of all words due to the sound collection environment, and to suppress variations in the speech recognition result due to the sound collection environment.

また、音声処理手段は、各単語の信頼度の最大値を基準として、各単語の相対的な信頼度を出力し、スコア算出手段は、音声処理手段により出力された各単語の相対的な信頼度を用いて各単語のスコアを算出してもよい。この場合、集音環境による全単語の信頼度の増減に柔軟に対応してスコアを算出し、音声認識結果として採用する単語を選定することができる。従って、集音環境による音声認識結果のばらつきをより抑制することができる。 The speech processing means outputs the relative reliability of each word with reference to the maximum reliability of each word, and the score calculation means outputs the relative reliability of each word output by the speech processing means. The score of each word may be calculated using the degree. In this case, it is possible to select a word to be adopted as a speech recognition result by calculating a score flexibly corresponding to an increase or decrease in the reliability of all words due to the sound collection environment. Therefore, it is possible to further suppress the variation in the speech recognition result due to the sound collection environment.

また、発話者度算出手段は、各単語の発話者度の最大値を基準として、各単語の相対的な発話者度を算出し、スコア算出手段は、発話者度算出手段により算出された各単語の相対的な発話者度を用いて各単語のスコアを算出してもよい。この場合、集音環境による全単語の発話者度の増減に柔軟に対応してスコアを算出し、音声認識結果として採用する単語を選定することができる。このため、集音環境による音声認識結果のばらつきをより抑制することができる。 Further, the speaker degree calculating means calculates the relative speaker degree of each word on the basis of the maximum value of the speaker degree of each word, and the score calculating means calculates each speaker calculated by the speaker degree calculating means. You may calculate the score of each word using the relative speaker degree of a word. In this case, it is possible to select a word to be adopted as a speech recognition result by calculating a score flexibly corresponding to increase / decrease in the degree of utterance of all words due to the sound collection environment. For this reason, the dispersion | variation in the speech recognition result by sound collection environment can be suppressed more.

また、単語選定手段は、各単語のスコアと所定の閾値とに基づいて、音声認識結果として採用しない単語を削除した後に、スコアが最大である単語を含む単語列と不連続となる単語を更に削除し、残った単語を音声認識結果として採用してもよい。この場合、雑音から抽出された単語の誤採用をより低減させることができ、音声認識精度をより向上させることができる。 Further, the word selection means further deletes a word that is discontinuous from the word string including the word having the maximum score after deleting a word that is not adopted as the speech recognition result based on the score of each word and a predetermined threshold. You may delete and employ | adopt the remaining word as a speech recognition result. In this case, erroneous adoption of words extracted from noise can be further reduced, and speech recognition accuracy can be further improved.

本発明に係る音声認識装置、音声認識方法、及び音声認識プログラムによれば、音声認識精度を向上させることができる。 According to the speech recognition apparatus, speech recognition method, and speech recognition program according to the present invention, speech recognition accuracy can be improved.

本発明に係る音声認識方法を採用した音声認識装置の一実施形態の機能を示すブロック図である。It is a block diagram which shows the function of one Embodiment of the speech recognition apparatus which employ | adopted the speech recognition method which concerns on this invention. 図１の音声認識装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the speech recognition apparatus of FIG. 図１の音声認識装置による音声認識手順を示すフローチャートである。It is a flowchart which shows the speech recognition procedure by the speech recognition apparatus of FIG. 図１の音声認識装置による音声認識結果の一例を示す図である。It is a figure which shows an example of the speech recognition result by the speech recognition apparatus of FIG. 図１の音声認識装置による音声認識結果の他の例を示す図である。It is a figure which shows the other example of the speech recognition result by the speech recognition apparatus of FIG. 図１の音声認識装置の他のハードウェア構成を示すブロック図である。It is a block diagram which shows the other hardware constitutions of the speech recognition apparatus of FIG. 本発明に係る音声認識方法を採用した音声認識プログラムの一実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of one Embodiment of the speech recognition program which employ | adopted the speech recognition method which concerns on this invention.

以下、本発明に係る音声認識方法を採用した音声認識装置及び音声認識プログラムの実施形態を説明する。 Hereinafter, embodiments of a speech recognition apparatus and a speech recognition program that employ a speech recognition method according to the present invention will be described.

図１は、本発明に係る音声認識方法を採用した音声認識装置１００の機能を示すブロック図である。音声認識装置１００は、例えば、音声によるアプリケーションへのデータ入力を可能とするために用いられる装置である。 FIG. 1 is a block diagram illustrating functions of a speech recognition apparatus 100 that employs a speech recognition method according to the present invention. The voice recognition apparatus 100 is an apparatus used to enable data input to an application by voice, for example.

図１に示すように、音声認識装置１００は、音データ入力部１１０と、特徴量算出部１２０と、音声処理部１３０と、音響モデル保持部１３１と、言語モデル保持部１３２と、辞書データ保持部１３３と、発話者度算出部１４０と、スコア算出部１５０と、閾値設定部１６０と、閾値保持部１６１と、単語選定部１７０と、音声認識結果出力部１８０と、を備えている。 As shown in FIG. 1, the speech recognition apparatus 100 includes a sound data input unit 110, a feature amount calculation unit 120, a speech processing unit 130, an acoustic model holding unit 131, a language model holding unit 132, and dictionary data holding. A unit 133, a speaker degree calculating unit 140, a score calculating unit 150, a threshold setting unit 160, a threshold holding unit 161, a word selecting unit 170, and a speech recognition result output unit 180.

音データ入力部１１０は、例えばマイクロホンにより音データを取得する部分である。 The sound data input unit 110 is a part that acquires sound data using, for example, a microphone.

特徴量算出部１２０は、例えば１０ｍｓの時間区間（フレーム）ごとに、音データから音響特徴を示す特徴量データを算出する部分である。音響特徴を示す特徴量データは、音声認識スペクトルであって、例えば、ＭＦＣＣ（Mel Frequency Cepstrum Coefficient）のような周波数で表されるデータである。 The feature amount calculation unit 120 is a portion that calculates feature amount data indicating acoustic features from sound data, for example, every 10 ms time interval (frame). The feature amount data indicating the acoustic features is a speech recognition spectrum, and is data represented by a frequency such as MFCC (Mel Frequency Cepstrum Coefficient), for example.

音声処理部１３０は、特徴量算出部１２０により算出された特徴量データと、音響モデル保持部１３１、言語モデル保持部１３２、及び辞書データ保持部１３３に記憶されているデータとを参照し、音データ入力部１１０で取得された音データに含まれる単語を抽出し、各単語の信頼度を出力する部分である。 The sound processing unit 130 refers to the feature amount data calculated by the feature amount calculation unit 120 and the data stored in the acoustic model holding unit 131, the language model holding unit 132, and the dictionary data holding unit 133. This is a part that extracts words included in the sound data acquired by the data input unit 110 and outputs the reliability of each word.

音響モデル保持部１３１は、音素とそのスペクトルとを対応付けて記憶する部分である。言語モデル保持部１３２は、単語、文字などの連鎖確率を示す統計的情報を記憶する部分である。辞書データ保持部１３３は、単語のデータとして、例えば単語とその発音を表記する音素や発音記号を記憶する部分である。 The acoustic model holding unit 131 is a part that stores phonemes and their spectra in association with each other. The language model holding unit 132 is a part that stores statistical information indicating the chain probability of words, characters, and the like. The dictionary data holding unit 133 is a part that stores, as word data, for example, phonemes and phonetic symbols that represent words and their pronunciation.

発話者度算出部１４０は、各単語に対応する音データ又は特徴量データから、各単語に対応する音データの発話者の音声らしさを示す発話者度を算出する部分である。 The speaker degree calculating unit 140 is a part that calculates the speaker degree indicating the speech likeness of the speaker of the sound data corresponding to each word from the sound data or feature amount data corresponding to each word.

スコア算出部１５０は、音声処理部１３０により出力された各単語の信頼度と、発話者度算出部１４０により算出された各単語の発話者度とに基づいて、各単語のスコアを算出する部分である。本実施形態では、信頼度及び発話者度が高くなるのに応じ高いスコアが算出される例を説明する。なお、信頼度及び発話者度が高くなるのに応じ低いスコアが算出されてもよい。 The score calculation unit 150 calculates the score of each word based on the reliability of each word output by the speech processing unit 130 and the speaker level of each word calculated by the speaker level calculation unit 140 It is. In the present embodiment, an example will be described in which a high score is calculated as the reliability level and the speaker level increase. Note that a low score may be calculated as the reliability level and the speaker level increase.

閾値設定部１６０は、スコア算出部１５０によって算出されたスコアの最高値と、閾値保持部１６１に記憶されているデータとを参照し、音声認識結果として採用する単語を選定するための閾値を設定する部分である。閾値保持部１６１は、相対閾値データとして、例えば、上記閾値とスコアの最高値との差分を記憶する部分である。閾値設定部１６０は、スコア算出部１５０によって算出されたスコアの最高値と、閾値保持部１６１に記憶された差分とを加算して閾値を設定する。これにより、音声認識結果として採用する単語を選定するための閾値は、スコア算出部１５０によって算出されたスコアの最高値に応じて変わるようになっている。 The threshold setting unit 160 refers to the highest score calculated by the score calculation unit 150 and the data stored in the threshold holding unit 161, and sets a threshold for selecting a word to be adopted as a speech recognition result. It is a part to do. The threshold value holding unit 161 is a part that stores, for example, the difference between the threshold value and the highest score value as the relative threshold value data. The threshold setting unit 160 sets the threshold by adding the highest score calculated by the score calculation unit 150 and the difference stored in the threshold holding unit 161. Thereby, the threshold value for selecting a word to be adopted as the speech recognition result is changed according to the highest score calculated by the score calculation unit 150.

単語選定部１７０は、スコア算出部１５０によって算出された各単語のスコアと、閾値設定部１６０によって設定された閾値とに基づいて、音声認識結果として採用する単語を選定する部分である。本実施形態では、閾値設定部１６０によって設定された閾値よりも高いスコアの単語が選定される。 The word selection unit 170 is a part that selects a word to be adopted as a speech recognition result based on the score of each word calculated by the score calculation unit 150 and the threshold set by the threshold setting unit 160. In the present embodiment, a word having a score higher than the threshold set by the threshold setting unit 160 is selected.

音声認識結果出力部１８０は、単語選定部１７０によって選定された単語を出力し、例えばアプリケーションの表示画面等に表示する部分である。 The speech recognition result output unit 180 is a part that outputs the word selected by the word selection unit 170 and displays it on, for example, a display screen of an application.

図２は、音声認識装置１００のハードウェア構成を示すブロック図である。音声認識装置１００は、ハードウェア構成として、ＣＰＵ１１と、ＲＡＭ１２と、ＲＯＭ１３と、入力装置１４と、補助記憶装置１５と、通信装置１６と、出力装置１７と、記憶媒体１８ａの読取装置１８と、を備えている。上述した音声認識装置１００の各部分の機能は、ＲＡＭ１２等に補助記憶装置１５や読取装置１８等からプログラムやデータ等を読み込ませ、ＣＰＵ１１によりプログラムを実行させることで実現される。入力装置１４は、例えば、音データ入力部１１０を構成するマイクロホン等であり、出力装置１７は、例えば、音声認識結果出力部１８０を構成するモニタ等である。 FIG. 2 is a block diagram illustrating a hardware configuration of the speech recognition apparatus 100. The speech recognition apparatus 100 includes, as a hardware configuration, a CPU 11, a RAM 12, a ROM 13, an input device 14, an auxiliary storage device 15, a communication device 16, an output device 17, a reading device 18 of a storage medium 18a, It has. The function of each part of the voice recognition device 100 described above is realized by causing the RAM 12 or the like to read a program or data from the auxiliary storage device 15 or the reading device 18 and causing the CPU 11 to execute the program. The input device 14 is, for example, a microphone that constitutes the sound data input unit 110, and the output device 17 is, for example, a monitor that constitutes the speech recognition result output unit 180.

図３は、音声認識装置１００により実行される音声認識手順を示すフローチャートである。音声認識装置１００では、まず、音データ入力部１１０によって音データが取得され（ステップＳ１０）、特徴量算出部１２０によって、音データからフレームごとに特徴量データが算出される（ステップＳ２０）。 FIG. 3 is a flowchart showing a speech recognition procedure executed by the speech recognition apparatus 100. In the speech recognition apparatus 100, first, sound data is acquired by the sound data input unit 110 (step S10), and feature amount data is calculated for each frame from the sound data by the feature amount calculation unit 120 (step S20).

続いて、音声処理部１３０により、特徴量データに対して統計的手法を用いた処理が行われ、音データに含まれる単語が抽出され、各単語の信頼度が出力される（ステップＳ３０）。具体例として、まず単語列（以下、仮説という）の複数の候補（Ｎベスト）が挙げられ、各仮説をなす単語の信頼度が算出される。更に、各単語の信頼度から各仮説の信頼度が算出され、信頼度が最上位となる仮説が選定される。選定された仮説をなす単語が、音データに含まれる単語の抽出結果となる。そして、選定された仮説について、単語区切り、各単語の音素列、各単語の表記、各単語の読み、各単語の品詞情報、各単語の時間情報、及び各単語の係り受け情報等のデータと共に、各単語の信頼度が出力される。各単語の時間情報は、例えば、単語に対応する音データの開始フレーム番号と終了フレーム番号とによって表される。各単語の信頼度は、各単語の正答確度を示す情報であり、各単語の音響モデルの尤度や、各単語の言語モデルの尤度等に基づいて算出される。本実施形態では、各単語の信頼度として、例えばＧＷＰＰ（一般化単語事後確率：Generalized Word Posterior Probability）が算出される。 Subsequently, the voice processing unit 130 performs a process using a statistical method on the feature data, extracts words included in the sound data, and outputs the reliability of each word (step S30). As a specific example, first, a plurality of candidates (N best) of word strings (hereinafter referred to as hypotheses) are listed, and the reliability of the words forming each hypothesis is calculated. Further, the reliability of each hypothesis is calculated from the reliability of each word, and the hypothesis having the highest reliability is selected. The word that makes the selected hypothesis becomes the extraction result of the word included in the sound data. And for the selected hypothesis, together with data such as word breaks, phoneme string of each word, notation of each word, reading of each word, part of speech information of each word, time information of each word, and dependency information of each word , The reliability of each word is output. The time information of each word is represented by, for example, the start frame number and end frame number of the sound data corresponding to the word. The reliability of each word is information indicating the correct answer accuracy of each word, and is calculated based on the likelihood of the acoustic model of each word, the likelihood of the language model of each word, and the like. In the present embodiment, for example, GWPP (Generalized Word Posterior Probability) is calculated as the reliability of each word.

続いて、発話者度算出部１４０により、音声処理部１３０により抽出された各単語の発話者度が算出される（ステップＳ４０）。本実施形態では、各単語の発話者度は、各単語の音量を用いて算出される。具体例として、各単語ごとに、各単語の時間情報に対応する音データが切り出される。更に、人の音声の周波数帯域の音データが切り出される。人の音声の周波数帯域とは、人の音声の振幅が高くなる周波数帯域であり、例えば３００Ｈｚ〜３．４ｋＨｚである。そして、切り出された音データから音量が算出され、
発話者度＝音量
とされる。例えば、音量は、音データの振幅の時間平均値として算出されてもよいし、音データの振幅の最大値として算出されてもよい。また、本実施形態では、各単語の発話者度の最大値を基準として、各単語の相対発話者度が算出される。具体例として、各単語の発話者度と、各単語の発話者度の最大値との差分が算出される。なお、音量の算出には、音データではなく特徴量データが用いられてもよいし、音データ及び特徴量データの両方が用いられてもよい。 Subsequently, the speaker degree calculation unit 140 calculates the speaker degree of each word extracted by the speech processing unit 130 (step S40). In the present embodiment, the speaker degree of each word is calculated using the volume of each word. As a specific example, sound data corresponding to time information of each word is cut out for each word. Furthermore, sound data in the frequency band of human voice is cut out. The frequency band of the human voice is a frequency band in which the amplitude of the human voice is high, and is, for example, 300 Hz to 3.4 kHz. And the volume is calculated from the cut out sound data,
Speaker degree = volume. For example, the volume may be calculated as a time average value of the amplitude of the sound data, or may be calculated as a maximum value of the amplitude of the sound data. In the present embodiment, the relative speaker level of each word is calculated with reference to the maximum speaker level of each word. As a specific example, the difference between the speaker level of each word and the maximum value of the speaker level of each word is calculated. Note that the sound volume may be calculated using feature data instead of sound data, or both sound data and feature data may be used.

続いて、スコア算出部１５０により、各単語のスコアが算出される（ステップＳ５０）。本実施形態では、スコアは、各単語のＧＷＰＰと、各単語の相対発話者度とに基づき、例えば
スコア＝１０Ｌｏｇ_１０（ＧＷＰＰ）＋相対発話者度
により算出される。 Subsequently, the score of each word is calculated by the score calculation unit 150 (step S50). In the present embodiment, the score is calculated based on the GWPP of each word and the relative speaker level of each word, for example, score = ₁₀ Log ₁₀ (GWPP) + relative speaker level.

続いて、閾値設定部１６０により、音声認識結果として採用する単語を選定するための閾値が設定される（ステップＳ６０）。本実施形態では、閾値は、例えば
閾値＝スコアの最大値＋相対閾値
により算出される。 Subsequently, the threshold value setting unit 160 sets a threshold value for selecting a word to be adopted as the voice recognition result (step S60). In the present embodiment, the threshold value is calculated by, for example, threshold = maximum score value + relative threshold value.

続いて、単語選定部１７０により、閾値よりも高いスコアの単語が選定され（ステップＳ７０）、選定された単語が音声認識結果出力部１８０によって出力される（ステップＳ８０）。 Subsequently, the word selection unit 170 selects a word having a score higher than the threshold (step S70), and the selected word is output by the speech recognition result output unit 180 (step S80).

図４は、雑音に人の音声が含まれている環境において、発話者が、上記音データ入力部１１０を構成するマイクロホン等に向って「広島、お好み焼き」と発話した場合の音声認識結果を示している。この例において、閾値保持部１６１には、相対閾値データとして−４が記憶されている。図４に示す例では、「恋」、「待って」、「ます」、「広島」、「お好み焼き」、「ジュエリー」、「書房」、「株主」が抽出されている。これらの単語に対して、ＧＷＰＰの算出結果は、０．００８，０．０５９，０．０３，０．５５４，０．７０８，０．０４９，０．０１４，０．５７となっている。発話者度の算出結果は、−１０，−７，−２０，−５，−８，−２，−７，−１１となっている。相対発話者度の算出結果は、発話者度の最大値が−２であることから、−８，−５，−１８，−３，−６，０，−５，−９となっている。そして、スコアの算出結果は、−２９，−１７．３，−３３．２，−５．６，−７．５，−１３．１，−２３．５，−１１．４となっている。スコアの最大値が−５．６であることから、閾値は
−５．６−４＝−９．６
となり、この閾値よりも高いスコアの「広島、お好み焼き」が音声認識結果として採用され、他の単語は不採用とされている。 FIG. 4 shows a speech recognition result when a speaker utters “Hiroshima, Okonomiyaki” toward a microphone or the like constituting the sound data input unit 110 in an environment where human voice is included in noise. ing. In this example, −4 is stored in the threshold holding unit 161 as the relative threshold data. In the example shown in FIG. 4, “love”, “wait”, “mas”, “hiroshima”, “okonomiyaki”, “jewelry”, “shobo”, and “shareholder” are extracted. For these words, the GWPP calculation results are 0.008, 0.059, 0.03, 0.554, 0.708, 0.049, 0.014, 0.57. The calculation results of the speaker degree are −10, −7, −20, −5, −8, −2, −7, and −11. The calculation result of the relative speaker level is −8, −5, −18, −3, −6, 0, −5, −9 because the maximum value of the speaker level is −2. The score calculation results are -29, -17.3, -33.2, -5.6, -7.5, -13.1, -23.5, -11.4. Since the maximum value of the score is -5.6, the threshold value is -5.6-4 = -9.6.
Thus, “Hiroshima, Okonomiyaki” having a score higher than this threshold is adopted as a speech recognition result, and other words are not adopted.

図４の例では、雑音から抽出された単語である「株主」のＧＷＰＰは０．５７であり、発話者の音声から抽出された単語である「広島」のＧＷＰＰは０．５５４である。即ち、雑音から抽出された単語の信頼度が、発話者の音声から抽出された単語の信頼度よりも高くなっている。これに対し、「株主」のスコアは−１１．４であり、「広島」のスコアは−５．６である。即ち、雑音から抽出された単語のスコアは、発話者の音声から抽出された単語のスコアよりも低くなっている。これにより、「広島」よりも信頼度が高かった「株主」が、音声認識結果として採用されることなく、発話者の音声から抽出された「広島」及び「お好み焼き」のみが音声認識結果として採用されている。このように、音声認識装置１００によれば、各単語の信頼度に加え、各単語の発話者度に基づいて各単語のスコアを算出することで、発話者の音声から抽出された単語のスコアと、雑音から抽出された単語のスコアとの差異を明確にすることができ、音声認識精度を向上させることができる。特に、発話者度算出部１４０は、各単語に対応する音データの音量を用いて発話者度を算出しているため、発話者の音声から抽出された単語の発話者度と、雑音から抽出された単語の発話者度との差異がより明確となり、音声認識精度をより向上させることができる。 In the example of FIG. 4, the GWPP of “shareholder”, which is a word extracted from noise, is 0.57, and the GWPP of “Hiroshima”, which is a word extracted from the voice of the speaker, is 0.554. That is, the reliability of the word extracted from the noise is higher than the reliability of the word extracted from the speech of the speaker. On the other hand, the score of “shareholder” is −11.4, and the score of “Hiroshima” is −5.6. That is, the score of the word extracted from the noise is lower than the score of the word extracted from the speech of the speaker. As a result, “shareholders” who are more reliable than “Hiroshima” are not adopted as speech recognition results, but only “Hiroshima” and “okonomiyaki” extracted from the speech of the speaker are adopted as speech recognition results. Has been. Thus, according to the speech recognition apparatus 100, by calculating the score of each word based on the degree of speaker of each word in addition to the reliability of each word, the score of the word extracted from the speech of the speaker And the difference between the score of the word extracted from the noise and the speech recognition accuracy can be improved. In particular, since the speaker degree calculation unit 140 calculates the speaker degree using the volume of sound data corresponding to each word, it is extracted from the speaker degree and noise of words extracted from the voice of the speaker. The difference between the spoken word and the speaker level becomes clearer, and the speech recognition accuracy can be further improved.

また、発話者度算出部１４０は、人の音声の周波数帯域における各単語の音量を用いて発話者度を算出している。これにより、発話者の音声から抽出された単語と、人の音声以外の周波数帯域の音を多く含む雑音から抽出された単語とで、音量の差異がより明確となり、発話者度の差異がより明確となる。このため、音声認識精度をより向上させることができる。 Further, the speaker degree calculation unit 140 calculates the speaker degree using the volume of each word in the frequency band of human speech. As a result, the difference in volume between the word extracted from the speech of the speaker and the word extracted from noise that contains a lot of sound in a frequency band other than human speech becomes clearer, and the difference in the degree of speaker is more pronounced. It becomes clear. For this reason, the voice recognition accuracy can be further improved.

また、音声認識装置１００は、スコア算出部１５０により算出されたスコアの最大値に基づいて閾値を設定する閾値設定部１６０を更に備え、単語選定部１７０は、閾値設定部１６０により設定された閾値を用いて単語を選定している。このため、集音環境による全単語のスコアの増減に柔軟に対応して単語を選定することができ、集音環境による音声認識結果のばらつきを抑制することができる。 The speech recognition apparatus 100 further includes a threshold setting unit 160 that sets a threshold based on the maximum value of the score calculated by the score calculation unit 150, and the word selection unit 170 includes the threshold set by the threshold setting unit 160. Words are selected using. For this reason, it is possible to select words flexibly corresponding to the increase / decrease in the score of all words due to the sound collection environment, and it is possible to suppress variations in the speech recognition results due to the sound collection environment.

また、発話者度算出部１４０は、各単語の発話者度の最大値を基準として、各単語の相対発話者度を算出し、スコア算出部１５０は、発話者度算出部１４０により算出された各単語の相対発話者度を用いて各単語のスコアを算出している。このため、集音環境による全単語の発話者度の増減に柔軟に対応してスコアを算出し、単語を選定することができる。従って、集音環境による音声認識結果のばらつきをより抑制することができる。 Further, the speaker degree calculation unit 140 calculates the relative speaker degree of each word based on the maximum value of the speaker degree of each word, and the score calculation unit 150 is calculated by the speaker degree calculation unit 140. The score of each word is calculated using the relative speaker degree of each word. For this reason, it is possible to select a word by calculating a score in a flexible manner corresponding to the increase or decrease in the degree of speaker of all words due to the sound collection environment. Therefore, it is possible to further suppress the variation in the speech recognition result due to the sound collection environment.

以上、本発明に係る音声認識方法を採用した音声認識装置の好適な実施形態について説明してきたが、本発明は上述した実施形態に限られるものではなく、その要旨を逸脱しない範囲で様々な変更が可能である。 The preferred embodiments of the speech recognition apparatus employing the speech recognition method according to the present invention have been described above. However, the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the present invention. Is possible.

発話者度算出部１４０は、各単語の音量を用いて発話者度を算出しているが、各単語の音量に代えて、各単語の音声モデルの尤度、各単語の雑音モデルの尤度、各単語の空間伝達特性、各単語の基本周波数、及び各単語の声質、のいずれかを用いて発話者度を算出しても、発話者の音声から抽出された単語の発話者度と、雑音から抽出された単語の発話者度との差異を明確にすることができる。また、各単語の音量、各単語の音声モデルの尤度、各単語の雑音モデルの尤度、各単語の空間伝達特性、各単語の基本周波数、及び各単語の声質、のうち２つ以上を組み合わせて用い、発話者度を算出してもよい。この場合の発話者度は、例えば、音量と、音声モデルの尤度及び雑音モデルの尤度に基づいて算出される雑音尤度比と、空間伝達特性、基本周波数及び声質から算出される発話者度調整値とを用いて、
発話者度＝音量−α×１０Ｌｏｇ_１０（雑音尤度比）＋β×（発話者度調整値）
により算出される。α，βは所定の重み係数である。 The speaker degree calculation unit 140 calculates the speaker degree using the volume of each word, but instead of the volume of each word, the likelihood of the speech model of each word and the likelihood of the noise model of each word , Even if the speaker degree is calculated using any one of the spatial transfer characteristics of each word, the fundamental frequency of each word, and the voice quality of each word, the speaker degree of the word extracted from the speech of the speaker, It is possible to clarify the difference from the degree of speaker of a word extracted from noise. Also, two or more of the volume of each word, the likelihood of the speech model of each word, the likelihood of the noise model of each word, the spatial transfer characteristics of each word, the fundamental frequency of each word, and the voice quality of each word The speaker degree may be calculated using a combination. The speaker degree in this case is, for example, a speaker calculated from the volume, the noise likelihood ratio calculated based on the likelihood of the speech model and the likelihood of the noise model, the spatial transfer characteristic, the fundamental frequency, and the voice quality. With the degree adjustment value,
Speaker degree = volume−α × ₁₀ Log ₁₀ (noise likelihood ratio) + β × (speaker degree adjustment value)
Is calculated by α and β are predetermined weighting factors.

音声モデルの尤度は、音声から学習したＧＭＭ（Gaussian mixture model）に基づいてフレームごとに算出される。雑音モデルの尤度は、雑音から学習したＧＭＭに基づいてフレームごとに算出される。各単語の音声モデルの尤度及び各単語の雑音モデルの尤度は、各単語に対応するフレームについて算出された尤度の総和として算出される。また、各単語の雑音尤度比は、
雑音尤度比＝雑音モデルの尤度／音声モデルの尤度
により算出される。発話者の音声は、音データ入力部１１０を構成するマイクロホン等の近くで発せられることから、発話者の音声を含む音データの雑音尤度比は、発話者の音声を含まない音データの雑音尤度比よりも低くなる。このため、発話者の音声を含む音データから抽出された単語と、発話者の音声を含まない音データから抽出された単語との間では、雑音尤度比の差異が大きくなり易い。 The likelihood of the speech model is calculated for each frame based on a GMM (Gaussian mixture model) learned from speech. The likelihood of the noise model is calculated for each frame based on the GMM learned from the noise. The likelihood of the speech model of each word and the likelihood of the noise model of each word are calculated as the sum of the likelihoods calculated for the frame corresponding to each word. Also, the noise likelihood ratio of each word is
Noise likelihood ratio = noise model likelihood / speech model likelihood. Since the voice of the speaker is emitted near a microphone or the like constituting the sound data input unit 110, the noise likelihood ratio of the sound data including the voice of the speaker is the noise of the sound data not including the voice of the speaker. It becomes lower than the likelihood ratio. For this reason, the difference in the noise likelihood ratio tends to be large between a word extracted from sound data including the speech of the speaker and a word extracted from sound data not including the speech of the speaker.

各単語の空間伝達特性は、例えば線形予測分析により得られる線形予測係数や、残響時間として算出される。残響時間は、各単語の終端において、音量が所定値まで減衰するのに要する時間である。各単語の空間伝達特性は、音声の残響の程度に応じて変動する。発話者の音声は、音データ入力部１１０を構成するマイクロホン等の近くで発せられることから、発話者の音声を含む音データの残響は、発話者の音声を含まない音データの残響と比べて少ない。このため、発話者の音声を含む音データから抽出された単語と、発話者の音声を含まない音データから抽出された単語との間では、空間伝達特性の差異が大きくなり易い。 The spatial transfer characteristic of each word is calculated as, for example, a linear prediction coefficient obtained by linear prediction analysis or a reverberation time. The reverberation time is the time required for the sound volume to decay to a predetermined value at the end of each word. The spatial transfer characteristic of each word varies depending on the degree of reverberation of the speech. Since the voice of the speaker is emitted near a microphone or the like constituting the sound data input unit 110, the reverberation of the sound data including the voice of the speaker is compared with the reverberation of the sound data not including the voice of the speaker. Few. For this reason, a difference in spatial transfer characteristics tends to be large between a word extracted from sound data including the speech of the speaker and a word extracted from sound data not including the speech of the speaker.

各単語の基本周波数は、例えば、フーリエ変換により得られる周波数パワー特性に対して、音声の基本周波数Ｆ０の倍音を透過させるくし形フィルタを適用し、フィルタ通過後のパワーが最大となっている成分の周波数として算出される。或いは、各単語の基本周波数は、音声波形の時間領域での自己相関が最大となる値を１周期とする周波数として算出される。各単語の基本周波数により、各単語の音声らしさを把握することができる。発話者の音声は、音データ入力部１１０を構成するマイクロホン等の近くで発せられることから、発話者の音声を含む音データの基本周波数は、発話者の音声を含まない音データの基本周波数と比べてより音声らしい値となる。このため、発話者の音声を含む音データから抽出された単語と、発話者の音声を含まない音データから抽出された単語との間では、基本周波数の差異が大きくなり易い。 The fundamental frequency of each word is, for example, a component in which a comb filter that transmits harmonics of the fundamental frequency F0 of speech is applied to the frequency power characteristics obtained by Fourier transform, and the power after passing through the filter is maximum. Is calculated as the frequency of. Alternatively, the fundamental frequency of each word is calculated as a frequency with a period having a maximum autocorrelation in the time domain of the speech waveform. Based on the fundamental frequency of each word, it is possible to grasp the sound quality of each word. Since the voice of the speaker is emitted near a microphone or the like constituting the sound data input unit 110, the basic frequency of the sound data including the voice of the speaker is the basic frequency of the sound data not including the voice of the speaker. Compared to voice-like values. For this reason, a difference in fundamental frequency tends to be large between a word extracted from sound data including the voice of the speaker and a word extracted from sound data not including the voice of the speaker.

各単語の声質は、例えば、音声のスペクトル傾斜係数として算出される。各単語の声質により、発声方法の傾向が示される。このため、発話者の音声を含む音データから抽出された単語と、発話者の音声を含まない音データから抽出された単語との間では、声質の差異が大きくなり易い。 The voice quality of each word is calculated as, for example, a spectral slope coefficient of speech. The voice quality of each word indicates the tendency of the utterance method. For this reason, a difference in voice quality tends to be large between a word extracted from sound data including the voice of the speaker and a word extracted from sound data not including the voice of the speaker.

図５は、図４に示す例と同じ条件において、各単語の音量、各単語の音声モデルの尤度、各単語の雑音モデルの尤度、各単語の空間伝達特性、各単語の基本周波数、及び各単語の声質、の全てを組み合わせて用い、スコアを算出した場合の音声認識結果を示している。この例において、閾値保持部１６１には、相対閾値データとして−６が記憶されている。また、上記重み係数αは０．２に設定され、係数βは０．２に設定されている。この場合、スコアの算出結果は、−３３，−１７．５，−３３．４，−２．２，−２．５，−１３．１，−２３．３，−１３となっている。スコアの最大値が−２．２であることから、閾値は
−２．２−６＝−８．２
となり、この閾値よりも高いスコアの「広島、お好み焼き」が音声認識結果として採用され、他の単語は不採用とされている。 FIG. 5 shows the volume of each word, the likelihood of the speech model of each word, the likelihood of the noise model of each word, the spatial transfer characteristics of each word, the fundamental frequency of each word, under the same conditions as the example shown in FIG. The voice recognition results are shown when the score is calculated using a combination of all the voice qualities of each word. In this example, the threshold holding unit 161 stores −6 as relative threshold data. The weight coefficient α is set to 0.2, and the coefficient β is set to 0.2. In this case, the score calculation results are −33, −17.5, −33.4, −2.2, −2.5, −13.1, −23.3, and -13. Since the maximum value of the score is -2.2, the threshold value is -2.2-6 = -8.2.
Thus, “Hiroshima, Okonomiyaki” having a score higher than this threshold is adopted as a speech recognition result, and other words are not adopted.

図５の例では、雑音から抽出された単語である「株主」のスコアは−１３であり、発話者の音声から抽出された単語である「広島」のスコアは−２．２であり、これらの単語のスコアの差は−１０．８である。一方、図４の例では、「株主」のスコアは−１１．４であり、広島のスコアは−５．６であり、これらの単語のスコアの差は−５．８であった。即ち、図５の例では、雑音から抽出された単語のスコアと、発話者の音声から抽出されたスコアとの差異がより大きくなっている。このように、各単語の音量、各単語の音声モデルの尤度、各単語の雑音モデルの尤度、各単語の空間伝達特性、各単語の基本周波数、及び各単語の声質、のうち２つ以上を組み合わせて用いることによって、発話者の音声から抽出された単語の発話者度と、雑音から抽出された単語の発話者度との差異をより明確にすることができ、音声認識精度をより向上させることができる。 In the example of FIG. 5, the score of “shareholder” that is a word extracted from noise is −13, and the score of “Hiroshima” that is a word extracted from the voice of the speaker is −2.2. The difference in the scores of the words is −10.8. On the other hand, in the example of FIG. 4, the score of “shareholder” was −11.4, the score of Hiroshima was −5.6, and the difference between the scores of these words was −5.8. That is, in the example of FIG. 5, the difference between the score of the word extracted from the noise and the score extracted from the speech of the speaker is larger. Thus, two of the volume of each word, the likelihood of the speech model of each word, the likelihood of the noise model of each word, the spatial transfer characteristics of each word, the fundamental frequency of each word, and the voice quality of each word By using a combination of the above, it is possible to clarify the difference between the degree of utterance of words extracted from the speech of the speaker and the degree of utterance of words extracted from noise, and the speech recognition accuracy is further improved. Can be improved.

また、音声処理部１３０は、各単語の信頼度の最大値を基準として、各単語の相対的な信頼度を出力し、スコア算出部１５０は、音声処理部１３０により出力された各単語の相対的な信頼度を用いて各単語のスコアを算出してもよい。この場合、集音環境による全単語の信頼度の増減に柔軟に対応してスコアを算出し、単語を選定することができる。従って、集音環境による音声認識結果のばらつきをより抑制することができる。 The speech processing unit 130 outputs the relative reliability of each word with reference to the maximum reliability of each word, and the score calculation unit 150 calculates the relative value of each word output by the speech processing unit 130. The score of each word may be calculated using a certain degree of reliability. In this case, it is possible to select a word by calculating a score flexibly corresponding to an increase or decrease in the reliability of all words due to the sound collection environment. Therefore, it is possible to further suppress the variation in the speech recognition result due to the sound collection environment.

また、単語選定部１７０は、各単語のスコアと閾値とに基づいて、音声認識結果として採用しない単語を削除した後に、スコアが最大である単語を含む単語列と不連続となる単語を更に削除し、残った単語を音声認識結果として採用してもよい。この場合、図４の例において、第３位のスコアである「株主」までが閾値を上回っていたとしても、「株主」は、最大スコアの「広島」を含む単語列「広島、お好み焼き」と不連続であるために削除される。このように、雑音から抽出された単語の誤採用をより低減させることができ、音声認識精度をより向上させることができる。 In addition, the word selection unit 170 further deletes words that are discontinuous from the word string including the word having the maximum score after deleting a word that is not adopted as the speech recognition result based on the score and threshold value of each word. The remaining words may be adopted as the speech recognition result. In this case, in the example of FIG. 4, even if the third highest score “shareholder” exceeds the threshold, “shareholder” has the word string “Hiroshima, Okonomiyaki” including the maximum score “Hiroshima”. Deleted because it is discontinuous. In this way, it is possible to further reduce the erroneous adoption of words extracted from noise, and to further improve the speech recognition accuracy.

また、音声認識装置１００のハードウェア構成の一例として、図２の構成を示したが、これに限られない。例えば、音声認識装置１００のハードウェア構成は、図６に示すように、ネットワークを介して接続されたクライアント装置２１０及びサーバ装置２２０に機能が分散された構成であってもよい。例えば、音データ入力部１１０、特徴量算出部１２０、単語選定部１７０、及び音声認識結果出力部１８０をクライアント装置２１０に構成し、残りの部分をサーバ装置２２０に構成することで、クライアント装置２１０の演算負荷を軽減することができる。この場合、クライアント装置２１０とサーバ装置２２０との間では、特徴量データ、スコア算出結果、閾値設定結果等が送受信されるため、これらのデータを圧縮し、ネットワークの負荷を軽減することができる。クライアント装置２１０とサーバ装置２２０との機能の分担は上述した例に限られない。更に、クライアント装置２１０又はサーバ装置２２０が更に複数の装置に分かれていてもよい。 Moreover, although the structure of FIG. 2 was shown as an example of the hardware structure of the speech recognition apparatus 100, it is not restricted to this. For example, the hardware configuration of the voice recognition device 100 may be a configuration in which functions are distributed to the client device 210 and the server device 220 connected via a network, as shown in FIG. For example, by configuring the sound data input unit 110, the feature amount calculation unit 120, the word selection unit 170, and the speech recognition result output unit 180 in the client device 210 and configure the remaining part in the server device 220, the client device 210 is configured. The calculation load can be reduced. In this case, feature amount data, score calculation results, threshold setting results, and the like are transmitted and received between the client device 210 and the server device 220. Therefore, these data can be compressed to reduce the load on the network. The division of functions between the client device 210 and the server device 220 is not limited to the above-described example. Furthermore, the client device 210 or the server device 220 may be further divided into a plurality of devices.

なお、音声認識装置１００に係る発明は、コンピュータを音声認識装置として機能させるための音声認識プログラムに係る発明として捉えることができる。 The invention relating to the speech recognition apparatus 100 can be regarded as an invention relating to a speech recognition program for causing a computer to function as a speech recognition apparatus.

図７は、コンピュータを音声認識装置１００として機能させるための音声認識プログラムＰ１００のモジュールを示すブロック図である。図７の音声認識プログラムＰ１００は、音データ入力モジュールＰ１１０と、特徴量算出モジュールＰ１２０と、音声処理モジュールＰ１３０と、発話者度算出モジュールＰ１４０と、スコア算出モジュールＰ１５０と、閾値設定モジュールＰ１６０と、単語選定モジュールＰ１７０と、音声認識結果出力モジュールＰ１８０と、を備えている。各モジュールＰ１１０〜Ｐ１８０が実行されることにより実現される機能は、図１の各部１１０〜１８０の機能とそれぞれ同様である。音声認識プログラムＰ１００は、例えば、図２に示す記憶媒体１８ａに格納されて音声認識装置１００に提供される。記憶媒体１８ａとしては、フレキシブルディスク、ＣＤ、ＤＶＤ等の記憶媒体が挙げられる。また、音声認識プログラムＰ１００は、搬送波に重畳されたコンピュータデータ信号として、有線ネットワーク又は無線ネットワークを介して音声認識装置１００に提供されるものであってもよい。 FIG. 7 is a block diagram showing modules of a speech recognition program P100 for causing a computer to function as the speech recognition apparatus 100. The voice recognition program P100 of FIG. 7 includes a sound data input module P110, a feature amount calculation module P120, a voice processing module P130, a speaker degree calculation module P140, a score calculation module P150, a threshold setting module P160, a word A selection module P170 and a speech recognition result output module P180 are provided. Functions realized by executing the modules P110 to P180 are the same as the functions of the units 110 to 180 in FIG. The voice recognition program P100 is stored in, for example, the storage medium 18a shown in FIG. Examples of the storage medium 18a include storage media such as a flexible disk, a CD, and a DVD. The voice recognition program P100 may be provided to the voice recognition apparatus 100 via a wired network or a wireless network as a computer data signal superimposed on a carrier wave.

１００…音声認識装置、１３０…音声処理部、１４０…発話者度算出部、１５０…スコア算出部、１６０…閾値設定部、１７０…単語選定部、Ｐ１００…音声認識プログラム、Ｐ１３０…音声処理モジュール、Ｐ１４０…発話者度算出モジュール、Ｐ１５０…スコア算出モジュール、Ｐ１６０…閾値設定モジュール、Ｐ１７０…単語選定モジュール。 DESCRIPTION OF SYMBOLS 100 ... Voice recognition apparatus, 130 ... Voice processing part, 140 ... Speaker degree calculation part, 150 ... Score calculation part, 160 ... Threshold setting part, 170 ... Word selection part, P100 ... Voice recognition program, P130 ... Voice processing module, P140 ... Speaker degree calculation module, P150 ... Score calculation module, P160 ... Threshold setting module, P170 ... Word selection module.

Claims

Voice processing means for extracting words included in the sound data and outputting the reliability of each word;
A speaker degree calculating means for calculating a speaker degree of each word indicating the speech likeness of a speaker of sound data corresponding to each word;
Score calculating means for calculating the score of each word based on the reliability of each word output by the speech processing means and the speaker degree of each word calculated by the speaker degree calculating means;
Word selection means for selecting a word to be adopted as a speech recognition result based on the score of each word and a predetermined threshold;
A speech recognition apparatus characterized by that.

The speaker degree calculating means includes a volume of each word, a likelihood of a speech model of each word, a likelihood of a noise model of each word, a spatial transfer characteristic of each word, a fundamental frequency of each word, and Calculating the degree of speaker of each word using at least one of the voice qualities of each word;
The speech recognition apparatus according to claim 1.

The speaker degree calculating means calculates the speaker degree of each word using the volume of each word in the frequency band of human speech.
The speech recognition apparatus according to claim 2.

The voice recognition device further includes threshold setting means for setting the predetermined threshold based on the maximum score calculated by the score calculating means,
The word selecting means selects a word to be adopted as a speech recognition result using the predetermined threshold set by the threshold setting means;
The speech recognition apparatus according to claim 1, wherein

The speech processing means outputs the relative reliability of each word based on the maximum reliability of each word,
The score calculation means calculates the score of each word using the relative reliability of each word output by the voice processing means.
The speech recognition apparatus according to claim 1, wherein

The speaker degree calculating means calculates the relative speaker degree of each word based on the maximum value of the speaker degree of each word,
The score calculating means calculates the score of each word using the relative speaker degree of each word calculated by the speaker degree calculating means;
The speech recognition apparatus according to claim 1, wherein

The word selection means, after deleting a word that is not adopted as a speech recognition result based on the score of each word and the predetermined threshold, a word that is discontinuous with a word string including a word having the maximum score Delete further and adopt the remaining words as speech recognition results.
The speech recognition apparatus according to claim 1, wherein

A speech recognition method executed by a speech recognition device,
A voice processing step of extracting words included in the sound data and outputting the reliability of each word;
A speaker degree calculating step of calculating a speaker degree of each word indicating the speech quality of the speaker of the sound data corresponding to each word;
A score calculation step of calculating a score of each word based on the reliability of each word output by the speech processing step and the speaker degree of each word calculated by the speaker degree calculation step;
A word selection step of selecting a word to be adopted as a speech recognition result based on the score of each word and a predetermined threshold;
A speech recognition method characterized by the above.

Computer
Voice processing means for extracting words included in the sound data and outputting the reliability of each word;
A speaker degree calculating means for calculating a speaker degree of each word indicating the speech likeness of a speaker of sound data corresponding to each word;
Score calculating means for calculating the score of each word based on the reliability of each word output by the speech processing means and the speaker degree of each word calculated by the speaker degree calculating means;
Word selection means for selecting a word to be adopted as a speech recognition result based on the score of each word and a predetermined threshold;
A voice recognition program characterized by functioning as