JP2012255867A

JP2012255867A - Voice recognition device

Info

Publication number: JP2012255867A
Application number: JP2011128127A
Authority: JP
Inventors: Seisho Watabe; 生聖渡部
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2011-06-08
Filing date: 2011-06-08
Publication date: 2012-12-27

Abstract

PROBLEM TO BE SOLVED: To provide a robust voice recognition device having a lower false recognition rate.SOLUTION: The voice recognition device includes: a sound feature amount converting part 12 having a language model 22 that puts a limit on a phoneme sequence pattern, a word phoneme label dictionary 21 storing a phoneme label, a label conversion rule dictionary 23 storing a rule for converting the phoneme label, and an acoustic model 24 for generating a standard voice pattern to convert the voice to a feature amount; a phoneme label converting part 13 that refers to the language model 22, the word phoneme label dictionary 21, and the label conversion rule dictionary 23, to convert the voice converted by the sound feature amount converting part to a phoneme label; a similarity calculation part 14 that converts the phoneme label converted by the phoneme label converting part 13 to a standard voice pattern by the acoustic model 24 to calculate similarity to the voice converted by the sound feature amount converting part 12; and a maximum likelihood grammar deciding part 15 that determines an input sentence from a result obtained by the similarity calculation part 14.

Description

本発明は、音声認識装置に関する。 The present invention relates to a speech recognition apparatus.

近年では、コマンド型の音声認識システムが用いられている。コマンド型の音声認識システムでは、タスクが限定されているため認識文法をあらかじめ固定することができる。音声認識制御において、認識文法は音素系列（音響モデルの連鎖パターン）の制約条件として利用される。 In recent years, command type speech recognition systems have been used. In the command-type speech recognition system, the recognition grammar can be fixed in advance because the tasks are limited. In speech recognition control, the recognition grammar is used as a constraint condition for phoneme sequences (acoustic model chain patterns).

また従来では、通常読み仮名と音素パターンを１対１に割付ける方法がとられ、この方法は音声認識において、多くの場合に有効である。しかしながら、この方法を用いて音声認識を行ったであっても、誤認識率が５％〜５０％となる場合がある。
このような場合において、誤認識されやすいコマンドは偏っており、誤認識による結果も偏る場合が多い。すなわち、間違えやすいパターンが存在している。 Conventionally, a method of assigning a normal reading kana and a phoneme pattern on a one-to-one basis has been used, and this method is effective in many cases in speech recognition. However, even if speech recognition is performed using this method, the recognition error rate may be 5% to 50%.
In such a case, commands that are easily misrecognized are biased, and the results of misrecognition are often biased. That is, there are patterns that are easy to make mistakes.

特許文献１では、ノイズの存在によりＳ／Ｎ比が低下した場合の音声認識手法が開示されている。これによると、あらかじめ異なったＳ／Ｎ比の音響モデルを準備することで、音響的なゆれを吸収することができる。またノイズにより音素の変換が起こる点を、あらかじめ想定できる範囲で事前に音素パターンを生成する。 Patent Document 1 discloses a speech recognition method when the S / N ratio is reduced due to the presence of noise. According to this, acoustic fluctuations can be absorbed by preparing acoustic models having different S / N ratios in advance. In addition, a phoneme pattern is generated in advance within a range in which a phoneme can be converted by noise.

特許文献２では、音声認識箇所の誤認識部分を正しく且つ効率良く修正する修正箇所決定装置が開示されている。これによると、修正箇所決定手段は、正解文字列と発音が類似する文字列部分が音声認識結果に存在しない場合に、単語の接続制約を記述した言語モデルを用いて正解文字列が挿入される確率の高い位置を挿入位置として検索するものであり、認識結果修正手段は決定された挿入位置に正解文字列を挿入する。 Patent Document 2 discloses a correction location determination apparatus that corrects a misrecognition portion of a speech recognition location correctly and efficiently. According to this, when the character string portion whose pronunciation is similar to the correct character string does not exist in the speech recognition result, the corrected part determining means inserts the correct character string using the language model describing the word connection constraint. A position with a high probability is searched as an insertion position, and the recognition result correcting means inserts a correct character string at the determined insertion position.

特許文献３では、意味適合性を向上させつつ頑健性の高いに音声認識結果を得る音声認識装置が開示されている。これによると、微量抽出部で抽出された音声特徴量、音声モデル記憶部に記憶された音素ＨＭＭ、言語モデル記憶部に記憶された単語２−ｇｒａｍモデルを用いて、Ｎ個以上の最尤解及びその尤度（スコア）を演算する。 Patent Document 3 discloses a speech recognition apparatus that obtains a speech recognition result with high robustness while improving semantic suitability. According to this, N or more maximum likelihood solutions using the speech feature amount extracted by the micro extraction unit, the phoneme HMM stored in the speech model storage unit, and the word 2-gram model stored in the language model storage unit And its likelihood (score) is calculated.

特開平８−１２３４６７号公報JP-A-8-123467 特開２００６−２６７３１９号公報JP 2006-267319 A 特開２００５−２２１７５２号公報JP 2005-211752 A

［平成２３年５月１０日検索］、インターネット＜URL：http://shower.human.waseda.ac.jp/~m-kouki/pukiwiki_public/66.html＞[Search on May 10, 2011], Internet <URL: http://shower.human.waseda.ac.jp/~m-kouki/pukiwiki_public/66.html> ［平成２３年５月１０日検索］、インターネット＜URL：http://www.slp.is.ritsumei.ac.jp/C/pattern-rec/pr.pdf＞[Search May 10, 2011], Internet <URL: http://www.slp.is.ritsumei.ac.jp/C/pattern-rec/pr.pdf>

通常読み仮名と音素パターンを１対１に割付ける方法の場合において、音素が消滅した場合と、音素の変換がある場合と、音素の中間化がある場合には、誤認識が発生しやすい。音素が消滅した場合とは、例えば「秋田」の発音について「i」が発音されず、「a k (i) t a」となる場合である。音素の変換がある場合とは、例えば「本」は「h o N」であるが、「一本」は「i q p o N」となるような場合である。音素の中間化がある場合とは、例えば「右ロール」の発音は「m i g i r o_ r u」であるが、音素の「i r」が「ｙ」と置き換わり、「m i g y o_ r u」となるような場合である。 In the case of a method in which a normal reading kana and a phoneme pattern are assigned one-to-one, erroneous recognition is likely to occur when a phoneme disappears, when there is a phoneme conversion, and when there is a phoneme intermediate. When the phoneme disappears, for example, “i” is not pronounced for the pronunciation of “Akita” but becomes “a k (i) t a”. The case where there is a phoneme conversion is a case where “book” is “h o N” but “one” is “i q p o N”. For example, there is a phoneme neutralization, for example, the right roll is pronounced “migir o_ ru” but the phoneme “ir” is replaced by “y” and becomes “migy o_ ru”. is there.

このようなケースでは、先に出る音素パターンに認識結果が影響され、誤認識を起こしやすい。例えば、認識文法に「芥（あくた）」があれば「秋田」と間違えやすく、ユーザ発話が「秋田」であっても、認識結果が「芥」となる場合がある。また同様に、「右ヨー」があれば「右ロール」と間違えやすく、ユーザ発話が「右ロール」であっても、認識結果が「右ヨー」となる場合がある。 In such a case, the recognition result is affected by the phoneme pattern that appears first, and erroneous recognition is likely to occur. For example, if the recognition grammar contains “Akita”, it may be easily mistaken for “Akita”, and even if the user utterance is “Akita”, the recognition result may be “芥”. Similarly, if there is “right yaw”, it is easily mistaken for “right roll”, and even if the user utterance is “right roll”, the recognition result may be “right yaw”.

音声認識装置は、音素系列のパターンに制限を与える固定文法を記憶する言語モデルと、前記言語モデルにより抽出された文を音素ごとに分割する音素ラベルを記憶する単語音素ラベル辞書と、前記単語音素ラベル辞書により分割された音素パターンに特定のパターンがある場合に、音素モデルを変換するルールを記憶するラベル変換ルール辞書と、標準音素パターンをモデル化した音響モデルと、を有し、入力された音声信号を特徴量化する音響特徴量変換部と、前記言語モデルと、前記単語音素ラベル辞書と、前記ラベル変換ルール辞書とを参照して、音素ラベルに変換する音素ラベル変換部と、前記音素ラベル変換部により変換された音素ラベルを、前記音響モデルに基づいて音声パターンに変換し、前記音響特徴量変換部で特徴量化された音声信号との類似度を計算する類似度計算部と、前記類似度計算部による計算結果に基づいて、適切な入力文章を判定する最尤文法決定部と、を備える。
これにより、音響モデルや言語モデルの事前学習は必要とせず、特定のパターンついて音素ラベルをオンラインで変換することができる。 The speech recognition apparatus comprises: a language model that stores a fixed grammar that restricts a phoneme sequence pattern; a word phoneme label dictionary that stores a phoneme label that divides a sentence extracted by the language model into phonemes; and the word phoneme When a phoneme pattern divided by the label dictionary has a specific pattern, the phoneme pattern has a label conversion rule dictionary that stores rules for converting a phoneme model, and an acoustic model that models a standard phoneme pattern. A phoneme label conversion unit that converts a sound signal into a phoneme label with reference to an acoustic feature value conversion unit that converts a speech signal into a feature value, the language model, the word phoneme label dictionary, and the label conversion rule dictionary, and the phoneme label The phoneme label converted by the conversion unit is converted into a voice pattern based on the acoustic model, and the acoustic feature value conversion unit converts the phoneme label into a feature value. Comprises a similarity calculating unit calculating a similarity between the audio signal, based on the calculation result of the similarity calculation unit, and a maximum likelihood grammar determining section determines an appropriate input sentence.
As a result, phoneme labels can be converted online for a specific pattern without the need for prior learning of an acoustic model or a language model.

ロバスト性が高く、誤認識率を低減させた音声認識装置を提供する。 Provided is a speech recognition device that is highly robust and has a reduced misrecognition rate.

実施の形態１にかかる音声認識装置の構成物品を示す図である。It is a figure which shows the structural article of the speech recognition apparatus concerning Embodiment 1. FIG. 実施の形態１にかかる音素ラベルを示す図である。It is a figure which shows the phoneme label concerning Embodiment 1. FIG. 実施の形態１にかかる音声認識装置の構成物品とデータを示す図である。It is a figure which shows the structural article and data of the speech recognition apparatus concerning Embodiment 1. FIG.

実施の形態１
以下、図面を参照して本発明の実施の形態について説明する。図１は、音声認識装置１の構成物品を示した図である。音声認識装置１は、音声入力部１１、音響特徴量変換部１２、音素ラベル変換部１３、類似度計算部１４、最尤文法決定部１５、単語音素ラベル辞書２１、言語モデル２２、ラベル変換ルール辞書２３、音響モデル２４、により構成されている。 Embodiment 1
Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a diagram showing the components of the voice recognition device 1. The speech recognition apparatus 1 includes a speech input unit 11, an acoustic feature amount conversion unit 12, a phoneme label conversion unit 13, a similarity calculation unit 14, a maximum likelihood grammar determination unit 15, a word phoneme label dictionary 21, a language model 22, and a label conversion rule. A dictionary 23 and an acoustic model 24 are included.

音声入力部１１は、使用者が発した音声を集音する。例えば、音声入力部１１はマイクである。音声入力部１１は、入力された音声を、音響特徴量変換部１２に出力する。 The voice input unit 11 collects voice uttered by the user. For example, the voice input unit 11 is a microphone. The voice input unit 11 outputs the input voice to the acoustic feature amount conversion unit 12.

音響特徴量変換部１２は、集音された音声をサンプリングし、サンプリングされた音声データについて音声分析を行う。これにより、音響特徴量変換部１２では、一定の区間ごとに特徴量を抽出する。例えば、音響特徴量変換部１２は、ＭＦＣＣ（Mel Frequency Cepstral Coefficient：メル周波数ケプストラム）（非特許文献１）を用いて、音声の特徴量化を行う。音響特徴量変換部１２は、特徴量化した音声を、類似度計算部１４に出力する。 The acoustic feature amount conversion unit 12 samples the collected voice and performs voice analysis on the sampled voice data. As a result, the acoustic feature value conversion unit 12 extracts feature values for each fixed section. For example, the acoustic feature value conversion unit 12 performs voice feature value conversion using MFCC (Mel Frequency Cepstral Coefficient) (Non-Patent Document 1). The acoustic feature amount conversion unit 12 outputs the voice that has been converted into the feature amount to the similarity calculation unit 14.

音素ラベル変換部１３は、単語音素ラベル辞書２１と、言語モデル２２と、ラベル変換ルール辞書２３をそれぞれ参照することにより、音素ラベルを変換する。また、音素ラベル変換部１３は、記憶した音素ラベルを、類似度計算部１４に出力する。 The phoneme label conversion unit 13 converts the phoneme label by referring to the word phoneme label dictionary 21, the language model 22, and the label conversion rule dictionary 23, respectively. The phoneme label conversion unit 13 outputs the stored phoneme label to the similarity calculation unit 14.

単語音素ラベル辞書２１には、音素ごとに設けられたラベルが記憶されている。図２は、単語音素ラベル辞書２１に記憶されている音素ラベルの例である。なお、単語音素ラベルによる記述を行う場合には、長音はスペースをつけずに記述する。例えば、「あー」であれば「a_」、「ヨーカン」であれば「y o_ k a N」とする。 The word phoneme label dictionary 21 stores labels provided for each phoneme. FIG. 2 is an example of a phoneme label stored in the word phoneme label dictionary 21. When a description is made using a word phoneme label, a long sound is described without a space. For example, “a_” for “A” and “yo_ka N” for “Yokan”.

言語モデル２２は、形態素の系列を固定的に定義し、音素系列のパターンに制限を与える固定文法である。すなわち言語モデル２２は、語の連鎖を、文法および統計に基づいてモデル化したものである。
例えば言語モデル２２は、音素ラベル変換部１３に入力された音データが「佐藤さんの電話番号を教えて」であれば、「佐藤」「さん」「の」「電話」「番号」「を」「教え」「て」の各要素について、文法と統計的モデルに基づいて、出現パターンの制限を与える。より具体的には、「佐藤」の後には、「さん」や「君」などの語が出現するものとして制限する。また例えば、「さん」の後には、「の」や「は」などの語が出現するものとして制限する。その他の語についても同様に、出現パターンの制限を行う。なお、言語モデル２２は、確率付きのN-gramを用いて同様の処理が可能である。 The language model 22 is a fixed grammar that fixedly defines a morpheme sequence and restricts a phoneme sequence pattern. That is, the language model 22 is a model of word chains based on grammar and statistics.
For example, in the language model 22, if the sound data input to the phoneme label conversion unit 13 is “Tell me your phone number”, “Sato” “san” “no” “phone” “number” “ For each element of "Teach" and "Te", the appearance pattern is restricted based on the grammar and statistical model. More specifically, it is limited that words such as “san” and “you” appear after “Sato”. For example, after “san”, it is limited that words such as “no” and “ha” appear. Similarly, the appearance pattern is restricted for other words. The language model 22 can perform the same processing using an N-gram with probability.

ラベル変換ルール辞書２３は、単語音素ラベル辞書２１と、言語モデル２２の結合から定義される音素ラベル中に特定のパターンがあれば、対応するラベルパターンに変換するルールが記録される。例えば、音素パターンの{i r}がある場合には｛i y r｝に変更するというルールなどである。典型的には、ラベル変換ルール辞書２３は、任意にルールの追加や削除を行うことができる。
音素ラベル変換部１３は、ラベル変換ルール辞書２３に記録されている音素のパターンが含まれていれば、ルールに従って音素のパターンを変更したものを記録する。なお、音素ラベル変換部１３では、音素パターン変更前及び音素パターン変更後の両方の音素モデルを記録しているのが望ましい。 In the label conversion rule dictionary 23, if a specific pattern exists in a phoneme label defined by the combination of the word phoneme label dictionary 21 and the language model 22, a rule for conversion to a corresponding label pattern is recorded. For example, if there is a phoneme pattern {ir}, the rule is to change it to {iyr}. Typically, the label conversion rule dictionary 23 can arbitrarily add or delete rules.
If the phoneme pattern conversion unit 13 includes a phoneme pattern recorded in the label conversion rule dictionary 23, the phoneme label conversion unit 13 records the phoneme pattern changed according to the rule. Note that it is desirable that the phoneme label conversion unit 13 records both phoneme models before and after the phoneme pattern change.

類似度計算部１４は、音響モデル２４を参照し、言語的制限のもとで標準音素パターンと入力音声パターンの類似度を計算する。より具体的には、類似度計算部１４は、音素ラベル変換部１３から入力された音素ラベルについて、音響モデル２４を参照して生成した音声パターンと、音響特徴量変換部１２から入力された音声パターンとの類似度を計算する。
類似度計算部１４は、計算した類似度を最尤文法決定部１５に出力する。 The similarity calculation unit 14 refers to the acoustic model 24 and calculates the similarity between the standard phoneme pattern and the input speech pattern under linguistic restrictions. More specifically, the similarity calculation unit 14 generates a speech pattern generated by referring to the acoustic model 24 for the phoneme label input from the phoneme label conversion unit 13, and a speech input from the acoustic feature amount conversion unit 12. Calculate the similarity to the pattern.
The similarity calculation unit 14 outputs the calculated similarity to the maximum likelihood grammar determination unit 15.

音響モデル２４には、１つの音素について、その前後の他の音素との組み合わせの標準パターンが記録されている。例えば、音素｛m｝について、その前後の音素{aやiなど}との組み合わせのパターンが記録されている。さらに音響モデル２４には、音素の組合せに応じた発音が記録されている。
類似度計算部１４は、音素ラベル変換部１３から入力された音素ラベルと、音響モデル２４に基づいて標準音声パターンを作成し、音響特徴量変換部１２の出力である特徴量化された音声との類似度を計算する。 In the acoustic model 24, a standard pattern of a combination of one phoneme and other phonemes before and after the phoneme is recorded. For example, the phoneme {m} is recorded with a combination pattern with the phonemes {a, i, etc.} before and after the phoneme {m}. Furthermore, the acoustic model 24 records pronunciations according to phoneme combinations.
The similarity calculation unit 14 creates a standard speech pattern based on the phoneme label input from the phoneme label conversion unit 13 and the acoustic model 24, and the featured speech that is the output of the acoustic feature amount conversion unit 12. Calculate similarity.

最尤文法決定部１５は、類似度計算部１４で計算された類似度に基づいて、最も類似度の高い文法を決定する。 The maximum likelihood grammar determination unit 15 determines the grammar having the highest similarity based on the similarity calculated by the similarity calculation unit 14.

次に、音声認識装置１の動作について説明する。以下では、音声入力部１１に、「右ロール」が音声として入力されるものとして説明する。図３は、音声認識装置１の構成物品と、音声入力部１１に「右ロール」が音声入力される場合のデータについて示した図である。 Next, the operation of the voice recognition device 1 will be described. In the following description, it is assumed that “right roll” is input to the voice input unit 11 as voice. FIG. 3 is a diagram showing the components of the voice recognition device 1 and data when the “right roll” is voice-input to the voice input unit 11.

音声入力部１１は、使用者が発した音声を集音する。また、音声入力部１１は、集音された音声を音響特徴量変換部１２に出力する。 The voice input unit 11 collects voice uttered by the user. The voice input unit 11 outputs the collected voice to the acoustic feature amount conversion unit 12.

音響特徴量変換部１２は、音声入力部１１に入力された音声信号を分析して、無音で区切られた音声区間を切り出し、特徴量化する。音響特徴量変換部１２は、類似度計算部１４に特徴量化した音声パターンを出力する。 The acoustic feature quantity conversion unit 12 analyzes the voice signal input to the voice input unit 11, cuts out a voice section delimited by silence, and converts it into a feature quantity. The acoustic feature amount conversion unit 12 outputs the voice pattern that is converted into the feature amount to the similarity calculation unit 14.

音素ラベル変換部１３は、ルールに基づいて、音素ラベルを変換する。
具体的には、音素ラベル変換部１３は言語モデル２２を参照する。これにより、音素ラベル変換部１３は、音声入力部１１に入力された「右ロール方向」という文のうち、「右」「ロール」「方向」の各要素の出現パターンについて、文法及び統計的モデルに基づいて制限を与える。より具体的には、音素ラベル変換部１３では、言語モデル２２を参照することにより、「右」という語に文法として繋がる、「ロール」や「ヨー」を抽出する。
次に、音素ラベル変換部１３は、言語モデル２２を用いて抽出された語について、単語音素ラベル辞書２１を参照し、「右」｛m i g i｝、「ロール」｛r o_ r u｝、「ヨー」｛y o_｝を抽出する。すなわち、音素ラベル変換部１３には、「右ロール」として｛m i g i r o_ r u｝と「右ヨー」として｛m i g i y o_｝が記憶される。 The phoneme label conversion unit 13 converts phoneme labels based on the rules.
Specifically, the phoneme label conversion unit 13 refers to the language model 22. As a result, the phoneme label conversion unit 13 uses the grammar and statistical model for the appearance pattern of each element of “right”, “roll”, and “direction” in the sentence “right roll direction” input to the speech input unit 11. Give limits based on. More specifically, the phoneme label conversion unit 13 refers to the language model 22 to extract “roll” and “yaw” connected to the word “right” as a grammar.
Next, the phoneme label conversion unit 13 refers to the word phoneme label dictionary 21 for the words extracted using the language model 22, and “right” {migi}, “roll” {ro_ru}, “yaw”. {Yo_} is extracted. That is, the phoneme label conversion unit 13 stores {migir o_ ru} as “right roll” and {migiy o_} as “right yaw”.

次に、音素ラベル変換部１３は、ラベル変換ルール辞書２３を参照する。ここで、ラベル変換ルール辞書２３に｛i r｝を｛i y r｝に変換するルールが記載されているものとする。
「右ロール方向」を音素ラベルで記述した場合には、｛m i g i r o_ r u h o_ k o_｝であり、ラベル変換ルール辞書２３に｛i r｝を｛i y r｝に変換するルールがあるため、音素ラベル変換部１３は、｛m i g i r o_ r u h o_ k o_｝を｛m i g i y r o_ r u h o_ k o_｝に変換して記憶する。なお、音素ラベル変換部１３は、ラベル変換ルール辞書２３に基づいて変換される前の音素モデルと、変換された後の音素モデルの両方を記憶する。すなわち、音素ラベル変換部１３は、「右ロール」として｛m i g i r o_ r u｝と｛m i g i y r o_ r u｝の両方を記録する。
なお、ラベル変換ルール辞書２３には、「右ヨー」に含まれる音素ラベルを変換するルールは記録されていない。したがって、音素ラベル変換部１３には、「右ヨー」として｛m i g i y o_｝が記録されたままの状態となる。 Next, the phoneme label conversion unit 13 refers to the label conversion rule dictionary 23. Here, it is assumed that the rule for converting {ir} to {iyr} is described in the label conversion rule dictionary 23.
When the “right roll direction” is described by a phoneme label, {migir o_ ruho o_ k o_} and there is a rule for converting {ir} to {iyr} in the label conversion rule dictionary 23. The unit 13 converts {migir o_ruh o_ k o_} into {migiyr o_ ruh o_ k o_} and stores it. Note that the phoneme label conversion unit 13 stores both the phoneme model before conversion based on the label conversion rule dictionary 23 and the phoneme model after conversion. That is, the phoneme label conversion unit 13 records both {migir o_ ru} and {migiyr o_ ru} as the “right roll”.
Note that the label conversion rule dictionary 23 does not record a rule for converting a phoneme label included in “right yaw”. Therefore, the phoneme label conversion unit 13 is in a state where {migiy o_} is still recorded as “right yaw”.

類似度計算部１４は、音素ラベル変換部１３に記録されている｛m i g i r o_ r u｝、｛m i g i y r o_r u｝、｛m i g i y o_｝のそれぞれについて、音響モデル２４に記録されている標準音素パターンの当てはめを行う。これにより、類似度計算部１４は、｛m i g i r o_ r u｝、｛m i g i y r o_ r u｝、｛m i g i y o_｝のそれぞれについて、音声パターンを生成する。 The similarity calculation unit 14 applies the standard phoneme pattern recorded in the acoustic model 24 for each of {migir o_ ru}, {migiyr o_ru}, and {migiy o_} recorded in the phoneme label conversion unit 13. I do. Thereby, the similarity calculation unit 14 generates a speech pattern for each of {m i g i ro_r u}, {m i g i y r o_ ru}, and {m i g i y o_}.

類似度計算部１４は、音響モデル２４と音素ラベル変換部１３に記録された音素ラベルを用いて生成した複数の音声パターンと、音響特徴量変換部１２から入力された音声パターンとの類似度を計算する。すなわち、音響特徴量変換部１２から入力された音声パターンに対し、「右ロール」として｛m i g i r o_ r u｝、｛m i g i y r o_ r u｝、「右ヨー」として｛m i g i y o_｝の音声パターンの類似度を計算する。典型的には、類似度計算部１４は、生成した音声パターンの音声の周波数と、音響特徴量変換部１２から入力された音声パターンの音声の周波数の類似度を計算する（非特許文献２）。 The similarity calculation unit 14 calculates the similarity between a plurality of speech patterns generated using the phone model labels recorded in the acoustic model 24 and the phoneme label conversion unit 13 and the speech pattern input from the acoustic feature amount conversion unit 12. calculate. That is, with respect to the voice pattern input from the acoustic feature amount conversion unit 12, the similarity of the voice pattern of {migiro_ru} and {migiyr_o} as “right roll” and {migiy_o_} as “right yaw” is set. calculate. Typically, the similarity calculation unit 14 calculates the similarity between the frequency of the voice of the generated voice pattern and the frequency of the voice of the voice pattern input from the acoustic feature amount conversion unit 12 (Non-Patent Document 2). .

最尤文法決定部１５は、類似度計算部１４により算出された類似度のうち、最も類似度の高いものを判定する。例えば、類似度計算部１４において、言語モデルの｛m i g i y r o_ r u｝の類似度が最も高いという計算結果であれば、「右ロール」が入力されていたものと判定する。 The maximum likelihood grammar determination unit 15 determines the highest similarity among the similarities calculated by the similarity calculation unit 14. For example, the similarity calculation unit 14 determines that the “right roll” has been input if the calculation result indicates that the similarity of {m i g i y r o — ru u} of the language model is the highest.

これにより音声認識装置１は、音声認識に際して、誤認識率を低減させることができる。
音声認識装置１は、特定の音素パターンについて、対応する音素パターンへの変更を行うことができる。この変更対象は、音響モデルではなく言語モデルである。したがって、音響モデルにおいて変更を行う場合には音声認識装置１による学習やデータの収集などが必要となるところ、学習やデータの収集を行うことなく、ユーザが任意にラベル変換ルール辞書２３の変更パターンの設定を行うことができる。 Thereby, the speech recognition apparatus 1 can reduce the misrecognition rate during speech recognition.
The speech recognition apparatus 1 can change a specific phoneme pattern to a corresponding phoneme pattern. This change target is not an acoustic model but a language model. Therefore, when a change is made in the acoustic model, learning by the speech recognition apparatus 1 and data collection are required. However, the user can arbitrarily change the change pattern of the label conversion rule dictionary 23 without performing learning or data collection. Can be set.

なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。例えば、ラベル変換ルール辞書２３には、｛i r｝を｛i y r｝に変換するルールが記載されているものとして説明したが、さらに多数の変換ルールを記憶しておき、音素ラベルの変換に用いても良い。さらに、音素ラベル変換部１３は、１つの音素モデルについて、ラベル変換ルール辞書２３に登録されている変換のルールの複数個が該当する場合には、例えば、第１の変換ルールのみを適用したもの、第２の変換ルールのみを適用したもの、第１と第２のルールの両方を適用したもの、などの様々な組み合わせの、変換後の音素ラベルを生成しても良い。 Note that the present invention is not limited to the above-described embodiment, and can be changed as appropriate without departing from the spirit of the present invention. For example, the label conversion rule dictionary 23 has been described on the assumption that a rule for converting {ir} to {iyr} is described, but a larger number of conversion rules are stored and used for conversion of phoneme labels. Also good. Furthermore, when a plurality of conversion rules registered in the label conversion rule dictionary 23 correspond to one phoneme model, the phoneme label conversion unit 13 applies only the first conversion rule, for example. The phoneme labels after conversion may be generated in various combinations, such as a case where only the second conversion rule is applied, and a case where both the first and second rules are applied.

１音声認識装置
１１音声入力部
１２音響特徴量変換部
１３音素ラベル変換部
１４類似度計算部
１５最尤文法決定部
２１単語音素ラベル辞書
２２言語モデル
２３ラベル変換ルール辞書
２４音響モデル DESCRIPTION OF SYMBOLS 1 Speech recognition apparatus 11 Speech input part 12 Acoustic feature-value conversion part 13 Phoneme label conversion part 14 Similarity calculation part 15 Maximum likelihood grammar determination part 21 Word phoneme label dictionary 22 Language model 23 Label conversion rule dictionary 24 Acoustic model

Claims

A language model that stores fixed grammars that restrict phoneme sequence patterns;
A word phoneme label dictionary that stores phoneme labels that divide sentences extracted by the language model into phonemes;
A label conversion rule dictionary for storing rules for converting a phoneme model when there is a specific pattern in the phoneme pattern divided by the word phoneme label dictionary;
An acoustic model storing a standard phoneme pattern;
An acoustic feature amount conversion unit for converting the input audio signal into a feature amount;
A phoneme label conversion unit for converting to a phoneme label with reference to the language model, the word phoneme label dictionary, and the label conversion rule dictionary;
A similarity calculation unit that converts the phoneme label converted by the phoneme label conversion unit into a speech pattern based on the acoustic model, and calculates a similarity with the speech signal characterized by the acoustic feature amount conversion unit; ,
A speech recognition apparatus comprising: a maximum likelihood grammar determination unit that determines an appropriate input sentence based on a calculation result by the similarity calculation unit.