JP2010072020A

JP2010072020A - Voice recognition device, and transportation apparatus loaded therewith

Info

Publication number: JP2010072020A
Application number: JP2008235986A
Authority: JP
Inventors: Takashi Akasaka; 貴志赤坂
Original assignee: Yamaha Motor Co Ltd
Current assignee: Yamaha Motor Co Ltd
Priority date: 2008-09-16
Filing date: 2008-09-16
Publication date: 2010-04-02

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition device dispensing with prior investigation for creation of a word dictionary. SOLUTION: This voice recognition device 10 is equipped with an utterance section detection part 11 for detecting an utterance section including a voice signal from an acoustic signal; a characteristic quantity extraction part 12 for extracting an acoustic characteristic quantity from the voice signal by analyzing the voice signal in the utterance section detected by the utterance section detection part 11; a word dictionary 13 wherein a plurality of kinds of words are sorted by the mora number and registered; an acoustic model 14 wherein the acoustic characteristic quantity of each word registered in the word dictionary 13 is registered; and a collation part 15 for calculating likelihood of each word by collating the acoustic characteristic quantity extracted by the characteristic quantity extraction part 12 with the acoustic characteristic quantity of each word registered in the word dictionary 13 on reference to the word dictionary 13 and the acoustic model 14, and using a word having the highest likelihood as a recognition result. COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声認識装置に関し、さらに詳しくは、音声信号の中から単語を認識する音声認識装置に関する。 The present invention relates to a speech recognition device, and more particularly to a speech recognition device that recognizes words from speech signals.

自動二輪車、自動車、船舶、飛行機、ヘリコプタなどの輸送機器においては、ナビゲーションシステム、携帯電話機、オーディオ機器などの電子情報機器が搭載されており、最近では、この機器を音声で操作できるように音声認識装置も搭載されている。 Transport equipment such as motorcycles, automobiles, ships, airplanes, and helicopters are equipped with electronic information equipment such as navigation systems, mobile phones, and audio equipment. Recently, voice recognition has been made so that these equipment can be operated by voice. The device is also installed.

音声認識方法の１つとして、隠れマルコフモデル（以下、「ＨＭＭ」と略称する。）が広く知られている（非特許文献１参照）。ＨＭＭは、話者や発声変動等の音声が有する揺らぎを統計的に学習することによって高い認識精度を得ることができるので、現代では音声認識方法として定着している。 As one of speech recognition methods, a hidden Markov model (hereinafter abbreviated as “HMM”) is widely known (see Non-Patent Document 1). HMM can be obtained with high recognition accuracy by statistically learning fluctuations of speech such as a speaker and utterance fluctuations, and has become established as a speech recognition method in the present age.

特開２０００−０９９０７７号公報（特許文献１）には、入力された音声の長さから認識結果としての可能性がないと推定できる辞書との照合を行わず、より高速に音声認識を行うことができる音声認識装置が記載されている。この音声認識装置は、入力された音声の発声の長さを測定する発声長計測部と、予め用意されている音声認識辞書からある長さの範囲の辞書を選択する辞書選択部とを備え、発声長さから認識結果として可能性がある長さの辞書を推定し絞り込むことにより、認識結果としての可能性がない辞書と入力との照合を行うという無駄な照合処理を行わずに済む、というものである。ここで用いられる音声認識辞書には、辞書項目ごとに辞書長が登録されている（図４参照）。たとえば「新宿」の辞書長としては６５０ｍｓが登録されている。発声長Ｌの６０％〜１６０％の長さの辞書を許容する場合、発声長Ｌが５００ｍｓであると仮定すると、辞書選択部は、辞書長が３００ｍｓ〜８００ｍｓの辞書を選択する。 In Japanese Patent Laid-Open No. 2000-099077 (Patent Document 1), speech recognition is performed at a higher speed without performing collation with a dictionary that can be estimated from the length of input speech as having no possibility as a recognition result. A voice recognition device capable of performing the above is described. The speech recognition apparatus includes an utterance length measurement unit that measures the length of utterance of input speech, and a dictionary selection unit that selects a dictionary having a certain length range from a speech recognition dictionary prepared in advance. By estimating and narrowing down the dictionary of possible length as a recognition result from the utterance length, it is possible to avoid unnecessary collation processing of collating the dictionary with no possibility as a recognition result with the input. Is. In the speech recognition dictionary used here, a dictionary length is registered for each dictionary item (see FIG. 4). For example, 650 ms is registered as the dictionary head of “Shinjuku”. When a dictionary having a length of 60% to 160% of the utterance length L is allowed, assuming that the utterance length L is 500 ms, the dictionary selection unit selects a dictionary having a dictionary length of 300 ms to 800 ms.

しかしながら、このような音声認識辞書を作成するためには、全ての辞書項目について辞書長を事前に調査しなければならない。したがって、この事前調査に莫大な手間がかかる。
特開２０００−０９９０７７号公報特開２００７−２０６２３９号公報特願２００８−１９９７１３号（同一出願人の未公開先願） Lawrence Rabiner, Biing-Hwang Juang(共著)，古井貞煕（監訳），「音声認識の基礎」，第６章，ＮＴＴアドバンステクノロジ，１９９５年 However, in order to create such a speech recognition dictionary, the dictionary length must be examined in advance for all dictionary items. Therefore, this pre-investigation takes enormous effort.
JP 2000-099077 A JP 2007-206239 A Japanese Patent Application No. 2008-199713 (unpublished prior application of the same applicant) Lawrence Rabiner, Biing-Hwang Juang (co-author), Sadaaki Furui (supervised), "Basics of Speech Recognition", Chapter 6, NTT Advanced Technology, 1995

本発明の目的は、単語辞書の作成に事前調査を必要としない音声認識装置を提供することである。 An object of the present invention is to provide a speech recognition apparatus that does not require prior investigation for creation of a word dictionary.

Means for Solving the Problems and Effects of the Invention

本発明による音声認識装置は、記憶手段と、特徴量抽出手段と、発話長特定手段と、単語辞書範囲指定手段と、照合手段とを備える。記憶手段は、単語辞書と音響モデルとを記憶する。単語辞書には、複数種類の単語が、各単語を発話したときの長さの指標として予め知られている発話長とともに登録される。音響モデルには、単語辞書に登録された単語の音響特徴量が登録される。特徴量抽出手段は、音声信号を分析して音響特徴量を抽出する。発話長特定手段は、音声信号に基づいて発話長を特定する。単語辞書範囲指定手段は、単語辞書のうち発話長特定手段により特定された発話長の単語を含む範囲を指定する。照合手段は、単語辞書及び音響モデルを参照し、特徴量抽出手段により抽出された音響特徴量を単語辞書範囲指定手段により指定された範囲に含まれる各単語の音響特徴量と照合することにより各単語の尤度を算出し、かつ、最尤度の単語を認識結果とする。 The speech recognition apparatus according to the present invention comprises storage means, feature quantity extraction means, utterance length identification means, word dictionary range designation means, and collation means. The storage means stores a word dictionary and an acoustic model. In the word dictionary, a plurality of types of words are registered together with an utterance length known in advance as an index of length when each word is uttered. In the acoustic model, an acoustic feature amount of a word registered in the word dictionary is registered. The feature quantity extraction unit analyzes the audio signal and extracts an acoustic feature quantity. The utterance length specifying means specifies the utterance length based on the voice signal. The word dictionary range designation means designates a range including words of the utterance length identified by the utterance length identification means in the word dictionary. The matching unit refers to the word dictionary and the acoustic model, and compares each acoustic feature amount extracted by the feature amount extracting unit with the acoustic feature amount of each word included in the range specified by the word dictionary range specifying unit. The likelihood of the word is calculated, and the word with the highest likelihood is used as the recognition result.

本発明によれば、単語辞書には、予め知られている発話長（モーラ数、母音数、母音長など）とともに単語が登録されるので、単語辞書の作成に事前調査を必要としない。 According to the present invention, since words are registered in the word dictionary together with previously known utterance lengths (number of mora, number of vowels, vowel length, etc.), prior investigation is not required for creation of the word dictionary.

好ましくは、単語辞書の単語はモーラ数でソートされる。記憶手段はさらに、各モーラ数の開始及び／又は終了アドレスを登録したモーラ数アドレステーブルを記憶する。単語辞書範囲指定手段は、モーラ数アドレステーブルを参照し、発話長特定手段により特定された発話長の単語を含む範囲を開始及び／又は終了アドレスで指定する。 Preferably, the words in the word dictionary are sorted by the number of mora. The storage means further stores a mora number address table in which start and / or end addresses of each mora number are registered. The word dictionary range designation means refers to the mora number address table, and designates a range including words of the utterance length specified by the utterance length specification means by the start and / or end addresses.

この場合、発話長として予め知られているモーラ数で単語をソートするだけでよいので、単語辞書の作成が容易である。 In this case, it is only necessary to sort the words by the number of mora previously known as the utterance length, so that it is easy to create a word dictionary.

好ましくは、音声認識装置はさらに、音響信号の中から音声信号を含む発話区間を検出する発話区間検出手段を備える。発話長特定手段は、発話区間検出手段により検出された発話区間の時間を発話長として測定する発話時間測定手段を含む。 Preferably, the speech recognition apparatus further includes an utterance section detection unit that detects an utterance section including the speech signal from the acoustic signal. The utterance length specifying means includes an utterance time measuring means for measuring the time of the utterance section detected by the utterance section detecting means as the utterance length.

この場合、測定された発話区間の時間に応じて単語辞書の範囲を絞り込むことができる。 In this case, the range of the word dictionary can be narrowed down according to the measured time of the utterance section.

好ましくは、発話長特定手段は、特徴量抽出手段により抽出された音響特徴量に基づいて母音数又は母音長を発話長として推定する母音推定手段を含む。 Preferably, the utterance length specifying means includes vowel estimation means for estimating the number of vowels or the vowel length as the utterance length based on the acoustic feature quantity extracted by the feature quantity extraction means.

この場合、推定された母音数又は母音長に応じて単語辞書の範囲を絞り込むことができる。 In this case, the range of the word dictionary can be narrowed down according to the estimated number of vowels or vowel length.

好ましくは、単語辞書は、モーラ数ごとに複数のサブ単語辞書に分割される。単語辞書範囲指定手段は、発話長特定手段により特定された発話長の単語を含むサブ単語辞書を指定する。 Preferably, the word dictionary is divided into a plurality of sub-word dictionaries for each number of mora. The word dictionary range designating unit designates a sub-word dictionary including words of the utterance length specified by the utterance length specifying unit.

この場合、単語をアドレス順に登録する必要がない。 In this case, it is not necessary to register words in the order of addresses.

好ましくは、単語辞書範囲指定手段は、発話長特定手段により特定された発話長の単語だけでなく、その特定された発話長に近い所定範囲内の発話長の単語も含む範囲を指定する。 Preferably, the word dictionary range designating unit designates a range including not only the utterance length word specified by the utterance length specifying unit but also an utterance length word within a predetermined range close to the specified utterance length.

この場合、特定された発話長に多少の誤差があっても、正しい単語を認識することができる。 In this case, the correct word can be recognized even if the specified utterance length has some errors.

以下、図面を参照し、本発明の実施の形態を詳しく説明する。図中同一又は相当部分には同一符号を付してその説明は繰り返さない。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the drawings, the same or corresponding parts are denoted by the same reference numerals and description thereof will not be repeated.

［第１の実施の形態］
図１を参照して、本発明の第１の実施による音声認識装置１０は、発話区間検出部１１と、特徴量抽出部１２と、単語辞書１３と、音響モデル１４と、照合部１５とを備える。単語辞書１３及び音響モデル１４はハードディスクなどの記憶装置に記憶される。 [First Embodiment]
Referring to FIG. 1, a speech recognition apparatus 10 according to the first embodiment of the present invention includes an utterance section detection unit 11, a feature amount extraction unit 12, a word dictionary 13, an acoustic model 14, and a matching unit 15. Prepare. The word dictionary 13 and the acoustic model 14 are stored in a storage device such as a hard disk.

発話区間検出部１１は、音響信号の中から音声信号を含む発話区間を検出する。より具体的には、発話区間検出部１１は、音響信号を所定時間（たとえば１０ｍｓ）ごとにフレームに分割し、分割された音響信号をフレームごとに高速フーリエ変換し、フーリエ変換された音響信号を微分して微分係数を算出し、算出された微分係数の度数分布に基づいて音声を含む音声フレームを発話区間と判定する。詳細は、特願２００８−１９９７１３号（特許文献３）の記載をここに援用する。 The utterance period detection unit 11 detects an utterance period including an audio signal from an acoustic signal. More specifically, the utterance section detecting unit 11 divides the acoustic signal into frames every predetermined time (for example, 10 ms), performs fast Fourier transform on the divided acoustic signals for each frame, and converts the Fourier-transformed acoustic signal. A differential coefficient is calculated by differentiation, and a speech frame including speech is determined as an utterance section based on the frequency distribution of the calculated differential coefficient. For details, the description of Japanese Patent Application No. 2008-199713 (Patent Document 3) is incorporated herein.

特徴量抽出部１２は、発話区間検出部１１により検出された発話区間内の音声信号を分析することにより、その音声信号から音響特徴量を抽出する。 The feature amount extraction unit 12 extracts an acoustic feature amount from the speech signal by analyzing the speech signal in the speech segment detected by the speech segment detection unit 11.

図２を参照して、単語辞書１３には、複数種類の単語がモーラ数ｋでソートされて登録されている。日本語の場合、モーラ数ｋは単語を仮名表記したときの文字数にほぼ等しい。たとえばモーラ数ｋ＝１の単語「つ」はアドレスｓ１に登録されている。また、モーラ数ｋ＝２の単語「あき」はアドレスｓ２に登録され、モーラ数ｋ＝２の単語「ゆい」はアドレスｅ２に登録されている。また、モーラ数ｋ＝３の単語「あたみ」はアドレスｓ３に登録され、モーラ数ｋ＝３の単語「わせだ」はアドレスｅ３に登録されている。また、モーラ数ｋ＝４の単語「ありあけ」はアドレスｓ４に登録されている。また、モーラ数ｋ＝７の単語「わくらおんせん」はアドレスｅ７に登録されている。また、モーラ数ｋ＝８の単語「ゆきがやおおつか」はアドレスｓ８に登録されている。 Referring to FIG. 2, a plurality of types of words are sorted and registered in word dictionary 13 according to the number of mora k. In the case of Japanese, the mora number k is approximately equal to the number of characters when a word is expressed in kana. For example, the word “tsu” with the number of mora k = 1 is registered at the address s1. Further, the word “Aki” with the mora number k = 2 is registered at the address s2, and the word “Yui” with the mora number k = 2 is registered at the address e2. Further, the word “Atami” with the mora number k = 3 is registered at the address s3, and the word “Seda” with the mora number k = 3 is registered at the address e3. Further, the word “Ariake” with the number of mora k = 4 is registered at the address s4. Further, the word “wakura onsen” with the number of mora k = 7 is registered at the address e7. Further, the word “Yukigaya Otsuka” with the number of mora k = 8 is registered at the address s8.

記憶装置には、単語辞書１３及び音響モデル１４以外に、図３に示すように、モーラ数アドレステーブル１６が記憶されている。モーラ数アドレステーブル１６は、モーラ数ｋと、開始アドレスｓｋと、終了アドレスｅｋとが対応付けて登録されている。各開始アドレスｓｋは、対応するモーラ数ｋの単語が格納されている単語辞書１３における最初のアドレスである。各終了アドレスｅｋは、対応するモーラ数ｋの単語が格納されている単語辞書１３における最後のアドレスである。たとえばモーラ数ｋ＝１の単語は単語辞書１３の開始アドレスｓ１に格納されているので、モーラ数アドレステーブル１６にはモーラ数ｋ＝１に対応して開始アドレスｓ１が登録されている。また、モーラ数ｋ＝１の単語は１つしかないので、開始アドレスｓ１＝終了アドレスｅ１である。よって、モーラ数アドレステーブル１６にはモーラ数ｋ＝１に対応して開始アドレスｓ１と同じ終了アドレスｅ１が登録されている。また、モーラ数ｋ＝２の単語は単語辞書１３の開始アドレスｓ２から終了アドレスｅ２までに格納されているので、モーラ数アドレステーブル１６にはモーラ数ｋ＝２に対応して開始アドレスｓ２及び終了アドレスｅ２が登録されている。その他のモーラ数の単語についても上記と同様である。本実施の形態では上記の通り、単語を発話したときの時間的な長さの指標（以下、「発話長」という。）としてモーラ数が登録されている。 In addition to the word dictionary 13 and the acoustic model 14, the storage device stores a mora number address table 16 as shown in FIG. In the mora number address table 16, a mora number k, a start address sk, and an end address ek are registered in association with each other. Each start address sk is the first address in the word dictionary 13 in which words corresponding to the number of mora k are stored. Each end address ek is the last address in the word dictionary 13 in which words corresponding to the number of mora k are stored. For example, since the word with the mora number k = 1 is stored at the start address s1 of the word dictionary 13, the start address s1 is registered in the mora number address table 16 corresponding to the mora number k = 1. Since there is only one word with the number of mora k = 1, the start address s1 = the end address e1. Therefore, the same end address e1 as the start address s1 is registered in the mora number address table 16 corresponding to the mora number k = 1. Further, since the word with the mora number k = 2 is stored from the start address s2 to the end address e2 of the word dictionary 13, the mora number address table 16 has a start address s2 and an end corresponding to the mora number k = 2. Address e2 is registered. The same applies to the other words with the mora number. In the present embodiment, as described above, the number of mora is registered as an index of time length when a word is uttered (hereinafter referred to as “speech length”).

再び図１を参照して、音響モデル１４には、単語辞書１３に登録された単語の音響特徴量（たとえばスペクトルやケプストラムなどの特徴ベクトル）が登録されている。音響モデル１４は、単語ごとに標準音声パターンの音響的特徴をモデル化したものであり、入力音声パターンとの音響的な類似性の評価を行うための参照情報であり、ここには、音素やモーラ等の微小単位ごとに学習された音響パラメータが登録されている。 Referring to FIG. 1 again, the acoustic model 14 is registered with the acoustic feature quantities of words registered in the word dictionary 13 (for example, feature vectors such as spectrum and cepstrum). The acoustic model 14 models the acoustic features of the standard speech pattern for each word, and is reference information for evaluating the acoustic similarity with the input speech pattern. The acoustic parameters learned for each minute unit such as mora are registered.

単語辞書１３及び音響モデル１４は、事前に収録した学習データを用いた学習によって作成される。学習データは、たとえば多数の話者の様々な状況における発話を収録した音声データである。 The word dictionary 13 and the acoustic model 14 are created by learning using learning data recorded in advance. The learning data is, for example, voice data that records utterances of various speakers in various situations.

照合部１５は、単語辞書１３及び音響モデル１４を参照し、特徴量抽出部１２により抽出された音響特徴量を単語辞書１３に登録された各単語の音響特徴量と照合し、ＨＭＭアルゴリズムにより各単語の尤もらしさを表す尤度を算出し、かつ、その算出した各単語の尤度のうち最尤度の単語を認識結果とする。詳細は、特開２００７−２０６２３９号公報（特許文献２）の記載をここに援用する。 The collation unit 15 refers to the word dictionary 13 and the acoustic model 14, collates the acoustic feature amount extracted by the feature amount extraction unit 12 with the acoustic feature amount of each word registered in the word dictionary 13, and uses the HMM algorithm to The likelihood representing the likelihood of the word is calculated, and the word with the highest likelihood among the calculated likelihoods of each word is used as the recognition result. For details, the description of JP 2007-206239 A (Patent Document 2) is incorporated herein.

音声認識装置１０はさらに、発話時間測定部１７と、単語辞書範囲指定部１８とを備える。発話時間測定部１７は、音声信号に基づいて発話長を特定するために、発話区間検出部１１により検出された発話区間の時間（以下、「発話時間」という。）を発話長として測定する。単語辞書範囲指定部１８は、単語辞書１３のうち発話時間測定部１７により測定された発話時間の単語及びその発話時間に近い所定範囲内の発話時間の単語を含む範囲（以下、「認識対象範囲」という。）１９を指定する。上記照合部１５が音響特徴量を照合する単語は、単語辞書１３に登録された全ての単語ではなく、単語辞書範囲指定部１８により指定された認識対象範囲１９に含まれる単語だけである。詳細は後述する。 The speech recognition apparatus 10 further includes an utterance time measuring unit 17 and a word dictionary range specifying unit 18. The utterance time measurement unit 17 measures the time of the utterance section detected by the utterance section detection unit 11 (hereinafter referred to as “speech time”) as the utterance length in order to specify the utterance length based on the audio signal. The word dictionary range specifying unit 18 includes a word including words having an utterance time measured by the utterance time measuring unit 17 in the word dictionary 13 and words having an utterance time within a predetermined range close to the utterance time (hereinafter referred to as “recognition target range”). ") 19 is designated. The words that the collation unit 15 collates with acoustic features are not all the words registered in the word dictionary 13 but only words included in the recognition target range 19 specified by the word dictionary range specification unit 18. Details will be described later.

次に図４に示したフロー図を参照し、音声認識装置１０の動作を説明する。 Next, the operation of the speech recognition apparatus 10 will be described with reference to the flowchart shown in FIG.

音声認識装置１０に、たとえば図５（ａ）に示されるような音響信号が入力される。音声認識装置１０がナビゲーションシステムに内蔵される場合、操作を指令する音声（コマンド）がマイクで検知され、マイクから音声認識装置１０に音声信号が入力される。コマンドは、１つの単語のみで構成されてもよく、また、数単語の短いフレーズで構成されてもよい。 For example, an acoustic signal as shown in FIG. 5A is input to the speech recognition apparatus 10. When the voice recognition device 10 is built in the navigation system, a voice (command) for instructing an operation is detected by a microphone, and a voice signal is input to the voice recognition device 10 from the microphone. The command may be composed of only one word, or may be composed of a short phrase of several words.

まず、発話区間検出部１１は、入力された音響信号の中から、図５（ｂ）に示されるように、音声信号の開始時ｔｓから終了時ｔｅまでの発話区間Ｖを検出する（Ｓ１）。 First, the utterance section detector 11 detects the utterance section V from the start time ts to the end time te of the audio signal, as shown in FIG. 5B, from the input acoustic signal (S1). .

続いて、発話時間測定部１７は、図５（ｃ）に示されるように、発話区間検出部１１により検出された発話区間Ｖの発話時間Ｔ（＝ｔｅ−ｔｓ）を測定する（Ｓ２）。 Subsequently, the utterance time measurement unit 17 measures the utterance time T (= te−ts) of the utterance section V detected by the utterance section detection unit 11 as shown in FIG. 5C (S2).

続いて、単語辞書範囲指定部１８は、発話時間測定部１７により測定された発話時間Ｔに基づいて、認識すべき単語のモーラ数Ｌを推定する（Ｓ３）。具体的には、次の式（１）の範囲内でモーラ数Ｌ（自然数）を決定する。
αＴ≦Ｌ≦βＴ（１） Subsequently, the word dictionary range specification unit 18 estimates the number of mora L of words to be recognized based on the utterance time T measured by the utterance time measurement unit 17 (S3). Specifically, the mora number L (natural number) is determined within the range of the following equation (1).
αT ≦ L ≦ βT (1)

ただし、α及びβは、あらかじめ定められた係数である。たとえば２＜αＴ≦３、かつ、７≦βＴ＜８の場合、モーラ数Ｌ＝３〜７と推定される。 Here, α and β are predetermined coefficients. For example, when 2 <αT ≦ 3 and 7 ≦ βT <8, the number of mora L is estimated to be 3-7.

続いて、単語辞書範囲指定部１８は、その推定したモーラ数Ｌに基づいて、単語辞書１３内に認識対象範囲１９を指定する（Ｓ４）。具体的には、図３に示したモーラ数アドレステーブル１６を参照し、推定した最小のモーラ数Ｌｍｉｎに対応する開始アドレスｓｋを読み出し、かつ、推定した最大のモーラ数Ｌｍａｘに対応する終了アドレスｅｋを読み出す。そして、図２に示した単語辞書１３において、開始アドレスｓｋから終了アドレスｅｋまでの範囲を認識対象範囲１９として指定する。たとえばモーラ数Ｌ＝３〜７と推定された場合、Ｌ＝３に対応する開始アドレスｓ３が読み出され、かつ、Ｌ＝７に対応する終了アドレスｅ７が読み出され、開始アドレスｓ３から終了アドレスｅ７までの範囲が認識対象範囲１９として指定される。 Subsequently, the word dictionary range specifying unit 18 specifies the recognition target range 19 in the word dictionary 13 based on the estimated number of mora L (S4). Specifically, referring to the mora number address table 16 shown in FIG. 3, the start address sk corresponding to the estimated minimum mora number Lmin is read, and the end address ek corresponding to the estimated maximum mora number Lmax is read. Is read. Then, in the word dictionary 13 shown in FIG. 2, a range from the start address sk to the end address ek is designated as the recognition target range 19. For example, when it is estimated that the number of mora L = 3 to 7, a start address s3 corresponding to L = 3 is read, and an end address e7 corresponding to L = 7 is read, and the end address is determined from the start address s3. The range up to e7 is designated as the recognition target range 19.

一方、特徴量抽出部１２は、発話区間検出部１１により検出された発話区間内の音声信号を分析することにより、その音声信号からメルケプストラムなどの音響特徴量を抽出する（Ｓ５）。 On the other hand, the feature quantity extraction unit 12 extracts an acoustic feature quantity such as a mel cepstrum from the speech signal by analyzing the speech signal in the speech section detected by the speech section detection unit 11 (S5).

続いて、照合部１５は、単語辞書１３及び音響モデル１４を参照し、特徴量抽出部１２により抽出された音響特徴量を、単語辞書範囲指定部１８により指定された各単語の音響特徴量と照合し、ＨＭＭアルゴリズムにより各単語の尤もらしさを表す尤度を算出し、かつ、その算出した各単語の尤度のうち最尤度の単語を認識結果とする（Ｓ６）。 Subsequently, the matching unit 15 refers to the word dictionary 13 and the acoustic model 14 and uses the acoustic feature amount extracted by the feature amount extraction unit 12 as the acoustic feature amount of each word designated by the word dictionary range designation unit 18. The likelihood of each word is calculated using the HMM algorithm, and the word with the highest likelihood among the calculated likelihoods of each word is used as the recognition result (S6).

以上のように本発明の第１の実施の形態によれば、単語辞書１３には発話長として予め知られているモーラ数で単語をソートして登録するだけでよいので、単語の発話長を事前に調査することなく、単語辞書１３を容易に作成することができる。また、発話時間を測定してモーラ数を推定し、そのモーラ数に近い範囲を認識対象範囲１９として指定しているので、単語辞書１３内の全ての単語との照合を行う必要がなく、音声認識を短時間で行うことができる。また、モーラ数アドレステーブル１６を参照することにより、推定した最小のモーラ数に対応する開始アドレス及び推定した最大のアドレスに対応する終了アドレスで迅速に認識対象範囲１９を指定することができる。また、推定したモーラ数だけでなく、それに近いモーラ数も含めて認識対象範囲１９を指定しているので、推定されたモーラ数に多少の誤差があっても、正しい単語を認識することができる。 As described above, according to the first embodiment of the present invention, it is only necessary to sort and register words in the word dictionary 13 according to the number of mora known in advance as the utterance length. The word dictionary 13 can be easily created without prior investigation. Further, since the number of mora is estimated by measuring the utterance time and the range close to the number of mora is designated as the recognition target range 19, it is not necessary to collate with all the words in the word dictionary 13, and the voice Recognition can be performed in a short time. Further, by referring to the mora number address table 16, the recognition target range 19 can be quickly specified by the start address corresponding to the estimated minimum mora number and the end address corresponding to the estimated maximum address. In addition, since the recognition target range 19 is specified including not only the estimated number of mora but also the number of mora close thereto, a correct word can be recognized even if there is some error in the estimated number of mora. .

本実施の形態では、モーラ数アドレステーブル１６に開始アドレス及び終了アドレスの両方が登録されている。しかし、全ての単語がアドレス順に登録されていれば、開始アドレス及び終了アドレスの一方だけが登録されていればよい。 In the present embodiment, both the start address and the end address are registered in the mora number address table 16. However, if all the words are registered in the order of addresses, only one of the start address and the end address needs to be registered.

［第２の実施の形態］
図６を参照して、本発明の第２の実施による音声認識装置２０は、上記第１の実施の形態における発話時間測定部１７の代わりに、母音推定部２１を備える。母音推定部２１は、特徴量抽出部１２により抽出された音響特徴量に基づいて母音数又は母音長を発話長として推定する。上記第１の実施の形態における発話区間検出部１１は設けられていない。 [Second Embodiment]
Referring to FIG. 6, the speech recognition apparatus 20 according to the second embodiment of the present invention includes a vowel estimation unit 21 instead of the utterance time measurement unit 17 in the first embodiment. The vowel estimation unit 21 estimates the number of vowels or the vowel length as the utterance length based on the acoustic feature amount extracted by the feature amount extraction unit 12. The utterance section detection unit 11 in the first embodiment is not provided.

日本語の場合、モーラ数は母音数にほぼ等しい。図２に示した単語の音素表記は図７に示す通りである。たとえば単語「つ」の音素表記は/ts u/で、母音数ｎ＝１である。また、単語「あき」の音素表記は/a k i/で、母音数ｎ＝２である。また、単語「あたみ」の音素表記は/a t a m i/で、母音数ｎ＝３である。また、単語「ありあけ」の音素表記は/a r i a k e/で、母音数ｎ＝４である。これらの単語は全てモーラ数が母音数に等しい。一方、撥音「ん」を含む単語、たとえば「わくらおんせん」の音素表記は/w a k u r a o N s e N/で、母音数ｎ＝５である。しかし、本例では音素/N/も母音として数え、母音数ｎ＝７、モーラ数ｋ＝７とみなしている。その他、促音も母音１つとみなし、長音は母音２つとみなす。 In Japanese, the number of mora is almost equal to the number of vowels. The phoneme notation of the word shown in FIG. 2 is as shown in FIG. For example, the phoneme notation of the word “tsu” is / ts u / and the number of vowels n = 1. The phoneme notation of the word “aki” is / a k i / and the number of vowels n = 2. The phoneme notation of the word “Atami” is / a t a m i / and the number of vowels n = 3. Also, the phoneme notation of the word “Ariake” is / a r i a k e / and the number of vowels n = 4. All these words have the same number of mora as the number of vowels. On the other hand, a phoneme notation of a word including the repellent sound “n”, for example, “wakura onsen” is / wa kurao N se N /, and the number of vowels is n = 5. However, in this example, the phoneme / N / is also counted as a vowel, and is regarded as a vowel number n = 7 and a mora number k = 7. In addition, the prompt sound is regarded as one vowel, and the long sound is regarded as two vowels.

次に図８に示したフロー図を参照し、音声認識装置２０の動作を説明する。 Next, the operation of the speech recognition apparatus 20 will be described with reference to the flowchart shown in FIG.

上記第１の実施の形態と異なり、母音推定部２１は、特徴量抽出部１２により抽出された音響特徴量に基づいて母音数Ｋ又は母音長を推定する（Ｓ７）。 Unlike the first embodiment, the vowel estimation unit 21 estimates the vowel number K or the vowel length based on the acoustic feature amount extracted by the feature amount extraction unit 12 (S7).

音声信号から母音のみを検出することは、一般的な音声認識に比べると比較的容易である。たとえば音声信号を所定時間ごとに分割した各フレームの零交差数ｚｃに注目する。あらかじめ適当な最小値ｍｉｎと最大値ｍａｘを設定しておき、ｍｉｎ＜ｚｃ＜ｍａｘとなるフレーム数ｆｎをカウントし、ｆｎを以って母音長と推定してもよい。また、平均母音長ｍｅａｎをあらかじめ計算して設定しておき、ｆｎ／ｍｅａｎを母音数と推定してもよい。また、母音弁別専用のＨＭＭ又はＧＭＭ（Gaussian Mixture Model）を学習により作成して母音を検出するようにしてもよい。このとき、尤度の閾値Ｌｔを設定しておき、ＨＭＭ又はＧＭＭが出力する尤度Ｌｆに対して、Ｌｔ＜Ｌｆとなるフレーム数をカウントして母音長と推定してもよい。その他、音素セグメンテーションの技術分野で知られている様々な手法を用いることができる。 It is relatively easy to detect only a vowel from a speech signal as compared with general speech recognition. For example, attention is paid to the zero crossing number zc of each frame obtained by dividing the audio signal at predetermined time intervals. An appropriate minimum value min and maximum value max may be set in advance, the number of frames fn where min <zc <max is counted, and the vowel length may be estimated using fn. Alternatively, the average vowel length mean may be calculated and set in advance, and fn / mean may be estimated as the number of vowels. Alternatively, a vowel may be detected by creating an HMM or GMM (Gaussian Mixture Model) dedicated to vowel discrimination by learning. At this time, a likelihood threshold Lt may be set, and the number of frames satisfying Lt <Lf may be counted with respect to the likelihood Lf output from the HMM or GMM to estimate the vowel length. In addition, various methods known in the technical field of phoneme segmentation can be used.

続いて、単語辞書範囲指定部１８は、母音推定部２１により推定された母音数Ｋ又は母音長に基づいて、認識すべき単語のモーラ数Ｌを推定する（Ｓ３）。具体的には、次の式（２）の範囲内でモーラ数Ｌ（自然数）を決定する。
Ｋ−ｍ≦Ｌ≦Ｋ−ｎ（２） Subsequently, the word dictionary range specifying unit 18 estimates the number of mora L of the word to be recognized based on the vowel number K or the vowel length estimated by the vowel estimation unit 21 (S3). Specifically, the mora number L (natural number) is determined within the range of the following equation (2).
Km ≦ L ≦ Kn (2)

ただし、ｍ及びｎは、あらかじめ定められた０以上の整数である。たとえばｋ１−１＜Ｋ−ｍ≦ｋ１、かつ、ｋ２≦Ｋ−ｎ＜ｋ２＋１の場合、モーラ数Ｌ＝ｋ１〜ｋ２と推定される。 However, m and n are predetermined integers of 0 or more. For example, when k1-1 <Km ≦ k1 and k2 ≦ Kn <k2 + 1, it is estimated that the number of mora L = k1 to k2.

続いて、単語辞書範囲指定部１８は、その推定したモーラ数Ｌに基づいて、単語辞書１３内に認識対象範囲１９を指定する（Ｓ４）。モーラ数Ｌ＝ｋ１〜ｋ２の場合、図２に示した単語辞書１３において、開始アドレスｓｋ１から終了アドレスｅｋ２までの範囲が認識対象範囲１９として指定される。 Subsequently, the word dictionary range specifying unit 18 specifies the recognition target range 19 in the word dictionary 13 based on the estimated number of mora L (S4). When the number of mora L = k1 to k2, the range from the start address sk1 to the end address ek2 is designated as the recognition target range 19 in the word dictionary 13 shown in FIG.

以上のように本発明の第２の実施の形態によれば、抽出した音響特徴量に基づいて母音数又は母音長を推定し、さらにモーラ数を推定しているので、そのモーラ数に近い認識対象範囲１９を指定することができる。したがって、第１の実施の形態のように発話区間検出部１１を設ける必要がない。 As described above, according to the second embodiment of the present invention, the number of vowels or vowel length is estimated based on the extracted acoustic feature amount, and further the number of mora is estimated. The target range 19 can be specified. Therefore, it is not necessary to provide the utterance section detection unit 11 as in the first embodiment.

［第３の実施の形態］
本発明の第３の実施の形態における単語辞書１３は、上記実施の形態と異なり、図９に示すように、モーラ数ごとにＮ個のサブ単語辞書１３１〜１３Ｎに分割されている。たとえばモーラ数ｋ＝１の単語「つ」はサブ単語辞書１３１に登録され、モーラ数ｋ＝２の単語「あき」などはサブ単語辞書１３２に登録され、モーラ数ｋ＝３の単語「あたみ」などはサブ単語辞書１３３に登録され、モーラ数ｋ＝７の単語「ひがしふなばし」などはサブ単語辞書１３７に登録され、モーラ数ｋ＝Ｎの単語はサブ単語辞書１３Ｎに登録されている。 [Third Embodiment]
Unlike the above embodiment, the word dictionary 13 in the third embodiment of the present invention is divided into N sub-word dictionaries 131 to 13N for each number of mora as shown in FIG. For example, the word “tsu” with the mora number k = 1 is registered in the sub-word dictionary 131, the word “aki” with the mora number k = 2 is registered in the sub-word dictionary 132, and the word “a” with the mora number k = 3. Are registered in the sub-word dictionary 133, the word “Higashi Funabashi” with the mora number k = 7 is registered in the sub-word dictionary 137, and the word with the mora number k = N is registered in the sub-word dictionary 13N. Has been.

また、図３に示したモーラ数アドレステーブル１６の代わりに、本実施の形態では図１０に示すように、モーラ数辞書テーブル３０が設けられている。モーラ数辞書テーブル３０は、モーラ数ｋと、サブ単語辞書１３１〜１３Ｎとが対応付けて登録されている。 Further, instead of the mora number address table 16 shown in FIG. 3, a mora number dictionary table 30 is provided in the present embodiment as shown in FIG. In the mora number dictionary table 30, the mora number k and the sub-word dictionaries 131 to 13N are registered in association with each other.

本実施の形態では、図１１に示すように、単語辞書範囲指定部１８は、ステップＳ３で推定したモーラ数Ｌに基づいて、サブ単語辞書１３１〜１３Ｎのうち１又は２以上のサブ単語辞書を選択する（Ｓ８）。モーラ数Ｌ＝ｋ１〜ｋ２の場合、単語辞書１３の中からサブ単語辞書１３ｋ１〜１３ｋ２が選択され、認識対象範囲１９として指定される。 In the present embodiment, as shown in FIG. 11, the word dictionary range designation unit 18 selects one or more subword dictionaries from the subword dictionaries 131 to 13N based on the number of mora L estimated in step S3. Select (S8). When the number of mora L = k1 to k2, the sub word dictionaries 13k1 to 13k2 are selected from the word dictionary 13 and designated as the recognition target range 19.

以上のように本発明の第３の実施の形態によれば、単語辞書１３がモーラ数ごとに複数のサブ単語辞書１３１〜１３Ｎに分割されているので、第１及び第２の実施の形態のように単語をアドレス順に登録する必要がない。 As described above, according to the third embodiment of the present invention, the word dictionary 13 is divided into a plurality of sub-word dictionaries 131 to 13N for each number of mora, so that the first and second embodiments There is no need to register words in the order of addresses.

［用途］
上記実施の形態による音声認識装置は典型的には自動二輪車に搭載される。たとえば図１２に示すように、自動二輪車５０には、車載通信機５１と、車載情報機器５２と、運転者が着用するヘルメット５３Ａに装備されるヘルメット側無線通信機５４Ａと、同乗者が着用するヘルメット５３Ｂに装備されるヘルメット側無線通信機５４Ｂとが搭載される。上記音声認識装置は車載情報機器５２に内蔵される。 [Usage]
The speech recognition apparatus according to the above embodiment is typically mounted on a motorcycle. For example, as shown in FIG. 12, a motorcycle 50 is worn by a passenger on a motorcycle 50, a vehicle-mounted information device 52, a vehicle-mounted information device 52, a helmet-side wireless communication device 54 A equipped in a helmet 53 A worn by a driver. A helmet-side wireless communication device 54B equipped on the helmet 53B is mounted. The voice recognition device is built in the in-vehicle information device 52.

自動二輪車５０は、車体フレーム５５と、この車体フレーム５５に対して上下に揺動可能に取り付けられた動力ユニット５６と、この動力ユニット５６からの駆動力を得て回転する後輪５７と、車体フレーム５５の前部にフロントフォーク５８を介して取り付けられた操向車輪としての前輪５９と、フロントフォーク５８と一体的に回動するハンドル６０とを備えている。ハンドル６０には、メイン電源スイッチ６１が備えられている。 The motorcycle 50 includes a vehicle body frame 55, a power unit 56 attached to the vehicle body frame 55 so as to be swingable up and down, a rear wheel 57 that rotates by obtaining driving force from the power unit 56, a vehicle body A front wheel 59 as a steering wheel attached to the front portion of the frame 55 via a front fork 58 and a handle 60 that rotates integrally with the front fork 58 are provided. The handle 60 is provided with a main power switch 61.

動力ユニット５６は、車体フレーム５５の中央付近の下部に揺動自在に連結されているとともに、車体フレーム５５の後部に対しては、リアサスペンションユニット６２を介して弾性的に結合されている。車体フレーム５５の中央付近の上部には、運転者用のシート６３が配置され、さらにその後方には同乗者用のシート６４が配置されている。車体フレーム５５において、シート６３とハンドル６０との間の位置には、運転者が足を置く運転者用ステップ６５が設けられている。また、運転者用のシート６３の下方には、車体フレーム５５の両側に、同乗者が足を置くためのステップ６６が設けられている。運転者及び同乗者の乗車状態を検出するために、シート６３，６４には、それぞれ、運転席着座センサ６７及び同乗者席着座センサ６８が設けられている。 The power unit 56 is swingably connected to a lower portion near the center of the vehicle body frame 55 and is elastically coupled to the rear portion of the vehicle body frame 55 via a rear suspension unit 62. A driver's seat 63 is disposed at an upper portion near the center of the vehicle body frame 55, and a passenger's seat 64 is disposed behind the seat 63. In the vehicle body frame 55, a driver step 65 where the driver puts his / her foot is provided at a position between the seat 63 and the handle 60. Further, below the driver's seat 63, steps 66 are provided on both sides of the vehicle body frame 55 for the passenger to place his / her feet. In order to detect the riding state of the driver and the passenger, the seats 63 and 64 are provided with a driver seat seating sensor 67 and a passenger seat seating sensor 68, respectively.

車載通信機５１は、同乗者用のシート６４の下方位置において、車体フレーム５５に固定されている。この車載通信機５１は、同乗者用のシート６４の後方において車体フレーム５５に固定されたアンテナ６９と接続されており、ヘルメット側無線通信機５４Ａ，５４Ｂとの間で無線通信を行う。車載情報機器５２は、ハンドル６０に固定されており、さらに、車載通信機５１と配線接続されている。車載情報機器５２の例としては、走行経路の音声案内を行うナビゲーションシステム、音楽プレイヤ、ラジオ、携帯電話機の通話音声を中継する電話音声中継装置などを挙げることができる。車載通信機５１及び車載情報機器５２は、車載バッテリ７０からの給電を受けて動作するようになっている。 The in-vehicle communication device 51 is fixed to the vehicle body frame 55 at a position below the passenger seat 64. The in-vehicle communication device 51 is connected to an antenna 69 fixed to the vehicle body frame 55 behind the passenger seat 64, and performs wireless communication with the helmet-side wireless communication devices 54A and 54B. The in-vehicle information device 52 is fixed to the handle 60 and is further connected to the in-vehicle communication device 51 by wiring. Examples of the in-vehicle information device 52 include a navigation system that provides voice guidance of a travel route, a music player, a radio, and a telephone voice relay device that relays call voice of a mobile phone. The in-vehicle communication device 51 and the in-vehicle information device 52 operate by receiving power from the in-vehicle battery 70.

ヘルメット５３Ａ，５３Ｂの内面において、乗員の左右の耳元に対向する位置には、一対のスピーカ７１が固定されており、乗員の口元に対向する位置にはマイクロフォン７２が固定されている。一方、帽体の背面には、ヘルメット側無線通信機５４Ａ，５４Ｂが固定されている。このヘルメット側無線通信機５４Ａ，５４Ｂは、アンテナ７３を備え、スピーカ７１及びマイクロフォン７２と接続される。 On the inner surfaces of the helmets 53A and 53B, a pair of speakers 71 are fixed at positions facing the left and right ears of the occupant, and a microphone 72 is fixed at a position facing the occupant's mouth. On the other hand, helmet side wireless communication devices 54A and 54B are fixed to the back of the cap body. The helmet-side wireless communication devices 54A and 54B include an antenna 73 and are connected to a speaker 71 and a microphone 72.

以上、本発明の実施の形態を説明したが、上述した実施の形態は本発明を実施するための例示に過ぎない。よって、本発明は上述した実施の形態に限定されることなく、その趣旨を逸脱しない範囲内で上述した実施の形態を適宜変形して実施することが可能である。 While the embodiments of the present invention have been described above, the above-described embodiments are merely examples for carrying out the present invention. Therefore, the present invention is not limited to the above-described embodiment, and can be implemented by appropriately modifying the above-described embodiment without departing from the spirit thereof.

本発明の第１の実施の形態による音声認識装置の全体構成を示す機能ブロック図である。It is a functional block diagram which shows the whole structure of the speech recognition apparatus by the 1st Embodiment of this invention. 図１に示した音声認識装置に用いられる単語辞書の構成を示す図である。It is a figure which shows the structure of the word dictionary used for the speech recognition apparatus shown in FIG. 図１に示した音声認識装置に用いられるモーラ数アドレステーブルの構成を示す図である。It is a figure which shows the structure of the mora number address table used for the speech recognition apparatus shown in FIG. 図１に示した音声認識装置の動作を示すフロー図である。It is a flowchart which shows operation | movement of the speech recognition apparatus shown in FIG. 図１中の発話区間検出部及び発話時間測定部による処理内容を説明するための音声信号の波形図である。It is a wave form diagram of the audio | voice signal for demonstrating the processing content by the utterance area detection part in FIG. 1, and the utterance time measurement part. 本発明の第２の実施の形態による音声認識装置の全体構成を示す機能ブロック図である。It is a functional block diagram which shows the whole structure of the speech recognition apparatus by the 2nd Embodiment of this invention. 図６に示した音声認識装置に用いられる単語辞書のモーラ数と母音数との関係を説明するために単語の音素表記を示す図である。It is a figure which shows the phonetic notation of a word in order to demonstrate the relationship between the number of mora and vowel number of the word dictionary used for the speech recognition apparatus shown in FIG. 図６に示した音声認識装置の動作を示すフロー図である。It is a flowchart which shows operation | movement of the speech recognition apparatus shown in FIG. 本発明の第３の実施の形態による音声認識装置に用いられる単語辞書の構成を示す図である。It is a figure which shows the structure of the word dictionary used for the speech recognition apparatus by the 3rd Embodiment of this invention. 本発明の第３の実施の形態による音声認識装置に用いられるモーラ数辞書テーブルの構成を示す図である。It is a figure which shows the structure of the mora number dictionary table used for the speech recognition apparatus by the 3rd Embodiment of this invention. 本発明の第３の実施の形態による音声認識装置の動作を示すフロー図である。It is a flowchart which shows operation | movement of the speech recognition apparatus by the 3rd Embodiment of this invention. 本発明の実施の形態による音声認識装置を搭載した自動二輪車の外観構成を示す側面図である。1 is a side view showing an external configuration of a motorcycle equipped with a voice recognition device according to an embodiment of the present invention.

Explanation of symbols

１０音声認識装置
１１発話区間検出部
１２特徴量抽出部
１３単語辞書
１４音響モデル
１５照合部
１６モーラ数アドレステーブル
１７発話時間測定部
１８単語辞書範囲指定部
１９認識対象範囲
２０音声認識装置
２１母音推定部
３０モーラ数辞書テーブル
１３１-１３Ｎサブ単語辞書 DESCRIPTION OF SYMBOLS 10 Speech recognition apparatus 11 Speech section detection part 12 Feature-value extraction part 13 Word dictionary 14 Acoustic model 15 Collation part 16 Mora number address table 17 Speech time measurement part 18 Word dictionary range designation part 19 Recognition target range 20 Speech recognition apparatus 21 Vowel recognition Part 30 Mora number dictionary table 131-13N Sub-word dictionary

Claims

A word dictionary in which a plurality of types of words are registered together with an utterance length known in advance as an index of length when each word is uttered, and an acoustic model in which acoustic features of words registered in the word dictionary are registered Storage means for storing
Feature quantity extraction means for analyzing the audio signal and extracting the acoustic feature quantity;
An utterance length identifying means for identifying an utterance length based on an audio signal;
A word dictionary range designating unit for designating a range including the word of the utterance length specified by the utterance length specifying unit in the word dictionary;
By referring to the word dictionary and the acoustic model, each acoustic feature amount extracted by the feature amount extracting unit is compared with an acoustic feature amount of each word included in the range specified by the word dictionary range specifying unit. A speech recognition apparatus comprising: a matching unit that calculates the likelihood of a word and uses the word with the highest likelihood as a recognition result.

The speech recognition device according to claim 1,
The words in the word dictionary are sorted by the number of mora,
The storage means further stores a mora number address table in which start and / or end addresses of each mora number are registered,
The speech recognition apparatus, wherein the word dictionary range specifying means refers to the mora number address table, and specifies a range including words of the utterance length specified by the utterance length specifying means by the start and / or end addresses.

The speech recognition device according to claim 1, further comprising:
Comprising an utterance section detecting means for detecting an utterance section including an audio signal from an acoustic signal;
The utterance length specifying means includes:
A speech recognition apparatus, comprising speech time measuring means for measuring the duration of the speech section detected by the speech section detecting means as the speech length.

The speech recognition device according to claim 1,
The utterance length specifying means includes:
A speech recognition apparatus comprising: a vowel estimation unit that estimates a vowel number or a vowel length as the utterance length based on the acoustic feature amount extracted by the feature amount extraction unit.

The speech recognition device according to claim 1,
The word dictionary is divided into a plurality of sub-word dictionaries for each number of mora,
The speech recognition apparatus, wherein the word dictionary range designation means designates a sub-word dictionary including words having an utterance length specified by the utterance length specifying means.

The speech recognition device according to claim 1,
The word dictionary range designating unit designates a range including not only the utterance length word specified by the utterance length specifying unit but also a utterance length word within a predetermined range close to the specified utterance length. apparatus.

Transportation equipment carrying the voice recognition device according to any one of claims 1 to 6.