JP4986028B2

JP4986028B2 - Speech recognition apparatus, utterance determination method thereof, utterance determination program, and storage medium thereof

Info

Publication number: JP4986028B2
Application number: JP2007010853A
Authority: JP
Inventors: 顕吾藤田; 恒夫加藤; 恒河井
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2007-01-19
Filing date: 2007-01-19
Publication date: 2012-07-25
Anticipated expiration: 2027-01-19
Also published as: JP2008176155A

Description

本発明は、音声認識装置およびその発声判定方法、発声判定プログラムならびにその記憶媒体に係り、特に、音節強調発声の認識に好適な音声認識装置およびその発声判定方法、発声判定プログラムならびにその記憶媒体に関する。 The present invention relates to a speech recognition apparatus, a speech determination method thereof, a speech determination program, and a storage medium thereof, and more particularly to a speech recognition apparatus suitable for recognition of syllable emphasized speech, a speech determination method thereof, a speech determination program, and a storage medium thereof. .

図９は、従来の音声認識装置の主要部の構成を示した図であり、入力音声から音響特徴量を抽出する音響分析部５１と、抽出された音響特徴量に基づいて、予め作成された統計的な音響モデル５３や言語モデル５４に従って探索処理を行い、音声認識結果を出力する探索処理部５２とを含む。 FIG. 9 is a diagram illustrating a configuration of a main part of a conventional speech recognition apparatus, which is created in advance based on an acoustic analysis unit 51 that extracts an acoustic feature amount from input speech and the extracted acoustic feature amount. And a search processing unit 52 that performs a search process according to a statistical acoustic model 53 and a language model 54 and outputs a speech recognition result.

前記音響分析部５１では、入力音声から長さTのフレームを切り出し、その特徴を表すn次元の音響特徴量を抽出する。この処理は、図１０に示したように、ΔTずつフレーム位置をシフトしながら進め、音声の終端まで実行する。 The acoustic analysis unit 51 cuts out a frame having a length T from the input speech and extracts an n-dimensional acoustic feature amount representing the feature. As shown in FIG. 10, this process proceeds while shifting the frame position by ΔT and is executed until the end of the voice.

探索処理部５２では、言語モデルで定義される遷移可能な単語列のうち、入力音声に対していずれが最も確からしいかを探索する。言語モデルとしては、単語の遷移パターンを予め定義しておく固定文法モデル、あるいはある時刻までに確定した単語列に従い次に遷移可能な単語が確率的に定まる確率文法モデルのいずれかが用いられる。 The search processing unit 52 searches for the most probable input speech among the translatable word strings defined by the language model. As the language model, either a fixed grammar model in which a word transition pattern is defined in advance or a probabilistic grammar model in which a next transitionable word is stochastically determined according to a word string determined by a certain time is used.

例えば、図１１に一例を示した固定文法モデルでは、初めの無音状態「sil」から遷移可能な単語は「伊藤」、「糸井」、「今井」、「土井」の４通りであり、その次に唯一遷移可能な単語「です」を経由して、最終的に再び無音状態「sil」へ遷移するような単語列が定義されている。即ち、「[sil]｛伊藤／糸井／今井／土井｝です[sil]」のうち、最尤単語列がいずれかを探索することになる。 For example, in the fixed grammar model shown as an example in FIG. 11, there are four words “Ito”, “Itoi”, “Imai”, and “Doi” that can be transitioned from the initial silent state “sil”. A word string is defined so as to finally transition to the silent state “sil” again via the word “is” that can only be shifted to. That is, the most likely word string is searched for “[sil] {Ito / Itoi / Imai / Doi} [sil]”.

固定文法モデルおよび確率文法モデルのいずれを用いる場合でも、フレーム毎の音響特徴量を用いた探索処理は、単語を更に細分化した音素単位で進められる。各々の単語は、音素毎のHMM状態系列を連結した形で表される。図１２に、単語「今井」のHMM状態系列を示す。 Regardless of whether the fixed grammar model or the probabilistic grammar model is used, the search process using the acoustic feature amount for each frame is advanced in units of phonemes obtained by further subdividing words. Each word is represented by concatenating HMM state sequences for each phoneme. FIG. 12 shows an HMM state sequence of the word “Imai”.

「今井」の音素表現は「i／m／a／i」であるが、一般に探索処理性能向上のため、図１２のような前後の音素に依存したHMM状態系列が用いられる。ここで、「sil-i+m」は音素「i」の先行音素が「sil」、後続音素が「m」である場合のHMM状態系列を表す。各々のHMM状態には自身への遷移(自己遷移)と右隣のHMM状態への遷移(LR遷移)が許されており、自己遷移確率およびLR遷移確率が音響モデルに記述されている。また、音響モデルには、フレーム毎に得られる音響特徴量の各HMM状態に対する尤もらしさ(音響尤度)を算出するための確率分布が記述されている。 The phoneme representation of “Imai” is “i / m / a / i”, but in general, an HMM state sequence depending on the preceding and following phonemes as shown in FIG. 12 is used to improve search processing performance. Here, “sil-i + m” represents an HMM state sequence when the preceding phoneme of the phoneme “i” is “sil” and the subsequent phoneme is “m”. Each HMM state is allowed to transition to itself (self-transition) and to the right-hand HMM state (LR transition), and the self-transition probability and LR transition probability are described in the acoustic model. The acoustic model describes a probability distribution for calculating the likelihood (acoustic likelihood) of each acoustic feature quantity obtained for each frame with respect to each HMM state.

探索処理は、フレーム毎にそのフレームで考慮すべき全てのHMM状態について、自己遷移、LR遷移それぞれの場合の遷移確率と音響尤度の和(累積尤度)を算出し、HMM状態遷移として尤もらしい(累積尤度の高い)遷移を選ぶことを繰り返し、最終的に最も累積尤度の高いHMM状態系列を決定することに相当する。このように最尤のパスを探索するアルゴリズムはViterbiアルゴリズムと呼ばれる。 The search process calculates the sum of the transition probability and acoustic likelihood (cumulative likelihood) for each of the self-transition and LR transition for every HMM state that should be considered in that frame for each frame. This is equivalent to repeatedly selecting an apparent transition (highest cumulative likelihood) and finally determining the HMM state sequence having the highest cumulative likelihood. The algorithm for searching for the maximum likelihood path in this way is called the Viterbi algorithm.

認識結果が誤りであったために利用者の再発声が必要となる場合、利用者による最初の発声が、生活の中で人間を相手に発するような通常発声であったとしても、人間を相手に聞き取り易く発声するのと同様の意図で、再発声が各音節を区切って強調する音節強調発声となる現象がしばしば見られる。 If the user's re-utterance is required because the recognition result is incorrect, even if the initial utterance by the user is a normal utterance that is uttered against a human being in their lives, There is often a phenomenon in which a recurrent voice becomes a syllable emphasized utterance in which each syllable is emphasized with the same intention as utterance that is easy to hear.

図１３，１４は、同一発声者による同一発声内容「神奈川」の通常発声「かながわ」の波形、および音節強調発声「か・な・が・わ」の波形をそれぞれ示している。音節強調発声では、通常発声には見られない発声途中の音節間の無音区間が存在し、音節を個別に発声したような波形となっていることがわかる。 FIGS. 13 and 14 respectively show the waveform of the normal utterance “Kanagawa” of the same utterance content “Kanagawa” by the same utterer and the waveform of the syllable-emphasized utterance “Kana-ga-wa”. In the syllable emphasis utterance, it can be seen that there is a silent section between syllables in the middle of utterance that cannot be seen in normal utterance, and the waveform is as if the syllable was uttered individually.

音節強調発声は、通常発声のように発声区間が連続しておらず、図１４に示したように各音節間に無音区間が存在する。しかしながら、通常の音声認識装置では、図１２に示したように言語モデルに記述された各単語のHMM状態系列が音節間に「sil」への遷移を許していないため、音節強調発声に対しては、音節間の無音区間では無理に何らかの音素が存在するものとして探索処理を進めなければならない。その結果、音節間の無音区間の音響尤度の低下により、発声内容に対応するHMM状態系列の累積尤度が低下し、誤認識の原因となり得る。 In the syllable emphasis utterance, the utterance sections are not continuous like the normal utterance, and there are silent sections between the syllables as shown in FIG. However, in an ordinary speech recognition apparatus, as shown in FIG. 12, the HMM state sequence of each word described in the language model does not allow a transition to “sil” between syllables. The search process must proceed as if some phoneme existed in the silent section between syllables. As a result, the cumulative likelihood of the HMM state sequence corresponding to the utterance content decreases due to a decrease in the acoustic likelihood of a silent section between syllables, which may cause erroneous recognition.

このような技術課題に対して、従来は音節強調発声の音節間の無音区間に対して、単語のHMM状態系列の各音節の後に「sil」への遷移を許すような記述を言語モデルまたは音響モデルに加えることで対応している。 In order to deal with such technical problems, a description that allows a transition to “sil” after each syllable of the HMM state sequence of a word for a silent section between syllables of syllable-emphasized utterances is conventionally used. It is supported by adding to the model.

特許文献１には、HMM状態系列に後続音素環境としてスキップ可能な無音への遷移を追加する等、通常発声を対象としたHMM状態系列に複数の音節強調発声を対象としたHMM状態系列を追加すること(マルチパス化)により、無音の挿入や、通常発声からの音節間の音響的特徴の変形が起こる音節強調発声に対しても認識性能を維持しようとする技術が開示されている。 Patent Document 1 adds an HMM state sequence that targets multiple syllable-enhanced utterances to an HMM state sequence that targets normal speech, such as adding a transition to silence that can be skipped as a subsequent phoneme environment to the HMM state sequence By doing so (multipathing), there is disclosed a technique for maintaining recognition performance even for syllable emphasized utterance in which silence is inserted or acoustic characteristics between syllables are changed from normal utterance.

特許文献２には、上記特許文献１が対象とする日本語のように、音節が必ず母音の後で区切られる言語のみならず、英語を始めとしたいずれの音素の後でも音節を区切ることができる言語に対しても、各音節後に無音の挿入を許すモデルを利用することにより、音節強調発声に対して認識性能を維持しようとする技術が開示されている。
特開２００２−１８９４９４号公報特開２００６−２４３１２３号公報 In Japanese Patent Application Laid-Open No. 2004-228688, not only a language in which a syllable is necessarily separated after a vowel, but a syllable can be divided after any phoneme including English, as in Japanese which is the subject of the above-mentioned Patent Document 1. A technique for maintaining recognition performance for syllable-emphasized utterances by using a model that allows insertion of silence after each syllable is also disclosed for possible languages.
JP 2002-189494 A JP 2006-243123 A

図１５は、前記図１２の「sil-i+m」, 「i-m+a」と連続するHMM状態系列に対して、音節「i」の後に「sil」への遷移を付加したものである。連続する２音節「i・ma」の前半音節「i」終了時には「sil-i+m」に加えて、「sil-i+sil」のように後の「sil」環境依存であるHMM状態系列、または「sil-i+＊」のように後環境依存なしのHMM状態系列を経由して1状態の「sil」へと遷移することが許されている。それに伴い、後半音節「ma」開始時には「i-m+a」に加えて前の「sil」環境依存である「sil-m+a」への遷移が付加されている。 FIG. 15 is a diagram in which a transition to “sil” is added after the syllable “i” to the HMM state sequence continuous with “sil-i + m” and “i-m + a” in FIG. is there. At the end of the first half syllable “i” of two consecutive syllables “i ・ ma”, in addition to “sil-i + m”, the “sil” environment-dependent HMM state sequence such as “sil-i + sil” Or, it is allowed to transition to “sil” of one state via an HMM state sequence without post-environment dependency like “sil-i + *”. Accordingly, at the beginning of the second syllable “ma”, in addition to “i-m + a”, a transition to “sil-m + a”, which depends on the previous “sil” environment, is added.

また、前半音節終了時の１状態の「sil」への遷移はスキップすることも可能である。音節強調発声が通常発声と最も異なる点は音節間の無音区間の存在であるが、この無音区間の影響により個別に音節を発声する場合と通常発声の中間のような音響的特徴が現れる場合もある。図１５のように幾通りもの遷移を可能としているのは、音節強調発声のこういった通常発声との相違への対応策である。 It is also possible to skip the transition to one state “sil” at the end of the first half syllable. The most different point between syllable-enhanced utterances and normal utterances is the existence of silent intervals between syllables. is there. As shown in FIG. 15, a number of transitions are possible in response to the difference between the syllable-emphasized utterance and the normal utterance.

しかしながら、全ての単語に含まれる各音節について図１５のような複数の遷移を考慮することは探索処理に要する処理量の増大を招き、認識結果を得るまでの時間に遅延が生ずる可能性がある。また、入力が通常発声の場合でも同様の言語モデルを用いるため、音節強調発声用の不要なHMM状態系列の存在による探索空間の広がりが原因で認識性能を低下させることにも繋がる。 However, considering a plurality of transitions as shown in FIG. 15 for each syllable included in all words causes an increase in the amount of processing required for the search process, and there is a possibility that a delay occurs in obtaining the recognition result. . In addition, since the same language model is used even when the input is a normal utterance, the recognition performance is lowered due to the expansion of the search space due to the presence of an unnecessary HMM state sequence for syllable emphasis utterance.

このように、音声認識装置への入力が音節強調発声であった場合、通常発声を対象とした探索処理では誤認識の可能性が高くなる。誤認識による悪影響を防ぐために、入力が音節強調発声であった場合、探索処理を実行せず利用者に通常発声を再度促す、あるいは音節強調発声を対象とした探索処理に切り替えるといった対策が考えられるが、いずれも探索処理以前に入力が音節強調発声であるか否かを判定する必要がある。 Thus, when the input to the speech recognition apparatus is syllable-weighted utterance, the possibility of misrecognition increases in search processing for normal utterance. In order to prevent adverse effects due to misrecognition, when the input is syllable-enhanced utterance, measures such as prompting the user again for normal utterance without switching the search process or switching to search process targeting syllable-enhanced utterance However, it is necessary to determine whether or not the input is syllable emphasis utterance before the search process.

本発明の目的は、上記した従来技術の課題を解決し、探索処理以前に入力が音節強調発声であるか否かを判定することを可能にした音声認識装置およびその発声判定方法、発声判定プログラムならびにその記憶媒体を提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to solve the above-described problems of the prior art and enable a speech recognition apparatus, an utterance determination method thereof, and an utterance determination program capable of determining whether or not an input is syllable emphasized utterance before search processing. And providing a storage medium thereof.

上記した目的を達成するために、本発明の音声認識装置は、以下のような手段を講じた点に特徴がある。
(1)入力音声の音響特徴量を抽出する音響分析手段と、抽出された音響特徴量に基づいて音声認識を行うための統計モデルと、抽出された音響特徴量の周期性に基づいて、入力音声が音節強調発声であるか否かを判定する音節強調発声判定手段と、音響特徴量に前記統計モデルを適用して探索処理を実行する探索処理部とを備え、入力音声が音節強調発声であるか否かの判定結果に応じて固有の音声認識動作を実行することを特徴とする。
(2)入力音声が音節強調発声と判定されたときに、発声者に対して再度の発声を要求する手段を含むことを特徴とする。
(3)通常発声に固有の音響特徴量に対応した第１統計モデルと、音節強調発声に固有の音響特徴量に対応した第２統計モデルとを備え、探索処理部は、入力音声が音節強調発声と判定されたときに第２統計モデルを用いて探索処理を実行することを特徴とする。 In order to achieve the above object, the speech recognition apparatus of the present invention is characterized in that the following means are taken.
(1) Acoustic analysis means for extracting acoustic features of the input speech, a statistical model for performing speech recognition based on the extracted acoustic features, and input based on the periodicity of the extracted acoustic features A syllable-enhanced utterance determination unit that determines whether or not the speech is syllable-enhanced utterance; and a search processing unit that executes the search process by applying the statistical model to the acoustic feature amount, and the input speech is syllable-enhanced utterance A unique voice recognition operation is executed according to the determination result of whether or not there is.
(2) It is characterized by including means for requesting the utterer to speak again when the input speech is determined to be syllable emphasized utterance.
(3) The search processing unit includes a first statistical model corresponding to an acoustic feature amount specific to normal utterance and a second statistical model corresponding to an acoustic feature amount specific to syllable emphasized utterance. The search process is executed using the second statistical model when it is determined to be utterance.

本発明によれば、以下のような効果が達成される。
(1)利用者の発声が通常発声および音節強調発声のいずれであるかを探索処理の開始前に判定できるので、利用者の発声に対応した適性処理へ短時間で移行できるようになる。
(2)利用者の発声が通常発声および音節強調発声のいずれであるかを、音響特徴量の周期性に着目して判定するようにしたので、少ない処理負荷で正確な判定が可能になる。
(3)利用者の発声が音節強調発声と判定されると、利用者に通常発声での再発声を促すようにしたので、通常発声に基づく正確な音声認識が可能になる。
(4)通常発声用の統計モデルと音節強調発声用の統計モデルとを備え、利用者の発声が通常発声および音節強調発声のいずれであるかの応じて統計モデルを使い分けるようにしたので、利用者が通常発声および音節強調発声のいずれで発声しても良好な音声認識が可能になる。 According to the present invention, the following effects are achieved.
(1) Since it is possible to determine whether the user's utterance is a normal utterance or a syllable-emphasized utterance before the search process is started, it is possible to shift to an appropriate process corresponding to the user's utterance in a short time.
(2) Whether the user's utterance is a normal utterance or a syllable-emphasized utterance is determined by paying attention to the periodicity of the acoustic feature amount, so that an accurate determination can be performed with a small processing load.
(3) When the user's utterance is determined to be syllable-weighted utterance, the user is prompted to re-utter with the normal utterance, so that accurate speech recognition based on the normal utterance is possible.
(4) A statistical model for normal utterance and a statistical model for syllable-enhanced utterance are provided, and the statistical model is used properly depending on whether the user's utterance is normal utterance or syllable-enhanced utterance. Good speech recognition is possible regardless of whether the person utters either normal speech or syllable-emphasized speech.

以下、図面を参照して本発明の最良の実施の形態について詳細に説明する。図１は、本発明に係る音声認識装置の第１実施形態の主要部の構成を示したブロック図である。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the best embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the main part of a first embodiment of a speech recognition apparatus according to the present invention.

入力制御部１１は、後に詳述する音節強調発声判定部１３において入力が音節強調発声であると判定されたときに、利用者に対して通常発声での再発声を要求するメッセージとして、例えば「もう一度、通常会話するときのように発声して下さい」を出力する再発生要求部１４を含む。音響分析部１２は、入力音声から音響特徴量を抽出する。 When the input control unit 11 determines that the input is syllable emphasized utterance in the syllable emphasized utterance determination unit 13 described in detail later, for example, “ A re-generation request unit 14 that outputs “Please speak as in normal conversation once more” is included. The acoustic analysis unit 12 extracts an acoustic feature amount from the input voice.

音節強調発声判定部１３は、音響分析部１２で抽出された音響特徴量を用いて、一定の間隔で発声される各音節の出現周期性を検出し、入力音声が音節強調発声であるか否かを判定する。探索処理部１５は、抽出された音響特徴量に基づいて、予め作成された統計的な音響モデル１６および言語モデル１７に従って探索処理を行い、音声認識結果を出力する。 The syllable-enhanced utterance determination unit 13 detects the appearance periodicity of each syllable uttered at regular intervals using the acoustic feature amount extracted by the acoustic analysis unit 12, and determines whether or not the input speech is a syllable-enhanced utterance. Determine whether. The search processing unit 15 performs a search process according to the statistical acoustic model 16 and the language model 17 created in advance based on the extracted acoustic feature quantity, and outputs a speech recognition result.

図２は、本発明に係る音声認識装置の第２実施形態の主要部の構成を示したブロック図であり、前記と同一の符号は同一または同等部分を表している。 FIG. 2 is a block diagram showing the configuration of the main part of the second embodiment of the speech recognition apparatus according to the present invention, where the same reference numerals as those described above represent the same or equivalent parts.

第１音響モデル１６および第１言語モデル１７には、通常発声の音響特徴量に基づいて音声認識を行うための統計モデルが登録されている。第２音響モデル１８および第２言語モデル１９には、音節強調常発声の音響特徴量に基づいて音声認識を行うための統計モデルが登録されている。統計モデル選択部２０は、入力音声が通常発声であれば第１統計モデル１６，１７を選択し、音節強調発声であれば第２統計モデル１８，１９を選択する。前記探索処理部１５は、選択された統計モデルを用いて探索処理を実行する。 In the first acoustic model 16 and the first language model 17, a statistical model for performing speech recognition based on the acoustic feature amount of the normal utterance is registered. In the second acoustic model 18 and the second language model 19, a statistical model for performing speech recognition based on the acoustic feature amount of the syllable-emphasized normal utterance is registered. The statistical model selection unit 20 selects the first statistical models 16 and 17 if the input speech is a normal utterance, and selects the second statistical models 18 and 19 if the input speech is a syllable emphasized utterance. The search processing unit 15 performs a search process using the selected statistical model.

図３は、前記音節強調発声判定部１３の構成を模式的に表現した図であり、前記音響分析部１２で抽出された入力音声のパワー(E)やn次元のMFCC（MFCCの０次項C0を含む）等の音響特徴量に基づいて入力音声の発声区間を検知する発声区間検知部１３１と、検知された発声区間の出現周期性を判定する周期性判定部１３２とを主要な構成とし、入力音声が音節強調発声であるか否かの判定結果を、前記図１の第１実施形態では入力制御部１１へ、前記図２の第２実施形態では統計モデル選択部２０へ、それぞれ出力する。 FIG. 3 is a diagram schematically showing the configuration of the syllable-emphasized utterance determination unit 13. The input speech power (E) extracted by the acoustic analysis unit 12 and the n-dimensional MFCC (MFCC zero-order term C0) A speech section detection unit 131 that detects a speech section of the input speech based on an acoustic feature amount, and a periodicity determination unit 132 that determines the appearance periodicity of the detected speech section. The determination result as to whether or not the input speech is syllable emphasis is output to the input control unit 11 in the first embodiment of FIG. 1 and to the statistical model selection unit 20 in the second embodiment of FIG. .

次いで、前記音節強調発声判定部１３の動作について詳細に説明する。音声認識でよく用いられる音響特徴量のひとつとして、ケプストラム領域の特徴量（MFCC：Mel Frequency Cepstrum Coefficient）およびパワーが挙げられる。MFCCとは、フレーム毎に音声データのFFT分析で得られるパワースペクトルに対してメルスケールのフィルタバンクを施し、周波数軸変換されたパワースペクトルに対して離散コサイン変換(DCT)を実行することにより抽出される、スペクトル包絡を表すパラメータであり、その詳細は「音声認識システム」（野清宏、伊藤克亘、河原達也、武田一哉、山本幹雄編著，オーム社出版局; ISBN4-274-13228-5）などで説明されている。 Next, the operation of the syllable emphasized utterance determination unit 13 will be described in detail. One of the acoustic feature quantities often used in speech recognition is a cepstrum domain feature quantity (MFCC: Mel Frequency Cepstrum Coefficient) and power. MFCC is extracted by applying a Mel-scale filter bank to the power spectrum obtained by FFT analysis of audio data for each frame, and performing discrete cosine transform (DCT) on the frequency spectrum converted power spectrum This is a parameter that represents the spectral envelope, and the details are “speech recognition system” (edited by Kiyoshi Hiroshi, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, edited by Mikio Yamamoto, Ohm Publishing Co., Ltd .; Explained.

音声認識では、入力音声のスペクトル特徴量を離散コサイン変換し、ケプストラム領域において３つの処理（直流成分の除去，リフタリング処理およびケプストラム平均除去) を実行することで得られる１２次元のMFCC(MFCC1，MFCC2，…MFCC12)およびその１次時間微分(ΔMFCC1，ΔMFCC2，…ΔMFCC12)、ならびにパワーEの１次時間微分(ΔE)を併せた２５次元の音響特徴量、さらには各MFCCの２次時間微分(ΔΔMFCC1，ΔΔMFCC2，…ΔΔMFCC12，ΔΔE)までを加えた３８次元のMFCCが音響特徴量として利用されることが多い。 In speech recognition, 12-dimensional MFCC (MFCC1, MFCC2) obtained by performing discrete cosine transform on spectral features of input speech and executing three processes in the cepstrum domain (DC component removal, liftering processing, and cepstrum average removal) ,... MFCC12) and its first time derivative (ΔMFCC1, ΔMFCC2,..., ΔMFCC12), and a first time derivative (ΔE) of power E, and a 25-dimensional acoustic feature, and further, a second time derivative ( A 38-dimensional MFCC including ΔΔMFCC1, ΔΔMFCC2,... ΔΔMFCC12, ΔΔE) is often used as an acoustic feature.

ΔEはパワーの時間的変化を表し、無音区間から発声区間に切り替わる境界では大きくパワーが増加するため、ΔEは大きな正のピークをもつ。一方、発声区間から無音区間への境界においてΔEは絶対値の大きな負のピークをもつ。したがって、ΔEの正負のピーク（または、最大振幅）により発声区間と無音区間とを識別することが可能となる。 ΔE represents a temporal change in power, and ΔE has a large positive peak because the power greatly increases at the boundary where the silent section switches to the utterance section. On the other hand, ΔE has a negative peak with a large absolute value at the boundary from the utterance section to the silent section. Therefore, it is possible to distinguish the utterance section and the silent section from the positive and negative peaks (or the maximum amplitude) of ΔE.

図４，５に、前記図１３，１４に波形を示した同一発声者による同一発声内容「神奈川」の通常発声および音節強調発声のΔE・|ΔMFCC_1|の変化の様子を示す。両発声とも発声区間の開始時には正のピークが、終了時には負のピークが現れていることがわかる。 FIGS. 4 and 5 show changes in ΔE · | ΔMFCC_1 | of the normal utterance and syllable emphasized utterance of the same utterance content “Kanagawa” by the same utterer whose waveforms are shown in FIGS. It can be seen that for both utterances, a positive peak appears at the start of the utterance interval and a negative peak appears at the end.

図４，５を比較すると、図４の通常発声ではそれぞれの音節に対応するピークが連続しているのに対して、図５の音節強調発声では、ほぼ一定の周期毎にピークが現れている。そして、ピークの出現が完全に周期的である場合、即ちピーク間隔が完全に一定である場合、音節強調発声についてΔE・|ΔMFCC_1|の自己相関をとると、遅れ幅τがこの周期の整数倍に一致するときにピークが現れることとなる。 Comparing FIGS. 4 and 5, in the normal utterance in FIG. 4, the peaks corresponding to the syllables are continuous, whereas in the syllable emphasized utterance in FIG. 5, peaks appear at almost constant intervals. . When the peak appearance is completely periodic, that is, when the peak interval is completely constant, the autocorrelation of ΔE · | ΔMFCC_1 | A peak appears when it matches.

しかしながら、実際にはピーク間隔には揺らぎが存在し、かつΔE・|ΔMFCC_1|のピークは非常にシャープであるため、自己相関にピークがはっきりと現れない可能性が高い。このような場合には、図６に示したように、一定フレーム毎に、その一定区間内での最大振幅で当該区間を代表させることでΔE・|ΔMFCC_1|をピークピッキング（矩形波化）し、等ピーク間隔の揺らぎを吸収する手法を用いて、その自己相関結果にピークを出現させることができる。前記一定区間は、前記図１０に関して説明した音声認識のフレーム処理のように、その一部分が重複していても良いし、重複せずに連続していても良い。 However, in practice, there is a fluctuation in the peak interval, and the peak of ΔE · | ΔMFCC_1 | is very sharp, so there is a high possibility that the peak does not appear clearly in the autocorrelation. In such a case, as shown in FIG. 6, for every certain frame, ΔE · | ΔMFCC_1 | is peak picked (rectangular wave) by representing the section with the maximum amplitude within the certain section. A peak can be made to appear in the autocorrelation result by using a technique for absorbing fluctuations in the equal peak interval. As in the frame processing for speech recognition described with reference to FIG. 10, a part of the certain section may be overlapped or may be continued without overlapping.

図７は、揺らぎを吸収した場合のΔE・|ΔMFCC_1|の自己相関結果を示した図であり、大きなピークが現れる遅れ幅τp1，τp2，τp3がほぼ一定の間隔であることがわかる。ΔE・|ΔMFCC_1|のピーク間隔が一定に近ければ近いほど、自己相関におけるそれぞれのピークの値も大きくなる。したがって、例えば、遅れ幅τp1の第１のピーク、すなわち一次の自己相関が予め定めた閾値を超えた場合に入力音声が音節強調発声であるとするような判定基準を設けることができる。 FIG. 7 is a diagram showing the autocorrelation result of ΔE · | ΔMFCC — 1 | when the fluctuation is absorbed, and it can be seen that the delay widths τp1, τp2, and τp3 in which large peaks appear are substantially constant intervals. The closer the peak interval of ΔE · | ΔMFCC_1 | is to a certain value, the greater the value of each peak in the autocorrelation. Therefore, for example, it is possible to provide a criterion for determining that the input speech is syllable emphasized utterance when the first peak of the delay width τp1, that is, the primary autocorrelation exceeds a predetermined threshold.

そこで、本実施形態では前記発声区間検知部１３１が、前記図６に関して説明した時系列情報に基づいて発声区間を検知し、前記周期性判定部１３２が、この時系列情報の一次の自己相関が予め定めた閾値を超える場合に、入力音声が音節強調発声であると判定するようにしている。 Therefore, in the present embodiment, the utterance interval detection unit 131 detects an utterance interval based on the time series information described with reference to FIG. 6, and the periodicity determination unit 132 determines the primary autocorrelation of the time series information. When a predetermined threshold value is exceeded, it is determined that the input voice is a syllable emphasized utterance.

また、前記パワーEの代わりにゼロ次項のMFCC、すなわち各フレームにおけるスペクトルの直流成分に対応するC0を用いて、同一発声のΔC0・|ΔMFCC_1|の自己相関をとった場合も、図８に示すように、スケールは異なるものの図７のΔE・|ΔMFCC_1|の自己相関と同様の変化を示す。したがって、ΔEに代えてΔC0を用い、同様に入力が音節強調発声であるか否かを判定するようにしても良い。 Further, FIG. 8 shows the case where autocorrelation of ΔC0 · | ΔMFCC_1 | of the same utterance is obtained using MFCC of zero-order term instead of the power E, that is, C0 corresponding to the DC component of the spectrum in each frame. Thus, although the scale is different, the same change as the autocorrelation of ΔE · | ΔMFCC_1 | of FIG. 7 is shown. Therefore, ΔC0 may be used instead of ΔE, and it may be similarly determined whether or not the input is a syllable-emphasized utterance.

さらに、上記した実施形態では、入力音声のパワーEの時間変化率(ΔE)または入力音声のMFCCの０次項(C0)の時間変化率(ΔC0)に、入力音声のMFCCのn各次元分の時間変化率の絶対値同士の積(ΔMFCC_1)を乗じ、これを所定の一定区間ごとに当該区間の最大振幅で代表して得られる時系列情報に基づいて発声区間を検知するものとして説明したが、本発明はこれのみに限定されるものではなく、以下のような変形が可能である。 Furthermore, in the above-described embodiment, the time change rate (ΔE) of the power E of the input sound or the time change rate (ΔC0) of the 0th-order term (C0) of the MFCC of the input sound is equal to each n dimension of the MFCC of the input sound. The product of the absolute values of the time change rate (ΔMFCC_1) is multiplied, and this is explained as detecting the utterance interval based on time series information obtained by representing the maximum amplitude of the interval for each predetermined interval. The present invention is not limited to this, and the following modifications are possible.

第１の変形例として、発声区間を入力音声のパワー(E)の時間変化率(ΔE)のみに基づいて検知するようにしても良い。 As a first modification, the utterance interval may be detected based only on the time change rate (ΔE) of the power (E) of the input speech.

第２の変形例として、発声区間を入力音声のMFCCの０次項(C0)の時間変化率(ΔC0)のみに基づいて検知するようにしても良い。 As a second modification, the utterance interval may be detected based only on the time change rate (ΔC0) of the 0th-order term (C0) of the MFCC of the input speech.

第３の変形例として、発声区間を入力音声のパワーEの時間変化率(ΔE)または入力音声のMFCCの０次項(C0)の時間変化率(ΔC0)に、入力音声のMFCCのn次元分の時間変化率の絶対値同士の積(ΔMFCC_1)を乗じて得られる時系列情報に基づいて検知するようにしても良い。 As a third modification, the utterance period is set to the time change rate (ΔE) of the power E of the input speech or the time change rate (ΔC0) of the 0th-order term (C0) of the MFCC of the input speech. Alternatively, detection may be performed based on time-series information obtained by multiplying the products (ΔMFCC_1) of absolute values of the time change rates.

第４の変形例として、発声区間を入力音声のパワーEの時間変化率(ΔE)または入力音声のMFCCの０次項(C0)の時間変化率(ΔC0)に、入力音声のMFCCのn次元分の時間変化率の絶対値同士の積(ΔMFCC_1)を乗じ、これを平滑化して得られる時系列情報に基づいて検知するようにしても良い。 As a fourth modification, the utterance period is set to the time change rate (ΔE) of the power E of the input speech or the time change rate (ΔC0) of the 0th-order term (C0) of the MFCC of the input speech to the n-dimensional component of the MFCC of the input speech. The product may be detected based on time-series information obtained by multiplying the product (ΔMFCC_1) of absolute values of the time change rate and smoothing the product.

本発明に係る音声認識装置の第１実施形態のブロック図である。1 is a block diagram of a first embodiment of a speech recognition apparatus according to the present invention. 本発明に係る音声認識装置の第２実施形態のブロック図である。It is a block diagram of 2nd Embodiment of the speech recognition apparatus which concerns on this invention. 音節強調発声判定部の構成を模式的に表現した図である。It is the figure which expressed typically the structure of the syllable emphasis utterance determination part. 通常発声「かながわ」のΔE・|ΔMFCC_1|の変化を示した波形図である。FIG. 6 is a waveform diagram showing a change in ΔE · | ΔMFCC_1 | of a normal utterance “Kanagawa”. 音節強調発声「か・な・が・わ」のΔE・|ΔMFCC_1|の変化を示した波形図である。FIG. 6 is a waveform diagram showing changes in ΔE · | ΔMFCC_1 | of a syllable-emphasized utterance “Kananagawa”. 図５の波形に揺らぎを吸収する手法を適用して得られた波形図である。FIG. 6 is a waveform diagram obtained by applying a technique for absorbing fluctuations to the waveform of FIG. 5. 揺らぎが吸収されたΔE・|ΔMFCC_1|の自己相関結果を示した図である。FIG. 6 is a diagram showing an autocorrelation result of ΔE · | ΔMFCC_1 | in which fluctuation is absorbed. 揺らぎが吸収されたΔC0・|ΔMFCC_1|の自己相関結果を示した図である。FIG. 6 is a diagram illustrating an autocorrelation result of ΔC0 · | ΔMFCC_1 | in which fluctuation is absorbed. 従来の音声認識装置の主要部の構成を示した図である。It is the figure which showed the structure of the principal part of the conventional speech recognition apparatus. 音響分析部での音響特徴量の抽出方法を説明するための図である。It is a figure for demonstrating the extraction method of the acoustic feature-value in an acoustic analysis part. 固定文法モデルの構造を模式的に表現した図である。It is the figure which expressed typically the structure of the fixed grammar model. 単語「今井」のHMM状態系列を示した図である。It is the figure which showed the HMM state series of the word "Imai". 通常発声「かながわ」の波形図である。It is a waveform diagram of normal utterance "Kanagawa". 音節強調発声「か・な・が・わ」の波形図である。It is a wave form diagram of syllable emphasis utterance "Kana naga wawa". 「sil」への遷移を付加したHMM状態系列を示した図である。It is the figure which showed the HMM state series which added the transition to "sil".

Explanation of symbols

１１…入力制御部，１２…音響分析部，１３…音節強調発声判定部，１４…再発生要求部，１５…探索処理部，１６…音響モデル，１７…言語モデル，１８…第２音響モデル，１９…第２言語モデル，２０…統計モデル選択部
DESCRIPTION OF SYMBOLS 11 ... Input control part, 12 ... Acoustic analysis part, 13 ... Syllable emphasis utterance determination part, 14 ... Regeneration request part, 15 ... Search processing part, 16 ... Acoustic model, 17 ... Language model, 18 ... 2nd acoustic model, 19 ... Second language model, 20 ... Statistical model selection unit

Claims

Acoustic analysis means for extracting acoustic features of the input speech;
A statistical model for performing speech recognition based on the extracted acoustic features;
Syllable emphasized utterance determination means for determining whether or not the input speech is syllable emphasized utterance based on the periodicity of the extracted acoustic feature amount;
A search processing unit that executes a search process by applying the statistical model to the acoustic feature amount,
Depending on the determination result of whether or not the input speech is syllable emphasis utterance, perform a unique speech recognition operation,
The syllable-emphasized utterance determination means includes
Utterance section detection means for detecting the utterance section of each syllable based on a boundary that switches from the silent section to the utterance section and a boundary that switches from the utterance section to the silent section;
Periodicity determining means for determining the periodicity in which the utterance section of each syllable appears,
A speech recognition apparatus, characterized in that the input speech having periodicity is determined as syllable emphasized utterance .

The syllable-emphasized utterance determination means is
The speech recognition apparatus according to claim 1, wherein when the appearance periodicity is higher than a predetermined reference value, the input speech is determined as a syllable emphasized utterance.

The speech recognition apparatus according to claim 1, wherein the periodicity determination unit determines appearance periodicity based on an autocorrelation of an appearance period of the utterance section of each syllable .

The speech recognition apparatus according to claim 1, wherein the periodicity determination unit determines appearance periodicity based on a first-order autocorrelation of an appearance period of the utterance section of each syllable .

5. The speech recognition apparatus according to claim 1, further comprising means for requesting the utterer to utter again when the input speech is determined to be syllable emphasized utterance.

A first statistical model corresponding to acoustic features specific to normal speech;
A second statistical model corresponding to acoustic features specific to syllable-emphasized utterances,
5. The speech recognition according to claim 1, wherein the search processing unit performs a search process using the second statistical model when it is determined that the input speech is a syllable emphasized utterance. 6. apparatus.

The acoustic feature amount includes power (E) of input voice,
7. The speech recognition apparatus according to claim 2, wherein the utterance section detecting unit detects a utterance section of each syllable based on a time change rate (ΔE) of power of input speech.

The acoustic feature amount includes MFCC of input speech,
The speech recognition according to any one of claims 2 to 6, wherein the speech section detecting means detects a speech section of each syllable based on a time change rate (ΔC0) of a zero-order term of MFCC of input speech. apparatus.

The acoustic feature amount includes power (E) of input speech and n-dimensional MFCC,
The utterance section detecting means is configured to take either the time change rate (ΔE) of the power of the input voice or the time change rate (ΔC0) of the MFCC of the input voice for the nth dimension of the MFCC of the input voice. The speech recognition apparatus according to any one of claims 2 to 6, wherein an utterance section of each syllable is detected based on time-series information obtained by multiplying products of absolute values of ΔMFCC.

The acoustic feature amount includes power (E) of input speech and n-dimensional MFCC,
The utterance section detecting means is configured to take either the time change rate (ΔE) of the power of the input voice or the time change rate (ΔC0) of the MFCC of the input voice for the nth dimension of the MFCC of the input voice. 7. The speech recognition apparatus according to claim 2, wherein a speech section of each syllable is detected based on time series information obtained by multiplying the products of absolute values of ΔMFCC and smoothing the product. .

The acoustic feature amount includes power (E) of input speech and n-dimensional MFCC,
The utterance section detecting means is configured to take either the time change rate (ΔE) of the power of the input voice or the time change rate (ΔC0) of the MFCC of the input voice for the nth dimension of the MFCC of the input voice. A product of absolute values of ΔMFCC is multiplied, and the utterance section of each syllable is detected based on time-series information obtained by representing the maximum amplitude of the section for each predetermined constant section. The speech recognition apparatus according to any one of 2 to 6.

In the utterance determination method for determining whether or not the input speech is syllable emphasis utterance,
A procedure for extracting acoustic features of the input speech;
On the basis of the extracted acoustic features with a vocal section of each syllable of the input speech, and procedures for detecting the boundary and vocal section switched to speech section from a silent section based on the boundaries switched to silent section,
A procedure for determining the appearance periodicity of the utterance section of each syllable ;
And a procedure for determining whether or not the input speech is a syllable emphasized utterance based on the appearance periodicity of the utterance section of each syllable .

A procedure for determining whether or not the input speech is a syllable emphasis utterance,
Wherein when occurrence periodicity is higher than the predetermined reference value, the utterance determination method according to claim 12, wherein the input speech is a procedure for determining the syllable emphasis utterance.

The utterance determination method according to claim 12 or 13, wherein the procedure for determining the appearance periodicity determines the appearance periodicity based on an autocorrelation of an appearance period of the utterance section of each syllable .

The utterance determination method according to claim 12 or 13, wherein the step of determining the appearance periodicity determines the appearance periodicity based on a primary autocorrelation of the appearance period of the utterance section of each syllable. .

The acoustic feature amount includes power (E) of input voice,
The utterance determination method according to any one of claims 12 to 15, wherein the procedure of detecting the utterance section of each syllable detects the utterance section based on a time change rate (ΔE) of power of the input speech. .

The acoustic feature amount includes MFCC of input speech,
16. The procedure for detecting the utterance interval of each syllable detects the utterance interval based on a time change rate (ΔC0) of the 0th-order term of the MFCC of the input speech. Utterance determination method.

The acoustic feature amount includes power (E) of input speech and n-dimensional MFCC,
The procedure for detecting the utterance interval of each syllable is determined by either the time change rate (ΔE) of the power of the input speech or the time change rate (ΔC0) of the 0th-order term of the MFCC of the input speech. The utterance determination method according to any one of claims 12 to 15, wherein the utterance section is detected based on time-series information obtained by multiplying products of absolute values of minute time change rates ΔMFCC.

The acoustic feature amount includes power (E) of input speech and n-dimensional MFCC,
The procedure for detecting the utterance interval of each syllable is determined by either the time change rate (ΔE) of the power of the input speech or the time change rate (ΔC0) of the 0th-order term of the MFCC of the input speech. The utterance section according to claim 12, wherein the utterance section is detected based on time-series information obtained by multiplying products of absolute values of the time change rate ΔMFCC of minutes and smoothing the product. Judgment method.

The acoustic feature amount includes power (E) of input speech and n-dimensional MFCC,
The procedure for detecting the utterance interval of each syllable is determined by either the time change rate (ΔE) of the power of the input speech or the time change rate (ΔC0) of the 0th-order term of the MFCC of the input speech. Multiplying the products of the absolute values of the time change rate ΔMFCC of minutes and detecting the utterance interval based on time series information obtained by representing the maximum amplitude of the interval for each predetermined interval The utterance determination method according to claim 12.

An utterance determination program for causing a computer to execute the utterance determination method according to any one of claims 12 to 20.

A storage medium storing the utterance determination program according to claim 21 so as to be readable by a computer.