JP4986028B2 - Speech recognition apparatus, utterance determination method thereof, utterance determination program, and storage medium thereof - Google Patents

Speech recognition apparatus, utterance determination method thereof, utterance determination program, and storage medium thereof Download PDF

Info

Publication number
JP4986028B2
JP4986028B2 JP2007010853A JP2007010853A JP4986028B2 JP 4986028 B2 JP4986028 B2 JP 4986028B2 JP 2007010853 A JP2007010853 A JP 2007010853A JP 2007010853 A JP2007010853 A JP 2007010853A JP 4986028 B2 JP4986028 B2 JP 4986028B2
Authority
JP
Japan
Prior art keywords
utterance
syllable
section
speech
input speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP2007010853A
Other languages
Japanese (ja)
Other versions
JP2008176155A (en
Inventor
顕吾 藤田
恒夫 加藤
恒 河井
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
KDDI Corp
Original Assignee
KDDI Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by KDDI Corp filed Critical KDDI Corp
Priority to JP2007010853A priority Critical patent/JP4986028B2/en
Publication of JP2008176155A publication Critical patent/JP2008176155A/en
Application granted granted Critical
Publication of JP4986028B2 publication Critical patent/JP4986028B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Description

本発明は、音声認識装置およびその発声判定方法、発声判定プログラムならびにその記憶媒体に係り、特に、音節強調発声の認識に好適な音声認識装置およびその発声判定方法、発声判定プログラムならびにその記憶媒体に関する。   The present invention relates to a speech recognition apparatus, a speech determination method thereof, a speech determination program, and a storage medium thereof, and more particularly to a speech recognition apparatus suitable for recognition of syllable emphasized speech, a speech determination method thereof, a speech determination program, and a storage medium thereof. .

図9は、従来の音声認識装置の主要部の構成を示した図であり、入力音声から音響特徴量を抽出する音響分析部51と、抽出された音響特徴量に基づいて、予め作成された統計的な音響モデル53や言語モデル54に従って探索処理を行い、音声認識結果を出力する探索処理部52とを含む。   FIG. 9 is a diagram illustrating a configuration of a main part of a conventional speech recognition apparatus, which is created in advance based on an acoustic analysis unit 51 that extracts an acoustic feature amount from input speech and the extracted acoustic feature amount. And a search processing unit 52 that performs a search process according to a statistical acoustic model 53 and a language model 54 and outputs a speech recognition result.

前記音響分析部51では、入力音声から長さTのフレームを切り出し、その特徴を表すn次元の音響特徴量を抽出する。この処理は、図10に示したように、ΔTずつフレーム位置をシフトしながら進め、音声の終端まで実行する。   The acoustic analysis unit 51 cuts out a frame having a length T from the input speech and extracts an n-dimensional acoustic feature amount representing the feature. As shown in FIG. 10, this process proceeds while shifting the frame position by ΔT and is executed until the end of the voice.

探索処理部52では、言語モデルで定義される遷移可能な単語列のうち、入力音声に対していずれが最も確からしいかを探索する。言語モデルとしては、単語の遷移パターンを予め定義しておく固定文法モデル、あるいはある時刻までに確定した単語列に従い次に遷移可能な単語が確率的に定まる確率文法モデルのいずれかが用いられる。   The search processing unit 52 searches for the most probable input speech among the translatable word strings defined by the language model. As the language model, either a fixed grammar model in which a word transition pattern is defined in advance or a probabilistic grammar model in which a next transitionable word is stochastically determined according to a word string determined by a certain time is used.

例えば、図11に一例を示した固定文法モデルでは、初めの無音状態「sil」から遷移可能な単語は「伊藤」、「糸井」、「今井」、「土井」の4通りであり、その次に唯一遷移可能な単語「です」を経由して、最終的に再び無音状態「sil」へ遷移するような単語列が定義されている。即ち、「[sil]{伊藤/糸井/今井/土井}です[sil]」のうち、最尤単語列がいずれかを探索することになる。   For example, in the fixed grammar model shown as an example in FIG. 11, there are four words “Ito”, “Itoi”, “Imai”, and “Doi” that can be transitioned from the initial silent state “sil”. A word string is defined so as to finally transition to the silent state “sil” again via the word “is” that can only be shifted to. That is, the most likely word string is searched for “[sil] {Ito / Itoi / Imai / Doi} [sil]”.

固定文法モデルおよび確率文法モデルのいずれを用いる場合でも、フレーム毎の音響特徴量を用いた探索処理は、単語を更に細分化した音素単位で進められる。各々の単語は、音素毎のHMM状態系列を連結した形で表される。図12に、単語「今井」のHMM状態系列を示す。   Regardless of whether the fixed grammar model or the probabilistic grammar model is used, the search process using the acoustic feature amount for each frame is advanced in units of phonemes obtained by further subdividing words. Each word is represented by concatenating HMM state sequences for each phoneme. FIG. 12 shows an HMM state sequence of the word “Imai”.

「今井」の音素表現は「i/m/a/i」であるが、一般に探索処理性能向上のため、図12のような前後の音素に依存したHMM状態系列が用いられる。ここで、「sil-i+m」は音素「i」の先行音素が「sil」、後続音素が「m」である場合のHMM状態系列を表す。各々のHMM状態には自身への遷移(自己遷移)と右隣のHMM状態への遷移(LR遷移)が許されており、自己遷移確率およびLR遷移確率が音響モデルに記述されている。また、音響モデルには、フレーム毎に得られる音響特徴量の各HMM状態に対する尤もらしさ(音響尤度)を算出するための確率分布が記述されている。   The phoneme representation of “Imai” is “i / m / a / i”, but in general, an HMM state sequence depending on the preceding and following phonemes as shown in FIG. 12 is used to improve search processing performance. Here, “sil-i + m” represents an HMM state sequence when the preceding phoneme of the phoneme “i” is “sil” and the subsequent phoneme is “m”. Each HMM state is allowed to transition to itself (self-transition) and to the right-hand HMM state (LR transition), and the self-transition probability and LR transition probability are described in the acoustic model. The acoustic model describes a probability distribution for calculating the likelihood (acoustic likelihood) of each acoustic feature quantity obtained for each frame with respect to each HMM state.

探索処理は、フレーム毎にそのフレームで考慮すべき全てのHMM状態について、自己遷移、LR遷移それぞれの場合の遷移確率と音響尤度の和(累積尤度)を算出し、HMM状態遷移として尤もらしい(累積尤度の高い)遷移を選ぶことを繰り返し、最終的に最も累積尤度の高いHMM状態系列を決定することに相当する。このように最尤のパスを探索するアルゴリズムはViterbiアルゴリズムと呼ばれる。   The search process calculates the sum of the transition probability and acoustic likelihood (cumulative likelihood) for each of the self-transition and LR transition for every HMM state that should be considered in that frame for each frame. This is equivalent to repeatedly selecting an apparent transition (highest cumulative likelihood) and finally determining the HMM state sequence having the highest cumulative likelihood. The algorithm for searching for the maximum likelihood path in this way is called the Viterbi algorithm.

認識結果が誤りであったために利用者の再発声が必要となる場合、利用者による最初の発声が、生活の中で人間を相手に発するような通常発声であったとしても、人間を相手に聞き取り易く発声するのと同様の意図で、再発声が各音節を区切って強調する音節強調発声となる現象がしばしば見られる。   If the user's re-utterance is required because the recognition result is incorrect, even if the initial utterance by the user is a normal utterance that is uttered against a human being in their lives, There is often a phenomenon in which a recurrent voice becomes a syllable emphasized utterance in which each syllable is emphasized with the same intention as utterance that is easy to hear.

図13,14は、同一発声者による同一発声内容「神奈川」の通常発声「かながわ」の波形、および音節強調発声「か・な・が・わ」の波形をそれぞれ示している。音節強調発声では、通常発声には見られない発声途中の音節間の無音区間が存在し、音節を個別に発声したような波形となっていることがわかる。   FIGS. 13 and 14 respectively show the waveform of the normal utterance “Kanagawa” of the same utterance content “Kanagawa” by the same utterer and the waveform of the syllable-emphasized utterance “Kana-ga-wa”. In the syllable emphasis utterance, it can be seen that there is a silent section between syllables in the middle of utterance that cannot be seen in normal utterance, and the waveform is as if the syllable was uttered individually.

音節強調発声は、通常発声のように発声区間が連続しておらず、図14に示したように各音節間に無音区間が存在する。しかしながら、通常の音声認識装置では、図12に示したように言語モデルに記述された各単語のHMM状態系列が音節間に「sil」への遷移を許していないため、音節強調発声に対しては、音節間の無音区間では無理に何らかの音素が存在するものとして探索処理を進めなければならない。その結果、音節間の無音区間の音響尤度の低下により、発声内容に対応するHMM状態系列の累積尤度が低下し、誤認識の原因となり得る。   In the syllable emphasis utterance, the utterance sections are not continuous like the normal utterance, and there are silent sections between the syllables as shown in FIG. However, in an ordinary speech recognition apparatus, as shown in FIG. 12, the HMM state sequence of each word described in the language model does not allow a transition to “sil” between syllables. The search process must proceed as if some phoneme existed in the silent section between syllables. As a result, the cumulative likelihood of the HMM state sequence corresponding to the utterance content decreases due to a decrease in the acoustic likelihood of a silent section between syllables, which may cause erroneous recognition.

このような技術課題に対して、従来は音節強調発声の音節間の無音区間に対して、単語のHMM状態系列の各音節の後に「sil」への遷移を許すような記述を言語モデルまたは音響モデルに加えることで対応している。   In order to deal with such technical problems, a description that allows a transition to “sil” after each syllable of the HMM state sequence of a word for a silent section between syllables of syllable-emphasized utterances is conventionally used. It is supported by adding to the model.

特許文献1には、HMM状態系列に後続音素環境としてスキップ可能な無音への遷移を追加する等、通常発声を対象としたHMM状態系列に複数の音節強調発声を対象としたHMM状態系列を追加すること(マルチパス化)により、無音の挿入や、通常発声からの音節間の音響的特徴の変形が起こる音節強調発声に対しても認識性能を維持しようとする技術が開示されている。   Patent Document 1 adds an HMM state sequence that targets multiple syllable-enhanced utterances to an HMM state sequence that targets normal speech, such as adding a transition to silence that can be skipped as a subsequent phoneme environment to the HMM state sequence By doing so (multipathing), there is disclosed a technique for maintaining recognition performance even for syllable emphasized utterance in which silence is inserted or acoustic characteristics between syllables are changed from normal utterance.

特許文献2には、上記特許文献1が対象とする日本語のように、音節が必ず母音の後で区切られる言語のみならず、英語を始めとしたいずれの音素の後でも音節を区切ることができる言語に対しても、各音節後に無音の挿入を許すモデルを利用することにより、音節強調発声に対して認識性能を維持しようとする技術が開示されている。
特開2002−189494号公報 特開2006−243123号公報
In Japanese Patent Application Laid-Open No. 2004-228688, not only a language in which a syllable is necessarily separated after a vowel, but a syllable can be divided after any phoneme including English, as in Japanese which is the subject of the above-mentioned Patent Document 1. A technique for maintaining recognition performance for syllable-emphasized utterances by using a model that allows insertion of silence after each syllable is also disclosed for possible languages.
JP 2002-189494 A JP 2006-243123 A

図15は、前記図12の「sil-i+m」, 「i-m+a」と連続するHMM状態系列に対して、音節「i」の後に「sil」への遷移を付加したものである。連続する2音節「i・ma」の前半音節「i」終了時には「sil-i+m」に加えて、「sil-i+sil」のように後の「sil」環境依存であるHMM状態系列、または「sil-i+*」のように後環境依存なしのHMM状態系列を経由して1状態の「sil」へと遷移することが許されている。それに伴い、後半音節「ma」開始時には「i-m+a」に加えて前の「sil」環境依存である「sil-m+a」への遷移が付加されている。   FIG. 15 is a diagram in which a transition to “sil” is added after the syllable “i” to the HMM state sequence continuous with “sil-i + m” and “i-m + a” in FIG. is there. At the end of the first half syllable “i” of two consecutive syllables “i ・ ma”, in addition to “sil-i + m”, the “sil” environment-dependent HMM state sequence such as “sil-i + sil” Or, it is allowed to transition to “sil” of one state via an HMM state sequence without post-environment dependency like “sil-i + *”. Accordingly, at the beginning of the second syllable “ma”, in addition to “i-m + a”, a transition to “sil-m + a”, which depends on the previous “sil” environment, is added.

また、前半音節終了時の1状態の「sil」への遷移はスキップすることも可能である。音節強調発声が通常発声と最も異なる点は音節間の無音区間の存在であるが、この無音区間の影響により個別に音節を発声する場合と通常発声の中間のような音響的特徴が現れる場合もある。図15のように幾通りもの遷移を可能としているのは、音節強調発声のこういった通常発声との相違への対応策である。   It is also possible to skip the transition to one state “sil” at the end of the first half syllable. The most different point between syllable-enhanced utterances and normal utterances is the existence of silent intervals between syllables. is there. As shown in FIG. 15, a number of transitions are possible in response to the difference between the syllable-emphasized utterance and the normal utterance.

しかしながら、全ての単語に含まれる各音節について図15のような複数の遷移を考慮することは探索処理に要する処理量の増大を招き、認識結果を得るまでの時間に遅延が生ずる可能性がある。また、入力が通常発声の場合でも同様の言語モデルを用いるため、音節強調発声用の不要なHMM状態系列の存在による探索空間の広がりが原因で認識性能を低下させることにも繋がる。   However, considering a plurality of transitions as shown in FIG. 15 for each syllable included in all words causes an increase in the amount of processing required for the search process, and there is a possibility that a delay occurs in obtaining the recognition result. . In addition, since the same language model is used even when the input is a normal utterance, the recognition performance is lowered due to the expansion of the search space due to the presence of an unnecessary HMM state sequence for syllable emphasis utterance.

このように、音声認識装置への入力が音節強調発声であった場合、通常発声を対象とした探索処理では誤認識の可能性が高くなる。誤認識による悪影響を防ぐために、入力が音節強調発声であった場合、探索処理を実行せず利用者に通常発声を再度促す、あるいは音節強調発声を対象とした探索処理に切り替えるといった対策が考えられるが、いずれも探索処理以前に入力が音節強調発声であるか否かを判定する必要がある。   Thus, when the input to the speech recognition apparatus is syllable-weighted utterance, the possibility of misrecognition increases in search processing for normal utterance. In order to prevent adverse effects due to misrecognition, when the input is syllable-enhanced utterance, measures such as prompting the user again for normal utterance without switching the search process or switching to search process targeting syllable-enhanced utterance However, it is necessary to determine whether or not the input is syllable emphasis utterance before the search process.

本発明の目的は、上記した従来技術の課題を解決し、探索処理以前に入力が音節強調発声であるか否かを判定することを可能にした音声認識装置およびその発声判定方法、発声判定プログラムならびにその記憶媒体を提供することにある。   SUMMARY OF THE INVENTION An object of the present invention is to solve the above-described problems of the prior art and enable a speech recognition apparatus, an utterance determination method thereof, and an utterance determination program capable of determining whether or not an input is syllable emphasized utterance before search processing. And providing a storage medium thereof.

上記した目的を達成するために、本発明の音声認識装置は、以下のような手段を講じた点に特徴がある。
(1)入力音声の音響特徴量を抽出する音響分析手段と、抽出された音響特徴量に基づいて音声認識を行うための統計モデルと、抽出された音響特徴量の周期性に基づいて、入力音声が音節強調発声であるか否かを判定する音節強調発声判定手段と、音響特徴量に前記統計モデルを適用して探索処理を実行する探索処理部とを備え、入力音声が音節強調発声であるか否かの判定結果に応じて固有の音声認識動作を実行することを特徴とする。
(2)入力音声が音節強調発声と判定されたときに、発声者に対して再度の発声を要求する手段を含むことを特徴とする。
(3)通常発声に固有の音響特徴量に対応した第1統計モデルと、音節強調発声に固有の音響特徴量に対応した第2統計モデルとを備え、探索処理部は、入力音声が音節強調発声と判定されたときに第2統計モデルを用いて探索処理を実行することを特徴とする。
In order to achieve the above object, the speech recognition apparatus of the present invention is characterized in that the following means are taken.
(1) Acoustic analysis means for extracting acoustic features of the input speech, a statistical model for performing speech recognition based on the extracted acoustic features, and input based on the periodicity of the extracted acoustic features A syllable-enhanced utterance determination unit that determines whether or not the speech is syllable-enhanced utterance; and a search processing unit that executes the search process by applying the statistical model to the acoustic feature amount, and the input speech is syllable-enhanced utterance A unique voice recognition operation is executed according to the determination result of whether or not there is.
(2) It is characterized by including means for requesting the utterer to speak again when the input speech is determined to be syllable emphasized utterance.
(3) The search processing unit includes a first statistical model corresponding to an acoustic feature amount specific to normal utterance and a second statistical model corresponding to an acoustic feature amount specific to syllable emphasized utterance. The search process is executed using the second statistical model when it is determined to be utterance.

本発明によれば、以下のような効果が達成される。
(1)利用者の発声が通常発声および音節強調発声のいずれであるかを探索処理の開始前に判定できるので、利用者の発声に対応した適性処理へ短時間で移行できるようになる。
(2)利用者の発声が通常発声および音節強調発声のいずれであるかを、音響特徴量の周期性に着目して判定するようにしたので、少ない処理負荷で正確な判定が可能になる。
(3)利用者の発声が音節強調発声と判定されると、利用者に通常発声での再発声を促すようにしたので、通常発声に基づく正確な音声認識が可能になる。
(4)通常発声用の統計モデルと音節強調発声用の統計モデルとを備え、利用者の発声が通常発声および音節強調発声のいずれであるかの応じて統計モデルを使い分けるようにしたので、利用者が通常発声および音節強調発声のいずれで発声しても良好な音声認識が可能になる。
According to the present invention, the following effects are achieved.
(1) Since it is possible to determine whether the user's utterance is a normal utterance or a syllable-emphasized utterance before the search process is started, it is possible to shift to an appropriate process corresponding to the user's utterance in a short time.
(2) Whether the user's utterance is a normal utterance or a syllable-emphasized utterance is determined by paying attention to the periodicity of the acoustic feature amount, so that an accurate determination can be performed with a small processing load.
(3) When the user's utterance is determined to be syllable-weighted utterance, the user is prompted to re-utter with the normal utterance, so that accurate speech recognition based on the normal utterance is possible.
(4) A statistical model for normal utterance and a statistical model for syllable-enhanced utterance are provided, and the statistical model is used properly depending on whether the user's utterance is normal utterance or syllable-enhanced utterance. Good speech recognition is possible regardless of whether the person utters either normal speech or syllable-emphasized speech.

以下、図面を参照して本発明の最良の実施の形態について詳細に説明する。図1は、本発明に係る音声認識装置の第1実施形態の主要部の構成を示したブロック図である。   DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the best embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the main part of a first embodiment of a speech recognition apparatus according to the present invention.

入力制御部11は、後に詳述する音節強調発声判定部13において入力が音節強調発声であると判定されたときに、利用者に対して通常発声での再発声を要求するメッセージとして、例えば「もう一度、通常会話するときのように発声して下さい」を出力する再発生要求部14を含む。音響分析部12は、入力音声から音響特徴量を抽出する。   When the input control unit 11 determines that the input is syllable emphasized utterance in the syllable emphasized utterance determination unit 13 described in detail later, for example, “ A re-generation request unit 14 that outputs “Please speak as in normal conversation once more” is included. The acoustic analysis unit 12 extracts an acoustic feature amount from the input voice.

音節強調発声判定部13は、音響分析部12で抽出された音響特徴量を用いて、一定の間隔で発声される各音節の出現周期性を検出し、入力音声が音節強調発声であるか否かを判定する。探索処理部15は、抽出された音響特徴量に基づいて、予め作成された統計的な音響モデル16および言語モデル17に従って探索処理を行い、音声認識結果を出力する。   The syllable-enhanced utterance determination unit 13 detects the appearance periodicity of each syllable uttered at regular intervals using the acoustic feature amount extracted by the acoustic analysis unit 12, and determines whether or not the input speech is a syllable-enhanced utterance. Determine whether. The search processing unit 15 performs a search process according to the statistical acoustic model 16 and the language model 17 created in advance based on the extracted acoustic feature quantity, and outputs a speech recognition result.

図2は、本発明に係る音声認識装置の第2実施形態の主要部の構成を示したブロック図であり、前記と同一の符号は同一または同等部分を表している。   FIG. 2 is a block diagram showing the configuration of the main part of the second embodiment of the speech recognition apparatus according to the present invention, where the same reference numerals as those described above represent the same or equivalent parts.

第1音響モデル16および第1言語モデル17には、通常発声の音響特徴量に基づいて音声認識を行うための統計モデルが登録されている。第2音響モデル18および第2言語モデル19には、音節強調常発声の音響特徴量に基づいて音声認識を行うための統計モデルが登録されている。統計モデル選択部20は、入力音声が通常発声であれば第1統計モデル16,17を選択し、音節強調発声であれば第2統計モデル18,19を選択する。前記探索処理部15は、選択された統計モデルを用いて探索処理を実行する。   In the first acoustic model 16 and the first language model 17, a statistical model for performing speech recognition based on the acoustic feature amount of the normal utterance is registered. In the second acoustic model 18 and the second language model 19, a statistical model for performing speech recognition based on the acoustic feature amount of the syllable-emphasized normal utterance is registered. The statistical model selection unit 20 selects the first statistical models 16 and 17 if the input speech is a normal utterance, and selects the second statistical models 18 and 19 if the input speech is a syllable emphasized utterance. The search processing unit 15 performs a search process using the selected statistical model.

図3は、前記音節強調発声判定部13の構成を模式的に表現した図であり、前記音響分析部12で抽出された入力音声のパワー(E)やn次元のMFCC(MFCCの0次項C0を含む)等の音響特徴量に基づいて入力音声の発声区間を検知する発声区間検知部131と、検知された発声区間の出現周期性を判定する周期性判定部132とを主要な構成とし、入力音声が音節強調発声であるか否かの判定結果を、前記図1の第1実施形態では入力制御部11へ、前記図2の第2実施形態では統計モデル選択部20へ、それぞれ出力する。   FIG. 3 is a diagram schematically showing the configuration of the syllable-emphasized utterance determination unit 13. The input speech power (E) extracted by the acoustic analysis unit 12 and the n-dimensional MFCC (MFCC zero-order term C0) A speech section detection unit 131 that detects a speech section of the input speech based on an acoustic feature amount, and a periodicity determination unit 132 that determines the appearance periodicity of the detected speech section. The determination result as to whether or not the input speech is syllable emphasis is output to the input control unit 11 in the first embodiment of FIG. 1 and to the statistical model selection unit 20 in the second embodiment of FIG. .

次いで、前記音節強調発声判定部13の動作について詳細に説明する。音声認識でよく用いられる音響特徴量のひとつとして、ケプストラム領域の特徴量(MFCC:Mel Frequency Cepstrum Coefficient)およびパワーが挙げられる。MFCCとは、フレーム毎に音声データのFFT分析で得られるパワースペクトルに対してメルスケールのフィルタバンクを施し、周波数軸変換されたパワースペクトルに対して離散コサイン変換(DCT)を実行することにより抽出される、スペクトル包絡を表すパラメータであり、その詳細は「音声認識システム」(野清宏、伊藤克亘、河原達也、武田一哉、山本幹雄 編著,オーム社出版局; ISBN4-274-13228-5)などで説明されている。   Next, the operation of the syllable emphasized utterance determination unit 13 will be described in detail. One of the acoustic feature quantities often used in speech recognition is a cepstrum domain feature quantity (MFCC: Mel Frequency Cepstrum Coefficient) and power. MFCC is extracted by applying a Mel-scale filter bank to the power spectrum obtained by FFT analysis of audio data for each frame, and performing discrete cosine transform (DCT) on the frequency spectrum converted power spectrum This is a parameter that represents the spectral envelope, and the details are “speech recognition system” (edited by Kiyoshi Hiroshi, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, edited by Mikio Yamamoto, Ohm Publishing Co., Ltd .; Explained.

音声認識では、入力音声のスペクトル特徴量を離散コサイン変換し、ケプストラム領域において3つの処理(直流成分の除去,リフタリング処理およびケプストラム平均除去) を実行することで得られる12次元のMFCC(MFCC1,MFCC2,…MFCC12)およびその1次時間微分(ΔMFCC1,ΔMFCC2,…ΔMFCC12)、ならびにパワーEの1次時間微分(ΔE)を併せた25次元の音響特徴量、さらには各MFCCの2次時間微分(ΔΔMFCC1,ΔΔMFCC2,…ΔΔMFCC12,ΔΔE)までを加えた38次元のMFCCが音響特徴量として利用されることが多い。   In speech recognition, 12-dimensional MFCC (MFCC1, MFCC2) obtained by performing discrete cosine transform on spectral features of input speech and executing three processes in the cepstrum domain (DC component removal, liftering processing, and cepstrum average removal) ,... MFCC12) and its first time derivative (ΔMFCC1, ΔMFCC2,..., ΔMFCC12), and a first time derivative (ΔE) of power E, and a 25-dimensional acoustic feature, and further, a second time derivative ( A 38-dimensional MFCC including ΔΔMFCC1, ΔΔMFCC2,... ΔΔMFCC12, ΔΔE) is often used as an acoustic feature.

ΔEはパワーの時間的変化を表し、無音区間から発声区間に切り替わる境界では大きくパワーが増加するため、ΔEは大きな正のピークをもつ。一方、発声区間から無音区間への境界においてΔEは絶対値の大きな負のピークをもつ。したがって、ΔEの正負のピーク(または、最大振幅)により発声区間と無音区間とを識別することが可能となる。   ΔE represents a temporal change in power, and ΔE has a large positive peak because the power greatly increases at the boundary where the silent section switches to the utterance section. On the other hand, ΔE has a negative peak with a large absolute value at the boundary from the utterance section to the silent section. Therefore, it is possible to distinguish the utterance section and the silent section from the positive and negative peaks (or the maximum amplitude) of ΔE.

また、ΔEにn次元分のΔMFCCの絶対値(|ΔMFCC1|,|ΔMFCC2|,…|ΔMFCCn|)の積(ΔMFCC_1)を乗ずることで、更に発声区間と無音区間との境界のピークを強調する一方、それ以外の箇所でのパワー変化によるピークの出現を抑制することができる。   In addition, by multiplying ΔE by the product (ΔMFCC_1) of absolute values of ΔMFCC (| ΔMFCC1 |, | ΔMFCC2 |, ... | On the other hand, the appearance of peaks due to power changes at other locations can be suppressed.

図4,5に、前記図13,14に波形を示した同一発声者による同一発声内容「神奈川」の通常発声および音節強調発声のΔE・|ΔMFCC_1|の変化の様子を示す。両発声とも発声区間の開始時には正のピークが、終了時には負のピークが現れていることがわかる。   FIGS. 4 and 5 show changes in ΔE · | ΔMFCC_1 | of the normal utterance and syllable emphasized utterance of the same utterance content “Kanagawa” by the same utterer whose waveforms are shown in FIGS. It can be seen that for both utterances, a positive peak appears at the start of the utterance interval and a negative peak appears at the end.

図4,5を比較すると、図4の通常発声ではそれぞれの音節に対応するピークが連続しているのに対して、図5の音節強調発声では、ほぼ一定の周期毎にピークが現れている。そして、ピークの出現が完全に周期的である場合、即ちピーク間隔が完全に一定である場合、音節強調発声についてΔE・|ΔMFCC_1|の自己相関をとると、遅れ幅τがこの周期の整数倍に一致するときにピークが現れることとなる。   Comparing FIGS. 4 and 5, in the normal utterance in FIG. 4, the peaks corresponding to the syllables are continuous, whereas in the syllable emphasized utterance in FIG. 5, peaks appear at almost constant intervals. . When the peak appearance is completely periodic, that is, when the peak interval is completely constant, the autocorrelation of ΔE · | ΔMFCC_1 | A peak appears when it matches.

しかしながら、実際にはピーク間隔には揺らぎが存在し、かつΔE・|ΔMFCC_1|のピークは非常にシャープであるため、自己相関にピークがはっきりと現れない可能性が高い。このような場合には、図6に示したように、一定フレーム毎に、その一定区間内での最大振幅で当該区間を代表させることでΔE・|ΔMFCC_1|をピークピッキング(矩形波化)し、等ピーク間隔の揺らぎを吸収する手法を用いて、その自己相関結果にピークを出現させることができる。前記一定区間は、前記図10に関して説明した音声認識のフレーム処理のように、その一部分が重複していても良いし、重複せずに連続していても良い。   However, in practice, there is a fluctuation in the peak interval, and the peak of ΔE · | ΔMFCC_1 | is very sharp, so there is a high possibility that the peak does not appear clearly in the autocorrelation. In such a case, as shown in FIG. 6, for every certain frame, ΔE · | ΔMFCC_1 | is peak picked (rectangular wave) by representing the section with the maximum amplitude within the certain section. A peak can be made to appear in the autocorrelation result by using a technique for absorbing fluctuations in the equal peak interval. As in the frame processing for speech recognition described with reference to FIG. 10, a part of the certain section may be overlapped or may be continued without overlapping.

図7は、揺らぎを吸収した場合のΔE・|ΔMFCC_1|の自己相関結果を示した図であり、大きなピークが現れる遅れ幅τp1,τp2,τp3がほぼ一定の間隔であることがわかる。ΔE・|ΔMFCC_1|のピーク間隔が一定に近ければ近いほど、自己相関におけるそれぞれのピークの値も大きくなる。したがって、例えば、遅れ幅τp1の第1のピーク、すなわち一次の自己相関が予め定めた閾値を超えた場合に入力音声が音節強調発声であるとするような判定基準を設けることができる。   FIG. 7 is a diagram showing the autocorrelation result of ΔE · | ΔMFCC — 1 | when the fluctuation is absorbed, and it can be seen that the delay widths τp1, τp2, and τp3 in which large peaks appear are substantially constant intervals. The closer the peak interval of ΔE · | ΔMFCC_1 | is to a certain value, the greater the value of each peak in the autocorrelation. Therefore, for example, it is possible to provide a criterion for determining that the input speech is syllable emphasized utterance when the first peak of the delay width τp1, that is, the primary autocorrelation exceeds a predetermined threshold.

そこで、本実施形態では前記発声区間検知部131が、前記図6に関して説明した時系列情報に基づいて発声区間を検知し、前記周期性判定部132が、この時系列情報の一次の自己相関が予め定めた閾値を超える場合に、入力音声が音節強調発声であると判定するようにしている。   Therefore, in the present embodiment, the utterance interval detection unit 131 detects an utterance interval based on the time series information described with reference to FIG. 6, and the periodicity determination unit 132 determines the primary autocorrelation of the time series information. When a predetermined threshold value is exceeded, it is determined that the input voice is a syllable emphasized utterance.

また、前記パワーEの代わりにゼロ次項のMFCC、すなわち各フレームにおけるスペクトルの直流成分に対応するC0を用いて、同一発声のΔC0・|ΔMFCC_1|の自己相関をとった場合も、図8に示すように、スケールは異なるものの図7のΔE・|ΔMFCC_1|の自己相関と同様の変化を示す。したがって、ΔEに代えてΔC0を用い、同様に入力が音節強調発声であるか否かを判定するようにしても良い。   Further, FIG. 8 shows the case where autocorrelation of ΔC0 · | ΔMFCC_1 | of the same utterance is obtained using MFCC of zero-order term instead of the power E, that is, C0 corresponding to the DC component of the spectrum in each frame. Thus, although the scale is different, the same change as the autocorrelation of ΔE · | ΔMFCC_1 | of FIG. 7 is shown. Therefore, ΔC0 may be used instead of ΔE, and it may be similarly determined whether or not the input is a syllable-emphasized utterance.

さらに、上記した実施形態では、入力音声のパワーEの時間変化率(ΔE)または入力音声のMFCCの0次項(C0)の時間変化率(ΔC0)に、入力音声のMFCCのn各次元分の時間変化率の絶対値同士の積(ΔMFCC_1)を乗じ、これを所定の一定区間ごとに当該区間の最大振幅で代表して得られる時系列情報に基づいて発声区間を検知するものとして説明したが、本発明はこれのみに限定されるものではなく、以下のような変形が可能である。   Furthermore, in the above-described embodiment, the time change rate (ΔE) of the power E of the input sound or the time change rate (ΔC0) of the 0th-order term (C0) of the MFCC of the input sound is equal to each n dimension of the MFCC of the input sound. The product of the absolute values of the time change rate (ΔMFCC_1) is multiplied, and this is explained as detecting the utterance interval based on time series information obtained by representing the maximum amplitude of the interval for each predetermined interval. The present invention is not limited to this, and the following modifications are possible.

第1の変形例として、発声区間を入力音声のパワー(E)の時間変化率(ΔE)のみに基づいて検知するようにしても良い。   As a first modification, the utterance interval may be detected based only on the time change rate (ΔE) of the power (E) of the input speech.

第2の変形例として、発声区間を入力音声のMFCCの0次項(C0)の時間変化率(ΔC0)のみに基づいて検知するようにしても良い。   As a second modification, the utterance interval may be detected based only on the time change rate (ΔC0) of the 0th-order term (C0) of the MFCC of the input speech.

第3の変形例として、発声区間を入力音声のパワーEの時間変化率(ΔE)または入力音声のMFCCの0次項(C0)の時間変化率(ΔC0)に、入力音声のMFCCのn次元分の時間変化率の絶対値同士の積(ΔMFCC_1)を乗じて得られる時系列情報に基づいて検知するようにしても良い。   As a third modification, the utterance period is set to the time change rate (ΔE) of the power E of the input speech or the time change rate (ΔC0) of the 0th-order term (C0) of the MFCC of the input speech. Alternatively, detection may be performed based on time-series information obtained by multiplying the products (ΔMFCC_1) of absolute values of the time change rates.

第4の変形例として、発声区間を入力音声のパワーEの時間変化率(ΔE)または入力音声のMFCCの0次項(C0)の時間変化率(ΔC0)に、入力音声のMFCCのn次元分の時間変化率の絶対値同士の積(ΔMFCC_1)を乗じ、これを平滑化して得られる時系列情報に基づいて検知するようにしても良い。   As a fourth modification, the utterance period is set to the time change rate (ΔE) of the power E of the input speech or the time change rate (ΔC0) of the 0th-order term (C0) of the MFCC of the input speech to the n-dimensional component of the MFCC of the input speech. The product may be detected based on time-series information obtained by multiplying the product (ΔMFCC_1) of absolute values of the time change rate and smoothing the product.

本発明に係る音声認識装置の第1実施形態のブロック図である。1 is a block diagram of a first embodiment of a speech recognition apparatus according to the present invention. 本発明に係る音声認識装置の第2実施形態のブロック図である。It is a block diagram of 2nd Embodiment of the speech recognition apparatus which concerns on this invention. 音節強調発声判定部の構成を模式的に表現した図である。It is the figure which expressed typically the structure of the syllable emphasis utterance determination part. 通常発声「かながわ」のΔE・|ΔMFCC_1|の変化を示した波形図である。FIG. 6 is a waveform diagram showing a change in ΔE · | ΔMFCC_1 | of a normal utterance “Kanagawa”. 音節強調発声「か・な・が・わ」のΔE・|ΔMFCC_1|の変化を示した波形図である。FIG. 6 is a waveform diagram showing changes in ΔE · | ΔMFCC_1 | of a syllable-emphasized utterance “Kananagawa”. 図5の波形に揺らぎを吸収する手法を適用して得られた波形図である。FIG. 6 is a waveform diagram obtained by applying a technique for absorbing fluctuations to the waveform of FIG. 5. 揺らぎが吸収されたΔE・|ΔMFCC_1|の自己相関結果を示した図である。FIG. 6 is a diagram showing an autocorrelation result of ΔE · | ΔMFCC_1 | in which fluctuation is absorbed. 揺らぎが吸収されたΔC0・|ΔMFCC_1|の自己相関結果を示した図である。FIG. 6 is a diagram illustrating an autocorrelation result of ΔC0 · | ΔMFCC_1 | in which fluctuation is absorbed. 従来の音声認識装置の主要部の構成を示した図である。It is the figure which showed the structure of the principal part of the conventional speech recognition apparatus. 音響分析部での音響特徴量の抽出方法を説明するための図である。It is a figure for demonstrating the extraction method of the acoustic feature-value in an acoustic analysis part. 固定文法モデルの構造を模式的に表現した図である。It is the figure which expressed typically the structure of the fixed grammar model. 単語「今井」のHMM状態系列を示した図である。It is the figure which showed the HMM state series of the word "Imai". 通常発声「かながわ」の波形図である。It is a waveform diagram of normal utterance "Kanagawa". 音節強調発声「か・な・が・わ」の波形図である。It is a wave form diagram of syllable emphasis utterance "Kana naga wawa". 「sil」への遷移を付加したHMM状態系列を示した図である。It is the figure which showed the HMM state series which added the transition to "sil".

符号の説明Explanation of symbols

11…入力制御部,12…音響分析部,13…音節強調発声判定部,14…再発生要求部,15…探索処理部,16…音響モデル,17…言語モデル,18…第2音響モデル,19…第2言語モデル,20…統計モデル選択部
DESCRIPTION OF SYMBOLS 11 ... Input control part, 12 ... Acoustic analysis part, 13 ... Syllable emphasis utterance determination part, 14 ... Regeneration request part, 15 ... Search processing part, 16 ... Acoustic model, 17 ... Language model, 18 ... 2nd acoustic model, 19 ... Second language model, 20 ... Statistical model selection unit

Claims (22)

入力音声の音響特徴量を抽出する音響分析手段と、
前記抽出された音響特徴量に基づいて音声認識を行うための統計モデルと、
前記抽出された音響特徴量周期性に基づいて、入力音声が音節強調発声であるか否かを判定する音節強調発声判定手段と、
前記音響特徴量に前記統計モデルを適用して探索処理を実行する探索処理部とを含み、
入力音声が音節強調発声であるか否かの判定結果に応じて固有の音声認識動作を実行し、
前記音節強調発声判定手段は、
各音節の発声区間を、無音区間から発声区間に切り替わる境界および発声区間から無音区間に切り替わる境界に基づいて検知する発声区間検知手段と、
前記各音節の発声区間が出現する周期性を判定する周期性判定手段とを含み、
前記周期性のある入力音声を音節強調発声と判定することを特徴とする音声認識装置。
Acoustic analysis means for extracting acoustic features of the input speech;
A statistical model for performing speech recognition based on the extracted acoustic features;
Syllable emphasized utterance determination means for determining whether or not the input speech is syllable emphasized utterance based on the periodicity of the extracted acoustic feature amount;
A search processing unit that executes a search process by applying the statistical model to the acoustic feature amount,
Depending on the determination result of whether or not the input speech is syllable emphasis utterance, perform a unique speech recognition operation,
The syllable-emphasized utterance determination means includes
Utterance section detection means for detecting the utterance section of each syllable based on a boundary that switches from the silent section to the utterance section and a boundary that switches from the utterance section to the silent section;
Periodicity determining means for determining the periodicity in which the utterance section of each syllable appears,
A speech recognition apparatus, characterized in that the input speech having periodicity is determined as syllable emphasized utterance .
前記音節強調発声判定手段が、
前記出現周期性が所定の基準値よりも高いときに、入力音声を音節強調発声と判定することを特徴とする請求項1に記載の音声認識装置。
The syllable-emphasized utterance determination means is
The speech recognition apparatus according to claim 1, wherein when the appearance periodicity is higher than a predetermined reference value, the input speech is determined as a syllable emphasized utterance.
前記周期性判定手段は、前記各音節の発声区間の出現周期の自己相関に基づいて出現周期性を判定することを特徴とする請求項1または2に記載の音声認識装置。 The speech recognition apparatus according to claim 1, wherein the periodicity determination unit determines appearance periodicity based on an autocorrelation of an appearance period of the utterance section of each syllable . 前記周期性判定手段は、前記各音節の発声区間の、出現周期の一次の自己相関に基づいて出現周期性を判定することを特徴とする請求項1または2に記載の音声認識装置。 The speech recognition apparatus according to claim 1, wherein the periodicity determination unit determines appearance periodicity based on a first-order autocorrelation of an appearance period of the utterance section of each syllable . 入力音声が音節強調発声と判定されたときに、発声者に対して再度の発声を要求する手段を含むことを特徴とする請求項1ないし4のいずれかに記載の音声認識装置。   5. The speech recognition apparatus according to claim 1, further comprising means for requesting the utterer to utter again when the input speech is determined to be syllable emphasized utterance. 通常発声に固有の音響特徴量に対応した第1統計モデルと、
音節強調発声に固有の音響特徴量に対応した第2統計モデルとを備え、
前記探索処理部は、入力音声が音節強調発声と判定されたときに、前記第2統計モデルを用いて探索処理を実行することを特徴とする請求項1ないし4のいずれかに記載の音声認識装置。
A first statistical model corresponding to acoustic features specific to normal speech;
A second statistical model corresponding to acoustic features specific to syllable-emphasized utterances,
5. The speech recognition according to claim 1, wherein the search processing unit performs a search process using the second statistical model when it is determined that the input speech is a syllable emphasized utterance. 6. apparatus.
前記音響特徴量が入力音声のパワー(E)を含み、
前記発声区間検知手段が、入力音声のパワーの時間変化率(ΔE)に基づいて各音節の発声区間を検知することを特徴とする請求項2ないし6のいずれかに記載の音声認識装置。
The acoustic feature amount includes power (E) of input voice,
7. The speech recognition apparatus according to claim 2, wherein the utterance section detecting unit detects a utterance section of each syllable based on a time change rate (ΔE) of power of input speech.
前記音響特徴量が入力音声のMFCCを含み、
前記発声区間検知手段が、入力音声のMFCCの0次項の時間変化率(ΔC0)に基づいて各音節の発声区間を検知することを特徴とする請求項2ないし6のいずれかに記載の音声認識装置。
The acoustic feature amount includes MFCC of input speech,
The speech recognition according to any one of claims 2 to 6, wherein the speech section detecting means detects a speech section of each syllable based on a time change rate (ΔC0) of a zero-order term of MFCC of input speech. apparatus.
前記音響特徴量が入力音声のパワー(E)およびn次元のMFCCを含み、
前記発声区間検知手段が、入力音声のパワーの時間変化率(ΔE)および入力音声のMFCCの0次項の時間変化率(ΔC0)のいずれかに、入力音声のMFCCのn次元分の時間変化率ΔMFCCの絶対値同士の積を乗じて得られる時系列情報に基づいて各音節の発声区間を検知することを特徴とする請求項2ないし6のいずれかに記載の音声認識装置。
The acoustic feature amount includes power (E) of input speech and n-dimensional MFCC,
The utterance section detecting means is configured to take either the time change rate (ΔE) of the power of the input voice or the time change rate (ΔC0) of the MFCC of the input voice for the nth dimension of the MFCC of the input voice. The speech recognition apparatus according to any one of claims 2 to 6, wherein an utterance section of each syllable is detected based on time-series information obtained by multiplying products of absolute values of ΔMFCC.
前記音響特徴量が入力音声のパワー(E)およびn次元のMFCCを含み、
前記発声区間検知手段が、入力音声のパワーの時間変化率(ΔE)および入力音声のMFCCの0次項の時間変化率(ΔC0)のいずれかに、入力音声のMFCCのn次元分の時間変化率ΔMFCCの絶対値同士の積を乗じ、これを平滑化して得られる時系列情報に基づいて各音節の発声区間を検知することを特徴とする請求項2ないし6のいずれかに記載の音声認識装置。
The acoustic feature amount includes power (E) of input speech and n-dimensional MFCC,
The utterance section detecting means is configured to take either the time change rate (ΔE) of the power of the input voice or the time change rate (ΔC0) of the MFCC of the input voice for the nth dimension of the MFCC of the input voice. 7. The speech recognition apparatus according to claim 2, wherein a speech section of each syllable is detected based on time series information obtained by multiplying the products of absolute values of ΔMFCC and smoothing the product. .
前記音響特徴量が入力音声のパワー(E)およびn次元のMFCCを含み、
前記発声区間検知手段が、入力音声のパワーの時間変化率(ΔE)および入力音声のMFCCの0次項の時間変化率(ΔC0)のいずれかに、入力音声のMFCCのn次元分の時間変化率ΔMFCCの絶対値同士の積を乗じ、これを所定の一定区間ごとに当該区間の最大振幅で代表して得られる時系列情報に基づいて各音節の発声区間を検知することを特徴とする請求項2ないし6のいずれかに記載の音声認識装置。
The acoustic feature amount includes power (E) of input speech and n-dimensional MFCC,
The utterance section detecting means is configured to take either the time change rate (ΔE) of the power of the input voice or the time change rate (ΔC0) of the MFCC of the input voice for the nth dimension of the MFCC of the input voice. A product of absolute values of ΔMFCC is multiplied, and the utterance section of each syllable is detected based on time-series information obtained by representing the maximum amplitude of the section for each predetermined constant section. The speech recognition apparatus according to any one of 2 to 6.
入力音声が音節強調発声であるか否かを判定する発声判定方法において、
入力音声の音響特徴量を抽出する手順と、
抽出された音響特徴量に基づいて入力音声の各音節の発声区間を、無音区間から発声区間に切り替わる境界および発声区間から無音区間に切り替わる境界に基づいて検知する手順と、
前記各音節の発声区間の出現周期性を判定する手順と、
前記各音節の発声区間の出現周期性に基づいて、入力音声が音節強調発声であるか否かを判定する手順とを含むことを特徴とする発声判定方法。
In the utterance determination method for determining whether or not the input speech is syllable emphasis utterance,
A procedure for extracting acoustic features of the input speech;
On the basis of the extracted acoustic features with a vocal section of each syllable of the input speech, and procedures for detecting the boundary and vocal section switched to speech section from a silent section based on the boundaries switched to silent section,
A procedure for determining the appearance periodicity of the utterance section of each syllable ;
And a procedure for determining whether or not the input speech is a syllable emphasized utterance based on the appearance periodicity of the utterance section of each syllable .
前記入力音声が音節強調発声であるか否かを判定する手順が、
前記出現周期性が所定の基準値よりも高いときに、入力音声を音節強調発声と判定する手順であることを特徴とする請求項12に記載の発声判定方法。
A procedure for determining whether or not the input speech is a syllable emphasis utterance,
Wherein when occurrence periodicity is higher than the predetermined reference value, the utterance determination method according to claim 12, wherein the input speech is a procedure for determining the syllable emphasis utterance.
前記出現周期性を判定する手順は、前記各音節の発声区間の出現周期の自己相関に基づいて出現周期性を判定することを特徴とする請求項12または13に記載の発声判定方法。 The utterance determination method according to claim 12 or 13, wherein the procedure for determining the appearance periodicity determines the appearance periodicity based on an autocorrelation of an appearance period of the utterance section of each syllable . 前記出現周期性を判定する手順は、前記各音節の発声区間の、出現周期の一次の自己相関に基づいて出現周期性を判定することを特徴とする請求項12または13に記載の発声判定方法。 The utterance determination method according to claim 12 or 13, wherein the step of determining the appearance periodicity determines the appearance periodicity based on a primary autocorrelation of the appearance period of the utterance section of each syllable. . 前記音響特徴量が入力音声のパワー(E)を含み、
前記各音節の発声区間を検知する手順が、入力音声のパワーの時間変化率(ΔE)に基づいて発声区間を検知することを特徴とする請求項12ないし15のいずれかに記載の発声判定方法。
The acoustic feature amount includes power (E) of input voice,
The utterance determination method according to any one of claims 12 to 15, wherein the procedure of detecting the utterance section of each syllable detects the utterance section based on a time change rate (ΔE) of power of the input speech. .
前記音響特徴量が入力音声のMFCCを含み、
前記各音節の発声区間を検知する手順が、入力音声のMFCCの0次項の時間変化率(ΔC0)に基づいて発声区間を検知することを特徴とする請求項12ないし15のいずれかに記載の発声判定方法。
The acoustic feature amount includes MFCC of input speech,
16. The procedure for detecting the utterance interval of each syllable detects the utterance interval based on a time change rate (ΔC0) of the 0th-order term of the MFCC of the input speech. Utterance determination method.
前記音響特徴量が入力音声のパワー(E)およびn次元のMFCCを含み、
前記各音節の発声区間を検知する手順が、入力音声のパワーの時間変化率(ΔE)および入力音声のMFCCの0次項の時間変化率(ΔC0)のいずれかに、入力音声のMFCCのn次元分の時間変化率ΔMFCCの絶対値同士の積を乗じて得られる時系列情報に基づいて発声区間を検知することを特徴とする請求項12ないし15のいずれかに記載の発声判定方法。
The acoustic feature amount includes power (E) of input speech and n-dimensional MFCC,
The procedure for detecting the utterance interval of each syllable is determined by either the time change rate (ΔE) of the power of the input speech or the time change rate (ΔC0) of the 0th-order term of the MFCC of the input speech. The utterance determination method according to any one of claims 12 to 15, wherein the utterance section is detected based on time-series information obtained by multiplying products of absolute values of minute time change rates ΔMFCC.
前記音響特徴量が入力音声のパワー(E)およびn次元のMFCCを含み、
前記各音節の発声区間を検知する手順が、入力音声のパワーの時間変化率(ΔE)および入力音声のMFCCの0次項の時間変化率(ΔC0)のいずれかに、入力音声のMFCCのn次元分の時間変化率ΔMFCCの絶対値同士の積を乗じ、これを平滑化して得られる時系列情報に基づいて発声区間を検知することを特徴とする請求項12ないし15のいずれかに記載の発声判定方法。
The acoustic feature amount includes power (E) of input speech and n-dimensional MFCC,
The procedure for detecting the utterance interval of each syllable is determined by either the time change rate (ΔE) of the power of the input speech or the time change rate (ΔC0) of the 0th-order term of the MFCC of the input speech. The utterance section according to claim 12, wherein the utterance section is detected based on time-series information obtained by multiplying products of absolute values of the time change rate ΔMFCC of minutes and smoothing the product. Judgment method.
前記音響特徴量が入力音声のパワー(E)およびn次元のMFCCを含み、
前記各音節の発声区間を検知する手順が、入力音声のパワーの時間変化率(ΔE)および入力音声のMFCCの0次項の時間変化率(ΔC0)のいずれかに、入力音声のMFCCのn次元分の時間変化率ΔMFCCの絶対値同士の積を乗じ、これを所定の一定区間ごとに当該区間の最大振幅で代表して得られる時系列情報に基づいて発声区間を検知することを特徴とする請求項12ないし15のいずれかに記載の発声判定方法。
The acoustic feature amount includes power (E) of input speech and n-dimensional MFCC,
The procedure for detecting the utterance interval of each syllable is determined by either the time change rate (ΔE) of the power of the input speech or the time change rate (ΔC0) of the 0th-order term of the MFCC of the input speech. Multiplying the products of the absolute values of the time change rate ΔMFCC of minutes and detecting the utterance interval based on time series information obtained by representing the maximum amplitude of the interval for each predetermined interval The utterance determination method according to claim 12.
前記請求項12ないし20のいずれかに記載の発声判定方法をコンピュータに実行させる発声判定プログラム。   An utterance determination program for causing a computer to execute the utterance determination method according to any one of claims 12 to 20. 前記請求項21に記載の発声判定プログラムをコンピュータによる読み出し可能に記憶した記憶媒体。   A storage medium storing the utterance determination program according to claim 21 so as to be readable by a computer.
JP2007010853A 2007-01-19 2007-01-19 Speech recognition apparatus, utterance determination method thereof, utterance determination program, and storage medium thereof Expired - Fee Related JP4986028B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2007010853A JP4986028B2 (en) 2007-01-19 2007-01-19 Speech recognition apparatus, utterance determination method thereof, utterance determination program, and storage medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2007010853A JP4986028B2 (en) 2007-01-19 2007-01-19 Speech recognition apparatus, utterance determination method thereof, utterance determination program, and storage medium thereof

Publications (2)

Publication Number Publication Date
JP2008176155A JP2008176155A (en) 2008-07-31
JP4986028B2 true JP4986028B2 (en) 2012-07-25

Family

ID=39703216

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2007010853A Expired - Fee Related JP4986028B2 (en) 2007-01-19 2007-01-19 Speech recognition apparatus, utterance determination method thereof, utterance determination program, and storage medium thereof

Country Status (1)

Country Link
JP (1) JP4986028B2 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
MX2011008605A (en) * 2009-02-27 2011-09-09 Panasonic Corp Tone determination device and tone determination method.
JP2015215503A (en) * 2014-05-12 2015-12-03 日本電信電話株式会社 Voice recognition method, voice recognition device and voice recognition program
CN110070883B (en) * 2016-01-14 2023-07-28 深圳市韶音科技有限公司 Speech enhancement method
US11996115B2 (en) 2019-03-08 2024-05-28 Nec Corporation Sound processing method
CN111768800B (en) * 2020-06-23 2024-06-25 中兴通讯股份有限公司 Voice signal processing method, equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62164097A (en) * 1986-01-14 1987-07-20 株式会社リコー Voice discrimination system
JPS62166400A (en) * 1986-01-20 1987-07-22 株式会社リコー Voice wordprocessor
JPH0383100A (en) * 1989-08-25 1991-04-09 Ricoh Co Ltd Detector for voice section
JP3720595B2 (en) * 1998-09-17 2005-11-30 キヤノン株式会社 Speech recognition apparatus and method, and computer-readable memory
JP3588030B2 (en) * 2000-03-16 2004-11-10 三菱電機株式会社 Voice section determination device and voice section determination method
JP2006010739A (en) * 2004-06-22 2006-01-12 Toyota Central Res & Dev Lab Inc Speech recognition device
US8175868B2 (en) * 2005-10-20 2012-05-08 Nec Corporation Voice judging system, voice judging method and program for voice judgment

Also Published As

Publication number Publication date
JP2008176155A (en) 2008-07-31

Similar Documents

Publication Publication Date Title
JP5218052B2 (en) Language model generation system, language model generation method, and language model generation program
EP0867857A2 (en) Enrolment in speech recognition
JP2004258658A (en) Continuous speech recognition method using inter-word phoneme information and device thereforfor
JP2003316386A (en) Method, device, and program for speech recognition
JP2011033680A (en) Voice processing device and method, and program
JPWO2009081895A1 (en) Speech recognition system, speech recognition method, and speech recognition program
JP4986028B2 (en) Speech recognition apparatus, utterance determination method thereof, utterance determination program, and storage medium thereof
WO2010100853A1 (en) Language model adaptation device, speech recognition device, language model adaptation method, and computer-readable recording medium
JP5398295B2 (en) Audio processing apparatus, audio processing method, and audio processing program
JP4791857B2 (en) Utterance section detection device and utterance section detection program
JP4700522B2 (en) Speech recognition apparatus and speech recognition program
JP5983604B2 (en) Segment information generation apparatus, speech synthesis apparatus, speech synthesis method, and speech synthesis program
KR100480790B1 (en) Method and apparatus for continous speech recognition using bi-directional n-gram language model
JP5184467B2 (en) Adaptive acoustic model generation apparatus and program
JP2009025579A (en) Voice recognition device and method
JP2008026721A (en) Speech recognizer, speech recognition method, and program for speech recognition
JP6526602B2 (en) Speech recognition apparatus, method thereof and program
Kruspe Keyword spotting in singing with duration-modeled hmms
JP2006243213A (en) Language model conversion device, sound model conversion device, and computer program
JP2012255867A (en) Voice recognition device
JPWO2012032748A1 (en) Speech synthesis apparatus, speech synthesis method, and speech synthesis program
JP5673239B2 (en) Speech recognition apparatus, speech recognition method, and speech recognition program
JPWO2013125203A1 (en) Speech recognition apparatus, speech recognition method, and computer program
JP2011065041A (en) Basic frequency estimating device, musical notation device and program
JP2011065043A (en) Tone pitch estimating device and program

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20090707

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20110210

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20110406

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20110513

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20120125

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20120326

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20120418

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20120418

R150 Certificate of patent or registration of utility model

Ref document number: 4986028

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20150511

Year of fee payment: 3

LAPS Cancellation because of no payment of annual fees