JP2007079072A - Method and device for speech recognition - Google Patents

Method and device for speech recognition Download PDF

Info

Publication number
JP2007079072A
JP2007079072A JP2005266130A JP2005266130A JP2007079072A JP 2007079072 A JP2007079072 A JP 2007079072A JP 2005266130 A JP2005266130 A JP 2005266130A JP 2005266130 A JP2005266130 A JP 2005266130A JP 2007079072 A JP2007079072 A JP 2007079072A
Authority
JP
Japan
Prior art keywords
speech
time
speech recognition
procedure
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2005266130A
Other languages
Japanese (ja)
Other versions
JP4576612B2 (en
Inventor
Akira Saso
晃 佐宗
Masataka Goto
真孝 後藤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Institute of Advanced Industrial Science and Technology AIST
Original Assignee
National Institute of Advanced Industrial Science and Technology AIST
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Institute of Advanced Industrial Science and Technology AIST filed Critical National Institute of Advanced Industrial Science and Technology AIST
Priority to JP2005266130A priority Critical patent/JP4576612B2/en
Publication of JP2007079072A publication Critical patent/JP2007079072A/en
Application granted granted Critical
Publication of JP4576612B2 publication Critical patent/JP4576612B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and a device for speech recognition that exclude a feature quantity of a speech area which becomes a long sound and perform speech recognition based upon an ARHMM. <P>SOLUTION: The method for speech recognition comprises procedures of judging a section wherein phonetic variation of an input speech is small as a long-sound section, deleting the speech feature quantity of the section, and recognizing remaining feature quantities. The device for speech recognition comprises a means of implementing the procedures. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、高基本周波数音声や長音化音声を含んだ音声の音声認識方法および音声認識装置に関する。   The present invention relates to a speech recognition method and speech recognition apparatus for speech including high fundamental frequency speech and prolonged sound.

近年の音声認識技術は、大語彙の連続音声を高い精度で認識することを可能にしたが、その適用範囲はごく限られている。
例えば、背景雑音や残響などがある雑音環境下での音声認識、対話音声、感情音声、歌声などの様々な発話様式での音声認識、そして、子供、老人、障害者などの多様な話者の音声認識では、認識精度が著しく劣化する。
Recent speech recognition technology has made it possible to recognize continuous speech of a large vocabulary with high accuracy, but its application range is very limited.
For example, speech recognition in a noisy environment with background noise or reverberation, speech recognition in various utterance styles such as dialogue speech, emotional speech, singing voice, etc., and various speakers such as children, elderly people, and disabled people In speech recognition, the recognition accuracy is significantly degraded.

歌声や子供の音声、また、アニメなどの声優の音声など高基本周波数音声と長音化音声を含む音声を認識する場合(例えば、ARHMMを用いた音響信号のモデリングやパラメータ推定手法などについては、下記特許文献1、特許文献2、および、特許文献3、また、下記非特許文献1、非特許文献2、非特許文献3などで述べられている。)、以下の理由により、従来の音声認識手法では認識が困難となる。
即ち、高基本周波数音声は、周波数領域で調波構造が疎になるため、音声の特徴量として従来広く用いられているLPC(線形予測符号化)ケプストラム(対数スペクトルに逆フーリエ変換をしたもの)やMFCC(Mel−Frequency Cepstrum Coefficient:人間の感覚尺度を考慮して音声から抽出したスペクトル包絡を表す特徴量)などでは、音韻性を表すフォルマントの特徴を精度よく抽出できないという問題点がある。そして、新聞記事などを読み上げた音声から学習したHMM(隠れマルコフモデル)による音響モデルを用いて認識を行う従来のHMMベース認識システムでは、長音化した音声の継続時間とHMMの状態遷移確率がミスマッチとなり、認識精度が劣化するという問題点がある。
When recognizing voices including high fundamental frequency voices and longer voices, such as voices of singing voices, children's voices, and voice actors such as anime (for example, modeling of acoustic signals using ARHMM and parameter estimation methods are described below) Patent Document 1, Patent Document 2, and Patent Document 3, and the following Non-Patent Document 1, Non-Patent Document 2, Non-Patent Document 3, etc.)) Conventional speech recognition methods for the following reasons Then, recognition becomes difficult.
In other words, high fundamental frequency speech has a sparse harmonic structure in the frequency domain. Therefore, LPC (Linear Predictive Coding) cepstrum (logarithmic spectrum obtained by inverse Fourier transform), which has been widely used as speech features, has been widely used. And MFCC (Mel-Frequency Cepstrum Coefficient: a feature quantity representing a spectral envelope extracted from speech in consideration of a human sensory scale) has a problem that a formant feature representing phonology cannot be accurately extracted. In a conventional HMM-based recognition system that performs recognition using an acoustic model based on HMM (Hidden Markov Model) learned from speech read out from newspaper articles, etc., the duration of the prolonged speech and the state transition probability of HMM are mismatched. Thus, there is a problem that recognition accuracy deteriorates.

特開2003−5785号公報JP 2003-5785 A 特開2003−99085号公報JP 2003-99085 A 特開2004−287010号公報JP 2004-287010 A 佐宗 晃, 田中 和世, “HMMによる音源のモデリングと高基本周波数に頑健な声道特性抽出,“ 電子情報通信学会論文誌(D−II), Vol.J84−D−II, No.9, pp.1960−1969, Sep, 2001.Satoshi Sasou, Kazuyo Tanaka, “Modeling of sound source by HMM and extraction of vocal tract characteristics robust to high fundamental frequency,” IEICE Transactions (D-II), Vol. J84-D-II, no. 9, pp. 1960-1969, Sep, 2001. Akira Sasou, Masataka Goto, Satoru Hayamizu, Kazuyo Tanaka, “Comparison of Auto−Regressive, Non−Stationary Excited Signal Parameter Estimation Methods,“ Proc. of IEEE Workshop on Machine Learning for Signal Processing (MLSP2004), pp.295−304, Sep. 2004.Akira Sasou, Masataka Goto, Satoru Hayamizu, Kazuyo Tanaka, “Comparison of Auto-Regressive, Non-Stationary Excited Signal Parametric. of IEEE Workshop on Machine Learning for Signal Processing (MLSP2004), pp. 295-304, Sep. 2004. Akira Sasou, Masataka Goto, Satoru Hayamizu, Kazuyo Tanaka, “An Auto−Regressive, Non−Stationary Excited Signal Parameter Estimation Method and an Evaluation of a Singing−Voice Recognition“, Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2005), vol.1, pp.237−240, Mar. 2005.Akira Sasou, Masataka Goto, Sataru Hayamizu, Kazuyo Tanaka, “An Auto-Regressive and Non-Steady Excited Signal Para Estimator. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), vol. 1, pp. 237-240, Mar. 2005.

本発明の目的は、上記問題点に鑑み、ARHMMに基づいた音声分析手法で音声の特徴量を求め、更に、長音化した音声特徴量を除いて音声認識する音声認識方法および音声認識装置を提供することにある。   In view of the above problems, an object of the present invention is to provide a speech recognition method and a speech recognition apparatus for obtaining a speech feature amount by a speech analysis method based on ARHMM, and for recognizing speech by removing a speech feature amount having a longer sound length. There is to do.

図1は本発明の音声認識方法を実行するように構成された音声認識装置のブロック図である。
以上の課題を解決するために、本発明は、図1に示す、ARHMM(Auto−Regressive Hidden Markov Model:自己回帰隠れマルコフモデル:HMM(かくれマルコフモデル)の出力をAR(自己回帰)フィルタに通す構成→観測時系列をARフィルタで逆フィルタリングした時系列をHMMで表現する:換言すると、HMMの出力をARフィルタに通しているので、連続的に変化する統計量で観測時系列を表現していることになる、更に言及すれば、観測時系列が一定の相関を持って変化している場合、その相関性をARフィルタで取り除くことで、単純な時系列に変換できる場合がある。)に基づいた音声の特徴抽出手順を実行するためのARHMMに基づいた音声の特徴抽出手段、長音補正処理手順を実行する長音補正処理手段および音声認識手順を実行する音声認識手段を順次組み合わせた音声認識方法およびその方法を実行する各手段からなる音声認識装置を用いる。
本発明は、歌声のような高基本周波数音声から特徴抽出するために、ARHMMに基づいた分析手法を採用する。
問題となる長音に対しては、入力音声の音韻変動の少ない区間を長音区間と判断し、その区間の音声特徴量を削除し、残りの特徴量を認識することで、長音による認識精度の劣化を改善する。提案方法は、音声特徴量の時系列信号に対して時間軸方向の回帰係数として求められるΔ係数を下記数9の式により求める。
Δc(n,i)は、フレーム時刻nにおける音声特徴量の第i番目要素のΔ係数をあらわす。このようにして求めたΔ係数は、音韻変動の少ない長音区間でゼロに近づくことを利用して、長音検出を行う。具体的な手順は以下の通りである。はじめに、Δ係数の自乗和の時系列s(n)を下記数10の式から求める。
FIG. 1 is a block diagram of a speech recognition apparatus configured to execute the speech recognition method of the present invention.
In order to solve the above-described problems, the present invention passes the output of an ARHMM (Auto-Regressive Hidden Markov Model: Autoregressive Hidden Markov Model: HMM (Hidden Markov Model) shown in FIG. 1 through an AR (autoregressive) filter. Configuration → The time series obtained by inversely filtering the observation time series with the AR filter is expressed by the HMM: In other words, since the output of the HMM is passed through the AR filter, the observation time series is expressed by continuously changing statistics. In other words, if the observed time series changes with a constant correlation, the correlation may be converted to a simple time series by removing the correlation with an AR filter. A speech feature extraction unit based on ARHMM for executing a speech feature extraction procedure based on the A speech recognition method comprising a combination of a long sound correction processing means to perform and a speech recognition means for executing a speech recognition procedure in sequence, and a speech recognition apparatus comprising each means for executing the method are used.
The present invention employs an analysis method based on ARHMM to extract features from high fundamental frequency speech such as singing voice.
For problematic long sounds, the section with little phonological variation of the input speech is judged as a long sound section, the speech feature amount in that section is deleted, and the remaining feature amount is recognized, so that the recognition accuracy deteriorates due to the long sound. To improve. In the proposed method, a Δ coefficient obtained as a regression coefficient in the time axis direction with respect to the time series signal of the audio feature amount is obtained by the following equation (9).
Δc (n, i) represents the Δ coefficient of the i-th element of the audio feature quantity at frame time n. The Δ coefficient obtained in this way is used to detect a long sound by utilizing the fact that it approaches zero in a long sound section with little phonological variation. The specific procedure is as follows. First, a time series s (n) of the square sum of Δ coefficients is obtained from the following equation (10).

次に、s(n)に対して、例えば、移動平均などによる平滑化処理を施した時系列l(n)を下記数11の式から求める。
上記のようにして求められる時系列l(n)に対して、閾値lthrを設け、ある時刻nから連続するN個の値が下記数12の式の閾値lthrを下回ったら
長音と判断し、時刻(n+N)からl(n)が閾値を下回り続ける限りその時刻の特徴量を削除する。
Next, for example, a time series l (n) obtained by performing a smoothing process such as a moving average on s (n) is obtained from the following equation (11).
For the time series l (n) obtained as described above, a threshold value l thr is provided, and when N r consecutive values from a certain time n s fall below the threshold value l thr in the following equation (12)
It is determined that the sound is a long sound, and the feature value at that time is deleted as long as l (n) continues to fall below the threshold from the time (n s + N r ).

具体的には、以下の手段を採用する。
(1)音声認識方法は、入力音声の音韻変動の少ない区間を長音区間と判断し、その区間の音声特徴量を削除し、残りの特徴量を認識することを特徴とする。
(2)上記(1)記載の音声認識方法は、入力音声の長音区間を、各時刻のフレームから得られた音声特徴量からΔ係数を求め、各フレームのΔ係数の自乗和をフレームの時刻順に並べた時系列に対して平滑化処理を施し、更に閾値を設け、連続して閾値を下回るフレーム数が一定数を越えた場合、それ以降のフレームで、閾値を下回り続ける限り長音区間と判断し、そのフレームの特徴量を削除して、残りの特徴を認識する手順を備えたことを特徴とする。
Specifically, the following means are adopted.
(1) The speech recognition method is characterized in that a section with a small phonological variation of the input speech is determined as a long sound section, the speech feature amount in that section is deleted, and the remaining feature amount is recognized.
(2) In the speech recognition method described in (1) above, a Δ coefficient is obtained from a speech feature value obtained from a frame at each time in a long sound section of the input speech, and the square sum of the Δ coefficient of each frame is calculated as the time of the frame. Performs smoothing processing on the time series arranged in order, further sets a threshold value, and if the number of frames that continuously fall below the threshold value exceeds a certain number, it is determined as a long sound section as long as it continues below the threshold value in subsequent frames And a feature of deleting the feature amount of the frame and recognizing the remaining features.

(3)上記(1)又は(2)記載の音声認識方法は、Auto−Regressive Hidden Markov Modelに基づいた音声分析により得られる音声特徴量に対して上記(1)又は(2)の手順を適用することを特徴とする。
(4)上記(3)記載の音声認識方法は、
(4−1)音声信号を時間毎にフレーム単位で取り込み、音声信号の時系列信号をARHMMに基づいて分析し、得られたフレーム時刻nにおけるAR(自己回帰)スペクトル振幅の対数値u(n)を下記数13の式により求める手順1、
式中NはFFTのサンプル数である。
(4−2)メル周波数上に並んだ三角窓を用いてメルフィルタバンク出力を求める手順2、
(3) In the speech recognition method described in (1) or (2) above, the procedure of (1) or (2) is applied to a speech feature obtained by speech analysis based on Auto-Regressive Hidden Markov Model. It is characterized by doing.
(4) The speech recognition method described in (3) above is
(4-1) An audio signal is captured in units of frames every time, a time series signal of the audio signal is analyzed based on ARHMM, and the logarithmic value u (n) of the AR (autoregressive) spectrum amplitude at the obtained frame time n is obtained. ) Is obtained by the following equation (13),
Where N is the number of FFT samples.
(4-2) Procedure 2 for obtaining the Mel filter bank output using the triangular windows arranged on the Mel frequency,

(4−3)手順1で求めた前記対数値u(n)と手順2で求めたメルフィルタバンク出力に対して離散コサイン変換を行い、ARHMMベースのMFCC(人間の感覚尺度を考慮して音声から抽出したスペクトル包絡を表す特徴量)とする手順3、
(4−4)フレーム時刻nにおけるΔ係数の自乗和の時系列s(n)を下記数14の式から求める手順4、
(4−5)次に、s(n)に対して、例えば、移動平均などによる平滑化処理を施した時系列l(n)を下記数15の式から求める手順5、
(4-3) Discrete cosine transformation is performed on the logarithmic value u (n) obtained in step 1 and the mel filter bank output obtained in step 2 to obtain an ARHMM-based MFCC A feature amount representing a spectral envelope extracted from step 3),
(4-4) Procedure 4 for obtaining a time series s (n) of the square sum of Δ coefficients at frame time n from the following equation (14):
(4-5) Next, a procedure 5 for obtaining a time series l (n) obtained by performing a smoothing process such as a moving average on s (n) from the following equation (15):

(4−6)上記のようにして求められる時系列l(n)に対して、閾値lthrを設け、ある時刻nから連続するN個の値が下記数16の式の閾値lthrを下回ったら
長音と判断し、時刻(n+N)からl(n)が閾値を下回り続ける限りその時刻の特徴量を削除した信号を得る手順6、
(4−7)上記手順6を実行して得た特徴量に基づき音声認識を行う手順7、
を実行することを特徴とする。
(4-6) A threshold value l thr is provided for the time series l (n) obtained as described above, and N r consecutive values from a certain time n s are expressed as the threshold value l thr in the following equation (16). Below
Procedure 6 for determining a long sound and obtaining a signal from which the feature value at that time is deleted as long as l (n) continues to fall below the threshold from the time (n s + N r ),
(4-7) Procedure 7 for performing speech recognition based on the feature amount obtained by executing the procedure 6 above,
It is characterized by performing.

(5)音声認識装置は、入力音声の音韻変動の少ない区間を長音区間と判断し、その区間の音声特徴量を削除し、残りの特徴量を認識することを特徴とする。
(6)上記(5)記載の音声認識装置は、入力音声の長音区間を、各時刻のフレームから得られた音声特徴量からΔ係数を求め、各フレームのΔ係数の自乗和をフレームの時刻順に並べた時系列に対して平滑化処理を施し、更に閾値を設け、連続して閾値を下回るフレーム数が一定数を越えた場合、それ以降のフレームで、閾値を下回り続ける限り長音区間と判断し、そのフレームの特徴量を削除して、残りの特徴を認識する手段を備えたことを特徴とする。
(7)上記(5)又は(6)記載の音声認識装置は、Auto−Regressive Hidden Markov Modelに基づいた音声分析により得られる音声特徴量に対して請求項5又は6の手段を適用することを特徴とする。
(5) The speech recognition apparatus is characterized in that a section with a small phonological variation of the input speech is determined as a long sound section, a speech feature amount in the section is deleted, and the remaining feature amount is recognized.
(6) The speech recognition apparatus according to (5) described above obtains a Δ coefficient from a speech feature amount obtained from a frame at each time in a long sound section of the input speech, and calculates a square sum of the Δ coefficient of each frame as a frame time. Performs smoothing processing on the time series arranged in order, further sets a threshold value, and if the number of frames that continuously fall below the threshold value exceeds a certain number, it is determined as a long sound section as long as it continues below the threshold value in subsequent frames And a means for deleting the feature amount of the frame and recognizing the remaining features.
(7) The speech recognition apparatus according to the above (5) or (6) applies the means according to claim 5 or 6 to a speech feature obtained by speech analysis based on Auto-Regressive Hidden Markov Model. Features.

(8)上記(7)記載の音声認識装置は、
(8−1)音声信号を時間毎にフレーム単位で取り込み、音声信号の時系列信号をARHMMに基づいて分析し、得られたフレーム時刻nにおけるAR(自己回帰)スペクトル振幅の対数値u(n)を下記数17の式により求める手段1、
式中NはFFTのサンプル数である。
(8−2)メル周波数上に並んだ三角窓を用いてメルフィルタバンク出力を求める手段2、
(8−3)手段1で求めた前記対数値u(n)と手段2で求めたメルフィルタバンク出力に対して離散コサイン変換を行い、ARHMMベースのMFCCとする手段3、
(8−4)フレーム時刻nにおけるΔ係数の自乗和の時系列s(n)を下記数18の式から求める手段4、
(8) The speech recognition device according to (7) above,
(8-1) An audio signal is captured in units of frames every time, a time series signal of the audio signal is analyzed based on ARHMM, and the logarithmic value u (n) of the AR (autoregressive) spectrum amplitude at the obtained frame time n is obtained. ) By means of the following equation 17,
Where N is the number of FFT samples.
(8-2) Means 2 for obtaining the mel filter bank output using triangular windows arranged on the mel frequency,
(8-3) Means 3 for performing discrete cosine transform on the logarithmic value u (n) obtained by means 1 and the mel filter bank output obtained by means 2 to obtain an ARHMM-based MFCC,
(8-4) Means 4 for obtaining a time series s (n) of square sums of Δ coefficients at frame time n from the following equation (18):

(8−5)次に、s(n)に対して、例えば、移動平均などによる平滑化処理を施した時系列l(n)を下記数19の式から求める手段5、
(8−6)上記のようにして求められる時系列l(n)に対して、閾値lthrを設け、ある時刻nから連続するN個の値が下記数20の式の閾値lthrを下回ったら
長音と判断し、時刻(n+N)からl(n)が閾値を下回り続ける限りその時刻の特徴量を削除した信号を得る手段6、
(8−7)上記手段6の出力である特徴量に基づき音声認識を行う手段7、
からなることを特徴とする。
(8-5) Next, means 5 for obtaining a time series l (n) obtained by subjecting s (n) to a smoothing process such as a moving average from the following equation (19),
(8-6) A threshold value l thr is provided for the time series l (n) obtained as described above, and N r consecutive values from a certain time n s are expressed as the threshold value l thr in the following equation (20). Below
Means 6 for determining a long sound and obtaining a signal from which the characteristic quantity at that time is deleted as long as l (n) continues to fall below the threshold from the time (n s + N r );
(8-7) Means 7 for performing speech recognition based on the feature value that is the output of the means 6;
It is characterized by comprising.

従来の音声認識システムは、例えば、新聞記事を読み上げた音声から学習した音響モデルを用いるため、歌声のように長音化音声を多く含むような音声では、特に状態遷移確率に関してミスマッチが生じる。また、従来の音韻に関する特徴抽出手法である線形予測法やメルフィルタバンク分析などは、高基本周波数音声を分析する場合、抽出精度が劣化する傾向がある。このため、音韻情報に関して音響モデルとミスマッチが生じる。これらの原因により、従来の音声認識システムで歌声などの音声を認識する場合、その認識精度が著しく劣化する。
本発明は、この問題を解決するために、長音区間の検出および削除をすることで、状態遷移確率のミスマッチを解消する。長音区間検出を精度良く行うには、歌声のような基本周波数の高い音声からでも音韻特徴を正しく抽出する必要がある。しかし、前述のように、従来の特徴抽出手法は、高基本周波数音声の場合に、精度が劣化する。このため、長音区間検出精度も劣化し、状態遷移確率に関するミスマッチが解消されない可能性がある。本発明は、既に開発してある、高基本周波数音声からでも精度よく音韻特徴抽出が可能なARHMMに基づいた手法と長音補正処理を組み合わせることで、音韻特徴抽出精度と長音区間検出精度が同時に改善され、高い認識精度を実現することができる。
Since the conventional speech recognition system uses, for example, an acoustic model learned from the speech of reading a newspaper article, a mismatch occurs particularly with respect to state transition probabilities in speech that includes many prolonged sounds such as a singing voice. Also, the linear prediction method and the mel filter bank analysis, which are conventional feature extraction methods for phonemes, tend to deteriorate the extraction accuracy when analyzing high fundamental frequency speech. For this reason, a mismatch occurs with the acoustic model regarding the phoneme information. For these reasons, when a voice such as a singing voice is recognized by a conventional voice recognition system, the recognition accuracy is remarkably deteriorated.
In order to solve this problem, the present invention eliminates the mismatch of state transition probabilities by detecting and deleting a long sound section. In order to accurately detect a long sound section, it is necessary to correctly extract phonological features even from a voice having a high fundamental frequency such as a singing voice. However, as described above, the conventional feature extraction method deteriorates in accuracy in the case of high fundamental frequency speech. For this reason, the detection accuracy of the long sound section is also deteriorated, and there is a possibility that the mismatch related to the state transition probability is not eliminated. The present invention improves the accuracy of phonological feature extraction and the detection of the long sound interval at the same time by combining the ARHMM-based method, which has already been developed and enables accurate phonological feature extraction even from high fundamental frequency speech, and the long sound correction processing. And high recognition accuracy can be realized.

本発明の実施の形態を図に基づいて詳細に説明する。   Embodiments of the present invention will be described in detail with reference to the drawings.

図1に示すように、本発明の音声認識方法を実行するように音声認識装置を構成する。
音声認識装置は、基本的に、少なくとも音声信号を取り込み、所定の演算結果を出力する入出力(I/O)装置、記憶装置(メモリ)、中央演算装置等を有し、所定のプログラムによって、所定の手順を実行する。音声認識装置は、例えば、パーソナルコンピュータで構成する。この場合、音声信号を取り込むための装置を備えていても良い。
本発明の音声認識方法を用いた歌声認識実験について、以下に述べる。
実験には、RWC研究用音楽データベースに収録されている日本のポピュラー音楽の中から英語表現の少ない12曲を選び、そのヴォーカルファイル(楽器演奏を含まないヴォーカルだけのデータ)を用いた。サンプリング周波数は16kHzである。認識には、大語彙連続音声認識システムJuliusと日本語の新聞記事を読み上げた音声から学習した音響モデルを用いた。音響モデルの学習に用いた特徴量はMFCC(人間の感覚尺度を考慮して音声から抽出したスペクトル包絡を表す特徴量)である。以上のように、この実験で用いている認識システムの音響モデルは、完全に歌声に関してオープンとなっている。単語辞書および言語モデルは、曲ごとに歌詞から生成した。
この実験では、以下の手順で、AR係数a(i)からARHMMベースMFCCを求める。
As shown in FIG. 1, the speech recognition apparatus is configured to execute the speech recognition method of the present invention.
The speech recognition device basically has an input / output (I / O) device, a storage device (memory), a central processing unit, etc. that captures at least a speech signal and outputs a predetermined computation result, A predetermined procedure is executed. The voice recognition device is composed of, for example, a personal computer. In this case, a device for capturing an audio signal may be provided.
A singing voice recognition experiment using the speech recognition method of the present invention will be described below.
In the experiment, 12 songs with little English expression were selected from the popular Japanese music recorded in the RWC research music database, and the vocal files (data only for vocals not including instrumental performance) were used. The sampling frequency is 16 kHz. For recognition, a large vocabulary continuous speech recognition system Julius and an acoustic model learned from speech obtained by reading Japanese newspaper articles were used. The feature amount used for learning the acoustic model is MFCC (a feature amount representing a spectral envelope extracted from speech in consideration of a human sensory scale). As described above, the acoustic model of the recognition system used in this experiment is completely open with respect to singing voice. The word dictionary and language model were generated from the lyrics for each song.
In this experiment, the ARHMM-based MFCC is obtained from the AR coefficient a (i) by the following procedure.

フローチャートで説明する。
はじめに(START)、
(1)音声信号を時間毎にフレーム単位で取り込み、音声信号の時系列信号をARHMMに基づいて分析し、得られたフレーム時刻nにおけるAR(自己回帰)スペクトル振幅の対数値u(n)を下記数21の式により求める(ステップS1)。
この処理は、通常のMFCCを求める手続きの中でFFT(ファーストフーリエ変換)の対数振幅を求める処理に相当する。式中NはFFTのサンプル数である。これ以降の手続きは通常のMFCCのそれと同じで、
(2)メル周波数上に並んだ三角窓を用いてメルフィルタバンク出力を求め(ステップS2)、
(3)ステップ1で求めた前記対数値u(n)とステップ2で求めたメルフィルタバンク出力に対して離散コサイン変換を行い、ARHMMベースのMFCCとする(ステップS3)。
次に、上記のようにして求めたARHMMベースMFCCの時間軸方向に沿って、回帰分析をしてΔ係数を求める。Δ係数から下記式(5)、(6)、(7)を用いて長音区間検出を行う。
長音に対しては、入力音声の音韻変動の少ない区間を長音区間と判断し、その区間の音声特徴量を削除し、残りの特徴量を認識することで、長音による認識精度の劣化を改善する。提案方法は、音声特徴量の時系列信号に対して時間軸方向の回帰係数などとして求められるΔ係数が、音韻変動の少ない長音区間でゼロに近づくことを利用して、長音検出を行う。
This will be described with reference to a flowchart.
Introduction (START),
(1) An audio signal is captured in units of frames every time, a time series signal of the audio signal is analyzed based on ARHMM, and a logarithmic value u (n) of the AR (autoregressive) spectrum amplitude at the obtained frame time n is obtained. It calculates | requires by the following formula | equation 21 (step S1).
This process corresponds to a process for obtaining a logarithmic amplitude of FFT (Fast Fourier Transform) in a procedure for obtaining an ordinary MFCC. Where N is the number of FFT samples. The procedure after this is the same as that of normal MFCC.
(2) Obtain the mel filter bank output using the triangular windows arranged on the mel frequency (step S2),
(3) Discrete cosine transform is performed on the logarithmic value u (n) obtained in step 1 and the mel filter bank output obtained in step 2 to obtain an ARHMM-based MFCC (step S3).
Next, a regression analysis is performed along the time axis direction of the ARHMM-based MFCC obtained as described above to obtain a Δ coefficient. The long sound section is detected from the Δ coefficient using the following formulas (5), (6), and (7).
For long sounds, the section with little phonological variation of the input speech is judged as a long sound section, the speech feature amount in that section is deleted, and the remaining feature amount is recognized, thereby improving the recognition accuracy degradation due to the long sound. . The proposed method detects a long sound by utilizing the fact that a Δ coefficient obtained as a regression coefficient in the time axis direction for a time series signal of speech feature values approaches zero in a long sound section with little phonological variation.

具体的には、
(5)次に、s(n)に対して、例えば、移動平均などによる平滑化処理を施した時系列l(n)を下記数23の式から求める(ステップS5)。
In particular,
(5) Next, a time series l (n) obtained by performing a smoothing process such as a moving average on s (n) is obtained from the following equation (23) (step S5).

(6)上記のようにして求められる時系列l(n)に対して、閾値lthrを設け、ある時刻nから連続するN個の値が下記数24の式の閾値lthrを下回ったら
長音と判断し、時刻(n+N)からl(n)が閾値を下回り続ける限りその時刻の特徴量を削除した信号を得る(ステップS6)。
(7)上記ステップ6を実行して得た特徴量に基づき音声認識を行う(ステップS7)。
終了(END)。
(6) A threshold value l thr is provided for the time series l (n) obtained as described above, and N r consecutive values from a certain time n s fall below the threshold value l thr in the following equation (24). Once
Determines that the prolonged sound, time (n s + N r) l (n) from obtaining a signal which removes the feature quantity of that time as long as that continues below the threshold (step S6).
(7) Voice recognition is performed based on the feature amount obtained by executing step 6 (step S7).
End (END).

上記ステップS1〜S3は、図1のARHMMに基づいた特徴抽出手順SA又は特徴抽出手段MAに相当し、上記ステップS4〜S6は、図1の長音補正処理手順SB又は長音補正処理手段MBに相当し、上記ステップS7は、図1の音声認識手順SC又は音声認識手段MCに相当する。
長音区間検出を行った例を図2に示す。
図2は本発明の音声認識装置の各部の出力信号図である。上から、歌声の音声波形(a)、音韻特徴の抽出結果(b)、ARHMMベースMFCCのデルタ特徴量に数11の式を適用した結果(c)、そして最下図は、数12の式によって評価された特徴量を削除するフレームの区間情報(d)を表している。この図のように、本発明により、音韻が一定になっている区間で正しく、特徴量が削除されることがわかる。図2(c)の縦軸は1メモリ5で0から40までのレンジになっている。
長音補正処理を行わないで、音声特徴量としてARHMMベースMFCCと従来のMFCCのそれぞれで歌声を認識したときの結果を、表1,2に示す。表1が単語正解率(Correct Word Rate[%])で表2が誤り率(Error Rate[%])を示している。単語正解率と誤り率の平均を見ると、従来のMFCCよりARHMMベースMFCCを特徴量として用いる方が、認識率が改善される。
Steps S1 to S3 correspond to feature extraction procedure SA or feature extraction means MA based on ARHMM in FIG. 1, and steps S4 to S6 correspond to long sound correction processing procedure SB or long sound correction processing means MB in FIG. The step S7 corresponds to the voice recognition procedure SC or the voice recognition means MC shown in FIG.
FIG. 2 shows an example in which long sound section detection is performed.
FIG. 2 is an output signal diagram of each part of the speech recognition apparatus of the present invention. From the top, singing voice waveform (a), phonological feature extraction result (b), the result of applying Equation 11 to the delta feature quantity of ARHMM-based MFCC (c), and the bottom figure is given by Equation 12 The section information (d) of the frame from which the evaluated feature value is deleted is shown. As shown in this figure, according to the present invention, it is understood that the feature amount is correctly deleted in the section where the phoneme is constant. The vertical axis in FIG. 2C is a range from 0 to 40 in one memory 5.
Tables 1 and 2 show the results when the singing voice is recognized by each of the ARHMM-based MFCC and the conventional MFCC as the voice feature amount without performing the long sound correction process. Table 1 shows the correct word rate (correct word rate [%]), and Table 2 shows the error rate (error rate [%]). Looking at the average of the word correct rate and the error rate, the recognition rate is improved when the ARHMM-based MFCC is used as the feature amount, compared to the conventional MFCC.

次に、歌声から抽出したARHMMベースMFCCと従来のMFCCに対して、長音補正処理を行い、得られた特徴量を認識した結果を表3、4に示す。これより、従来のMFCCに対して長音補正処理する場合よりも、ARHMMベースMFCCに対して長音補正処理をする方が、より認識率の改善が実現されている。 Next, Tables 3 and 4 show the results of performing long sound correction processing on the ARHMM-based MFCC extracted from the singing voice and the conventional MFCC and recognizing the obtained feature values. Thus, the recognition rate is improved more when the long sound correction processing is performed on the ARHMM-based MFCC than when the long sound correction processing is performed on the conventional MFCC.

長音補正処理なし(表1,2)と長音補正処理あり(表3,4)のMFCCの結果を比較することで、長音補正処理の有効性を確認することができる。また、長音補正処理なし(表1,2)のARHMMベースMFCCの結果を見ると改善は僅かであるが、ARHMMベースMFCCと長音補正処理を組み合わせた結果(表3,4のARHMM)をみると、音韻の特徴抽出精度とそれによる長音区間検出精度の両方が改善されるため、それぞれ単独で用いたときに得られる改善値を足し合わせた以上に、大きな改善が得られている。以上より、ARHMMベースMFCCと長音補正処理が最適な組み合わせあることがわかる。 The effectiveness of the long sound correction process can be confirmed by comparing the MFCC results without the long sound correction process (Tables 1 and 2) and with the long sound correction process (Tables 3 and 4). In addition, the results of the ARHMM-based MFCC without long sound correction processing (Tables 1 and 2) show little improvement, but the results of combining the ARHMM-based MFCC and long sound correction processing (ARHMM in Tables 3 and 4) Since both the phoneme feature extraction accuracy and the resulting long sound interval detection accuracy are improved, the improvement is greater than the sum of the improvement values obtained when used independently. From the above, it can be seen that there is an optimal combination of the ARHMM-based MFCC and the long sound correction processing.

産業上の利用の可能性Industrial applicability

歌声を認識することで、歌手が曲のどの部分を歌っているのかを検出し、その情報に基づいて伴奏のスピードを制御するカラオケマシン。
歌声やアニメ音声を認識することで、歌詞や台詞のテロップ表示の自動化。
A karaoke machine that recognizes the singing voice, detects which part of the song is being sung, and controls the accompaniment speed based on that information.
Recognize singing voices and animated voices to automate telop display of lyrics and lines.

本発明の音声認識方法のフローチャート図である。It is a flowchart figure of the speech recognition method of this invention. 本発明の音声認識装置の各部の出力信号図である。It is an output signal diagram of each part of the speech recognition apparatus of the present invention.

符号の説明Explanation of symbols

SA、MA ARHMMに基づいた特徴抽出手順SAおよび特徴抽出手段MA
SB、MB 長音補正処理手順SBおよび特徴抽出手段MB
SC、MC 音声認識手順SCおよび音声認識手段MC
Feature extraction procedure SA and feature extraction means MA based on SA, MA ARHMM
SB, MB Long sound correction processing procedure SB and feature extraction means MB
SC, MC Voice recognition procedure SC and voice recognition means MC

Claims (8)

入力音声の音韻変動の少ない区間を長音区間と判断し、その区間の音声特徴量を削除し、残りの特徴量を認識することを特徴とする音声認識方法。 A speech recognition method characterized by determining a section having a small phonological variation of input speech as a long sound section, deleting a speech feature amount of the section, and recognizing the remaining feature amount. 入力音声の長音区間を、各時刻のフレームで得られた音声特徴量からΔ係数を求め、そのΔ係数の自乗和をフレームの時刻順に並べた時系列に対して平滑化処理を施し、更に閾値を設け、連続して閾値を下回るフレーム数が一定数を越えた場合、それ以降のフレームで、閾値を下回り続ける限り長音区間と判断し、そのフレームの特徴量を削除して、残りの特徴を認識する手順を備えたことを特徴とする請求項1記載の音声認識方法。 For a long sound section of the input speech, a Δ coefficient is obtained from the speech feature value obtained in the frame at each time, a smoothing process is performed on the time series in which the square sum of the Δ coefficient is arranged in the order of the frame time, and further a threshold value If the number of frames that continuously fall below the threshold exceeds a certain number, it is determined as a long sound section as long as it continues to fall below the threshold in the subsequent frames, the feature amount of that frame is deleted, and the remaining features are The speech recognition method according to claim 1, further comprising a recognition procedure. 自己回帰隠れマルコフモデルに基づいた音声分析により得られる音声特徴量に対して請求項1又は2の手順を適用することを特徴とする請求項1又は2記載の音声認識方法。 3. The speech recognition method according to claim 1, wherein the procedure of claim 1 or 2 is applied to a speech feature obtained by speech analysis based on an autoregressive hidden Markov model. 請求項3記載の音声認識方法であって、
(1)音声信号を時間毎にフレーム単位で取り込み、音声信号の時系列信号を自己回帰隠れマルコフモデルに基づいて分析し、得られたフレーム時刻nにおけるAR(自己回帰)スペクトル振幅の対数値u(n)を下記数1の式により求める手順1、
式中NはFFTのサンプル数に相当する。
(2)メル周波数上に並んだ三角窓を用いてメルフィルタバンク出力を求める手順2、
(3)それらに対して離散コサイン変換を行い、ARHMMベースのMFCCとする手順3、
(4)フレーム時刻nにおけるΔ係数の自乗和の時系列s(n)を下記数2の式から求める手順4、
(5)次に、s(n)に対して、例えば、移動平均などによる平滑化処理を施した時系列l(n)を下記数3の式から求める手順5、
(6)上記のようにして求められる時系列l(n)に対して、閾値lthrを設け、ある時刻nから連続するN個の値が下記数4の式の閾値lthrを下回ったら
長音と判断し、時刻(n+N)からl(n)が閾値を下回り続ける限りその時刻の特徴量を削除した信号を得る手順6、
(7)上記手順6を実行して得た特徴量に基づき音声認識を行う手順7、
を実行することを特徴とする音声認識方法。
The speech recognition method according to claim 3,
(1) An audio signal is captured in units of frames every time, a time series signal of the audio signal is analyzed based on an autoregressive hidden Markov model, and a logarithmic value u of the AR (autoregressive) spectrum amplitude at the obtained frame time n is obtained. Procedure 1 for obtaining (n) by the following equation (1):
In the equation, N corresponds to the number of FFT samples.
(2) Procedure 2 for obtaining the Mel filter bank output using triangular windows arranged on the Mel frequency,
(3) Step 3 of performing discrete cosine transform on them to obtain an ARHMM-based MFCC;
(4) Procedure 4 for obtaining a time series s (n) of the square sum of Δ coefficients at frame time n from the following equation (2):
(5) Next, a procedure 5 for obtaining a time series l (n) obtained by subjecting s (n) to a smoothing process such as a moving average from the following equation (3):
(6) A threshold value l thr is provided for the time series l (n) obtained as described above, and N r consecutive values from a certain time n s fall below the threshold value l thr in the following equation (4). Once
Procedure 6 for determining a long sound and obtaining a signal from which the feature value at that time is deleted as long as l (n) continues to fall below the threshold from the time (n s + N r ),
(7) Procedure 7 for performing speech recognition based on the feature value obtained by executing the procedure 6;
The voice recognition method characterized by performing.
入力音声の音韻変動の少ない区間を長音区間と判断し、その区間の音声特徴量を削除し、残りの特徴量を認識することを特徴とする音声認識装置。 A speech recognition apparatus characterized by determining a section having a small phonological variation of input speech as a long sound section, deleting a speech feature amount of the section, and recognizing the remaining feature amount. 入力音声の長音区間を、各時刻のフレームから得られた音声特徴量からΔ係数を求め、各フレームのΔ係数の自乗和をフレームの時刻順に並べた時系列に対して平滑化処理を施し、更に閾値を設け、連続して閾値を下回るフレーム数が一定数を越えた場合、それ以降のフレームで、閾値を下回り続ける限り長音区間と判断し、そのフレームの特徴量を削除して、残りの特徴を認識する手段を備えたことを特徴とする請求項5記載の音声認識装置。 Obtain a Δ coefficient from the speech feature value obtained from the frame at each time for the long sound section of the input speech, and perform a smoothing process on the time series in which the square sum of the Δ coefficient of each frame is arranged in the time order of the frame, In addition, when a threshold value is set and the number of frames that continuously fall below the threshold value exceeds a certain number, it is determined that the frame is a long sound section as long as it continues to fall below the threshold value, the feature amount of that frame is deleted, and the remaining frames 6. The speech recognition apparatus according to claim 5, further comprising means for recognizing the characteristics. 自己回帰隠れマルコフモデルに基づいた音声分析により得られる音声特徴量に対して請求項5又は6の手段を適用することを特徴とする請求項5又は6記載の音声認識装置。 7. The speech recognition apparatus according to claim 5, wherein the means of claim 5 or 6 is applied to a speech feature obtained by speech analysis based on an autoregressive hidden Markov model. 請求項7記載の音声認識装置であって、
(1)音声信号を時間毎にフレーム単位で取り込み、音声信号の時系列信号を自己回帰隠れマルコフモデルに基づいて分析し、得られたフレーム時刻nにおける自己回帰スペクトル振幅の対数値u(n)を下記数5の式により求める手段1、
式中NはFFTのサンプル数である。
(2)メル周波数上に並んだ三角窓を用いてメルフィルタバンク出力を求める手段2、
(3)手順1で求めた前記対数値u(n)と手順2で求めたメルフィルタバンク出力に対して離散コサイン変換を行い、自己回帰隠れマルコフモデルベースのMFCCとする手段3、
(5)次に、s(n)に対して、例えば、移動平均などによる平滑化処理を施した時系列l(n)を下記数7の式から求める手段5、
(6)上記のようにして求められる時系列l(n)に対して、閾値lthrを設け、ある時刻nから連続するN個の値が下記数8の式の閾値lthrを下回ったら
長音と判断し、時刻(n+N)からl(n)が閾値を下回り続ける限りその時刻の特徴量を削除した信号を得る手段6、
(7)上記手段6の出力である特徴量に基づき音声認識を行う手順7、
を実行することを特徴とする音声認識装置。
The speech recognition device according to claim 7,
(1) A speech signal is captured in units of frames for each time, a time series signal of the speech signal is analyzed based on an autoregressive hidden Markov model, and a logarithmic value u (n) of an autoregressive spectrum amplitude at the obtained frame time n Means 1 for obtaining the following equation (5):
Where N is the number of FFT samples.
(2) Means 2 for obtaining a mel filter bank output using triangular windows arranged on the mel frequency,
(3) Means 3 for performing discrete cosine transform on the logarithmic value u (n) obtained in step 1 and the mel filter bank output obtained in step 2 to obtain an autoregressive hidden Markov model-based MFCC;
(5) Next, means 5 for obtaining a time series l (n) obtained by subjecting s (n) to a smoothing process such as a moving average from the following equation (7):
(6) For the time series l (n) obtained as described above, a threshold value l thr is provided, and N r consecutive values from a certain time n s fall below the threshold value l thr in the following equation (8). Once
Means 6 for determining a long sound and obtaining a signal from which the characteristic quantity at that time is deleted as long as l (n) continues to fall below the threshold from the time (n s + N r );
(7) Procedure 7 for performing speech recognition based on the feature value that is the output of the means 6;
A speech recognition apparatus characterized by executing
JP2005266130A 2005-09-13 2005-09-13 Speech recognition method and speech recognition apparatus Expired - Fee Related JP4576612B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2005266130A JP4576612B2 (en) 2005-09-13 2005-09-13 Speech recognition method and speech recognition apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2005266130A JP4576612B2 (en) 2005-09-13 2005-09-13 Speech recognition method and speech recognition apparatus

Publications (2)

Publication Number Publication Date
JP2007079072A true JP2007079072A (en) 2007-03-29
JP4576612B2 JP4576612B2 (en) 2010-11-10

Family

ID=37939459

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2005266130A Expired - Fee Related JP4576612B2 (en) 2005-09-13 2005-09-13 Speech recognition method and speech recognition apparatus

Country Status (1)

Country Link
JP (1) JP4576612B2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445924A (en) * 2020-03-18 2020-07-24 中山大学 Method for detecting and positioning smooth processing in voice segment based on autoregressive model coefficient
CN111914721A (en) * 2020-07-27 2020-11-10 华中科技大学 Machining state identification method based on linear regression and Gaussian threshold

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS60129796A (en) * 1983-12-17 1985-07-11 電子計算機基本技術研究組合 Sillable boundary detection system
JPH04211299A (en) * 1991-02-08 1992-08-03 Matsushita Electric Ind Co Ltd Monosyllabic voice recognizing device
JPH11250063A (en) * 1998-02-27 1999-09-17 Toshiba Corp Retrieval device and method therefor
JP2000099099A (en) * 1998-09-22 2000-04-07 Sharp Corp Data reproducing device
JP2002311981A (en) * 2001-04-17 2002-10-25 Sony Corp Natural language processing system and natural language processing method as well as program and recording medium
JP2003005785A (en) * 2001-06-26 2003-01-08 National Institute Of Advanced Industrial & Technology Separating method and separating device for sound source
JP2004012883A (en) * 2002-06-07 2004-01-15 Sharp Corp Speech recognition device, speech recognition method, speech recognition program, and program recording medium
JP2004287010A (en) * 2003-03-20 2004-10-14 National Institute Of Advanced Industrial & Technology Method and device for wavelength recognition, and program

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS60129796A (en) * 1983-12-17 1985-07-11 電子計算機基本技術研究組合 Sillable boundary detection system
JPH04211299A (en) * 1991-02-08 1992-08-03 Matsushita Electric Ind Co Ltd Monosyllabic voice recognizing device
JPH11250063A (en) * 1998-02-27 1999-09-17 Toshiba Corp Retrieval device and method therefor
JP2000099099A (en) * 1998-09-22 2000-04-07 Sharp Corp Data reproducing device
JP2002311981A (en) * 2001-04-17 2002-10-25 Sony Corp Natural language processing system and natural language processing method as well as program and recording medium
JP2003005785A (en) * 2001-06-26 2003-01-08 National Institute Of Advanced Industrial & Technology Separating method and separating device for sound source
JP2004012883A (en) * 2002-06-07 2004-01-15 Sharp Corp Speech recognition device, speech recognition method, speech recognition program, and program recording medium
JP2004287010A (en) * 2003-03-20 2004-10-14 National Institute Of Advanced Industrial & Technology Method and device for wavelength recognition, and program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徳田 恵一 KEIICHI TOKUDA: "HMMからの音声パラメータ生成アルゴリズム A SPEECH PARAMETER GENERATION ALGORITHM BASED ON HMM", 電子情報通信学会技術研究報告 VOL.95 NO.468 IEICE TECHNICAL REPORT, JPN6010004118, 19 January 1996 (1996-01-19), JP, pages 35 - 42, ISSN: 0001685302 *
徳田 恵一 KEIICHI TOKUDA: "隠れマルコフモデルの音声合成への応用 SPEECH SYNTHESIS BASED ON HIDDEN MARKOV MODELS", 電子情報通信学会技術研究報告 VOL.99 NO.255 IEICE TECHNICAL REPORT, JPN6010004120, 5 August 1999 (1999-08-05), JP, pages 47 - 54, ISSN: 0001685303 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445924A (en) * 2020-03-18 2020-07-24 中山大学 Method for detecting and positioning smooth processing in voice segment based on autoregressive model coefficient
CN111914721A (en) * 2020-07-27 2020-11-10 华中科技大学 Machining state identification method based on linear regression and Gaussian threshold
CN111914721B (en) * 2020-07-27 2024-02-06 华中科技大学 Machining state identification method based on linear regression and Gaussian threshold

Also Published As

Publication number Publication date
JP4576612B2 (en) 2010-11-10

Similar Documents

Publication Publication Date Title
Chang et al. Large vocabulary Mandarin speech recognition with different approaches in modeling tones
Shahnawazuddin et al. Creating speaker independent ASR system through prosody modification based data augmentation
Shahnawazuddin et al. Pitch-Adaptive Front-End Features for Robust Children's ASR.
Wang et al. Speaker identification by combining MFCC and phase information in noisy environments
Deshwal et al. Feature extraction methods in language identification: a survey
WO2004111996A1 (en) Acoustic interval detection method and device
Bezoui et al. Feature extraction of some Quranic recitation using mel-frequency cepstral coeficients (MFCC)
WO2007046267A1 (en) Voice judging system, voice judging method, and program for voice judgment
JP2006171750A (en) Feature vector extracting method for speech recognition
Shahnawazuddin et al. Effect of prosody modification on children's ASR
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
Fukuda et al. Detecting breathing sounds in realistic Japanese telephone conversations and its application to automatic speech recognition
Rao et al. Speech processing in mobile environments
Shahnawazuddin et al. Pitch-normalized acoustic features for robust children's speech recognition
Eringis et al. Improving speech recognition rate through analysis parameters
Alku et al. The linear predictive modeling of speech from higher-lag autocorrelation coefficients applied to noise-robust speaker recognition
Vegesna et al. Prosody modification for speech recognition in emotionally mismatched conditions
US20140200889A1 (en) System and Method for Speech Recognition Using Pitch-Synchronous Spectral Parameters
Chadha et al. Optimal feature extraction and selection techniques for speech processing: A review
Zolnay et al. Using multiple acoustic feature sets for speech recognition
Zolnay et al. Extraction methods of voicing feature for robust speech recognition.
JP4576612B2 (en) Speech recognition method and speech recognition apparatus
Hasija et al. Recognition of Children Punjabi Speech using Tonal Non-Tonal Classifier
Khonglah et al. Speech enhancement using source information for phoneme recognition of speech with background music
Yavuz et al. A Phoneme-Based Approach for Eliminating Out-of-vocabulary Problem Turkish Speech Recognition Using Hidden Markov Model.

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20070314

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20100118

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20100202

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20100319

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20100803

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20100804

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20130903

Year of fee payment: 3

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20130903

Year of fee payment: 3

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20130903

Year of fee payment: 3

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

S533 Written request for registration of change of name

Free format text: JAPANESE INTERMEDIATE CODE: R313533

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

LAPS Cancellation because of no payment of annual fees