JP2013122508A - Voice recognition device and program - Google Patents

Voice recognition device and program Download PDF

Info

Publication number
JP2013122508A
JP2013122508A JP2011270381A JP2011270381A JP2013122508A JP 2013122508 A JP2013122508 A JP 2013122508A JP 2011270381 A JP2011270381 A JP 2011270381A JP 2011270381 A JP2011270381 A JP 2011270381A JP 2013122508 A JP2013122508 A JP 2013122508A
Authority
JP
Japan
Prior art keywords
speech
unit
learning
voice
feature amount
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2011270381A
Other languages
Japanese (ja)
Inventor
Tetsutsugu Tamura
哲嗣 田村
Satoru Hayamizu
悟 速水
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to JP2011270381A priority Critical patent/JP2013122508A/en
Publication of JP2013122508A publication Critical patent/JP2013122508A/en
Pending legal-status Critical Current

Links

Images

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition device which generates a discriminative feature quantity conversion optimized by voice recognition by using a voice signal for learning and correct answer class information of the voice signal for learning, and suppresses an influence of acoustic noise by applying the feature quantity conversion to a recognition voice signal, and by extracting and using a sound feature quantity.SOLUTION: A learning unit 10 includes: a class identification conversion construction unit 11 which calculates a class identification conversion from correct answer class information of vector series and the voice signal for learning by using the vector series to which a voice input unit 1 converts the voice signal for learning and correct answer class information of the voice signal for learning as inputs; and a dimensional compression and linear orthogonal transform construction unit 12 which calculates a dimensional compression and linear orthogonal transform from the vector series and the correct answer class information of the voice signal for learning or the class identification conversion. An application unit 20 includes: a feature quantity extraction unit 21 which extracts a sound feature quantity from the vector series and the class identification conversion or the dimensional compression and linear orthogonal transform by using the vector series to which the voice input unit 1 converts a recognition voice signal; and a voice recognition unit 2 which performs voice recognition by using the sound feature quantity and outputs a recognition result.

Description

本発明は、音響特徴量をテキストに変換する音声認識装置及びこれらのプログラムに関するものである。   The present invention relates to a speech recognition apparatus that converts an acoustic feature amount into text and a program thereof.

音声認識は、入力された音声信号から、音響特徴量を抽出し、パターンマッチング又は隠れマルコフモデル(Hidden Markov Model)などにより、テキストに変換する技術である。このうち、音響特徴量抽出は、入力された音声信号を、信号処理・信号分析などを用いて時系列のベクトル、すなわち、特徴量に変換する技術である。実用の音声認識では、音響分析に基づき抽出されるメル尺度ケプストラム係数(Mel-Frequency Cepstrum Coefficient)と呼ばれる音響特徴量を用いることが多い。   Speech recognition is a technique for extracting an acoustic feature from an input speech signal and converting it into text by pattern matching or a hidden Markov model. Among them, the acoustic feature extraction is a technique for converting an input voice signal into a time-series vector, that is, a feature using signal processing / signal analysis. In practical speech recognition, an acoustic feature called Mel-Frequency Cepstrum Coefficient extracted based on acoustic analysis is often used.

既存の音声認識の問題に、実環境など雑音下での認識性能が低い点がある。認識性能の改善に多用される手法は、非特許文献1にまとめられているように、音響特徴量抽出の過程で雑音低減を行う手法として、スペクトルサブトラクションやケプストラム平均正規化が、隠れマルコフモデルを雑音に適応する手法として、事後確率最大化法や最尤線形回帰法がある。   An existing problem of speech recognition is that the recognition performance under noise such as in the real environment is low. As summarized in Non-Patent Document 1, the techniques often used to improve the recognition performance include spectral subtraction and cepstrum average normalization as hidden Markov models as techniques for reducing noise during the acoustic feature extraction process. As a method for adapting to noise, there are a posteriori probability maximization method and a maximum likelihood linear regression method.

認識性能の改善に、音響分析や信号処理を主とした手法に代えて識別的な手法による音響特徴量抽出を用いる方法もある。例として非特許文献2及び非特許文献3では、音声信号からの音響特徴量の抽出を、最大相互情報量基準や音素誤り最小化基準による最適化で求めている。   In order to improve the recognition performance, there is a method of using acoustic feature amount extraction by a discriminative method instead of a method mainly using acoustic analysis and signal processing. As an example, in Non-Patent Document 2 and Non-Patent Document 3, the extraction of the acoustic feature amount from the speech signal is obtained by optimization based on the maximum mutual information criterion and phoneme error minimization criterion.

一方、遺伝的アルゴリズムは、最適解の候補を遺伝子として表し、遺伝子を持つ個体を複数生成し、適応度に応じて選択・交叉・突然変異などの操作を行うことで、最適解を探索する手法である。非特許文献4では、隠れマルコフモデルの出力尤度などを用い、遺伝的アルゴリズムによって、メル尺度ケプストラム係数の抽出方法を改良している。   Genetic algorithms, on the other hand, represent optimal solution candidates as genes, generate multiple individuals with genes, and search for the optimal solution by performing operations such as selection, crossover, and mutation according to fitness. It is. In Non-Patent Document 4, the extraction method of the Mel scale cepstrum coefficient is improved by a genetic algorithm using the output likelihood of the hidden Markov model.

遺伝的アルゴリズムを音声処理に用いた先行事例として他に、特許文献1は、音声分析の一手法として、音声を機械的・電気的に生成することを目的に、生成にかかる音響管断面積や電流値などの制御パラメータを遺伝的アルゴリズムにより求めている。   In addition to the prior case of using a genetic algorithm for speech processing, Patent Document 1 discloses a method of speech analysis, in which an acoustic tube cross-sectional area required for generation is generated for the purpose of mechanically and electrically generating speech. Control parameters such as current values are obtained using a genetic algorithm.

特開平07−114397号公報Japanese Patent Application Laid-Open No. 07-114397

「音声認識技術に関する特許出願技術動向調査等報告」特許庁、2003年“Patent Application Technology Trend Survey on Speech Recognition Technology,” JPO, 2003 D.Poveyほか「fMPE: Discriminatively trained features for speech recognition」国際会議ICASSP2005、961-964ページ、2005年D.Povey et al. "FMPE: Discriminatively trained features for speech recognition" International Conference ICASSP2005, pages 961-964, 2005 D.Poveyほか「Boosted MMI for model and feature-space discriminative training」国際会議ICASSP2008、4057-4060ページ、2008年D.Povey et al. “Boosted MMI for model and feature-space discriminative training” International Conference ICASSP2008, 4057-4060, 2008 Z.Behzadほか「Discriminative transformation for speech features based on genetic algorithm and HMM likelihoods」IEICE Electronics Express、第7巻第4号247-253ページ、2010年Z.Behzad et al. `` Discriminative transformation for speech features based on genetic algorithm and HMM likelihoods '' IEICE Electronics Express, Vol. 7, No. 4, pp. 247-253, 2010

従来の音声認識は、雑音環境下では認識性能が著しく低下してしまうことが問題となっている。この要因の一つとして、音響特徴量であるメル尺度ケプストラム係数の抽出過程が信号処理に基づいた手法であり、音声認識に必ずしも最適な特徴量抽出方法ではないことが挙げられる。この問題に対し、メル尺度ケプストラム係数の抽出手法を改善するさまざまな試みが行われているが、未だ得られる音声認識性能は不十分であることが課題である。   Conventional speech recognition has a problem in that the recognition performance is significantly deteriorated in a noisy environment. One of the factors is that the mel scale cepstrum coefficient extraction process, which is an acoustic feature quantity, is a technique based on signal processing, and is not necessarily the optimum feature quantity extraction method for speech recognition. Various attempts have been made to improve the mel scale cepstrum coefficient extraction method for this problem, but the problem is that the speech recognition performance obtained is still insufficient.

本発明の目的は、メル尺度ケプストラム係数とは異なる、音声認識に最適化され、雑音環境を考慮した識別的な音響特徴量を抽出することで、雑音下の音声認識の性能を改善することにある。   An object of the present invention is to improve the performance of speech recognition under noise by extracting discriminative acoustic features that are optimized for speech recognition and that take noise environment into consideration, which is different from the Mel scale cepstrum coefficient. is there.

又、本発明の他の目的として、コンピュータを、音声認識に最適化され、雑音環境を考慮した識別的な音響特徴量を抽出し、該音響特徴量を用いた音声認識装置とすることができるプログラムを提供することにある。   As another object of the present invention, a computer can be optimized for speech recognition, extract a discriminative acoustic feature amount in consideration of a noise environment, and can be used as a speech recognition device using the acoustic feature amount. To provide a program.

本発明は、音声認識において、音声認識に最適化された音響特徴量を抽出することを備えた音声認識装置を提供することを、特徴とする。   The present invention is characterized by providing a speech recognition apparatus that includes extracting an acoustic feature amount optimized for speech recognition in speech recognition.

前記目的を達成するため、請求項1に記載の発明は、発話者の音声信号を入力してディジタル信号に変換し該ディジタル信号からベクトル系列に変換して出力する音声入力部と、前記音声入力部が出力する学習用音声信号のベクトル系列と学習用音声信号の正解クラス情報とを入力してクラス識別変換を出力するクラス識別変換構築部と、前記音声入力部が出力する前記学習用音声信号のベクトル系列と前記学習用音声信号の正解クラス情報と前記クラス識別変換とを入力して次元圧縮・線形直交変換を出力する次元圧縮・線形直交変換構築部と、前記音声入力部が出力する認識音声信号のベクトル系列と前記クラス識別変換と前記次元圧縮・線形直交変換とを入力して音響特徴量を抽出する特徴量抽出部と、前記認識音声信号を前記特徴量抽出部が生成した音響特徴量に基づいて音声認識を行う音声認識部とを備えたことを特徴とする音声認識装置を要旨とするものである。   In order to achieve the above object, the invention according to claim 1 is characterized in that a voice input unit that inputs a voice signal of a speaker, converts the voice signal into a digital signal, converts the digital signal into a vector sequence, and outputs the vector signal; A class identification conversion construction unit that inputs a vector sequence of learning speech signals output by the unit and correct class information of the learning speech signals and outputs class identification conversion; and the learning speech signal output by the speech input unit A dimensional compression / linear orthogonal transformation construction unit that inputs a vector sequence of the above, correct class information of the learning speech signal and the class identification transformation and outputs dimensional compression / linear orthogonal transformation, and recognition output by the speech input unit A feature quantity extraction unit for inputting a vector sequence of the speech signal, the class identification transform, and the dimension compression / linear orthogonal transform to extract an acoustic feature, and the recognized speech signal as the feature Based on the acoustic features out portion is generated in which the gist of the speech recognition apparatus characterized by comprising a voice recognition unit for performing voice recognition.

請求項2に記載の発明は、発話者の音声信号を入力してディジタル信号に変換し該ディジタル信号からベクトル系列に変換して出力する音声入力部と、前記音声入力部が出力する学習用音声信号のベクトル系列と学習用音声信号の正解クラス情報とを入力して遺伝的アルゴリズムにより特徴量変換を出力する特徴量変換構築部と、前記音声入力部が出力する認識音声信号のベクトル系列と前記特徴量変換構築部が出力する特徴量変換とを入力して音響特徴量を抽出する特徴量抽出部と、前記認識音声信号を前記特徴量抽出部が生成した音響特徴量に基づいて音声認識を行う音声認識部とを備えたことを特徴とする音声認識装置を要旨とするものである。   According to a second aspect of the present invention, there is provided a speech input unit that inputs a speech signal of a speaker, converts the speech signal into a digital signal, converts the digital signal into a vector series, and outputs the speech signal, and learning speech output from the speech input unit A feature amount conversion construction unit that inputs a vector sequence of signals and correct class information of a learning speech signal and outputs a feature amount conversion by a genetic algorithm; a vector sequence of recognized speech signals output by the speech input unit; A feature amount extraction unit that inputs a feature amount conversion output by the feature amount conversion construction unit and extracts an acoustic feature amount; and speech recognition based on the acoustic feature amount generated by the feature amount extraction unit. The gist of the present invention is a speech recognition device including a speech recognition unit for performing the speech recognition.

請求項3に記載の発明は、コンピュータに音声認識させるためのプログラムであって、請求項1又は請求項2に記載の音声認識装置をコンピュータにより実現することを特徴とするプログラムを要旨とするものである。   The invention according to claim 3 is a program for causing a computer to perform voice recognition, and is characterized by realizing the voice recognition device according to claim 1 or claim 2 by a computer. It is.

請求項1の発明によれば、学習機構を有する音響特徴量抽出によって、音声認識に最適化された特徴量変換を行い識別的な音響特徴量を抽出することができる。請求項1の発明では、第一に、ベクトル系列から取り出したベクトルに対し識別対象のクラス(音素など)に属するか否かの情報から構成される中間ベクトルを返すクラス識別変換を生成し、第二に、クラス識別変換で得られた中間ベクトルに対して次元圧縮と線形直交化を行うための次元圧縮・線形直交変換を生成する。このように、請求項1の発明によって、認識性能の高い特徴量を抽出することができる。   According to the first aspect of the present invention, it is possible to extract a discriminative acoustic feature quantity by performing feature quantity conversion optimized for speech recognition by acoustic feature quantity extraction having a learning mechanism. In the first aspect of the present invention, first, a class identification transformation is generated that returns an intermediate vector composed of information about whether or not a vector extracted from a vector sequence belongs to a class to be identified (phoneme or the like). Secondly, a dimensional compression / linear orthogonal transformation for performing dimensional compression and linear orthogonalization on the intermediate vector obtained by the class identification transformation is generated. As described above, according to the first aspect of the present invention, it is possible to extract a feature amount having high recognition performance.

請求項2の発明によれば、学習機構を有する音響特徴量抽出によって、音声認識に最適化された特徴量変換を行い識別的な音響特徴量を抽出することができる。請求項2の発明では、ベクトル系列から音響特徴量への変換を、遺伝的アルゴリズムを用いて生成する。すなわち、請求項2の発明によって、認識性能の高い特徴量を抽出することができる。   According to the second aspect of the present invention, it is possible to extract a discriminative acoustic feature quantity by performing feature quantity conversion optimized for speech recognition by acoustic feature quantity extraction having a learning mechanism. According to the second aspect of the present invention, the conversion from the vector sequence to the acoustic feature amount is generated using a genetic algorithm. That is, according to the second aspect of the present invention, it is possible to extract a feature quantity with high recognition performance.

請求項3の発明によれば、プログラムを実行することによって、コンピュータを請求項1又は請求項2に記載の音声認識装置として容易に実現することができる。   According to the invention of claim 3, by executing the program, the computer can be easily realized as the speech recognition apparatus according to claim 1 or claim 2.

以上のように、本発明による音声認識装置は、学習機構を有する音響特徴量抽出によって、音声認識に最適化され雑音環境下でも頑健な音響特徴量を抽出し、静寂な環境はもとより、雑音環境下でも高い音声認識性能を実現する。   As described above, the speech recognition apparatus according to the present invention extracts acoustic feature amounts that are optimized for speech recognition and are robust even under noisy environments by means of acoustic feature amount extraction having a learning mechanism. Realizes high voice recognition performance even under.

請求項1に記載の音声認識装置の一実施形態の機能ブロック図。The functional block diagram of one Embodiment of the speech recognition apparatus of Claim 1. 請求項2に記載の音声認識装置の一実施形態の機能ブロック図。The functional block diagram of one Embodiment of the speech recognition apparatus of Claim 2. ディジタル化された音声信号から周波数ベクトル系列への変換の説明図。Explanatory drawing of the conversion from the digitized audio | voice signal to a frequency vector series. 請求項1に記載の音声認識装置の一実施形態と従来の音声認識装置との雑音種類別性能比較図。The performance comparison figure according to noise kind of one Embodiment of the speech recognition apparatus of Claim 1 and the conventional speech recognition apparatus. 請求項1に記載の音声認識装置の一実施形態と従来の音声認識装置との雑音強度別性能比較図。The performance comparison figure according to noise intensity with one Embodiment of the speech recognition apparatus of Claim 1, and the conventional speech recognition apparatus.

以下で、本発明を具体化した音声認識装置の一実施形態を図1〜図2を参照して説明する。   Hereinafter, an embodiment of a speech recognition apparatus embodying the present invention will be described with reference to FIGS.

図1は、請求項1に記載の発明の一実施形態である。
音声入力部1では、発話者の音声を入力し、入力された音声信号を標本化定理により標本化するとともに、適当な量子化ステップで量子化を行い、ディジタル信号に変換する。続いて、図3に示すように、ディジタル信号を一定時間ごとに一定時間長で切り出し、音声フレームを抽出する。該音声フレームごとに、フーリエ変換を用いて周波数情報を求め、メルフィルタバンクにより周波数情報をまとめベクトル化することで、周波数ベクトルに変換する。該周波数ベクトルをディジタル化された信号全体で時系列に並べることで、ベクトル系列(以下、周波数ベクトル系列という)を生成する。
FIG. 1 is an embodiment of the invention described in claim 1.
The speech input unit 1 inputs the speech of the speaker, samples the input speech signal by the sampling theorem, quantizes it at an appropriate quantization step, and converts it into a digital signal. Subsequently, as shown in FIG. 3, the digital signal is cut out at a certain time length every certain time to extract a voice frame. For each voice frame, frequency information is obtained by using Fourier transform, and the frequency information is collected and vectorized by a mel filter bank to be converted into a frequency vector. A vector sequence (hereinafter referred to as a frequency vector sequence) is generated by arranging the frequency vectors in time series with the entire digitized signal.

学習部10は、学習用音声信号を音声入力部1に入力して得られる周波数ベクトル系列と学習用音声信号の正解クラス情報を用いて、クラス識別変換と次元圧縮・線形直交変換を計算する。適用部20は、学習部10で求めたクラス識別変換と次元圧縮・線形直交変換を用いて、認識音声信号を音声入力部1に入力して得られる周波数ベクトル系列を音響特徴量に変換し、音声認識を行うことで、認識結果を出力する。   The learning unit 10 uses the frequency vector sequence obtained by inputting the learning speech signal to the speech input unit 1 and the correct class information of the learning speech signal to calculate class identification transformation and dimensional compression / linear orthogonal transformation. The application unit 20 converts the frequency vector sequence obtained by inputting the recognized speech signal to the speech input unit 1 using the class identification transformation and dimensional compression / linear orthogonal transformation obtained by the learning unit 10 into an acoustic feature amount, A recognition result is output by performing voice recognition.

クラス識別変換構築部11では、第一に、クラスごとに周波数ベクトルがそのクラスに属するか否かを判定する識別器を構成する。一例として「与えられた周波数ベクトルが音素aのものか否かを判定する識別器」「与えられた周波数ベクトルが音素bのものか否かを判定する識別器」を構築する。   First, the class identification conversion construction unit 11 configures a classifier that determines whether or not a frequency vector belongs to the class for each class. As an example, “a discriminator for determining whether or not a given frequency vector is for phoneme a” and “a discriminator for determining whether or not a given frequency vector is for phoneme b” are constructed.

クラス識別変換構築部11における、遺伝的アルゴリズムを用いた識別器の構成例を以下で説明する。
クラスcに対し、N次元の周波数ベクトルxがcに属するか否かを表す識別器fを次の式で表す。該識別器は周波数ベクトルxがcに属する場合は正の値を、属さない場合は負の値を返す。
A configuration example of a classifier using a genetic algorithm in the class identification conversion construction unit 11 will be described below.
A classifier f indicating whether or not the N-dimensional frequency vector x belongs to c for class c is expressed by the following equation. The discriminator returns a positive value if the frequency vector x belongs to c, and returns a negative value if it does not belong.

Figure 2013122508
Figure 2013122508

前記識別器fの識別器パラメータa(c)は、学習用音声信号と学習用音声信号の正解クラス情報を用いて決定される。すなわち、学習用音声信号を音声入力部1に入力して得られる周波数ベクトル系列において、なるべく多くのベクトルに対し正しい識別結果を返すように、a(c)を決定する。   The discriminator parameter a (c) of the discriminator f is determined using the learning speech signal and correct class information of the learning speech signal. That is, a (c) is determined so as to return a correct identification result for as many vectors as possible in the frequency vector sequence obtained by inputting the learning speech signal to the speech input unit 1.

具体的に、前記識別器パラメータa(c)を遺伝的アルゴリズムで決定する方法を示す。N+1個の係数を持つ識別器パラメータa(c)を遺伝子とする個体を複数生成し、初期世代を構成する。各個体はランダムに初期化する。   Specifically, a method for determining the discriminator parameter a (c) by a genetic algorithm will be described. A plurality of individuals using the discriminator parameter a (c) having N + 1 coefficients as genes are generated to constitute the initial generation. Each individual is initialized randomly.

現世代の個体vに対し、次の式により適応度E(v)を計算する。ただしaは該個体vの遺伝子から得られる識別器パラメータ、rkは学習用音声信号を音声入力部1に入力して得られる周波数ベクトル系列のk番目の周波数ベクトル、lkはk番目の周波数ベクトルの正解クラス情報で、rkがクラスcに属するとき1、属さないとき−1である。尚、Kは学習用音声信号を音声入力部1に入力して得られる周波数ベクトルの総数である。又、sgnは符号関数である。   The fitness E (v) is calculated with respect to the individual v of the current generation by the following formula. Where a is a discriminator parameter obtained from the gene of the individual v, rk is a kth frequency vector of a frequency vector sequence obtained by inputting a learning speech signal to the speech input unit 1, and lk is a kth frequency vector. The correct class information is 1 when rk belongs to class c and -1 when it does not belong. K is the total number of frequency vectors obtained by inputting the learning speech signal to the speech input unit 1. Sgn is a sign function.

Figure 2013122508
Figure 2013122508

前記適応度に基づき、遺伝的アルゴリズムの基本操作(エリート選択・継承・交叉・突然変異)を行い、現世代から次世代を作成する(以下、現世代から次世代を作成することを世代交代という)。   Based on the fitness, the basic operation of the genetic algorithm (elite selection / inheritance / crossover / mutation) is performed to create the next generation from the current generation (hereinafter, the generation of the next generation from the current generation is referred to as generation change). ).

前記初期世代から世代交代を一定回数繰り返し、最終世代を生成する。該最終世代から最も適応度の高い個体を取り出し、該個体の遺伝子を解析することで、最適化された識別器パラメータa(c)を得る。   The generation change is repeated a certain number of times from the initial generation to generate the final generation. An individual with the highest fitness is taken out from the final generation, and the gene of the individual is analyzed to obtain an optimized discriminator parameter a (c).

尚、前記識別器パラメータa(c)の決定に、本発明はサポートベクターマシン(Support Vector Machine)などを用いることを妨げない。又、前記識別器の一例として線形識別器を示したが、本発明は非線形識別器を用いることを妨げない。   Note that the present invention does not prevent the use of a support vector machine or the like for the determination of the discriminator parameter a (c). Although a linear classifier is shown as an example of the classifier, the present invention does not prevent the use of a nonlinear classifier.

クラス識別変換構築部11では、第二に、前記識別器を全てのクラスに対して構成する。クラスcの識別器パラメータa(c)を行ベクトルとして表現したとき、全ての行ベクトルをまとめて、クラス識別変換行列を構成する。クラス識別変換構築部11の出力として、C行N+1列のクラス識別変換行列が得られる。ただしCは識別クラス数である。   Second, the class identification conversion construction unit 11 configures the classifier for all classes. When class c discriminator parameter a (c) is expressed as a row vector, all the row vectors are put together to form a class discrimination conversion matrix. As an output of the class identification conversion construction unit 11, a class identification conversion matrix of C rows and N + 1 columns is obtained. Where C is the number of identification classes.

次元圧縮・線形直交変換構築部12では、前記クラス識別変換構築部11で得られたクラス識別変換の次元圧縮・線形直交化を行う。   The dimension compression / linear orthogonal transformation construction unit 12 performs dimension compression / linear orthogonalization of the class identification transformation obtained by the class identification transformation construction unit 11.

次元圧縮・線形直交変換構築部12における、遺伝的アルゴリズムを用いた次元圧縮・線形直交化の構成例を以下で説明する。
学習用音声信号を音声入力部1に入力して得られる周波数ベクトル系列において、該周波数ベクトル系列の各周波数ベクトルに対し、次の式のように拡張周波数ベクトルx’にクラス識別変換行列Aを乗算し、C次元のベクトルy(以下、中間ベクトルという)を得る。
A configuration example of dimensional compression / linear orthogonalization using a genetic algorithm in the dimensional compression / linear orthogonal transformation construction unit 12 will be described below.
In the frequency vector sequence obtained by inputting the learning speech signal to the speech input unit 1, each frequency vector of the frequency vector sequence is multiplied by the class identification transformation matrix A by the extended frequency vector x ′ as shown in the following equation. A C-dimensional vector y (hereinafter referred to as an intermediate vector) is obtained.

Figure 2013122508
Figure 2013122508

中間ベクトルyに対し、次元圧縮・線形直交変換の1次元目g1の計算を次の式で行う。   The calculation of the first dimension g1 of dimensional compression / linear orthogonal transformation is performed on the intermediate vector y by the following equation.

Figure 2013122508
Figure 2013122508

前記g1のパラメータb(1)は、学習用音声信号と学習用音声信号の正解クラス情報を用いて決定される。すなわち、学習用音声信号を音声入力部1に入力して得られる周波数ベクトル系列において、クラスごとに中間ベクトルyの平均ベクトルを計算し、該平均ベクトルの前記g1の値を求め、その分散が最大となるように、b(1)を決定する。   The parameter b (1) of g1 is determined using the learning speech signal and the correct class information of the learning speech signal. That is, in the frequency vector sequence obtained by inputting the learning speech signal to the speech input unit 1, the average vector of the intermediate vector y is calculated for each class, the value of the average vector g1 is obtained, and the variance is maximum. B (1) is determined so that

具体的に、前記パラメータb(1)を遺伝的アルゴリズムで決定する方法を示す。C個の係数を持つパラメータb(1)を遺伝子とする個体を複数生成し、初期世代を構成する。各個体はランダムに初期化する。   Specifically, a method for determining the parameter b (1) by a genetic algorithm will be described. A plurality of individuals having genes of the parameter b (1) having C coefficients are generated, and the initial generation is configured. Each individual is initialized randomly.

現世代の個体vに対し、次の式により適応度E(v)を計算する。ただしbは該個体vの遺伝子から得られるパラメータである。又、varは分散を求める関数である。   The fitness E (v) is calculated with respect to the individual v of the current generation by the following formula. Where b is a parameter obtained from the gene of the individual v. Also, var is a function for obtaining the variance.

Figure 2013122508
Figure 2013122508

前記適応度に基づき、遺伝的アルゴリズムの基本操作を行い、現世代から次世代を作成(世代交代)する。   Based on the fitness, the basic operation of the genetic algorithm is performed to create the next generation from the current generation (generation change).

前記初期世代から世代交代を一定回数繰り返し、最終世代を生成する。該最終世代から最も適応度の高い個体を取り出し、該個体の遺伝子を解析することで、最適化されたパラメータb(1)を得る。   The generation change is repeated a certain number of times from the initial generation to generate the final generation. An optimized parameter b (1) is obtained by taking out the individual with the highest fitness from the final generation and analyzing the gene of the individual.

前記次元圧縮・線形直交変換の2次元目g2の計算を次の式で行う。   The calculation of the second dimension g2 of the dimension compression / linear orthogonal transformation is performed by the following equation.

Figure 2013122508
Figure 2013122508

前記g2のパラメータb(2)は、前記g1のパラメータb(1)と同様に遺伝的アルゴリズムで決定する。ただし、b(2)は、b(1)との内積が0となるよう、次の式を満たす制約の下で求める。   The parameter b (2) of g2 is determined by a genetic algorithm in the same manner as the parameter b (1) of g1. However, b (2) is obtained under the constraint that satisfies the following expression so that the inner product with b (1) becomes zero.

Figure 2013122508
Figure 2013122508

以降同様に、前記次元圧縮・線形直交変換のm次元目gmは、パラメータb(m)とb(1)、b(m)とb(2)、…、b(m)とb(m−1)の内積が全て0となる制約の下で、前記平均ベクトルの該gmの値を求め、その分散が最大となるよう、b(m)を決定することで、求めることができる。   Similarly, the m-th order gm of the dimension compression / linear orthogonal transformation is the parameters b (m) and b (1), b (m) and b (2),..., B (m) and b (m− Under the constraint that the inner product of 1) is all 0, the gm value of the average vector is obtained, and b (m) is determined so that the variance is maximized.

尚、前記パラメータb(1)及びb(2)乃至b(m)の決定に、本発明は主成分分析(Principal Component Analysis)などを用いることを妨げない。   Note that the present invention does not prevent the use of principal component analysis or the like in determining the parameters b (1) and b (2) to b (m).

次元圧縮・線形直交変換構築部12の出力として、M行C列の次元圧縮・線形直交変換行列が得られる。ただしMは次元圧縮・線形直交変換の出力次元数である。   As an output of the dimension compression / linear orthogonal transformation construction unit 12, a dimension compression / linear orthogonal transformation matrix of M rows and C columns is obtained. However, M is the number of output dimensions of dimension compression / linear orthogonal transformation.

音声認識を行う際は、認識音声信号を音声入力部1に入力し、周波数ベクトル系列に変換する。   When performing speech recognition, a recognized speech signal is input to the speech input unit 1 and converted into a frequency vector sequence.

特徴量抽出部21では、前記クラス識別変換構築部11で求めたクラス識別変換及び前記次元圧縮・線形直交変換構築部12で求めた次元圧縮・線形直交変換を用いて、周波数ベクトル系列を音響特徴量に変換する。認識音声信号を音声入力部1に入力して得られる周波数ベクトル系列において、該周波数ベクトル系列の各周波数ベクトルに対し、次の式のように拡張周波数ベクトルx’にクラス識別変換行列A及び次元圧縮・線形直交変換行列Bを乗算し、M次元の音響特徴量zを得る。   The feature quantity extraction unit 21 uses the class identification transformation obtained by the class identification transformation construction unit 11 and the dimensional compression / linear orthogonal transformation obtained by the dimensional compression / linear orthogonal transformation construction unit 12 to convert the frequency vector series into acoustic features. Convert to quantity. In the frequency vector sequence obtained by inputting the recognized speech signal to the speech input unit 1, for each frequency vector of the frequency vector sequence, the class identification transformation matrix A and the dimension compression are expanded into the extended frequency vector x ′ as in the following equation: Multiply the linear orthogonal transformation matrix B to obtain an M-dimensional acoustic feature quantity z.

Figure 2013122508
Figure 2013122508

音声認識部2では、前記特徴量抽出部21が出力した音響特徴量を用いて音声認識を行う。モデルに隠れマルコフモデルを使用し、ビタビアルゴリズムにより音響特徴量とモデルのマッチングを行い、最も尤度の高い単語仮説候補を認識結果として出力する。   The speech recognition unit 2 performs speech recognition using the acoustic feature amount output from the feature amount extraction unit 21. A hidden Markov model is used as a model, and acoustic features are matched with the model using a Viterbi algorithm, and the most likely word hypothesis candidate is output as a recognition result.

図2は、請求項2に記載の発明の一実施形態である。
音声入力部1では、発話者の音声を入力し、入力された音声信号を標本化定理により標本化するとともに、適当な量子化ステップで量子化を行い、ディジタル信号に変換する。続いて、図3に示すように、ディジタル信号を一定時間ごとに一定時間長で切り出し、音声フレームを抽出する。該音声フレームごとに、フーリエ変換を用いて周波数情報を求め、メルフィルタバンクにより周波数情報をまとめベクトル化することで、周波数ベクトルに変換する。該周波数ベクトルをディジタル化された信号全体で時系列に並べることで、ベクトル系列(以下、周波数ベクトル系列という)を生成する。
FIG. 2 shows an embodiment of the invention described in claim 2.
The speech input unit 1 inputs the speech of the speaker, samples the input speech signal by the sampling theorem, quantizes it at an appropriate quantization step, and converts it into a digital signal. Subsequently, as shown in FIG. 3, the digital signal is cut out at a certain time length every certain time to extract a voice frame. For each voice frame, frequency information is obtained by using Fourier transform, and the frequency information is collected and vectorized by a mel filter bank to be converted into a frequency vector. A vector sequence (hereinafter referred to as a frequency vector sequence) is generated by arranging the frequency vectors in time series with the entire digitized signal.

学習部10は、学習用音声信号を音声入力部1に入力して得られる周波数ベクトル系列と学習用音声信号の正解クラス情報を用いて、特徴量変換を計算する。適用部20は、学習部10で求めた特徴量変換を用いて、認識音声信号を音声入力部1に入力して得られる周波数ベクトル系列を音響特徴量に変換し、音声認識を行うことで、認識結果を出力する。   The learning unit 10 calculates the feature amount conversion using the frequency vector series obtained by inputting the learning speech signal to the speech input unit 1 and the correct class information of the learning speech signal. The application unit 20 uses the feature amount conversion obtained by the learning unit 10 to convert a frequency vector sequence obtained by inputting the recognized speech signal to the speech input unit 1 into an acoustic feature amount, and performs speech recognition. Output the recognition result.

特徴量変換構築部16では、周波数ベクトル系列を音響特徴量に変換する特徴量変換を、遺伝的アルゴリズムにより生成する。   The feature quantity conversion construction unit 16 generates a feature quantity conversion for converting a frequency vector series into an acoustic feature quantity by a genetic algorithm.

特徴量変換構築部16における、遺伝的アルゴリズムを用いた特徴量変換の構成例を以下で説明する。
学習用音声信号を音声入力部1に入力して得られる周波数ベクトル系列において、該周波数ベクトル系列の各周波数ベクトルx(N次元)に対し、特徴量変換の1次元目h1の計算を次の式で行う。
A configuration example of feature amount conversion using a genetic algorithm in the feature amount conversion construction unit 16 will be described below.
In the frequency vector sequence obtained by inputting the learning speech signal to the speech input unit 1, the calculation of the first dimension h1 of the feature amount conversion is performed for each frequency vector x (N dimension) of the frequency vector sequence by the following equation: To do.

Figure 2013122508
Figure 2013122508

前記h1のパラメータd(1)は、学習用音声信号と学習用音声信号の正解クラス情報を用いて決定される。すなわち、学習用音声信号を音声入力部1に入力して得られる周波数ベクトル系列において、クラスごとに平均ベクトルを計算し、該平均ベクトルの前記h1の値を求め、その分散が最大となるように、d(1)を決定する。   The parameter d (1) of the h1 is determined using the learning speech signal and the correct answer class information of the learning speech signal. That is, in the frequency vector sequence obtained by inputting the learning speech signal to the speech input unit 1, an average vector is calculated for each class, the h1 value of the average vector is obtained, and the variance is maximized. , D (1) is determined.

具体的に、前記パラメータd(1)を遺伝的アルゴリズムで決定する方法を示す。N個の係数を持つパラメータd(1)を遺伝子とする個体を複数生成し、初期世代を構成する。各個体はランダムに初期化する。   Specifically, a method for determining the parameter d (1) by a genetic algorithm will be described. A plurality of individuals having a parameter d (1) having N coefficients as genes are generated to form an initial generation. Each individual is initialized randomly.

現世代の個体vに対し、次の式により適応度E(v)を計算する。ただしdは該個体vの遺伝子から得られるパラメータである。又、varは分散を求める関数である。   The fitness E (v) is calculated with respect to the individual v of the current generation by the following formula. Where d is a parameter obtained from the gene of the individual v. Also, var is a function for obtaining the variance.

Figure 2013122508
Figure 2013122508

前記適応度に基づき、遺伝的アルゴリズムの基本操作を行い、現世代から次世代を作成(世代交代)する。   Based on the fitness, the basic operation of the genetic algorithm is performed to create the next generation from the current generation (generation change).

前記初期世代から世代交代を一定回数繰り返し、最終世代を生成する。該最終世代から最も適応度の高い個体を取り出し、該個体の遺伝子を解析することで、最適化されたパラメータd(1)を得る。   The generation change is repeated a certain number of times from the initial generation to generate the final generation. The optimized parameter d (1) is obtained by taking out the individual with the highest fitness from the final generation and analyzing the gene of the individual.

前記特徴量変換の2次元目h2の計算を次の式で行う。   The calculation of the second dimension h2 of the feature amount conversion is performed by the following equation.

Figure 2013122508
Figure 2013122508

前記h2のパラメータd(2)は、前記h1のパラメータd(1)と同様に遺伝的アルゴリズムで決定する。ただし、d(2)は、d(1)との内積が0となるよう、次の式を満たす制約の下で求める。   The parameter d (2) of h2 is determined by a genetic algorithm in the same manner as the parameter d (1) of h1. However, d (2) is obtained under the constraint that satisfies the following expression so that the inner product with d (1) becomes zero.

Figure 2013122508
Figure 2013122508

以降同様に、前記特徴量変換のm次元目hmは、パラメータd(m)とd(1)、d(m)とd(2)、…、d(m)とd(m−1)の内積が全て0となる制約の下で、前記平均ベクトルの該hmの値を求め、その分散が最大となるよう、d(m)を決定することで、求めることができる。   In the same manner, the m-th order hm of the feature amount conversion is parameters d (m) and d (1), d (m) and d (2),..., D (m) and d (m−1). Under the constraint that the inner products are all 0, the hm value of the average vector is obtained, and d (m) is determined so that the variance becomes maximum.

特徴量変換構築部16の出力として、M行N列の特徴量変換行列が得られる。ただしMは特徴量変換の出力次元数である。   As an output of the feature quantity conversion construction unit 16, a feature quantity conversion matrix of M rows and N columns is obtained. However, M is the number of output dimensions of feature quantity conversion.

音声認識を行う際は、認識音声信号を音声入力部1に入力し、周波数ベクトル系列に変換する。   When performing speech recognition, a recognized speech signal is input to the speech input unit 1 and converted into a frequency vector sequence.

特徴量抽出部26では、前記特徴量変換構築部16で求めた特徴量変換を用いて、周波数ベクトル系列を音響特徴量に変換する。認識音声信号を音声入力部1に入力して得られる周波数ベクトル系列において、該周波数ベクトル系列の各周波数ベクトルxに対し、次の式のように特徴量変換行列Dを乗算し、M次元の音響特徴量zを得る。   The feature quantity extraction unit 26 converts the frequency vector series into an acoustic feature quantity using the feature quantity conversion obtained by the feature quantity conversion construction unit 16. In the frequency vector sequence obtained by inputting the recognized speech signal to the speech input unit 1, each frequency vector x of the frequency vector sequence is multiplied by a feature amount transformation matrix D as shown in the following equation to obtain an M-dimensional sound. A feature value z is obtained.

Figure 2013122508
Figure 2013122508

音声認識部2では、前記特徴量抽出部26が出力した音響特徴量を用いて音声認識を行う。モデルに隠れマルコフモデルを使用し、ビタビアルゴリズムにより音響特徴量とモデルのマッチングを行い、最も尤度の高い単語仮説候補を認識結果として出力する。   The speech recognition unit 2 performs speech recognition using the acoustic feature amount output from the feature amount extraction unit 26. A hidden Markov model is used as a model, and acoustic features are matched with the model using a Viterbi algorithm, and the most likely word hypothesis candidate is output as a recognition result.

尚、本発明の実施形態は前記実施形態に限定されるものではなく、前記実施形態を、本発明の趣旨から逸脱しない範囲で変更してもよい。   In addition, embodiment of this invention is not limited to the said embodiment, You may change the said embodiment in the range which does not deviate from the meaning of this invention.

図4は、前記実施形態のうち図1で示した請求項1に記載の発明の一実施形態による音声認識装置の音声認識精度と、音声入力部1及びメル尺度ケプストラム係数を生成する特徴量抽出部乃至音声認識部2を備えた従来の音声認識装置の音声認識精度について、雑音環境下音声認識共通評価基盤コーパスCENSREC−1を用いて、8種類の異なる雑音種類で評価したものである。   FIG. 4 shows the speech recognition accuracy of the speech recognition apparatus according to the embodiment of the invention described in claim 1 shown in FIG. 1 and the feature extraction for generating the speech input unit 1 and the Mel scale cepstrum coefficient. The speech recognition accuracy of the conventional speech recognition apparatus including the speech recognition unit 2 and the speech recognition unit 2 is evaluated with 8 different noise types using the speech recognition common evaluation infrastructure corpus CENSREC-1 under a noisy environment.

図5は、前記実施形態のうち図1で示した請求項1に記載の発明の一実施形態による音声認識装置の音声認識精度と、音声入力部1及びメル尺度ケプストラム係数を生成する特徴量抽出部乃至音声認識部2を備えた従来の音声認識装置の音声認識精度について、雑音環境下音声認識共通評価基盤コーパスCENSREC−1を用いて、7種類の異なる雑音強度で評価したものである。   FIG. 5 shows the speech recognition accuracy of the speech recognition apparatus according to an embodiment of the invention described in claim 1 shown in FIG. 1 and the feature extraction for generating the speech input unit 1 and the mel scale cepstrum coefficient. The speech recognition accuracy of the conventional speech recognition apparatus including the speech recognition unit 2 and the speech recognition unit 2 is evaluated with seven different noise intensities using the speech recognition common evaluation infrastructure corpus CENSREC-1 in a noisy environment.

図4及び図5で示した音声認識実験は、雑音環境下音声認識共通評価基盤コーパスCENSREC−1に附属の基準手法(ベースライン)と同じ条件で行った。   The speech recognition experiments shown in FIGS. 4 and 5 were performed under the same conditions as the reference method (baseline) attached to the speech recognition common evaluation infrastructure corpus CENSRECEC-1 in a noisy environment.

図4及び図5より、図1で示した請求項1に記載の発明の一実施形態による音声認識装置は、従来の音声認識装置と比較して、有意に音声認識性能が向上しており、本発明の目的である雑音下での音声認識の性能向上が達せられていることがわかる。   From FIG.4 and FIG.5, the speech recognition apparatus by one Embodiment of invention of Claim 1 shown in FIG. 1 compared with the conventional speech recognition apparatus, the speech recognition performance improved significantly, It can be seen that the performance of speech recognition under noise, which is the object of the present invention, has been achieved.

1…音声入力部
2…音声認識部
10…学習部
11…クラス識別変換構築部
12…次元圧縮・線形直交変換構築部
16…特徴量変換構築部
20…適用部
21…特徴量抽出部
26…特徴量抽出部
DESCRIPTION OF SYMBOLS 1 ... Speech input part 2 ... Speech recognition part 10 ... Learning part 11 ... Class identification transformation construction part 12 ... Dimension compression / linear orthogonal transformation construction part 16 ... Feature quantity transformation construction part 20 ... Application part 21 ... Feature quantity extraction part 26 ... Feature extraction unit

Claims (3)

発話者の音声信号を入力してディジタル信号に変換し、該ディジタル信号からベクトル系列に変換して出力する音声入力部と、
前記音声入力部が出力する学習用音声信号のベクトル系列と、学習用音声信号の正解クラス情報とを入力して、クラス識別変換を出力するクラス識別変換構築部と、
前記音声入力部が出力する前記学習用音声信号のベクトル系列と、前記学習用音声信号の正解クラス情報と、前記クラス識別変換とを入力して次元圧縮・線形直交変換を出力する次元圧縮・線形直交変換構築部と、
前記音声入力部が出力する認識音声信号のベクトル系列と、前記クラス識別変換と、前記次元圧縮・線形直交変換とを入力して、音響特徴量を抽出する特徴量抽出部と、
前記認識音声信号を、前記特徴量抽出部が生成した音響特徴量に基づいて、音声認識を行う音声認識部と、
を備えたことを特徴とする音声認識装置。
A voice input unit that inputs a voice signal of a speaker and converts the voice signal into a digital signal; converts the digital signal into a vector sequence;
A class identification conversion construction unit that inputs a vector sequence of learning speech signals output by the speech input unit and correct class information of the learning speech signals and outputs class identification conversion;
Dimensional compression / linear that inputs a vector sequence of the learning speech signal output from the speech input unit, correct class information of the learning speech signal, and the class identification transformation and outputs dimensional compression / linear orthogonal transformation An orthogonal transform construction unit;
A feature amount extraction unit that inputs a vector sequence of recognized speech signals output from the speech input unit, the class identification transformation, and the dimension compression / linear orthogonal transformation, and extracts an acoustic feature amount;
A speech recognition unit that performs speech recognition based on the acoustic feature amount generated by the feature amount extraction unit;
A speech recognition apparatus comprising:
発話者の音声信号を入力してディジタル信号に変換し、該ディジタル信号からベクトル系列に変換して出力する音声入力部と、
前記音声入力部が出力する学習用音声信号のベクトル系列と、学習用音声信号の正解クラス情報とを入力して、遺伝的アルゴリズムにより特徴量変換を出力する特徴量変換構築部と、
前記音声入力部が出力する認識音声信号のベクトル系列と、前記特徴量変換構築部が出力する特徴量変換とを入力して、音響特徴量を抽出する特徴量抽出部と、
前記認識音声信号を、前記特徴量抽出部が生成した音響特徴量に基づいて、音声認識を行う音声認識部と、
を備えたことを特徴とする音声認識装置。
A voice input unit that inputs a voice signal of a speaker and converts the voice signal into a digital signal; converts the digital signal into a vector sequence;
A feature amount conversion construction unit that inputs a vector sequence of learning speech signals output by the speech input unit and correct class information of the learning speech signals and outputs feature amount conversion by a genetic algorithm;
A feature quantity extraction unit that inputs a vector sequence of recognized speech signals output by the voice input unit and a feature quantity conversion output by the feature quantity conversion construction unit, and extracts an acoustic feature quantity;
A speech recognition unit that performs speech recognition based on the acoustic feature amount generated by the feature amount extraction unit;
A speech recognition apparatus comprising:
コンピュータに、音声認識させるためのプログラムであって、請求項1又は請求項2に記載の音声認識装置をコンピュータにより実現することを特徴とするプログラム。   A program for causing a computer to perform voice recognition, wherein the voice recognition device according to claim 1 or 2 is realized by a computer.
JP2011270381A 2011-12-09 2011-12-09 Voice recognition device and program Pending JP2013122508A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2011270381A JP2013122508A (en) 2011-12-09 2011-12-09 Voice recognition device and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2011270381A JP2013122508A (en) 2011-12-09 2011-12-09 Voice recognition device and program

Publications (1)

Publication Number Publication Date
JP2013122508A true JP2013122508A (en) 2013-06-20

Family

ID=48774503

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2011270381A Pending JP2013122508A (en) 2011-12-09 2011-12-09 Voice recognition device and program

Country Status (1)

Country Link
JP (1) JP2013122508A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020095707A1 (en) * 2018-11-08 2020-05-14 日本電信電話株式会社 Optimization device, optimization method, and program

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020095707A1 (en) * 2018-11-08 2020-05-14 日本電信電話株式会社 Optimization device, optimization method, and program
JP2020076874A (en) * 2018-11-08 2020-05-21 日本電信電話株式会社 Optimization device, optimization method, and program
US20220005471A1 (en) * 2018-11-08 2022-01-06 Nippon Telegraph And Telephone Corporation Optimization apparatus, optimization method, and program
JP7167640B2 (en) 2018-11-08 2022-11-09 日本電信電話株式会社 Optimization device, optimization method, and program

Similar Documents

Publication Publication Date Title
Dey et al. A hybrid meta-heuristic feature selection method using golden ratio and equilibrium optimization algorithms for speech emotion recognition
Chung et al. A recurrent latent variable model for sequential data
Sigtia et al. Audio Chord Recognition with a Hybrid Recurrent Neural Network.
Qian et al. The Use of DBN-HMMs for Mispronunciation Detection and Diagnosis in L2 English to Support Computer-Aided Pronunciation Training.
Manoharan et al. Analysis of complex non-linear environment exploration in speech recognition by hybrid learning technique
Kadyan et al. A comparative study of deep neural network based Punjabi-ASR system
JP2017228272A (en) Semantic generation method, semantic generation device, and program
CN109979428B (en) Audio generation method and device, storage medium and electronic equipment
Afshan et al. Improved subject-independent acoustic-to-articulatory inversion
Smaragdis et al. The Markov selection model for concurrent speech recognition
Garg et al. Survey on acoustic modeling and feature extraction for speech recognition
Sunny et al. Recognition of speech signals: an experimental comparison of linear predictive coding and discrete wavelet transforms
JP5914054B2 (en) Language model creation device, speech recognition device, and program thereof
Pan et al. Robust Speech Recognition by DHMM with A Codebook Trained by Genetic Algorithm.
Zhan et al. An improved LSTM for language identification
Yu et al. Recent progresses in deep learning based acoustic models (updated)
JP2013122508A (en) Voice recognition device and program
Tanweer et al. Analysis of combined use of nn and mfcc for speech recognition
Li et al. Decision tree based state tying for speech recognition using DNN derived embeddings
Barman et al. State of the art review of speech recognition using genetic algorithm
Sharma et al. A Natural Human-Machine Interaction via an Efficient Speech Recognition System
JP5104732B2 (en) Extended recognition dictionary learning device, speech recognition system using the same, method and program thereof
Pavel et al. Evaluation of Bag-of-Words Classifiers for COVID-19 Detection from Cough Recordings
Liu et al. Speech Emotion Recognition Based on Deep Learning
Tripathi et al. Analysis of sparse representation based feature on speech mode classification.