JP2013122508A

JP2013122508A - Voice recognition device and program

Info

Publication number: JP2013122508A
Application number: JP2011270381A
Authority: JP
Inventors: Tetsutsugu Tamura; 哲嗣田村; Satoru Hayamizu; 悟速水
Original assignee: Individual
Current assignee: Individual
Priority date: 2011-12-09
Filing date: 2011-12-09
Publication date: 2013-06-20

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition device which generates a discriminative feature quantity conversion optimized by voice recognition by using a voice signal for learning and correct answer class information of the voice signal for learning, and suppresses an influence of acoustic noise by applying the feature quantity conversion to a recognition voice signal, and by extracting and using a sound feature quantity.SOLUTION: A learning unit 10 includes: a class identification conversion construction unit 11 which calculates a class identification conversion from correct answer class information of vector series and the voice signal for learning by using the vector series to which a voice input unit 1 converts the voice signal for learning and correct answer class information of the voice signal for learning as inputs; and a dimensional compression and linear orthogonal transform construction unit 12 which calculates a dimensional compression and linear orthogonal transform from the vector series and the correct answer class information of the voice signal for learning or the class identification conversion. An application unit 20 includes: a feature quantity extraction unit 21 which extracts a sound feature quantity from the vector series and the class identification conversion or the dimensional compression and linear orthogonal transform by using the vector series to which the voice input unit 1 converts a recognition voice signal; and a voice recognition unit 2 which performs voice recognition by using the sound feature quantity and outputs a recognition result.

Description

本発明は、音響特徴量をテキストに変換する音声認識装置及びこれらのプログラムに関するものである。 The present invention relates to a speech recognition apparatus that converts an acoustic feature amount into text and a program thereof.

音声認識は、入力された音声信号から、音響特徴量を抽出し、パターンマッチング又は隠れマルコフモデル（Hidden Markov Model）などにより、テキストに変換する技術である。このうち、音響特徴量抽出は、入力された音声信号を、信号処理・信号分析などを用いて時系列のベクトル、すなわち、特徴量に変換する技術である。実用の音声認識では、音響分析に基づき抽出されるメル尺度ケプストラム係数（Mel-Frequency Cepstrum Coefficient）と呼ばれる音響特徴量を用いることが多い。 Speech recognition is a technique for extracting an acoustic feature from an input speech signal and converting it into text by pattern matching or a hidden Markov model. Among them, the acoustic feature extraction is a technique for converting an input voice signal into a time-series vector, that is, a feature using signal processing / signal analysis. In practical speech recognition, an acoustic feature called Mel-Frequency Cepstrum Coefficient extracted based on acoustic analysis is often used.

既存の音声認識の問題に、実環境など雑音下での認識性能が低い点がある。認識性能の改善に多用される手法は、非特許文献１にまとめられているように、音響特徴量抽出の過程で雑音低減を行う手法として、スペクトルサブトラクションやケプストラム平均正規化が、隠れマルコフモデルを雑音に適応する手法として、事後確率最大化法や最尤線形回帰法がある。 An existing problem of speech recognition is that the recognition performance under noise such as in the real environment is low. As summarized in Non-Patent Document 1, the techniques often used to improve the recognition performance include spectral subtraction and cepstrum average normalization as hidden Markov models as techniques for reducing noise during the acoustic feature extraction process. As a method for adapting to noise, there are a posteriori probability maximization method and a maximum likelihood linear regression method.

認識性能の改善に、音響分析や信号処理を主とした手法に代えて識別的な手法による音響特徴量抽出を用いる方法もある。例として非特許文献２及び非特許文献３では、音声信号からの音響特徴量の抽出を、最大相互情報量基準や音素誤り最小化基準による最適化で求めている。 In order to improve the recognition performance, there is a method of using acoustic feature amount extraction by a discriminative method instead of a method mainly using acoustic analysis and signal processing. As an example, in Non-Patent Document 2 and Non-Patent Document 3, the extraction of the acoustic feature amount from the speech signal is obtained by optimization based on the maximum mutual information criterion and phoneme error minimization criterion.

一方、遺伝的アルゴリズムは、最適解の候補を遺伝子として表し、遺伝子を持つ個体を複数生成し、適応度に応じて選択・交叉・突然変異などの操作を行うことで、最適解を探索する手法である。非特許文献４では、隠れマルコフモデルの出力尤度などを用い、遺伝的アルゴリズムによって、メル尺度ケプストラム係数の抽出方法を改良している。 Genetic algorithms, on the other hand, represent optimal solution candidates as genes, generate multiple individuals with genes, and search for the optimal solution by performing operations such as selection, crossover, and mutation according to fitness. It is. In Non-Patent Document 4, the extraction method of the Mel scale cepstrum coefficient is improved by a genetic algorithm using the output likelihood of the hidden Markov model.

遺伝的アルゴリズムを音声処理に用いた先行事例として他に、特許文献１は、音声分析の一手法として、音声を機械的・電気的に生成することを目的に、生成にかかる音響管断面積や電流値などの制御パラメータを遺伝的アルゴリズムにより求めている。 In addition to the prior case of using a genetic algorithm for speech processing, Patent Document 1 discloses a method of speech analysis, in which an acoustic tube cross-sectional area required for generation is generated for the purpose of mechanically and electrically generating speech. Control parameters such as current values are obtained using a genetic algorithm.

特開平０７−１１４３９７号公報Japanese Patent Application Laid-Open No. 07-114397

「音声認識技術に関する特許出願技術動向調査等報告」特許庁、2003年“Patent Application Technology Trend Survey on Speech Recognition Technology,” JPO, 2003 D.Poveyほか「fMPE: Discriminatively trained features for speech recognition」国際会議ICASSP2005、961-964ページ、2005年D.Povey et al. "FMPE: Discriminatively trained features for speech recognition" International Conference ICASSP2005, pages 961-964, 2005 D.Poveyほか「Boosted MMI for model and feature-space discriminative training」国際会議ICASSP2008、4057-4060ページ、2008年D.Povey et al. “Boosted MMI for model and feature-space discriminative training” International Conference ICASSP2008, 4057-4060, 2008 Z.Behzadほか「Discriminative transformation for speech features based on genetic algorithm and HMM likelihoods」IEICE Electronics Express、第7巻第4号247-253ページ、2010年Z.Behzad et al. `` Discriminative transformation for speech features based on genetic algorithm and HMM likelihoods '' IEICE Electronics Express, Vol. 7, No. 4, pp. 247-253, 2010

従来の音声認識は、雑音環境下では認識性能が著しく低下してしまうことが問題となっている。この要因の一つとして、音響特徴量であるメル尺度ケプストラム係数の抽出過程が信号処理に基づいた手法であり、音声認識に必ずしも最適な特徴量抽出方法ではないことが挙げられる。この問題に対し、メル尺度ケプストラム係数の抽出手法を改善するさまざまな試みが行われているが、未だ得られる音声認識性能は不十分であることが課題である。 Conventional speech recognition has a problem in that the recognition performance is significantly deteriorated in a noisy environment. One of the factors is that the mel scale cepstrum coefficient extraction process, which is an acoustic feature quantity, is a technique based on signal processing, and is not necessarily the optimum feature quantity extraction method for speech recognition. Various attempts have been made to improve the mel scale cepstrum coefficient extraction method for this problem, but the problem is that the speech recognition performance obtained is still insufficient.

本発明の目的は、メル尺度ケプストラム係数とは異なる、音声認識に最適化され、雑音環境を考慮した識別的な音響特徴量を抽出することで、雑音下の音声認識の性能を改善することにある。 An object of the present invention is to improve the performance of speech recognition under noise by extracting discriminative acoustic features that are optimized for speech recognition and that take noise environment into consideration, which is different from the Mel scale cepstrum coefficient. is there.

又、本発明の他の目的として、コンピュータを、音声認識に最適化され、雑音環境を考慮した識別的な音響特徴量を抽出し、該音響特徴量を用いた音声認識装置とすることができるプログラムを提供することにある。 As another object of the present invention, a computer can be optimized for speech recognition, extract a discriminative acoustic feature amount in consideration of a noise environment, and can be used as a speech recognition device using the acoustic feature amount. To provide a program.

本発明は、音声認識において、音声認識に最適化された音響特徴量を抽出することを備えた音声認識装置を提供することを、特徴とする。 The present invention is characterized by providing a speech recognition apparatus that includes extracting an acoustic feature amount optimized for speech recognition in speech recognition.

前記目的を達成するため、請求項１に記載の発明は、発話者の音声信号を入力してディジタル信号に変換し該ディジタル信号からベクトル系列に変換して出力する音声入力部と、前記音声入力部が出力する学習用音声信号のベクトル系列と学習用音声信号の正解クラス情報とを入力してクラス識別変換を出力するクラス識別変換構築部と、前記音声入力部が出力する前記学習用音声信号のベクトル系列と前記学習用音声信号の正解クラス情報と前記クラス識別変換とを入力して次元圧縮・線形直交変換を出力する次元圧縮・線形直交変換構築部と、前記音声入力部が出力する認識音声信号のベクトル系列と前記クラス識別変換と前記次元圧縮・線形直交変換とを入力して音響特徴量を抽出する特徴量抽出部と、前記認識音声信号を前記特徴量抽出部が生成した音響特徴量に基づいて音声認識を行う音声認識部とを備えたことを特徴とする音声認識装置を要旨とするものである。 In order to achieve the above object, the invention according to claim 1 is characterized in that a voice input unit that inputs a voice signal of a speaker, converts the voice signal into a digital signal, converts the digital signal into a vector sequence, and outputs the vector signal; A class identification conversion construction unit that inputs a vector sequence of learning speech signals output by the unit and correct class information of the learning speech signals and outputs class identification conversion; and the learning speech signal output by the speech input unit A dimensional compression / linear orthogonal transformation construction unit that inputs a vector sequence of the above, correct class information of the learning speech signal and the class identification transformation and outputs dimensional compression / linear orthogonal transformation, and recognition output by the speech input unit A feature quantity extraction unit for inputting a vector sequence of the speech signal, the class identification transform, and the dimension compression / linear orthogonal transform to extract an acoustic feature, and the recognized speech signal as the feature Based on the acoustic features out portion is generated in which the gist of the speech recognition apparatus characterized by comprising a voice recognition unit for performing voice recognition.

請求項２に記載の発明は、発話者の音声信号を入力してディジタル信号に変換し該ディジタル信号からベクトル系列に変換して出力する音声入力部と、前記音声入力部が出力する学習用音声信号のベクトル系列と学習用音声信号の正解クラス情報とを入力して遺伝的アルゴリズムにより特徴量変換を出力する特徴量変換構築部と、前記音声入力部が出力する認識音声信号のベクトル系列と前記特徴量変換構築部が出力する特徴量変換とを入力して音響特徴量を抽出する特徴量抽出部と、前記認識音声信号を前記特徴量抽出部が生成した音響特徴量に基づいて音声認識を行う音声認識部とを備えたことを特徴とする音声認識装置を要旨とするものである。 According to a second aspect of the present invention, there is provided a speech input unit that inputs a speech signal of a speaker, converts the speech signal into a digital signal, converts the digital signal into a vector series, and outputs the speech signal, and learning speech output from the speech input unit A feature amount conversion construction unit that inputs a vector sequence of signals and correct class information of a learning speech signal and outputs a feature amount conversion by a genetic algorithm; a vector sequence of recognized speech signals output by the speech input unit; A feature amount extraction unit that inputs a feature amount conversion output by the feature amount conversion construction unit and extracts an acoustic feature amount; and speech recognition based on the acoustic feature amount generated by the feature amount extraction unit. The gist of the present invention is a speech recognition device including a speech recognition unit for performing the speech recognition.

請求項３に記載の発明は、コンピュータに音声認識させるためのプログラムであって、請求項１又は請求項２に記載の音声認識装置をコンピュータにより実現することを特徴とするプログラムを要旨とするものである。 The invention according to claim 3 is a program for causing a computer to perform voice recognition, and is characterized by realizing the voice recognition device according to claim 1 or claim 2 by a computer. It is.

請求項１の発明によれば、学習機構を有する音響特徴量抽出によって、音声認識に最適化された特徴量変換を行い識別的な音響特徴量を抽出することができる。請求項１の発明では、第一に、ベクトル系列から取り出したベクトルに対し識別対象のクラス（音素など）に属するか否かの情報から構成される中間ベクトルを返すクラス識別変換を生成し、第二に、クラス識別変換で得られた中間ベクトルに対して次元圧縮と線形直交化を行うための次元圧縮・線形直交変換を生成する。このように、請求項１の発明によって、認識性能の高い特徴量を抽出することができる。 According to the first aspect of the present invention, it is possible to extract a discriminative acoustic feature quantity by performing feature quantity conversion optimized for speech recognition by acoustic feature quantity extraction having a learning mechanism. In the first aspect of the present invention, first, a class identification transformation is generated that returns an intermediate vector composed of information about whether or not a vector extracted from a vector sequence belongs to a class to be identified (phoneme or the like). Secondly, a dimensional compression / linear orthogonal transformation for performing dimensional compression and linear orthogonalization on the intermediate vector obtained by the class identification transformation is generated. As described above, according to the first aspect of the present invention, it is possible to extract a feature amount having high recognition performance.

請求項２の発明によれば、学習機構を有する音響特徴量抽出によって、音声認識に最適化された特徴量変換を行い識別的な音響特徴量を抽出することができる。請求項２の発明では、ベクトル系列から音響特徴量への変換を、遺伝的アルゴリズムを用いて生成する。すなわち、請求項２の発明によって、認識性能の高い特徴量を抽出することができる。 According to the second aspect of the present invention, it is possible to extract a discriminative acoustic feature quantity by performing feature quantity conversion optimized for speech recognition by acoustic feature quantity extraction having a learning mechanism. According to the second aspect of the present invention, the conversion from the vector sequence to the acoustic feature amount is generated using a genetic algorithm. That is, according to the second aspect of the present invention, it is possible to extract a feature quantity with high recognition performance.

請求項３の発明によれば、プログラムを実行することによって、コンピュータを請求項１又は請求項２に記載の音声認識装置として容易に実現することができる。 According to the invention of claim 3, by executing the program, the computer can be easily realized as the speech recognition apparatus according to claim 1 or claim 2.

以上のように、本発明による音声認識装置は、学習機構を有する音響特徴量抽出によって、音声認識に最適化され雑音環境下でも頑健な音響特徴量を抽出し、静寂な環境はもとより、雑音環境下でも高い音声認識性能を実現する。 As described above, the speech recognition apparatus according to the present invention extracts acoustic feature amounts that are optimized for speech recognition and are robust even under noisy environments by means of acoustic feature amount extraction having a learning mechanism. Realizes high voice recognition performance even under.

請求項１に記載の音声認識装置の一実施形態の機能ブロック図。The functional block diagram of one Embodiment of the speech recognition apparatus of Claim 1. 請求項２に記載の音声認識装置の一実施形態の機能ブロック図。The functional block diagram of one Embodiment of the speech recognition apparatus of Claim 2. ディジタル化された音声信号から周波数ベクトル系列への変換の説明図。Explanatory drawing of the conversion from the digitized audio | voice signal to a frequency vector series. 請求項１に記載の音声認識装置の一実施形態と従来の音声認識装置との雑音種類別性能比較図。The performance comparison figure according to noise kind of one Embodiment of the speech recognition apparatus of Claim 1 and the conventional speech recognition apparatus. 請求項１に記載の音声認識装置の一実施形態と従来の音声認識装置との雑音強度別性能比較図。The performance comparison figure according to noise intensity with one Embodiment of the speech recognition apparatus of Claim 1, and the conventional speech recognition apparatus.

以下で、本発明を具体化した音声認識装置の一実施形態を図１〜図２を参照して説明する。 Hereinafter, an embodiment of a speech recognition apparatus embodying the present invention will be described with reference to FIGS.

図１は、請求項１に記載の発明の一実施形態である。
音声入力部１では、発話者の音声を入力し、入力された音声信号を標本化定理により標本化するとともに、適当な量子化ステップで量子化を行い、ディジタル信号に変換する。続いて、図３に示すように、ディジタル信号を一定時間ごとに一定時間長で切り出し、音声フレームを抽出する。該音声フレームごとに、フーリエ変換を用いて周波数情報を求め、メルフィルタバンクにより周波数情報をまとめベクトル化することで、周波数ベクトルに変換する。該周波数ベクトルをディジタル化された信号全体で時系列に並べることで、ベクトル系列（以下、周波数ベクトル系列という）を生成する。 FIG. 1 is an embodiment of the invention described in claim 1.
The speech input unit 1 inputs the speech of the speaker, samples the input speech signal by the sampling theorem, quantizes it at an appropriate quantization step, and converts it into a digital signal. Subsequently, as shown in FIG. 3, the digital signal is cut out at a certain time length every certain time to extract a voice frame. For each voice frame, frequency information is obtained by using Fourier transform, and the frequency information is collected and vectorized by a mel filter bank to be converted into a frequency vector. A vector sequence (hereinafter referred to as a frequency vector sequence) is generated by arranging the frequency vectors in time series with the entire digitized signal.

学習部１０は、学習用音声信号を音声入力部１に入力して得られる周波数ベクトル系列と学習用音声信号の正解クラス情報を用いて、クラス識別変換と次元圧縮・線形直交変換を計算する。適用部２０は、学習部１０で求めたクラス識別変換と次元圧縮・線形直交変換を用いて、認識音声信号を音声入力部１に入力して得られる周波数ベクトル系列を音響特徴量に変換し、音声認識を行うことで、認識結果を出力する。 The learning unit 10 uses the frequency vector sequence obtained by inputting the learning speech signal to the speech input unit 1 and the correct class information of the learning speech signal to calculate class identification transformation and dimensional compression / linear orthogonal transformation. The application unit 20 converts the frequency vector sequence obtained by inputting the recognized speech signal to the speech input unit 1 using the class identification transformation and dimensional compression / linear orthogonal transformation obtained by the learning unit 10 into an acoustic feature amount, A recognition result is output by performing voice recognition.

クラス識別変換構築部１１では、第一に、クラスごとに周波数ベクトルがそのクラスに属するか否かを判定する識別器を構成する。一例として「与えられた周波数ベクトルが音素ａのものか否かを判定する識別器」「与えられた周波数ベクトルが音素ｂのものか否かを判定する識別器」を構築する。 First, the class identification conversion construction unit 11 configures a classifier that determines whether or not a frequency vector belongs to the class for each class. As an example, “a discriminator for determining whether or not a given frequency vector is for phoneme a” and “a discriminator for determining whether or not a given frequency vector is for phoneme b” are constructed.

クラス識別変換構築部１１における、遺伝的アルゴリズムを用いた識別器の構成例を以下で説明する。
クラスｃに対し、Ｎ次元の周波数ベクトルｘがｃに属するか否かを表す識別器ｆを次の式で表す。該識別器は周波数ベクトルｘがｃに属する場合は正の値を、属さない場合は負の値を返す。 A configuration example of a classifier using a genetic algorithm in the class identification conversion construction unit 11 will be described below.
A classifier f indicating whether or not the N-dimensional frequency vector x belongs to c for class c is expressed by the following equation. The discriminator returns a positive value if the frequency vector x belongs to c, and returns a negative value if it does not belong.

前記識別器ｆの識別器パラメータａ（ｃ）は、学習用音声信号と学習用音声信号の正解クラス情報を用いて決定される。すなわち、学習用音声信号を音声入力部１に入力して得られる周波数ベクトル系列において、なるべく多くのベクトルに対し正しい識別結果を返すように、ａ（ｃ）を決定する。 The discriminator parameter a (c) of the discriminator f is determined using the learning speech signal and correct class information of the learning speech signal. That is, a (c) is determined so as to return a correct identification result for as many vectors as possible in the frequency vector sequence obtained by inputting the learning speech signal to the speech input unit 1.

具体的に、前記識別器パラメータａ（ｃ）を遺伝的アルゴリズムで決定する方法を示す。Ｎ＋１個の係数を持つ識別器パラメータａ（ｃ）を遺伝子とする個体を複数生成し、初期世代を構成する。各個体はランダムに初期化する。 Specifically, a method for determining the discriminator parameter a (c) by a genetic algorithm will be described. A plurality of individuals using the discriminator parameter a (c) having N + 1 coefficients as genes are generated to constitute the initial generation. Each individual is initialized randomly.

現世代の個体ｖに対し、次の式により適応度Ｅ（ｖ）を計算する。ただしａは該個体ｖの遺伝子から得られる識別器パラメータ、ｒｋは学習用音声信号を音声入力部１に入力して得られる周波数ベクトル系列のｋ番目の周波数ベクトル、ｌｋはｋ番目の周波数ベクトルの正解クラス情報で、ｒｋがクラスｃに属するとき１、属さないとき−１である。尚、Ｋは学習用音声信号を音声入力部１に入力して得られる周波数ベクトルの総数である。又、ｓｇｎは符号関数である。 The fitness E (v) is calculated with respect to the individual v of the current generation by the following formula. Where a is a discriminator parameter obtained from the gene of the individual v, rk is a kth frequency vector of a frequency vector sequence obtained by inputting a learning speech signal to the speech input unit 1, and lk is a kth frequency vector. The correct class information is 1 when rk belongs to class c and -1 when it does not belong. K is the total number of frequency vectors obtained by inputting the learning speech signal to the speech input unit 1. Sgn is a sign function.

前記適応度に基づき、遺伝的アルゴリズムの基本操作（エリート選択・継承・交叉・突然変異）を行い、現世代から次世代を作成する（以下、現世代から次世代を作成することを世代交代という）。 Based on the fitness, the basic operation of the genetic algorithm (elite selection / inheritance / crossover / mutation) is performed to create the next generation from the current generation (hereinafter, the generation of the next generation from the current generation is referred to as generation change). ).

前記初期世代から世代交代を一定回数繰り返し、最終世代を生成する。該最終世代から最も適応度の高い個体を取り出し、該個体の遺伝子を解析することで、最適化された識別器パラメータａ（ｃ）を得る。 The generation change is repeated a certain number of times from the initial generation to generate the final generation. An individual with the highest fitness is taken out from the final generation, and the gene of the individual is analyzed to obtain an optimized discriminator parameter a (c).

尚、前記識別器パラメータａ（ｃ）の決定に、本発明はサポートベクターマシン（Support Vector Machine）などを用いることを妨げない。又、前記識別器の一例として線形識別器を示したが、本発明は非線形識別器を用いることを妨げない。 Note that the present invention does not prevent the use of a support vector machine or the like for the determination of the discriminator parameter a (c). Although a linear classifier is shown as an example of the classifier, the present invention does not prevent the use of a nonlinear classifier.

クラス識別変換構築部１１では、第二に、前記識別器を全てのクラスに対して構成する。クラスｃの識別器パラメータａ（ｃ）を行ベクトルとして表現したとき、全ての行ベクトルをまとめて、クラス識別変換行列を構成する。クラス識別変換構築部１１の出力として、Ｃ行Ｎ＋１列のクラス識別変換行列が得られる。ただしＣは識別クラス数である。 Second, the class identification conversion construction unit 11 configures the classifier for all classes. When class c discriminator parameter a (c) is expressed as a row vector, all the row vectors are put together to form a class discrimination conversion matrix. As an output of the class identification conversion construction unit 11, a class identification conversion matrix of C rows and N + 1 columns is obtained. Where C is the number of identification classes.

次元圧縮・線形直交変換構築部１２では、前記クラス識別変換構築部１１で得られたクラス識別変換の次元圧縮・線形直交化を行う。 The dimension compression / linear orthogonal transformation construction unit 12 performs dimension compression / linear orthogonalization of the class identification transformation obtained by the class identification transformation construction unit 11.

次元圧縮・線形直交変換構築部１２における、遺伝的アルゴリズムを用いた次元圧縮・線形直交化の構成例を以下で説明する。
学習用音声信号を音声入力部１に入力して得られる周波数ベクトル系列において、該周波数ベクトル系列の各周波数ベクトルに対し、次の式のように拡張周波数ベクトルｘ’にクラス識別変換行列Ａを乗算し、Ｃ次元のベクトルｙ（以下、中間ベクトルという）を得る。 A configuration example of dimensional compression / linear orthogonalization using a genetic algorithm in the dimensional compression / linear orthogonal transformation construction unit 12 will be described below.
In the frequency vector sequence obtained by inputting the learning speech signal to the speech input unit 1, each frequency vector of the frequency vector sequence is multiplied by the class identification transformation matrix A by the extended frequency vector x ′ as shown in the following equation. A C-dimensional vector y (hereinafter referred to as an intermediate vector) is obtained.

中間ベクトルｙに対し、次元圧縮・線形直交変換の１次元目ｇ１の計算を次の式で行う。 The calculation of the first dimension g1 of dimensional compression / linear orthogonal transformation is performed on the intermediate vector y by the following equation.

前記ｇ１のパラメータｂ（１）は、学習用音声信号と学習用音声信号の正解クラス情報を用いて決定される。すなわち、学習用音声信号を音声入力部１に入力して得られる周波数ベクトル系列において、クラスごとに中間ベクトルｙの平均ベクトルを計算し、該平均ベクトルの前記ｇ１の値を求め、その分散が最大となるように、ｂ（１）を決定する。 The parameter b (1) of g1 is determined using the learning speech signal and the correct class information of the learning speech signal. That is, in the frequency vector sequence obtained by inputting the learning speech signal to the speech input unit 1, the average vector of the intermediate vector y is calculated for each class, the value of the average vector g1 is obtained, and the variance is maximum. B (1) is determined so that

具体的に、前記パラメータｂ（１）を遺伝的アルゴリズムで決定する方法を示す。Ｃ個の係数を持つパラメータｂ（１）を遺伝子とする個体を複数生成し、初期世代を構成する。各個体はランダムに初期化する。 Specifically, a method for determining the parameter b (1) by a genetic algorithm will be described. A plurality of individuals having genes of the parameter b (1) having C coefficients are generated, and the initial generation is configured. Each individual is initialized randomly.

現世代の個体ｖに対し、次の式により適応度Ｅ（ｖ）を計算する。ただしｂは該個体ｖの遺伝子から得られるパラメータである。又、ｖａｒは分散を求める関数である。 The fitness E (v) is calculated with respect to the individual v of the current generation by the following formula. Where b is a parameter obtained from the gene of the individual v. Also, var is a function for obtaining the variance.

前記適応度に基づき、遺伝的アルゴリズムの基本操作を行い、現世代から次世代を作成（世代交代）する。 Based on the fitness, the basic operation of the genetic algorithm is performed to create the next generation from the current generation (generation change).

前記初期世代から世代交代を一定回数繰り返し、最終世代を生成する。該最終世代から最も適応度の高い個体を取り出し、該個体の遺伝子を解析することで、最適化されたパラメータｂ（１）を得る。 The generation change is repeated a certain number of times from the initial generation to generate the final generation. An optimized parameter b (1) is obtained by taking out the individual with the highest fitness from the final generation and analyzing the gene of the individual.

前記次元圧縮・線形直交変換の２次元目ｇ２の計算を次の式で行う。 The calculation of the second dimension g2 of the dimension compression / linear orthogonal transformation is performed by the following equation.

前記ｇ２のパラメータｂ（２）は、前記ｇ１のパラメータｂ（１）と同様に遺伝的アルゴリズムで決定する。ただし、ｂ（２）は、ｂ（１）との内積が０となるよう、次の式を満たす制約の下で求める。 The parameter b (2) of g2 is determined by a genetic algorithm in the same manner as the parameter b (1) of g1. However, b (2) is obtained under the constraint that satisfies the following expression so that the inner product with b (1) becomes zero.

以降同様に、前記次元圧縮・線形直交変換のｍ次元目ｇｍは、パラメータｂ（ｍ）とｂ（１）、ｂ（ｍ）とｂ（２）、…、ｂ（ｍ）とｂ（ｍ−１）の内積が全て０となる制約の下で、前記平均ベクトルの該ｇｍの値を求め、その分散が最大となるよう、ｂ（ｍ）を決定することで、求めることができる。 Similarly, the m-th order gm of the dimension compression / linear orthogonal transformation is the parameters b (m) and b (1), b (m) and b (2),..., B (m) and b (m− Under the constraint that the inner product of 1) is all 0, the gm value of the average vector is obtained, and b (m) is determined so that the variance is maximized.

尚、前記パラメータｂ（１）及びｂ（２）乃至ｂ（ｍ）の決定に、本発明は主成分分析（Principal Component Analysis）などを用いることを妨げない。 Note that the present invention does not prevent the use of principal component analysis or the like in determining the parameters b (1) and b (2) to b (m).

次元圧縮・線形直交変換構築部１２の出力として、Ｍ行Ｃ列の次元圧縮・線形直交変換行列が得られる。ただしＭは次元圧縮・線形直交変換の出力次元数である。 As an output of the dimension compression / linear orthogonal transformation construction unit 12, a dimension compression / linear orthogonal transformation matrix of M rows and C columns is obtained. However, M is the number of output dimensions of dimension compression / linear orthogonal transformation.

音声認識を行う際は、認識音声信号を音声入力部１に入力し、周波数ベクトル系列に変換する。 When performing speech recognition, a recognized speech signal is input to the speech input unit 1 and converted into a frequency vector sequence.

特徴量抽出部２１では、前記クラス識別変換構築部１１で求めたクラス識別変換及び前記次元圧縮・線形直交変換構築部１２で求めた次元圧縮・線形直交変換を用いて、周波数ベクトル系列を音響特徴量に変換する。認識音声信号を音声入力部１に入力して得られる周波数ベクトル系列において、該周波数ベクトル系列の各周波数ベクトルに対し、次の式のように拡張周波数ベクトルｘ’にクラス識別変換行列Ａ及び次元圧縮・線形直交変換行列Ｂを乗算し、Ｍ次元の音響特徴量ｚを得る。 The feature quantity extraction unit 21 uses the class identification transformation obtained by the class identification transformation construction unit 11 and the dimensional compression / linear orthogonal transformation obtained by the dimensional compression / linear orthogonal transformation construction unit 12 to convert the frequency vector series into acoustic features. Convert to quantity. In the frequency vector sequence obtained by inputting the recognized speech signal to the speech input unit 1, for each frequency vector of the frequency vector sequence, the class identification transformation matrix A and the dimension compression are expanded into the extended frequency vector x ′ as in the following equation: Multiply the linear orthogonal transformation matrix B to obtain an M-dimensional acoustic feature quantity z.

音声認識部２では、前記特徴量抽出部２１が出力した音響特徴量を用いて音声認識を行う。モデルに隠れマルコフモデルを使用し、ビタビアルゴリズムにより音響特徴量とモデルのマッチングを行い、最も尤度の高い単語仮説候補を認識結果として出力する。 The speech recognition unit 2 performs speech recognition using the acoustic feature amount output from the feature amount extraction unit 21. A hidden Markov model is used as a model, and acoustic features are matched with the model using a Viterbi algorithm, and the most likely word hypothesis candidate is output as a recognition result.

図２は、請求項２に記載の発明の一実施形態である。
音声入力部１では、発話者の音声を入力し、入力された音声信号を標本化定理により標本化するとともに、適当な量子化ステップで量子化を行い、ディジタル信号に変換する。続いて、図３に示すように、ディジタル信号を一定時間ごとに一定時間長で切り出し、音声フレームを抽出する。該音声フレームごとに、フーリエ変換を用いて周波数情報を求め、メルフィルタバンクにより周波数情報をまとめベクトル化することで、周波数ベクトルに変換する。該周波数ベクトルをディジタル化された信号全体で時系列に並べることで、ベクトル系列（以下、周波数ベクトル系列という）を生成する。 FIG. 2 shows an embodiment of the invention described in claim 2.
The speech input unit 1 inputs the speech of the speaker, samples the input speech signal by the sampling theorem, quantizes it at an appropriate quantization step, and converts it into a digital signal. Subsequently, as shown in FIG. 3, the digital signal is cut out at a certain time length every certain time to extract a voice frame. For each voice frame, frequency information is obtained by using Fourier transform, and the frequency information is collected and vectorized by a mel filter bank to be converted into a frequency vector. A vector sequence (hereinafter referred to as a frequency vector sequence) is generated by arranging the frequency vectors in time series with the entire digitized signal.

学習部１０は、学習用音声信号を音声入力部１に入力して得られる周波数ベクトル系列と学習用音声信号の正解クラス情報を用いて、特徴量変換を計算する。適用部２０は、学習部１０で求めた特徴量変換を用いて、認識音声信号を音声入力部１に入力して得られる周波数ベクトル系列を音響特徴量に変換し、音声認識を行うことで、認識結果を出力する。 The learning unit 10 calculates the feature amount conversion using the frequency vector series obtained by inputting the learning speech signal to the speech input unit 1 and the correct class information of the learning speech signal. The application unit 20 uses the feature amount conversion obtained by the learning unit 10 to convert a frequency vector sequence obtained by inputting the recognized speech signal to the speech input unit 1 into an acoustic feature amount, and performs speech recognition. Output the recognition result.

特徴量変換構築部１６では、周波数ベクトル系列を音響特徴量に変換する特徴量変換を、遺伝的アルゴリズムにより生成する。 The feature quantity conversion construction unit 16 generates a feature quantity conversion for converting a frequency vector series into an acoustic feature quantity by a genetic algorithm.

特徴量変換構築部１６における、遺伝的アルゴリズムを用いた特徴量変換の構成例を以下で説明する。
学習用音声信号を音声入力部１に入力して得られる周波数ベクトル系列において、該周波数ベクトル系列の各周波数ベクトルｘ（Ｎ次元）に対し、特徴量変換の１次元目ｈ１の計算を次の式で行う。 A configuration example of feature amount conversion using a genetic algorithm in the feature amount conversion construction unit 16 will be described below.
In the frequency vector sequence obtained by inputting the learning speech signal to the speech input unit 1, the calculation of the first dimension h1 of the feature amount conversion is performed for each frequency vector x (N dimension) of the frequency vector sequence by the following equation: To do.

前記ｈ１のパラメータｄ（１）は、学習用音声信号と学習用音声信号の正解クラス情報を用いて決定される。すなわち、学習用音声信号を音声入力部１に入力して得られる周波数ベクトル系列において、クラスごとに平均ベクトルを計算し、該平均ベクトルの前記ｈ１の値を求め、その分散が最大となるように、ｄ（１）を決定する。 The parameter d (1) of the h1 is determined using the learning speech signal and the correct answer class information of the learning speech signal. That is, in the frequency vector sequence obtained by inputting the learning speech signal to the speech input unit 1, an average vector is calculated for each class, the h1 value of the average vector is obtained, and the variance is maximized. , D (1) is determined.

具体的に、前記パラメータｄ（１）を遺伝的アルゴリズムで決定する方法を示す。Ｎ個の係数を持つパラメータｄ（１）を遺伝子とする個体を複数生成し、初期世代を構成する。各個体はランダムに初期化する。 Specifically, a method for determining the parameter d (1) by a genetic algorithm will be described. A plurality of individuals having a parameter d (1) having N coefficients as genes are generated to form an initial generation. Each individual is initialized randomly.

現世代の個体ｖに対し、次の式により適応度Ｅ（ｖ）を計算する。ただしｄは該個体ｖの遺伝子から得られるパラメータである。又、ｖａｒは分散を求める関数である。 The fitness E (v) is calculated with respect to the individual v of the current generation by the following formula. Where d is a parameter obtained from the gene of the individual v. Also, var is a function for obtaining the variance.

前記初期世代から世代交代を一定回数繰り返し、最終世代を生成する。該最終世代から最も適応度の高い個体を取り出し、該個体の遺伝子を解析することで、最適化されたパラメータｄ（１）を得る。 The generation change is repeated a certain number of times from the initial generation to generate the final generation. The optimized parameter d (1) is obtained by taking out the individual with the highest fitness from the final generation and analyzing the gene of the individual.

前記特徴量変換の２次元目ｈ２の計算を次の式で行う。 The calculation of the second dimension h2 of the feature amount conversion is performed by the following equation.

前記ｈ２のパラメータｄ（２）は、前記ｈ１のパラメータｄ（１）と同様に遺伝的アルゴリズムで決定する。ただし、ｄ（２）は、ｄ（１）との内積が０となるよう、次の式を満たす制約の下で求める。 The parameter d (2) of h2 is determined by a genetic algorithm in the same manner as the parameter d (1) of h1. However, d (2) is obtained under the constraint that satisfies the following expression so that the inner product with d (1) becomes zero.

以降同様に、前記特徴量変換のｍ次元目ｈｍは、パラメータｄ（ｍ）とｄ（１）、ｄ（ｍ）とｄ（２）、…、ｄ（ｍ）とｄ（ｍ−１）の内積が全て０となる制約の下で、前記平均ベクトルの該ｈｍの値を求め、その分散が最大となるよう、ｄ（ｍ）を決定することで、求めることができる。 In the same manner, the m-th order hm of the feature amount conversion is parameters d (m) and d (1), d (m) and d (2),..., D (m) and d (m−1). Under the constraint that the inner products are all 0, the hm value of the average vector is obtained, and d (m) is determined so that the variance becomes maximum.

特徴量変換構築部１６の出力として、Ｍ行Ｎ列の特徴量変換行列が得られる。ただしＭは特徴量変換の出力次元数である。 As an output of the feature quantity conversion construction unit 16, a feature quantity conversion matrix of M rows and N columns is obtained. However, M is the number of output dimensions of feature quantity conversion.

特徴量抽出部２６では、前記特徴量変換構築部１６で求めた特徴量変換を用いて、周波数ベクトル系列を音響特徴量に変換する。認識音声信号を音声入力部１に入力して得られる周波数ベクトル系列において、該周波数ベクトル系列の各周波数ベクトルｘに対し、次の式のように特徴量変換行列Ｄを乗算し、Ｍ次元の音響特徴量ｚを得る。 The feature quantity extraction unit 26 converts the frequency vector series into an acoustic feature quantity using the feature quantity conversion obtained by the feature quantity conversion construction unit 16. In the frequency vector sequence obtained by inputting the recognized speech signal to the speech input unit 1, each frequency vector x of the frequency vector sequence is multiplied by a feature amount transformation matrix D as shown in the following equation to obtain an M-dimensional sound. A feature value z is obtained.

音声認識部２では、前記特徴量抽出部２６が出力した音響特徴量を用いて音声認識を行う。モデルに隠れマルコフモデルを使用し、ビタビアルゴリズムにより音響特徴量とモデルのマッチングを行い、最も尤度の高い単語仮説候補を認識結果として出力する。 The speech recognition unit 2 performs speech recognition using the acoustic feature amount output from the feature amount extraction unit 26. A hidden Markov model is used as a model, and acoustic features are matched with the model using a Viterbi algorithm, and the most likely word hypothesis candidate is output as a recognition result.

尚、本発明の実施形態は前記実施形態に限定されるものではなく、前記実施形態を、本発明の趣旨から逸脱しない範囲で変更してもよい。 In addition, embodiment of this invention is not limited to the said embodiment, You may change the said embodiment in the range which does not deviate from the meaning of this invention.

図４は、前記実施形態のうち図１で示した請求項１に記載の発明の一実施形態による音声認識装置の音声認識精度と、音声入力部１及びメル尺度ケプストラム係数を生成する特徴量抽出部乃至音声認識部２を備えた従来の音声認識装置の音声認識精度について、雑音環境下音声認識共通評価基盤コーパスＣＥＮＳＲＥＣ−１を用いて、８種類の異なる雑音種類で評価したものである。 FIG. 4 shows the speech recognition accuracy of the speech recognition apparatus according to the embodiment of the invention described in claim 1 shown in FIG. 1 and the feature extraction for generating the speech input unit 1 and the Mel scale cepstrum coefficient. The speech recognition accuracy of the conventional speech recognition apparatus including the speech recognition unit 2 and the speech recognition unit 2 is evaluated with 8 different noise types using the speech recognition common evaluation infrastructure corpus CENSREC-1 under a noisy environment.

図５は、前記実施形態のうち図１で示した請求項１に記載の発明の一実施形態による音声認識装置の音声認識精度と、音声入力部１及びメル尺度ケプストラム係数を生成する特徴量抽出部乃至音声認識部２を備えた従来の音声認識装置の音声認識精度について、雑音環境下音声認識共通評価基盤コーパスＣＥＮＳＲＥＣ−１を用いて、７種類の異なる雑音強度で評価したものである。 FIG. 5 shows the speech recognition accuracy of the speech recognition apparatus according to an embodiment of the invention described in claim 1 shown in FIG. 1 and the feature extraction for generating the speech input unit 1 and the mel scale cepstrum coefficient. The speech recognition accuracy of the conventional speech recognition apparatus including the speech recognition unit 2 and the speech recognition unit 2 is evaluated with seven different noise intensities using the speech recognition common evaluation infrastructure corpus CENSREC-1 in a noisy environment.

図４及び図５で示した音声認識実験は、雑音環境下音声認識共通評価基盤コーパスＣＥＮＳＲＥＣ−１に附属の基準手法（ベースライン）と同じ条件で行った。 The speech recognition experiments shown in FIGS. 4 and 5 were performed under the same conditions as the reference method (baseline) attached to the speech recognition common evaluation infrastructure corpus CENSRECEC-1 in a noisy environment.

図４及び図５より、図１で示した請求項１に記載の発明の一実施形態による音声認識装置は、従来の音声認識装置と比較して、有意に音声認識性能が向上しており、本発明の目的である雑音下での音声認識の性能向上が達せられていることがわかる。 From FIG.4 and FIG.5, the speech recognition apparatus by one Embodiment of invention of Claim 1 shown in FIG. 1 compared with the conventional speech recognition apparatus, the speech recognition performance improved significantly, It can be seen that the performance of speech recognition under noise, which is the object of the present invention, has been achieved.

１…音声入力部
２…音声認識部
１０…学習部
１１…クラス識別変換構築部
１２…次元圧縮・線形直交変換構築部
１６…特徴量変換構築部
２０…適用部
２１…特徴量抽出部
２６…特徴量抽出部 DESCRIPTION OF SYMBOLS 1 ... Speech input part 2 ... Speech recognition part 10 ... Learning part 11 ... Class identification transformation construction part 12 ... Dimension compression / linear orthogonal transformation construction part 16 ... Feature quantity transformation construction part 20 ... Application part 21 ... Feature quantity extraction part 26 ... Feature extraction unit

Claims

A voice input unit that inputs a voice signal of a speaker and converts the voice signal into a digital signal; converts the digital signal into a vector sequence;
A class identification conversion construction unit that inputs a vector sequence of learning speech signals output by the speech input unit and correct class information of the learning speech signals and outputs class identification conversion;
Dimensional compression / linear that inputs a vector sequence of the learning speech signal output from the speech input unit, correct class information of the learning speech signal, and the class identification transformation and outputs dimensional compression / linear orthogonal transformation An orthogonal transform construction unit;
A feature amount extraction unit that inputs a vector sequence of recognized speech signals output from the speech input unit, the class identification transformation, and the dimension compression / linear orthogonal transformation, and extracts an acoustic feature amount;
A speech recognition unit that performs speech recognition based on the acoustic feature amount generated by the feature amount extraction unit;
A speech recognition apparatus comprising:

A voice input unit that inputs a voice signal of a speaker and converts the voice signal into a digital signal; converts the digital signal into a vector sequence;
A feature amount conversion construction unit that inputs a vector sequence of learning speech signals output by the speech input unit and correct class information of the learning speech signals and outputs feature amount conversion by a genetic algorithm;
A feature quantity extraction unit that inputs a vector sequence of recognized speech signals output by the voice input unit and a feature quantity conversion output by the feature quantity conversion construction unit, and extracts an acoustic feature quantity;
A speech recognition unit that performs speech recognition based on the acoustic feature amount generated by the feature amount extraction unit;
A speech recognition apparatus comprising:

A program for causing a computer to perform voice recognition, wherein the voice recognition device according to claim 1 or 2 is realized by a computer.