WO2017130387A1 - Speech recognition device - Google Patents

Speech recognition device Download PDF

Info

Publication number
WO2017130387A1
WO2017130387A1 PCT/JP2016/052724 JP2016052724W WO2017130387A1 WO 2017130387 A1 WO2017130387 A1 WO 2017130387A1 JP 2016052724 W JP2016052724 W JP 2016052724W WO 2017130387 A1 WO2017130387 A1 WO 2017130387A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
test data
base matrix
acoustic model
data
Prior art date
Application number
PCT/JP2016/052724
Other languages
French (fr)
Japanese (ja)
Inventor
裕紀 金川
勇気 太刀岡
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to PCT/JP2016/052724 priority Critical patent/WO2017130387A1/en
Priority to JP2016541466A priority patent/JP6054004B1/en
Priority to TW105115458A priority patent/TW201727620A/en
Publication of WO2017130387A1 publication Critical patent/WO2017130387A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker

Definitions

  • the present invention relates to a speech recognition apparatus for converting an acoustic feature quantity using a base matrix and a transformation matrix in a technique for adapting the feature quantity to match an acoustic model.
  • Non-Patent Document 1 a CMLLR (Constrained-MLLR) method disclosed in Non-Patent Document 1 is known as a technique for applying such a feature amount. This is a method for converting the mean and variance of model parameters. Since this conversion is equivalent to converting a feature vector, CMLLR obtains a conversion matrix in the feature. As is specifically formula (1), determine the affine transformation matrix W that approach the acoustic features o t D-dimensional computed from the input speech in the acoustic model is a standard pattern of the phoneme.
  • MFCC Mel frequency cepstrum coefficient
  • N basis matrices are used.
  • N max D (D + 1).
  • weighting the basis matrix W n by weight d n obtains a transformation matrix W to speaker adaptation.
  • Basis matrix is calculated from learning data, at the time of adaptive obtaining only those weights d n transformation matrix to the input talker.
  • Non-Patent Document 2 The steps in the speech recognition apparatus described in Non-Patent Document 2 are roughly divided into a learning step for obtaining a base matrix W1 : Nmax from learning data, adaptive data (test data) and a base matrix W1 : Nmax. There are two adaptation steps for determining the transformation matrix W using N of the N.
  • an acoustic model that is a standard phoneme pattern is obtained from the learning data.
  • HMM Hidden Markov Model: Hidden Markov Model
  • Conventionally used feature vectors such as filter bank coefficients, MFCC, and PLP (Perceptual Linear Predictive) can be used as the acoustic feature quantity as learning data.
  • weights of the base matrix are generated using the test data. This weight corresponds to d n as described above.
  • the base matrix is weighted with the obtained weight of the base matrix, and the transformation matrix W is obtained as the weighted matrix.
  • the matrix weighted with the weight of the base matrix is sequentially obtained by the equation (4).
  • the transformation matrix W is obtained by weighting in order from the base matrix W n having a high contribution degree according to Equation (4).
  • the index n is assigned to the basis matrix W 1: Nmax in descending order of contribution here, the contribution degree of each basis matrix is not considered in the equation (4). That is, until multiplies the d n are considered to basis matrix are the same contribution. For this reason, there is a problem that the effect of adaptation may not be sufficiently obtained due to the influence of a base matrix having a low contribution.
  • the present invention has been made to solve such a problem, and an object of the present invention is to provide a speech recognition apparatus capable of improving the accuracy of estimating a transformation matrix at the time of adaptation and improving the accuracy of speech recognition.
  • the speech recognition apparatus includes an acoustic model calculation unit that calculates an acoustic model obtained by modeling a standard pattern of learning data using an acoustic feature amount of learning data, and a basis matrix using the acoustic model and the learning data.
  • a basis matrix calculation unit that calculates the basis matrix a basis matrix contribution calculation unit that calculates the contribution of the basis matrix using the basis matrix, an acoustic feature amount of the test data, an acoustic model, and a basis matrix , Using the weight calculation unit for the base matrix to calculate the weight of the base matrix, the weight of the base matrix, the contribution of the base matrix, and the base matrix, generate a transformation matrix that weights the base matrix
  • a weight application unit to a base matrix a matrix application unit to feature data that converts test data into converted test data for recognizing an acoustic model using a conversion matrix, converted test data and acoustic It is obtained by a decoding unit which performs speech recognition by matching and Dell.
  • the speech recognition apparatus calculates a contribution degree of a base matrix, and generates a transformation matrix that weights the base matrix using the contribution degree of the base matrix, the weight of the base matrix, and the base matrix. It is what I did. Thereby, the estimation accuracy of the transformation matrix at the time of adaptation can be improved, and the speech recognition performance can be improved.
  • FIG. 1 is a configuration diagram of a speech recognition apparatus according to this embodiment.
  • the speech recognition apparatus includes a learning step execution unit 100 and an adaptation step execution unit 200 as illustrated.
  • the learning step execution unit 100 includes an acoustic model calculation unit 101, a basis matrix calculation unit 102, and a basis matrix contribution calculation unit 103.
  • the adaptation step execution unit 200 includes a basis matrix weight calculation unit 201 and a weight to the basis matrix.
  • An application unit 202, a matrix application unit 203 for feature amount data, and a decoding unit 204 are provided.
  • the acoustic model calculation unit 101 in the learning step execution unit 100 is a processing unit that calculates an acoustic model 105 that models the standard pattern of the learning data 104 using the acoustic feature amount of the learning data 104.
  • the base matrix calculation unit 102 is a processing unit that calculates the base matrix 106 using the acoustic model 105 calculated by the acoustic model calculation unit 101 and the learning data 104.
  • the base matrix contribution calculation unit 103 is a processing unit that calculates the base matrix contribution 107 using the base matrix 106 calculated by the base matrix calculation unit 102.
  • the basis matrix weight calculation unit 201 in the adaptive step execution unit 200 is a processing unit that calculates the basis matrix weight 206 using the acoustic feature quantity of the test data 205, the acoustic model 105, and the basis matrix 106.
  • the base matrix weight application unit 202 uses the base matrix weight 206 calculated by the base matrix weight calculation unit 201, the base matrix contribution 107, and the base matrix 106 to weight the base matrix 106. It is a processing unit that performs conversion and generates a transformation matrix 207 that is a weighted matrix.
  • the matrix application unit 203 for feature amount data uses the conversion matrix 207 obtained by the weight application unit 202 for the base matrix and the test data 205 to convert the test data 205 to be suitable for acoustic model recognition.
  • the decoding unit 204 is a processing unit that collates the converted test data 208 obtained by the matrix application unit 203 to the feature data and the acoustic model 105, performs speech recognition, and outputs a recognition result 209.
  • FIG. 1 an arrow from the acoustic model 105 to the decoding unit 204 is not shown.
  • FIG. 2 is a hardware configuration diagram of the speech recognition apparatus according to the first embodiment.
  • the speech recognition apparatus is realized using a computer, and includes a processor 1, a memory 2, an input / output interface (input / output I / F) 3, and a bus 4.
  • the processor 1 is a functional unit that performs calculation processing as a computer
  • the memory 2 is a storage unit that stores various programs and calculation results, and configures a work area when the processor 1 performs calculation processing.
  • the input / output interface 3 is an interface for inputting the learning data 104 and the test data 205 and outputting the recognition result 209 to the outside.
  • the bus 4 is a bus for connecting the processor 1, the memory 2, and the input / output interface 3 to each other.
  • the acoustic model calculation unit 101, the base matrix calculation unit 102, the base matrix contribution calculation unit 103, the base matrix weight calculation unit 201, the base matrix weight application unit 202, and the feature data matrix application unit illustrated in FIG. 203 and the decoding unit 204 are each implemented by the processor 1 executing a program stored in the memory 2.
  • the acoustic model 105, the basis matrix 106, the basis matrix weights 206, the transformation matrix 207, and the transformed test data 208 are stored in the storage area of the memory 2, respectively.
  • a plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.
  • an acoustic model 105 that is a phoneme standard pattern is created from the learning data 104 by the acoustic model calculation unit 101 (step ST1).
  • acoustic feature amounts conventionally used feature vectors such as filter bank coefficients, MFCC (Mel Frequency Cepstrum Coefficient), and PLP (Perceptual Linear Predictive) can be used.
  • a contribution ( ⁇ n ) 107 corresponding to the index n of each base matrix 106 is obtained from the base matrix 106 using the base matrix contribution calculation unit 103 (step ST3).
  • the contribution degree 107 takes a large value in the order of the set n in which the expressive power of learning data is high.
  • a singular value k 1: Nmax obtained when the base matrix (W 1: Nmax ) 106 is obtained can be used. This is because the base matrix of index n having a large singular value has a high contribution to expressing the matrix M. Therefore, the contribution 107 can be obtained similarly by holding the singular value k 1: Nmax calculated by the base matrix calculation unit 102 instead of calculating the singular value k 1: Nmax again by the contribution calculation unit 103. It is done.
  • the basis function (W n ) is obtained by applying the transformation function ⁇ (•) to the singular value to obtain ⁇ (k n ).
  • the contribution corresponding to 106 can be controlled.
  • a sigmoid function or the like can be used as the conversion function.
  • the base matrix weight calculation unit 201 generates a base matrix weight (d n ) 206 from the test data 205, the acoustic model 105, and the base matrix 106 (step ST11).
  • the weight applying unit 202 for the base matrix uses the base matrix weight 206 obtained in step ST11 and the weighted matrix using the base matrix 106 and the base matrix contribution ( ⁇ 1: Nmax ) 107.
  • a transformation matrix (W) 207 is obtained (step ST12).
  • the basis matrix weight 206 and the transformation matrix 207 are obtained sequentially based on the equation (5).
  • step ST11 and step ST12 are sequentially repeated and the likelihood increase is less than the threshold value or is repeated a predetermined number of times, the process proceeds to the next step.
  • the likelihood is an index of how close the input speech is to the standard pattern with respect to the acoustic model 105.
  • the range of increase in likelihood since the previous conversion matrix was estimated is calculated. If the difference in likelihood is smaller than the set numerical value, that is, the likelihood increase is smaller than the set numerical value, it can be considered that the estimation process has converged, and it is determined that the highly accurate estimation process has been performed.
  • the weight 206 for the base matrix is estimated again to obtain a transformation matrix 207 with higher accuracy.
  • the weight matrix applying unit 202 for the base matrix considers the base matrix contribution by multiplying the base matrix (W n ) 106 by the contribution ( ⁇ n ) 107 when estimating the transformation matrix (W) 207. It is possible to improve the estimation accuracy of the transformation matrix (W) 207.
  • the decoding unit 204 performs speech recognition processing based on HMM (Hidden Markov Model). Specifically, as an output probability model of the HMM, a model GMM-HMM using a mixed Gaussian distribution (hereinafter referred to as GMM (Gaussian Mixture Model)) or a neural network (hereinafter referred to as NN (Neural Network)) is used.
  • GMM Gaussian Mixture Model
  • NN Neural Network
  • the acoustic model calculation unit that calculates the acoustic model obtained by modeling the standard pattern of the learning data using the acoustic feature amount of the learning data, the acoustic model, A base matrix calculation unit that calculates a base matrix using the learning data, a base matrix contribution calculation unit that calculates a base matrix contribution using the base matrix, an acoustic feature amount of the test data, and an acoustic Using the model and the basis matrix, the weight calculation unit to the basis matrix that calculates the weight of the basis matrix, the basis matrix weight, the contribution of the basis matrix, and the basis matrix are used to calculate the basis matrix weight.
  • Embodiment 2 a transformation matrix and a base matrix used for estimating the transformation matrix are obtained for each class such as phonemes.
  • FIG. 5 is a configuration diagram of the speech recognition apparatus according to the second embodiment.
  • the speech recognition apparatus according to Embodiment 2 includes a learning step execution unit 100a and an adaptation step execution unit 200a.
  • the learning step execution unit 100a includes an acoustic model calculation unit 101a and a base matrix calculation unit 102a.
  • the adaptation step execution unit 200a includes a base matrix weight calculation unit 201a, a base matrix weight application unit 202a, a feature data matrix application unit 203a, a decoding unit 204, an alignment calculation unit 210, and a data class classification unit 211. I have.
  • the acoustic model calculation unit 101a in the learning step execution unit 100a models the standard pattern of the learning data 104a for each class by using the acoustic feature amount of the learning data 104a for each class clustered in units of classes, thereby generating the acoustic model 105a.
  • This is a processing unit to be obtained.
  • the base matrix calculation unit 102a is a processing unit that calculates the base matrix 106a for each class using the acoustic model 105a and the learning data 104a for each class.
  • the alignment calculation unit 210 in the adaptive step execution unit 200a is a processing unit that calculates an alignment 212 indicating the state sequence of the acoustic feature amount of the test data 205.
  • the data class classification unit 211 is a processing unit that classifies the test data 205 by class using the test data 205 and the alignment 212 and outputs the test data 205 as test data 213 for each class.
  • the base matrix weight calculation unit 201a uses the test data 213 for each class, the acoustic model 105a, and the base matrix 106a for each class to obtain a weight for the base matrix 106a for each class, and the weight of the base matrix for each class. It is a processing unit that outputs 206a.
  • the weight matrix applying unit 202a for the base matrix is a processing unit that generates the conversion matrix 207a for each class by weighting using the base matrix 106a for each class and the base matrix weight 206a for each class.
  • the matrix application unit 203a for feature amount data uses the test data 205, the alignment 212, and the conversion matrix 207a for each class to convert the test data 205 to be suitable for acoustic model recognition, and generates converted test data 208a.
  • the decoding unit 204 is a processing unit that performs speech recognition by comparing the converted test data 208a and the acoustic model 105a, and outputs a recognition result 209. In FIG. 5, an arrow from the acoustic model 105a to the decoding unit 204 is not shown.
  • the learning steps performed by the learning step execution unit 100a will be described with reference to the flowchart of FIG.
  • the learning step the learning data is classified into C classes such as phonemes in advance, and learning data 104a for each clustered class is prepared.
  • the number of classes C and how to divide classes may be determined manually according to phonemes, or may be determined by clustering using a decision tree or K-means method.
  • the acoustic model calculation unit 101a calculates the acoustic model 105a from the learning data 104a for each class (step ST101).
  • the learning data 104a for each class and the acoustic model 105a are respectively input to the basis matrix calculation unit 102a to obtain the basis matrix 106a for each class (step ST102).
  • the alignment calculation unit 210 calculates the alignment 212 from the test data 205 (step ST201).
  • the alignment is an HMM state sequence and is used for associating phonemes and class information corresponding to each time t of the test data.
  • the data class classification unit 211 classifies the test data 205 by class using the alignment 212, and generates test data corresponding to class 1 to class C as test data 213 for each class (step ST202). .
  • base matrix weight calculation section 201a calculates base matrix weight 206a for each class using test model 213 for each class, using acoustic model 105a and base matrix 106a for each class (step ST203). ). Further, the base matrix weight applying unit 202a calculates a class-specific transformation matrix 207a using the class-basis matrix 106a with respect to the class-basis matrix weight 206a (step ST204). Step ST203 and step ST204 are sequentially repeated, and when the likelihood increase is less than the threshold value or when the set number of times is repeated, the process proceeds to step ST205.
  • FIG. 8 is an explanatory diagram showing the processing contents of the basis matrix weight calculation unit 201a.
  • the acoustic feature sequence shown in FIG. 8 shows the acoustic features continuously changing the test data in a time series, o t in the figure indicates the feature vector at time t.
  • the alignment shown in FIG. 8 shows a phoneme string “sil a k i” when the user speaks “Aki”.
  • the phoneme string of “Aki” is “aki”, but silence at the beginning of the word is expressed by “sil”.
  • the numbers indicated by the alignment indicate the state numbers of the HMMs. That is, the alignment is an HMM state sequence corresponding to the acoustic feature amount sequence. Further, the straight arrow indicated by the alignment indicates a transition to the next state, and the curved arrow indicates a self-transition.
  • the phonemes corresponding to the acoustic feature quantity o t at each time Alignment 212 association by using the basis matrix suitable for converting the feature quantity of the phoneme, the acoustic characteristics of the test data It is possible to estimate the weight to the base matrix adapted to.
  • an acoustic model calculation unit that calculates an acoustic model obtained by modeling a standard pattern of learning data using acoustic feature quantities of clustered learning data; Using an acoustic model and learning data, a base matrix calculation unit that calculates a base matrix for each class, an alignment calculation unit that calculates an alignment indicating a state sequence of the acoustic feature amount of test data, and test data and alignment Data class classification unit for classifying test data for each class, and base matrix weight calculation unit for obtaining weights to the basis matrix for each class using test data, basis matrix and acoustic model for each class And the basis matrix for each class and the weight of the basis matrix for each class to generate a transformation matrix for each class by weighting A matrix application unit to feature amount data for generating converted test data for recognizing the test data as an acoustic model using the weight application unit to the matrix, the test data, the alignment, and the conversion matrix for each class; Since the decoding unit
  • any combination of the embodiments, or any modification of any component in each embodiment, or omission of any component in each embodiment is possible.
  • the adaptive accuracy can be improved. Improvement is possible.
  • the speech recognition apparatus can be applied to navigation devices, home appliances, and the like to improve robust speech recognition performance in order to enable robust speaker adaptation processing even for a small amount of data. Suitable for use in.

Abstract

A means (103) for calculating the contribution degree of a basis matrix uses a basis matrix (106) to calculate the contribution degree (107) of the basis matrix. A means (202) for applying weight to the basis matrix uses the weight (206) of the basis matrix, the contribution degree (107) of the basis matrix, and the basis matrix (106) to generate a conversion matrix (207) obtained by applying weighting to the basis matrix. A means (203) for applying the matrix to feature value data uses the conversion matrix (207) to turn test data (206) into converted test data (208). A decoding means (204) compares the converted test data (208) with an acoustic model (105) to perform speech recognition.

Description

音声認識装置Voice recognition device
 この発明は、特徴量を音響モデルにマッチするよう適応化する手法において、基底行列と変換行列を用いて音響特徴量を変換させる音声認識装置に関するものである。 The present invention relates to a speech recognition apparatus for converting an acoustic feature quantity using a base matrix and a transformation matrix in a technique for adapting the feature quantity to match an acoustic model.
 音声認識技術において、音素などのコンテキスト情報を音声の標準パターンで表現した音響モデルに対して、入力音声信号が一致しない要因となる、話者、騒音、マイクなどの影響を低減することを目的として、話者適応技術(特徴量の適用手法)が数多く提案されている。
 従来、このような特徴量の適用手法として、例えば非特許文献1に開示されたCMLLR(Constrained-MLLR)法が知られている。これはモデルパラメータの平均及び分散を変換する手法である。当該変換は特徴量ベクトルを変換することと等価であるため、CMLLRは特徴量における変換行列を求めることとなる。具体的には式(1)のように、入力音声から計算されたD次元の音響特徴量oを音素の標準パターンである音響モデルに近づけるようなアフィン変換行列Wを求める。
Figure JPOXMLDOC01-appb-I000001
In speech recognition technology, for the purpose of reducing the influence of speakers, noise, microphones, etc., which cause the input speech signals to be inconsistent with the acoustic model that expresses context information such as phonemes in the standard pattern of speech Many speaker adaptation techniques (applying features) have been proposed.
Conventionally, for example, a CMLLR (Constrained-MLLR) method disclosed in Non-Patent Document 1 is known as a technique for applying such a feature amount. This is a method for converting the mean and variance of model parameters. Since this conversion is equivalent to converting a feature vector, CMLLR obtains a conversion matrix in the feature. As is specifically formula (1), determine the affine transformation matrix W that approach the acoustic features o t D-dimensional computed from the input speech in the acoustic model is a standard pattern of the phoneme.
Figure JPOXMLDOC01-appb-I000001
 しかしながら非特許文献1に記載された特徴量の適用手法では、変換行列Wを適応データのみから求めているため、変換行列の推定に十分なデータ量が得られない場合、適応することでかえって性能が下がってしまうことがわかっている。この原因は、推定すべきパラメータ数に対して適応データ量が少なく、過学習するためである。例えば13次元のメル周波数ケプストラム係数MFCC(Mel-Frequency Cepstrum Coefficient)のベクトルと、その動的特徴量を連結した計39次元を音響特徴量として使用する場合、推定すべきパラメータ数は変換行列の要素数であるため39×40=1560個にも及ぶ。 However, in the feature quantity application method described in Non-Patent Document 1, since the transformation matrix W is obtained only from the adaptive data, if the amount of data sufficient for estimating the transformation matrix cannot be obtained, the performance can be improved by adaptation. Is known to go down. This is because the amount of adaptive data is small with respect to the number of parameters to be estimated and overlearning is performed. For example, when using a total of 39 dimensions, which is a 13-dimensional mel frequency cepstrum coefficient MFCC (Mel-Frequency Cepstrum Coefficient) vector and its dynamic features connected as acoustic features, the number of parameters to be estimated is Since it is a number, it reaches 39 × 40 = 1560.
 この問題に対し、例えば非特許文献2に記載された特徴量の適用手法では、推定すべきパラメータ数を少なくするため、適応データから変換行列Wを直接推定するのではなく、N個の基底行列W1:Nmaxの重みづけにより表現している(n=1,…,N≦Nmax)。ここで、Nmax=D(D+1)である。具体的には式(2)のように、基底行列Wを重みdによって重みづけ、適応話者への変換行列Wを求める。
Figure JPOXMLDOC01-appb-I000002
 基底行列は学習データより求め、適応時には入力話者への変換行列をそれらの重みdのみを求める。適応ステップで求めるべきパラメータは重みdだけでよく、100フレーム(=1秒)のデータに対し、推定すべきパラメータ数は非特許文献2によれば、式(3)により20個程度で済む。
N=min(ηβ,Nmax)   ∵η=0.2   (3)
 これは入力フレームβに応じてNを変え、使用する基底行列数を制限することを意味している。
To deal with this problem, for example, in the technique for applying a feature amount described in Non-Patent Document 2, in order to reduce the number of parameters to be estimated, instead of directly estimating the transformation matrix W from adaptive data, N basis matrices are used. W 1: Expressed by weighting Nmax (n = 1,..., N ≦ N max ). Here, N max = D (D + 1). Specifically, as the equation (2), weighting the basis matrix W n by weight d n, obtains a transformation matrix W to speaker adaptation.
Figure JPOXMLDOC01-appb-I000002
Basis matrix is calculated from learning data, at the time of adaptive obtaining only those weights d n transformation matrix to the input talker. Parameters to be determined in the adaptation step need only weights d n, for the data of 100 frames (= 1 second), the number of parameters to be estimated is, according to Non-Patent Document 2, requires only 20 or so by the formula (3) .
N = min (ηβ, N max ) ∵η = 0.2 (3)
This means that N is changed in accordance with the input frame β to limit the number of base matrices to be used.
 非特許文献2に記載された音声認識装置における実施のステップとしては、大きく分けて、学習データから基底行列W1:Nmaxを求める学習ステップと、適応データ(テストデータ)と基底行列W1:Nmaxの内のN個を用いて変換行列Wを求める適応ステップの二つがある。 The steps in the speech recognition apparatus described in Non-Patent Document 2 are roughly divided into a learning step for obtaining a base matrix W1 : Nmax from learning data, adaptive data (test data) and a base matrix W1 : Nmax. There are two adaptation steps for determining the transformation matrix W using N of the N.
 学習ステップでは、まず学習データから音素の標準パターンである音響モデルを得る。標準パターンにはHMM(Hidden Markov Model:隠れマルコフモデル)を用いる。学習データである音響特徴量としては、フィルタバンク係数、MFCC、PLP(Perceptual Linear Predictive)など従来から用いられている特徴ベクトルを利用することができる。 In the learning step, first, an acoustic model that is a standard phoneme pattern is obtained from the learning data. HMM (Hidden Markov Model: Hidden Markov Model) is used for the standard pattern. Conventionally used feature vectors such as filter bank coefficients, MFCC, and PLP (Perceptual Linear Predictive) can be used as the acoustic feature quantity as learning data.
Figure JPOXMLDOC01-appb-I000003
Figure JPOXMLDOC01-appb-I000003
 次に適応ステップでは、まずテストデータを用いて基底行列の重みを生成する。この重みが先に述べたdに相当する。求めた基底行列の重みで基底行列を重みづけし、重みづけられた行列として変換行列Wを得る。最適なWを求めるため、逐次的に式(4)によって基底行列の重みと重みづけられた行列を求める。
Figure JPOXMLDOC01-appb-I000004
Next, in the adaptation step, first, weights of the base matrix are generated using the test data. This weight corresponds to d n as described above. The base matrix is weighted with the obtained weight of the base matrix, and the transformation matrix W is obtained as the weighted matrix. In order to obtain the optimum W, the matrix weighted with the weight of the base matrix is sequentially obtained by the equation (4).
Figure JPOXMLDOC01-appb-I000004
 最後に重みづけられた行列とテストデータを用いて、変換済みテストデータを生成する。この際、式(1)を用いて変換できる。得られた変換済みテストデータと音響モデルによって表現される音素の標準パターンと照合することにより音声の認識処理を行い、認識結果を得る。 最後 Generate converted test data using the last weighted matrix and test data. At this time, conversion can be performed using the equation (1). A speech recognition process is performed by collating the obtained converted test data with a standard phoneme pattern expressed by an acoustic model, and a recognition result is obtained.
 上記従来の音声認識装置では、適応ステップにおいて、寄与度の高い基底行列Wから順に式(4)により重みづけして変換行列Wを求めていた。しかしながら、ここで基底行列W1:Nmaxは寄与度が高い順にインデックスnが振られているが、式(4)では各基底行列の寄与度が考慮されていない。つまり、dを乗算するまでは基底行列が同じ寄与度であるとみなされている。このため寄与度の低い基底行列が影響して、適応による効果が十分に得られない場合があるといった課題があった。 In the conventional speech recognition apparatus described above, in the adaptation step, the transformation matrix W is obtained by weighting in order from the base matrix W n having a high contribution degree according to Equation (4). However, although the index n is assigned to the basis matrix W 1: Nmax in descending order of contribution here, the contribution degree of each basis matrix is not considered in the equation (4). That is, until multiplies the d n are considered to basis matrix are the same contribution. For this reason, there is a problem that the effect of adaptation may not be sufficiently obtained due to the influence of a base matrix having a low contribution.
 この発明は、かかる問題を解決するためになされたもので、適応時における変換行列の推定精度を向上させ、音声認識精度の向上を図ることのできる音声認識装置を提供することを目的とする。 The present invention has been made to solve such a problem, and an object of the present invention is to provide a speech recognition apparatus capable of improving the accuracy of estimating a transformation matrix at the time of adaptation and improving the accuracy of speech recognition.
 この発明に係る音声認識装置は、学習データの音響特徴量を用いて学習データの標準パターンをモデル化した音響モデルを算出する音響モデル算出部と、音響モデルと学習データとを用いて、基底行列を算出する基底行列算出部と、基底行列を用いて、基底行列の寄与度を算出する基底行列の寄与度計算部と、テストデータの音響特徴量と、音響モデルと、基底行列とを用いて、基底行列の重みを算出する基底行列への重み算出部と、基底行列の重みと、基底行列の寄与度と、基底行列とを用いて、基底行列への重み付けを行った変換行列を生成する基底行列への重み適用部と、変換行列を用いて、テストデータを音響モデルと認識するための変換済みテストデータに変換する特徴量データへの行列適用部と、変換済みテストデータと音響モデルとを照合して音声認識を行うデコード部とを備えたものである。 The speech recognition apparatus according to the present invention includes an acoustic model calculation unit that calculates an acoustic model obtained by modeling a standard pattern of learning data using an acoustic feature amount of learning data, and a basis matrix using the acoustic model and the learning data. A basis matrix calculation unit that calculates the basis matrix, a basis matrix contribution calculation unit that calculates the contribution of the basis matrix using the basis matrix, an acoustic feature amount of the test data, an acoustic model, and a basis matrix , Using the weight calculation unit for the base matrix to calculate the weight of the base matrix, the weight of the base matrix, the contribution of the base matrix, and the base matrix, generate a transformation matrix that weights the base matrix A weight application unit to a base matrix, a matrix application unit to feature data that converts test data into converted test data for recognizing an acoustic model using a conversion matrix, converted test data and acoustic It is obtained by a decoding unit which performs speech recognition by matching and Dell.
 この発明に係る音声認識装置は、基底行列の寄与度を算出し、この基底行列の寄与度と基底行列の重みと基底行列とを用いて、基底行列への重み付けを行った変換行列を生成するようにしたものである。これにより、適応時における変換行列の推定精度を向上させ、音声認識性能の向上を図ることができる。 The speech recognition apparatus according to the present invention calculates a contribution degree of a base matrix, and generates a transformation matrix that weights the base matrix using the contribution degree of the base matrix, the weight of the base matrix, and the base matrix. It is what I did. Thereby, the estimation accuracy of the transformation matrix at the time of adaptation can be improved, and the speech recognition performance can be improved.
この発明の実施の形態1の音声認識装置を示す構成図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a block diagram which shows the speech recognition apparatus of Embodiment 1 of this invention. この発明の実施の形態1の音声認識装置のハードウェア構成図である。It is a hardware block diagram of the speech recognition apparatus of Embodiment 1 of this invention. この発明の実施の形態1の音声認識装置の学習ステップの流れを示すフローチャートである。It is a flowchart which shows the flow of the learning step of the speech recognition apparatus of Embodiment 1 of this invention. この発明の実施の形態1の音声認識装置の適応ステップの流れを示すフローチャートである。It is a flowchart which shows the flow of the adaptation step of the speech recognition apparatus of Embodiment 1 of this invention. この発明の実施の形態2の音声認識装置を示す構成図である。It is a block diagram which shows the speech recognition apparatus of Embodiment 2 of this invention. この発明の実施の形態2の音声認識装置の学習ステップの流れを示すフローチャートである。It is a flowchart which shows the flow of the learning step of the speech recognition apparatus of Embodiment 2 of this invention. この発明の実施の形態2の音声認識装置の適応ステップの流れを示すフローチャートである。It is a flowchart which shows the flow of the adaptation step of the speech recognition apparatus of Embodiment 2 of this invention. この発明の実施の形態2の音声認識装置の基底行列の重み算出部の処理内容を示す説明図である。It is explanatory drawing which shows the processing content of the weight calculation part of the base matrix of the speech recognition apparatus of Embodiment 2 of this invention.
 以下、この発明をより詳細に説明するために、この発明を実施するための形態について、添付の図面に従って説明する。
実施の形態1.
 図1は、本実施の形態による音声認識装置の構成図である。
 本実施の形態による音声認識装置は、図示のように、学習ステップ実行部100と適応ステップ実行部200から構成される。学習ステップ実行部100は、音響モデル算出部101、基底行列算出部102、基底行列の寄与度計算部103を備え、適応ステップ実行部200は、基底行列の重み算出部201、基底行列への重み適用部202、特徴量データへの行列適用部203、デコード部204を備えている。
Hereinafter, in order to explain the present invention in more detail, modes for carrying out the present invention will be described with reference to the accompanying drawings.
Embodiment 1 FIG.
FIG. 1 is a configuration diagram of a speech recognition apparatus according to this embodiment.
The speech recognition apparatus according to the present embodiment includes a learning step execution unit 100 and an adaptation step execution unit 200 as illustrated. The learning step execution unit 100 includes an acoustic model calculation unit 101, a basis matrix calculation unit 102, and a basis matrix contribution calculation unit 103. The adaptation step execution unit 200 includes a basis matrix weight calculation unit 201 and a weight to the basis matrix. An application unit 202, a matrix application unit 203 for feature amount data, and a decoding unit 204 are provided.
 学習ステップ実行部100における音響モデル算出部101は、学習データ104の音響特徴量を用いて学習データ104の標準パターンをモデル化した音響モデル105を算出する処理部である。基底行列算出部102は、音響モデル算出部101が算出した音響モデル105と学習データ104を用いて基底行列106を算出する処理部である。基底行列の寄与度計算部103は、基底行列算出部102が算出した基底行列106を用いて基底行列の寄与度107を算出する処理部である。 The acoustic model calculation unit 101 in the learning step execution unit 100 is a processing unit that calculates an acoustic model 105 that models the standard pattern of the learning data 104 using the acoustic feature amount of the learning data 104. The base matrix calculation unit 102 is a processing unit that calculates the base matrix 106 using the acoustic model 105 calculated by the acoustic model calculation unit 101 and the learning data 104. The base matrix contribution calculation unit 103 is a processing unit that calculates the base matrix contribution 107 using the base matrix 106 calculated by the base matrix calculation unit 102.
 適応ステップ実行部200における基底行列の重み算出部201は、テストデータ205の音響特徴量と、音響モデル105と、基底行列106とを用いて、基底行列の重み206を算出する処理部である。基底行列への重み適用部202は、基底行列の重み算出部201で算出された基底行列の重み206と、基底行列の寄与度107と、基底行列106とを用いて、基底行列106への重みづけを行い、重みづけられた行列である変換行列207を生成する処理部である。特徴量データへの行列適用部203は、基底行列への重み適用部202により得られた変換行列207と、テストデータ205とを用いて、テストデータ205を音響モデルの認識に適するよう変換して変換済みテストデータ208を生成する処理部である。デコード部204は、特徴量データへの行列適用部203によって得た変換済みテストデータ208と、音響モデル105とを照合して、音声認識を行って認識結果209を出力する処理部である。なお、図1では音響モデル105からデコード部204への矢印の図示は省略している。 The basis matrix weight calculation unit 201 in the adaptive step execution unit 200 is a processing unit that calculates the basis matrix weight 206 using the acoustic feature quantity of the test data 205, the acoustic model 105, and the basis matrix 106. The base matrix weight application unit 202 uses the base matrix weight 206 calculated by the base matrix weight calculation unit 201, the base matrix contribution 107, and the base matrix 106 to weight the base matrix 106. It is a processing unit that performs conversion and generates a transformation matrix 207 that is a weighted matrix. The matrix application unit 203 for feature amount data uses the conversion matrix 207 obtained by the weight application unit 202 for the base matrix and the test data 205 to convert the test data 205 to be suitable for acoustic model recognition. A processing unit that generates converted test data 208. The decoding unit 204 is a processing unit that collates the converted test data 208 obtained by the matrix application unit 203 to the feature data and the acoustic model 105, performs speech recognition, and outputs a recognition result 209. In FIG. 1, an arrow from the acoustic model 105 to the decoding unit 204 is not shown.
 図2は、実施の形態1の音声認識装置のハードウェア構成図である。
 音声認識装置はコンピュータを用いて実現されており、プロセッサ1、メモリ2、入出力インタフェース(入出力I/F)3、バス4を備える。プロセッサ1は、コンピュータとしての演算処理を行う機能部であり、メモリ2は、各種のプログラムや演算結果を記憶し、また、プロセッサ1が演算処理を行う場合の作業領域を構成する記憶部である。入出力インタフェース3は、学習データ104やテストデータ205を入力したり、認識結果209を外部に出力したりする際のインタフェースである。バス4は、プロセッサ1、メモリ2及び入出力インタフェース3を相互に接続するためのバスである。
FIG. 2 is a hardware configuration diagram of the speech recognition apparatus according to the first embodiment.
The speech recognition apparatus is realized using a computer, and includes a processor 1, a memory 2, an input / output interface (input / output I / F) 3, and a bus 4. The processor 1 is a functional unit that performs calculation processing as a computer, and the memory 2 is a storage unit that stores various programs and calculation results, and configures a work area when the processor 1 performs calculation processing. . The input / output interface 3 is an interface for inputting the learning data 104 and the test data 205 and outputting the recognition result 209 to the outside. The bus 4 is a bus for connecting the processor 1, the memory 2, and the input / output interface 3 to each other.
 図1に示す音響モデル算出部101、基底行列算出部102、基底行列の寄与度計算部103、基底行列の重み算出部201、基底行列への重み適用部202、特徴量データへの行列適用部203及びデコード部204は、それぞれプロセッサ1がメモリ2に記憶されたプログラムを実行することにより実現されている。また、音響モデル105、基底行列106、基底行列の重み206、変換行列207、変換済みテストデータ208は、それぞれメモリ2の記憶領域に記憶されている。プロセッサ1及びメモリ2をそれぞれ複数備え、複数のプロセッサ1とメモリ2とが連携して上述した機能を実行するように構成してもよい。 The acoustic model calculation unit 101, the base matrix calculation unit 102, the base matrix contribution calculation unit 103, the base matrix weight calculation unit 201, the base matrix weight application unit 202, and the feature data matrix application unit illustrated in FIG. 203 and the decoding unit 204 are each implemented by the processor 1 executing a program stored in the memory 2. The acoustic model 105, the basis matrix 106, the basis matrix weights 206, the transformation matrix 207, and the transformed test data 208 are stored in the storage area of the memory 2, respectively. A plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.
 次に、実施の形態1の音声認識装置の動作について説明する。
 先ず、学習ステップ実行部100が行う学習ステップについて図3のフローチャートを用いて説明する。
 学習ステップでは、先ず、学習データ104から音響モデル算出部101により音素の標準パターンである音響モデル105を作成する(ステップST1)。ここで音響特徴量としては、フィルタバンク係数、MFCC(Mel Frequency Cepstrum Coefficient)、PLP(Perceptual Linear Predictive)など従来から用いられている特徴ベクトルを利用することができる。
Next, the operation of the speech recognition apparatus according to the first embodiment will be described.
First, the learning steps performed by the learning step execution unit 100 will be described with reference to the flowchart of FIG.
In the learning step, first, an acoustic model 105 that is a phoneme standard pattern is created from the learning data 104 by the acoustic model calculation unit 101 (step ST1). Here, as acoustic feature amounts, conventionally used feature vectors such as filter bank coefficients, MFCC (Mel Frequency Cepstrum Coefficient), and PLP (Perceptual Linear Predictive) can be used.
Figure JPOXMLDOC01-appb-I000005
 また、基底行列106から、基底行列の寄与度計算部103を用いて、各基底行列106のインデックスnに対応する寄与度(ω)107を求める(ステップST3)。寄与度107は学習データの表現力が高い集合n順に大きい値をとる。
Figure JPOXMLDOC01-appb-I000005
Also, a contribution (ω n ) 107 corresponding to the index n of each base matrix 106 is obtained from the base matrix 106 using the base matrix contribution calculation unit 103 (step ST3). The contribution degree 107 takes a large value in the order of the set n in which the expressive power of learning data is high.
 基底行列の寄与度107を示す具体的な例として、基底行列(W1:Nmax)106を求める際に得られる特異値k1:Nmaxが利用可能である。これは特異値が大きいインデックスnの基底行列が、行列Mを表現するのに寄与度が高いからである。従って、寄与度計算部103で再度特異値k1:Nmaxを計算するのでなく、基底行列算出部102にて算出した特異値k1:Nmaxを保持しておくことでも同様に寄与度107が得られる。 As a specific example showing the contribution 107 of the base matrix, a singular value k 1: Nmax obtained when the base matrix (W 1: Nmax ) 106 is obtained can be used. This is because the base matrix of index n having a large singular value has a high contribution to expressing the matrix M. Therefore, the contribution 107 can be obtained similarly by holding the singular value k 1: Nmax calculated by the base matrix calculation unit 102 instead of calculating the singular value k 1: Nmax again by the contribution calculation unit 103. It is done.
 また、基底行列の寄与度計算部103において特異値k1:Nmaxをそのまま用いるのでなく、変換関数φ(・)を特異値に適用してφ(k)とすることで、基底行列(W)106に対応する寄与度をコントロールすることができる。変換関数には、シグモイド関数などを用いることができる。 Further, instead of using the singular value k 1: Nmax as it is in the contribution calculation unit 103 of the basis matrix, the basis function (W n ) is obtained by applying the transformation function φ (•) to the singular value to obtain φ (k n ). n ) The contribution corresponding to 106 can be controlled. A sigmoid function or the like can be used as the conversion function.
 次に、適応ステップ実行部200が行う適応ステップについて図4のフローチャートを用いて説明する。
 適応ステップでは、先ず、基底行列の重み算出部201は、テストデータ205と音響モデル105と基底行列106から基底行列の重み(d)206を生成する(ステップST11)。次に、基底行列への重み適用部202は、ステップST11で求めた基底行列の重み206と、基底行列106と基底行列の寄与度(ω1:Nmax)107とを用いて重みづけられた行列としての変換行列(W)207を得る(ステップST12)。逐次的に式(5)に基づいて、基底行列の重み206と変換行列207を求める。
Figure JPOXMLDOC01-appb-I000006
Next, the adaptation steps performed by the adaptation step execution unit 200 will be described using the flowchart of FIG.
In the adaptation step, first, the base matrix weight calculation unit 201 generates a base matrix weight (d n ) 206 from the test data 205, the acoustic model 105, and the base matrix 106 (step ST11). Next, the weight applying unit 202 for the base matrix uses the base matrix weight 206 obtained in step ST11 and the weighted matrix using the base matrix 106 and the base matrix contribution (ω 1: Nmax ) 107. A transformation matrix (W) 207 is obtained (step ST12). The basis matrix weight 206 and the transformation matrix 207 are obtained sequentially based on the equation (5).
Figure JPOXMLDOC01-appb-I000006
 すなわち、ステップST11とステップST12を逐次的に繰り返し、尤度の上がり幅が閾値を下回る、もしくは定めた回数分繰り返した場合、次ステップに進む。ここで、尤度とは音響モデル105に対して入力音声が標準パターンにどれほど近いかの指標となる。尤度の差分を算出することにより、前回の変換行列を推定した時からの尤度の上がり幅が算出される。尤度の差分が設定した数値より小さい、すなわち尤度の上がり幅が設定した数値より小さくなることは、推定処理が収束したとみなすことができ、精度の高い推定処理が行われたと判断する。一方、尤度の差分が設定した数値以上である、即ち尤度の上がり幅が設定した数値以上の場合は、推定処理が収束していないと判断する。この場合、基底行列への重み206を再度推定し、より精度の高い変換行列207を取得する。 That is, if step ST11 and step ST12 are sequentially repeated and the likelihood increase is less than the threshold value or is repeated a predetermined number of times, the process proceeds to the next step. Here, the likelihood is an index of how close the input speech is to the standard pattern with respect to the acoustic model 105. By calculating the difference in likelihood, the range of increase in likelihood since the previous conversion matrix was estimated is calculated. If the difference in likelihood is smaller than the set numerical value, that is, the likelihood increase is smaller than the set numerical value, it can be considered that the estimation process has converged, and it is determined that the highly accurate estimation process has been performed. On the other hand, if the likelihood difference is greater than or equal to the set numerical value, that is, if the likelihood increase is greater than or equal to the set numerical value, it is determined that the estimation process has not converged. In this case, the weight 206 for the base matrix is estimated again to obtain a transformation matrix 207 with higher accuracy.
 本発明では基底行列への重み適用部202において、変換行列(W)207推定時に寄与度(ω)107を基底行列(W)106に乗算することで、基底行列の寄与度を考慮することが可能となり、変換行列(W)207の推定精度の向上が期待できる。 In the present invention, the weight matrix applying unit 202 for the base matrix considers the base matrix contribution by multiplying the base matrix (W n ) 106 by the contribution (ω n ) 107 when estimating the transformation matrix (W) 207. It is possible to improve the estimation accuracy of the transformation matrix (W) 207.
 最後に変換行列207とテストデータ205を用いて、特徴量データへの行列適用部203により、変換済みテストデータ208を生成する(ステップST13)。具体的には式(1)を用いて変換できる。得られた変換済みテストデータ208にデコード部204で音響モデル105によって表現される音素の標準パターンと照合することにより、認識結果209を取得する(ステップST14)。 Finally, using the conversion matrix 207 and the test data 205, the matrix application unit 203 for the feature data generates the converted test data 208 (step ST13). Specifically, it can be converted using equation (1). A recognition result 209 is acquired by collating the obtained converted test data 208 with a standard phoneme pattern represented by the acoustic model 105 in the decoding unit 204 (step ST14).
 デコード部204では、HMM(Hidden Markov Model)に基づく音声認識処理を行う。詳細には、HMMの出力確率モデルとして、混合ガウス分布(以下、GMM(Gaussian Mixture Model)と称する)を用いたモデルGMM-HMMや、ニューラルネットワーク(以下、NN(Neural Network)と称する)を用いたモデルNN-HMMが利用可能である。 The decoding unit 204 performs speech recognition processing based on HMM (Hidden Markov Model). Specifically, as an output probability model of the HMM, a model GMM-HMM using a mixed Gaussian distribution (hereinafter referred to as GMM (Gaussian Mixture Model)) or a neural network (hereinafter referred to as NN (Neural Network)) is used. The existing model NN-HMM is available.
 以上説明したように、実施の形態1の音声認識装置によれば、学習データの音響特徴量を用いて学習データの標準パターンをモデル化した音響モデルを算出する音響モデル算出部と、音響モデルと学習データとを用いて、基底行列を算出する基底行列算出部と、基底行列を用いて、基底行列の寄与度を算出する基底行列の寄与度計算部と、テストデータの音響特徴量と、音響モデルと、基底行列とを用いて、基底行列の重みを算出する基底行列への重み算出部と、基底行列の重みと、基底行列の寄与度と、基底行列とを用いて、基底行列への重み付けを行った変換行列を生成する基底行列への重み適用部と、変換行列を用いて、テストデータを音響モデルと認識するための変換済みテストデータに変換する特徴量データへの行列適用部と、変換済みテストデータと音響モデルとを照合して音声認識を行うデコード部とを備えたので、寄与度の高い基底行列の影響度を高くし、寄与度の低い基底行列の影響度を低く抑えることが可能となり、適応時における変換行列の推定精度を向上させ、音声認識性能の向上を図ることができる。 As described above, according to the speech recognition apparatus of the first embodiment, the acoustic model calculation unit that calculates the acoustic model obtained by modeling the standard pattern of the learning data using the acoustic feature amount of the learning data, the acoustic model, A base matrix calculation unit that calculates a base matrix using the learning data, a base matrix contribution calculation unit that calculates a base matrix contribution using the base matrix, an acoustic feature amount of the test data, and an acoustic Using the model and the basis matrix, the weight calculation unit to the basis matrix that calculates the weight of the basis matrix, the basis matrix weight, the contribution of the basis matrix, and the basis matrix are used to calculate the basis matrix weight. A weight application unit to a base matrix that generates a weighted conversion matrix, and a matrix application unit to feature data that converts test data into converted test data for recognizing it as an acoustic model using the conversion matrix; , Since it has a decoding unit that performs speech recognition by comparing the converted test data with the acoustic model, the influence of the base matrix with high contribution is increased, and the influence of the base matrix with low contribution is kept low. It is possible to improve the estimation accuracy of the transformation matrix at the time of adaptation and improve the speech recognition performance.
実施の形態2.
 実施の形態2は、変換行列と、変換行列の推定に使用する基底行列とを音素などのクラス毎に求めるようにしたものである。
Embodiment 2. FIG.
In the second embodiment, a transformation matrix and a base matrix used for estimating the transformation matrix are obtained for each class such as phonemes.
 図5は、実施の形態2に係る音声認識装置の構成図である。実施の形態2に係る音声認識装置は、学習ステップ実行部100aと適応ステップ実行部200aから構成される。学習ステップ実行部100aは、音響モデル算出部101a、基底行列算出部102aを備えている。適応ステップ実行部200aは、基底行列の重み算出部201a、基底行列への重み適用部202a、特徴量データへの行列適用部203a、デコード部204、アラインメント算出部210、データのクラス分類部211を備えている。 FIG. 5 is a configuration diagram of the speech recognition apparatus according to the second embodiment. The speech recognition apparatus according to Embodiment 2 includes a learning step execution unit 100a and an adaptation step execution unit 200a. The learning step execution unit 100a includes an acoustic model calculation unit 101a and a base matrix calculation unit 102a. The adaptation step execution unit 200a includes a base matrix weight calculation unit 201a, a base matrix weight application unit 202a, a feature data matrix application unit 203a, a decoding unit 204, an alignment calculation unit 210, and a data class classification unit 211. I have.
 学習ステップ実行部100aにおける音響モデル算出部101aは、クラス単位でクラスタリングされたクラス毎の学習データ104aの音響特徴量を用いて、クラス毎の学習データ104aの標準パターンをモデル化して音響モデル105aを求める処理部である。基底行列算出部102aは、音響モデル105aとクラス毎の学習データ104aとを用いて、クラス毎の基底行列106aを算出する処理部である。 The acoustic model calculation unit 101a in the learning step execution unit 100a models the standard pattern of the learning data 104a for each class by using the acoustic feature amount of the learning data 104a for each class clustered in units of classes, thereby generating the acoustic model 105a. This is a processing unit to be obtained. The base matrix calculation unit 102a is a processing unit that calculates the base matrix 106a for each class using the acoustic model 105a and the learning data 104a for each class.
 適応ステップ実行部200aにおけるアラインメント算出部210は、テストデータ205の音響特徴量の状態系列を示すアラインメント212を算出する処理部である。データのクラス分類部211は、テストデータ205とアラインメント212とを用いてテストデータ205をクラス毎に分類し、クラス毎のテストデータ213として出力する処理部である。基底行列の重み算出部201aは、クラス毎のテストデータ213と音響モデル105aとクラス毎の基底行列106aとを用いて、クラス毎の基底行列106aへの重みを求め、クラス毎の基底行列の重み206aを出力する処理部である。基底行列への重み適用部202aは、クラス毎の基底行列106aとクラス毎の基底行列の重み206aとを用いて、重み付けによりクラス毎の変換行列207aを生成する処理部である。特徴量データへの行列適用部203aは、テストデータ205とアラインメント212とクラス毎の変換行列207aとを用いて、テストデータ205を音響モデルの認識に適するよう変換し、変換済みテストデータ208aを生成する処理部である。デコード部204は、変換済みテストデータ208aと音響モデル105aとを照合して音声認識を行い、その認識結果209を出力する処理部である。なお、図5では音響モデル105aからデコード部204への矢印の図示は省略している。また、これら処理部は図2に示したプロセッサがメモリに記憶されたプログラムを実行することにより実現されている。 The alignment calculation unit 210 in the adaptive step execution unit 200a is a processing unit that calculates an alignment 212 indicating the state sequence of the acoustic feature amount of the test data 205. The data class classification unit 211 is a processing unit that classifies the test data 205 by class using the test data 205 and the alignment 212 and outputs the test data 205 as test data 213 for each class. The base matrix weight calculation unit 201a uses the test data 213 for each class, the acoustic model 105a, and the base matrix 106a for each class to obtain a weight for the base matrix 106a for each class, and the weight of the base matrix for each class. It is a processing unit that outputs 206a. The weight matrix applying unit 202a for the base matrix is a processing unit that generates the conversion matrix 207a for each class by weighting using the base matrix 106a for each class and the base matrix weight 206a for each class. The matrix application unit 203a for feature amount data uses the test data 205, the alignment 212, and the conversion matrix 207a for each class to convert the test data 205 to be suitable for acoustic model recognition, and generates converted test data 208a. Is a processing unit. The decoding unit 204 is a processing unit that performs speech recognition by comparing the converted test data 208a and the acoustic model 105a, and outputs a recognition result 209. In FIG. 5, an arrow from the acoustic model 105a to the decoding unit 204 is not shown. These processing units are realized by the processor shown in FIG. 2 executing a program stored in the memory.
 次に、実施の形態2の音声認識装置の動作について説明する。
 先ず、学習ステップ実行部100aが行う学習ステップについて図6のフローチャートを用いて説明する。
 学習ステップにおいて、学習データを予め音素などのC個のクラス毎に分類し、クラスタリングされたクラス毎の学習データ104aを用意する。この際のクラス数Cやクラスの分け方は、音素に応じて手動で決めてもよいし、決定木やK-means法を用いたクラスタリングにより決定してもよい。音響モデル算出部101aは、このようなクラス毎の学習データ104aから音響モデル105aを算出する(ステップST101)。次にクラス毎の学習データ104aと音響モデル105aとを基底行列算出部102aにそれぞれ入力し、クラス毎の基底行列106aを得る(ステップST102)。
Next, the operation of the speech recognition apparatus according to the second embodiment will be described.
First, the learning steps performed by the learning step execution unit 100a will be described with reference to the flowchart of FIG.
In the learning step, the learning data is classified into C classes such as phonemes in advance, and learning data 104a for each clustered class is prepared. In this case, the number of classes C and how to divide classes may be determined manually according to phonemes, or may be determined by clustering using a decision tree or K-means method. The acoustic model calculation unit 101a calculates the acoustic model 105a from the learning data 104a for each class (step ST101). Next, the learning data 104a for each class and the acoustic model 105a are respectively input to the basis matrix calculation unit 102a to obtain the basis matrix 106a for each class (step ST102).
 次に、適応ステップ実行部200aが行う適応ステップについて図7のフローチャートを用いて説明する。
 適応ステップでは、アラインメント算出部210により、テストデータ205からアラインメント212を算出する(ステップST201)。ここで、アラインメントとはHMMの状態系列であり、テストデータの各時刻tに対応する音素やクラス情報を対応づける用途に使用される。次に、データのクラス分類部211は、アラインメント212を用いてテストデータ205をクラス毎に分類し、クラス1からクラスCに対応するテストデータをクラス毎のテストデータ213として生成する(ステップST202)。次に、基底行列の重み算出部201aは、クラス毎のテストデータ213に対して、音響モデル105aとクラス毎の基底行列106aを用いて、クラス毎の基底行列の重み206aを算出する(ステップST203)。更に、基底行列への重み適用部202aは、クラス毎の基底行列の重み206aに対して、クラス毎の基底行列106aを用いてクラス毎の変換行列207aを算出する(ステップST204)。ステップST203とステップST204を逐次的に繰り返し、尤度の上がり幅が閾値を下回る、もしくは定めた回数分繰り返した場合にステップST205に進む。
Next, the adaptation steps performed by the adaptation step execution unit 200a will be described using the flowchart of FIG.
In the adaptation step, the alignment calculation unit 210 calculates the alignment 212 from the test data 205 (step ST201). Here, the alignment is an HMM state sequence and is used for associating phonemes and class information corresponding to each time t of the test data. Next, the data class classification unit 211 classifies the test data 205 by class using the alignment 212, and generates test data corresponding to class 1 to class C as test data 213 for each class (step ST202). . Next, base matrix weight calculation section 201a calculates base matrix weight 206a for each class using test model 213 for each class, using acoustic model 105a and base matrix 106a for each class (step ST203). ). Further, the base matrix weight applying unit 202a calculates a class-specific transformation matrix 207a using the class-basis matrix 106a with respect to the class-basis matrix weight 206a (step ST204). Step ST203 and step ST204 are sequentially repeated, and when the likelihood increase is less than the threshold value or when the set number of times is repeated, the process proceeds to step ST205.
 図8は基底行列の重み算出部201aの処理内容を示す説明図である。図8に示す音響特徴量系列とは、テストデータの連続的に変化する音響特徴量を時系列に示しており、図中のoは時刻tにおける特徴量ベクトルを示している。 FIG. 8 is an explanatory diagram showing the processing contents of the basis matrix weight calculation unit 201a. The acoustic feature sequence shown in FIG. 8 shows the acoustic features continuously changing the test data in a time series, o t in the figure indicates the feature vector at time t.
 図8に示すアラインメントは、ユーザが「あき」と発話した場合の音素列「sil a k i」を示している。「あき」の音素列は「aki」であるが、語頭の無音を「sil」で表現している。また、アラインメントが示す数字はそれぞれHMMの状態番号を示している。すなわち、アラインメントは、音響特徴量系列に対応するHMMの状態系列となる。更に、アラインメントが示す直線の矢印は次の状態への遷移を示し、曲線の矢印は自己遷移を示している。 The alignment shown in FIG. 8 shows a phoneme string “sil a k i” when the user speaks “Aki”. The phoneme string of “Aki” is “aki”, but silence at the beginning of the word is expressed by “sil”. The numbers indicated by the alignment indicate the state numbers of the HMMs. That is, the alignment is an HMM state sequence corresponding to the acoustic feature amount sequence. Further, the straight arrow indicated by the alignment indicates a transition to the next state, and the curved arrow indicates a self-transition.
Figure JPOXMLDOC01-appb-I000007
Figure JPOXMLDOC01-appb-I000007
 実施の形態2では、アラインメント212により各時刻の音響特徴量oに対応する音素を対応づけ、その音素の特徴量を変換するのに適した基底行列を用いることで、テストデータの音響的特徴に適合した基底行列への重みを推定することが可能となる。 In the second embodiment, the phonemes corresponding to the acoustic feature quantity o t at each time Alignment 212 association, by using the basis matrix suitable for converting the feature quantity of the phoneme, the acoustic characteristics of the test data It is possible to estimate the weight to the base matrix adapted to.
 次に、ステップST205では、特徴量データへの行列適用部203aにより、ステップST204で求めたクラス毎の変換行列207aとテストデータ205とアラインメント212とを用いて、変換済みテストデータ208aを算出する。すなわち、特徴量データへの行列適用部203aは、アラインメント212により得たクラス情報を用いて、ある時刻の音響特徴量に対応するクラス毎の変換行列207aを対応づけ、変換行列を特徴量ベクトルに乗算して変換済みテストデータ208aを生成する。その後、デコード部204は、ステップST205により得た変換済みテストデータ208aと音響モデル105aと照合して音声認識を行い、認識結果209を取得する(ステップST206)。 Next, in step ST205, the matrix application unit 203a for feature data calculates the converted test data 208a using the class-specific conversion matrix 207a, test data 205, and alignment 212 obtained in step ST204. That is, the matrix application unit 203a for feature quantity data uses the class information obtained by the alignment 212 to associate the transformation matrix 207a for each class corresponding to the acoustic feature quantity at a certain time, and converts the transformation matrix into a feature quantity vector. Multiplication is performed to generate converted test data 208a. After that, the decoding unit 204 performs speech recognition by comparing the converted test data 208a obtained in step ST205 with the acoustic model 105a, and obtains a recognition result 209 (step ST206).
 以上説明したように、実施の形態2の音声認識装置によれば、クラスタリングされた学習データの音響特徴量を用いて学習データの標準パターンをモデル化した音響モデルを算出する音響モデル算出部と、音響モデルと学習データとを用いて、クラス毎に基底行列を算出する基底行列算出部と、テストデータの音響特徴量の状態系列を示すアラインメントを算出するアラインメント算出部と、テストデータとアラインメントとを用いて、テストデータをクラス毎に分類するデータのクラス分類部と、クラス毎のテストデータと基底行列と音響モデルとを用いて、クラス毎の基底行列への重みを求める基底行列の重み算出部と、クラス毎の基底行列と、クラス毎の基底行列の重みとを用いて、重みづけによりクラス毎に変換行列を生成する基底行列への重み適用部と、テストデータとアラインメントとクラス毎の変換行列とを用いて、テストデータを音響モデルと認識するための変換済みテストデータを生成する特徴量データへの行列適用部と、変換済みテストデータと、音響モデルとを照合して音声認識を行うデコード部とを備えたので、適応時における変換行列の推定精度を向上させ、音声認識性能の向上を図ることができる。 As described above, according to the speech recognition apparatus of the second embodiment, an acoustic model calculation unit that calculates an acoustic model obtained by modeling a standard pattern of learning data using acoustic feature quantities of clustered learning data; Using an acoustic model and learning data, a base matrix calculation unit that calculates a base matrix for each class, an alignment calculation unit that calculates an alignment indicating a state sequence of the acoustic feature amount of test data, and test data and alignment Data class classification unit for classifying test data for each class, and base matrix weight calculation unit for obtaining weights to the basis matrix for each class using test data, basis matrix and acoustic model for each class And the basis matrix for each class and the weight of the basis matrix for each class to generate a transformation matrix for each class by weighting A matrix application unit to feature amount data for generating converted test data for recognizing the test data as an acoustic model using the weight application unit to the matrix, the test data, the alignment, and the conversion matrix for each class; Since the decoding unit that performs speech recognition by comparing the converted test data with the acoustic model is provided, the estimation accuracy of the conversion matrix at the time of adaptation can be improved, and speech recognition performance can be improved.
 なお、本願発明はその発明の範囲内において、各実施の形態の自由な組み合わせ、あるいは各実施の形態の任意の構成要素の変形、もしくは各実施の形態において任意の構成要素の省略が可能である。例えば、実施の形態1と実施の形態2とを組み合わせ、実施の形態2の基底行列への重み適用部202aに実施の形態1で説明した基底行列の寄与度を反映させることで、適応精度の向上が可能である。 In the present invention, within the scope of the invention, any combination of the embodiments, or any modification of any component in each embodiment, or omission of any component in each embodiment is possible. . For example, by combining the first embodiment and the second embodiment and reflecting the contribution degree of the base matrix described in the first embodiment to the weight applying unit 202a to the base matrix of the second embodiment, the adaptive accuracy can be improved. Improvement is possible.
 以上のように、この発明に係る音声認識装置は、少量データに対しても頑健な話者適応処理を行うことを可能とするため、ナビゲーション装置や家電製品などに適用し、音声認識性能の向上に用いるのに適している。 As described above, the speech recognition apparatus according to the present invention can be applied to navigation devices, home appliances, and the like to improve robust speech recognition performance in order to enable robust speaker adaptation processing even for a small amount of data. Suitable for use in.
 100,100a 学習ステップ実行部、101,101a 音響モデル算出部、102,102a 基底行列算出部、103 基底行列の寄与度計算部、104 学習データ、104a クラス毎の学習データ、105,105a 音響モデル、106 基底行列、106a クラス毎の基底行列、107 寄与度、200,200a 適応ステップ実行部、201,201a 基底行列の重み算出部、202,202a 基底行列への重み適用部、203,203a 特徴量データへの行列適用部、204 デコード部、205 テストデータ、206 基底行列の重み、206a クラス毎の基底行列の重み、207 変換行列、207a クラス毎の変換行列、208,208a 変換済みテストデータ、209 認識結果、210 アラインメント算出部、211 データのクラス分類部、212 アラインメント、213 クラス毎のテストデータ。 100, 100a learning step execution unit, 101, 101a acoustic model calculation unit, 102, 102a basis matrix calculation unit, 103 basis matrix contribution calculation unit, 104 learning data, 104a learning data for each class, 105, 105a acoustic model, 106 basis matrix, 106a basis matrix for each class, 107 contribution, 200, 200a adaptive step execution unit, 201, 201a basis matrix weight calculation unit, 202, 202a basis matrix weight application unit, 203, 203a feature quantity data Matrix application unit, 204 decoding unit, 205 test data, 206 basis matrix weight, 206a basis matrix weight, 207 transformation matrix, 207a class transformation matrix, 208, 208a transformed test data, 209 recognition Result 21 Alignment calculation unit, 211 the classification of the data, 212 alignment, test data for each 213 class.

Claims (2)

  1.  学習データの音響特徴量を用いて当該学習データの標準パターンをモデル化した音響モデルを算出する音響モデル算出部と、
     前記音響モデルと前記学習データとを用いて、基底行列を算出する基底行列算出部と、
     前記基底行列を用いて、基底行列の寄与度を算出する基底行列の寄与度計算部と、
     テストデータの音響特徴量と、前記音響モデルと、前記基底行列とを用いて、基底行列の重みを算出する基底行列への重み算出部と、
     前記基底行列の重みと、前記基底行列の寄与度と、前記基底行列とを用いて、基底行列への重み付けを行った変換行列を生成する基底行列への重み適用部と、
     前記変換行列を用いて、前記テストデータを前記音響モデルと認識するための変換済みテストデータに変換する特徴量データへの行列適用部と、
     前記変換済みテストデータと前記音響モデルとを照合して音声認識を行うデコード部とを備えた音声認識装置。
    An acoustic model calculation unit that calculates an acoustic model obtained by modeling the standard pattern of the learning data using the acoustic feature amount of the learning data;
    Using the acoustic model and the learning data, a base matrix calculation unit for calculating a base matrix;
    Using the basis matrix, a basis matrix contribution calculation unit for calculating the contribution of the basis matrix;
    A weight calculation unit to a base matrix that calculates a weight of the base matrix using the acoustic feature quantity of the test data, the acoustic model, and the base matrix;
    A weight applying unit to a base matrix that generates a transformation matrix weighted to the base matrix using the weight of the base matrix, the contribution of the base matrix, and the base matrix;
    Using the transformation matrix, a matrix application unit to feature data that converts the test data into transformed test data for recognizing the acoustic model;
    A speech recognition apparatus comprising: a decoding unit that performs speech recognition by comparing the converted test data with the acoustic model.
  2.  クラスタリングされた学習データの音響特徴量を用いて当該学習データの標準パターンをモデル化した音響モデルを算出する音響モデル算出部と、
     前記音響モデルと前記学習データとを用いて、クラス毎に基底行列を算出する基底行列算出部と、
     テストデータの音響特徴量の状態系列を示すアラインメントを算出するアラインメント算出部と、
     前記テストデータと前記アラインメントとを用いて、前記テストデータをクラス毎に分類するデータのクラス分類部と、
     前記クラス毎のテストデータと前記基底行列と前記音響モデルとを用いて、クラス毎の基底行列への重みを求める基底行列の重み算出部と、
     前記クラス毎の基底行列と、前記クラス毎の基底行列の重みとを用いて、重みづけによりクラス毎に変換行列を生成する基底行列への重み適用部と、
     前記テストデータと前記アラインメントと前記クラス毎の変換行列とを用いて、前記テストデータを前記音響モデルと認識するための変換済みテストデータを生成する特徴量データへの行列適用部と、
     前記変換済みテストデータと、前記音響モデルとを照合して音声認識を行うデコード部とを備えた音声認識装置。
    An acoustic model calculation unit that calculates an acoustic model obtained by modeling a standard pattern of the learning data using the acoustic feature amount of the clustered learning data;
    Using the acoustic model and the learning data, a base matrix calculation unit that calculates a base matrix for each class;
    An alignment calculation unit for calculating an alignment indicating a state series of the acoustic feature amount of the test data;
    A class classification unit of data for classifying the test data into classes using the test data and the alignment;
    Using the test data for each class, the basis matrix, and the acoustic model, a weight calculation unit for a basis matrix that obtains a weight to the basis matrix for each class;
    A weight applying unit to a base matrix that generates a transformation matrix for each class by weighting using the base matrix for each class and the weight of the base matrix for each class;
    Using the test data, the alignment, and the conversion matrix for each class, a matrix application unit to feature data for generating converted test data for recognizing the test data as the acoustic model;
    A speech recognition apparatus comprising: a decoding unit that performs speech recognition by comparing the converted test data with the acoustic model.
PCT/JP2016/052724 2016-01-29 2016-01-29 Speech recognition device WO2017130387A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/JP2016/052724 WO2017130387A1 (en) 2016-01-29 2016-01-29 Speech recognition device
JP2016541466A JP6054004B1 (en) 2016-01-29 2016-01-29 Voice recognition device
TW105115458A TW201727620A (en) 2016-01-29 2016-05-19 Speech recognition device capable of using a basis matrix to convert the acoustic features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2016/052724 WO2017130387A1 (en) 2016-01-29 2016-01-29 Speech recognition device

Publications (1)

Publication Number Publication Date
WO2017130387A1 true WO2017130387A1 (en) 2017-08-03

Family

ID=57582225

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/052724 WO2017130387A1 (en) 2016-01-29 2016-01-29 Speech recognition device

Country Status (3)

Country Link
JP (1) JP6054004B1 (en)
TW (1) TW201727620A (en)
WO (1) WO2017130387A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003216178A (en) * 2002-01-18 2003-07-30 Nec Corp Hierarchical intrinsic space extraction device, adaptive model creation device, extraction, creation method and extraction, creation program thereof
US20050182626A1 (en) * 2004-02-18 2005-08-18 Samsung Electronics Co., Ltd. Speaker clustering and adaptation method based on the HMM model variation information and its apparatus for speech recognition
US20120173240A1 (en) * 2010-12-30 2012-07-05 Microsoft Corporation Subspace Speech Adaptation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003216178A (en) * 2002-01-18 2003-07-30 Nec Corp Hierarchical intrinsic space extraction device, adaptive model creation device, extraction, creation method and extraction, creation program thereof
US20050182626A1 (en) * 2004-02-18 2005-08-18 Samsung Electronics Co., Ltd. Speaker clustering and adaptation method based on the HMM model variation information and its apparatus for speech recognition
US20120173240A1 (en) * 2010-12-30 2012-07-05 Microsoft Corporation Subspace Speech Adaptation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
D.POVEY ET AL.: "A Basis Representation of Constrained MLLR Transforms for Robust Adaptation", COMPUTER SPEECH AND LANGUAGE, vol. 26, no. 1, January 2012 (2012-01-01), pages 35 - 51, XP028098288, DOI: doi:10.1016/j.csl.2011.04.002 *

Also Published As

Publication number Publication date
JPWO2017130387A1 (en) 2018-02-01
TW201727620A (en) 2017-08-01
JP6054004B1 (en) 2016-12-27

Similar Documents

Publication Publication Date Title
Arik et al. Deep voice 2: Multi-speaker neural text-to-speech
JP5326892B2 (en) Information processing apparatus, program, and method for generating acoustic model
US11450332B2 (en) Audio conversion learning device, audio conversion device, method, and program
Xue et al. Online end-to-end neural diarization with speaker-tracing buffer
Seki et al. A deep neural network integrated with filterbank learning for speech recognition
WO2020036178A1 (en) Voice conversion learning device, voice conversion device, method, and program
WO2019240228A1 (en) Voice conversion learning device, voice conversion device, method, and program
US8600744B2 (en) System and method for improving robustness of speech recognition using vocal tract length normalization codebooks
Panchapagesan et al. Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC
Ghaffarzadegan et al. Deep neural network training for whispered speech recognition using small databases and generative model sampling
Kannadaguli et al. A comparison of Bayesian and HMM based approaches in machine learning for emotion detection in native Kannada speaker
US8874438B2 (en) User and vocabulary-adaptive determination of confidence and rejecting thresholds
Shah et al. Unsupervised Vocal Tract Length Warped Posterior Features for Non-Parallel Voice Conversion.
WO2019212375A1 (en) Method for obtaining speaker-dependent small high-level acoustic speech attributes
JP3088357B2 (en) Unspecified speaker acoustic model generation device and speech recognition device
US9892726B1 (en) Class-based discriminative training of speech models
JP4922225B2 (en) Speech recognition apparatus and speech recognition program
JP6054004B1 (en) Voice recognition device
Kannadaguli et al. Comparison of hidden markov model and artificial neural network based machine learning techniques using DDMFCC vectors for emotion recognition in Kannada
Sen et al. A novel bangla spoken numerals recognition system using convolutional neural network
Kannadaguli et al. Comparison of artificial neural network and gaussian mixture model based machine learning techniques using ddmfcc vectors for emotion recognition in kannada
Shinoda Speaker adaptation techniques for speech recognition using probabilistic models
Tang et al. Deep neural network trained with speaker representation for speaker normalization
Suzuki et al. Discriminative re-ranking for automatic speech recognition by leveraging invariant structures
JP3589508B2 (en) Speaker adaptive speech recognition method and speaker adaptive speech recognizer

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2016541466

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16887973

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16887973

Country of ref document: EP

Kind code of ref document: A1