WO2017130387A1

WO2017130387A1 - Speech recognition device

Info

Publication number: WO2017130387A1
Application number: PCT/JP2016/052724
Authority: WO
Inventors: 裕紀金川; 勇気太刀岡
Original assignee: 三菱電機株式会社
Priority date: 2016-01-29
Filing date: 2016-01-29
Publication date: 2017-08-03
Also published as: JPWO2017130387A1; TW201727620A; JP6054004B1

Abstract

A means (103) for calculating the contribution degree of a basis matrix uses a basis matrix (106) to calculate the contribution degree (107) of the basis matrix. A means (202) for applying weight to the basis matrix uses the weight (206) of the basis matrix, the contribution degree (107) of the basis matrix, and the basis matrix (106) to generate a conversion matrix (207) obtained by applying weighting to the basis matrix. A means (203) for applying the matrix to feature value data uses the conversion matrix (207) to turn test data (206) into converted test data (208). A decoding means (204) compares the converted test data (208) with an acoustic model (105) to perform speech recognition.

Description

Voice recognition device

The present invention relates to a speech recognition apparatus for converting an acoustic feature quantity using a base matrix and a transformation matrix in a technique for adapting the feature quantity to match an acoustic model.

In speech recognition technology, for the purpose of reducing the influence of speakers, noise, microphones, etc., which cause the input speech signals to be inconsistent with the acoustic model that expresses context information such as phonemes in the standard pattern of speech Many speaker adaptation techniques (applying features) have been proposed.
Conventionally, for example, a CMLLR (Constrained-MLLR) method disclosed in Non-Patent Document 1 is known as a technique for applying such a feature amount. This is a method for converting the mean and variance of model parameters. Since this conversion is equivalent to converting a feature vector, CMLLR obtains a conversion matrix in the feature. As is specifically formula (1), determine the affine transformation matrix W that approach the acoustic features o _t D-dimensional computed from the input speech in the acoustic model is a standard pattern of the phoneme.

However, in the feature quantity application method described in Non-Patent Document 1, since the transformation matrix W is obtained only from the adaptive data, if the amount of data sufficient for estimating the transformation matrix cannot be obtained, the performance can be improved by adaptation. Is known to go down. This is because the amount of adaptive data is small with respect to the number of parameters to be estimated and overlearning is performed. For example, when using a total of 39 dimensions, which is a 13-dimensional mel frequency cepstrum coefficient MFCC (Mel-Frequency Cepstrum Coefficient) vector and its dynamic features connected as acoustic features, the number of parameters to be estimated is Since it is a number, it reaches 39 × 40 = 1560.

To deal with this problem, for example, in the technique for applying a feature amount described in Non-Patent Document 2, in order to reduce the number of parameters to be estimated, instead of directly estimating the transformation matrix W from adaptive data, N basis matrices are used. W _1: Expressed by weighting _Nmax (n = 1,..., N ≦ N _max ). Here, N _max = D (D + 1). Specifically, as the equation (2), weighting the basis matrix W _n by weight d _n, obtains a transformation matrix W to speaker adaptation.

Basis matrix is calculated from learning data, at the time of adaptive obtaining only those weights d _n transformation matrix to the input talker. Parameters to be determined in the adaptation step need only weights d _n, for the data of 100 frames (= 1 second), the number of parameters to be estimated is, according to Non-Patent Document 2, requires only 20 or so by the formula (3) .
N = min (ηβ, N _max ) ∵η = 0.2 (3)
This means that N is changed in accordance with the input frame β to limit the number of base matrices to be used.

The steps in the speech recognition apparatus described in Non-Patent Document 2 are roughly divided into a learning step for _{obtaining a} base matrix W1 _{: Nmax} from learning data, adaptive data (test data) and a base matrix W1 _{: Nmax.} There are two adaptation steps for determining the transformation matrix W using N of the N.

In the learning step, first, an acoustic model that is a standard phoneme pattern is obtained from the learning data. HMM (Hidden Markov Model: Hidden Markov Model) is used for the standard pattern. Conventionally used feature vectors such as filter bank coefficients, MFCC, and PLP (Perceptual Linear Predictive) can be used as the acoustic feature quantity as learning data.

Next, in the adaptation step, first, weights of the base matrix are generated using the test data. This weight corresponds to d _n as described above. The base matrix is weighted with the obtained weight of the base matrix, and the transformation matrix W is obtained as the weighted matrix. In order to obtain the optimum W, the matrix weighted with the weight of the base matrix is sequentially obtained by the equation (4).

最後 Generate converted test data using the last weighted matrix and test data. At this time, conversion can be performed using the equation (1). A speech recognition process is performed by collating the obtained converted test data with a standard phoneme pattern expressed by an acoustic model, and a recognition result is obtained.

In the conventional speech recognition apparatus described above, in the adaptation step, the transformation matrix W is obtained by weighting in order from the base matrix W _n having a high contribution degree according to Equation (4). However, although the index n is assigned to the basis matrix W _{1: Nmax} in descending order of contribution here, the contribution degree of each basis matrix is not considered in the equation (4). That is, until multiplies the d _n are considered to basis matrix are the same contribution. For this reason, there is a problem that the effect of adaptation may not be sufficiently obtained due to the influence of a base matrix having a low contribution.

The present invention has been made to solve such a problem, and an object of the present invention is to provide a speech recognition apparatus capable of improving the accuracy of estimating a transformation matrix at the time of adaptation and improving the accuracy of speech recognition.

The speech recognition apparatus according to the present invention includes an acoustic model calculation unit that calculates an acoustic model obtained by modeling a standard pattern of learning data using an acoustic feature amount of learning data, and a basis matrix using the acoustic model and the learning data. A basis matrix calculation unit that calculates the basis matrix, a basis matrix contribution calculation unit that calculates the contribution of the basis matrix using the basis matrix, an acoustic feature amount of the test data, an acoustic model, and a basis matrix , Using the weight calculation unit for the base matrix to calculate the weight of the base matrix, the weight of the base matrix, the contribution of the base matrix, and the base matrix, generate a transformation matrix that weights the base matrix A weight application unit to a base matrix, a matrix application unit to feature data that converts test data into converted test data for recognizing an acoustic model using a conversion matrix, converted test data and acoustic It is obtained by a decoding unit which performs speech recognition by matching and Dell.

The speech recognition apparatus according to the present invention calculates a contribution degree of a base matrix, and generates a transformation matrix that weights the base matrix using the contribution degree of the base matrix, the weight of the base matrix, and the base matrix. It is what I did. Thereby, the estimation accuracy of the transformation matrix at the time of adaptation can be improved, and the speech recognition performance can be improved.

BRIEF DESCRIPTION OF THE DRAWINGS It is a block diagram which shows the speech recognition apparatus of Embodiment 1 of this invention. It is a hardware block diagram of the speech recognition apparatus of Embodiment 1 of this invention. It is a flowchart which shows the flow of the learning step of the speech recognition apparatus of Embodiment 1 of this invention. It is a flowchart which shows the flow of the adaptation step of the speech recognition apparatus of Embodiment 1 of this invention. It is a block diagram which shows the speech recognition apparatus of Embodiment 2 of this invention. It is a flowchart which shows the flow of the learning step of the speech recognition apparatus of Embodiment 2 of this invention. It is a flowchart which shows the flow of the adaptation step of the speech recognition apparatus of Embodiment 2 of this invention. It is explanatory drawing which shows the processing content of the weight calculation part of the base matrix of the speech recognition apparatus of Embodiment 2 of this invention.

Hereinafter, in order to explain the present invention in more detail, modes for carrying out the present invention will be described with reference to the accompanying drawings.
Embodiment 1 FIG.
FIG. 1 is a configuration diagram of a speech recognition apparatus according to this embodiment.
The speech recognition apparatus according to the present embodiment includes a learning step execution unit 100 and an adaptation step execution unit 200 as illustrated. The learning step execution unit 100 includes an acoustic model calculation unit 101, a basis matrix calculation unit 102, and a basis matrix contribution calculation unit 103. The adaptation step execution unit 200 includes a basis matrix weight calculation unit 201 and a weight to the basis matrix. An application unit 202, a matrix application unit 203 for feature amount data, and a decoding unit 204 are provided.

The acoustic model calculation unit 101 in the learning step execution unit 100 is a processing unit that calculates an acoustic model 105 that models the standard pattern of the learning data 104 using the acoustic feature amount of the learning data 104. The base matrix calculation unit 102 is a processing unit that calculates the base matrix 106 using the acoustic model 105 calculated by the acoustic model calculation unit 101 and the learning data 104. The base matrix contribution calculation unit 103 is a processing unit that calculates the base matrix contribution 107 using the base matrix 106 calculated by the base matrix calculation unit 102.

The basis matrix weight calculation unit 201 in the adaptive step execution unit 200 is a processing unit that calculates the basis matrix weight 206 using the acoustic feature quantity of the test data 205, the acoustic model 105, and the basis matrix 106. The base matrix weight application unit 202 uses the base matrix weight 206 calculated by the base matrix weight calculation unit 201, the base matrix contribution 107, and the base matrix 106 to weight the base matrix 106. It is a processing unit that performs conversion and generates a transformation matrix 207 that is a weighted matrix. The matrix application unit 203 for feature amount data uses the conversion matrix 207 obtained by the weight application unit 202 for the base matrix and the test data 205 to convert the test data 205 to be suitable for acoustic model recognition. A processing unit that generates converted test data 208. The decoding unit 204 is a processing unit that collates the converted test data 208 obtained by the matrix application unit 203 to the feature data and the acoustic model 105, performs speech recognition, and outputs a recognition result 209. In FIG. 1, an arrow from the acoustic model 105 to the decoding unit 204 is not shown.

FIG. 2 is a hardware configuration diagram of the speech recognition apparatus according to the first embodiment.
The speech recognition apparatus is realized using a computer, and includes a processor 1, a memory 2, an input / output interface (input / output I / F) 3, and a bus 4. The processor 1 is a functional unit that performs calculation processing as a computer, and the memory 2 is a storage unit that stores various programs and calculation results, and configures a work area when the processor 1 performs calculation processing. . The input / output interface 3 is an interface for inputting the learning data 104 and the test data 205 and outputting the recognition result 209 to the outside. The bus 4 is a bus for connecting the processor 1, the memory 2, and the input / output interface 3 to each other.

The acoustic model calculation unit 101, the base matrix calculation unit 102, the base matrix contribution calculation unit 103, the base matrix weight calculation unit 201, the base matrix weight application unit 202, and the feature data matrix application unit illustrated in FIG. 203 and the decoding unit 204 are each implemented by the processor 1 executing a program stored in the memory 2. The acoustic model 105, the basis matrix 106, the basis matrix weights 206, the transformation matrix 207, and the transformed test data 208 are stored in the storage area of the memory 2, respectively. A plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.

Next, the operation of the speech recognition apparatus according to the first embodiment will be described.
First, the learning steps performed by the learning step execution unit 100 will be described with reference to the flowchart of FIG.
In the learning step, first, an acoustic model 105 that is a phoneme standard pattern is created from the learning data 104 by the acoustic model calculation unit 101 (step ST1). Here, as acoustic feature amounts, conventionally used feature vectors such as filter bank coefficients, MFCC (Mel Frequency Cepstrum Coefficient), and PLP (Perceptual Linear Predictive) can be used.

Also, a contribution (ω _n ) 107 corresponding to the index n of each base matrix 106 is obtained from the base matrix 106 using the base matrix contribution calculation unit 103 (step ST3). The contribution degree 107 takes a large value in the order of the set n in which the expressive power of learning data is high.

As a specific example showing the contribution 107 of the base matrix, a singular value k _{1: Nmax} obtained when the base matrix (W _{1: Nmax} ) 106 is obtained can be used. This is because the base matrix of index n having a large singular value has a high contribution to expressing the matrix M. Therefore, the contribution 107 can be obtained similarly by holding the singular value k _{1: Nmax} calculated by the base matrix calculation unit 102 instead of calculating the singular value k _{1: Nmax} again by the contribution calculation unit 103. It is done.

Further, instead of using the singular value k _{1: Nmax} as it is in the contribution calculation unit 103 of the basis matrix, the basis function (W _n ) is obtained by applying the transformation function φ (•) to the singular value to obtain φ (k _n ). _n ) The contribution corresponding to 106 can be controlled. A sigmoid function or the like can be used as the conversion function.

Next, the adaptation steps performed by the adaptation step execution unit 200 will be described using the flowchart of FIG.
In the adaptation step, first, the base matrix weight calculation unit 201 generates a base matrix weight (d _n ) 206 from the test data 205, the acoustic model 105, and the base matrix 106 (step ST11). Next, the weight applying unit 202 for the base matrix uses the base matrix weight 206 obtained in step ST11 and the weighted matrix using the base matrix 106 and the base matrix contribution (ω _{1: Nmax} ) 107. A transformation matrix (W) 207 is obtained (step ST12). The basis matrix weight 206 and the transformation matrix 207 are obtained sequentially based on the equation (5).

That is, if step ST11 and step ST12 are sequentially repeated and the likelihood increase is less than the threshold value or is repeated a predetermined number of times, the process proceeds to the next step. Here, the likelihood is an index of how close the input speech is to the standard pattern with respect to the acoustic model 105. By calculating the difference in likelihood, the range of increase in likelihood since the previous conversion matrix was estimated is calculated. If the difference in likelihood is smaller than the set numerical value, that is, the likelihood increase is smaller than the set numerical value, it can be considered that the estimation process has converged, and it is determined that the highly accurate estimation process has been performed. On the other hand, if the likelihood difference is greater than or equal to the set numerical value, that is, if the likelihood increase is greater than or equal to the set numerical value, it is determined that the estimation process has not converged. In this case, the weight 206 for the base matrix is estimated again to obtain a transformation matrix 207 with higher accuracy.

In the present invention, the weight matrix applying unit 202 for the base matrix considers the base matrix contribution by multiplying the base matrix (W _n ) 106 by the contribution (ω _n ) 107 when estimating the transformation matrix (W) 207. It is possible to improve the estimation accuracy of the transformation matrix (W) 207.

Finally, using the conversion matrix 207 and the test data 205, the matrix application unit 203 for the feature data generates the converted test data 208 (step ST13). Specifically, it can be converted using equation (1). A recognition result 209 is acquired by collating the obtained converted test data 208 with a standard phoneme pattern represented by the acoustic model 105 in the decoding unit 204 (step ST14).

The decoding unit 204 performs speech recognition processing based on HMM (Hidden Markov Model). Specifically, as an output probability model of the HMM, a model GMM-HMM using a mixed Gaussian distribution (hereinafter referred to as GMM (Gaussian Mixture Model)) or a neural network (hereinafter referred to as NN (Neural Network)) is used. The existing model NN-HMM is available.

As described above, according to the speech recognition apparatus of the first embodiment, the acoustic model calculation unit that calculates the acoustic model obtained by modeling the standard pattern of the learning data using the acoustic feature amount of the learning data, the acoustic model, A base matrix calculation unit that calculates a base matrix using the learning data, a base matrix contribution calculation unit that calculates a base matrix contribution using the base matrix, an acoustic feature amount of the test data, and an acoustic Using the model and the basis matrix, the weight calculation unit to the basis matrix that calculates the weight of the basis matrix, the basis matrix weight, the contribution of the basis matrix, and the basis matrix are used to calculate the basis matrix weight. A weight application unit to a base matrix that generates a weighted conversion matrix, and a matrix application unit to feature data that converts test data into converted test data for recognizing it as an acoustic model using the conversion matrix; , Since it has a decoding unit that performs speech recognition by comparing the converted test data with the acoustic model, the influence of the base matrix with high contribution is increased, and the influence of the base matrix with low contribution is kept low. It is possible to improve the estimation accuracy of the transformation matrix at the time of adaptation and improve the speech recognition performance.

Embodiment 2. FIG.
In the second embodiment, a transformation matrix and a base matrix used for estimating the transformation matrix are obtained for each class such as phonemes.

FIG. 5 is a configuration diagram of the speech recognition apparatus according to the second embodiment. The speech recognition apparatus according to Embodiment 2 includes a learning step execution unit 100a and an adaptation step execution unit 200a. The learning step execution unit 100a includes an acoustic model calculation unit 101a and a base matrix calculation unit 102a. The adaptation step execution unit 200a includes a base matrix weight calculation unit 201a, a base matrix weight application unit 202a, a feature data matrix application unit 203a, a decoding unit 204, an alignment calculation unit 210, and a data class classification unit 211. I have.

The acoustic model calculation unit 101a in the learning step execution unit 100a models the standard pattern of the learning data 104a for each class by using the acoustic feature amount of the learning data 104a for each class clustered in units of classes, thereby generating the acoustic model 105a. This is a processing unit to be obtained. The base matrix calculation unit 102a is a processing unit that calculates the base matrix 106a for each class using the acoustic model 105a and the learning data 104a for each class.

The alignment calculation unit 210 in the adaptive step execution unit 200a is a processing unit that calculates an alignment 212 indicating the state sequence of the acoustic feature amount of the test data 205. The data class classification unit 211 is a processing unit that classifies the test data 205 by class using the test data 205 and the alignment 212 and outputs the test data 205 as test data 213 for each class. The base matrix weight calculation unit 201a uses the test data 213 for each class, the acoustic model 105a, and the base matrix 106a for each class to obtain a weight for the base matrix 106a for each class, and the weight of the base matrix for each class. It is a processing unit that outputs 206a. The weight matrix applying unit 202a for the base matrix is a processing unit that generates the conversion matrix 207a for each class by weighting using the base matrix 106a for each class and the base matrix weight 206a for each class. The matrix application unit 203a for feature amount data uses the test data 205, the alignment 212, and the conversion matrix 207a for each class to convert the test data 205 to be suitable for acoustic model recognition, and generates converted test data 208a. Is a processing unit. The decoding unit 204 is a processing unit that performs speech recognition by comparing the converted test data 208a and the acoustic model 105a, and outputs a recognition result 209. In FIG. 5, an arrow from the acoustic model 105a to the decoding unit 204 is not shown. These processing units are realized by the processor shown in FIG. 2 executing a program stored in the memory.

Next, the operation of the speech recognition apparatus according to the second embodiment will be described.
First, the learning steps performed by the learning step execution unit 100a will be described with reference to the flowchart of FIG.
In the learning step, the learning data is classified into C classes such as phonemes in advance, and learning data 104a for each clustered class is prepared. In this case, the number of classes C and how to divide classes may be determined manually according to phonemes, or may be determined by clustering using a decision tree or K-means method. The acoustic model calculation unit 101a calculates the acoustic model 105a from the learning data 104a for each class (step ST101). Next, the learning data 104a for each class and the acoustic model 105a are respectively input to the basis matrix calculation unit 102a to obtain the basis matrix 106a for each class (step ST102).

Next, the adaptation steps performed by the adaptation step execution unit 200a will be described using the flowchart of FIG.
In the adaptation step, the alignment calculation unit 210 calculates the alignment 212 from the test data 205 (step ST201). Here, the alignment is an HMM state sequence and is used for associating phonemes and class information corresponding to each time t of the test data. Next, the data class classification unit 211 classifies the test data 205 by class using the alignment 212, and generates test data corresponding to class 1 to class C as test data 213 for each class (step ST202). . Next, base matrix weight calculation section 201a calculates base matrix weight 206a for each class using test model 213 for each class, using acoustic model 105a and base matrix 106a for each class (step ST203). ). Further, the base matrix weight applying unit 202a calculates a class-specific transformation matrix 207a using the class-basis matrix 106a with respect to the class-basis matrix weight 206a (step ST204). Step ST203 and step ST204 are sequentially repeated, and when the likelihood increase is less than the threshold value or when the set number of times is repeated, the process proceeds to step ST205.

FIG. 8 is an explanatory diagram showing the processing contents of the basis matrix weight calculation unit 201a. The acoustic feature sequence shown in FIG. 8 shows the acoustic features continuously changing the test data in a time series, o _t in the figure indicates the feature vector at time t.

The alignment shown in FIG. 8 shows a phoneme string “sil a k i” when the user speaks “Aki”. The phoneme string of “Aki” is “aki”, but silence at the beginning of the word is expressed by “sil”. The numbers indicated by the alignment indicate the state numbers of the HMMs. That is, the alignment is an HMM state sequence corresponding to the acoustic feature amount sequence. Further, the straight arrow indicated by the alignment indicates a transition to the next state, and the curved arrow indicates a self-transition.

In the second embodiment, the phonemes corresponding to the acoustic feature quantity o _t at each time Alignment 212 association, by using the basis matrix suitable for converting the feature quantity of the phoneme, the acoustic characteristics of the test data It is possible to estimate the weight to the base matrix adapted to.

Next, in step ST205, the matrix application unit 203a for feature data calculates the converted test data 208a using the class-specific conversion matrix 207a, test data 205, and alignment 212 obtained in step ST204. That is, the matrix application unit 203a for feature quantity data uses the class information obtained by the alignment 212 to associate the transformation matrix 207a for each class corresponding to the acoustic feature quantity at a certain time, and converts the transformation matrix into a feature quantity vector. Multiplication is performed to generate converted test data 208a. After that, the decoding unit 204 performs speech recognition by comparing the converted test data 208a obtained in step ST205 with the acoustic model 105a, and obtains a recognition result 209 (step ST206).

As described above, according to the speech recognition apparatus of the second embodiment, an acoustic model calculation unit that calculates an acoustic model obtained by modeling a standard pattern of learning data using acoustic feature quantities of clustered learning data; Using an acoustic model and learning data, a base matrix calculation unit that calculates a base matrix for each class, an alignment calculation unit that calculates an alignment indicating a state sequence of the acoustic feature amount of test data, and test data and alignment Data class classification unit for classifying test data for each class, and base matrix weight calculation unit for obtaining weights to the basis matrix for each class using test data, basis matrix and acoustic model for each class And the basis matrix for each class and the weight of the basis matrix for each class to generate a transformation matrix for each class by weighting A matrix application unit to feature amount data for generating converted test data for recognizing the test data as an acoustic model using the weight application unit to the matrix, the test data, the alignment, and the conversion matrix for each class; Since the decoding unit that performs speech recognition by comparing the converted test data with the acoustic model is provided, the estimation accuracy of the conversion matrix at the time of adaptation can be improved, and speech recognition performance can be improved.

In the present invention, within the scope of the invention, any combination of the embodiments, or any modification of any component in each embodiment, or omission of any component in each embodiment is possible. . For example, by combining the first embodiment and the second embodiment and reflecting the contribution degree of the base matrix described in the first embodiment to the weight applying unit 202a to the base matrix of the second embodiment, the adaptive accuracy can be improved. Improvement is possible.

As described above, the speech recognition apparatus according to the present invention can be applied to navigation devices, home appliances, and the like to improve robust speech recognition performance in order to enable robust speaker adaptation processing even for a small amount of data. Suitable for use in.

100, 100a learning step execution unit, 101, 101a acoustic model calculation unit, 102, 102a basis matrix calculation unit, 103 basis matrix contribution calculation unit, 104 learning data, 104a learning data for each class, 105, 105a acoustic model, 106 basis matrix, 106a basis matrix for each class, 107 contribution, 200, 200a adaptive step execution unit, 201, 201a basis matrix weight calculation unit, 202, 202a basis matrix weight application unit, 203, 203a feature quantity data Matrix application unit, 204 decoding unit, 205 test data, 206 basis matrix weight, 206a basis matrix weight, 207 transformation matrix, 207a class transformation matrix, 208, 208a transformed test data, 209 recognition Result 21 Alignment calculation unit, 211 the classification of the data, 212 alignment, test data for each 213 class.

Claims

An acoustic model calculation unit that calculates an acoustic model obtained by modeling the standard pattern of the learning data using the acoustic feature amount of the learning data;
Using the acoustic model and the learning data, a base matrix calculation unit for calculating a base matrix;
Using the basis matrix, a basis matrix contribution calculation unit for calculating the contribution of the basis matrix;
A weight calculation unit to a base matrix that calculates a weight of the base matrix using the acoustic feature quantity of the test data, the acoustic model, and the base matrix;
A weight applying unit to a base matrix that generates a transformation matrix weighted to the base matrix using the weight of the base matrix, the contribution of the base matrix, and the base matrix;
Using the transformation matrix, a matrix application unit to feature data that converts the test data into transformed test data for recognizing the acoustic model;
A speech recognition apparatus comprising: a decoding unit that performs speech recognition by comparing the converted test data with the acoustic model.
An acoustic model calculation unit that calculates an acoustic model obtained by modeling a standard pattern of the learning data using the acoustic feature amount of the clustered learning data;
Using the acoustic model and the learning data, a base matrix calculation unit that calculates a base matrix for each class;
An alignment calculation unit for calculating an alignment indicating a state series of the acoustic feature amount of the test data;
A class classification unit of data for classifying the test data into classes using the test data and the alignment;
Using the test data for each class, the basis matrix, and the acoustic model, a weight calculation unit for a basis matrix that obtains a weight to the basis matrix for each class;
A weight applying unit to a base matrix that generates a transformation matrix for each class by weighting using the base matrix for each class and the weight of the base matrix for each class;
Using the test data, the alignment, and the conversion matrix for each class, a matrix application unit to feature data for generating converted test data for recognizing the test data as the acoustic model;
A speech recognition apparatus comprising: a decoding unit that performs speech recognition by comparing the converted test data with the acoustic model.