WO2018134916A1

WO2018134916A1 - Speech recognition device

Info

Publication number: WO2018134916A1
Application number: PCT/JP2017/001551
Authority: WO
Inventors: 裕紀金川
Original assignee: 三菱電機株式会社
Priority date: 2017-01-18
Filing date: 2017-01-18
Publication date: 2018-07-26
Also published as: JP6532619B2; JPWO2018134916A1

Abstract

A model learning unit (106) in a learning execution unit (100) calculates a domain determination model (107) representing a relationship between a feature amount and a domain using learning label data (105). A determination execution unit (200) performs speech recognition on input speech data (201) via a determination speech recognition unit (202) and calculates scores (203) as the results for the speech recognition. A determination feature amount conversion unit (204) calculates the feature amount based on the scores (203). A domain determination unit (205) calculates a domain determination result (206) that presents the domain represented by the input speech data by applying a domain determination model (107) to the feature amount calculated by the determination feature amount conversion unit (204).

Description

Voice recognition device

The present invention relates to a speech recognition apparatus that determines which domain an input speech is an utterance of.

In a speech recognition apparatus that recognizes a plurality of domains indicating categories such as address, name, and telephone number, a method for obtaining a speech recognition result of a desired domain while determining which domain the input speech is is as follows. There was something like that. That is, first, a recognition result for each domain is calculated by voice recognition, and then the recognition results of each domain are compared with each other to obtain a final recognition result. For example, in the method disclosed in Patent Document 1, first, a speech recognition result is obtained by a plurality of speech recognition systems using a statistical language model prepared for each different domain. Of the recognition results obtained by the recognition system of each domain, a score based on the weighted sum of the acoustic score _SAM and the language score _SLM obtained at the time of speech recognition is used as the reliability of which is close to the utterance domain.
score = S _AM + αS _LM

Here α is a coefficient that controls the degree of influence of the acoustic score and the language score, and is experimentally determined so as to reduce the error of the utterance domain. The domain of the recognition result with the maximum score of the above expression is determined as the optimal domain, and the recognition result is presented as the optimal recognition result.

International Publication No. 2015/118645

In the conventional speech recognition apparatus described above, the weighted sum of the score obtained at the time of recognition and the score obtained based on the recognition result is taken, and the optimum domain is determined based on the magnitude of the score. However, there is a problem that the weighting coefficient in the weighted sum must be determined empirically, and depending on the utterance, there is a problem that the difference in score between the domains is small, and it is difficult to discriminate only by the magnitude of the score.

The present invention has been made to solve such a problem, and an object of the present invention is to provide a speech recognition apparatus capable of improving the accuracy of domain determination and improving the accuracy of speech recognition.

The speech recognition apparatus according to the present invention includes a learning speech recognition unit that calculates a learning score that is a value indicating a speech recognition result from learning speech data, and a learning feature amount that converts the learning score into a learning feature amount. A model for calculating a domain determination model indicating a relationship between a feature quantity and a domain using a conversion unit, a learning feature quantity, and learning label data that defines which domain the speech data for learning is an utterance of A learning unit; a determination speech recognition unit that calculates a determination score that is a value indicating a speech recognition result from input speech data; a determination feature amount conversion unit that converts the determination score into a determination feature amount; A domain determination unit that compares a feature quantity with a domain determination model and calculates a domain determination result indicating which domain the input speech data is an utterance of is provided.

The speech recognition apparatus according to the present invention calculates a domain determination model indicating a relationship between a feature amount and a domain by using learning label data that defines which domain the speech data for learning is an utterance of, This is to determine which domain the input voice data is utterance using the determination model. As a result, the domain determination accuracy can be improved and the speech recognition performance can be improved as compared with the conventional case where the optimum domain is determined based on the size of the recognition score.

BRIEF DESCRIPTION OF THE DRAWINGS It is a block diagram which shows the speech recognition apparatus of Embodiment 1 of this invention. It is a hardware block diagram of the speech recognition apparatus of Embodiment 1 of this invention. It is a flowchart which shows the flow of the domain discrimination | determination model learning step of the speech recognition apparatus of Embodiment 1 of this invention. It is explanatory drawing which shows the means to convert into a feature-value from the score of the speech recognition apparatus of Embodiment 1 of this invention. It is a flowchart which shows the flow of the domain discrimination | determination step of the speech recognition apparatus of Embodiment 1 of this invention. It is a block diagram which shows the speech recognition apparatus of Embodiment 2 of this invention. It is a flowchart which shows the flow of the domain discrimination | determination model learning step of the speech recognition apparatus of Embodiment 2 of this invention. It is explanatory drawing which shows the means to convert into a feature-value from the score of the speech recognition apparatus of Embodiment 2 of this invention. It is a flowchart which shows the flow of the domain discrimination | determination step of the speech recognition apparatus of Embodiment 2 of this invention. It is a block diagram which shows the speech recognition apparatus of Embodiment 3 of this invention. It is a flowchart which shows the flow of the domain discrimination | determination model learning step of the speech recognition apparatus of Embodiment 3 of this invention. It is explanatory drawing which shows the means to carry out the dimension compression of the feature-value of the speech recognition apparatus of Embodiment 3 of this invention. It is a flowchart which shows the flow of the domain discrimination | determination step of the speech recognition apparatus of Embodiment 3 of this invention. It is a block diagram which shows the speech recognition apparatus of Embodiment 4 of this invention. It is a flowchart which shows the flow of the domain discrimination | determination model learning step of the speech recognition apparatus of Embodiment 4 of this invention. It is a flowchart which shows the flow of the domain discrimination | determination step of the speech recognition apparatus of Embodiment 4 of this invention. It is explanatory drawing which shows a means to integrate the several domain determination result of the speech recognition apparatus of Embodiment 4 of this invention.

Hereinafter, in order to explain the present invention in more detail, modes for carrying out the present invention will be described with reference to the accompanying drawings.
Embodiment 1 FIG.
FIG. 1 is a configuration diagram of a speech recognition apparatus according to the first embodiment. The speech recognition apparatus according to the present embodiment includes a learning execution unit 100 and a determination execution unit 200 as illustrated. The learning execution unit 100 includes a learning speech recognition unit 102, a learning feature amount conversion unit 104, and a model learning unit 106, and the determination execution unit 200 includes a determination speech recognition unit 202, a determination feature amount conversion unit 204, and a domain. A determination unit 205 is provided.

The learning speech recognition unit 102 in the learning execution unit 100 is a processing unit that calculates the learning score 103 using the learning speech data 101. The learning feature amount conversion unit 104 is a processing unit that converts the learning score 103 calculated by the learning speech recognition unit 102 into a learning feature amount. The model learning unit 106 is a processing unit that calculates the domain determination model 107 using the learning feature amount calculated by the learning feature amount conversion unit 104 and the learning label data 105 of the domain corresponding to the learning speech.

In the determination execution unit 200, the determination voice recognition unit 202 and the determination feature amount conversion unit 204 are the same as those of the learning execution unit 100, respectively. That is, the determination speech recognition unit 202 has the same configuration as the learning speech recognition unit 102 and is a processing unit that calculates the determination score 203 using the input speech data 201. The determination feature amount conversion unit 204 is a processing unit that converts the determination score 203 using the determination score 203 calculated by the determination speech recognition unit 202. The domain determination unit 205 is a processing unit that calculates the domain determination result 206 using the determination feature amount calculated by the determination feature amount conversion unit 204 and the domain determination model 107.

FIG. 2 is a hardware configuration diagram of the speech recognition apparatus according to the first embodiment.
The speech recognition apparatus is realized using a computer, and includes a processor 1, a memory 2, an input / output interface (input / output I / F) 3, and a bus 4. The processor 1 is a functional unit that performs calculation processing as a computer, and the memory 2 is a storage unit that stores various programs and calculation results, and constitutes a work area when the processor 1 performs calculation processing. . The input / output interface 3 is an interface for inputting the learning voice data 101 and the input voice data 201 and outputting the domain determination result 206 to the outside. The bus 4 is a bus for connecting the processor 1, the memory 2, and the input / output interface 3 to each other.

The learning speech recognition unit 102, the learning feature amount conversion unit 104, the model learning unit 106, the determination speech recognition unit 202, the determination feature amount conversion unit 204, and the domain determination unit 205 shown in FIG. This is realized by executing the program stored in 2. Further, the learning speech data 101, the learning score 103, the learning label data 105, the domain determination model 107, the input speech data 201, the determination score 203, and the domain determination result 206 are stored in the storage area of the memory 2, respectively. Yes. A plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.

Next, the operation of the speech recognition apparatus according to the first embodiment will be described.
First, the domain determination model learning step performed by the learning execution unit 100 will be described with reference to the flowchart of FIG.
In the learning step, first, the learning speech recognition unit 102 performs speech recognition on the learning speech data 101 and calculates the learning score 103 (step ST101). Here, the learning speech recognition unit 102 includes a plurality of speech recognizers A to C, and each reads a language model and an acoustic model corresponding to each domain. Scores A to C of the learning score 103 are the first recognition results from the speech recognizers A to C. As an example of the learning score 103, an acoustic score or a language score can be used. In this embodiment, three speech recognizers A to C are used as an example, but can be appropriately selected according to the number of domains.

Next, the learning feature amount conversion unit 104 converts the learning score 103 into a learning feature amount (step ST102). As a method for specifically converting to a learning feature amount, as shown in FIG. 4, a method of arranging and vectorizing an acoustic score and a language score for each domain is conceivable. In the example shown in FIG. 4, it is 2 (acoustic score + language score) × the number of domains, so it has 6 dimensions. The score required for vectorization is not limited to the acoustic score and the language score, but may be anything obtained by adding the acoustic score and the language score, or any other score obtained from the learning speech recognition unit 102.

Next, the domain learning model 107 is calculated by the model learning unit 106 using the learning feature amount converted from the learning score 103 and the learning label data 105 (step ST103). Here, the learning label data 105 defines what domain the speech data for learning 101 utters. The model learning unit 106 calculates a model so that the learning feature amount obtained by the learning feature amount conversion unit 104 is associated with the learning label data 105. Here, as a method used by the model learning unit 106, a statistical method such as a mixed Gaussian distribution model, a support vector machine, or a neural network can be used.

In this way, the learning execution unit 100 applies the learning speech data 101 to a plurality of speech recognizers, converts the obtained recognition score into a learning feature amount, and what domain is the learning feature amount and its utterance. By using the learning label data 105 indicating such, the correspondence between the recognition score and the domain is modeled in the framework of statistical machine learning.

Next, the domain determination step performed by the determination execution unit 200 will be described using the flowchart of FIG.
In the determination step, first, the determination score 203 is calculated from the input voice data 201 by the determination voice recognition unit 202 (step ST111). Here, each speech recognition unit in the determination speech recognition unit 202 uses the same speech recognition unit as in the learning step. The scores A to C of the determination score 203 are the first recognition results from each speech recognizer.

Next, the determination score 203 is converted into a determination feature amount by the determination feature amount conversion unit 204 (step ST112). The determination feature quantity conversion unit 204 uses the same feature quantity conversion unit as in the learning step.

Next, the determination feature amount generated by the determination feature amount conversion unit 204 from the determination score 203 and the domain determination model 107 are input to the domain determination unit 205, and the domain determination result 206 is calculated (step ST113). The domain determination unit 205 uses the same statistical method as the model learning unit 106 in the learning step. The domain determination unit 205 compares the determination feature quantity with the domain determination model 107, selects the domain having the highest occurrence probability, and sets the selected domain and the speech recognition result corresponding to the domain as the domain determination result 206.

As described above, according to the speech recognition apparatus of the first embodiment, the learning speech recognition unit that calculates the learning score that is a value indicating the speech recognition result from the learning speech data, and the learning score for learning The relationship between the feature quantity and the domain using the learning feature quantity conversion unit for converting into the feature quantity, the learning feature quantity, and the learning label data defining which domain the speech data for learning is uttered A model learning unit that calculates a domain determination model indicating a determination, a determination speech recognition unit that calculates a determination score that is a value indicating a speech recognition result from input speech data, and a determination that converts the determination score into a determination feature amount Feature amount conversion unit, and a domain determination unit that compares a determination feature amount with a domain determination model and calculates a domain determination result indicating which domain the input speech data is an utterance of Advance it is possible to keep learning associated trends and domain scores of the speech recognition can be expected improvement of the domain determination accuracy than the domain determining method in the score obtained from the input speech data.

Embodiment 2. FIG.
In the second embodiment, N (N is an integer of 2 or more) best recognition results are generated from the speech recognizers of the learning speech recognition unit and the determination speech recognition unit. This is an example of determination.

FIG. 6 is a configuration diagram of the speech recognition apparatus according to the present embodiment.
As shown in the figure, the speech recognition apparatus according to the present embodiment includes a learning execution unit 100a and a determination execution unit 200a. The learning execution unit 100a includes a learning speech recognition unit 102a, a learning feature amount conversion unit 104a, and a model learning unit 106. The determination execution unit 200a includes a determination speech recognition unit 202a, a determination feature amount conversion unit 204a, and a domain. A determination unit 205 is provided. In addition, the same code | symbol is attached | subjected to the structure similar to Embodiment 1, and the description about the structure is abbreviate | omitted or simplified.

The learning speech recognition unit 102a in the learning execution unit 100a is a processing unit that uses the learning speech data 101 to calculate N learning scores 103a having the highest recognition result. The learning feature amount conversion unit 104a is a processing unit that converts the N best learning score 103a calculated by the learning speech recognition unit 102a into a learning feature amount. The model learning unit 106 calculates the domain determination model 107 using the learning feature amount calculated by the learning feature amount conversion unit 104a and the learning label data 105 that is the label data of the domain corresponding to the learning speech. Part.

In the determination execution unit 200a, the determination speech recognition unit 202a and the determination feature amount conversion unit 204a use the same configuration as the learning speech recognition unit 102a and the learning feature amount conversion unit 104a in the learning execution unit 100a. The determination speech recognition unit 202 a is a processing unit that calculates the N best determination score 203 a using the input speech data 201. The determination feature value conversion unit 204a is a processing unit that converts the N best determination score 203a calculated by the determination speech recognition unit 202a into a determination feature value. The domain determination unit 205 is a processing unit that calculates the domain determination result 206 using the determination feature amount calculated by the determination feature amount conversion unit 204 a and the domain determination model 107.

The learning speech recognition unit 102a, the learning feature amount conversion unit 104a, the model learning unit 106, the determination speech recognition unit 202a, the determination feature amount conversion unit 204a, and the domain determination unit 205 illustrated in FIG. 6 are illustrated in FIG. This is realized by the processor 1 executing a program stored in the memory 2. Further, the learning speech data 101, the learning score 103a, the learning label data 105, the domain determination model 107, the input speech data 201, the determination score 203a, and the domain determination result 206 are stored in the storage area of the memory 2, respectively. Yes. A plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.

Next, the operation of the speech recognition apparatus according to the second embodiment will be described.
First, the domain determination model learning step performed by the learning execution unit 100a will be described with reference to the flowchart of FIG.
In the learning step, first, the learning best speech score 103a is calculated from the learning speech data 101 by the learning speech recognition unit 102a (step ST201). Here, the learning speech recognition unit 102a includes a plurality of speech recognizers A to C, and each reads a language model and an acoustic model corresponding to each domain. Scores A1 to C1 and scores A2 to C2 of the learning score 103a are the first and second recognition results obtained from each speech recognizer. In the present embodiment, three recognizers A to C are used as an example.

Next, the learning score 103a is converted into a learning feature quantity by the learning feature quantity conversion unit 104a (step ST202). Specifically, as a method for converting to a learning feature amount, a method of arranging an acoustic score and a language score with N best scores for each domain and vectorizing them as shown in FIG. In the illustrated example, 2 (acoustic score + language score) × number of domains × 2 vest is converted into a 12-dimensional learning feature amount. The score required for vectorization is not limited to the acoustic score and the language score, but may be anything obtained by adding the acoustic score and the language score, or any other score obtained from the learning speech recognition unit 102a.

Next, the domain determination model 107 is calculated by the model learning unit 106 using the learning feature amount converted from the learning score 103a and the learning label data 105 (step ST203). Here, the learning label data 105 defines what domain the speech data for learning 101 utters. The model learning unit 106 calculates the domain determination model 107 so that the learning feature amount obtained by the learning feature amount conversion unit 104 a is associated with the learning label data 105.

Next, the domain determination step performed by the determination execution unit 200a will be described with reference to the flowchart of FIG.
In the determination step, first, the N best determination score 203a is calculated from the input sound data 201 by the determination speech recognition unit 202a (step ST211). Here, the determination speech recognition unit 202a uses the same speech recognition unit as the learning speech recognition unit 102a in the learning step. The scores A1 to C1 and scores A2 to C2 of the determination score 203a are the first and second recognition results from each speech recognizer.

Next, the determination score 203a is converted into a determination feature value by the determination feature value conversion unit 204a (step ST212). The determination feature value conversion unit 204a uses the same feature value conversion unit as the learning feature value conversion unit 104a in the learning step.

Next, the determination feature amount generated by the determination feature amount conversion unit 204a from the determination score 203a and the domain determination model 107 are input to the domain determination unit 205, and the domain determination result 206 is calculated (step ST213). The domain determination unit 205 performs processing using the same statistical method as the model learning unit 106 in the learning step. The domain determination unit 205 compares the feature amount input by the determination feature amount conversion unit 204a with the domain determination model 107, selects a domain having the highest occurrence probability, and selects the selected domain and the speech recognition result corresponding to the domain. Is the domain determination result 206.

As described above, according to the speech recognition apparatus of the second embodiment, the N best learning score that is a value indicating the N (N is an integer of 2 or more) best speech recognition result is calculated from the learning speech data. A learning speech recognition unit that performs learning, a learning feature amount conversion unit that converts an N-best learning score into a learning feature amount, a learning feature amount, and which domain the speech data for learning is an utterance of A model learning unit that calculates a domain determination model indicating a relationship between a feature quantity and a domain using the learned label data, and for determining N best that is a value indicating a speech recognition result of N best from input speech data The determination speech recognition unit for calculating the score, the determination feature amount conversion unit for converting the N best determination score to the determination feature amount, the determination feature amount and the domain determination model are collated, and the input speech data is Which And a domain determination unit that calculates a domain determination result indicating whether the utterance is the main utterance, so that N best can be taken into consideration for the feature amount for domain determination. In addition, further improvement in domain determination accuracy can be expected.

Embodiment 3 FIG.
In the third embodiment, in addition to the configuration of the second embodiment, dimensional compression of the feature amount is performed.

FIG. 10 is a configuration diagram of the speech recognition apparatus according to the present embodiment.
As shown in the figure, the speech recognition apparatus according to the present embodiment includes a learning execution unit 100b and a determination execution unit 200b. The learning execution unit 100b includes a learning speech recognition unit 102a, a learning feature amount conversion unit 104a, a dimension compression matrix estimation unit 108, a learning dimension compression unit 110, and a model learning unit 106, and the determination execution unit 200b includes a determination A speech recognition unit 202a, a determination feature value conversion unit 204a, a determination dimension compression unit 207, and a domain determination unit 205 are provided. In addition, the same code | symbol is attached | subjected to the structure similar to Embodiment 2, and the description about the structure is abbreviate | omitted or simplified.

The dimension compression matrix estimation unit 108 in the learning execution unit 100b is a processing unit that calculates the dimension compression matrix 109 using the learning feature amount calculated from the learning feature amount conversion unit 104a and the learning label data 105. The learning dimension compression unit 110 is a processing unit that multiplies the learning feature amount calculated from the learning feature amount conversion unit 104a by the dimension compression matrix 109 to compress the dimension of the learning feature amount. The model learning unit 106 is a processing unit that calculates the domain determination model 107 using the learning feature amount compressed by the learning dimension compression unit 110 and the learning label data 105.

In the determination execution unit 200b, the determination speech recognition unit 202a and the determination feature amount conversion unit 204a use the same configuration as the learning speech recognition unit 102a and the learning feature amount conversion unit 104a of the learning execution unit 100b. The determination dimension compression unit 207 is a processing unit that multiplies the determination feature amount calculated from the determination feature amount conversion unit 204a by the dimension compression matrix 109 to compress the dimension of the determination feature amount. Here, the dimensional compression matrix 109 is matrix data for performing dimensional compression of multidimensional feature values. The domain determination unit 205 is a processing unit that calculates the domain determination result 206 using the determination feature amount calculated by the determination dimension compression unit 207 and the domain determination model 107.

The learning speech recognition unit 102a, the learning feature amount conversion unit 104a, the model learning unit 106, the dimension compression matrix estimation unit 108, the learning dimension compression unit 110, the determination speech recognition unit 202a, and the determination feature amount conversion illustrated in FIG. Each of the unit 204a, the determination dimension compression unit 207, and the domain determination unit 205 is realized by the processor 1 executing a program stored in the memory 2. Further, the learning speech data 101, the learning score 103a, the learning label data 105, the domain determination model 107, the dimension compression matrix 109, the input speech data 201, the determination score 203a, and the domain determination result 206 are stored in the memory 2, respectively. It is stored in the area. A plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.

Next, the operation of the speech recognition apparatus according to the third embodiment will be described.
First, the domain determination model learning step performed by the learning execution unit 100b will be described with reference to the flowchart of FIG.
In the learning step, first, a learning score 103a is calculated from the learning speech data 101 by the learning speech recognition unit 102a (step ST301). Here, the learning speech recognition unit 102a includes a plurality of speech recognizers A to C, and each reads a language model and an acoustic model corresponding to each domain. Scores A1 to C1 and scores A2 to C2 of the learning score 103a are the first and second recognition results obtained from each speech recognizer. In this embodiment, three recognizers A to C are used as an example. However, the number of recognizers may be changed according to the number of domains, or the number of N best results of recognition may be changed.

Next, the learning score 103a is converted into a learning feature quantity by the learning feature quantity conversion unit 104a (step ST302). As a method for specifically converting to a learning feature amount, a method in which the acoustic score and the language score are vectorized by arranging the N best scores for each domain as shown in FIG. The score required for vectorization is not limited to the acoustic score and the language score, but may be anything obtained by adding the acoustic score and the language score, or any other score obtained from the learning speech recognition unit 102a.

Next, the dimension compression matrix 109 is estimated by the dimension compression matrix estimation unit 108 using the learning feature amount converted from the learning score 103a and the learning label data 105 (step ST303). Specifically, as shown in FIG. 12, dimensions such as linear discriminant analysis (LDA: Linear Discriminant Analysis) and unequal variance discriminant analysis (HDA: Heteroscopic Discriminant Analysis) are applied to the feature vector obtained from the N-best score. A matrix is calculated using a compression method. Advantages of dimensional compression include that supervised methods such as LDA and HDA can generate feature quantities suitable for identification, and a reduction in the number of model parameters when modeling with a mixed Gaussian distribution.

Next, using the dimension compression matrix 109 calculated by the dimension compression matrix estimation unit 108 and the learning feature amount converted from the learning score 103a, the learning dimension compression unit 110 converts the learning score 103a from the learning score 103a. The feature amount is dimensionally compressed (step ST304). As shown in FIG. 12, the dimension compression is to convert a feature quantity obtained from the N best score by a dimension compression matrix 109 to convert it into a low-order vector feature quantity. In the example of FIG. 12, the recognition results from the first place to the third place are shown.

Next, the domain learning model 107 is learned by the model learning unit 106 using the learning feature quantity and the learning label data 105 dimension-compressed by the learning dimension compression unit 110 (step ST305). The model learning unit 106 calculates a model so as to associate the learning feature amount dimensionally compressed by the learning dimensional compression unit 110 with the learning label data 105.

Next, the domain determination step performed by the determination execution unit 200b will be described using the flowchart of FIG.
In the determination step, first, a determination score 203a is calculated from the input voice data 201 by the determination voice recognition unit 202a (step ST311). Here, the determination speech recognition unit 202a uses the same speech recognition unit as the learning speech recognition unit 102a in the learning step. The scores A1 to C1 and scores A2 to C2 of the determination score 203a are the first and second recognition results from each speech recognizer.

Next, the determination score 203a is converted into a determination feature value by the determination feature value conversion unit 204a (step ST312). The determination feature quantity conversion unit 204a uses the same configuration as the learning feature quantity conversion unit 104a in the learning step.

Next, using the dimension compression matrix 109 calculated by the dimension compression matrix estimation unit 108 and the determination feature amount converted from the determination score 203a, the determination dimension compression unit 207 converts the determination score 203a from the determination score 203a. The feature amount is dimensionally compressed (step ST313). As in the dimensional compression unit 110 for learning of the learning execution unit 100b, the dimensional compression is performed by multiplying the feature amount obtained from the N best score by the dimensional compression matrix 109 as shown in FIG. Convert to quantity.

Next, the domain determination unit 205 obtains the domain determination result 206 from the feature quantity dimension-compressed by the determination dimension compression unit 207 and the domain determination model 107 (step ST314). The domain determination unit 205 performs processing using the same statistical method as in the learning step. The domain determination unit 205 collates the determination feature quantity dimension-compressed by the determination dimension compression unit 207 with the domain determination model 107, selects the domain having the highest occurrence probability, and selects the selected domain and the voice corresponding to the domain. Let the recognition result be the domain determination result 206.

As described above, according to the speech recognition apparatus of the third embodiment, the N best learning score, which is a value indicating the N (N is an integer of 2 or more) best speech recognition result, is calculated from the learning speech data. A learning speech recognition unit that performs learning, a learning feature amount conversion unit that converts an N-best learning score into a learning feature amount, a learning feature amount, and which domain the speech data for learning is an utterance of A learning feature using a dimension compression matrix estimation unit for estimating a dimension compression matrix for compressing the dimension of the learning feature quantity using the learning label data, and a learning feature quantity and a dimension compression matrix. A domain determination model that indicates the relationship between a feature quantity and a domain using a learning dimension compression section that compresses the dimension of the quantity, a learning feature quantity compressed by the learning dimension compression section, and learning label data A model learning unit to calculate, A determination speech recognition unit that calculates an N best determination score, which is a value indicating the N best speech recognition result, and a determination feature amount conversion that converts the N best determination score into a determination feature amount A determination dimension compression unit that compresses the dimension of the determination feature value using the determination unit, the determination feature value, and the dimension compression matrix, and the determination feature value and the domain determination model compressed by the determination dimension compression unit And a domain determination unit that calculates a domain determination result indicating which domain the input speech data is uttered. In addition to the effects of the second embodiment, the feature amount is compressed to a low dimension. By doing so, it is possible to handle feature quantities suitable for identification, and it is possible to reduce the number of model parameters depending on the type of model.

Further, according to the speech recognition apparatus of the third embodiment, the dimension compression matrix estimation unit receives the feature quantity and the teacher label and outputs a matrix that converts the dimension of the feature quantity into a low dimension. Can be generated.

Embodiment 4 FIG.
The fourth embodiment is an example in which a recognition result of N (N is an integer of 2 or more) best is generated and a domain determination model is generated for each N best.

FIG. 14 is a configuration diagram of the speech recognition apparatus according to the present embodiment.
As shown in the figure, the speech recognition apparatus according to the present embodiment includes a learning execution unit 100c and a determination execution unit 200c. The learning execution unit 100c includes a learning speech recognition unit 102a, a first learning feature amount conversion unit 104b, a second learning feature amount conversion unit 104c, a first model learning unit 106a, and a second model learning unit 106b. The determination execution unit 200c includes a determination speech recognition unit 202a, a first determination feature amount conversion unit 204b, a second determination feature amount conversion unit 204c, a first domain determination unit 205a, and a second domain. A determination unit 205b and a domain determination unit 208 are provided. In addition, the same code | symbol is attached | subjected to the structure similar to Embodiment 2, and the description about the structure is abbreviate | omitted or simplified.

The first learning feature value conversion unit 104b and the second learning feature value conversion unit 104c have the same configuration as the learning feature value conversion unit 104 of the first embodiment, respectively. A processing unit that converts the calculated learning score 103a into a learning feature amount. However, the first learning feature value conversion unit 104b uses the score A1 to C1 with the first recognition result, and the second learning feature value conversion unit 104c uses the score A2 to C2 with the second recognition result as the feature value. Is configured to convert. The first model learning unit 106a and the second model learning unit 106b have the same configuration as the model learning unit 106 of the first embodiment. However, the first model learning unit 106 a calculates the first domain determination model 107 a using the learning feature amount calculated by the first learning feature amount conversion unit 104 b and the learning label data 105, and The second model learning unit 106b is configured to calculate the second domain determination model 107b using the learning feature amount calculated by the second learning feature amount conversion unit 104c and the learning label data 105. . In the illustrated example, the case where N = 2 is shown as the configuration for each N best, but N can be applied to any value.

In the determination execution unit 200c, the determination speech recognition unit 202a, the first determination feature amount conversion unit 204b, and the second determination feature amount conversion unit 204c are the same as the learning speech recognition unit 102a in the learning execution unit 100c. The same configuration as that of the learning feature value conversion unit 104b and the second learning feature value conversion unit 104c is used. The first domain determination unit 205a uses the determination feature amount calculated by the first determination feature amount conversion unit 204b and the first domain determination model 107a to calculate the first domain determination result 206a. Part. The second domain determination unit 205b calculates the second domain determination result 206b using the determination feature amount calculated by the second determination feature amount conversion unit 204c and the second domain determination model 107b. Part. The domain determination unit 208 is a processing unit that calculates the domain final determination result 209 using the first domain determination result 206a and the second domain determination result 206b. In the illustrated example, the learning execution unit 100c and the determination execution unit 200c show the case where N = 2 as the configuration for each N best, but N can be applied to any value.

The learning speech recognition unit 102a, the first learning feature amount conversion unit 104b, the second learning feature amount conversion unit 104c, the first model learning unit 106a, and the second model learning unit 106b, which are illustrated in FIG. Voice recognition unit 202a, first determination feature value conversion unit 204b and second determination feature value conversion unit 204c, first domain determination unit 205a and second domain determination unit 205b, and domain determination unit 208 are: Each is realized by the processor 1 shown in FIG. 2 executing a program stored in the memory 2. Further, the learning speech data 101, the learning score 103a, the learning label data 105, the domain determination model 107, the input speech data 201, the determination score 203a, the domain determination result 206, and the domain final determination result 209 are stored in the memory 2, respectively. It is stored in the storage area. A plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.

Next, the operation of the speech recognition apparatus according to the fourth embodiment will be described.
First, the domain determination model learning step performed by the learning execution unit 100c will be described with reference to the flowchart of FIG.
In the learning step, first, the learning best speech score 103a is calculated from the learning speech data 101 by the learning speech recognition unit 102a (step ST401). Here, the learning speech recognition unit 102a includes a plurality of speech recognizers A to C, and each reads a language model and an acoustic model corresponding to each domain. Scores A1 to C1 and scores A2 to C2 of the learning score 103a are the first and second recognition results obtained from each speech recognizer. In the present embodiment, three recognizers A to C are used as an example.

Next, the learning score 103a is converted into each learning feature amount by the first learning feature amount conversion unit 104b and the second learning feature amount conversion unit 104c every N best (step ST402). As a method for specifically converting to a learning feature amount, as shown in FIG. 4, a method of vectorizing an acoustic score and a language score by arranging N best scores for each domain is conceivable. The score required for vectorization is not limited to the acoustic score and the language score, but may be anything obtained by adding the acoustic score and the language score, or any other score obtained from the learning speech recognition unit 102a.

Next, the first model learning unit 106a and the second model learning unit 106b use the learning feature amount converted from the learning score 103a and the learning label data 105 for each N best. A domain determination model 107a and a second domain determination model 107b are obtained (step ST403). That is, each of the first model learning unit 106a and the second model learning unit 106b includes the learning feature amount obtained by the first learning feature amount conversion unit 104b and the second learning feature amount conversion unit 104c. A model is calculated so as to associate the learning label data 105.

Next, the domain determination step performed by the determination execution unit 200c will be described using the flowchart of FIG.
In the determination step, first, the determination best speech score 203a is calculated from the input speech data 201 by the determination speech recognition unit 202a (step ST411). Here, the determination speech recognition unit 202a uses the same speech recognition unit as the learning speech recognition unit 102a in the learning step. The scores A1 to C1 and scores A2 to C2 of the determination score 203a are the first and second recognition results from each speech recognizer.

Next, the determination score 203a is converted into determination feature values every N best by the first determination feature value conversion unit 204b and the second determination feature value conversion unit 204c (step ST412). The first determination feature value conversion unit 204b and the second determination feature value conversion unit 204c have the same features as the first learning feature value conversion unit 104b and the second learning feature value conversion unit 104c in the learning step. Use the quantity converter.

Next, the first domain determination unit 205a and the second domain determination unit 205b respectively generate the first determination feature value conversion unit 204b and the second determination feature value conversion unit 204c for each N best. The determination feature quantity, the first domain determination model 107a and the second domain determination model 107b are acquired, and the N best domain determination results (first domain determination result 206a and second domain determination result 206b) are obtained. Is obtained (step ST413). The first domain determination unit 205a and the second domain determination unit 205b use the same statistical method as the first model learning unit 106a and the second model learning unit 106b in the learning step. The first domain determination unit 205a and the second domain determination unit 205b and the determination feature amount generated by the first determination feature amount conversion unit 204b and the second determination feature amount conversion unit 204c The domain determination model 107a and the second domain determination model 107b are respectively collated, the domain having the highest occurrence probability is output, and the recognition result corresponding to the domain and the domain is determined as the first domain determination result 206a and the second domain determination. The result is 206b.

Next, domain determining section 208 obtains domain final determination result 209 from N best domain determination results (first domain determination result 206a and second domain determination result 206b) (step ST414). Here, as a domain determination method, there are a method using a simple majority vote of N best domain determination results as shown in FIG. 17 and a method of taking a majority vote with weights according to the ranking of each domain determination result. Available. In the example of FIG. 17, the recognition results from the first place to the third place are shown.

As described above, in the fourth embodiment, unlike in the second embodiment, a model is generated for each N best, so that the appearance of scores in an arbitrary order can be modeled, and the number of dimensions of the feature amount The increase can be suppressed. Further, by integrating the determination results of the N best domains by a method such as majority decision, it is possible to suppress dependence on only the upper recognition results.

As described above, according to the speech recognition apparatus of the fourth embodiment, the N best learning score, which is a value indicating the speech recognition result of N (N is an integer of 2 or more) best, is calculated from the learning speech data. A learning speech recognition unit that performs learning, a learning feature amount conversion unit that converts a learning score for N best into a learning feature amount for each N best, a learning feature amount for each N best, and learning voice data Using a learning label data that defines whether the utterance is a domain utterance, a model learning unit that calculates a domain determination model indicating a relationship between a feature amount and a domain every N best, and N best speech from input speech data A determination speech recognition unit that calculates a determination score for N best, which is a value indicating a recognition result, a determination feature value conversion unit that converts a determination score for N best into a determination feature value for each N best, and N Judgment for each vest The domain determination unit that compares the feature amount with the domain determination model for each N best, calculates the domain determination result for each N best, and the domain determination result for each N best, and in which domain utterance the input speech data is And a domain determination unit that calculates a domain final determination result indicating whether or not the N best can be considered in the feature amount for domain determination. In addition to the effects of the first embodiment, Improvement of domain determination accuracy can be expected.

In the present invention, within the scope of the invention, any combination of the embodiments, or any modification of any component in each embodiment, or omission of any component in each embodiment is possible. .

As described above, the speech recognition device according to the present invention relates to a configuration for determining which domain the input speech is an utterance of, and is applied to a navigation device, a home appliance, etc., and used for improving speech recognition performance. Suitable for

100, 100a, 100b, 100c Learning execution unit, 101 Learning speech data, 102, 102a Learning speech recognition unit, 103, 103a Learning score, 104, 104a Learning feature amount conversion unit, 104b First learning feature Quantity conversion unit, 104c second learning feature quantity conversion unit, 105 learning label data, 106 model learning unit, 106a first model learning unit, 106b second model learning unit, 107 domain determination model, 107a first Domain determination model, 107b second domain determination model, 108 dimension compression matrix estimation unit, 109 dimension compression matrix, 110 learning dimension compression unit, 200, 200a, 200b, 200c determination execution unit, 201 input speech data, 202, 202a voice recognition unit for determination, 203, 203 Determination score, 204, 204a Determination feature amount conversion unit, 204b First determination feature amount conversion unit, 204c Second determination feature amount conversion unit, 205 Domain determination unit, 205a First domain determination unit, 205b Second domain determination unit, 206 domain determination result, 206a first domain determination result, 206b second domain determination result, 207 determination dimension compression unit, 208 domain determination unit, 209 domain final determination result.

Claims

A learning speech recognition unit that calculates a learning score that is a value indicating a speech recognition result from the learning speech data;
A learning feature amount conversion unit for converting the learning score into a learning feature amount;
A model learning unit that calculates a domain determination model indicating a relationship between a feature quantity and a domain using the learning feature quantity and learning label data that defines which domain the speech data for learning is an utterance of When,
A determination speech recognition unit that calculates a determination score that is a value indicating a speech recognition result from input speech data;
A determination feature value conversion unit that converts the determination score into a determination feature value;
A speech recognition apparatus, comprising: a domain determination unit that compares the determination feature quantity with the domain determination model and calculates a domain determination result indicating which domain the utterance of the input speech data is. .
A learning speech recognition unit that calculates a learning score for N best, which is a value indicating a speech recognition result of N (N is an integer greater than or equal to 2) from learning speech data;
A learning feature value conversion unit for converting the N best learning score into a learning feature value;
A model learning unit that calculates a domain determination model indicating a relationship between a feature quantity and a domain using the learning feature quantity and learning label data that defines which domain the speech data for learning is an utterance of When,
A determination speech recognition unit that calculates an N best determination score that is a value indicating the N best speech recognition result from input speech data;
A determination feature value conversion unit that converts the N best determination score into a determination feature value;
A speech recognition apparatus, comprising: a domain determination unit that compares the determination feature quantity with the domain determination model and calculates a domain determination result indicating which domain the utterance of the input speech data is. .
A learning speech recognition unit that calculates a learning score for N best, which is a value indicating a speech recognition result of N (N is an integer greater than or equal to 2) from learning speech data;
A learning feature value conversion unit for converting the N best learning score into a learning feature value;
A dimension compression matrix for compressing the dimension of the learning feature quantity is estimated using the learning feature quantity and learning label data defining which domain the speech data for learning is uttered. A dimensional compression matrix estimator;
A learning dimension compression unit that compresses the dimension of the learning feature value using the learning feature value and the dimension compression matrix;
A model learning unit that calculates a domain determination model indicating a relationship between a feature amount and a domain using the learning feature amount compressed by the learning dimension compression unit and the learning label data;
A determination speech recognition unit that calculates an N best determination score that is a value indicating the N best speech recognition result from input speech data;
A determination feature value conversion unit that converts the N best determination score into a determination feature value;
A determination dimension compression unit that compresses the dimension of the determination feature value using the determination feature value and the dimension compression matrix;
A domain determination unit that collates the determination feature amount compressed by the determination dimension compression unit with the domain determination model, and calculates a domain determination result indicating which domain the input speech data is an utterance of A speech recognition apparatus characterized by that.
4. The speech recognition apparatus according to claim 3, wherein the dimension compression matrix estimation unit receives a feature quantity and a teacher label and outputs a matrix for converting the dimension of the feature quantity into a low dimension.
A learning speech recognition unit that calculates a learning score for N best, which is a value indicating a speech recognition result of N (N is an integer greater than or equal to 2) from learning speech data;
A learning feature value conversion unit that converts the learning score for the N best into a learning feature value for each N best;
A domain determination model indicating a relationship between a feature quantity and a domain using the learning feature quantity for each N best and learning label data defining which domain the speech data for learning is uttered by A model learning unit for calculating every N best,
A determination speech recognition unit that calculates an N best determination score that is a value indicating the N best speech recognition result from input speech data;
A determination feature value conversion unit that converts the N best determination score into a determination feature value for each N best;
A domain determination unit that collates the determination feature value for each N best with the domain determination model for each N best, and calculates a domain determination result for each N best;
A speech recognition apparatus comprising: a domain determination unit that calculates a domain final determination result indicating which domain the input speech data is for using the domain determination result for each N best.