WO2018134916A1 - Speech recognition device - Google Patents

Speech recognition device Download PDF

Info

Publication number
WO2018134916A1
WO2018134916A1 PCT/JP2017/001551 JP2017001551W WO2018134916A1 WO 2018134916 A1 WO2018134916 A1 WO 2018134916A1 JP 2017001551 W JP2017001551 W JP 2017001551W WO 2018134916 A1 WO2018134916 A1 WO 2018134916A1
Authority
WO
WIPO (PCT)
Prior art keywords
learning
determination
domain
unit
speech recognition
Prior art date
Application number
PCT/JP2017/001551
Other languages
French (fr)
Japanese (ja)
Inventor
裕紀 金川
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to PCT/JP2017/001551 priority Critical patent/WO2018134916A1/en
Priority to JP2018562783A priority patent/JP6532619B2/en
Publication of WO2018134916A1 publication Critical patent/WO2018134916A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • the present invention relates to a speech recognition apparatus that determines which domain an input speech is an utterance of.
  • a method for obtaining a speech recognition result of a desired domain while determining which domain the input speech is is as follows. There was something like that. That is, first, a recognition result for each domain is calculated by voice recognition, and then the recognition results of each domain are compared with each other to obtain a final recognition result. For example, in the method disclosed in Patent Document 1, first, a speech recognition result is obtained by a plurality of speech recognition systems using a statistical language model prepared for each different domain.
  • is a coefficient that controls the degree of influence of the acoustic score and the language score, and is experimentally determined so as to reduce the error of the utterance domain.
  • the domain of the recognition result with the maximum score of the above expression is determined as the optimal domain, and the recognition result is presented as the optimal recognition result.
  • the weighted sum of the score obtained at the time of recognition and the score obtained based on the recognition result is taken, and the optimum domain is determined based on the magnitude of the score.
  • the weighting coefficient in the weighted sum must be determined empirically, and depending on the utterance, there is a problem that the difference in score between the domains is small, and it is difficult to discriminate only by the magnitude of the score.
  • the present invention has been made to solve such a problem, and an object of the present invention is to provide a speech recognition apparatus capable of improving the accuracy of domain determination and improving the accuracy of speech recognition.
  • the speech recognition apparatus includes a learning speech recognition unit that calculates a learning score that is a value indicating a speech recognition result from learning speech data, and a learning feature amount that converts the learning score into a learning feature amount.
  • a domain determination unit that compares a feature quantity with a domain determination model and calculates a domain determination result indicating which domain the input speech data is an utterance of is provided.
  • the speech recognition apparatus calculates a domain determination model indicating a relationship between a feature amount and a domain by using learning label data that defines which domain the speech data for learning is an utterance of, This is to determine which domain the input voice data is utterance using the determination model.
  • the domain determination accuracy can be improved and the speech recognition performance can be improved as compared with the conventional case where the optimum domain is determined based on the size of the recognition score.
  • FIG. 1 is a configuration diagram of a speech recognition apparatus according to the first embodiment.
  • the speech recognition apparatus according to the present embodiment includes a learning execution unit 100 and a determination execution unit 200 as illustrated.
  • the learning execution unit 100 includes a learning speech recognition unit 102, a learning feature amount conversion unit 104, and a model learning unit 106
  • the determination execution unit 200 includes a determination speech recognition unit 202, a determination feature amount conversion unit 204, and a domain.
  • a determination unit 205 is provided.
  • the learning speech recognition unit 102 in the learning execution unit 100 is a processing unit that calculates the learning score 103 using the learning speech data 101.
  • the learning feature amount conversion unit 104 is a processing unit that converts the learning score 103 calculated by the learning speech recognition unit 102 into a learning feature amount.
  • the model learning unit 106 is a processing unit that calculates the domain determination model 107 using the learning feature amount calculated by the learning feature amount conversion unit 104 and the learning label data 105 of the domain corresponding to the learning speech.
  • the determination voice recognition unit 202 and the determination feature amount conversion unit 204 are the same as those of the learning execution unit 100, respectively. That is, the determination speech recognition unit 202 has the same configuration as the learning speech recognition unit 102 and is a processing unit that calculates the determination score 203 using the input speech data 201.
  • the determination feature amount conversion unit 204 is a processing unit that converts the determination score 203 using the determination score 203 calculated by the determination speech recognition unit 202.
  • the domain determination unit 205 is a processing unit that calculates the domain determination result 206 using the determination feature amount calculated by the determination feature amount conversion unit 204 and the domain determination model 107.
  • FIG. 2 is a hardware configuration diagram of the speech recognition apparatus according to the first embodiment.
  • the speech recognition apparatus is realized using a computer, and includes a processor 1, a memory 2, an input / output interface (input / output I / F) 3, and a bus 4.
  • the processor 1 is a functional unit that performs calculation processing as a computer
  • the memory 2 is a storage unit that stores various programs and calculation results, and constitutes a work area when the processor 1 performs calculation processing.
  • the input / output interface 3 is an interface for inputting the learning voice data 101 and the input voice data 201 and outputting the domain determination result 206 to the outside.
  • the bus 4 is a bus for connecting the processor 1, the memory 2, and the input / output interface 3 to each other.
  • a plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.
  • the learning speech recognition unit 102 performs speech recognition on the learning speech data 101 and calculates the learning score 103 (step ST101).
  • the learning speech recognition unit 102 includes a plurality of speech recognizers A to C, and each reads a language model and an acoustic model corresponding to each domain.
  • Scores A to C of the learning score 103 are the first recognition results from the speech recognizers A to C.
  • an acoustic score or a language score can be used.
  • three speech recognizers A to C are used as an example, but can be appropriately selected according to the number of domains.
  • the learning feature amount conversion unit 104 converts the learning score 103 into a learning feature amount (step ST102).
  • a method for specifically converting to a learning feature amount as shown in FIG. 4, a method of arranging and vectorizing an acoustic score and a language score for each domain is conceivable. In the example shown in FIG. 4, it is 2 (acoustic score + language score) ⁇ the number of domains, so it has 6 dimensions.
  • the score required for vectorization is not limited to the acoustic score and the language score, but may be anything obtained by adding the acoustic score and the language score, or any other score obtained from the learning speech recognition unit 102.
  • the domain learning model 107 is calculated by the model learning unit 106 using the learning feature amount converted from the learning score 103 and the learning label data 105 (step ST103).
  • the learning label data 105 defines what domain the speech data for learning 101 utters.
  • the model learning unit 106 calculates a model so that the learning feature amount obtained by the learning feature amount conversion unit 104 is associated with the learning label data 105.
  • a statistical method such as a mixed Gaussian distribution model, a support vector machine, or a neural network can be used as a method used by the model learning unit 106.
  • the learning execution unit 100 applies the learning speech data 101 to a plurality of speech recognizers, converts the obtained recognition score into a learning feature amount, and what domain is the learning feature amount and its utterance.
  • the learning label data 105 indicating such, the correspondence between the recognition score and the domain is modeled in the framework of statistical machine learning.
  • the determination score 203 is calculated from the input voice data 201 by the determination voice recognition unit 202 (step ST111).
  • each speech recognition unit in the determination speech recognition unit 202 uses the same speech recognition unit as in the learning step.
  • the scores A to C of the determination score 203 are the first recognition results from each speech recognizer.
  • the determination score 203 is converted into a determination feature amount by the determination feature amount conversion unit 204 (step ST112).
  • the determination feature quantity conversion unit 204 uses the same feature quantity conversion unit as in the learning step.
  • the determination feature amount generated by the determination feature amount conversion unit 204 from the determination score 203 and the domain determination model 107 are input to the domain determination unit 205, and the domain determination result 206 is calculated (step ST113).
  • the domain determination unit 205 uses the same statistical method as the model learning unit 106 in the learning step.
  • the domain determination unit 205 compares the determination feature quantity with the domain determination model 107, selects the domain having the highest occurrence probability, and sets the selected domain and the speech recognition result corresponding to the domain as the domain determination result 206.
  • the learning speech recognition unit that calculates the learning score that is a value indicating the speech recognition result from the learning speech data, and the learning score for learning The relationship between the feature quantity and the domain using the learning feature quantity conversion unit for converting into the feature quantity, the learning feature quantity, and the learning label data defining which domain the speech data for learning is uttered
  • a model learning unit that calculates a domain determination model indicating a determination, a determination speech recognition unit that calculates a determination score that is a value indicating a speech recognition result from input speech data, and a determination that converts the determination score into a determination feature amount Feature amount conversion unit, and a domain determination unit that compares a determination feature amount with a domain determination model and calculates a domain determination result indicating which domain the input speech data is an utterance of Advance it is possible to keep learning associated trends and domain scores of the speech recognition can be expected improvement of the domain determination accuracy than the domain determining method in the score obtained from the input speech data.
  • N N is an integer of 2 or more
  • N N is an integer of 2 or more
  • FIG. 6 is a configuration diagram of the speech recognition apparatus according to the present embodiment.
  • the speech recognition apparatus includes a learning execution unit 100a and a determination execution unit 200a.
  • the learning execution unit 100a includes a learning speech recognition unit 102a, a learning feature amount conversion unit 104a, and a model learning unit 106.
  • the determination execution unit 200a includes a determination speech recognition unit 202a, a determination feature amount conversion unit 204a, and a domain.
  • a determination unit 205 is provided.
  • symbol is attached
  • the learning speech recognition unit 102a in the learning execution unit 100a is a processing unit that uses the learning speech data 101 to calculate N learning scores 103a having the highest recognition result.
  • the learning feature amount conversion unit 104a is a processing unit that converts the N best learning score 103a calculated by the learning speech recognition unit 102a into a learning feature amount.
  • the model learning unit 106 calculates the domain determination model 107 using the learning feature amount calculated by the learning feature amount conversion unit 104a and the learning label data 105 that is the label data of the domain corresponding to the learning speech. Part.
  • the determination speech recognition unit 202a and the determination feature amount conversion unit 204a use the same configuration as the learning speech recognition unit 102a and the learning feature amount conversion unit 104a in the learning execution unit 100a.
  • the determination speech recognition unit 202 a is a processing unit that calculates the N best determination score 203 a using the input speech data 201.
  • the determination feature value conversion unit 204a is a processing unit that converts the N best determination score 203a calculated by the determination speech recognition unit 202a into a determination feature value.
  • the domain determination unit 205 is a processing unit that calculates the domain determination result 206 using the determination feature amount calculated by the determination feature amount conversion unit 204 a and the domain determination model 107.
  • the learning speech recognition unit 102a, the learning feature amount conversion unit 104a, the model learning unit 106, the determination speech recognition unit 202a, the determination feature amount conversion unit 204a, and the domain determination unit 205 illustrated in FIG. 6 are illustrated in FIG. This is realized by the processor 1 executing a program stored in the memory 2. Further, the learning speech data 101, the learning score 103a, the learning label data 105, the domain determination model 107, the input speech data 201, the determination score 203a, and the domain determination result 206 are stored in the storage area of the memory 2, respectively. Yes. A plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.
  • the learning best speech score 103a is calculated from the learning speech data 101 by the learning speech recognition unit 102a (step ST201).
  • the learning speech recognition unit 102a includes a plurality of speech recognizers A to C, and each reads a language model and an acoustic model corresponding to each domain.
  • Scores A1 to C1 and scores A2 to C2 of the learning score 103a are the first and second recognition results obtained from each speech recognizer.
  • three recognizers A to C are used as an example.
  • the learning score 103a is converted into a learning feature quantity by the learning feature quantity conversion unit 104a (step ST202).
  • a method for converting to a learning feature amount a method of arranging an acoustic score and a language score with N best scores for each domain and vectorizing them as shown in FIG.
  • 2 acoustic score + language score
  • ⁇ number of domains ⁇ 2 vest is converted into a 12-dimensional learning feature amount.
  • the score required for vectorization is not limited to the acoustic score and the language score, but may be anything obtained by adding the acoustic score and the language score, or any other score obtained from the learning speech recognition unit 102a.
  • the domain determination model 107 is calculated by the model learning unit 106 using the learning feature amount converted from the learning score 103a and the learning label data 105 (step ST203).
  • the learning label data 105 defines what domain the speech data for learning 101 utters.
  • the model learning unit 106 calculates the domain determination model 107 so that the learning feature amount obtained by the learning feature amount conversion unit 104 a is associated with the learning label data 105.
  • the N best determination score 203a is calculated from the input sound data 201 by the determination speech recognition unit 202a (step ST211).
  • the determination speech recognition unit 202a uses the same speech recognition unit as the learning speech recognition unit 102a in the learning step.
  • the scores A1 to C1 and scores A2 to C2 of the determination score 203a are the first and second recognition results from each speech recognizer.
  • the determination score 203a is converted into a determination feature value by the determination feature value conversion unit 204a (step ST212).
  • the determination feature value conversion unit 204a uses the same feature value conversion unit as the learning feature value conversion unit 104a in the learning step.
  • the determination feature amount generated by the determination feature amount conversion unit 204a from the determination score 203a and the domain determination model 107 are input to the domain determination unit 205, and the domain determination result 206 is calculated (step ST213).
  • the domain determination unit 205 performs processing using the same statistical method as the model learning unit 106 in the learning step.
  • the domain determination unit 205 compares the feature amount input by the determination feature amount conversion unit 204a with the domain determination model 107, selects a domain having the highest occurrence probability, and selects the selected domain and the speech recognition result corresponding to the domain. Is the domain determination result 206.
  • the N best learning score that is a value indicating the N (N is an integer of 2 or more) best speech recognition result is calculated from the learning speech data.
  • the determination speech recognition unit for calculating the score, the determination feature amount conversion unit for converting the N best determination score to the determination feature amount, the determination feature amount and the domain determination model are collated, and the input speech data is Which And a domain determination unit that calculates a domain determination result indicating whether the utterance is the main utterance, so that N best can be taken into consideration for the feature amount for domain
  • Embodiment 3 In the third embodiment, in addition to the configuration of the second embodiment, dimensional compression of the feature amount is performed.
  • FIG. 10 is a configuration diagram of the speech recognition apparatus according to the present embodiment.
  • the speech recognition apparatus includes a learning execution unit 100b and a determination execution unit 200b.
  • the learning execution unit 100b includes a learning speech recognition unit 102a, a learning feature amount conversion unit 104a, a dimension compression matrix estimation unit 108, a learning dimension compression unit 110, and a model learning unit 106
  • the determination execution unit 200b includes a determination A speech recognition unit 202a, a determination feature value conversion unit 204a, a determination dimension compression unit 207, and a domain determination unit 205 are provided.
  • symbol is attached
  • the dimension compression matrix estimation unit 108 in the learning execution unit 100b is a processing unit that calculates the dimension compression matrix 109 using the learning feature amount calculated from the learning feature amount conversion unit 104a and the learning label data 105.
  • the learning dimension compression unit 110 is a processing unit that multiplies the learning feature amount calculated from the learning feature amount conversion unit 104a by the dimension compression matrix 109 to compress the dimension of the learning feature amount.
  • the model learning unit 106 is a processing unit that calculates the domain determination model 107 using the learning feature amount compressed by the learning dimension compression unit 110 and the learning label data 105.
  • the determination speech recognition unit 202a and the determination feature amount conversion unit 204a use the same configuration as the learning speech recognition unit 102a and the learning feature amount conversion unit 104a of the learning execution unit 100b.
  • the determination dimension compression unit 207 is a processing unit that multiplies the determination feature amount calculated from the determination feature amount conversion unit 204a by the dimension compression matrix 109 to compress the dimension of the determination feature amount.
  • the dimensional compression matrix 109 is matrix data for performing dimensional compression of multidimensional feature values.
  • the domain determination unit 205 is a processing unit that calculates the domain determination result 206 using the determination feature amount calculated by the determination dimension compression unit 207 and the domain determination model 107.
  • Each of the unit 204a, the determination dimension compression unit 207, and the domain determination unit 205 is realized by the processor 1 executing a program stored in the memory 2.
  • the learning speech data 101, the learning score 103a, the learning label data 105, the domain determination model 107, the dimension compression matrix 109, the input speech data 201, the determination score 203a, and the domain determination result 206 are stored in the memory 2, respectively. It is stored in the area.
  • a plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.
  • a learning score 103a is calculated from the learning speech data 101 by the learning speech recognition unit 102a (step ST301).
  • the learning speech recognition unit 102a includes a plurality of speech recognizers A to C, and each reads a language model and an acoustic model corresponding to each domain.
  • Scores A1 to C1 and scores A2 to C2 of the learning score 103a are the first and second recognition results obtained from each speech recognizer.
  • three recognizers A to C are used as an example.
  • the number of recognizers may be changed according to the number of domains, or the number of N best results of recognition may be changed.
  • the learning score 103a is converted into a learning feature quantity by the learning feature quantity conversion unit 104a (step ST302).
  • a method for specifically converting to a learning feature amount a method in which the acoustic score and the language score are vectorized by arranging the N best scores for each domain as shown in FIG.
  • the score required for vectorization is not limited to the acoustic score and the language score, but may be anything obtained by adding the acoustic score and the language score, or any other score obtained from the learning speech recognition unit 102a.
  • the dimension compression matrix 109 is estimated by the dimension compression matrix estimation unit 108 using the learning feature amount converted from the learning score 103a and the learning label data 105 (step ST303).
  • dimensions such as linear discriminant analysis (LDA: Linear Discriminant Analysis) and unequal variance discriminant analysis (HDA: Heteroscopic Discriminant Analysis) are applied to the feature vector obtained from the N-best score.
  • LDA Linear Discriminant Analysis
  • HDA Heteroscopic Discriminant Analysis
  • a matrix is calculated using a compression method.
  • Advantages of dimensional compression include that supervised methods such as LDA and HDA can generate feature quantities suitable for identification, and a reduction in the number of model parameters when modeling with a mixed Gaussian distribution.
  • the learning dimension compression unit 110 converts the learning score 103a from the learning score 103a.
  • the feature amount is dimensionally compressed (step ST304). As shown in FIG. 12, the dimension compression is to convert a feature quantity obtained from the N best score by a dimension compression matrix 109 to convert it into a low-order vector feature quantity. In the example of FIG. 12, the recognition results from the first place to the third place are shown.
  • the domain learning model 107 is learned by the model learning unit 106 using the learning feature quantity and the learning label data 105 dimension-compressed by the learning dimension compression unit 110 (step ST305).
  • the model learning unit 106 calculates a model so as to associate the learning feature amount dimensionally compressed by the learning dimensional compression unit 110 with the learning label data 105.
  • a determination score 203a is calculated from the input voice data 201 by the determination voice recognition unit 202a (step ST311).
  • the determination speech recognition unit 202a uses the same speech recognition unit as the learning speech recognition unit 102a in the learning step.
  • the scores A1 to C1 and scores A2 to C2 of the determination score 203a are the first and second recognition results from each speech recognizer.
  • the determination score 203a is converted into a determination feature value by the determination feature value conversion unit 204a (step ST312).
  • the determination feature quantity conversion unit 204a uses the same configuration as the learning feature quantity conversion unit 104a in the learning step.
  • the determination dimension compression unit 207 converts the determination score 203a from the determination score 203a.
  • the feature amount is dimensionally compressed (step ST313).
  • the dimensional compression is performed by multiplying the feature amount obtained from the N best score by the dimensional compression matrix 109 as shown in FIG. Convert to quantity.
  • the domain determination unit 205 obtains the domain determination result 206 from the feature quantity dimension-compressed by the determination dimension compression unit 207 and the domain determination model 107 (step ST314).
  • the domain determination unit 205 performs processing using the same statistical method as in the learning step.
  • the domain determination unit 205 collates the determination feature quantity dimension-compressed by the determination dimension compression unit 207 with the domain determination model 107, selects the domain having the highest occurrence probability, and selects the selected domain and the voice corresponding to the domain. Let the recognition result be the domain determination result 206.
  • the N best learning score which is a value indicating the N (N is an integer of 2 or more) best speech recognition result, is calculated from the learning speech data.
  • a learning speech recognition unit that performs learning a learning feature amount conversion unit that converts an N-best learning score into a learning feature amount, a learning feature amount, and which domain the speech data for learning is an utterance of
  • a domain determination model that indicates the relationship between a feature quantity and a domain using a learning dimension compression section that compresses the dimension of the quantity, a learning feature quantity compressed by the learning dimension compression section, and learning label data
  • a model learning unit to calculate A determination speech recognition unit that calculates an N best determination score, which is a value indicating the N best speech recognition result, and a determination feature amount conversion that converts the N best determination score into a determination feature amount
  • a determination dimension compression unit that compresses the dimension of the determination feature value using the determination unit, the determination feature value, and the dimension compression matrix, and the determination feature value and the domain determination model compressed by the determination dimension compression unit
  • a domain determination unit that calculates a domain determination result indicating which domain the input speech data is uttered.
  • the feature amount is compressed to a low dimension. By doing so, it is possible to handle feature quantities suitable for identification, and it is possible to reduce the number of model parameters depending on the type of model.
  • the dimension compression matrix estimation unit receives the feature quantity and the teacher label and outputs a matrix that converts the dimension of the feature quantity into a low dimension. Can be generated.
  • Embodiment 4 is an example in which a recognition result of N (N is an integer of 2 or more) best is generated and a domain determination model is generated for each N best.
  • FIG. 14 is a configuration diagram of the speech recognition apparatus according to the present embodiment.
  • the speech recognition apparatus includes a learning execution unit 100c and a determination execution unit 200c.
  • the learning execution unit 100c includes a learning speech recognition unit 102a, a first learning feature amount conversion unit 104b, a second learning feature amount conversion unit 104c, a first model learning unit 106a, and a second model learning unit 106b.
  • the determination execution unit 200c includes a determination speech recognition unit 202a, a first determination feature amount conversion unit 204b, a second determination feature amount conversion unit 204c, a first domain determination unit 205a, and a second domain.
  • a determination unit 205b and a domain determination unit 208 are provided.
  • symbol is attached
  • the first learning feature value conversion unit 104b and the second learning feature value conversion unit 104c have the same configuration as the learning feature value conversion unit 104 of the first embodiment, respectively.
  • the first learning feature value conversion unit 104b uses the score A1 to C1 with the first recognition result
  • the second learning feature value conversion unit 104c uses the score A2 to C2 with the second recognition result as the feature value. Is configured to convert.
  • the first model learning unit 106a and the second model learning unit 106b have the same configuration as the model learning unit 106 of the first embodiment.
  • the first model learning unit 106 a calculates the first domain determination model 107 a using the learning feature amount calculated by the first learning feature amount conversion unit 104 b and the learning label data 105
  • the second model learning unit 106b is configured to calculate the second domain determination model 107b using the learning feature amount calculated by the second learning feature amount conversion unit 104c and the learning label data 105.
  • N 2 is shown as the configuration for each N best, but N can be applied to any value.
  • the determination speech recognition unit 202a, the first determination feature amount conversion unit 204b, and the second determination feature amount conversion unit 204c are the same as the learning speech recognition unit 102a in the learning execution unit 100c.
  • the same configuration as that of the learning feature value conversion unit 104b and the second learning feature value conversion unit 104c is used.
  • the first domain determination unit 205a uses the determination feature amount calculated by the first determination feature amount conversion unit 204b and the first domain determination model 107a to calculate the first domain determination result 206a.
  • the second domain determination unit 205b calculates the second domain determination result 206b using the determination feature amount calculated by the second determination feature amount conversion unit 204c and the second domain determination model 107b. Part.
  • the domain determination unit 208 is a processing unit that calculates the domain final determination result 209 using the first domain determination result 206a and the second domain determination result 206b.
  • Voice recognition unit 202a, first determination feature value conversion unit 204b and second determination feature value conversion unit 204c, first domain determination unit 205a and second domain determination unit 205b, and domain determination unit 208 are: Each is realized by the processor 1 shown in FIG. 2 executing a program stored in the memory 2.
  • the learning speech data 101, the learning score 103a, the learning label data 105, the domain determination model 107, the input speech data 201, the determination score 203a, the domain determination result 206, and the domain final determination result 209 are stored in the memory 2, respectively. It is stored in the storage area.
  • a plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.
  • the learning best speech score 103a is calculated from the learning speech data 101 by the learning speech recognition unit 102a (step ST401).
  • the learning speech recognition unit 102a includes a plurality of speech recognizers A to C, and each reads a language model and an acoustic model corresponding to each domain.
  • Scores A1 to C1 and scores A2 to C2 of the learning score 103a are the first and second recognition results obtained from each speech recognizer.
  • three recognizers A to C are used as an example.
  • the learning score 103a is converted into each learning feature amount by the first learning feature amount conversion unit 104b and the second learning feature amount conversion unit 104c every N best (step ST402).
  • a method for specifically converting to a learning feature amount as shown in FIG. 4, a method of vectorizing an acoustic score and a language score by arranging N best scores for each domain is conceivable.
  • the score required for vectorization is not limited to the acoustic score and the language score, but may be anything obtained by adding the acoustic score and the language score, or any other score obtained from the learning speech recognition unit 102a.
  • the first model learning unit 106a and the second model learning unit 106b use the learning feature amount converted from the learning score 103a and the learning label data 105 for each N best.
  • a domain determination model 107a and a second domain determination model 107b are obtained (step ST403). That is, each of the first model learning unit 106a and the second model learning unit 106b includes the learning feature amount obtained by the first learning feature amount conversion unit 104b and the second learning feature amount conversion unit 104c.
  • a model is calculated so as to associate the learning label data 105.
  • the determination best speech score 203a is calculated from the input speech data 201 by the determination speech recognition unit 202a (step ST411).
  • the determination speech recognition unit 202a uses the same speech recognition unit as the learning speech recognition unit 102a in the learning step.
  • the scores A1 to C1 and scores A2 to C2 of the determination score 203a are the first and second recognition results from each speech recognizer.
  • the determination score 203a is converted into determination feature values every N best by the first determination feature value conversion unit 204b and the second determination feature value conversion unit 204c (step ST412).
  • the first determination feature value conversion unit 204b and the second determination feature value conversion unit 204c have the same features as the first learning feature value conversion unit 104b and the second learning feature value conversion unit 104c in the learning step. Use the quantity converter.
  • the first domain determination unit 205a and the second domain determination unit 205b respectively generate the first determination feature value conversion unit 204b and the second determination feature value conversion unit 204c for each N best.
  • the determination feature quantity, the first domain determination model 107a and the second domain determination model 107b are acquired, and the N best domain determination results (first domain determination result 206a and second domain determination result 206b) are obtained. Is obtained (step ST413).
  • the first domain determination unit 205a and the second domain determination unit 205b use the same statistical method as the first model learning unit 106a and the second model learning unit 106b in the learning step.
  • the first domain determination unit 205a and the second domain determination unit 205b and the determination feature amount generated by the first determination feature amount conversion unit 204b and the second determination feature amount conversion unit 204c The domain determination model 107a and the second domain determination model 107b are respectively collated, the domain having the highest occurrence probability is output, and the recognition result corresponding to the domain and the domain is determined as the first domain determination result 206a and the second domain determination.
  • the result is 206b.
  • domain determining section 208 obtains domain final determination result 209 from N best domain determination results (first domain determination result 206a and second domain determination result 206b) (step ST414).
  • a domain determination method there are a method using a simple majority vote of N best domain determination results as shown in FIG. 17 and a method of taking a majority vote with weights according to the ranking of each domain determination result. Available. In the example of FIG. 17, the recognition results from the first place to the third place are shown.
  • a model is generated for each N best, so that the appearance of scores in an arbitrary order can be modeled, and the number of dimensions of the feature amount The increase can be suppressed. Further, by integrating the determination results of the N best domains by a method such as majority decision, it is possible to suppress dependence on only the upper recognition results.
  • the N best learning score which is a value indicating the speech recognition result of N (N is an integer of 2 or more) best
  • a learning speech recognition unit that performs learning a learning feature amount conversion unit that converts a learning score for N best into a learning feature amount for each N best, a learning feature amount for each N best, and learning voice data
  • a learning label data that defines whether the utterance is a domain utterance
  • a model learning unit that calculates a domain determination model indicating a relationship between a feature amount and a domain every N best, and N best speech from input speech data
  • a determination speech recognition unit that calculates a determination score for N best which is a value indicating a recognition result
  • a determination feature value conversion unit that converts a determination score for N best into a determination feature value for each N best
  • N Judgment for each vest The domain determination unit that compares the feature amount with the domain determination model for each N best, calculates the domain determination model for each N best.
  • the speech recognition device relates to a configuration for determining which domain the input speech is an utterance of, and is applied to a navigation device, a home appliance, etc., and used for improving speech recognition performance. Suitable for
  • 100, 100a, 100b, 100c Learning execution unit 101 Learning speech data, 102, 102a Learning speech recognition unit, 103, 103a Learning score, 104, 104a Learning feature amount conversion unit, 104b First learning feature Quantity conversion unit, 104c second learning feature quantity conversion unit, 105 learning label data, 106 model learning unit, 106a first model learning unit, 106b second model learning unit, 107 domain determination model, 107a first Domain determination model, 107b second domain determination model, 108 dimension compression matrix estimation unit, 109 dimension compression matrix, 110 learning dimension compression unit, 200, 200a, 200b, 200c determination execution unit, 201 input speech data, 202, 202a voice recognition unit for determination, 203, 203 Determination score, 204, 204a Determination feature amount conversion unit, 204b First determination feature amount conversion unit, 204c Second determination feature amount conversion unit, 205 Domain determination unit, 205a First domain determination unit, 205b Second domain determination unit, 206 domain determination result, 206a first domain determination result,

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

A model learning unit (106) in a learning execution unit (100) calculates a domain determination model (107) representing a relationship between a feature amount and a domain using learning label data (105). A determination execution unit (200) performs speech recognition on input speech data (201) via a determination speech recognition unit (202) and calculates scores (203) as the results for the speech recognition. A determination feature amount conversion unit (204) calculates the feature amount based on the scores (203). A domain determination unit (205) calculates a domain determination result (206) that presents the domain represented by the input speech data by applying a domain determination model (107) to the feature amount calculated by the determination feature amount conversion unit (204).

Description

音声認識装置Voice recognition device
 本発明は、入力音声がどのドメインの発話であるかを判定する音声認識装置に関するものである。 The present invention relates to a speech recognition apparatus that determines which domain an input speech is an utterance of.
 住所、名称、電話番号といったカテゴリを示す複数のドメインを認識対象とする音声認識装置において、入力音声がどのドメインかを判定しつつ、所望のドメインの音声認識結果を得るための方法としては次のようなものがあった。すなわち、まず音声認識によってドメイン毎の認識結果を算出し、その後各ドメインの認識結果同士をスコアで比較して最終的な認識結果を得る方法である。例えば、特許文献1に開示された方法では、まず異なるドメイン毎に用意した統計的言語モデルを用いて複数の音声認識システムにて音声認識結果を得る。各ドメインの認識システムにより得た認識結果のうち、どれがその発話のドメインに近いかの信頼度として、音声認識時に得た音響スコアSAMと言語スコアSLMとの加重和によるスコアを用いる。
 score=SAM+αSLM
In a speech recognition apparatus that recognizes a plurality of domains indicating categories such as address, name, and telephone number, a method for obtaining a speech recognition result of a desired domain while determining which domain the input speech is is as follows. There was something like that. That is, first, a recognition result for each domain is calculated by voice recognition, and then the recognition results of each domain are compared with each other to obtain a final recognition result. For example, in the method disclosed in Patent Document 1, first, a speech recognition result is obtained by a plurality of speech recognition systems using a statistical language model prepared for each different domain. Of the recognition results obtained by the recognition system of each domain, a score based on the weighted sum of the acoustic score SAM and the language score SLM obtained at the time of speech recognition is used as the reliability of which is close to the utterance domain.
score = S AM + αS LM
 ここでαは音響スコアと言語スコアの影響度合いを制御する係数であり、発話ドメインの誤りが小さくなるよう実験的に決定される。上式のスコアが最大となる認識結果のドメインを最適ドメインと判定し、その認識結果を最適な認識結果として提示する。 Here α is a coefficient that controls the degree of influence of the acoustic score and the language score, and is experimentally determined so as to reduce the error of the utterance domain. The domain of the recognition result with the maximum score of the above expression is determined as the optimal domain, and the recognition result is presented as the optimal recognition result.
国際公開第2015/118645号International Publication No. 2015/118645
 上記従来の音声認識装置では、認識時に得たスコア及び認識結果に基づき得たスコアの加重和を取り、そのスコアの大小で最適ドメインを判定していた。しかしながら、加重和における重み係数を経験的に決めなければならないという問題があり、また、発話によっては各ドメイン間でのスコアの差が小さく、スコアの大小だけでは判別が難しいという問題があった。 In the conventional speech recognition apparatus described above, the weighted sum of the score obtained at the time of recognition and the score obtained based on the recognition result is taken, and the optimum domain is determined based on the magnitude of the score. However, there is a problem that the weighting coefficient in the weighted sum must be determined empirically, and depending on the utterance, there is a problem that the difference in score between the domains is small, and it is difficult to discriminate only by the magnitude of the score.
 この発明は、かかる問題を解決するためになされたもので、ドメイン判定精度を向上させ、音声認識精度の向上を図ることのできる音声認識装置を提供することを目的とする。 The present invention has been made to solve such a problem, and an object of the present invention is to provide a speech recognition apparatus capable of improving the accuracy of domain determination and improving the accuracy of speech recognition.
 この発明に係る音声認識装置は、学習用音声データから音声認識結果を示す値である学習用スコアを算出する学習用音声認識部と、学習用スコアを学習用特徴量に変換する学習用特徴量変換部と、学習用特徴量と、学習用音声データがどのドメインの発話であるかを定義した学習用ラベルデータとを用いて、特徴量とドメインとの関係を示すドメイン判定モデルを算出するモデル学習部と、入力音声データから音声認識結果を示す値である判定用スコアを算出する判定用音声認識部と、判定用スコアを判定用特徴量に変換する判定用特徴量変換部と、判定用特徴量とドメイン判定モデルとを照合し、入力音声データがどのドメインの発話であるかを示すドメイン判定結果を算出するドメイン判定部とを備えたものである。 The speech recognition apparatus according to the present invention includes a learning speech recognition unit that calculates a learning score that is a value indicating a speech recognition result from learning speech data, and a learning feature amount that converts the learning score into a learning feature amount. A model for calculating a domain determination model indicating a relationship between a feature quantity and a domain using a conversion unit, a learning feature quantity, and learning label data that defines which domain the speech data for learning is an utterance of A learning unit; a determination speech recognition unit that calculates a determination score that is a value indicating a speech recognition result from input speech data; a determination feature amount conversion unit that converts the determination score into a determination feature amount; A domain determination unit that compares a feature quantity with a domain determination model and calculates a domain determination result indicating which domain the input speech data is an utterance of is provided.
 この発明に係る音声認識装置は、学習用音声データがどのドメインの発話であるかを定義した学習用ラベルデータを用いて、特徴量とドメインとの関係を示すドメイン判定モデルを算出し、このドメイン判定モデルを用いて入力音声データがどのドメインの発話であるかを判定するようにしたものである。これにより、従来の認識スコアの大小で最適なドメインを決定していた場合よりもドメイン判定精度を向上させ、音声認識性能の向上を図ることができる。 The speech recognition apparatus according to the present invention calculates a domain determination model indicating a relationship between a feature amount and a domain by using learning label data that defines which domain the speech data for learning is an utterance of, This is to determine which domain the input voice data is utterance using the determination model. As a result, the domain determination accuracy can be improved and the speech recognition performance can be improved as compared with the conventional case where the optimum domain is determined based on the size of the recognition score.
この発明の実施の形態1の音声認識装置を示す構成図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a block diagram which shows the speech recognition apparatus of Embodiment 1 of this invention. この発明の実施の形態1の音声認識装置のハードウェア構成図である。It is a hardware block diagram of the speech recognition apparatus of Embodiment 1 of this invention. この発明の実施の形態1の音声認識装置のドメイン判別モデル学習ステップの流れを示すフローチャートである。It is a flowchart which shows the flow of the domain discrimination | determination model learning step of the speech recognition apparatus of Embodiment 1 of this invention. この発明の実施の形態1の音声認識装置のスコアから特徴量に変換する手段を示す説明図である。It is explanatory drawing which shows the means to convert into a feature-value from the score of the speech recognition apparatus of Embodiment 1 of this invention. この発明の実施の形態1の音声認識装置のドメイン判別ステップの流れを示すフローチャートである。It is a flowchart which shows the flow of the domain discrimination | determination step of the speech recognition apparatus of Embodiment 1 of this invention. この発明の実施の形態2の音声認識装置を示す構成図である。It is a block diagram which shows the speech recognition apparatus of Embodiment 2 of this invention. この発明の実施の形態2の音声認識装置のドメイン判別モデル学習ステップの流れを示すフローチャートである。It is a flowchart which shows the flow of the domain discrimination | determination model learning step of the speech recognition apparatus of Embodiment 2 of this invention. この発明の実施の形態2の音声認識装置のスコアから特徴量に変換する手段を示す説明図である。It is explanatory drawing which shows the means to convert into a feature-value from the score of the speech recognition apparatus of Embodiment 2 of this invention. この発明の実施の形態2の音声認識装置のドメイン判別ステップの流れを示すフローチャートである。It is a flowchart which shows the flow of the domain discrimination | determination step of the speech recognition apparatus of Embodiment 2 of this invention. この発明の実施の形態3の音声認識装置を示す構成図である。It is a block diagram which shows the speech recognition apparatus of Embodiment 3 of this invention. この発明の実施の形態3の音声認識装置のドメイン判別モデル学習ステップの流れを示すフローチャートである。It is a flowchart which shows the flow of the domain discrimination | determination model learning step of the speech recognition apparatus of Embodiment 3 of this invention. この発明の実施の形態3の音声認識装置の特徴量を次元圧縮する手段を示す説明図である。It is explanatory drawing which shows the means to carry out the dimension compression of the feature-value of the speech recognition apparatus of Embodiment 3 of this invention. この発明の実施の形態3の音声認識装置のドメイン判別ステップの流れを示すフローチャートである。It is a flowchart which shows the flow of the domain discrimination | determination step of the speech recognition apparatus of Embodiment 3 of this invention. この発明の実施の形態4の音声認識装置を示す構成図である。It is a block diagram which shows the speech recognition apparatus of Embodiment 4 of this invention. この発明の実施の形態4の音声認識装置のドメイン判別モデル学習ステップの流れを示すフローチャートである。It is a flowchart which shows the flow of the domain discrimination | determination model learning step of the speech recognition apparatus of Embodiment 4 of this invention. この発明の実施の形態4の音声認識装置のドメイン判別ステップの流れを示すフローチャートである。It is a flowchart which shows the flow of the domain discrimination | determination step of the speech recognition apparatus of Embodiment 4 of this invention. この発明の実施の形態4の音声認識装置の複数のドメイン判定結果を統合する手段を示す説明図である。It is explanatory drawing which shows a means to integrate the several domain determination result of the speech recognition apparatus of Embodiment 4 of this invention.
 以下、この発明をより詳細に説明するために、この発明を実施するための形態について、添付の図面に従って説明する。
実施の形態1.
 図1は、実施の形態1による音声認識装置の構成図である。本実施の形態による音声認識装置は、図示のように、学習実行部100と判定実行部200から構成される。学習実行部100は、学習用音声認識部102、学習用特徴量変換部104及びモデル学習部106を備え、判定実行部200は、判定用音声認識部202、判定用特徴量変換部204及びドメイン判定部205を備えている。
Hereinafter, in order to explain the present invention in more detail, modes for carrying out the present invention will be described with reference to the accompanying drawings.
Embodiment 1 FIG.
FIG. 1 is a configuration diagram of a speech recognition apparatus according to the first embodiment. The speech recognition apparatus according to the present embodiment includes a learning execution unit 100 and a determination execution unit 200 as illustrated. The learning execution unit 100 includes a learning speech recognition unit 102, a learning feature amount conversion unit 104, and a model learning unit 106, and the determination execution unit 200 includes a determination speech recognition unit 202, a determination feature amount conversion unit 204, and a domain. A determination unit 205 is provided.
 学習実行部100における学習用音声認識部102は、学習用音声データ101を用いて学習用スコア103を算出する処理部である。学習用特徴量変換部104は、学習用音声認識部102が算出した学習用スコア103を学習用特徴量に変換する処理部である。モデル学習部106は、学習用特徴量変換部104が算出した学習用特徴量と、学習用音声に対応するドメインの学習用ラベルデータ105を用いてドメイン判定モデル107を算出する処理部である。 The learning speech recognition unit 102 in the learning execution unit 100 is a processing unit that calculates the learning score 103 using the learning speech data 101. The learning feature amount conversion unit 104 is a processing unit that converts the learning score 103 calculated by the learning speech recognition unit 102 into a learning feature amount. The model learning unit 106 is a processing unit that calculates the domain determination model 107 using the learning feature amount calculated by the learning feature amount conversion unit 104 and the learning label data 105 of the domain corresponding to the learning speech.
 判定実行部200において、判定用音声認識部202と判定用特徴量変換部204は、それぞれ学習実行部100と同じものを用いる。すなわち、判定用音声認識部202は学習用音声認識部102と同様の構成であり、入力音声データ201を用いて判定用スコア203を算出する処理部である。判定用特徴量変換部204は、判定用音声認識部202が算出した判定用スコア203を用いて判定用特徴量に変換する処理部である。ドメイン判定部205は、判定用特徴量変換部204により算出した判定用特徴量と、ドメイン判定モデル107を用いてドメイン判定結果206を算出する処理部である。 In the determination execution unit 200, the determination voice recognition unit 202 and the determination feature amount conversion unit 204 are the same as those of the learning execution unit 100, respectively. That is, the determination speech recognition unit 202 has the same configuration as the learning speech recognition unit 102 and is a processing unit that calculates the determination score 203 using the input speech data 201. The determination feature amount conversion unit 204 is a processing unit that converts the determination score 203 using the determination score 203 calculated by the determination speech recognition unit 202. The domain determination unit 205 is a processing unit that calculates the domain determination result 206 using the determination feature amount calculated by the determination feature amount conversion unit 204 and the domain determination model 107.
 図2は、実施の形態1の音声認識装置のハードウェア構成図である。
 音声認識装置はコンピュータを用いて実現されており、プロセッサ1、メモリ2、入出力インタフェース(入出力I/F)3、バス4を備える。プロセッサ1は、コンピュータとしての演算処理を行う機能部であり、メモリ2は、各種のプログラムや演算結果を記憶し、また、プロセッサ1が演算処理を行う場合の作業領域を構成する記憶部である。入出力インタフェース3は、学習用音声データ101や入力音声データ201を入力したり、ドメイン判定結果206を外部に出力したりする際のインタフェースである。バス4は、プロセッサ1、メモリ2及び入出力インタフェース3を相互に接続するためのバスである。
FIG. 2 is a hardware configuration diagram of the speech recognition apparatus according to the first embodiment.
The speech recognition apparatus is realized using a computer, and includes a processor 1, a memory 2, an input / output interface (input / output I / F) 3, and a bus 4. The processor 1 is a functional unit that performs calculation processing as a computer, and the memory 2 is a storage unit that stores various programs and calculation results, and constitutes a work area when the processor 1 performs calculation processing. . The input / output interface 3 is an interface for inputting the learning voice data 101 and the input voice data 201 and outputting the domain determination result 206 to the outside. The bus 4 is a bus for connecting the processor 1, the memory 2, and the input / output interface 3 to each other.
 図1に示す学習用音声認識部102、学習用特徴量変換部104、モデル学習部106、判定用音声認識部202、判定用特徴量変換部204、ドメイン判定部205は、それぞれプロセッサ1がメモリ2に記憶されたプログラムを実行することにより実現されている。また、学習用音声データ101、学習用スコア103、学習用ラベルデータ105、ドメイン判定モデル107、入力音声データ201、判定用スコア203、ドメイン判定結果206は、それぞれメモリ2の記憶領域に記憶されている。プロセッサ1及びメモリ2をそれぞれ複数備え、複数のプロセッサ1とメモリ2とが連携して上述した機能を実行するように構成してもよい。 The learning speech recognition unit 102, the learning feature amount conversion unit 104, the model learning unit 106, the determination speech recognition unit 202, the determination feature amount conversion unit 204, and the domain determination unit 205 shown in FIG. This is realized by executing the program stored in 2. Further, the learning speech data 101, the learning score 103, the learning label data 105, the domain determination model 107, the input speech data 201, the determination score 203, and the domain determination result 206 are stored in the storage area of the memory 2, respectively. Yes. A plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.
 次に、実施の形態1の音声認識装置の動作について説明する。
 まず、学習実行部100が行うドメイン判定モデル学習ステップについて、図3のフローチャートを用いて説明する。
 学習ステップでは、最初に、学習用音声認識部102が学習用音声データ101に対して音声認識を行って、その学習用スコア103を計算する(ステップST101)。ここで、学習用音声認識部102は複数の音声認識器A~Cから成り、それぞれが各ドメインに対応した言語モデル及び音響モデルを読み込んでいる。学習用スコア103のスコアA~Cは各音声認識器A~Cからの1位の認識結果である。学習用スコア103の例として、音響スコアや言語スコアが利用可能である。なお、本実施の形態では例として音声認識器をA~Cの三つとしているが、ドメイン数に応じて適宜選択可能である。
Next, the operation of the speech recognition apparatus according to the first embodiment will be described.
First, the domain determination model learning step performed by the learning execution unit 100 will be described with reference to the flowchart of FIG.
In the learning step, first, the learning speech recognition unit 102 performs speech recognition on the learning speech data 101 and calculates the learning score 103 (step ST101). Here, the learning speech recognition unit 102 includes a plurality of speech recognizers A to C, and each reads a language model and an acoustic model corresponding to each domain. Scores A to C of the learning score 103 are the first recognition results from the speech recognizers A to C. As an example of the learning score 103, an acoustic score or a language score can be used. In this embodiment, three speech recognizers A to C are used as an example, but can be appropriately selected according to the number of domains.
 次に、学習用特徴量変換部104は学習用スコア103を学習用特徴量に変換する(ステップST102)。具体的に学習用特徴量に変換する方法として、図4に示すように、音響スコアと言語スコアをドメイン毎に並べてベクトル化する方法が考えられる。図4に示す例では2(音響スコア+言語スコア)×ドメイン数であるため、6次元となる。なお、ベクトル化に必要なスコアは音響スコアと言語スコアに限定されるものではなく、音響スコアと言語スコアを加算したものや、その他学習用音声認識部102から得られるものであれば何でもよい。 Next, the learning feature amount conversion unit 104 converts the learning score 103 into a learning feature amount (step ST102). As a method for specifically converting to a learning feature amount, as shown in FIG. 4, a method of arranging and vectorizing an acoustic score and a language score for each domain is conceivable. In the example shown in FIG. 4, it is 2 (acoustic score + language score) × the number of domains, so it has 6 dimensions. The score required for vectorization is not limited to the acoustic score and the language score, but may be anything obtained by adding the acoustic score and the language score, or any other score obtained from the learning speech recognition unit 102.
 次に、学習用スコア103から変換した学習用特徴量と学習用ラベルデータ105を用いてモデル学習部106により、ドメイン判定モデル107を算出する(ステップST103)。ここで学習用ラベルデータ105は学習用音声データ101がどのドメインの発話かを定義したものである。モデル学習部106は、学習用特徴量変換部104により得た学習用特徴量と学習用ラベルデータ105を対応付けるようモデルを算出する。ここでモデル学習部106が用いる手法として、混合ガウス分布モデルやサポートベクターマシン、ニューラルネットワーク等の統計的手法が利用できる。 Next, the domain learning model 107 is calculated by the model learning unit 106 using the learning feature amount converted from the learning score 103 and the learning label data 105 (step ST103). Here, the learning label data 105 defines what domain the speech data for learning 101 utters. The model learning unit 106 calculates a model so that the learning feature amount obtained by the learning feature amount conversion unit 104 is associated with the learning label data 105. Here, as a method used by the model learning unit 106, a statistical method such as a mixed Gaussian distribution model, a support vector machine, or a neural network can be used.
 このように、学習実行部100は、学習用音声データ101を複数の音声認識器にかけ、得た認識スコアを学習用特徴量に変換し、この学習用特徴量とその発話が何のドメインであるかを示す学習用ラベルデータ105を用いることで、認識スコアの出方とドメインの対応を統計的機械学習の枠組みでモデル化するようにしたものである。 In this way, the learning execution unit 100 applies the learning speech data 101 to a plurality of speech recognizers, converts the obtained recognition score into a learning feature amount, and what domain is the learning feature amount and its utterance. By using the learning label data 105 indicating such, the correspondence between the recognition score and the domain is modeled in the framework of statistical machine learning.
 次に、判定実行部200が行うドメイン判定ステップについて、図5のフローチャートを用いて説明する。
 判定ステップでは、まず入力音声データ201から判定用音声認識部202により判定用スコア203を計算する(ステップST111)。ここで、判定用音声認識部202における各音声認識部は学習ステップと同じ音声認識部を使用する。判定用スコア203のスコアA~Cは各音声認識器から1位の認識結果である。
Next, the domain determination step performed by the determination execution unit 200 will be described using the flowchart of FIG.
In the determination step, first, the determination score 203 is calculated from the input voice data 201 by the determination voice recognition unit 202 (step ST111). Here, each speech recognition unit in the determination speech recognition unit 202 uses the same speech recognition unit as in the learning step. The scores A to C of the determination score 203 are the first recognition results from each speech recognizer.
 次に、判定用スコア203を判定用特徴量変換部204により、判定用特徴量に変換する(ステップST112)。判定用特徴量変換部204には学習ステップと同じ特徴量変換部を使用する。 Next, the determination score 203 is converted into a determination feature amount by the determination feature amount conversion unit 204 (step ST112). The determination feature quantity conversion unit 204 uses the same feature quantity conversion unit as in the learning step.
 次に、判定用スコア203から判定用特徴量変換部204により生成した判定用特徴量と、ドメイン判定モデル107をドメイン判定部205に入力し、ドメイン判定結果206を算出する(ステップST113)。ドメイン判定部205は、学習ステップにおけるモデル学習部106と同様の統計的手法を用いる。ドメイン判定部205は、判定用特徴量とドメイン判定モデル107を照合し、生起確率が最も高いドメインを選択し、選択したドメイン及びそのドメインに対応する音声認識結果をドメイン判定結果206とする。 Next, the determination feature amount generated by the determination feature amount conversion unit 204 from the determination score 203 and the domain determination model 107 are input to the domain determination unit 205, and the domain determination result 206 is calculated (step ST113). The domain determination unit 205 uses the same statistical method as the model learning unit 106 in the learning step. The domain determination unit 205 compares the determination feature quantity with the domain determination model 107, selects the domain having the highest occurrence probability, and sets the selected domain and the speech recognition result corresponding to the domain as the domain determination result 206.
 以上説明したように、実施の形態1の音声認識装置によれば、学習用音声データから音声認識結果を示す値である学習用スコアを算出する学習用音声認識部と、学習用スコアを学習用特徴量に変換する学習用特徴量変換部と、学習用特徴量と、学習用音声データがどのドメインの発話であるかを定義した学習用ラベルデータとを用いて、特徴量とドメインとの関係を示すドメイン判定モデルを算出するモデル学習部と、入力音声データから音声認識結果を示す値である判定用スコアを算出する判定用音声認識部と、判定用スコアを判定用特徴量に変換する判定用特徴量変換部と、判定用特徴量とドメイン判定モデルとを照合し、入力音声データがどのドメインの発話であるかを示すドメイン判定結果を算出するドメイン判定部とを備えたので、事前に音声認識のスコアの傾向とドメインを対応づけて学習しておくことが可能となり、入力音声データから得られるスコアでのドメイン判定方法よりもドメイン判定精度の向上が期待できる。 As described above, according to the speech recognition apparatus of the first embodiment, the learning speech recognition unit that calculates the learning score that is a value indicating the speech recognition result from the learning speech data, and the learning score for learning The relationship between the feature quantity and the domain using the learning feature quantity conversion unit for converting into the feature quantity, the learning feature quantity, and the learning label data defining which domain the speech data for learning is uttered A model learning unit that calculates a domain determination model indicating a determination, a determination speech recognition unit that calculates a determination score that is a value indicating a speech recognition result from input speech data, and a determination that converts the determination score into a determination feature amount Feature amount conversion unit, and a domain determination unit that compares a determination feature amount with a domain determination model and calculates a domain determination result indicating which domain the input speech data is an utterance of Advance it is possible to keep learning associated trends and domain scores of the speech recognition can be expected improvement of the domain determination accuracy than the domain determining method in the score obtained from the input speech data.
実施の形態2.
 実施の形態2は、学習用音声認識部及び判定用音声認識部の各音声認識器からN(Nは2以上の整数)ベストの認識結果を生成することで、下位の結果も考慮しドメインを判定する例である。
Embodiment 2. FIG.
In the second embodiment, N (N is an integer of 2 or more) best recognition results are generated from the speech recognizers of the learning speech recognition unit and the determination speech recognition unit. This is an example of determination.
 図6は、本実施の形態による音声認識装置の構成図である。
 本実施の形態による音声認識装置は、図示のように、学習実行部100aと判定実行部200aから構成される。学習実行部100aは、学習用音声認識部102a、学習用特徴量変換部104a、モデル学習部106を備え、判定実行部200aは、判定用音声認識部202a、判定用特徴量変換部204a、ドメイン判定部205を備えている。なお、実施の形態1と同様の構成には同一符号を付し、その構成についての説明は省略または簡略化する。
FIG. 6 is a configuration diagram of the speech recognition apparatus according to the present embodiment.
As shown in the figure, the speech recognition apparatus according to the present embodiment includes a learning execution unit 100a and a determination execution unit 200a. The learning execution unit 100a includes a learning speech recognition unit 102a, a learning feature amount conversion unit 104a, and a model learning unit 106. The determination execution unit 200a includes a determination speech recognition unit 202a, a determination feature amount conversion unit 204a, and a domain. A determination unit 205 is provided. In addition, the same code | symbol is attached | subjected to the structure similar to Embodiment 1, and the description about the structure is abbreviate | omitted or simplified.
 学習実行部100aにおける学習用音声認識部102aは、学習用音声データ101を用いて認識結果が上位からN個の学習用スコア103aを算出する処理部である。学習用特徴量変換部104aは、学習用音声認識部102aが算出したNベストの学習用スコア103aを学習用特徴量に変換する処理部である。モデル学習部106は、学習用特徴量変換部104aにより算出した学習用特徴量と、学習用音声に対応するドメインのラベルデータである学習用ラベルデータ105を用いてドメイン判定モデル107を算出する処理部である。 The learning speech recognition unit 102a in the learning execution unit 100a is a processing unit that uses the learning speech data 101 to calculate N learning scores 103a having the highest recognition result. The learning feature amount conversion unit 104a is a processing unit that converts the N best learning score 103a calculated by the learning speech recognition unit 102a into a learning feature amount. The model learning unit 106 calculates the domain determination model 107 using the learning feature amount calculated by the learning feature amount conversion unit 104a and the learning label data 105 that is the label data of the domain corresponding to the learning speech. Part.
 判定実行部200aにおいて、判定用音声認識部202aと判定用特徴量変換部204aは、学習実行部100aにおける学習用音声認識部102aと学習用特徴量変換部104aと同じ構成を用いる。判定用音声認識部202aは、入力音声データ201を用いてNベストの判定用スコア203aを算出する処理部である。判定用特徴量変換部204aは、判定用音声認識部202aが算出したNベストの判定用スコア203aを用いて判定用特徴量に変換する処理部である。ドメイン判定部205は、判定用特徴量変換部204aにより算出した判定用特徴量と、ドメイン判定モデル107を用いてドメイン判定結果206を算出する処理部である。 In the determination execution unit 200a, the determination speech recognition unit 202a and the determination feature amount conversion unit 204a use the same configuration as the learning speech recognition unit 102a and the learning feature amount conversion unit 104a in the learning execution unit 100a. The determination speech recognition unit 202 a is a processing unit that calculates the N best determination score 203 a using the input speech data 201. The determination feature value conversion unit 204a is a processing unit that converts the N best determination score 203a calculated by the determination speech recognition unit 202a into a determination feature value. The domain determination unit 205 is a processing unit that calculates the domain determination result 206 using the determination feature amount calculated by the determination feature amount conversion unit 204 a and the domain determination model 107.
 図6に示す学習用音声認識部102a、学習用特徴量変換部104a、モデル学習部106、判定用音声認識部202a、判定用特徴量変換部204a及びドメイン判定部205は、それぞれ図2に示すプロセッサ1がメモリ2に記憶されたプログラムを実行することにより実現されている。また、学習用音声データ101、学習用スコア103a、学習用ラベルデータ105、ドメイン判定モデル107、入力音声データ201、判定用スコア203a、ドメイン判定結果206は、それぞれメモリ2の記憶領域に記憶されている。プロセッサ1及びメモリ2をそれぞれ複数備え、複数のプロセッサ1とメモリ2とが連携して上述した機能を実行するように構成してもよい。 The learning speech recognition unit 102a, the learning feature amount conversion unit 104a, the model learning unit 106, the determination speech recognition unit 202a, the determination feature amount conversion unit 204a, and the domain determination unit 205 illustrated in FIG. 6 are illustrated in FIG. This is realized by the processor 1 executing a program stored in the memory 2. Further, the learning speech data 101, the learning score 103a, the learning label data 105, the domain determination model 107, the input speech data 201, the determination score 203a, and the domain determination result 206 are stored in the storage area of the memory 2, respectively. Yes. A plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.
 次に、実施の形態2の音声認識装置の動作について説明する。
 まず学習実行部100aが行うドメイン判定モデル学習ステップについて、図7のフローチャートを用いて説明する。
 学習ステップでは、最初に、学習用音声データ101から学習用音声認識部102aによりNベストの学習用スコア103aを計算する(ステップST201)。ここで、学習用音声認識部102aは複数の音声認識器A~Cから成り、それぞれが各ドメインに対応した言語モデル及び音響モデルを読み込んでいる。学習用スコア103aのスコアA1~C1とスコアA2~C2は各音声認識器から得られる1位と2位の認識結果である。なお本実施の形態では例として認識器をA~Cの三つとしているが、ドメイン数に応じて変えてもよく、認識結果のNベストの数を変えてもよい。
Next, the operation of the speech recognition apparatus according to the second embodiment will be described.
First, the domain determination model learning step performed by the learning execution unit 100a will be described with reference to the flowchart of FIG.
In the learning step, first, the learning best speech score 103a is calculated from the learning speech data 101 by the learning speech recognition unit 102a (step ST201). Here, the learning speech recognition unit 102a includes a plurality of speech recognizers A to C, and each reads a language model and an acoustic model corresponding to each domain. Scores A1 to C1 and scores A2 to C2 of the learning score 103a are the first and second recognition results obtained from each speech recognizer. In the present embodiment, three recognizers A to C are used as an example.
 次に、学習用スコア103aを学習用特徴量変換部104aにより、学習用特徴量に変換する(ステップST202)。具体的に学習用特徴量に変換する方法として、図8のように、音響スコアと言語スコアをドメイン毎にNベストのスコアを並べてベクトル化する方法が考えられる。図示例では、2(音響スコア+言語スコア)×ドメイン数×2ベストで12次元の学習用特徴量に変換した例を示している。ベクトル化に必要なスコアは音響スコアと言語スコアに限定されるものではなく、音響スコアと言語スコアを加算したものや、その他学習用音声認識部102aから得られるものであれば何でもよい。 Next, the learning score 103a is converted into a learning feature quantity by the learning feature quantity conversion unit 104a (step ST202). Specifically, as a method for converting to a learning feature amount, a method of arranging an acoustic score and a language score with N best scores for each domain and vectorizing them as shown in FIG. In the illustrated example, 2 (acoustic score + language score) × number of domains × 2 vest is converted into a 12-dimensional learning feature amount. The score required for vectorization is not limited to the acoustic score and the language score, but may be anything obtained by adding the acoustic score and the language score, or any other score obtained from the learning speech recognition unit 102a.
 次に、学習用スコア103aから変換した学習用特徴量と学習用ラベルデータ105を用いてモデル学習部106により、ドメイン判定モデル107を算出する(ステップST203)。ここで学習用ラベルデータ105は学習用音声データ101がどのドメインの発話かを定義したものである。モデル学習部106は、学習用特徴量変換部104aにより得た学習用特徴量と学習用ラベルデータ105を対応付けるようドメイン判定モデル107を算出する。 Next, the domain determination model 107 is calculated by the model learning unit 106 using the learning feature amount converted from the learning score 103a and the learning label data 105 (step ST203). Here, the learning label data 105 defines what domain the speech data for learning 101 utters. The model learning unit 106 calculates the domain determination model 107 so that the learning feature amount obtained by the learning feature amount conversion unit 104 a is associated with the learning label data 105.
 次に、判定実行部200aが行うドメイン判定ステップについて、図9のフローチャートを用いて説明する。
 判定ステップでは、まず入力音声データ201から判定用音声認識部202aによりNベストの判定用スコア203aを計算する(ステップST211)。ここで、判定用音声認識部202aは学習ステップの学習用音声認識部102aと同じ音声認識部を使用する。判定用スコア203aのスコアA1~C1とスコアA2~C2は各音声認識器からの1位と2位の認識結果である。
Next, the domain determination step performed by the determination execution unit 200a will be described with reference to the flowchart of FIG.
In the determination step, first, the N best determination score 203a is calculated from the input sound data 201 by the determination speech recognition unit 202a (step ST211). Here, the determination speech recognition unit 202a uses the same speech recognition unit as the learning speech recognition unit 102a in the learning step. The scores A1 to C1 and scores A2 to C2 of the determination score 203a are the first and second recognition results from each speech recognizer.
 次に判定用スコア203aを判定用特徴量変換部204aにより判定用特徴量に変換する(ステップST212)。判定用特徴量変換部204aは学習ステップの学習用特徴量変換部104aと同じ特徴量変換部を使用する。 Next, the determination score 203a is converted into a determination feature value by the determination feature value conversion unit 204a (step ST212). The determination feature value conversion unit 204a uses the same feature value conversion unit as the learning feature value conversion unit 104a in the learning step.
 次に判定用スコア203aから判定用特徴量変換部204aにより生成した判定用特徴量と、ドメイン判定モデル107をドメイン判定部205に入力し、ドメイン判定結果206を算出する(ステップST213)。ドメイン判定部205は、学習ステップにおけるモデル学習部106と同じ統計的手法を用いて処理を行う。ドメイン判定部205は、判定用特徴量変換部204aにより入力された特徴量とドメイン判定モデル107を照合し、生起確率が最も高いドメインを選択し、選択したドメイン及びそのドメインに対応する音声認識結果をドメイン判定結果206とする。 Next, the determination feature amount generated by the determination feature amount conversion unit 204a from the determination score 203a and the domain determination model 107 are input to the domain determination unit 205, and the domain determination result 206 is calculated (step ST213). The domain determination unit 205 performs processing using the same statistical method as the model learning unit 106 in the learning step. The domain determination unit 205 compares the feature amount input by the determination feature amount conversion unit 204a with the domain determination model 107, selects a domain having the highest occurrence probability, and selects the selected domain and the speech recognition result corresponding to the domain. Is the domain determination result 206.
 以上説明したように、実施の形態2の音声認識装置によれば、学習用音声データからN(Nは2以上の整数)ベストの音声認識結果を示す値であるNベストの学習用スコアを算出する学習用音声認識部と、Nベストの学習用スコアを学習用特徴量に変換する学習用特徴量変換部と、学習用特徴量と、学習用音声データがどのドメインの発話であるかを定義した学習用ラベルデータとを用いて、特徴量とドメインとの関係を示すドメイン判定モデルを算出するモデル学習部と、入力音声データからNベストの音声認識結果を示す値であるNベストの判定用スコアを算出する判定用音声認識部と、Nベストの判定用スコアを判定用特徴量に変換する判定用特徴量変換部と、判定用特徴量とドメイン判定モデルとを照合し、入力音声データがどのドメインの発話であるかを示すドメイン判定結果を算出するドメイン判定部とを備えたので、ドメイン判定のための特徴量にNベストを考慮することができるようになり、実施の形態1の効果に加えて、さらにドメイン判定精度の向上が期待できる。 As described above, according to the speech recognition apparatus of the second embodiment, the N best learning score that is a value indicating the N (N is an integer of 2 or more) best speech recognition result is calculated from the learning speech data. A learning speech recognition unit that performs learning, a learning feature amount conversion unit that converts an N-best learning score into a learning feature amount, a learning feature amount, and which domain the speech data for learning is an utterance of A model learning unit that calculates a domain determination model indicating a relationship between a feature quantity and a domain using the learned label data, and for determining N best that is a value indicating a speech recognition result of N best from input speech data The determination speech recognition unit for calculating the score, the determination feature amount conversion unit for converting the N best determination score to the determination feature amount, the determination feature amount and the domain determination model are collated, and the input speech data is Which And a domain determination unit that calculates a domain determination result indicating whether the utterance is the main utterance, so that N best can be taken into consideration for the feature amount for domain determination. In addition, further improvement in domain determination accuracy can be expected.
実施の形態3.
 実施の形態3は、実施の形態2の構成に加えて、特徴量の次元圧縮を行うようにしたものである。
Embodiment 3 FIG.
In the third embodiment, in addition to the configuration of the second embodiment, dimensional compression of the feature amount is performed.
 図10は、本実施の形態による音声認識装置の構成図である。
 本実施の形態による音声認識装置は、図示のように、学習実行部100bと判定実行部200bから構成される。学習実行部100bは、学習用音声認識部102a、学習用特徴量変換部104a、次元圧縮行列推定部108、学習用次元圧縮部110、モデル学習部106を備え、判定実行部200bは、判定用音声認識部202a、判定用特徴量変換部204a、判定用次元圧縮部207、ドメイン判定部205を備えている。なお、実施の形態2と同様の構成には同一符号を付し、その構成についての説明は省略または簡略化する。
FIG. 10 is a configuration diagram of the speech recognition apparatus according to the present embodiment.
As shown in the figure, the speech recognition apparatus according to the present embodiment includes a learning execution unit 100b and a determination execution unit 200b. The learning execution unit 100b includes a learning speech recognition unit 102a, a learning feature amount conversion unit 104a, a dimension compression matrix estimation unit 108, a learning dimension compression unit 110, and a model learning unit 106, and the determination execution unit 200b includes a determination A speech recognition unit 202a, a determination feature value conversion unit 204a, a determination dimension compression unit 207, and a domain determination unit 205 are provided. In addition, the same code | symbol is attached | subjected to the structure similar to Embodiment 2, and the description about the structure is abbreviate | omitted or simplified.
 学習実行部100bにおける次元圧縮行列推定部108は、学習用特徴量変換部104aから算出した学習用特徴量と学習用ラベルデータ105を用いて次元圧縮行列109を算出する処理部である。学習用次元圧縮部110は、学習用特徴量変換部104aから算出した学習用特徴量に次元圧縮行列109を乗算し、学習用特徴量の次元を圧縮する処理部である。モデル学習部106は、学習用次元圧縮部110により圧縮した学習用特徴量と学習用ラベルデータ105とを用いてドメイン判定モデル107を算出する処理部である。 The dimension compression matrix estimation unit 108 in the learning execution unit 100b is a processing unit that calculates the dimension compression matrix 109 using the learning feature amount calculated from the learning feature amount conversion unit 104a and the learning label data 105. The learning dimension compression unit 110 is a processing unit that multiplies the learning feature amount calculated from the learning feature amount conversion unit 104a by the dimension compression matrix 109 to compress the dimension of the learning feature amount. The model learning unit 106 is a processing unit that calculates the domain determination model 107 using the learning feature amount compressed by the learning dimension compression unit 110 and the learning label data 105.
 判定実行部200bにおいて、判定用音声認識部202aと判定用特徴量変換部204aは、学習実行部100bの学習用音声認識部102aと学習用特徴量変換部104aと同じ構成を用いる。判定用次元圧縮部207は、判定用特徴量変換部204aから算出した判定用特徴量に次元圧縮行列109を乗算し、判定用特徴量の次元を圧縮する処理部である。ここで次元圧縮行列109とは、多次元の特徴量の次元圧縮を行うための行列データである。ドメイン判定部205は、判定用次元圧縮部207により算出した判定用特徴量と、ドメイン判定モデル107を用いてドメイン判定結果206を算出する処理部である。 In the determination execution unit 200b, the determination speech recognition unit 202a and the determination feature amount conversion unit 204a use the same configuration as the learning speech recognition unit 102a and the learning feature amount conversion unit 104a of the learning execution unit 100b. The determination dimension compression unit 207 is a processing unit that multiplies the determination feature amount calculated from the determination feature amount conversion unit 204a by the dimension compression matrix 109 to compress the dimension of the determination feature amount. Here, the dimensional compression matrix 109 is matrix data for performing dimensional compression of multidimensional feature values. The domain determination unit 205 is a processing unit that calculates the domain determination result 206 using the determination feature amount calculated by the determination dimension compression unit 207 and the domain determination model 107.
 図10に示す学習用音声認識部102a、学習用特徴量変換部104a、モデル学習部106、次元圧縮行列推定部108、学習用次元圧縮部110、判定用音声認識部202a、判定用特徴量変換部204a、判定用次元圧縮部207及びドメイン判定部205は、それぞれプロセッサ1がメモリ2に記憶されたプログラムを実行することにより実現されている。また、学習用音声データ101、学習用スコア103a、学習用ラベルデータ105、ドメイン判定モデル107、次元圧縮行列109、入力音声データ201、判定用スコア203a、ドメイン判定結果206は、それぞれメモリ2の記憶領域に記憶されている。プロセッサ1及びメモリ2をそれぞれ複数備え、複数のプロセッサ1とメモリ2とが連携して上述した機能を実行するように構成してもよい。 The learning speech recognition unit 102a, the learning feature amount conversion unit 104a, the model learning unit 106, the dimension compression matrix estimation unit 108, the learning dimension compression unit 110, the determination speech recognition unit 202a, and the determination feature amount conversion illustrated in FIG. Each of the unit 204a, the determination dimension compression unit 207, and the domain determination unit 205 is realized by the processor 1 executing a program stored in the memory 2. Further, the learning speech data 101, the learning score 103a, the learning label data 105, the domain determination model 107, the dimension compression matrix 109, the input speech data 201, the determination score 203a, and the domain determination result 206 are stored in the memory 2, respectively. It is stored in the area. A plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.
 次に、実施の形態3の音声認識装置の動作について説明する。
 まず学習実行部100bが行うドメイン判定モデル学習ステップについて、図11のフローチャートを用いて説明する。
 学習ステップでは、最初に、学習用音声データ101から学習用音声認識部102aにより学習用スコア103aを計算する(ステップST301)。ここで、学習用音声認識部102aは複数の音声認識器A~Cから成り、それぞれが各ドメインに対応した言語モデル及び音響モデルを読み込んでいる。学習用スコア103aのスコアA1~C1とスコアA2~C2は各音声認識器から得られる1位と2位の認識結果である。なお本実施の形態では例として認識器をA~Cの三つとしているが、ドメイン数に応じて変えてもよく、認識結果のNベストの数を変えてもよい。
Next, the operation of the speech recognition apparatus according to the third embodiment will be described.
First, the domain determination model learning step performed by the learning execution unit 100b will be described with reference to the flowchart of FIG.
In the learning step, first, a learning score 103a is calculated from the learning speech data 101 by the learning speech recognition unit 102a (step ST301). Here, the learning speech recognition unit 102a includes a plurality of speech recognizers A to C, and each reads a language model and an acoustic model corresponding to each domain. Scores A1 to C1 and scores A2 to C2 of the learning score 103a are the first and second recognition results obtained from each speech recognizer. In this embodiment, three recognizers A to C are used as an example. However, the number of recognizers may be changed according to the number of domains, or the number of N best results of recognition may be changed.
 次に学習用スコア103aを学習用特徴量変換部104aにより、学習用特徴量に変換する(ステップST302)。具体的に学習用特徴量に変換する方法として、実施の形態2と同様、図8のように音響スコアと言語スコアをドメイン毎にNベストのスコアを並べてベクトル化する方法が考えられる。ベクトル化に必要なスコアは音響スコアと言語スコアに限定されるものではなく、音響スコアと言語スコアを加算したものや、その他学習用音声認識部102aから得られるものであれば何でもよい。 Next, the learning score 103a is converted into a learning feature quantity by the learning feature quantity conversion unit 104a (step ST302). As a method for specifically converting to a learning feature amount, a method in which the acoustic score and the language score are vectorized by arranging the N best scores for each domain as shown in FIG. The score required for vectorization is not limited to the acoustic score and the language score, but may be anything obtained by adding the acoustic score and the language score, or any other score obtained from the learning speech recognition unit 102a.
 次に、学習用スコア103aから変換した学習用特徴量と学習用ラベルデータ105を用いて次元圧縮行列推定部108により、次元圧縮行列109を推定する(ステップST303)。具体的には図12のように、Nベストのスコアから得た特徴量ベクトルに対して、線形判別分析(LDA:Linear Discriminant Analysis)や不等分散判別分析(HDA: Heteroscedastic Discriminant Analysis)などの次元圧縮方法を用いて行列を算出する。次元圧縮の利点として、LDAやHDAのような教師付きの手法では、識別に適した特徴量を生成できること、また混合ガウス分布でモデル化する場合ではモデルパラメータの数の削減が挙げられる。 Next, the dimension compression matrix 109 is estimated by the dimension compression matrix estimation unit 108 using the learning feature amount converted from the learning score 103a and the learning label data 105 (step ST303). Specifically, as shown in FIG. 12, dimensions such as linear discriminant analysis (LDA: Linear Discriminant Analysis) and unequal variance discriminant analysis (HDA: Heteroscopic Discriminant Analysis) are applied to the feature vector obtained from the N-best score. A matrix is calculated using a compression method. Advantages of dimensional compression include that supervised methods such as LDA and HDA can generate feature quantities suitable for identification, and a reduction in the number of model parameters when modeling with a mixed Gaussian distribution.
 次に、次元圧縮行列推定部108により算出した次元圧縮行列109と、学習用スコア103aから変換した学習用特徴量を用いて、学習用次元圧縮部110により、学習用スコア103aから変換した学習用特徴量を次元圧縮する(ステップST304)。次元圧縮とは、図12に示すように、Nベストのスコアから得た特徴量に次元圧縮行列109を乗算することにより、低次のベクトル特徴量に変換することである。なお、図12の例では1位から3位までの認識結果を得た場合を示している。 Next, using the dimension compression matrix 109 calculated by the dimension compression matrix estimation unit 108 and the learning feature amount converted from the learning score 103a, the learning dimension compression unit 110 converts the learning score 103a from the learning score 103a. The feature amount is dimensionally compressed (step ST304). As shown in FIG. 12, the dimension compression is to convert a feature quantity obtained from the N best score by a dimension compression matrix 109 to convert it into a low-order vector feature quantity. In the example of FIG. 12, the recognition results from the first place to the third place are shown.
 次に、学習用次元圧縮部110により次元圧縮された学習用特徴量と学習用ラベルデータ105を用いてモデル学習部106により、ドメイン判定モデル107を学習する(ステップST305)。モデル学習部106は、学習用次元圧縮部110により次元圧縮された学習用特徴量と学習用ラベルデータ105を対応付けるようモデルを算出する。 Next, the domain learning model 107 is learned by the model learning unit 106 using the learning feature quantity and the learning label data 105 dimension-compressed by the learning dimension compression unit 110 (step ST305). The model learning unit 106 calculates a model so as to associate the learning feature amount dimensionally compressed by the learning dimensional compression unit 110 with the learning label data 105.
 次に、判定実行部200bが行うドメイン判定ステップについて、図13のフローチャートを用いて説明する。
 判定ステップでは、まず入力音声データ201から判定用音声認識部202aにより判定用スコア203aを計算する(ステップST311)。ここで、判定用音声認識部202aは学習ステップの学習用音声認識部102aと同じ音声認識部を使用する。判定用スコア203aのスコアA1~C1とスコアA2~C2は各音声認識器からの1位と2位の認識結果である。
Next, the domain determination step performed by the determination execution unit 200b will be described using the flowchart of FIG.
In the determination step, first, a determination score 203a is calculated from the input voice data 201 by the determination voice recognition unit 202a (step ST311). Here, the determination speech recognition unit 202a uses the same speech recognition unit as the learning speech recognition unit 102a in the learning step. The scores A1 to C1 and scores A2 to C2 of the determination score 203a are the first and second recognition results from each speech recognizer.
 次に判定用スコア203aを判定用特徴量変換部204aにより、判定用特徴量に変換する(ステップST312)。判定用特徴量変換部204aには学習ステップにおける学習用特徴量変換部104aと同じ構成を使用する。 Next, the determination score 203a is converted into a determination feature value by the determination feature value conversion unit 204a (step ST312). The determination feature quantity conversion unit 204a uses the same configuration as the learning feature quantity conversion unit 104a in the learning step.
 次に、次元圧縮行列推定部108により算出した次元圧縮行列109と、判定用スコア203aから変換した判定用特徴量を用いて、判定用次元圧縮部207により、判定用スコア203aから変換した判定用特徴量を次元圧縮する(ステップST313)。次元圧縮は学習実行部100bの学習用次元圧縮部110と同様に、図12に示すように、Nベストのスコアから得た特徴量に次元圧縮行列109を乗算することにより、低次のベクトル特徴量に変換する。 Next, using the dimension compression matrix 109 calculated by the dimension compression matrix estimation unit 108 and the determination feature amount converted from the determination score 203a, the determination dimension compression unit 207 converts the determination score 203a from the determination score 203a. The feature amount is dimensionally compressed (step ST313). As in the dimensional compression unit 110 for learning of the learning execution unit 100b, the dimensional compression is performed by multiplying the feature amount obtained from the N best score by the dimensional compression matrix 109 as shown in FIG. Convert to quantity.
 次に、ドメイン判定部205により、判定用次元圧縮部207により次元圧縮された特徴量と、ドメイン判定モデル107から、ドメイン判定結果206を得る(ステップST314)。ドメイン判定部205は、学習ステップと同じ統計的手法を用いて処理を行う。ドメイン判定部205は、判定用次元圧縮部207により次元圧縮された判定用特徴量とドメイン判定モデル107を照合し、生起確率が最も高いドメインを選択し、選択したドメイン及びそのドメインに対応する音声認識結果をドメイン判定結果206とする。 Next, the domain determination unit 205 obtains the domain determination result 206 from the feature quantity dimension-compressed by the determination dimension compression unit 207 and the domain determination model 107 (step ST314). The domain determination unit 205 performs processing using the same statistical method as in the learning step. The domain determination unit 205 collates the determination feature quantity dimension-compressed by the determination dimension compression unit 207 with the domain determination model 107, selects the domain having the highest occurrence probability, and selects the selected domain and the voice corresponding to the domain. Let the recognition result be the domain determination result 206.
 以上説明したように、実施の形態3の音声認識装置によれば、学習用音声データからN(Nは2以上の整数)ベストの音声認識結果を示す値であるNベストの学習用スコアを算出する学習用音声認識部と、Nベストの学習用スコアを学習用特徴量に変換する学習用特徴量変換部と、学習用特徴量と、学習用音声データがどのドメインの発話であるかを定義した学習用ラベルデータとを用いて、学習用特徴量の次元を圧縮するための次元圧縮行列を推定する次元圧縮行列推定部と、学習用特徴量と次元圧縮行列とを用いて、学習用特徴量の次元を圧縮する学習用次元圧縮部と、学習用次元圧縮部で圧縮された学習用特徴量と、学習用ラベルデータとを用いて、特徴量とドメインとの関係を示すドメイン判定モデルを算出するモデル学習部と、入力音声データからNベストの音声認識結果を示す値であるNベストの判定用スコアを算出する判定用音声認識部と、Nベストの判定用スコアを判定用特徴量に変換する判定用特徴量変換部と、判定用特徴量と、次元圧縮行列とを用いて、判定用特徴量の次元を圧縮する判定用次元圧縮部と、判定用次元圧縮部で圧縮された判定用特徴量とドメイン判定モデルとを照合し、入力音声データがどのドメインの発話であるかを示すドメイン判定結果を算出するドメイン判定部とを備えたので、実施の形態2の効果に加えて、特徴量を低次元に圧縮することで、識別に適した特徴量を扱えることができると共に、モデルの種類によってはモデルパラメータ数を削減することができる。 As described above, according to the speech recognition apparatus of the third embodiment, the N best learning score, which is a value indicating the N (N is an integer of 2 or more) best speech recognition result, is calculated from the learning speech data. A learning speech recognition unit that performs learning, a learning feature amount conversion unit that converts an N-best learning score into a learning feature amount, a learning feature amount, and which domain the speech data for learning is an utterance of A learning feature using a dimension compression matrix estimation unit for estimating a dimension compression matrix for compressing the dimension of the learning feature quantity using the learning label data, and a learning feature quantity and a dimension compression matrix. A domain determination model that indicates the relationship between a feature quantity and a domain using a learning dimension compression section that compresses the dimension of the quantity, a learning feature quantity compressed by the learning dimension compression section, and learning label data A model learning unit to calculate, A determination speech recognition unit that calculates an N best determination score, which is a value indicating the N best speech recognition result, and a determination feature amount conversion that converts the N best determination score into a determination feature amount A determination dimension compression unit that compresses the dimension of the determination feature value using the determination unit, the determination feature value, and the dimension compression matrix, and the determination feature value and the domain determination model compressed by the determination dimension compression unit And a domain determination unit that calculates a domain determination result indicating which domain the input speech data is uttered. In addition to the effects of the second embodiment, the feature amount is compressed to a low dimension. By doing so, it is possible to handle feature quantities suitable for identification, and it is possible to reduce the number of model parameters depending on the type of model.
 また、実施の形態3の音声認識装置によれば、次元圧縮行列推定部は、特徴量と教師ラベルを入力とし、特徴量の次元を低次元に変換する行列を出力するようにしたので、識別に適した特徴量を生成することができる。 Further, according to the speech recognition apparatus of the third embodiment, the dimension compression matrix estimation unit receives the feature quantity and the teacher label and outputs a matrix that converts the dimension of the feature quantity into a low dimension. Can be generated.
実施の形態4.
 実施の形態4は、N(Nは2以上の整数)ベストの認識結果を生成すると共に、Nベスト毎にドメイン判定モデルを生成するようにした例である。
Embodiment 4 FIG.
The fourth embodiment is an example in which a recognition result of N (N is an integer of 2 or more) best is generated and a domain determination model is generated for each N best.
 図14は、本実施の形態による音声認識装置の構成図である。
 本実施の形態による音声認識装置は、図示のように、学習実行部100cと判定実行部200cから構成される。学習実行部100cは、学習用音声認識部102a、第1の学習用特徴量変換部104b及び第2の学習用特徴量変換部104c、第1のモデル学習部106a及び第2のモデル学習部106bを備え、判定実行部200cは、判定用音声認識部202a、第1の判定用特徴量変換部204b及び第2の判定用特徴量変換部204c、第1のドメイン判定部205a及び第2のドメイン判定部205b、ドメイン確定部208を備えている。なお、実施の形態2と同様の構成には同一符号を付し、その構成についての説明は省略または簡略化する。
FIG. 14 is a configuration diagram of the speech recognition apparatus according to the present embodiment.
As shown in the figure, the speech recognition apparatus according to the present embodiment includes a learning execution unit 100c and a determination execution unit 200c. The learning execution unit 100c includes a learning speech recognition unit 102a, a first learning feature amount conversion unit 104b, a second learning feature amount conversion unit 104c, a first model learning unit 106a, and a second model learning unit 106b. The determination execution unit 200c includes a determination speech recognition unit 202a, a first determination feature amount conversion unit 204b, a second determination feature amount conversion unit 204c, a first domain determination unit 205a, and a second domain. A determination unit 205b and a domain determination unit 208 are provided. In addition, the same code | symbol is attached | subjected to the structure similar to Embodiment 2, and the description about the structure is abbreviate | omitted or simplified.
 第1の学習用特徴量変換部104bと第2の学習用特徴量変換部104cは、それぞれ実施の形態1の学習用特徴量変換部104と同様の構成であり、学習用音声認識部102が算出した学習用スコア103aを学習用特徴量に変換する処理部である。ただし、第1の学習用特徴量変換部104bは認識結果が1位のスコアA1~C1を、第2の学習用特徴量変換部104cは認識結果が2位のスコアA2~C2を特徴量に変換するよう構成されている。第1のモデル学習部106aと第2のモデル学習部106bは、それぞれ実施の形態1のモデル学習部106と同様の構成である。ただし、第1のモデル学習部106aが、第1の学習用特徴量変換部104bで算出した学習用特徴量と学習用ラベルデータ105とを用いて第1のドメイン判定モデル107aを算出し、第2のモデル学習部106bが、第2の学習用特徴量変換部104cで算出した学習用特徴量と学習用ラベルデータ105とを用いて第2のドメイン判定モデル107bを算出するよう構成されている。なお、図示例では、Nベスト毎の構成としてN=2の場合を示しているが、Nは任意の値に適用可能である。 The first learning feature value conversion unit 104b and the second learning feature value conversion unit 104c have the same configuration as the learning feature value conversion unit 104 of the first embodiment, respectively. A processing unit that converts the calculated learning score 103a into a learning feature amount. However, the first learning feature value conversion unit 104b uses the score A1 to C1 with the first recognition result, and the second learning feature value conversion unit 104c uses the score A2 to C2 with the second recognition result as the feature value. Is configured to convert. The first model learning unit 106a and the second model learning unit 106b have the same configuration as the model learning unit 106 of the first embodiment. However, the first model learning unit 106 a calculates the first domain determination model 107 a using the learning feature amount calculated by the first learning feature amount conversion unit 104 b and the learning label data 105, and The second model learning unit 106b is configured to calculate the second domain determination model 107b using the learning feature amount calculated by the second learning feature amount conversion unit 104c and the learning label data 105. . In the illustrated example, the case where N = 2 is shown as the configuration for each N best, but N can be applied to any value.
 判定実行部200cにおいて、判定用音声認識部202aと第1の判定用特徴量変換部204b及び第2の判定用特徴量変換部204cは、学習実行部100cにおける学習用音声認識部102aと第1の学習用特徴量変換部104b及び第2の学習用特徴量変換部104cと同じ構成を用いる。第1のドメイン判定部205aは、第1の判定用特徴量変換部204bで算出した判定用特徴量と第1のドメイン判定モデル107aとを用いて、第1のドメイン判定結果206aを算出する処理部である。第2のドメイン判定部205bは、第2の判定用特徴量変換部204cで算出した判定用特徴量と第2のドメイン判定モデル107bとを用いて、第2のドメイン判定結果206bを算出する処理部である。ドメイン確定部208は第1のドメイン判定結果206a及び第2のドメイン判定結果206bを用いてドメイン最終判定結果209を算出する処理部である。なお、図示例の学習実行部100c及び判定実行部200cでは、Nベスト毎の構成としてN=2の場合を示しているが、Nは任意の値に適用可能である。 In the determination execution unit 200c, the determination speech recognition unit 202a, the first determination feature amount conversion unit 204b, and the second determination feature amount conversion unit 204c are the same as the learning speech recognition unit 102a in the learning execution unit 100c. The same configuration as that of the learning feature value conversion unit 104b and the second learning feature value conversion unit 104c is used. The first domain determination unit 205a uses the determination feature amount calculated by the first determination feature amount conversion unit 204b and the first domain determination model 107a to calculate the first domain determination result 206a. Part. The second domain determination unit 205b calculates the second domain determination result 206b using the determination feature amount calculated by the second determination feature amount conversion unit 204c and the second domain determination model 107b. Part. The domain determination unit 208 is a processing unit that calculates the domain final determination result 209 using the first domain determination result 206a and the second domain determination result 206b. In the illustrated example, the learning execution unit 100c and the determination execution unit 200c show the case where N = 2 as the configuration for each N best, but N can be applied to any value.
 図14に示す学習用音声認識部102a、第1の学習用特徴量変換部104b及び第2の学習用特徴量変換部104c、第1のモデル学習部106a及び第2のモデル学習部106b、判定用音声認識部202a、第1の判定用特徴量変換部204b及び第2の判定用特徴量変換部204c、第1のドメイン判定部205a及び第2のドメイン判定部205b、ドメイン確定部208は、それぞれ図2に示したプロセッサ1がメモリ2に記憶されたプログラムを実行することにより実現されている。また、学習用音声データ101、学習用スコア103a、学習用ラベルデータ105、ドメイン判定モデル107、入力音声データ201、判定用スコア203a、ドメイン判定結果206、ドメイン最終判定結果209は、それぞれメモリ2の記憶領域に記憶されている。プロセッサ1及びメモリ2をそれぞれ複数備え、複数のプロセッサ1とメモリ2とが連携して上述した機能を実行するように構成してもよい。 The learning speech recognition unit 102a, the first learning feature amount conversion unit 104b, the second learning feature amount conversion unit 104c, the first model learning unit 106a, and the second model learning unit 106b, which are illustrated in FIG. Voice recognition unit 202a, first determination feature value conversion unit 204b and second determination feature value conversion unit 204c, first domain determination unit 205a and second domain determination unit 205b, and domain determination unit 208 are: Each is realized by the processor 1 shown in FIG. 2 executing a program stored in the memory 2. Further, the learning speech data 101, the learning score 103a, the learning label data 105, the domain determination model 107, the input speech data 201, the determination score 203a, the domain determination result 206, and the domain final determination result 209 are stored in the memory 2, respectively. It is stored in the storage area. A plurality of processors 1 and memories 2 may be provided, and the plurality of processors 1 and the memories 2 may be configured to perform the functions described above in cooperation.
 次に、実施の形態4の音声認識装置の動作について説明する。
 まず学習実行部100cが行うドメイン判定モデル学習ステップについて、図15のフローチャートを用いて説明する。
 学習ステップでは、最初に、学習用音声データ101から学習用音声認識部102aによりNベストの学習用スコア103aを計算する(ステップST401)。ここで、学習用音声認識部102aは複数の音声認識器A~Cから成り、それぞれが各ドメインに対応した言語モデル及び音響モデルを読み込んでいる。学習用スコア103aのスコアA1~C1とスコアA2~C2は各音声認識器から得られる1位と2位の認識結果である。なお本実施の形態では例として認識器をA~Cの三つとしているが、ドメイン数に応じて変えてもよく、認識結果のNベストの数を変えてもよい。
Next, the operation of the speech recognition apparatus according to the fourth embodiment will be described.
First, the domain determination model learning step performed by the learning execution unit 100c will be described with reference to the flowchart of FIG.
In the learning step, first, the learning best speech score 103a is calculated from the learning speech data 101 by the learning speech recognition unit 102a (step ST401). Here, the learning speech recognition unit 102a includes a plurality of speech recognizers A to C, and each reads a language model and an acoustic model corresponding to each domain. Scores A1 to C1 and scores A2 to C2 of the learning score 103a are the first and second recognition results obtained from each speech recognizer. In the present embodiment, three recognizers A to C are used as an example.
 次に学習用スコア103aをNベスト毎に、第1の学習用特徴量変換部104b及び第2の学習用特徴量変換部104cにより、それぞれの学習用特徴量に変換する(ステップST402)。具体的に学習用特徴量に変換する方法として、図4に示したように、音響スコアと言語スコアをドメイン毎にNベストのスコアを並べてベクトル化する方法が考えられる。ベクトル化に必要なスコアは音響スコアと言語スコアに限定されるものではなく、音響スコアと言語スコアを加算したものや、その他、学習用音声認識部102aから得られるものであれば何でもよい。 Next, the learning score 103a is converted into each learning feature amount by the first learning feature amount conversion unit 104b and the second learning feature amount conversion unit 104c every N best (step ST402). As a method for specifically converting to a learning feature amount, as shown in FIG. 4, a method of vectorizing an acoustic score and a language score by arranging N best scores for each domain is conceivable. The score required for vectorization is not limited to the acoustic score and the language score, but may be anything obtained by adding the acoustic score and the language score, or any other score obtained from the learning speech recognition unit 102a.
 次に、学習用スコア103aから変換したそれぞれの学習用特徴量と学習用ラベルデータ105を用いて第1のモデル学習部106a及び第2のモデル学習部106bにより、Nベスト毎に、第1のドメイン判定モデル107a及び第2のドメイン判定モデル107bを得る(ステップST403)。すなわち、それぞれの第1のモデル学習部106a及び第2のモデル学習部106bは、第1の学習用特徴量変換部104b及び第2の学習用特徴量変換部104cにより得た学習用特徴量と学習用ラベルデータ105を対応付けるようモデルを算出する。 Next, the first model learning unit 106a and the second model learning unit 106b use the learning feature amount converted from the learning score 103a and the learning label data 105 for each N best. A domain determination model 107a and a second domain determination model 107b are obtained (step ST403). That is, each of the first model learning unit 106a and the second model learning unit 106b includes the learning feature amount obtained by the first learning feature amount conversion unit 104b and the second learning feature amount conversion unit 104c. A model is calculated so as to associate the learning label data 105.
 次に、判定実行部200cが行うドメイン判定ステップについて、図16のフローチャートを用いて説明する。
 判定ステップでは、まず入力音声データ201から判定用音声認識部202aによりNベストの判定用スコア203aを計算する(ステップST411)。ここで、判定用音声認識部202aは学習ステップの学習用音声認識部102aと同じ音声認識部を使用する。判定用スコア203aのスコアA1~C1とスコアA2~C2は各音声認識器からの1位と2位の認識結果である。
Next, the domain determination step performed by the determination execution unit 200c will be described using the flowchart of FIG.
In the determination step, first, the determination best speech score 203a is calculated from the input speech data 201 by the determination speech recognition unit 202a (step ST411). Here, the determination speech recognition unit 202a uses the same speech recognition unit as the learning speech recognition unit 102a in the learning step. The scores A1 to C1 and scores A2 to C2 of the determination score 203a are the first and second recognition results from each speech recognizer.
 次に、判定用スコア203aを第1の判定用特徴量変換部204b及び第2の判定用特徴量変換部204cにより、Nベスト毎に判定用特徴量に変換する(ステップST412)。第1の判定用特徴量変換部204b及び第2の判定用特徴量変換部204cは、学習ステップの第1の学習用特徴量変換部104b及び第2の学習用特徴量変換部104cと同じ特徴量変換部を使用する。 Next, the determination score 203a is converted into determination feature values every N best by the first determination feature value conversion unit 204b and the second determination feature value conversion unit 204c (step ST412). The first determination feature value conversion unit 204b and the second determination feature value conversion unit 204c have the same features as the first learning feature value conversion unit 104b and the second learning feature value conversion unit 104c in the learning step. Use the quantity converter.
 次に、第1のドメイン判定部205a及び第2のドメイン判定部205bは、Nベスト毎に第1の判定用特徴量変換部204b及び第2の判定用特徴量変換部204cにより生成したそれぞれの判定用特徴量と、第1のドメイン判定モデル107a及び第2のドメイン判定モデル107bを取得して、Nベスト個のドメイン判定結果(第1のドメイン判定結果206a及び第2のドメイン判定結果206b)を得る(ステップST413)。第1のドメイン判定部205a及び第2のドメイン判定部205bは、学習ステップにおける第1のモデル学習部106a及び第2のモデル学習部106bと同様の統計的手法を用いる。第1のドメイン判定部205a及び第2のドメイン判定部205bは、第1の判定用特徴量変換部204b及び第2の判定用特徴量変換部204cにより生成された判定用特徴量と第1のドメイン判定モデル107a及び第2のドメイン判定モデル107bをそれぞれ照合し、生起確率が最も高いドメインを出力し、ドメイン及びそのドメインに対応する認識結果を第1のドメイン判定結果206a及び第2のドメイン判定結果206bとする。 Next, the first domain determination unit 205a and the second domain determination unit 205b respectively generate the first determination feature value conversion unit 204b and the second determination feature value conversion unit 204c for each N best. The determination feature quantity, the first domain determination model 107a and the second domain determination model 107b are acquired, and the N best domain determination results (first domain determination result 206a and second domain determination result 206b) are obtained. Is obtained (step ST413). The first domain determination unit 205a and the second domain determination unit 205b use the same statistical method as the first model learning unit 106a and the second model learning unit 106b in the learning step. The first domain determination unit 205a and the second domain determination unit 205b and the determination feature amount generated by the first determination feature amount conversion unit 204b and the second determination feature amount conversion unit 204c The domain determination model 107a and the second domain determination model 107b are respectively collated, the domain having the highest occurrence probability is output, and the recognition result corresponding to the domain and the domain is determined as the first domain determination result 206a and the second domain determination. The result is 206b.
 次に、ドメイン確定部208は、Nベスト個のドメイン判定結果(第1のドメイン判定結果206a及び第2のドメイン判定結果206b)からドメイン最終判定結果209を得る(ステップST414)。ここでドメインの確定方法には、図17のようにNベスト個のドメイン判定結果の単純な多数決を用いる方法や、各ドメイン判定結果の順位に応じて重みをかけて多数決をとるなどの方法が利用できる。なお、図17の例では1位から3位までの認識結果を得た場合を示している。 Next, domain determining section 208 obtains domain final determination result 209 from N best domain determination results (first domain determination result 206a and second domain determination result 206b) (step ST414). Here, as a domain determination method, there are a method using a simple majority vote of N best domain determination results as shown in FIG. 17 and a method of taking a majority vote with weights according to the ranking of each domain determination result. Available. In the example of FIG. 17, the recognition results from the first place to the third place are shown.
 このように、実施の形態4では、実施の形態2とは異なり、Nベスト毎にモデルを生成するため、任意の順位のスコアの出方をモデル化することができ、特徴量の次元数の増加を抑えることができる。また、Nベストのドメインの判定結果を多数決などの方法で統合することにより、上位の認識結果のみに依存するのを抑制することができる。 As described above, in the fourth embodiment, unlike in the second embodiment, a model is generated for each N best, so that the appearance of scores in an arbitrary order can be modeled, and the number of dimensions of the feature amount The increase can be suppressed. Further, by integrating the determination results of the N best domains by a method such as majority decision, it is possible to suppress dependence on only the upper recognition results.
 以上説明したように、実施の形態4の音声認識装置によれば、学習用音声データからN(Nは2以上の整数)ベストの音声認識結果を示す値であるNベストの学習用スコアを算出する学習用音声認識部と、Nベストの学習用スコアをNベスト毎に学習用特徴量に変換する学習用特徴量変換部と、Nベスト毎の学習用特徴量と、学習用音声データがどのドメインの発話であるかを定義した学習用ラベルデータとを用いて、特徴量とドメインとの関係を示すドメイン判定モデルをNベスト毎に算出するモデル学習部と、入力音声データからNベストの音声認識結果を示す値であるNベストの判定用スコアを算出する判定用音声認識部と、Nベストの判定用スコアをNベスト毎に判定用特徴量に変換する判定用特徴量変換部と、Nベスト毎の判定用特徴量とNベスト毎のドメイン判定モデルとを照合し、Nベスト毎のドメイン判定結果を算出するドメイン判定部と、Nベスト毎のドメイン判定結果を用いて、入力音声データがどのドメインの発話であるかを示すドメイン最終判定結果を算出するドメイン確定部とを備えたので、ドメイン判定のための特徴量にNベストを考慮することができるようになり、実施の形態1の効果に加えて、ドメイン判定精度の向上が期待できる。 As described above, according to the speech recognition apparatus of the fourth embodiment, the N best learning score, which is a value indicating the speech recognition result of N (N is an integer of 2 or more) best, is calculated from the learning speech data. A learning speech recognition unit that performs learning, a learning feature amount conversion unit that converts a learning score for N best into a learning feature amount for each N best, a learning feature amount for each N best, and learning voice data Using a learning label data that defines whether the utterance is a domain utterance, a model learning unit that calculates a domain determination model indicating a relationship between a feature amount and a domain every N best, and N best speech from input speech data A determination speech recognition unit that calculates a determination score for N best, which is a value indicating a recognition result, a determination feature value conversion unit that converts a determination score for N best into a determination feature value for each N best, and N Judgment for each vest The domain determination unit that compares the feature amount with the domain determination model for each N best, calculates the domain determination result for each N best, and the domain determination result for each N best, and in which domain utterance the input speech data is And a domain determination unit that calculates a domain final determination result indicating whether or not the N best can be considered in the feature amount for domain determination. In addition to the effects of the first embodiment, Improvement of domain determination accuracy can be expected.
 なお、本願発明はその発明の範囲内において、各実施の形態の自由な組み合わせ、あるいは各実施の形態の任意の構成要素の変形、もしくは各実施の形態において任意の構成要素の省略が可能である。 In the present invention, within the scope of the invention, any combination of the embodiments, or any modification of any component in each embodiment, or omission of any component in each embodiment is possible. .
 以上のように、この発明に係る音声認識装置は、入力音声がどのドメインの発話であるかを判定する構成に関するものであり、ナビゲーション装置や家電製品などに適用し、音声認識性能の向上に用いるのに適している。 As described above, the speech recognition device according to the present invention relates to a configuration for determining which domain the input speech is an utterance of, and is applied to a navigation device, a home appliance, etc., and used for improving speech recognition performance. Suitable for
 100,100a,100b,100c 学習実行部、101 学習用音声データ、102,102a 学習用音声認識部、103,103a 学習用スコア、104,104a 学習用特徴量変換部、104b 第1の学習用特徴量変換部、104c 第2の学習用特徴量変換部、105 学習用ラベルデータ、106 モデル学習部、106a 第1のモデル学習部、106b 第2のモデル学習部、107 ドメイン判定モデル、107a 第1のドメイン判定モデル、107b 第2のドメイン判定モデル、108 次元圧縮行列推定部、109 次元圧縮行列、110 学習用次元圧縮部、200,200a,200b,200c 判定実行部、201 入力音声データ、202,202a 判定用音声認識部、203,203a 判定用スコア、204,204a 判定用特徴量変換部、204b 第1の判定用特徴量変換部、204c 第2の判定用特徴量変換部、205 ドメイン判定部、205a 第1のドメイン判定部、205b 第2のドメイン判定部、206 ドメイン判定結果、206a 第1のドメイン判定結果、206b 第2のドメイン判定結果、207 判定用次元圧縮部、208 ドメイン確定部、209 ドメイン最終判定結果。 100, 100a, 100b, 100c Learning execution unit, 101 Learning speech data, 102, 102a Learning speech recognition unit, 103, 103a Learning score, 104, 104a Learning feature amount conversion unit, 104b First learning feature Quantity conversion unit, 104c second learning feature quantity conversion unit, 105 learning label data, 106 model learning unit, 106a first model learning unit, 106b second model learning unit, 107 domain determination model, 107a first Domain determination model, 107b second domain determination model, 108 dimension compression matrix estimation unit, 109 dimension compression matrix, 110 learning dimension compression unit, 200, 200a, 200b, 200c determination execution unit, 201 input speech data, 202, 202a voice recognition unit for determination, 203, 203 Determination score, 204, 204a Determination feature amount conversion unit, 204b First determination feature amount conversion unit, 204c Second determination feature amount conversion unit, 205 Domain determination unit, 205a First domain determination unit, 205b Second domain determination unit, 206 domain determination result, 206a first domain determination result, 206b second domain determination result, 207 determination dimension compression unit, 208 domain determination unit, 209 domain final determination result.

Claims (5)

  1.  学習用音声データから音声認識結果を示す値である学習用スコアを算出する学習用音声認識部と、
     前記学習用スコアを学習用特徴量に変換する学習用特徴量変換部と、
     前記学習用特徴量と、前記学習用音声データがどのドメインの発話であるかを定義した学習用ラベルデータとを用いて、特徴量とドメインとの関係を示すドメイン判定モデルを算出するモデル学習部と、
     入力音声データから音声認識結果を示す値である判定用スコアを算出する判定用音声認識部と、
     前記判定用スコアを判定用特徴量に変換する判定用特徴量変換部と、
     前記判定用特徴量と前記ドメイン判定モデルとを照合し、前記入力音声データがどのドメインの発話であるかを示すドメイン判定結果を算出するドメイン判定部とを備えたことを特徴とする音声認識装置。
    A learning speech recognition unit that calculates a learning score that is a value indicating a speech recognition result from the learning speech data;
    A learning feature amount conversion unit for converting the learning score into a learning feature amount;
    A model learning unit that calculates a domain determination model indicating a relationship between a feature quantity and a domain using the learning feature quantity and learning label data that defines which domain the speech data for learning is an utterance of When,
    A determination speech recognition unit that calculates a determination score that is a value indicating a speech recognition result from input speech data;
    A determination feature value conversion unit that converts the determination score into a determination feature value;
    A speech recognition apparatus, comprising: a domain determination unit that compares the determination feature quantity with the domain determination model and calculates a domain determination result indicating which domain the utterance of the input speech data is. .
  2.  学習用音声データからN(Nは2以上の整数)ベストの音声認識結果を示す値であるNベストの学習用スコアを算出する学習用音声認識部と、
     前記Nベストの学習用スコアを学習用特徴量に変換する学習用特徴量変換部と、
     前記学習用特徴量と、前記学習用音声データがどのドメインの発話であるかを定義した学習用ラベルデータとを用いて、特徴量とドメインとの関係を示すドメイン判定モデルを算出するモデル学習部と、
     入力音声データからNベストの音声認識結果を示す値であるNベストの判定用スコアを算出する判定用音声認識部と、
     前記Nベストの判定用スコアを判定用特徴量に変換する判定用特徴量変換部と、
     前記判定用特徴量と前記ドメイン判定モデルとを照合し、前記入力音声データがどのドメインの発話であるかを示すドメイン判定結果を算出するドメイン判定部とを備えたことを特徴とする音声認識装置。
    A learning speech recognition unit that calculates a learning score for N best, which is a value indicating a speech recognition result of N (N is an integer greater than or equal to 2) from learning speech data;
    A learning feature value conversion unit for converting the N best learning score into a learning feature value;
    A model learning unit that calculates a domain determination model indicating a relationship between a feature quantity and a domain using the learning feature quantity and learning label data that defines which domain the speech data for learning is an utterance of When,
    A determination speech recognition unit that calculates an N best determination score that is a value indicating the N best speech recognition result from input speech data;
    A determination feature value conversion unit that converts the N best determination score into a determination feature value;
    A speech recognition apparatus, comprising: a domain determination unit that compares the determination feature quantity with the domain determination model and calculates a domain determination result indicating which domain the utterance of the input speech data is. .
  3.  学習用音声データからN(Nは2以上の整数)ベストの音声認識結果を示す値であるNベストの学習用スコアを算出する学習用音声認識部と、
     前記Nベストの学習用スコアを学習用特徴量に変換する学習用特徴量変換部と、
     前記学習用特徴量と、前記学習用音声データがどのドメインの発話であるかを定義した学習用ラベルデータとを用いて、前記学習用特徴量の次元を圧縮するための次元圧縮行列を推定する次元圧縮行列推定部と、
     前記学習用特徴量と前記次元圧縮行列とを用いて、前記学習用特徴量の次元を圧縮する学習用次元圧縮部と、
     前記学習用次元圧縮部で圧縮された学習用特徴量と、前記学習用ラベルデータとを用いて、特徴量とドメインとの関係を示すドメイン判定モデルを算出するモデル学習部と、
     入力音声データからNベストの音声認識結果を示す値であるNベストの判定用スコアを算出する判定用音声認識部と、
     前記Nベストの判定用スコアを判定用特徴量に変換する判定用特徴量変換部と、
     前記判定用特徴量と、前記次元圧縮行列とを用いて、前記判定用特徴量の次元を圧縮する判定用次元圧縮部と、
     前記判定用次元圧縮部で圧縮された判定用特徴量と前記ドメイン判定モデルとを照合し、前記入力音声データがどのドメインの発話であるかを示すドメイン判定結果を算出するドメイン判定部とを備えたことを特徴とする音声認識装置。
    A learning speech recognition unit that calculates a learning score for N best, which is a value indicating a speech recognition result of N (N is an integer greater than or equal to 2) from learning speech data;
    A learning feature value conversion unit for converting the N best learning score into a learning feature value;
    A dimension compression matrix for compressing the dimension of the learning feature quantity is estimated using the learning feature quantity and learning label data defining which domain the speech data for learning is uttered. A dimensional compression matrix estimator;
    A learning dimension compression unit that compresses the dimension of the learning feature value using the learning feature value and the dimension compression matrix;
    A model learning unit that calculates a domain determination model indicating a relationship between a feature amount and a domain using the learning feature amount compressed by the learning dimension compression unit and the learning label data;
    A determination speech recognition unit that calculates an N best determination score that is a value indicating the N best speech recognition result from input speech data;
    A determination feature value conversion unit that converts the N best determination score into a determination feature value;
    A determination dimension compression unit that compresses the dimension of the determination feature value using the determination feature value and the dimension compression matrix;
    A domain determination unit that collates the determination feature amount compressed by the determination dimension compression unit with the domain determination model, and calculates a domain determination result indicating which domain the input speech data is an utterance of A speech recognition apparatus characterized by that.
  4.  前記次元圧縮行列推定部は、特徴量と教師ラベルを入力とし、特徴量の次元を低次元に変換する行列を出力することを特徴とする請求項3記載の音声認識装置。 4. The speech recognition apparatus according to claim 3, wherein the dimension compression matrix estimation unit receives a feature quantity and a teacher label and outputs a matrix for converting the dimension of the feature quantity into a low dimension.
  5.  学習用音声データからN(Nは2以上の整数)ベストの音声認識結果を示す値であるNベストの学習用スコアを算出する学習用音声認識部と、
     前記Nベストの学習用スコアを前記Nベスト毎に学習用特徴量に変換する学習用特徴量変換部と、
     前記Nベスト毎の学習用特徴量と、前記学習用音声データがどのドメインの発話であるかを定義した学習用ラベルデータとを用いて、特徴量とドメインとの関係を示すドメイン判定モデルを前記Nベスト毎に算出するモデル学習部と、
     入力音声データからNベストの音声認識結果を示す値であるNベストの判定用スコアを算出する判定用音声認識部と、
     前記Nベストの判定用スコアをNベスト毎に判定用特徴量に変換する判定用特徴量変換部と、
     前記Nベスト毎の判定用特徴量と前記Nベスト毎のドメイン判定モデルとを照合し、Nベスト毎のドメイン判定結果を算出するドメイン判定部と、
     前記Nベスト毎のドメイン判定結果を用いて、前記入力音声データがどのドメインの発話であるかを示すドメイン最終判定結果を算出するドメイン確定部とを備えたことを特徴とする音声認識装置。
    A learning speech recognition unit that calculates a learning score for N best, which is a value indicating a speech recognition result of N (N is an integer greater than or equal to 2) from learning speech data;
    A learning feature value conversion unit that converts the learning score for the N best into a learning feature value for each N best;
    A domain determination model indicating a relationship between a feature quantity and a domain using the learning feature quantity for each N best and learning label data defining which domain the speech data for learning is uttered by A model learning unit for calculating every N best,
    A determination speech recognition unit that calculates an N best determination score that is a value indicating the N best speech recognition result from input speech data;
    A determination feature value conversion unit that converts the N best determination score into a determination feature value for each N best;
    A domain determination unit that collates the determination feature value for each N best with the domain determination model for each N best, and calculates a domain determination result for each N best;
    A speech recognition apparatus comprising: a domain determination unit that calculates a domain final determination result indicating which domain the input speech data is for using the domain determination result for each N best.
PCT/JP2017/001551 2017-01-18 2017-01-18 Speech recognition device WO2018134916A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2017/001551 WO2018134916A1 (en) 2017-01-18 2017-01-18 Speech recognition device
JP2018562783A JP6532619B2 (en) 2017-01-18 2017-01-18 Voice recognition device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2017/001551 WO2018134916A1 (en) 2017-01-18 2017-01-18 Speech recognition device

Publications (1)

Publication Number Publication Date
WO2018134916A1 true WO2018134916A1 (en) 2018-07-26

Family

ID=62907889

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/001551 WO2018134916A1 (en) 2017-01-18 2017-01-18 Speech recognition device

Country Status (2)

Country Link
JP (1) JP6532619B2 (en)
WO (1) WO2018134916A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113016029A (en) * 2018-11-02 2021-06-22 株式会社赛斯特安国际 Method and apparatus for providing context-based speech recognition service
WO2022177165A1 (en) * 2021-02-19 2022-08-25 삼성전자주식회사 Electronic device and method for analyzing speech recognition result

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012022069A (en) * 2010-07-13 2012-02-02 Nippon Telegr & Teleph Corp <Ntt> Speech recognition method, and device and program for the same
JP2012047924A (en) * 2010-08-26 2012-03-08 Sony Corp Information processing device and information processing method, and program
JP2013167666A (en) * 2012-02-14 2013-08-29 Nec Corp Speech recognition device, speech recognition method, and program

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3265701B2 (en) * 1993-04-20 2002-03-18 富士通株式会社 Pattern recognition device using multi-determiner
WO2008096582A1 (en) * 2007-02-06 2008-08-14 Nec Corporation Recognizer weight learning device, speech recognizing device, and system
JP6003492B2 (en) * 2012-10-01 2016-10-05 富士ゼロックス株式会社 Character recognition device and program
JP6188831B2 (en) * 2014-02-06 2017-08-30 三菱電機株式会社 Voice search apparatus and voice search method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012022069A (en) * 2010-07-13 2012-02-02 Nippon Telegr & Teleph Corp <Ntt> Speech recognition method, and device and program for the same
JP2012047924A (en) * 2010-08-26 2012-03-08 Sony Corp Information processing device and information processing method, and program
JP2013167666A (en) * 2012-02-14 2013-08-29 Nec Corp Speech recognition device, speech recognition method, and program

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113016029A (en) * 2018-11-02 2021-06-22 株式会社赛斯特安国际 Method and apparatus for providing context-based speech recognition service
WO2022177165A1 (en) * 2021-02-19 2022-08-25 삼성전자주식회사 Electronic device and method for analyzing speech recognition result

Also Published As

Publication number Publication date
JP6532619B2 (en) 2019-06-19
JPWO2018134916A1 (en) 2019-04-11

Similar Documents

Publication Publication Date Title
US20190325859A1 (en) System and methods for adapting neural network acoustic models
US8566093B2 (en) Intersession variability compensation for automatic extraction of information from voice
US8271283B2 (en) Method and apparatus for recognizing speech by measuring confidence levels of respective frames
Thakur et al. Speech recognition using euclidean distance
JP2006510933A (en) Sensor-based speech recognition device selection, adaptation, and combination
CN109801646B (en) Voice endpoint detection method and device based on fusion features
US9378735B1 (en) Estimating speaker-specific affine transforms for neural network based speech recognition systems
Gill et al. Vector quantization based speaker identification
KR100574769B1 (en) Speaker and environment adaptation based on eigenvoices imcluding maximum likelihood method
WO2018134916A1 (en) Speech recognition device
JP6845489B2 (en) Speech processor, speech processing method, and speech processing program
CN111667839A (en) Registration method and apparatus, speaker recognition method and apparatus
JP2009086581A (en) Apparatus and program for creating speaker model of speech recognition
JP2020060757A (en) Speaker recognition device, speaker recognition method, and program
JP4652232B2 (en) Method and system for analysis of speech signals for compressed representation of speakers
JP2012108429A (en) Voice selection device, utterance selection device, voice selection system, method for selecting voice, and voice selection program
JP6791816B2 (en) Voice section detection device, voice section detection method, and program
KR101041035B1 (en) Method and Apparatus for rapid speaker recognition and registration thereof
JP6114210B2 (en) Speech recognition apparatus, feature quantity conversion matrix generation apparatus, speech recognition method, feature quantity conversion matrix generation method, and program
Hossan et al. Speaker recognition utilizing distributed DCT-II based Mel frequency cepstral coefficients and fuzzy vector quantization
CN111798844A (en) Artificial intelligent speaker customized personalized service system based on voiceprint recognition
JP6054004B1 (en) Voice recognition device
Yu et al. Speaker recognition models.
CN109872725B (en) Multi-view vector processing method and device
Guanyu et al. Design and implementation of a high-performance client/server voiceprint recognition system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17893040

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2018562783

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17893040

Country of ref document: EP

Kind code of ref document: A1