WO2010050414A1 - モデル適応装置、その方法及びそのプログラム - Google Patents
モデル適応装置、その方法及びそのプログラム Download PDFInfo
- Publication number
- WO2010050414A1 WO2010050414A1 PCT/JP2009/068263 JP2009068263W WO2010050414A1 WO 2010050414 A1 WO2010050414 A1 WO 2010050414A1 JP 2009068263 W JP2009068263 W JP 2009068263W WO 2010050414 A1 WO2010050414 A1 WO 2010050414A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- adaptation
- phoneme
- sentence
- distance
- Prior art date
Links
- 230000006978 adaptation Effects 0.000 title claims abstract description 280
- 238000000034 method Methods 0.000 title claims description 99
- 238000001514 detection method Methods 0.000 claims abstract description 69
- 238000004364 calculation method Methods 0.000 claims abstract description 34
- 230000003044 adaptive effect Effects 0.000 claims description 11
- 238000004458 analytical method Methods 0.000 claims description 7
- 238000002620 method output Methods 0.000 claims 1
- 239000000284 extract Substances 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 8
- 230000000875 corresponding effect Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 230000015654 memory Effects 0.000 description 6
- 238000012795 verification Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
Definitions
- the present invention relates to a model adaptation apparatus, method and program for adapting an acoustic model to a target person such as a speaker in order to improve recognition accuracy in speech recognition or the like.
- a model adaptation technique is known that aims to improve recognition accuracy by adapting an acoustic model in speech recognition to a speaker or the like.
- the sentence list prepared to efficiently secure the minimum learning amount of each phoneme unit possessed by the acoustic model
- Patent Document 1 and FIG. 1 For example, Patent Document 1 and FIG.
- an original text database including a sufficient amount of phonemes and environment and other variations in phonemes is provided, and a number list is generated by counting the number of each phoneme from the original text database.
- a sorting list is generated by rearranging the phonemes in the number list in the number order, and all sentences including the smallest number phoneme ⁇ having the smallest number in the sorting list are arranged in the minimum number phoneme sentence list, and the sorting list.
- the learning efficiency score and learning variation efficiency of the phoneme model of the sentence list including the smallest number of phonemes ⁇ having the smallest number are calculated, and an efficiency calculation sentence list is generated.
- the sentences supplied from the efficiency calculation sentence list are rearranged in the order of the learning efficiency score.
- a rearranged sentence list is generated in which the sentences are rearranged in the order of the learning variation efficiency.
- Sentences are selected in order from the top of the rearranged sentence list until the reference learning data number a, which is the number of speech data required for each phoneme, is reached.
- a selected sentence list is generated from the selected sentence, and the number of phonemes included in the selected sentence list is counted to generate an already selected sentence phoneme number list. For a small number of phonemes ⁇ , when the reference learning data number a is not reached in the selected sentence phoneme number list, a phoneme sentence list less than the reference learning data number including the phoneme ⁇ is generated.
- Patent Document 2 discloses an invention in which speaker clustering is performed for each group of phonemes, and an appropriate speaker cluster of phonemes is created and selected to perform more precise model adaptation. ing.
- Patent Document 3 discloses an invention relating to a method and apparatus that allows a user to perform a search by keyword speech against a multimedia database including speech.
- Patent Document 4 discloses an invention related to phoneme model adaptation by phoneme model clustering.
- Patent Document 5 even if the stroke order when writing a character for registration in the dictionary and the stroke order when writing the character at the time of identification are different, it can be determined that the handwriting of the same author is used.
- An invention relating to a writer identification method and a writer identification device is disclosed.
- Patent Document 1 has a problem that it is difficult to appropriately set the setting for each speaker because the reference learning data number a which is a necessary minimum learning amount has to be given manually in advance. That is, since the relationship between the speaker to be adapted and the model is not considered, there is a problem in that the amount of learning is excessive or insufficient for a specific phoneme depending on the speaker.
- Patent Document 5 The invention disclosed in Patent Document 5 is to create a dictionary that identifies each user by adding to the standard dictionary the features of the writing of users with different handwriting.
- the writer identification method that allows the creation of a dictionary for each user with a single input by writing has a problem that accurate model adaptation is difficult for voice identification using the user's utterance as an input.
- the present invention has been made in view of the above, and an object of the present invention is to provide a model adaptation apparatus, a method thereof, and a program thereof capable of performing efficient model adaptation.
- a model adaptation device is a model adaptation device that adapts a model to the input feature value by approximating the model to the feature of the input feature value that is input data.
- Model adaptation means for performing model adaptation corresponding to each label from the input feature quantity and the first teacher label string that is the content thereof, and outputting adaptation feature information for the model adaptation, and the adaptation feature information
- a distance calculating means for calculating the distance between the models for each label, a detecting means for detecting a label whose distance between the models exceeds a predetermined threshold, and one or more outputs as the output of the detecting means
- a label generating means for generating a second teacher label string including at least one of the detected labels when a label is obtained.
- a model adaptation method is a model adaptation method that adapts a model to the input feature value by approximating the model to the feature of the input feature value that is input data, A model adaptation procedure that performs model adaptation corresponding to each label from the input feature quantity and the first teacher label string that is the content, and outputs adaptation feature information for the model adaptation, and the adaptation feature information
- a distance calculation procedure for calculating the distance between the model and the model for each label, a detection procedure for detecting a label in which the inter-model distance exceeds a predetermined threshold, and one or more outputs as an output in the detection procedure
- a label generation procedure for generating a second teacher label string including at least one of the detected labels when the label is obtained.
- a model adaptation program is a model adaptation program that adapts a model to the input feature value by approximating the model to the feature of the input feature value that is input data.
- Model adaptation processing that performs model adaptation corresponding to each label from the input feature quantity and the first teacher label string that is the content, and outputs adaptation feature information for the model adaptation, and the adaptation feature information
- a distance calculation process for calculating the distance between the model and the model for each label, a detection process for detecting a label in which the inter-model distance exceeds a predetermined threshold, and at least one output as the detection process
- the model adaptation means performs model adaptation and outputs adaptive feature information
- the distance calculation means calculates the inter-model distance between the adaptation feature information and the model for each label.
- the label generation means provides a model adaptation apparatus, a method thereof, and a program thereof capable of efficiently performing model adaptation by generating a second teacher label string including a label whose model distance exceeds a threshold. be able to.
- Model adaptation apparatus 11
- Input means 12
- Text database 13
- Sentence list 14
- Model adaptation means 15
- Distance calculation means 17
- Phoneme detection means 18
- Label generation means 19
- Statistics database 20
- Output means 100
- Model adaptation part 110
- Input means 120
- text database 130 sentence list 150
- sentence presentation means 210
- model update means 230
- output means 10c model adaptation device 17b phoneme detection means
- class database 100b language adaptation system 10d model adaptation unit
- FIG. 2 is a diagram showing an overall configuration of the model adaptation apparatus according to the first embodiment of the present invention.
- the model adaptation apparatus 10 in FIG. 2 uses the input speech and the sentence list of the utterance content to approximate the target acoustic model to the characteristics of the input speech, thereby making this acoustic model available to the speaker of the input speech. To adapt.
- the model adaptation apparatus 10 is a general-purpose computer system, and includes a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and a non-volatile memory as configurations not shown. Equipment.
- CPU Central Processing Unit
- RAM Random Access Memory
- ROM Read Only Memory
- non-volatile memory as configurations not shown. Equipment.
- the CPU reads an OS (Operating System) and a model adaptation program stored in a RAM, a ROM, or a nonvolatile storage device, and executes a model adaptation process.
- OS Operating System
- the model adaptation apparatus 10 does not have to be a single computer system, and may be configured by a plurality of computer systems.
- the model adaptation apparatus 10 of the present invention includes a model adaptation unit 14, a distance calculation unit 16, a phoneme detection unit 17, a label generation unit 18, and a statistic database 19.
- the input unit 11 inputs an input voice or a feature amount series obtained by acoustic analysis of the input voice.
- the sentence list 13 is a sentence set having a plurality of sentences describing contents to be uttered by the speaker, that is, contents of the input speech, and is selected in advance from the text database 12 storing a plurality of sentences having a predetermined phoneme. It has been done.
- the predetermined phoneme in the text database 12 refers to a predetermined sufficient amount of phonemes that enables speech recognition.
- the model 15 is an acoustic model used for speech recognition, for example, and is, for example, an HMM (Hidden Markov Model) having a feature amount series representing features of each phoneme.
- HMM Hidden Markov Model
- the model adaptation unit 14 uses the speech that is the input feature amount input by the input unit 11 and the sentence list 13 of the utterance content that is the first teacher label sequence as the input speech.
- the model adaptation corresponding to these phonemes is performed using each phoneme as each label so as to be close to each other, and the feature information for adaptation is output to the statistic database 19.
- the feature information for adaptation is a sufficient statistic for approximating the model 15 to the input speech.
- the distance calculation means 16 acquires the adaptation feature information that is the output of the model adaptation means 14 from the statistic database 19, and sets the distance between the adaptation feature information and the original model 15 as an acoustic distance for each phoneme. And the distance value for each phoneme is output. At this time, phonemes that did not appear in the sentence list 13 may not exist in the adaptation feature information. In this case, the distance value may be set to zero.
- the phoneme detection means 17 outputs the phoneme as a detection result if there is a phoneme distance value that is an output of the distance calculation means 16 that exceeds a predetermined threshold.
- label generation for example, an arbitrary sentence composed of the detected phonemes may be automatically generated, or for example, a sentence including the detected phonemes may be selected from the text database 12.
- label generation is not performed. That is, for example, an empty set is output as a generation result.
- the one or more sentences generated by the label generation means 18 are output from the model adaptation device 10 and are used to perform model adaptation again as a new sentence list.
- the text database 12 may use an external database connected to a network such as the Internet.
- the text database 12, the sentence list 13, the model 15, and the statistics database 19 may be a non-volatile storage device such as a hard disk drive, a magneto-optical disk drive, or a flash memory, or a DRAM (Dynamic Random Access Memory) or the like. It may be a volatile storage device.
- the text database 12, the sentence list 13, the model 15, and the statistics database 19 may be storage devices externally attached to the model adaptation device 10.
- the model adaptation apparatus 10 inputs a voice (S100). Specifically, a speech waveform input from a microphone or a feature amount series obtained by acoustic analysis of the speech waveform is obtained as an input.
- the model adaptation apparatus 10 adapts the target model 15 to be close to the input voice by using the input voice and the sentence list 13 of the utterance content (S101). Specifically, the model adaptation unit 14 of the model adaptation apparatus 10 performs model adaptation on the model 15 from the feature amount series of the input speech obtained in step S100 and the sentence list 13 representing the contents thereof, for example, adaptation. A sufficient statistic as feature information is output to the statistic database 19.
- the sentence list 13 may be a teacher label in which the utterance content is described in monophone, and the model adaptation means 14 performs supervised model adaptation, for example, to phoneme / s /.
- the movement vector F (s) (s1, s2,..., Sn) and the number of adaptive samples (number of frames) are obtained as feature information for adaptation.
- model adaptation using the feature quantity series in this way, for example, a model adaptation technique is well known as a known technique, and thus detailed description thereof is omitted here.
- the model adaptation apparatus 10 detects phonemes having a large difference between the input speech and the model 15 (S103). Specifically, the phoneme detection unit 17 of the model adaptation apparatus 10 has a value that exceeds a predetermined threshold with respect to the distance value of each phoneme that is the output of the distance calculation unit 16 obtained in step S102. The phoneme is output as a detection result.
- the phoneme detection target is not limited to the phoneme / a / or the phoneme / s /, and all phonemes included in the sentence list 13 may be the detection target or may be partially the detection target.
- the threshold value Dthre may be the same value for all phonemes, or a different threshold value may be used for each phoneme.
- the model adaptation apparatus 10 generates a sentence for model adaptation again (S104).
- the label generation unit 18 of the model adaptation apparatus 10 has one or more sentences including the detected phoneme for the phoneme related to the detection result detected by the phoneme detection unit 17 obtained in step S103.
- a sentence including the detected phoneme is searched from the text database 12, and the sentence extracted by this search is output in step S105.
- sentences including phonemes / a / and phonemes / e / are searched from the text database 12, and if one or more exist, they are output.
- step S103 If there is no phoneme detected in step S103, the process may be terminated without generating a label in step S104, or a message indicating that no label generation result has been output is terminated. Also good.
- a monophone representing a single phoneme is used as a model, but the same applies to the case of using a phone environment-dependent Diphone model or a Triphone model.
- the model adaptation apparatus 10 performs model adaptation using the input speech and the first sentence list 13 on the model 15 to be adapted, A phoneme having a large distance is detected, and a new sentence list including the detected phoneme is generated.
- the obtained sentence may be different if the adaptation target model is different. That is, even when speakers and models are different, it is possible to efficiently adapt a model by generating a more suitable sentence list.
- FIG. 4 is a diagram illustrating the overall configuration of the speaker adaptation system according to the present embodiment.
- the speaker adaptation system 100 shown in FIG. 4 includes an input unit 110, a model adaptation unit 10b, a text database 120, a sentence list 130, an acoustic model 150, a sentence presentation unit 200, a determination unit 210, and a model update.
- Means 220 and output means 230 are provided.
- the speaker adaptation system 100 is a general-purpose computer system, and includes a CPU, a RAM, a ROM, and a non-volatile storage device as components not shown.
- the CPU reads the OS and the speaker adaptation program stored in the RAM, ROM, or nonvolatile storage device, and executes speaker adaptation processing. As a result, it is possible to adapt the target model to be close to the characteristics of the input speech.
- the speaker adaptation system 100 does not have to be a single computer system, and may be configured by a plurality of computer systems.
- the input means 110 is an input device such as a microphone, and may include an A / D conversion means or an acoustic analysis means as a configuration not shown.
- the text database 120 is a set of sentences including a sufficient amount of phonemes and environment and other variations in phonemes.
- the sentence list 130 is a teacher label used for speaker adaptation processing, and is a set of sentences composed of one or more sentences extracted from the text database 120.
- the acoustic model 150 is, for example, an HMM (Hidden Markov Model) having a feature amount series representing features of each phoneme.
- HMM Hidden Markov Model
- the sentence presentation means 200 presents a teacher label, that is, a sentence list to be uttered, to the speaker in order to perform speaker adaptation.
- the model adaptation unit 10b corresponds to the model adaptation device 10 of FIG. Therefore, in the following, the difference from FIG. 2 will be mainly described, and the description of the configuration corresponding to FIG. 2 and having the same function will be omitted.
- the label generation unit 18 When at least one phoneme detected by the phoneme detection unit 17 is present, the label generation unit 18 generates one or more sentences including the detected phoneme in order to perform model adaptation again. Notify If there is no detected phoneme, the determination means 210 is notified of this fact.
- the determination unit 210 receives the output of the label generation unit 18 and, when a sentence is generated, sets the sentence as a new adaptive sentence list. When the sentence is not generated, the model update unit 220 is notified of that.
- the model update unit 220 When the model update unit 220 receives a notification from the determination unit 210 that a sentence has not been generated, the model update unit 220 applies the adaptation feature information received from the statistics database 19 to the acoustic model 150 to obtain an after-adaptation acoustic model.
- the output means 230 outputs the post-adaptation acoustic model obtained by the model update means 220.
- the technique regarding the model update in speaker adaptation is well-known as a well-known technique, detailed description is abbreviate
- the text database 120 may use an external database connected to a network such as the Internet.
- the text database 120, the sentence list 130, the model 150, and the statistics database 19 may be a non-volatile storage device such as a hard disk drive, a magneto-optical disk drive, or a flash memory, or a volatile storage device such as a DRAM. Also good. Further, the text database 120, the sentence list 130, the model 150, and the statistics database 19 may be storage devices externally attached to the speaker adaptation system 100.
- the speaker adaptation system 100 inputs a voice (S200). Specifically, the speaker adaptation system 100 can obtain, as an input, a speech waveform input from a microphone by the input unit 110 or a feature amount series obtained by acoustic analysis thereof.
- model adaptation processing As shown in FIG. 3 is performed by the model adaptation unit 14, the distance calculation unit 16, the phoneme detection unit 17, and the label generation unit 18 in the model adaptation unit 10 b of the speaker adaptation system 100.
- the speaker adaptation system 100 determines whether a sentence has been output in the model adaptation process (S202). Specifically, if the determination unit 210 of the speaker adaptation system 100 outputs a sentence as a result of the model adaptation process in step S201, the output sentence is set as a new sentence list.
- the new sentence list is presented again to the speaker by the speaker adaptation system 100 (S203).
- the sentence presentation unit 200 of the speaker adaptation system 100 presents a new sentence list to the speaker as a teacher label for speaker adaptation, accepts a new voice input, and the process from the voice input in step S200. repeat.
- the model adaptation means 14 performs model adaptation again using the speech input based on the new sentence list and the new sentence list, outputs again the feature information for adaptation, and the statistics database 19
- the adaptation feature information is stored, and the distance calculation means 16 acquires the adaptation feature information again from the statistic database 19, calculates the distance between the adaptation feature information and the acoustic model again for each phoneme,
- the phoneme detection means 17 outputs the distance value exceeding the predetermined threshold value as the detection result again when there is a distance value exceeding the predetermined threshold value among the distance values again.
- the generation unit 18 searches the text database 120 for a sentence including the phoneme related to the detection result again, and outputs the sentence extracted by this search.
- the determination unit 210 When the sentence is not output, the determination unit 210 notifies the model update unit 220 to that effect.
- the speaker adaptation system 100 executes a model update process when a sentence is not generated as a result of the determination process in step S202 (S204). Specifically, the model update unit 220 of the speaker adaptation system 100 applies the adaptation feature information received from the statistics database 19 to the acoustic model 150 to obtain an after-adaptation acoustic model. Thereafter, the output unit 230 outputs the obtained post-adaptation acoustic model as a speaker adaptive acoustic model (S205).
- speaker adaptation using a phoneme with a large distance as a priority is performed on the acoustic model that the speaker wants to adapt, so that efficient speaker adaptation can be realized.
- the subsequent adaptive processing can be prevented from being performed. That is, since it is possible to stop the adaptation process when it is determined that the acoustic model is sufficiently close, it is possible to provide a determination criterion for stopping speaker adaptation.
- the adaptation feature information is used as the adaptation feature information and the distance between the adaptation feature information and the original model is calculated.
- the distance between the adapted model and the original model is calculated. The same applies to the case. In this case, it is only necessary to calculate the distance between the two models, and the technique for calculating the distance between the models is well known as a known technique, and thus the description thereof is omitted here.
- the present embodiment uses a class database to increase the efficiency of speaker adaptation even with a small sentence list.
- the class database is a database constructed with a large number of speech data in advance.
- the model adaptation processing according to the first embodiment is executed by a plurality of speakers, and the distance calculation result for each phoneme. Is a database that classifies
- the phoneme / t / distance value is large.
- the phoneme / t / belonging to the same class is also the original sentence list. Even phonemes that did not appear in the label can be targeted for label generation.
- FIG. 6 is a diagram showing an overall configuration of the model adaptation apparatus according to the second embodiment.
- the model adaptation apparatus 10c of FIG. 6 uses the input speech and the sentence list of the utterance content to adapt the target model so as to be close to the features of the input speech.
- the model adaptation apparatus 10c of the present invention is a general-purpose computer system, and includes a CPU, a RAM, a ROM, and a nonvolatile storage device as components not shown.
- the CPU reads the OS and the model adaptation program stored in the RAM, ROM, or nonvolatile storage device, and executes model adaptation processing.
- the model adaptation apparatus 10c does not have to be a single computer system, and may be configured by a plurality of computer systems.
- the model adaptation apparatus 10c of the present invention includes a model adaptation unit 14, a distance calculation unit 16, a phoneme detection unit 17b, a label generation unit 18, a statistics database 19, and a class database 30.
- the model adaptation unit 14, the distance calculation unit 16, the label generation unit 18, and the statistic database 19 are the same as those in FIG. Only the differences from FIG. 2 will be described below.
- the phoneme detection unit 17b outputs a phoneme as a detection result if there is a phoneme distance value that is an output of the distance calculation unit 16 that exceeds a predetermined threshold.
- the class database 30 is referred to, and phonemes belonging to the same class are also output as detection results for phonemes or phoneme combinations exceeding the threshold.
- the class database 30 is a database having information that classifies phonemes or combinations of phonemes. For example, the phonemes / p /, phonemes / b /, phonemes / t /, and phonemes / d / belong to the same class. When two or more of these are obtained as detection results, the rest are also detected results. Alternatively, a rule may be described in which another predetermined phoneme is also detected as a combination of predetermined phonemes.
- the class database 30 may be a non-volatile storage device such as a hard disk drive, a magneto-optical disk drive, or a flash memory, or may be a volatile storage device such as a DRAM.
- the class database 30 may be a storage device externally attached to the model adaptation device 10c.
- the model adaptation apparatus 10c detects a phoneme having a large difference between the input speech and the model 15 in step S103. Specifically, the phoneme detection unit 17b of the model adaptation apparatus 10c has a value exceeding a predetermined threshold with respect to the distance value of each phoneme that is the output of the distance calculation unit 16 obtained in step S102. The phoneme is output as a detection result. At the same time, the class database 30 is referred to, and phonemes belonging to the same class are also output as detection results for phonemes or phoneme combinations exceeding the threshold.
- the class database 30 is referred to.
- the phoneme / p /, the phoneme / b /, the phoneme / t /, and the phoneme / d / belong to the same class in the class database 30, the phoneme / p / and the phoneme / d / Since it is detected, phoneme / t / and phoneme / b / are also detected.
- the threshold value Dthre may be the same value for all phonemes, a different threshold value may be used for each phoneme, or a different threshold value may be used for each class existing in the class database 30.
- the model adaptation apparatus 10c uses the class database 30 when performing model adaptation using the input speech and the first sentence list 13 on the model 15 to be adapted. It is also possible to detect phonemes that did not exist in the sentence list 13. That is, even when the sentence list 13 is small, it is possible to efficiently adapt the model by generating a suitable sentence list.
- FIG. 7 is a diagram illustrating the overall configuration of the language adaptation system according to the present embodiment.
- the language adaptation system 100b shown in FIG. 7 includes an input unit 110, a model adaptation unit 10d, a text database 120, a sentence list 130, an acoustic model 150, a sentence presentation unit 200, a determination unit 210, and a model update unit. 220 and output means 230.
- the language adaptation system 100b is a general-purpose computer system, and includes a CPU, a RAM, a ROM, and a non-volatile storage device as components not shown.
- the CPU reads the OS and the language adaptation program stored in the RAM, ROM, or nonvolatile storage device, and executes language adaptation processing.
- the language adaptation system 100b does not need to be a single computer system, and may be configured by a plurality of computer systems.
- the input means 110, the text database 120, the sentence list 130, the acoustic model 150, the sentence presentation means 200, the determination means 210, the model update means 220, and the output means 230 are the same as in FIG. Therefore, explanation is omitted. Only the differences from FIG. 4 will be described below.
- the model adaptation unit 10d is obtained by replacing the model adaptation unit 10b of FIG. 4 and corresponds to the model adaptation device 10c of FIG. Therefore, in the following, the difference from FIG. 6 will be mainly described, and the description of the configuration corresponding to FIG. 6 and having the same function will be omitted.
- the label generation unit 18b When at least one phoneme detected by the phoneme detection unit 17b is present, the label generation unit 18b generates one or more sentences including the detected phoneme in order to perform model adaptation again. Notify If there is no detected phoneme, the determination means 210 is notified of this fact.
- the determination unit 210 receives the output of the label generation unit 18 and, when a sentence is generated, sets the sentence as a new adaptive sentence list. When the sentence is not generated, the model update unit 220 is notified to that effect.
- the text database 120 may use an external database connected to a network such as the Internet.
- the text database 120, sentence list 130, model 150, statistics database 19, and class database 30 may be non-volatile storage devices such as hard disk drives, magneto-optical disk drives, and flash memories, or volatile storage devices such as DRAMs. It may be.
- the text database 120, the sentence list 130, the model 150, the statistics database 19, and the class database 30 may be storage devices externally attached to the language adaptation system 100b.
- the language adaptation system 100b executes a model adaptation process in step S201. Specifically, model adaptation processing as shown in FIG. 3 is performed by the model adaptation unit 14, the distance calculation unit 16, the phoneme detection unit 17b, and the label generation unit 18b in the model adaptation unit 10d of the language adaptation system 100b.
- phoneme / i / (: is a long vowel symbol) and phoneme / u: / as data of Kansai-annoying Japanese speakers extracted from the speaker group consisting of a plurality of speakers.
- phoneme / e / belong to the same class.
- the phoneme detection means 17b refers to the class database to detect phonemes / u: / and phonemes / e: / belonging to the same class, and the label generation means 18b detects the phonemes / i: / and phonemes / u: / and phonemes.
- a sentence including / e: / is generated.
- the phoneme class having a large distance from the model with respect to the language to which the speaker wants to adapt for example, the phoneme common to Japanese speakers speaking in Kansai, is used with an emphasis. Therefore, efficient language adaptation can be realized even when the first sentence list is small.
- dialect examples are shown as examples of language adaptation in which the acoustic model is adapted to the language.
- adaptation to a language difference that is, Japanese and English or Japanese-speaking English. Even if it exists, it is the same.
- speaker adaptation that applies to a specific speaker in the same language or dialect.
- the post-adaptation acoustic model obtained by the present invention can be expected to have high recognition accuracy when used for speech recognition. Similarly, high verification accuracy can be expected by using it for speaker verification.
- the present invention is applicable to such a situation.
- model adaptation apparatus and method described above can be realized by hardware, software, or a combination thereof.
- the above-described model adaptation device can be realized by hardware, but can also be realized by a computer reading a program for causing the computer to function as its system from a recording medium and executing it.
- model adaptation method can be realized by hardware, but a program for causing a computer to execute the method is also read out from a computer-readable recording medium and executed. Can be realized.
- any hardware can be applied as long as the functions of the respective means described above can be realized.
- it may be configured individually for each function of each means described above, or may be configured integrally with the function of each means.
- the present invention can be applied to uses such as voice input / authentication services using voice recognition / speaker verification technology.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
11 入力手段
12 テキストデータベース
13 文リスト
14 モデル適応手段
15 モデル
16 距離計算手段
17 音素検出手段
18 ラベル生成手段
19 統計量データベース
20 出力手段
100 話者適応システム
10b モデル適応部
110 入力手段
120 テキストデータベース
130 文リスト
150 音響モデル
200 文提示手段
210 判定手段
220 モデル更新手段
230 出力手段
10c モデル適応装置
17b 音素検出手段
30 クラスデータベース
100b 言語適応システム
10d モデル適応部
図2は、本発明の第1の実施の形態に係るモデル適応装置の全体の構成を示した図である。図2のモデル適応装置10は、入力音声と発声内容の文リストとを用いて、対象となる音響モデルをこの入力音声の特徴に近似させることで、この音響モデルをこの入力音声の話者に適応させるものである。
次に、本実施の形態に係るモデル適応処理を図3に示すフローチャート図により説明する。まず、モデル適応装置10は、音声を入力する(S100)。具体的には、マイクから入力される音声波形、あるいはそれを音響分析した特徴量系列が入力として得られる。
本実施の形態に係るモデル適応装置の実施例として、以下に話者適応システムの例を説明する。図4は、本実施例にかかる話者適応システムの全体の構成を示した図である。図4に示す話者適応システム100は、入力手段110と、モデル適応部10bと、テキストデータベース120と、文リスト130と、音響モデル150と、文提示手段200と、判定手段210と、モデル更新手段220と、出力手段230とを備える。
次に、本実施例に係る話者適応処理の全体の流れを、図5に示すフローチャートにより説明する。まず、話者適応システム100は、音声を入力する(S200)。具体的には、話者適応システム100は、入力手段110によってマイクから入力される音声波形、あるいはそれを音響分析した特徴量系列が入力として得られる。
以下、図面を参照しながら、本発明の第2の実施の形態について詳細に説明する。本実施の形態は、第1の実施の形態に比べ、クラスデータベースを用いることにより、少ない文リストでも話者適応の効率を高めるものである。
次に、本実施の形態に係るモデル適応処理を説明する。ここで、本実施の形態は、図3におけるステップS103の音素検出処理以外は、図3と同様のため、説明を省略する。
本発明の第2の実施の形態に係るモデル適応装置の実施例として、以下に言語適応システムの例を説明する。図7は、本実施例にかかる言語適応システムの全体の構成を示した図である。図7に示す言語適応システム100bは、入力手段110と、モデル適応部10dと、テキストデータベース120と、文リスト130と、音響モデル150と、文提示手段200と、判定手段210と、モデル更新手段220と、出力手段230とを備える。
次に、本実施例に係る言語適応処理を説明する。ここで、本実施例は、図5におけるステップS201のモデル適応処理以外は、図5と同様のため、説明を省略する。
Claims (18)
- モデルを入力データである入力特徴量の特徴に近似させることで該モデルを該入力特徴量に適応させるモデル適応装置であって、
前記入力特徴量とその内容である第一の教師ラベル列とから各ラベルに対応するモデル適応を行い、該モデル適応のための適応用特徴情報を出力するモデル適応手段と、
前記適応用特徴情報と前記モデルとのモデル間距離を前記ラベルごとに計算する距離計算手段と、
前記モデル間距離があらかじめ定められた閾値を超えるラベルを検出する検出手段と、
前記検出手段の出力として一つ以上のラベルが得られた場合に、該検出されたラベルを少なくとも一つ以上含む第二の教師ラベル列を生成するラベル生成手段と、
を備えることを特徴とするモデル適応装置。 - 音声認識に用いる音響モデルを入力音声の特徴に近似させることで該音響モデルを該入力音声の話者に適応させるモデル適応によるモデル適応装置であって、
所定の音素を有する文を複数格納するテキストデータベースと、
前記入力音声の内容を記述した複数の文を有する文リストと、
前記入力音声が入力される入力手段と、
前記入力音声と前記文リストとを用いて前記モデル適応を行い、前記音響モデルを前記入力音声に近似させるための充分統計量である適応用特徴情報を出力するモデル適応手段と、
前記適応用特徴情報を格納する統計量データベースと、
前記適応用特徴情報と前記音響モデルとの音響的な距離を音素ごとに計算し、各音素についての距離値を出力する距離計算手段と、
前記距離値のうち予め定められた閾値を超えるものが存在する場合、該閾値を超えるものを検出結果として出力する音素検出手段と、
前記検出結果に係る音素を含む文を前記テキストデータベースから検索し、該検索で抽出された文を出力するラベル生成手段と、
を備えることを特徴とするモデル適応装置。 - 前記ラベル生成手段が前記検索で文を出力した場合は、該文を新たな文リストとし、前記ラベル生成手段が該文を出力しなかった場合は、その旨を通知する判定手段と、
前記判定手段から前記文が出力されなかった旨の通知を受けた場合に、前記統計量データベースから前記適応用特徴情報を取得し、これを前記音響モデルに適用することで適応後音響モデルを得るモデル更新手段と、
前記適応後音響モデルを出力する出力手段と、
前記文リスト及び前記新たな文リストを提示する文提示手段と、
を更に備え、
前記モデル適応手段は、前記新たな文リストに基づく音声入力と前記新たな文リストとを用いて再度のモデル適応を行い、再度の適応用特徴情報を出力し、
前記距離計算手段は、前記再度の適応用特徴情報と前記音響モデルとの距離を音素ごとに計算し、各音素についての再度の距離値を出力し、
前記音素検出手段は、前記再度の距離値のうち前記閾値を超えるものが存在する場合、前記閾値を超えるものを再度の検出結果として出力し、
前記ラベル生成手段は、前記再度の検出結果に係る音素を含む文を前記テキストデータベースから検索し、該検索で抽出された文を出力することを特徴とする請求項2に記載のモデル適応装置。 - 前記音素検出手段は、音素毎に異なる閾値を用いることを特徴とする請求項2又は3に記載のモデル適応装置。
- 音素又は音素の組合せをクラス化した情報を格納するクラスデータベースを更に備え、
前記音素検出手段は、前記クラスデータベースを参照し、前記距離計算手段の出力である各音素の距離値のうち前記閾値を超えるものがあれば、前記閾値を超えた音素と同じクラスに属する音素も検出結果として出力することを特徴とする請求項2乃至4のいずれか1項に記載のモデル適応装置。 - 前記入力音声には、音声及び該音声を音響分析した特徴量系列のデータが含まれることを特徴とする請求項2乃至5のいずれか1項に記載のモデル適応装置。
- モデルを入力データである入力特徴量の特徴に近似させることで該モデルを該入力特徴量に適応させるモデル適応方法であって、
前記入力特徴量とその内容である第一の教師ラベル列とから各ラベルに対応するモデル適応を行い、該モデル適応のための適応用特徴情報を出力するモデル適応手順と、
前記適応用特徴情報と前記モデルとのモデル間距離を前記ラベルごとに計算する距離計算手順と、
前記モデル間距離があらかじめ定められた閾値を超えるラベルを検出する検出手順と、
前記検出手順での出力として一つ以上のラベルが得られた場合に、該検出されたラベルを少なくとも一つ以上含む第二の教師ラベル列を生成するラベル生成手順と、
を備えることを特徴とするモデル適応方法。 - 音声認識に用いる音響モデルを入力音声の特徴に近似させることで該音響モデルを該入力音声の話者に適応させるモデル適応によるモデル適応方法であって、
前記入力音声が入力される入力手順と、
前記入力音声と前記入力音声の内容を記述した複数の文を有する文リストとを用いて前記モデル適応を行い、前記音響モデルを前記入力音声に近似させるための充分統計量である適応用特徴情報を出力するモデル適応手順と、
前記適応用特徴情報を統計量データベースに格納する手順と、
前記適応用特徴情報と前記音響モデルとの音響的な距離を音素ごとに計算し、各音素についての距離値を出力する距離計算手順と、
前記距離値のうち予め定められた閾値を超えるものが存在する場合、該閾値を超えるものを検出結果として出力する音素検出手順と、
前記検出結果に係る音素を含む文を所定の音素を有する文を複数格納するテキストデータベースから検索し、該検索で抽出された文を出力するラベル生成手順と、
を備えることを特徴とするモデル適応方法。 - 前記ラベル生成手順が前記検索で文を出力した場合は、該文を新たな文リストとし、前記ラベル生成手順が該文を出力しなかった場合は、その旨を通知する判定手順と、
前記判定手順から前記文が出力されなかった旨の通知を受けた場合に、前記統計量データベースから前記適応用特徴情報を取得し、これを前記音響モデルに適用することで適応後音響モデルを得るモデル更新手順と、
前記適応後音響モデルを出力する出力手順と、
前記文リスト及び前記新たな文リストを提示する文提示手順と、
を更に備え、
前記モデル適応手順は、前記新たな文リストに基づく音声入力と前記新たな文リストとを用いて再度のモデル適応を行い、再度の適応用特徴情報を出力し、
前記距離計算手順は、前記再度の適応用特徴情報と前記音響モデルとの距離を音素ごとに計算し、各音素についての再度の距離値を出力し、
前記音素検出手順は、前記再度の距離値のうち前記閾値を超えるものが存在する場合、前記閾値を超えるものを再度の検出結果として出力し、
前記ラベル生成手順は、前記再度の検出結果に係る音素を含む文を前記テキストデータベースから検索し、該検索で抽出された文を出力することを特徴とする請求項8に記載のモデル適応方法。 - 前記音素検出手順は、音素毎に異なる閾値を用いることを特徴とする請求項8又は9に記載のモデル適応方法。
- 音素又は音素の組合せをクラス化した情報をクラスデータベースに格納する手順を更に備え、
前記音素検出手順は、前記クラスデータベースを参照し、前記距離計算手順の出力である各音素の距離値のうち前記閾値を超えるものがあれば、前記閾値を超えた音素と同じクラスに属する音素も検出結果として出力することを特徴とする請求項8乃至10のいずれか1項に記載のモデル適応方法。 - 前記入力音声には、音声及び該音声を音響分析した特徴量系列のデータが含まれることを特徴とする請求項8乃至11のいずれか1項に記載のモデル適応方法。
- モデルを入力データである入力特徴量の特徴に近似させることで該モデルを該入力特徴量に適応させるモデル適応プログラムであって、
前記入力特徴量とその内容である第一の教師ラベル列とから各ラベルに対応するモデル適応を行い、該モデル適応のための適応用特徴情報を出力するモデル適応処理と、
前記適応用特徴情報と前記モデルとのモデル間距離を前記ラベルごとに計算する距離計算処理と、
前記モデル間距離があらかじめ定められた閾値を超えるラベルを検出する検出処理と、
前記検出処理での出力として一つ以上のラベルが得られた場合に、該検出されたラベルを少なくとも一つ以上含む第二の教師ラベル列を生成するラベル生成処理と、
をコンピュータに実行させることを特徴とするモデル適応プログラム。 - 音声認識に用いる音響モデルを入力音声の特徴に近似させることで該音響モデルを該入力音声の話者に適応させるモデル適応によるモデル適応プログラムであって、
前記入力音声が入力される入力処理と、
前記入力音声と前記入力音声の内容を記述した複数の文を有する文リストとを用いて前記モデル適応を行い、前記音響モデルを前記入力音声に近似させるための充分統計量である適応用特徴情報を出力するモデル適応処理と、
前記適応用特徴情報を統計量データベースに格納する処理と、
前記適応用特徴情報と前記音響モデルとの音響的な距離を音素ごとに計算し、各音素についての距離値を出力する距離計算処理と、
前記距離値のうち予め定められた閾値を超えるものが存在する場合、該閾値を超えるものを検出結果として出力する音素検出処理と、
前記検出結果に係る音素を含む文を所定の音素を有する文を複数格納するテキストデータベースから検索し、該検索で抽出された文を出力するラベル生成処理と、
をコンピュータに実行させることを特徴とするモデル適応プログラム。 - 前記ラベル生成処理が前記検索で文を出力した場合は、該文を新たな文リストとし、前記ラベル生成処理が該文を出力しなかった場合は、その旨を通知する判定処理と、
前記判定処理から前記文が出力されなかった旨の通知を受けた場合に、前記統計量データベースから前記適応用特徴情報を取得し、これを前記音響モデルに適用することで適応後音響モデルを得るモデル更新処理と、
前記適応後音響モデルを出力する出力処理と、
前記文リスト及び前記新たな文リストを提示する文提示処理と、
を更にコンピュータに実行させ、
前記モデル適応処理は、前記新たな文リストに基づく音声入力と前記新たな文リストとを用いて再度のモデル適応を行い、再度の適応用特徴情報を出力し、
前記距離計算処理は、前記再度の適応用特徴情報と前記音響モデルとの距離を音素ごとに計算し、各音素についての再度の距離値を出力し、
前記音素検出処理は、前記再度の距離値のうち前記閾値を超えるものが存在する場合、前記閾値を超えるものを再度の検出結果として出力し、
前記ラベル生成処理は、前記再度の検出結果に係る音素を含む文を前記テキストデータベースから検索し、該検索で抽出された文を出力することを特徴とする請求項14に記載のモデル適応プログラム。 - 前記音素検出処理は、音素毎に異なる閾値を用いることを特徴とする請求項14又は15に記載のモデル適応プログラム。
- 音素又は音素の組合せをクラス化した情報をクラスデータベースに格納する処理を更にコンピュータに実行させ、
前記音素検出処理は、前記クラスデータベースを参照し、前記距離計算処理の出力である各音素の距離値のうち前記閾値を超えるものがあれば、前記閾値を超えた音素と同じクラスに属する音素も検出結果として出力することを特徴とする請求項14乃至16のいずれか1項に記載のモデル適応プログラム。 - 前記入力音声には、音声及び該音声を音響分析した特徴量系列のデータが含まれることを特徴とする請求項14乃至17のいずれか1項に記載のモデル適応プログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2010535770A JP5376341B2 (ja) | 2008-10-31 | 2009-10-23 | モデル適応装置、その方法及びそのプログラム |
US12/998,469 US20110224985A1 (en) | 2008-10-31 | 2009-10-23 | Model adaptation device, method thereof, and program thereof |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008281387 | 2008-10-31 | ||
JP2008-281387 | 2008-10-31 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010050414A1 true WO2010050414A1 (ja) | 2010-05-06 |
Family
ID=42128777
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/068263 WO2010050414A1 (ja) | 2008-10-31 | 2009-10-23 | モデル適応装置、その方法及びそのプログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US20110224985A1 (ja) |
JP (1) | JP5376341B2 (ja) |
WO (1) | WO2010050414A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4772164B2 (ja) * | 2009-01-30 | 2011-09-14 | 三菱電機株式会社 | 音声認識装置 |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8595004B2 (en) * | 2007-12-18 | 2013-11-26 | Nec Corporation | Pronunciation variation rule extraction apparatus, pronunciation variation rule extraction method, and pronunciation variation rule extraction program |
KR20170034227A (ko) * | 2015-09-18 | 2017-03-28 | 삼성전자주식회사 | 음성 인식 장치 및 방법과, 음성 인식을 위한 변환 파라미터 학습 장치 및 방법 |
EP3535751A4 (en) * | 2016-11-10 | 2020-05-20 | Nuance Communications, Inc. | METHOD FOR LANGUAGE-INDEPENDENT WAY RECOGNITION |
CN109754784B (zh) * | 2017-11-02 | 2021-01-29 | 华为技术有限公司 | 训练滤波模型的方法和语音识别的方法 |
CN114678040B (zh) * | 2022-05-19 | 2022-08-30 | 北京海天瑞声科技股份有限公司 | 语音一致性检测方法、装置、设备及存储介质 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002132288A (ja) * | 2000-10-24 | 2002-05-09 | Fujitsu Ltd | エンロール文音声入力方法とエンロール文音声入力装置とそれを実現するためのプログラムを記録した記録媒体 |
WO2007105409A1 (ja) * | 2006-02-27 | 2007-09-20 | Nec Corporation | 標準パタン適応装置、標準パタン適応方法および標準パタン適応プログラム |
JP2007248730A (ja) * | 2006-03-15 | 2007-09-27 | Nippon Telegr & Teleph Corp <Ntt> | 音響モデル適応装置、音響モデル適応方法、音響モデル適応プログラム及び記録媒体 |
JP2008129527A (ja) * | 2006-11-24 | 2008-06-05 | Nippon Telegr & Teleph Corp <Ntt> | 音響モデル生成装置、方法、プログラム及びその記録媒体 |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6272462B1 (en) * | 1999-02-25 | 2001-08-07 | Panasonic Technologies, Inc. | Supervised adaptation using corrective N-best decoding |
JP2001134285A (ja) * | 1999-11-01 | 2001-05-18 | Matsushita Electric Ind Co Ltd | 音声認識装置 |
US7209881B2 (en) * | 2001-12-20 | 2007-04-24 | Matsushita Electric Industrial Co., Ltd. | Preparing acoustic models by sufficient statistics and noise-superimposed speech data |
JP3981640B2 (ja) * | 2003-02-20 | 2007-09-26 | 日本電信電話株式会社 | 音素モデル学習用文リスト生成装置、および生成プログラム |
US7412383B1 (en) * | 2003-04-04 | 2008-08-12 | At&T Corp | Reducing time for annotating speech data to develop a dialog application |
KR100612840B1 (ko) * | 2004-02-18 | 2006-08-18 | 삼성전자주식회사 | 모델 변이 기반의 화자 클러스터링 방법, 화자 적응 방법및 이들을 이용한 음성 인식 장치 |
US7529669B2 (en) * | 2006-06-14 | 2009-05-05 | Nec Laboratories America, Inc. | Voice-based multimodal speaker authentication using adaptive training and applications thereof |
US8155961B2 (en) * | 2008-12-09 | 2012-04-10 | Nokia Corporation | Adaptation of automatic speech recognition acoustic models |
-
2009
- 2009-10-23 WO PCT/JP2009/068263 patent/WO2010050414A1/ja active Application Filing
- 2009-10-23 US US12/998,469 patent/US20110224985A1/en not_active Abandoned
- 2009-10-23 JP JP2010535770A patent/JP5376341B2/ja active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002132288A (ja) * | 2000-10-24 | 2002-05-09 | Fujitsu Ltd | エンロール文音声入力方法とエンロール文音声入力装置とそれを実現するためのプログラムを記録した記録媒体 |
WO2007105409A1 (ja) * | 2006-02-27 | 2007-09-20 | Nec Corporation | 標準パタン適応装置、標準パタン適応方法および標準パタン適応プログラム |
JP2007248730A (ja) * | 2006-03-15 | 2007-09-27 | Nippon Telegr & Teleph Corp <Ntt> | 音響モデル適応装置、音響モデル適応方法、音響モデル適応プログラム及び記録媒体 |
JP2008129527A (ja) * | 2006-11-24 | 2008-06-05 | Nippon Telegr & Teleph Corp <Ntt> | 音響モデル生成装置、方法、プログラム及びその記録媒体 |
Non-Patent Citations (1)
Title |
---|
TANI ET AL.: "Jubun Tokeiryo o Mochiita Kyoshi Nashi Washa Tekio ni Okeru Washa Sentakuho", IEICE TECHNICAL REPORT NLC2007-33-86, vol. 107, no. 405, 13 December 2007 (2007-12-13), pages 85 - 89 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4772164B2 (ja) * | 2009-01-30 | 2011-09-14 | 三菱電機株式会社 | 音声認識装置 |
Also Published As
Publication number | Publication date |
---|---|
JP5376341B2 (ja) | 2013-12-25 |
JPWO2010050414A1 (ja) | 2012-03-29 |
US20110224985A1 (en) | 2011-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11990127B2 (en) | User recognition for speech processing systems | |
US11270685B2 (en) | Speech based user recognition | |
CN106463113B (zh) | 在语音辨识中预测发音 | |
US11798556B2 (en) | Configurable output data formats | |
US6836760B1 (en) | Use of semantic inference and context-free grammar with speech recognition system | |
JP5327054B2 (ja) | 発音変動規則抽出装置、発音変動規則抽出方法、および発音変動規則抽出用プログラム | |
US7275034B2 (en) | Word-specific acoustic models in a speech recognition system | |
US10176809B1 (en) | Customized compression and decompression of audio data | |
US8494853B1 (en) | Methods and systems for providing speech recognition systems based on speech recordings logs | |
US6910012B2 (en) | Method and system for speech recognition using phonetically similar word alternatives | |
CN113692616B (zh) | 用于在端到端模型中的跨语言语音识别的基于音素的场境化 | |
US6839667B2 (en) | Method of speech recognition by presenting N-best word candidates | |
US8731926B2 (en) | Spoken term detection apparatus, method, program, and storage medium | |
US7292976B1 (en) | Active learning process for spoken dialog systems | |
JP2004362584A (ja) | テキストおよび音声の分類のための言語モデルの判別トレーニング | |
US10515637B1 (en) | Dynamic speech processing | |
JP2006038895A (ja) | 音声処理装置および音声処理方法、プログラム、並びに記録媒体 | |
Kurimo et al. | Modeling under-resourced languages for speech recognition | |
JP5376341B2 (ja) | モデル適応装置、その方法及びそのプログラム | |
JP7544989B2 (ja) | ルックアップテーブルリカレント言語モデル | |
Decadt et al. | Transcription of out-of-vocabulary words in large vocabulary speech recognition based on phoneme-to-grapheme conversion | |
JP2002278584A (ja) | 言語モデル生成装置及びこれを用いた音声認識装置、並びにこれらの方法、これらのプログラムを記録したコンピュータ読み取り可能な記録媒体 | |
Razavi et al. | On the Application of Automatic Subword Unit Derivation and Pronunciation Generation for Under-Resourced Language ASR: A Study on Scottish Gaelic |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09823525 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2010535770 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12998469 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09823525 Country of ref document: EP Kind code of ref document: A1 |