WO2008069308A1 - 音声認識装置および音声認識方法 - Google Patents
音声認識装置および音声認識方法 Download PDFInfo
- Publication number
- WO2008069308A1 WO2008069308A1 PCT/JP2007/073674 JP2007073674W WO2008069308A1 WO 2008069308 A1 WO2008069308 A1 WO 2008069308A1 JP 2007073674 W JP2007073674 W JP 2007073674W WO 2008069308 A1 WO2008069308 A1 WO 2008069308A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- utterance
- speech
- model
- data
- speech recognition
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 30
- 230000008569 process Effects 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 8
- 238000013518 transcription Methods 0.000 description 8
- 230000035897 transcription Effects 0.000 description 8
- 238000001514 detection method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 230000004044 response Effects 0.000 description 3
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000945 filler Substances 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/197—Probabilistic grammars, e.g. word n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
Definitions
- the present invention relates to a speech recognition technology, and more particularly to a speech recognition technology using an acoustic model and a language model, and a model learning technology.
- Non-Patent Document 1 uses a high utterance speed! /, A high utterance speed for voice! /, A dedicated acoustic model learned only by speech, and a dictionary in which utterance deformation is registered. The technology is described. The technology in this document improves recognition performance by using a model dedicated to speech speed.
- Non-patent literature l Takahiro Shinozaki, Sadaoki Furui, "HIDDEN MODE HMM USING B AYESIAN NETWORK FOR MODELING SPEAKING RATE FLUCTUATION", Auto matic Speech Recognition and Understanding (ASRU) workshop 2003, p.417-422 , "Language Model and Calculation 4: Stochastic Language Model", The University of Tokyo Press, 1999, .57-62
- Non-Patent Document 3 Steve Young et al., "The HTK Book (for HTK Version 3.3),", University University Engineering Department, April 2005, p.35-40, p.54-64, p.127-130 Disclosure of the invention Problems to be solved by the invention
- the utterance speed is a characteristic measured based on the utterance content.
- the utterance content is estimated using the recognition result of the input speech data.
- the recognition result may contain an error
- the utterance speed obtained from such a recognition result lacks accuracy. Therefore, there is a problem that it is difficult to improve the recognition accuracy in the speech recognition method by learning the model using the speech rate.
- the recognition accuracy may be degraded.
- the above problem is caused by using a feature quantity that is difficult to measure accurately, such as the utterance speed, as a feature quantity that represents a phenomenon of spoken language.
- the recognition accuracy is significantly improved under ideal conditions where the correct feature value is known.
- it is difficult to improve the recognition accuracy because the correct answer is unknown.
- utterance speed is originally an acoustic feature, but utterance content, which is a linguistic feature, is not related to this change in utterance speed. For this reason, the range of improvement in speech recognition using speech rate is limited to acoustic features, so the absolute value of the improvement is not considered large.
- the present invention has been made in view of the above problems, and an object of the present invention is to provide a speech recognition technology for recognizing speech with high accuracy such as spoken words that is difficult to capture an accurate feature amount. Means to solve
- the speech recognition apparatus includes a speech recognition unit that performs speech recognition processing using an acoustic model and a language model, and the acoustic model and language model according to a speech length that represents a length of a speech section in speech data. And a model learning unit for learning.
- FIG. 1 is a block diagram of a model learning unit in a first embodiment of the present invention.
- FIG. 2 is a block diagram of a speech recognition unit in the first embodiment of the present invention.
- FIG. 3 is a block diagram of a model learning unit in the second embodiment of the present invention.
- FIG. 4 is a block diagram of a speech recognition unit in the second embodiment of the present invention.
- FIG. 5 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.
- FIG. 5 shows the configuration of the speech recognition apparatus according to the embodiment of the present invention.
- the speech recognition apparatus 100 includes a model learning unit 100A that performs learning processing of a model used for speech recognition, and a speech recognition unit 100B that performs recognition processing of input speech and outputs the recognition result.
- the illustrated configuration is common to the first and second embodiments described later.
- FIG. 1 shows the configuration of model learning unit 100A_1 in the first embodiment.
- the model learning unit 100A_1 includes voice data 101, transcription text data 102, section detection means 103, The data selection means 104, the utterance length data 105, the model learning means 106, and the utterance length model 107 are configured.
- a common element 110 surrounded by a chain line in the drawing is a common element in the present embodiment and a second embodiment described later.
- the audio data 101 is data for learning an acoustic model.
- the audio data 101 is A / D-converted at 16 bits per sample with a sampling frequency of 44.1 kHz. Since the voice data 101 includes all the sounds captured during the conversation, a section where the voice is uttered and a section of silence and noise other than the utterance are mixed.
- the transcription text data 102 is text data in which a person listens to the voice data 101 and writes the contents.
- the section detecting means 103 analyzes the input voice data 101 to detect a voice section and outputs it as voice section information.
- a method for detecting a voice segment for example, a method can be used in which a voice segment is calculated and a segment in which the power exceeds a certain threshold is set as a voice segment.
- the sound power is a value obtained by adding the square of the amplitude of the sound data at regular intervals of about 10 msec, for example.
- the data selection unit 104 cuts out voice data according to the voice section detected by the section detection unit 103, and selects and outputs the transcription text data 102 corresponding to the section. At that time, the data portion cut out from the voice data and the transcription text is classified according to the length of the voice section, that is, the utterance length, and stored in the storage device. In this embodiment, there are three classification units: “short utterance”, “long utterance”, and “medium utterance”. The voice length data 105 is voice data and transcription text classified into these three units.
- “Short utterance” corresponds to, for example, an utterance composed of! ⁇ 2 words, such as a response to a question from the other party or a question.
- the vocabulary is composed of words that indicate responses such as “yes” and “eichi”, and words that answer questions. Since such utterance is normally considered to be about 1 second, in this embodiment, the utterance length of “short utterance” is defined as less than 1 second.
- “Medium utterance” means, for example, a fixed phrase such as “Thank you” or “Where are you on January 1st?” Responding to simple questions arranged in mind such as In this embodiment, the utterance length of such “medium utterance” is defined as about 1 to 3 seconds. “Long utterance” corresponds to a case where an event is explained or an explanation of matters that are not organized in the head, and in this embodiment, the utterance length is defined as 3 seconds or more.
- the model learning means 106 uses the utterance length data 105 to learn the acoustic model and language model used for speech recognition for each of the above classifications.
- the utterance length model 107 is an acoustic model and a language model learned for each utterance length unit.
- the language model is a model expressed by approximation by N-gram as described in Non-Patent Document 2, and the learning method is mainly performed by maximum likelihood estimation.
- N-gram is a language model modeling technique that approximates the appearance probability of all words using the appearance probability (conditional probability) of the Nth word on condition that the N-1 word in the history is a condition. Then, this is the technique.
- the appearance probability can be calculated by counting the frequency of word strings in the learning corpus. For example, the probability of occurrence of a word string of two word chains “I” and “ha” is equivalent to the number of occurrences of “I” and “ha” divided by the total number of two word chains.
- the acoustic model is a probabilistic model that expresses the acoustic features of speech.
- an acoustic model for example, the HMM (Hidden Markov Model) toolkit manual, Non-Patent Document 3, page 35 to page 40 is written! HMMs with a triphone as a phoneme are widely used.
- learning of an acoustic model will be described.
- Non-Patent Document 3 pages 54 to 64, the acoustic characteristics of speech are cut out from speech data at regular intervals of about 10 msec, and pre-emphasis, FFT, and filter bank processing are performed. Then, it is extracted by performing cosine transformation. In addition to the extracted features, it is also possible to use power and the difference between the previous and next times. [0024] Next, using the label data obtained by using the extracted features and the corresponding transcription text, the forward-back described in pages 127 to 130 of Non-Patent Document 3 Find the word probability. Thereby, the feature is associated with the label data.
- the label data the aforementioned triphone or the like can be considered.
- the Labenore data will be “* —w + a w-a + t a-t + a t_a + ka—k + uk—u + s s_i + *”.
- the model learning means 106 learns an acoustic model and a language model by the process as described above for each of the three units of utterance length.
- a model for a “short utterance” with an utterance length of 1 second or less a model for a “medium utterance” with an utterance length of 1 to 3 seconds, and a utterance length of 3 seconds or more.
- Three types of models are learned, such as a model for “long speech”.
- the learned acoustic model and language model is the model 107 by utterance length.
- FIG. 2 shows the configuration of the speech recognition unit 100B_1 in the first embodiment.
- the speech recognition unit 100B_1 includes section detection means 103, utterance length determination means 201, utterance length-specific model 107, model selection means 202, and recognition means 203.
- the section detection means 103 is basically the same function as that of the model learning unit 100A_1 described above, detects a voice section from the input voice data, and determines the start time and end time of the voice section. Is output as section information.
- the utterance length determination means 201 calculates the utterance length, which is the length of the section, based on the section information. Then, the calculated utterance length determines the power corresponding to any of the prescribed units such as “1 second or less”, “1 to 3 seconds”, or “3 seconds or more”.
- the model selection unit 202 selects an acoustic model and a language model corresponding to the unit of the utterance length determined by the utterance length determination unit 201 from the utterance length model 107 described above.
- the recognition unit 203 recognizes input speech using the acoustic model and language model selected by the model selection unit 202, and outputs the recognition result.
- the recognition method is roughly divided into acoustic analysis processing and search processing.
- the acoustic analysis is a process for calculating the above-mentioned voice feature amount.
- the search calculates the word score using the calculated feature value, the acoustic model, and the language model, and outputs a high score! /, As a recognition candidate. Is.
- the acoustic model and the language model are learned according to the utterance length as the feature amount of speech, and speech recognition is performed using the model, so that speech recognition is performed. Can improve the accuracy.
- the learning model may be expressed by conditional probabilities, for example, with the utterance length as a condition, in addition to being created separately for sound and language as in the above embodiment.
- conditional probabilities for example, with the utterance length as a condition, in addition to being created separately for sound and language as in the above embodiment.
- speech recognition for example, when the utterance length is 3 seconds, instead of using only the 3 second model, it is also possible to use a linear sum with the 2 or 4 second utterance model! /.
- model learning and speech recognition focusing on the utterance time which is the time from the beginning of the speech section, is performed in addition to the above-mentioned utterance length as the speech feature amount.
- FIG. 3 shows the configuration of the model learning unit in the second embodiment.
- the model learning unit 100A_2 of the present embodiment includes utterance length data 105 obtained by the common element 110 shown in FIG. 1, utterance time determination means 301, utterance length 'speech time data 302, and model learning. Mean 1 06 and utterance length 'speaking time model 303.
- the utterance time determination means 301 further includes a part for 1 second from the beginning, a part for the last 1 second, and the rest of the voice data and transcription data of the utterance length data 105 classified by utterance length. It is classified into three in the central part. The classified part corresponds to the detailed data part in the present invention. Note that the number of classifications is not limited to three as in the present embodiment, but may be other numbers such as four or five. Also, a combination of multiple parts, such as a combination of the first 1 second and the last 1 second, may be combined into one classification.
- the utterance length / speech time-specific data 302 is obtained by classifying the speech data and the transcription text separated by the utterance time determination means 301 by utterance length and utterance time.
- the model learning means 106 learns an acoustic model and a language model for each utterance length and each utterance time by using the utterance length's data utterance time 302.
- the learned acoustic model and language model is the utterance length 'model 303 by utterance time.
- FIG. 4 shows the configuration of the speech recognition unit 100B_2 of the present embodiment.
- the speech recognition unit 100B_2 includes an interval detection unit 103, an utterance length determination unit 201, an utterance time determination unit 301, an utterance length and utterance time model 303, a model selection unit 401, and a recognition unit 203. ing.
- the section detection means 103 and the utterance length determination means 201 are the same as those of the speech recognition unit 100B_1 shown in FIG. That is, a voice section is detected from the input voice data, and it is determined to which unit the length of the section, that is, the utterance length corresponds.
- the utterance time determination means 301 recognizes three parts, the 1-second part, the last 1-second part, and the remaining central part from the beginning of the input voice based on the section information.
- the model selection means 401 selects an acoustic model and a language model corresponding to the speech data to be recognized from the utterance length / utterance time model 303 based on the utterance length and utterance time information.
- a model for example, if the speech waveform to be recognized is less than 1 second and you want to recognize 1 second from the beginning, the utterance length of the speech data used for learning is less than 1 second and the utterance time is the beginning. Select a model for 1 second.
- acoustic models and language models are created for each utterance time, and speech recognition is performed using a dedicated model according to the utterance time observed from the input speech. This is considered to improve recognition performance.
- the beginning of the recognized section is narrowed down to vocabulary such as “Yes” or “Aichi”, and the end of the section is narrowed down to a sentence ending expression such as “ ⁇ ”. Can improve processing efficiency.
- the utterance time as the voice feature amount is information obtained by measuring the time from the head of the determined utterance length. Therefore, the utterance time is information that is not directly related to the content of the utterance, like the utterance length, and is not information that causes a difference in observation values between learning and recognition, so stable speech recognition can be realized.
- the learning model using the utterance time may be expressed by a conditional probability with the utterance length and the utterance time as conditions. Also, at the time of speech recognition, it is not necessary to use only the model selected based on the utterance length and utterance time, but also using the adjacent utterance length or utterance time model or a weighted linear sum with other models. Good.
- the utterance length is classified into three categories of "short utterance", "long utterance", and "medium utterance”.
- two or four utterance lengths are used. It may be classified into more than one.
- it is difficult to improve the recognition accuracy because the classification is rough, and the processing becomes more complicated as the number of units increases. Therefore, it is desirable to set the number of utterance lengths in consideration of these trade-offs! /.
- the present invention is suitable for various speech recognition apparatuses that require highly accurate speech recognition.
- the present invention may be implemented as a computer program corresponding to the means provided in the speech recognition apparatus in each of the above embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/518,075 US8706487B2 (en) | 2006-12-08 | 2007-12-07 | Audio recognition apparatus and speech recognition method using acoustic models and language models |
JP2008548349A JP5240456B2 (ja) | 2006-12-08 | 2007-12-07 | 音声認識装置および音声認識方法 |
EP07850261A EP2096630A4 (en) | 2006-12-08 | 2007-12-07 | AUDIO RECOGNITION DEVICE AND AUDIO RECOGNITION METHOD |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2006-331871 | 2006-12-08 | ||
JP2006331871 | 2006-12-08 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008069308A1 true WO2008069308A1 (ja) | 2008-06-12 |
Family
ID=39492183
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2007/073674 WO2008069308A1 (ja) | 2006-12-08 | 2007-12-07 | 音声認識装置および音声認識方法 |
Country Status (4)
Country | Link |
---|---|
US (1) | US8706487B2 (ja) |
EP (1) | EP2096630A4 (ja) |
JP (1) | JP5240456B2 (ja) |
WO (1) | WO2008069308A1 (ja) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011107314A (ja) * | 2009-11-16 | 2011-06-02 | Nippon Telegr & Teleph Corp <Ntt> | 音声認識装置、音声認識方法及び音声認識プログラム |
US9031841B2 (en) | 2011-12-28 | 2015-05-12 | Fujitsu Limited | Speech recognition apparatus, speech recognition method, and speech recognition program |
JP2020187211A (ja) * | 2019-05-13 | 2020-11-19 | 株式会社日立製作所 | 対話装置、対話方法、及び対話コンピュータプログラム |
JP2021121875A (ja) * | 2018-10-19 | 2021-08-26 | ヤフー株式会社 | 学習装置、検出装置、学習方法、学習プログラム、検出方法、および検出プログラム |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6341092B2 (ja) * | 2012-10-31 | 2018-06-13 | 日本電気株式会社 | 表現分類装置、表現分類方法、不満検出装置及び不満検出方法 |
US9754607B2 (en) * | 2015-08-26 | 2017-09-05 | Apple Inc. | Acoustic scene interpretation systems and related methods |
US9799327B1 (en) * | 2016-02-26 | 2017-10-24 | Google Inc. | Speech recognition with attention-based recurrent neural networks |
CN109313900A (zh) * | 2016-06-15 | 2019-02-05 | 索尼公司 | 信息处理设备和信息处理方法 |
US10586529B2 (en) * | 2017-09-14 | 2020-03-10 | International Business Machines Corporation | Processing of speech signal |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003330485A (ja) * | 2002-05-10 | 2003-11-19 | Tokai Rika Co Ltd | 音声認識装置、音声認識システム及び音声認識方法 |
JP2004126143A (ja) * | 2002-10-01 | 2004-04-22 | Mitsubishi Electric Corp | 音声認識装置および音声認識プログラム |
JP2007249051A (ja) * | 2006-03-17 | 2007-09-27 | Nippon Telegr & Teleph Corp <Ntt> | 音響モデル作成装置、音響モデル作成方法、そのプログラムおよびその記録媒体 |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6239900A (ja) | 1985-08-15 | 1987-02-20 | キヤノン株式会社 | 音声認識装置 |
US5774851A (en) * | 1985-08-15 | 1998-06-30 | Canon Kabushiki Kaisha | Speech recognition apparatus utilizing utterance length information |
JP2829014B2 (ja) | 1989-01-12 | 1998-11-25 | 株式会社東芝 | 音声認識装置及び方法 |
US5444817A (en) * | 1991-10-02 | 1995-08-22 | Matsushita Electric Industrial Co., Ltd. | Speech recognizing apparatus using the predicted duration of syllables |
US5583961A (en) * | 1993-03-25 | 1996-12-10 | British Telecommunications Public Limited Company | Speaker recognition using spectral coefficients normalized with respect to unequal frequency bands |
US5893059A (en) * | 1997-04-17 | 1999-04-06 | Nynex Science And Technology, Inc. | Speech recoginition methods and apparatus |
US6014624A (en) * | 1997-04-18 | 2000-01-11 | Nynex Science And Technology, Inc. | Method and apparatus for transitioning from one voice recognition system to another |
JP3058125B2 (ja) | 1997-06-27 | 2000-07-04 | 日本電気株式会社 | 音声認識装置 |
JP2000099077A (ja) | 1998-09-28 | 2000-04-07 | Matsushita Electric Ind Co Ltd | 音声認識装置 |
WO2001026092A2 (en) * | 1999-10-06 | 2001-04-12 | Lernout & Hauspie Speech Products N.V. | Attribute-based word modeling |
EP1189202A1 (en) * | 2000-09-18 | 2002-03-20 | Sony International (Europe) GmbH | Duration models for speech recognition |
US20020087309A1 (en) * | 2000-12-29 | 2002-07-04 | Lee Victor Wai Leung | Computer-implemented speech expectation-based probability method and system |
JP3893893B2 (ja) | 2001-03-30 | 2007-03-14 | セイコーエプソン株式会社 | ウエブページの音声検索方法、音声検索装置および音声検索プログラム |
JP4124416B2 (ja) | 2002-01-28 | 2008-07-23 | 独立行政法人情報通信研究機構 | 半自動型字幕番組制作システム |
JP2006039120A (ja) * | 2004-07-26 | 2006-02-09 | Sony Corp | 対話装置および対話方法、並びにプログラムおよび記録媒体 |
US20060149544A1 (en) * | 2005-01-05 | 2006-07-06 | At&T Corp. | Error prediction in spoken dialog systems |
JP4906379B2 (ja) * | 2006-03-22 | 2012-03-28 | 富士通株式会社 | 音声認識装置、音声認識方法、及びコンピュータプログラム |
-
2007
- 2007-12-07 JP JP2008548349A patent/JP5240456B2/ja active Active
- 2007-12-07 US US12/518,075 patent/US8706487B2/en active Active
- 2007-12-07 EP EP07850261A patent/EP2096630A4/en not_active Withdrawn
- 2007-12-07 WO PCT/JP2007/073674 patent/WO2008069308A1/ja active Search and Examination
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003330485A (ja) * | 2002-05-10 | 2003-11-19 | Tokai Rika Co Ltd | 音声認識装置、音声認識システム及び音声認識方法 |
JP2004126143A (ja) * | 2002-10-01 | 2004-04-22 | Mitsubishi Electric Corp | 音声認識装置および音声認識プログラム |
JP2007249051A (ja) * | 2006-03-17 | 2007-09-27 | Nippon Telegr & Teleph Corp <Ntt> | 音響モデル作成装置、音響モデル作成方法、そのプログラムおよびその記録媒体 |
Non-Patent Citations (7)
Title |
---|
FUJIMURA K. ET AL.: "Jitsukankyo ni Okeru SNR-betsu Onkyo Model no Hyoka", IEICE TECHNICAL REPORT, vol. 104, no. 631, 21 January 2005 (2005-01-21), pages 43 - 48 * |
KITA: "Language models and calculation 4: stochastic language models", 1999, UNIVERSITY OF TOKYO PRESS, pages: 57 - 62 |
KUDO I. ET AL.: "Voice Across Japan Database", THE TRANSACTIONS OF THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS, vol. 40, no. 9, 15 September 1999 (1999-09-15), pages 3432 - 3445, XP008109285 * |
NISHIDA M. ET AL.: "BIC ni Motozuku Tokeiteki Washa Model Sentaku ni yoru Kyoshi Nashi Washa Indexing", THE TRANSACTIONS OF THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS D-II, vol. J87-D-II, no. 2, 1 February 2004 (2004-02-01), pages 504 - 512, XP008109283 * |
See also references of EP2096630A4 |
STEVE YOUNG ET AL.: "The HTK Book (for HTK Version 3.3)", April 2005, CAMBRIDGE UNIVERSITY ENGINEERING DEPARTMENT, pages: 55 - 40 |
TAKAHIRO SHINOZAKI; SADAOKI FURUI: "HIDDEN MODE HMM USING BAYESIAN NETWORK FOR MODELING SPEAKING RATE FLUCTUATION", AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU) WORKSHOP, 2003, pages 417 - 422, XP010713323, DOI: doi:10.1109/ASRU.2003.1318477 |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011107314A (ja) * | 2009-11-16 | 2011-06-02 | Nippon Telegr & Teleph Corp <Ntt> | 音声認識装置、音声認識方法及び音声認識プログラム |
US9031841B2 (en) | 2011-12-28 | 2015-05-12 | Fujitsu Limited | Speech recognition apparatus, speech recognition method, and speech recognition program |
JP2021121875A (ja) * | 2018-10-19 | 2021-08-26 | ヤフー株式会社 | 学習装置、検出装置、学習方法、学習プログラム、検出方法、および検出プログラム |
JP7212718B2 (ja) | 2018-10-19 | 2023-01-25 | ヤフー株式会社 | 学習装置、検出装置、学習方法、学習プログラム、検出方法、および検出プログラム |
JP2020187211A (ja) * | 2019-05-13 | 2020-11-19 | 株式会社日立製作所 | 対話装置、対話方法、及び対話コンピュータプログラム |
JP7229847B2 (ja) | 2019-05-13 | 2023-02-28 | 株式会社日立製作所 | 対話装置、対話方法、及び対話コンピュータプログラム |
Also Published As
Publication number | Publication date |
---|---|
EP2096630A1 (en) | 2009-09-02 |
JP5240456B2 (ja) | 2013-07-17 |
US20100324897A1 (en) | 2010-12-23 |
EP2096630A4 (en) | 2012-03-14 |
US8706487B2 (en) | 2014-04-22 |
JPWO2008069308A1 (ja) | 2010-03-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11270685B2 (en) | Speech based user recognition | |
JP5240456B2 (ja) | 音声認識装置および音声認識方法 | |
CN106463113B (zh) | 在语音辨识中预测发音 | |
EP1557822B1 (en) | Automatic speech recognition adaptation using user corrections | |
JP4301102B2 (ja) | 音声処理装置および音声処理方法、プログラム、並びに記録媒体 | |
US9646605B2 (en) | False alarm reduction in speech recognition systems using contextual information | |
KR101153078B1 (ko) | 음성 분류 및 음성 인식을 위한 은닉 조건부 랜덤 필드모델 | |
US20040210437A1 (en) | Semi-discrete utterance recognizer for carefully articulated speech | |
JP2011033680A (ja) | 音声処理装置及び方法、並びにプログラム | |
CN101887725A (zh) | 一种基于音素混淆网络的音素后验概率计算方法 | |
Ghai et al. | Analysis of automatic speech recognition systems for indo-aryan languages: Punjabi a case study | |
Hasnat et al. | Isolated and continuous bangla speech recognition: implementation, performance and application perspective | |
KR101014086B1 (ko) | 음성 처리 장치 및 방법, 및 기록 매체 | |
Gorin et al. | Learning spoken language without transcriptions | |
Jothilakshmi et al. | Large scale data enabled evolution of spoken language research and applications | |
Mary et al. | Searching speech databases: features, techniques and evaluation measures | |
JP2008176202A (ja) | 音声認識装置及び音声認識プログラム | |
Harvill et al. | Frame-Level Stutter Detection. | |
Proença et al. | Mispronunciation Detection in Children's Reading of Sentences | |
Hirschberg et al. | Generalizing prosodic prediction of speech recognition errors | |
JP7098587B2 (ja) | 情報処理装置、キーワード検出装置、情報処理方法およびプログラム | |
Pranjol et al. | Bengali speech recognition: An overview | |
JP6199994B2 (ja) | コンテキスト情報を使用した音声認識システムにおける誤警報低減 | |
Pisarn et al. | An HMM-based method for Thai spelling speech recognition | |
Kamath et al. | Automatic Speech Recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07850261 Country of ref document: EP Kind code of ref document: A1 |
|
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 12518075 Country of ref document: US Ref document number: 2008548349 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
REEP | Request for entry into the european phase |
Ref document number: 2007850261 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2007850261 Country of ref document: EP |
|
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) |