JP2005189294A - Speech recognition device - Google Patents

Speech recognition device Download PDF

Info

Publication number
JP2005189294A
JP2005189294A JP2003427481A JP2003427481A JP2005189294A JP 2005189294 A JP2005189294 A JP 2005189294A JP 2003427481 A JP2003427481 A JP 2003427481A JP 2003427481 A JP2003427481 A JP 2003427481A JP 2005189294 A JP2005189294 A JP 2005189294A
Authority
JP
Japan
Prior art keywords
speech
syllable
character string
recognition
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2003427481A
Other languages
Japanese (ja)
Inventor
Hiroyuki Hoshino
博之 星野
Takakatsu Yoshimura
貴克 吉村
Iko Terasawa
位好 寺澤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toyota Central R&D Labs Inc
Original Assignee
Toyota Central R&D Labs Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toyota Central R&D Labs Inc filed Critical Toyota Central R&D Labs Inc
Priority to JP2003427481A priority Critical patent/JP2005189294A/en
Publication of JP2005189294A publication Critical patent/JP2005189294A/en
Pending legal-status Critical Current

Links

Images

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition device capable of recognizing speech without removing noise even if the noise such as breathing or lip noise is included in a speech waveform. <P>SOLUTION: A speech analyzer 12 analyzes speech spoken by a user and outputs a speech waveform signal, a syllable segmenting unit 14 cuts a monosyllabic part out of the speech waveform signal, and a syllable recognition unit 2 recognizes the cut out monosyllabic part. Then the recognized character string and a character string obtained by deleting at least one character from the recognized character string are matched with a recognition dictionary. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声認識装置に係り、特に、単音節に区切り発音された音声波形に基づいて音声認識する音声認識装置に関する。   The present invention relates to a speech recognition device, and more particularly, to a speech recognition device that recognizes speech based on speech waveforms that are sounded separated into single syllables.

従来より、単語を音節に区切って発声された音声波形に基づいて音声を認識する音声認識装置が知られている。この音声認識装置では、息やリップノイズ等を検出すると誤認識が発生するため、閾値比較処理により音節候補から音節を決定して、必要な音節以外の息やリップノイズ等をノイズとして除去したり(例えば、特許文献1)、有音隠れマルコフモデル及び無音隠れマルコフモデルを用いて、呼吸音やリップノイズ等を除去したりすること(例えば、特許文献2)が行なわれている。
特開平7−261779号公報 特開平11−288293号公報
2. Description of the Related Art Conventionally, a speech recognition apparatus that recognizes speech based on a speech waveform uttered by dividing a word into syllables is known. In this speech recognition device, misrecognition occurs when breath or lip noise is detected. Therefore, the syllable is determined from the syllable candidates by threshold comparison processing, and the breath or lip noise other than the necessary syllable is removed as noise. (For example, Patent Document 1), breathing sounds, lip noise, and the like are removed using a voiced hidden Markov model and a silent hidden Markov model (for example, Patent Document 2).
Japanese Patent Laid-Open No. 7-261799 JP 11-288293 A

しかしながら、上記従来の技術では、信号処理での音声区間検出段階で息やリップノイズ等の特徴に基づいて、息やリップノイズ等を認識し、音声以外の音として除去しているため、複雑な信号処理が必要であると共に、音声をノイズと誤認識して欠落させてしまうことが発生する、という問題がある。   However, in the above-described conventional technique, since the breath or lip noise is recognized based on the features such as the breath or lip noise in the voice section detection stage in the signal processing and is removed as a sound other than the voice, it is complicated. There is a problem in that signal processing is necessary and voice may be mistakenly recognized as noise and dropped.

本発明は、上記問題点を解消すべく成されたもので、音声波形中に息やリップノイズ等のノイズが含まれていてもこれらのノイズを除去することなく音声認識が可能な音声認識装置を提供することを目的とする。   The present invention has been made to solve the above-mentioned problems, and a speech recognition apparatus capable of performing speech recognition without removing these noises even if the speech waveform contains noises such as breath and lip noise. The purpose is to provide.

上記目的を達成するために本発明は、ユーザーから発声された音声を分析して音声波形信号を出力する音声分析手段と、前記音声波形信号から音節部分を切出す音節切出手段と、音響モデルを用いて切出された各音節部分の音声認識を行なって文字列を出力する認識手段と、前記認識手段で認識された文字列、及び該文字列から少なくとも1つの文字を削除した文字列について認識辞書とのマッチングを行なうマッチング手段と、を含んで構成したものである。   In order to achieve the above object, the present invention provides a voice analysis means for analyzing a voice uttered by a user and outputting a voice waveform signal, a syllable extraction means for cutting out a syllable part from the voice waveform signal, and an acoustic model Recognition means for performing speech recognition of each syllable part extracted using, outputting a character string, a character string recognized by the recognition means, and a character string from which at least one character has been deleted from the character string And matching means for matching with the recognition dictionary.

本発明の音声分析手段は、ユーザーから発声された音声を分析して音声波形信号を出力し、音節切出手段は、音声波形信号から音節部分を切出す。認識手段は、音響モデルを用いて切出された各音節部分の音声認識を行なって文字列を出力し、マッチング手段は、認識手段で認識された文字列及びこの文字列から少なくとも1つの文字を削除した文字列について認識辞書とのマッチングを行なう。そして、マッチング手段は、辞書とマッチングした文字列を認識候補として出力する。   The speech analysis means of the present invention analyzes speech uttered by the user and outputs a speech waveform signal, and the syllable extraction means extracts a syllable portion from the speech waveform signal. The recognition means performs speech recognition of each syllable portion extracted using the acoustic model and outputs a character string, and the matching means obtains a character string recognized by the recognition means and at least one character from the character string. The deleted character string is matched with the recognition dictionary. And a matching means outputs the character string matched with the dictionary as a recognition candidate.

1つの文字を削除しても文字列が認識辞書とマッチングしない場合は、マッチング手段によって、文字を削除した文字列と認識辞書とがマッチングするまで、削除する文字数を増加させてマッチングを行なうことにより、認識候補を出力することができる。   If the character string does not match the recognition dictionary even if one character is deleted, matching is performed by increasing the number of characters to be deleted until the character string from which the character has been deleted matches the recognition dictionary. , Recognition candidates can be output.

以上説明したように本発明によれば、認識手段で認識された文字列から少なくとも1つの文字を削除した文字列について認識辞書とのマッチングを行なうようにしたので、音声波形中に息やリップノイズ等のノイズが含まれていてもこれらのノイズを除去することなく音声認識をすることができる、という効果が得られる。   As described above, according to the present invention, since the character string obtained by deleting at least one character from the character string recognized by the recognition unit is matched with the recognition dictionary, breath and lip noise are included in the speech waveform. Even if noises such as the above are included, an effect that voice recognition can be performed without removing these noises can be obtained.

以下、図面を参照して本発明の実施の形態を詳細に説明する。   Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図1に示すように、本実施の形態には、ユーザが単語を音節に区切って発声した音声を検出するマイクロホン10、及びマイクロホン10で検出された音声を分析して音声のパワー情報を備えた音声波形信号を出力する音声分析器12が設けられている。   As shown in FIG. 1, the present embodiment includes a microphone 10 that detects a voice uttered by a user dividing a word into syllables, and voice power information obtained by analyzing a voice detected by the microphone 10. A voice analyzer 12 for outputting a voice waveform signal is provided.

音声分析器12には、入力された音声波形信号から音声区間(単音節部分)を切出して音節列信号を出力する音節切出器14が接続されている。   The speech analyzer 12 is connected to a syllable extractor 14 that extracts a speech segment (single syllable portion) from the input speech waveform signal and outputs a syllable string signal.

音節切出器14には、記憶装置18が接続された音節認識器16が接続されている。記憶装置18には、「a」、「i」、「u」、「e」、「o」、「ka」、「ki」、「ku」、「ke」、「ko」、「sa」、・・・等の五十音図に表された50個、「ga」、「gi」、「gu」、・・・等の濁音、「pa」、「pi」、「pu」、・・・等の半濁音、「kya」、「kyu」、「kyo」、・・・等の拗音を含む計110個の日本語の基本的な単音節の音響モデル(孤立音節モデル)が記憶されている。   A syllable recognizer 16 to which a storage device 18 is connected is connected to the syllable extractor 14. The storage device 18 includes “a”, “i”, “u”, “e”, “o”, “ka”, “ki”, “ku”, “ke”, “ko”, “sa”, , Etc. 50, which are represented in the Japanese syllabary diagram such as “ga”, “gi”, “gu”,..., Etc., “pa”, “pi”, “pu”,. A total of 110 basic Japanese syllable acoustic models (isolated syllable models) including semi-voiced sounds such as “Kya”, “kyu”, “kyo”,... .

音節認識器16には、多数の単語を認識辞書として記憶した記憶装置22が接続された文字列マッチング器20が接続されている。   The syllable recognizer 16 is connected to a character string matching unit 20 to which a storage device 22 that stores a large number of words as a recognition dictionary is connected.

次に、本実施の形態の音声認識装置の動作について説明する。ユーザが発声した音声は、マイクロホン10で検出され、マイクロホン10で検出された音声は音声分析器12に入力され、音声分析器12は、ユーザが音節に区切って発声した単音節発声を音声分析し、音声のパワー情報等の特徴量を抽出して図2に示す音声波形信号Wを出力する。図2は、「い・つ・し・ん・しゃ・(鼻息)」と発声した場合の音声波形信号を示すものであり、発声の最後に鼻息が発せられたため、鼻息も音声波形信号の中に含まれている。   Next, the operation of the speech recognition apparatus according to this embodiment will be described. The voice uttered by the user is detected by the microphone 10, the voice detected by the microphone 10 is input to the voice analyzer 12, and the voice analyzer 12 analyzes the single syllable utterance uttered by the user divided into syllables. Then, a feature quantity such as voice power information is extracted and a voice waveform signal W shown in FIG. 2 is output. FIG. 2 shows a speech waveform signal when “I, Tsu, Shi, N, Sha, (nasal breath)” is uttered, and nasal breath is generated at the end of the utterance. Included.

音声分析器12から出力された音声波形信号は音節切出器14に入力される。音節切出器14は、入力された音声波形信号とパワーの閾値とを比較することにより、パワーが閾値より大きい部分を音声区間として切出し、図2に示す音節列信号Sを出力する。音節と音節との間では、音声波形信号のパワー(振幅)が小さくなるので、このパワーが小さくなった部分より大きな閾値を設定することにより、音声区間を切出すことができる。   The speech waveform signal output from the speech analyzer 12 is input to the syllable extractor 14. The syllable extractor 14 compares the input speech waveform signal with the power threshold value, and extracts a portion where the power is greater than the threshold value as a speech section, and outputs the syllable string signal S shown in FIG. Since the power (amplitude) of the speech waveform signal is small between syllables, the speech section can be cut out by setting a threshold larger than the portion where the power is small.

このとき、パワーの閾値を高めに設定すると、息やリップノイズは検出されなくなる場合が多いが、同時に小さな声で発声された音節も欠落してしまうことになるため、閾値は低めに設定せざるを得ない。図2の例では、発声の最後に発せられた鼻息も誤って音節区間として切出されている。   At this time, if the power threshold is set high, breath and lip noise are often not detected, but syllables uttered by a small voice are also lost at the same time, so the threshold should be set low. I do not get. In the example of FIG. 2, the nasal breath uttered at the end of the utterance is also erroneously cut out as a syllable section.

音節認識器16では、記憶装置18に記憶された単音節の音響モデルを用いて、音節区間として切出された単音節部分の音声認識を行なう。図2の例では、鼻息以外が正しく認識された場合、「い・つ・し・ん・しゃ・(鼻息)」と認識される。この(鼻息)の部分は、例えば、「ふ」のように認識されてしまう。このため、通常の処理では、後段の文字列マッチング処理では正しく認識されないことになる。   The syllable recognizer 16 uses the single syllable acoustic model stored in the storage device 18 to perform speech recognition of a single syllable portion extracted as a syllable section. In the example of FIG. 2, if anything other than nasal breath is correctly recognized, it is recognized as “I, tsu, shi, n, sha, (nasal breath)”. This (nasal breath) portion is recognized as, for example, “fu”. For this reason, in a normal process, it will not be correctly recognized in the subsequent character string matching process.

そこで、本実施の形態では、文字列マッチング器20において、音節認識器16での認識結果だけでなく、音節認識器16での認識結果の文字列から文字をn個(例えば、1または2個)削除した文字列についてもマッチングを行い、マッチングした認識候補を出力する。この場合、辞書とマッチングするまで削除する文字の個数を順に増加させてマッチングを行なう。   Therefore, in the present embodiment, in the character string matching unit 20, not only the recognition result by the syllable recognition unit 16 but also n characters (for example, 1 or 2) from the character string of the recognition result by the syllable recognition unit 16 are used. ) Matching is also performed on the deleted character string, and a matching recognition candidate is output. In this case, matching is performed by sequentially increasing the number of characters to be deleted until matching with the dictionary.

図2の例では、文字列が「い・つ・し・ん・しゃ・ふ」と認識されているので、まず1音節(1文字)削除して、「つ・し・ん・しゃ・ふ」、「い・し・ん・しゃ・ふ」、「い・つ・ん・しゃ・ふ」、「い・つ・し・しゃ・ふ」、「い・つ・し・ん・ふ」、「い・つ・し・ん・しゃ」の6通りについて文字列のマッチングを行なう。この場合、6番目の「い・つ・し・ん・しゃ」が辞書とマッチングするので、マッチングした「い・つ・し・ん・しゃ」を認識候補として出力する。   In the example of FIG. 2, since the character string is recognized as “I, Tsu, Shi, N, Sha, Fu”, first delete one syllable (one character), and “Tsu, Shin, N, Sha, Fu” ”,“ I ・ Sh ・ n ・ Sha ・ fu ”,‘ I ・ tsu ・ n ・ sha ・ fu 」,‘ I ・ tsu ・ shi ・ sha ・ fu 」,` `I ・ tsu ・ shi ・ sha ・ fu '', String matching is performed for the six types “I, Tsu, Shi, N, and Sha”. In this case, since the sixth “Itsutsu, Shin, Sha” matches the dictionary, the matched “Itsu, Shi, Shin, Sha” is output as a recognition candidate.

1音節削除しても認識候補が得られない場合は、さらに1音節、すなわち2音節削除して上記と同様のマッチング処理を行ないう。この音節の削除と削除した後のマッチング処理は、認識候補が得られるまで繰り返される。   If a recognition candidate is not obtained even if one syllable is deleted, one syllable, that is, two syllables, is deleted and matching processing similar to the above is performed. The deletion of syllables and the matching process after deletion are repeated until a recognition candidate is obtained.

なお、息やリップノイズ等のノイズが含まれていない場合には、音節認識器16での認識結果の文字列から音節を削除しなくても認識辞書とマッチングすることになる。   When noise such as breath and lip noise is not included, matching with the recognition dictionary is possible without deleting the syllable from the character string of the recognition result by the syllable recognizer 16.

さらに、「み・か・ん・ふ」(最後の「ふ」は鼻息)のように認識された場合は、1音節削除すると「か・ん・ふ」、「み・ん・ふ」、「み・か・ふ」、「み・か・ん」となり、これらと辞書との文字列マッチングを行なった場合、「か・ん・ふ」及び「み・か・ん」の2つ以上の文字列がマッチングする場合も考えられる。このような場合は、文章中の単語の出現頻度等を考慮して「み・か・ん」を認識候補として選択ればよい。   Furthermore, if it is recognized as “Mi / ka / n / fu” (the last “fu” is a nose breath), deleting one syllable will result in “ka / n / fu”, “mi / n / fu”, “ "Mi-ka-fu", "mi-ka-n", and when these are matched with a dictionary, two or more characters "ka-n-fu" and "mi-ka-n" It is also possible that the columns match. In such a case, “mi / ka / n” may be selected as a recognition candidate in consideration of the appearance frequency of words in the sentence.

本発明の実施の形態を示すブロック図である。It is a block diagram which shows embodiment of this invention. 音声波形信号、及び音声波形信号を音節部分毎に切出したときの音節列信号の例を示す波形図である。It is a wave form diagram which shows the example of a syllable string signal when a speech waveform signal and a speech waveform signal are cut out for every syllable part.

符号の説明Explanation of symbols

10 マイク
12 音声分析器
16 音節認識器
20 文字列マッチング器
10 microphone 12 speech analyzer 16 syllable recognizer 20 character string matcher

Claims (2)

ユーザーから発声された音声を分析して音声波形信号を出力する音声分析手段と、
前記音声波形信号から音節部分を切出す音節切出手段と、
音響モデルを用いて切出された各音節部分の音声認識を行なって文字列を出力する認識手段と、
前記認識手段で認識された文字列、及び該文字列から少なくとも1つの文字を削除した文字列について認識辞書とのマッチングを行なうマッチング手段と、
を含む音声認識装置。
A voice analysis means for analyzing a voice uttered by a user and outputting a voice waveform signal;
Syllable cutting means for cutting out a syllable portion from the speech waveform signal;
A recognition means for performing speech recognition of each syllable portion extracted using an acoustic model and outputting a character string;
A matching means for performing matching with a recognition dictionary for a character string recognized by the recognition means and a character string obtained by deleting at least one character from the character string;
A speech recognition device.
前記マッチング手段は、文字列と認識辞書とがマッチングするまで、削除する文字数を増加させてマッチングを行なう請求項1記載の音声認識装置。   The speech recognition apparatus according to claim 1, wherein the matching unit performs matching by increasing the number of characters to be deleted until the character string matches the recognition dictionary.
JP2003427481A 2003-12-24 2003-12-24 Speech recognition device Pending JP2005189294A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2003427481A JP2005189294A (en) 2003-12-24 2003-12-24 Speech recognition device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2003427481A JP2005189294A (en) 2003-12-24 2003-12-24 Speech recognition device

Publications (1)

Publication Number Publication Date
JP2005189294A true JP2005189294A (en) 2005-07-14

Family

ID=34786744

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2003427481A Pending JP2005189294A (en) 2003-12-24 2003-12-24 Speech recognition device

Country Status (1)

Country Link
JP (1) JP2005189294A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105976814A (en) * 2015-12-10 2016-09-28 乐视致新电子科技(天津)有限公司 Headset control method and device
CN110634505A (en) * 2018-06-21 2019-12-31 卡西欧计算机株式会社 Sound period detection device, sound period detection method, storage medium, sound recognition device, and robot

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105976814A (en) * 2015-12-10 2016-09-28 乐视致新电子科技(天津)有限公司 Headset control method and device
CN110634505A (en) * 2018-06-21 2019-12-31 卡西欧计算机株式会社 Sound period detection device, sound period detection method, storage medium, sound recognition device, and robot

Similar Documents

Publication Publication Date Title
US8666745B2 (en) Speech recognition system with huge vocabulary
US6973427B2 (en) Method for adding phonetic descriptions to a speech recognition lexicon
JP2006251147A (en) Speech recognition method
US6502072B2 (en) Two-tier noise rejection in speech recognition
JP5647455B2 (en) Apparatus, method, and program for detecting inspiratory sound contained in voice
JP2010078877A (en) Speech recognition device, speech recognition method, and speech recognition program
KR101122591B1 (en) Apparatus and method for speech recognition by keyword recognition
JPS6138479B2 (en)
JP2001195087A (en) Voice recognition system
JP2005189294A (en) Speech recognition device
EP3718107B1 (en) Speech signal processing and evaluation
JP2006010739A (en) Speech recognition device
JP2011180308A (en) Voice recognition device and recording medium
JP2002372988A (en) Recognition dictionary preparing device and rejection dictionary and rejection dictionary generating method
JP2005189293A (en) Voice recognition device
JP4213608B2 (en) Speech waveform information analyzer and its pre-processing device
JP2594916B2 (en) Voice recognition device
JPH1130994A (en) Voice recognizing method and device therefor and recording medium recorded with voice recognition program
JPS6033599A (en) Voice recognition equipment
JP2001013983A (en) Speech recognition apparatus using speech synthesis and speech recognition method
JPH0695684A (en) Sound recognizing system
JPS60159798A (en) Voice recognition equipment
JPS62166400A (en) Voice wordprocessor
JP2002341891A (en) Speech recognition device and speech recognition method
JPS6363098A (en) Voice recognition