JP2005189294A

JP2005189294A - Speech recognition device

Info

Publication number: JP2005189294A
Application number: JP2003427481A
Authority: JP
Inventors: Hiroyuki Hoshino; 博之星野; Takakatsu Yoshimura; 貴克吉村; Iko Terasawa; 位好寺澤
Original assignee: Toyota Central R&D Labs Inc
Current assignee: Toyota Central R&D Labs Inc
Priority date: 2003-12-24
Filing date: 2003-12-24
Publication date: 2005-07-14

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition device capable of recognizing speech without removing noise even if the noise such as breathing or lip noise is included in a speech waveform. <P>SOLUTION: A speech analyzer 12 analyzes speech spoken by a user and outputs a speech waveform signal, a syllable segmenting unit 14 cuts a monosyllabic part out of the speech waveform signal, and a syllable recognition unit 2 recognizes the cut out monosyllabic part. Then the recognized character string and a character string obtained by deleting at least one character from the recognized character string are matched with a recognition dictionary. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声認識装置に係り、特に、単音節に区切り発音された音声波形に基づいて音声認識する音声認識装置に関する。 The present invention relates to a speech recognition device, and more particularly, to a speech recognition device that recognizes speech based on speech waveforms that are sounded separated into single syllables.

従来より、単語を音節に区切って発声された音声波形に基づいて音声を認識する音声認識装置が知られている。この音声認識装置では、息やリップノイズ等を検出すると誤認識が発生するため、閾値比較処理により音節候補から音節を決定して、必要な音節以外の息やリップノイズ等をノイズとして除去したり（例えば、特許文献１）、有音隠れマルコフモデル及び無音隠れマルコフモデルを用いて、呼吸音やリップノイズ等を除去したりすること（例えば、特許文献２）が行なわれている。
特開平７−２６１７７９号公報特開平１１−２８８２９３号公報 2. Description of the Related Art Conventionally, a speech recognition apparatus that recognizes speech based on a speech waveform uttered by dividing a word into syllables is known. In this speech recognition device, misrecognition occurs when breath or lip noise is detected. Therefore, the syllable is determined from the syllable candidates by threshold comparison processing, and the breath or lip noise other than the necessary syllable is removed as noise. (For example, Patent Document 1), breathing sounds, lip noise, and the like are removed using a voiced hidden Markov model and a silent hidden Markov model (for example, Patent Document 2).
Japanese Patent Laid-Open No. 7-261799 JP 11-288293 A

しかしながら、上記従来の技術では、信号処理での音声区間検出段階で息やリップノイズ等の特徴に基づいて、息やリップノイズ等を認識し、音声以外の音として除去しているため、複雑な信号処理が必要であると共に、音声をノイズと誤認識して欠落させてしまうことが発生する、という問題がある。 However, in the above-described conventional technique, since the breath or lip noise is recognized based on the features such as the breath or lip noise in the voice section detection stage in the signal processing and is removed as a sound other than the voice, it is complicated. There is a problem in that signal processing is necessary and voice may be mistakenly recognized as noise and dropped.

本発明は、上記問題点を解消すべく成されたもので、音声波形中に息やリップノイズ等のノイズが含まれていてもこれらのノイズを除去することなく音声認識が可能な音声認識装置を提供することを目的とする。 The present invention has been made to solve the above-mentioned problems, and a speech recognition apparatus capable of performing speech recognition without removing these noises even if the speech waveform contains noises such as breath and lip noise. The purpose is to provide.

上記目的を達成するために本発明は、ユーザーから発声された音声を分析して音声波形信号を出力する音声分析手段と、前記音声波形信号から音節部分を切出す音節切出手段と、音響モデルを用いて切出された各音節部分の音声認識を行なって文字列を出力する認識手段と、前記認識手段で認識された文字列、及び該文字列から少なくとも１つの文字を削除した文字列について認識辞書とのマッチングを行なうマッチング手段と、を含んで構成したものである。 In order to achieve the above object, the present invention provides a voice analysis means for analyzing a voice uttered by a user and outputting a voice waveform signal, a syllable extraction means for cutting out a syllable part from the voice waveform signal, and an acoustic model Recognition means for performing speech recognition of each syllable part extracted using, outputting a character string, a character string recognized by the recognition means, and a character string from which at least one character has been deleted from the character string And matching means for matching with the recognition dictionary.

本発明の音声分析手段は、ユーザーから発声された音声を分析して音声波形信号を出力し、音節切出手段は、音声波形信号から音節部分を切出す。認識手段は、音響モデルを用いて切出された各音節部分の音声認識を行なって文字列を出力し、マッチング手段は、認識手段で認識された文字列及びこの文字列から少なくとも１つの文字を削除した文字列について認識辞書とのマッチングを行なう。そして、マッチング手段は、辞書とマッチングした文字列を認識候補として出力する。 The speech analysis means of the present invention analyzes speech uttered by the user and outputs a speech waveform signal, and the syllable extraction means extracts a syllable portion from the speech waveform signal. The recognition means performs speech recognition of each syllable portion extracted using the acoustic model and outputs a character string, and the matching means obtains a character string recognized by the recognition means and at least one character from the character string. The deleted character string is matched with the recognition dictionary. And a matching means outputs the character string matched with the dictionary as a recognition candidate.

１つの文字を削除しても文字列が認識辞書とマッチングしない場合は、マッチング手段によって、文字を削除した文字列と認識辞書とがマッチングするまで、削除する文字数を増加させてマッチングを行なうことにより、認識候補を出力することができる。 If the character string does not match the recognition dictionary even if one character is deleted, matching is performed by increasing the number of characters to be deleted until the character string from which the character has been deleted matches the recognition dictionary. , Recognition candidates can be output.

以上説明したように本発明によれば、認識手段で認識された文字列から少なくとも１つの文字を削除した文字列について認識辞書とのマッチングを行なうようにしたので、音声波形中に息やリップノイズ等のノイズが含まれていてもこれらのノイズを除去することなく音声認識をすることができる、という効果が得られる。 As described above, according to the present invention, since the character string obtained by deleting at least one character from the character string recognized by the recognition unit is matched with the recognition dictionary, breath and lip noise are included in the speech waveform. Even if noises such as the above are included, an effect that voice recognition can be performed without removing these noises can be obtained.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１に示すように、本実施の形態には、ユーザが単語を音節に区切って発声した音声を検出するマイクロホン１０、及びマイクロホン１０で検出された音声を分析して音声のパワー情報を備えた音声波形信号を出力する音声分析器１２が設けられている。 As shown in FIG. 1, the present embodiment includes a microphone 10 that detects a voice uttered by a user dividing a word into syllables, and voice power information obtained by analyzing a voice detected by the microphone 10. A voice analyzer 12 for outputting a voice waveform signal is provided.

音声分析器１２には、入力された音声波形信号から音声区間（単音節部分）を切出して音節列信号を出力する音節切出器１４が接続されている。 The speech analyzer 12 is connected to a syllable extractor 14 that extracts a speech segment (single syllable portion) from the input speech waveform signal and outputs a syllable string signal.

音節切出器１４には、記憶装置１８が接続された音節認識器１６が接続されている。記憶装置１８には、「ａ」、「ｉ」、「ｕ」、「ｅ」、「ｏ」、「ｋａ」、「ｋｉ」、「ｋｕ」、「ｋｅ」、「ｋｏ」、「ｓａ」、・・・等の五十音図に表された５０個、「ｇａ」、「ｇｉ」、「ｇｕ」、・・・等の濁音、「ｐａ」、「ｐｉ」、「ｐｕ」、・・・等の半濁音、「ｋｙａ」、「ｋｙｕ」、「ｋｙｏ」、・・・等の拗音を含む計１１０個の日本語の基本的な単音節の音響モデル（孤立音節モデル）が記憶されている。 A syllable recognizer 16 to which a storage device 18 is connected is connected to the syllable extractor 14. The storage device 18 includes “a”, “i”, “u”, “e”, “o”, “ka”, “ki”, “ku”, “ke”, “ko”, “sa”, , Etc. 50, which are represented in the Japanese syllabary diagram such as “ga”, “gi”, “gu”,..., Etc., “pa”, “pi”, “pu”,. A total of 110 basic Japanese syllable acoustic models (isolated syllable models) including semi-voiced sounds such as “Kya”, “kyu”, “kyo”,... .

音節認識器１６には、多数の単語を認識辞書として記憶した記憶装置２２が接続された文字列マッチング器２０が接続されている。 The syllable recognizer 16 is connected to a character string matching unit 20 to which a storage device 22 that stores a large number of words as a recognition dictionary is connected.

次に、本実施の形態の音声認識装置の動作について説明する。ユーザが発声した音声は、マイクロホン１０で検出され、マイクロホン１０で検出された音声は音声分析器１２に入力され、音声分析器１２は、ユーザが音節に区切って発声した単音節発声を音声分析し、音声のパワー情報等の特徴量を抽出して図２に示す音声波形信号Ｗを出力する。図２は、「い・つ・し・ん・しゃ・（鼻息）」と発声した場合の音声波形信号を示すものであり、発声の最後に鼻息が発せられたため、鼻息も音声波形信号の中に含まれている。 Next, the operation of the speech recognition apparatus according to this embodiment will be described. The voice uttered by the user is detected by the microphone 10, the voice detected by the microphone 10 is input to the voice analyzer 12, and the voice analyzer 12 analyzes the single syllable utterance uttered by the user divided into syllables. Then, a feature quantity such as voice power information is extracted and a voice waveform signal W shown in FIG. 2 is output. FIG. 2 shows a speech waveform signal when “I, Tsu, Shi, N, Sha, (nasal breath)” is uttered, and nasal breath is generated at the end of the utterance. Included.

音声分析器１２から出力された音声波形信号は音節切出器１４に入力される。音節切出器１４は、入力された音声波形信号とパワーの閾値とを比較することにより、パワーが閾値より大きい部分を音声区間として切出し、図２に示す音節列信号Ｓを出力する。音節と音節との間では、音声波形信号のパワー（振幅）が小さくなるので、このパワーが小さくなった部分より大きな閾値を設定することにより、音声区間を切出すことができる。 The speech waveform signal output from the speech analyzer 12 is input to the syllable extractor 14. The syllable extractor 14 compares the input speech waveform signal with the power threshold value, and extracts a portion where the power is greater than the threshold value as a speech section, and outputs the syllable string signal S shown in FIG. Since the power (amplitude) of the speech waveform signal is small between syllables, the speech section can be cut out by setting a threshold larger than the portion where the power is small.

このとき、パワーの閾値を高めに設定すると、息やリップノイズは検出されなくなる場合が多いが、同時に小さな声で発声された音節も欠落してしまうことになるため、閾値は低めに設定せざるを得ない。図２の例では、発声の最後に発せられた鼻息も誤って音節区間として切出されている。 At this time, if the power threshold is set high, breath and lip noise are often not detected, but syllables uttered by a small voice are also lost at the same time, so the threshold should be set low. I do not get. In the example of FIG. 2, the nasal breath uttered at the end of the utterance is also erroneously cut out as a syllable section.

音節認識器１６では、記憶装置１８に記憶された単音節の音響モデルを用いて、音節区間として切出された単音節部分の音声認識を行なう。図２の例では、鼻息以外が正しく認識された場合、「い・つ・し・ん・しゃ・（鼻息）」と認識される。この（鼻息）の部分は、例えば、「ふ」のように認識されてしまう。このため、通常の処理では、後段の文字列マッチング処理では正しく認識されないことになる。 The syllable recognizer 16 uses the single syllable acoustic model stored in the storage device 18 to perform speech recognition of a single syllable portion extracted as a syllable section. In the example of FIG. 2, if anything other than nasal breath is correctly recognized, it is recognized as “I, tsu, shi, n, sha, (nasal breath)”. This (nasal breath) portion is recognized as, for example, “fu”. For this reason, in a normal process, it will not be correctly recognized in the subsequent character string matching process.

そこで、本実施の形態では、文字列マッチング器２０において、音節認識器１６での認識結果だけでなく、音節認識器１６での認識結果の文字列から文字をｎ個（例えば、１または２個）削除した文字列についてもマッチングを行い、マッチングした認識候補を出力する。この場合、辞書とマッチングするまで削除する文字の個数を順に増加させてマッチングを行なう。 Therefore, in the present embodiment, in the character string matching unit 20, not only the recognition result by the syllable recognition unit 16 but also n characters (for example, 1 or 2) from the character string of the recognition result by the syllable recognition unit 16 are used. ) Matching is also performed on the deleted character string, and a matching recognition candidate is output. In this case, matching is performed by sequentially increasing the number of characters to be deleted until matching with the dictionary.

図２の例では、文字列が「い・つ・し・ん・しゃ・ふ」と認識されているので、まず１音節（１文字）削除して、「つ・し・ん・しゃ・ふ」、「い・し・ん・しゃ・ふ」、「い・つ・ん・しゃ・ふ」、「い・つ・し・しゃ・ふ」、「い・つ・し・ん・ふ」、「い・つ・し・ん・しゃ」の６通りについて文字列のマッチングを行なう。この場合、６番目の「い・つ・し・ん・しゃ」が辞書とマッチングするので、マッチングした「い・つ・し・ん・しゃ」を認識候補として出力する。 In the example of FIG. 2, since the character string is recognized as “I, Tsu, Shi, N, Sha, Fu”, first delete one syllable (one character), and “Tsu, Shin, N, Sha, Fu” ”,“ I ・ Sh ・ n ・ Sha ・ fu ”,‘ I ・ tsu ・ n ・ sha ・ fu 」,‘ I ・ tsu ・ shi ・ sha ・ fu 」,` `I ・ tsu ・ shi ・ sha ・ fu '', String matching is performed for the six types “I, Tsu, Shi, N, and Sha”. In this case, since the sixth “Itsutsu, Shin, Sha” matches the dictionary, the matched “Itsu, Shi, Shin, Sha” is output as a recognition candidate.

１音節削除しても認識候補が得られない場合は、さらに１音節、すなわち２音節削除して上記と同様のマッチング処理を行ないう。この音節の削除と削除した後のマッチング処理は、認識候補が得られるまで繰り返される。 If a recognition candidate is not obtained even if one syllable is deleted, one syllable, that is, two syllables, is deleted and matching processing similar to the above is performed. The deletion of syllables and the matching process after deletion are repeated until a recognition candidate is obtained.

なお、息やリップノイズ等のノイズが含まれていない場合には、音節認識器１６での認識結果の文字列から音節を削除しなくても認識辞書とマッチングすることになる。 When noise such as breath and lip noise is not included, matching with the recognition dictionary is possible without deleting the syllable from the character string of the recognition result by the syllable recognizer 16.

さらに、「み・か・ん・ふ」（最後の「ふ」は鼻息）のように認識された場合は、１音節削除すると「か・ん・ふ」、「み・ん・ふ」、「み・か・ふ」、「み・か・ん」となり、これらと辞書との文字列マッチングを行なった場合、「か・ん・ふ」及び「み・か・ん」の２つ以上の文字列がマッチングする場合も考えられる。このような場合は、文章中の単語の出現頻度等を考慮して「み・か・ん」を認識候補として選択ればよい。 Furthermore, if it is recognized as “Mi / ka / n / fu” (the last “fu” is a nose breath), deleting one syllable will result in “ka / n / fu”, “mi / n / fu”, “ "Mi-ka-fu", "mi-ka-n", and when these are matched with a dictionary, two or more characters "ka-n-fu" and "mi-ka-n" It is also possible that the columns match. In such a case, “mi / ka / n” may be selected as a recognition candidate in consideration of the appearance frequency of words in the sentence.

本発明の実施の形態を示すブロック図である。It is a block diagram which shows embodiment of this invention. 音声波形信号、及び音声波形信号を音節部分毎に切出したときの音節列信号の例を示す波形図である。It is a wave form diagram which shows the example of a syllable string signal when a speech waveform signal and a speech waveform signal are cut out for every syllable part.

Explanation of symbols

１０マイク
１２音声分析器
１６音節認識器
２０文字列マッチング器 10 microphone 12 speech analyzer 16 syllable recognizer 20 character string matcher

Claims

A voice analysis means for analyzing a voice uttered by a user and outputting a voice waveform signal;
Syllable cutting means for cutting out a syllable portion from the speech waveform signal;
A recognition means for performing speech recognition of each syllable portion extracted using an acoustic model and outputting a character string;
A matching means for performing matching with a recognition dictionary for a character string recognized by the recognition means and a character string obtained by deleting at least one character from the character string;
A speech recognition device.

The speech recognition apparatus according to claim 1, wherein the matching unit performs matching by increasing the number of characters to be deleted until the character string matches the recognition dictionary.